atextcrawler/doc/source/devel/related_work.md

65 lines
3.2 KiB
Markdown
Raw Normal View History

2021-11-29 09:16:31 +00:00
## Related work
* [collection of crawlers](https://github.com/adbar/awesome-crawler)
* [collection of webscrapers](https://github.com/adbar/awesome-web-scraper)
### crawlers
* [acrawler](https://acrawler.readthedocs.io/en/latest/)
* [trafilatura](https://trafilatura.readthedocs.io/en/latest/index.html)
* [repo](https://github.com/adbar/trafilatura)
* [intro](https://adrien.barbaresi.eu/blog/trafilatura-main-text-content-python.html)
* [aiohttp_spider](https://github.com/niklak/aiohttp_spider/)
* [scrapy](https://docs.scrapy.org/en/latest/)
* [heritrix3](https://github.com/internetarchive/heritrix3/)
* [YaCy](https://yacy.net/)
* [searchmysite](https://searchmysite.net/)
* [spiderling](http://corpus.tools/raw-attachment/wiki/Downloads/spiderling-src-0.84.tar.xz)
* [aiohttp_spider](https://github.com/niklak/aiohttp_spider)
* https://github.com/riteshnaik/Crawling-and-Deduplication-of-Polar-Datasets-Using-Nutch-and-Tika
* [edge search engine](https://memex.marginalia.nu/projects/edge/about.gmi)
#### general
* [elastic enterprise search](https://www.elastic.co/blog/building-a-scalable-easy-to-use-web-crawler-for-elastic-enterprise-search)
### sitemap parsers
* [ultimate-sitemap-parser](https://github.com/mediacloud/ultimate-sitemap-parser)
### url handling
* [courlan](https://pypi.org/project/courlan/)
### language detection
* [overview](https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language)
* [guess_language-spirit](https://pypi.org/project/guess_language-spirit/)
* [guess_language](https://pypi.org/project/guess-language/)
* [cld3](https://github.com/google/cld3)
### text extraction
* [JusText](http://corpus.tools/wiki/Justext_changelog) [demo](https://nlp.fi.muni.cz/projects/justext/)
### deduplication
* [PostgreSQL extension smlar](https://github.com/jirutka/smlar)
* [use smlar](https://medium.datadriveninvestor.com/the-smlar-plug-in-for-effective-retrieval-of-massive-volumes-of-simhash-data-e429c19da1a3)
* remove paragraphs with more than 50% word-7-tuples encountered previously
### Extract more meta tags
* https://github.com/shareaholic/shareaholic-api-docs/blob/master/shareaholic_meta_tags.md
https://support.shareaholic.com/hc/en-us/articles/115003085186
### Date parsing dependent on language
* https://en.wikipedia.org/wiki/Date_format_by_country
* https://en.wikipedia.org/wiki/Common_Locale_Data_Repository
* https://pypi.org/project/dateparser/
* https://github.com/ovalhub/pyicu
* https://github.com/night-crawler/cldr-language-helpers
* https://stackoverflow.com/questions/19927654/using-dateutil-parser-to-parse-a-date-in-another-language
ICU
* https://unicode-org.github.io/icu/userguide/format_parse/datetime/examples.html#parse
* https://gist.github.com/dpk/8325992
* https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html
* https://unicode-org.github.io/icu/userguide/
* https://unicode-org.github.io/icu-docs/#/icu4c/
* https://github.com/ovalhub/pyicu/blob/master/samples/break.py
* https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table
* https://www.unicode.org/reports/tr35/tr35-dates.html#months_days_quarters_eras
* https://unicode-org.github.io/icu/userguide/format_parse/datetime/#formatting-dates-and-times-overview