atextcrawler/doc/source/devel/related_work.md

## Related work
* [collection of crawlers](https://github.com/adbar/awesome-crawler)
* [collection of webscrapers](https://github.com/adbar/awesome-web-scraper)

### crawlers
* [acrawler](https://acrawler.readthedocs.io/en/latest/)
* [trafilatura](https://trafilatura.readthedocs.io/en/latest/index.html)
  * [repo](https://github.com/adbar/trafilatura)
  * [intro](https://adrien.barbaresi.eu/blog/trafilatura-main-text-content-python.html)
* [aiohttp_spider](https://github.com/niklak/aiohttp_spider/)
* [scrapy](https://docs.scrapy.org/en/latest/)
* [heritrix3](https://github.com/internetarchive/heritrix3/)
* [YaCy](https://yacy.net/)
* [searchmysite](https://searchmysite.net/)
* [spiderling](http://corpus.tools/raw-attachment/wiki/Downloads/spiderling-src-0.84.tar.xz)
* [aiohttp_spider](https://github.com/niklak/aiohttp_spider)
* https://github.com/riteshnaik/Crawling-and-Deduplication-of-Polar-Datasets-Using-Nutch-and-Tika
* [edge search engine](https://memex.marginalia.nu/projects/edge/about.gmi)

#### general
* [elastic enterprise search](https://www.elastic.co/blog/building-a-scalable-easy-to-use-web-crawler-for-elastic-enterprise-search)

### sitemap parsers
* [ultimate-sitemap-parser](https://github.com/mediacloud/ultimate-sitemap-parser)

### url handling
* [courlan](https://pypi.org/project/courlan/)

### language detection
* [overview](https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language)
* [guess_language-spirit](https://pypi.org/project/guess_language-spirit/)
* [guess_language](https://pypi.org/project/guess-language/)
* [cld3](https://github.com/google/cld3)

### text extraction
* [JusText](http://corpus.tools/wiki/Justext_changelog) [demo](https://nlp.fi.muni.cz/projects/justext/)

### deduplication
* [PostgreSQL extension smlar](https://github.com/jirutka/smlar)
* [use smlar](https://medium.datadriveninvestor.com/the-smlar-plug-in-for-effective-retrieval-of-massive-volumes-of-simhash-data-e429c19da1a3)
* remove paragraphs with more than 50% word-7-tuples encountered previously

### Extract more meta tags
* https://github.com/shareaholic/shareaholic-api-docs/blob/master/shareaholic_meta_tags.md
  https://support.shareaholic.com/hc/en-us/articles/115003085186

### Date parsing dependent on language
* https://en.wikipedia.org/wiki/Date_format_by_country
* https://en.wikipedia.org/wiki/Common_Locale_Data_Repository
* https://pypi.org/project/dateparser/
* https://github.com/ovalhub/pyicu
* https://github.com/night-crawler/cldr-language-helpers
* https://stackoverflow.com/questions/19927654/using-dateutil-parser-to-parse-a-date-in-another-language

ICU
* https://unicode-org.github.io/icu/userguide/format_parse/datetime/examples.html#parse
* https://gist.github.com/dpk/8325992
* https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html
* https://unicode-org.github.io/icu/userguide/
* https://unicode-org.github.io/icu-docs/#/icu4c/
* https://github.com/ovalhub/pyicu/blob/master/samples/break.py
* https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table
* https://www.unicode.org/reports/tr35/tr35-dates.html#months_days_quarters_eras
* https://unicode-org.github.io/icu/userguide/format_parse/datetime/#formatting-dates-and-times-overview
Put under version control 2021-11-29 09:16:31 +00:00			`## Related work`
			`* [collection of crawlers](https://github.com/adbar/awesome-crawler)`
			`* [collection of webscrapers](https://github.com/adbar/awesome-web-scraper)`

			`### crawlers`
			`* [acrawler](https://acrawler.readthedocs.io/en/latest/)`
			`* [trafilatura](https://trafilatura.readthedocs.io/en/latest/index.html)`
			`* [repo](https://github.com/adbar/trafilatura)`
			`* [intro](https://adrien.barbaresi.eu/blog/trafilatura-main-text-content-python.html)`
			`* [aiohttp_spider](https://github.com/niklak/aiohttp_spider/)`
			`* [scrapy](https://docs.scrapy.org/en/latest/)`
			`* [heritrix3](https://github.com/internetarchive/heritrix3/)`
			`* [YaCy](https://yacy.net/)`
			`* [searchmysite](https://searchmysite.net/)`
			`* [spiderling](http://corpus.tools/raw-attachment/wiki/Downloads/spiderling-src-0.84.tar.xz)`
			`* [aiohttp_spider](https://github.com/niklak/aiohttp_spider)`
			`* https://github.com/riteshnaik/Crawling-and-Deduplication-of-Polar-Datasets-Using-Nutch-and-Tika`
			`* [edge search engine](https://memex.marginalia.nu/projects/edge/about.gmi)`

			`#### general`
			`* [elastic enterprise search](https://www.elastic.co/blog/building-a-scalable-easy-to-use-web-crawler-for-elastic-enterprise-search)`

			`### sitemap parsers`
			`* [ultimate-sitemap-parser](https://github.com/mediacloud/ultimate-sitemap-parser)`

			`### url handling`
			`* [courlan](https://pypi.org/project/courlan/)`

			`### language detection`
			`* [overview](https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language)`
			`* [guess_language-spirit](https://pypi.org/project/guess_language-spirit/)`
			`* [guess_language](https://pypi.org/project/guess-language/)`
			`* [cld3](https://github.com/google/cld3)`

			`### text extraction`
			`* [JusText](http://corpus.tools/wiki/Justext_changelog) [demo](https://nlp.fi.muni.cz/projects/justext/)`

			`### deduplication`
			`* [PostgreSQL extension smlar](https://github.com/jirutka/smlar)`
			`* [use smlar](https://medium.datadriveninvestor.com/the-smlar-plug-in-for-effective-retrieval-of-massive-volumes-of-simhash-data-e429c19da1a3)`
			`* remove paragraphs with more than 50% word-7-tuples encountered previously`

			`### Extract more meta tags`
			`* https://github.com/shareaholic/shareaholic-api-docs/blob/master/shareaholic_meta_tags.md`
			`https://support.shareaholic.com/hc/en-us/articles/115003085186`

			`### Date parsing dependent on language`
			`* https://en.wikipedia.org/wiki/Date_format_by_country`
			`* https://en.wikipedia.org/wiki/Common_Locale_Data_Repository`
			`* https://pypi.org/project/dateparser/`
			`* https://github.com/ovalhub/pyicu`
			`* https://github.com/night-crawler/cldr-language-helpers`
			`* https://stackoverflow.com/questions/19927654/using-dateutil-parser-to-parse-a-date-in-another-language`

			`ICU`
			`* https://unicode-org.github.io/icu/userguide/format_parse/datetime/examples.html#parse`
			`* https://gist.github.com/dpk/8325992`
			`* https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html`
			`* https://unicode-org.github.io/icu/userguide/`
			`* https://unicode-org.github.io/icu-docs/#/icu4c/`
			`* https://github.com/ovalhub/pyicu/blob/master/samples/break.py`
			`* https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table`
			`* https://www.unicode.org/reports/tr35/tr35-dates.html#months_days_quarters_eras`
			`* https://unicode-org.github.io/icu/userguide/format_parse/datetime/#formatting-dates-and-times-overview`