65 lines
3.2 KiB
Markdown
65 lines
3.2 KiB
Markdown
|
## Related work
|
||
|
* [collection of crawlers](https://github.com/adbar/awesome-crawler)
|
||
|
* [collection of webscrapers](https://github.com/adbar/awesome-web-scraper)
|
||
|
|
||
|
### crawlers
|
||
|
* [acrawler](https://acrawler.readthedocs.io/en/latest/)
|
||
|
* [trafilatura](https://trafilatura.readthedocs.io/en/latest/index.html)
|
||
|
* [repo](https://github.com/adbar/trafilatura)
|
||
|
* [intro](https://adrien.barbaresi.eu/blog/trafilatura-main-text-content-python.html)
|
||
|
* [aiohttp_spider](https://github.com/niklak/aiohttp_spider/)
|
||
|
* [scrapy](https://docs.scrapy.org/en/latest/)
|
||
|
* [heritrix3](https://github.com/internetarchive/heritrix3/)
|
||
|
* [YaCy](https://yacy.net/)
|
||
|
* [searchmysite](https://searchmysite.net/)
|
||
|
* [spiderling](http://corpus.tools/raw-attachment/wiki/Downloads/spiderling-src-0.84.tar.xz)
|
||
|
* [aiohttp_spider](https://github.com/niklak/aiohttp_spider)
|
||
|
* https://github.com/riteshnaik/Crawling-and-Deduplication-of-Polar-Datasets-Using-Nutch-and-Tika
|
||
|
* [edge search engine](https://memex.marginalia.nu/projects/edge/about.gmi)
|
||
|
|
||
|
#### general
|
||
|
* [elastic enterprise search](https://www.elastic.co/blog/building-a-scalable-easy-to-use-web-crawler-for-elastic-enterprise-search)
|
||
|
|
||
|
### sitemap parsers
|
||
|
* [ultimate-sitemap-parser](https://github.com/mediacloud/ultimate-sitemap-parser)
|
||
|
|
||
|
### url handling
|
||
|
* [courlan](https://pypi.org/project/courlan/)
|
||
|
|
||
|
### language detection
|
||
|
* [overview](https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language)
|
||
|
* [guess_language-spirit](https://pypi.org/project/guess_language-spirit/)
|
||
|
* [guess_language](https://pypi.org/project/guess-language/)
|
||
|
* [cld3](https://github.com/google/cld3)
|
||
|
|
||
|
### text extraction
|
||
|
* [JusText](http://corpus.tools/wiki/Justext_changelog) [demo](https://nlp.fi.muni.cz/projects/justext/)
|
||
|
|
||
|
### deduplication
|
||
|
* [PostgreSQL extension smlar](https://github.com/jirutka/smlar)
|
||
|
* [use smlar](https://medium.datadriveninvestor.com/the-smlar-plug-in-for-effective-retrieval-of-massive-volumes-of-simhash-data-e429c19da1a3)
|
||
|
* remove paragraphs with more than 50% word-7-tuples encountered previously
|
||
|
|
||
|
### Extract more meta tags
|
||
|
* https://github.com/shareaholic/shareaholic-api-docs/blob/master/shareaholic_meta_tags.md
|
||
|
https://support.shareaholic.com/hc/en-us/articles/115003085186
|
||
|
|
||
|
### Date parsing dependent on language
|
||
|
* https://en.wikipedia.org/wiki/Date_format_by_country
|
||
|
* https://en.wikipedia.org/wiki/Common_Locale_Data_Repository
|
||
|
* https://pypi.org/project/dateparser/
|
||
|
* https://github.com/ovalhub/pyicu
|
||
|
* https://github.com/night-crawler/cldr-language-helpers
|
||
|
* https://stackoverflow.com/questions/19927654/using-dateutil-parser-to-parse-a-date-in-another-language
|
||
|
|
||
|
ICU
|
||
|
* https://unicode-org.github.io/icu/userguide/format_parse/datetime/examples.html#parse
|
||
|
* https://gist.github.com/dpk/8325992
|
||
|
* https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html
|
||
|
* https://unicode-org.github.io/icu/userguide/
|
||
|
* https://unicode-org.github.io/icu-docs/#/icu4c/
|
||
|
* https://github.com/ovalhub/pyicu/blob/master/samples/break.py
|
||
|
* https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table
|
||
|
* https://www.unicode.org/reports/tr35/tr35-dates.html#months_days_quarters_eras
|
||
|
* https://unicode-org.github.io/icu/userguide/format_parse/datetime/#formatting-dates-and-times-overview
|