3.2 KiB
3.2 KiB
Related work
crawlers
- acrawler
- trafilatura
- aiohttp_spider
- scrapy
- heritrix3
- YaCy
- searchmysite
- spiderling
- aiohttp_spider
- https://github.com/riteshnaik/Crawling-and-Deduplication-of-Polar-Datasets-Using-Nutch-and-Tika
- edge search engine
general
sitemap parsers
url handling
language detection
text extraction
deduplication
- PostgreSQL extension smlar
- use smlar
- remove paragraphs with more than 50% word-7-tuples encountered previously
Extract more meta tags
- https://github.com/shareaholic/shareaholic-api-docs/blob/master/shareaholic_meta_tags.md https://support.shareaholic.com/hc/en-us/articles/115003085186
Date parsing dependent on language
- https://en.wikipedia.org/wiki/Date_format_by_country
- https://en.wikipedia.org/wiki/Common_Locale_Data_Repository
- https://pypi.org/project/dateparser/
- https://github.com/ovalhub/pyicu
- https://github.com/night-crawler/cldr-language-helpers
- https://stackoverflow.com/questions/19927654/using-dateutil-parser-to-parse-a-date-in-another-language
ICU
- https://unicode-org.github.io/icu/userguide/format_parse/datetime/examples.html#parse
- https://gist.github.com/dpk/8325992
- https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html
- https://unicode-org.github.io/icu/userguide/
- https://unicode-org.github.io/icu-docs/#/icu4c/
- https://github.com/ovalhub/pyicu/blob/master/samples/break.py
- https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table
- https://www.unicode.org/reports/tr35/tr35-dates.html#months_days_quarters_eras
- https://unicode-org.github.io/icu/userguide/format_parse/datetime/#formatting-dates-and-times-overview