## TODO * parse html time tags * site annotations: * categories * historical (no changes any more since n months) * news * local focus - geonames: http://download.geonames.org/export/dump/cities15000.zip * allow for tls in elasticsearch config * replace dashes, dots and quotes: https://github.com/kovidgoyal/calibre/blob/3dd95981398777f3c958e733209f3583e783b98c/src/calibre/utils/unsmarten.py ``` '–': '--', '–': '--', '–': '--', '—': '---', '—': '---', '—': '---', '…': '...', '…': '...', '…': '...', '“': '"', '”': '"', '„': '"', '″': '"', '“': '"', '”': '"', '„': '"', '″': '"', '“':'"', '”':'"', '„':'"', '″':'"', '‘':"'", '’':"'", '′':"'", '‘':"'", '’':"'", '′':"'", '‘':"'", '’':"'", '′':"'", ``` * normalize quotation marks and punctuation in general * https://unicode-table.com/en/sets/quotation-marks/ * https://github.com/avian2/unidecode/blob/master/unidecode/x020.py * https://www.fileformat.info/info/unicode/category/Po/list.htm * https://www.gaijin.at/en/infos/unicode-character-table-punctuation * ⁝ * cancel crawls that take too long * search for "TODO" in code * feedparser has support for JSON feeds since commit a5939702b1fd0ec75d2b586255ff0e29e5a8a6fc (as of 2020-10-26 in "develop" branch, not part of a release) the version names are 'json1' and 'json11' * allow site URLs with path, e.g. https://web.archive.org/web/20090320055457/http://www.geocities.com/kk_abacus/ * add more languages ## Ideas * use [python-libzim](https://github.com/openzim/python-libzim) to create ZIM archives * [space-langdetect](https://pypi.org/project/spacy-langdetect/) * [langid.py](https://github.com/saffsd/langid.py) * [gain](https://github.com/gaojiuli/gain) * [ruia](https://docs.python-ruia.org/) * [demiurge](https://demiurge.readthedocs.io/) * [cocrawler](https://github.com/cocrawler/cocrawler/) * [aiocrawler](https://github.com/tapanpandita/aiocrawler/)