atextcrawler/doc/source/devel/todo.md at 028be1631ded9038e6f23ae11c7502e6e3282cf3

ibu a6af5b12d2 Put under version control

2021-11-29 09:16:31 +00:00

2.2 KiB

Raw Blame History

TODO

parse html time tags
site annotations:
- categories
  - historical (no changes any more since n months)
  - news
- local focus - geonames: http://download.geonames.org/export/dump/cities15000.zip
allow for tls in elasticsearch config
replace dashes, dots and quotes: 3dd9598139/src/calibre/utils/unsmarten.py

        '&#8211;': '--',
        '&ndash;': '--',
        '–': '--',
        '&#8212;': '---',
        '&mdash;': '---',
        '—': '---',
        '&#8230;': '...',
        '&hellip;': '...',
        '…': '...',
        '&#8220;': '"',
        '&#8221;': '"',
        '&#8222;': '"',
        '&#8243;': '"',
        '&ldquo;': '"',
        '&rdquo;': '"',
        '&bdquo;': '"',
        '&Prime;': '"',
        '“':'"',
        '”':'"',
        '„':'"',
        '″':'"',
        '&#8216;':"'",
        '&#8217;':"'",
        '&#8242;':"'",
        '&lsquo;':"'",
        '&rsquo;':"'",
        '&prime;':"'",
        '‘':"'",
        '’':"'",
        '′':"'",

normalize quotation marks and punctuation in general
cancel crawls that take too long
search for "TODO" in code
feedparser has support for JSON feeds since commit a5939702b1fd0ec75d2b586255ff0e29e5a8a6fc (as of 2020-10-26 in "develop" branch, not part of a release) the version names are 'json1' and 'json11'
allow site URLs with path, e.g. https://web.archive.org/web/20090320055457/http://www.geocities.com/kk_abacus/
add more languages

Ideas

use python-libzim to create ZIM archives
space-langdetect
langid.py
gain
ruia
demiurge
cocrawler
aiocrawler

2.2 KiB Raw Blame History Unescape Escape

TODO

Ideas

2.2 KiB

Raw Blame History