78 lines
2.2 KiB
Markdown
78 lines
2.2 KiB
Markdown
|
## TODO
|
|||
|
|
|||
|
* parse html time tags
|
|||
|
|
|||
|
* site annotations:
|
|||
|
* categories
|
|||
|
* historical (no changes any more since n months)
|
|||
|
* news
|
|||
|
* local focus - geonames: http://download.geonames.org/export/dump/cities15000.zip
|
|||
|
|
|||
|
* allow for tls in elasticsearch config
|
|||
|
|
|||
|
* replace dashes, dots and quotes: https://github.com/kovidgoyal/calibre/blob/3dd95981398777f3c958e733209f3583e783b98c/src/calibre/utils/unsmarten.py
|
|||
|
```
|
|||
|
'–': '--',
|
|||
|
'–': '--',
|
|||
|
'–': '--',
|
|||
|
'—': '---',
|
|||
|
'—': '---',
|
|||
|
'—': '---',
|
|||
|
'…': '...',
|
|||
|
'…': '...',
|
|||
|
'…': '...',
|
|||
|
'“': '"',
|
|||
|
'”': '"',
|
|||
|
'„': '"',
|
|||
|
'″': '"',
|
|||
|
'“': '"',
|
|||
|
'”': '"',
|
|||
|
'„': '"',
|
|||
|
'″': '"',
|
|||
|
'“':'"',
|
|||
|
'”':'"',
|
|||
|
'„':'"',
|
|||
|
'″':'"',
|
|||
|
'‘':"'",
|
|||
|
'’':"'",
|
|||
|
'′':"'",
|
|||
|
'‘':"'",
|
|||
|
'’':"'",
|
|||
|
'′':"'",
|
|||
|
'‘':"'",
|
|||
|
'’':"'",
|
|||
|
'′':"'",
|
|||
|
```
|
|||
|
* normalize quotation marks and punctuation in general
|
|||
|
* https://unicode-table.com/en/sets/quotation-marks/
|
|||
|
* https://github.com/avian2/unidecode/blob/master/unidecode/x020.py
|
|||
|
* https://www.fileformat.info/info/unicode/category/Po/list.htm
|
|||
|
* https://www.gaijin.at/en/infos/unicode-character-table-punctuation
|
|||
|
* ⁝
|
|||
|
|
|||
|
* cancel crawls that take too long
|
|||
|
|
|||
|
* search for "TODO" in code
|
|||
|
|
|||
|
* feedparser has support for JSON feeds since commit
|
|||
|
a5939702b1fd0ec75d2b586255ff0e29e5a8a6fc
|
|||
|
(as of 2020-10-26 in "develop" branch, not part of a release)
|
|||
|
the version names are 'json1' and 'json11'
|
|||
|
|
|||
|
* allow site URLs with path, e.g.
|
|||
|
https://web.archive.org/web/20090320055457/http://www.geocities.com/kk_abacus/
|
|||
|
|
|||
|
* add more languages
|
|||
|
|
|||
|
## Ideas
|
|||
|
* use [python-libzim](https://github.com/openzim/python-libzim) to create ZIM archives
|
|||
|
|
|||
|
* [space-langdetect](https://pypi.org/project/spacy-langdetect/)
|
|||
|
* [langid.py](https://github.com/saffsd/langid.py)
|
|||
|
|
|||
|
* [gain](https://github.com/gaojiuli/gain)
|
|||
|
* [ruia](https://docs.python-ruia.org/)
|
|||
|
* [demiurge](https://demiurge.readthedocs.io/)
|
|||
|
* [cocrawler](https://github.com/cocrawler/cocrawler/)
|
|||
|
* [aiocrawler](https://github.com/tapanpandita/aiocrawler/)
|