atextcrawler/doc/source/devel/todo.md
2021-11-29 09:16:31 +00:00

78 lines
2.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## TODO
* parse html time tags
* site annotations:
* categories
* historical (no changes any more since n months)
* news
* local focus - geonames: http://download.geonames.org/export/dump/cities15000.zip
* allow for tls in elasticsearch config
* replace dashes, dots and quotes: https://github.com/kovidgoyal/calibre/blob/3dd95981398777f3c958e733209f3583e783b98c/src/calibre/utils/unsmarten.py
```
'–': '--',
'–': '--',
'': '--',
'—': '---',
'—': '---',
'—': '---',
'…': '...',
'…': '...',
'…': '...',
'“': '"',
'”': '"',
'„': '"',
'″': '"',
'“': '"',
'”': '"',
'„': '"',
'″': '"',
'“':'"',
'”':'"',
'„':'"',
'″':'"',
'‘':"'",
'’':"'",
'′':"'",
'‘':"'",
'’':"'",
'′':"'",
'':"'",
'':"'",
'':"'",
```
* normalize quotation marks and punctuation in general
* https://unicode-table.com/en/sets/quotation-marks/
* https://github.com/avian2/unidecode/blob/master/unidecode/x020.py
* https://www.fileformat.info/info/unicode/category/Po/list.htm
* https://www.gaijin.at/en/infos/unicode-character-table-punctuation
*
* cancel crawls that take too long
* search for "TODO" in code
* feedparser has support for JSON feeds since commit
a5939702b1fd0ec75d2b586255ff0e29e5a8a6fc
(as of 2020-10-26 in "develop" branch, not part of a release)
the version names are 'json1' and 'json11'
* allow site URLs with path, e.g.
https://web.archive.org/web/20090320055457/http://www.geocities.com/kk_abacus/
* add more languages
## Ideas
* use [python-libzim](https://github.com/openzim/python-libzim) to create ZIM archives
* [space-langdetect](https://pypi.org/project/spacy-langdetect/)
* [langid.py](https://github.com/saffsd/langid.py)
* [gain](https://github.com/gaojiuli/gain)
* [ruia](https://docs.python-ruia.org/)
* [demiurge](https://demiurge.readthedocs.io/)
* [cocrawler](https://github.com/cocrawler/cocrawler/)
* [aiocrawler](https://github.com/tapanpandita/aiocrawler/)