Put under version control
This commit is contained in:
parent
d26d23348b
commit
a6af5b12d2
83 changed files with 20130 additions and 0 deletions
63
doc/source/devel/devel.md
Normal file
63
doc/source/devel/devel.md
Normal file
|
@ -0,0 +1,63 @@
|
|||
## Setup dev environment
|
||||
1. You need python 3.9 or later.
|
||||
1. Have pipenv installed, e.g. like this: Install pip3, e.g. with `apt install python3-pip`. Then `pip3 install --user pipenv`
|
||||
1. Clone the repo and setup a virtualenv:
|
||||
```
|
||||
cd YOUR_DEV_DIR
|
||||
git clone ssh://gitea@gitea-ssh.multiname.org:20106/a-text/atextcrawler.git
|
||||
cd atextcrawler
|
||||
pipenv install -d
|
||||
```
|
||||
|
||||
## Configure the instance
|
||||
See [installation](installation.md).
|
||||
|
||||
## Run
|
||||
```
|
||||
python -m atextcrawler
|
||||
```
|
||||
|
||||
## Logging
|
||||
Use the configured instance_name (e.g. `atextcrawler_dev`) to select journal messages:
|
||||
```
|
||||
journalctl -ef SYSLOG_IDENTIFIER=atextcrawler_dev
|
||||
```
|
||||
|
||||
## Upgrading
|
||||
Upgrade dev tools:
|
||||
```
|
||||
pre-commit autoupdate
|
||||
```
|
||||
|
||||
## Test and clean manually
|
||||
```
|
||||
AIOPGQ_POSTGRESQL="host=127.0.0.1 port=5432 database=atextcrawler-dev user=atextcrawler-dev password=*************" python -W ignore -m unittest discover
|
||||
mypy --ignore-missing-imports src/atextcrawler
|
||||
isort src/atextcrawler
|
||||
black -S -t py37 -l 79 src/atextcrawler
|
||||
pybetter --exclude B004,B007,B008 src/atextcrawler
|
||||
interrogate -i -I -m -v src/atextcrawler
|
||||
```
|
||||
|
||||
## Release
|
||||
There are no releases (currently).
|
||||
|
||||
## Useful commands
|
||||
|
||||
### Fetch a resource or a site manually
|
||||
```
|
||||
python -m atextcrawler.resource https://www.katesharpleylibrary.net/
|
||||
python -m atextcrawler.site https://www.katesharpleylibrary.net/
|
||||
```
|
||||
|
||||
### SQL
|
||||
```
|
||||
drop table crawl; drop table site_path; drop table resource; drop table site cascade; drop table site_feed; drop table site_link; drop table site_queue; drop table kvs;
|
||||
|
||||
http -j --auth elastic:*********************** -j DELETE http://127.0.0.1:9200/anarchism_text_*
|
||||
|
||||
http -j --auth elastic:*********************** -j GET http://127.0.0.1:9200/_cat/indices
|
||||
|
||||
-- stats: sites, paths, resources
|
||||
select s.id site_id, s.base_url, spr.n_paths, spr.n_resources, spr.n_chars from site s left join (select sp.site_id, count(sp.path) n_paths, count(r.id) n_resources, sum(r.text_len) n_chars from site_path sp left join resource r on sp.resource_id=r.id group by sp.site_id) spr on spr.site_id=s.id where s.relevant order by s.id;
|
||||
```
|
64
doc/source/devel/related_work.md
Normal file
64
doc/source/devel/related_work.md
Normal file
|
@ -0,0 +1,64 @@
|
|||
## Related work
|
||||
* [collection of crawlers](https://github.com/adbar/awesome-crawler)
|
||||
* [collection of webscrapers](https://github.com/adbar/awesome-web-scraper)
|
||||
|
||||
### crawlers
|
||||
* [acrawler](https://acrawler.readthedocs.io/en/latest/)
|
||||
* [trafilatura](https://trafilatura.readthedocs.io/en/latest/index.html)
|
||||
* [repo](https://github.com/adbar/trafilatura)
|
||||
* [intro](https://adrien.barbaresi.eu/blog/trafilatura-main-text-content-python.html)
|
||||
* [aiohttp_spider](https://github.com/niklak/aiohttp_spider/)
|
||||
* [scrapy](https://docs.scrapy.org/en/latest/)
|
||||
* [heritrix3](https://github.com/internetarchive/heritrix3/)
|
||||
* [YaCy](https://yacy.net/)
|
||||
* [searchmysite](https://searchmysite.net/)
|
||||
* [spiderling](http://corpus.tools/raw-attachment/wiki/Downloads/spiderling-src-0.84.tar.xz)
|
||||
* [aiohttp_spider](https://github.com/niklak/aiohttp_spider)
|
||||
* https://github.com/riteshnaik/Crawling-and-Deduplication-of-Polar-Datasets-Using-Nutch-and-Tika
|
||||
* [edge search engine](https://memex.marginalia.nu/projects/edge/about.gmi)
|
||||
|
||||
#### general
|
||||
* [elastic enterprise search](https://www.elastic.co/blog/building-a-scalable-easy-to-use-web-crawler-for-elastic-enterprise-search)
|
||||
|
||||
### sitemap parsers
|
||||
* [ultimate-sitemap-parser](https://github.com/mediacloud/ultimate-sitemap-parser)
|
||||
|
||||
### url handling
|
||||
* [courlan](https://pypi.org/project/courlan/)
|
||||
|
||||
### language detection
|
||||
* [overview](https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language)
|
||||
* [guess_language-spirit](https://pypi.org/project/guess_language-spirit/)
|
||||
* [guess_language](https://pypi.org/project/guess-language/)
|
||||
* [cld3](https://github.com/google/cld3)
|
||||
|
||||
### text extraction
|
||||
* [JusText](http://corpus.tools/wiki/Justext_changelog) [demo](https://nlp.fi.muni.cz/projects/justext/)
|
||||
|
||||
### deduplication
|
||||
* [PostgreSQL extension smlar](https://github.com/jirutka/smlar)
|
||||
* [use smlar](https://medium.datadriveninvestor.com/the-smlar-plug-in-for-effective-retrieval-of-massive-volumes-of-simhash-data-e429c19da1a3)
|
||||
* remove paragraphs with more than 50% word-7-tuples encountered previously
|
||||
|
||||
### Extract more meta tags
|
||||
* https://github.com/shareaholic/shareaholic-api-docs/blob/master/shareaholic_meta_tags.md
|
||||
https://support.shareaholic.com/hc/en-us/articles/115003085186
|
||||
|
||||
### Date parsing dependent on language
|
||||
* https://en.wikipedia.org/wiki/Date_format_by_country
|
||||
* https://en.wikipedia.org/wiki/Common_Locale_Data_Repository
|
||||
* https://pypi.org/project/dateparser/
|
||||
* https://github.com/ovalhub/pyicu
|
||||
* https://github.com/night-crawler/cldr-language-helpers
|
||||
* https://stackoverflow.com/questions/19927654/using-dateutil-parser-to-parse-a-date-in-another-language
|
||||
|
||||
ICU
|
||||
* https://unicode-org.github.io/icu/userguide/format_parse/datetime/examples.html#parse
|
||||
* https://gist.github.com/dpk/8325992
|
||||
* https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html
|
||||
* https://unicode-org.github.io/icu/userguide/
|
||||
* https://unicode-org.github.io/icu-docs/#/icu4c/
|
||||
* https://github.com/ovalhub/pyicu/blob/master/samples/break.py
|
||||
* https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table
|
||||
* https://www.unicode.org/reports/tr35/tr35-dates.html#months_days_quarters_eras
|
||||
* https://unicode-org.github.io/icu/userguide/format_parse/datetime/#formatting-dates-and-times-overview
|
77
doc/source/devel/todo.md
Normal file
77
doc/source/devel/todo.md
Normal file
|
@ -0,0 +1,77 @@
|
|||
## TODO
|
||||
|
||||
* parse html time tags
|
||||
|
||||
* site annotations:
|
||||
* categories
|
||||
* historical (no changes any more since n months)
|
||||
* news
|
||||
* local focus - geonames: http://download.geonames.org/export/dump/cities15000.zip
|
||||
|
||||
* allow for tls in elasticsearch config
|
||||
|
||||
* replace dashes, dots and quotes: https://github.com/kovidgoyal/calibre/blob/3dd95981398777f3c958e733209f3583e783b98c/src/calibre/utils/unsmarten.py
|
||||
```
|
||||
'–': '--',
|
||||
'–': '--',
|
||||
'–': '--',
|
||||
'—': '---',
|
||||
'—': '---',
|
||||
'—': '---',
|
||||
'…': '...',
|
||||
'…': '...',
|
||||
'…': '...',
|
||||
'“': '"',
|
||||
'”': '"',
|
||||
'„': '"',
|
||||
'″': '"',
|
||||
'“': '"',
|
||||
'”': '"',
|
||||
'„': '"',
|
||||
'″': '"',
|
||||
'“':'"',
|
||||
'”':'"',
|
||||
'„':'"',
|
||||
'″':'"',
|
||||
'‘':"'",
|
||||
'’':"'",
|
||||
'′':"'",
|
||||
'‘':"'",
|
||||
'’':"'",
|
||||
'′':"'",
|
||||
'‘':"'",
|
||||
'’':"'",
|
||||
'′':"'",
|
||||
```
|
||||
* normalize quotation marks and punctuation in general
|
||||
* https://unicode-table.com/en/sets/quotation-marks/
|
||||
* https://github.com/avian2/unidecode/blob/master/unidecode/x020.py
|
||||
* https://www.fileformat.info/info/unicode/category/Po/list.htm
|
||||
* https://www.gaijin.at/en/infos/unicode-character-table-punctuation
|
||||
* ⁝
|
||||
|
||||
* cancel crawls that take too long
|
||||
|
||||
* search for "TODO" in code
|
||||
|
||||
* feedparser has support for JSON feeds since commit
|
||||
a5939702b1fd0ec75d2b586255ff0e29e5a8a6fc
|
||||
(as of 2020-10-26 in "develop" branch, not part of a release)
|
||||
the version names are 'json1' and 'json11'
|
||||
|
||||
* allow site URLs with path, e.g.
|
||||
https://web.archive.org/web/20090320055457/http://www.geocities.com/kk_abacus/
|
||||
|
||||
* add more languages
|
||||
|
||||
## Ideas
|
||||
* use [python-libzim](https://github.com/openzim/python-libzim) to create ZIM archives
|
||||
|
||||
* [space-langdetect](https://pypi.org/project/spacy-langdetect/)
|
||||
* [langid.py](https://github.com/saffsd/langid.py)
|
||||
|
||||
* [gain](https://github.com/gaojiuli/gain)
|
||||
* [ruia](https://docs.python-ruia.org/)
|
||||
* [demiurge](https://demiurge.readthedocs.io/)
|
||||
* [cocrawler](https://github.com/cocrawler/cocrawler/)
|
||||
* [aiocrawler](https://github.com/tapanpandita/aiocrawler/)
|
Loading…
Add table
Add a link
Reference in a new issue