64 lines
2 KiB
Markdown
64 lines
2 KiB
Markdown
|
## Setup dev environment
|
||
|
1. You need python 3.9 or later.
|
||
|
1. Have pipenv installed, e.g. like this: Install pip3, e.g. with `apt install python3-pip`. Then `pip3 install --user pipenv`
|
||
|
1. Clone the repo and setup a virtualenv:
|
||
|
```
|
||
|
cd YOUR_DEV_DIR
|
||
|
git clone ssh://gitea@gitea-ssh.multiname.org:20106/a-text/atextcrawler.git
|
||
|
cd atextcrawler
|
||
|
pipenv install -d
|
||
|
```
|
||
|
|
||
|
## Configure the instance
|
||
|
See [installation](installation.md).
|
||
|
|
||
|
## Run
|
||
|
```
|
||
|
python -m atextcrawler
|
||
|
```
|
||
|
|
||
|
## Logging
|
||
|
Use the configured instance_name (e.g. `atextcrawler_dev`) to select journal messages:
|
||
|
```
|
||
|
journalctl -ef SYSLOG_IDENTIFIER=atextcrawler_dev
|
||
|
```
|
||
|
|
||
|
## Upgrading
|
||
|
Upgrade dev tools:
|
||
|
```
|
||
|
pre-commit autoupdate
|
||
|
```
|
||
|
|
||
|
## Test and clean manually
|
||
|
```
|
||
|
AIOPGQ_POSTGRESQL="host=127.0.0.1 port=5432 database=atextcrawler-dev user=atextcrawler-dev password=*************" python -W ignore -m unittest discover
|
||
|
mypy --ignore-missing-imports src/atextcrawler
|
||
|
isort src/atextcrawler
|
||
|
black -S -t py37 -l 79 src/atextcrawler
|
||
|
pybetter --exclude B004,B007,B008 src/atextcrawler
|
||
|
interrogate -i -I -m -v src/atextcrawler
|
||
|
```
|
||
|
|
||
|
## Release
|
||
|
There are no releases (currently).
|
||
|
|
||
|
## Useful commands
|
||
|
|
||
|
### Fetch a resource or a site manually
|
||
|
```
|
||
|
python -m atextcrawler.resource https://www.katesharpleylibrary.net/
|
||
|
python -m atextcrawler.site https://www.katesharpleylibrary.net/
|
||
|
```
|
||
|
|
||
|
### SQL
|
||
|
```
|
||
|
drop table crawl; drop table site_path; drop table resource; drop table site cascade; drop table site_feed; drop table site_link; drop table site_queue; drop table kvs;
|
||
|
|
||
|
http -j --auth elastic:*********************** -j DELETE http://127.0.0.1:9200/anarchism_text_*
|
||
|
|
||
|
http -j --auth elastic:*********************** -j GET http://127.0.0.1:9200/_cat/indices
|
||
|
|
||
|
-- stats: sites, paths, resources
|
||
|
select s.id site_id, s.base_url, spr.n_paths, spr.n_resources, spr.n_chars from site s left join (select sp.site_id, count(sp.path) n_paths, count(r.id) n_resources, sum(r.text_len) n_chars from site_path sp left join resource r on sp.resource_id=r.id group by sp.site_id) spr on spr.site_id=s.id where s.relevant order by s.id;
|
||
|
```
|