atextcrawler/doc/source/devel/devel.md

## Setup dev environment
1. You need python 3.9 or later.
1. Have pipenv installed, e.g. like this: Install pip3, e.g. with `apt install python3-pip`. Then `pip3 install --user pipenv`
1. Clone the repo and setup a virtualenv:
```
cd YOUR_DEV_DIR
git clone ssh://gitea@gitea-ssh.multiname.org:20106/a-text/atextcrawler.git
cd atextcrawler
pipenv install -d
```

## Configure the instance
See [installation](installation.md).

## Run
```
python -m atextcrawler
```

## Logging
Use the configured instance_name (e.g. `atextcrawler_dev`) to select journal messages:
```
journalctl -ef SYSLOG_IDENTIFIER=atextcrawler_dev
```

## Upgrading
Upgrade dev tools:
```
pre-commit autoupdate
```

## Test and clean manually
```
AIOPGQ_POSTGRESQL="host=127.0.0.1 port=5432 database=atextcrawler-dev user=atextcrawler-dev password=*************" python -W ignore -m unittest discover
mypy --ignore-missing-imports src/atextcrawler
isort src/atextcrawler
black -S -t py37 -l 79 src/atextcrawler
pybetter --exclude B004,B007,B008 src/atextcrawler
interrogate -i -I -m -v src/atextcrawler
```

## Release
There are no releases (currently).

## Useful commands

### Fetch a resource or a site manually
```
python -m atextcrawler.resource https://www.katesharpleylibrary.net/
python -m atextcrawler.site https://www.katesharpleylibrary.net/
```

### SQL
```
drop table crawl; drop table site_path; drop table resource; drop table site cascade; drop table site_feed; drop table site_link; drop table site_queue; drop table kvs;

http -j --auth elastic:*********************** -j DELETE http://127.0.0.1:9200/anarchism_text_*

http -j --auth elastic:*********************** -j GET http://127.0.0.1:9200/_cat/indices

-- stats: sites, paths, resources
select s.id site_id, s.base_url, spr.n_paths, spr.n_resources, spr.n_chars from site s left join (select sp.site_id, count(sp.path) n_paths, count(r.id) n_resources, sum(r.text_len) n_chars from site_path sp left join resource r on sp.resource_id=r.id group by sp.site_id) spr on spr.site_id=s.id where s.relevant order by s.id;
```
Put under version control 2021-11-29 09:16:31 +00:00			`## Setup dev environment`
			`1. You need python 3.9 or later.`
			1. Have pipenv installed, e.g. like this: Install pip3, e.g. with `apt install python3-pip`. Then `pip3 install --user pipenv`
			`1. Clone the repo and setup a virtualenv:`
			```
			`cd YOUR_DEV_DIR`
			`git clone ssh://gitea@gitea-ssh.multiname.org:20106/a-text/atextcrawler.git`
			`cd atextcrawler`
			`pipenv install -d`
			```

			`## Configure the instance`
			`See [installation](installation.md).`

			`## Run`
			```
			`python -m atextcrawler`
			```

			`## Logging`
			Use the configured instance_name (e.g. `atextcrawler_dev`) to select journal messages:
			```
			`journalctl -ef SYSLOG_IDENTIFIER=atextcrawler_dev`
			```

			`## Upgrading`
			`Upgrade dev tools:`
			```
			`pre-commit autoupdate`
			```

			`## Test and clean manually`
			```
			`AIOPGQ_POSTGRESQL="host=127.0.0.1 port=5432 database=atextcrawler-dev user=atextcrawler-dev password=*************" python -W ignore -m unittest discover`
			`mypy --ignore-missing-imports src/atextcrawler`
			`isort src/atextcrawler`
			`black -S -t py37 -l 79 src/atextcrawler`
			`pybetter --exclude B004,B007,B008 src/atextcrawler`
			`interrogate -i -I -m -v src/atextcrawler`
			```

			`## Release`
			`There are no releases (currently).`

			`## Useful commands`

			`### Fetch a resource or a site manually`
			```
			`python -m atextcrawler.resource https://www.katesharpleylibrary.net/`
			`python -m atextcrawler.site https://www.katesharpleylibrary.net/`
			```

			`### SQL`
			```
			`drop table crawl; drop table site_path; drop table resource; drop table site cascade; drop table site_feed; drop table site_link; drop table site_queue; drop table kvs;`

			`http -j --auth elastic:*********************** -j DELETE http://127.0.0.1:9200/anarchism_text_*`

			`http -j --auth elastic:*********************** -j GET http://127.0.0.1:9200/_cat/indices`

			`-- stats: sites, paths, resources`
			`select s.id site_id, s.base_url, spr.n_paths, spr.n_resources, spr.n_chars from site s left join (select sp.site_id, count(sp.path) n_paths, count(r.id) n_resources, sum(r.text_len) n_chars from site_path sp left join resource r on sp.resource_id=r.id group by sp.site_id) spr on spr.site_id=s.id where s.relevant order by s.id;`
			```