Improve doc: installation

This commit is contained in:
ibu 2021-11-29 11:39:09 +00:00
parent a6af5b12d2
commit 85b194931e
3 changed files with 33 additions and 11 deletions

View file

@ -38,7 +38,7 @@ crawl:
# Number of concurrent workers # Number of concurrent workers
# Default value: 10 # Default value: 10
# Allowed values: integer >=0 and <=1000 # Allowed values: integer >=0 and <=1000
#workers: 3 workers: 10
# Delay in seconds between attempts to fetch items # Delay in seconds between attempts to fetch items
# from site_queue if the last attempt gave no item # from site_queue if the last attempt gave no item

View file

@ -12,6 +12,11 @@ pipenv install -d
## Configure the instance ## Configure the instance
See [installation](installation.md). See [installation](installation.md).
Finally also do
```
pre-commit install
```
## Run ## Run
``` ```
python -m atextcrawler python -m atextcrawler

View file

@ -5,9 +5,11 @@ The instructions below are for this system.
## System packages ## System packages
``` ```
apt install pandoc tidy python3-systemd protobuf-compiler libprotobuf-dev apt install pandoc tidy python3-systemd openjdk-17-jre-headless
apt install protobuf-compiler libprotobuf-dev build-essential libpython3-dev
``` ```
The protobuf packages are required for python package gcld3 (see below). Java is needed for tika.
The second line is required for python package gcld3 (see below).
## PostgreSQL database ## PostgreSQL database
We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database: We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database:
@ -22,6 +24,11 @@ Note: TLS is not yet supported, so install this service locally.
See [elasticsearch howto](elasticsearch.md). See [elasticsearch howto](elasticsearch.md).
Create an API key (using the password for user elastic):
```
http --auth elastic:******************* -j POST http://127.0.0.1:9200/_security/api_key name=atext role_descriptors:='{"atext": {"cluster": [], "index": [{"names": ["atext_*"], "privileges": ["all"]}]}}'
```
## Tensorflow model server ## Tensorflow model server
We need access to a tensorflow model server (over TCP/IP). We need access to a tensorflow model server (over TCP/IP).
It should serve `universal_sentence_encoder_multilingual` It should serve `universal_sentence_encoder_multilingual`
@ -40,18 +47,19 @@ cat >>.bashrc <<EOF
export PYTHONPATH=\$HOME/repo/src export PYTHONPATH=\$HOME/repo/src
EOF EOF
pip3 install --user pipenv pip3 install --user pipenv
mkdir repo
cat >>.profile <<EOF cat >>.profile <<EOF
PYTHONPATH=\$HOME/repo/src PYTHONPATH=\$HOME/repo/src
PATH=\$HOME/.local/bin:$PATH PATH=\$HOME/.local/bin:$PATH
cd repo
\$HOME/.local/bin/pipenv shell \$HOME/.local/bin/pipenv shell
EOF EOF
exit exit
su - atextcrawler su - atextcrawler
git clone https://gitea.multiname.org/a-text/atextcrawler.git repo rm Pipfile
cd repo git clone https://gitea.multiname.org/a-text/atextcrawler.git $HOME/repo
virtualenv --system-site-packages `pipenv --venv` # for systemd
pipenv sync pipenv sync
pipenv install --site-packages # for systemd
pre-commit install
``` ```
Note: One of the dependencies, Python package `tldextract`, Note: One of the dependencies, Python package `tldextract`,
@ -63,7 +71,7 @@ $HOME/.cache/python-tldextract/
## Configure atextcrawler ## Configure atextcrawler
As user `atextcrawler` execute As user `atextcrawler` execute
``` ```
mkdir $HOME/.config mkdir -p $HOME/.config
cp -r $HOME/repo/doc/source/config_template $HOME/.config/atextcrawler cp -r $HOME/repo/doc/source/config_template $HOME/.config/atextcrawler
``` ```
@ -72,7 +80,7 @@ Edit `$HOME/.config/atextcrawler/main.yaml`.
If you want to override a plugin, copy it to the plugins directory If you want to override a plugin, copy it to the plugins directory
and edit it, e.g. and edit it, e.g.
``` ```
cp /srv/atextcrawler/repo/src/atextcrawler/plugin_defaults/filter_site.py $HOME/.config/plugins cp $HOME/repo/doc/source/config_template/plugins/filter_site.py $HOME/.config/atextcrawler/plugins
``` ```
Optionally edit `$HOME/.config/atextcrawler/initial_data/seed_urls.list`. Optionally edit `$HOME/.config/atextcrawler/initial_data/seed_urls.list`.
@ -87,7 +95,12 @@ To see if it works, run `atextcrawler` from the command line:
``` ```
python -m atextcrawler python -m atextcrawler
``` ```
You can stop it with `Ctrl-C`; stopping may take a few seconds or even minutes. You can follow the log with:
```
journalctl -ef SYSLOG_IDENTIFIER=atextcrawler
```
You can stop with `Ctrl-C`; stopping may take a few seconds or even minutes.
## Install systemd service ## Install systemd service
To make the service persistent, create a systemd unit file To make the service persistent, create a systemd unit file
@ -108,7 +121,7 @@ Environment=PYTHONPATH=/srv/atextcrawler/repo/src
ExecStart=/srv/atextcrawler/.local/bin/pipenv run python -m atextcrawler ExecStart=/srv/atextcrawler/.local/bin/pipenv run python -m atextcrawler
TimeoutStartSec=30 TimeoutStartSec=30
ExecStop=/bin/kill -INT $MAINPID ExecStop=/bin/kill -INT $MAINPID
TimeoutStopSec=180 TimeoutStopSec=300
Restart=on-failure Restart=on-failure
[Install] [Install]
@ -120,3 +133,7 @@ systemctl daemon-reload
systemctl enable atextcrawler systemctl enable atextcrawler
systemctl start atextcrawler systemctl start atextcrawler
``` ```
Then follow the log with:
```
journalctl -efu atextcrawler
```