2021-11-29 09:16:31 +00:00
|
|
|
# Installation
|
|
|
|
Installation was only tested on Debian bullseye (on amd64).
|
|
|
|
The instructions below are for this system.
|
|
|
|
(Please adapt to other environments.)
|
|
|
|
|
|
|
|
## System packages
|
|
|
|
```
|
2021-11-29 11:39:09 +00:00
|
|
|
apt install pandoc tidy python3-systemd openjdk-17-jre-headless
|
|
|
|
apt install protobuf-compiler libprotobuf-dev build-essential libpython3-dev
|
2021-11-29 09:16:31 +00:00
|
|
|
```
|
2021-11-29 11:39:09 +00:00
|
|
|
Java is needed for tika.
|
|
|
|
The second line is required for python package gcld3 (see below).
|
2021-11-29 09:16:31 +00:00
|
|
|
|
|
|
|
## PostgreSQL database
|
|
|
|
We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database:
|
|
|
|
```
|
|
|
|
createdb -E UTF8 --lc-collate=C --lc-ctype=C -T template0 -O atextcrawler atextcrawler
|
|
|
|
```
|
|
|
|
|
|
|
|
## Elasticsearch
|
|
|
|
We need access to an elasticsearch instance (over TCP/IP).
|
|
|
|
|
|
|
|
Note: TLS is not yet supported, so install this service locally.
|
|
|
|
|
|
|
|
See [elasticsearch howto](elasticsearch.md).
|
|
|
|
|
2021-11-29 11:39:09 +00:00
|
|
|
Create an API key (using the password for user elastic):
|
|
|
|
```
|
|
|
|
http --auth elastic:******************* -j POST http://127.0.0.1:9200/_security/api_key name=atext role_descriptors:='{"atext": {"cluster": [], "index": [{"names": ["atext_*"], "privileges": ["all"]}]}}'
|
|
|
|
```
|
|
|
|
|
2021-11-29 09:16:31 +00:00
|
|
|
## Tensorflow model server
|
|
|
|
We need access to a tensorflow model server (over TCP/IP).
|
|
|
|
It should serve `universal_sentence_encoder_multilingual`
|
|
|
|
or a similar language model.
|
|
|
|
|
|
|
|
Note: TLS is not yet supported, so install this service locally.
|
|
|
|
|
|
|
|
See [tensorflow howto](tensorflow_model_server.md).
|
|
|
|
|
|
|
|
## Setup virtualenv and install atextcrawler
|
|
|
|
```
|
|
|
|
apt install python3-pip
|
|
|
|
adduser --home /srv/atextcrawler --disabled-password --gecos "" atextcrawler
|
|
|
|
su - atextcrawler
|
|
|
|
cat >>.bashrc <<EOF
|
|
|
|
export PYTHONPATH=\$HOME/repo/src
|
|
|
|
EOF
|
|
|
|
pip3 install --user pipenv
|
2021-11-29 11:39:09 +00:00
|
|
|
mkdir repo
|
2021-11-29 09:16:31 +00:00
|
|
|
cat >>.profile <<EOF
|
|
|
|
PYTHONPATH=\$HOME/repo/src
|
|
|
|
PATH=\$HOME/.local/bin:$PATH
|
2021-11-29 11:39:09 +00:00
|
|
|
cd repo
|
2021-11-29 09:16:31 +00:00
|
|
|
\$HOME/.local/bin/pipenv shell
|
|
|
|
EOF
|
|
|
|
exit
|
|
|
|
su - atextcrawler
|
2021-11-29 11:39:09 +00:00
|
|
|
rm Pipfile
|
|
|
|
git clone https://gitea.multiname.org/a-text/atextcrawler.git $HOME/repo
|
|
|
|
virtualenv --system-site-packages `pipenv --venv` # for systemd
|
2021-11-29 09:16:31 +00:00
|
|
|
pipenv sync
|
|
|
|
```
|
|
|
|
|
|
|
|
Note: One of the dependencies, Python package `tldextract`,
|
|
|
|
uses this directory for caching:
|
|
|
|
```
|
|
|
|
$HOME/.cache/python-tldextract/
|
|
|
|
```
|
|
|
|
|
|
|
|
## Configure atextcrawler
|
|
|
|
As user `atextcrawler` execute
|
|
|
|
```
|
2021-11-29 11:39:09 +00:00
|
|
|
mkdir -p $HOME/.config
|
2021-11-29 09:16:31 +00:00
|
|
|
cp -r $HOME/repo/doc/source/config_template $HOME/.config/atextcrawler
|
|
|
|
```
|
|
|
|
|
|
|
|
Edit `$HOME/.config/atextcrawler/main.yaml`.
|
|
|
|
|
|
|
|
If you want to override a plugin, copy it to the plugins directory
|
|
|
|
and edit it, e.g.
|
|
|
|
```
|
2021-11-29 11:39:09 +00:00
|
|
|
cp $HOME/repo/doc/source/config_template/plugins/filter_site.py $HOME/.config/atextcrawler/plugins
|
2021-11-29 09:16:31 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
Optionally edit `$HOME/.config/atextcrawler/initial_data/seed_urls.list`.
|
|
|
|
|
|
|
|
Check (and print) the instance configuration:
|
|
|
|
```
|
|
|
|
python -m atextcrawler.config
|
|
|
|
```
|
|
|
|
|
|
|
|
## Test run
|
|
|
|
To see if it works, run `atextcrawler` from the command line:
|
|
|
|
```
|
|
|
|
python -m atextcrawler
|
|
|
|
```
|
2021-11-29 11:39:09 +00:00
|
|
|
You can follow the log with:
|
|
|
|
```
|
|
|
|
journalctl -ef SYSLOG_IDENTIFIER=atextcrawler
|
|
|
|
```
|
|
|
|
|
|
|
|
You can stop with `Ctrl-C`; stopping may take a few seconds or even minutes.
|
2021-11-29 09:16:31 +00:00
|
|
|
|
|
|
|
## Install systemd service
|
|
|
|
To make the service persistent, create a systemd unit file
|
|
|
|
`/etc/systemd/system/atextcrawler.service` with this content:
|
|
|
|
```
|
|
|
|
[Unit]
|
|
|
|
Description=atextcrawler web crawler
|
|
|
|
Documentation=https://gitea.multiname.org/a-text/atextcrawler
|
2021-12-09 05:55:01 +00:00
|
|
|
Requires=network.target elasticsearch.service tensorflow.service
|
|
|
|
After=network-online.target elasticsearch.service tensorflow.service
|
2021-11-29 09:16:31 +00:00
|
|
|
|
|
|
|
[Service]
|
|
|
|
Type=simple
|
|
|
|
User=atextcrawler
|
|
|
|
Group=atextcrawler
|
|
|
|
WorkingDirectory=/srv/atextcrawler/repo
|
|
|
|
Environment=PYTHONPATH=/srv/atextcrawler/repo/src
|
|
|
|
ExecStart=/srv/atextcrawler/.local/bin/pipenv run python -m atextcrawler
|
|
|
|
TimeoutStartSec=30
|
|
|
|
ExecStop=/bin/kill -INT $MAINPID
|
2021-11-29 11:39:09 +00:00
|
|
|
TimeoutStopSec=300
|
2021-11-29 09:16:31 +00:00
|
|
|
Restart=on-failure
|
|
|
|
|
|
|
|
[Install]
|
|
|
|
WantedBy=multi-user.target
|
|
|
|
```
|
|
|
|
and
|
|
|
|
```
|
|
|
|
systemctl daemon-reload
|
|
|
|
systemctl enable atextcrawler
|
|
|
|
systemctl start atextcrawler
|
|
|
|
```
|
2021-11-29 11:39:09 +00:00
|
|
|
Then follow the log with:
|
|
|
|
```
|
|
|
|
journalctl -efu atextcrawler
|
|
|
|
```
|