3.8 KiB
Installation
Installation was only tested on Debian bullseye (on amd64). The instructions below are for this system. (Please adapt to other environments.)
System packages
apt install pandoc tidy python3-systemd openjdk-17-jre-headless
apt install protobuf-compiler libprotobuf-dev build-essential libpython3-dev
Java is needed for tika. The second line is required for python package gcld3 (see below).
PostgreSQL database
We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database:
createdb -E UTF8 --lc-collate=C --lc-ctype=C -T template0 -O atextcrawler atextcrawler
Elasticsearch
We need access to an elasticsearch instance (over TCP/IP).
Note: TLS is not yet supported, so install this service locally.
See elasticsearch howto.
Create an API key (using the password for user elastic):
http --auth elastic:******************* -j POST http://127.0.0.1:9200/_security/api_key name=atext role_descriptors:='{"atext": {"cluster": [], "index": [{"names": ["atext_*"], "privileges": ["all"]}]}}'
Tensorflow model server
We need access to a tensorflow model server (over TCP/IP).
It should serve universal_sentence_encoder_multilingual
or a similar language model.
Note: TLS is not yet supported, so install this service locally.
See tensorflow howto.
Setup virtualenv and install atextcrawler
apt install python3-pip
adduser --home /srv/atextcrawler --disabled-password --gecos "" atextcrawler
su - atextcrawler
cat >>.bashrc <<EOF
export PYTHONPATH=\$HOME/repo/src
EOF
pip3 install --user pipenv
mkdir repo
cat >>.profile <<EOF
PYTHONPATH=\$HOME/repo/src
PATH=\$HOME/.local/bin:$PATH
cd repo
\$HOME/.local/bin/pipenv shell
EOF
exit
su - atextcrawler
rm Pipfile
git clone https://gitea.multiname.org/a-text/atextcrawler.git $HOME/repo
virtualenv --system-site-packages `pipenv --venv` # for systemd
pipenv sync
Note: One of the dependencies, Python package tldextract
,
uses this directory for caching:
$HOME/.cache/python-tldextract/
Configure atextcrawler
As user atextcrawler
execute
mkdir -p $HOME/.config
cp -r $HOME/repo/doc/source/config_template $HOME/.config/atextcrawler
Edit $HOME/.config/atextcrawler/main.yaml
.
If you want to override a plugin, copy it to the plugins directory and edit it, e.g.
cp $HOME/repo/doc/source/config_template/plugins/filter_site.py $HOME/.config/atextcrawler/plugins
Optionally edit $HOME/.config/atextcrawler/initial_data/seed_urls.list
.
Check (and print) the instance configuration:
python -m atextcrawler.config
Test run
To see if it works, run atextcrawler
from the command line:
python -m atextcrawler
You can follow the log with:
journalctl -ef SYSLOG_IDENTIFIER=atextcrawler
You can stop with Ctrl-C
; stopping may take a few seconds or even minutes.
Install systemd service
To make the service persistent, create a systemd unit file
/etc/systemd/system/atextcrawler.service
with this content:
[Unit]
Description=atextcrawler web crawler
Documentation=https://gitea.multiname.org/a-text/atextcrawler
Requires=network.target elasticsearch.service tensorflow.service
After=network-online.target elasticsearch.service tensorflow.service
[Service]
Type=simple
User=atextcrawler
Group=atextcrawler
WorkingDirectory=/srv/atextcrawler/repo
Environment=PYTHONPATH=/srv/atextcrawler/repo/src
ExecStart=/srv/atextcrawler/.local/bin/pipenv run python -m atextcrawler
TimeoutStartSec=30
ExecStop=/bin/kill -INT $MAINPID
TimeoutStopSec=300
Restart=on-failure
[Install]
WantedBy=multi-user.target
and
systemctl daemon-reload
systemctl enable atextcrawler
systemctl start atextcrawler
Then follow the log with:
journalctl -efu atextcrawler