Installation

Installation was only tested on Debian bullseye (on amd64). The instructions below are for this system. (Please adapt to other environments.)

System packages

apt install pandoc tidy python3-systemd protobuf-compiler libprotobuf-dev

The protobuf packages are required for python package gcld3 (see below).

PostgreSQL database

We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database:

createdb -E UTF8 --lc-collate=C --lc-ctype=C -T template0 -O atextcrawler atextcrawler

Elasticsearch

We need access to an elasticsearch instance (over TCP/IP).

Note: TLS is not yet supported, so install this service locally.

See elasticsearch howto.

Tensorflow model server

We need access to a tensorflow model server (over TCP/IP). It should serve universal_sentence_encoder_multilingual or a similar language model.

Note: TLS is not yet supported, so install this service locally.

See tensorflow howto.

Setup virtualenv and install atextcrawler

apt install python3-pip
adduser --home /srv/atextcrawler --disabled-password --gecos "" atextcrawler
su - atextcrawler
cat >>.bashrc <<EOF
export PYTHONPATH=\$HOME/repo/src
EOF
pip3 install --user pipenv
cat >>.profile <<EOF
PYTHONPATH=\$HOME/repo/src
PATH=\$HOME/.local/bin:$PATH
\$HOME/.local/bin/pipenv shell
EOF
exit
su - atextcrawler
git clone https://gitea.multiname.org/a-text/atextcrawler.git repo
cd repo
pipenv sync
pipenv install --site-packages  # for systemd
pre-commit install

Note: One of the dependencies, Python package tldextract, uses this directory for caching:

$HOME/.cache/python-tldextract/

Configure atextcrawler

As user atextcrawler execute

mkdir $HOME/.config
cp -r $HOME/repo/doc/source/config_template $HOME/.config/atextcrawler

Edit $HOME/.config/atextcrawler/main.yaml.

If you want to override a plugin, copy it to the plugins directory and edit it, e.g.

cp /srv/atextcrawler/repo/src/atextcrawler/plugin_defaults/filter_site.py $HOME/.config/plugins

Optionally edit $HOME/.config/atextcrawler/initial_data/seed_urls.list.

Check (and print) the instance configuration:

python -m atextcrawler.config

Test run

To see if it works, run atextcrawler from the command line:

python -m atextcrawler

You can stop it with Ctrl-C; stopping may take a few seconds or even minutes.

Install systemd service

To make the service persistent, create a systemd unit file /etc/systemd/system/atextcrawler.service with this content:

[Unit]
Description=atextcrawler web crawler
Documentation=https://gitea.multiname.org/a-text/atextcrawler
Requires=network.target
After=network-online.target

[Service]
Type=simple
User=atextcrawler
Group=atextcrawler
WorkingDirectory=/srv/atextcrawler/repo
Environment=PYTHONPATH=/srv/atextcrawler/repo/src
ExecStart=/srv/atextcrawler/.local/bin/pipenv run python -m atextcrawler
TimeoutStartSec=30
ExecStop=/bin/kill -INT $MAINPID
TimeoutStopSec=180
Restart=on-failure

[Install]
WantedBy=multi-user.target

and

systemctl daemon-reload
systemctl enable atextcrawler
systemctl start atextcrawler

3.2 KiB Raw Blame History