Improve doc: installation
This commit is contained in:
parent
a6af5b12d2
commit
85b194931e
3 changed files with 33 additions and 11 deletions
|
@ -38,7 +38,7 @@ crawl:
|
||||||
# Number of concurrent workers
|
# Number of concurrent workers
|
||||||
# Default value: 10
|
# Default value: 10
|
||||||
# Allowed values: integer >=0 and <=1000
|
# Allowed values: integer >=0 and <=1000
|
||||||
#workers: 3
|
workers: 10
|
||||||
|
|
||||||
# Delay in seconds between attempts to fetch items
|
# Delay in seconds between attempts to fetch items
|
||||||
# from site_queue if the last attempt gave no item
|
# from site_queue if the last attempt gave no item
|
||||||
|
|
|
@ -12,6 +12,11 @@ pipenv install -d
|
||||||
## Configure the instance
|
## Configure the instance
|
||||||
See [installation](installation.md).
|
See [installation](installation.md).
|
||||||
|
|
||||||
|
Finally also do
|
||||||
|
```
|
||||||
|
pre-commit install
|
||||||
|
```
|
||||||
|
|
||||||
## Run
|
## Run
|
||||||
```
|
```
|
||||||
python -m atextcrawler
|
python -m atextcrawler
|
||||||
|
|
|
@ -5,9 +5,11 @@ The instructions below are for this system.
|
||||||
|
|
||||||
## System packages
|
## System packages
|
||||||
```
|
```
|
||||||
apt install pandoc tidy python3-systemd protobuf-compiler libprotobuf-dev
|
apt install pandoc tidy python3-systemd openjdk-17-jre-headless
|
||||||
|
apt install protobuf-compiler libprotobuf-dev build-essential libpython3-dev
|
||||||
```
|
```
|
||||||
The protobuf packages are required for python package gcld3 (see below).
|
Java is needed for tika.
|
||||||
|
The second line is required for python package gcld3 (see below).
|
||||||
|
|
||||||
## PostgreSQL database
|
## PostgreSQL database
|
||||||
We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database:
|
We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database:
|
||||||
|
@ -22,6 +24,11 @@ Note: TLS is not yet supported, so install this service locally.
|
||||||
|
|
||||||
See [elasticsearch howto](elasticsearch.md).
|
See [elasticsearch howto](elasticsearch.md).
|
||||||
|
|
||||||
|
Create an API key (using the password for user elastic):
|
||||||
|
```
|
||||||
|
http --auth elastic:******************* -j POST http://127.0.0.1:9200/_security/api_key name=atext role_descriptors:='{"atext": {"cluster": [], "index": [{"names": ["atext_*"], "privileges": ["all"]}]}}'
|
||||||
|
```
|
||||||
|
|
||||||
## Tensorflow model server
|
## Tensorflow model server
|
||||||
We need access to a tensorflow model server (over TCP/IP).
|
We need access to a tensorflow model server (over TCP/IP).
|
||||||
It should serve `universal_sentence_encoder_multilingual`
|
It should serve `universal_sentence_encoder_multilingual`
|
||||||
|
@ -40,18 +47,19 @@ cat >>.bashrc <<EOF
|
||||||
export PYTHONPATH=\$HOME/repo/src
|
export PYTHONPATH=\$HOME/repo/src
|
||||||
EOF
|
EOF
|
||||||
pip3 install --user pipenv
|
pip3 install --user pipenv
|
||||||
|
mkdir repo
|
||||||
cat >>.profile <<EOF
|
cat >>.profile <<EOF
|
||||||
PYTHONPATH=\$HOME/repo/src
|
PYTHONPATH=\$HOME/repo/src
|
||||||
PATH=\$HOME/.local/bin:$PATH
|
PATH=\$HOME/.local/bin:$PATH
|
||||||
|
cd repo
|
||||||
\$HOME/.local/bin/pipenv shell
|
\$HOME/.local/bin/pipenv shell
|
||||||
EOF
|
EOF
|
||||||
exit
|
exit
|
||||||
su - atextcrawler
|
su - atextcrawler
|
||||||
git clone https://gitea.multiname.org/a-text/atextcrawler.git repo
|
rm Pipfile
|
||||||
cd repo
|
git clone https://gitea.multiname.org/a-text/atextcrawler.git $HOME/repo
|
||||||
|
virtualenv --system-site-packages `pipenv --venv` # for systemd
|
||||||
pipenv sync
|
pipenv sync
|
||||||
pipenv install --site-packages # for systemd
|
|
||||||
pre-commit install
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Note: One of the dependencies, Python package `tldextract`,
|
Note: One of the dependencies, Python package `tldextract`,
|
||||||
|
@ -63,7 +71,7 @@ $HOME/.cache/python-tldextract/
|
||||||
## Configure atextcrawler
|
## Configure atextcrawler
|
||||||
As user `atextcrawler` execute
|
As user `atextcrawler` execute
|
||||||
```
|
```
|
||||||
mkdir $HOME/.config
|
mkdir -p $HOME/.config
|
||||||
cp -r $HOME/repo/doc/source/config_template $HOME/.config/atextcrawler
|
cp -r $HOME/repo/doc/source/config_template $HOME/.config/atextcrawler
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -72,7 +80,7 @@ Edit `$HOME/.config/atextcrawler/main.yaml`.
|
||||||
If you want to override a plugin, copy it to the plugins directory
|
If you want to override a plugin, copy it to the plugins directory
|
||||||
and edit it, e.g.
|
and edit it, e.g.
|
||||||
```
|
```
|
||||||
cp /srv/atextcrawler/repo/src/atextcrawler/plugin_defaults/filter_site.py $HOME/.config/plugins
|
cp $HOME/repo/doc/source/config_template/plugins/filter_site.py $HOME/.config/atextcrawler/plugins
|
||||||
```
|
```
|
||||||
|
|
||||||
Optionally edit `$HOME/.config/atextcrawler/initial_data/seed_urls.list`.
|
Optionally edit `$HOME/.config/atextcrawler/initial_data/seed_urls.list`.
|
||||||
|
@ -87,7 +95,12 @@ To see if it works, run `atextcrawler` from the command line:
|
||||||
```
|
```
|
||||||
python -m atextcrawler
|
python -m atextcrawler
|
||||||
```
|
```
|
||||||
You can stop it with `Ctrl-C`; stopping may take a few seconds or even minutes.
|
You can follow the log with:
|
||||||
|
```
|
||||||
|
journalctl -ef SYSLOG_IDENTIFIER=atextcrawler
|
||||||
|
```
|
||||||
|
|
||||||
|
You can stop with `Ctrl-C`; stopping may take a few seconds or even minutes.
|
||||||
|
|
||||||
## Install systemd service
|
## Install systemd service
|
||||||
To make the service persistent, create a systemd unit file
|
To make the service persistent, create a systemd unit file
|
||||||
|
@ -108,7 +121,7 @@ Environment=PYTHONPATH=/srv/atextcrawler/repo/src
|
||||||
ExecStart=/srv/atextcrawler/.local/bin/pipenv run python -m atextcrawler
|
ExecStart=/srv/atextcrawler/.local/bin/pipenv run python -m atextcrawler
|
||||||
TimeoutStartSec=30
|
TimeoutStartSec=30
|
||||||
ExecStop=/bin/kill -INT $MAINPID
|
ExecStop=/bin/kill -INT $MAINPID
|
||||||
TimeoutStopSec=180
|
TimeoutStopSec=300
|
||||||
Restart=on-failure
|
Restart=on-failure
|
||||||
|
|
||||||
[Install]
|
[Install]
|
||||||
|
@ -120,3 +133,7 @@ systemctl daemon-reload
|
||||||
systemctl enable atextcrawler
|
systemctl enable atextcrawler
|
||||||
systemctl start atextcrawler
|
systemctl start atextcrawler
|
||||||
```
|
```
|
||||||
|
Then follow the log with:
|
||||||
|
```
|
||||||
|
journalctl -efu atextcrawler
|
||||||
|
```
|
||||||
|
|
Loading…
Reference in a new issue