Improve doc: installation

2021-11-29 11:39:09 +00:00 · 2021-11-29 11:39:09 +00:00 · 85b194931e
commit 85b194931e
parent a6af5b12d2
3 changed files with 33 additions and 11 deletions
--- a/doc/source/config_template/main.yaml
+++ b/doc/source/config_template/main.yaml
@ -38,7 +38,7 @@ crawl:
    # Number of concurrent workers
    # Default value: 10
    # Allowed values: integer >=0 and <=1000
-    #workers: 3
+    workers: 10

    # Delay in seconds between attempts to fetch items
    # from site_queue if the last attempt gave no item
--- a/doc/source/devel/devel.md
+++ b/doc/source/devel/devel.md
@ -12,6 +12,11 @@ pipenv install -d
 ## Configure the instance
 See [installation](installation.md).

+Finally also do
+```
+pre-commit install
+```
+
 ## Run
 ```
 python -m atextcrawler
--- a/doc/source/installation.md
+++ b/doc/source/installation.md
@ -5,9 +5,11 @@ The instructions below are for this system.

 ## System packages
 ```
-apt install pandoc tidy python3-systemd protobuf-compiler libprotobuf-dev
+apt install pandoc tidy python3-systemd openjdk-17-jre-headless
+apt install protobuf-compiler libprotobuf-dev build-essential libpython3-dev
 ```
-The protobuf packages are required for python package gcld3 (see below).
+Java is needed for tika.
+The second line is required for python package gcld3 (see below).

 ## PostgreSQL database
 We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database:
@ -22,6 +24,11 @@ Note: TLS is not yet supported, so install this service locally.

 See [elasticsearch howto](elasticsearch.md).

+Create an API key (using the password for user elastic):
+```
+http --auth elastic:******************* -j POST http://127.0.0.1:9200/_security/api_key name=atext role_descriptors:='{"atext": {"cluster": [], "index": [{"names": ["atext_*"], "privileges": ["all"]}]}}'
+```
+
 ## Tensorflow model server
 We need access to a tensorflow model server (over TCP/IP).
 It should serve `universal_sentence_encoder_multilingual`
@ -40,18 +47,19 @@ cat >>.bashrc <<EOF
 export PYTHONPATH=\$HOME/repo/src
 EOF
 pip3 install --user pipenv
+mkdir repo
 cat >>.profile <<EOF
 PYTHONPATH=\$HOME/repo/src
 PATH=\$HOME/.local/bin:$PATH
+cd repo
 \$HOME/.local/bin/pipenv shell
 EOF
 exit
 su - atextcrawler
-git clone https://gitea.multiname.org/a-text/atextcrawler.git repo
-cd repo
+rm Pipfile
+git clone https://gitea.multiname.org/a-text/atextcrawler.git $HOME/repo
+virtualenv --system-site-packages `pipenv --venv`  # for systemd
 pipenv sync
-pipenv install --site-packages  # for systemd
-pre-commit install
 ```

 Note: One of the dependencies, Python package `tldextract`,
@ -63,7 +71,7 @@ $HOME/.cache/python-tldextract/
 ## Configure atextcrawler
 As user `atextcrawler` execute
 ```
-mkdir $HOME/.config
+mkdir -p $HOME/.config
 cp -r $HOME/repo/doc/source/config_template $HOME/.config/atextcrawler
 ```

@ -72,7 +80,7 @@ Edit `$HOME/.config/atextcrawler/main.yaml`.
 If you want to override a plugin, copy it to the plugins directory
 and edit it, e.g.
 ```
-cp /srv/atextcrawler/repo/src/atextcrawler/plugin_defaults/filter_site.py $HOME/.config/plugins
+cp $HOME/repo/doc/source/config_template/plugins/filter_site.py $HOME/.config/atextcrawler/plugins
 ```

 Optionally edit `$HOME/.config/atextcrawler/initial_data/seed_urls.list`.
@ -87,7 +95,12 @@ To see if it works, run `atextcrawler` from the command line:
 ```
 python -m atextcrawler
 ```
-You can stop it with `Ctrl-C`; stopping may take a few seconds or even minutes.
+You can follow the log with:
+```
+journalctl -ef SYSLOG_IDENTIFIER=atextcrawler
+```
+
+You can stop with `Ctrl-C`; stopping may take a few seconds or even minutes.

 ## Install systemd service
 To make the service persistent, create a systemd unit file
@ -108,7 +121,7 @@ Environment=PYTHONPATH=/srv/atextcrawler/repo/src
 ExecStart=/srv/atextcrawler/.local/bin/pipenv run python -m atextcrawler
 TimeoutStartSec=30
 ExecStop=/bin/kill -INT $MAINPID
-TimeoutStopSec=180
+TimeoutStopSec=300
 Restart=on-failure

 [Install]
@ -120,3 +133,7 @@ systemctl daemon-reload
 systemctl enable atextcrawler
 systemctl start atextcrawler
 ```
+Then follow the log with:
+```
+journalctl -efu atextcrawler
+```