diff --git a/doc/source/config_template/main.yaml b/doc/source/config_template/main.yaml index 8a12feb..308deeb 100644 --- a/doc/source/config_template/main.yaml +++ b/doc/source/config_template/main.yaml @@ -38,7 +38,7 @@ crawl: # Number of concurrent workers # Default value: 10 # Allowed values: integer >=0 and <=1000 - #workers: 3 + workers: 10 # Delay in seconds between attempts to fetch items # from site_queue if the last attempt gave no item diff --git a/doc/source/devel/devel.md b/doc/source/devel/devel.md index 18ce86b..4937c38 100644 --- a/doc/source/devel/devel.md +++ b/doc/source/devel/devel.md @@ -12,6 +12,11 @@ pipenv install -d ## Configure the instance See [installation](installation.md). +Finally also do +``` +pre-commit install +``` + ## Run ``` python -m atextcrawler diff --git a/doc/source/installation.md b/doc/source/installation.md index 300c94b..3491867 100644 --- a/doc/source/installation.md +++ b/doc/source/installation.md @@ -5,9 +5,11 @@ The instructions below are for this system. ## System packages ``` -apt install pandoc tidy python3-systemd protobuf-compiler libprotobuf-dev +apt install pandoc tidy python3-systemd openjdk-17-jre-headless +apt install protobuf-compiler libprotobuf-dev build-essential libpython3-dev ``` -The protobuf packages are required for python package gcld3 (see below). +Java is needed for tika. +The second line is required for python package gcld3 (see below). ## PostgreSQL database We need access to a PostgreSQL database. Install PostgreSQL or provide connectivity to a PostgreSQL database over TCP/IP. Create a new database: @@ -22,6 +24,11 @@ Note: TLS is not yet supported, so install this service locally. See [elasticsearch howto](elasticsearch.md). +Create an API key (using the password for user elastic): +``` +http --auth elastic:******************* -j POST http://127.0.0.1:9200/_security/api_key name=atext role_descriptors:='{"atext": {"cluster": [], "index": [{"names": ["atext_*"], "privileges": ["all"]}]}}' +``` + ## Tensorflow model server We need access to a tensorflow model server (over TCP/IP). It should serve `universal_sentence_encoder_multilingual` @@ -40,18 +47,19 @@ cat >>.bashrc <>.profile <