atextcrawler/README.md

atextcrawler is an asynchronous webcrawler indexing text for literal and semantic search.

Its client-side counterpart is [atextsearch](https://gitea.multiname.org/a-text/atextsearch)

atextcrawler crawls and indexes selected websites.
It starts from a few seed sites and follows their external links.
Criteria defined in plugin code determine which linked sites (and
which of their resources) are (recursively) added to the pool.

atextcrawler is written in Python, runs a configurable number of
async workers concurrently (in one process), uses tensorflow for
embedding (paragraph-sized) text chunks in a (multi-)language model
and stores metadata in PostgreSQL and texts in elasticsearch.
Put under version control 2021-11-29 09:16:31 +00:00			`atextcrawler is an asynchronous webcrawler indexing text for literal and semantic search.`

			`Its client-side counterpart is [atextsearch](https://gitea.multiname.org/a-text/atextsearch)`

			`atextcrawler crawls and indexes selected websites.`
			`It starts from a few seed sites and follows their external links.`
			`Criteria defined in plugin code determine which linked sites (and`
			`which of their resources) are (recursively) added to the pool.`

			`atextcrawler is written in Python, runs a configurable number of`
			`async workers concurrently (in one process), uses tensorflow for`
			`embedding (paragraph-sized) text chunks in a (multi-)language model`
			`and stores metadata in PostgreSQL and texts in elasticsearch.`