atextcrawler crawls and indexes selected websites. It starts from a few seed sites and follows their external links. Criteria defined in plugin code determine which linked sites (and which of their resources) are (recursively) added to the pool.
Find a file
2022-01-09 17:05:27 +00:00
doc Make minimum text length configurable and actually remove elasticsearch documents 2021-12-26 18:21:15 +00:00
src/atextcrawler Remove unwanted print 2022-01-09 17:05:27 +00:00
tests Put under version control 2021-11-29 09:16:31 +00:00
.gitignore Put under version control 2021-11-29 09:16:31 +00:00
.pre-commit-config.yaml Put under version control 2021-11-29 09:16:31 +00:00
license.txt Put under version control 2021-11-29 09:16:31 +00:00
Pipfile Put under version control 2021-11-29 09:16:31 +00:00
Pipfile.lock Put under version control 2021-11-29 09:16:31 +00:00
pyproject.toml Put under version control 2021-11-29 09:16:31 +00:00
README.md Put under version control 2021-11-29 09:16:31 +00:00

atextcrawler is an asynchronous webcrawler indexing text for literal and semantic search.

Its client-side counterpart is atextsearch

atextcrawler crawls and indexes selected websites. It starts from a few seed sites and follows their external links. Criteria defined in plugin code determine which linked sites (and which of their resources) are (recursively) added to the pool.

atextcrawler is written in Python, runs a configurable number of async workers concurrently (in one process), uses tensorflow for embedding (paragraph-sized) text chunks in a (multi-)language model and stores metadata in PostgreSQL and texts in elasticsearch.