14 lines
692 B
Markdown
14 lines
692 B
Markdown
|
atextcrawler is an asynchronous webcrawler indexing text for literal and semantic search.
|
||
|
|
||
|
Its client-side counterpart is [atextsearch](https://gitea.multiname.org/a-text/atextsearch)
|
||
|
|
||
|
atextcrawler crawls and indexes selected websites.
|
||
|
It starts from a few seed sites and follows their external links.
|
||
|
Criteria defined in plugin code determine which linked sites (and
|
||
|
which of their resources) are (recursively) added to the pool.
|
||
|
|
||
|
atextcrawler is written in Python, runs a configurable number of
|
||
|
async workers concurrently (in one process), uses tensorflow for
|
||
|
embedding (paragraph-sized) text chunks in a (multi-)language model
|
||
|
and stores metadata in PostgreSQL and texts in elasticsearch.
|