37 lines
967 B
ReStructuredText
37 lines
967 B
ReStructuredText
atextcrawler
|
|
============
|
|
|
|
atextcrawler is an asynchronous webcrawler indexing text
|
|
for literal and semantic search.
|
|
|
|
Its client-side counterpart is atextsearch_.
|
|
|
|
atextcrawler crawls and indexes selected websites.
|
|
It starts from a few seed sites and follows their external links.
|
|
Criteria defined in plugin code determine which linked sites (and
|
|
which of their resources) are (recursively) added to the pool.
|
|
|
|
atextcrawler is written in Python, runs a configurable number of
|
|
async workers concurrently (in one process), uses tensorflow for
|
|
embedding (paragraph-sized) text chunks in a (multi-)language model
|
|
and stores metadata in PostgreSQL and texts in elasticsearch.
|
|
|
|
.. _atextsearch: https://gitea.multiname.org/a-text/atextsearch
|
|
|
|
.. toctree::
|
|
:maxdepth: 2
|
|
:caption: Contents:
|
|
|
|
introduction
|
|
installation
|
|
maintenance
|
|
development
|
|
reference/modules
|
|
|
|
|
|
Indices and tables
|
|
==================
|
|
|
|
* :ref:`genindex`
|
|
* :ref:`modindex`
|
|
* :ref:`search`
|