38 lines
967 B
ReStructuredText
38 lines
967 B
ReStructuredText
|
atextcrawler
|
||
|
============
|
||
|
|
||
|
atextcrawler is an asynchronous webcrawler indexing text
|
||
|
for literal and semantic search.
|
||
|
|
||
|
Its client-side counterpart is atextsearch_.
|
||
|
|
||
|
atextcrawler crawls and indexes selected websites.
|
||
|
It starts from a few seed sites and follows their external links.
|
||
|
Criteria defined in plugin code determine which linked sites (and
|
||
|
which of their resources) are (recursively) added to the pool.
|
||
|
|
||
|
atextcrawler is written in Python, runs a configurable number of
|
||
|
async workers concurrently (in one process), uses tensorflow for
|
||
|
embedding (paragraph-sized) text chunks in a (multi-)language model
|
||
|
and stores metadata in PostgreSQL and texts in elasticsearch.
|
||
|
|
||
|
.. _atextsearch: https://gitea.multiname.org/a-text/atextsearch
|
||
|
|
||
|
.. toctree::
|
||
|
:maxdepth: 2
|
||
|
:caption: Contents:
|
||
|
|
||
|
introduction
|
||
|
installation
|
||
|
maintenance
|
||
|
development
|
||
|
reference/modules
|
||
|
|
||
|
|
||
|
Indices and tables
|
||
|
==================
|
||
|
|
||
|
* :ref:`genindex`
|
||
|
* :ref:`modindex`
|
||
|
* :ref:`search`
|