atextcrawler/doc/source/introduction.md

67 lines
3.9 KiB
Markdown
Raw Normal View History

2021-11-29 09:16:31 +00:00
# Introduction
## What atextcrawler does:
* Start from a seed (white+black-)list of website base URLs
* Loop over sites selected by applying criteria to the content
of the site's start page
* Crawl the site, i.e. loop over resources of the site
* Extract plaintext content from the resource (html parsing is
optimized for html5); discard non-text content, but handle feeds
and sitemaps
* Extract internal and external links; external links contribute
to the site list
* Keep track of the sites and resources in a PostgreSQL database
* Store plaintext content of resources in an Elasticsearch index
* Store vector embeddings of plaintexts also in Elasticsearch
using tensorflow model server with a multilingual language model
## Architecture
There is only one python process running concurrently.
We use asyncio where possible (almost everywhere).
1. There is a queue of websites, see database table `site_queue`.
The queue is fed a) on first startup with seeds, b) manually
and c) from crawls which find external links.
When the queued is handled new sites are stored to table `site`.
New sites are updated, existing sites only if the last update was more than `crawl.site_revisit_delay` seconds in the past.
After the queue has been handled there is a delay
(`crawl.site_delay` seconds) before repetition.
1. Updating a site means: the start page is fetched and
criteria are applied to its content to determine whether
the site is relevant. (It is assumed that (non-)relevance is
obvious from the start page already.) If the site is relevant,
more information is fetched (e.g. sitemaps).
1. There is s a configurable number of crawler workers (config
`crawl.workers`) which concurrently crawl sites, one at a time
per worker. (During the crawl the site is marked as locked using
crawl_active=true.) They pick a relevant site which has not been crawled for a certain time ("checkout"), crawl it, and finally mark it as crawled (crawl_active=false, "checkin") and schedule the next crawl.
Each crawl (with begin time, end time, number of found (new)
resources)) is stored in table `crawl`.
1. Crawls are either full crawls (including all paths reachable
through links from the start page are fetched) or feed crawls (only paths listed in a feed of the site are fetched). The respective (minimum) intervals in which these crawls happens are `full_crawl_interval` and `feed_crawl_interval`.
Feed crawls can happen more frequently (e.g. daily).
1. When a path is fetched it can result in a MetaResource (feed or
sitemap) or a TextResource (redirects are followed and irrelevant content is ignored). A TextResource obtained from a path can be very similar to a resource obtained from another path; in this case no new resource is created, but both paths are linked to the same resource (see tables `site_path` and `resource`).
1. If a MetaResource is fetched and it is a sitemap, its paths are
added to table `site_path`. If it is a feed, the feed is stored in table `site_feed` and its paths are added to table `site_path`.
1. Links between sites are stored in table `site_link`.
## Site annotations
Database table `site_annotation` can have any number of annotations
for a base_url. While crawling, these annotations are considered:
Blacklisting or whitelisting has precedence over function `site_filter`
(in plugin `filter_site`).
Annotations cannot be managed from within atextcrawler;
this requires another application, usually [`atextsearch`](https://TODO).
Each annotation requires a base_url of the annotated site and
if a site with this base_url exists in the `site` table,
it should also be associated with the site's id (column `site_id`).
## Limitations
* atextcrawler is not optimized for speed; it is meant to be run as a
background task on a server with limited resources
(or even an SBC, like raspberry pi, with attached storage)
* atextcrawler only indexes text, no other resources like images