3.9 KiB
3.9 KiB
Introduction
What atextcrawler does:
- Start from a seed (white+black-)list of website base URLs
- Loop over sites selected by applying criteria to the content of the site's start page
- Crawl the site, i.e. loop over resources of the site
- Extract plaintext content from the resource (html parsing is optimized for html5); discard non-text content, but handle feeds and sitemaps
- Extract internal and external links; external links contribute to the site list
- Keep track of the sites and resources in a PostgreSQL database
- Store plaintext content of resources in an Elasticsearch index
- Store vector embeddings of plaintexts also in Elasticsearch using tensorflow model server with a multilingual language model
Architecture
There is only one python process running concurrently. We use asyncio where possible (almost everywhere).
- There is a queue of websites, see database table
site_queue
. The queue is fed a) on first startup with seeds, b) manually and c) from crawls which find external links. When the queued is handled new sites are stored to tablesite
. New sites are updated, existing sites only if the last update was more thancrawl.site_revisit_delay
seconds in the past. After the queue has been handled there is a delay (crawl.site_delay
seconds) before repetition. - Updating a site means: the start page is fetched and criteria are applied to its content to determine whether the site is relevant. (It is assumed that (non-)relevance is obvious from the start page already.) If the site is relevant, more information is fetched (e.g. sitemaps).
- There is s a configurable number of crawler workers (config
crawl.workers
) which concurrently crawl sites, one at a time per worker. (During the crawl the site is marked as locked using crawl_active=true.) They pick a relevant site which has not been crawled for a certain time ("checkout"), crawl it, and finally mark it as crawled (crawl_active=false, "checkin") and schedule the next crawl. Each crawl (with begin time, end time, number of found (new) resources)) is stored in tablecrawl
. - Crawls are either full crawls (including all paths reachable
through links from the start page are fetched) or feed crawls (only paths listed in a feed of the site are fetched). The respective (minimum) intervals in which these crawls happens are
full_crawl_interval
andfeed_crawl_interval
. Feed crawls can happen more frequently (e.g. daily). - When a path is fetched it can result in a MetaResource (feed or
sitemap) or a TextResource (redirects are followed and irrelevant content is ignored). A TextResource obtained from a path can be very similar to a resource obtained from another path; in this case no new resource is created, but both paths are linked to the same resource (see tables
site_path
andresource
). - If a MetaResource is fetched and it is a sitemap, its paths are
added to table
site_path
. If it is a feed, the feed is stored in tablesite_feed
and its paths are added to tablesite_path
. - Links between sites are stored in table
site_link
.
Site annotations
Database table site_annotation
can have any number of annotations
for a base_url. While crawling, these annotations are considered:
Blacklisting or whitelisting has precedence over function site_filter
(in plugin filter_site
).
Annotations cannot be managed from within atextcrawler;
this requires another application, usually atextsearch
.
Each annotation requires a base_url of the annotated site and
if a site with this base_url exists in the site
table,
it should also be associated with the site's id (column site_id
).
Limitations
- atextcrawler is not optimized for speed; it is meant to be run as a background task on a server with limited resources (or even an SBC, like raspberry pi, with attached storage)
- atextcrawler only indexes text, no other resources like images