67 lines
3.9 KiB
Markdown
67 lines
3.9 KiB
Markdown
|
# Introduction
|
||
|
|
||
|
## What atextcrawler does:
|
||
|
* Start from a seed (white+black-)list of website base URLs
|
||
|
* Loop over sites selected by applying criteria to the content
|
||
|
of the site's start page
|
||
|
* Crawl the site, i.e. loop over resources of the site
|
||
|
* Extract plaintext content from the resource (html parsing is
|
||
|
optimized for html5); discard non-text content, but handle feeds
|
||
|
and sitemaps
|
||
|
* Extract internal and external links; external links contribute
|
||
|
to the site list
|
||
|
* Keep track of the sites and resources in a PostgreSQL database
|
||
|
* Store plaintext content of resources in an Elasticsearch index
|
||
|
* Store vector embeddings of plaintexts also in Elasticsearch
|
||
|
using tensorflow model server with a multilingual language model
|
||
|
|
||
|
## Architecture
|
||
|
There is only one python process running concurrently.
|
||
|
We use asyncio where possible (almost everywhere).
|
||
|
|
||
|
1. There is a queue of websites, see database table `site_queue`.
|
||
|
The queue is fed a) on first startup with seeds, b) manually
|
||
|
and c) from crawls which find external links.
|
||
|
When the queued is handled new sites are stored to table `site`.
|
||
|
New sites are updated, existing sites only if the last update was more than `crawl.site_revisit_delay` seconds in the past.
|
||
|
After the queue has been handled there is a delay
|
||
|
(`crawl.site_delay` seconds) before repetition.
|
||
|
1. Updating a site means: the start page is fetched and
|
||
|
criteria are applied to its content to determine whether
|
||
|
the site is relevant. (It is assumed that (non-)relevance is
|
||
|
obvious from the start page already.) If the site is relevant,
|
||
|
more information is fetched (e.g. sitemaps).
|
||
|
1. There is s a configurable number of crawler workers (config
|
||
|
`crawl.workers`) which concurrently crawl sites, one at a time
|
||
|
per worker. (During the crawl the site is marked as locked using
|
||
|
crawl_active=true.) They pick a relevant site which has not been crawled for a certain time ("checkout"), crawl it, and finally mark it as crawled (crawl_active=false, "checkin") and schedule the next crawl.
|
||
|
Each crawl (with begin time, end time, number of found (new)
|
||
|
resources)) is stored in table `crawl`.
|
||
|
1. Crawls are either full crawls (including all paths reachable
|
||
|
through links from the start page are fetched) or feed crawls (only paths listed in a feed of the site are fetched). The respective (minimum) intervals in which these crawls happens are `full_crawl_interval` and `feed_crawl_interval`.
|
||
|
Feed crawls can happen more frequently (e.g. daily).
|
||
|
1. When a path is fetched it can result in a MetaResource (feed or
|
||
|
sitemap) or a TextResource (redirects are followed and irrelevant content is ignored). A TextResource obtained from a path can be very similar to a resource obtained from another path; in this case no new resource is created, but both paths are linked to the same resource (see tables `site_path` and `resource`).
|
||
|
1. If a MetaResource is fetched and it is a sitemap, its paths are
|
||
|
added to table `site_path`. If it is a feed, the feed is stored in table `site_feed` and its paths are added to table `site_path`.
|
||
|
1. Links between sites are stored in table `site_link`.
|
||
|
|
||
|
## Site annotations
|
||
|
Database table `site_annotation` can have any number of annotations
|
||
|
for a base_url. While crawling, these annotations are considered:
|
||
|
Blacklisting or whitelisting has precedence over function `site_filter`
|
||
|
(in plugin `filter_site`).
|
||
|
|
||
|
Annotations cannot be managed from within atextcrawler;
|
||
|
this requires another application, usually [`atextsearch`](https://TODO).
|
||
|
|
||
|
Each annotation requires a base_url of the annotated site and
|
||
|
if a site with this base_url exists in the `site` table,
|
||
|
it should also be associated with the site's id (column `site_id`).
|
||
|
|
||
|
## Limitations
|
||
|
* atextcrawler is not optimized for speed; it is meant to be run as a
|
||
|
background task on a server with limited resources
|
||
|
(or even an SBC, like raspberry pi, with attached storage)
|
||
|
* atextcrawler only indexes text, no other resources like images
|