atextcrawler/doc/source/introduction.md

# Introduction

## What atextcrawler does:
* Start from a seed (white+black-)list of website base URLs
* Loop over sites selected by applying criteria to the content
  of the site's start page
* Crawl the site, i.e. loop over resources of the site
* Extract plaintext content from the resource (html parsing is
  optimized for html5); discard non-text content, but handle feeds
  and sitemaps
* Extract internal and external links; external links contribute
  to the site list
* Keep track of the sites and resources in a PostgreSQL database
* Store plaintext content of resources in an Elasticsearch index
* Store vector embeddings of plaintexts also in Elasticsearch
  using tensorflow model server with a multilingual language model

## Architecture
There is only one python process running concurrently.
We use asyncio where possible (almost everywhere).

1. There is a queue of websites, see database table `site_queue`.
   The queue is fed a) on first startup with seeds, b) manually
   and c) from crawls which find external links.
   When the queued is handled new sites are stored to table `site`.
   New sites are updated, existing sites only if the last update was more than `crawl.site_revisit_delay` seconds in the past.
   After the queue has been handled there is a delay
   (`crawl.site_delay` seconds) before repetition.
1. Updating a site means: the start page is fetched and
   criteria are applied to its content to determine whether
   the site is relevant. (It is assumed that (non-)relevance is
   obvious from the start page already.) If the site is relevant,
   more information is fetched (e.g. sitemaps).
1. There is s a configurable number of crawler workers (config
   `crawl.workers`) which concurrently crawl sites, one at a time
   per worker. (During the crawl the site is marked as locked using
   crawl_active=true.) They pick a relevant site which has not been crawled for a certain time ("checkout"), crawl it, and finally mark it as crawled (crawl_active=false, "checkin") and schedule the next crawl.
   Each crawl (with begin time, end time, number of found (new)
   resources)) is stored in table `crawl`.
1. Crawls are either full crawls (including all paths reachable
   through links from the start page are fetched) or feed crawls (only paths listed in a feed of the site are fetched). The respective (minimum) intervals in which these crawls happens are `full_crawl_interval` and `feed_crawl_interval`.
   Feed crawls can happen more frequently (e.g. daily).
1. When a path is fetched it can result in a MetaResource (feed or
   sitemap) or a TextResource (redirects are followed and irrelevant content is ignored). A TextResource obtained from a path can be very similar to a resource obtained from another path; in this case no new resource is created, but both paths are linked to the same resource (see tables `site_path` and `resource`).
1. If a MetaResource is fetched and it is a sitemap, its paths are
   added to table `site_path`. If it is a feed, the feed is stored in table `site_feed` and its paths are added to table `site_path`.
1. Links between sites are stored in table `site_link`.

## Site annotations
Database table `site_annotation` can have any number of annotations
for a base_url. While crawling, these annotations are considered:
Blacklisting or whitelisting has precedence over function `site_filter`
(in plugin `filter_site`).

Annotations cannot be managed from within atextcrawler;
this requires another application, usually [`atextsearch`](https://TODO).

Each annotation requires a base_url of the annotated site and
if a site with this base_url exists in the `site` table,
it should also be associated with the site's id (column `site_id`).

## Limitations
* atextcrawler is not optimized for speed; it is meant to be run as a
  background task on a server with limited resources
  (or even an SBC, like raspberry pi, with attached storage)
* atextcrawler only indexes text, no other resources like images
Put under version control 2021-11-29 09:16:31 +00:00			`# Introduction`

			`## What atextcrawler does:`
			`* Start from a seed (white+black-)list of website base URLs`
			`* Loop over sites selected by applying criteria to the content`
			`of the site's start page`
			`* Crawl the site, i.e. loop over resources of the site`
			`* Extract plaintext content from the resource (html parsing is`
			`optimized for html5); discard non-text content, but handle feeds`
			`and sitemaps`
			`* Extract internal and external links; external links contribute`
			`to the site list`
			`* Keep track of the sites and resources in a PostgreSQL database`
			`* Store plaintext content of resources in an Elasticsearch index`
			`* Store vector embeddings of plaintexts also in Elasticsearch`
			`using tensorflow model server with a multilingual language model`

			`## Architecture`
			`There is only one python process running concurrently.`
			`We use asyncio where possible (almost everywhere).`

			1. There is a queue of websites, see database table `site_queue`.
			`The queue is fed a) on first startup with seeds, b) manually`
			`and c) from crawls which find external links.`
			When the queued is handled new sites are stored to table `site`.
			New sites are updated, existing sites only if the last update was more than `crawl.site_revisit_delay` seconds in the past.
			`After the queue has been handled there is a delay`
			(`crawl.site_delay` seconds) before repetition.
			`1. Updating a site means: the start page is fetched and`
			`criteria are applied to its content to determine whether`
			`the site is relevant. (It is assumed that (non-)relevance is`
			`obvious from the start page already.) If the site is relevant,`
			`more information is fetched (e.g. sitemaps).`
			`1. There is s a configurable number of crawler workers (config`
			`crawl.workers`) which concurrently crawl sites, one at a time
			`per worker. (During the crawl the site is marked as locked using`
			`crawl_active=true.) They pick a relevant site which has not been crawled for a certain time ("checkout"), crawl it, and finally mark it as crawled (crawl_active=false, "checkin") and schedule the next crawl.`
			`Each crawl (with begin time, end time, number of found (new)`
			resources)) is stored in table `crawl`.
			`1. Crawls are either full crawls (including all paths reachable`
			through links from the start page are fetched) or feed crawls (only paths listed in a feed of the site are fetched). The respective (minimum) intervals in which these crawls happens are `full_crawl_interval` and `feed_crawl_interval`.
			`Feed crawls can happen more frequently (e.g. daily).`
			`1. When a path is fetched it can result in a MetaResource (feed or`
			sitemap) or a TextResource (redirects are followed and irrelevant content is ignored). A TextResource obtained from a path can be very similar to a resource obtained from another path; in this case no new resource is created, but both paths are linked to the same resource (see tables `site_path` and `resource`).
			`1. If a MetaResource is fetched and it is a sitemap, its paths are`
			added to table `site_path`. If it is a feed, the feed is stored in table `site_feed` and its paths are added to table `site_path`.
			1. Links between sites are stored in table `site_link`.

			`## Site annotations`
			Database table `site_annotation` can have any number of annotations
			`for a base_url. While crawling, these annotations are considered:`
			Blacklisting or whitelisting has precedence over function `site_filter`
			(in plugin `filter_site`).

			`Annotations cannot be managed from within atextcrawler;`
			this requires another application, usually [`atextsearch`](https://TODO).

			`Each annotation requires a base_url of the annotated site and`
			if a site with this base_url exists in the `site` table,
			it should also be associated with the site's id (column `site_id`).

			`## Limitations`
			`* atextcrawler is not optimized for speed; it is meant to be run as a`
			`background task on a server with limited resources`
			`(or even an SBC, like raspberry pi, with attached storage)`
			`* atextcrawler only indexes text, no other resources like images`