Docido SDK provides several ways for a pull-crawler to push web resources and binary files. When such content is pushed, then the framework tries to extract additional information the payload may provide like the text it contains, a title, description, ...
There are 3 different content a pull-crawler may push whose analysis is made later on:
- URL to either web page, text, or binary files.
- a blob of bytes with a mime type
- a stream to any resource, useful when HTTP headers are required to access the resource.
Reference to URL can be specified in a nested document in the
attachments field. The
type of the attachment must be link. For instance:
For such attachment, Docido will tries to download the webpage and enrich the nested attachment with the page title, description, and text. It also works for links to binary files like PDF, words documents, ...
The MIME type is deduced from the Content-Type header specified in the HTTP response. If the pull-crawler already knows it, it should force its value by providing the
This behavior can be de-clutched with the special
_analysis attribute set to
False in the nested document:
Blob of bytes
If the pull-crawler has the binary content of the payload to index, it can be specified in the
bytes attribute of an attachment. The attachment must also provide the
mime_type attribute so that the framework can use the most appropriate content analyzer.
The crawler may provide in the same attachment any information regarding the payload, for instance:
This mode allows you to delay download of the payload later to reduce memory contention of your process. The stream must be an instance of
docido_sdk.toolbox.http_ext.delayed_request and specified in the
stream attribute. For instance:
from docido_sdk.toolbox.http_ext import delayed_request
Pull-crawler tasks are executed with Celery, and Docido SDK provides various methods to control how these tasks are being scheduled:
- independant sub-tasks:
Crawler.iter_crawl_tasks simply returns a list of tasks, executed in parallel.
- group of sub-tasks:
Crawler.iter_crawl_tasks returns a list of list of tasks that can be executed in parallel:
- Tasks of a given list are executed sequentialy.
- 2 different list of tasks are executed in parallel.
Max concurrent tasks per crawl
When a pull-crawler simply provides a list of tasks, Docido's internal framework will split them in sub-lists to control how many tasks are executed in parallel. Default value is set to 10, and can be updated if necessary.
For instance, if the API your crawler fetches accepts no more than 2 connections at the same time, then you can specify override the default
max_concurrent_tasks in the
dict returned by the
Crawler.iter_crawl_tasks method and specify 2 instead:
If you want to set a value greater than the default value (which is 10), please contact Cogniteev's developers and explain your use-case.
Passing data from a task to another
If you crawler returns a list of tasks sequences, then you can leverage the
prev_result parameter given to sub-task. It will contain what the previous task returned. Note that the
prev_result parameter given to the first task of every sequence will be
sub-task retry mechanism
There are many use-cases where you want to retry a task later on:
- You cannot contact API
- You reached the API's rate limits.
To do so, a sub-task can raise an instance of
docido_sdk.crawler.Retry exception class.
Retry exception accepts a bunch of arguments in constructor to specify when to retry the current task.
Furthermore, you can provide the
Retry class keyword arguments that will be given to the retried task.
Sample below highlights the retry capabilities:
- During the initial scan, let's assume there are 10 pages to crawl:
iter_crawl_tasks asks the
crawl_page sub-task to be called with:
crawl_start=UNIX_TIMESTAMP, for instance 1447083795
- First call to
page=1, will submit cards to Docido index, and ask the
crawl_page subtask to be called with
page=2, and so on.
page=11, then the
fetch_page method raise an
UnknownPage exception, meaning that the crawl is terminated and it is time to update the crawler checkpoint.
- When the account synchronization is recalled few hours later, then
since is set to the date when the previous crawl began so that
ClientAPI.fetch_page can only provides changes that occured since then.
from docido_sdk.core import Core, implements
from docido_sdk.toolbox.date_ext import timestamp_ms
def crawl_page(index, token, prev_result, logger,
since=None, crawl_start=None, page=1):
client = ClientApi(token)
raise Retry(page=page + 1, countdown=60)
service_name = 'my_service'
def iter_crawl_tasks(self, index, *args, **kwargs):
Passing huge payload
It is not recommended to use huge objects:
- in parameters specified in the
- returned by sub-tasks.
You may only store object's identifier, not their content.
Dispatch your API calls among different sub-tasks
iter_crawl_tasks method is only meant to enumerate what is to be done by
the returned sub-tasks. You may not call the fetched API to retrieve document's content.
Do not unnecessarily delete objects from index
When a crawler needs to perform incremental indexing a set of objects, one
basic pattern is to first remove the set from index to then push the new set.
But this is not the right way to proceed because, during a certain time, those
objects are not indexed, so not searchable.
Actually what you exactly need to do is:
- Push objects in source but not in index
- Update items already present in index. Actually, we often reindex those
objects because it most cases, the index queries required to know whether the
indexed objects are different from the source's are very costly.
- Remove from index objects that are not in the source anymore.