Docido Crawler SDK

Edit me

Card attachments

Docido SDK provides several ways for a pull-crawler to push web resources and binary files. When such content is pushed, then the framework tries to extract additional information the payload may provide like the text it contains, a title, description, ...

There are 3 different content a pull-crawler may push whose analysis is made later on:

URL to either web page, text, or binary files.
a blob of bytes with a mime type
a stream to any resource, useful when HTTP headers are required to access the resource.

URL

Reference to URL can be specified in a nested document in the attachments field. The type of the attachment must be link. For instance:

{
  "id": "1234",
  "service": "lambda",
  "attachments": [
    {
      "type": "link",
      "url": "http://google.com"
    }
  ]
}

For such attachment, Docido will tries to download the webpage and enrich the nested attachment with the page title, description, and text. It also works for links to binary files like PDF, words documents, ... The MIME type is deduced from the Content-Type header specified in the HTTP response. If the pull-crawler already knows it, it should force its value by providing the mime_type attribute.

This behavior can be de-clutched with the special _analysis attribute set to False in the nested document:

{
  "id": "1234",
  "service": "lambda",
  "attachments": [
    {
      "type": "link",
      "_analysis": False,
      "url": "http://skip/analysis/of/this/link"
    }
  ]
}

Blob of bytes

If the pull-crawler has the binary content of the payload to index, it can be specified in the bytes attribute of an attachment. The attachment must also provide the mime_type attribute so that the framework can use the most appropriate content analyzer. The crawler may provide in the same attachment any information regarding the payload, for instance:

{
  "id": "1234",
  "service": "lambda",
  "attachments": [
    {
      "type": "file",
      "filename": "foo.pdf",
      "mime_type": "application/pdf",
      "bytes": "..."
    }
  ]
}

Stream

This mode allows you to delay download of the payload later to reduce memory contention of your process. The stream must be an instance of docido_sdk.toolbox.http_ext.delayed_request and specified in the stream attribute. For instance:

from docido_sdk.toolbox.http_ext import delayed_request

{
  "id": "1234",
  "service": "lambda",
  "attachments": [
  {
    'type': u'file',
    'filename': u'the-filename.pdf',
    'mime_type': u'application/pdf',
    'stream': delayed_request('https://private/resource/url',
                              params=dict(access_token='0123456789ABCDEF')),
  }
}

Tasks dispatch

Pull-crawler tasks are executed with Celery, and Docido SDK provides various methods to control how these tasks are being scheduled:

independant sub-tasks: Crawler.iter_crawl_tasks simply returns a list of tasks, executed in parallel.
group of sub-tasks: Crawler.iter_crawl_tasks returns a list of list of tasks that can be executed in parallel:
- Tasks of a given list are executed sequentialy.
- 2 different list of tasks are executed in parallel.

Max concurrent tasks per crawl

When a pull-crawler simply provides a list of tasks, Docido's internal framework will split them in sub-lists to control how many tasks are executed in parallel. Default value is set to 10, and can be updated if necessary.

For instance, if the API your crawler fetches accepts no more than 2 connections at the same time, then you can specify override the default max_concurrent_tasks in the dict returned by the Crawler.iter_crawl_tasks method and specify 2 instead:

class Crawler(Component):
    def iter_crawl_tasks(...):
       return {
           'tasks': [...],
           'max_concurrent_tasks': 2
       }

If you want to set a value greater than the default value (which is 10), please contact Cogniteev's developers and explain your use-case.

Passing data from a task to another

If you crawler returns a list of tasks sequences, then you can leverage the prev_result parameter given to sub-task. It will contain what the previous task returned. Note that the prev_result parameter given to the first task of every sequence will be None.

sub-task retry mechanism

There are many use-cases where you want to retry a task later on:

You cannot contact API
You reached the API's rate limits.

To do so, a sub-task can raise an instance of docido_sdk.crawler.Retry exception class. Retry exception accepts a bunch of arguments in constructor to specify when to retry the current task.

side-effects with `Retry` exception

Furthermore, you can provide the Retry class keyword arguments that will be given to the retried task.

Sample below highlights the retry capabilities:

During the initial scan, let's assume there are 10 pages to crawl:
1. iter_crawl_tasks asks the crawl_page sub-task to be called with:
2. since=None
3. crawl_start=UNIX_TIMESTAMP, for instance 1447083795
4. First call to crawl_page with page=1, will submit cards to Docido index, and ask the crawl_page subtask to be called with page=2, and so on.
5. When page=11, then the fetch_page method raise an UnknownPage exception, meaning that the crawl is terminated and it is time to update the crawler checkpoint.
When the account synchronization is recalled few hours later, then since is set to the date when the previous crawl began so that ClientAPI.fetch_page can only provides changes that occured since then.

import functools
from docido_sdk.core import Core, implements
from docido_sdk.toolbox.date_ext import timestamp_ms


def crawl_page(index, token, prev_result, logger,
               since=None, crawl_start=None, page=1):
    client = ClientApi(token)
    try:
        index.push_cards(to_docido_cards(client.fetch_page(
            page=page,
            since=since)
        ))
    except UnknownPage:
        index.set_kv('since', crawl_start)
    else:
        raise Retry(page=page + 1, countdown=60)


class Crawler(Component):
    implements(ICrawler)
    service_name = 'my_service'

    def iter_crawl_tasks(self, index, *args, **kwargs):
        return {
            'tasks': functools.partial(crawl_page,
                crawl_start=timestamp_ms.now()
                since=index.get_kv('since')
            ),
        }

Common errors

Passing huge payload

It is not recommended to use huge objects:

in parameters specified in the future subtasks
returned by sub-tasks.

You may only store object's identifier, not their content.

Dispatch your API calls among different sub-tasks

The iter_crawl_tasks method is only meant to enumerate what is to be done by the returned sub-tasks. You may not call the fetched API to retrieve document's content.

Do not unnecessarily delete objects from index

When a crawler needs to perform incremental indexing a set of objects, one basic pattern is to first remove the set from index to then push the new set. But this is not the right way to proceed because, during a certain time, those objects are not indexed, so not searchable.

Actually what you exactly need to do is:

Push objects in source but not in index
Update items already present in index. Actually, we often reindex those objects because it most cases, the index queries required to know whether the indexed objects are different from the source's are very costly.
Remove from index objects that are not in the source anymore.