Explain how to test your crawlers on your workstation
Edit me

Unit-tests

You may write unit-tests executed by tox to cover all the code that is used by your crawler.

Code coverage

Source code of compiled programming language is more or less checked by the compiler. Here we are using Python, meaning that writing bullet-proof code is kind of tedious.

One of the interesting indicators is the coverage of your unit-tests, i.e how much code of your library is executed when the tests are executed.

Build HTML report

tox -e stats commands writes a summary of code coverage to the console output and generate an HTML report in the ./htmlcov directory providing full coverage details:

  • executed and not executed lines
  • lines excluded with the # pragma: no cover comment.
  • branches partially executed, for instance when the else statement of a condition is never executed).

Coverage.py sample

Frameworks

Here are some pointers to interesting libraries that can help you write unit-tests efficiently:

  • unit-test: the Python core testing library.
  • mock: a Python mock library, which is now part of the Python standard library since 3.3.
  • vcr-py: to record HTTP requests made by your code, to replay them later one.

Execute crawls on your workstation

Docido SDK provides a command line utility named dcc-run that allows you to run crawls on your workstation.

Installation

You can create a dedicated virtualenv environment for that, directly from your crawler package:

cd /path/to/my/crawler
virtualenv .test
. .test/bin/activate
pip install .
hash -r
dcc-run --help

Configuration

The script relies on two configuration files to build the proper testing environment:

  • Global YAML configuration, describing the crawler's environment (Index API pipeline, extra schemas to check, extra fields to add to the Elasticsearch mapping...). There are 2 basic configurations:
    • settings.yml: the most simple one, without required 3rd parties. Items pushed by your crawler are stored locally.
    • settings-es.yml: Describe environment where documents emitted by your crawler are stored in Elasticsearch. This configuration is required when your crawl needs to execute Elasticsearch queries to perform its incremental scan.
  • .dcc-runs.yml describing the crawls to launch.

By default, settings.yml is used as default crawler environment. If your crawler needs Elasticsearch, then you can specify another YAML configuration file in .dcc-runs.yml

Once the SDK is installed and the configuration files filled, the crawl can be run locally via the dcc-run command, executed at the crawler module root directory.

Launch initial crawl

dcc-run -v command launches all crawls referenced in .dcc-runs-yml.

All data persisted by your crawler will be stored in the .dcc-runs directory.

Incremental run

You can provide your crawler content of a previous crawl to test incremental indexing. To do so, provide path to a previous crawl to the --incremental option, for instance:

dcc-run -i .dcc-runs/github-full-20151125-121432