Explain how to test your crawlers on your workstation
Edit me
Unit-tests
You may write unit-tests executed by tox to cover all the code that is
used by your crawler.
Code coverage
Source code of compiled programming language is more or less checked by the
compiler. Here we are using Python, meaning that writing bullet-proof code
is kind of tedious.
One of the interesting indicators is the coverage of your unit-tests, i.e
how much code of your library is executed when the tests are executed.
Build HTML report
tox -e stats commands writes a summary of code coverage to the console output
and generate an HTML report in the ./htmlcov directory providing full coverage details:
- executed and not executed lines
- lines excluded with the
# pragma: no cover comment.
- branches partially executed, for instance when the
else statement of a condition is never executed).

Frameworks
Here are some pointers to interesting libraries that can help you write
unit-tests efficiently:
- unit-test:
the Python core testing library.
- mock: a Python mock library, which
is now part of the Python standard library since 3.3.
- vcr-py: to record HTTP requests
made by your code, to replay them later one.
Execute crawls on your workstation
Docido SDK provides a command line utility named dcc-run that allows you
to run crawls on your workstation.
Installation
You can create a dedicated virtualenv environment for that, directly from
your crawler package:
cd /path/to/my/crawler
virtualenv .test
. .test/bin/activate
pip install .
hash -r
dcc-run --help
Configuration
The script relies on two configuration files to build the proper testing
environment:
- Global YAML configuration, describing the crawler's environment
(Index API pipeline, extra schemas to check, extra fields to add to the
Elasticsearch mapping...). There are 2 basic configurations:
settings.yml: the most simple one, without required 3rd parties. Items
pushed by your crawler are stored locally.
settings-es.yml: Describe environment where documents emitted by your
crawler are stored in Elasticsearch. This configuration is required when
your crawl needs to execute Elasticsearch queries to perform its
incremental scan.
.dcc-runs.yml describing the crawls to launch.
By default, settings.yml is used as default crawler environment.
If your crawler needs Elasticsearch, then you can specify another YAML
configuration file in .dcc-runs.yml
Once the SDK is installed and the configuration files filled, the crawl can be
run locally via the dcc-run command, executed at the crawler module root
directory.
Launch initial crawl
dcc-run -v command launches all crawls
referenced in .dcc-runs-yml.
Tip:
-v stands for verbose, and it advised as dcc-run is pretty shy on logs
All data persisted by your crawler will be stored in the .dcc-runs directory.
Incremental run
You can provide your crawler content of a previous crawl to test incremental
indexing. To do so, provide path to a previous crawl to the --incremental
option, for instance:
dcc-run -i .dcc-runs/github-full-20151125-121432