Explain how to test your crawlers on your workstation
You may write unit-tests executed by
tox to cover all the code that is
used by your crawler.
Source code of compiled programming language is more or less checked by the
compiler. Here we are using Python, meaning that writing bullet-proof code
is kind of tedious.
One of the interesting indicators is the coverage of your unit-tests, i.e
how much code of your library is executed when the tests are executed.
Build HTML report
tox -e stats commands writes a summary of code coverage to the console output
and generate an HTML report in the
./htmlcov directory providing full coverage details:
- executed and not executed lines
- lines excluded with the
# pragma: no cover comment.
- branches partially executed, for instance when the
else statement of a condition is never executed).
Here are some pointers to interesting libraries that can help you write
the Python core testing library.
- mock: a Python mock library, which
is now part of the Python standard library since 3.3.
- vcr-py: to record HTTP requests
made by your code, to replay them later one.
Execute crawls on your workstation
Docido SDK provides a command line utility named
dcc-run that allows you
to run crawls on your workstation.
You can create a dedicated
virtualenv environment for that, directly from
your crawler package:
pip install .
The script relies on two configuration files to build the proper testing
- Global YAML configuration, describing the crawler's environment
(Index API pipeline, extra schemas to check, extra fields to add to the
Elasticsearch mapping...). There are 2 basic configurations:
settings.yml: the most simple one, without required 3rd parties. Items
pushed by your crawler are stored locally.
settings-es.yml: Describe environment where documents emitted by your
crawler are stored in Elasticsearch. This configuration is required when
your crawl needs to execute Elasticsearch queries to perform its
.dcc-runs.yml describing the crawls to launch.
settings.yml is used as default crawler environment.
If your crawler needs Elasticsearch, then you can specify another YAML
configuration file in
Once the SDK is installed and the configuration files filled, the crawl can be
run locally via the
dcc-run command, executed at the crawler module root
Launch initial crawl
dcc-run -v command launches all crawls
-v stands for verbose, and it advised as
dcc-run is pretty shy on logs
All data persisted by your crawler will be stored in the
You can provide your crawler content of a previous crawl to test incremental
indexing. To do so, provide path to a previous crawl to the
option, for instance:
dcc-run -i .dcc-runs/github-full-20151125-121432