Describes crawler source code layout, and development process
Git branching model
Crawler projects follow the Git Flow.
There is no particular good reason to do that, except that at least one common
workflow is required.
To summarize the workflow:
- Crawler developers push to the develop branch
- When ready, integrators creates a
release/vX.X branch and deploy,
and iterate until version is good for production.
- Integrators merge the
release/vX.X branch on the master branch, and tag it.
Crawler repositories are all based on a template project owned by Cogniteev
core developers. When the template is modified, changes are dispatched on all
crawlers, so stay tune!
A crawler project provides, among others, the following files:
Dockerfile: to build the Docker image used by Docido application.
requirements-dev.txt: provides additional Python packages required to test
tox configuration file, see Validate your changes section
dcc-run environment configuration
dcc-run utility input file, for crawlers that need
Elasticsearch to run properly.
dcc-run input file, providing crawls configuration.
Bootstrap development environment
You may use
tox to ensure that your pull-crawler source code is sane.
Run the following commands to bootstrap your development environment:
git clone .../docido-pull-crawler-foo.git
pip install tox
You may test your development against Python 2.7.9
Tip: You can increase quality of your code by testing it
against several Python versions by adding them in tox.ini
Validate your changes
Before pushing change to the
develop branch, you must ensure that the
command works properly. This command executed:
- unit-tests coverage. There is no minimum percentage required, but you may
ensure that it never decreases !
- check source code compliancy against PEP8 standards