Learn how to consolidate your crawl's data.

Rationale

A crawl aggregation allows you to build an analytic information over a set of metrics extracted during a crawl of a site. It provides aggregated data based on a search query.

Making Requests

The GET and POST method of request /api/crawls/{crawl_id}/aggs allows you to perform. It is designed to perform several aggregation in one single call.

An aggregation query is formed by:

  • One or more filters. Each filter selects a set of page and restrict the scope on which the aggregation is performed. If more than one filter is given, then the final scope of pages is the conjunction of all filters. A list of available filters can be obtained at api/{project_id}/quickfilters

  • Zero or more aggregation fields. Each aggregation stage distribute the pages according to their value of the aggregation fields. If more than one field is provided, pages are first distributed according to field1 values, called bucket. Then for each bucket of field1, pages are distributed according to their values of field2, and so on.

By default, the result of aggregations is the number of pages in each bucket. But it is also possible to retrieve:

  • the sum of values of a field, for instance: nb_inlinks:sum

  • the average value of a field, for instance: nb_inlinks:avg

Examples

Count number of pages per page group

Create one bucket per page group and count the number of pages in each of them.

import json, requests

token = 'TOKEN'
crawl_id = 'CRAWL-ID'
aggregation = {
  'aggs': [
    {
      'filters': 'all_pages',
      'fields': 'page_group',
    }
  ]
}
resp = requests.post('https://app.oncrawl.com/api/crawls/{}/aggs'.format(crawl_id),
                     headers={'x-oncrawl-token': token},
                     json=aggregation)
print json.dumps(resp.json(), indent=2, sort_keys=True)

Below the request JSON response:

{
  "aggs": [
    {
      "cols": [
        "page_group", 
        "page_count"
      ], 
      "rows": [
        [
          "photos", 
          5
        ], 
        [
          "other", 
          110497
        ]
      ]
    }
  ]
}

Count number of pages per depth and page group

This is a demonstration of how to use multiple aggregation fields to arrange the buckets.

import json, requests

token = 'TOKEN'
crawl_id = 'CRAWL-ID'
aggregation = {
  'aggs': [
    {
      'filters': 'all_pages',
      'fields': 'depth,page_group',
    }
  ]
}
resp = requests.post('https://app.oncrawl.com/api/crawls/{}/aggs'.format(crawl_id),
                     headers={'x-oncrawl-token': token},
                     json=aggregation)
print json.dumps(resp.json(), indent=2, sort_keys=True)

Below the request JSON response:

{
  "aggs": [
    {
      "cols": [
        "depth", 
        "page_group", 
        "page_count"
      ], 
      "rows": [
        [
          1, 
          "photos", 
          0
        ], 
        [
          1, 
          "other", 
          1
        ], 
        [
          2, 
          "photos", 
          0
        ], 
        [
          2, 
          "other", 
          685
        ], 
        [
          3, 
          "photos", 
          0
        ], 
        [
          3, 
          "other", 
          4316
        ], 
        [
          4, 
          "photos", 
          0
        ], 
        [
          4, 
          "other", 
          20
        ]
      ]
    }
  ]
}

This example show how to retrieve a field instead of a number of pages.

import json, requests

token = 'TOKEN'
crawl_id = 'CRAWL-ID'
aggregation = {
  'aggs': [
    {
      'filters': 'all_pages',
      'fields': 'page_group',
      'value': 'nb_inlinks:sum'
    }
  ]
}
resp = requests.post('https://app.oncrawl.com/api/crawls/{}/aggs'.format(crawl_id),
                     headers={'x-oncrawl-token': token},
                     json=aggregation)
print json.dumps(resp.json(), indent=2, sort_keys=True)

Below the request JSON response:

{
  "aggs": [
    {
      "cols": [
        "page_group", 
        "nb_inlinks:sum"
      ], 
      "rows": [
        [
          "photos", 
          0.0
        ], 
        [
          "other", 
          3112553.0
        ]
      ]
    }
  ]
}

GET method

Because it is not always possible to perform a POST request, a GET version if available. The bad news is that it is necessary to encode the JSON payload in the URL. The aggregation request must be serialized in Rison format. More details about the supported parameters in the API Reference guide.

Tags: guide