Grouping

If a connector supports aggregates, it may also support grouping data and then aggregating data in those groups. This ability is tracked by the query.aggregates.group_by capability.

Grouping is requested in the query API alongside fields and aggregates, in the groups field of the Query object.

A grouping operation specifies one or more dimensions along which to partition the row set. Each dimension selects a column from which to draw values (see Dimension::Column). For each group, every row should have equal values in each of those dimension columns.

If the dimension's column's schema defines arguments, then the arguments property is used to provide values for those arguments.

In addition, a grouping operation specifies aggregates which should be computed and returned for each group separately.

Dimensions

Dimension columns can be:

  • A column
  • A object-nested column
  • A column across an object relationship
  • A column across an object-nested object relationship

A key property is that nested arrays or nested relationships cannot be traversed from the rows being grouped over when selecting a dimension column. Only nested objects or object relationships can be traversed.

Extraction Functions and Complex Dimensions

We can also group by components of scalar types using extraction functions.

In order to apply an extraction function to the value of a dimension, the Dimension should specify an extraction property, which is the name of the extraction function to apply.

For example, this query groups articles by the year component of their published date:

{
  "collection": "articles",
  "arguments": {},
  "query": {
    "groups": {
      "aggregates": {
        "count": {
          "type": "star_count"
        }
      },
      "dimensions": [
        {
          "type": "column",
          "column_name": "published_date",
          "path": [],
          "extraction": "year"
        }
      ]
    }
  },
  "collection_relationships": {}
}

Filtering

Grouping operations have two types of filtering:

  • The initial row set can be filtered before the grouping operation, using the predicate field of the Query object as usual, and
  • The groups themselves can be filtered after the grouping operation, using the predicate field of the Grouping object. This is controlled by the query.aggregates.group_by.filter capability.

Unlike regular predicates on rows, group predicates are not allowed to compare columns, but must instead compare values of aggregates over the group. For example, we can filter groups by comparing a count of rows in the group, but not by comparing values in individual rows.

Ordering

As with filtering, group operations support two types of ordering:

  • The initial row set can be ordered before the grouping operation, using the order_by field of the Query object as usual, and
  • The groups themselves can be ordered after the grouping operation, using the order_by field of the Grouping object. This is controlled by the query.aggregates.group_by.order capability.

Group sort orders are restricted to comparing aggregate values, similar to filtering. For example, we can order groups by a count, but not by the value of individual rows. However, we can also choose to sort by the selected grouping dimensions.

Pagination

Pagination can also be applied both before and after grouping:

  • The initial row set can be paginated before the grouping operation, using the limit and offset fields of the Query object as usual, and
  • The groups themselves can be paginated after the grouping operation, using the limit and offset fields of the Grouping object. This is controlled by the query.aggregates.group_by.paginate capability.

Examples

This example partitions the articles collection by author_id, and then returns the row count for each group. That is, it computes the number of articles written by each author:

{
  "collection": "articles",
  "arguments": {},
  "query": {
    "groups": {
      "aggregates": {
        "article_count": {
          "type": "star_count"
        }
      },
      "dimensions": [
        {
          "type": "column",
          "column_name": "author_id",
          "path": []
        }
      ]
    }
  },
  "collection_relationships": {}
}

Filtering examples

This example applies a predicate to the rows before grouping:

{
  "collection": "articles",
  "arguments": {},
  "query": {
    "groups": {
      "aggregates": {
        "min_id": {
          "type": "single_column",
          "column": "id",
          "function": "min"
        },
        "max_id": {
          "type": "single_column",
          "column": "id",
          "function": "max"
        }
      },
      "dimensions": [
        {
          "type": "column",
          "column_name": "author_id",
          "path": []
        }
      ]
    },
    "predicate": {
      "type": "binary_comparison_operator",
      "column": {
        "type": "column",
        "name": "author_id",
        "path": []
      },
      "operator": "eq",
      "value": {
        "type": "scalar",
        "value": 1
      }
    }
  },
  "collection_relationships": {}
}

This example applies a predicate to the groups themselves, after grouping. It computes some aggregates for author groups which have exactly two articles:

{
  "collection": "articles",
  "arguments": {},
  "query": {
    "groups": {
      "aggregates": {
        "min_id": {
          "type": "single_column",
          "column": "id",
          "function": "min"
        },
        "max_id": {
          "type": "single_column",
          "column": "id",
          "function": "max"
        }
      },
      "dimensions": [
        {
          "type": "column",
          "column_name": "author_id",
          "path": []
        }
      ],
      "predicate": {
        "type": "binary_comparison_operator",
        "target": {
          "type": "aggregate",
          "aggregate": {
            "type": "star_count"
          }
        },
        "operator": "eq",
        "value": {
          "type": "scalar",
          "value": 2
        }
      }
    }
  },
  "collection_relationships": {}
}

Ordering and pagination

This example computes the article count for the author with the most articles, by ordering the groups by article count, and then using pagination to select the first group:

{
  "collection": "articles",
  "arguments": {},
  "query": {
    "groups": {
      "aggregates": {
        "article_count": {
          "type": "star_count"
        }
      },
      "dimensions": [
        {
          "type": "column",
          "column_name": "author_id",
          "path": []
        }
      ],
      "limit": 1,
      "offset": 0,
      "order_by": {
        "elements": [
          {
            "order_direction": "desc",
            "target": {
              "type": "aggregate",
              "aggregate": {
                "type": "star_count"
              },
              "path": []
            }
          }
        ]
      }
    }
  },
  "collection_relationships": {}
}

This example sorts the groups by the values of their dimensions. It groups articles by their author_id, and then sorts the groups by that author_id dimension, descending:

{
  "collection": "articles",
  "arguments": {},
  "query": {
    "groups": {
      "aggregates": {
        "article_count": {
          "type": "star_count"
        }
      },
      "dimensions": [
        {
          "type": "column",
          "column_name": "author_id",
          "path": []
        }
      ],
      "order_by": {
        "elements": [
          {
            "order_direction": "desc",
            "target": {
              "type": "dimension",
              "index": 0,
              "path": []
            }
          }
        ]
      }
    }
  },
  "collection_relationships": {}
}

Requirements

  • If the Query object specifies the groups field, then each correponding RowSet object must contain a non-null groups field.
  • Each returned Group object must contain values for each requested dimension, in the order in which they were requested:
    • The connector should effectively partition the RowSet described by the Query object into groups, such that the dimension tuples are unique within each group.
  • Each returned Group object must contain values for each requested aggregate, using the same key as used to request it:
    • Aggregates should be computed over the rows in each group in turn.

See also