group

The group command defines the scope of the computation an aggregate function, i.e it groups together rows that have the same values for a given group key (a field or set of fields).

group( { field_name | '@'position | alias | expression [ sort_param ] } [, ... ] )

Before computing the aggregate function the dataset is sorted by group key(s) and then the aggregation function is applied. When group key changes, then the aggregate value for that key is emitted and aggregate function is reset. The process continues until the end of the sorted dataset.

Note

In order to take the effect of grouping the group command must follow immediately the select command specifying the aggregation function.

The resultset of computing an aggregation function depends on the fields specified for grouping:

  1. Grouping all non-aggregated fields will yield computing aggregations for all unique values of combinations of these fields.
  2. Grouping some of the non-aggregated fields will yield computing aggregations for all unique values of combinations of grouped fields. The values of other non-aggregated fields in the resultset are undetermined.
  3. when group command is omitted the resultset will contain one computed aggregation for the whole selected dataset. The values of other non-aggregated fields in the resultset are undetermined.

Example 1. Compute the count of odd and even numbers from 0 to 9

1
2
3
4
dual(10)                           // generate 10 records of data
| sort(RANDOM())                   // randomize the order of records
| select(parity:i%2, cnt:count(*))
| group(parity)

As expected, we get that there are five odd and even numbers in that set:

parity cnt
0 5
1 5

Example 2. Compute the count of parity and integer pairs in the same set. Note that we need to group by both non-aggregated fields (parity and i)

1
2
3
4
dual(10)                           // generate 10 records of data
| sort(RANDOM())                   // randomize the order of records
| select(parity:i%2, i, cnt:count(*))
| group(parity, i)

The result:

parity i cnt
0 0 1
0 2 1
0 4 1
0 6 1
0 8 1
1 1 1
1 3 1
1 5 1
1 7 1
1 9 1

Note

Had we omitted field i from grouping (i.e grouping only by parity), we would’ve gotten the count of odd and even numbers accompanied with randomly picked value of i which would change every time you execute the query:

parity i cnt
0 4 5
1 1 5

Example 3. Omitting group will yield to count(*) being performed on the entire selected dataset:

1
2
3
dual(10)
| sort(RANDOM()) // randomize the order of records
| select(parity:i%2, i, cnt:count(*))

Note that the non-aggregated fields will have randomly picked values at every execution of the query.

parity i cnt
0 6 10

Group Sort

The grouping has its own sorting parameter, allowing to control the behavior of sorting which forms the basis of grouping. It allows specifying the field for sorting or disable sorting altogether:

sort_param:
        {   SORT { field_name | '@'position | alias | expression } [ASC | DESC]
          | NOSORT
        }

With group-sort, the sort key is applied independently of the grouping key. While the latter remains the same the dataset gets sorted by sort key instead. The aggregates are still computed looking at the group key in a sorted dataset: when it changes then computed aggregate value is emitted and aggregate function is reset. This allows you to compute aggregates on a completely different set of groups which sometimes becomes very useful. Take for instance computing aggregates on non-contiguous ranges.

Example 4. Computing US IPv4 address ranges requires applying MIN() and MAX() aggregate functions over IP addresses grouped by IP-country code. With default sorting behavior we would get just the beginning of the lowest and the end of the highest of ranges in the US:

1
2
3
4
5
6
7
dual(16909050, 20330757)  //generate some IPv4 addresses
| select(ipCC:cc(ip),     //compute ip-country code
         min(ip),         //begin of range
         max(ip))         //end of range
| group(ipCC)             //compute min max for each ip-country
| filter(ipCC='US')       //filter out US
;
ipCC min_ip max_ip
US 1.32.232.0 2.56.59.254

However, when we change group sort to NOSORT (i.e leaving dataset unsorted) we will get all IPv4 subranges since the addresses are generated by dual in increasing order. The group key changes at the border of subranges and we’ll get min, max computed exactly there. Applying filter will show us only the subranges belonging to the US:

1
2
3
4
5
6
7
dual(16909050, 20330757)  //generate some ip addresses
| select(ipCC:cc(ip),     //compute ip-country code
         min(ip),         //begin of range
         max(ip))         //end of range
| group(ipCC NOSORT)      //compute min max for each ip-country
| filter(ipCC='US')       //filter out US
;
ipCC min_ip max_ip
US 1.32.232.0 1.32.239.255
US 2.56.9.245 2.56.9.248
US 2.56.20.0 2.56.23.255
US 2.56.32.0 2.56.35.255
US 2.56.44.0 2.56.51.255
US 2.56.56.0 2.56.59.254