How can I count the number of times a data element changes its value?¶

Sample scripts: scripts.tar.gz Sample data: sample_data.tar.gz

Counting all occurrences of a data element or their unique (distinct) values are the two most popular methods used to describe a dataset. Various descriptive statistics can be derived from these, such as the average, min, max, standard deviation, etc.

However, sometimes (surprisingly often) a third type of quantification becomes really handy: counting the number of times a data element changes its value. This enables you to find events and actors by behavioural patterns. For instance, sensors measuring fluctuating values, users often changing their location, etc. SpectX provides an elegant way of computing changes by allowing to specify the internal sort key to the group function.

Let’s take a simple example and calculate all three different counts for the following (ordered) dataset. As you can see, the ip address changes per username:

alice   193.0.0.1
alice   193.0.0.1
alice   193.0.0.1
alice   193.0.0.1
alice   193.0.0.1
alice   193.0.0.2
alice   193.0.0.2
alice   193.0.0.2
alice   193.0.0.2
alice   193.0.0.2
bob     193.0.24.1
bob     193.0.24.2
bob     193.0.24.1
bob     193.0.24.2
bob     193.0.24.1
bob     193.0.24.2
bob     193.0.24.1
bob     193.0.24.2
bob     193.0.24.1
bob     193.0.24.2

Extract username and ip-address:

$pat = "LOWER:user ';' IPV4:ip EOL"
@src=PARSE(pattern:$pat, src:'http://docs.spectx.com/_downloads/group_sort_data.csv');

Total count of ip-addresses per user:

@src
 .select(user, count(*) as totalCnt)
 .group(user);
user totalCnt
alice 10
bob 10

Count of unique ip-addresses per user:

@src
 .select(user, ip, count(*))          //compute unique pairs of the user and ip-address
 .group(@1, @2)                       //group by the first and second field
 .select(user, count(*) as uniqCnt)   //count occurrences of unique ip addresses from the previous result
 .group(@1)
;
user uniqCnt
alice 2
bob 2

Counting ip-address changes per user:

@src
 .select(user, ip, count(*))          //compute unique pairs of users and ip-addresses
 .group(ip sort user)                 //sort by user for aggregation, compute aggregation per ip
 .select(user, count(*) as changeCnt) //count all resulting rows from previous (i.e the changes of ip address)
 .group(user)                         //group by user
user uniqCnt
alice 2
bob 10

Perhaps it deserves to be explained how the magic of group-sort works. The way aggregation functions work is that first the dataset gets sorted by a group key and then the aggregation function is applied. When the group key changes, the result of the aggregation function is emitted for that group (in our case it is count() function) and the process continues until the end of the dataset.

What’s more, when sort is specified as an argument to the group() function, the dataset is sorted by the provided sort key. Next, the aggregation function is applied the same way using the group key. When it changes, the result of the aggregation function is emitted. In our case, we sort the dataset by the user and group by ip:

@src
 .select(user, ip, count(*)).group(ip sort user)
user ip f_2
alice 193.0.0.1 5
alice 193.0.0.2 5
bob 193.0.24.1 1
bob 193.0.24.2 1
bob 193.0.24.1 1
bob 193.0.24.2 1
bob 193.0.24.1 1
bob 193.0.24.2 1
bob 193.0.24.1 1
bob 193.0.24.2 1
bob 193.0.24.1 1
bob 193.0.24.2 1

Counting rows by user gives us the count of changes of ip-address. Voilà!