Handling Complex Data Structures and Unexpected Values

Parsing simple tabulated data (tsv, csv, ...) may seem easy at first glance. Unfortunately as you look closer some nasty aspects starts emerging:

  • what if the field doesn’t contain expected data type?
  • what if the field is longer or shorter than expected or is missing altogether?
  • what if the field can contain values of different type?
  • what if the field can contain one OR more values?
  • and so on ...

A seemingly simple task suddenly appears increasingly complex now. But this is just a beginning. In forensic log analysis you can never rely on expected behaviour of applications, particularly when you’re tracing exploits. Consequently you also can’t dismiss unexpected values, missing fields, etc in their logs. In fact such exceptions are very often what you are looking for: they may point to situations where application has failed to interpret correctly end user submitted data. A fertile ground for potential exploits. Being an infosec analyst you don’t want to miss these, right?

Unexpected situations are actually even more common. Failures of log collection, file rotation resulting from system being under high load or sabotaged, log poisoning attacks caused by attackers attempts to cover their trails, etc may cause similar situations. Sometimes metadata of messages turns out to be the most valuable source of information. There are countless different situations which a data scientist studying normalities would dismiss but which are of primary interest for infosec analyst.

What qualities should a log analytics tool possess to support learning, handling complexity and edge cases?

Parsing Multivalue Fields

Let’s take for example a log I came across when analyzing domain typosquatting scheme. The main log I used for analysis had tab separated structure with following fields:

  1. id - integer value holding id of typosquatted domain
  2. domain - string value with typosquatted domain name. always present, min len 1
  3. xff - multivalue field, which may contain one or more following values: i) string “unknown” or ii) ipv4 address or iii) ipv6 address or iv) string holding domain name. Multiple values are separated by comma optionally followed by space(s). Can be empty.
  4. domainLabel - string containing prefix to tld. Sometimes empty.
  5. uriQuery - string containing uri query part. Can be empty, some of them up to 50 Kb long
  6. lang - string containing language header value. Can be empty.
  7. referrer - string containing referrer url. Can be empty, some are up to 50 Kb long
  8. userAgent - user agent string. Can be empty, some are up to 50 Kb long
  9. ipgeo - optional (i.e missing) field containing key-value pair. String key can be either “ipgeo=” or “ip_geo=”. Value is integer.
  10. httpgeo - optional field containing key-value pair. String key is “httpgeo=”. Value is integer.

Third field is the most complex of them. It looks suspiciously similar to X-Forwarded-For HTTP header field, except that it contains string values “unknown” and domain names.

Here’s the SpectX pattern to parse all that. Supplied with comments it should be pretty understandable even for those of you not intimately familiar with SpectX pattern language.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
LONG:id                             //first field contains numeric values, assign column name 'id'
'\t' LD:domain                      //second field is name of typosquatted domain (campaign). min len = 1.
'\t' (ARRAY{                        //third field can contain one or more following elements:
             ('unknown' |           //string 'unknown' or
              IPV4:ipv4 |           //ipv4 address or
              IPV6:ipv6 |           //ipv6 address or
              [a-zA-Z.-_0-9]+:host  //domain name
             ) ','? [ ]*            //which, if more than one, can be separated by comma and zero or more spaces
           }+:xff                   //match at least one or more times, assign column name 'xff'
      | LD{0,50000}:invalidClient   //or if any of previous elements were not matched then capture it as 'invalidClient'
      )
'\t' LD*:domainLabel                //4'th field seems to contain prefix label to tld. Sometimes empty.
'\t' LD{0,50000}:uriQuery           //uri query part. Some of them are up to 50 Kb long
'\t' LD*:lang                       //string identifying language (from Accept header)
'\t' LD{0,50000}:referrer           //referrer, some are very long exceeding default LD capture size of 4096
'\t' LD{0,50000}:userAgent          //user agent, some are very long.
('\t' LD*)?                         //ipgeo or ip_geo=number, httpgeo=number. Dunno what those are, so let's ignore.
(EOL | EOS)                         //line feed or end of file

Here the LONG captures numeric data as long data type, LD is wildcard matching any characters until next defined matcher within a line.

The xff field deserves a bit more detailed explanation. It has alternatives group with two members: ARRAY and LD. First the parsing engine tries to match ARRAY and if it fails then matching with LD is attempted. As we have supplied LD with length limits from 0 to 50000 bytes it will capture empty field value as well as any unexpected data up to 50 Kb.

The ARRAY is trying to match a repeating sequence of an alternatives group followed by comma (made optional by ?) and zero or more space characters (defined by *). The alternative elements are:

Let’s walk through an example:

192.168.0.10,10.0.5.14, 84.50.125.243

Matching starts with data pointer at the beginning of first ip-address. The engine will take the first element of alternatives group: ARRAY. The first matcher in ARRAY is also an alternatives group. So the engine tries first to match 192.168.0.10 with constant string ‘unknown’, which obviously fails. Next it tries matching with IPV4, which succeeds. The engine moves data pointer past the ip-address (pointing now to comma) and tries matching with constant string ‘,’ (as one of the alternatives had matched). This succeeds too and pointer is advanced to 10.0.5.14. Now the engine tries to match it with space character (character group matcher [ ]*). This succeeds because the quantifier allows to match for zero or more (i.e the space is optional) and data pointer is not advanced (more precisely it is advanced by the number of bytes consumed, which is zero).

The ‘+’ quantifier after closing parenthesis of ARRAY tells it to match at least one or more (up to 4096) times, so it will resort back to matching with constant string ‘unknown’ again. The same sequence repeats as described above, with the exception of matching space character after second comma: now there actually is a space character there.

When the last ip-address (84.50.125.243) is consumed, then parsing engine advances to ‘t’ field separator. This one does not match to any matchers in ARRAY, therefore engine considers it complete. Since it was first of the top alternatives the invalidClient is not evaluated (has NULL value). And the engine moves on to next matcher in pattern which is constant string ‘t’ and proceeds with the rest of fields.

Last two fields with ipgeo and httpgeo key-value pairs are optional - i.e they can be missing altogether. Considering that I did not need these in analysis I chose not to export them at all. So we can match them with LD wildcard and handle missing by encapsulating field separator and wildcard in sequence group made optional by ? modifier.

Capturing Unexpected Values

Large part of security and fraud analysis constitutes to behavioural analysis. More precisely it is about separating bad or malicous behaviour from normal. Most of the time it is quite difficult as bad actors are trying hard masking their behaviour to look like normal. By doing so they still leave fingerprints in the meta-information of original communication messages, such as sequence of fields, events, distinct way of generating values, etc. The ability to read such meta-information often makes the difference between success and failure. This is also the main reason why logs must be retained in their original form, as each transformation loses the meta-info.

I have used alternatives group to capture third field: i.e it can contain either ARRAY of different elements captured to column xff, or if none matched (or field is empty) then capture it as string in column invalidClient (which can be up to 50 Kb long).

Such approach works well for handling exceptional values of fields where you anticipate such behaviour. Usually these are the fields containing user supplied data, like uri path and query fields in apache access log. But how can we detect and examine the the records which are broken at unanticipated fields?

Examining the edge cases of sample data when developing a pattern is easy: in the query script you can always include _unmatched field which contains data unmatched by the current pattern. Going through these exceptions is a routine task when when developing pattern.

However when you’re running recurring queries (i.e reports) on logs which are continuously updated then how can you detect such occasions? They are important as it may indicate potential hacking attempts (including log poisoning attacks) or unexpected changes in the structure of fields (which can lead to wrong results of a query).

At the point where you consider the pattern complete you’ll also know the number and percentage of unmatched bytes. This is always provided with the results of a query. Depending if you want a complete or partially matching pattern, this number will be zero or more. However with regard to further detection the percentage of unmatched bytes will serve as baseline. As long as the structure of the log record is stable, the baseline number will not change much. Deviations over a (sensible) threshold from baseline indicate unexpected changes in log structure.

Handling Missing Timestamps

You may have noticed that the log record above did not contain timestamp. This is most uncommon for a log - how could one determine when the event happened? I don’t have any explanation why it became to be so but in this case the events were stored in daily rotated files, each of which had date of rotation embedded in the filename: [dd-MM-yyyy].log. We could use this to do time analysis with day precision, which is better than nothing. There is small technical task though: how do we assign each record with the timestamp taken from relevant file?

An obvious solution is to write a shell/python/perl script (pick your favorite programming language). However SpectX offers much more elegant solution for accomplishing this task. The listing is integral part of query processing in SpectX. The output of listing (i.e the file name, size, last modified date, etc) is always included in parsed fields stream, therefore for each record you can see the file name it came from, etc.

SpectX can parse time and date from file path and name, when time pattern is specified in the uri. As a result the _path_time field will be evaluated to parsed time. All we have to do is to include it in our select statement:

1
2
3
4
5
$pattern =  $[/user/varia/cm/main.sxp];
@list = LIST({uri:'file:/data/cases/dot-cm/logs/$dd$-$MM$-$yyyy$.log.sx.gz', tz:'UTC'});
@stream = PARSE(pattern:$pattern, src:@list);

SELECT _path_time as date, * FROM @stream;

Note we don’t actually know in which timezone the server operated, therefore we set the timestamp to UTC. When quering we can access different elements of ARRAY as follows:

@stream
  .select(client[0][ipv4]           //select leftmost element containing ipv4 address
          ,client[0][ipv6])         //select leftmost element containing ipv6 address

Using Meta-Info

According to X-Forwarded-For format the leftmost element (i.e first in array) should be client’s address. The reality is quite different though, significant number of records contain addresses listed in opposite sequence: client’s address is the last position in array. Fun eh?

This can be remedied quite easily though. Assuming that client address should normally belong to public ip space we can select first or last in array if either of them is from private space. User defined functions come very handy in accomplishing this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
//define function extracting ipv4 address from array.
//takes two arguments: array and its length
$extractIpv4(client::ARRAY,len::INT) =
 IF($len > 1,         //if len>1 we need to choose between first and last element in array:
       $selectIp($client[0][ipv4], ARRAY_SELECT($client,INTEGER($len-1))[ipv4]),
    $client[0][ipv4] //when only one ipv4 address, then get it from first position:
 )
;

//define function to select first or last address
$selectIp(first::IPV4, last::IPV4) =
    CASE
        WHEN $isPrivate($first)=true AND $isPrivate($last)=false THEN $last
        WHEN $isPrivate($first)=false AND $isPrivate($last)=true THEN $first
        ELSE $first
    END
;

//define function to test if ipv4 address is in private space
$isPrivate(ip::IPV4) =
 IF($ip is not null,
    IPV4IN($ip, 10.0.0.0/8) or IPV4IN($ip, 172.16.0.0/12) or IPV4IN($ip, 192.168.0.0/16),
    true)
;

// our select statement uses function defined above for extracting client ipv4 from *xff* array:
@stream
.select($extractIpv4(xff, ARRAY_LEN(xff)) as c_ipv4);