Parsing Recurring Key-Value Pairs from CEF Files

Key-value pairs are most often appearing in json structures, where the syntax governing the rules of field and record separation, value encoding etc. However in real life you might encounter key value pair structures where syntax rules are much more relaxed. Take for instance HPE Common Event Format’s Extension field which can contain any number of key-value pairs separated by spaces. If a field contains a space, such as a file name, this is valid and can be logged in exactly that manner, as shown below:

filePath=/user/username/dir/my file name.txt

Secondly when parsing arbitrary set of key-value pairs their ordering matters a lot. This is because the parsing engine has only limited possibilities to deal with data elements of variable ordering (alternatives group). By the order we mean that key c always comes after key b, which in turn comes always after key a. This is valid even when some of key-value pairs or their values are missing (this can easily be handled by Quantifier and Optional Modifier). So, in case of ordered key-value pairs there is no problem with parsing values according to key names as usual.

However, when there are many key-value pairs which order is undetermined, we need to take another approach.

Parsing Ordered Key-Value Pairs

Let’s consider following example CEF formatted data /user/examples/patterns/parsing_kvpairs_cef.sxp:

CEF:0|bart| a=a1 b=1 c=Eat my shorts d=d1
CEF:0|bart| a=a2 b= c=Eep! 
CEF:0|bart| a=a3 c=Ay caramba! d=
CEF:0|bart| b=4 c=cowabunga d=foo
CEF:0|homer| e=e1 f=d'oh! g=g1
CEF:0|homer| e=e2 f=Whatever, I'll be at Moe's. g=
CEF:0|homer| e= f=Woo hoo! 

First two fields cefVersion and cefVendor are common for all records. As you can see we have two vendors bart and homer both with different set of key-value pairs. Vendor bart messages have keys a, b, c, d and vendor homer have keys e, f, g. Provided that the order of the pairs remain the same across all the records, we can parse them with following pattern:

/user/examples/patterns/parsing_kvpairs_cef.sxp
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/*
  Following pattern assumes the ORDER of key-value pairs remain the same across all records.
  Missing kvpairs and values are allowed.
*/

// each message begins with CEF version and vendor, so let's capture them in header
$hdr =
(
  LD:cefVersion '|'
  LD?:deviceVendor '|'
);

//We need to define different structure for vendors as they have different set of key-value pairs:

$bart = 
('a=' LD*:a)? >>((WORD '=') | EOL)  //kvpair is optional, terminated by next key or EOL
('b=' INT*:b LD)? >>((WORD '=') | EOL)
('c=' LD*:c)? >>((WORD '=') | EOL)
('d=' LD*:d)?                       //last kvpair is terminated only by EOL
EOL
;

$homer= 
('e=' LD*:e)? >>((WORD '=') | EOL)
('f=' LD*:f)? >>((WORD '=') | EOL)
('g=' LD*:g)?
EOL
;

// CEF record:
$hdr ' '          // header followed by a space
($bart | $homer)  // followed by either bart or homer records

How does it work? First the engine tries to match the sequence of header fields and a single space character. When it matches then engine has to try two alternatives: bart or homer. Following bart the first element to match is key a=. A match occurs so the data pointer is advanced to point at value string. Now engine starts matching bytes for LD*:a. Recall that LD checks at each byte whether it is line feed or it matches the next not wildcard matcher. In our case it is the >>((WORD '=') | EOL), which tries to match either sequence of WORD followed by = or EOL symbol. The lookaround modifier >> makes the whole sequence group NOT to advance data pointer when match occurs - so that matcher for next key-value pair could continue at the same point. The optional modifier ? of sequence group ('a=' LD*:a)? handles the case when key-value pair is missing (record 4). The quantifier * in LD*:a handles the case when value is missing (record 2).

The result of parsing looks like this:

cefVersion deviceVendor a b c d e f g
CEF:0 bart a1 1 Eat my shorts d1 NULL NULL NULL
CEF:0 bart   2 Eep! NULL NULL NULL NULL
CEF:0 bart a3 NULL Ay caramba!   NULL NULL NULL
CEF:0 bart NULL 4 cowabunga foo NULL NULL NULL
CEF:0 homer NULL NULL NULL NULL e1 d’oh! g1
CEF:0 homer NULL NULL NULL NULL e2 Whatever, I’ll be at Moe’s.  
CEF:0 homer NULL NULL NULL NULL   Woo hoo! NULL

Parsing Unordered Key-Value Pairs

When parsing more then two unordered key-value pairs, the complexity of parsing script and its performance quickly reaches unacceptable limit. In this case it makes sense to parse the key-value pairs as a string and extract the values in the query layer using PARSE function.

This approach has two advantages: a) the performance of parsing increases significantly (the more key-value pairs the bigger the improvement) and b) all key-value pairs remain available for query. In many cases only few of the key-value pairs have relevance for a query, extracting them on need to know basis helps to maintain the performance of query.

$pattern = <<<PATTERN
  LD:cefVersion '|'   //version and
  LD:deviceVendor '|' //vendor fields are present for all messages
  LD:message          //capture key-value pairs as string value
  EOL
PATTERN;

@src = PARSE(pattern:$pattern, src:'sx:/user/examples/patterns/parsing_kvpairs_cef.sxp.data');

$c(srcStr) =
 PARSE("LD* 'c=' LD*:c >>((WORD '=') | EOS) LD* EOS",$srcStr);

@src
 .filter(deviceVendor = 'bart')
 .select(cefVersion, deviceVendor, $c(message) as Bart_catchphrases, message)
;
cefVersion deviceVendor bart_catchphrases message
CEF:0 bart Eat my shorts a=a1 b=1 c=Eat my shorts d=d1
CEF:0 bart Eep! a= b=2 c=Eep!
CEF:0 bart Ay caramba! a=a3 c=Ay caramba! d=
CEF:0 bart cowabunga b=4 c=cowabunga d=foo