Parsing Recurring Key-Value Pairs from CEF Files¶
Key-value pairs are most often appearing in json structures, where the syntax governing the rules of field and record separation, value encoding etc. However in real life you might encounter key value pair structures where syntax rules are much more relaxed. Take for instance HPE Common Event Format’s Extension field which can contain any number of key-value pairs separated by spaces. If a field contains a space, such as a file name, this is valid and can be logged in exactly that manner, as shown below:
filePath=/user/username/dir/my file name.txt
Secondly when parsing arbitrary set of key-value pairs their ordering matters a lot. This is because the parsing engine has only limited possibilities to deal with data elements of variable ordering (alternatives group). By the order we mean that key c always comes after key b, which in turn comes always after key a. This is valid even when some of key-value pairs or their values are missing (this can easily be handled by Quantifier and Optional Modifier). So, in case of ordered key-value pairs there is no problem with parsing values according to key names as usual.
However, when there are many key-value pairs which order is undetermined, we need to take another approach.
Parsing Ordered Key-Value Pairs¶
Let’s consider following example CEF formatted data /user/examples/patterns/parsing_kvpairs_cef.sxp:
CEF:0|bart| a=a1 b=1 c=Eat my shorts d=d1 CEF:0|bart| a=a2 b= c=Eep! CEF:0|bart| a=a3 c=Ay caramba! d= CEF:0|bart| b=4 c=cowabunga d=foo CEF:0|homer| e=e1 f=d'oh! g=g1 CEF:0|homer| e=e2 f=Whatever, I'll be at Moe's. g= CEF:0|homer| e= f=Woo hoo!
First two fields cefVersion and cefVendor are common for all records. As you can see we have two vendors bart and homer both with different set of key-value pairs. Vendor bart messages have keys a, b, c, d and vendor homer have keys e, f, g. Provided that the order of the pairs remain the same across all the records, we can parse them with following pattern:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
/* Following pattern assumes the ORDER of key-value pairs remain the same across all records. Missing kvpairs and values are allowed. */ // each message begins with CEF version and vendor, so let's capture them in header $hdr = ( LD:cefVersion '|' LD?:deviceVendor '|' ); //We need to define different structure for vendors as they have different set of key-value pairs: $bart = ('a=' LD*:a)? >>((WORD '=') | EOL) //kvpair is optional, terminated by next key or EOL ('b=' INT*:b LD)? >>((WORD '=') | EOL) ('c=' LD*:c)? >>((WORD '=') | EOL) ('d=' LD*:d)? //last kvpair is terminated only by EOL EOL ; $homer= ('e=' LD*:e)? >>((WORD '=') | EOL) ('f=' LD*:f)? >>((WORD '=') | EOL) ('g=' LD*:g)? EOL ; // CEF record: $hdr ' ' // header followed by a space ($bart | $homer) // followed by either bart or homer records
How does it work? First the engine tries to match the sequence of header fields and a single space character. When it matches
then engine has to try two alternatives: bart or homer. Following bart the first element to match is key
A match occurs so the data pointer is advanced to point at value string. Now engine starts matching bytes for
Recall that LD checks at each byte whether it is line feed or it matches the next not wildcard matcher. In our case it is
>>((WORD '=') | EOL), which tries to match either sequence of WORD followed by = or EOL symbol. The lookaround
>> makes the whole sequence group NOT to advance data pointer when match occurs - so that matcher for next
key-value pair could continue at the same point. The optional modifier ? of sequence group
('a=' LD*:a)? handles the case when key-value pair is missing (record 4). The quantifier * in
handles the case when value is missing (record 2).
The result of parsing looks like this:
|CEF:0||bart||a1||1||Eat my shorts||d1||NULL||NULL||NULL|
|CEF:0||homer||NULL||NULL||NULL||NULL||e2||Whatever, I’ll be at Moe’s.|
Parsing Unordered Key-Value Pairs¶
When parsing more then two unordered key-value pairs, the complexity of parsing script and its performance quickly reaches unacceptable limit. In this case it makes sense to parse the key-value pairs as a string and extract the values in the query layer using PARSE function.
This approach has two advantages: a) the performance of parsing increases significantly (the more key-value pairs the bigger the improvement) and b) all key-value pairs remain available for query. In many cases only few of the key-value pairs have relevance for a query, extracting them on need to know basis helps to maintain the performance of query.
$pattern = <<<PATTERN LD:cefVersion '|' //version and LD:deviceVendor '|' //vendor fields are present for all messages LD:message //capture key-value pairs as string value EOL PATTERN; @src = PARSE(pattern:$pattern, src:'sx:/user/examples/patterns/parsing_kvpairs_cef.sxp.data'); $c(srcStr) = PARSE("LD* 'c=' LD*:c >>((WORD '=') | EOS) LD* EOS",$srcStr); @src .filter(deviceVendor = 'bart') .select(cefVersion, deviceVendor, $c(message) as Bart_catchphrases, message) ;
|CEF:0||bart||Eat my shorts||a=a1 b=1 c=Eat my shorts d=d1|
|CEF:0||bart||Eep!||a= b=2 c=Eep!|
|CEF:0||bart||Ay caramba!||a=a3 c=Ay caramba! d=|
|CEF:0||bart||cowabunga||b=4 c=cowabunga d=foo|