SpectX Pattern Matching Language Reference Manual

A pattern in SXQL is described as a collection of matcher expressions. There has to be at least one matcher expression to describe a pattern but usually there are more, in which case they can be separated by a whitespace or commas or newlines.

A matcher can be any of the following:
Matchers can also be grouped:
Matchers can be applied with Operators:
  • Some matchers allow configuration specifying their behaviour. For instance, a timestamp needs expected format definition.
  • Most of matchers and groupings can be added with a quantifier - to tell the engine how many times it should try to match.
  • All matchers and groupings can be declared to be optional - i.e if the element in the expected position is missing, the engine will output NULL to the resultset and continue with the next matcher in expression.
  • All matchers and groupings can be assigned an export_name - a name of the field exposed to the query layer. The sole purpose of pattern matching is to make data elements available for the query engine. However, not all matched elements are needed for queries (such as field separators in tabulated files) therefore export_name is the mechanism for the end user to declare which data elements are exposed for queries (at the same time providing a name for the query fields). So, a matcher without export_name still does it’s job of matching the pattern but it is not visible in queries.
  • All matchers and groupings can “look around” (backward or forward) - mainly for the purpose of enabling decision making (conditional branching).

Don’t worry if you don’t grasp all these in detail immediately. We will explore each of them below. For now ,the easiest way to illustrate the above is to walk through an example.

Example 1: Suppose we have a comma separated record (terminated with the line feed character) with the following fields:

  • order number - integer
  • username - consisting of upper and lower case letters and numbers (but not a comma)
  • ipv4 address of the user
1,alice,192.168.1.1
2,bob,10.6.24.18
3,mallory,192.168.1.3

This structure can be described by the following pattern expression:

1
2
3
4
5
6
INT:seq            //integer matcher for the order number, visible in queries as 'seq'
','                //constant_string matcher for the field separator, not visible in queries
[a-zA-Z0-9]*:uname //chargroup matcher, quantifier * (zero or more chars), visible in queries as 'uname'
','
IPV4ADDR:user_ip  //ip v4 address matcher, visible in queries as 'user_ip'
[\n]              //chargroup matcher for the line feed terminating the record

SpectX pattern matching engine tries to apply the pattern by utilizing matchers in the order they were defined. In the example above it starts by trying to match INT:seq at the first byte of input data. This happens to be ‘1’. As it is suitable for an integer type it moves on to next byte and finds it to be a comma. This does not match with an integer therefore the INT:seq matcher gets completed by converting ‘1’ to an integer and the next matcher in the pattern is selected: ','. The engine tries it for a current position of data and finds a match. So the data pointer is moved on to the next byte (pointing to the first letter of ‘bob’). As the constant string matcher contained just one character, the matcher is considered complete and the engine takes the next one in pattern: the [a-zA-Z0-9]*:uname. The quantifier * enforces [a-zA-Z0-9]*:uname to consume a variable number of bytes (zero or more), so it keeps matching until it finds a byte not matching with its defined characters. This happens at the second comma (just after ‘bob’), the engine considers the [a-zA-Z0-9]*:uname matcher complete and takes the next one: ',' . Again, it tries to match it to byte at current position and succeeds. Data pointer is moved to the next byte, pointing to the beginning of ‘192.168.1.1’. As ',' completes it, the engine takes IPV4ADDR:user_ip. Trying it from current position the match is found and the data pointer gets moved forward 11 bytes, now pointing to a newline character. The engine finds a match for it using the last matcher in pattern: [\n] Now the data pointer is advanced to the next byte, the pattern iterator is reset and the cycle continues with trying out the first matcher of the pattern again, against currently pointed data. This continues until the end of the input data.

Should the engine encounter data which it is unable to find a match, it resets the pattern iterator, marks this byte as unmatched and moves on to the next byte. This continues until a match is found or there is no more data. Eventually, the following data is available for query:

seq uname user_ip
1 alice 192.168.1.1
2 bob 10.6.24.18
3 mallory 192.168.1.3

Hopefully, the example above has explained the basics of pattern matching. Of course, it is just a scratch on the surface as the data structure is a very straightforward CSV (yet widely used in real life), the pattern expression is simplistic for illustrating the principles (yet fully functioning) and we covered only one positive scenario of engine parsing through a stream of data bytes. Yes, there is a lot more to that than we described in this example and we all know the devil is in the details. But don’t worry, we’ll go through these below with more examples to demonstrate how we can easily describe very complex data structures and extract information from the midst of noise.

It is important to note that since SpectX Pattern Matching Language is a strongly typed language, the pattern matching engine converts exported data tokens to one of SpectX data types.

Matcher Expression Syntax

A matcher expression consists of the matcher itself and optional controlling elements (operators), arranged in the following order (from right to left, square brackets indicate optional elements):

[lookaround_modifier] MATCHER [ '(' configuration ')' ] [quantifier] [optional_modifier] [':'export_name]

Note that whitespaces and newlines are allowed between the elements. Placing elements in a different order (for instance by placing optional_modifier after the export_name) will trigger a syntax error.

Before going diving deep into matchers, let’s have a look what operators do.

Operators

Lookaround Modifiers

The term “look around” means to “peek” either forward or backward from the current position in the input stream without moving forward. This allows making binary decisions based on the result of lookaround matching:

Modifier Description
>> positive look ahead - returns true if bytes forward from the current position match
!>> negative look ahead - returns true if bytes forward from the current position do not match
<< positive look behind - returns true if up to 64 bytes backwards from the current position match
!<< negative look behind - returns true if up to 64 bytes backwards from the current position do not match

The modifier must be placed immediately before the matcher.

Example 2: positive look ahead. Suppose we have data containing date and time, datasource name and a variable number of key-value pairs in the record, separated by a single space character. However, the space is also allowed inside the value (notice the last filename in row 3).

2016-03-22 12:37:44|firewall| src=192.168.22.59 dst=8.8.8.8 res=ok
2016-03-22 15:02:39|myApp| a=1 b=ohoo file=/data/my/f1.txt
2016-03-22 15:06:12|myApp| a=2 b=noo%one file=/data/oh-my/new file.txt
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
                                                    // as date, time and source name appear in every record
$hdr =                                              // let's declare them as header
(                                                   // a sequence group of:
  TIMESTAMP('yyyy-MM-dd HH:mm:ss'):dateTime         // date and time, made available for query as 'dateTime'
  '|'                                               // followed by a separator
  WORD:deviceName                                   // datasource name, made available for query as 'deviceName'
  '|'                                               // last separator
);

/*
    To parse a variable number of elements we use the ARRAY grouping, which allows specifying a quantifier.
    Note that kv-pairs are separated with a single space, which is also allowed within the value. Therefore we need
    to look ahead after the parsing value, checking whether it is followed by a key (word followed by '=').
*/

// here's our pattern describing the record:
$hdr ' '                    // header, followed by a single space
ARRAY{                      // array of:
 WORD:key '=' LD*:val       // key value pair separated by '=', made available for query respectively
 (>>(WORD '=') | EOL)       // followed by either the next key (a word followed by '=') or the end of line
}{1,100}:kvData             // the array can consist of minimum 1 and maximum 100 kv-pairs, exported as 'kvData'
;

The engine starts with matching the first bytes in the input stream with matchers defined in the record header: TIMESTAMP. It does match, so the pointer is moved forward past the time. Next it tries to match the separator '|' , which also happens to match. Again, the pointer is moved forward, now pointing at the first character of the deviceName (‘firewall’). It now starts to match the WORD:deviceName matcher which by definition matches upper and lowercase letters, numbers and underscore characters until the whitespace character. The match stops at the next separator, which in turn gets matched by the last Constant String matcher '|' in the header. Now the pointer is at the space following the last separator character. This will be matched by ' '.

Next, the engine starts matching key-value pairs having a pointer set at ‘src=’. The key is defined as word, followed by ‘=’. This matches to WORD:key so the pointer gets moved on to ‘192.168.22.59’. The matcher for parsing value is defined as LD*:val, which by definition matches any character until the next non-wildcard matcher or a newline character in the data stream. Therefore, the engine evaluates the next matcher on each position if it evaluates to true. In our case the next matcher is an alternative group matching either a key (look ahead sequence of the word followed by ‘=’) or an end of line character. The engine can move through the ip address and the following space until it encounters the next key ‘dst=’. At this point, the look ahead sequence matching the key causes an ARRAY start from the beginning again (since it was the last element in ARRAY), matching the key value pair.

Notice the vital role of the look ahead modifier here. Without it ,the pointer in the data stream would have moved past the next key (‘dst=’) and when the ARRAY restarts trying to match a key value pair, it would fail.

When the value is followed by a newline character, the ARRAY will restart again but won’t match any more therefore causing to restart our record pattern which starts from matching the header again. Finally, the following result would be available for the query (time values are adjusted according to your local timezone, here it is UTC+2):

dateTime deviceName kvData
2016-03-22 14:37:44.000 +0200 firewall [{key=”src” val=”192.168.22.59 “} {key=”dst” val=”8.8.8.8 “} {key=”res” val=”ok”}]
2016-03-22 17:02:39.000 +0200 myApp [{key=”a” val=”1”} {key=”b” val=”ohoo “} {key=”file” val=”/data/my/f1.txt”}]
2016-03-22 17:06:12.000 +0200 myApp [{key=”a” val=”2”} {key=”b” val=”noo%one “} {key=”file” val=”/data/oh-my/new file.txt”}]

Example 3: conditional matching with look-behind. Suppose our data record is a ipv4 address:

1.2.3.4
1.2.3.40
1.2.3.55
1
2
3
4
5
6
// We are interested only in addresses whose last octet value is greater than 50
// we can accomplish this with a pattern:

IPV4:ipAddr     // a field with ip v4 address, exported under the name of 'ipAddr'
<<INT(min=50)   // look back for an integer with the minimum value of 50
EOL             // the record is terminated with end of line

The engine starts matching with IPV4:ipAddr matcher from the first position in data. It does match ‘1.2.3.4’ and the pointer is moved forward accordingly pointing at the line feed now. The engine proceeds with the next matcher, the look behind INT with a minimum value of 50. It encounters the value 4 (the last octet of ip address) but since it is less than the specified minimum, the match is discarded. The engine proceeds with the next matcher, EOL that does successfully match with the currently pointed line feed character. At this point the pattern gets restarted from the beginning. The same cycle is repeated for the second row, which will also get discarded (the ip-address’ last octet value 40 is less than the specified minimum). Only the ip-address in the last row will be matched and made available for query:

ipAddr
1.2.3.55

Configuration

Configuration is the way of telling execution-specific input parameters to a matcher, specified in the form of one or more key-value pair(s), enclosed in parentheses and separated by a comma:

matcher '(' paramName '=' value [, ...] ')'
where value is:
    - a constant value (integer, string, float, ...), or
    - a matcher expression, enclosed in curly brackets

Configuration is optional and specific to matchers.

Example 4: the TIMESTAMP matcher takes a configuration string which specifies the date and time format:

2016-03-22 15:02:39 +0300|myApp| a=1 b=ohoo file=/data/my/f1.txt
2016-03-22 15:06:12 +0300|myApp| a=2 b=noo%one file=/data/oh-my/new file.txt
1
2
TIMESTAMP(format='yyyy-MM-dd HH:mm:ss'):tstamp
LD EOL

Example 5: Sequence Group can take configuration in the form of a matcher expression specifying alternative field separators:

1,alice;192.168.1.1
2,bob;10.6.24.18
3,mallory;192.168.1.3
1
2
3
4
5
6
(
  INT:seq
  [a-z]*:userName
  IPV4:ip
)(fs={[,;]})
EOL

Quantifier

Every now and then you might encounter a record with repeating elements. When the number of elements is not fixed then we need a dynamic way of resolving it. This is what quantifiers are about: to match a variable number of repeating data elements within predefined limits.

The following table lists quantifier syntax and descriptions:

Quantifier Description
{ min, max } specifies repetition with minimum = min times, maximum = max times
{ min, } specifies repetition with minimum = min times, maximum = 4096 times
{ ,max } specifies repetition with minimum = 0 times, maximum = max times
{ val } specifies the exact number of repetitions
* specifies repetition with min 0 times and max 4096 times (i.e same as {0,} )
+ specifies repetition with at least 1 times and max 4096 times (i.e same as {1,} )

Example 6: Parsing a username that can contain numbers, lower and uppercase letters and which must be at least 1 character long:

john
Mary01
albert
1
2
3
4
[a-zA-Z0-9]+:username // a chargroup matching lower and uppercase letters and numbers. Chargroup matches one
                      // character by default, so in order to match more we need to use a quantifier.
                      // We'll use '+' as min username length is 1
EOL

Example 7: Repeated data elements are most often parsed using ARRAY. For instance, parsing a repeating sequence of ip-addresses can be done as follows:

1.1.1.1; 1.1.1.2; 1.1.1.3; 192.168.1.0
1.1.1.4; 1.1.1.5; 1.1.1.6; 192.168.1.1
1
2
3
4
5
ARRAY {         //an array consisting of:
  IPV4:ipaddr   //ip v4 address
  '; '?         //field separator. Is optional because the last item does not have a trailing separator
}{1,4}:ipRec    //array elements are expected to be repeated at least 1 and max 4 times, exported as 'ipRec'
EOL;            //the record ends with line feed

Parsed data is exposed to query as an array:

ipRec
[1.1.1.1; 1.1.1.2; 1.1.1.3; 192.168.1.0]
[1.1.1.4; 1.1.1.5; 1.1.1.6; 192.168.1.1]

Optional Modifier

Irregularities in logs are very common. Even when the structure of records is defined, one may still have to deal with missing fields in the record. This can be handled by making a matcher optional in the pattern expression.

Syntax: place a question mark after the quantifier (and before the export_name):

matcher '?'

It’s important to note that there are two different situations with missing data:

  1. the field is there but field value is missing (the separator is still in place, line 2 in the example below)
  2. the field value and following separator both are missing (line 3 in the example below).

From the parser standpoint these are different: in line 2 the username field contains no value, in line 3 the username field is omitted completely. Optional Modifier can handle both. Using a Quantifier that allows repetitions the minimum zero times lets you to parse the second line (i.e zero match is allowed), but it fails on the third line.

Example 8: Note that in the first record all fields are present, in the second username value is missing and in the third, the username field is missing altogether.

14/Mar/2016:23:37:06 +0200,INFO,mary01,200
14/Mar/2016:23:37:07 +0200,INFO,,200
14/Mar/2016:23:37:13 +0200,INFO,500
1
2
3
4
5
6
7
(                   //a sequence_group of the following matchers:
 HTTPDATE:timestamp
 UPPER:severity
 ALNUM?:user        //optional_modifier allows the engine to continue parsing when a field is missing
 INT:response
)(fs=',')           //separated by a comma
EOL;

Results:

timeStamp severity user response
2016-03-14 23:37:06 +0200 INFO mary01 200
2016-03-14 23:37:07 +0200 INFO NULL 200
2016-03-14 23:37:13 +0200 INFO NULL 500

Export Name

Not all data in your source data stream is interesting or useful. For instance, field separators are not really necessary when analyzing content of a CSV file (although the parser engine needs to know where they are). To tell the parsing engine which elements should be made available to the query layer, an export_name must be assigned to the respective matcher. The assigned name becomes the name of the field you can refer to in the query. It will also be the name of the respective column heading of the resultset.

Export name is a string which:

  • is preceded by a ‘:’ (colon) symbol
  • must begin with an upper or lowercase letter and
  • is at least one character long and
  • may contain only lower or uppercase letters and numbers

The export name is always the last item in the matcher expression (after Configuration, Quantifier and Optional Modifier).

Any matcher output can be exported, including all matcher groups: Character Group, Sequence Group and Alternatives Group. Matched data is exported as STRING.

Example 9: Suppose our data record has only one field which may consist either ipv4, ipv6 or domain name in it. We parse this field using Alternatives Group.

192.168.0.1
0000:0000:0000:0159:0000:0000:0000:0016
www.example.com
1
2
3
4
5
(                           // alternatives group:
  IPV4:ipv4 |               // ipv4 address or
  IPV6:ipv6 |               // ipv6 address or
  [ a-zA-Z0-9.-]*:fqdn      // FQDN
):remote                    // exporting group output as 'remote'

Each exported member of an alternative group will have value only when matched. By also exporting the output of the group, we will have the field ‘remote ‘which always has a value:

ipv4 ipv6 fqdn remote
192.168.0.1 NULL NULL 192.168.0.1
NULL 0000:0000:0000:0159:0000:0000:0000:0016 NULL 0000:0000:0000:0159:0000:0000:0000:0016
NULL NULL www.example.com www.example.com

Visibility of members of the composite data types (Structure, Array and Json) is somewhat different from groups. To make the members of composite data entity (a structure, array or json) visible to the query layer, the entity itself must be exported. See Structure, Array and Json Objects for details.

And finally we also have references to patterns which may contain exported members and which could themselves have an export name (and therefore being exported as a whole). The rules are simple:

  • if the pattern reference is not exported, then any of its member exports are visible with their own export names
  • if pattern reference is exported (i.e has assigned an export_name), all its exported members are visible in the structure named by pattern reference’s export_name. If there was no member exported, all matched data is exported as a string.

See examples in Referencing Pattern Expressions.

Now it’s time to have a closer look at matchers.

Constant String

A constant string matcher allows to define a UTF-8 string that the engine tries to match to the bytes in the input stream. Constant string matcher is mostly used to define field separators but it is also useful to match any known fixed strings in the data stream (such as key names in key-value type of records).

Syntax:

string '(' 'charset' = charset_name ')'
where:
  • string is the string to be matched. Must be placed between single quotes (character 0x27 in ASCII table) or double quotes (0x22 ASCII).
  • charset is the name of an optional configuration parameter specifying the character set for converting bytes to characters.
  • charset_name is the string specifying the character set name enclosed in single or double quotes (for example ISO-8859-1)

In case the string contains a single quote or double quote you may either use the other for enclosing (i.e use double quotes for enclosing if the string contains a single quote or vice versa). Alternatively, you can escape it with a preceding backslash character (0x5c ASCII).

Constant string matched data will be converted to a String according to the specified character set. The default character set is UTF-8 (i.e when used without specifying a configuration).

Constant string matcher does not take quantifier.

Character Group

Character group matcher allows matching a single character out of several in a defined group. When used together with the quantifier, it allows matching variable strings in the input data stream.

The syntax is compatible with Regular Expression Character Class:

'[' {'!' | '^'} char ... ']' '(' 'charset' = charset_name ')'
'[' {'!' | '^'} start_char-end_char ']' '(' 'charset' = charset_name ')'
where:
  • char ... is one or more characters to be matched. Note that it must be placed between square brackets. In case you want to match a square bracket character, it must be escaped by a preceding backslash character(0x5c ASCII ). Characters can also be expressed as ranges, for instance [0-9] matches any digit from 0 to 9. Negating is supported by placing a caret or an exclamation mark before characters.
  • charset is the name of an optional configuration parameter specifying character set for converting bytes to characters.
  • charset_name is the string specifying a character set name (for example ISO-8859-1) enclosed in single or double quotes.

As a courtesy to the European Mac users having trouble locating caret on their keyboard, we have extended the negating character class by allowing to use an exclamation mark (‘!’, 0x21 in ASCII table) as an alternative to the caret (‘^’, 0x5e in ASCII table).

Example 10:

foo;22-40035-abCD
e;22-339528-Tech
1
2
3
4
5
[a-z]+:f1       //matches one or more lowercase letters
[;]             //matches a single semicolon character
[! \n]*:other   //negated match of the exclamation mark and space characters,
                //i.e it matches any character except space and line feed zero or more times
[\n]            //matches the single line feed character

The pattern above produces two fields for the query layer:

f1 other
foo 22-40035-abCD
e 22-339528-Tech

NB! Make sure to exclude field separator symbols from your character group definition.

POSIX Character Classes

POSIX Character Class matchers will match one or more character corresponding to any of the characters in its specified group. Note that this behaviour is different from Character Group matcher, which matches a single character.

The following table lists all 12, plus the [:ascii:] and [:word:] classes that some regex flavors support.

Matcher name POSIX Character Class Description
ALNUM [:alnum:]
Alphanumeric characters a-z; A-Z; 0-9
ALPHA [:alpha:]
Alphabetic characters a-z; A-Z
BLANK [:blank:]
Space (0x20) and tab (0x9) characters
CNTRL [:cntrl:]
Control characters in ASCII range
0x1-0x1F; 0x7
DIGIT [:digit:]
Digit in range of 0-9
GRAPH [:graph:]
Visible characters in the ASCII code
range 0x21 - 0x7E
LOWER [:lower:]
Lowercase letters a-z
PRINT [:print:]
Printable characters in the ASCII
code range 0x20 - 0x7E
PUNCT [:punct:]
Punctuation and symbols
!”#$%&’()*+,-./:;<=>?@[]^_`{|}~|
SPACE [:space:]
All whitespace characters. In ASCII codes:
0x20; 0x9; 0xA 0xB; 0xC ;0xD
NSPACE [!:space:]
Matches all characters except whitespace.
UPPER [:upper:]
Uppercase letters A-Z
XDIGIT [:xdigit:]
Digit in hexadecimal notation 0x0 - 0xF
ASCII [:ascii:]
All ASCII characters in range of 0x0 - 0x7F
WORD [:word:]
Word characters: letters a-z; A-Z;
numbers 0-9 and underscore _)
[:any:]
matches any character in range 0x0 - 0xff

All Posix character class matchers accept the optional configuration parameter:

'(' 'charset' = charset_name ')'
where:
  • charset is a name of optional configuration parameter specifying the character set for converting bytes to characters.
  • charset_name is the string specifying character set name (for example ISO-8859-1) enclosed in single or double quotes.

Example 8 above illustrates using POSIX character class matchers.

Hint: Make sure your Posix character class matchers do not collide with the field separator symbols.

Lines and Strings

Log records are often formed as lines in a text file - i.e character sequences terminated by \n or \r\n characters). Also, the strings are often quoted. Line Data matchers offer easy to remember helpers to parse such records: wildcards, line terminators and quotations.

Matcher name Description
LDATA; LD
Matches any characters until the next non-wildcard matcher in the scope of a line (see Example 11).
Therefore, the pattern expression must contain at least two matchers when using LDATA
DATA
Matches any characters until the next non-optional matcher of pattern expression (see Example 12).
Therefore the pattern expression must contain at least two matchers when using DATA
EOL; LF
Matches the single line feed character (ASCII 0xa). Does not take quantifier.
CR
Matches single carriage return character (ASCII 0xd). Does not take quantifier
EOLWIN
Matches two characters: line feed followed by the carriage return. Does not take quantifier
SQS
Matches string enclosed between single quotes (ASCII 0x27). Any single quote inside the string must be escaped by
backslash character (ASCII 0x5c).
DQS
Matches string enclosed between double quote characters (ASCII 0x22). Any single quote inside the string must be
escaped by backslash character (ASCII 0x5c).
CSVSQS
Matches string enclosed between single quotes (ASCII 0x27). Any single quote inside the string must be escaped by single
quote character (CSV style).
CSVDQS
Matches string enclosed between double quote characters (ASCII 0x22). Any single quote inside the string must be
escaped by double quote character (CSV style).
BOF, BOS
Matches first byte of stream/file
MOF, MOS
Matches any bytes in the middle of file/stream
EOF, EOS
Matches last byte of file/stream

All string matchers accept the optional configuration parameter:

'(' 'charset' = charset_name ')'
where:
  • charset is the name of optional configuration parameter specifying character set for converting bytes to characters.
  • charset_name is the string specifying character set name (for example ISO-8859-1) enclosed in single or double quotes.

Example 11: LDATA wildcard is useful when exploring logs with records arranged in lines.

14/Mar/2016:23:37:06 +0200,INFO,mary01,127.0.0.1,reqid=52d29ae72bb69;method=POST;amt=100;,200
14/Mar/2016:23:37:07 +0200,INFO,john,127.0.0.1,reqid=470f97a9bc34d;method=POST;amt=86;,404
14/Mar/2016:23:37:13 +0200,ERROR,albert,127.0.0.1,reqid=34bc7028ee1392;method=POST;amt=3000;,500
1
2
3
4
5
HTTPDATE:timestamp  //timestamp followed by a comma
','
UPPER:severity      //the severity of log record
LDATA:message       //parse rest of the line using LDATA as wildcard
EOL

This will result in records parsed as:

timestamp severity message
2016-03-14 23:37:06 +0200 INFO ,mary01,127.0.0.1,reqid=52d29ae72bb69;method=POST;amt=100;,200
2016-03-14 23:37:07 +0200 INFO ,john,127.0.0.1,reqid=470f97a9bc34d;method=POST;amt=86;,404
2016-03-14 23:37:13 +0200 ERROR ,albert,127.0.0.1,reqid=34bc7028ee1392;method=POST;amt=3000;,500

Example 12: DATA wildcard is useful in exploring logs with multiline records. Suppose we have a debug log where stack traces are laid out on multiple lines and the record is terminated by \n-----\n character sequence:

2015.10.03 16:32:51     ?:?     ERROR   com.spectx.webconsole.jsp.data.SQLTimeSeriesCache -- SQLTimeSeries remote fetch failed
org.postgresql.util.PSQLException: Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:66)
        at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:125)
        at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30)
        at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:22)
        at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(AbstractJdbc4Connection.java:30)
        at org.postgresql.jdbc4.Jdbc4Connection.<init>(Jdbc4Connection.java:24)
        at org.postgresql.Driver.makeConnection(Driver.java:393)
        at org.postgresql.Driver.connect(Driver.java:267)
        at java.sql.DriverManager.getConnection(DriverManager.java:582)
        at java.sql.DriverManager.getConnection(DriverManager.java:154)
        at org.logicalcobwebs.proxool.Prototyper.buildConnection(Prototyper.java:159)
        at org.logicalcobwebs.proxool.ConnectionPool.getConnection(ConnectionPool.java:211)
        at org.logicalcobwebs.proxool.ProxoolDriver.connect(ProxoolDriver.java:89)
        at java.sql.DriverManager.getConnection(DriverManager.java:582)
        at java.sql.DriverManager.getConnection(DriverManager.java:207)
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:432)
        at java.net.Socket.connect(Socket.java:529)
        at java.net.Socket.connect(Socket.java:478)
        at java.net.Socket.<init>(Socket.java:375)
        at java.net.Socket.<init>(Socket.java:189)
        at org.postgresql.core.PGStream.<init>(PGStream.java:62)
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:76)
        ... 23 more
-----
1
2
DATA:stacktrace     //DATA matches any character also between the lines, until
'\n-----\n'         //record termination sequence is encountered

The pattern above will yield the whole record (i.e the stack trace) being parsed into one string:

stacktrace
2015.10.03 16:32:51 ?:? ERROR com.spectx.webconsole.jsp.data.SQLTimeSeriesCache – SQLTimeSeries remote fetch failed org.postgresql.util.PSQLException: Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections ...

Time and Date

Parsing date and time means correctly assigning value to a timestamp - information describing a point in time. SpectX keeps timestamps internally as Unix time (or epoch time) values - defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970.

NB! Note that at parsing a time string a conversion from original time zone to UTC must happen (or otherwise the time info will have incorrect value when converted to UTC). When timezone is part of time string then TIMESTAMP parser can use it in conversion.

However in real life that is often not the case and then you have the option of specifying timezone yourself to TIMESTAMP matcher (as a parameter). In case you don’t the operating system default timezone is used to convert the time information to UTC.

TIMESTAMP

TIMESTAMP allows parsing time and date fields in any format. The generic syntax is:

TIMESTAMP '(' ['format=']pattern_str [, 'timezone='timezone_str ] [,'locale='locale_tag ]')'

Parameters:

format - specifies the time and date pattern string (see Date and Time Patterns table below). Must be enclosed within single quotes. The ‘format=’ key is optional. Example 13:

2017-03-28 15:36:22.456
1
2
3
4
//Timestamp for 4-digit year, 2-digit month in year, 1 or 2 digit day in month,
//2-digit hours, 2-digit minutes, 2-digit seconds, 3-digit milliseconds:

TIMESTAMP(format='yyyy-MM-d HH:mm:ss.SSS'):tstamp

timezone (or tz) - string specifying timezone name (as defined in IANA Time Zone Database) enclosed in single quotes. This parameter is optional. Default timezone is UTC. Note that timezone included in the timestamp overrides the specified timezone.

Example 14:

2017-03-28 15:36:22
2017-03-28 15:36:22
1
2
TIMESTAMP(format='yyyy-MM-dd HH:mm:ss', timezone='GMT'):ts1 EOL
TIMESTAMP(format='yyyy-MM-dd HH:mm:ss', timezone='America/Los_Angeles'):ts2 EOL

locale - a string specifying IETF BCP 47 language tag string enclosed in single quotes (see the list here ). This parameter is optional. Allows to parse locale specific month and day names. Default locale is English.

Example 15: Setting locale for parsing abbreviated month names in German.

2017-Jan-28 15:00:24
2017-März-28 15:00:24
2017-Okt-28 15:00:24
1
2
TIME(format='yyyy-MMM-dd HH:mm:ss', locale='de'):t
EOL

Date and Time Patterns

Letter Date or Time component Presentation Example
G Era marker   case insensitive AD or BC
y Year Year 2012; 96; 0015
Y Week year Year  
M Month in year Month July; Jul; 07, 7
w Week in year Numeric 27
W Week in month Numeric 2
D Day in year Numeric 189
d Day in month Numeric 10
F Day of week in month Numeric 3
E Day name in week Text Tue; Tuesday
u Unnecessary numeric metadata Unnecessary  
a am/pm marker   case insensitive am or pm
H hour in day of zero based 24 hour clock (0 - 23) Numeric 0
k hour in day of one based 24 hour clock (1 - 24) Numeric 24
K hour in day of zero based 12-hour clock (0 - 11) Numeric 3
h hour in day of one based 12-hour clock (1 - 12) Numeric 1
m Minute in hour Numeric 30
s Second in minute Numeric 51
S Millisecond Millisecond 957; 864539
z,Z Time zone Timezone GMT+02:00; EET

Time parsing is backed by the Java Calendar class. Depending on user Locale settings the Calendar may be Gregorian or locale-specific. Time and Date pattern behaviour may be specific to the Calendar instance.

Pattern letters are usually repeated, as their number determines the exact presentation:

  • Text: If number of pattern letters are 4 or more, the full name of a field is expected by parser. Otherwise abbreviated name is expected. For instance pattern “EE” expects abbreviated name of day in week, such as “Tue”.

  • Numeric: Digits 0 - 9, leading zeroes and spaces are allowed. Depending on the number of letters in pattern specification, the behaviour of parser is as follows:

    • 1 letter pattern is treated as variable length parser accepting any number of digits.
    • 2 - 4 letter patterns are treated as fixed length parsers accepting only respective number of digits.
    • 5 or more letter patterns are treated as variable length patterns accepting any number of digits.
  • Year: Numeric data is allowed only. If the calendar is Gregorian then:

    • if the number of parsing letters is more than 2, the year is interpreted literally regardless of the number of digits. Therefore using the pattern “MM-dd-yyyy”, “01-11-12” parses to Jan 11’th, 12 AD.
    • when parsing with pattern “y” or “yy” TIMESTAMP must interpret the abbreviated year relative to some century. When year value is less than 32 then date is adjusted to 21’st century, otherwise to 20’th century. Therefore using the pattern “MM-dd-yy”, “01-11-12” parses to Jan 11’th, 2012 and “01-11-72” parses to Jan 11’th, 1972.
    • If the calendar is not Gregorian and number of pattern letters is 4 or more, a calendar specific long form is used. Otherwise calendar specific short form is used.

    Patterns with 2 and 4 parsing letters (for instance “yy” and “yyyy” respectively) are treated as fixed length parsers. So, pattern “yy” will parse successfully only 2 digit long years and fail for any other length. Patterns with any other length are treated as variable length, which accept any length of years. For instance pattern “y” parses successfully both “2” and “1256”. Hence variable length time units placed consecutively without non-numeric separators in-between, are impossible to parse correctly.

  • Month: If the number of pattern letters is 3 or more, the month is interpreted as text, otherwise as numeric:

    • 1 letter pattern is treated as variable length parser, which accepts both one and two digit months
    • 2 letter pattern is treated as fixed length parser, which accepts only two digit months
    • 3 letter pattern expects abbreviated month names. For instance pattern “MMM-dd-yyyy” parses “Jan-11-2012” to Jan 11’th, 2012.
    • 4 or more letter pattern expects full month names. For instance pattern “MMMM-dd-yyyy” parses “January-11-2012” to Jan 11’th, 2012.
  • Unnecessary: intended for skipping numeric parts of time and date, which do not contribute to timestamp computation. For example number of day in week. These parts of timestamp will be parsed as follows, but are ignored in computation of timestamp value.

  • Millisecond: millisecond data are treated as numeric. Data with microsecond precision (pattern with length of 4-6 letters) are divided respectively by 10, 100 and 1000 to translate them to milliseconds.

  • Timezone: parses time zone expressed either as:

    1. offset from GMT or UTC timezone in the form:

    {GMT | UTC} { + | -} hours[:]minutes
    
    where:
    • hours - is one or two digit hours value
    • minutes - is one or two digit minutes value
    • Colon (‘:’) between Hours and Minutes may be omitted

    2. timezone full name or abbreviation in English (see https://www.timeanddate.com/time/zones/)

    Example 16:

    2017-06-23 15:34:15 GMT+02:00
    2017-06-23 15:34:15 G+02:00
    2017-06-23 15:34:15 +02:00
    2017-06-23 15:34:15 +0200
    2017-06-23 15:34:15 EET
    2017-06-23 15:34:15 Eastern European Time
    
    1
    TIMESTAMP('yyyy-MM-dd HH:mm:ss Z'):ts
    

Predefined Format Timestamps

Matcher name SpectX data type Description
ISO8601 TIMESTAMP Matches timestamp in the form of yyyy-MM-ddTHH:mm:ssZ
HTTPDATE TIMESTAMP Matches timestamp in the form of dd/MMM/yyyy:HH:mm:ss Z
JSONTIMESTAMP TIMESTAMP Matches timestamp in the form of yyyy-MM-ddTHH:mm:ss.SSSZ

Numeric Data

Matcher name SpectX data type Description
BOOLEAN BOOLEAN
Matches case insensitive strings true and false
FLOAT FLOAT
Matches floating point numbers in form of
[+|-]?[0-9]+[.0-9]* or [+|-]?[0-9]+[E|e0-9]*

Accepts configuration parameter min=
Examples: 0.129 ; 12E23
CFLOAT FLOAT
Same as FLOAT, but with separator comma:
[+|-]?[0-9]+[,0-9]* or [+|-]?[0-9]+[E|e0-9]*

Accepts configuration parameter min=
Examples: 0,129 ; 12E23
DOUBLE DOUBLE
Matches floating point numbers in form of
[+|-]?[0-9]+[.0-9]* or [+|-]?[0-9]+[E|e0-9]*

Accepts configuration parameter min=
Examples: 0.129 12E23
CDOUBLE DOUBLE
Same as DOUBLE, but with separator comma:
[+|-]?[0-9]+[,0-9]* or [+|-]?[0-9]+[E|e0-9]*

Accepts configuration parameter min=
Examples: 0,129 12E23
INT, INTEGER INTEGER
Matches integral numbers in the range
-2147483648 to 2147483647

Examples: 1 ; 33 ; 12369275
HEXINT INTEGER
Matches integral numbers in the form of
[+|-]?0?x?[0-9a-fA-F]+ and in the range
-0xFFFF to 0xFFFE

Examples: 0xa01F ; xFE ; 10fE
LONG LONG
Matches integral numbers in the range
-18446744073709551615 to 18446744073709551614

Examples: 2000 18446744073709551613
HEXLONG LONG
Matches integral numbers in the form of
[+|-]?0?x?[0-9a-fA-F]+ and in the range
-0xFFFFFFFF to 0xFFFFFFFE

Examples: 0x11111111 ; xFEFEFE ; 10AABBFF

All numeric named matchers (except BOOLEAN) accept following configuration parameters:

  • min - matched numeric value has to be greater than or equal to,
  • max - matched numeric value has to be less than or equal to.

The specified value has to correspond respective matcher type - i.e you’ll get syntax error when trying to specify floating point configuration value to integral matcher. Configuration becomes handy when you want to match numeric values within a predefined range.

Network data

Matcher name SpectX data type Description
IPADDR IPADDR
Matches IPv4 and IPv6 addresses
IPV4, IPV4ADDR IPV4ADDR
Matches IPv4 address

Example: 192.168.33.1
IPV4SOCKET IPV4SOCKET
Matches IPv4 address and port separated by ‘:’ (colon) or ‘.’ (punct) symbols

Example: 192.168.33.1:22 192.168.33.1.443
IPV4NET  
Matches Ipv4 network in CIDR notation.

Example 192.168.3.40/24
IPV6, IPV6ADDR IPV6ADDR
Matches IPv6 address. All forms of implemented by
Java Inet6Address are supported.

Examples: 1080:0:0:0:8:800:200C:417A
::FFFF:129.144.52.38
MACADDR MACADDR
Matches MAC address in the form of 12 hexadecimal numbers
optionally separated by hyphen (-), colon (:) or dot (.)

Example: 00:00:00:00:00:01

Credit Card data

Credit card numbers come in variety of lengths (depending on the credit card scheme), formattings (continuous, space delimited) or encodings (when submitted in the url). Matcher for valid credit card numbers (i.e with valid Luhn checksum) in all its varieties is:

CREDITCARD

See example in Distributed Grep of Creditcard Numbers and SSN’s.

Sequence Group

Sequences are the most common way of data elements appearing in records. To create a sequence_group enclose matcher expressions (one or more) between parentheses:

'(' matcher, ... ')' [ ':'export_name ]

Sequence groups are useful in many ways. In the examples above we parsed CSV structured records and we used constant_string matchers to mark the field separators. While it served its purpose for explaining the principle it can be quite tedious to express the pattern for longer and more complex structures in this way. It would be much easier and also better to read when field separator could be specified only once. And this is what a sequence_group allows to do.

Sequence group can take optional configuration with parameter fs to specify value of field separator:

[ '(' 'fs='{ STRING | '{' MATCHER_EXPR '}' } ')' ]
where fs value can be either:
    STRING - is string value representing field separator enclosed in single or double quotes OR
    MATCHER_EXPR - is matcher expression evaluating to string enclosed in curly brackets

Example 17: we could rewrite the Example 1 as follows:

1,alice,192.168.1.1
2,bob,10.6.24.18
3,mallory,192.168.1.3
1
2
3
4
5
6
(                   //define sequence group consisting of:
 INT:seq            //integer matcher for sequence number, has export_name named 'seq'
 [a-zA-Z0-9]*:uname //character class matcher, quantifier * (zero or more chars), export_name 'uname'
 IPV4ADDR:user_ip   //ip v4 address matcher , has export_name 'user_ip'
)(fs=',')           //configuration specifies field separator comma
EOL                 //line feed terminates record

NB! Note that there is no comma after last field in the source data - since there are no following fields to it then no field separator is required either. And this is also how the sequence group treats fields when separator is specified via fs parameter. In another words - first field is never expected to be preceded by field separator and last field is never expected to be followed by field separator.

Should we encounter data, where separator is used interchangeably between many characters, then we can specify field separator also as pattern expression.

Example 18: comma and semicolon both used as separator:

1,alice;192.168.1.1
2,bob,10.6.24.18
3;mallory,192.168.1.3
1
2
3
4
5
6
(                   //define sequence group consisting of:
 INT:seq            //integer matcher for sequence number, has export_name named 'seq'
 [a-zA-Z0-9]*:uname //chargroup matcher, quantifier * (zero or more chars), export_name
 IPV4ADDR:user_ip   //ip v4 address matcher , has export_name 'user_ip'
)(fs={(',' | ';')}) //configuration specifies field separator comma OR semicolon
EOL                 //line feed terminates record

Both Example 17 and 18 evaluate to:

seq uname user_ip
1 alice 192.168.1.1
2 bob 10.6.24.18
3 mallory 192.168.1.3

Even more importantly than providing easily readable code, sequence group allows to manipulate with matchers as a unit. This becomes handy when you have variable number field records with record separator being sequence of certain type of elements.

Example 19. Consider data where record consists of a free text message containing any printable ASCII character, ip-address and integer. Fields are separated with single space character. However, the message field is also allowed to contain spaces. Therefore record is considered to end with the sequence of ip-address, integer and line feed character:

hello 1.2.3.4 250
hello I am Dolly 1.2.3.4 492
hello I am Dolly at 1.2.3.4! 1.2.3.4 31
1
2
LD:message (' ' IPV4:ipAddr ' ' INT:i EOL)  /* joining ip-address and integer matchers into a sequence group
                                               allows the message parsed fully for each line */

It may seem that message in all rows should be fully parsed also with pattern without enclosing ip-address and following integer in sequence group. Well, let’s walk through how it would work. The engine starts with trying to find match for first matcher in pattern, that is the LD:message. Recall how LD works: it matches any character except line feed on condition that its next matcher evaluates to false at the same time. Therefore LD:message matches until it encounters space character. At this point LD:message is considered complete and engine tries next matcher for space character: the ' ' matching space. Again, the match completes ' ', data pointer is advanced to point ip-address, next matcher IPV4:ipAddr is taken from pattern. And on it goes until the first line gets fully matched.

The beginning of second line looks a bit different. The engine matches LD:message successfully until it matches first space character. Now the data pointer is being pointed at ‘I’. Time to try IPV4:ipAddr matcher - but alas there’s no ip-address there therefore the engine reverts back to first matcher of pattern (LD:message) again. Same happens at second and third space. It is not until fourth space that the whole pattern matches, with LD:message containing last matched characters: ‘Dolly’. As a result we get in message field only the last word:

message ipAddr i
hello 1.2.3.4 250
Dolly 1.2.3.4 492
1.2.3.4! 1.2.3.4 31

Now when we enclose IPV4:ipAddr and INT:i into a sequence group, then LD would look for match for the entire sequence (as opposed just to single constant string matcher) when deciding if to continue or complete. Effectively it won’t stop at space characters until the moment when the sequence group is the only remaining part of the line. Which is what we wanted:

message ipAddr i
hello 1.2.3.4 250
hello I am Dolly 1.2.3.4 492
hello I am Dolly at 1.2.3.4! 1.2.3.4 31

Sequence groups are needed also at handling series of missing fields: just make the sequence group optional by applying Optional Modifier.

Example 20: Suppose we have following TSV structured log records: timestamp, severity, username, ip-address, post parameters (which in turn contains fields separated by semicolon) and response code. We can see that in case of ERROR the post parameter field is missing completely.

14/Mar/2016:23:37:06 +0200	INFO	mary01	127.0.0.1	52d29ae72bb69;POST;100	200
14/Mar/2016:23:37:07 +0200	INFO	john	127.0.0.1	470f97a9bc34d;POST;86	404
14/Mar/2016:23:37:13 +0200	ERROR	albert	127.0.0.1	500
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
(
HTTPDATE:timestamp
UPPER:severity
[a-zA-Z0-9]*:username
IPV4:ipAddr
  (                     //sequence_group for parsing post parameter field subfields:
    [a-fA-F0-9]*:reqId
    UPPER:httpMethod
    INT:amount
  )(fs=';')            // semicolon separates subfields of post parameter
  ?                    // modifier '?' makes sequence group optional to handle missing field
INT:httpStatus
)(fs='\t')              //fs specifies TAB as field separator in record
EOL                     //record is terminated by LF

As a result the subfields of post parameter field are successfully parsed and assigned NULL values when missing:

timestamp severity username ipAddr reqId httpMethod amount httpStatus
2016-03-14 23:37:06 +0200 INFO mary01 127.0.0.1 52d29ae72bb69 POST 100 200
2016-03-14 23:37:07 +0200 INFO john 127.0.0.1 470f97a9bc34d POST 86 404
2016-03-14 23:37:13 +0200 ERROR albert 127.0.0.1 NULL NULL NULL 500

Alternatives Group

Alternatives group allows to match data element to several different matchers. This comes handy when data at the same position may be of different type. A good example is Apache access log where the first field in record may be an ipv4, ipv6 address or hostname.

To create an alternatives group, matchers separated by pipe character must be enclosed by parenthesis:

'(' matcher '|' matcher '|' ... ')' [ ':'export_name ]

Alternatives group tries the matchers in the order they are defined (from left to right). It stops right away when the match is found - i.e it uses so called lazy match strategy. Therefore only one matcher of alternatives will be outputted at each record, the values of others will be assigned to NULL.

Example 21. Suppose we have a CSV structured record with timestamp field, username field and field which can contain either ipv4 address or hostname:

14/Mar/2016:23:37:06 +0200;mary01;201.44.32.5
14/Mar/2016:23:37:07 +0200;johnny;example.com
14/Mar/2016:23:37:13 +0200;albert;121.224.67.140
1
2
3
4
5
6
(
 HTTPDATE:timestamp
 [a-zA-Z0-9]*:username
   (IPV4:ipAddr | [a-zA-Z0-9\.]+:host)	//alternative group of ip-address OR hostname
)(fs=';')
EOL
timestamp username ipAddr host
2016-03-14 23:37:06 +0200 mary01 201.44.32.5 NULL
2016-03-14 23:37:07 +0200 johnny NULL example.com
2016-03-14 23:37:13 +0200 albert 121.224.67.140 NULL

Alternative group can be applied with Quantifier or Optional Modifier. Alternative group does not take any configuration parameters or quantifier.

Array

Arrays allow parsing repeated sequences of variable number data elements:

ARRAY'{' matcher [ ... ] '}' quantifier [ ':'export_name ]

Array captures exported data elements in array data type. You must assign export name to ARRAY in order to make exported members visible for the query layer.

Array requires Quantifier (i.e you must set limits to repeated data elements).

Example 22. Parsing array of numbers. Data has two fields, ip-address and array of numbers (separated by forward slash). Fields are separated by space:

1.2.3.4 101/102/103/104/105
2.2.2.2 201//203///207
3.3.3.3 //303/304//305/306/307//
1
2
3
4
5
6
7
IPV4:ip
' '
ARRAY{
 '/'?               //optional since separator of numbers may be missing when first position int value is present
 INT*:i             //int value may be missing too
}{3,10}:nodeId      //array is expected to contain at least 3 and max 10 integers
EOL;                //record is terminated by line feed

Result:

ip nodeId
1.2.3.4 [101, 102, 103, 104, 105]
2.2.2.2 [201, null ,203, null, null, 207]
3.3.3.3 [null, 303, 304, null, 305, 306, 307, null, null]

Example 2 demonstrates using array in parsing variable number of key value pairs.

Array does not have any configuration parameters.

Enum

Enum constructor allows parsing predefined strings and convert them into integer values. The strings and respective integer values are defined as series of key-value pairs, separated by commas and enclosed in curly brackets:

ENUM '{' '<string>' '=' integer ... '}' [ configuration ] [ ':'export_name ]

Enum takes optional configuration parameter allowing to match strings case insensitively (default is case sensitive):

'(' 'cis=' {true | false} ')'

Enum does not take quentifier.

Example 23. Suppose we have data with username, login result and comment fields, and we want to map login result to integer:

Alice;success;all good
Bob;Wrong password;attempts left 2
Oscar;tech error;
Mallory;;doodaloo
1
2
3
LD:username ';'
ENUM{'' = -3, 'success' = 0, 'Wrong password' = 1, 'tech error' = 2}(cis=true):result ';'
LD*:comment EOL;

Result:

username result comment
Alice 0 all good
Bob 1 attempts left 2
Oscar 2  
Mallory -3 doodaloo

Structure

Structure allows to capture any sequence of matchers in tuple data type:

STRUCTURE '{' matcher ... '}' [ ':'export_name ]

You must assign export name to STRUCTURE in order to make exported members visible for the query layer.

Structure does not take quantifier or configuration.

Example 24. Capture exported matchers data in tuple:

14/Mar/2016:23:37:06 +0200	mary01	201.44.32.5
14/Mar/2016:23:37:07 +0200	johnny	example.com
14/Mar/2016:23:37:13 +0200	albert	121.224.67.140
1
2
3
4
5
6
7
8
STRUCTURE {
 (
  HTTPDATE:timestamp
  [a-zA-Z0-9]*:username
    (IPV4:ipAddr | [a-zA-Z0-9\.]+:host)	//alternative group of ip-address OR hostname
 )(fs='\t')
 EOL
}:dataStruct;

Result:

dataStruct
{timestamp=2016-03-14 23:37:06.000 +0200 username=”mary01” ipAddr=201.44.32.5 host=NULL}
{timestamp=2016-03-14 23:37:07.000 +0200 username=”johnny” ipAddr=NULL host=”example.com”}
{timestamp=2016-03-14 23:37:13.000 +0200 username=”albert” ipAddr=121.224.67.140 host=NULL}

Json Objects

JSON_OBJECT (or its alias JSON) parses Json objects according to RFC 8259. A Json object is a structure of name-value pairs enclosed in curly brackets. For parsing Json values not enclosed in an object see Json Arrays and Json Values.

There are several ways how to control parsing elements from a Json object. The easiest is to use JSON matcher without any parameters. It will enumerate all elements, transform them to SpectX primitive type according their defined type in Json and returns a VARIANT_OBJECT with parsed elements.

Example 25: A Json object:

{
	"name":"John",
	"age":33,
	"weight":72.3,
	"isMarried":true,
	"children":["Mallory", "Mary"],
	"address":{"city":"New York", "street":"1'st str 3", "zip":23456}
}

Can be automatically parsed as follows:

1
JSON:person

The returned VARIANT_OBJECT contains parsed members with VARIANT type assigned according to their Json type (double-click on the resultset row to see the details):

person[name] = "John" (VARIANT:STRING)
person[age] = 33 (VARIANT:LONG)
person[weight] = 72.3 (VARIANT:DOUBLE)
person[isMarried] = true (VARIANT:BOOLEAN)
person[children] = ["Mallory","Mary"] (VARIANT:VARIANT_ARRAY)
person[address][city] = "New York" (VARIANT:STRING)
person[address][street] = "1'st str 3" (VARIANT:STRING)
person[address][zip] = 23456 (VARIANT:LONG)

For more precise control over parsing object members you can:

  • specify the members and desired data type you want to extract. All SpectX Data Types are supported.
  • specify a member which must be present in the object (i.e set mandatory). Can be used to validate semantic rules on a Json object. By default when a member is missing then it’s respective value in output TUPLE structure is set to NULL. With mandatory member missing the whole JSON matcher fails and returns NULL.
  • allow parsing objects not following Json specification. Unquoted json names and string values consisting of one word can be parsed.
  • parsing large Json objects. By default the size of object are limited to 32 kB (32768 bytes). Larger objects can be parsed by explicitly specifying max size.

Json object members are specified as one or more matcher expressions, separated by commas and enclosed in curly braces:

{JSON | JSON_OBJECT} '{' [matcher][ + ][':' "member_name"][':'export_name] , ... '}'
    ['(' configuration ')' ]
    [ ':'obj_export_name ]

where:
    matcher: a matcher specifying type conversion.

    + : quantifier 'one or more'. Makes matcher_expr mandatory.

    member_name: string enclosed in double or single quotes, specifying the name of Json member.
                 Used when member name is not a single word.

    export_name: word specifying the name of field in resulting TUPLE structure. When member_name is not specified
                 then it must match the name of member name.

    configuration: is key-value pair of following configuration options:
        greedy="greedy_name" :  optional string parameter allowing to capture those members not explicitly
                                defined. If not specified then undefined members are not exported. greedy_name is a
                                string, enclosed in double or single quotes, specifying name of the TUPLE structure
                                which captures all members not defined explicitly by matcher_expr.
        strict={true | false} : optional Boolean parameter allowing to control validating Json syntax.
                                Default is true. When set to false, then Unquoted json names and string values consisting
                                of one word are allowed.
        maxlen=bytes          : optional integer parameter specifying maximum allowed size of a Json object in bytes.

    obj_export_name: word specifying the export name of the entire JSON object

Parsing selected members with specified type

Example 26. Suppose we’re primarily interested in ip-address, port and client timestamp in the following object:

{
 "ip":"192.168.3.20",
 "port":443,
 "client time":"2018-07-25 13:18:57 -0600",
 "ready":true,
 "descr":"2'nd floor printer"
}

We can extract these members explicitly. Note handling multiple word Json member name. All other members could be captured as well should we need them:

1
2
3
4
5
6
JSON{
 IPADDR:"ip",                       //json string is transformed to IPADDR type
 INT:"port",                        //json numeric is transformed to INTEGER type
 TIMESTAMP('yyyy-MM-dd HH:mm:ss Z'):"client time":client_time   //handle multiple word member name
}(greedy='other members'):host      //members not explicitly extracted are captured in TUPLE named 'other members'
                                    //and resulting TUPLE structure is named as 'host'

Resulting TUPLE contains explicitly parsed members with specified primitive types and the rest with VARIANT types (double-click on the resultset row to see the details):

host[ip] = 192.168.3.20 // --
host[port] = 443
host[client_time] = T('2018-07-25 22:18:57.000 +0300')
host[other members][ready] = true (VARIANT:BOOLEAN)
host[other members][descr] = "2'nd floor printer" (VARIANT:STRING)

Semantic validation of a Json object

Example 27. Suppose our application requires ‘name’ and ‘age’ members to be mandatory. The ‘biography’ member is optional and can be quite large, up to 50 Kb in size:

{
    "name":"John",
    "age":55,
    "biography":"Once upon a time there lived a man called John ..."
}
{"name":"Eve", "sex":"female"}

We can enforce these semantic rules by using + quantifier and maxlen configuration parameter:

1
2
3
4
5
JSON {
    STRING+:name,       //name is mandatory member
    INT+:age,           //age is mandatory member
    STRING:biography    //biography is optional member. Size may be be up to 50Kb.
}(maxlen=51000):data    //allow 51000 bytes size in total

Results in:

{name="john" age=55 biography="Once upon a time there lived a man called John ..."}  //first Json object is successfully parsed
NULL                               //second Json object parsing fails since semantically mandatory member 'age' is missing

Parsing non-standard Json objects

By default JSON treats Json object members strictly according to RFC 8259 syntax. However under certain circumstances relaxing this validation may be a good idea. For instance investigating a malicious behaviour or when an application is producing an invalid Json objects.

Example 28. Suppose an incoming HTTP request contains following Json object containing hostname, department name and ip-address. Note unquoted name and string values, and ip-address presented as UNICODE character string:

{
host:"example.com"
    ,
"department" : HR    ,
    ip : "\u0031\u0039\u0032.\u0031\u0036\u0038.\u0034\u0037.\u0030"
}

Parsing untrusted input can always include surprises, so we relax Json syntax validation using strict configuration parameter:

1
2
3
4
5
JSON {
 STRING:host,
 STRING:department,
 IPADDR:ip
}(strict=false):data

The result is (double-click on the resultset row to see the details):

data[host] = 'example.com'
data[department] = 'HR'
data[ip] = 192.168.47.0 // --

Parsing simple arrays with specified type

Simple Json arrays consisting of one Json type and are part of a Json object, can be parsed using following syntax:

matcher '[]' [ ... ]

where:
    matcher : a matcher specifying transformation to primitive type

    '[]' : one or more pair of square brackets represent the number of array dimensions

The + quantifier, member_name and export_name operators behave also the same way as described above.

Example 30. We have two json objects with numeric array members:

{
 "items":[1,3,6,9]
}
{
 "items":[22,566,79]
}

We can extract array members and transform them to INTEGER type:

1
2
3
4
JSON{
 INT[]:items	//integer array 'items'
}:data
EOL             //json objects are separated by newline

Result:

"{items=[1, 3, 6, 9]}"
"{items=[22, 566, 79]}"

Note that you can parse also multidimensional arrays. Just add as many square bracket pairs your array is composed of.

Example 31. A two dimensional json array:

{
 "items":[[1,3,6,9],[4,4,59,100],[32,17,999,8]]
}
1
2
3
4
JSON {
 INT[][]:items
}:data
EOL

Result (double-click on the resultset row to see the details):

data[items][0][0] = 1
data[items][0][1] = 3
data[items][0][2] = 6
data[items][0][3] = 9
data[items][1][0] = 4
data[items][1][1] = 4
data[items][1][2] = 59
data[items][1][3] = 100
data[items][2][0] = 32
data[items][2][1] = 17
data[items][2][2] = 999
data[items][2][3] = 8

Parsing Nested Objects

For parsing Json objects inside an object just use the JSON (JSON_OBJECT) matcher as defined above. Note that for nested objects matcher name (JSON or JSON_OBJECT) can be omitted for simplicity

Example 32.

{
  "name":"Evelyn",
  "age":32,
  "pId":1,
  "callerId":{
     "nickname":"eve",
     "ip":"192.168.1.0"
   }
}
{
  "name":"Mallory",
  "age":22,
  "pId":5,
  "callerId": {
     "nickname":"mel",
     "ip":"192.168.10.32"
  }
}
1
2
3
4
5
6
7
8
9
JSON {
 STRING:name,
 INT:age,
 JSON {
    STRING:nickname,
    IPV4:ip
 }:callerId
}:data
EOL

Result:

data
{name=’‘Evelyn’’ age=32 callerId={nickname=’‘eve’’ ip=192.168.1.0}}
{name=’‘Mallory’’ age=22 callerId={nickname=’‘mel’’ ip=192.168.10.32}}

Json Arrays

Json syntax allows constructing arrays in variety of ways. They can range from very simple primitive type members to very complex nested object members. A Json array can contain members of different type. And it is perfectly valid that they can appear also outside of a Json object. JSON_ARRAY is meant to handle all of them.

The easiest is to use JSON_ARRAY without any parameters. It will enumerate all array elements, transform them to SpectX data types according their defined type in Json and returns parsed elements as VARIANT_ARRAY object.

Example 29.

[5,null,2.2,"17.9","5\u0037\u0037\u0037\u0037\u00372"]
1
JSON_ARRAY{}:arr

Resulting VARIANT_ARRAY holds all parsed array elements (double-click on the resultset row to see the details). Note that every parsed element is transformed to VARIANT type according to their Json type:

arr = [5,null,2.2,"17.9","5777772"] (VARIANT_ARRAY)

JSON_ARRAY provides also more precise control over transformation of array members. With this you can:

  • return parsed elements as regular ARRAY type <dt_array> (simpler to access in queries).
  • specify desired primitive data type <primitive_data_types> for conversion.
  • extract multidimension arrays
  • extract Json objects or structures from string members

The conversion type of array members can be specified as single matcher or a matcher expression enclosed in curly braces, followed by optional configuration and export name:

JSON_ARRAY '{' [(matcher | matcher_expr)] '}' ['(' configuration ')' ] [ ':'array_export_name ]

where:
    matcher : matcher specifying conversion to primitive type.

    matcher_expr : matcher':'export_name ...
        A matcher expression which outputs two or more different primitive types. Note that each matcher must be
        specified with export name.

    configuration : key-value pair of following configuration options:
        typed={true | false} :  optional Boolean parameter allowing to control JSON_ARRAY output type: true sets the
                                output type to ARRAY, false sets output type to VARIANT_ARRAY.

    array_export_name : name of output

Let’s have a look at examples.

Example 33: Setting the JSON_ARRAY output to ARRAY. When using JSON_ARRAY without parameters, then the output type is set to VARIANT_ARRAY. To change it to ARRAY we can set typed configuration to true. So using the same data from Example 29 above:

[5,null,2.2,"17.9","5\u0037\u0037\u0037\u0037\u00372"]

and adding configuration to pattern:

1
JSON_ARRAY{}(typed=true):arr

we now have result as ARRAY (as opposed to VARIANT_ARRAY above):

arr[0] = 5L // (VARIANT:LONG)
arr[1] = null
arr[2] = 2.2D // (VARIANT:DOUBLE)
arr[3] = '17.9' // (VARIANT:STRING)
arr[4] = '5777772' // (VARIANT:STRING)

Example 34. Converting entire array to one primitive type.

[223423,-343.8e7,null,"3.14"]
1
2
3
JSON_ARRAY{
 DOUBLE       //we convert entire array to DOUBLE
}:arr         //output is named 'arr'

The result of applying this pattern to the data above (double-click on the resultset row to see the details). Note when conversion is specified then JSON_ARRAY output type is automatically set to ARRAY.

arr[0] = 223423.0D
arr[1] = -3.438E9D
arr[2] = null
arr[3] = 3.14D

Example 35. Handling complex multi-dimensional arrays. Suppose we have a two-dimensional array where elements of inner array are strings, which in turn contain an ip-address and count. The latter is expressed either as decimal or hex:

[
  ["10.0.0.1 512","10.0.0.2 FFFFF"],
  ["10.0.10.243 512","10.0.10.104 3", "10.0.10.186 94"]
]

With the following pattern we can extract ip-address and count structures:

1
2
3
JSON_ARRAY{
 JSON_ARRAY{(IPADDR:ip ' ' (INT:int | HEXLONG:hex))}
}:a

Here’s the result (double-click on the resultset row to see the details):

a[0][0][ip] = 10.0.0.1 // --
a[0][0][int] = 512
a[0][0][hexint] = null
a[0][1][ip] = 10.0.0.2 // --
a[0][1][int] = null
a[0][1][hexint] = 1048575
a[1][0][ip] = 10.0.10.243 // --
a[1][0][int] = 512
a[1][0][hexint] = null
a[1][1][ip] = 10.0.10.104 // --
a[1][1][int] = 3
a[1][1][hexint] = null
a[1][2][ip] = 10.0.10.186 // --
a[1][2][int] = 94
a[1][2][hexint] = null

Example 36. Extracting Json object from array. Suppose we have a Json array of Simpson’s cartoon characters:

[
  {
    "name":"Homer Simpson",
    "age":40,
    "cars":["pink sedan"],
    "married":true,
    "family":{"wife":"Marge Simpson", "children":["Bart","Lisa","Maggie"], "pets":{"dog":"Santa's Little Helper", "cat":"Snowball"}}
  },
  {
    "name":"Charles Montgomery Burns",
    "age":104,
    "cars":["936 Stutz Bearcat", "Bugatti Royale", "Ford Quadricycle"],
    "married":false,
    "family":{"assistant":"Waylon Smithers", "children":[], "pets":{"dog":"vicious guard dog 1", "dog":"vicious guard dog 2", "dog":"vicious guard dog 3"}}
  }
]

With the following pattern we can extract all attributes of each character:

1
2
3
4
5
6
7
8
9
JSON_ARRAY{
 JSON{
   STRING:name,
   INT:age,
   STRING[]:cars,
   BOOLEAN:married,
   JSON:family
 }
}:simpsons

Here’s the result (double-click on the resultset row to see the details):

_unmatched = null
simpsons[0][name] = 'Homer Simpson'
simpsons[0][age] = 40
simpsons[0][cars][0] = 'pink sedan'
simpsons[0][married] = true
simpsons[0][family][wife] = 'Marge Simpson' // (VARIANT:STRING)
simpsons[0][family][children] = [Bart,Lisa,Maggie] (VARIANT_ARRAY) // (VARIANT:VARIANT_ARRAY)
simpsons[0][family][pets][dog] = 'Santa\'s Little Helper' // (VARIANT:STRING)
simpsons[0][family][pets][cat] = 'Snowball' // (VARIANT:STRING)
simpsons[1][name] = 'Charles Montgomery Burns'
simpsons[1][age] = 104
simpsons[1][cars][0] = '936 Stutz Bearcat'
simpsons[1][cars][1] = 'Bugatti Royale'
simpsons[1][cars][2] = 'Ford Quadricycle'
simpsons[1][married] = false
simpsons[1][family][assistant] = 'Waylon Smithers' // (VARIANT:STRING)
simpsons[1][family][children] = [] (VARIANT_ARRAY) // (VARIANT:VARIANT_ARRAY)
simpsons[1][family][pets][dog] = 'vicious guard dog 1' // (VARIANT:STRING)
simpsons[1][family][pets][dog] = 'vicious guard dog 2' // (VARIANT:STRING)
simpsons[1][family][pets][dog] = 'vicious guard dog 3' // (VARIANT:STRING)

Json Values

Json array, string, number, boolean, null are perfectly valid and allowed to appear without being enclosed in a Json object (see JSON Grammar). JSON_VALUE is intended for parsing single Json values:

JSON_VALUE '{' [(matcher | matcher_expr)] '}' ['(' configuration ')' ] [ ':'val_export_name ]

where:
    matcher : matcher specifying conversion to primitive type.

    matcher_expr : matcher':'export_name ...
        A matcher expression which outputs two or more different primitive types. Note that each matcher must be
        specified with export name.

    configuration : key-value pair of following configuration options:
        typed={true | false} :  optional Boolean parameter allowing to control JSON_VALUE output type: true sets the
                                output to converted SpectX native type (default), false sets output type to VARIANT.
                                Effective only when used with explicit conversion, ignored when used without params.

    val_export_name : name of the output

Example 37. Here we have three Json values: number, boolean and string:

33true"::1"
1
2
3
4
JSON_VALUE{}:jv1                        //automatic type discovery and conversion
JSON_VALUE{BOOLEAN}(typed=false):jv2    //explicit type conversion. However the output is forced to VARIANT
JSON_VALUE{IPADDR}:jv3                  //explicit type conversion
EOL

Here’s the result (double-click on the resultset row to see the details):

jv1 = 33L // (VARIANT:LONG)
jv2 = true // (VARIANT:BOOLEAN)
jv3 = IPADDR('::1') // --

Key-Value Pairs

Parsing lists of unordered key-value pairs is not easy, especially when the key names are varying. KVP provides simple yet powerful features to easily parse most complex key-value pair lists. Returns VARIANT_OBJECT.

The formal syntax of using KVP is:

KVP '{' key_matcher_expr value_matcher_expr '}' [ '{'quantifier'}' ] [ ':'export_name ]

where:
    key_matcher_expr : is a matcher expression returning STRING value with name 'key'

    value_matcher_expr : is a matcher expression returning one or more types with names with prefix 'value'

    quantifier : specifies min and max number of matched key-value pairs. Default min=1, max=128. Optional.

    export_name : name of output

KVP will apply specified pattern repeadetly until the maximum number of key-value pairs has reached or an unmatch occurs.

Example 38: Parsing Log Event Extended Format (LEEF) of IBM QRadar. Note unordered key-value pairs in different records:

Jan 18 11:07:53 192.168.1.1 LEEF:1.0|QRadar|QRM|1.0|NEW_PORT_DISCOVERD|src=172.5.6.67	dst=172.50.123.1	sev=5	cat=anomaly msg=there are spaces in this message
Jan 18 11:07:53 192.168.1.1 LEEF:1.0|QRadar|QRM|1.0|NEW_PORT_DISCOVERD|long_key=multi word value	dst=172.50.123.1	sev=5	cat=anomaly	msg=there are spaces in this message
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
/*https://www.ibm.com/support/knowledgecenter/en/SSMPHH_9.5.0/com.ibm.guardium95.doc/appendices/topics/leef_mapping.html*/

//record header:
TIMESTAMP('MMM dd HH:mm:ss', tz='GMT'):time
' ' IPADDR:server_host
' LEEF:' FLOAT:leefVer
'|' LD:vendor
'|' LD:product
'|' LD:version
'|' LD:eventId '|'

//key-value pairs as per IBM specification:
KVP{
  [! =]+:key     // Keys must not contain spaces or equal signs
  '='
  [!\t\n]*:value // Values must not contain tabs and new line characters
  '\t'?          // key-value pairs are separated by tab
}:attributes
EOL

Here’s how the one of the result looks like (double-click on the resultset row to see the details):

time = T('2019-01-18 13:07:53.000 +0200')
server_host = 192.168.1.1 // --
vendor = 'QRadar'
product = 'QRM'
version = '1.0'
eventId = 'NEW_PORT_DISCOVERD'
attributes[src] = '172.5.6.67' // (VARIANT:STRING)
attributes[dst] = '172.50.123.1' // (VARIANT:STRING)
attributes[sev] = '5' // (VARIANT:STRING)
attributes[cat] = 'anomaly msg=there are spaces in this message' // (VARIANT:STRING)

Example 39. We could modify the pattern slightly if we wanted to convert ip-addresses and severity levels to IPADDR and INTEGER:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/*https://www.ibm.com/support/knowledgecenter/en/SSMPHH_9.5.0/com.ibm.guardium95.doc/appendices/topics/leef_mapping.html*/

//record header:
TIMESTAMP('MMM dd HH:mm:ss', tz='GMT'):time
' ' IPADDR:server_host
' LEEF:' FLOAT:leefVer
'|' LD:vendor
'|' LD:product
'|' LD:version
'|' LD:eventId '|'

// key-value pairs:
KVP{
  [! =]+:key               // Keys must not contain spaces or equal signs
  '='
  (                        // use alternative group to match different types of value:
	IPADDR:valueIp |       // ip-address or
	INT:valueInt   |       // integer or
	[!\t\n]*:valueString   // string not containing tab or line feed (if any of before didn't match)
  )
  '\t'?                    // key-value pairs are separated by tab
}:attributes
EOL

Now the severity, src and dst fields have converted appropriately (double-click on the resultset row to see the details):

time = T('2019-01-18 13:07:53.000 +0200')
server_host = 192.168.1.1 // --
vendor = 'QRadar'
product = 'QRM'
version = '1.0'
eventId = 'NEW_PORT_DISCOVERD'
attributes[src] = 172.5.6.67 // US // (VARIANT:IPADDR)
attributes[dst] = 172.50.123.1 // US // (VARIANT:IPADDR)
attributes[sev] = 5 // (VARIANT:INTEGER)
attributes[cat] = 'anomaly msg=there are spaces in this message' // (VARIANT:STRING)

Binary Data Formats

PCAP

PCAP extracts selected elements from network packet capture data (<https://en.wikipedia.org/wiki/Pcap): capture time, the number of bytes observed and captured, source and destination MAC and IP addresses, PDU type of link layer, network layer protocol id, tcp flags, TTL.

Example 45:

@list    = LIST('file:///sample.pcap');
@stream  = PARSE(pattern:'PCAP', src:@list);

@stream
 .limit(1000)
;

Returns:

captureTime captureLen wireLen src dst type ipSrc ipDst ipProto tcpFlag ttl
2017-01-18 09:15:55 78 78 6c-40-08-98-54-7e 90-72-40-03-3a-73 8 192.168.1.2:59873 217.146.76.69:8389 6 0x40 64
2017-01-18 09:15:55 74 74 90-72-40-03-3a-73 6c-40-08-98-54-7e 8 217.146.76.69:8389 192.168.1.2:59873 6 0x0 60
2017-01-18 09:15:55 66 66 6c-40-08-98-54-7e 90-72-40-03-3a-73 8 192.168.1.2:59873 217.146.76.69:8389 6 0x40 64
2017-01-18 09:15:55 338 338 6c-40-08-98-54-7e 90-72-40-03-3a-73 8 192.168.1.2:59873 217.146.76.69:8389 6 0x40 64
2017-01-18 09:15:55 66 66 90-72-40-03-3a-73 6c-40-08-98-54-7e 8 217.146.76.69:8389 192.168.1.2:59873 6 0x0 60
2017-01-18 09:15:55 216 216 90-72-40-03-3a-73 6c-40-08-98-54-7e 8 217.146.76.69:8389 192.168.1.2:59873 6 0x0 60
2017-01-18 09:15:55 66 66 6c-40-08-98-54-7e 90-72-40-03-3a-73 8 192.168.1.2:59873 217.146.76.69:8389 6 0x40 64
2017-01-18 09:15:55 364 364 6c-40-08-98-54-7e 90-72-40-03-3a-73 8 192.168.1.2:59873 217.146.76.69:8389 6 0x40 64
2017-01-18 09:15:55 721 721 90-72-40-03-3a-73 6c-40-08-98-54-7e 8 217.146.76.69:8389 192.168.1.2:59873 6 0x0 60
2017-01-18 09:15:55 66 66 6c-40-08-98-54-7e 90-72-40-03-3a-73 8 192.168.1.2:59873 217.146.76.69:8389 6 0x40 64

Referencing Pattern Expressions

An expression can be assigned to a variable and referenced from other expressions. This allows to build complex patterns which are still easily readable.

The syntax of declaring a pattern is similar to declarations of query statements:

'$'variable_name '=' matcher [ ... ] ';'

Note that declared expressions can be referenced only from current pattern and are not visible outside of it.

Example 40:

14/Mar/2016:23:37:06 +0200,INFO,mary01,127.0.0.1,reqid=52d29ae72bb69;method=POST;amt=100,200
14/Mar/2016:23:37:07 +0200,INFO,john,127.0.0.1,reqid=470f97a9bc34d;method=POST;amt=86,404
14/Mar/2016:23:37:13 +0200,ERROR,albert,127.0.0.1,500
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
//declare a pattern to extract values from key-value pair structure separated by semicolon
$postParams =
(
  'reqid'   '=' [a-fA-F0-9]*:reqId
  ';method' '=' UPPER:method
  ';amt'    '=' INT:amount
);


//pattern to parse record
(
  HTTPDATE:timestamp
  UPPER:severity
  ALNUM:username
  IPV4:ipAddr
  ($postParams)?         //refer to declared pattern for parsing kvp structured field
  INT:httpStatus
)(fs=',')
EOL

Result:

timestamp severity username ipAddr reqId method amount httpStatus
2016-03-14 23:37:06 +0200 INFO mary01 127.0.0.1 52d29ae72bb69 POST 100 200
2016-03-14 23:37:07 +0200 INFO john 127.0.0.1 470f97a9bc34d POST 86 404
2016-03-14 23:37:13 +0200 ERROR albert 127.0.0.1 NULL NULL NULL 500

Exporting Rules

As already mentioned in Export Name, a pattern reference can be also assigned an export name. What would happen to exported members? The rules are simple:

  • if pattern_reference is not exported, then any of it’s member exports are visible with their own export_names. Example 41:

    2016-01-03 11:30:24;300;foo
    2016-01-03 11:30:25;400;boo
    
    1
    2
    3
    4
    5
    6
    7
    8
    9
    $hdr =                                      // declare a header
     TIMESTAMP('yyyy-MM-dd hh:MM:ss'):dateTime  // exporting dateTime
     ';'
     INT:response                               // and response
     ';'
    ;
    
    $hdr LD EOL;                                // refer header in our pattern.
                                                // Note that reference has no export_name
    

    Result:

    dateTime response
    2018-06-03 14:00:24.000 +0300 300
    2018-06-03 14:00:25.000 +0300 400
  • if pattern reference is exported (i.e has assigned an export_name) then all its exported members are visible in structure named by pattern_reference export_name. Example 42:

    2016-01-03 11:30:24;300;foo
    2016-01-03 11:30:25;400;boo
    
    1
    2
    3
    4
    5
    6
    7
    8
    9
    $hdr =                                      // declare a header
     TIMESTAMP('yyyy-MM-dd hh:MM:ss'):dateTime  // export dateTime
     ';'
     INT:response                               // export response
     ';'
    ;
    
    $hdr:header LD EOL;                         // and refer header in our pattern.
                                                // Note that reference is exported as 'header'
    

    Result:

    header
    {dateTime=2018-06-03 14:00:24.000 +0300 response=300}
    {dateTime=2018-06-03 14:00:25.000 +0300 response=400}
  • If there was no member exported then all matched data is exported as string. Example 37:

    2016-01-03 11:30:24;300;foo
    2016-01-03 11:30:25;400;boo
    
    1
    2
    3
    4
    5
    6
    7
    8
    9
    $hdr =                                      // declare a header with no exported member
     TIMESTAMP('yyyy-MM-dd hh:MM:ss')
     ';'
     INT
     ';'
    ;
    
    $hdr:header LD EOL;                         // and refer header in our pattern
                                                // Note that reference is exported as 'header'
    

    Result:

    header
    2016-01-03 11:30:24;300;
    2016-01-03 11:30:25;400;

Adding Metadata

Every now and then you may find useful to add metadata to parsed records or data elements. For example if the source data contains several different record types you may want to include an explicit record type field or the position of a data element in input stream. To achieve this the pattern language allows to define constant values or matched token positions in patterns.

To define a metadata element place a keyword or string or integer value between angle brackets:

'<' { 'pos' | STRING | INTEGER } '>' [':'export_name]
where:
  • pos - outputs the starting position of matching token in input stream
  • STRING and INTEGER values are emitted to resultset as is.

Similarly to matchers, the metadata elements won’t appear in resultset unless assigned export_name. Example 44:

14/Mar/2016:23:37:06 INFO [1372845]; tcp connection established from 192.168.75.20
14/Mar/2016:23:37:07 INFO [1372845]; reqType=GET uri="/?a=0&b=3&c=apply"
14/Mar/2016:23:37:07 INFO [1372845]; authenticating username: smartdude
14/Mar/2016:23:37:08 INFO [1372845]; authentication result: OK
14/Mar/2016:23:37:09 INFO [1372845]; posting response: uid=279956825 resp_code=200
14/Mar/2016:23:37:09 INFO [1372845]; closing connection
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/* assign distinct record type for records to enable explicit selecting in queries */

// each record seem to start with timestamp, loglevel and connection id, declare it as a header
$hdr =
TIMESTAMP('dd/MMM/yyyy:hh:mm:ss'):timestamp ' '
UPPER:logLevel
' [' INT:connId ']'
'; '
;

// declare connect record
$connect =
<'connect'>:recType     // metadata element defining a constant value 'connect', exported as recType.
$hdr                    // header
LD 'connection established from '
IPV4:ip_addr
EOL;

// pattern to match other records
$other =
<'other'>:recType       // metadata element defining constant value 'other', exported also as recType
$hdr                    // header
LD:line
EOL;

($connect | $other)     // our final pattern consist of connect or other records

Result:

recType timestamp logLevel connId ip_addr line
connect 2016-03-15 13:37:06.000 +0200 INFO 1372845 192.168.75.20 NULL
other 2016-03-15 13:37:06.000 +0200 INFO 1372845 NULL reqType=GET uri=”/?a=0&b=3&c=apply”
other 2016-03-15 13:37:06.000 +0200 INFO 1372845 NULL authenticating username: smartdude result=OK
other 2016-03-15 13:37:06.000 +0200 INFO 1372845 NULL posting response: uid=279956825 resp_code=200
other 2016-03-15 13:37:06.000 +0200 INFO 1372845 NULL closing connection

Another (perhaps more realistic) example of determining matched token positions is given in Distributed Grep of Creditcard Numbers and SSN’s.