Input Data Parser

If you are new to SpectX pattern matching language then the best way to get started is with Pattern Development Guide. This introduces you to Pattern Developer, the general workflow of preparing a pattern and gives many practical examples.

The focus of this manual is to give detailed information on pattern matching language elements.

A pattern consists of one or more matcher expressions. They can be separated by a whitespace or commas or newlines.

A matcher expression can be any of the following:
Matcher expressions can be grouped:
  • sequence group - defines ordered sequence of matchers
  • alternatives group - defines a list of matchers to choose from
  • array - to parse repeated data elements
  • structure - to capture parsed data as composite type
  • enum group - allows matching strings to numeric values
  • JSON - allows parsing Json structures
Matcher expressions have Modifiers:
  • Some matchers allow configuration specifying their behavior. For instance, a timestamp needs an expected format definition.
  • Most of matchers and groupings can be added with a quantifier - to tell the engine how many times it should try to match.
  • All matchers and groupings can be declared to be optional - i.e if the element in the expected position is missing, the engine will output NULL to the resultset and continue with the next matcher in expression.
  • All matchers and groupings can be assigned an export_name - the name of the field exposed to the query layer. The sole purpose of pattern matching is to make data elements available for the query engine. However, not all matched elements are needed for queries (such as field separators in tabulated files) therefore export_name is the mechanism for the end-user to declare which data elements are exposed for queries (at the same time providing a name for the query fields). So, a matcher without export_name still does its job matching the pattern but it is not visible in queries.
  • All matchers and groupings can “look around” (backward or forward) - mainly to enable decision making (conditional branching).

Example. Suppose we have a comma-separated record (terminated with the line feed character) with the following fields:

  • order number - integer
  • username - consisting of upper and lower case letters and numbers (but not a comma)
  • ipv4 address of the user
1,alice,192.168.1.1
2,bob,10.6.24.18
3,mallory,192.168.1.3

This structure can be described by the following pattern expression:

1
2
3
4
5
6
INT:seq
','
LD:uname
','
IPADDR:user_ip
EOL
where:
  • on line 1 integer matcher for the order number, visible in queries as ‘seq’
  • on lines 2,4 constant string matcher for the field separator, not visible in queries
  • on line 3 line data matcher, visible in queries as ‘uname’
  • on line 5 IP address matcher, visible in queries as ‘user_ip’
  • on line 6 chargroup matcher for the line feed terminating the record

SpectX pattern matching engine tries to apply the pattern by utilizing matchers in the order they were defined. In the example above it starts by trying to match INT:seq at the first byte of input data. This happens to be ‘1’. As it is suitable for an integer type it moves on to next byte and finds it to be a comma. This does not match with an integer, therefore, the INT:seq matcher gets completed by converting ‘1’ to an integer and the next matcher in the pattern is selected: ','. The engine tries it for a current position of data and finds a match. So the data pointer is moved on to the next byte (pointing to the first letter of ‘bob’). As the constant string matcher contained just one character, the matcher is considered complete and the engine takes the next one in the pattern: the [a-zA-Z0-9]*:uname. The quantifier * enforces [a-zA-Z0-9]*:uname to consume a variable number of bytes (zero or more), so it keeps matching until it finds a byte not matching with its defined characters. This happens at the second comma (just after ‘bob’), the engine considers the [a-zA-Z0-9]*:uname matcher complete and takes the next one: ',' . Again, it tries to match it to the byte at the current position and succeeds. The data pointer is moved to the next byte, pointing to the beginning of ‘192.168.1.1’. As ',' completes it, the engine takes IPV4ADDR:user_ip. Trying it from the current position the match is found and the data pointer gets moved forward 11 bytes, now pointing to a newline character. The engine finds a match for it using the last matcher in the pattern: [\n] Now the data pointer is advanced to the next byte, the pattern iterator is reset and the cycle continues with trying out the first matcher of the pattern again, against currently pointed data. This continues until the end of the input data.

Should the engine encounter data which it is unable to find a match, it resets the pattern iterator, marks this byte as unmatched and moves on to the next byte. This continues until a match is found or there is no more data. Eventually, the following structured data is available for query:

seq uname user_ip
1 alice 192.168.1.1
2 bob 10.6.24.18
3 mallory 192.168.1.3