Developing Patterns

Using Pattern Developer

Pattern Developer tab window is divided to two: upper side contains editor for pattern. Lower half contains three tab windows:

  • Data Editor - here is the sample data to be parsed. This is editor window - i.e you can create the data, paste it from other windows, change it as you like. Also the Data Browser puts the data from a file preview here.
  • Results - displays parsing results: exported fields and unmatched data.
  • Parse Preview - displays matched data in colors as you type in Data Editor.

Action bar has buttons for:

  • Parse - executes parsing.
  • Save - opens dialog window for saving pattern to file. Pattern files have .sxp extension.
  • Show Parser Tree - displays logical structure of the pattern.
  • Matcher Performance - executes matching performance test of the pattern. During execution the pattern elements are only matched, not parsed (i.e exported), regardless if they were assigned to export name or not.
  • Parser Performance - executes parsing performance test of the pattern. Here the exported pattern elements are also parsed, exactly as during query processing. The difference between matching and parsing performance shows you the cost of exporting fields.
  • Prepare Query - opens new query tab with skeleton query with prepared pattern and chosen source file. Note that the button is displayed only when Data Editor was populated from Data Browser.

Shortcut keys:

  • CTRL+E - executes parsing.
  • CTRL+S - save pattern file.

Normally the workflow starts with choosing data file: open the Data Browser, choose the file you need and press Prepare Pattern. This opens new Pattern Developer tab and sends limited sample from the file to the Data Editor. More precisely the preview size of the file content (up to 16 kB, which can be changed by wgui.dataBrowser.preview_size property, see Configuration for details).

Also the pattern editor gets populated with pattern autodetected from sample data. Use it if you like or delete it and start from the scratch. Keep SpectX Pattern Matching Language Reference Manual at hand and make use of autocomplete (CTRL+SPACE): it hints you of potential matchers which could be used.

The pattern matching engine tries to match pattern from beginning of data towards the end of it (on the screen it means from right to left). Pattern Developer does that as you type in pattern elements and colorizes pattern matches in Parse Preview tab.

NB! The pattern is a sequence of pattern elements (called matchers). Therefore if the pattern consists of only one element, then matches to only that element are colorized. If there are two elements in pattern then matches to the sequence of those elements are colorized. And so on. Unmatched data is left on white background.

At any time you can execute Parse to make sure that all the data elements you intend to extract from source data are parsed out correctly and have correct type assigned. The resultset also displays you unmatched data between the records. (Note that you may often see unmatched data at last record - this is because the sample data is cut off in the middle of a record.)

When you’re satisfied with the pattern then Save is a good idea, so the pattern could be re-used later on. Saved pattern files have .sxp extension. The sample data gets saved as well, in the file with the same name and sxp.data extension.

And then it’s time to proceed to querying data with your newly developed pattern. The quickest way to do that is to press Prepare Query which opens up new query tab and creates simple query script. It consists of PARSE command (taking your pattern and source file uri as arguments) followed by SELECT pipe command selecting all extracted fields and also unmatched data.

Now you must decide whether you want to match all the data with your pattern or you don’t care about unmatched bytes. In the latter case you can remove the _unmatched field from SELECT pipe command and proceed with developing query.

Finding Unmatched Corner Cases

However, if you need all the data covered with your pattern, then you must try your pattern on more data than just the small amount of sample in Data Editor.

When executing prepared query does not produce immediately visible unmatched data (i.e they may be miles down in the resultset), then it makes sense to check, if there are unmatched bytes at all: switch to Explanation tab and look for parser.unmatched.bytes. If this is zero then you can repeat the execution with other or additional files or consider pattern successfully finished.

If the number indicates you have still unmatched data somewhere then you can bring it up using FILTER pipe command.

Example 1.

1
2
3
4
5
6
7
$pattern = PATTERN{
    IPV4:ip LD HTTPDATE:date LD:line EOL
};

PARSE(src:'sx:/user/examples/data/apache_access.log.sx.gz', pattern:$pattern)
 .select(_unmatched, *)
 .filter(_unmatched is not null);

Double clicking on the row in resultset will display you details of unmatched data:

_unmatched[pos] = 16048L
_unmatched[len] = 30
_unmatched[data] = 'unmatched corner case example

Detecting Changes in Input Data

Changes in the structure of data can appear both intentionally (i.e new data elements introduced, existing ones changed due to added business functionality) or unintentionally (because of programming errors or data integrity loss in transport). They may easily change the reliability of the analysis and therefore identifying if changes have been introduced is important (see this whitepaper).

SpectX provides a simple way to determine the reliability of data by outputting the ratio of unmatched bytes encountered in the data queried. When the ratio is in the expected range, it means the data structures are also in expected form without intentional or unintentional changes. A pattern fully matching all the data, you can set the expected ratio of unmatched bytes to zero. Any query resulting in any unmatched bytes ratio higher than that will indicate changes (or errors) in the source data. You can also set the ratio to a higher value to match your expectations.

Note that the number of unmatched data is calculated for the data used in that particular query. For the queries involving data unaffected by changes, the confidence of the analysis results is high. In the case when query does include changed data, the ratio of unmatched bytes gives good basis for analyst to determine how much the results of that query can be trusted.

Usually it makes sense to define a view for fully mapped data (see Using Views). Among other features, view also provides the means to define the expected unmatched data ratio.

Common Mistakes

Always pay attention when determining data elements and record separators. It’s easy to miss them. For instance if your pattern has specified LF as record separator symbol (i.e UNIX style line termination), it will fail to match records generated in Windows where lines are ended with CRLF. The line terminations may particularly easily get changed when you use copy-paste to populate Data Editor. Therefore it is recommended always to use Data Browser for that purpose.

At parsing timestamps pay attention to month and hour numberings: are they one or two digit (i.e padded with zero)? These are easy to misinterprete when you look only at few examples with afternoon time or dates later than first decade.

Modifying Sequence Group matchers. Suppose we want to extract username, timestamp, ip info (ip-address and country), type from following CSV data:

Example 2.

aaaaaaaaa;31.10.16 15:13;141.76.45.35 [Germany];mail
zxzxzx-33;21.12.16 18:26;94.186.122.214 [Latvia];www

We could use following pattern. CSV data is convenient to parse using sequence group with field separator:

1
2
3
4
5
6
7
(
 LD:user
 TIMESTAMP('dd.MM.yy H:mm'):date
 LD:ipInfo
 LD:type
)(fs=';')
EOL;

Results:

user date ipInfo type
aaaaaaaaa 2016-10-31 07:13:00.000 +0200 141.76.45.35 [Germany] mail
zxzxzx-33 2016-12-21 20:26:00.000 +0200 94.186.122.214 [Latvia] www

Next we want to extract ip-address with appropriate type from ipFinfo field (for instance we want to know ASN info about it). If we just add the IPV4 matcher before LD:

Example 3.

1
2
3
4
5
6
7
(
 LD:user
 TIMESTAMP('dd.MM.yy H:mm'):date
 (IPV4:ip ' ' LD)     //ip-address + space + some stuff we do not care about
 LD: type
)(fs=';')
EOL;

then we can see it doesn’t work: Parse Preview window left all data uncolored and executing Parse we get both rows as unmatched data.

The reason is that the sequence group expects field separator after each matcher in the group. We replaced the LD:ipInfo matcher with three new matchers - and naturally sequence group expects field separators between each of them. To correct this, we simply need to group them with another sequence group - that will make to appear them to upper sequence group as one matcher. Note that since we’re interested only seeing ip-address in query, we don’t need to export the sequence group.

Example 4.

1
2
3
4
5
6
7
(
 LD:user
 TIMESTAMP('dd.MM.yy H:mm'):date
 (IPV4:ip ' ' LD)     //ip-address + space + some stuff we do not care about, as sequence group
 LD:type
)(fs=';')
EOL;

which produces us the expected result:

user date ip type
aaaaaaaaa 2016-10-31 07:13:00.000 +0200 141.76.45.35 mail
zxzxzx-33 2016-12-21 20:26:00.000 +0200 94.186.122.214 www