Parsing Multiline Records

Multiline records are fairly common in logs. A typical example is when an application error log includes an exception with the stack trace. For instance, the log record could consist of a timestamp, log severity, and message which may be multiline:

2015.10.03 16:32:50     INFO    connecting to db ...
2015.10.03 16:32:51     ERROR -- SQLTimeSeries remote fetch failed
org.postgresql.util.PSQLException: Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
        at org.postgresql.core.ConnectionFactory.openConnection(
        at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(
        at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(
        at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(
        at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(
        at org.postgresql.jdbc4.Jdbc4Connection.<init>(
        at org.postgresql.Driver.makeConnection(
        at org.postgresql.Driver.connect(
        at java.sql.DriverManager.getConnection(
        at java.sql.DriverManager.getConnection(
        at org.logicalcobwebs.proxool.Prototyper.buildConnection(
        at org.logicalcobwebs.proxool.ConnectionPool.getConnection(
        at org.logicalcobwebs.proxool.ProxoolDriver.connect(
        at java.sql.DriverManager.getConnection(
        at java.sql.DriverManager.getConnection(
Caused by: Connection refused
        at Method)
        at org.postgresql.core.PGStream.<init>(
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(
        ... 23 more
2015.10.03 16:32:54     DEBUG   closing socket
2015.10.03 16:32:56     INFO    connecting to db ...

Turns out that parsing this seemingly complex message is very simple. We can use wildcard matcher DATA which captures any character until the next non-wildcard matcher - which in our case here is the record separator.

From the sample, we can see that a record ends when next line begins with a timestamp (i.e the beginning of next record). Hence, the matcher next of DATA (signaling to stop capturing) must be a sequence group: EOL followed by TIMESTAMP. The only problem is that the latter must not capture the time as it belongs to already next record. This can be done by applying look ahead modifier - which applies matching but does not advance the current position.

But hang on, the last record in the file is not followed by a timestamp of next one, is it? Indeed, it will encounter the EOF (end-of-file) marker instead. So we need to incorporate also this case in our stopping expression. Easy-peasy using Alternatives Group.

Here’s how it looks:

$hdr =                                  //record header:
TIMESTAMP('yyyy.MM.dd HH:mm:ss'):time   //timestamp
SPACE                                   //followed by by one or more spaces
UPPER:level                             //followed by uppercase log level

$hdr                                    //each record begins with header
DATA:message                            //and is followed by message with one or more lines
(EOL (>>$hdr | EOF))                    //until we see the EOL followed by header or end of file

See more examples of conditional matching here.

How Much Of Multiline Data Can I Capture?

You can capture up to 16 million characters long strings. Just specify your desired max length in the quantifier:

DATA{,256000}:message   // capture data up to 256000 characters