Parsing Apache Access Log

Apache access log is a great example of a server application log where both request and response are stored in the same one-line record. At the same time it also has a few interesting fields: first can contain either IP-address or hostname (depending on whether hostname lookups have been configured). In second and third fields a single hyphen signifies the absence of the information of data (which in turn is allowed to contain a hyphen too). And then the record can contain a variable number of fields depending on if you’re looking at Common or Combined format. Let’s look at how we can write a pattern accommodating all these corner cases.

The first field may contain either an ipv4 address or an ipv6 address or hostname. The latter can contain letters, numbers, hyphen, underscore and dot. So the alternative group with IPADDR (works for both IPv4, and IPv6) and chargroup with respective characters should do the job. Assigning each of them an export_name places them nicely to separate columns in resultset too:

(IPADDR:clientIp | [a-zA-Z0-9-_.]+:host)

The second and third fields are identical from the parsing standpoint. Both can contain a string with almost unrestricted characters. However, since the fields are separated by space, we can use NSPACE keyword for matching these fields. But there are corner cases here too: when the fields contain single hyphen character it signifies the lack of information. It would be really nice if we could replace that with NULL in the resultset. Well, the Alternatives Group seems to provide us exactly what we need: using Literal Expressions matcher for single hyphen without export_name ('-') causes parsing engine to recognize it but not to output it to resultset. At the same time, the alternative NSPACE matcher in the group gets assigned value NULL when a single hyphen is encountered. Ah, and let’s not forget to prepend fields with a matcher for separator too:

' ' ('-' | NSPACE:ident)
' ' ('-' | NSPACE:auth)

The next field is timestamp which happens to be in the form of HTTPDATE. If configured otherwise just use TIMESTAMP with a suitable format to match the timestamp. For some reason, Apache folks have decided to enclose timestamp also in square brackets. Well, we can handle that easily:

' ' ('[' HTTPDATE:timestamp ']')

The next field is the request string enclosed in double-quotes. We could use DQS to parse it as a whole string, but we want to extract also the request method, URI and HTTP version out of it. First, we match double quotes with '"' (a Literal Expressions) matcher. Next is the HTTP method which is always an uppercase word, separated from URI by a space. Let’s export it under the name ‘method’ in the resultset. Next is URI, a string terminated by a space. Let’s use DATA keyword matcher for this. And next to that is constant_string ‘HTTP/’, so we might as well concatenate them together and use Literal Expressions matcher ' HTTP/' to match both of them. We don’t really need it in the resultset, so no Export Name is assigned either. After forward-slash there is the HTTP version number in the form of floating-point, hence let’s use FLOAT for it. Conclude the whole thing with Literal Expressions matcher for trailing double quotes '"' and we’re done with this field:

' ' ('"' UPPER:method ' ' DATA:uri ' HTTP/' FLOAT:httpversion '"')

The next field is the server response code, formally called httpStatus. Three-digit code, so quite suitable to use INTEGER for it:

' ' INTEGER:httpStatus

The last field in Apache Common format is the size of the response returned to the client. Can be quite big, so let’s use LONG for that one. However, when no content was returned to the client the field contains a single hyphen. Seems, familiar? Yes, let’s use the same approach as with ident and auth fields above:

' ' (LONG:bytes | '-')

Apache Combined format contains additionally Referer and User-Agent fields. By using optional_modifier we can transparently support parsing both Common and Combined formats: when they are absent (that corresponds to Apache Common format) and also when they are present (which corresponds to Apache Combined format). Both fields are double-quoted strings. However just in case if either of them would be missing then let’s put the value to the referrer (we can’t really tell which one is actually missing when this happens). Making sequence of space and user agent string optional will achieve just that:

(' ' DQS:referrer (' ' DQS:agent)?)?

And here the record ends, so all we need is to match a newline:

EOL

And we’re done! Here’s how the pattern expression looks without comments:

1
2
3
4
5
6
7
8
9
(IPADDR:clientIp | [! \n]+):host
' ' ('-' | NSPACE:ident)                          // Apache auth is vulnerable to the
' ' ('-' | (DATA{1,8096}:auth >>(' [' HTTPDATE))) // log poisoning attach via auth field
' ' '[' HTTPDATE:timestamp ']'
' ' (('\"' [A-Z-_]+:verb ' ' LD{0,8096}:uri ' HTTP/' FLOAT:httpversion '\"') | DQS:invalidRequest)
' ' INTEGER:response
' ' (LONG:bytes | '-')
(' ' DQS:referrer (' ' DQS:agent)?)?
EOL