Parsing Apache Access Log

Apache access log is great example of server application log where both request and response are stored in the same one-line record. At the same time it also has few interesting fields: first can contain either ip-address or hostname (depending whether hostname lookups have been configured). In second and third fields a single hyphen signifies the absence of the information of data (which in turn is allowed to contain a hyphen too). And then the record can contain variable number of fields depending if you’re looking at Common or Combined format. Let’s have a look how we can write a pattern accommodating all these corner cases.

The first field may contain either ipv4 address or ipv6 address or hostname. The latter can contain letters, numbers, hyphen, underscore and dot.So the alternative group with matchers for ipv4, ipv6 and chargroup with respective characters should do the job. Assigning each of them an export_name places them nicely to separate columns in resultset too:

(IPV4:clientIpv4 | IPV6:clientIpv6 | [a-zA-Z0-9-_.]+:host)

Second and third field are identical from parsing standpoint. Both can contain a string with almost unrestricted characters. However, since the fields are separated by space, we can use NSPACE keyword for matching these fields. But there are corner cases here too: when the fields contain single hyphen character it signifies the lack of information. It would be really nice if we could replace that with NULL in resultset. Well, the Alternatives Group seems to provide us exactly what we need: using Constant String matcher for single hyphen without export_name ('-') causes parsing engine to recognize it but not to output it to resultset. At the same time the alternative NSPACE matcher in the group gets assigned value NULL when single hyphen is encountered. Ah, and let’s not forget to prepend fields with matcher for separator too:

' ' ('-' | NSPACE:ident)
' ' ('-' | NSPACE:auth)

Next field is timestamp which happens to be in the form of HTTPDATE. If configured otherwise just use TIMESTAMP with suitable sdf configuration to match timestamp format. For some reason Apache folks have decided to enclose timestamp also in square brackets. Well, we can handle that easily:

' ' ('[' HTTPDATE:timestamp ']')

Next field is the request string received from client given in double quotes. We could use DQS to parse it as a whole string, but we want to extract also the request method, uri and http version out of it. First we match double quotes with '"' (a Constant String) matcher. Next is the http method which is always an uppercase word, separated from uri by a space. Let’s export it under the name ‘method’ in resultset. Next is uri, a string which is terminated with space. Let’s use DATA keyword matcher for this. And next to that is constant_string ‘HTTP/’, so we might as well concatenate them together and use Constant String matcher ' HTTP/' to match both of them. We don’t really need it in resultset, so no Export Name is assigned either. After forward slash there is http version number in the form of floating point, hence let’s use FLOAT for it. Conclude the whole thing with Constant String matcher for trailing double quotes '"' and we’re done with this field:

' ' ('"' UPPER:method ' ' DATA:uri ' HTTP/' FLOAT:httpversion '"')

Next field is server response code, formally called httpStatus. Three digit code, so quite suitable to use INTEGER for it:

' ' INTEGER:httpStatus

Last field in Apache Common format is the size of the response returned to client. Can be quite big, so let’s use LONG for that one. However, when no content was returned to client the field contains single hyphen. Seems, familiar? Yes, let’s use the same approach as with ident and auth fields above:

' ' (LONG:bytes | '-')

Apache Combined format contains additionally Referer and User Agent fields. By using optional_modifier we can transparently support parsing both Common and Combined formats: when they are absent (that corresponds to Apache Common format) and also when they are present (which corresponds to Apache Combined format). Both fields are double quoted strings. However just in case if either of them would be missing then let’s put the value to referrer (we can’t really tell which one is actually missing when this happens). Making sequence of space and user agent string optional will achieve just that:

(' ' DQS:referrer (' ' DQS:agent)?)?

And here the record ends, so all we need is to match a newline:


And we’re done! Here’s how the pattern expression looks without comments:

(IPADDR:clientIp | [! \n]+):host
' ' ('-' | NSPACE:ident)                          // Apache auth is vulnerable to the
' ' ('-' | (DATA{1,8096}:auth >>(' [' HTTPDATE))) // log poisoning attach via auth field
' ' '[' HTTPDATE:timestamp ']'
' ' (('\"' [A-Z-_]+:verb ' ' LD{0,8096}:uri ' HTTP/' FLOAT:httpversion '\"') | DQS:invalidRequest)
' ' INTEGER:response
' ' (LONG:bytes | '-')
(' ' DQS:referrer (' ' DQS:agent)?)?