Apache Access Log

Apache access log in combined format is an example of a log with only one type of record laid out on a single line. The record consists of a fixed number of fields in the specified order, separated by single space characters:

  1. IP-address or hostname
  2. remote logname (from identd) or a -
  3. remote user or a -
  4. request time in HTTPDATE format, enclosed in square brackets
  5. double-quoted string with the first line of the request
  6. HTTP status code (a number)
  7. size of the response in bytes (a number) or a - (in case of 0 bytes returned)
  8. double-quoted string with request Referer header
  9. double-quoted string with request User-Agent header

Example:

185.130.5.146 - - [15/Mar/2016:09:06:31 +0200] "HEAD / HTTP/1.1" 200 224 "-" "curl/7.38.0"

Hint

You can find sample log file by navigating with Input Data Browser to s3s://spectx-docs/formats/log/apache/apache_access.log.sx.gz

Parse

The pattern to parse such record consist of matchers for each field in the same order:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
(IPADDR:client_ip | NSPACE:host)
' '
('-' | NSPACE:remote_logname)
' ' ('-' | NSPACE:remote_user)
' ' '[' HTTPDATE:req_time ']'
' ' DQS:request
' ' INTEGER:response
' ' ('-' | LONG:bytes)
' ' DQS:referrer
' ' DQS:user_agent
EOL

where:

  1. We use IPADDR and character group (matching any character except space), placed in the alternatives group. This will create two fields in the resultset: IPADDR type client_ip and STRING type host.
  2. Field separator. Does not create any fields in the resultset since it is not exported.
  3. We use - and NSPACE placed in the alternatives group. This will create STRING type field remote_logname in the resultset, which will have the value NULL if the field contains -.
  4. The field separator is followed by pattern extracting the remote_user field similar to the previous field.
  5. The field separator is followed by '[' (matching opening square bracket). Then we use HTTPDATE to extract timestamp to req_time field. ']' matches closing square bracket.
  6. The field separator is followed by DQS extracting the content of double-quoted string (the first line of request) to field request.
  7. Field separator is followed by INT, INTEGER extracting HTTP status code to INTEGER type field response.
  8. After field separator we use - and LONG placed in the alternatives group. This will create LONG type field bytes in the resultset, which will have the value NULL if the field contains -.
  9. The field separator is followed by DQS extracting the content of double-quoted string (the Referer header) to field referrer.
  10. The field separator is followed by DQS extracting the content of double-quoted string (the User-Agent header) to field user_agent.

Parsing the example line above with this pattern results in:

client_ip host remote_logname remote_user req_time request response bytes referrer user_agent
185.130.5.146 NULL NULL NULL 2016-03-15 07:06:31.000 +0000 HEAD / HTTP/1.1 200 224 - curl/7.38.0

The pattern above can be used only for Apache combined formatted logs. A more practical one would also parse Apache common format (the combined differs from common only by added referrer and user-agent fields). We also might want to extract elements of the request field:

1
2
3
4
5
6
7
8
9
    (IPADDR:clientIp | [! \n]+):host
' ' ('-' | NSPACE:ident)
' ' ('-' | (DATA{1,8096}:auth >>(' [' HTTPDATE)))
' ' '[' HTTPDATE:timestamp ']'
' ' (('\"' [A-Z-_]+:verb ' ' LD{0,8096}:uri ' HTTP/' FLOAT:httpversion '\"') | DQS:invalidRequest)
' ' INTEGER:response
' ' (LONG:bytes | '-')
(' ' DQS:referrer (' ' DQS:agent)?)?
EOL

Query

Let’s find the top 5 countries from where requests are made from:

1
2
3
4
5
6
7
LIST(src:'s3s://spectx-docs/formats/log/apache/apache_access.log.sx.gz')
| parse(pattern:FETCH('https://raw.githubusercontent.com/spectx/resources/master/examples/patterns/apache/apache.sxp'))
| select(CC(clientIp), cnt:count(*))
| group(@1)
| sort(cnt DESC)
| limit(5)
;

where line:

  1. Performs listing source data file
  2. Retrieves the content source data file and parses it according to the pattern specified.
  3. Compute country code from IP-address (using function CC) and aggregated count
  4. Group the resultset by the first field in the stream.
  5. Sort the resultset by field cnt in descending order
  6. Limit the resultset to 5 rows.
cc cnt
147
CN 44
US 5
JP 2
DE 1

Hint

You can download full code of the pattern and query at https://github.com/spectx/resources/tree/master/examples/patterns/apache