Six Quick SpectX Queries

Directly reading and analyzing raw log files from various sources like AWS, Azure, Hadoop or an on-prem log server makes SpectX a perfect tool for ad-hoc log forensics, looking for the unknown and exploring data to discover what’s there. Here are six quick queries to copy-paste into the query window and run it on compressed sample logs we’ve stored in S3.

Here’s what a raw line of our sample data looks like: it’s a straightforward access log record, containing a timestamp, client IP, bytes transmitted and some JSON capturing headers:

2019-01-01 00:36:46  65.132.59.34  /robots.txt  200  304  {"User_Agent":"Gigabot","Referer":" ","Accept":"*/*","Accept_Encoding":"gzip,deflate,br","Accept_Language":"en-US","Connection":"keep-alive"}

1. Query multiple (compressed) files within a stream

The sample data is divided to:

  1. log files containing records from 2019 and
  2. an archive directory with files going back to 2015.

All these files can be queried simultaneously by separating URIs with commas and closing them in square brackets.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$pattern = <<<EOP
    TIMESTAMP:timestamp '\t'    // time, followed by TAB, name it timestamp
    IPADDR:c_ip         '\t'    // ip-address, followed by TAB, name it c_ip
    LD{0,20000}:uri     '\t'    // match everything until a TAB character, name it uri
    INT:response        '\t'    // integer, name it response (HTTP response code)
    LONG:bytes          '\t'    // long, name it bytes (number of bytes returned)
    JSON:headers        EOL     // JSON containing all request headers, followed by end-of-line (LF)
    ;
EOP;

LIST(['s3s://spectx-docs/logs/custom/tracking_access-2019.log.pi.gz',
      's3s://spectx-docs/logs/custom/archive/tracking_access-2015.log.pi.gz'])
| parse(pattern:$pattern)
;

Hint

#1: Replace ‘2015’ with a wildcard (asterisk) and include all the files in that directory to the query scope (like in the JSON query below).

Hint

#2: click on the ‘Follow’ button to cache the results of this query to your machine and continue analysing the results from there.

2. Unnest JSON. E.g. What are the top 10 unique keys?

Here’s a query for taking a closer look at the headers and counting the top 10 unique keys in JSON.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
$pattern = <<<EOP
    TIMESTAMP:timestamp '\t'    // time, followed by TAB, name it timestamp
    IPADDR:c_ip         '\t'    // ip-address, followed by TAB, name it c_ip
    LD{0,20000}:uri     '\t'    // match everything until a TAB character, name it uri
    INT:response        '\t'    // integer, name it response (HTTP response code)
    LONG:bytes          '\t'    // long, name it bytes (number of bytes returned)
    JSON:headers        EOL     // JSON containing all request headers, followed by end-of-line (LF)
    ;
EOP;

LIST(['s3s://spectx-docs/logs/custom/tracking_access-2019.log.pi.gz',
    's3s://spectx-docs/logs/custom/archive/tracking_access-*.log.pi.gz'])
| parse(pattern:$pattern)
| select(hdr:UNNEST(headers))      //expand all KVPs from JSON
| select(keys:hdr[key], count(*))  //extract only keys and count
| group(keys)                      //group to find all unique keys
| sort(count DESC)
| limit(10)
;

3. Query and parse data via http(s)

This query fetches the multiline records published by Torproject.org and parses them into a neat table of timestamps and TOR exit node IPs.

1
2
3
4
5
6
7
8
9
$tor_list_pattern = <<<EOP
    ('ExitNode ' LD:exitNode EOL
     'Published ' TIMESTAMP('yyyy-MM-dd HH:mm:ss'):published EOL
     'LastStatus ' TIMESTAMP('yyyy-MM-dd HH:mm:ss'):lastStatus EOL
    )?
    'ExitAddress ' IPV4:exitAddress ' ' TIMESTAMP('yyyy-MM-dd HH:mm:ss'):addrTime EOL
EOP;

LIST('https://check.torproject.org/exit-addresses') | parse(pattern:$tor_list_pattern);

4. Join logs from different sources

Did any requests in the logs originate from the TOR network? Let’s find out by joining an Apache access log file with current Tor exit nodes (see the query above):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$apache_pattern = <<<EOP
    (IPADDR:clientIp | [! \n]+):host
    ' ' ('-' | NSPACE:ident)                          // Apache auth is vulnerable to the
    ' ' ('-' | (DATA{1,8096}:auth >>(' [' HTTPDATE))) // log poisoning attach via auth field
    ' ' '[' HTTPDATE:timestamp ']'
    ' ' (('\"' [A-Z-_]+:verb ' ' LD{0,8096}:uri ' HTTP/' FLOAT:httpversion '\"') | DQS:invalidRequest)
    ' ' INTEGER:response
    ' ' (LONG:bytes | '-')
    (' ' DQS:referrer (' ' DQS:agent)?)?
    EOL
EOP;

$tor_list_pattern = <<<EOP
    ('ExitNode ' LD:exitNode EOL
     'Published ' TIMESTAMP('yyyy-MM-dd HH:mm:ss'):published EOL
     'LastStatus ' TIMESTAMP('yyyy-MM-dd HH:mm:ss'):lastStatus EOL
    )?
    'ExitAddress ' IPV4:exitAddress ' ' TIMESTAMP('yyyy-MM-dd HH:mm:ss'):addrTime EOL
EOP;

@tor_list = LIST('https://check.torproject.org/exit-addresses') | parse(pattern:$tor_list_pattern);

LIST('s3s://spectx-docs/formats/log/apache/apache_access.log.sx.gz')
| parse(pattern:$apache_pattern)
| join(@tor_list on clientIp = exitAddress)
;

5. Do conditional aggregations

E.g. count the number of total requests and failed requests per unique IP:

1
2
3
4
5
6
LIST('s3s://spectx-docs/formats/log/apache/apache_access.log.sx.gz')
| parse(pattern:"IPADDR:clientIp LD '[' HTTPDATE:timestamp '] ' DQS ' ' INT:response LD EOL;")
| select(clientIp, ok:count(response = 200), total:count(*))
| group(clientIp)
| sort(total DESC)
;

6. Search for specific strings within certain time periods

This query first filters out a time period (from August to October 2019) and then another negative filter: give me all rows that do not contain ‘access.log’ in the URI.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
$pattern = <<<EOP
    TIMESTAMP:timestamp '\t'    // time, followed by TAB, name it timestamp
    IPADDR:c_ip         '\t'    // ip-address, followed by TAB, name it c_ip
    LD{0,20000}:uri     '\t'    // match everything until a TAB character, name it uri
    INT:response        '\t'    // integer, name it response (HTTP response code)
    LONG:bytes          '\t'    // long, name it bytes (number of bytes returned)
    JSON:headers        EOL     // JSON containing all request headers, followed by end-of-line (LF)
    ;
EOP;

LIST(['s3s://spectx-docs/logs/custom/tracking_access-2019.log.pi.gz'])
| parse(pattern:$pattern)
| filter(timestamp > T('2019-01-08 06:52:22.000 +0200') and timestamp < T('2019-01-10 06:52:22.000 +0200'))
| filter(uri NOT CONTAINS 'access.log')
| sort(timestamp)
;

Hint

right-click on a field in the query results to add a filter to that value.