Input Data Formats

The analytics of unstructured data is more difficult compared to text-based search since the data elements contained within the text are not directly machine-readable. Why? First, many logs do not contain the type of data elements and secondly, different logs have organized them in various ways. Consequently, we need to extract and transform the data elements so that the query engine can execute queries and functions on them.

The difficulty lies in the wide variety of ways how the data elements in logs are structured. This is largely unregulated territory with only a handful of widely known standards (such as BSD Syslog, Apache access log). Quite often the logs of the same application may differ depending on their configuration (for example Apache common vs custom log), the version of the application (see discussion on Syslog Repeated Messages) or even on the hardware the application runs on (iptables logs on one network interface vs multiple interfaces).

The logs with machine-readable formats (such as JSON, XML) are of course much better to read. However even those formats allow many corner cases which often result in reading errors and consequent data loss.

To sum it up: there is no such thing as a log with a structure consistent over time and originators. Despite all this mess, there are still typical patterns of structures that many real-life logs are using. The chapters in this section walk you through these typical cases based on examples of logs, explanations of how to parse them and patterns for immediate use and adaptation:

  1. Single event on single line, based on Apache access log example.
  2. Single event on multiple lines, based on Java debug log example.
  3. Data elements in key-value pairs, based on FortiGate Traffic Log example.
  4. Data elements in array, based on Nginx Custom Access Log example.
  5. Multi-event log, based on OpenLDAP Access Log example.
  6. Multi-application log, based on the example of Linux Syslog
  7. JSON formatted log, based on JSONified Windows Event Log example.
  8. Various cases of CSV formatted data

Unrecognized Data

When you do exploratory data analytics (typically when investigating computer security incidents) you encounter often with data unfamiliar to you. In many such cases, you do not need to extract all the data elements and cover all the corner cases to verify the hypothesis at hand. It is quite enough to extract only those relevant for the current query and move on with your task.

Example Suppose we have the task of investigating the malfunction of a hypothetical login service. At this point, we do not know yet if this is related to malicious activity or if it is caused by just some system malfunctioning. So the first thing we need to do is to create a baseline of request rate. This involves parsing of an application log with the custom format:

2016-03-02 07:17:44    3.154.157.245  52c6d118b68fe  200    2.25   POST   /v1/login  payload:{"name":"levavo", "client_version":"1.1(9)", "p":"10100000000000005", "q":["ecf8427e","c0541c19","65a7eb6d"], "r":"null", "f":"TR"}

Most of the fields (separated by TAB character) are completely unknown to us. However, for the task, we have we only need the date and time, IP-address and response fields (first, second and fourth respectively). We can ignore all the rest:

1
2
3
4
5
6
7
$pt = <<<PPP
   TIMESTAMP:time
   '\t' IPADDR:ip
   '\t' LD          //ignore third field
   '\t' INT:response
   LD EOL           //ignore everything until the end of line
PPP;

Using extracted time and response fields we can right away make a query producing request rate:

1
2
3
| PARSE(pattern:$pt, src:'s3s://spectx-docs/formats/log/custom-app/login-service.log.sx.gz')
| filter(time > NOW()[-3 day])
| select(time[1 h], ok:count(response=200), rejected:count(response!=200)) | group(@1)

where line:

  1. retrieves the data file pointed by src URI and parses it using pattern pointed to by $pt variable.
  2. includes records within the last three days
  3. computes hourly counts of successful and rejected requests
../../_images/req_rate.png

Do you notice an increase in rejected requests just before the successful ones start to decrease? Looks like it is worth to dig deeper in that direction.

Hint

You can find sample log file by navigating with Input Data Browser to s3s://spectx-docs/formats/log/custom-app/login-service.log.sx.gz

Download the patterns and queries from https://github.com/spectx/resources/tree/master/examples/patterns/custom-app/

See also

More examples and tutorials of parsing different data structures in the Pattern Development Guide