Using Views

Introduction

The main purpose of views in SpectX is to provide an abstraction layer for definitions of virtual structure for data and its location(s). This has many benefits. Once we’ve captured data location and pattern somewhere else, our main analytics query scripts become easier to read. Also, different queries can re-use the same views, just like in relational databases.

But there’s more. Separating virtual structure and location from analytics is also a way to separate roles for data analytics and management. In an enterprise environment, the collection of logs is often the responsibility of a dedicated team. If they can manage data location behind the views, we can spare analysts from that task altogether.

Another example of role separation is allowing analysts to see obfuscated (for example replacing last octet of ip-address with X) or restricted set of fields (for example omitting credit card numbers). Or preventing system overload caused by carelessly launched queries on large amounts of data, a view can be initialized with some default period limitation.

Last but not least, views provide a way for optimizing query processing.

So, what is a view? Technically it’s a tuple stream which can be referred to by a name. Surprisingly, this can be materialized as simply as a saved query script with an execution statement:

  • When a query is saved to a file, we can use its name as a reference.
  • Within the query script, you need to define the location(s) and virtual structure of the data - in most cases using PARSE and LIST commands.
  • Make sure the script ends with an execution statement that will produce an output tuple stream.

Example 1. A naive example of a view to Apache access log

1
2
3
4
5
6
7
//the pattern defines the virtual structure of our data

$pattern = $[/user/examples/patterns/apache_access.sxp];

//the PARSE command performs fetching, extracting and transforming the source data into a tuple stream

PARSE(pattern:$pattern, src:'sx:/user/examples/data/apache_access.log.sx.gz');

Using this view becomes a simple select statement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
@[/user/examples/views/my_webserver_access_logs.sx]
 .select(timestamp, clientIp, uri, response)
 .limit(10)
;

/* //the same as the above in the SQL syntax:

SELECT timestamp, clientIp, uri, response
FROM @[/user/examples/views/my_webserver_access_logs.sx]
LIMIT 10
;

*/

Input Arguments

A view can have input arguments (parameters). These must be declared in the init function:

INIT'(' param_name(':'value | '::'type) [, ...] ')'

where:
    param_name - name of the parameter. Must begin with a letter or underscore (reserved for built-in system variables),
    subsequent characters can be letters, numbers or underscore.

    value - the initial value of the parameter.

    type - type of a parameter.

When a parameter is declared using type then its initial value must be set by input argument of calling script. I.e the parameter becomes a mandatory input argument.

When a parameter is declared using value then it is an optional input argument. I.e when not present in the input argument list then it is initialized to declared value. Otherwise it is set to the value specified by input argument.

The argument values can be specified in the literal notation or using casting functions.

Example 4. Let’s create a view on a apache access log to return records newer than specified time and from specified ip-address. We want ip-address always be specified when view is called, starting time is set to some point in the past:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
init(
  ip::IPADDR,                                     //mandatory input argument (uninitialized)
  since:TIMESTAMP('2016-03-14 00:00:00 +0000')    //optional input argument (initialized to a default value)
);

$pattern = $[/shared/patterns/apache.sxp];
@list    = LIST('sx:/user/examples/data/apache_access.log.sx.gz');
@stream  = PARSE(pattern:$pattern, src:@list);

@stream
 .filter(clientIpv4 = $ip AND timestamp >= $since)
;

When calling this view we must supply the ip-address and optionally also the beginning time:

1
SELECT * FROM @[/user/examples/doc/user_manual/views/example4.sx](ip:172.16.4.11, since:T('2016-03-15 21:00:00 +0200'));

Optimizing Queries

Time Partitioned Data

In real life, the volume of stored data can be large. But only very rare occasions it is needed to execute queries over all the history. Hence most of the queries could be speed up if full scan of all the source files could be avoided. SpectX can make use of logs natural time-based partitioning [1] to look only in the files relevant to timeframe specified by query where or filter clauses.

There is two options where SpectX can find partitioning time:

  1. the timestamp of a file last modification. SpectX captures this during LIST operation in the last_modified field.
  2. time embedded in the path or filename (usually done at log rotation). This may captured by using time patterns to describe the format of time string which LIST will use to evaluate the path_time field.

In order to activate partitioned access you need to apply filter (or WHERE) clause on the output of LIST command with comparison between a timestamp value and last_modified or path_time. As a result the PARSE command will only include files filtered according to the specified criteria.

NB! Now the last_modified and path_time are points in time referring to partition period which means that we need use the same periods in the comparison expression too. However the relation of those timestamps to the period of events can vary a lot. For instance the last_modified timestamp can refer to file rotation time. It is quite usual that daily rotation of logs is done at early morning hours (say 1 am), hence they will contain records from previous day and also some from current day. Note that not all of the previous day events are there, only those since last rotation. Hence for comparison expression we need a period starting from 2 days earlier than last_modified timestamp.

At daily rotation the path_time is often set to the date where most of the events are from (i.e the previous date). In most cases time is omitted, hence the path_time gets evaluated to 00:00:00 am at that date. Therefore our comparison expression needs a period starting from exactly the path_time until the next date (i.e +24 hours).

Of course it must be noted that the relation of path_time to the actual period of events depends entirely on how the file and/or path naming is implemented. Hence the comparison expression must be defined by those who do know this. Note that the comparison can be made transparent to other users of view: they can pass their chosen period by using view input arguments.

Initializing input arguments to sensible default values you can prevent wasting system resources on the queries launched by inexperienced users.

Example 2. Let’s use the hypothetical authentication service logs in SpectX resource tree /user/examples/data/production/auth/ to demonstrate optimized access of source data files. The logs are rotated at 1 am each day and the name of the file contains previous date (that’s where the most of events are). Here’s how it works:

  1. First the begin_time and end_time variables get initialized in the init() function.
  2. Next the LIST command gets executed with an uri containing time pattern, which returns the metadata of actual files. Note that the time values from path are parsed and transformed to the path_time field of listing. The filter command takes the output of listing and filters out all records that fall outside filtering criterias.
  3. Next the PARSE command is executed. It takes the filtered output of the @list stream as the input of locations of source data files. During execution, the files get retrieved from their locations and parsed according to pattern specified (which we neatly refer from a separate file).
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/*
 In the init() block provide 2 timestamp parameters limiting the scope of view.
 Make sure to assign sensible default values to them.
 In this example we would like to limit default view to 360 last days worth of data.
*/
init(
  begin_time:now()[-360 day], // variable begin_time will hold default of beginning of data period
  end_time:now()              // and end_time has the default of ending of period
);

/* use LIST cmd to perform listing. */
@list = LIST(
  src:{
  	uri:'sx:/user/examples/data/logs/production/auth/$yyyy$/$MM$$dd$.srv*.*.log',
  	tz:'GMT'
  })
/* applying filter cmd to output of LIST we exclude files irrelevant to the scope of our view (avoiding full scan).
   Our logs are rotated at 1 am each day. The name of the file will contain previous date (that's where the most of
   events are). Hence our files contain events from path_time until path_time + 24 hours.
*/
.filter(
   path_time >= $begin_time             // path_time 00:00:00 am until
   AND path_time[+24 hour] <= $end_time // path_time + 1 day and 2 hours
)
;

/* use PARSE cmd to apply virtual structure to data and produce stream of typed data */
PARSE(
  pattern:$[/user/examples/doc/user_manual/views/auth_log.sxp],
  src:@list
);

While executing this example the speedup is not noticeable. In real life, with hundreds of thousands or even millions of files, the gain is huge.

Example 3. Selecting from the view.

1
2
3
4
5
6
7
/*
 Select failed login events from first decade of January 2016
*/

@[/user/examples/doc/user_manual/views/example2.sx](begin_time:T('2016-01-01 00:00:00'), end_time:T('2016-01-10 00:00:00'))
 .filter(response != 200)
;
[1]Partitioned access. The good news is that logs are fundamentally related to time: log records almost always include the creation time. More importantly, records are usually stored in batches, created by the process known as file rotation. This is common practice in log management which results in data being partitioned by time - where each resulting file (or blob in the cloud storage) represents a partition. Considering that time-based queries are predominant in log analytics, looking at only relevant partitions when processing large volumes of data (hundreds of gigabytes, terabytes or more) is a very natural approach to take. The best part is that it comes with no extra effort - partitioning is already there and ready to be used!

Custom Partitioned Data

Source data partitioning is not limited to time. It can be done using any relevant parameter. For instance let’s consider an example from automotive industry, where various data is collected from autonomous driving cars at regular basis: operating system events, sensor data, etc. It gets recorded in a number of differently named logs under the directories of car id and sessions:

└── logs/
    └── cars/
        ├── 0A0/
        │   ├── 2016-08-01-18-19-17_c38a054-a232-4620-2fdc7a8e8c44/
        │   │   ├── max_temperature_sensor.log
        │   │   ├── front_impact_sensor.log
        │   │   ├── shutdown.log
        │   │   └── ...
        │   ├── 2016-08-01-22-08-22_aeca3187-2bf3-bac2-67109856b26b/
        │   │   ├── ssh.log
        │   │   ├── error.log
        │   │   └── ...
        │   └── ...
        ├── 0C72/
        │   ├── 2017-04-30-05-49-04_133f89a7-d894-505e-1eb7e6367425/
        │   │   ├── 4g-link-stats.log
        │   │   ├── accel-sensor.log
        │   │   └── ...
        │   └── 2017-05-02-03-01-59_1645aab3-b1d8-e4c6-1a568b988413/
        │       └── ...
        └── ...

Suppose that the car id is not included in records of any of the logs, i.e the only place it appears is the path of the log file. When we need to analyze acceleration of few selected cars then we can just select the _uri field, extract the car id and apply a filter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$pattern = $[/user/patterns/accel-sensor-log.sxp];

//list accel.log from all cars all sessions
@list = LIST('s3://<bucket>/logs/cars/*/*/accel.log');

@src = PARSE(src:@list, pattern:$pattern);

@src
  .select(_uri, *)                            //select hidden _uri field and all other fields
  .select(SUBSTR(_uri, 19,24) as car_id,* )   //extract car_id from _uri
  .filter(car_id = '...')                     //filter by some particular car_id
;

However, when there are large number of log files stored (many sessions per car over several years) then listing all sessions from all cars may become very slow and time consuming. We could considerably speed up the script above if the uri contained the desired car id directly. This is actually very easy to do by composing the uri dynamically using car_id supplied by views input argument. Using time pattern the query can be optimized even further:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
init(
 car_id::STRING           //always require to supply car_id from caller
 from:now()[-30 day]      //limit default time scope to 30 days
);

$pattern = $[/user/patterns/accel-sensor-log.sxp];

/* compose uri dynamically by concatenating different parts: */
$getUri() =
 's3://<bucket>/logs/' | $car_id | '/$yyyy$-$MM$-$dd$-$HH$-$mm$-$ss$_*/accel-sensor.log';

@list = LIST(src:{
                  uri:$getUri(),     //use getUri() function above to compose actual uri
                  timezone: 'UTC'
                 }
);

PARSE(pattern:$pattern, src:@list);