SQL Style

Note: Keywords are not case sensitive. In this document clauses such as SELECT are capitalized for illustration purposes.

SELECT

SELECT
    {  *
     | alias.*
     | '@'position
     | function
     | expression
     | literal
     [ AS alias ]
    }
    [, ... ]

The SELECT clause specifies comma separated list of literals, functions, names of fields of record stream and expressions to be computed. The list can contain wildcard symbol to select multiple fields. The fields can be referred also by position.

Each expression can be optionally assigned an alias by a space followed by identifier. Keyword AS must be added between expression and identifier. Aliases defined in a SELECT clause can be referenced in the GROUP BY, HAVING and ORDER BY clauses of the query. Unlike standard sql the aliases can be referenced also within SELECT and from WHERE clauses.

Example 1.

1
2
3
4
5
6
7
8
SELECT
  i                     // select integer value
  ,@2 as myLong         // select long value by position, assign alias
  ,i + myLong AS mySum  // add int and long.
FROM dual(100)          // generate 100 records using dual
WHERE myLong > 10       // notice reference by alias from WHERE and SELECT scope
ORDER BY mySum
LIMIT 10;

FROM

FROM clause specifies the record stream source to be queried:

FROM from_args [ AS alias ]

from_args:
    {   record_stream_create_cmd
      | '@'stream_ref
      | '@' '[' results_file ']'
      | '@' '[' script_file ']'
      | '(' subquery ')'
    }

where:

  • record_stream_create_cmd is one of the record stream create commands: LIST, PARSE, SOURCE, DUAL, GREP, VALUE
  • stream_ref is variable referring to record stream
  • results_file is the filename of stored record stream result, enclosed in square brackets. The filename may include absolute or relative path.
  • script_file is the filename of query script, enclosed in square brackets. The filename may include absolute or relative path.
  • subquery is another query enclosed in parenthesis
  • alias - you can assign the source an alias, which can be used a a reference within the query statement. Keyword AS is mandatory.

Note that stream_ref, results_file and script_file are references returning record stream and therefore must have type assigned.

Example 2. Selecting from record stream reference:

1
2
3
4
5
6
@intStr = SELECT i FROM dual(10);

@intStr
 .filter(i > 5)     //commands can be separated by dots or
 | limit(2)         //pipes
;

Example 3. Selecting from view:

1
2
SELECT * FROM @[/user/examples/views/my_webserver_access_logs.sx]
LIMIT 100;

Union

To combine the rows of two datasets into one resultset (sql UNION ALL) specify the desired record stream sources separated by commas to FROM clause

',' from_args [ ,... ]

Example 4.

1
2
3
4
5
6
7
8
@s1 = VALUES(i:1, d:2.2, t:NOW());  //create a stream using VALUES cmd

SELECT
    i, s, ip
FROM
    dual(20,10)
    ,@s1                            //union output of dual with stream @s1
;

JOIN

JOIN {'@'stream_ref | '@' '[' results_file ']' | '@' '[' script_file ']'} [AS alias] ON join_condition

join_condition:
field_from_left_side_data = field_from_right_side_data [ {AND | OR} boolean_expr ,... ]

JOIN operation combines the columns of record stream declared in FROM clause with the one in JOIN clause.

SpectX supports multiple JOIN operators in FROM clause. They are always executed pairwise starting from first pair of source datasets read from right to left, top to bottom. Subsequent JOIN operations use the results of the previous JOIN operation as the left input. Any columns in resultset from previous joins can be used in join_condition in the ON clauses of subsequent JOIN operator.

Note that assigning alias is optional:

  • when it is omitted then default aliases are assigned to left and right streams: left and right respectively.
  • fields with matching names will be renamed by prepending left_ or right_ to field name respectively to originating stream
  • fields with different names remain the same in resultset

join_condition is a boolean returning expression that specifies which rows in a join are considered to match. It must contain at least one expression specifying join key field_from left_side_data = field_from_right_side_data which may be extended by logical AND or OR clauses with any boolean returning expressions (including other join key expressions).

Example 5:

1
2
3
4
5
6
7
8
@ipInfo = SELECT ip, cc(ip + 19.54.33.154) FROM dual(0,100);

SELECT i, t, ip FROM dual(100)                //neither first or second streams in join have aliases defined
JOIN @ipInfo ON left.ip = right.ip            //hence they get assigned default aliases
                  AND left.t < now()[+20 ms]  //you can use additional join conditions too
                                              //note third joined stream having alias defined
JOIN @[/user/examples/doc/query_lang/query_reference/pipe_style/example4.sx] AS sumInfo ON left.i = sumInfo.i
;

WHERE

WHERE boolean_expr [ ,... ]

boolean_expr:
  {   expression_returning_boolean_value
    | boolean_expr AND [NOT] boolean_expr
    | boolean_expr OR [NOT] boolean_expr
    | NOT boolean_expr
  }

WHERE filters the records produced by FROM using expression(s) that evaluates to boolean. Multiple expressions can be combined using boolean AND and OR operators (which in turn can optionally enclosed in parenthesis).

Example 6.

1
2
3
SELECT ip, s
FROM dual(16909060,10000)
WHERE cc(ip)='US' OR cc(ip)='CA';

GROUP BY

GROUP BY { field_name | '@'position | alias | expression [ gsort_opt ] } [, ... ]

gsort_opt:
        {   SORT { field_name | '@'position | alias | expression } [ASC | DESC]
          | NOSORT
        }

The GROUP BY clause allows you to group rows that have the same values for a given field or set of fields so that you can compute aggregations of related fields. In other words: GROUP BY defines the scope of aggregation. Basically you have three options for defining scope:

  1. group by all non-aggregated fields (in select clause) will yield computing aggregations for all unique values of combinations of grouped fields
SELECT a, b, count(c) FROM @record_stream GROUP by a, b
  1. group by some of the non-aggregated fields (in select clause) will yield computing aggregations for all unique values of combinations of grouped fields. The output values of non-aggregated fields in resultset are undetermined (i.e the ordering of these fields are not determined).
SELECT a, b, count(c) FROM @record_stream GROUP by a
  1. when GROUP BY is omitted the resultset will contain one computed aggregation for the whole selected dataset (which may be restricted by WHERE clause). The output values of non-aggregated fields in resultset are undetermined.
SELECT count(*) FROM @record_stream

Example 7. Getting top 5 of ip-addresses generating access denied requests:

1
2
3
4
5
6
7
8
SELECT
    clientIp, count(*) AS cnt
FROM
    @[/user/examples/views/my_webserver_access_logs.sx]
WHERE response == 404
GROUP BY clientIp
ORDER BY cnt DESC
LIMIT 5

Example 8. Computing top 5 of ip-address user-agent pairs:

1
2
3
4
5
6
7
8
SELECT
    clientIp, agent, count(*) AS cnt
FROM
    @[/user/examples/views/my_webserver_access_logs.sx]
WHERE response != 404
GROUP BY @1, @2             // grouping fields referred by positions
ORDER BY cnt DESC
LIMIT 5;

Example 9. Omitting GROUP BY will yield to count(*) being performed over all selected dataset:

1
2
3
4
5
SELECT
    clientIp, agent, count(*) AS cnt
FROM
    @[/user/examples/views/my_webserver_access_logs.sx]
WHERE response != 404;

GROUP BY has its own private sorting option, allowing to control the behaviour of sorting which forms the basis of grouping. By default the dataset is sorted by group key(s) in its natural or increasing order. Then the aggregation function is applied. When group key changes, then the aggregate value for that key is emitted and aggregate function is reset. The process continues until the end of sorted dataset.

With group sort the sort key is applied independently of group key. While group key remains the same the dataset gets sorted by sort key instead. The aggregates are still computed looking at group key in sorted dataset: when it changes then computed aggregate value is emitted and aggrate function is reset. This allows you to compute aggregates on completely different set of groups which sometimes becomes very useful. Take for instance computing aggregates on noncontiguous ranges.

Example 10. Computing ipv4 address ranges of US requires applying MIN(), MAX() aggregate functions over ip addresses groups of ip country code. With default sorting we would get just the beginning of lowest and the end of highest of ranges in US:

1
2
3
4
5
6
7
8
SELECT
	ipCC:cc(ip)             //compute ipv4 country code
	,min(ip)                //begin of range
	,max(ip)                //end of range
FROM 
	dual(16909050, 86231040)//evaluate over a space of first 10 US ip ranges
GROUP BY ipCC               //compute min max for each ip country
HAVING ipCC='US';
ipCC min max
US 1.2.3.0 220.232.59.132

However, when we change group sort to NOSORT (i.e leaving dataset unsorted) we will get all 20051 ipv4 ranges of US, since the addresses in dataset are in increasing order. Assuming that ranges are contiguous the group key changes appear at the borders of ranges and we’ll get min, max computed exactly there:

Example 11.

1
2
3
4
5
6
7
8
SELECT
	ipCC:cc(ip)             //compute ipv4 country code
	,min(ip)                //begin of range
	,max(ip)                //end of range
FROM 
	dual(16909050, 86231040)//evaluate over a space of first 10 US ip ranges
GROUP BY ipCC NOSORT        //compute min max on unsorted dataset
HAVING ipCC='US';
ipCC min max
US 1.2.3.0 1.2.3.255
US 3.0.0.0 4.16.47.255
US 4.16.56.0 4.16.145.255
   

HAVING

HAVING boolean_expr

The HAVING clause filters also the records just like WHERE does except that it gets executed after all the expressions have been computed. It is useful for filtering by computed aggregates which are not visible in WHERE clause.

Example 12:

1
2
3
4
5
6
7
8
SELECT
    clientIp, agent, count(*) AS cnt
FROM
    @[/user/examples/views/my_webserver_access_logs.sx]
WHERE response != 404
GROUP BY clientIp, agent
HAVING cnt > 2
ORDER BY cnt DESC;

ORDER BY

ORDER BY { field_name | '@'position | alias | expression } [ ASC | DESC ] [ ,... ]

The ORDER BY clause sorts resultset of a query in ascending or descending order by one or more key columns. Key columns may be referred either by name, alias or expression. To sort by multiple keys separate them by commas and the list is processed from left to right. Use ASC (ascending) or DESC (descending) to specify the sort direction. ASC is the default. Each sort key can have its own direction assigned.

The ORDER BY is executed after all expressions in SELECT clause have been evaluated therefore all aliases and expressions can be referenced from there.

See examples above.

LIMIT

LIMIT [ offset, ] { rowcount | numeric_expression }

The LIMIT clause limits the rows in the resultset by number of rowcount rows. You can optionally also specify offset which causes specified number of rows from beginning of result set being omitted.

The LIMIT clause is the last to be evaluated. This means that the rows remaining in the resultset after limiting depends on the ordering. I.e to avoid non-deterministic rows in resultset use ORDER BY operator.

Note that LIMIT clause does not always reduce processing time, for instance computing aggregated values require still computing over whole requested dataset.

See examples above.

Saving resultset

To save the resultset you’ll need to pipe the output of sql style statement to save command in piped command style.

Example 13.

1
2
3
SELECT * FROM DUAL(10)
 .save('/user/saved_example.sxt')   //output from dual gets piped to save command
;

Grammar

SELECT {  * | alias.* | '@'position | function | expression | literal [ AS alias ] } [, ... ]
[ FROM from_args
  [ ',' ... ]
  [ JOIN join_args ]
  [ WHERE boolean_expr ]
  [ GROUP BY group_args ]
  [ HAVING boolean_expr ]
  [ ORDER BY { field_name | '@'position | alias | expression } [ ASC | DESC ] [ ,... ] ]
  [ LIMIT [ offset, ] { rowcount | numeric_expression } ]
];

from_args:
    { record_stream_create_cmd  | '@'stream_ref |
    '@' '[' results_file ']' | '@' '[' script_file ']' | '(' subquery ')'
    } [ AS alias ]

join_args:
    {'@'stream_ref | '@' '[' results_file ']' | '@' '[' script_file ']'} [AS alias] ON join_condition

    join_condition:
        field_from_left_side_data = field_from_right_side_data [ {AND | OR} boolean_expr ,... ]

group_args:
    { field_name | '@'position | alias | expression [ gsort_opt ] } [, ... ]

    gsort_opt:
            {   SORT { field_name | '@'position | alias | expression } [ASC | DESC]
              | NOSORT
            }

boolean_expr:
  {  expression_returning_boolean_value
   | boolean_expr AND [NOT] boolean_expr
   | boolean_expr OR [NOT] boolean_expr
   | NOT boolean_expr
  }