Data Access URI Syntax

The location of source data is described using the URI syntax where:

Example 1:

sa://mylogstore/archive/apache/access.log

where:

  • sa:// is URI scheme for Source Agent protocol
  • mylogstore is the URI authority part referring to DataStore named mylogstore
  • /archive/apache/access.log is URI path

Example 2:

ssh://192.168.1.100/data/logs/apache/access.log

where:

  • ssh:// is uri scheme for SSHv2 protocol
  • 192.168.1.100 uri authority part specifies target ip-address.
  • /data/logs/apache/access.log is uri path

SpectX resolves URIs using a staged approach:

  1. Resolving the URI scheme. SpectX tries to find the protocol provider that implements the specified schema in URI. If no provider is found, an error is thrown.
  2. Resolving the authority part of the URI. First, SpectX tries to match the authority part to any of the defined DataStore names. If a match is found, it takes connection parameters from the DataStore and passes them to the protocol provider. If necessary parameters are not found, an error is thrown.

GLOB Patterns

SpectX supports the GLOB pattern syntax for matching URIs to individual files/blobs (with exception to http and `` exec`` schemes). Any character that appears in a pattern matches itself, except the special pattern characters described below. The NUL character must not occur in a pattern. To match special pattern characters literally, they must be escaped by the preceding backslash.

Special pattern characters have the following meaning:

1. * (single star) - matches any string, including the null string.

Example 3: The following file pattern lists all files in the directory logs/auth/2015/ of spectx-docs bucket in the Amazon S3 storage:

1
LIST('s3s://spectx-docs/logs/auth/2015/*.log');

2. ** ... (multiple stars) - when placed between ‘/’ characters, it matches directories and subdirectories up to the depth of the number of stars (i.e recurring match). When placed anywhere else (i.e in file expansion, preceded, followed or enclosed by any other characters), it is treated as a single star pattern.

Example 4: The following pattern searches in 3 levels deep under the /logs/ directory of spectx-docs bucket and retrieves all files ending with srv04.v1.log

1
LIST('s3s://spectx-docs/logs/***/*.srv04.v1.log');

3. ? - (question mark) matches any single character.

Example 5: The pattern: s3s://spectx-docs/logs/auth/2016/010?.srv01.v1.log matches logs from server01 from the first decade of January 2016

1
LIST('s3s://spectx-docs/logs/auth/2016/010?.srv01.v1.log');

4. [...] Matches any one of the enclosed characters. A pair of characters separated by a hyphen mark a range. Any character that falls between those two (inclusive) characters, is matched. UTF8 collating sequence and character set are used. If the first character following [ is ! or ^, then any character not enclosed in the range is matched. - may be matched by including it as the first or last character in the set. ] may be matched by including it as the first character in the set.

Example 6: The pattern s3s://spectx-docs/listing/glob/l2/[a-c]*.log will match and retrieve files whose name begins with characters ‘a’, ‘b’ or ‘c’ and ends with ‘.log’:

1
LIST('s3s://spectx-docs/listing/glob/l2/[a-c]*.log');

Within ‘[’ and ‘]’, character classes can be specified using the syntax [:class:]. Classes are defined in the POSIX standard:

POSIX Character Class Description
[:alnum:]
Alphanumeric characters a-z; A-Z; 0-9
[:alpha:]
Alphabetic characters a-z; A-Z
[:ascii:]
All ASCII characters in range of 0x0 - 0x7F
[:blank:]
Space (0x20) and tab (0x9) characters
[:cntrl:]
Control characters in ASCII range
0x1-0x1F; 0x7
[:digit:]
Digit in range of 0-9
[:graph:]
Visible characters in the ASCII code
range 0x21 - 0x7E
[:lower:]
Lowercase letters a-z
[:print:]
Printable characters in the ASCII
code range 0x20 - 0x7E
[:punct:]
Punctuation and symbols
!”#$%&’()*+,-./:;<=>?@[]^_`{|}~|
[:space:]
All whitespace characters. In ASCII codes:
0x20; 0x9; 0xA 0xB; 0xC ;0xD
[:upper:]
Uppercase letters A-Z
[:xdigit:]
Digit in hexadecimal notation 0x0 - 0xF
[:word:]
Word characters: letters a-z; A-Z;
numbers 0-9 and underscore _)

Example 7: The pattern s3s://spectx-docs/listing/glob/l2/[a-c][[:digit]].log will match and retrieve files whose name begins with characters ‘a’, ‘b’ or ‘c’, followed by single digit and ends with ‘.log’:

1
LIST('s3s://spectx-docs/listing/glob/l2/[a-c][[:digit:]].log');

Note

SpectX does not support matching equivalence classes or collating symbols.

5. In the following description, a pattern-list is a list of one or more patterns separated by a ‘|’. Composite patterns can be formed using one or more of the following sub-patterns:

  • ?(pattern-list) Matches zero or one occurrence of the given patterns.
  • *(pattern-list) Matches zero or more occurrences of the given patterns.
  • +(pattern-list) Matches one or more occurrences of the given patterns.
  • @(pattern-list) Matches one of the given patterns.
  • !(pattern-list) Matches anything except one of the given patterns.

Example 8:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// LIST('s3s://spectx-docs/listing/glob/l1/a?(b|x)c');
// matches:
// abc      ac      axc

// LIST('s3s://spectx-docs/listing/glob/l1/a*(b|x)c');
// matches:
// abbc     abc     abxc    ac      axc

// LIST('s3s://spectx-docs/listing/glob/l1/a+(b|x)c');
// matches:
// abbc     abc     abxc    axc

// LIST('s3s://spectx-docs/listing/glob/l1/a@(b|x)c');
// matches:
// abc      axc

// LIST('s3s://spectx-docs/listing/glob/l1/!(a@(b|x)c)');
// matches:
// 20150625 abbc    abxc    ac DCE

LIST('s3s://spectx-docs/listing/glob/l1/*([[:lower:]])');
// matches:
//abbc      abc     abxc    ac      axc

Time Patterns

SpectX supports URI expansion using the following time patterns:

Time pattern Description
$yy$ 2-digit year
$yyyy$ 4-digit year
$M$ variable length month in year (1 - 12)
$MM$ 2-digit month in year (01 - 12)
$d$ variable length day in month (1 - 31)
$dd$ 2-digit day in month (01 - 31)
$H$ variable length hour of zero based 24 hour clock (0 - 23)
$HH$ 2-digit hour of zero based 24 hour clock (00 - 23)
$m$ variable length minute in hour (0 - 59)
$mm$ 2-digit minute in hour (00 - 59)
$s$ variable length second in minute (0 - 59)
$ss$ 2-digit second in minute (00 - 59)
$S$ variable length millisecond in second (0 - 999)
$SSS$ 3-digit millisecond in second (000 - 999)

The abbreviated 2-digit year must be interpreted relative to a century. If the year value is less than 32, the date is adjusted to the 21st century, otherwise to the 20th century. I.e. the year 12 parses to 2012 and the year 72 parses to 1972.

2, 3 and 4-digit tokens are treated as fixed-length matchers, accepting only the respective amount of digits.

Example 9: The path_time field will be evaluated when using time patterns with the LIST command

1
LIST('s3s://spectx-docs/listing/glob/l1/$yyyy$$MM$$dd$') | select(path_time);

evaluates to 2015-06-25 00:00:00

Variable-length patterns must accept a variable amount of digits. This means the parser needs information of time-unit length. The only feasible way to do this is to separate time units with distinct markers (non-numeric characters). Variable-length time units placed consecutively, without non-numeric separators in-between, are impossible to parse correctly.

Example 10: The pattern s3s://spectx-docs/listing/glob/l3/$yyyy$/$M$/access_$d$-$H$.log will match variable length month, day and hour of the path elements. The LIST output field path_time contains evaluated time pattern.

1
LIST('s3s://spectx-docs/listing/glob/l3/$yyyy$/$M$/access_$d$-$H$.log') | select(uri, path_time);

Note

The timezone specified in user properties is used to convert the parsed timestamp. Use timezone parameter of LIST command to specify a different one.

Content Handling

Before parsing the content of a file may need to be decompressed, converted from a non-UTF-8 character set or sliced.

The instruction is specified in the form of URI:

scheme:/params

where:

  • scheme specifies instruction
  • params contains instruction specific parameters

The instruction URI is placed in the fragment of Data Access URI.

Note

Multiple instructions can be placed in Data Access URI fragments.

Decompression

SpectX supports decompression of files and archives. The instructions (algorithm, path within the archive) are specified in the URI fragment(s) by algorithm_name followed by :/ followed optionally by path:

algorithm_name:/path

The LIST command passes the instructions to the PARSE in the content_ref field which handles decompression.

The following table lists the supported algorithms:

algorithm name Utility Default file extension SpectX recognized file extension
gz gzip,pigz .gz .gz
zz pigz -z .zz .zz
lz4 lz4 .lz4 .lz4
xz xz .xz .xz
pigz pigz -i .gz .pi.gz
pizz pigz -i -z .zz .pi.zz
pbz2 bzip2 .bz2 .bz2
pbz2 lbzip2 .bz2 .bz2
concatbz2 pbzip2 .bz2 .pi.bz2
concatgz sxgzip .sx.gz .sx.gz
deflate      
deflate64      
zip zip .zip .zip
sevenz 7z .7z .7z
rar rar .rar .rar
tar tar .tar .tar
tgz tar .tar.gz; .tgz .tar.gz; .tgz
none:/ [1]      
[1]none:/ can be used for overriding decompression to void when you want to examine compressed content of a file.

Note

Specifying decompression info in the Data Access URI fragment is needed only when extracting files from an archive (such as .zip), overriding decompression algorithm identified by file extension or when decompressing a file which has been compressed multiple times.

Example 11: To extract a file from a tar.gz compressed archive we need to specify the decompression algorithm and the path within the archive:

1
2
LIST('s3s://spectx-docs/formats/compressed/auth.log.tar.gz#tgz:/auth/2015/0106.srv01.v1.log')
| parse(pattern:"LD:line (EOL|EOF)")

Note

You can specify multiple instructions for extraction/decompression by appending URI fragments one after another accordingly. SpectX applies specified instructions from left to right (i.e the leftmost is applied first).

Example 12: To extract a gz compressed file from a zip archive we need to specify first the instructions for extracting from archive and then decompression the target file:

1
2
3
LIST('s3s://spectx-docs/formats/compressed/investigation_case.zip#zip:/auth.log.gz#gz:/')
| parse(pattern:"LD:line (EOL|EOF)")
| limit(1000)

Hint

You can extract multiple files from an archive using GLOB patterns:

Example 13:

1
2
LIST('s3s://spectx-docs/formats/compressed/investigation_case.zip#zip:/*****/*.txt')
| parse(pattern:"LD:line (EOL|EOF)")

Sometimes the file extension does not correspond to actual compression used. This is not a problem. You can always override the compression type. You can assign compression type arbitrarily in the Data Access URI fragment (an argument to LIST command).

Example 14. Enforce using gz decompressor on a compressed file with no extension:

LIST('s3s://spectx-docs/formats/compressed/gzip_compressed_file#gz:/')
| parse(pattern:"LD:line EOL")

Charset Conversion

SpectX parsing and query engines interpret text in UTF-8 encoding. When the content of a file is encoded with another character set then it needs to be converted to UTF-8.

The source character set name is specified as:

charset:/name

where name specifies the name of the source character set. The supported names are defined in Java StandardCharsets class.

Example 15: Convert UTF-16 big endian encoded content to UTF-8:

1
2
LIST('s3s://spectx-docs/formats/text/UTF-16BE.txt#charset:/UTF-16BE')
| parse(pattern:$[/user/patterns/selection-test/fully-matching/line.sxp])

Slice

SpectX supports partial retrieving of file content by specifying slicing position and length in Data Access URI fragment:

slice:/length@offset

Note

slice:/ must be applied in the first (leftmost) Data Access URI fragment.

The LIST command passes the offset and length to the PARSE (in the offset, length fields) which retrieves only the specified portion of the file.

Example 15. Extract second line from an LDAP log:

1
2
LIST('s3s://spectx-docs/formats/log/ldap/2015-12-31_slapd_access-syslog.log#slice:/74@113')
| parse(pattern:"LD:line EOL")