SpectX URI Syntax

The location of source data is described using the URI syntax where:

  • the URI scheme contains a protocol specifier (see supported protocols here: Input Data)
  • the authority part can contain either DataStore configuration name or protocol-specific target info (host, container, etc).
  • path contains path and filename. Note that you may use GLOB Patterns and Time Patterns to specify multiple files in the path.

Example 1:

sa://mylogstore/archive/apache/access.log

where: * sa:// is URI scheme for Source Agent protocol * mylogstore is the URI authority part referring to DataStore named mylogstore * /archive/apache/access.log is URI path

Example 2:

ssh://192.168.1.100/data/logs/apache/access.log

where: * ssh:// is uri scheme for SSHv2 protocol * 192.168.1.100 uri authority part specifies target ip-address. * /data/logs/apache/access.log is uri path

SpectX resolves URIs using a staged approach:

  1. Resolving the URI scheme. SpectX tries to find the protocol provider that implements the specified schema in URI. If no provider is found, an error is thrown.
  2. Resolving the authority part of the URI. First, SpectX tries to match the authority part to any of the defined DataStore names. If a match is found, it takes connection parameters from the DataStore and passes them to the protocol provider. If necessary parameters are not found, an error is thrown.

GLOB Patterns

SpectX supports the GLOB pattern syntax for matching URIs to individual files/blobs. Any character that appears in a pattern matches itself, except the special pattern characters described below. The NUL character must not occur in a pattern. To match special pattern characters literally, they must be escaped by the preceding backslash.

Special pattern characters have the following meaning:

1. * (single star) - matches any string, including the null string.

Example 3: The following file pattern lists all files in the directory logs/auth/2015/ of spectx-docs bucket in the Amazon S3 storage:

1
LIST('s3s://spectx-docs/logs/auth/2015/*.log');

2. ** ... (multiple stars) - when placed between ‘/’ characters, it matches directories and subdirectories up to the depth of the number of stars (i.e recurring match). When placed anywhere else (i.e in file expansion, preceded, followed or enclosed by any other characters), it is treated as a single star pattern.

Example 4: The following pattern searches in 3 levels deep under the /logs/ directory of spectx-docs bucket and retrieves all files ending with srv04.v1.log

1
LIST('s3s://spectx-docs/logs/***/*.srv04.v1.log');

3. ? - (question mark) matches any single character.

Example 5: The pattern: s3s://spectx-docs/logs/auth/2016/010?.srv01.v1.log matches logs from server01 from the first decade of January 2016

1
LIST('s3s://spectx-docs/logs/auth/2016/010?.srv01.v1.log');

4. [...] Matches any one of the enclosed characters. A pair of characters separated by a hyphen mark a range. Any character that falls between those two (inclusive) characters, is matched. UTF8 collating sequence and character set are used. If the first character following [ is ! or ^, then any character not enclosed in the range is matched. - may be matched by including it as the first or last character in the set. ] may be matched by including it as the first character in the set.

Example 6: The pattern s3s://spectx-docs/listing/glob/l2/[a-c]*.log will match and retrieve files whose name begins with characters ‘a’, ‘b’ or ‘c’ and ends with ‘.log’:

1
LIST('s3s://spectx-docs/listing/glob/l2/[a-c]*.log');

Within ‘[’ and ‘]’, character classes can be specified using the syntax [:class:]. Classes are defined in the POSIX standard:

POSIX Character Class Description
[:alnum:]
Alphanumeric characters a-z; A-Z; 0-9
[:alpha:]
Alphabetic characters a-z; A-Z
[:ascii:]
All ASCII characters in range of 0x0 - 0x7F
[:blank:]
Space (0x20) and tab (0x9) characters
[:cntrl:]
Control characters in ASCII range
0x1-0x1F; 0x7
[:digit:]
Digit in range of 0-9
[:graph:]
Visible characters in the ASCII code
range 0x21 - 0x7E
[:lower:]
Lowercase letters a-z
[:print:]
Printable characters in the ASCII
code range 0x20 - 0x7E
[:punct:]
Punctuation and symbols
!”#$%&’()*+,-./:;<=>?@[]^_`{|}~|
[:space:]
All whitespace characters. In ASCII codes:
0x20; 0x9; 0xA 0xB; 0xC ;0xD
[:upper:]
Uppercase letters A-Z
[:xdigit:]
Digit in hexadecimal notation 0x0 - 0xF
[:word:]
Word characters: letters a-z; A-Z;
numbers 0-9 and underscore _)

Example 7: The pattern s3s://spectx-docs/listing/glob/l2/[a-c][[:digit]].log will match and retrieve files whose name begins with characters ‘a’, ‘b’ or ‘c’, followed by single digit and ends with ‘.log’:

1
LIST('s3s://spectx-docs/listing/glob/l2/[a-c][[:digit:]].log');

Note

SpectX does not support matching equivalence classes or collating symbols.

5. In the following description, a pattern-list is a list of one or more patterns separated by a ‘|’. Composite patterns can be formed using one or more of the following sub-patterns:

  • ?(pattern-list) Matches zero or one occurrence of the given patterns.
  • *(pattern-list) Matches zero or more occurrences of the given patterns.
  • +(pattern-list) Matches one or more occurrences of the given patterns.
  • @(pattern-list) Matches one of the given patterns.
  • !(pattern-list) Matches anything except one of the given patterns.

Example 8:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// LIST('s3s://spectx-docs/listing/glob/l1/a?(b|x)c');
// matches:
// abc      ac      axc

// LIST('s3s://spectx-docs/listing/glob/l1/a*(b|x)c');
// matches:
// abbc     abc     abxc    ac      axc

// LIST('s3s://spectx-docs/listing/glob/l1/a+(b|x)c');
// matches:
// abbc     abc     abxc    axc

// LIST('s3s://spectx-docs/listing/glob/l1/a@(b|x)c');
// matches:
// abc      axc

// LIST('s3s://spectx-docs/listing/glob/l1/!(a@(b|x)c)');
// matches:
// 20150625 abbc    abxc    ac DCE

LIST('s3s://spectx-docs/listing/glob/l1/*([[:lower:]])');
// matches:
//abbc      abc     abxc    ac      axc

Time Patterns

SpectX supports URI expansion using the following time patterns:

Time pattern Description
$yy$ 2-digit year
$yyyy$ 4-digit year
$M$ variable length month in year (1 - 12)
$MM$ 2-digit month in year (01 - 12)
$d$ variable length day in month (1 - 31)
$dd$ 2-digit day in month (01 - 31)
$H$ variable length hour of zero based 24 hour clock (0 - 23)
$HH$ 2-digit hour of zero based 24 hour clock (00 - 23)
$m$ variable length minute in hour (0 - 59)
$mm$ 2-digit minute in hour (00 - 59)
$s$ variable length second in minute (0 - 59)
$ss$ 2-digit second in minute (00 - 59)
$S$ variable length millisecond in second (0 - 999)
$SSS$ 3-digit millisecond in second (000 - 999)

The abbreviated 2-digit year must be interpreted relative to a century. If the year value is less than 32, the date is adjusted to the 21st century, otherwise to the 20th century. I.e. the year 12 parses to 2012 and the year 72 parses to 1972.

2, 3 and 4-digit tokens are treated as fixed-length matchers, accepting only the respective amount of digits.

Example 9: The path_time field will be evaluated when using time patterns with the LIST command

1
LIST('s3s://spectx-docs/listing/glob/l1/$yyyy$$MM$$dd$') | select(path_time);

evaluates to 2015-06-25 00:00:00

Variable-length patterns must accept a variable amount of digits. This means the parser needs information of time-unit length. The only feasible way to do this is to separate time units with distinct markers (non-numeric characters). Variable-length time units placed consecutively, without non-numeric separators in-between, are impossible to parse correctly.

Example 10: The pattern s3s://spectx-docs/listing/glob/l3/$yyyy$/$M$/access_$d$-$H$.log will match variable length month, day and hour of the path elements. The LIST output field path_time contains evaluated time pattern.

1
LIST('s3s://spectx-docs/listing/glob/l3/$yyyy$/$M$/access_$d$-$H$.log') | select(uri, path_time);

Note

The timezone specified in user properties is used to convert the parsed timestamp. Use timezone parameter of LIST command to specify a different one.