SpectX URI Syntax¶
The location of source data is described using the URI syntax where:
- the URI scheme contains a protocol specifier (see supported protocols here: Input Data)
- the authority part can contain either DataStore configuration name or protocol-specific target info (host, container, etc).
- path contains path and filename. Note that you may use GLOB Patterns and Time Patterns to specify multiple files in the path.
sa:// is URI scheme for Source Agent protocol
mylogstore is the URI authority part referring to DataStore named mylogstore
/archive/apache/access.log is URI path
ssh:// is uri scheme for SSHv2 protocol
192.168.1.100 uri authority part specifies target ip-address.
/data/logs/apache/access.log is uri path
SpectX resolves URIs using a staged approach:
- Resolving the URI scheme. SpectX tries to find the protocol provider that implements the specified schema in URI. If no provider is found, an error is thrown.
- Resolving the authority part of the URI. First, SpectX tries to match the authority part to any of the defined DataStore names. If a match is found, it takes connection parameters from the DataStore and passes them to the protocol provider. If necessary parameters are not found, an error is thrown.
SpectX supports the GLOB pattern syntax for matching URIs to individual files/blobs. Any character that appears in a pattern matches itself, except the special pattern characters described below. The NUL character must not occur in a pattern. To match special pattern characters literally, they must be escaped by the preceding backslash.
Special pattern characters have the following meaning:
* (single star) - matches any string, including the null string.
Example 3: The following file pattern lists all files in the directory
in the Amazon S3 storage:
** ... (multiple stars) - when placed between ‘/’ characters, it matches directories and subdirectories up to the depth of
the number of stars (i.e recurring match).
When placed anywhere else (i.e in file expansion, preceded, followed or enclosed by any other characters), it is
treated as a single star pattern.
Example 4: The following pattern searches in 3 levels deep under the
/logs/ directory of
spectx-docs bucket and
retrieves all files ending with
? - (question mark) matches any single character.
Example 5: The pattern:
s3s://spectx-docs/logs/auth/2016/010?.srv01.v1.log matches logs from server01 from the first
decade of January 2016
[...] Matches any one of the enclosed characters. A pair of characters separated by a hyphen mark a range.
Any character that falls between those two (inclusive) characters, is matched. UTF8 collating sequence and character
set are used. If the first character following
^, then any character not enclosed in the range is
- may be matched by including it as the first or last character in the set.
] may be matched by
including it as the first character in the set.
Example 6: The pattern s3s://spectx-docs/listing/glob/l2/[a-c]*.log will match and retrieve files whose name begins with characters ‘a’, ‘b’ or ‘c’ and ends with ‘.log’:
Within ‘[’ and ‘]’, character classes can be specified using the syntax [:class:]. Classes are defined in the POSIX standard:
|POSIX Character Class||Description|
Alphanumeric characters a-z; A-Z; 0-9
Alphabetic characters a-z; A-Z
All ASCII characters in range of 0x0 - 0x7F
Space (0x20) and tab (0x9) characters
Control characters in ASCII range
Digit in range of 0-9
Visible characters in the ASCII code
range 0x21 - 0x7E
Lowercase letters a-z
Printable characters in the ASCII
code range 0x20 - 0x7E
Punctuation and symbols
All whitespace characters. In ASCII codes:
0x20; 0x9; 0xA 0xB; 0xC ;0xD
Uppercase letters A-Z
Digit in hexadecimal notation 0x0 - 0xF
Word characters: letters a-z; A-Z;
numbers 0-9 and underscore _)
Example 7: The pattern s3s://spectx-docs/listing/glob/l2/[a-c][[:digit]].log will match and retrieve files whose name begins with characters ‘a’, ‘b’ or ‘c’, followed by single digit and ends with ‘.log’:
SpectX does not support matching equivalence classes or collating symbols.
5. In the following description, a pattern-list is a list of one or more patterns separated by a ‘|’. Composite patterns can be formed using one or more of the following sub-patterns:
?(pattern-list)Matches zero or one occurrence of the given patterns.
*(pattern-list)Matches zero or more occurrences of the given patterns.
+(pattern-list)Matches one or more occurrences of the given patterns.
@(pattern-list)Matches one of the given patterns.
!(pattern-list)Matches anything except one of the given patterns.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
// LIST('s3s://spectx-docs/listing/glob/l1/a?(b|x)c'); // matches: // abc ac axc // LIST('s3s://spectx-docs/listing/glob/l1/a*(b|x)c'); // matches: // abbc abc abxc ac axc // LIST('s3s://spectx-docs/listing/glob/l1/a+(b|x)c'); // matches: // abbc abc abxc axc // LIST('s3s://spectx-docs/listing/glob/l1/a@(b|x)c'); // matches: // abc axc // LIST('s3s://spectx-docs/listing/glob/l1/!(a@(b|x)c)'); // matches: // 20150625 abbc abxc ac DCE LIST('s3s://spectx-docs/listing/glob/l1/*([[:lower:]])'); // matches: //abbc abc abxc ac axc
SpectX supports URI expansion using the following time patterns:
|$M$||variable length month in year (1 - 12)|
|$MM$||2-digit month in year (01 - 12)|
|$d$||variable length day in month (1 - 31)|
|$dd$||2-digit day in month (01 - 31)|
|$H$||variable length hour of zero based 24 hour clock (0 - 23)|
|$HH$||2-digit hour of zero based 24 hour clock (00 - 23)|
|$m$||variable length minute in hour (0 - 59)|
|$mm$||2-digit minute in hour (00 - 59)|
|$s$||variable length second in minute (0 - 59)|
|$ss$||2-digit second in minute (00 - 59)|
|$S$||variable length millisecond in second (0 - 999)|
|$SSS$||3-digit millisecond in second (000 - 999)|
The abbreviated 2-digit year must be interpreted relative to a century. If the year value is less than 32, the date is adjusted to the 21st century, otherwise to the 20th century. I.e. the year 12 parses to 2012 and the year 72 parses to 1972.
2, 3 and 4-digit tokens are treated as fixed-length matchers, accepting only the respective amount of digits.
Example 9: The
path_time field will be evaluated when using time patterns with the LIST command
LIST('s3s://spectx-docs/listing/glob/l1/$yyyy$$MM$$dd$') | select(path_time);
Variable-length patterns must accept a variable amount of digits. This means the parser needs information of time-unit length. The only feasible way to do this is to separate time units with distinct markers (non-numeric characters). Variable-length time units placed consecutively, without non-numeric separators in-between, are impossible to parse correctly.
Example 10: The pattern
s3s://spectx-docs/listing/glob/l3/$yyyy$/$M$/access_$d$-$H$.log will match variable length
month, day and hour of the path elements. The LIST output field
path_time contains evaluated time pattern.
LIST('s3s://spectx-docs/listing/glob/l3/$yyyy$/$M$/access_$d$-$H$.log') | select(uri, path_time);