LIST

Listing input resources (files or blobs) is an essential step in almost every SpectX query. In addition to the name and exact path, it provides us with metadata such as the last modified time, file size, etc. All of them may be used directly in queries or query optimization (for instance partitioned access).

The LIST operation results in metadata of each evaluated URI:

  • uri - full URI of the evaluated file
  • file_name - part of the path after the last slash (‘/’) character
  • offset - file offset in bytes, used to slide start position of file content when parsing
  • length - file length in bytes
  • last_modified - time the file was last modified
  • path_time - timestamp evaluated from the time string in the URI using time patterns.
  • etag - do not modify, used internally for optimizations
  • ref - do not modify, used internally for referring to content of compressed archives
  • meta_props - do not modify, used internally for optimizations
  • is_blob - true if the listed resource was a file (blob), otherwise false

Note that listing properties of a directory does not return the list of files residing in that directory. Use GLOB patterns to list the content of a directory.

NB! Listing can become a very expensive and time-consuming operation when used carelessly! Typically, this situation occurs when listing files with a pattern. This causes the remote system to scan a huge number of files. See more details in the Listing Best Practices section.

Syntax

The listing operation is performed using the LIST expression:

LIST(uri_expr)
LIST(src:src_expr [, config_key:value ,... ])

where:

  • uri_expr is either:

  • src_expr is either:

    • single data access URI:

      LIST(src:'file:/path/to/file')
      
    • array of data access URI-s:

      LIST(src:['file:/path/to/file', 'file:/path/to/another/file'])
      
    • a tuple containing one mandatory uri field of type data access URI and possibly parameters with values:

      LIST(src:{ uri:'file:/path/to/dir', includeContainers:TRUE, includeBlobs:FALSE });
      
    • array of tuples, each containing one mandatory uri field of type data access URI and possibly parameters with values:

      LIST(src:[
        { uri:'file:/path/to/file' },
        { uri:'file:/path/to/$yyyy$/$MM$$dd$.file', tz:'GMT' }
      ]);
      

    The parameters are as follows:

    • includeContainers is optional BOOLEAN type named argument with FALSE as default value. Value of TRUE allows to list directory properties. The value overrides query.list.includeContainers
    • includeBlobs is an optional BOOLEAN type named argument with default value TRUE allowing to list file properties. The value overrides query.list.includeBlobs.
    • includeBlobsDisallowedInACL is an optional BOOLEAN type named argument with FALSE as default value. Value of TRUE allows listing file properties of blobs disallowed from reading by Read ACL defined for a datastore. The value overrides query.list.includeBlobsDisallowedInACL.
    • timezone (or tz) is an optional STRING type named argument, specifying timezone for evaluating time patterns in listed uri. The value overrides query.list.timezone.
    • locale is an optional STRING type named argument, specifying locale for evaluating time patterns in listed uri. The value overrides query.list.locale.
    • ignoreErrors is an optional BOOLEAN type named argument, when set to TRUE then the listing operation is not interrupted at IO-related errors such as ‘file not found’ or ‘access denied’. Defaults to FALSE. The value overrides query.list.ignoreErrors.
    • ignoreFullyDefinedUriNotFoundErrors is an optional BOOLEAN type named argument, when set to TRUE then the listing operation is not interrupted when listing a non-existent file with its exact URI. Defaults to FALSE. The value overrides query.list.ignoreFullyDefinedUriNotFoundErrors.
  • config_key:value - optional configuration parameter with its value.

Syntax examples

Example 1. Listing a single uri:

1
LIST('s3s://spectx-docs/logs/auth/2016/0101.srv00.v1.log');

Example 2. Listing multiple URIs (expressed as an array of named strings):

1
2
3
4
LIST(src:[
    's3s://spectx-docs/logs/auth/2015/*.log',
    's3s://spectx-docs/logs/auth/2016/*.log'
    ]);

Example 3. Listing single uri with parameters (uri and parameters expressed as tuple):

1
LIST( src:{uri:'sx:/user/examples/data/logs/production', includeContainers:TRUE, includeBlobs:FALSE});

Example 4. Listing multiple uris with parameters (expressed as an array of tuples):

1
2
3
4
LIST(src:[
    {uri:'s3s://spectx-docs/formats/log/apache/apache_access.log.sx.gz'},
    {uri:'s3s://spectx-docs/logs/auth/$yyyy$/$MM$$dd$.srv*.v1.log', tz:'GMT'}
    ]);

Example 5. Listing fully defined URI (a data access URI with no patterns used):

1
LIST('s3s://spectx-docs/logs/auth/2016/0101.srv00.v1.log');

returns:

uri file_name offset length last_modified path_time etag ref meta_props is_blob
s3s://spectx-docs/logs/auth/2016/0101.srv00.v1.log 0101.srv00.v1.log 0 56 2019-08-30 12:40:01.000 +0000 NULL b5da70b6506631b134801dffa9dd6996 NULL NULL true

Example 6. Listing with Glob Patterns. You can use SpectX GLOB patterns for listing multiple files:

1
LIST('s3s://spectx-docs/logs/auth/2015/*.srv03.v1.log');

returns:

uri file_name offset length last_modified path_time etag ref meta_props is_blob
s3s://spectx-docs/logs/auth/2015/1220.srv03.v1.log 1220.srv03.v1.log 0 56 2019-08-30 12:40:01.000 +0000 NULL 7689cf3b7a7935f4984f2b554214f7b6 NULL NULL true
s3s://spectx-docs/logs/auth/2015/1223.srv03.v1.log 1223.srv03.v1.log 0 54 2019-08-30 12:40:01.000 +0000 NULL 08ad02410a7e8a2eaeed9c81103dc626 NULL NULL true

Listing with Time Patterns

Let’s suppose you have stored your logs in a directory structure. The logs are in separate directories based on the the year they were created, with month and day incorporated in the file name (see this example below).

A query involving files only from a certain period would be much quicker to perform if we were able to use only files relevant to that period (as opposed to doing a full scan through all the files). In other words, the data is partitioned according to time and we can take advantage of this by using uri time patterns.

When you specify a time pattern in the URI, the listing process will parse the time value from the listed URI. The time value is then returned in the path_time field of the listing result set. The query layer can then use the result further to shorten the list of relevant files to retrieve for performing a query.

Example 7: listing authentication logs from Christmas 2015 holidays:

1
2
LIST(src:'s3s://spectx-docs/logs/auth/$yyyy$/$MM$$dd$.srv*.v1.log')
| filter(path_time > T('2015-12-24 00:00:00 GMT') and path_time < T('2015-12-26 23:59:59 GMT'));

results in:

uri file_name length last_modified path_time etag meta_props is_blob    
s3s://spectx-docs/logs/auth/2015/1225.srv01.v1.log 1225.srv01.v1.log 0 57 2019-08-30 12:40:01 +0000 2015-12-25 00:00:00 +0000 84c45c3bec6f014bb6d46a0946b47434 NULL NULL true
s3s://spectx-docs/logs/auth/2015/1226.srv01.v1.log 1226.srv01.v1.log 0 55 2019-08-30 12:40:01 +0000 2015-12-26 00:00:00 +0000 e2f7b738ce338c474376e990ed8b377b NULL NULL true

Error Handling

During listing operation error situations, such as ‘file not found’ or ‘access denied’, may occur. You can control the behavior of LIST in these situations with query.list.ignoreErrors configuration parameter. If it is set to FALSE then query processing will be stopped at errors. When set to TRUE then processing continues and the error message is written to Query log tab.

NB! Listing with glob or time patterns yielding no results is not an error since this is essentially a search operation that resulted in nothing found. Listing of non-existent file with its exact URI results in error, unless query.list.ignoreFullyDefinedUriNotFoundErrors is set to TRUE. Then such errors get ignored (no message gets written to Query log tab).

Ignoring errors may be useful at processing a large number of source files when individual failures do not have major effect on the result.

Values for these both query.list.ignoreErrors and query.list.ignoreFullyDefinedUriNotFoundErrors parameters set either in script’s INIT block or in system configuration affect all LIST commands in the script.

However, if you want these values to be applied for execution of a particular LIST command then you do it by supplying these as config_key arguments. Note that in this case, just like with INIT block, parameter names must be prefixed by an underscore and enclosed in single or double quotes, just as in the example below.

Example 8: Ignore potential errors during execution of a LIST command using configuration property:

1
LIST(src:'s3s://spectx-docs/logs/auth/***/*.log', '_query.parse.ignoreErrors':true);

Alternatively, such source parameters like ignoreErrors and ignoreignoreFullyDefinedUriNotFoundErrors could be used instead. But this requires to use a tuple with parameters in src_expr:

LIST(src:{uri:'sx:/user/examples/data/logs/***/*.log', ignoreErrors:true});

Listing Best Practices

Listing can become a very time consuming and expensive operation when it involves a very large number of files. To understand how this can happen, a quick overview of how listing works is in order.

SpectX supports several storage systems that behave differently when listing. This is especially true when we are talking about listing with a pattern. In a nutshell, the listing does two operations: i) composing a list of files of the specified path where the pattern is omitted and ii) filtering the list according to the pattern. Those of you familiar to the command line find it useful to use GLOB patterns to list files in file systems, since the ls and dir commands of most operating systems perform both operations together seamlessly. The SourceAgent (serving files from an on-premise server) also performs listing this way.

However, listing in cloud storage only does the first operation. This means listing becomes a 2-phase operation where an exhaustive list of files in a specified path is composed on the storage side and filtering is done on the SpectX side. Note that the list is transmitted to SpectX over the network, which may become expensive when the list is large.

Now, when the specified path where a pattern is omitted (or a path prefix, to use the cloud search API terminology) includes a large number of files, transmitting the list takes a significant amount of time. Sometimes, when millions of files are involved, it can take several minutes.

The best way to overcome these issues is to organize your data the way that matches the scope of your queries.

Organize Your Data

The main idea is to try to have the path in the listing operation as precise as possible - i.e the closer the path prefix is to the file part, the better (like the list of wildcarded files will be shorter). How to achieve that for most of the queries?

Here are some tips:

1. Identify the boundaries in your data that you seldom cross in your queries (such as the distinction between test and production data, different applications, etc). Store the data of different boundaries in separate directories. Usually, there are more than one of such boundaries present. Therefore, place the directories with more frequent crossing likelihood towards the end of the path.

The same logic applies to compose the name of the file: place the stable boundary identifiers first and the less stable ones after them.

Do not choose frequently changing dimensions as boundaries. E.g. it’s not a good idea to use the session-id as a boundary for organizing logs.

2. It is always a good idea to incorporate date (and time) information in the URI. But again, time is a dimension that tends to be crossed in queries. Place only the largest necessary time unit (year) into the path and have it preceded by more significant directories. As a rule of thumb, keep smaller time units as part of the file name.

3. Try to keep the number of files in a directory smaller than 1000. The cloud storage listing API’s tend to return the listing in 1000-line batches. Each file over that limit means an additional HTTP request-response roundtrip per directory. It may seem insignificant at first but when you think that listing operation is performed thousands of times, the cost finally adds up.

4. Keep rare files separate from regular files. When your query involves rare files, listing in cloud storage will likely produce the list, where irrelevant files (i.e regularly appearing) will vastly dominate the relevant ones. To avoid this, keep the rarities in a separate directory.

Example: let’s say we have several server applications (authentication, ordering, and balance info), each producing its own logs kept in different directories. Also, keeping production and test data separately is always a good idea. Data is further compartmentalized into separate directories by year. Month and day stay in the file name. Assuming that we have two servers running each application and the logs are rotated daily, we will have slightly over 700 files in each year directory. This also leaves us plenty of room for version changes.

If more servers are added, we might consider splitting the data into monthly directories.

└── logs/
    └── production/
        ├── auth/
        │   ├── 2015/
        │   │   ├── 1220.srv03.v1.log
        │   │   ├── 1221.srv04.v1.log
        │   │   ├── 1222.srv02.v1.log
        │   │   ├── 1223.srv03.v1.log
        │   │   └── ...
        │   └── 2016/
        ├── orders/
        │   ├── 2015/
        │   │   └── ...
        │   └── 2016/
        │       └── ...
        └── balance/
    └── test/
        ├── auth/
        ├── orders/
        └── balance/

Listing Snapshots

In some cases, changing the storage structure is expensive and infeasible. If you are stuck with a structure with millions of files in a single directory or have irregular event files together with regular event files, then your only option is to develop a custom solution that pushes storage changes into a single file. SpectX can use this file to create a listing for queries instead of performing the list operation using cloud storage listing API.

How does this accelerate the listing process? By taking advantage of SpectX’s resource caching and optimized reading. The listing file is fully transferred from storage to SpectX only with the first reading. From then on, only file changes are transferred. Selecting from the list and filtering is performed locally and very fast, meaning the efficiency improves vastly.

Example 9: suppose we have a script updating s3s://spectx-docs/listing/logs_dir.txt regularly of all the changes to our log storage at s3s://spectx-docs/logs/auth/.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* pattern for parsing aws s3 ls output :
2019-08-30 15:40:01         56 logs/auth/2015/1220.srv03.v1.log
*/
$listPattern = <<<END
    TIMESTAMP(tz='UTC'):last_modified BLANK
    LD:length BLANK
    LD:path
    EOL
END;

//parse our listing file stored at s3s://spectx-docs/listing/logs_dir.txt
@listing = PARSE(pattern:$listPattern, src:'s3s://spectx-docs/listing/logs_dir.txt')
| select("s3s://spectx-docs/" || path as uri)      // we only need uri, and we also need to add protocol and bucket names
;

//define pattern for parsing example auth log records
$authLogPattern = <<<END
(
  TIMESTAMP('yyyy-MM-dd HH:mm:ss Z'):dateTime
  IPADDR:ipAddr
  LD:userName
  INT:response
)(fs='\t')
EOL
END;

//parse auth logs using list of uri's we got from @listing
PARSE(pattern:$authLogPattern, src:@listing);

Configuration reference

You can assign specific values for configuration parameters in the script INIT block or in system configuration file - in which case all LIST commands in the script will be affected. However, if you want these values to be applied for execution of a particular LIST command then you do it by supplying these as config_key arguments. Note that in this case, just like with INIT block, parameter names must be prefixed by an underscore and enclosed in single or double quotes (refer to Example 8 in Error Handling section for an example).

Below follows a description of all possible query configuration parameters used by LIST.

query.list.includeContainers: An optional BOOLEAN parameter with FALSE as default value. The value of TRUE allows listing directory properties.

query.list.includeBlobs: An optional BOOLEAN parameter with default value TRUE allowing to list file properties.

query.list.includeBlobsDisallowedInACL: An optional BOOLEAN parameter with FALSE as default value. Value of TRUE allows to list file properties of blobs disallowed from reading by Read ACL defined for a datastore.

query.list.ignoreErrors: An optional BOOLEAN parameter, when set to TRUE then the listing operation is not interrupted at IO-related errors such as ‘file not found’ or ‘access denied’. Defaults to FALSE.

query.list.ignoreFullyDefinedUriNotFoundErrors: An optional BOOLEAN parameter, when set to TRUE then the listing operation is not interrupted when listing a non-existent file with its exact URI. Defaults to FALSE.

query.list.timezone (or query.list.tz): An optional STRING type named argument, specifying timezone for evaluating time patterns in listed uri. If not specified, then value of configuration parameter query.timezone is evaluated. If this is not specified, then system default value is used.

query.list.locale: An optional STRING type named argument, specifying locale for evaluating time patterns in listed uri. If not specified, then value of configuration parameter query.locale is evaluated. If this is not specified, then value of engine.locale is evaluated. If this is not specified, then the value “en_US” is used.

query.list.tp.multi.eval: An optional BOOLEAN parameter, when set to TRUE then the listing operation tries to evaluate URIs with time patterns into multiple independent URIs for listing optimization. The default value is TRUE.