PARSE

During the parsing SpectX fetches the data from the specified source files and converts it into a stream of structured records.

Syntax:

PARSE(uri_expr)
PARSE(pattern_expr, uri_expr)
PARSE(pattern:pattern_expr, src:src_expr     [, rc:rc_expr] [, cache:onoff_expr] [, config_key:value ,... ])
PARSE(pattern:pattern_expr, src:@list_stream [, rc:rc_expr] [, cache:onoff_expr] [, config_key:value ,... ])

LIST(...) | PARSE()
LIST(...) | PARSE(pattern_expr)
LIST(...) | PARSE(pattern:pattern_expr       [, rc:rc_expr] [, cache:onoff_expr] [, config_key:value ,... ])

where:

  • uri_expr specifies URI(s) of input files. It is of the same format as uri_expr argument of LIST command

  • src_expr specifies location of input files. It can be either in the same format as src_expr argument of LIST command or contain a reference to output of the LIST command.

  • pattern_expr is a STRING, filename or reference to a variable specifying the pattern for extract and transform. A filename must be enclosed in square brackets and prepended with the type identifier $. Optional. The default is LD{0,65535}:line (EOL|EOF).

  • config_key:value is optional configuration parameter with its value.

  • rc_expr is a tuple containing the following parameters of a raw cursor:

    • name - name of the raw cursor (mandatory parameter). Cannot contain ‘/’ and start with ‘.’ symbols.
    • maxEntries - integer value specifying maximum number of files allowed.
    • entryExpirePeriod - string value specifying time period for a record to be kept in the cursor after it has disappeared from listing.
    • startFromBeginningIfNew - boolean value specifying how records are read from a new files in the cursor.
    • dryRun - boolean value telling if cursor is in read-only mode.
  • cache_expr is a boolean value (true or false) for optional cache argument, which, if specified, overrides cache usage behaviour defined for datastores of the source URI(s). This has effect only if cache is enabled in system configuration. Then, true enables cache usage even if datastore is not defined or its configuration disables caching; false disables cache usage.

The PARSE command fetches the content of all the files specified by single uri_expr, multiple uri src_expr or from the output of LIST command, decompresses the data when needed, parses it using the pattern_expr and sends resulting structured record stream to its output.

Example 1. Parsing data from single URI.

1
PARSE(src:'s3s://spectx-docs/logs/auth/2015/12*.srv*.v1.log', pattern:'LD:line EOL');

Example 2. Parsing data from multiple URI’s, expressed as an array of strings:

1
2
3
4
5
6
PARSE(src:[
        's3s://spectx-docs/logs/auth/2015/1231.srv01.v1.log',
        's3s://spectx-docs/logs/auth/2016/0101.srv00.v1.log'
      ],
      pattern:"LD:line EOL"
);

Example 3. Parsing data from a single URI with additional parameters (URI and parameters are expressed as a tuple):

1
2
3
4
PARSE(
    src:{uri:'s3s://spectx-docs/logs/auth/2015/*.log', tz:'GMT'},
    pattern:"LD:line EOL"
);

Example 4. Parsing data from multiple URIs with additional parameters (expressed as an array of tuples):

1
2
3
4
5
6
7
8
9
$line = "LD:line EOL";

PARSE(
    src:[
        {uri:'s3s://spectx-docs/formats/log/apache/apache_access.log.sx.gz'},
        {uri:'s3s://spectx-docs/logs/auth/$yyyy$/$MM$$dd$.*.log', tz:'EET'}
        ],
    pattern:$line       //pattern can also be referenced
);

The parameter rc can be used to specify raw cursor mode of the command execution.

Optional configuration parameters can be used to control various aspects of parsing.

The parameters specifiying source files (uri_expr and src_expr) may be absent if the command reads listing stream) from its input pipe.

Note

the listing stream may be produced either by LIST or any other command producing the stream with the same structure. See the Listing Snapshots example.

The default pattern (when PARSE is invoker without pattern_expr specification) parses source data as lines, terminated by LF (line-feed) character.

Of course, there many more aspects to it. Processing data in parallel without exhausting computer memory, accessing meta-info and error handling up to optimization of different layers to get the best performance. Read the following sections for details.

Processing explained

The processing (i.e retrieval of raw data, decompression, and parsing) by several independent tasks executed in parallel. The number of simultaneously running tasks is restricted by a value of query.max_tasks query configuration parameter, that defaults to the number of available processing units multiplied by two.

The number of tasks is determined mainly by the size of the chunks the source data files are split into. Each task fetches its chunk from the specified source, performs the decompression if needed, and extracts and transforms data elements defined by the given pattern. All the parameters of tasks ( like the size of chunks, the amount of the memory allowed, etc) are configurable.

Normally, the errors occurring during the retrieval of the data or decompression cause a task to stop its execution, resulting in termination of processing the whole query. It is possible, however, to make the parsing stage to ignore errors and continue its execution by setting configuration parameter query.parse.ignoreErrors to TRUE.

Tasks creation process

The tasks are created and started by SpectX engine based on the length and type properties of the file/blob.

The length is the size of a file/blob in the listing. The length is considered undefined if the field is missing from the listing or has a value equal or less than 0.

The type refers to whether the file is compressed or not, and is determined by:

  1. content_ref field coming from a LIST stream, or
  2. the extension of the blob/file

If the compression type is not eventually defined, the blob is considered to be plaintext.

Depending on these two properties:

  • Multiple sequential small blobs of the same compression type (plaintext/bzip2/deflate/deflate64/gzip/lz4/xz/zz) are processed as a batch.
  • In the case processing files from compressed archives (7z/rar/zip), their content must first be listed or enumerated. After that separate tasks for each target/selected blob in the archive are executed.
  • Big plaintext blobs and compressed blobs supporting parallel decompression are split into virtual chunks that are processed by simultaneous tasks.
  • Compressed blobs which can not be decompressed in parallel, are processed in a single task (see single-blob processing).
  • Blobs with unknown or undefined length (compressed or not) are always processed in a single task (see unknown-length blob processing).

To conclude: the way a source data file/blob is processed depends on its length and type.

Task types

Batch processing

Batch tasks are created for blobs of the same compression type (including blobs with plaintext content). Only compressed archives (rar/zip/7zip) cannot be processed in batches.

Batch creation is disabled if the query explicitly disables fetching of the very first chunk of each specified blob, e.g. by specifying a condition _chunk_id > 0 in the query filter. In this case, the blobs will be processed by a single task each.

When composing a batch, PARSE follows the following restrictions:

  • Size of the batch (sum of the lengths of all the blobs in a batch) must not exceed the value specified by the configuration parameters query.parse.batch.pt.maxSize (for plaintext blobs, the default is 64MB) and query.parse.batch.<type>.maxSize for compressed ones (bz2, gz, lz4, xz, zz, deflate, deflate64), the default is 16MB.

    Note

    If a query is limiting its resultset by an explicit call to LIMIT() following PARSE, the max size of the batch size is automatically adjusted to optimize the retrieval of data.

  • Count of the blobs in a batch must not exceed the value specified by the configuration parameters query.parse.batch.pt.maxBlobCount and query.parse.batch.<type>.maxBlobCount for plaintext and compressed blobs correspondingly. The default value for both is 32.

The composition of the batch terminates whenever PARSE encounters a blob violating any of the rules above. If by then the batch consisted of only one blob, the batch processing is terminated and the blob is processed in a single task.

The batch-processing task is the only one spanning over multiple threads. For every blob in the batch, the task fetches the blob’s content in a separate thread, while decompression and parsing is performed in the main task. The value of configuration parameter query.parse.batch.<type>.ptSize is used to determine the size of internal memory buffer for decompression. The max number of simultaneously running threads is restricted by configuration parameter query.parse.batch.maxFetchThreadCount.

All the fetched chunks of data are parsed in the same order as the corresponding blobs were specified in the listing stream.

Single-blob processing

Plaintext blobs

The task fetches the content of the target blob (as one blob-sized chunk) and parses the data.

Compressed blobs

The task fetches the content of the target compressed blob in chunks of the size specified by the configuration parameter query.parse.<type>.readLen (<type> is one of bz2, gz, lz4, xz, zz, deflate, deflate64). The default value is specified by query.parse.chunkSizeCompressed. The task then decompresses the data into internal memory buffer, performs virtual alignment and then parses. The configuration parameter query.parse.<type>.ptSize is used to determine the size of internal memory buffer for decompression.

Parallel processing

The content of a blob is virtually divided into chunks that are processed in parallel. The size of the chunk is limited by query.parse.chunkSize for plaintext data and query.parse.chunkSizeCompressed for compressed data. The default values are 64MB and 16MB correspondingly. The real sizes of data fetched by the task are slightly bigger than chunk size computed by division, as the task needs some overlap for alignment.

First, the task fetches the specified chunk of data. Secondly, the decompression is performed in the case of bzip2, pbzip2, pigz, pizz, sxgz compressed data. The size of the buffer for decompression is limited by query.parse.<type>.ptSize (<type> is one of bzip2, pbzip2, pigz, pizz, sxgz).

Then, virtual alignment and parsing take place.

Note that gzip blobs served by the Source Agent can be processed in parallel. SpectX uses decompression index metadata pre-computed by the Source Agent to determine the size of chunks for parallel decompression and subsequent alignment and parsing. The size of the internal buffer is computed at runtime by adding the values of query.parse.chunkSize and min plaintext chunk size.

Unknown-length blob processing

Plaintext blobs

The task fetches the entire content of the target blob. To avoid overflow of the available memory the processing is performed in consecutive loops of reading a chunk of source data, alignment and parsing. The size of the chunk is limited by the query.parse.ulen.pt.readLen.

Compressed blobs

The task fetches the entire content of the target blob. To avoid overflow of the available memory the processing is performed in consecutive loops of reading a chunk of source data, decompression, alignment and parsing. The size of the chunk is limited by the query.parse.ulen.<type>.readLen parameter. The value of query.parse.ulen.<type>.ptSize is used to determine the size of internal memory buffer for decompression.

<type>: bz2, gz, lz4, xz, zz, deflate, deflate64

Archive blobs

The processing of archive blobs (see the list of supported archives in Working With Archives) consists of two phases: archive-listing and glob-filtering followed by decompression and parsing of selected files.

Archive-listing fetches parts of the archive containing listing information and filters the pathname list with a glob pattern specified by the archive extraction instructions. The processing of each of the file in the list is then scheduled according to the length and type, as described above.

Zip archives provide seekable access to its entries - i.e the parsing task can access the corresponding compressed stream directly. The archive entries (files) are processed by single-blob tasks, either plaintext or compressed (deflate/deflate64/xz/bzip2) depending on its compression type within zip archive..

Rar and 7Zip do not provide seekable access, hence the enumeration of all the entries in these archives is needed. The processing of an entry is performed in consecutive loops of reading a chunk of the entry, decompression, alignment, and parsing. The value of query.parse.<type>.ptSize (<type> is one of rar or 7z correspondingly) is used to determine the size of the internal memory buffer for decompression.

Data fetching

Retrieving source data from its storage location is performed using the unified SpectX data access layer.

When querying data using sa:// or sas//: protocols (the Source Agent), SpectX always asks it to temporarily “lock” target blobs (to protect from deletion of entire file) for the query time. The blob locking is provided by Source Agent API for locking files for a limited time.

It is implemented by holding the target file channel open until the lock expires or is explicitly requested to be destroyed. Once a lock is created, it gets assigned a random id, which may then be used to query the file content without additional authentication. All consequent lock creation requests to the same file reuse the previously opened file channel. It is guaranteed that the content is available if the file is deleted during the lock lifetime.

Note

locking does not prevent the file truncations.

Upon the lock destruction, the server returns statistics on real, cpu and usr times spent for handling the IO operations on the given lock.

The lock expiry time is set by SourceAgent. If the processing of the locked blob content takes more than lock lifetime, the latter must be extended by the Source Agent API client. SpectX engine uses configuration parameter query.parse.maxLockAgeBeforeUpdate to determine when to issue lock lifetime extension request. Note that if that period is set to be shorter than the time required for extending by Source Agent, the query execution may fail due to the attempt of using the expired lock on the target blob.

Locking of the blobs that are not likely modified (for instance archived files) can be disabled using configuration parameter query.parse.maxLockableBlobAge. The default behavior is locking blobs regardless of its last modification time.

The locking can be disabled completely by setting the value of query.parse.lockingDisabled to TRUE.

Raw cursor

Specifying raw cursor will set PARSE to return only the most recent records since the last read. For each input file to the PARSE command, the last read position (the offset) is recorded in the cursor (along with other metadata).

The cursor is stored in the /user/cursors/ as a SpectX table with the name specified by the name parameter.

Hint

Right click on resource tree and choose “Refresh Tree” to see the cursor file.

The behavior at the first reading of a file is dictated by the startFromBeginningIfNew parameter. When set to true (the default) then the file is read from the beginning to the end. When set to false then, the offset is set to the position at the end of the file and no records are returned. At the next reading, only the records after that position will be returned.

When a file disappears from the listing then there is a “grace” period during which the record of it is still kept in the cursor. The period is configurable with the entryExpirePeriod parameter. It accepts a string of a number followed by time unit “ms, sec, min, hour, day, week”. The default value is “25 hour”.

The maximum number of input files to the cursor can be limited by maxEntries. The default is 16000 and the maximum allowed value is 256000. When the number of files in the input listing exceeds that number the query will fail.

Note

  • Only plaintext (i.e uncompressed) blobs of known length are supported (unsupported blobs are reported to Query Log).
  • Pattern changes between executing PARSE with the same raw cursor does not invalidate cursor.
  • Cursor data gets saved only if PARSE finishes successfully.
  • Cursor parameters are not stored (they are read from PARSE call parameters).
  • The cursor itself has no expiry. To reset a cursor it must be deleted from the resource tree.

Example: Parsing Linux syslog. The raw cursor is stored as /user/cursors/syslog.sxt.

1
2
3
LIST('file:/var/log/syslog')
| parse(pattern:$[/shared/patterns/syslog.sxp], rc:{name:"syslog"})
;

Generating Source Data Listing Yourself

Listing stream passed to PARSE command (either by src parameter or input pipe) does not necessarily need to come from the LIST command. PARSE expects a stream with certain fields:

Name Mandatory Type Description
uri yes STRING Full URI of the evaluated file.
file_name no STRING Part of the path after the last slash (‘/’) character.
length no LONG
File size in bytes. By specifying value -1 or omitting
the value, the file will be processed in a single thread.
last_modified no TIMESTAMP Timestamp of the file was last modified.
path_time no TIMESTAMP
Timestamp evaluated from the time string in the URI
etag no STRING Used internally for optimizations.
content_ref no STRING Decompression
meta_props no STRING Used internally for optimizations.
is_blob no BOOLEAN
True if the listed resource was a file or a blob,
otherwise false.

Note

You can pass to PARSE a listing stream containing only uri field. However, it is highly recommended to include also the length with the size of the file. (When the size of a file is unknown then it can not be processed in parallel, therefore decreasing the speed of the query).

When would you need to generate a listing yourself? For instance, when a LIST operation is very-very slow - this happens in when you have millions of files stored in an S3 bucket (see more about that here).

Another such case occurs when you need to retrieve a bunch of files from an HTTP server. A webserver or web application displays the content of a directory as an HTML page, which can be very differently formatted in each case. Hence there is no standardized way for the LIST command to ask an HTTP server to provide a listing of files using a glob pattern. In this case, we can simply parse the file names (or URLs) from the HTML page and feed them to the next PARSE command as a listing stream.

For example, https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports displays daily report files of COVID-19 infection cases by countries. There are way too many files there to compose the list manually or to update it each day. Hence, we parse the file names out from the HTML page:

 <td class="content"><LF>
      <span class="css-truncate css-truncate-target"><a class="js-navigation-open " title="01-30-2020.csv"<LF>
      id="7111eeebd745f0fec67aae9f01c83957-cdea5906c274de587002a18300773c7dc098535a"<LF>
      href="/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/01-30-2020.csv">>LF>
      01-30-2020.csv</a></span><LF>
</td>
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$pattern = <<<EOP
  '<td class="content">' data '<span class="css-truncate css-truncate-target"><a class="js-navigation-open " '
  'title=' DQS:title DATA
  '</td>'
EOP;

PARSE(src:'https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports',
     pattern:$pattern)
| filter(title like '%.csv')
| select(uri:'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/' + title)
;

where:

  • line 7-8 retrieves the listing HTML page and parses out title (i.e the file name), using pattern defined in lines 1-5
  • line 9 includes in the resultset only relevant .csv files (we don’t need readme.MD or .gitignore)
  • line 10 constructs the url to file raw content
uri
https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-23-2020.csv
https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-24-2020.csv
https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-21-2020.csv

This listing stream (containing the only mandatory field uri) can be directed into next PARSE command which will retrieve the content of the files and parse it according to the pattern you specify.

Metadata About Data

Along with parsed values, SpectX provides you also with various metadata: output fields from LIST command, the position of the record in a file, etc. These are provided in system fields when requested explicitly (i.e they don’t appear when selecting all fields using a wildcard).

Name Description
_uri Full URI of the evaluated file.
_file_name Part of the path after the last slash (‘/’) character.
_length File size in bytes.
_last_modified Time the file was last modified.
_path_time Timestamp evaluated from the time string in the URI using time patterns.
_etag Used internally for optimizations, do not modify.
_meta_props Used internally for optimizations, do not modify.
_is_blob True if the listed resource was a file or a blob, otherwise false.
_unmatched Contains unmatched data from last matched record until next matching record. See more at Finding Unmatched Corner Cases.
_raw_text Contains original text (converted using selected character set) of current matched or unmatched record.
_raw_bytes Contains original bytes of current matched or unmatched record.
_raw_offset Position in the file of current record [1]
_raw_length Size of current record in bytes.

Example 5.

1
2
3
4
5
$pattern = "TIMESTAMP('yyyy-MM-dd HH:mm:ss Z'):time '\t' IPV4:ip '\t' LD:user '\t' INT:response EOL";

PARSE(src:'s3s://spectx-docs/logs/auth/2015/12*.srv*.v1.log', pattern:$pattern)
| select(_file_name, _length, _last_modified, *)
;
_file_name _length _last_modified time ip user response
1220.srv03.v1.log 56 2018-02-13 13:31:27 2015-12-22 22:15:13 112.190.142.126 blackwash 404
1221.srv04.v1.log 60 2018-02-13 13:31:27 2015-12-22 15:34:02 146.211.3.199 arbitrariamente 404
1222.srv02.v1.log 58 2018-02-13 13:31:27 2015-12-27 12:04:38 102.119.101.169 pantaphobia 404
1223.srv03.v1.log 54 2018-02-13 13:31:27 2015-12-26 07:00:48 1.38.101.207 trochantin 200
1224.srv04.v1.log 58 2018-02-13 13:31:27 2015-12-28 10:29:43 25.115.199.170 dictyogenous 404
1225.srv01.v1.log 57 2018-02-13 13:31:27 2015-12-27 08:21:10 175.132.253.128 arrestasti 404
1226.srv01.v1.log 55 2018-02-13 13:31:27 2015-12-29 19:10:32 125.4.57.186 passibility 404
1227.srv00.v1.log 56 2018-02-13 13:31:27 2015-12-27 15:50:45 100.145.20.139 leptospira 200
1228.srv01.v1.log 53 2018-02-13 13:31:27 2015-12-29 04:37:32 86.212.190.219 wkproxy 404
1229.srv04.v1.log 55 2018-02-13 13:31:27 2015-12-30 14:09:14 85.13.62.121 nonscraping 200
1230.srv04.v1.log 53 2018-02-13 13:31:27 2016-01-03 13:52:15 0.221.101.11 manifesti 200
1231.srv01.v1.log 54 2018-02-13 13:31:27 2016-01-03 00:30:18 123.82.76.124 innovator 200
[1]Availability of position info depends on the compression algorithm of the source data file and if it can be processed in parallel. As most of the compression algorithms do not include plaintext positional info in their compressed streams then processing them in parallel (which SpectX does by default) makes it impossible to calculate record positions. Hence the position info is available for uncompressed files and sxgzip compressed files (both are always processed in parallel). When position info is required for other compressed files then their decompression mode must be forced to single thread. See details in Compressed data.

Special Cases of Parsing Timestamps

Timestamp describes a point in time. Additionally to handling the timezone conversion correctly in parsing time fields, there is another real-life case when time field does not have year in it (happens a lot with Linux system logs with default configuration). Obviously, a timestamp can not exist without year information, therefore we can either assign year to some default value or use year taken from file’s _last_modified or _path_time timestamps. Here’s how SpectX handles parsing timestamps without year (NB! this is in the order of priority):

  1. When _path_time is assigned (i.e is not NULL) then year from _path_time is used to adjust parsed timestamp
  2. When _path_time is not available (ie. is NULL) then the year from _last_modified is used to adjust parsed timestamp
  3. When both _path_time and _last_modified are not available then system current time is used. Note that this should never happen as LIST always evaluates _last_modified. The only occasion it may happen is when using some other input to generate listing tupleset (for instance Listing Snapshots).

Error Handling

Parsing may encounter various errors, for instance, reading or decompression failures. You can control the behavior of PARSE in these situations with query.parse.ignoreErrors configuration parameter. If it is set to FALSE then query processing will be stopped at errors. When set to TRUE then processing continues and the error message is written to Query log tab.

Ignoring errors may be useful at processing a large number of source files when individual failures do not have major effect on the result.

Example 6. Ignore potential errors during execution of a PARSE command:

1
2
3
4
PARSE(pattern:"LD:line EOL",
     src:'s3s://spectx-docs/logs/auth/*/*.log',
     '_query.parse.ignoreErrors':true
    );

Configuration reference

You can assign specific values for configuration parameters in the script’s INIT block or in system configuration file - in which case all PARSE commands in the script will be affected. However, if you want these values to be applied for execution of a particular PARSE command then you do it by supplying these as config_key arguments. Note that in this case, just like with INIT block, parameter names must be prefixed by an underscore and enclosed in single or double quotes (refer to Example 6 in Error Handling section for an example).

Below follows a description of all possible query configuration parameters used by PARSE.

Error handling

query.parse.ignoreErrors: A boolean parameter, when set to TRUE then the command execution is not interrupted at input data related errors, such as file content not available, decompression failures, etc. Defaults to FALSE.

Pattern

query.parse.maxRecordSize: An integer specifying max record length. By default, the value gets figured on basis of the input pattern.

query.parse.var.minChunkSize: twice the value of query.parse.maxRecordSize minus 1 byte.

Locking

query.parse.lockingDisabled: A boolean parameter, when set to TRUE then a mechanism for blob locking at target Source Agent servers is disabled. Defaults to FALSE.

query.parse.maxLockableBlobAge: A time period specifying the maximum age of a Source agent’s blob for being subject to locking. The default value for that is infinity, which causes all blobs to be locked. Specifying any other period makes blobs “older” than this period to be skipped from locking.

query.parse.maxLockAgeBeforeUpdate: A time period specifying the age of a blob lock requiring its lifetime to be expanded at a target Source Agent server. The default value is 10s.

Plaintext blobs processing

query.parse.chunkSize: An integer specifying a size of data split between parallel PARSE tasks processing plaintext blobs. The default value is 64MB.

query.parse.ulen.pt.readLen: An integer specifying a size of a rolling buffer used for fetching blobs with unknown size. Defaults to value of configuration parameter query.parse.chunkSize (which itself defaults to 64MB).

Note

Plaintext blobs with unknown size are processed in a single thread

query.parse.batch.pt.maxSize: A long integer specifying the max size of a batch, which is calculated as the sum of sizes of plaintext blobs. The default value is 64MB.

query.parse.batch.pt.maxBlobCount: An integer specifying max count of plaintext blobs in batch. Default is 32.

Compressed blobs processing

query.parse.chunkSizeCompressed: An integer specifying a size of data split between parallel PARSE tasks processing compressed blobs. The default value is 16MB.

query.parse.<parallel_compression_type>.ptSize: An integer specifying a size of a rolling plaintext buffer used for decompression of chunks of blobs compressed with corresponding utility (specified by <parallel_compression_type>: sxgz, bzip2, pbzip2, pigz, pizz, 7z, rar).

Defaults to greater of the following two: a value of configuration parameter query.parse.chunkSize (which defaults to 64MB) and min plaintext chunk size.

query.parse.sxgz.blockSize: An integer value of –blocksize (-b) parameter of sxgzip compression utility. The value is in bytes (see sxgzip help for details). If you have used different block sizes then the largest one should be used.

query.parse.pigz.blockSize: An integer value of –blocksize (-b) parameter of pigz compression utility. The value is in KiB (see pigz help for details). If you have used different block sizes then the largest one should be used.

query.parse.pizz.blockSize: An integer value of –blocksize (-b) parameter of pigz compression utility when used with -z parameter. The value is in KiB (see pigz help for details). If you have used different block sizes then the largest one should be used.

query.parse.bzip2.blockSize: An integer value of block size parameter -1 … -9 of bzip2 or lbzip2 utilities. Allowed values are 1 … 9 corresponding to 100 Kb to 900 Kb block size (see respective utility help for details). If you have used different values then the largest value should be used.

query.parse.pbzip2.blockSize: An integer value of -b (block size) parameter of pbzip2 compression utility. The value is a positive integer representing a multiplier of 100 Kb block size (see pbzip2 help for details). If you have used different values then the largest value should be used.

Single-thread (non-parallel) processing of compressed blobs

Note

In below configuration key names, <compression_type> is one of gz, bz2, lz4, xz, zz, deflate, deflate64

query.parse.<compression_type>.readLen: An integer value specifying a size of a rolling buffer used for fetching compressed content of blobs compressed with the corresponding method. The default is 16MB.

query.parse.<compression_type>.ptSize: An integer value specifying a size of a rolling plaintext buffer used for decompression of blobs compressed with corresponding method. Defaults to greater of the following two: a value of configuration parameter query.parse.chunkSize (which defaults to 64MB) and min plaintext chunk size.

Processing compressed blobs with undefined size

Note

Compressed blobs with undefined size are processed in single-thread (non-parallel).

<ulen_compression_type> below is one of gz, bz2, lz4, xz, zz, deflate, deflate64

query.parse.ulen.<ulen_compression_type>.readLen: An integer value specifying a size of a rolling buffer used for fetching compressed content of blobs compressed with corresponding method. Defaults to a value of configuration parameter query.parse.chunkSizeCompressed (which itself defaults to 16MB).

query.parse.ulen.<ulen_compression_type>.ptSize: An integer value specifying a size of a rolling plaintext buffer used for decompression of blobs compressed with corresponding method. Defaults to greater of the following two: a value of configuration parameter query.parse.chunkSize (which defaults to 64MB) and min plaintext chunk size.

Archive processing

Note

<archive_compression_type> below is one of 7z or rar. The zip is absent as its content gets processed archive member-wise.

query.parse.<archive_compression_type>.ptSize: An integer value specifying a size of a rolling plaintext buffer used for decompression of blobs in the archive. Defaults to greater of the following two: a value of configuration parameter query.parse.chunkSize (which defaults to 64MB) and min plaintext chunk size.

Batch processing of compressed blobs

Note

<batch_compression_type> below is one of gz, bz2, lz4, xz, zz, deflate, deflate64

query.parse.batch.<batch_compression_type>.ptSize: An integer value specifying a size of a rolling plaintext buffer used for decompression of blobs compressed with corresponding method. Default to greater of the following two: 2MB and min plaintext chunk size.

query.parse.batch.<batch_compression_type>.maxSize: An integer value specifying a max size of a compressed batch, which is calculated as sum of sizes of blobs. Default is 16MB.

query.parse.batch.<batch_compression_type>.maxBlobCount: An integer value specifying a max count of blobs in compressed batch of given type. Default is 32.

Batch common

query.parse.batch.maxFetchThreadCount: An integer value specifying a max number of fetching threads to start for getting the content of blobs in the batch. Default is 32.