Compressed data

Compressing your data is useful in many ways. It saves storage space, which nowadays tends to get cheaper over time. It also increases the performance of the query as reading bytes from hard disks is known to be a bottleneck. In the case of compressed data, there is much less to be read compared to plaintext.

Unfortunately, there are some limitations in subjecting compressed data directly to analysis. To scale up the system must process the data in parallel. Until files are small enough to fit decompressed content in the memory of processing node (less than 2Gb) it can be done in one run. However, with bigger files, this becomes a problem since usually it is compressed as one continuous stream. These can not be decompressed by smaller pieces in parallel and hence should be processed in one thread.

The good news is that many popular compression tools support creating compressed streams, which can be decompressed in parallel. Some of them do it by default, others require specifying a special command-line option. They do it by compressing data by blocks, which can be decompressed independently of each other and hence in parallel.

In case you have single-stream gzip-compressed files stored at in-premise servers then you can use SourceAgent for exposing them to SpectX. Among other optimizations, SourceAgent allows processing single-stream gzip-compressed files in parallel. SourceAgent computes in background necessary metadata for parallel decompression, storing it separately from original files (these are read-only). You must configure the directory where to store metadata.

This solution is applicable only for gzip-compressed files stored at in-premise servers.

Supported Compression Formats

SpectX supports many different compression utilities, which can produce compressed streams with different algorithms and formats. Unfortunately, there is no simple way for SpectX to recognize all the combinations automatically, therefore we need to rely on file extensions. The following table summarizes compression utilities output modes mapping to file extensions, which SpectX recognizes. These extensions should be assigned during compression.

Utility Parallel decompression Default file extension SpectX recognized file extension SpectX compress_type
gzip No .gz .gz gz
pigz No .gz .gz gz
pigz -z No .zz .zz zz
lz4 No .lz4 .lz4 lz4
xz No .xz .xz xz
pigz -i Yes .gz .pi.gz pigz
pigz -i -z Yes .zz .pi.zz pizz
bzip2 Yes .bz2 .bz2 pbz2
lbzip2 Yes .bz2 .bz2 pbz2
pbzip2 Yes .bz2 .pi.bz2 concatbz2
sxgzip Yes + positioninfo [1] .sx.gz .sx.gz concatgz
  n/a   none  
[1]Files compressed with Sxgzip Compression Utilities includes plaintext offset and length of a compressed block. This allows to provide the positions of parsed records and fields in queries.

Sometimes it happens that the file extension does not correspond to actual compression used. This is not a problem. You can always override the compression type. You can assign compression type arbitrarily to any file of your choice by appending compress_type field to the output of LIST command.

Example 1: enforce using pigz -i decompressor instead of sxgzip:

LIST(src:'sx:/user/examples/data/*')
| filter(file_name = 'apache_access.log.sx.gz')
| select(*, compress_type:'pigz');

Example 2: enforce using lz4 decompressor on .log extension files:

LIST(src:'sx:/user/examples/data/*.log')
| select(*, compress_type:'lz4');

Subsequent PARSE command will then use overridden compress_type in when selecting a decompressor instead of selecting it based on the file extension. Supported values for the filed are: concatbz2, concatgz, deflate, deflate64, bz2, gz, lz4, pbz2, pigz, pizz, rar, xz, zip, zz, 7z.

Tools for Compressing Data

1. We highly recommend compressing your data using Sxgzip Compression Utilities utility. It creates compressed gzip files, fully compatible with GZIP standard which SpectX can decompress in parallel. The input data is broken into blocks of specified size (default size is 1 MB) which will be compressed independently and stored as members. It is also often called as concatenated gzip - indeed the individual members are exactly as individual gzip files.

Sxgzip Compression Utilities also adds metadata about plaintext: each member header extra field contains the plaintext offset and length of the member. This allows providing the positions of parsed records and fields in queries. This comes at a price of slightly lower compression rate but is very useful in locating original data records.

NB! The block size used at compression is important at performing parallel decompression. SpectX must know the maximum size used at compression. When using block size exceeding default value of 1 MB then it must be specified using configuration parameter query.parse.sxgz.blockSize.

2. In case you’re not able to use Sxgzip Compression Utilities, we recommend to use popular pigz utility which provides the best compression/decompression speed and compression ratio. SpectX supports parallel decompression of gzip files created by pigz independent mode (version 2.3.4 or higher), where the input file is broken up into chunks of the specified size and compressed as individual streams. The default chunk size is 128 kB, which is a little too small (resulting in slightly lower compress rate and longer decompression time). We recommend blocksize 1 MB, which can be specified using -b option (see pigz man page for details).

Also, we recommend using .pi.gz file extension, in this way SpectX can automatically use appropriate compression type (see Table 2 below):

$ pigz -i -b 1000 -S .pi.gz <filename>

NB! The block size used at compression is important at performing parallel decompression. SpectX must know the maximum size used at compression. When using block size exceeding default value of 1 MB then it must be specified using configuration parameter query.parse.pigz.blockSize.

pigz does not retain plaintext offset and length information. Because of that the positions of parsed fields are not available. This is also the reason why we recommend it as a second choice for compression.

NB! Independent block compression mode is supported by pigz starting from version 2.3.4, please make sure you use appropriate version!

3. In case you’re aiming at best compression rate then we recommend to use lbzip2, pbzip2 or bzip2 utilities. While all of them perform almost equally on decompression rate and speed the original bzip2 is slowest in compression (as it does not support multi-threaded compression).

All of them produce independently compressed blocks with the default size of 900Kb. Custom block size can be specified using the configuration parameter query.parse.bzip2.blockSize in case compressed content block size exceeds it.

Working With Archives

Archives, in contrast to compressed files, can contain multiple files preserving their name and directory structure. Unfortunately, almost none of the compressed archive formats allow the content to be processed in parallel. Also, accessing individual files from an archive requires an enumeration of all entries in the archive, which depending on the archive type may require seeking the entire archive until the required entry could be found.

SpectX supports accessing files in the archives of the following formats:

  • ZIP. SpectX supports decompression of entries in Zip archives with the following methods: deflate, deflate64, xz, bzip2, stored. The last two allow for parallel decompression.
  • 7z. The following compression methods are supported: copy, LZMA, LZMA2, BZIP2, deflate, deflate64. Processing of encrypted content is not supported at the moment.
  • RAR. Files with format up to v4 are supported. Rar v5 files are not supported.

You’ll need to specify the files you desire to extract using archive_src parameter to the PARSE input stream. Without this parameter, the archive file is treated as plaintext file and its processing is performed by either batch processing, single-blob processing or parallel processing tasks depending on its size and listing stream content as it was described above.

The archive_src parameter specifies glob pattern for locating target files by their pathname in the archive. Files within the archive are always processed as plaintext files, i.e. processing of nested compressed files or archives is not supported.