Working with Compressed Data

Compressing your data is useful in many ways. Obviously it saves storage space, which nowadays tends to get cheaper over time. It also increases the performance of query as reading bytes from hard disks is known to be bottleneck. In case of compressed data there are much less to be read compared to plaintext.

Unfortunately there are some limitations in subjecting compressed data directly to analysis. To scale up the system must process the data in parallel. Until files are small enough to fit decompressed content in the memory of processing node (less than 2Gb) it can be done in one run. However with bigger files this becomes a problem since usually it is compressed as one continuous stream, which can not be decompressed by smaller pieces in parallel hence should be processed in one thread.

The good news is that many popular compression tools support creating compressed streams, which can be decompressed in parallel. Some of them do it by default, others require specifying special command line option. They do it by compressing data by blocks, which can be decompressed independently of each other and hence in parallel.

Following table summarizes compression utilities what SpectX supports:

Utility Parallel decompression supported
gzip No
pigz No
lz4 No
xz No
pigz -i Yes
pigz -i -z Yes
bzip2 Yes
lbzip2 Yes
pbzip2 Yes
sxgzip Yes + position metadata [1]
[1]Sxgzip Compression Utility includes plaintext offset and length of a compressed block. This allows to provide the positions of parsed records and fields in queries.

Compressing Data

1. We highly recommend to compress your data using Sxgzip Compression Utility utility. It creates compressed gzip files, fully compatible with GZIP standard which SpectX can decompress in parallel. The input data is broken into blocks of specified size (default size is 1 MB) which will be compressed independently and stored as members. It is also often called as concatenated gzip - indeed the individual members are exactly as individual gzip files.

Sxgzip Compression Utility also adds metadata about plaintext: each member header extra field contains the plaintext offset and length of member. This allows to provide the positions of parsed records and fields in queries. This comes at a price of slightly lower compression rate but is very useful in locating original data records.

NB! The block size used at compression is important at performing parallel decompression. SpectX must know the maximum size used at compression. When using block size exceeding default value of 1 MB then it must be specified using query.parse.sxgz.blockSize SpectX configuration entry.

2. In case you’re not able to use Sxgzip Compression Utility, we recommend to use popular pigz utility which provides the best compression/decompression speed and compression ratio. SpectX supports parallel decompression of gzip files created by pigz independent mode (version 2.3.4 or higher), where the input file is broken up into chunks of specified size and compressed as individual streams. The default chunk size is 128 kB, which is a little too small (resulting in slightly lower compress rate and longer decompression time). We recommend blocksize 1 MB, which can be specified using -b option (see pigz man page for details).

Also we recommend to use .pi.gz file extension, in this way SpectX can automatically use appropriate compression type (see Table 2 below):

$ pigz -i -b 1000 -S .pi.gz <filename>

NB! The block size used at compression is important at performing parallel decompression. SpectX must know the maximum size used at compression. When using block size exceeding default value of 1 MB then it must be specified using query.parse.pigz.blockSize SpectX configuration entry.

pigz does not retain plaintext offset and length information. Because of that the positions of parsed fields are not available. This is also the reason why we recommend it as a second choice for compression.

NB! Independent block compression mode is supported by pigz starting from version 2.3.4, please make sure you use appropriate version!

3. In case you’re aiming at best compression rate then we recommend to use lbzip2, pbzip2 or bzip2 utilities. While all of them perform almost equally on decompression rate and speed the original bzip2 is slowest in compression (as it does not support multi-threaded compression).

All of them produce independently compressed blocks with default size of 900Kb. When using block size exceeding this then it must be specified using query.parse.bzip2.blockSize SpectX configuration entry.

Existing Compressed Data

In case you have single stream gzip compressed files stored at in-premise servers then you can use SourceAgent for exposing them to SpectX. Among other optimizations SourceAgent allows processing single stream gzip compressed files in parallel. SourceAgent computes in background necessary metadata for parallel decompression, storing it separately from original files (these are read only). You must configure the directory where to store metadata.

This solution is applicable only for gzip compressed files stored at in-premise servers.

SpectX supports many different compression utilities, which can produce compressed streams with different algorithms and formats. Unfortunately there is no simple way for SpectX to recognize all the combinations automatically, therefore we need to rely on file extensions. Following table summarizes compression utilities output modes mapping to file extensions, which SpectX recognizes. These extensions should be assigned during compression.

Utility Default file extension SpectX recognized file extension SpectX compress_type
gzip .gz .gz gz
pigz .gz .gz gz
pigz -z .zz .zz zz
lz4 .lz4 .lz4 lz4
xz .xz .xz xz
pigz -i .gz .pi.gz pigz
pigz -i -z .zz .pi.zz pizz
bzip2 .bz2 .bz2 pbz2
lbzip2 .bz2 .bz2 pbz2
pbzip2 .bz2 .pi.bz2 concatbz2
sxgzip .sx.gz .sx.gz concatgz
      none

Sometimes it happens that the file extension does not correspond to actual compression used. This is not a problem. You can always override the compression type. In fact you can assign compression type arbitrarily to any file of your choice by appending compress_type field to output of LIST command.

Example 1: enforce using pigz -i decompressor instead of sxgzip:

LIST(src:'sx:/user/examples/data/*')
 .filter(file_name = 'apache_access.log.sx.gz')
 .select(*, compress_type:'pigz');

Example 2: enforce using lz4 decompressor on .log extension files:

LIST(src:'sx:/user/examples/data/*.log')
 .select(*, compress_type:'lz4');

Subsequent PARSE command will then use overrided compress_type in using selected decompressor instead of selecting it based on file extension.