Input Data Browser

Hint

To query local data quickly and efficiently, enter the full file path into the search bar then press enter.

When a user clicks Input data a new window appears. Here users can browse datastores or directly query data by entering the file path or URL into the search bar.

Accessing local files requires no configuration, details on how to configure access to data in other locations such as Amazon S3 can be found in input data.

File Functions

the data browser where users can configure new datastores from any of the files they have access to.

Slide Bar

There is also a slider available that allows users to navigate data within the file to any other position within its raw content. The box on the right of it allows the sliding position to be entered manually. Once the slider position has been changed, all the buttons described above operate with the virtual slice of the file with the new starting position and length. These values get also reflected in the URI of the file in the search bar.

Preview Button

The Preview button, when pressed, displays the content of the given file “slice”, depending on the compression type which gets figured out automatically based on analysis of the beginning of the slice. For non-compressed data, it will be the first 16KB of the raw content. For compressed data (gzip, bzip, lz4, etc), it will be the first 16KB of decompressed data (see figure below). For compressed archives (zip, rar or 7z), it will be the list of archive member entries with some additional properties, and each entry needs to be selected and opened individually to gain access to its contents.

Prepare Pattern Button

The Prepare Pattern button - Prepares a pattern for extracting data from the file based on inbuilt fuzzy logic. Patterns can be written from scratch or customized as required (best for raw or unstructured data).

Prepare Query Button

The Prepare Query button - Applies the best matching pattern from SpectX’s database and prepares the data to be queried (best for structured data in a commonly readable format).

Download Button

The Download button - Downloads the file data to local storage on the users’ device.

Hex

“Hex” checkbox allows choosing between raw and hex modes for the preview. The encoding dropdown menu offers several encoding formats the viewed content to be represented in.

Compressed Data

For compressed data and compressed archive member entries, the button Download Plaintext gets displayed. It allows downloading decompressed content of the file or archive member to the local hard disk. As well, a checkbox “Raw” controls how the content is displayed in preview - decompressed or not (“raw”).

See also

See more on supported compression formats in Compressed data

Search Bar Modes

The input data browser has four modes of displaying the file and directory hierarchies. The mode is automatically chosen depending on the content of the search bar:

  • if it is empty, a default view appears, which displays a list of defined datastores in two columns, Store Name and Type. The list can be filtered by entering character sequences to be matched with either name or type of target datastore into small filtering text boxes in the corresponding column.
  • if it contains a URI of a directory, then the directory content view is displayed. It shows all the directory items, their sizes (in bytes) and times of last modification if known. Note that the size of a directory is datastore type-specific, and may not necessarily be a sum of sizes of all its entries; in most cases, it is usually the size of meta information for the directory saved on the disk in the datastore. The default sort order shows alphabetically sorted list of directory names (preceded with a slash) followed by an alphabetically sorted list of files. The displayed list can be re-sorted by clicking on one of the three columns and then choosing the sort direction.
  • if it contains an URI with glob patterns in it, the search result view is displayed. The view, comparing to the previous one, provides one more column for the type of the found directory entry (“file” or “dir”), and contains the entry’s path in the datastore instead of its name in the directory. Note that glob patterns can be specified only for the entries in the current directory.
  • finally, if it contains a URI of a file, the file details view is displayed. In addition to the file name, its size and last modification date it provides a column Content Available, which contains true for files whose content can be read by SpectX, or any other string specifying a reason for content unavailability reported by the datastore.
The data browser with advanced options highlighted.

Selecting multiple data sources

Working with only one selected data file is most often not enough. You can specify one or more files from the same or different locations directly in the PARSE or LIST commands.

Note

You may also include connection-specific parameters in URIs, for example:

Microsoft Azure blob store public (anonymous) access:
wasb://flightdelay@hditutorialdata.blob.core.windows.net/flightdelays.hql
    where "flightdelay" is the name of the container and
    "hditutorialdata.blob.core.windows.net" is the endPointUriSuffix

Amazon S3 Store  public (anonymous) access:
s3://us-east-1@big-data-benchmark/pavlo/text/tiny/crawl/part-00000
    where "us-east-1" is the region
    and "big-data-benchmark" is the name of the bucket

Warning

Do not specify credentials directly in the URI! It will result in credentials being included in SpectX audit logs, unprotected.