Data Access Protocols

SpectX URI Syntax is used to define access to different storage locations. The following protocols are supported:

Source Agent

The Source Agent is a SpectX component providing secure, optimized, high-speed access to data stored in on-premises servers.

The schemes (sa:// and sas:// ) refer to the Source Agent RESTful protocol. It is a client-server protocol where the SA server component is installed in the data server and the SA client is the SpectX engine. Both counterparts need configuration, we’ll discuss the client side here (details on server configuration can be found in SourceAgent Admin Manual Configuration section).

The sa:// scheme uses plaintext, whereas sas:// uses TLS on transport layer.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Host hostname or ip address (ipV4 or ipV6) of the SA server
port port SA server is listening on (default: 8389)
Authentication key shared secret between SA client and server. Optional.
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

SSH

Ssh scheme (ssh://) is using SSHv2 to access files stored in on-premises hosts (more precisely, any host reachable using ssh). It is relatively slow in terms of access speed but very useful for using ad-hoc data analysis. For instance, during incident response analysis, a member of security team might need logs from a live server, which have not been collected on a regular basis. Live servers are often behind a firewall and configured to allow only minimum necessary protocols. They are blocking protocols for public access (such as http or https) or custom protocols (like SpectX Source Agent). Getting firewall connection open for any of these is usually a time consuming process. However, SSHv2 is commonly allowed in such firewalls, as it is used by the operations staff for server shell access. It is far easier and faster to create a shell account an the analyst. Then the next problem arises: analyzing logs on that server might degrade the performance of its primary purpose. Where do we copy the logs?

Instead, the logs can be accessed by SpectX via SSHv2. After an account has been created on target host, SpectX can immediately analyse logs that the user has access to.

There are a few requirements for the remote host configuration in order for SpectX to work over ssh:

  1. Sftp subsystem must be enabled in sshd (it usually is)
  2. dd and md5 utilities must be available for execution. SpectX uses dd for getting file chunks, and md5 for calculating ETAGs to provide advanced caching support. Normally these are both available on Linux and Mac OS X out of the box.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Host target host hostname or ip address (ipV4 or ipV6)
port ssh listening port (default: 22)
Username target host account username
Password target host account password
Public key public ssh key that can be copied and pasted to ssh authorized_keys file
Root directory
top level directory at the target host, where the logs can be read from. Must be
read accessible for the user. Optional. When empty, / is assumed. If starts with
~/, users’s home directory is assumed.
pathDd
filename with full path to dd utility executable binary. If empty, command -v dd
will be executed in order to determine the location of the dd utility. It gets executed
with bs, skip and count parameters, with input provided to its stdin and output
read from stdout
pathMd5
filename with full path to md5 utility executable binary. If empty,
command -v md5sum (linux) or command -v md5 (osx) is executed in order
to determine the location of the md5 utility. In case of md5sum (linux), it is invoked
as md5sum | cut -c 1-32 and in case of md5 it is invoked as md5 -q. Both are
given inputs at stdin. Output is expected at stdout.
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

SSH key pairs are generated by SpectX. Private keys are not accessible for SpectX users.

Local file system

The file scheme (file://) is intended for a standalone SpectX installation where data is accessed from the local file system. Just like the SSH scheme, it is meant for ad-hoc data analysis.

Configuration parameters:
  • store name - unique name among all defined DataStores
  • root directory - top level directory, starting point for the path. Mandatory. Must be read accessible for the user.

MicroSoft Azure Blob Storage

The schemes (wasb://, wasbs://) provide access to the Microsoft Azure Blob Storage. The latter uses TLS to secure underlying communication to MS Azure.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Container the name of a target Blob Storage container storing blobs.
Account the name of MS Azure storage account. If not specified then anonymous access is assumed
Access Key storage account access key
Endpoint URI Suffix Endpoint URI suffix
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

Amazon Simple Storage Service

The schemes (s3:// and s3://) provide access to Amazon Simple Storage Service (S3). The latter uses TLS to secure underlying communication to S3.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Bucket name of the target Amazon S3 bucket storing blobs.
Region optional name of the region where the bucket is residing. Default: “us-west-2”
Access Key Id access key ID for Amazon S3 API authentication. Mandatory.
Secret Access Key secret access key for Amazon S3 API authentication. Mandatory.
Directory delimiter directory separator in a file name. Optional. When empty, default “/” is assumed.
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

Google Cloud Storage

The schemes (gcs:// and gcss://) provide access to the Google Cloud Storage (GCS). The latter uses TLS to secure underlying communication to GCS.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Bucket name of the target Google Cloud Storage bucket storing blobs.
Private Key Google service account private key. Optional, needed for accessing private buckets.
Directory delimiter directory separator in a file name. Optional. When empty then default “/” is assumed.
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

HTTP

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Host hostname or ip address (ipV4 or ipV6) of the target host
port ssh listening port, default 22
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

HTTPS

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Host hostname or ip address (ipV4 or ipV6) of target host
port ssh listening port, default: 22
Basic Auth username username for Http Basic Authentication scheme
Basic Auth password password for Http Basic Authentication scheme
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

HDFS

The scheme hdfs:// provides access to objects in the Hadoop Distributed File System.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Host NameNode metadata IPC service hostname or IPv4/IPv6 address
Port NameNode listening port, default is 8020
User principal name Kerberos principal to be used for authentication at NameNode. If empty, authentication is not attempted
Password
password for the principal. If empty, a keytab file is used for authentication. The algorithm to locate the keytab file is the following:
- an attempt is made to locate krb5.conf file and get keytab file location from its default_keytab_name parameter in [libdefaults] section
- if it is not specified in krb5.conf, it will be looked for the file {user.home}/krb5.keytab
Root directory top level directory at the NameNode where the logs can be read from. Must be read accessible for the user. Optional, default is /.
RPC calls protection Hadoop protection value for secured SASL connections, a value set for “hadoop.rpc.protection” parameter in cluster configuration
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

If the target HDFS cluster requires authentication, the underlying implementation requires proper Kerberos 5 client configuration to be set up on SpectX host in krb5.conf file. For content of krb5.conf file, refer to MIT documentation.

The Kerberos configuration is system-wide, so consult with SpectX administrators if any has already been set up on the machine. As a matter of fact, if SpectX is configured to use Kerberos authentication in Active Directory for querying users group membership during log on, it might be required to introduce some changes to the configuration to support multiple realms and avoid errors due to possible conflicting settings.

SX Resources

The scheme sx:/ provides access to SpectX resource tree.

sx:/ is authorityless protocol, therefore there is just one forward slash in the scheme.

There are no configuration parameters.