Data Access Protocols

SpectX URI Syntax is used to define access to different storage locations. The following protocols are supported:

Source Agent

The Source Agent is a SpectX component providing secure, optimized, high-speed access to data stored in on-premises servers.

The schemes (sa:// and sas:// ) refer to the Source Agent RESTful protocol. It is a client-server protocol where the SA server component is installed in the data server and the SA client is the SpectX engine. Both counterparts need configuration, we’ll discuss the client side here (details on server configuration can be found in SourceAgent Admin Manual Configuration section).

The sa:// scheme uses plaintext, whereas sas:// uses TLS on transport layer.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Host hostname or ip address (ipV4 or ipV6) of the SA server
port port SA server is listening on (default: 8389)
Authentication key shared secret between SA client and server. Optional.
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

SSH

Ssh scheme (ssh://) is using SSHv2 to access files stored in on-premises hosts (more precisely, any host reachable using ssh). It is relatively slow in terms of access speed but very useful for using ad-hoc data analysis. For instance, during incident response analysis, a member of security team might need logs from a live server, which have not been collected on a regular basis. Live servers are often behind a firewall and configured to allow only minimum necessary protocols. They are blocking protocols for public access (such as http or https) or custom protocols (like SpectX Source Agent). Getting firewall connection open for any of these is usually a time consuming process. However, SSHv2 is commonly allowed in such firewalls, as it is used by the operations staff for server shell access. It is far easier and faster to create a shell account an the analyst. Then the next problem arises: analyzing logs on that server might degrade the performance of its primary purpose. Where do we copy the logs?

Instead, the logs can be accessed by SpectX via SSHv2. After an account has been created on target host, SpectX can immediately analyse logs that the user has access to.

There are a few requirements for the remote host configuration in order for SpectX to work over ssh:

  1. Sftp subsystem must be enabled in sshd (it usually is)
  2. dd and md5 utilities must be available for execution. SpectX uses dd for getting file chunks, and md5 for calculating ETAGs to provide advanced caching support. Normally these are both available on Linux and Mac OS X out of the box.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Host target host hostname or ip address (ipV4 or ipV6)
port ssh listening port (default: 22)
Username target host account username
Password target host account password
Public key public ssh key that can be copied and pasted to ssh authorized_keys file
Root directory
top level directory at the target host, where the logs can be read from. Must be
read accessible for the user. Optional. When empty, / is assumed. If starts with
~/, users’s home directory is assumed.
pathDd
filename with full path to dd utility executable binary. If empty, command -v dd
will be executed in order to determine the location of the dd utility. It gets executed
with bs, skip and count parameters, with input provided to its stdin and output
read from stdout
pathMd5
filename with full path to md5 utility executable binary. If empty,
command -v md5sum (linux) or command -v md5 (osx) is executed in order
to determine the location of the md5 utility. In case of md5sum (linux), it is invoked
as md5sum | cut -c 1-32 and in case of md5 it is invoked as md5 -q. Both are
given inputs at stdin. Output is expected at stdout.
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

SSH key pairs are generated by SpectX. Private keys are not accessible for SpectX users.

Local file system

The file scheme (file://) is intended for a standalone SpectX installation where data is accessed from the local file system. Just like the SSH scheme, it is meant for ad-hoc data analysis.

Configuration parameters:
  • store name - unique name among all defined DataStores
  • root directory - top level directory, starting point for the path. Mandatory. Must be read accessible for the user.

MicroSoft Azure Blob Storage

The schemes (wasb://, wasbs://) provide access to the Microsoft Azure Blob Storage. The latter uses TLS to secure underlying communication to MS Azure.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Container the name of a target Blob Storage container storing blobs.
Account the name of MS Azure storage account. If not specified then anonymous access is assumed
Access Key storage account access key
Endpoint URI Suffix Endpoint URI suffix
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

Amazon Simple Storage Service

The schemes (s3:// and s3s://) provide access to Amazon Simple Storage Service (S3). The latter uses TLS to secure underlying communication to S3.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Bucket name of the target Amazon S3 bucket storing blobs.
Region optional name of the region where the bucket is residing. Default: “us-west-2”
Access Key Id access key ID for Amazon S3 API authentication. Mandatory.
Secret Access Key secret access key for Amazon S3 API authentication. Mandatory.
Directory delimiter directory separator in a file name. Optional. When empty, default “/” is assumed.
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

Google Cloud Storage

The schemes (gcs:// and gcss://) provide access to the Google Cloud Storage (GCS). The latter uses TLS to secure underlying communication to GCS.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Bucket name of the target Google Cloud Storage bucket storing blobs.
Private Key Google service account private key. Optional, needed for accessing private buckets.
Directory delimiter directory separator in a file name. Optional. When empty then default “/” is assumed.
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

HTTP

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Host hostname or ip address (ipV4 or ipV6) of the target host
port ssh listening port, default 22
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

HTTPS

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
Host hostname or ip address (ipV4 or ipV6) of target host
port ssh listening port, default: 22
Basic Auth username username for Http Basic Authentication scheme
Basic Auth password password for Http Basic Authentication scheme
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

HDFS

The scheme hdfs:// provides access to objects in the Hadoop Distributed File System.

Configuration parameters:

Name Description
Store name unique name among all defined DataStores
High Availability Cluster enables HA cluster connectivity configuration mode (for setting addresses of 2 NameNodes in HA cluster)
Host NameNode metadata IPC service hostname or IPv4/IPv6 address (first NameNode in HA cluster mode)
Port NameNode listening port, default is 8020
Host2 Second NameNode metadata IPC service hostname or IPv4/IPv6 address (HA cluster mode)
Port2 Second NameNode listening port, default is 8020 (HA cluster mode)
User principal name Kerberos principal to be used for authentication at NameNode. If empty, authentication is not attempted
Password
password for the principal. If empty, a keytab file is used for authentication. The algorithm to locate the keytab file is the following:
- an attempt is made to locate krb5.conf file and get keytab file location from its default_keytab_name parameter in [libdefaults] section
- if it is not specified in krb5.conf, it will be looked for the file {user.home}/krb5.keytab
Root directory top level directory at the NameNode where the logs can be read from. Must be read accessible for the user. Optional, default is /.
RPC calls protection Hadoop protection value for secured SASL connections, a value set for “hadoop.rpc.protection” parameter in cluster configuration
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified

If the target HDFS cluster requires authentication, the underlying implementation requires proper Kerberos 5 client configuration to be set up on SpectX host. In environments with one Kerberos authentication realm, it might be enough to specify system properties java.security.krb5.realm and java.security.krb5.kdc for default realm and KDC location. If more advanced configuration is required, or an access to secure HDFS clusters running multiple Kerberos authentication realms needs to be set up, the krb5.conf file must exist in the filesystem. It has to have default_realm property defined in its [libdefaults] section. The default realm must not necessarily be one of those used by target HDFS ecosystem. If it is the case, or in case the datastore targets any other non-default Kerberos realm, there must be correct mapping defined for the realms’s KDC hostname to the realm name in [domain_realm] section of the krb5.conf. For content of krb5.conf file, refer to MIT documentation.

The Kerberos configuration is system-wide, so consult with SpectX administrators if any configuration has already been set up on the machine. As a matter of fact, if SpectX is configured to use Integrated Windows Authentication, it might be required to introduce some changes to the configuration to support multiple Kerberos realms and avoid errors due to possible conflicting settings.

Contents of different keytab files can be merged into one keytab file by means of ktutil, a Kerberos keytab maintenance utility.

SX Resources

The scheme sx:/ provides access to SpectX resource tree.

sx:/ is authorityless protocol, therefore there is just one forward slash in the scheme.

There are no configuration parameters.