SourceAgent

SourceAgent is a RESTful data access server. Being a part of SpectX products family, it provides interface to local data sources for other SpectX services through defined API. The API provides its clients with basic functionality for listing and searching for blobs (files) and containers (directories), retrieving file chunks. It also supports authentication, advanced caching, file locking, statistics and enables parallel processing of single stream gzip compressed data.

Download

SourceAgent is distributed within SpectX installation package. SpectX trial is available for download at https://www.spectx.com/#signup.

Installation

SourceAgent requires Oracle’s Java Runtime Environment (JRE) 1.8 to be installed on the system. It is available for download from Oracle download site. Please note that using OpenJDK is not recommended, as it results in reduced performance. Please check your Java version first by running (and then installing/upgrading accordingly if needed):

$ java -version

The SourceAgent does not require any special permissions on a system and can thus be run in standard user privileges. Please follow the installation steps below.

  1. The installation package can be found in /tools subdirectory in the SpectX installation directory:

    ./
    └── spectx/
        ├── bin/
        ├── conf/
        ├── lib/
        └── tools/
            └── sa-v{version}.tar.gz
    
  2. Unpack the tarball and navigate to the directory sa created as the result of unpacking the tarball:

    $ tar -zxf sa-v{version}.tar.gz
    $ cd sa
    
  3. Run the server with the configuration test command:

    $ bin/sa.sh -t
    

    This should print the current configuration, list of webroots and enabled API endpoints.

At this point, you should be done with setting up the SourceAgent with default configuration parameters. As a result, the following directories should be created:

Name Content description
bin Executable scripts to start the server
conf Configuration files, including sa.conf
docs Sample config files and SourceAgent API documentation
lib Server’s jar file sourceAgent.jar
logs Server application log files

Startup scripts

The SourceAgent java server process can be started manually:

$ java -jar lib/sa.jar [ARGS]

It accepts the following arguments:

-h,--help            Show this help and exit
-V,--version         Show version information and exit
-v,--verbose         Make generic logging verbose, use multiple times to
                     increase verbosity level
-q,--quiet           Disable logging to standard output
-t,--testconfig      Test configuration and exit
-c,--config <path>   Path to readable config file. Mandatory unless one
                     of [-h,-V] is specified

The command-line option -l has been removed since version 1.3.41 in favour of log.dir configuration file option.

Alternatively, there are two executable scripts in bin/ directory:

  • run script bin/sa.sh which allows to execute SourceAgent java process in terminal
  • init.d script bin/sa-init.d.sh can be used to manage the server daemon as a service.

Both scripts expect the JAVA_HOME environment variable is set with a value pointing to a file path to valid JRE/JDK 1.8 home directory. If it is not set, they both use common script bin/sa.common.sh which provides basic functionality for identifying correct JAVA_HOME, which, in it’s turn calls bin/sa.env.sh (it will be created by copying bin/sa.env.sh.default if it does not exist). Thus you can specify JAV_HOME in bin/sa.env.sh.

The scripts create configuration file conf/sa.conf at first run (i.e if it didn’t exist) by copying its content from conf/sa.conf.default. You would need to edit the configuration file by assigning values for api.auth.key and sysinfo.auth.key keys.

The run script accepts the same parameters as the java server process mentioned above. The init.d script can be invoked with one of the following arguments:

  • start - to start server process
  • stop - to stop running server process
  • restart - to stop and start server process
  • status - to check if server process is running
  • configtest - to check if current configuration file syntax is valid. Prints current config content to stdout.

For both executable scripts, exit value 0 means successful operation, any other value means one of the following errors:

1 incorrect script arguments/settings
2 pid file not found, assuming server process is not running
3 server process failed starting
4 server process is dead but pid file exists
5 java executable cannot be found
6 pid file cannot be written
7 server process stopped but pid file cannot be deleted
8 server’s jar cannot be found or is not readable
9 server’s configuration file cannot be found or is not readable
10 log dir cannot be found or is not writable
127 error in processing file paths

any other value indicates java server process failure.

Configuration

Server configuration file sa.conf has Java properties format. Values specified in config file override default values for configuration options used by the server. Any change in the configuration file requires server restart for the change to take effect.

The values can include any number of constructions of the form “${PROP}” where PROP denotes a system property name set using -D command-line option for java virtual machine; during configuration reading, the SourceAgent is substituting these with actual values of the properties.

Although default example configuration should allow the server to start and run gracefully, there are at least 2 groups of keys which must be reviewed and assigned correct values. These are ones starting with roots. and gzipscan., which both require valid path names of directories to be set.

The following objects are parts of configuration:

  • roots.<container>.path - Declares a named container for a local directory which content will be served through SourceAgent API, in the form roots.<container>.path=</path/to/directory>. If not specified, the default mapping is created with container named “own” pointing to path ../logs relative to server’s jar file location (roots.own.path=${SA_HOME}/logs)
  • roots.<container>.onlyIncremental - boolean indicating mode for handling conditional requests for growing files in named container. If enabled, then conditional requests for file contents get not-modified responses if the file has grown since last request. Default is false
  • roots.<container>.polling.interval - time interval to be used between polling loops for each subdirectory in the named container. Gets queried only if file system monitoring is enabled. If present and set to any value grater than 0, then enables explicit polling. Default value is 0, meaning that implicit polling (provided by JDK/JRE) will be enabled only on the filesystems which do not provide support for filesystem modification notification (such as INotify on Linux)
  • roots.<container>.polling.threadCount - maximum number of threads to be used for explicit polling. Gets queried only if file system monitoring is enabled. Is ignored if roots.<container>.polling.interval is not specified for the container or has value of 0. Default is 1
  • roots.<container>.polling.maxEventListSizePerDirectory - max number of unconsumed filesystem modification events for each subdirectory in named container to queue before starting dropping pending events and signalling overflow when doing explicit polling. Gets queried only if file system monitoring is enabled. Is ignored if roots.<container>.polling.interval is not specified for the container or has value of 0. Default is 0 (no limits).
  • host - string specifying either hostname or IP address (both IPv4 and IPv6 supported) for server to bind to, defaults to 127.0.0.1
  • port - integer specifying a port number to accept incoming connections on, defaults to 8389
  • gzipscan.dir - pathname to a writable directory where gzip indices are to be stored. In case it does not exist the server will try to create it. Make sure the parent directory is writable to server. If the value is null, empty or absent then gzipScan is disabled (default).
  • gzipscan.blockSize - integer indicating min gzip block size in bytes for indexing. Must not be less than 65536. Default is 1000000. Gets queried only if gzipscan.dir is specified.
  • gzipscan.executors - integer specifying max number of threads gzipScan uses for scanning. Value of 0 means it uses twice as many threads as there processors/cores available (the default). Gets queried only if gzipscan.dir is specified.
  • gzipscan.minAge - min time period for waiting until scanning a newly discovered gzip file. Time period is calculated starting from last modified timestamp. Default value is 1000 ms. Gets queried only if gzipscan.dir is specified.
  • gzipscan.minSize - min size of file (in bytes, numerical long) which are subjected for scanning. Default is 16000000 bytes. Gets queried only if gzipscan.dir is specified.
  • tls.enabled - boolean specifying mode of accepting TLS connections. If value is set to false (the default) then TLS connections are not accepted, if set to true then only TLS connections are accepted.
  • tls.certChainFile - pathname to a readable file containing X.509 certificate chain in PEM format. Gets queried only if tls.enabled is set to true. If both this and tls.keyFile are not specified, the server uses self-signed TLS certificate.
  • tls.keyFile - pathname to a readable file containing PKCS#8 private key in PEM format. Gets queried only if tls.enabled is set to true. If both this and tls.certChainFile are not specified, the server uses self-signed TLS certificate. .. _config-file-sa-tls-keyPassword:
  • tls.keyPassword - private key password. Gets queried only if tls.enabled is set to true and tls.keyFile is specified
  • acceptorsThreadCount - integer specifying a number of threads responsible for accepting incoming connections. The value of 0 means the server uses twice as many threads as there processors/cores available per listening socket. The value of 1 is the default.
  • ioThreadCount - integer specifying a number of worker threads handling asynchronous IO operations. The worker thread handles the IO traffic of the accepted connection once the acceptor accepts the connection and registers the accepted connection to the request processor. Value of 0 means the server uses twice as many threads as there processors/cores available (the default).
  • handlersThreadCount - integer specifying a number of worker threads handling request processors logic. Value of 0 means the server uses as many threads as ioThreadCount eventually is (the default).
  • startUpScanThreadCount - integer specifying a number of threads the server uses for initial scan of directories referenced in roots.<container>.path directives. Gets queried only if file system monitoring is enabled. Value of 0 means the server use one separate thread for each directory (the default).
  • listingPriorityThreadCount - integer specifying a number of threads the server uses for on-demand listing of directories which have not yet been discovered by running standard scan. Gets queried only if file system monitoring is enabled. Negative value disables this feature, and the value of 0 (the default) indicates that the server must use as many threads as there are named containers defined using roots.<container>.path directives.
  • channelIdleTimeout - the time period each incoming connection is allowed to be idle before it gets forcibly closed. A value of 0 means os default. The default value is 30000
  • readLimit and writeLimit - values specifying max allowed read and write throughput correspondingly, in bytes per second. The value of 0 means no corresponding throttling is enabled (the default)
  • sysinfo.enabled - boolean value controlling if system info reporting api is enabled. The default value is true
  • sysinfo.diskReadStatsUpdateInterval - the time interval between captures of disk read stats used by system info reporting API. Gets queried only if sysinfo.enabled is set to true. The default value is 10000ms
  • sysinfo.auth.disabled - boolean value instructing the server to disable authentication of requests to the SourceAgent’s system info reporting API. Authentication is enabled by default.
  • sysinfo.auth.key - string api key for authentication of requests to the system info reporting API in (in x-auth-key header). Must be set if the authentication is enabled (if sysinfo.auth.disabled is set to false).
  • api.auth.disabled - instructing the server to disable authentication of requests to the SourceAgent’s API. Authentication is enabled by default.
  • api.auth.key - string api key for authentication of requests to the SourceAgents API (in x-auth-key header). Must be set if the authentication is enabled (if api.auth.disabled is set to false).
  • handleHttpOptions - boolean value instructing the server to enable automatic generation of processors for OPTIONS HTTP requests to all enabled API uris, and appending CORS headers to responses to these uris. The default value is false
  • lockTtl - long value indicating newly created lock lifetime in milliseconds for incoming file lock requests. Default value is 300000ms (5 min)
  • useUnpooledBuffers - boolean value instructing the server to disable Netty’s ByteBuf pool. Default value is false
  • backlog - integer maximum queue length for incoming connection indications (a request to connect). Default is value is 1024.
  • writeSpinCount - maximum loop count for a write operation until channel.write() returns a non-zero value. Default value is 8.
  • dbFilePath - path to a writable database file to hold directory scanning results. If file with specified path does not exist it will be created. If not specified then database is not used, and all metadata requests are routed to file system
  • audit.requests - boolean specifying if request info should be logged as soon as request is read in. Default is false.
  • audit.access - boolean specifying if request and response log info should be logged before response has been sent out. Default is false.
  • audit.responses - boolean specifying if request and response log info should be logged after response has been completely flushed out. Default is true.
  • log.dir - path to existing writable directory to write server logs to. If not specified then filesystem logging is disabled
  • log.rotate - boolean parameter enabling automatic daily log rotation. Default value is true.
  • log.tz - time zone ID (as defined in IANA Time Zone Database) to be used for creating log file names (when log.rotate = true) and timestamps in log records. Default value is “UTC”.

System properties

The following system properties are possible to specify for the server’s java process:

  • com.spectx.zlib.path - path to zlib library/dll. The server bundle comes with prebuilt zlib native libraries for OSX, Windows and Linux platforms. Should you need to use different zlib library version you can use this property to specify fully-qualified pathname of it’s location using in bin/sa.env.sh as follows:

    JAVA_OPTS="${JAVA_OPTS} -Dcom.spectx.zlib.path=/path/to/zlib.so"
    

Logging

The default configuration is set up to produce the following log types:

  • audit logging, which includes:

    • request - contains log records for incoming http requests, which are being logged as soon as request has been read.
    • access - contains log records for incoming http requests and corresponding responses, which are being logged before response has been sent out
    • response - contains log records for incoming http requests, corresponding responses and different timing information regarding request processing, resource consumption and delivery status, which are being written as soon as responses get fully written to socket
  • generic logging, which includes:

    • error - contains log records regarding errors and warnings, i.e. log records logged with level >= WARN
    • debug - contains debugging information, i.e. log records logged with any log level lesser or equal to active one.

The verbosity of generic logging is controlled by -v command line switch given to server binary upon start up. If it is not specified, then the active log level is set to WARN by default, and can only be increased in terms of verbosity by means of manipulation with -v switch . Specifying the switch once sets active log level to INFO, twice - to DEBUG, and three or more times - to TRACE.

The audit logging is controlled by means of audit.* settings in the configuration file, and does not depend on -v command-line switch.

The -q switches logging to stdout completely off. In order to enable logging to files, you must specify valid directory path to logging directory in configuration file using log.dir option.

If no -q command line argument is given to server binary upon start up, each produced log record is printed to stdout with a string specifying it’s type (in upper case) following timestamp. If log.dir setting is specified in configuration file, daily-rotated log files for each log type are being produced under specified directory, each being put under monthly-rotated directory, which in its turn is located in yearly-rotated parent directory. Schematically, the following layout gets created in log directory:

YYYY/MM/YYYY.MM.DD_{TYPE}.log

where {TYPE} represents the log type mentioned above (one of request, access, response, debug, error).

If value for log.rotate parameter is set explicitly to false, the layout of the log directory will be flat, and names of produced log files will not contain timestamps. The rotation of log files then can be accomplished by means of external tools (e.g. logrotate) supporting copy-and-truncate log rotation scenarios.

Timestamps in log records and log file names are in time zone specified by log.tz in configuration.

Note that if the default log configuration gets overridden by any external means, the -q command line argument gets unsupported, as well as configuration options log.dir, log.rotate and log.rotate.

Audit logs

Audit logs contain new line-separated log records with tab-separated fields. Log fields lengths are restricted to not exceed 1000 chars.

request log:
  • timestamp in format YYYY-MM-dd HH:mm:ss.SSS Z
  • request id (internally assigned). It is non-empty only if one or more -v command-line args are provided
  • remote socket address (host:port)
  • client hostname/IP address, a value of a x-forwarded-for request header, or empty if the header is missing
  • user identity, a value of a x-user-identity request header, or empty if the header is missing
  • HTTP request method
  • request uri path
  • request content length
  • user agent, a value of a user-agent request header, or empty if the header is missing
access log:
  • timestamp in format YYYY-MM-dd HH:mm:ss.SSS Z
  • request id (internally assigned). It is non-empty only if one or more -v command-line args are provided
  • remote socket address (host:port)
  • client hostname/IP address, a value of a x-forwarded-for request header, or empty if the header is missing
  • user identity, a value of a x-user-identity request header, or empty if the header is missing
  • HTTP request method
  • request uri path
  • request content length
  • HTTP response code
  • response content length
  • response type (one of C (file chunk), T (chunked transfer), O (json object), V (void/empty), E (error))
  • user agent, a value of a user-agent request header, or empty if the header is missing
response log:
  • timestamp in format YYYY-MM-dd HH:mm:ss.SSS Z
  • request id (internally assigned). It is non-empty only if one or more -v command-line args are provided
  • remote socket address (host:port)
  • client hostname/IP address, a value of a x-forwarded-for request header, or empty if the header is missing
  • user identity, a value of a x-user-identity request header, or empty if the header is missing
  • HTTP request method
  • request uri path
  • request content length
  • HTTP response code
  • response content length
  • response delivery status (one of S (success), F (failure), C (cancelled), I (incomplete))
  • response type (one of C (file chunk), T (chunked transfer), O (json object), V (void/empty), E (error))
  • user agent, a value of a user-agent request header, or empty if the header is missing
  • task id, a value of a x-task-id request header, or empty if the header is missing
  • bytes read from disk during the request processing, long
  • total real time elapsed, in milliseconds
  • total cpu time elapsed, in milliseconds
  • total usr time elapsed, in milliseconds
  • disk read cpu time elapsed, in milliseconds
  • disk read usr time elapsed, in milliseconds

The last 5 entries are non-empty if corresponding measurements were carried out for the given request

Generic logs

Debug and error logs contain new line-separated log records with following tab-separated fields:

  • timestamp in format YYYY-MM-dd HH:mm:ss.SSS Z
  • log record’s log level indicator
  • thread name
  • logger name (java class name)
  • log message (may span to multiple lines)

File system monitoring

File system monitoring gets enabled only if at least one of the following holds:

For each configured container, server sets up separate monitor. The monitor is considered “full” if it is able to signal all required filesystem modification events: file/directory creation, deletion, content modification.

Server chooses file system monitor for each configured container as follows.

  • If no explicit polling is defined for the container (roots.<container>.polling.interval = 0 or undefined), server uses JRE’s monitoring engine, which can either utilize native INotify on Linux/Unix (“full” monitor), or uses own single-threaded polling mechanism in other cases (not “full” then as it fails to signal all events). The problem of JRE on Linux/Unix is that it cannot distinguish between local and remotely mounted file systems (NFS) and mistakenly enables INotify-based engine for remote drives, which simply does not work. In this case, explicit polling must be configured by means of assigning values to roots.<container>.polling.* keys.
  • If polling is requested explicitly (roots.<container>.polling.interval > 0), server instantiates its own polling monitor with defined (roots.<container>.polling.threadCount) number of dedicated threads. Such monitor is considered “full”.

If gzipscan is enabled but no database is enabled, then server keeps paths of indexed gzips for each container in heap memory (only if monitor is not “full” monitor), and actual gzip indices data in gzipscan db. If database is enabled, then server utilizes it to keep metadata.

Tuning

Backlog queue

The backlog configuration parameter sets a queue length for incoming connection indications (a request to connect) for a server socket. If a connection indication arrives when the queue is full, the connection is refused. Depending on how your OS is configured, you might still hit a limit at 128 or so. This is probably due to the kernel config parameter which has a default value lesser than one you specified for the backlog. So try setting it to the same value:

OS X:

$ sysctl -w kern.ipc.somaxconn=1024

Linux:

$ sysctl net.core.somaxconn=1024

ioThreadCount

The ioThreadCount configuration parameter sets the max number of threads handling asynchronous IO operations in client connections. By default, it is set to value of doubled number of processors/cores available, which should be enough for normal operation modes. However, there can be an issue with slow clients fetching big file chunks simultaneously, which in conjuction with slow disk read speed may cause other connections lagging. Try increasing the value of this configuration parameter to some sensible one.

FD limits

File descriptors are operating system resources used to represent connections and open files, among other things. Should you have queries resulting in simultaneous locking of too many files, or should the server manage serving a large number of connections, try increasing the FD ulimts in the shell for the server application (ulimit -n number).

Write spin count

The writeSpinCount configuration parameter is used to control channel data writing behaviour in attempts to send a buffer’s data into underlying socket. The matter is that a write from buffers to the underlying sockets may not transfer all data in one try, and the parameter sets maximum loop count for a write operation until the channel’s write() method returns a non-zero value. There is a balance between how much time the IO thread can spend attempting to fully write a single buffer, and if that buffer is not fully written then IO thread must register for the write event and be notified when the underlying socket is writable. If there are many parallel channels served by restricted amount of IO threads, it may have more sense to let the IO thread to switch from temporarily non-writable channel to any other one which is available for writing.

INotify

Since Linux Kernel INotify API does not support recursive listening on a directory, SourceAgent adds an inotify watch on every subdirectory under the watched directory. This process takes a time which is linear to the number of directories in the tree being recursively watched, and requires system resources, namely INotify watches, which are limited (by default to 8192 watches per processes). If you observe an error “No space left on device” in logs it may mean the native inotify watch limit per process has been reached. The solution is to increase the limits by editing respective files in /proc/sys/fs/inotify/max_user_watches directory.

NFS-mounted directories

When running on Linux, Source Agent utilizes kernel INotify API for getting file system modification events. However, INotify does not work with directories on NFS shares. In case you have such directories defined with roots.<container>.path directives, you need to force Source Agent to use directory polling for these by specifying values for roots.<container>.polling.* configuration parameters for each such directory.

Native libraries unavailability

If you are running SourceAgent on a host where directory for temporary files (/tmp) is mounted with noexec option, you may encounter an issues with unavailability of native Zlib decompression libraries. By default, SourceAgent process uses the system-provided temporary directory to extract required libraries to, but such security settings prevent it to launch the extracted libraries from the directory, which results in java.lang.UnsatisfiedLinkError with message “failed to map segment from shared object: Operation not permitted” which can be observed in error logs.

To resolve the issue, please provide another temporary directory to be used for storing and executing bundled native libraries, and specify its location with “jna.tmpdir” system property for SourceAgent in its bin/sa.env.sh script. Alternatively, you might specify full paths to required libraries at executable location as described in System properties section.