Configuration

This section details the configuration settings for SourceAgent. These settings can be changed by overriding the default values given during the first run, or by editing the SourceAgent’s configuration file later on.

Configuration items in SourceAgent’s configuration file conf/sa.conf follow Java properties format. Values specified in config file override default values used by the server. Any change in the configuration requires server restart for the change to take effect.

The values can include any number of constructions of the form “${PROP}” where PROP denotes a name of environment variable, or, if the variable is undefined, then a system property name set using -D command-line option for java virtual machine. During configuration reading, the SourceAgent is substituting these with actual values of the environment variables/system properties. These can be set in environment variable definition script.

One of the properties actively used in default configuration template is SA_HOME, which is set by startup script to the value of environment variable of the same name.

NB! Do not rename the SourceAgent’s configuration file. SourceAgent searches for configuration file only by the name ${SA_HOME}/conf/sa.conf.

This file will not be modified or deleted during upgrades in the future.

Although default configuration should allow the server to start and run gracefully, there are at least 2 groups of keys which must be reviewed and assigned correct values. These are ones starting with roots. and gzipscan., which both require valid path names of directories to be set.

The following objects are parts of configuration.

Containers

SourceAgent exposes contents of local directories configured as virtual root containers through its API.
Container names are unique, but the target local directories are not required to be such. Normally though, each defined root container points to a local directory being a root for separate independent branch in the local file system hierarchy, i.e. no one of the defined root container targets is a member directory in any other container’s target file system branch.
Having same physical directories or files in different root containers when GzipScan or metadata database is enabled may lead to excessive duplication of file system and processing resources consumption, as explained in File System Monitoring section. SourceAgent always follows symbolic links, and this feature can be effectively be used to avoid such issues.

Below are settings for configuration of root containers to be exposed through Source Agent API.

  • roots.<container>.path - Declares a named container for a local directory which content will be served through SourceAgent API, in the form roots.<container>.path=</path/to/directory>. If not specified, the default mapping is created with container named “own” pointing to path ../logs relative to server’s jar file location (roots.own.path=${SA_HOME}/logs)
  • roots.<container>.onlyIncremental - boolean indicating mode for handling conditional requests for growing files in named container. If enabled, then conditional requests for file contents get not-modified responses if the file has grown since last request. Default is false
  • roots.<container>.polling.interval - time interval to be used between polling loops for each subdirectory in the named container. Gets queried only if file system monitoring is enabled. If present and set to any value grater than 0, then enables explicit polling. Default value is 0, meaning that implicit polling (provided by JDK/JRE) will be enabled only on the filesystems which do not provide support for filesystem modification notification (such as INotify on Linux)
  • roots.<container>.polling.signal - a signal name to use for triggering polling (instead of time interval or cron) of each subdirectory in the named container. Gets queried only if file system monitoring is enabled. If present then enables explicit polling.
  • roots.<container>.polling.cron - a cron-like time specification expression for triggering polling (instead of time interval or signals) of each subdirectory in the named container. Gets queried only if file system monitoring is enabled. If present then enables explicit polling.
    The expression line has six time and date fields. Polling is started when the second (0-59), minute (0-59), hour (0-23), and month of year (1-12) fields match the current time, and when at least one of the two day fields (day of month (1-31), or day of week (0(Sun)-6(Sat), 7->0(Sun))) matches the current time.
    A field may be an asterisk (*), which always stands for “first-last”.
    Ranges of numbers are allowed. Ranges are two numbers separated with a hyphen. The specified range is inclusive. For example, 8-11 for an “hours” entry specifies execution at hours 8, 9, 10 and 11.
    Lists are allowed. A list is a set of numbers (or ranges) separated by commas. Examples: “1,2,5,9”, “0-4,8-12”.
    Step values can be used in conjunction with ranges. Following a range with “/<number>” specifies skips of the number’s value through the range. For example, “0-23/2” can be used in the hours field to specify command execution every other hour. Steps are also permitted after an asterisk, so if you want to say “every two hours”, just use “*/2”.
  • roots.<container>.polling.threadCount - maximum number of threads to be used for explicit polling. Gets queried only if file system monitoring is enabled. Default is 1. Gets queried only if file system monitoring is enabled.
  • roots.<container>.polling.maxEventListSizePerDirectory - max number of unconsumed filesystem modification events for each subdirectory in named container to queue before starting dropping pending events and signalling overflow when doing explicit polling. Default is 0 (no limits). Gets queried only if file system monitoring is enabled.
  • roots.<container>.compress.files - a glob pattern specifying names of files to be compressed in transit. The pattern syntax is the same as one used for SpectX uris, with exception regarding multiple star notation: as the pattern targets file names (not paths), the multiple star sequence gets treated as a single star.
    As an example, consider glob pattern *.@(log|txt), which selects all files with names ending with either .log or .txt.
    Note that although the compression may save bandwidth, it results in additional processing at both SourceAgent and its’ client sides. This also prevents Source Agent from using zero-copy operations when transferring files.
    Should you use this setting, avoid specifying too wide filters with the pattern as it may result in compressing already compressed files (gzip or bzip2 files, for instance). This does not improve the bandwidth usage but wastes processing power instead, and leads to longer query running times.
    By default this setting is not determined so all files from this container are served directly without compression to a requester.
  • roots.<container>.compress.level - an integer compression level to be used for compressing files specified by roots.<container>.compress.files filter. 1 yields the fastest compression and 9 yields the best compression. The default compression level is 6.

API Authentication

Access authentication to containers through API can either be explicitly disabled entirely (opening world-readable access to all configured containers), or enabled for particular containers, either using separate access keys for each container or using a master key allowing access to all containers, or using both key types simultaneously. Note that if the API authentication is enabled (the default behavior), it is an error to have configured containers with no explicit authentication settings specified; it is required then for each container to either disable authentication, or set either master key or container-specific key.
By the API design, calls targeting locked files does not require key authentication as these consume randomly generated short-lived lock ids.

The following properties control the authentication setup.

  • api.auth.disabled - boolean instructing the server to disable authentication of requests to the SourceAgent’s API. Authentication is enabled by default. Note that if this option is set explicitly to true then all configured containers are accessible without authentication.
  • api.auth.key - string key for authentication of requests to the SourceAgents API for all configured containers. Can be set if the authentication is enabled (if api.auth.disabled is not specified or is set to false). Alternatively, container-specific keys may be defined for each container. By default the key is not set.
  • roots.<container>.auth.keys - comma-separated list of API authentication key names defined using api.auth.key.<name> properties. Access to the container gets granted only to requests with specified keys (along with ones with master key if it is set).
  • api.auth.key.<name> - Named API authentication key to be used in roots.<container>.auth.keys for configuring per-container authentication. Note that all authentication keys specified using this setting must be unique, and cannot be same as one specified as a value for the master key.
  • roots.<container>.auth.disabled - boolean instructing the server to disable authentication of requests to the named container.

For instance, below we specify 3 containers for application logs of different environments, one for the SA’s own logs and one for public data, and grant access to all containers except for the latter with master key; having per-container keys set, the logs of all environments can be accessed with admin’s one, logs for dev environments can be accessed with both developers and testers keys, and test environment logs can be accessed with tester’s key, and the public directory can be accessed without any key:

...
    # App logs from different environments
    roots.prod-logs.path=/data/prod
    roots.dev-logs.path=/data/dev
    roots.test-logs.path=/data/test
    # SA's own logs
    roots.sa-logs.path=${SA_HOME}/logs
    # World-readable data
    roots.public.path=/data/public

    # per-container access list definitions, each uses list of key names (api.auth.key.<name>)
    roots.prod-logs.auth.keys=admins
    roots.dev-logs.auth.keys=developers,testers,admins
    roots.test-logs.auth.keys=testers,admins
    # disable authentication for public container
    roots.public.auth.disabled=true

    # per-container access keys definitions
    api.auth.key.admins=<admins key>
    api.auth.key.testers=<testers key>
    api.auth.key.developers=<developers key>

    # master key, allows access to all containers
    api.auth.key=<master key>
...

If there wasn’t api.auth.key property specified, we would need to set roots.sa-logs.auth.keys property to define specific access list for the container sa-logs. Alternatively, we then could set roots.sa-logs.auth.disabled to true to make the container world-readable.

SysInfo API

By default, if not explicitly disabled, Source Agent exposes a system info reporting API, which gives access to various runtime metrics. The following configuration parameters can be used to set it up.

  • sysinfo.enabled - boolean value controlling if system info reporting api is enabled. The default value is false.
  • sysinfo.auth.disabled - boolean value instructing the server to disable authentication of requests to the system info reporting API. The authentication is enabled by default. Note that if this option is set explicitly to true then the API is accessible without authentication.
  • sysinfo.auth.key - string api key for authentication of requests to the system info reporting API in. Must be set if the authentication is enabled (if sysinfo.auth.disabled is not specified or is set to false).
  • sysinfo.diskReadStatsUpdateInterval - the time interval between captures of disk read stats used by system info reporting API. Gets queried only if sysinfo.enabled is set to true. The default value is 10000ms.

GzipScan

A support for parallel processing of ordinary gzip files is provided by a GzipScan module. When enabled (not by default), the module scans gzip files in source data directories in each configured container in the background, and creates specific indices for zlib blocks of certain length in configured directory. The module enhances SourceAgent’s API by exposing endpoints which provide access to indexed gzip chunks for parallel processing.

The following properties control the module’s setup.

  • gzipscan.dir - pathname to a writable directory where gzip indices are to be stored. In case it does not exist the server will try to create it. Make sure the parent directory is writable to server. If the value is empty or the property is not specified then GzipScan is disabled (default).
  • gzipscan.blockSize - integer indicating min gzip block size in bytes for indexing. Must not be less than 65536. Default is 1000000. Gets queried only if gzipscan.dir is specified.
  • gzipscan.executors - integer specifying max number of threads GzipScan uses for scanning. Value of 0 means it uses twice as many threads as there processors/cores available (the default). Gets queried only if gzipscan.dir is specified.
  • gzipscan.minAge - min time period for waiting until scanning of newly discovered gzip file should start. Time period is calculated starting from last modified timestamp. Default value is 1000 ms. Gets queried only if gzipscan.dir is specified.
  • gzipscan.minSize - min size of file (in bytes, numerical long) which are subjected for scanning. Default is 16000000 bytes. Gets queried only if gzipscan.dir is specified.
  • gzipscan.exclusions - comma-separated list of valid container names. GzipScan in these containers will not be enabled. If not specified, the GzipScan will be enabled in all declared containers (if gzipscan.dir is specified).

Logging

The default configuration is set up to produce the following log types:

  • audit logging, which includes:
    • request - contains log records for incoming http requests, which are being logged as soon as request has been read.
    • access - contains log records for incoming http requests and corresponding responses, which are being logged before response has been sent out
    • response - contains log records for incoming http requests, corresponding responses and different timing information regarding request processing, resource consumption and delivery status, which are being written as soon as responses get fully written to socket
  • generic logging, which includes:
    • error - contains log records regarding errors and warnings, i.e. log records logged with level >= WARN
    • debug - contains debugging information, i.e. log records logged with any log level lesser or equal to active one.

Record format

Audit logs

Audit logs contain new line-separated log records with tab-separated fields. Log fields lengths are restricted to not exceed 1000 chars.

request log:

  • timestamp in format yyyy-MM-dd HH:mm:ss.SSS Z
  • log_type field containing string value request. Is present only when log destination is set to stdout
  • request id (internally assigned). It is non-empty only if log verbosity is INFO or higher
  • remote socket address (host:port)
  • client hostname/IP address, a value of a x-forwarded-for request header, or empty if the header is missing
  • user identity, a value of a x-user-identity request header, or empty if the header is missing
  • count of characters in a value of a x-auth-key request header, or empty if the header is missing (since v1.4.36)
  • HTTP request method
  • request uri path
  • request content length
  • user agent, a value of a user-agent request header, or empty if the header is missing

access log:

  • timestamp in format yyyy-MM-dd HH:mm:ss.SSS Z
  • log_type field containing string value access. Is present only when log destination is set to stdout
  • request id (internally assigned). It is non-empty only if log verbosity is INFO or higher
  • remote socket address (host:port)
  • client hostname/IP address, a value of a x-forwarded-for request header, or empty if the header is missing
  • user identity, a value of a x-user-identity request header, or empty if the header is missing
  • name of API key used for request authentication, or empty if authentication has not happened (since v1.4.36)
  • HTTP request method
  • request uri path
  • request content length
  • HTTP response code
  • response content length
  • response type (one of C (file chunk), T (chunked transfer), O (json object), V (void/empty), E (error))
  • user agent, a value of a user-agent request header, or empty if the header is missing

response log:

  • timestamp in format yyyy-MM-dd HH:mm:ss.SSS Z
  • log_type field containing string value response. Is present only when log destination is set to stdout
  • request id (internally assigned). It is non-empty only if log verbosity is INFO or higher
  • remote socket address (host:port)
  • client hostname/IP address, a value of a x-forwarded-for request header, or empty if the header is missing
  • user identity, a value of a x-user-identity request header, or empty if the header is missing
  • name of API key used for request authentication, or empty if authentication has not happened (since v1.4.36)
  • HTTP request method
  • request uri path
  • request content length
  • HTTP response code
  • response content length
  • response delivery status (one of S (success), F (failure), C (cancelled), I (incomplete))
  • response type (one of C (file chunk), T (chunked transfer), O (json object), V (void/empty), E (error))
  • user agent, a value of a user-agent request header, or empty if the header is missing
  • task id, a value of a x-task-id request header, or empty if the header is missing
  • bytes read from disk during the request processing, long
  • total real time elapsed, in milliseconds
  • total cpu time elapsed, in milliseconds
  • total usr time elapsed, in milliseconds
  • disk read cpu time elapsed, in milliseconds
  • disk read usr time elapsed, in milliseconds

The last 5 entries are non-empty if corresponding measurements were carried out for the given request.

The audit logging is controlled by means of audit.* settings in the configuration file. It takes place with built-in verbosity which does not depend on the verbosity of generic logging.

Generic logs

Debug and error logs contain new line-separated log records with following tab-separated fields:

  • timestamp in format yyyy-MM-dd HH:mm:ss.SSS Z
  • log_type, contains “debug” or “error” correspondingly, is present only when log destination is set to stdout
  • log record’s log level indicator
  • thread name
  • logger name (java class name)
  • log message (may span to multiple lines)

The verbosity

The verbosity of generic logging is controlled by configuration setting log.level. However, the value set in configuration can be overridden by specifying command line argument -v. Specifying the switch once sets active log level to INFO, twice - to DEBUG, and three or more times - to TRACE. On Windows, these switches can be specified as command-line arguments for the SourceAgent startup script. On Linux, Arch Linux ARM and Mac OSX these can be specified as a value for SA_LAUNCHER_ARGS environment variable in environment variable definition script.

Destination

Unless the -q command-line switch is specified for the SourceAgent server Java process, the server prints all log messages to standard output. In this case the log records can be distinguished by additional log_type field inserted after the timestamp. The field can have one of these values: request, access, response, debug, error.

The logging directory path for standard output and error used by startup scripts can be set in the SourceAgent’s environment variable definition script as a value for SA_STD_LOG_DIR variable.

Note that the startup script on Linux, Arch Linux ARM and Mac OSX does specify this switch.

In order to enable logging to files, you must specify valid directory path to logging directory in configuration file using log.dir option.
If log.dir setting is specified in configuration file, daily-rotated log files for each log type are being produced under specified directory, each being put under monthly-rotated directory, which in its turn is located in yearly-rotated parent directory. Schematically, the following layout gets created in log directory:

logs/
└── YYYY/
    ├── MM/
    │   ├── YYYY.MM.DD.access.log
    │   ├── YYYY.MM.DD.debug.log
    │   ├── YYYY.MM.DD.error.log
    │   ├── YYYY.MM.DD.request.log
    │   ├── YYYY.MM.DD.response.log
    │   └── ...
    └── ...

If value for log.rotate parameter is set explicitly to false, the layout of the log directory will be flat, and names of produced log files will not contain timestamps:

logs/
├── access.log
├── debug.log
├── error.log
├── request.log
└── response.log

The rotation of log files then can be accomplished by means of external tools (e.g. logrotate) supporting copy-and-truncate log rotation scenarios.
Timestamps in log records and log file names are in time zone specified by log.tz in configuration.
Note that if the default log configuration gets overridden by any external means, the -q command line argument gets unsupported, as well as configuration options log.dir, log.rotate and log.rotate.

Configuration parameters

Parameters for configuring logging are as follows.

  • audit.requests - boolean specifying if request info should be logged as soon as request is read in. Default is false.
  • audit.access - boolean specifying if request and response log info should be logged before response has been sent out. Default is false.
  • audit.responses - boolean specifying if request and response log info should be logged after response has been completely flushed out. Default is true.
  • log.dir - path to existing writable directory to write server logs to. If not specified then filesystem logging is disabled
  • log.rotate - boolean parameter enabling automatic daily log rotation. Default value is true.
  • log.tz - time zone ID (as defined in IANA Time Zone Database) to be used for creating log file names (when log.rotate = true) and timestamps in log records. Default value is “UTC”.
  • log.level - Log level of debug logging. Possible values are: trace, debug, info, warn, error. The warn is the default. Note that the specified log level can be overridden by -v command line switches, as described in Log verbosity section.

Server parameters

  • host - string specifying either hostname or IP address (both IPv4 and IPv6 supported) for server to bind to, defaults to 127.0.0.1
  • port - integer specifying a port number to accept incoming connections on, defaults to 8389
  • tls.enabled - boolean specifying mode of accepting TLS connections. If value is set to false (the default) then TLS connections are not accepted, if set to true then only TLS connections are accepted.
  • tls.certChainFile - pathname to a readable file containing X.509 certificate chain in PEM format. Gets queried only if tls.enabled is set to true. If both this and tls.keyFile are not specified, the server uses self-signed TLS certificate.
  • tls.keyFile - pathname to a readable file containing PKCS#8 private key in PEM format. Gets queried only if tls.enabled is set to true. If both this and tls.certChainFile are not specified, the server uses self-signed TLS certificate.
  • tls.keyPassword - private key password. Gets queried only if tls.enabled is set to true and tls.keyFile is specified
  • acceptorsThreadCount - integer specifying a number of threads responsible for accepting incoming connections. The value of 0 means the server uses twice as many threads as there processors/cores available per listening socket. The value of 1 is the default.
  • ioThreadCount - integer specifying a number of worker threads handling asynchronous IO operations. The worker thread handles the IO traffic of the accepted connection once the acceptor accepts the connection and registers the accepted connection to the request processor. Value of 0 means the server uses twice as many threads as there processors/cores available (the default).
  • handlersThreadCount - integer specifying a number of worker threads handling request processors logic. Value of 0 means the server uses as many threads as ioThreadCount eventually is (the default).
  • startUpScanThreadCount - integer specifying a number of threads the server uses for initial scan of directories referenced in roots.<container>.path directives. Gets queried only if file system monitoring is enabled. Value of 0 means the server use one separate thread for each directory (the default).
  • listingPriorityThreadCount - integer specifying a number of threads the server uses for on-demand listing of directories which have not yet been discovered by running standard scan. Gets queried only if file system monitoring is enabled. Negative value disables this feature, and the value of 0 (the default) indicates that the server must use as many threads as there are named containers defined using roots.<container>.path directives.
  • channelIdleTimeout - the time period each incoming connection is allowed to be idle before it gets forcibly closed. A value of 0 means os default. The default value is 30000
  • readLimit and writeLimit - values specifying max allowed read and write throughput correspondingly, in bytes per second. The value of 0 means no corresponding throttling is enabled (the default)
  • handleHttpOptions - boolean value instructing the server to enable automatic generation of processors for OPTIONS HTTP requests to all enabled API endpoints, and appending CORS headers to responses to these uris. The default value is false
  • lockTtl - long value indicating newly created lock lifetime in milliseconds for incoming file lock requests. Default value is 300000ms (5 min)
  • backlog - integer maximum queue length for incoming connection indications (a request to connect). Default is value is 1024.
  • writeSpinCount - maximum loop count for a write operation until channel.write() returns a non-zero value. Default value is 8.
  • db.file (dbFilePath in versions prior to 1.4.40) - path to a writable database file to hold directory scanning results. If file with specified path does not exist it will be created. If not specified then database is not used, and all metadata requests are routed to the file system; if gzipscan is enabled then metainfo regarding gzip files located on drives with not “full” monitoring is stored in the heap memory.
    The setting may be valuable on systems with containers pointed to directories with slow metadata querying interfaces (e.g. NFS-mounted directories).
  • db.exclusions - comma-separated list of valid container names. Metainfo on files and directories in these containers will not be saved to database, and all metadata requests will be routed to the file system.

File system monitoring

File system monitoring gets enabled for each configured container only if at least one of the following holds:

For each target container, server sets up separate monitor. The monitor is considered “full” if it is able to signal all required filesystem modification events: file/directory creation, deletion, content modification.
Server chooses file system monitor for each configured container as follows.

If GzipScan is enabled but no database is enabled, then server keeps paths of indexed gzip files for each container in heap memory (only if monitor is not “full” monitor), and actual gzip indices data in gzipscan db. If database is enabled, then server utilizes it to keep metadata.