hdfs://

The scheme hdfs:// provides access to objects in the Hadoop Distributed File System (HDFS).

Using hdfs:// in SpectX

The implementation of hdfs:// protocol does not support host-less URI notations, and always requires either a valid HDFS name node hostname or IP address, or a datastore name to be specified as host in the URI.

When accessing an HDFS cluster, an user principal to be used can be specified in the URI in its user info part, e.g hdfs://principal@cluster. If the cluster refers to a defined datastore, the principal will override one defined for the datastore.

Note that you need to percent-encode the principal if it contains ‘@’ symbol, e.g. hdfs://principal.name%40domain@cluster.

Datastore configuration

UI

Configuration parameters for hdfs:// datastore definition:

Name Description
Store name unique name among all defined DataStores
High Availability Cluster enables HA cluster connectivity configuration mode (for setting addresses of 2 NameNodes in HA cluster)
Host NameNode metadata IPC service hostname or IPv4/IPv6 address (first NameNode in HA cluster mode)
Port NameNode listening port, default is 8020
Host2 Second NameNode metadata IPC service hostname or IPv4/IPv6 address (HA cluster mode)
Port2 Second NameNode listening port, default is 8020 (HA cluster mode)
User principal name Kerberos principal to be used for authentication at NameNode. If empty, authentication is not attempted
Password
password for the principal. If empty, a keytab file is used for authentication. The algorithm to locate the keytab file is the following:
- an attempt is made to locate krb5.conf file and get keytab file location from its default_keytab_name parameter in [libdefaults] section
- if it is not specified in krb5.conf, it will be looked for the file {user.home}/krb5.keytab
Root directory top level directory at the NameNode where the logs can be read from. Must be read accessible by the user. Optional, default is /.
RPC calls protection Hadoop protection value for secured SASL connections, a value set for “hadoop.rpc.protection” parameter in cluster configuration
Is cacheable enables caching data by Processing Units
Hot cache period limits time related data caching to the period specified
Read ACL specifies blob read ACL

Filesystem

The datastore definition file is of JSON structure of the following format (optional parameters can be omitted):

{
  "type": "HDFS",
  "hdfsStore": {
    "host": "<host>",
    "port": <port>,
    "host2": "<host2>",
    "port2": <port2>,
    "principal": <principal>,
    "passwd": <passwd>,
    "keytabFilePath": "<keytabFilePath>",
    "rpcProtection": "<rpcProtection>",
    "rootDir": <rootDir>,
    "isCacheable": <isCacheable>,
    "hotCachePeriod": "<hotCachePeriod>",
    "acl": {<rACL>}
  }
}

where

  • <host> NameNode metadata IPC service hostname or IPv4/IPv6 address (first NameNode in HA cluster mode). A string. Mandatory parameter
  • <port> NameNode listening port, default is 8020. An integer
  • <host2> Second NameNode metadata IPC service hostname or IPv4/IPv6 address (HA cluster mode). Optional. A string
  • <port2> Second NameNode listening port, default is 8020 (HA cluster mode). An integer
  • <principal> Kerberos principal to be used for authentication at NameNode. Optional, if empty, authentication is not attempted. A string. Mandatory parameter
  • <passwd> password for the principal. If empty, a keytab file is used for authentication. The algorithm to locate the keytab file is the following:
    • an attempt is made to locate krb5.conf file and get keytab file location from its default_keytab_name parameter in [libdefaults] section
    • if it is not specified in krb5.conf, it will be looked for the file {user.home}/krb5.keytab
  • <keytabFilePath> a path to a keytab file to use for authentication of given principal instead of a password. Optional. A string
  • <rpcProtection> Hadoop protection value for secured SASL connections, a value set for “hadoop.rpc.protection” parameter in cluster configuration (one of “authentication”, “integrity”, “privacy”). Optional, default is “authentication”. A string
  • <rootDir> top-level directory at the NameNode where the logs can be read from. Must be read accessible by the user. Optional, default is /. A string
  • <isCacheable> enables caching data by Processing Units. A boolean (“true” or “false”)
  • <hotCachePeriod> limits time-related data caching to the period specified. A time period
  • <rACL> is a definition of a blob read ACL for the datastore. A map.

Points to consider

If the target HDFS cluster requires authentication, the underlying implementation requires proper Kerberos 5 client configuration to be set up on SpectX host. In environments with one Kerberos authentication realm, it might be enough to specify system properties java.security.krb5.realm and java.security.krb5.kdc for default realm and KDC location. If a more advanced configuration is required, or access to secure HDFS clusters running multiple Kerberos authentication realms needs to be set up, the krb5.conf file must exist in the filesystem. It has to have default_realm property defined in its [libdefaults] section. The default realm must not necessarily be one of those used by the target HDFS ecosystem. If it is the case, or in case the datastore targets any other non-default Kerberos realm, there must be correct mapping defined for the realms’ KDC hostname to the realm name in [domain_realm] section of the krb5.conf. For the content of the krb5.conf file, refer to MIT documentation.

The Kerberos configuration is system-wide, so consult with SpectX administrators if any configuration has already been set up on the machine. If SpectX is configured to use Integrated Windows Authentication, it might be required to introduce some changes to the configuration to support multiple Kerberos realms and avoid errors due to possible conflicting settings.

Contents of different keytab files can be merged into one keytab file using ktutil, a Kerberos keytab maintenance utility.