(SM-2302) Hadoop Storage Setup 1802

Prerequisites

Open ports

In a controlled network environment, it is common to have firewall rules in place. In order to enable communication of SAP systems with Hadoop, the following port numbers need to be reachable in the Hadoop cluster from the SAP system:

PortHadoop service
10000Hive
11000Oozie
14000HttpFS
21050Impala

These are the default port numbers of Hadoop services.

Hive parameters

There are two configuration parameters of the Hive service that need to be properly configured in Hive Service Advanced Configuration Snippet (Safety Valve) for Hive-site.xml

hive.exec.dynamic.partition = true
hive.exec.dynamic.partition.mode = nonstrict

Example:

Hadoop user

SNP recommends creating distinct users for every SAP system connected to the Hadoop cluster in order to isolate each system's data.
Usually, there is central repository for Hadoop users (LDAP/AD) but you can also create the user locally.
Each of these users has to have its own dedicated user group.
If Hadoop Sentry is used: User groups will be used for the definition of Sentry access rules.
The recommended user names are mirroring SAP's guidelines for user names: <sid>adm<sid>hdp.
Create the user's Kerberos principal in the form of <sid>hdp@<KERBEROS_REALM>.
The user's home directory on HDFS has to be manually created with appropriate permissions:

$ hdfs dfs -ls -d /user/dvqhdp
-rwxrwxr-x   3 dvqhdp supergroup           0 2017-02-24 11:05 /user/dvqhdp

Create a distinct user for every SAP system connecting to the Hadoop cluster. Usually, there is a central repository for Hadoop users (LDAP/AD) but you can also create the user locally.
Important thing is that the user has to be defined identically on every Hadoop cluster node.

If Kerberos is used: Create a Kerberos principal for the user. You need to run kadmin.local on the host where Kerberos DB is running:

Hive database

SNP recommends creating a dedicated database (schema) in Hive for each SAP system. The recommended database name is sap<sid> (sapdvq).

If Hadoop Sentry is used: Two Sentry rules have to be created, which will enable all actions of <sid>hdp user on sap<sid> database and his home directory in HDFS.
Example: If HDFS ACL synchronization with Sentry permissions is enabled, the user's directory has to be added to the Sentry Synchronization Path Prefixes parameter in the HDFS service configuration.
More on HDFS ACL synchronization topic can be found at https://www.cloudera.com/documentation/enterprise/latest/topics/sg_hdfs_sentry_sync.html

OS prerequisites (On SAP host)

This group of requirements relates to the operating systems underlying the SAP system with all its application servers. SNP products (e.g. SNP Glue™, SNP Outboard™ Datatiering) have been developed and tested on the Suse Linux environment, but the design is not limited by the choice of the operating system if the requirements listed in this guide are met.

Java version

An up-to-date Java version (recommended is 1.7 or higher) needs to be available to the SAP's <sid>adm user. An updated version (e.g. 1.8.0_111) is required when using Kerberos (which is recommended by SNP) because the base version (1.8.0) is unable to communicate with Hadoop services using Kerberos authentication.

Therefore to connect to the Kerberized cluster, it is necessary to use a patched Java version (e.g. 1.8.0_111).

SAP Java Connector Library

SAP Java Connector 3.0 library libsapjco3.so needs to be accessible. It's available for download from the SAP marketplace. It needs to be present in the SAP kernel directory or in other directories pointed by LD_LIBRARY_PATH environment variable of <sid>adm user.
E.g.: dvqadm 1> which libsapjco3.so
/usr/sap/DVQ/SYS/exe/uc/linuxx86_64/libsapjco3.so

OS directories

Hadoop connector uses two directories dedicated to its configuration and log files:

# ls -ld /usr/sap/DVQ/dvd_conn /sapmnt/DVQ/global/security/dvd_conn
drwx------ 2 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn
drwxr-xr-x 2 dvqadm sapsys 4096 --- /usr/sap/DVQ/dvd_conn

The first one (/sapmnt/<SID>/global/security/dvd_conn) is used to store Kerberos-related files and is shared among SAP application servers.

The second will store drivers, configuration, and log files of the Java connector used by SNP Glue™. Ownership and permissions need to be set appropriately to <sid>adm

Create directories on each SAP application server to store Storage Management related configuration and log files. <sid>adm, the user needs permission to access those folders.

JDBC Drivers

JDBC protocol is used to connect to Hadoop services (Hive and Impala). JDBC drivers have to be manually stored on the operating system and accessible to the connector.

SNP recommends storing the drivers in a folder within the connector directory, organized in subfolders to avoid possible conflicts.

Kerberos keytab and configuration files

The Kerberos keytab of <sid>hdp the principal needs to be exported from the Kerberos database and copied into the operating system directory /sapmnt/DVQ/global/security/dvd_conn and made available to the <sid>adm user:

/sapmnt/DVQ/global/security/dvd_conn # ls -l dvqhdp.keytab
-r-------- 1 dvqadm sapsys 59 Apr  5 15:12 dvqhdp.keytab


At the same location, the Kerberos configuration file needs to be created/copied and made readable for the user <sid>adm.
Here is a sample of the Kerberos configuration file:

/sapmnt/DVQ/global/security/dvd_conn # ls -l kerberos.cfg
-r-------- 1 dvqadm sapsys 393 Feb 22 16:28 kerberos.cfg

/sapmnt/DVQ/global/security/dvd_conn # cat kerberos.cfg
[libdefaults]
default_realm = HADOOP.LOCAL
dns_lookup_kdc = false
dns_lookup_realm = false
ticket_lifetime = 86400
renew_lifetime = 604800
forwardable = true
default_tgs_enctypes = rc4-hmac
default_tkt_enctypes = rc4-hmac
permitted_enctypes = rc4-hmac
udp_preference_limit = 1
kdc_timeout = 3000
[realms]
HADOOP.LOCAL = {
kdc = hadoop01.hadoop.local
admin_server = hadoop01.hadoop.local
}

SSL Certificates for Java

If your Hadoop cluster has full TLS encryption enabled, it is necessary to create a Java truststore and save it on the following path with correct ownership and permissions:

/sapmnt/<SID>/global/security/dvd_conn # ls -l jssecacerts
-r-------- 1 <SID>adm sapsys 59 Apr  5 15:12 jssecacerts

Server certificates of hosts that SNP Glue™ communicates with, need to be placed in this truststore. There is also an option to copy the complete jssecacerts truststore from any Hadoop node and place it in this path.

SAP prerequisites

HTTP service and ICM parameters

The HTTP service has to be active in the SAP system. It can be checked via transaction
SMICM > [Goto] > Services

There are two particularly important parameters affecting the HTTP communication of the SAP system:

  • icm/HTTP/client/keep_alive_timeout: Defines HTTP communication timeout, can be raised in case of HTTP communication failing in timeout.
  • icm/HTTP/max_request_size_KB: Defines the maximum size of data ICM will accept (default 100 MB).

SAP gateway access

External communication with the SAP system goes through the SAP gateway. If the SAP system parameter gw/acl_mode is enabled, there are two files (secinfo and reginfo) which limits access.

In such a case two programs need to have granted access either explicitly or by wildcard definition:

  • AUTH_CONN
  • JAVA_CONN

The program IDs are defined later in this guide within the step TCP/IP RFC creation. More information on the SAP gateway ACL topic can be found on the SAP website:

https://help.sap.com/saphelp_nw73/helpdata/en/e2/16d0427a2440fc8bfc25e786b8e11c/content.htm

Kerberos cookie encoding

The SAP system by default encodes certain characters in cookies. SAP note 1160362 describes the behavior in detail. As the Kerberos cookie must not be anyhow modified for the Kerberos server to accept it, this encoding needs to be disabled by setting the following parameter in each SAP application server instance profile:

ict/disable_cookie_urlencoding = 1

The parameter is static.

SAP RFC role and user

The Java connector uses a dedicated user in the SAP system for communication. In our reference configuration, we use the username Hadoop. This user needs to be created with the type Communications Data and with a role limiting his privileges to RFC communication.

Example of the RFC role:

SSL for SAP RFCs

To enable SSL communication for SAP RFCs, client certificates of target Hadoop nodes need to be added to the SAP certificate list in transaction STRUST.

If HttpFS is used, a client certificate of the HttpFS host is needed.

If WebHDFS is used, client certificates of all datanodes and namenodes are necessary.

To import necessary certificates:

  1. Use transaction STRUST
  2. In the left menu choose the certificate list that you want to add the certificate to (by double-clicking)
  3. In the right window area on the bottom left, click the import button
  4. In tab FILE of the dialog window point to the certificate file on your local file system (the certificate should be in .pem format)
  5. After you confirm the path and SAP is able to read/recognize the certificate, details will show in the corresponding fields
  6. To finish adding a certificate click on Add to Certificate List
  7. Click on Save (in the general menu or Ctrl+S)
  8. Transaction SMICM: top menu > Administration > ICM > Exit Soft > Local/Global

CONFIGURATION

When all prerequisites are fulfilled, further configuration is done from within the SAP system.

RFC Destinations

There are three RFC connections that need to be created via transaction SM59.

HttpFS RFC

This RFC connection is used for communication with Hadoop's HttpFS service which mediates operations in HDFS.

The name and description of the destination are optional but it is recommended to designate its purpose with keywords Hadoop and HttpFS. In our example, the RFC destination also contains the Hadoop server hosting the HttpFS service for the sake of clarity:

Entries explained:

  • Connection Type: G for HTTP connection to an external service.
  • Target host: FQDN of Hadoop server hosting HttpFS service.
  • Service No.: Port number on which HttpFS service is listening (default is 14000).
  • Path Prefix: This string consists of two parts.
    1. /webhdfs/v1 part is mandatory.
    2. /user/dvqhdp part defines the Hadoop user's root directory in HDFS where flat files from the SAP system will be loaded.

If SSL is used: It is necessary to enable SSL and add a client certificate list to be used in the Logon & Security tab.

Authentication RFC

This RFC connection is part of the authentication mechanism towards any Hadoop cluster in a kerberized environment. 

The RFC setup is very basic as parameters affecting authentication are defined elsewhere. It is recommended to use the generic RFC name HADOOP_AUTH_CONN:

Entries explained:

  • Connection Type: T for TCP/IP Connection
  • Activation Type: select Registered Server Program
  • Program ID: AUTH_CONN

It is important to enabling the AUTH_CONN program registration in the SAP gateway (SAP gateway access).

Java RFC

Java RFC by name refers to the Java service which is used for communication with Hadoop services. Again, the setup is basic and the parameters of the Java connector are defined in separate tables:

Entries explained:

  • Connection Type: T for TCP/IP Connection.
  • Activation Type: Select Registered Server Program.
  • Program ID: JAVA_CONN.

It is important to enabling the JAVA_CONN program registration in the SAP gateway (SAP gateway access).

External commands setup

For correct access to an application server, the following command needs to be specified in the SM69 transaction:

Entries explained:

  • Command Name: Must be set to ZCHMOD.
  • Operating System Command: Specifies UNIX command - chmod.
  • Parameters for Operating System Command: Additional parameters of the command - 755.
  • Additional Parameters Allowed: Checks if additional parameters should be allowed - checked.

Java connector setup

Java connectors are configured using files that have to be defined in the SAP system.
They are configured in the following steps.

Logical file path definition

The first step is to map the logical path ZHADOOP_SECURITY to the OS path where the files are stored. The actual OS path is created in the section connector directories (/sapmnt/<SID>/global/security/dvd_conn).

Kerberos logical file definition

Before actually setting up the Hadoop storage, there are three files that are required for successful Kerberos authentication. They need to be defined as logical names to the SAP system via the FILE transaction.

When the logical path is defined, the file definition follows:

ZHADOOP_KRB_KEYTAB and ZHADOOP_KRB_CONFIG refer to the Kerberos keytab of <sid>hdp user and Kerberos configuration file defined in section Kerberos keytab and configuration files respectively. ZHADOOP_CDH_DRIVER refers to the custom Cloudera driver configuration file, which will be generated during the storage activation.

SSL logical file definition

If the Hadoop services cluster resides in a safe environment that is accessible only with SSL authentication, the following logical file needs to be defined as follows:

Drivers logical file definition

As described in the JDBC Drivers section, JDBC drivers for Hadoop services connection are stored on the operating systems underlying the SAP system. They also need to be defined as logical names to the SAP system via the FILE transaction.

In our example, we will be using Hive and Impala JDBC Drivers provided by Cloudera. The first step is to map the logical path ZJDBC_DRIVER_PATH to the OS path where the files are stored (in our case /urs/sap/<SID>/dvd_conn/drivers/).

Example:

When the logical path is defined, the definition of driver-specific folders follows:

ZJDBC_HIVE_CLOUDERA_JARS and ZJDBC_IMPALA_CLOUDERA_JARS refer to the folders in which Hive JDBC drivers and Impala JDBC drivers provided by Cloudera have been placed in the section JDBC Drivers.


Java connector configuration

There are two Java connectors that SNP Glue™ works with. One is used for Kerberos authentication, the other one runs SQL queries against Hadoop services. They run as two autonomous Java processes with their own log file and they are configured by setting up parameters in three configuration tables. These are imported to the SAP system as part of the SNP Glue™ transports.

The ones which need to be configured manually are:
/DVD/HDP_CUS_C – contains information for connector authentication, it connects HttpFS RFC with authentication files.

The table can be maintained via transaction SM30.

Sample entry:

Entries explained:

  • Destination: HttpFS RFC destination created in HttpFS RFC.
  • User Name: Hadoop user principal created in Hadoop user, group, and HDFS directory.
  • Auth. Method: Authentication method towards the Hadoop cluster.
  • Krb. Keytab: Logical file definition for Kerberos keytab file.
  • Krb. Config: Logical file definition for Kerberos configuration file.
  • Krb. Service RFC: Authentication RFC destination created in Authentication RFC.
  • SSL Keystore: Logical file definition for SSL Keystore.
  • SSL Password: Password for accessing SSL Keystore.


/DVD/JAVA_HDR:
 Maps the RFC destination to the Java connector version.

SNP Glue™ currently uses two Java connectors, HADOOP_AUTH_CONN for the authentication process and HIVE_CDH for submitting queries to Hive. 

Table entries can be if needed, maintained via transaction.

Table example:


/DVD/JAVA_CONFIG: Stores parameters for Java connectors. The table needs to be populated with entries via transaction SE16.

Sample configuration:

Prerequisites:

Multiple entries from the/DVD/JAVA_CONFIG table define the location and filename where a specific file will be generated. These will be marked (generated) in a detailed explanation below.

Note, that as a prerequisite, all the paths specified in the table must be created beforehand (only folders).

Entries explained:

  • Client: ID of the SAP system source data client.
  • CONFIG_AS_PATH: Path and configuration file specifying authentication information towards the SAP system to Java connector (generated).
  • CONFIG_PATH: Path and configuration file specifying program ID and SAP gateway information to Java connector (generated).
  • CONN_COUNT: Number of connections registered at the gateway.
  • JAR_PATH: Path and filename of Java connector JAR-file. This jar is generated (content is part of transport) in the specified folder and filename. (to be generated).
  • JAVA_EXE: Path to Java binary used by <sid>adm user.
  • LOG_FILE: Path and filename of log file related to the Java connector (generated).
  • MAX_RAM_USED: Amount of memory dedicated to the Java connector.
  • USERNAME: RFC user-defined in SAP RFC role and user.
  • PASSWORD: The password of the RFC user (defined above) in hashed form. Use instructions from the chapter Password hash generator to generate the hash.
  • PEAK_LIMIT: Maximum number of connections that can be created for a destination simultaneously.
  • PROG_ID: Registered program ID defined in Authentication RFC (version 32) and Hive RFC (version 34).
  • REP_DEST: Java connector client destination from which to obtain the repository.
  • WORK_THREAD_MAX: Number of threads that can be used by Java Connector.
  • WORK_THREAD_MIN: Number of threads always kept running by Java Connector.
Password hash generator

Use report /DVD/XOR_GEN for this purpose.



Storage Management setup

A generic software component: Reuse Library is used for the setup. The required component is Storage Management for Hadoop.

Storage Management facilitates transparent communication with different types of storages, which includes various types of databases and Hadoop: HDFS for flat files and Hive for structured data.

HDFS storage

In order to transparently store data, two types of Hadoop storages are required to be defined in Storage Management:

  • HDFS storage facilitates the transfer of files to HDFS via Hadoop's HttpFS service.
  • Hive storage which enables data replication between SAP tables and Hive tables.

The third type of storage is required for efficient querying of data located in Hive:

  • Impala storage which connects to Impala agents to provide fast SQL execution by leveraging Impala in-memory data caching

HDFS storage creation is done via transaction:

/DVD/SM_SETUP > [Edit mode] > [New storage]

Entries explained:

  • Storage ID: Name of the storage.
  • Storage Type: Choose HADOOP for HDFS.
  • Description: Extended description of the storage for easier identification.
  • RFC Destination: HttpFS RFC destination defined in HttpFS RFC.

Finish the creation of the storage by confirming (F8). If the SAP system is able to authenticate against Hadoop Kerberos and get properties of the HDFS home directory (/user/<sid>hdp) from the HttpFS service, storage creation is considered successful.

Hive metastore storage

The Hive metastore storage is created in a very similar way to the process of setting up the HDFS storage but the values are different:



Entries explained:

  • Storage ID: Name of the storage.
  • Storage Type: Choose SM_TRS_HV2 for Hive.
  • Description: Extended description of the storage for easier identification.
  • Database: Hive database created in Hive database.
  • Hive host: Hadoop server hosting the Hive service.
  • Hive username: Hadoop user created in Hadoop user, group, and HDFS directory.
  • Impala host: Hadoop server hosting the Impala service.
  • Impala username: Hadoop user created in Hadoop user, group, and HDFS directory.
  • HTTP RFC Destination: HttpFS RFC destination defined in HttpFS RFC.
  • Java connector RFC: Hive RFC destination defined in Hive RFC.
  • Load Engine: Engine used for loading (writing) data, e.g. Hive or Impala.
  • Read Engine: Engine used for reading data, e.g. Hive or Impala.
  • Load driver classname: Classname of the driver used for loading (e.g. Cloudera Hive - com.cloudera.hive.jdbc41.HS2Driver).
  • Load driver path: Logical name of the Load driver file.
  • Read driver classname: Classname of the driver used for reading (e.g.Cloudera Impala - com.cloudera.impala.jdbc41.Driver).
  • Read driver path: Logical name of the Read driver path.
  • Use Kerberos checkbox: Checked in case the Hadoop cluster is Kerberized.
  • Hive service principal: Kerberos principal of the Hive service, must reflect the Hive host.
  • Kerberos config file path: Logical name of the Kerberos configuration file defined in Kerberos logical file definition.
  • Kerberos keytab path: Logical name of Kerberos principal keytab file defined in Kerberos logical file definition.
  • File Type: File format in which Hive will store table data on HDFS.
  • Compression: Not used.
  • HDFS Permissions: UNIX permissions for files created on HDFS.
  • Use the Cloudera drivers checkbox: Always checked if the Hadoop cluster is a Cloudera distribution.
  • Cloudera driver config path: Logical name of the Cloudera driver configuration file defined in Kerberos logical file definition.
  • Hive host: Hive high availability host.
  • Hive port for: Hive high availability port.
  • HTTP RFC Destination: Hive high availability RFC destination.
  • Skip trash: Checked if HDFS files should NOT be moved to trash after deleting them.
  • Impala port: Impala JDBC port.
  • Hive port: Hive JDBC port.
  • Hints for hive/impala: String that will be concatenated as post-fix with JDBC connection string (e.g. SYNC_DDL=TRUE).
  • Open cursor logic: Select which logic will be used for reading via the cursor.
  • Repetition for HDFS: Number of times HDFS request will be repeated in case of failure.
  • Use compression: Checked in case compression is used for the file created on HDFS.
  • Compression level: Level of compression (0-minimum, 9-maximum).
  • Use SSL: Checked if SSL authentication should be used.
  • SSL Keystore path: Logical name of the SSL Keystore file.
  • SSL Keystore password: Password to Keystore.

Finish the creation of the storage by confirming (F8). If the SAP system is able to authenticate against Hadoop Kerberos and receives the expected result of the SQL command use database, the creation of the storage is considered successful.

Troubleshooting

In case of any issues with the Java connector, Java logs can be read from an application server to determine the source of the issue with the report /DVD/SM_HIVE_DISPLAY_JAVA_LOG.

Entries explained:

  • RFC Destination: Destination of RFC with Java service.
  • Num. of last lines to display: How many last lines of the log should be displayed.
  • Name of file with log: Filename of file with Java log. (in default destination).
  • Is the log archived?: Check if need to display archived log (date obligatory).
  • Date of the log: Date of the archived log to display.