(SM-1902) Hadoop Storage Setup

Prerequisites

Open ports

In a controlled network environment, it is common to have firewall rules in place. In order to enable communication of SAP systems with Hadoop, the following port numbers should be reachable in Hadoop cluster from the SAP system:

PortTypeHadoop serviceComment
10000tcpHive
11000tcpOozieOptional feature for Datavard Glue
14000tcpHttpFSAlternative is webHDFS on port 50070
21050tcpImpala

These are default port numbers of Hadoop services.

If Kerberos is enabled, also KDC (Key Distribution Center) should be reachable from SAP on port 88 (tcp/udp).

DNS names

Proper DNS name translation should be configured between SAP and Hadoop for Kerberos communication.

Hive parameters

Two configuration parameters of the Hive service, which must be configured in Hive Service Advanced Configuration Snippet (Safety Valve) for Hive-site.xml.

hive.exec.dynamic.partition = true
hive.exec.dynamic.partition.mode = nonstrict

Example:

Hadoop user

We recommend to create distinct users for every SAP system connected to the Hadoop cluster in order to isolate each system's data. There is usually a central repository for Hadoop users (LDAP/AD), but you can also create the user locally (on every Hadoop cluster node).

Each of these users should have its own dedicated user group.

  • If Hadoop Sentry is used: User groups are used for the definition of Sentry access rules.

The recommended user names are mirroring SAP's guideline for user names: <sid>adm => <sid>hdp.

Create the user's Kerberos principal in form of <sid>hdp@<KERBEROS_REALM>.

The user's home directory on HDFS must be created manually with appropriate permissions:

$ hdfs dfs -ls -d /user/dvqhdp
-rwxrwxr-x 3 dvqhdp supergroup 0 --- /user/dvqhdp


  • If Kerberos is used: Create Kerberos principal for the user. Run the kadmin.local on the host, where Kerberos DB is running.

Hive database

We recommend to create a dedicated database (schema) in Hive for each SAP system. The recommended database name is sap<sid> (sapdvq).

If Hadoop Sentry is used: Create two Sentry rules, which enable all actions of <sid>hdp user on sap<sid> database and his home directory in HDFS.

Example: If HDFS ACL synchronization with Sentry permissions is enabled, add the user's directory to the Sentry Synchronization Path Prefixes parameter in the HDFS service configuration.

More on HDFS ACL synchronization topic can be found on Synchronizing HDFS ACLs and Sentry Permissions.

OS prerequisites (On SAP host)

This group of requirements relates to the operating systems underlying SAP system with all its application servers. Datavard products (e.g. Datavard Glue, OutBoard DataTiering) have been developed and tested on the SUSE Linux environment and Windows Server 2012, but by design are not limited by the choice of an operating system, if requirements listed in this guide are met.

Java version 

An up-to-date Java version (recommended is 1.7 or higher) should be available to the SAP's <sid>adm user.

If Hadoop cluster has Kerberos enabled, patched version of Java is required (e.g. Java 8u172).

SAP Java Connector library

SAP Java Connector 3.0 library libsapjco3.so, which can be downloaded from the SAP marketplace, needs to located on SAP Application Server. It should be located in the directory referenced by LD_LIBRARY_PATH environment variable of <sid>adm user.

It is recommended to append a shared Datavard folder (/sapmnt/DVQ/global/security/dvd_conn) to LD_LIBRARY_PATH variable of <sid>adm user and place libsapjco3.so to this folder.

$ echo $LD_LIBRARY_PATH
/usr/sap/HD4/SYS/exe/run:/usr/sap/HD4/SYS/exe/uc/linuxx86_64:/usr/sap/HD4/hdbclient:/home/hd4adm/IQ16_ODBC_CLIENT_Linux:/sapmnt/DVQ/global/security/dvd_conn

/sapmnt/DVQ/global/security/dvd_conn/ # ls -l libsapjco3.so
-rwxr-x--- 1 dvqadm sapsys 59 Apr  5 15:12 libsapjco3.so

OS directories

Hadoop connector uses two directories dedicated to its configuration and log files:

$ ls -ld /usr/sap/DVQ/dvd_conn /sapmnt/DVQ/global/security/dvd_conn
drwx------ 2 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn
drwxr-xr-x 2 dvqadm sapsys 4096 --- /usr/sap/DVQ/dvd_conn

The first one (/sapmnt/<SID>/global/security/dvd_conn) is used to store drivers, Kerberos and SSL related files and is shared among SAP application servers.

The second stores configuration and log files of Java connector used by Datavard. Set the ownership and permissions appropriately to <sid>adm.

Create directories on each SAP application server to store Storage Management related configuration and log files. <sid>adm user needs permissions to access those folders.

JDBC Drivers

JDBC protocol is used to connect to Hadoop services (Hive and Impala). JDBC drivers have to be manually stored on the operating system and be accessible to Datavard connector.

We recommend to store the drivers in a folder within the connector directory, organized in subfolders to avoid possible conflicts.

$ ls -ld /sapmnt/DVQ/global/security/dvd_conn/hive* /sapmnt/DVQ/global/security/dvd_conn/impala*
drwxr-xr-x 2 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn/hive_apache
drwxr-xr-x 3 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn/hive_cloudera
drwxr-xr-x 3 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn/impala_cloudera


Kerberos keytab and configuration files

The Kerberos keytab of <sid>hdp principal should be exported from Kerberos database, copied into the operating system directory /sapmnt/DVQ/global/security/dvd_conn and made available to the <sid>adm user:

/sapmnt/DVQ/global/security/dvd_conn # ls -l dvqhdp.keytab
-r-------- 1 dvqadm sapsys 59 Apr  5 15:12 dvqhdp.keytab

At the same location, the Kerberos configuration file should be created or copied and made readable for the user <sid>adm.

Usually a suitable krb5.conf file can be found on Hadoop nodes. Here is a sample of Kerberos configuration file:

/sapmnt/DVQ/global/security/dvd_conn # ls -l kerberos.cfg
-r-------- 1 dvqadm sapsys 393 Feb 22 16:28 kerberos.cfg

/sapmnt/DVQ/global/security/dvd_conn # cat kerberos.cfg
[libdefaults]
default_realm = HADOOP.LOCAL
dns_lookup_kdc = false
dns_lookup_realm = false
ticket_lifetime = 86400
renew_lifetime = 604800
forwardable = true
default_tgs_enctypes = rc4-hmac
default_tkt_enctypes = rc4-hmac
permitted_enctypes = rc4-hmac
udp_preference_limit = 1
kdc_timeout = 3000
[realms]
HADOOP.LOCAL = {
kdc = hadoop01.hadoop.local
admin_server = hadoop01.hadoop.local
}

SSL Certificates for Java

If your Hadoop cluster has full TLS encryption enabled, it’s necessary to create Java truststore and save it on the following path with correct ownership and permissions:

/sapmnt/<SID>/global/security/dvd_conn # ls -l jssecacerts
-r-------- 1 <SID>adm sapsys 59 Apr  5 15:12 jssecacerts

Server certificates for hosts that Storage Management communicates with via JDBC should be placed in this truststore. This is usually Hiveserver2 and Impala daemon. An alternative option is to copy the complete jssecacerts truststore from any Hadoop node and place it in this path.

SAP prerequisites

Kerberos cookie encoding

By default, the SAP system encodes certain characters in cookies. The SAP note 1160362 describes the behavior in detail. As the Kerberos cookie must not be anyhow modified for the Kerberos server to accept it, this encoding should be disabled by setting the following parameter in each SAP application server instance profile:

Value for the SAP Kernel versions lower than 7.53 patch level 5
ict/disable_cookie_urlencoding = 1

Incompatible Kernel version

SAP kernel 7.53 patch level 5 introduced a change in this parameter, which causes Storage Management to malfunction. Therefore Hadoop storage doesn't work on SAP kernel 7.53 patch level 5-222. In SAP kernel 7.53 patch level 223 and in the future versions, is possible to change this value to "2", which once again introduces desired behavior as before. Issue is described in SAP Note 2681175.

Value for the SAP Kernel versions higher than 7.53 patch level 222
ict/disable_cookie_urlencoding = 2

The parameter is dynamic starting from the version 7.53.

SAP RFC role and user

The Java connector uses a dedicated user in the SAP system for communication. In our reference configuration, we use the username 'hadoop'. This user should be created with the type 'Communications Data' and with authorizations limiting his privileges to basic RFC communication.

Authorization object required is S_RFC with these settings:

  • ACTVT = 16
  • RFC_NAME = SYST, RFC1, SDIFRUNTIME
  • RFC_TYPE = FUGR


Example of custom SAP role in PFCG transaction (Display Authorization Data):

SSL for SAP RFCs

To enable SSL communication for SAP RFCs, add client certificates of target Hadoop nodes to SAP certificate list in the transaction STRUST.

If HttpFS is used, client certificate of the HttpFS host is required.

If WebHDFS is used, client certificates of all datanodes and namenodes are necessary.

To import necessary certificates:

  1. Use the transaction STRUST.
  2. In the left menu choose certificate list that you want to add the certificate to (by double-clicking).
  3. In the right window area on left bottom click Import.
  4. In tab FILE of dialog window point to certificate file on your local file system (certificate should be in .pem format).
  5. After you confirm the path and SAP is able to recognize the certificate, details are displayed in the corresponding fields.
  6. To complete adding the certificate click Add to Certificate List.
  7. Click Save (in general menu or Ctrl+S).
  8. Transaction SMICM: top menu > Administration > ICM > Exit Soft > Local/Global.


Check the following parameters (typically they are running and set):

HTTP service and ICM parameters

The HTTP service must be active in SAP system. It can be checked via the transaction SMICM > [Goto] > Services

There are two important parameters affecting HTTP communication of SAP system:

  • icm/HTTP/client/keep_alive_timeout – HTTP communication timeout, can be raised in the case of HTTP communication failing in timeout.
  • icm/HTTP/max_request_size_KB – maximum size of data which ICM accepts (default 100 MB).

SAP gateway access

External communication with SAP system goes through the SAP gateway. If SAP system parameter gw/acl_mode is enabled, there are two files (secinfo and reginfo) which limit the access.

In this case two programs need to have granted access either explicitly or by wildcard definition:

  • AUTH_CONN
  • JAVA_CONN

The program IDs are defined later in this guide within the step 194153621. More information on the SAP gateway ACL topic can be found on the SAP web site Gateway Security Files secinfo and reginfo.

Configuration

When all prerequisites are fulfilled, further configuration is performed from the SAP system.

RFC Destinations

There are three RFC connections which should be created through the transaction SM59.

HttpFS RFC

This RFC connection is used for communication with Hadoop's HttpFS service which mediates operations in HDFS.

Two RFCs pointing to different HttpFS services can be created to ensure High Availability.

The name and description of the destination is optional, but it is recommended to designate its purpose with keywords 'Hadoop' and 'HttpFS'. In our example, the RFC destination also contains the Hadoop server hosting HttpFS service for the sake of clarity:

Entries explained:

  • Connection Type – G for HTTP connection to an external service
  • Target host – FQDN of Hadoop server hosting HttpFS service
  • Service No. – port number on which HttpFS service is listening (default is 14000)
  • Path Prefix – this string consists of two parts
    1. /webhdfs/v1 part is mandatory
    2. /user/dvqhdp part defines Hadoop user's 'root' directory in HDFS where flat files from SAP system are loaded

If SSL is used: Enable SSL and add client certificate list which is used in Logon & Security tab.

Authentication RFC

This RFC connection is part of the authentication mechanism towards any Hadoop cluster in a kerberized environment. 

The RFC setup is very basic as parameters affecting authentication are defined elsewhere. It is recommended to use the generic RFC name 'HADOOP_AUTH_CONN':

Entries explained:

  • Connection Type – T for TCP/IP Connection
  • Activation Type – select Registered Server Program
  • Program ID – AUTH_CONN

It is important to enable the AUTH_CONN program registration in the SAP gateway (SAP Gateway Access Control Lists).

Java RFC

Java RFC by name refers to the Java service which is used for communication with Hadoop services. Define the parameters of the Java connector in separate tables:

Entries explained:

  • Connection Type – T for TCP/IP Connection
  • Activation Type – select Registered Server Program
  • Program ID – JAVA_CONN

It is important to enable the JAVA_CONN program registration in the SAP gateway (SAP Gateway Access Control Lists).

Java connector setup

Java connectors are configured using files that you define in SAP system.
They are configured in the following steps:

Logical file path definition

The first step is to map logical path ZDVD_DIR to the OS path where the files are stored. The actual OS path is created in the section Datavard connector directories (/sapmnt/<SID>/global/security/dvd_conn).

Kerberos logical file definition

Before actually setting up the Hadoop storage, there are three files which are required for successful Kerberos authentication. Define them as logical names to the SAP system via the FILE transaction.

After the logical path is defined, follows the file definition:


ZHADOOP_KRB_KEYTAB and ZHADOOP_KRB_CONFIG refer to Kerberos keytab of <sid>hdp user and Kerberos configuration file defined in section Kerberos keytab. ZHADOOP_CDH_JAAS refers to the custom Cloudera driver configuration file, which are generated during the storage activation.

SSL logical file definition

If Hadoop cluster resides in a safe environment which is accessible only with SSL authentication, define the following logical file:


Drivers logical file definition

As described in 194153621194153621, JDBC drivers for Hadoop service connection are stored on operating systems underlying SAP system. Define them also as logical names to the SAP system via the FILE transaction.

In our example, we are using Hive and Impala JDBC Drivers provided by Cloudera. Definition of driver specific folders looks as follows:

ZHADOOP_HIVE_CLOUDERA_DRIVERS and ZHADOOP_IMPALA_CLOUDERA_DRIVERS refer to the folders in which Hive JDBC drivers and Impala JDBC drivers provided by Cloudera have been placed in the section JDBC Drivers.


Java connector configuration

There are two Java connectors which Storage Management works with. The connector HADOOP_AUTH_CONN is used for the Kerberos authentication and the connector HIVE_CDH runs SQL queries against Hadoop services. They run as two autonomous Java processes with their own log file and they should be configured by setting up parameters in three configuration tables, which are imported to the SAP system as part of the Datavard transports.

You should manually configure the following tables:
/DVD/HDP_CUS_C – contains information for connector authentication, it connects HttpFS RFC with authentication files.

If 2 HttpFS RFCs were created, an entry needs to be created in this table for both of them.

The table can be maintained through the transaction SM30.

Sample entry:

Entries explained:

  • Destination – HttpFS RFC destination created in HttpFS RFC
  • User Name – Hadoop user principal created in Hadoop user, group and HDFS directory
  • Auth. Method – authentication method towards the Hadoop cluster
  • Krb. Keytab – logical file definition for Kerberos keytab file
  • Krb. Config – logical file definition for Kerberos configuration file
  • Krb. Service RFC – authentication RFC destination created in Authentication RFC
  • SSL Keystore – logical file definition for SSL keystore
  • SSL Password – password for accessing SSL keystore


/DVD/JAVA_HDR
– maps an RFC destination to the Datavard Java connector version.

If required, the table entries can be maintained through the SE16 transaction.

Table example:

/DVD/JAVA_CONFIG – stores parameters for Datavard Java connectors. Fill in the table with entries through the transaction SE16

Sample configuration:

The Java service is started with a system command. You can adjust the name of this command in the table /DVD/JAVA_CONFIG with the parameter JAVA_START_CMD. The default name of the command is ZDVD_START_JAVA. In the case the system command doesn't exist, it is created automatically. You can view the system commands through the transaction SM69.

On Linux, another system command is required, which sets executable rights for the configuration files (chmod 755 <filename>). Its name can be adjusted with the parameter CHMOD_CMD with the default value ZDVD_CHMOD.


Prerequisites:

Multiple entries from /DVD/JAVA_CONFIG table define the file path and the filename of generated files. These are marked as (generated) in detailed explanation below.

Please note, that as a prerequisite, you must create before all the paths specified in the table (only folders).

Entries explained:

  • Client – ID of the SAP system source data client
  • CONFIG_AS_PATH – path and configuration file specifying authentication information towards the SAP system to Java connector (generated)
  • CONFIG_PATH – path and configuration file specifying program ID and SAP gateway information to Java connector (generated)
  • CONN_COUNT – number of connections registered at the gateway
  • JAR_PATH – path and filename of Datavard Java connector JAR-file. This jar is generated (content is part of transport) in the specified folder and filename. (to be generated)
  • JAVA_EXE – path to Java binary used by <sid>adm user
  • LOG_FILE – path and filename of log file related to the Java connector (generated)
  • MAX_RAM_USED – the amount of memory dedicated to the Java connector
  • USERNAME – RFC user defined in SAP RFC role and user
  • PASSWORD – password of an RFC user (defined above) in a hashed form. Use instructions from the part 194153621 to generate a hash.
  • PEAK_LIMIT – maximum number of connections that can be created for a destination simultaneously
  • PROG_ID – registered program ID defined in Authentication RFC (version 32) and Hive RFC (version 34)
  • REP_DEST – Java connector client destination from which to obtain the repository
  • WORK_THREAD_MAX – number of threads that can be used by Java Connector
  • WORK_THREAD_MIN – number of threads always kept running by Java Connector
  • JAVA_START_CMD - name of the system command, which starts Java
  • CHMOD_CMD - name of the system command for setting executable rights
Password hash generator

Use the report /DVD/XOR_GEN for this purpose.


Storage Management setup

A generic Datavard software component: “Reuse Library” is used for the setup. The required component is “Storage Management for Hadoop”.

Datavard Storage Management facilitates transparent communication with different types of storages, which includes various types of databases and Hadoop: HDFS for flat files and Hive for structured data.

HDFS storage

In order to transparently store data, you should define two types of Hadoop storages in Storage Management:

  • HDFS storage which facilitates transfer of files to HDFS through the Hadoop HttpFS service
  • Hive storage which enables data replication between SAP tables and Hive tables

The third type of storage is required for efficient querying of data located in Hive:

  • Impala storage which connects to Impala agents to provide fast SQL execution by leveraging Impala in-memory data caching

Create the HDFS storage through the transaction:

/DVD/SM_SETUP > [Edit mode] > [New storage]

Entries explained:

  • Storage ID – name of the storage
  • Storage Type – choose HADOOP for HDFS
  • Description – extended description of the storage for easier identification
  • RFC Destination – HttpFS RFC destination defined in HttpFS RFC

Complete the creation of the storage by confirming (F8). If the SAP system can authenticate against Hadoop Kerberos and get properties of HDFS home directory (/user/<sid>hdp) from HttpFS service, storage creation is considered successful.

Hive metastore storage

The Hive metastore storage is created in a similar way as the process of setting up the HDFS storage, but the values are different:

Entries explainedFigure

Storage ID – Name of the storage

Storage Type – Choose SM_TRS_MS for Hive

Description – Extended description of the storage for easier identification

Database – Hive database created in Hive database

Hive host – Hadoop server hosting the Hive service

Hive host for high availabilityHA Hive host

Hive username – Hadoop user created in Hadoop user, group and HDFS directory

HIVE password - Password for the Hive user

Impala host – Hadoop server hosting the Impala service

Impala host for high availability - HA Impala host

Impala username – Hadoop user created in the Hadoop user group and HDFS directory

Staging location type - Storage location for data staging area (external csv tables)

Staging location URL (non-default) - URL address for data staging area (e.g. Azure DataLake)

HTTP RFC Destination – HttpFS RFC destination defined in HttpFS RFC

HTTP RFC Destination (HA) - HttpFS RFC destination (High Availability)

Java connector RFC – Hive RFC destination defined in Hive RFC

Load Engine - Engine used for loading (writing) data, e.g. Hive or Impala

Read Engine - Engine used for reading data, e.g. Hive or Impala

Load driver classname - Classname of the driver used for loading (e.g. Cloudera Hive - com.cloudera.hive.jdbc41.HS2Driver)

Load driver path - Logical name of Load driver file

Read driver classname - Classname of the driver used for reading (e.g.Cloudera Impala - com.cloudera.impala.jdbc41.Driver)

Read driver path - Logical name of Read driver path

Use custom connection string - If checked, use custom connection string

Custom connection string - Standard settings are ignored, custom connection string is used instead

Use Kerberos – Checked in case the Hadoop cluster is Kerberized

Kerberos config file path – Logical name of Kerberos configuration file defined in Kerberos logical file definition

Hive service principal – Kerberos principal of the Hive service, must reflect the Hive host

Impala service principal - Kerberos principal of the Impala service, must reflect the Impala host

Kerberos keytab path – Logical name of Kerberos principal keytab file defined in Kerberos logical file definition

Kerberos user principal - User name (kerberos principal) defined for connection to Hadoop

File Type – File format in which Hive stores table data on HDFS

Compression codec - Compression codec used for storing data on HDFS

HDFS Permissions - UNIX permissions for files created on HDFS

Use Cloudera drivers – Always checked if the Hadoop cluster is Cloudera distribution

Cloudera driver config path – Logical name of the Cloudera driver configuration file defined in Kerberos logical file definition

Skip trash - Checked if HDFS files shouldn't be moved to trash after deleting them

Impala port - Impala JDBC port

Hive port - Hive JDBC port

Hints for hive/impala - Hints that can be specified for JDBC connection, separated by ; (e.g. SYNC_DDL=TRUE;UseNativeQuery=1)

Open cursor logic - Select which logic is used for reading via cursor

Number of repetition for WebHDFS - Number of times HDFS request is repeated in the case of failure

Number of repetition for JDBC - Number of times JDBC requests are repeated in the case of failure

Use compression on transfer - Checked in the case compression is used for filed created on HDFS

Compression level - Level of compression (0-minimum, 9-maximum)

Use SSL - Checked if SSL authentication should be used

SSL keystore path - Logical name of the SSL keystore file

SSL keystore password - Password to keystore

Force file cursor reader (expert setting) - Cursor reader is used all the time when reading data from Hadoop

Use extended escaping (expert setting) - Extending escaping is used all the time when writing data to Hadoop












Finish the creation of the storage by confirming (F8). If the SAP system is able to authenticate against Hadoop Kerberos and receives the expected result of the SQL command 'use database', the creation of the storage is considered successful.

Troubleshooting

In the case of any issues with Java connector, you can find the Java logs in application server to determine a source of the issue with the report /DVD/SM_HIVE_DISPLAY_JAVA_LOG.

Entries explained:

  • RFC Destination - Destination of RFC with Java service
  • Num. of last lines to display - How many last lines of the log should be displayed
  • Name of file with log - Filename of a file with java log. (in default destination)
  • Is log archived? - Check, if you need to display the archived log (date obligatory)
  • Date of log - Date of the archived log, which should be displayed