Prerequisites

Open ports

In a controlled network environment, it is common to have firewall rules in place. In order to enable communication of SAP systems with Hadoop, the following port numbers need to be reachable in Hadoop cluster from the SAP system:

PortTypeHadoop serviceComment
10000tcpHive
11000tcpOozieOptional feature for Datavard Glue
14000tcpHttpFSAlternative is webHDFS on port 50070
21050tcpImpala

These are default port numbers of Hadoop services.

If Kerberos is enabled, also KDC (Key Distribution Center) needs to be reachable from SAP on port 88 (tcp/udp).

DNS names

Proper DNS name translation needs to be configured between SAP and Hadoop for Kerberos communication.

Hive parameters

There are two configuration parameters of the Hive service which need to be properly configured in Hive Service Advanced Configuration Snippet (Safety Valve) for Hive-site.xml

hive.exec.dynamic.partition = true
hive.exec.dynamic.partition.mode = nonstrict

Example:

Hadoop user

Datavard recommends to create distinct users for every SAP system connected to the Hadoop cluster in order to isolate each system's data. 
There is usually central repository for hadoop users (LDAP/AD), but you can also create the user locally (on every hadoop cluster node).

Each of these users has to have its own dedicated user group.
If Hadoop Sentry is used: User groups will be used for the definition of Sentry access rules.

The recommended user names are mirroring SAP's guideline for user names: <sid>adm => <sid>hdp.

Create the user's Kerberos principal in form of <sid>hdp@<KERBEROS_REALM>.
The user's home directory on HDFS has to be manually created with appropriate permissions:

$ hdfs dfs -ls -d /user/dvqhdp
-rwxrwxr-x 3 dvqhdp supergroup 0 --- /user/dvqhdp


If Kerberos is used: Create Kerberos principal for user. You need to run kadmin.local on host where Kerberos DB is running

Hive database

Datavard recommends to create a dedicated database (schema) in Hive for each SAP system. The recommended database name is sap<sid> (sapdvq).

If Hadoop Sentry is used: Two Sentry rules have to be created, which will enable all actions of <sid>hdp user on sap<sid> database and his home directory in HDFS.
Example: If HDFS ACL synchronization with Sentry permissions is enabled, the user's directory has to be added to the Sentry Synchronization Path Prefixes parameter in the HDFS service configuration.
More on HDFS ACL synchronization topic can be found on https://www.cloudera.com/documentation/enterprise/latest/topics/sg_hdfs_sentry_sync.html

OS prerequisites (On SAP host)

This group of requirements relates to the operating systems underlying SAP system with all its application servers. Datavard products (e.g. Datavard Glue, OutBoard DataTiering) has been developed and tested on the SUSE Linux environment, but by design is not limited by the choice of operating system, if requirements listed in this guide are met.

Java version

An up to date Java version (recommended is 1.7 or higher) needs to be available to the SAP's <sid>adm user.

If Hadoop cluster has Kerberos enabled, patched version of Java is required (e.g. Java 8u172).


SAP Java Connector library

SAP Java Connector 3.0 library libsapjco3.so needs to be accessible, which is available for download from the SAP marketplace. It needs to be present in the SAP kernel directory or in other directory pointed by LD_LIBRARY_PATH environment variable of <sid>adm user.

$ which libsapjco3.so
/usr/sap/DVQ/SYS/exe/uc/linuxx86_64/libsapjco3.so

OS directories

Hadoop connector uses two directories dedicated to its configuration and log files:

$ ls -ld /usr/sap/DVQ/dvd_conn /sapmnt/DVQ/global/security/dvd_conn
drwx------ 2 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn
drwxr-xr-x 2 dvqadm sapsys 4096 --- /usr/sap/DVQ/dvd_conn

The first one (/sapmnt/<SID>/global/security/dvd_conn) is used to store Kerberos and SSL related files and is shared among SAP application servers.

The second will store drivers, configuration and log files of Java connector used by Datavard Glue. Ownership and permissions need to be set appropriately to <sid>adm

Create directories on each SAP application server to store Storage Management related configuration and log files. <sid>adm user needs permissions to access those folders.

JDBC Drivers

JDBC protocol is used to connect to Hadoop services (Hive and Impala). JDBC drivers have to be manually stored on the operating system and accessible to Datavard connector.

Datavard recommends to store the drivers in a folder within the connector directory, organized in subfolders to avoid possible conflicts.

$ ls -ld /usr/sap/DVQ/dvd_conn/drivers/hive* /usr/sap/DVQ/dvd_conn/drivers/impala*
drwxr-xr-x 2 dvqadm sapsys 4096 --- /usr/sap/DVQ/dvd_conn/drivers/hive_apache
drwxr-xr-x 3 dvqadm sapsys 4096 --- /usr/sap/DVQ/dvd_conn/drivers/hive_cloudera
drwxr-xr-x 3 dvqadm sapsys 4096 --- /usr/sap/DVQ/dvd_conn/drivers/impala_cloudera


Kerberos keytab and configuration files

The Kerberos keytab of <sid>hdp principal needs to be exported from Kerberos database, and copied into the operating system directory /sapmnt/DVQ/global/security/dvd_conn and made available to the <sid>adm user:

/sapmnt/DVQ/global/security/dvd_conn # ls -l dvqhdp.keytab
-r-------- 1 dvqadm sapsys 59 Apr  5 15:12 dvqhdp.keytab


At the same location, the Kerberos configuration file needs to be created/copied, and made readable for the user <sid>adm.
Here is a sample of Kerberos configuration file:

/sapmnt/DVQ/global/security/dvd_conn # ls -l kerberos.cfg
-r-------- 1 dvqadm sapsys 393 Feb 22 16:28 kerberos.cfg

/sapmnt/DVQ/global/security/dvd_conn # cat kerberos.cfg
[libdefaults]
default_realm = HADOOP.LOCAL
dns_lookup_kdc = false
dns_lookup_realm = false
ticket_lifetime = 86400
renew_lifetime = 604800
forwardable = true
default_tgs_enctypes = rc4-hmac
default_tkt_enctypes = rc4-hmac
permitted_enctypes = rc4-hmac
udp_preference_limit = 1
kdc_timeout = 3000
[realms]
HADOOP.LOCAL = {
kdc = hadoop01.hadoop.local
admin_server = hadoop01.hadoop.local
}

SSL Certificates for Java

If your Hadoop cluster has full TLS encryption enabled, it’s necessary to create Java truststore and save it on following path with correct ownership and permissions:

/sapmnt/<SID>/global/security/dvd_conn # ls -l jssecacerts
-r-------- 1 <SID>adm sapsys 59 Apr  5 15:12 jssecacerts

Server certificates of hosts, that Glue will communicate with, need to be placed in this truststore. Alternative option is to copy the complete jssecacerts truststore from any Hadoop node and place it in this path.

SAP prerequisities

Kerberos cookie encoding

The SAP system by default encodes certain characters in cookies. SAP note 1160362 describes the behaviour in detail. As the Kerberos cookie must not be anyhow modified for the Kerberos server to accept it, this encoding needs to be disabled by setting following parameter in each SAP application server instance profile:

ict/disable_cookie_urlencoding = 1

The parameter is static.

SAP RFC role and user

The Java connector uses a dedicated user in the SAP system for communication. In our reference configuration we use the username 'hadoop'. This user needs to be created with type 'Communications Data' and with an authorizations limiting his privileges to basic RFC communication.

Authorization object required is S_RFC with these settings:


Example of custom SAP role in PFCG transaction (Display Authorization Data):


SSL for SAP RFCs

To enable SSL communication for SAP RFCs, client certificates of target Hadoop nodes need to be added to SAP certificate list in transaction STRUST.

If HttpFS is used, client certificate of HttpFS host is needed.

If WebHDFS is used, client certificates of all datanodes are and namenodes are necessary.

To import necessary certificates:

  1. Use transaction STRUST
  2. In the left menu choose certificate list that you want to add certificate to (by double clicking)
  3. In the right window area on left bottom click import button
  4. In tab FILE of dialog window point to certificate file on your local file system (certificate should be in .pem format)
  5. After you confirm the path and SAP is able to read/recognize certificate, details will show in corresponding fields
  6. To finish adding certificate click on Add to Certificate List
  7. Click on Save (in general menu or Ctrl+S)
  8. Transaction SMICM: top menu > Administration > ICM > Exit Soft > Local/Global


Following parameters needs to be checked (typically it's running and set):

HTTP service and ICM parameters

The HTTP service has to be active in SAP system. It can be checked via transaction
SMICM > [Goto] > Services

There are two particularly important parameters affecting HTTP communication of SAP system:

SAP gateway access

External communication with SAP system goes through the SAP gateway. If SAP system parameter gw/acl_mode is enabled, there are two files (secinfo and reginfo) which limit the access.

In such case two programs need to have granted access either explicitly or by wildcard definition:

The program IDs are defined later in this guide within the step TCP/IP RFC creation. More information on the SAP gateway ACL topic can be found on the SAP web site:

https://help.sap.com/saphelp_nw73/helpdata/en/e2/16d0427a2440fc8bfc25e786b8e11c/content.htm


CONFIGURATION

When all prerequisites are fulfilled, further configuration is done from within the SAP system.

RFC Destinations

There are three RFC connections which need to be created via transaction SM59.

HttpFS RFC

This RFC connection is used for communication with Hadoop's HttpFS service which mediates operations in HDFS.

The name and description of the destination is optional, but it is recommended to designate its purpose with keywords 'Hadoop' and 'HttpFS'. In our example, the RFC destination also contains the Hadoop server hosting HttpFS service for the sake of clarity:

Entries explained:

If SSL is used: It’s necessary to enable SSL and add client certificate list to be used in Logon & Security tab.

Authentication RFC

This RFC connection is part of the authentication mechanism towards any Hadoop cluster in a kerberized environment. 

The RFC setup is very basic as parameters affecting authentication are defined elsewhere. It is recommended to use the generic RFC name 'HADOOP_AUTH_CONN':

Entries explained:

It is important to enable the AUTH_CONN program registration in the SAP gateway (SAP gateway access).

Java RFC

Java RFC by name refers to the Java service which is used for communication with Hadoop services. Again, the setup is basic, and parameters of the Java connector are defined in separate tables:

Entries explained:

It is important to enable the JAVA_CONN program registration in the SAP gateway (SAP gateway access).

Java connector setup

Java connectors are configured using files that has to to be defined in SAP system.
They are configured in the following steps.

Logical file path definition

The first step is to map logical path ZHADOOP_SECURITY to the OS path where the files are stored. The actual OS path is created in the section  Datavard connector directories (/sapmnt/<SID>/global/security/dvd_conn).

Kerberos logical file definition

Before actually setting up the Hadoop storage, there are three files which are required for successful Kerberos authentication. They need to be defined as logical names to the SAP system via the FILE transaction.

When the logical path is defined, file definition follows:

ZHADOOP_KRB_KEYTAB and ZHADOOP_KRB_CONFIG refer to Kerberos keytab of <sid>hdp user and Kerberos configuration file defined in section Kerberos keytab and configuration files respectively. ZHADOOP_CDH_DRIVER refers to the custom Cloudera driver configuration file, which will be generated during the storage activation.

SSL logical file definition

If Hadoop services cluster resides in safe enviroment which is accessible only with SSL authentication, the following logical file needs to be defined as follows:

Drivers logical file definition

As described in JDBC Drivers section, JDBC drivers for Hadoop services connection are stored on the operating systems underlying SAP system. They also need to be defined as logical names to the SAP system via the FILE transaction.

In our example we will be using Hive and Impala JDBC Drivers provided by Cloudera. The first step is to map logical path ZJDBC_DRIVER_PATH to the OS path where the files are stored (in our case /urs/sap/<SID>/dvd_conn/drivers/).

Example:

When the logical path is defined, definition of driver specific folders follows:

ZJDBC_HIVE_CLOUDERA_JARS and ZJDBC_IMPALA_CLOUDERA_JARS refer to the folders in which Hive JDBC drivers and Impala JDBC drivers provided by Cloudera have been placed in section JDBC Drivers.


Java connector configuration

There are two Java connectors which Datavard Glue works with. One is used for the Kerberos authentication, the other runs SQL queries against Hadoop services. They run as two autonomous Java processes with their own log file and they are configured by setting up parameters in three configuration tables, which are imported to the SAP system as part of the Datavard Glue transports.

The ones which need to be configured manually are:
/DVD/HDP_CUS_C – contains information for connector authentication, it connects HttpFS RFC with authentication files.

The table can be maintained via transaction SM30.

Sample entry:

Entries explained:


/DVD/JAVA_HDR
– maps a RFC destination to the Datavard Java connector version.

Datavard GLUE currently uses two Java connectors, HADOOP_AUTH_CONN for authentication process and HIVE_CDH for submitting of queries to Hive. 

Table entries can be, if needed, maintained via transaction .

Table example:


/DVD/JAVA_CONFIG – stores parameters for Datavard Java connectors. The table needs to be populated with entries via transaction SE16.

Sample configuration:

Prerequisites:

Multiple entries from /DVD/JAVA_CONFIG table define location and filename where a specific file will be generated. These will be marked as (generated) in detailed explanation bellow.

Please note, that as a prerequisite, all the paths specified in the table must be created beforehand (only folders).

Entries explained:

Password hash generator

Use report /DVD/XOR_GEN for this purpose.











Storage Management setup

A generic Datavard software component: “Reuse Library” is used for the setup. The required component is “Storage Management for Hadoop”.

Datavard Storage Management facilitates transparent communication with different types of storages, which includes various types of databases and Hadoop: HDFS for flat files and Hive for structured data.

HDFS storage

In order to transparently store data, two types of Hadoop storages are requeired to be defined in Storage Management:

The third type of storage is required for efficient querying of data located in Hive:

HDFS storage creation is done via transaction:

/DVD/SM_SETUP > [Edit mode] > [New storage]

Entries explained:

Finish the creation of the storage by confirming (F8). If SAP system is able to authenticate against Hadoop Kerberos and get properties of HDFS home directory (/user/<sid>hdp) from HttpFS service, storage creation is considered successful.

Hive metastore storage

The Hive metastore storage is created in a very similar way to the process of setting up the HDFS storage, but the values are different:

Entries explained:

Finish the creation of the storage by confirming (F8). If the SAP system is able to authenticate against Hadoop Kerberos and receives the expected result of the SQL command 'use database', the creation of the storage is considered successful.

Troubleshooting

In case of any issues with Java connector, Java logs can be read from an application server to determine source of issue with report /DVD/SM_HIVE_DISPLAY_JAVA_LOG.

Entries explained: