...
Warning | ||
---|---|---|
| ||
The Java service is started with a system command. You can adjust the name of this command in the table /DVD/JAVA_CONFIG with the parameter JAVA_START_CMD. The default name of the command is ZDVD_START_JAVA. In the case the system command doesn't exist, it is created automatically. You can view the system commands through the transaction SM69. On Linux, another system command, which sets executable rights for the configuration files (chmod 755 <filename>), is required. Its name can be adjusted with the parameter CHMOD_CMD with the default value ZDVD_CHMOD. |
Prerequisites
Open Ports
In a controlled network environment, it is common to have firewall rules in place. In order to enable communication of SAP systems with Hadoop, the following port numbers need to be reachable in the Hadoop cluster from the SAP system:
Port | Type | Hadoop service | Comment |
---|---|---|---|
10000 | tcp | Hive | |
11000 | tcp | Oozie | Optional feature for Datavard Glue |
14000 | tcp | HttpFS | Alternative is webHDFS on port 50070 |
21050 | tcp | Impala |
These are the default port numbers of Hadoop services.
If Kerberos is enabled, also KDC (Key Distribution Center) needs to be reachable from SAP on port 88 (tcp/udp).
DNS names
Proper DNS name translation needs to be configured between SAP and Hadoop for Kerberos communication.
Hive parameters
There are two configuration parameters of the Hive service, which need to be properly configured in Hive Service Advanced Configuration Snippet (Safety Valve) for Hive-site.xml.
Code Block |
---|
hive.exec.dynamic.partition = true hive.exec.dynamic.partition.mode = nonstrict |
Example:
Hadoop user
We recommend creating distinct users for every SAP system connected to the Hadoop cluster in order to isolate each system's data.
There is usually a central repository for Hadoop users (LDAP/AD), but you can also create the user locally (on every Hadoop cluster node).
...
If Kerberos is used: Create a Kerberos principal for the user. You need to run kadmin.local on the host, where Kerberos DB is running.
Hive database
Datavard recommends creating a dedicated database (schema) in Hive for each SAP system. The recommended database name is sap<sid> (sapdvq).
If Hadoop Sentry is used: Two Sentry rules have to be created, which enable all actions of <sid>hdp user on sap<sid> database and his home directory in HDFS.
Example: If HDFS ACL synchronization with Sentry permissions is enabled, the user's directory must be added to the Sentry Synchronization Path Prefixes parameter in the HDFS service configuration.
More on HDFS ACL synchronization topic can be found on https://www.cloudera.com/documentation/enterprise/latest/topics/sg_hdfs_sentry_sync.html
OS prerequisites (On SAP host)
This group of requirements relates to the operating systems underlying the SAP system with all its application servers. Datavard products (e.g. Datavard Glue, OutBoard DataTiering) have been developed and tested on the SUSE Linux environment, but by design is not limited by the choice of the operating system, if requirements listed in this guide are met.
Java version
An up-to-date Java version (recommended is 1.7 or higher) needs to be available to the SAP's <sid>adm user.
If the Hadoop cluster has Kerberos enabled, a patched version of Java is required (e.g. Java 8u172).
SAP Java Connector library
SAP Java Connector 3.0 library libsapjco3.so needs to be accessible, which is available for download from the SAP marketplace. It needs to be present in the SAP kernel directory or in other directory pointed by the LD_LIBRARY_PATH environment variable of <sid>adm user.
Code Block |
---|
$ which libsapjco3.so /usr/sap/DVQ/SYS/exe/uc/linuxx86_64/libsapjco3.so |
OS directories
Hadoop connector uses two directories dedicated to its configuration and log files:
...
The first one (/sapmnt/<SID>/global/security/dvd_conn) is used to store Kerberos and SSL-related files and is shared among SAP application servers.
The second will store drivers, configuration, and log files of the Java connector used by Datavard Glue. Ownership and permissions need to be set appropriately to <sid>adm
Create directories on each SAP application server to store Storage Management related configuration and log files. <sid>adm user needs permissions to access those folders.
JDBC Drivers
JDBC protocol is used to connect to Hadoop services (Hive and Impala). JDBC drivers have to be manually stored on the operating system and accessible to the Datavard connector.
...
Code Block |
---|
$ ls -ld /usr/sap/DVQ/dvd_conn/drivers/hive* /usr/sap/DVQ/dvd_conn/drivers/impala* drwxr-xr-x 2 dvqadm sapsys 4096 --- /usr/sap/DVQ/dvd_conn/drivers/hive_apache drwxr-xr-x 3 dvqadm sapsys 4096 --- /usr/sap/DVQ/dvd_conn/drivers/hive_cloudera drwxr-xr-x 3 dvqadm sapsys 4096 --- /usr/sap/DVQ/dvd_conn/drivers/impala_cloudera |
Kerberos keytab and configuration files
The Kerberos keytab of <sid>hdp principal needs to be exported from the Kerberos database, and copied into the operating system directory /sapmnt/DVQ/global/security/dvd_conn and made available to the <sid>adm user:
...
Code Block |
---|
/sapmnt/DVQ/global/security/dvd_conn # ls -l kerberos.cfg
-r-------- 1 dvqadm sapsys 393 Feb 22 16:28 kerberos.cfg
/sapmnt/DVQ/global/security/dvd_conn # cat kerberos.cfg
[libdefaults]
default_realm = HADOOP.LOCAL
dns_lookup_kdc = false
dns_lookup_realm = false
ticket_lifetime = 86400
renew_lifetime = 604800
forwardable = true
default_tgs_enctypes = rc4-hmac
default_tkt_enctypes = rc4-hmac
permitted_enctypes = rc4-hmac
udp_preference_limit = 1
kdc_timeout = 3000
[realms]
HADOOP.LOCAL = {
kdc = hadoop01.hadoop.local
admin_server = hadoop01.hadoop.local
}
|
SSL Certificates for Java
If your Hadoop cluster has full TLS encryption enabled, it’s necessary to create Java truststore and save it on the following path with correct ownership and permissions:
Code Block |
---|
/sapmnt/<SID>/global/security/dvd_conn # ls -l jssecacerts
-r-------- 1 <SID>adm sapsys 59 Apr 5 15:12 jssecacerts
|
Server certificates of hosts, that Glue will communicate with, need to be placed in this truststore. An alternative option is to copy the complete jssecacerts truststore from any Hadoop node and place it in this path.
SAP prerequisites
Kerberos cookie encoding
The SAP system by default encodes certain characters in cookies. SAP note 1160362 describes the behavior in detail. As the Kerberos cookie must not be anyhow modified for the Kerberos server to accept it, this encoding needs to be disabled by setting the following parameter in each SAP application server instance profile:
Code Block |
---|
/security/dvd_conn # cat kerberos.cfg
[libdefaults]
default_realm = HADOOP.LOCAL
dns_lookup_kdc = false
dns_lookup_realm = false
ticket_lifetime = 86400
renew_lifetime = 604800
forwardable = true
default_tgs_enctypes = rc4-hmac
default_tkt_enctypes = rc4-hmac
permitted_enctypes = rc4-hmac
udp_preference_limit = 1
kdc_timeout = 3000
[realms]
HADOOP.LOCAL = {
kdc = hadoop01.hadoop.local
admin_server = hadoop01.hadoop.local
}
|
SSL Certificates for Java
If your Hadoop cluster has full TLS encryption enabled, it’s necessary to create a Java truststore and save it on the following path with correct ownership and permissions:
Code Block |
---|
/sapmnt/<SID>/global/security/dvd_conn # ls -l jssecacerts
-r-------- 1 <SID>adm sapsys 59 Apr 5 15:12 jssecacerts
|
Server certificates of hosts, that Glue will communicate with, need to be placed in this truststore. An alternative option is to copy the complete jssecacerts truststore from any Hadoop node and place it in this path.
SAP prerequisites
Kerberos cookie encoding
By default, the SAP system encodes certain characters in cookies. The SAP note 1160362 describes the behavior in detail. As the Kerberos cookie must not be anyhow modified for the Kerberos server to accept it, this encoding should be disabled by setting the following parameter in each SAP application server's instance profile:
Code Block | ||
---|---|---|
| ||
ict/disable_cookie_urlencoding = 1 |
Warning | ||
---|---|---|
| ||
SAP kernel 7.53 patch level 5 introduced a change in this parameter, which causes Storage Management to malfunction. Therefore Hadoop storage doesn't work on SAP kernel 7.53 patch level 5-222. In SAP kernel 7.53 patch level 223 and in the future versions, it is possible to change this value to "2", which once again introduces desired behavior as before. The issue is described in SAP Note 2681175. |
Code Block | ||
---|---|---|
| ||
ict/disable_cookie_urlencoding = 12 |
The parameter is staticdynamic in kernel version 7.53 and higher.
SAP RFC role and user
The Java connector uses a dedicated user in the SAP system for communication. In our reference configuration, we use the username 'hadoop'. This user needs to be created with type 'Communications Data' and with authorizations limiting his privileges to basic RFC communication.
Authorization The authorization object required is S_RFC with these settings:
...
Example of custom SAP role in PFCG transaction (Display Authorization Data):
SSL for SAP RFCs
To enable SSL communication for SAP RFCs, client certificates of target Hadoop nodes need to be added to the SAP certificate list in transaction STRUST.
...
Following parameters need to be checked (typically it's running and set):
HTTP service and ICM parameters
The HTTP service has to be active in the SAP system. It can be checked via transaction
SMICM > [Goto] > Services
...
- icm/HTTP/client/keep_alive_timeout – defines HTTP communication timeout, can be raised in case of HTTP communication failing in timeout.
- icm/HTTP/max_request_size_KB – defines maximum size of data ICM will accept (default 100 MB)
SAP gateway access
External communication with the SAP system goes through the SAP gateway. If SAP system parameter gw/acl_mode is enabled, there are two files (secinfo and reginfo) which that limit the access.
In such a case, two programs need to have granted access either explicitly or by wildcard definition:
...
https://help.sap.com/saphelp_nw73/helpdata/en/e2/16d0427a2440fc8bfc25e786b8e11c/content.htm
CONFIGURATION
When all prerequisites are fulfilled, further configuration is done from within the SAP system.
RFC Destinations
There are three RFC connections that need to be created via transaction SM59.
HttpFS RFC
This RFC connection is used for communication with Hadoop's HttpFS service which mediates operations in HDFS.
The name and description of the destination are optional, but it is recommended to designate its purpose with keywords 'Hadoop' and 'HttpFS'. In our example, the RFC destination also contains the Hadoop server hosting the HttpFS service for the sake of clarity:
...
If SSL is used: It’s necessary to enable SSL and add a client certificate list to be used in the Logon & Security tab.
Authentication RFC
This RFC connection is part of the authentication mechanism towards any Hadoop cluster in a kerberized environment.
...
- Connection Type – T for TCP/IP Connection
- Activation Type – select Registered Server Program
- Program ID – AUTH_CONN
It is important to enable enabling the AUTH_CONN program registration in the SAP gateway (SAP gateway access).
Java RFC
Java RFC by name refers to the Java service which is used for communication with Hadoop services. Again, the setup is basic, and the parameters of the Java connector are defined in separate tables:
...
- Connection Type – T for TCP/IP Connection
- Activation Type – select Registered Server Program
- Program ID – JAVA_CONN
It is important to enable enabling the JAVA_CONN program registration in the SAP gateway (SAP gateway access).
Java connector setup
Java connectors are configured using files that have to be defined in the SAP system.
They are configured in the following steps.
Logical file path definition
The first step is to map the logical path ZHADOOP_SECURITY to the OS path where the files are stored. The actual OS path is created in the section Datavard connector directories (/sapmnt/<SID>/global/security/dvd_conn).
Kerberos logical file definition
Before actually setting up the Hadoop storage, there are three files that are required for successful Kerberos authentication. They need to be defined as logical names to the SAP system via the FILE transaction.
...
ZHADOOP_KRB_KEYTAB and ZHADOOP_KRB_CONFIG refer to Kerberos keytab of <sid>hdp user and Kerberos configuration file defined in section Kerberos keytab and configuration files respectively. ZHADOOP_CDH_DRIVER refers to the custom Cloudera driver configuration file, which will be generated during the storage activation.
SSL logical file definition
If the Hadoop services cluster resides in a safe environment which that is accessible only with SSL authentication, the following logical file needs to be defined as follows:
Drivers logical file definition
As described in the JDBC Drivers section, JDBC drivers for Hadoop services connection are stored on the operating systems underlying the SAP system. They also need to be defined as logical names to the SAP system via the FILE transaction.
In our example, we will be using Hive and Impala JDBC Drivers provided by Cloudera. The first step is to map the logical path ZJDBC_DRIVER_PATH to the OS path where the files are stored (in our case /urs/sap/<SID>/dvd_conn/drivers/).
...
ZJDBC_HIVE_CLOUDERA_JARS and ZJDBC_IMPALA_CLOUDERA_JARS refer to the folders in which Hive JDBC drivers and Impala JDBC drivers provided by Cloudera have been placed in section JDBC Drivers.
Java connector configuration
There are two Java connectors that Datavard Glue works with. One is used for the Kerberos authentication, the other runs SQL queries against Hadoop services. They run as two autonomous Java processes with their own log file and they are configured by setting up parameters in three configuration tables, which are imported to the SAP system as part of the Datavard Glue transports.
...
- Client – ID of the SAP system source data client
- CONFIG_AS_PATH – path and configuration file specifying authentication information towards the SAP system to Java connector (generated)
- CONFIG_PATH – path and configuration file specifying program ID and SAP gateway information to Java connector (generated)
- CONN_COUNT – number of connections registered at the gateway
- JAR_PATH – path and filename of Datavard Java connector JAR-file. This jar is generated (content is part of transport) in the specified folder and filename. (to be generated)
- JAVA_EXE – path to Java binary used by <sid>adm user
- LOG_FILE – path and filename of log file related to the Java connector (generated)
- MAX_RAM_USED – the amount of memory dedicated to the Java connector
- USERNAME – RFC user-defined in SAP RFC role and user
- PASSWORD – password of an RFC user (defined above) in hashed form. Use instructions from the chapter "Password hash generator" to generate a hash.
- PEAK_LIMIT – the maximum number of connections that can be created for a destination simultaneously
- PROG_ID – registered program ID defined in Authentication RFC (version 32) and Hive RFC (version 34)
- REP_DEST – Java connector client destination from which to obtain the repository
- WORK_THREAD_MAX – number of threads that can be used by Java Connector
- WORK_THREAD_MIN – number of threads always kept running by Java Connector
- JAVA_START_CMD – name of the system command, which starts Java
- CHMOD_CMD – name of the system command for setting executable rights
Password hash generator
Use report /DVD/XOR_GEN for this purpose.
Storage Management setup
A generic Datavard software component: “Reuse Library” is used for the setup. The required component is “Storage Management for Hadoop”.
Datavard Storage Management facilitates transparent communication with different types of storages, which includes various types of databases and Hadoop: HDFS for flat files and Hive for structured data.
HDFS storage
In order to transparently store data, two types of Hadoop storages are required to be defined in Storage Management:
...
The third type of storage is required for efficient querying of data located in Hive:
- Impala storage which connects to Impala agents to provide fast SQL execution by leveraging Impala in-memory data caching
...
Finish the creation of the storage by confirming (F8). If the SAP system is able to authenticate against Hadoop Kerberos and get properties of HDFS home directory (/user/<sid>hdp) from HttpFS service, storage creation is considered successful.
Hive metastore storage
The Hive metastore storage is created in a very similar way to the process of setting up the HDFS storage, but the values are different:
...
- Storage ID – name of the storage
- Storage Type – choose SM_TRS_HV2 for Hive
- Description – extended description of the storage for easier identification
- Database – Hive database created in Hive database
- Hive host – Hadoop server hosting the Hive service
- Hive username – Hadoop user created in Hadoop user, group, and HDFS directory
- Hive password - password for Hive user
- Impala host – Hadoop server hosting the Impala service
- Impala username – Hadoop user created in Hadoop user, group, and HDFS directory
- Staging location type - Storage location for data staging area (external CSV tables)
- Staging location url (non-default) - URL address for data staging area (e.g. Azure DataLake)
- HTTP RFC Destination – HttpFS RFC destination defined in HttpFS RFC
- HTTP RFC Destination (HA) - HttpFS RFC destination (High Availability)
- Java connector RFC – Hive RFC destination defined in Hive RFC
- Load Engine - Engine used for loading (writing) data, e.g. Hive or Impala
- Read Engine - Engine used for reading data, e.g. Hive or Impala
- Load driver classname - Classname of the driver used for loading (e.g. Cloudera Hive - com.cloudera.hive.jdbc41.HS2Driver)
- Load driver path - Logical name of the Load driver file
- Read driver classname - Classname of the driver used for reading (e.g.Cloudera Impala - com.cloudera.impala.jdbc41.Driver)
- Read driver path - Logical name of the Read driver path
- Use custom connection string checkbox - if checked, use the custom connection string
- Custom connection string - standard settings are ignored, the custom connection string is used instead
- Use Kerberos checkbox – checked in case the Hadoop cluster is Kerberized
- Kerberos config file path – Logical name of Kerberos configuration file defined in Kerberos logical file definition
- Hive service principal – Kerberos principal of the Hive service, must reflect the Hive host
- Impala service principal - Kerberos principal of the Impala service, must reflect the Impala host
- Kerberos keytab path – Logical name of Kerberos principal keytab file defined in Kerberos logical file definition
- Kerberos user principal - User name (Kerberos principal) defined for connection to Hadoop
- File Type – file format in which Hive will store table data on HDFS
- Compression codec - Compression codec used for storing data on HDFS
- HDFS Permissions - UNIX permissions for files created on HDFS
- Use Cloudera drivers checkbox – always checked if the Hadoop cluster is a Cloudera distribution
- Cloudera driver config path – Logical name of the Cloudera driver configuration file defined in Kerberos logical file definition
- Skip trash - checked if HDFS files should NOT be moved to trash after deleting them
- Impala port - Impala JDBC port
- Hive port - Hive JDBC port
- Hints for hive/impala - Hints that can be specified for JDBC connection (e.g. SYNC_DDL=TRUE)
- Open cursor logic - Select which logic will be used for reading via cursor
- Repetition for HDFS - number of times HDFS request will be repeated in case of failure
- Use compression on transfer - checked in case compression is used for files created on HDFS
- Compression level - level of compression (0-minimum, 9-maximum)
- Use SSL - checked if SSL authentication should be used
- SSL Keystore path - logical name of the SSL Keystore file
- SSL Keystore password - password to Keystore
- Force file cursor reader checkbox (expert setting) - cursor reader is used all the time when reading data from Hadoop
- Use extended escaping checkbox (expert setting) - extending escaping is used all the time when writing data to Hadoop
Finish the creation of the storage by confirming (F8). If the SAP system is able to authenticate against Hadoop Kerberos and receives the expected result of the SQL command 'use database', the creation of the storage is considered successful.
Troubleshooting
In case of any issues with the Java connector, Java logs can be read from an application server to determine the source of the issue with the report /DVD/SM_HIVE_DISPLAY_JAVA_LOG.
...