(SM-1903) Hadoop Storage Setup
Contents
Prerequisites
Open ports
In a controlled network environment, it is common to have firewall rules in place. In order to enable communication of SAP systems with Hadoop, the following port numbers should be reachable in Hadoop cluster from the SAP system:
Port | Type | Hadoop service | Comment |
---|---|---|---|
10000 | tcp | Hive | |
11000 | tcp | Oozie | Optional feature for Datavard Glue |
14000 | tcp | HttpFS | Alternative is WebHDFS on port 50070 |
21050 | tcp | Impala |
These are default port numbers of Hadoop services.
If Kerberos is enabled, KDC (Key Distribution Center) should also be reachable on port 88 (tcp/udp) from each SAP application server.
DNS names
Proper DNS name translation should be configured between SAP and Hadoop for Kerberos communication.
DNS resolution should be tested from SAP host using OS command 'nslookup <hadoop_host_FQDN>'.
Hive parameters
Two configuration parameters of the Hive service, which must be configured in Hive Service Advanced Configuration Snippet (Safety Valve) for Hive-site.xml.
hive.exec.dynamic.partition = true hive.exec.dynamic.partition.mode = nonstrict
Example:
Hadoop technical user
We recommend to create distinct technical users for every SAP system connected to the Hadoop cluster in order to isolate system's data.
There is usually a central repository for Hadoop users (LDAP/AD), but users can also be created locally (on every Hadoop cluster node).
Recommended naming convention is reflecting SAP <sid>adm users → <sid>hdp
Each Hadoop technical user should have its own dedicated group.
For illustration purposes we will use Hadoop user dvqhdp (with group dvqhdp) in further text.
Create the user's home directory on HDFS with appropriate permissions:
$ hdfs dfs -ls -d /user/dvqhdp -rwxrwxr-x 3 dvqhdp supergroup 0 --- /user/dvqhdp
- If Kerberos is used: Create Kerberos principal in form of <sid>hdp@<KERBEROS_REALM>. Run the kadmin.local on the host where Kerberos DB is running.
Hive database
We recommend to create a dedicated database (schema) in Hive for each SAP system. Recommended database name is sap<sid> (e.g.: sapdvq).
Access privileges on Hadoop side
In the productive Hadoop cluster, depending on Hadoop distribution, either Sentry service or Ranger service is responsible for management of users' privileges on Hadoop resources.
For proper functionality of Datavard Storage Management, the Hadoop technical user <sid>hdp needs access at least to the following two resources:
- User's HDFS home directory - typically /user/<sid>hdp
- Assigned Hive database - typically sap<sid>
To set up necessary policies in respective security service, follow the instructions below.
If Sentry is used
Sentry service is managing access to Hadoop resources using Sentry rules. The rules are created for a role, which can have "one to many" relation with user groups (not the users directly).
We typically set up only "one to one" role ↔ group relation. You need to set up two rules, one granting ALL actions on the HDFS directory (gets automatically translated to URI) and one granting all actions on the Hive database.
Example:
NOTE: If HDFS ACL synchronization with Sentry rules is enabled, add the user's directory to the "Sentry Synchronization Path Prefixes" parameter in the HDFS service configuration.
More on HDFS ACL synchronization topic can be found on Synchronizing HDFS ACLs and Sentry Permissions.
If Ranger is used
Similarly to Sentry, Ranger services is managing access to Hadoop resources using policies. The policies can grant privileges either to group or to users directly.
Again, you need to create two policies - one granting full access to Hadoop technical user on his HDFS directory and the other granting access to the Hive database.
HDFS policy example:
Hive database policy example:
Verification of setup
To verify that the setup on Hadoop side is valid, we recommend to use the technical user to create a testing table inside the sap<sid> Hive database.
Load a test file to the user's home directory, verify whether it's there and delete it in the end.
If all the commands are successful, you can conclude that the setup is valid.
[root@skbtscck21 ~]# database=sapnsq \\set the database variable [root@skbtscck21 ~]# jdbc_string="jdbc:hive2://skbtscck21.hadoop.local:10000/$database;principal=hive/skbtscck21.hadoop.local@HADOOP.LOCAL;ssl=true;sslTrustStore=/opt/certs/jks/skbtscck21-keystore.jks;trustStorePassword=123456aB" \\ set jdbc string [root@skbtscck21 ~]# kinit -kt ~/nsqhdp.keytab nsqhdp@HADOOP.LOCAL \\ authenticate as the technical user [root@skbtscck21 ~]# klist \\ verify that you have a valid ticket Ticket cache: FILE:/tmp/krb5cc_0 Default principal: nsqhdp@HADOOP.LOCAL Valid starting Expires Service principal 04/25/2019 17:24:03 04/26/2019 17:24:03 krbtgt/HADOOP.LOCAL@HADOOP.LOCAL renew until 05/02/2019 17:24:03 [root@skbtscck21 ~]# beeline -u $jdbc_string -e "create table if not exists $database.xxx (a int)" \\ Create test table INFO : OK [root@skbtscck21 ~]# beeline -u $jdbc_string -e "drop table $database.xxx" \\ delete test table INFO : OK [root@skbtscck21 ~]# beeline -u $jdbc_string -e "show current roles" \\ only with Sentry - show roles of the user INFO : OK +---------------+--+ | role | +---------------+--+ | sapnsq_admin | +---------------+--+ [root@skbtscck21 ~]# beeline -u $jdbc_string -e "show grant role sapnsq_admin" \\ only with Sentry - display privileges of the role. Should look like the one below INFO : OK +--------------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ | database | table | partition | column | principal_name | principal_type | privilege | grant_option | grant_time | grantor | +--------------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ | sapnsq | | | | sapnsq_admin | ROLE | all | false | 1490282925235000 | -- | | hdfs://skbtscck21.hadoop.local:8020/user/nsqhdp | | | | sapnsq_admin | ROLE | all | false | 1490282925256000 | -- | +--------------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ [root@skbtscck21 ~]# echo Success > file.txt \\ Create a test file [root@skbtscck21 ~]# hadoop fs -copyFromLocal ./file.txt /user/nsqhdp/ \\ Store file on HDFS [root@skbtscck21 ~]# hadoop fs -cat /user/nsqhdp/file.txt \\ Read file Success [root@skbtscck21 ~]# hadoop fs -rm -skipTrash /user/nsqhdp/file.txt \\ Clean up Deleted /user/nsqhdp/file.txt
OS prerequisites (On SAP host)
This group of requirements relates to the operating systems underlying SAP system with all its application servers. Datavard products (e.g. Datavard Glue, OutBoard DataTiering) have been developed and tested on the SUSE Linux environment and Windows Server 2012, but by design are not limited by the choice of an operating system, if requirements listed in this guide are met. Successful implementations were done also in AIX and Solaris.
Java version
An up-to-date Java version should be available to the SAP's <sid>adm user. We recommend the latest Java release.
If Hadoop cluster has Kerberos enabled, patched version of Java is required (e.g. Java 8u172).
NOTE: There is a limitation for AIX operating system. Oracle JDK is not available for this platform, thus usual replacement is IBM Java.
This poses certain limitation for Datavard Java Connector, specifically in scenario where Hive connection is facilitated by Zookeeper service. So far the authentication fails when running IBM Java with Apache Hive driver connecting to Zookeeper. Solution of this scenario is under development.
OS directories
Hadoop connector uses two directories used to store configuration and log files.
Create them with appropriate permissions (read/write) and usual SAP directory ownership (<sid>adm:sapsys).
dvd_conn directory in shared /sapmnt filesystem:
$ ls -ld /sapmnt/DVQ/global/security/dvd_conn drwx------ 2 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn
This one contains drivers, Kerberos and SSL related files. It is shared among all SAP application servers.
dvd_conn directory on each SAP application server:
$ ls -ld /usr/sap/DVQ/DVEBMGS05/work/dvd_conn drwxr-xr-x 7 dvqadm sapsys 4096 --- /usr/sap/DVQ/DVEBMGS05/work/dvd_conn
This one stores Datavard Java Connector libraries, configuration and log files.
In the previous Storage Management installations the directory used to reside directly in /usr/sap/<SID>/dvd_conn. It was moved to the work directory, to be logically grouped together with other SAP related operational and log files.
$ ls -ld /usr/sap/DVQ/dvd_conn drwxr-xr-x 2 dvqadm sapsys 4096 --- /usr/sap/DVQ/dvd_conn
SAP Java Connector library
SAP Java Connector 3.0 library libsapjco3.so, which can be downloaded from the SAP marketplace, needs to be uploaded to the SAP application server. It should be located in the directory referenced by LD_LIBRARY_PATH environment variable of <sid>adm user.
It is recommended to append a shared Datavard directory (/sapmnt/<SID>/global/security/dvd_conn) to LD_LIBRARY_PATH variable of <sid>adm user and place libsapjco3.so to this directory.
Or you may also copy it directly into the SAP kernel directory, which is by default included in LD_LIBRARY_PATH variable.
If you copy the SAP Java Connector library into the SAP kernel directory, you must ensure the library will be preserved during each SAP kernel upgrade.
$ echo $LD_LIBRARY_PATH /usr/sap/DVQ/SYS/exe/run:/usr/sap/DVQ/SYS/exe/uc/linuxx86_64:/usr/sap/DVQ/hdbclient:/sapmnt/DVQ/global/security/dvd_conn /sapmnt/DVQ/global/security/dvd_conn/ # ls -l libsapjco3.so -rwxr-x--- 1 dvqadm sapsys 59 Apr 5 15:12 libsapjco3.so
JDBC Drivers
JDBC protocol is used to connect to Hadoop services (Hive and Impala). JDBC drivers have to be manually stored on the operating system and be accessible to Datavard connector.
We recommend to store the drivers in the shared dvd_conn directory, organized in sub-directories to avoid possible conflicts.
$ ls -ld /sapmnt/DVQ/global/security/dvd_conn/[hi]* drwxr-x--- 2 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn/hive drwxr-x--- 3 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn/impala
JDBC drivers can be usually downloaded from your Hadoop vendors website.
There are multiple drivers able to facilitate communication between Datavard Java Connector and respective Hadoop service.
Based on our experience, the most reliable are Simba drivers (adopted by Cloudera, Hortonworks and MapR Hadoop distributions).
Kerberos keytab and configuration files
The Kerberos keytab of <sid>hdp principal should be exported from Kerberos database, copied into the operating system directory /sapmnt/DVQ/global/security/dvd_conn and made available to the <sid>adm user:
/sapmnt/DVQ/global/security/dvd_conn # ls -l DVQ.keytab -r-------- 1 dvqadm sapsys 59 Apr 5 15:12 DVQ.keytab
At the same location, the Kerberos configuration file should be created or copied and made readable for the user <sid>adm.
Usually a suitable krb5.conf file can be found on Hadoop nodes. Here is a sample of Kerberos configuration file:
/sapmnt/DVQ/global/security/dvd_conn # ls -l krb5.conf -r-------- 1 dvqadm sapsys 393 Feb 22 16:28 krb5.conf /sapmnt/DVQ/global/security/dvd_conn # cat krb5.conf [libdefaults] default_realm = HADOOP.LOCAL dns_lookup_kdc = false dns_lookup_realm = false ticket_lifetime = 86400 renew_lifetime = 604800 forwardable = true default_tgs_enctypes = rc4-hmac default_tkt_enctypes = rc4-hmac permitted_enctypes = rc4-hmac udp_preference_limit = 1 kdc_timeout = 3000 [realms] HADOOP.LOCAL = { kdc = hadoop01.hadoop.local admin_server = hadoop01.hadoop.local }
SSL Certificates for Java
If your Hadoop cluster has full TLS encryption enabled, it’s necessary to create Java truststore and save it on the following path with correct ownership and permissions:
/sapmnt/<SID>/global/security/dvd_conn # ls -l jssecacerts -r-------- 1 <SID>adm sapsys 59 Apr 5 15:12 jssecacerts
Server certificates of hosts that Storage Management communicates with or certificate of Certificate Authority that was used to sign these certificates should be placed in this truststore. This is usually Hiveserver2, Impala daemon and HttpFS/WebHDFS service.
An alternative option is to copy the complete jssecacerts truststore from any Hadoop node and place it in this path.
SAP prerequisites
Kerberos cookie encoding
By default, the SAP system encodes certain characters in cookies. The SAP note 1160362 describes the behavior in detail. As the Kerberos cookie must not be anyhow modified for the Kerberos server to accept it, this encoding should be disabled by setting the following parameter in each SAP application server instance profile:
ict/disable_cookie_urlencoding = 1
Incompatible Kernel version
SAP kernel 7.53 patch level 5 introduced a change in this parameter, which causes Storage Management to malfunction. Therefore Hadoop storage doesn't work on SAP kernel 7.53 patch level 5-222. In SAP kernel 7.53 patch level 223 and in the future versions, is possible to change this value to "2", which once again introduces desired behavior as before. Issue is described in SAP Note 2681175.
ict/disable_cookie_urlencoding = 2
The parameter is dynamic starting from the version 7.53.
SAP RFC role and user
The Java connector uses a dedicated user in the SAP system for communication. In our reference configuration, we use the username 'hadoop'. This user should be created with the type 'Communications Data' and with authorizations limiting his privileges to basic RFC communication.
Authorization object required is S_RFC with these settings:
- ACTVT = 16
- RFC_NAME = SYST, RFC1, SDIFRUNTIME
- RFC_TYPE = FUGR
Example of custom SAP role in PFCG transaction (Display Authorization Data):
SSL for SAP RFCs
To enable SSL communication for SAP RFCs, add client certificates of target Hadoop nodes to SAP certificate list in the transaction STRUST.
If HttpFS is used, client certificate of the HttpFS host is required.
If WebHDFS is used, client certificates of all datanodes and namenodes are necessary.
To import necessary certificates:
- Use the transaction STRUST.
- In the left menu choose certificate list that you want to add the certificate to (by double-clicking).
- In the right window area on left bottom click Import.
- In tab FILE of dialog window point to certificate file on your local file system (certificate should be in .pem format).
- After you confirm the path and SAP is able to recognize the certificate, details are displayed in the corresponding fields.
- To complete adding the certificate click Add to Certificate List.
- Click Save (in general menu or Ctrl+S).
- Transaction SMICM: top menu > Administration > ICM > Exit Soft > Local/Global.
Check the following parameters (typically they are set and active):
HTTP service and ICM parameters
The HTTP service must be active in SAP system. It can be checked via the transaction SMICM > [Goto] > Services
The following two parameters are important for the HTTP communication of SAP system:
- icm/HTTP/client/keep_alive_timeout – HTTP communication timeout, can be raised in the case of HTTP communication failing in timeout.
- icm/HTTP/max_request_size_KB – maximum size of data which ICM accepts (default 100 MB).
SAP gateway access
External communication with SAP system goes through the SAP gateway. If SAP system parameter gw/acl_mode is enabled, there are two files (secinfo and reginfo) which limit the access.
In this case two programs need to have granted access either explicitly or by wildcard definition:
- HADOOP_AUTH_CONN
- HADOOP_JAVA_CONN
The program IDs are defined later in this guide within the step 194153621. More information on the SAP gateway ACL topic can be found on the SAP web site Gateway Security Files secinfo and reginfo.
Configuration
When all prerequisites are fulfilled, further configuration is performed from the SAP system.
RFC Destinations
There are three RFC connections which should be created through the transaction SM59.
HttpFS RFC
This RFC connection is used for communication with Hadoop's HttpFS service which mediates operations in HDFS.
Two RFCs pointing to different HttpFS services can be created to ensure High Availability.
The name and description of the destination is optional, but it is recommended to designate its purpose with keywords 'Hadoop' and 'HttpFS'. In our example, the RFC destination also contains the Hadoop server hosting HttpFS service for the sake of clarity:
Entries explained:
- Connection Type – G for HTTP connection to an external service
- Target host – FQDN of Hadoop server hosting HttpFS service
- Service No. – port number on which HttpFS service is listening (default is 14000)
- Path Prefix – this string consists of two parts
- /webhdfs/v1 part is mandatory
- /user/dvqhdp part defines Hadoop user's 'root' directory in HDFS where flat files from SAP system are loaded
If SSL is used: Enable SSL and add client certificate list which is used in Logon & Security tab.
Authentication RFC
This RFC connection is part of the authentication mechanism towards any Hadoop cluster in a kerberized environment.
The RFC setup is very basic as parameters affecting authentication are defined elsewhere. It is recommended to use the generic RFC name 'HADOOP_AUTH_CONN':
Entries explained:
- Connection Type – T for TCP/IP Connection
- Activation Type – select Registered Server Program
- Program ID – HADOOP_AUTH_CONN
It is important to enable the HADOOP_AUTH_CONN program registration in the SAP gateway (SAP Gateway Access Control Lists).
Java RFC
Java RFC by name refers to the Java service which is used for communication with Hadoop services. Define the parameters of the Java connector in separate tables:
Entries explained:
- Connection Type – T for TCP/IP Connection
- Activation Type – select Registered Server Program
- Program ID – HADOOP_JAVA_CONN
It is important to enable the HADOOP_JAVA_CONN program registration in the SAP gateway (SAP Gateway Access Control Lists).
Logical paths and logical filenames
Datavard ships default logical paths leading to files stored on SAP application servers. These default values are usable for most SAP installations and in general don't need be changed. If these default paths are used, administrator needs to make sure that all security related files (kerberos keytab, config,...) stick to the default naming used in these paths.
In shipped Datavard paths, standard SAP logical paths DIR_HOME and DIR_GLOBAL are frequently used as variables. These paths usually refer to DIR_HOME = /usr/sap/<SID>/<INSTANCE>/ and DIR_GLOBAL = /sapmnt/<SID>/global/
Logical files and their values shipped by Datavard are following:
- /DVD/DEF_KRB_KEYTAB = <P=DIR_GLOBAL>/security/dvd_conn/<SID>.keytab
- /DVD/DEF_KRB_CONFIG = <P=DIR_GLOBAL>/security/dvd_conn/krb5.config
- /DVD/DEF_SSL_TRUSTSTORE = <P=DIR_GLOBAL>/security/dvd_conn/jssecacerts
- /DVD/DEF_JCO_DIR = <P=DIR_HOME>/dvd_conn/
- /DVD/DEF_HIVE_DRIVER = <P=DIR_GLOBAL>/security/dvd_conn/hive
- /DVD/DEF_IMPALA_DRIVER = <P=DIR_GLOBAL>/security/dvd_conn/impala
If a customer wants to use custom location or filename, a Z copy of these logical paths and logical files with customizations needs to be made, as any direct change to these paths would be overwritten during Storage Management update.
Password hash generator
All passwords that are stored in Storage management need to be hashed upfront using report /DVD/XOR_GEN. This report can be executed using transaction /DVD/RL_EXE
JCO Management
There are two Java connectors which Storage Management works with. The connector HADOOP_AUTH_CONN is used for the Kerberos authentication and the connector HADOOP_JAVA_CONN runs SQL queries against Hadoop services. These Java connectors run as two independent OS processes and can be controlled through transaction /DVD/JCO_MNG. Current latest versions are 79 for HADOOP_JAVA_CONN (JDBC) and 78 for HADOOP_AUTH_CONN (Kerberos)
This transaction allows the user to link Datavard Java connector to existing SAP RFC created in previous steps, and maintain settings of Datavard java connector. After all required fields are filled and saved, the user can click the Restart button to start the Java service. When the java process is running, the user can control it across application servers and display logs.
Step-by-step:
- After selecting JCO which settings you want to adjust (double-click in the list of all JCOs), switch to Edit mode
- Instead of manually filling all values, you can use the button or paste previously copied config from clipboard
- Adjust values you want to change
Config
Client - the client of SAP technical user used to register on SAP gateway
RFC User - username of SAP technical user used to register on SAP gateway
Password - hashed password of SAP technical user used to register on SAP gateway, the hash can be created by typing password in the field below and clicking 'Hash' button
Program ID - ID used to register Java application on SAP gateway, the value is automatically pulled from RFC that is used
Install directory - directory on the application server where the application and log files will be generated - can be a physical path or a logical path enclosed in <>, default path </DVD/DEF_JCO_DIR> can be used. After changing this value you have to click the 'Generate paths' button
Java exe - the path to java executable on the SAP application server
Java vendor - Name of Java vendor, ORACLE or IBM
Dependencies
- select a library which you want to use with JCO
Advanced
- OS Command for starting java service - name of OS command for starting java service (SM69)
- OS Command for setting access permissions - name of OS command for setting execution rights (SM69)
- Max RAM used - a maximum amount of RAM used by java service
- Additional java starting arguments - additional arguments used to start java service
- Repository destination - Destination type
- Work thread MIN - Minimum number of work threads
- Work thread MAX - Maximum number of work threads
- Connection count - Number of connections
- Peak limit - Limit of connections at peak
- Log4j log level - level of messages collected in logs
- Log4j log deletion - how long the logs should be archived
- JAR path - the path where the JAR file will be created
- Server config path - the path where the server config will be created
- Destination config path - the path where the destination config will be created
- Log files path - a directory where the log files will be saved
- Log4j config path - the path where the log4j config file will be created
- Go to the 'General' tab and assign an RFC to the java connector
- RFC Usage - Defines an RFC that points to this java application.
- Save the settings
- Start the java service by clicking the 'Restart' button
The Java service is started with a system command. You can adjust the name of this command in the table Advanced tab. The default name of the command is ZDVD_START_JAVA. In the case the system command doesn't exist, it is created automatically. You can view the system commands through the transaction SM69.
On Linux, another system command is required, which sets executable rights for the configuration files (chmod 755 <filename>). Its name can be adjusted in the Advanced tab.
HttpFS authentication
Table /DVD/HDP_CUS_C contains required authentication information for storages of type Hadoop. Entries in this table are tied to HttpFS RFC created previously.
If 2 HttpFS RFCs were created to support HA scenarios, an entry needs to be created in this table for both of them.
The table can be maintained through the transaction /DVD/RL_TM.
Sample entry:
Entries explained:
- Destination – HttpFS RFC destination created in HttpFS RFC
- User Name – Hadoop user principal created in Hadoop user, group and HDFS directory
- Auth. Method – authentication method towards the Hadoop cluster
- Krb. Keytab – logical file definition for Kerberos keytab file, default /DVD/DEF_KRB_KEYTAB can be used
- Krb. Config – logical file definition for Kerberos configuration file, default /DVD/DEF_KRB_CONFIG can be used
- Krb. Service RFC – authentication RFC destination created in Authentication RFC
- SSL Keystore – logical file definition for SSL keystore, default /DVD/DEF_SSL_TRUSTSTORE can be used
- SSL Password – password for accessing SSL keystore, hashed by report /dvd/xor_gen
Storage Management setup
A generic Datavard software component: “Reuse Library” is used for the setup. The required component is “Storage Management for Hadoop”.
Datavard Storage Management facilitates transparent communication with different types of storages, which includes various types of databases and Hadoop: HDFS for flat files and Hive for structured data.
HDFS storage
In order to transparently store data, you should define two types of Hadoop storages in Storage Management:
- HDFS storage which facilitates transfer of files to HDFS through the Hadoop HttpFS service
- Hive storage which enables data replication between SAP tables and Hive tables
The third type of storage is required for efficient querying of data located in Hive:
- Impala storage which connects to Impala agents to provide fast SQL execution by leveraging Impala in-memory data caching
Create the HDFS storage through the transaction:
/DVD/SM_SETUP > [Edit mode] > [New storage]
Entries explained:
- Storage ID – name of the storage
- Storage Type – choose HADOOP for HDFS
- Description – extended description of the storage for easier identification
- RFC Destination – HttpFS RFC destination defined in HttpFS RFC
Complete the creation of the storage by confirming (F8). If the SAP system can authenticate against Hadoop Kerberos and get properties of HDFS home directory (/user/<sid>hdp) from HttpFS service, storage creation is considered successful.
Hive metastore storage
The Hive metastore storage is created in a similar way as the process of setting up the HDFS storage, but the values are different:
Entries explained | Figure |
---|---|
Storage ID – Name of the storage Storage Type – Choose SM_TRS_MS for Hive Description – Extended description of the storage for easier identification Database – Hive database created in Hive database Hive host – Hadoop server hosting the Hive service Hive host for high availability - HA Hive host Hive username – Hadoop user created in Hadoop user, group and HDFS directory HIVE password - Password for the Hive user Impala host – Hadoop server hosting the Impala service Impala host for high availability - HA Impala host Impala username – Hadoop user created in the Hadoop user group and HDFS directory Staging location type - Storage location for data staging area (external csv tables) Staging location URL (non-default) - URL address for data staging area (e.g. Azure DataLake) HTTP RFC Destination – HttpFS RFC destination defined in HttpFS RFC HTTP RFC Destination (HA) - HttpFS RFC destination (High Availability) Java connector RFC – Hive RFC destination defined in Hive RFC Load Engine - Engine used for loading (writing) data, e.g. Hive or Impala Read Engine - Engine used for reading data, e.g. Hive or Impala Load driver classname - Classname of the driver used for loading (e.g. Cloudera Hive - com.cloudera.hive.jdbc41.HS2Driver) Load driver path - Logical name of Load driver file Read driver classname - Classname of the driver used for reading (e.g.Cloudera Impala - com.cloudera.impala.jdbc41.Driver) Read driver path - Logical name of Read driver path Use custom connection string - If checked, use custom connection string Custom connection string - Standard settings are ignored, custom connection string is used instead Use Kerberos - Checked in case the Hadoop cluster is Kerberized Kerberos config file path - Logical name of Kerberos configuration file defined in Kerberos logical file definition Hive service principal - Kerberos principal of the Hive service, must reflect the Hive host Impala service principal - Kerberos principal of the Impala service, must reflect the Impala host Kerberos keytab path - Logical name of Kerberos principal keytab file defined in Kerberos logical file definition Kerberos user principal - User name (kerberos principal) defined for connection to Hadoop File Type - File format in which Hive stores table data on HDFS Compression codec - Compression codec used for storing data on HDFS HDFS Permissions - UNIX permissions for files created on HDFS Use Cloudera drivers - Always checked if the Hadoop cluster is Cloudera distribution Cloudera driver config path - WRONG field name - this is JAAS configuration file path - file containing Java Authentication and Authorization Service login configuration. Additionally, this is automatically generated file, so in the future it should be simply created in the same directory as other Kerberos configuration files without necessity to explicitly specify it Skip trash - Checked if HDFS files shouldn't be moved to trash after deleting them Impala port - Impala JDBC port Hive port - Hive JDBC port Hints for hive/impala - Hints that can be specified for JDBC connection, separated by ; (e.g. SYNC_DDL=TRUE;UseNativeQuery=1) Open cursor logic - Select which logic is used for reading via cursor Number of repetition for WebHDFS - Number of times HDFS request is repeated in the case of failure Number of repetition for JDBC - Number of times JDBC requests are repeated in the case of failure Use compression on transfer - Checked in the case compression is used for filed created on HDFS Compression level - Level of compression (0-minimum, 9-maximum) Use SSL - Checked if SSL authentication should be used SSL keystore path - Logical name of the SSL keystore file SSL keystore password - Password to keystore Force file cursor reader (expert setting) - Cursor reader is used all the time when reading data from Hadoop Use extended escaping (expert setting) - Extending escaping is used all the time when writing data to Hadoop |
Finish the creation of the storage by confirming (F8). If the SAP system is able to authenticate against Hadoop Kerberos and receives the expected result of the SQL command 'use database', the creation of the storage is considered successful.
Extended explanation for Storage setup with Hadoop distribution other than Cloudera
Following fields' values can differ depending on Hadoop distribution used:
- Load driver classname - apart from Cloudera, there are other vendors offering JDBC drivers able to connect to Hive service (or other Hadoop services like Impala or Drill).
In case of Hortonworks distribution, the JDBC Hive driver can be found directly on Hadoop node, typically located at /usr/hdp/X.X.X.X-XXXX/hive2/jdbc/hive-jdbc-X.X.X.X-XXXX-standalone.jar.
The contents of the jar file can be listed with 'less' command:[root@skbtshcc01 ~]# less /usr/hdp/2.6.4.0-91/hive2/jdbc/hive-jdbc-2.1.0.2.6.4.0-91-standalone.jar | grep HiveDriver
-rw---- 2.0 fat 6960 bl defN 18-Jan-04 10:39 org/apache/hive/jdbc/HiveDriver.classThe driver Class Name can be derived from it as
org.apache.hive.jdbc.HiveDriver
Reference links:
https://www.cloudera.com/downloads/connectors/hive/jdbc/2-6-5.html
https://www.simba.com/resources/jdbc/
https://www.progress.com/jdbc/apache-hadoop-hive
http://repo.hortonworks.com/content/repositories/releases/org/apache/hive/hive-jdbc/ - Custom connection string - not used under standard Cloudera setup, as URL connection string is being composed from other settings. In case of specific configuration on Hadoop side or for testing purposes, it is possible to use explicitly stated conntion URL here.
Example: connection URL directed at Zookeeper service (port 2181) running on three Hadoop hosts, acting as proxy redirecting connection to currently active Hive server in High Availability setupjdbc:hive2://skbtshcc01.hadoop.local:2181,skbtshcc03.hadoop.local:2181,skbtshcc02.hadoop.local:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
Reference links:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
https://www.cloudera.com/documentation/other/connectors/hive-jdbc/latest/Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf
https://docs.microsoft.com/bs-latn-ba/azure/hdinsight/hadoop/apache-hadoop-connect-hive-jdbc-driver?view=aspnetcore-2.1
https://mapr.com/docs/51/Hive/ConnectingtoHiveServer2-U_29655382-d3e110.html - File type - format in which the files loaded to HDFS are stored, depends on Hadoop distribution - in Cloudera it's typically PARQUET, in Hortonworks it's typically ORC.
- Compression codec - optional setting, depends on previously selected File Type and whether the files stored in HDFS should be compressed or not. If file type is PARQUET, compression codec is typically SNAPPY, if file type is ORC, compression codec is typically ZLIB.
- Use Cloudera drivers - depending on this checkbox the default connection URL is generated either with JDBC Cloudera/Simba driver syntax (checkbox ticked), or with JDBC Apache Hive driver syntax (checkbox unticked).
If custom connection string is used, this should be left unchecked.
Troubleshooting
In the case of any issues with Java connector, you can find the Java logs in application server to determine a source of the issue with the report /DVD/SM_HIVE_DISPLAY_JAVA_LOG.