(SM-2408) Hadoop Storage Setup
Table of Contents:
Hadoop prerequisites
Open ports
In a controlled network environment, it is common to have firewall rules in place. To enable communication of SAP systems with Hadoop, the following port numbers should be reachable in the Hadoop cluster from the SAP system:
Port | Type | Hadoop service | Comment |
---|---|---|---|
10000 | tcp | Hiveserver2 | |
10500 | tcp | Hiveserver2 LLAP | Hortonworks Hive2 LLAP |
14000 | tcp | HttpFS | HDFS service in the Cloudera distribution |
50070 | tcp | WebHDFS | Apache Hadoop HDFS service |
1022 | tcp | HDFS datanode | Needs to be open when WebHDFS is used |
2181 | tcp | Zookeeper | |
21050 | tcp | Impala |
These are the default port numbers of Hadoop services.
If Kerberos is enabled, KDC (Key Distribution Center) should also be reachable on port 88 (tcp/udp) from each SAP application server.
DNS names
Proper DNS name translation should be configured between SAP and Hadoop for Kerberos communication.
DNS resolution should be tested from the SAP host using the OS command nslookup <hadoop_host_FQDN>.
If IBM Java is used, also reverse lookup must be successful - IP address to hostname.
Hive parameters
Two configuration parameters of the Hive service must be configured in the Hive Service Advanced Configuration Snippet (Safety Valve) for Hive-site.xml.
hive.exec.dynamic.partition = true hive.exec.dynamic.partition.mode = nonstrict
Example:
NOTE: If Simba JDBC drivers will be used, these parameters can be set for our session only in the SAP system, therefore without a global impact on the cluster.
Hadoop technical user
We recommend creating distinct technical users for every SAP system connected to the Hadoop cluster to isolate the system's data.
There is usually a central repository for Hadoop users (LDAP/AD), but users can also be created locally (on every Hadoop cluster node).
The recommended naming convention reflects SAP <sid>adm users → <sid>hdp.
Each Hadoop technical user should have its dedicated group, in case the Sentry service is used for authorization management as in Sentry, access roles are assigned to groups.
For illustration purposes, we will use Hadoop user dvqhdp (with group dvqhdp) in further text.
If Kerberos is used
Create Kerberos principal in the form of <sid>hdp@<KERBEROS_REALM>. This can be either a principal created in MIT Kerberos, FreeIPA, or an Active Directory user.
To export a Kerberos keytab from the Active directory, use the following command:
ktpass /princ dvqhdp@HADOOP.LOCAL /pass badpassword1 /ptype KRB5_NT_PRINCIPAL /out DVQ.keytab
HDFS landing zone
A landing zone, typically a home folder of the technical user, needs to be created on HDFS.
The directory needs to meet the following conditions:
- Technical user needs to be able to read and write to this directory and all subdirectories.
- Impala and Hive users need to be able to read and write to this directory and all subdirectories.
NOTE: If Kerberos is used, Impala and Hive runtime key tabs are stored by Cloudera Manager under /var/run/cloudera-scm-agent/process/
or /etc/security/keytabs
in Hortonworks distribution.
Example landing zone creation:
## Create a home directory [root@skbtscck21 ~]# hadoop fs -mkdir -p /user/dvqhdp/.Trash ## Set ownership and permissions [root@skbtscck21 ~]# hadoop fs -chown -R dvqhdp:dvqhdp /user/dvqhdp [root@skbtscck21 ~]# hadoop fs -chmod -R 770 /user/dvqhdp ## Set ACL to grant access to Hive group (by default containing Hive and Impala user) [root@skbtscck21 ~]# hadoop fs -setfacl -m default:group:hive:rwx /user/dvqhdp ## Check the directory [root@skbtscck21 ~]# hadoop fs -ls -d /user/dvqhdp drwxrwx---+ - dvqhdp dvqhdp 0 2017-03-22 14:45 /user/dvqhdp
HDFS parameters and dfs.namenode.acls.enabled have to be set to true in hdfs-site.xml if HDFS POSIX-like permissions and HDFS ACLs are to be applied in the absence of Sentry/Ranger policy.
HDFS ACL support is not enabled in the default configuration.
Hive database
We recommend creating a dedicated database (schema) in Hive for each SAP system. The recommended database name is sap<sid> (e.g.: sapdvq).
Access privileges on the Hadoop side
In the productive Hadoop cluster, depending on Hadoop distribution, either Sentry service or Ranger service is responsible for the management of users' privileges on Hadoop resources.
For proper functionality of Storage Management, Hadoop technical user <sid>hdp needs to have access to the following two resources at least:
- User's HDFS home directory - typically /user/<sid>hdp
- Assigned Hive database - typically sap<sid>
To set up necessary policies in the respective security service, follow the instructions below.
If Sentry is used
Sentry service manages access to Hadoop resources using Sentry rules. The rules are created for a role, which can have one to many relations with user groups (not the users directly).
We typically set up only one-to-one roles ↔ group relations. You need to set up two rules, one granting ALL actions on the HDFS directory (gets automatically translated to URI) and one granting all actions on the Hive database.
Example:
NOTE: If HDFS ACL synchronization with Sentry rules is enabled, add the user's directory or any parent directory to the Sentry Synchronization Path Prefixes parameter in the HDFS service configuration.
More information on the HDFS ACL synchronization topic can be found in the chapter Synchronizing HDFS ACLs and Sentry Permissions.
If Ranger is used
Similar to Sentry, Ranger service is managing access to Hadoop resources using policies. The policies can grant privileges either to the group or to users directly.
Again, you need to create two policies - one granting full access to Hadoop technical users on his HDFS directory and the other one granting access to the Hive database.
HDFS policy example:
Hive database policy example:
Verification of setup
To verify the setup on the Hadoop side is valid, we recommend using the technical user to create a testing table inside the sap<sid> Hive database.
Load a test file to the user's home directory, verify whether it is there, and delete it in the end.
If all the commands are successful, you can conclude that the setup is valid.
[root@skbtscck21 ~]# database=sapnsq \\set the database variable [root@skbtscck21 ~]# jdbc_string="jdbc:hive2://skbtscck21.hadoop.local:10000/$database;principal=hive/skbtscck21.hadoop.local@HADOOP.LOCAL;ssl=true;sslTrustStore=/opt/certs/jks/skbtscck21-keystore.jks;trustStorePassword=123456aB" \\ set jdbc string [root@skbtscck21 ~]# kinit -kt ~/nsqhdp.keytab nsqhdp@HADOOP.LOCAL \\ authenticate as the technical user [root@skbtscck21 ~]# klist \\ verify that you have a valid ticket Ticket cache: FILE:/tmp/krb5cc_0 Default principal: nsqhdp@HADOOP.LOCAL Valid starting Expires Service principal 04/25/2019 17:24:03 04/26/2019 17:24:03 krbtgt/HADOOP.LOCAL@HADOOP.LOCAL renew until 05/02/2019 17:24:03 [root@skbtscck21 ~]# beeline -u $jdbc_string -e "create table if not exists $database.xxx (a int)" \\ Create test table INFO : OK [root@skbtscck21 ~]# beeline -u $jdbc_string -e "drop table $database.xxx" \\ delete test table INFO : OK [root@skbtscck21 ~]# beeline -u $jdbc_string -e "show current roles" \\ only with Sentry - show roles of the user INFO : OK +---------------+--+ | role | +---------------+--+ | sapnsq_admin | +---------------+--+ [root@skbtscck21 ~]# beeline -u $jdbc_string -e "show grant role sapnsq_admin" \\ only with Sentry - display privileges of the role. Should look like the one below INFO : OK +--------------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ | database | table | partition | column | principal_name | principal_type | privilege | grant_option | grant_time | grantor | +--------------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ | sapnsq | | | | sapnsq_admin | ROLE | all | false | 1490282925235000 | -- | | hdfs://skbtscck21.hadoop.local:8020/user/nsqhdp | | | | sapnsq_admin | ROLE | all | false | 1490282925256000 | -- | +--------------------------------------------------+--------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ [root@skbtscck21 ~]# echo Success > file.txt \\ Create a test file [root@skbtscck21 ~]# hadoop fs -copyFromLocal ./file.txt /user/nsqhdp/ \\ Store file on HDFS [root@skbtscck21 ~]# hadoop fs -cat /user/nsqhdp/file.txt \\ Read file Success [root@skbtscck21 ~]# hadoop fs -rm -skipTrash /user/nsqhdp/file.txt \\ Clean up Deleted /user/nsqhdp/file.txt
OS prerequisites (On SAP host)
This group of requirements relates to the operating systems underlying the SAP system with all its application servers. SNP products (e.g. SNP Glue™, SNP OutBoard™ Data Tiering) have been developed and tested on the SUSE Linux environment and Windows Server 2012.
By design, they are not limited by the choice of an operating system, if the requirements listed in this guide are met. Successful implementations were done also on AIX and Solaris.
OS directories
The Hadoop connector uses two directories to store configuration and log files.
Create them with appropriate permissions (read/write) and usual SAP directory ownership (<sid>adm:sapsys).
dvd_conn directory in shared /sapmnt filesystem:
$ ls -ld /sapmnt/DVQ/global/security/dvd_conn drwx------ 2 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn
This one contains drivers, Kerberos, and SSL-related files. It is shared among all SAP application servers.
dvd_conn directory in the work directory of each SAP application server:
$ ls -ld /usr/sap/DVQ/DVEBMGS05/work/dvd_conn drwxr-xr-x 7 dvqadm sapsys 4096 --- /usr/sap/DVQ/DVEBMGS05/work/dvd_conn
This one stores Java Connector libraries, configuration, and log files. The folder can be also created automatically during Java connector setup in /DVD/JCO_MNG.
In the previous Storage Management installations, the directory used to reside directly in /usr/sap/<SID>/dvd_conn. It was moved to the work directory, to be logically grouped with other SAP-related operational and log files.
$ ls -ld /usr/sap/DVQ/dvd_conn drwxr-xr-x 2 dvqadm sapsys 4096 --- /usr/sap/DVQ/dvd_conn
JDBC Drivers
JDBC protocol is used to connect to Hadoop services (Hive and Impala). JDBC drivers have to be manually stored on the operating system and be accessible to the connector.
We recommend storing the drivers in the shared dvd_conn directory, organized in sub-directories to avoid possible conflicts.
$ ls -ld /sapmnt/DVQ/global/security/dvd_conn/[hi]* drwxr-x--- 2 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn/hive drwxr-x--- 3 dvqadm sapsys 4096 --- /sapmnt/DVQ/global/security/dvd_conn/impala
There are multiple drivers able to facilitate communication between Java Connector and respective Hadoop services.
Based on our experience, the most reliable are Simba drivers (adopted by Cloudera).
Kerberos keytab and configuration files
The Kerberos keytab of <sid>hdp principal should be exported from the Kerberos database, copied into the operating system directory /sapmnt/<SID>/global/security/dvd_conn, and made available to the <sid>adm user:
/sapmnt/DVQ/global/security/dvd_conn # ls -l DVQ.keytab -r-------- 1 dvqadm sapsys 59 Apr 5 15:12 DVQ.keytab
At the same location, the Kerberos configuration file should be created or copied and made readable for the user <sid>adm.
Usually a suitable krb5.conf file can be found on Hadoop nodes. Here is a sample of the Kerberos configuration file:
/sapmnt/DVQ/global/security/dvd_conn # ls -l krb5.conf -r-------- 1 dvqadm sapsys 393 Feb 22 16:28 krb5.conf /sapmnt/DVQ/global/security/dvd_conn # cat krb5.conf [libdefaults] default_realm = HADOOP.LOCAL dns_lookup_kdc = false dns_lookup_realm = false ticket_lifetime = 86400 renew_lifetime = 604800 forwardable = true default_tgs_enctypes = rc4-hmac default_tkt_enctypes = rc4-hmac permitted_enctypes = rc4-hmac udp_preference_limit = 1 kdc_timeout = 3000 [realms] HADOOP.LOCAL = { kdc = hadoop01.hadoop.local admin_server = hadoop01.hadoop.local }
To be able to use the default SAP logical paths and names provided by SNP, make sure that you follow the naming prescribed in this section.
Files that need to follow naming convention and location are krb5.conf, <SID>.keytab (SID is UPPERCASE), jssecacerts, and directories Hive and Impala that contain JDBC drivers.
Kerberos keytab verification
To verify that Kerberos config and keytab are valid, execute the following steps.
This is not part of the permanent configuration and is only listed here to make sure Kerberos authentication is working from the SAP side.
If klist
and kinit
commands are not available on the SAP application server OS, they can be installed with krb5-client
software package.
## switch to <sid>adm vsks012:~ # su - dvqadm ## change to dvd_conn directory vsks012:dvqadm 54> cd /sapmnt/DVQ/global/security/dvd_conn/ ## check principal in the keytab vsks012:dvqadm 55> klist -k DVQ.keytab Keytab name: FILE:DVQ.keytab KVNO Principal ---- -------------------------------------------------------------------------- 9 dvqhdp@DATA.DEV ## Set environment variable for kerberos config and login as the technical user. This is the expected result vsks012:dvqadm 56> setenv KRB5_CONFIG /sapmnt/DVQ/global/security/dvd_conn/krb5.conf && kinit -kt DVQ.keytab dvqhdp@DATA.DEV && klist Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: dvqhdp@DATA.DEV Valid starting Expires Service principal 05/14/19 11:04:08 05/15/19 11:04:08 krbtgt/DATA.DEV@DATA.DEV renew until 05/21/19 11:04:08 Kerberos 4 ticket cache: /tmp/tkt1001 klist: You have no tickets cached
SSL Certificates for Java
If your Hadoop cluster has full SSL/TLS encryption enabled, it’s necessary to create a Java truststore and save it on the following path with correct ownership and permissions:
/sapmnt/<SID>/global/security/dvd_conn # ls -l jssecacerts -r-------- 1 <SID>adm sapsys 59 Apr 5 15:12 jssecacerts
This truststore needs to contain the CA certificate with which Hadoop host certificates were signed. Another possibility is to store the respective hosts' self-signed certificates.
An alternative option is to copy the complete jssecacerts truststore from any Hadoop node and place it in this path.
SAP prerequisites
Kerberos cookie encoding
By default, the SAP system encodes certain characters in cookies. SAP Note 1160362 describes the behavior in more detail. As the Kerberos cookie must not be anyhow modified for the Kerberos server to accept it, this encoding should be disabled by setting the following parameter in each SAP application server's instance profile:
ict/disable_cookie_urlencoding = 1
Incompatible Kernel version
SAP kernel 7.53 patch level 5 introduced a change in this parameter, which caused Storage Management to malfunction. Therefore Hadoop storage does not work on SAP kernel 7.53 patch level 5-222. In SAP kernel 7.53 patch level 223 and in future versions, it is possible to change this value to 2, which once again introduces the desired behavior as before. The issue is described in SAP Note 2681175.
ict/disable_cookie_urlencoding = 2
The parameter is dynamic in kernel version 7.53 and higher.
SSL for SAP RFCs
To enable SSL communication for SAP RFCs, add client certificates enabling communication with Hadoop nodes to the SAP certificate list in the transaction STRUST.
If HttpFS is used, a client certificate of the HttpFS host is required.
If WebHDFS is used, client certificates of all datanodes and namenodes are necessary.
To import necessary certificates:
- Use transaction STRUST.
- In the left menu choose the certificate list that you want to add the certificate to (by double-clicking).
- In the right window area in the bottom left, click Import.
- In tab FILE of the dialog window point to the certificate file on your local file system (the certificate should be in .pem format).
- After you confirm the path and SAP can recognize the certificate, details are displayed in the corresponding fields.
- To complete adding the certificate click Add to Certificate List.
- Click Save (in the general menu or Ctrl+S).
- Restart ICM in transaction SMICM: Menu > Administration > ICM > Exit Soft > Local/Global.
Check the following parameters (typically they are set and active):
HTTP service and ICM parameters
The HTTP service must be active in the SAP system. It can be checked via transaction SMICM > [Goto] > Services
If the HttpFS/WebHDFS service is SSL-secured, the HTTPS service needs to be active as well.
The following two parameters are affecting the HTTP communication of the SAP system. By default they do not need to be modified and are listed here just for the information purpose:
- icm/HTTP/client/keep_alive_timeout: HTTP communication timeout, can be raised in the case of HTTP communication failing in timeout.
- icm/HTTP/max_request_size_KB: Maximum size of data that ICM accepts (default 100 MB).
Java connector
Java connector is a critical middle-ware component. Follow the steps in this guide (SM-2408) Java Connector Setup to set it up before you continue.
Configuration
When all prerequisites are fulfilled, further configuration is performed in the SAP system.
RFC Destination
An RFC destination needs to be created via transaction SM59.
HttpFS/WebHDFS RFC
Storage management communicates with HDFS using WebHDFS API. In the reference configuration, we will be using the HttpFS service, but this can be substituted with WebHDFS if necessary.
The only difference in the setup when WebHDFS is used is the port number and SSL certificates required in STRUST.
This RFC connection is used for communication with Hadoop's HttpFS service which mediates operations in HDFS.
Two RFCs pointing to different HttpFS services can be created to ensure High Availability.
The name and description of the destination are optional, but it is recommended to designate its purpose with keywords Hadoop and HttpFS. In our example, the RFC destination also contains the Hadoop server hosting the HttpFS service for the sake of clarity:
Entries explained:
- Connection Type: G for HTTP connection to an external service
- Target host: FQDN of Hadoop server hosting HttpFS service
- Service No.: Port number on which HttpFS service is listening (14000 for HttpFS, 50070 for WebHDFS)
- Path Prefix: This string consists of two parts:
- /webhdfs/v1 part is mandatory
- /user/dvqhdp part defines the Hadoop user's 'root' directory in HDFS where flat files from the SAP system are loaded
If SSL is used: Enable SSL and add client certificate PSE which is used in Logon & Security tab.
If the RFC destination connection test fails on "SSL handshake", there's a collection of helpful information published by SAP in the SAP Note 510007.
Logical paths and logical filenames
SNP ships default logical paths leading to files stored on SAP application servers. These default values are usable for most SAP installations and in general, don't need to be changed. If these default paths are used, the administrator needs to make sure that all security-related files (Kerberos keytab, Kerberos config, jssecacerts) and drivers stick to the default naming used in these paths.
In shipped SNP paths, standard SAP logical paths DIR_HOME and DIR_GLOBAL are frequently used as variables. These paths usually refer to DIR_HOME = /usr/sap/<SID>/<INSTANCE>/work and DIR_GLOBAL = /sapmnt/<SID>/global/
Logical files and their values shipped by SNP are the following:
- /DVD/DEF_KRB_KEYTAB = <P=DIR_GLOBAL>/security/dvd_conn/<SID>.keytab
- /DVD/DEF_KRB_CONFIG = <P=DIR_GLOBAL>/security/dvd_conn/krb5.config
- /DVD/DEF_SSL_TRUSTSTORE = <P=DIR_GLOBAL>/security/dvd_conn/jssecacerts
- /DVD/DEF_JCO_DIR = <P=DIR_HOME>/dvd_conn/
- /DVD/DEF_HIVE_DRIVER = <P=DIR_GLOBAL>/security/dvd_conn/hive
- /DVD/DEF_IMPALA_DRIVER = <P=DIR_GLOBAL>/security/dvd_conn/impala
If a customer wants to use a custom location or filename, a Z copy of these logical paths and logical files with customizations needs to be made, as any direct change to these paths would be overwritten during the Storage Management update.
Storage Management setup
Storage Management facilitates transparent communication with different types of storages, HDFS for flat files, and Hive for structured data.
To transparently store data, two types of Hadoop storages need to be defined in Storage Management (/DVD/SM_SETUP):
- HDFS storage which facilitates the transfer of files to HDFS through the Hadoop HttpFS service
- Hive storage which enables data replication between SAP tables and Hive tables
The third type of storage is required for efficient querying of data located in Hive:
- Impala storage connects to Impala agents to provide fast SQL execution by leveraging Impala in-memory data caching
HDFS storage type
Sample entry:
Entries explained:
- HTTP RFC Destination: HttpFS RFC destination defined in SM59
- HTTP RFC Destination (HA): Secondary HttpRFC destination defined in SM59 as a failover
- HDFS Repeat: Number of retries if request to HDFS is not successful
- HDFS Repeat Time (seconds): The delay between request attempts
- HDFS Permissions: Files will be created with these permissions (e.g. 770 for -rwxrwx---)
Authentication settings:
- Authentication method: Authentication method toward the Hadoop cluster
- Username: Hadoop user-principal
- JCO RFC destination: Authentication RFC destination created in Authentication RFC
- Krb. Config file path: Logical file definition for Kerberos configuration file, default /DVD/DEF_KRB_CONFIG can be used
- Krb. Keytab path: Logical file definition for Kerberos keytab file, default /DVD/DEF_KRB_KEYTAB can be used
- SSL Keystore: Logical file definition for SSL Keystore, default /DVD/DEF_SSL_TRUSTSTORE can be used
- SSL Password: Password for accessing SSL Keystore
Entries in this table are case-sensitive. Be especially careful with the principal name, as this can produce errors where the cause is usually hard to identify.
Complete the creation of the storage by confirming (F8). If the SAP system can authenticate against Hadoop Kerberos and get properties of the HDFS home directory (/user/<sid>hdp) from the HttpFS service, storage creation is considered successful.
HIVE storage type
The Hive metastore storage is created in a similar way as the process of setting up the HDFS storage, but the values are different:
Entries explained | Figure |
---|---|
General Storage ID: Name of the storage Storage Type: Choose SM_TRS_MS for Hive Description: Extended description of the storage for easier identification Java connector RFC: Hive RFC destination defined in Hive RFC Hadoop HTTP RFC Destination: HttpFS RFC destination defined in HttpFS RFC HTTP RFC Destination (HA): HttpFS RFC destination (High Availability) Database: Hive database created in the Hive database File Type: File format in which Hive stores table data on HDFS Compression codec: Compression codec used for storing data on HDFS Load Engine: Engine used for loading (writing) data, e.g. Hive or Impala Read Engine: Engine used for reading data, e.g. Hive or Impala Hive settings Hive host: Hadoop server hosting the Hive service Hive host for high availability: HA Hive host Hive port: Hive JDBC port Impala settings Impala host: Hadoop server hosting the Impala service Impala host for high availability: HA Impala host Impala port: Impala JDBC port Drivers Load driver path: Logical name of the Load driver file Load driver class: Classname of the driver used for loading (e.g. Cloudera Hive - com.cloudera.hive.jdbc41.HS2Driver) Read driver path: Logical name of the Read driver path Read driver class: Classname of the driver used for reading (e.g. Cloudera Impala - com.cloudera.impala.jdbc41.Driver) Security Username: Hadoop user created in Hadoop user, group, and HDFS directory Kerberos config file path: Logical name of the Kerberos configuration file defined in the Kerberos logical file definition Kerberos keytab path: Logical name of the Kerberos principal keytab file defined in the Kerberos logical file definition Hive principal: Kerberos principal of the Hive service must reflect the Hive host Impala principal: Kerberos principal of the Impala service must reflect the Impala host SSL Settings SSL Enabled: Checked if SSL authentication should be used SSL Keystore path: Logical name of the SSL Keystore file SSL Keystore password: Password to Keystore Advanced Staging location type: Storage location for data staging area (external CSV tables) Staging location URL (non-default): URL address for the data staging area (e.g. Azure DataLake) Use custom connection string: If checked, use the custom connection string Custom connection string: Standard settings are ignored, the custom connection string is used instead HDFS repeat: Number of times HDFS request is repeated in the case of failure HDFS repeat time (seconds): Seconds between repetitions - if not filled (0), the default value is 3 JDBC repeat: Number of times JDBC requests are repeated in the case of failure JDBC login timeout: Timeout for JDBC connect to database JDBC connection pool size: Number of connections in the JDBC connection pool Hints for hive/impala: Hints that can be specified for JDBC connection, separated by ; (e.g. SYNC_DDL=TRUE;UseNativeQuery=1) Database for TMP tables: Hive database where temporary tables are created Use compression on transfer: Checked in the case compression is used for files created on HDFS Compression level: Level of compression (0-minimum, 9-maximum) Force file cursor reader (expert setting): Cursor reader is used all the time when reading data from Hadoop Open cursor logic: Select which logic is used for reading via the cursor Skip trash: Checked if HDFS files shouldn't be moved to the trash after deleting them Use extended escaping (expert setting): Extending escaping is used all the time when writing data to Hadoop |
Finish the creation of the storage by confirming (F8). If the SAP system can authenticate against Hadoop Kerberos and receives the expected result of the SQL command 'use database', the creation of the storage is considered successful.
Extended explanation for Storage setup with Hadoop distribution other than Cloudera
The following fields' values can differ depending on the Hadoop distribution used:
- Load driver classname: Apart from Cloudera, other vendors are offering JDBC drivers able to connect to Hive service (or other Hadoop services like Impala or Drill).
In the case of Hortonworks distribution, the JDBC Hive driver can be found directly on the Hadoop node among other JAR files at /usr/hdp/current/hadoop-client/client/.The contents of the jar file can be listed with the 'less' command:
[root@skbtshcc01 ~]# less /usr/hdp/2.6.4.0-91/hive2/jdbc/hive-jdbc-2.1.0.2.6.4.0-91-standalone.jar | grep HiveDriver
-rw---- 2.0 fat 6960 bl defN 18-Jan-04 10:39 org/apache/hive/jdbc/HiveDriver.classThe driver Class Name can be derived from it as
org.apache.hive.jdbc.HiveDriver
Based on experience, hive-jdbc-xxx-standalone.jar is not as "standalone" as the name would suggest.
Three other JAR files are needed in the same driver directory when the connection is established via Zookeeper (these may vary depending on the Hadoop version):hadoop-auth-X.X.X.X.X.X.X-XX.jar
hadoop-common-X.X.X.X.X.X.X-XX.jar
hadoop-yarn-registry-X.X.X.X.X.X.X-XX.jarReference links:
https://www.cloudera.com/downloads/connectors/hive/jdbc/2-6-5.html
https://www.simba.com/resources/jdbc/
https://www.progress.com/jdbc/apache-hadoop-hive
http://repo.hortonworks.com/content/repositories/releases/org/apache/hive/hive-jdbc/ - Custom connection string: Not used under the standard Cloudera setup, as the URL connection string is being composed of other settings.
In case of a specific configuration on the Hadoop side or for testing purposes, it is possible to use an explicitly stated connection URL here.
Example: connection URL directed at Zookeeper service (port 2181) running on three Hadoop hosts, acting as a proxy redirecting connection to currently active Hive server in High Availability setupjdbc:hive2://skbtshcc01.hadoop.local:2181,skbtshcc03.hadoop.local:2181,skbtshcc02.hadoop.local:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
Reference links:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
https://www.cloudera.com/documentation/other/connectors/hive-jdbc/latest/Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf
https://docs.microsoft.com/bs-latn-ba/azure/hdinsight/hadoop/apache-hadoop-connect-hive-jdbc-driver?view=aspnetcore-2.1
https://docs.datafabric.hpe.com/51/Hive/ConnectingtoHiveServer2-U_29655382-d3e110.html - File type: Format in which the files loaded to HDFS are stored. It depends on Hadoop distribution - in Cloudera and MapR it's typically PARQUET, in Hortonworks it is typically ORC.
- Compression codec: Optional setting. It depends on the previously selected File Type and whether the files stored in HDFS should be compressed or not. If the file type is PARQUET, the compression codec is typically SNAPPY. If the file type is ORC, the compression codec is typically ZLIB.
- Use Cloudera drivers: Depending on this checkbox, the default connection URL is generated either with JDBC Cloudera/Simba driver syntax (checkbox ticked) or with JDBC Apache Hive driver syntax (checkbox unticked).
If a custom connection string is used, this should be left unchecked.