(SM-2411) Hadoop Storage Setup
Table of Contents:
- 1 Hadoop prerequisites
- 1.1 Open ports
- 1.2 DNS names
- 1.3 Hive parameters
- 1.4 Hadoop technical user
- 1.4.1 If Kerberos is used
- 1.5 HDFS landing zone
- 1.6 Hive database
- 1.7 Access privileges on the Hadoop side
- 1.7.1 If Sentry is used
- 1.7.2 If Ranger is used
- 1.8 Verification of setup
- 2 OS prerequisites (On SAP host)
- 3 SAP prerequisites
- 4 Java connector
- 5 Configuration
Hadoop prerequisites
Open ports
In a controlled network environment, it is common to have firewall rules in place. To enable communication of SAP systems with Hadoop, the following port numbers should be reachable in the Hadoop cluster from the SAP system:
Port | Type | Hadoop service | Comment |
|---|---|---|---|
10000 | tcp | Hiveserver2 |
|
10500 | tcp | Hiveserver2 LLAP | Hortonworks Hive2 LLAP |
14000 | tcp | HttpFS | HDFS service in the Cloudera distribution |
50070 | tcp | WebHDFS | Apache Hadoop HDFS service |
1022 | tcp | HDFS datanode | Needs to be open when WebHDFS is used |
2181 | tcp | Zookeeper |
|
21050 | tcp | Impala |
|
These are the default port numbers of Hadoop services.
If Kerberos is enabled, KDC (Key Distribution Center) should also be reachable on port 88 (tcp/udp) from each SAP application server.
DNS names
Proper DNS name translation should be configured between SAP and Hadoop for Kerberos communication.
DNS resolution should be tested from the SAP host using the OS command nslookup <hadoop_host_FQDN>.
If IBM Java is used, also reverse lookup must be successful - IP address to hostname.
Hive parameters
Two configuration parameters of the Hive service must be configured in the Hive Service Advanced Configuration Snippet (Safety Valve) for Hive-site.xml.
hive.exec.dynamic.partition = true
hive.exec.dynamic.partition.mode = nonstrictExample:
NOTE: If Simba JDBC drivers will be used, these parameters can be set for our session only in the SAP system, therefore without a global impact on the cluster.
Hadoop technical user
We recommend creating distinct technical users for every SAP system connected to the Hadoop cluster to isolate the system's data.
There is usually a central repository for Hadoop users (LDAP/AD), but users can also be created locally (on every Hadoop cluster node).
The recommended naming convention reflects SAP <sid>adm users → <sid>hdp.
Each Hadoop technical user should have its dedicated group, in case the Sentry service is used for authorization management as in Sentry, access roles are assigned to groups.
For illustration purposes, we will use Hadoop user dvqhdp (with group dvqhdp) in further text.
If Kerberos is used
Create Kerberos principal in the form of <sid>hdp@<KERBEROS_REALM>. This can be either a principal created in MIT Kerberos, FreeIPA, or an Active Directory user.
To export a Kerberos keytab from the Active directory, use the following command:
ktpass /princ dvqhdp@HADOOP.LOCAL /pass badpassword1 /ptype KRB5_NT_PRINCIPAL /out DVQ.keytabHDFS landing zone
A landing zone, typically a home folder of the technical user, needs to be created on HDFS.
The directory needs to meet the following conditions:
Technical user needs to be able to read and write to this directory and all subdirectories.
Impala and Hive users need to be able to read and write to this directory and all subdirectories.
NOTE: If Kerberos is used, Impala and Hive runtime key tabs are stored by Cloudera Manager under /var/run/cloudera-scm-agent/process/ or /etc/security/keytabs in Hortonworks distribution.
Example landing zone creation:
## Create a home directory
[root@skbtscck21 ~]# hadoop fs -mkdir -p /user/dvqhdp/.Trash
## Set ownership and permissions
[root@skbtscck21 ~]# hadoop fs -chown -R dvqhdp:dvqhdp /user/dvqhdp
[root@skbtscck21 ~]# hadoop fs -chmod -R 770 /user/dvqhdp
## Set ACL to grant access to Hive group (by default containing Hive and Impala user)
[root@skbtscck21 ~]# hadoop fs -setfacl -m default:group:hive:rwx /user/dvqhdp
## Check the directory
[root@skbtscck21 ~]# hadoop fs -ls -d /user/dvqhdp
drwxrwx---+ - dvqhdp dvqhdp 0 2017-03-22 14:45 /user/dvqhdpHDFS parameters dfs.permissions.enabled and dfs.namenode.acls.enabled have to be set to true in hdfs-site.xml if HDFS POSIX-like permissions and HDFS ACLs are to be applied in the absence of Sentry/Ranger policy.
HDFS ACL support is not enabled in the default configuration.
Hive database
We recommend creating a dedicated database (schema) in Hive for each SAP system. The recommended database name is sap<sid> (e.g.: sapdvq).
Access privileges on the Hadoop side
In the productive Hadoop cluster, depending on Hadoop distribution, either Sentry service or Ranger service is responsible for the management of users' privileges on Hadoop resources.
For proper functionality of Storage Management, Hadoop technical user <sid>hdp needs to have access to the following two resources at least:
User's HDFS home directory - typically /user/<sid>hdp
Assigned Hive database - typically sap<sid>
To set up necessary policies in the respective security service, follow the instructions below.
If Sentry is used
Sentry service manages access to Hadoop resources using Sentry rules. The rules are created for a role, which can have one to many relations with user groups (not the users directly).
We typically set up only one-to-one roles ↔ group relations. You need to set up two rules, one granting ALL actions on the HDFS directory (gets automatically translated to URI) and one granting all actions on the Hive database.
Example: