(SM-2411) Hadoop Storage Setup

(SM-2411) Hadoop Storage Setup

Table of Contents:

 

Hadoop prerequisites

Open ports

In a controlled network environment, it is common to have firewall rules in place. To enable communication of SAP systems with Hadoop, the following port numbers should be reachable in the Hadoop cluster from the SAP system:

Port

Type

Hadoop service

Comment

Port

Type

Hadoop service

Comment

10000

tcp

Hiveserver2

 

10500

tcp

Hiveserver2 LLAP

Hortonworks Hive2 LLAP

14000

tcp

HttpFS

HDFS service in the Cloudera distribution

50070

tcp

WebHDFS

Apache Hadoop HDFS service

1022

tcp

HDFS datanode

Needs to be open when WebHDFS is used

2181

tcp

Zookeeper

 

21050

tcp

Impala

 

These are the default port numbers of Hadoop services.

If Kerberos is enabled, KDC (Key Distribution Center) should also be reachable on port 88 (tcp/udp) from each SAP application server.

DNS names

Proper DNS name translation should be configured between SAP and Hadoop for Kerberos communication.
DNS resolution should be tested from the SAP host using the OS command nslookup <hadoop_host_FQDN>.
If IBM Java is used, also reverse lookup must be successful - IP address to hostname.

Hive parameters

Two configuration parameters of the Hive service must be configured in the Hive Service Advanced Configuration Snippet (Safety Valve) for Hive-site.xml.

hive.exec.dynamic.partition = true hive.exec.dynamic.partition.mode = nonstrict

Example:

NOTE: If Simba JDBC drivers will be used, these parameters can be set for our session only in the SAP system, therefore without a global impact on the cluster.

Hadoop technical user

We recommend creating distinct technical users for every SAP system connected to the Hadoop cluster to isolate the system's data.
There is usually a central repository for Hadoop users (LDAP/AD), but users can also be created locally (on every Hadoop cluster node).
The recommended naming convention reflects SAP <sid>adm users → <sid>hdp.

Each Hadoop technical user should have its dedicated group, in case the Sentry service is used for authorization management as in Sentry, access roles are assigned to groups.

For illustration purposes, we will use Hadoop user dvqhdp (with group dvqhdp) in further text.

If Kerberos is used

Create Kerberos principal in the form of <sid>hdp@<KERBEROS_REALM>. This can be either a principal created in MIT Kerberos, FreeIPA, or an Active Directory user. 
To export a Kerberos keytab from the Active directory, use the following command:

ktpass /princ dvqhdp@HADOOP.LOCAL /pass badpassword1 /ptype KRB5_NT_PRINCIPAL /out DVQ.keytab

HDFS landing zone

A landing zone, typically a home folder of the technical user, needs to be created on HDFS. 

The directory needs to meet the following conditions:

  • Technical user needs to be able to read and write to this directory and all subdirectories.

  • Impala and Hive users need to be able to read and write to this directory and all subdirectories.

NOTE: If Kerberos is used, Impala and Hive runtime key tabs are stored by Cloudera Manager under /var/run/cloudera-scm-agent/process/ or /etc/security/keytabs in Hortonworks distribution.

Example landing zone creation:

## Create a home directory [root@skbtscck21 ~]# hadoop fs -mkdir -p /user/dvqhdp/.Trash ## Set ownership and permissions [root@skbtscck21 ~]# hadoop fs -chown -R dvqhdp:dvqhdp /user/dvqhdp [root@skbtscck21 ~]# hadoop fs -chmod -R 770 /user/dvqhdp ## Set ACL to grant access to Hive group (by default containing Hive and Impala user) [root@skbtscck21 ~]# hadoop fs -setfacl -m default:group:hive:rwx /user/dvqhdp ## Check the directory [root@skbtscck21 ~]# hadoop fs -ls -d /user/dvqhdp drwxrwx---+ - dvqhdp dvqhdp 0 2017-03-22 14:45 /user/dvqhdp

HDFS parameters dfs.permissions.enabled and dfs.namenode.acls.enabled have to be set to true in hdfs-site.xml if HDFS POSIX-like permissions and HDFS ACLs are to be applied in the absence of Sentry/Ranger policy.
HDFS ACL support is not enabled in the default configuration.

 

Hive database

We recommend creating a dedicated database (schema) in Hive for each SAP system. The recommended database name is sap<sid> (e.g.: sapdvq).

Access privileges on the Hadoop side

In the productive Hadoop cluster, depending on Hadoop distribution, either Sentry service or Ranger service is responsible for the management of users' privileges on Hadoop resources.
For proper functionality of Storage Management, Hadoop technical user <sid>hdp needs to have access to the following two resources at least:

  1. User's HDFS home directory - typically /user/<sid>hdp

  2. Assigned Hive database - typically sap<sid>

To set up necessary policies in the respective security service, follow the instructions below.

If Sentry is used

Sentry service manages access to Hadoop resources using Sentry rules. The rules are created for a role, which can have one to many relations with user groups (not the users directly).
We typically set up only one-to-one roles ↔ group relations. You need to set up two rules, one granting ALL actions on the HDFS directory (gets automatically translated to URI) and one granting all actions on the Hive database.

Example: