(SM-2308) Troubleshooting

Table of Contents:

Purpose

Storage Management provides the option to check the storage configuration. If the check is unsuccessful, this article lists the troubleshooting steps which may be helpful in the identification of the cause.

If you receive a specific error message, some examples can be found here.

Introduction

When a Glue or Outboard job fails, the first hint can be usually found in the job log. The job log can be accessed via t-code SM37. After filtering for the failed job, you can select it and click the Job log button.
The log usually gives you information on what stage the job failed. 

This can be:

  1. Failed on data transfer (HADOOP storage type)
    example:
  2. Failed on commit - COPY_DATA_FROM_TMPTAB_TO_MSTAB (SM_TRS_MS storage type)
    example:
  3. other errors (no license, etc.)

Based on this information, we can narrow down the root cause of the specific area.

Another quick check can be performed via t-code /DVD/SM_SETUP. In the transaction, select affected storage and click on the Check Storage button. Based on the output, follow the troubleshooting steps for the specific storage type.

HADOOP storage type

HADOOP storage type is responsible for communication with the Hadoop distributed file system (HDFS). If Check storage failed on this storage, follow these troubleshooting steps. 

HTTP RFC destination

One of the core components is the SAP standard RFC. The RFC is a good start, as it can help us narrow down the troubleshooting area.

  1. To identify the RFC that is used, open transaction /DVD/SM_SETUP, and double-click the failed storage to display the details.
  2. Go to transaction SM59 and open the RFC (type HTTP Connections to the external server). Click the Connection Test button.
  3. The expected correct response is a pop-up asking for login information. This means that the connection to HttpFS or WebHDFS service was successful but authentication is required. 

Possible issues

If the Connection test is unsuccessful, these are some of the possibilities. SM59 usually returns really generic errors and for more details, it's necessary to check the ST11 dev_icm log. Some examples can be found here.

Possible issueWhat to do
HttpFS node hostname cannot be resolvedThere can be a multitude of reasons why Hadoop host resolution fails. To be on the safe side Hadoop IP↔host couples can be added to each SAP application server /etc/hosts file (or in Windows: C:\Windows\System32\drivers\etc\hosts)
HttpFS/WebHDFS service port is no longer reachable

Check the availability of Hadoop service from SAP application server OS using telnet <host> <service_port>.
There is a possibility that a network team unaware of the necessity of these ports being open closes them for security hardening purposes or that the Hadoop cluster is simply down.

If WebHDFS is used, also datanode service (port 1022) needs to be reachable on all HDFS datanodes. There is an unresolved issue with specific SAP kernel versions and host architectures that cause failures in redirects (SAP ignores 307 redirects from WebHDFS to datanode). In this case, the HttpFS service has to be used to avoid redirects.

HttpFS service has failed over to an alternate host

Check in the Hadoop cluster manager (Cloudera Manager / Ambari), that the service (HttpFS/WebHDFS) is still running on the host defined in the RFC destination.
It is possible to define HA RFC Destination in Storage configuration to be used as a fallback if the primary RFC returns the error in the connection attempt.

SSL on HttpFS is active, but RFC is not set as SSL activeChange settings in the Logon & Security tab in SM59 to be SSL active.
HttpFS service is SSL secured, RFC is set to use SSL, but there are missing certificates in STRUSTAdd required certificates to STRUST. Details can be found in the Hadoop Storage setup documentation.
RFC is set to SSL active, but HTTPFS service is not SSL securedDisable SSL for the RFC
HTTP/HTTPS service is not activeCheck the transaction SMICM → Goto →  Services. Make sure that HTTP (HTTPS if SSL is used) has a port number filled and is active.

Java connector issue

Storage type HADOOP uses a Java connector only for authentication with Kerberos. If the Hadoop cluster is not kerberized, this section is not relevant.
The following page contains detailed information on possible issues related to Java connector setup.

Kerberos authentication failure

If this is the issue, it can be found in the Java log. To access the Java log, go to transaction /DVD/JCO_MNG, select the Java connector that is used, and click the [Logs] button. Errors are highlighted in red.
Some examples of error messages can be found here. Some of Oracle jre8 error messages can be found here https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html

Possible issues

Possible issueWhat to do
Incorrect logical paths in /DVD/HDP_CUS_C

Check table /DVD/HDP_CUS_C. Make sure that logical paths are correct and point to the correct files on the OS level.

Wrong/expired keytab

It is possible for a keytab to expire, either after a fixed period of time or after another copy of the keytab was exported from KDC (KVNO has increased).

If the file is present in the correct directory, with the correct format and permissions (/sapmnt/<SID>/global/security/dvd_conn/<sid>hdp.keytab), try manual login with the keytab.
     export KRB5_CONFIG=<path>/krb5.conf && kinit -kt <keytab> <principal> && klist

Wrong principal (case sensitive)Make sure that the principal name in /DVD/HDP_CUS_C is correct. It should have a format like user@EXAMPLE.COM . It is case-sensitive. To check the principal name inside the keytab:
     klist -k <path_to_keytab>
Wrong Kerberos configCheck the contents of krb5.conf file in /sapmnt/<SID>/global/security/dvd_conn/ directory and compare them with krb5.conf valid for the Hadoop cluster. The contents have to match. Make sure that there is information about the principal's Kerberos realm and about the Hadoop cluster Kerberos realm.
Port to KDC is not openMake sure that port 88 to KDC is open. If cross-realm authentication is used, port 88 to KDCs of both realms needs to be open. 

HDFS permissions

Another possibility is, that the Hadoop technical user does not have correct permissions on HDFS or the home/landing folder is not existing. The easiest way to check this is directly on the Hadoop cluster.

To get the path that is used:

  1. Go to SM59 and open the HttpFS/WebHDFS RFC
  2. The HDFS path is visible in the path prefix field, following the /webhdfs/v1/ prefix

If HDFS is secured by Kerberos and you execute the Connection test in /DVD/SM_SETUP, sometimes you get a login pop-up.

This is caused by incorrectly setting parameter ict/disable_cookie_urlencoding. It needs to be set to 1 (2 in newer SAP kernel versions).

SM_TRS_MS Storage type

SM_TRS_MS storage type incorporates communication with metastore services. This includes Hive, Impala, and Databricks.

Java connector

This storage type utilizes the Java connector. In transaction /DVD/SM_SETUP you can find which RFC and JCO version is used by the storage.

Go to transaction /DVD/JCO_MNG and make sure that the connector is running. In case the connector is not running and does not start, follow the troubleshooting instructions mentioned in the previous section.

Kerberos issues

When the JCO is running, another possible issue is that of authentication. If the login of the JCo seems to be connected to authentication, follow the instructions mentioned above.

Other issues

Depending on Hadoop distribution and actual configuration, there can be different Hadoop services Java Connector communicates with.
But typically there is always at least one service that facilitates the manipulation of files stored in HDFS (HttpFS/WebHDFS) and at least one service emulating database representation of data stored in HDFS (plus metadata), accepting SQL queries.

Possible issueWhat to do
The service has crashed/failedWhile unlikely, it is still possible that the Hadoop service will fail. The reason for that needs to be checked via cluster manager and restarted.
The service is no longer reachable on the host specified in HTTP RFC DestinationIf service failover happens in the High Availability scenario and there is no load balancer/proxy (e.g. Zookeeper), the HTTP RFC needs to be reconfigured to be directed to the new Hadoop host where the service is running.
There is a possibility to circumvent this by configuring the HA alternative host in the Storage Management setup.
The service is no longer reachable on its port

Make sure that the Hadoop service is running on the correct host.
Try telnet from the SAP application server to the respective host/port.

If unreachable, there is a high probability that a change occurred in network/firewall settings, which disabled the connectivity.

Other troubleshooting steps

  • Is Hadoop service reachable at designated ports? (Hive 10000, HttpFS 14000, WebHDFS 50070, Impala 21050) telnet <hadoop_host_FQDN> <port_number>.
  • Is the Java connector process running? Transaction: /DVD/JCO_MNG; Start/Restart; ps -ef | grep java.
  • Is Java connector registered on SAP Gateway? Transaction: SMGW > Connected clients.
  • Is the Hadoop HDFS storage check ok? /DVD/SM_SETUP.
  • Is RFC Destination working properly? SM59 > Connection test.
  • Is ICM running? Transaction: SMICM.
  • If Kerberos is used for authentication, are necessary files in place? Kerberos config, user's keytab, JAAS config,
    • Is the Kerberos keytab version number still valid (i.e. is it possible that a new keytab was created with higher kvno, rendering the previous keytab invalid)?
  • Are Hadoop permissions correctly set? HDFS permissions, Ranger/Sentry rules for HDFS user directory, and Hive database.