(SM-2302) Troubleshooting
Table of Contents:
Purpose
Storage Management provides the option to check the storage configuration. If the check is unsuccessful, this article lists the troubleshooting steps which may be helpful in the identification of the cause.
If you receive a specific error message, some examples can be found here.
Introduction
When a Glue or Outboard job fails, the first hint can be usually found in the job log. The job log can be accessed via t-code SM37. After filtering for the failed job, you can select it and click the Job log button.
The log usually gives you information on what stage the job failed.
This can be:
- Failed on data transfer (HADOOP storage type)
example: - Failed on commit - COPY_DATA_FROM_TMPTAB_TO_MSTAB (SM_TRS_MS storage type)
example: - other errors (no license, etc.)
Based on this information, we can narrow down the root cause of the specific area.
Another quick check can be performed via t-code /DVD/SM_SETUP. In the transaction, select affected storage and click on the Check Storage button. Based on the output, follow the troubleshooting steps for the specific storage type.
HADOOP storage type
HADOOP storage type is responsible for communication with the Hadoop distributed file system (HDFS). If Check storage failed on this storage, follow these troubleshooting steps.
HTTP RFC destination
One of the core components is the SAP standard RFC. The RFC is a good start, as it can help us narrow down the troubleshooting area.
- To identify the RFC that is used, open transaction /DVD/SM_SETUP, and double-click the failed storage to display the details.
- Go to transaction SM59 and open the RFC (type HTTP Connections to the external server). Click the Connection Test button.
- The expected correct response is a pop-up asking for login information. This means that the connection to HttpFS or WebHDFS service was successful but authentication is required.
Possible issues
If the Connection test is unsuccessful, these are some of the possibilities. SM59 usually returns really generic errors and for more details, it's necessary to check the ST11 dev_icm log. Some examples can be found here.
Possible issue | What to do |
---|---|
HttpFS node hostname cannot be resolved | There can be a multitude of reasons why Hadoop host resolution fails. To be on the safe side Hadoop IP↔host couples can be added to each SAP application server /etc/hosts file (or in Windows: C:\Windows\System32\drivers\etc\hosts) |
HttpFS/WebHDFS service port is no longer reachable | Check the availability of Hadoop service from SAP application server OS using telnet <host> <service_port>. If WebHDFS is used, also datanode service (port 1022) needs to be reachable on all HDFS datanodes. There is an unresolved issue with specific SAP kernel versions and host architectures that cause failures in redirects (SAP ignores 307 redirects from WebHDFS to datanode). In this case, the HttpFS service has to be used to avoid redirects. |
HttpFS service has failed over to an alternate host | Check in the Hadoop cluster manager (Cloudera Manager / Ambari), that the service (HttpFS/WebHDFS) is still running on the host defined in the RFC destination. |
SSL on HttpFS is active, but RFC is not set as SSL active | Change settings in the Logon & Security tab in SM59 to be SSL active. |
HttpFS service is SSL secured, RFC is set to use SSL, but there are missing certificates in STRUST | Add required certificates to STRUST. Details can be found in the Hadoop Storage setup documentation. |
RFC is set to SSL active, but HTTPFS service is not SSL secured | Disable SSL for the RFC |
HTTP/HTTPS service is not active | Check the transaction SMICM → Goto → Services. Make sure that HTTP (HTTPS if SSL is used) has a port number filled and is active. |
Java connector issue
Storage type HADOOP uses a Java connector only for authentication with Kerberos. If the Hadoop cluster is not kerberized, this section is not relevant.
The following page contains detailed information on possible issues related to Java connector setup.
Kerberos authentication failure
If this is the issue, it can be found in the Java log. To access the Java log, go to transaction /DVD/JCO_MNG, select the Java connector that is used, and click the [Logs] button. Errors are highlighted in red.
Some examples of error messages can be found here. Some of Oracle jre8 error messages can be found here https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html
Possible issues
Possible issue | What to do |
---|---|
Incorrect logical paths in /DVD/HDP_CUS_C | Check table /DVD/HDP_CUS_C. Make sure that logical paths are correct and point to the correct files on the OS level. |
Wrong/expired keytab | It is possible for a keytab to expire, either after a fixed period of time or after another copy of the keytab was exported from KDC (KVNO has increased). If the file is present in the correct directory, with the correct format and permissions (/sapmnt/<SID>/global/security/dvd_conn/<sid>hdp.keytab), try manual login with the keytab. |
Wrong principal (case sensitive) | Make sure that the principal name in /DVD/HDP_CUS_C is correct. It should have a format like user@EXAMPLE.COM . It is case-sensitive. To check the principal name inside the keytab: klist -k <path_to_keytab> |
Wrong Kerberos config | Check the contents of krb5.conf file in /sapmnt/<SID>/global/security/dvd_conn/ directory and compare them with krb5.conf valid for the Hadoop cluster. The contents have to match. Make sure that there is information about the principal's Kerberos realm and about the Hadoop cluster Kerberos realm. |
Port to KDC is not open | Make sure that port 88 to KDC is open. If cross-realm authentication is used, port 88 to KDCs of both realms needs to be open. |
HDFS permissions
Another possibility is, that the Hadoop technical user does not have correct permissions on HDFS or the home/landing folder is not existing. The easiest way to check this is directly on the Hadoop cluster.
To get the path that is used:
- Go to SM59 and open the HttpFS/WebHDFS RFC
- The HDFS path is visible in the path prefix field, following the /webhdfs/v1/ prefix
Parameter ict/disable_cookie_urlencoding is incorrectly set
If HDFS is secured by Kerberos and you execute the Connection test in /DVD/SM_SETUP, sometimes you get a login pop-up.
This is caused by incorrectly setting parameter ict/disable_cookie_urlencoding. It needs to be set to 1 (2 in newer SAP kernel versions).
SM_TRS_MS Storage type
SM_TRS_MS storage type incorporates communication with metastore services. This includes Hive, Impala, and Databricks.
Java connector
This storage type utilizes the Java connector. In transaction /DVD/SM_SETUP you can find which RFC and JCO version is used by the storage.
Go to transaction /DVD/JCO_MNG and make sure that the connector is running. In case the connector is not running and does not start, follow the troubleshooting instructions mentioned in the previous section.
Kerberos issues
When the JCO is running, another possible issue is that of authentication. If the login of the JCo seems to be connected to authentication, follow the instructions mentioned above.
Other issues
Depending on Hadoop distribution and actual configuration, there can be different Hadoop services Java Connector communicates with.
But typically there is always at least one service that facilitates the manipulation of files stored in HDFS (HttpFS/WebHDFS) and at least one service emulating database representation of data stored in HDFS (plus metadata), accepting SQL queries.
Possible issue | What to do |
---|---|
The service has crashed/failed | While unlikely, it is still possible that the Hadoop service will fail. The reason for that needs to be checked via cluster manager and restarted. |
The service is no longer reachable on the host specified in HTTP RFC Destination | If service failover happens in the High Availability scenario and there is no load balancer/proxy (e.g. Zookeeper), the HTTP RFC needs to be reconfigured to be directed to the new Hadoop host where the service is running. There is a possibility to circumvent this by configuring the HA alternative host in the Storage Management setup. |
The service is no longer reachable on its port | Make sure that the Hadoop service is running on the correct host. |
Other troubleshooting steps
- Is Hadoop service reachable at designated ports? (Hive 10000, HttpFS 14000, WebHDFS 50070, Impala 21050) telnet <hadoop_host_FQDN> <port_number>.
- Is the Java connector process running? Transaction: /DVD/JCO_MNG; Start/Restart; ps -ef | grep java.
- Is Java connector registered on SAP Gateway? Transaction: SMGW > Connected clients.
- Is the Hadoop HDFS storage check ok? /DVD/SM_SETUP.
- Is RFC Destination working properly? SM59 > Connection test.
- Is ICM running? Transaction: SMICM.
- If Kerberos is used for authentication, are necessary files in place? Kerberos config, user's keytab, JAAS config,
- Is the Kerberos keytab version number still valid (i.e. is it possible that a new keytab was created with higher kvno, rendering the previous keytab invalid)?
- Are Hadoop permissions correctly set? HDFS permissions, Ranger/Sentry rules for HDFS user directory, and Hive database.