(SM 2008) Troubleshooting

Purpose

Datavard Storage Management provides the option to check the storage configuration. If the check is unsuccessful, this article lists the troubleshooting steps which may be helpful in the identification of the cause.

If you receive a specific error message, some examples can be found here.

Introduction

When a Glue or Outboard job fails, the first hint can be usually found in the job log. Job log can be accessed in t-code sm37. After filtering for the failed job, you can select it and click Job log button. 

The log usually gives you information at what stage the job failed. 

This can be:

  1. Failed on data transfer (HADOOP storage type)
    example:
  2. Failed on commit - COPY_DATA_FROM_TMPTAB_TO_MSTAB (SM_TRS_MS storage type)
    example:
  3. other errors (no license, ....)

Based on this information, we can narrow down the root cause to the specific area.


Another quick check can be performed in t-code /dvd/sm_setup. In the transaction, select a storage and click Check Storage button. Based on the output, follow the troubleshooting steps for the specific storage type.

Hadoop storage type

Hadoop storage type is responsible for a communication with Hadoop distributed file system (HDFS). If Check storage failed on this storage, follow these troubleshooting steps. 

HTTP RFC destination

One of the core components is SAP standard RFC. The RFC is a good start, as it can help us narrow down the troubleshooting area.

  1. To identify the RFC that is used, open transaction /dvd/sm_setup and double-click the failed storage to display the details.
  2. Go to transaction sm59 and open the RFC (type HTTP Connections to external server). Click Connection Test button.
  3. Expected correct response is a pop-up asking for login information. This means that the connection to HttpFS or WebHDFS service was successful but authentication is required. 

Possible issues

If the Connection test is unsuccessful, these are some of the possibilities. SM59 usually returns really generic errors and for more details it's necessary to check ST11 dev_icm log. Some examples can be found here.

Possible issueWhat to do
HttpFS node hostname cannot be resolvedThere can be a multitude of reasons why Hadoop host resolution fails. To be on a safe side Hadoop IP↔host couples can be added to each SAP application server /etc/hosts file (or in Windows: C:\Windows\System32\drivers\etc\hosts)
HttpFS/WebHDFS service port is no longer reachable

Check the availability of Hadoop service from SAP application server OS using 'telnet <host> <service_port>.
There is a possibility that a network team unaware of the necessity of these ports being open closes them for security hardening purpose or that Hadoop cluster is simply down.

If WebHDFS is used, also datanode service (port 1022) needs to be reachable on all HDFS slave nodes. There is an unresolved issue with specific SAP kernel versions and host architectures that cause failures in redirects (SAP ignores 307 redirect from WebHDFS to datanode). In this case HttpFS service has to be used to avoid redirects.

HttpFS service has failed over to alternate host

Check in Hadoop cluster manager (Cloudera Manager / Ambari), that the service (HttpFS/WebHDFS) is still running on host defined in the RFC destination.
It is possible to define HA RFC Destination in Storage configuration to be used as fallback if primary RFC returns error in connection attempt.

SSL on HttpFS is active, but RFC is not set as SSL activeChange settings in Logon & Security tab in SM59 to be SSL active.
HttpFS service is SSL secured, RFC is set to use SSL, but there are missing certificates in STRUSTAdd required certificates to STRUST. Details can be found in Hadoop Storage setup documentation.
RFC is set to SSL active, but HTTPFS service is not SSL securedDisable SSL for the RFC
HTTP/HTTPS service is not activeCheck the transaction SMICM → Goto →  Services. Make sure that HTTP (HTTPS if SSL is used) has port number filled and it is active.


Java connector issue

Storage type HADOOP uses Java connector only for authentication with Kerberos. If the Hadoop cluster is not kerberized, this section is not relevant.

To check if Java connector is running:

  1. Open transaction /DVD/JCO_MNG
  2. On the left side of the screen, select connector that is being used. Usually, for HADOOP storage the Java connector is connected using DATAVARD_JAVA_CONN RFC.
  3. Status of the connector for the specific app.server can be seen on the right side of the screen.
  4. Click the restart button

Possible issues

If the Java connector doesn't start, these are the possible issues. Even if the Java doesn't start, some information can be sometimes seen in the logs. Click on Logs button to check logs.

Possible issueWhat to do
Wrong settingsDouble check the setup in /dvd/jco_mng Config and Dependencies tab
Install directory doesn't existLatest versions of Storage Management create installation directory on it's own. In case you have older version (<1903), make sure the directory exists
libsapjco3.so is not in $LD_LIBRARY_PATHMake sure that libsapjco3.so is in $LD_LIBRARY_PATH of <SID>adm user. Keep in mind that $LD_LIBRARY_PATH is only updated after app.server restart.
RFC user does not exist or is lockedCreate/unlock the user in transaction SU01
RFC user has wrong user type (not 'Communication Data' user)Correct user type in transaction SU01
RFC user has wrong role assigned (enabling RFC communication)Correct user role in transaction PFCG
Java runtime environment is lower than 1.7Datavard Java connector supports only releases >=1.7 . Update the JRE.
RFC and Java connector have incorrect program IDCheck Program ID in Config tab in /dvd/jco_mng. Make sure that the RFC that is connecting to this Java connector uses the same program ID.
ACLs on SAP gateway don't allow registration of Java connectorCheck ACLs in SMGW transaction. Make sure that program ID used in the setup is allowed to register on the gateway.
None of the above

Try to manually start the JVM with debug mode.

Examples:

Oracle Java

export WORKDIR=/usr/sap/<SID>/<instance_DIR>/work/dvd_conn/jco204
/usr/bin/java -Xmx2G
-Djava.security.debug=gssloginconfig,configfile,configparser,logincontext
-Dsun.security.jgss.debug=true
-Dsun.security.krb5.debug=true
-Dsun.security.spnego.debug=true
-Dlog4j.configurationFile=${WORKDIR}/log4j.xml
-jar ${WORKDIR}/dvd_auth_jco.jar
-conf ${WORKDIR}/auth_conn.jcoserver
-confDest ${WORKDIR}/auth_conn_as.jcoDestination 2>&1

IBM Java

/usr/bin/java -Xmx2G
-Djava.security.debug=gssloginconfig,configfile,configparser,logincontext
-Dcom.ibm.security.jgss.debug=all
-Dcom.ibm.security.krb5.Krb5Debug=all
-Djava.security.auth.login.config=/sapmnt/D06/global/security/dvd_conn/jaas.config
-Dlog4j.configurationFile=/usr/sap/D06/DVEBMGS00/work/dvd_conn/log4j.xml
-jar /usr/sap/D06/DVEBMGS00/work/dvd_conn/dvd_java_connector.jar
-conf /usr/sap/D06/DVEBMGS00/work/dvd_conn/config.jcoServer
-confDest /usr/sap/D06/DVEBMGS00/work/dvd_conn/config_as.jcoDestination
-log /sapmnt/D06/global/security/dvd_conn/custom_java.log


Kerberos authentication failure

If this is the issue, it can be found in Java log. To access Java log, go to transaction /dvd/jco_mng, select the Java connector that is used and click Logs button. Errors are highlighted in red.

Some examples of error messages can be found here. Some of Oracle jre8 error messages can be found here https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html

Possible issues

Possible issueWhat to do
Incorrect logical paths in /dvd/hdp_cus_c

Check table /dvd/hdp_cus_c. Make sure that logical paths are correct and point to correct files on OS level.

Wrong/expired keytab

It is possible for a keytab to expire, either after fixed period of time, or after another copy of the keytab was exported from KDC (KVNO has increased).

If the file is present in the correct directory, with correct format and permissions (/sapmnt/<SID>/global/security/dvd_conn/<sid>hdp.keytab), try manual login with the keytab.
     export KRB5_CONFIG=<path> && kinit -kt <keytab_path> <principal> && klist

Wrong principal (case sensitive)Make sure that principal name in  /dvd/hdp_cus_c is correct. It should have format like user@EXAMPLE.COM . It is case sensitive. To check principal name inside the keytab:
     klist -k <path_to_keytab>

Wrong Kerberos configCheck the contents of krb5.conf file in /sapmnt/<SID>/global/security/dvd_conn/ directory and compare them with krb5.conf valid for Hadoop cluster. The contents have to match. Make sure that there is information about principal's Kerberos realm and about Hadoop cluster Kerberos realm.
Port to KDC is not openMake sure that port 88 to KDC is open. If cross-realm authentication is used, port 88 to KDCs of both realms need to be open. 

HDFS permissions

Another possibility is, that the Hadoop technical user doesn't have correct permissions on HDFS or the home/landing folder is not existing. The easiest way to check this is directly on the Hadoop cluster.

To get the path that is used:

  1. Go to SM59 and open the HttpFS/WebHDFS RFC
  2. The HDFS path is visible in the path prefix field, following the /webhdfs/v1/ prefix

If HDFS is secured by kerberos and you execute Connection test in /dvd/sm_setup, sometimes you get login pop-up.

This is caused by incorrectly set parameter ict/disable_cookie_urlencoding. It needs to be set to '1' ('2' in newer SAP kernel versions).

SM_TRS_MS Storage type

SM_TRS_MS storage type incorporates communication with "metastore" services. This includes Hive, Impala, Databricks.

Java connector

This storage type utilizes Datavard Java connector. In transaction /dvd/sm_setup you can find which RFC and JCO version is used by the storage.

Go to transaction /dvd/jco_mng and make sure that the connector is running. In case the connector is not running and doesn't start, follow troubleshooting instructions mentioned in previous section.

Kerberos issues

When the JCO is running, another possible issue is that of authentication. If login jco connector seems to be connected to authentication, follow the instructions mentioned in above.

Other issues

Depending on Hadoop distribution and actual configuration, there can be different Hadoop services Datavard Java Connector communicates with.
But typically there is always at least one service which facilitates manipulation of files stored in HDFS (HttpFS/WebHDFS) and at least one service emulating database representation of data stored in HDFS (plus metadata), accepting SQL queries.

Possible issueWhat to do
The service has crashed/failedWhile unlikely, it is still possible that Hadoop service will fail. The reason of that needs to be checked via cluster manager and restarted.
The service is no longer reachable on host specified in HTTP RFC DestinationIf service failover happens in High Availability scenario and there is no load balancer/proxy (e.g. Zookeeper),
the HTTP RFC needs to be reconfigured to be directed to new Hadoop host where the service is running.
There is a possibility to circumvent this by configuring HA alternative host in Storage Management setup.
The service is no longer reachable on its port

Make sure that Hadoop service is running on correct host.
Try telnet from SAP application server to respective host/port.

If unreachable, there is high probability that a change occurred in network/firewall settings, which disabled the connectivity.


Other troubleshooting steps

  • Is Hadoop service reachable on designated ports? (Hive 10000, HttpFS 14000, WebHDFS 50070, Impala 21050) telnet <hadoop_host_FQDN> <port_number>
  • Is Java connector process running? transaction: /DVD/JCO_MNG; Start/Restart; ps -ef | grep java
  • Is Java connector registered on SAP Gateway? transaction: SMGW → Connected clients
  • Is Hadoop HDFS storage check ok? /DVD/SM_SETUP
  • Is RFC Destination working properly? SM59 → Connection test
  • Is ICM running? transaction: SMICM
  • If Kerberos is used for authentication, are necessary files in place? kerberos config, user's keytab, JAAS config,
    • Is Kerberos keytab version number still valid (i.e. is it possible that new keytab was created with higher kvno, rendering previous keytab invalid?)
  • Are Hadoop permissions correctly set? HDFS permissions, Ranger/Sentry rules for HDFS user directory and Hive database