(SM-1911) Troubleshooting - DRAFT
Purpose
Datavard Storage Management provides the option to check storage configuration. If the check is unsuccessful, this article lists the troubleshooting steps which may be helpful in the identification of the cause.
If you receive a specific error message, some examples can be found here.
Introduction
When a Glue or Outboard job fails, the first hint can be usually found in the job log. Job log can be accessed in t-code sm37. After filtering for the failed job, you can select it and click Job log button.
The log usually gives you information at what stage the job failed.
This can be:
- Failed on data transfer (HADOOP storage type)
example: - Failed on commit - COPY_DATA_FROM_TMPTAB_TO_MSTAB (SM_TRS_MS storage type)
example: - other errors (no license, ....)
Based on this infromation, we can narrow down the root cause to the specific area.
Another quick check can be performed in t-code /dvd/sm_setup. In the transaction, select a storage and click Check Storage button. Based on the output, follow the troubleshooting steps for the specific storage type.
Hadoop storage type
Hadoop storage type is responsible for communication with Hadoop distributed file system (HDFS). If Check storage failed on this storage, follow these troubleshooting steps.
HTTP RFC destination
One of the core components is SAP standard RFC. The RFC is a good start, as it can help us narrow down the troubleshooting area.
- To identify the RFC that is used, open transaction /dvd/sm_setup and double-click the failed storage to display details.
- Go to transaction sm59 and open the RFC (type HTTP Connections to external server). Click Connection Test button.
- Expected correct response is a pop-up asking for login information. This means that the connection to HttpFS or WebHDFS service was successful, but authentication is required.
Possible issues
If the Connection test is unsuccessful, these are some of the possibilities. SM59 usually returns really generic errors and for more details it's necessary to check ST11 dev_icm log. Some examples can be found here.
Possible issue | What to do |
---|---|
HttpFS node hostname cannot be resolved | There can be a multitude of reasons why Hadoop host resolution fails. To be on a safe side Hadoop IP↔host couples can be added to each SAP application server /etc/hosts file (or in Windows: C:\Windows\System32\drivers\etc\hosts) |
HttpFS/WebHDFS service port is no longer reachable | Check availability of Hadoop service from SAP application server OS using 'telnet <host> <service_port>. If WebHDFS is used, also datanode service (port 1022) needs to be reachable on all HDFS slave nodes. There is an unresolved issue with specific SAP kernel versions and host architectures that cause failures in redirects (SAP ignores 307 redirect from WebHDFS to datanode). In this case HttpFS service has to be used to avoid redirects. |
HttpFS service has failed over to alternate host | Check in Hadoop cluster manager (Cloudera Manager / Ambari), that the service (HttpFS/WebHDFS) is still running on host defined in the RFC destination. |
SSL on HttpFS is active, but RFC is not set as SSL active | Change settings in Logon & Security tab in SM59 to be SSL active. |
HttpFS service is SSL secured, RFC is set to use SSL, but there are missing certificates in STRUST | Add required certificates to strust. Details can be found in Hadoop Storage setup documentation. |
RFC is set to SSL active, but HTTPFS service is not SSL secured | Disable SSL for the RFC |
HTTP/HTTPS service is not active | Check transaction SMICM → Goto → Services. Make sure that HTTP (HTTPS if SSL is used) has port number filled and it is active. |
Java connector issue
Storage type HADOOP uses Java connector only for authentication with Kerberos. If the Hadoop cluster is not kerberized, this section is not relevant.
To check if Java connector is running:
- Open transaction /DVD/JCO_MNG
- On the left side of the screen, select connector that is being used. Usually, for HADOOP storage the Java connector is connected using DATAVARD_JAVA_CONN RFC.
- Status of the connector for the specific app.server can be seen on the right side of the screen.
- Click the restart button
Possible issues
If the Java connector doesn't start, these are the possible issues. Even if the Java doesn't start, some information can be sometimes seen in the logs. Click to Logs button to check logs.
Possible issue | What to do |
---|---|
Wrong settings | Double check the setup in /dvd/jco_mng Config and Dependencies tab |
Install directory doesn't exist | Latest versions of Storage Management create installation directory on it's own, but in case you have older version (<1903), make sure the directory exists |
libsapjco3.so is not in $LD_LIBRARY_PATH | Make sure that libsapjco3.so is in $LD_LIBRARY_PATH of <SID>adm user. Keep in mind that $LD_LIBRARY_PATH is only updated after app.server restart. |
RFC user does not exist or is locked | Create/unlock the user in transaction SU01 |
RFC user has wrong user type (not 'Communication Data' user) | Correct user type in transaction SU01 |
RFC user has wrong role assigned (enabling RFC communication) | Correct user role in transaction PFCG |
Java runtime environment is lower than 1.7 | Datavard Java connector supports only releases >=1.7 . Update the JRE. |
RFC and Java connector have incorrect program ID | Check Program ID in Config tab in /dvd/jco_mng. Make sure that the RFC that is connecting to this Java connector uses the same program ID. |
ACLs on SAP gateway don't allow registration of Java connector | Check ACLs in SMGW transaction. Make sure that program ID used in the setup is allowed to register on the gateway. |
None of the above | Try to manually start the JVM with debug mode. Examples: Oracle Java export WORKDIR=/usr/sap/<SID>/<instance_DIR>/work/dvd_conn/jco204 IBM Java /usr/bin/java -Xmx2G |
Kerberos authentication failure
If this is the issue, it can be found in Java log. To access Java log, go to transaction /dvd/jco_mng, select the Java connector that is used and click Logs button. Errors are highlighted in red.
Some examples of error messages can be found here. Some of Oracle jre8 error messages can be found here https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/tutorials/Troubleshooting.html
Possible issues
Possible issue | What to do |
---|---|
incorrect logical paths in /dvd/hdp_cus_c | Check table /dvd/hdp_cus_c. Make sure that logical paths are correct and point to correct files on OS level. |
Wrong/expired keytab | It is possible for a keytab to expire, either after fixed period of time, or after another copy of the keytab was exported from KDC (KVNO has increased). If the file is present in the correct directory, with correct format and permissions (/sapmnt/<SID>/global/security/dvd_conn/<sid>hdp.keytab), try manual login with the keytab. |
Wrong principal (case sensitive) | Make sure that principal name in /dvd/hdp_cus_c is correct. It should have format like user@EXAMPLE.COM . It is case sensitive. To check principal name inside the keytab: klist -k <path_to_keytab> |
Wrong Kerberos config | Check the contents of krb5.conf file in /sapmnt/<SID>/global/security/dvd_conn/ directory and compare them with krb5.conf valid for Hadoop cluster. The contents have to match. Make sure that there is information about principal's Kerberos realm, and about Hadoop cluster Kerberos realm. |
Port to KDC is not open | Make sure that port 88 to KDC is open. If cross-realm authentication is used, port 88 to KDCs of both realms need to be open. |
HDFS permissions
Another possibility is, that the Hadoop technical user doesn't have correct permissions on HDFS, or the home/landing folder is not existing. The easies way to check this is directly on the Hadoop cluster.
To get the path that is used:
- Go to SM59 and open the HttpFS/WebHDFS RFC
- The HDFS path is visible in the path prefix field, following the /webhdfs/v1/ prefix
Parameter ict/disable_cookie_urlencoding is incorrectly set
If HDFS is secured by kerberos and you execute Connection test in /dvd/sm_setup, sometimes you get login pop-up.
This is caused by incorrectly set parameter ict/disable_cookie_urlencoding. It needs to be set to '1' ('2' in newer SAP kernel versions).
SM_TRS_MS Storage type
SM_TRS_MS storage type incorporates communication with "metastore" services. This includes Hive, Impala, Databricks.
Java connector
This storage type utilizes Datavard Java connector. In transaction /dvd/sm_setup you can find which RFC and JCO version is used by the storage.
Go to transaction /dvd/jco_mng and make sure that the connector is running. In case the connector is not running and doesn't start, follow troubleshooting instructions mentioned in previous section.
Kerberos issues
When the JCO is running, another possible issue is that of authentication. If logs in jco connector seem to be connected to authentication, follow the instructions mentioned in above.
Other issues
Depending on Hadoop distribution and actual configuration, there can be different Hadoop services Datavard Java Connector communicates with.
But typically there is always at least one service which facilitates manipulation of files stored in HDFS (HttpFS/WebHDFS) and at least one service emulating database representation of data stored in HDFS (plus metadata), accepting SQL queries.
Possible issue | What to do |
---|---|
The service has crashed/failed | While unlikely, it is still possible that Hadoop service will fail. The reason of that needs to be checked via cluster manager and restarted. |
The service is no longer reachable on host specified in HTTP RFC Destination | If service failover happens in High Availability scenario and there is no loadbalancer/proxy (e.g. Zookeeper), the HTTP RFC needs to be reconfigured to be directed to new Hadoop host where the service is running. There is a possibility to circumvent this by configuring HA alternative host in Storage Management setup. |
The service is no longer reachable on its port | Make sure that Hadoop service is running on correct host. |
Other troubleshooting steps
- Is Hadoop service reachable on designated ports? (Hive 10000, HttpFS 14000, WebHDFS 50070, Impala 21050) telnet <hadoop_host_FQDN> <port_number>
- Is Java connector process running? transaction: /DVD/JCO_MNG; Start/Restart; ps -ef | grep java
- Is Java connector registered on SAP Gateway? transaction: SMGW → Connected clients
- Is Hadoop HDFS storage check ok? /DVD/SM_SETUP
- Is RFC Destination working properly? SM59 → Connection test
- Is ICM running? transaction: SMICM
- If Kerberos is used for authentication, are necessary files in place? kerberos config, user's keytab, JAAS config,
- Is Kerberos keytab version number still valid (i.e. is it possible that new keytab was created with higher kvno, rendering previous keytab invalid?)
- Are Hadoop permissions correctly set? HDFS permissions, Ranger/Sentry rules for HDFS user directory and Hive database