(SM-2305) SAP-Hadoop Communication

Storage Management (SM) provides a connection to Hadoop. For Hadoop specific scenarios, we communicate with:

  • HDFS 

  • Hive or Impala

Hadoop Distributed File System (HDFS)

HDFS is a data storage system used by Hadoop applications that SNP uses for data transfer.
It is possible to send files directly to HDFS, but also to use it as temporary storage when transferring data from SAP to Hadoop engines.

Communication with HDFS

Communication with HDFS is provided through WebHDFS REST API which is handled by SM directly.

Hive and Impala

Apache Hive is data warehouse software in the Hadoop ecosystem that facilitates reading, writing, and managing large data sets in a distributed storage using SQL-like query language.
Apache Impala is a massively parallel processing (MPP) SQL query engine for data stored in the Hadoop cluster. Impala is scalable parallel database technology and enables users to issue low-latency SQL queries to data stored in HDFS.

Both engines support SQL-like query language to execute DDL and DML operations (these are described below in more detail).
When data is transferred from SAP to Hive or Impala, it is moved in form of a .csv file to HDFS, and afterward, the Hive or Impala engine loads the transferred data. 

Detailed communication diagram

When communicating with Hive or Impala engines, the Java connector which is implemented by SNP, is used. This connector wraps SQL-like queries using JDBC jars and forwards them to the engines.