(SM-2405) SAP-Hadoop Communication
Storage Management (SM) provides a connection to Hadoop. For Hadoop specific scenarios, we communicate with:
HDFS
Hive or Impala
Hadoop Distributed File System (HDFS)
HDFS is a data storage system used by Hadoop applications that SNP uses for data transfer.
It is possible to send files directly to HDFS, but also to use it as temporary storage when transferring data from SAP to Hadoop engines.
Communication with HDFS
Communication with HDFS is provided through WebHDFS REST API which is handled by SM directly.
Hive and Impala
Apache Hive is data warehouse software in the Hadoop ecosystem that facilitates reading, writing, and managing large data sets in a distributed storage using SQL-like query language.
Apache Impala is a massively parallel processing (MPP) SQL query engine for data stored in the Hadoop cluster. Impala is scalable parallel database technology and enables users to issue low-latency SQL queries to data stored in HDFS.
Both engines support SQL-like query language to execute DDL and DML operations (these are described below in more detail).
When data is transferred from SAP to Hive or Impala, it is moved in form of a .csv file to HDFS, and afterward, the Hive or Impala engine loads the transferred data.
Detailed communication diagram
When communicating with Hive or Impala engines, the Java connector which is implemented by SNP, is used. This connector wraps SQL-like queries using JDBC jars and forwards them to the engines.