(SM-1811) SAP-Hadoop communication

Connection to Hadoop is provided by Datavard component - Storage Management. 

For Hadoop specific scenarios, we communicate with:

  • HDFS 
  • Hive/Impala

HDFS

HDFS (Hadoop Distributed File System) is a data storage system used by Hadoop applications, and also Datavard uses it for data transfer

It is possible to send files directly to HDFS, but also to use it as temporary storage when transferring data from SAP to Hadoop engines (see the description of DML sequence diagram below).

Communication with HDFS

Communication with HDFS is provided via WebHDFS REST API which is handled by SM directly.

Hive/Impala

Apache Hive is data warehouse software in the Hadoop ecosystem, that facilitates reading, writing, and managing large datasets in distributed storage using SQL-like query language.

Apache Impala is massively parallel processing (MPP) SQL query engine for data stored in Hadoop cluster. Impala brings scalable parallel database technology, enabling users to issue low-latency SQL queries to data stored in HDFS.


Both engines support SQL-like query language to execute DDL and DML operations (these are described below in more detail).

For purpose of data transfer to Hive/Impala engines, also HDFS is used. Firstly, the data is moved in form of a .csv file to HDFS, and afterwards, the engine loads the transferred data. 

Communication with Hive/Impala

When communicating with Hive/Impala engines, Java connector (implemented by Datavard) is used. This connector wraps SQL like queries using JDBC jars and forwards them to the engines themselves.

Detailed Communication Diagram