(SM-1911) SAP-Hadoop Communication
The Datavard component - Storage Management (SM) provides a connection to Hadoop. For Hadoop specific scenarios, we communicate with:
- HDFS
- Hive or Impala
Hadoop Distributed File System (HDFS)
HDFS is a data storage system used by Hadoop applications which Datavard uses for data transfer.
It is possible to send files directly to HDFS, but also to use it as temporary storage when transferring data from SAP to Hadoop engines.
Communication with HDFS
Communication with HDFS is provided through WebHDFS REST API which is handled by SM directly.
Hive and Impala
Apache Hive is data warehouse software in the Hadoop ecosystem that facilitates reading, writing, and managing large data-sets in a distributed storage using SQL-like query language.
Apache Impala is massively parallel processing (MPP) SQL query engine for data stored in Hadoop cluster. Impala is a scalable parallel database technology and enables users to issue low-latency SQL queries to data stored in HDFS.
Both engines support SQL-like query language to execute DDL and DML operations (these are described below in more detail).
When data is transferred from SAP to Hive or Impala, it is moved in form of a .csv file to HDFS, and afterwards the Hive or Impala engine loads the transferred data.
Communication with Hive or Impala
When communicating with Hive or Impala engines, the Java connector which is implemented by Datavard, is used. This connector wraps SQL like queries using JDBC jars and forwards them to the engines.