(SP21) Hadoop storage architecture
SAP <-> Hadoop communication
Connection to Hadoop is provided by Datavard component - Storage Management.
Storage management (SM) is part of Reuse Library, and contains implementation for binary and table storages. For both storage types, SM provides facade which is Storage Manager.
In this case (Hadoop connection), Table Storage is used.
For Datavard specific scenarios, we need to handle communication with:
- HDFS
- Hive/Impala
HDFS
HDFS (Hadoop Distributed File System) is data storage system used by Hadoop applications, and also Datavard uses it for data transfer.
It is possible to send files directly to HDFS, but also to use it as temporary storage when transferring data from SAP to Hadoop engines (see description of DML sequence diagram bellow).
Communication with HDFS
Communication with HDFS is provided via WebHDFS REST API which is handled by SM directly.
Hive/Impala
Apache Hive is data warehouse software in Hadoop ecosystem, that facilitates reading, writing, and managing large datasets in distributed storage using SQL-like query language.
Apache Impala is massively parallel processing (MPP) SQL query engine for data stored in Hadoop cluster. Impala brings scalable parallel database technology, enabling users to issue low-latency SQL queries to data stored in HDFS.
Both engines support SQL-like query language to execute DDL and DML operations (these are described bellow in more detail).
For purpose of data transfer to Hive/Impala engines, also HDFS is used. Firstly, the data is moved in form of .csv file to HDFS, and afterwards the engine loads the transferred data.
Communication with Hive/Impala
When communicating with Hive/Impala engines, Java connector (implemented by Datavard) is used. This connector wraps SQL like queries using JDBC jars and forwards them to the engines themselves.
DDL and DML operations
DDL (Data Definition Language) is a syntax similar to a computer programming language for defining data structures, especially database schemas. Following are DDL statements: CREATE, DROP, ALTER, TRUNCATE.
DML (Data Manipulation Language) is a computer programming language used for adding, deleting, and modifying data in a database. Following are DML statements: SELECT, INSERT, UPDATE, DELETE.
For explaining DDL and DML operations from Datavard product to Storage Management using Hadoop are created two sequence diagrams.
Objects in diagrams with blue color are Datavard components.
DDL
Sended DDL request from Datavard product to Storage Management is processed by Storage Manager. Firstly, Storage Manager sends execution command to HDFS to create a directory. This step is followed by connecting to JDBC and after succesfull connection, DDL execution command is sent. JDBC sends request to Hadoop storage, and after execution recieves a response. Connection to JDBC is closed and response is sent through Storage Manager to Datavard product.
DML
For DML request (e.g. data load), Datavard product firstly reads data from source which can be SAP, external storage, local/remote file,... . Afterwards the DML request is sent to Storage Manager with acquiered data. Storage Manager sends execution command to HDFS for creating directory and request to create file with data (in .csv format). Storage Manager sends request via JDBC to Hadoop storage engine (Hive, Impala) to load data cotained in file stored in HDFS. After execution, Hadoop storage sends a response, which is afterwards sent to Datavard product through Storage Manager.