Lake Data Warehouse Architecture for Big Data Solutions

Traditional Data Warehouse is a multidimensional repository. It is nonvolatile, ‎subject-oriented, integrated, timevariant, and non-‎operational data. It is gathered from multiple ‎heterogeneous data ‎sources. We need to adapt traditional Data Warehouse architecture to deal with the new ‎challenges imposed by the abundance of data and the current big data characteristics, containing ‎volume, value, variety, validity, volatility, visualization, variability, and venue. The new ‎architecture also needs to handle existing drawbacks, including availability, scalability, and ‎consequently query performance. This paper introduces a novel Data Warehouse architecture, named Lake ‎Data Warehouse Architecture, to provide the traditional Data Warehouse with the capabilities to ‎overcome the challenges. ‎Lake Data Warehouse Architecture depends on merging the traditional Data Warehouse architecture ‎with big data technologies, like the Hadoop framework and Apache Spark. It provides a hybrid ‎solution in a complementary way. The main advantage of the proposed architecture is that it ‎integrates the current features in ‎traditional Data Warehouses and big data features acquired ‎through integrating the ‎traditional Data Warehouse with Hadoop and Spark ecosystems. Furthermore, it is ‎tailored to handle a tremendous ‎volume of data while maintaining availability, reliability, and ‎scalability.‎ Keywords—Traditional data warehouse; big data; semistructured data; unstructured data; novel data ‎warehouses architecture; Hadoop; spark


I. INTRODUCTION
Data warehouse (DW) has many benefits; it enhances Business Intelligence, data quality, and consistency, saves time, and supports historical data analysis and querying [1]. In the last two decades, data warehouses have played a prominent role in helping decision-makers. However, in the age of big data with the massive increase in the data volume and types, there is a great need to apply more adequate architectures and technologies to deal with it.
Therefore, it became crucial to enhance traditional DW to deal with big data in various fields to accommodate this evolution in volume, variety, velocity, and veracity of big data [2], [3]. To achieve this, we propose a new DW architecture called Lake Data Warehouse Architecture. Lake Data Warehouse Architecture is a hybrid system that preserves the traditional DW features. It adds additional features and capabilities that facilitate working with big data technologies and tools (Hadoop, Data Lake, Delta Lake, and Apache Spark) in a complementary way to support and enhance existing architecture.
Our proposed contribution solve several issues that face integrating data from big data repositories such as:  Integrating traditional DW technique, Hadoop Framework, and Apache Spark.
 Handling different data types from various sources like structured data (DBs, spreadsheet), semi-structured data (XML files, JSON files), and unstructured data (video, audio, images, emails, word, PowerPoint, pdf files).
 Capturing, storing, managing, and analyzing data volume that cannot be handled by traditional DW.
 Using recent technologies like the Hadoop framework, Data Lake, Delta Lake, and Apache Spark to decrease time spent analyzing data and to decrease storage costs are inexpensive.
 Support all users, especially data scientists, because they need to perform depth data analysis.
The rest of the paper is organized as follows: Section II explains background and preliminaries for traditional Data Warehouses and its limitations, importance of DWs, and big data characteristics and types. Section III overviews the related works to the DW architectures. Section IV presents the proposed DW architecture named Lake Data Warehouse Architecture. Section V describes the case study by applying our contributions called Lake DW Architecture. Finally, the conclusion is presented in Section VI. Usually, in primary DW, there are data marts (DMs). Data marts (DMs) are a small DW that contains only a subset of data obtained from a central DW. The content of DMs represents information from a specific domain of interest. Many DWs servers are used to manage data. These servers present multidimensional views of data to a variety of front-end tools.

B. The limitations of Traditional DW Architecture
In [6], [7], and [8], the authors reviewed some limitations of DW, such as supporting only structured data, on the contrary, do not support semi-structured data, or unstructured data. In addition to the above, the restriction of handling data volume in terabytes, which does not scale to petabyte size, is widely available. Besides, it is costly as it depends on proprietary hardware and software.
Moreover, traditional DW performs analytic queries, which consequently affects the whole query performance, accessing, and processing of data. The decision-making process also may be affected if: 1) the correct data are not available at a suitable time, and 2) the growth of the business requires new methods for data management other than adapting traditional DW architecture.

C. The Objectives of the redeveloped traditional DW
One of our main objectives is to overcome the limitations of traditional DW architecture built on outdated technologies [4]. We initiate an overall architecture that supports the functionalities of the traditional DW with abilities to include [6], [9] :  Meeting new business requirements.
 Depending on lower-cost infrastructure.
 Handling heterogeneous data and new data structures and formats.
 Managing customer expectations.
 Meeting growth of the business.
 Using advances in new technology and its improvements.
 Handling product existence and status.
 Improving business efficiencies and competitive advantage.
 Decreasing operational and financial risks.
 Evaluating and forecasting trends and behaviors.

D. The Age of Big data
Big Data is a data volume that is available in different levels of complication. It is generated at various velocities and levels of uncertainty; hence it is not handled using traditional approaches, traditional technologies, or traditional algorithms [6]. Today, big data is characterized by ten main characteristics namely volume, variety, velocity, veracity, value, validity, variability, visualization, volatility, and venue [3], [10], [2], [11], [12], [13] as follows:  Volume: the huge amount of data generated continuously on an hourly or a daily basis from different sources. Such as terabytes generated per hour for applications like YouTube, Instagram, and Google.
 Variety: the types of big data that are ingested from different data sources.
 Velocity: the speed at which data is produced. The aspects of data may be batch data or streaming data.
 Veracity: the quality of data that is being handled to obtain valuable insights. Such as ambiguous, inconsistent, incomplete, anomaly, uncertain, and biased data.
 Value: represents the business value to be derived from big data.
 Validity: refers to the correctness of data used to extract outputs in the form of information.
 Variability: refers to the inconsistent data flow.
 Visualization: refers to the ability to analyze and visual insights as an output of big data analysis.
 Volatility: the stored data on how long it is valuable to the consumer.
 Venue: refers to a various platform where numerous kinds of data from different sources by several platforms.
In general, data are a set of qualitative values or quantitative variables; Big Data can be categorized into three types [2], [9]:  Structured data. The data has a defined structure or a schema organized either in the form of a relational database or in some other way that is easy to operate. For example, data stored in a relational database (in the www.ijacsa.thesai.org form rows and columns), in spreadsheets (such as CSV files), and cleansed data (that have been processed with a lot of cleansing and filtering).
 Semi-Structured Data: The data is hard to retrieve, analyze, and store as structured data. It requires a big data software framework (such as Apache Hadoop) to achieve these operations. For example, XML files, JSON files, and BibTex files.
 Unstructured Data: The fully unorganized data is difficult to handle, and it requires advanced software and tools to access it. Examples include video, audio, images, emails, word, PowerPoint, pdf files, webpages, location coordinates, and streaming data.

E. Hadoop Framework and Data Lake
Hadoop is an open-source software framework. It allows the execution of the MapReduce processes for data processing. It provides massive storage for all data types, massively parallel processing, and storing data and running applications on clusters to accomplish better computation resources [14]. Hadoop has main components as follows [15], [16], [17]:  The Hadoop Distributed File System (HDFS) is a file system, which manages storage and access to data spread across the different nodes of a Hadoop cluster.
 YARN is a Hadoop cluster resource manager used to assign system resources for applications and schedule of the jobs.
 Map-Reduce is a processing engine and a programming framework used to manage large-scale batch data in the Hadoop system.
 Hadoop Common is a set of services and institutions that provide underlying capabilities needed by the other parts of Hadoop.
Data Lake is a data store that can collect any type of data: structured, semi-structured, or unstructured data, which are stored with one another regardless of structure, format, or types [18], [19]. It is a conceptual idea that is usually implemented with one or more technologies such as Hadoop and NoSQL databases. When querying the Data Lake, only need data will transform that are relevant to business needs [20].
Creating Data Lake depends on Hadoop's technology, which is a component (as the platform) for the data lake. It is the complementary relationship between Data Lake and Hadoop [21], [22], [23] Data Lake is similar to traditional DW in that they are both repositories for data. However, there are apparent differences in features between them [7]-the schema on reading in Data Lake, but schema on write in DW. The scale of data in Data Lake is enormous, while it is large data volumes in DW. The data sources may be semi-structured data or/and unstructured data, but it is mainly structured data in DW [24], [25], [26], [27].

F. Apache Spark and Delta Lake
Apache Spark is an open-source applied for big data analytics and distributed systems. It provides streaming libraries, SQL, graph analysis, and machine learning. It has two main components Spark streaming, which is used for managing real-time data, and the Spark engine , which directly processes each data chunks by Spark streaming [28], [29], [30].
Delta Lake is an open-source Spark storage layer. It is an extra storage layer that makes reliability to our data lakes built on The Hadoop Distributed File System (HDFS) and cloud storage [31]. Delta Lake provides a series of other features including:  Joining streaming and batch data processing.
 Giving a scalable metadata approach.
 Providing ACID transactions that guaranteed consistency of the data stored inside the data lake through ensuring that only complete writes are committed.
 Time travel that is allowing one to access and return previous versions of the data.
 Schema evolution as data evolves, Delta allows Spark table to change in the schema and many more while we use Delta.
 Enabling a Data Lake to update data without going through the entire Data Lake repository .
III. RELATED WORKS Several efforts have been conducted to adapt traditional DWs for handling new user requirements and changes in the underlying data sources . Many approaches focus on DWs that deal with a relational database. However, they cannot be appropriated to deal with big data. In [32], some methodologies try to solve the problem by developing the ETL (Extracting, Transforming, and Loading) process. In [33], the authors attempt to update DW Schema to reflect modifications that already took place. As mentioned in [33], [33], [34], [5], [35], [36] , the authors use temporal DW and schema versions to update the DWs Structure by keeping more than one DWs version. These works do not depict how the user's needs impact the evolution changes.
Limited approaches have taken care of handling the aspect of big data development. The approach mentioned in [37] describes data schema specification and evolution processing, but it does not depend on DW. The work presented in [38] explained the method for treating the growth of the data sources in the integration area using big data integration ontology. It can handle some changes in data sources, but it does not determine how to answer all requirements.
Several types of research have concentrated on the use of DWs in big data analysis. The authors in [39] display an OLAP method for big data executed with Hadoop. They focus on multidimensional analysis for big data analysis. However, they do not address big data evolution, which applies to the research work proposed in [40] and [26]. www.ijacsa.thesai.org Other researchers studied the problem of big data evolution. In [41], the authors discuss a methodology for constructing a system for big data analysis. However, it cannot be applicable in the state of the previously used data is not provided.
The authors in [42] presented the DW approach for Big Data analysis that was implemented using Map-Reduce. This approach handles two types of changes: (1) schema versions in metadata control variations of the fact table and (2) slowly changing dimensions. The approach does not process changes in big data sources that may affect the analysis process and results.

IV. THE PROPOSED LAKE DATA WAREHOUSE ARCHITECTURE
Our contribution aims to adapt traditional DW to solve the new challenges by handling semi-structured and unstructured data. In addition to managing the growth of business requirements and treating the two main drawbacks, namely, availability and system performance. Furthermore, it enhances query performance by providing the required data from users at any time.
In this section, we present a novel DW architecture that improves the traditional DW performance to deal with these challenges. Our contribution integrates the traditional DW architecture with big data technologies like Hadoop and Apache Spark.
In the age of big data, We need to improve DW architecture to handle the new challenges imposed by big data. The Hadoop Framework and Apache Spark are a complement to handling the challenges of traditional DW; each has its advantages in different circumstances. In some cases, we need the Hadoop Framework and Apache Spark to process unstructured or semistructured data (raw data) and large volume datasets (big data). In other words, we still depend on traditional DW for consistent and high-quality data (structured data), and low latency and interactive reports.
In Fig. 2, we explain the proposed Lake DW Architecture. Where Hadoop and Spark do not replace traditional DW and Big Data is going to change traditional DW architecture but not replacing it. Our contribution depends on integrating traditional DW techniques, Hadoop Framework, and Apache Spark into a hybrid solution.
The large amount of big data generated every minute and every hour needs a data lake that can scale to handle this volume. Therefore, we use Hadoop as a data platform for data lakes to provide extensive scalability at an acceptable cost. Besides, Data Lakes can be complemented DW besides the Hadoop framework. Also, Data Lakes can be complemented DW besides the Hadoop framework. Our proposed architecture differentiates itself from all previous work. It is a hybrid environment that upgrades traditional DW with Hadoop environment depending on the Hadoop-based Data Lake because it can extend the use and capabilities of traditional DW, as follows:  Collecting or capturing data from structured data sources using ETL Architecture. DW depends on the traditional ETL process; where extract (E) data from operational databases and then, the data process, clean and transform (T) before loading (L) them into the DW or data marts or virtual data marts. DW is designed to handle and analyze read-heavy workloads. DW needs to define the data model before loading the data. Then, they call a Schema-On-Write approach, as presented in Fig. 2.
 Collecting or capturing data from Semi-Structured or Unstructured data sources using ELT Architecture. Big Data requires a different process to collect data where traditional ETL does not work well on semi-Structure or unstructured data. Big Data calls for ELT. The raw data will be stored in its original format. The preprocessing step will not be used until the query or other application acknowledge/ ask for these data. Where Data Lake is different from DW through the processing of data in the ELT order and utilizing the Schema-on-Read approach, as shown in Fig. 2.

A. Hadoop-Based Data Lake architecture in Cloud Environment
It is using Hadoop as a staging area for DW by adding Hadoop-based Data Lake that is a storage repository that is used for complementing traditional DW. It is using as a data source that passes only required data to Data Lake and ingesting unlimited amounts of raw data that related to business objectives. As shown in Fig. 2, we explore the Hadoop-based Data Lake architecture has many layers as follows: 1) Ingesting data layer: ingests raw data in native format, where the ingested data can be micro-batch, macro-batch, batch, real-time, or hybrid.
2) Landing data layer: Different data types (ingested in the previous layer) are stored in native format (without processing). In this layer, users can find the original data versions of their analysis to aid the subsequent handling. 3) Metadata Layer: is responsible for making data easy to access and extract values from Data Lake. It helps to make www.ijacsa.thesai.org identifications, data versioning, entity and attributes, distributions, quality.

4) Governance Data
Layer: applies to the other layers. It is responsible for authentications, data access, data security, data quality, and data life cycle. It determines the responsibilities of governing the right to access and handling the data.

B. Delta Lake architecture with Apache Spark Cloud Environment
Delta Lake technology is used with Apache Spark to implement our proposed model by creating a cloud data platform for solving Data Lake challenges, such )1) the data quality is low, )2) reading and writing are not guaranteed, )3) insufficient performance with growing volumes of data, and )4) updating records is hard [43]. As presented in Fig. 2, Delta Lake Architecture has two layers: 1) The atomic layer would be a Silver Delta table built using Object Storage, and 2) The Departmental layer would be any number of Gold Delta tables also built on Object Storage. We employ Spark and store all data in Apache Parquet format allowing Delta Lake to leverage the well-organized compression native to Parquet.
Apache Spark is used to read and process huge files and datasets. Spark provides a query engine capable of processing data in huge data files. Some of the most significant Spark jobs in the world run on Petabytes of data. Apache Parquet has the format as a columnar file responsible for optimizations to go faster queries [44]. It is a more efficient file format than JSON files. It is suitable for data processing in the Hadoop. It provides an efficient method to handle complex datasets.
The main difference between our contributions Lake DW Architecture over traditional DW as follows:  Handling different data types (structured, semistructured, and unstructured data) from various sources.
 Extracting, storing, managing, and analyzing data volume that cannot handle by traditional DW.
 Integrating between traditional DW technique, Hadoop Framework, and Apache Spark as a hybrid solution. It uses ETL or ELT processes depending on types of data sources.
 Supporting different data types in various sources.
 Determining and analyzing data from Data Lake and Delta Lake, they scale to extreme data volumes.
 Supporting all users, especially Data scientists, because they can do in-depth analysis.

V. CASE STUDY
Our goal of this section is to experiment with our contribution to prove its effectiveness in dealing with big data and analyze it for decision-makers. This case study shows how our proposed model can extract, integrate, and analyze big data that cannot be handled by traditional DW. We use Data Lake to collect and process the Internet of Things (IoT) data. We provide a demo IoT sensor dataset for demonstration purposes. The data simulates heart rate data measured by health tracker devices. Each file consists of five users whose heart rate is measured each hour, 24 hours a day, every day. We store datasets in the data lake as JSON files. We use two data files in JSON format, the first file for readings recorded by the devices in January 2020 (health_tracker_data_2020_01.json), and the second file for the readings recorded by the devices for February 2020 (health_tracker_data_2020_02.json).
We implemented on Data Lake, Delta Lake, and Apache Spark in Databricks, which provides an integrated platform for working with Apache Spark. When working with Delta Lake, Parquet files can be converted in-place to Delta files. Next, we will convert the Parquet-based data lake table we created previously into a Delta table.
A. Configure Apache Spark 1) Creating the ClustreHelthTracher cluster is computation resources used to run data science and data analytics such as the ETL process, ad-hoc analytics, and streaming analytics.
2) Configuration the Apache Spark, we will need to perform a few configuration operations on the Apache Spark session to get optimal performance. These will include creating a database to store data, as shown in the following Spark SQL script: 3) In additional to configure the number of shuffle partitions as shown in the following Spark SQL script:

B. Importing data from a Data Lake into a Delta Lake by ETL of Apache Spark
We import data from our Data Lake and save it into a Delta Lake as Parquet files. We appended the first month of records and kept in the health_tracker_data_2020_02.json file by using the ETL process of Apache Spark. In additional to configure the number of shuffle partitions as presented in the following Python language scripts: www.ijacsa.thesai.org In Fig. 3, we visualize the sensor data over time, as displayed in the following Spark SQL script: Fig. 3. The sensor data over time.

C. Create a Parquet-based Data Lake Table
We convert the existing Parquet file to The Parquet-based data lake table that will be used to Delta tables. We will be writing files to the root location of the Databricks File System (DBFS) in our cloud object storage. We create the table using the Create 1) Convert the Files to Delta Files: We convert the files in place to Parquet files. The conversion creates a Delta Lake transaction log that tracks the files. Now, the directory is a directory of Delta files, as shown in the following Spark SQL script: 2) Register the Delta Table: we will register the table in the Metastore. The Spark SQL command will automatically infer the data schema by reading the Delta files' footers, as shown in the following Spark SQL script: With Delta Lake, the Delta table is ready to use the transaction log stored with the Delta files containing all metadata needed for an immediate query. We count the records in the health_tracker_silver table with a Spark SQL query as follows:

F. Exploring analysis results
The health_tracker_user_analytics table could be used to define a dashboard for analyzing of the results according to business requirements as provided in Fig. 4 which describes the aggregation results such as maximum, minimum, and average data, as presented in the following Spark SQL script:  Table (Batch data  write to Delta Tables) We convert the existing Parquet file to The Parquet-based data lake table that will be used to Delta tables. We will be writing files to the root location of the Databricks File System (DBFS) in our cloud object storage. We create the table using the Create Table As Select (CTAS) Spark SQL pattern as shown in the following script:

H. Exploring analysis results
We can modify existing Delta tables through appending files to an existing directory of Delta files. We append the next month of records, kept in the health_tracker_data_2020_02 table by using the INSERT INTO Spark SQL command as shown in the following script:

I. Assessing the Missing Records
After a batch update of the health_tracker_silver table, we counted the number of records in the table. We discovered that some records were missing by Count the Number of Records per Device.

J. Assessing the Missing Records
After a batch update of the health_tracker_silver table, we counted the number of records in the table. We discovered that some records were missing by Count the Number of Records per Device. In Table I, device number 4 looks are missing 95 records. We run a query to discover the missing records' timing by displaying the number of records per day. We have no records for device 4 for the last few days of the month, as shown in the following Spark SQL query: SELECT dte as Date, p_device_id as Device_Id, heartrate as Heart_Rate_Reading FROM health_tracker_silver WHERE p_device_id IN (1, 4) and dte > "20-2-2020" In Fig. 5, the absence of records from the last few days of the month shows a phenomenon that may often occur in a production data pipeline: late-arriving data. Delta Lake allows us to process data as it arrives and is prepared to handle the occurrence of late-arriving data.

K. Identify Broken Readings in the Table
In the initial load of data into the health_tracker_silver table, we noted that there are broken records in the data. In specific, we made a note of the fact that several negative readings were present even though it is impossible to record a negative heart rate. Let us assess the extent of these broken readings in our table. First, we create a temporary view for the broken readings in the health_tracker_silver table, as shown in the following Spark SQL script: Next, we sum the records in the view, as shown in the following Spark SQL query:

VI. CONCLUSION
Many companies use DWs in different areas to help in the decision-making process. Besides, DWs enhance business intelligence, data quality, and consistency, saving time, and storing historical data. In the age of Big Data, the amount of data needed for ingesting and storing is an unprecedented rate. However, the architecture of the traditional DWs cannot manage such large amounts of data. Its new types from the current data sources are autonomous, heterogeneous, scalable, and distributed, which requires modern technology and modern DW architecture to deal with it.
In this paper, we proposed a novel DW architecture called Lake Data Warehouse Architecture. Lake Data Warehouse Architecture is a new DW structure that depends on integrating between traditional DW technique, Hadoop Framework, and Apache Spark as a hybrid solution. It familiarizes a traditional DW. It employs big data technologies and tools (Hadoop, Apache Spark, Data Lake, and Delta Lake) in a complementary way to support and enhance existing architecture. Furthermore, it can improve scalability and reduce the costs of development of traditional DW architecture.
Lake Data Warehouse Architecture has many competencies over traditional DWs Architecture, such as solving several issues that face integrating data from big data repositories, www.ijacsa.thesai.org handling different data types from various sources. Like Structured data (DBs, spreadsheet), Semi-structured data (XML files, JSON files), and Unstructured data (video, audio, images, emails, word, PowerPoint, pdf files). Moreover, it captures, stores, manages, and analyzes data volume that cannot be handled by traditional DW. Furthermore, it uses the recent technology such as the Hadoop Platform, Data Lake, Delta Lake, and Apache Spark is inexpensive to minimize the time spent analyzing data and to reduce storage costs.