MapReduce Programs Simplification using a Query Criteria API

A Hadoop HDFS is an organized and distributed collection of files. It is created to store a huge part of data and then retrieve it and analyze it efficiently in a less amount of time. To retrieve and analyze data from the Hadoop HDFS, MapReduce Jobs must be created directly using some programming languages like Java or indirectly using some high level languages like HiveQL and PigLatin. Everyone knows that creating MapReduce programs using programming languages is a difficult task that requires a remarkable effort for their creation and also for their maintenance. Writing MapReduce code by hand needs a lot of time, introduce bugs, harm readability, and impede optimizations. Profiles working in the field of big data always try to avoid hard and long programs in their work. They are always looking for much simpler alternatives like graphical interfaces or reduced scripts like PIG Latin or even SQL queries. This article proposes to use a MapReduce Query API inspired from Hibernate Criteria to simplify the code of MapReduce programs. This API proposes a set of predefined methods for making restrictions, projections, logical conditions and so on. An implementation of the Word Count example using the Query Criteria API is illustrated in this paper. Keywords—Hadoop; HDFS; MapReduce


I. INTRODUCTION
Big data analysis has become a priority for all companies and organizations that want to maintain a high level of competition.To accomplish this task, companies use several frameworks like Hadoop ecosystem which ensures both storage and processing despite the huge volume of data.Hadoop [1], [2] contains mainly a distributed File System HDFS [3] and a distributed computation framework MapReduce [4].
To analyse data stored in HDFS and according to the user's profile and competence, several programming languages are used such as java to create MapReduce programs directly [5].Some high level languages are also used like PIG Latin scripts or HiveQL queries [6].In this domain, several research projects have attempted to simplify the code of MapReduce programs to make them readable and easily maintainable.
This article suggests using an API called MapReduce Criteria inspired from the Hibernate Criteria API to hide the code of the restrictions and projections made on the data stored in the HDFS.Thus, the number of lines in MapReduce programs will be reduced to facilitate readability and maintenance.
This paper is organized as follows.Section 2 describes Hadoop ecosystem.Section 3 presents the motivation of using MapReduce Criteria.As for Section 4, it talks about Pig, Hive and Sqoop as related work.Section 5 briefly describes the Query Criteria API.The last section contains final conclusions and points to further work.

II. HADOOP ECOSYSTEM
Hadoop ecosystem is a set of popular frameworks [7] that provides distributed processing over a huge amount of data.Hadoop is designed to solve data storage problems caused by the large amount of data generated each second.The data managed by this framework is processed in a parallel way by exploiting thousands of machines.Fig. 1 shows a simple architecture of Hadoop ecosystem.With the multitude of solutions that currently exist in the market, each profile must carefully choose the Big Data solution that aligns with its skills.The analysts will likely find that they can ramp up on Hadoop faster by using Hadoop data warehouses such as Hive [8], Impala [9] and HAWQ now frequently deployed at customer sites.Developers who want better control of the data flow process and those who come from a procedural language context will choose to work in PIG Latin.Despite the diversity of existing solutions, they all use the same HDFS for cluster storage and the same MapReduce model for distributed processing (Fig. 2).

Hadoop Distributed File System (HDFS)
To meet the ever-changing volumes of data processed every day, The Hadoop Distributed File System (HDFS) is designed to be highly fault-tolerant and to be deployed on low-cost hardware.HDFS is based on a master / slave architecture.It offers a master server (NameNode) and slaves (DataNodes) per node of the cluster [3].The NameNode manages the namespace of the file system and also orchestrates access to the files by the clients.The DataNodes manages the storage associated with the nodes on which they run.A simple HDFS architecture is given at Fig. 3.

B. MapReduce
The basis of the MapReduce framework was defined by Dean and Ghemawat at their paper in 2004 [5].MapReduce orchestrates the processing of a large data sets using parallel computing on a cluster.It manages all issues related to partitioning the input data, scheduling the program's execution and data transfers.Several research papers are focused on the MapReduce model to apply it to some business domains [10], [11] to resolve some algorithms issues [12], [13] or to search for some optimization leads [14], [15].The user of the MapReduce library expresses the computation as two functions: Map and Reduce.

III. MOTIVATION
MapReduce programs are positioned in the core of all BigData systems.Unfortunately MapReduce programs have been criticized for several disadvantages including the large number of instructions, the lack of readability and also the difficulty of maintenance.
In order to simplify the number of instructions, the readability and the maintenance of the MapReduce programs, we propose to use the MapReduce Query API which will hide all the instructions related to: • Restrictions like equal, not equal, less than, more than, etc. www.ijacsa.thesai.org • Logical expressions like AND, OR, XOR etc.
• Orders ascendant and descendant.
We propose to apply all methods defined in Hibernate Criteria [16] to MapReduce programs.Among the major concerns of big data solutions today, we can mention the optimization of execution times and also the simplification of creating MapReduce programs.Solutions mentioned at the next section try to hide the complexity of MapReduce programs by generating MapReduce plans automatically.

IV. RELATED WORK
Different tools and sub-projects have been created to simplify the task for users who are not so good at programming languages.Many frameworks have been implemented to help users who are struggling with Hadoop, especially while performing any MapReduce tasks.Among these solutions, we find Pig, Hive and Sqoop described briefly in the following paragraphs.

A. Pig
Pig is a procedural language platform used to develop a script Pig Latin [17]: a sequence of steps, much like in a programming language, each of which carries out a single high-level data transformation e.g., filtering, grouping, or aggregation.

B. Hive
A data warehouse solution that allows users to write SQL like Query (HiveQL) and translate them into physical plans of MapReduce jobs using the Thrift Server.Hive proposes many external interfaces (Command Line, Web UI, JDBC...) to challenge with its database [18]- [20].The latest version of Hive (since version 2.0) allows also procedural SQL on Hadoop [21].

C. Sqoop
This solution is also adopted by the Apache Foundation in order to achieve bulk data transfers between Hadoop and structured databases such as relational databases.Sqoop hides and simplifies the complexity of MapReduce programs to users [22], [23].

V. PROPOSED WORK
The MapReduce Query API inspired from Hibernate Criteria API will represent a query against a particular file stored at the HDFS.The interface will provide the same powerful mechanism of hibernate criteria API and will allow a programmatic creation of queries against the HDFS (Fig. 5).It's an alternate way to manipulate objects generated from data stored at the HDFS.Specifying the structure of the data to be loaded from the HDFS is required for using MapReduce Query API.It is the equivalent of Relational Object Mapping in the Hibernate Framework.Any program based on MapReduce Query API will be automatically translated to MapReduce programs according to a previously defined plan.

VI. IMPLEMENTATION
WordCount is a famous application that counts the number of occurrences for each word in a given set of files.The input for this implementation is a file of comments as detailed below:

A. WordCount Example without MapReduce Criteria
To develop a simple MapReduce example in the current model, it is necessary to create at least three classes: A class "Mapper" as shown in Table I, a "Reduce" class as shown in Table II and a "Main" class as shown in Table III.MapReduce is a programming model created to perform distributed processing of a large datasets stored in a distributed file system HDFS.It is well-known that MapReduce programs are difficult to create, to read and to maintain.Therefore, it is necessary to simplify them using some frameworks or APIs.This paper has suggested using an API called MapReduce Criteria in order to reduce the number of MapReduce instructions and also to hide Mappers and Reducers classes for developers.In our future work we will compare MapReduce programs that use the Query Criteria API with existing languages that also simplify the use of MapReduce as Pig Latin and Hive.

Fig. 1 .
Fig. 1.Simple architecture of Hadoop ecosystem.Along with the market trends and the diversity of profiles that intervene on the data, several additional technical components have been emerged; components for people who
Map function takes an input pair and produces a set of intermediating key/value pairs.It gathers together all intermediate values associated with the same intermediate key and passes them to the Reduce function.Reduce function written by the user accepts an intermediate key and a set of values for that key.It merges these values to form a possibly smaller set of values (Fig. 4).

Fig. 4 .
Fig. 4. Map function showing values to form a possibly smaller set of values.