Aspect-Combining Functions for Modular MapReduce Solutions

MapReduce represents a programming framework for modular Big Data computation that uses a function map to identify and target intermediate data in the mapping phase, and a function reduce to summarize the output of the map function and give a final result. Because inputs for the reduce function depend on the map function’s output to decrease the communication traffic of the output of map functions to the input of reduce functions, MapReduce permits defining combining function for local aggregation in the mapping phase. MapReduce Hadoop solutions do not warrant the combining functioning application. Even though there exist proposals for warranting the combining function execution, they break the modular nature of MapReduce solutions. Because Aspect-Oriented Programming (AOP) is a programming paradigm that looks for the modular software production, this article proposes and apply AspectCombining function, an AOP combining function, to look for a modular MapReduce solution. The Aspect-Combining application results on MapReduce Hadoop experiments highlight computing performance and modularity improvements and a warranted execution of the combining function using an AOP framework like AspectJ as a mandatory requisite. Keywords—Combining; Hadoop; MapReduce; AOP; AspectJ; aspects


I. INTRODUCTION
MapReduce represents a computation framework aiming to solve Big Data and Big Computation issues [1]- [4].Hadoop is a MapReduce application tool [4], [5] with two main components, the Hadoop Distributed File System (HDFS) for an 'Infrastructural' point of view and MapReduce for the 'Programming' aspect.Hence, HDFS is a distributed and scalable file system designed for running on clusters of commodity hardware.HDFS follows the write-once, readmany approach to store huge files using streaming data access patterns to enable high throughput data access and simplifies data coherency issues [4], [5].HDFS abstracts developers of distribution, coordination, synchronization, faults and failures, and supervision tasks details.Thus, developers must focus on two main computation functionalities: map and reduce.
Aspect-Oriented Programming (AOP) corresponds to a programming methodology for isolating crosscutting concerns functionalities and data to look for modular solutions [6].Ideas of obliviousness and advisable classes appear in AOP.Wampler [7] indicates and demonstrates the AOP support and refinement of Object-Oriented Design (OOD) principle such as the Single Responsibility Principle (SRP) and Open-Closed Principle (OCP) mainly to remark the AOP practical benefits.
Even though MapReduce represents a framework to isolate a programmer of traditional faults and issues on traditional distributed programming approaches and frameworks, MapReduce demands to figure out solutions using their main two functions: map and reduce.Thus, these functions can include code out of their inner nature which are clear crosscutting concerns examples according to good modular programming and AOP principles [7].
Hadoop allows the definition of the combining function on the map output [5], [8], [9] to optimize the MapReduce framework functioning for local aggregation in the map phase, that is, a function to aggregate data in the map phase before sending them to the reduce phase.Even though the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call defined combining functions [8].Thus, as a guarantee of combining execution, [8] proposed the use of the 'In-Mapper' Combining function, i.e., the combining function behavior directly inside the map function.Nonetheless, this solution does not respect object-oriented modularity principles such as the SRP [7], [10].Looking for a modular application of the MapReduce programming framework, this article proposes and exemplifies the use of Aspect-Combining, an AOP application on MapReduce for the combining functions definition.Thus, the main contributions of this article are: • Giving a review of performance and modularity issues of MapReduce combining solutions.
• Locating and justifying the presence of crosscuttingconcerns in current optimal combining solutions.
• Defining and testing Aspect-Combining functions on classic case studies for getting more modular and usually more efficient results.
• Establishing the bases for future works about the symbiosis of Big Data and AOP solutions.
This article is organized as follows: Section II gives a description of the MapReduce framework and its main components.That section also explains the primary structure and principles of traditional AOP-AspectJ solutions.

A. MapReduce
MapReduce is a programming model proposed by Google [1]- [3] for distributed computation on massive amounts of data (Big Data), that is, MapReduce is an execution framework for large-scale data processing on clusters of commodity servers.MapReduce has already enjoyed widespread adoption by the use of Hadoop, a open-source implementation of MapReduce [5], [8].
MapReduce can refer to three concepts: 1) a programming model; 2) an execution framework to coordinates the execution of programs written in this programming style; 3) the implementation of 1) and 2), that is, MapReduce is the implementation of a programming model and its execution framework.Google is the proprietary of MapReduce implementation [1]- [3], and Hadoop is an open-source analogue substitute [5], [8].
Hadoop applies the Hadoop Distributed File System (HDFS), a highly fault-tolerant and distributed file system able to run on commodity hardware [4], [5], [8].HDFS provides high throughput access to application data.HDFS is suitable for applications with large data sets such as the set of valid configurations in a Software Product Line (SPL) [9].MapReduce basic idea is to partition a large problem into smaller sub-problems possibly independent able to run in parallel by different workers, that is, either by threads in a processor core, cores in a multi-core processor, multiple processors in a multi-processors machine, or many machines in a cluster [4], [8].Fig. 1 shows the Hadoop functioning architecture.Hence, an iteration of a Hadoop solution (a.k.a job) normally executes in four steps: 1) Slicing to split the source data in multiples splices and deliver them to each mapworker or mapper.2) Map to process the data (each mapper processes one or more chunks of data and sends the results to the shufflers).3) Shuffle to organize the data.4) Reduce to compact and write back results to the disk.Thus, intermediate results from each worker are then combined to yield the final output.MapReduce allows for commutative and associative map functions to define combining function, that is, to decrease the amount of data shuffling between map-workers and reduceworkers [5], [8].Combining functions work on the map functions output; hence, the output of combining functions represent the input of reduce functions.The MapReduce [1]- [3] execution framework coordinates functioning of mapworkers and reduce-workers.
Hadoop solutions usually enable for the definition of a set of dependent jobs, i.e., the output of one job is used as an input for others and so on.Thus, a set of key-value records (K in , V in ) is the input of a map function, and a list(K inter , V inter ) corresponds to its output, that is, the input of a combining function if it were defined or input for the shuffling process.As was mentioned, the input for combining functions corresponds to the output of mappers, and the combining functions output will be the input for the shuffling process.Shuffling process orders and distributes data for reduce functions, that is, they get (K inter , list(V inter )) as input to produce an output (K out , V out ) which can be the input of other map functions, and so on.Fig. 2 illustrates this described process.

B. Aspect-Oriented Programming
Aspect-Oriented Programming (AOP) [6] permits modularizing crosscutting concerns in base classes as aspects in Object-Oriented Programming (OOP).Aspects advise classes statically in defined advisable modules and dynamically like events.AOP like AspectJ [6] defines oblivious advisable classes and modularizes crosscutting concerns as aspects, that is, orthogonal methods which are not part of the nature of advisable classes.
AOP well modularizes homogeneous crosscutting concerns as aspects [6], [7], [11]- [13].However, aspects do not reflect the structure of refined features and the classes cohesion for the modularization of classes collaboration [14], [15].Moreover, AOP languages like AspectJ [13], [16], [17] introduce implicit dependencies between aspects and advisable classes [18]- [21].Hence, first, aspects do no respect the information hiding principle because oblivious classes can experience unexpected behavior and properties changes, and second, changes on the firm of advisable behavior can generate spurious and noneffective aspects.Thus, aspects need to know structure details about the advisable behavior and classes, a great issue for independent development.
Next, this article describes main AOP elements.

C. Join points and Pointcuts
A join point represents an event in the execution control flow of a program, that is, "a thing that happens" [13], [16].Hence, in AOP [6], [11], a join point is a point of the program execution in which aspects advise advisable base modules.Examples of join points in AspectJ are method calls, method executions, object instantiations, constructor executions, field references and handler executions  According to [6], [7], a pointcut is a rule to pick out and define the join points occurrence and expose data from the execution context of those join points.Possible components of pointcut rules definition are call (method pattern), execution (method pattern), get (field pattern), set (field pattern), identifiers of time for advisable methods, objects associated to an advisable method, among others.
Just, for the pointcut definition in AOP like AspectJ languages of a method execution, two important times exist: when a methods is called (call time) and when a method is in execution (execution time).Furthermore, we can differentiate between target and this objects on the join point event, that is, the object whose method is in execution and the object that executes the method on the target object.Thus, this and target are the same object for pointcut rules of execution methods, and for call pointcut this is the object that order the target method execution.

D. Inter-type Declarations and Advices
In essence, inter-type declaration statically injects changes on fields, properties, and methods into existing advisable classes in AOP [6].
Advice defines crosscutting behavior regarding pointcut.Three type of advice in traditional AOP exist [6]: before, after, and around which determine how an advice runs at every picked out join point.These kinds of advice determine how the code injection works over the join points.Thus, in AOP like AspectJ languages there exist advice instances which run before their join points, run after their join points, and run in place of (or "around") their join points.Fig. 3 [22] details the AspectJ components and functioning structure, that is, aspects advise oblivious base modules and they present implicit dependencies among them.Fig. 4 [11] illustrates a basic AspectJ example, an advisable class HelloWorld and an aspect with two advice instances to inject behavior into the advisable class before and after calling a void method that starts with the word say in the class HelloWorld.
AspectJ mainly looks for of modular solutions and respecting modularity principles [16]- [18], [23].This paper looks for getting modular MapReduce solutions by the use of AspectJ on Hadoop solutions.

III. GROUPING DATA LOCALLY IN MAPREDUCE
The MapReduce computation in Hadoop does not require to put attention on embarrassingly-parallel issues such as synchronization and deadlock [7], [8].Hadoop and MapReduce solutions possibly involve large data-intensive transferring from map-worker to reduce worker instances.Thus, since data transferring can be of a high cost; for the local aggregation, combining functions can considerably diminish the map output records with the same key in the map-workers.In practice, primary map and reduce functions in Hadoop [9], write intermediate results on local disk before sending them over the network.Those I/O processes possibly imply high computing and hardware costs depending on the networklatency and disk-space costs.Thus, using combining functions minimizes the amount of intermediate data transferring from map-workers to reduce-workers.That also allow decreasing the number and size of key-value pairs to shuffle from mapworkers to reduce-workers for getting improvements on the MapReduce algorithmic efficiency.Just, combining functions are named "mini-reducers".In general, the use of combining functions seems adequate because map functions recognize intermediate-key and value pairs to send them for the shuffling and sorting process in traditional MapReduce solutions.The output of those processes corresponds to the input for reduceworker instances.

A. Combining and 'In-Mapper' Combining
Even though map and reduce functions seem algorithmically simple to think and implement, combining function symbolize improvements performance for cases of high-traffic of data between map and reduce-workers.Combining functions act like the reduce functions [8] because they minimize the amount of intermediate data generated by each map-worker.For example, WordCount and Average represent two traditional solutions that support the use of a combining function, in the first case, functioning likes the reduce function.Nevertheless, combining functions execution are not always effective [5], [8].Precisely, 'In-Mapper' Combining functions [8] solve those mentioned issues.reduce-worker.The input for that example corresponds to a set of words.Fig. 7 shows a new class Palabra for grouping values (local aggregation) in the WordCount example.The main function of the mapper functions in Fig. 5 and 6 look for identify words only, and to identify words and locally aggregate the already identified words count in the map function, respectively.Fig. 8 shows the MapReduce solution for the Average example that looks for to obtain the average score of each student in a list of student and grade pairs.Fig. 9 shows an input example for the Average example.Even though the 'In-Mapper' Combining approach allows reducing information traffic from map-workers to reduceworkers [8], this approach implies to add more code and responsibilities on map functions.For example, 'In-Mapper' Combining of Fig. 6 includes a HashMap definition, and map function presents two actions in the loop, one to recognize each word and add them in the HashMap, and another one to update previous values of existing words; and map outputs these values after identifying all words and their occurrence number in the received input value.The number of sending and receiving messages of this solution would decrease if there were repeated words in the input.Nevertheless, map function grows in code and responsibilities, that is, the map function for 'In-Mapper' Combining approach is definitely lesser modular than its original version.Like for traditional combining function, the goal of Aspect-Combining is to locally aggregate data in map-worker instances to diminish the associated networking traffic in the map-workers for the shuffling process.Therefore, taking into account the components and functioning of the 'In-Mapper' Combining solutions such as those in Fig. 6; a class that contains the map function should also contain an attribute for local aggregation and methods for that process.Thus, in the WordCount example, it is necessary to know about each identified word and the number of previous occurrences of that word for updating its occurrences number.Hence, new attributes and methods for advisable classes are required by inter-type declaration in an AOP context.Likewise, in Average case, for each identified student, it would be necessary to sum their grades and also to count the number of their grades.
As Fig. 11 shows, three events exist for code injection in the advisable map method: before starting the execution of a map method to initialize attributes to group values, around the execution of a map method to group or create an identified element for local aggregation, and after the method map finishes for sending information to the next MapReduce step.Without considering the injection time for the occurrence of these events, pointcut rules are definable in AOP and AspectJ as well as the time for injecting the new behavior code that is analogue to the definition for AOP advices.

B. Results
In this section we discuss the results we obtained and how the null hypothesis has been rejected, thus accepting the alternative hypothesis.
We perform four experiments on the WordCount example and three experiments on the Average example to check the validity of the Aspect-Combining approach for modular MapReduce solutions in Hadoop.Although these experiments did not run in a cluster of computing machines, and knowing the main practical improvement of 'In-Mapper' and Aspect-Combining is a reduction of traffic between map-workers and reduce-workers; surprisingly, Aspect-Combining permits obtaining better modularity and better performance for big-input examples.Hence, only for a single and small file, the traditional WordCount solution without combining approaches obtains the best time.In the WordCount example, for two files, 'In-Mapper' WordCount solution is the best and Aspect-Combining the 2nd one.For all other cases, Aspect-Combining presents the best performance.Thus, in addition to the best modularity, Aspect-Combining permits getting efficient computation results in one machine execution.This situation would be the same cluster environments.

V. DISCUSSION
The null hypothesis establishes that Aspect-Combining does not improve the modularity for the presence of crosscutting-concerns issues and the execution-time compared to the Combining and 'In-Mapper' Combining solutions for testing on the WordCount and Average examples.To refute that hypothesis, we review the modular code of Aspect-Combining solutions in which the map function has only one responsibility, and we analyze the execution-time for experiments described in Table II and Table III, both tables for random files of different sizes.For the appreciated results, we accepted the alternative hypothesis that the Aspect-Combining  The SRP establishes that each module or class should have one and only one purpose and reason to change since if a class has more than one responsibility, then the responsibilities become coupled [10].According to [7], "The SRP is the OOD solution to the classic 'separation-of concerns' problem".Thus, Aspect-Combining permits simple map functions and efficient MapReduce solutions, even though, for the weaving process of AOP, the code of 'In-Mapper' Combining and the final one of Aspect-Combining should be equivalent.

VI. CONCLUSIONS
In this section, we present the lessons we learned while developing the Aspect-Combining solutions: • Aspect-Combining presents a practical symbiosis between MapReduce and AOP.In particular, this article presented a Hadoop and AspectJ for the implementation of Aspect-Combining.• Thinking on the primary functions of MapReduce along with their focus, original combining functions are usually adequate to preserve the map function nature and simplicity.Nonetheless, this article pointed out its non-effectiveness and cost.Therefore, 'In-Mapper Combining seems more practical, but they do not respect modularity principles.Hence, this article presented and practically proved the benefits of Aspect-Combining for modular MapReduce solutions and, for big data-input, possible more efficient results than Combining and 'In-Mapper' Combining Hadoop solution.
• Although a class for map-worker permit the producion of modular solutions, a programmer is in charge of putting attention on Initializer, Map, and Close methods, that is, setup(..), map(..), and cleanup(..) methods in Hadoop which does not permit an independent development.Thus, Aspect-Combining approach separates these functions as advice instances, and the map-worker focuses only on an oblivious map(..) method of before(..), around(..) and after(..) advice instances which operate similar to Initialize, Map, and Close methods of Fig. 18 [8].As future work, this research group plans to review more about AOP on MapReduce applications to figure out the applicability of other AOP practical approaches such as JPI [18], [20], [22], [24] and Ptolemy [17] on Hadoop [5], [8] and Giraph approaches [25], and compare their effectiveness and practical performance.Giraph also permits defining combining functions without a guarantee for their execution [25], and Aspect-Combining seems adequate to guarantee their execution.
Section III reviews previous 'In-Mapper' Combining function and identifies crosscutting concerns issues to define Aspect-Combining functions.Section IV defines hypothesis and variables to measure in the experiments, and presents results of the use of Combining, 'In-Mapper' Combining, and Aspect-Combining proposal on a few application examples to highlight the main practical pros and cons of the Aspect-Combining function.Section V discusses validity of the established hypothesis.Section VI concludes and presents future research work.

Fig. 5
Fig. 5 presents a traditional Hadoop MapReduce solution for the WordCount example, and Fig. 6 shows an 'In-Mapper' Combining function to local aggregate data in the map phase and reduce the information traffic between map-worker and
Fig.8shows the MapReduce solution for the Average example that looks for to obtain the average score of each student in a list of student and grade pairs.Fig.9shows an input example for the Average example.Fig. 10 illustrate the GradeCount class necessary for the local aggregation in the Average example.Note that, for 'In-Mapper' Combining solution of the WordCount example, map function produces the same output as a traditional MapReduce solution, i.e., reduce function continues being the same.Nevertheless, as Lin and Dyer [8] illustrate, map and reduce functions of 'In-Mapper' Combining for the Average example do not produce and receive the same values such as those of the map and reduce functions in a traditional MapReduce solution of that example.
Table I shows the hypothesis and use of variables when conducting experimentation on classic Combining, In-Mapper Combining, and Aspect-Combining functions on the Word-Count and Average examples.For each experiment of Table I, the null hypothesis establishes that Aspect-Combining neither performs faster nor is more modular than classic Combining and 'In-Mapper' Combining solutions on the analyzed examples.

Fig. 12
Fig. 12 and 13 present the definition of pointcut instances for the Aspect-Combining of the WordCount and Average examples, in this case, 3 point cuts for each case: one to start collecting data, one for the collect and write methods call inside the advised map methods, and one for the end of the map method execution.

Fig. 14
Fig. 14 and 15 show Aspect-Combining inter-type declaration for the WordCount Average examples to add and manipulate the required object collection, ArrayList of class Palabra and HashMap of class GradeCount instances, respectively.Finally, for Aspect-Combining in the WordCount and Average examples, Fig. 16 and 17 present advice instances for before the method map execution to initialize the attribute for local aggregation, around the grouping values process, and after the execution of map method to effectively send the locally grouped values to the next MapReduce step.As a practical functioning and results evaluation, Tables II and III present traditional In-Mapper and Aspect-Combiner results for the WordCount and Average examples to appreciate and compare them.We run practical experiments in a single Lenovo ThinkPad Edge E530 laptop of 2.50 GHz, 16GB of RAM and a Core i3 processor.For the WordCount examples, as input files, Words is a text file of 168 bytes, and ebook is a file of 1.6 MB; whereas for the Average examples, input files were generated taking in account ten students, and grades from 0 to 100.

Fig. 13 .
Fig. 13.Pointcut definition for Aspect-Combining in the Average example.

Fig. 15 .
Fig. 15.Inter-type declaration for Aspect-Combining in the Average example.

Fig. 18 .
Fig. 18.A modular structure of Mapper class in MapReduce Solutions.

TABLE I .
HYPOTHESES AND DESIGN OF EXPERIMENTS FOR WORDCOUNT AND AVERAGE MAPREDUCE EXAMPLES Aspect-Combining solutions neither are faster nor more modular than Combining and 'In-Mapper' Combining for the WordCount case-study.Alt.Hypothesis (H1) Exist cases in which Aspect-Combining solutions performs faster than Combining and In-Mapper Combining and does not present crosscutting concerns for the WordCount case-study.filesused as input Randomly generated files of words of name and grade.Size of files are from 10KB to 1MB.Blocking variablesIn each experiment, we generated a set of files in increasing size.

TABLE III .
AVERAGE SOLUTIONS -PRACTICAL EVALUATION