Time-Saving Approach for Optimal Mining of Association Rules

Data mining is the process of analyzing data so as to get useful information to be exploited by users. Association rules is one of data mining techniques used to detect different correlations and to reveal relationships among data individual items in huge data bases. These rules usually take the following form: if X then Y as independent attributes. An association rule has become a popular technique used in several vital fields of activity such as insurance, medicine, banks, supermarkets... Association rules are generated in huge numbers by algorithms known as Association Rules Mining algorithms. The generation of huge quantities of Association Rules may be time-and-effort consuming this is the reason behind an urgent necessity of an efficient and scaling approach to mine only the relevant and significant association rules. This paper proposes an innovative approach which mines the optimal rules from a large set of Association Rules in a distributive processing way to improve its efficiency and to decrease the running time. Keywords—MDPREF Algorithm; Association Rules mining; Data partitioning; Optimization (profitability, efficiency and Risks) ; Bagging


I. INTRODUCTION
Big data is an important research topic and it has attracted considerable attention.The huge numbers of data sets are unused and redundant in the databases of companies, universities, etc. Discovering the unused and redundant information stored in these data bases is grounded on the efficient KDD (Knowledge Discovery in Database) process.This latter does not only retrieve data or let researchers find new information from data [1] but also has the ability to reveal the patterns and relationships among large amounts of data in a single or several data sets.KDD process makes use of several techniques from statistics and artificial intelligence in a variety of activities.The main activities are as follows [2][3][4][5][6][7][8][9][10][11]: Association Rules; Clustering; Classification; Regression and Prediction.We are rather interested in the association rules mining, together with classification and clustering which are two of the major data mining applications where pattern mining is extensively used to transform raw data into patternbased description that is accepted and processed by classification and clustering algorithms.In this context, patterns which occur in data are simply considered as features that characterize data.Patterns describing the data are also called explanatory variables.Whereas Association Rules Mining is one of the most common algorithm-based data mining techniques which can be defined as the extractor or generator of interesting relationships and correlations among items in large amounts of data.[10] Although Association Rules, Clustering and Classification are the techniques extensively used in this paper, Regression and Prediction are going to be taken into account in our future work to reinforce the reliability and to improve the quality of results.For reasons related to the obviation of a possible confusion or misunderstanding, we provide below the definitions of the activities meant by both concepts:  Regression for a set of items is the analysis of the relationships of dependence between the values of the attributes.A model is automatically produced that can predict attribute values for new items.
 Prediction for a specific item and a corresponding model is the ability to predict the value of a specific attribute.For example, in a predictive model for treatment schema, prediction is used to determine the next procedure in the sequence of treatment.
Lately, many algorithms have been suggested in the literature for instance: Close, Close +, Charm, Sky Rules,… to help generate association rules, either by improving the process of "patterns'extraction" or by introducing other criteria and factors in order to determine which rule to keep and which one to discard [3].However, these algorithms are mainly used to centralize computing systems and relatively evaluate small databases.Yet, the huge numbers of generated association rules and modern databases are growing dramatically in terms of size.Consequently, several parallel and distributed solutions have been proposed to tackle this issue.In addition to that, many distributed frameworks have been used to deal with the existing abundance of data.These distributed frameworks focus on the challenges of distributed system building and on simple programming models for data analysis.To solve these problems, we think that a data partitioning technique considering data characteristics should be applied.In this paper, we propose a scalable and distributive approach for large scale frequent association rules.The proposed approach offers the possibility to apply any of the known association rules mining algorithms in a distributive way.In addition, it allows many possibilities to apply any of the known clustering or classification algorithms as partitioning techniques for the association rules set.www.ijacsa.thesai.orgBesides the introduction, the paper is made up of four sections, each of which deals with a particular aspect: section II deals with the necessary definitions, section III describes the proposed approach of large-scale association rules mining.Then, section IV is concerned with the experiments we have carried out to concretize the proposed approach.The last section concludes the paper and reveales our willingness to continue research for better results.

II.
BACKGROUND AND PROBLEM DEFINITION  The cost measure of a partitioning technique is: A large cost value indicates that the runtime values are far from the mean value and a small cost value indicates that the runtime values are near the mean value.The smaller the value of the cost is, the more efficient the partitioning is.

A. Distributed machine learning and data mining techniques
Data mining and machine learning hold a vast scope in using the various aspects of Big Data technologies for scaling existing algorithms and solving some of the related challenges [4].In the following, we present existing works on distributed machine learning and data mining techniques.a) NIMBLE NIMBLE [5] is a portable infrastructure that has been specifically designed to enable the implementation of parallel Machine Learning (ML) and Data Mining (DM) algorithms possible.
The NIMBLE approach allows composing parallel Machine Learning and Data Mining algorithms « ML-DM algorithms » using reusable (serial and parallel), building blocks that can be efficiently executed using MapReduce and other parallel programming models.The programming abstractions of NIMBLE have been designed so as to parallelize ML-DM computations and allow users to specify several tasks such as parallel data, parallel tasks and even pipelined computations.
The NIMBLE approach has been used to implement some popular data mining algorithms such as k-Means Clustering and Pattern Growth based Frequent Item set Mining, k-Nearest Neighbors, Random Decision Trees, and RBRP-based Outlier Detection algorithm.
As shown in Fig. 1, NIMBLE is divided into four distinct layers: 1) The user API layer, which provides the programming interface to the users.Within this layer, users are able to design tasks and Directed Acyclic Graphs (DAGs) of tasks to indicate dependencies between tasks.A task processes one or more datasets in parallel and produces one or more datasets as output.
2) The architecture independent layer, which acts as the middleware between the user specified tasks/DAGs, and the underlying architecture dependent layer.This layer is responsible for the scheduling of tasks, and delivering the results to the users.
3) The architecture dependent layer, which consists of harnesses providing a means allowing NIMBLE to run portably on various and several platforms.Currently, NIMBLE only supports execution on the Hadoop framework.
4) The hardware layer, which consists of the used cluster.On the one hand, DML is a system whose main goal is to simplify the usage or development of Machine Learning algorithm, it separates algorithms from data representation and execution plans.www.ijacsa.thesai.orgOn the other hand, DML language exposes arithmetical and linear algebra primitives on matrices that are natural to express a large class of Machine Learning algorithms.
2) The High-Level Operator Component (HOP): It analyzes all the operations within a statement block and chooses from multiple high-level execution plans.A plan is represented in a DAG of basic operations (called hops) over matrices and scalars.
3) The Low-Level Operator Component (LOP): It translates the high-level execution plans provided by the HOP component into low-level physical plans on MapReduce.
4) The runtime component: It executes the low-level plans obtained from the LOP component on Hadoop.PARMA is built on top of MapReduce and the computations are performed twice following the two processing steps of MapReduce.As stressed in Fig. 3, the Ellipses represent data, squares represent calculations on that data and arrows show the movement of data through the system.
PARMA creates multiple small random samples of the transactional dataset, at Phase 1 " Map1", and runs a mining algorithm on the samples independently and in parallel, at Phase 2 " Reduce1".The output results from each sample labeled "id", at Phase 3 "Map 2", are aggregated and filtered, at Phase 4 "Reduce 2", to provide a single collection as output which is a global set of Frequent Itemsets and Association Rules.The final result of PARMA is an approximation of the exact solution since it mines random subsets of the input dataset.Fig. 3.An overview of the software architecture of PARMA Table 1 presents the most popular data mining and machine learning techniques.For each technique, it lists the programming model, the implemented techniques and the programming language.We notice that the input and the output of the above presented approaches are user-defined.

IV. OVERVIEW OF THE PROPOSED APPROACH
1) The point of departure is the Association rules set (input) that is first distributed into J partitions (where 1 J k  ) which are processed simultaneously by MDP REF algorithm which is in itself distributed among the J partitions (see Fig. 4).
2) MDP REF -Jis an MDP REF algorithm that we execute in the assigned data partition -J th partitionto generate the corresponding, locally most Dominant and Preferential association rules.
3) The Optimizing step uses the sets of locally Most Dominant and Preferential association rules as input and computes profitability, efficiency and risks of each one of the partitions.Then, it outputs the set of the globally optimal Most Dominant and Preferential association rules, i.e., Association Rules that are undominated, most preferable and efficient ones in the whole association rules set (AR-Set).www.ijacsa.thesai.org

B. Data partitioning
In the data partitioning step, several techniques intervene to partition the dataset into a number of partitions with respect to particular criteria such as similarity or nearest neighbor criterion.These techniques may involve algorithms using different measures to partition data.It follows that the output (of a technique applied on the input) is a homogenous set, this homogeneity reflects the criterion which this technique uses.Therefore if the output is a group of similar association rules then the criterion used must be that of similarity.If the technique takes into consideration the distance between elements it generates a set of equidistant elements with regard to a determined central point.In our case the input database, a set of Association Rules AR-Set = {AR 1 ……AR n }, is partitioned into a user-specified number "k" of partitions.The output is a set of partitions Part(AR-Set) = {Part 1 (AR-Set)…..Part k (AR-Set)}.
The proposed framework allows many partitioning techniques for the Association Rules Set, like k-Means, k-Medoids, Décision Tree in addition to other partitioning techniques or meta-algorithms like Bagging and Boosting whose objective is to improve predictions, classification and accuracy.In the last step, the Optimizing step, we run an algorithm permiting to determine the optimal set formed by the locally MDP REF Association Rules with regard to the minimization of "Risks" and to the maximization of the " profitabilityefficiency " of Association Rules.

D. MDP REF Algorithm
MDP REF stands for the Most Dominant and Preferential rules.It is based on dominance, preference and user profile.Besides being threshold-free, MDP REF solves the subjectivity problem and keeps all measures so as no information would be lost.Its main goal is to successfully discover, filter and prune AR into subsets verifying a two-sided criterion.That is to say, each rule in a subset must meet two conditions; it has to be the most dominant as well as the most preferred by the user.To get at the above-mentioned objective, the algorithm takes into account the factor of time during the processing of the following tasks [8]:  Creates a referential rule (r T ) which dominates all the rules (Having the maximum measurements);  Computes the degree of similarity of all the rules one by one with the referential rule (r T );  ------------------------------- Concerning the quality of rules, it is assessed by the two measures of dominance and preference which are inherent in the algorithm.The first one is a statistical measure and the second one is subjectiverelated to users.Rules failing to meet these two measures are not mined.

V. EXPERIMENTS
This section presents an experimental study of the proposed approach on real datasets.First, it describes the datasets that have been used and the details of implementation.Then, it introduces a discussion of the results.

A. Experimental setup
The datasets used in the experimental study are presented in the table III, the proposed approach use six datasets, Diabete, Flare, Iris, Monks, Nursery, and Zoo taken from "UCI Machine Learning Repository".Each dataset deals with a particular domain such as human health, animals, and agriculture defined by an item count, transaction count and all counts of the association rules.Recalling that these experiments have been applied on a machine that has the following characteristics: 1.73 GHz and a memory capacity of 2GB.
The Fig. 5 illustrates the effect of the proposed partitioning method on the rate of lost association rules.We can easily see in Fig. 5 that the proposed partitioning method allows low values of loss rate especially with low values of tolerance rate.In order to study the scalability of the proposed approach and to show the impact of the number of used machines on the large scale Association Rules mining runtime, the Fig. 6 present the runtime of the proposed approach for each number of MDP REF (i) machines.
As illustrated in Fig. 6, the proposed approach scales with the number of machines.In fact, the execution time of the proposed approach is proportional to the number of machines.

Definition 4 (
Cost of a partitioning method) Let R = {Runtime 1 (PM),…,Runtime N (PM)} be a set of runtime values.Runtime j (PM) represents the runtime of computing MDP REF rules in the partition j (Part j ) of the database.The operator E denotes the average or expected value of R. Let μ be the mean value of R: μ = E[R].

Fig. 1 .
Fig. 1.An overview of software architecture of NIMBLE b) SystemML SystemML [6] is a system that enables the development of large scale Machine Learning algorithms.It first expresses a Machine Learning algorithm in a higher-level language called Declarative Machine learning Language (DML).Then, it executes the algorithm in a MapReduce environment.

Fig. 2 .
Fig. 2.An overview of software architecture of System ML a) PARMA PARMA [7] is a parallel randomized algorithm for mining Frequent Itemsets (FI's) and Association Rules (AR's).PARMA is built on top of MapReduce and the computations are performed twice following the two processing steps of MapReduce.As stressed in Fig.3, the Ellipses represent data, squares represent calculations on that data and arrows show the movement of data through the system.

Fig. 4 .
Fig. 4.An overview of the software architecture of our Approach AR-Set.We define a partitioning of the database over a k partitions by the following: Part(AR-Set) = {Part 1 (AR-Set),…..,Part k (AR-Set)} such that : Association Rules mining The distributive ARM step mines a set of sub-sets of locally Most Dominant and Preferential association rules named MDP REF Association Rules sub-sets.The input of this step is a partition of the AR-Set : Part(AR-Set) = {Part 1 (AR-Set),…..,Part k (AR-Set)}.The execution of Distributive Association Rules mining step is resumed by running the MDP REF algorithm on each partion Part k (AR-Set) in parallel.


Determines the dominant rule r* (which has a minimal "degree of similarity" with referential rule (r T )); Discards all the rules dominated by r*; If two rules are equivalent, we resort to the user's preferences to determine which one to keep; Keep both if the decision maker is indifferent in regard to the equivalent rules, otherwise we keep the one satisfying more preferences; Drop all rules where the user's preferences are already covered by those previously handled; , efficiency and Risks)Optimal and Globally MDPREF Association Rules www.ijacsa.thesai.org Keep Rules covering the user's preference other than those already covered by those previously selected[9].ALGORITHM: "MDPREF" Algorithm 1.0 Input : Set of Rules+Set of Measures+Preference Set Ω (R, M, Pref) 2.0 OutPut: The Most Dominant and Preferential Rules MDPREF 3.0 Begin- while C ≠ Ø do 7.0 Create a referential rule r T having a max of measure value 8.0 r* r ∈ C having a min (DegSim (r, r T )) 9.0 For ( i=1 to k= C  ) do 10For ( j=1 to k') do 18.0If ri[mj] ≥ r*[mj] 19)= max S « ri the most preferred rule in S » 28.0 Z= Z ∪ Zbest(ri)

Fig. 5 .
Fig. 5. Effect of partitioning method on the rate of lost Association Rules

Definition 2 (MDP REF rules) MDP
REF is an algorithm which is short for the Most Dominant and Preferential rules.It is based on notions of dominance, preference and user profile.

TABLE I .
OVERVIEW OF DATA MINING AND MACHINE LEARNING TECHNIQUES

TABLE II
Set ={AR 1 ,…..AR n } be a set of Association Rules.For 1 j k  , Let

TABLE III .
DESCRIPTIVE OF DATA SETS