Map Reduce based REmoving Dependency on K and Initial Centroid Selection MR-REDIC Algorithm for clustering of Mixed Data

In machine learning, clustering is recognized as widely used task to find hidden structure of data. While handling the massive amount of data, the traditional clustering algorithm degrades in performance due to size and mixed type of attributes. The Removal Dependency on K and Initial Centroid Selection (REDIC) algorithm is designed to handle mixed data with frequency based dissimilarity measurement for categorical attributes. The selection of initial centroids and prior decision for number of cluster improves the efficiency of REDIC algorithm. To deal with the large scale data, the REDIC algorithm is migrated to Map Reduce paradigm,and Map Reduce based REDIC( MRREDIC) algorithm is proposed. The large amount of data is divided into small chunks and parallel approach is used to reduce the execution time of algorithm.The proposed algorithm inherits the feature of REDIC algorithm to cluster the data.The algorithm is implemented in Hadoop environment with three different configuration and evaluated using five bench mark data sets. Experimental results show that the Speed up value of data is gradually shifting towards linear by increasing number of data nodes from one to four. The algorithm also achieves the near to closer value for Scale up parameter, while maintaining the accuracy of algorithm. Keywords—Machine learning; clustering; similarity measurement; initial centroid selection; number of clusters; map reduce paradigm


I. INTRODUCTION
In the current era of machine learning the strategy of unsupervised algorithm to group data without prior knowledge is widely adopted in the application of data segregation. Like the applications in the field of web mining, text mining, image processing, stock prediction, signal processing, biology and other fields of science and engineering the algorithms of clustering had been applied for better results. [1] [2] Clustering targets to find out hidden structure from underlying dataset, without any prior information, here the labels are not associated with data. As an outcome of this the clusters are formed having minimum within cluster distance and maximum between cluster distances. Here one representative of a group is chosen as cluster centroid.
A number of strategies have been proposed in last several years, for solving the clustering problems efficiently [3] The two broad categories of clustering algorithm are: Hierarchical Clustering and Partitional Clustering. In Partitional Clustering K Means Clustering has practiced the facts of simple mathematical formation, ease of implementation and fast coverage [4]. This will enlarge the application field of the algorithm.
Conversely the results generated of K Means Clustering coverage the local optimal based on initial clusters which may or may not be endure from global optimum solution. The Hamming Distance measurement is applicable for numerical dataset only, however the attributes of real world data is not restricted to numerical attributes, it consist of the categorical attributes as well. This is an additional obstacle to adopt the K Means Clustering Algorithm.
For categorical attributes the K Mode Clustering Algorithm has supplanted the Hamming distance measure with simple matching dissimilarity measurement and derived with the alternate solution. To extend the K Mode Clustering algorithm for mixed attributes the K Prototype Clustering algorithm is proposed, which is in cooperation with both the distance based measurement. [5] [6] The K prototype Clustering algorithm is one of the approaches for clustering of mixed attributes. To enhance the performance of K prototype clustering algorithm, various provisions have been proposed in last decades. The emphasis of this paper is to augment one more approach in the same direction.
The Section 2, the REDIC K Prototype Clustering algorithm along with necessity of migrating the algorithm to MapReduce Paradigm is elaborated. The MapReduce REDIC K prototype Clustering algorithm is proposed In Section 3, the experimental setup and result of the proposed algorithm are given in Section 4.

II. BACKGROUND KNOWLEDGE
In the Clustering of Mixed data using the K prototype clustering algorithm, the numerical attributes and categorical attributes are separated and functioned separately using two different similarity measurements. It implements Euclidean Distance measurement for numerical attributes and Hamming distance for categorical attributes. The Hamming distance measurement results in 0 if two categorical attributes are similar and results 1 if the results are dissimilar. This reasoning may not give the better result in many real world data set [7].
As the extension of K Prototype Clustering algorithm, The K center Algorithm has been proposed and proved that the algorithm contributes improved results by considering the frequency of attributes in consideration [8].
The effectiveness of this algorithm is contingent on the initial centroids selected, and on the other side the algorithm requires professional knowledge to choose the value for parameter K [9]. REDIC K Prototype Clustering algorithm is proposed in [10], which has precisely concerns the above stated issues, and suggested the alternate strategy.

A. REDIC K Prototype Clustering
The similarity between two categorical attributes is largely depends on the relative frequency of the common values for particular attributes. [11]. For this reason the frequency based method for similarity measurement of categorical attributes must be adopted.REmoval Dependency on K and Initial Centroid Selection (REDIC) K Prototype Clustering algorithm is proposed with a novel frequency based similarity measurement for categorical attributes. [12] Using simple furthest point heuristic (Maxmin) initialization decreases the clustering error of k-means from 15% to 6% on average [13]. This strategy is adopted in REDIC algorithm while choosing the initial centroids. (REDIC) K prototype clustering is proposed, which will have three contribution: 1) Frequency based dissimilarity measurement for the categorical attributes. 2) Select the initial centroids by calculating most significant attributes. 3) Incremental approach for deciding number of clusters. The preliminary used in the algorithm are defined as: Preliminary 1. Similarity between two Categorical Attributes Cdist(c i , c j ) Consider any two categorical instances C i and C j of n instances.
where F OC(c i ) is defined as F OC(c i ) = no of occurrence of c i total no of instances n (2)

Preliminary 2.
Similarity between two Instances Dist(I i , I j ) (3) Preliminary 3. Initial Centroid Selection: The instances having minimum or maximum row factor will be consider as initial cenroids. Here out of n attributes, 1 to m are numerical attributes and m to n are categorical attributes.
Using these five preliminaries the clusters are refined and formed, here it is observed that the algorithm is computationally simple not much expensive. The evaluation results are also better than the K prototype algorithm.
REDIC Algorithm is designed for the dataset of small size, to deal with the large dataset the substitute option should be formulated. Here the Map Reduce Paradigm is preferred to speed up the REDIC algorithm.

B. Map Reduce Paradigm
To process the large scale data the Map Reduce is designed and processed. [14] Automatic parallelization and task assignment reduce the overhead while deploying the algorithm to the paradigm. Here only two phases are essential to parallelize the algorithm namely: Map and Reduce. For both of the phases the input and output is in the form of < key, value > pair. Match phase accepts the input in < key, value > pair and produces the intermediate list in < key, value > format only. By grouping and shuffling operation this list will be rearranged as per the intermediate key value. The Intermediate list is in the form of < key, (value1, value2, . . . , valueN ) > . Reducer accepts intermediate list as an input and produce the final values according to the algorithm. In the Figure 1, the functions of Map and Red are explained with block of data. The library of MapReduce Paradigm splits the input file in number of blocks. The copies of the input files are stored to the nodes of the paradigm. One superior copy of input files with data and metadata is maintained and referred as the master. The master node does the task allotment automatically without user interference. The rest of nodes are the slave nodes, which performs the task assigned by master node. A slave node who is assigned a map work reads the input file and convert the file into < key, V alue > pair. The Mapper function performs the grouping and shuffling and creates the intermediate list. The reducer function reads the intermediate list and performs the sorting operation for mapping of different task to appropriate reducer task. A slave node who is assigned a reducer work iterates over sorted data and assigns the intermediate key to appropriate output values. After successful implementation of algorithm the output is stored into different files of reducer.

III. PROPOSED ALGORITHM: MAP REDUCE BASED REDIC K PROTOTYPE CLUSTERING
To handle the categorical data of large scale, the Map Reduce based REDIC K Prototype Clustering algorithm is proposed. In MR REDIC Algorithm the distance between two categorical instances are calculated using frequency based method and the initial centers are selected by calculating the row factor of each instance. This will eliminate user The map phase accepts the part of mixed dataset which is stored on HDFS as an input and calculates the row factor for each instance. The instance having minimum or maximum value for row factor is selected as initial centroids. This decides the number of clusters. The distance between each centroid to each instance is calculated and the instance is assigned to the cluster which have minimum distance . Here Cluster Index work as a key and the associated instances of that cluster work as a value of that key.
In reducer phase cluster refinement is done. Here the delta value is calculated for each cluster and it is compared with the distance between cluster centroid and each instance , If this value is less then only particular instance will remain in that cluster else it will be considered non promising instance. Such instances are eliminated from cluster and new cluster with that centroid will be created. This process will be continued till all the instances will be allotted to appropriate cluster and there will be no revision in cluster refinement. Here again Cluster Index work as a key and the associated instances of that cluster work as a value of that key.
The order to migrate REDIC algorithm to Map Reduce paradigm, MR REDIC algorithm is proposed. The Job class does the initialization of the job along with the directory path for Input and Output . The input dataset is stored on HDFS, which will be split and assigned to Mapper Class for further processing. The Job class of MR-REDIC constructs a global variant centers which is a null array( C) , later it will store the information about centers of the clusters. The task is distributed to the datanodes by job class using inbuilt libraries, and reducer class will refine the clusters and update the value for number of cluster parameter. The Global Center array, offset key and sample value is assigned as an input to Mapper Class. Initially when this array is empty the Row value is calculated by using the method proposed in [12] and stored in the set R. The smallest and largest value from the R is extracted in min and max . The initial value for number of cluster k is set to the count of min and max value. The initial centroids are stored in CN i variable, in next step distance from every instance to every centroid dist(I i , C k ) is calculated by using formula defined in [12] . The instance is assigned to the cluster having minimum cluster centroid distance. The intermediate < Key, V alue > pair is maintained where Key is the Cluster Index and Value is the value all the attributes of of particular instance. The Reducer class will refine the clusters by reallocating the www.ijacsa.thesai.org Class. For refinement of clusters the decision value parameter δ k is calculated using the formula proposed in [12]. The values of dist(I i , C k ) will compare with δ k . If dist(I i , C k ) > δ k then that particular instance I i is not promising for cluster C k . The instance I i will be removed from the cluster C k and the new cluster will be formed and centroid value of both the cluster will be updated. The final < Key, V alue > pair is maintained where Key is the updated Cluster Index and Value is the value all the attributes of of particular instance. Finally the results file of the Reducer Class will be stored to the Output Directory path set by Job Class.

IV. RESULT AND ANALYSIS
The MR-REDIC is implemented in Hadoop architecture with a single node and multi node cluster environment. The dataset used for validation are varies in size to measure out the execution time and speed up of the proposed algorithm. This section divides into three parts:The experimental setup and configuration of nodes, the details of the considered dataset, and the experimental result.

A. Experiment Setup
For implementation of the proposed algorithm using Map Reduce paradigm, Hadoop 2.6.0 and Java version 1.7.0 is considered. Operating system used is Ubuntu 14.04.The experiment was carried out on different a Hadoop cluster.
Configuration 1: Name node and Data Node on a single machine Configuration 2: One Name Node and Two Data Node Configuration 3: One Name node and Four Data Nodes The nodes in the Hadoop cluster are configured with Intel i3α3.64GHZ processor, 2GBof RAM for each node and 500GB of hard disk with a measured bandwidth for end-to-end TCP sockets of 100M B/s.

B. Dataset Description
The dataset used in this experiment are downloaded from the kaggle and UCI Repository. To evaluate the performance for the execution time the dataset that of different size and mixed attributes are chosen. Poker Hand data set:This data set is having multivariate characteristics with 1025010 instances and 11 numbers of (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 2, 2020 Remove I i from cluster C k ; k= k+1; Create Cluster C k+1 , where I i ∈ C k+1 ; update C n set globalF = True; end Key → cluster Index ; Value → The values of all attributes of new Cluster Index ; return <key,value > attributes. Each instance of dataset is an example of a hand which consist of strategy of how five playing cards are to be drawn from a deck of 52 cards. [15] Indian Census dataset: This dataset consist of 156 attributes of mixed type. It gives the information about population based on demographic data for each district of India.
Magic Gamma Telescope Dataset The imaging technique used by telescope, captures the high energy gamma particles, as gamma rays emits radiation by charged particles in an electromagnetic shower. Every event of this radiation is described by the various parameters, like major axis of ellipse [mm] minor axis of ellipse [mm] etc,. out of which 10 , 10-log of sum of content of all pixels etc. Here 10 different numerical attributes are considered. The instances are is divided into two major categories :gammas (signal) and hadrons (background). [16] House Sales in King County, USA [17] This dataset having the information about house sold prices for USA between May 2014 and May 2015. It contains the record of 21614 houses with 21 different properties like, area, longitude, latitude etc.
Toy Dataset: This toy dataset comprised of 150000 instance with 6 attributes. Size of the dataset is 5 MB. Here each instance is described by the features like age, gender, income, city etc. [18] The brief introduction of dataset is given in table I

C. Evaluation Parameters and Results
In these experiments three evaluation measurements are considered: Execution time. Speed up and Scale up .

1) Execution Time:
The execution time of the proposed algorithm MR REDIC is compared with the REDIC algorithm, here the comparison is done with 3 different configurations. For Configuration 1, Data Node and Name Node is on single machine, for Configuration 2 One Name Node and Two Data Nodes are considered, Further in Configuration 3 One Name Node and 4 Data Nodes are considered. By observing the values of table II, it is verified that the execution time is decreasing gradually for each configuration. The figure 3 shows the execution time is decreasing as number of data nodes will increase.
2) Speed up: : To measure out the Speed up , the number of nodes are increased in every configuration by keeping the fix size of Speedup = T1 Tn T 1 is execution time on 1 node and T n is execution time for the n node. [19] The linear Speed up is the ideal case of Map Reduce paradigm. For example , the speed up of algorithm is increasing from 1 to n while splitting the task from 1 machine to n machines. the In real time it is difficult to achieve, as by increasing the number of nodes, the communication overhead will also be considered. To evaluate the Speed up parameter of the proposed algorithm, five different dataset and the 3  figure  4 it is observed the Speed up value is slightly drifting from linear in case of 1 Node to 2 Node, but while considering the Speed up from 2 nodes to 4 nodes it almost closer to linear.
3) Scale up: To measure out the Scale up , the size of dataset is increased by keeping the fixed configuration of Data Node and Name Node. It is evaluated by using formula, T D n is execution time of dataset of size D for n Data Nodes T 2D n is execution time of dataset of size 2D for n Data Nodes [19] The constant value of Scale up for each size of dataset the ideal case of Map Reduce paradigm.In real time it is difficult to achieve, as execution time depends on the different values of attributes for particular instances.
The scale up analysis is calculated in table IV. To evaluate the Scale up parameter of the proposed algorithm, Toy dataset is considered. Here the three instances of dataset with 1MB,2MB, 4 MB sizes are considered. The execution time for different instances of dataset is recorded in table III. From   Figure 4.

V. CONCLUSION
In this paper,the MR-REDIC algorithm is proposed, in which the REDIC K-prototype algorithm is migrated to Map Reduce model for making it suitable to create cluster of mixed data having large scale. The REDIC K-Prototype Clustering is efficient due to its simple calculation and eliminating the dependency on prior parameters. It has proved its efficiency for real time mixed data-sets. But it requires more time for large scale data, so here its parallel version using Map Reduce paradigm is proposed. Experimental results were conducted with five benchmark data-sets and three different configuration of Map Reduce paradigm. In Speed up comparative analysis it has shown the near to linear increase approach, and in Scale up comparative analysis the algorithm gives almost constant value.n In future the different similarity measurement for numerical attributes can be integrated with MR REDIC algorithm.