A Variant of Genetic Algorithm Based Categorical Data Clustering for Compact Clusters and an Experimental Study on Soybean Data for Local and Global Optimal Solutions

Almost all partitioning clustering algorithms getting stuck to the local optimal solutions. Using Genetic algorithms (GA) the results can be find globally optimal. This piece of work offers and investigates a new variant of the Genetic algorithm (GA) based k-Modes clustering algorithm for categorical data. A statistical analysis have been done on the popular categorical dataset which shows the user specified cluster centres stuck at local optimal solution in k-Modes algorithm even in all the higher iterations and the proposed algorithm overcome this problem of local optima. To the best of our knowledge, such comparison has been reported here for the first time in case of categorical data. The obtained results, shows that the proposed algorithm is better over the conventional kModes algorithm in terms of optimal solutions and within cluster variation measure. Keywords—Clustering; Categorical data; k-Modes; Genetic Algorithm


INTRODUCTION
There is a growing requirement for the way to extract knowledge from the data [1].Clustering is a descriptive task which partition the dataset based on the predefined similarity measure [2].Clustering techniques have been widely used in machine learning, pattern recognition, medical etc. Number of clustering algorithms have been proposed for different requirements and nature of the data [3].Partition based clustering (k-Modes and its initialisation methods) [4], hierarchical clustering (HIERDENC) [5] model-based clustering (EAST algorithm) [6], density-based clustering [7], graph-based clustering, and grid-based clustering are some basic clustering algorithms with their advantages and disadvantages.
It is hard to discover the distance measure between two categorical data objects, greater the distance between the clusters more separated will be the clusters [8].One of the well-known clustering for categorical data is k-Modes algorithm for large datasets.The traditional way to treat categorical data is binary but does not do justice to the large value difference such as for the very low and very high the difference is same.
The major issue in partition clustering is to initialize the cluster centres, since it has a direct influence on the construction of ultimate clusters.This paper focus on the better partitions of all real world categorical datasets on the lowest cost using GA in less space and time.
GA is proposed by Holland [9] and can apply to many optimization problems.Due to the cluster centre initialization problem which affects the proper clustering of data, GA has been used to convert the local optimal solution into global optimal solution in many GA based clustering algorithm for numeric as well as categorical data in the literature [10].This paper calculates Total Within Cluster Variation (TWCV), time and conversion of local optima to global optima.Fig 1 .shows the various operators used in GA based categorical data clustering found in literature.This paper is organized as follows: Section II presents Literature Review: Section III presents Background: Section IV presents proposed method: Section V shows the Experimental details where we compare basic k-Modes algorithm with propose algorithms: Section VI concludes the paper: Section VII tries to put the future work.www.ijacsa.thesai.orgII.

III. BACKGROUND
In many categorical data clustering algorithms the seeds or the cluster centres are not known in advance for example k-Modes algorithm is a well-known and widely used clustering technique of this type.However, the major drawback of the k-Modes is that it often gets stuck at local minima and the result is largely dependent on the choice of the initial cluster centres.

A. K-Modes Algorithm a) Dissimilarity measure
Let A and B be two categorical objects described by m categorical attributes.The dissimilarity measure can be defined by the total mismatches of the corresponding attribute categories of the two objects [4].Formally where n aj and n bj are the number of objects in the dataset that have categories a j and b j for attribute j and dχ 2 (X, Y )is Chi-square distance.
This paper work on dataset having frequencies of categories then the distance calculation [4] eq. ( 2) is used to calculate the distance.
Consider X is set of categorical objects described by categorical attributes, C 1 ; C 2 ; : : : . Where M may or may not the element of X. [4] b) Find a mode for a set Let , kj t n be the number of data objects having the k th category , kj t in attribute C j and the relative frequency of , for all j=1, 2,..,m.

c) The k-Modes algorithm
When equation ( 1) and ( 2) are used as the dissimilarity measure for categorical objects, the cost function becomes An attempt is made in this paper to integrate the effectiveness of the k-Modes algorithm for partitioning data into a number of clusters, with the capability of genetic algorithm to bring it out of this local minima.GAs are randomized search and are efficient to provide near optimal solutions of fitness function in an optimization problem.
The Local Search based approach such as k-Modes may get stuck at the local optimum solutions.Genetic algorithm based clustering escape from the local optimum, but it is slow and expensive to compute.The Similarity based approaches are not consistent among different inputs and can be context dependent.A small gap between k-modes and proposed GA based algorithm is the assumption of cluster centres on which the clustering is based.GA is an efficient algorithm to solve optimization problems which represented by chromosomes as string encodings and has multiples solutions.GA opts for the best fit solutions in each generation.
Increase in the string length of the chromosome, the search space in GAs increases therefore the whole process becomes more time consuming.When the number of data points and number of attributes are very large then the size of a chromosome which is equal to the number of data points multiplied by number of clusters assumed is difficult to store and manage.
In this paper, a generalized mechanism for all the categorical datasets is presented to identify and ignore the worst cluster centres in a categorical data set.Proposed work utilizes the robustness of genetic algorithm (GA) to optimize the k-Modes clustering algorithm that uses searching capability of GAs to determine most appropriate cluster centres which also prohibits the expensive crossover operator by using one step k-Modes operator.The associated cost function is defined in terms of the distances between the cluster objects and the cluster centre.This paper presents chromosomes in the form of strings (sequence of data values).(2) www.ijacsa.thesai.org The objective of this work is to find k partitions that minimize the cost function and find optimal solution with some GA operators; string representation, population size, selection operator and one-step k-Modes algorithm in the place of the crossover operator This paper shows how the conventional k-Modes clustering algorithm may stops at locally optimal solution whereas the proposed hybridized clustering algorithm facilitate the global optimization of the underlying cost as objective function, to construct optimal partition of objects so that the within-cluster dispersion can be minimized and the between-cluster separation can be increased.Consequently the updated cluster mode is (Yellow Small Stretch Adult) (Purple Large Dip Child) with the frequency based method shown in equation ( 2) later, the cost or within cluster variation is calculated.

b) Fitness calculation
This paper presents fitness function as the sum of within clustering variation, larger the fitness, denser the data in cluster and more separated from the other clusters.The details are described below.
Initially the clusters are formed randomly using the centres encoded in that particular chromosome, then the cluster centres encoded in the chromosome are replaced by the cluster centre (modes) of the respective clusters using frequency method.Therefore assign each point x i =1, 2, 3,...,n with mode m j such that , 1, 2,3 ,....
Frequency method shown in eq. ( 2) for attribute j where C j is replaced by new C i .
( , ) ( , ) The fitness function is defined as f=1/P(W,M) i.e. less the cost more fit will be the chromosome.

c) Selection
The fundamental selection method for GA based clustering algorithm is spinning the roulette wheel.In this paper after fitness calculation sort the cost of all the chromosomes in the population in the present generation, delete if the highest cost of chromosome in present generation is greater than the average of all cost of the chromosomes in the next iteration else keep that chromosome in that population.

d) Crossover process
Similar to genetic k-Modes algorithm, this paper also used one step k-Modes algorithm as the crossover operator to exchange of information between the two parent chromosomes to generate two offspring's.

e) Termination criteria
The most popular termination criteria for GA based clustering algorithms are: to run the algorithm based on user defined iterations.In the proposed algorithm iteration stops for the particular chromosome if the constant fitness value persist even before user specified iteration count.

f) Solution of the Empty cluster problem
The Empty cluster formation is the well-known problem in clustering.And the problem becomes big if the optimization techniques are used, this paper try to remove the empty cluster issue using following algorithm:

B. Flowchart of proposed GA based clustering Algorithm
In the current implementation of GA this paper used the standard k-Modes algorithm for creating multiple partitions of the categorical data for global optimal solution.EXPERIMENTAL RESULTS In this work, the proposed GA based clustering algorithm and the standard k-Modes method were coded using python language.The experiments has been conducted on a computer laptop with 2.89 GHz CPU and 8 G RAM under a Windows 8.1 operating system.To test the effectiveness of the proposed algorithm on Soybean dataset from UCI [16] has been used.Secondly, the TWCV is an intrinsic validity measure to calculate the sum of within cluster variation for all clusters.The smaller value of TWCV means the dataset are more compact.Therefore, in order to obtain compact clusters or mor separated clusters the value should be minimized for clustering task.If only considering the computational efficiency, the faster algorithm is better.The detailed analysis will be shown in the next sub-sections.
The shaded values shown in tables II-VII are locally optimal and globally optimal in case of k-Modes and proposed algorithm respectively.www.ijacsa.thesai.org The detailed clustering results of k-Modes algorithm for soybean data on different initialization with different k values has been shown in Table II-IV which shows the values are stuck at locally optimal.The proposed clustering algorithm provides the optimal values from table V-VII in all the runs for all the k.K-Modes algorithm also attains somewhere the optimal value as proposed value of the total runs but the ratio are very less.Table VIII, X, XII, XIV, XVI shows the average cost of different initialisation in 100 th iteration for different k using k-Modes.Table IX, XI, XIII, XV, XVII shows the average cost of different population in 100 th iteration for different k using proposed algorithm.Fig. 3 shows the cost gap increases between the k-Modes and proposed method which shows the compact clustering of proposed algorithm.Fig. 4. shows the time obtained to cluster k-Modes and proposed method show the very less time gap between the proposed algorithm and k-Modes.

VI. CONCLUSION
Many clustering results are sensitive to the selection of the initial cluster centres as well as gives local optimal solution.The determination of cluster centres in a data set is attracting attention in many research areas.This paper introduced a new variant of GA based clustering for categorical data with the analysis of local and global optimality with k-Modes.Existing approaches does not serve as the best method in terms of time and space, Experiments proves noticeably results in terms of cost, within cluster variation, time and initialization of cluster centres.

VII. FUTURE WORK
As this work gives better results for less number of clusters using MATLAB [17].This can be modify if the number of clusters increased.Proposed method can be compared to more recent algorithms with more number of real world datasets.To discover an algorithm which can perform clustering without knowing cluster number is also a significant work in clustering analysis can be done.And to increase the convergence speed is an important area of future research.Using GA on large number of attributes in datasets need more time and space so latest feature selection techniques [18] can also be applied.

Fig. 1 .
Fig. 1.Many operators used in GA based clustering according to Literature

Example 1 .
a) Encoding for categorical data: Variant size of matrices are developed for chromosome representation in almost all GA based categorical data clustering.In this paper the chromosomes are encoded as string with N*k size[14].Suppose N=2 and k=4 then the string representation for a chromosome is (Yellow Small Stretch Adult Purple Large Dip Child) from Lenses real world dataset.It embed the two clusters (Yellow Small Stretch Adult) and (Purple Large Dip Child).Each categorical data in the chromosome is a allele.

Example 2 .
any generation for C i the intermediate clusters in chromosome are found to be null or empty) { iteration ++ if(found any empty cluster) { delete the chromosome & M=cost C i } else go head else go to next iteration () } Suppose N=4 and k=2 if the intermediate clusters After first generation: (Yellow Small Stretch Adult) (Purple Large Dip Child) After Second generation: (Yellow Adult Stretch Child) (Purple Large Dip Adult) ...... ...... ......After m th generation (Yellow Small Dip Child) ( ) After n th generation ( ) ( ) --------(4) www.ijacsa.thesai.orgAfter using the above algorithm the updated clusters are: (Yellow Small Stretch Adult) (Purple Large Dip Child)

Soybean dataset :
The dataset contains 47 instances, 35 attributes, and 19 classes and four classes are considered in reality.And out of 35 attributes 14 attributes categories are same so we shall use 21 attributes only.Existing k-Modes algorithm has been run for 100 iterations with different initialisation say different seeds and different number of k.Proposed GA based clustering algorithm were executed 5 times for soybean dataset and k-Modes executed approx.10-10 times for each k of the dataset.Proposed algorithm has been run till 100 iterations.To evaluate performance measure computational time (in seconds) has been calculated for algorithm efficiency.

Fig. 3 .Fig. 4 .
Fig. 3. Comparison of average cost obtained by proposed algorithm and kmodes algorithm for different k after 100 iteration

TABLE I .
COMPARISON OF ALGORITHMS

TABLE II .
TWCV USING K-MODES ALGORITHM ON VARIOUS ITERATIONS WITH VARIOUS CLUSTER CENTRES WHEN K=2

TABLE III .
TWCV USING K-MODES ALGORITHM ON VARIOUS ITERATIONS WITH VARIOUS CLUSTER CENTRES WHEN K=3

TABLE V .
TWCV USING PROPOSED ALGORITHM ON VARIOUS ITERATIONS WITH VARIOUS POPULATION SIZE WHEN K=2

TABLE XIV .
AVERAGE TWCV OBTAINED USING K-MODES ALGORITHM FOR DIFFERENT INITIALISATION AFTER 100 ITERATION WHEN K=5

TABLE XVII .
AVERAGE TWCV OBTAINED USING PROPOSED ALGORITHM FOR DIFFERENT POPULATION SIZE AFTER 100 ITERATION WHEN K=6