Fisher Distance Based GA Clustering Taking Into Account Overlapped Space Among Probability Density Functions of Clusters in Feature Space

Fisher distance based Genetic Algorithm: GA clustering method which takes into account overlapped space among probability density functions of clusters in feature space is proposed. Through experiments with simulation data of 2D and 3D feature space generated by random number generator, it is found that clustering performance depends on overlapped space among probability density function of clusters. Also it is found relation between cluster performance and the GA parameters, crossover and mutation probability as well as the


INTRODUCTION
Genetic Algorithm: GA clustering is widely used for image clustering.It allows relatively good clustering performance with marginal computer resources.In particular, Fisher distance based GA clustering is well known [1].It uses Fisher distance as fitness function of GA.It, however, is not clear the characteristics of Fisher distance based GA clustering.For instance, relation between clustering performance and overlapped space among probability density function of clusters.Also, relation between cluster performance and the GA parameters, crossover and mutation probability as well as the number of features and the number of clusters are unclear [2].
The paper describes the aforementioned characteristics through simulation studies with random number generator derived simulation data with the different parameters.Also, the results from GA based clustering are compared to the Simulated Annealing based clustering [3].
The following section describes fundamental theoretical background of the Fisher distance based GA clustering method followed by some experimental results with simulation data.Then finally, conclusion and remarks are described together with some discussions.

A. Fisher distance based GA clustering
Fisher distance between two probability density functions of two features is defined as equation ( 1) where , , , denotes mean and variance of two features.The most appropriate linear discrimination function for multi-dimensional feature space is expressed as equation ( 2). ( Discrimination function is illustrated in Fig. 1.The line with arrow (linear discrimination border) in the Fig. 1 in the orthogonal coordinate is discrimination function between two classes (two clusters).The slant coordinate of probability density functions for two classes implies cross section of the one dimensional probability functions for two classes.

B. Problem definition on GA clustering
Most of problems would occur when the probability density functions are overlapped in the feature space as shown in Fig. 2. In this case, three clusters' probability density functions are overlapped.(overlapped space volume between two different probability density functions of two different cluster) as well as f o and f p denotes the functions which represent required computer resources and clustering performance, respectively.Through calculation of and , and are optimized.Thus optimum parameters of GA clustering (crossover and mutation probabilities) can be determined.

B. Cluster performance evaluations
Clustering performance is evaluated with the aforementioned simulation dataset together with the number of iteration for convergence.Convergence condition is set at 5% of residual error.As shown in Fig. 6 and Fig. 7, the most appropriate crossover and mutation probabilities depend on the overlapped space in the feature space which is expressed in equation ( 7).The relation between overlapped space volume and crossover and mutation probabilities is shown in Fig. 8.As shown in Fig. 9 and Fig. 10, the most appropriate crossover and mutation probabilities depend on the overlapped space in the feature space which is expressed in equation ( 7).The relation between overlapped space volume and crossover and mutation probabilities is shown in Fig. 11.Table 1 shows comparisons among GA clustering performance for three cluster case with the different parameters of crossover and mutation probabilities.It also shows a comparison between GA clustering and Simulated Annealing method.Crossover and mutation probabilities are optimized empirically.As the results, 0.2 and 0.08 of crossover and mutation probabilities are optimum parameters ofr GA cluatering for this three cluster cases.Simulated Annealing SA allows global optimum.Therefore cluster performance for SA based clustering should be 100 % accurate.Due to tha fact that allowable residual error is set at 5% as convergence condition, cluster performance of the SA based clustering is 98% .On the other ahnd, the number of iterations for SA based clustering is 8783578 while that for GA based clustering is 164670 at the optimum GA parameters.Therefore, computation resources of SA based clustering requires 53 times longer than that of GA based clustering.The difference of clustering performance between SA based clustering and GA based clustering is just 3%.Therefore, GA based clustering allows much faster clustering than SA based clustering with acceptable clustering performance.

III. CONCLUSION
Fisher distance based Genetic Algorithm: GA clustering method which takes into account overlapped space among probability density functions of clusters in feature space is proposed.Through experiments with simulation data of 2D and 3D feature space generated by random number generator, it is found that clustering performance depends on overlapped space among probability density function of clusters.Also it is found relation between cluster performance and the GA parameters, crossover and mutation probability as well as the number of features and the number of clusters.From the experimental results with three cluster case, it is found that 0.2 and 0.08 of crossover and mutation probabilities are optimum parameters ofr GA cluatering.Although Simulated Annealing SA based clustering should be 100 % accurate, cluster performance of the SA based clustering is 98%.due to tha fact that allowable residual error is set at 5% as convergence condition, .
On the other ahnd, the number of iterations for SA based clustering is 8783578 while that for GA based clustering is 164670 at the optimum GA parameters.Therefore, computation resources of SA based clustering requires 53 times longer than that of GA based clustering.The difference of clustering performance between SA based clustering and GA based clustering is just 3%.Therefore, GA based clustering allows much faster clustering than SA based clustering with acceptable clustering performance.

Fig. 1 .
Fig. 1.Illustrative view of discrimination function in two dimensional feature space for two clusters are called between cluster variance and within cluster variance, respectively.Fisher distance based GA clustering is finding f(H) as to minimizing Fitness of equation (4).

Fig. 2 .
Fig. 2. Problem situations in GA clustering due to overlapping of probability density functions of clusters in feature spaceIf the following criterion equation is optimized, then c (crossover probability), m (mutation probability) would be optimized accordingly.
space volume in the feature space ) and (8)

Fig. 3 .
Fig. 3.An example of generated simulation data set for three cluster and two features (band 0 and 1) with 16 by 16 pixels of imagery data.

( 2 )
Fig. 4. Data distribution of the simulation dataset in the feature plane for two class cases Meanwhile, variance covariance matrices are set as follows for the 8cases of three cluster case,

Fig. 6
Fig.6 (a) and (b) shows the number of processing unit time as functions of crossover and mutation probabilities and Percent Correct Clustering: PCC as functions of crossover and mutation probabilities for the most far two data distributions of two cluster cases while Fig.7 (a) and (b) shows those for the closest two data distribution of two cluster cases.
Fig. 5. Data distribution of the simulation dataset in the feature plane for two class cases Fig. 7. PCC and the number of processing unit time for the closest data distribution of two cluster case of the simulation dataset.

Fig. 8 .
Fig. 8. Relation between overlapped space volume and crossover and mutation probabilities for two cluster datasets Fig.9 (a) and (b) shows the number of processing unit time as functions of crossover and mutation probabilities and PCC as functions of crossover and mutation probabilities for the most far three data distributions of three cluster cases while Fig. 10.PCC and the number of processing unit time for the closest data distribution of two cluster case of the simulation dataset.

Fig. 11 .
Fig. 11.Relation between overlapped space volume and crossover and mutation probabilities for three cluster datasets

TABLE I .
COMPARISONS AMONG GA CLUSTERING PERFORMANCE FOR THREE CLUSTER CASE WITH THE DIFFERENT PARAMETERS OF CROSSOVER AND MUTATION PROBABILITIES AS WELL AS SIMULATED ANNEALING METHOD