A Novel Fuzzy Clustering Approach for Gene Classification

Automatic cluster detection is crucial for real-time gene expression data where the quantity of missing values and noise ratio is relatively high. In this paper, algorithms of dynamical determination of the number of cluster and clustering have been proposed without any pre and post clustering assumptions. Proposed fuzzy Meskat-Hasan (MH) clustering provides solutions for sophisticated datasets. MH clustering extracts the hidden information of the unknown datasets. Based on the findings, it determines the number of clusters and performs seed based clustering dynamically. MH Extended KMeans cluster algorithm which is a nonparametric extension of the traditional K-Means algorithm and provides the solution for automatic cluster detection including runtime cluster selection. To ensure the accuracy and optimum partitioning, seven validation techniques were used for cluster evaluation. Four well known datasets were used for validation purposes. In the end, MH clustering and MH Extended K-Means clustering algorithms were found as a triumph over traditional algorithms. Keywords—Meskat-Hasan clustering (MH clustering); MH Extended K-Means clustering; K-Means; fuzzy clustering


I. INTRODUCTION
Clustering divides the dataset based on data's attributes or characteristics. The fundamental purpose of clustering is to categorize the data based on their distinguishable attributes. For partitioning clustering, there are few established soft and hard versions of algorithms. Popular versions of hardcore clustering are K-Means, K-Medians, K-Modes, Forgy's algorithm and soft clustering are Fuzzy C-Means, Fuzzy K-Means (Fuzzy clustering). The variances of the Fuzzy C-Means, such as Gath-Geva (GG) clustering and Gustafson-Kessel (GK) clustering algorithms are used. Extended versions of the fuzzy clustering algorithm like E-FCM, Extended GK cluster exist. There are some other versions of the algorithm like Fuzzy K-NN algorithm and Fuzzy Local Information C-Means clustering algorithm. GrFPCM select features in the preprocessing step while FPCM and Granular Computing used for outlier detection and features selection [1]. To defeat highdimensionality the problem, an ant-based algorithm used in the bioinformatics domain which enhanced with the use of FCM and heaps merging heuristic [2]. Gene ontology annotations based GO-FRC algorithm used biological data for gene clustering and this method may assign one gene into multiple clusters [3]. PSO clustering method established on fuzzy point symmetry used for gene expression classification [4]. FWCMR merge the sub clusters to form a final cluster which is implemented on the parallel and distributed environment [5]. WGFCM used entropy based weight vector calculation to appropriately measure the distance [6]. Immune system behavior based MCSOA used a new fast convergence mechanism for optimum solutions where the number of clusters varies in a certain range [7]. Dynamic Time Wrapping distance technique is useful in shaped based clustering while grouping time series GE data [8]. Fuzzy decision tree algorithm outperforms over classical decision tree algorithm in analyzing cancer GE data [9]. Also, there are techniques for determining the number of clusters like FLAME clustering. These algorithms help to find the behavior of the dataset to reveal the underlying hidden pattern by grouping similar categories of data based on characteristics and most of them need a good initial guess of the number of clusters to perform clustering. Predicting the accurate number of cluster is challenging task. Lots of algorithms are developed but depend on some predefined or prior knowledge. For example, K-NN, K-Means, FCM, etc. are all need robust initial guess.
It has been founded that previously this scenario was solved by applying some post cluster analysis to predict and selecting the number of clusters that require time and cost. Therefore, a method to determine the number of clusters dynamically is required. However, we develop two new algorithms, named Meskat-Hasan clustering (MH clustering) algorithm and MH-Extended K-Means clustering algorithm are proposed for automatic cluster detection, dynamic cluster selection and partitions. Moreover, post cluster enhancement is not required for them. So, they minimize time and cost complexity. Performance of these methods was validated using seven validation criteria Separation Index(S), Partition Coefficient (PC), Dunn's Index(DI), Alternative Dunn Index(ADI), Classification Entropy(CE), Partition Index(SC), Xie and Beni's Index(XB) and comparing results with existing clustering techniques like Fuzzy C-Means, Gath-Geva clustering (GG), Gustafson-Kessel algorithm (GK), K-Means and K-Median. Four different datasets (Wisconsin Breast Cancer, Leukemia, Irises and Motor Cycle [10]) were used to evaluate the performance of the algorithms.
Finally, the proposed algorithms are performed better than other existing literature.

II. LITERATURE REVIEWS
Patrik D'haeseleer [11], described the working principles of gene expression clustering and suggested to use more than one clustering algorithms. According to Jain and Dubes [12], there is no single criterion to define a good clustering algorithm. Clusters are of arbitrary size and shapes in a multidimensional pattern space and clustering quality may be evaluated based on 64 | P a g e www.ijacsa.thesai.org internal criteria or external criteria. James C. Bezdek, Robert Ehrlich and William Full [13], proposed a program for Fuzzy C-Means (FCM) clustering algorithms to generate prototypes and fuzzy partitions for the numerical datasets. This fuzzy partition is useful for suggesting definite substructure for the raw datasets. Representing the similarity of a point is shared among the nearest clusters with the help of membership function whose values ranges from one to zero, is the idea given by Zadeh [14]. In FMLE algorithm good initial seed points are required because of the exponential distance helps to converge the algorithm to a local optimum rather in a narrow region. Except for this limitation, the FMLE algorithm is better than Gustafson and Kessel algorithm [15]. D. E. Gustafson and W. C. Kessel [15], developed a fuzzy clustering algorithm using fuzzy covariance matrix to prove the argument that in fuzzy clustering, fuzzy covariance has a natural approach. An expression of the interpretation of the membership functions was proposed by Ruspini [16]. This relationship denotes the similarity between samples where a fuzziness parameter is used, whose increasing value trend to indicate the more fuzziness of the clustering process. Jiye Liang, Xingwang Zhao, Deyu Li, Fuyuan Cao and Chuangyin Dang [17], proposed a clustering algorithm for special datasets like mixed datasets containing both numeric and categorical attributes. They presented a mechanism to characterize the data within the cluster and between cluster entropies and to detect the worst cluster in that particular dataset. Kaiser [18] proposed eigenvalues greater than one rule, which is now a commonly used criterion for finding the number of factors. It strongly states that the number of reliable factors is equal to the number of eigenvalues greater than one. As negative eigenvalues have negative reliability, the respective composite score should be reliable. In the internal consistency, it must have some positive reliability. Norman [19] concluded this by suggesting that more reliable components there will be that those are indicated in the eigenvalues greater than unity rules. `The convergence rate of evolutionary clustering methods is high enough than partition clustering methods [20]. This [21] comparative dissection paves the way of choosing the desirable clustering algorithm for some particular dataset.

III. PROPOSED ALGORITHMS
Clustering is an unsupervised technique for categorizing the data elements. Previously, it was not possible to predict the accurate number of a cluster without conducting pre cluster analysis. Proposed techniques have been developed based on the principal component analysis. Then, these techniques were applied to the two types of clustering (Fuzzy and hard-core) algorithms. Finally, validation of the approaches was done and in the next sections, proposed two algorithms based on fuzzy clustering and hard clustering algorithms are shown.

A. Meskat-Hasan Clustering (MH Clustering) Algorithm
MH clustering is a fuzzy approach for data clustering. It is an integrated package of all the tasks to perform clustering including the determination of the number of cluster and clustering. Most of the established clustering algorithm has some kinds of dependency or need some data related prediction for knowing the behavior or characteristics of the dataset. To overcome this, Meskat-Hasan clustering (MH clustering) dynamically determine the clusters number. MH clustering has the following four steps: Step-1: Normalize dataset.
Step-3: Run time cluster number determination.
The generalized version of the algorithm is given below: where € =1x10 -6 and m=1 to ∞ is the fuzziness parameter.
In steps 1 to 10 determine the desired number of clusters and steps 11 to 13 clustering is performed.

B. MH Extended K-Means Clustering Algorithm
The MH Extended K-Means clustering algorithm is an extension of the K-Means algorithm. In the MH Extended K-Means algorithm, we will apply the techniques of determining the number of the clusters dynamically along with the original k-means algorithm. That means, to implement it in a hardcore clustering process. Main steps of this algorithm are as follows: Step-1: Scale dataset Step-2: Extract underlying structure 65 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 8, 2020 Step-3: Approaches to determine the cluster number dynamically Step-4: Perform clustering by K-Means clustering procedure The generalized version of the algorithm is given below: MH Extended K-Means is designed to determine the number of clusters for the well-separated dataset. It is a combined method of dynamically determining the number of the cluster with the traditional K-means algorithm. K-means works for well-separated dataset but, it needs a strong initial assumption to the number of clusters. To defeat limitation, MH Extended K-Means is designed. It accurately determines the number of the cluster for the well-separated dataset.

C. Predicting the Number of Cluster-based on σ Value
Non-parametric value σ is the limiting criterion indicating the percentage of variation of the dataset. σ value is determined from the principal component analysis technique. The cumulative summation of main components variances of the dataset is the limiting value of σ. If the value of σ is high, then the number of the cluster becomes less and vice versa. For example, in leukaemia dataset, the summation of three main principal components is 66.3% indicating 66.3% variation of the total dataset. So for leukaemia dataset, we set σ = 66.3 and finally it concludes to set 2 as the number of clusters. For Wisconsin Breast Cancer (WBC) dataset the outcome of MH clustering algorithm is three number of cluster and σ value is 86.7 and so three cluster hold at least 86.7% data. For Leukemia dataset number of cluster is two and σ value is 66.3, so two clusters have 66.3% data. For Irises dataset the number of clusters is two and σ value is 92.46, therefore, two clusters keep 92%. For Motor Cycle dataset the number of clusters is three and σ value is 85.3448 thus three clusters contain 85%. In MH Extended K-Means clustering algorithm number of cluster for WBC, Leukemia, Irises and Motor Cycle dataset is three, two, two and three respectively. MH Extended K-Means clustering algorithm brings 86%, 85%, 92% and 85% underlying data into consideration for dynamically determination of cluster number.

IV. A COMPARATIVE STUDY AMONG CLUSTERING ALGORITHMS
Performance of algorithms is compared based on average execution time vs. no. of clusters. Considering a certain number of clusters, execution times of the corresponding algorithm is obtained and comparative study is as below: Fig. 1, the proposed MH clustering algorithm takes the lowest time for clustering and MH Extended K-means clustering takes highest times. WBC dataset is not well separated and for this nature of datasets proposed fuzzy algorithm takes less time comparatively. Fig. 2, MH clustering algorithm takes the lowest time to perform clustering compared to other fuzzy clustering algorithms. Hard-core clustering like K-Means, K-Mediods takes comparatively longer time than MH Extended K-Means clustering algorithm. Leukemia dataset is not well separated and for these nature of dataset proposed fuzzy algorithm takes comparatively much less time. Fig. 3. Meskat-Hasan (MH) clustering algorithm takes the lowest time. Others fuzzy clustering algorithm like FCM, GG and GK takes almost the same times to perform executions. Besides, hard-core clustering likes K-Means, K-Mediods takes comparatively more times than MH Extended K-means clustering. In Fig. 4. Meskat-Hasan (MH) clustering algorithm takes the lowest time and MH Extended K-Means clustering takes highest times to execute.

B. MH Over MH Extended K-Means Clustering Algorithm
MH uses the fuzzy approach in cluster implementation whereas MH Extended K-Means use hard clustering approach. MH Extended K-Means perform better when the input dataset is well separated and MH performs better for a non-linearly separable dataset.

V. CLUSTER VALIDATION
To evaluate cluster performance analysis validation played an important role. For validating partition Separation index (S) takes the minimum separation distance where the smallest values of S indicate a valid optimal partition. To measure the amount of overlapping between cluster partition coefficient (PC) indexing is used and its high value provide cluster accuracy. Dunn Index (DI) is internal evaluation criteria which identify the well separate cluster. Higher DI and ADI values indicate better clustering results. CE measures the cluster partitions fuzziness. Low values of CE and SC reflect good performance. XB index recognizes whole cluster compactness and smallest value site the optimum number of cluster. Validation result of the MH clustering is organized in Table I. Table II reveals the performance of the MH Extended K-Means algorithm over the four datasets. As PC indicates the amount of overlapping between cluster regions, so it provides constant values in MH Extended K-Means which is a hard clustering algorithm.

A. Analyzing Total Outcomes
All outcomes of both proposed and existing algorithms are compared based on the evaluation criteria for each dataset. Table III is    MH clustering algorithm provides solutions for automatic clusters number detection, run time cluster selection, and performs fuzzy clustering accordingly. It appropriately determines clusters number and produces proper partitioning. MH clustering algorithm works well for the non-linear dataset. MH Extended K-Means algorithm performs hard clustering on the basis of dynamic cluster number determination. It works well for a clearly separable dataset. By analyzing all results, both MH and MH Extended K-Means clustering algorithms select the precise cluster number and produce optimum partitioning and perform clustering accordingly. MH clustering algorithm takes comparatively less time to execute than other algorithms. Based on validation techniques, MH clustering algorithm performance is quite better than the other established literature. MH clustering algorithm meets the objective of automatic cluster number detection without using post cluster analysis and performs clustering accurately and satisfy the time complexity. Though MH Extended K-Means clustering algorithm takes more time than the MH clustering algorithm but performs better than other hard clustering algorithms. Evaluating validation results MH and MH Extended K-Means clustering algorithm are acceptable.

VII. CONCLUSION
The idea of MH clustering algorithm and MH Extended K-Means clustering algorithm comes to solve the problem of the exact number of cluster detection automatically and perform clustering accordingly. MH and MH Extended K-Means were applied on both linear and non-linearly partitioned dataset. Performance of the proposed algorithms was compared with other selected algorithms by validating the cluster and performance evaluation. The comparison was done based on execution time and validation indexes and it provides an effective way of selecting an efficient clustering algorithm for the particular dataset. For linearly separable dataset performance of MH Extended K-Means clustering algorithm and non-linearly separable dataset, MH clustering algorithm is better. Both MH and MH Extended K-Means clustering algorithm meet the desired needs of dynamically determining the number of clusters accurately and provides better and efficient results than the selected clustering algorithms.
Here we work on gene expression datasets and algorithms are tested in standalone systems. In the future, we want to work with real-time microarray gene expression dataset and implement in parallel and distributed system and will upgrades the algorithms accordingly. So that classification of microarray gene expression data can take computational benefit from cloud infrastructure.