Appraising Research Direction & Effectiveness of Existing Clustering Algorithm for Medical Data

The applicability and effectiveness of clustering algorithms had unquestioningly benefitted solving various sectors of real-time problems. However, with the changing time, there is a significant change in forms of the data. This paper briefs about the different taxonomies of the clustering algorithm and highlights the frequently used techniques to understand the research popularity. We also discuss the existing direction of the research work and find that still there is a significant amount of open issues when it comes to clustering medical data. We find that existing techniques are quite symptomatic in nature on local problems in clustering while problems associated with complex medical data are yet to be explored by the researchers. We believe that this manuscript will give a good summary of the effectiveness of existing clustering techniques towards medical data as a contribution. Keywords—Medical Data; Clustering Algorithm; k-Means Clustering; Fuzzy; Classification


INTRODUCTION
In the list of challenges about unsupervised learning techniques, clustering is one of the biggest challenges till date [1] [2].Clustering deals with exploring an elite structure from a given set of the raw database.Theoretically, the technique of organizing the objects into the group where the member of the group's bears certain similarity score with each other is known's as clustering.A good clustering technique always identifies the internal grouping from a given set of raw data.The user frames the effectiveness of the clustering performance.The user provides such forms of converging criterion.The applications of the clustering algorithm observed in many places e.g.biology, city planning, libraries, marketing, studying natural calamities, etc. [3].For a clustering algorithm to be robust, needed that it should explore random-shaped clusters, should possess scalability, and should have high dimensionality.It should have better usability and interoperability characteristics along with insensitive features towards inputs.Most important, it should also have the potential to counter-measure the adverse effect of noise as well as outliers.A robust clustering algorithm can also state if it possesses the capability to manage higher and diversified number of attributes.Finally, it should have lower demands for domain knowledge in order to evaluate input attributes [4] [5].However, there are also certain pitfalls associated with conventional clustering techniques, for example: 1) Higher dependencies of the spatial feature is the prime criteria of effectiveness (usually, such forms observed over distance-based clustering. 2) All the clustering and classification demands cannot be fulfilled using existing clustering techniques.
3) Defining a specific measure of the distance in case of multi-dimensional spaces is quite a challenging task, 4) Existing techniques suffers from problems with the larger dimension of the data owing to the greater extent of time complexity.
Although the outcomes of any clustering algorithm can have multiple inferences, it is hardly possible to even identify the correct number of outcomes for higher dimensional data.The clustering algorithms used over the various field but the applicability of the clustering in medical science is highly challenging.The input for clustering techniques could be any form of medical data, where the purpose could be anything right from segmentation to the classification of a specific disease condition.The original of medical data could have diversified forms (signal, image, dataset, wavelet, etc.).The medical images as quite different from the natural images as they captured from a specific data capturing device.Hence, their formats are very different that causes to implement specific forms of medical image processing.There is also a possibility of inclusion of the higher amount of noises and distortion that potentially affect the data quality.Hence, performing clustering of the medical data is one of the challenging problems in medical image processing.
In most recent times, there has been a significant amount of research work being carried out in introducing clustering techniques using various forms of data.However, with the evolution of complex medical data capturing devices and analysis, the inputs of medical data are no simpler than ten years ago.They will be required to analyze in the perfect manner to assist in effect clustering algorithm.The prime aim of this is to present a discussion about the effectiveness of the existing clustering techniques towards medical data.The discussion has been carried out using standard research papers and its contribution towards solving clustering problems.
Section II discusses the fundamental briefing of the clustering techniques followed by existing research trends in Section III.Section IV discusses the recent techniques about www.ijacsa.thesai.orgclustering techniques and studying its effectiveness.The open research problems have been discussed in Section V while the summary of the work and future direction of the work is briefed in Section VI.

II. ABOUT CLUSTERING TECHNIQUES
Clustering is a mechanism that allows the grouping of the data in the form of logical groups of certain significance [6].One of the prime beneficial characteristics of clustering is its adaptability feature.The prime goal of the clustering algorithm is to carry out a transformation of the group of data into further meaningful data in order to ensure that the data residing in the similar group or cluster offers certain logic [7] [8].The majority of the clustering algorithms aims to reduce the distance between two similar clusters (intra-cluster distance) and increase the distance between two different clusters (intercluster distance) [9].The mechanism of clustering also termed as data segmentation owing to its characteristics of differentiating objects that also results in the identification of outliers.The usage of clustering observed in various fields e.g. machine learning, pattern recognition, false detection, analysis of the business market, etc.In the majority of the analysis, clustering tree is represented using dendrogram.The clustering technique is also frequently called as data mining technique using unsupervised approach applied for clustering (or grouping) data.For a given set of unlabeled data, mainly clustering technique explores the internal grouping.
As per theory, there are five types of clustering techniques i.e. i) Hierarchical methods, ii) Partitioning Methods, iii) Gridbased methods, iv) Machine Learning methods, v) algorithms for high dimensional data [10]  The frequently exercised clustering mechanisms shown in Fig. 1  The brief discussions of the different forms of the clustering techniques are as follows:

A. Hierarchical clustering
A pre-determined order of cluster is formulated either from top to bottom (divisive) or vice-versa (Agglomerative) in hierarchical clustering.It is normally represented by the dendrogram.Fig. 2 shows the two forms of hierarchical clustering technique.An agglomerative clustering initiate with the one-point group and then iteratively combines two or more precisely determined clusters.It performs the computation of all pair wise patterns for evaluating similarity coefficient.After analyzing it's each pattern in one class, it than combines the clusters to form new clusters and compute the respective distances of similarity score.This step is repeated until it ends up in k-cluster that can be one also.Similarly, divisive clustering initiates with one cluster and then iteratively divide the precise cluster.It starts its divisive operation from the top of the cluster that is distributed with the aid of flat clustering algorithm [11].This mechanism is repeated till it reaches the singleton pattern of a cluster.Research papers use the terms called as Agglomerative Hierarchical clustering algorithm (AGNES) and Divisive Hierarchical clustering algorithm (DIANA) for agglomerative nesting and divisive analysis) respectively [12].Both AGNES and DIANA are opposite of each other.The existing research studies that have discussed AGNES and DIANA [12].The beneficial attributes of such www.ijacsa.thesai.orgalgorithms are -i) simplified implementation to offer better outcomes and ii) Lesser pre-defined information about demanded number of clusters.The limitation of such techniques would be -1) The algorithms cannot be effectively controlled to return to its prior state if required.
2) Increase of computational resources with an increase of data points.
3) Due to the form of the spatial factor selected for combining, this algorithm is witnessed with troubleshooting while splitting larger size of clusters, higher sensitivity to outliers, challenging to manage heterogeneity in clusters.In many cases, determining the precise number of clusters is highly difficult one.

B. Partition Methods of Clustering
This technique of clustering is used for partitioning database consisting of a specific number of clusters and objects.An optimization of iterative nature is used in partitioning technique between k-number of clusters.Such technique is further classified in the form of k-means as well as k-medoids approaches.Usage of k-means is seen in maximum research work as it is quite simple to be incorporated in a majority of research problems.It is also one of the simple algorithms for extracting the demanded cluster number using centroid.Fig. 3 shows the conventional representation of partitioning process.The technique doesn't have any pitfalls on the types of parameters that are governed by the location of the predetermined fraction of the coordinates within the cluster location.Therefore, the grouping of the nationalities by food habit as shown in Fig. 3 can be easily done using k-means clustering.

C. Density Based Clustering
This is another frequently used clustering technique of existing system that is more inclined towards densities of the data point.This technique is more interested in exploring the random shapes cluster along with noise over a distance-based dataset.It always ensures that neighbor quantity is more than minimum data points in case the cluster is constructed.Fig. 4 highlights a typical case of density-based clustering.It uses iterative processes for forming a cluster.One of the prominent pitfalls of this technique is that it cannot perform grouping of the data over the dataset of the larger dimension of differences in the cluster densities.The technique uses three different classified forms of objects e.g.classified, non-classified, and noise.A respective id of a cluster is always used for every classified object as well as noise object.However, this technique doesn't use any form of cluster id for non-classified objects.The example cited in Fig. 4 shows implication of density-based clustering technique to categorize unhealthy tissue or a lesion from health tissue.It could further explore the sub-regions of different colors within the unhealthy tissue that could be again benefitted for association or classification operation.The advantage of using density-based clustering is to identify the cluster number as apriori in order comfortably manage the clusters with random dimension.However, it also suffers from the pitfalls as its inapplicability in heterogeneous densities.Moreover, its outcome highly depends on spatial measures.

D. K-Nearest-Neighbour (KNN)
KNN algorithm is also known as memory-based clustering technique as it needs prior feeding of the samples required for training while performing processing at run time.The algorithm used in the mining operation.Different forms of the continuous parameters can be managed by the KNN algorithm although it can also work with similar capability over discretebased properties during clustering.All the parameter in this algorithm associates distance and considers the maximum of them as far as possible.However, relationships of the parameters are not considered in this technique for computing similarity metric.This is the prime cause of errors in distance measures that significantly affects the classification accuracy.The beneficial factor of KNN algorithm is its simple implementation procedure accompanied by faster training steps.The issues in this algorithm are its dependencies of the www.ijacsa.thesai.orglarger database, slower validation process, and have higher noise sensitivity.

E. K-Means Clustering
Usage of k-means clustering is seen in the majority of the clustering techniques.This technique is quite iterative in nature that classifies the given data in order to form k-disjoint clusters.Fig. 6 shows the technique of KNN clustering for a given set of original data.The effectiveness of k-means clustering is normally assessed using squared error factor within a cluster.It was noted that adoption of k-means clustering forms cluster of compact form but it choose not to consider the distance between two clusters.From theoretical viewpoint, adoption of squared l 2 -normalization leads to higher sensitivity in the case of maximized errors.This will eventually mean that such formulation quite less robust from the statistical viewpoint.Only because of its simple implementation and efficiency towards computational performance, k-means algorithm is frequently used clustering technique.It also has very low memory utilization and relatively easier to understand compared to other existing clustering techniques.For distinct dataset, it offers higher precision result and offers better compactness in the cluster as compared to hierarchical clustering technique.However, it also suffers from limitations e.g. it doesn't resolve any overlapping clusters, higher dependencies of pre-determined information, random selection of clusters, the applicability only in case of presence of the mean value, and it cannot be used for outliers as well as noisy data.

F. Fuzzy C-Means
Usage of fuzzy logic over clustering has been started witnessing since last decade.Such form of the algorithm uses spatial attribute for assigning membership function mapping with the data points which is considered equivalent to the center of each group.If the nearness of the data is more towards the center of the cluster than the ability of the membership function is also more towards the cluster center.Using probabilistic approach, the sum of all the involved membership function is equivalent to 1. Fig. 7 shows the mechanism of clustering in this case.The beneficial factors of using fuzzy c-means clustering are that its applicability of assigning membership functions at the center of the cluster.Moreover, fuzzy c-means algorithm is highly applicable for the dataset that is in overlapping form, and it works better as compared to a conventional k-means algorithm.However, the limitation of this technique does also exist e.g.usage of Euclidean's distance is not proportionate with the unequal weight and it involves more iterative steps.Predetermined information dependencies are another pitfall of this algorithm.Although there are other significant types of a clustering algorithm, they are less explored by the research community since 2010.Another significant trend is that all the investigation was carried out by diversified forms of the data, where maximum data is in the form of an image.There is also less specialization work of www.ijacsa.thesai.orgclustering towards detection and diagnosis of the complex medical condition.Most recently, there are certain standard review papers e.g.[14] [15] [16] that has reviewed over different research work being carried out over clustering techniques.But nowhere it is found how strongly clustering technique is used over medical data or any other form of complex data.With the increasing usage of the dynamic user, the formation, processing, and distribution process of such data would be quite complex to solved.Even the frequent usage of the k-means algorithm was not much seen to address the complicated problems associated with medical images.On the other hand, there has been considerable amount of work being carried out using Artificial Neural Network ( Hence, it can be easily said that maximum research work till date from 2010 has used k-means clustering algorithm followed by the probabilistic approach, co-clustering approach, neural network, and evolutionary techniques.Apart from this, other techniques have received less attention till date.Therefore, it can be said that usage of machine learning and portioned-based clustering techniques are predominantly used in the existing system and can also be represented as existing research trends.However, the existing survey papers don't speak about predominant clustering techniques of recent time, and hence it is quite challenge to understand the effectiveness of existing clustering techniques. The next section discusses the existing research techniques accompanied by brief highlights of existing problems, the technique adopted to solve them with associated advantages and limitation of existing techniques.[17] have combinedly used unsupervised and semisupervised classification approach in order to perform involuntary segmentation.The authors have also used the median filter as well as Fuzzy-c-means attributes for performing clustering.A technique called as subtractive clustering is used for minimizing computational complexity.Adoption of fuzzy clustering technique was also seen in the www.ijacsa.thesai.orgwork of Proietti et al. [18] that applies membership function of kernel-based.The study claims to extract unconstrained structure.Clustering also plays a significant role in maintaining the resolution of an image.Al-Qizwini et al. [19] have used similarity of the subspace as well as manifold clustering.Applying subspace clustering assists in extracting low ranks clusters along with usage of Principal Component Analysis (PCA).Finally, training and testing are carried out on natural images where the outcomes were testified using Peak Signalto-Noise Ratio (PSNR) and Structural Similarity Index (SSIM).Although, clustering techniques is beneficial with the abundance of data, but could encounter a significant problem if data is incomplete.One of such investigation towards implementing clustering operation for a given set of impartial data was carried out using Li et al. [20].The authors have used K-means clustering algorithm as well as a k-median method in order to perform clustering.The technique also performs minimax optimization technique for reduced complexities.The study outcome was assessed using numbers of wrongly estimated values for different clustering mechanism.Ahmad [21] have applied fuzzy clustering algorithm for breast cancer detection.The technique has applied fuzzy c-means clustering and applied an existing technique for computing the distance between two values of features.Clustering approaches were also studied with respect to the transfer function.Such direction of research work was carried out by Zhang et al. [22] where affinity-based propagation is studied over histograms of intensity gradient magnitude in order to generate transfer function.The study proved that such clustering technique assists in accomplishing better accuracy in clustering outcomes as well as it also achieves convergence point faster over medical images.

partitioning Hierarchical
El-Khamy et al. [23] have presented a study that performs clustering of brain images in order to identify the suspected mass.The technique uses the fuzzy c-means algorithm as well as conformed threshold in order to enhance the clustering performance.The study outcome shows higher accuracy and lower processing time.Kitrungrotsakul et al. [24] have used clustering approach in order to perform segmentation that significantly minimizes the graph scale for increasing the optimization speed.Shabanzadeh et al. [25] have used biogeography-based optimization in order to perform data clustering over the real-life dataset.The study outcome was found to have better performance compared to existing clustering and optimization techniques.Haraty et al. [26] have enhanced k-means clustering for extracting diversified patterns from the medical data.The algorithm also uses greedy approach, where the outcomes of the study have been evaluated with respect to a number of items in dataset and fmeasure, a coefficient of variance, etc. Hou and Lin [27] have used subspace clustering in order to carry out image retrieval.The technique uses low-rank representation and a matrix completion algorithm for performing involuntary tag completion.Usage of sparse subspace clustering was also seen in the work carried out by Wen et al. [28].The technique also utilizes total variation method and forms a non-convex optimization model.The technique is mainly used for recovering image as well as performing clustering over the images that has incomplete information.Zhan et al. [29] have presented a clustering technique for medical images using graph-based theory.The technique uses the weighted representation of the medical image to give a shape of a completed graph that is further subjected to pruning.The study outcome was assessed using f-score.Usage of subspace clustering was also seen in the work carried out by Ziko et al. [30] where a visual descriptor was created.A supervised data is added during the clustering process that further minimizes the errors in the results.Aghabozorgi et al. [31] have used time-series data to formulate a unique hybrid clustering algorithm along with k-medoids.The study outcome of the presented work is assessed using accuracy over the cardinality of the datasets.Harchaoul et al. [32] have used the fuzzy cmeans algorithm for overcoming the problems of overlapping clustering.Schultz and Kindlmann [33] have presented a technique for three-dimensional image analyses using spectral clustering.Using medical images, the technique was implemented.Boulemnadjel and Hachouf [34] have presented a technique of subspace clustering considering medical images.Paul et al. [35] have presented a simplified clustering technique that assists in the detection of specific diseases.The authors have used constraint k-means and k-mode clustering technique to achieve this.Sulaiman and Isa [36] have presented a technique of image segmentation using fuzzy k-means clustering.The interesting point is its applicability on different forms of images.www.ijacsa.thesai.org More inclination towards recursive-based approach: It has also been seen that maximum studies in the existing system have been used the recursive function which calls for more number of iterative steps to achieve the stage of convergence or meet the objective function.
Existing studied has been only testified with respect to time complexity and very few studies to be testified for space complexities.There is less availability of studies that considers using the non-recursive approach in the process of optimization.
Although there is the certain level of work being carried out towards enhancing clustering techniques, it can be easily seen that majority of them are associated with limitations (Table 1 of Section IV).Classification of the disease condition with faster response time and lower computational complexity is the critical demands of clustering techniques over medical images.There is a less number of analytical modeling designed using any of the existing clustering techniques for enhancing the classification performance.Moreover, usage of multidimensional technique can further leverage the disease classification while formulating novel clustering technique.Such technique can be used for performing clustering of the medical data with the complex disease condition.However, in an existing system, the term medical data is found maximum corresponding to image only.A closer look at the existing system also shows that there are various clustering techniques that offer lower time complexities.However, there is no such evidence if such claims will be applicable while changing the environments.It will also mean lower applicability in a physical world and more on research work.A closer look at the existing system shows that adoption of the complex medical dataset is few to find.Even with the general medical data, the multiple modalities among the images are quite less to find.It was also observed that there had been various clustering techniques presented in the past with MRI image that are normally bigger in size using k-means clustering algorithm.In fact, the majority of such scheme is similar to this.Maximum of such techniques are found to provide non-intuitive outcomes of classification.Such outcomes are never considered to be understood completely by the radiologist or attending physician in real-time practices.For better outcomes, it is necessary to perform inference of the clinical outcomes using simple rules.Unfortunately, the complex medical data e.g. that of gene expression data are normally collected in the form of high dimensional format.Such data not only have the higher value of veracity but it also has a greater extent of outliers and noise.Therefore, it is quite a challenging task to design and develop a technique that can deal with such complicated issues of clustering.

VI. CONCLUSION & FUTURE WORK
Clustering is the better way to deal with the classification of the higher number of data by performing logical groups.This paper discusses the theoretical aspects of clustering and its applications and taxonomies.By reviewing the existing clustering schemes, we find that it uses the common database with no clustering algorithm to represent disease heterogeneity.Moreover, existing algorithms are quite specific to the medical database.However, to cope up the rising demands of clustering, it is required that it should start analyzing the database of complex disease condition as well as it should also address disease heterogeneity.It is also required that the algorithm should be working on multiple forms of a complex dataset with nearly similar outcomes.Finally, the paper highlights the open research problem associated with clustering of medical data.Our future work will be in the direction to find the certain robust solution for open resource issues.Our first approach will be to develop a novel prioritizing scheme to select the best sub-cluster from complex medical data followed by application of an enhanced fuzzy logic on the informative sub-cluster extracted from complex medical data.A novel labeling technique would be formulated to assists in extraction of normal as well as the abnormal region.Its consecutive approach will be to formulate a framework for disease classification to address problems associated with multi-tier clustering.A novel multi-modal scheme will be developed for extracting significant features from complex medical data.A study-specific optimal pattern selection strategy will be designed to obtain multiple patterns from data.This step could be further enhanced by performing extraction of multi-modal regional feature representation for each subject from multiple pattern spaces.We will also develop a new technique of Subclass Clustering-Based Feature Selection by applying supervised learning to perform classification.Our final phase of the study will be to formulate clustering framework for

Fig. 7 .
Fig. 7. Fuzzy c-Means Clustering III.EXISTING RESEARCH TREND This section discusses the existing research trends towards clustering techniques.For this purpose, we prune the research papers published between 2010 to till date from IEEE Xplore.We find that there are 14,730 conference paper and 2090 Journals associated with the problems and enhancement techniques of clustering.For an elaborated understanding, we use Fig.8 that basically furnishes two types of information i.e. i) complete classification of clustering algorithm and ii) a total number of research papers specific to each type of the clustering algorithm.It is widely known that clustering is specifically useful for performing pattern recognition, Spatial Data Analysis, Image Processing, Economic Science, document classification, data mining, etc. [13].The survey outcome basically shows that k-means (No. of published Journal: 399, No. of Published Conference: 4226) algorithm is the highly adopted technique in classification followed by probabilistic technique (No. of published Journal: 216, No. of Published Conference: 804).Although there are other significant types of a clustering algorithm, they are less explored by the research community since 2010.Another significant trend is that all the investigation was carried out by diversified forms of the data, where maximum data is in the form of an image.There is also less specialization work of

Fig. 8 .
Fig. 8. Research Trend towards Clustering Algorithm IV.EXISTING RESEARCH WORK This section discusses the existing research technique that has been used for enhancing clustering performance.Usage of clustering technique towards medical data is mainly associated with improving the image processing operations e.g.segmentation.The work carried out by Al-Dmour and Al-Ani

Hierarchical clustering Partition Methods of Clustering Density Based Clustering K-Nearest-Neighbour (KNN) K-Means Clustering Fuzzy C-Means Frequently Used Clustering
are briefed as follows:

TABLE I .
SUMMARY OF EXISTING CLUSTERING TECHNIQUE www.ijacsa.thesai.orgV. OPEN RESEARCH ISSUES This section discusses the open research issues after reviewing the standard clustering techniques as well as some of the significant research being carried out by recent times.