Applications of Clustering Techniques in Data Mining: A Comparative Study

In modern scientific research, data analyses are often used as a popular tool across computer science, communication science, and biological science. Clustering plays a significant role in the reference composition of data analysis. Clustering, recognized as an essential issue of unsupervised learning, deals with the segmentation of the data structure in an unknown region and is the basis for further understanding. Among many clustering algorithms, “more than 100 clustering algorithms known” because of its simplicity and rapid convergence, the K-means clustering algorithm is commonly used. This paper explains the different applications, literature, challenges, methodologies, considerations of clustering methods, and related key objectives to implement clustering with big data. Also, presents one of the most common clustering technique for identification of data patterns by performing an analysis of sample data. Keywords—Clustering; data analysis; data mining; unsupervised learning; k-mean; algorithms


I. INTRODUCTION
Data mining is the latest interdisciplinary field of computational science. Data mining is the process of discovering attractive information from large amounts of data stored either in data warehouses, databases, or other information repositories. It is a process of automatically discovering data pattern from the massive database [1], [2]. Data mining refers to the extraction or "mining" of valuable information from large data volumes [3], [4]. Nowadays, people come across a massive amount of information and store or represent it as datasets [4], [5]. Process discovery is the learning task that works to the construction of process models from event logs of information systems [6]. Fascinating insights, observable behaviours, or high-level information can be extracted from the database by performing data mining and viewed or browsed from various angles. The knowledge discovered can be applied for process control, decision making, information management, and question handling. Decisionmakers will make a clear decision using these methods to improve the real problems of this world further. In data mining, many data clustering techniques are used to trace a particular data pattern [2]. Data mining methods for better understanding are shown in Fig. 1.
Clustering techniques are useful meta-learning tools for analyzing the knowledge produced by modern applications. Clustering algorithms are used extensively not only for organizing and categorizing data but also for data modelling and data compression [7]. The purpose of the clustering is to classify the data into groups according to data similarities, traits, characteristics, and behaviours [8]. Data cluster evaluation is an essential activity for finding knowledge and for data mining. The process of clustering is achieved by unsupervised, semi-supervised, or supervised manner [2]. However, there are more than 100 clustering algorithms known and selection from these algorithms for better results is more challenging.
PyClustering is an open-source library for data mining written in Python and C++, providing a wide variety of clustering methods and algorithms, including bio-inspired oscillatory networks. PyClustering focuses primarily on cluster analysis to make it more user friendly and understandable. Many methods and algorithms are in the C++ namespace "ccore::clst" and in the Python module "pyclustering.cluster." Some of the algorithms and their availability in PyClustering module is mentioned in Table I [9].

A. Clustering in Data Mining
Data volumes continue to expand exponentially in various scientific and industrial sectors, and automated categorization techniques have become standard tools for data set exploration [10]. Automatic categorization techniques, traditionally called clustering, helps to reveal a dataset"s structure [9]. Clustering is a well-established unsupervised data mining-based method [11], and it deals with the discovery of a structure in unlabeled data collection. The overall process that will be followed when developing an unsupervised learning solution can be summarized in the following chart in Fig. 2 The main applications of unsupervised learning are:  Simplify datasets by aggregating variables with similar attributes.
 Detecting anomalies that do not fit any group.
 Segmenting datasets by some shared attributes.
Clustering results in the reduction of the dimensionality of the data set. The objective of such a clustering algorithm is to identify the distinct groups within the data set [12]. There are different clustering objects, such as hierarchical, partitional, grid, density-based, and model-based [13]. The performance of various methods can differ depending on the type of data used for clustering and the volume of data available [14]. For example, Document clustering has been investigated for use in many different areas of text mining and information retrieval [15]. There are several different metrics of quality, relative ranking, and the performance of different clustering algorithms that can vary considerably depending on which measure is used. Two measures of "goodness" or quality of the cluster are used for clustering. One type of measure allows comparing www.ijacsa.thesai.org different cluster sets without external knowledge and is called an "internal quality measure." The other form of measure is called an "external quality measure," which allows evaluating how well the clustering works by comparing the groups generated by the clustering techniques to the classes identified. Fig. 3 shows a simple example of data clustering based on data similarity.

1) Types of clustering:
Clustering can generally be broken down into two subgroups:  Hard Clustering: In hard clustering, each data point is either entirely or not part of a cluster.
o For example, each customer is grouped into one of 10 groups.
 Soft Clustering: In soft clustering, a probability or likelihood of the data point being in certain clusters is assigned instead of placing each data point into a separate cluster.
o For example, each customer is assigned a probability to be in 10 clusters.
2) Clustering methodologies: Since the clustering method is subjective, it is the tool that can be used to accomplish plenty of objectives. Every methodology follows several sets of rules and regulations that describe the "similarity" between data points. Cluster analysis is not an automated task, but an iterative information discovery process or multi-objective collaborative optimization involving trial and error [16]. There are typically more than 100 known clustering algorithms. But few of these algorithms are popularly used. Some of the clustering methodologies are mentioned below in Table II. The best known and most widely used method of partitioning is K-means [17]- [19]. There are many clustering techniques from which K-means is an unsupervised and iterative data mining approach [11]. The standard approach of all clustering techniques is to classify cluster centres representing each cluster. K-means clustering is a method of cluster analysis aimed at observing and partitioning data point into k clusters in which each observation is part of the nearest mean cluster [7]. The most significant advantage of the Kmeans algorithm in data mining applications is its efficiency in clustering large data sets. K-means and its different variants have a computation time complexity that is linear in the number of records but is assumed to discover inferior clusters [15].
The K-means algorithm is a basic algorithm for iterative clustering. It calculates the distance means, giving the initial centroid, with each class represented by the centroid, using the distance as the metric and given the classes K in the data set. In the k-means partitioning algorithm, the mean value of objects within-cluster is represented at the centre of each cluster.  TABLE II. CLUSTERING METHODOLOGIES

Method Algorithm
Distance-based method  Partitioning algorithms "K-means, K-medians, K-medoids."  Hierarchical algorithms, "Agglomerative, Divisive method." These algorithms run iteratively to find the local optima and are incredibly easy to understand but have no scalability for handling large datasets.


Grid-base algorithm: Individual regions of the data space are formed into a grid-like structure. These methods use a single-uniform grid mesh to separate the entire problem domain into cells. The cell represents the data objects located within a cell using a collection of statistical attributes from the objects.


Density-Based Spatial Clustering of Applications with Noise / DBSCAN  Ordering points to identify the clustering structure OPTICS These algorithms scan the data space for areas with different data points density within the data space. It isolates different density regions within the same cluster and assigns the data points within those regions.
Probabilistic and generative models  Expectation-maximization algorithm: Modeling data from a generative process. Often these models suffer from over-fitting. A prominent example of such models is the Expectation-Maximization algorithm that uses normal multivariate distributions. www.ijacsa.thesai.org II. BACKGROUND AND DISCUSSION OF CLUSTERING APPLICATIONS AND APPROACHES Cluster analyses have lots of applications in different domains, e.g., It has been popularly used as a preprocessing step or intermediate step for other data mining tasks "Generating a compact summary of data for classification, pattern discovery, hypothesis generation and testing, compression, reduction, and outlier detection, etc." Clustering analysis can also be used in collaborative filtering, recommendation systems, customer segmentation, multimedia data analyses, biological data analyses, social network analysis, and dynamic trend detection. Some of the clustering techniques and approaches are discussed in Table III.

A. Requirement and Challenges
Despite recent efforts, the challenge of clustering on "mixed and categorical" data in the sense of big data remains, due to the lack of inherently meaningful similarity measurement between the high computational complexity of current clustering techniques and categorical objects [18]. For cluster analysis, there are several items to be considered. Some of them are mentioned in Table IV.
Typically, there are multiple ways to use or apply clustering analysis; some advantages and limitations of clustering techniques are mentioned in Table V.  As a stand-alone tool to get insights into data distribution.
 As a preprocessing (or intermediate) steps for other algorithms.
According to [24], [25], parallel classification is a better approach for big data, but due to its implementation"s complexity remains a significant challenge. However, the framework of MapReduce can be suitable for implementing parallel algorithms, but still, there is no algorithm to handle all Challenges of big data. In [26], the authors proposed a novel Spark extreme learning machine "SELM" algorithm based on a spark parallel framework to boost the speed and enhance the efficiency of the whole process. SELM gives the highest speed and minimal error in all experimental results compared to Parallel Extreme Learning Machine (PELM) and an improved Extreme Learning Machine (ELM*). Table VI presents the pros and cons of different clustering algorithms with real-world applications. For large dataset concluded that "K-means clustering is more efficient in terms of its time, space complexity, and its order-independent" and "Hierarchical clustering is more versatile, but it has the following disadvantages: Time complexity O( and space complexity of a hierarchical agglomerative algorithm is O( [22] Zengyou He, Xiaofei Xu, Shengchun Deng, Bin Dong

K-mean, K-modes, K-Histogram
Compare different clustering algorithms to determine an efficient clustering algorithm for the categorical dataset.
K-Histogram is the enhanced version of K-means to categorical areas by substituting means of clusters with histograms. In general, K-Histogram is almost similar to the K-modes algorithm, but as compared to k-modes, k-histogram algorithms are more stable, and the algorithm will converge faster. Discover an efficient clustering by comparing Divisive and Agglomerative Hierarchical Clustering with K-means.
To obtain high accuracy, Agglomerative Clustering with k-means will be the practical choice. Divisive clustering with K-means also works efficiently where each cluster can be taken fixedly. For solving the clustering problems, introduced a hybrid approach of the Genetic algorithm with K-means.
A hybrid approach of K-means with a Genetic algorithm efficiently solves all the problems of the kmeans, e.g., K-mean will produce empty clusters with initial centre vector and converge to nonoptimal value, etc. www.ijacsa.thesai.org [7] Manish K-means is faster than all the algorithms that are discussed in this paper. When using a huge dataset, K-means and EM will the best results than hierarchical clustering. [11] Karthikeyan B., Dipu Jo George, G. Manikandan, Tony Thomas.

K-means, Agglomerative Hierarchical Clustering
Comparative research to determine the best-suited algorithm on K-Means and Agglomerative Hierarchical Clustering.
The k-means is best suited for larger datasets in term of minimum execution time and rate of change in usage of memory. It is also concluded that agglomerative clustering is best suited for smaller datasets due to the overall minimum consumption of memory.

Clustering Techniques Advantages Limitations
Data-mining clustering algorithms  Implementation is simple.  Compromises on user"s privacy.  Do not deal with a large amount of data Dimension reduction  It is very fast, reduces the dataset, and the cost of the treatment will be optimized.
 It must be applied before the classification algorithm.  It cannot provide an efficient result for the high dimensional dataset.  It may lose some amount of data.

Parallel classification
 It gives minimal execution time and more scalable.  Difficult to implement.
 It does not do best for graphs, iterative, and incremental, multiple inputs, etc. A car manufacturer company wants to identify the purchase behaviours of its customers to view which product is getting more sales and what is the procedure of our customers. They are currently looking at each customer"s details based on this information, decide which product manufacturing should be increased and what are the behaviour of customers which helps the company to monitor sales for other products by starting a promotional campaign or increase the availability of resources.
Recently, the company can potentially have millions of customers. It is not possible to look at each customer"s data individually and then make a decision. A manual process will take a huge amount of time. This is when K-means Clustering assists in a convenient way to analyze data automatically. The K-mean clustering algorithm utilizes a fixed number of clusters for optimum clustering [12], [28]. Initially, start partitioning with the chosen number of clusters next to improve the partitions iteratively to find the patterns in data. Let D= {D1, D2, …, Dn} be the set of data points and Y= {Y1, Y2, …, Yt} be the set of centers. This clustering technique is implemented and analyzed using a k-mean clustering tool WEKA. In the following steps, the K-means algorithm can be implemented: The data set used for the K-mean clustering example will focus on a fictional car dealership. The dealership is starting a promotional campaign for slow-selling units, whereby it is trying to push resources to its valuable customers. Table VII shows the sample dataset, which is used for the analysis.
In Table VII, every row shows the purchase behaviour of customers, e.g., Customers went to the dealership without going on a showroom and done some computer search mostly interested in Toyota Harrier without financing they purchased it. These types of behaviour understandings about customers help Toyota to manage their sales. K-mean clustering allows the company to perform analysis without any efforts by finding patterns in a given dataset, shown in Fig. 4 and Fig. 5. Fig. 4 explains that based on cluster 3, 100% of customers went for the dealership, whereas 45% went to the showroom too, and 100% of the customers also did computer searching. The majority of the customer that is 63% have shown interest in Fortuner, whereas 45% had shown interest in Harrier, and the least interest was found to be 9% in Corolla. These customers who 100% end up financing and purchasing a product consistently went to the dealership and done computer searching before buying an SUV car.
Meanwhile, based on cluster 4, only 32% of customers went to the dealership, whereas 100% went to the showroom, and 24% also did computer searching. Majority of the customers that are 100% interested in Corolla whereas 32% had shown interest in Fortuner, and the least interest was found to be 3% in Harrier; out of all these, 56% of the customers went for the financing details whereas 82% ends up purchasing a product. These are the customers looking for a small family car, i.e., Corolla, mostly approaching the showrooms.  by where represents number of data points in ℎ cluster. 6. Once again, find out the distance between each data point and the new cluster centre. 7. If no data points were reassigned then stop, otherwise back to step 3. www.ijacsa.thesai.org

IV. CONCLUSION
This paper describes the different algorithms and methodologies used to handle large and small sets of data. The process of clustering is to group data based on their characteristics and similarities. Previously described the clustering models, many clustering techniques used to partition the data into a set of clusters. Algorithm selection should depend on the properties and the nature of the data collection because each algorithm has its pros and cons. This shows that there is no algorithm to manage all the clustering challenge. However, there are some algorithms to provide an optimist solution based on their sufficiency to face the challenges of the problem. To achieve high accuracy in terms of time and space, K-means would be the best choice for large and categorical data. However, we need to reduce their time and memory"s complexity by upgrading Clustering Algorithms. However, a combined approach of the Genetic Algorithm with K-means can almost resolve all the issues of K-means. Genetic K-means Algorithm (GKA) speeds up the convergence to a globally optimum, and it concludes that GKA is faster than evolutionary Algorithms.

V. FUTURE DIRECTIONS AND OPEN ISSUES
To date, Data Mining and information disclosure are advancing an essential innovation for businesses and scientists in numerous domains. Although information mining is extremely powerful, it faces innumerable difficulties during its usage. The problems could be identified with performance, data, strategies, and procedures utilized. The information mining measure becomes effective when the challenges or issues are distinguished accurately and sifted through appropriately.
Some of the following challenges and future directions are:  Efficiency and Scalability of Algorithms: The data mining algorithms must be proficient and adaptable to extricate data from gigantic sums of information within the database. So, as a future direction, develop a parallel formulation of an Improved rough k-means algorithm to enhance the efficiency of an algorithm.
 Privacy and Security: Information mining ordinarily leads to genuine issues in terms of information security, protection, and administration. For case, when a retailer reveals his clients purchasing details without their permission. So, as a future direction, there needs to develop a single cache system and DES (Data Encryption Standard) techniques in any Clustering Algorithm to improve the privacy and security of data in the cloud.
 Complex Data Types: Complex data elements, objects with graphical data, temporal data, and spatial data may be included in the database. Mining of these types of data isn"t practical to be done one device.
 Performance: The execution of the data mining framework depends on the proficiency of calculations and procedures are utilizing. The calculations and strategies planned are not up to the marked lead to influence the performance of the data mining process.
Therefore, as a future direction, we need to introduce a new hybrid approach of an Improved Rough k-means Algorithm, and the Genetic Algorithm will improve the performance and handles the complex data. The combination of Partitioning Clustering and Hierarchical Clustering Algorithms will also increase the accuracy of data analysis.