A Generic Methodology for Clustering to Maximises Inter-Cluster Inertia

This paper proposes a novel clustering methodology which undeniably manages to offer results with a higher inter-cluster inertia for a better clustering. The advantage obtained with this methodology is due to an algorithm that showed beforehand its efficiency in clustering exercises, MCDBSCAN, which is associated to an iterative process with a potential of auto-adjustment of the weights of the pertinent criteria that allows the reclassification of objects of the two closest clusters through each iteration, as well as the aptitude of the auto-evaluation of the precision of the clustering during the clustering process. This work conducts the experiments using the well-known benchmark, ‘Seismic’, ‘Landform-Identification’ and ‘Image Segmentation’, to compare the performance of the proposed methodology with other algorithms (K-means, EM, CURE and MC-DBSCAN). The experimental results demonstrate that the proposed solution has good quality of clustering results. Keywords—MC-DBSCAN; iterative process; inter-cluster inertia; unsupervised precision-recall metrics


I. INTRODUCTION
Nowadays, Data Mining [1] is imposed as one of the effective techniques for searching and retrieving information from very large databases.Like other search traditional operations, data mining is in the same vein.It aims to analyze a set of raw data in order to extract information that can be considered part of knowledge, and therefore, become exploitable.However, the data mining field specifically supplies solutions targeting the problematic of description, estimates, prediction, association, segmentation, classification and clustering [2]- [4]; To that end, the state of art shows that clustering and classification are both the most fundamental tasks in Data Mining.
The supervised classification always depends on a preconstituted database reference.On the other hand, the exploitation of dataset without reference classification, unsupervised classification techniques called 'clustering' are unconditionally used [5].For Clustering techniques [6], there is a choice among the methods based on the partition [7], hierarchical methods [8], [9], methods based on the grids [10], methods using models [11] and methods based on the density [12].In that sense, Jain suggests, in his recent works, that 'There is no best clustering algorithm' [13], [14].Furthermore, the practice shows that the performance of an algorithm depends on the tool choice and adaptation in accordance with the problem constraints.
The present paper proposes a generic methodology leading to an iterative process, that allows to improve in an optimal way the results of a clustering exercise.To that end, the density algorithm MC-DBSCAN [15] was used as the main clustering algorithm.This is justified by the fact that the MC DBSCAN showed its performance towards problems of multi-criteria in clustering.Specifically, in an earlier study, the MC-DBSCAN algorithm has given respectively the Accuracy values [17], [18] 93% and 34% with databases 'Vehicle-Silhouettes' and 'Iris' [15].Although accuracy levels are high, some elements more or less misclassified are detected.
For this purpose, the performance of the solution proposed by this work will have as an assessment element for comparison, the results from the 'clustering' achieved with algorithms MC-DBSCAN [15], CURE [8], EM [19] and K-Means [20] each respectively representing a particular clustering category, clustering algorithms density, hierarchical clustering, clustering from clustering model and partitioning.
The outline of this article focuses successively on the presentation of MC-DBSCAN algorithm, the methodology governing the proposed solution; the treatment and comparing results obtained; and a conclusion.
The next part of this work, after the first section where the theme was introduced, is divided into five sections: Section 2, describing the original MC-DBSCAN algorithm; Section 3 presenting the proposed new generic methodology of clustering www.ijacsa.thesai.org in detail; Section 4, explaining the experimental results and discussions; Section 5, drawing conclusions.

II. MC-DBSCAN ALGORITHM
MC-DBSCAN is an improved version of DBSCAN [16] for the purpose of solving the problem of multi-criteria in clustering.The multi-criteria data is defined on different scale types with varied weights according to the importance of each criterion.This capacity has largely influenced the algorithm choice for the needs of this work, since MC-DBSCAN offers a possibility to adjust the weight of pertinent criteria in each iteration.
The MC-DBSCAN algorithm is composed of the following steps:

Functions Meaning
Similarity: Min and Max: The proposed solution is a generic methodology that can use other types of clustering algorithms.However, for the raised reasons in the previous parties, the MC-DBSCAN is proved to be the appropriate tool.In substance, the methodology is a model operating in an iterative manner to achieve the clustering.The iterative process of this model is tributary to the quality of the concluded clusters from the previous iteration.In other words, the process's continuity relies on the automatic comparison of the quality of the two consecutive iterations results.The solution consists of three principals steps.
The first phase leads to the MC-DBSCAN algorithm's intervention, which, first of all, uses the default values of inputs parameters for the preliminary classification.In this way, the obtained clusters serve as input data for the next stage, which is a procedure of the iterative classification.
The second phase represents the analysis and assessment stage of the obtained results in order to detect the similarity between the different achieved clusters.The analysis and assessment of the classification quality is done by calculation of the similarity between the clusters; in the sense that hence, the two clusters presenting a high similarity rate, show in contrast, an inter-class inertia [21] value less elevated (1).This situation would be a result of two possible scenarios, either the objects constituting two closest classes should belong to the same class, or an error is produced in the classification of certain objects that would belong normally to a class whereas they were found in the other class and vice versa.
The proposed model overcomes these classification anomalies by identifying the pertinent criteria (2), which would amplify the similarity between two classes, while taking back into consideration their weights in the following classification by using the AHP method [22], [23].The third phase purpose is the evaluation of the two consecutive iterations.It concretely allows a comparison of the quality of the obtained results in the two last iterations in such a manner that the results' quality of the iteration (i) is better than the iteration (i-1).In this case, the process of classification continues in order to improve the classes precision; if not, it restores and considers the issued results of the previous iteration (i-1) to complete the classification process.www.ijacsa.thesai.orgFor the purpose of assessing the overall quality of the results, the art of state offers several metrical approaches, which can be grouped into two categories.The first category are methods depending on the availability of a reference database.And the second category includes methods that do not use the reference database [24], this is namely inertial methods [21], Dunn [25], DB [26], Silhouette [27] and so on.
However, these preceding methods are limited in the evaluation of the results' quality in some clustering cases as mentioned in the work of Kassab [28].
To overcome this dilemma, Lamiel and other [29]- [32] have proposed improvements of the subsequent methods (Recall, Precision and F-Measures) based on reference classification, by making them adequate and relevant to unsupervised classification.
Nevertheless, the suggested method has been previously adapted for the clustering applied to text data.However, the present paper proposes also the improvements of the following unsupervised methods: Recall, Precision and F-Measures, for being adaptable to all different types of data.
The principle of this work relies on the fact to be able to measure the classes' homogeneity by studying the distribution of intervals of each criterion within these classes.Consequently, each class is characterized by a set of intervals, in which the ratio of their weights inside the considered class and those in the partition should be maximal.
The global values of unsupervised Recall (4), Precision (5) and F-measure (6) are calculated as follows (Table 2): Criterion S (11) ,    The following chart summarizes the process of the proposed methodology (Fig. 2).

A. Databases Used
The performance of the proposed generic methodology and those of other algorithms namely EM, Cure, K-means and MC-DBSCAN are evaluated using the well-known reference databases, 'Seismic', 'LandformIdentification' and 'Image Segmentation' (Table 3).The three databases are from the great platform of data 'UCI Machine Learning Repository'.www.ijacsa.thesai.org

B. Assessment Measures
To evaluate and compare the proposed methodology performance, we use the standard metrics: 'Precision: number of objects correctly assigned divided by total number of objects assigned', 'Recall: number of objects correctery assigned divided by the total number of objects that should be assigned' and 'F-measure: harmonic mean of precision and recall' which use the confusion matrix.
The precision scales the clusters in terms of the proportion of data that contain the specific properties of these first.Consequently, the more the data associated with a cluster have specific common properties, the more they are similar to each other, and therefore the criterion of homogeneity within the clusters is strengthened.
The Recall 'Recall' allows to measure the completeness of the clusters' contents, linked to the presence of specific properties that are specific to them.The more a cluster has a set of specific properties that are exclusive, the more it differs from other clusters, and therefore the criterion of heterogeneity between clusters is strengthened.
The F-measure which combines the precision and the recall is their harmonic average, named F-measure or F-score.

C. Results and Discussion
The Table 4 below includes and shows the results of different performed tests (Precision, Recall, F-measure) with the three test databases (a), (b) and (c).In these tests, the input parameters of the three first algorithms (EM, Cure and Kmeans) have default values except the parameter that represents the clusters' number which is fixed according to the issued information of reference databases.The proposed 'Precision' factor for appreciating this work's results shows an important contrast between the achieved results of the propounded methodology and those of other existing algorithms.The suggested methodology presents respectively values 83,5%, 84% and 89% with databases 'Image Segmentation', 'Land form Identification' and 'Seismic'.In the other hands, the three other algorithms present fluctuating values between 23% and 83%, knowing that the number of clusters are pre-defined in these algorithms.
On one hand, these results lead to note that the precision's levels of achieved clusters are superior to 80% (required values for a sufficient homogeneity of clusters).This outcome illustrates or lets us foresee a high homogeneity within the given clusters from the proposed methodology.On the other hand, this methodology permitted an improvement of results of MC-DBSCAN algorithm with regard to its exclusive use.It allowed an improvement of the of clusters homogeneity varying between 6% and 21% in accordance to the used 'test databases'.
Regarding the 'Recall' factor, the suggested methodology gives an average of 87% for the three test databases.However, it points out respectively the average values of 61%, 66%, 53% and 57% for MC-DBSCAN, K-means, Cure and EM algorithms.Exceptionally, in the third database 'Seismic', the value of the 'Recall' factor, issued from K-means algorithm, has shown the existence of clusters that present a set of specific properties that are exclusive for them.This means that The 'Recall' shows a value of 100% (against 95% for the proposed methodology).
Moreover, the improvement provided by the proposed methodology is important and considerable.It is 26% in comparison to the result given by the MC-DBSCAN algorithm.This improvement emanates from the inclusion of iterative corrections, which allow a re-classification of misclassified items in previous iterations.
Overall, the harmonic average of the two factors 'Precision' and 'Recall' on the three test databases has demonstrated an improvement respectively of 20%, 48%, 35% and 36% compared to the MC-DBSCAN, K-means, Cure and EM algorithms, which highlights the relevance and pertinence of the proposed methodology.

V. CONCLUSION
Due to the recurring difficulty that rises in the evaluation of the quality of a clustering, many approaches are used in the performance estimation in a clustering exercise results.The state of art puts forward approaches of appreciation based on www.ijacsa.thesai.org the judgment of an expert, the use of the labeled data when available, the comparison with the references classification or the computation of various indices generally relying on the relations of intra\extra distances clusters.Even though those approaches offered results that are relatively satisfying in some projects, it still reveals its limits in certain clustering exercises.However, the proposed methodology seems to be an alternative solution to overcome the limitations faced with the approaches mentioned above.The methodology leading to an iterative process, that allows to improve in an optimal way the results of a clustering exercise with a higher inter-cluster inertia.To that end, MC-DBSCAN algorithm was used as the main clustering algorithm.
As a minimum, it would be important to mention that this methodology highlighted the improvement of the inter-class inertia; nevertheless, in order to achieve a better precision of clusters, it is better and significant to include a parallel evaluation, which would allow an optimized intra-cluster quality and a better homogeneity.
In addition, the proposed methodology could also contribute, beyond the MC-DBSCAN algorithm, to the improvement of the performance and to the precision of other multi-criterion assistance with the decision algorithms, as long as it offers the possibility to adjust the weights of the criteria's from iteration to the other.
and p B are respectively the weights of the two clusters A and B, G A and G B are respectively the centers of the two clusters A and B.
i ) and I(C j ) represents the respectively average distance between the elements and center of class 'i' and class 'j'.I(C i, C i ) represents the average distance between the two classes' centers 'i' and 'j'.

N
Number of appearance of Inti,j within the other classes.
of class C possessing the property Inti,j.
of the partition P possessing the property Inti,j.P : Set of proper classes.

TABLE II .
PROPORTIES OF UNSUPERVISED RECALL AND PRECISION

TABLE III .
DESCRIPTIONS OF DATASETS

TABLE IV .
NUMERICAL RESULTS OF EM, CURE, K-MEANS, MC-DBSCAN AND THE PROPOSED METHODOLOGY