A Multi-Criteria Decision Method in the DBSCAN Algorithm for Better Clustering

This paper presents a solution based on the unsupervised classification for the multiple-criteria analysis problems of data, where the characteristics and the number of clusters are not predefined, and the objects of data sets are described by several criteria, and the latter can be contradictory, of different nature and varied weights. This work focuses on two different tracks of research, the unsupervised classification which is one of data mining techniques as well as the multi-criteria clustering which is part of the field of Multiple-criteria decisionmaking. Experimental results on different data sets are presented in order to show that clusters, formed using the improvement of the algorithm DBSCAN by incorporating a model of similarity, are intensive and accurate. Keywords—Data mining; Clustering; Density-based clustering; Multiple-criteria decision-making


INTRODUCTION
Many studies showed that the resort to multiple-criteria analysis of the data in the classification establishes an effective approach for the extraction of the information, and that in optimal way in big databases described by several criteria, which are sometimes of different nature [1], [2]. To do it, several algorithms of different principles have been used in various different types of work. For example, UTADIS [3], [4] [5] which presents the first and the only method belonging to the unique criterion synthesis approach. Basing on the utility functions apply only in the case cardinal data. In the first methods of assignment based on outranking relations approach, there is Trichotomic segmentation [6] and N-tomic (A Support System for Multicriteria Segmentation Problems) [7], had a limited number of categories and a fuzzy assignment. On the other side Electre-Tri [8] [9] [10] with its rather strong explanatory character, can handle any number of categories. There have been many developments since then [12]. But always with fuzzy assignment, an ordinal sorting and preorder structure. Thus the filtering method based on fuzzy preference introduced the fuzzy assignment approach and a binary relation of preference. The last techniques based on fuzzy indifference modeling, PROAFTN [13] [14], [15] and TRINOMFC [16] are the methods of nominal sorting which require no particular structure.
However, it is noticed that all these methods have for basic principle supervised learning. This tendency is confirmed by the studies of D' Henriet [16], Zopounidis [2], Belacel [17] and others who list the various algorithms of multiple-criteria classification, and those classified in the family of multiple-criteria assignment based on supervised learning.
In spite of the superiority of the algorithms based on the supervised classification, their contribution remains limited in face to certain problems in which the information or/and the experience in the domain remain insufficient to predefine the clusters. To overcome this problem, some studies have begun researches by exploiting unsupervised learning.
In this sense, F.Anuska [1] introduces the research by evoking the multiple-criteria clustering problem and proposes the attempts of solution based on:  The reduction of the multiple-criteria analysis problem in clustering to clustering problem with single criterion obtained as a combination of the criteria;  The application of the techniques of clustering to grouping obtained by using single criteria clustering algorithms for each criteria;  The application of constrained clustering algorithms where a chosen criterion is considered as the clustering criterion and all others are determining for the constraints;  The modification of a hierarchical algorithm which would allow to solving the problem directly.
However, the indirect solutions proposed by F. Anuska direct towards NP-complete problems. And even direct solutions based on a hierarchical clustering method would be limited, because all the hierarchical clustering algorithms are efficient when the size of dataset does not exceed 100 objects [18], and they also are adapted for specific problems associated with areas having the separation or the regrouping of the objects, following the example of taxonomy in biology and in the natural evolution of the species [19].
Then Y. De Smet [21] and Rocha [20] used partition-based clustering algorithms as K-means. The first proceeded to the improvement of the K-means algorithm [22] by integrating a structural procedure preference (P, I, J) considering a triplet of binary relations, where p models strong preference, I Indifference relation and J incomparability relation. The second, more recent proposed the classification approach of a set of alternatives to a set of partially ordered categories by using the K-means method. Thereafter, these categories are classified www.ijacsa.thesai.org according to their centroid by using an ordinal classification process such as ELECTRE [23]. In spite of the notoriety of Kmeans with a large number of variables, may be computationally faster than other clustering (if K is small), however the partitioning methods in clustering require fixed number of clusters can make it difficult to predict it. Moreover, this method is based on calculation of the distance, which obliges to establish the metric ones [24].
Taking into consideration all the limits evoked previously, this present paper proposes an approach of an unsupervised clustering algorithm based on the density. This algorithm is contributing to the resolution of the problem of clustering in a multidimensional way by using algorithm DBSCAN [25] and integrating a model of similarity inspired of the concept of the multiple-criteria decision analysis [26], [27] [28] [29] [30] [31]. This approach based on the density makes it possible to work on great databases without however determining beforehand the nature and the number of clusters, in this family of clustering much of work exists, quoting by way an example algorithm DGLC [32], OPTICS [33], DENCLUE [34], WaveCluster [35], CLICKS [36], CURD [37] AND DBSCAN [38]. And the choice of DBSCAN algorithm is justified by the fact that beyond supporting several types of data of which those of space, it is particularly effective when the groups are touched or in the presence of noise. It is also effective for the detection of non-convex clusters [38] [39]. It is also advisable to stress that the fact of working with the no modified version of DBSCAN algorithm, which leaves the result of this exploitable work by all other improved DBSCAN algorithms, following the example of OPTICS [33], DVBSCAN [40], VDBSCAN [41], DBCLASD [42], LDD-DBSCAN [43], NDCMD [44], ST-DBSCAN [45].

II. APPROACH PROPOSED: INVOLVING THE MULTI-CRITERIA CONCEPT IN THE DBSCAN ALGORITHM
Both data mining research and Multiple-criteria decisionmaking have each specific and limited asset. As a result, the hybrid algorithm (DBSCAN modified) synergies the strengths of each algorithm in solving clustering problems.

A. DBSCAN (Density-Based Spatial Clustering of
Applications with Noise) DBSCAN [25], A Density Based Spatial Clustering of Application with Noise, is a density based clustering technique for discovering clusters of arbitrary shape as well as distinguishing noise. DBSCAN accepts a radius value ) ( :  Eps based on a user defined distance measure and a value MinPts for the number of minimal points that should occur within Eps radius.
The following are some concepts and terms that explain the DBSCAN algorithm as presented in [25]:  Eps-neighborhood: The Eps-neighborhood of a point p " with D as database of n objects.  Core object: A core object contains at least a minimum number MinPts of other objects within its Epsneighborhood.
 Directly density-reachable: A point p is directly densityreachable from a point q if ) (q N p   and q is a core point.
 Density-reachable: A point p is density-reachable from the point q with respect to Eps and MinPts if there is a chain of points p 1 ,..., p n , with p 1 = q and p n = q such that p i+1 is directly density reachable from p i with respect to Eps and MinPts , for 1 ≤ i ≤ n, p i D.
: p is density-connected to q with Eps and MinPts . (Connectivity).

B. The model of similarity and dissimilarity
The model of comparison used in our algorithm is composed of four stages by calculating the following functions (e.g. first object: alt1 and second object: alt2) [

1) The function of similarity
In order to calculate the similarity (1) between two alternatives for each criterion "i" of the whole of criteria, we use the following functions: MinCr are respectively the maximal value and the minimal value of the criterion "i".
According to the results of the first function, we can conclude that the similarity of two alternatives "alt1" and "alt2" come as follows: , then "alt1 "and "alt2" are similar on criterion "i"; , then "alt1" and "alt2" are not similar on criterion "i".
2) The function of the weighted similarity In this stage, the importance of every criterion is introduced, the function of the weighted similarity (2)  , it implicates that it is more sure than not that "alt1" is similar to "alt2"; implicates that it is more sure that "alt1" is not similar to "alt2" than the opposite; , in this case we are in doubt whether object "alt1" is similar to object "alt2" or not.
To reinforce results and to limit doubt, by passing to the third stage, this latter can calculate strong dissimilarity between two alternatives.

3) The function of strong dissimilarities
This stage of the model allows to calculating strong dissimilarity (3) between two alternatives by using the following function:  . implicates that "alt1" and "alt2" are strongly dissimilar on criterion "i".
In certain cases two alternatives can be similar in most criteria but there is a strong dissimilarity on the other criteria.

4) The functions of overall similarities
The last stage of the model of comparison allows us to introduce a total similarity (4). With the aid of following functions, we can finalize this model of comparison.   The relative importance of which criterion intervenes in assessing the comparison between two objects is not always equivalent and can influence the final result of a multi-criterion analysis. Therefore, the presence of a coefficient related to every criterion; witch reflects the importance in comparison with other criteria; is a primordial aspect in an algorithm to appoint a weight to every criterion with: The algorithmic approach can be structured into the following steps:

III. EXPERIMENTATION AND RESULTS
To test and to assess the performances of our algorithm, we implemented the DBSCAN and the MC-DBSCAN algorithms by using Java as a language to implement the algorithms.
For these tests to reflect correctly the performance of an algorithm, we compare the number of groups created by both algorithms and the percentage of non classified objects by varying parameters knowing that common parameters, ray locating maximum neighbors " ) ( :  Eps " and the minimum number of points that have to be present in Eps-neighborhood of this object " MinPts ", we have the same values as both algorithms.
In the results table "Tab. 1" due to the global parameter Eps and MinPts, DBSCAN classifies objects in one class because it is not able to consider several criteria simultaneously.
The results presented in " Fig. 1" and "Fig. 2" prove that the classes obtained by the multi criteria clustering algorithm are very similar to groups that have been proposed by experts and that the percentage of non-classified objects is too low.   The proposed algorithm allows for an experimental comparative study between the results by varying the relative importance regarding the criteria involved in the evaluation of assimilation between two actions.
Regarding the proposed algorithm, the weight change may influence the final outcome of a multi-criteria analysis " Fig. 3", while DBSCAN algorithm does not consider the indifference between the relative importances of each criterion. its size. In this test, we apply the MC-DBSCAN algorithm on the database "Color Histogram" of a varying size between 1300 and 65000 objects by changing the input parameters.
Reading the " Fig.4" show that even if the size of the database increases from 1300 up to 65,000 objects, the results remain in the standard, which explains that the added objects by increasing the size will affect the created classes but not the creation of new classes.

IV. CONCLUSION
This work has eventually reached a new clustering algorithm which contributes to resolving the multiple-criteria clustering problem with various weights to the relative importance to each criterion.
This new approach is based on the clustering by the enhancement of the DBSCAN algorithm which was merged with multiple-criteria decision-making.
However, it is necessary to highlight the need to further improve the performance of the algorithm. Because MC-DBSCAN like most clustering algorithms requires in advance a manual determination of input parameters.
It becomes clear that is by minimizing the human intervention relative to the determination of the input parameters will give us a better result. Eps=0.2, MinPts=3 Eps=0.25, MinPts=9 Eps=0.6, Pts=4