A Complexity Survey on Density based Spatial Clustering of Applications of Noise Clustering Algorithms

Abstrac t—Data Clustering is an interesting field of unsupervised learning that has been extensively used and discussed over several research papers and scientific studies. It handles several issues related to data analysis by grouping similar entities into the same set. Up to now, many algorithms were developed for clustering using several techniques including centroids, density and dendrograms approaches. We count nowadays more than 100 diverse algorithms and many enhancements for each algorithm. Therefore, data scientists still struggle to find the best clustering method to use among this diversity of techniques. In this paper we present a survey on DBSCAN algorithm and its enhancements with respect to time requirement. A significant comparison of DBSCAN versions is also illustrated in this paper to help data scientist make decisions about the best version of DBSCAN to use.


I. INTRODUCTION
The fast development of the internet and the availability of cheap mobiles, smart sensors and social networks applications allow users to generate a huge amount of data continuously. This rapid increase of data volume makes several domains difficult to be understood easily using only human capabilities. However many algorithms for clustering have been developed to guide data scientists to analyse and to understand data despite its volume. Nowadays, these algorithms play a crucial role in several sophisticated systems and applications including recommender systems, medical applications, face recognition, environmental assessment and anomalies detection [1][2] [3][4] [5]. To better understand any phenomena under investigation, clustering algorithms must extract correct and efficient statistics and trends, which is a very hard task, because results are often influenced by the nature of the real-world data which can be sparse, dense, spatial, high dimensional or even noisy. Therefore, algorithms must handle all complicated issues generated by data such as supporting volume increases, improving the scalability, processing high dimensional space, dealing with shaped structure and detecting outliers. The quality of clustering is also mainly influenced by the choice of the initial parameters such as number of clusters or the density radius. Thus, algorithms must vanish, optimize or even detect the parameters to use in order to detect meaningful clusters. To deal with all mentioned difficulties in real cases, many clustering approaches were raised including partitioning methods [6], hierarchical methods [7] and density based methods [8], etc.
In this paper, we are interested in density-based clustering, where clusters are defined by areas in which the density of the data points is high and clusters are separated from each other by areas of low density. We will focus especially on the DBSCAN algorithm [8] which can process spatial data efficiently and it can discard outliers properly. DBSCAN is a very simple and reliable technique, however it suffers from many limitations including its high complexity , its sensitivity to the local density variation, its dependence on initial parameters and its scalability failures. Therefore, it has undergone several improvements to make it efficient and to avoid its bastard chaos as effectively as possible. For instance, I-DBSCAN [9] and FDBSCAN [10] enhance the time requirement and minimize the deviation of results, MR-DBSCAN [11] improves scalability and deals with heavily skewed data and HDBSCAN [12] solves initial parameters issues, etc. We propose a comparison guide for all DBSCAN enhancements related to the complexity criterion and a repository for all DBSCAN versions related to time requirement is also presented in this work. This paper is organized mainly in 4 sections: Sections 1 and 2 contains a brief refresh related to the clustering concept followed by an in-detail description about DBSCAN. Section 3 discusses the well-known DBSCAN improvements according to time complexity. Section 4 presents a comparison between DBSCAN versions based on time criteria.

II. CLUSTERING TECHNIQUES
This section contains a brief description of partitional, hierarchical and density based clustering.
Clustering is the process of affecting each data object to a group based on the distance computation or on the similarity between each pair of observations. It is considered as the main process for many fields including image processing, pattern recognition, statistical data analysis and other business applications. Clustering methods can be broadly divided into several types including partitional, hierarchical, density based clustering, etc.

A. Partitional Clustering
Clustering has taken its roots from the partitioning method K-mean [6] which organizes all observations into an already www.ijacsa.thesai.org known number of groups (K). Each cluster is represented by its mean called centroid and objects are affected to the nearest cluster centroid. This method iterates many times over all observations to minimize the following objective function:

∑ ∑|| ||
Where is a Set of observations from a dataset, , k is the number of clusters and is the mean of points in .
K-mean is based on a very simple computation technique, however it is sensitive to outliers, data shapes and it assumes that clusters have roughly equal numbers of observations. In some cases, as mentioned in Fig. 1, it can lead to bad or even surprising results. Fig. 1(b), (f) and (d) are wrong clustering results.
The K-means method has relatively low time complexity and high computing efficiency, but it finds only compact and spherical shapes and it is still not suitable for non-convex data. Additionally, it needs prior knowledge about the number of clusters (K), it selects randomly the initial centroids. Thus, many improvements were done to overcome the limitations aforementioned such as the Partitioning Around Medoids (PAM) [13], Clustering Large Applications (CLARA) [14], and K-means for outlier detection [15]. Despite all efforts made, this type of clustering is not used when groups of data are expected to differ in size and shape, when the number of clusters is not known and when data contains noises. For instance, hierarchical and density clustering are explored to discover arbitrary shaped and meaningful clusters from large amounts of spatial data by preserving the spatial proximity of data objects.

B. Hierarchical Clustering
Hierarchical techniques seek to organize data objects into a tree structure representation called dendrogram. They are based on the computation of a symmetric distance matrix and they use some properly defined partitioning methods such as Ward's method [16], single or complete linkage. Several algorithms were invented under this type of clustering such as CURE [17] such as BIRCH [18].
As mentioned in Fig. 2, hierarchical techniques can be agglomerative or additive depending on where the algorithm starts processing the tree, from the top or from the bottom. Once the tree is built, hierarchical algorithms make splits in the additive processing or merge in agglomerative processing in order to find clusters. These cuts or merges decisions must be made properly thereby the quality of clusters will be better. For instance, as illustrated in Fig. 2, the level of cutting defines the number of clusters to detect. The first level gives rise to two clusters while the second one creates four clusters. Hierarchical algorithms are easy to understand and to implement. However, they rarely provide accurate results for mixed data types, they work poorly on very large data sets, they involve lots of arbitrary decisions and unfortunately, no adjustment can be performed once a merge or split decision has been executed. Many exceptions detected in partitioning and hierarchical clustering can be handled by using a density based clustering illustrated in the next section.

C. Density based Clustering
Density-Based Clustering refers to finding contiguous regions with high density among the dataset.
As mentioned in Fig. 3, these regions should be separated by low density regions called sparse regions. The idea behind such algorithms is that the clusters are represented by the detected dense regions and data objects in the sparse regions are typically considered noise/outliers.
In the next sections of this paper, we will focus on the most popular and the most cited density based algorithm (over 19430 times) called Density-Based Spatial Clustering of Applications with Noise [8].

A. Definitions
DBSCAN received many scientific awards such as the test-of-time award from the leading data-mining conference KDD2014 for its good performance and its significant accuracy in clustering spatial data. The main purpose of DBSCAN is to detect arbitrary shaped clusters within a large data set and to effectively distinguish noises. It measure the density at any object O by counting the number of objects falling in a hyper sphere S(O, ) where is a radius measured by an Euclidean distance. A region delimited by S(O, ) is considered dense if the object O satisfies the following equation: is the -neighbourhood of the object O and MinPts is the minimum number of points required to be present in the region to make hyper sphere S(O, ) dense.
So, if objects share the same dense S(O, ) then they belong to the same cluster. As mentioned before and to decide if a region is dense or sparse, this algorithm uses two parameters: an Euclidean distance threshold and a positive integer parameter MinPts. As described below, DBSCAN introduces many definitions to categorize data objects into core, border or noisy objects [8].

DEFINITION 1: Core objects
An object O is considered core object if the number of objects inside the hyper sphere S(O, ) is greater than MinPts parameter value. The points B, G, M are core objects in cluster 1 and V is a core object cluster 2.

DEFINITION 2: Border objects
An object is border if it belongs to some ε-neighbours of some core objects and the number of its own ε-neighbourhood is less than MinPts value. Thus, an object O is considered as a border object if it belongs to a cluster without being a core object. In Fig. 4, A,C,D,E,I,J,H,F,K,L,H,N and O are border objects for cluster 1 and P,Q,R and S are border objects for cluster 2.

DEFINITION 3: Noise objects
If an object O is not a core or border object then it is considered as a noise or outlier. In Fig. 4, the points T, W, X, Z, and Y are outliers.

DEFINITION 4: Directly Density reachability and Density reachability
If an object O is a core object, so all objects within the εneighbourhood of P are called directly density reachable objects from O. In Fig. 4, border objects A,C,D,E are directly reachable form the core point B and border objects P,Q,R,S are directly reachable from the core point V.
Two objects O1 and On are density reachable, if a chain of objects O1, O2, …, On is found within the dataset where Oi+1 is directly density-reachable from Oi with respect to the initial parameters ε, MinPts and . For instance the chain A  Fig. 4, makes a density link between the objects A and L. Thus A and L are density reachable.

DEFINITION 5: Maximality and Connectivity
Maximality: If a core object O belongs to a cluster, then all the objects density-reachable from O also belong to the same cluster.
Connectivity: If two objects O1, O2 belong to the same cluster so there is another object O in the same cluster such that both O1 and O2 are density-reachable from O. In Fig. 4, B and G are connected because they are density reachable through the chain BEG

B. DBSCAN Algorithm
Based on the previous definitions and the previously mentioned parameters and MinPts, we illustrate the DBSCAN algorithm in Table I. www.ijacsa.thesai.org Add each object of Neighbors(ε,O,Data) which is not marked as "seen" to the queue(ClusterId) while queue(ClusterId) is not empty do Take an object P from queue(ClusterId) and mark it as "seen" if card(Neighbors(ε,P,Data)>MinPts then Mark each object of Neighbors(ε,P,Data) with cluster identifier ClusterId; if any object of Neighbors(ε,P,Data) is marked "noise" then remove this mark.
Add each object of Neighbors(ε,P,Data) which is not marked as "seen" to queue(ClusterId) end if Remove y from queue(ClusterId) end while end if end if end for Output all objects of Data along with their ClusterId or "noise" mark.
The previous algorithm describes DBSCAN where "Neighbors (ε, P, Data)" is the sub-set of objects in "Data" that are present in the hyper-sphere of radius at S(P, ε). "Card(Neighbors(ε,P,Data)" is the cardinality of the set "Neighbors(ε,P,Data)". Each object from "Data" is marked with a cluster identifier (ClusterId) which gives the cluster to which the object belongs or it is marked as "noise" indicating that the object is a noisy one. To distinguish between the objects which are processed from that which are not, the mark "seen" is used. Note that all objects of Neighbors(ε,P,Data) are initially marked as "noise", except the object P, then they can later become a border point of a cluster and hence the "noise" mark can be deleted.
According to the previous description, we can easily notice that DBSCAN does not require the pre-determination of the number of clusters and it requires only two parameters to determine when a region is considered to be dense or sparse, however it still suffers from several limitations including its high complexity which can reach , its failure with local density variation, its handicap related to data scalability and its huge memory consumption. Many works have been adopted to bring a significant optimization of the DBSCAN algorithm and to overcome its major drawbacks. For instance, E. Schubert et al. [19] discussed the relationship of the indexability of the dataset and the quality of clusters. They proposed some indicators of bad parameters to guide data scientists in choosing appropriate parameters and MinPts.
For time reduction, B. Borah and D. K. Bhattacharyya used a sampling-based method [20] and J. Gan and Y. Tao used approximation techniques [21]. Derya Birant and Alp Kut tried to cover no spatial and spatial-temporal data by DBSCAN [22]. Moreover, other improvements were released to cluster in high dimensional space [23], to use parallel processing opportunities [24], [25] and to fix local density variation issues [26]. Knowing that complexity is a powerful criterion to decide about the efficiency of an algorithm, we propose a survey in the rest of this paper, a review of some well-cited DBSCAN extensions which significantly affect the time requirement.

IV. DBSCAN COMPLEXITY ENHANCEMENTS
Complexity criterion is among the ultimate indicators which qualify the efficiency of an algorithm. Thereby we decided to cover in this section some well-cited DBSCAN papers published between 2000 and 2019 and aiming to enhance the time requirement of the original algorithm. In the rest of this paper, "n" will represent the number of samples in the dataset and "d" will refer to the number of features studied. As mentioned in the first section, DBSCAN computes the empirical density for each dataset element and it measures mutual distances for the entire observations. Hence it requires a large volume of memory and a huge period of time to achieve large datasets clustering. Thus, it is qualified, by data scientists, as a very expensive algorithm and it is widely criticized due to its quadratic time requirement. Originally, Ester et al. [8] claimed that the DBSCAN will terminate in O (nlog(n)). However, the neighbourhood queries consume a big part of the running time. It requires ∑ | | to measure distances between all objects regardless of the initial parameters MinPts and . Fortunately, this time requirement can be reduced significantly to reach [27] if some suitable indexing structure is used such as R*-tree [28] where m is the number of entries in a page of R*-tree. However, the use of R*-tree is suitable only when the dimensionality of the data is low. Thereby, researchers are still trying to run DBSCAN in some subquadratic time (i) by reducing the queries time and (ii) by minimizing the number of queries needed. As results of their efforts, many new methods appeared including hybrid methods [9] [29] which used only some accurate objects as prototypes rather than using all dataset objects. This approximation used by the hybrid methods can, in some cases, lead to clusters with bad quality. In the next paragraphs, we will weigh the pros and cons of some well cited DBSCAN time reducing methods. B Borah et al. proposed IDBSCAN in 2004 to incorporate a sampling technique for searching the core object's neighbourhood. They used only outer objects as seeds and they ignored no representative objects. Therefore, they omit unnecessary queries by adding an extra function to the original algorithm based on Marked Boundary Objects (MBO) technique [20]. This function adds a complexity of O(sd) where s is the neighbourhood size. However, the overall complexity of IDBSCAN is where m is the number of entries. Chen et al. [30] proposed an exact and approximate algorithm with time for high dimensional space and for two www.ijacsa.thesai.org dimensional spaces. P. Viswanath and Rajwala Pinkesh [9] proposed another fast hybrid density method called L-DBSCAN based on leaders clustering technique [31]. They derived some representative objects at the coarser level and others at the finer level of the clustering process. Authors used a first category of leaders to reduce the time requirement and second category to optimize the deviation of the results. This hybrid scheme uses only a set of pairs denoted by | where ℓ is a leader and ℒ a set of leaders. According to experiments showed by the authors, L-DBSCAN can run in where k represents the number of derived leaders which is much smaller than n. However, this technique can give raise to big margin error, when a leader is not originally dense but it is estimated dense according to ℒ*. This method reduces the computation time, but it requires two additional thresholds: τc and τf. P. Viswanath and V. Suresh Babu [29] enhanced the density approximation of leaders in their technique called rough-DBSCAN by combining the leaders clustering method [31] and the rough set approach [32]. They added a mapping between every leader and its belonging objects (followers). This mapping is represented by ℒ* = {(followers (l), count (l)} where l is a leader form ℒ. Then, they used a lower and upper approximation, as shown in equation 1, 2 and 3, to find the exact neighbours of a leader . (1)

| || || | || ||
Rough-DBSCAN needs only where k is the number of leaders, but it improves the clustering quality by minimizing the approximation error.
FDBSCAN [33] is another non-linear searching algorithm proposed by B. Liu, in 2007, to reduce redundant searching by using a fast merging algorithm. It sorts objects using dimensional coordinates and then selects only unlabelled objects outside a core object's neighbourhood in order to decrease region queries. Another interesting paper is proposed, in the same year, by Yi-Pu Wu et al. [34] to optimize the process of Nearest Neighbour Search by using Locality-Sensitive Hashing (LSH) technique. Authors used the hash collisions to detect and represent similarities between two objects A and B form a dataset D. On the other hand, to capture object similarity they compute the probability distributions over a set of hash functions ℋ as where and S is a similarity function defined as . This LSH algorithm makes a significant decrease in DBSCAN running time which becomes O (N) and maintains the quality of detected clusters [35].
Cheng-fa Tsa and Chien-Tsung Wu, inspired by the fast merging method of FDBSCAN [33], proposed the GF-DBSCAN algorithm to segment data into several grid-cells and to limit the neighbourhood searches only to the cell scope instead of exploring the entire grid. They merge clusters if they are intersected and the overlapping objects include some core object. By introducing this grid approach and this merging process, GF-DBSCAN minimizes significantly the number of searches and increases the clustering accuracy [36]. Gunawan [27] demonstrated that DBSCAN"s performance can be improved to by applying the following process in order (i) partitioning data using a grid-cell (ii) determining all core points (iii) merging density-connected core points into clusters and finally (iiii) determining border points and noise. He used the hash table to discard cells without any point. However this faster algorithm is experimented only in two dimensional space. Therefore, J.Gan and Y.Tao extend Gunawan"s thesis to and they get a running time of where the parameter δ specifies the accuracy of the approximation. J.Gan and Y.Tao also proposed a new algorithm called ρ-approximate suitable for large datasets which can be computed in an expected time of ( ) . They are inspired by Chen et al."s paper [30] which already discussed how to compute DBSCAN in .
GPUs opportunities and parallelization strategies are also used by some algorithms including G-DBSCAN algorithm [37] to speed up the original algorithm. G-DBSCAN constructs firstly a data graph G(O, E) where O are objects (nodes) connected by edges E if they are within a minimum proximity R (threshold parameter) of each other. Then it identifies clusters by using breadth-first search (BFS) technique [38]. Thereby, a complexity of O (n + ne) is added by the BFS search where "ne" is the number of edges. G-DBSCAN uses graphics processing units (GPU"s) capabilities to achieve acceleration greater than 100×, but unfortunately it doesn't reduce the original complexity.
The DBSCAN neighbour search operation can be optimized by using a graph-based index structure method, as demonstrated by K. Mahesh Kumar, and A. Rama Mohan Reddy [39]. Their idea is to prune out outliers objects early to vanish unnecessary distance computations which may be introduced by noises. RNN-DBSCAN [40] uses reverse nearest neighbour counts and k nearest neighbour graph traversals to estimate observation density. It reduces complexity of DBSCAN by using a single parameter (choice of k nearest neighbours) and also improves the ability of handling large variations in cluster density (heterogeneous density). Mark de Berg et al. [41] presented another O (n log n) approximate algorithm for DBSCAN in two dimensional space. They represented data objects using a smaller box graph where nodes are disjoint rectangular boxes with a diameter of at most ε and edges connect pairs of boxes within distance ε from each other. Then they detected another graph including only core points where the connected components of are considered as clusters. Mark de Berg et al. improved the quality of clusters by assigning borders to their nearest core point rather than the first cluster that finds them. Box graph method [41] V. CONCLUSION DBSCAN is a powerful technique for data clustering; however it still suffers from its huge time requirement which can reach in the worst case. This paper weighs the pros and cons of the well-known and well-cited DBSCAN variations with respect to the time requirement. We present the current state of art related to DBSCAN complexity and also we mentioned some techniques used to enhance the original version of the algorithm. According to the papers studied, researchers can use the leaders clustering method, Graphbased method, breadth-first search BFS, Triangle Inequality Property or Locality sensitive hashing to bring new enhancements in this field. We noticed that DBSCAN complexity vary between and . Another analysis of all these DBSCAN variations based on real data experiments will be presented in our further works.

ACKNOWLEDGMENT
Authors would like to express their special thanks of gratitude to Mr Labriji and Mr Rachik for their able guidance and support in completing this manuscript. We would also like to extend our gratitude to the anonymous reviewers whose thoughtful comments and suggestions will lead to improving this manuscript.