Exploreing K-Means with Internal Validity Indexes for Data Clustering in Traffic Management System

Traffic Management System (TMS) is used to improve traffic flow by integrating information from different data repositories and online sensors, detecting incidents and taking actions on traffic routing. In general, two decision making systems-weights updating and forecasting are integrated inside the TMS. The models need numerous data sets for making appropriate decisions. To determine the dynamic road weights in TMS, four (4) different environmental attributes are considered, which are directly or indirectly related to increase the traffic jam– rain fall, temperature, wind, and humidity. In addition, peak hour is taken as an additional attribute. Usually, the data sets are classified by instinct method. However, optimum classification on data sets is vital to improve the decision accuracy of the TMS. Collected data sets have no class label and thus, cluster based unsupervised classifications (partitioning, hierarchical, grid-based, density-based) can be used to find optimum number of classifications in each attribute, and expected to improve the performance of the TMS. Two most popular and frequently used classifiers are hierarchical clustering and partition clustering. K-means is simple, easy to implement, and easy to interpret the clustering results. It is also faster, because the order of time complexity is linear with the number of data. Thus, in this paper we are going to demonstrate the performance of partition k-means and hierarchical k-means with their implementations by Davies Boulder Index (DBI), Dunn Index (DI), Silhouette Coefficient (SC) methods to outline the optimal number classifications (features) inside each attribute of TMS data sets. Subsequently, the optimal classes are validated by using WSS (within sum of square) errors and correlation methods. The validation results conclude that k-means with DI performs better in all attributes of TMS data sets and provides more accurate optimum classification numbers. Thereafter, the dynamic road weights for TMS are generated and classified using the combined k-means and DI method. Keywords—Traffic Management System (TMS); Data Clustering; K-means; Hierarchical Clustering; Cluster Validation


INTRODUCTION
A new low cost, flexible, maintainable, and secure internetbased traffic management system with real time bi-directional communication was proposed and implemented (in [1] [2][3] [4]) to assist and reduce the traffic situation.To determine the dynamic road weights in TMS, four (4) different environmental attributes -rain fall, temperature, wind, and humidity are considered.Rainfall is one of the most influential weather attributes to determine the road congestion in metro city, as the road segments are submerged due to the heavy rains, and makes slower traffic movements.The heat released from the engines, air-conditioners of the traffic stacked vehicles, may raise the overall temperature of the area.Thus, the current temperature helps to classify traffic congestion status of a particular road segment.Gusts of wind have direct influence on road safety and that pushes to slower vehicle movement.In addition, temperature, wind and humidity have direct influence to predict the future rainfall in a particular area.Peak hour is one of the most influential attributes to cause traffic congestion in metro cities.Thus, these four (4) environmental attributes and peak hour have direct or indirect relationship on traffic congestion as well as vehicle movement and influence to choose them as decision making parameters.
The value of these attributes (features) are intelligently crawled by search engine, with metadata indexing (title, description, keyword etc.), directly from the multiple data feeds (like web site, RSS feeds, web service etc.) from the web page in [5].Crawled data are simplified (structured) and stored in a historic table.However, the number of attributes can be changed according to the system requirements.We collect more than two (2) years or 750 days (1/12/2006 to 20/12/2008) data of five features from the web page in [5].
Initially, decision tree (DT) [1] [2] [3] was used to classify road weights and weighted moving average analytic was implemented to estimate or predict feature values in DT [28][29] based system and achieved 16.45% accuracy.However, the model data sets were classified by instinct method.Cluster based classifications (K-means, Locality-Sensitive Hashing (LSH) etc.) can be used to find optimum number of classifications in each feature and can improve the performance of the TMS.With this hypothesis, we implement two unsupervised clustering techniques partition k-means and hierarchical k-means.There are several methods (internal/external) to measure similarity between two clustering steps and used to compare how well different data clustering algorithms perform on a set of data.Only internal methods -Davies Boulder Index (DBI), Dunn Index (DI), and Silhouette Coefficient (SC) -are used to choose the optimum number of classification, as they do not have any external information.Subsequently, the optimal classes are cross-validated by using statistical analytics -correlation and Within Sum of Square (WSS) errors.
Results highlight that Dunn Index (DI) performs better for both partition k-means and hierarchical k-means algorithms by www.ijacsa.thesai.orgproviding minimum Sum of Square Error (SSE) for all environmental attributes.However, the optimum numbers of classifications are generated by both algorithms, for each environmental attribute, differs in their numbers.Both algorithms are compared by computing the correlation values on their optimal number of clusters for each attribute.The correlation values of partition k-means algorithm are higher than the correlations of hierarchical k-means algorithm for all attributes.The validation results conclude that the combination of the k-means with Dunn Index performs better and provides more accurate optimum classification number(s) on environmental data set.Thereafter, the dynamic road weights for TMS are generated and classified with these combined algorithms.

II. RELATED WORKS
Integrating intelligence technologies in transportation system including intelligent and effective route planning to reduce travel time, reliable estimation of traffic congestion, accident and/or hazard detection etc., can help to reduce both fuel consumption and the associated emission of greenhouse gases.However, this kind of Intelligent Transportation System (ITS) requires collecting and modeling tremendous amount of continuous data from all road segments, in different time domains, for everyday in a year, and is a complex task.In addition, analytical decision making on optimum route planning requires high data processing and centralized computation.Data mining techniques, especially clustering, are involved to shape the unstructured data to a structural formulation and make easier decision making system for ITS problems.
Traffic flow data is used in [31] to detect the traffic status and predict the traffic patterns from historical database.Two different data mining techniques-cluster analysis and classification analysis are used in the historical data prediction model.Classified road features are used to estimate traffic flows in [32].Functional Data Analysis (FDA) is used in [33] to analyze the daily traffic flow.A comparative study on different data mining techniques to classify traffic congestion is done in [34].It examines J48 Decision Tree, Artificial Neural Network, Support Vector Machine (SVM), PART and K-Nearest Neighborhood to classify future traffic status and concludes J48 Decision Tree algorithm has the best performance.
In our previous works, traffic management data attributes were worked with DT (decision tree) [1] [2] [3] (Fig. 1) and Neural Network (NN) [4].NN performs better than DT.However, these works did not perform any recognized data mining or classification technique to the environmental data sets.Rather, they classified data according to the intuitive guesses.Thus, the proposed TMS is suffering from optimal data classification strategies.
There are many available methods/techniques used to classify data sets.In [12], optimal cluster numbers are determined based on the intra-cluster and inter-cluster distance measurements.They use Davies-Bouldin index and Dunn's index methods for classifying both synthetic and natural images.In [15], authors discuss and compare the various clustering methods to find the best and fix the optimal number of clusters over three (3) structured datasets.They use three (3) different clustering algorithms-hierarchical, k-means, PAM and three (3) internal optimal clustering methods-connectivity, silhouette and dunn.
It is common and popular to apply hierarchical or partition clustering on classification problems [16].K-means is simple, easier to implement and provide linear order complexity.Thus, partition k-means and hierarchical k-means algorithms are used to classify the TMS data sets and their optimum classification numbers are determined by three (3) different cluster validity indexes-Davies Boulder Index (DBI), Dunn Index (DI), Silhouette Coefficient (SC).

III. CLASSIFICATION TECHNIQUES
There are many industrial problems identified as classification problems.For examples, stock market prediction, weather forecasting, bankruptcy prediction, medical diagnosis, speech recognition, character recognitions to name a few [6][7][8][9][10].Classifications are typically classified into three broad categories-supervised, unsupervised and reinforce learning [11].Supervised learning is used when the data class label are known.Unsupervised learning (cluster analysis) is applicable on unknown class label datasets.Reinforcement learning is the problem of getting an agent to act in the world to maximize its rewards.In this paper, TMS data sets have no class label thus falls in unsupervised learning category.This section describes the algorithms and methods-those are used for clustering in this paper.Notations and their descriptions are listed in Table I.

A. Hierarchical Clustering
Hierarchical clustering constructs a hierarchy of clusters (dendrogram).Dendrogram is a process that captures whether the order in which clusters are merged (bottom-up view) or www.ijacsa.thesai.orgclusters are split (top-down view).There are two variant of hierarchical clustering methods (in fig.2.): i) Agglomerative Hierarchical clustering algorithm (HAC) or AGNES (bottomup approaches), ii) Divisive Hierarchical clustering algorithm (HDC) or DIANA (top-down approaches).In this paper, we implement the divisive hierarchical cluster to classify the feature data, as it has less computational cost compare to AGNES.We stop our iteration when optimal clustering number is reached.1) Davies-Bouldin Index: Davies Bouldin (DB) index [20] [21] measures the average similarity between each cluster and its most similar one.Lower value of DB Index indicates that clusters are tight compact and well separated which reflects better clustering.The goal of this index is to achieve minimum within-cluster variance and maximum between cluster separations.It measures similarity of cluster (R ij ) by variance of a cluster (S i ) and separation of cluster (d ij ) by distance between two clusters (v i and v j ).The formulae of DB index are-∑ (5) ∑ Where, ( ) (10) 3) Silhouette Coefficient: Silhouette Coefficient (SC) [22] [23] [24] shows-how well the objects can fit within the cluster.It measures the quality of the cluster by ranging between -1 and 1.A value near to one (1) indicates that the point x is affected to the right cluster.There are two terms-cohesion and separation.Cohesion is intra clustering distance, and separation is distance between cluster centroids.A(x) is the average dissimilarity between x and all other points of its cluster.B(x) is the minimum dissimilarity x and its nearest cluster.A cluster which has a value near -1, indicates that the point should be affected to another cluster.The formulae of SC are-

V. CLUSTER VALIDATION METHODS
A. Correlation: An effective clustering algorithm needs a suitable measurement of similarity or dissimilarity.Correlation (in Fig. 6) computes the similarity matrix and incident matrix (also called occurrence matrix) to measure the correlation between the data and its cluster [25].
Higher value of correlation indicates that the points belong to the same cluster (very close to each other), and reflects good clustering.The formula of correlation is- C. Incidence Matrix: An incidence matrix is a matrix that shows the relationship between two classes of objects.It is an nxn matrix where n is the total number of data set.If the object x and the object y belongs to the same cluster then Ixy=1 and if the object x and the object y belongs to the different cluster then Ixy=0.

D. Manhattan Matrix :
Manhattan distance is the absolute distance between two points.Let, the objects x = (x 1 , ...,x d ) and y = (y 1 , ..., y d ) then the Manhattan distance between the two objects is, In this work, we use Manhattan distance as a distance measurement technique.

VI. EXPERIMENTAL RESULTS
Based on the above algorithms and methods, data are formulated to determine the optimal classes in each feature, road weight and verify better algorithm.Experiments in Table 2 and 3 are generated from the 750 days (1/12/2006 to 20/12/2008) collected data from [5] and presented in Fig. 8.

1) Results of Divisive Hierarchical
Method: the sum of square error (SSE) of all features using divisive hierarchical cluster with Davies-Bouldin index, Dunn index and Silhouette index are presented in Table II.Shaded block (in Table II) indicates the minimum value of SSE.This table represents the optimal cluster size of each feature using three methods and also presents that Dunn index minimizes the SSE values in all cases.Thus, we conclude that Dunn index performs better for HDC to find optimal cluster.Thus, the optimal classes of each feature using HDC are -Rainfall (k=2), Temperature (k=2), Wind (k=3), Humidity (k=5) and Peak hour (k=4).III reflects that Dunn index provides minimum value of the SSE in all features.Thus, we conclude that Dunn index performs better for k-means algorithm to find optimal cluster numbers.The optimal classes of each feature using k-means are -Rainfall (k=3), Temperature (k=3), Wind (k=4), Humidity (k=6) and Peak hour (k=5).
3) Comparison of HDC and K-means: Hierarchical clustering and K-means clustering are compared by computing the correlation on their optimal cluster numbers in each feature.It is clear from Table IV that the correlations of Kmeans are higher than the correlations of HDC, for all features.Thus, we conclude k-means performs better than HDC.4) Optimal Cluster of Road Weight : From the previous experiments it is clear that k-means with Dunn index performs better for all features.Thus, for the classification of the road weight k-means with Dunn index can be chosen.Table V shows the no of cluster of road weight and Dunn index value of that corresponding cluster.This table represents that maximum value of Dunn index achieves in k=7.Thus, the optimal cluster size of road weight is seven (7) and there should be seven (7) different type of classes for road weight updates.Table VI presents some sample experimental results of road weight updates.

VII. CONCLUSION AND FUTURE WORKS
In this section, we summarize our work.The features data are collected from the external feeds (like web site, RSS feed, web service etc.) for classifying data.We cluster the data using two approaches (partition k-means and hierarchical k-means) and find the optimal number of clusters for each feature using Davies-Bouldin index, Dunn index and Silhouette coefficient.Thereafter, conclusion has been drawn which algorithm is better for which feature data and then find the optimal number of clusters of road weights with the input of the measured five (5) feature clusters.www.ijacsa.thesai.org In future, we can also measure validity of the classes by other probabilistic and statistical methods.Dunn index method needs lots of computational cost.Improvement on the computation cost and error of the cluster building procedure can be reduced using other statistical models.At present, we are not considering other characteristics of environmental and road status such as: accidents, road works, etc. Roads and Highway authorities in Bangladesh does not provide/publish any road construction, maintenance status and thus, these attributes will be considered in our future research direction.
Online multi data feeds capability supports the proposed model to be connected with different Social Medias (facebook, twitters etc.), and collects necessary information (mishap, disaster situations), and uses analytical tools to make proper decisions.However, special consideration is required on internet securities as all of the information is available on the internet.Recently, deep learning (DL) techniques are also used to solve unsupervised clustering problem.Interpolation of deep leaning is much complex than k-means.In addition, deep learning works with multi-layer data representation and sometimes degrades the performance due to the limited amount of data.It addresses over fitting problem also.Thus, a comparative study with simple k-means and DL is required and will be applied in near future.
Still, the proposed TMS is in construction phase and cover small road networks.City level broader area will be considered in near future.A GSM and GPS based micro controller with different embedded sensors is in developing phase.This device will help to collect real time environmental data at an instant time.

Fig. 1 .
Fig. 1.Decision Tree Using ID3 Algorithm Paper [13] evaluates the performance of three clustering algorithms (hard k-means, single linkage, and a simulated annealing) and determining the number of clusters using four methods-Davies-Bouldin index, Dunn's index, Calinski-Harabasz index and index I.Paper [14] compares three clustering algorithms-agglomerative hierarchical clustering kmeans algorithm, bisecting k-means algorithm and standard kmeans.Results indicate that the bisecting k-means technique performance better than other two.

Fig. 2 .
Fig. 2. Hierarchical clustering structure 1) Divisive Hierarchical Clustering Algorithm: Division Hierarchical Clustering Algorithm (HDC) or DIANA (DivisiveANAly) [17] is a variant of hierarchical clustering.It starts evaluation from the top with all data in one cluster (fig.3) and then split using flat clustering algorithm such as kmeans clustering.Algorithm: a. Initially all items belong to one cluster C i =0.b.Split C i into sub-clusters, C i+1 and C i+2 .c. Apply K-mean on C i+1 and C i+2 .d. Increment the value of i. e. Repeat steps b, c and d until the desired cluster structure is obtained.Node 0 containing the whole data set C 1 =2 input nodes 1-2.C 2 =3 input nodes-> 2-4 (1 spilt into 2 sub group-3 and 4).C 3 =4 input nodes ->3-6 (1 spilt into 2 sub group-5 & 6).Do until C kmax not reached where C kmax is maximum number of clusters.

Fig. 4 .
Fig. 4. Partitional clustering 1) K-means Clustering Algorithm: K-means clustering [18][19][27] aims to partition data into k clusters.K-means is the most popular non-hierarchical iterative clustering algorithm (Fig.5).The basic idea of k-means is to start with an initial partition and assign data objects to cluster so that the squared error decreases.Algorithm: a. Randomly initialize k center from the set of data point {X d =x 1 d , x 2 d , x 3 d …x n d }. b.Assign each point to their nearest center using Manhattan distance measure.(1) c.Compute the centroid for each cluster by averaging the data objects belonging to the cluster, assign it as a new cluster center.∑ (2) d.Re-assign all the data points to its new center.e. Repeat b, c and d steps until all the cluster centers do not change anymore otherwise stop.

Fig. 5 .
Fig. 5. Flow chart of K-mean Algorithm 2) Dunn Index: The value of Dunn index (DI) [21] is expected to large if clusters of the data set are well separated.If the dataset has compact and well-separated clusters, the distance between the clusters is expected to be larged and the diameter of the clusters is expected to be smaller.The clusters are compact and well separated by maximizing the intercluster distance while minimizing the intra-cluster distance.Large value of Dunn index indicates the compact and wellseparated clusters.The formulae of Dunn index are-(8) the data and its cluster, Distance matrix, D= {d 11 ,d 22 ,d 33 , …,d nn }, Incident matrix C= {c 11 ,c 22 ,c 33 , …,c nn }, ̅ =mean of the distance matrix, and ̅ =mean of the incident matrix.B. Distance Matrix : It is also called similarity matrix, an nxn two dimensional matrix -where n is the number of elements in a data set.d(x, y) distance or dissimilarity between objects x and y.Fig. 7 represents distance matrix.d(x,y)=|x-y| (15)

TABLE .
II. OPTIMAL NUMBER OF CLUSTER AND VALUE OF SSE OF RAINFALL, TEMPERATURE, WIND, HUMIDITY AND PEAK HOUR USING HDC WITH DB, DUNN AND SC INDICES TABLE.III.OPTIMAL NUMBER OF CLUSTER AND ITS VALUE OF SSE OF RAINFALL, TEMPERATURE, WIND, HUMIDITY AND PEAK HOUR USING K-MEAN WITH DB, DUNN AND SC INDICES

TABLE .
Result of K-means Clustering Method: The sum of square error (SSE) [26] of all features using k-means clustering algorithm with Davies Bouldin index, Dunn index and Silhouette index are presented in Table III.Shaded block (in Table III) indicates the minimum value of SSE.Table

TABLE .
VI. SAMPLE ROAD WEIGHT CLUSTERING RESULT