Identify Discriminatory Factors of Traffic Accidental Fatal Subtypes using Machine Learning Techniques

In today's world, traffic accidents are one of the main reasons of mortality and long-term injury. Bangladesh is no exception in this case. Several vehicle accidents each year have become an everyday occurrence in Bangladesh. Bangladesh's largest highway, the Dhaka-Banglabandha National Highway, has a significant number of accidents each year. In this work, we gathered accident data from the Dhaka-Banglabandha highway over an eight-year period and attempted to determine the subtypes present in this dataset. Then we tested with various classification algorithms to see which ones performed the best at classifying accident subtypes. To describe the discriminatory factors among the subtypes, we also used an interpretable model. This experiment gives essential information on traffic accidents and so helps in the development of policies to reduce road traffic collisions on Bangladesh's Dhaka-Banglabandha National Highway. Keywords—Traffic accident; clustering analysis; machine learning; feature selection; classification; discriminatory factors


I. INTRODUCTION
Traffic accidents have become one of the leading causes of loss of life and property. The likelihood of traffic accidents is increasing as the number of vehicles and roads increases. In 2020, total of 4,891 vehicle accidents in Bangladesh killed 6,686 people and wounded 8,600 others [1]. As a result, 18 individuals died in traffic accidents each day across the country. In its yearly road accident observing report for 2020, Bangladesh Passengers Welfare Association (BPWA) disclosed these data. According to the Accident Research Institute of Bangladesh University of Engineering and Technology (BUET), 56,987 individuals have perished in 58,208 vehicle accidents in Bangladesh in the last two decades. Many researchers have examined road accident datasets and used various machine learning methods to predict the risk of an accident; some of their findings are summarized in the Literature Review section. All of the research efforts on these datasets are aimed at classifying the risk of a car accident. Many organizations exist in many countries around the world to maintain road safety in order to reduce the threat of fatal road traffic accidents. Researchers, more-over, used a variety of techniques, particularly statistics methods, to define the reasons of traffic road accidents through a historical path traffic road dataset. Using various data mining tools and techniques, the data mineworkers investigated different parameters or variables for the reasons of traffic accidents besides diver behaviors. A lot of researchers used to expend a significant amount of time attempting to find the greatest performing data mining procedure for mining the traffic road accidents dataset.
In this research, we collected traffic accident data in the DBH from the Accident Research Institute (ARI), BUET, from 2007 to 2015, and we only used fatal data records. This research aims to identify only accident fatal subtypes and identify important discriminant features that will assist authorities in better understanding accident risks.
The rest of this paper is as follows. Section II is dedicated to related activity. Section III discusses traffic accident data analysis and methodology.  [3]. In the same year, F. Francis used Hierarchical clustering and K-means clustering the same year to merge the spatially specified groupings into six clusters based on the similarity of their temporal patterns [4]. Dooti Roy et al. (2021) introduced a two-stage clustering-based technique based on SOM followed by neural gas clustering to build a data-driven taxonomy of bus crashes [5]. Rocio Suarez-del Fueyo et al. (2021) used unsupervised clustering methods to identify badly injured, belted occupants into groups, bio-mechanical characteristics, and accident severity [6]. The applicability of the k-prototypes clustering method in massive truck-involved crashes was investigated by Syed As-Sadeq Tahfim et al. (2021). To predict the severity of injuries in major truck incidents, four gradients boosted decision trees techniques were used to the dataset and individual clusters [7]. Filbert Francis et al. (2021) found high-risk areas in Dar es Salaam for motorcycle-related injuries. Three distinct motorcycle injury hotspot clusters have been discovered [8].
Mert Ersen et al. (2020) used the Kernel Density approach to examine statistical analyses based on accident kinds. The Kernel Density approach has been found to produce better visual results than other spatial methods [9]. Seyed Mohsen Hosseinian et al. (2020) investigated the effect of different factors on the severity of urban traffic accidents in Rasht metropolis by using frequency analysis of accident data [10]. Qiuru Cai (2020) created the Apriori algorithm to mine the rules that govern the relationship between risk issues and the cause of traffic accidents on urban roads [11].In 2020, Yunduan Lin et al. used crowdsourcing data to investigate the technique of predicting the complicated behavior of traffic flow evolution after traffic accidents. According to the results, NN outperforms the other models [12].
Sharaf AlKheder et al. (2020), on the other hand, used three data mining algorithms to conduct a thorough investigation of risk factors associated to the severity of traffic accidents. In comparison to previous models, the Bayesian network was more accurate in predicting the variables [13]. Yang Yong Zheng et al. (2020) discovered the elements that influence traffic accidents in undersea tunnels and developed a prediction model for undersea tunnel traffic accidents [14]. Marjana Cubranic-Dobrodolac et al. (2020) suggested a model for assessing and making decisions about a driver's proclivity for traffic accidents that is based on an estimation of the driver's psychological attributes [15].
Based on single-vehicle crashes, Natalia Casado-Sanz et al. (2020) found the contributing factors to a fatal outcome. The most relevant factors related with driver injury severity were identified using a Multinomial Logit model [16]. Human error was highlighted as a major contributory element in road traffic accidents by Asad Iqbal et al. (2020), and the Salt Range was classified as a black spot-on account of vehicle braking failure [17]. Minglei Song et al. suggested a road accident prediction model based on joint probability density feature extraction from big data in 2019 [18]. Eight impact factors were chosen by Cheng Zhang et al. (2019), and the Bayesian network was the best model to potentially predict road accident black spots [19].
On accident datasets, Sadiq Hussain et al. (2019) used J48, Multi-layer Perceptron, and BayesNet classifiers. The Multilayer Perceptron classifier per-formed well in the study, with an accuracy of 85.33 percent [20]. According to Juan Pineda-Jaramillo et al. (2019), road traffic collisions occur in all clusters, although zones surrounded by landscapes and parks have more run overs than fallen residents [21].

A. Dataset Description
From 2007 to 2015, we collected 1283 data of traffic incidents on the N5NH (N5 National Highway) from MAAP5 of the Accident Research Institute (ARI), BUET Accident report forms have been distributed to various police stations in Bangladesh. There are two parts to the accident report form. One is the main form, while the other is the supplementary form. Each accident record is filled out on the main form, where the top 37 columns depict preliminary information on the severity of the traffic accident. Accidental vehicle information is stored in columns 38-45, whereas driver information is stored in columns 68-52. Furthermore, columns 53-58 and 59-64 contain detailed information about passengers and pedestrians, respectively. 65-67 columns, on the other hand, are utilized to identify the causes of an accident.

B. Proposed Discriminatory Factors of Fatal Subtype
Detection Model Fig. 1 depicts an overview of our process for identifying discriminatory characteristics, which is briefly detailed step by step.
Step 1: Data Preprocessing and Analysis: In this section, all portions of the data, including the route number, have a recurring value. Furthermore, some data, such as XY map, X coordinate, and Y coordinate, have no values. 67 percent values are also missing in the kilometer post and 100-meter attributes. As a result, we decide experimental data eliminate them. We have separated 1002 fatal data from 1283 entries. Numeric and nominal values are blended throughout all records. All nominal values have been converted to numeric values. The features dealing with vehicle details (columns 38-45), driver details (columns 46-52), passenger details (columns 53-58), and pedestrian details (columns 53-58) are deleted from empirical traffic accident data as unusable features. Table  I illustrates these characteristics with a brief explanation. Report Number, FIR Number, and Thana are not deemed particularly important and are detached from the empirical data. Formerly, we construct a hit-map to detect linked traits (see Fig. 2). So, we see that the number of vehicles is correlated to the number of driver and pedestrian victims. As a result, we eliminate these two attributes. The remaining attributes are useful in determining the more accurate outcomes in this experiment. Step 2: Employing Clustering Analysis: Clustering analysis is a technique for categorizing cases into comparable important groups based on their unique characteristics. The agglomerative mode of hierarchical clustering algorithms divides every cluster into small sub clusters or assembles them into super clusters on a regular basis [2]. In a hierarchical architecture known as a dendrogram, the connection between each pair of clusters is determined by the medium of dissimilarity or similarity. However, we apply this strategy to generate numerous fatal sub-types in the accident dataset. To reveal the predictability of the proposed model, these subtypes are considered as separate class labels.
Step 3: Chi-Square Test for Feature Ranking: When two attributes are independent, the executed count is close to the awaited count, resulting in a reduced Chi-Square value. The higher the Chi-Square number, the more dependent the property is on the response. Then it can be chosen for model training. However, in our research, we rank attributes in order to identify the appeasement set of most significant factors that result in the maximum accuracy. After identifying the subtypes, we utilized the Chi-Square test feature ranking technique on the accidental dataset to discover the optimal set of most significant attributes.
Step 4: Normalization: Normalization is a data preparation method used frequently in machine learning. Its major purpose is to use a common scale to adjust the values of numeric columns in the dataset without losing information. In this paper, we use the MinMaxScaler class in Python to normalize fatal sub-types data of the most significant attributes and create a balanced dataset with appropriate structures.
Step 5: Classification Approach: Classification Approach: To compute the class of objects, Classification is a mode of function discovery in which concepts or classes are interpreted and isolated whose label is unfamiliar to the target. On the normalized dataset, we use six machine learning classification algorithms to identify the observed sub-types: Decision Tree (DT), Naive Bayes (NB), K-Nearest Neighbor (KNN), Random Forest (RF), Multi-layer Perceptron (MLP and Support Vector Machine (SVM). Previous studies of road accidents have used these classifiers extensively [12,19]. To find the best classifier with the highest accuracy, some evaluation metrics (see Table II) such as Accuracy, F1-Score, and AUROC were used.
Step 6: Exploring Discriminatory Factors Using LIME: Local Interpretable Model-Agnostic Explanations (LIME) is an algorithm that can interpret a model by distracting the data sample input and knowing how the predictions vary. LIME produces a set of interpretations that show how each feature performs against a prediction for a single sample, which is a type of local interpretability. We use LIME to find which characteristics contributed the most to attaining the best result in categorizing the sub-types on the dataset using all the attributes for the best per-forming classifier. As a result, we have discriminatory variables for the classification of subtypes. 246 | P a g e www.ijacsa.thesai.org

IV. RESULT AND DISCUSSION
In our work, we identified the clusters using hierarchical clustering on the dataset. Each cluster is defined as an observed subtype present in the accident dataset. We utilized the Chi-Square test to determine the most significant features after identifying the subtypes. Then, on the selected features, we performed data normalization using Python's MinMaxScaler class. On the datasets, we used various classification algorithms (i.e., DT, KNN, NB, RF, SVM, MLP) to classify the observed subtypes. Classification is accomplished through the use of 10-fold cross-validation. Finally, we used LIME to interpret features for discriminatory factors. Jupyter Notebook version 6.1.4 is used for all of the experiments.

A. The Analogy of Performance of Distinct Classifiers
In this study, Hierarchical clustering yields two subtypes (subtype-1 and subtype-2) (see Fig. 3). The ratio of subtype-1 to subtype-2 is found to be the same. As a result, no data balancing was required. The Chi-Square test result is displayed in Table III. As can be seen in the table, the features are ordered in ascending order depending on their P-Values. The feature with the lower P-Value is more important. We took the different number of features (i.e., 5, 10, 15, 20, and 24) from those significant feature lists and applied different classifiers to the datasets. The experimental results for different classifiers (described in section III.B) utilized to categorize the sub-types are shown in Tables IV, V, VI, VII, and VIII. To explain our findings, we used a variety of evaluation matrices (Accuracy, F1-score, and AUROC). Performance Analysis of All Significant Features is shown in Fig. 4.   From the table, we can see that RF outperforms all other classifiers in terms of accuracy, F1-score, and AUROC (see Table IV to VIII) for all different number of features. For the dataset with only the five most significant features, RF achieves 99.90% accuracy, 99.90% F1-score, and 100.00% AUROC. This is the highest possible score in our study. For other sets of features, RF receives slightly different scores. Furthermore, all other classifiers, with the exception of KNN, achieve high results (i.e., above 90%). It is also worth noting that all classifiers performed best with the most significant 5 feature subset, and their performance degraded as the number of features used increased.

B. Interpretation of Features for Discriminatory Factors
We used LIME on the dataset (with all features) to find the highest performing RF classifier and determine which features contributed the most to correctly categorizing the subtypes. As a result, we obtain discriminatory factors for sub-type classification. The features that contributed the most to identifying distinct sub-types are shown in Fig. 5, which differs significantly from the statistical result we obtained using the Chi-Square test for important features. The most crucial feature identified for subtype classification is 'Road Feature,' as seen in Fig. 5. The relationship between road features and accident subtype is seen in Table IX. The table shows that "Road Feature" -General is the most prevalent cause of accidents and has about the same ratio in both categories. "Road Feature"-Bridge is twice as common in subtype-1 as in subtype-2. Culverts and Speed Breakers are more common in subtypes 1 and 2, respectively. The second most significant attribute is 'Road Class.' The relationship between road classes and accident subtypes is shown in Table X. It is apparent that the most prevalent type of accident is 'Road Class'-Natural and has nearly the same ratio in both subtypes. 'Road Class'-Feeder' is twice as common in subtype-1 as it is in subtype-2. 'No. of Vehicles' is the third most essential aspect. The relationship between the 'No. of Vehicles' and the types of accidents is seen in Table XI.   The table shows that for subtype-1, 'No. of Vehicles' 1 is more prevalent, while for subtype-2, 'No. of Vehicles' 2 is more common. 'No. of Vehicles' 5 appears only in subtype-1. 'Weather' is the fourth most essential feature. The relationship between weather and accident subtypes is seen in Table XII. The data shows that 'Weather'-Clean/ Fair has a higher risk of accidents. 'Weather'-Rain is exclusively related to the subtype-1. As shown in Fig. 5, 'Traffic Control,' 'Junction Type,' and 'Divider' all play a role in subtype classification. Other features in the list have a negative relationship with classification into subtypes.

C. Relative Studies and Implication
Many researchers have looked into road accident classification, and some of their findings are included in Section 2. We discovered that all of the research efforts on these datasets were focused on classifying the risk of a traffic collision. However, no attempt was made to identify the various forms of road accidents (as far as our knowledge). We used clustering to determine the subtypes in this study, and the appropriate number of clusters for each dataset is justified. Then we used classifiers to find the best classification of subtypes using relevant feature sets. Then, using the explainable AI technique, we showed key features that contributed to the identification of subtypes. Identifying subtypes will assist authorities in better understanding accident risks. We discovered important elements that will assist them in identifying sub-types as well as accident risks.

V. CONCLUSION AND FUTURE WORK
Traffic accidents are viewed as a global issue that results in fatalities and serious injuries. The study of traffic accident data assists the traffic department in identifying the primary persuasive elements of accidents and revealing the relationships between these issues, creating the groundwork for risk control measures to be developed. Discriminatory factors can increase the likelihood of traffic accidents or other factors that contribute to the severity of injuries sustained as a result of traffic accidents. We've compiled a list of 24 features that are thought to be linked to road accidents. The information was obtained from the Accident Research Center (ARI) at BUET. To recognize the clusters, we utilized hierarchical clustering on the dataset. Each cluster represents a perceived subtype found in the accident dataset. To categorize those experimental subtypes, we applied six different classification methods on the datasets. Finally, for the interpretation of features for discriminatory variables, we used the LIME analysis technique. As a result, future work will necessitate a thorough examination of the updated dataset of traffic accidents from across the country, as well as the application of more classification and clustering algorithms, as well as the improvement of the discriminatory factors identification model through additional development and experiments to conduct follow-up traffic accidents.