A Platform for Extracting Driver Behavior from Vehicle Sensor Big Data

Traffic analysis of vehicles in densely populated areas and places of public gathering can provide interesting insights into crowd behavior. Hajj is a spatio-temporally bound religious activity that is held annually and attended by more than 2 million people. More than 17,000 buses are used to transport pilgrims on fixed days to fixed locations. This poses great challenges in terms of crowd management. Using Global Positioning System (GPS) and Automatic Vehicle Location (AVL) sensors attached to buses, a large amount of spatiotemporal vehicle data can be collected for traffic analysis. In this paper, we present a study whereby driver behavior was extracted from an analysis of vehicle big data. We have explained in detail how we collected data, cleaned it, moved it to a big data repository, processed it and extracted information that helped us characterize driver behavior according to our definition of aggressiveness. We have used data from 17,000 buses that has been collected during Hajj 2018. Keywords—GPS Data; AVL sensors; hajj; big data; traffic analysis


I. INTRODUCTION
Open source traffic data for Saudi Arabia, provided by the Saudi Ministry of Interior, shows that rash driving is a major factor contributing to road accidents [1] in the Kingdom of Saudi Arabia 1 . The data also shows that the maximum number of incidents happen in the Makkah region, whose population is far less than the most populous Riyadh region. One of the possible reasons for this increase of accidents in the Makkah region is the fact that the annual Hajj pilgrimage happens in this area whereby more than 2 million people visit the region from all corners of the world. A large fleet of vehicles is needed to transport these people between cities and the holy places. To understand the behavior of drivers, we used the data of 17,000 buses collected in Hajj 2018.
Hajj is an annual pilgrimage of Muslims that happens every year from the 8th to the 13th of Dhul-Hijjah, the 12th month [2] [3] of the lunar Islamic calendar. More than 2 million people from Saudi Arabia and around the world come to perform Hajj. International pilgrims come a few days earlier to the city of Makkah. On the 8 th , they leave for Mina (bounded with a red dashed line in Fig. 1), a small dwelling of permanently installed tents near Makkah. On the morning of the 9 th , they move to Arafat (bounded with a blue dashed line in Fig. 1), an open space about 17 km further south. After sunset, they come back and spend the night at Muzdalifah (bounded with a green dashed line in Fig. 1), completing their return trip to Mina by the morning of the 10th. From 10th onwards, for a period of 3 days, including an optional 4th one, the pilgrims stone the three pillars called Jamarat. They also go to slaughterhouses and visit the Grand Mosque in Makkah to perform certain obligatory rituals [2]. All this movement is restricted with respect to time and space. A fleet of more than 17,000 buses is required to move the pilgrims across the holy places, collectively referred to as Mashaer. To collect traffic data from the large number of buses utilized by the pilgrims, the General Syndicate of Cars (Naqaba 2 ) has ordered the bus operators to attach Automatic Vehicle Location (AVL) sensors to their buses. Data collected from these AVL sensors has proven to be very useful in studying a number of characteristics related to traffic. We have developed a system that utilizes this data to present interactive visualization to perceive traffic activity at various hours of the day throughout the Hajj season.
This article provides an expansion of our previous research [5][6] by focusing on analysis of sensor data to extract driver behavior. In this paper we present the results of a study conducted on pilgrim bus data. The main objective of this study is to understand drivers' aggressive behavior by capturing data stamped with spatial and temporal information using Automatic Vehicle Location (AVL) devices based on Global Positioning System (GPS). The vehicles used in this study were pilgrim buses were equipped with AVL sensors that provided location-based data. From this data, we extracted speeds of various buses on different routes. We used this speed data as well as other parameters extracted from the source data to classify drivers and their driving skills according to our defined driver profiles. This paper is divided into six sections. Section 2 discusses the state of the art in the area of traffic data collection for information extraction using AVL and GPS technology. Section 3 explains the methodology for collecting data and extracting useful information. Section 4 presents an overview of the system architecture and explains the role of different components in the system. Section 5 discusses the results of applying analytics on data. Section 6 concludes the paper with a summary of the whole study.

II. LITERATURE REVIEW
Our study entails using GPS data collected from AVL devices to detect driver behavior from the obtained data. We detail below the state of the art in this regard.
Grengs et al. [7] explained the procedural challenges to collect, store and design databases and to manipulate and analyze the enormous set of geocoded data captured from trips and tours for understanding driving characteristics of a single driver for a duration of a month or so. They studied 78 drivers by using an automobile and recorded their behavior on a dayto-day basis for about a month. They added the position coordinates and time stamp with each data set. Their results showed that the travel patterns were more complex when compared to traditional travels.
Necula et al. [8] utilizes a Hidden Markov Model (HMM) method and a training process and presented an interactive tool to study drivers' behaviors. The tool integrates the past real data captured from various local drivers and analyzes the routes followed by every driver utilizing time, distance height and speed information. The tool also manages the maximum likelihood to validate the next route segment for a network of roads.
Feng et al. [9] examined the merits of using accelerometer with GPS data in a transportation study. They presented three approaches by first considering accelerometer only, then GPS sensors only and lastly a combination of both accelerometer and GPS sensors. They utilized Bayesian Belief Network model to study the three different transportation modes and found that the use of accelerometer can successfully play a significant role in imputing transportation mode. The single usage of each device separately was helpful in terms of predictivity accuracy. The combined usage of both the devices reviled best performance.
Choi et al. [10] considered real driving scenarios and presented a model to detect distraction due to peripheral tasks. They utilized Hidden Markov Models (HMMs) and captured drivers' characteristics using CAN-Bus (Controller Area Network) sensor. This provided them a variety of information such as steering wheel angles, brake status and brake usage with respect to time as well as breaking behavior with associated speed, etc. They defined the drivers' behavior in terms of action and distraction based on the abovementioned data.
Warwick et al. [11] presented their study on drowsiness of drivers and quoted about 100,000 crashes a year on national highways. They emphasized the development of smart a system to detect the drowsiness earlier to avoid accidents. They compared the causes of accidents because of vehiclebased issues with drivers' physiological-based approaches and found that causes were mostly due to drivers' physiology. They proposed to design a driver drowsiness detection system which utilizes wireless wearables to measure a driver's physiological data. The sensory setup provides data that can be analyzed to find key parameters related to drivers' drowsiness and generates early alerts to act in time.
Jasinski et al. [12] proposed a method that identifies realtime aggressive behavior of drivers and generate prior alerts for any dangerous behavior which may result in a severe accident. The method composed of four stages -data collection, pre-processing, semantic enrichment and calculations to compute the aggressiveness of drivers. The (TAI -Trajectory Aggressively Indicator) aggressive Indicator values varies from 0 to 100, with 0 means no aggressiveness and 100 the extremely aggressive. The proposed approach also considers the environmental conditions in calculating a better estimate of TAI.
Paefgen et al. [13] developed a method to measure the accident risk. The method utilizes GPS data collected from a large sample of traffic data from a telematics provider in northern Italy, where there were 1500 drivers with and without accidents over a period of two years. The GPS trajectories were analyzed to study the driver risk profiling problem and their findings in this regard were promising.
Khan at al. [14] presented a comprehensive survey on driving activities, the reasons for accidents, and systems to generate prior notification for drivers for their safety and comfortable drive upon early detection of an accident. Based on their findings, they suggested that a well-designed DMAS (driving monitoring and assistance system) can improve critical issues associated with drivers as well as the challenges associated with the related driving environment.
Stutts et al. [15] and Kan et al. [16] described the main reasons for majority (about 90%) of road accidents as, distraction, fatigue and aggressive driving style. Distraction refers to eating or drinking, looking at off road people, getting busy in small activities like sharing food, texting or attending phone calls, etc. and causes more than half of the accidents. Fatigue explains the physical condition of drivers like drowsiness, over acting to show up extra driving skills, etc. Aggressiveness is related to driving style, overreaction to overtaking cars, or applying breaks in front of other vehicles. Their study suggested the use of DMAS by considering the factors associated with drivers and the driving environment.
228 | P a g e www.ijacsa.thesai.org Jingqiu et al. [17], developed a hybrid model to study driving behavior and risk patterns. The model utilizes Autoencoder and Self-organized Maps (AESOM) approach to extract driving behaviors. They made 4032 observations by collecting data through GPS sensors, in Shenzhen, China, and analyzed the speed and excessive acceleration and summarize their findings as that AESOM usage may improve the quality of the driving.
Arumugam et al. [18] presented a comprehensive survey on driving behavior and addressed drivers' agressive behavior and detailed multiple incidents on short and long-term driving activities. The purpose of the survey was to explore the solution to minimize the risk on roads by considering drivers' emotional factors defining their driving behavior and provide the information to insurance companies to profile drivers' behavior and define the best possible insurance premium package for risk prevention.
Improving transportation system is one of the significant requirements for large gatherings where crowd safety is major concern [4]. For this researcher proposed several intelligent transportation systems, which in turn opened the door for research areas related to traffic data collection, data mining [19][20] [21] processing and analysis [22]. GPS (global position system) sensors are valuable data sources which help in tracking the vehicles by reading their spatial information in real time [23].

III. METHODOLOGY
We have divided our process into 5 major categories -data management, computation, behavior definition, comparison and interpretation. Each category has been further divided as shown in Fig. 2. Data management includes data collection, data cleaning, data enrichment and data visualization. Computation consists calculation of distance, speed and acceleration. Definition entails defining speed ranges with respect to roads. Comparison of the bus speed with allowed ranges with respect to road. Interpretation includes identifying other parameters, driver characteristics and classifying drivers.

A. Data Management
Data management is a critical step that directly helps in improving the quality of extracted information for analytics. It includes cleaning, structuring, removing noise or fixing missing data and validation. In the following, we explain how we performed the above operations on our data. 1) Data collection: To develop any system for decisionmaking requires collection of a good amount of historical data. Our data source (Naqaba) is the transport authority of the Ministry of Hajj and Umrah that collected data of 17,000+ buses using automatic vehicle location (AVL) service providers in Hajj 2018. Fig. 3 shows some facts about the collected data. The data collected was for pilgrim movement on different routes during Hajj, such as, Jeddah Airport-Makkah, Makkah-Madinah, Madinah -Madinah Airport, Makkah -Mina, Makkah -Arafat, Arafat -Muzdalifah, Muzdalifah -Mina.
2) Data cleaning: We excluded noise from the data using the spatial boundary algorithm that removes locations outside a given boundary from the dataset. We also removed data entries where difference in distance between two points for the same bus is around zero and the location is not on main roads, assuming the bus to be parked at a pickup or drop off point or at a parking location.
3) Data enrichment: In our case, we collected the GPS traces of the buses during Hajj. Generally, AVL sensor providers configure GPS devices to transmit data at intervals varying from 2 to 7 minutes in length. We found that sometimes the duration between the two locations from the same bus is up to 20-30 minutes due to which some entries are lost, mostly because of connectivity issues. Fig. 4 shows the anomaly associated with a missing entry or a long distance between recorded points. The two locations from the same bus are shown in Fig. 4. The black line shows the line as per the raw data and the blue line shows an extra data point after data enrichment, i.e., after adding a missing point. The enriched data shows the actual distance and will be beneficial to extract knowledge for analytics.

4) Data validation:
Along with data enrichment, we have also performed data quality checks to handle the GPS error issues. An error of only 10 meters can show the location of the bus on the other side of the road that will lead to a large error in calculating distance, and henceforth speed and acceleration.

B. Computation
After data completing manipulation, we enriched the dataset by adding acceleration and distances information as additional data columns, using mathematical equations as follows: Calculate acceleration (a): Where vf = final velocity, m/s; vi = initial velocity, m⁄s; ∆v = difference in velocity, m⁄s; tf = ending time in seconds, ti = starting time in seconds, and ∆t = difference of time in seconds.
Calculate distance (d) between adjacent points on the globe as shown in Fig. 5 by using Haversine formula as shown in Fig. 5: Where Φ = latitude, λ = longitude, R = radius of the earth (R ≈ 6.371 km), A= ending point, B = initial point and d = the distance between two points.

C. Definition
We use the open source Open Street Maps (OSM) and extract road related characteristics for each road such as the speed limit to calculate the speed threshold for each road and highway as shown in Table I. The speed limit varies from 60-140 kmph depending on the road type and its proximity to populated areas. We allow the driver to cross the speed limit up to 10%.

D. Comparison
Based on the street profiles extracted above, we match the bus speed data with our speed threshold for each road on the route to classify the vehicles according to speed. The spatial queries have been used with the help of a spatial relational database. The spatial relational database stores the geometry of each road and spatial query checks whether the location of the bus belongs to that road segment or not.

E. Interpretation
After separating the vehicles violating traffic rules from others, we apply the spatio-temporal conditions mentioned previously on data to classify the drivers into aggressive and non-aggressive behavior. Fig. 6 shows the high-level view of the big data platform that we developed to analyze the bus data. We developed a data lake layer that consist a Master data service in addition to an MS SQL service. The master data service provides a visualization of all the relations in the data based on different parameters, such as Establishment, Offices or bus number (every bus is assigned to an Office which is under an Establishment related to a geographical area). The MS SQL database contains the original data we received from Naqaba. 230 | P a g e www.ijacsa.thesai.org Our big data layer is made up of a Cassandra cluster and a Big Data Aggregation service. We have migrated Location History data to the Cassandra cluster for cleaning and removing noise using the ETL engine. The benefit of using Cassandra cluster is that it increases efficiency and scalability using a distributed, wide column store, running on a NoSQL database management system.

IV. SYSTEM ARCHITECTURE
We have used Hadoop and Presto to setup the big data aggregation service. Presto is efficient tool used for distributed SQL analytical queries on data in the Hadoop distributed file system (HDFS). Hadoop is highly beneficial for batch-based analytics while Cassandra is good for time-based. REDIS cache is an open-source (BSD-licensed), in-memory data structure store used as a cache.
It is good for caching a huge number of key-value pairs. We have use REDIS cache to boost the performance of the system. We have devised a RESTful API that provides a list of APIs to handle requests coming from the front-end. The frontend requests the API to fetch data from the Master data service, the big data aggregation service or from the REDIS cache and returns the results that are visualised on the screen.
The API Server provides the front-end data visualization and analysing service. It allows the user to display data based on multiple filters, including Establishment (Mo'assasa) name, Office (Maktab) number, company name, bus number and route. Fig. 7 compares time taken to perform queries in MS SQL server and our big data platform. The orange line shows the time to fetch the records of bus id in MS SQL server while the grey line shows the time to fetch the records in our big data platform. It is clear from the Fig. 8 that time gained from moving to the big data platform is significant.

V. ANALYTICS AND DISCUSSION
Each bus was allotted to a single driver for the entire Hajj season. The vehicles were tracked by capturing their spatiotemporal information and the collected data was analyzed by considering each road speed limit. Table II is a snapshot of the collected data along with few violations' information.
First, we selected the continuous number of observations that exceeded speed limit threshold (80kmh) with starting & ending timestamp. Then we calculated the duration of violation from starting and ending timestamp in minutes and seconds as described in the above table. Fig. 9, is the summary of the number of violations detected by the AVL sensors. We can observe that for most of the days the driver's behavior was aggressive, crossing the threshold speed several times.
Further, we have classified the violations based on severity, and collected information regarding the frequency, duration and severity of speed limit violation by a driver. Table III details   We can see that normal category of violations is common, which shows that 22 times the driver violated the speed limit but just for a few seconds or so, and then reduced his speed less than the threshold. To address normal violation cases, it happened 8 times that he violated the speed limit for about 10 minutes duration. Three times he continuously violated for a duration between 10-20 minutes and twice, he committed severe violations, that is, for more than 20 minutes.  On the same scale, the system analyzed mobility of several buses and found that the bus with ID 152, violated the speed limit in total 905 times, out of which mostly the violation was in once category and 197 times the bus violated for duration less than 10 minutes. No case of severe violation was recorded for bus 152. This analysis helps find out the worst cases as shown in Fig. 11.
Among all the drivers, there were some with best performance as they committed no or minor violations. Fig. 12 below is the summary of the best buses where drivers' behavior was in a satisfactory range. The best case is for bus ID who violated just for a few seconds during the entire Hajj season. Fig. 13 shows the speed-based detection of aggressiveness. The geographical locations captured with timestamps were analyzed for one of the drivers, who was driving on C-ring road, in Makkah. Upon analysis, we discussed both the nonaggressive behavior (Fig. 13a: the green dots show that the driver's speed was within the threshold value and was not committing any violation) and aggressive behavior (Fig. 13b: the red dots show that the driver violated several times the threshold speed showing his aggressive behavior). The data points show that the driver's behavior was aggressive for 52.55% of the collected data points, which has been visualized in the figure below. The corresponding heading and acceleration information for both the cases is also detailed.    234 | P a g e www.ijacsa.thesai.org

VI. CONCLUSION
In this paper we have presented a study to extract driver aggressiveness information from GPS data obtained through AVL sensors attached to a fleet of 17,000+ buses used to transport pilgrims from one place to another in the holy areas during the Hajj pilgrimage. We have explained the details of data preparation and pre-processing methodology we have adopted by moving the data to a big data platform for efficient query processing. We have also highlighted our procedure for extraction of information. One of the limitations of the experiment is that it was carried out in one particular city on drivers from outside the city. However, our technique is generic and can be used on any AVL based data in any part of the world. 39