Feature Engineering Algorithms for Traffic Dataset

As a result of an increase in the human population globally, traffic congestion in the urban area is becoming worse, which leads to time-consuming, waste of fuel, and, most importantly, the emission of pollutants. Therefore, there is a need to monitor and estimate traffic density. The emergence of an automatic traffic management system allows us to record and monitor motor vehicles’ movement in a road segment. One of the challenges researchers face is when the historical traffic data is given as an annual average that contains incomplete data. The annual average daily traffic (AADT) is an average number of traffic volumes at the roadway segment in a specific location over a year. An example of AADT data is the one given by Road Traffic Volume Malaysia (RTVM), and this data is incomplete. The RTVM provides an average of daily traffic data and one peak hour. The recorded traffic data is for sixteen hours, and the only hourly data given is one hour, from 8.00 am to 9.00 am. Hence there is a need to estimate hourly traffic volume for the remaining hours. Feature engineering can be used to overcome the issue of incomplete data. This paper proposed feature engineering algorithms that can efficiently estimate hourly traffic volume and generate features from the existing dataset for all traffic census stations in Malaysia using queuing theory. The proposed feature engineering algorithms were able to estimate the hourly traffic volume and generate features for three years in Jalan Kepong census station, Kuala Lumpur, Malaysia. The algorithms were evaluated using the Random Forest model and Decision Tree Models. The result shows that our feature engineering algorithms improve machine learning algorithms’ performance except for the prediction of NO2 using Random Forest, which shows the highest MAE, MSE, and RMSE when traffic data was included for prediction. The algorithm is applied in one of the traffic census stations in Kuala Lumpur, and it can be used for the other stations in Malaysia. Additionally, the algorithm can also be used for any annual average daily traffic data if it includes average hourly data. Keywords—Feature engineering algorithm; queuing theory; Road Traffic Volume Malaysia (RTVM); machine learning algorithms


I. INTRODUCTION
As a result of an increase in the human population globally, traffic congestion in the urban area is becoming worse, which leads to time-consuming, waste of fuel, and, most importantly, the emission of pollutants. Therefore, there is a need to monitor and estimate traffic density. This reason results in the emergence of an automatic traffic management system for recording and monitoring the hourly and daily movement of motor vehicles. Several studies reported that motor vehicles are primary sources of air pollution in the urban area worldwide [1]. The concentration and increase of air pollution depend on the increase of traffic volume, speed of the vehicle, type of vehicle and many more factors. Researchers are looking for creative solutions like smart cities and GIS systems to avoid traffic congestion and volume [2,3]. A study conducted reveals that traffic volume has a significant impact on P M 10 , N O x , N O, and N O 2 concentrations [4]. [5]'s study shows that the increase of the vehicle increases the concentrations of air pollution during peak hours in the morning and evening. Traffic volume, traffic congestion, and low speed increase level of P M and N O x emissions [6]. A study in Kula Lumpur shows that air pollution concentration strongly depends on traffic volume, waiting time on the road, speed of the vehicle, and fuel consumption [7]. Speed of the vehicle, composition, traffic volume, intensity, and acceleration influenced the concentration of air pollution [8]. One of the challenge researchers are facing is when the recorded traffic data is given as an annual average that contains incomplete data-and most of these researchers are conducting multidisciplinary studies- [9]. The annual average daily traffic (AADT) is an average number of traffic volumes at the roadway segment in a specific location over a year. AADT data are collected using surveillance cameras to count and monitor passing vehicles on a 24-hour, 16-hour, 5-hour, or 1-hour basis. These data are used mostly in road transport studies, such as estimation of fuel consumption, roadway planning, emission prediction, traffic operation, travel behavior, accident predictions, and many more [10]. An example of AADT data is the one given by Road Traffic Volume Malaysia (RTVM), and this data is incomplete. The RTVM provides an average of daily traffic data and one peak hour. The recorded traffic data is for sixteen hours, and the only hourly data given is one hour, from 8.00 am to 9.00 am, as shown in Fig. 1. The total daily traffic volume and peak hour (hourly traffic volume from 8.00 to 9.00 am) were highlighted with red color in the Figure. The highlighted blue color indicates the volume of type of the vehicle. The not available (N/A) in the Figure shows that the remaining hourly traffic volume and volume of the type of the vehicles were missing. There is a need to estimate the hourly traffic for the remaining hours. In this study, feature engineering was applied to overcome the issue of incomplete hourly traffic volume.
Feature engineering is one of the most challenging and significant tasks in data science. Extracting and generating new variable from the existing dataset is a difficult task, and also consume time, and effort to process variable in dataset before applying them in the model. Feature engineering is the process of extracting and generating new features or variables from the existing dataset which helps in improving the performance of Machine Learning Algorithms. It also helps to understand the data deeply and gives more valuable insights. Data scientist spend more than 80% of their time on cleaning the dataset [11].
The structure of the paper is presented as follows, Section II presents the related works, Section III discusses the methodology, and it is divided into two sections; namely, Section III-A presented the data and how it was collected, and Section III-B discusses the proposed feature engineering algorithms. In Section IV, the results were presented, it has two sections, Section IV-A is the feature engineering algorithms result and Section IV-B is the prediction of traffic emissions with and without traffic dataset. Discussion was presented in Section V. Lastly, the conclusion is discussed in Section VI.

II. RELATED WORK
Traffic volume is one of the important variable, which contribute for increasing air pollution level produced by automobiles. Many studies were conducted to estimated hourly traffic volume, for example [12] applied extreme gradient boosting tree (XGBoost) and graph theory to estimate hourly traffic volume at location without traffic sensor in Utah United States of America (USA). The developed model was able to estimate hourly traffic volume. Study of [13] propose deep learning algorithm and image processing method to estimate traffic volume, vehicle type, and vehicle speed using recorded traffic video. The model was found good with 90% of accuracy for traffic volume estimation. Estimation of hourly traffic volume was conducted using Artificial Neural Network (ANN) [14]. The applied ANN model was able to estimated hourly traffic volume.
Time spent on the road and speed vehicles were responsible for the variability and trend of air pollution. Several studies were conducted to estimate vehicle speed and time spent on the road. These studies include [15], [16], [17], [18], [19], [20], [21], [22], [23], and [24]. These studies can be divided into three. Firstly, most of the studies have speed parameters in their data, so they developed models to estimate the vehicle speed in a location that they do not have traffic sensor stations. Some researchers proposed an algorithm to estimate the dataset's missing values, while others focus on estimating the speed to evaluate their models using the historical dataset. The second category is having a recorded video of the moving automobiles on the road, so they developed methods to estimate the vehicle's speed. Lastly, some studies installed sensor devices on the road to calculate and estimate the vehicles' speed.
In general, all these studies used four types of traffic datasets, namely, sensor device data, video-based data, imagebased data, and vehicle data. Fig. 2 presents the four types of datasets used for the estimation of vehicle speed and time spent on the road. All of the presented studies none of them estimate or generate vehicle speed and time spent using AADT dataset. To the best of our knowledge, we could not find a study that generates the cars' speed and time spent on the road using AADT dataset. The RTVM data have been used by many studies in Malaysia. Table V presents the researches were conducted in Malaysia using the RTVM dataset. These studies mostly used the data to explore the Level of Service (LOS), the number of registered vehicles, and traffic density. There is a lack of study using hourly traffic data due to unavailability and incomplete hourly traffic volume dataset. In this paper, we proposed feature engineering algorithms which can efficiently estimate hourly traffic volume and generate features from existing traffic dataset (RTVM dataset). Queuing model is proposed to generate vehicle speed and time spent on the road features.

A. Data
There is a total number of 554 traffic monitoring stations all over Malaysia. The traffic was recorded hourly for the state and federal roads in Malaysia by the State Public Works Department (JKR Negeri) and organized by Road Traffic Volume Malaysia (RTVM). In 1982, the first printed copy of the organized national traffic census was published by the RTVM. From early 1999, the data were available in Compact Disc (CD). In contrast, the online version started from 2014 to date. The traffic sensor is conducted twice a year during March-April or September-October. The data collection is categorized into three types, type 0, type 1, and type 2. The type 0 data is recorded for 24 hours in 7 days, while type 1 for 16 hours in 7 days, and type 2 for 16 in 1 day. The recording for 16 hours started from 6.00 am to 10.00 pm. The vehicles are divided into six classes, as presented in Table I. We also utilize the air pollution dataset from the AQM stations provided by the Department of Environment (DOE), Malaysia and feature engineered in our previous work [11,25,26]  The RTVM divides the carriageway into two, single and dual carriageway. Table II describes the carriageway types, their code, and how many lanes each road includes.   In this study, we proposed a new algorithm for estimating hourly traffic data based on AADT data provided by RTVM. Kuala Lumpur traffic census station was chosen for this study, and it has five stations; these stations are dual carriages with six lanes. Table III presents the station's ID, locations, and a kilometer of each road. The WR101 station was chosen in this study. The traffic data for 2014, 2015, and 2016 were used in this study. Table IV summarizes the hourly (one peak hour) and daily (16 hours) data based on the average estimation given by RTVM. B. Feature Engineering Algorithms 1) Feature Estimation: The feature engineering algorithm is proposed to estimate hourly traffic volume. The algorithm performs two tasks; EstimationOfData and DataDistribution. The EstimationOfData is performed by selecting the station, peak hour, and the normal hour. Six peak hours (peak hours 7.00 to 8.00 am, 8.00 to 9.00 am, 9.00 to 10.00 am, 10.00 to 11.00 am, 17.00 to 18.00 pm, and 18.00 to 19.00 pm) were selected and distributed randomly from the daily average traffic volume for six months, while the remaining amount of the daily traffic volume were distributed randomly to the ten hours ( normal hours 11.00 am to 12.00 pm, 12.00 to 13.00 pm, 13.00 to 14.00 pm, 14.00 to 15.00 pm, 15.00 to 16.00 pm, 16.00 to 17.00 pm, 19.00 to 20.00 pm, 20.00 to 21.00 pm, 21.00 to 22.00 pm, and 22.00 to 23.00 pm) for six months as given in the following equations: The p is the peak hour, h stand for hourly data given by RTVM, n is the normal hour, d daily traffic volume. The percentage of the vehicle type was distributed from the total daily traffic volume using the below equation. The v is the vehicle type, d is the daily traffic volume, and c is the percentage of vehicle type.
The DataDistribution is the distribution of the estimated traffic data obtained from EstimationOfData. Since the data is based on a six-month average, we create three-years data with hourly rows for sixteen hours, because the RTVM data is based on sixteen hours. We first distribute the amount of peak Estimation of future traffic volume Skudai, Johor RTVM 2011 [45] The trend of public transport in Perak Perak RTVM 2005 hour for the six hours randomly and then insert and distribute the normal hours randomly (the remaining ten hours), which is the remaining ten hours. Lastly, we distribute the amount to the type of vehicles based on the percentage given in RTVM data. The algorithm is presented in Fig. 4. The six peak hours were distributed randomly using range with minimum and maximum values (so that the values would not be same). Similarly, normal hours were distributed randomly using range with minimum and maximum values.
2) Feature Generation: In this study, queuing theory is applied to calculate average speed of vehicle and time spent on the road. Queuing theory is a mathematical study of estimation of the waiting time in the queue. Queuing theory is a mathematical study of estimation of the waiting time in the queue. Queuing system considers arrival time, number of server and service time. The arrival time is considered as the arrival of motor vehicles on the particular road segment, the server is the installed camera that recorded the passing vehicles, while the service rate is the time after the vehicle leaves where the camera was installed. The Queuing model structure is presented in Fig. 5, and the notations used in the model.
The arrival rate of the vehicles is distributed using poison distribution. Similarly, the service rate is exponentially distributed. The time spent or waiting time is calculated using the following equation: The average speed of the motor vehicles is calculated using the below equation: IV. RESULT

A. Feature Engineering Algorithms
Motor vehicle has a significant impact on traffic emissions concentration, and human health; moreover, it causes accidents, traffic congestion, and fuel consumption. The emergence of an automatic traffic management system allows us to record and monitor every vehicle passing on the road. The recorded data are used primarily in transportation studies. In Malaysia, the RTVM provided annual average traffic data. There is a need to estimate the hourly traffic volume. This study proposed a feature engineering algorithm that will efficiently estimate hourly traffic volume and generate features from the existing dataset in Malaysia's traffic census stations. The RTVM gives an average hourly (for peak hour) and daily traffic volume in specific stations. The proposed algorithm was able to estimate the hourly traffic volume for three years based on the yearly average provided by RTVM and distributed the types of vehicles based on the percentage given in the data. Furthermore, the queuing theory was able to generate the vehicle's average speed and time spent on the road. The output of the feature engineering algorithms (estimated and generated features) was shown in Fig. 6.

B. Prediction of Traffic Emission With and Without Traffic Dataset
Due to the lack of studies that generate and estimate features from the RTVM dataset. To justify the claim that feature engineering improves machine learning models' performance, we proposed Random Forest and Decision Three machine learning algorithms to predict traffic emissions concentrations using the estimated and generated features (traffic dataset). Additional dataset of air quality and meteorological variables in [11] study were used. The input and output variables were presented in Table VI. Evaluation metrics such as Mean Absolute Error (MAE), Mean Square Error (MSE), and Root Mean Square Error (RMSE) were used to evaluate the performance of the models. First of all, we predict the level of the CO, N O, N O 2 , and N O x pollutants using meteorological features but without traffic data. Lastly, we included the traffic dataset (estimated and generated features) for prediction. The result shows that our feature engineering algorithms improve the accuracy of the machine learning models (Random Forest and Decision Tree models) by predicting the level of traffic pollutants except for the N O 2 , which shows no improvement using the Random Forest model as presented in Table VII and VIII. We can also visualize the results in Figures 7, 8,

V. DISCUSSION
Motor vehicles become one of the primary concern for the government and agencies. Automobiles create many problems such accident and emissions of pollution to atmosphere. The air pollution produced by motor vehicles had significant impact on human health and the environment as well. Several studies have been conducted for estimation and prediction of traffic emissions. The variability and increase of air pollution depend on traffic characteristics. One of the issue researchers are facing is when the traffic data was provided as an annual average daily traffic (ADDT). The RTVM provides ADDT dataset. The data were incomplete as shown in the Fig. 1. We proposed feature engineering algorithms to estimated and generate missing values in the RTVM dataset. Our feature engineering algorithms were able to generate and estimate the missing features as presented in the Fig. 6. Our feature engineering algorithm is applied in one of the stations in Kuala Lumpur traffic census stations, and this algorithm can be used on the other stations in Malaysia. Additionally, the algorithm can also be used for any annual average daily traffic data if it includes an average hourly dataset. There some limitations in this study, firstly, the estimated hourly traffic volume is proposed due to insufficient hourly traffic volume, which may not provide the exact traffic volume hourly. The RTVM does not provide speed of the vehicle, we calculate vehicle speed as an average basis (the speed of the vehicle is constant). This study could not extract acceleration/deceleration. Some studies suggested that different types of fuel have different emissions, but consideration of fuel type is not provided in this study. Jalan Kepong traffic census station was the selected station in this study, the remaining stations were not studied.

VI. CONCLUSION
Motor vehicles are the primary source of air pollution in metropolitan globally. Air pollution has a significant effect on human health with diseases such as asthma, cardiovascular, and respiratory. Motor vehicle also causes accidents and create congestion at road segments. Due to these reasons, the government and agencies introduce an automated traffic management system to record the passing vehicles on the road. The recorded data has been used in various studies by researchers. The Road Traffic Volume Malaysia provides incomplete traffic data. We proposed a new feature engineering algorithm to overcome the issue of incomplete traffic data. The proposed feature engineering algorithms could estimate the hourly traffic volume and generate features for three years in Jalan Kepong, Kuala Lumpur, Malaysia. The algorithm was evaluated by predicting four traffic pollutants CO, N O, N O 2 , and N O x using Random Forest and Decision Tree models. The prediction was conducted in two phases, phase one is prediction without traffic dataset (estimated and generated features), and phase two is the prediction with traffic dataset. The result shows that our feature engineering algorithms improve machine learning models' performance except for the prediction of N O 2 using Random Forest, which shows the highest MAE, MSE, and RMSE when traffic data was included for prediction.