Fuel Consumption Prediction Model using Machine Learning

In the paper, we are enhancing the accuracy of the fuel consumption prediction model with Machine Learning to minimize Fuel Consumption. This will lead to an economic improvement for the business and satisfy the domain needs. We propose a machine learning model to predict vehicle fuel consumption. The proposed model is based on the Support Vector Machine algorithm. The Fuel Consumption estimation is given as a function of Mass Air Flow, Vehicle Speed, Revolutions Per Minute, and Throttle Position Sensor features. The proposed model is applied and tested on a vehicle’s On-Board Diagnostics Dataset. The observations were conducted on 18 features. Results achieved a higher accuracy with an R-Squared metric value of 0.97 than other related work using the same Support Vector Machine regression algorithm. We concluded that the Support Vector Machine has a great effect when used for fuel consumption prediction purposes. Our model can compete with other Machine Learning algorithms for the same purpose which will help manufacturers find more choices for successful Fuel Consumption Prediction models. Keywords—Fuel consumption; machine learning; support vector machine; feature weight; feature selection; on-board diagnostic


I. INTRODUCTION
In this study, we are trying to enhance fuel consumption (FC) prediction using machine learning algorithms. We used a Support Vector Machine algorithm to predict fuel consumption. We measure fuel consumption based on a legacy Dataset containing On-Board Diagnostics (OBD) data. The aim is to achieve a good value for the R-Squared metric using the SVM.
OBD is the protocol responsible for scanning and reading the ECU in the vehicle. OBD adapter can scan the ECU and send the FC data to a third-party device. OBD is considered a part of the Internet of Things technique. It can be connected to remote datasets to save its data for important and urgent analysis related to vehicles depending on Big Data, Deep Learning, and Machine Learning techniques. These analyses are helpful for instant diagnoses for vehicles and other types of machines which are using the same OBD protocol [1][2][3][4].
Fuel Consumption has an essential interest for individuals, businesses, and the globe. The price of fuel controls the economy of the world. Therefore, changes in the price of fuel affect the economical side for businesses.
Machine Learning is considered an application of Artificial Intelligence. Arthur Samuel said that Machine Learning: "is defined as the field of study that gives the computers the ability to learn without being explicitly programmed" [5].
One of the famous algorithms of Machine Learning is the Support Vector Machine (SVM) algorithm. SVM is an algorithm that tries to predict a specific value or a set of classes either in classification or regression form [6,7]. It has been used in several studies related to the prediction of fuel consumption. These studies are considered to be related to our work similarly.
We used SVM to propose an ML model for fuel consumption prediction purposes. The other related research work had applied the SVM algorithm to predict FC based on a training dataset of a small size. Its results were not enough good. Its model had returned an R-Squared value equal 0.004624. It depended on the RPM_TPS-based equation only, which will be discussed later. However, in our research, we used both the RPM_TPS-based equation besides the VS_MAFbased equation. There is no other literature that discussed the same problem with the SVM algorithm depending on both RPM_TPS-based and VS_MAF-based equations. The RPM_TPS-based equation depends on RPM and TPS parameters. The VS_MAF-based equation depends on VS and www.ijacsa.thesai.org MAF parameters. These two equations are considered the most important equations that can be used to measure the fuel consumption rate when a complete FC Dataset exists. Our FC Dataset is considered a high-dimensional size dataset.
It's important to note that our proposed model and its internal experiments couldn't be observed without an FC Dataset containing the parameters which are existing in the FC equations used.
Before using SVM for the prediction of fuel consumption, Feature Weighting should be described. Feature Weighting is the ranking process of the importance of the features, as it depends on a voting approach for ranking the importance of the features in datasets [8].
Feature Weighting is followed by the Feature Selection step. Feature Selection is applied to the highly ranked features after the Feature Weighting step. Then, these highly ranked features are filtered and applied to the classifier [8].
Feature Selection can be applied to datasets using different algorithms. Random Forest and Decision Tree are the most famous algorithms used to rank the importance of the features and select the highly ranked features.
In the last decade, scholars talked about the importance of predicting the consumed fuel percentage depending on some of the sophisticated algorithms from both Data Mining (DM) and Machine Learning (ML). However, in an earlier time, scholars had discussed the prediction of fuel consumption with different algorithms, including Neural Networks (NN), Random Forest (RF), Gradient Boosting (GB), and Support Vector Machine (SVM) [9,10].
The prediction of fuel consumption value will become more precise when predicted with sophisticated ML techniques. The discussion of fuel consumption has been a trending topic when discussed from the view of ML in the last five years.
Many research papers have been developed to discuss the most followed methods for monitoring fuel consumption in vehicles. Fuel consumption scholars have focused on different methods that should be followed to eliminate fuel consumption.
In [11], the authors had used sophisticated techniques depending on ML models to detect and measure levels of fuel consumption using Support Vector Machine (SVM) and Artificial Neural Networks (ANNs) models. They used 27 vehicles in their experiments. They discussed their multiple tries for achieving better accuracy on different types of vehicles of the same age, different segments, engine displacement, and type of transmission. Finally, they achieved accuracy with 83%.
In [12], the authors had discussed the problem of predicting fuel in fleets of vehicles depending on machine learning techniques. They had used Random Forest, Gradient Boosting, and Neural Networks as machine learning models. Random Forest Algorithm had achieved the best result between the other used algorithms. However, they depended on the Nash-Sutcliffe coefficient for measuring the predictive power for the efficiency of each model. Also, they used Bias, Mean Absolute Error (MAE), and Root Mean-Squared Error (RMSE) as error statistics to evaluate their model's accuracy.
In [13], the authors had used a machine-learning algorithm to predict fuel consumption depending on a set of variables in a large-scale Dataset gathered by 153 drivers during a month depending on GPS and CAN (Controller Area Network) bus data, including speed of the vehicle and moved distance. They used regression methods for the machine learning methods: SVM, ANN, Linear Regression (LR), and Link Fuel Summation SVM model (LSSVM). Their study revealed that SVM had the best R-Squared value with 0.92 while ANN, LR, and LSSVM had R-Squared values of 0.86, 0.74, and 0.79. The training phase had affected the superiority of SVM over other models. However, SVM had generated the best fit results/accuracy. Also, it wasn't affected by cost functions as it provided a linear penalty to huge error rates where the ANN model minimizes the sum of squared errors.
In [14], the authors had used Boruta Algorithm (BA) and Neural Networks (NNs) algorithm to measure fuel consumption regarding a huge fleet of trucks on different road pavements. BA had shown a good result in comparison with previous studies, which used the same data. While the developed NN algorithm had achieved (R2) value of 0.88 for test data. NN appeared to be a suitable candidate for analyzing large datasets effectively and predicting the impact of roughness and macrotexture of roads on truck fuel consumption.
In [15], the authors had addressed the identification of driving style issues. They used the K-means clustering algorithm to differentiate between different types of driving styles. Driving styles are divided into three categories: normal, soft, and aggressive category. Also, they used random forest, K-nearest neighbor, support vector machine, and neural network models. Random forest overall accuracy was 95.39% while trucks are in their heavy load, and 90.74% on no-load status. The aggressive driving style achieved the largest fuel consumption and reached 10 % higher than the average driving style.
In [16], the authors had used Autonomie, which is a simulation tool, to simulate the process of fuel and vehicle power consumption. They proposed a Large-scale learning and prediction process (LSLPP) with machine learning models. LSLPP tests were successful as they could accelerate analysis processes and prediction of vehicle's fuel consumption.
In [17], the authors had used the Support Vector Machine (SVM) model as one of the ML prediction techniques with OBD-II to monitor and predict fuel consumption levels. The proposed model uses both TPS and RPM variables to measure the consumed level of fuel. Finally, their RMSE value was 2.43.
In [18], the authors had used SVM, RF, and ANN algorithms for fuel consumption prediction purposes. SVM and ANN algorithms achieved the best results. However, RF outperformed both of them. The coefficient of determination (R2) for SVM, RF, and ANN are 0.83, 0.87, and 0.85, respectively. www.ijacsa.thesai.org

II. PROPOSED MODEL
The proposed model aims to predict fuel consumption using SVM. The proposed model consists of four phases: Data Preprocessing, Feature Weighting, Feature Selection, and SVM Prediction Model, as shown in Fig. 1. The proposed prediction model has been applied to FC Dataset with 8262 records. The Dataset includes 18 fields, as shown in Table I. FC Dataset was gathered by 19 drivers using an OBD scanner in vehicles, which was used for a previous dissertation for profiling automotive data in 2018 [19]. The Dataset gathered by 19 drivers had been collected depending on a vehicle model of the well-known Brazilian vehicle, A 2015 Chevrolet S10, which has a 2.5-liter flex-fuel engine by 206 hp. This Dataset is gathered in an urban road in the city of Natal (Brazil). It was gathered at a distance of 18.8 kilometers for 34 minutes for each driver [20,21].  A. Pre-processing Data pre-processing is the first step in the proposed prediction model. Converting the data into a more desired and eligible form is essential to ensure that the Dataset is accurate and ready for further processing [22,23]. In our proposed model, we are performing filtering noisy data and manipulating with missing values steps.
1) Filtering noisy data: This step is a very important step in which the noisy records in FC Dataset are removed. For example, some cells are filled with symbols and characters like the Speed field, which contains (50 km/h). Such characters and symbols affect the implementation and results of the prediction model.
2) Manipulating with missing values: Most of the fields in our FC Dataset had filled with data. However, our FC Dataset was high-dimensional. Hence, it was difficult to discover the missing values by hand. So, we had to automate this process using specific techniques to avoid exceptions happening while training the SVM algorithm as the missing values may cause a big issue for processing the prediction model. For example, the FC Dataset contains fields with Nan, Null, or Zero values, in which the R2 value of the regression model is affected negatively and returned exceptions in the runtime. There are several methods to handle missing values. One of these methods is the mean imputation. The imputation method estimates the missing values by replacing them with the mean for that variable [24].

B. Feature Weighting
Feature weighting is an essential step in identifying the most feature or a set of features affecting other specific features. In the proposed model, feature weighting is used to set weights for FC Dataset features to identify which feature mostly affects the fuel consumption level. We used two models for weighting features in our Dataset. These models are Random Forest (RF) and Decision Tree Algorithm.
The Random Forest Algorithm is considered a good and reliable algorithm for features ranking for small and larger datasets. This is because it can distinguish the relevant and the irrelevant attributes in the Dataset. It can handle both classification and regression problems by constructing multiple decision trees concurrently and returning equivalent forecasting for the average result of the processed decision trees. RF can handle high-dimensional datasets as it can process too many inputs and return results with high performance [25,26].
Decision Tree Algorithm is also an important model used to identify the importance of the attributes in the Dataset in which feature selection can be used in higher and lowerdimensional classification tasks. It describes the relations between data that can be simulated by leaves in the trees. Each node has other leaf nodes under it. Each leaf node holds a specific value that represents a meaningful form for the algorithm. A tree constitutes a leaf and a node. In classification, nodes represent a group to be classified and each node subset represents a value that can be taken by the node [27,28]. www.ijacsa.thesai.org Feature weighting is applied to the whole selected features of the FC Dataset to determine the most important features that affect fuel consumption. Clarification of feature weighting phase and used methodologies and algorithms will be discussed in another following section.
Feature weighting is applied to features in equations that are used for calculating fuel consumption. Fuel consumption can be calculated via two methods, the first is based on VS and MAF features, and the second is based on RPM and TPS features.
1) VS_MAF-based: VS_MAF is the first method used for calculating fuel consumption, according to (1).
Where f is fuel consumption value, VS is the vehicle speed parameter, which is measured in km/hour, and MAF refers to the value of Mass Air Flow in the engine, which is measured in g/s (gram per second).
Depending on (1), fuel consumption can be measured using two metrics, the first metric is Mile Per Gallon (MPG), and the second metric is Liters per 100 Km. For example, the following equation retrieves the fuel consumption values in MPG and L/100KM.
To retrieve the fuel consumption value in US MPG, based on (2), the value of Speed is divided by MAF then multiplied by α = 7.718, which is a constant.
Further, to retrieve fuel consumption value in liters per 100 km, we multiply the fuel consumption value in US MPG by the value of the constant β [17].
2) RPM_TPS-based: RPM_TPS is the second method used for calculating fuel consumption, according to (3).
Our feature selection experiments had been applied using the VS_MAF-based Equation and the RPM_TPS-based Equation. Generated results from our Random Forest Algorithm indicated that RPM, SPEED, and MAF have the highest effects on fuel consumption levels. Results and analysis for applying RF to FC Dataset will be provided in more detail in specific sections for the experiments, discussion, and results.

C. Feature Selection
After feature weighting, feature selection is applied to determine the most weighted features that affect the fuel consumption value after feature weighting. First, the generated weights of the FC features by the weighting models RF and DT are ranked, then a selection of the most important features is done.

D. The SVM Prediction Model
The proposed model is based on Support Vector Machine (SVM). The SVM model is a machine learning algorithm that reads input data and represents points on a 2d space or 3d space to be drawn on X-axis and Y-axis in 2d view or X-axis, Y-axis, and Z-axis in 3d view. Then, it draws a boundary line that splits the groups and classifies the data to refer to which class the point is grouped or classified. SVM has a maximum margin line that usually divides the class of points equally called "hyperplane". The hyperplane looks for the maximum distance between each point and its nearest group or class [18]. The hyperplane is divided into two different types. The first one is the optimal hyperplane, which is the linear function with the maximum margin between vectors or multiple vectors in two groups, and the second one is called the soft margin hyperplane, which happens when two classes of the data are not linearly separable [6].

E. Experiments
In the experiment section, the details of the performed experiments are illustrated. Several experiments had been done using a historical FC Dataset. However, the proposed work for predicting fuel consumption using SVM with a regression model is considered the first experiment with this algorithm to be conducted specifically on this Dataset. The experiments are conducted using two equations. The first is based on MAF and VS features, and the second is based on RPM and TPS features.
The results of the experiments have been evaluated using the coefficient of determination metric R-Squared/R 2 , a statistical metric that represents the variance between dependent and independent variables and evaluates the model's ability for prediction purposes [29].
After applying feature weighting and selection, we update both the VS_MAF-based equation and the RPM_TPS-based equation. These updates improve the Squared Correlation Coefficient metric R-Squared/R 2 value of the proposed model compared with other studies.
1) Applying feature weighting: Feature Weight/importance is identified via different algorithms used to select the most important features in high-dimensional datasets. RF and DT are two important algorithms used to measure features and find the correlations in FC Dataset.
Random Forest (RF) is a machine learning algorithm commonly used to evaluate the model's ability for prediction purposes.
Recursive Feature Elimination (RFE) was first used and proposed to enable the SVM model to evaluate the features/attributes importance and identify field ranking in datasets. The same methodology has been added to the RF algorithm to find the correlated features/fields in datasets with high dimensionality [30].
RF and DT algorithms are reliable enough to be considered for measuring the importance of the features in our FC Dataset. Using RF and DT for features weighting purposes during our observation leads to a better focus of the prediction purpose, after removing the unnecessary features from the experiment.
We had imported both RF and DT algorithms in Spider engine to run them using Python v.9 programming language. www.ijacsa.thesai.org Python has become an essential programming language for ML research. We used Python to print the weight results for FC Dataset features. We could draw figures using Matplotlib, which is a drawing and visualization library using Python, to differentiate the features with high and low importance values [31].

a) Feature Selection experiment applied on 18 features using (RF):
We applied the Random Forest Algorithm for identifying the importance/weight score for the features existing in our Dataset. Table II shows the importance of our Dataset features. Fig. 2 and Fig. 3 show a representation of the feature's importance in our Dataset. It was found that the most important features that affect the fuel consumption level after applying the feature selection algorithm according to the VS_MAF-based equation are MAF and SPEED. However, when applying the RPM_TPS-based equation, it was found that RPM is the most important feature. Fig. 2 indicates that MAF and SPEED parameters are the most important parameters in the Dataset that affect fuel consumption according to VS_MAF-based equations. While Fig. 3 indicates that ENGINE_RPM is the most important feature that affects fuel consumption between the whole features in the dataset according to the RPM_TPS-based equation.

b) Weighted VS_MAF-based equation:
According to the VS_MAF-based equation, fuel consumption calculation is based on MAF and VS features. Depending on RF and DT algorithms, Table III, Fig. 4, and Fig. 5 represent the feature importance results for both MAF and VS features.
In Table III, and Fig. 4 the results show and indicate that both MAF and SPEED features affect fuel consumption features with a feature weight of 0.50876 for MAF and 0.49124 for SPEED using the RF algorithm. However, in Table III and Fig. 5 both the MAF and SPEED features affect the fuel consumption feature with a feature weight of 0.50665 for MAF and 0.49335 for SPEED using the DT algorithm. This indicates that the importance value of the MAF and SPEED features doesn't hugely change when applied to the RF or DT algorithms.
Also, the previous importance values for both of the features refer to the more significant impact of the MAF feature over the SPEED feature when compared to each other according to their effect on the fuel consumption value.
After calculating feature weight for MAF and VS, the VS_MAF-based equation can be updated by adding the weight values for the equation.
So, we can multiply each feature in the equation by its importance according to Table III

c) Weighted RPM_TPS-based equation:
According to the RPM_TPS-based equation, fuel consumption calculation is based on RPM and TPS features. Therefore, depending on the RF and DT algorithms, Table IV and Fig. 6, and Fig. 7 include the feature importance results for both RPM and TPS features. Table IV and Fig. 6 indicate that both RPM and TPS features affect the fuel consumption feature with a feature weight of 0.999952 for RPM and 0.000048 for TPS using the RF algorithm. However, in Table IV and Fig. 7, both the RPM and TPS features affect the fuel consumption with a feature weight of 0.999969 for RPM and 0.000031 for TPS using the DT algorithm. This indicates that the importance value of the RPM and TPS features doesn't change when applied to the RF or DT algorithms. Also, the previous importance values for both of the features refer to the massive importance of the RPM when compared with TPS importance.   The same as the VS_MAF-based equation, the RPM_TPSbased equation calculate fuel consumption rate using the following equation: Fuel (rpm, tps) = p00x 2 + p10x + p01xy (5) We can update the last equation via multiplying RPM and TPS by their importance values according to the generated results by the RF algorithm in Table IV to become: Fuel (rpm, tps) = p00x 2 * rpm i + p10x * rpm i + p01xy * rpm i * tps i (6) Where = 0.999952 and = 0.000048 by RF algorithm.

2) Applying SVM on fuel consumption equations:
The SVM model is applied using the original and the new-weighted fuel consumption equations, which calculates fuel consumption values. We had noticed the difference in the squared correlation coefficient R 2 metric value for each conducted experiment. Table V shows a sample of the data using the VS_MAFbased experiment and the RPM_TPS-based experiment. Fig. 8 compares the actual and predicted values of fuel using the VS_MAF-based equation when implemented using the SVM model, while Fig. 9 compares the actual and predicted values of fuel using the RPM_TPS-based equation when implemented using the SVM model.
In Fig. 8, according to the VS_MAF-based experiment, it looks that some of the actual fuel consumption data are quite www.ijacsa.thesai.org similar to the predicted values, which are likely similar to the result of the R-Squared/R 2 value of the model that reached 0.97, which indicates that the SVM model has achieved a high accuracy depending on the VS_MAF-based equation. Also, in Fig. 9, according to the RPM_TPS-based experiment, it looks that some of the actual data are quite similar to the predicted fuel consumption values, which are likely similar to the result of the R-Squared/R 2 value of the model that reached 0.96, which indicates that the SVM model has achieved a high accuracy too using the result of applying the RPM_TPS-based equation. We had achieved a superior result better than other candidates [17] who achieved lower results: R-Squared/R 2 = 0.004624 than our experiment using the same SVM predictor model depending on the RPM_TPS-based equation. Therefore, our goodness of fit using the R-Squared/R 2 metric equals 0.96 when applied their original RPM_TPS-based equation and our new-weighted RPM_TPS-based equation, according to Table VI. Finally, the new weighted-VS_MAF-based equation had affected the R-Squared/R 2 metric value to be 0.97, while the new weighted-RPM_TPS-based equation had affected the R-Squared/R 2 metric value to be 0.96, according to Table VII.

IV. RESULTS
The value of the R-Squared/R 2 metric is 0.96 for the original VS_MAF-based equation and the original RPM_TPSbased equation, while the value of the R-Squared/R 2 metric is 0.97 while applying the New-weighted VS_MAF-based equation and 0.96 for applying the New-weighted RPM_TPSbased equation.   Finally, Table VII  This study had achieved better results than other candidates who applied the RPM_TPS-based equation using the SVM model [17] as their R-Squared/R 2 equals: 0.004624 while implementing their RPM_TPS-based equation.
Future research may be a try to investigate more correlated parameters in FC Dataset to create more mathematical equations for measuring FC. The more correlated parameters the more equations that calculate FC, consequently, the more experiments and observations that lead to better enhancements and more accurate FC prediction models with ML. MAF, RPM, SPEED, ENGINE_LOAD, and ENGINE_RPM may have a greater correlation that can be used to create a new mathematical equation to measure FC in our FC Dataset or we can use a larger dataset for implementing FC prediction observations. Also, the great enhancement will be to convert the proposed model to a running system integrated with Internet of Things components and devices in the vehicle and predict fuel consumption instantly with SVM in the runtime applying our proposed model.