Improving Imbalanced Data Classification in Auto Insurance by the Data Level Approaches

Predicting the frequency of insurance claims has become a significant challenge due to the imbalanced datasets since the number of occurring claims is usually significantly lower than the number of non-occurring claims. As a result, classification models tend to have a limited ability to predict the occurrence of claims. So, in this paper, we'll use various data level approaches to try to solve the imbalanced data problem in the insurance industry. We developed 32 machine learning models for predicting insurance claims occurrence {(undersampling, over-sampling, the combination of over-and undersampling (hybrid), and SMOTE)× (three Decision tree models, three boosting models, and two bagging models) = 32}, and we compared the models' accuracies, sensitivities, and specificities to comprehend the prediction performance of the built models. The dataset contains 81628 claims, each of which is a car insurance claim. There were 5714 claims that occurred and 75914 claims that didn't occur. According to the findings, the AdaBoost classifier with oversampling and the hybrid method had the most accurate predictions, with a sensitivity of 92.94%, a specificity of 99.82%, and an accuracy of 99.4%. And with a sensitivity of 92.48%, a specificity of 99.63%, and an accuracy of 99.1%, respectively. This paper confirmed that When analyzing imbalanced data, the AdaBoost classifier, whether using oversampling or the hybrid process, could generate more accurate models than other boosting models, Decision tree models, and bagging models. Keywords—Machine learning; classification; insurance; imbalanced data problem; resampling methods


I. INTRODUCTION
The use of machine learning techniques and the transformation of the insurance market into a new level of digital applications are the insurance industry's current challenges. There are two types of insurance: life insurance and non-life insurance. Non-life insurance, specifically auto insurance, is the subject of this study.
A variety of variables influence automobile insurance pricing [1]. And these factors would affect the cost of a client's insurance policy. Credit history is an example of one of these factors; studies indicate that people with poor credit are more likely to file claims, commit fraud, or miss payments, putting the insurance company in a financial bind. Another factor to consider is the client's location; studies have shown that densely populated areas with heavy traffic have a higher rate of accidents, resulting in a higher number of claims. This would result in a significant rise in the customer's insurance premium. However, it is unjust for a good client to pay more simply because of where they live; this creates a problem for the consumer because if the insurance premium is raised, he will be unable to afford it, resulting in the insurance provider losing these clients. So, necessitating the creation of an appropriate method for evaluating the risk each client poses to insurers.
As a result, insurance rates should be adjusted based on a client's skill and other personal information, making car insurance more accessible to consumers. Where insurance companies should customize a custom premium for each customer because this will help the insurers to adjust to any situation and manage any loss. Since It would be unreasonable to expect a client with a good driving record to pay the same insurance premium as a client with a poor driving record; as a result, the model should classify which clients are unlikely to file claims, lower their insurance costs, and raise insurance costs for those who are likely to file claims.
Data imbalanced problem are more likely to arise in the case of insurance data since the number of occurring claims is usually Significantly lower than the number of non-occurring claims. And one of the major problems with machine learning techniques is that they are affected by the data set's unequal binary class distribution. In other words, when the data is unbalanced, certain machine learning techniques will simply ignore the small class and allocate the majority of the unseen cases to the common class, resulting in high overall model accuracy. Nonetheless, the performance of the prediction models for the small class will be substantially reduced. To solve this problem, we will use resampling techniques, such as Over-sampling, under-sampling, the combination of over-and under-sampling(hybrid), and the synthetic minority oversampling technique (SMOTE), to improve the classification efficiency for imbalanced data.
We used a large dataset given by a large automotive company based in Egypt. In this study, we apply data-level approaches that could reduce overfitting caused by data imbalance. And we built 32 machine learning models for predicting the occurrence of auto insurance claims ((undersampling, over-sampling, hybrid of over-and under-sampling, and SMOTE) × (three Decision tree models, three boosting models, and two bagging models) =32). And we compared the models' accuracy, sensitivities, and specificities to better understand the built models' prediction efficiency.
The following is the structure of this paper: Section II presents the previous studies. Section III explain the data collection, machine learning models, and data-level approaches. Section IV compared the results of the built thirty-493 | P a g e www.ijacsa.thesai.org two prediction models.
Section V presents concludes. Section VI presents the future work.

II. RELATED WORK
Over the last decade, many researchers have used machine learning algorithm to forecast the occurrence of auto insurance claims. And while machine learning models are efficient at predicting. But when the data is unbalanced, machine learning techniques will simply ignore the small class and allocate the majority of the cases to the common class, resulting in high overall model accuracy. Nonetheless, the performance of the prediction models for the small class will be substantially reduced. The following studies show a lack of using the resampling methods to solve the unbalanced data problem except for the study of [1] that only used the oversampling method.
In the study of machine learning approaches for auto insurance big data [1], they built eight classifiers to predict the occurrence of the claims using big insurance data, including XGBoost, J48, RF, C5.0, CART, K-NN, logistic regression, and naïve Bayes algorithms, and they handled the heavy imbalanced data using the over-sampler method. The RF model performed the best among the eight models. And [2] used two competing methods, XGBoost, and logistic regression, to predict the frequency of motor insurance claims. This study shows that the XGBoost model is slightly better than logistic regression. Furthermore, a model for predicting insurance claims was developed by [3]; they built four classifiers to predict the claims, including XGBoost, J48, ANN, and naïve Bayes algorithms. The XGBoost model performed the best among the four models. Another example of a similar and satisfactory solution to the same problem is the thesis "Research on Probability-based Learning Application on Car Insurance Data" by [4], which used only a Bayesian network to classify either a claim or no claim. And the [5] research also aims to look at data mining techniques for creating a predictive model for auto insurance claim prediction. And they compared three ML methods for predicting claims. Their findings showed that the best predictor was the neural networks.
In summary, despite the relevance of the imbalanced data problem in the insurance industry, there is a lack of comprehensive comparison among the prominent resampling approaches as a strategy to deal with it. The purpose of this study is to investigate the impact of the unbalanced data problem on the performance of machine learning models. This paper solves the unbalanced data problem with several resampling approaches and compares them while utilizing various machine learning classifiers to fill in the gaps in the literature.
In comparison to prior studies, the following are the novel innovations and vital procedures of this study: • Applying and comparing several resampling algorithms, including the Random Over Sampler, Random Under Sampler, SMOTE, and the hybrid.
• Appling several machine learning models, such as three Decision tree models, three boosting models, and two bagging models, to compare the performance of resampling methods.
• Measuring the effectiveness of implemented machine learning models utilizing various performance measures such as Accuracy, Sensitivity, and Specificity.
• Demonstrating how resampling affects the performance of classifiers.

A. Data Collection
Our dataset is progressive record keeping, which usually updates over time to reflect the updated status of a particular customer, which means that provided data is the snapshot of some particular date, where all the records show the updated status of each customer. This dataset is updated on changes in circumstances of the customer, such as marital status, age, etc. The sample auto insurance claims dataset was collected between 2014 and 2018.
The data used in this study is real-life data obtained from an Egyptian car insurance firm; we end up with 81628 claims in the dataset, each of which is a car insurance claim. In total, there are 5714 claims that occurred, and 75914 non-occurred claims, suggesting that the data is heavily unbalanced. And as we mention, the performance of classification algorithms is greatly affected by imbalanced data. So, we apply four resampling techniques to solve the problem of data imbalance. Each claim comprises 17 features. Table I provides Attributes of the data.

B. Data Preprocessing
Numeric values are allocated to categorical variables. For example, instead of male or female as the gender of the insured, the "Male" component would be (1), and "female" would be (0). After this phase, we can apply our data to all machine learning models.

C. Machine-Learning for Auto Insurance Claims Occurrence:
1) Decision trees: A decision tree D, in more formal terms, is made up of two kinds of nodes: • A leaf node that represents the response variable's given class/region.
• A decision node that defines a test on a single attribute (predictor variable) with one branch and subtree for each test outcome.
Using a recursive divide and conquer method, a decision tree can be used to classify an observation by beginning at the top decision node (called the root node) and going down through the other decision nodes until a leaf is encountered.
The following models are the most well-known methodology for constructing decision trees [6,7,8].
494 | P a g e www.ijacsa.thesai.org 2) Bagging trees: Bagging, or bootstrap aggregation, is an ensemble meta-algorithm. This algorithm increases the model's consistency and accuracy while also reducing overfitting. In classification, it weights the output to ensemble into a single output. Leo Breiman suggested bagging in 1996 [9] as a way to improve classification results.
The following are the two most common bagging algorithms: 3) Boosting: Boosting is an ensemble technique that, like training, creates several individual models sequentially. Each new model attempts to correct the errors of the previous group of models. Boosting, like bagging, can be used to improve any supervised machine learning algorithm. Boosting, on the other hand, is most effective when weak learners are used as submodels. As a result, boosting has historically been used on shallow decision trees. By shallow, I mean a decision tree with a limited number of levels of depth or a single split. Boosting's aim is to bring together a group of weak learners to form a strong ensemble learner [10,11,12].
The following are the two most common bagging algorithms:

D. A Data-level Approach and Imbalanced Data
The imbalanced data problem exists in many datasets; as a result, classifiers models are biased against the minority class and are unable to predict it accurately [13]. In contrast, most machine learning models perform better when applied with balanced datasets [14,15,16,17].
Analysis of our database shows that they are extremely imbalanced, and the two forms of insurance claims are not balanced, with 93% (n=75914) of the auto insurance claims occurred, and those non-occurred were 7% (n=5714). As a consequence, the imbalanced data problem must be addressed. Many techniques have been developed to resolve the problem of unbalanced data. One of the most successful approaches for addressing unbalanced data is using a sampling-based approach, either Random Over Sampler [18], Random Under Sampler [19], and SMOTE [20].
We will use the ROSE and the ovun.sample function incorporates more conventional class inequality solutions, such as over-sampling the minority class, under-sampling the majority class, or a combination of over-and under-sampling. And also, we will use the DMwR package to apply SMOTE as a resampling method.

1) Over-sampling technique:
This technique increases the weight of the minority class. It's important to note that the technique of over-sampling is typically used more than other methods. a) Random over sampler: Random Over-Sampling is a technique based on bootstrap that supports the binary classification task in the presence of unbalanced classes by generating synthetic examples from a conditional density estimation of the two classes [21]. It handles both continuous and categorical data by randomly replicating samples from the minor class [22]. As a result of this process, the dataset grows in size. The argument is that no new samples are generated by a random over-sampler, and the variety of samples remains constant. Since the sample size grows, the oversampling technique takes longer to construct a model and can cause overfitting because it duplicates samples from a minor class. [23,24]. b) SMOTE: SMOTE is similar to random oversampling. However, it does not regenerate the same instance. It creates a new instance by appropriately combining existing instances, thus making it possible to avoid the disadvantage of overfitting to a certain degree. Moreover, SMOTE is an oversampling technique that produces new minority samples by combining two minorities and one of their K nearest neighbours [25]. This approach is a statistical technique for creating new instances to increase the number of minority samples in a dataset. This algorithm takes characteristic features for the target class and its closest neighbours, then produces new samples by combining the characteristics of a specific case with those of its neighbours.
2) Random under sampler: Under-sampling is one of the simplest techniques to dealing with the problem of unbalanced data. It balances the majority and minority classes. The process of under-sampling includes arbitrarily deleting examples from the majority class in the training dataset, referred to as random under-sampling [26]. By reducing the amount of data, under-sampling can save time when building a model, but it comes at the cost of losing information [27,28].
3) Hybrid methods: There are several benefits and drawbacks of over-sampling and under-sampling. Combining these two strategies will add the strengths of these two methods to a new method [26].

E. Development of Prediction Models and Prediction Performance Evaluation
This study built thirty-two prediction models ((undersampling, oversampling, the combination of over-and undersampling(hybrid), and SMOTE)× (three Decision tree models, three boosting models, and two bagging models) =32). The accuracy, sensitivity, and specificity of each model are used to compare the prediction performance of the established models.
This study randomly divided the data into a training dataset and a test dataset at a ratio of 7:3. The hyperparameters are tuned using A 10-fold cross-validation that was performed only on the training dataset to get the best performance for each machine learning model. And the test dataset was used to evaluate the prediction performance. The Data-level Approaches must only be applied to the training set while the test set still unbalanced. We used R version 4.0.2 to conduct all analyses.

F. Evaluation Methods
Evaluation methods are essential in comparing and selecting the best model because they are assessing the efficiency of classifiers [1].  Table II shows the Evaluation methods used in this study. Where TP is the number of true positives, the number of false positives is FP, the number of true negatives is TN, and the number of false negatives is FN.

IV. RESULTS AND DISCUSSION
To show the difference between the ability of the machine learning classifiers to predict the insurance claims occurrence before and after handling the unbalanced data problem, we compared all applied models on the unbalanced data and also on the balanced data created by different resampling techniques. We measure the performance of models on the testing data using accuracy, sensitivity, and specificity.

A. Comparing the Performance of the Built Machine
Learning Models  Tables III, IV, and V show the accuracy, sensitivity, and specificity respectively of the thirty-two prediction models, as well as Fig. 1.  Table III shows the Accuracy of each machine learning technique on unbalanced data and balanced datasets generated by four different resampling models. And we must consider that only if the data is balanced will Accuracy be a valuable metric, while when the data is unbalanced, the Accuracy would be meaningless. Because when the data is unbalanced, most machine learning techniques will simply ignore the small class and allocate most of the unseen cases to the majority class, resulting in high overall model accuracy and high specificity, while the sensitivity of the models will substantially be reduced. The AdaBoost classifier achieved 99.4 %accuracy by using the oversampling, which is the highest of all other classifiers. And the lowest accuracy outcome goes to the C5.0 with 74.84% by using the under-sampling.  Table IV refers to the Sensitivity of the machine learning models. Sensitivity relates to the ability to predict the occurrence of claims. We can note that the Sensitivity for all ML models with the unbalanced data is lower than the Sensitivity for balanced data created by different resampling methods because the occurred claims represent a small class with only 7% in our data. Therefore, before solving the unbalanced data problem, machine learning techniques will simply ignore the small class (occurred claims). Thus, resulting in very low Sensitivity in the case of the unbalanced data. While the Sensitivity is improved after applied the resampling methods. This refers to the effectiveness of using the resampling methods to handle the unbalanced data problem in the insurance industry. And the highest Sensitivity goes to the AdaBoost classifier with 92.94% using the oversampling, while the lowest one goes to the AdaBoost model with 0.46% using the unbalanced data.  Table V refer to the Specificity of the machine learning models. Specificity refers to the ability to predict non-occurred claims. We can note that the Specificity for all models with unbalanced data is highest than the Specificity for balanced data created by different resampling methods because the nonoccurred claims represent the majority class with 93% in our data. Therefore, before solving the unbalanced data problem, machine learning techniques will allocate the most unseen cases to the majority class (non-occurred claims). This is resulting in very high overall model Specificity in the case of the unbalanced data. But our objective is to detect MINOR class more accurately than MAJOR class; therefore, we are interested in Sensitivity more than Specificity. SO, we need to resample the dataset to force algorithms to identify both classes with equal importance. And the highest Specificity in the dataset belongs to AdaBoost classifiers with 99.93% using the unbalanced data, and the lowest one goes to the C5.0 model with 74.65 % using the under-sampling.
Last but not least, from Tables III, IV, V, and Fig. 1, we can conclude that using the resampling methods is very effective for handle the unbalanced data problem in the insurance industry, because the best results are achieved after applied the data-level approaches.
And the best models are AdaBoost with the over and hybrid methods, then the C5.0 model with the hybrid method, and then the random forest model with the hybrid method. Where AdaBoost with the over and hybrid methods achieved a sensitivity of 92.94%, a specificity of 99.82%, and an accuracy of 99.4%. And a sensitivity of 92.48%, a specificity of 99.63%, and an accuracy of 99.19%, respectively. And the C5.0 model with the hybrid method has a sensitivity of 88.78%, a specificity of 98.79%, and an accuracy of 98.22%. Then there's the random forest model with the hybrid method, which has a sensitivity of 84.28%, a specificity of 98.96%, and an accuracy of 98.06%.

V. CONCLUSION
This study specifically established models for improving the classification efficiency of imbalanced data by using oversampling, under-sampling, the combination of over-and under-sampling, and SMOTE as resampling approaches.
((under-sampling, oversampling, a combination of over-and under-sampling(hybrid), and SMOTE) × (three Decision tree models, three boosting models, and two bagging models) =32) for predicting auto insurance claims occurrence. According to the findings of this analysis, the AdaBoost model with over and hybrid could generate more accurate models than other boosting models, Decision tree models, and bagging models, then the C5.0 model with the hybrid method, and then the random forest model with the hybrid method.

VI. FUTURE WORK
Further research is required to compare the accuracy using various datasets from various fields to prove the prediction efficiency of an AdaBoost classifier with resampling methods to solve the imbalanced data problem. And Future work may be done in the following directions: Using hybrid machine learning classifiers to improve comparison and performance. And also, use different feature selection approaches to enhance model results and gain a deeper understanding of the important features.