Comparing SMOTE Family Techniques in Predicting Insurance Premium Defaulting using Machine Learning Models

—Default in premium payments impacts significantly on the profitability of the insurance company. Therefore, predicting defaults in advance is very important for insurance companies. Predicting in the insurance sector is one of the most beneficial and important study areas in today's world, thanks to technological advancements. But because of the imbalanced datasets in this industry, predicting insurance premium defaulting becomes a difficult task. Moreover, there is no study that applies and compares different SMOTE family approaches to address the issue of imbalanced data. So, this study aims to compare different SMOTE family approaches. Such as Synthetic Minority Oversampling Technique (MOTE), Safe-level SMOTE (SLS), Relocating Safe-level SMOTE (RSLS), Density-based SMOTE (DBSMOTE), Borderline-SMOTE(BLSMOTE), Adaptive Synthetic Sampling (ADSYN), and Adaptive Neighbor Synthetic (ASN), SMOTE-Tomek, and SMOTE-ENN, to solve the problem of unbalanced data. This study applied a variety of machine learning (ML)classifiers to assess the performance of the SMOTE family in addressing the imbalanced problem. These classifiers including Logistic Regression (LR), CART, C4.5, C5.0, Support Vector Machine (SVM), Random Forest (RF), Bagged CART(BC), AdaBoost (ADA), Stochastic Gradient Boosting, (SGB), XGBOOST(XGB), NAÏVE BAYES, (NB), k-Nearest Neighbors (K-NN), and Neural Networks (NN). Additionally, model validation strategies include Random hold-out. The findings obtained using various assessment measures show that ML algorithms do not perform well with imbalanced data, indicating that the problem of imbalanced data must be addressed. On the other hand, using balanced datasets created by SMOTE family techniques improves the performance of classifiers. Moreover, the Friedman test, a statistical significance test, further confirms that the hybrid SMOTE family methods are better than others, especially the SMOTE -TOMEK, which performs better than other resampling approaches. Moreover, among ML algorithms, the SVM model has produced the best results with the SMOTE-TOMEK.


I. INTRODUCTION
In the era of the industrial revolution, all businesses seek digital transformation. One of the key elements of digital transformation is your ability to manage data. Data Science and business analytics is the tool that is being employed on the holy grail of data to extract hidden insights. Since the amount of data is exponentially increasing, therefore the systematic process of data science is gaining popularity in recent times. Like any other industry, 'THE INSURANCE' industry is no exception, and in fact, it is one of the key areas where data science is being practiced at a large scale. Many insurance companies are now employing ML techniques that provide a more systematic way of obtaining a more accurate and representative outcome than the traditional statistic approach.
One of the main challenges with ML approaches in classification is that they are influenced by the data set's unequal class distribution. In other words, when the data is uneven, many ML algorithms may simply disregard the tiny class and assign the majority of the cases to the common class, resulting in high overall model accuracy. Still, the prediction models' efficiency for the tiny class will be drastically diminished. Thus, this study aims to apply a variety of SMOTE family techniques to deal with the imbalanced data problem to improve the performance of ML models in predicting the small class efficiently. In our study, we will develop 117 ML models for predicting insurance premium defaulting {(9 of SMOTE family methods) × (13 of ML models) = 117 model}.
The following is the structure of this paper: Section II presents the previous studies. Section III explains the methodology included data collection, Data Preparation, and imbalanced data problem. Section IV explains model training and parameter optimization. Section V presents the evaluation methods. Section VI shows the results. Section VII shows the results of the statistical tests. Section VIII and IX represent the conclusion and the future work, respectively.

II. RELATED WORK
In the study of [1], they employed several data level methodologies to try to address the unbalanced data issue to predict the occurrence of claims in insurance. The AdaBoost model with oversampling and the hybrid technique produced the highest accurate results. And [2]; they used big insurance data to build eight ML algorithms to predict the occurrence of claims, and they handled the highly imbalanced data using the over-sampler technique. The random forest classifier outperformed the other algorithms. Furthermore, [3] constructed a model for forecasting insurance claims; they generated four classifiers to predict the claims, with the XGBoost model outperforming the others. And [4] predicted www.ijacsa.thesai.org the frequency of vehicle insurance claims using two competing approaches, logistic regression and XGBoost. According to this study, the XGBoost model outperforms logistic regression. Further, the [5] study is to investigate data mining approaches for developing a predictive classifier for vehicle insurance claim prediction. Their studies revealed that neural networks were the best predictor. And [6], this study intends to provide an accurate way for insurance companies to forecast whether or not the customer relationship with the insurance company will be renewed or not. In this paper, random forests were shown to be the top-performing algorithm. And [7], this study starts with data enrichment and works its way up to model development to predict customer churn. And they applied class weights to the prediction model due to the imbalance of the samples. And in [8] the aim of this paper is to compare and contrast the results of different machine-learning techniques for churn prediction; according to the results of this study, the Random Forest and ADA improve outperform all other methods. The study of [9] shows that after using resampling techniques to solve the imbalanced data problem, the efficiency of all ML classifiers in predicting auto insurance fraud is enhanced. Besides, the Stochastic gradient boosting classifier obtained the best result after using the SMOTE-ENN resampling technique among all the other models. And [10] created a new approach for improving the accuracy of fraud prediction. And to solve the unbalanced data problem, they re-balance the data through the method "Resample" of Weka before applying testing and learning. According to this study, Random Forest outperforms all other algorithms in terms of fraud prediction. And [11] predicts fraudulent claims and estimates insurance premium amounts for a range of customers depending on their personal and financial data. The results showed that the Random Forest outperforms the other two algorithms on the Insurance claim dataset. And to deal with the unbalanced data distribution, the research of [12] provides a novel insurance fraud detection technique. The paper is based on constructing insurance fraud detection models based on data partitions derived from undersampling. The results show that DT outperforms other algorithms.
To accentuate the importance of our study and the gap that we will fill in this study, we summarized a list of recent research that works on classification in the insurance industry by applying the ML models is presented in Table I.  Table I demonstrates that there is an absence of application and detailed comparison of the common SMOTE family approaches for handling unbalanced data in the insurance industry. This research aims to look into the impact of SMOTE family techniques on boosting the performance of machine learning models in the insurance industry. So, in this study, we applied numerous SMOTE family approaches for solving the imbalanced data problem to fill in the gaps in the previous studies. As compared to earlier studies, the following are our study's original advances and key procedures:  Using feature scaling to standardize different data features.
 Implementing and comparing different SMOTE family techniques, including nine different methods.
 Hold-Out is applied as a prominent cross-validation algorithm to perform the validation process.
 Comparison of the efficiency of SMOTE family techniques using different ML algorithms, including 13 different models.
 Using various evaluation approaches, such as Accuracy, sensitivity, specificity, and AUC, to assess the performance of the developed models.
 Showing how the various SMOTE family strategies affect the performance of classifiers.
 Using the Friedman test to analyze the differences among several SMOTE family approaches and indicating the best method among the others.

III. METHODOLOGY
This study compares various SMOTE family approaches to handle the imbalanced data problem to discover the optimal methodology and classifier for forecasting insurance premium defaulting. The following are the methodology steps used to attain the objectives of this paper:  Data Gathering.

 Data Preparation.
 Implementing SMOTE family techniques to solve the issue of the Imbalanced data.
 applying ML classification algorithms.
 Analyzing the outcomes.

A. Data Collection
This research has used datasets from an insurance company of Egypt. between 2014 and 2020 years. This data collection has a number of variables that can influence insurance premium defaulting. This dataset includes information on the 93520 clients with ten various features. There are four categorical variables (area type, Accommodation, Marital status, Default or Not), and six continuous &discrete variables (Age Income, Number of Vehicles owned, number of Late payments, number of premiums that paid, Premium amount, the number of dependents for the insured) with no missing values and columns.

B. Data Preparation
One of the most crucial stages in ML is data preparation. This procedure turns raw data into an understandable format. This phase will eliminate the errors, which may exist in the dataset, making datasets easier to manage [2]. And the data preprocessing can be summarized into the following two steps.
1) Feature scaling: Feature scaling is a method of normalizing the range of independent variables in a dataset. Most ML algorithms employ the Euclidean distance between two data points, hence without Feature Scaling, the ML algorithms may not perform properly [13]. And in this study, the values range in our dataset is not similar for most variables, so we will apply the Standardization technique as a feature scaling method to rescale the data variables. As a consequence, all of the variables become to have a mean of zero and a standard deviation of one, which is typical of a normal distribution.
The data were scaled using the following algorithm: Where is the mean and is the standard deviation.
2) One-hot encoding for categorical features: In machine learning, one-hot encoding is the process of converting categorical data into a format that can be fed into ML algorithms. Because most of the ML models works only with the numerical inputs.

C. Imbalanced Data Problem
It's worth noting that most ML algorithms in classification operate best when each class's number of instances is roughly equal. Because the unbalanced data lead to the majority class dominates the minority class. Consequently, algorithms are biased toward the majority class, and their performance become is unreliable [1,14,15]. Our datasets are severely uneven, and the two categories of insurance premium defaulting are not equivalent; in reality, the dataset contains more samples from non-defaulted (90% of the observations) www.ijacsa.thesai.org and defaulted classes (only 10 % of observations). Several techniques have been proposed to address the issue of imbalanced data, SMOTE family is one of the highly effective strategies for resolving the issue of imbalanced data. SMOTE family: Is a collection of numerous oversampling techniques evolved from SMOTE.

1) Synthetic Minority Oversampling Technique (SMOTE):
SMOTE is a statistical strategy that generates new instances to increase the number of minority samples in the dataset. This approach takes feature space samples for each target class and its nearest neighbours, then generates new samples that blend the features from the target case with the features from its neighbours. The new cases are not exact replicas of extant minority cases [16].
2) Adaptive Synthetic Sampling (ADASYN): ADASYN's core concept is to apply a weighted distribution for different minority class instances according to the possibility of learning them. With more artificial instances generated for the minority class instances that are harder to learn than minority class instances that are simpler to learn. Consequently, this technique enhances data distribution learning by eliminating or decreasing the bias brought on by data imbalanced and adaptively pushing the classification decision boundary toward difficult instances [17].
3) Borderline-SMOTE (BLSMOTE): BLSMOTE is a new minority over-sampling technique founded on the SMOTE method that over-samples only the minority examples at the borderline, where the number of majority neighbours of each minority instance is used to split minority instances into three groups: SAFE/DANGER/NOISE. Only the DANGER is employed to generate synthetic instances [18].

4) Density-based SMOTE (DBSMOTE)
: DBSMOTE, a new over-sampling approach. This method is based on a density-based clustering concept and is intended to oversample a randomly shaped cluster obtained by DBSCAN. DBSMOTE creates synthetic instances by finding the shortest path between each positive instance and a minority-class cluster's pseudo centroid. As a result, the synthetic dataset that results are dense around the core of a group of original positive cases [19].

5) Adaptive Neighbor Synthetic (ANS):
The requirement of the number of nearest neighbours as a critical parameter to synthesize instances is one of SMOTE's drawbacks. And The Adaptive Neighbor Synthetic Minority Oversampling Technique (ANS) is a new adaptive technique that tries to avoid this drawback by dynamically adapts the number of neighbours required for oversampling around different minority regions [20].
6) Safe-level SMOTE (SLS): SMOTE synthesizes minority instances at random along a line connecting a minority instance, and it's chosen nearest neighbours while disregarding surrounding majority instances. SLS is a technique that meticulously samples minority instances along the same line with varied weight degrees, which is referred to as the safe level. The safe level is calculated using the minority instances of the nearest neighbours [21]. 7) Relocating Safe-level SMOTE (RSLS): SLS creates synthetic minority instances in the vicinity of original instances while avoiding majority instances nearby. This may cause some classifiers to become confused. Furthermore, SLS generates synthetic instances without employing minority outcast instances; thus, some valuable information of the minority class may be lost in the dataset. And by merging two methods, the RSLS tries to address these two flaws in SLS. The first is to check and move these synthetic instances away from any potentially nearby majority instances. The second is using the 1-nearest neighbour strategy to deal with minority outcasts [22]. 8) HYBRID techniques: smote family that are considered as over-sampling methods have their own set of benefits and drawbacks. Combining the Over-sampling methods with the under-sampling can help reap the benefits of both.
a) SMOTE-ENN: The SMOTE-ENN technique is one of the most well-known techniques for improving outcomes by combining the SMOTE that represent an over-sampling technique with the Edited Nearest Neighbors (ENN) that represent an under-sampling technique [23].
b) SMOTE-Tomek: The SMOTE-Tomek technique combines the SMOTE that represents an over-sampling technique with the Tomek that represents an under-sampling technique to improve outcomes [23].

A. Model Validation
By using the cross-validation technique, the data were divided into training and testing subsets. Cross-validation of input data is used to prevent machine learning models from overfitting and underfitting. This study used the Random holdout as a popular cross-validation procedure.
You can see a scheme of holdout CV in Fig. 2   The data is randomly split into a training and test set.

B. Overfitting and Underfitting
Machine model's training and validation scores will be recorded at lower levels in the case of Underfitting. In comparison, overfitting is defined as a pattern of high training scores combined with low validation results. Model parameters must be optimized to avoid overfitting and underfitting circumstances. The grid search technique, which is a popular tuning tool, was used to optimize the parameters of the models. Table II shows the best values for model parameters.  The evaluation methods employed in this study are shown in Table III. Where TP is the number of true positives, FP represents the number of false positives, TN represents the number of true negatives, and FN represents the number of false negatives.

Where:
1) TP: is the aggregate number of clients who accurately attributed to default class.
2) FP: is the aggregate number of clients who inaccurately attributed to the default class.
3) TN: is the aggregate number of clients who accurately attributed to non-default class. 4) FN: is the aggregate number of clients who inaccurately attributed to the non-default class.
Besides the evaluation methods in Table III, we also used the AUC, AUC is a universal quality metric for models. AUC of 1 indicates a perfect model, whereas an AUC of 0.5 indicates a random model.
Analyzing and comparing the performance of the classifiers is an important procedure. Although evaluation measures are straightforward to employ, the results obtaining from the evaluation measures may be misleading. As a result, determining the optimal model or technique according to their abilities is a difficult task. This problem will be solved using statistical significance tests [24]. A common statistical test method for determining the differences between two or more related sample means is called the ANOVA test. The ANOVA's null hypothesis is that all resampling procedures are equivalent, and the stated discrepancies are just coincidental [25]. There are three assumptions that must take into account before we applied the ANOVA test.

1) All samples must follow the normal distribution.
2) The sample cases should be independent of one another.
3) There should be roughly equal variance among the methods (SMOTE family methods).
The Anderson-Darling normality test [25] is used in this study to determine whether data is normal or not. The null hypothesis of this Anderson-Darling normality test is that the data follow a normal distribution. And we will accept this null hypothesis if the p-value of the test is more than 0.05; otherwise, we will reject the null hypothesis if the p-value 0.05.
If one of the ANOVA's assumptions be broken, the Friedman test [26] will be used instead of the ANOVA test to investigate differences among the methods. The Friedman test's null hypothesis is that all SMOTE family methods perform the same. And we will accept the null hypothesis if the p-value of the test is more than 0.05; otherwise, we will reject the null hypothesis if the p-value 0.05. And rejecting the null hypothesis means that at least one of the SMOTE family strategies perform differently from others. For each SMOTE family approach, the accuracy, sensitivity, specificity, and AUC values are used to compare the ability of the different resampling techniques to tackle the problem of unbalanced data.
The Freidman test ranks each classifier's data for each SMOTE family technique, then examines the ranks values [27].
As a result, for each SMOTE family technique, the Friedman test generates a sum of ranks, which aids in determining which SMOTE family method is the most effective among the others.

VI. RESULTS
The performance of the various ML classifiers on the unbalanced dataset and also on the balanced data that was generated by the SMOTE family methods is shown in Table IV. Various assessment measure methods, including accuracy, sensitivity, specificity, and AUC, are utilized to gain a better knowledge of the models' performance. Table IV shows the accuracy, sensitivity, specificity, and AUC of each ML strategy on balanced and imbalanced datasets created by the SMOTE family. The most important outcomes are from Table IV; there is a substantial discrepancy between specificity and sensitivity with the unbalanced data. www.ijacsa.thesai.org  Table IV we can see that in the column of imbalanced Dataset, all of the accuracy results are greater than 90%, all sensitivity values are less than 18 % and all specificity results are greater than 96 %, indicating that all classifiers are biased toward the majority class. So, the problem must be addressed because it led to inaccurate results. And, after using various SMOTE family techniques to solve the unbalanced problem, we can see a significant improvement in the ML systems' ability to forecast the minority class. For example, while utilizing imbalanced data, the SVM got a sensitivity of 3.2 %, but the result increased to 83.84% with the SOMTE -TOMEK technique.

VII. RESULTS OF STATISTICAL TESTS
The ML algorithms perform differently with the different balanced data created by various SMOTE family techniques. As a result, finding the appropriate SMOTE family approach to get the greatest results from ML algorithms is quite difficult. Thus, we will use a Statistical significance test that will help us in this difficult task of deciding on the optimum SMOTE family technique. And before doing the ANOVA test, it's important to check the normality assumption.  Table V shows the normality test results according to the Anderson-Darling normality test on the accuracy, sensitivity, specificity, and AUC. The p-value is less than 0.05; thus, the null hypothesis is rejected, and the ANOVA test cannot be employed.
Because one of ANOVA's assumptions related to the normal distribution is broken, we will use the Friedman test to compare the resampling strategies in both datasets instead of the ANOVA test. The Friedman test results are shown in Table VI.  Table VI shows that the p-value of the Friedman test for Accuracy, Sensitivity, Specificity, and AUC is lower than the (0.05). As a result, we will reject the null hypothesis, and the following conclusion can be drawn at least one of the SMOTE family techniques performs differently from the other methods.  To summarize, in this study, we aim to solve the imbalanced problem with SMOTE family methods; the assessment measures as to the accuracy, sensitivity, specificity, and AUC are utilized to compare models more compactly. Accuracy can be a useful measure if data has the same number of samples per class. However, with an imbalanced set of samples, accuracy is not helpful at all because the model predicts the value of the majority classes for all predictions. So, when it comes to selecting the best models, AUC will take precedence. From Fig. 3, 4, 5 and 6, we can note that ML models achieve the highest accuracy and the highest specificity with the original data. On the other hand, ML models achieve the lowest results for the sensitivity and AUC measures; this refers to ML algorithms do not give accurate results using imbalanced datasets, and they cannot predict all the target classes. Therefore, solving the imbalanced data problem is notably necessary. And by using the balanced dataset after applied SMOTE family, the sensitivity and accuracy-test results are not significantly improved. And it is logical because, on the balanced data, most ML classifiers will consider all classes, which will lead to lower sensitivity and accuracy results than the imbalanced data that considers only one class and ignore the other class. Moreover, the specificity and AUC results using the balanced dataset are significantly improved, especially with the hybrid SMOTE methods.    Finally, based on the AUC comparison of ML models, the performance of the SVM classifier with the SMOTE-TOMEK method was 80.5%, which was the highest compared with all models.

VIII. CONCLUSION
The findings show that, algorithms are unable to make accurate predictions with unbalanced data. In contrast, the results demonstrate that algorithms performance has improved when using the various balanced data obtained by different SMOTE family techniques. The findings of the validation approach show that classifiers perform differently on the different balanced data, making it difficult to choose the appropriate resampling technique. The Friedman test was used to determine the optimal resampling approach. According to the AUC, the results of this test show that the hybrid resampling methods are better than others, and especially the SMOTE-TOMEK performs better than alternative resampling approaches. Moreover, among ML algorithms, the SVM model has produced the best results with the SMOTE -TOMEK. According to the results of this paper, we recommend using hybrid resampling strategies to solve the unbalanced data problem as both SMOTE-TOMEK and SMOTE-ENN provided the best performance. www.ijacsa.thesai.org

IX. FUTURE WORK
The study can be broadened to incorporate hybrid and deep learning algorithms. Other performance indicators might be used to assess performance. The algorithm's timing measures could also be a useful indicator of algorithms performance. Algorithms could also be evaluated with different datasets from various sectors that suffer from the problem of unbalanced data to prove the efficiency of the hybrid resampling strategies to solve the imbalanced data problem.