Handling Class Imbalance in Credit Card Fraud using Resampling Methods

Credit card based online payments has grown intensely, compelling the financial organisations to implement and continuously improve their fraud detection system. However, credit card fraud dataset is heavily imbalanced and different types of misclassification errors may have different costs and it is essential to control them, to a certain degree, to compromise those errors. Classification techniques are the promising solutions to detect the fraud and non-fraud transactions. Unfortunately, in a certain condition, classification techniques do not perform well when it comes to huge numbers of differences in minority and majority cases. Hence in this study, resampling methods, Random Under Sampling, Random Over Sampling and Synthetic Minority Oversampling Technique, were applied in the credit card dataset to overcome the rare events in the dataset. Then, the three resampled datasets were classified using classification techniques. The performances were measured by their sensitivity, specificity, accuracy, precision, area under curve (AUC) and error rate. The findings disclosed that by resampling the dataset, the models were more practicable, gave better performance and were statistically better. Keywords—Credit card; imbalanced dataset; misclassification error; resampling methods; random undersampling; random oversampling; synthetic minority oversampling technique


I. INTRODUCTION
In the past decades when businesses were migrated and evolved to the online business and money was managed electronically in an ever-growing cashless banking economy, credit cards were gradually replacing the use of cash over its suitability [1].Credit cards became the most popular mode of payment ever since.According to [2], credit card-based purchases can be categorised into two types: i) physical card purchase and ii) virtual card purchase.Most payments for online purchases were under virtual card purchases which few information were needed such as card numbers, expiration data, and secure codes.Along with the increasing numbers of the credit card users, the numbers of fraudulent transactions have been constantly increased.In the article [3] stated that, it is hard to find the identity and the location of the fraudsters since the evidences were hidden behind the internet.The merchants that were facing with the credit card fraudsters will bear all the costs including card issuer fees, charges, and administrative charges [4].Consequently, the merchants must increase the price of the goods or give more discounts or reduce the incentives to conceal all the losses.Hence, an effective fraud detection system is vital to reduce the losses rate.
Before proceeding with the fraud detection system, it must be bear in mind that there had been an enormous increase in the amount of credit card dataset collected and processed by the organisations.Normally in the real dataset, the number of fraudulent is very rare as compared with the non-fraudulent transactions [5,6,7].Conceivably with a skewed dataset, the performance of the system surely drops in terms of its accuracy.When a legitimate transaction is misclassified as a fraudulent transaction, it will affect the customer services and causes to lose trust from the customers and the financial institution [8,9,10].Maes (2002) have provided some capacity that a fraud detection system should have in order to perform a good result [11].The system should be able to: i) handle skewed distributions, ii) handle noise, iii) avoid the overlapping data iv) adapt themselves to new kinds of frauds, v) evaluate the classifier using good metrics, and vi) detect the behaviour of the frauds.Recent research in [12] stated there are three challenges to construct the fraud detection system.The challenges are: i) the data distribution evolves over time because of seasonality and new attack strategies ii) fraudulent transactions represent only a very small fraction of all the daily transactions and iii) the fraud detection problem is intrinsically a sequential classification task.
In 2017, Haixian et al. stated that it is difficult to detect the rare events due to their infrequency and casualness.Plus, it can result in heavy cost if misclassified the rare events.In their review paper, they have identified three main solutions to the challenges: resampling, cost-sensitive learning, and ensemble methods [13].The most popular method is resampling methods which are used to rebalance the imbalanced dataset in order to alleviate the effect of the skewed class distribution in the learning process.Secondly, cost-sensitive learning which can be incorporated to both data level and algorithmic level.Lastly, ensemble method is used to improve the performance of a single classifiers that outperform.In a review paper [14], they specified two approaches should be performed to solve the imbalanced data problems: i) solution at data level by balancing the distribution of the majority and minority class trough methods of under sampling, over sampling or combination of both methods, ii) solution at algorithm level by modification in classifier methods or optimise the performance of learning algorithm.www.ijacsa.thesai.orgThus in [13 and 14], both review papers emphasized that there are no absolute methods that are more efficient in dealing with the class imbalance.They found out some insights about commonly-used methods in some domains.Burez (2009) handled the imbalanced class in churn prediction by applying several methods.The methods are: i) evaluation metrics, ii) cost-sensitive learning, ii) resampling methods and iv) boosting.He used ROC analysis as for evaluation metrics and stochastic gradient boosting learner as for boosting.For cost-sensitive learning, he used random forest and random under sampling as for resampling methods [15].The study in [16] proposed an efficient resampling method and obtained comparable classification results between random under sampling and random over sampling.The experiments were carried out using four large imbalanced Bioinformatics datasets.They have recommended 100%under(0.75)-overmethod for obtaining comparable classification results to the over sampling results.In 2002, [17] has proposed Wrapper-based Random Oversampling (WRO) to handle class imbalanced problem.Wrapper is a preprocessing method that incorporates the classifier output to guide pre-processing.The method oversampled the minority class data randomly and the classifier is optimised.They evaluated the WRO with real dataset that they obtained from UCI repository.WRO has better results in most experiment compared to Synthetic Minority Over Sampling Techniques (SMOTE) and random over sampling.Research in [18] investigated the resampling methods specifically on data from Spotify users.They used the most common oversampling methods: random oversampling and SMOTE, and the most common under sampling method: random under sampling.Yan and Han (2018) proposed RE-sample and Cost-Sensitive Stacked Generalisation (RECSG) based on 2-layer learning models to solve the imbalanced problem in 18 benchmark datasets [19].The experimental results and statistical tests showed that the RECSG approach improved the classification performance.
In reviewing the literature, resampling methods is the main focus of this study due to its simplicity and compatibility with existing classification models to handle the rarity event in massive credit card dataset.There is no research yet were found on the association between credit card fraud and resampling methods.Therefore, the aim of this study is to investigate the classification models' ability to classify the fraud and non-fraud transactions, and to examine if the different resampling methods could improve the performances of the models.The research methodology of the study is conducted in Section 2. Thereafter in Section 3, the experimental setup is described.Next, the results and discussions is presented in Section 4. This study ends with conclusion remarks and future works in Section 5.

II. RESEARCH METHODOLOGY
This section gives brief description of the methodology of this study.In addition, this section also discusses each step of the methodology.Fig. 1 displays the framework of research methodology of this study.

A. Data Collection
One of the biggest problems associated with researchers in financial fraud detection is lack of real-life data because of sensitivity of data and privacy issue [5].Hence, a publicly available dataset is downloaded from [20] to be used in this research.It has a total of 284,807 transactions made in September 2013 by European cardholders.The dataset contains 492 fraud transactions, which is highly imbalance.

B. Resampling Methods
Three widely-used methods for resampling in this study are Random Under Sampling (RUS), Random Over Sampling (ROS) and SMOTE.For undersampling, RUS is chosen, since it is considered both simple yet effective.ROS and SMOTE were chosen as oversampling methods because of its widely usage.Furthermore, ROS is an intuitive way of balancing data, whereas SMOTE is more complex creating synthetic samples using K-Nearest Neighbour (KNN).Table 1 below summarises the differences between the three resampling methods.

C. Classification Techniques
Credit Card dataset is a binary classification task.Either the transaction is classified as non-fraud (0) or fraud (1).After the data have been resampled accordingly, the models are needed to be trained using classifiers to evaluate the methods.Thus, in this study, four different classification techniques were explored: Naïve Bayes (NB), Linear Regression (LR), Random Forest (RF) and Multilayer Perceptron (MLP).A summary of the strength and limitations of the classifiers used in this study is given in Table II.www.ijacsa.thesai.orgAssumes that all attributes are independent of each other given the context of the class.

Linear Regression
Provide optimal results when the relationship between independent and dependent variable are almost linear.
Sensitive to outliers and limited to numeric values only.

Random Forest
Require low computational power and suitable for realtime operations.The procedure is easy to understand and implement.
Easily to get overfitting when training set does not give underlying domain information.Whenever have new types of cases need retrain.

Multilayer Perceptron
Suitable for binary classification problems.
Retraining is required and need high computational power.Unsuitable for realtime operations.

III. EXPERIMENTAL SETUP
This section describes the division of the data in training dataset and the performance measures conducted throughout of this study.All the resampling techniques are implemented in Java framework of WEKA 3.8 for comparative evaluations.The parameters for the classification techniques were set accordingly by default.No further fine tuning of parameters to specific datasets can be beneficial, consideration of generally accepted settings is more typical in practice.

A. Data Division
The methodological approach taken by this study is motivated from research [15]

B. Performances Evaluation
In this study, performance evaluations were conducted to assess the performance of the classification methods for each resampling technique.The models have two fundamental errors may occur: classifying a fraud falsely as a non-fraud and classifying a non-fraud falsely as a fraud.These errors are more commonly known as false positive and false negative results.Other possible classifications will be correctly classified such as true positive and true negative results.The correlation between these are presented in a confusion matrix in Table V. Performance of four classifiers were compared in terms of Sensitivity, Specificity, Accuracy, F-Measure and Area Under Curve (AUC).These metrics are calculated using the confusion matrix as shown below.www.ijacsa.thesai.orgTable V was generated from the four measures: True Positive (TP)the number of correctly classified as a fraud and it is really a fraud, True Negative (TN)the number of correctly classified as non-fraud and it is really a non-fraud, False Positive (FP)instances which were incorrectly classified as a fraud but it is a non-fraud and False Negative (FN)instances which were incorrectly classified as nonfraud but it is a fraud.

Specificity =
(2) IV. RESULTS AND DISCUSSIONS This section discusses the results that were obtained from the experiments.Table VI and Table VII displays the summary of the comparison results for each classification techniques in three resampling methods by ratio 30:70 and 50:50, correspondingly.The results were compared in terms of sensitivity, specificity, accuracy, precision, F-measure, AUC and time taken to build the model in seconds.All the classifiers were performed well with an accuracy of 0.90 or more.Though, RF dominates with higher accuracy compared to other classification techniques for both ratios of each resampling method.In contrast, Table VII presents the information of ratio 50:50 for each resampling method.In RUS, LR and RF have similar specificity rate which is 0.97866 but different in sensitivity rate nearly at 0.0183.Although both techniques can classify the same number of fraud dataset, RF is still better in classifying the fraud dataset compare to LR. RF also expresses the effectiveness of classification in terms of high F-Measure compared to other classifiers.In the meantime, for ROS, RF has higher accuracy rate.Follows by LR, MLP and NB with 0.9969, 0.9895 and 0.9115, subsequently.Both LR and RF have equal numbers of sensitivity rate which is 1.This means that LR and RF have 100% correctly classified the fraud dataset.Meanwhile, RF gives comprehensive results even though in SMOTE.RF has a small differences of accuracy rate compared to LR and MLP with 0.00608 and 0.00856.Hence, RF can correctly classified fraud and non-fraud in the dataset since it has higher sensitivity rate and precision rate compared to other classification methods.It is important to view that, although NB only took a split of second to build the model, it has the lowest precision rate compared to the rest.NB also has the lowest accuracy rate and sensitivity rate.Albeit MLP has the longest time taken to build the model, MLP is doing well in classifying the fraud and non-fraud which add up to 99% correctness.

V. CONCLUSIONS AND FUTURE WORK
This study was set out with the aim to investigate the classification models' ability to classify the fraud and nonfraud transactions, and to examine if the different resampling methods could improve the performances of the models.It is interesting to note that in all four classifiers that have been applied, RF showed a robust performance in three resampling methods.RF succeeded to get higher accuracy compared to NB, LR and MLP for the resampling methods.It would be interesting to compare the classification techniques used in this study with other techniques such as Support Vector Machines, Neural Network and Genetic Algorithm.Surprisingly, it has been found out that ROS was found to give convincing results if compared to SMOTE.Although SMOTE is quite effective in the literature, this is most probably due to some of the synthetic data resulting from the oversampling process were spreading on both minority and majority data, as discussed in Section 2, which is the main limitation of SMOTE.There were few researchers that have modified SMOTE to create more effective methods in improving the classification performance.Perhaps, this improve-SMOTE can be compared with the current resampling methods for credit card dataset in the future.Hence, these results may provide further support to the organisation to build better fraud detection system which can handle the skewed distribution and noise as well to evaluate the classifier using better metrics.
. The research handled the imbalance problem in churn prediction by resampling the minority and majority classes based on ratio 10:90, 20:80, 30:70, 40:60, 50:50, 60:40, 70:30, 80:20 and 90:10 where churners proportionate with non-churners.Due to the results that the research have obtained and limitation of time, this study chose ratio 30:70 and ratio 50:50 (fraud:non-fraud) to divide the training dataset for this study.An overview of the dataset division, splitting and resampling, can be seen in Fig. 2. Following Fig. 2 is Table III and IV which have more details on dataset division for this research.

TABLE I .
COMPARISON OF RESAMPLING METHODS

TABLE III .
DIVISION DATASET BY RATIO 30:70

TABLE V .
CONFUSION MATRIX OF CREDIT CARD DATASET

Table
VI provides the information of ratio 30:70 for three resampling techniques.As can be seen in RUS, MLP has higher sensitivity if compared to other classification techniques but have slightly lower specificity than RF.The error rate for MLP and RF are 0.0319 and 0.0273, correspondingly.Thus, RF have approximately 2% of misclassification rate compare to MLP which is have 3% of misclassification rate.For ROS, LR and RF have accuracy 99% which they can correctly identified the fraud and nonfraud of the credit card dataset.However, RF is more precise compared to LR.On the other hand, LR only took 53.5 seconds to build the model while RF took about 343 seconds.The longest time taken to build the model is MLP which is 896 seconds.Meanwhile in SMOTE, RF has higher precision rate compare to other classification methods.It shows that RF often correctly classified non-fraud dataset with 0.9999 rate.Followed by LR (0.9862), MLP (0.9837) and NB (0.9328).

TABLE VI .
COMPARISON RESULTS OF CLASSIFICATION TECHNIQUES BY RATIO 30:70

TABLE VII .
COMPARISON RESULTS OF CLASSIFICATION TECHNIQUES BY RATIO 50:50

Table
VIII is quite revealing in several ways.First, unlike the other tables, Table VIII is more focusing on performance of AUC and error rate.Secondly, RUS, ROS and SMOTE were compared with the original training dataset.The highest AUC and lowest error were printed in bold.For each of the case, the test statistics with the highest AUC and lowest errors were calculated and compared with models that were significantly worse.It can be seen in the table that although the original training set should have the lowest rate compared to other resampling methods in four classifiers, it has the worst performance in AUC.It is an example that the model does not have a good statistic.For NB, the best AUC is obtained by the RUS when the training set is set to 30% of fraud and 70% of non-fraud (AUC=0.9117).It is only has 0.0002 differences compared to ROS when the training sets have the same ratio of fraud and non-fraud.SMOTE and the original training set do not differ significantly what concerns to AUC.While for LR, the closest AUC to 1 is ROS by ratio 30:70 with 0.9988.Followed by the ROS (50:50), SMOTE (50:50) and SMOTE (30:70).RUS in both ratios and the original training set have large differences in AUC compared to ROS with 30% fraud.Although the original training sets have the lowest error set, ROS (30:70) has the second lowest error rate.ROS with 30% of fraud and 70% non-fraud is significantly better in statistic, therefore it can be a better resampling method for LR technique.www.ijacsa.thesai.org

TABLE VIII .
MEAN AUC PERFORMANCE AND ERROR RATE FOR EACH RESAMPLING METHODS From the table, ROS (50:50) have the smallest error rate as well after the original training set.Both ratios in ROS show significantly better in statistic for RF.When looking at MLP, what concerns of the performance of AUC, SMOTE gave better result in both ratios.Followed by ROS and RUS.Similar to error rate where SMOTE also gave the smallest rate compared to ROS and RUS.Yet, none of the resampling methods were significantly better in terms of error rate than the original training set.