A Comparison of Data Sampling Techniques for Credit Card Fraud Detection

Credit Card fraud is a tough reality that continues to constrain the financial sector and its detrimental effects are felt across the entire financial market. Criminals are continuously on the lookout for ingenious methods for such fraudulent activities and are a real threat to security. Therefore, there is a need for early detection of fraudulent activity to preserve customer trust and safeguard their business. A major challenge faced in designing fraud detection systems is dealing with the class imbalance issue in the data since genuine transactions outnumber the fraudulent transactions typically account less than 1% of the total transactions. This is an important area of study as the positive case (fraudulent case) is hard to distinguish and becomes even harder with the inflow of data where the representation of such cases even decreases further. This study trained four predictive models, Artificial Neural Network (ANN), Gradient Boosting Machine (GBM) and Random Forest (RF) on different sampling methods. Random Under Sampling (RUS), Synthetic Minority Over-sampling Technique (SMOTE), Density-Based Synthetic Minority OverSampling Technique (DBSMOTE) and SMOTE combined with Edited Nearest Neighbour (SMOTEENN) was used for all models. The findings of this study indicate promising results with SMOTE based sampling techniques. The best recall score obtained was with SMOTE sampling strategy by DRF classifier at 0.81. The precision score for this classifier was observed to be 0.86. Stacked Ensemble was trained for all the sampled datasets and found to have the best average performance at 0.78. The Stacked Ensemble model has shown promise in the detection of fraudulent transactions across most of the sampling strategies. Keywords—Data imbalance; credit card fraud; sampling techniques


I. INTRODUCTION
Transactions using credit cards have become an important aspect of our daily lives. Purchase of goods and services are no longer a chore that requires physical activity, rather it is initiated with a touch of a button on our smartphone or personal computers. The authorization of transactions is rigorous and secure although such conveniences are brought about by compromising the proof of identity checks which require personal identification documents, authorized signature and physical presence. The basis of the identity proof in such transactions is the information on the card along with digital identification tied to the cardholder.
The conveniences brought about by digital transactions makes it a target for fraudsters who employ elegant tactics for theft and illicit use. Credit card fraud is generally an unauthorized movement by an individual who is not authorized to perform the said account operation. It can be also classified where a person transacts with a card, without the explicit permission of the owner of the cardholder or card issuer [1]. The most common form of credit card frauds are stolen or lost cards, fraudulent applications, counterfeit card fraud, nonreceipt fraud, card not present (CNP) and account takeovers [2]. According to the European Central Bank [3], on the composition of fraudulent credit card transactions for the year 2016, 73% of the fraudulent transactions were a result of CNP, where payments are made via the internet or telephone.
Billions worth of transactions is lost worldwide every year due to fraudulent credit card transactions. According to the Nelson Report on global payment systems, the amount of losses due to credit card fraud is $22.8 billion, and this indicated a 4.4 percent increase from the year 2015. It was also highlighted that 38.6% of this global credit card frauds are accounted for from frauds in the United States. The Nelson Report also projects that the credit card fraud losses are expected to grow by over $10 billion over the next three years [4].
With the increasing amounts of loss due to such illicit activities costing institutions and individual's huge amounts of money, tackling this issue has become a priority over the past decade and various studies have been conducted to address this problem. Financial institutions are constantly on the verge of upgrading their fraud detection systems. Association of Certified Fraud Examiners (ACFE), suggests that proactive data analysis and continuous monitoring of real-time activity as the key for minimizing and preventing fraudulent credit card transactions [2].
Financial institutions and credit card issuers collect and store a vast amount of transaction data. Every credit card transaction composes key attributes such as the card identifier, transaction date, recipient and amount of transaction, which are stored in the databases. Fraud Detection Systems (FDS) implements various layers of validation to flag potential frauds using such datasets.
Although, machine learning and predictive analytics might not answer the question of exactly the type of fraud that may occur, it has the potential to flag suspicious activities and identify potential frauds with the help of a trained model, on historical data combined with expert analysis. Such systems can equip institutions with proactive insights into the future, to enable them to better cope and mitigate fraudulent transactions.

*Corresponding Author
Real-world implementation of FDS cannot reliably check all transactions as it is constrained by the human labour required to validate the sheer number of alerts raised the by system. It mainly relies on fraud investigators who are used as a confirmatory layer whereby flagged fraudulent activities (alerts) are verified and validated by the designated investigators.
Transactions which are then reported by the customer during this window are flagged or labelled as fraudulent and the unreported transactions labelled genuine transactions. To summarize, there are two ways FDS samples the data; immediate feedback samples (transactions with investigator feedback) and delayed samples (transactions whose fate are known only after a set reaction-time period). This is a crucial distinction to be considered when implementing an accurate FDS as every transaction is not immediately labelled either fraud or genuine [5].
Fraudulent labels in the dataset can, therefore, be safely assumed to be verified and validated by the investigators. However, there are other challenges in designing accurate machine learning techniques for such data. Firstly, the nonstationary data distribution (fraudulent and genuine transactions share similar profile). Often at times fraudsters mimic the cardholders spending behaviours, which makes the profiles of the fraudster and cardholder very similar in such cases and different in the other cases. This changing dynamics between genuine and fraudster profiles also known as concept drift, makes it particularly challenging for machine learning algorithms to accurately predict fraudulent transactions [5]- [9]. Secondly, the skewness or the class imbalance in these datasets poses a considerable challenge in building accurate machine learning models. This is the case for a variety of realworld applications where the true class or the interested observations tend to be a fraction of the total cases. Credit card fraud detection has this distinctive characteristic as majority of the transactions are genuine while the concerned cases (fraud activity) has very few transactions. This is known as the class imbalance and it is significant because the positive class is often the rare class and predicting this class becomes harder as the number of false class keeps on increasing. Machine learning models typically work on the assumption of an equal class balance and equal cost of misclassification, therefore adequate measures have to be taken in order to address this issue of class imbalance [5], [6], [10]- [12].
Detection of credit card fraud is classified as a costsensitive problem, where there is an associated cost incurred for incorrectly classifying a genuine transaction as fraudulent and incorrectly classifying fraudulent transaction as genuine. In the absence or no occurrence of fraud, there is no associated administrative costs incurred by the financial institution. However, failure to detect the fraud is a loss of the particular transaction amount. It is thus, an important proposition to incorporate in to the FDS, particularly in the development of models on class imbalanced datasets [9].

A. Contributions
The research contributes both theoretically and practically. The significance in terms of both means are summarized as follows.
This paper provides an overview of the most recent literature on credit card fraud detection strategies which focused on the newest Machine Learning techniques while addressing the major challenges faced by the traditional FDS. The research offers an up to date perspective on trends in the credit card fraud detection domain, model evaluation metrics that offer the best results and outlines limitations of existing FDS. Researchers can find this paper helpful as it is a good starting point, to kickstart a research on implementing machine learning techniques for credit card fraud detection.
The practical contributions of this research are to provide a sound and realistic model that articulates the classification problem pertaining to the domain of credit card fraud detection. Sampling strategies proposed to be implemented in this paper shall enable researchers to promptly use and adopt this technique which best serves their research goal. Various sampling techniques shall be implemented to generate and train different machine learning models, and conclusively summarize experimental results of the built models using a multitude of relevant model evaluation metrics.
The paper addresses key challenges faced in building machine learning models for FDS, and experimentally prove strategies to mitigate or minimize such challenges. Therefore, it is an invaluable contribution to the financial sector, with the contribution of a predictive model able to accurately predict fraudulent credit card transactions.

II. RELATED WORKS
This section synthesizes the contents and ideas in the existing studies and encompasses key subject matters regarding the domain of credit card fraud detection. These subjects include, machine learning techniques, sampling techniques, visual data analytics, feature engineering and model evaluation metrics.

A. Machine Learning Techniques
Credit card fraud detection studies on the use of predictive analytics have shown that researchers adopted various methods such as Artificial Neural Networks, k-Nearest Neighbour (kNN), Logistic Regression (LR), AdaBoost, Naïve Bayes (NB) and many more [6], [13]- [17].
In [6], used NB, kNN and LR on the European card holders dataset. This dataset contains anonymized transaction data of European credit card holders which were collected for a period of two days and contains 284,807 samples. The results of this study conclude that kNN produced the best results for accuracy, sensitivity and specificity. Although the authors argue that this potentially could be caused by the generation of synthetic samples using Synthetic Minority Over Sampling which uses KNN. 478 | P a g e www.ijacsa.thesai.org One study proposed an improved method of sampling to produce a better performance, which referred to as Moving to Adaptive Samples in Imbalanced (MASI) dataset. The study implemented Random Forest (RF), Support Vector Machines (SVM) and C 5.0 Decision Tree algorithm to conclude that SVM produced the best results [18].
In another study using the same dataset implemented LR, KNN, Linear SVM, RBFSVM decision trees, RF, and NB algorithms [17]. Although, both [18] and [6] implemented the same models, the sample size was 350, which was a result of random under sampling. The highest sensitivity score achieved for the study was SVM with a score of 94%.
Random Forest is implemented by majority of the researchers [6], [18], [19] with varying degree of results. In [20] experimented on a weight assignment approach to the RF, using out-of-bag error to compute the weights while other researchers typically opted for using various sampling techniques.
Deep Learning techniques such as ANN, Recurrent Neural Networks (RNN), Long Short-term Memory (LSTM) and Gated Recurrent Units (GRU) was implemented [21]. The LSTM and GRU outperformed the traditional ANNs, however the shortcomings are that the training was not conducted to achieve optimal model stability. It was cited that "performance improved whenever network size was increased", and future recommendation was made to identify an optimal stopping point.
Restricted Boltzmann Machines (RBM) was another topology of Deep Learning which was implemented by past researchers [22]. The researcher used a novel approach with the use of unsupervised machine learning techniques (Stacked Auto Encoder) to identify optimum weights, which was then applied to a supervised machine learning model RBM achieved an accuracy of 91.5%.

B. Issues of Class Imbalance
Numerous studies have shown different approaches to deal with this issue in the context of implementing accurate prediction model which are aimed at improving the detection rate of fraudulent transactions [6], [18], [23], [24].
The most common method implemented in the existing studies to handle the problem at data level, where the data is subjected to various sampling techniques. Random under sampling (RUS) is implemented where the majority class instances are removed [10], [17] or random oversampling (ROS) is used where minority class instances are added by replicating training samples with the same class representation. Some advanced methods were also used to oversample with techniques such as Synthetic Minority Oversampling Technique (SMOTE) which creates new synthetic instances of the minority class using kNN. Synthetic instances which are created using this technique have been shown to perform better, than simply using random oversampling or replication of instances [1], [25], [26]. Alternative methods to the SMOTE, was implemented by [23] and [18]. The drawbacks of the SMOTE sampling technique such as loss of potential information and potential for model overfitting for the synthetic samples.
In [18] proposed an improved method of sampling using an approach which the author refers to as Moving to Adaptive Samples (MASI) in Imbalanced dataset and obtained comparatively better performance against other sampling techniques such RUS, ROS and SMOTE. While SMOTE, resampling generates new instances and increase the data size prior to the implementation of the classifier, MASI adaptively creates synthetic samples which are created based on the density distribution of original data and up-samples the minority class by changing class labels. The researchers indicate this reduces the bias of the classifier as it moves the samples in minor class closer to the decision boundary.
Alternative to tackling the imbalance issue on the dataset, ensemble learning handles class imbalance issue at the algorithmic level. Ensemble methods typically include bagging and boosting that primarily aims to lower the variance in the data by using multiple classifiers. In bagging method, multiple weak classifiers are trained on different subsets of the majority class and minority class before combined final classifier is built using all the weak classifiers. AdaBoost employs similar strategy and can be implemented for many classification problems and it eliminates the need for exploring an optimum class balance ratio while alleviating the information loss which can be caused by RUS, and overfitting issue caused by ROS and SMOTE methods [25], [27].
One study implemented a new oversampling strategy which combined k-means clustering with genetic algorithm to oversample the minority class. The researchers propose this solution as opposed to SMOTE and other sampling strategies highlighting the potential for information loss and overfitting [23].

C. Feature Engineering
Fraudsters constantly change their behaviours and implement new ways to commit frauds, which renders traditional expert rules. Machine learning methods are also prone to this type of problems, however adoption of new strategies can assist to counter. Feature engineering is a method which can be used extensively to counter this effect, whereby new features are created based on the card holder's behaviour over time. These new features aids the machine learning models to distinguish patterns from the normal card holder behaviour [9], [28].
Feature engineering is proven to be an important aspect of predictive analytics for detection of credit card frauds. Financial institutions obtain and store large amounts of data related to transactions such as transaction amount, account holder details, time of transaction and more. While these collected data serve as good predictors in a classifier setting, it has the potential to be enriched with new information such as card holder spending habits in a set time frame, average amount spent in different geographical areas or product and service types. For example, a card holder can be profiled by his spending habit at home, but this may differ completely with his spending habit on a vacation in India. Such features could potentially be able to discover patterns and solve the conceptdrift problem where card holder and fraudster behaviour is distinguished with the help of new data dimensions [9], [23]. 479 | P a g e www.ijacsa.thesai.org It is also noteworthy, that single transaction information is typically insufficient for the purpose, rather aggregate measures which combines to form new features are ideal [9].

D. Evaluation Metrics
Evaluation metrics are an important aspect to understand the performance of the machine learning models. Detection of credit card fraud is classified as a cost-sensitive problem, where there is an associated cost incurred for incorrectly classifying a genuine transaction as fraudulent and incorrectly classifying fraudulent transaction as genuine [9]. As such, the choice of evaluation metric must be carefully chosen and shall be relevant in terms of the objective of the study and available data.
Machine learning models work on the assumption of equal class distribution and equal cost of misclassification. Using accuracy metric for evaluating a model is not suitable for datasets with class imbalance issue as it would bias the model towards majority class since the accuracy metric calculates the total of correct predictions [20], [10].
Area Under the Curve (AUC), is a measure of the probability that the model or classifier will choose a random positive instance higher than a random negative instance. AUC is a metric; many researchers have adopted [22], [26], [29] and gives a good indication of the overall predictive performance of the model across various probability threshold settings and is very well suited for the class imbalanced modelling.
Precision is the percentage of true positives among all positive predictions, while recall indicates the total correctly predicted positive classes over the total predictions for both correctly predicted positive class and falsely predicted positive class. F1 measure is the mean of sensitivity and precision. Out of all these metrics used in this study, the most useful metric which was able to give a clear indication of the best classifier was sensitivity or recall metric.

III. METHODS AND TECHNIQUES
This section briefs the research methodology that will be adopted to achieve the objectives of this research. The section includes an overview and key processes involved in the methodology, such as dataset summary, sampling techniques, and machine learning algorithms.
The dataset collected for this study is secondary data consist of transaction data of European credit card holders which were collected for a period of two days and contains 284,807 observations with 31 variables out of which 28 variables are anonymized using principal component [30].
The three non-anonymised variables are transaction time, amount and the class label (fraudulent or not fraudulent transaction). The class label indicates '0' for non-fraudulent transaction and '1' for fraudulent transaction. The dataset is highly imbalanced as the percentage of fraud instances accounts to 0.172%. The dataset does not contain any missing values and outliers, therefore pre-processing techniques on the dataset shall not be required.

A. Sampling Techniques
A reliable FDS with detecting all frauds is vital as well as reducing false flags where genuine transactions are misclassified as fraudulent. The associated costs are much higher, when a fraudulent transaction pass through the system undetected (False Negative). However, it is also an important issue when false flags are raised for non-fraud transactions (False Positive), which hurts the customer sentiment as well as an added cost of allocating investigative resources needlessly. Maximizing recall score, is thus significantly important as high recall scores indicate a higher ability for the classifier to detect True Positives (Frauds). Precision scores is also important as the FDS shall avoid or minimize misclassifying genuine transactions as frauds. Therefore, various sampling strategies were adopted, and four different classifiers implemented to conclusively deduce the best and most effective sampling strategies and classifiers best suited for the dataset.

1) Random Under Sampling (RUS): Random Under
Sampling is one of the most commonly used sampling techniques, where the majority class is down sampled or reduced to the same number of minority class by randomly removing instances of the majority class. The major problem with RUS is that it is randomly removing data which leads to potential loss of important information which may have been captured.
2) Synthetic Minority Oversampling Technique (SMOTE): SMOTE create synthetic instances of the minority class. These data points are created by assessing the nearest neighbours for each of the minority sample and creating new synthetic instances in the feature space until the minority class is balanced to the given ratio.
3) Density-Based Synthetic Minority Oversampling Technique (DBSMOTE): DBSMOTE algorithm relies on a clustering algorithm called Density-Based Spatial Clustering of Applications with Noise (DBSCAN), which is widely used clustering algorithm used for data mining and machine learning applications. DBSCAN works by grouping together a set of data points based on how close together the points are packed in terms of a distance measurement such as the 480 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020 Euclidean distance and a given number of minimum points to operate on the bi-dimensional space. DBSMOTE essentially implements the DBSCAN clustering algorithm to form a cluster of the minority class, which is then used to up-sample the minority class. The minimum samples specify the number of data points required to form the dense region.

4) Synthetic Minority Oversampling Technique with Edited Nearest Neighbor (SMOTEENN):
SMOTEEN is another variant of SMOTE which is basically a combination of SMOTE and Edited Nearest Neighbour (ENN). The ENN is an effective method which is used to remove noise from the dataset. For any given data point of either class, ENN removes the data point which differs by at least half of the given k-Nearest Neighbour.

B. Machine Learning Techniques
This section, details and justifies the different benchmark machine learning models which are proposed in FDS, highlights the strengths of the machine learning models used, and lists out the evaluation metrics to be used focusing on the class imbalance nature of the dataset.
The model shall primarily be evaluated with Recall score as it is more important to the FDS to accurately detecting the fraudulent transactions (increasing TPR). The precision although not as significant as the recall score, still has associated costs for an FDS and thus the second metric to consider shall be the precision score.

1) Stacked ensemble: Stacked Ensemble model have
shown promising improvements in terms of classification accuracy when combined with diverse set of classifiers. In a study by [31], Stacked Ensemble was used for an imbalanced dataset and proved to have gained maximum performance among the other models. Modern applications of machine learning quite often must deal with imbalanced classification as is the case with this study. The current ensemble techniques offer a modification to the traditional ensemble models to allow for maximum performance on imbalanced learning. The Stacked Ensemble model allows for customization of parameters that are designed specifically to handle class imbalance issues [33]. The SE is a combined model of chosen base models of and uses General Linear Model (GLM) as a default meta learner to enhance the model performance.
2) Gradient boosting machine: Gradient Boosting Machine can be used for either regression or classification models. It is an ensemble learning method which operates on the concept of Boosting where weak learners are built gradually to allow for maximum prediction accuracy with each iteration. Unlike Random Forests which use Bagging, and trees are built independent of one another, Boosting aims to build trees which are built based on the results of previously built trees. Boosting although improves accuracy it is slower and has reduced interpretability than other traditional models.
This study shall use gradient boosting model to allow for a diverse set of classifiers where four different categories of learning is considered, namely, Bagging, Boosting, Deep Learning and Super Learning and gradient Boosting Machine [33].
3) Random forest: Random Forest is essentially and ensemble model consisting of many decision trees all of which are made from the same input dataset. The high prediction accuracy of random forests is due to the fact that a combined output is obtained in random forest by comparing outputs from all decision trees. Essentially multiple training subsets are built from the dataset and a decision tree is constructed for each of these training subsets. With each tree contributing towards voting and eventually majority of the votes determine the final class. This technique is known as random split and the trees are known as random trees.
For the purpose of this study, Random Forest shall be chosen to build a predictive classifier model, as this model gives the best classification accuracy and also due to the high speed of classification, interpretation ability of the knowledge or classifications, and model parameter handlining as indicated by [34].

4) Artificial neural network: Multilayer Perceptron (MLP)
is a technique which is trained by the backpropagation algorithm. Essentially a MLP neural network composed of three layers namely; input layer, output layer and many hidden layers. The architecture is a densely connected network where every neuron in a layer is connected to neurons in prior and next layers. The other feature of this network is that there is no activation function in the input layer, but every neuron in the hidden and output layer has an activation function.
The initialization of weights is a random process in the MLP, however the network trains by working out the difference between the computed output and the actual output and adjusts the weights iteratively to this cause of minimizing the residual.

IV. EXPERIMENTAL RESULTS AND DISCUSSIONS
This section briefs the model development stage and details the techniques implemented for this study. The sampling methods including RUS, SMOTE, DBSMOTE and SMOTEENN are implemented, since the dataset is highly imbalanced with 0.17% of positive instances. These methods are based on previous research conducted on the domain and is selected to offer diversity in terms of adopted sampling algorithm and attempts to find out which sampling strategy works best for the given dataset. The aim is also to understand in terms of the strengths of various classifiers and their ability to tackle each sampling strategy.
The class distribution after the dataset was split in to 70% for training set and 30% for holdout set using stratified random sampling. The holdout set contains 148 samples of the fraudulent transactions and will be used to evaluate the performance of all models to maintain consistency in scoring and model benchmarking. A separate hold out set is also the best strategy to adopt to avoid data leaks, which can be a 481 | P a g e www.ijacsa.thesai.org problematic and frequently occurs, while using cross validation along with oversampling. The training set contains 199,020 non-fraud and 344 fraudulent transactions. This set will be used for both data over sampling using SMOTE based techniques, as well as under sampling using RUS.
The Table II shows a summary of class counts after implementing sampling techniques on the original training dataset. In all the cases the final class counts are equal other than the SMOTEEN technique with unequal class counts. This is due to removal of noise using ENN.

A. Comparision of Sampling Techniques over Different Classifers
The training dataset was used to produce four different sampled datasets which were used to train each classifier.

1) Artifical Neural Network (ANN):
The ANN architecture in this case was set at 30 in the input layer and 200 neurons in a single hidden layer followed by two layers in the output layer. The activation function used was Rectified Linear Unit (ReLU). The model reached optimal performance at 61 epochs with each epoch iterating over the training dataset. A drop out of 40% was used in the hidden layer so that the model automatically drops the neurons in the hidden layer. The learning rate used for the model was set at 0.005.
The highest F1 score is at 0.8116 on unsampled dataset, which is ideal in the case where precision and recall are of equal importance or significance as the F1 measure is a harmonic mean between the two metrics. However, in the case of FDS, recall is of much more importance then precision. The sampling method, RUS had the highest recall of 0.8311, and the lowest precision score among all the sampled datasets. ANN with unsampled data produced the highest f1 score of 0.8116, although its recall is lower than the second highest recall for ANN with SMOTE at 0.7635 and significantly better precision of 0.8370. Therefore, a better model for ANN is with SMOTE sampling.

2) Distributed Random Forest (DRF):
Number of trees was set to 43 with a maximum tree depth of 20. The low tree depth helps lower model complexity while avoiding overfitting. The min rows parameter set as 5 specifies that a minimum of 5 observations is used for each leaf. The sample rate specifies the rate for row sampling which was set at 63%. Column sample rate was set to 0.8, which takes in 80% of columns to construct an individual tree. Lowering the column sample rate will aid in producing diverse trees, which are able to regularize well.
The highest F1 score was produced by the unsampled dataset for the DRF, which was influenced by the highest precision provided at 0.9286. Recall as we consider as the more important and significant metric is at the highest for SMOTE sampling at 0.8176 with a reasonable precision of 0.86. SMOTE sampling is, therefore, the best model for DRF considering the high recall score. It can also be noted that SMOTEENN sampling produced the second-best recall score for DRF classifier. SMOTEENN technique performs data reduction or noise removal using Edited Nearest Neighbour technique which removes any sample which is misclassified by its three nearest neighbours. It is proven with this result, that the noise removal is not very effective as it produced a lower recall score than the original SMOTE sampling.

3) Gradient Boosting Machine (GBM):
The number of trees was set to 116 with a maximum tree depth of 15. This allows for reduced model complexity and prevents model from overfitting. Minimum rows to sample for the creation of each tree was set to 100 and column sampling rate set at 0.8, which means that 80% of the columns will be used for each tree.
It is observed that the highest recall score was produced by two sampling methods SMOTE, and SMOTEENN at 0.81. In this case, where two classifiers produce similar recall score, F1 score could be used as a deciding factor since it reflects the model with the best precision. Reducing false flags (False Positive) is an important aspect of an FDS, and thus the model with the highest recall and precision is preferred. Therefore, for the GBM classifier the best model is using SMOTE sampling which resulted in 0.81 recall score and 0.90 precision score. The model with the highest F1 score (DBSMOTE) at 0.86 cannot be considered the best model as it has lower recall score at 0.79 compared to the previously mentioned models, although precision is at the highest at 0.94.

4) Stacked Ensemble (SE):
The Stacked Ensemble has very little parameters to define. The SE is a combined model of all trained models (30 models), using a General Linear Model (GLM) as a meta learner to enhance the model performance. The meta learner folds was set to 5 to create a 5fold cross validated model training with stratified sampling.
Stacked Ensemble model is a Super Learner based on the combinations of ANN, GBM and DRF. The Random Under sampling (RUS) method scores the lowest for the key metric at 482 | P a g e www.ijacsa.thesai.org 0.68 as well as offered the lowest precision score. Highest observed recall score was for SMOTEENN with a combination of SMOTE oversampling and noise removal is using the Edited Nearest Neighbour (ENN) technique. Since this model also offers a reasonable precision score of 0.85 it can be considered as the best sampling strategy for the Stacked Ensemble. Unsampled dataset offered the highest precision score of 0.94 as a result of less noise since it is based on 100% of original data and no synthetic samples were introduced.

B. Summary
The results from all classifiers for each of the sampling methods employed were consolidated based on the performance metrics. The key metric for the domain of FDS are recall which is of the highest priority while also addressing minimal False Positives (FP); i.e.; higher precision. To this end, the primary metric which will be considered is the recall as it is the key metric which is indicative of the total True Positives (fraud cases) detected while minimizing False Negatives (fraudulent transactions classified as nonfraudulent).
The evaluation results were assessed from two perspectives; i) Optimal sampling strategy, ii) Optimal classifier for the domain. The Table III   The key metrics recall score is considered as a first step for identifying the best sampling strategy. RUS has the highest observed recall score of 0.83 with ANN classifier. However, this was not chosen to be the best model since it offered very little precision of 0.27. This means that while most of the fraudulent transactions are detected by the system it also falsely flagged several genuine transactions as fraudulent. Fraud Detection System is mostly concerned with increasing True Positives it must also consider to be precise in this detection by reducing the number of False Positive.
The second highest recall score was then considered with SMOTE sampling strategy by DRF classifier at 0.81. Precision score for this classifier is observed to be 0.86, which is significantly better than the RUS by ANN. Therefore, SMOTE method can be considered a better sampling strategy to adopt. It is also observed that SMOTE with GBM classifier also offers a high recall which was the third highest recorded at 0.81 while offering even higher precision then the SMOTE with DRF at 0.90. SMOTEENN sampling is another technique which offered promising results and performed consistently with all classifiers except for ANN. The recall scores for most of the models been at 0.81 while yielding a good precision score above 0.85 in all the cases.
Assessing the average performance of the sampling strategy across various classifiers gives an indication of the best overall sampling strategy to adopt. In a diverse classifier domain such as FDS the average performance of the sampling strategy is very much indicative of its generalizability in terms of adopting well for other datasets in the field. Adopting no sampling strategy resulted in the worst average recall scores while SMOTEENN sampling strategy offered the best average recall score at 0.79. SMOTE and DBSMOTE have the same average score of 0.78, although SMOTE produced the best classifier. The average score considerably dropped for SMOTE due to a very low recall of 0.76 with ANN.

V. CONCLUSION
Detection of credit card fraud is classified as a costsensitive problem, where there is an associated cost incurred for incorrectly classifying a genuine transaction as fraudulent and incorrectly classifying fraudulent transaction as genuine. In the absence or no occurrence of fraud, there is no associated administrative costs incurred by the financial institution. However, failure to detect the fraud is a loss of the particular transaction amount. It is thus, an important proposition to incorporate in to the FDS, particularly in the development of models on class imbalanced datasets. There is an associated cost with False Positives, where genuine transactions are flagged as fraud. However, the cost associated with the inability to identify a fraudulent transaction can be immense in contrast. Therefore, recall score was used as key metric as the target of the FDS is to maximize the True Positive Rate.
A base model was implemented using an unsampled dataset, followed by the implementation of four different sampling strategies. Four different classifiers including a Super learner (Stacked Ensemble) was used for each of the sampled datasets to train the models. Distributed Random Forest (DRF), Artificial Neural Network (ANN), Gradient Boosting Machine (GBM) and Stacked Ensemble (SE) are the four classifiers which have been trained on the four different sampling strategies (RUS, SMOTE, DBSMOTE, SMOTEENN). Each classifier is evaluated based on the overall summary of key evaluation metrics F1 score, Precision and Recall score.
The findings of this study indicate promising results with SMOTE based sampling techniques. The best recall score obtained was with SMOTE sampling strategy by DRF classifier at 0.81. Precision score for this classifier was observed to be 0.86. Therefore, SMOTE method can be considered a better sampling strategy to adopt.
Stacked Ensemble was trained for all the sampled datasets and found to have the best average performance at 0.78 with the second-best average for GBM classifier. ANN suffered with the worst recall score, which may be due to the high level of noise generated by the synthetic samples. The Stacked Ensemble model has shown promise in the detection of fraudulent transactions across majority of the sampling strategies.

A. Future Recommendation
Although the study was conducted to address the major problems in the domain of predicting fraudulent transactions, the limitations of the study with respect to time and resources contributed to selection of limited number of sampling strategies. Several other sampling strategies may be considered as an avenue for further research to improve the classifier performance.
Although un-supervised machine learning was not covered within the scope of this study it is still a promising area to be explored. This study may further be improved with the implementation of semi-supervised or un-supervised learning techniques such as one-SVM, k-means clustering and Isolation Forests.
Research can also be further expanded in identifying optimum thresholds for identifying the cut-off points to maximize the Recall score while finding the right balance between Precision and Recall could also yield potentially good results.