An Efficient Domain-Adaptation Method using GAN for Fraud Detection

In this paper, an efficient domain-adaptation method is proposed for fraud detection. The proposed method employs the discriminative characteristics used in feature maps and generative adversarial networks (GANs), to minimize the deviation that occurs when a common feature is shifted between two domains. To solve class imbalance problem and increase the model’s detection accuracy, new data samples are generated by applying a minority class data augmentation method, which uses a GAN. We evaluate the classification performance of the proposed domain-adaption model by comparing it against support vector machine (SVM) and convolutional neural network (CNN) models, using classification performance evaluation indicators. The experimental results indicated that the proposed model is applicable to both test datasets; furthermore, it requires less time for learning. Although the SVM offers a better detection performance than the CNN and proposed domain-adaptation model, its learning time exceeds those of the other two models when dataset increases. Also, although the detection performance of the CNN-based model is similar to that of the proposed domain-adaptation model, its learning process is longer. In addition, although the GAN used to solve the class imbalance problem of the two datasets requires slightly more time than SMOTE (synthetic minority oversampling technique), it shows a better classification performance and is effective for datasets featuring class imbalances. Keywords—Fraud detection; domain adaptation; data augmentation; deep learning; GAN


I. INTRODUCTION
With the rapid development of information technology, the existing financial industry paradigm is changing; the new paradigm, following the evolution of smartphones and mobile technologies, is creating new forms of electronic financial services, increasing the number of non-face-to-face transactions (through the use of various devices and communications technologies), and simplifying and diversifying payment methods. However, alongside these developments, concerns over security incidents (e.g., cyber threats involving the leakage and hacking of financial and personal information) are also increasing, owing to the new approaches facilitated by the Internet, device diversity, transaction simplicity, and ease of data flow. Therefore, the performances of fraud detection systems (FDS) must be improved, to actively respond to these diversified and intelligent cyber threats. Accordingly, machineand deep-learning based technologies, which learn large quantities of data to improve prediction and classification accuracies, have recently been developed; thus, research incorporating these technologies has increased accordingly, to improve the performances of FDSs. However, the existing FDS's abnormal-transaction-detection method which combines machine-and deep-learning techniques to identify abnormal transactions in large quantities of real-time data is timeconsuming and computationally expensive. Therefore, this study presents a faster-learning abnormal-transaction-detection model, by training a model suitable for data across different domains and utilizing the common features and information thereby found. The proposed model to detect anomalies between different domains is constructed using domain adaptation method [1] which is one of transfer learning [2], a machine-learning method that utilizes pre-learned domain information from similar domains when a specific task or domain is changed. The datasets employed in the proposed domain-adaptation method are generally used in research relating to abnormal-transaction detection; in particular, they are benchmark datasets for fraud detection in credit card [3] and financial [4] datasets. However, because both datasets feature an unbalanced ratio between the normal transactions and fraudulent or anomalous ones, the classes must be balanced to improve the machine learning performance and ensure smoothly learning. Then, a data augmentation method can be used to increase the total number of data when datasets are insufficient; this method is applied to the minority class using a generative adversarial networks (GANs) [5]; the augmented data are used for training/test data of the proposed domainadaptation model, and the results are compared with those of SMOTE (Synthetic Minority Oversampling Technique) [6] which is one of oversampling methods. Therefore, in this study, a GAN and SMOTE are used to solve the class imbalance problem for credit-card and financial-transaction fraud datasets; then, the domain-adaptation method is used to implement a model for detecting abnormal transactions in the two datasets; finally, the method's effectiveness is verified through a comparison of its classification performance against those of support vector machine (SVM) [7] and convolutional neural network (CNN) [8] based methods. The remainder of the paper is organized as follows: In Section II, the background and related research are described; in Section III, the model and datasets employed are described in detail; in Section IV, the experimental environment, learning method, and hyperparameters are described; in Section V, the classification performance of the model is compared and analyzed against those of the SVM-and CNN-based models; and in Section VI, the conclusions and limitations of the research are described, and future research directions are considered. *Corresponding Author. 94 | P a g e www.ijacsa.thesai.org II. RELATED WORKS This section describes existing fraud detection methods, data augmentation approaches, and domain adaptation methods.

A. Fraud Detection
Abnormal transaction detection is a data mining approach used to detect transactions that differ from normal transaction patterns. The detection results are divided into two transaction classes: normal and abnormal. A variety of detection technologies are constantly being studied to minimize the risks posed to users by fraudulent transactions. Studies for abnormal transaction detection include the development of procedures for classification (a field of supervised learning), clustering (a field of unsupervised learning), deep learning and so on. In the existing research on classification-model-based abnormal transaction detection approaches, [9] proposed Very Fast Decision Tree, which can manage unbalanced data using decision trees; [10] employed a hidden Markov model (HMM) to learn a normal credit card transaction, and they classified transactions that were not accepted by the HMM as abnormal; [11] detected abnormal transactions using k-Nearest Neighbors, which offers reduced memory consumption compared to other machine learning methods. Furthermore, [12] proposed a model to detect abnormal transactions and money laundering, by applying an SVM. In addition, deep learning models have been applied to abnormal transaction detection using auto-encoders or GANs as a solution for data unbalancing [13,14]. In addition, a significant number of abnormal detection models have been proposed to increase the accurate detection rate of FDS. In our study, for the classification performance of fraud detection, the proposed domain-adaptation model was evaluated by comparing it with the SVM and CNN models, which are supervised learning-based analytical models.

B. Oversampling
Approaches to solving the data imbalance problem can be divided into four categories: sampling-based, cost-based, kernel-based, and active-learning-based methods [15]. The approach of changing the distribution between the majority and minority classes in unbalanced datasets is a sampling-based method; the distribution balance can be adjusted to reduce the number of data samples in the majority class (undersampling) or to increase the number in the minority class (oversampling). SMOTE is an oversampling method: it generates data between the minority class' data samples by connecting a straight line between them. Majority Weighted Minority Oversampling Technique [16] identifies minority class data and assigns weights according to the Euclidean distance between them and the nearest data samples in the majority class; then, a clustering approach generates data between the weighted minority class data in the same way as SMOTE. Meanwhile, the Random Oversampling Examples (ROSE) [17] method generates new minority data based on the existing kernel-density estimate; robROSE [18] is an oversampling method that overcomes the shortcomings of ROSE (which can deviate under the influence of outliers). Of the above methods, we used SMOTE to solve the class imbalance problem, because it is easier to implement and understand than other methods and offers excellent performance characteristics.

C. Data Augmentation
Data augmentation, which was first introduced in [19], is a popular method for processing image data; it generates noise whilst preserving the amount of information in the data. GANs are suitable models for performing data augmentation; it consists of two artificial neural networks (ANNs) that learn by competing against each other: one is a generator that receives random noise as an input and processes it to resemble the distribution of the original data; the other is a discriminator that distinguishes the original data from those created by the generator. The generator seeks to make the data it produces indistinguishable from the original data as much as possible, and the discriminator tries to classify the two types of data with the highest possible probability, in opposition to the generator. As a result, data that pass through a network consisting of generators and discriminators are generated with a distribution similar to that of the original data. By varying the structures and purposes of GANs, researchers have successfully applied them to various fields; in particular, the field of image-data-related research [5,20] has found considerable use for them, and models for increasing their performance and generating new image data have been proposed. Among them, deep convolutional GANs [21] provided guidelines for stable learning, and the Wasserstein GAN (WGAN) [22] improved the stability by attributing unsuccessful learning to the limit of the Kullback-Leibler (KL) divergence and redefining the loss function. Most of studies (e.g., [23,24,25]) have aimed to improve the network performance for image data. However, some studies have attempted to solve the data imbalance problem using GAN. In particular, the study [25] applied numerical data, not image data, to GAN. However, since GANs learn via the gradient descent method, learning problems can occur due to the loss functions [22]. Therefore, in this study, data augmentation was performed for the minority class data samples of each dataset, by applying the loss function of WGAN to alleviate the GAN's limitations and generate datasets more closely resembling the original data. Because the GANbased minority class data-augmentation method is similar to the oversampling method, it is applied by integrating it with oversampling techniques rather than data augmentation. Therefore, in this study, we use the terms "data oversampling" and "data augmentation" interchangeably.

D. Domain Adaptation
A transfer learning is a machine-learning method that utilizes pre-learned domain information from similar domains when a specific task or domain is changed. The area in which the transfer learning model previously worked is referred to as the source domain, and the new one is referred to as the target domain; transfer learning, depending on the presence or absence of labels in the domain, is primarily divided into multi-task learning [26], in which the class exists only in the target domain; self-taught learning [27], in which the class exists in the source domain but no classes exist in the target domain; and domain adaptation [1], in which the class exists in both domains. In this study, we consider a domain adaptation model to detect anomalies between different domains. Regarding domain adaptation [28], several previous studies [29,30,31] have focused on minimizing the differences between the source and target domain feature-map distributions; most of these have used the maximum mean discrepancy [32] loss function. Deep Correlation Alignment [29] matches the mean and covariance of the two distributions. In [31], the addition of a fully connected layer to the domain adaptation model was proposed, and a method was derived to determine the resulting value of the binary label and approximate the uniform distribution via the domain confusion loss. ReverseGrad [30], a gradient-reversal algorithm, calculates the gradient in the reverse direction when deriving the loss function in the network; it has exhibited a faster learning performance than comparable methods. In addition to [30], a study investigating methods of reconstructing images in the target domain was also presented in [31]. In [33], probabilities were used to learn the distribution between the two domains, and the distance between data within the same class across the two domains was expressed as a probability; learning was conducted to maximize this probability. Adversarial Discriminative Domain Adaptation (ADDA) [34] applied the loss function used in discriminator of GAN to match the distributions between the two domains, thereby enabling more effective learning. This method has the advantage of being able to interact with other domain-adaptation models. In this study, to facilitate interactions between similar domains, considering the advantages of ADDA, it was applied to the abnormal transaction detection model.

III. METHODOLOGY
This section describes a set of approaches conducted for fraud detection in FDSs. Section A describes the experimental dataset used in this study. Section B and C describe data augmentation to solve class-imbalance problems in learning. GAN model was used for data augmentation of minority class through the creation of new samples. It was compared to SMOTE used for data oversampling as well. Section D presents the proposed domain adaptation method, which is capable of evaluating classification performances on two datasets of similar domains. Fig. 1 shows the simplified overall structure of the model proposed in this study, and Fig. 2 illustrates the flow of this structure.

A. Dataset
The credit card dataset here employed consists of data collected by the Machine Learning Group [3] and Worldline. The dataset contains a total of 284,315 normal and 492 abnormal transaction data samples. For the data, owing to security issues (e.g., financial and personal information leaks); the test was conducted using a total of 30 variables. Similarly, the financial transaction dataset is an artificial (owing to security issues) dataset based on actual data. This dataset contains simulation results obtained through PaySim [4], using real financial transaction samples taken over a period of one month; it consists of a total of 11 variables and includes 6,354,407 normal and 8,213 abnormal transaction data samples. Unlike the credit card fraud dataset, this dataset was processed via min-max normalization before being used as input data in this work.

B. Data Oversampling
SMOTE oversamples the minority class data when class imbalances occur; in this study, it was adopted as the oversampling method because it delivers a strong performance whilst also being theoretically simple and easy to implement. First, SMOTE takes the data of a minority class and then finds the k-nearest neighbors of these data. Next, the differences between the current sample and these k neighbors are obtained, multiplied by a random value (between 0 and 1) to generate data, and combined with the original sample. It also shifts the existing data slightly, to account for the neighbors it adds. In this study, SMOTE was implemented using the imbalancedlearn Python library [35]. The oversampled data were tested with ratios of 0.3:1, 0.5:1, 0.7:1 and 1:1 between the minority and majority classes, respectively.

C. Data Augmentation using GANs
In existing GANs, several problems can arise when training the GAN via the gradient descent method [22]. First, if the discriminator makes an incorrect judgment, the generator does not receive accurate feedback, and the loss function cannot learn properly. Second, if the discriminator makes a very accurate judgment, the gradient of the loss function quickly converges to 0, resulting in a significant delay or disturbance to the learning speed. Because of these two problems, existing GANs are limited. WGANs compensate for these GAN shortcomings; in them, the KL divergence, which is used to define the loss function in existing GANs, is redefined using the Wasserstein distance (also referred to as the Earth mover's distance); this is an index that measures the distance between the two probability distributions. Under KL divergence, the distance value is 0 when the two distributions overlap each other, and it is infinite or constant when they do not overlap, showing an extreme distance value. The Wasserstein distance can be readily applied in training because a constant value is maintained regardless of whether the distributions overlap. Therefore, WGANs redefine the loss function using this Wasserstein distance, to smoothly train and improve the data such that it resembles the existing data as much as possible. Therefore, in this study, oversampling was performed using the WGAN loss function within a general GAN model and inputting the minority class of the original data. The structure of the GAN-based data oversampling model is as shown in Fig. 3. Although it has an identical structure to the general GAN, the potential problems of the existing GAN have been resolved by applying the WGAN theory and loss function. For each epoch, a random noise z is fed into the generator to generate fake data, and the fake data are merged with the abnormal transaction data (the minority class) from the original dataset. The random noise is expressed as a vector of the size to be generated, and the combined data are input to the discriminator, which attempts to distinguish the original data from the fake data (generated by the generator) and classify them as either real (1) or fake (0). Using the discriminator's classification results, the generator applies loss function to minimize the classification probability and the discriminator seeks to maximize it. The loss function is expressed as.
and (2) are the loss functions applied to the discriminator and generator, respectively. Above, ω is the parameter of the discriminator, and ∇ω is the gradient descent for ω. Also, θ is the parameter of the generator and ∇ θ is the gradient descent for θ. x is the original data, z is the random noise and G is the generator. These loss functions differ from that of existing GANs, and the purpose of the discriminator also differs therefrom. Instead of using a direct criterion for identifying the fake data generated by the generator, the discriminator learns the K-Lipschitz continuous function, which is used to calculate the Wasserstein distance. In this process, as the loss function decreases, the Wasserstein distance becomes smaller and the fake data generated by the generator approach the actual data distribution [22].
For oversampling, the WGAN loss function was applied in a GAN. Only the data in the minority classes were selected and input to the model; the random noise followed the distribution of the input data through the interaction of the generator and discriminator. Finally, when the probability of distinguishing between the input and generated data converged to 0.5, the model was terminated, and the generated data combined with input data to resolve the original data imbalance. The proportions of generated data and random noise were determined by adjusting the ratio according to the quantity of original data. For the data oversampled through SMOTE, the amount of minority class data was determined according to the sampling strategy of the original data. If the sampling strategy was 1, the [minority class: majority class] ratio became [1:1]; if the sampling strategy was 0.5, it became [0.5:1]. Therefore, to generate GAN oversampling results similar to the data processed through SMOTE, the amount of random noise z was set to (0.3, 0.5, 0.7, 1) times the size of the majority class.

D. Domain Adaptation for Fraud Detection
Detecting abnormal transactions is a time-consuming and expensive process when using different models for two datasets of similar domains. Therefore, to develop a single model capable of detecting abnormal transactions from two datasets, we applied a domain-adaptation method which employs the discriminative characteristics of GANs, such as those used in ADDA [34]. While the ADDA was applied to image datasets, the proposed domain-adaptation method was applied to text datasets. Also in our study, the text datasets were augmented to avoid class imbalance problems. The domain-adaptation model used in this study was composed of source and target encoders that employed CNNs as shown in Fig. 4 and 5, respectively. Each encoder consisted of a 1D convolution layer (Conv1d), max pooling, and a fully connected layer. The convolution layer was used because it can readily extract feature maps and does not require any further layer (e.g., recurrent neural networks) for time-independent datasets. In addition, a CNN was used because these networks outperform ANNs in terms of time and performance efficiency. Two convolutional layers and two max pooling layers were used to prevent unsmooth learning or overfitting from occurring when adjusting the hyperparameters to match the feature maps. The model first learned a source encoder and classifier using the credit card fraud dataset (source domain). The loss function applied to the source encoder is expressed as follows: Here, C is the classifier, is the source encoder, is the credit card dataset, and is the credit card dataset class. Next, the financial transaction fraud dataset (target domain) was input to the CNN-based target encoder. The learning proceeded by labeling the output of the target encoder as 1 and inputting it to the discriminator. Expressed otherwise, when the discriminator receives the output of the target encoder, the learning proceeds in the direction in which the result value becomes 1. The target encoder's loss function is expressed as where D is the identifier, is the target encoder, and is the financial transaction dataset. The discriminator learns the distribution by labeling the output value of the source encoder as 1 (real) and the output value of the target encoder as 0 (fake), to properly distinguish between normal and fraudulent data; then, it applies a loss function. The loss function applied to the discriminator is expressed as follows:  98 | P a g e www.ijacsa.thesai.org The entire learning process optimizes the loss functions described above, operating in a stepwise fashion. Based on the credit card fraud dataset (including the class information), the source encoder and classifier learn first, followed by the target encoder and discriminator. The source encoder proceeds in a fixed state whilst the target encoder and discriminator are being trained; thus, the target encoder's and discriminator's learning can proceed smoothly, without checking the state of the source encoder and classifier. Fig. 6 illustrates the overall structure of the domain-adaptation model introduced in this study; the components denoted with solid lines indicate a state in which learning is completed, and components formed of dotted lines indicate that learning takes place. Thus, the entire test process is as follows. First, the source encoder and classifier are trained on the source domain, and the discriminator and target encoder are trained from the source encoder and target domain. Finally, the proposed domain-adaptation model terminates the process when the target and source encoder can completely derive the classification results of the target and source domains, respectively.

E. Evaluation
The test results were evaluated using the area-under-curve (AUC) score, which is a classification-model performance evaluation index. The receiver operating characteristic (ROC) curve is a performance measure commonly used in binary classification and medical applications. Table I shows the confusion matrix; here, True (T)/False (F) indicates that the predicted value is the same/differs from the actual value, and Positive (P)/Negative (N) indicates how the predicted value was obtained. The ratio between the true-positive rate (TPR) and false-positive rate (FPR) is expressed as a graph of the ROC curve, and the AUC score is the area underneath this curve. The AUC score of a model with 100% incorrect prediction is expressed as 0.0, and the AUC score of a model with 100% correct predictions is expressed as 1.0; the performances of the models used in this study were evaluated accordingly.

IV. EXPERIMENTS
To evaluate the classification performance of the proposed domain-adaptation model, an SVM and CNN were employed as comparison machine and deep learning methods, respectively. Among machine learning methods, SVM has received particular attention for their excellent performance. It is a supervised learning model mainly used for pattern recognition and data analysis (in particular, classification and regression). Here, because both credit card and financial transaction datasets have class labels, SVM was used to detect abnormal transactions. The kernel of SVM uses a radial basis function. After testing values from 1 to 10,000, the hyperparameter C was set as 1000, which was found to deliver the optimal time and accuracy performances. The compositions of the source and target encoders in the proposed domain-adaptation model are as shown in Figs. 4 and 5. The source encoder sets the filter, kernel size, strides, and activation function, as shown in Fig. 4; the feature map (which undergoes max pooling after the CNN layer) passes through the fully connected layer. The output of the fully connected layer is passed to the classifier, to derive the classification result. The target encoder sets the number of strides to 2, to derive an output value with the same shape as the output value of the source encoder; other parameters (i.e., filter, kernel size, and activation function) are set identically to those of the source encoder. In addition, to prevent overfitting, a dropout was applied to the fully connected layer, with a ratio of 0.5.
The loss function of the classifier was calculated from the softmax cross-entropy, and the loss function of the discriminator was calculated using the sigmoid binary crossentropy and optimized through the Adam optimizer (learning rate = 0.0001, beta 1 = 0.5, beta 2 = 0.99). The CNN model used the source encoder, target encoder, and classifier of the domain adaptation model. The credit card fraud data were used as the input data of the source encoder, and the financial transaction fraud dataset was used as the input data of the target encoder, to compare the classification results. The number of nodes of the hidden layer used in the GAN-based oversampling method was set to 128, the epoch was set to 20, and the Adam optimizer was set identically to the domain-adaptation model.  Table II compares the classification performance results of the SVM, CNN, and proposed domain-adaption models. The experiment was conducted, and the results of the classification performance were averaged by summing only values above 0.8; this expresses the ratio between the majority and minority class when augmenting or oversampling a dataset. In other words, if the majority class is 1, a quantity of data equal to the ratio is generated to oversample the minority class. Table III shows the time taken for each model to receive data, train it, and derive its classification results. Table IV shows the time taken to oversample each dataset with GAN and SMOTE, respectively. Fig. 7 compares the performances of the GAN-and SMOTE-based oversampling methods. The left-hand and right-hand graphs describe results for the credit card and financial transaction fraud datasets, respectively; the x-axis denotes the ratio mentioned in Table II. The AUC scores on the y-axis represent the averaged classification performances for all methods; the GAN-based oversampling method takes slightly longer than SMOTE to complete, but it exhibits a superior performance (as shown in Fig. 7). The left-hand graph in Fig. 8 shows the average classification performance for the dataset in which the GAN-based oversampling method was applied. The right-hand graph shows the time-averaged values of the GANbased oversampling method in Table III. In Fig. 8

VI. CONCLUSIONS
In this study, a domain-adaptation method, applicable to data in similar domains, was proposed. The model to which the proposed domain-adaptation method was applied has the advantage of minimizing domain shifts when the domains are similar, even if the dataset has changed. In the experiments, credit card and financial transaction fraud datasets were used to evaluate the model's performance. Both datasets had a class imbalance problem; thus, oversampling was conducted using GAN and SMOTE; then, these data were used as input data of the model. Moreover, a classification performance comparison was made against SVM and CNN, to evaluate the model's performance. As a result, though the proposed domain adaption model did not achieve a better classification performance than the SVM or CNN, its performance was comparable thereto, while requiring a shorter learning time. Moreover, the GANbased oversampling method, which was used to solve the class imbalance problem, outperformed SMOTE. Although the CNN showed a similar classification performance to the domainadaption model, it required a longer learning time. The SVM had a high classification performance; however, it required a comparatively longer learning time than the CNN when the dataset size was increased. As a result, the proposed domainadaptation model was shown to be capable of simultaneously classifying two datasets with similar domains and shortening the learning time compared to the SVM and CNN. However, there are several limitations to this study, which should be addressed in the future: both datasets were constructed using CNN models, to smoothly reuse the feature maps; the classification performance was insufficient compared to that of the SVM; and various domain data and results were absent. Therefore, in future research, structural changes will be made to the oversampling method proposed in this study, to make use of the various abnormal transaction data (including time-series data) and judge the performance of the model more objectively.