Adaptive Generation-based Approaches of Oversampling using Different Sets of Base and Nearest Neighbor’s Instances

—Standard classification algorithms often face a challenge of learning from imbalanced datasets. While several approaches have been employed in addressing this problem, methods that involve oversampling of minority samples remain more widely used in comparison to algorithmic modifications. Most variants of oversampling are derived from Synthetic Minority Oversampling Technique (SMOTE), which involves generation of synthetic minority samples along a point in the feature space between two minority class instances. The main reasons these variants produce different results lies in (1) the samples they use as initial selection / base samples and the nearest neighbors. (2) Variation in how they handle minority noises. Therefore, this paper presented different combinations of base and nearest neighbor’s samples which never used before to monitor their effect in comparison to the standard oversampling techniques. Six methods; three combinations of Only Danger Oversampling (ODO) techniques, and three combinations of Danger Noise Oversampling (DNO) techniques are proposed. The ODO’s and DNO’s methods use different groups of samples as base and nearest neighbors. While the three ODO’s methods do not consider the minority noises, the three DNO’s include the minority noises in both the base and neighbor samples. The performances of the proposed methods are compared to that of several standard oversampling algorithms. We present experimental results demonstrating a significant improvement in the recall metric.


I. INTRODUCTION
One of the most challenging machine learning problems to both the academia and industry in the last couple of decades is one associated with learning from data that is unbalanced [1]. This problem is known to arise in both binary and multiclass classification tasks when data instances from one class, known as the majority class occur more frequently than instances of other classes, known as the minority classes [2]. This obvious disproportion in the distribution of data instances across classes leans the classifier towards significant bias to the majority class which in turn results in the misclassification of instances of other classes [3]. What makes the class imbalance problem more interesting is the fact that the minority class is often the class of interest in most real-life application domain, thus, the cost of misclassifying the minority class is often higher than that of the majority class [4,5]. For instance, given a machine learning fraud detection system, legitimate transactions occur more often than fraudulent ones, but the cost of misclassifying a fraudulent transaction as legitimate is greater than the opposite. Therefore, approaches to addressing class imbalance problem are aimed at increasing the accuracy and sensitivity of the classifier to the minority class.
The approaches to dealing with class imbalance problem can broadly be grouped into two categories [6]. The first category entails algorithmic creation/modification to improve learning of the minority class samples. The second category of approaches is the most popularly used category, data level methods, which resamples the data distribution to ensure balanced data distribution across the respective classes via oversampling, under-sampling or their hybrid combination. This paper focuses on oversampling methods that involve the generation of synthetic data samples to augment the minority class. A leading oversampling method that serves as the basis for most of the recent oversampling methods is the Synthetic Minority Over-sampling Technique (SMOTE) algorithm [2]. SMOTE basically generates artificial samples along the length of the line joining neighboring minority class samples.
SMOTE has also inspired several approaches to counter the issue of class imbalance. It is standard benchmark for learning from imbalanced data [7]. Based on SMOTE, several techniques have been proposed in the literature, and these techniques have been categorized according to some properties include: (1) initial selection of instances to be oversampled (technically called base samples), (2) integration with Undersampling as step in the technique, (3) type of interpolation, (4) operation with dimensionality changes, (5) adaptive generation of synthetic examples, (6) possibility of relabeling and (7) filtering of noisy generated instances.
Each SMOTE-based extension might have different properties from the aforementioned aspects. However, a large number of them use the three common aspects include: initial selection, type of interpolation (the common type is 'range The most common standard technique that utilizes initial selection and the 'range restricted' interpolation aspects is SMOTE_BORDERLINE [8]. This research, thus, started with adopting the same initial selection of instances to be oversampled in SMOTE-BORDERLINE. The common standard technique that uses adaptive generation of synthetic examples is ADASYN [9], and this is also adopted in this study to be used in our proposed techniques. The minority classes have been classified into three different groups namely safe, danger, and noise; according to its level of difficulty [8,[10][11][12] Similarly, the main difference between the three DNO methods is the criteria for choosing the nearest neighbors. In DNO1, the NN is the minority class, while in DNO2, the NN are the same as the base examples which consist of the Danger and Noise examples. Lastly, in DNO3, the NN group consists of the whole classes (minority and majority). Table I shows how each of the proposed methods differs from the standard techniques (SMOTE, Borderline1, Borderline2, and ADASYN). Moreover, in this study, three aspects are added for more clarification about the methods and they are: (1) Nearest Neighbor group, (2) 'how to choose from NN group' and (3) 'noise considered?' Hence, the major contribution of this study includes the implementation of the proposed methods as well as a tabular overview showing the differences between the methods in details and more clarifications, and this includes the initial selection / base samples used, the NN groups, the method of NN selection, type of interpolation, adaptive generation, and the representation of the minority noises (noises considered?) as shown in Table I. The proposed oversampling techniques were experimentally analyzed using four classification algorithms and evaluation metrics across 15 publicly available datasets from Machine Learning Repositories. The performances of the proposed methods are compared to SMOTE, Borderline SMOTE and ADASYN oversampling methods. In addition, statistical analysis was also carried out using Friedman aligned and Holm's tests.
The organization of this article is as follows. An overview of pertinent studies and oversampling methods is provided in Section II while the procedure of the proposed methods is listed in Section III followed by the experimental design in Section IV. The experimental results and conclusion are respectively presented in Sections V and VI.

II. RELATED WORK
Given that this study focuses on oversampling through synthetic data generation which is a data level approach, a short review of related studies is presented here in this regard. References [7,13,14] are important articles for an in-depth review of imbalance resolution approaches. The most basic form of oversampling is known as Random Oversampling which involves random sampling of minority class samples with replacement till it matches the size of the majority class samples. A major drawback of this approach is high likelihood of overfitting that results from the exposure of the classifier to the same information.
An oversampling approach that sidesteps the challenges associated with basic random oversampling is SMOTE which involves synthetic data generation along the length of the line joining neighboring minority class samples. SMOTE generates synthetic samples for any minority class including minority noises which also participate as nearest neighbors. However, when the separation between majority and minority class clusters is not clear, noisy samples may be generated [2]. On the other hand, borderline-SMOTE methods [8] intend to prevent producing noisy samples by detecting the boundary instances between the majority and minority classes, which are then utilized to identify useful informative minority class samples. Although both SMOTE-Borderline1 and SMOTE-Borderline 2 do not generate any sample for minority noises, dealing with those noises as nearest neighbors may generate new samples located near the noises or overlap with them. The study in [9] aims to distribute the new synthetic samples according to the level of difficulties by making the most difficult samples have more new samples. However, this approach results in that minority noises will have the big portion of the new synthetic samples.
From the afore-highlighted, it is obvious that the methods vary in how they deal with the base and nearest neighbor's samples. Similarly, some of them give the minority noises the advantage of being more represented in the new samples while others ignore them completely. However, the use of other different groups is still lacking, therefore, using different sample groups of the base and nearest neighbors are needed.

III. PROPOSED METHODS' PROCEDURE
Suppose that the whole training set is X, the minority class is P and the majority class is N, and P={p 1 ,p 2 ,…,p num }, N = (n 1 ,n 2 ,…,n num ) Where p num and n num are the number of minority and majority examples. The detailed procedure of ODO1 explained in Fig. 1.
The difference between ODO1, ODO2, and ODO3 is the NN groups as we mentioned above. Additionally, the difference between ODO's techniques and DNO's techniques is that, in DNO's methods, minority noises are added to both base samples and NN samples as declared in Table I. Further, In situations where the NN is from the majority class, a random value between 0 and 0.5 will be multiplied by the difference between the base example and its nearest negative example as in SMOTE_Borderline2 [8]. www.ijacsa.thesai.org

IV. EXPERIMENTAL DESIGN
The performance of the proposed methods is evaluated using 15 benchmark imbalanced datasets of varying imbalance rations (IR) from the Machine Learning Repositories (UCI, Kaggle, Keel, Datahub) and this is a common practice in class imbalance learning. Table II shows a summary of the 15 datasets. The performances of the proposed oversampling techniques were evaluated and compared with SMOTE, SMOTE_Borderline1, SMOTE_Borderline2, and ADASYN. Since accuracy has been shown in representative works as an insufficient evaluation metric for imbalanced datasets, Recall, and F1-measure are employed in this study. Additionally, the four classifiers considered for evaluation in this study are Decision Trees (DT) [15], Logistic Regression (LR) [16], RandomForest (RF) [17] and Support Vector Machine (SVM) [18].
For each combination of dataset, classifier and evaluation metric, an aligned ranking score is used to rank each oversampling method including the baseline. In addition to the 10 oversampling algorithms considered in this study, the performance of the classifiers on the original dataset without oversampling is also used as the baseline.
Thus, the best performing method has the biggest ranking score while the smallest ranking score indicates the worst performing method. Additionally, two statistical tests, Friedman aligned ranks and Holm, were also used to further establish the significance of our findings. While the Friedman aligned rank's test recognizes the difference in outcomes obtained from many attempts when the normality assumption may not hold, the Holm's test is a nonparametric t-test used to establish whether a control method outperforms comparative methods. www.ijacsa.thesai.org Only Danger Oversampling (ODO1) algorithm: Step 1. Extract the X_min as the minority samples.
Step 2. Define m_min and m_maj as the number of minority class examples and the number of majority class examples, respectively. Therefore, m_min ≤ m_maj and m_min+m_maj = X.
Step 3. Calculate the degree of class imbalance: d = m_min/m_maj, where d ∈ (0, 1]. Step 4. Calculate the number of synthetic data examples that need to be generated for the minority class: G = (m_maj -m_min) × β Where β∈ [0, 1] is a parameter used to specify the desired balance level after generation of the synthetic data. β = 1 means a fully balanced data set is created after the generalization process.
Step 5. Determine the three Minority groups (Noise, Danger, Safe) Step 6. Now, we find the KNN (K=5) for each example xi in the danger group in the whole training dataset X.
where Δi is the number of examples in the K nearest neighbors of xi that belong to the majority class, therefore ri ∈ [0, 1].
Step 8. Normalize ri according to ri^ = ri / ∑ , so that ri^ is a density distribution (Σ ri^ =1) Step 9. Calculate the number of synthetic data examples that need to be generated for each minority_danger example xi: gi = ˆri × G where G is the total number of synthetic data examples that need to be generated for the minority_danger class.
Step 10. Determine the minority group without noises X_min_no_noise = (X_min)-(Noise) Step 11. find the KNN (K=5) for each example xi in the danger group in the X_min_no_noise. In this step, we guarantee that we don't use any minority noise as a NN.
Step 12. For each minority_danger class data example xi, generate gi synthetic data examples according to the following steps: Do the Loop from 1 to gi: (i) Choose a minority data example (xzi) randomly from the nearest neighbors for data xi. To evaluate the performance of the classifiers on each dataset and method, a stratified k-fold cross validation experimental setup was applied with k = 5. Each oversampling method is performed on only the training portion dataset during k-fold CV and tested on their respective test folds [19]. The presented results represent the means validation performance. When the data you are using to train a machine learning algorithm happens to have the information you are trying to predict that is called Data leakage [20]. Therefore, to prevent leaking the data, the data preparation was performed within cross validation folds.
The hyperparameter tuning of the classifiers was done on the original datasets with no oversampling (baseline) and then the obtained optimal parameters are used when applying the oversampling methods to have fairness with all techniques, while the various oversampling algorithms' hyperparameters were tuned using the default values, except an important parameter in this study that is k nearest neighbor which must be equal to 5 in all oversampling techniques since the proposed methods are built on this number of nearest neighbors. The classifiers and standard oversampling algorithms were implemented using Python modules Scikit-Learn [21] and Imbalanced-Learn [22].

V. EXPERIMENTAL RESULT AND DISCUSSION
At first, in the favor of explaining more about the nature of work of the oversampling standard techniques and the proposed methods, this research visualized their generating of the new samples using a synthetic dataset as you can see in Fig. 2, in addition to the detail description in Table I.
On your imbalanced classification problem, you can choose to use precision or recall. The number of false positive errors will be reduced if precision is maximized, while the number of false negative errors will be reduced if recall is maximized. As a result, precision may be a better fit for classification problems where false positives are a concern. Alternately, recall may be more appropriate on classification problems when false negatives are more important [23]. With dataset such as Breast Cancer, the concern is the recall, therefore, try to reduce the False Negative (FN) as possible as can, while with dataset such as Spam mails dataset, the task will be more focus on precision since it is needed to reduce the False Positive (FP) the most. This study tries to improve the recall without hurting the precision too much.
For each combination of classifier and evaluation metric, the mean rankings of the oversampling approaches over data sets are shown in Table III. The Friedman aligned test is used to statistically confirm the conclusion and the results are shown in Table IV. As a result, the null hypothesis is rejected at a significance level of 0.05., i.e., the oversampling methods do not perform equally in mean rankings for all evaluation metrics. Table V shows that our proposed method DNO3 is always the first or the second winner with all classifiers when the metric measure is the recall, therefore, DNO3 oversampler is used as a control method in the Holm's test to see if DNO3 result is a significant or not. The adjusted p-values are shown in Table VI. DNO3 ranked as the best method among all techniques regarding the recall results, and then DNO1 coming as the second. By looking at the differences between the DNO1 and DNO3, the only difference is the NN samples. DNO3 will deal with all classes in the NN whether they are minority or majority class, while DNO1 will only consider the minority class in the NN. This shows the importance of considering both minority and majority classes in the nearest neighbors.  Among the common standard techniques (SMOTE, SMOTE_BORDERLINE1 (BL1), SMOTE_BORDERLINE2 (BL2), and ADASYN), the BL2 is the best in Recall results. Comparing SMOTE_BORDERLINE2's structure with DNO3 shows the importance of considering the minority noise in the base samples since SMOTE_BORDERLINE2 is not considering that, as well as the weighted distribution of the new samples used by DNO3 that creates more new samples for the most difficult samples which is not the way used in SMOTE_BORDERLINE2.
From the above analysis this study depicts that there are three factors can affect the detection of the minority class; the first is that the minority's noises and danger samples which should be considered in the initial selection / base samples, and the second factor is that the minority noises, danger, and also the majority samples should be considered in the nearest neighbors samples, and last but not least is that the distribution of the new synthetic samples should be also weighted distributed so that the more difficult samples will be given more new synthetic samples. These factors can help reducing the false negative (FN) examples and this, in turn, increases the recall.

VI. CONCLUSION
DNO'S techniques performances were the best in Recall, and specifically DNO3 that outperformed all standard techniques in recall metric. This study shows the importance of considering minority noises and danger samples whether as base samples or nearest neighbors' group. Furthermore, the majority class samples should be under concern in the nearest neighbors' group. Finally, the weighted distribution (adaptive generation) of the new samples can help to get better Recall result. Taking everything into account, next work should consider not only the minority danger and minority noise groups, but also different groups of difficult minority samples including the minority safe samples.