Novel Oversampling Algorithm for Handling Imbalanced Data Classification Novel Oversampling Algorithm

—In the current age, the attention of researchers is immersed by numerous imbalanced data applications. These application areas are intrusion detection in security, fraud recognition in finance, medical applications dealing with disease diagnosis pilfering in electricity, and many more. Imbalanced data applications are categorized into two types: binary and multiclass data imbalance. Unequal data distribution among data diverts classification performance metrics towards the majority data instance class and ignores the minority data, instance class. Data imbalance leads to an increase in the classification error rate. Random Forest Classification (RFC) is best suitable technique to deal with imbalanced datasets. This paper proposes the novel oversampling rate calculation algorithm as Improvised Dynamic Binary-Multiclass Imbalanced Oversampling Rate (IDBMORate). Experimentation analysis of the proposed novel approach IDBMORate on Page-block (Binary) dataset shows that instances of positive class is increased from 559 to 1118 whereas negative instance class remains same as 4913. In case of referred multiclass dataset (Ecoli), IDBMORate produces the consistent result as minority classes (om, omL, imS, imL) instances are oversampled majority class instances remains unchanged. IDBMORate algorithm reduces the ignorance of minority class and oversamples its data without disturbing the size of the majority instance class. Thus, it reduces the overall computation cost and leads towards the improvisation of classification performance.


I. INTRODUCTION
Numerous ranges of applications in today's real-world deal with imbalanced data applications. Numerous domains specifically medical diagnosis text mining, tracking of financial transactions, telecommunication, and industrial and engineering applications [1,2,3]. Dealing with these applications attracts researchers to resolve the data imbalance challenge. For the rapid development of real-world applications, information management with imbalanced classification is a decisive task. The upcoming needs of this digitized world comprise the utilization of technologies that can handle complex unevenness within the data sample distribution within data. There are a variety of functional application areas which need to reshape unbalanced, complex, and huge volumes of data by incorporating sampling techniques [4,5,6].
Data sampling methods are trendy in addressing class inequality at the data level and generally show improvement in classification results. The existing sampling approaches show that there is performance inconsistency if it is applied on both binaries as well as multiclass imbalance data application. The existing imbalanced data applications and work depict that there is an excessive sample generation in the existing oversampling methods which diverse the classification accuracy towards the majority data sample class [7,8]. It also increases the computation cost due to excessive sample generation. Present scenarios also have a diversion in data size of majority data sample in oversampling process and ignorance of minority data sample class. Data sample ignorance in the minority class leads to missing important information and overfitting in the majority class due to excessive data generation in the oversampling process. These challenges motivated this research work to derive a novel oversampling algorithm.
Imbalanced data classification biases performance towards majority numbered class in case of a binary application or majority classes in case of multiclass applications [9]. Traditional approaches lean towards abridged accuracy due to the massive amount of biased data towards the majority [10]. The proposed research work deals with a novel oversampling rate algorithm. In the existing study, the sampling methods which are suitable for the binary imbalance category are not suitable for multiclass imbalance application domains. The proposed IDBMORate algorithm is targeted to calculate oversampling rate which is dynamically applicable to binary as well as multiclass data imbalance and get enhanced classification performance.
In the first attempt, the proposed novel oversampling algorithm deals with the dynamicity of data oversampling which applies to both categories. The second advantage of the proposed algorithm is it will not disturb the majority data instance class and only focus on oversampling the minority data sample class. These two advantages indicate the strengths of the proposed algorithm in terms of less computation time and enhanced classification performance. The main objective of the paper is to identify imbalanced application areas and study existing sampling techniques. The subsequent objective of this research study is to propose a novel oversampling algorithm that leads to performance improvement. Experimental analysis of proposed IDBMORate on selected www.ijacsa.thesai.org binary and multiclass datasets shows improved performance metrics.

A. Organization of the Paper
The research study in this paper is organized as follows. The next section deals with a brief review of the related literature study of binary and multiclass imbalanced application domains and suitability of classifier. The third section emphasis on existing sampling approaches. Subsequent fourth section deals with the study of proposed Improvised Dynamic Binary-Multiclass Imbalanced Oversampling Rate (IDBMORate) algorithm and experimentation. Experiment analysis is carried on both binary (Page-block Dataset) as well as multiclass imbalanced (Ecoli Dataset) for verifying the dynamicity of proposed algorithm. Subsequent section deals with computational results of proposed IDBMORate. The final section outlines the major advantages and dynamicity of the proposed research work.

B. Research Gap
Excess time and computation cost required for generating new data samples for balancing the data. Proposed IDBMORate overcomes this research gap by oversampling minority class without disturbing majority data class and improvises classification performance.

II. IMBALANCED APPLICATION DOMAINS
This section of the paper focuses on imbalanced application domains and the suitability of the classifier for binary and multiclass imbalanced application domains [11,12]. It also highlights the issues raised due to data imbalance [13,14].

A. Imbalance Application and Suitability of Classifier
Classification with Imbalanced Dataset (ID) deals with heterogeneous and other imbalances with a massive amount of data. Fig. 1 depicts the compatibility flow of classifiers depending upon the type of massive and streamed data. It shows that traditional classifiers are best suitable for balanced datasets [15,16] and Random Forest Classifier is best suitable for imbalanced data applications [24].

B. Binary and Multiclass Imbalanced Application Domains
There is a list of numerous numbers of imbalanced applications which belong to class types as either binary imbalance or multiclass imbalance [17]. Table I nominates a list of selected applications with data domain analysis and categorization as binary, multiclass, or of both binary and multiclass imbalance. Binary classification techniques are the most progressive technique to deal with several applications such as medical diagnosis, and fault-finding activities in various business domains which always put forth the statistical results either belonging to one category of data or belonging to a second category [18,19,20]. To deal with the classification analysis of these binary and multiclass imbalance data applications, numerous approaches are discussed in the upcoming sections. Data imbalance approaches works at different data level or algorithmic level. At the data level based on the nature of the data, the approaches are categorized [23]. Table I summarizes selected applications, related application domains, and class categories.

A. Existing Sampling Approaches
Sampling techniques are used to balance distorted data distribution Fig. 2 depicts categories of probability and nonprobability sampling techniques [24]. www.ijacsa.thesai.org Both strategies have different sampling approaches to balance the dataset. Table II indicates the simple random sampling techniques steps [22], [23].

Input: Imbalance data of sample size X provided
Step:1 Take input as an imbalanced data set.
Steps:2 Distribution of dataset into x number of subsets with equal selection probability. Table III indicates the stratified random sampling techniques steps [24]. TABLE III. STRATIFIED RANDOM SAMPLING ALGORITHMIC STEPS

Input: Imbalance data of sample size X provided
Step:1 Take input as an imbalanced Data Set.
Step:3 From each stratum select x as any random data sample.
Step:4 Merge the stratum x into the overall data sample.

Sample Case:
Game X has a team of 600 girl participants and 400 boy participant members. For applying a 30-number stratified random sample there is a need to select 12 boy participants from 400 and 18 girl participants from the overall count of 600 participants [25].

Input: Imbalance data of sample size X provided
Step:1 Take input as an imbalance data Set.
Step:2 Stage I sampling is based on one data attribute as selection criteria for all data samples provided in the data set.
Step:3 Stage II sampling is based on another data attribute as selection criteria for all data samples.
Sample case: Compilation of region-wise voters list based on numerous attributes like city, gender, etc. [26,27].

IV. PROPOSED ALGORITHM AND EXPERIMENTATION
This section of the paper deals with the evolution of the proposed algorithm Improvised Dynamic Binary-Multiclass Imbalanced Oversampling Rate (IDBMORate) to balance the imbalance ratio for both the category that is binary as well as multiple classes. The proposed algorithm targets the aim of oversampling minority data sample classes. IDBMORate is successfully targeting data rescaling, selection of data, the invention of extra data, and transformation of data. The proposed algorithm deals with the dynamic approach of oversampling rate calculation.

B. Experimental Analysis of IDBMORate for Multiclass Datasets
The proposed algorithm also outperforms in the case of multiclass dataset. For performance evaluation of the multiclass dataset, this research study has used Ecoli dataset which contains multiple classes. The total sample size of the Ecoli dataset is 336.        Table IX shows the RFC classification result with the proposed oversampling rate algorithm to compute the effectiveness of the proposed algorithm.

V. CONCLUSION AND FUTURE WORK
This research work addressed binary and multiclass imbalanced application domains, associated problems, and approaches to dissolve data imbalance dynamically. The proposed algorithm Improvised Dynamic Binary-Multiclass Imbalanced Oversampling Rate (IDBMORate) balances the minority classes without affecting the majority class which minimizes the cost of computation. Experimentation analysis on dataset page block and Ecoli has been carried out. IDBMORate algorithm overcomes the problem of the generation of extreme synthetic data samples for the minority classes, which leads to improved classification accuracy with the Random Forest Classification Model. Experimental analysis shows that IDBMORate efficiently outperforms the existing oversampling techniques for both binary as well as Multiclass imbalanced real-life scenarios. (IDBMORate) balances the minority classes without affecting the majority class which minimizes the cost of computation. The Proposed algorithms Improvised Dynamic Binary-Multiclass Imbalanced Oversampling Rate proposed algorithm which shows improvised results for both binary as well as multiclass DATA. THE hybrid sampling method will be focused in the future to upgrade the performance. The more dynamic method can be focused to work in a distributed environment.