HHO-SMOTe: Efficient Sampling Rate for Synthetic Minority Oversampling Technique Based on Harris Hawk Optimization

—Classifying imbalanced datasets presents a significant challenge in the field of machine learning, especially with big data, where instances are unevenly distributed among classes, leading to class imbalance issues that affect classifier performance. Synthetic Minority Over-sampling Technique (SMOTE) is an effective oversampling method that addresses this by generating new instances for the under-represented minority class. However, SMOTE's efficiency relies on the sampling rate for minority class instances, making optimal sampling rates crucial for solving class imbalance. In this paper, we introduce HHO-SMOTe, a novel hybrid approach that combines the Harris Hawk optimization (HHO) search algorithm with SMOTE to enhance classification accuracy by determining optimal sample rates for each dataset. We conducted extensive experiments across diverse datasets to comprehensively evaluate our binary classification model. The results demonstrated our model's exceptional performance, with an AUC score exceeding 0.96, a high G-means score of 0.95 highlighting its robustness, and an outstanding F1-score consistently exceeding 0.99. These findings collectively establish our proposed approach as a formidable contender in the domain of binary classification models.


I. INTRODUCTION
The applications of Machine Learning (ML) have seen a growing trend in classification domains involving data for automating processes.However, the process of training presents difficulties due to inherent nature of algorithms, which typically learn from datasets with balanced distributions [1].As a result, acquiring knowledge from datasets with uneven distributions can lead to reduced accuracy and dependability in the resulting model.This phenomenon is termed "imbalance" or "unbalance" [2].
In contemporary applications, addressing challenges posed by imbalanced data has emerged as a notable issue.This issue is particularly evident in various domains such as the detection of fraud telephone calls [3], text classification [4], and biomedical data analysis [5,6].The classification of imbalanced data stands as a significant concern within the realms of machine learning and data mining [7].In the context of imbalanced datasets, a notable discrepancy exists, with one class containing notably fewer training instances (Minority class) than the other (Majority class).In dealing with imbalanced datasets, conventional machine learning and classification algorithms frequently exhibit a tendency to achieve very high accuracy rates in classifying the majority class, while attaining notably lower accuracy rates when classifying the minority class [8].Therefore, the classifier's effectiveness suffers when it comes to diagnosing samples from the minority class.Consequently, the classification of imbalanced datasets presents a substantial hurdle in the realm of classification research.Conversely, in numerous practical scenarios, the emphasis is placed on recognizing minority class samples rather than their majority counterparts [9].
In this paper, we emphasize the critical nature of class imbalance and its adverse consequences on the performance of traditional classifiers in real-world applications, such as medical diagnosis, fraud detection, and anomaly detection.To overcome these problem and shortage, we present a unique hybrid binary classification method that integrates multiple algorithms, enhancing the overall robustness of the approach.The core of our methodology lies in the utilization of the Harris Hawk optimization search algorithm, which facilitates the calculation of optimal sample rates for each minority class, resulting in improved representation within the data set.By strategically adapting the SMOTE technique with Harris Hawk Search, we ensure more effective synthetic data generation, tailored to capture the specific characteristics of the imbalance data.
The SMOTE has emerged as a contender for effectively addressing the classification of imbalanced datasets [10].This technique operates by generating new instances for the underrepresented minority class, effectively re-balancing the dataset by augmenting the presence of minority class data points using SMOTE framework.These algorithms adopt a uniform sampling rate for all instances.Unfortunately, this uniform approach leads to suboptimal performance outcomes.This limitation becomes particularly pronounced when the dataset presents varying degrees of difficulty across different instances of the minority class.Instances that are inherently harder to classify may benefit from a different sampling strategy compared to instances that are relatively easier to classify.This nuanced variation is often not accounted for by the uniform sampling rate strategy, resulting in missed opportunities to improve the overall performance of the classification model.As a result, there exists a need for more sophisticated techniques that can deceptively adjust the sampling rates based on the inherent complexities within the minority class instances.By doing so, the resulting classification model could achieve more accurate and refined www.ijacsa.thesai.orgoutcomes, effectively mitigating the limitations imposed by the current SMOTE-based methodologies.
Within our paper, we propose an innovative algorithm that builds upon the foundation of the SMOTE technique while incorporating the HHO [11] to enhance the efficacy of imbalanced data classification.The integration of the HHO algorithm introduces a dynamic approach wherein diverse sampling rates are generated for individual instances of the minority class.This process culminates in the identification of an optimal combination of these sampling rates.Subsequently, this amalgamation of optimal sampling rates is formulated and seamlessly integrated into the SMOTE Algorithm.The quest for these optimal sampling rates is executed with a high degree of intelligence, ensuring an insightful search process.Once these optimal rates are successfully pinpointed, oversampling is carried out exclusively on the instances belonging to the minority class, with each instance benefiting from its corresponding optimal sampling rate.
The subsequent sections of this paper are structured as follows: In Section II introducing an overview of current methodologies utilized for handling imbalanced datasets.Section III describes the SMOTE technique and the HHO algorithm in some detail.Section IV delves into the intricacies of our novel HHO-SMOTe algorithm, presenting a detailed account of its design and functionality, Section V guides you through a comprehensive examination of outcomes, encompassing diverse datasets and a variety of algorithms.Section VI concludes this paper.

II. RELATED WORK
A lot of research papers [2,12,13] have create a comprehensive examination of imbalanced datasets.These studies have not only conducted reviews but have also put forth various solutions aimed at effectively addressing the challenge of imbalanced data.Their objective is to determine the most optimal approach that exhibits superior performance in handling this issue.Ebenuwa et al. [12] introduced a feature selection approach for handling imbalanced datasets.They outlined the methodology and implementation steps, evaluating its effectiveness using machine learning algorithms like decision trees, logistic regression, and support vector machines.Their study aimed to identify the algorithm most suitable for addressing imbalanced data challenges through this ensemble of classifiers.The approach proposed in [13] involves the incorporation of an oversampling technique that meticulously incorporates all minority samples during the classification process within the training data.The Study conducted a comprehensive evaluation of this technique by comparing its performance against state-of-the-art ensemble learning methods.The objective behind this assessment was to ascertain the prowess of the oversampling technique in addressing imbalanced data scenarios.
Liu et al. [14] proposed advanced EasyEnsemble and BalanceCascade algorithms to address class imbalance issues more effectively than existing methods.Their research revealed that both algorithms outperformed established techniques, demonstrating their efficiency in tackling class imbalance challenges.Additionally, the authors in [15] devised the GASMOTE algorithm, which introduces a novel approach of employing distinct sampling rates tailored to individual instances within minority classes.This algorithm intelligently identifies the optimal combination of these sampling rates.Empirical evaluations performed on ten prototypical imbalanced datasets unveiled compelling outcomes.When juxtaposed against the SMOTE algorithm, GASMOTE exhibited an impressive enhancement.The empirical results derived from this application validate the GASMOTE algorithm's precision.
Nnamoko and Korkontzelos in [16] have taken strides in the realm of diabetes prediction by devising an optimized iteration of the SMOTE technique.This advanced algorithm integrates the InterQuartile Range technique to strategically oversample dispersed or extreme data prior to the application of SMOTE.This pre-processing step contributes significantly to enhancing the distribution of training samples, ultimately bolstering the efficacy of the diabetes prediction model.Liu st al. [17] brought forth a pioneering contribution in the arena of data balance within the context of spam detection.They proposed a sophisticated algorithm termed Fuzzy-based OverSampling, which revolves around the utilization of fuzzy logic principles to carefully harmonize the data distribution in synthetic sampling endeavors.This innovative methodology exhibited its prowess in not only rectifying the class imbalance but also in fine-tuning the distribution to be more representative of the real-world scenario.Notably, this enhancement manifested in elevated precision levels across a diverse array of ensemble learning models employed for the spam detection task.
The authors in [18] undertook a significant enhancement of the SL-SMOTE technique by incorporating an evolutionary optimization procedure to fine-tune its algorithmic parameters.This evolved rendition, aptly labeled Evolutionary SL-SMOTE, attained exemplary performance metrics when evaluated in the context of seminal quality prediction using AdaBoost.In the research conducted by Susan and Kumar [19], a comprehensive survey was undertaken to delve into the realm of preprocessing techniques within the domain of machine learning applications.The scholarly paper in question provides an in-depth exploration of various sampling methodologies, delving into the intricacies of how each of the scrutinized works tactically incorporated the suggested remedies.The culmination of this survey encompasses a thorough summary of the experimental protocols employed, encompassing intricate procedural insights as well as the comprehensive compilation of the outcomes that were documented.
To address more effectively the issue of how to determine the proper sample rate of the minority instances involved in the synthesis to avoid the generated minority instances decreasing the learning efficiency of the classification process, in this paper, we propose HHO-SMOTe which is also an improved variant of SMOTE based on a novel nature inspired algorithm call HHO.Nevertheless, HHO-SMOTe emphasis on determine the appropriate minority instances which increase the accuracy of the classification algorithmic.www.ijacsa.thesai.org

III. PRELIMINARIES
The SMOTE and exploratory and exploitative stages of the Harris Hawk Optimization algorithm are covered in this section.We explained the different procedures and steps used by each algorithm.In addition, we demonstrate how these various phases have been used to develop a novel algorithm.Due to the integration of the two algorithms, our method can dynamically adapt to a variety of datasets and consistently produce the best results with a high degree of efficiency.Impact of sample rate to balance Dataset: The choice of s influences how many synthetic instances are generated and how they are distributed between A and B. By adjusting s, you can fine-tune the balance of your dataset.A smaller s may be suitable if you want a moderate increase in the minority class, while a larger s will result in a more substantial over-sampling.As Addressed class imbalance in datasets using the SMOTE algorithm is a common strategy in machine learning, but selecting the appropriate sample rate presents a challenging task.There are no universal guidelines for determining the ideal sample rate, as it hinges on various factors like dataset characteristics, machine learning algorithms, and problem-specific nuances.The primary goal of SMOTE is to balance class distribution, vital for training fair and effective models.However, selecting the wrong sample rate can lead to overfitting, underfitting, or suboptimal model performance.
Researchers in [20][21][22][23] often use SMOTE approaches to balance their datasets before staring work on the classification or feature selection, or cluster problems without working with the sample rate selection for the minority classes.Grid search involves trying out a range of predefined sample rates and selecting the one that optimizes evaluation metrics such as precision, recall, F1-score, or AUC.Cross-validation enhances this process by providing a more robust assessment across multiple data subsets.An iterative refinement process, where researchers gradually narrow down the optimal sample rate through experimentation and analysis, is common practice.Additionally, understanding the sensitivity of machine learning algorithms to different sample rates is crucial.
In summary, choosing the right sample rate in SMOTE is a nuanced decision that relies on empirical methods, domain expertise, and iterative exploration to strike the balance that suits the dataset and problem domain.We have put forth our solution for determining the most accurate sample rate, which will be applied when generating samples from the minority classes to achieve data set balance.This solution leverages the intelligence of the HHO algorithm, a sophisticated optimization technique.www.ijacsa.thesai.org

B. Harris Hawks Optimizer (HHO)
The HHO has introduced by Ali Asghar Heidari in 2019, the HHO algorithm has garnered significant attention from the research community [11,24].HHO draws inspiration from the hunting behavior of Harris Hawks in nature, particularly their agile surprise pounce technique.Harris Hawks, known for their remarkable intelligence, exhibit various chasing styles based on different scenarios and the behavior of their prey.HHO is widely recognized as one of the most effective optimization algorithms, and it has been successfully applied to a variety of problems across different domains encompass energy and power flow analysis, engineering, medical applications, network optimization, and image processing.The comprehensive review [25][26][27][28] presents a survey of the existing body of work related to HHO.
Within this section, shows the modeling of both the exploratory and exploitative phases inherent in HHO methodology.The phases are done by three steps draw inspiration from the natural behaviors of Harris hawks, including their approaches to prey exploration, surprise pouncing, and the diverse attack strategies employed.HHO represents a population-based optimization approach devoid of gradients, rendering it adaptable to a wide array of optimization challenges, provided that they are appropriately formulated.The detailed explanations provided in the subsequent subsections.

1) Exploration phase:
Hawks perch in specific locations and constantly monitor the surrounding environment to identify prey using two strategies, which are represented in Eq. ( 2).If p < 0.5, the hawks perch based on the position of the family members.If p ≥ 0.5, the hawks perch in a random space within the population area.

{
(2) where X(t + 1) is the position vector of hawks in the next iteration t, Xrabbit(t) is the position of rabbit, X(t) is the current position vector of hawks, r 1 , r 2 , r 3 , r 4 , and q are random numbers inside (0, 1), which are updated in each iteration, LB and UB show the upper and lower bounds of variables, Xrand(t) is a randomly selected hawk from the current population, and X m is the average position of the current population of hawks.
The HHO utilized a simple model to generate random locations inside the group's home range (LB, UB).The first rule generates solutions based on a random location and other hawks.In second rule of Eq. ( 2), we have the difference of the location of best so far and the average position of the group plus a randomly-scaled component based on range of variables, while r 3 is a scaling coefficient to further increase the random nature of rule once r 4 takes close values to 1 and similar distribution patterns may occur.Utilizing the simplest rule, which can mimic the behaviors of hawks.The average position of hawks is attained using Eq.(3): where, X i (t) indicates the location of each hawk in iteration t and N denotes the total number of hawks.

2) Transition from exploration to exploitation:
The HHO can transfer from exploration to exploitation and then, change between different exploitative behaviors based on the escaping energy of the prey.The energy of a prey decreases considerably during the escaping behavior.To model this fact, the energy of a prey is modeled as: (4) Where E indicates the escaping energy of the prey, T is the maximum number of iterations, and E 0 is the initial state of its energy.In HHO, E 0 randomly changes inside the interval (−1, 1) at each iteration.When the value of E 0 decreases from 0 to −1, the rabbit is physically flagging, whilst when the value of E 0 increases from 0 to 1, it means that the rabbit is strengthening.
3) Exploitation phase: Which the hawks attack the targeted prey.Then, however, the prey tries to escape the attack.Based on hawk attacking behavior and escaping prey behavior, four scenarios will be described as below: a) Soft Besiege: When r ≥ 0.5 and |E| ≥ 0.5, the rabbit still has enough energy and try to escape by some random misleading jumps but finally it cannot.During these attempts, the Harris' hawks encircle it softly to make the rabbit more exhausted and then perform the surprise pounce.This behavior is modeled by the following rules: (5) (6) Where ∆X(t) is the difference between the position vector of the rabbit and the current location in iteration t, r 5 is a random number inside (0, 1), and J = 2(1 − r 5 ) represents the random jump strength of the rabbit throughout the escaping procedure.The J value changes randomly in each iteration to simulate the nature of rabbit motions.b) Hard Besiege: When r ≥ 0.5 and |E| <0.5, the prey is so exhausted, and it has a low escaping energy.In addition, the Harris' hawks hardly encircle the intended prey to finally perform the surprise pounce.In this situation, the current positions are updated using: (7) c) Soft Besiege with Progressive Rapid Dives: When still |E| ≥ 0.5 but r < 0.5, the rabbit has enough energy to successfully escape and still a soft besiege is constructed before the surprise pounce.This procedure is more intelligent than the previous case, the final strategy for updating the positions of hawks in the soft besiege phase can be performed by: where, Y and Z are obtained using Eq.9 and Eq.10.A simple illustration of this step for one hawk.Y is the hawks next move based on the following rule.www.ijacsa.thesai.org(9) To mathematically model the escaping patterns of the prey and leapfrog movements (as called in [22]), the levy flight (LF) concept is utilized in the HHO algorithm.In HHO the hawks dive based on the LF-based patterns using the following rule: (10) Where D is the dimension of problem and S is a random vector by size 1 × D and LF is the levy flight function, which is calculated as follows.(11) Where u, v are random values inside (0, 1), β is a default constant set to 1.5.d) Hard Besiege with Progressive Rapid Dives: When |E| < 0.5 and r < 0.5, the rabbit has not enough energy to escape and a hard besiege is constructed before the surprise pounce to catch and kill the prey.The situation of this step in the prey side is similar to that in the soft besiege, but this time, the hawks try to decrease the distance of their average location with the escaping prey.Therefore, the following rule is performed in hard besiege condition: where Y and Z are obtained using rules in Eq. ( 13) and Eq. ( 14).
(13) (14) IV.THE PROPOSED HHO-SMOTE ALGORITHM In this section, the proposed HHO-SMOTe approach is proposed for determining the efficient sample rate to be used in the SMOTE technique.The proposed HHO-SMOTe primary goal is to increase the accuracy of classification of the imbalanced datasets.We employed the HHO algorithm to find the optimum solution based on the KNN classification accuracy in order to get the best sampling rate of the synthetic minority class instances.
The proposed HHO-SMOTe initialized by determining its control parameters such as the population size N, the number of minority class instances n, and the maximum number of iterations.Then, the algorithm starts by generating a population X with the dimension N × n from the initial solution as an initial phase for the HHO-SMOTe approach.
Each solution x i ∈ X represents a candidate sample rate for SMOTe and it is assessed by the value of dataset classification accuracy where the best sample rate (solution) has the highest classification accuracy based on KNN algorithm.The solution can be represented with a raw of n values, these values are 0 and the maximum number of samples for each minority class instance.The 0 value in the first position of x i indicates that the current instance in the minority class have a sample rate 0 and will not be used in the generation of the synthetic data.
Since, if the value is greater than 0, then the current minority class instance will be utilized in the generation of the synthetic data.For example, a solution x i for generating a synthetic data which have 6 minority class instances can be represented as x i = [1, 0, 2, 0, 3, 1].This means that the sample rate to generate the synthetic date is 1 sample of the first minority class instance, 0 sample of the second minority class instance, two samples of the third minority class instance, and so on.The pseudocode of the HHO-SMOTe is showed in Algorithm 1.

Inputs:
The population size N and maximum number of iterations T Outputs: The location of rabbit and its fitness value Initialize the random population X i , i = 1, 2, . .., N while (stopping condition is not met) do

Generate a synthetic data based on current sample rate (solution) using SMOTE alg., then calculate the fitness values of hawks using on KNN alg.
Set X rabbit as the location of rabbit (highest accuracy) for (each hawk (X i )) do Update the initial energy E 0 and jump strength J E 0 = 2rand() -1, J = 2(1-rand()) Update the E using Eq. ( 4) if (|E| ≥ 1) then (Exploration phase) Update the location vector using Eq. ( 2) if (|E| < 1) then (Exploitation phase) if (r ≥ 0.5 and |E| ≥ 0.5) then (Soft besiege) Update the location vector using Eq. ( 5) else if (r ≥ 0.5 and |E| < 0.5) then (Hard besiege) Update the location vector using Eq. ( 7) else if (r < 0.5 and |E| ≥ 0.5) then (Soft besiege with progressive rapid dives) Update the location vector using Eq. ( 8) else if (r < 0.5 and |E| < 0.5) then (Hard besiege with progressive rapid dives) Update the location vector using Eq. ( 12) Return Xrabbit

A. Performance Evaluation Measures
Performance evaluation metrics are critical for evaluating classification performance and guiding classifier design.In this step, the confusion matrix was used to get the results of the proposed HHO-SMOTe approach and to make the comparison between all the used SMOTE approaches.The confusion matrix Fig. 2 describes the performance of the classification models.True positive (TP): Observation is predicted positive and is actually positive.False positive (FP): Observation is predicted positive and is actually negative.True negative (TN): Observation is predicted negative and is actually negative.False negative (FN): Observation is predicted negative and is actually positive.From the confusion matrix, we can conclude the following measures: www.ijacsa.thesai.org

1) G-mean:
The geometric mean is the root of the product of class-wise sensitivity.This measure tries to maximize the accuracy on each of the classes while keeping these accuracies balanced.For binary classification G-mean is the squared root of the product of the sensitivity and specificity.For multi-class problems it is a higher root of the product of sensitivity for each class.

√ (15)
2) F1 score: The F1 score, F score, or F measure is the harmonic mean of precision and sensitivity it gives importance to both factors: (16) 3) AUC: The receiver operating characteristics (ROC) curve is the plot between sensitivity and the FP rate for various threshold values.The area under curve (AUC) is the area under this ROC curve; it is used to measure the quality of a classification model.The larger the area, the better the performance.The ROC curve is a two-dimensional coordinate graph in which the X-axis represents the false positive rate (FPR) and Y-axis represent the true positive rate (TPR).The AUC can be calculated as: (17) V. EXPERIMENTS AND EVALUATION In this section, the experiments were done on different datasets.The following subsections will demonstrate the results and analyze these results.The experiments were conducted on Google Colaboratory, which provides a free Jupyter notebook environment with GPU support for running machine learning experiments [29].In our research, we utilized over 25 diverse datasets in different industries and attributes to evaluate the proposed technique.We maintained the original class distribution with five-fold cross-validation and conducted each experiment five times to obtain average metrics.Table I summarizes dataset details, including the dataset name, the number of attributes, the number of samples for the minority class, the original dataset record numbers, the number of samples in the majority class, and the corresponding imbalance ratio.
Table II presents the outcomes of our experimentation of 19 SMOTe variants approach and the proposed HHO-SMOTe approach with KNN algorithm as the application of SMOTE techniques for oversampling the dataset.The 19 methods are ADASYN [30], AND-SMOTE [31], ANS [32], Borderline-SMOTE1 [33], Borderline-SMOTE2 [33], distance-SMOTE [34], G-SMOTE [35], GASMOTE [15] , Gaussian-SMOTE [36], KernelADASYN [37], kmeans-SMOTE [38], Random-SMOTE [39], Safe-Level-SMOTE [40], SDSMOTE [41], SMOTE [10], SOMO [42], SVM-balance [43], SYMPROD [44], ASN-SMOTE [45].Notably, we have highlighted in bold the distinctive optimal values achieved for the average Gmean, F1-score, and AUC within the KNN results.This highlighting underscores the noteworthy observation that the combination of HHO-SMOTe consistently yields optimal results across a diverse array of datasets.The classification performance comparison results for the selected seven approaches applied on twelve datasets presented in Fig. 3, 4, and 5 are obtained using data from Table II.www.ijacsa.thesai.orgIn Fig. 3, we assessed G-mean values across 12 data sources using seven SMOTE techniques.A higher G-mean indicates a model's proficiency in both positive and negative class identification, a valuable metric for imbalanced classification.ANS-SMOTE and GASMOTE ranked lower, while ADASYN, SMOTE, RANDOM-SMOTE, and Borderline-SMOTE performed similarly.ADASYN had slightly lower G-mean for "cleveland-0."HHO-SMOTe consistently excelled across various datasets, demonstrating its robustness in imbalanced classification tasks.In Fig. 4, we compare classification results using F1-score values for various SMOTE algorithms.The F1-score combines precision and recall, indicating a model's ability to balance false positives and false negatives.ANS-SMOTE and GASMOTE performed poorly compared to ADASYN, SMOTE, RANDOM-SMOTE, and Borderline-SMOTE.Conversely, HHO-SMOTe consistently achieved near-perfect F1-Scores (0.9 to 1) across datasets, showing its stability and reliability in diverse classification tasks.In Fig. 5, we conducted a fresh evaluation of our classification studies, focusing on AUC (Area Under the Receiver Operating Characteristic Curve).AUC gauges a binary classification model's overall discrimination ability, considering true positive and false positive rates across different thresholds.The results show ANS-SMOTE and GASMOTE underperformed compared to ADASYN, SMOTE, RANDOM-SMOTe, and Borderline-SMOTE in AUC.In contrast, HHO-SMOTe consistently achieved high AUC values (typically 0.9 to 1), showcasing its adaptability across diverse datasets and confirming its effectiveness in classification tasks, especially when class separation is crucial.This research employs the of the well-known credit card fraud detection dataset [46].The dataset was prepared by the ULB Machine Learning Group, which specializes in big data mining and fraud detection [47].The dataset covers credit card transactions made by European credit card clients within two days in September 2013.Dataset have 492 fraudulent transactions out of 284807 total.Meanwhile, all attributes except ‗‗Time'' and ‗‗Amount'' are numerical due to transformation carried out on dataset using dimensionality reduction technique called principal component analysis (PCA).‗‗Amount'' attribute is the cost of the transaction, and ‗‗Time'' attribute is the seconds that elapsed between a transaction and the first transaction in the dataset.‗‗Class'' is the dependent variable, has a value of 1 for fraudulent and 0 for legitimate.
In Fig. 6, we conducted extensive comparison using credit card fraud dataset known for its vast transaction volume.The goal was to thoroughly evaluate the stability and accuracy of our method within the realm of big data challenges, compared to other techniques.As depicted in the figure, HHO-SMOTe achieved highest AUC score, an impressive 0.96, surpassing other methods with scores below 0.94.These methods ranked in descending order as borderline-2, SMOTE, ADASYN, Borderline1, ASN-SMOTE, GASMOTE, and Random SMOTE.In terms of the F1-Score, all algorithms consistently scored above 0.99, even reaching a perfect score of 1. Regarding the G-mean metric, HHO-SMOTe demonstrated its

VI. CONCLUSION
In summary, the HHO-SMOTe approach represents a significant advancement in effectively addressing complexities of imbalanced datasets in classification tasks.By seamlessly integrating various classifiers with the Harris Hawk search optimization algorithm and SMOTE, we have established a robust framework capable of producing precise and reliable predictions for imbalanced data scenarios.These results hold substantial implications for a wide range of realworld applications where improved classification accuracy and data balance correction play pivotal roles in informed decision-making.Furthermore, our research contributes significantly to the field of imbalanced data handling by shedding light on a potent methodology that enhances the performance of classification models across diverse domains.This amalgamation of state-of-the-art techniques has the potential to mitigate challenges posed by skewed data distributions, ultimately enabling more accurate and trustworthy predictions.
A. SMOTE SMOTE is commonly used when dealing with imbalanced datasets, where one class (minority class) has significantly fewer examples than the other class (majority class).In such cases, machine learning models may struggle to correctly classify the minority class because they tend to be biased towards the majority class.SMOTE helps address this imbalance by generating synthetic examples of the minority class to create a more balanced dataset for training.

Fig. 1 .
Fig. 1.The principle of the SMOTE.We can observe an example of an imbalanced dataset in Fig. 1(a) above.Here, the majority class is represented by circular shapes, which stand in for the data's predominant occurrences, while the minority class is represented by triangular shapes, signifying the smaller number of data samples.Some examples from the minority and majority classes are in areas that do not naturally align with the opposite class, most notably with the red arrow.The SMOTE algorithm initiates the process of selecting synthetic samples, a crucial step in bolstering the minority class.The sampling rate specified for each category of occurrences serves as the basis for this selection process.The synthetic samples are presented as square forms in Fig. 1(b).Upon applying the SMOTE technique, the resultant effect is a reduction in the disparity between the Minority and Majority classes.The SMOTE algorithm includes a sample rate parameter to control the extent of over-sampling.The sample rate determines how many synthetic examples are generated for

Fig. 2 .
Fig. 2. Confusion matrix for the two-class classification problem.
www.ijacsa.thesai.orgsuperiority with a score exceeding 0.95, while its counterparts fell short with scores below 0.94.

TABLE II .
RESULTS OBTAINED BY KNN ON DATASETS OVERSAMPLED BY DIFFERENT SMOTE TECHNIQUES