Hybrid Intrusion Detection System Based on Data Resampling and Deep Learning

—The growth of the internet has advanced information-sharing capabilities and vastly increased the importance of global network security. However, because new and inconspicuous abnormal behaviors are nearly impossible to detect in massive network access environments, modern intrusion detection systems have identified a high rate of false-positive (FP) and false-negative (FN) attacks. To overcome this, this paper proposes a hybrid deep learning model that significantly mitigates the disadvantages of consistently imbalanced sample attack data. First, it resolves imbalanced data using random undersampling and synthetic minority oversampling techniques. Then, convolutional neural networks (CNNs) extract local and spatial features, and a transformer encoder extracts global and temporal features. The novelty of this combination increases recognition accuracy at the algorithm level, which is crucial to reducing FPs and FNs. The model was subjected to multiclassification testing on the NSL-KDD and CICIDS2017 benchmark datasets, and the results show that our model has higher classification accuracy and lower FP rates than state-of-the-art intrusion detection models. Moreover, it significantly improves the detection rate of low-frequency attacks.


INTRODUCTION
The ubiquity of mobile handheld internet devices allows people to access digital information quickly and effortlessly, just about anywhere.The associated transference and storage of vast volumes of data over computer networks have created new and evolving opportunities for cybercriminals [1].From 2019 to 2022, the cost of repairing cyberattack damage increased by USD 6T, and the average detection time increased from 57.4 to 93.2 days [2].Traditional cybersecurity methods (e.g., firewalls, user authentication, and data encryption) cannot handle the complex attacks that take place online.Intrusion detection systems (IDSs) are designed to detect a variety of anomalous patterns that serve as the attack signatures of new and known attacks [3], using advanced database systems with machine learning [4].When an IDS reports potential malicious activities in an information system [5], it kicks off various analytical and alerting processes to confirm the nature of the attack and launch protection measures.IDSs generally operate in three phases: information collection, data analysis, and response.The fact that most advanced cyberattacks utilize new and unusual network and system penetration methods makes it nearly impossible to train machine learning models to recognize discrete new and seemingly inconspicuous threats.On the other hand, the models must still be trained with legacy class types from past attacks.Hence, IDS training datasets grow heavily imbalanced over time [6].Supposing the target class is rare (< 10%) in terms of its representation in the training dataset, critical new and unusual network behaviors can be easily overlooked (as with human perception).
Modern machine-learning methods that handle unbalanced classes typically consider both data-and algorithm-level remedies.That is, training data and classification algorithms are modified separately so that their combination will improve the detection and recognition accuracy of minority samples.Hence, advantageous tradeoffs can be gained.Unfortunately, even the most state-of-the-art IDS models continue to suffer high rates of false-positive (FP) and false-negative (FN) attack detection.
To contribute to the robustness of machine-learning IDS accuracy and recognition, this study makes the following contributions.(1) We apply a novel combination of data-and algorithm-level techniques to specifically reduce the FP rate while improving the model's recall rate.(2) We provide legitimate and reproducible results by applying our combined model to state-of-the-art NSL-KDD [7] and CICIDS2017 [8] benchmark intrusion detection datasets as our research objects.(3) To improve data-level class balancing, we provide an ingenious combination of random undersampling (RUS) and synthetic minority oversampling to adjust the data distribution structure and improve minority class detection.(4) To improve algorithm-level class balancing, we apply a hybrid convolutional neural network (CNN) and a Transformer model to adopt new detection performance efficiencies over contemporary models.Our model's performance is compared with that of state-ofthe-art IDS models, demonstrating that our innovations have clear advantages in terms of accuracy, FP rate, and recall.
To convincingly deliver this information, the remainder of this paper proceeds as follows.Section II covers the extant research that has led us to pursue our current motivations.Section III adequately describes the proposed model and the related techniques and technologies applied.Section IV describes our experiments and presents the comparison results, Fund Project: Natural Science Foundation of Fujian Province, China [grant no.2021J01332] www.ijacsa.thesai.orgconfigurational impacts, and implications of our findings.Finally, Section V presents the conclusions.

II. RELATED WORKS
This study provides an IDS model that can more accurately identify malicious traffic and detect a wider variety of intrusion attacks than current models.First, our model resolves sample imbalance problems at the data and algorithm levels based on the lessons learned from current studies, described briefly in the following sections.

A. Data-Level Mitigation Efforts
In terms of the current data-level mitigation efforts used to overcome problems related to training models with imbalanced datasets, data reconstruction efforts prevail.Related strategies focus on preprocessing original datasets to provide appropriately weighted training sets for model learning and tailoring the model's feature classification methods to maximize learning and retention based on the task at hand.
A healthy number of intrepid researchers have applied oversampling [9][10][11][12][13][14], undersampling [15][16][17][18][19], and hybrid [20][21][22][23] preprocessing methods to restore balance to their training datasets.These methods are combined with feature classification methods to maximize benefits.For example, the synthetic minority oversampling technique (SMOTE) [24] is a widely used data reconstruction strategy that provides good data balancing and classification results while effectively avoiding overfitting.Noting that the SMOTE algorithm analyzes minority class samples and manually synthesizes new ones based on the needed additions, Dablain et al. [25] provided a deep learning-based SMOTE method that applies a novel oversampling method to counter class imbalances and train new skew-insensitive classifiers.Joloudari et al. [26] proposed a CNN that uses SMOTE to achieve a remarkable accuracy of 99.08% on 24 imbalanced datasets, including KEEL, Breast Cancer, and Z-Alizadeh Sani sets.

B. Algorithm-Level Mitigation Efforts
Most current algorithm-level mitigation efforts aim to intuitively process input data algorithmically for better classification results.Modern techniques match the model's internal structure to the distribution characteristics of the original dataset as much as possible.For example, CNN-based autoencoders are extensively used for IDSs, resulting in high detection performance [27].Yin et al. [28] proposed a recurrent neural network (RNN)-based IDS that provides impressive breakthroughs in accuracy.Vigneswaran et al. [29] used a deep neural network (DNN) to predict attacks directed at network IDSs (NIDS).The famous KDD-CUP99 [30] dataset was used to train and benchmark, revealing that a DNN with three layers outperformed all other classical machine learning algorithms at the time.XIAO et al. [31] proposed IDS to reduce the required CNN features for computational efficiency.The KDD-CUP99 dataset was again used, showing reduced FPs and improved speeds.Belarbi et al. [32] proposed a multi-class NIDS based on a deep belief network (DBN) using the CICIDS2017 dataset to train and evaluate performance.The experimental results demonstrated that DBNs can surpass traditional multilayer perceptron classification performance, significantly improving overall recall.In 2017, Vaswani et al. [33] proposed the transformer model, originally designed to solve the tasks of language modeling and machine translation, achieving good results; this model has also been gradually applied to network IDSs.Wang et al. [34] proposed a robust unsupervised IDS (RUIDS) by introducing a masked context reconstruction module into a transformer-based, self-supervised learning scheme.Extensive experiments on four intrusion datasets were conducted to demonstrate the effectiveness and robustness of the RUIDS.Yang et al. [5] proposed IDS based on an improved vision transformer, demonstrating superior results on the NSL-KDD public intrusion detection via simulation experiments.

C. Hybrid Solutions
As noted, CNNs, RNNs, (Recurrent Neural Networks) and DBNs (Deep Belief Networks) are among the most common IDS solutions used to mitigate imbalanced data problems [36].Hybrid models have recently become popular, based on their observed improvements to symbiotic and amplified model strength [37].Indeed, research has shown that combined models consistently perform better than individual algorithms [38].Table I list the best representative hybrid IDS models and summarize their basic algorithmic models, dataset properties, classification types, and accuracy results.This listing is fully explained in the subsequent narrative.
1) Focused neural network combinations: Zhang et al. [39] proposed an IDS model based on an improved genetic algorithm with a DBN trained and evaluated using the NSL-KDD dataset, demonstrating effective improvements in intrusion recognition rates (> 99%).Wu et al. [40] proposed a hierarchical CNN + RNN model (i.e., LuNet) that effectively extracts spatial and temporal data features, providing higher detection accuracy and fewer FPs than peer methods.LuNet's verification accuracies on the NSL-KDD and UNSW-NB15 datasets were 99.24% and 97.40%, respectively.Souza et al. [41] proposed a hybrid binary classification model comprising a DNN with a k-nearest-neighbors (kNN) function.This method achieved higher accuracy than classical machine learning methods, with 99.77% on the NSL-KDD dataset and 99.85% on the CICIDS2017 dataset.Albahar et al. [42] proposed an approach that combines a regularization algorithm with an artificial neural network, achieving all-timehigh true-positive (TP) and accuracy rates on the NSL-KDD, UNSW-NB15, and CIDDS-001 datasets (i.e., 98.53, 94.58, and 97.87%, respectively) using 10-fold cross-validation.Ahsan et al. [43] proposed a hybrid CNN with a long shortterm memory (LSTM) network, achieving the highest known accuracy (at the time) of 99.70% on the NSL-KDD dataset.Banaamah et al. [44] adopted a CNN with an LSTM and a gated recursive unit (GRU) model to improve internet-ofthings (IoT) security.Using the highly reputable Bot-IoT dataset, the proposed model surpassed the highest accuracy, with a 99.8% ratio.Kamalakkannan et al. [45] developed an improved CNN + LSTM model that learns spatial and temporal data characteristics, demonstrating 98% accuracy and a 98.14% average detection rate on the NSL-KDD dataset.www.ijacsa.thesai.orgShivhare et al. [46] proposed a CNN + LSTM + SVM model to tackle multiclass tasks on the CICIDS 2017 dataset, achieving an accuracy of 97.29%.Qazi et al. [47] proposed a deep-layered CNN + RNN model to detect and classify malicious traffic using the CICIDS-2018 dataset, achieving an average accuracy of 98.90%.Recently, the use of transformers has provided new feature extraction methods.Transformers are deep neural networks wholly based on attention mechanisms that have shown great success in natural language processing (NLP) fields.Their versatility allows them to be applied to other domains, such as image classification, cybersecurity, and more.Xing et al. [48] sought to improve unknown attack learning and detection by extracting data features from different perspectives using CNN and transformer models.Xiang et al. [49] later proposed a transformer-based fusion deep learning architecture in which the transformer is used to adjust the ML-CNN-BiLSTM model to enhance its feature encoding ability.Ullah et al. [50] proposed an IDS using transformer-based transfer learning for imbalanced network traffic (INT).The resulting DS-INT uses transformer-based transfer learning to learn feature interactions in network feature representations, even with imbalanced data.A hybrid CNN-LSTM model was then developed to detect attacks from deep features.2) Focused Data-and Algorithm-Level combination: Yan et al. [51] proposed a novel combinatorial IDS model based on a deep RNN and a region-adaptive SMOTE technique.This model significantly improved the detection rate of lowfrequency attacks and overall efficiency while improving unknown attack detection.Al et al. [52] proposed a hybrid CNN + LSTM + SMOTE and the Tomek-Link sampling method (i.e., STL) to improve system performance to an impressive extent.Cao et al. [36] designed a CNN + GRU model that extracts spatiotemporal features from network data traffic.This model combines adaptive synthetic sampling (ADASYN) and repeatedly edits its nearest neighbors to process positive and negative sample imbalances in the original dataset.This model resolves both low classification accuracy and imbalance problems.

D. Motivation for and Purpose of this Study
Through the research and discussion of the above literature, we can see that model systems combining two or more algorithms can often obtain better detection capabilities than single algorithms.Of course, with that comes an increase in the cost of computation.Therefore, how to achieve better detection results at the exact computational cost, the reasonable choice of classification algorithm will be the key to the problem.
The CNN model has become one of the classification algorithms selected in this paper because it can comprehensively map the data features, mine the relationship between the features, and improve the accuracy of feature extraction.However, the CNN model focuses more on spatial local features and has time series characteristics for the traffic data studied in this paper.Therefore, the processing ability of sequence data will be emphasized in selecting the second classification algorithm.RNN, GRU, LSTM, and Transformer www.ijacsa.thesai.orgare all sequential models in deep learning.Compared with RNN and LSTM, the Transformer model can obtain the relationship between all the information in the sequence through the self-attention mechanism, which can better cope with the long-term dependency problem and has higher accuracy.The model can be operated in parallel, and the calculation speed is faster.Based on the above reasons, the CNN and the Transformer models have become the algorithm choices for this paper's hybrid intrusion detection system.
In addition, previous studies have primarily focused on the overall detection rate of the system, but for the typical unbalanced network traffic data, identifying a small number of attack samples is the key to detection classification.Therefore, the difference between this paper and previous studies is that the system focuses more on the identification rate of minority species without significantly affecting the overall detection rate.To achieve this goal, the system balances the sample size of the majority class and the minority class at the data level through data resampling technology to adapt to the common classifier that pursues global accuracy.

III. PROPOSED MODEL
The model proposed in this study uses the NSL-KDD and CICIDS2017 datasets as the research targets.New training, validation, and testing sets were divided by random sampling to digitize and normalize the original data.Most class samples were randomly undersampled to stress the sample imbalance problem.
The focus of this model is on the classification research of imbalanced data, which are divided into two levels for operation.First, at the data level, a data reconstruction strategy is used to adjust the internal distribution structure of the data so that the imbalanced dataset tends toward a balanced state.The measure is obtained by randomly undersampling the majority class samples in the training set and oversampling the minority class samples with SMOTE to achieve balanced data.
Second, at the algorithmic level, the model adjusts the traditional classification algorithm or proposes the optimization and improvement of existing classification ideas as an adaption technique to handle the inherent characteristics of imbalanced datasets, thereby improving the overall recognizability of the model.Research has shown that combined models consistently perform better than individual algorithms [38].As mentioned, we combined the classic CNN with a transformer self-attention module to achieve optimization by combining multiple classifiers that adapt to the internal distributed structure of imbalanced datasets.Hence, the detection rate of the model will be improved.
This model accounts for both data-and algorithm-level aspects of the problem and utilizes their combined advantages to achieve superior recognition accuracy with minority class samples.Fig. 1 presents a schematic diagram of our proposed model.[5], NSL-KDD [53] and KDD-CUP99 [54] are the most widely used datasets in IDS research (ca.2012-2022).The NSL-KDD dataset was generated in 2009 and is commonly used to train models for anomaly detection.It is a revised version of the classic KDD99 dataset but retains its structure.The new dataset consists of four subsets: KDDTest+, KDDTrain+, KDDTest-21, and KDDTrain+_20%, where the latter two are subsets of the first two, respectively.
In the NSL-KDD dataset, each sample record contains 41 attribute features and a classification identifier.Normal and abnormal network connections are marked with the classification identifier.The normal type is represented as -normal,‖ and the dataset contains many anomalies and 39 attack identifiers.These identifiers are divided into four categories by type: denial of service (DoS), probe, root-to-local (R2L), and unauthorized-to-root (U2R).
Our experiment uses the original data sources of KDDTrain+ (125,973 sample records) and KDDTest+ (22,544 sample records).Table II presents the sample size distributions of each attack type.
2) CICIDS2017 Dataset: Table II shows that the NSL-KDD dataset is a typical imbalanced dataset.Notice the small proportion of Probe, R2L, and U2R attack-type samples, especially for U2R attacks.Although this dataset is very popular in IDS studies, some researchers have pointed out that it is somewhat outdated.
Emerging datasets include UNSW-NB15, CICIDS2017, Bot-IoT, and others.Among them, CICIDS2017 is the most popular.Therefore, we chose CICIDS2017 as our second benchmark to gauge performance differences.
The CICIDS2017 dataset was released in 2017 [55], providing normal data and the latest common attack types, similar to real-world data.It contains 2,830,743 network traffic samples, each containing 83 network traffic features.It also includes one benign and 14 attack categories, including the standard DoS, botnet, web, infiltration, file transfer protocol patator, and SSH patator types [56].Among the 14 attack categories, tags with similar features and behaviors are merged to form five new categories.The distribution of the number of samples in the CICIDS2017 dataset is shown in Table III.The CICIDS2017 dataset is also imbalanced, with bot-and-web attack class samples being particularly scarce.

B. Data Preprocessing
Using the NSL-KDD dataset as an example, data preprocessing was introduced, and the operation of the CICIDS2017 dataset was similarly manipulated.
1) Numericalization: The NSL-KDD dataset contains 41 attribute features (i.e., 38 digital and three non-digital types).Because the input value of the model should be a digital matrix, it was necessary to use a numerical method to map data with symbolic features into digital feature vectors.We used the LabelEncoder method of the preprocessing module in the sklearn library to convert the three non-digital features (i.e., protocol_type, service, and flag) into digital features.
2) Standardization: Unlike normalization, which is easily affected by outliers, standardization is relatively stable; thus, it is suitable for noisy big data scenarios.Therefore, standardization was used for data preprocessing.The original data were transformed into a range with a mean of zero and a standard deviation of one so that the processed data would conform to a standard normal distribution.The StandardScaler method of the preprocessing module in the sklearn library uses a standard z-score scaling calculation formula, expressed using Eq.(1): where, X  represents the converted data value,  is the original data value, mean is the mean value of the column data, and is the standard deviation of the column data.www.ijacsa.thesai.org

C. Dataset Partitioning
The KDDTrain+ and KDDTest+ subsets of the NSL-KDD dataset were used as the original data, and new training, validation, and testing sets were formed by random sampling.It lists the number and proportions of each sample set after division.The CICIDS2017 dataset was also divided according to the same ratio, and the numbers after the division are listed in Table IV.To achieve good data balance, undersampling and oversampling were performed on the training set samples.D. Data Balancing 1) Undersampling: The undersampling method achieves data equalization by randomly removing a certain proportion of majority instances from the RUS dataset [23].This process consists of the following steps: In the NSL-KDD dataset, NORMAL and DoS samples belong to the majority class, and undersampling was performed using RUS samples.The BENIGN and DoS/DDoS samples of the CICIDS2017 dataset belong to the majority class and are undersampled.
2) Oversampling: Oversampling is used to rebalance a dataset by creating fake minority instances, and SMOTE [22] is the best method [57] in our case as it effectively compensates for the shortcomings of random oversampling and is superior to simple replication, which can easily cause model overfitting and weaken generalizability.SMOTE also has the advantages of a simple design and strong robustness.Moreover, it uses interpolation between minority class samples and their nearest neighbors to generate new synthetic samples [58].The SMOTE steps are as follows: where, i x is an observation point in the minority class, j y is a randomly selected K -nearest neighbor, and (0,1) rand represents a random number generated between zero and one.d) New samples are combined with the original data to form a new dataset.
In the NSL-KDD dataset, Probe, R2L, and U2R samples belong to a minority class and were oversampled with SMOTE to increase the number of class samples.For the CICIDS2017 dataset, the Bot, Brute Force, PortScan, and Web Attack samples belong to the minority class and were oversampled.
The training set samples were balanced at the data level via undersampling and oversampling.
E. Model Structure 1) CNN: CNNs are feedforward neural networks with convolution calculations and a deep structure that extract features accurately and efficiently [59].The error function is obtained by calculating the difference between the actual and predicted values.Network parameters are adjusted retroactively until the model reaches an optimal solution [60].This method has been widely used in several fields, such as NLP and computer vision.
A CNN generally comprises a convolution layer, activation function, pooling layer, and a fully connected layer [61] as shown in Fig. 2. The convolution layer extracts high-level features from the input data, and the pooling layer performs feature selection and information filtering on the graph data output by the convolution layer, thereby reducing the amount of data processing.2) Transformer: A transformer is a deep learning model [33] that is widely used for NLP and other sequential data processing tasks.
The transformer differs from traditional RNNs and CNNs in that they adopt a novel self-attention mechanism that allows the model to assign different weights to different elements when processing input sequences.It calculates the similarity score between elements and uses the score to calculate the weighted averages of relationships among elements.Notably, the transformer supports parallel computing, this allows it to handle long sequences easily without step-by-step iterations.The self-attention mechanism also allows the transformer to incorporate information from the entire sequence into its calculations, which leads to better long-range dependencies.www.ijacsa.thesai.orgCNNs are particularly adept at modeling fine-grained local features due to their convolutional operations and hierarchical structure.Nevertheless, their global modeling ability is weak, whereas the transformer excels at modeling global contextual information [62].The proposed framework utilizes complementary CNN characteristics to extract local, spatial, and time series features.
3) Hybrid model: This article adopts a hybrid architecture that combines the CNN and the transformer as illustrated in Fig. 3. Spatial features are extracted after preprocessing and sample balancing in one-dimensional (1D) convolutional and pooling layers.Then, by using the self-attention mechanism of the transformer to process the data, the shortcomings of the RNN's short-term memory and the CNN's difficulties in learning remote dependencies are overcome, and temporal and global features are extracted.Finally, using flattening and fully connected functions, the data are classified according to attack type.For the NSL-KDD dataset, the data were divided into five categories: one normal and four attack.The CICIDS2017 dataset was divided into six categories: one benign and five attacks.

A. Evaluation Indicators
Commonly used evaluation indicators for classification problems are accuracy (ACC), precision (PRE), recall (i.e., TPR), false-positive rate (FPR), and F1-measure.It is necessary to adopt reasonable evaluation criteria for unbalanced data, including the F1-measure, G-mean, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC) values.
Accuracy is defined by Eq. ( 3), which reflects the percentage of correctly predicted samples among the total number of predicted samples:

TP TN TP TN FP FN Accuracy
Precision is the ratio of correctly predicted positive samples to the total number of positive samples, as shown in Eq. ( 4): Recall describes the ratio of the number of correctly predicted positive samples to the total number of positive samples as formulated in Eq. ( 5): FPR is the number of false positive samples detected divided by the total number of TN samples, as defined by Eq. ( 6): The F1-measure is a comprehensive assessment of precision and recall and represents the harmonic average between them, as defined by Eq. ( 7): The G-mean is a standard that comprehensively considers both recall and accuracy.A high G-mean value indicates good modularity, reflecting the geometric mean of sensitivity (i.e., hit rate or recall) and precision.The G-mean is defined in Eq.The ROC curve defines TPR and FPR in terms of horizontal and vertical coordinates, respectively.Each threshold corresponds to a point (FPR, TPR), and all points are connected as the threshold changes.
Although the ROC curve can comprehensively and intuitively express the performance of a classifier, it cannot provide a specific value.Therefore, it is usually evaluated using the area AUC, as defined in Eq. ( 9): ) ( ) AUC values range from zero to one; the larger the AUC, the better the classification performance.

B. Experimental Results
The experiment was conducted on a desktop Intel 3.10 GHz processor with 64-GB memory, no GPU acceleration, and a 64-bit Windows 11 operating system.The programming tool was Keras 2.9.0, based on TensorFlow.The NSL-KDD and CICIDS 2017 datasets were used to train the model shown in Fig. 1.For the NSL-KDD dataset, owing to the small amount of data, the batch size was set to 256, and the training epochs were set to 200.The CICIDS2017 dataset contains a considerable amount of data.To accelerate the convergence speed of the model, the batch size was set to 512, and 40 epochs of training were performed.Finally, the model parameters with the best effects on the corresponding datasets were obtained.Subsequently, the model with the optimal parameters was tested on the testing set to obtain classification results, and the confusion matrix was constructed as shown in Fig. 4 and Fig. 5.
Multiple classification experiments were conducted for different attack categories.The NSL-KDD dataset included normal, DoS, Probe, U2R, and R2L classes.The CICIDS 2017 dataset consisted of BENIGN and five attack classes: bot, brute-force, DoS/DDoS, PortScan, and web types.The experimental results are presented in Tables V and VI, respectively.For most class samples, the classification performance of the model was good.For the minority class samples, the model's classification performance decreased to some extent; however, the degree of decrease was not significant.The model does not sacrifice the classification performance of other categories to improve the classification accuracy of any specific category.Therefore, the overall classification performance of the model is very well-balanced.
The overall classification results of the model are presented in Table VII.Although the overall accuracy was not very high, the model did not sacrifice the classification effects of a few classes in exchange for higher overall accuracy, which is a unique demonstration of superior classification procedures.Therefore, the model showed little difference in the classification effects between the majority and minority classes.Moreover, it tended to improve the recognition rate of minority classes (e.g., U2R and R2L) in the NSL-KDD dataset and Bot and Web classes in the CICIDS2017 Dataset).

C. Analysis and Discussion
1) Impact of model structure on results: In this section, the structure of the proposed model is discussed.We compared the classification effects of the model before and after data balancing and the single-network model with the hybrid model of both.The following conclusions were drawn from the NSL-KDD dataset, as listed in Table VIII.
The overall effect of the model after data balancing was better than that of the model without data balancing.Moreover, the impact of the hybrid model was better than that of the single-network model.
At the same time, data balancing is beneficial for improving the classification effect of minority classes.Fig. 6 and Fig. 7 show the comparison of precision before and after data balancing for the minority classes U2R and R2L, respectively.From the figures, we can see that regardless of whether it is a single algorithm model or a hybrid model, the classification accuracy after data balancing has increased to varying degrees.This also confirms the necessity of data balancing operations.Similar conclusions were drawn for the CICIDS2017 dataset.The effect of the hybrid model was better than that of the single network model.Data balancing provided better improvements to accuracy and precision indicators, as shown in Table IX.Fig. 8 and Fig. 9 present a comparative analysis of the precision of rare classes-Bot and Web Attack-in the CICIDS2017 dataset, both before and after the application of data balancing techniques.Similar to the NSL-KDD dataset, the conclusion drawn from these figures is that data balancing is beneficial for improving the classification accuracy of minority classes.2) Impact of sampling rate on results: The previous section showed that data balancing benefits minority class detection.In this section, we focus on comparing the different sampling rates of rare classes to explore the impact of sampling rates.For the NSL-KDD dataset, we checked the U2R category.In contrast, for the CICIDS2017 dataset, we checked the Bot and Web Attack categories due to their low representation.During model training, 100, 300, 500, and 1,000% samples were considered for the given categories, and the optimal model parameters generated were predicted using the testing set.The experimental results are presented in Tables X to XII.These results show that increasing the sampling rate significantly improved the recall rate, F-measure, and G-mean for rare categories.However, this had little impact on overall classification accuracy.Due to the small proportions of rare classes in the original dataset, it was difficult for the model to train effectively for class recognition.Therefore, increasing the sampling rate is equivalent to increasing the training opportunities of the model for that category, thereby improving the recall of subsequent testing data.Thus, improving the detection rate for minority classes comes at the cost of increasing training time.

D. Comparisons of Experimental Results
We compared the above experimental results with methods from the relevant literature to verify our model's effectiveness with multiclassification problems using unbalanced data.We first compared NSL-KDD data, as the related literature is abundant.
First, the classification accuracy of multiple classifications was compared, as presented in   The classification FPRs of multiple classifications were then compared, as shown in Table XV.It can be seen that, apart from a few R2L cases, the FPR of our model was the lowest of all.Finally, for imbalanced data classification problems, the F1 measure is often more important than other metrics.Table XVI presents the results of the multicategory F1-measure comparisons.Apart from the U2R category, the F1 measure of our model was the best.Using the CICIDS2017 dataset, our model also showed advantages in accuracy and recall, as shown in Table XVII.Through the above comparative analyses, our hybrid model, based on data balancing and two deep learning networks, has clear advantages and achieved excellent results in multiclassification problems with unbalanced data.

V. CONCLUSIONS AND FUTURE RECOMMENDATIONS
NIDS plays vital network security roles in identifying, preventing, and countering network threats.Owing to the large amount of unbalanced data collected in network datasets, FPs and omissions significantly reduce the detection efficiency of extant IDSs.This paper proposed a deep learning model that combines data balancing and a CNN + Transformer hybrid to improve the data distribution of the original dataset via undersampling and oversampling techniques.Our data redistribution method increases the likelihood of identifying minority classes based on model training, and the experimental results show that our innovations effectively improve this detection rate.Our hybrid model's algorithm-level improvements increased recognition training based on fused spatiotemporal features, and the experimental results show that the proposed system, combined with multiple combined processes, identifies anomalies more efficiently and accurately than any single network model.
For the classic NSL-KDD and modern CICIDS2017 datasets, our model was more effective in multiclassification data applications and was superior to existing IDS models in terms of accuracy, FPR, F1-mean, and other indicators.Notably, the CICIDS2017 dataset showed superiority in training compared with existing models in terms of accuracy and recall.
Although the model proposed in this paper has advantages over existing systems, several other data balancing activities, such as the edited nearest neighbor, Tomek-Links, SMOTEBoost, and ADASYN methods described, should be www.ijacsa.thesai.orgtested.Many LSTM, GRU, DBN, and other variants should also be tested.The objective is to improve the detection effects of data classifications based on innovative model structures so that network security professionals and scholars can obtain better IDS results, even in the face of scarce data.

a)
The numbers of majority samples, N1, and minority samples, N2, are calculated.b) Based on the set sampling ratio, r, we calculate the number of majority class samples needing deletion (N1 -N2 * r).c) Randomly selected samples from the majority class, maj S , to form the sample set E ; remove sample set E from maj S ; generate a new dataset new maj maj

a)
For each sample X in the minority class, a k-NN is used to sample each minority class sample.b) We determine the sampling rate, N , based on the sample imbalance ratio and randomly select N samples from K nearest neighbors for random linear interpolations.c) We construct a new minority class sample using Eq.(2):

Fig. 5 .
Fig. 5. Confusion matrix of classification results of the CICIDS2017 dataset.

TABLE I .
SUMMARY OF THE HYBRID INTRUSION DETECTION SYSTEM

TABLE II .
DISTRIBUTION OF VARIOUS SAMPLES FROM THE NSL-KDD DATASETDatasetThe number and proportion of various types of samples

TABLE IV .
NUMBER AND PROPORTION OF DATASETS AFTER PARTITIONING

TABLE V .
FIVE CLASSIFICATION RESULTS FOR THE NSL-KDD DATASET

TABLE VIII .
COMPARISON OF THE RESULTS OF THE NSL-KDD DATASET UNDER DIFFERENT MODEL CONFIGURATIONS

TABLE IX .
COMPARISON OF THE RESULTS OF THE CICIDS2017 DATASET UNDER DIFFERENT MODEL CONFIGURATIONS Comparison of Web Attack class accuracy rate before and after data balancing.

TABLE XII .
COMPARISON OF RESULTS FOR THE WEB ATTACK CATEGORY UNDER DIFFERENT SAMPLING RATES

Table XIII .
Our model had the highest classification accuracy for all five categories, and there were no cases in which the accuracy of a specific category was

TABLE XIII .
ACCURACY COMPARISONS OF FIVE CLASSIFICATIONSNext, multi-classification recall rates were compared, and the results are listed in TableXIV.It can be seen from the table that the recall rates of the DOS and R2L categories were the highest compared with those reported in the relevant literature.The difference between the other three categories and the highest values in the literature was insignificant.

TABLE XIV .
RECALL COMPARISONS OF FIVE CLASSIFICATIONS

TABLE XV .
FPR COMPARISONS OF FIVE CLASSIFICATIONS

TABLE XVI .
F1-MEASURE COMPARISON OF FIVE CLASSIFICATIONS

TABLE XVII .
COMPARISON OF THE RESULTS OF THE CICIDS2017 DATASET