A Review on Feature Selection and Ensemble Techniques for Intrusion Detection System

Intrusion detection has drawn considerable interest as researchers endeavor to produce efficient models that offer high detection accuracy. Nevertheless, the challenge remains in developing reliable and efficient Intrusion Detection System (IDS) that is capable of handling large amounts of data, with trends evolving in real-time circumstances. The design of such a system relies on the detection methods used, particularly the feature selection techniques and machine learning algorithms used. Thus motivated, this paper presents a review on feature selection and ensemble techniques used in anomaly-based IDS research. Dimensionality reduction methods are reviewed, followed by the categorization of feature selection techniques to illustrate their effectiveness on training phase and detection. Selection of the most relevant features in data has been proven to increase the efficiency of detection in terms of accuracy and computational efficiency, hence its important role in the design of an anomaly-based IDS. We then analyze and discuss a variety of IDS-based machine learning techniques with various detection models (single classifier-based or ensemble-based), to illustrate their significance and success in the intrusion detection area. Besides supervised and unsupervised learning methods in machine learning, ensemble methods combine several base models to produce one optimal predictive model and improve accuracy performance of IDS. The review consequently focuses on ensemble techniques employed in anomaly-based IDS models and illustrates how their use improves the performance of the anomaly-based IDS models. Finally, the paper laments on open issues in the area and offers research trends to be considered by researchers in designing efficient anomaly-based IDSs. Keywords—Intrusion detection system (IDS); anomaly-based IDS; feature selection (FS); ensemble

in different methods of feature selection and ensemble detection for anomaly-based intrusion detection system. Furthermore, future trends and open issues are also addressed.

D. Contributions
The contributions of this review are summarized below: • Classification of detection methods in IDS. The methods, feature selection technique used, classification type, evaluation tool and dataset are all mentioned.
• Classification of machine learning techniques used in anomaly-based IDS.
• Identification of feature selection for anomaly-based IDS, as summarized in Tables II and III. • Identification of ensemble classification for anomalybased IDS.
• Presentation of future directions on the state-of-the-art anomaly-based feature selection and ensemble classification.
To achieve the mentioned contributions, some research questions are ready for this analysis and the responses are given in the following sections: RQ1.What are the detection methods utilized for IDS? RQ2.Which evaluation tools are utilized to assess the effectiveness of the IDS?
RQ3.What are the datasets reported in the review to be used in anomaly-based IDS?
RQ4.Which feature selection methods are used for anomaly-based IDS?
RQ5.What are the machine learning algorithms used for detecting intrusions in anomaly-based detection?
RQ6.What of the ensemble techniques included in the review are reported to be used in anomaly-based IDS?
The remainder of this paper is organized as follows: Section 2 provides an overview of detection methods in IDS. Then, the taxonomy of machine learning algorithms and their methods which are employed in IDS follows in Section 3, while Section 4 reviews and compares different techniques of feature selection. Next, ensemble classification algorithms and their methods which are employed in anomaly-based IDS follows in Section 5. In Section 6, discussions on the open issues and future trends for IDSs are provided, and finally Section 7 concludes this survey.
II. DETECTION METHODS Intrusion detection methods are classified into four groups based on the detection method used in the system: signaturebased, anomaly-based, specification-based, and hybrid. In signature-based detection the IDS identifies threats when the system or network operation matches the threat pattern (called signature) stored in the IDS local databases, and an alert will be activated. Signature-based IDSs are effective and efficient in identifying existing attacks, and their task is simple to comprehend. However, this technique has not been effective in identifying zero day attacks and new variants of previously identified attacks which are still elusive as the associated signature for these attacks [11]. Signature-based schemes offer very strong outcomes for popular, well-known threats. However, they are unable to identify new, unseen attacks, even though they are designed as minimal variants of attacks previously identified. Examples of signature-based IDSs are Artificial Immune System (AIS) [12], Collaborative Block Chained Signature-Based IDS (CBSigIDS) [13], IPFIX-based IDS (FIXIDS) [14].
Anomaly-based detection aims to predict the system's ''ordinary'' pattern to be covered and produce an anomaly warning whenever the difference between an immediate occurrence and normal pattern reaches a predetermined threshold. The key benefit of anomaly-based detection method is their ability to recognize previously undiscovered attack incidents. Nevertheless, in anomaly-based systems, the rate of false positives (FP), or wrongly defined as attacks is typically higher than that of signature-based method, considering the possible inaccuracy in formal signature specifications. Examples of anomaly-based IDSs are Hybridized Feature Selection Approach (HFSA) [15], Hybrid Anomaly Detection Model (HADM) [16], Unsupervised Heterogeneous Anomaly Based IDS [17].
For specification-based detection method, a human expert manually constructs the desired template which consists of a series of rules (specifications) that aim to evaluate valid behavior of a device. If the parameters are sufficiently accurate, the template may identify unlawful patterns of behavior. In addition, the false positive rate is decreased, primarily because benign behaviors that were not previously observed are not flagged as intrusions in this type of system. Specifications could also be created using some formal tool, for example, with a sequence of states and their transitions, the Finite State Machine (FSM) methodology appears suitable for modelling network protocols [18] [19]. Standard languages of representation such as LOTOS, UML and N-grammars can be considered for this reason.
Hybrid detection aims to benefit from the strengths of each intrusion detection method, minimized their weaknesses and build strong schema to detect the intrusion. A notable aspect in hybrid detection is common uses of a key signature-based detection system in conjunction with an additional anomalybased model. This integration of the two forms of detection strategies in a ''Hybrid NIDS'' [20] aims to increase the final accuracy of signature-based models for intrusion detection while eliminating the usual high level of false positives of network-based IDS (NIDS) approaches, hence a hybrid approach is embraced by most existing platforms. Other examples of hybrid are Signature-Based Anomaly Detection Scheme (SADS) [21] , Artificial Bee Colony and Artificial Fish Swarm (ABC-AFS) [22], Hybrid Intrusion Detection Approach In Cloud Computing (HIDCC) [23]. Table I shows the type of detection methods utilized by researchers in IDS. RQ1, RQ2, and RQ3 are all answered in detail in the table. It specifies the detection method, the 539 | P a g e www.ijacsa.thesai.org evaluation tool, dataset used in the articles, and so on. From the table it is apparent that signature-based and specificationbased detection methods did not utilize feature selection and ensemble classifier to detect intrusions, in contrast to the anomaly-based detection which utilized both of them. For evaluation tools and dataset, signature-based and specification-based models were deployed and validated using simulation and real data while anomaly-based approaches were evaluated by experiments and standard IDS datasets. The NSL-KDD dataset is the most utilized dataset based on the articles in this review.
540 | P a g e www.ijacsa.thesai.org III. MACHINE LEARNING IN ANOMALY-BASED IDS Machine learning (ML) algorithms are classified into unsupervised learning and supervised learning, depending on the availability of training dataset and the successful outcome of learning algorithms. Fig. 1 illustrates the taxonomy of machine learning algorithms in anomaly-based IDS. Regarding RQ5, it has been noted that most studies focus on the following algorithms for IDS.
In supervised learning, the training function is provided with input and target output pairs, and an expert model is trained to predict the output of functions at minimal expense. Supervised learnings are classified based on learning algorithms, frameworks and objective functions. Support vector machine (SVM), decision trees, and artificial neural network (ANN) are common categorizations.
For unsupervised learning there is no tag or label in the sample dataset. Unsupervised learning algorithms are proposed to simplify the data's key features and shape clusters of natural input patterns due to a particular cost function. Hierarchical clustering, K-means clustering, and selforganization map are the most common unsupervised learning methods. One of the challenges of unsupervised training is that it is hard to evaluate because it does not have a specific educator and therefore does not have labelled test data. A. Supervised learning 1) Artificial Neural Network (ANN): ANN is one of the major algorithms of machine learning which is widely utilized as a detector operator in IDSs in many studies. ANN is used to solve a variety of issues faced by other existing intrusion detection approaches and has been suggested as a substitute for the statistical analysis aspect of detection of anomaly schemes. Initially, the ANN acquires its expertise by training the machine to properly detect pre-selected examples of problems. The neural network result will be checked and the machine configuration will be optimized until the training data neural network response reaches a sufficient level. Besides the initial training phase, the neural network often gains expertise over time as it performs review of the problem-related data [24], [25] . A hypervisor layer anomaly detection system called Hypervisor Detector that employs a combination algorithm that is a hybrid of Fuzzy C-Means clustering algorithm and Artificial Neural Network (FCM-ANN) was introduced to enhance the detection system accuracy [26]. The KDD Cup 99 sample dataset is used to test the design system to test for the reliability of five attack forms. The model was good at finding normal and probe attacks, but did not yield good results for DOS (99.96-5.33), U2R (96.78-3.22) and R2L (93.73-6.27) attacks, even for accuracy and false alarm rate. A reasonable solution using ANN in hierarchical anomaly-based IDS can be pointed to [27], which used neural Self Organization Map (SOM) networks to identify and distinguish normal packet from the attack traffic. The proposed machine was used to configure, train and evaluate the SOM Neural Network for intrusion detection. Detection output was performed to evaluate the SOM efficiency in detecting anomaly intrusion and the findings show that SOM with the KDD Cup 99 dataset can distinguish attack packet from normal one at 92.37%, while with NSL-KDD the detection rate is 75.49%.
The work in [28] tackles detection problems by presenting a simple ANN-based IDS system, utilizing back propagation and feed forward algorithms together with different other optimization methods to minimize the total computing overhead while maintaining a high level of performance. Results of the experiment on the NSL-KDD benchmark dataset showed that the quality of the proposed ANN (accuracy and detection speed) was 98.86% for accuracy and 95.77% for detection rate. An effective method to identify brute force attack in the Secure Shell (SSH) was proposed by [29]. A brute force attack is performed by the implementation into the private cloud of a client-server SSH model and the server captures traffic related to attack and normal. Next, ANN's Multi-Layer Perceptron model extracts indicative traffic characteristics and uses them to distinguish the attack and normal packets. Results obtained from this approach indicate that the suggested framework is able to detect the attack successfully with great accuracy and minimal false alarm.

2) Multi-layer Perceptron Neural Network (MLP):
MLP is a supervised learning classifier which utilizes back propagation algorithm in the learning phase to train the model. It can learn a non-linear approximate function for both regression and classification task, by providing a group of features and a target in which one or more non-linear layers called hidden layers between the inputs and outputs are distinguished from logistic regressions [44].
The MLP neurons are positioned in layers with alwaysflowing outputs toward the output layer, either one layer (called a perceptron) or a multilayer perceptron, if multiple layers exist [45], where every neuron in a single layer has direct connections to the subsequent layer's neurons. The units of those networks apply a sigmoid function as activation function in many applications.
A wrapper-based feature selection is designed by utilizing the Discernibility Function as algorithm for search to construct subsets of feature and the MLP classifier is used to determine the subsets of features. Thus, the C4.5 decision tree and the MLP classifier, which are commonly utilized in the IDS, are 541 | P a g e www.ijacsa.thesai.org used to illustrate better classification rates. With this hybrid method, the findings for the KDD Cup 99 shows improved accuracy of approximately 12% for U2R, 2% for Probe, and 1% for DOS classes [46].
To build effective IDS, a hybrid multi-layer perceptron (MLP) and Artificial Bee Colony (ABC) algorithm were designed. The MLP classifier was used to distinguish among the attack and normal traffic in network. Training and testing have been conducted using the NSL-KDD dataset. Results of the experiments show that the suggested solution gives a high detection rate of about 87.54 % and error rate of 0.124% [47].
3) K-nearest neighbor (KNN): KNN algorithm is a nonparametric technique for classification and is a simple and straightforward machine learning algorithm. It is vast used based on many experiments reported on intrusion detection, pattern recognition, text categorization and countless others [48].
A combination of the Learning Vector Quantization ANN and KNN method for intrusion detection was suggested by [49]. The analysis was performed on the NSL-KDD dataset and the proposed model has a detection rate of 97.2% (five classes) with a false alarm rate of approximately 1%.

4) Support Vector Machines (SVM)
: SVM is one of the algorithms in machine learning that used labeled instance (packet) to train the model and differentiate the packet to different classes by generating templates that could determine which class a new instance belongs into [50], [51]. SVM's main objective is to discover a linear optimized hyper plane that maximizes the isolation boundary between groups. The SVM then trains the model across sections or portions of the data [52].
A hybrid intrusion detection KPCASVM with GAs design was proposed [53], where KPCA is implemented in the N-KPCA-GA-SVM system to obtain the key data features of intrusion detection, and a multi-layer SVM classifier is used to determine normal or attack behavior. The test was conducted on the KDD Cup 99 dataset and the detection rate was 96%. BIRCH hierarchical clustering SVM-based network intrusion detection framework [54] was proposed for pre-processing of data. Instead of the original large data set, the BIRCH hierarchical clustering could provide the SVM learning with highly qualified, abstracted and reduced data sets. The proposed solution could achieve a 95.72% accuracy with a false positive rate of 0.7% overall, but was not satisfactory with the division accuracy for each attack type (Prob= 97.55%, U2R=19.73% and R2L=28.81%).
A new Combining Support Vectors with Ant Colony (CSVAC) algorithm was proposed to produce cluster classifiers in intrusion detection [55] using two existing machine learning techniques (SVM and CSOACN) to improve overall detection rates and speed. The method is applied and tested using the standard KDD Cup 99 dataset benchmark, and yields a classification rate of 94.86% with a false negative ratio of about 1% and the false-positive ratio of 6.01%.

5) Naive Bayes Network (NB):
Naive Bayes (NB) is a simple method of creating classifiers that allocate labels of class to problematic cases identified as values of feature vectors, where class tags are drawn from a restricted set. There is no single algorithm for learning such classifiers but a set of algorithms based on a common concept. A Directed Acyclic Graph (DAG) usually describes the structure of an NB, that each node represents a process variables and each reference encodes one node's control over another [56]. By comparing the decision tree and Bayesian techniques, the decision tree's accuracy is much higher but the processing time of the Bayesian network is low [57]. Therefore, it will be effective to use NB models when the dataset is very large.
A Naive Bayes-based IDS which obtained better findings than neural network IDS while tested on the KDD Cup 99 was proposed by [58]. The average accuracy obtained by utilizing Naive Bayes was 91.52%. While being basic in design, it can produce accurate results. A hybrid intrusion detection system based on Naive Bayes and decision tree was proposed by [59]. The model has been compared and tested using benchmark KDD Cup 99 dataset, the detection rate was 99.63%. A Fuzzy Intrusion Recognition Engine (FIRE) Intrusion Detection System Simple data mining approaches used to process network stimulus data packets and reveal essential anomaly detection indicators was developed by [60]. Such indicators were accessed for each observed value and used afterwards to classify network attacks. An intrusion detection model with information gain for feature selection and SimpleCart algorithm to detect the intrusion was suggested by [61]. First, the features were reduced to 33 and then the SimpleCart algorithm used for detection. The model was applied on NSL-KDD dataset and the detection accuracy was 82.32% and error rate was 17.67%. A hybrid strategy to learning is suggested by integrating Naive Bayes and K-Means clustering classifier. The suggested solution has been compared and tested using the benchmark dataset KDD Cup 99. These combinations learning methodology achieved rather low error rates with an average of less than 0.5% while retaining accuracy and detection rates above 99%. The method is capable of accurately classifying all data except the U2R and R2L attacks. to overcome this limitation, it was recommended to consider the Integrated Intrusion Detection Program which is ideal for identifying R2L and U2R threats [62], [63]. In SSH traffic, a combination of Bayesian Network and Genetic Algorithm was introduced to improve identification of brute force attacks [64]. The proposed method implements brute force attack data obtained in a client-server model. Their findings show that the most effective features were chosen and the final result was better than the benchmark.

B. Unsupervised learning
Unsupervised detection of anomalies (often recognized as outlier detection) employs clustering approaches to classify potentially malicious incidents without previous knowledge in a dataset. Clustering aims to divide a limited unlabeled data into a discrete and finite collection of "natural" unseen structures of data instead of providing a precise non-observed characterization incidents produced within the same distribution probability [65]. In another aspect, the goal of 542 | P a g e www.ijacsa.thesai.org unsupervised algorithms is to divide the data into categories (clusters) that reach great similar internal and external dissimilarities without previous knowledge.
All clustering approaches are based on the following hypotheses for this reason. First, the volume of normal instances in a database surpasses the volume of anomalies. Next, the anomaly packet themselves vary from normal instances qualitatively [66]. Scores are allocated to the installed clusters after the cluster formation. If a cluster's score reaches the threshold pre-defined or automatically determined, a potential anomaly is considered. When clustering is utilized to identify attacks on the network, respectively, one believes that malicious traffic is less than the normal packet and normal packet is distinguished from the malicious one in some way. In other words, the features that characterize the attacks well enough to be defined must be selected concerning to the process of detection. The aim of clustering is to categorize network packets or flows without prior knowledge, but based solely on their relationships. As a result, large normal packet clusters would be formed when attack packets produce small clusters and cases not belonging to other groups. A static or dynamic threshold may be utilized to determine that clusters are deemed to be attack based on the testing and algorithm adjustment used. The main benefit of clustering models is their capability to identify unseen threats without previous information, thereby eliminating the need for labeled traffic. The main disadvantage is their high false-positive rate.
The extraction or selection of features is among the most critical stages of unsupervised detection. The use of clustering techniques to identify a range of attacks by checking alarm records from heterogeneous database was proposed [17], instead of utilizing the attributes of abnormalities that carry specific actions to suit instances or the standard approach of testing and training currently utilized in abnormal detection. Even though it required less time for the three clustering algorithms tested in the system to forecast and build clusters, the clusters' accuracy produced by one algorithm was not consistent across various logs and subsets. The obtained result indicates the way or route to develop abnormal detectors that could use pure activity logs obtained from heterogeneous databases on the tracked network and compare instances through alarm records to identify intrusion.

IV. FEATURE SELECTION TECHNIQUES
Feature Selection (FS) is a method for removing unnecessary and redundant features and choosing the most suitable feature subset that will result in a better classification of patterns which belong to various classes of attack. So, from researchers' view there are reasons why feature selection needs to be performed: 1) A single selection strategy is not adequate to obtain consistency across multiple datasets, as network traffic activity is changing [67]- [69].
2) An appropriate subset for each attack types should be identified, since one general subset of features is insufficient to properly represent all the various attacks [69]- [71].
3) FS can significantly improve not only the accuracy of detection but also the computational efficiency, where: a) features which are irrelevant or redundant can result in poor detection rate and overfitting, therefore, reducing them can increase the detection accuracy; and b) more features for each data point would cause higher computational costs and complexity-reducing irrelevant features will increase the computational efficiency [67], [69]- [74]. 4) Ultimately, R2L (Remote-to-Local) and U2R (User-to-Root) attack groups are known to become the most challenging to identify since they are too isolated and could be mislabeled as normal packet. Studies and experiments have shown that FS can solve this issue by defining a feature subset adapted to the behavior of each attack type classes [70], [71], [74].
Methods of FS are generally classified into filter, wrapper and optimization-based FS methods for selecting features. Table II illustrates the advantages and disadvantages of the mentioned features selection methods and Table III summarized the reviewed feature selection for anomaly-based IDS. RQ3 and RQ4 are all answered in detail in the table. It specifies the feature selection methods, the algorithm's origin, subset size, strength, weakness, dataset used in the articles, and so on. Propose an adaptive selection probability approach that will adjust the selection probability and enhance the algorithm's ability to find the best solution.
Lack of class-based feature selection. NSLKDD 2020 [43] Wrapper / Optimizationbased Reducing feature vector by more than 60%. This reduces computational complexity of the proposed solution.
Lack of solving class imbalance problem present in UNSW-NB15.
Performance metric was not satisfactory.

UNSW-NB15
According to the reviewed articles in Table III and the result from Fig. 2, it shows that optimization-based methods were mostly utilized for feature selection in the recent years. This method has undergone a significant improvement in terms of feature numbers. Based on the review, NSL-KDD dataset was mostly used by researchers to prove their models. In addition, some research utilized different datasets to highlight the generality of their solutions, like Kyoto2006+, ISCX 2012, UNSW-NB15, and CIC-IDS2017.

5) Filter
: Filter methods use different information theory and mathematical formula for feature selection. Due to their simplicity, ranking methods are used and had good performance for practical applications. The rating of variables is based on an accepted ranking criterion and the threshold is being utilized to eliminate variables just below the value of threshold. The methods of ranking are filtering approaches as less relevant variables are extracted before the classification. A fundamental characteristic of a distinctive feature is the provision of useful information on the various classes of data. This characteristic could be described as a feature relevance [75] which defines a measure of the efficacy of the feature in order to distinguish among various classes. There are different ways to calculate a feature's relevance to the data point or outcome. Different publications [75]- [77] proposed different understandings and measurements for a variable's importance and relevancy. One description that can be listed that would be valuable is ''A feature can be regarded as irrelevant if it is conditionally independent of the class labels.'' [78]. This clearly stipulates that the data could be distinct but not separate from labels of the class, if the feature is to be relevant. The feature that does not impact class labels can be omitted. As noted above, for assessing specific features, correlation of features plays a key role. The underlying distribution of practical uses is unclear: it is calculated by the accuracy of the classifier. Because of this, an ideal subset of features may not be special because using different feature sets it may be possible to reach the similar accuracy of classifier. An improved feature selection algorithm has been proposed [79] to efficiently classify the attacks behaviors by measuring mutual information. Correlation can also be extended to evaluate of the efficiency of a feature subset, where a subset of features is perfect if the correlation among the classification and the feature subset is significant, but the correlation among the specific feature and the other features within the subset of features is poor. In addition, distance calculation for the selection of features can also be utilized [80]. Widely utilized distance calculation includes Euclidean distance, Martensitic distance and standardized Euclidean distance. 6) Wrapper: Wrapper feature selection use machine learning as a fitness function and determine the best feature subset across all subsets of features. This problem formulation allows generic optimization techniques to be used with the machine learning to rank subsets of feature based on their prediction. Therefore, in the aspect of a machine learning final predictive accuracy, the wrapper method typically surpasses the filter approach. The wrapper technique was widely popularized by [75], and provides an easy but efficient way to tackle the issue of selection of features. However, the wrapper method incurs more computation cost and need more execution time compared to filtering methods. A feature selection method using machine learning algorithms was proposed [81] for efficient intrusion detection, which blends the characteristics of distributed denial of service (DDoS) characteristic-based features (DCF) and consistency set evaluation (CSE). To identify the most relevant features, the NSL-KDD dataset is utilized as an attack dataset and is built 545 | P a g e www.ijacsa.thesai.org on a few selections of feature methods, along with consistency-based evaluation of subsets and DDoS characteristic-based features (DCF). The experimental result shows that their proposed system has greater accuracy and efficiency compared to other approaches. 7) Optimization-based methods: Classic wrapper and filter strategies are independently evaluated and subset chosen. However, some features are not independent, but they are really successful when they work together. Therefore, the classic strategies in this respect are not very successful. Metaheuristic-based methods were already used to select and classify the selected features as a result of its vast improvement capability of in detection [82], [83]. Examples of optimization-based methods are Particle Swarm Optimization (PSO) [84]- [86] entropy of network features [87], Genetic Algorithm [88], [89], ant colony optimization [34], [90] and Kernel Principal Component Analysis (KPCA) [91]. With the increase in the dataset dimension, the space of the problem of selection of feature rises significantly. This leads to a large solution space with additional features. Furthermore, in a wide solution space, a huge proportion of duplicate or uncorrelated features generate several local optima.
A new anomaly based detection model of Hypergraph based Genetic Algorithm (HG -GA) was proposed by [92]. The Hypergraph's attribute was used to generate initial population in order to speed up the quest for the optimum solution and avoid trapping at local minima. HG-GA utilized a weighted objective function to achieve the balance among maximum detection rate and reducing false positive, as well as reducing features number. HG-GA SVM performance was assessed by NSL-KDD dataset.
An Ant Colony Optimization (ACO) for selection of feature method was proposed [34] using K-Nearest Neighbor (KNN) for the classification process and the accuracy was utilized as the assessment function for the model. The studies were performed using the KDD Cup 99 dataset, giving 98.9 % for accuracy and 2.59% for false positive rate.
A learning model for fast learning network (FLN) based on PSO was proposed by [93]. The PSO-based optimized FLN was trained using particle swarm optimization to pick weights. For evaluation, the research utilized KDD Cup 99 dataset to explore the effects of PSO-FLN model. The findings indicated that the model had good impact on intrusion detection.
An enhancement of Cuckoo Search Algorithm (CSA), named Mutation Cuckoo Fuzzy (MCF) was proposed by [94] for feature selection method and multiverse optimization ANN for classification at anomaly-based IDS. For feature selection phase, MCF that integrates mutation operator with cuckoo search and Fuzzy C Means (FCM) clustering was utilized. Through this method, the cuckoo Search efficiency to detect the optimal features was increased. The proposed feature selection choses 22 out of 41 features and for evaluation part well known dataset, called NSL-KDD was used to illustrate the effectiveness of their anomaly-based IDS.

A. Limitations of the Related Works
After analyzing the data collected from the literature related to feature selection, some limitations and shortcomings of the works are identified: 1) The optimal detection methods or strategies for various datasets have yet to be established.
2) There is a lack of proper feature subset to train faster with minimal computation and optimal performance in detecting intrusion with high accuracy and less false alarms.
V. ENSEMBLE The idea of merging results from a collection of learners into one is known as ensemble [95]. To obtain reliable and more accurate predictions, an ensemble can integrate multiple learners. It is possible to use a variety of techniques to generate and incorporate learners. Various datasets could be utilized to train the same training frameworks or the similar dataset could be utilized to learn various frameworks [96]. The biggest issue on the learning of the ensemble is to choose the algorithms that construct the ensemble and the function of decision or fusion that incorporates these algorithms' results. Of course, it is easy to use more algorithms to enhance the fusion results, but bearing in mind the computing cost of adding a new algorithm, it needs careful consideration. Dietterich [95] offered three key explanations for the use of an ensemble-based system. First, the empirical justification is related to the absence of sufficient knowledge to accurately classify the quest space's best hypothesis. Second, the computational description is to resolve the issue that most machine learning methods might be trapped in the local optima when looking for the perfect solution. Finally, the rationale for representation is to resolve the problem of the failure of several machine learning methods to accurately depict the border of the searched decision. Creating an ensemble takes two main parts: creating and combining [97].
The creation process has to construct a collection of base classifiers. The decision on how to integrate the results of the base classifiers into one is taken in the combining process. Many of the well-known modern ML algorithms were constructed around the idea of the ensemble. The three widely used ensemble model are bagging, boosting, and stacking [98]. Such techniques combine various models of learning into a single model so that bagging (variance), boosting (bias) or stacking (predictions) can be minimized. Fig. 3 demonstrates the general design methods of the ensemble.

A. Bagging
Among the first ensemble algorithms, one of the simplest and easiest way to accomplish a better efficacy was bagging [99]. When bootstrapped copies were used, varieties of results are generated in bagging, which is to say, various data subsets are randomly selected from the complete dataset of training. A different same type of classifier is designed by utilizing the learning data portion. Using a majority vote on their lists, the fusion of different classifiers is accomplished. Therefore, the decision of the ensemble is the category chosen by the largest number of classifiers for any instance data. 546 | P a g e www.ijacsa.thesai.org Random Forests is a method which is produced from bagging [100]. Training of multiple decision trees and randomly changing parameters relevant to training is a way to create this sort of classifier. As in bagging, copies of the training data could be bootstrapped from those parameters; but, unlike bagging, they can also be unique subsets of features, which is the case in the random subspace process.
From bagging, another method was generated, named "pasting small votes." It was a technique designed to run on huge datasets, unlike bagging [101]. Large size datasets are divided to the small size portions called "bites," used for learning various classifiers.
Small votes have resulted in the design of two combinations: first, named as Rvotes, randomly produces the subsets of data; second, named as Ivotes, creates consecutive datasets, taking into account the importance of the instances. Ivotes has been shown to deliver better results similar to the approach in boosting methods where the classifier advises the most suitable instances for the ensemble component used [109].
New method of ensemble classification [110] are proposed using bagging classifiers and their performance is evaluated with accuracy in mind. A classifier ensemble is built as a base classifier using the Support Vector Machine (SVM) and Radial Basis Function (RBF). The effectiveness and advantages of the approaches proposed are demonstrated through NSL-KDD datasets. The accuracy for bagged RBF was 86.40% and bagged SVM was 93.92%.

B. Boosting
In 1990, Schapire [111] demonstrated a weak learner (algorithm) which produces classifiers that can moderately surpass random guessing, can be converted into a powerful learner that can properly classify all instances except an extremely small fraction. The boosting created a group of classifiers by resampling the data and integrating results by majority voting. Re-sampling in boosting is designed to provide the most detailed training data for successive classifiers. In general three classifiers are created by boosting: a randomized subset of available training data is utilized to construct the first one. For training of the second classifier, knowledgeable subset of data provided to the first classifier is utilized where the knowledgeable data portion includes instances of training dataset, so the first classifier correctly identified half of them and the other half was misidentified. Ultimately, learning information for the third classifier is made up of cases where there was a conflict between the first and second classifiers. The results of the three classifiers would be combined with a majority vote.
A simplified edition of the initial boosting algorithm called "adaptive boosting" or "AdaBoost" was proposed in 1997 by Freund and Schapire [112]. Two algorithms of this group, AdaBoostM1 and AdaBoostR are the most commonly used variants, as they are perfect to cope for problems of regression and multiclass. AdaBoost generates some assumptions and the same assumptions apply to aggregate decisions by weighted majority voting of the groups decided. By extracting instances from a successively updated distribution of training data, a weak classifier is trained to build the assumptions. Updating the distribution ensures that the following classifier examples that were incorrectly identified by the prior classifier are return back to dataset to train other classifiers. Therefore, training data from various classifiers continue to move into instances that are becoming increasingly difficult to classify.

C. Stacking
Many cases are very likely to be miscategorized because they may happen to be in the near neighboring of the decision line and thus are typically located on the incorrect side of the line identified by the machine learning classifier. On the other hand, since it is on the right side and far from the boundaries of the appropriate decision, there may be instances that are likely to be well defined. If a group of classifiers performs with a dataset from an undefined source, could we create a relationship among the classifiers' results and correctly detect groups? The concept motivating generalization of Wolpert's is that the results of a classifying ensemble serve as sources to the next meta-classifier at second level with the goal of learning the manner in which the ensemble's findings are related to the correct label instances [113].
Stacking is the term used for Stacked Generalization [113], which is to find the ideal composition of a base learner set. Stacking is an algorithm class that requires training a "metalearner" second level to find the combination. Stacking aims to combine solid, different sets of learners, unlike bagging and boosting. Besides, ensemble methods such as boosting and bagging are often utilized to construct alike ensembles, while stacking could be utilized to create diverse ensembles.

D. Other Work
New ensemble methods [114] proposed are Net-GR based ANN-Bayesian approach that implies ensemble of Bayesian Net with Gain Ratio (GR) feature selection approach and ANN. They have applied a variety of single classification methods and their proposed ensemble on NSL-KDD and KDD Cup 99 datasets to evaluate for model's robustness. With 29 features which were selected, a 97.78% and 99.38% accuracy detection were achieved when the model was applied to the NSL KDD and KDD Cup 99 datasets to detect intrusions.
A hybrid approach that combines the synthetic minority oversampling technique (SMOTE) and cluster center and nearest neighbor (CANN) was proposed [115]. Significant 547 | P a g e www.ijacsa.thesai.org features were selected by utilizing the leave one out (LOO) approach. In addition, the research utilized the NSL-KDD dataset and the results illustrate that the proposed approach increases the accuracy of the R2L and U2R attacks as opposed to the benchmark paper by 50% and 94%, respectively.
A Hybrid RBF-SVM ensemble classification was proposed by [110] utilizing Support Vector Machine (SVM) and Radial Basis Function (RBF) as base classification. The efficacy and advantages of the proposed model are presented using NSL-KDD datasets, and their finding illustrates that the proposed ensemble RBF-SVM is superior to single-method approaches in terms of accuracy as it achieved 98.46%.
An ensemble-based IDS model was designed using integrated feature selection approach and an ensemble of ML classifiers comprising Bayesian Network, J48, and Naive Bayes [15] . In this model, features are reduced from 41 to 12, and majority vote is used for combing the findings. The true positive rate (TP) of the proposed model is 98.0% with a falsepositive rate (FP) of 0.021%.
A hybrid classification approach was proposed to detect and forecast DDoS threats. Using the KDD Cup 99 dataset as attack data, related features were chosen based on information gain. The experimental result revealed that each step of the threat case is well divided, and they can identify DDoS threat precursors as well as the threat itself [116].
A model for Adaptive Ensemble Learning was proposed by [38] by changing the learning data ratio and constructing a MultiTree algorithm which deploys multiple decision trees. To increase detection efficiency, a number of base classifiers are chosen, including Random Forest, decision tree, deep neural network (DNN), KNN, and an adaptive voting algorithm were developed. For the validation part, the NSL-KDD dataset was used, and the MultiTree algorithm accuracy was 84.2%, while the final adaptive voting ensemble accuracy was 85.2%.
A model called SCDNN combines spectral clustering (SC) and DNN algorithms was proposed by [117]. In this model, k subsets were created from the dataset based on the similarity of the sample utilizing cluster centers as in SC. Then, the distance between data points in the training set and the test set was calculated on the basis of features similarity and was applied into the DNN algorithm to detect intrusion. NSL-KDD dataset was used for evaluation benchmark and the overall accuracy was 92.1%.
A framework with feature selection and ensemble method [118] integrates correlation-based feature selection with Bat algorithm (CFS-BA), and an ensemble of Random Forest (RF), C4.5 and Forest by Penalizing Attributes (Forest PA) is developed for the detection model. The evaluation experiments used the CIC-IDS2017, AWID, and NSL-KDD datasets. The results show that this framework has better accuracy than other research work.
A hierarchical ensemble classifier and knowledge-based method was proposed by [119]. In order to determine the specific attack class, it used a weighted voting fusion technique for specific classes to obtain a more accurate classification. The KDD Cup 99 dataset was used to prove the model. This IDS model has more complexity during the learning phase and it consumes more time in contrast to other work.
A Hybrid IDS of One Class Support Vector Machine (OC-SVM) and C5 decision tree classifier [42] was proposed to detect unknown and known intrusion. To the model was evaluated using the ADFA and NSL-KDD datasets. Their finding demonstrated that the hybrid schema has better performance than other models. An IDS ensemble model of convolutional neural network, Random Forest, and gated recurrent unit (GRU) was proposed by [120]. NSL-KDD dataset was utilized to prove the performance of the model. The detection accuracy was 76.61% with reduced learning time and resource usage than other schema.
An IDS model with combination of ensemble (Random Forest, J48, and Reptree) and CFS algorithm suggested by [121]. Experimented on the KDD Cup 99 and NSLKDD datasets, their finding illustrates that the proposed ensemble has 99.90% for the KDD99 dataset, and 98.60% detection rate for NSLKDD. However, this model could not handle imbalance data.
A stacked ensemble classifier with a combination of gradient boosting machine, XGBoost, and Random Forest [39] was proposed and experimented on CICIDS-2017, CSIC-2010v2, UNSW-NB15, and NSL-KDD. The result shows that the proposed ensemble model has good impact on detection of attack in a Web application. According to the reviewed articles presented in the table, different combination of classifiers and algorithms were utilized for ensemble detection. An ensemble with diversity of classifier types had significant improvements in detection accuracy and reduces the false alarm for anomaly-based IDS.
Based on the review, NSL-KDD dataset was mostly used to show the efficacy and advantages of the proposed ensemble models. Furthermore, some articles utilized different datasets to highlight their generality of their solutions, like AWID, ISCX 2012, UNSW-NB15, CIC-IDS2017 and CSIC-2010v2.
Based on the analysis of the studied articles in the review, Fig. 4 illustrates that NSL-KDD dataset was mostly utilized to highlight the effectiveness of their anomaly-based IDS models. The KDD Cup 99 dataset came in second as to be used to evaluate their solutions. 548 | P a g e www.ijacsa.thesai.org

1) Limitations of the ensemble classification:
After analyzing the data collected from the literature related to ensemble, some limitations and shortcomings of the works are identified and in order to reach maximum diversity with various boundaries of decision, the identified limitation should be considered: a) Multiple datasets have to be utilized to prove the generality of the ensemble model. b) In order to handle imbalance data issues in anomaly-based IDS, different types of classifiers have to be deploy in ensemble machine. Therefore, selection of various classifiers and the fusion of their outcomes empower the final result. 549 | P a g e www.ijacsa.thesai.org VI. DISCUSSION Upon studying and reviewing the different IDS models, we found challenges that motivate research in utilizing machine learning for feature selection and ensemble techniques in IDS. In this paper, we discuss future trends in anomaly-based IDS, in particular feature selection and ensemble techniques. Some of the critical topics in the existing research with view of future trends are described below: 1) Anomaly-based IDS datasets have a crucial impact on the proposed approaches in terms of performance assessment. To be current, it is necessary to utilized updated datasets to illustrate that the proposed solution works well with new attack types. Although KDD Cup 99 is an old dataset used by most of researchers as benchmark comparisons, the attack packets and even the features are dated 20 years ago. In addition, researchers can deploy their model on different anomaly-based IDS datasets to prove the generality of their model to detect different attacks.
2) Finding the appropriate feature selection schema plays an important role in anomaly-based IDS. Proper selection of feature subset helps expert machine in the learning phase to detect attacks in the testing phase. Optimization-based feature selection aims to acquire an optimal subset of features among all features in different domains. The role of new optimization-based feature selection methods in the success of anomaly-based IDS must be considered.
3) Ensemble-based modern anomaly-based IDS techniques allow multiple combinations of models or algorithms to identify new unseen cases. In the implementation, after a variety of classification models are typically constructed utilizing some portion of datasets, the various classifiers results are merged to form the final conclusion. Various schemes may be suggested for the generation of classifiers and for the combination of the ensembles.
The future trends mentioned above and open issues discussed in anomaly-based intrusion detection system should be considered by researchers in the field of anomaly-based IDS.
VII. CONCLUSIONS Intrusion detection system is a prominent security mechanism designed to prevent intrusion, illicit entry, modification or demolition by intruders. For efficient intrusion detection process vital components like feature selection and detection mechanism have to be considered when designing the model. The article reviews the studies on feature selection and ensemble approaches utilized for anomaly-based intrusion detection systems. We discussed the main challenges in IDS, namely the dimensionality reduction in anomaly-based IDS that reduces irrelevant attributes from dataset; and how to build an appropriate feature subset selection, in order to better detect intrusion by increasing the performance metrics. Consequently, the study categorizes and discusses feature selection methods and presents their performance in detection accuracy. Another important challenge in anomaly-based IDS lies in utilizing suitable machine learning algorithms in the detection process. To illustrate their effectiveness in improving the IDS performance, this paper reviews and categorizes various machine learning schema and discussed their utilization in IDS, giving emphasize on ensemble methods as an emerging trend in anomaly-based IDS. Based on our study on anomaly-based IDS and the assessment and comparison of feature selection and detection module, we can summarize two points about how to boost the performance of anomaly-based IDS as follows: 1) Optimization-based feature selection with excellent combination and well tune up parameters will select the proper feature subset for IDSs. Through this study, it is clear that optimization-based have significant performance to design the optimal feature set. Furthermore, if their parameters are adjusted well, feature selection could be significantly enhanced.
2) Ensemble detection with different types of classification can empower the detection phase and reduce the false alarm rate. If the diversity occurred, a fusion of the outcome has better chance to detect properly.
Finally, we present some open issues and offered research trends, including the datasets used, the role of optimizationbased algorithm-ms and ensemble methods, in the area of anomaly-based IDS. We expect that this review paper will furnish scientists with innovative ideas and serve as a springboard for them to undertake better studies. We acknowledge that this article has some limitations due to the scope of the review: 1) This review focused on the feature selection and ensemble detection for anomaly-based IDS.
2) This review does not focus on performance parameter which is utilized at IDS.
3) This article does not study IDS datasets in-depth, like their features, attack types, etc.
Having listed the limitations of the paper, a deep analysis on the following issues can be considered as future work: 1) Other detection methods for anomaly-based IDS, apart from the feature selection and ensemble detection methods that are discussed here, could be studied too, in order to acquire a more holistic understanding of the research area.
2) Extra studies could be performed on performance parameters which are utilized in IDS, and how we can obtain the optimal set of parameters for better detection performance.
3) An in-depth study on IDS datasets could be carried out, such as their features, attack types, etc. to understand the pattern in their attributes that may affect the detection performance.