Application of Machine Learning Approaches in Intrusion Detection System: a Survey

—Network security is one of the major concerns of the modern era. With the rapid development and massive usage of internet over the past decade, the vulnerabilities of network security have become an important issue. Intrusion detection system is used to identify unauthorized access and unusual attacks over the secured networks. Over the past years, many studies have been conducted on the intrusion detection system. However, in order to understand the current status of implementation of machine learning techniques for solving the intrusion detection problems this survey paper enlisted the 49 related studies in the time frame between 2009 and 2014 focusing on the architecture of the single, hybrid and ensemble classifier design. This survey paper also includes a statistical comparison of classifier algorithms, datasets being used and some other experimental setups as well as consideration of feature selection step.


I. INTRODUCTION
The Internet has become the most essential tool and one of the best sources of information about the current world.Internet can be considered as one of the major components of education and business purpose.Therefore, the data across the Internet must be secure.Internet security is one of the major concerns now-a-days.As Internet is threatened by various attacks it is very essential to design a system to protect those data, as well as the users using those data.Intrusion detection system (IDS) is therefore an invention to fulfill that requirement.Network administrators adapt intrusion detection system in order to prevent malicious attacks.Therefore, intrusion detection system became an essential part of the security management.Intrusion detection system detects and reports any intrusion attempts or misuse on the network.IDS can detect and block malicious attacks on the network, retain the performance normal during any malicious outbreak, perform an experienced security analysis.
Intrusion detection system approaches can be classified in 2 different categories.One of them is anomaly detection and the other one is signature based detection, also known as misuse detection based detection approach [4,41].The misuse detection is used to identify attacks in a form of signature or pattern.As misuse detection uses the known pattern to detect attacks the main disadvantage is that it will fail to identify any unknown attacks to the network or system.On the other hand, anomaly detection is used to detect unknown attacks.There are different ways to find out the anomalies.Different machine learning techniques are introduced in order to identify the anomalies.
Over the years, many researchers and scholars have done some significant work on the development of intrusion detection system.This paper reviewed the related studies in intrusion detection system over the past six years.This paper enlisted 49 papers in total from the year 2009 to 2014.This paper enlisted the proposed architecture of the classification techniques, algorithms being used.A Statistical comparison has been added to show classifier design, chosen algorithms, used datasets as well as the consideration of feature selection step.
This paper is organized as follows: Section 2 provides the research topic overview where a number of techniques for intrusion detection have been described.Section 3 represents a statistical overview of articles over the years on the algorithms that were frequently used, the datasets for each experiment and the consideration of feature selection step.Section 4 includes the discussion and conclusion as well as some issues which have been highlighted for future research in intrusion detection system using machine learning approaches.www.ijarai.thesai.org

A. Machine Learning Approach
Machine learning is a special branch of artificial intelligence that acquires knowledge from training data based on known facts.Machine learning is defined as a study that allows computers to learn knowledge without being programmed mentioned by Arthur Samuel in 1959.Machine learning mainly focuses on prediction.Machine learning techniques are classified into three broad categories such assupervised learning, unsupervised learning, and reinforcement learning.

1) Supervised Learning
Supervised learning is also known as classification.In supervised learning data, instances are labeled in the training phase.There are several supervised learning algorithms.Artificial Neural Network, Bayesian Statistics, Gaussian Process Regression, Lazy learning, Nearest Neighbor algorithm, Support Vector Machine, Hidden Markov Model, Bayesian Networks, Decision Trees(C4.5,ID3,CART, Random Forrest), K-nearest neighbor, Boosting, Ensembles classifiers (Bagging, Boosting), Linear Classifiers (Logistic regression, Fisher Linear discriminant, Naive Bayes classifier, Perceptron, SVM), Quadratic classifiers are some of the most popular supervised learning algorithms.

2) Unsupervised Learning
In unsupervised learning data instances are unlabeled.A prominent way for this learning technique is clustering.Some of the common unsupervised learners are Cluster analysis (K-means clustering, Fuzzy clustering), Hierarchical clustering, Self-organizing map, Apriori algorithm, Eclat algorithm and Outlier detection (Local outlier factor).

3) Reinforcement Learning
Reinforcement learning means computer interacting with an environment to achieve a certain goal.A reinforcement approach can ask a user (e.g., a domain expert) to label an instance, which may be from a set of unlabeled instances.

B. Single Classifiers
One machine learning algorithm or technique for developing an intrusion detection system can be used as a standalone classifier or single classifier.Some of the machine learning techniques have been discussed in this study which have been found as frequently used single classifiers in our studied 49 research papers.

1) Decision Tree
Creating a classifier for predicting the value of a target class for an unseen test instance, based on several already known instances is the task of Decision tree (DT).Through a sequence of decisions, an unseen test instance is being classified by a Decision tree [11].Decision tree is very much popular as a single classifier because of its simplicity and easier implementation [14].Decision tree can be expanded in 2 types: (i) Classification tree, with a range of symbolic class labels and (ii) Regression tree, with a range of numerically valued class labels [11].

2) Naive Bayes
On the basis of the class label given Naive Bayes assumes that the attributes are conditionally independent and thus tries to estimate the class-conditional probability [15].Naive Bayes often produces good results in the classification where there exist simpler relations.Naive Bayes requires only one scan of the training data and thus it eases the task of classification a lot.

3) K-nearest neighbor
Various distance measure techniques are being used in Knearest neighbor.K-nearest neighbor finds out k number of samples in training data that are nearest to the test sample and then it assigns the most frequent class label among the considered training samples to the test sample.For classifying samples, K-nearest neighbor is known as an approach which is the most simple and nonparametric [8].K-nearest neighbor can be mentioned as an instance-based learner, not an inductive based [35].

4) Artificial Neural Network
Artificial Neural Network (ANN) is a processing unit for information which was inspired by the functionality of human brains [23].Typically neural networks are organized in layers which are made up of a number of interconnected nodes which contain a function of activation.Patterns are presented to the network via the input layer, which communicates to one or more hidden layers where via a system of weighted connections the actual processing is done.The hidden layers then link to an output layer for producing the detection result as output.

5) Support Vector Machines
Support vector machine (SVM) was introduced in mid-1990's [5].The concept behind SVM for intrusion detection basically is to use the training data as a description of only the normal class of objects or which is known as non-attack in intrusion detection system, and thus assuming the rest as anomalies [51].The classifier constructed by support vector machines methodology discriminates the input space in a finite region where the normal objects are contained and all the rest of the space is assumed to contain the anomalies [9].

6) Fuzzy Logic
For reasoning purpose, dual logic's truth values can be either absolutely false (0) or absolutely true (1), but in Fuzzy logic these kinds of restrictions are being relaxed [60].That means in Fuzzy logic the range of the degree of truth of a statement can hold the value between 0 and 1 along with '0' and '1' [11].

C. Hybrid Classifiers
A hybrid classifier offers combination of more than one machine learning algorithms or techniques for improving the intrusion detection system's performance vastly.Using some clustering-based techniques for preprocessing samples in training data for eliminating non-representative training samples and then, the results of the clustering are used as training samples for pattern recognition in order to design a classifier.Thus, either supervised or unsupervised learning approaches can be the first level of a hybrid classifier [11].www.ijarai.thesai.org

D. Ensemble Classifiers
The classifiers performing slightly better than a random classifier are known as weak learners.When multiple weak learners are combined for the greater purpose of improving the performance of a classifier significantly is known as Ensemble classifier [11].Majority vote, bagging and boosting are some common strategies for combining weak learners [15].Though it is known that the disadvantages of the component classifiers get accumulated in the ensemble classifier, but it has been producing a very efficient performance in some combination.So researchers are becoming more interested in ensemble classifiers day by day.

A. Distribution of Papers by Year of Publication
The  Intrusion detection method can be categorized in 3 categories namely single, hybrid and ensemble [11] .Fig. 2 depicts the number of research papers in terms of single, hybrid and ensemble classifiers used in each year.According   II enlists the proposed algorithms used in all the articles reviewed in this paper.Table IV shows Year wise distribution of single classifiers regarding results and citation of each article.

B. Classifier design
Support vector machine and Artificial neural network are the most popular approaches for single learning algorithm classifiers.Though we have taken 49 related papers and number of comparative samples is less but the comparison result implies that Support Vector machine is by far the most common and considered single classification technique.On the contrary, Fuzzy logic seems to be less considerable among the single classifiers over the enlisted literatures.

D. Ensemble classifiers
Multiple weak learners are combined in Ensemble classifiers.Table III depicts the articles using ensemble classifiers in intrusion detection system.Statistics shows AdaBoost is the most commonly used learning algorithm along with majority voting.Table III also enlists the detection rate of each of the classifier and the citation of each article throughout the time period.

F. Used Dataset in Researches
Datasets are assigned for default tasks e.g., Classification, Regression, Function learning, Clustering.Datasets reviewed by this paper is for classification purpose.As Fig. 4 depicts, by far the most common dataset being used is KDD cup 1999 dataset.This dataset contains 4,000,000 instances and 42 attributes.The number of papers using KDD cup 1999 data set yields a peak in 2011 and in total 20 research papers has mentioned KDD Cup 1999 as their dataset.
Car evolution dataset [32] contains 1,728 instances with 6 attributes, attribute types are categorical.Wisconsin Breast cancer [16] has multivariate data types, all 10 attributes are integer types and it has 699 instances.Glass [32] dataset with multivariate data types and 214 instances It has 10 real attributes.Mushroom dataset [32] contains 22 categorical attributes and 8,124 instances.Lympography dataset [16] contains 18 categorical attributes and 148 instances.Yeast dataset [24] have 8 real attributes with 1,484 instances.Fisher-Iris dataset [25] contains 4 real attributes with 150 instances.Bicup2006 dataset and CO2 dataset [27] have 1,323 and 296 instances respectively.Public datasets like DARPA 1998, DARPA 2000, Fisher-Iris dataset, NSL KDD datasets are used in many related studies.Study also shows that few private or non-public datasets used over the time frame.Although the study briefly highlights public datasets like KDD cup 99, DARPA 1998, DARPA 2000 being considered as standard datasets for intrusion detection system.DARPA dataset contains around 1.5 million traffic instances [36].NSL-KDD dataset was proposed by removing all redundant instances from KDD'99.Thus, NSL-KDD dataset is more efficient than KDD'99 in getting more accurate evaluation of different learning techniques [19].Some of the datasets were randomly used by the researchers.Table VI shows the yearwise distribution of randomly used dataset.

G. Feature Selection
Feature Selection is an important step for the improvement of the system performance.Feature selection is considered before the training phase.Feature selection points out the best features and eliminates the redundant and irrelevant features.Table VII shows the year-wise distribution of feature selection step consideration.Table VII implies that out of 49 studies, 21 used feature selection step for their proposed architecture.It also shows that the number of papers using feature selection yields a peak in the year 2012, where out of 8 papers in that year 7 used feature selection step.On the contrary, in 2009 the scenario was completely opposite.Though we have taken 49 related papers and number of differences in those papers are trivial but the comparison result implies that 21 experiments used feature selection where 28 experiments did not.It implies that feature selection is not a popular procedure in intrusion detection.Table VII and VIII overview the year-wise distribution of feature selection considered in related studies and the count of paper.Uses of different classifier techniques in intrusion detection system is an emerging study in machine learning and artificial intelligence.It has been the attention of researchers for a long period of time.This paper has identified 49 research papers related to application of using different classifiers for intrusion detection published between 2009 and 2014.Though this survey paper cannot claim to be an in-depth study of those studies, but it presents a reasonable perspective and shows a valid comparison of works in this field over those years.The following issues could be useful for future research:  Removal of redundant and irrelevant features for the training phase is a key factor for system performance.Consideration of feature selection will play a vital role in the classification techniques in future work.
 Feature selection has many algorithms to work with.Using different feature selection algorithms and working with the best possible one will be helpful for the classification techniques and also increase the consideration of feature selection step in intrusion detection.
 Uses of single classifiers or baseline classifiers in performance measurement can be replaced by hybrid or ensemble classifiers.
survey comprises 49 research papers in the time frame between 2009 and 2014.It discussed 8 papers from each of the year 2009, 2010 and 2012.The highest number of papers are studied from the year 2011.The number of papers from that year is 11. 10 papers are enlisted for the year 2013 and 4 papers from 2014.Fig.1 depicts the percentage of distribution of papers by year of publication.

Hybrid 22 ( 7 (Fig. 2 .
Fig. 2. Year wise distribution of research papers for the types of classifier design

Fig. 3 .
Fig. 3. Distribution of Single classifiers over the Years

Fig. 4 .
Fig. 4. Distribution of popular datasets over the years
Table V depicts Year wise distribution of Hybrid classifiers regarding results and citation of each article.Hybrid classifiers in intrusion detection have established in the mainstream study due to the performance accuracy in recent times Statistics shows hybrid classifiers have the highest number of articles in the Year of 2011.The table also shows the used algorithms in each article and their performance in intrusion detection system.

TABLE II .
ALGORITHMS USED IN SINGLE TYPE OF CLASSIFIER DESIGNED BASED RESEARCH PAPERS

TABLE III .
YEAR WISE DISTRIBUTION OF ENSEMBLE CLASSIFIERS REGARDING RESULTS AND CITATION OF EACH ARTICLE a a Not cited yet.

TABLE IV .
YEAR WISE DISTRIBUTION OF SINGLE CLASSIFIERS REGARDING RESULTS AND CITATION OF EACH ARTICLE b b Not cited yet.

TABLE V .
A DETAILED INFORMATION ON RESEARCH PAPERS DESIGNED WITH HYBRID CLASSIFIER

TABLE VI .
YEAR-WISE DISTRIBUTION OF RANDOMLY USED DATASET

TABLE VII .
YEAR-WISE DISTRIBUTION OF FEATURE SELECTION CONSIDERED

TABLE VIII .
DISTRIBUTION OF RESEARCH PAPERS CONSIDERING THE FEATURE SELECTION STEP