Intrusion Detection Techniques in Wireless Sensor Network Using Data Mining Algorithms: Comparative Evaluation Based on Attacks Detection

—Wireless sensor network (WSN) consists of sensor nodes. Deployed in the open area, and characterized by constrained resources, WSN suffers from several attacks, intrusion and security vulnerabilities. Intrusion detection system (IDS) is one of the essential security mechanism against attacks in WSN. In this paper we present a comparative evaluation of the most performant detection techniques in IDS for WSNs, the analyzes and comparisons of the approaches are represented technically, followed by a brief. Attacks in WSN also are presented and classified into several criteria. To implement and measure the performance of detection techniques we prepare our dataset, based on KDD'99, into five steps, after normalizing our dataset, we determined normal class and 4 types of attacks, and used the most relevant attributes for the classification process. We propose applying CfsSubsetEval with BestFirst approach as an attribute selection algorithm for removing the redundant attributes. The experimental results show that the random forest methods provide high detection rate and reduce false alarm rate. Finally, a set of principles is concluded, which have to be satisfied in future research for implementing IDS in WSNs. To help researchers in the selection of IDS for WSNs, several recommendations are provided with future directions for this research.


INTRODUCTION
Wireless sensor networks are composed of several sensors deployed in areas where the aim is to collect data and forward it for the analysis.It has become an increasingly interesting field of research in solving such challenging real-world problem, as environmental monitoring [1], military applications, geographical sensing, traffic control, and home automation.The properties of WSN show that that sensor node is completely restricted by resources, including memory, energy, computing, communication and bandwidth.[2].Therefore, the deployment of these kinds of networks with their resource restrictions makes their security issue essential, and vulnerable to various security threats.Key management and authentication have been used to protect WSNs from different attacks, encryption and authentication are the first security measures as the first line of defense for protecting WSN [3].But cryptography based on secret key management are not enough to protect the WSN, because even in the presence of this first line of defense, several attacks may extract sensitive information, and use them for malicious reason.However, Detection-based approaches are then proposed to protect WSNs from intrusion and attacks, as a second line defense, after the failure of the cryptographic techniques [4], Intrusion detection system (IDS) observes and analyzes the events generated in the network system to identify maximum security problems.IDSs are used to monitor the network to detect anything unusual.[5].This concept was originally proposed by Anderson [6].There are two principal approaches for detection, intrusion: Misuse detection based on rules, these rules will look for signatures on the network and then system operations try to catch known attack that should be considered as Misuse [7] [8].Anomaly detection [9], which based on the normal behavior of a system, it compares normal activities against observed events to identify significant deviations.The main scope of this paper is to improve that random forest technique is an efficient anomaly detection technique for IDS in WSN, with a comparative evaluation study for the most recent and performants anomaly detection technique used in IDS for WSN.In Section 2 we present a classification of existing attacks in WSN by several criteria.Section 3 introduces a survey of ids in WSN, and analyzes four recent anomaly intrusion detection techniques using in IDS for WSN: (Kmeans, Naives Bayesian, SVM, Random Forest), showing their principles, advantages and drawbacks.
Simulation environment and results are presented in section 4, we simulated last techniques on KDD dataset using Weka tool, and results are based on matrices of confusion, detection rate, time of execution and memory consumption.At the end of the paper a conclusion is introduced, and a set of www.ijacsa.thesai.orgrecommendations are suggested to boosting the performance of intrusion detection in WSN for future researches.

II. ATTACKS CLASSIFICATION IN WSN
An attack is a set of techniques, used to cause damage to a network by exploiting flaws in it.Attacks know several possible classification, the most used are grouped into the following categories: A. According to the origin or source attacks: Two categories are distinguished: internal and external attacks: An external attack is triggered by a node that does not belong to the network or does not have access permission.The aim of this attack is to cause congestion in the network, the spread of incorrect routing information, or completely close the network.The internal attack is done by a malicious internal node.Defense strategies generally aim to protect the network against external attacks.However, internal attacks are the most serious threats that can disturb the WSN [10] [11].

B. Based on the nature of attacks:
We can distinguish between passive and active attacks, the passive attack is limited to listening and analyzes exchanged traffic.This kind of attacks is difficult to detect and easier to realize, because the attacker does not make any modification on exchanged information.
The aims of the attacker can be the knowledge of the significant nodes in the network (cluster head node), or knowledge of confidential information by analyzing routing information.In the active attacks, an attacker tries to modify or remove the messages transmitted on the network, inject his own traffic or replay of old messages to disturbing the operation of the network.[12].

C. Classification by attacks techniques:
The spoofed, altered, or replayed routing information attack, and sinkhole attack: need to make a probe step before starting to attack, thereforeattack we can classified these attack as probe attacks.Selected forwarding, jamming, tampering: which uses illegitimate data forwarding to make attack, is known as a dos attacks?Hello floods caused by internal attacks, is classified as U2R attack.Sybil, wormholes, hello floods, and acknowledgment spoofing make the attack through the weakness of the system then they would be classified as R2L attack.In the table below we present the following main types of attacks, sorted by four principals attack classes.The following main types of attacks, are sorted by their assignments to the relevant layers of the protocol stack.For each attack, a list of proposed mechanism defense is presented [13][14]:

ATTACKS, PROTOCOL LAYER AND DEFENSE MECHANISM III. RELATED WORK
It has become clear that we cannot achieve the satisfactory level of security in WSN only by using cryptographic techniques, as these techniques fall prey to insider attacks.The attacker can compromise and retrieve the cryptographic material of a number of nodes [15].In order to counter this threat some additional techniques such as intrusion detection system (IDS) has to be deployed.Any kind of unauthorized or unapproved activities are called intrusions.An IDS is a collection of the resources, methods and tools to help identify, evaluate, and report intrusions [16].WSN led researchers develop strategies about providing stable networking and communications, and also about how to secure these strategies with limited resources.
In [17], a hierarchical framework for intrusion detection as well as data processing is proposed.Throughout the experiments on the proposed framework, they stressed the significance of one hop clustering.The authors believed that their hierarchical framework was useful for securing industrial applications of WSNs with regard to two lines of defense.Krontiris et al. [18] proposed a distributed IDS for WSNs based on collaborative neighborhood watching.In a simulation environment, the authors evaluated the effectiveness of their IDS scheme against blackhole and selective forwarding attacks.In [19] provided an IDS for WSNs that was based on detection of packet level receive power anomalies.The detection scheme was focused on

IDS REQUIREMENT FOR WSN
There are two basic approaches in IDS according to the used detection techniques [21]: Misuse detection technique compares the observed behavior with known attack patterns (signatures).Action patterns that may pose a security threat have to be defined and stored in the system.The advantage of this technique is that it can accurately and efficiently detect instances of known attacks, but it lacks an ability to detect an unknown type of attack.

Anomaly detection:
The detection is based on monitoring changes in behavior, rather than searching for some known attack signatures.Before the anomaly detection based system is deployed, it usually must be taught to recognize normal system activity (usually by automated training).The system then watches for activities that differ from the learned behavior by a statistically significant amount.The main disadvantage of this type of system is high false positive rate.The system also assumes that there are no intruders during the learning phase.
Anomaly may be caused by security threats, or faulty sensor nodes in the network or unusual phenomena in the monitoring zone [22].Isolated node failures can bring down the whole network, which is malicious to reliability of WSN.Researches in this field are yet absent to present the latest progress of developing anomaly detection in WSN.However, our paper expects acting as a guideline of selecting efficient and appropriate anomaly detection techniques, not just based on analyzing, comparing, and evaluating those particular approaches, but also according to the results of simulation, which shows the classification rate, confusion matrix, consumption of memory, and time to build every approach.

IV. RSTUDY ANALYSIS AND EVALUATION OF ANOMALY DETECTION TECHNIQUES IN WSN
A. Clustering approach With K-means clustering algorithm, Rajasegarar et al [20] design a distributed detection scheme.Each common sensor node locally collects the input dataset to work out a normal profile.Then the cluster head collects all local normal profiles to accomplish the procedure of data processing, where a global normal profile is produced.After received the global normal profile, each common sensor node initiates the analysis and decision procedure to perform detection.In order to fit in distance-based clustering, the input dataset is normalized at each common sensor node with a preprocessing procedure.

Given a dataset
, k=1…m, it is transformed to Where and stand for the mean and standard deviation of the jth attribute in .Subsequently is normalized in the interval [0,1], according to .Given a common sensor node collecting a dataset , sends the local normal profile.
(∑ ∑ ) to the cluster head, where m stands for .After the global normal profile is computed, ( ).The cluster head sends it back to the common sensor nodes.After receiving the global normal profile, each common sensor node initiates detection locally, using a fixed-width clustering algorithm.If the Euclidean distance between a data point and its closest cluster centroid is larger than a user-specified radius o, a new cluster is organized with this data point as centroid.For reducing the number of resulting clusters, a cluster merging process is then conducted, through measuring the inner-cluster distances [35].The clusters c1 and c2 merge if their inner-cluster distance d(c1,c2) is less than o.Finally, the average inter-cluster distance of K nearest neighbor (KNN) clusters is applied to identify anomalous clusters.Let ICDi be the average intercluster distance (KNN) of cluster i, AVG (ICD) and SD(ICD) be the mean and standard deviation of all inter-cluster distances respectively.If : ICDi>SD(ICD) + AVG(ICD), cluster i is viewed as anomalous [35].

Constraints and challenges of WSN
Requirement of IDS • There is no trusted authority; decisions have to be concluded in a collaborative manner.
• Not introduce new weaknesses to the system, • Need little system resources and should not degrade overall system performance by introducing overheads, • Run continuously and remain transparent to the system and the users, • Use standards to be cooperative and open, • Be reliable and minimize false positives and false negatives in the detection phase.www.ijacsa.thesai.org

B. Support Vector Machine Classifier
Support Vector Machines (SVMs) are supervised learning algorithms [24], which have been applied increasingly to anomaly detection in the last decade.One of the primary benefits of SVMs is that they learn very effectively from high dimensional data [25].In WSN SVM is used to investigate spatial and temporal correlations of data for detecting suspect behavior of a node.Many researchers have tried to find possible methods to apply SVM classification for large data sets.Sequential Minimal Optimization (SMO) is a fast method to train SVM [26], which breaks the large Quadratic Programming (QP) problem into a series of smallest possible QP problems.In [27] Kim et all applied SVMs to host based anomaly detection of masquerades.One-class quarter-sphere SVM, as a representative algorithm of SVM, is also suited to distribute anomaly detection [28].First, the local quartersphere is computed at each common sensor node.Second, the cluster heads collects these locally computed radii to work out a global radius.Detection is then launched at each common sensor node with the global normal profile.

C. Naïve Bayes Classifier
The naive Bayes classifier is usually used in WSN because of its simplicity, elegance, and robustness.A large number of modifications have been introduced, by the statistical, data mining, machine learning, and pattern recognition communities, in an attempt to make it more flexible.Novel approach was proposed in [29] to identify the faulty sensor node using Naïve Bayes classifier.The proposed Naïve Bayes framework was deployed for performing WSN faulty node(s) detection.A new attribute, the end-to-end transmission time of each packet arrived at the sink is analyzed using Naïve Bayesian classifier for determining the network status.This technique doesn't involve any additional protocol and extra resource consumption of sensor node, it suggests a list of suspicious faulty nodes to the user [29].In the same context, based on mobile agent and using naïve Bayesian classifier an IDS is presented in [23].The figure below presents the principal of naive Bayesian classifier.m Number of classes C1, C2,….,CmDimentional vector for class t = {dct1,dct1,…….,dctn}where ∑ =1 K total ksenses of network operation S = { , ,….., } Is a product of the data that appear in the scene: Where is the number of data I in scene .

D. Random Forest Classifier :
Random forests are based on collection learning method for classification, that operate by constructing a multitude of decision trees, at training time and outputting the class, that is the mode of the classes output by individual trees.Random tree, on the other hand, involves construction of multiple decision trees randomly [30].Each tree is constructed using the following algorithm: Step1: Let the number of training cases be N, and the number of variables in the classifier be M. Step2: We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
Step3: Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e.take a bootstrap sample).Use the rest of the cases to estimate the error of the tree, by predicting their classes.Step4: For each node of the tree, randomly choose m variables on which to base the decision at that node.Calculate the best split based on these m variables in the training set.Step5: Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).A novel data mining approach based on random forests was proposed to characterize and classify a similar large scale physical environment in [31].The proposed data mining formulation, allows better performance in terms of tradeoff between energy efficiency and accuracy.Compared to a single decision tree algorithm, RFs runs efficiently on large datasets with a better performance.In [30] Random Forests (RF) is used as a classifier for the proposed intrusion detection framework.RF gives better performance in designing IDS that is efficient and effective for network intrusion detection.the advantages and inconveniences of the studied techniques are presented in the following table:

V. EXPERIMENT RESULTS
A series of experiments were conducted to simulate and evaluate each approach, to define the efficient detection www.ijacsa.thesai.orgtechnique for ids in WSN.We used several critical evaluation metrics: Confusion matrix, general classification rate, time to build model, memory consumption.We prepared our data set, based on the standard KDDCup'99 intrusion detection dataset [32], into following five step, using Weka tool: Step1: in this step we structured all records on Attribute-Relation File Format (ARFF), which is an input file format used by the machine learning tool WEKA [33].
Step2: In this step we classed all types of attacks, on four principal categories.As shown the table [2].
Step 3: the main aims of this step is defining the number of records treated for each class as presented in table below, we used 70% in training stage and 30% in the test stage for each class.Step4: In general, a characteristic is good if it is relevant to the concept of class but not redundant to one of the other functions.Reduction of the attributes is a process of choosing a subset of the original attributes which feature space is reduced optimally at an endpoint.
In our experiment, Weka tool is used for reduction function.CfsSubsetEval with BestFirst approach is applied to the set of training data to obtain the relevant features for the classification process.Each subset was analyzed using correlation analysis to identify important features.The best known Measuring correlation is the linear correlation coefficient.For a pair of variables (x, y), the linear correlation coefficient r (x, y) is given by the expression below: The main principle of CfsSubsetEval method is evaluating the value of a subset of attributes by considering the individual predictive ability of each element as well as the degree of redundancy between them.It generates subsets of features that are highly correlated with the class while having a low cross correlation [34].The results are presented in the table below: Step5: In this step we implemented each technique on our dataset, using Weka tool.Below the result obtained based on confusion matrix, detection rate, time of execution and memory consumption.

A. Confusion Matrix:
In order to assess these techniques we take the confusion matrix, illustrated below:

B. Classification Rate
The purpose of classification is to minimize the probability of error Detection algorithms are usually evaluated using the detection rate.A simple way to perform an intrusion detection, is to use a classifier to determine whether certain traffic data observed is normal or attacks.We present the classification rate on two sides: Global records classification and general rate classification.

Global records classification:
The table below presents for each technique the global number of correctly and incorrectly classified records:   According to the results, the SVM method is the most complex [0((NM)^2] , which explains its high memory consumption with 38,444KB, and his long time compilation.The memory consumption of these techniques are compared with properties of sensor node that we can use in deployment of wireless sensor network, we choose MICA2 and Telosb.Knowing that MICA2 is equipped with a processor running at 7.37 MHz, 4KB of RAM, 128KB of flash memory and a radio transmitter on 433 MHz.For Telosb, is equipped with an 8 MHz clock processor, 10K RAM, 48K of program memory, and 1024K flash storage.
In the figure 5, we compare the memory consumption of studied techniques and node sensor ability.Time to build the approaches is presented in figure 6.According to results, it is clear that memory is enough to compile each approach on Mica2 node or Telosb node, but for increasing the lifetime of the node, and taken on consideration the main aims of these techniques, detecting the different attacks (classification rate),we can say that Random forest technique is the efficient technique for detecting intrusion in wireless sensor network, with a higher rate classification (99.9544 %), reasonable required memory (11,62 KB), and building time(78,67 s).Indeed, the superiority of Random Forest intrusion detection technique, SVM, Naïve Bayes and K-means respectively, can be clearly deduced, in this order, according to confusion matrix, classification rate, memory, complexity, building time and memory consumption we can classify these techniques, from the higher to lower performant technique.Classification based on suitable feature selection is

VI. CONCLUSION
The key challenge of evolving intrusion detection system in WSN is to identify attacks with high accuracy, and satisfied the required constraints and challenges, to prolong the lifetime of the entire network.This aims could be attained from several ways.Firstly paying much more attention to detection techniques used for attacks detection is characterized by efficiency and ability.Secondly, reconstructing detection mechanism with a distributed manner, to reducing the communication overhead.This paper has compared and evaluated the newest anomaly detection intrusion techniques used in wireless sensor network, to improve the efficient technique for IDS in WSN.According to the results, it is highly recommended to use the data mining techniques to detect effectively the intrusions and attacks in WSN.The decision of choosing efficient IDS is a compromise between technique employed and performance metrics.However, many issues are still open and need further research efforts such as hierarchical clustering patterns, using machine learning in resource management problem of wireless sensor networks, developing a classifier that is trained well with network patterns, selecting and preprocessing an appropriate dataset.In addition, taking smart strategies into account such as compressing the input dataset, narrowing the scale of attributes set and simplifying the procedure of analysis and decision could make lots of progress for IDS to satisfy the requirement constraint of WSN without losing the security and reliability.

Fig. 4 .
Fig. 4. Instance Classification As shown in figure above, we note that random forest has the higher number of correctly classified instances and the lower number of incorrectly classified instances, however we observe the complete opposite for K-means technique.General rate classification: The following figures represents the rate classification of each class, normal class is represented by the Blue color, Red for Doss class, U2R Blue sky, green for R2L class, and pink color for the Probe class.A better classification is obtained if the represented classes are well separated.According to the results we deduce that the Random Forest classifier is more effective and efficient than other approaches with a classification rate of 99.9544%.Below the Complexity variables: (N: instances number, M: Attribute number, C: Classes number, V:attribute value).

Fig. 6 .
Fig. 6.Memory Consumption www.ijacsa.thesai.orgone of the main factors which reach the performance of IDS, especially in WSN.

TABLE II .
[20]et arrival rates of the neighboring nodes of a particular node.In[20], a distributed cluster based anomaly detection algorithm was proposed.They minimized the communication overhead by clustering the sensor measurements and merging clusters before sending a description of the clusters to the other nodes.The authors implemented their proposed model in a real-world project.They demonstrated that their scheme achieves comparable accuracy when compared to centralized schemes with a significant reduction in communication overhead.The table below presents a brief list of constraints and the corresponding requirements of IDS in WSN: Authentication www.ijacsa.thesai.orgtransceiver behaviors and

TABLE IV .
ADVANTAGES AND INCONVENIENCES OF STUDIED TECHNIQUES

TABLE VI .
MOST RELEVANT ATTRIBUTES

TABLE VII .
CONFUSION MATRIX APPROACHES 86%), however 6106 instances are classified into normal class, 55 as U2R attack, 58 as R2L attack and 151 as Probe attack.Naïve bayes is able to classify 39189 Dos attacks from 41749 real Dos Doss attack (93.87%), while 150 instances is classified as normal attack, and 5 as Probe.SMO classified 41735 Dos attack from 41749 (99,96%), and 15 instances into normal class.Finally random forest classified 41745 Dos attack from 41749 real Dos attack (99,99%), and 2 instances as a Probe attack.

TABLE VIII .
INSTANCES CLASSIFICATION