Detecting Unauthorized Network Intrusion based on Network Traffic using Behavior Analysis Techniques

Nowadays, network intrusion detection is an essential problem because cyber-attacks are increasing in both the number and extent of the danger. Network intrusion techniques often use various methods to bypass the oversight of anomaly detection and surveillance systems. This paper proposes to use behavior analysis techniques, machine learning, and deep learning algorithms for the task of detecting network intrusions. The practical and scientific significance of our paper includes two issues: (1) Regarding the process of selecting and extracting features: instead of using typical abnormal behaviors of attacks, this study will use statistical behaviors that are easy to calculate and extract while still ensuring the effectiveness of the method; (2) Regarding the detection process, this study proposes to use the Random Forest (RF) classification algorithm, the Multilayer Perceptron (MLP) and the Convolutional Neural Network (CNN) deep learning model. The experimental results in Section IV have proven that our proposal in this paper is completely correct and reasonable. Based on the results shown in Section IV, this study has provided network surveillance systems with a number of abnormal behaviors as the basis for detecting network intrusions. Keywords—Network intrusion detection; abnormal behaviors; IDS 2018 dataset; deep learning and machine learning


I. INTRODUCTION
Unauthorized intrusion techniques are a dangerous attack form, have been growing rapidly in both the number of recorded attacks and the extent of damage that it causes to organizations or enterprises. Therefore, the task of early detecting and warning signs of cyber-attack campaigns is essential nowadays. Currently, there are two main methods to detect network intrusions: signature-based method through rulesets and anomaly-based method based on analyzing data and statistics to seek abnormal characteristics in the network [1], [2], [3]. The signature-based method has the ability to detect network intrusions quickly and accurately, but it is not possible to detect new attack techniques [1]. The anomalybased method not only has the ability to detect attacks but also has the ability to detect abnormal behaviors, but it requires complex computation and processing processes and its accuracy is not high. The anomaly-based method is often based on two main techniques to classify abnormal and normal behavior, machine learning and deep learning [1], [2]. So clearly, regarding the network intrusion detection method using machine learning or deep learning, the most important factor is how to identify normal behavior and abnormal behavior. The studies [4,5] focused on extracting abnormal characteristics and behaviors based on specific attack techniques. However, we noticed that such an approach could quickly and accurately detect attacks based on specific datasets, but when using other datasets, it is difficult to detect cyber-attack techniques. Therefore, this paper proposes a new network intrusion detection method using deep learning and machine learning algorithms including RF, MLP, CNN based on analyzing behaviors of network traffic. Accordingly, in this paper, we will not find ways to analyze abnormal behavior in network data, we only try to statistic the behavior of network traffic and then use machine learning and deep learning algorithms for analysis and evaluation. With this approach, this study will reduce many steps in finding and extracting abnormal behavior of network intrusion techniques. For the experimental dataset, PCAP files in the IDS 2018 dataset [6] will be selected and used. The study [7] listed and analyzed a number of datasets typically used for detecting cyber-attacks such as DARPA/KDD Cup99, CAIDA, NSL-KDD, ISCX 2012, UNSW-NB15, IDS 2018, etc. In which, the IDS 2018 dataset is built and developed in accordance with real network systems. Therefore, this study will use the IDS 2018 dataset to conduct experiments of cyber-attack detection methods.

II. RELATED WORKS
In the study [8], Vikash Kumar et al. proposed a cyberattack classification method using UNSW-NB15 and rulesets. Nour Moustafa et al. [9] proposed Geometric Area Analysis Technique for cyber-attack detection using Trapezoidal Area Estimation. This study used UNSW-NB15 and NSL-KDD datasets to conduct experiments in order to evaluate the effectiveness of the proposed method. The experimental results in this study showed the superiority of the UNSW-NB15 dataset compared to the NSL-KDD dataset.
In addition, the study [10] presented a scalable framework for building an effective and lightweight anomaly detection system based on two well-known datasets, the NSL-KDD and UNSW-NB15.
Sikha Bagui et al. proposed in their study [11] a method to detect cyber-attacks based on the Naïve Bayes and Decision Tree (J48) machine learning algorithms. The team [11] used these two algorithms in turn for classifying components of cyber-attacks in the UNSW-NB15 dataset.
The study [12] proposed a cyber-attack detection model using the stacking technique. In their model, the training process uses some machine learning algorithms including Knearest Neighbor (KNN), Decision Tree (DT) and Logistic 46 | P a g e www.ijacsa.thesai.org The study [13] performed an evaluation of the efficiency of 8 machine learning algorithms (2-layers and 3-layers) for network intrusion detection.
The study [14] presented a DDOS attack detection method using a comprehensive simulation technique of DDOS attacks.
In the study [15], Cho et al. proposed two tasks: detecting cyber-attacks using machine learning algorithms and optimizing features using algorithms such as IG, PCA. Experimental results showed that the team's proposals were relatively good. However, because feature optimization algorithms have large computational times and high complexity, a large calculation system is required. In addition, Cho et al. [16,17,24,25] proposed a method to detect cyberattacks based on network traffic using machine learning and deep learning algorithms.
In the study [18], Zhao et al. proposed a botnet detection method based on analyzing abnormal behaviors of traffic and flow. Besides, the approach to detect botnet and cyber-attack using the CTU 13 dataset was proposed by Chowdhury et al. [19]. In addition, Ahmed [20] proposed using the ANN deep learning algorithm to classify abnormal connections.

III. NETWORK INTRUSION DETECTION METHOD USING BEHAVIOR ANALYSIS TECHNIQUES
The facts show that with the approach of detecting unauthorized network intrusion using behavior analysis techniques, systems need to perform two main tasks: i) defining abnormal behavior. This definition process is the task of selecting and extracting features, ii) a method of classifying behaviors. This process uses machine learning or deep learning algorithms to classify the behaviors that have just been built in the task (i). We will delve into analyzing and clarifying this content in the next section of the paper.

A. Selecting and Extracting Features
This paper uses the CICFlowMenter tool [21] to handle network traffic. This tool has a function analyzing network traffic into 76 features [16,17]. These features were presented in detail in the studies [17,24].

B. Detection Method
As mentioned above, in order to classify intrusion behavior in network traffic, this paper uses a combination of machine learning and deep learning algorithms including Random Forest, CNN, and MLP. These algorithms are being studied and applied in many different problems of the recognition field.
In this, the Random Forest algorithm is a supervised machine learning algorithm researched and developed by [22]. The studies [1,16] have shown that this algorithm is currently the best classification algorithm because it has a simple operation principle, is easy to calculate and install, especially has low calculation and classification time. The study [22] presented the operating principle and the mathematical model of this algorithm in detail. This paper will use the Random Forest algorithm with standard parameters. We only change the number of random trees in the algorithm to find and conclude the best model of the algorithm with this experimental dataset.
Regarding the MLP network, the study [23] presented in detail the architecture of an MLP network that is built by simulating the way neurons work in the human brain. MLP networks usually have 3 or more layers including 1 input layer, 1 output layer, and more than 1 hidden layer. Besides, the efficiency of the MLP network depends on the activation function. In this paper, we will tune activation functions to evaluate the effectiveness and suitability of activation functions for the network intrusion detection task.
Finally, the CNN network is defined as a set of basic layers including convolution layer + nonlinear layer, fully connected layer. The detailed structure of CNN as well as the terms: stride, padding, MaxPooling are presented in detail in the paper [23]. In which, the ReLU activation function is used.

A. Experimental Dataset and Scenarios
The experimental dataset is extracted from IDS 2018 Dataset with three types of attacks: Bot (Botnet), Dos, and HTTP-attacks. This dataset is divided into 2 sub-datasets with a total of 762,000 records. In which: the first sub-dataset has two labels: 0 (Benign -clean) and 1 (Bot -malicious); the second sub-dataset has three labels: 0 (Benign -clean), 1 (Dosmalicious), 2 (HTTP-attacks -malicious). We use 70% of this dataset for training and the remaining 30% for testing. Besides, in this paper, to see the effectiveness of the proposed method, we will proceed to refine the parameters of each algorithm to find the most optimal model and architecture.

B. Measures to Evaluate the Results of the Algorithm
The following measures will be used in this paper to evaluate the accuracy of models: • Accuracy: The ratio between the number of samples classified correctly and total number of samples. Accuracy is calculated by the following formula:  Table I lists the experimental results of network intrusion detection applying the Random Forest algorithm with 10, 50, 100 trees using the 2-labels dataset. From Table I, could see that the algorithm has the highest Accuracy and Precision (99.996%) when the number of decision trees is 50. Besides, when the number of decision trees is changed from 10 to 100, the accuracy of the algorithm does not change much. This shows that with the dataset balanced on the ratio of normal and abnormal records, the Random Forest algorithm brings good and stable detection results. Fig. 1 below presents the evaluation results of the confusion matrix when the number of decision trees is 50. From Fig. 1, seeing that the normal and abnormal prediction models all have high accuracy.  Table II lists the experimental results with the 3-labels dataset. Based on the experimental results in Table II, we found that: similar to the 2-labels, the scores obtained with the 3labels dataset had high results (all over 99%). The Random Forest algorithm gave the best classification results with the number of trees of 100. Comparing the results in Table I and  Table II shows that the Random Forest algorithm gave higher efficiency on all measures when using the 2-labels dataset. Confusion Matrix with 100 trees is shown in Fig. 2.  Table III, seeing that the MLP model gave very different results when using different activation functions and the number of layers. In particular, with 2 layers, the MLP model gave the best result with ReLU activation. However, when increasing the number of layers to 4, the MLP model had the best results with Logistic activation. But considering accurately detecting the intrusion techniques, the MLP model with ReLU activation still gave a completely better result (reaching 100%). Fig. 3 below is the result of Confusion Matrix when using the ReLU activation function. From Fig. 3, it can be seen that the MLP model gave prediction results with very high accuracy, with only 32 incorrectly classified records. With this result, it is clear that the MLP model is completely consistent with the purposes and requirements.   Table IV show that when increasing the number of classes of the dataset that need to be classified to 3, the F1-score decreased greatly. The average F1-score when using 4 hidden units is higher than when using 2 hidden units. However, the highest F1 was achieved in the case of using 2 hidden units with the Identity activation function. The result of using 4 hidden units and Relu activation function was exceptionally low at 16.67%. Fig. 4 depicts the results of the Confusion Matrix. 3) Experimental results with CNN: The CNN network consists of an input layer, hidden layers, and an output layer with corresponding parameters. After many experiments, we found that processing data with Convolution Layers with parameters {filter = 32, 39, 64; filter size = 5; batch size = 32} is optimal. Learning rate parameters of 0.01, 0.001, and 0.0001 were also run to select the most optimal parameter. Based on these results, seeing that a learning rate of 0.0001 gave the best results. Table V describes information about the network models that were selected and tested.

2) Experimental results with MLP a) 2-classes dataset: From the results shown in
Based on the parameters in Table V, this paper performed with 50 epochs and all Convolution layers used the ReLU activation function. Table VI, seeing that the CNN model with 1D-CNN achieved very good performance in terms of accuracy, precision, recall, and F1score. The 1D-CNN 2-layers had the highest performance in 3 models and did not need enough 50 epochs to produce high results. Besides, Fig. 5 presents the accuracy of the training and test process of 1D-CNN 2-layers. Based on it, seeing that this model had an accuracy of approximately 100% after only 23 epochs and maintained that state until the end of the training process. This model detected most attacks (only 8 attack records were not detected). For normal network traffic, the number of false positive record is just 1.   Besides, for the 3-labels attacker dataset, the 1D-CNN 3layers had the best performance in the 3 models. The accuracy of the training and testing process of 1D-CNN 3-layers shows in the figure below. It can be seen that this model had an accuracy of approximately 100% after 50 epochs. Fig. 6 below depicts the results of the CNN model with 1D-CNN 3-layers.  Table VIII below shows the overall comparison results of the RF, MLP, CNN classification algorithms with 2-classes and 3-classes dataset. With the results shown in Table VIII, with 2-labels and 3labels dataset, algorithms CNN, Random Forest, and MLP all gave classification results with not too large differences on evaluation metrics. However, in the case of 2 classes, the Random Forest algorithm with 50 trees gave a score about 0.01% higher than CNN (1D-CNN 2-layers). And in the case of 3 classes, the CNN (1D-CNN 3-layers) algorithm gave better classification results than the Random Forest with 100 trees (0.0189% higher). This is not a large number. However, with the actual dataset, it is a quite far distance and has a great impact on the prediction. Therefore, depending on the model of the problem, we will build according to the Random Forest or CNN algorithms. From the data, seeing that with a large amount of data, the number of incorrectly predicted records of the two algorithms is quite much different. Therefore we recommend using CNN rather than Random Forest or MLP algorithms although we must define the network's architecture including the number of layers, decision function, etc.

V. CONCLUSION
Unauthorized network intrusion techniques will transform increasingly to bypass the surveillance of attack detection systems. This requires intrusion detection systems to be constantly updated on the abnormal signs and behavior of network attacks. In this paper, based on analyzing behaviors of network intrusion in network traffic, we have succeeded in determining attack behaviors and normal behaviors of the network data. The scientific and practical significance of the paper is shown in the classification and feature extraction. Accordingly, in our research, we did not extract typical features of cyber-attacks. Instead, we tried to enumerate fully their components and characteristics in the network and then use machine learning and deep learning algorithms to classify. With this approach, we have greatly reduced the time cost of finding and extracting features of network attacks. In addition, based on the experimental results, we have proven that our approach and proposal in this paper are correct and reasonable. This result shows that the proposal using behavior analysis techniques of network traffic using machine learning and deep learning techniques not only helps to accurately detect network intrusion techniques but also contributes to improving the time of seeking and extracting features. Besides, based on the experimental results of Random Forest, CNN, and MLP algorithms with different parameters, seeing that the 2-label dataset gave better results than the 3-label dataset. This shows that: the more optimal the standardization of models and data is, the more accurate the classification is; should not clearly distinguish the labels of network intrusion techniques in the dataset. In the future, we will research and use other analysis methods to improve the efficiency of the detection method based on this dataset. In particular, because our behavior analysis technique has extracted statistical features of network traffic, these features express the correlation not only in terms of data but also in terms of time. Therefore, it is necessary to have algorithms and analysis methods to highlight the time factor in behavior.