1D-CNN based Model for Classification and Analysis of Network Attacks

With the advancement in technology and upsurge in network devices, more and more devices are getting connected to the network leading to more data and information on the network which emphasizes the security of the network to be of paramount importance. Malicious traffic must be detected in networks and machine learning or more precisely deep learning (DL), which is an upcoming approach, should be used for better detection. In this paper, Detection of attacks through a classification of traffic into normal and attack data is done using 1D-CNN, a special variant of convolutional neural network (CNN). For this, the CICIDS2017 dataset consisting of 14 attack types spread across 8 different files, is considered for evaluating model performance and various indicators like recall, precision, F1-score have been utilized. Separate 1D-CNN based DL models were built on individual sub-datasets as well as on combined datasets. Also, an evaluation of the model is done by comparing it with an artificial neural network (ANN) model. Experimental results have demonstrated that the proposed model has performed better and shown great capability in detecting network attacks as the majority of the class labels had achieved excellent scores in each of the evaluation indicators used. Keywords—1D-CNN; CICIDS2017; network attacks; deep learning


I. INTRODUCTION
The Internet has become a major aspect in today's society with people using the services of WWW for most of their dayto-day activities. People use the Internet for both personal as well as professional purposes and the majority of tasks include sharing data, information access, sending files, connecting with friends or colleagues through social media and most importantly e-commerce activities which include saving passwords, credit/debit card info. Not only individuals but organizations too depend heavily on the Internet and its services. Network attacks in the form of malicious traffic results in loss of data, privacy violation for individuals; monetary, financial and political impact on big organizations and interrupted businesses for all its shareholders [1]. Nowadays, with the effect of Covid-19 and the ongoing pandemic, work from home is becoming a new normal. This has led to personal devices being vulnerable with not so sophisticated protection mechanisms as compared to the organization's resources. The effect of the pandemic could lead to attack mechanisms getting more diverse which means cyber-security will remain verticals of critical importance in the times to come.
There are many approaches to provide cyber-security ranging from authentication, encryption to a firewall, IDS (intrusion detection system). With IDS providing monitoring and behavior analysis of network traffic and further identifying attacks from network flow, it has proven itself a better alternative to other approaches [2]. Detection of cyberattacks is like a classification approach where it categorizes whether it belongs to benign or different types of attacks. Traditional Machine learning (ML) techniques, also known as shallow learning, have been used for intrusion detection by classifying the network traffic [3]. As the real world data gets bigger by time resulting in high dimensional space, the drop in performance of ML techniques can be observed due to its over-dependence on the features selected by the human experts. DL, with its complex architecture, overcame this limitation by automatically learning features through a massive amount of data. In this paper, we propose a 1D-CNN as a DL technique for effective feature representation and categorizing traffic into normal and different attack types. 1D-CNN's or 2D-CNN are almost identical in architecture as the core process in both of them is convolution operation.
Convolution, a mathematical operation operates on two signals by convolving them with one being input signal or data and the other known as kernel or filter. It is the process between input and kernel/filter which includes element-wise multiplication followed by summation resulting in single/scalar value. Convolution can be 1-d, 2-d or multidimensional depending on the problem in hand but the traditional CNNs developed [4] and the popular ones employed uses 2-d convolution which became the de facto standard for most applications in image processing and other deep learning tasks [5] [6] [7]. CNNs consist of input layer, convolution layers in the initial stages and MLP or fully connected layers in the final stages of a model preceding the output layer. The other optional but mostly used layers include sub-sampling (pooling) layer and dropout (regularization technique). General convolution operation is shown in equation 1: Where X is the input vector and F is the filter or kernel used and * is the convolution operation employed. The dimensions of both X and F are 1 dimensional in 1D-CNN and subsequently vary with the CNN used. www.ijacsa.thesai.org Generally, the architecture of 1D-CNN and 2D CNN remains the same with the main difference between the two is the use of 1d array or tensor in the former and 2d matrix or tensor in the latter. This means both input data and the kernel used for convolution are in 1d array form and the kernel moves over input in 1d direction. These minor but strategic changes led to certain advantages of 1D-CNN over 2D-CNN like 1) Reduced computational complexity due to 1D tensor over 2D tensor, 2) Well suited for low-cost applications but can be used for complex problems [8]. These advantages of 1D-CNN and better compatibility with certain problem domain has led to many areas where it has been applied or can be applied such as:  Most extensively used in Natural Language Processing (NLP), where it is quite helpful in extracting subsequences from sequences of words [9].  Human activity recognition task which involves time series of sensor data [10].
 Analysis of signal data over a fixed-length period, for example, an audio recording, real-time motor fault detection [11].
 Data in tabular form.
The focus of this study is the network traffic data which is stored in tabular form where each record is represented by an individual row which is in a one-dimensional shape. Applying 2D-CNN over this type of data requires converting each 1d row into a 2d matrix shape before convolution between input and kernel can be performed. Application of 2D-CNN over images seems justifiable since images are already in 2d shape but in the case of tabular data, it takes an additional effort of converting the 1d input data (each row) into 2d matrix shape which might include padding as well. This overhead can be avoided by using 1D-CNN over 2D-CNN with the only notable difference between the two being the shape of input data and kernel vector as 1d array (or tensor) is used in the former and 2d matrix (or tensor) in the latter.
The Rest of the manuscript is organized as follows: Section 2 discusses the related work in the field of intrusion detection, Section 3 explains the methodology part which comprise sub-sections 1) dataset description 2) model architecture and 3) model evaluation. Section 4 presents the results and analysis while Section 5 concludes the paper.

II. RELATED WORK
Research on intrusion detection has been going on for many decades, still a lot of work needs to be done and lots of issues must be examined. Several Data mining/ML techniques whether supervised or unsupervised learning have been applied for the identification of malicious traffic [12] [13]. More recently DL techniques have been used for the detection of Cyber-attacks and it has achieved significant results. So our literature review revolves mostly around the DL technique used (especially CNN) or the CICIDS2017 dataset which has been utilized in the proposed work.
Detection and mitigation of the common DDoS attacks using DDoS detectors employed for network traffic monitoring have been carried out using ANN structures, which were designed for different protocols separately [14]. In [15], authors uses NSL-KDD and Kyoto dataset for implementing their work which contains two important concepts: online sequential extreme learning machine which is the methodology used for classification and traffic profiling which makes up the preprocessing part. DL based intelligent framework have been implemented using Long short term memory (LSTM) to lessen DDoS attack in fog environment [16]. ISCX and CTU-13 were the datasets considered along with attack launching tool Hping3 for model evaluation. In [17], the applicability of restricted boltzmann machine (RBM) to differentiate between normal and abnormal Netflow traffic have been demonstrated in the ISCX dataset. A hybrid approach has been adopted in the form of a Double-Layered Hybrid Approach (DLHA) where the first layer uses naïve Bayes (NB) to detect DoS and probe attacks while the second layer adopts SVM for detecting the remaining attacks in the NSL-KDD dataset [18]. In [19], authors proposed a model based on 5-layer autoencoder (AE) for detection of network anomalies. Their work also includes data preprocessing for removing outliers and error reconstruction for effective network traffic classification.
In the detection of network attacks using CNN, the majority of the academic research has been done using 2D-CNN in which input data in the linear form is transformed into a matrix form. In [20] and [21], the proposed approach revised the established LeNet-5 model for classification of attacks in the KDD99 dataset, and input data is converted into 32*32 matrix shapes for input to the model. DNN based IDS was built with 4 hidden layers and evaluated the model using the NSL-KDD dataset [22]. Dimensionality reduction using principal component analysis (PCA) and AE has been performed on the KDD99 dataset before the classification technique CNN is applied [23]. The input shape of 1*122 is transformed into 1*121 and 1*100 before being converted to 10*10 and 11*11 matrix shapes.
Both shallow and deep learning have been combined through the random forest (RF) and non-symmetric deep autoencoders (NDAE) [24]. They exercised the NDAE technique for unsupervised learning of features, and for classification tasks, a model constructed from a combination of stacked NDAEs and the RF algorithm was implemented. Separate architectures or models were built in the form of CNN, RNN, and different variants of AE [25]. NSL-KDD dataset has been used and each record was converted into 32*32 2d form. Long short term memory (LSTM) is the variant of RNN used while Sparse, Denoising, Contractive, and Convolutional are different variants of AE used in the experiments. In [26], authors utilized the 1D-CNN based model for intrusion detection further evaluated using the NSL-KDD dataset. They compared the performance of their proposed model with different ML/DL techniques like J48, NB, RF, MLP, and RNN. In [27], authors proposed BAT as a traffic anomaly detection model for effective feature representation and network classification. The BAT model is a combination of a Bidirectional LSTM and attention mechanism.
The use of the CICIDS2017 dataset for intrusion detection has also been found in the literature. The author in his thesis www.ijacsa.thesai.org has done integration of open-source anomaly-based IDS Zeek (Bro), which uses scripts for feature extraction, and developed a model using various algorithms like RF, DT, and KNN on the CICIDS2017 dataset [28]. An ML-based hybrid model was recommended which comprises DT and RF in a stacked manner for classifying attacks in CICIDS2017 and NSL-KDD dataset [29]. The author incorporates the Fisher score as the feature selection method and performed the analysis of Supervised Learning techniques like DT, KNN, and SVM in detecting DDOS attacks from the CICIDS2017 dataset [30]. Experimental results have shown a good detection rates for DT and KNN but mediocre classification results for models built using SVM. In [31], authors applied and performed comparative analysis of 10 common ML/ DL techniques for detecting web attacks. The employed techniques include ANN, DT, KNN, SVM, CNN, NB, RF, k-means, expectation maxim and SOM. The results of the experiment conducted have shown that the NB, KNN and DT has outperformed the other models. Table I summarizes the key existing studies done in the detection of network attacks using ML or DL. From the literature review, it can be observed:  Majority of the academic research is done using KDD99 and NSL-KDD dataset despite criticism from researchers about it being outdated [32].
 Applicability and deployment of DL in detecting network anomalies is still in infancy stage.
 While implementing CNN, the preferred choice is 2D-CNN although 1D-CNN has better applicability.

III. PROPOSED METHODOLOGY
The proposed 1D-CNN model for classification of attacks consists of four steps: Step 1: Data preprocessing -This step involves methods to make data suitable for model training.
Step 2: Model Training -Includes specifying the architecture of a model and then train the model.
Step 3: Testing -Testing the model on unobserved data separated from training dataset.
Step 4: Evaluation -Evaluating the model using multiple metrics mentioned.
These steps form the basis for the overall process demonstrated in Fig. 1. First, the dataset is split into 80:20 train/test samples and then preprocessing of data is done on both. Model with basic initial architecture has been built upon which optimization is performed and training samples are then used to train the optimized model. Final model is tested using a test dataset with the help of various evaluation metrics.
These stages in the proposed 1D-CNN model along with description of the dataset used in the process are further elaborated in detail in following sub-sections.

A. Dataset Description
As already mentioned, the CICIDS2017 dataset, created by the Canadian Institute for Cybersecurity consists of data scattered across eight files both in pcap and csv format [33]. It contains two directories containing 8 files each; GeneratedLabelledFlows has 85 features (including label) per record in each file and MachineLearningCSV, mostly used for ML/DL tasks and focus of this study, has 79 features. These features have been extracted using CICFlowMeter which is a network flow generator and most of the features extracted are time-based statistic features [34]. Csv files are the result of flow-based features extracted from pcap files using an analyzer. Data files used in our experiments contain timerelated features embedded in them are further classified as: Iat: the inter-arrival time between packets sent in backward, forward, or either direction; Psec: includes packets or bytes per second; Active/idle: specifies time a flow was active/idle before going idle/active; other: like duration, Flag count, etc.
As evident from Table II, there are 8 files out of which one file includes only benign data while the other 7 files contain benign and attack data. File1 contains two types of brute force attacks used for logging attempts and file1 includes application layer based Dos attacks launched using different tools like GoldenEye, Hulk, slowhttptest, and slowloris. Furthermore, file3 contains web related attacks like SQL injection, brute force, and XSS while file4 incorporates infiltration attack records. Lastly file5, file6, file7 include records of the bot, PortScan, and DDoS respectively. 1) Data preprocessing: It involves techniques for data preparation or transformation of values before data is fed to the model for training.it further consists of these steps: a) Handling of missing data: There are two approaches for handling missing data; either drop the rows containing the missing value; or fill the cell with a new value. As the dataset contains a large number of missing values, the former approach looks irrational due to which the latter approach of filling these values is chosen. There are four options to select a new value ranging from a constant value like zero to the mean, mode, or median of the selective attribute. Either one could be okay but we carried out a pre-experiment with a small portion of the dataset before major experiments to find out the best replaced value. b) Feature scaling: On reviewing the dataset, one can find huge disparities between values from different columns with attributes like SYN, PSH flag count have a smaller range on values while attributes like duration, total length have large magnitude values. To scale these values we use standardization which works on continuous numeric features and makes sure data in a column has 0 mean and unit variance. It is done to ensure each feature has equal weightage and let gradient descent converge quickly in the model training. The formula for standardization is given in equation 2: newval = (val -mean_val) / sd Where val is actual value, mean_val and sd are mean and standard deviation of respective attribute. c) One hot encoding: The last column/attribute representing class label in train dataset is one hot encoded to make it compatible with 1D-CNN model while training which expects target vector in said form. This results in additional columns for the output vector which is equal to the number of class labels (attacks and normal labels).

B. Model Architecture
The overall general architecture used in the experimental setup has been shown in Fig. 2. As we deal with different files the architecture of these separate models is uniform/identical albeit with minor changes. It consists of an input layer sequentially connected to 2 or 3 CNN layers intermixed by dropout and followed by flatten layer which further connects to a fully connected (FC) or dense layer and finally output layer. Input shape provided to the first Conv layer is (1* C) with 1 specifying the steps which is one row at a time and C states the number of features. With Conv layer mapping input to high dimension space, its output with dimension 1*C*f1 is the feature map containing f1 number of filters which learns network information from input data. This output is then applied to the activation function and for that purpose, the one used mostly with the Conv layer, ReLu is used. Dropout is then used to minimize the interaction of feature detectors switching off some connections randomly in the network thereby preventing model overfitting [35]. Dropout doesn't decrease the number of parameters in the model, it only prevents some of them from participating in the weight update process. The Softmax activation function is combined with an FC layer to output the classified results. The mathematical www.ijacsa.thesai.org formulae for ReLu and Softmax activation function are given in equation 3 and 4 where x and xi are the input values while f(x) and Softmax(x) being the output values passed to the next layer respectively.

1) Parameters and Hyper-parameters:
Another important aspect of model architecture are the parameters, which are learned through training and hyper-parameters, selected or chosen manually. In the 1D-CNN model building, the type of hyper-parameters ranges from general hyper-parameters like batch size, number of iterations to the model-specific hyperparameters like a number of layers, filters, size of the kernel, an initial rate of learning, loss function, optimizer, and activation function used. The total parameters depend on certain hyper-parameters like number of layers, filters or nodes in a certain layer and size of filter which might vary from model to model. The general architecture of the proposed model would be like: "Conv 1 (f 1 ,k 1 )-Dr 1 (r 1 )-Conv 2 (f 2 ,k 2 )-Dr 2 (r 2 )-----Conv n (f n ,k n )-Dr n (r n )-FC 1 (nd 1 )-FC 2 (nd 2 )".
Here Conv, Dr, FC are convolutional, dropout, and fully connected layers respectively. The fi, ki refers to the number of filters and kernel-size in the ith convolutional layer whereas ri signifies the rate of dropout. The nodes in the FC layers are nd1 and nd2 with the latter related to the nodes in the output layer and equal to the number of classes. As hyper-parameters are selected manually, the number of trainable parameters can be calculated as:  Thus, the total number of parameters in the particular model architecture is equal to the sum of parameters in all the layers. It is to be noted that the use of dropout is optional and has no effect on the number of parameters. Consider a model, for instance, with configuration "Conv(80,1)-Dr(0.2)-Conv(50,1)-Dr(0.2)-FC(50)-FC(2)". Total number of trainable parameters could be calculated as: (78*80*1+80) + (80*50+50) + (50*50+50) + (50*2+2) = 13022 trainable parameters.

C. Model Evaluation
As our work is based on classification of multiple classes, multi-class confusion matrix is used to find or display correct/ incorrect instances and its constituents are TP (True positive), TN (True negative), FP (False positive) and FN (False negative). Using these various evaluation indicators like Precision (Pr), Recall (Rc) and F1_score (F1_sc) can be derived to be further used for evaluation of model.
For classifying attack data, Pr or PPV (positive predicted value) specifies how many attack predictions actually belong to the attack data.

PPV = TP / (TP + FP) (5)
Also Rc or TPR (true positive rate) specifies the ratio of predicted attack instances to the actual attack instances. TPR = TP / (TP + FN) (6) Both PPV and TPR are suitable in their own way as former tells how attack predictions are relevant and latter tells the relevant records being predicted. Instead of choosing one over other there is another single metric F1_score calculating the harmonic mean of the both.

A. Experimental Setup and Model Configuration
Experiments are conducted on google colab platform using python language and keras is the framework used for building 1D-CNN model with tensorflow as backend. Other important libraries used are pandas, numpy for loading/storing dataset and sklearn for preprocessing tasks and evaluating model and calculating results.
Different models built and evaluated might have distinct configurations of their architecture resulting in a different number of parameters and hyper-parameter values. The number of epochs and batch-size is not unique for each model but there are still some hyper-parameter values that are identical for all the models implemented in experiments and they are shown in Table III. Table IV shows the configuration parameters for each model, built during experimentation, with its complete model architecture.

B. Results
The overall experimental process is divided into two phases: Phase1: Separate models built on individual files of dataset. www.ijacsa.thesai.org  Phase2: Model built on combined dataset except file0. Also some labels are combined and renamed to make it more balanced.

1) Phase1:
During the first phase of experiments, individual files of the dataset have been used for building different models which means we have separate models for many different types of attacks. This means model1 is built on file1, model2 upon file2 and so on. This will be helpful if one wishes to detect a certain specific type of attack. For instance if you are interested in detecting DDoS attacks then model build using file7 will be useful and likewise for identifying bots model created using file5 is selected. Also processing individual files separately is good for attacks with less instances as they have better prevalence in their respective files rather than in combined dataset. It should be emphasized that file0 is not used in the experimental process as it contains only benign traffic which means seven models were trained and evaluated. Each model built is used to classify normal and corresponding attacks in the individual files and further tested on 20% test data of their respective classes. Table V shows the detailed evaluation of each model as their overall metrics results has not been displayed but detailed result for each class in every model as huge imbalance in the dataset would always results in better overall model performance. From the detailed analysis we can observe that attacks like XSS, Sql Injection and Bot have not performed well as compared to other attacks.
2) Phase2: For the second phase of experiments, combined dataset is considered for classification and all files except file0 is taken into account. As other files too containing benign records leading to large number of normal records in combined dataset, inclusion of records of file0 could led to more imbalanced data. So dataset is combined with seven files and this combined dataset contains 2,300,825 overall records. Model built on this could classify all attacks (14) in the dataset. As evident from the Table VI records containing benign traffic constitute 75.76% of all the instances in the concatenated dataset while attacks barring Dos/DDoS or Portscan are low prevalent. The combined dataset suffers from a class imbalance situation with some labels like Heartbleed, XSS having very few records which often results in a low detection rate for these labels [36]. We ran an experiment to build a model that would classify all 15 classes (14 attacks) in the dataset and the results are shown in Table VII where it can be easily observed that attacks with few testing records owing to their low prevalence are not classified properly. Attacks with sufficient training instances have performed satisfactory but for minority label attacks some attacks have zero correctly classified instances while others too have low detection accuracy. To solve the class imbalance situation relabeling is done by merging minority class labels into one class label which proves to be a good measure for improving model performance. It is not done randomly but in a strategic way by merging similar categories of attacks. For example, SQL injection, XSS, and web attack-brute force are all types of web attacks so they are merged together and given new labels (web attacks). Full details of the new attack label along with the percentage of occurrence are shown in the Table VI. After relabeling it now contains 7 classes including 6 attack labels and a model based on 1D-CNN is trained and then evaluated. The results for the same are shown in Table VIII and Table IX with the former displaying the confusion matrix based on all the labels and the latter illustrating the detailed results in metrics for all class labels. Analyzing the confusion matrix in Table VIII, the number of classifications or misclassifications with a particular class label predicted as another label can be properly seen. The same can be analyzed from Table IX as a high number of true positives were achieved for all class labels with the exception of the Bot and Web attacks label. The overall performance of the model is better as more than 99.6% output has been achieved in PPV, TPR, and F1_sc. Bot and Web attacks are the two labels with gloomy detection rate resulting in low values of TPR and F1_sc.

3) Experiment with deep neural network:
To compare and further validate our proposed model, a DNN based on an artificial neural network has also being used. The experimental setup is identical with 1D-CNN i.e., the same preprocessing steps and evaluation metrics. DNN comprises of a) input layer with 78 nodes; b) 3 hidden layers with 60, 50, and 20 nodes, respectively; c) output layer with 8 nodes(like phase2). Also, dropout with 0.1 value is used between hidden layers to prevent overfitting. The results are depicted in Tables X and XI.    Analyzing the results of Bot in phase2 (Table VIII), one can see similarities between Bot and Benign traffic as all FN and FP in case of Bot attack label belongs to benign label which indicates Bot is not classified as any other attack by the model and no other attack has been classified as Bot attack. This signifies the resemblance between the two as the distinction between bot and normal behavior is blurred.
 As for web attacks, their comparatively lower performance could be attributed to the fewer training instances in the dataset as they have less than 0.1 of total instances. Or these attacks don't have a specific pattern and they could be better detected using payload content.
 Also our proposed 1D-CNN model has outperformed the model built using DNN (Table XI).

V. CONCLUSION
In this paper, we proposed a novel way of identifying attacks in the dataset using 1D-CNN as a classification approach. The proposed 1D-CNN model has performed better with the least number of misclassifications. Experiments were conducted with a model trained and evaluated on individual files of the dataset as well on a combined dataset which was further relabeled to handle class imbalance situation. Satisfactory performance was recorded in both cases for the majority of labels as more than 99% output achieved in each of the evaluation indicators used. Some attacks with low prevalence like bot and web attacks have a comparatively lower detection rate. Experiments using DNN have also been done for comparative purposes and further validation of the proposed model.
As for future work, other DL algorithms need to be explored for training the model and a study regarding hyperparameter optimization should be done to find the optimal model configuration. Moreover, other datasets with the latest attack types and real world traffic should be investigated for detection of cyber-attacks. Addition of records of bots and web related attacks needs to be done as more data is needed for training and to improve their detection accuracy.