A Deep Transfer Learning Approach to Enhance Network Intrusion Detection Capabilities for Cyber Security

—Cyberattacks are on the rise, making technology companies increasingly prone to data theft. Recent research has focused on constructing cognitive models for traffic anomaly detection in a communication network. Many of these exper-iments resulted in data packets recorded by technologies like Wireshark. These datasets provide high-dimensional data relating to benign and malicious data packets. Recent research has mostly focused on developing machine learning, and deep learning systems o detect attack data packets in a network. Also, machine learning algorithms are currently trained to detect only known threats. However, with the growth of new cyberattacks and zero- day attacks, current algorithms are unable to detect unknown attacks. This research focuses on detecting rare attacks using transfer learning from a dataset of known attacks. Deep learning outperforms explicit statistical modelling approaches by at least 21% for the dataset used. A preliminary survey of candidate deep learning architectures has been performed before testing for transferability and proposes a Convolutional Neural Network architecture that is 99.65% accurate in classifying attack data packets. The suggested CNN architecture trained with a known attack and then tested its performance on unknown attacks to assess transferability. For this model to extract sufficient information for transferability, the training samples must have more information. Only 20% of the dataset represents current threat data. Several strategies, such as innovative synthetic dataset-based training and bootstrapped dataset training, have been developed to overcome small training sets. A subset of training attacks is determined to optimise learning potential. This study finds training-testing attack pairings with good learning transferability. The most robust and stable relationships are found in DoS attack training-testing pairings. This study also presents model generalisation hypotheses. The dataset features and attack characteristics were analysed using the Recursive Feature Elimination (RFE) algorithm to validate the results.


I. INTRODUCTION
A neural network uses knowledge learned from previous training to improve generalization on another similar task in transfer learning. The primary goal of this study is to leverage learning transferability to select attacks from the training dataset that causes the model to generalize for other attacks, which can be extended to rare attacks.
In computer networks that employ machine learning algorithms for network security, it is difficult to provide guarantees about the kind of attacks that the network can expect to see and is susceptible to, especially with the rise in the number of novel attacks. Current machine learning algorithms are usually trained to detect a set of known attacks or learn from attacks performed in the past. Therefore, it is hard to predict the range of attacks against which the employed machine learning algorithm is robust. In the literature, there is proof of concept for both classical ML and DL approaches for anomaly detection [1].
Transferability studies can help us understand the range of attacks against which the system is actually robust based on the attacks that the algorithm has been trained to detect and how those attacks enable the model to generalize for other attacks. This is especially useful when the trained model is able to generalize to unknown attacks without being trained on them explicitly because this provides a more extensive range of attacks that the network may be secure against attacks. Studying these attack correlations indicates the attacks that the model can detect and the accuracy of these unknown attacks. This helps us evaluate the risks of what the model can predict if the observed correlations are consistent and what kind of protection the algorithm provides at what computational cost. Transferability studies could discover that the model scales to other different attacks it has learned to detect without being trained on them explicitly. It assures detection for an even broader range of attacks than the ones it has seen. For groups of similar attacks or correlated attacks, transfer learning enables us to find representative attacks from that group. This helps us to train the model with a smaller number of attacks yet achieve the same level of security as it would have if rained with all the attacks in the group.
The increasing machine learning applications have also seen deep learning models going into hardware chips. The advantage of transferability is that we could identify smaller training sets yet make the model generalize well for a broader range of attacks. This decreases training times, making the process of training computationally efficient and having lower memory requirements. This is important, especially if this algorithm is deployed on a resource-constrained device while performing at par with a model trained on a full training set.
Increasing research in this domain has led to an increase in network traffic datasets. Some of the common datasets used are CAIDA 2007, DARPA 98, KDD 99, CSE-CIC-IDS 2018, and UNSW-NB15, to name a few. This work uses the traffic dataset CICIDS 2017 for training the proposed model. Traffic monitoring software like Wireshark has been used for logging and monitoring network traffic packets for creating these datasets. While these datasets provide a complete description of network packets, they have a high dimensional feature space. In a preliminary study of a potentially suitable neural network, a model has been developed to extract meaningful data from high dimensional training data and a limited number of attack data packets. After identifying a suitable network, the work focuses on attack correlations and comparisons for why training the model with one attack scales for another and if these correlations are symmetric or asymmetric. This study describes the work's contributions in three sections: first, Developing a DNN architecture; second, Testing the proposed architecture for transferability and third, hypothesis for attack correlations.
Developing a DNN architecture: A comparative study of a Hidden Markov Model has been carried out as a statistical modelling-based approach and a candidate Convolutional LSTM Deep Neural Network (CLDNN) architecture and shows that the deep learning approach scales better for the used dataset. Further, deep learning models have been explored for the classification task. The proposed model's candidate DNN architectures include a Convolutional LSTM Deep Neural Network (CLDNN), Convolutional Neural Network with BGRU recurrent layers, and One Class Neural Networks (OCNN), and a Convolutional Neural Network (CNN). The performance of these models has been evaluated when trained and tested on all attacks in the dataset. Also, the learning transferability has been evaluated. All the above architectures had overall accuracies of above 90%; however, this dataset is highly unbalanced, with a majority of the dataset consisting of benign data packets. The models are able to detect benign attack packets with high accuracy, thereby boosting the overall accuracy of the models. Therefore, a major criterion for architecture selection is not the overall accuracy but the percentage of attack data packets it is able to classify correctly. Attack transferability has been studied with some of these architectures; however, none of the candidate models other than the proposed architecture showed very strong correlations between training-testing attack pairs. The proposed architecture is the model that shows maximum attack data classification accuracy for the multi-class classification problem. This architecture is able to perform deep feature extraction using flow-based features not only for adequately sampled attacks but also for severely underrepresented attacks.
Testing the proposed architecture for transferability: Attack transferability has been tested with the proposed model architecture by training the model with one kind of attack from the training dataset and testing it on another attack. The entire dataset consists of 20% of attack data that are further divided into fourteen categories of attacks. Therefore, the representation of each attack class is low. For learning transferability, the model requires a larger number of attack data packets. Two techniques have been used to increase the number of attack data packets to address this problem: 1) Using SMOTE generated data and 2) Using a bootstrapped dataset. A Synthetic Minority Oversampling Technique (SMOTE) was used for the first method to generate more attack data packets in each attack class synthetically. For the second method, the original attack data is resampled, shuffled and added to the training dataset to match the number of benign data packets.
Furthermore, exploring the possibility that the model may exhibit higher testing accuracies for a particular attack if it is trained with a subset of other attacks in the dataset. An attack boosting algorithm is employed to select a subset of attacks from the training dataset to maximize the classification accuracy when tested on a particular attack. Hypotheses for correlated attacks: The study of attack correlations reveals training testing attack pairs that exhibit strong correlations with each other. A hypothesis has been proposed for these observed attack correlations. The attack characteristics have been studied and important features to validate the hypotheses. The Recursive Feature Elimination (RFE) algorithm is used to identify the dominant features for learning. This not only reduces the dimensionality of the training data but also reduces training times. This study makes inferences about the properties of the features selected and the number of features selected for a pair of correlated attacks. This results in identifying usable attacks in this dataset for transfer learning and explaining why they scale for certain attacks.

II. BACKGROUND AND RELATED WORK
The dataset used is the Canadian Institute of Cybersecurity Intrusion Detection Systems dataset: CICIDS 2017. Overall, the dataset consists of 8 separate CSV files, with data corresponding to attacks simulated in 8 sessions included. The entire dataset consists of data corresponding to 14 types of attacks and benign traffic. The dataset describes 78 flow features. 80% of the data entries are benign data, and 20% of the dataset consists of attack data. The dataset was created by simulating 14 different types of attacks that are shown below in Table I. An attacker gains control of several machines and servers connected to the internet and uses these servers to carry out cyberattacks against a target. Due to multiple synchronized attacking machines, the attack data volumes are huge, thereby facilitating a strong attack against the target machine. Botnets are often used to carry out DDoS attacks. Denial of Service (DoS) attacks are attacks aimed at making a server unresponsive to the requests of legitimate users. They usually work by sending enormous traffic, thereby flooding the server and hitting its resource pool by requesting resources or sending obfuscated data that make the server unavailable to real requests.
Distributed Denial of Service (DDoS) attacks are DoS attacks conducted by different servers at once on a victim server. These servers have different IP addresses, thus making (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 4, 2022 it difficult to track down one single IP address for attacker identification, which makes it hard to mitigate the attack. In the case of flooding attacks, the volume of data generated to flood the victim server is larger, making the DoS attack strong and harder to terminate [2]. DoS: A goldenEye is an attack tool that identifies vulnerabilities in a target server. It explores the capability of victim servers to form multiple HTTP connections, thereby using up all possible connections on that server. This attack can be operated as a distributed attack as well. DoS GoldenEye uses Keep-Alive headers and Caching Control options to keep the connection alive, preventing the target server from shutting down the connections and making the server unresponsive to non-attack requests.
DoS: HULK is HTTP Unbearable Load King attack. A flooding attack floods the target servers with a large volume of HTTP requests, requesting data or resources or sending unclear HTTP packets. This floods the target machine with HTTP data packets, and the load is unbearable for the server, which makes the server unreachable.
DoS: Slowhttptest and DoS: Slowloris are attack tools for carrying out Slow HTTP attacks. Slow HTTP attacks form multiple connections with the target server and try to keep the connections open. Some slow HTTP attacks keep the connection open by declaring a large amount of data to be sent and sending the data at very large intervals of time, almost equal to the timeout period. However, the server cannot close the connection, and the connection does not take time out. This keeps all available connections alive and keeps the server from responding to legitimate requests. Slow HTTP attacks can be made using Slow Header attacks, where the packet header arrives at large intervals or Slow Body attacks, where the body of the data arrives very slowly.
Brute Force attacks are targeted towards gaining access to authentication keys by trying all possible combinations of the key. File Transfer Protocol is a network protocol that can be used to transfer files. Users connect to the server using an FTP client using the username and password authentication. Brute Force FTP attacks are an attack on the username and password. Brute Force SSH gains access to valid login credentials to authenticate SSH access to a server by trying all possible combinations.
Heartbleed could be classified as a protocol based attack where the attack utilizes a packet header field required for the transport layer security protocol that causes a machine to dump out its entire memory, including confidential and protected data. It does so without leaving traces in the target machine, making it undetectable. In an infiltration attack, the attacker gains access to a protected network or system and finds vulnerabilities in the machines or devices connected to the network. After identifying vulnerabilities, the attacker attacks the machine or device to steal private information.
In an Port Scan attacks, the attacker scans the ports of the target server and sends requests to a range of ports on the target server. The attacker finds an active port on the server and exploits a known vulnerability of that service to attack the target server.
Web Attacks identify weaknesses in web applications to gain access to private or protected data. These weaknesses could be exploited using data entry or using injection attacks where malicious scripts are injected into data entries of otherwise harmless websites or by using brute force. For example, for a web application that uses an underlying SQL database and has an option for user inputs, malicious inputs could be given that cause the database to dump out confidential information. DDoS attacks are also a kind of web attack.
It can be observed from Table 1 the CICIDS 2017 dataset is highly unbalanced, with 80% of the dataset consisting of benign data and 20% of the dataset consisting of attack data packets. The attack data is further divided unequally into fourteen different types of attacks. Because of the unequal division, certain attack classes have a very sparse representation, for example, just 11 or 36 data packets throughout the entire dataset. This makes it difficult to generate realistic synthetic data to augment the number of data packets and limits model learning even when using a bootstrapped dataset for these classes of attacks. Therefore, when the proposed architecture has been tested for attack transferability. It eliminates those classes of attacks that cannot use for the transferability of learning.
In the literature, there are different machine learning techniques that deal with intrusion detection using flow characteristics. Alkasassbeh et al. explore MLP, Naive-Bayes, and Random Forest classification algorithms for Distributed Denial of Service attack detection and show that MLP achieves the highest accuracy [3]. Lopez et al. presents a study of machine learning techniques for traffic anomaly detection and proves that a Random Forest-based decision classifier is the best model for anomaly detection, and a Dense Neural Network is a good classifier for some types of DDoS attacks with methods to boost a number of attack samples of underrepresented attack types [1]. Vinayakumar et al. carry out a comprehensive study of DNNs and Machine Learning classifiers that learn abstract and high dimensional data representation using the KDDCup99 dataset and test their model performance on other datasets, such as NSL-KDD, UNSW-NB15, Kyoto, WSN-DS, and CI-CIDS 2017. They also propose a hybrid DNN framework which can be deployed in real-time to monitor network traffic and events to detect possible network attacks [4]. Sharafaldin et al. generate a dataset consisting of benign and seven types of attack data. They also evaluate the performance of machine learning algorithms to identify the best subset of features for certain types of attacks [5]. Ferrag et al. analyze RNNs, DNNs, restricted Boltzmann machines, Deep Belief Networks, CNNs, Deep Boltzmann machines and deep autoencoders for traffic data classification using the CSE-CIC-IDS2018 dataset and the Bot-IoT dataset [6].
To find out about an undiscovered attack, researchers have chosen TL-based IDS more. Zhao et al. [7] proposed an algorithm that used the NSL-KDD dataset for binary classification mapped the source and target domains into the latent space K. In their method, the R2L attack was found by looking at DoS data from the source domain. The same author made their TL even better in their next work by adding clusters [8]. They found that 500 data points are enough to train the NSL-KDD dataset well. It was shown by Taghiyarrenani et al. [9] that transfer learning can work even if the source and target data are different from each other. They were able to tell the difference between normal traffic and traffic that wasn't normal. Their results were better when the source domain had a lot of labels and the target domain didn't have as many labels.
Wu et al. [10] also used a CNN to pass on knowledge learned from one dataset to another. When they researched, the UNSW-NB15 dataset was used as the source data, and the NSL-KDD dataset was considered the target data. As the datasets were different, they used two CNNs in their method. Afterwards, a fully connected layer used the data to make the final classification for the data from NSL-KDD. Singla et al. [11] found that transfer learning can help when there isn't enough training data to learn new attacks. Their approach isn't very different from what [10] had done. They used binary classification to look for a specific attack in the UNSW-NB15 dataset. They used the remainder of the attacks using the same dataset as the original domain. With the normal DNN, they found that it did better when there was less training data for their transfer learning-based approach.
In 2020, Masum et al. [12] used the same computer vision algorithms on IDS that they used on PCs. As image processing needs two-dimensional data, they turned the intrusions dataset into two-dimensional data, and then used VGG-16 [13] to look for intrusions. However, they were able to get about 95% accuracy, which some people might think isn't enough for the real world because other algorithms have been able to get more than 95% accuracy in the past. The reason could be that the features that were made with VGG-16 might be better for image processing than intrusion detection. Second, Dhillon et al. [14] have shown how TL can be used in real-time. A CNN-LSTM algorithm was made. Then, they moved their model to a new place where they could use data that wasn't there to figure out whether it was an attack or not. Research done by them shows that there is no difference between the space for features and the space for labels in both fields. They were able to speed up their research by using TL.
Vijayakumar, Alazab et al. [15] evaluated the performance of CNN for intrusion detection. CNN was investigated in their work both directly and in combination with RNN, LSTM, and GRU algorithms. CNN and recurrent algorithms were found to perform nearly identically when used together. Even though they could get good outcomes, the computational power required by their approach was excessive.

A. Hidden Markov Models: A Statistical Modeling based Approach
HMM is a Markov Model in which the process being modelled is a Markov Process. Markov models are used to model processes that change stochastically. This study considers that the phenomenon being modelled is a first-order Markov Process where the next state of the phenomenon is only dependent on the current state of the phenomenon. This is called the Markovian Property. If q t represents the state of a model at time instant t, then the conditional probability of its next state will be as given: Where S i is the current state and S j , S k represents the past states.
In HMM problem, the observations are the features corresponding to a data sample. Forty flow features from the dataset are taken for each data sample and are fed to each HMM as 40 observations. Each CSV file in the dataset is treated as a separate dataset. The performance of HMMs has been evaluated on 2 class and 4 class classification problems. In those cases, 2 HMMs and 4 HMMs have been used, respectively, for training. Observations are taken at a time from the testing dataset, and each model is tested with the T observations. The model that generates the highest probability P (O|λ) implies that the set of observations O belongs to that HMM and class.
Computation of P (O|λ) using hidden states. Let Q be the set of hidden states: Where Q is the set of all possible hidden states, the complexity is of the order O(2T N T ) for a general system with N states and T observations to compute P (O|λ) using the method above. A forward-backwards algorithm is used to compute the same quantity. This reduces the computational complexity to O(N 2 T ).

B. Neural Network Architectures
Deep Neural Network architectures have been tested to identify the best architecture to use to study attack correlations. The first two architectures that have been tested are a CLDNN and CNN with BGRU layers. These neural networks are part of a class of neural networks called CRNN, which is CNN with Recurrent Layers. Convolutional layers have been used because they capture spatial features from the dataset and gated RNNs to capture temporal patterns in the sequence. Bidirectional GRUs capture temporal patterns not only forward in time but also backward in time. In addition, the performance of a One Class Neural Network (OCNN) has been evaluated as it is good for minority data detection in large datasets. Finally, the performance of a Convolutional Neural Network without Recurrent layers has been evaluated.

1) CLDNN:
The Convolutional LSTM Deep Neural Network architecture has 6 hidden layers. The first layer is a convolutional layer with 256 feature maps, a (1, 3) convolution kernel and 20% dropout. The second hidden layer is a convolutional layer with 256 feature maps and a (2, 3) convolution kernel. The third layer is a convolutional layer with 80 feature maps and a (1, 3) kernel with 20% dropout. The fourth hidden layer is a convolutional layer with 80 feature maps and a (1, 3) kernel. The fifth hidden layer is an LSTM layer with 50 cells. The sixth hidden layer is a fully-connected layer with 128 neurons. The CLDNN architecture has shown in Table II. All hidden layers have Rectified Linear Unit(ReLU) activation as: This architecture was used to evaluate its performance on i) The multi-class classification problem when it is trained with all classes of data and tested on all classes of data and ii) The two-class classification problem when it is trained with www.ijacsa.thesai.org one class of data and tested on all other classes of data. For the multi-class classification problem, the output layer has 15 output classes. For the two-class classification problem, the output layer has two output classes. In both implementations, the output layer has softmax activation. Softmax activation can be given as: where J is the total number of classes All 78 features are used for training. For the two-class classification problem, the model has been trained on one kind of attack and evaluates its performance when tested on all the other attacks. In this case, all benign data is labelled zero, and all attack data is labelled one. For the multi-class classification problem, the model is trained with all 15 classes of data. The total dataset is divided into a fifty-fifty split; 50% of the data is used for training, and 50% of the data is used for testing. The loss function used is Categorical CrossEntropy, and the optimizer is Adam. The batch size is 1024. This model is trained for 50 epochs. The categorical CrossEntropy loss function is given by: where t i is the true label of the i th data point, p i is the predicted probability of the data point belonging to class t i and N is the batch size.
2) CNN with BGRU layers and CNN with stacked BGRU layers: Gated Recurrent Units are different from traditional RNNs without gates as they are able to control the amount of previous data to carry over to the next iteration using hidden states. GRUs usually have a reset gate and update gate as opposed to a forget gate, input gate and output gate as in LSTMs. This makes training the model with GRUs faster than using LSTMs. BGRUs have connections that go backwards in time, enabling them to capture temporal patterns backwards and forward in time. The CNN with Bidirectional Gated Recurrent Units and Stacked Bidirectional Gated Recurrent Units architecture is the same as the architecture of the CLDNN used, except in these architectures, there are BGRU and stacked-BGRU layers in place of the LSTM layer. Each BGRU layer has 50 cells. For a stacked-BGRU approach, there are two BGRU layers consecutively with 50 cells in each layer. Each model is trained for 150 epochs, with a batch size of 1024. The loss function used is mean squared error, and the optimizer used is RMSProp. Table III and Table IV are the architectures of these two DNNs.
3) One Class Neural Networks: One Class Neural Networks (OCNN) are used for anomaly detection in complex datasets that require highly non-linear decision boundaries. The loss function of OCNNs is derived from the loss function of OC-SVMs. In an OC-SVM, for input data X with N input  space that is a mapping from the input space to feature space. All the input data points are labelled one, and the only negative point is the origin. A hyperplane separates the origin from the mapped ϕ(X n )s. The hyperplane in the feature space is given by f (X n ) = w T ϕ(X n ) − r. v is the parameter to control the trade-off between maximizing the distance and the number of data points falsely classified as positive. r is the hyperplane bias. An OCNN has a feedforward neural network with one hidden layer and one output node. Here, w is the scalar output from the hidden to the output layer, and V is the weight matrix from the input to the hidden layer. The hidden layer has a linear or sigmoidal activation given by g(.) The optimization problem for OCNNs is: (w, V ) are updated using normal backpropagation. The model is trained using important features extracted from an autoencoder instead of using the raw data. The OCNN algorithm is implemented by Chalapathy et al. [16]. The performance metrics have been calculated by True Positive Rate(TPR), False Positive Rate and (FPR)ROC curve.  Table V. The model is trained on 70% of data and tested on 30% of the data. 10% of the training data is used as validation data. The model is trained for 100 epochs. The

C. Attack Correlations
Base Architecture Trained on Original Dataset: This section studies the transferability of learning using the proposed base architecture, training the CNN architecture with one attack, and testing on all the other attacks. A twoclass classification problem has been considered to implement this. All attack data has been labelled as one class and all benign data as another class. The model has been trained on benign data and attack data corresponding to the attack whose transferability has to study. As seen in Table I, attacks like Attack 8 (HeartBleed), Attack 9 (Infiltration) and Attack 13 (Web Attack: SQL Injection) are severely underrepresented in the dataset. A synthetic and bootstrapped dataset has been discussed to address this problem. In this section, the models are trained with only the original dataset. In confusion matrices in the results section, Attacks 8, 9, and 13 are not included as there are not enough original samples in the dataset for transferability in learning. The model is trained for 20 epochs for each attack, the loss function used is Binary CrossEntropy, and the optimizer used is Adam.
Base Architecture Trained on Synthetic and Bootstrapped Dataset: The CICIDS 2017 dataset is unbalanced, with 80% of the data samples being benign data packets and 20% of the data samples belonging to attack data. The total attack data is further divided into fourteen classes of attacks; thus, each class is severely underrepresented in the dataset, which is not ideal for learning. The main goal of this section is to test the model for strong and consistent attack correlations when it is trained with a greater number of samples from the minority attack classes.

Synthetic Minority Oversampling Technique (SMOTE):
More attack data has been generated per class using Synthetic Minority Oversampling Technique (SMOTE). SMOTE selects examples that are close to each other in the feature space. This technique selects one point from the minority attack class and uses K nearest neighbours algorithm to find the K nearest data points around the randomly chosen point. The number K is chosen based on how much minority classes have to oversample. In this implementation, 5 nearest neighbours have been chosen. It then randomly selects one neighbour point out of the K neighbours and creates a synthetic data point on the line joining the two selected points. Using this technique, specific regions have been identified in the feature space that can generate more samples belonging to a certain class of data. The number of attack packets generated using this technique is used to match the number of benign data packets in the dataset. This process is repeated for each attack. The model trained with this dataset is tested on: 1) SMOTE generated benign data and real attack data for evaluation of accuracy when tested on a synthetically generated benign dataset and 2) Real benign data and real attack data as this scenario best emulates a real-life scenario.
Bootstrapped Dataset: Another technique used to create a balanced training dataset is for every attack that the model is trained with, the attack data is resampled to match the number of benign data packets. This training dataset is bootstrapped dataset. This process is used to replicate attack data corresponding to all attacks. The base architecture is trained with this enhanced dataset and tested on real attack data. During training, it is validated on 20% of the attack data. During training with the SMOTE dataset as well as a bootstrapped dataset, the model is trained for 20 epochs per training attack, the loss function used is Binary CrossEntropy, and the optimizer is Adam.
Attack Boosting Algorithm: The previous sections explore transferability in learning when the model is trained with a single attack and tested on other attacks. This section aims to identify a subset of training attacks that can be used to train the model to cause it to scale for other attacks. The possibility of the model scaling for a certain test attack has been explored when trained with a combination of training attacks (that do not include the testing attack). The aim was to observe consistent and strong off-diagonal correlations and identify training attacks. To do this, the attack boosting algorithm has been used as outlined in Algorithm 1. 50% of the set is used only as training data for selecting the training attacks, and the model performance is evaluated on the remaining 50% of the set that serves as the validation data. The testing data consists of the attack for which we want to select a subset of training attacks along with benign data. The loss function used is Binary CrossEntropy, and the optimizer used is Adam.

D. Hypothesis for Attack Correlations
In this section, we propose hypotheses for observing the correlations as in Fig. 5.2. In doing so, we use the Recursive Feature Elimination Algorithm [17] to identify the important and dominant features for model learning. We propose hypotheses based on the number of important features selected by the algorithm as well as the properties of the attacks and features selected.
Recursive Feature Elimination is a feature selection algorithm used to identify the important features. A different ML algorithm is used to identify important features in a given feature set and wrapped by the RFE algorithm to do so. This ML algorithm at the core of the wrapper does not have to be the same as the classifier used for the classification task at hand, and it can be other algorithms like Decision Trees, Random Forests, to name a few. RFE provides us with a subset of the entire feature set that is important for learning. This is especially important for datasets with a large number of features to identify the important features for learning and to reduce training times significantly. RFE works by recursively training the model with reduced features and removing features without which the model performs with higher accuracy. RFE also ranks the features in order of importance. The top n features can be selected based on this algorithm.
The CICIDS 2017 dataset has 78 features, a relatively large number of features. The main goal is to identify a smaller subset of features that are important for the learning and scalability of the proposed model and speed up training times. Two implementations of the RFE algorithm have been done in the proposed model. A single attack has been selected from all the attacks for the first implementation. The RFE algorithm is used to identify features that the classifier uses to separate from the benign data. This process has been repeated for all attacks in the dataset in order to identify the common features between attacks to validate the proposed hypotheses. The second implementation involves taking two correlated attacks (as observed in Figure 5) and using the RFE algorithm to identify the number of features that the classifier uses to differentiate both attacks from benign data. The Decision Tree Classifier has been used as the core model for RFE. The training data for both these implementations contain the attack and benign data.

A. Comparison between HMM Performance and CLDNN Performance
To observe if the statistical modelling-based approach scales better than the deep learning approach, the performance of the HMM compared with a candidate CLDNN model (Table  VI). It has been observed that the CLDNN model is able to achieve an overall accuracy of 99.69% for 2 classes of data and 99.57% for 4 classes of data. In contrast, the HMM model is only 78.7% accurate for 2 classes of data and 50.8% accurate for 4 classes of data. This could be because not all the attacks in the dataset are multi-stage attacks that correspond to the hidden states of the HMMs. The performance of the HMM model deteriorates as the number of classes in the dataset is increases.  Table VI concluded that the deep learning approach shows more promise for the multi-class classification problem. Therefore, more deep learning architectures have been explored in the following sections to identify the best architecture to study attack transferability patterns. The next section gives the architecture details of the used candidate CLDNN model.

B. Result of the Neural Network Architectures
1) Result of the CLDNN Architecture on the Multi-class Classification Problem: An attack accuracy metric is used to evaluate the performance of a model other than the overall accuracy metric [18]. This metric was used because the number of benign packets is much larger than the attack data. A gross misclassification of the attack data packets does not diminish the overall accuracy considerably. A more accurate metric of performance evaluation would be the percentage of correctly classified attack data. Therefore, the attack accuracy is defined as the ratio of the number of correctly classified attack packets in a class to the total number of attack packets in that class. This is also called true positive rate more generally.    1 is the confusion matrix of the performance of this CLDNN architecture. The labels on the Y-axis of the matrix indicate the true labels of the testing data. The labels on the X-axis of the matrix indicate the predicted labels of the testing data. A model with good performance observed high accuracy along the diagonal of the confusion matrix, which indicates that the model correctly classifies the attack with which it is trained with high accuracy. It has been observed that the proposed CLDNN architecture achieves an overall accuracy of 98.53%. For most classes, the model is able to classify the attacks correctly. However, it can be observed that although the model is being trained on data from Attacks 8 (HeartBleed), 9 (Infiltration), and 13 (Web Attacks: SQL Injection), it is not able to classify them with a good attack accuracy correctly. This is mainly because Attacks 8, 9 and 13 are significantly underrepresented in the dataset. It could also be that there are not a lot of dominant features that influence the learning of the model. In addition, it is also a possibility that the important features do not have a large variance in their distribution. This makes it hard for the model to learn patterns to differentiate between the data classes.
2) Result of the CLDNN Architecture on the Two-class Classification Problem: Fig. 2 is the confusion matrix for the two-class classification problem. In this confusion matrix, the labels on the Y-axis indicate the attack with which the model has been trained. The labels on the X-axis represent the class of attack that the model is tested on. A model trained on one attack is tested on all other attacks for transferability. More off, the diagonal correlations are preferred for a model that exhibits attack transferability. High classification accuracies on the diagonal represent the model's performance when it is trained and tested on the same attack. However, to observe transferability in learning, it will be best to observe high accuracies when the model is tested on an attack class that it has not been trained on. It can be observed that training this CLDNN model with one attack does not correlate to any other attack. Most of the correlations observed are on the diagonal of the confusion matrix, which means the model only performs well when tested on the training attack. It exhibits low transferability; for example, when the model is trained with Attack 12, it can identify attacks in class 14 with 91.87% accuracy. It does not exhibit other strong off the diagonal correlations. Therefore, it has been concluded from these results that the transferability of learning cannot be tested using this architecture.

3) Results on CNN with BGRU Layers and CNN with
Stacked BGRU Layers: These results present a comparative study of the performance of the CLDNN, CNN with BGRU layer and CNN with stacked BGRU layers for select attacks for the two-class classification problem. The performance of a CNN has also been evaluated with a BGRU layer with 10 cells. The attack accuracy was evaluated when tested on each attack. All the entries are attack accuracies in terms of percentages.   Table 4.4 that training the model with Attack 4 (DoS: HULK) has the potential to scale for Attack 2(DDoS) and Attack 3 (DoS: Hulk), but this model only detects those attacks with 31%-47% accuracy. This comparative study concludes that the architectures explored so far do not exhibit very strong attack transferability. The next section will explore another deep learning architecture, a One Class Neural Network and evaluate its performance for anomaly detection.

C. Results of One Class
The implementations are the implementation by Chalapathy et al. [16] [19]. The raw data is passed through an autoencoder to find important features. It is then passed through a one-layer NN, and the function it optimizes is given by equation 1. For the CICIDS dataset, the model AUROCs were computed for different seed values.
From Table XII, it can observe that the model performs as good as random guessing in 30% of the runs, marginally better than random guessing in 60% of the runs and worse than random guessing in 10% of the runs. On average, it achieves an AUROC of 0.55. Therefore, It concluded from the AUROC values that this model cannot detect anomalies with high accuracy and would not be suitable for the two-class classification task at hand. The performance of a Convolutional Deep Neural Network has been evaluated in the next section.

D. Result of CNN Architecture on Multi-class Classification Problem
The overall accuracy of the CNN is 99.65%, whereas the CLDNN achieved an overall accuracy of 98.53%. In the confusion matrix, the diagonal elements represent the percentage of data in each class correctly classified. It has been observed from the confusion matrix that the model is able to classify Attack 8 (HeartBleed) correctly and Attack 9 (Infiltration), that have 11 and 36 data packets in the dataset up to 80% and 81.82% accuracy. Thus, this model can learn the important features from a high dimensional feature set. It concludes that this architecture has performed the best and is the best architecture for studying attack transferability. This architecture does not have recurrent layers, speeding up training time while increasing classification accuracy by 1.12%. This is our proposed base architecture.
E. Results of Attack Correlations 1) Results of Base Architecture Trained on Original Dataset: Fig. 4 represents the overall accuracy of the base architecture. Figure 5 is the confusion matrix for when the base architecture is tested for the two-class classification task. The diagonal elements are placeholders as they indicate the performance of the model when tested on the training attack. From Fig. 3, it has been observed that the model performs with high accuracies when trained and tested on the same attack. Observed Correlations have been shown in Table XIII  2) Results of Base Architecture Trained on Synthetic and Bootstrapped Dataset: Fig. 6, Fig. 7 and Fig. 8 are the confusion matrices of the base architecture when trained on the SMOTE dataset and bootstrapped dataset and tested on SMOTE and real attack data. From Fig. 5, it can observe that most of the attack pairs that showed a correlation when the model was trained with the original training data also showed correlations when the model was being trained with synthetic and bootstrapped data. Like the previous section, the diagonal elements in the confusion matrices are placeholders as they represent the accuracies when training and testing on the same attack that already observed scales well with accuracies above 90% from Fig. 3.

3)
Results of Attack Boosting Algorithm: As observed from Table XIV,   this does not outperform the model's performance, which was trained with a single attack. From Table XIV can also be observed that the selected attacks for Attack 1 cannot correctly classify Attack 1. The selected attack for testing attack Attack 2 also achieves a low PR. From Figure 5, it can observe that training the model with attack data corresponding to only Attack 3 or Attack 4 yields true positive accuracies of 80.53% and 93.87% when tested on Attack 2. The selected attacks for Attack 3 cannot detect attack data packets; however, training the model with Attack 5 or Attack 6 yields a true positive accuracy of 85.01% and 90.84% when tested on attack 3 (as seen in Fig. 5).
Comparing the positive rates from Tables XIV and XV with Fig. 5, it concludes that training with one kind of attack scales better than training with a subset of attacks. Therefore, this algorithm has not been employed to further study attack  correlations.
This study of attack correlations has concluded that the correlations observed in Fig. 5 are the only strong and consistent correlations that exist. The model exhibits the same correlations even when trained with a larger number of samples from the minority attack data classes using synthetic and bootstrapped datasets. Fig. 6, Fig. 7 and Fig. 8 indicate the same correlations as Fig. 5 when the model is trained with the augmented datasets. The next section discusses these observed correlations with the proposed hypothesis. Table XVI shows the number of features selected for select attack pairs. For attack pairs like (3,2), (3,6) and (3,4) that are correlated, as seen in Fig. 5, Fig. 6, Fig. 7 and Fig. 8, the algorithm selects only 4, 6 and 6 features out of a total of 78 features in the dataset. For uncorrelated attacks like (3,7), the number of features selected by the algorithm is 24, which is considerably higher than correlated attacks. Similarly, for (3,1), which are also uncorrelated, 21 features are selected. This forms the base for the hypotheses. Some selected features from correlated attacks have also been observed in Table XVII. In order to strengthen the hypothesis, Some of the features listed in Table XVII have been used.

F. Results of Hypothesis for Attack Correlations
Hypothesis 1: Fig. 5 shows that training the base architecture with Dos: GoldenEye(Attack 3) attack causes it to  Hypothesis 3: Fig. 5 shows that training the model with DoS: GoldenEye (Attack 3) causes the model to scale for DDoS (Attack 2) and DoS: HULK (Attack 4); however, this observed correlation does not apply the other way around. Similarly, training the model with DoS: Slowloris (Attack 6) causes it to scale for DoS: HULK (Attack4) but not vice versa. The focus was on these asymmetric comparisons in this hypothesis. To explain this, it has been hypothesized that correlated attacks will have few dominant features selected by the algorithm, which may explain the correlations between the attacks. In contrast, the uncorrelated attacks will have many features selected by the RFE algorithm, indicating that there are not many dominant features that are important for learning which could explain the low correlations. This can be strengthened by the results in Table XVI for Attack pairs (3,2), (3,4), and (3,6) that have 6, 6 and 9 features selected by the algorithm. Each of these attack pairs is correlated. However, for uncorrelated attacks like (3,7) and (3,1), the RFE algorithm selects 24 and 21 features which are considerably more.
In addition to these hypotheses, some features observed from the selected feature list from Table XVII provide validation for our hypotheses. For example, in addition to DoS: GoldenEye and DDoS attacks having only a few features selected in their reduced feature set, the features selected as indicated in Table XVII are Subflow Fwd packets, Subflow Fwd bytes, Subflow Bwd packets. Subflow Fwd packets are the average number of packets in a sub-flow in the forward direction. Subflow Fwd bytes is the average number of bytes in a subflow in the forward direction, Subflow Bwd packet is the average number of bytes in a subflow in the backward direction. Subflows in the forward and backward direction are usually associated with distributed systems which point to these two attacks being distributed attacks. DoS: The GoldenEye attack tool can be used as a distributed attack as well by specifying the number of workers while simulating the attack. This provides further explanation for generalization.
Similarly, to study the correlation between DoS: Gold-enEye, DoS: Slowhttptest, DoS: Slowloris and DoS: HULK, the RFE algorithm chooses features Inter-Arrival Time, Bytes Sent in the Forward Direction, Bytes sent in the Backward Directions in the Initial Window which is coherent with the hypotheses for symmetric comparisons.

V. CONCLUSION
The deep learning approach shows that it generalizes well for this dataset in comparison to explicit statistical modellingbased methods. After evaluating the performance of several candidate DNN architectures like CLDNN, CNN with BGRU, OCNN and CNN, a CNN architecture has been proposed with the highest accuracy in the multi-class classification problem (Fig. 3). Further, this architecture has been used to study attack transferability and observe its performance in Fig. 5. Training attacks have been identified, for which the model generalizes well on certain testing attacks. The problem of underrepresented attacks in the dataset has been addressed using synthetic and bootstrapped dataset based training methods. Fig. 6, Fig. 7 and Fig. 8 show that it can observe the same attack correlations as in Fig. 5 for the performance of the model trained on the SMOTE dataset and bootstrapped datasets. This confirms our identification of training-testing pairs of attacks that exhibit attack transferability.
Furthermore, this study proposes hypotheses for the observed correlations. A Reduced Feature Elimination Algorithm has been used to strengthen the hypotheses. The relationship between the number of features chosen and the training-testing attack pairs can be observed in Table XVI. Table XVII lists feature selected by the algorithm for selected attacks coherent with the proposed hypotheses.
This work validates the hypotheses with features from the RFE algorithm. Future work could validate these hypotheses using transferability in learning for different datasets, for example, DARPA 2009 IDD, KDDCup99, NSL-KDD, UNSW-NB15, and WSN-DS. Future work could also analyze specific or selected features trends in studying attack correlations. It is interesting to observe from Fig. 5 that it only observed correlations in attacks that are different types of DoS attacks. It did not observe correlations from other kinds of attacks, for example, Attack 1 (Botnets), Attack 7 (Brute Force: FTP-Patator), Attack 8 (Heartbleed), Attack 9 (Infiltration) and Attack 13 (Web Attack: SQL Injection). Future research can focus on identifying attributes that could help the model scale for these attacks and testing for transferability of learning for these attacks. Future work could also focus on attack detection when there is more than one attack being performed simultaneously.
ACKNOWLEDGMENT This work was supported by the Research Center of Computer Science and Engineering, PES Institute of Technology & Management, Shivamogga. The research centre is recognized by Visvesvaraya Technological University (VTU). The authors are grateful to all of those with whom we have had the pleasure to work during this work. Each of my Research Committee members has provided me with extensive personal and professional guidance and taught me a great deal about scientific research and life in general.