Anomaly Detection using Network Metadata

—The proliferation of numerous network function today gave rise to the importance of network trafﬁc classiﬁcation against various cyber-attacks. Automatic training with a huge number of representative data necessitates the creation of a model for an efﬁcient classiﬁer. As a result, automatic categorization requires using training techniques capable of assigning classes to data objects based on the activities supplied to learn classes. Predeﬁned classes allow for the detection of new items. However, the analysis and categorization of data activity in intrusion detection systems are vulnerable to a wide range of threats. Thus, New methods of analysis must be developed in order to establish an appropriate approach for monitoring circulating trafﬁc in order to solve this problem. The major goal of this research is to develop and verify a heterogeneous trafﬁc classiﬁer that can classify the collected metadata of networks. In this study, a new model is proposed, which is based on machine learning technique, to increase the accuracy of prediction. Prior to the analysis stage, the gathered trafﬁc is subjected to preprocessing. This paper aims to provide the mathematical validation of a novel machine learning classiﬁer for heterogeneous trafﬁc and anomaly detection.


I. INTRODUCTION
As part of network forensics, network traffic and event logs are commonly referred to as being sniffed, recorded, acquired, and analyzed to investigate a network security incident. It enables the investigator to study network traffic and records to identify and locate the assaulting system. Computers, smartphones, tablets, and other network-connected devices continue to grow. As the frequency of assaults against networked systems increases, the criticality of network forensics grows. Most previous studies confront two fundamental issues in extracting external and internal data, making traffic flow prediction a difficult endeavour. Currently, available solutions do not completely use the fundamental properties of short-term nearby and long-term periodic temporal patterns in terms of their various roles. In terms of the extrinsic task, current work has primarily used hand-crafted fusion algorithms to incorporate external inputs, however, there are still challenges with generalization [1].
The examination of a traffic incident is divided into two stages: Appearance check is the initial stage in the process of determining the Bloom filter (period) including an excerpt. To find the flows that conveyed the excerpt, the second phase is termed "flow determination," and it involves combining the excerpt blocks with the flows found by the Bloom filter. It was key difficulties handled by HBF [2], such as ensuring that blocks were aligned and that they were consecutively placed.
Cybercrime is a constant danger to computer networks. No security mechanism can guarantee complete safety. Even the most advanced network security measures are unable to identify and prevent all assaults, particularly those that are new and unknown. In certain circumstances, preventing cybercrime is impossible. Suppose that confidential information about a company is leaked over its network. How can security specialists track down cybercriminals? Let us consider the following scenario: an organization's internal network has been infected by a worm, and the organization's Intrusion Detection and Prevention System (IDS/IPS) was unable to identify and block the worm's dissemination. How can you track down the person who spreads the virus or the afflicted systems? As a result, in addition to preventative security systems, tools and methodologies for investigating cybercrime after it has occurred are required. This is the function of network forensics and the tools that it provides [2].
Recording and storing raw network traffic is the most basic method of network research. Traffic recording makes it feasible to examine any networking event that occurs. It is possible to scan through the recorded traffic for the leaked information or the worm's signature to determine where it originated and where it ended up. "Attribution" is the term for this operation. The most difficult challenge with this system is the exceedingly costly storing of large amounts of data [3]. In addition, the invasion of privacy is a concern with traffic recording. By monitoring network traffic, it is possible to gain access to the personal information of users. As a result of the increasing difficulty in providing both privacy and network forensics, new Internet designs and protocols have been proposed [4]. However, implementing such modifications would be prohibitively expensive, making them impractical in practice.
In the field of traffic categorization, three groups of methodologies exist port-based, payload-based, and machine learning-based methods [5]. The identification of network traffic based on port numbers is a straightforward process that depends on mapping programs to well-known port numbers. Regrettably, port-based categorization algorithms have grown erroneous as a result of the increased use of dynamic port numbers by numerous apps. Payload-based approaches necessitate the analysis of the payload of each packet. Privacy regulations and encryption, on the other hand, may prevent traffic payloads from being accessed. As a result of this, deep packet inspection (DPI) is expensive in terms of both computation and signature maintenance [6].

II. MOTIVATION
Machine learning-based solutions have the potential to overcome some of the restrictions associated with port-and payload-based systems. More precisely, machine learning approaches can classify Internet traffic based on applicationneutral traffic data. When it comes to how long it takes to send and receive a particular message, there are several variables that may be taken into consideration. Furthermore, it has the potential to minimize computing costs while also making it easier to identify encrypted traffic.
There are two main applications for network forensics. The first, which focuses on network security, is keeping an eye out for unusual traffic patterns and spotting breaches. On a hacked system, an attacker may be able to delete all log files. Consequently, network-based evidence may be the sole evidence accessible for forensic investigation in this situation. Law enforcement can also take advantage of network forensics by interpreting human communication represented through emails or other forms of electronic correspondence and reassembling transmitted information, looking for keywords, and so on [7].
Today's world is evolving at a rapid pace, and the internet is critical for quicker communication between people or machines, faster transactions, and faster fulfillment of duties (tasks). However, the internet is also a major victim of cybercrime. Transactions over the internet are the main draw for attackers. To do this, we need a forensic technology known as "Network Metadata" to help us identify the perpetrators of cybercrime and their methods of attack. Network Metadata is a sub-field of digital forensics research that deals with computer networks. The collection of network traces from the victim system for examination is a common practice in network forensics, whether the crime has been discovered or after it has been committed. The evidence gathered can be used to bring the perpetrator to justice in a criminal court of law. While digital forensics involves the examination of static data, network forensics involves the examination of volatile and dynamic data [8].

III. RELATED WORK
The study [9] establishes a network intrusion criminal system based on the switching scheme (NIFSTC) that may detect criminality in networked situations and identify digital evidence automatically. The advantage of NIFSTC is that it does not require a standard forensic network to be built, hence it has superior detection performance in practice than traditional approaches. For the most modern network forensic methodologies, the KDD Cup Experiment Series 1999 dataset shows NIFSTC's highest true positive (TP) and lowest rate false positive (FP) .
The authors [10] introduced SPIE (Source Path Isolation Engine) in this regard, which calculates the first eight bytes of the payload and packet digests (i.e. hashes) from the header. A brief period of time is allowed for the digestion of these digests in a bloom filter. If a third-party device, such as an IDS or a firewall, identifies suspicious activity, SPIE can be used to track down the source of a packet.
In the research [11], the focus is on the security risks of the botnet through which DDoS attacks, worms and spam attacks are implemented. For network security forensic investigation, the researchers recommended the design and implementation of a cloud-based security center. Also, cloud storage is used to store the acquired traffic data, which is then processed utilizing cloud computing.
A tool that explores the architecture of the network forensic is proposed in [12], which is called NetFo (Network Forensic) analysis tool. It captures packets using Winpcap technology and It can be used as a monitoring and management tool. NetFo can discover session information, keywords, bookmarks, hostnames, IP addresses, and other information.
As explained in [13], due to many requirements that were not addressed in this design space, developing a forensic network architecture is a complex task.
The authors [14] present a real-life case study in which they reconstruct a crime scene in relation to a victim's previous Facebook session using digital evidence collected and analysed via access to a desktop computer's RAM, with a focus on some distinct chains that could be used to reconstruct a previous Facebook session.
Huaxin et al. [15] developed a framework for extracting four types of characteristics from real-world Wi-Fi data, as well as supervised machine learning approaches for estimating user demographics. The study was based on Wi-Fi traffic information from 28,158 users during a five-month period. According to the testing results, the best accuracy in predicting gender and education level is 82% and 78%, respectively. Users' demographics may be predicted with a precision of 69% and 76% utilizing HTTPS traffic, even in encrypted transmission (i.e., across the internet). Being forensically prepared increases the degree of security in both cloud and on-premises computing. As a result, research in the fields of cloud and network security may also apply to IoT-centered forensics investigations. After all, traditional computer networks and Internet of Things (IoT) networks are also vulnerable to security flaws. Because IoT systems interact with the physical environment more frequently than traditional systems, they are susceptible to a greater number of physical and digital dangers. As a result, the work introduced in [16] was dedicated for securing the IoT domain.
The authors in [17] provided an overview of forensic advancements related to the IoT as well as the remaining hurdles. They focused on the taxonomy and criteria in the IoT Forensics. However, they did not discuss historical and current frameworks, standardization and certification difficulties in the IoT Forensics.

IV. THE PROPOSED MODEL
The ML models are becoming widespread in recent years because of mitigating a variety of complex relationships and acquiring the most favorable solutions by general evolution. The ML models have the ability to discover nonlinear relationships and complex functions among independent and dependent variables based on processing and classifying the data through training. An ML technique is comprised of algorithms with many models based on artificial intelligence. In this paper, five ML classifiers are used and compared in terms of the highest accuracy.  Fig. 1 depicts the overall layout of the proposed framework based on machine learning approaches for network anomaly detection. It represents the phases that the model goes through and includes a large number of distinct processes. In the first phase, the dataset is analyzed and split into training and testing sets. For both training and testing, the attribute vectors are sliced in a 70:30 ratio. Next, in the pre-processing phase, the dataset is cleaned, features containing categorical data are normalized, and records including incorrect data are removed. Following, in the feature selection phase, the features are analyzed according to their weights and we choose most important features to define the attacks. Next, in the tuning phase, parameters of chosen classifier are tuned and optimized using a grid search. At the end, the optimized classifier is used for training and testing datasets, which are used for prediction of new traffic records.

A. Dataset
Data are the most valuable asset to develop an efficient intrusion detection system. CICIDS2017 [18] is the most recent intrusion detection evaluation dataset. It was created by the Canadian Institute for Cybersecurity at the University of New Brunswick. The CICIDS2017 dataset was constructed using the Network Traffic Flow analyzer. It was captured over a duration of 5 days over which 83 features and 15 classes were captured [19]. One of these classes represents the normal network traffic (defined as Benign) while the other 14 represent anomaly traffic (called Attacks). The names and numbers of these classes are shown in Table I. Compared to older and traditional datasets, such as KDD-99 [20], DARPA 98/99 [21] and ISCX2012 [22], CICIDS2017 dataset has the following advantages: • Cover the current trends of attacks.
• Represent real-world data.
• Attacks based on many protocols are included, such as HTTP, HTTPS, FTP, SSH, and email protocols.
For these reasons, the CICIDS2017 dataset is selected.

B. Data Pre-Processing
As explained in the last section, the CICIDS2017 dataset contains 3119345 stream records and 83 features containing 15 class labels (one for normal traffic and 14 for attacks). To ensure that the dataset is ready to be trained, we need to clean and normalize it.
As most of the datasets, CICIDS2017 dataset contains some undesirable elements that must be removed. In CI-CIDS2017 dataset, because the network traffic was collected using the CICFlowMeter tool, some flag features have constant values (0 or 1), such as "Bwd URG Flags" and "Bwd P SH Flags". These features were removed from the dataset because they have no impact on model results and to decrease the memory footprint of the dataset. Next step in the preprocessing phase is removing records that have missing class label, missing information, and invalid values such as "NaN" or "Inf". After examining these records, 288602 records were removed.
If the dataset used for training of a classifier or detector suffers from high class imbalance problem, the classifier biases towards the majority class. As a result, the classifier shows lower accuracy with higher false alarm. Unfortunately, CICIDS2017 data set is prone to high class imbalance, as shown in Table 1. Therefore, to avoid this problem, the normal traffic records have been down sampled. In addition, to improve prevalence ratio and reducing class imbalance issue, few minority classes have been merged, such as Web Attacks. Therefore, the new dataset was partitioned into 70% for training (1571510 records) and 30% for testing (471453) sets.

C. Feature Selection
The goals of feature selection are to identify and remove unneeded, irrelevant and redundant features from the dataset. This help reduce the complexity of the predictive model without compromising its accuracy. Feature selection helps define most important features for detecting attacks on the dataset. First using correlation test, some features are removed from the dataset to reduce its size and enhance the performance. It is a powerful tool to summarize a large dataset and identify and visualize patterns in the given data. Each cell in the table contains the correlation coefficient between the features that scales from 0 to 1. If the coefficient approaches 1, it means that it is more positive, meaning that both features have an impact on the prediction process, and whenever the value approaches 0, it means a negative correlation that does not benefit us in the process of prediction, and they have no effect.
By analyzing the correlation matrix, we found a strong correlation between the following features: (Fwd IAT Std, BwdIATMean), (Bwd Header Length, Fwd Header Length) and (Bwd Header Length, Subflow Fwd packets). Therefore, we delete the features that are not needed.
After removing correlated features, we still have large number of features. We need to use feature selection methods to determine the importance of a certain features in the detection of anomalous traffic. There are several feature selection methods in the literature, such as Fisher Score, T-Score, chisquared tests, random forest, or regression. Using these five feature selection methods, each feature is given a weight of importance as to how useful they are. These weights of features are compared and sorted. Fig. 3 shows the most 10 important features that are used for training and testing in the proposed model.

V. RESULT AND DISCUSSION
In this section, the results will be presented and discussed based on the proposed machine learning techniques.

A. Data Classification Methods
In artificial intelligence, machine learning is regarded as a subfield. Automatic classification [ [23], [24]] is one of interested subjects for machine learning. In order to handle classification difficulties, automatic learning employs a variety of methods that group homogenous classes of comparable data items together. In order to train the decision rule and develop a classifier, supervised learning is adopted. ML can be used to create a predictive model to detect unknown attacks in network traffic. However, one important problem in ML is to identify and select the most relevant feature characteristics, from which to build a specific model based on training data for a particular classification job [ [24], [25], [26]].
Classification is a logical choice for doing predictions with discrete known outcomes when using a machine learning technique such as classification. Items are classified using a classification technique, which is a set of exact rules for categorizing objects based on the quantitative and qualitative factors that characterize the objects. There are a variety of goals for which data categorization is performed, the most prevalent of which is to assist with data security challenges, particularly in anomaly detection [ [27], [28], [29]].
In this work, we adopted using five classifiers to categorize the network, which are: the Random Forest, Logistic Regression, Decision Tree Algorithm, SVM, and the k-nearest neighbors. The findings were then compared using performance metrics and classification reports. Through the optimization of the classifier, training and testing process are repeated, where the behavior of the classifier is changed until the intended behavior is accomplished.

B. Performance Measures
To evaluate the performance of the suggested classification methods for anomaly detection, we adopted the following measures: accuracy, recall, precision, and F1 Score. The confusion matrix is utilized to separate the prescient execution of the classification in the test data.
2) Recall is the ability of the proposed model to detect the attacks. Recall can be calculated from the number of detected attacks rather than the number of actual attacks.
3) Precision is the ratio of predicted positive to total positive observed predictions.
P recision = T P T P + F P 4) F1 Score is the average of recall and precision values.

C. Experimental Results
Table II and Fig. 5 show the results of applying five different machine learning techniques for classifying different types of attacks. Fig. 5 shows the confusion matrix for all algorithms. Table II shows the accuracy, precision, recall, F1 Score for each algorithm.
From the Table II, we can notice that the best algorithms are Random Forest and Decision Tree. This is because they have a high accuracy and precision rates. The worst algorithm is K-Neighbors because it had lower accuracy and precision rates.
VI. CONCLUSION AND FUTURE WORK An intrusion detection system is an important protection tool for detecting complex network attacks. In this work we have developed a new model for network intrusion (anomaly) detection based on machine learning algorithms. The proposed model consists of six phases: dataset analysis, pre-processing, feature selection, parameter tuning, training and testing. Using the proposed model, five machine learning algorithms have been investigated for classification of network anomaly detection, which are: K-neighbors, logistic regression, SVM, decision tree and random forest. The performances of these ML algorithms have been observed on the basis of their accuracy, recall, precision and F1 score. The dataset CICIDS-2017 has been used for training and testing, which consists of seven different types of attacks. According to results, compared to other ML algorithms, the performance of the random forest algorithm is better. This is because it has achieved the highest accuracy and precision rates for classification of anomaly detection, which are 98.63% and 99.80, respectively. Compared to related work, the performance of the proposed model is better. This is because of: (1) The dataset was carefully cleaned by removing noise and outlier data and solving imbalance issues. (2) The proposed feature selection technique removed correlated and irrelevant features from the dataset.
(3) Parameters of chosen classifier are tuned and optimized using gride search. As a future work, we will investigate other machine learning and deep learning algorithms for network anomaly detection.