Detecting C&C Server in the APT Attack based on Network Traffic using Machine Learning

APT (Advanced Persistent Threat) attack is a form of dangerous attack, it has clear intentions and targets. APT uses a variety of sophisticated, complex methods and technologies to attack on targets to gain confidential, sensitive information. Currently, the problem of detecting APT attacks still faces many challenges. The reason is APT attacks are designed specifically for each specific target, so it is difficult to detect them based on experiences or predefined rules. There are many different methods that are researched and applied to detect early signs of APT attacks in an organization. Today, one method of great concern is analyzing connections to detect a control server (C&C Server) in the APT attack campaign. This method has great practical significance because we just need to detect early the connection of malware to the control server, we will prevent quickly attack campaigns. In this paper, we propose a method to detect C&C Server based on network traffic analysis using machine learning. Keywords—Advanced Persistent Threat (APT); abnormal behavior; network traffic; machine learning; APT detection; Control Server (C&C Server)


I. INTRODUCTION
The publication [1,2] presented the characteristics, procedures and life cycle of APT attack. From the characteristics of the APT attack, it shows that the APT attack has specific, clear objects and goals. Any organization, individual, business or governmental agency can become victims of this attack. In the past, APT attack groups often operated for personal purposes. However, most APT attack groups are now financially supported by government or financial institutions in order to implement political motives. Due to this change, ATP attack groups are increasingly equipped with not only modern attack, hide and cover tracks tools, but also a team of warlike, elite hackers. In the publication [3], the authors have presented some of the characteristics of the attack scenario makes APT attack detection becomes much more difficult than any other threats, such as advanced attack tool, lack of public data, and using standard encryption protocols. From the above presentations, we can see the dangers as well as difficulties in detecting APT attacks. In fact, many APT attacks have taken place over the years, exploiting large amounts of data without the victim's knowledge.
However, although APT attacks are advanced and sophisticated with completely new attack ways, they are all separated into four main stages [3,4,5]: spying (collecting information), attacking and escalating privilege, stealing information, and covering tracks. All four stages have the same role and importance, they support each other in the entire offensive campaign.
In the attacking and escalating privileges stage and stealing information stage, all of these are performed by commands from the C&C Server to the malware that has been exploited in the target machine. Therefore, if the system can detect abnormalities in the connection, it can quickly and accurately detect the signs of APT in the system. To detect the connection to the C & C Server, studies often focus on the issue of monitoring the anomaly of network traffic or rely on a list of C & C Server that has been built before. However, APT malware often easily bypass these traditional approaches. Therefore, in order to improve the ability to detect abnormal connections to a C&C Server, in this paper, we propose a method to analyze abnormal behavior in network traffic based on machine learning techniques. Accordingly, firstly the network traffic data will be analyzed and extracted behaviors relying on domain name or IP address, then these behaviors will be built into a feature vector. Finally, we use a machine learning algorithm to classify them in order to detect the abnormal connection of the C & C server. The science of our paper includes recommending some abnormal features of C&C Server based on Network traffic and using the Random Forest algorithm to detect abnormal connections. The paper is organized as follows. Section II reviews some recent works in the literature on C&C Server detection. The proposed C&C Server detection system using machine learning is presented in Section III. In this section, the new features for the C&C Server detection process are also described in detail. Experimental results and discussions are provided in Section IV. The paper is concluded in Section V.

A. Several Methods of Detecting APT are based on Abnormal Connections
In [3], the authors proposed a method for APT attack detection based on the analysis of abnormal behaviors of flow in Network Traffic. This method includes the process of extraction, normalization, and analysis of abnormal values of three groups of signals in flow, which are numbytes, numflows, numdst. Andrew Vance et al. [6] used measures non-signature based traffic and involved flow based measurements and applied a statistical for detection APT attack. Weina Niu et al. www.ijacsa.thesai.org [7] introduced a method for APT attack detection based on Mobile DNS Logging using four sets of features, which are DNS request, answer-based features; Domain-based features, Time-based features, Whois-based features. With the selected feature sets, the authors used a number of machine learning methods such as Global Abnormal Forest, k-Nearest Neighbor to detect APT Malware. G. Zhao et al. [8] used five sets of features: Domain-based features, Time-based features, Whoisbased features, DNS answer-based features; Active probing features and used J48 decision tree algorithm to detect APT malware command and control domains (C&C Domain). In [9], the authors used three sets of features to detect the domain APT, which are Domain name lexical features; Ranking features; DNS query features and Random Forest machine learning.

B. Detecting APT Attack using Big Data Technology
The publication [10] listed a number of APT attack detection tools based on analysis and correlation calculations among events such as Splunk, LogRapse, and IBM QRadar. Jisang Kim et al. [11] proposed a method to detect APT attacks based on the process of collecting and processing collected data sources consisting of the network packets are collected; Email logs are traced to accept; the privilege increase logs (Syslog) are traced to accept; Call-back domain blacklist; Internal DNS server; SSL port. However, in this paper, the authors didn't present the technical solutions and the big data processing technology used. Besides, Sung-Hwan Ahn et al. [12] proposed the idea of applying big data technology to APT attack detection. Accordingly, the authors proposed the architecture of big data analysis system consisting of the following stages: collecting data from firewall and log, behavior, status information (date, time, inbound/outbound packet, daemon log, user behavior, process information, etc.) from anti-virus, database, network device and system; preprocessing data; analyzing data; and giving warning results for signs of APT attacks. However, in this paper, the authors didn't present the solution or technology used in bigdata to support the model proposed by the authors. In the paper [13], the authors proposed the APT attack detection model on the big data platform with two main processes: Behavior Rule Generation and Abnormal Behavior Detection. In this proposed model, the authors use the Hadoop MapReduce framework.

C. Some Commercial Software Detecting APT Attacks
The document [14] introduced a number of commercial products and technologies that support APT attack monitoring and detection, including Symantec, Forcepoint, McAfee, Kaspersky Lab, Fortinet, Cisco, Palo Alto Networks, and FireEye.
McAfee Advanced Threat Defense is designed to detect APT malware and zero-day vulnerability by combining static analysis with dynamic analysis through sandboxing techniques. The analysis results will be provided to the system to detect and alert from within the network to the terminals. However, the disadvantage of this solution is the inability to analyze attachments on emails.
Kaspersky Anti Targeted Attack Platform (KATAP) is a solution that combines machine learning algorithms with sandbox technology to handle information about threats collected from inside systems and terminals in order to detect signs of APT malware (including known, unknown and APT malware) at any stage in the APT attack's life cycle. However, the disadvantage of the KATAP solution is not providing the monitoring and troubleshooting function after APT attack campaign, and the weakness in preventing data leakage.
FireEye's APT attack prevention solution is a set of solutions that analyze data from multiple sources such as Web, Email, File, Central Management and Malware Analysis. Accordingly, all suspicious files, attachments, files, and URLs are automatically scanned and monitored through the rule sets, then all suspicious signals will be transferred to the sandbox environment to be executed.
Advanced Malware Protection (AMP) solution of Cisco is the solution to detect APT attacks at the stage of spreading or hiding in the system or all 3 phases (consist of before APT attacks, during APT attack, and after APT attack). In Before APT attack phase, AMP uses information about threats worldwide gathered from Cisco's Collective Security Intelligence, Talos Security Intelligence and Research Group, and AMP Threat Grid to prevent known malware attacks.
In During APT attack phase, AMP uses information obtained from known attacks (signature), combined with AMP Threat Grid technology with the ability to automatically analyze malware in order to identify and prevent suspicious, dangerous files that are trying to gain access to the network. In After APT attack phase, AMP not only checks and monitors at the time of the attack, but continues to monitor and analyze all operations and paths of the data (though it was previously considered to be "clean"), and look for signs of dangerous behavior. When detecting a file containing malicious code, AMP provides visual information about the activity in the network, in each terminal of malicious code, AMP also allows quick response and troubleshooting through a simple web interface. However, AMP doesn't provide a function to prevent data leakage.
WildFire is Palo Alto Networks' APT attack detection solution, providing full visibility of all traffic, including APT threats from Web traffic, email protocols (SMTP, IMAP, POP), FTP (regardless of whether it is encrypted or not). The weaknesses of the WildFire solution is that it only focuses on monitoring the network layer, with little interest in the application layer and only focuses on detecting and preventing attacks but doesn't provide troubleshooting. Fig. 1 presents the proposed C&C server detection system using machine learning. The model consists of the following components:

A. Model Overview
-Network Traffic: Network data that is checked here can be taken directly in real-time from the network card or can be taken from the pcap file.
-Extract features: In this paper, we use the Bro IDS tool to assist in analyzing network traffic into network components. Bro IDS Tool is a network security monitoring tool with fast processing speed. It detects (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 5, 2020 24 | P a g e www.ijacsa.thesai.org intrusion based on rule sets and helps to separate features from network traffic at high speed.
-Training: After extracting the necessary information based on Bro IDS log files, the features will be saved in the CSV file. Random Forest algorithm will be used to classify from those features Table I lists the features of network traffic we use. All features marked "*" in Table I are newly extracted and selected in this research.  The features in Table I are defined as follows: -Anomaly Port and Protocol: In order to find abnormal ports, which don't run properly service according to the Internet standard, the first step is extracting the strange IP address that the server queries to in the DNS packet. From there, we find the queries that server queries to that IP address. From those records, we extract the protocols and service ports from the server, consider whether they are suitable or not, otherwise, the protocol and the port are abnormal. We define this because when a server provides service out with a specific port, it always listens to requests and returns responses through that port. For example, web services with the HTTP protocol have a default port of 80. Thus, the webserver always listens to requests and responds via port 80. -Number of three way handshakes: is the number of three way handshakes.

B. Select and Extract Features
-Number of connection teardowns: is the number of the failed connection. Table II shows some cases that occur when a connection fails.
-Number of complete conversation: is the number of successful connections.
-Anomaly Data: The first step is identifying strange IP addresses through DNS records. Then, we take the value of the tcp.len field of all records that their destination IP is a strange IP address. From there, we find the maximum size of the packet and calculate the average size of the packets -Number of packets per time: is the number of packets in a period such as hours, minutes, seconds, days, weeks, months and years.
-Number of bytes per time: is the number of sizes of packets in a period such as hours, minutes, seconds, days, weeks, months and years.
-Percentage of TCP SYN packets: is the percentage of the SYN flag in the TCP protocol of packets.
-Percentage of TCP SYN ACK packets: is the percentage of SYN ACK flags in the TCP protocol of packets.
-Percentage of TCP ACK PUSH packets: is the percentage of ACK PUSH flags in the TCP protocol of packets.
-Command and File System: The commands that the C&C server sends to malware are always command line, so if traffic has the system command line that is transmitted to any machine in the system from the external internet, it is very likely from C&C server. In addition, recent attacks such as Sofary Group's Parrallel Attack [15], which is organized by APT 28 in February 2018, the malware will save the data that the malware obtained into the file in the % APPDATA% folder and transfer directly to C&C Server. Similarly, other attacks, the malware also hide information in directories like% TEMP%, etc. and send it directly to the C&C server. Thus, if the network traffic has command lines and system files, it is certainly attacked by APT.
-Data to computer in LAN: In recent attacks, the first computer on the LAN that is hacked will be used to gather all data from other computers on the LAN and data from that machine will be sent to C&C Server via VPN. Typically, the APT15 attack on the US Navy on June 14, 2018 [16].
-Tor network: In order to encrypt the operations and commands of malware when it accessed to the victim machine, APT organizations often use the Tor network to encrypt and to avoid detecting C&C Server addresses.

C. C&C Server Detection Method
To detect connections from within the network to the C&C server, in this paper, we use the Random Forest algorithm. Random Forest is an ensemble classification method [17]. This algorithm is based on an ensemble of classifiers, which normally are Decision Trees to make the final prediction. The theoretical foundation of this algorithm is based on Jensen's inequality [18]. According to Jensen's inequality applied to the classification problems, it is shown that the combination of many models may produce less error rate than that of each individual model. In the study [19,20] has proven Random Forest algorithm has many advantages compared to other machine learning algorithms. In this paper, we use the Random Forest algorithm with the number of decision trees of 10 in order to classify and test connections.

A. Dataset and Experiment Environments
In this paper, we collected 61 network traffic files of APT attacks from [21][22][23][24][25][26].   Through the experimental results in Tables IV and V, we can see that C&C server training and detection model brings good results. This shows that the Random Forest classification algorithm and the features that are selected and extracted in the paper have brought good effect. In particular, the experimental part of detecting C&C server is absolutely accurate, which showed that the training model created a very good model for detection. Therefore, from the experimental results in this paper, we can see that the features, which represent the abnormal connection behaviors and are selected and proposed by us, present exactly the difference between normal connections and APT connections. This is very important because most APT attacks will be difficult to detect without the events-stringing system.

V. CONCLUSION AND FUTURE DIRECTION
The APT attack has been and will be a dangerous attack and a challenge to information security systems. In this paper, based on the Random Forest machine learning algorithm and the unusual behavioral features of network traffic, we successfully detected and alerted C&C servers early. The results of this research can be used in intrusion detection and prevention systems to look for abnormal signs of the network. In the future, we will improve the features of network traffic in order to detect the signs of an APT attack when this attack uses encryption techniques to transmit information