A Framework for Detecting Botnet Command and Control Communication over an Encrypted Channel

— Botnet employs advanced evasion techniques to avoid detection. One of the Botnet evasion techniques is by hiding their command and control communication over an encrypted channel like SSL and TLS. This paper provides a Botnet Analysis and Detection System (BADS) framework for detecting Botnet. The BADS framework has been used as a guideline to devise the methodology, and we divided this methodology into six phases: i. data collection, customization, and conversion, ii. feature extraction and feature selection, iii. Botnet prediction and classification, iv. Botnet detection, v. attack notification, and vi. testing and evaluation. We tend to use the machine learning algorithm for Botnet prediction and classification. We also found several challenges in implementing this work. This research aims to detect Botnet over an encrypted channel with high accuracy, fast detection time, and provides autonomous management to the network manager.


I. INTRODUCTION
Botnet has become a significant concern in the computer industry. With users engaging in daily life surfing to the Internet, there was a high risk of becoming a victim of a Botnet attack. The botnet has developed many capabilities, but unfortunately, most of those capabilities are used for attack purposes, such as performing a DDoS attack, spamming, malware spreading, and large computer compromising. Launching a massive DDoS attack is one of the main capabilities of Botnet. For instance, a DDoS attack that happened in the year 2000 is one of the notorious DDoS attacks when the cyber-criminal targeting Yahoo!, Fifa.com, Amazon.com, Dell, Inc., E*TRADE, eBay, and CNN. Another main capability of Botnet is spamming. Several authors like Solomon and Evron (2006) [1] highlight Botnet spamming as a significant concern because of the large amount of distribution of spam, which will use many network resources. Their concern was supported by McAfee Avert Labs [2] which stated that more than 70 percent of spam email caused by Botnet.
Other than DDoS and spamming, Botnet vigorously compromised many computers and tried to develop a vast network of the infected machine through its command and control (C&C) communication. Thus the impact of the attack is enormous. Therefore, Botnet is becoming an increasingly widely-used method by cybercriminals for many purposes, such as gaining recognition from other hackers, financial gain, and many other nefarious activities; hence, all of these affect the users in general. The Botnet is also making antivirus tools ineffective, and bots able to modify registry entries, so they remain active even when the infected machine is booted in a safe mode. Some of the Botnet even respond vigorously if they notice there are efforts made trying to detect their presence.
Considering such capabilities deployed in many Botnets, the effects of Botnet attack are so huge. Botnet brings high risk to national security, intimidates the security of many organizations, either public or private entities, especially caused terrible disturbance and high usage of network resources. Cleaning on the detected system will be very difficult because the volume of network traffic created by bots is massive, thus making it impossible to perform an update on an infected machine (Thomas & Jyoti, 2007) [3]. For this reason, even governments have to spend much money to prevent Botnet attacks. CyberSecurity Malaysia [4] provides the statistic of Botnet attacks in Malaysia, as depicted in Fig. 1. Table I shows Botnet attacks in comparison to other attacks in Malaysia for six consecutive years. Botnet employs many evasion techniques to avoid detection and stay in the network. One of the evasion techniques is by manipulating encrypted channel like SSL and TLS to hide their C&C activities. Zhang (2017) [5] refers Botnet that uses encryption evasion techniques as an advanced Botnet. Burghouwt (2015) [6] states the encryption of C&C traffic as the most crucial evasion technique by Botnet. The Botnet dependency to this evasion technique is due to several reasons; for instance, the increasing number of services and applications that use an encrypted channel to secure the communications and contents.

Malaysia Botnet Drone 2012-2017
Even though the reports by CyberSecurity Malaysia in Fig. 1 and Table I did not directly state that the Botnet attacks have been performing through Secure Socket Layer (SSL) or any encrypted channel, we consider another report by Gebhart from Electronic Frontier Foundation (EFF) [9] which stated that by 2017 more than 50% of Internet traffic had been protected by HTTPS. The effort of turning the traffic into encrypted one was enthusiastically made since 2010. Therefore, the botmasters are taking advantage of this situation to hide their operation and evade detection. Botnet itself creates a massive impact, and with the implementation of advanced evasion techniques like masquerading in the SSL channel will amplify the impacts. These scenarios have shown the severity of encrypted Botnet attack and therefore become an encouraging factor for developing solutions to detect Botnet over an encrypted channel even though there are other network attacks encrypted in SSL/TLS-enabled protocol.
Additionally, according to Nicholson (2015) [7] and Finley (2017) [10], roughly more than half of all traffic is encrypted mostly by SSL/TLS. By the end of 2016, Gooley (2017) [11] states that 80% of traffic across Google properties was encrypted, and 54% of threats Zscaler blocked are hidden inside SSL traffic. Therefore the use of SSL for the distribution of malicious content is rising too. Recent Botnet strains manipulated this situation and use SSL/TLS channel for their command and control communication. SSL/TLS protects legitimate content but simultaneously provides Botnet with hiding spots, making encrypted channel beneficial for a good guy and bad guy. There are three most active malicious contents referred by Gooley which are Dridex, Vawtrak, and Gootkit; all of them are variants of Botnet, which commonly associated with user credential stealing. Moreover,  report that detailed C&C traffic analysis shows at least ten prevalent malware families avoided wellknown C&C carrier and preferred encrypted channel, and one of them is Zeus. Consequently, the rising of SSL-encrypted traffic increases the Botnet attack trough SSL/TLS channel.
Despite the advancement of Botnet technology and fast evolution, research in finding the solution for Botnet detection is still in its infancy because existing studies remain somewhat limited in scope and do not generally include recent research and development (Silva et al. 2012) [13]. Thus, this indicates that there is still room for improvement in Botnet detection, uniquely encrypted Botnet. Many Botnet detection techniques are based on payload analysis, and these techniques, unfortunately, are inefficient for encrypted C&C channels (Shanti & Seenivasan, 2015) [14].   [15] prove that Botnet detection techniques that rely on payload analysis could be foiled by encryption. Payload-based analysis requires decryption, and this leads to a privacy issue. Zhao et al. (2013) [16] state that several challenges in Botnet detection remain unaddressed, such as the ability to design detectors which can cope with new forms of Botnet, therefore they proposed the use of machine learning techniques which proven to increase detection accuracy even for dynamic forms of Botnet. Furthermore, Botnet detection approaches for encrypted traffic were not well established, for instances limited signatures, limited features extracted, limited Botnet detected, and insufficient alarm mechanism (Zhao et al., 2013 [16]; Bortolameotti, 2014 [17]; Larinkoski, 2016 [18]).
The purpose of the study is to propose an approach to detect Botnet in the encrypted channel. The solution was devised to secure the gaps in encrypted Botnet detection system especially for the Botnet detection that base on payload analysis. This study will benefit the system administrator as the detection system assist them in monitoring and protecting system security. This research tends to explore the potential of machine learning techniques which expected to produce a detection system with high accuracy, fast detection and autonomous. The autonomous feature provides minimum supervision and self-learning. The findings also will benefit researchers in this area as it opens up to the exploration of the possible machine learning techniques in developing an effective and efficient Botnet detection system.
The implementation of machine learning in Botnet detection is compelling to overcome the limitation in Botnet detection. Cha & Kim (2017) [19] [25]); however there are still gaps to be fulfilled for the research in detecting encrypted Botnet such as inadequacy of detection features used. In order for the system to achieve a high detection rate and fast detection, it requires techniques that offer high precision and fast pattern recognition capabilities. Autonomous in the detection system requires decision support, situation awareness, and knowledge management. Therefore, this research tends to look at any potential machine learning techniques to fulfill the detection system requirements. In general, machine learning has been proven by previous www.ijacsa.thesai.org researches as being able to solve issues like accuracy (Salvador et al. 2009 [26]; Al-Hammadi 2010 [27]; Bilge et al. 2011 [20]; Guntuku et al. 2013 [22]; Ritu & Kaushal 2015 [25]) and real-time (Salvador et al. 2009 [26]; Guntuku et al. 2013 [22]) in Botnet detection.
We organized the remainder of this paper as follows. In section 2, we provide the related work of encrypted Botnet C&C detection. Then in section 3, we propose the Botnet detection framework to detect Botnet over the encrypted channel. This framework will be used to devise the methods based on several phases. Section 4 highlights the challenges for the implementation of the proposed solution. Finally, we drew some concluding remarks in section 5.

II. RELATED WORK
Many researchers relied on a payload-based analysis (deep packet inspection) to detect encrypted Botnet C&C. For example,   [15] develop high entropy detectors and analyzed packets based on the determined threshold. They stated that the encrypted Botnet produces high entropy, and it can be detected by using the detectors. The challenge of this approach is how to differentiate entropy produces by encrypted Botnet with other traffic that produces high entropy, for instance, media, executable, and compressed files. Tyagi et al. (2015) [28] also implement deep packet inspection (DPI) in their approach and proposed N-gram based HTTP bot traffic detection. The proposed technique detects encrypted and regular Botnet. This technique was based on the fact that the C&C responds with similar communication patterns, with only slight modifications to an HTTP GET request made by a bot. The communications patterns did not varied unless the bot is updated. Therefore this technique works appropriately only if the bot is not updated.
Other work on deep packet inspection was by Sherry et al. (2015) [29], which propose BlindBox to perform deep packet inspection directly on the encrypted traffic. They demonstrate that BlindBox enables applications such as IDS, exfiltration detection, and parental filtering and supports real rule sets from both open-source and industrial DPI systems. They also implement BlindBox and show that it is practical for settings with long-lived HTTPS connections. However, this approach is not specially designed to detect Botnet over the encrypted channel but only stated Botnet as one of their possible usage scenarios. Therefore there was the possibility that this approach might not work well for Botnet detection. Differently, Burghouwt (2015) [6] uses a Causal analysis of traffic flows to detect covert Botnet, for example, Botnet that hides in the encrypted channel. This approach detected covert Botnet by identifying the direct Causal relationship between network flows and prior events. However, this technique needs user events in addition to network traffic; therefore, it causes deployment complexity. Another researcher that used this method is Zhang (2017) [5].
Instead of deep packet inspection, some researchers use decryption techniques to detect encrypted Botnet C&C, for example,   [12]. They propose PROVEX, a system that automatically derives probabilistic vectorized signatures. PROVEX learns characteristic values for fields in the C&C protocol by evaluating byte probabilities in C&C input traces used for training. This way, they identify the syntax of C&C messages without the need to specify C&C protocol semantics manually, but purely based on network traffic. Even though PROVEX can detect all studied malware families, the fact that it used payload-based analysis that depends on the studied signature limits the detection to the known bots only. Furthermore, by implementing a bruteforce-like decryption technique, it leads to the privacy issue.
Some researchers claim that their general Botnet detection approaches could even detect encrypted Botnet C&C based on the assumption that their approaches did not analyze the payload content, and it was Botnet structure independent. Shanti & Seenivasan (2015) [14] propose a detection methodology to classify bot hosts from the normal host by analyzing traffic flow characteristics based on time intervals www.ijacsa.thesai.org instead of payload inspection. They use the Decision Tree and Naïve Bayes classification. Classification with a decision tree gave a better true positive of 86.69%. Kirubavathi & Anitha (2016) [39] propose an approach to detect Botnet irrespective of their structures. They try several MLAs to their approach, and Naïve Bayes has the highest detection rate of 99.14%. Zhao et al. (2013) [16] and Bortolameotti (2014) [17] use Decision Tree to their approaches, and both provide a very high detection rate which is 98.5 % and 99.96% with a very low false positive rate of 0.01 % and 0%. Dietrict et al. (2013) [33] develop CoCoSpot use Average-Linking Hierarchical Clustering. 50% of Botnet families were detected by the rate of 95.6%. Buriya et al. (2015) [36] use Naïve Bayes and achieved 98.84% accuracy. Apparently most of the MLAs discussed in this paper have a very high detection rate.
Even though some techniques provide a high detection rate, comparatively they also got a high false positive rate. For example, Richer (2017) [40] proposes an approach using Support Vector Machine and got a 100% detection rate; however the false positive rate is more than 15%. The work by Shanti & Seenivasan (2015) [14] also provides very high false positive which is more than 21%. Above all, Warmer (2011) had the highest false positive value of 44.3% by using Naïve Bayes.
Al-Hammadi (2010) [27] presents a host-based behavioral approach for detecting Botnet based on correlating different activities generated by bots by monitoring function calls within a specified time window. Al-Hammadi uses Dendric Cell Algorithm inspired by the Immune System. The evaluation shows that correlating different activities generated by IRC/P2P bots within a specified period achieves high detection accuracy (100%). In addition, using an intelligent correlation algorithm not only states if an anomaly is present, but it also exposed the source of the anomaly.
One of the most prominent MLA for Botnet detection is Neural Network and its distributions, Self-Organizing Map (SOM). SOM is an unsupervised Neural Network and has been widely used in intrusion detection. Unfortunately, there are limited works discussing SOM for Botnet detection, and instead, more in intrusion detection. However, SOM is a promising approach especially for developing an autonomous Botnet detection system that is based on the collection of flow statistics using Neural Network. The results obtained show that the system is feasible and efficient since it provides high detection rates with low computational overhead.
Many existing approaches to the process of detecting intrusions utilized some forms of rule-based analysis. Expert System is the most common form of rule-based intrusion detection approaches. Most existing behavior-based approaches are not able to detect and predict the Botnet as they change their structure and pattern. Roshna & Ewards (2013) [23] present the AdaptiveNeuro-Fuzzy Inference System (ANFIS), a technique that trains the system for future prediction. However, the limitation of this work is the restriction of fuzzy rules and fuzzy sets for the comparison purpose. Therefore, the proposed work should be able to overcome the limitations by increasing the number of rules generated using the Botnet features and information gain.  [44] intends to identify bot-relevant domain names and IP addresses by inspecting network traces. The algorithm involves traffic reduction, feature selection, and pattern recognition. Fuzziness in pattern recognition helps to detect bots that are hidden or camouflage. Performance evaluation results based on real traces show that the proposed system can reduce more than 70% input raw packet traces and achieve a high detection rate (about 95%) and a low false positive rate (0-3.08%). Furthermore, the proposed FPRF algorithm is resource-efficient and can identify inactive Botnet to indicate potentially vulnerable hosts. BotDigger proposed by Al-Duwairi &Al-Ebbini (2010) [45] utilizes fuzzy logic in order to define logical rules that are mainly based on some statistical facts and essential features that identify Botnet activities. The key advantage of the architecture designed in this research is that it allows the integration of a wide range of traffic specifications.
The above machine learning approaches mostly use detection rate, accuracy, or false positive value as the metrics to measure the detection performance. However, other vital metrics are real-time and autonomous. Even though detection able to detect accurately, it is useless without fast detection or real-time detection. Researchers focus on developing a realtime Botnet detection system, for example, Salvador et al. Autonomous mainly focus on self-learning and selfmanaging properties. Chandhankhede (2013) [21] proposes the new autonomous model for Botnet detection using the Kmeans algorithm, one of the most straightforward unsupervised learning algorithms that solve the well-known clustering problem. According to Khattak et al. (2014) [46], the degree of automation can be classified as manual, semiautomated and automated. Semi-automated Botnet detection requires very little human intervention, and most of the detection is performed on automated fashion. However, fully automated Botnet detection should require no human intervention after initial development. Khattak et al. also agree that ideally, any detection method should be as generic and automated as possible. www.ijacsa.thesai.org

III. PROPOSED BOTNET DETECTION FRAMEWORK
To achieve high accuracy, fast detection, and an autonomous Botnet detection system as stated in section I, we propose a conceptual framework as a guideline to devise a methodology to detect Botnet over the encrypted channel. Fig.  2 shows the Botnet Analysis and Detection System (BADS) framework which consists of three main components namely Network Analysis System (NAS), IDS and Alarm System (AS). Through BADS, we generally divide the process into six phases as depicted in Fig. 3 also shows the expected results of each phase.

Phase 1: Data Collection, Customization and Conversion
This study requires the dataset of encrypted Botnet traffic; however, in the secure network, it is a challenge to get one. Therefore, we convert public Botnet datasets into customizable encrypted Botnet dataset by using BotTalker developed by   [47]. The public datasets use are ISOT, Malware Capture Facility Project (MCFP), and Network Information Management and Security (NIMS). Some datasets are in pcap format; therefore, we need to convert the dataset into CSV format using Wireshark, then to arff using Weka.

Phase 2: Feature Extraction and Selection
In encrypted Botnet, detection by inspecting the payload is a tedious process, especially if it involves extensive traffic data. Furthermore, this method required the decryption of data, and it involved a privacy issue. Because of that, for this research, the approach without inspecting the payload is necessary. Encrypted Botnet itself produces features that can be used for detection, and most importantly, it should not require any decryption. Botnet features are extracted through a feature extraction process, and then if necessary, followed by feature selection. Feature selection reduces the number of features, and these selected features are the most relevant features for the detection. Feature selection is crucial because it is a reliant factor to detection accuracy, as proven by Buriya et al. (2015) [36] and Kirubavathi & Anitha (2016) [39].
Tranalyzer is used to extract the Botnet features. From the literature study, Tranalyzer was proven to capture more features compared to other features extractors. Furthermore, features extracted using Tranalyzer provide better accuracy (Jianguo et al., 2016) [37]. For feature selection, we use Information Gain Attribute Evaluation in Weka and employ Ranker Algorithm to select the features that will give the most relevant features based on the ranking provided.

Phase 3: Botnet Prediction and Classification
We use Weka classify module to perform Botnet prediction and classification. The classification generates decision rules, and these rules are used for Botnet detection.
Phase 4: Botnet Detection IDS component consists of a Snort-based Botnet detection mechanism, as shown in Fig. 4. Sensors sniff the packets from the network. Packet decoder takes packets from different types of network interfaces and prepares the packets to be preprocessed. The preprocessor arranges or modifies data packets before the detection engine does some operation. It also normalizes protocol headers, detects anomalies, packet re-assembly, and TCP stream reassembly. The detection engine is an essential part of IDS. The detection engine detects Botnet intrusion activity that exists in a packet. There are two detectors use; misuse detector and anomaly detector. The detection engine employs fuzzy inference rules for this purpose. The rules are read into internal data structures or chains where they are matched against all packets. If a packet matches any rule, appropriate action is taken; otherwise, the packet is dropped. The idea of fuzzy rules and fuzzy inference implementation in the detection engine is to determine the severity level of a Botnet attack.

Phase 5: Attack notification
Intrusion notification assists the network manager to manage the system. There are four sub-components to fulfill that purpose namely intruder tracing, alarm strategy, protection strategy, and report generation. Implementing these components into the system should be able to alert the network manager and notify them of the severity of attacks, suggesting protection strategy, and generate a report. The autonomous mechanism enables the Botnet detection system to works effectively with minimum human intervention.

Phase 6: Testing and Evaluation
The open stack test-bed is set up by using multiple virtual machines. Several virtual machines are used to carry out intrusion attempts and one virtual machine to run the proposed Botnet detection system. This phase happens to test the effectiveness of the proposed approach by using several parameters, for instance, accuracy and false positive. Then, for the evaluation, the comparison study is performed between the proposed approach and other existing Botnet detection system to measure the efficiency of the approach.
The BADS assists the network manager in monitoring the security of the system. The idea of BADS is to minimize human intervention in performing network monitoring and suppose to take appropriate action based on the severity of the Botnet attack. Therefore we endeavor to propose an autonomous Botnet detection system.

IV. CHALLENGES OF IMPLEMENTATION
There are several challenges to implement BADS. Firstly, it is data preprocessing parts, which are dataset collection, dataset customization, and conversion. Actually, for this work, we also want to use the Botnet dataset from IMPACT Cyber Trust. However, the dataset is only available to US-based researchers and those in approved locations. Unfortunately, our location is not in that approved locations. Another challenge is for data customization as the reference for BotTalker is quite limited. Then, we have to do data conversion for all the datasets except for the NIMS dataset. Overall, the data preprocessing part is a tedious part and requires many works.
Since we are using various tools in our work, we are expecting conflict because each tool produces different types of outputs. For example, the rules that are retrieved from the classification in Weka to the rules structure in Snort. Furthermore, we also want to employ fuzzy rules into Snort because we want the detection system to be able to determine the severity of Botnet attacks. We try to achieve this because we want the detection system to provide appropriate solutions based on the level of a Botnet attack. This feature will help the network manager to monitor the network and provides automation. We believe fuzzy inference rules can provide the required solution. Currently, we are still looking for solutions to this issue.

V. CONCLUSION AND FUTURE WORK
Botnet evolves, and new Botnet strains have developed advanced evasion techniques. It includes the capabilities to manipulate encrypted channels like SSL/TLS for their command and control communication, use social media to spread malware, spamming, and gain credential info (social bot). These avoidance techniques enable Botnet to cover its operation, evade detection, and stay on the system as long as possible. However, existing detection techniques were not well established and had limitations in detecting Botnet especially the Botnet over the encrypted channel. Having an effective and efficient Botnet detection system is essential. This research endeavors to find a solution to enhance the Botnet detection system over the encrypted channel by using machine learning. Machine learning is a promising approach to detect Botnet, especially over an encrypted channel. Therefore, we proposed the BADS framework and devised a methodology based on the framework. This framework consists of three main components, which are Network Analysis System (NAS), IDS, and Alarm System (AS). Besides the main components, testing and evaluation processes also included in the framework. We devise the methodology from the framework and divide them into six phases. Overall the contribution of this paper is three-fold: 1) The framework of Botnet Analysis and Detection System (BADS).
2) The methodology of devising the techniques for Botnet detection.
3) The design of fuzzy inference Snort-based Botnet detection system.