Capsule Network for Cyberthreat Detection

In cybersecurity, analyzing social network data has become an essential research area due to its property of providing real-time updates about real-world events. Studies have shown that Twitter can contain information about security threats before some specialized sites. Thus, the classification of tweets into security-related and not security-related can help with early warnings for such attacks. In this study, the use of a capsule network (CapsNet), the new deep learning algorithm, is investigated for the first time in the field of security attack detection using Twitter. The aim was to increase the accuracy of tweet classification by using CapsNet rather than a convolutional neural network (CNN). To achieve the research objective, the original implementation of CapsNet with dynamic routing is adapted to be suitable for text analysis using tweet data set. A random search technique was used to tune the model’s hyperparameters. The experimental results showed that CapsNet exceeded the baseline CNN on the same data set, with accuracy of 92.21% and a 92.2% F1 score; also, word2vec embedding performed better than a random initialization. Keywords—Capsule network; dynamic routing; deep learning; Twitter; text analysis; attack detection


I. INTRODUCTION
Security monitoring and attack detection are essential parts of any organization's management for protection against cyber-attacks. These attacks can cause service disruption, asset damage, data breaches, or data loss. To avoid such dangerous effects, a number of official security data sources are available, including the National Vulnerability Database (NVD) [1], which contains a security analysis of discovered vulnerabilities, and the ExploitDB [2], which provides a userfriendly interface for all discovered exploits targeting known vulnerabilities. These traditional data sources provide trusted security information, but it comes at a cost, which is the delay of reporting the information [3]. Not all reported vulnerabilities will be exploited in the real world, and some have a higher probability of being exploited and thus need to be patched first [4].
For system administrators, the time between the detection of a cyberattack plan and the actual occurrence is critical. They need up-to-date information about current or imminent attacks to analyze them, study their impact, and be aware of new attack types and hacking tools in real time [5]. One of the new solutions for this problem is utilizing social network data to extract real-time notifications about the security situation of the organization or software and hardware used in its infrastructure.
As one of the most popular social networks, Twitter is considered a rich source of information about different security threats. This claim is supported by studies showing that Twitter contained information about security threats before some specialized sites [6]- [8]. This observation attracted researchers to analyze Twitter data and extract knowledge to be used in the detection and prediction of security attacks. The objectives when using Twitter data in the security field vary, from vulnerability and exploit detection [4], [9], [10] to attack detection by linking a sentiment score to a specific target with real security events [11]- [13], and trying to determine the threshold of tweet sentiment that predicts the probability of the attack occurring [14].
Text classification using different machine learning (ML), neural network (NN), and deep learning (DL) algorithms has been widely investigated for detecting cyber-attacks using Twitter data. One of the most advanced techniques for this purpose is the convolutional neural network (CNN) [15]. As one of the DL algorithms, a CNN overcomes the traditional ML technique limitation by providing automation for the learning process [16]. However, the CNN comes with limitations that are mainly related to the use of the pooling layer [17], which will be described in detail in Section II.
In 2017, the godfather of DL, Geoffrey Hinton, proposed the capsule network (CapsNet), which was first examined using the Modified National Institute of Standards and Technology's (MNIST) data set [18]. CapsNet outperforms its predecessor, the CNN, in many image classification tasks [19], but it is still in early stages for text classification [20]. This study aimed to use Twitter to examine CapsNet's capability for providing accurate classifications of security tweets with the goal of cyberthreat detection. The CapsNet is implemented by building an NN model to classify tweets as security-related or not security-related. Then, the CapsNet model was evaluated in terms of classification accuracy and F1 score, and using the CNN as a baseline model, compared the performance of CapsNet in tweet classification for the security field.
The rest of this paper is organized as follows: In Section II, an overview of CapsNet's improvements in comparison to CNNs will be given. Section III covers the main recent work done in the field using Twitter for cyberthreat detection. This is followed by Section IV, which describes the implementation of the proposed model. In Section V, the details of the experiments conducted is described, and Section VI includes the results and discussion. Finally, the paper's conclusion and future works are discussed in Section VII.

II. BACKGROUND
Until recently, CNNs achieved state-of-the-art results for many natural language processing (NLP) tasks [21]. However, CNNs have limitations and drawbacks, such as with pooling. Pooling, one of the building blocks of CNNs, is used to reduce the complexity and the number of parameters in the CNN while preserving the main features [20]. This makes CNNs (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020 particularly efficient with classification tasks, but it causes a loss of valuable information such as the precise location of an object or the relationships between the object's parts [18]. Fig.  ?? shows the way that the CNN works. Even when parts of the face are not arranged correctly, the CNN will classify it as a face regardless of the location and relationships between the parts [19]. For better modeling of spatial relationships among parts, CapsNet is proposed [22].
The architecture of the CapsNet overcomes CNN's drawbacks by different properties [18]. First, the basic unit in the CapsNet is the capsule (vector), where each one is a set of neurons representing an object or an object part. CapsNet transforms vector inputs into vector outputs; thus, it can learn more complex transformations than CNNs, which operate on scalars. The output of a capsule is an activity vector, where its length represents the probability of the existence of the object, and its coordinates (dimensions) encode the object's attributes (pose information), which preserves the spatial relationship between features. Second, CapsNet uses a routing-by-agreement technique to replace the routing by max pooling used in the CNN. In simple terms, instead of extracting the most important features by using max pooling and ignoring the less important ones, propagation between the layers will be based on routingby-agreement. This means that the output of each capsule will be forwarded to the next layers' capsules with different weights that are based on the agreements between the capsules.
In NLP, CapsNet has a greater ability to efficiently learn the spatial relationships between words, such as the local order of words and their semantic representations [22]. Many researchers have investigated the use of CapsNet for NLP tasks like sentiment analysis [23], [24], fake news detection [25], stock performance prediction using Twitter [26], implicit emotion detection [27], and offensive posts on social media [28]. The results of these studies showed that CapsNet outperformed the CNN in text classification, which was part of the motivation for the present study.

III. RELATED WORK
In this section, some studies that used Twitter data for the detection of security attacks are reviewed. For each work, the specific problem that was solved by each of these studies is summarized, the analysis techniques used, and the results obtained to give an overview of the research already conducted in the field of interest.
In order to discover Twitter discussions about emerging attacks against a specific target, the authors of [29] proposed an approach to security event detection that learned with positive and unlabeled data based on user-provided expectations.
Expectation regularization (ER) was used to find the ratio between positive and negative examples in the training process. The study's security events included denial of service (DoS) attacks, data breaches, and account hijacking represented in the form of (Entity, Date) as training examples. Two sets of manually extracted features were used to find new events. Using the logistic regression (LR) classifier, the proposed solution was able to detect new events automatically in real time for each predefined category.
The use of simple discrete features may suffer limitations in representing subtle semantic differences between true event mentions and false cases with similar word patterns. To overcome this limitation, the researchers in [16], based on [29], modified the method to be more semantically based by using a long short-term memory (LSTM) based neural embedding model that learns tweet-level features automatically. This change improved the detection accuracy as compared to the previous method because of the NN's ability to represent deep semantic information, which is more difficult to capture through discrete features.
As an end-to-end solution for cyberthreat detection, SYNAPSE [30] provided a real-time extraction of security events from Twitter with high-level abstraction. A data set of more than 195,000 tweets was collected from security-related accounts and filtered by keywords related to the monitored infrastructure. The statistical method called term frequencyinverse document frequency (TF-IDF) was used to extract the tweets' features. Support vector machine (SVM) algorithm was used for feature learning, and it achieved a minimum true positive rate (TPR) and a true negative rate (TNR) of 90% in classifying tweets. For more informative extractions, the model proposed in [30] included stream tweet clustering using a dynamic clustering algorithm and summarization of each cluster with the exemplar tweet. The model was able to detect important actionable threats by verifying them with threats reported in the Common Vulnerability Scoring System (CVSS).
With the same objectives as the previous work [30], the authors in [31] proposed an event detection model with joint phases that performed the filtering, clustering, and summarization with shared tweet representation. Features were extracted using skip-gram and LSTM to obtain vector representations, and a multi-layer perceptron (MLP) classifier was used for tweet classification identifying security-related tweets. The tweets were clustered in groups, and each cluster was summarized with the most informative tweet provided. All these phases were conducted jointly based on features extracted at the beginning. The collaborative event detection and summarization model was more effective than solutions that used discrete or neural models for new event detection, clustering, and summarization.
The authors in [32] proposed a model that consists of three steps: data collection and pre-processing, feature extraction, and class prediction. They collected two balanced sets of tweets. The first set, which represented the positive class, contained tweets retrieved from cybersecurity accounts, while the negative set was tweets retrieved from non-security specialized accounts, such as health, news, and magazines. Next, for feature extraction, they used TF-IDF. In the class prediction step, the binary naïve Bayes (NB) classifier was (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020 trained using a 10-fold validation approach, which resulted in an average accuracy of 77.90% for tweet classification into security-related or not.
The authors in [33] proposed a DL classification model based on domain-specific and contextual embeddings to extract features from raw tweets. These features are convolved using a meta-encoder and then combined to be sent to the CNN, LSTM, and contextual encoder for feature learning in parallel. The resultant feature maps were concatenated with contextual embeddings in a fusion layer. A softmax classifier was used for the final prediction for each tweet as security-related or not. Compared to a set of ML and NN baseline models, the proposed model performed better with accuracy of 82%, precision of 79%, 72% recall, and an F1 score equal to 76%.
The authors in [34] used a CNN to classify tweets containing security keywords as security-related or not. All tweets were retrieved from security-specialized Twitter accounts that mentioned three predefined organizations or their assets. The results obtained were using GloVe and word2vec embedding and also used random initialization trained on the classification task. The embedded tweets were fed into three CNN layers in parallel to be convolved with a different number of filters and filter sizes. The researchers suggested a named entity recognition (NER) step to extract the main entities in the tweet using bidirectional LSTM (BiLSTM). The results confirmed that the CNN model performed better than traditional ML techniques. The classification performance achieved 94% recall and 91% TNR, while the NER achieved a 92% F1 score with specifying appropriate entities.
Recent studies that reviewed in this section used a CNN, which opened the door for more investigation and encouraged more studies to improve the accuracy of detecting potential security attacks. The present study aimed to implement the new CapsNet algorithm for the first time in the field of attack detection using Twitter.

IV. IMPLEMENTATION
Before describing the model implementation, a general representation of the tweet classification model that has the purpose of cyberthreat detection is illustrated. As shown in Fig. 2, it contains three layers: an input layer, a classification algorithm, and an output layer. The input layer holds the tweets to be classified, which pass to the classification algorithm in the second layer, the CapsNet in this work. The final goal of this architecture is the label's prediction of each tweet, which is the task of the output layer in labeling each tweet as securityrelated or not security-related.
The CapsNet model that is proposed for classification purpose is the main contribution of this work. The input of the model is a tokenized tweet of n words, and the output is the predicted class of this tweet. The same principles used in [18] for MNIST handwritten digit data set classification will be followed and adapted to be compatible with the tweet data set. The architecture of the model is shown in Fig. 3 and described in the following sections.

A. Embedding Layer
This layer acts as a link between the input layer and the NN because the NN does not understand the textual input. If n indicates the number of words in a tweet, then each tweet is represented by an array of length n. Thus, there is a need to convert each word into a numeric representation using a word embedding model. Each word will be mapped to its corresponding numeric representation in the embedding model. According to this description, the embedding layer converts the tokenized tweet from an n-dimensional vector to an n × d dimensional tensor of a floating points matrix to be sent to the next layer.

B. Convolutional Layer
The first convolutional layer is a regular convolutional layer that the embedded tweet is fed to before passing to the primary capsule layer. It convolves the embedding matrix with a set of filters f and a kernel size k × d . This means it processes k words at a time, which results in a tensor with size f (n−k+1).

C. Primary Capsule Layer
The primary capsule layer is fully connected to the next layer and consists of three transformations performed sequentially: • Second convolutional layer: Similar to the previous convolutional layer. It performs a convolution operation on its input with a kernel size k, which reduces the input by k + 1.
• Reshape: As mentioned previously, one of the significant contributions of CapsNet is that it deals with vectors. Scalar is a quantity with magnitude only, while a vector is a quantity with the magnitude as well as direction. This layer is added to reshape the input feature maps, scalar values, to an output vector map of the desired dimensions to get a set of vectors (capsules) instead of scalars.
• Squashing function: To ensure that all vectors' lengths, which represent a probability, are between 0 and 1 while preserving the orientations of the vectors (features detected) as the following equation [18]: where v j is the vector output of capsule j and s j is its total input.

D. Capsule Layer
At the point, between the primary and the capsule layer, the novel routing algorithm sits. The goal of routing-by-agreement is to send the output of the lower-level capsule (output of the primary capsule) with high weights to the capsule in the next layer (capsule layer) that it agrees with. To do that, it calculates the predicted output of the next layer by learning routing coefficients in multiple routing rounds. In other words, it strengthens routing weights where predictions made by primary capsules match secondary capsule outputs based on the routing algorithm proposed in [18]. Between the CNNs, which usually implement routing by max pooling that results in loss of some information and the fully connected layers, routing-by-agreement reduces the noise forwarded to the next layer while keeping all the desired information for accurate classification.

E. Flatten, Dense, and Dropout Layers
The output of the previous layer (capsule layer) is a twodimensional array/matrix, and the next layer is a dense layer that expects a one-dimensional array. The flatten layer is responsible for transforming the two-dimensional matrix of features into a vector by stacking the rows next to each other in a way that can be fed into a fully connected layer for prediction. Then, instead of using the decoder proposed in the work by [18], dropout is used as a regularization method against overfitting [35], which will drop a percentage of the neurons in the flatten layer randomly [36].

F. Output Layer
Since the problem of this work was a binary classification, the final layer of this architecture is a dense layer that predicts the class of each input tweet. Many activation functions can be used to accomplish the aim of this layer, such as softmax or sigmoid, which labels the tweet to be security-related (positive) or not security-related (negative).

V. EXPERIMENTS
In this section, the hardware and software configurations used in the experiments is reviewed. In addition, the data set that were used, the embedding layer specifications, the baseline CNN model that was proposed for comparison purposes, and the optimization process that was conducted to tune the models' hyperparameters will be described.

A. Tools
The experiments were run on Google Colab [37], a free cloud-based service, with a Tesla P4 GPU and 25 GB of RAM. The code was implemented in Python 3.6.9 with Keras 2.2.4 [38], using TensorFlow 1.15.2 [39] as a backend.

B. Data Set
The data set that satisfied the model requirements was the data set created in the work [34]. It contains tweets that were retrieved from predefined Twitter security-related accounts that mention the infrastructures being monitored or its assets and denoted as A, B, and C. The use of specialized Twitter accounts eliminated the retrieval of tweets containing the desired keyword without security context, such as the words "apple, windows, network, virus, worm, root." The data set was already divided into three sets: training, validation, and testing. Two sets of security specialist accounts, denoted as S1 and S2, were used. The training and validation sets contained the tweets that were retrieved from the S1 accounts, while the testing set was compiled from the S1 and S2 Twitter accounts. The goal of using different sets of Twitter accounts was to give us insights about the models' performances on not only unseen tweets but also tweets retrieved from a different set of Twitter accounts. Another property for the data set was the time interval of the tweets, where the validation and testing sets were retrieved from time intervals following the training set. This means that the obtained results would simulate the real deployment of the model. Then, the collected tweets were filtered based on a set of keywords describing the selected organizations and labeled as security or not.

C. Data Set Retrieval and Statistics
Because of Twitter's policy, which prevents publishing tweets in plain text, the data set was only available in the form of (tweet ID, label). Thus, a Twitter developer account was created to retrieve the text of the tweets knowing the IDs using Python and Tweepy library. At the time of retrieving the tweets, some were missed due to deletion by the user or the user being suspended. The data set was manipulated to serve the work objectives as follows: the tweets from the three infrastructures were merged and duplicates were deleted since division by the organization was out of the scope of this study. In addition, to work with a balanced data set, the validation and testing tweets were merged, and 300 tweets from each class were specified as a validation set and the remaining as testing tweets, while keeping the classes balanced. The resultant data set statistics are shown in Table I.

D. Data Set Pre-Processing
The raw tweets were cleaned in a pre-processing step, the approach used by [34] for the same data set was followed. In detail, each tweet was converted into lowercase, special characters other than "." and "-" were removed and replaced with a dot and hyphen, respectively. These symbols were needed because they could exist in the software versions and common vulnerabilities and exposures (CVE) numbers. Then, all numbers were converted into its textual representation to be analyzed as text, which resulted in tokenized tweets that were the input for the embedding layer, as described in Section IV-A.

E. Word Embedding
The embedding layer received the tokenized tweet results from the data pre-processing and converted each word into a high-dimensional vector. A widely used NLP technique for feature extraction that represents the semantic meaning of words is word embedding [40]. In this work, two ways of initializing the embedding matrix were examined: using Keras embedding [38] and word2vec pre-trained word embeddings [41], both with 300 dimensions.

F. Baseline Convolutional Neural Network (CNN) Model
In order to choose the appropriate architecture for the baseline model that was used for purposes of comparison, the CNN model used as a baseline in the CapsNet-MNIST proposal [18] was manipulated to be suitable for the text data set. In [18], the CNN and CapsNet models were not architecturally similar, but the authors designed them with similar computational efforts that served the work's objectives. Similarly, the baseline CNN model of this work was built with three convolutional layers, flatten layer, dense layer, and dropout layer, and then added a last dense layer for final prediction.

G. Hyperparameter Tuning
Hyperparameters are the model's parameters that were not included in the training. These parameters should be set carefully because they affect how the model will learn from the data. The manual selection of these values could be less than optimal, and the solution for that problem is hyperparameter tuning. This step was performed because one of the work objectives was finding the optimal values that would lead to the best model performance and generate acceptable results. The random search is used for optimization [42]. For a fair comparison, 100 combinations of each model were tested, the CNN and CapsNet in 200 epochs and early stopping after five epochs. Table II lists all the layers that were included in the search process, the values to be examined for each hyperparameter, and in the final column, the optimal values based on the validation set results. Similarly, Table III shows the hyperparameter tuning specification for the proposed CapsNet model. To reduce the total number of combinations, hyperparameters such as the optimizers or the activation functions were not included because we fixed them in both models, and for batch size and learning rate, we kept the default values.

VI. RESULTS AND DISCUSSION
This study was conducted to investigate the use of CapsNet for cyberthreat detection by classifying tweets into securityrelated or not security-related. The model was built based on hypothesizing and aiming at proving that CapsNet could classify security tweets more accurately than a CNN and that routing-by-agreement is more efficient than pooling. In order to verify the correctness of the work hypothesis, the final architectures generated from the hyperparameter tuning process was trained to minimize the validation loss using binary cross-entropy loss function, and evaluated them using the classification accuracy and F1 score. Classification accuracy is the ratio of correct predictions (positive and negative) to the total number of samples and is computed as in [43]: while the F1 score is calculated as: For the first hypothesis, mentioned above, the CapsNet model was compared with a CNN model that did not include a pooling layer. However, the second hypothesis was tested by comparing the CapsNet with a CNN model that had a pooling layer to prove the efficiency of the routing over the pooling. As can be seen in Table IV, the proposed model achieved competitive results over the strong baseline models. In addition, the accuracy and F1-score comparisons are presented in Fig. 4 and Fig. 5 respectively. In the three models, using the pre-trained word embedding word2vec gave better results than the randomly initialized ones. In general, CapsNet models' results were better than the CNN, followed by the CNN with a pooling layer. The CapsNet model with word2vec achieved the best results, with 92.21% accuracy and 92.20% F1 score, while the worst result was related to the CNN model with a pooling layer at 91.01% accuracy and an F1 score of 90.84%. By comparing the CNN models to each other, it became clear that the use of a pooling layer in the text classification tasks would not be a wise choice, at least in the context of this work.

VII. CONCLUSION
Securing software, hardware, data, and services has become a crucial part of any organization's management due to the increasing numbers of emerging attacks that threaten its security. In this study, the novel DL algorithm CapsNet was utilized along with Twitter data to provide accurate classification of tweets for the purposes of cyberthreat detection. A random search was used for hyperparameter tuning for random and word2vec embeddings. CapsNet model was built based on the hyperparameters was found after the random search, then the model was compared with two CNN architectures: a CNN baseline model without a pooling layer and a CNN baseline pooling model that included a pooling layer. The model was evaluated using accuracy and F1 score, and from the results, multiple remarks were gleaned. First, it was proved the better efficiency of routing-by-agreement compared to pooling. Second, in the three models, the pre-trained word embedding word2vec achieved better results than random embedding. Third, the proposed CapsNet model outperformed the strong competitor of CNN, with 92.21% accuracy and 92.2% F1 score.
Because there is always room for improvement, we plan to compare the obtained results with a recurrent neural network (RNN) model given that CapsNet introduces improvements on its architecture. In addition, we aim to examine replacing the CNN layers in the CapsNet with RNN to take advantage of dealing with tweets word-by-word rather than the whole tweet at once. In addition, we aim to test more embeddings, such as GloVe, Fasttext, BERT, and Elmo.