Relational Deep Learning Detection with Multi-Sequence Representation for Insider Threats

—Insider threats are typically more challenging to be detected since security protocols struggle to recognize the anomaly behavior of privileged users in the network. Intuitively, an insider threat detection model depends on analyzing the audit data, representing trusted users’ activity streams, on recognizing malicious behaviors. However, the audit data is high dimensional data in that it presents n dependent streams of activities where it establishes a complex feature extraction. In this context, the dependent streams represent user activities where each activity is represented by an ordered set of real variables that pertain to a speciﬁc occurrence, such as log-in records. As a result, multiple actions can be represented simultaneously, with one or more values being recorded at each timestamp. Moreover, the relations between dependent streams are typically neglected while detecting the anomaly behavior. Ideally, relation learning is commonly considered to recognize occurrence patterns in streaming data. Thus, the latent relations are thought to have insight for the accurate detection of anomaly behavior concerning insider threats. This study introduces a novel model to detect insider threats by representing audit data as multivariate time series to explicitly learn the existing inter-relations between activity streams using a Recurrent Neural Network (RNN). The model considers learning the latent relationships to effectively extract features for modeling the behavior proﬁle where anomaly behavior can be detected accurately. The evaluation, using the CERT dataset has shown that the proposed model outperforms the comparator approaches to insider threats detection with AUC of 0 . 99 .


I. INTRODUCTION
With the rapid advancement and growth in networking technology, cyber threats have become a significant issue for numerous companies and organizations worldwide [1]. A cyber threat can mainly be realized by breaching network security. Ideally, the primary option for malicious intent to breach network security is by using a malware [2]. In this context, the malware contravenes the secured network by an external malicious component such as rootkits, Trojan horses, viruses, and worms [3]. Thus, the ideal solution to secure the network from such external threats is by proposing a perimeter defense, e.g., firewalls, antivirus software, and intrusion prevention/detection systems.
However, a cyber attack can be triggered from an internal source in the network; it is well known as an insider threat [4]. A typical form of an insider threat is that the legitimate user may conduct harmful work at the network, such as leaking, altering, or disrupting sensitive data. Thus, an insider threat (malicious) is typically realized as an abnormal action, or behavior, in the network flows that is performed by the legitimate user [5]. Fig. ?? shows a conceptual illustration of the internal and external attacks intuition which can affect such a network in cyber space. It can be observed that the external attack is transparent to be prevented/detected by the perimeter defense protocols. On the other hand, the insider attack is commonly deceptive as the perimeter defense can hardly detect attacks conceived from inside the network. Therefore, insider attacks pose a critical challenge in the cyber security domain where the demand to propose practical solutions to detect insider threats remains a desirable solution for a tremendous number of organizations in the market [6].
To detect insider threats, the typical solution relies on developing systems that are capable of analyzing the user's behavior to discover anomalies in the network [7]. The idea is to observe the user's daily activities and tasks where these activities yield frequent network usage patterns. Ideally, the regular activities can underline insightful patterns to map a typical behavior for the legitimate user. In this context, the ideal method to analyze the user's behavior is accomplished using Machine Learning (ML)-based approaches, (see for instance [8], [9], [10], [11]). ML approaches, such as Hidden Markov Model (HMM) and Support Vector Machine (SVM), have been utilized to detect insider threats via modeling the behavioral profile from the audit data (daily activities) such as the log events. Typically, these ML approaches, which are well known as shallow learning methods [12], are subject to attentive feature engineering to model the behavior profile accurately. The reason is that the audit data is composed of a large volume of unstructured, high-dimensional, and sparsity instances, which makes extracting features a non-trivial task. The traditional approaches have modeled the user's behavior by aggregated data consisting of the user's activities within a single day. However, missing some features can cause unpredictable behavior where it imposes unbalanced detection alarms; for example, it can increase the false alarms in the system. Deep Learning (DL), a subset of ML approaches, has been employed to address the drawback mentioned earlier for insider threats detection [13], [4], [14], [15]. DL provides the advantage that features can be represented immediately from unstructured data [16]. Moreover, it has the advantage that features can be extracted sequentially to reduce missing temporal feature learning, a way that is not applicable in the case of shallow learning methods. Nevertheless, insider threats detection approaches neglect considering the temporal representation of multivariate sequences to model the user's behavior, where this limitation has established a downside to developing an accurate detection model. Intuitively, the user activities are temporally recorded as sequences of dependent variables, i.e., at each timestamp, one or more variables, such as user id and log-in/off time, would be recorded simultaneously. Therefore, incorporating more extensive dependent variables can increase the chance of better modeling the user's behavior. It is worth mentioning that increasing data volume results in improving the learning accuracy as per Bonferonni's principle [17]. This, in turn, brings the motivation to structure the entire audit data as temporal sequences, i.e., multivariate time series representation, which thought it is fruitful to map all possible behavior patterns of the user. Moreover, as the audit data consist of multi-sequential actions, the relations between these sequences are not well considered in previous studies. Thus, the conjecture is that the sequences related to one user would underline strong relations to describe unique user-related patterns used for insider threat detection.
This study proposes a novel model that utilizes DL to learn the user's behavior for insider threats detection using a multivariate representation of audit data. The model represents the user activities as a set of dependent sequences in the temporal domain to where the hidden relations are extracted and learned. In concise, each activity is represented as an ordered set of actual values that refers to some event, such as log-in records. Thus several activities can be represented simultaneously such that one or more values are recorded at each timestamp, i.e. the user activities are represented as multivariate time series streams such that at each time tick, n values are recorded temporally. We then use Recurrent Neural Network (RNN) to learn the existing latent temporal relationships between sequences to map the hidden patterns. Thereafter, the model serves to extract features where the recognized behavior is classified as normal or anomaly, which indicates a possible insider threat.
The contributions of the proposed work can be summarized as follows: i) The study has presented a novel method to model the user behavior in a multivariate time series structure. More specifically, the user behavior is represented using all sequences of events that denote the user activities. The intuition is that time series frequencies would present accurate, readable user behavior patterns because they incorporate exclusive features, not only aggregated features.
iii) The developed model has considered learning and extracting the relations between the multivariate sequences to extract patterns in training data. The existing sequences hold latent relations that can be extracted for accurate behavior modeling, leading to the accurate prediction of anomaly behavior.
iii) The proposed model has applied deep learning to extract features from inter-correlation streams that represent audit data of user activities. The importance is that user activities are stochastic and unstructured, so deep learning is ideal for extracting features from unstructured data.
The remainder of this paper is structured as follows. Section II presents an overview of the related work. This is followed by Section III where the proposed model has been introduced. In Section IV, we demonstrate the evaluation and results of the proposed model. Section V provides conclusions and future directions where this study can be extended.

II. RELATED WORK
ML has been increasingly used for cyber threats detection throughout the previous decade [18]. Generally speaking, ML has shown to be advantageous in the identification and classification of anomalous occurrences in the network streams [19], [20]. Insider threat is a well-known type of cyber attack [21] that has received considerable work using ML approaches. The insider attack detection model heavily depends on the user's daily activities where the behavior is recognized. Typically, the obtained data is unstructured and complex due to the diversity of the user's activities on the network. Thus, modeling the user's behavior is a relatively intricate task. In the literature, most ML approaches for insider threat detection are datadriven in that the user activities are aggregated to underline representative features where the behavior profile is modeled. In this context, insider threat detection would be proposed based on different examples of data instances where the detection model is handled as an anomaly detection problem. Accordingly, various ML methods have been developed for insider threat detection. For example, SVM is used as a oneclass detection method for insider attack [22]. The study presented in [9] had proposed an HMM model to map the typical user behavior based on weekly activities. The insider attack is detected by computing a deviation score between sequences; a low probability score could indicate a probable attack. In [8] a set of supervised and unsupervised ML approaches have been evaluated for insider attack detection. The study had used Self Organization Map (SOM), HMM, and Decision Tree to model malicious behavior for anomaly detection. The features were extracted into two categories, including numerical and sequential features. Whatever category was being employed, the features were aggregated to a set of weekly representative instances. Thus the detection model was complex due to the need for extensive feature engineering.
DL methods have been proposed to tackle the issue of requiring feature extraction and handling the large volume of features to be learned in the detection model for insider attacks data. In [5] a comprehensive survey has introduced the state-of-the-art of DL with insider threat detection. In this context, a number of DL applications have been recruited such as deep feed-forward neural network [23], [24], recurrent neural network (RNN) [14], [25], conventional neural network [26], and graph neural network. [13], [27]. Due to the complexity of data structure, the majority of DL approaches have focused on representing sub-sequences, such as one-day activities, for detection granularity. In practice, each session is a subsequence that denotes a series of activities, i.e. "login" and "log-off" events. Whenever a subsequence contains malicious activity, the subsequence will be designated as a malicious subsequence where a possible attack could occur. Therefore, detecting abnormal actions is difficult due to the limited information (features) that can be leveraged. Moreover, the relation extraction between sequences is ignored; however, the extraction of latent relations can bring insight for better modeling of user's behavior.
This study addresses the above-mentioned drawbacks by representing all activities (sequences) as multivariate time series streams where the relations between streams are also considered for building the behavior profile. The study also endeavors to leverage the advantage of using RNN for effective feature extraction from the temporal/sequential data. Recall that RNN has shown effective feature learning of the sequential data for anomaly detection [28], [29], [30].

III. PROPOSED MODEL
This section elaborates on the proposed insider threat detection model. Fig. ?? illustrates an overview of the model flows. As it can be seen from the figure that the model operates in several consecutive steps, including i) user activity representation, ii) sequence activity embedding, iii) latent relations learning, iv) feature learning, and v) anomaly detection. The following subsections give further detail of each step as follows.

A. User Activity Representation
The primary step in the proposed model is to represent the user's activities as multivariate time series. As noted in the introduction to this paper, the user's activities can be structured as a series of temporal activities recorded each tick of time. Several data points (characteristics values) should be recorded at each timestamp, such as log-in/off information, user's id, and HTTP data.
More formally, given a series of a single activity A 1 it consists of a sequence of an ordered n data points such that A 1 = {p 1 1 , p 2 1 , . . . , p n 1 }, as for a given point p it maps an encoded real value of an event. The entire activities are expressed as a whole series S which represent set of m activities, i.e. S = (A 1 , A 2 , . . . , A m ) = (A 1 , A 2 , . . . , A n ) ∈ R m×n , where n ∈ N is the length of time series, i.e. the number of data points in each single activity. Intuitively, S is structured as a matrix consisting the entire user's activities whose samples are denoted as p

B. Sequence Activity Embedding
The represented activity sequences have a wide range of features that can be related in diverse ways. For example, if we consider the log-in and user id sequences for two different users, the sequences for the same user likely have strong relations. Thus, the idea is to portray each sequence in a flexible fashion that captures the various characteristics that underpin its behavior in a multidimensional manner. To this end, each activity sequence A has been encoded as an embedding vector v such that v i ∈ R d , for i ∈ {1, . . . , m}. Note that the encoded embedding sequences are randomly initialized before being trained with the remainder of the model. Moreover, sequences with comparable embedding values should have a strong inclination to be connected since similar embedding sequences indicate similar activities.

C. Latent Relations Learning
The subsequent step is to learn relations between the embedded vectors. The optimal method to conduct so is by using direct graph architecture. In this context, given embedded vectors V = { v 1 , . . . , v |V| }, they are mapped to graph structure V → (N , ξ) with nodes and edges; where nodes denote embedded vectors, such that v i ∈ N for i = {1, . . . , |N |}, and edges represent pairs ε i ∈ ξ → ε i = v, v ∈ N × N . In this study, we implement direct graph representation for latent relations learning as direct edges v → v between nodes because the dependency patterns between vectors do not have to be symmetric. Thus, the mapped edges between vectors represent relative dependant relationship in that the first vector is used to model the behavior of the second vector. Recall that given a node vector v is denoted by h v ∈ R d . Each node has given a label v ∈ {1, . . . , L N } that indicates the activity type, e.g. the log-in activities, where the edge has also been given label such that ε ∈ {1, . . . , L ξ } for each given ε i .
Note that when an edge connects two vectors, it means the first vector is utilized to underline the behavior of the second vector. The dependency is represented between vectors as a set of candidate relations R i for each vector such that R i ⊆ {1, . . . , m}\{ v i }. The selection of which dependencies related to v i is conducted by the search for the most similar candidate. To this end, we compute the similarity θ between ε i edge node and the embeddings of its candidate relation j ∈ R i using dot product measure. Equation 1 shows θ similarly measurement.

D. Feature Learning
Having represented relations between embedded vectors (graph nodes), the next step is concerned with extracting and learning features. The idea is to establish an abstracted feature space using RNN to fuse a node's information with its neighbors. In the proposed model, feature extraction includes the vector embedding v i , which describes the various behaviors of various vector kinds. Nevertheless, feature extraction has been accomplished using Long-Short Term Memory (LSTM), a well-known set of RNN architecture. The main benefit of adopting an LSTM unit is that the cell state averages activities over time, which helps to avoid disappearing gradients and better capture long-term time series relationships.
At each tick of time, an individual entry node v will map a hidden layer h (t) . Each hidden unit has a memory cell c (t) to obtain long-term dependencies. The intuition is that c (t) serves to remember the effect of the prior input layer. In the proposed model, the mapping function use three non-linear gates to manage the access to c (t) cell as follows: i) remember vector v r , save vector v s , and focus vector v f . The following equations express the mathematical notation of gated vector v r , v s , v f respectively (2,3,4), memory cell control c (t) (5), and mapping function h (t) (6).
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 where h (t−1) ; v (t) ∈ R d is the sum of the prior hidden state h (t−1) and the current vector v (t) along with some bias , is the element-wise multiplication, and σ is the non-linear Rectified Linear Unite (ReLU) activation function [31]. Here, the resulted output is an aggregated representation Z i of hidden layers at time (t) from such input node v i .
where v (t) i is the input for a given node at time t, with ReLU activation σ.

E. Anomaly Detection
When the relations are learned as per the previous subsection, the next step is to determine how the anomaly behavior can be detected accordingly. The idea is to determine how such an unseen behavior deviates from learned relations for each user. Thus, the proposed model attempts to calculate a similarity score Φ between the observed behavior of the user, that has resulted from the learned relations, and the new abstracted streamẐ: To assure a robustness calculation of the similarity score, we normalize the score for each input node using the median u of the difference between 1st and 3rd quartiles of distribution α.
Recall that the use of inter-quartile range shows an effectiveness calculation of the distribution's spread for anomaly detection in stream data [32]. Then, the anomaly score Anom sc is the max value of ϕ that is computed at time (t) as follows: Hence, the stream is classified as either normal or anomaly activity at some fixed threshold. The user can configure the threshold value; however, in the evaluation of this study, the value is set to max over the validation data. Thus, the stream is labeled as an anomaly whenever the similarity score exceeds the max Anom

IV. PERFORMANCE EVALUATION
This section provides the evaluation of the proposed model to determine its efficacy. The main objective of the evaluation is to show the effectiveness of detecting insider anomalous in network flows based on relation-based learning with LSTM. Moreover, the evaluation examines the model's performance under different LSTM parameters tuning to determine how they would affect the detection accuracy. Finally, the evaluation shows how well the performance of the proposed model compared to baseline methods of detecting insider attacks.
A. Evaluation Settings 1) Dataset: The evaluation experiments have been conducted over CERT dataset [33]. CERT is a public released insider threat dataset. It consists of activities data for more than 1k users, with 32 million events (log lines) generated over 502 days. There are around 7k log lines representing anomaly actions among the total recorded activities; these logs were manually placed into the data records by specialists. The data pertaining to log-in, log-off, device, and HTTP is stored in the logline. Each user action is parsed into a vector in the experiment, including the id, date, user, computer, and activity type as multivariate time series sequences. The dataset has been spillted to training (70%) and testing (30%) for all conducted experiments.
where a i is the actual value, and p i is the predicted value.
3) Experiment Settings: The experiments have been conducted using TensorFlow 1 . A number of LSTM parameters, including the epochs, batch-size and hidden-layers, have been tested while learning relations and extracting features. Further detail concerning the choice criterion is given in the following section. Recall that batch-size refers to 1 https://www.tensorflow.org/ the size of the embedding vector v i of an activity stream as proposed in the model. The model is trained using Adam optimizer with 0.01 learning rate.

B. Results
The model has been evaluated under different LSTM parameter setting to determine the best performance of relation learning and feature extraction. To this end, we determine the performance of the model under different parameter-setting including epochs, batch-size, and hidden-layer size. To determine the optimal setting for batch-size we conduct a grid search over {32, 64, 128, 256}. For hidden-layer we also conduct a grid search to tune the best performance over hyper-values of {8, 16, 32, 64, 128}. The model is run over epochs = 50 and epochs = 100, for the entire grid searches, respectively. Fig. 3 illustrates the performance of the proposed model in terms of MSE value for different parameter setting. The figure shows that epochs = 100 has generally produced better performance than epochs = 50 with different scales of batch-size and hidden-layer. The best results of MSE are recorded with batch-size = 256. It can be seen that the batch-size has an influence on obtaining better results whenever the value get larger although we found that the time complexity is increased accordingly. However, the efficiency on terms of time complexity scale is beyond the scope of this study despite it is interesting for consideration in other simulations such as the case of online feature extraction. Moreover, the model has yield best results whenever hidden-layer = 128 get larger. Figure 4 shows the ROC curves of the proposed model with respect to different hidden-layer sizes. The best result is recorded with AUC = 0.99 at hidden-layer = h = 256.  The proposed model has been evaluated to determine the detection accuracy in terms of F1, Pre and Rec compared with baseline methods. The evaluation explores the detection of anomaly behavior with relation-based feature learning of temporal streams, as in the proposed model, and the aggregated features based on statistician abstraction of audit data. We adopt the following baselines: SVM, HMM, and shallow Neural Network (NN). Note that the later considers feedforward structure with one layer of batch-size = 265 and hidden-layer = 128. Table I shows the obtained results of all methods; for clarity, we denote the proposed method as Rel-RNN. It can be seen from the table that the Rel-RNN has recorded best results with 0.80, 99.12, 67.12 for F1, Pre, and Rec respectively. To further demonstrate the obtained results, Fig. ?? shows the ROC curve for the proposed model compared with baseline methods. It can be observed that the performance of Rel-RNN has obtained a better AUC value of 0.99.

V. CONCLUSION
This study has proposed a novel model for insider threats detection. The proposed model structures the audit data, which represents the daily activities, as a multivariate time series covering broader characteristics for better user behavior learning. Thus, the temporal sequence of exclusive events is considered rather than an abstract set of features. The represented sequences are fed into an RNN model to learn hidden relations for feature extraction. The relations between representative features can be learned to identify latent patterns in the sequences for recognizing malicious behavior. To maintain the consecutive temporal lags between the set of features, LSTM has been used thus to avoid the vanishing gradient problem. The evaluation on the CERT dataset has shown that the proposed model has outperformed the comparator baselines for insider threat prediction. In the future, the plan is to incorporate Spatio-temporal dependencies to determine whether it affects modeling the latent relations to profile the user's behavior. This is desirable when users have the authorization to access the network from different places remotely. This case is observed during the COVID-19 pandemic when most companies and organizations allow employees to access networks from distant locations.