Univariate and Multivariate Gaussian Models for Anomaly Detection in Multi Tenant Distributed Systems

—Due to the flaws in shared memory, settings, and network access, distributed systems on a network always have been susceptible to cyber intrusions. Co-users on the same server give attackers the chance to monitor the activity of many other users and launch an attack when those users' security is at risk. Building completely secure network topologies immune from risks and assaults has traditionally been the goal. It is also hard to create an architecture that is 100 percent safe due to its open-ended nature. The precise parameters and infrastructure design whereby the strike is instantiated are a constant which can always be detected regardless of the sort of attack. This work now have the chance to simulate any abnormality and subsequent attack possibilities using network parameter values thanks to the increased usage of algorithms for machine learning and data-gathering tools. This work proposes a Gaussian model to forecast the likelihood of an attack occurring depending on certain system parameters. This work model a univariate and a multivariate Gaussian model on the training dataset. This work makes use of various threshold values to predict whether the data point is an inlier or an outlier. This research examines accuracies for various threshold values. An important challenge in an anomaly detection situation is class imbalance. As long as this work just utilizes training data, a class imbalance is not a problem. Our data-driven results show that combining machine learning with Gaussian-based models might be a useful tool for analyzing network intrusions. Although more steps are being made to boost digital space security, machine learning algorithms may be utilized to examine any abnormal behavior that is left uncontrolled.


I. INTRODUCTION
One of today's most demanding technologies is cloud computing. Cloud computing offers an infinite quantity of IT facilities to deliver amazing computing speed, but on the flip side, it has serious security problems with public clouds for multitenant cloud environments. Most government and commercial companies are compromising with the limited IT resources and performance from existing resources since they are not migrating their sensitive and private data over the public cloud due to security concerns. The aforementioned problems will be solved by finding a way to protect private space over public clouds.
Multiple clients can use the services provided by multitenant distributed systems. As a result, each client has access to the activity of the others. By being one of the clients of such a system and taking advantage of such surveillance, attackers can launch assaults against one or more other tenants of the system [1]. To stop any entity in the system from suffering damage, such an attack must be promptly detected [2]. The scourge of attacks in such distributed systems has been a hot topic among researchers despite improvements in cyber security measures. Although cyber security protections have improved, experts continue to focus on the problem of intrusions in such distributed multi-tenant systems. Multiple tenants can cohabit on the same network thanks to multi-tenant distributed systems (MTDS). The MTDS service provider does not inquire about the tenant's motivations when they request co-allocation. This situation presents a chance for renters with bad intentions to observe and collect confidential information about the target occupants. [4] Because the attacker tenant has access to sensitive information, the tenant may prepare an attack that has a greater likelihood of success. [3] There have already been several attempts to use a variety of techniques to identify the existence of intrusions in distributed applications. [5], [6] Earlier, the emphasis was on applying statistical techniques to compute specific function values, but more recently, cutting-edge approaches including deep learning have been applied. In this regard, artificial neural networks have been investigated.
Although rule-based engines were used to identify assaults, they frequently fall short of spotting any newly discovered threats. Transfer learning may be helpful in this situation, but there is no guarantee that the variables of the source work and the destination job are identical, which has been a significant obstacle to its application. [7] This work suggests a Gaussian-based classifier strategy in this research for identifying the potential for intrusions in a multi-tenant distributed system to identify inliers and outliers. This work defines a threshold value. This work also looks at the accuracy of different threshold values. Authors are thankful to Patil and Ingale [8] for providing us with the dataset.
Section II of paper includes literature survey of research work done in the area of network attack detection. It explores www.ijacsa.thesai.org Machine learning algorithms used to detect network attacks and to improve cyber security. Section III describes experimentation performed to create and collect dataset. As network attack is not a continuously or regularly occurring event hence lesser number of attacks are performed to create dataset. This dataset includes majority non attack instances and very few attack instances. This section includes statistical and graphical representation of collected dataset. Section IV explains creation of univariate and multivariate Gaussian models for anomaly detection and respective models performance analysis. Section V contains conclusion of research work done.

II. RELATED WORK
Network attack detection has historically made heavy use of signature-based detection. This approach uses an analysis of an attack's "signature," or distinctive qualities, to foretell potential hazards in the future [9]. Methods to discover the best attack signatures were suggested by Hilker et al. [10]. Han et al. [11] advocated crafting network traffic using several attributes. The system cannot identify any new attacks that were previously undiscovered owing to a lack of knowledge about them, which is a significant problem with this technique. Additionally, each new effort to locate signatures requires human labor in addition to time.
Additionally, there have been initiatives to employ machine learning algorithms in this field. Algorithms based on supervised learning have traditionally been used to identify network attacks. [12] For assault detection, Zseby et al. favoured the use of selecting features and subsequent mapping [13]. Evolutionary algorithms were used by Rafique et al. [14] to evaluate the effectiveness of classifying malware. The chance of assault is extremely low, it should be highlighted, therefore a model may get away with forecasting all data as non-negative and yet show good accuracy, making the entire process exceedingly costly.
Prior strategies likewise emphasized the application of boosting techniques and feature reduction in transfer learning. TrAdaBoost was introduced by Dai et al. [15] and reweights the data from the positive and negative classes to give the uncommon examples that indicate attacks more weight in the outcome. TCA-transfer component analysis was used by Pan et al. to feature project the domains closer to one another in the common space [16]. HeMap is a technique created by Shi et al. [17] that projects features using linear transformations. Patil and Ingale [8] tackled the class imbalance problem and used an ensemble based meta classifier to detect anomaly.
The detection of assaults has also been done using modelbased methods. This strategy falls under the category of transfer learning and makes the crucial assumption that the source task and the target task share at least some parameters or model priors. Bekerman demonstrated how transfer learning may help increase the resilience of malware detection in uncharted situations. [17].
A noteworthy finding in all of these prior methods was that the stark class disparity seen in network assaults was hardly discussed. Additionally, due to this imbalance, effectiveness of other measures should also be discussed in order to shed light on the results that were produced. We model a Gaussian model on the training dataset. The advantage of this method is that class imbalance does not cause any hindrance.
Research community is contributing towards improving cyber security and security of multi-tenant distributed systems. Despite being all these efforts, attackers are successfully able to place compromised or virtual machine having anomaly to reside with target virtual machine. This leads to increase in the probability of having successful attack on a target virtual machine. Detection of new types of attack possible because of co-residence, co-location and co-tenant of attacker virtual machine with a target virtual machine is still remains a challenge to researchers. Univariate and Multivariate Gaussian models are created to detect network attacks. Performance analysis of individual models created is performed.

A. Dataset Collection
Dataset has been collected by Patil and Ingale [8] by using Netdata, a programme for real-time performance monitoring that creates system logs. The logs have been collected across 28 files. This work combines all the files into a single dataset for easy handling. The dataset consists of 4986 inliers instances and 60 outlier instances with 63 columns. All the columns names are noted in Table I.

B. Dataset Preparation
Contributors dropped the column "anomaly score" as it is generated by the software. Authors also separate "label" from the remaining dataset. Authors also drop the columns whose standard deviation is less than 0.3 but also store the original dataset. Contributors are left with 36 columns in the remaining dataset. This work plot some of the important columns as a categorical plot except anomaly score from Fig. 1              Authors then standardize the dataset as there is need to perform PCA on it. PCA is applied by a keeping 98% variance. After applying PCA, Dataset have 38 columns in the original dataset and 18 columns in the dataset on which columns were removed having standard deviation less than 0.3. This work plots the first two components of the new dataset on a 2D axis as shown in Fig. 13. Authors also perform PCA on the original dataset. This work, as shown in Fig. 14, plots the first three components of the new dataset on 3D axes. Here authors can clearly see a separation between inliers and outliers.

A. Univariate Gaussian Model
Gaussian distribution is a continuous probability density function for a real-valued random variable in statistics. It is given by Eq. (1).
Where f(x) is the probability density function, µ is the mean and σ is the standard deviation.
This work calculates the mean and standard deviation of each column of both datasets and model a Gaussian distribution on all columns. The final probability is calculated by taking the product of the probabilities of all columns. Negative logarithms of probabilities are plotted as histograms as shown in Fig. 15 to 19. Fig. 15 shows probabilities of train inliers. Fig. 16 shows probabilities of test set. Fig. 17 shows probabilities of test set with columns having standard deviation less than 0.3 removed. Fig. 18 shows probabilities of cross validation set. Fig. 19 shows probabilities of cross validation set with columns having standard deviation less than 0.3 removed.     Authors set a threshold probability value to classify the test and cross-validation set. Different thresholds are set and accuracy is observed.   Table IV and V shows accuracies for the dataset whose columns were removed which had a standard deviation of less than 0.3.

B. Multivariate Gaussian Model
The multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution are expansions of the one-dimensional normal distribution to higher dimensions in probability theory and statistics. It models the probability in one shot instead of calculating individual probabilities and multiplying them. Multivariate Gaussian distribution is given by the Eq. (2).
Where µ is the length-d row vector of means of all columns, ∑ is the covariance matrix of shape d x d. d is the number of features.
Authors calculate the mean and covariance matrices of both datasets to model a multivariate Gaussian distribution. Authors set a threshold value and classify the dataset between inlier and outlier and calculate accuracies for various threshold values. Negative logarithms of probabilities are plotted as a histogram as shown in Fig. 20 to 24. Fig. 20 shows probabilities of train inliers. Fig. 21 shows probabilities of test inliers and test outliers. Fig. 22 shows probabilities of test inliers and test outliers with columns having standard deviation less than 0.3 removed. Fig. 23 shows probabilities of cross-val inliers and cross-val outliers. Fig. 24 shows probabilities of cross-val inliers and cross-val outliers with columns having standard deviation less than 0.3 removed.       Table VIII and IX show accuracies for the dataset whose columns were removed which had a standard deviation of less than 0.3.

V. CONCLUSION
This work states that univariate and multivariate Gaussian models for anomaly detection are successfully created. Data imbalance is not an issue here because these models fit on the train set and this work uses a threshold to predict inliers and outliers. This work examines the trend between various threshold values and accuracies. The proposed method, a Gaussian model to forecast the likelihood of an attack occurring based on certain system parameters uses a univariate and a multivariate Gaussian model on the training dataset and examines accuracies for various threshold values. It also addresses the challenge of class imbalance in anomaly detection situations. This method presents the successful creation of univariate and multivariate Gaussian models for anomaly detection. The data imbalance is not an issue in these models because they fit on the train set and use a threshold to predict inliers and outliers. The study also examines the relationship between various threshold values and accuracies. For univariate Gaussian model variation of accuracy with different threshold values ranges up to 99.175 percent and for train accuracy up to 98.6 percent for test inlier accuracy and up to 100 percent for test outlier accuracy. For multivariate Gaussian model variation of accuracy with different threshold values ranges up to 99.45 for train accuracy, up to 100 for test inlier accuracy and up to 100 for test outlier accuracy with validation.
Future work is about using deep learning techniques such as auto encoders. Machine learning is revealing a plethora of potential for cybersecurity aficionados to explore as more and more data is gathered, specifically with the data that they already own. When this work talks about escalating warfare in the internet age, timely automated identification of any threats or suspicious conduct can avoid a number of mistakes from occurring.