Ensembling of Attention-based Recurrent Units for Detection and Mitigation of Multiple Attacks in Cloud

—In the recent years, number of threats to network security increases exponentially as the Internet users which poses serious threat in cloud storage application. Detection and defending against the multiple threats are currently a hot topic in industry and considered as one of the challenging research in academia. Many methodologies and algorithms devised to predict the different attacks. Still, most of the methods cannot simultaneously achieve high performance of prediction with a small number of false alarm rates. In this scenario, Deep Learning (DL) algorithms are appropriate and intelligent to categorize the multiple attacks. Still, most of the existing DL techniques are computationally inefficient that may degrade the performance in predicting the both normal and attack information. To overcome this aforementioned problem, this paper proposes the hybrid combination of attention maps with deep recurrent networks to mitigate the multiple attacks with low computational overhead. Initially, the pre-processing step is proposed to the inputs in a specified range. Later on, input data are fed into the Attention Enabled Gated Recurrent Networks (AEGRN) which is used to remove the redundant features and select the optimal features that aids for the better classification. Further to enhance the faster response, deep feed forward layers are proposed to replace the traditional deep neural networks. Numerous metrics for performance, including accuracy, precision, recall, specificity, and F1-score, are examined and analyzed as part of the thorough experimentation utilizing multiple datasets, including NSL-KDD-99, UNSW -2019, and CIDC-001. Comparisons of performance between the method that is suggested and existing models developed with DL are used to demonstrate the proposed algorithm's supremacy. The suggested framework surpasses the other DL models and has the best accuracy in predicting with little computational overhead, according to an investigation.


I. INTRODUCTION
Internet Based Communication is used for managing big industry and transformed the scenario of monitoring and interaction methodology.Its scope of services also included the medical industry and was applicable to banking, schooling, government departments, military, and recreation.In addition, time, network development gives hackers and intruders opportunities to find illegal ways to break into an organization.
Multiple assaults that have the capacity to deny services to legitimate customers are one of the main risks to the IP network on which numerous researchers have focused their attention.Therefore, maintaining the security and protection of various websites operating on the Internet is primarily required of the secured network [1][2].Due to their qualities, such as quick access and primitive ways of attack detection, these attacks have expanded significantly.
It can be difficult to distinguish between malicious and lawful network data because intruders have unexpected behavior [3][4][5].Applications run through anytime, anything, and anywhere in an internet context and interact remotely with a variety of devices or appliances.This makes it easier for bad actors to get devices.Despite these guidelines, interruption of devices or assistance is likely to be the first stage of many attacks due to factors such as ease of comprehension, simplicity of execution, lack of extensive technical knowledge on the part of the attacker, and variety of platforms and applications for aided attack orchestration [6][7][8].
These attacks can be single-source assaults commencing at just one host or multi-source attacks that distribute attack packages to the target across numerous hosts.Also, attack toolkits have been developed and therefore are easily accessible in online today [9][10].But these tools can be exploited by the intruders to enforce the attacks with least effort.As a result, more examinations are performed in recent years through the use of numerous algorithms to develop the defensive system for cloud attacks.But these traditional systems possess the various problems such as high memory, high bandwidth and processing capacity.It is vital to design Intrusion Detection Systems in order to counteract this lack of security assaults, in which the network attacks can be prevented primarily.With exponential increase, information is steadily moved from separate networks.IDS needs improvisation in predicting the intrusion in such huge data environment.
IDS has been using deep computing and machine learning techniques in the last ten years to help classify the observed data using known characteristics or attributes that have been learned from training datasets.The purpose of ML and DLbased techniques, which have some limitations, is to evaluate network traffic packet properties and set a reasonable threshold www.ijacsa.thesai.orgfor separating attacks from genuine traffic.For instance, tatistical recognition methods [15], Neural Networks [11], Support Vector Machine [12], nearest neighbor [13] andclustering [14].These current studies reveal that various studies have been conducted to offer treatments to deal with this difficulty by outlining particular treatments for emerging network assaults.Since these intelligent-based IDS have only recently been introduced, a number of issues need to be resolved.Here are a few of the current issues that studies on attack detection systems are now facing: 1) The majority of currently used techniques concentrate on identifying a single attack with a low false alarm rate, although they typically fall short of reaching a high detection rate.
2) Knowing the many characteristics of attacks is important, but identifying the ones that can really help in the detection of assaults is even more crucial.However, because of redundant information and excessive computational expense, certain existing techniques commonly have high false positive rates.A well-organized network attack detection method remains a promising research subject because earlier methods also fall short in terms of attaining efficient accuracy.
Considering these problems, this research article proposes the novel integration of the Attention layers with the Gated recurrent NN to achieve the high classification ratio in mitigating the network attacks with less computational complexity.Following are the paper's main contributions: 1) Self-Attention Maps are introduced in Gated Recurrent Neural Network (RNN) to achieve the better feature selection that in returns support for the better detection ratio.
2) Data-Pre-processing technique is employed for the increasing the speed in detecting the attacks.
3) Feedforward Learning Layers-They are introduced in the place of the conventional neural networks to achieve the faster training with less error detection.
Following is how the manuscript is organised: Details on the background and related works are provided in Section II.The description of the dataset, data pre-processing, and suggested approach are shown in Section III.The following Section IV provides further details on the experimental findings.The Section V provides a conclusion and future enhancements.

Abirami et al. (2022) demonstrated how "Deep
Reinforcement Learning (DRL)" might be used in a cloud network to offload tasks while also recognizing generalized attackers.Techniques for identity-based linear classification are used in virtual machine attack categorization channels.This proposed system supports methods for remote information analysis.Reinforcement learning has the potential to reduce data secrecy and improve cooperation.The sole drawback of this system is the prolonged computation time [16].
In 2022, Tao et al., developed a "Continuous Duelling Deep Q-Learning (C-DDQN)" technique for protecting the cloud.The suggested Dynamic Field Adaptive System and improving are the fundamental ideas of this system.The convergence and learning capabilities of the aforementioned structure are preferable than those when transfer learning methods weren't used.But this framework's primary problem is the rising energy consumption [17].
Recurrent and convolution neural networks were combined in 2021 by Hizal et al. to create a DL method for threat detection in security of the cloud.Any discovered or forbidden traffic cannot be sent to the cloud server using this method.The recommended method is 99.86% accurate for classification into five classes.But this framework's primary drawback is the higher connectivity cost [18].
In 2020, Karri et al. proposed a three-stage abnormality detection framework that utilized DL for intrusion attack detection.CNN, GANomaly, and K-means clustering algorithms are all used by the system.The effectiveness of the network and automated intrusion detection had been either greatly improved.The main advantage of the aforementioned structure is that it reduces the level of computation without reducing cost [19].
By Wang et al. in 2022, stacking contractive auto encoder (SCAE) system was unveiled.The Support Vector Machine configuration serves as the framework's core.By using the unfiltered network information, this structure enables the automatic learning of improved as well as more trustworthy low-dimensional properties.This paradigm significantly reduces the analytical complexity.This technique leads to improved detection efficiency.This framework's drawback, nevertheless, is that it cannot be used in contexts where events happen in the present [20].
PredictDeep was introduced in 2020 as an approach for prediction of anomaly in big data environments by Elsayed et al.GCNs, or Graph Convolutional Networks, form the basis of the system.This solution produced better outcomes in regards of the fast discovery and forecasting of incidents of security and was able to cope with the multifaceted nature of clouds.The problem with this technique, though, is that it doesn't recognize and classify irregularities in a range of classifications in accordance with the changes in system function they cause [21].Nguyen et al. (2021) examined the difficulties associated with compute offloading and cybersecurity in a multiple-userfriendly mobile edge-cloud computing framework utilizing blockchain.The above structure provides an effective authorization mechanism powered by blockchain that may protect servers in the cloud from incorrect offloading practices in order to boost offloading security.A complex DRL method called a double-dueling Q-network was developed by this framework to do this.This framework is lowering the latency, energy consumption, and intelligent contract fees.But this approach has the drawback that efficiency degrades as the amount of information increases [22].
RNN-based DL approaches were examined by Kimmel et al. (2021) for their efficacy in identifying malware in cloud.www.ijacsa.thesai.org The focus of the framework was on LSTMs and bidirectional RNNs.Such frameworks progressively understand malware behaviors based on the course of operation, minute activities, and system statistics such as CPU, memory, and disc usage.With this architecture, there are high detection rates.but, cannot maintain the identical degree of performance when dealing with diverse data [23].
Loukas et al. ( 2018) introduced a recurrent NN with a deep multilayer perceptron architecture which is capable of understanding the temporal context of several attacks.A computational framework was developed to determine whether compute offloading is favorable utilizing detection latency as the criterion, given networking operational parameters and DL framework processor needs.When the processing requirements are more severe and the network has become more reliable, offloading lowers detection delay to a greater extent.The biggest problem with this structure though, is the additional communication complexity [24].
By fusing a Convolutional NN with Grey Wolf Optimization, Garg et al. ( 2019) created a composite data mining method for identifying network abnormalities.The GWO and CNN learning procedures were improved in order to enhance the framework's abilities for initial sample creation, exploring, taking advantage of and discarding functionality.The above structure works better in terms of precision, false alarms, and recognition rate.This strategy does have a disadvantage, too, in that it increases computing difficulty [25].Table I following provides an overview of several relevant studies.

III. PROPOSED ARCHITECTURE
According to Fig. 1, the hybrid suggested network's framework is made up of three sub modules.In the first module, multiple datasets are pre-processed and inputted to the proposed network.The second module consists of the proposed SA-GRU-FF framework in which attention layer is integrated to remove redundant and non-optimal temporal features.These features are then fed into the fully connected deep feed forward networks based on Extreme Learning Machines (ELM) for classification of the multiple attacks.

A. Materials and Methods
Three distinct datasets, namely CIDDS-001 [27], UNSW-NB15 [28], and NSLKDD [29], are employed in this investigation.We choose the CIDDS-001 and UNSW-NB15 datasets because they are the most current statistics produced and include real data traffic, which makes them beneficial for designing accurate IDSs for tracking and finding novel forms of denial of service attacks in cloud networks.An IDS based on anomalies may now be created with the help of the CIDDS-001 dataset, which was just made accessible.In all, the collection contains around 32 million tracks, covering both normal and attack traffic.This dataset is composed of 12 identifying features and two distinguishing traits.Random sampling is applied to acquire 80,000 normal and 20,000 DoS attack events from the relational database of server traffic data, totaling 100,000 events.Using the extracted sample, the crossfold validity and hold-out of the classifiers are tested.A new contribution to the public domain, the UNSW-NB15 dataset, www.ijacsa.thesai.org was also utilized for the purposes of testing.In the dataset, there are 49 characteristics and 1 class attribute.A subset of the dataset uses the training and test establishes, "UNSW NB15 Train & UNSW NB15 Test".There are 175,341 occurrences in the train set compared to 82,332 in the test set."There are 56,000 occurrences of ordinary traffic and 119,341 illustrations of attack traffic on the platform set.Additionally, there are 37,000 examples of ordinary traffic and 45,332 cases of attack traffic in the test set".Hold-out confirmation makes use of both the whole train set and the test set, while cross-fold assessment solely utilizes the set that has been tested.The NSL-KDD dataset is then utilized to do classifier validation as well.41 measures including 1 class attribute are part of the dataset.The NSL KDD dataset's KDDTrain+ (training) as well as KDDTest+ (testing) sets are utilised in this study.13,499 attack traffic instances and 11,743 regular traffic instances make up the total 25,192 instances in the KDDTrain+ set.While the KDDTest+ set has a total of 22,544 instances, including 12,833 instances of regular traffic and 9,711 incidents of attack traffic.On each dataset separately, hold out as well as cross fold validation of classifers are performed.The selection of these sets was made to prevent randomly selecting cases from the entire NSL-KDD dataset.

B. Data Reorganizing
The input data are first analysed, and then they are fed into a standardization approach, which assists to convert the bulk of attributes with numerical data to a specified numeric domain.Min-Max normalisation is used in conjunction with the linear transformation concept to accomplish this.After preprocessing step, new pre-processed datasets is formed from the original raw datasets.These pre-processed data is given for feature extraction module.

C. Feature Extraction using Self -Attention Gated Recurrent Networks
The operation of gated recurrent sections, self-attention, and mixed combinations of self-attention gated recurrent units are covered in this section.
1) Gated recurrent units -An overview: One of most interesting form of LSTM is known as GRU the architecture is depicted in Fig. 2. The forget gate with input vector are intended to be combined into a single vector according to the concept set out by Chung et al. [30].Both long-term sequences and memories are supported by this network.When contrasted to the LSTM network, the complexity is drastically reduced.Chung developed the following equations to illustrate the traits of GRU.
The following is the general GRU characteristic equation: where, "xt input feature at the present state, yt output state , ht  output of the unit as of this moment, Zt & rtupdate & reset gates, W(t)  weights, B(t)  bias weights at present instant".
2) Self-awareness maps: In 2014, the attentive map was proposed to describe the appropriate words in a sequence-tosequence structure.In the vast mainstream of contemporary works, redundant characteristics that support accurate categorization mechanisms are imitated using attention layers.The self-attention process, commonly alluded to as the intraattention procedure, generates the three vectors Q, K, and V for each input pattern.Thus, the results sequences are created by transforming the input patterns from all of the layers.It is a technique that, in its simplest form, maps the query string to the set of key-pair collections using logarithmic dot processes.The mathematical formula that follows can be used to get the dot multiplying for self-attention.

D. Proposed Feature Extraction
BiGRU networks, which combine forward and backward GRU, are built for gathering meaningful information from the many dataset streams.Eq. ( 9) delivers data on the precise properties of the BiGRU network.In order to classify data, the BiGRU network collects spatiotemporal characteristics that incorporate a variety of different pieces of data.Although the training time, which makes up the overhead in the classification layer, may be affected by the more varied information in these characteristics.Self-attention layers that are inserted among the BiGRU network and classification layer help to diminish the resulting classification cost.Eq. ( 6) is used to create the attention characteristics retrieved from the input features of the BiGRU network that are then given to the feedforward level of classification via the softmax layer.
Combining the Eq. ( 7) and Eq. ( 8) The following information is related to integrated Self-Attention (SA) with BiGRU feature extraction.

E. Feed Forward Classification Layers
After receiving these attributes for the fully connected forward feed-forward network, the final classifying is carried out.Layers are entirely linked using the ELM principle.The principle of auto-tuning capacity underlies the operation of a particular class of neural network known as an ELM, which only uses one hidden unit.In regards to dependability, speed, and computational burden, ELM performed better than other learning models like "Support vector machines (SVM), Bayesian Classifier (BC), K-Nearest Neighbourhood (KNN), and even Random Forest".
There is just one hidden layer in this specific neural network; therefore it may not require to be modified.Compared to other learning algorithms like Random Forest and Support Vector Machines, ELM operates better, more quickly, and with lower computational cost.Small training error and improved approximation are the ELM's main benefits.ELM uses non-zero activation functions and weight biases that are automatically tuned.The ELM's intricate operating mechanism is covered in [26].Following Attention maps, the ELM's input features maps are represented by: where, Y  features from Self Attention BiGRU network , The ELM's output function is represented by the symbol ELM's comprehensive training is provided by: Finally, the softmax activation layers are applied for the above feedforward layers to achieve the best accuracy.

IV. EXPERIMENTATION DETAILS
The entire algorithm was designed on an Intel Workspace with a 3.2 GHz of frequency, I7 CPU (NVIDIA GPU) and a16GB of RAM.Utilizing Keras (Tensorflow) as the rear end, the suggested baseline infrastructure was created.

A. Performance Metrics
Deep feed forward training networks that classify the necessary classifications into typical sensitive and malicious information as well as the suggested design are validated as part of the experiment.Metrics including "accuracy, sensitivity, selectivity, recall, and f1-score" are used to gauge the suggested design's effectiveness.The calculations for the metrics used to assess the suggested architecture are shown in Table II in their respective computation formulae.Additionally, Table III shows the experimental hyperparameters that were utilized to train the suggested network.

B. Results and Discussion
The experimentation is carried out based on component structures with the same parameters as the proposed framework.In detail, the existing structures were one dimensional Long Short Term Memory [30], Gated Recurrent Units [31], Optimized LSTM [23], and BiGRU [32].The technique was validated and a comparison study was performed using four different datasets.Tables IV, V, VI, and VII demonstrates the proposed algorithm's performance metrics for categorizing several assaults using various datasets.The Table IV represents, the outcomes of proposed and existing frameworks when testing under CIDCC-001 Datasets.The Table V, Table VI and Table VI represents, the outcomes of proposed and existing model when testing under UNSW2019, NSL-KDD+(Train) and NSL-KDD+(TEST) datasets respectively.From Table IV, V, VI and VII, it is observed that, the suggested model GRU-SA-FF has demonstrated the best performance in detecting the numerous attacks.The integration of Self-attention maps has provided the best results in contrast to different DL techniques.Additionally, the validation effectiveness of the suggested model (see Fig. 3) is assessed using various datasets, and it is discovered that the RMSE (root mean square error) in between training and testing data is 0.001.Model building times for various classifiers are shown in Table VIII for four datasets employing hold-out evaluation.Recognising how essential it is to deliberate how long a system needs train until it is successful at spotting various risks, the main driver aimed at estimating MBT is this realisation.Because of this, MBT helps to achieve a good trade-off among computational complexity and the accuracy of classifiers.The suggested model's average MBT when trained on the different sets of data is 0.22s, according to the above table, compared to 0.36s, 0.41s, and 0.48s for Op-LSTM, GRU, and LSTM, respectively, for Op-LSTM.According to the evaluation, the suggested framework uses only 0.22 seconds and excels at designing countermeasures against several threats.

V. CONCLUSION AND FUTURE ENHANCEMENT
In this work, investigation on integration of Self-attention maps with GRU for securing the cloud against the multiple attacks is carried out.The role of self -attention network with the BiGRU to select the optimal features that can aid for the classification layers is proposed in this paper.Additionally, www.ijacsa.thesai.orgrole of feed forward layers which works on principle of ELM has been used in the proposed research to achieve the better classification with reduced computational burden and quick speed.Precision, specificity, susceptibility, false alarm rate, and region under the curve of receiver operating characteristics are used to assess the performance of the suggested model.On the CIDDS-001, UNSW-NB15, & NSL-KDD datasets, all of the classifiers are benchmarked.Results demonstrate in terms of a superior detection ratio and so little overhead, the proposed approach have done better over the other DL models.As the future scope, performance of the proposed model is required for the validation with real time datasets and also brighter light of deploying in the resource constraint in Cloud.

Fig. 1 .
Fig. 1.Proposed architecture for the GRU-SA-FF based multiple classifications of attacks.

TABLE II .
LIST OF RELATED WORKS FROM LITERATURE

TABLE III .
ALGEBRAIC EQUATIONS FOR THE CALCULATION OF PERFORMANCE METRICS

TABLE IV .
HYPER PARAMETERS USED IN THE NETWORK'S TRAINING 983 www.ijacsa.thesai.org

TABLE VII .
NSL-KDD+(TRAIN) DATASETS PERFORMANCE INDICATORS OF THE SEVERAL ALGORITHMS

TABLE IX .
MBT IN SUPPORT OF DIFFERENT ALGORITHMS USING DIFFERENT DATASETS