Improving Intrusion Detection System using Artificial Neural Network

Currently, network communication is more susceptible to different forms of attacks due to its expanded usage, accessibility, and complexity in most areas, consequently imposing greater security risks. One method to halt attacks is to identify different forms of irregularities in the data transmitted and processed during communication. Detection of anomalies is a vital process to secure a system. To this end, machine learning plays a key role in identifying abnormalities and intrusion in communication over a network. The term regularization is one of the major aspects of training machine learning models, in which, it plays a primary role in several successful Artificial neural network models, by inducing regularization in the model training. Then, this technique is integrated with an Artificial Neural Network (ANN) for classifying and detecting irregularities in network communication efficiency. The purpose of regularization is to discourage learning a more flexible or complex model. Thus, the machine learning model generalizes enough to perform accurately on unseen data. For training and testing purposes, NSL-KDD, CIDDS-001 (External and Internal Server Data), and UNSWNB15 datasets were utilized. Through extensive experiments, the proposed regularizer reaches higher True Positive Rate (TPR) and precision compared L1 and L2 norm regularization algorithms. Thus, it is concluded that the proposed regularizer demonstrates a strong intrusion detection ability. Keywords—New regularizer; anomaly detection; NSL-KDD dataset; CIDDS-001 dataset; UNSW-NB15


I. INTRODUCTION
Now-a-days, network communication's threats and attacks are growing as it is widely utilized in every field. To prevent such attacks, it is a crucial and necessary task to classify network communication as normal and suspicious. Such task is generally known as anomaly detection, dealing with unlikely events in network communication. The standard approach to detect an anomaly is computing the accurate mathematical model of normal data. Every new receiving instance is compared with the model of normality and, accordingly, an anomaly score is computed. The score will describe the deviations of the new instance compared to the average data instance and, if the deviation is relatively high, then the instance will be considered as suspicious and classified as anomalous and hence processed adequately [1] [2] [3].
In machine learning, generally, we are looking for the bestfitting model among other models in a large solution space. Similarly, in the context of ANN, solution space is defined as the space of all approximated or precise functions that a network can represent. Network depth and activation functions are used to determine the size of this solution space. One hidden layer with an activation function makes the space of functions very huge, so this space grows exponentially when the depth of the network is increased; hence, finding a most-fit solution becomes a difficult task.
Multiple optimizer functions tend to minimize the loss function, of which Stochastic Gradient Descent (SGD) being very common. Using SGD as an optimizer, one can seek a solution by moving in the opposite direction of the gradient of loss function. Due to complexity and richness of the solution space, this method of learning might overfit the learning model and affects the generalization error or performance significantly on unseen data while giving good results on training data [4]. To solve this issue, the concept of regularization is introduced in machine learning to avoid the complexity of the learning model. There are different regularization algorithms used to avoid overfitting of the machine learning model [4]. For example, in iterative learning, the most common regularization algorithm is early stopping, and, in the neural network, the commonly used regularization algorithm is a dropout. Generally, in statistics and machine learning, the regularization term is used in combination with the loss or error function. This method is beneficial as it incorporates the model complexity into the function to be minimized. Such methods are used in many algorithms such as Support Vector Machines (SVMs) [5] as optimization problems.
However, the existing regularization algorithms come with drawbacks due to the nature of the regularizers. In a challenging setting, where the number of features is greater than the number of samples and correlated, the existing regularization algorithms either do not promote sparsity or poorly perform because of the absence of relevant information.
The purpose of this paper is to implement a new regularization algorithm to search for the optimal solution in a large solution space by taking into consideration the relationship between weight matrix entries. Hence, the limit of space is increased and can be controlled by squeezing and expanding this space based on the penalty term λ. Consequently, it provides the ability to find the least complex learning model. To differentiate between normal and various malicious connections, we plan to examine the algorithm from a multiclass classification perspective.
In this paper, we introduced new regularization design considerations and a general outline of an intrusion detection technique based on using the standard deviation to decay the weight matrices in order to get the regularization term. Compared with well-known regularization techniques, we embedded the proposed regularizer with ANN model for classification tasks and employed NSL-KDD, CIDDS-001 (External and Internal Server Data), and UNSW-NB15 datasets with separate testing and training sets to evaluate the efficiency in detecting anomalies.
The main contributions of this paper are summarized as follows: 1) We present the design and implementation of an ANN intrusion detection system based on a new regularizer. 2) We study the performance of the model with different regularization parameters impacting accuracy.
The outline of the paper is as follows: We provide the related works in Section II. Then, we give background and formalization in Section III. Next, we present the used datasets in Section IV. In the same section, we also study anomaly detection and the new regularization technique. In Section V, results and discussion are presented. The limitation of proposed regularization technique provided in Section VI. Finally, we conclude our study in Section VII.

II. RELATED WORKS
The first IDS or anomaly detection system was introduced by Dr. Dorothy in SRI international, and it is still an actively and heavily researched topic due to its broad applications in network communication ( [6], [7]). Supervised learning techniques are popular methods for solving such problems. These techniques give more satisfactory results when statistical and regression techniques are incorporated [21]. A novel intrusion system and a multilevel hybrid classifier were proposed in [8]. The proposed system is combined the unsupervised Bayesian clustering with the supervised tree classifiers to detect the intrusions. Based on the Modular Multiple Classifier System (MCS), the authors in [9] proposed an unlabeled network anomaly IDS, where every module was created to model network services or a specific group of similar protocols. Moreover, they conducted experimental studies on the KDD Cup 1999 dataset, which revealed that the proposed anomaly IDS was able to accomplish high attack detection along with the low rate of false alarm. In [10], authors developed an intrusion detection system based on the AdaBoost algorithm. Within this algorithm, the decision rules were provided for the continuous and categorical features and the decision stumps were used as weak classifiers. The combination of the weak classifiers for the continuous and categorical features with the strong classifier allowed handling the relation between these features without the need for any forced conversations. According to the authors' experimental analysis, they reported that the algorithm had low error rates and computational complexity. Data mining techniques were utilized in [11]; the authors devised a novel framework for intrusion detection accordingly. For building classifiers, the authors proposed a classification algorithm which uses fuzzy association rules. However, the outcomes regarding the unseen attacks were not promising. In [12], the authors used a supervised learning classifier system for intrusion detection. To learn signatures for network intrusion detection, they presented a biologically inspired computational approach which can learn adaptively and dynamically. For the futuristic establishment of the intrusion detection system, authors in [13] presented a reference for the comparison of the efficiency of different machine learning techniques, including SVM and the tree classification. Moreover, the authors proposed a method to compute the mean value through sampling different ratios within the normal data for every measurement, resulting in obtaining a better rate of accuracy when observing the data in the real world. A novel machine-learning algorithm was proposed in [14], namely, Boosted Subspace Probabilistic Neural Network (BSPNN), which combined a semiparametric and an adaptive boosting approach to attain better trade-off between the generality and the accuracy. Hence, the method depicted prominent improvements with respect to detection accuracy, comparatively low computational complexity, and negligible false alarms. A new approach for intrusion detection was proposed in [15]. This approach is based on ANN and fuzzy clustering (FC-ANN). To evaluate the proposal, the authors conducted an experiment using the KDD Cup 1999 dataset. Experimental results demonstrated that FC-ANN enhanced the detection stability and the detection precision. For the prediction of the anomaly detection, a random-effects logistic regression model was proposed in [16].
Imbalanced class distribution is an inevitable problem in real network traffic due to the large size of traffic and low frequency of certain types of anomalies. Authors in [17] used sampling approaches to combat imbalanced class distributions for network intrusion detection. It performed flow-based classification on a network flow dataset: CIDDS-001. The system was able to detect attacks with up to 99.99% accuracy.
In [18], the statistical and complexity analysis of CIDDS-001 dataset is considered. The authors utilized the k-nearest neighbor classifier on CIDDS-001 to build an IDS. Their system achieved an overall accuracy of 99.6% with 2nn and a minimum accuracy of 99.3% with 5nn. Using the same dataset, the authors in [19] conducted an analytical study to assess the performance of KNN and k-means clustering algorithms when classifying traffic. Both algorithms achieved over 99% accuracy. In [20], authors proposed an effective anomaly-based intrusion detection system using a gradient boosted machine (GBM). Three different datasets, NSL-KDD, UNSW-NB15, and GPRS dataset, were utilized with either tenfold crossvalidation or hold-out method. In [21], the authors proposed an improved IDS based on hybrid feature selection and twolevel classifier ensembles. Two intrusion datasets (NSL-KDD and UNSW-NB15) have been employed to evaluate the performance. Based on the statistics and significance tests, on the NSL-KDD dataset, the proposed classifier shows 85.8% accuracy, 86.8% sensitivity, and 88.0% detection rate. By taking advantage of the multiple classification abilities of neural networks and the fuzzy logic, authors in [22] developed a novel model for the intrusion detection system. A new learning algorithm was proposed in [23] for adaptive intrusion detection using naïve Bayesian and boosting classifiers. Additionally, they conducted an experiment using the KDD Cup 1999 dataset. The experiment proved that the proposed algorithm offered higher detection rates with a remarkable reduction in the number of false positives for multiple types of network intrusion. A GA combined with the KNN for feature weighting and selection was proposed in [24]. The proposed model was applied on the KDD Cup 1999 dataset for identifying DDoS/DoS attacks. The result showed that the accuracy for unknown attacks was found to be 78%, whereas the accuracy for known attacks was calculated to be 97.24%. Based on the Pittsburgh, iterative rule learning (IRL), and Michigan approaches, the authors in [25] proposed three different types of genetic fuzzy systems for intrusion detection. A novel feature representation approach was proposed in [26]. This approach is called the cluster center and nearest-neighbor approach (CANN), in which the distance between data and its nearest neighbor and data sample and its cluster center were measured and summed. The authors conducted the experiments using the KDD Cup 1999 dataset, showing that the CANN classifier performed similarly or slightly better than SVM and k-NN.
Two dimensionality reduction techniques, namely, PCA and fuzzy PCZ, were used and compared in [27], where the authors classified the test samples of connections into attack or normal category by applying KNN algorithm. In addition, they conducted experiments using KDD Cup 1999 dataset. The results showed that fuzzy PCA performed better than the PCA in detecting the DoS and U2R attacks. In [28], the authors proposed a deep learning approach using recurrent neural networks (RNN-IDS). The experimental results demonstrated that the RNN-IDS was ideal for modeling a classification model with relatively high accuracy, and its performance was also superior compared to conventional machine learning classification techniques in multiclass and binary classification. The authors in [29] built an anomaly detection system using backpropagation algorithm optimized by Conjugate Gradient (CG) algorithm. Then, they analyzed the use of CG optimization (Polak-Ribiere, Fletcher Reeves, Powell Beale). Based on their experiment results, the average accuracy was 93.2% for two classes "intrusion" and "normal". Applications of LSTM to RNN for modeling the IDS modeling were proposed in [30]. The ideology of the experiment was dependent on the hyperparameter values, the rate of learning, and changes in the performance; the size of the hidden layer had a significant impact on the performance. According to their experiments, the average rate of detection was computed to be 98.8%. The authors in [31] proposed a learning model, namely, PSO-FLN for fast learning network (FLN), based on particle swarm optimization (PSO). A deep learning model was proposed in [32]. The model is based on the DBN and stacked nonsymmetric deep autoencoder (NDAE). They used KDD Cup 1999 and NSL-KDD datasets to evaluate their model, which accurately detected the Probe attacks and the DoS. Nevertheless, R2L attacks were barely identified, while no detection of the U2R attacks was recorded. The precision value was found to be 99.99%, with an overall accuracy of 97.85%.

III. BACKGROUND AND FORMALIZATION
The most critical issue in machine learning is developing a generalized training model that will perform accurately on training data and at least will provide almost the same results on unseen data.
There are many algorithms used whose primary goal is to decrease classification error on unseen data at the cost of increased training error. In other words, we may say that reducing the model's generalization error without any effect on training error is known as regularization.
Many techniques have been used to improve the generalization performance of the learning model [33]. Some add constraints to the machine learning model, such as putting constraint on model parameters values, and others add further statistical terms in the objective function that are known as a soft constraint on model parameters [34].
Developing a more effective regularization algorithm is a crucial task in the field of machine learning; hence, it is the main focus of research in this field. In a statistical model of learning algorithms, such constraints and penalties are used to encode prior knowledge. On occasion, these penalties and constraints are designed to promote generalization by expressing generic preferences for a simple classification model. However, it is necessary to incorporate such penalties and constraints to make an undetermined problem determined.
As explained above, there are multiple strategies to incorporate regularization in machine learning algorithms [35], [36], [37]. Among these methods, L1-and L2norm are the most common regularization methods. L2 regularization is also referred to as Tikhonov regularization, and, in statistics, as ridge regression. It is combined with the cost function as a complexity term.
L2 regularization is the squared Euclidean of all feature weights of the hidden layer, and, in the case of multiple hidden layers, it is the sum of all such squared norms including the output layer of the neural network [38].
Another regularization parameter,λ, is multiplied with regularization in order to put a penalty on and control the strength of the magnitude of weights. Due to this regularization, the model results in much smaller weights for each layer. Similarly, L1 regularization produces many zeros in the weight matrix and makes it sparse, hence, controlling the complexity of the model. Both L1 and L2 regularizations have a welldefined probabilistic interpretation which is similar to adding a Gaussian prior over the distribution of weight matrix W in case of L2 and Laplacian in case of L1 [39]. However, several tried and tested regularization methods exist for both neural networks and other machine learning algorithms (Random forests, SVM, etc.) [5], [37], [40], [41]. For succinctness, we will focus purely on methods used on ANNs. For simplicity, we can split the methods into categories, with one being sparsity-based regularization and the other not.
The sparsity-based methods to be considered are L1 and L2 norms.
Both methods take a sum over the absolute value and square, respectively.
There is a great amount of previous work comparing L1 and L2 along with other regularization methods in a variety of problem domains [42].
On the other hand, we can also apply methods such as early stopping. This would reduce the number of parameters the network learns; thus, this is considered a form of regularization. The goal of early stopping along with other forms of regularization is to reduce generalization error or increase generalization accuracy while allowing training error to increase.
The most seminal regularization method and one of the more significant breakthroughs in machine learning is dropout [43].
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020 Dropout is an intuitively brilliant discovery that drops out or deactivates and removes a portion of neurons randomly according to an arbitrary value. Pushing a neural network to acquire more stable and strong characteristics together with various random subsets of other neurons. Depending on the problem's context, it can be used in combination with sparsity regularizers to good effect (see Fig. 1).

A. Anomaly Detection
Anomaly detection can be framed in many ways. Outlier detection, for instance, can often fall under this umbrella. Here, let us define an anomaly as something that significantly differs from the rest of the data or otherwise grossly misfits the distribution of data. We trained and tested in a supervised context (classification) using a feedforward network to test differences across other regularization techniques and our new regularization in our problem domain.

B. DATA
Due to the nature of the problem, the following 3 datasets were chosen to carry out analysis based on the proposed regularization, L1 and L2 norm regularizations. The competition task was to build a network intrusion detector to analyze the performance of the proposed regularizer with existing regularizers and a predictive model capable of distinguishing between normal connections and attack connections.  Table I. 2) UNSW-NB15 dataset: UNSW-NB15 dataset is created by IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) and its purpose is generating a hybrid of real modern normal activities and synthetic contemporary attack behaviors. It contains nine different types of attacks Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. The whole dataset contains 2,540,044 records and it is available to download as one file or split into several different CSV files. There is also one list of event files which contains information about the number of events categorized by attack category and attack subcategory for all 2.5M records. From that dataset, the training and the test dataset are produced, wherein the training dataset contains175,341 records and 82,332 records in the test dataset [44].
3) CIDDS-001 Dataset: Another dataset we used for our experiment is the CIDDS-001 dataset [45]. This is a labeled flow-based dataset used for intrusion detection system. The following attributes from the dataset are used for training the model: Src IP, Src Port, Dest IP, Dest Port, Proto, Duration, Bytes, Packets, Flags.
There are two types of server through which this data is collected (open stack and external server). Data from both servers contain the aforementioned attributes, the only difference is in the attack categories.
Data from open stack server contains the following three categories: normal, victim and attacker. While data from the external server contains the following 5 categories: normal, victim, attacker, unknown, suspicious.

C. New Regularization Technique
In the machine learning field, the commonly applied regularization techniques are L1-norm and L2-norm. During optimization, these regularizers consider the complexity of weights to induce the networks towards a more general mapping. L1-norm imposes the sum of the absolute values as a penalty, while L2-norm imposes the sum of the squared values as a penalty. The purpose of this article is to introduce a new regularization that employs the standard deviation of the weight matrix and then multiplies it by λ to make the regularization term. Consequently, the regularizer computes the weights standard deviation of the weights to the loss function.
After studying the L1 and L2 regularizers, we found one www.ijacsa.thesai.org The contour of the new regularizer was displayed highlighting the efficacy and potency of the new regularizer. 2 represents the feasible region of L1, L2, and new regularization techniques. The contours of each regularizer represent different loss values. The behavior of the L2-norm is circular and incorporates L1, while the new regularization acts like a parabola and takes values beyond the L2-norm limit. This helps in a sense, it increases the limit of values (space) to be adopted, and based on the penalty term λ this space can be expanded. The formalization as follows (See equations 1), with ω denoting the standard deviation of weight matrix w i .
During the training process, λ denotes the regularization parameter that sets a penalty to restrict weights from selecting high values. In other words, the loss function in our case will become (see equations 2): Therefore, if the weight values of all layers are large, the weight values of the selected λ will be large. Thus, the weight values cannot be equal, as they will have more freedom to search in a large space. Consequently, our regularization technique is more effective compared to the L1 and L2 regularization techniques.
The model was trained using the Nesterov ADAM optimizer, with tanh activation functions. The model was trained over 100 epochs with a batch size of 32. The labeled data were classified with a feedforward network.

D. Artificial Neural Network (ANN) Based IDS with New Regularization Method
In this section, we present the diagrammatic representation of data preprocessing and training ANN model which employed our new regularizer as shown in Fig. 3. There are generally four steps involved in this process (Fig. 2) as follows: There are generally four steps involved in this process (Fig. 3) as explained below. 1) Data Preprocessing: Artificial Neural Network uses only numerical data for training and testing. So, the initial step is to transform nominal and textual data into numerical data. To do this, the following steps were performed: • All the nominal and textual attributes were converted by using one-hot encoding (nominal to binary conversion in Weka). Conversion of attributes to one-hot encoding leads to increasing of attributes in attributes. Therefore, the number of units in ANN is adjusted according to attributes.
• Each category of attack types was converted by onehot encoding.
2) Data Scaling: After data preprocessing, each dataset contains attributes of numerical values and one-hot encoded values. The numerical values were normalized according to the formulation given in equation 3.
For i = 1, ..., n where n represents the number of records, and x represents a specific column in the dataset. Next, duplicate records were removed from the dataset to restrict classifiers from giving biased results.
3) Training the ANN model: After data preprocessing and data scaling phases, our next task is to implement ANN model. Python was picked to be the implementation language and the Keras framework was employed for ANN. The ANN model is incorporated with a new regularizer to test our method. Employing the mathematical description in equation 1, the proposed regularizer is applied as a function. In fact, the kernel regularizer was assigned with this new function rather than built-in regularizers in the ANN model. The ANN predictive model includes two hidden five-and three-unit layers, respectively. The last layer is composed of two units according to class values. tanh is the activation function employed in each layer, except for the last layer, where the sof tmax activation function is utilized. In the first two layers, the weight matrix was initialized with Gaussian random distribution. Due to the large number of neurons in each layer, we show only the reduced version of the model as depicted in Fig. 4. The first layer consists of 122 neurons, the first hidden layer 10 neurons, and the second layer 100 neurons.
Likewise, the third hidden layer has 50 neurons. In the fourth hidden layer, the input size is reduced to 10. Hence, we added 10 neurons to it. Further, these neurons are connected to 3 neurons in the fifth hidden layer which is further connected to 5 output neurons each for one of the five specific categories. In each layer, a Kernel matrix is initialized with uniform distribution and tanh activation function except the last layer which has sof tmax activation function. Further, for binary classification, the last layer has 2 output neurons. Finally, the model is compiled with an adam optimizer with a default learning rate and other parameters.

4) HyperParameters Adjustment:
After each 100-epoch run of the ANN model, precision and loss values were evaluated and the hyperparameters were adjusted accordingly. Activation functions and kernel initializer distributions were determined after several iterations and examining the depth of the ANN model. According to our optimal desired outputs, regularization parameter λ was also adjusted. The number of layers and hyperparameters remained the same for each regularizer. λ parameter was constantly updated and fixed to the value resulting in the highest and best accuracy for the corresponding regularizer.

V. RESULTS AND DISCUSSION
To produce results based on the proposed method, we implemented our model for multiclass classification (normal and four different attack categories (5-class)). In addition, we applied the 10-fold cross-validation on each dataset. All simulations were carried out on a server having 32 GB RAM, GeForce GTX 1080 GPU of 8 GB GDDR5X memory, and 2560 NVIDIA CUDA cores. We compared results for multiclass problems in each case and demonstrated our results. For each dataset, the corresponding attack categories were considered as classes and the ANN with new, L1, and L2 regularizations is trained by using 10-fold cross-validation. For every attack class in each dataset, the performance measures described in equations 5-9 were computed and presented. In the following sections, results for each dataset based on our new regularization are compared with other regularization algorithms. In each type of classification, the proposed regularization demonstrates a good performance and is superior to L1 and L2 regularizations. Furthermore, other hyperparameters and results on each classification category are discussed in detail.

A. Evaluation Protocols
For multiclass classification, the loss function used is categorical cross entropy as given in equation 4.
where C is the number of classes, t i is the ith class and f (s i )is the ith output after activation function f . To evaluate our model, training and validation accuracy are reported for the data partitions as explained in Results and Discussion. Accuracy is calculated based on the following mathematical representation. Apart from accuracy, other performance measures, that is, TPR also known as Recall, False Positive Rate (FPR), Precision (Pre), and F1 measures, are calculated based on equations 5, 6, 7, 8, and 9, respectively.
P re = T P T P + F P F 1 = 2 * T P R × P re T P R + P re (9) where T P, T N, F P and F N denote true positives, true negatives, false positives, and false negatives, respectively.  Tables II, III, and IV. NSL-KDD dataset is an imbalanced dataset; therefore, the individual performance measures for each class are significantly affected. For example, R2L attack type has a total of 224 samples and the performance is lower for this category type. In such a situation, the classifier is biased towards more frequent samples, for example, a normal category having 9711 samples.
New regularization: NSL-KDD dataset has four different attack categories. For each attack category, different performance measures were computed (see Table II). Experimental results for TPR are also demonstrated in Fig. 5.

L1-Norm regularization:
For the sake of comparison, we also used L1-norm regularization for the 5 classes of NSL-KDD dataset. Finally, we computed the performance measures results (see Table III). L2-norm regularization: Further, we trained the classifier by using L2-norm regularization. The result of NSL-KDD datasets is shown in Table IV. 2) UNSW-NB 15 dataset: In addition, we provided the comparison of different performance measures for the UNSW-NB15 dataset using new, L1-norm, and L2-norm regularizers. We tested these different regularizations using 10 classes of the UNSW-NB15 dataset on 175,341 samples given in an explicit training set (NSW NB15 Train). Similarly, the models are tested on 82,332 samples (UNSW NB15 Test), and then we computed the results (see Tables V, VI, and VII).
New regularizaton : We embedded the proposed regularization with ANN model and then tested it using 10 different categories of the UNSW-NB15 dataset as shown in Table V. TPR results can also be viewed from Fig. 6.  3) CIDDS-001 dataset: Here, we carried out several experiments on the CIDDS-001 dataset using different regularizations. CIDDS-001 dataset has two parts:

1) External Server Dataset
The model is trained over all the dataset provided using 10-fold cross-validation, except for a set of 339030 samples which were kept separate for testing purposes. Apart from normal samples, there are 4 different attack categories in this dataset which are victim, attacker, unknown, and suspicious.

2) Open Stack dataset
The same approach is applied to this dataset. The ANN model is trained over all dataset using the 10fold cross-validation, except for a set of 2789002 samples. There are only 2 attack categories which are victim and attacker.
We carried out our experiments on both datasets and the following performance measures were observed for each regularization.
New regularization: The experimental results on the two types of dataset using the new regularization are given in Tables VIII and IX. TPR results for each category are also shown in Fig. 7 and 8, respectively.  Tables X and XI. L2-norm regularization: Tables XII and XIII demonstrated the results using L2-norm regularization for each of the two types of the dataset. In this case, the testing time was the same for all regularizations which was 76.6 seconds. Based on the analysis of the above results, it can be concluded that the proposed regularization outperformed the other regularizations.
• OpenStack Server Dataset For this dataset, the average TPR computed was 97.86%, while the FPR was 1.03%. As for the L1 and L2 regularizers, the average TPR is 96.13% and 97.6%. Similarly, the average FPR for L1 and L2 is 1.9% and 1.26%, respectively. As far as the training time is concerned, the training time took 1728.4, 1720.0, and 1723.2 seconds for the proposed, L1, and L2 regularizations, respectively. In terms of testing, the time taken was 97.7, 100.2, and 92.8 seconds, respectively. Hence, from the results above, we can conclude that the performance of the proposed regularization is slightly higher than other regularizers.

VI. LIMITATION OF NEW REGULARIZATION
Several researchers employed multiple regularization algorithms, the most common ones being lasso regularizations and ridge regression. However, some disadvantages are inherent in the regularization framework. For example, In a challenging setting, where the number of instances is very low and the dimensionality is very high, it is impractical to utilize these regularizations. Likewise, our regularization algorithm had multiple limitations, as follows: • It cannot be employed for selecting or reducing features.
• It is challenging to choose a suitable value of λ, due to the fact that it is a continuous value. In addition, the process of picking a suitable value from multiple attempts will be computationally costly and time consuming.

VII. CONCLUSION
The field of ANN regularizers is one that is still ripe for new research and innovation. From attempts in adaptive weight decay to new techniques altogether, many innovations in improving generalization through reducing model complexity are possible. In this paper, we proposed a new regularization technique for anomaly detection based on the standard deviation of the weight matrix. Based on the analysis of our experimental results, it is evident that our proposed regularization algorithm makes the ANN capable of identifying good patterns in data and classifying them efficiently. Moreover, the proposed regularizer has outperformed the existing regularization algorithms when incorporated with ANN. As a result, the overall average accuracy achieved on NSL-KDD, UNSW-NB15, and CIDDS-001 datasets using 10-folds crossvalidation is 98.53%, 94.58%, and 97.87%, respectively.