Ensemble Methods to Detect XSS Attacks

Machine learning techniques are gaining popularity and giving better results in detecting Web application attacks. Cross-site scripting is an injection attack widespread in web applications. The existing solutions like filter-based, dynamic analysis, and static analysis are not effective in detecting unknown XSS attacks, and machine learning methods can detect unknown XSS attacks. Existing research to detect XSS attacks by using machine learning methods have issues like single base classifiers, small datasets, and unbalanced datasets. In this paper, supervised ensemble learning techniques trained on a large labeled and balanced dataset to detect XSS attacks. The ensemble methods used in this research are random forest classification, AdaBoost, bagging with SVM, Extra-Trees, gradient boosting, and histogram-based gradient boosting. Analyzed and compared the performance of ensemble learning algorithms by using the confusion matrix. Keywords—Cross-site scripting; machine learning; ensemble learning; random forest; bagging; boosting


I. INTRODUCTION
Machine learning algorithms are useful in detecting unknown and new XSS attacks in Web Applications. Ensemble methods are a combination of different base models, and the ensemble learning models can give optimal results compared to base models [1]. In XSS attacks, the attacker can steal victim's session cookie, sensitive data of victim, implement keyloggers at browser, and damage the reputation of a trusted Website.
A common problem in existing XSS prevention techniques are the incapability of detecting unknown or new XSS attacks [2]. Highly effective XSS detection models can be built by using ensemble learning techniques. AdaBoost, bagging, Extra-Trees, gradient boosting, random forest, histogram-based gradient boosting are ensemble methods, which uses base models like decision trees, etc.
Cross-site scripting injection attacks are categorized into three types, and they are persistent (stored), non-persistent (reflected), and DOM-based attacks. Many existing solutions primarily focused on preventing only one type of XSS attack, and there are only a few solutions to avoid all types of attacks [3]. The proposed ensemble learning models can detect all types of attacks by proper implementation at the server and client-side.
Ensemble methods use different algorithms to achieve better prediction rate. Usually, ensemble learning involves the same base learning algorithm. The limitation in ensemble methods is that these require more computations compared to a single model. In ensemble learning base models are combined in three ways.
Bagging: In bagging (bootstrap aggregation) weak learning algorithms applies on a small sample dataset and takes an average of all learners prediction. Bagging will decrease the variance.
Boosting: It is an iterative method, in boosting sample weights are adjusted based on the previous classification. Boosting will decrease bias error.
Stacking: In this output of one model is given as input to another model. Stacking will decrease variance or bias based on models used.
The purpose of this paper is to investigate and compare the prediction accuracy of machine learning ensemble methods in detecting Cross-site scripting attacks in Web Applications.
The paper is organized as follows: Section 2 contains related work. We prepared XSS data for training and testing in Section 3. We implemented the ensemble learning models in Section 4. We analyzed the performance of proposed ensemble models in Section 5. Sections 6 and 7 contains conclusion and future work.

II. RELATED WORK
Rodriguez et al. [4] analyzed 67 documents related to XSS attacks. According to their research, most of the researches use browser tools or web page analysis methods to prevent XSS attacks, very few researches on machine learning algorithms to prevent these attacks. Based on their research most common issues in existing researches are detecting only one type of XSS attacks, low attacks data, only restricted to one programming environment like PHP, same data for different researches, methods not scalable, high false positives, methods work on only one browser, few methods proposed to use artificial intelligence, etc.
S. Gupta and B. B. Gupta [5] did a study on defense mechanisms of XSS attacks, and they stated that safe input handling is one of the essential techniques to mitigate XSS attacks. A good XSS defensive technique needs to differentiate malicious code and legitimate JavaScript code automatically.
Hydara et al. [6] studied 115 research papers on XSS attacks. Based on their study, non-persistence XSS attacks are popular, and there is a need for solutions to remove XSS vulnerabilities from the source code.
Shanmugasundaram et al. [7] stated that developers lack knowledge on implementing existing XSS solutions in their web applications. *Corresponding Author www.ijacsa.thesai.org Aliga et al. [8] study showed that most of the XSS prevention solutions are client-side, and they are unable to detect new XSS attacks, and these solutions lack self-learning capabilities. They reviewed 15 XSS prevention techniques, and out of 15, only two techniques have self-learning capabilities.
Nunan et al. [9] used supervised ML methods like Naive Bayes and SVM to detect XSS attacks. Their total data set size 216054, and among them, 15366 are XSS attacks. They evaluate the algorithms based on accuracy, detection, and false alarm rates, etc. Their results show that compared to Naive Bayes, the SVM achieved the best performance. They selected the following features for classification of XSS attacks Obfuscation of code, the number of domains, URL Length, duplicate special characters, Schemes, etc.
Mereani and Howe [10] developed Random Forest, kNN, and SVM models to detect XSS malicious code, and they used labeled data in training. They trained using 2000 samples and used 13000 for testing. In their experiments, they reached accuracy up to 99.75%. They extracted Structural features contain a set of special characters in malicious JavaScript, and Behavioural Features includes function and commands used in malicious JavaScript code, a total of 59 features from both categories.
Rathore et al. [11] developed an ML method for Social networking services (SNSs) to detect XSS attacks. In their method, extracted Webpage features, URL features, and SNSs features from web pages and used this data to train models. Some of the features include domains in a URL, URL length, Iframes count, external link counts, and malicious JavaScript codes in SNSs webpage, etc. 1000 SNSs pages used to build a dataset for testing and used different classifiers in their testing. They achieved 97.2% accuracy in their tests.
Akaishi and Uda [12] used a combination of classifiers to detect XSS attacks in their research. Their data set contain balanced 10000 samples where attack data in URL format. They divided the attack sentence into words, co-occurrence, and frequency of words used in their classification. They used word2vec based model in their research to transform words into vectors, and used those vectors is classification algorithms. According to them, CNN and SVM are the best filters for realworld problems.
Mokbal et al. [13] proposed a Multilayer perceptron based model to detect XSS attacks. Their model achieved an accuracy of 99.32% in detecting attacks. Their dataset contains a total of 138569 samples, and among them, 38569 are attack samples. They extracted URL based, HTML based, and JavaScript-based features form content and used these features in training proposed models. Some of the features like URL length and special characters in URL, HTML tags, JavaScript events, etc.
Wang, Cai, and Wei [14] proposed a deep learning-based framework to detect malicious JavaScript. The structure contains logistic regression, deep learning method, and sparse random projection. They extracted features from JavaScript code by using Stacked denoising autoencoders (SdA). These features used to train SVM or logistic regression models. Classification of malicious code done by logistic regression.
Their labeled dataset contains 14783 malicious JavaScript codes and 12320 benign samples. Their model achieved 94.9% accuracy.

III. DATA COLLECTION AND DATA PREPROCESSING
For this research, collected XSS vectors by using popular XSS tools like XSStrike, XSSER [15] and from different sources collected thousands of attack vectors. The dataset contains 154626 unique samples with labels. Half of this dataset is XSS attack vectors, and another half (77313) of the dataset is safe vectors. XSS attack vectors and Safe vectors are maintained at 128 characters, and longer sequences are split into 128 character chunks. Fig. 1 shows safe vector generator, by using this, generated safe vector samples. These randomly generated safe vectors are three types with length ranges from 40 to 126 those are, string with only uppercase or lowercase alphabets, strings with all alphabets and digits, and strings with all alphabets, digits and special characters.
The number of safe vectors generated depends on XSS attack vectors, to maintain the balance between XSS and safe samples of the dataset. This balanced dataset used to train and test the models. The below examples shows sample XSS attack vectors.
1. <img src="http://www.example.org/theerrornoimg.file" onerror=alert(" hi, here XSS Problem");> 2. <script\x20type="text/javascript">javascript:alert (19);</sc ript> To prepare input for models, converted the character sequence of XSS attacks, and Safe vectors into Unicode integer format, Fig. 2 shows sample data in Unicode format. The dataset is standardized by using sklearn's [16] StandardScaler function, Standardization of a dataset will improve the performance and accuracy of machine learning algorithms. Fig. 3 shows a sample data after standardization without the output column. The preparing process of dataset for model training shown in Fig. 4. www.ijacsa.thesai.org    IV. IMPLEMENTATION OF ENSEMBLE METHODS In this research supervised ensemble machine learning methods are used to detect XSS attacks. The ensemble learning methods [17] used are random forest classification, AdaBoost, bagging with SVM, Extra-Trees, gradient boosting, and histogram-based gradient boosting. These ensemble classification methods are effective in detecting XSS attacks compared to base models.
The dataset contains balanced unique 154626 samples, 77313 are Safe vectors and 77313 are XSS attacks. Total samples divided into 8:2 ratio for training (123700) and testing (30926) samples. Confusion matrix [20] used to calculate performance metrics of a model, by using the confusion matrix one can calculate the following values.

A. Random Forest Classifier (RF)
Random forest classifier contains a collection of decision trees, and each decision tree fits on a subset of the dataset. Based on the output of all decision trees, the random forest classifier decides the final class of an input object. Table I shows the confusion matrix of the random forest model, and Table II shows   (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 5, 2020 698 | P a g e www.ijacsa.thesai.org

B. AdaBoost Classifier (AB)
Boosting algorithms are used to reach high accuracy, AdaBoost (Adaptive Boosting) is a popular ensemble boosting algorithm works on decision trees. AdaBoost combines multiple low performing classifiers to get high performing classifier. In AdaBoost's every iteration, weak classifiers are tweaked (weighted data) based on the accuracy of previous training. The confusion matrix of the AdaBoost classifier model is shown in Table III, and Table IV shows

C. Bagging Classifier with SVM (BC)
Bootstrap aggregating (or Bagging) is an ensemble method in machine learning. SVM is used as a base classifier for the bagging model. In bagging, the base classifiers are trained (fits) on a randomly selected subset data of the original dataset, and the final prediction depends on individual base classifiers predictions. The confusion matrix of the bagging model is shown in Table V, and Table VI shows

D. Extra-Trees Classifier (ET)
Extra-Trees (Extremely Randomized Trees) method is an ensemble method similar to the random forest classifier. In Extra-Trees classifier, decision trees are constructed randomly in the forest, and these decision trees trained (fits) on subsets of data. The final prediction depends on all decision trees predictions. The confusion matrix values of the Extra-Trees classifier model is shown in Table VII, and Table VIII shows

E. Gradient Boosting Classifier (GB)
Gradient boosting classifier is an ensemble boosting algorithm, where a weak classifier is modified into a strong classifier. In the gradient boosting classifier, decision trees are base classifiers, and loss function is optimized while adding a new tree. Table IX shows the confusion matrix of the gradient  boosting model, and Table X shows Vol. 11, No. 5, 2020 699 | P a g e www.ijacsa.thesai.org

V. RESULTS AND DISCUSSION
This research evaluated the XSS detection rate in ensemble learning techniques. AdaBoost, bagging with SVM, Extra-Trees, gradient boosting, random forest classification, and histogram-based gradient boosting models are trained on a large labeled dataset and evaluated these methods performance based on their accuracy, recall, precision, and the F-measure. Table XIII compares the performance metrics of all models,  and Table XIV compares the cross-validation scores of all models, and Fig. 6 shows the mean score of cross-validations of models. From the results, it is concluded that all ensemble methods performed well and reached an accuracy of more than 98% in all models.
Form all tested ensemble machine learning algorithms, the histogram-based gradient boosting classification model is the best performed model with the highest possible accuracy of 0.9989.

VI. CONCLUSION
We developed and analyzed supervised ensemble machine learning methods to detect XSS attacks in Web applications. Ensemble learning techniques are a collection of base classifiers, and these ensemble methods perform better than single classifiers. Existing solutions to detect XSS attacks by using machine learning methods have issues like single base classifiers, small datasets, and unbalanced datasets. We trained and evaluated proposed models on a large balanced dataset, and in this research, we detect XSS attacks in data submitted by the user. In this work, we evaluated the performance of random forest classification, AdaBoost, bagging with SVM, www.ijacsa.thesai.org Extra-Trees, gradient boosting, and histogram-based gradient boosting models in detecting XSS attacks and safe vectors. We compared the performance of models by using the confusion matrix metrics. The results show that all ensemble learning models performed exceptionally well in detecting XSS attacks and safe vectors. We reached the highest accuracy of 0.9989 in the histogram-based gradient boosting classification model.

VII. FUTURE WORK
In future, the work can be extend to detect other Web application attacks like SQL injection. The models can be tested by integrated into real world applications to detect attacks.