Malicious URL Detection based on Machine Learning

www.ijacsa.thesai.org


I. INTRODUCTION
Uniform Resource Locator (URL) is used to refer to resources on the Internet.In [1], Sahoo et al. presented about the characteristics and two basic components of the URL as: protocol identifier, which indicates what protocol to use, and resource name, which specifies the IP address or the domain name where the resource is located.It can be seen that each URL has a specific structure and format.Attackers often try to change one or more components of the URL's structure to deceive users for spreading their malicious URL.Malicious URLs are known as links that adversely affect users.These URLs will redirect users to resources or pages on which attackers can execute codes on users' computers, redirect users to unwanted sites, malicious website, or other phishing site, or malware download.Malicious URLs can also be hidden in download links that are deemed safe and can spread quickly through file and message sharing in shared networks.Some attack techniques that use malicious URLs include [2,3,4]: Drive-by Download, Phishing and Social Engineering, and Spam.
According to statistics presented in [5], in 2019, the attacks using spreading malicious URL technique are ranked first among the 10 most common attack techniques.Especially, according to this statistic, the three main URL spreading techniques, which are malicious URLs, botnet URLs, and phishing URLs, increase in number of attacks as well as danger level.
From the statistics of the increase in the number of malicious URL distributions over the consecutive years, it is clear that there is a need to study and apply techniques or methods to detect and prevent these malicious URLs.
Regarding the problem of detecting malicious URLs, there are two main trends at present as malicious URL detection based on signs or sets of rules, and malicious URL detection based on behavior analysis techniques [1,2].The method of detecting malicious URLs based on a set of markers or rules can quickly and accurately detect malicious URLs.However, this method is not capable of detecting new malicious URLs that are not in the set of predefined signs or rules.The method of detecting malicious URLs based on behavior analysis techniques adopt machine learning or deep learning algorithms to classify URLs based on their behaviors.In this paper, machine learning algorithms are utilized to classify URLs based on their attributes.The paper also includes a new URL attribute extraction method.
In our research, machine learning algorithms are used to classify URLs based on the features and behaviors of URLs.The features are extracted from static and dynamic behaviors of URLs and are new to the literature.Those newly proposed features are the main contribution of the research.Machine learning algorithms are a part of the whole malicious URL detection system.Two supervised machine learning algorithms are used, Support vector machine (SVM) and Random forest (RF).
The paper is organized as follows.Section II reviews some recent works in the literature on malicious URL detection.The proposed malicious URLs detection system using machine learning is presented in Section III.In this section, the new features for URLs detection process are also described in details.Experimental results and discussions are provided in Section IV.The paper is concluded by Section V.

A. Signature based Malicious URL Detection
Studies on malicious URL detection using the signature sets had been investigated and applied long time ago [6,7,8].Most of these studies often use lists of known malicious URLs.Whenever a new URL is accessed, a database query is www.ijacsa.thesai.orgexecuted.If the URL is blacklisted, it is considered as malicious, and then, a warning will be generated; otherwise URLs will be considered as safe.The main disadvantage of this approach is that it will be very difficult to detect new malicious URLs that are not in the given list.

B. Machine Learning based Malicious URL Detection
There are three types of machine learning algorithms that can be applied on malicious URL detection methods, including supervised learning, unsupervised learning, and semisupervised learning.And the detection methods are based on URL behaviors.
In [1], a number of malicious URL systems based on machine learning algorithms have been investigated.Those machine learing algorithms include SVM, Logistic Regression, Nave Bayes, Decision Trees, Ensembles, Online Learning, ect.In this paper, the two algorithms, RF and SVM, are used.The accuracy of these two algorithms with different parameters setups will be presented in the experimental results.
The behaviors and characteristics of URLs can be divided into two main groups, static and dynamic.In their studies [9,10,11] authors presented methods of analyzing and extracting static behavior of URLs, including Lexical, Content, Host, and Popularity-based.The machine learning algorithms used in these studies are Online Learning algorithms and SVM.Malicious URL detection using dynamic actions of URLs is presented in [12,13].In this paper, URL attributes are extracted based on both static and dynamic behaviors.Some attribute groups are investigated, including Character and semantic groups; Abnormal group in websites and Host-based group; Correlated group.

C. Malicious URL Detection Tools
 URL Void: URL Void is a URL checking program using multiple engines and blacklists of domains.Some examples of URL Void are Google SafeBrowsing, Norton SafeWeb and MyWOT.The advantage of the Void URL tool is its compatibility with many different browsers as well as it can support many other testing services.The main disadvantage of the Void URL tool is that the malicious URL detection process relies heavily on a given set of signatures.
 UnMask Parasites: Unmask Parasites is a URL testing tool by downloading provided links, parsing Hypertext Markup Language (HTML) codes, especially external links, iframes and JavaScript.The advantage of this tool is that it can detect iframe fast and accurately.However, this tool is only useful if the user has suspected something strange happening on their sites.
 Dr.Web Anti-Virus Link Checker: Dr.Web Anti-Virus Link Checker is an add-on for Chrome, Firefox, Opera, and IE to automatically find and scan malicious content on a download link on all social networking links such as Facebook, Vk.com, Google+.
 Comodo Site Inspector: This is a malware and security hole detection tool.This helps users check URLs or enables webmasters to set up daily checks by downloading all the specified sites.and run them in a sandbox browser environment.
 Some other tools: Among aforementioned typical tools, there are some other URL checking tools, such as UnShorten.it,VirusTotal, Norton Safe Web, SiteAdvisor (by McAfee), Sucuri, Browser Defender, Online Link Scan, and Google Safe Browsing Diagnostic.
From the analysis and evaluation of malicious URL detection tools presented above, it is found that the majority of current malicious URL detection tools are signature-based URL detection systems.Therefore, the effectiveness of these tools is limited.

III. MALICIOUS URL DETECTING USING MACHINE LEARNING
A. The Model Fig. 1 presents the proposed malicious URL detection system using machine learning.The malicious URL detection model using machine learning contains two stages: training and detection.
 Training stage: To detect malicious URLs, it is necessary to collect both malicious URLs and clean URLs.Then, all the malicious and clean URLs are correctly labeled and proceeded to attribute extraction.These attributes will be the best basis for determining which URLs are clean and which are malicious.Details of these attributes will be presented in details in this paper.Finally, this dataset is divided into 2 subsets: training data used for training machine learning algorithms, and testing data used for testing process.If the classification performance of the machine learning model is good (high classification accuracy), the model will be used in the detection phase.
 Detection phase: The detection phase is performed on each input URL.First, the URL will go through attribute extraction process.Next, these attributes are input to the classifier to classify whether the URL is clean or malicious.

B. URL Attribute Extraction and Selection
In [1], the authors listed some main attribute groups for malicious URL detection as follows.
Lexical features: these features include URL length, main domain length, maximum token domain length, path average length, average token length in domain.
Host-based Features: these features are extracted from the host characteristics of the URLs.These attributes indicate the location of malicious servers, the identity of malicious servers, the degree of impact of several host-based features that contribute the URL's malicious level.
Content-based Features: these features are acquired when a whole web page is downloaded.The workload of these features is quite heavy, since a lot of information needs to be extracted, and there may be security concerns about accessing that URL.However, with more information available about a particular www.ijacsa.thesai.orgsite, it is expected to create a better prediction model.The content-based features of a website can be extracted primarily from its HTML content and the use of JavaScript.
Above are the three main attribute groups commonly used by researchers to detect malicious URLs.However, each study has its own decision on suitable attributes and characteristics for each particular experimental dataset.In this paper, the use of all three attribute groups is recommended.However, in each attribute group some new attributes and characteristics of the URL to optimize the ability to detect malicious URLs are proposed.The new attributes for malicious URL detection in this research are listed in Tables I, II, and III.All attributes marked "*" in Tables I, II, III are newly extracted and selected in this research.Besides, in previous researches, authors tend to use feature extraction and selection method based on a group of predefined features.However, those recommended features are specialized and not popular.As a results, it is usually difficult to implement those features in other works, and to re-evaluate the detection performance of those features.In this work, we try to combine basic features to formulate new ones.

C. Machine Learning Algorithm Selection
The application of machine learning algorithms in detecting malicious URLs has been studied and applied widely [1].In this paper, two commonly used supervised machine learning algorithms, RF and SVM [14,15], are used.
In this research, machine learning algorithms are the last puzzle to complete our proposed malicious URL detection system.Those algorithms are suitable to utilized the usefulness of our new features selected for malicious URL detection.The machine learning algorithms are already well investigated in the literature.In this work, SVM and RF are selected as an example to illustrate the good performance of the whole detection system, and are not our main focus.Readers are encouraged to implement some other algorithms such as Naïve Bayes, Decision trees, k-nearest neighbors, neural networks, etc.
In order to explore the effectiveness of using these two algorithms, different adjustments of parameters are implemented.www.ijacsa.thesai.org

A. Dataset and Experiment Environments 1) Experiment dataset:
The experimental dataset for malicious URL detection model includes: 470.000URLs collected from [16,17,18,19], of which about 70.000 URLs are malicious and 400.000URLs are safe.All these URLs are checked by Virus Total tool to verify the labels of each URL.The complete dataset is stored using CSV format.Each URL sample has a label "bad" for malicious and "good" for safe.Details of the data are as follows:  Phishtank [16]: Phishtank is a service Website dedicated for sharing phishing URLs.Suspicious URLs can be sent to Phishtank for verification.The data in Phishtank is updated hourly.
 URLhaus [17]: URLhaus is a project from abuse.ch aiming at sharing malicious URLs being used for malicious software distribution.
 Alexa [18]: Is a database ranking all websites according to their usefulness.
 Malicious_n_Non-Malicious URL [19]: is a data source with more than 400,000 labeled URL.In this database, 82% of all URLs are safe, while remaining 18% of URLs are malicious.
2) Experimental setup: The dataset of both safe and malicious URLs mentioned above is divided into 2 subsets.About 80% of the dataset, 470.000URLs (400.000safe URLs, 70.000 malicious URL), is used for training, and about 20% of the dataset, about 10.000 URLs (5.000 malicious URLs, 5.000 safe URLs), is used for testing.The experiment is repeated many times with both SVM and RF algorithm.Different parameter settings are used in different runs.
3) Experiment dataset  (1) where: TP-True positive is the number of malicious URLs correctly labeled; FN -False negative is the number of malicious URLs misclassified as safe; TN-True negative is the number of safe URL correctly labeled; FP -False positive is the number of safe URLs misclassified as malicious.
Confusion matrix: is a two-way Table IV  (5)

 Training performance
To evaluate the training performance of the machine learning algorithm, both two data subsets are used individually.Each of these data subsets has different data size as well as different distribution of data labels, which may result in different training performances.The results are presented V.
Experimental results show that the RF with 100 trees gives the best predictive result.In return, the training time of the RF is slightly longer than SVM, but the testing time is not much different.The accuracy of the second dataset is reduced due to the unbalance between safe and malicious URLs of the data.As expected, RF algorithm, with its fast speed and high accuracy, is very suitable for classification problem.Besides, in our research, when machine learning algorithms are combined with spark libraries, the training and testing time can be reduced significantly.SparkML Machine Learning is a library package that provides and supports many machine learning algorithms such as SVM, RF, Naïve Bayes, Regression, Clustering, Collaborative Filtering, ... It is a suitable tool for applying machine learning algorithms with fast and accurate processing speed on large datasets.V. CONCLUSIONS In this paper, a method for malicious URL detection using machine learning is presented.The empirical results in Tables V and VI have shown the effectiveness of the proposed extracted attributes.In this study, we do not use special attributes, nor do we seek to create huge datasets to improve the accuracy of the system as many other traditional publications.Here, the combination between easy-to-calculate attributes and big data processing technologies to ensure the balance of the two factors is the processing time and accuracy of the system.The results of this research can be applied and implemented in information security technologies in information security systems.The results of this article have been used to build a free tool [20] to detect malicious URLs on web browsers.

TABLE .
III. LIST OF URL FEATURES IN CORRELATED FEATURE GROUP representing how many samples are classified into which label accordingly.


Testing results: In this paper, additional small testing dataset, with 107 safe URLs and 118 malicious URLs, is used to evaluate the performance of the best machine learning algorithm discussed above, RF (100).The results are presented in TableVI.

TABLE . V
. TRAINING PERFORMANCE OF MALICIOUS URL DETECTION SYSTEM