Machine Learning-Based Phishing Attack Detection

This paper explores machine learning techniques and evaluates their performances when trained to perform against datasets consisting of features that can differentiate between a Phishing Website and a safe one. This capability of telling these sites apart from one another is vital in the modernday internet surfing. As more and more of our resources shift online, one vulnerability and a leak of sensitive information by someone could bring everything down in a connected network. This paper's objective through this research is to highlight the best technique for identifying one of the most commonly occurring cyberattacks and thus allow faster identification and blacklisting of such sites, therefore leading to a safer and more secure web surfing experience for everyone. To achieve this, we describe each of the techniques we look into in great detail and use different evaluation techniques to portray their performance visually. After pitting all of these techniques against each other, we have concluded with an explanation in this paper that Random Forest Classifier does indeed work best for Phishing Website Detection. Keywords—Phishing attack; phishing attack detection; phishing website detection; machine learning; random forest classifier


I. INTRODUCTION
Phishing Attacks are the most common ways of attack in the digital world these days. Any method of communication can be used to target an individual to trick them into leaking confidential data in a fake environment, which can later be used to harm the sole victim or even an entire business depending on the attacker's intent and the type of data leaked.
Phishing attacks, while dangerous, can be avoided by simply creating awareness and developing habits of staying alert and continuously being on the lookout when surfing through the internet and only clicking links after verifying if the source of the links is trustworthy at all. There are also tools such as browser extensions that notify users when they have entered their credentials on a fake site, possibly having their credentials transferred to a user with malicious intent. Other tools can also allow networks to lock down everything and allow access to whitelisted sites to provide extra security while compromising some convenience on the user side [1].
In a related study, five main reasons have been stated behind users falling into traps of phishing attack schemes:  Lack of knowledge about URLs.
 Lack of knowledge about trusted websites.
 Lack of visibility of full web addresses due to the redirection or hidden URLs.
 Lack of time for analyzing URLs, and accidental entries of some web pages.
 Lack of capability of telling phishing web pages apart from legitimate ones.
One example of such an attack would be the attack in 2016, known as the Bangladesh Bank Cyber Heist. Security Hackers issued thirty-five fraudulent instructions via the SWIFT network to illegally transfer almost 1 billion US dollars from the Federal Reserve Bank of New York account that belonged to Bangladesh Bank. Out of these 35 instructions, 5 of them successfully transferred 101 million dollars, with 20 million traced to Sri Lanka and 81 million traced to the Philippines. Fortunately, the Federal Reserve Bank of New York was able to block the remaining thirty transactions. Without this block, another 850 million dollars would have been lost. And it was possible all thanks to noticing a misspelled instruction that raised suspicions among the authorities. The money transferred to Sri Lanka was all recovered, but from the US$ 81 million transferred to the Philippines, only US$ 18 million was recovered. Most of the money transferred to the Philippines were collected into four personal accounts [2].
The method of this attack has been suspected to be a Dridex malware. It specializes in stealing bank credentials by using macros set up in a Word or Excel document. Windows users can fall victim to such an attack if they open email attachments in Word or Excel, containing such a macro, which once activated on opening these documents, begin downloading Dridex, which then infects computers and sets up the stage for a banking theft. A knowledgeable and alert employee or a software aiding in detecting such an attack would have helped immensely in this event [3].
Machine learning algorithms are widely used to detect hidden patterns in the dataset. The most common algorithms are K-nearest neighbor, decision trees, random forest, and support vector machine [4]. In addition, belief rule-based expert system can mine rules from the dataset [5] [6].
In this paper, we focus on training machine learning models that can detect phishing web pages apart from real web pages. We analyze each of these models and state our findings and research in this paper to allow for others to have a clear understanding of the performance of these models when trained for this purpose. Of course, data preprocessing is very crucial for the models to work as they did in our case, and that is an essential part of the procedure. Papers from other researchers contributed immensely to our research, and we hope our paper will do the same by providing a collection of our findings regarding Phishing Detection using Machine Learning in this paper. *Corresponding Author www.ijacsa.thesai.org The remaining of the paper is organized as follows. In Section II, we reviewed the literature, followed by presenting the proposed methodology in Section III. The empirical results of the proposed approach are explained in Section IV, followed by Section V where a conclusion and further research scopes are discussed.

II. LITERATURE REVIEW
A. Types of Phishing Attacks 1) Algorithm-Based phishing: Attackers access sensitive information from a website's database by employing different algorithms V. Shreeram, M. Suban, P. Shanthi, K. Manjula proposed an anti-phishing detection method that would detect phishing hyperlinks with the help of the rule-based system that is formulated from the genetic algorithm (GA). A phishing link is detected if it matches the ruleset that is created by GA, which is stored in a database [7].
2) Deceptive phishing: This technique involves supplying clients with malicious links via emails and redirecting them to malicious websites where they are likely to enter sensitive information. Huajun Huang, Junshan Tan, Lingxi Liu gives a thorough overview of a deceptive phishing attack and different anti-phishing techniques. They present the different methods used by phishers and the advantages and disadvantages of the different countermeasures used [8].
3) URL phishing: Hackers can inject hidden links that redirect to malicious pages into the URL, where one may not expect to find one. Mohammed Nazim Feroz, Susan Mengel, proposes a method to detect URL phishing with URL ranking. They classify the URLs by their lexical and host-based features and categorizes and rank the URLs using the online URL reputation services [9]. 4) Hosts file poisoning: Replacing hostnames in the host records can override the usual process of DNS servers trying to retrieve actual IP addresses from beyond the network. This technique can poison the records and allow valid URLs that are meant to lead to secure sites lead to malicious pages instead, due to compromised IP associations in the server. Saeed Abu-Nimeh, Suku Nair, proposes a new attack that can bypass security toolbars and phishing filters by using DNS poisoning. They use spoofed DNS cache entries to create fake results and successfully attack four renowned security toolbars and the phishing filters of three popular browsers without being detected [10].

5) Content injection phishing:
Data collection is achieved in this technique by concatenation of malicious sections within a real website. Jussi-Pekka Erkkil presents the different methods by which phishing techniques can trick a person. A list of several strategies is listed that can detect phishing. The paper proposes that the company adapt effective protocols to keep their security features up to date [11].
6) Clone phishing: Duplicating already sent emails and attaching a malicious link into it can allow for a successful attack on an unsuspecting user. Ahmad Alamgir Khan proposed a new method where websites use One Time Password and User-machine Identification system to combat phishing attacks. Webservers will send a one time password to a user by SMS or email and create an encrypted token for the device after the user inputs the password [12].

1) Blacklist filter:
Blacklists can be maintained to block recorded unwanted sites from reaching the client's machine. These filters can be applied in different security measures like DNS servers, firewalls, email servers, etc. A blacklist filter maintains a list of elements like IP addresses, domains, IP netblocks that are commonly used by phishers. Adam Oest, Yeganeh Safaei, Adam Doupé, Gail-Joon Ahn, Brad Wardman, Kevin Tyers uses a scalable framework to test the effectiveness of browser blacklist filters. Their study concluded that most blacklist filters in mobile browsers failed to combat phishing attacks and are more vulnerable [13]. Mohsen Sharifi, Seyed Hossein Siadati, proposes a new method that will create a blacklist generator and keep a timely track of phishing website blacklists. Their techniques yield an accuracy of 91% and 100% in detecting real pages and phishing websites, respectively [14].
2) Whitelist filter: Unlike a Blacklist, Whitelist filters allow recorded website URLs, schemes, or domains to make it through to the client machine and block all other unrecorded sites. A whitelist, contrary to a blacklist, maintains a list of all legitimate websites. A. Belabed, E. Aïmeur, A. Chikh proposes a method that combines the whitelist approach with machine learning. A support vector classifier is used to filter further the websites that are not blocked by the whitelist filter [15]. Linfeng Li, Marko Helenius, and Eleni Berki conducted tests that compared the effectiveness of blacklist and whitelist anti-phishing toolbars. Their study did not find a significant difference in performance between both toolbars but encourages that toolbars be more instructive in helping users identify phishing websites [16].
3) Pattern matching filter: Checks whether or not individual tokens or sequences of data is contained within a given list of data by using a pattern matching technique. Rahamathunnisa Usuff, N. Manikandan, U.S. Kumaran, and C. Niveditha propose a method that uses pattern matching to detect phishing websites. A database of blacklist and whitelist that contains malicious URL patterns and original URL patterns is used to match with the user requested URL [17].

1) Malicious domain detection:
Machine Learning models are being trained to optimize their capabilities of detecting Phishing pages, one of the most common forms of cyberattacks. Nitay Hason, Amit Dvir, and Chen Hajaj propose a robust feature selection mechanism that creates better malicious domain detection models. All of the data are collected from 5000 legitimate URLs and 1350 harmful URLs. The models created are robust to different malicious abnormalities and show the effectiveness of models trained on features [18]. Hossein Shirazi, Bruhadeshwar Bezawada, www.ijacsa.thesai.org Indrakshi Ray shows concern about the large number of training features and types of datasets used and suggests that the domain name is much better and useful detecting phishing websites. Their learning model detects unknown live phishing URLs with an accuracy of 99.7% [19]. Krzysztof Lasota, Adam Kozakiewicz proposes a study that shows the similarity of different malicious domain name creations. The main task for detecting malicious behaviors was to detect similarity based on sets of domain names, URL names, and hostnames [20].
2) Email spam filtering: Emails are screened through various scoring techniques based on thousands of rules set to predict their probability of being an actual spam email. If the evaluated probability is beyond the acceptable range, then the email is blocked via the spam filter. Phishers use spam emails to direct a client to their malicious webpage and steal data. Andronicus A. Akinyelu1 and Aderemi O. Adewumi research about the effectiveness and use of random forest classifier in developing a phishing email classifier by extracting pertinent phishing email features from a dataset of 2000 phishing and ham emails. The proposed machine learning models shows a classification accuracy of 99.7% with low false positives and negatives [21]. Tushaar Gangavaraapu, C.D. Jaidhar, and Bhabesh Chanduka focus on the proper ways of extracting features from spam email content and behavior-based features, the features necessary in detecting spam emails, and on the selection of an important feature set. Their proposed machine learning model based on their selected features yields a constant accuracy of 99% in spam emails [22]. Table I illustrate the advantages and limitations of existing phishing detection researches. In Table I, we observed that most of the researches consider a small number of features and datasets. In this research, we try to overcome the limitations observed from Table I   * Does not do well with a random dataset without applying a supervised resample filter. [24] Proposes a machine learning-based method that can detect whether a web page exhibits phishing attacks.
* Proposed method is based on an easy to acquire feature vector that does not require additional computation.
* Only uses 10 features for detection. * Limited dataset of 1353 instances. [25] Uses feature selection to identify important features that categorize phishing and legitimate websites.
* Feature selection highly improves the accuracy score after implementation. * Use of feature selection reduces computational time.
* 14 features. * limited dataset (200 legitimate URL and 1400 phishing URL) * May not work properly with datasets of equal URLs of legitimate and phishing web pages. [26] Builds a system using machine learning that can classify websites using URLs.
* Can be used to build a rule-based system with associative rules to classify URLs. * 9 features for each URL * All features are discrete. * Limited dataset (1353 URLs) [27] Proposes a learning-based aggregation analysis mechanism to decide page layout similarity, which is used to detect phishing pages.
* Automatically trains classifiers to determine web page similarity from CSS layout features, which does not require human expertise.
* Method is lightweight as it only takes one class of features, CSS structure. * Limited by the size of the dataset and distribution of samples. [28] This research uses a new attribute called the "domain top page similarity" to improve the efficiency of a machine learning-based phishing detection model.
* Increases f-measure and reduces the error rate. * Proves that with better features, the detection rate is much higher and can be implemented in future works.
* The model is highly dependent on the accuracy of the features. [29] This paper proposes a real-time antiphishing system that uses seven classification algorithms and natural language processing-based features (NLP) In this section, we explain our proposed data-driven phishing website detection system-the dataset obtained from the online repository of Mendeley. Parallel coordinates, pearson and shapiro ranking, and principal component analysis are used for feature extraction. We use KNN, decision trees, random forest, SVM, and logistic regression to detect phishing websites.

A. Dataset
The phishing webpage dataset contains 48 features that are obtained from the online repository of Mendeley. The total number of websites is 1000, where 5000 phishing and 5000 legitimate websites. The class label 0 indicates a phishing website and 1 a legitimate website.

B. Feature Extraction and uses
For feature extraction, we used parallel coordinates, pearson and shapiro ranking, and principal component analysis. We used parallel coordinates to visualize and analyze our dataset and PCA to reduce the dimensionality of our dataset. We have explained our features in Table I, Table III, and Table  IV. In Table II Table IV, a set of 8 correlation features are shown with data types and description.

C. Classifiers
We deploy KNN, decision tree, random forest, extra trees, SVM, and logistic regression in our system.

1) K-Nearest Neighbors (KNN):
We calculated the distance using the Euclidean method from equation (1), Our KNN model is based on equation (2), Our dataset has 48 features and a Class label where 0 indicates a phishing website, and 1 indicates a legitimate website. When given an unknown sample, KNN will first measure the distance of the unknown sample with its neighbors by using Euclidean distance. The number of neighbors that it will check will be the value of K that can be chosen by setting the value of "n_neighbors." The distances will be measured by taking in the features of the samples that are in the dataset. The majority class of the neighbors that are the closest will be then assigned to the unknown sample.   2) Random forest: We used Gini importance to calculate a node's importance for each decision tree. This was based under the assumption that the tree is binary, and so each node has at most two children. For the elimination of branches in the tree, we used the equation (3), For calculating the importance of each feature on a decision tree, we used the equation (4), These can be normalized afterward to a value between 0 and 1 by the equation (5), And the sum of the feature's importance value on each tree is calculated by the equation (6) and divided by the total number of trees.

∑ (6)
A random forest classifier consists of a large number of decision trees that work as an ensemble. At first, it will create a bootstrap dataset of size "N" that will randomly take samples from our dataset. A random forest can then use these bootstrap samples to create a tree. For example, if our training data was [a, b, c, d, e, f], we might give one of our trees the following list [a, b, b, c, f, f]. It should be noticed that both samples are of the same size, and "b" and "f" are repeated in the bootstrap dataset because we sample with replacement. After taking in the samples from the bootstrap dataset, it begins to build trees by first choosing a root node. Random forest differs from decision trees because it uses a method called Feature Randomness. This means that when it comes to choosing a root node for a random tree forest will only allow the trees to choose a root node from a subset of features. The Gini impurity is measured among these subsets of features, and the lowest score will be used as the root node, and the different subsequent nodes are chosen in the same way. After creating the trees, the random forest classifier is ready to make predictions. It will take an unknown sample from our test dataset and run the sample among all of the trees. All of the individual trees give a class prediction, and the class that has the most votes will be the class of the unknown sample. One of the main reasons random forest classifier does well with large www.ijacsa.thesai.org datasets is because it maintains diversity between models by using bootstrap aggregation and feature randomness.

3) Support vector machines:
We used the equation (7) to calculate the loss function for our support vector machine, For calculating gradients, we used the equation (8), By using SVM, we plot each data item as a point in ndimensional space (where n is the number of features and in our dataset it is 48) with the value of each feature being a value of a specific coordinate. After that SVM finds a hyperplane or a decision boundary that can properly differentiate between the classes. An optimal hyperplane is one where it has equal and maximum distance between two data points, which are considered as support vectors. SVM is very easy to apply when the data points can be easily divided by a linear line, but it is rare to find such datasets in the real world. This is where the kernel trick of SVM comes to work. One of the reasons why SVM works well with our large dataset is that it can work in infinite dimensions. The best part is that the kernel does not necessarily generate the infinite dimensions but simulates the lower dimension data so as if they are working in infinite dimensions. The kernel is very useful here because it can make a non-separable problem into a separable problem by adding more dimensions to it, and the number of dimensions depends on the number of features each sample has; some of the kernels that we found compelling are Linear Kernel, Polynomial Kernel, and the Radial Basis Function (RBF) kernel.

4) Logistic regression:
Logistic regression is based on the linear regression, where a line is plotted its axes for a given dataset.
The conditional probability function we used gives a binary output for the variable Y as a function of X. Any unknown parameters in the function are estimated by maximum likelihood. The conditional probability is calculated by using equation (9).

A. ROC Curve
Now let us look at our ROC curves of different models. Fig. 1 shows the ROC curve of the support vector machine. The X-axis indicates the false positive rate, and the Y-axis indicates the True positive rate. The AUC value for this is 0.97.              (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 9, 2020 385 | P a g e www.ijacsa.thesai.org Fig. 9 shows the ROC curve of the extra trees classifier. The X-axis indicates the False Positive rate, and the Y-axis indicates the True positive rate. Here the AUC of class 0 (phishing website) is 1.00, and class 1 (real website) is 1.00. The AUC of the macro and micro average of the ROC curve is also 1.00. This is the best ROC curve so far. We can see that the steepness of the curve is at the most top left corner. Fig. 10 shows the ROC curve of the random forest classifier. The X-axis indicates the False Positive rate, and the Y-axis indicates the True positive rate. Here the AUC of class 0 (phishing website) is 1.00, and class 1 (real website) is 1.00. The AUC of the macro and micro average of the ROC curve is also 1.00. This is the same as the Extra Trees classifier. We can see that the steepness of the curve is at the most top left corner. Hence it can be said that the extra trees classifier and random trees classifier has the best ROC curve.

B. Discrimination Threshold
Let us look at the discrimination threshold of our models. Fig. 11 shows the threshold plot for the support vector machine. On the X-axis, we have the discrimination threshold, and on the Y-axis, we have the score. Here we see that the discrimination threshold for this is 0.03. For this threshold, we see that the precision, recall, and f1 score are approximately around 0.89.    Fig. 12 shows the threshold plot for the non-uniform support vector machine. On the X-axis, we have the discrimination threshold, and on the Y-axis, we have the score.
Here we see that the discrimination threshold for this is 0.00. For this threshold, we see that the precision, recall, and f1 score are approximately around 0.86. Fig. 13 shows the threshold plot for the linear support vector machine. On the X-axis, we have the discrimination threshold, and on the Y-axis, we have the score. Here we see that the discrimination threshold for this is 0.05. For this threshold, we see that the precision, recall, and f1 score are approximately around 0.9.  (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 9, 2020 386 | P a g e www.ijacsa.thesai.org Fig. 14 shows the threshold plot for KNN. On the X-axis, we have the discrimination threshold, and on the Y-axis, we have the score. Here we see that the discrimination threshold for this is 0.50. For this threshold, we see that the precision, recall, and f1 score are approximately 0.82 to 0.89. Fig. 15 shows the threshold plot for logistic regression. On the X-axis, we have the discrimination threshold, and on the Yaxis, we have the score. Here we see that the discrimination threshold for this is 0.46. For this threshold, we see that the precision, recall, and f1 score are approximately 0.85 to 0.9.     16 shows the threshold plot for stochastic gradient descent (SGD). On the X-axis, we have the discrimination threshold, and on the Y-axis, we have the score. Here we see that the discrimination threshold for this is 0.00. For this threshold, we see that the precision, recall, and f1 score are approximately 0.8 to 0.9. Fig. 17 shows the threshold plot for logistic regressionCV. On the X-axis, we have the discrimination threshold, and on the Y-axis, we have the score. Here we see that the discrimination threshold for this is 0.58. For this threshold, we see that the precision, recall, and f1 score are approximately around 0.95. Fig. 18 shows the threshold plot for Bagging Classifier. On the X-axis, we have the discrimination threshold, and on the Yaxis, we have the score. Here we see that the discrimination threshold for this is 0.56. For this threshold, we see that the precision, recall, and f1 score are approximately around 0.98. Fig. 19 shows the threshold plot for random forest classifier. On the X-axis, we have the discrimination threshold, and on the Y-axis, we have the score. Here we see that the discrimination threshold for this is 0.48. For this threshold, we see that the precision, recall, and f1 score are approximately around 0.99.

V. CONCLUSION
Our work analyses different machine learning techniques when implemented over a dataset of features regarding websites and their corresponding details that may prove useful to detect a possible phishing website. This document aims to be useful to its readers to provide a conclusive analysis of these methods and to verify our observations regarding the random forest classifier's optimal performance. F1 score for the random forest is 0.99, which indicate that both false positive and false negative rate are in the satisfactory level. The graphs and details we have added to the document aim to help others carry out further experimentation to conclude our work. And we, ourselves, also intend to carry on our work with further modifications to the dataset and applying other machine learning techniques with modified parameters to hopefully open more possibilities in the hopes of improving the world's defenses against the cyber attackers out there. The internet is both fantastic and dangerous. And our work's main objective is to help minimize the danger by addressing a pervasive security issue of the modern world. In this paper, we apply basic machine learning algorithms. In the future, we will deploy deep learning techniques like multilayer perception and artificial neural networks to improve the performance of the detection system.