Feature Engineering Framework to detect Phishing Websites using URL Analysis

Phishing is a most popular and dangerous cyberattack in the world of internet. One of the most common attacks in cyber security is to access the personal information of internet users through “Phishing Website”. The major element through which hacker can do this job is through URL. Hacker creates an almost replica of original URL in which there is a very small difference, generally not revealed without keen observation. By pipelining various machine learning algorithms, the proposed model aims to recognize the important features to classify the URL using a recursive feature elimination process. In this work the data set of various URL records has been collected with 112 features including one target value. In this work a Machine Learning based model is proposed to identify the significant features, used to classify a URL, the wrapper method recursive feature elimination compares different bagging and boosting machine learning approaches .Ensemble algorithms, Bootstrap Aggregation Algorithms, Boosting and stacking algorithms are used for feature selection. The proposed work has five sections: work on the pre-processing phase, finding the relation between the features of the dataset, automatic selection of number of features using Extra Tree Classifier, comparison of the various ensemble algorithm and finally generates the best features for URL analysis. This paper, designs meta learner with XG BOOST classifier as base classifier and achieved an accuracy of 93% Out of 112 features, this model has performed an extensive comparative study on feature selection and identified 29 features as core features by performing URL analysis. Keywords—Recursive feature elimination; principal component analysis; standard scalar transformation; eXtreme gradient boosting classifier; correlation matrix


I. INTRODUCTION
The world of digital suffers a lot from cyber security attacks. The phishing attack can be handled based on source code or URL or image. This research designed the model based on URL features. These URL features are further classified into 4 sub categories. The sub categorization is represented in the below Fig. 1.
With the increase of E-commerce applications, cybercrimes are also increasing rapidly [1]. To solve this issue, researchers are focusing on the detection of phishing websites using Machine Learning and Deep Learning techniques. The dataset contains 112 attributes but all the attributes may not be important. This research tries to find the most important attributes that can determine whether it is a phishing website or not. Some of the websites access are marked as unauthorized by Google but it is difficult to identify all the unauthorized sites by Google search engine. In order to prevent those types of sites, the model compares the every component of the URL to mark it as "Phish Website" [7]. In the proposed research, the feature selection to detect the phishing website is solved using the ensemble algorithms. Ensemble algorithms are popular for their robust results and it designs the meta learners by combining the weak learners, which performed on the same dataset. There are three popular ensemble techniques. The algorithms are as follows.

1) Bootstrap aggregation algorithms:
Bootstrap Aggregation is also known as "Bagging", in which execution of weak classifiers occurs concurrently (or) in parallel. In this method, a sample randomized subset is generated from the entire dataset. The traditional algorithms suffer from high variance. To solve this problem, a component known as "estimator" is added, which is used to create a random sub sample based on the classifier passed.
2) Boosting algorithms: In this type of algorithms, classifiers of same type are combined sequentially to generate a new model. In general, it solves the problem of label incorrectness, which is defined by the one or more models. Boosting also, address the problem of high variance by generating the output of various classifiers and then the average of their predictions are taken into account. It also plays an vital role in reduction of bias. www.ijacsa.thesai.org 3) Stacking algorithms: It is more advanced than bagging and boosting algorithms because it combines heterogeneous classifiers rather than homogenous. It also designs a model in such way that it combines a meta learners with base learners. In these algorithms, after every iterations, it applies meta model based on the output generated by the previous iteration.  c) Third-Party-based Features: These features are extracted from the search engines and assisting devices. In this type of features, Alexa has obtained highest information gain. In information gain, it assumes that every feature is dependent on class label and all other features are independent of other features. Every feature is allocated with rank and the ranks less than the threshold value are ignored.
In the next step, the model splits the data into trained and cross validated data. The model had designed a multi feed forward network, with five DNN layers with hyper parameters. In this design, weight values are considered randomly. A Recurrent Neural Network as LSTM is designed to find the relation between the extracted. LSTM is good at handling the vanishing gradient problem. To avoid long term dependencies, LSTM has designed three gates with every gate has its own equation. A CNN with eight layers are designed with back propagation method to classify the URL as suspicious or not.
In [3] Paulius Vaitkevicius et al. conducted a comparative study on various machine learning algorithms and designed a unified ranking model for the detection of phishing website. During the training phase, the model has implemented a cross validation process with setting of hyper parameters and also it has the solved the problem of memorizing the data by decreasing the number of weak learning algorithms. Similarly, at the same time, overfitting and underfitting issues are handled by defining the high bias and high variance. Welch's T-test, compare the accuracies generated by the two classifiers and compute the statistical difference. Based on the accuracies produced, each classifier has assigned a unique value. Finally, the model has given unique ranking depending on the related work and libraries used.
In [9] Jitendra Kumar et al. proposed five classifiers for detecting phishing websites. Among these random forest and decision tree classifiers are given almost the same accuracy. Author used regular expressions to extract the components from the URL. The author has mainly focused on the three important features namely, URL based features, page based features and domain based features. The data are randomly distributed for forming training and testing dataset.

In [4] Ammar Odeh et al. designed a Multi-Layer
Perceptron by considering the URL features. On the extracted features, performed selection, combination of ranking and single attribute evaluation is performed. Later, a subset from these features is generated and is passed as input to the neural network. In this, model is designed with fixed values of hyper parameters and has obtained an accuracy of 93.7%.
In [5] Yazan A et al, proposed AI meta learners combined with base algorithm known as "Extra Tree Classifier". The first meta learner is "ABET", this process is carried for 100 iterations and later normal distribution is performed. After training the classifier, it generated a hypothesis and computed the error rate. For every iteration, it updated the weight value and checked whether the threshold value 0.5 is satisfied or not. For the entire generated hypothesis, it computed the argmax function. The second meta learner is "RoFET", in which the dataset is randomly divided into five subsets, with each containing six features. It created a new dataset by using the bootstrap induction. The newly constructed dataset generated coefficients with the help of sparse rotation matrix. It generated class confidence to determine the class label of each record. The third meta learner is "BET", 150 iterations are performed over the training dataset and a new dataset is generated by inducer for base algorithm. Argmax concept is applied to find the most frequent predicted class label of the record. The fourth meta classifier is "LBET", all the weights are initialized to 1/n, where n is number of records in the dataset. The probability estimators are initialized as 0.5. For all the 100 iterations, calculated the weight based probability estimators, then fitted the least square regression function to the weights and updated the values of the weight. Summation of all the classifiers are used to predict the class label. Out of these 4 meta classifiers, LBET has performed better with 97.5% accuracy. www.ijacsa.thesai.org In [6] Waleed Ali et al. proposed PSO based feature weighting approach by encoding the features in a particle. The position and velocities are generated randomly to calculate the fitness of the function. The major goal of the fitness function is to update the local and global values by checking the threshold values regularly. After reaching the termination condition, it generates the optimal weights of all the features. Based on the weights, important features are considered for classification process. These features are passed as input to the five traditional machine learning algorithms and one neural network algorithm. The model has compared all the six algorithms in terms of all evaluation metrics and stated that back propagation neural network has achieved highest accuracy.
In [10] Suleiman Y. Yerima et al, designed 2-layer CNN and 3-layer CNN to detect the phishing websites. The 2layered architecture contains one convolution and one max pooling layer. The flattened data is transferred to the eights units of neurons that are activated with the help of ReLu function. To prove the efficiency, author has constructed the network with different number of neurons and has achieved 96.6% accuracy for 64 neurons. In the 3-layered architecture additional max pooling layer is added and has achieved 97.1% accuracy for 64 neurons. This model clearly depicted that with the increase of layers and neurons, the accuracy of the model also increases.

III. METHODOLOGY
In this research, the major focus is to reduce the number of features for classifying a URL as phishing website or not. This research work is organized into five sections: Section 1 describes the pre-processing, Section 2 describes the process of finding the relation between the features of the dataset using correlation and principal component analysis (PCA), Section 3 describes the automatic selection of number of features using Extra Tree Classifier, Section 4 compares the bagging and boosting algorithms on the number of features generated using the pipeline mechanism and Section 5 generates the best features based on the best algorithm generated in Section 4.
The feature engineering helps the model to construct the explanatory attributes which plays a vital role in training the model. The major goal of feature selection is to reduce the computation time of the complex models. The process of feature engineering is illustrated in the Fig. 2.

A. Pre-Processing
The process of cleaning and transforming the data is known as "Pre-processing". A dataset with missing and inconsistent values leads to the poor performance of the model. The dataset used in this research contains 112 features (Table VII in Appendix A) and all the features are numerical values. Used wrangling process to deal with missing and inconsistent values to reduce the computational time, the data should be normally distributed. Normal distribution is achieved by applying standard scalar mechanism.

The equation is
Where, , represents attribute(columns) of i th index , represents the mean , represents the standard deviation After applying standard scalar mechanism, for further processing of the model, the dataset is divided into 80% training data and 20% testing data.

B. Dependency between the Features
Implementing correlation function and PCA functions to find the relation among the features of the dataset. In correlation, the coefficient r, is denoted by 3 possible values [-1, 0, 1], which indicates negative, no relation and positive relation. In this research, Pearson method is implemented, which tries to draw a best fit line between the features. The calculation of Pearson coefficient is represented in the equation 2.
Where, np, represents number of pairs.
Ai, A j, represents attribute values of corresponding indexes.
After applying PCA on our dataset, the result is observed as followed (the attributes whose coefficient value, PC is less than 0.95 and self-denoting correlation values are ignored) Table II. Principal Component Analysis (PCA) creates a linear relationship among the attributes. The explained variance measures as rank in terms of PCA. It iterates from 1 to n components specified in the algorithm. It is a two-step process in which first one calculates the covariance matrix and second one computes the Eigen values, Eigen vectors for the covariance matrix. This proposed research made the threshold for covariance as 0.90 and obtained 20 features as important features. The results are tabulated in Table III.

C. Automatic Selection of Number of Features
In this research, an ensemble technique known as "Extra Tree Classifier" is implemented. This model constructs unpruned decision trees and it considers majority voting to predict the class labels. It follows reverse approach to decision trees and random forest; it tries to fit the decision trees generated by the dataset. The algorithm has three hyper parameters, change of which may lead different evaluations. The hyper parameter k, determines the attribute selections, nmin determines the output noise average and m determines variance level. To evaluate the model, repeated stratified cross validation is performed. This research paper has opted repeated cross validation because in K-fold cross validation, because of the random distribution of the data, every time it is executed, and the output values may vary. To reduce this noisy data problem, in the proposed model the repeated cross validation, performs the same task for multiple times and the average of all the executions is taken into account. The number of models generated and executed is shown in the equation 3.

D. Comparative Study
Bagging and boosting algorithms are identified as powerful ensemble algorithms. Pipelined four ensemble algorithms to study the performance of an algorithms on the number of features generated by the Extra Tree Classifier among them three are boosting algorithms and other is bagging algorithm with meta base estimator each algorithm is illustrated.

1) Bagging with meta base estimator:
This model fits the randomly generated subsets to the base classifiers. The average predicted values of all the models generated by the subsets are taken in to consideration and the outcome is predicted based on the majority voting. The random subsets are generated based on the estimator specified in the model. In this research, the estimator considered is "Logistic Regression", which is better to draw the relationships between nominal, interval and ordinal data. The base classifier considered is "Decision Tree Classifier", in this at every node to split the data, it takes number of conditions into consideration. Using the concept of divide and conquer, the decision tree recursively partitions the data. The splitting of the data is taken care by the parameter, information gain. Here, the information gain is computed by subtracting the weighted sum node impurity of two child nodes from the node impurity of the parent node. The node impurity defines the homogeneity of the labels to which it belongs to.
2) Adaptive boost (AdaBoost) classifier: This algorithm trains the model sequentially by correcting the errors generated by the previous models. In this classifier, initially every record is assigned with a random weight (Generally, it is considered as 1/n, where n represents number of record in the dataset). A decision tree is constructed and is used to classify the records. The predictive labels generated by the decision tree are compared against the actual class labels of the training dataset. This comparison gives the results of correctly and incorrectly classified records. This misclassified data weights summation is assumed as error rate and theses errors are corrected by increasing the weights of incorrect decisions and decreasing the weights of correct decisions. This process is repeated continuously until correct predictions are made.
3) Gradient Boosting (GBM) Classifier: It is almost similar to AdaBoost classifier but it specifies the target values to the next model generator. The target value always depends on the rate of variation among the prediction models. If the error rate is high the target value is set to high otherwise it is set to low. The error value is determined by the gradient coefficient in the loss function. The loss function in this algorithm is considered as mean square error, the objective of this algorithm is to minimize it. It always tries to bring the error rate close to zero.

4) eXtreme Gradient BOOSTing (XGBOOST) Classifier:
It is an extended version of gradient boosting algorithm. The major goal of this algorithm is to increase the computational power of the system and optimize the performance of the model as minimum as possible. All the trees construction occurs concurrently in this model. This parallelism improves the CPU utilization time. Step 1: Load the Dataset, D Step 2: create a two dimensional matrix, ID store all the columns except the target column Step 3: create a one dimensional matrix, DDstore the target column Step 4: ID_train, DD_train, ID_test,DD_testsplit the dataset Step 5: Update the ID_train and DD_train with the standard scalar values and fit the transformation Step 6: alg=get_alg() Step 7: create two empty lists, res and name Step 1: Load the Dataset, D Step 2: create a two dimensional matrix, ID store all the columns except the target column Step 3: create a one dimensional matrix, DDstore the target column Step 4: ID_train, DD_train, ID_test,DD_testsplit the dataset Step 5: Update the ID_train and DD_train with the standard scalar values and fit the transformation Step 6: Call RFE() method with XGBoostClassifier as estimator.
Step 7: fit the train data to the RFE() method Step 8: for i0 to len(D):

IV. EXPERIMENTAL RESULTS
In this work the algorithms are good enough to find the significant features useful for classification process. The feature selection is based on their accuracy score and score for every feature is tabulated in Table IV.When the accuracy is calculated for all 111 features, it is observed that accuracy is more i.e., 94 % accuracy when the number of features are 29 and are tabulated in table new. The accuracy scores are plotted in Fig. 3. In the below graph, x-axis represents the number of features and y-axis represents the accuracy.  On the selected number of features, four different algorithms are passed to the recursive feature elimination and the accuracies are tabulated in Table V, and the comparative study is plotted in Fig. 4.

V. CONCLUSION
This proposed model of feature engineering, trained the dataset in split ratio of 80% and 20%. This model evaluates the correlation matrix values and principal component analysis values. The values don't specify the relation between the features clearly. While comparing PCA values and filter values with wrapper methods we found more clarity between the features in terms of exploratory and performance by using wrapper methods. The proposed developed algorithm eXtreme Gradient finds the important features among all the existing 112 features. The major reason for selecting the XGB classification algorithm is when meta classifier i.e., the bagging algorithm, Ada Boost (Ada), XGBoost (XGB), and Gradient Boost (gbm) classifiers are applied, the XGB algorithm has achieved an accuracy of 93%. The Ensemble based recursive feature elimination mechanism constructs a subset, by eliminating the weak features. RFE identify significant feature with minimum number so good classifier is designed. Recursive feature elimination is developed by combining Meta classifier with base classifier and decision tree as a bagging algorithm. The most significant features are highlighted under the column support on the basis of the rank values. All the true values in support column are considered as significant values, these significant features has accuracy value as 1.000. The bagging algorithm, Logistic regression obtained 89.8% accuracy whereas boosting algorithms achieved more than 90% accuracy apart from all XGBoost gave 93.0% accuracy to retrieve the features to detect Phishing websites.