The Effect of Feature Selection on Phish Website Detection an Empirical Study on Robust Feature Subset Selection for Effective Classification

—Recently, limited anti-phishing campaigns have given phishers more possibilities to bypass through their advanced deceptions. Moreover, failure to devise appropriate classification techniques to effectively identify these deceptions has degraded the detection of phishing websites. Consequently, exploiting as new; few; predictive; and effective features as possible has emerged as a key challenge to keep the detection resilient. Thus, some prior works had been carried out to investigate and apply certain selected methods to develop their own classification techniques. However, no study had generally agreed on which feature selection method that could be employed as the best assistant to enhance the classification performance. Hence, this study empirically examined these methods and their effects on classification performance. Furthermore, it recommends some promoting criteria to assess their outcomes and offers contribution on the problem at hand. Hybrid features, low and high dimensional datasets, different feature selection methods, and classification models were examined in this study. As a result, the findings displayed notably improved detection precision with low latency, as well as noteworthy gains in robustness and prediction susceptibilities. Although selecting an ideal feature subset was a challenging task, the findings retrieved from this study had provided the most advantageous feature subset as possible for robust selection and effective classification in the phishing detection domain.


INTRODUCTION
Phishers impersonate trustworthy websites of financial organizations through online transactions.Many efforts have been made to overcome the phishing attacks through numerous phishing detecting approaches.Nevertheless, phishing has caused enormous money loss in the cyberspace over the past years, which has motivated researchers to seek effective phishing detection techniques that protect users' digital identity [1][2][3].In general, phishing detection techniques fall into several categories due to the deployed scenarios of detection.In the literature, Islam & Abawajy [4] roughly categorized them into non-classification and classification techniques.Specifically, white lists of famous trustworthy URLs; black lists of valid phish URLs; heuristics; and information flow techniques were categorized as non-classification techniques.In contrary, classification techniques involved those relied on machine learning classifiers and data mining based scenarios.They differ in terms of classification accuracies, rates of classification errors, and demands on external resources [1][2][3][4][5].However, they commonly have deployed features as the key factor for classification task, such as hybrid features.Besides, classification task mostly rely on extracting a set of features from tested instances (i.e.emails and websites) and deploy them to distinguish phish instances from the legitimate ones [1][2][3][4][5].Thus, classification techniques outperformed their competitors by intuitively detecting phishing that exploits the web to protect clients [3,6].Moreover, they could automatically extract features from webpage content; URL of websites, hosting information, and classifying their phishness [7 and 8].Besides, the usage of hybrid features supported the generality of the classification techniques to classify phishing variations and such techniques reported high rates of detection accuracy than those provided by their competitors [4, 6 and 9].However, constraints like high-dimensionality of feature set, hybridity of features, their irrelevance to the corresponding classes (i.e.phish and legitimate), their dependency on each other, their redundancy on the examined feature space, and heterogeneity of their values (i.e.discrete and continuous values) might degrade detection accuracy.In addition, they might have increased the false detection errors and computational costs.Then, they would limit the overall effectiveness of classification techniques in the real-world experience along with their scalability to the enormous web data and the evolving phish exploits [5,9].
Hence, to tolerate with the aforesaid issues, researchers had looked into their constructed classification models via feature selection methods that played an important role in data analysis during the classification task.Such methods typically refined the extracted set of features into a minimal and effective subset for the classification task.Besides, they eliminated the least representative features by applying the lowest discrimination on the tested data.However, these assisted methods yielded different outputs of feature selection.Meanwhile, as for the existing researches; specifically in phishing websites detection, the direct comparison of such differences had been neglected.In their evaluations, they www.ijacsa.thesai.orgunderlined the differences with respect to the detection accuracy and overall performance [4][5][6][7][8][9][10][11][12].They rarely quantified feature selection methods in terms of (i) the measure of feature's prediction susceptibility that they had utilized, (ii) their scalability under different feature sets' dimensions, (iii) the goodness of their output in the presence of different classification models, (iv) the stability of their output against evolving data and phishing variations, and (v) the similarity between the outputs of multiple feature selection methods.
Besides, the causality between the aforesaid issues and the optimum choice of feature selection subset had been highlighted.It quantified the highest quality of selected feature subset that yielded the best case of detection accuracy with least error rate as possible.Moreover, this contribution is extended by testing the selected feature subset across multiple classification models.Apart from that, this study promotes its contribution by handling a proposed set of hybrid features.Hence, it is hoped that the proposed features, the characterized literatures, the highlighted issues, and the empirical tests would offer a global picture on phishing detection assisted by feature selection.Moreover, they could be regarded as the baselines for future works to appropriately choose the feature selection methods for their classification models.
In this context, this study characterizes the prior works, and critically appraises them with respect to their frontiers in feature selection as presented in Section II.Then, Section III recommends certain criteria and depicts their relevant terminologies to assess both resilience and effectiveness of selective feature subsets.Section IV, practically appraises feature selection exploits and testifies their outcomes in the presence of the recommended criteria.Based on the stated findings, Section V deduces the present work on hand and gives an outlook to the future implications.

A. Feature Selection Methods
All feature selection methods aim at reducing the dimensionality of the feature space and in enhancing the compactness of the features.Meanwhile, in data processing, specifically data mining and machine learning approaches; a large number of features may cause problems of high dimensionality, irrelevance, and redundancy [13].Therefore, in order to reduce the dimensionality and to obtain the most representative features that could effectively predict instances over a given dataset, data pre-processing is needed [13 and 14].Mainly, feature selection has been considered as a data pre-processing technique that chooses a minimum subset of m features from an original set of n features.Accordingly, the selection involves: a search procedure for feature subset generation, and an evaluation criterion for iterative feature selection [13,14].Furthermore, the search procedure often discards or adds one feature based on its evaluation outcome, whereas the evaluation criterion compares that feature with the previously selected one regarding to either its information, or dependency, or consistency, or distance or its transformation.However, feature selection methods differ in specifics and parameters that can be tuned for both the search procedure and the evaluation criterion [13,14].Table I enlists four feature selection methods that had been adopted for phishing detection in the reviewed literature, which were characterized by search procedure, as well as evaluation specifics and criteria.

Informatio n Gain (IG)
Filter Information "Where S, SV, V and a are the collection of instances, a subset of instances with V of a, a relevant value and an attribute, respectively."

Correlatio n Based Feature Selection (CFS)
Filter Consistency "Where, is said to be relevant if there exists some and c for which ( ) ."

B. Related Works
At present, vast literature is available on the merits and demerits of phishing detection campaign.Towards devising anti-phishing solutions for the specific problem at hand (i.e.phishing websites), many proposals have been introduced and experiments conducted by using different machine learningbased approaches combined without features extraction and features selection.For instance, Likarish et al. [15] developed a Bayesian filter to identify phish websites based on retrieved tokens obtained from the HTML document and constructing DOM (Document Object Model) with the aid of DOM parser.Then, researchers at Google Inc., Whittaker, Ryner & Nazif [16]; worked on the up-gradation of Google's phishing blacklist integrated with a classifier.In addition, another antiphishing technique was developed by Bergholz et al. [17] to phish email filtering by analyzing several extracted features related to body, external, and model based on examined emails.The developed techniques involved two training phases; one for model-based features and the other was for the rest of the features.Later, CANTINA + was proposed by Xiang, Hong, Rose, and Cranor [18] with three classifiers and ten features derived from the URLs and the contents of webpages, as well as some online features for highly accurate results of phishing detection.Meanwhile, Zhang Liu, Chow, www.ijacsa.thesai.organd Liu [19] introduced a linear classifier Naïve Bayes (NB) in order to detect eight textual and visual features on suspected websites for phishness prediction.The used classifier returned a normalized number; reflecting the likelihood of the suspect website as being phished or legitimate.Likewise, a Supervised Machine Learning (SVM) classifier was developed by He et al. [8] to predict phishness on examined webpage by exploiting webpage identity and some textual features.The textual features were extracted by using a well-known information retrieval method to be deployed for classification process.Contrarily, a phish webpage detector was proposed by Li, Xiao, Feng, and Zhao [20] based on visual features and DOM objects of the webpage content that learned and tested over datasets by using Semi-Supervised Machine Learning (TSVM) classifier.Furthermore, Kordestani and Shajari [21] applied three classifiers, including Naïve Bayes (NB), Supervised Machine Learning (SVM), and Random Forest (RF), on a randomly selected dataset to predict phishness in suspected websites.They were deployed for phishness prediction with the presence of URL and online features.Then, Gowtham and Krishnamurthi [22] extracted fifteen, which were trained by using Supportive Vector Machine (SVM) classifier and a whitelist through two modules.The first module involved checking the identity features of the examined website against a pre-defined white list of legitimate ones, whereas the second module predicted phishness of the examined webpage based on its login form features via SVM classifier.However, the application of the aforesaid proposals encountered some tradeoffs related to the processing of large and realistic datasets, the extraction of hybrid features, the analysis of their heterogeneity, increasing storage requirements and processing time, as well as some costly miss-classifications.Moreover, it is worthy to mention that final decisions of phishing detection relied potentially on predictive features against phishing susceptibility.More precisely, phishing detection in the presence of predictive features should yield minute amounts of both valid phish misclassifications and losses of valid legitimate instances.Thus, researchers were motivated to maintain some feature selection methods as those briefly described in Table II to cope with the aforesaid factors.In the literature, Pan and Ding [23] proposed phishing detector based on applying Supportive Vector Machine (SVM) classifier and extracting both textual and Document Object Model (DOM) features from the examined webpages.They employed two major components for their detector, including an information retrieval strategy to extract textual features and Chi-squared ( 2 ) criterion to select the most effective features.Then, Ma Ofoghi, Watters, and Brown [24] experimentally analyzed seven webpages and pages to rank the features with the aid of a filter-based feature selection method, Information Gain (IG), to phish website classification and deploy two classifiers that varied in their classification accuracy due to the selected features.On top of that, Khonji, Jones, and Iraqi [25] enhanced classification performance by selecting the most effective subset of the most commonly used 47 features.Both filter-based and Wrapper-based feature selection methods, such as Information Gain (IG), Correlation Based Feature Selection (CFS), and Wrapper Feature Based Selection (WFS), were developed with machine learning classifiers to predict phish emails.The classification results differed due to the employed feature selection method and the number of selected features.On the other hand, Basnet, Sung, and Liu [26] analyzed high dimensional feature space, including 177 features extracted from both the content and URL of websites to select the best feature subset.In fact, several subsets were considered for application of Wrapper Feature Based Selection (WFS) and Correlation Based Feature Selection (CFS).They were trained over a dataset with the aid of Logistic Regression (RF) classifiers.Nevertheless, they varied in selecting the most contributing features such that classifiers caused variation on detection accuracies.Later, Zhang, Jiang, and Kim [27] developed automatic detection approach for Chinese e-business websites by incorporating the unique features extracted from URL and contents of website.Alongside, Hamid and Abawajy [28] proposed a multi-tier detector to phish emails filtering with the aid of Adaboost and SMO classifiers in an ensemble design.Moreover, they used Information Gain (IG) and clustering strategy to quantify the best predictive features of phish emails and also tested the outcomes over three large scale datasets.However, large size dataset, imbalanced datasets, redundancy, the limit of cluster size, and error rates emerged as the key issues in their work.

C. Shortages
In order to offer a global view on feature selection for exploitation in phishing detection domain, Table II characterizes the previous works with respect to their deployed feature selection methods and their limitations.As depicted in Table II, the surveyed works often deployed sub-optimal feature subsets for phishing detection due to some limitations.Such limitations include: the dependency of feature selection outcomes on a given dataset, different feature selection outcomes across different classification models, heterogeneity of features values, and un-scalable feature selection method to more challenging datasets [23][24][25][26][27][28].Furthermore, most of the dedicated efforts focused on discarding the relevant features rather than the redundant ones during feature selection [23][24][25][26][27][28].Besides, since they are mutually dependent on other features belonging to the same targeting class; the redundant features might distort the classification task and then degrade its accuracy by producing high error rates [29 and 30].Consequently, Table III underlines some striking issues like non-scalability, heterogeneity, non-robustness, irrelevance, and redundancy that must be considered to deal with feature selection limits [29][30][31].

Non-scalable Feature Subset [29]
The deployed features rarely raise the classification accuracy to the best case as possible under different selection scenarios and over different datasets.

Redundant Features [29, 30]
Since the high-dimensional data have a substantial amount of irrelevant features which require high computational cost selection strategy to reduce.Such strategy potentially causes inefficient classifier.Irrelevant features, in turn, may contain redundant and nonredundant features which require a robust feature selection strategy capable to handle their redundancy.

Irrelevant Features [29, 30]
Large scaled and realistic datasets like that involved in anti-phishing techniques the may contain high fraction of irrelevant features.Because of the exponential growth of more sophisticated and deceptive phishing features, the resultant irrelevant features highly degrade the classifier's performance.

Feature
Values Heterogeneity [31] Websites are inconsistent datasets with various hybrid features that have different values -discrete, categorical and continuous values.For any collected dataset, the extracted hybrid feature space is heterogeneous in values and huge in size.That is, in the presence of any extracted or selected subset of features, the machine learning classifier should be able to categorize them for both training and testing purposes with a minimum loss of feature values.

Non-robust Feature Subset [31]
When applying feature selection for knowledge discovery, robustness of the feature selection result is a desirable characteristic, especially if subsequent analyses or validations of selected feature subsets are costly.Modification of the dataset can be considered at different levels: perturbation at the instance level (e.g. by removing or adding samples), at the feature level (e.g. by adding noise to features), or a combination of both.

III. ASSESSMENT MEASURES
Other than that, as for the problems at hand (Table III), the outcomes of selective feature subset must be quantified on its scalability, goodness, stability, and similarity over multiple datasets [29][30][31][32][33].In addition, the assessment of outcomes prediction susceptibility against phishing over different datasets is a noteworthy issue to be highlighted towards obtaining the most advantageous features [34].Thus, specific measures adopted by prior researchers in different fields have been recommended in this work (Table IV) to test and to assess the outcomes of feature selection methods [31][32][33][34][35][36][37].Such measure can be considered as comparison baselines for any further study on feature selection effects.A phishness ratio restates the prediction susceptibility of selective feature set to phishing upon each instance in the dataset.The probability ( ) of estimated phishness along with a feature ti. is computed across all instances in the dataset.Then, the instance's phishness is computed by averaging the probability of all its related features.
"Where S is the examined webpage, Phishness (S) is the prediction of phishing susceptibility, ti is the feature in S, is the number of occurrences of ti in phish instance, is the number of occurrences for ti in legitimate instance.and n is the number of features in S." Minimal Redundancy [35,36] It eliminates duplicate features that having another one replicate them in the dataset.
"Where R(S) is the set of highest mutually exclusive features that selected between xi and xj." Maximal Relevance [35,36] It selects most relevant features to the target class and highly affecting the classification output.
( ) This criterion selects a subset feature compactness composed of the most relevant and least redundant features from the original set simultaneously.
"Where, D and R indicate the dependency between a feature xi and its class, and the highest relevance between features xi and xj in the same feature set."

IV. EMPIRICAL TEST AND DISCUSSION
Based on the recommended measures presented in Section III.C, the empirical test was conducted to state not only the variations of assisted feature selection methods on Prediction Susceptibility, Goodness, Stability, Similarity, and Scalability, but also it assessed outputs of the simultaneous discarding criterion of redundant and irrelevant features (mRMR).To the best of our knowledge, this type of empirical test with the aid of the recommended criteria is scarcely underscored in the literature of phishing detection despite of its significance for feature selection.Hence, an empirical test was implemented on a specific test-bed that was set to extract a large number of hybrid features.Then, a comparison was made on the effectiveness of the best chosen feature subset across different classification models.Test-bed is described, results are reported, and discussion is summarized in the following:

A. Test-Bed and Features
A wide range of aggregated phish and legitimate webpages were considered as test-bed for this study.Mostly they are reported in public archives such as PhishTank, CastleCops, and Alexa.Both PhishTank and CastleCops are phishing data archives that volunteers frequently update them with valid living phish webpages.While, Alexa archive is publicly used to retrieve valid legitimate webpages.We chose such archives because they were commonly used by prior researchers in the literature of phishing detection [15][16][17][18][19][20][21][22][23][24][25][26][27][28].Fig. 1 illustrates the aforesaid test-bed in terms of dimension, the number of phish webpages and the number of legitimate webpages.In Fig. 1, the test-bed consists of three multiple datasets: Dataset1, Dataset2 and Dataset3.Dataset1 composed of 1000 webpages, Dataset2 composed of 5000 webpages but Dataset3 consists of 10000 webpages.Multi-dimensional test-bed helped to empirically assess the outcomes of the reviewed feature selection methods towards demonstrating the most suitable one among them for phish website detection.Indeed, the webpage content and URL can be used to characterize each instance included in the aforesaid datasets such that they can be categorized accordingly to a specific class either phish or legitimate.
Consequently, the characterized datasets with their features and corresponding classes helped to generate the required feature space.Fig. 2(a) illustrates the structure of the generated feature space in terms of class label, feature index, the feature itself, and its value.Furthermore, Fig. 2(b) shows a part of the database schema to provide a global view on how raw data could be generated.Moreover, a set of web development tools, such as Fireburg, Jsoup, and Import.Io, had been helpful in implementing this task.Besides, several publicly used tools, such as KNIME and WEKA -the Waikato Environment for Knowledge Analysis, were employed for feature selection implementation and tests.In Fig. 2(a), the j th webpage is characterized as a vector of features .Then, all feature vectors extracted from mdimensional set of webpages are represented as combined together in a feature matrix M such that * +; where m indicates the number of feature vectors included in M. Each entry vector W j in M consists of its features' indexes and their corresponding values along its corresponding class label as the first column, i.e. { ( ) ( ) ( )} ; where n is the number of features, is the index of each i th feature of j th feature vector W j , where , and .Whereas C j is the label of the class such that * + with and , which indicates the membership of W j in the phish class or in the legitimate class based on its corresponding features [38,39] as portrayed in Fig. 2(b).Further, features of Boolean values are mapped into either 0 or 1, and features of Continuous quantities are represented as numeric quantities.Appendix I enlists the original set of features extracted from all webpages included in the test-bed.Totally 58 features were included in the original feature set.48 features were extracted from specific parts, tags and scripts in the webpage source code.Besides, www.ijacsa.thesai.orgten features were extracted from the indicators of webpage URLs.This high-dimensional set of features will be refined later to a subset of selected features using several feature selection methods as it will be presented in the next subsection.

B. Comparison Across Feature Selection Methods
In this section, all the details and discussions of the first empirical test and the related findings are presented.The test was conducted on four feature selection algorithms (FSAs); namely CBF, WFS, χ 2 , and IG; which had been previously adopted in the surveyed works.Besides, the mRMR feature selection method was also involved in the comparison to qualify if it could be recommended as an alternative FSA for the problems at hand (i.e.features' redundancy and irrelevance).Among its competitors those mentioned in Table I, mRMR discards redundant and irrelevant features in parallel and yields a selective subset of the most relevant and least redundant features together in a compact combination.Hence, both test and comparison were achieved in the presence of three datasets with different sizes and collections of phish and legitimate instances, as presented in Fig. 3 and Fig. 4.
From Fig. 3 and Fig. 4 , the overall results are very encouraging towards deploying all the selective hybrid features as predictive ones on phishing websites.The only difference is the variation of their compactness by using different FSAs.Findings of this test are summarized as follows:  In Fig. 3(a), the evaluation and comparison of their prediction susceptibilities were done by using the measure of Phishness Ratio (Table IV  It shows that the feature subset chosen by using FSA 1 (i.e.mRMR) could successfully rise the score of prediction from the typical case to the best one over datasets having different sizes.This, in turn, restates that mRMR can be considered as the most scalable FSA among the others because it could preserve its prediction rate as close to the best case as possible.
 Fig. 4(a) qualified the goodness of the selected subsets over the three different datasets.It is clearly shown that FSA 1 (i.e.mRMR) still preserves the best case of goodness (i.e.quality) among the others despite of the volume variations of the utilized test-bed.But both of FSAs 4 and 5 (i.e.χ 2 and IG) have the worst case of quality among the others.This implies that the significance of reducing feature set's dimensionality, and removing both redundant and noisy features to define the best features subset.Indeed, such feature subset will help the classification model to well perform over all datasets.More interestingly, such feature subset is needed to effectively detect phishing websites in realistic applications.www.ijacsa.thesai.orgstable over all datasets than their competitors.Further, it emphasizes the significance of the interdependencies between the features in the same chosen feature subset.Features chosen on their interdependencies can compose a stable subset under different detection scenarios and datasets.In contrast, those subsets chosen with respect to the topmost ranking of their constituents like FSA 5 (i.e.IG) may vary in their discriminating power against vast dataset and different detection approach.
 In the context of overall outputs' similarity (Fig. 4(c)), it can be observed that FSAs' outputs are notably dissimilar over all the datasets.The reported similarity scores are lower than (0.3) which point out that the selected subsets overlap partially and they are complementary to each other's.Interestingly, such dissimilarity implies that feature subset composed of hybrid and diversely predictive features could be a promising avenue to improve the classification performance.Moreover, FSAs produce dissimilar feature subsets can be effectively integrated and exploited for a specific phishing detection approach.Despite this, it is clearly observed that the optimal feature subset chosen by specific FSA, it may be considered as sub-optimal choice regarding to another FSA.Hence, both likelihood and difference of FSAs outputs are crucial issue in a machine learning based detection approaches.
 Based on the overall results, we obtained a useful insight into the crucial importance of feature selection method for the problem domain at hands.This, in turn, enables us to improve the detection performance in the context of using as few, predictive and robust features as possible.In general, looking at the aforesaid test and its overall findings highlights the significance of selective feature subset in terms of prediction susceptibility, scalability, goodness, and stability.In particular, feature subset chosen by FSA 1 (i.e.mRMR) always has the first best scores in terms of the aforesaid perspectives among the others.Whilst, FSAs 2 and 3 (i.e.CBF and WFS) reveal the second and third best cases among the others.Contrarily, both FSAs 4 and 5 (i.e.χ 2 and IG) yield the worst cases across all the aforesaid perspectives.
In summary, this empirical test restates that several selection methods reach a quite bit similar peaks of prediction susceptibility and robustness.Therefore, they can be considered as the baseline methods for feature selection in phishing website detection.More importantly, if the feature selection method is carefully chosen, i.e. on the basis of its prediction susceptibility and robustness; the performance of the classification model could be highly improved with low latency and errors.However, there is still no exact answer for the perfect FSA among all the tested ones unless they assessed in terms of detection accuracy, specificity and sensitivity across several classification models and different datasets.This issue will be considered in the next subsection.

C. Comparison Across Classification Models
Herewith, we turn to qualify how the aforesaid selective subsets of features can shift detection accuracy, specificity and sensitivity of the classification model to the best rates as possible.The qualification is determined through two comparisons.First, the outputs obtained from the previously tested FSAs are compared on detection accuracy, detection sensitivity and specificity over training and testing datasets dedicated for this purpose.To accomplish the performance test and get findings for comparison, a specific machine learning classifier was applied; namely, C4.5 as can be seen in www.ijacsa.thesai.orgFig. 5.Meanwhile, several supportive metrics are deployed for the performance evaluation as presented in Table V.
To qualify the discriminating behavior, four machine learning classifiers are involved in the second comparison.Those classifiers are described with their related calculations in Table VI.Such classifiers are chosen because of their wide use in the literature of phishing detection.Consequently, this comparison highlights how the best selective feature subset could classify phishing websites not only across different datasets (i.e.training and testing datasets) but also across different classification models as illustrated in Fig. 6.
Both comparisons are applied over two datasets: training and testing datasets that generated from a collection of phishing and legitimate webpages specifically aggregated for this purpose.The datasets are generated through extracting the features space from the aggregated webpages (i.e.data preprocessing) and dividing it into a training dataset (70% of the main dataset) and a testing dataset (30% of the main dataset).

FP
False Positive refers to the rate of wrongly classified legitimate instance s as phishing

TN
True Negative refers to the rate of correctly identified legitimate instances."

FN
False Negative indicates the wrongly labeled phishing instances as legitimate ones.

Specificity
The percentage of correctly positive predictions

Sensitivity
It refers to the percentage of correctly predicted positive instances (TPs).

Accuracy
It indicates the overall rate of correctly detected phishing and legitimate instances (the rate of correct predictions). " "Where: NP→P, NL→P, NP→L, NL→L denote the number of correctly labeled phishing instances, the number of wrongly labeled legitimate instances, the number of phishing instances that are incorrectly recognized as legitimate, and the number of legitimate instances that are identified correctly as legitimate respectively [1,3]."Regarding Tables V and VI as well as the statistics plotted in Fig. 5 and Fig. 6, the following standpoints are inferred:  The significant differences between classification models assisted by the tested FSAs (Fig. 5) point out the major or minor contribution that the assisted feature selection method can provide.Variations in accuracy, sensitivity, and specificity demonstrate that not all the tested feature selection method yield promising outcomes on phish website detection.This is because of (i) variations on specifics and evaluation criteria of FSAs themselves, (ii) the chosen features themselves due to their varied prediction susceptibilities and robustness, (iii) the inter-dependency of detection performance on the deployed classification model itself, (iv) the type of exploited features (i.e.webpage's URL and /or webpage's content) and (v) the dimension of the selected feature subset (i.e. the number of features included in the selected subset).
 Consequently, different outcomes of performance test (Fig. 5) show that certain classification model may sensibly being influenced by the training and testing datasets, and the suitability of machine learning classifier as well as the chosen feature selection method.This implies that the diversity and preprocessing of the collected dataset likely influence the overall classification performance because the dataset may encompass imbalanced data.More precisely, the imbalanced data indicate the divergent abundance of features corresponding to the classes of phishing or legitimate over the collected test-bed.Since the collected test-bed is quite bit different in dataset size and it consists of a dozen of labelled and unlabeled instances having a variety of features (i.e.hybridity), and a heterogeneity of features values.Therefore, kfold validation and chronological assessment must be attained to come up with such diversity.www.ijacsa.thesai.org The classification performance is likely to be influenced by the set of many features (Fig. 6 (a)).For instance, 58 extracted hybrid features may encompass irrelevance, redundancy and noisy data; therefore, eliminating the worst features and selecting the best ones (i.e. the most representative ones) are important inductive factors for well-performed classification as can be recognized in Fig. 6(b).
 Also, the feature set's dimensionality is an important factor for the classification performance (Fig. 6(a) and Fig. 6(b)).As more features are being processed as more computational cost is being consumed.Moreover, the feature set's dimensionality interacts with the dataset's dimensionality.
 Selected feature subset chosen by the mRMR promotes the overall performance of classification models.Classification models assisted by mRMR outperform those baseline models in terms of classification accuracy and error rates (Fig. 6(a) and Fig. 6(b)).

V. CONCLUSIONS AND FUTURE WORK
In the light of selecting a minimal and effective feature subset for well-performed phish website detection technique, this paper critically and practically appraised the exploitation of the feature selection via classification-based techniques.In this appraisal, those techniques assisted by machine learning classifiers and feature selection methods were involved, as well as a review of prior works with their related issues.Further, empirical tests are conducted over 58 new hybrid features, five different datasets and five different classification models.Promoting measures are introduced to assess the outcomes of applied feature selection methods and then qualify the most suitable one among them for the problem at hands.Deeper understanding to their effects and significant gains on their outcomes' prediction susceptibility, scalability, goodness, stability and similarity are obtained respectively.Moreover, feature selection outcomes are compared on how they can notably improve the overall classification performance towards finding an optimal anti-phishing solution.
As a result, the findings displayed that some feature selection methods significantly outperformed their competitors by exhibiting better robustness, prediction, and performance.Between, other methods diverted from the best and the worst cases in relation to the aforesaid quantified factors.This was caused by the variations in dataset sizes and their constituent instances, the compactness of the chosen features and the features themselves, the evaluation criteria of the selected methods, and the discriminating behavior of the applied classifiers on training and testing instances.Moreover, the empirical tests addressed that the appropriately chosen set of features outperformed the original set of extracted features and/or the individual features themselves with least latency.However, the notably powerful selection method (i.e.mRMR) failed to provide an ideal subset of features; it could only produce as minimal and effective feature subset as possible.Nonetheless, mRMR could deal with the problematic features of redundancy and irrelevance at once.However, it is worthy to mention that no precise feature selection method existed in this study to cope with all the classification models.Hence, the forthcoming work will quantify feature selection outcomes concerning the processing time and misclassification costs.With that, more classification models will be involved in a remedial framework for feature selection towards rational phish website detection.www.ijacsa.thesai.org

Fig. 1 .Fig. 2 .
Fig. 1.Description of collected datasets in terms of the total number of instances, legitimate websites and phish websites

Fig. 3 .
Fig. 3. Illustration of empirical test across four five feature selection methods.Each of FSA 1, 2, 3, 4, and 5 refers to mRMR, CBF, WFS, χ 2 , and IG respectively  Fig. 3(b) portrays the outcomes of scalability comparison.It shows that the feature subset chosen by using FSA 1 (i.e.mRMR) could successfully rise the score of prediction from the typical case to the best one over datasets having different sizes.This, in turn, restates that mRMR can be considered as the most scalable FSA among the others because it could preserve its prediction rate as close to the best case as possible.

Fig. 4 .
Fig. 4. Illustration of empirical test across five feature selection methods; where: FSAs 1, 2, 3, 4, and 5 refer to mRMR, CBF, WFS, χ 2 , and IG respectively  Fig. 4(b) outlines how the feature subsets chosen byFSAs 1and 2 (i.e.mRMR and CBF) are notably more stable over all datasets than their competitors.Further, it emphasizes the significance of the interdependencies between the features in the same chosen feature subset.Features chosen on their interdependencies can compose a stable subset under different detection scenarios and datasets.In contrast, those subsets chosen with respect to the topmost ranking of their constituents like FSA 5 (i.e.IG) may vary in their discriminating power against vast dataset and different detection approach.

TABLE II .
RELATED WORKS WITH LIMITED FEATURE SELECTION METHODS  Dissimilarity of selection outputs www.ijacsa.thesai.org