A Rank Aggregation Algorithm for Ensemble of Multiple Feature Selection Techniques in Credit Risk Evaluation

In credit risk evaluation the accuracy of a classifier is very significant for classifying the high-risk loan applicants correctly. Feature selection is one way of improving the accuracy of a classifier. It provides the classifier with important and relevant features for model development. This study uses the ensemble of multiple feature ranking techniques for feature selection of credit data. It uses five individual rank based feature selection methods. It proposes a novel rank aggregation algorithm for combining the ranks of the individual feature selection methods of the ensemble. This algorithm uses the rank order along with the rank score of the features in the ranked list of each feature selection method for rank aggregation. The ensemble of multiple feature selection techniques uses the novel rank aggregation algorithm and selects the relevant features using the 80%, 60%, 40% and 20% thresholds from the top of the aggregated ranked list for building the C4.5, MLP, C4.5 based Bagging and MLP based Bagging models. It was observed that the performance of models using the ensemble of multiple feature selection techniques is better than the performance of 5 individual rank based feature selection methods. The average performance of all the models was observed as best for the ensemble of feature selection techniques at 60% threshold. Also, the bagging based models outperformed the individual models most significantly for the 60% threshold. This increase in performance is more significant from the fact that the number of features were reduced by 40% for building the highest performing models. This reduces the data dimensions and hence the overall data size phenomenally for model building. The use of the ensemble of feature selection techniques using the novel aggregation algorithm provided more accurate models which are simpler, faster and easy to interpret. Keywords—Classification; Credit Risk; Feature Selection; Ensemble; Rank Aggregation; Bagging


INTRODUCTION
The data size is increasing regarding records and dimensions both.It presents challenges to the machine learning community which is working on new methods and techniques to fasten the data exploration, analysis, and validation tasks.One way of handling this problem is by using an effective sampling methodology to choose a subset of samples describing the dataset as a whole.This method results in a reduced dataset having less number of instances.Another way of handling this problem is to use an appropriate dimensionality reduction/ feature selection method to reduce the dimensions of the dataset.
In a vital machine learning problem of classification, the accuracy of a classifier plays an important role.The accuracy of the classifier depends on many factors such asthe single, hybrid or an ensemble method used for modelling; the base models used for the ensemble; the learning algorithm used for model training; the feature selection method used for selecting the relevant features; the sampling technique used for sampling the data; the evaluation method used for testing the model and many more.
Feature selection is an important pre-processing step in machine learning and pattern recognition problems.It has been an active area of research since past three decades [1].Feature selection increases the performance of classification models by eliminating redundant and irrelevant features and thus reducing the dimensionality of datasets [2].This study uses the feature selection approach for the enhancement of accuracy of credit risk evaluation models.

A. Credit Risk Evaluation
Quantifying the credit risk is a typical bank decision problem of classification in which the new loan applicants are to be classified accurately into either a creditworthy or a noncreditworthy category based on the historical dataset of loan applicants.This historical dataset is used for training the classifier, and the new loan applicant's data is tested on this trained classifier.The Class labels i.e. creditworthy or noncreditworthy are automatically assigned to the new applicants records during testing phase.The credit dataset contains the features mainly describing the financial status, demographic details of the applicant and his personal profile.Some features of the dataset may provide more significant information needed for classifying a new loan applicant than others.While some of the features are not required, some may contain redundant or irrelevant information and don't provide any additional information during the model development task.They don't contribute to the accuracy of the model and sometimes even decrease it by slowing down the classifier learning process.The big feature set can make a more complex model whose interpretation also becomes www.ijarai.thesai.orgcumbersome.It can make a classifier overfitting the training data [3].

B. Feature Selection for Credit Evaluation
In credit risk evaluation the accuracy of the classifier is very crucial.Even a small increase in model accuracy may result in huge profit for the bank.For performance enhancement of single models, the literature proposed the hybrid and ensemble based models.In credit risk evaluation.Many of the ensemble based and hybrid models are developed using feature selection methods during the initial stage [4].Feature selection is crucial for the selection of significant and appropriate features for model development.If the number of features is large, more computation is required, and the accuracy and interpretation of the classification model decrease [5], [6].A large number of features in credit evaluation implies that there are a large number of questions for the loan applicants, which will be time-consuming and confusing.According to [7], exploring a big number of features lead to identifying a relevant subset of features for building the credit model.
The relevance of the features needs to be identified before the model development task so that the undesired, redundant and irrelevant features are not used as input to the model.Supervised feature selection determines relevant features by their relations with the corresponding class labels and discards irrelevant and redundant features.The subset of features identified as important will help in reducing the size of the hypothesis space and allows the algorithms to operate faster and more effectively [8].This smaller feature subset will help in building simplified models reducing the time and space complexity of the algorithms and hence improving the accuracy with well interpreted results.
The purpose of this paper is the enhancement of classification accuracy of the credit risk evaluation models.This study uses the ensemble of multiple feature selection techniques for ranking and selecting the significant features.

II. FEATURE SELECTION CRITERIA FOR FILTER BASED FEATURE SELECTION
The filter approach to feature selection works independently of learning/Induction algorithm (Fig. 1.).It operates as a pre-processing step and selects and presents the important features to the learning algorithm as input.Filter approach makes use of the complete training data for its operation.It ranks the features in accordance with their importance w.r.t selecting a class.A threshold has to be then defined for selecting the number of most important features from the ranking.There are several features ranking methods [9] available in the literature, some of them are -correlation based, mutual information based and methods based on decision tree and the distance between probability distributions.Any of the predefined measures such asthe Dependency measures, Information measures, distance measures [10] [11], independent component analysis [12], class separability measure [13], or variable ranking [14] are the basis of these feature ranking methods.

A. Dependency measures
As discussed by [15] and [2], the dependency measures or correlation measures quantify the ability to predict the value of one variable based on the value of the other.The Pearson's correlation coefficient (PCC) is very useful for feature selection [16] [17], as it quantifies the relationship of a feature with its corresponding class label and with other features in the dataset.As per [18], PCC for continuous features is a simple measure but can be effective in a wide variety of feature selection methods.
A uniform manner is used to treat the features and the class, then the feature-class correlation and feature-feature inter-correlations are calculated according to the following equation: ̅ and are the mean and standard deviation of j th feature and ̅ and are the mean and standard deviation of vector c of class labels).The ranking values are absolute values of CC: This ranking has a low complexity of the order of O (mn) and is very simple to implement for numerical variables.
For, nominal or categorical variables the popular feature selection method used is Pearson's chi-squared (χ 2) test.The numerical variables can also be converted into nominal or categorical types for applying the χ 2 test.First, a contingency table is made by converting the raw data.Then, the independence between each variable and the target variable is measured using the contingency table . is defined by : = ∑ where is the observed frequency; is the expected theoretical frequency, asserted by the hypothesis of independency and c the number of cells in the contingency table.
Correlation-based feature selection is the base for symmetrical uncertainty (SU) also.It is a symmetric measure and can be used to measure feature-feature correlation.The value of symmetrical uncertainty ranges between 0 and 1.The value of 1 indicates that one variable (either X or Y) completely predicts the other variable [19] .The value of 0 indicates that both variables are completely independent.

B. Information Measures
Information theory has been proved to be very successful in solving many problems [20].It provides a theoretical

Feature Subset Selection
Induction Algorithm www.ijarai.thesai.orgframework for measuring the relation between the classes and a feature or more than one feature.Mutual Information (MI) is a filter-based feature selection metric used to find the relevance of features.It works on the principle of information shared by two features using MI [20], the relevance of a feature subset on the output vector C can be quantified.Formally, the MI is defined as follows: Where MI is zero when x and y are statistically independent, i.e., p(x(i), y(j)) = p(x(i))•p(y(j)).Information Gain (IG) and Gain Ratio (GR) are feature ranking methods based on information measures.IG is the reduction in entropy of the class variable when the value of the independent variable is known.The IG of an attribute X with respect to class variable Y is given by: Where H(Y) is the entropy of Y,

H (Y|X) is the uncertainty about Y for a given X
The information gain measure is biased towards tests with many outcomes.Therefore C4.5 uses Gain Ratio (GR) for overcoming this bias and is an extension of IG.
Where Gain(A) is the encoding information gained by branching on A and SplitInfo(A) is the information got by splitting the dataset into 'n' distinct values of the attribute A. The maximum GainRatio attribute is subject to splitting.

C. Distance Measures
Distance measures, also known as separability, divergence, or discrimination measures, study the difference between the two-class conditional probabilities in a binary context [15] [22].In other words, a feature X j is chosen over another feature X j ' if it induces a greater difference between the twoclass conditional probabilities than X j '.In the case where the difference is zero then the two features are identical.Relief is one of the most famous feature selection method based on distance measures.Relief algorithm has been given by [23].It is a multivariate method which is sensitive to interactions [24].It estimates the features relevance according to how well their values distinguish between the instances of the same and different classes that are near each other.It performs well on small sample size datasets having a large collection of features.Its computational complexity is O (mn), which is linear in comparison to other multivariate methods often having quadratic complexity in the number of features.

D. Feature Ranking
Feature ranking uses the above discussed filter based measures to compute a scoring function from the values (x j i ; y i ).It is considered that a high score indicates a valuable feature and the features are sorted in decreasing order of the scoring function [25].It is computationally efficient since it requires only the computation of d scores and sorting them.It is statistically robust against overfitting because it introduces bias, however it may have considerably less variance [26].Therefore, feature ranking can be preferable than any other feature selection method.

III. BACKGROUND
In general the feature ranking criteria for filter based feature selection discussed above have one or the other limitation in their performance.The distance based measures like -Relief are good in capturing the relevance of features to the target variable but doesn't capture the redundancy among the features.The dependency measure such as PCC is not able to capture the correlations that are not linear [2].The dependency measures and information measures suffer from time complexity issues since they have to evaluate all possible subsets.Therefore they are not practical to deal with high dimensional data.
Due to these limitations of the filter based methods, it is difficult to find out the best criteria for a particular problem.
According to [27] this problem is called the selection trouble.The best approach is to independently apply a combination of the available methods and evaluate the results.
Aggregating the ranked lists from individual rankers into a single better ranking is called as rank aggregation.Rank aggregation method is an Ensemble based feature selection method which is considered as an upcoming important tool for combining information with the purpose of getting higher accuracy.

IV. ENSEMBLE METHOD FOR FEATURE SELECTION
An ensemble of classifiers is a set of base Classifiers that are individually trained.For classifying new instances, the decisions of these classifiers are combined using weighted or un-weighted majority voting [28] [29].According to [30], the ensemble model could outperform the single base models when weak/ unstable models are combined.Looking at advantage of ensemble based classifiers over individual ones, the concept of ensemble can be applied for performance enhancement in the feature selection process also.

A. Ensemble of a Single Feature Ranking Technique
Ensemble of a single feature ranking technique involves Bagging (Bootstrap Aggregation) or some other Algorithms to generate various bags of data.For each bag the feature ranking is done and the ensemble is formed by combining the individual bag rankings by weighted voting, using linear aggregation [31].

B. Ensemble of Multiple Feature Ranking Technique
In this method, multiple feature ranking techniques are used for ranking the features in order of their relevance for building an ensemble.The same training data is used by the ranking methods and the results of these methods i.e. the ranking lists are combined in a certain way to obtain a final www.ijarai.thesai.orgranked list of the features.Thus, multiple feature ranking lists creates a single feature ranking list in the following two steps: First a set of different ranking lists are created using corresponding rankers and secondly these ranking lists are combined using rank ordering of features [32].Suppose a dataset 'D' has 'I' instances and 'k' features.During the first step a set of n ranking lists {F1, F2, F3…Fn} are obtained (one for each 'n' feature selection methods used).
In the second step, a rank aggregation method R is used for combining the ranks of individual features from n ranking lists obtained in first step.Let f i j be the rank of feature i from ranking list j, then the set of rankings of feature i is given by: The new rank obtained by feature i using the combination method C is C. Rank Aggregation There are different combination or rank aggregation methods used for creating an aggregated feature ranking list from various individual feature ranking lists for the ensembles of multiple feature selection techniques.Recently, there have been studies applying the ensemble concept to the process of feature selection [33].The results of this technique are more stable and accurate as the different ranking methods explore different important qualities of the data.A combination of these qualities in one ranking scheme will outperform each ranking method.
Research in the field of feature selection proposed some rank aggregation methods such as the sum, mean, median, highest rank or lowest rank aggregation and some are more difficult [33].Moreover, research is on to give more weight to top ranking features or combining well-known aggregation methods in search of finding the best list which is an optimization problem.

V. METHODOLOGY
In this paper, the ensemble of multiple feature selection methods has been used for the selection of important features for the classifier.For the combination of ranks of individual feature selection methods the ensemble uses the fusion based rank aggregation method.For, the FS ensemble, five individual filter based methods of FS were chosen based on different measures of feature ranking.These were -Chi Square and Symmetrical Uncertainty methods of FS based on Dependency Measures; Information Gain and Gain Ratio FS methods based on Information Theory Measures; and Relief FS method based on Distance Measures.
In the first step, the five filter-based feature selection methods were used for ranking the features by their importance.The result of the first step is five ranked lists from the five individual feature selection methods.
The results of the first step are five ranked lists from the five individual feature selection methods.
The individual feature selection methods used are the Chi-Square, Information Gain, Gain Ratio, ReliefAttributeEval and SymmetricalUncertaintyAttributeEval from the WEKA software environment for knowledge analysis [34].The study conducts experiments for ranking features using each feature selection method.
The second step proposes a new fusion based Rank Aggregation Algorithm for an ensemble of multiple feature selection techniques.The algorithm is described in Fig. 2.This method makes use of both rank score and ranks order of each feature in the ranked lists for rank aggregation.Fig. 2. describes the rank aggregation algorithm and its operation as follows: First, the k individual feature selection methods rank the n features in order of their importance in descending order.Hence, each feature selection method generates a ranked list depicting the rank score (the value of a feature in the ranked list) and a sequence number m of each feature in the descending ordered ranked list.
In the second step, most of the rank aggregation methods use a combination of the ranked scores of multiple feature selection techniques in a certain way such as the sum, mean, median or taking the highest or lowest rank scores.But the rank score alone can't depict the importance of a feature in the ranked list.The order of the feature in the ranked list is also crucial for considering the importance of a feature.The proposed novel aggregation algorithm considers both rank scores and the rank orders of the features.This aggregation will give more weight to the features which not only have higher rank scores but also have higher rank orders in the ranked list.Equation (1) computes the rank order of a feature having sequence no.'m' in a ranked list of 'n' features.Therefore, for a feature having sequence number 1, in a ranked list of 20 features, the aggregation finds the rank order of this feature as 20 by using (1).

A. Data Used
The data set chosen for this experiment is the German dataset from UCI repository [35].It is a credit dataset having 1000 loan applicants' records and 20 predictor variables.There is one class variable having two classes -Good and Bad.Most of the features are qualitative, and few are numerical.

B. Feature Selection
For ranking the features in order of their importance, the experiments considers the ensemble of multiple feature ranking techniques and five individual rank based feature selection methods.Those feature selection methods are used which perform better on qualitative data since the data is mostly qualitative.The novel rank aggregation algorithm uses the rank scores and rank orders of the individual rank based feature selection methods.The threshold values of 80%, 60%, 40% and 20% i.e. 16, 12, 8 and four features are used for selecting the features from the top of the sorted, ranked lists.In this way, only the highly ranked features identified as important and relevant by the individual and ensemble feature selection methods have been selected for building the classification models.
The performance of the classifiers is compared to find out the best threshold, best model and the best feature selection method which yielded the highest ROC value.The best threshold value indicates that the features selected using it are the most important ones which best described the dataset.
The best model is the one whose average classification performance across all the feature selection methods is the highest.The best feature selection method is the one which yields best average performance across all models built over the features selected by it.

C. Classifier
For testing the impact of the new rank aggregation algorithm on the accuracy of classifiers, the features selected from the aggregated ranked list are taken as inputs to the classifiers.The individual and ensemble based classifiers are used for model building and performance assessment.The individual classifiers used are the C4.5 and the MLP, while Bagging is used as the ensemble classifier.-----, n Initialize Ensemble Rank List E = φ Suppose F 1 , F 2 , -------, F k be the feature selection techniques used for the ensemble For each F i , i= 1, 2, ---, k Calculate rank score of each feature and construct ranked lists R i , i = 1, 2, ----, k Sort each R i in descending order of rank scores Give a sequence number m=1, 2, ------, n; to all the features in each R i starting from top.ENDFOR For each feature f j , j = 1, 2, -----, n For each sorted ranked list R i , i = 1, 2, ------, k For Sequence no.m = 1, 2,-------,n; Sort the Ensemble rank list E using ensemble rank scores in descending order Output: A sorted ensemble ranked list E containing features and their corresponding ensemble rank scores.www.ijarai.thesai.organd complexity of the system, since the focus of the study is the enhancement of classification accuracy of the credit risk evaluation models using the proposed rank aggregation method.For data sampling using bootstrapping, 20 iterations are used, as the classifier didn't show any increase in performance using more iterations.More iterations would rather have slowed down the classification process by increasing data samples and hence time.

D. Accuracy Assessment
The Area under the Receivers Operating Curve (ROC) popularly known as AUC, is used for accuracy assessment.The ROC Curve is a graph of True Positive Rate (TPR) versus False Positive Rate (FPR).The models are built using 70% training and 30% test partitions.A random sampling of 70% of training data is done from the dataset for training the classifier.The classifier uses the remaining 30% of data for testing the classifiers.The correctly classified instances were taken from the test data for classification.A ROC graph was plotted using TPR against the FPR for assessing the accuracy.

VII. RESULTS AND DISCUSSION
Four classifier models -C4.5, MLP, C4.5 based Bagging and MLP based Bagging were built on the German credit dataset using a different number of features selected by each FS method.Each model was generated on four different threshold percentages (80%, 60%, 40% and 20%) i.e. (16, 12, 8 and 4) features selected from the sorted, ranked lists of the five individual feature selection methods and an ensemble of multiple FS methods.The performance of the classifiers has been observed using the ROC measure which is considered as a true measure of accuracy.For comparison of accuracy, each model has also been built using all the features.The average performance of six FS methods using four different thresholds across four different classifier models is depicted in Table I.The performance of each FS based ranking method is recorded for the four models for all thresholds.An average of performances of all the models on the features selected by the FS methods using a particular threshold is observed.Similarly, the average performances of all FS methods including Ensemble FS method have been calculated across all models using different thresholds.The comparative performance of these FS methods is depicted in Fig. 3.
The graph of Table I. summarizes that the performance of the ensemble of multiple FS methods is higher than all individual FS methods for the thresholds of 80% and 60%, while the performance of FS methods Chi-square and Information gain is higher than others for the 40% threshold.The symmetrical uncertainty method outperforms the others for 20% threshold.It is clearly observed from the graph that, for 40% and 20% thresholds (i.e.small no. of features), the performance of all the FS methods is substantially lower than that for 80% and 60% thresholds.
By looking at the graph, it can also be inferred that the performance of the Ensemble of FS methods is the highest for the 60% threshold followed by 80% threshold.Also, the performance of all FS methods including the ensemble of multiple FS techniques started declining drastically after the 60% threshold.The individual model performance based on different thresholds using the ensemble of FS method is depicted in Table II.It can be seen across all thresholds, the performance of bagging models based on C4.5 and MLP as the base classifiers is much better than the individual C4.5 and MLP models.Moreover, the average model performance for the bagging model based on MLP as the base classifier is the best.It can also be observed that the average performance of all the models is the best for 60% threshold.The graph depicting the average performance of the individual models in Fig. 4, shows that the performance of bagging based on MLP classifier is the highest followed by bagging based on C4.5 classifier at 60% threshold.While the individual models C4.5 and MLP performed best at 80% threshold, the individual C4.5 model performed the worst of all for all thresholds.

VIII. CONCLUSION
In credit risk evaluation the accuracy of a classifier is very crucial.Even a small increase in model accuracy may result in huge profit for the bank.For accuracy enhancement, this study uses the ensemble of multiple feature selection techniques for ranking and selecting the important features.A novel rank aggregation algorithm has been proposed using the rank scores and rank orders of the individual rank based feature selection methods.The ensemble of FS technique uses the novel rank aggregation algorithm for ranking the features in order of their importance and relevance.The ranked lists of 5 FS methods and 1 Ensemble based FS method were used to select the top 16, 12, 8 and 4 features.The Ensemble based FS method attained the best performance for the threshold of 12 top features with an average ROC value of .772followed by the threshold of 16 giving an average ROC value of .769while the average ROC value for the dataset without FS is .754.Moreover, these ROC values for the ensemble method are higher than all other individual FS methods used.On comparing the ROC values it is inferred that using the Ensemble based FS method, the average performance of the four models increased by a ROC of .018using the 60% threshold.
The results also concluded that the bagging based models outperformed the individual models using the ensemble of FS methods for all thresholds.The performance of Bagging using MLP as the base classifier is the highest with a ROC of .809followed by Bagging using C4.5 as the base classifier with a ROC of .787 at 60% threshold, while the individual MLP and C4.5 models performed with an ROC value of .765and .727respectively for the same threshold.By using Bagging, there is an average performance enhancement of .044and .060respectively for individual MLP and C4.5 models across all thresholds.One more inference drawn from the results is that the average performance of Bagging model with MLP as the base classifier is the best across all thresholds with a ROC of .794followed by .775for the Bagging model with C4.5 as the base classifier.
Therefore, the study concluded that, using an ensemble of multiple feature selection techniques with the novel rank aggregation algorithm proposed in the study, a significant enhancement in the performance of credit risk evaluation models is observed.The accuracy of the models is enhanced with the selection of top 80% and 60% features from the ranked list of the ensemble.Although, the accuracy of the models declined with the selection of top 40% and 20% features.It may be attributed to the rejection of many relevant features required for building the accurate model.By using the ensemble of multiple feature selection techniques, the bagging based models outperformed the individual models for all thresholds but most significantly for the 60% threshold.This increase in performance is more significant from the fact that the number of features reduces by 40% for building the highest performing models which indicates a phenomenal reduction in the instance size and hence the overall data size.The reduction of irrelevant features simplifies the model building task and hence the time and space complexity of running the models.A simpler and faster model would be helpful for the bankers in a quick and precise overall assessment of the risk involved in granting the loan to a customer.Moreover, the irrelevant features with very low ranks are identified which do not contribute to the model building process.These features can be ignored by the banks in the loan application forms, making them simpler and faster for the applicants to fill in and for the banks to get them verified quickly.
Future studies can focus on testing the novel rank aggregation algorithm on other high dimensional credit datasets collected from the real world.The algorithm may prove to be more useful for such data with a large number of attributes by selecting only a small number of relevant attributes contributing to the accuracy and simplicity of the model.Even a small enhancement in the accuracy of credit risk evaluation models is very beneficial as the financial risk associated with the credit defaulters get assessed accurately on time.

Fig. 1 .
Fig. 1.The Filter based method of Feature Selection Large values of MI indicate a high correlation between the two features and zero indicates that two features are uncorrelated.Many feature selection methods are proposed based on MI such as [20] [21].

Fig. 2 .
Fig. 2. A Novel Rank Aggregation Algorithm for Ensemble of FS The ensemble based bagging technique is used since the use of bootstrapping with replacement in bagging creates diversity within the data being used by the classifier hence impacting the performance of the classifier.The base classifiers used for bagging are the C4.5 and the MLP.These classifiers are considered acceptable to use at the cost of time Algorithm: A Novel Rank Aggregation Algorithm for Ensemble of Multiple Feature Selection Techniques ________________________________________________________________ Input: Dataset m*n containing m instances and n features f j , where j = 1, 2,

TABLE II .
INDIVIDUAL MODEL PERFORMANCE ON DIFFERENT THRESHOLDS USING THE ENSEMBLE BASED ON FS METHOD Fig. 4. Average performance of Individual Models using Ensemble based FS method