A Better Comparison Summary of Credit Scoring Classification

The credit scoring aim is to classify the customer credit as defaulter or non-defaulter. The credit risk analysis is more effective with further boosting and smoothing of the parameters of models. The objective of this paper is to explore the credit score classification models with an imputation technique and without imputation technique. However, data availability is low in case of without imputation because of missing values depletion from the large dataset. On the other hand, imputation based dataset classification accuracy with linear method of ANN is better than other models. The comparison of models with boosting and smoothing shows that error rate is better metric than area under curve (AUC) ratio. It is concluded that artificial neural network (ANN) is better alternative than decision tree and logistic regression when data availability is high in dataset. Keywords—Credit score data mining; classification; artifical neural network; imputation


INTRODUCTION
In 2006 Taiwan faced credit flow crises as matter of fact to propel in market the bank has over issued cash and credit regardless of their ability to repayments, overconsumptions history.This situation blows consumer and finance rapport badly.The well down financial system crisis on downstream and risk management on upstream.The purpose of risk management is to enforce checks on consumer ability to repay bills thus reduce damage and consumer's credit repayment uncertainty.
A lender commonly makes two types of decisions: first, whether to grant credit to a new application or not, and second, how to deal with existing application, including whether to increase their credit limits or not [1].Scoring model is better alternative for traditional model but model is not perfect-sometimes a bad application will receive high score and accepted, therefore, model is needed.There are various methods testing in past and in this literature that group into parameterized and non-parameterized methods.In this direction, [2]- [3] highlighted the importance of artificial neural network (ANN) and support vector machine (SVM) towards credit scoring as a better alternative of conventional approaches.Further from analysis point of view this literature is not focusing on ensemble methods or hybrid models' due to complexity of design.Although, admired the complexity of ensemble models and utilized for multiple tasks [4].
The non-parameterize model logistic regression is not perfectly suitable for the classification than parameterized models.These models are decision trees [5]- [7] and artificial neural networks [8]- [10].Here, ultimate focus is to elaborate the importance of parameters of models with smoothing and boosting with the dataset imputation.This research will further interrogate the existing approaches to find the better smoothing and boosting model with respect to imputation technique.

II. RELATED WORK
The work of author [11] has proven that area ratio (AUC) is better metric in accuracy than error rate compared with decision tree, logistic and ANN.It was concluded that ANN is better in performance in area ratio metric regardless of over consumed computation.Our approach further analyzed the existing work in depth to prove that boosting and smoothing of models worked well for these core approaches especially for ANN, which is better determiner of metric accuracy.That is the reason, why we contradict the existing approach because existing approach undertake the same dataset without treatment of missing values with imputation technique.Our work is continuation of the existing work to prove that neural network on similar dataset performed better when imputation technique applied over smoothing and boosting of models.

A. Decision Tree
Decision trees allow creating a tree-based classification model.Decision trees can graphically illustrate other choices that can be made and enable the decision maker is to identify the best situation in a circumstance.Common algorithms for decision tree induction include ID3, C4.5, CART, CHAID and QUEST [12].In [13], author says that the decision rules should maximize a divergence measure of the difference in default risk between the two subsets.The splitting is repeated until no group can be split into two subgroups which are statistically different.According to [14], there are three major tasks of a classification tree: 1) how to partition the data at each step; 2) when to stop partitioning; and 3) how to predict the value of y for each x in partition.

B. Logistic Regression
Logistic regression analysis is the multivariate technique, which allow to estimate the probability that an event occurs or www.ijacsa.thesai.orgnot, by predicting a binary dependent outcome from a set of independent variables.The logit model is a widely used statistical parametric model for modelling binary dependent variable.The logit model for credit scoring is presented with comparisons with other models including conventional one [15].

C. Artificial Neural Network (ANN)
ANN is an information processing model resembling connections structure in the synapses.It consists of many nodes (also called neurons or units) by links.The feed-forward neural network with back propagation (BP) is widely used for credit scoring, where the neurons receive signals from prelayer and output them to the next layers without feedback.
According to [16], made a comparison of neural networks and linear scoring models in the credit union environment and the results indicated that neural network had better performance for correctly classifying bad loans than LR model.Besides, ANN need many training samples and long learning time.In [17], found that ANN has a higher accuracy rate by comparing with Logistic regression and discriminate analysis.

IV. PROPOSED SOLUTION AND DATASET
This section is discussing about boosting and smoothing criteria for each model with parameters.The purpose of the approach parametrization is to evaluate the different conventional models to improve the accuracy of classification.The dataset chosen from UCI website as shown below, it is about 2006 Taiwan faced credit flow crises thus result of repayments evidences are required.All the dependent and independent variables are given in Table 1.Later the models feed by opted strong and moderately correlated variables after data insight analysis technique, because, these variables are not necessarily correlated to each other thus prune from dataset eventually after correlational test.The categorical variables sex, education, marital status and age and continues variable Limit_BAL eventually have weak correlation with other independent variables, therefore, not included as input to the models.The final set of 19 independent variables has given as input to the models from Table 1.

A. Arificial Neural Network (ANN)
According to [18], it has presented credit scoring by integrating back propagation neural networks with linear method.Linear model (1) and non-linear model (2) definitions are as follows:

Dichotomous
To find the best model, the gradual increment of hidden nodes with settings of 5 hidden followed by 2 hidden layers are involving better for the result accuracy.There are 19 input parameters of all three models, where, ANN-l is single layer default model, ANN-H model train with minimum hidden layers and ANN-L model is train with linear model without any activation function.The results are shown as in Table 2, where, ANN-L performed slightly better than all other models ANN-l and ANN-H with accuracy metrics error rate and area under curve (AUC).

B. Decision Tree (DT)
In [19] author says, the decision rules should maximize a divergence measure of the difference in default risk between the two subsets.The splitting is repeated until no group can be split into two subgroups which are statistically different.According to [20], there are three major tasks of a classification tree: (i) how to partition the data at each step, (ii) when to stop partitioning and (iii) how to predict the value of y for each x in partition.The decision tree [21] purpose is to find the optimal sub-tree that gives bad and good credit based on overall accuracy and error rate.This paper evaluated result using C4.5 classification, which formulate classification tree based on principle of entropy (1) and information gain principle (2).The split of tree based on pure values evaluated by measures entropy and information gain.The result evaluated and compared among four model configurations as in Table 3.It was presented that boosting 10% of decision tree model is better in terms of accuracy than all other configuration such as booting 100%, pruning and default DT model.

C. Logistic Regression
According to [22], it was the first paper published investigates the logistic regression (LR) with discriminant analysis applied to credit scoring.Its results shows LR exhibiting higher accuracy rates, however, neither method was found to be sufficiently good to be cost effective for his problem.LR was also applied by [23] to a commercial loan evaluation process (exploring several models using random effects for bank branches).

 
Eq. Logistic Regression In Table 4, logistic regression error rate metric considerably evaluated results better than AUC for bad and good credit.The area under curve (AUC) metric accuracy is 77.91, on the opposite, error rate metric test accuracy 81.22 is far better than AUC.

V. RESULT DISCUSSION
Researchers either consciously or by default in a statistical analysis drop the variables that have in-complete data.As an alternative to complete-case analysis, researchers may in a plausible value for the missing observations, such as using the mean of the observed cases on that variable [24].But, here this research is focus on nearest distance based imputation technique.Besides, k-mean there are many statisticians recently advocated methods that are based on distributional models for the data (such as maximum likelihood and multiple imputation).More literature has been published in the statistical literature on missing data [25]- [27].
In [28], propose a new approach to clustering that divides the data features into observed features, which are known for all objects, and constraining features, which contain missing values.We generate a set of constraints based on the known values for the constraining feature.Based on our observation, we found high percentage of missing values in our dataset, therefore, we implemented similar technique of k-mean clustering for imputation to diminish the missing percentage in dataset to gain accuracy.We evaluated the result with all best models chosen after boosting and smoothing.Fig. 1 below shows that k-mean imputation with ANNlinear model outperformed all other models in accuracy of error rate metric.Similarly, logistic regression slightly performed less ANN-Linear but, better than other model DT in error metric and AUC metric.Here, it was clearly notable that error metric is better metric for accuracy gain of the model in all comparisons.On the same note, we evaluated all models comparatively without imputation, which means contain more empty values.
No imputation results are shown in Fig. 2 below which comprehends the result for training dataset, it clearly reveals that DT shown significance over ANN.It was also evident that in test dataset ANN-Linear performed better than DT with no imputation technique; further below given table shows the test dataset comparison between models (Table 5).

VI. CONCLUSION
This paper examines the major classification techniques in data mining and compares the performance of classification.The novel imputation method of k-mean improvised to avoid the data loss, for the first time, is presented for the similar dataset and its comparison with no imputation technique.Obviously, error rate is more sensitive than AUC, because, it is more appropriate criterion to measure the classification accuracy of models.Artificial neural networks linear model performs classification accurately than the other models in comparison to imputation and without imputation.Artificial neural networks model is also shown the best performance in no imputation test dataset but performed second last in case of training dataset.It shown more accuracy in case of availability of data like in imputation based dataset accuracy is better than all models.From the perspective of risk control, estimating the client risks without imputation is more meaningful than imputation on classification.It was also concluded that, artificial neural networks model is more reliable to be employed for credit scoring for bad and good clients' awareness.In future.big scoring and its impact can be tested with larger dataset using ANN ensemble or hybrid approach to cater the multiple tasks 1) feature selection; 2) classification.Because conventional credit score techniques inherit with narrow scope that is not perfect model because it only analyzes customer payment history but unable to justify customer characters, nature and credibility by the help of external source for instance, social media.

Fig. 1 .Fig. 2 .
Fig. 1.The graph shows the comparisons between models with Error Rate Metric and AUC with imputation.

Table IV :
Regression Classification

TABLE V
in piece fully available for analysis into depth of model such as in case of imputation technique of K-mean.Kmean fill the values with nearest neighbor thus increase data availability that increases the accuracy rate of classification of ANN.But, in case of without imputation dataset lacks the volume in the form of missing values thus DT performed better than other model in training dataset but in test dataset ANN still performed better over low volume of dataset.