Evaluating the Impact of GINI Index and Information Gain on Classification Using Decision Tree Classifier Algorithm*

Decision tree is a supervised machine learning algorithm suitable for solving classification and regression problems. Decision trees are recursively built by applying split conditions at each node that divides the training records into subsets with output variable of same class. The process starts from the root node of the decision tree and progresses by applying split conditions at each non-leaf node resulting into homogenous subsets. However, achieving pure homogenous subsets is not possible. Therefore, the goal at each node is to identify an attribute and a split condition on that attribute that minimizes the mixing of class labels, thus resulting into nearly pure subsets. Several splitting indices were proposed to evaluate the goodness of the split, common ones being GINI index and Information gain. The aim of this study is to conduct an empirical comparison of GINI index and information gain. Classification models are built using decision tree classifier algorithm by applying GINI index and Information gain individually. The classification accuracy of the models is estimated using different metrics such as Confusion matrix, Overall accuracy, Per-class accuracy, Recall and Precision. The results of the study show that, regardless of whether the dataset is balanced or imbalanced, the classification models built by applying the two different splitting indices GINI index and information gain give same accuracy. In other words, choice of splitting indices has no impact on performance of the decision tree classifier algorithm. Keywords—Supervised learning; classification; decision tree; information gain; GINI index


I. INTRODUCTION
Machine learning problems can be broadly classified into two categories viz. supervised learning and unsupervised learning as shown in Fig. 1. With supervised learning techniques, the training data is labeled. It means each observation in the data set has both descriptive variables (i.e., independent variables or decision variables) and a labeled outcome variable. Labels can be either categories or continuous values [1]. With supervised learning, a labeled data set is used to train the model in making predictions. A learning model maps the input variables to the output variable, with the aim of accurately predicting the output for future input variables.
Unlike supervised learning, with unsupervised learning the data is not labeled. This means that the training data has descriptive variables only and no outcome variable. The model has to determine the patterns and interesting structures in the data that are not known beforehand [2].
Classification is a supervised learning problem, where the objective is to analyse the training data and develop a model that can predict the future behavior, here the training dataset is labeled. Decision tree algorithm is commonly used for classification tasks. Decision trees classify data into finite number of classes based on the values of input variables. It is most appropriate for categorical data [3].
Decision tree is a simple flowchart that selects class labels of an output variable using the values of one or more input variables. The classification process starts at the root node of the decision tree and recursively progresses until it reaches the leaf node with class labels. At each node a split condition is applied to decide whether the input value should continue towards left or right sub tree until it reaches the leaf nodes [4]. The split condition applied at each node should result in homogenous subsets. Homogenous subsets have records with same class label. However, it is impossible to achieve pure homogenous subsets with real time data. Some kind of mixing will always be there. Therefore, while building the decision tree, the goal at each node is to select split conditions that best divide the dataset into homogenous subsets. The "goodness of split criterion" was introduced, which is derived from the notion of impurity [5]. Impurity is measured mathematically for each split condition and split condition with lowest impurity value is chosen.
To measure the impurity value of a split condition several indices are proposed viz., GINI index, Information gain, gain ratio and misclassification rate. This paper empirically examines the effect of GINI index and Information gain on classification task. The classification accuracy is measured to check the suitability of the models in making good predictions.
Rest of the paper is organised as follows: Section II introduces the theoretical notions of Information gain and GINI index. Section III is literature review. Sections IV and V gives the details of data and experimental procedure to compare Information gain and GINI index on balanced and imbalanced data set along with results obtained, and Section VI summarizes the results of the study. 613 | P a g e www.ijacsa.thesai.org II. THEORITICAL NOTATION This section briefly discusses theoretical notions of Information gain and GINI index. Raileanu and Stoffel [6] presented theoretical comparison of GINI index and Information gain.
Let L be a learning sample, L= {(x 1 , c 1 ), (x 2 , c 2 ) … (x i , c j )}; Where x 1 , x 2 …x i is a measurement vector and c 1 , c 2 … c j are class labels. x i can be viewed as a vector of input variables, and split conditions are based on one of these variables. If p i is probability that an arbitrary tuple belongs to class c i , p i can be measured as

A. Entropy
Information gain is based on Entropy. Entropy measures the extent of impurity or randomness in a dataset [7]. If the observations of subsets of a dataset are homogenous, then there is no impurity or randomness in the dataset. If all the observations of subsets belong to one class, the entropy of that dataset would be 0. Entropy is defined as the sum of the probability of each label times the log probability of that same label.
For a dataset with one class label, will be 1 and ( ) is 0. Hence the Entropy of homogenous data set is zero [8]. If the entropy is higher the uncertainty/impurity/mixing is higher [9].

B. Information Gain
Information gain is based on Entropy. Information gain is the difference between Entropy of a class and conditional entropy of the class and the selected feature. It measures the usefulness of a feature f in classification [10] i.e., the difference in Entropy from before to after the split of set L on a feature f. In other words, it measures the reduction of uncertainty after splitting the set on a feature. If information gain value increases, it means the feature f is more useful for classification. The feature with highest information gain is the best feature to be selected for split. Assuming that there are V different values for a feature f, |L v | represents the subset of L with f=v, Information gain after splitting L on a feature f is measured as [8].
C. GINI Index GINI index determines the purity of a specific class after splitting along a particular attribute. The best split increases the purity of the sets resulting from the split. If L is a dataset with j different class labels, GINI is defined [3] as

( ) ∑
Where pi is relative frequency if class i in L. If the dataset is split on attribute A into two subsets L1 and L2 with sizes N1 and N2 respectively, GINI is calculated as

III. LITERATURE REVIEW
This section briefly presents some of the empirical studies that compared the performance of decision tree algorithms which use different impurity metrics for feature selection at non-leaf nodes. An attempt is made to find out if the choice of these feature selection metrics has any impact on the accuracy of the model from past studies.
Mingers [11] tested different feature selection measures empirically, and reported that choice of the feature selection measure affects the size of the tree but not its accuracy. The accuracy remained the same even when attributes are randomly selected. Patil [12] studied the two decision tree based classification algorithms C5.0 and CART. C5.0 uses information gain and CART algorithm uses GINI index to select the features for split conditions. Their study was an experiment to compare C5.0 and CART classification algorithms to classify if a customer qualifies for membership card or not. The study revealed that C5.0 gives higher classification accuracy of 99.6% than CART algorithm with 94.8% accuracy.
A study empirically compared different feature selection measures and proposed a variant of GINI index which uses GINI index ratios for feature selection. In this study they compared the classification accuracy of modified GINI with other classification algorithms ID3, C4.5 and GINI. The results show that ID3 and C4.5 based on Information gain have low classification and prediction accuracy than GINI index and modified GINI index. Modified GINI index is reported to obtain the highest accuracy among all algorithms that were compared [13]. Adhatrao et.al [14] present experiments to compare the performance of two decision tree algorithms, ID3 and C4.5 in predicting the performance of first year engineering students based on the performance achieved by old students who are now in second year engineering. The results show that both the algorithms give same accuracy. In a study Hssina, et.al [15] compared different decision tree algorithms viz. ID3, C4.5, C5, CART and the results reported show that C4.5 has achieved the highest classification accuracy. C4.5 uses information gain to evaluate goodness of split.
Above discussed studies give varied results on the performance of Information gain and GINI index. Moreover, the empirical studies compared the models that were built using different tree based algorithms. These algorithms differ in splitting attribute selection, number of splits (binary /ternary), order of splitting attribute (splitting the same attribute only once or multiple times), stopping criteria and pruning technique (pre/post) [14]. All these factors contribute to the performance of the models built using these algorithms.
The present study is unique as it focuses only on finding the impact of GINI index and Information gain on classification. Therefore, unlike other studies, this study develops classification models using single algorithm called decision tree classifier on which GINI index and information gain are applied individually. This neutralizes the impact of all other factors on models.

IV. EXPERIMENTAL SETUP
This section gives the details of data and experimental procedure.

A. Dataset Description
The experiment is conducted using real data provided by UCI Machine Learning repository [16]. The data was collected by Portuguese banking institution by making phone calls to customers. The dataset is relatively a large dataset with 41187 rows and 21 columns. One input variable, "duration" is discarded, as it is highly multi valued and should be avoided for good prediction. Details of the remaining variables are given in Table I. The classification goal is to predict whether customer will subscribe for a term deposit (y) based on remaining 19 input variables. The dataset is clean; it doesn"t have Null values. Term deposit (y) is the outcome variable with two class labels (yes or no). Therefore, it is a binary classification problem. When developing a decision tree, the goal at each node is to identify the attribute and a split condition of the attribute that best divides the training set into pure subsets at that node [17].
Given a dataset with input variables and an outcome variable with a class label, the decision tree algorithm recursively divides the training set until each division contains examples of same class label. If all the observations of the division belong to one class, then it is homogenous subset and if they belong to multiple classes it is impure or heterogeneous [18]. To evaluate the goodness of the split, two splitting indices, GINI index and Information gain are used. Both GINI index and Information gain are applied on Decision tree classifier algorithm and models are developed.
The dataset is split into two parts, training and test. The general practice is to divide the dataset into 80:20 ratios, 80 % training data and 20% test data (unseen data). Using the decision tree classifier algorithm, a classification model built recursively from the training data, dividing the data until each division is pure (homogenous class) and then its prediction accuracy is tested on the unseen test data. In this experiment, the classification model is trained to predict whether customers would subscribe for a term deposit (Yes or No) using the 19 input variables.
A k-fold cross validation method minimizes the bias associated with random sampling of the training and hold out of data samples while comparing the predictive accuracy of two or more methods [3]. In our experiment classification model is trained and tested 10 times where the training set is split into 10 exclusive subsets of equal size and each time, the model is trained on all 9 leaving 1 subset which will be used for testing. Overall accuracy is simply average of the 10 individual accuracies obtained.

B. Decision Tree Classifier
Many algorithms have been proposed for creating decision trees. In this experiment, Decision tree classifier, a supervised learning algorithm is used. It is based on CART and can be used for creating both classification and regression trees [19]. rpart is a package in R programming, which implements many of the ideas found in CART model. Different splitting criterions can be applied while splitting the nodes of the tree using rpart function [20]. The classification models built by applying Information gain and GINI index are shown in Fig. 2 and Fig. 3, respectively.
It is noted that both the splitting measures select the same feature, "Number of employees" with same split condition at the root node. "Number of employees" which is a numeric attribute is selected with split condition nr.employees >=5088.

C. Performance Evaluation Metrics
Classification is technique where the model is developed using a labeled dataset. It means each record in the training dataset has a class label associated with it. The model is later used to predict the class labels of new/unseen data. Predictive accuracy of classification model is its ability to correctly predict the class label of an unseen data. The common metrics for measuring the accuracy of classification models are confusion matrix, overall accuracy, per-class accuracy, recall and precision [3] [21]. First confusion matrix is created using which all other metrics are easily calculated.

 Confusion matrix
Confusion matrix gives detailed view of the performance with breakdown of correct and incorrect predictions for each class. The performance is measured by comparing the predicted outcome values with actual values. The information is tabulated in the form of a confusion matrix as shown in Table II.

D. Performance Evaluation on the Test Set
The test set has a total of 8237 observations. Confusion matrix of Decision tree classifier with Information gain and GINI Index are shown in Table III and Table IV. Positive/majority class is represented as 0 negative/minority class is represented using 1.  Accuracy, recall, precision and F1 score values are shown in Table V. Table V, quite clearly show that there is no significant difference between the classification accuracy obtained by the two feature selection measures. Overall accuracy as well as per class accuracy values remain approximately the same. Other observations are in line with literature which says, classifiers trained on low dimensional, imbalanced data classify most of the samples to majority class [23]. Therefore, it is deceivingly simple to achieve high overall accuracy, although it is difficult to classify the data reliably. This is evident from the results obtained, where the majority class accuracy is too high (98.3%) when compared to minority class accuracy (22% approx.). With imbalanced data set, even when the minority class accuracy is very low, the overall accuracy would be high because of high True positive count as in our case. Hence, kappa statistic is measured which takes in to account the chance agreement.

 Kappa Coefficient:
Kappa coefficient is an interesting alternative to measure the accuracy of classifier models. It is particularly useful when the data sets are imbalanced [24]. It is used to quantify the reproducibility of discrete variable.
Originally Cohen"s Kappa(κ) coefficient was introduced to measure the level of inter-observer agreement, its value ranging from 0 to 1 [25]. If κ is 0 then the agreement between observed and expected is only by chance; if it 1, it is a perfect agreement. κ value between 0 and 0.2 indicates slight agreement, 0.2 to 0.4 says fair agreement, 0.6 to 0.8 is substantial agreement. [26]. The Kappa (κ) statistic takes into account the chance agreement and is defined as. www.ijacsa.thesai.org

( )
Kappa coefficient is used to evaluate the accuracy of models by measuring agreement between predicted values and true values. Using the confusion matrix in Table III and  Table IV, kappa values for the classifiers are generated as Kappa value of the classifier model based on Information gain, Kappa (κ) = =0.284 Kappa value 0.28 indicates that observed agreement is 28% of the way between chance and perfect agreement.
Kappa value of the classifier model based on GINI index,

=0.293
Kappa value 0.29 indicates that observed agreement is 29% of the way between chance and perfect agreement.
It is clearly evident from the results obtained that both the classifier models obtained near to equal results. In other words, the results clearly show that the classification accuracy of decision trees is not sensitive to choice of feature selection measures.
High overall accuracy (89% approx.) and very low minority class accuracy (22%) show that the data is not classified reliably. This could be because the dataset used in the experiment is highly imbalanced with 29231 positive (majority) samples and 3719 negative (minority) samples. In next section we provide the details of methods for balancing the dataset and discuss the results of the experiment conducted after balancing the dataset.

V. BALANCING THE DATASET
Imbalanced datasets have imbalanced class distribution; where by more observations belong to one class than other. Classification algorithms suffer from the problem of imbalanced dataset which leads to biases and poor generalizations. Sometimes, in real world applications, minority class would be of most interest and classifying them correctly should be given high importance, allowing small error rate in classification of majority class since the cost of misclassifying them could be relatively very [27].
For a binary classification problem, if S is the training data, y is the response variable, [28] defines imbalanced classification problem as follows: S = {(x 1 , y 1 ) … (x m , y m )}, where y i ∈ {-1, 1} will be data labels. S + = {(x, y) ∈ S: y = 1} be the positive or minority instances. S − = {(x, y) ∈ S: y = −1} be the negative or majority instances.
In the test set if, |S + | > |S − |, the performance of classification algorithm will be very poor, and misclassification rate will be high especially when it comes to the minority class. Therefore, to improve the performance, resampling methods are applied on the training dataset to generate a new set E with synthetic instances of minority class, transforming the training dataset into, S = (S + ∪ E) ∪ S -

A. Resampling
Imbalanced datasets have imbalanced class distribution. The dataset used for the study is imbalanced with 29231 positive samples and 3719 negative samples. In such situations, it is difficult to classify the data reliably, although it is simple to attain high accuracy. It is quite essential to balance the dataset to classify reliably. Distribution of classes can be balanced by random oversampling minority class observations or random under sampling majority class observations or by combining both over and under in a systematic manner [29]. Random oversampling creates the problem of over fitting the classifiers and under sampling suffers from loss of useful observations. Another heuristic method, SMOTE (Synthetic Minority Oversampling Technique) based on oversampling is widely used which reduces the over fitting to certain extent and performs better than random over sampling. SMOTE generates synthetic observations of minority class [27] [23].
Before applying any of the resampling techniques training and test data must be split to avoid over fitting and poor generalizations. After resampling we have nearly equal ratio of observations for each class in the training set. The number of observations after applying the resampling methods on the training set can be seen in Table VI.

B. Results: Performance Evaluation after Resampling
After balancing the dataset with resampling techniques, the experiment described in section IV is repeated and accuracy is measured. Confusion matrix created after applying resampling techniques is shown in Table VII.  Tables VIII and IX summarizes the results obtained by the classification models after applying different resampling techniques. The results in the tables show that balancing the data set has decreased the majority class accuracy but improved the minority class accuracy. Balancing the data set has improved the minority class accuracy by increasing the count of true negative. As discussed earlier it is relatively simple to achieve high overall accuracy with imbalanced data sets, but classifying data reliably is difficult. Thus, after balancing the dataset the objective of classifying data reliably is achieved as the minority class accuracy has improved. Further analysis of results show that, SMOTE has achieved highest overall accuracy among all the resampling methods. Also, with Smote technique kappa value is 39%. It shows that SMOTE technique is relatively more reliable technique for balancing the dataset than other three methods studied.

VI. CONCLUSIONS
The empirical results reported in this paper show that both Information gain and GINI index produce the same accuracy for classification problems. The experiment is conducted before and after the data set is balanced. The results obtained prove that there is no significant difference in the performance of models using GINI index and Information gain before and after the data set balanced. The results are in line as stated by Mingers [11] that splitting indices have no impact on accuracy. In summary, the results obtained in this paper show that classification accuracy of decision trees for both balanced and imbalanced data sets, is not sensitive to the choice feature selection metrics that were studied.
Another interesting observation is balancing the dataset has lowered the majority class accuracy with decrease in count of true positives and minority class accuracy has improved with increase in the true negative count. In other words, the sensitivity decreased and specificity improved after the data set is balanced. Despite the fact that there is a decrease in overall accuracy, there is clearly a significant rise in the minority class accuracy. This proves that classification accuracy is sensitive to number of positive and negative samples in the data set and type of data, balanced or imbalanced.