Impacts of Unbalanced Test Data on the Evaluation of Classification Methods

The performance of a classifier in a supervised machine learning problem is popularly evaluated by using the accuracy, precision, recall, and F1-score. These parameters could evaluate very well classifiers in the case that the number of positive label sample and the number of negative label sample in the testing set are balanced or nearly balanced. However, these parameters may miss-evaluate the classifiers in some case where the positive and negative samples in the testing set is unbalanced. This paper proposes some update in these parameters by taking into account the unbalanced factor which represents the unbalance ratio of positive and negative samples in the testing set. The new updated parameters are then experimentally evaluated to compare to the traditional parameters. Keywords—Supervised machine learning evaluation; accuracy; f1 score; unbalanced factor


I. INTRODUCTION
The problem of classification (texts, images, voice...) is already popular in the machine learning community.One of popular methods is supervised machine learning.In which, there are two main phases.First, training phase, a set of samples which are already classified with a label, called training set, will be used to extract some common features of samples of the same label.This work is done by a classifier.Second, at the testing phase, if there is a new sample s, the assignment of a label to the sample s is decided by the classifier trained in the training phase.
The performance of the classifier is popularly evaluated by using the accuracy, precision, recall, and F1-score parameter which are calculated based on the definition of Salton et al. [7]: P recision = T P T P + F P * 100% (2) where: TP is the number of true positive; FP is the number of false positive; FN is the number of false negative; TN is the number of true negative.
These parameters could evaluate very well classifiers in the case that the number of positive label sample and the number of negative label sample in the testing set are balanced or nearly balanced.
However, these parameters may miss-evaluate the classifiers in some case where the positive and negative samples in the testing set is unbalanced.For instance, let's consider in a case of positive major of testing set in which, there are 90% of samples are positive label and 10% are negative label.There is a very simple classifier which always returns TRUE for any testing sample.In that case, we have: where x is the number of sample in the testing set.
With the value of accuracy and F1-score is about 90.00% and 94.75%, respectively, any evaluator could conclude that this is a good classifier.Meanwhile the classifier is very simple and idiot one: it always returns true for any sample.Intuitively, these parameters are lost its objective in this case.
In order to avoid the miss-evaluated in the case of unbalanced testing data, this paper proposes some update in these parameters by taking into account the unbalanced factor which represents the unbalance ratio of positive and negative samples in the testing set.The new updated parameters are then experimentally evaluated to compare to the traditional parameters.The paper is organised as follows: Section II presents our proposal of unbalanced factor in the output parameters.Section III presents our experiments to evaluate the proposed update in output parameters.Finally, Section IV is a conclusion.

II. PROPOSAL
We make used the basic concepts based on the definition of Salton et al. [7]: • Number of true positive (TP): This is the number of samples which are assigned to the considered label.
And in the results, it is also assigned to the same label.
• Number of false positive (FP): This is the number of samples which are NOT assigned to the considered label.But in the results, it is assigned to the label.
• Number of false negative (FN): This is the number of samples which are assigned to the considered label.But in the results, it is NOT assigned to the label.
• Number of true negative (TN): This is the number of samples which are NOT assigned to the considered label.And in the results, it is NOT assigned to the label.
We take into account the unbalanced factor which is defined as the ratio between the number of positive sample and that of negative sample in the testing set: number of positive sample in the testing set number of negative sample in the testing set (5) This unbalanced factor of testing set is then applied in the output parameters by updating the concept of accuracy, precision, recall, and F-score as follows: P recision = T P T P + α * F P * 100% Intuitively, these updates could replace the traditional output parameters in the case the unbalanced factor equals to 1.It means that the testing set is balanced or nearly balanced.
Let's return to the paradox example in Section I with a very simple classifier which always returns TRUE for any testing sample, in the case of positive major of testing set in which, there are 90% of samples are positive label and 10% are negative label.If the unbalanced factor is taken into account, we will have: • Accuracy = 0.9x + 9 * 0 0.9x + 9 * 0.1x + 0 + 9 * 0 * 100% = 50.00% • P recision = 0.9x 0.9x + 9 * 0.1x * 100% = 50.00% where x is the number of sample in the testing set.
With the value of accuracy and F1-score is about 50.00% and 66.67%, respectively, any evaluator could conclude that this is a below-average classifier.This is suitable to the classifier which is very simple and idiot one: it always returns true for any sample.Intuitively, these new updated parameters could help us to avoid the case of miss-evaluate the simple classifier in an unbalanced testing set.

III. EVALUATION
This section presents an experiment to evaluate the proposed output parameters in the balance and unbalanced testing set.

A. Dataset
This experiment evaluates the proposed model on the dataset of 20 Newsgroups [4].This dataset contains about 20000 texts, divided into 20 subjects.The longest text has more than 20000 words.The shortest text has about 75 words.The average length of text in this dataset is about 370 words.This dataset is widely used in machine learning and information retrieval domain, in the problem of text classification.The distribution of texts by 20 class labels is presented in Table I.

B. Scenario
The main scenario of this experiment is defined as follows: • Using the same training set.
• Using the same classifier.In this experiment, we use the classifier of Multinomial Naive Bayes (MNB) [3].This algorithm improves the Naive Bayes model with the Multinomial Naive Bayes (MNB) algorithm.It had already proved its good performance in texts classification as presented in several recent works [5], [6].
• Testing with different sets: balanced testing set, and unbalanced testing set (YES major, and NO major).
• This scenario is repeated in ten times, and then comparing the output parameters in the case with/without unbalanced factor.

1) Building of training set:
The training set is built for each label, based on the one-vs-all method [1], as following scenario: • For each label, select randomly 500 texts whose label is the considered label, and 500 other texts whose label is different from that label.
• Divide this set into ten subsets (for running of ten times): each subset has about 100 texts, in which, 50 texts have the considered label, 50 remain texts have other label.
• For each text in each training subset, remove all stopwords.
• Split the remain character sequence into 1-gram, 2grams, and 3-grams.The combination of three grams from 1-gram to 3-grams is proved that is the best case for the dataset of 20Newsgroups in the work of Nguyen [5].That is the reason we use this combination in the experiment.
• Transform it into a vector of TF-IDF [7] value.
• Training with Multinomial Naive Bayes (MNB) [3] classifier1 2) Building of testing set: The three testing sets are also built for each label as following scenario: • Unbalanced testing set with ratio of 20:80 (NO major -called 20:80 testing set): • Select randomly 200 texts whose label is that label, and 800 other texts whose label is different from that label.• Divide this set into ten subsets (for running of ten times): each subset has about 100 texts, in which, 20 texts have the considered label, 80 remain texts have other label.
• Balanced testing set with ratio of 50:50 (YES/NO balance -called 50:50 testing set): • Select randomly 500 texts whose label is that label, and 500 other texts whose label is different from that label.• Divide this set into ten subsets (for running of ten times): each subset has about 100 texts, in which, 50 texts have the considered label, 50 remain texts have other label.
• Unbalanced testing set with ratio of 80:20 (YES major -called 80:20 testing set): • Select randomly 800 texts whose label is that label, and 200 other texts whose label is different from that label.• Divide this set into ten subsets (for running of ten times): each subset has about 100 texts, in which, 80 texts have the considered label, 20 remain texts have other label.
• For each text in each testing subset, remove all stopwords.
• Transform it into a vector of TF-IDF value.

C. Output Parameters
We consider the output parameters in two cases: without unbalanced factor (classical), and with unbalanced factor (new proposed).
1) Output parameters without unbalanced factor: In this case, we use the traditional output parameters of Accuracy, and F1-score as the definition of Salton et al. [7] (formula 1 and 4).
2) Output parameters with unbalanced factor: In this case, we take into account the balance factor -α of the testing set.Therefore, we use the output parameters defined in Section II: accuracy (formula 6), and F1-score (formula 9).

D. Results
The results from the case using output parameters without/with unbalanced factor are presented in the Tables II, and III, respectively.These results indicate that the variation of accuracy and F1-score in the case without unbalanced factor is much higher than that in the case with unbalanced factor.For instance, in the case of label comp.graphics(the 2nd row in the Tables II and III): The accuracy varies from 83.83% to 89.58% and 95.35% in the testing set of 20:80, 50:50, and 80:20 respectively if the unbalanced factor is not taken into account.Meanwhile, if the unbalanced factor is taken into account, the accuracy becomes more stable with value of 90.01%, 89.58%, and 90.30% in the testing set of 20:80, 50:50, and 80:20 respectively.
The same to the value of F1-score: It varies from 68.26% to 90.33% and 97.16% in the testing set of 20:80, 50:50, and 80:20 respectively if the unbalanced factor is not taken into account.Meanwhile, if the unbalanced factor is taken into account, the F1-score becomes more stable with value of 90.88%, 90.33%, and 91.08% in the testing set of 20:80, 50:50, and 80:20 respectively.This principle is appear in almost topics of the considered dataset.Consequently, the average value of accuracy and F1score overall 20 topics in the case with unbalanced factor are more stable than that in the case without unbalanced factor (the last row in the Tables II and III): At the level of accuracy, its value varies from 88.35% to 93.07% and 96.08% in the case without unbalanced factor.Meanwhile, in   the case with unbalanced factor, its value has a small change from the 92.60% to 93.07% and 93.47% in the testing set of 20:80, 50:50, and 80:20 respectively.At the level of F1score, its value varies from 76.11% to 93.43% and 97.55% in the case without unbalanced factor.Meanwhile, in the case with unbalanced factor, its value has a small change from the 93.17% to 93.43% and 93.85% in the testing set of 20:80, 50:50, and 80:20 respectively.
In order to see the difference in detail from the two considered cases, we compared the results from ten times of testing on each output parameters.At the level of accuracy (Fig. 1), its value in the case without unbalanced factor is significantly different from the testing set of 20:80, 50:50, and 80:20 (Fig. 1(a)).Meanwhile, there is no significant difference from its value in the case with unbalanced factor (Fig. 1(b)): this value is stably about 93%.The same results at the level of F1-score (Fig. 2), its value in the case without unbalanced factor is significantly different from the testing set of 20:80, 50:50, and 80:20 (Fig. 2(a)).Meanwhile, there is no significant difference from its value in the case with unbalanced factor (Fig. 2(b)): this value is stably within 93-94%.
In summary, the experiment results indicate that the unbalanced factor could bring the value of accuracy and F-score of a classification method more stable.In other words, it could make the value of accuracy and F-score of a classification method more independent from the unbalanced ratio of label in the testing set.

IV. CONCLUSION
This paper proposed some update in the output parameters in evaluation of supervised machine learning methods (accuracy, precision, recall, F1-score) by taking into account the unbalanced factor which represents the unbalance ratio of positive and negative samples in the testing set.The new updated parameters are then experimentally evaluated to compare to the traditional parameters.The experiment results indicate that the new updated parameters could evaluate the classifier with a stable value in spite of the change of unbalanced ratio between the positive and negative samples in the testing set.

Fig. 1 .Fig. 2 .
Fig. 1.Variation of Accuracy in three testing sets in the case without and with unbalanced factor.

TABLE II .
COMPARISON OF ACCURACY AND F1-SCORE (%) WITHOUT THE unbalanced factor ON THREE TESTING SETS

TABLE III .
COMPARISON OF ACCURACY AND F1-SCORE (%) WITH THE unbalanced factor ON THREE TESTING SETS