Classification Model Evaluation Metrics

The purpose of this paper was to confirm the basic assumption that classification models are suitable for solving the problem of data set classifications. We selected four representative models: BaiesNet, NaiveBaies, MultilayerPerceptron, and J48, and applied them to a four-class classification of a specific set of hepatitis C virus data for Egyptian patients. We conducted the study using the WEKA software classification model, developed at Waikato University, New Zealand. Defeat results were obtained. None of the four classes envisaged has been determined reliably. We have described all 16 metrics, which are used to evaluate classification models, listed their characteristics, mutual differences, and the parameter that evaluates each of these metrics. We have presented comparative, tabular values that give each metric for each classification model in a concise form, detailed class accuracy with a table of best and worst metric values, confusion matrices for all four classification models, and a type I and II error table for all four classification models. In addition to the 16 metric classifications, which we described, we listed seven other metrics, which we did not use because we did not have the opportunity to show their application on the selected data set. Metrics were negatively rated selected, standard reliable, classification models. This led to the conclusion that the data in the selected data set should be pre-processed to be reliably classified by the classification model. Keywords—Classification model; classification models; evaluate classification models; worst metric values; four-class classification; metric classification; reliable classified classification models; detailed class accuracy Subject areas—Artificial intelligence and machine learning; software engineering


I. INTRODUCTION
A specific set of data on the hepatitis C virus, consisting of 1385 instances described with 29 attributes, was considered [12]. The goal is to classify these instances into four classes, which represent hepatitis diseases: class a -Portal fibrosis, class, b -Little sepsis, class, c -A lot of sepsis, and class d -Cirrhosis. [6] This paper challenges this classification. Sources in the literature suggest that classification into five classes would be better: class a-liver inflammation, class bfibrosis, class c-cirrhosis, class d -end-stage disease (ESLD), and class e-cancer [15].
The initial assumption is that standard, generally accepted classification models, BayesNet, NaiveBayes, Multilayer-Perceptron, and J48, are suitable for such a classification. These models exist in the WEKA software and, as such, have been applied to the selected data set. Unsatisfactory results were obtained. Available instances are classified very poorly.
That was the reason, motive, and incentive to consider why this is so? These four models were chosen at random. In this introduction, we give their generally accepted definitions.
A Bayesian network is defined as a system of event probabilities, nodes in a directed acyclic graph, in which, the probability of an event can be calculated from the probabilities of its predecessors in the graph. The nodes in the network are variable. They can be concrete values, randomly given, latent values, or hypotheses. They are characterized by the distribution of probabilities. Probability is a quantity that touches a presented state of knowledge or a state of belief. In Bayesian opinion, the probability is assigned to a hypothesis. In frequency thinking, the hypothesis is tested without assigning a probability. The result of Bayesian analysis is Bayesian inference. It updates the previous probability assigned to the hypothesis because more evidence and information have been obtained [3], [16].
Naive Bayesian classifiers are based on naive assumptions of the mutual characteristics of independence. In this way, each distribution obtained can be independently estimated as a one-dimensional distribution. This alleviates the problems arising from the "curse of dimensionality". The "curse of dimensionality" is the problematic nature of the number of variables, which can be collected from a single sample. An example of this is the need for data sets that are scaled (arranged) exponentially with many characteristics [3], [14] [16], [18].
A multilayer perceptron is defined as a system composed of a series of elements (nodes -"neurons") organized into layers. Layers process information so that they react dynamically to external inputs. The input layer has one neuron for each component, which exists in the input data. Communicates with hidden layers in the network. The entire processing of input data takes place in hidden layers. The input data are weighted (measured) by appropriate coefficients. The neuron accepts them, calculates their sum, and processes it with an activation function. It processes the processed data in a "forward" process. The last hidden layer is connected to the output layer. The output layer has one neuron for each possible output. [3], [14] , [16], [18].
J48 is a machine learning model based on the decision tree. It was created using the ID3 algorithm (Iterative Dichtomizer 3), developed by the WEKA project development team. The decision tree presents and analyzes decision-making situations when one type of decision is derived from another type of decision. This facilitates understanding of selection problems, assessment of available versions of the decision, 599 | P a g e www.ijacsa.thesai.org and coverage of uncertain events, which affect outcomes and versions of the decision [3], [14], [16],18].
The first idea was to consider the metrics used to evaluate the classification models used. 16 metrics used by WEKA software were reviewed, described, and explained [4]. In addition, it was stated that there are, in addition to the above, the following metrics: False discovery rate, [21] Log Loss, [22] Barier score, [23] Cumulative gain chart, [24] Lift curve, [25] Kolmogorov-Smirnov test, [26]. These metrics were not considered because they were not contained in the WEKA software, which was used. Therefore, they could not give their ratings of the classification model on the selected data set.
The research made a significant contribution to the interpretation of the 16 mentioned metrics, elements, and parameters that each of them uses to evaluate the classification models.
A significant contribution is also the question: why did the metrics negatively evaluate the classification models used on the selected data set?
As a result of this research, other questions arose. Is the number of attributes per instance of the observed data set too large? How many attributes are needed (optimal) and what are those attributes? Is it necessary to pre-process the data of the observed set? What are the techniques for pre-processing data in a set?
Unobtrusively, the question arose as to whether the four classes for the classification of instances of the observed set were correctly determined? II. METRICS 1) Accurately classified instances are the sum of true positive (TP) and true negative (TN).
2) Incorrectly classified instances are the sum of false positives (FPs) and false negatives (FNs).
3) Kappa statistic -Cohen's Kappa coefficient (k) is a measure of how many instances are classified model of machine learning, matched the data marked as the basic truth, controlling the accuracy of the random classifier as measured, expected accuracy. The accuracy of the Random Accuracy is 1 / k. Here k is the number of classes in the data set. In the case of binary classification k = 2, so the accuracy is 50%.
p0 -total accuracy of the module, pe -random accuracy (random accuracy of the model).
In the problem of binary classification pe = pe1 + pe2; pe1 -the probability that the predictions agree randomly with the actual values of class 1 -"good"; pe2 -the probability that the predictions agree randomly with the actual values of class 2 -"accidentally". The assumption is that the two classifiers (model prediction and actual class value) are independent. In this case, the probabilities pe1 and pe2 are calculated by multiplying the share of things in the class and the share of the predicted class. [2], [20].

4) Mean Absolute Error is the mean value of the absolute values of individual prediction errors of all instances in the test set.
Each prediction error is the difference between the actual value and the predicted value for the instance.
The mean absolute error (MAE) Ei of an individual model and is calculated by the formula: where P (ij) is the value predicted by the individual model i for record j (of n records); and T j is the target value for record j. For a perfect prediction, P (ij) = T j and E i = 0. Thus, the index E i ranges from 0 to infinity, and 0 corresponds to the ideal [14] [28].

5)
Root mean squared error (RMSE) -The root mean square error is relative to what it would be if a simple predictor was used. Taking the square root of the relative square error, the error is reduced to the same dimensions as the predicted size.
The root mean square error (RMSE) E i of an individual model and is calculated by the formula: Where P (ij) is the value predicted by the individual model i for record j (of n records), and T j is the target value for the record j.For a perfect prediction, P (ij) = T j and E i = 0. Thus, the index E i ranges from 0 to infinity, and 0 corresponds to the ideal. [27].

6)
Relative absolute error (RAE) is the total absolute error and normalized by dividing by the total absolute error of the simple predictor (ZeroR classifier). The relative absolute error E i of an individual model is evaluated by the equation: Where P (ij) is the value predicted by the individual model i for record j (of n records); T j is the target value for record j, and T is given by the formula: For a perfect prediction, the counter is 0 and E i = 0. Thus, the index E i ranges from 0 to infinity, and 0 corresponds to the ideal. 600 | P a g e www.ijacsa.thesai.org A good prediction model produces a near-zero ratio. A bad model (one that is worse than a naive model) will produce a ratio greater than one x100%. [27].

7)
Root relative squared error (RRSE) reduces the error to the same dimensions as the predicted size. Relative square error is the total square error divided by the total square error of a simple predictor. The root of the relative square error E i of an individual model j is calculated by the formula: Where P (ij) is the value predicted by the individual model i for record j (of n records). For perfect prediction, the counter is equal to 0 and Ei = 0. The index Ei ranges from 0 to infinity, and 0 corresponds to the ideal [28].

8)
Confusion matrix for a binary classifier (Fig. 1). Actual values are marked True (1) and False (0), and are predicted as Positive (1) and Negative (0). Estimates of the possibilities of classification models are derived from the expressions TP, TN, FP, FN, which exist in the confusion matrix [10].

Actual class
True (1) False (0) Predicted class Confusion matrix for four-class classification (Fig. 3). Four-class classification is a problem of classifying instances (examples) into four classes. Case of four classes: class A, class B, class C, and class D [13], [17]. 14) F-Measure or F-score is a measure of the accuracy of the test. It is calculated, based on precision and reminders, by the formula: [11], [19] 15) Matthews Correlation Coefficient (MCC) -is the correlation between the predicted classes and the basic truth. It is calculated based on the values from the confusion matrix.
MCC is generally considered a balanced measure, which can be used even if the classes are of very different sizes [7], [11], [19].

16) ROC Area -Receiver Operating Characteristic Area -
The ROC curve is a graph that visualizes the trade-off between True Positive Rate and False Positive Rate (Fig. 9) For each threshold, we calculate True Positive Rate and False Positive Rate and plot them on one graph. The higher the True Positive Rate and the lower the False Positive Rate for each threshold, the better. Better classifiers have more curves on the left. The area below the ROC curve is called the ROC AUC score, a number that determines how good the ROC curve is [11].
The ROC AUC Score shows how good the model is in ranking predictions. Indicates the probability that a randomly selected positive instance is ranked higher than a randomly negative instance [7], [19].

17) PRC Area (Precision-Recall Curve
Area) It is one number that describes the capabilities of the model. The PR AUC Score is the average of the precision scores calculated for each reminder threshold [0,0, 1,0]. The PRC curve is obtained by combining Positive Predictive Value and True Positive Rate (Fig. 10). For each threshold, Positive Predictive Value and True Positive Rate are calculated and the corresponding point of the graph is plotted. Preferably, the algorithm has high precision and high sensitivity. These two metrics are not independent. That is why a compromise is being made between them. A good PRC curve has a higher AUC. Research has shown that PRC is graphically more inforative than ROC graphs when estimating binary classifiers on unbalanced sets [5], [9], [19].  Summary of the accuracy of the four representative classifiers expressed by general metrics. Metrics are listed in the rows of the table, and their values, for each classifier, in the columns of the table. A special number is the total number of instances, which is the same for each classifier.
The detailed accuracy of each of the four representative classifiers for each of the predictions of the class is expressed by the values of eight different metrics. Metrics are in the columns of the table, the names of the classifiers in the rows of the table, separately for each class. For each class, the weighted value of each of the eight metrics is shown. This value is the average that results from multiplying each component by a factor that reflects its significance.   Comparative table of four confusion matrices for all four representative classifiers. In the rows, the number is provided for each class, and in the columns the actual value of the class.
Comparative table of Type I and Type II error values for each class and each representative classifier. There are types of errors in the rows, and their size in the columns.

IV. DISCUSSION
The average value of correctly classified instances is 25.24%, and incorrectly classified instances 73.68% (Table 1).
Landis and Koch proposed the following standards for the kappa coefficient: ≤0 = poor, .01 -.20 = insignificant, .21 -.40 = fair, .41 -.60 = moderate, .61-.80 = substantial, a. 81-1 = almost perfect [29]. In line with the above proposal, BayesNet and NaiveBaies have a poor kappa coefficient, and multilayer perceptron and J48 are negligible. It is concluded that the values of the kappa coefficients show that the instances, classified by the machine learning model, do not match the data marked as the basic truth. MAE values: 0.3763 for BaiesNet, 0.3748 for NaiveBaies, 0.3718 for MultilayerPerceptron, 0.3751 for J48 are closer to the lower limit (ideal) than the upper (worst). We, therefore, appreciate that they are acceptable ( Analysis of the detailed accuracy of the classes (Table 1 and Table 2) shows very significant results. Based on the tables of best and worst metric values for detailed class accuracy, we conclude: 1) TP Rate has extremely poor values, close to the worst, for all rated models and all classes. The exception is NaiveBaies, which has the best value of 1,000 for class 4, but the same NaiveBayes has the worst value of TP Rate, 0,000, for classes 1,2, and 3. Relatively good value of TP Rate, 0,577, showed MultilaierPerceptron for class 2. Weighted values TP Rates are consequently poor.
2) FP Rate for NaiveBayes has an optimal value of 0.000 for classes 1,2 and 3, as opposed to class 4 for which it has a maximum value of 1000. BayesNet, MultilayerPerceptron, and J48, as well as a weighted value for all four models, and all four classes are extremely bad.
3) Precision has values below a level satisfactory for all four models. 4) Recall, has the same values as TP Rate. The question is why are they separated for display in a separate column?
5) The F-Measure has values that are below levels that meet all rated models and all four classes.
6) MCC showed unsatisfactory values, which are at the level of random prediction, for all evaluated models and all classes.
7) The ROC Area showed values for all models and all classes that are on the verge of bad.
8) The value of the PRC area, for all models and all classes, is below the level that is the worst.
The metrics of detailed assessment by classes unequivocally show that the evaluated models, applied in a presented way, do not satisfy (Table III). This means that new research is needed and the answer to the question: why do metrics of detailed accuracy give poor estimates of the models used?
By comparative analysis of the confusion matrix for all four classification models and all four classes, we see that the predictions of true positive results (TP) are not good enough 604 | P a g e www.ijacsa.thesai.org (Table 4). Type I and type II errors are relatively high. The goal of modeling is to reduce these errors to minimum values.
Separate consideration of type I and type II errors for the four applied models shows that NaiveBayes has a type I error value equal to 0, for class d, and type II errors for classes a, b, and c (Table 5). These data further problematize the use of this model. For the other three models, the type I and type II errors are, on average, 2.5 times larger than exactly predicted.

V. CONCLUSIONS
In this paper, we have considered in detail the 16 metrics for the evaluation of classification models, which exist in WEKA software, version 3.4.1., Developed at the University of Waikato, New Zealand. The consideration is in line with the initial assumption of the paper that classification models are suitable for solving the classification problem applied to a specific set of hepatitis C virus data for Egyptian patients.
In addition to the above 16 metrics, we found in the literature that there are other metrics: False discovery rate, Log Loss, Barrier score, Cumulative gain chart, Lift curve, Kolmogorov-Smirnov plot, and Kolmogorov -Smirnov statistics. We did not describe them because we were unable to demonstrate their application to the data set we selected. These metrics remain for display in a later review paper.
All metrics considered negatively evaluated the classification models, which we used. This has led to doubts because these are models that are generally accepted as standard and reliable. Why, metrics, do they rate them negatively on a selected data set? Is the number of attributes in the selected data set too large? How many attributes are needed and what are those attributes? Is it necessary to preprocess the data of the selected set?
The special significance of this paper is that it highlights the multitude of metrics used to evaluate each classification model. It emphasizes the diversity of these metrics and the parameters they measure to better understand the model and its features.
New questions and problems, which arose from this paper, are: What are the techniques for pre-processing data in a data set, and how should discretization, purification, reduction, and discussion of data be performed in a specific hepatitis C virus data set for Egyptian patients?
We suggest that the classification be performed in five classes, as provided in the latest professional literature: class a-inflammation of the liver, class b-fibrosis, class c-cirrhosis, class dend-stage disease (ESLD), and class e-cancer.