Performance Comparison of the Kernels of Support Vector Machine Algorithm for Diabetes Mellitus Classification

—Diabetes Mellitus is a disease where the body cannot use insulin properly, so this disease is one of the health problems in various countries. Diabetes Mellitus can be fatal, cause other diseases, and even lead to death. Based on this, it is essential to have prediction activities to find out a disease. The SVM algorithm is used in classifying Diabetes Mellitus diseases. This study aimed to compare the accuracy, precision, recall, and F1-Score values of the SVM algorithm with various kernels and data preprocessing. Data preprocessing included data splitting, normalization, and data oversampling. This research has the benefit of solving health problems based on the percentage of Diabetes Mellitus and can be used as material for accurate information. The results of this study are that the highest accuracy was obtained by 80% (obtained from the polynomial kernel), the highest precision was obtained by 65%, which was also obtained from the polynomial kernel, and the highest recall was obtained by 79% (obtained from the RBF kernel) and the highest F1-score was obtained by 70% (which was also obtained from the RBF kernel).


INTRODUCTION
Diabetes Mellitus is a disease where blood sugar levels are overly high because the body cannot use insulin properly. Diabetes Mellitus has become a severe health problem in various countries, including Indonesia [1]. The International Diabetes Federation (IDF) explained that in 2021 the number of people with Diabetes Mellitus in Indonesia reached 19.5 million, while in 2019 the figure was 10.7 million. This means there has been an increase of nearly 9 million cases in just two years, during the COVID-19 pandemic. With almost two times the addition, Indonesia ranks fifth in the world. Not only in Indonesia, but this upward trend in cases also occurs worldwide. According to IDF data, at least 1 in 10 people or as many as 537 million people live with Diabetes Mellitus. If not appropriately treated immediately, Diabetes Mellitus can be fatal, cause other diseases, and even lead to death. Based on this, it is essential to have prediction activities to find out a disease. This activity is carried out so a disease can be detected quickly and treated immediately.
Activities in predicting various diseases have been carried out in various scientific fields, one of which is computer science. Along with the development of information and communication technology, it can be used to improve the ability of the system to help detect Diabetes Mellitus disease [2]. Data mining is part of the Knowledge Discovery in Database (KDD) process that can classify, predict, and get a lot of information from large data sets [3]. Classification is an important stage in data mining, classification is carried out by looking at variables from existing data groups and aims to predict the class of an object that was not previously known [4].

II. LITERATURE REVIEW
Previous research regarding applying the K-Nearest Neighbour classification model to the diabetes patient dataset explained that the study had the highest accuracy of 39% [5]. Another study is the implementation of the Decision Tree C4.5 algorithm for diabetes prediction resulted in a prediction model with the highest accuracy of 70.32% [6]. The previous study's shortcoming is that the prediction model's accuracy is still below 80%, so there is a need to improve accuracy performance. In research [7] that compared the accuracy, recall, and precision classification of the C4.5 algorithm, Random Forest, Support Vector Machine (SVM), and Naïve Bayes resulted in the C4.5 algorithm obtaining accuracy of 86.67%, the Random Forest algorithm obtained an accuracy of 83.33%, the SVM algorithm obtained accuracy by 95%, and the Naive Bayes algorithm obtained an accuracy of 86.67%. The highest accuracy algorithm is the SVM algorithm. Therefore in this study applying the SVM algorithm for the classification of Diabetes Mellitus disease.
The SVM algorithm was chosen because it is reliable in processing large amounts of data by optimizing hyperplanes in high-dimensional space that maximizes margins between data [8]. The kernel in SVM is used to determine kernel parameters and produce the best accuracy in the classification process. Linear kernels are used when a hyperplane can easily separate classified data. At the same time, non-linear kernels are used when the data is separated using curved lines or a plane in space with high dimensions [9].
This study aims to compare the performance metrics e.g., accuracy, precision, recall, and F1-Score values of the SVM algorithm with various kernels and preprocessing data in the classification of Diabetes Mellitus disease. The SVM algorithm is evaluated to determine which kernel can produce the best performance metric. www.ijacsa.thesai.org It has the benefit of solving health problems based on the percentage of Diabetes Mellitus and can be an accurate information material. The output of this study is to imply that the SVM algorithm is expected to show better performance values than previous studies.

A. Data Collection
The first stage in this study is the collection of Diabetes Mellitus datasets. The dataset used is the Pima Indian diabetes dataset obtained from the UCI Machine Learning Repository. Several variables and attributes can facilitate the research process in data mining. The Pima Indian diabetes dataset consists of 768 data and 9 features. The variables and features used are shown in Table I.

X2
Glucose, glucose/blood sugar levels. Normal blood sugar levels are below 120 mg/dL, while the sugar levels of diabetics are more than 120 mg/dL. The data range in the dataset is 0-199 mg/dL.

X3
Blood Pressure, blood pressure with mmHg units, the data range in the dataset is 0-112 mmHg.

X4
Skin Thickness, skin fold thickness with a data range of 0-99 mm. The norm is about 12.5 mm.

X5
Insulin, insulin levels in the blood with a data range of 0-846 U / ml. B. Data Preprocessing 1) Data splitting: The next stage is the data splitting stage, which separates training data and testing data. Training data is used to create models that are applied to testing data [10] and testing data cannot be used for the training process, so the model learns from the new data [11]. Training and testing data are determined randomly, so the proportion between categories remains balanced [12]. In this study, splitting data was divided into 80% training data and 20% testing data.
2) Data normalization: Normalization of data in datasets aims to create data in the same range of values [13]. This study used the min-max and z-score normalization methods. a) Min-Max normalization: Normalization of min-max can overcome non-uniform data forms with a range of values greater than 0-1 [14]. Min-max normalization was chosen because it has the advantage that the data is balanced between before and after normalization [15]. The normalization of minmax is presented in (1). (1) represents the min-max value, is the value to be normalized, is the lowest value of the overall data and is the highest value of the entire data. b) Z-Score normalization: Z-Score normalization is used to compare the performance or quality of data goals with the average distribution of data across groups based on standard deviation values [16]. Z-score normalization was chosen because it is a suitable method for balancing the data scale [17]. (2) is a formula for knowing the z-score. ( is the z-score value, is the value to be normalized, is the average value of the whole data and is the standard deviation value.

3) SMOTE (Synthetic Minority
Over-sampling Technique): The SMOTE method can handle dataset class imbalances by making data replication of minor classes equivalent to major classes [18]. The diabetes dataset used in this study had a total of 268 positive classes and 500 negative classes, so there was an imbalance between the positive and negative classes. Therefore, the SMOTE method was used in this study to balance between positive classes and negative classes. (3) is the formula for SMOTE.
is the resulting new class data, is the approach to i, is the x closest to and is a random number between 0-1.
C. Data Processing 1) Support Vector Machine (SVM): SVM is a good algorithm for data classification [19] with the principle of finding the best hyperplane that serves as a separator of two data classes [20]. The best hyperplane is determined by measuring the hyperplane margin and finding its maximum point, margin is the distance between the hyperplane and the nearest point of each class and this closest point is called the support vector [21]. The following is a description of SVM, there is data ⃗⃗⃗ ∈ ( ⃗⃗⃗⃗ , ⃗⃗⃗⃗ ⃗⃗⃗⃗ … ) is data consisting of n attributes and two classes ∈ +1, -1. Suppose that the two classes can be perfectly separated by a d-dimensional hyperplane defined by (4).
The maximum margin can be obtained by maximizing the value of the distance between the hyperplane and its closest point or support vector which ⃗⃗⃗ [22]. It is formulated as Quadratic Programming (QP) by looking for a minimum point based on (7).

⃗⃗⃗ ⃗⃗
is the target class to i, ⃗⃗⃗ is the input data to i, ⃗⃗ is the weight, and b is the relative field position.
2) Kernel SVM: To work around high-dimensional data, a kernel can transform the input space into a feature space [23]. Kernel functions commonly used in SVM are Linear [24], Radial Basic Function (RBF) and Polynomial [25]. The parameters possessed by kernel functions are used in the testing process [26]. There is no definite conclusion about the best kernel, therefore this study will compare 4 kernel functions, namely linear, RBF, polynomial and sigmoid.
a) Kernel linear: The Linear kernel was chosen because it is the simplest kernel and is used when the data is linearly overstretched.
b) Kernel polynomial: The Polynomial kernel was chosen because it can be used when the data is not linearly separated and is suitable for use in solving classification problems in all training data that has been normalized. (10)

c) Kernel Radial Basic Function (RBF):
The RBF kernel is used when the data is not linearly separated, it is chosen because it performs well with specific parameters, and the result of the training has a small error value. (11) d) Kernel sigmoid: This sigmoid kernel was chosen because it is similar to the two-layer perceptron model of the neural network, which works as an activation function for neurons.

3) Evaluation:
In evaluating the performance of each SVM kernel, we implement several performance metrics, including Accuracy, Performance, Recall and F1-score. Before computing the performance metric score, we build a confusion matrix defined as an evaluation method that provides information comparing the classification of prediction results with the actual classification [27]. In a confusion matrix, there are four terms of value, namely True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Based on these values, accuracy, precision, recall, and F1-Score values can be generated.
Accuracy is the ratio of predicted correct values of all data [28]. (13) Precision indicates a correctly classified prediction of positive values divided across positive classified data [28]. (14) Recall compares the positive correct predicted value with the entire positive correct value [29]. (15) The F1-Score shows the average comparison of precision and recall values [29]. (16)

IV. RESULTS AND DISCUSSIONS
This stage is a decipherment of the research obtained and its explanation.

A. Data Preprocessing
The dataset used is the Pima Indian diabetes dataset which consists of 768 data and nine attributes. The initial stage carried out in this study is the process of collecting and processing datasets. In this study, data preprocessing was divided into three steps. The first step is the data splitting process, where the Diabetes Mellitus dataset will be divided into training data and testing data. The second step is the data normalization process to create data in the same range. The third step is an oversampling process to balance the dataset class using the SMOTE method. Data processing in the study uses the Python programming language in the google colab application.

1) Data splitting results:
After getting the dataset, the next step is to divide the dataset into training data and testing data. The Diabetes Mellitus dataset totaled 768 data consisting of eight variables and one target/class. Then the dataset is divided into 80% training data, totaling 614 data and 20% testing data, totaling 154 data. The diagnosis of Diabetes Mellitus is divided into two, namely non-diabetics who are denoted by 0 and diabetics who are denoted by 1. Obtained diabetics totaled 268 data and non-diabetics amounted to 500 data.
2) Data normalization results: The normalization methods used are min-max and z-score. Fig. 1 shows a comparison of the variables in the dataset, there are two variables being compared namely pregnancies and insulin, the data has a fairly high range of values. For example, on the insulin variable, where the value range is from 0 to over 200, this is considered unbalanced. The min-max normalization method is used to process values into the range 0-1. Fig. 2 shows the results after normalizing the min-max, www.ijacsa.thesai.org where the range of values for the insulin variable becomes smaller, ranging from 0 to 1.  In addition to using the min-max method, data normalization is also carried out using the z-score method. Zscore is performed by processing the mean and standard deviation from the values of its attributes. Fig. 3 shows the results after normalizing the z-score.

3) Oversampling results:
In the dataset there is a difference between the number of positive and negative classes, therefore there is a need for class balancing. Class balancing is done by oversampling using the SMOTE method and is carried out on training data only. Oversampling is carried out after splitting data so that data replication does not appear in data training and data testing [30]. It can be seen in Fig. 4, before oversampling the number of positive classes was 221 and the number of negative classes was 393. Meanwhile, after oversampling, the number between the positive and negative classes becomes the same, which is 393 so that it becomes balanced.

B. Data Preprocessing and Evaluation
This study compared the performance of the SVM algorithm kernels for the classification of Diabetes Mellitus diseases. SVM kernels include linear, polynomial, RBF, and sigmoid kernels. Evaluation is carried out using the confusion matrix method to calculate the accuracy, precision, recall, and F1-score by optimizing the best parameters for each kernel. Each kernel on SVM has a specific parameter, the cost parameter (C) being the most commonly used value for all kernels. The gamma ( ) parameter is used to determine the degree of proximity between two points to make it easier to find hyperplanes consistent with the data. The gamma parameter is used by polynomial, RBF, and sigmoid kernels. Next is the degree (d) parameter used to map data from the input space to the higher dimension space in the feature space, only the polynomial kernel uses this parameter [31]. The best parameters on the kernel are determined by trial and error. Table II evaluates the classification models of various SVM kernels before different data preprocessing is carried out.  Table II, all parameter values in each kernel use auto parameters from python. The highest accuracy is obtained from the polynomial and RBF kernels, which is 77%. The highest precision was obtained from the RBF kernel, which was 69%, the highest recall was obtained from the linear kernel, which was 57% and the highest F1-score was obtained from linear and polynomial kernels, which was 61%. Table III evaluates the classification models of various SVM kernels after preprocessing data with min-max normalization and SMOTE oversampling. Meanwhile, Table IV evaluates the classification models of different SVM kernels after preprocessing data with normalization of z-score and oversampling SMOTE. Table III and IV show that the highest accuracy is obtained by applying z-score normalization and SMOTE oversampling, which is obtained by 80% using a polynomial kernel. The polynomial kernel using the parameter value C=1 =0.1 d=1.5 is obtained through trial and error to produce margin optimization values that maximize the hyperplane by mapping the data into higher dimensions. The highest precision is also obtained from the polynomial kernel, which is 65%. This shows that the higher the accuracy value, the higher the precision value. The highest recall was obtained at 79% from the RBF kernel shown in Table III. The RBF kernel uses the parameter value C=2.5 =1.5. The highest F1-score is obtained from the RBF kernel shown in Table III, which is 70%. The values in the parameters C, , and d are the most optimal values to get the maximum accuracy value. If the value is increased or decreased, the accuracy value will decrease.

V. CONCLUSION AND FUTURE WORKS
This research produces the highest accuracy of up to 80%, obtained from polynomial kernels. So the shortcomings of previous research have been resolved in this study. By optimizing the use of the kernel on the SVM algorithm, it is proven to be able to maximize performance. Hence, it can be concluded that the SVM algorithm performs better in classifying Diabetes Mellitus.
This study found that the performance of the SVM algorithm kernel to produce the highest accuracy was obtained from the polynomial kernel. The accuracy results produced in this study can be used as an accurate and beneficial recommendation for overcoming health problems related to Diabetes Mellitus.
For further research, we can also implement other datasets that contain more data. Also, we can create different novel kernels which may gain better accuracy results. In addition, the results of this study can also be used in making applications to detect Diabetes Mellitus which can be web-based or mobile.