Classification Model for Diabetes Mellitus Diagnosis based on K-Means Clustering Algorithm Optimized with Bat Algorithm

—Diabetes mellitus is a disease characterized by abnormal glucose homeostasis resulting in an increase in blood sugar. According to data from the International Diabetes Federation (IDF), Indonesia ranks 7th out of 10 countries with the highest number of diabetes mellitus patients in the world. The prevalence of patients with diabetes mellitus in Indonesia reaches 11.3 percent or there are 10.7 million sufferers in 2019. Prevention, risk analysis and early diagnosis of diabetes mellitus are necessary to reduce the impact of diabetes mellitus and its complications. The clustering algorithm is one of methods that can be used to diagnose and analyze the risk of diabetes mellitus. The K-mean Clustering Algorithm is the most commonly used clustering algorithm because it is easy to implement and run, computation time is fast and easy to adapt. However, this method often gets to be stuck at the local optima. The problem of the K-means Clustering Algorithm can be solved by combining the K-means Clustering algorithm with the global optimization algorithm. This algorithm has the ability to find the global optimum from many local optimums, does not require derivatives, is robust, easy to implement. The Bat Algorithm (BA) is one of global optimization methods in swarm intelligence class. BA uses automated enlargement techniques into a solution and it’s accompanied by a shift from exploration mode to local intensive exploitation. Based on the background that has been explained, this article proposes the development of a classification model for diagnosing diabetes mellitus based on the K-means clustering algorithm optimized with BA. The experimental results show that the K-means clustering optimized by BA has better performance than K-means clustering in all metrics evaluations, but the computational time of the K-means clustering optimized by BA is higher than K-means clustering.


I. INTRODUCTION
Diabetes mellitus is a disease characterized by abnormal glucose homeostasis resulting in an increase in blood sugar. According to data from the International Diabetes Federation (IDF), Indonesia ranks 7th out of 10 countries with the highest number of diabetes mellitus in the world. The prevalence of patients with diabetes mellitus in Indonesia reaches 11.3 percent or there are 10.7 million sufferers in 2019 [1]. Diabetes mellitus causes various complications such as cardiovascular disease, atherosclerotic disease, peripheral neuropathy, diabetic retinopathy, severe foot infections, kidney failure, and sexual dysfunction [2,3].
Early diagnosis of diabetes mellitus is necessary to reduce the impact of diabetes mellitus and its complications. Clustering algorithms have been used to diagnose and analyze the risk of diabetes mellitus [4][5][6]. In general, clustering is divided into four categories of use, namely data reduction, hypothesis formation, hypothesis testing, and prediction based on groups [7]. Algorithm clustering is automatically able to recognize patterns in the data so that it can analyze the collected data without the label [8].
The K-mean Clustering Algorithm is the most commonly used clustering algorithm because it is easy to implement and run, the computation time is fast, and easy to adapt [9]. This algorithm has been used in various applications including diagnosis of diabetes mellitus [5], segmentation of diseases in plant leaves [10], heart disease prediction and classification [11,12] and prediction of diabetes mellitus [13]. However, this method has a drawback, namely random centroid initialization causing, the algorithm to be stuck at the local optima [14]. The clustering result of the K-means algorithm becomes worse because the cluster center is stuck at the local optima. Therefore, the robust initialization of centroid is needed to obtain the good clustering result.
Problems of the K-means Clustering Algorithm can be overcome by combining the K-means Clustering with global optimization algorithms, e.g., swarm intelligence algorithm. This algorithm is able to find the global optimum from many local optimums, does not require derivatives, robust, and easy to implement [15]. Anam et al. have used a swarm intelligence-based algorithm (Particle Swarm Optimization) to segment disease in tomato leaves [16]. One of the swarm intelligence methods is Bat Algorithm, with a faster convergence rate than Genetic Algorithm and Particle Swarm Optimization [17]. This is because BA uses automated enlargement techniques into a promising solution. This enlargement is accompanied by a shift from exploration mode to local intensive exploitation. BA also has been used for many applications, for example travelling salesman problem [18,19], resource scheduling [20,21], customer churn [22,23], brain tumor recognition [24,25], estimating state of health of lithium-ion batteries [26], detection of myocardial infarction [27] and features selection [28,29]. Based on the background described, this article proposes the development of a method for diagnosing diabetes mellitus based on the K-means clustering algorithm optimized by BA. The purpose of this research is to develop a rapid method for diagnosing diabetes mellitus using the K-means and BA algorithms. In this article, the K-means algorithm is improved by using the BA algorithm to overcome the problem of the Kmeans algorithm which is often stuck in the local optima. This research is useful for the prevention and reduction of the impact of diabetes mellitus through a rapid and inexpensive diagnosis of diabetes mellitus by utilizing information technology (machine learning).

II. PROPOSED METHOD
This sub-chapter will explain the dataset which are used, research stages, stages of the proposed method and evaluation tools used.

A. Data Set
The dataset used in this study was taken from the web at https://www.kaggle.com/datasets/alexteboul/diabetes-healthindicators-dataset. The dataset was taken from the Behavioral Risk Factor Surveillance System (BRFSS), which is a healthrelated telephone survey that is collected annually in America.
The dataset consists of several predictor variables, both medical and non-medical, and one target variable (diabetes mellitus sufferers and not diabetes mellitus sufferers). Description of the attributes of the dataset used can be seen in Table I. Class 1 means people with diabetes mellitus or prediabetes while class 0 means not people with diabetes mellitus. This dataset is used for the training process and for evaluation of the built diabetes mellitus prediction model.

B. Classification Model for Diabetes Mellitus Diagnosis Based on K-Means Clustering Algorithm and Bat Algorithm
This section describes the steps of the Classification Model for Diabetes Mellitus Diagnosis Based on K-Means Clustering Algorithm and Bat Algorithm. The steps or stages of the Classification Model for Diabetes Mellitus Diagnosis Based on the K-Means Clustering Algorithm and the Swarm Intelligence Algorithm can be seen in Fig. 1. The method has several steps which are input dataset and method parameters, preprocess dataset, split data (training data and testing data),

C. Parameters Setting
Before the Classification Model for Diabetes Mellitus Diagnosis Based on the K-Means Clustering Algorithm and Bat Algorithm is used, there are several parameters that must be set. Some of these parameters include: These parameters are taken from [17].

D. Data Preprocessing
Data preprocessing is an initial step in the data mining technique to convert raw data into data that is more efficient and in accordance with the data mining model to be used. Raw data taken from various sources often experience errors, missing values, and are inconsistent, so that the raw data need to be formatted so that data mining results are precise and accurate. In addition, raw data also need to be transformed to change data from its original form into data that is ready to be mined. Data transformation can facilitate the process of extracting data to find new knowledge. One of the data transformation techniques is normalization. Normalization is the process of scaling the attribute values of the data so that they can lie in a certain range. This study uses the Min-Max Normalization Method. The Min-Max Normalization is a normalization method by carrying out a linear transformation of the original data so as to produce a balance of comparison values between the data.

E. Data Splitting
After preprocessing, dataset is divided into two parts for training and testing. The proportion of training and testing data is 80% and 20%. The training data in this study is used to train the model so as to get a clustering model. Data testing is used to test and evaluate the model, as a simulation of using the model in the real world. Data testing should never be used in model training before to make model validation.

F. Develop a the Classification Model for Diabetes Mellitus Diagnosis Based on K-Means Clustering Algorithm and Bat Algorithm
The next step is to build the Classification Model for Diabetes Mellitus Diagnosis Based on K-Means Clustering Algorithm and Bat Algorithm. The diagnostic model is built based on the Clustering Method based on the K-Means Clustering Algorithm and the Bat Algorithm. The first step is to build a K-Means Clustering algorithm that is optimized with the Bat Algorithm. The position of the bat in the Bat Algorithm represents the center of the cluster (centroid). The optimized function in the Bat Algorithm is the objective function of K-means Clustering. After the algorithm is built, the next step is to implement the program with the Python programming language. The next step is to input the training data, the parameters of the Bat Algorithm and the number of clusters. The training data that will be included in the clustering model training process is the predictor variables of the training data. While the response variable will be used later when evaluating the clustering model after the training phase is complete. Algorithm 1 states the K-means Clustering Algorithm-Based Clustering Method and the Bat Algorithm. Algorithm 2 is used for the association of cluster centers to data classes, while Algorithm 3 is used for the testing process.

Input:
The training data (X train ) with size of The number of cluster (K) The parameters of Bat Algorithm Output: Best ( ) is the best solution produced a. Initialize the bat positions and velocities and ,( ) Each x i represents a candidate from the centroid or cluster center. For example, matrix is the i-th centroid candidate represented in equation (1).
where is random numbers obtained from the normal Gaussian distribution ( ), ( ) is the average loudness of all bats over time , and is the scale factor, for simplification, can be used . end if iii. Evaluate the fitness using the objective function of K-means clustering which is stated in equation (6), where represents centroid j of K centroid. reshaping results were obtained x i of size to matrix C i of size iv. if ( and ( ) ( )) then Update the current solution using one of the solutions from step (i) or (ii) end if v. Increase and reduce by using Equations (7) and (8),

Input:
The training data (X train ) with size of The centroids y train (class label of each training data) Output: Accuracy, Recall, Precision, F1 Score.
1. Determine which centroid represents the target class based on the majority value of the labels in each cluster.
2. Calculate the label prediction y pred based on centroid cluster.

Algorithm 3 Testing Algorithm of Classification model of Diabetes Mellitus Diagnosis Based Clustering Method K-Means Clustering and Bat Algorithm
Input: The testing data The centroids y testing (class labels of each testing data) Output: Accuracy, Recall, Precision, F1 Score.
1. Calculate the label prediction y pred based on centroid cluster.

G. Evaluation Metrics
The performance of the proposed method is evaluated by using accuracy, recall, precision and F1 Score. The performance of the proposed method is compared to the previous method, namely the K-means Clustering method. If the proposed method is better than the standard method, it can be said that the performance of this method can be improved. The evaluation metrics used are: 1) Classification Rate / Accuracy which is calculated using the formulation in equation (9), (9) where TP states that diabetics and is detected as a diabetic. TN stated that healthy person and is detected as a healthy person. FN is the healthy person but detected as diabetics. FP stated that diabetics but detected as a healthy person. Accuracy is used to measure the ratio of correct predictions to the total number of instances evaluated.
2) Recall is calculated by the formulation in equation (10). Recall is used to measure the fraction of a correctly classified positive pattern.
3) Precision is calculated by the formula in equation (11). Precision is used to measure the correctly predicted positive pattern from the total predicted pattern in the positive class. (11) 4) F1 Score is calculated by the formula in equation (12). F1 Score is harmonic mean of precision and recall. (12) After the evaluation metrics are calculated, the experimental results are analyzed to obtain conclusions.

III. RESULTS AND DISCUSSIONS
The dataset has different scale on each feature/variable; therefore the algorithm cannot work well. So that, the dataset is needed to be pre-processed to solve this problem by using normalization technique. Furthermore, the data are normalized by using the minmax normalization method. This process will make data that has the same range, namely between values 0 and 1. The normalized data is then divided into two parts, namely, 80% of the data is used for training and the remaining 20% is used for testing. After the data is appropriate with the model to be used, the data can be input for the algorithm to be executed. www.ijacsa.thesai.org The evaluation tools used to measure the quality of each algorithm are the objective function (f min ), accuracy, precision, recall, F1 score, and the computational time. The accuracy, precision, recall and F1 score are calculated for both training data and testing data. The evaluation tool will reach the optimum value when the objective function is minimum, the accuracy, precision, recall and F1 score are maximum, and the computation time is not too long.
The first algorithm to run is the standard K-means Algorithm. The parameters used are the number of clusters of 2 which correspond to the expected number of targets. The iterations are carried out until one of stopping conditions is reached. The parameters used in the Bat Algorithm are initialized with the parameters described in sub-section II.C. These algorithm uses two stopping conditions which are the maximum iteration and convergence condition. The maximum iteration is 1000 times. The algorithm is assumed convergence if the global best doesn't have improvement in 100 iteration. The experiments in this study were repeated 25 times, because the Bat Algorithm and the K-means Algorithm used in the Classification Model for Diabetes Mellitus Diagnosis Based on the K-Means Clustering Algorithm and the Bat Algorithm involve random numbers in obtaining the optimum value of the objective function. Then the average and standard deviation of the evaluation tool used are calculated. The standard deviation is used to measure the spread of recall, accuracy, precision and F1 scores, as well as objective function values. While the average value is used to concentrate the results of recall, accuracy, precision and F1 scores, as well as objective function values.     Mellitus Diagnosis based on the K-means Clustering Algorithm and the Bat Algorithm are not much different between training and testing data. This shows that the proposed method has good performance and neither overfitting nor underfitting occurs.

IV. CONCLUSIONS
Based on the experimental results and analysis of the experimental results, several conclusions were obtained. that the Model for Diabetes Mellitus Diagnosis based on the Kmeans Clustering Algorithm and the Bat Algorithm developed from the K-Means Clustering Algorithm by adding the Bat Algorithm optimizer to determine the centroid of the cluster. The experimental results revealed that the number of bats has an effect on the method's convergence speed and processing time. The experimental results reveal that the Model for Diabetes Mellitus Diagnosis based on the K-means Clustering Algorithm and the Bat Algorithm are able to diagnose diabetes mellitus quite well. The accuracy obtained is around 72.4% and the F1 score is 71.4% for training data, and the accuracy obtained is around 71.55% and the F1 score is 71.18% for data testing. The evaluation results show that the performance of the Model for Diabetes Mellitus Diagnosis based on the Kmeans Clustering Algorithm and the Bat Algorithm is better than the standard K-means for evaluating accuracy, precision, recovery, f1 score, but the recovery time is quite large.