Prediction of Diabetic Obese Patients using Fuzzy KNN Classifier based on Expectation Maximization, PCA, and SMOTE Algorithms

Diabetes is a long-term disease. Inappropriate blood sugar level control in diabetic patients can lead to serious issues like kidney and heart diseases. Obesity is widely regarded as a major risk factor for type 2 diabetes. In this research, a model proposed to predict diabetic obese patients based on Expectation Maximization, PCA, and SMOTE Algorithms in the preprocessing and feature extraction phases, and using Fuzzy KNN classifier in the prediction phase. The model applied on real dataset and the accuracy of prediction results reflects the positive effect of the preprocessing techniques. The accuracy of the proposed model is 95.97% and outperforms other model applied on the same dataset. Keywords—KNN classifier; SMOTE; PCA; diabetic obese patients


I. INTRODUCTION
Using Data Mining (DM) and Machine Learning (ML) techniques in data mining research are a common way for making use of large amounts of available knowledge-based data. Machine Learning is extremely essential in the realm of medical diagnostics. Data mining is a great goal in science and medical research, which necessarily generates massive amounts of data owing to the special societal effect of the serious disease. As a result, Machine learning and data mining approaches are unquestionably of great importance for aspects of clinical administration, diagnosis, and treatment As part of this work, challenges were undertaken to examine the recent literature on ML and DM methodologies in many diseases especially in the diseases of the chronic diabetes. Diagnosis in the healthcare sector is an ideal subject for ML algorithms [1]. Many of these may be identified using pattern recognition on large amounts of data. An algorithm should be trained on a small number of tests to be useful in the field, medical diagnostics must be able to tolerate noisy and empty datasets. Many researches on machine learning in the sector of healthcare have been undertaken. Healthcare ML has emerged as a top goal for many academics. Different data mining approaches and procedures in hidden pattern recognition can be used to gain insights. The medical science primary roles are to prevent or help treat diseases. One of the chronic illnesses marked by hyperglycemia is Diabetes mellitus. It can lead to a slew of difficulties [2]. As a result of higher mortality rates in recent years, According to WHO (World Health Organization) forecasts by 2040, the world's population of diabetes is anticipated to reach 642 million [3], suggesting that one out of every ten people would suffer from diabetes in the future. There are three types of Diabetes [4], namely; Gestational Diabetes, Type-1 Diabetes Mellitus, Type-2 Diabetes Mellitus. Type-2 diabetes mellitus patients are frequently classified as having a fatty liver disease in which it could be either nonalcoholic or alcoholic fatty liver disease (NAFLD|AFLD) [5]. Type-2 diabetes mellitus has been postulated as a primary cause of NAFLD development, or nonalcoholic steatohepatitis, which likely reflects in Type-2 diabetes mellitus with rapid advancement of weight gain and resistance of insulin. Obesity and diabetes, both multifactorial, difficult illnesses, have become major public health issues across the world [6]. Many conditions, on the other hand, may be prevented. Obesity is a significant growing health concern; some refer to it as the New World Syndrome [7]. The occurrence of obesity and fatty liver in persons with diabetes of Type-2 has long been documented as they are strongly associated with each other [8]. It is often viewed as an accidental finding with small to no therapeutic value. Sedentary lifestyles or poor dietary habits result in weight gain. It may also increases the chances of facing a metabolic syndrome over time. Avoiding the significant consequences that result in massive issues in health, since early detection is the beginning point for a good life without the disease reflects the significance of using the recommended method for predicting patients suffer from diabetes and affected by obesity and NAFLD. Diabetes mellitus and its consequences, in particular, must be prevented and managed in poor and middle-income countries. The following is how this paper is arranged; Section II outlines the related work. Section III, details the suggested model as well as the dataset used. The Section IV offers the obtained results, followed by the conclusion and the future work in Section V.

II. RELATED WORK
In [9], Kumar purposed various data mining approaches in medical sector to highlight data mining applications based on the nature of the information; In order to predict Parkinson's illness, Support Vector Machines and Artificial Neural Networks were used and resulted in 95 percent accuracy. In addition, it improved detection rate by employing an ANN to diagnose cancer of breast to 98.8 percent, and employed Artificial Neural Networks. Basma Boukenze et al. in [10] assessed the DM techniques performance in medical health 233 | P a g e www.ijacsa.thesai.org sector using multiple learning techniques. The result simulation indicated that the decision tree (DT) performed better than other learning techniques in forecasting kidney failure chronic disease. Furthermore, M. Abdullah and S. Al-Asmari in [11], clarified the same DM approaches to designate the type of anemia patients suffer from anemia. DT executed with an accuracy result of 93.75 percent. While only support vector machine algorithm was used in categorizing diabetes disease, while in [12] Kumari and Chitra used the Matlab tool version 2010a in order to identify the diabetic patients by 78 percent accuracy. Developing DT and DM classification approaches assists medical practitioners in gaining better medical judgments to detect diseases timely [13]. El-Halees and Shurrab in [14] developed a model that can discriminate between individuals with blood tumors and normal blood illness utilizing multiple association rules and ANN, results with 79.45 percent accuracy. In addition, in order to predict diabetes in many circumstances various researches have been conducted in which the authors of [15] used a regression-based approach of DM to introduce diabetes therapy predictive analysis. The Oracle Data Miner was used as a mining software to forecast diabetic treatment methods. For the experimental investigation, the support vector machine technique was applied. They conclude that pharmacological therapy for patients under the age of 18 can be postponed to minimize negative effects. The authors used four classifiers in [16] to categorize the diabetes mellitus risk. First, four categorization models were investigated: DT, Logistic Regression, ANN and Naive Bayes. Then, to improve the resilience of such models, Bagging and Boosting strategies were examined. According to the findings, the Random Forest (RF) algorithm performs the best in illness risk categorization. They suggested an early diabetes prediction model in [17], and they discovered a high correlation between diabetes, glucose level and body mass index (BMI), that was retrieved using the Apriori technique. Diabetes was predicted using RF, ANN, and K-means approaches. The ANN approach achieved the highest accuracy of 75.7 percent. For the prediction of diabetes, the authors of [18] employed KNN and the Naïve Bayes approach. Their method was implemented as a program of expert software, in which users submit input in the form of patient data and the determination of whether or not the patient is diabetic. The authors of [19] propose an attribute selection technique of firefly and cuckoo search-based for the PIMA Indian diabetes database from University of California Irvine (UCI), with the goal of greater accuracy and lesser training overhead. They also said that the proposed structure promises to be more accurate than the usual technique. The authors of [20] applied a ML model to forecast the occurrence of Type-2 Diabetes mellitus, using information from the present year ( ). From 2013 to 2018, electronic health records were collected at a private medical facility for this investigation. Key characteristics were initially picked for the prediction model using chi-squared tests, ANOVA tests and recursive variable reduction approaches. Based on these variables, they used random forest, logistic regression, XGBoost, SVM and ensemble ML methods in order to foresee the result as diabetic, non-diabetic or pre-diabetic. The model performed pretty well in anticipating the occurrence of Type-2 diabetes in the Korean population. The authors of [21] applied two machine-learning techniques for two-phase classification; SVM and ANN to predict diabetes mellitus. They used a real dataset from Al-Kasr Al-Aini Hospital in Egypt. In the first phase, they used SVM to predict patients with fatty liver disease with accuracy of 95%. Then in the second phase they used ANN for prediction of diabetic patients based on phase 1 and another 8 different attributes.
III. PROPOSED SOLUTION AND DATASET As the dataset of this problem was collected manually as will be described in next section, it had many issues like missing values, and the data was unbalanced, so we applied a preprocessing phase for the dataset. The algorithms used in the proposed model are described in this section.

A. Expectation Maximization Algorithm for Estimating the
Missing Values Dempster et al. 1977 in [22], demonstrated that the Expectation Maximization (EM) algorithm can be applied when (the missing data joint distribution) and (the observed data) is candid. For all ( ), let the density function probability of (;) be =( , ). The estimation of get the most out of the observed data log eventuality in which the expectation maximization algorithm aims to find.
In general, because this number cannot be estimated explicitly, the EM method calculates the MLE by iteratively maximizing the anticipated log-likelihood of complete-data in (2) ( ; )=log ( , ; Begin with a value of 0 and let be the estimate of at the t th iteration, then below is two steps of the next EM iteration: M step: Define (( +1)) by maximizing the Q function:

B. Feature Reduction using Principal Component Analysis Algorithm
Principle Component Analysis (PCA) is an extracting features statistical approach that employs to turn a set of possibly associated annotations to a set of variables uncorrelated transformed linearly known as principle components. PCA may be used to reduce the feature dimensions [23]. Because the eigenvectors number exceeds the columns number, the dimension of the projected output data is smaller than the dimension of the input data. The method of PCA utilized in the feature reduction step is as follows.

C. Handling the Un-Balanced Data using SMOTE Algorithm
Chawla presented the Synthetic Minority Oversampling Technique (SMOTE) in 2002 [24]. In contrast to random oversampling, in the SMOTE method the minority class is oversampled by producing samples of synthetic rather than oversampling with replacement. The SMOTE method generates fake instances based on similarities between existing minority cases in feature space rather than data space [24,25]. These synthetic instances are constructed by connecting a portion or all of the minority class's K-Nearest Neighbor (KNN). Neighbors from the KNN are picked at random depending on the quantity of oversampling necessary. Algorithm 2 represents the used SMOTE algorithm for handling the un-balanced dataset.

D. Fuzzy KNN Classifier
Keller et al introduced the fuzzy KNN classifier [26], which assign to each sample a class memberships, as a function from of its KNN training samples of each sample's distance. Because it is easy and provides information on the certainty of the classification result, the fuzzy KNN classifier is a popular choice for applications. According to Keller et al, the major benefit of utilizing the FKNN model may not be the reduction in error rate. More crucially, the model provides a level of assurance that may be combined with the "refuse-to-decide" option. Objects with overlapping classes can thus be discovered and treated independently as in Algorithm 3.

Algorithm 2 " SMOTE"
The input: X: original set of training sample N: percentage of oversampling K: nearest neighbors value The output: the oversampled training set n ← # observations m ← # attributes nmin ← # min observations if N < 100 then Stop: warning "N should be greater than 100" end if N ← int(N/100) S (n * N)×m ← empty array for synesthetic samples ← 1 → Do Discover the KNN for each then save the indexes in the nn newindex ← 1 while N ≠ 0 do K c ← number between (1& K) randomly

Soft labels, input , a membership vector
• The target class max ( )

E. Dataset Description
The dataset used in this study was obtained from Cairo University, Faculty of Medicine, Al-Kasr Al-Aini Hospital [21]. The dataset contains 30 variables; Gender, Age, Alcohol consumption, Smoking, Schistosomiasis, steroids, History of hypertension, Oral contraceptive pill, Waist circumference, Body Mass Index, Hemoglobin test (HGB), Liver disease, Primed lymphocyte test, Basic Insulation Level, Aspartate Aminotransferase (AST), Alanine Aminotransferase (ALT), White blood cells (WBCs), Albumin level in blood (ALB), Protein C test, Alkaline phosphatase (ALP), Gamma-Glutamyl Transferase (GGT), Total cholesterol, Triglycerides test (TGs), High-density lipoprotein (HDL), Low-density lipoprotein (LDL), International Normalized Ratio (INR), Spleen size, Fasting blood sugar, History of diabetes, and Hemoglobin A1c (HBA1C). This was preprocessed as will be shown in the proposed model section through different phases. The algorithms used in the data-preprocessing phase are expectation maximization algorithm, which estimate missing values. PCA algorithm is used in feature reduction phase, while SMOTE algorithm used to generate new sample in the minority class to overcome the unbalanced data issue that affects the measures. Fig. 1 shows the basic steps used in the proposed model for the prediction of diabetic obese patients. At the first, we read the dataset and apply a data preprocessing phase on it. The first step in the data preprocessing is estimating the missing values by the EM algorithm. The next step is applying the PCA algorithm to reduce the features in the dataset. The basic steps of the PCA are calculating the covariance matrix, then calculating the Eigen values, then sorting the attributes in descending order, then normalizing the values, and calculating the weight value for each attribute. The third step is solving the unbalanced data using SMOTE algorithm described in last section above. The SMOTE algorithm used for generating new sample in the minority class. The last step in the proposed model is classifying the new samples using fuzzy KNN classifier. 236 | P a g e www.ijacsa.thesai.org

A. Evaluation Method
Evaluation of the performance of algorithms using the precision and recall criteria is very valuable. When making a choice, precision is the proportion of the time that the model properly predicts a good outcome. Precision is defined as the accurately identified or In which, the true positive ( ) represent the positive cases that predicted positive, the false negative ( ) represents the cases that were positive. However, it predicted negative and the false positive ( ) are the negative cases that were positively predicted.

B. Results
In this part, we report the findings obtained when the fuzzy KNN classifier used with the proposed model on the dataset described, and applying the fuzzy KNN classifier on the raw data of the dataset. Table I shows the proposed model output applied on the dataset after preprocessing compared to the same classifier but without data preprocessing.  Table I and Fig. 2 shows that the data preprocessing steps, estimating the missing values, feature reduction and solving the problem of unbalanced data enhanced the all measurement values resulted from the classifier. The obtained results were compared to the results obtained in [21], as they used the same dataset. They proposed a twophase classifier for predicting the potential diabetic obese patient as mentioned in related work section. Table II shows the basic differences in the proposed model and the model in [21].  Table III shows the comparison between results of the proposed model and results in [21]. From Tables II and III, we can observe that the algorithms and techniques used in the proposed model to prepare the data before training and testing were affected positively the data especially the steps of estimating missing values and handling unbalanced data, also the proposed classifier introduces a promising classification accuracy compared to the results introduced in [21]. In this research a model for prediction of Diabetic Obese Patients was proposed, the model was based on Expectation Maximization, PCA, and SMOTE Algorithms in data preparation and preprocessing phase, and the fuzzy KNN classifier was used in prediction phase. The dataset used in this research was obtained from Cairo University, Faculty of Medicine, Al-Kasr Al-Aini Hospital. The algorithms used in the preprocessing enriched the clearness and effectiveness of the dataset which reflected in the prediction phase as shown in the results. The prediction accuracy reached to 95.97% in the proposed model and this result outperforms a corresponding model applied on the same dataset mentioned in the related work. We can suggest some improvements in the preprocessing phase afterwards like adopting another feature selection algorithm and other algorithms for handling imbalanced data, and estimating the missing values. In addition, an ensemble model can be provided on more than one classifier in order to enhance the precision value.