Preprocessing Handling to Enhance Detection of Type 2 Diabetes Mellitus based on Random Forest

Diabetes is a non-communicable disease that has a death rate of 70% in the world. Majority of diabetes cases, 9095%, are of diabetes cases are type 2 diabetes which is caused by an unhealthy lifestyle. Type 2 diabetes can be detected earlier by using examination that contains diabetes-related parameters. However, the dataset does not always contain complete information, the distribution between positive and negative classes is mostly imbalanced, and some parameters have low importance to the decision class. To overcome the problems, this study needs to carry out preprocessing to improve detection precision and recall. In this paper, propose an approach on dataset preprocessing, which is applied to diabetes prediction. The preprocessing approach consists of the following process: missing value process, imbalanced data process, feature importance process, and data augmentation process. The data preprocessing process uses the median for missing value, random oversampling for imbalanced data, the Gini score in the random forest for feature importance, and posterior distribution for data augmentation. This research used random forest and logistic regression as classification algorithms. The experimental results show that the classification increased by 20% precision and 24% recall by applying proposed method and random forest method compared to without proposed method and random forest method. Keywords—Diabetes mellitus; data preprocessing; data augmentation; random forest; classification


I. INTRODUCTION
Quoted from the 2016 WHO data, 70% of total deaths in the world are caused by diabetes, and 90-95% of diabetes cases are type 2 diabetes, which is mainly preventable because it is caused by an unhealthy lifestyle [1]. Diabetes mellitus is a chronic metabolic disorder caused by the pancreas not producing enough insulin or the body unable to use the insulin effectively [2]. In Indonesia, according to Basic Health Research (RisKesDas) in 2018 [2], people with diabetes from 2013 to 2018 increased gradually, where 6.9% of Indonesia population is diabetic. 69.6% of those with diabetes were undiagnosed, and 30.4% diagnosed. Meanwhile, in 2013, 5.7% were diabetic. As many as 73.7% of these people with diabetes, were undiagnosed and 26.3% were diagnosed. This data shows that diabetes mellitus is a dangerous disease since it can lead to various complications of other diseases, such as heart disease, kidney failure, stroke, and even paralysis and death [2].
The prevalence of diabetes mellitus (DM), based on a doctor's diagnosis in the population aged ≥ 15 years, is increased to 2% based on the report of Basic Health Research (RisKesDas) 2018 [2]. The largest DM sufferers are in the age range of 55-64 years and 65-74 years [2]. In 2018, the percentage of DM sufferers for female (1.8%) and male (1.2%) [2]. As for domicile areas, the percentage of DM sufferers in urban areas (1.9%) than in rural areas (1.0%) [2]. The highest estimate number of DM cases in Indonesia will occur in 2030, with a total population of 21.3 million [2]. Based on Basic Health Research (RisKesDas) diabetes data [2], undiagnosed patients can be detected beforehand. Diabetes detection could be performed by a doctor based on blood sugar and insulin levels or conducted automatically based on individual medical checkup data.
The explanation of paper contributions taken from some of the shortcomings of previous research is applied to diabetes prediction. In [8] discusses the process of missing value using the median in general and feature selection using this importance index and permutation importance index. Paper [10] discusses the problem of imbalanced data using general random oversampling. In [23] discusses data augmentation techniques for the problem of imbalanced data using a gaussian distribution.
In this paper, the contribution is firstly to replace the value of outliers using median for every six rows, secondly for imbalanced data using oversampling technique namely Random Oversampling by combining three imbalanced features, third for the selected feature process using feature importance technique in random forest model with Gini index value, fourth for data augmentation process using posterior distribution technique where latent data (Y) uses Karya Medika data. Comparison of the contribution of this study with several other studies can be seen in Table I. This paper aims to improve the precision and recall outcomes in diabetes prediction using data preprocessing.

II. LITERATURE REVIEW
In previous studies, the classification and prediction of DM with Pima Indian data have been carried out using several machine learning methods. However, only a few studies discussed about preprocessing on Pima Indian dataset. The problem of missing value is discussed in a limited number of papers [8,13,14,15,17]. The problem of imbalanced data [10,11,17] and of feature selection [5,9,10,14] have been discussed too. Several models have been used in data preprocessing, such as missing value using median [8], Interquartile Range [13,14], mean [15], and Naive Bayes [17]. In imbalanced data, there is Synthetic Minority Over-sampling [10,11], Random Oversampling [10], and Adaptive Synthetic Sampling [17]. Meanwhile, in feature selection, there is Principal Component Analysis [5,9], Maximum Relevance and Minimum Redundancy [5], Fisher Discriminant Ratio [9], Analysis of Variance [9], Information Gain [10], and forward backward [14] models.
According to several prior studies on diabetes prediction, important factors that contribute to classification accuracy are imbalanced data, the presence or absence of missing values, and features that affect the results [4,7,11,[13][14][15][16][17][19][20][21][22]. In addition, paper explains that data augmentation can improve the accuracy of diabetes prediction [23]. Data augmentation is an algorithm used to augment the observed X data with a quantity of Y, referred to as latent data [24]. In the Pima Indian dataset, imbalanced data occurs in the class label. Imbalanced data is a problem related to the performance of learning algorithms faced with underrepresented data, and the slope of the class distribution is severe [25]. The missing value is a problem that replaces the null value in a variable [9]. The maximum limit for missing value varies from 5-10% and 50% [26]. Feature selection is an important problem in machine learning since it gets the most informative features [9].

III. METHODS OF RESEARCH
To improve the precision and recall outcomes in DM prediction analysis, this research proposed data preprocessing on the binary classification of DM type 2. Fig. 1 shows the proposed system diagram performed in this study whilst Fig. 2 shows the proposed system in more detail.

A. Dataset
This study used two different diabetes datasets, namely Pima Indian and Karya Medika. Kumar et al. provides Pima Indian dataset description [14].
For data augmentation, other data with the same characteristics with the Pima Indian data were used. In this paper, this research used datasets of DM from Karya Medika in January to April 2020. This dataset was taken from an individual sample of Indonesians from the Slawi region, Central Java with a sample size of 630 and has nine features include class labels. In Karya Medika dataset also has problems with preprocessing. Table II shows the dataset of Karya Medika, where the body mass index (BMI) value can be obtained using the formula (1). The BMI formula was used during the data augmentation process, which will become a new feature called BMI.
Table III presents a baseline of two different datasets, which used as a comparison. Same characteristics found in these two datasets are glucose level, diastolic blood pressure, BMI, age, and class types.

B. Outliers Identification and Replacement
In this process, identifying each dataset is whether each feature has a null value, as represented in NaN/{}/0. After determining the outliers, the value is calculated. The process of replacing the null value with a statistical method or machine learning model is carried out. This process can be referred to as missing imputation. In this study, the missing imputation process uses the median value. The median value is chosen since it only takes the middle value in the calculation process without considering other values. In this step, this research aim to find the median value with an even number of data because the imputation process will be carried out every six rows.

C. Data Balancing
In this step, this research uses the random oversampling (ROS) method, which will carry out the oversampling process for minor data to increase percentage. The ROS method was chosen because the data problem used occurred imbalanced in the minority class, which was suitable to use the oversampling method. Applying a re-sampling strategy to the pre-processing data process to obtain a more balanced data distribution is an effective solution to the imbalance problem [27]. ROS method also involves randomly duplicating samples from a minority class and adding them to the training dataset [27]. The process will also see the imbalanced ratio, which calculates the data set from two certain classes. The imbalanced ratio then can be calculated using formula (2) [12].
where, instance minority is the number of distributions of label class that is less, while instance majority is the number of distributions of label class that is more. So, to find the imbalanced ratio based on [12], namely, the distribution of minority divided by the distribution of the majority. Table VI shows the imbalanced ratio in the two datasets of this study. The imbalanced ratio has a scale of 0-1, where if the result is close to the value of 1, then the class has only a few imbalanced data.

D. Feature Selection
There are three feature selection techniques: univariate selection, feature importance, and correlation matrix with heat maps [28]. In this paper, this research performs the feature importance technique to solve predictive analysis problems [29]. This technique is carried out to provide a score for each feature against the label class, whether it has high or low attachment.
where c is the number of values in the target attribute (number of classification classes) and P is the sample portion for the class i (diabetes and no diabetes).
In this paper, the feature importance technique uses the random forest model. Therefore, the calculation process uses the Gini function, as shown in equation (3) in the random forest model. The value of c is two classes, namely diabetes or no diabetes. Then Pi is the sample size for diabetes and no diabetes.

E. Data Augmentation
This study proposed other techniques in addition to using oversampling techniques on class balance problems. The proposed technique is data augmentation. This study uses data augmentation for the problem of lack of varied samples in the Pima Indian dataset, which will be done with additional data using dataset Karya Medika. The data augmentation process will provide a way to increase inference based on the posterior distribution [24]. The posterior distribution is shown in formula (4). www.ijacsa.thesai.org where P(ϑ|Y) denotes the posterior density of parameter ϑ given the dataset Pima Indian observation, P(ϑ|Z,Y) denotes the predictive density of the Karya Medika data Z given Pima Indian, and P(ϑ|Y,Z) denotes the conditional density of ϑ given the data augmented X=(Y,Z) namely augmented posterior [24]. This study will augment data from Pima Indian using Karya Medika data to produce data augmented (X) containing feature characteristics with similarities in both datasets. This study calculates the relative difference using equation (5) to calculate the increase in the original data changes with augmentation data.
where Result Augmentation (RA) is the result after the Pima Indian dataset augmentation process with Karya Medika dataset. Result Original (RO) is the result before the dataset augmentation process is carried out.
Relative Difference (RD) is a measure that shows the percentage increase when an enlarged data set is used compared to the original data [23]. The instance value used from Karya Medika dataset for augmentation is 100%.

F. Classification
This process is data classification using supervised machine learning methods, namely random forest (RF) and logistic regression (LR), to see the precision and recall. This process also separates training data from data testing. This study split the dataset to train and to test dataset with the ratio of 75:25. Both models have been widely applied with success in various disciplines for classification and regression purposes [30]. The Random Forest used is entropy, as shown in equation (6), where c and Pi have been described above.

IV. RESULT AND DISCUSSION
This section will discuss the results of the proposed method and analyze the results. Three experiments were conducted separately. First, using the Pima Indian dataset by applying the preprocessing algorithm and then conducting classification. Second, using the Karya Medika dataset by applying the preprocessing algorithm and then conducting classification. Third, using the augmented dataset by applying the preprocessing algorithm and then conducting classification. Table VII, the Pima Indian dataset by applying the proposed preprocessing was compared to the original preprocessing increased by using RF and LR classification methods. In Karya Medika dataset by applying preprocessing proposal was compared to original preprocessing increased by using RF classification method compared to LR classification. The results indicated to be different in the Karya Medika dataset with the oversampling process of three features using LR experienced a decrease in precision of 7% and F1 score of 1% compared to the original data. So, for the overall RF classification experimentation is superior to LR by applying the proposed method of preprocessing. This happens because the LR classification performed better when the number of noise variables was less than or equal to the number of explanatory variables. Therefore, if the LR classification results were going to be improved, it was necessary to note the importance of each variable used.

As shown in
Based on augmented dataset results, it showed that Karya Medika data was able to make the predicted results of DM in the Pima Indian dataset increase. However, the Pima Indian dataset was unable to make the DM prediction results in the Karya Medika dataset increase. The F1 score showed that after the imbalanced data method was applied, the results for the minority class increased. Precision and recall results show that the importance of preprocessing the dataset in advance to improve the predicted results of diabetes mellitus.
For the most important preprocessing process to improve diabetes detection results is missing value and balancing class. This is because the missing value process is a built-in problem in which the data used there is a value of 0/NaN/{} in which the value must be replaced with a guess of value, if the missing value is not executed then there will be an error during the classification process. Meanwhile, the process of balancing the class has a great influence on the results of diabetes detection because the ratio of the class of diabetes used as a sample tends to be less than the class that is not diabetic. This is evident after the process of balancing the class of classification results obtained has increased significantly.

V. CONCLUSION
Based on the results of the implementation and analysis, it can be concluded that this study on the preprocessing process can improve the precision and recall results of the random forest classification model. The results indicate that the classification method using a random forest is superior to logistic regression. The proposed preprocessing method can also be applied to the other augmentation result data from two different datasets by looking at the data characteristics. For the most important preprocessing process to improve diabetes detection results is missing value and balancing class. Data augmentation can also improve the precision and recall results of each original data. This study found that the data quality used is better for Karya Medika dataset than Pima Indian.
Further works need to be conducted by adding some other parameters to the data with samples such as insulin levels, history of diseases suffered, family history of people with diabetes or not, and other parameters related to diabetes. In addition, further studies can also be done using other medical data such as patient data on cancer, heart disease, stroke, and others, or using other combinations of machine learning models in any preprocessing or classification process.