Diagnosis of Diabetes by Applying Data Mining Classification Techniques Comparison of Three Data Mining Algorithms

Health care data are often huge, complex and heterogeneous because it contains different variable types and missing values as well. Nowadays, knowledge from such data is a necessity. Data mining can be utilized to extract knowledge by constructing models from health care data such as diabetic patient data sets. In this research, three data mining algorithms, namely Self-Organizing Map (SOM), C4.5 and RandomForest, are applied on adult population data from Ministry of National Guard Health Affairs (MNGHA), Saudi Arabia to predict diabetic patients using 18 risk factors. RandomForest achieved the best performance compared to other data mining classifiers. Keyword—Diabetes; Data mining; Self-Organizing Map; Decision tree; Classification


INTRODUCTION
Saudi Arabia is facing financial challenge due to the prevalence of diabetes.The Ministry of Health (MOH) in Saudi Arabia and Institute for Health Metrics and Evaluation (IHME) implemented, as collaboration, the assessment of burden based on the direct cost of diabetes from integrated health information system in 2014 [1].Based on the established system, the current estimated cost of diabetes is 17 billion Riyals (US $4.5 billion) with expectation to increase the cost to 27 billion Riyals (US $7.2 billion) in the case that undiagnosed people are documented.Moreover, if prediabetes people become diabetes the cost will increase to 43 billion Riyals (USD 11.43 billion).The cost includes medications, visits, and lab tests, which also varies based on the patient's stage.The high cost of treating diabetes plus the expected growth of diabetes will put Saudi Arabia face to face with financial and health challenges in near future.Prevention, monitoring and controlling are the most effective actions to face such a health care challenge.
Data mining techniques assist health care researchers to extract knowledge from large and complex health data.With the evolution of information technology, data mining provides a valuable asset in diabetes research, which leads to improve health care delivery, increase support to decision-making and enhance disease management [2].Data mining techniques include pattern recognitions, clustering, classification and association.
Diabetes is one of the main topics for medical research due to the longevity of the diabetes and the huge cost on the health care providers.Early detecting of diabetes ultimately reduces cost on health care providers for treating diabetic patients [3][4][5][6][7][8], but it is a challenging task.For early detecting of diabetes, researchers can take advantage of the patient's health care data to convert raw data into meaningful information and extract hidden knowledge by applying data mining such as decision tree or SOM to construct an intelligent predictive model.SOM or Kohonen maps is a machine-learning tool that is used to analyze heterogeneous data and provides supervised or unsupervised learning model [9][10][11].Hence, SOM maps high dimensional data to be more meaningful by identifying similarities.In this research article, decision trees, namely C4.5 and RandomForest, are compared with SOM to build a classification model to predict diabetic patients using retrospective data collected from hospital database systems.The data sets are extracted from the hospital information management system from the Ministry of National Guard Health Affairs (MNGHA), Saudi Arabia.The National Guard Health hospitals provide optimum health care to their employees, dependents, other eligible patients and private patients.The data sets are collected from four hospitals in the three largest regions in Saudi Arabia in terms of populations.The hospitals are: i) King Abdulaziz Medical City (SANG) in Riyadh, Central Region; ii) King Abdulaziz Medical City in Jeddah, Western Region; iii) Imam Abdulrahman Al Faisal Hospital in Dammam, Eastern Region; and v) King Abdulaziz Hospital in Alahsa, Eastern Region.The contribution of this study is utilizing the data mining techniques to construct intelligent predictive model using real healthcare data that are extracted from hospital information systems using 18 risk factors.
The rest of the paper is organized as follows.The literature review is given in Section II.Methodology is presented in Section III.Results and discussion are given in Section IV.Finally, conclusions and future work are presented in Section V.

II. LITERATURE REVIEW
In the literature, SOM has been applied in health care data.Mäkinen et al. [12] used SOM algorithm to detect association www.ijacsa.thesai.org between certain risk factors and complications.They used SOM as an unsupervised method to cluster biochemical profiles.A 7 x 10 grid of hexagonal map units with Gaussian neighborhood function were used to present similarities and differences between variables.Tirunagari et al. [13] applied SOM to cluster heterogeneous diabetes data.They were able to reduce the dimensionality of the data and demonstrate the similarities between patients by placing them in groups using the U-matrix.As a result, the profiles of patients who need self care management were grouped clearly and easily were identified.
In another study, Tirunagari [14] used the SOM to recognize the behavior of self care based on survey data collected from type I diabetic patients.The visualization result improved understanding pattern of various behaviors as well as detecting patients who need to adjust their lifestyle.Zarkogianni et al. [15] proposed personalized hybrid model by combining Compartmental Models (CMs) and Self-Organizing Map.The model helped patients with Type I Diabetes Mellitus to predict the metabolic behavior.Luboschik et al. [5] used SOM as part of an early detecting system to predict Neuropathy complications in diabetic patients.By using the computational and visual methods of SOM, they were able to identify characteristics of diabetic Neuropathy patients.
Other data mining algorithms had been applied to classify diabetic patients.Farran et al. [16] used non-laboratory attributed to classify the diabetes by applying 4 data mining models that were logistic regression, k-nearest neighbors (k-NN), multifactor dimensionality reduction and Support Vector Machines (SVM).They achieved an accuracy of 85% for diabetic patients.Barakat et al. [17] applied SVM on data collected from a national survey in the Sultanate of Oman that investigated the prevalence of diabetes mellitus.They achieved a sensitivity of 93% and 94% for accuracy and specificity.
Moreover, Ganji et al. [18] used (FCS-ANTMINER) on public diabetes data set (Pima Indians Diabetes data set [19]).They obtained an accuracy of 84%.Huang et al. [20] employed three data mining algorithms that were Naive Bayes, IB1 and C4.5 to predict diabetes on data gathered from Ulster Community and Hospitals Trust (UCHT) between 2000 and 2004.They were able to achieve an accuracy of 98%.Furthermore, Al Jarullah A. [21] employed C4.5 data mining algorithm on Pima Indians Diabetes data set [19].He achieved an overall accuracy of 78%.
From the literature review, data mining algorithms have been used to predict diabetes using public data or private data.However, the data sets are either small in size (less than 10,000 records) or collected from one region (mostly one hospital).In this research study, the data sets are collected from 4 large hospitals in Saudi Arabia.The model extracted from the data could assist in improving healthcare plans that are delivered for diabetic patients.

III. METHODOLOGY
To achieve the study objectives, study method consists of several phases, which are collection of data and attribute selection, data mining algorithms and evaluation criteria.

A. Data sets and Attributes Selection
In this work, the data sets are collected from Ministry of National Guard Health Affairs (NGHA) databases from the highest three populated regions in Saudi Arabia, where the databases have all patients visit information such as laboratory and medications, etc.These regions are: central region (Riyadh city), western region (Jeddah city) and eastern region (Alahsa and Dammam cities).The latest Saudi census showed that more than 66% of the country total population lives in these three regions and the largest city on these three regions are Riyadh city (The capital and the largest city in the Central region); Jeddah (the largest city in the Western region; iii) Dammam; and Alahsa (the largest two cities in the Eastern region) [22].
The data sets consist of 66,325 diabetic and non-diabetic instances.The study used data from the hospital Information System in MNGHA from the 2013 to 2015.Hospital databases are extremely exposed to inconsistent values, noisy and missing input values from the data because the data are coming from heterogeneous sources.There are several considerations that are followed and assured throughout the data extraction process by the information systems in MNGHA to insure the accuracy of the data.In addition, the data sets are gone through manual inspections to ensure the data are consistent and accurate.
All adult patients who have diabetes are included while pediatric diabetic patients are excluded.The data used for the study did not include identification information in order to not violate the patient privacy.
Detailed information about demographic variables is summarized in Table1.Furthermore, the data set divided into training and test data sets as follows:  Data from 2013 to 2014 represents a training set that is used to construct and train the model.
 Data from 2015 represents a test set that is used to test the model and estimate the accuracy rate.
The data sets consist of a total of 18 attributes.The attributes include gender, age and region as demographic variables; patient's measurements such as BMI and blood pressure in addition to 11 various lab tests.The Data sets contain 36,811 male (55.50%) and 29,514 females (44.50%), all of them at least 14 years old and older.More than half of the total patients (64.47%) have diabetes; male diabetic patients represent 36.34% of the total diabetic patients, while female diabetic patients represent 28.13% of the total diabetic patients as shown in Table 1.www.ijacsa.thesai.org

B. Data Mining Algorithms:
R software [23] is used to employ SOM algorithm in order to predict diabetes patients.Kohonen package in R implements SOM as unsupervised algorithm as well as supervise algorithm.The bdk and xyf are supervised functions of SOM in R. The returned output obtained from calling both functions is used for prediction in this study.
Since SOM has a number of parameters, selecting the appropriate parameters, such as type of SOM, network size and training algorithm, is important.Parameters have direct impact on the classification performance as well as computational time [9].The values for parameters are summarized in Table 3.On the other hand, Weka [24] data mining tool is used to run C4.5 and RandomForest decision trees using the default parameters.

C. Evaluation Criteria:
To select the best performance data mining algorithms in predicting diabetic patients, two standard matrices have been applied, which are Recall and Precision.Recall, Eq. 1, will reflect the number of diabetic instances who are correctly classified, which we need in such system.It is calculated using: Recall = TP/(TP + FN) (1) While Precision, Eq. 2, represents the relevant instances that are correctly classified.It is calculated using: Precision = TP/(TP+FP) (2) True Positive (TP) implies that diabetic patients who are classified as diabetic patients, whereas False Negative (FN) implies that diabetic patients who are classified as nondiabetic patients.On the other hand, False Positive (FP) implies that non-diabetic patients who are classified as diabetic patients.Commonly, the best learning algorithm is going to be selected based upon the performance of the classifiers in terms of high Recall, and Precision.

IV. RESULT AND DISCUSION
In Table 4, two different measurements were calculated for each algorithm for assessing how well each model and to be used to evaluate algorithm's performance compared to each other.C4.5 and RandomForest achieved Recall and Precision over 90% on the training data set while SOM (bdk and xyf) was able to achieve Recall and Precision over 79% on the training data set.
To choose the best algorithms in terms of high performance, according to the evaluation criteria, all algorithms are evaluated on an unseen data set (test data set).The algorithm/model who achieved the highest performance in terms of high Recall and Precision is considered to be the best one.It can be seen that RandomForest achieves the highest Recall and Precision on the test data set as indicated in Table 4.

TABLE IV. RESULT OF THE CLASSIFIERS
The reason behind that SOM could not perform higher than decision trees due to the fact that the SOM constructs the its model from only the first SOM grid layer.The multi-layer classification capability of SOM could improve the performance.However, the multilayer capability is not available in R software [23].
In this study, SOM and decision tree techniques are applied to predict diabetic patients using 18 risk factors (attributes).The most common risk factors among the model constructs from the algorithms are as the following: i) gender; ii) age; iii) blood pressure; iv) Body Mass Index (BMI); v) The extracted knowledge from the research conducted among the samples (patient records) from MNGHA can be generalized to the wider diabetic population in Saudi Arabia since the data sets (samples) are collected from the largest populated region in Saudi Arabia where more than 66% of the total country population lives.

V. CONCLUSION
Model constructed from the data mining algorithms could help to support decision making in different fields including health care field.In this research, real health care data sets have been collected from MNGHA databases that contain 18 attributes.Furthermore, three data mining algorithms have been evaluated, namely SOM (bdk and xyf), C4.5 and RandomForest to construct data mining models to predict diabetic patients using real health care data sets.
The results show that the constructed data mining model could assist health care providers to make better clinical decisions in identifying diabetic patients.Additionally, the model could be further developed for patient protection.In the future, the results can be utilized to create a control plan for diabetes because diabetic patients are normally not identified till a later stage of the disease or the development of complications.

TABLE I
Lab test data are described statistically and summarized in Table2in order to provide more understanding of lab tests data which are considered as attributes in the study.