Validation of Semantic Discretization based Indian Weighted Diabetes Risk Score (IWDRS)

—The objective of this research study is to validate Indian Weighted Diabetes Risk Score (IWDRS). The IWDRS is derived by applying the novel concept of semantic discretization based on Data Mining techniques. 311 adult participants (age > 18 years), who have been tested for diabetes using the biochemical test in pathology laboratory according to World Health Organization (WHO) guidelines, were selected for this study. These subjects were not included for deriving IWDRS tool. IWDRS is calculated for all 311 subjects. Prediction parameters, such as sensitivity and specificity are evaluated along with other performance parameters for an optimal cut-off score for IWDRS. The IWDRS tool is validated and found to be highly sensitive in diagnosing diabetes positive cases at the same time it is almost equally specific for identifying diabetes negative cases as well. The result of IWDRS is compared with the results of another two similar studies conducted for the Indian population and found it better. At optimal cut-off score IWDRS>=294, the prediction accuracy is 82.32%, while sensitivity and specificity is 82.22% and 82.44%, respectively.


I. INTRODUCTION
Undetected diabetes and prediabetes are the major concerns for East Asian countries, including India [1]. In such scenario, Diabetes Risk Score (DRS) tools can be proved effective in detecting undiagnosed diabetes and pre-diabetes cases. DRS tools are simple and easy to use computational tools that calculate the risk of diabetes of an individual's based on some risk factors.
Rest of the paper is organized as follows. Section 2 presents the literature review, which is followed by the discussion on Indian Weighted Diabetes Risk Score (IWDRS) in Section 3. Section 4 presents an outline of the research design. Details of experiments and results are discussed in Sections 5 and 6, respectively. The conclusion of the research study is given in Section 6.

II. LITERATURE SURVEY
Various DRS tools have been reported in literature [2]- [14]. Basically, DRS tool uses a questionnaire to collect data from the target population. These data are used to build a mathematical model for predicting risk score of an individual. A mass diabetic screening test can be organized to detect undiagnosed and pre-diabetic persons, in which only those person who scored high on DRS, will be pathologically tested for high blood sugar. Developing countries like India, where lack of awareness, lack of pathological testing facilities, shortage of medical fund and late diagnosis is a major problem, DRS tools can be used as a cost-effective solution.
Several DRS tools have been developed and validated for different ethnic groups. A DRS tool, developed for a particular ethnic group, may not be generalized and may not produce similar results if applied on another ethnic group [15]. And that is why, separate DRS tools need to be developed and validated for each ethnic group, society, and country.
Logistic regression and Cox logistic regress models are used for deriving such risk scores, in which β coefficients of the risk factors are computed [10], [11], [14]. But building such logistic regression models are not a fixed, and it cannot be reproduced. Gary et al. [16] have observed that different investigators with the same data set produced different risk models. Anderson et al. [17] have argued that the diagnostic algorithm tools developed using logistic regression model is not perfect and prone to misuse.
To overcome the limitations of logistic regression models, Chandrakar and Saini have proposed a new methodology for deriving risk score and applied for deriving IWDRS [18]. IWDRS is derived by collecting data from a comprehensive questionnaire consisting of more than 60 risk factors [19], [20]. These risk factors are discretized using a novel concept of semantic discretization [21]. Then each risk factor is assigned to appropriate weight using machine learning techniques, and the corresponding risk score is calculated. One study Pima Indian Diabetes Dataset shows that classification accuracy is significantly increased when the dataset is semantically discretized before giving them to classifier [21]. In the present study, researchers validate the proposed IWDRS.

III. INDIAN WEIGHTED DIABETES RISK SCORE
IWDRS is developed for Indian population considering demographic, socioeconomic, family and personal indicators. It includes parameters like age, family history of diabetes, blood pressure and high cholesterol, personal history of blood pressure and high cholesterol, BMI, waist circumference, diet quality, stress, physical activity and life quality. Various types of stress faced like work stress, financial stress, family or social *Corresponding Author. www.ijacsa.thesai.org stress and health-related stress with its perceived intensity are considered. Life quality majors how the subject perceives the quality of his/her life, which includes qualitative indicators like happiness, love, and hope in their life. Responses of these parameters recorded at three different points of time. The responses of these parameters are categorized into three categories, low, moderate and high based on the rules derived using machine learning techniques. Table 1 shows the Indian Weighted Risk Score assigned to each parameter in each category.

V. EXPERIMENTS
Data are collected using the same questionnaire which was used to collect data for deriving IWDRS [18]. Data is collected from 311 adult subjects, of both genders, with age more than 18 years. Their diabetes status is confirmed with a biochemical test. 180 out of 311 subjects were diabetic. IWDRS is calculated for each of them.
Minimum and maximum possible score is 142 and 464 respectively. Considering 142 as base score, the IWDRS 142-464, is divided into 10 cutoff scores, which are 142, 175, 207, 239, 271, 303, 336, 368, 400, 432 and 464. Prediction parameters are calculated for the above cut-off score. Results are shown in Table 2.  Tables 2 and 3 present the sensitivity and specificity and accuracy of predicting diabetes for different cut-off values for IWDRS. From Tables 2 and 3, the highest prediction accuracy is 82.32% for IWDRS >= 294 and IWDRS >= 300. Sensitivity is 82.22% and 80.56% and specificity is 82.44% and 84.73%, respectively. Though prediction accuracy is same for both cutoff scores, at IWDRS>= 300, sensitivity is less than specificity, meaning that it predicts diabetes negative persons more accurately than diabetes positive persons, while our interest is in identifying diabetes person more accurately. So we choose IWDRS>=294 as the optimal cut-off score.

VI. RESULT ANALYSIS
Our study results are comparable and consistent with other studies reported in scientific literature. Experimental result of validation of IWDRS is shown in Table 3.
Two similar studies are found for Indian population. Mohan et al. [10] have developed simplified Indian Diabetes Risk Score using logistic regression model. Four parameters are used for developing the risk model, namely, 1) Age; 2) Obesity; 3) Physical activity; and 4) History of diabetes in the family. Ramachandran et al. [14] have also developed a DRS for Asian Indian population living in India using a logistic regression model with five parameters. They used 1) BMI; 2) Waist Circumference as a risk factor apart from; 3) Age; 4) Physical activity; and 5) History of diabetes in the family. Initially, Gender and Monthly income were considered as a diabetes risk factor, but not taken into account while developing the model. Table 4 compares the prediction statistics of these two risk score tools with their results with IWDRS. With prediction accuracy, 82.32%, IWDRS can be proved useful and inexpensive yet effective tool for a two-phase mass screening test for diabetes, especially in developing and underdeveloped countries like India where the undiagnosed or late diagnosis of diabetes is a major problem. In the first phase of the mass screening test, IWDRS can be calculated using an easy to response questionnaire for all subjects. In the second phase, only those subjects, who scored more than optimal cutoff value for IWDRS, are tested for the induced plasma glucose tolerance test using biochemical methods in the pathology laboratory, as per WHO guidelines. This two-phase mass screening approach will reduce the mass screening cost drastically in comparison single phased mass screening using pathology test only. By conducting a pathological test for only 55% of the population, we can detect 82% of the total diabetic person present in the population. In other words, for any given budget for the diabetes mass detection program, we can identify 20% more diabetic person if we use IWDRS tool in the first phase of two-phase diabetes screening.