Developing a Mathematical Model to Detect Diabetes Using Multigene Genetic Programming

—Diabetes Mellitus is one of the deadly diseases growing at a rapid rate in the developing countries. Diabetes Mellitus is being one of the major contributors to the mortality rate. It is the sixth reason for death worldwide. Early detection of the disease is highly recommended. This paper attempts to enhance the detection of diabetic based on set of attributes collected from the patients to develop a mathematical model using Multigene Symbolic Regression Genetic Programming technique. Genetic Programming (GP) showed significant advantages on evolving nonlinear model which can be used for prediction. The developed GP model is evaluated using Pima Indian data set and showed higher capability and accuracy in detection and diagnosis of Diabetes.


I. INTRODUCTION
Diabetes is one of the famous diseases that causing death.Based on measured statistics, it is the sixth reason for death worldwide.It was estimated that the world lose about 116 billion per year from medical care costs directly, and cost 580 billion indirectly (death, loss of work because of the deficit).Statistics showed that the high rates of deaths in developing countries are caused by diabetes disease.Early detection of the disease is highly recommended.It is essential to find a way that can help in early predicting this disease.A model with high accuracy, less complex and has efficient performance is urgently needed.
Diabetes Mellitus is simply caused by the failure of the body to produce the right amount of insulin to stabilize the amount of sugar in the body [1].Most patients suffer this type of body failure are recommended to take insulin injection.This is called diabetes type I. Diabetes type II the patient body rejection to insulin.This type of patient is recommended to undergo certain health meal program as well as performing exercises to lose weight, plus taking oral medication.But heart diseases are likely to strike these patients in the long run [2].
Gestational Diabetes can occur temporarily during Pregnancy which is due to the hormonal changes and usually begins in the fifth or sixth month of pregnancy (between the 24th and 28th weeks).Gestational diabetes usually resolves once the baby is born.However, 25-50% of women with gestational diabetes will eventually develop diabetes later in their life, especially in those who require insulin during pregnancy and those who are overweight after their delivery.

A. General Diabetes Statistics
Due to the wide spread of type II infected diabetes in the USA, a survey was conducted by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) in collaboration with the American Diabetes Association [3], the result was 17.9 million have been diagnosed while 5.7 million are unaware that they are infected by the disease.Statistically 23.6 million people in America have been diagnosed type II diabetes positive.Table I show some statistics from the Gestational diabetes in the Middle East and Northern Africa [4].Some reported statistics of infected people include: • US women aged 20 years and older form 11.5 million which represent 10.2% of women in USA.
• US men aged 20 years and older form 12 million which represent 11.2% of men in USA.
• Adults over 60 years form 12.2 million.
• African Americans aged 20 years and older form 3.7 million (14.7% of all African Americans age 20 years and older).
• Caucasian Americans aged 20 years and older form 14.9 million (9.8% of all Caucasian Americans age 20 years and older).The objective of this work is to explore the advantages of Multigene Symbolic Regression GP to classify the existence or non-existence of diabetic based on data collected from patient (IJARAI) International Journal of Advanced Research in Artificial Intelligence, www.ijarai.thesai.orgwith various nature [5].The proposed model can predict the class of the patient based on the eight attributes.The model is based on number of measured features of the patients.They include: the number of times pregnant, the results of an oral glucose tolerance test, diastolic blood pressure (mm/Hg), E-Triceps skin fold thickness (mm), 2-h serum insulin (micro U/ml), body mass index, diabetes pedigree function, Age (year).
The paper is structured as follows.In section II, we provide a literature review on the basic research work in the area of diabetic research based on soft computing techniques such as Artificial Neural Networks.In section III, basic process of GP is described.The expansion of Multigene Symbolic Regression GP approach is provided in section IV.The developed results are presented in section VI including the inputs and output of the model, the experimental setup and the developed mathematical GP Model.Finally we introduce the conclusion and future work.

II. LITERATURE REVIEW
The need for an accurate predictor for the diabetes is highly needed.Not only this, but also a predictor that is extremely automated and with less human interference.A diabetic predictor should meet the following specification; efficient modeling, applicability and accuracy and be trusted.It should be compatible with various diagnostic techniques.
Many prediction techniques are used, but the Multi-layer Percepton (MLP) is the most common [6]- [8].ANN consists of fully connected layers.In the training phase of the prediction, the learning algorithm examines the inputs.While during the testing phase, it examines the outputs and the other unexamined parts during the training phase.
Anthropometrical Body surface scanning data was used to construct a classification model for diabetes type II in [9].The model applies four data mining approaches.This model is meant to select and point out the appropriate and necessary decision tree for classifying diabetic diseases.It incorporates Artificial Neural Network, Decision Tree, Logistic Regression and Rough sets.In [10] authors used the classification tree for the classification and regression with a binary target.It introduces ten attributes including age, sex, emergency department visits, comorbidity index, dyslipidemia, hypertension, cardiovascular disease, and retinopathy and endstage renal disease.The cascade learning system which is based on generalized discriminant analysis was introduced by [11].It has also linked the system with the least square support vector machine in order to perform the classification of diabetes diseases.This uses the classification accuracy, kfold cross-validation method and confusion matrix.
A method to discover key attributes affecting diabetic diseases was introduced in [12].The method is called feature selection method.Then it introduced the three classification complementary techniques including Naive Bayes and C4.5.In [13] authors developed and upgraded the Linear Discriminant Analysis (LDA) and integrated it into the automatic diagnosis system.All these models functions primarily in the area of classification.But this method is meant to be accurate and well performed.
The fuzzy approaches have recently become the wellknown approaches for improving classification models.Fuzzy Neural Networks (FNNs) and artificial neural networks have been recently integrated hybrid classification model that helps well in diagnosing and classifying the state of the diabetic diseases.This model was presented by [14].Multi-objective genetic programming approach is proposed by [15] to develop Pareto optimal decision trees in diabetes classification.In [16], GP was used to generate new features by making combinations of the existing diabetes features.III.GENETIC PROGRAMMING GP works on a population of individuals, each of which represents a potential solution to a problem.GP was introduced by J. Koza in 1992 at Stanford.A flow chart for GP evolutionary process is shown in Figure 1.In order to solve a problem, it is necessary to specify the following [17]: • The terminal set: A set of input variables or constants.
• The function set: A set of domain specific functions used in conjunction with the terminal set to construct potential solutions to a given problem.For symbolic regression this could consist of a set of basic mathematical functions, while Boolean and conditional operators could be included for classification problems.
• The fitness function: Fitness is a numeric value assigned to each member of a population to provide a measure of the appropriateness of a solution to the problem in question.
• The termination criterion: This is generally a predefined number of generations or an error tolerance on the fitness.
In order to further illustrate the coding procedure and the genetic operators used for GP, a symbolic regression example will be used.Consider the problem of predicting the numeric value of an output variable, y, from two input variables a and b.One possible symbolic representation for y in terms of a and b would be: y = a−b 3 .Figure 2 demonstrates how this expression may be represented as a tree structure.With this tree representation, the genetic operators of crossover and mutation must be posed in a fashion that allows the syntax of resulting expressions to be preserved.Figure 3 shows a valid crossover operation where the two parent expressions are given in Equations 1 and 2. The two offspring are given in Equation 3 and 4. Parent 1 (y 1 ) and Parent 2 (y 2 ) are presented in Equations 1 and 2. The developed offspring 1 (y 3 ) and offspring 2 (y 4 ) are presented in Equation 3 and 4. Typically, symbolic regression is performed by using GP to evolve a population of trees, each of which encodes a mathematical equation that predicts n × 1 vector of outputs y using a corresponding n × m matrix of inputs X where N is the number of observations of the response variable and M is the number of input (predictor) variables [17].In contrast, in Multigene symbolic regression each symbolic model (and each member of the GP population) is a weighted linear combination of the outputs from a number of GP trees, where each tree may be considered to be a gene [19].For example, the Multigene model shown in Figure 4 predicts an output variable using input variables x 1 , x 2 , x 3 .This model structure contains non-linear terms (e.g. the hyperbolic tangent) but is linear in the parameters with respect to the coefficients α 0 , α 1 , α 2 .
In practice, the user specifies the maximum number of genes G max and the maximum tree depth D max therefore an exert can control the model complexity.In particular, we have found that enforcing stringent tree depth restrictions (i.e.maximum depths of 4 or 5 nodes) often allows the evolution of relatively compact models that are linear combinations of each model, the linear coefficients are estimated from the training data using ordinary least squares techniques.
Hence, Multigene GP combines the power of classical linear regression with the ability to capture non-linear behavior without the need to pre-specify the structure of the nonlinear model.In [20] it was shown that Multigene symbolic regression can be more accurate and computationally efficient than the standard GP approach for symbolic regression.
Here, the first parent individual contains the genes (G 1 G 2 G 3 ) and the second contains the genes (G 4 G 5 G 6 G 7 ) where G max equals to 5. Two randomly selected crossover points are created for each individual.The genes enclosed by the crossover points are denoted by [ ].
The genes enclosed by the crossover points are then exchanged resulting in two new individuals as follows: Two point high level crossover allows the acquisition of new genes for both individuals but also allows genes to be removed.If an exchange of genes results in an individual containing more genes than G max then genes are randomly selected and deleted until the individual contains G max genes.
The user can set the relative probabilities of each of these recombination processes.These processes are grouped into categories called events.The user can then specify the probability of crossover events, direct reproduction events and mutation events.These must sum to one.The user can also specify the probabilities of event subtypes, e.g. the probability of a two point high level crossover taking place once a crossover event has been selected, or the probability of a sub tree mutation once a mutation event has been selected.www.ijarai.thesai.org An example of Multigene model is shown in Figure 4.The presented model can be introduced mathematically as given in Equation 5. GPTIPS Matlab Toolbox provides default values for each of these probabilities so the user does not need to explicitly set them [21].V. PERFORMANCE CRITERION Number of evaluation criterion was computed to evaluate the performance of the developed models.The Route Mean Square (RMS) was used as the fitness function for genetic programming.RMS can be described by Equation 6.
Other performance criterion was used to evaluate the goodness of the developed GP model.The set of criterion used are given as follows: • Sensitivity (Sens): • Specificity (Spec): • Positive Predicted Value (PPV): • Negative Predicted Value (NPV): • Accuracy (Acc): Given that: • True Positive (TP): Sick people correctly diagnosed as sick.
• False Positive (FP): Healthy people incorrectly identified as sick.
• True Negative (TN): Healthy people correctly identified as healthy.
• False Negative (FN): Sick people incorrectly identified as healthy.

A. Model Inputs and Output
Pima Indian is a homogeneous group that inhabits the area around American, but they are popular for being the most infected group with type II diabetes.Pima Indians diabetes data can even be retrieved from UCI Machine Learning Repository's web site [5].So, they are subject of intense studies in type II diabetes.The data consist of eight input variables and one output (0,1).The GP mathematical model has the inputs and output presented in Table II.We used 500 samples as a training set and 100 samples as a testing set.The data set was normalized according to Equation 12.
x max and x min are the maximum and minimum values of the array x, respectively.x new is the newly computed value based on the value of x old .

B. Experimental Setup
In this research, we adopted a GPTIPS toolbox [21] to develop our results.In GPTIPS, the initial population is constructed by creating individuals that contain randomly generated GP trees with between 1 and G max genes.During the run, genes are acquired and deleted using a tree crossover operator called two point high level crossover.This allows the exchange of genes between individuals and it is used in addition to the standard GP recombination operators.Some parameters have to be defined by the user at the beginning of the evolutionary process.They include: population size, probability of crossover, mutation probability and the type of the selection mechanism.User has also to setup the maximum number of genes G max where a model is allowed to have.The maximum tree depth D max allows us to change the complexity of the evolved models.Restricting the tree depth helps evolving simple model but it may also reduce the performance of the evolved model.Crossover was performed with the two-point high-level crossover operator.Once the two parent individuals have been selected, two gene crossover points are selected within each parent.Then the genes enclosed by the crossover points are swapped between parents to form two new offspring.

C. Developed Mathematical GP Model
The data set described earlier was loaded then the Multigene GP was applied using GPTIPS Tool.The parameters of the algorithm were tuned as listed in Table III.In Figure 5, we show the convergence of GP over 100 generations.It can clearly seen that the final model is a simple and compact mathematical model which is easy to evaluate.The performance measurements for the model was computed and summarized in Table V.The best generated diabetic prediction Multigene GP model is given in Table IV.We also explored the idea of considering a subset of the features used to develop the GP model.Thus, we considered the features x 3 , x 6 and x 8 to develop the output class y of diabetic type.Running GP with a population size 30 and 100 generations with the same tuning parameters such as tree depth, maximum number of genes, probability of crossover and probability of mutation we produced the results in this case.The performance measurements for the developed GP model was computed and summarized in Table VI.In Figure 6, we show the convergence of GP in the case with less number of features.The developed GP model is presented in Table VII.

GP convergence
Fig. 6.Convergence of the GP evolutionary process model was able to classify patient type.The developed classification accuracy obtained based on Multigene GP is high with respect to sensitivity, specificity, accuracy, positive predicted and negative predicted values.These evaluation criterions proved that Multigene GP is beneficial for diabetic patient classification.The knowledge gained is comprehensible and can enhance the decision making process by the physician.We plan to expand this research to detect the most significant attributes which indicate diabetic.

A
prior knowledge on the problem domain helps in designing a function set which could speed up the evolutionary (IJARAI) International Journal of Advanced Research in Artificial Intelligence, www.ijarai.thesai.orgprocess for model development.The adopted function set to develop the GP model is given as: F = {+, −, ×}

TABLE II .
INPUTS AND OUTPUT FOR DIABETIC PREDICTION MODEL

TABLE V .
PERFORMANCE OF THE GP MODEL WITH x 1 , . . ., x 8 AS INPUTS

TABLE VI .
PERFORMANCE OF THE GP MODEL WITH x 3 , x 6 AND x 8