Novel Data Oriented Structure Learning Approach for the Diabetes Analysis

Diabetes mellitus is considered a significant disease an ever rising epidemic. Accordingly this disease represents a worldwide public-health-crisis. Several classification techniques have been recently employed for diabetes diagnosis, however only few researches have been dedicated to facilitating its analysis based on knowledge representation using probabilistic modelling. Bayesian Network as a probabilistic graphical model is considered as one of the most effective techniques of classification. Bayesian Network (BN) is widely employed in several domains like risk analysis, medicine, bioinformatics and security. This probabilistic graphical model represents an effective formalism to reason under uncertainty. The construction of the BN model goes through two learning phases of structure and parameter. The first learning phase of BN skeleton has been assessed as complex problem (NP-hard problem). Accordingly, several methods have been introduced amongst which the score based algorithms that are considered as one of the most powerful methods of structure learning. In this paper, we introduce a novel algorithm based on graph theory and the information theory combination. The proposed algorithm called GIT algorithm for Parents and children detection for BN structure learning. In addition, we evaluate the obtained results and using the reference networks, we prove the efficiency of the proposed GIT algorithm in terms of accuracy. Furthermore, we apply our algorithm in a real field, especially for detecting the interesting dependencies which are useful for the diabetes analysis. Keywords—Classification; Bayesian Network; structure learning; score oriented approach; diabetes analysis


I. INTRODUCTION
In this century, the diabetes mellitus represents a serious health problem [1] [2]. The International Diabetes Federation (IDF) reveals that by 2040 it is expected to have 642 million adults who are diabetic and during the next two decades, our world will attend an important increase of 10.4%. The estimated percentage of the undiagnosed diabetes is about 0.497 of all affected people where the highest values were discovered respectively in Africa (70), South-East of Asia (60) and regions of Western Pacific (54) [3]. Consequently, and considering the importance of these diagnostics, correct and rapid analysis for diabetes detection using an intelligent technique has been considered as a crucial necessity [1] [4] [5].
The BN is a classification technique based on the graphical representation mode and the probabilistic reasoning. This technique is deemed as consistent formalism for making a model for the complex systems [6]. This classification technique is included in the most extensively used category of probabilistic-graphical models [7]. Therefore, and because of its potent abilities in reasoning using graphical representation, the BN has been effectively applied in several research areas like image processing [8], risk analysis [9], medical diagnosis [10], image processing [11], bioinformatics [12], etc. The construction of the BN model consists of two learning stages of its structure and its parameter. The structure learning phase allows the specification of the dependency set between the variables (random). In fact, it permits to create a directed acyclic graph (DAG), which consists of nodes and edges representing the dependency relation between Parents and children nodes while the parameter learning stage allows these dependencies' quantification.
The aim of the first learning phase is to generate the optimal structure which is judged as an NP-hard problem due to the intractable space of search [13]. In order to solve this problem, two main methods have been proposed: data oriented method and the second one is based on the expert knowledge [14] which is time consuming. In this paper, we propose a score driven algorithm, called GIT algorithm, which is based on the Information Theory IT (precisely the Mutual Information MI exchanged between nodes) and the Graph Theory GT. The proposed BN was experimented while assisting in the exploration of the medical database for diabetes diagnosis in the Zulfy hospital of Saudi Arabia [5].
Section II is dedicated to the fundamental concept's introduction which is useful for our proposal description. Section III is devoted to the proposed GIT algorithm representation. In Section IV, we represent an illustrative example. Section V introduces the experimental results and its evaluation using the well-known benchmarks. In Section VI, we present the application of the proposed data oriented method in the medical field, precisely for the diabetes diagnosis. In the last section which is our conclusion, we summarize the main ideas of the paper.

II. PRELIMINARIES
Since our purpose is to propose a novel BN structure learning algorithm based on the IT and the GT, we have to introduce the interrelated notions before highlighting the proposed idea.

A. Structure Learning of the BN
The Bayesian network is a Direct Acyclic Graph (DAG) that allows the representation of the distribution of the conditional probabilities over a set of variables. The DAG is composed of a set of nodes (random variables) and edges representing the dependencies between the nodes. The BN is illustrated as a couple (G, P) where the G is the directed graph and P designates the probabilities distribution, and the graph can be denoted as a couple (N, E), in which N is the nodes (or the random variables) and the E designates the edges between the nodes. The variable value can be discrete or continuous. As indicated in the following equation, the joint probability distributions are computed as the product of the local conditional probabilities: Where i N is the node i and ( ) a i P N represents its parent.
The construction of the BN model consists of two learning phases of the structure and the parameter. The learning of the BN structure allows us to obtain the graphical representation of the qualitative knowledge in which we are focusing on in this paper. The structure learning phase aims to represent explicitly the causal relationship among the random variables (is the answer of the "what if?" question) [15].
In the last two decades, several algorithms of BN structure-learning have been proposed which can be categorized into score oriented approach, conditional independency based approach and hybrid approach. (1) The conditional independency based methods perform a qualitative study of the variables dependency, and the generated skeleton represents these dependency relationships. The well-known algorithms of this approach are the PC (Predictive Causation) algorithm and IC (Inductive Causation) algorithm. (2) The score oriented approach is based on the score metric and it aims to determine the learned graph that maximizes the used score. This metric is defined as a fit measure between the data and the graph. The main goal of the algorithms based on the score is to produce the structure having the highest score. In the last decade, many researches proved that the methods based on score represent the widely used algorithms like Amirkhani et al. [16] (2016), Tabar et al. [17] (2018) and Benmohamed et al. [18] (2019) [19](2020), Accordingly, the present study introduces a novel score based algorithm as explained in the following section.

III. PROPOSED SCORED-ORIENTED-ALGORITHM
The proposed GIT method ( Fig. 1) is divided into two main phases. The first phase allows the extraction of the dependencies between the different random variables within the dataset in order to create an undirected acyclic graph. To obtain this latter, we have to avoid the cyclic structure and the weak dependencies between the nodes. Henceforth, the extracted graph is used to determine the list of parent and children nodes. The generated oriented acyclic graph is not only important for correct parameter learning and instances classification but it also provides a graphical representation of useful knowledge that allows a better analysis of the dataset.
For extracting the undirected acyclic graph, we used the information theory IT, precisely the calculation of the mutual information MI in order to eliminate the cyclic structure. Basing on the graph theory GT, we start by avoiding the cycle between each tree nodes forming the graph because of the acyclic characteristic of the BN. In our algorithm called IT and GT based structure learning algorithm (GIT), the MI is calculated to determine the weak dependencies for erroneous edge elimination. Between each two variables A and B, the MI noted I(A,B) is calculated as follows: H(A) is the entropy of A, and H(A|B) represents the conditional entropy of A given B. The entropy of the variable A is defined by: Mathematically, the H(A|B) is defined by the following equation: 396 | P a g e www.ijacsa.thesai.org For learning the BN structures, we used in the algorithm the mutual information MI which is significantly used in the literature for structure learning. The proposed sub-algorithm for undirected graph extraction (Algorithm 1) included two phases, for the detection of the dependencies between nodes and for cyclic structure elimination as described in the rest of this section. As shown in Algorithm 1, for each pairwise (X, Y) we verify the existence of a third node Z forming a cycle. As defined in the graph theory, the path can form a cycle if the start node (ni) represents the final node in the path (with the number of nodes is greater or equal to 1). In order to avoid the circuit formed by the three nodes X, Y and Z, we eliminate the weak dependency. In fact, if the condition cannot be reached, the specified nodes will form a cycle and the edge between the original pairwise should be eliminated as shown in Fig. 2: As a matter of fact, we use the obtained graph to determine the orientation of each edge. Indeed, the lists of parents and children can be created the final graph which are acyclic and oriented. These tasks are reached by Algorithm 2 which consists of two phases for parent and children detection. In [20], the dependence criteria used to determine the parent or the children node between two dependent nodes (X and Y) is calculated as: Where OE determines the orientation of the edge from Y to X (Y node represents the node parent) and the possible values' number of the variable X is | | X . In addition, we combine this criteria with the condition based on the value of the Maximum Mutual Information (MMI) for the edge orientation. For the determining of the parent and child, we propose an amelioration of the OE metric as follows: The node X represents the parent if this condition is satisfied: If the node Y is the child and the orientation of the edge is ( ) Y X → , the following condition should be satisfied: These steps is explained in the algorithm2:  Vol. 12, No. 3, 2021 In this section, we describe our method for learning the BN structure through data oriented approach. The learned dependencies obtained by the execution of the first algorithm are oriented basing on the second algorithm for generating the final directed acyclic graph. The main idea for the structure learning is based on the mutual information and the graph theory. The following section will be dedicated to the representation of the experimental results.

A. Used Datasets
To test the proposed GIT algorithm on the well-used benchmark networks, firstly we represent theses datasets and then the used performance measures. Thus, the test of the novel algorithm, is done using the three well-known datasets: ASIA, CANCER and ALARM. The algorithm is executed on an Inteli5-5300U with 8G of memory (64-bitsystem). In the following table, we expose the datasets description (Table I): In the next sub-section, the evaluation metrics of the GIT algorithm's performance and the gained results will be described.

B. Used Metrics and Experimental Results
The evaluation of the proposed algorithm, which is based mainly on the MI and the GT demonstrates its efficiency in resolving the structure learning problem. Moreover, to present the obtained results, we use the metrics shown in Table II. The experimental results, shown in Fig. 3, Fig. 4 and Fig. 5, will be described using the difference between the original structure and the learned one (terms: CE, AE, DE, RE, SD).
To verify the effectiveness of the GIT algorithm, it was executed on the datasets shown in Table II for 1000, 2000, 3000, 5000 and 10000 cases. Fig. 3, Fig. 4 and Fig. 5 show the experimental results basing on structures difference between the obtained skeleton and the original. In Fig. 4, we show the gained results for ASIA network for different cases which are seven correct edges, one reversed edge and zero deleted and added edge. In fact, for this latter, we obtain sensitive values when changing the number of cases from 1000 to 10000. In addition, for CANCER network, we obtain four correct edges for 1000, 2000 and 3000 cases with one added edge. However, for ALARM network, our method can correctly detect thirty four edges, four edges are wrongly deleted, one reversed edge, and eleven are added incorrectly for ALARM-1000, ALARM-2000 and ALARM-3000. For the rest of the cases, our GIT algorithm produces just one reversed edge, four deleted edges and twelve added edges; accordingly, we obtain seventeen erroneous edges in comparison to the original ALARM network.  Furthermore, we use the accuracy metric (Acc) [21] in order to present the performance of our GIT algorithm and this factor is defined as follows: In the following figure, we introduce the values of accuracy for the 1000, 2000, 3000, 5000 and 10000 cases for CANCER, ASIA and ALARM networks. As shown in Fig. 6, the proposed method produces better accuracy values for ASIA network which are respectively for ASIA (1000, 2000 and 3000) 0.875 and for ASIA (5000 and 10000) 0.778. In addition for CANCER network, for 1000, 2000 and 3000 is 0.8 and for CANCER-5000 and CANCER-10000 equals to 0.6. As shown, for ALARM-1000, ALARM-2000, ALARM-3000, ALARM-5000 and ALARM-10000, all values are greater than 0.65 and decrease with the increase of the number of samples. The produced results allow the demonstration of the reliability of the proposed algorithm for learning the structure of CANCER, ASIA and ALARM networks. Besides, for GIT algorithm efficiency demonstration, we present a comparison section of the gained results with the other methods results for the resolution of the BN structure-learning problem.

V. EXPERIMENTAL RESULTS
The gained results produced by the GIT algorithm are introduced in Table III and Table IV [24] algorithms for ASIA and ALARM networks for 1000, 2000, 3000, 5000 and 10000 cases. To highlight the results, the bold value represents the best value and the starred value depicts the second potential value.
In Table III, we present the different experimental results describing the difference between the original topology and the learned structure using respectively our algorithm, NDPSO-BN, Ko and Kim, Tabar et al. and Ai algorithms. As exhibited, our GIT algorithm allows the gaining of the best or the second best results for ASIA-1000, ASIA-2000 and ASIA-3000. Comparing to the NDPSO-BN method, our algorithm produces the same number of erroneous edges, which equals to one edge. The GIT algorithm detects the eight correct edges but with one incorrect oriented edge while the NDPSO-BN cannot determine it for ASIA-1000 and ASIA-3000. In addition, the other methods give between three to six erroneous edges. In Table IV, we present the experimental results comparisons among the five algorithms for ALARM-1000, ALARM-2000, ALARM-3000, ALARM-5000 and ALARM-10000.  From the first observation, we can demonstrate that the proposed algorithm produces sensitive results when increasing the dataset's size. Besides, we obtain the lowest or the second best number of accidently RE and AE for ALARM-1000, ALARM-2000, ALARM-5000 and ALARM-10000 (Table IV). The proposed GIT algorithm cannot extract the correct number of correct edges and gives thirty four or thirty three CE and eleven or twelve DE. For ALARM network for 1000 and 2000 cases, our method produces the second greatest number of CE. Moreover, the results presented above allows us to declare that the GIT algorithm learns the BNs' topologies with sensitivity to the increasing of the number of cases. In the next section, we will use the proposed BN to model the features of diabetes dataset and the extracted causality relationships which represent the main factor for helping in the diabetes diagnostics.

VI. APPLICATION OF GIT ALGORITHM FOR THE DIABETES DIAGNOSTICS
Data classification techniques have an important role in medical field. These intelligent techniques help physicians to analyse and explore large amount of datasets. In our case study, as explained above 47.7% of diabetes patients are not diagnosed. Therefore, basing on the probabilistic graphical model, we aim to represent the cause-effect relationships between the characteristic features in the diabetes dataset. In Table V, we introduce the features description and the used samples (for training and test) in the diabetes dataset.
The proposed GIT algorithm is implemented using FULLBNT project in Matlab.
The application of our proposal in the training diabetes dataset (368 samples) allows the generation of the structure shown in Fig. 7 in which the class node A1Cg represents the parent node for all given variables. These extracted dependency relationships means that being diabetic affects directly or indirectly the rest of the used features. In addition, the calculated K2 score of the generated graph is -99543,25.
The directed acyclic graph reported in Fig. 7 describes the dependency relationships between the used variables of the diabetic dataset. Specifically, the class variable (A1Cg) has a direct connection with CIMT, PT SPT, Ba and AgeG variables. The age, A1C and DiP nodes are connected to the A1Cg variable through CIMT node. Furthermore, the rest of the variables are connected to the main variable through the nodes 1 and 6. Once these connections are broken, the physician cannot analyse correctly the cause and effect relationships. The efficiency of our GIT algorithm has to be validated and improved basing on the expert knowledge. Indeed, completing the BN model construction with the conditional-probability tables allows the provability of our proposal's importance. Moreover, this latter can be demonstrated relating to the GIT algorithm's ability in the improvement of the classification of results.

VII. CONCLUSIONS
In this paper, we propose a novel algorithm based on the Information Theory and the Graph Theory in order to solve the Bayesian network structure learning problem. Furthermore, testing the GIT algorithm on the well-known networks produced important results which were compared to four proposed algorithms.
The gained results exhibit the efficiency of our method in learning the structure of small network (ASIA with 8 nodes).
For this network, our GIT algorithm is superior in terms of the correct edges detection and the erroneous edges learning. Besides, for ALARM network, our method generated acceptable results and comparing to the other algorithms, it produced the smaller number of reversed and added edges.
To summarize, our proposal represents an effective data driven method for BN structure learning. As for future works, we will complete to BN model construction to prove the effects of the learned skeleton on the classification of data. Moreover, we will concentrate on the model validation by the experts and we will further explain its efficiency for diabetes diagnostic.