The threshold EM algorithm for parameter learning in bayesian network with incomplete data

Bayesian networks (BN) are used in a big range of applications but they have one issue concerning parameter learning. In real application, training data are always incomplete or some nodes are hidden. To deal with this problem many learning parameter algorithms are suggested foreground EM, Gibbs sampling and RBE algorithms. In order to limit the search space and escape from local maxima produced by executing EM algorithm, this paper presents a learning parameter algorithm that is a fusion of EM and RBE algorithms. This algorithm incorporates the range of a parameter into the EM algorithm. This range is calculated by the first step of RBE algorithm allowing a regularization of each parameter in bayesian network after the maximization step of the EM algorithm. The threshold EM algorithm is applied in brain tumor diagnosis and show some advantages and disadvantages over the EM algorithm.


INTRODUCTION
Machine Learning is now considered among the essential tools for making decisions and solving problems that affect the uncertainty. This science allows automation of methods that helps the expert to take an effective decision in several areas. This work is functional by means of artificial intelligence that combines the concepts of learning, reasoning and problemsolving. In recent years, Bayesian networks have become important tools for modeling uncertain knowledge. They are used in various applications such as information retrieval [6,14], data fusion [5], bioinformatics [11], classification [12,13] and medical diagnostics [2].
Bayesian networks are graphical models that can apply these concepts in daily life by modeling a given problem as a causal structure as a graph indicating the independence between the different actors of the problem and using qualitative state which is in the form of conditional probability tables. The clarity of the semantics and comprehensibility by humans are the major advantages of using Bayesian networks for modeling applications. They offer the possibility of causal interpretation of models of learning.
The concepts of learning in bayesian network are devised into two types; the first one is to learn the parameters when the structure is known. The second one is to learn the structure and the parameters at the same moment. In this paper, we assume that the structure is known. The parameter learning in this case is divided into two categories. If the training data are complete this problem is resolved by statistic approach or a bayesian approach. In real application, to find complete training data is difficult for various reasons. When data are incomplete two classical approaches are usually used to determine the parameters of a bayesian network that include EM algorithm [1] and Gibbs Sampling [3].
Other methods are suggested to deal with the disadvantages of these classical approaches. The most robust is the RBE algorithm [8]. In order to regularize the learning problem, some modifications are needed to reduce the search space and help escape from local maxima.
These problems in learning parameter in bayesian network motivate us to add some modification in the existing parameter learning algorithm where the network structure is known and the data are incomplete.

II. LEARNING BAYESIAN NETWORK PARAMETERS
A bayesian network is defined by a set of variables χ = {X 1 , X 2, ..., X n } that represent the actors of the problem and a set of edge that represent the conditional independence between these variables. If there is an arc from X i to X j then X i is called parent of X j and is noted by pa(X j ). Each node is conditionally independent from all the other nodes given its parents. The conditional distribution of all nodes is described as: Each node is described by a conditional probability table which we denote by the vector θ. The entire vector is composed by a set of parameters value i,j,k  and it's defined by: Where i=1…n represents the range of all variables, k=1…r i describes all possible states taken by X i and j=1...q i ranges all possible parent configurations of node X i . The process of learning parameters in bayesian network is discussed in many papers. The goal of parameter learning is to http://ijacsa.thesai.org/ find the most probable θ that explain the data.
The equation (3) and (4) are not applied where the training data is incomplete.

A. Learning parameter with complete data
In the case where all variables are observed, the simplest method and most used is the statistical estimate. It estimats the probability of an event by the frequency of occurrence of the event in the database. This approach (called maximum likelihood (ML)) then gives us: Where N i,j,k is the number of events in the database for which the variable X i is in state x k and his parents are in the configuration x j . The principle, somewhat different, the Bayesian estimation is to find parameters most likely knowing that the data were observed. Using a Dirichlet distribution as a priori parameters which are written as: , , , where α i,j,k are the parameters of the Dirichlet distribution associated with the prior distribution. The approach to maximum a posteriori (MAP) gives us:

B. Learning parameter with incomplete data
In most applications, databases are often incomplete. Some variables are observed only partially or never. The classical approaches are EM, Gibbs sampling and RBE algorithms. These algorithms are approximate except RBE which determinate a low bound and an upper bound for each parameter in the bayesian network.
The method of parameter estimation with incomplete data and the most commonly used is based on the iterative Expectation-Maximization (EM) proposed by Dempster [1] and applied to the RB in [7].
The EM above is as follows: repeat the steps expectation and maximization until the convergence.
Each iteration ensures that the likelihood function increases and eventually converges to a local maximum. By cons, when we have multiple nodes admitting a large number of missing data, the method of learning by the EM method converges quickly to a local maximum. In the first step, the algorithm starts by depending arbitrary quantities on missing data. The second steps consist of employing the expectation entries and maximizing them with respect to the unknown parameters.
The results of the second step are used as arbitrary quantities in the next expectation step. The algorithm converges when the difference between successive estimates is smaller than a fixed threshold or the number of iterations is bigger than a fixed maximum iteration.
use estimate date to apply the learning procedure (for example the maximum likelihood) Algorithm : EM algorithm The second algorithm is Gibbs sampling [3] introduced by Heckerman. Gibbs sampling is described as a general method for probabilistic inference. It can be applied in all type of graphical models whether the arcs are directed or not and whether the variables are discrete or continuous. Gibbs sampling is a special case of MCMC (Markov Chain Monte Carlo). It generates a string of samples with accepting or rejecting some interesting points. In other words, Gibbs sampling consists in completing the sample by inferring the missing data from the available information. In learning the parameters, Gibbs sampling is a method that converges slowly or has no solution if the number of hidden variables is very large.
The third algorithm is Robust Bayesian Estimator RBE [8]. It's composed of two steps Bound and Collapse [10]. The first step consists of calculating a lower bound and an upper bound for each parameter in the bayesian network. The second step uses a convex combination to determine the value of i,j,k  . http://ijacsa.thesai.org/ RBE is considered a procedure that runs through all the data D recorded observations about the variables and then it allows to bound the conditional probability of a variable Xi. This procedure begins by identifying the virtual frequencies following:  n (X i = x k |?): calculating the number of observations where the variable X i takes the value x k and the value of pa (X i ) is not completely observed.  n (? | pa (X i ) = x j ) calculating the number of observations where parents pa (X i ) takes the value x j and the value of X i is missing.  n (? |?): calculating the number of observations where both values of X i and pa (X i ) are unknown and the value of pa(X i ) can be completed as x j .
These frequencies help us to calculate the minimum and maximum number of observations that may have characteristics X i = x k and pa (X i ) = x j in the database D: is the minimum number of observations with characteristics X i = x k and pa (X i ) = x j .
is the maximum number of observations with characteristics X i = x k and pa (Xi) = xj. Virtual frequencies defined above can be set to zero, which is called the Dirichlet distribution with parameters α i,j,k . We A detailed example mentioned in [8] shows the use of these equations in calculating conditional probabilities by determining the minimum and maximum bounds of the interval. This phase of determining min i,j,k and max i,j,k depends only on the frequency of observed data in the database and virtual frequencies calculated by completing the records. The major advantage of this method is the independence of the distribution of missing data without trying to infer.
To find the best parameters for this method, a second phase is necessary. It estimates the parameters using a convex combination from each distribution calculated for each given node. This convex combination can be determined either by external knowledge about the missing data, or by a dynamic estimate based on valid information in the database. A description of the execution of this phase is articulated in [10].

III. THE THRESHOLD EM ALGORITHM FOR PARAMETER LEARNING IN BAYESIAN NETWORK WITH INCOMPLETE DATA
The set of parameter in bayesian network using EM algorithm is approximate. In addition, the use of the bound step of the RBE algorithm gives a lower bound and an upper bound for each parameter in the network which is defined by : Our work consists of performing the optimization of the bayesain network parameter using the EM algorithm and verifying the bound step of the RBE algorithm. For doing that, the threshold EM algorithm consists of verifying the constrain mentionned in equation (12) after the two steps of the EM algorithm. Let i,j,k  (t) be the maximized parameter after the execution of the two steps of the EM algorithm. The threshold EM algorithm is composed by three steps. The first two steps are the same as the EM algorithm. The third step consists of the regularization of i,j,k  (t) with the constraint mentionned in equation (12). The main actions used in this step consists of: iii) If min i,j,k <= i,j,k  (t) <=max i,j,k then the i,j,k  (t) is saved like it's.
These changes provide a disagree of the probabilities constraint defined in equation (13) : ,, 1 i j k k    (13) So, it's necessary to make a normalization step to verify the equation (13). This step is described by the use of the equation (14).
These new calculating parameters are used like an input in the next step of the threshold algorithm. This principle is repeated until convergence. The stopping points are the same as the EM algorithm. The third step is used to force the solution to be between the bounds calculating by the bound step of the RBE algorithm. In the worst case, the solution is moving toward the directions of reducing the violations of the constraint mentionned in equation (12). Now, we are ready to present the threshold EM algorithm for parameter learning in bayesian network with missing data as summarized in table 1.

Repeat until it converges
Step1: Expectation step to compute the conditional expectation of the log-likelihood function.
Step2: Maximization step to find the parameter If min i,j,k <= i,j,k it's. Strep4: Normalization step based on equation (14) We describe in table 2 an example of using the threshold algorithm in one iteration : We see that the new parameter calculating in one step reduces the violations of the constraint mentioned in equation (12). During this section, we compare our algorithm to the EM algorithm. We apply this work in brain tumor diagnosis. We use the Bayesian Network Toolbooxs (BNT) by Murphy to test our algorithm. The bayesian network as shown in Fig1 is created in these experiments. Then, 72 instances are collected from a real diagnosis and we mention that not all the variables are instanced.
The dataset use to learn the bayesian network parameters is composed by 72 instances of each node tacked from a real cases collected by a specialist in brain tumor diagnosis. All these nodes are discrete and takes between two and 8 values. The percentage of the missing data in this dataset is equal to 37.16%. The majority of missing data is in the intermediate nodes of the bayesian network. The causes of the missing data are the quality of IRM images or the doctor forgot to mention all the details in this report. http://ijacsa.thesai.org/  Table II and III.   TABLE II.  TABLE TYPE  We show in Figure 3 the comparison between EM algorithm and the threshold EM algorithm (TH_EM) concerning the loglikelihood function which is defined as follows : , , , where : n is the node number. q i is node i parents configuration number. r i the number state of node i N i,j,k i is the number of cases where the node i is in state k and its parents are in configuration j. θ i,j,k is the parameter value where node i is in state k and ists parents are in configuration j.  This test is applied when we fix the same starting points in the two algorithms. We see that the convergence of our algorithm is quickly than EM algorithm. This result is shown in 70% of cases when we change the starting points of the two algorithms ( figure 4). In addition, we see that the probability distribution in each node is modified. Each probability is between the two bounds calculating with the first step of the RBE algorithm or error rate become smaller. One advantage of our algorithm consists of the absence of zero probability in each probability distribution. http://ijacsa.thesai.org/ The convex combination of the two bounds calculated in the first step of the RBE algorithm external information to get the parameters of any bayesian networks. This task becomes difficult when you have a complex structure. Our proposed method deletes the use of this information to get the conditional probability tables of our bayesian network.

V. CONCLUSION
In real application, training data in Bayesian network are always incomplete or some nodes are hidden. Many learning parameter algorithms are suggested foreground EM, Gibbs sampling and RBE algorithms. In order to limit the search space and escape from local maxima produced by executing EM algorithm, this paper presents a learning parameter algorithm that is a fusion of EM and RBE algorithms. This algorithm incorporates the range of a parameter into the EM algorithm. The threshold EM algorithm is applied in brain tumor diagnosis and show some advantages and disadvantages over the EM algorithm