Medical Image Retrieval Based On the Parallelization of the Cluster Sampling Algorithm

In this paper we develop parallel cluster sampling algorithms and show that a multi-chain version is embarrassingly parallel and can be used efficiently for medical image retrieval among other applications.


Introduction
Signal retrieval from noisy measurements (observations) involves solving an inverse problem. Inverse problems are essential in many fields such as image reconstruction or retrieval, tomography, weather prediction, and other predictions based on spacetime models. The solution of inverse problems in case of space-time models usually employs a data assimilation (DA) [6,2,4] methodology. DA refers to the process of fusing information about a physical system obtained from different sources in order to produces more accurate conclusions about the physical system of concern.
Two approaches are widely employed to solve an inverse problem. The first approach is a variational approach that involves solving an optimization problem with a regularized solution. The second approach is the statistical formulation of the DA problem which incorporates a prior distribution that encapsulates the knowledge about the system produced by the model, prior to the incorporation of any other source of information. Given the prior information, a likelihood function, the posterior is formulated as best estimate of the truth.
Markov Chain Monte-Carlo (MCMC) is one of the most powerful simulation techniques for sampling a high-dimensional probability distribution, given only it's shape function, without the intrinsic need to the associated scaling factor. HMC sampling filter [3] is an accelerated Markov chain Monte-Carlo (MCMC) algorithm for solving the non-Gaussian sequential DA "filtering problem". This algorithm works by sampling the posterior distribution to produce description of the system state along with associated uncertainty. Specifically, these algorithms follow a Hamiltonian Monte-Carlo (HMC) approach to sample the posterior.
Cluster sampling filters (C HMC, and MC-C HMC) [1] are developed as extension of the Hamiltonian Monte-Carlo (HMC) sampling filter presented in [3] where the true (unknown) prior distribution is approximate using a Gaussian mixture model (GMM).
Given the current computational power, it is natural to try to run Monte-Carlo simulations in parallel. However, Markov chains in general have to satisfy the socalled the "Markovian" property which makes the chain generation an inherently sequential problem. This restriction is mainly posed by the transition density function used to generate a proposal state given the current state of the chain.
Two approaches have gained wide popularity to parallelize MCMC samplers. The parallel-chain approach proceeds by running several chains in parallel from different initial states. The main disadvantage of this approach is that the burn-in stage has to be carried out by all chains independently, which limits the efficiency gained by running the chains on different processors. Another difficulty with this approach is the aggregation of samples generated on different processors such that the combined ensemble correctly represents the mass of the target distribution. The second approach is to parallelize a single chain.
The parallel chain approach turns out to be surprisingly effective in practice. Moreover, if sufficient information about the geometry of the target distribution is available, we can guide the parallel chains to sample effectively from the target distribution.
The accuracy of C HMC filters to handle nonlinearity in both model dynamics, and observational mapping operator, puts it on the right direction of applicability to practical problems. The cost of serial C HMC is nearly similar to the cost of the original HMC sampling filter, however the MC-C HMC algorithm is naturally parallelizable.
Following a Bayesian approach, C HMC algorithm can be easily modified and applied for image retrieval given noisy image and a probabilistic representation of prior knowledge. This can be very useful in settings where several medical snapshots are collected for the same object, e.g. a tumor, of different resolution or uncertainty levels.
Mathematical regularization is amongst the most popular methods for image reconstruction from noisy sources [10]. Among the regularization methods, the Tikhonov scheme is most popular due to the Gaussianity assumption about data noise, and the easiness to incorporate prior information. Disbite simplicity, the perfomace of this approach is highly influenced by the choice of the regularization parameter.
Widely used methodologies for solving the Bayesian image retrieval problem include the algorithms discussed in [9,5]. In [5], the authors investigate statistical image reconstruction (SIR) with regularization based on the Markov random field (MRF) model.
While, regularization approach is popular, it is sensitive to the choice of the regularization parameter.
Our main interest here is to develop highly accurate parallel Bayesian sampling algorithms that can be efficiently used for solving large-scale inverse problems, and show that they are suitable for a wide spectrum of applications including medical image retrieval.
In this work, we develop parallel cluster sampling algorithms, and show that a multi-chain version is embarrassingly parallel, and can be used efficiently for medical image retrieval amongst other applications. The approach discussed in this work does not require regularization, and is designed to work in both Gaussian, and non-Gaussian case, where the computationl expense is minimized via parallelization. Specifically, in this paper, we focus on describing the complexity analysis of a specific scenario where the MC-C HMC is parallelized by running several chains in parallel to sample the posterior distribution. The algorithm proceeds by running several Markov chains in parallel such that the number of chains is specified by the the number of components in the mixture model. We will focus on the case where an ensemble of states, generated from an unknown prior distribution is available, and the likelihood function relating observations to target states is either a linear or a nonlinear map. The prior distribution is approximated using a Gaussian mixture distribution which parameters are approximated based on the given prior ensemble by running an expectation maximization (EM) algorithm.
In Section 2 we review the general iterative and Bayesian frameworks for inverse problems and image reconstruction. Section 3 formulates the problem, and reviews the C HMC filter formulation. In Section 4 we discuss opportunities for parallelization of C HMC. Section 5 presents a detailed complexity analysis of the proposed parallel version of C HMC filter. Numerical results are presented in Section 6. Conclusions and future works are drawn in Section 7 2 Iterative and Bayesian Image Reconstruction As mentioned in Section 1, one of the most popular iterative reconstruction algorithms is Regularization-based algorithms. For the sake of completeness, we review the Tikhonov regularization approach [10,8] next, then we present the Bayesian formulation.
The Tikhonov regularization approach involves solving the following optimization problem: where α is the regularization parameter, and C is the regularization matrix and it can be chosen in many clever ways. Here H is an observation operator that maps the model space to the observation space. If the target state is directly observed then H = I, where I is the identity operator. The weighted norm in Equation (1) is described as follows: The traditional approach to regularization is the variational formulation in which equation (1) is minimized w.r.t x. Usually, derivative-based iterative minimizaiton algorithms are employed to solve the problem described by (1). The derivative of the objective function T (H(x)) w.r.t the parameter ]x is given by: where [∂H] * is the adjoint of the derivative, e.g. the Jacobian-transpose, of the observation operator H. In the case of a linear observaiton operator this is simply the transpose of the observation operator.
In the statistical approach we infer the underlying state x based on a formulation constructed using Bayesian theory, where the goal is to represent the state as a random variable which distribution is of interest. Assume P b (x) is a prior probability density function (PDF) representing prior knowladge about the state x. Assume also that P(y|x) is the data likelihood function that describes the observational error distribution. Using Bayes' theorem, the probability of the state x given the collected meauserments is characterized by the posterior distribution: A common practice is to assume that the prior distribution is, generally speaking, a multivariate normal (Gaussian) distribution centered around a forecasted or a first-guess state x b , and have a predefined covariance structure C, i.e. x ∼ N (x b , C).
For Gaussian priors and consequently Gaussian posteriors, the variational approach corresponds to finding the maximum aposterior estimate (MAP) of the posterior PDF. The MAP is the maximizer of the posterior PDF, or equivalently, the minimizer of its negative logarithm − log (P(x|y)). Following Tikhonov regularization approach (1), and assuming Gaussian noise, the likelihood function reads: Without loss of generality, if we assume x ∼ N(0, C), the prior would be on the form where B is the precision matrix, i.e. the inverse of the covariance B = C −1 . The MAP estimator in this formulation is the minimizer of This shows the equivalence between Tikhonov regularization approach with the Bayesian formulation in the Gaussian linear settings.
In the Bayesian approach, once the posteiror is constructed, a sampling mechanism is usually employed to estimate all the desired statistics of the posterior PDF, such as the posteiror mean E(x|y) that can be used as a reliable estimate of the state given the data. Moreover, the generated ensemble can be used to estimate the posteiror covariance that can be used as a proxy prior error covariance for future applications of the inverse problem. Sampling the posterior PDF is usually carried out following a Monte-Carlo approach. The most powerful Monte-Carlo sampling methodology is the general family Markov-Chain Monte Carlo (MCMC) samplers. Sampling high dimensional distribution however is a very expensive process, and requires parallel efficient implementation to be considered practical. As explained in the next Section, MCMC is not limited to Gaussian or linear settings, and can be very efficient if implemented in parallel.

Cluster Sampling Filter
Let x ∈ R nvar is a discretized approximation of the true state of the model, for example the entensities of an image pixels.
The prior distribution P b (x) encapsulates the knowledge about the system state before additional information is incorporated. The likelihood function P(y|x) quantifies the deviation of the prediction of model observations from the collected measurements y ∈ R n obs , where n obs ≤ n var .
From Bayes' theorem, the posterior distribution P a (x) reads: where P b (x) is the prior distribution, P(y|x) is the likelihood function. P(y) acts as a scaling factor and is ignored in in the MCMC context. Assuming the prior distribution is approximated by a Gaussian Mixture distribution, the prior takes the form: where the weight τ i quantifies the probability that an ensemble member x[e] belongs to the i th component, and (µ i , Σ i ) are the mean and the covariance matrix associated with the i th component of the mixture model. Here x ∈ R nvar , where n var the dimension of the target state space.
Assuming the observation errors are characterized by a Gaussian distribution N (0, R), the likelihood reads: where R ∈ R m×m is the observation error covariance matrix, and H : R nvar → R m is the observation operator that maps the state space to the observation space. Here y ∈ R m is the observation vector. From Equations (8), (9), (10), the posterior takes the form: This mixture distribution may not correspond to a Gaussian mixture in general if the observation operator is a nonlinear map. The negative-logarithm (negative-log) of the posterior distribution kernel (11) is required by the HMC sampling algorithm. Specifically, the posterior negative-log is viewed as the potential energy function in the extended Hamiltonian phase space. The posterior negative-log is given by: The derivative of the posterior negative-log reads: In this work, we sample the posterior distribution (11) following a parallel chains approach given only ensemble of states generated from the prior distribution.

Parallelization of C HMC Filter
In this section, we present a brief review of the MCMC sampling algorithm, and discuss opportunities of running C HMC filter on parallel architecture. We discuss the parallelism of C HMC filter even if the Hamiltonian system is replaced with a Gaussian proposal kernel build around the forecast.

Markov Chain Monte-Carlo (MCMC)
MCMC is a sampling scheme capable of producing ensembles from an arbitrary distribution given it's shape function, without the need for the associated scaling factor. The choice of the proposal kernel has the greatest influence on the performance of the sampler. Here, we chose two proposal kernels; a) a Gaussian density function centered around the current state of the chain, b) Hamiltonian Monte Carlo (HMC). The standard MCMC sampler is described in Algorithm 1.
Input : An initial state for the chain (x 0 ), and the proposal kernel q Output: An ensemble of states from the posterior distribution ∝ π(x) 1 Initialize the chain to the state x 0 ; 2 Initialize k = 0 ; 3 while No sufficient samples are collected do 4 Given the current state x k , use q propose a state x ; 5 Calculate the acceptance probability (MH): Sample a uniform random number u k ∼ U(0, 1) ; Reject the proposal: The standard MCMC algorithm1 generally suffers from random walk behaviour, slow convergence to the target density, low acceptance rate, and slow space exploration. Moreover, the generated samples are highly correlated when the vanilla MCMC algorithm is used. Many of these problems can be addressed by using Hybrid Monte Carlo (HMC). HMC uses a Hamiltonian system which plays the role of the proposal density.

The multi-chain MCMC algorithm (MC-MCMC)
Although the traditional MCMC algorithm is inherently serial, there are several modifications that can be made to allow for parallelization. In our approach, instead of constructing a single long Markov chain to produces n ens samples, we generate several shorter Markov chains and divide the ensemble size n ens over these chains.
The constructed chains can run in parallel to sample different regions of the target distribution independently.
The parallel (MC-MCMC) sampler starts by running an Expectation Maximization step to build a Gaussian Mixture Model (GMM) approximation of the prior distribution.
Once the GMM is constructed on the root processor, the GMM information is broadcasted to all the working nodes. Of course, if the number of processors is exactly the same as the number of components, each node is assigned one chain. If the number of processors is less than the number of components/chains, we can assign several chains to each processor, e.g. based on the local ensemble sizes or simply in a round-robin fashion. Caution has to be exercised to maintain a balanced load. Once all chains have generated their assigned local samples, the ensembles are gathered to the root processor and returned as output. Of course parallel output can be considered to reduce the communication overhead. The steps of the proposed MC-MCMC scheme is detailed in Algorithm 2. While parallelization of HMC itself can be considered for further increase in the performance, it has been avoided for clarity and to simplify the idea. We elected to use the Master-Slave parallel pattern, where a master core sends the information required for creating the individual chains.

Complexity Analysis
In this section we provide a detailed theoretical discussion of the computational cost of the proposed parallel algorithm. Since sampling from a Gaussian distribution is essential for the two flavors tested here, we start with discussion the computational cost of sampling from a Gaussian distribution

Cost of sampling a Gaussian N (µ, Σ)
A scalar normal distribution can be sampled using many accurate algorithms such as the Mersenne Twister, Box-Muller transform, Marsaglia polar method, and Ziggurat algorithm. The least expensive is Ziggurat algorithm, where a typical value produced only requires the generation of one random floating-point value and one random table index, followed by one table lookup, one multiply operation and one comparison. The cost of generating an n var −dimensional standard normal random vector x ∈ R nvar is O(n var ). To generate a multivariate normal (MVN) random vec-tor Ziggurat algorithm from a general Gaussian distribution N (µ, Σ) the following steps are required: 1. Factorization of the covariance matrix Σ, to generate Σ 1 2 , e.g. using Cholesky decomposition 2. Draw a standard normal random vector y ∈ N (0 I), 3. Scaling: x = Σ 1 2 y + µ If the covariance matrix Σ is diagonal, the factorization costs O(n var ), while the cost of Cholesky decomposition in general is O(n 3 var ). If the covariance matrix Σ is diagonal, the cost of the scaling step is O(n var ), otherwise the scaling cost is O(n 2 var ). To find the total cost of MCMC sampling algorithm, we need to evaluate the total number of steps in the Markov chain. If the ensemble size is n ens , we need to construct a chain of length b s + m s × n ens , where b s are burn-in steps carried out to achieve convergence to the stationary distribution, and m s are mixing steps introduced to improve the sampler mixing and reduce the correlation between selected ensembles Total cost of MCMC sampling: The total cost of a serial MCMC with with a Gaussian proposal density reads:

Cost of MC-MCMC sampling
The serial complexity T s of MC-MCMC is summarized as follows: Parallel cost: The parallel complexity can be studied clearly under a simplified (ideal) assumption, where each chain is to sample nens nc ensemble points. In this case, since we have n c chains the parallel cost (discarding the communication cost) of MC-MCMC is given by: If we discard the burn-in stage, i.e. set b s = 0, the speedup, and the parallel efficiency simplify to: , and, It follows that both speedup, and parallel efficiency are independent from the state space dimension n var .
Communication overhead: Assuming serial GMM run on the root node, the cost of broadcasting GMM information to all nodes is the cost of broadcasting, the means, the covariance and/or precision matrices, and the weights.
Assuming a linear communication model [7], and assuming that t s , and t w are the startup, and the per-word transfer times respectively, the cost of broadcasting GMM information to p nodes is given by: diagonal, or spherical GMM covariances t s + t w ( n 2 var n c + n var + 1) n c log (p) ; tied GMM covariances t s + t w (n 2 var + n var + 1) n c log (p) ; full GMM covariances (19) After sampling in parallel, the collected ensembles are gathered on the root node, at a cost (t s + t w (n ens × n var )) log (p) . Consequently, the total communication cost reads: Total parallel cost: It follows immediately from the discussion above, that the total parallel cost of MC-MCMC sampling algorithm simplifies to: Total overhead: The total overhead function (T o = P T p − T s ) of MC-MCMC reads: 2ts + tw nens nc + 2 nvar + 1 nc p log (p) + O ((nc − 1) bs nvar) ; diagonal, or spherical covariances of GMM, and proposal density 2ts + tw nens+nvar nc + 1 nvar + 1 nc p log (p) + O (nc − 1) bs n 2 var ; tied covariances of GMM 2ts + tw ( nens×nvar nc + n 2 var + nvar + 1) nc p log (p) + O (nc − 1) bs n 2 var ; otherwise 2ts + tw nens nc + 2 nvar + 1 nc p log (p) ; diagonal, or spherical GMM covariances 2ts + tw nens+nvar nc + 1 nvar + 1 nc p log (p) ; tied GMM covariances 2ts + tw ( nens×nvar nc + n 2 var + nvar + 1) nc p log (p) ; full GMM covariances Isoefficiency: Assuming the burn-in stage is discarded, i.e. set b s = 0. With k t w n ens n var p log (p) ; diagonal, or spherical GMM covariances, and HMC, or Gaussian proposal with diagonal covariance k t w (n ens + n var ) n var p log (p) ; tied covariances of GMM, and HMC, or Gaussian proposal k t w (n ens + n var n c ) n var p log (p) ; full covariances of GMM, and HMC, or Gaussian proposal (23)

Numerical Experiments and Performance
In this section we present numerical experiments to assess the complexity analysis provided in Section 5. As mentioned above, the speedup, and parallel efficiency are independent of problem dimensionality. We discuss the computational cost and the performance using one dimensional examples.
To execute the Markov chains in parallel, we used the MPI4Py package, which provides bindings of the Message Passing Interface (MPI) standard for the Python programming language, allowing any Pythonprogram to exploit multiple processors.
we present numerical experiments to assess the complexity analysis provided in Section 5. Specifically, we show numerical results for the parallel C HMC flavors discussed in this paper using one dimensional synthetic example, and a two dimensional image retrieval experiment.

One-dimensional example:
Following the strategy described in [1], we start with a synthetic prior ensemble generated from a GMM with n c = 5. A GMM approximation of the true prior probability distribution is constructed using the EM algorithm . The model selection criterion used here is AIC. The parameters of the true GMM prior are: Assume the observation errors follow Gaussian distribution with zero mean, and variance 2.2. Assuming a synthetic observation y = −1.0, the observation likelihood function given by: The generated GMM approximation of the prior has n c = 7 and the following parameters: (26) Figure 1 shows the results of sampling a one dimensional posterior (11) with seven components. Since we are mainly interested in asymptotic behaviour of the sampler, and to avoid the effect of sampling error, the ensemble size here is set to 1000. Despite the good mass distribution resulting from the use of a Gaussian density with the serial MCMC sampler 1(a), the acceptance rate is ≈ 45% resulting in a large amount of wasted calculations. On the other hand, the parallel MC-MCMC with Gaussian proposal kernel 1(b) improves the acceptance rate to ≈ 81%. The acceptance rate is increased due to the local adjustment of the sampler hyperparameters based on the local ensemble under the corresponding prior component in the mixture. While the Gaussian-based MCMC sampler represents an acceptable mass distribution, it suffers from random walk behaviour leading to the demonstrated low acceptance rate. One the other hand HMC sampling results in general in high acceptance rate. Unfortunately, as explained by the results in Figure 1(c) is unable to sample all probability modes. The acceptance rate in the serial C HMC sampler here is 96%. By running C HMC sampling methodology in parallel, the acceptance rate drops to 94% (which is still very high). However, the mass distribution is much better than the serial case. This is supported by results in Figure 1(d) compared to Figure 1(c). Figure 2 shows the CPU time and speedup results of the clustering sampling algorithms with Gaussian and HMC proposal mechanisms. As suggested by the analysis in Section 5, the CPU time becomes flat once the number of processors reaches 7, the number of components in the mixture.
The parallel efficiency results are shown in Figure 3. The numerical results shown here suggest that, running the clustering filters in parallel not only results in computational saving, but also can potentially increase the sampling accuracy. While the discussed parallelization of the sampler reduces the computational time, the numerical results suggest that more parallelization effort is needed in order to achieve higher efficiency. In the current settings, the chains are assigned to processes in a round-robin fashion. The performance of the sampler can be greatly enhanced if the a smart scheduler is used such that the parallel chains are assigned to processors based on the sample size per chain.

Two-dimensional example
Here we test the the paralle C HMC sampling algorithm as a tool for statistical medical image retrieval. We employ a non-linear Gaussian convolution filter as a forward operator H. The convolution filter is applied to a two-dimensional image resulting in a blurred image, then Gaussian noise is added to collect synthetic measured/observed image. Here, the vector x represents intensities of the image pixels arranged in a column vector. The observation nosie level is set to be 9% of the average intensity of the original image. This formulation clearly results in a nonlinear inverse problem that can be challanging for traditional approaches such as Tikhonov regularization. For , the Jacobian of the convoluted image with respect to the intensities of the image pixels is found to be a Toeplitz matrix. The goal here is to retrieve the original image given the noisy measurement, and a sample drawn from the probability distribution from which the blurred image is drawn. This is is a relevant problem description in many cases, where several low resolution or blurrd images are taken along with the collected measurement. Figure 4 shows the original (true) image, the blurred image constructed by the convolution filter, and the noisy image, i.e. the blurred image with additive noise.  To create a synthetic non-Gaussian sample, we sampled 50 images from a Gaussian distribution centered around the blurred image with variances equal to 8% of the average entensity of the original image. To create a synthetic prior sample, we have selected n ens = 30 uniformly random distributed images from the generated Gaussin sample. This procedure is guaranteed to result in a non-Gaussian prior, and is powerful to test the GMM prior assumption.
The C HMC sampling algorithm with both Gaussian and Hamiltonian kernels performed similarily. The acceptance rates however were quite different. Specifically, the rejection rate in the case of Gaussian proposal was 56% resulting in wasted computations, while HMC rejection rate was 7%.
The initial state of each chain is chosen as the mean of the corresponding component in the prior mixture, and the parameters of the Hamiltonian system are tuned empirically to give acceptable acceptance rate. 30 sample members are collected from the posterior distribution.
The mean and median of posterior samples collectecd using the parallel C HMC sampling algorithm are shown in Figure 5 (panels 5(b), and 5(c)). For the sake of comparison to one of the most popular and widely used approaches, we show the results obtained using Tikhonov regularization approach with regularization parameter optimally tuned following an L-curve approach. The Tikhonov-regularized solution is shown in panel 5(a) of Figure 5.  By comparing results in Figure 5, to the blurred (prior) image 4(b) and the noisy image 4(c), one can see that the posterior samples produce statistics those are closer to the original image. Moreover, the retrieved results using the non-Gaussian C HMC (serial or parallel) algorithm is much better than the solution obtined by the traditional Tikhonov regularization approach. The relative error of the mean obtained using the parallel C HMC sampling algorithm is 0.01663, while the relative error of the regularized solution is 0.021447. The relative error is defined as x−x true x true , where x is the retrieved image, and x true is the true image. The proposed parallel algorithms achieve an improvement of 29% over the optimally tuned Tikhonov-based solution. These results show the capability and accuracy of the C HMC sampling algorithm in non-Gaussian settings.
While we have shown the blurred image in Figure 4, it is of utmost importance to highlight the fact that the formulation and the sampler is unaware of this blurred image. This is mainly due to the fact that we have uniformly sampled the Gaussian sample centered around this blurred image. To further enhance the retrieved image, one can follow the inverse problem solution by deblurring filter which is out of the scope of this work.

Conclusions and Future Work
In this work, we have developed parallel cluster sampling algorithms for solving Bayesian inverse problems. Specifically, we have proposed parallel sampling algorithms based on the cluster sampling filter (C HMC) developed mainly for non-Gaussian data assimilation. The proposed algorithms can be efficiently used for solving various large-scale problems including medical image retrieval from noisy observations.
We have introduced a detailed complexity analysis of the proposed parallel clustering sampling (C HMC) algorithms with mixture model represenation of the prior information. Generally speaking, aside from parallelization, the parallel versions of the algorithm result in higher acceptance rates. Specifically, the parallel C HMC increases the acceptance rate of the sampler from 44% to 93% with Gaussian proposal kernel, leading to massive saving of computations. The proposed sampling algorithms achieve an improvement of 29% over the optimally-tuned Tikhonovbased solution for image retrieval. The algorithm can run significantly faster than the serial sampler in ideal settings. However, the algorithm can be slower than the serial sampler if too many outliers exist where some chains are assigned much smaller ensemble size than the others. The C HMC sampling algorithm, in addition to desired parallelization features, has proved powerful in the context of Bayesian image retrieval.
In future work, we will investigate the possibility of parallelizing other components of the sampling algorithm such as the likelihood function and the proposal mechanisms. For example a parallel version of EM can be considered to construct the GMM approximation to the prior distribution. In the case of HMC, the symplectic integrator can be parallelized. Also, matrix-vector products can be parallelized. Methods for parallelizing a single chain can be considered for a second level of parallelization. While our implementation is not a direct parallelization of the MCMC algorithm, it still provides acceptable sampling with better performance.