Parallel Implementation of Bias Field Correction Fuzzy C-Means Algorithm for Image Segmentation

—Image segmentation in the medical field is one of the most important phases to diseases diagnosis. The bias field estimation algorithm is the most interesting techniques to correct the in-homogeneity intensity artifact on the image. However, the use of such technique requires a powerful processing and quite expensive for big size as medical images. Hence the idea of parallelism becomes increasingly required. Several researchers have followed this path mainly in the bioinformatics field where they have suggested different algorithms implementations. In this paper, a novel Single Instruction Multiple Data (SIMD) architecture for bias field estimation and image segmentation algorithm is proposed. In order to accelerate compute-intensive portions of the sequential implementation, we have implemented this algorithm on three different graphics processing units (GPU) cards named GT740m, GTX760 and GTX580 respectively, using Compute Unified Device Architecture (CUDA) software programming tool. Numerical obtained results for the computation speed up, allowed us to conclude on the suitable GPU architecture for this kind of applications and closest ones.


INTRODUCTION
In the medical image area, segmentation of anatomical structures is a key step for medical applications such as diagnostics, planning and act operation.Medical images contain a lot of information, and often few of structures are of interest.Segmentation allows visualization of the structures of interest and removing unnecessary information.It also enables structure analysis such as calculating the volume of a tumor, and performing feature-based image-to-patient as well as image-to-image registration, which is an important part of image guided surgery.
In magnetic resonance imaging, the in-homogeneities of intensities, called bias field, are caused by non-uniformities in the Radio Frequency RF field during the acquisition.The result is a shading effect where the pixel or voxel intensities of same tissue class vary slowly over the image domain.This shading can cause severe errors when attempting to segment corrupted images using intensity-based pixel classification methods.It has been shown that this shading is well modeled by the product of the original image and a smooth, very slowly varying multiplier field [1,2].
In recent literature the authors in [3] proposed modified classical fuzzy c-mean algorithm to be able to get a handle on the intensity in-homogeneities and noisy image effectively.Authors in [4] have demonstrated that the combination of the iterative nonparametric non uniformity normalization and FCM correction method brightens the signal intensity of fatty tissues and that separates the histogram peaks between the fibro glandular and fatty tissues to permit an accurate segmentation between them.In the work of Huan Jun Ding and al [5] they have investigated the feasibility of volumetric breast density quantification with two computer assisted image segmentation methods on medical magnetic resonance imaging (MRI) scans of 40 postmortem breasts.
However, the use of their method is expensive in terms of time execution.In fact, most image segmentation methods proposed in the literature are computationally expensive, especially when run on large medical datasets, and requires powerful hardware to meet desired speed processing.
It also requires several techniques and algorithmic calculation models, which may be sequential or parallel using elementary processors, cellular automata, or neural networks.www.ijacsa.thesai.org In [6] Jaber Juntu and al have discussed two approaches to remedy the problem of the bias field corruption.The one can be used as a preprocessing step where the corrupted MRI image is re-established by dividing it by an estimated bias field signal using a surface fitting approach.The other approach explains how to edit the fuzzy c-means algorithm so that it can be used to segment an MRI image degraded by a bias field signal.The authors in [7] have proposed a fast spatially constrained kernel clustering algorithm in order to segment medical magnetic resonance imaging (MRI) brain images and correct the intensity in-homogeneities.VOVK Uroš and al [8] gave a review of different methods for correction of intensity in-homogeneity in MRI, they classified the methods according to the in-homogeneity correction strategy and different, qualitative and quantitative, evaluation approaches.Earlier, authors in [9] suggested a method that has been applied to 3T and 7T, they have achieved desirable results, and also their method was robust according to initialization.According to all this studies and developed algorithms for image processing, the reach of GPUs has made a revolution in the scientific community by using the power of parallel calculations that are well suited to this type of card.
Graphic processing units (GPUs) were originally created for rendering graphics.Recently, GPUs has emerged as coprocessing units for Central Processing Units (CPU) and has become popular for general-purpose high performance computation (GP-GPU) which is mainly attractive and used by many researchers [10][11][12], to accelerate various digital signal processing applications, including medical image processing [13][14][15].GPUs are composed of hundreds of processing cores, highly decoupled, able to achieve immense power of parallel computing.To take advantage of these multi-core architectures, these applications must be parallelized.
Among the images segmentation algorithms with intensity in-homogeneities correction on GPU architecture, authors in [16] proposed an extended mask-based version of the level set method with bias field, recently presented by Li et al. [17].They develop CUDA implementations for the original full domain and the extended mask-based versions, and compare the methods in terms of speed, efficiency, and performance.The GPU implementation of their version allows a speed up of around 50−100 times for instance, for 512×512×128 slices.
Other researchers have suggested different implementation categories of FCM [18] on GPU in order to accelerate the running time.Anderson & al [19] suggested a GPU solution for the Fuzzy C-Means.They have used OpenGL and Cg to achieve approximately two orders of magnitude computational speed-up for some clustering profiles using an NVIDIA8800GPU card.Then they generalized the system for the use of non-Euclidean metrics [20].On the other and they tried to provide computational intelligence researchers the skills necessary to exploit the low cost and high performance of GPUs with a minimum learning cost [21].
In other works authors implemented a parallel version of FCM on GPU using openGL and Cg language [22].They reached about 2× speedup over the sequential implementation.
Rowinska and Goclawski [23] have focused on accelerating the clustering of the FCM algorithm on GPU using CUDA.They compared their implementation with its C++ sequential implementation, as well as a MATLAB version.In their papers FCM clustering was applied to segment polyurethane foam images.The NVIDIA GeForce GTX560 card was used to perform the parallel experiments, while Intel Core i3 processor was used for the sequential implementation.As a result, they showed that their parallel methods have achieved about 10×speed up over the sequential implementation and it was 50 to100× faster than the MATLAB version.
In [24] authors proposed a parallel implementation of brFCM which is a faster version of the standard FCM, they tested their algorithm on two GPU cards, Tesla M2070 and Tesla K20m, where the implementation provides about 2.24x speedup.The brFCM implementation achieves a speed up of 23.42x compared to the traditional FCM.Lately in [25] the authors have introduced a modified FCM algorithm that improved the calculations of the membership matrix and the centers update.Their upgrade version was implemented using CUDA on GPU hardware to rise: the execution time, the visual, and the segmentation efficiency.The experiments used different images of different sizes.They used GTX260 and Intel Core 2 Duo to run the GPU and CPU experiments.The authors achieved at least 10x improvements over the sequential FCM version.Shalom et al. [26] proposed an implementation to improve the computational time of FCM on big data sets using GPU.The practical works were developed on a multidimensional yeast gene expression data set.The researchers stored the distance and membership matrices in the texture memory.The CPU carried out the initial step of the algorithm.Subsequently, the GPU held the most running time parts of the algorithm which are the iterative tasks such as distance calculations, membership calculations, and new cluster centers computations.They developed the experiments on two different GPU cards, GeForce 8500GT and GeForce 8800GTX.The comparison between the parallel and the serial implementations proves an up to 140x speedup on 8800GTX according to the CPU implementation.Moreover, the GPU implementation showed an up to 73x speedup on 8500GT.
In [27] the researchers implemented and analyzed a parallel dynamic functional connectivity (DFC).algorithm in GPU using two approaches, the first is thread-based and the second is block-based, moreover they have also parallelize the DFC using openMP on fMRI data, they have reached a speedup ranging from18.5x to 157x on GPU and 7.7x with openMP respectively.
In this paper, our contribution is mainly intended to parallelize the BCFCM algorithm [28] on massively parallel architecture.Our algorithm leads to a better accuracy on segmentation than FCM.But, it is more time consuming.In order to remedy this issue we have exploited the performance of NVIDIA graphic card (GPU) to implement a version that accelerates successfully the BCFCM.We will try later, to detail our method that gives more promising results.
The rest of this paper is organized as follows.In Section II, we summarized the fine-grained parallel model used (GPU of NVidia) and its software development environment (CUDA).www.ijacsa.thesai.orgSection III presents a review of the sequential version of the clustering algorithm BCFCM proposed in [28].Section IV, deals with our parallel implementation for the original and extended algorithms as well as the details of the parallelization.In Section V, we present our findings, compare the results and performance speed-ups.Section VI, concludes the paper and gives some perspectives for this work.

II. PARALLEL ARCHITECTURE MODEL
In this section we will give a summary presentation of the computational model and the software development environment choosing for the implementation of the algorithm.It is related to GPU massively parallel architecture and it's SDK CUDA.

A. NVidia GPU Architecture
Modern GPUs are massively parallel processors that support a large number of elementary processors.They are particularly well suited for exploring the calculations on many data types that have high arithmetic intensity.Currently, GPU architecture is composed of an evolving set of processing units flows (SM, SMX Streaming Multiprocessor or: nextgeneration Streaming Multiprocessor).Such a multiprocessor contains a number of cores (scalar processor), a multithreaded instruction unit, number of registers, local memory and shared memory (Fig. 1).The number of processor cores and multiprocessor depends on the architecture and model of the GPU.The fundamental characteristic of the architecture of the GPU is that it has a massively parallel architecture that supports a large number of threads destined to end finegrained calculation.Each parallelization scheme exploits the ability of mass GPU computing and provides a good load balancing, because each thread executes the same amount of computation.

B. CUDA: Compute Unified Device Achitecture
CUDA is a software development environment based on the C language for GPUs from NVIDIA unveiled in 2007 [29].It's constituted by a parallel programming model and a set of dedicated instruction.The CUDA codes are compiled using the NVCC compiler [30].It allows the programmer to define C functions, called kernels, which are executed in parallel by multiple CUDA threads (instantiation of a kernel).The programmer organizes these threads into a hierarchy of grids of thread blocks (Fig. 1).A block of threads is a set of concurrent threads that can cooperate with each other through barrier synchronization and shared access.During execution, the threads can access data at different levels of hierarchy: registers, shared memory and global memory.The global memory is accessible by all threads, but its access time is about 500 times slower than the access time to the shared memory and registers.
The treatment of elementary processes (threads) on the GPU is not independent.Indeed, the threads are executed in groups called Warps, where in a warp given (32 threads), all threads execute the same instruction (SIMD).

III. BACKGROUND: A BIAS FIELD CORRECTION (BC) FUZZY C-MEANS (FCM) ALGORITHM (BCFCM)
The standard FCM [18] objective function for partitioning (1) The degree of membership of data j x in the cluster v i , v i : The prototypes (or center) of the cluster i, N: The total number of pixels in the MRI image p: A weighting exponent parameter on each fuzzy membership value, it determines the amount of fuzziness of the resulting classification according to : The observed MRI signal is modeled as a product of the true signal generated by the underlying anatomy, and a spatially varying factor called the gain field.
Where X k and Y k are the true and observed intensities at the k th pixel, respectively, G k is the gain field at the k th voxel.The application of a logarithmic transformation to the intensities allows the artifact to be modeled as an additive bias field; Where x k and y k are the true and observed log-transformed intensities at the k th pixel, respectively, and k  is the bias field at the k th pixel.www.ijacsa.thesai.orgAhmed et al [28] proposed a modification to (1) by introducing a term that allow the labeling of a pixel (Voxel) to be influenced by the labels in its immediate neighborhood.The modified objective function is given by: Where: N k : Set of neighbour's pixels that exist in a window around x k . And The cluster prototype (centroid) updating is done by the expression: The estimated bias field is given by the expression: Algorithm 1 explains the main steps of this algorithm.4: repeat 5: Update the membership value U using Eq. ( 6) 6: Update the cluster center vector V using Eq. ( 9) 7: Update the bias field estimated matrix  using Eq. ( 10)

IV. PARALLEL FUZZY C-MEANS ALGORITHM FOR BIAS FIELD ESTIMATION AND SEGMENTATION (PBCFCM)
The main objective of the algorithm is to obtain simultaneously a bias field correction and image segmentation.However the expensive computation of this algorithm in its sequential version leads us to exploit the (SIMD) GPU architecture which is the adequate model for this kind of algorithms.Fig.2, illustrates the strategy used to distribute the data, at coarse-grained (Blocks) and fine-grained (threads) levels.
The main idea is to split the image over the GPU so that each pixel presented by one thread can execute its own instructions independently.For this algorithm we start with initializing the centroids vector and the variables, then allocate and transfer data from CPU to GPU before the loop iteration.Note that we have used two main kernels, one to compute the membership function and the second to compute the estimated bias field.
Once in the loop, we called the first kernel that computes both operations, the membership function and the expressions for updating centroids.Next step we update the cluster centers vector in CPU, then we transfer the new computed vector to the GPU in order to compute the new estimated bias field, and finally verify the criteria termination based on cluster variation.
The major problem of execution time in our case, is the large data transfer time that eat up the memory.While for small image size, data transfer between CPU and GPU is negligible and can even sometimes be preferable.
In this paper, we present a bias field correction fuzzy cmeans implementation which exploits the shared memory for data and constant memory for centroids in addition of registers.While, the part assigned to be executed in CPU is the division operation for the results updating of cluster and www.ijacsa.thesai.org the criteria termination test.Those two operations are assigned to CPU since they do not require processing power.The data transfer cost of cluster centers to constant memory may nearly be ignored compared to the iteration cost.Note that the allocation on device and data transfer are done only one time before starting the loop iteration, to avoid the latency caused by transferring data back and forth from the CPU and GPU.
The Parallel bias field Fuzzy c-means classification algorithm is achieved using the following stages:

A. Experiment setup
Algorithms on host (serial and parallel) were implemented using Microsoft VC++ program and CUDA 7.5 libraries.Sequential BCFCM algorithm is computed to obtain reference runtimes in C and compiled within Microsoft Visual Studio 2013 (Debug mode).Speed-up results were carried out on many devices: Intel(R) Core(TM) i7-4770 8 cores, 3.5GHz (CPU1), Intel(R) Core(TM) i7-4770 8 cores, 2.4 GHz (CPU2), GeForce GT740m, GTX760 and GTX580 GPUs.An example of specifications for tested processor and GPU equipment are summarized in Table 1 and the operating system was 64-bits Windows 7.

B. Implementation and Validation
To highlight the effectiveness of the proposed method, we extensively experiment both versions, sequential and parallel, of the clustering algorithm on different clinical brain images from online training database BRATS2012 [31] that offer a set of pathological cerebral MRI data.Before this implementation and in order to measure the performance of the proposed implementation in term of time execution, we have used the well-known Lena sample image and segment it into its optimal number of clusters which equals to 7 [32] on CPU and on GPU for different sizes that varies from 1024 to 6553600 pixels.This first assessment stage is done to identify perfectly the behavior of our implantation program on different computational GPU cards.In this section, the parameters set to validate and test the performances of our implementation for Lena image are as follow: C=7, p=2, Nr=8 and α =0.85 (α represents the neighbors effect as mentioned in [28] for low-SNR images).
To evaluate our implementation, we compared its performance to a sequential equivalent.For this purpose, runtime for serial execution was measured on the host processor that was obtained from single-core execution and referenced (TABLE.I).The GPU based computing duration for the same experiment parameters is compared with singlecore timing.In GPU performance evaluation, the allocation on device and data transfer is done only one time before the loop iteration to avoid the latency caused by transferring data back and forth from the CPU and GPU.We calculate the time of the application after the file I/O, in order to show the speedup effect more clearly.The speedup results are normalized to the baseline which is the serial implementation.Speed up of the proposed parallel algorithm (PBCFCM) is related to the image size.The larger the size is, the more we get a better speed-up.This rule is not obvious because sometimes a low acceleration can be obtained.Furthermore and according to Fig. 3, GPU cards do not have a linear acceleration, we can see that there is almost saturation, but there is always a slight speedup increase as shown in the following Fig. 3. -From 1024 pixels to 50526 pixels, the behavior is linear for the three devices, -From 50526 up to 2359296 pixels the behavior is almost parabolic for GT 740m and GTX 760 while for the GTX 580, the parabolic behavior ends only at the value 3686400 pixels.
Beyond these experimental critical values, we note that, the speed-up tends slowly to the saturation state.Also note that the maximum speed-up is achieved for the GTX 580 that is almost 5x faster than GT 740M and 2.5x faster than GTX 760.
The speed-ups reached for this figure are 52x, 20x and 13x for GTX 580, GTX 760 and GT 740 respectively for image size up to about to 7 Mega pixels.
According to our study, it seems clear that the performance speed-up depends on the strength of the GPU and the CPU configuration.Fig. 4 confirms the influence of the x64 configuration.Indeed treatment on images to this resolution is better than on x32 configuration (Fig. 3) for the same parallel implementation of the algorithm and different GPU.
Performance improvement is manifested on the three curves of the speed-up to an x64 configuration.The maximum values of speed-up obtained are of the order of 63x, 41x and 19x for GTX 580, GTX 760 and GT 740 respectively for image size up to about to 7 Mega pixels.In most cases and for large image sizes the suitable card having more performance in CUDA language is GTX 580 proved in this study.
To illustrate the impact of the resolution on the same graph, Fig. 5 shows the implementation speed up results of the parallel algorithm PBCFCM over sequential algorithm BCFCM, executed on three GPU cards and confronted the two types of configurations, namely win32 and x64.Indeed, on the graph of Fig. 6, the impact of a very high resolution is mainly shown on the curves where the resolution is x64 for the three types of GPU cards.On the other hand, for the same types of cards and even implementation of algorithm relating to x32 resolution, the speed-up is significantly lower than for x64 resolution.This feature is noticed on the three curves of the graph of figure 6 with x32 resolution.In order evaluate the performance of the proposed GPU-BCFCM, we present in Fig. 6 the variation of the execution time for iteration versus the picture size for the three types of the tested devices.We can clearly see that the variation of execution time per iteration is straight line and strongly related on the performance of the used card.According to our measurements, significant slopes were notified in a decreasing order for: GT740M, GTX 760 and GTX 580 respectively.We found: S GT740M =2.8 10 -7 S GTX760 =1.5 10 -7 S GTX580 =6. 1 10 -8  All these slops are defined the ratio: www.ijacsa.thesai.org We note that the performance test of these cards corresponds to the lower slope.This is given for GTX 580 (S GTX580 ) which has the fastest execution time.
We conclude that GTX 580 still better choose due to its performances compared to GTX 760 or GT 740M related to the slopes of experimental straight lines we introduced in this meaningful study.
The interest of the study that we conducted is clearly manifested for large image sizes.This is validated experimentally for different tests and different GPU cards.This study shows another aspect on performance analysis in terms of data storage effect on the performance of each card (data exchange inside memory card).
Another aspect of GPU performance can be sorted out by inspecting their computation power by testing them by a large amount of computation for a fixed data size.To do so, we extend the measurement for a fixed image size of 1024x1024, and run the FCM algorithm for a number of clusters ranging from 2 to 14 as in Fig. 7 that shows the variation of speed-up according to number of clusters for different GPU devices.The results investigated in this paper are very relevant as long as they validate the highest importance of GPU processing optimization.This means that how to find optimal material for a given image processing requiring a given amount of memory and powerful ability of computation.
The speed varies almost linearly according to the number of cluster, for GT 740M, GTX 760 and GTX 580, respectively.
In fact, the values minima and maxima are ranging between 2.07x up to 14.64x for GT 740M, and between 3.31x up to 29.63x and finally for GTX 580 the speedup is ranging from 4.01x up to 52.6x.
According to these numerical results and parameters performance defined, we conclude that, GTX 580 is faster and powerful than GTX 760 which is also faster than GT740M.
Referring to Fig. 8 and Fig. 9, the image used for this measurement is 6.5536 million pixels of size.The figure below represents time execution in second versus the used devices categories.
In fact, for serial execution we got as result "62.6 s" which is the highest value comparing to GTX 580 of value 1.2s, GTX760 of value 3.1s and GT740M of value 5.6s respectively (Fig. 8).However, after the results expressed experimentally, it is clear that the parallel implementation is very benefic for images ranging from medium-sized to large.On the other hand, Fig. 9 shows the speed-up reached in our experimental measurements, for GTX580, GTX760 and GT740M, the values obtained are 52.16x,20.16x and 11.19x, respectively.

C. Application example
To illustrate the effectiveness of the proposed PBCFCM algorithm we present in fig. 10 an application example on T2weighted MRI image which contains melanoma tumor.This image was segmented into 5 clusters (Background, LCR, GM, WM and Tumor).www.ijacsa.thesai.orgBoth the parallel and the sequential algorithms was also applied to different MRI datasets image corrupted by bias field artefact and demonstrated its high performance in terms of correction and speed up according to the sequential version.
BCFCM is an algorithm that takes more time to estimate, correct and segment MRI images compared to fuzzy c-mean because it includes additional procedures that make segmentation and correction of intensity in-homogeneity based on neighborhoods.variants have high computational complexities that limit their applicability.The large image sizes processing needed by the medical applications motivates the researchers to increase the computation of the enhanced algorithms and take the advantages of the modern GPU of high processing ability.In this paper we presented a parallel implementation of bias field algorithm PBCFCM which is an enhanced version of fuzzy C-means that correct the in-homogeneity intensity and segment the image simultaneously.In fact, many parameters and performance indices are introduced according to the realized measurements on different devices.Our results are translated into meaningful graphs, showing the impact of architecture on the card and the spot to which it was intended.The speed-up depends on the strength of the GPU, on the image size and on the number of clusters.Larger is the processing amount, the speed-up is higher.
For GTX 580 the speed-up increases until 52x versus the image size and also versus the number of clusters that is limited in this study to C=14.The obtained curves which are straight lines show that the speed-up is proportional to the number of clusters.For GTX 760 the speed-up increase until 20x versus image size and 29x versus the number of clusters, while for GT740M it reached 11x versus image size and 14x versus the number of clusters.
We can conclude from this study that whatever the performance of the graphic cards, it is necessary to take into account the amount of data to be processed before elaborating any parallel approach.The material performance test procedure remains obvious to fit the physical parallel architecture to the computational model algorithm and data inquiry.
As perspective to this work, we will establish an analytical model for the variation of the speed up versus the data size at first.This model will be improved by introducing the number of clusters especially for MRI classification domain.This study will be specifically launched for the three GPU cards used in our experiments.

Fig. 1 .
Fig. 1.NVIDIA GPU Grid, Block and Thread hierarchy in CUDA N r : Cardinal of N k  : Neighbours effectThe new membership function is then given by:

Algorithm 1 : 1 : 2 . 3 :
Bias field Correction fuzzy C-Means Algorithm (BCFCM) Set the parameters C, p, N r and  .Choose the stop criteria: Error, Initialize the centroïds vector V and estimated bias field  .

Algorithm 2 : 2 : 3 : 4 : 5 :Stage 6 : 7 : 8 : 9 :
Parallel bias field correction Fuzzy C-Means algorithm (PBCFCM).Stage 1: -Initialize the class centers Vi, (i = 1,...,C).This is carried out by selecting C points in the gray level scale [0,…,255].-Read the input data file.-Initialize the main variables of the algorithm in the CPU.Stage Copy the pixels data, the initial clusters and the estimated bias field to the GPU.Stage Allocate the memory of the membership matrix on the GPU.Stage Compute the membership values based on cluster center for each data member.Stage Compute expression needed for updating clusters on GPU.Update the clusters centers on the CPU.Stage Transfer the new cluster centers to the GPU.Stage Compute the new estimated bias field according to the new clusters on GPU.Stage Check the difference between the current clusters value and its previous one.Test (If the stopping condition is reached) then exit; else return to Stage 4. Stage 10: Output the final results.

Fig. 3 .
Fig. 3. Evolution of speed-up versus image size on different GPU devices for 32 bit configuration

Fig. 4 .
Fig. 4. Evolution of speed-up versus image size on different GPU devices for 64 bit configuration

Fig. 5 .
Fig. 5. Speed-up comparison for win32 and win x64 behavior on three GPU devices

Fig. 6 .
Fig. 6.The variation of execution time per iteration versus image size on different GPU devices

Fig. 7 .
Fig. 7. Speed-up variation versus clusters number for Lena image size of 1024x1024

Fig. 8 .
Fig. 8.Comparison of execution time for serial and parallel implementation on three devices

Fig. 9 .
Fig. 9. Speedups of the parallel implementations over the serial one for Lena image size of.6.5536 million pixels

Figure 8 (
Figure 8 (b) shows the estimated Bias field, this was obtained by scaling the bias-field values from one to 255. Figure 8 (c) represents the corrected image after removing the bias field of figure 8(b).Figure 8 (d) shows the segmented image.

Fig. 10 .
Fig. 10.Example of application.a) Test image, b) Bias-field estimations using PBCFCM algorithm: this was obtained by scaling the bias-field values from one to 255, c) Corrected image, d) 5 clusters segmented image VI.CONCLUSION AND PERSPECTIVES

TABLE I .
HARDWARE SPECIFICATION OF A CPU AND ONE OF THE THREE GPUS (CORE I7, GT 740M) USED IN EXPERIMENTS