Distributed GPU-Based K-Means Algorithm for Data-Intensive Applications : Large-Sized Image Segmentation Case

K-means is a compute-intensive iterative algorithm. Its use in a complex scenario is cumbersome, specifically in data-intensive applications. In order to accelerate the K-means running time for data-intensive application, such as large sized image segmentation, we use a distributed multi-agent system accelerated by GPUs. In this K-means version, the input image data are divided into subsets of image data which can be performed independently on GPUs. In each GPU, we offloaded the data assignment and the K-centroids recalculation steps of the K-means algorithm for a massively parallel processing. We have implemented this K-means version on the Nvidia GPU with Compute Unified Device Architecture. The distributed multiagent system was written with Java Agent Development framework. Keywords—Distributed computing; GPU computing; K-means; image segmentation


I. INTRODUCTION
In our decade, a huge amount of data must be processed continuously by computers to meet the needs of the end users in many business areas.By a simple search on Google, we found lot of official statistics that show how big the big data processed in image processing is, in web semantics, data storage, profiling and other scientific fields used by Google and Facebook.For example, Facebook stores 300 petabytes, and processes 600 terabytes per days.It deals with 1 billion users per month, and finally 300 million photos are uploaded per day.In addition, Google stores much more than Facebook.Google stores 15 exabytes; it processes 100 petabytes per day; it indexes 60 trillion pages and performs 2.3 million searches per second.In brief, the data to be processed in many application areas become more than ever increasingly large.
In this paper, we focus on image processing and their applications.Understanding images and extracting information from them so that the information can be used for other tasks is an important aspect, as for example cancer detection in Magnetic Resonance Imaging (MRI).Such analyses and extraction of useful information from images are ensured by image processing techniques such as image segmentation [6], [7] which is one of the clustering problems.The K-means is an unsupervised learning algorithm that solves the clustering problem.It is an iterative algorithm.Each iteration consists of two steps, the assignment of data objects and K centroids recalculation.
Nonetheless, there are two important factors to consider when doing image segmentation.First is the number of images to be processed in a given use case.Second is the image quality which has known an important evolution during the last few years, i.e., the number of pixels that make up an image has been multiplied by 200 from the 720 x 480 pixels to 9600 x 7200 pixels.This has resulted in a much better definition of the image (more detail visibility) and more nuances in the colors and shades.
Thus, during last decade, image processing techniques have become cumbersome in computing time for monolithic computers due to the huge number of pixels.This obvious need has led naturally to more powerful computers to allow image processing researchers to use new High-Performance Computing (HPC) strategies based on the parallelism and distributed approaches such as 2D or 3D reconfigurable mesh [10], FPGA, and recently GPU [8], [9] and Hadoop.
In GPU computing, the most important advance is the Nvidia CUDA (Compute Unified Device Architecture) solution.The Nvidia TITAN X is the fastest GPU at the time of this writing.This GPU has 3584 shader units also called CUDA cores or elementary processors.It has 1417 MHz as base clock which can be boosted to 1531 MHz, and 12 Gbits of GDDR5 memory with 480 Gbits/s of memory bandwidth.To have more computational power, four TITAN X GPUs can be interconnected with Nvidia"s Scalable Link Interface (4-way SLI), the result being a powerful GPU with 14336 CUDA cores and 48 Gbits of GDDR5 memory which in collaboration with the Intel Core i7 5960X CPU can give an interesting optimization not only for image processing but also for many other domains of applications.Unfortunately, in some cases, the use of multi-GPU systems is not sufficient to obtain a highenough performance computing for certain scientific or engineering applications.In the case where these applications have to process a large amount of data and perform complex tasks, as for instance in medical imaging to perform an analysis on large-sized MRI cerebral images using image-processing techniques such as the K-means clustering algorithm.In addition, the scalability is not guaranteed and strongly depends on the evolution of GPU and CPU hardware proposed by Nvidia, AMD and Intel.Thus, using a multi-GPU system on a single node is constrained by hardware limitations.In other words, the computing and data communications capabilities of the processing environment become the dominating bottleneck.www.ijacsa.thesai.org To overcome these limitations, we have studied distributed programming libraries with the objective of combining GPU computing and distributed computing paradigms.
In the distributed computing paradigm, we found a set of distributed programming libraries and standards, as for instance MPI (Message Passing Interface) [13], [14], OpenMP (Open Multi-Processing) [15], [16], or HPX [19], Hadoop [20].The idea of distributed computing is to combine machines, which is typically commodity hardware, that can be used to parallelize tasks, as for example the libraries and standards cited above that were used in more than scientific domains [11], [12], [17], [18].But the limitation of the distributed system lies in the fact that these machines are limited in computing power (number of processors in each machine) and the data storage capacity.The scalability of such as system is slow and expensive.To improve the computation power of such a distributed system, we have to connect new machines.For example, to have 384 more processors in the system, we must connect 48 machine octa processors.
Additionally, in the distributed computing paradigm, all researchers agree that the challenge is to find a library or a framework which provides ease of programming with a highlevel programming language (without memory management or other low-level programming routines) and the best performance exploitation of hardware.Unfortunately, these two goals are contradictory due to the fact that some researchers obtained best performance by using low-level communication libraries known to be error-prone like MPI (Message Passing Interface) [21,24] or OpenMP (Open Multi-Processing) [22], [23], Other researchers [25], [26] have used libraries and frameworks with a high-level programming language which ensures simplicity of programming and portability of the code, although bringing a loss of performance and preventing an efficient access to CPU and GPU due to high-level abstractions of the hardware.
To tackle these problems, we have used a distributed Multi-Agent System (MAS) on GPU-accelerated nodes to accelerate the large-sized image segmentation using the K-means algorithm.The MAS distributed on connected nodes is used to divide the data into a subset of dispatched data through accelerated compute nodes with the GPU.Each subset of data will be processed separately in a node using GPUs.In this version, we used CUDA C/C++ to write the K-means kernel code that will be executed on the GPUs.On the other hand, the multi-agent system was programmed using the JADE platform which is based on Java.This paper presents the role of the MAS on the data and task distribution between remote GPUs across interconnected nodes during the K-means execution and will show the experimental results.
Poteras et al. [27] focused on optimization of the data assignment step of the K-means algorithm.The idea is that for each iteration before the data assignment step, they add a procedure that determines which of the data objects could be affected by a move.Thus, they no longer need to visit all the data objects to define their membership, but just a small list of data objects.
Fang et al. [4] propose a GPU-based implementation of Kmeans.This version copies all the data to the texture memory, which uses a cache mechanism.Then it uses constant memory to store the K-centroids, which is also more efficient than using global memory.Each thread is responsible for finding the nearest centroid of a data point; each block has 256 threads, and the grid has n/256 blocks.
The workflow of [4] is straightforward.First, each thread calculates the distance from one corresponding data point to every centroid and finds the minimum distance and corresponding centroid.Second, each block calculates a temporary centroid set based on a subset of data points, and each thread calculates one dimension of the temporary centroid.Third, the temporal centroid sets are copied from GPU to CPU, and then the final new centroid set is calculated on CPU.
In [4] each data point is assigned to one thread and utilizes the cache mechanism to get a high reading efficiency.However, the efficiency could be further improved by other memory access mechanisms such as registers and shared memory.
Che et al. [5] present another optimized K-means implementation of GPU-based K-means in a single node.They store all input data in the global memory, and load k-centroids to the shared memory.Each block has 128 threads, and the grid has n/128 blocks.The main characteristic of [5] is the design of a bitmap.The workflow of [5] is as follows.First, each thread calculates the distance from one data point to every centroid, and changes the suitable bit into true bit in the bit array, which stores the nearest centroid for each data point.Second, each thread is responsible for one centroid, finds all the corresponding data points from the bitmap and takes the mean of those data points as the new centroids.The main problem of [5] is the poor utilization of GPU memory, since [5] accesses most of the data (input data points) directly from the global memory.
Mao et al. [2] present a distributed implementation of the K-means using Hadoop.This research work deals with a dataintensive clustering application.A virtual Hadoop cluster based on cloud computing with CloudStack was established with the aim to implement the distributed K-Means clustering algorithm based on the MapReduce pattern.The initial centroid selection and number of iterations was optimized.The initial centroid selection was improved using the furthest first (FF) algorithm to select the next farthest point.To improve the iteration time, they use the result of a previous iteration for the next iteration of the centroid point in the Map calculation.www.ijacsa.thesai.org The article of Baydoun et al. [3] proposes two improved versions of the Kernel K-means on CPU and GPU.The CPU version was based on OpenMP, Cilk Plus and BLAS Libraries.The GPU version was based the Nvidia CUDA.These versions of Kernel K-Means utilize the Kernelization approach [1] to divide given data into a set of clusters using an approach mainly based on K-Means.

III. IMAGE SEGMENTATION USING THE K-MEANS CLUSTERING ALGORITHM
The K-Means is a clustering algorithm that classifies the input data points S of n attribute vectors into c classes (clusters = 1…c) based on their inherent distances from each other.The algorithm assumes that the data features form a vector space and tries to find natural clustering among them.The points are clustered around class centers (centroids) which are obtained by minimizing the objective function: Where is the centroid of the th class, and ( ) is the distance between th center and the th data of S. We use the Euclidean distance to define the objective function as follows: As described in MacQueen"s paper [32], an initial clustering ( = 1… ) is created by choosing random centroids from the set of n data points S.This is known as centroids initialization.Next, an assignment step is executed where each data point S ( = 1… ) is assigned to the cluster for which ( ) is minimal.Each centroid is then recalculated by the mean of all data points .The assignment and K-centroid recalculations steps are executed repeatedly until no longer changes.This algorithm is known to converge to a local minimum subject to the initial centroids.In our application, the clustering K-means algorithms is used for the image segmentation.Thus, the flow chart of the algorithm in the Fig. 1 takes a 2-dimensional image as data in input, each point (pixel) of this image having an intensity.
The K-means algorithm can be directly implemented on CPU using several "for" loops embedded in one "while" loop with the aim to be performed on a CPU.In "for" loops calculations of distances between each pixel and the centroids is performed.Next, recalculation of the new K-centroids for the next iteration of the "while" loop is also done in a "for" loop.The "while" loop condition for another iteration is inequality between the centroids intensities from a previous iteration and the next iteration.If the centroids do not change, the loop is broken, and the algorithm stops.
In brief, K-means chooses the centroids intelligently and it compares centroids with the data points based on the intensities and characteristics and finds the distances.The data points which are similar to the centroid are assigned to the cluster having that centroid.New centroids are calculated and thus K-clusters are formed by finding out the data points nearest to the clusters.

A. Runtime Environment
The data distribution is based on the agent interactions within MAS deployed on multiple nodes.The MAS used was implemented by JADE [33] in accordance with the standards of the Foundation for Intelligent Physical Agents.The interactions between the agents are based on asynchronous communication mechanisms in accordance with the ACL.
Each running instance of the JADE runtime environment is called a container as it can contain several agents.The JADE platform is a set of active containers distributed on nodes.JADE agents are identified by a unique name and, provided they know each other"s name, they can communicate transparently regardless of their actual location in the same container or different containers in the same platform.As shown in Fig. 2 the runtime environment consists of two types of containers.The first is the main container which is a JADE main container which must always be active in a platform.All other containers register with it as soon as they start.Note that only one main container must be launched at first to start the JADE platform.The main container has the ability of accepting registrations from other non-main containers.A main container holds two special agents, automatically started when the main container is launched.The first one is the AMS (Agent Management System) that provides the naming service (i.e.ensures that each agent in the platform has a unique name) and represents the authority in the platform (for instance it is possible to create/kill agents on remote containers by requesting that to the AMS).The second one is the DF (Directory Facilitator) that provides a Yellow Pages service by means of which an agent can find other agents providing the services he requires in order to reach his goals.Additionally, the main container holds dispatcher agents and a main agent.
The second type of container is the compute containers which are JADE normal ("non-main") containers, each compute container register with the main container as soon as it starts and must "be told" where to find (host and port) its main container.In compute containers, we find worker agents and one team leader agent.

B. Workflow
In this section, we show how K-means application on a large-sized image is performed within the MAS, and how agents interact with each other across nodes to achieve efficient tasks and data communications.Fig. 3 illustrates the steps and interactions established within the multi-agent system during kmeans application on large-sized image.
At the beginning, in the main container, the main agent chooses K data randomly as initial centroids.After that, it divides the large-sized image data into a subsets of image data.The stream of subsets of image data will then be sent to dispatcher agents.The role of these agents is to compress and dispatch the data subsets through the computer container.
In the compute containers, team leader agents listen to queries and data subsets data sent by dispatcher agents.For each data subset received, the team leader agent delegates it to a worker agent.Thus, each worker agent decompresses the subset of data image received and performs the data assignment step of K-means using independent GPU computing units i.e. the Streaming Multiprocessor (SM).For each image data subset of the stream to process, the SMs of the GPUs have their own queue used to collaborate with a worker agent.After that, each of the worker agent returns the membership matrix containing the membership labels of each pixel of the processed data image subsets.
The main agent performs the data rearrangement which consists of calculating the sum of the pixels intensities of each cluster and calculating the number of elements of each cluster with the aim to calculate the new centroids.
In summary, the purpose of these interactions is to send subsets of data images to the worker agents.They then perform the data assignment step of K-means using the SMs of the GPUs, as shows in the Algorithm 1 below.The initiation routine, data rearrangement (described by the Algorithm 2) and the convergence test steps are performed by the main agent using the CPU.After that, running the K centroids recalculations depends on K which is the number of clusters declared during the initiation routine.If K is less than 100 the main agent itself performs the K centroids recalculation step, or else it delegates the recalculations to a team leader agent for parallel execution using the GPU (Algorithm 3) in collaboration with a worker agent.

C. K-Means Execution Steps
Beyond the data communication and synchronization among agents across the MAS, each agent has a role to perform the specific K-means steps as summarized in the following:  Centroids initialization: The main agent selects K points randomly as initial clustering centroids.

YE S NO
The main agent performs centroids calculations using CPU YES NO www.ijacsa.thesai.org Data assignment: Each worker agent performs this step on the received subsets of image data in collaboration with an SM of GPU.This step consists to calculating the distance between each points and centroids, and clusters the points using these distances.Each point data sets will be delegated to a processor in GPU:  Data rearrangement: The main agent rearranges all data, and calculates sumIntensites [ ] and Cardinals[ ] where = 1…c, which will be used to calculates the new centroids:  K-centroids recalculations: This step is performed by the main agent sequentially if K is less than 100, or else the maim agent delegate it to a team leader in a computer container in order to be performed using GPUs in collaboration with an agent worker.The agent worker uses GPU to recalculate the new centroids of each cluster.Every thread block in the GPU is responsible for a new centroid: In recapitulation, the data assignment, K-centroids recalculations are parallel performed on the SMs of GPUs.The main agent is responsible for centroids initialization, data rearrangement and controlling the iteration process.

V. EXPERIMENTAL RESULTS
In this section, we use a set of large sized images to compare the total processing time of these images using our proposed GPU-based K-means distributed on multiple computer nodes using Multi Agent System, with the total processing time of the same set of images using GPU-based kmeans performed on GPUs on a single node.
All experiments were concluded on 4 nodes equipped with Intel Core i7-3610QM CPU 2.30GHz (8 CPUs), 8GB main memory and GeForce GTX 660M, 835 MHz engine clock speed, 2048 MB GDDR5 of device RAM, and 384 processors, organized into 3 streaming multiprocessors.Additionally, we use 4 external GPUs GeForce GTX 750Ti connected by PCIe using a PE4C V2.1 connectors.This environment was assembled and tested in our laboratory for testing purposes.The GTX 750Ti have 1020 MHz engine clock speed, 2048 MB GDDR5 of device RAM, and 640 processors, organized into 5 streaming multiprocessors.All GPUs used in this study use single-precision floating-point arithmetic.
The results were obtained using sets of large-sized images, All Euclidean distance calculations were done in singleprecision.The performance of our K-means algorithm version depends on the actual data and task communication between agents across nodes.To observe the influence of data size on the total running time, large-sized images with thousands of intensity points were used as show in Table I   The test scenarios were carried out on the five large-sized images with three different hardware configurations.The first scenario was made using 2 compute Nodes with 4 GPUs.The float[] dataStreamMembership)  The speedup of our GPU-based K-means could reach from 4 to 5 of the CPU-based K-means in the first scenario and 8 to 9,5 in the second and 12 to 20 in the third scenario.This performance improvement benefits from the high parallel computing ability of the GPU using CUDA, the data rearrangement and the division of the problem to lightweight subproblem.In addition, in CUDA GPU, processors and CUDA Streams are all indistinctive, and not distinguished by pixel and vertex, so that they can run at the same time without any idle time.

VI. DISCUSSION
Using distributed computing based on agent combined with GPU computing show the advantages of easily encoding data and task communication among computer nodes.In addition, the agent communication language used (ACL) which follows the FIPA specifications make the communication transparent and make exchanges between computer nodes structured.Specifically, the use of the JADE platform or similar platform with Nvidia GPU in compute-intensive and data-intensive application such as K-means applied for image segmentation, allows using a high-level programming language like java to write Host code with JNI wrapper, and CUDA C/C++ for device code.
In our work we focus on how to solve the problem of segmentation of large-sized images using the combination of two powerful computing paradigms.Unlike [4,5] who focused on the memory management and the communication of the thread blocks in single GPU, and in the case of data-intensive application it will be complicated to guarantee their effectiveness.
Despite of the great performance of [2], specifically for the massively image processing, it is possible to speed up the computer nodes with GPUs to have more computing power.
Thus, to implements a K-means version with Hadoop and GPU, the programmer needs to understand the low-level communication routines and storage mechanism of Hadoop framework in order to be able to write scalable algorithm, which in this case will be a mixture between Hadoop and CUDA code.
Furthermore, the overhead of the data and task synchronization between agents across nodes is limited by the efficiency of the connected network where is deployed; as instance, using standard Local area network (LAN) with 54Mbits/s or 100Mbits/s, the latency can be quite high.This last can be reduced using a different physical medium for data communication as the Fiber Distributed Data Interface (FDDI), with 1Gbits/s; or the IEEE 802.3 gigabit Ethernet, with 10Gbits/s (e.g.1000Base-LX or 1000Base-SX series); or the well-suited InfiniBand bus which was used in this research.There are many such studies to demonstrate communication latency and process synchronization; however, they are out of the scope of this research.

VII. CONCLUSION
This K-means version was implemented using the agentbased distributed computing and GPU computing to solves the problem of hardware limitations, specifically the number of elementary processors and storage capacity.Also, instead of using low-level libraries like MPI or OpenMP which can be error-prone in some complex cases such as massively image processing, we used programming paradigm based on agent with JADE framework to overcome difficulty of the communication and synchronization between nodes in the distributed system and this being based on the FIPA specifications.
Our implementation of K-means allowed us to confirm the possibility of using this type of model based on two HPC paradigms (Distributed and GPU computing) to solve problems of hardware limitations and opening the possibility of designing more scalable HPC models.
Cluster the points based on distance of their intensities from the centroid intensities.The clustered points can be defined by a binary membership matrix ( ), where each element   is formulated by: Where  = 1...c,  = 1...n and n is the total number of points in S Start Input data Centroids initialization: Initialize the centroid   ( = 1…c) with  random point intensities.recalculations: Compute the new centroid for each cluster using the equation: End Converge?Yes No www.ijacsa.thesai.org

TABLE I .
THE IMAGE DATA USED FOR THE TESTS Algorithm 3: K centroids recalculation step (device code) www.ijacsa.thesai.orgsecond was made using 4 compute nodes with 6 GPUs, and third was made with 4 compute nodes with 8 GPUs.The measurements taken were the total processing time, which includes data transfer.Total processing time of GPUbased K-means on single node is denoted by GKtt, and our distributed GPU-based K-means are denoted by DGKtt.The results obtained are presented in TableII.The initial class centers are chosen as:(