A Load Balancing Policy for Heterogeneous Computational Grids

Computational grids have the potential computing power for solving large-scale scientific computing applications. To improve the global throughput of these applications, workload has to be evenly distributed among the available computational resources in the grid environment. This paper addresses the problem of scheduling and load balancing in heterogeneous computational grids. We proposed a two-level load balancing policy for the multi-cluster grid environment where computational resources are dispersed in different administrative domains or clusters which are located in different local area networks. The proposed load balancing policy takes into account the heterogeneity of the computational resources. It distributes the system workload based on the processing elements capacity which leads to minimize the overall job mean response time and maximize the system utilization and throughput at the steady state. An analytical model is developed to evaluate the performance of the proposed load balancing policy. The results obtained analytically are validated by simulating the model using Arena simulation package. The results show that the overall mean job response time obtained by simulation is very close to that obtained analytically. Also, the simulation results show that the performance of the proposed load balancing policy outperforms that of the random and uniform distribution load balancing policies in terms of mean job response time. The improvement ratio increases as the system workload increases and the maximum improvement ratio obtained is about 72% in the range of system parameter values examined. Keywords-grid computing; resource management; load balancing; performance evaluation; queuing theory; simulation models.


I. INTRODUCTION
The rapid development in computing resources has enhanced the performance of computers and reduced their costs.This availability of low cost powerful computers coupled with the advances and popularity of the Internet and high speed networks has led the computing environment to be mapped from the traditionally distributed systems and clusters to the Grid computing environments.The Grid computing has emerged as an attractive computing paradigm [1,2].The Computing Grid, a kind of grid environments, aims to solve the massive computation problems.It can be defined as hardware and software infrastructure which provides dependable, consistent, pervasive and inexpensive access to geographically widely distributed computational resources.These resources may belong to various individuals and institutions to solve large-scale scientific applications.Such applications may contain Nano-materials, massive data, DNA research and simulated meteorology systems.
Basically, grid resources are physically distributed workstations or servers, which are gathered to works as an integrated processing system.The primary motivation of grid computing system is to support clients and programs with universal and continuous access to enormous set of high performance computational resources [1][2][3][4].Computational grids offer many types of services.These services are provided by the servers in the grid computing system.The servers are generally heterogeneous as they may have different CPUs computing power, storage size, etc. [4].
As a consequence of the unequal task arrival rates and difference of computing capacities and capabilities, the computers in one grid site may be heavily loaded while others in a different grid site may be lightly loaded or even idle.It is therefore needed to shift some jobs from the heavily loaded computers to others from the lightly loaded set aiming to efficiently employ the resources and consequently minimize the average job response time.The load shifting process is recognized as load balancing (LB) [4,5,6].
In general, LB policies can be categorized into centralized or distributed.In centralized policies, the system has only one LB decision maker which has a global view of the system load information.In such polices, the system's incoming jobs are automatically forwarded to the decision maker, which balances the load among different processing nodes aiming to improve system average response time.These strategies are favorable if the communication cost is unneglectable or not important as in shared memory multiprocessor systems.Various scholars claim that, the centralized policies are not scalable as if the number of processing nodes in the system increases, the decision maker may fail [6][7][8][9]16].
In the distributed (decentralized) LB policies on contrary, all computers (nodes) in the system participate in taking the load distribution decisions.As a result, the decisions of load redistribution are not centralized in one node.Therefore, various scholars think that, the distributed LB strategies are better from the scalability and fault tolerance points of view than the centralized ones.But at the same time, it is very costly to enable every computer in a distributed system from collecting the state information of the entire system.As a www.ijacsa.thesai.orgconsequence, in the distributed load distribution strategies, every computing node receives its incoming tasks and after that, it decides to shift a part of its load based on the partial or complete information it has about the overall system's load distribution [17][18][19].It appears that this policy is closely related to the individually optimal policy, in that each job (or its user) optimizes its own expected mean response time independently of the others [4][5][6][7][8][9][10].
Although the problem of balancing loads in conventional distributed environments has been studied massively [6][7][8][9][10][11][12][13][14], new challenges in Grid computing still make it an interesting topic and many research projects are interested in this problem.
In this paper, we present a distributed LB policy for the grid computing environment.The proposed policy tends to improve grid resources utilization and hence maximizes throughput.It concentrates on studying the proposed model in its steady state.In this state, the total number of admitted jobs to the computational grid is adequately large and the incoming jobs rate cannot surpass the entire processing capacity of system [15].As in [15], steady-state mode will help us to derive optimality for the proposed LB policy.The suggested LB strategy addresses the problem's class of massive computation and entirely independent jobs that has no in between communications.An analytical model is presented.This model is based on queuing theory.We are interested in computing the overall mean job response time.The results obtained analytically are validated by simulating the model using Arena simulation package.
The structure of this paper's remaining sections is as follows: Section II gives major and recent related works.Section III presents the architecture of suggested computational grid model.Section IV introduces the proposed grid LB policy.Section V discusses the analytical queuing model.In Section VI, we assess the performance evaluation of the proposed LB policy.Lastly, Section VII concludes this paper.

II. RELATED WORK AND MOTIVATIONS
LB has been studied massively in the conventional distributed systems literature for more than two decades.Various policies and algorithms have been suggested, analyzed and implemented in a number of studies [6][7][8][9][10][11][12][13][14].It is more challenging to achieve LB in Grid systems than in conventional distributed computing ones because of the heterogeneity and the complicated dynamic nature of the Grid systems.The problem of LB in grid architecture is addressed by assigning loads in a grid without neglecting the communication overhead in collecting the load information.It considers load index as a decision factor for scheduling of jobs in a cluster and among clusters.
Many papers have been published recently to address the problem of LB in Grid computing environments.Some of the proposed computational grids LB policies are modifications or extensions to the conventional distributed systems LB policies.
In [23], a decentralized model for heterogeneous grid has been proposed as a collection of clusters.In [1], the authors employed the tree structure in representing a computational grid model.Their suggested model considers the heterogeneity of system's computational nodes but it is entirely autonomous of any real grid structure.Though, they did not offer any job assigning algorithm.Their resource controlling strategy relies on the periodic gathering of node's information via manager node.Such strategy suffers from having massive communication overhead.Indeed, the manager node may represent a single point of failure to the system.The authors in [24] suggested utilizing ring topology in guiding managers of computational grids.These managers are in charge of controlling a dynamic set of computing nodes (computers or processors).The process of taking workload balancing decisions in their model relies on real load of computing nodes in the system.In [21], the authors proposed a hierarchical structure for grid managers rather than ring topology to improve scalability of the grid computing system.They also proposed a job allocation policy which automatically regulates the job flow rate directed to a given grid manager.
In this paper we propose a decentralized LB policy that can cater for the next exclusive features of applied computational grids systems:  Large-scale.As a grid can involve a huge set of advanced computational nodes that really existed in various distributed sites; where it is impossible for the centralized systems to deal with the problems of having enormous communication overhead and remotely administrating distant stations.
 Heterogeneous grid sites.There might be various hardware specifications, OS and processing speeds in different sites.
 Effects from considerable transfer delay.The communication overhead involved in capturing load information of sites before making a dispatching decision can be a major issue negating the advantages of job migration.We should not ignore the considerable dynamic transfer delay in disseminating load updates on the Internet.

III. GRID COMPUTING SERVICE STRUCTURE
The studied computational grid model is a large-scale service one and it relies on a geographical hierarchy decomposition arrangement.Every user submits his computing jobs and their hardware requirements to the Grid Computing Service (GCS).The GCS will reply to the user by sending the results when it finishes the execution of the jobs.In the GCS, jobs pass through four phases which can be summarized as follows:

A. Task submission phase
Grid clients can admit the jobs via any web explorer.This facilitates the job admission procedure and makes the system reachable to all users.www.ijacsa.thesai.org

B. Task allocation phase
Once the GCS receives a job, it looks for the available resources (computers or processors) and allocates the suitable resources to the task.

C. Task execution phase
Once the needed resources are allocated to the task, it is scheduled for execution on that computing site.

D. Results collection phase
The GCS informs the user by his task's results immediately upon the execution is completed.
Three-level Top-Down view of the considered grid computing model is shown in Fig. 1 and can be explained as follows: Any LGM manages a pool of Site Managers (SMs) in its geographical area.The role of LGM is to collect information about the active resources managed by its corresponding SMs.
LGMs are also involved in the task allocation and LB process in the grid.New SMs can join the GCS by sending a join request to register themselves at the nearest parent LGM.

 Level 1: Site Manager (SM)
Every SM is in charge of controlling a set of computing nodes that are configured dynamically (i.e., any computing node can enter or disuse the system as desired).A new joining computing node to the site should register itself within the SM.The role of the SM is to collect information about active processing elements in its pool.The collected information mainly includes CPU speed and other hardware specifications.Also, any SM has the responsibility of allocating the incoming jobs to any processing element in its pool according to a specified LB algorithm.

 Level 2: Processing Elements (PE)
Any private or public PC or workstation can join the grid system by registering within any SM and offer its computing resources to be used by the grid users.When a computing element joins the grid, it starts the GCS system which will report to the SM some information about its resources such as CPU speed.
Within this hierarchy, the addition or removal of a SMs or PEs is an easy process and ensures scalability of suggested model of computational grids.
The LGMs represent the entry points of computing jobs in the proposed grid computing model.Any LGM works like a server in the web for the grid model.Any client can admit his jobs to the associated LGM using the web explorer.According to the available LB information, the LGM will pass the arrived jobs to the appropriate SM.The SM in turn distributes these computing jobs according to the available site LB information to a chosen processing element for execution.LGMs allover the world may be interconnected using a high-speed network as shown in Fig. 1.
As explained earlier, the information of any processing element joining or leaving the grid system is collected at the associated SM which in turn transmits it to its parent LGM.This means that a communication is needed only if a processing element joins or leaves its site.All of the collected information is used in balancing the system workload between the processing elements to efficiently utilize the entire system resources aiming to minimalize user's jobs response time.This policy minimizes the communication overhead involved in capturing system information before making a LB decision which improves the system performance. .

IV. GRID LOAD BALANCING POLICY
We proposed a two-level LB policy for the multi-cluster grid environment where clusters are located in different local area networks.The proposed LB policy takes into account the heterogeneity of the computational resources.It balances the system's load according to capacity of computing nodes.We assume that the jobs admitted to the grid system are entirely independent ones with no inter-process communication in between and that they are massive computation jobs.
To formalize the LB policy, we define the following parameters for grid computing service model: that can be executed by the i th site per second.Hence, the SPC i can be calculated by summing all the PECs for all the PEs managed the i th site.

Local grid manager Processing Capacity (LPC):
Number of jobs that can be executed under the responsibility of the LGM per second.The LPC can be calculated by summing all the SPCs for all the sites managed by that LGM.www.ijacsa.thesai.org The proposed LB policy is a multi-level one as it could be seen form Fig 2. This policy is explained at each level of the grid architecture as follows:

A. Local Grid Manager Load Balancing Level
Consider a Local Grid Manager (LGM) which is responsible of a group of site managers (SMs).As mentioned earlier, the LGM maintains information about all of its SMs in terms of processing capacity SPCs.The total processing capacity of a LGM is LPC which is the sum of all the SPCs for all the sites managed by that LGM.Based on the total processing capacity of every site SPC, the LGM scheduler distributes the workload among his sites group members (SMs).Let N denotes the number of jobs arrived at a LGM in the steady state.Hence, the i th site workload (S i WL) which is the number of jobs to be allocated to i th site manager is obtained as follows:

B. Site Manager Load Balancing Level
As it is explained earlier every SM manages a dynamic pool of processing elements (workstations or processors).Hence, it has information about the PECs of all the processing elements in its pool.The total site processing capacity SPC is obtained by summing all the PECs of all the processing elements in that site.Let M be the number of jobs arrived at a SM in the steady state.The SM scheduler will use a LB policy similar to that used by the LGM scheduler.This means that the site workload will be distributed among his group of processing elements based on their processing capacity.Using this policy, the throughput of every processing element will be maximized and also its resource utilization will be improved.Hence, the i th PE workload (PE i WL) which is the number of jobs to be allocated to i th PE is obtained as follows: Example: Let N =1500 j/s (job/second) arrive at a LGM with five SMs having the following processing capacities: SPC 1 =440 j/s, SPC 2 =260 j/s, SPC 3 =320 j/s, SPC 4 =580 j/s, and SPC 5 =400 j/s.Hence, LPC= 440+260+320+580+400=2000 j/s.So, the workload for every site will be computed according to equation 1 as follows: Then workload of every site will be allocated to the processing elements managed by that site based on equation 2. As an example, suppose that the fifth site contains three PEs having the processing capacities of 90j/s, 200j/s, and 150j/s respectively.Hence the SPC= 90+200+150= 440 t/s.Remember that this site workload equals to 300 t/s as computed previously.So, the workload for every PE will be computed according to equation 2 as follows:    From this simple numerical example, one can see that the proposed LB policy allocates more workload to the faster PEs which improves the system utilization and maximizes system throughput.

V. ANALYTICAL MODEL
To compute the mean job response time analytically, we consider one LGM section as a simplified grid model.In this model, we will concentrate on the time spent by a job in the processing elements.Consider the following system parameters:  λ is the external job arrival rate from grid clients to the LGM.
 λ i is the job flow rate from the LGM to the i th SM which is managed by that LGM.
 λ ij is the job flow rate from the i th SM to the j th PE managed by that SM.
 µ is the LGM processing capacity.
 µ i is processing capacity of the i th SM.
 µ ij is the processing capacity of the j th PE which is managed by the i th SM.
 ρ=λ/µ is the system traffic intensity.For the system to be stable ρ must be less than 1.
is traffic intensity of the j th PE which is managed by i th SM.
We assume that the jobs arrive from clients to the LGM according to a time-invariant Poisson process.Jobs arrive at the LGM sequentially, with inter-arrival times which are www.ijacsa.thesai.orgindependent, identically, and exponentially distributed with the arrival rate λ j/s.Simultaneous arrivals are excluded.Every PE in the dynamic site pool will be modeled by an M/M/1 queue.
Since jobs that arrive to the LGM will be automatically distributed on the sites managed by that LGM with a routing probability LPC SPC PrS based on the LBP, where j is the PE number and i is the site number.Hence, Since the arrivals to LGM are assumed to follow a Poisson process, then the arrivals to the PEs will also follow a Poisson process.We also assume that the service times at the j th PE in the i th SM is exponentially distributed with fixed service rate µ ij j/s.Note that µ ij represents the PE's processing capacity (PEC) in our LB policy.The service discipline is First Come First Serviced.This grid queueing model is illustrated in Fig 2. The state transition diagram of the j th PE in i th site manager is shown in Fig. 3.As mentioned earlier, we are interested in studying the system at the steady state that is the traffic intensity is less than one i.e., 1   .To compute the expected mean job response time, the Little's formula will be used.Let E[T g ] denotes the mean time spent by a job at the grid to the arrival rate λ and E[N g ] denotes the number of jobs in the system.Hence by Little formula, the mean time spent by a job at the grid will be given by equation 3 as follows: can be computed by summing the mean number of jobs every PE at all the grid sites.So, , where i=1,2,..m, is the number of site managers managed by a LGM, j=1,2,…,n is the number of processing elements managed by a SM and is the mean number of jobs in a processing element number j at site number i. Since every PE is modeled as an M/M/1 queue, then  =PEC ij for PE number j at site number i. From equation 3, the expected mean job response time is given by:

A. Experimental Environment
The simulation was carried out using the great discrete event system simulator Arena [25].This simulator allows modeling and simulation of entities in grid computing systems users, applications, resources and resource load balancers for design and evaluation of LB algorithms.
To gauge the performance of grid computing system under the proposed LB policy, a simulation model is built using Arena simulator.This simulation model consists of one LGM which manages a number of SMs which in turn manages a number of PEs (Workstations or Processors).All simulations are performed on a PC (Core 2 Processor, 2.73GHz, 1GB RAM) using Windows xp OS.

B. Simulation Results and Analysis
We assume that the external jobs come to the LGM in a sequential fashion and their inter-arrival times are independent www.ijacsa.thesai.organd they follow the exponential distribution with mean 1/λ j/s.no Instantaneous arrivals is allowed.We also assume that the service times of LGMs follow the exponential distribution with mean 1/µ j/s.
The performance of the grid computing system under the proposed LB policy is compared with two other policies namely; Random distribution LB policy and Uniform distribution LB policy.
In the Uniform distribution LB policy the job flow rate (routing probability) from LGM to its SMs is fixed to the In the Random distribution LB policy a resource for job execution is selected randomly without considering any performance metrics to that resource or to the system.This policy is explained in [26].However, in the proposed LB policy all the arriving jobs from clients to the LGMs are distributed on the SMs based on their processing capacity to improve utilization aiming to minimize mean job response time.
The grid system built in our simulation experiment has 1 LGM, 3 SMs having 4, 3, and 5 PEs respectively.We fixed the total grid system processing capacity µ=LPC=1700 j/s.First, the mean job response time under the proposed LB policy is computed analytically and by simulation as shown in Table 1.
From that table, we can see that the response times obtained by the simulation approximate that obtained analytically.The obtained simulation results satisfy 95% confidence level.Also, from table 1, we can notice that the proposed LB policy is asymptotically optimal because its saturation point (λ/µ)≈1 is very close to saturation level of the grid computing model.
Using the same grid model parameters setting of our simulation experiment, the performance of the proposed LB policy is compared with that of the Uniform distribution, and Random distribution as shown in Fig. 4. From that figure we can see that proposed LBP outperforms the Random distribution and Uniform distribution LBPs in terms of system mean job response time.It is also noticed that the system mean response time obtained by the uniform LBP lies between that of the proposed and random distribution LBPs.
To evaluate how much improvement obtained in the system mean job response time as a result of applying the proposed LBP, we computed the improvement ratio , where U T is the system mean job response time under uniform distribution LBP and T P is the system mean job response time under proposed LBP, see Fig. 5. From that figure, we can see that the improvement ratio increases as the system workload increases and it is about 72% in the range of parameter values examined.This result was anticipated since the proposed LBP balances the system's load according to the capacity of computing nodes which leads to maximizing system resources utilization ratio and as a result system mean job response time is minimized.In contrast, the Random distribution policy distributes the system workload randomly on the system PE without putting any performance metric in mind which may lead to unbalanced system workload distribution which leads to poor resources utilization and hence, the system performance is affected.This situation www.ijacsa.thesai.orgappears clearly as the system workload increases.Also, the Uniform distribution policy distributes the system workload equally on the PEs without putting their processing capacity or any workload information in mind which repeats the same situation as the random distribution LBP.To be fair, we must say that according to the obtained simulation results, the performance of the Uniform distribution LBP is much better that that of the Random distribution LBP.

VII. CONCLUSION
This paper addresses the load balancing problem for computational grid environment.We proposed a two-level load balancing policy for the multi-cluster grid environment where clusters are located in different local area networks.The proposed load balancing strategy reflects the heterogeneity of the computing nodes.It balances system's load according to capacity of computing nodes.Consequently, the system's overall job response time, utilization are minimized and maximized respectively.
An analytical model is developed to compute the expected mean job response time in the grid system.To evaluate the performance of the proposed load balancing policy and validate the analytic results a simulation model is built using Arena simulator.The results show that the overall mean job response time obtained analytically is very close to that obtained by the simulation.Also, the results showed that the performance of the proposed load balancing outperforms that of the Random and Uniform distribution load balancing policies in terms of mean job response time.It improves the overall job mean response time.The improvement ratio increases as the system workload increases and the maximum improvement ratio obtained is about 72% in the range of system parameter values examined.

Figure 1 .
Figure 1.Grid Computing Model Structure  Level 0: Local Grid Manager (LGM) site i arrivals will also automatically be distributed on the PEs managed by that site with a routing probability

Figure 3 .
Figure 3.A state transition diagram of j th PE in i th site manager.
number of SMs in the grid computing service model.Also the job flow rate (routing probability) from any SM to its PEs is fixed to the value of PEs which are managed by that site.

Figure 5 .
Figure 5. System mean job response time improvement ratio

1 .
Job: Every job is represented by a job Id, number of job instructions NJI, and a job size in bytes JS. 2.

Processing Element Capacity (PEC ij ):
Number of jobs that can be executed by the j th PE at full load in the i th site per second.The PEC can be calculated using the PEs CPU speed and assuming an Average Number of job Instructions ANJI. 3.

TABLE 1 :
COMPARISON BETWEEN ANALYTIC AND SIMULATION MEAN TASK RESPONSE TIMES USING THE PROPOSED LBP