A GA-Based Replica Placement Mechanism for Data Grid

Data Grid is an infrastructure that manages huge amount of data files, and provides intensive computational resources across geographically distributed collaboration. To increase resource availability and to ease resource sharing in such environment, there is a need for replication services. Data replication is one of the methods used to improve the performance of data access in distributed systems by replicating multiple copies of data files in the distributed sites. Replica placement mechanism is the process of identifying where to place copies of replicated data files in a Grid system. Choosing the best location is not an easy task. Current works find the best location based on number of requests and read cost of a certain file. As a result, a large bandwidth is consumed and increases the computational time. Authors proposed a GA-Based Replica Placement Mechanism (DBRPM) that finds the best locations to store replicas based on five criteria, namely, 1) Read Cost, 2) Storage Cost, 3) Sites’ Workload, and 4) Replication Site. Keywords—Data Grid; Data replication; distributed systems; Replica placement mechanism; GA-Based Replica Placement Mechanism


I. INTRODUCTION
Data Grids [1,2] is an infrastructure that deals with huge amount of data to enable grid applications to share data files in a coordinated manner.Such an approach is seen to provide fast, reliable and transparent data access.Nevertheless, the approach is considered as a challenging problem in grid environment because the volume of data to be shared is large despite of limited storage space and network bandwidth.Furthermore, resources involved are heterogeneous as they belong to different administrative domains in a distributed environment.However, it is unfeasible for all users to access a single instance of data (e.g. a data file) from one single organization (e.g.site).This would lead to the increase of data access latency.Furthermore, one single organization may not be able to handle such a huge volume of data by itself.Motivated by these considerations, a common strategy is used in data grids as well as in distributed systems, and is known as replication.Replication vouches the efficient access without large bandwidth consumption and access latency [3][4][5][6][7][8][9].Replication technique is one of the major factors affecting the performance of data grids [10].Creating replicas can reroute a client requests to certain replica sites and offer a higher access speed [11].
Replication is also bounded by two factors: the size of storage available at different sites within the Data Grid and the bandwidth between these sites [12].Furthermore, the files in Data Grid are mostly large [13,14]; so, replication to every site is infeasible.Therefore deciding on the optimal locations to host a certain popular files is needed, in order to reduce the bandwidth consumption of the network.In this paper a GA-Based Replica Placement Mechanism (GARPM) propose by which the process of placing files in grid sites can be done in optimal or near-optimal manner.Authors present an adaptive genetic algorithm that solves the replica placement problem in data grid.The proposed mechanism considered as a long-term optimization technique that has two direct improvements on the performance of data grid.One is to optimize data access which leads to shorter execution time by considering the read cost of files; and the other one is to optimize the network bandwidth, which can avoid network congestion with the sudden frequently required data by considering workload of grid sites and distribution of current replicas.
The GARPM addresses the problems of current replication mechanisms which could be epitomized in two points: A large amount of network bandwidth is consumed resulting from a bad utilization of the network by the existing systems [11,[15][16][17][18][19][20][21][22] .As a result of bad utilization of network bandwidth will lead to increasing of the job execution time [17,[23][24][25][26][27].The proposed work is expected to minimize network bandwidth consumption and reduce job execution time.The rest of this paper is structured as follows.Section 2 provides a brief description on existing work in replica placement mechanisms.Authors include details of our proposed replication mechanism in Section 3 and provide a numerical example that explains how the proposed mechanism works in Section 4. Finally, conclude the paper in Section 5.

II. RELATED WORKS
There are many studies in the literature that concern replica placements issues.Chin-Min Wan et al. [19] proposed a replica placement scheme that tries to overcome the bottleneck caused by increasing the downlinks, which are occurring at the same time.The proposed strategy chooses the best site to host the replica according to the evaluation result based on the number of user request and transmission cost.
The purpose of the strategy is to replicate the file to a site that provides minimum average transmission cost.Transmission cost is defined to be inversely proportional to bandwidth, and the site that provides the minimum average transmission cost is selected.www.ijacsa.thesai.orgFollowing the bandwidth aspect, [28] proposed a dynamic replication strategy, called Bandwidth Hierarchy based Replication (BHR) to reduce access time by avoiding network congestion.BHR reduces the time taken to access and transfer the file.It places a replica at a high bandwidth location.However, such an approach only considers transmission cost and does not guarantee to minimize the overall cost.
A load balancing replication strategy has been proposed by [21], where the most frequently accessed file is placed closed to the users and the decision of replica placement is made based on the access load and the storage load of the candidate replica servers and their sibling nodes.In relation to this, [29] discussed various replication strategies namely; MinimizeExpectedUtil, MaximizeTimeDiffUtil, MinimizeMaxRisk, and MinimizeMaxAvgRisk while considering the utility and risk indexes, and making the replica placement decision by optimizing the average response time.They concluded that considering both current network state and file requests are better than considering the file requests alone.
Meanwhile, the work on dynamic replication algorithm by [22] had resulted in a Popularity Based Replica Placement (PBRP) algorithm for hierarchical Data Grids.The idea behind PBRP is to place replicas as close as possible to those clients that frequently request data files.Further work by [30] presented a dynamic replica placement in multi-tier Data Grid that categorized the files based on their access frequency into two groups: 1) Most Frequent Files (MFF) that are replicated and placed at the parent node of their respective best clients, where the best client for a file is a client which generates the maximum request for that file, and 2) Least Frequent Files (LFF) that are placed at one tier below the root of the Data Grid along the path of their best client.In [31], a dynamic placement algorithm was proposed that takes into account the dynamicity of sites in the Data Grid, since a site can at any time leave the grid and possibly join again later.Thus, two parameters were investigated: the request number for each file by each site, and utility of each site that involves the number of times the site did not answer to a file request due to its absence from the grid.
On the other hand, the authors in [23] suggested a model that provides a function that evaluates the placement of replica.The objective of this function is to maximize the difference between the replication benefits and replication cost (storage cost and transfer time).The benefit is the reduction in transfer time to the potential users, the storage cost is the storage cost at the remote site, and the transfer time is the duration from the current location to the new location.Yet, site workload is not considered, thus the system will not guarantee to perform well with increasing of running jobs.
Ruay-Shing et al. [17] proposed a dynamic replication mechanism that replicates a popular file to suitable site according to the access frequencies for each file that has been requested.Access frequency is an essential parameter that should be taken into account when determining replica placement.However, some important parameters such as overall cost (i.e.storage cost and read cost), distance and availability should not be neglected; otherwise the overall system performance is degraded.

III. REPLICA PLACEMENT STRATEGY
In previous work [32], authors proposed a replica creation model that evaluates the files based on the exponential and dependency level of files in grid system.Each file in the system is evaluated and given a File Value (FV).The main goal of our previous work [32] was to identify file that need to be replicated (also known as popular files).Details on such approach can be seen in [32].In this work, we are pursuing to identify sites that best to host the newly created replicas.Thus assume that the popular file already determined and authors use their values in this work The GA-Based Replica Placement Mechanism (GARPM) finds location sites to place the newly created replicas, such that the total Read Cost (RC) is minimized, which is defined as [26] the cost of transferring data file from the underlying site to the remote sites.The best locations are the sites that provide the best service to all other sites and users in the grid system.In users' perspective, the best sites are located as close as possible to the sites that most potentially request the underlying replicas.This improves the geographical locality of the sites, which consider files that requested by the sites are likely to be requested by nearby sites [33].However, in sites' perspective the best sites are located as far as possible from the replication sites that never request the underlying replicas.Hence, choosing the best location sites depends on four parameters: 1) Storage cost, 2) Read cost, 3) Sites' Workload, and 4) Replication Sites.

1) Storage Cost (SC):
RC is the cost of storing a file at a certain site [23][24][25][26]34].The storage cost might reflect the size of the file, the throughput of the site, or the fact that a copy of the file is residing at a specific site.In this context the storage cost is the storage space used to store data, and can be computed as following equation [33]: Where, Free Space: is the current available space of the underlying storage site 2) Read Cost (RC): RC is the cost of transferring data file from the underlying site to the remote sites [26], and can be computed as: Where, : The total number of the sites in the grid.:Number of sites that request the replica from the underlying site.   : The file value with respect to the specific site s i , which could be computed as: (3) Where,    : Number of request for a file from a specific site s i : is the data transmission time, and depends on the size of the file and the current network bandwidth of the link www.ijacsa.thesai.org between the two underlying sites.FTT is computed as in the following equation [26]: 3) Sites' Workload: The workload of the site is defined as the number of request that can be satisfied by the underlying site [24,35].The candidate site should not exceed a specific amount of workload that is assigned to it.
4) Replication Sites: Replication site is the site that is hosting the replica of the underlying file.Replication site influence the candidate sites.The candidate site should be located as far as possible from the replication sites, because of two main reasons: 1) the replication sites itself never request a replica that is already stored on it, 2) the load need to be distributed.
The proposed strategy, namely GARPM, combines the four parameters together in order to make the decision on the placement of replicas, according to the following steps:

1) Calculate the storage cost of the popular file by applying equation 1;
2) Calculate the transfer time of the popular file by applying equation 4; 3) Identify the sites that could be excluded from being candidates sites to hold the replicas, and those sites have the following characteristics: a) already stored the replicas in their storage elements (Replication Sites), b) already exceeded their maximum workload, and c) have a direct connection to replication sites; 4) Calculate the RC of each candidate site by applying equation 2; 5) Up to this step, we are given the number of copies to be created of a popular file, and a set of candidate sites with associated read cost.Our goal then to fine the best sites to host the certain number of copies, so as to optimize the total read cost.

IV. GA-BASED ALGORITHM
Genetic algorithms (GA) are an evolutionary optimization approach which is an alternative to traditional optimization methods [36].The effectiveness or quality of a GA (for a particular problem) can be judged by its performance against other known techniquesin terms of solutions found, and time and resources used to find the solutions [37].moreover, GA has shown itself to be extremely effective in problems ranging from optimizations to machine learning [38].An important advantage of GA is that they search for the optimal solution by examining only the overall all valuation of a solution; they require no specific problem related information for their search.i.e. it is a blind search [39].
In general GA search strategy consists of the following steps: GA begins with an initial population represented by chromosomes.Chromosome is a set of solutions from one population.It can be taken and In general when apply the GA replica placement problem, the algorithm will works as following: at the first we start with a random initial population The size of initial population is n chromosomes.Each chromosome s i of this population consists of n binary bits or (sites).
Therefore each bit (site) of a chromosome can be either included (s i = 1) or excluded (s i = 0) from being a candidate to host one replica.Number of bits in each chromosome has to be same as number of sites in the grid system, as each bit represent one site.Moreover, number of ones in each chromosome must be equals to number of copies that are created of the popular file.Example of possible initial population is as follows.
After the initial population is generated randomly, the fitness value of each chromosome is evaluated by using objective function or cost function.In our case the cost function represented by the Overall Cost (OC) of sites, therefore the objective is to minimize the total OC.So, the lower the total OC, the fitter the solution represented by that chromosome is.
The value of fitness function is given by the following equation: www.ijacsa.thesai.org∑ (  ) + (  )  =1 (5) Where,  is the total number of sites.
Having calculated the fitness value of the population, the next generation can be determined.Select chromosomes for reproduction, more fit chromosomes are more likely to be selected for reproduction.For selection, the Roulette Wheel selection used, where fitness level is used to associate a probability of selection with each chromosome.The roulette wheel selection scheme can be implemented as follows:  Evaluate the fitness, fitness(k i ), of each chromosome in population  Compute the probability, (P i ), of selection each member of the population: , where n is the population size  Calculate the cumulative probability, (q i ), for each chromosome:  Generate a random number, r ∈ (0, 1].
 If r < q 1 then select the first chromosome, x 1 , else select the chromosome x i such that q i−1 < r ≤ q i .
Having selected the parents for reproduction, crossover is performed by taking two parts of two chromosomes to create new chromosomes.Crossover process is illustrated in the example below as shown in Figure 1.Suppose that there two parents namely  1 and  2 , to create the children let say ℎ 1 and ℎ 2 do the following steps:  Go through  1 from the left side and take the first  2 ⁄ number of ones, then write them down in the same position in ℎ 1 .
 Go through right side of  2 and take the first ( − Mutation performed by a little modifying a chromosome.In this case it can be achieved by randomly picking a one attribute of a chromosome and convert it.Figure 2 below lists an example in which the bit (site) number two and five of a chromosome mutated and converted from 0 to 1 and from 1 to 0 respectively.

Fig. 2. Example of mutation process
Parents have been selected and children chromosomes created via crossover and an occasional mutation.After that, it is the time to insert the newly created children in to the population and begin the selection, crossover, and mutation process again until some stopping criterion is met.three criteria used as stopping conditions.(1) The evolution stops if the total number of iterations reaches a predefined number of iterations, (2) if the fittest chromosome of each generation has not changed much, that is, the difference is less than 10-3 over a predefined number, or (3) if all chromosomes have the same fitness values, i.e., when the algorithm has converged.below shows the algorithm described above.Ch2 0 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 1 0 www.ijacsa.thesai.org

V. CONCLUSION AND FUTURE WORK
This study describes the replica placement services as a part of replication management in Data Grid.The GA-Based Replica Placement Mechanism (GARPM) finds the best location sites to place the newly created replicas.From the users' perspective, the best sites are located as close as possible to the sites that most potentially will request the underlying replicas to improve the geographical locality of the sites, while considering that the files that are requested by the sites are likely to be requested by nearby sites [33].However, from the sites' perspective, the best sites are the ones that are located the farthest from the replication sites that never request the underlying replicas.The proposed strategy can make good decision on which replicas each site should store, such that comply with users' satisfaction and resource's satisfaction.
As a future work, it is our intention to implement the presented replication mechanism in a grid environment, for example by using OptorSim, a grid simulator.Furthermore, the strategy can be tested on a larger of number of sites and of different topologies.

1 .
Write down the first 6/2 left ones from the first parent in the same position 2. Write down the first 6 -(6/2) right ones from the second parent in the