Parallelization of 2-D IADE-DY Scheme on Geranium Cadcam Cluster for Heat Equation

A parallel implementation of the Iterative Alternating Direction Explicit method by D’Yakonov (IADE-DY) for solving 2-D heat equation on a distributed system of Geranium Cadcam cluster (GCC) using the Message Passing Interface (MPI) is presented. The implementation of the scheduling of n tri-diagonal system of equations with the above method was used to show improvement on speedup, effectiveness, and efficiency. The Master/Worker paradigm and Single Program Multiple Data (SPMD) model is employed to manage the whole computation based on the use of domain decomposition. The completion of the execution can need task recovery and favorable configuration. The above mentioned details consist of a main report about the numerical validation of the parallelization through simulation to demonstrate the proposed method effectiveness on the cluster system. It was found that the rate of convergence decreases as the number of processors increases. The result of this paper suggests that the 2-D IADE-DY scheme is a good approach to solving problems, particularly when it is simulation with more processors.


I. INTRODUCTION
Software programmers developing parallel application do focus on some challenges in the area of parallel computing.According to [18] there are theoretical challenges such as task decomposition, dependence analysis, and task scheduling.Then there are practical challenges such as portability, synchronization, and debugging.An alternative and cost effective means of achieving a comparable performance is by way of distributed computing, using a system of processors loosely connected through a local area network [3].For a global computational task with other processors, relevant data need to be passed from processors to processors through a message passing mechanism [7,11,28,22].There is greater demand for computational speed and computations must be completed within reasonable time period by using multiple processors on a single problem, hence, the demand for faster processors has been growing rapidly, which can only be met by the sue of parallel computers for grand challenge problems [19,30] and [4].
There are a number of important unresolved questions concerning multiprocessor computers, among these issues are: should they consist of a few, rather powerful processors or many very much less powerful processors, or something in between?According to [16] there is a natural expectation that the multiprocessors with a few, powerful processors will have an MIMD architecture, and that the others will have SIMD architecture.Parallelization of heat equation has been proposed by [3], and recent developments have included a number of different applications [5,2].Another issue is the communication among the processors.How is the memory connected to the processors, and how are these processors connected to each other?The model proposed in this paper enhances overlap communication and computation to avoid unnecessary synchronization; hence, the method yields significant speedup by the use of the non-blocking communication.
While the theoretical properties of the 2-D IADE-DY algorithm employing the master/worker paradigm and SPMD model are promising, achieving good performance in practice can be challenging.In reference to [2] this is due to fundamental tradeoff between the reduction of the time required for an inherently sequential part of the algorithm, and an increase in the number of the iterations required to converge.Previous analysis of the IADE scheme in the literature did not consider the efficient parallelization and scheduling of tasks to improve scalability.Sequential numerical methods for solving time dependable problems have been explored extensively [25,30].
A number of software tools have been developed for parallel implementation, MPI [19] is chosen since it has a large user group.The objective of our parallel focus is to improve performance.Due to our objective, parallelizing code has traditionally been paired with general code optimizations for performance, especially in the scientific and engineering area [18].
The main contribution of this paper is to present a detailed study of the parallelization using the 2-D IADE-DY algorithm employing master/worker paradigm and SPMD model to enhance overlapping communication with computation on the GCC cluster system running MPI that result in significant improved speedup, effectiveness, and efficiency across varying mesh sizes.The Master/Worker paradigm and SPMD model is employed to manage the whole computation based on the use of domain decomposition.The completion of the execution can need task recovery and favorable configuration.Our results demonstrate two properties that make this approach attractive for the platform of GCC: overlap communication and computation, and ability to arbitrary use various varying mesh sizes.The distribution done in the GCC reduces the memory pressure on the master while preserving parallel efficiency.www.ijarai.thesai.orgTo obtain results with sufficient accuracy for the numerical prediction of the scalable parallel implementation of the AGE, IADE and ADI algorithm, fine discretization of the domain would be necessary.Due to the limitation in both processing element power and memory on sequential architectures and the dimension of full scale utility, only coarse grids are possible.A confine enhancement may be achieved if a domain decomposition method is used to allow locally refined meshes.The paper is organized as follows: section 2 emphasizes on previous related work, section 3 introduces the model for the 2-D heat equation and method and the 2D-IADE-DY scheme.Section 4 and 4 introduces the performance analysis and numerical experiment.Finally, a conclusion is included in section 6.

II. PREVIOUS WORK
Parallelization of Partial Differential Equations (PDE) by time decomposition was first proposed by [24].The motivation for the paper was to achieve parallel real-time solutions.Recent improvements have included a number of different applications [5], and [2] emphasizes the scheduling of tasks in the Para real algorithm.The importance of loop parallelization and loop scheduling has been extensively studied [1].This work is distinct while promoting flexibility, and applies standard parallel concepts.Several approaches to solving heat equation have been carried out in [6,25,26,27] and [13,29,32].We have applied the 2-D IADE-DY scheme by simulation to schedule the n tri-diagonal system of equations with the above method used to show improvement on speedup, effectiveness, and efficiency.Reference [10] and [12] show speedup and efficiency, while comparing to our results generated using GCC, the GCC results show better conformity to linearity for speedup and closeness to unity for efficiency than [10] as applied to the simple method using MPI.In [20], the unconditional stability of the alternating difference schemes has similarity to our scheme.Our implementation compared to [26] and [27,6] is a way of proofing stability and convergence in the GCC cluster system.We also note the various constant improvements on speedup, effectiveness, and efficiency analysis carried out in [33h] using the overlapping domain decomposition method.However, [32] proposed a generalized speedup formula as the ratio of the parallel to sequential speed.As in relation to the performance strategies implementation, a thorough study of speedup models together with their advantages is implemented in [30,9,28] show the same conformity to our implementation, but here we were able to achieve unity conformity in the message passing mechanism.

III. THE MODEL PROBLEM
The problem that is of interest to us is the heat equation in 2-dimension.We assume that the heat will spread within the field based on a dynamics described in [27,31] and the Alternating Group Explicit [13] method by the following: with the initial condition, ( , ,0) ( , ),( , , ) {0}, and ( , , ) U x y t is specified on the boundary of , RR  by ( , , ) ( , , ),( , , ) where for simplicity we assume that the region R of the xyplane is a rectangle.Consider the two-dimensional heat (3.1) with the auxiliary conditions (3.1a) and (3.1b).The region R is a rectangle defined by At the point ( , , ) for simplicity of presentation, we assume that m and n are chosen so that xy    and consequently the mesh ratio is defined by k  time level method, the solution of (3.1) uses a backward-difference approximation.
Where xy and  are the usual central difference operators in the x and y coordinates respectively.

B. IADE-DY
The matrices A and B are respectively tridiagonal of size (mxm) and (nxn).Hence, at each of the (k + ½) and (k + 1) time levels, these matrices can be decomposed into hence, by taking p as an iteration index, and for a fixed acceleration parameter r > 0, the two-stage IADE-DY scheme of the form, Where,

IV. PERFORMANCE ANALYSIS AND PARALLEL ALGORITHM
All experiment were performed on the GCC of 8 nodes with Gigabit Ethernet interconnect.Each node consists of dual core processors (3.0GHZ) with 16 GB of RAM.The MPI implementation was implemented in C/MPI.A parallel platform design to run numerical application has to be efficient [8].The platform contains more computations on large set of varying mesh sizes, and its evaluation has to be large to benchmarking.Performance concerns not only the cost of functions of the schemes, but resource accesses and code placement on computing resources [8].Making declaration for placement of data at the beginning of computation, it does not accept any perturbation.The 2D IADE-DY scheme is extremely tested using the GCC cluster system for its implementation.The objective is to evaluate the overhead it introduces and its ability to exploit the inherent parallelism of an iterative computation as stated in [18].The scalability across varying number of processors and mesh sizes is observed.
To obtain any speedup we need convergence in fewer than N iterations.The closer the coarse propagator is to the fine propagation, the faster will be the convergence.If they are too similar, then the sequential part of the algorithm will significantly degrade the speedup.A simple speedup analysis according to [2] produces the following: Where r is the ration of the time taken by coarse propagation to fine propagation over the same time interval, K is the number of iterations required for convergence, and communication overhead is ignored.
therefore, the efficiency will be .

K
Full efficiency can be achieved if the algorithm converges in one iteration.To make r smaller, the coarse propagator must be less than accurate due to larger time step or coarse spatial grid, which in turn requires more iteration to converge [17].As treated in [14,15], the algorithm for the scheme is performed on a distributed memory system of p processors, assumes that each processors initially stores n = N/p objects distributed over the entire physical domain.In the www.ijarai.thesai.orgfirst iteration of the algorithm, the domain is decomposed into two sub-domains so that the difference between the sums of the weight of the sub-domain is as small as possible.Then the same process is applied to two sub-domains in parallel, and process is repeated recursively, for log p iteration.In other words, during iteration i, 1 log i p,  the p processors are group into 1 2  i groups of 1 /2 i p  processors each.At the beginning of the iteration, the problem domain is already partitioned into 1 2 i sub-domains and the objects in each sub- domain are stored in single group of processors.At the end of the iteration, each processor group is divided into two groups, and the corresponding sub-domain is divided into two subgroups with the object in one sub-domain residing in one half the processors and the other objects in the other sub-domain residing in the other half of processor.Data parallelism originated the SPMD [23].Thus, the finite difference approximation can be treated as a SPMD problem; essentially the same computation must be performed for multiple data sets.The multiple data are different parts of the overall grid, each sent to a different computer node (processor).The main issues that arise in parallelizing a finite difference grid are: the determination of how best to partitioned the grid among processors, and how to pass instructions about grid boundaries from node to node.
The domain decomposition is used to distribute data between different processors, in order to minimize the idle time static load balancing is used to distribute the data such that each processor gets almost the same number of computational points.The partitioning and load balancing is done in the pre-processing stage, wherein, separate grid files are generated for each processor along with other necessary information about partitioning.Thus there is no need to allocate any extra storage or scatter the grid data when the parallel program is executed.At the end of the parallel computation each process writes the output into separate files suitable for verification.

V. NUMERICAL EXPERIMENTS AND DISCUSSION
The algorithm was tested on the 2-D heat equation and the application of the above mentioned algorithm is now demonstrated on meshes of 100x100, 200x200 and 300x300 respectively.Tables 1 -3, show the various performance timing.Consider the 2-D parabolic equation of the form:    Here, we observed in our experiments designed to test the effectiveness of our approach that as the mesh size increases, the execution time increases as well with a proportionate decrease in time as processors increases for three mesh sizes in Tables 1 -3.T w is the time for the worker, T m is the master time, T sd is the worker domain decomposition time for worker allocation, S par is the speedup, and E par is the efficiency.This phenomenon shows that as the number of processors increases, though it might lead to a decrease in execution time but will get to a point that increasing the processors will not have much impact on total execution time.The time spend in data exchange will be significant compared to the time spend in computation and the parallel efficiency goes down.Hence, when the number of processors increases, balancing the number of computational cells per processors will become a difficult task due to significant load imbalance.When the number of processors increases, execution time suddenly increases for certain number of processors mesh sizes.This gain is due to the uneven distribution of the computational cells when a large number of processors are used, execution time had a very small change due to domain decomposition influence on performance in parallel computation.The larger the mesh sizes show that up to certain number, the speedup improvement is near linearity.The performance begins to degrade with an effect caused by increase in communication overheads.
The problem size is scaled up following the memorybounded constraint.This phenomenon is well under expected since the implicit replacement has a very low computation overhead as implemented on the three problems.However, these jumps in communication time which are relatively larger than the others are mainly caused by the architecture of the communication between the processors, that is, due to the underlying machine architecture not the algorithm.This rate of performance decrease is fairly shown for parallel computing, especially for experiments conducted under non-dedicated environments which show that the proposed algorithm scales well.Our experiment shows reliability by conforming to convergence, and how memory is been distributed to access main data.This step is made possible by the Master/Worker computation process.

VI. CONCLUSION
We have explained in this paper how the role of GCC in the parallelization of the 2-D IADE-DY scheme is a good approach to solving problems, particularly when it is simulation with more processors.The objective is to present a design of paradigm adapted architecture for distributed computation, because they depend on empirical concern (data and code).The algorithm presented shows significant improvement when implemented on the above number of processors.In addition to the ease of use compared to other common approaches, the results shoe negligible overhead with effective load scheduling which produce the expected inherent speedups.It was also confirmed that the domain decomposition, and the use of SPMD are important and this is easy with our parallel platform of GCC.The performance of the 2D IADE-DY with the parallel paradigm is in many cases superior.As the number of processors increases, the bottleneck of parallel computation appears and the global reduction consumes a large part of time, then the improvement becomes significant.
IADE-DY and DS-MF By fractional splitting, each time step in the double sweep methods is split into two steps of size /2 t  .The horizontal sweep advances from k t to 1/2 k t  by using a difference approximation that is implicit in only the x-direction.Specifically, past values in the y-direction along the grid line i xx  are used, to yield the intermediate value , , 1the solution is obtained by using an approximation implicit in only the ydirection and uses past values in the x-direction along the grid line j yy  , to yield the final value ,, i j k u .At the ( 1/ 2)