On the Parallel Design and Analysis for 3-d Adi Telegraph Problem with Mpi

—In this paper we describe the 3-D Telegraph Equation (3-DTEL) with the use of Alternating Direction Implicit (ADI) method on Geranium Cadcam Cluster (GCC) with Message Passing Interface (MPI) parallel software. The algorithm is presented by the use of Single Program Multiple Data (SPMD) technique. The implementation is discussed by means of Parallel Design and Analysis with the use of Domain Decomposition (DD) strategy. The 3-DTEL with ADI scheme is implemented on the GCC cluster, with an objective to evaluate the overhead it introduces, with ability to exploit the inherent parallelism of the computation. Results of the parallel experiments are presented. The Speedup and Efficiency from the experiments on different block sizes agree with the theoretical analysis. I. INTRODUCTION Parallel computing has greatly motivated the research works on the parallel design and analysis of the 3-DTEL in parallel cluster system. Cluster applications have more processor cores to manage and exploit the computational capacity of high-end machines providing effective and efficient means of parallelism even as the challenges of providing effective resources management grows. It is a known fact that high capacity computing platform are expensive, and are characterized by long-running, high processor-count jobs. The performance of message-passing programs depends on the parallel target machine, and the parallel programming model to be applied to achieve parallelism. In a cluster machines having large number of processing units' scalability becomes an important issue. Many programs from scientific computing have a large potential for parallelism that is exploited best in such a programming model for mixed fast and data parallelism where the parallelism can be structured in the form of concurrent multi-processor tasks [21].


I. INTRODUCTION
Parallel computing has greatly motivated the research works on the parallel design and analysis of the 3-DTEL in parallel cluster system.Cluster applications have more processor cores to manage and exploit the computational capacity of high-end machines providing effective and efficient means of parallelism even as the challenges of providing effective resources management grows.It is a known fact that high capacity computing platform are expensive, and are characterized by long-running, high processor-count jobs.The performance of message-passing programs depends on the parallel target machine, and the parallel programming model to be applied to achieve parallelism.In a cluster machines having large number of processing units' scalability becomes an important issue.Many programs from scientific computing have a large potential for parallelism that is exploited best in such a programming model for mixed fast and data parallelism where the parallelism can be structured in the form of concurrent multi-processor tasks [21].
Developing parallel applications have its own challenges in the field of parallel computing.With reference to [11], there are theoretical challenges such as task decomposition, dependence analysis, and task scheduling.Then they are practical challenges such as portability, synchronization, and debugging.However, there exist an alternative and cost effective way of achieving performance through the use of loosely connected system of processors with a local area network [3].Hence, for a global task with other processors relevant data needs to be passed from processor to processor through a message-passing mechanism [20,15], since there is greater demand for computational speed and the computations must be completed within reasonable time period.A multi-processor task can be implemented on a subset of processors, and one of the advantages is based on the fact that for many message-passing machines communication costs are affected by the number of participating processors.
Design and analysis for finite difference DD for 2-D heat equation has been discussed in [23], and the parallelization for 3-DTEL on parallel virtual machine with DD [8] show effective load scheduling over various mesh sizes, which produce the expected inherent speedups.Parallel algorithms have been implemented for the finite difference method by [12], and [21,13] use the discrete eigen functions method with the AGE method on telegraph equation problem.
The theoretical properties of the 3-D ADI algorithm with the parallel design approach employing SPMD model with DD are promising, achieving good performance as to what was done by [7] in practice can be challenging.There is a tradeoff between the reduction of the time required for an inherently sequential part of an algorithm, and an increase in the number of the iterations required to converge [2].Previous work on 3-D ADI scheme did not consider the parallel design approach on parallelism and improvement on scalability.To write SPMD programs using one of the standard message-passing software like MPI [13] requires the explicit administration of processors with a large user group.In this paper, we present a support for the implementation of parallel design and analysis with the use of DD strategy.Our programming style allows the application programmer to specify the program organization in a clear and readable program code.
We presented a detailed study of using parallel design and analysis on 3-DTEL, and solved by the use of ADI method on a GCC cluster MPI.The SPMD model is employed with DD to enhance overlapping communication with computation that resulted in significant improved speedup, effectiveness, and efficiency across varying mesh sizes as compared to [7].
Our results demonstrated the overlap communication with computation, and the ability to arbitrary use of varying mesh sizes distribution on GCC to reduce memory pressure while preserving parallel efficiency.On the other hand, the advantage of our platform is to have somewhat specification mechanism through a static distribution, and an execution implementation.
123 | P a g e www.ijacsa.thesai.org The rest of the paper is organized as follows.Section 2 presents related work.Section 3 introduces the model for the 3-DTEL and the 3-D ADI scheme.Section 4 introduces the parallel design and analysis.Section 5 introduces the results of several experiments, which illustrate and evaluate the parallelization possible with our platform.Section 6 gives the conclusion.

II. RELATED WORK
A work by [16] achieved configuration of MPI-based message passing programs, and various other platforms for the application of telegraph and heat equations have been done in [7,8].Description of application aware job scheduler that dynamically controls resource allocation among concurrently executing jobs was done by [22].A framework called 'Gridway' for adaptive execution of applications in Grids was described by [14].Parallelization by time decomposition was first proposed by [18] with motivation to achieve parallel realtime solutions, and even the importance of loop parallelism, loop scheduling have been extensively studied [1].The ADI method for the Partial Differential Equations (PDE) proposed by [19] has been widely used for solving algebraic systems resulting from finite difference method analysis of PDE in several scientific and engineering applications.Works on parallel implementation of 2-D Telegraph problem on cluster systems have been done in [10,12].
In [12] the unconditional stability of the alternating difference schemes has similarity to our scheme and shows that the unconditional stability application is useful to its speedup and efficiency as studied.Our implementation in the GCC platform has several aspects that differentiate it from the above.GCC is designed for application running on distributed memory clusters, which can dynamically and statically calculate partition sizes based on the run-time performance of the application.We use an efficient algorithm with stability which maps data using message passing over the GCC cluster.We evaluated our system using experimental results from speedup and efficiency for the system utilization.Our approach is best suited to applications where data and computations are uniformly distributed across processors.

III. THE MODEL PROBLEM
We consider the second order telegraph equation in 3-D: where a RC GL , let z and y x    , be the grid spacing in the x, y, z and t directions, where , m is a positive integer.We can solve (3.1) by extending the 1-D simple implicit finite difference method [21] of the telegraph equation to the above 3-D telegraph equation, (3.1) becomes: 0 although this simple implicit scheme is unconditionally stable, therefore, the computational time is extremely huge.

A. ADI Method on 3-DTEL
We derive the ADI method for 3-DTEL of the simple implicit finite difference method by using a general ADI procedure [6] extended to (3.1).The ADI method is a wellknown method for solving the PDE.The main feature of ADI is to sweep directions alternatively.In contrast to the standard finite-difference formulation with only one iteration to advance from the nth to (n + 1)th time step, the formulation of the ADI method requires multilevel intermediate steps to advance from the nth to (n + 1)th time step.Equation (3.2) can be rewritten as: where the operators of I, A m s, and the constants of C o , C 1 are define as: v by the extrapolation method.
Then splitting (3.3) by using an ADI procedure as in [17], we get a set of recursion relations as follows: are the intermediate solutions and the desired solution is A 2 and A 3 on the left side of (3.14) and (3.16), we get the 3-D ADI algorithm as in Table 1.

A. The Parallel Platform
The Geranium Cadcam Cluster consist of 32 Intel Pentium dual core processor at 1.73GHZ and 0.99GB RAM.Communication is through a fast Ethernet of 100 MBits per seconds running Linux, located at the University of Malaya.The cluster performance has high memory bandwidth with a message passing supported by MPI [13].The program is written in C and provides access to MPI through calling MPI library routines.The platform contains more computations on varying set of mesh sizes.Performance in the platform concerns the resource assessment and code placement on computing resources [5].The 3-DTEL with ADI scheme is implemented on the GCC cluster, with an objective to evaluate the overhead it introduces with ability to exploit the inherent parallelism of the computation.We observed the scalability across the varying number of processors and mesh sizes, to enable the speedup we need convergence in fewer than N iterations.

B. Domain Decomposition
The parallelization of the computations is implemented by means of grid partitioning technique.The computing domain is decomposed into many blocks with reasonable geometries.Along the block interfaces, auxiliary control volumes containing the corresponding boundary values of the neighboring block are introduced, so that the grids of neighboring blocks are overlapped at the boundary.When the domain is split, each block is given an I-D number by a "master" task, which assigns these sub-domains to "slave" tasks running in individual processors.In order to couple the sub-domains' calculations, the boundary data of neighboring blocks have to be interchanged after each iteration.The calculations in the sub-domains use the old values at the subdomains' boundaries as boundary conditions.This may affect the convergence rate; however, because the algorithm is implicit, the blocks strategy can preserve nearly same accuracy as the sequential program.
The DD is used to distribute data between different processors; the static load balancing is used to maintain same computational points for each processor.The partitioning and load balancing is done in the pre-processing stage giving no room for extra storage when the parallel program is executed.Data parallelism originated the SPMD [17], thus, the finite difference approximation used in this paper can be treated as an SPMD problem.Same computation is performed for multiple data sets, and the multiple data are different parts of the overall grid.

C. Parallel ADI with MPI
We focus on computational domain partitions in implementing the parallel 3-DTEL ADI scheme on GCC platform.We need divide the dimensions into sub-domains with no unique way of partitioning the domain of computation.The case of making a balance between the implementation of the algorithm and the communication efficiency is paramount to balance.The partitioning considered is the orientation of slices changing with the sweeps according to [4].

Begin
Sub-Iteration 1: After x-sweeps, the orientation changes to the y or the z direction.In this process each processor owns three data domains, one for each direction.Implementing the parallel algorithm for solving (3.1) is based on: indication of sweeping direction for each sub-domain.Sweeping direction of each sub-domain must be in opposite direction of its neighbors.For example, we must use left right direction for odd sub-domains and right left direction for even sub-domains.Updating start node of each sub-domain with (3.14) and (3.16), each processor of the parallel machine works only on its specific portion of the grid and when processor needs information from the nearest neighbor a message is passed through the MPI message passing library.For the best parallel performance, one would like to have optimal load balancing and as little communication between processors as possible.Considering load balancing first, one would like each processor to do exactly the same amount of work, hence, each processor is not idle.For the finite difference code, the basic computational element usually is the node; it makes sense to partition the grid such that each processor gets an equal number of nodes to work on.The second criterion is that the amount of communication between processors be made as small as possible.To minimize communication, the program must divide the domain in a way that minimizes the length of the touching faces in the different sub-domains.The number of processors that one processor has to communicate with also contributes to additional communication time, because of the latency penalty for starting the new message.At first step, we divide the spatial computational domain to

D. Load Balancing
With static load balancing, the computation time of parallel subtasks should be relatively uniform across processors; otherwise, some processors will be idle waiting for others to finish their subtasks.Therefore, the domain decomposition should be reasonably uniform.A better load balancing is achieved with the pool of tasks strategy, which is often used in masterslave programming [2]: the master task keeps track of idle slaves in the distributed pool and sends out the next task to the first available idle slave.With this strategy, the processors are kept busy until there is no further task in the pool.If the tasks vary in complexity, the most complex tasks are sent out to the most powerful processor first.With this strategy, the number of sub-domains should be relatively large compared to the number of processors.
Otherwise, the slave solving the last sent block will force others to wait for the completion of this task; this is especially true if this processor happens to be the least powerful in the distributed system.The block size should not be too small either, since the overlap of nodes at the interfaces of the subdomains become significant.This results in a doubling of the computations of some variables on the interfacial nodes, leading to a reduced efficiency.Increasing the block number also lengthens the execution time of the master program, which leads to a reduced efficiency.

E. Speedup and Efficiency
A simple speedup analysis with reference to [2] produces the following: where r is the ration of the time taken by coarse propagation to fine propagation over the same time interval, K is the number of iterations required for convergence, and communication overhead is ignored.In the limit , , 0 therefore, the efficiency will be .

K
The algorithm for the scheme is performed on a distributed memory system of p processors, assumes that each processors initially stores n = N/p objects distributed over the entire physical domain.In the first iteration of the algorithm, the domain is decomposed into two sub-domains so that the difference between the sums of the weight of the sub-domain is as small as possible.Then the same process is applied to two sub-domains in parallel, and process is repeated recursively, for log p iteration.In other words, during iteration i, At the beginning of the iteration, the problem domain is already partitioned into 1 2 i sub-domains and the objects in each sub-domain are stored in single group of processors.At the end of the iteration, each processor group is divided into two groups, and the corresponding sub-domain is divided into two sub-groups with the object in one subdomain residing in one half the processors and the other objects in the other sub-domain residing in the other half of processor

V. RESULTS AND DISCUSSION
Consider the Telegraph Equation of the form:  The results in the Tables show that the parallel efficiency increases with increasing grid size for given block number, and decreases with the increasing block number for given grid size.As the number of processors increase, though this leads to a decrease in execution time, but a point is reached when the increased processors will not have much impact on total execution time.Hence, when the numbers of processors increase, balancing the number of computational cells per processors will become a difficult task due to significant load imbalance.The gain in increasing execution time for certain mess sizes is due to uneven distribution of the computational cell, and the execution time has a very small change due to DD influence on performance in parallel computation.
The total CPU time is composed of three parts: the CPU time for the master task, the average slave CPU time for data communication and the average slave CPU time for computation, .

B. Numerical Efficiency
The numerical efficiency includes the DD efficiency and convergence rate behavior.The DD efficiency includes the increase of floating point operations induced by grid overlap at interfaces and the CPU time variation generated by DD techniques.In Table 5, we listed the total CPU time distribution over various grid sizes and block numbers running with only one processor.In Table, the DD efficiency can be calculated, and the result as shown in Fig. 3.Note that the DD efficiency can be greater than one, even with one processor.Fig. 3 also shows that the optimum number of sub-domains, which maximizes the DD efficiency E DD , increases with the grid size.The convergence rate behavior, the ratio of the iteration number for the best sequential CPU time on one processor and the iteration number for the parallel CPU time on n processor, describes the increase in the number of iterations required by the parallel method to achieve a specified accuracy, as compared to the serial method.This increase is caused mainly by the deterioration in the rate of convergence with increasing number of processors and subdomains.Because the best serial algorithm is not known generally, we take the existing parallel program running on one processor to replace it.Now the problem is that how the decomposition strategy affects the convergence rate?The results are summarized in Table 6 and Fig. 4, and Table 7 and Fig. 5.
It can be seen that the convergence rate decreases with increasing block number and increasing number of processors for given grid size.The larger the grid size, the higher the convergence rate.VI.CONCLUSION The results presented in this paper show the study on the parallel design and analysis for 3-D TEL ADI scheme with MPI.The objective is to present a design for the GCC for distributed computation, because they depend on empirical concern.The system allows a parallel collection of overlapping communication to avoid unnecessary synchronization and to have the impact of parallel convergence.In addition to the use of ease of our platform, compared to other approaches show negligible overhead with effective load scheduling over various mesh sizes, which produce the expected inherent speedups.It was also confirmed that flexible scheduling for the overlapping communication are important, and this is easy on with SPMD model as seen from the Tables and Figures.Computational results obtained have clearly shown the benefits of parallelization.The DD greatly influences the performance of the 3-DTEL ADI scheme on the parallel computers.On the basis of the current parallelization strategy, more sophisticated models can be attacked efficiently.Similarly, we are interested in improving our algorithms and testing implementations on additional architectures.
the non-blocking message passing for this communication stage to reduce computing time by allowing work to be done while communication is in progress.

T
EfficiencyThe speedup and efficiency obtained for various sizes, for 70x70x6 to 210x210x6, are for various numbers of subdomains, for B = 50 are listed in Tables 2 -4.In the Tables www.ijacsa.thesai.orgwe also listed the wall (elapsed) time for the master task, , W T (this is necessarily greater than the maximum wall time returned by the slaves), the master CPU time, , all in seconds.The speedup and efficiency versus the number of processors are shown in Fig.1and Fig.2, respectively, with block number B as a parameter.

TABLE II .
THE WALL TIME TW, THE MASTER TIME TM, THE SLAVE DATA TIME TSD, THE SLAVE COMPUTATIONAL TIME TSC, THE TOTAL TIME T, THE PARALLEL SPEED-UP SPAR AND THE EFFICIENCY EPAR FOR A MESH OF 70X70X6, WITH B = 50 BLOCKS AND NITER = 100.

TABLE III .
THE WALL TIME TW, THE MASTER TIME TM, THE SLAVE DATA TIME TSD, THE SLAVE COMPUTATIONAL TIME TSC, THE TOTAL TIME T, THE PARALLEL SPEED-UP SPAR AND THE EFFICIENCY EPAR FOR A MESH OF 120X120X6, WITH B = 50 BLOCKS AND NITER = 100.

TABLE IV .
THE WALL TIME TW, THE MASTER TIME TM, THE SLAVE DATA TIME TSD, THE SLAVE COMPUTATIONAL TIME TSC, THE TOTAL TIME T, THE PARALLEL SPEED-UP SPAR AND THE EFFICIENCY EPAR FOR A MESH OF 210X210X6, WITH B = 50 BLOCKS AND NITER = 100.

TABLE VII .
THE NUMBER OF ITERATION TO ACHIEVE A GIVEN TOLERANCE OF 10 -2 FOR A GRID OF 120X120X6