A Secured Large Heterogeneous HPC Cluster System using Massive Parallel Programming Model with Accelerated GPUs

High Performace Computing (HPC) architectures are expected to develop first ExaFlops computer. This Exascale processing framework will be proficient to register ExaFlops estimation every subsequent that is thousands-overlay increment in current Petascale framework. Current advancements are confronting a few difficulties to move toward such outrageous registering framework. It has been anticipated that billion-way of parallelism will be exploited to discover Exascale level secured system that provide massive performance under predefined limitations such as processing cores and power consumption. However, the key elements of the strategies are required to develop a secured ExaFlops level energy efficient system. This study proposes a non-blocking, overlapping and GPU computation based tri-hybird model (OpenMP, CUDA and MPI) model that provide a massive parallelism through different granularity levels. We implemented the three different message passing strategies including and performed the experiments on Aziz-Fujitsu PRIMERGY CX400 supercomputer. It was observed that a comprehensive experimental study has been conducted to validate the performance and energy efficiency of our model. Experimental investigation shows that the EPC could be considered as an initiative and leading model to achieve massive performance through efficient scheme for Exascale computing systems. Keywords—High Performance Computing HPC; MPI; OpenMP; CUDA; Supercomputing Systems


I. INTRODUCTION
Since last three decades, High performance computing (HPC), played a fundamental role in scientific endeavour where vendors emphasized to improve system performance by dramatic increasing through on-chip parallelism. According to Top-500 supercomputers list, an improvement of 10x in system performance is discovered after every 3.6 years [1]. A supercomputer in 2012, Titan Cray XK7 was capable to achieve 18 PFLOPs under the 8.3 MW power consumption [2]. Moving on the vision to enhance system performance to solve the complex problems, Tianhe-II the current supercomputer manufactured by NUDT is capable to deliver 55.2 PFLOPs with 17MW power consumption [3]. The demand of computation for solving complex problem envisioned to develop new supercomputer [4]. This extraordinary scale processing framework will be proficient to compute 1018 FLOPS activities for each subsequent that is thousand-crease increment in current Petascale framework. As per expectations, Exascale figuring framework will be involved countless heterogeneous process hubs connected by complex systems [5]. The essential issue for HPC frameworks is that such Extreme (Exascale) processing framework doesn't exist yet, anyway everything toward Exascale is simply expectations and contemplations. To improve the system throughput, the trend has been changed from traditional way of doubling clock speeds by doubling number of cores, threads or other parallelizing mechanisms [4]. However, it is predicted that millions of cores of heterogeneous devices including CPUs and GPUs will be comprised by the Exascale computing system.

A. Exascale Computing Limitations and Challenges
As indicated by the innovation and programming approaches that are being utilized in existing Petascale registering framework, the power consumption is about 25 to 60 MW by utilizing 30 M number of centres. The interest of intensity utilization for Exascale registering framework will be more than 130 Megawatts [6]. United State Department of Energy characterized some essential limitations such as Power Consumption roughly 20-30 MW, Development Cost (D.C) up to 200 M US dollars, Delivery Time (DT) till 2020, and Cores about 100 Million [7]. However, development of targeted Exascale Supercomputer under the delimitation of these constraints is the tremendous challenge for vendors and development communities.
Leading to the massive powerful computing system, there are several challenges which are still the blockage for development toward emerging HPC systems. In [7], some primary Exascale computing challenges discussed are presented in Table I. For 21st century, these imperative difficulties are the basic way to create innovatory answers for Exascale figuring framework. Nonetheless, an emotional reformulation at both equipment and programming levels, programming models, vitality proficient strategies, investigating apparatuses and overhaul calculations are requested to accomplish the calculation in ExaFlops [8]. Since last few years the development process for Exascale computing system is being rapidly fast. Under these listed challenges, many new approaches have been proposed.

B. Software Technology Navigation
In current study, our contribution is related to challenges 1, 2, and 5 from Table I to improve the system performance through efficient and massive parallelism under minimum power consumption. From software perspectives, still it has not The rest of paper is organized as follows. Section II related work describes the existing state-of-art-approaches at Single, Dual, and Tri levels. Further Section III depicts a comprehensive overview of proposed EPC model, its features and components. Section IV, presents the experimental platform and applications used to evaluate EPC model. Last Section V concludes and explains the results in term of summary. Exascale Algorithms New algorithms should be proposed to manage massive parallelism and advance programming. Discovery and Design Algorithms Discovery should be facilitated by mathematical models.

Resilience & Correctness
Faults and verification challenges should be addressed.

Scientific Productivity
Scientific productivity is necessary to through novel software tools.

Power Consumption Management
Power consumed by the system and its management II. RELATED WORK Pushing toward HPC (High Performance Computing), equipment and programming rising advances have been examined toward Petascale registering framework in [11]. Prompting Petascale figuring framework, numerous equipment point of view methods where studied such as Conventional innovation, Preparing In-Memory structures (PIM), Digital superconductor advances, Computation Fluid Dynamics (CFD), Special-reason equipment, Web-based Petascale Computing, atomic nanotechnology and insightful planetary rocket and so on [12]. An information parallel programming language with respect to procedures for Petascale framework were proposed [13]. These models where capable to gain parallelism for both course grain and fine grain level using traditional homogenous system on multicore CPU devices [14].
In the end of recent decade, to bring scalability in system, technology trend was changed from traditional homogenous to heterogeneous cluster system where many-core devices were introduced such as General Purpose Graphics Processing Unit (GPGPU), Graphics Processing Unit (GPU) by NIVIDIA [15] and MIC (Many Integrated cores) by Intel [16]. These accelerated devices are based on Single Instruction Multiple Data (SIMD) from Flynn's classification. Beyond these accelerated devices, many parallel programming models have been proposed such as CUDA, OpenACC, and OpenCL. It has been anticipated these parallel programming models could be promising to achieve massive parallelism required for future Exascale computing system [17]. In any case, to use such incredible gadgets and models, a key component of the methodology is the co-structure of uses, designs and programming conditions at both equipment and programming level.
According to development to HPC Exascale computing system, China has a fast development towards HPC systems and consequently they introduced Tianhe II HPC system recently in 2014 [18]. Further they introduced the upgraded version named as Tianhe III [19]. Similarly, DEEP (Dynamical Exascale Entry Platform) by European Union in 2011 [20] started effort toward a new HPC Exascale computing system. SERT project funded by NAG took initiative to introduce first Exascale computing system in 2020 [21,22]. In Japan, RIKEN [23] claimed to present first Exascale computing in start of 2020. Further, Indian Government also started Exascale computing development since 2018 and claimed to introduce in 2022 [24].

III. PRELIMINARIES
MPI has many different schemes that can be used to program a cluster system. Traditionally, two prevalent methods MPI blocking (synchronous) and non-blocking (asynchronous) are being used to distribute data over a cluster system [25,26,27]. In legacy systems, the whole processing was performed by CPU cores using MPI blocking method. Consequently, the processing over CPU cores was very costly with respect to energy consumption and processing efficiency. Therefore, new SIMD (single instruction multiple data) based energy efficient devices (GPUs, MIC) were introduced that contains thousands of cores on it. These cores compute data in parallel and consequently, reduce processing time and power consumption. www.ijacsa.thesai.org Due to parallel computation, data processing over GPU cores is very fast which required a rapid data input. In this way, MPI non-blocking is appropriate approach to fully utilize these powerful devices and achieve maximum performance. In current study, we discussed three fundamental MPI nonblocking schemes as follows:

A. (S1)-MPI Non-Blocking, no Overlapping Computation
In first strategy 1 (S1) MPI non-blocking and no overlapping implemented scheme, computation does not overlap during data processing [28]. This scheme performs just like a blocking mechanism where all resources are reserved until the whole processing is completed. One disadvantage of this scheme is that many resources are reserved event though they finished their assigned tasks. Although MPI communication is capable to overlap with CUDA, but we avoided from overlapping in this implementation. During exchanging data from multiple arrays, MPI scatter and gather data for one edge while memory copying operation is proceeding for other components.

B. (S2)-MPI Non-Blocking, Overlapping Computation
The second implemented strategy for data distribution was (S2) MPI non-blocking but overlapping computation where CUDA copying operation was overlapped with MPI communication. In this strategy, CUDA kernel was decomposed into three potions where top and bottom edges were done from the middle. In such way, kernel was started with the edges which are going to be computed, rather than start exchanging on entire domain. Following non-blocking MPI mechanism, first portion started copying operation from device to host. Immediately after completing copy operation to host, middle portion of the domain started computation. Similarly, last part of exchanging operation started as soon middle potion complete its computation. This implementation strategy can be more significant by improving the overlapping computation of middle portion.

C. (S3)-MPI Non-blocking,Overlapping & GPU Computation
The final implementation was MPI (S3) non-blocking with highest amount of overlapping which is anticipated the best performing strategy for large scale cluster system [29]. Using asynchronous method, CUDA streams were enabled and started computation from middle portion that cause to for massive overlapping, MPI communication and memory operations. The important thing in this strategy is that, a very small level of changes is needed inside the CUDA kernels to perform the computations. In order to optimize the GPU threads, a flag along with grid size and number of blocks is broadcasted over the kernels to indicate a specific portion for computation.

IV. EFFICIENT PARALLEL COMPUTING MODEL
We presented the proposed EPC model implemented in C++. Based on the predicted Exascale computing system, EPC model was categorized into three different computing environments including cluster system, compute node, and GPU computing. Each environment contained a separate layer of parallelism as presented in Fig. 2.
Programmer interacts with EPC model through the application written in C++. Before entering in parallelism zone, data is analyzed by the programmer himself statically to know that, which statement can be parallelized. Once data is analyzed and ready for parallel computation, it entered in parallel computing zones as described in following sections.

A. Inter-Node Computation Layer
The primary degree of parallelism of the model was accomplished between hub correspondences. In view of these parameters, developer break down and appropriate over associated framework hubs utilizing an institutionalized SIMD based Message Passing Interface (MPI) library [30]. MPI blocking (synchronous) and non-blocking (no concurrent) two pervasive components are being utilized to move and assemble information over the processors. Blocking systems is utilized when a solid synchronization is required because of reliance in information. For this situation, the assets are held utilizing some pre-characterized MPI holding up explanations until the handling is finished. In our parallel registering system, information is required just to convey over the processors that subsequently gives coarse-grain parallelism at this level, along these lines we chose "non-blocking, covering with GPU calculation" the third MPI non-blocking technique as talked about in past segment. In this procedure, when information is moved no concurrently over associated hubs, it entered in second degree of parallelism portrayed in following area.

B. Intra-Node Computation Layer
The proposed model provides the second level of parallelism at Intra-node computation. At this level, the distributed data through MPI processors is further communicated with CPU threads for parallel processing. At this stage, OpenMP pragmas are used that parallelize the blocks of code either fine grain or course grain computation. OpenMP threads use the system specified threads over CPU cores and complete the executions. According to new OpenMP version, we can use multiple OpenMP pragmas for multiple blocks within single block that is the reason to achieve fine grain parallelism in the code.

C. GPU Computation Layer Acceleration
The last level of parallelism in our proposed model is Intranode computation. In this layer of computation, the whole processing is performed on accelerated GPU devices. In this strategy, firstly the data is transferred form CPU cores to GPU that further distributed over GPU Warps. According to the structure of GPU each warp contains 32 number of cores where the number of warps can be different from GPU structure to structure. Once the data is transferred over GPU cores, GPU kernel divide the tasks into multiple GPU warps and perform all the operations in parallel. To perform the GPU computation, we can utilize the different accelerated devices such as NVIDIA GPU, AMD GPU etc. for current study, to maintain the maximum support for C++, we selected NVIDIA GPU and implemented accordingly.
In the past, low overlapping between CPU and GPU caused the wastage of resources where GPU threads remain in idle state until the processing from other kernels is not accomplished. Usually, this inefficiency factor was found in www.ijacsa.thesai.org MPI non-blocking non-overlapping and non-blocking lowoverlapping strategies that consequently waste resources utilization and decrease system efficiency.
Although MPI non-blocking was implemented in existing design as shown in Fig. 3(a) but waiting state for kernel and separate progress effected in decreasing efficiency. In each broadcasting, Isend() function/method has performed in three states including kernel initialization, kernel waiting and start sending data. During Isend() from these states, kernel stream was reserved. Once first kernel stream is complete, next one start processing. In such way, each stream waste resource utilization during waiting state. Conversely, in our proposed design, we organized these three states for every broadcasting Isend() in such way that kernels were overlapped and initialized immediately after one. Therefore, all kernel streams are now overlapping and can start processing as soon it receives data. A minor waiting state is ignorable because data sending process can be started as soon it complete its previous stage. Fig. 3(b) shows a clear benefit of proposed design that minimize delay in processing and increase efficiency through higher overlapping.  (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 5, 2020 374 | P a g e www.ijacsa.thesai.org

A. Platform
To perform the experiments, we have used the Aziz supercomputer the 360th positioned in 2015 top supercomputers placed in High Performance Computing Centre (HPCC), King Abdulaziz University. The Aziz supercomputer contains Xeon CPU processors along with GPU devices [31]. Aziz comprises of complete 11904 number of cores on it including both CPU and GPU cores. Regarding the Aziz memory, 96GB hubs and 256 GB individually configured in it where each hub consists of individual processor -2.4 GHz and 12 Cores-controlled CentOS 6.4 working framework. All the nodes are connected with infini-band medium to make the communication more efficient. With respect to overall efficiency, Aziz supercomputer is very powerful that is able to accomplish about 211 TFlops/s Linpack execution and about 228 TFlops/s overall [32].

B. Performance Measurment
The primary factor in High performance computing systems is Performance [33]. Conventionally, the performance of a computer system is calculated in number of Flops by attaining the peak performance and the number of flops against the targeted application execution as described in equation 1. If we consider that Fp are the flops at peak floating point and Fm are the number of flops against targeted application, therefore Fc can be determined as:

=
(1) Using the Aziz peak performance Aziz, we have quantified the performance by executing targeted HPC applications at different datasets described in following sections.

C. Power Measurment
The second most important metric is the power consumption which is the primary challenge for current and emerging HPC systems. In current we have discussed the power consumption different perspectives. Conventionally the power consumption can be categorized in two ways including the power consumed at system level without running specific application and secondly the power consumption with some specific application computation [30]. Both categories have been specified the given equations 2,3 as follows.
In above equation, the power consumed by system is the sum of power consumed by number of configured GPUs, CPUs and mainboard.
Similarly, the equation 3 describe the power consumed by system while running a specific application which is the sum of power consumed by number of configured GPUs, CPUs and mainboard.
VI. EXPERIMENTAL RESULTS AND DISCUSSION In this section we have presented all the determined results from the experiments where we implemented various numerical algorithms and discussed experimental results in this section. In first implementation, we run DMM application with multiple datasets through EPC model. A fundamental matrix multiplication method used in our implementation has been presented in below equation (6).
Sum of the given matrix can be defined as: Further to investigate the efficiency factor, we performed DMM implementation in suggested tri-level hybrid model with all MPI strategies (S1, S2 and S3) discussed in section (3).
By increasing matrix multiplication datasets, 'S3' increased the efficiency gradually and depicted the best performance compared to 'S1' and 'S2', and achieved 68% of peak performance in Tflops. Unlikely, 'S1' and 'S2' could attain the efficiency within range of 700-800 Gflops which was the initial throughput in 'S3' implementation. With large dataset computation, we observed that 'S1' declined the system efficiency which was eventually cause of over waiting during data distribution as shown in Fig. 4.
Along with performance, we quantified energy efficiency which is considered the primary metric for current and emerging HPC technologies. Likewise, the consequences in performance efficiency, 'S3' throughout increased energy efficiency at all datasets computation and accomplished 8.2 Gflops/w as shown in Fig. 5.
Further, we implemented 2-D Laplace application utilizing Jacobian iterative strategy where we run all models. By and large, the fractional differential conditions are ordered in a way like conic but here we have discussed only elliptic equation as U xx (x, y) + U yy (x, y) [22]. Be that as it may, the specific elliptic condition called "2-D Laplace condition" [23] utilized in current investigations is presented as follows in equation 7: We implemented 2-D Laplace Jacobian iterative method in EPC proposed model using all strategies. The mesh size was increased dramatically in the range of 1000-8000. Fig. 6 and 7 demonstrate the consequences of 2-D Laplace method against both metrics (performance and energy efficiency). The similar efficiency ratio of 'S1' in matrix multiplication was found in 2-D Laplace solver method in range of 390-700 Gflops/sec. Although, efficiency increased gradually in 'S1' but we can rely on it due to poor throughput.
We also evaluated energy efficiency in 2D Laplace equation method (see Fig. 6). 'S3' provided the best energy 11 Vol. 11, No. 5, 2020 375 | P a g e www.ijacsa.thesai.org efficiency as compared to other strategies. We noticed that 'S2' was also prominent and achieved energy efficiency up to 8.3 Gflops/w but 'S1' wasted a lot of energy throughout the computation and achieved 7.4 Gflops/w at maximum mesh size.

VII. CONCLUSION
The emerging HPC models are relied upon to grow first Exaflops PC to contain a huge number of heterogeneous process hubs connected by complex systems till next half decade. This Exascale processing framework will be skilled to figure one Exaflops estimation for each subsequent which is thousands-crease increment in current Petascale framework. In current study, we have discussed the extensive constraints for Exascale systems and perspective challenges for current technologies. In this research, the proposed model is a novel secure and efficient parallel programming approach which is tri-level hybrid of MPI, OpenMP and CUDA. In MPI, we implemented different strategies (S1, S2 and S3) under nonblocking mechanism. Further to evaluate the efficiency factors, the proposed model was implemented with all these strategies in two benchmarking HPC applications including DMM and two dimensional Laplace equation. Consequently, in both applications, we found that 'S3' strategy (non-blocking, overlapping and GPU computation) performed the best in providing performance efficiency and energy efficiency comparatively to (S1and S2). Therefore, hybrid of proposed model with 'S3' MPI strategy can be consider as promising model to achieve required performance and energy efficiency for Exascale systems. By future perspectives, this model is required to be executed a large cluster system that can meet the minimum requirement for Exascale system configurations.