Analysis of Resource Utilization on GPU

The problems arising due to massive data storage and data analysis can be handled by recent technologies, like cloud computing and parallel computing. MapReduce, MPI, CUDA, OpenMP, OpenCL are some of the widely available tools and techniques that use multithreading approach. However, it is a challenging task to use these technologies effectively to handle the compute intensive problems in the fields like life science, environment, fluid dynamics, image processing, etc. In this paper, we have used many core platforms with graphics processing units (GPU) to implement one of very important and fundamental problem of sequence alignment in the field of bioinformatics. Dynamic and concurrent kernel features offered by graphics card are used to speed up the performance. With these features, we achieved a speed up of around 120X and 55X. We have coupled well-known tiling technique with these features and observed a performance improvement up to 4X and 2X, as compared to non-tiling execution. The paper also analyses resource parameters, GPU occupancy and proposes their relationship with the design parameters for the chosen algorithm. These observations have been quantified and the relationship between the parameters is presented. The results of study can be extended further to study similar algorithms in this area. Keywords—Dynamic kernel; GPU; Multithreading; occupancy; parallel computing


I. INTRODUCTION
Graphics hardware along with multi-core system has emerged as a new combination for the applications that has computationally demanding tasks to be performed.The conventional graphic processors are now being used in various application domains including general purpose processing.Compute Unified Device Architecture (CUDA) provides tools to exploit resources on graphics processing units (GPU).With the help of this tool, it has become possible to handle compute intensive applications by invoking hundreds of parallel threads performing the task.However, in order to achieve performance improvement, it is essential to understand the architecture of the hardware, its limitations.Algorithms need to be restructured according to the underlying hardware in order to achieve speed up.
The main aim of this paper is to study and analyse the huge computational power offered by the graphics processors and utilize it to enhance the performance of a well-known problem of pair-wise sequence alignment.The paper discusses the parallelization of sequence alignment problem on many core platforms.The algorithm deals with finding the similarities between two or more biological sequences [DNA/protein].The functional and structural relationships between two or more biological sequences can be found out by sequence alignment methods like local & global alignment.
The similarity index can be used to explore the evolutionary relationship between the sequences.Needleman-Wunch [NW] [1] algorithm for global alignment and Smith Waterman [SW] [2] algorithm for local alignment are two widely used approaches based on dynamic programming [DP] method.The algorithm generates a "score matrix" to track the similarities between two sequences.It has three-fold data dependencies in north, west & northwest directions for every element of the matrix.As the size of the database increases, the searching time increases exponentially.Hence, the other approach is to use heuristic methods, such as FASTA and BLAST.Heuristic methods are faster than DP approach, but do not always guarantee the correctness of results.Dynamic programming method is preferred over heuristic approach for generating accurate results.With the availability of huge and ever increasing datasets, the serial CPU implementation by any method takes very large time to produce the results, even with the faster machines.Hence, over the past few years, the focus has been towards parallel implementation of the problem.With the availability of highly parallel programming platforms, like many and multi core machines, it has become possible to effectively use them to accelerate the performance of data parallel applications.
Due to the large volume of data and heavy data dependencies in the alignment problem, it is very difficult to apply it directly on the parallel platform.Hence, for parallel implementation, it is necessary to resolve these dependencies and then utilize the power of thousands of cores supported by the graphics card (GPUs).
In this paper, we have presented a method for generating score matrix for pair wise local sequence alignment problem using tilling technique.This method is coupled with the features like dynamic and concurrent kernel execution supported by the GPU card.The paper also presents the relationship of various design parameters with the resource parameters for improving the performance.The approach can easily be applied to the algorithms like global sequence alignment and multiple sequence alignment.

II. RELATED WORK
Various strategies have been proposed in the literature to apply parallel computing methodology for sequence alignment problem.The basic biological information about any species is represented in the form of sequences like DNA, and protein.The sequence of unknown species or the sequence under investigation is compared with the known sequences from the standard sequence repository.The result of the comparison shows, the analogy or the differences between them.For pair wise sequence alignment method, two strategies are mainly used by the researchers.www.ijacsa.thesai.orgComplexity of the alignment algorithm is directly proportional to the number of sequences and length of each sequence (e.g.O(nm) for 2 sequences of length n & m) With the availability of huge data for analysis, it is really challenging for the researchers to process the data and return the results within reasonable time period, so that biologists can infer the results quickly and carry out further analysis.With sequential algorithm, it takes many hours or even days to produce correct results especially for large number of longer sequences.Hence, researchers have used accelerators to speed up the compute intensive part of the algorithm.Because of the heavy data dependency, divergence code flow, and noncoalesced memory access it is very difficult to parallelize the sequence alignment algorithm and map it directly onto the processing platform.However, researchers have implemented the algorithm using various strategies and hardware accelerators.
Field Programmable Gate Array (FPGA) and GPUs are the commonly used hardware accelerators for improving the execution time.Performance study of three applications on an FPGA & GPU is presented in [5].Authors have studied Gaussian Elimination, Data Encryption, and Needleman-Wunch algorithm.The factors like, overall hardware features, application performance, programmability, overhead are considered for mapping applications onto various accelerators.
A space efficient global sequence alignment algorithm is presented by Scott Lloyd and Quinn O"Snell [6].Authors presented the performance improvement in forward scan and trace back in hardware, without memory and I/o limitations.Parallel implementation of sequence alignment problem was also studied for clustering system [7] using message passing interface [MPI] technique.The authors have discussed major models like pipeline model and anti-diagonal model for parallel implementation of the dynamic programming algorithm.Gotoh [8] has proposed an improved version of SW algorithm with an affine penalty function.Algorithm proposed by Khajej-Saeed, Poole, and Perot [9] enhances the parallelism by reconstructing the recurrence relations for multiple GPUs.Implentation of SW algorithm on GPU is presneted by Lukas Ligowski, and Witold Rudnicki [10] on NVIDIA GPU platform.The paper presents the performance improvement by effiicient use of shared memory on graphics card.H.Khaled, R.EI Gohary, N.L. Badr, et al [11] have also presented GPU implementation of pairwise DNA sequence alignment problem.This implementation assigns differnet nucleotide weights and then merges the subsequences of match on GPU.The authors have obtained optimal local alignment according to predefined rules.Pair-wise sequence alignment for very long sequences was done in [12].The authors have developed a single GPU implementation of the problem and have presented two algorithms, BlockedAntidiagonal and StripedScore.SW algorithm for protein database by using SIMD instruction of CPU and GPU is done in [13].The paper presents CUDASW++ 3.0 algorithm that uses SSE-based vector execution units as accelerators.Yongchao Liu and Bertil Schmidt [14] have presented GSWABE algorithm for a pairwise sequence alignment problem for short DNA sequences.They have implemented general tile based approach for global, semiglobal and local alignment algorithm on Kepler-based Tesla K40 GPU.The same problem is also implemented for long DNA sequences on Xeon Phi coprocessors by [15].Authors have explored naive, tiled and distributed approaches on emerging platform.
Parallelization of similar problems like approximate string matching on GPU [16], finding edit distance for large sets of string pairs using MapReduce technique [17] and on GPUs [18] have been done for performance improvement.Problem of multiple sequence alignment [MSA] is one of the widely used and computationally complex problem in the deomain of computational biology.Algorithms for MSA must produce the highest score from the entire set of sequences and it is one of the complex optimization problems.Hence, heuristic methods are preferred over accurate methods.Jurate Daugelaite, Aisling O"Driscoll, and Roy D. Sleator [19] have summarized various MSA algorithms in distributed and cloud environment.High performance computing techniques have been used for MSA tools in [20].Authors have developed MTA-TCofee tool.Optimal alignment of three sequences is presented by Junjie Li, Sanjay Ranka, & Sartaj Sahani [21].The authors have also implemented a variant of global alignment, called syntenic alignment in their paper [22].Paper [23] presents combination of G-MSA and T-Coffee algorithm for improving the performance of MSA on GPU.Comparison and analysis of various high performance computing archetectures in the field of bioinformatics, computational biology and systmes biology is presented in [24].Global sequence alignment on multi-core platform using GPU is discussed by Siriwardena and RanaSinghe [25].This paper presents a GPU implementation of pair wise sequence alignment algorithm (SW) as a case study to map the resource requirement of the algorithm to the available resources.The main features of our work are as follows:  The pair-wise SW algorithm on CPU + single GPU platform is implemented.Multiple GPU implementations are presented in [9].Allocation of strings, score matrix, deciding the block (tile) size, number of blocks, threads, launching concurrent kernels, is done on CPU side.The generation of score matrix, use of registers, invoking large number of threads, launching child kernel, is done on GPU side.
 The performance improvement using memory hierachies of the graphics card (like global memory, shared memory, constant memory, text memory) has been discussed by [10] [11].However, the study of www.ijacsa.thesai.orgGPU resources like cores, threads, warps, blocks, registers is done.
 The focus of our implementation is to effectively use GPU resources, to explore the features like multiple kernel execution supported by Kepler based NVIDIA CUDA cards (K5200, K6000).These features were not considered by previous studies [11][12][13][14].The paper [15] has implemented the problem on Xeon-Phi coprocessor, and not on GPU.
 Our study mainly focuses on the use of resources like computing cores, registers per thread, shared memory per thread, thread block size.These parameters contribute towards GPU occupancy.Large number of cores available on graphics card can be very effectively utilized by exploring the features like dynamic kernels, concurrent kernel, thereby increasing the GPU occupancy.
 The paper mainly concentrates on parallelization of the score matrix generation part, which is the major compute intensive portion of the SW algorithm.The generation of aligned sequence (without gaps) is a backtracking process, carried out on CPU side.
 The implementation consists of splitting the score matrix into horizontal strips and then into the blocks or tiles.Tile size is decided by considering GPU resources.Every tile is then processed by anti-diagonal parallelization method using concurrent or dynamic kernel method.Whereas, the approach used in [12] is of vertical stripped SW algorithm considering the parameters of the global & shared memory of the GPU itself.
 The features like dynamic parallelism, use of multiple, concurrent kernels using streams supported by NVIDIA graphics cards have been explored.
The rest of the paper is organized as follows: Section 3 describes the architecture of Graphics Card.Description of algorithm is presented in Section 4. Score matrix generation using various approaches is described in Section 5. Section 6 presents implementation of algorithm and comparative performance improvement.The conclusion is presented in Section 7.

III. GPU ARCHITECTURE
GPUs have large number of processing elements called as streaming multiprocessors (SMs) to host thousands of threads and blocks of threads.Higher throughput is achieved by concurrently executing these large number of threads.This is thread level parallelism (TLP).The implementation has been done on multi-core machines with NVIDIA graphics cards Quadro K5200, K6000.CUDA C is the programming language supported for accessing GPU cards.These are professional class GPU cards for integrating high performance computing applications.The cards connect to the host processor via a PCIe 3.0 bus.It is a programming challenge to effectively manage the data traffic between the host (CPU) and the device (GPU).If this data traffic is handled properly, it would lead to performance improvement by proper utilization of memory bandwidth.
The other issue in executing algorithm is to judiciously manage the memory traffic between the streaming multi-processors and various memory components on the card.Both the cards have Kepler micro architecture that supports dynamic parallelism.With this feature, CUDA kernel can create a child kernel (as shown in Fig. 1) that can perform new independent, parallel task, create and use new streams, events, without CPU involvement.The Kepler architecture supports L1 cache per SM with a unified memory request path for loads and stores.Memory model is shown in Fig. 2. The detail technical specification of cards used is shown in Table 1.The multi-core system with 16 cores, Intel Xeon E5-2698 processor with 2.3 GHz clock frequency with GPU card, was used for implementation.[2].Global alignment method is used to catch the regions of high similarity between two sequences.But, it may not be possible to find out the regions of high local similarity, during overall optimal global alignment.
Hence, local alignment is used to effectively tap the regions of high local similarity.There are certain issues to be considered while aligning two sequences for similarity quotient.
 Length of sequences may not be equal. There may be small matching regions in the sequences.

 Whether to allow partial matches or not. (i.e. some amino acid pairs can replace the other one)
 There may be the cases of insertions, deletions, or substitutions from the common ancestral sequence.This may lead to variable length regions, mutations, or gaps in the new alignment., and The recurrence relation for when both and are strictly positive is given in Fig. 3, where α, β denote gap penalty.Fig. 4 shows data dependency.

V. SCORE MATRIX GENERATION
This section describes parallel approach for alignment problem, CUDA kernels for generating score matrix, and algorithm parameters.

A. Many Core Implementation on GPU
CUDA enabled GPU card with compute capability greater than 3.5 supports the features like dynamic parallelism, concurrent kernels.Dynamic parallelism is expressed by invoking nested kernels.Fig. 5 shows the algorithm for dynamic parallelism.Here, "gpuBC" is parent kernel that creates and calls child kernel "fillmatrix".Parent kernel creates a grid of size ( of blocks (where T is number of threads per block).Total number of blocks in each direction is , where "N" is length of query string.The child kernel "fillmatrix" generates the entries in the score matrix(C), in the diagonal parallelization manner.There is an implicit synchronization between a child & parent grid.Main program on the host allocates and initializes the score matrix C on the host, copies it on the device and calls the parent kernel.The parent kernel calls the child kernel on the device.Concurrent kernel execution can be invoked by using independent "stream" for every host thread.Fig. 6 shows the algorithm for this approach.For example, generation of score matrix can be split into four parts.Due to diagonal dependency, these four parts can be wrapped into three independent streams as shown in Fig. 7.These streams can be executed concurrently in the following order.Stream1 executes kernel1, stream2 executes kernel2 & kernel3, and stream3 executes kernel 4. The execution sequence is shown in Fig. 8. CudaStreamCreate(&stream(i)) creates three streams for kernel 1, kernels 2 & 3, and kernel 4, respectively.Streams are synchronized using CudaStreamSynchronize().The grid pattern (number of blocks, number of threads per block) is specified as an argument to each kernel.

B. Tiling Approach
For the strings of very large sizes (especially string lengths, that generate the score matrix of size more than the size of global memory of the card), score matrix on host side is divided into suitable chunks (or tiles).It is essential to calculate proper tile size and the effective address calculations of all subsequent threads, using Block ID and Thread ID model of CUDA environment.For example, if tile size is , element size is "e", size of memory is "m", then, in order to accommodate the entire tile in the global memory of GPU card, equation 1 should be satisfied.

C. Resource Requirement & GPU Occupancy
Occupancy is a function of GPU card parameters and resource requirement of the algorithm.Hence, potential limitations for occupancy are the resources like registers, memory and number of streaming multi-processors (SM) required by the algorithm.Resources would be fully utilized, only when For pair wise sequence alignment problem, maximum occupancy would be experienced, if Where, N is length of string, and C g is total number of GPU cores on device.
(3) Occupancy can be determined by considering device parameters as well as certain design parameters.These parameters are shown in Table 2.
 Register usage-The number of registers needed per thread limits the register usage.Occupancy can be decided by thread ratio.
Active Threads per Block, ⁄ ⁄  Shared Memory usage-Occupancy can also be decided by considering the shared memory usage.
 Thread Block Size-Block size is a design criteria, which decides how many SMs can be utilized depending upon the number of active blocks used by each kernel.One warp consists of 32 threads.
Every resource parameter contributes to the GPU occupancy.Occupancy may not be the measure of the performance, but low occupancy codes reflect underutilization of the enormous resources offered by the execution platform.

Number of GPU Cores-Let the tile size be
, length of diagonal be .For diagonal parallelization method, number of threads required per block is maximum at diagonal.For 100% occupancy, all the cores should be utilized.Then for maximum utilization of GPU cores, but, Memory Size-It is required that, tile should be accommodated into the memory completely., where "s" is the size of element Combining equations ( 7) (8), we get Table 3 shows the corresponding values for GPU card K5200 & K6000

D. Date Transfer Issues
Time required to transfer the data from host memory to device memory depends upon the bandwidth of PCI bus.On device side, memory may be allocated as pinned memory or non-pinned (pageable) memory.It is observed that, the peak bandwidth between various device memories is much higher than the peak bandwidth between the host and device memory.Thus, data transfer time between host and device, is the major contributor towards the overall performance.Higher bandwidth is possible between the host and the device when transfer overheads are minimal, and data transfer is overlapped with kernel execution and other data transfers.

VI. RESULTS AND DISCUSSION
A. Many Core Implementation Experiments were carried out for parallel implementation of SW algorithm on many core systems.Parallelization was done using following approaches: a.Using only dynamic kernel.b.Using only concurrent kernel.c.Using tiling technique, coupled with above two methods.
For approach "a", dynamic parallelism was tested.Parent kernel on device launches the child kernel.For "b", multiple kernels, wrapped in different streams were launched from the host.However, for approach "c", tiling method was used.Entire score matrix was split into horizontal strips and then into tiles of size that could be accommodated into the global memory of the device.Processing of each tile was carried out using anti-diagonal method of parallelization.In this method, both the features (a & b above) were tested.
The implementation was compared against serial CPU based implementation on the same platform.Speed up was calculated with respect to time taken to execute the serial version of the algorithm on CPU.(10) Speed up of about 120X and 55X was observed using dynamic kernel and concurrent kernel features respectively.Initially, the speed up achieved by both the approaches is comparable.As the string size increases, the size of score matrix and searching time also increases.The speed up saturates for higher string sizes, when bandwidth is fully utilized.Tiling technique outperforms above two approaches, for larger string sizes.Speed up of about 240X is observed with the use of combined (tiling + dynamic & concurrent kernel) technique.Fig. 11 shows the results.Nearly same speed up is observed when tiling method is used with either concurrent or dynamic kernel approach.The comparative speed up with and without using tiling technique with both the approaches (dynamic & concurrent kernel) was carried out.(11) Fig. 12 shows the speed up when tiling technique is coupled with concurrent & dynamic kernel features.With this method, speed up of 4.2X (for tiling + concurrent kernel over only concurrent kernel) and 2X (for tiling+dynamic kernel over only dynamic kernel) is achieved.The main focus was on score matrix generation part of the algorithm, since, it is the major contributor towards the execution time.The serial execution of trace-back part of the algorithm was not considered.Therefore, it would be inappropriate to compare the results directly, with the results of any previous outcomes.

B. Resource Utilization and GPU Occupancy
GPU occupancy defines how efficiently the algorithm utilizes the resources provided by the underlying hardware.Occupancy will be less, if more registers, more shared memory per thread are needed by the kernel and the thread block size is small.For large data sets, occupancy is more than 100%.The tile size limits given in Table 3 has been verified and the results are shown in Table 4.It is observed that there is about 50% reduction in execution time, when tile size limits are followed.For all experiments, thread block size is minimum 256 and maximum 1024 threads per block.If the block size is less, number of blocks required for the given data size would be much more, and occupancy would be less.

C. Issues in Data Transfer
The aspects like, allocating memory on GPU using cudamalloc() or cudaHostAlloc(), use of pinned or non-pinned memory allocation, use of constant memory for read only data were explored.Memory allocation on GPU can be done using non-pinned (pageable) or pinned allocation method.The pinned transfers are faster than non-pinned transfers for smaller data sizes (for string sizes from 16KB upto 44KB), as shown in Table 5.But too much allocation of pinned memory degrades the performance.Hence, for large string sizes, pageable, i.e. non-pinned memory allocation is preferred.Constant memory of the GPU card can be used to store all read only data of the algorithm.A request for constant memory for the entire warp is split into two parts.When all the threads in a warp access the same memory location, two requests for each half warp are generated.Reading from constant memory location is thus as fast as reading from the registers.There is a serialized access to the addresses by the threads in a half warp, leading to performance improvement.Table 6 shows the improvement in execution time while using constant memory for non-pinned allocation.Use of pinned memory and constant memory contribute towards the performance improvement only for limited data sizes.But, due to limited size of constant memory (64KB), dynamic memory allocation is required even for storing constant data.

VII. CONCLUSION
The main focus of our study was to explore the features of the graphics cards and map the resource requirement of the algorithm under consideration with the available resources.Experiments with compute intensive part of pair-wise SW algorithm, i.e. score matrix generation were performed.Hence, our results are not directly comparable to the previous results.Heavy data dependent applications can be parallelized on GPU platform by coupling traditional tiling technique with the features like concurrent and dynamic kernel execution.Speedup up to 120X and 55X was observed, while using dynamic and concurrent kernel features respectively.Further performance improvement of about 240X was possible by using tiling method.Tile size was decided by considering the relationship between various device and algorithm parameters.This led to achieving a speed up of about 2X relative to using only dynamic kernel and about 4.2X relative to using only concurrent kernel approach.The utilization of GPU resources was tested with respect to register usage, shared memory usage and thread block size.It is observed that, for higher occupancy, it is necessary to do more work per thread, use more registers per thread in order to access slower shared memory.The relationship between the tile size and available resources on the device for better resource utilization and performance improvement is presented.We plan to extend our work on incorporating memory and compiler optimization issues on parallelizing the dynamic programming based algorithms on GPU.The proposed strategy can also be extended for global sequence alignment, multiple sequence alignment problems as well.
Consider strings S 1 & S 2 , (over the alphabet {A,C,G,T}) of lengths & respectively.Then dynamic programming approach solves local alignment problem in time.The score matrix is created, which is used to generate similarity index between two strings.The recurrence relation establishes a recursive relationship between the element and other elements of the score matrix.The base conditions are:

Fig. 12 .
Fig. 12. Speed up When Tiling is used with Respect to Non-Tiling Technique.


[2]orithms that are based on dynamic programming methodology giving accurate results but taking exponential time to produce the output.For example, Needleman and Wunsch [NW][1], Smith and Waterman [SW],[2]proposed the algorithm for global and sequence alignment, respectively.

TABLE III .
TILE SIZE LIMITS FOR GPU CARDS

TABLE IV .
TESTING TILE SIZE LIMITS

TABLE V .
PINNED AND NON-PINNED MEMORY