Solving Dynamic Programming Problem by Pipeline Implementation on GPU

In this paper, we show the effectiveness of a pipeline implementation of Dynamic Programming (DP) on GPU. As an example, we explain how to solve a matrix-chain multiplication (MCM) problem by DP on GPU. This problem can be sequentially solved in $O(n^3)$ steps by DP where $n$ is the number of matrices, because its solution table is of size $n \times n$ and each element of the table can be computed in $O(n)$ steps. A typical speedup strategy for this is to parallelize the $O(n)$ step computation of each element, which can be easily achieved by parallel prefix computation, i.e., an $O(\log n)$ step computation with $n$ threads in a tournament fashion. By such a standard parallelizing method, we can solve the MCM problem in $O(n^2 \log n)$ steps with $n$ threads. In our approach, we solve the MCM problem on GPU in a pipeline fashion, i.e., we use GPU cores for supporting pipeline-stages so that many elements of the solution table are partially computed in parallel at one time. Our implementation determines one output value per one computational step with $n$ threads in a pipeline fashion and constructs the solution table totally in $O(n^2)$ steps with $n$ threads.


I. INTRODUCTION
In this paper, we show the effectiveness of a pipeline implementation of Dynamic Programming (DP) on GPU.As an example, we explain how to solve a matrix-chain multiplication (MCM) problem [1] by DP on GPU.This problem can be sequentially solved in O(n 3 ) steps by DP where n is the number of matrices, because its solution table is of size n × n and each element of the table can be computed in O(n) steps.A typical speedup strategy for this is to parallelize the O(n) step computation of each element, which can be easily achieved by parallel prefix computation, i.e., an O(log n) step computation with n threads in a tournament fashion.By such a standard parallelizing method, we can solve the MCM problem in O(n 2 log n) steps with n threads.
It has been studied well to speed up DP programs using GPU (e.g.[2], [3]), where they mainly focus on optimizing the order of accessing data by proposing novel techniques avoiding memory access conflicts.In this study, we consider adopting a pipeline technique and implementing the DP program on GPU in a pipeline fashion.The pipeline computation technique [4] can be used in situations in which we perform several operations {OP 1 , OP 2 , . . ., OP n } in a sequence, where some steps of each OP i+1 can be carried out before operation OP i is finished.In parallel algorithms, it is often possible to overlap those steps and improve total execution time.
In our approach, we solve the MCM problem on GPU in a pipeline fashion, i.e., we use GPU cores for supporting pipeline-stages so that many elements of the solution table are partially computed in parallel at one time.Our implementation determines one output value per one computational step with n threads in a pipeline fashion and constructs the solution table totally in O(n 2 ) steps with n threads.This paper is an extended version of our conference paper [5].
The rest of this paper is organized as follows.Section II introduces problem definitions and base algorithms.Section III explains our pipeline implementations for DP on GPU and offers some experimental results.Section IV explains how to apply the pipeline implementation technique to the MCM problem, and finally Section V offers concluding remarks.

II. PRELIMINARIES
In this section, we introduce some preliminary definitions and base algorithms.We first define a simplified DP problem to be solved on GPU, and then explain our GPU implementations of programs.

A. Simplified DP Problem
In this study, we implement a typical DP program on GPU.To simplify the exposition, we focus on programs that solve such a simplified DP problem defined as follows: . ., a k } of k integers representing offset numbers, and a semi-group binary operator ⊗ over integers are given.Every element of set A satisfies the following inequality: Then, a simplified DP problem (S-DP problem) is to fill all the elements of array ST in such a way that each ST[i] is computed by the following equation: where

B. Conventional Approach to S-DP Problem
To begin with, we show a straightforward sequential algorithm that solves the S-DP problem.Fig. 1 shows the algorithm.
A Sequential Algorithm for S-DP Problem Next, we consider parallelizing the algorithm for S-DP problem.The straightforward approach is to parallelize the inner loop by using GPU cores.We can easily write a multithread program that executes the inner loop-body, ST[i] = ST[i] ⊗ ST[i−a j ], for each j in parallel using k−1 threads at one time.Such an implementation, however, does not improve the time cost, because every thread has access to the same ST[i] and thus memory access conflicts occur.As a result, those memory conflicts should be automatically solved at runtime by the serializing mechanism of GPU, and consequently the whole time cost stays in O(nk) steps, which is the same time cost as that of the sequential implementation.
To avoid the memory access conflicts, we can use a wellknown standard parallel prefix computation algorithm [6], [7], in which the computations of ⊗ over k values are executed in a tournament fashion.Since the parallel prefix computation runs in O(log k) steps for k values, obviously the entire time cost can be improved to O(n log k) steps by using k threads.
Although we can successfully reduce the time cost from O(nk) to O(n log k) by using the parallel prefix computation, it is not work-time optimal because there are many idle threads during the computations in a tournament fashion.In the next section we propose other parallel implementation strategy and show that we can improve the time cost further.

III. PIPELINE IMPLEMENTATION ON GPU
In this section, we explain our proposed parallel implementations for S-DP problem on GPU.Our program runs in a pipeline fashion.

A. Pipeline Implementation for S-DP Problem
In our implementation, we use a group of k threads to establish k-stage pipeline, and this thread group treats k consecutive elements at one time in parallel.Fig. 2 describes our pipeline algorithm for the S-DP problem.The index variable i of the outer loop stands for the head position of the kthread group.The inner loop controls each thread's behaviour in such a way that the j-th thread executes computation for ST[i-j+1] using the value stored in ST[i-j+1-a j ].
A Pipeline Algorithm for S-DP Problem for i = a 1 to n + k − 2 do for j = 1 to k do in parallel Thread j executes the following operation if a 1 ≤ i j < n where i j = i − j + 1: Fig. 2. A pipeline algorithm for S-DP problem An execution example is shown in Fig. 3, where k = 3, a 1 = 5, a 2 = 3, and a 3 = 1 hold and the initial values are already stored in ST[0], ST [1],. . ., ST [4].In Step 1, the head position i of the thread group is 5.In this step the only one thread is activated and executes ST [5] ← ST[0].In Step 2, the head position is incremented to 6, and two threads are activated.The first thread treats ST [6] and the second thread works on ST [5].In Step 3, the head position becomes 7, and now all k = 3 threads actively execute operations for ST [7], ST [6], and ST [5] respectively.It should be noted that finally in Step 3 the content of ST [5] is completely determined while those of ST [7] and ST [6] are partially computed and not yet determined.From Step 3, all the k = 3 threads are active until Step n − a 1 when the head position i of thread group reaches n−1, and after that step the number of active threads decreases one by each step.As you can see there is no memory access conflict in this example.
As for the time-complexity of our pipeline implementation, from a theoretical viewpoint, it takes only O(n) steps, because the outer loop takes n + k − a 1 − 1 = O(n) cycles and the inner loop requires O(1) time if there is no adjacent offset pair (a m , a m+1 ) such that a m = a m+1 + 1.
However, from a practical viewpoint, because of the memory access conflicts, the inner loop may take more time steps.Actually, in the worst case when consecutive offset numbers are given, those ST[i j − a j ], in the right hand side of the assignment statement, coincidentally become the same element of array ST and hence the worst memory access conflicts occur.In such a case, all threads in the inner loop are serialized and it takes time proportional to k. See Fig. 4 for such a worst case example.In this example, all four threads try to have access to ST[i − 4] at the same time in the inner loop.
Let seq = (a p , a p+1 , . . ., a q ) be one of the longest subsequences of given offset numbers (a 1 , a 2 , . . ., a k ) satisfying a r = a r+1 + 1 for all p ≤ r < q.Then, it is easy to check that in the inner loop every thread r (p ≤ r < q) has access to the same element of array ST, and as a result those conflicts are serialized at run time by GPU's serializing mechanism and hence the memory access time becomes (q − p + 1) times slower than that of conflict-free case.For such a case, in [5], we proposed a 2-by-2 pipeline implementation technique where each thread invoked in the inner loop executes two computations for each element of array ST.The details can be found in [5].

B. Experimental Results
Before going to the next section, we show the performance of our pipeline implementation on GPU.We use a computer with 3.40 [GHz] Intel Xeon CPU E3-1245 v3 and an NVIDIA GeForce GTX TITAN Black.The OS is Ubuntu 17.10 and we use g++ 7.2.0 for compiling cpp programs and CUDA 9.2 [8].The experimental results are shown in TABLE I.
TABLE I shows the average execution time for each implementation on GPU.In the table, SEQUENTIAL, NAIVE-PARALLEL, and PIPELINE respectively stand for the sequential implementation, the naive multi-thread implementation, and our pipeline implementation (for a general case).Here, we use min operation for ⊗.The average is computed among 100 executions for each setting.As for the comparison between the sequential implementation and the parallel ones, parallel implementations are much faster even though it is NAIVE-PARALLEL.Although there is no difference in time between NAIVE-PARALLEL and PIPELINE until n ≤ 2 17 , PIPELINE is faster than NAIVE-PARALLEL when n ≥ 2 18 .

IV. SOLVING MCM PROBLEM
In this section, we explain how to apply our pipeline implementation technique to more general DP problems.As an example, we deal with the matrix chain multiplication problem (MCM problem) [1].

A. Outline of Pipeline Implementation for MCM Problem
It is well-known that the MCM problem can be efficiently solved by DP with a two-dimensional solution table of triangular shape, and that each element is computed along the diagonal direction.See Fig. 5 for an example.In the figure, each number represents the order of elements to be computed.The content of each element is computed by a binary operator ↓ returning a smaller operand and a binary function f ( * , * ) as in Fig. 6.The detailed definition of the MCM problem can be found in [1].In the figure, the element marked 13 is computed by the elements marked by 1, 6, 10, 11, 8, and 4. If we write ST[x] for the element marked x here, the computation is expressed as [6], ST [8]) ↓ f (ST[10], ST [4]).Since the elements of two-dimensional solution table are computed in a total order (linear order), we can line up them into a linear array according to that order.Once the solution table is transformed into a linear array, we can apply the pipeline technique to the MCM problem as well.An execution example is shown in Fig. 7.

B. Detailed Pipeline Implementation for MCM Problem
Here, we explain how to solve the MCM problem by pipeline implementation in details.
Firstly, we assume that the original two-dimensional solution table of diagonal shape is already mapped to a one dimensional solution table ST appropriately.That is, we line up all the elements of the two-dimensional table into a linear array ST according to the total (linear) order in which each element is computed along the diagonal direction as in Fig. 5  and 7. Since the number of input matrices is n, the number of elements in the solution table is n(n + 1)/2.First n elements of array ST are preset with initial values, those correspond to the elements located in the diagonal line of the original two-dimensional table.
To solve such defined MCM problem by the pipeline implementation on GPU, we have to modify the pipeline algorithm designed for the S-DP problem, because the offset numbers for each ST[i] may differ from others.Thus, for the MCM problem, we propose a modified pipeline algorithm in Fig. 8.In the MCM-pipeline algorithm, each element of ST is computed by equation ( 2), and ST [1], ST [2],. . .,ST[n] are preset with initial values.In substep 1, 2, and 3, the computation of f (l (i,j) , r (i,j) ) is executed, and in substep 4, that obtained value is used for the computation by ↓ and the result is stored to ST[i].

C. Conflict-Free Memory Access of MCM Algorithm
In this subsection, we prove that no memory access conflict occurs during the execution of the MCM-pipeline algorithm described in Fig. 8.
To begin with, we prove the following lemma.Proof: Assume that threads p and q try to read the same element of ST in substep 1 and that p < q holds.Let (row p , col p ) (resp.(row q , col q )) be the pair of row and column indexes of the elements in the original two-dimensional solution table of where v l , v r , and v S are local variables in a thread.
Fig. 8.A pipeline algorithm for MCM problem triangle shape of the MCM problem for which thread p (resp.q) is now computing.Since threads p and q try to read the same element of ST and in substep 1 they read the value for the left argument of the function f , the relation row p = row q must hold.Then, since threads p and q respectively read the p-th and q-th elements from the left in the same row of the original two-dimensional solution table, the relation p = q must hold if the two threads read the same element of ST, which contradicts the assumption p < q.
Next, we prove the following lemma, which can be proved in a similar way to the proof of Lemma 1. Proof: Assume that threads p and q try to read the same element of ST in substep 2 and that p < q holds.Let (row p , col p ) (resp.(row q , col q )) be the pair of row and column indexes of the elements in the original two-dimensional solution table of triangle shape of the MCM problem for which thread p (resp.q) is now computing.Since threads p and q try to read the same element of ST and in substep 2 they read the value for the right argument of the function f , the relation col p = col q must hold.On the other hand, thread p reads the p-th elements below of the row row p of the original two-dimensional solution table, and thread q does the q-th elements below of the row row q of the table.Since threads p and q read row (row p + p) and row (row q + q) respectively, the relation row p + p = row q + q must hold if the two threads read the same element of ST.Since p < q and col p = col q hold, the relation row p < row q must hold from the way of mapping from the original twodimensional solution table to the linear array ST.This leads to the relation row p + p = row q + q, which contradicts the assumption that threads p and q read the same element of array ST.
In substep 3, each thread simply executes computation using local variables.In substep 4, it is obvious that each thread has access to a distinct element of ST.Therefore, by Lemma 1 and Lemma 2, we obtain the following theorem.
Theorem 1: No memory access conflict occurs during the execution of the MCM-pipeline algorithm described in Fig. 8.
As for the time-complexity of the MCM-pipeline implementation, from a theoretical viewpoint, it takes only O(n 2 ) steps with (n − 1) threads, because the outer loop takes O(n 2 ) cycles and the inner loop requires O(1) time (Theorem 1).

Fig. 5 .
Fig. 5.A solution table example for a matrix chain multiplication problem.

Fig. 6 .
Fig. 6.An example of computing an element of solution table.Here, we write ST[x] for the element marked x.

Lemma 1 :
In substep 1 of an execution of the inner loopbody of the MCM algorithm, each thread has access to a distinct element of the array ST.

Fig. 7 .
Fig. 7.An execution example of pipeline implementation for the matrix chain multiplication problem.

Lemma 2 :
In substep 2 of an execution of the inner loopbody of the MCM algorithm, each thread has access to a distinct element of the array ST.