k-Integer-Merging on Shared Memory

The k integer-merging problem is to merge the k sorted arrays into a new sorted array that contains all elements of Ai, ∀ i . We propose a new parallel algorithm based on exclusive read exclusive write shared memory. The algorithm runs in O(log n) time using n/ log n processors. The algorithm performs linear work, O(n), and has optimal cost. Furthermore, the total work done by the algorithm is less than the best-known previous parallel algorithms for k merging problem. Keywords—Merging; parallel algorithm; shared memory; optimality; linear work


I. INTRODUCTION
The problem of merging has many applications in computer science and used as a subroutine for solving many problems such as sorting [1], database management systems [2], information retrieval [3], memory management, scheduling [1], and reconstruction of the tree [4] [5]. Most of these applications based their solutions on the merging problem. For example, the optimal algorithm for sorting an array A of size n can be done as follows. (1) Partition the array A into two subarrays of equal size, A1 and A2. The merging problem is defined as follows [10]. Given two sorted arrays A = (a0, a1,..., an-1) and B = (b0, b1,..., bm-1). The merging of two sorted arrays is a new sorted array C = (c0, c1,.. ., cn+m-1) such that: 1) ci ∈ A or ci ∈B, ∀ 0 ≤ ≤ + − 1; and 2) ai and bj appear exactly once in C, ∀ 0 ≤ < and 0 ≤ < .
On the other side, some applications of computer science such as external sorting and information retrieval systems require to merge k sorted arrays of different lengths. In such a case, the problem is known as a k merging problem. For example, the external sorting (sorting a file of large data) problem can be done by the following steps [1]: (1) Dividing the file into small blocks to fit into main memory. (2) Applying the fast sorting algorithm on each block. (3) Merging the sorted blocks into sorted bigger blocks, until the file is sorted.
The merging problem of k sorted arrays is defined as follows.
In sequential computation, the problem of merging two arrays of sorted elements is solved in linear time, O(n), where n≥m [1] [7], while the problem of k merging required Ω(n log k), where 2≤ k ≤ n [17].
In parallel computation, different algorithms solve the problem of computation based on different strategies and parallel computational models. The two main types of parallel models are shared memory and interconnection networks. Our paper focuses only on the type of shared memory that is the Parallel Random Access Machine, PRAM. PRAM consists of p identical processors that operate p synchronously and communicate through large shared memory. There are three main models for PRAM based on memory access conflicts in shared memory. (1) Allowing read or write operations to memory location that is called Exclusive Read Exclusive Write (EREW). (2) Allowing only read operations to a memory location that is called Concurrent Read Exclusive Write (CREW). (3) Allowing read or write operations to a memory location that is called Concurrent Read Concurrent Write (CRCW).
In case the elements of the two arrays are taken from integer domain [1,m], then the problem of merging is called integer merging [ [17] reduced running time to constant, by considering some properties for the elements of the input. These properties are: (1) each element has a constant number of repetitions and (2) the difference between two successive elements is bounded by a constant. In case of CREW PRAM, Berkman and Vishkin in [6] proposed an algorithm that uses / log log log processors and has running time ( ), when = . Also; they have proposed ( ( )) algorithm by using / ( ) processors, where α(n) is the inverse of Ackermann's function and m=n. Furthermore the author in [18] proposed a constanttime deterministic algorithm, O(1), for merging on CREW. The proposed algorithm is optimal in case of the values of input elements are less than or equal to the size of the inputs and the number of processors is equal to size of the inputs.
In case of k merging problem, PRAM is the main used algorithm. In [19], the algorithm is based on repeated pairwise merging of k sorted arrays. The algorithm is not working optimally and it is running in (log × log ) time. In [20], the author proposed an algorithm based on CREW PRAM. The algorithm has O( ) parallel time using ( )/ processors with total work O(n log k). The algorithm based its solution on pipelining strategy and optimal work. In [8], the authors proposed two optimal parallel algorithms on PRAM with work ( ). Previous two algorithms are based on sampling scheme. The first algorithm runs in Ω( ) under EREW, while the second algorithm runs in Ω( + ) under CREW. Recently, the authors in [24] proposed a lazy-merge algorithm, have running time equal to O(k log (n/k) + merge(n/p)), where k and merge(n/p) are the number of segments and the time needed to merge n/p elements respectively by the used in-place merging algorithm. Also, the authors in [21] presented two parallel algorithms for k integer merging, when no repetition occurs in the elements. The running time for both algorithms are O(log n) and O(1) under EREW and CREW PRAM respectively.
The paper studies the k integer merging problem on PRAM and shows that the k integer merging problem can be solved in total work O(n), even though the elements number of repetitions is extreme. The proposed algorithm uses / processors of type EREW PRAM to run the algorithm in time ( ).
The paper is organized as follows: an introduction and five sections. In Section 2, we give foundations and subroutines that is needed for proposed algorithm. Proposed algorithm for k integer merging is explained in Section 3. In Section 4, we calculate the complexity analysis of the proposed algorithm. In Section 5, we show how the proposed algorithm works by tracing the algorithm on an example. Finally, in Section 6, we show the conclusion of our work.

II. PRELIMINARIES
In this section, we give the fundamental definitions and subroutines related to k integer merging problem.
Definition 1 [10] [22]: Given a problem Q of size n. The cost of the parallel algorithm for Q is equal to the product of the number of processor used and the running time for the parallel algorithm.
Definition 2 [10] [22]: Given a problem Q of size n. The cost of parallel algorithm for Q is optimal if it matches with the time complexity of the best-known sequential algorithm for Q.
Definition 3 [8][10] [22]: The work of a parallel algorithm is the total number of operations that the processors perform. The technique used to solve the prefix sum is called binary tree strategy and it can be used to solve many related problems [9][23].

III. NEW PARALLEL ALGORITHM
In this section, we show that the k integer merging problem on exclusive read exclusive write shared memory model can be solved in total work O(n) instead of ( log ), which is the total work for the best-known algorithm for k merging on the same shared memory model. Without loss of generality, assume that the elements of the k sorted arrays, , are taken from the integer domain [0,n-1], ∀ 0 ≤ < , 0 ≤ < , 2 ≤ ≤ , and = ∑ −1

=0
. The elements of the k sorted arrays are uniformly distributed over the integer domain. We also assume that the number of processors used to design the parallel algorithm is / .
The main idea behind the proposed algorithm is how to partition the k sorted arrays into / independent lists. Then, the proposed algorithm assigns each list to a processor to merge it sequentially. The algorithm consists of the following stages.

A. Stage 1: Partitioning
The partitioning stage is the first stage to merge k sorted arrays of integer elements. The goal of this stage is to divide the elements of the k sorted arrays into / log lists. The lists have the following properties. www.ijacsa.thesai.org P1: The lists are independent which means that the elements of the list number i are different from the elements of the list number , ∀ ≠ . P2: The lists are relatively ordered which means that the elements of the list number i is less than the elements of the list number j, ∀ < . P3: A small integer range called bounded range, BR., bound the difference between the elements in a list.
To verify the three properties of the / log lists, we define the value of BR by the following equation: To construct the independent lists, we will use two phases to partition the k sorted arrays. These two phases are called local and global partitioning.

B. Local Partitioning Phase
The main objective of the local partitioning phase is to partition each sorted array Ai into many subarrays based on the values of the elements of Ai. The elements of the ith subarray belong to the range [( − 1) , The input of local phase is k sorted arrays A0, A1, …, Ak-1 of lengths n0, n1, …, nk-1, respectively. By the end of the local partitioning phase, we construct a list, ALi, of li elements for each sorted array Ai, ∀ 0 ≤ < . The list contains the boundary indices for each partition in the sorted array Ai. Each element in the list ALi, ALi[j], consists of four fields, aNo, pNo, start and end. The component aNo represents the array number, while the component pNo represents the partition number. The fields start and end represent start and end indices for the partition number pNo in the sorted array Ai, respectively.
To construct the elements of ALi, we have two cases. The first case is when the size of each sorted array is approximately equal to BR. The second case is when the sizes of the sorted arrays are different.
In the case, the size of each sorted array Ai is approximately equal to BR, we do Subroutine 1 to construct the local partition. In the subroutine, we can compute the component of each element for the list ALi by the following steps.
Initially, the number of elements in the list ALi is equal to 0, and the first element in the array Ai, ai0, determines the first partition belong to the list ALi, see lines 1-2 in Subroutine 1.
The first three components of ALi[0] are as follows:

·
The partition number, pNo, is determined by using the Div operator to return the quotient of division. The fourth component will be determined later when the algorithm determines the start of a new partition. Then, the algorithm scans the elements of the array Ai from the second to the last elements to determine the end of the current partition and the start of a new partition (see line 3 in Subroutine 1). The end and the start of the current and new partitions, respectively, can be determined by testing if the quotients for the two successive elements, −1 , and , on BR are different. When, the result of comparison is different, then, the partition number, pNo, is equal to Div . In such case, the element represents the first element of a new partition and the index j represents the start index of current partition. On the other side, the element −1 represents the last element of the current partition and the index j-1 represents the last index of the current partition.
In case, the sizes of the k sorted arrays are different; we can compute , ∀ , using two steps. The first step include that, we take each array of size greater than or equal to BR and do the following: 1) Determine the number of processors required for the array Ai which is equal to = ⌊ / ⌋. 2) Each processor, , will do the same process as in Subroutine 1 on the ith partition of ALij, ∀ 0≤ j <npi.

3) Combine all the sublists,
, to the list by using the binary tree paradigm.
After finishing from all arrays of sizes greater than or equal to BR, we execute the second step. The second step includes that, we assign one processor to each of the remainder arrays and do the same process as in Subroutine 1.

Subroutine 1:
Each processor do the following: 1.
if ai j Div BR ≠ ai j-1 Div BR then 5.

C. Global Partitioning Phase
The main objective of the global partitioning phase is to partition the k sorted arrays into / log lists. Each list satisfies the three previous properties (mentioned in Stage1).
The input of global phase is a collection of k lists AL0, AL1, …, ALk-1, of lengths l0, l1,…, lk-1, respectively. By the end of this phase, we have an array, AP, of / log elements. The element AP[i] consists of three fields. The first two fields, start and end, represent the start and the end indices for the elements of the k sorted arrays that belong to the partition number i. The third field, no, represents the number of elements in the partition i, for all k sorted arrays. We can construct the array AP as follows: 3) Initially, compute the start and the end of the first and the last partitions, respectively, as follows:

D. Stage 2: Merging
The main objective of the merging stage is to merge elements of each partition. In other words, the goal is to merge the sorted subarrays that belong to the ith partition, AP[i].
To merge sorted subarrays that belong to the ith partition, we have two cases based on the number of elements in each partition. In the first case, the size of each partition is approximately equal to BR, while in the second case the size of each partition is different.
In the first case, we do the process of merging by using Subroutine 3, which uses the idea of counting sorting algorithm [25]. To verify our goal, we use an array CAi of length BR to merge the subarrays that belong to the partition number i, ∀ 0 ≤ < / log . Each element in this array consists of two fields. The first component, val, represents the value of the element, while the second component, count, represents the number of repetitions of the element val. The first step of Subroutine 3 is to initialize the two fields of the array with log + and 0, respectively as in lines 1-3 in Subroutine 3. In the second step we compute the number of repetitions for each element by traversing the elements of the partition [ ] in lines 4-7 in Subroutine 3. In the third step, we reallocate the elements of the auxiliary array to the array .
In case that the size of each partition is different, we can construct by the same method that is described in local partitioning step.

Subroutine 3
Processor pi do the following 1.
[ IV. COMPLEXITY ANALYSIS In this section, we analyze the proposed parallel algorithm for k integer merging problem according to the following criteria: running time, total number of work, optimality, and storage.
To compute the running time of the parallel proposed algorithm, the algorithm consists of three main stages: local partitioning, global partitioning, and merging.
The running time for the local partitioning stage can be computed as follows. In case that the size of each Ai is approximately equal to BR, each processor pi will execute a sequential loop on an array of length O(BR) approximately.

Therefore, the running time of this step is O(BR)=O(log n).
In case of the size of Ai is different, the running time can be computed as follows. Determining the number of processors that is required for Ai equal to constant time. The running time for step 2, execution of Subroutine 1, and step 3, combine all the sublists, are O(BR) and (log ) , respectively. The overall time for the local partitioning phase is ( The running time for the global partitioning stage can be computed as follows. The running time for applying the parallel integer sort algorithm on AL is bounded by (log ), because the maximum length of the list AL is n. The running time for the substep 2.1 is constant. The running time for the substep 2.2 is (log ), because * log ≤ log . The running time for the substep 2.3 is (log( log ⁄ )). Therefore, the overall running time for global partitioning is (log ).

is O(BR).
In case the size of each partition is different, then, the running time can be computed by a similar way which is equal to (log ) . The overall running time of the algorithm is (log ).
It can be seen from previous calculation that the total number of work done by each processor pi is (log ), ∀ 0 ≤ < / log . Hence, the algorithm has a total work of ( ).
Therefore, the proposed algorithm has optimal work and cost. Also, the storage required by the proposed algorithm is ( ).
Finally, it is clear that no step in the algorithm requires concurrent read or write. So, proposed algorithm based its work on exclusive read exclusive write shared memory.

V. EXAMPLE
Assume that we have six sorted arrays of total lengths equal to 32 as in Fig. 1.
Next, the algorithm starts to execute the global partitioning stage by sorting the elements of all lists, AL0, AL1, AL2, AL3, AL4, and AL5 to obtain a sorted list AL of l=17 elements as in Fig. 3.

VI. CONCLUSION
The paper addresses the problem of merging when the number of input sorted arrays is k, 2 ≤ ≤ . The output of the merging is a new sorted array that contains all elements of the input. Our main contribution is solving the k integer merging problem under exclusive read exclusive write shared memory. The proposed algorithm runs in (log ) time using / log processors. Additionally, the total work done by the proposed algorithm is ( ), which is less than the best-known k merging parallel algorithm that perform ( log ) work.