Empirical Analysis Measuring the Performance of Multi-threading in Parallel Merge Sort

Sorting is one of the most frequent concerns in Computer Science, various sorting algorithms were invented for specific requirements. As these requirements and capabilities grow, sequential processing becomes inefficient. Therefore, algorithms are being enhanced to run in parallel to achieve better performance. Performing algorithms in parallel differ depending on the degree of multi-threading. This study determines the optimal number of threads to use in parallel merge sort. Furthermore, it provides a comparative analysis of various degrees of multithreading. The implementation in this empirical experiment takes a group of devices with various specifications. For each device, it takes fixed-sized data set and executes merge sort for sequential and parallel algorithms. For each device, the lowest average runtime is used to measure the efficiency of the experiment. In all experiments, single-threaded is more efficient when the data size is less than 10 since it claimed 53% of the lowest runtime than the multithreaded executions. The overall average of the experiments shows either four or eight threads, with 72% and 28%, respectively, are most efficient when data sizes exceed 10. Keywords—Parallel merge sort; sort; multithread; degree of multithreading


I. INTRODUCTION
Merge sort is a divide and conquer algorithm that was invented by John von Neumann in 1945, it is an efficient, general-purpose, comparison-based sorting algorithm [1]. Most implementations produce a stable sort, which means that the implementation preserves the input order of equal elements in the sorted output. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and Neumann as early as 1948 [2]. Such divide and conquer algorithm recursively break down a problem into subproblems, making it simple to be solved easily, then combine the solutions of the sub-problems until the original problem is solved. In sorting n objects (list of array elements), merge sort is an efficient algorithm that has an average and worst-case performance of O(nlogn) [2].
If the running time of merge sort for a list of length n is T(n), then the recurrence T(n) = 2T(n/2) + n follows from the definition of the algorithm (apply the algorithm to two lists of half the size of the original list and add the n steps taken to merge the resulting two lists). In the worst case, the number of comparisons merge sort makes is equal to or slightly smaller than (nlogn − 2log n + 1), which is between (nlogn − n + 1) and (nlogn + n + O(logn)) [3]. In the section below, a pseudo-code of merge sort is illustrated, followed by an example in Fig. 1, using a simple data set of {38,27,43,3,9,82,10} [4]. Fig. 1 illustrates how the algorithm divides all items one by one then combines them recursively. This approach indicates the possibility of applying the algorithm in parallel. Hence, parallel merge sort reduces the complexity to O(nlogn/t), where t is the number of threads, by using multi-threaded operations where the data is divided into equal portions and each portion is assigned to a specific thread. The complexity is reduced to O(n) but could vary according to the number of threads used [5].
Merge sort is suitable when the data structure is a linked list because it is a sequential access structure. Implementing a linked list hinders the performance of other algorithms such as quicksort and heapsort [6,7]. Moreover, parallel merge sort is frequently used in various domains, including; sorting NoSql databases [8], high-performance computing environments [9], and massively parallel architectures [10,11]. When it comes to executing algorithms in parallel, most studies show results of the performance on several processors [12][13][14][15]. These results will mainly rely on the specifications of the device and the behavior of the execution in terms of multithreading. The question that led to this research is, what is the suitable degree of multi-threading required for parallel merge sort? This study conducts an empirical experiment and highlights several factors that influence multithreading performance. First, the number of cores that affect multithreading performance and second, the given data size that demands multithreading when a single-threaded performance degrades. 72 | P a g e www.ijacsa.thesai.org The contribution of this paper is to determine the optimal number of threads to use in parallel merge sort. Furthermore, it provides a comparative analysis of various degrees of multithreading. Each data size is examined among a determined number of threads, starting from one thread (sequential), two, four, eight, and sixteen threads (parallel).
In Section 2, related studies were taken to see how parallel merge sort was implemented and what the results were. Section 3 explains and walks through how the experiment was conducted. The results are illustrated in Section 4 and elucidated in the discussion. Finally, Section 5 presents the conclusion of this study.

II. RELATED WORK
There have been several papers that conducted various researches on parallel merge sort, and they have come up with the following.
Jeon [13] improved parallel merge sort by distributing and computing the approximately equal number of keys in all processors throughout the merging phases. Using the histogram information, keys can be divided equally regardless of their distribution, which evaluated the speedup showing a better performance by applying parallel merge sort on two different parallel machines: a Cray T3E and a Pentium III PC cluster on maximum data size of 10 6 × 4.
The tested algorithm on loosely coupled parallel machines and the performance of the algorithm has been observed. It has been found that the computational time of the algorithm varies logarithmically for a varying number of processors scenario [14].
Uyar [5] experimented with applying parallel merge sort using multi-threads similar to this experiment. It stated that two threads could perform one merge operation simultaneously. One thread generates the first half of the sorted values that start from the minimums of the two sorted subsets. The other thread generates the second half of the sorted values starting from the maximums of the two sorted subsets. It also compared it with double merging by using four threads implementing it on Java. The comparison focused on array sizes from 10 million up to 50 million. In this study, the array size starts from 5000 up to 50 million to detect when executing in parallel is more efficient than sequential.
A study was conducted on three parallel sorting algorithms (Odd-even transposition sort, Parallel rank sort, and Parallel merge sort) on a number of processors 2, 4, 6, 8, 10, and 12 on 10000 integers [15]. The results proved that parallel merge sort was the fastest, yet the study was comparing only one input size and may differ when the data size increases.
These previous studies show that merge sort could be conducted in parallel in several ways, giving better results than sequential as the array size increases [5,[13][14][15]. Yet, these studies were concerned with enhancing the performance of merge sort without comparing the degree of multi-threading. Only [5] compared different array sizes that were only applied up to four threads on a specific range of sizes, from 10 6 to 10 6 ×5. This study experiments parallel merge sort on four different degrees of multi-threading in a broader range of array sizes from 10 5 to 10 7 , which is explained in Section 3 maintaining the integrity of the specifications.

A. Requirements
This experiment was implemented on Java SE8. It was conducted on five devices to ensure diversity in the environment of implementation. Moreover, to verify the results are not dependent on the specifications of a particular device. The specifications of the devices used in this experiment are shown in Table I.

B. Implementation
This experiment takes a specific data set and executes it in two approaches: 1) Sequential (one thread), 2) Parallel (two, four, eight, and sixteen threads). The source code is available on https://github.com/muhyidean/ParallelMergeSort.git.
The implementation in this experiment takes a data set and applies merge sort for sequential and parallel algorithms. For sequential, it executes Algorithm 1. As for parallel, it executes Algorithm 2 based on the following: 1) Data formation: The array sizes for the data sets begin from 10 3 ×5, 10 4 , 10 4 ×5, 10 5 , ... up to 10 7 . Based on the array size, ten different random data sets are initiated to be implemented in both execution approaches. Each data set will be placed in a separate array and executed in each approach. The average runtime of ten executions for each array size is taken in milliseconds. 2) Partition process: The partitioning will be in five categories, one in sequential and four degrees of multithreading 2, 4, 8, 16. The original data set is considered the first partition, so it will be directly executed (sequentially). Then the same data set is taken and split in half making two data sets, each partition is assigned to a thread to run parallel. The process goes on for the other partitions with respect to the number of threads to be implemented which are two, four, eight, and sixteen.
3) Thread management: The implementation for the parallel merge sort divides the array into sub-arrays to be sorted by the number of threads. The threads sort their assigned sub-arrays independently. Two consecutive sorted sub-arrays are combined by one thread. Each merging thread merges two sorted arrays. The merge operation follows this approach. Whenever the arrays are sorted, the number of arrays is decreased by half. During the last iteration, two sorted arrays are merged to produce a sorted array. This implementation did not use any third-party libraries/frameworks, it was implemented with the java thread package in JDK (Java Development Kit). Fig. 2 illustrates the partitioning process and the merging mechanism. Each elliptical shape is considered a thread; the shapes labeled with D represent the partition of the original array sorted by merge sort. The shapes labeled with M merge the results from the previous threads until it merges the whole array. To be better illustrated, sixteen threads are not shown Fig. 2 because it follows similar partitioning.

C. Data Analysis
Tables III to VII shows the average runtime for different array sizes on each. Furthermore, they also show how each device performs on different execution approaches (sequential and parallel). For example, the average execution time is calculated by running the algorithm ten times, then the average of times is taken. // Defining threads to execute merge sort for each array 7: threads t1(mergesort(arr 1)), t2(mergesort(arr 2))... t x(mergesort(arr x)) 8: // Assign random integers to main arrays, to give each same set of random values 9: for i ← 0 to val do 10: n ← random value in range of (1 − x) 11

IV. RESULTS AND DISCUSSION
This section highlights and points out the main findings of the empirical experiment. To measure the efficiency of the experiment, the lowest average execution time (ms) is taken for each data size on each device.

A. Results
In Tables III to VII, it shows the average of 10 executions for each degree of multi-threading. Each column is a different size starting from 10 3 × 5 up to 10 7 . The rows show the performance of each thread for a specific data size. For example, (Th-1) is one thread, (Th-2) is two threads and goes on. As shown in Tables III to VII, for data size 10 3 × 5, all devices perform efficiently in terms of runtime in a singlethreaded execution. As for the sizes 10 4 and 10 4 × 5, it varies from one to eight threads depending on the number of cores in the device. With data sizes of 10 5 and larger, each device performs better with a certain number of threads, depending on the number of cores. All results are illustrated in Fig. 3 to 7. Fig. 3 to 7 illustrates the performance graphs according to different data sizes and the number of threads used. Multithreading is clearly more efficient when the data size increases. The appropriate number of threads will generally be visible when the data size exceeds 10 5 .

B. Findings
There were two main findings from these results. First, multithreading does not always have the most efficient runtime as it depends on the data size. Second, even when the data size increases, a specific number of threads will determine the optimized performance based on the device specifications. In other words, implementing as many threads as possible will not lead to higher runtime performance.
Tables VIII and IX were presented to highlight the findings of the results, one below 10 5 and the other greater 10 5 . Table VIII shows the overall average for each device with data sizes below 10 5 . For example, in Device 1, the sequential runtime performance was most efficient. By taking the overall average, single-threaded was more efficient since it claimed 53% of the lowest runtime than the multithreaded executions. Table IX shows the overall average for each device with data sizes above 10 5 . As shown in Table IX, multi-threaded implementation with either four or eight threads provided better performance with 72% and 28%. Fig. 8 and 9 visualize which threads performed better in the overall average for different data sizes. A higher percentage indicates that using a specific number of threads is more efficient on a particular data size.
Based on the experiment results, all devices that have four cores achieved efficient runtime performance with four threads. Moreover, all devices with eight cores achieved efficient runtime performance with eight threads. Evidently, the selection of the number of threads is mainly determined by the number of the cores.

C. Discussion
The main question of this study is, what is the optimal number of threads for parallel merge sort considering two main factors: data size and number of cores?        The results of this study had shown that having as many threads as possible will not lead to the best runtime performance. To achieve the best runtime performance, the number of cores present is crucial in determining the optimal number of threads. The cruciallity is due to how multiple threads are executed by the operating system. Correspondingly, the data size determines whether multiple threads are required. In small data sets, the use of multiple threads is unnecessary since one thread can perform more efficiently.
The conclusion is that if the data size is under 10 5 , singlethreaded will be more efficient. In contrast, having multiple threads will perform better for data sizes that exceed 10 5 . In addition, it should not spawn threads more than the number of cores (excluding merging threads).        Th-1  Th-2  Th-4  Th-8  Th-

V. CONCLUSION
This study conducts an empirical experiment to determine the optimal number of threads to use in parallel merge sort. Several factors are discussed in this study to answer this question. First is the number of cores that impact multithreading performance. Second is the given data size that requires the use of multiple cores.
The implementation in this experiment takes a group of devices with various specifications. For each device, it takes fixed-sized data set and applies merge sort for sequential and parallel algorithms. For each device, the lowest average execution time (ms) is used to measure the efficiency of the experiment. Taking the average for all experiments, singlethreaded is more efficient when the data size is less than 10 5 since it claimed 53%. Whereas, for data sizes exceeding 10 5 , multi-threaded implementation has better performance. The overall average of the experiments shows either four or eight threads are most efficient, with 72% and 28% respectively. There were two main findings from these results. First, multithreading does not always have the most efficient runtime as it depends on the data size. Second, even when the data size increases, a specific number of threads will determine the optimized performance based on the device specifications. In other words, implementing as many threads as possible will not lead to higher runtime performance.
The conclusion is that if the data size is under 10 5 , singlethreaded will be more efficient. In contrast, having multiple threads will perform better for data sizes that exceed 10 5 . In addition, the number of threads spawned should not exceed the number of cores (excluding merging threads).