A Performance Study of Some Sophisticated Partitioning Algorithms

Partitioning is a central component of the Quicksort which is an intriguing sorting algorithm, and is a part of C, C++ and Java libraries. Partitioning is a key component of Quicksort, on which the performance of Quicksort ultimately depends. There have been some elegant partitioning algorithms; Profound understanding of prior may be needed if one has to choose among those partitioning algorithms. In this paper we undertake a careful study of these algorithms on modern machines with the help of state of the art performance analyzers, choose the best partitioning algorithm on the basis of some crucial performance indicators. KeywordsQuicksort; Hoare Partition; Lomuto Partition; AQTime.


INTRODUCTION
Partitioning is undoubtably a core part of the Quicksort on which the performance ultimately depends.Quicksort is a leading and widely used sorting algorithm.For instance C, C++ and Java libraries use Quicksort as their sorting routine.The Partitioning is a key component of the Quicksort and selection algorithm.There are several partitioning algorithms that accomplish the task, but only a few deserve special attention.Hoare, Lomuto, Modified Lomuto and Modified Hoare are those few selected partition algorithms.This paper carries out an in depth study of the selected partitioning algorithms.The important question is as to which partitioning algorithm is superior so that we can call the superior algorithm in sorting routine.This study attempts to answer the same question.In past Scientists studied and compared these algorithms; the comparisons however were theoretical and were made on old architectures.An algorithm effective on old architectures may not be effective on modern machines.A study valid on old architectures may not be so on modern architectures.Moreover in past researchers did not have advanced performance analyzers to study cache miss and page faults.Consequently researchers relied on cache simulations.Therefore their results may be inaccurate.Hence it is beneficial to compare the algorithms on contemporary architectures using state of the art performance analyzers.
It has not escaped our notice that state of the art machines are Multicore and if an algorithm has to be effective it should be Multicore ready [13].Future lies in parallel/multithreaded algorithms, but even then one should not forget that parallel algorithms or multithreaded algorithms will need sequential algorithms at lower level.The basic question is which sequential sorting algorithm to call at lower level.Calling a slow sequential algorithm at lower level will neutralize the advantage of parallel sorting gained by multiple cores.So the question which sequential sorting is the best option at lower level is of paramount importance.Literature suggests that Quicksort offers the most effective answer at least today.If the Quicksort is lower level sequential sorting algorithm, then the very next question is which Partitioning algorithm we should choose.This study is going to solve the same question.
To study the performance of selected partitioning algorithms on contemporary machines is the central idea of the paper.A fair test of the algorithm's performance is its execution time; however the drawback of this approach is that no intuition is provided as to why the execution time performance was good or bad.The reason(s) may be high instruction count, high cache miss count and high branch misprediction count.Even high page fault count affects the performance.Earlier researchers studied the impact of these factors using cache simulation and similar techniques.Fortunately today researchers have performance analyzing softwares which are not merely effective in capturing execution time but also acquire accurate data about cache miss, branch mispredictions and page faults.

II. LITERATURE REVIEW
In the past researchers did not enjoy the luxury of sophisticated profilers which we enjoy now.Instead they relied heavily on theoretical models and cache simulations.Majority of algorithm researchers compare the algorithmic performance on the basis of unit cost model.The RAM model is a most commonly used unit cost model in which all basic operations involve unit cost.The advantage of unit cost model is that it is simple and easy to use.Moreover it produces results which are easily comparable.However, this model does not reflect the memory hierarchy present in modern machine.It has been observed that main memory has grown slower relative to processor cycle times, consequently Cache miss penalty has grown significantly [12].Thus good overall performance cannot be achieved without keeping cache miss count as low as possible.Since RAM model does not count cache miss, it is no longer a useful model.
Usually algorithm researchers in sorting area only count particular expensive operations.Analysis of sorting and searching algorithms, for instance, only counts the number of comparisons and swaps.There was exquisite logic behind only www.ijacsa.thesai.orgcounting comparison operation which was expensive in the past.That simplified the analysis and still retained accuracy since the bulk of the costs was captured, but this is no longer true because the shift in the technology renders the -expensive operations‖ inexpensive and vice versa.Same happened with comparison operation which is not expensive anymore.Indeed it is no more expensive than addition or copy.Hence the study favours a practical approach and is not biased towards a single performance indicator.The idea is to have a fairly objective view and goal of good overall performance rather than concentrating on a single performance indicator.
Literature reveals that every partitioning algorithm incurs (n-1) comparisons, where n is total number of elements in the array [1,2,3,4,5,6,7,8,9].Partitioning algorithms differ in swap count or data transfer operations.Hoare partition & Modified Hoare partition algorithms lead to adaptiveness of swap count / data transfer operation count.In the worst case, for Hoare and Modified Hoare algorithms swap count/ data transfer count is approximately(n/2), whereas for Lomuto and Modified Lomuto swap count/data transfer count is approximately (n) [14].

III. PERFORMANCE STUDY ON MODERN ARCHITECTURES
This paper studies the performance of Hoare partition, Lomuto partition, Modified Hoare partition and modified Lomuto partition on contemporary computers.Thus to study algorithms were tested on Pseudorandom numbers using state of the art Machines.Experiments were performed on state of the art COMPAQ PC which was equipped with Windows Ultimate operating system.Following tables and figure present the average case statistics generated by the tests on 3 important performance indicators: elapsed time, CPU Cache Miss, Branch mispredictions.AQtime software was instrumental in gathering the reliable profiling data.Elapsed time given in the table is in milliseconds.Tables and Figure 1, show the results based on random input, depict the performance on 3 crucial performance indicators.Since Page fault count was 0 for each one of the algorithms, it was not shown explicitly in the tables.Zero page fault count is due to large main memory size which was not feasible earlier.Modified Hoare partition outperforms the other algorithms in almost all entries in the table.Modified Lomuto is the second one to finish and is not too behind.Modified Lomuto is followed by Hoare partition which in turn is followed by Lomuto which is the last one to complete.It is easy to see that among the studied algorithms the one with the better cache miss count is usually the first one to complete the partitioning.Lomuto algorithm and Hoare algorithm are slow because of their higher instruction count, poor cache miss count and fairly high branch misprediction count.The interesting question that emerges is why Modified Hoare and Modified Lomuto have lower cache miss count whereas others have cache miss count on higher side.The intuitive reason is that instruction cache miss count is likely to go down as overall instruction count and code size goes down.If we can keep data cache miss count in check then overall cache miss count will be low.Same seems to have happened with Modified Hoare and Modified Lomuto partitioning algorithms.

TABLE I :
STATISTICS OF LOMUTO PARTITION

TABLE II :
STATISTICS OF MODIFIED LOMUTO PARTITION