Fast Sorting Algorithms using AVX-512 on Intel Knights Landing

This paper describes fast sorting techniques using the recent AVX-512 instruction set. Our implementations benefit from the latest possibilities offered by AVX-512 to vectorize a two-parts hybrid algorithm: we sort the small arrays using a branch- free Bitonic variant, and we provide a vectorized partitioning kernel which is the main component of the well-known Quicksort. Our algorithm sorts in-place and is straightforward to implement thanks to the new instructions. Meanwhile, we also show how an algorithm can be adapted and implemented with AVX-512. We report a performance study on the Intel KNL where our approach is faster than the GNU C++ sort algorithm for any size in both integer and double floating-point arithmetics by a factor of 4 in average.


INTRODUCTION
Sorting is a fundamental problem in computer science, and efficient implementations are critical for various applications such as database servers [1] or image rendering engines [2]. Moreover, sorting libraries are massively used in software development to reduce the complexity of specific algorithms. As an example, once an array is sorted, it is possible to use binary search and find any item in logarithmic time. Therefore, having efficient sorting implementations on new architectures can improve performance of a wide range of applications.
From one CPU generation to the next, improvements and changes are made at various levels. Some modifications are hidden from the programmer and might improve existing codes without any update. This includes the low-level modules (speculation, outof-order execution, etc.) but also the CPU's clock frequency. On the other hand, some new features and improvements require to be explicitly used. In this category, the two dominant improvements are the usage of vectorization units and multi-core parallelization. Thus, to achieve high-performance on modern CPUs, it is indispensable to vectorize a code, that is to take advantage of the single instruction on multiple data (SIMD) capacity. In fact, while the difference between a scalar code and its vectorized equivalent was "only" of a factor of 2 in the year 2000 (SSE technology and double precision), the difference is now up to a factor 4 on most CPUs (AVX) and up to 8 (AVX-512) on the KNL. Some algorithms or computational intensive kernels are straightforward to vectorize and could even be autovectorize by the compilers. However, data-processing algorithms are generally not in this category because the SIMD operations are not always well designed for this purpose.
The Intel Knights Landing (KNL) processor [3] did not follow the same path of improvement as the previous CPUs. In fact, the clock frequency has been reduced but the size of SIMD-vector and the number of cores in the chip have been increased impressively (in addition to having a new memory hierarchy). The KNL supports the AVX-512 instruction set [4]: it supports Intel AVX-512 foundational instructions (AVX-512F), Intel AVX-512 conflict detection instructions (AVX-512CD), Intel AVX-512 exponential and reciprocal instructions (AVX-512ER), and Intel AVX-512 prefetch instructions (AVX-512PF). AVX-512 allows to work on SIMD-vectors of double the size compared to previous AVX(2) set, and it comes with various new operations. Moreover, the next-generation of Intel CPUs will support AVX-512 too. Consequently, the development of new algorithms targeting Intel KNL should be beneficial to the future Intel Xeon SkyLake CPU as well.
In the current paper, we look at different strategies to develop an efficient sorting algorithm on Intel KNL using AVX-512. The contributions of this study are the following: • studying sorting algorithms on the Intel KNL • proposing a new partitioning algorithm using AVX-512 • defining a new Bitonic sort variant to sort tiny arrays using AVX-512 • implementing a new Quicksort variant using AVX-512.
All in all, we show how we can obtain a fast and arXiv:1704.08579v1 [cs.MS] 24 Apr 2017 vectorized sorting algorithm 1 . The rest of the paper is organized as follows. Section 2 provides the background related to vectorization and sorting. Then, we describe our proposal to sort tiny arrays in Section 3, and our partitioning and Quicksort variant in Section 4. Finally, in Section 5, we provide a performance study of our method against the GNU C++ standard sort (STL) implementation.

BACKGROUND
In this section, we provide the pre-requisite to our sort implementation. We first introduce the Quicksort and the Bitonic sort. Then we describe the objectives of the vectorization and we finish with an overview of some related work.

Quicksort Overview
We briefly describe the well-known Quicksort (QS) algorithm to make the document self-content, but readers comfortable with it can skip this section. QS has been originally proposed in [5], and uses a divideand-conquer strategy by recursively partitioning the input array until it ends with partitions of one value. The partitioning puts values lower than a pivot at the beginning of the array and the values greater at the end with a linear complexity. QS has a worstcase complexity of O(n 2 ) but an average complexity of O(n log n) in practice. The complexity is tied to the choice of the pivot when partitioning, it must be close to the median to ensure a low complexity. However, its simplicity in terms of implementation and its speed in practice has turned it into a very popular sorting algorithm. Figure 1 shows the sort of an array using QS.  We provide in Algorithm 2 a possible implementation of a scalar QS (here the term scalar refers to as single 1 The functions described in the current study are available at https://gitlab.mpcdf.mpg.de/bbramas/avx-512-sort. This repository includes a clean header-only library (branch master) and a test file that generates the performance study of the current manuscript (branch paper). The code is under MIT license.
value at the opposite of a SIMD vector). In this implementation, the choice of the pivot is done naively by selecting the value in the middle before partitioning. This type of selection can result in very unbalanced partitions, hence why more advanced heuristics have been proposed in the past. As an example, the pivot can be selected by taking a median from several values. The worst case complexity of QS makes it no longer suitable to be used as the standard C++ sort. In fact, a complexity of O(n log n) in average was required until year 2003 [6], but it is now a worst case limit [7] that a pure QS implementation cannot guarantee. Consequently, the current implementation is a 3-part hybrid sorting algorithm [8] i.e. it relies on 3 different algorithms. First, the algorithm uses an introsort [9] to a maximum depth of 2 × log 2 n (introsort is a hybrid of quicksort and heap sort) to obtain small partitions that are then sorted using an insertion sort.

Bitonic Sorting Network
In computer science, a sorting network is an abstract description of how to sort values from a fixed-size array; how the values are compared and exchanged. A sorting network can be represented graphically having each input value as an horizontal lines and each compare and exchange unit as a vertical connection between those lines. It exists various sorting networks in the literature, but we concentrate our description on the Bitonic sort from [10]. This network is easy to implement and has an algorithm complexity of O(n log(n) 2 ). Moreover, it has shown good performance on parallel computers [11] and GPUs [12]. Figure 2a shows a Bitonic sorting network to process 16 values. A sorting network can be seen as a time-line where input values are transfered from left to right and exchanged if needed at each vertical bar. We show such execution in Figure 2b where we print the Two main strategies are possible when implementing a sorting network. Either it can be implemented by hard-coding the connections between the lines (which can be seen as a direct mapping of the picture). Or with a flexible algorithm that performs the same operations by using a formula/rule to decide when to exchange the values. In Algorithm 2, we show a flexible implementation of a Bitonic sort which mimics the networks presented in Figure 2a but without hard-coded exchanges.

Vectorization
The term vectorization refers to a CPUs' feature to apply a single operation/instruction to a vector of values instead of only a single one [13]. It is common to refer to this concept as Flynn's taxonomy term SIMD, for single instruction on multiple data. By adding SIMD instructions/registers to CPUs, it has been possible to increase the peak performance of the single cores despite the stagnation of the clock frequency since the mid 2000s. In the rest of the paper, we use the term SIMD-vector to call the data type managed by the CPU, but it has no relation with an expandable vector data structure such as std::vector. The size of a SIMDvector is variable and depends on both the instruction set and the type of SIMD-vector element and is given by the size of the registers in the chip. Simd-vector extensions to the X86 instruction set, for example, are SSE [14], AVX [15], and AVX-512 [4], supporting SIMD-vectors of size 128, 256 and 512 bits, respectively. An AVX-512 SIMD-vector is able to store 8 double Algorithm 2: Bitonic sort Input: array: an array to sort. length: the size of array (must be a power of 2). Output: array: the array sorted. 1 function bitonic sort(array, length) 2 for s=2 ; s <= n ; s= s * 2 do 3 for i=0; i < n; i= i + s do 4 bitonic core(sub array(array,i), s);  Figure 3 illustrates the difference between a scalar summation and a SIMD-vector summation. Throughout this document, we use intrinsic function extension instead of the assembly language to write vectorized code on top of the AVX-512 instruction set.
Intrinsics are small functions that should be replaced by the compiler with a single assembly instruction. Using intrinsics allows us to use high-level programming languages (C++ in our case) while being able to tell the compilers to use particular instructions.

AVX-512
The AVX-512 is a recent instruction set that has been introduced with the Intel KNL CPU in the year 2016. It offers operations that do not have an equivalent in previous extensions of the X86 instruction sets. As an example, there are several new instructions that use a mask (integer) to select values or conditionally apply an operation on some values from a SIMD-vector. In the following, we describe some of the instructions that we use in our implementations.

B. Bramas
Load/set/store. As in previous instruction sets, AVX-512 has instructions to load a contiguous block of values from main memory and to transform it into a SIMDvector (load), fill a SIMD-vector with a given value (set) and move back a SIMD-vector into memory (store).
Store some. A new operation allows to save only some values from a SIMD-vector into memory (vpcmpd/vcmppd ). The values are saved contiguously. This is a major improvement because without this instruction, several operations are needed to obtain the same result. For example, to save some values from a SIMD-vector v at address p in memory, one possibility is to load the current values from p into a SIMD-vector v , permute the values in v to move the values to store at the beginning, merge v and v , and finally save the resulting vector. The pseudo-code in Figure 4 describes the results obtained with this store-some instruction.  Vector permutation. Permuting the values inside a vector was possible since AVX/AVX2 using permutevar8x32 (instruction vperm(d,ps)). This instruction allows to re-order the values inside a SIMDvector using a second integer array which contains the permutation indexes. AVX-512 also has this instruction on its own SIMD-vector, and we synthetize its behavior in Figure 5.  Comparison. In AVX-512, the value returned by a test/comparison (vpcmpd/vcmppd ) is a mask (integer) and not a SIMD-vector of integers as it was in SSE/AVX. Therefore, it is easy to modify and work directly on the mask with arithmetic and binary operations for scalar integers. The behavior of the comparison is shown in Figure 6, where the mask is filled with bits from the comparison results. AVX-512 provides several instructions that use this type of mask like the conditional selection for instance.
Conditional selection. Among the mask-based instructions, the mask move (vmovdqa32/vmovapd) allows to select values between two vectors using a mask. The behavior is show in Figure 7, where a value is taken from the second vector where the mask is false and from the first vector otherwise. Achieving the same result was possible in previous instruction set only using several operations and not one dedicated instruction.

Vectorized Sorting Algorithms
The literature on sorting and vectorized sorting implementations is immense.
Therefore, we only cite here some of the studies that we consider most connected to our work.
The sorting technique from [16] tries to remove branches and improves the prediction of a scalar sort, and they show a speedup by a factor of 2 against the STL (the implementation of the STL at that time was   different). This study illustrates the early strategy to adapt sorting algorithms to a given hardware and shows the need of low-level optimizations due to the limited instructions available at that time.
In [17], the authors propose a parallel sorting on top of combosort vectorized with the VMX instruction set of IBM architecture. Unaligned memory accesses is avoided, and the L2 cache is efficiently managed by using a out-of-core/blocking scheme. The authors show a speedup by a factor of around 3 against the GNU C++ STL.
In [18], the authors use a sorting-network for smallsized arrays similar to our approach. However, instead of dividing the main array into sorted partitions (partitions of increasing contents), and applying a small efficient sort on each of those partitions, the authors perform the contrary. Therefore, after multiple small sorts, the algorithms finishes with a complicated merge scheme using extra memory to globally sort all the partitions. A very similar approach has later been proposed in [19].
The recent work in [20] targets AVX2. The authors use a Quicksort variant with a vectorized partitioning and an insertion sort once the partitions are small enough (as the STL does). The partition method relies on a look-up table with a mapping between the comparison result of a SIMD-vector against the pivot, and the move/permutation that must be applied to the vector. The authors demonstrate a speedup by a factor of 4 against the STL. However, this method is not suitable on the KNL because the lookup table will need 256 values (leading to locality problems). This issue, as well as the use of extra memory, can be solved with  the new instructions of the KNL. As a side remark, the authors do not compare their proposal to the standard C++ partition function even so it is the only part of their algorithm that is vectorized.

SORTING AVX-512 SIMD-VECTORS
In this section, we describe techniques to sort a small number of SIMD-vectors. As traditional sorting implementations usually rely on insertion sort to process small arrays, we propose here fully vectorized kernels that are later used in our final sorting implementation.

Naive Bubble Sort
3.1.1. Sorting one SIMD-vector Using vectorization, we are able to work on all the values of a SIMD-vector without a loop statement. Therefore, while many basic sorting algorithms require two loops, one loop to reduce the number of values to sort while the second loop selects/moves the value to store at the end, vectorization allows to use only one. In Algorithm 3, we show how we implement a vectorized bubble sort on a single SIMD-vector. The idea is to exchange neighboring values (even/odd, odd/even) such that after a known number of iterations the full SIMDvector is sorted. The number of iterations corresponds to the length of the SIMD-vector VEC SIZE divided by two (resulting in a total of VEC SIZE exchanges). We provide in Appendix 1 the C++ implementation of this algorithm, which is straightforward by using a combination of tests, permutations and selections. Figure 8 shows a diagram of the algorithm. A sortingnetwork equivalent to the diagram is obtained by simply unrolling the loop. In specific cases (when the data are more likely to be already sorted) it might beneficial to interrupt the loop if no values are exchanged by testing the comparison masks.

Sorting several SIMD-vectors
The same strategy can be used for more than a one vector by propagating the values along all the vectors. We provide in Algorithm 4 the Bubble-sort for two SIMD-vectors where we exchange the border values between both vectors. The diagram of this algorithm is shown in Figure 9. This approach cannot be extended indefinitely because the number of SIMD registers is limited and because it has a quadratic complexity as the original bubble sort. However, for a small number of values, this method uses less instructions compared to its scalar equivalent and it can sort several vectors stored in registers without accessing the memory.

Sorting several SIMD-vectors
Algorithm 7 sorts two SIMD-vectors using the Bitonic's comparison order. As shown in Figure 2a, we first have to apply the Bitonic sort to each vector individually before exchanging values between both vectors. The same principle can be applied to more vectors using the fact that to sort V vectors we re-use the function to sort V /2 vectors and so on.

Sorting Small Arrays
We describe in sections 3.1 and 3.2 how we can sort several SIMD-vectors. However, we intend to sort arrays with sizes not necessary multiple of the SIMD-vector's length. Therefore, we provide function that loads an array into SIMD-vectors, pads the last vector with extra-values and calls the appropriate vectorized sort. A possible implementation is shown in Algorithm 7 where we rely on a switch statement to select the existing function that matches the size of the array to sort.

Partitioning with AVX-512
The store-some instruction from AVX-512 (described in Section 2.2.1) is a key operation of our partitioning function. It allows us to store the values from a SIMD-vector lower/greater than a pivot directly in memory. The main idea of our algorithm is to load values from the array and directly store the results using two iterators on the left and right for the values lower and greater than the pivot, respectively. To avoid overwriting values that have not been proceed, our algorithm starts by loading two SIMD-vectors (one from each array's extremities). Consequently, our implementation works in-place and only needs three SIMD-vectors.
We show our method in Algorithm 8. This description also include as side comments some possible optimizations in case the array is more likely to be already partitioned (A) or to reduce the data displacement of the values (B). We provide the AVX-512 implementation of this function in Appendix 3.

Quicksort Variant
In Section 4.1 we introduced a vectorized partitioning algorithm, and in Section 3.3 we described a vectorized sort function for small arrays. As a result, we transform the original QS (Algorithm 2) into Algorithm 9 where we replace the scalar partitioning by our simd-partitioning method, and where we sort small partitions using our Bitonic-based sort. The worst-case complexity of this SIMD-QS is O(n 2 ), as the sequential QS, and tied to the choice of the pivot.

Configuration
We asses our method on an Intel Xeon Phi CPU (KNL) 7210 at 1.30GHz. The cluster memory mode has been set to Quadrant/Flat, and we bind the process and the allocation with numactl -physcpubind=0 -membind=1. We use GCC 6.2.0, but we provide the results of the same tests with Intel compiler 17.0.1 in Appendix A.1 (both compilers deliver the same performance). The test file used for the presented benchmark is available online 2 , it includes the different sorts presented in this study plus some additional strategies and tests. Our SIMD-QS uses a 3-values median pivot selection (similar to the STL sort function).

Fixed Size
In Figure 10 SIMD-vector. For the same number of values, the AVX-512-bitonic appears slightly faster to sort integer values compared to floating-point ones. This is not surprising since integers are smaller and comparing integers is faster than comparing floating-point numbers. For the rest of the study, we leave the bubble-sort aside because it is slower than the bitonic-based sort.
1v (8) 2v ( In Figure 11, we compare the time to sort up to 16 SIMD-vectors. The plot shows that the AVX-512-bitonic sort maintains its speedup against the STL. However, our implementation use hard coded permutation indices, so we cannot extend the size of the array to sort indefinitely. Moreover, the arrays we sort in this figure have size multiple of the SIMD-vector length, but we intend to sort any size. Figure 12 shows the execution time to sort arrays of size from 1 to 16 × V EC SIZE including those of size not multiple of the length of SIMD-vectors. We remind that we use the method from Section 3.3 to pad the last SIMD-vectors with extra values. We see that AVX-512bitonic still delivers a good speedup against the STL. We can observe that the execution time increases every V EC SIZE values because the cost of sorting is not tied to the number of values but to the number of SIMDvectors to sort. The execution time is obtained from the average of 10 4 sorts for each size.The speedup of the AVX-512-bitonic against the STL is shown above the AVX-512-bitonic line. Figure 13 shows the execution time to partition using the std::partition, our AVX-512-partition and a basic partition implementation. Our AVX-512-partition is faster for any size, whereas both other methods appear quite similar in terms of performance. This speedup was not easy to predict. In fact, while the number of instructions of the vectorized implementation is smaller than its scalar equivalent, the partition operation is memory bound.

AVX-512-QS Sort
The execution time of our SIMD-QS against the STL is shown in Figure 14. The speedup of our implementation is still significant but not as high as for the partitioning.
In addition, it appears that the difference decreases as the size of the array increases, showing the STL gain by using a 3-parts sort with a complexity guarantee. However, our implementation still provides a significant speedup even when sorting more than 8G bytes of values.

CONCLUSIONS
In this paper, we have described new sorting strategies using the AVX-512 instruction set on the Intel KNL. The AVX-512 instruction set provides unique SIMD operations helpful to vectorize and to adapt algorithms more naturally than with SSE or AVX(2). We have introduced two approaches to sort small arrays: one based on the Bubble sort and another on the Bitonic sorting network. Both techniques are fully vectorized and provide a significant speedup against the STL, even for sorting arrays with sizes not multiple of the SIMD-vector's length. We have introduced a AVX-512 partitioning function, which is fully vectorized in its steady state and partitions in-place. Finally, our main sorting algorithm is a 2-part sort which partitions the input array until the partitions are small enough to be sorted by the Bitonic-based function (less than 16 SIMD-vectors length). We study the performance of each layer with different granularities on an Intel KNL. Our implementations show superior performance against the STL to partition or to sort arrays of any size. In the future, we intend to design a parallel implementation of our AVX-512-QS to use the 64 cores that the KNL provides, and we expect the recursive partitioning to be naturally parallelized with a taskbased scheme on top of OpenMP. Additionally, we will evaluate our implementations on the next Intel Xeon SKYLAKE CPU.    m512d l e f t v a l = mm512 loadu pd (& a r r a y [ l e f t ] ) ; 13 IndexType l e f t w = l e f t ; 14 l e f t += S ; 15 16 IndexType r i g h t w = r i g h t +1;