Efficient GPU Implementation of Multiple-Precision Addition based on Residue Arithmetic

In this work, the residue number system (RNS) is applied for efficient addition of multiple-precision integers using graphics processing units (GPUs) that support the Compute Unified Device Architecture (CUDA) platform. The RNS allows calculations with the digits of a multiple-precision number to be performed in an element-wise fashion, without the overhead of communication between them, which is especially useful for massively parallel architectures such as the GPU architecture. The paper discusses two multiple-precision integer algorithms. The first algorithm relies on if-else statements to test the signs of the operands. In turn, the second algorithm uses radix complement RNS arithmetic to handle negative numbers. While the first algorithm is more straightforward, the second one avoids branch divergence among threads that concurrently compute different elements of a multiple-precision array. As a result, the second algorithm shows significantly better performance compared to the first algorithm. Both algorithms running on an NVIDIA RTX 2080 Ti GPU are faster than the multi-core GNU MP implementation running on an Intel Xeon 4100 processor. Keywords—Multiple-precision algorithm; integer arithmetic; residue number system; GPU; CUDA


I. INTRODUCTION
Multiple-precision integer arithmetic, which provides operations with numbers that consist of more than 32 or 64 bits, is an important and often indispensable method for solving scientific and engineering problems that are difficult to solve using the standard numerical precision. The most notable application of multiple-precision integer arithmetic is cryptography, where the level of security depends on the length of the keys [1], [2]. Multiple precision is also required in computer algebra (symbolic computation) systems, which operate with mathematical expressions instead of fixed-precision integer and floating-point numbers [3]. The intermediate data produced during a computation may be very large, and multipleprecision arithmetic is required to prevent overflow. Another problem requiring computations with very large integers and of practical interest in polymer physics is counting Hamiltonian cycles on two-and three-dimensional lattices, triangular grid graph, and other structures [4]. Multiple-precision arithmetic is becoming more and more in demand as the scale of computations increases.
There are several approaches for implementing multipleprecision arithmetic. One of them is special software libraries that emulate operations with large numbers using standard fixed-precision operations. Some of the well-known libraries for central processors (CPUs) include the GNU MP Bignum Library (GMP) [5], the Library for doing Number Theory (NTL) [6], and thr Fast Library for Number Theory (FLINT) [7]. There are also works devoted to the implementation of integer arithmetic operations with arbitrary/multiple precision on GPUs [8], [9], [10], [11], [12].
A higher level of arithmetic precision is also supported in a number of programming languages, e.g., Python (the built-in int type), Ruby (the built-in Bignum type), Perl (Math::BigInt), Java (the BigInteger class), Haskell (the Integer datatype), and C# (BigInteger). Another actual approach is to develop hardware accelerators that support integer and floating-point computations with multiple precision [13], [14], [15].
Previous research in [8], [9], [13], and [15] use the traditional way of representing multiple-precision numbers, according to which a number is represented as an array of weighted digits in some base, and the digits themselves are machineprecision numbers [16]. The need for carry propagation under this number representation is one of the main bottleneck of efficient multiple-precision algorithms. This paper deals with another type of multiple-precision arithmetic, which is based on the residue number system (RNS) [17], [18]. In the RNS, a number is represented by its residues relative to a set of moduli. The moduli are mutually independent, so multiple-precision integer operations such as addition, subtraction, and multiplication are replaced by groups of reduced-precision operations with residues performed in an element-wise fashion and without the overhead of manipulating carry information between the residues.
Recently, a new software library has been developed for efficient residue number system computations on CPU and GPU architectures. The library is called GRNS and is freely available for download at https://github.com/kisupov/grns. GRNS is designed for arbitrary moduli sets with large dynamic ranges that far exceed the usual word length of computers, up to several thousand bits. In addition to a number of optimized nonmodular RNS operations such as magnitude comparison and division, GRNS implements multiple-precision integer arithmetic. This paper considers two multiple-precision addition algorithms implemented in GRNS. Along with multiplication, addition and subtraction is key operations for many computational algorithms, e.g., fast Fourier transform. Multipleprecision addition is usually considered to be faster and easier than multiplication. However, in the case of RNS, signed addition is more difficult than multiplication as it requires determining the sign of the result, which is a time-consuming operation for RNS.
Both of our multiple-precision addition algorithms use an interval floating-point evaluation technique for efficient RNS sign determination [19]. However, the first algorithm relies on if-else statements to test the signs of the operands, while the second one uses the radix complement RNS notation for negative numbers. It is shown that the second algorithm is better suited for implementation on massively parallel GPU architectures than the first algorithm.
The rest of this paper is organized as follows. Section II provides the background on RNS arithmetic. Section III describes the RNS-based format of multiple-precision integer numbers. Multiple-precision addition algorithms are presented in Section IV. Performance comparison results are given in Section V, and Section VI concludes the paper.

II. BACKGROUND ON RNS ARITHMETIC
An RNS is specified by a set of n pairwise prime moduli {m 0 , m 1 , . . . , m n−1 }. The dynamic range of the RNS is M = m 0 · m 1 · · · m n−1 . The mapping of an integer X into the RNS is defined to be the n-tuple (x 0 , x 1 , . . . , x n−1 ), where x i = |X| mi is the smallest non-negative remainder when X is divided by m i , that is, x i = X mod m i . Within the RNS there is a unique representation of all integers in the range from 0 to M − 1. Namely, the Chinese Remainder Theorem (CRT) states that [18] Since the RNS moduli are independent of each other, arithmetic operations such as addition, subtraction, and multiplication can be computed efficiently. If X, Y , and Z have RNS representations given by (x 0 , x 1 , . . . , x n−1 ), (y 0 , y 1 , . . . , y n−1 ), (z 0 , z 1 , . . . , z n−1 ), then denoting • to represent +, −, or ×, the RNS version of the Z = X • Y , satisfies provided that Z ∈ [0, M − 1]. Thus the ith RNS digit, namely z i , is defined in terms of |x i • y i | mi only. That is, no carry information need be communicated between residue digits, and the overhead of manipulating carry information in more traditional, weighted-number systems can be avoided [20].
The disadvantage of RNS is the high complexity of estimating the magnitude of a number, which is required to perform number comparison, sign calculation, overflow checking, division, and some other operations. The classic technique to perform these operations is based on the CRT formula (1) and consists in computing the binary representations of numbers with their subsequent analysis. However, in large dynamic ranges (e.g., a few thousand bits) this technique becomes slow. Other methods for evaluating the magnitude of residue numbers are based on the mixed-radix conversion (MRC) process [21]. But these methods are often also ineffective since they require a lot of arithmetic operations with residues or the use of unacceptably large lookup tables.
An alternative method for implementing time-consuming operations in the RNS is based on computing the floatingpoint interval evaluation of the fractional representation of an RNS number [19]. This method is designed to be fast on modern general-purpose computing platforms that support efficient finite precision floating-point arithmetic operations such as IEEE 754 operations. For a given RNS number Thus, I(X/M ) provides information about the range of changes in the fractional representation (also called relative value) of an RNS number. This information may not be sufficient to restore the binary representation, but it can be efficiently used to perform other difficult operations in RNS, e.g., magnitude comparison, sign detection, and division.
The most important benefit of this method is that computation of I(X/M ) requires only standard arithmetic operations, and no residue-to-binary conversion is required. For a given RNS representation (x 0 , x 1 , . . . , x n−1 ), the calculation of the bounds of I(X/M ) is performed on average in linear and logarithmic time for sequential and parallel cases, respectively. Furthermore, the following arithmetic operations are defined: In these interval formulas, the following notation are used: • + , − , · and ÷ stand for the floating-point operations of addition, subtraction, multiplication, and division, performed with rounding downwards; • + , − , · and ÷ stand for the floating-point operations of addition, subtraction, multiplication, and division, performed with rounding upwards; • V is the greatest floating-point number that is less than or equal to 1/M ; • W is the least floating-point number greater than or equal to 1/M . (3) are useful in that they do not limit the possible values of the result interval in the range of [0, 1). This allows for easy overflow detection or sign identification despite the cyclical (modulo M ) nature of RNS arithmetic.

Interval formulas
Using interval evaluations, new algorithms have been proposed in [19] to efficiently implement several difficult RNS operations, such as number comparison and general division.

III. NUMBER REPRESENTATION
The format for multiple-precision integers is shown in Fig. 1. A multiple-precision integer x consists of a sign s, a significand X composed of n significand digits (x 0 to x n-1 ), and an interval floating-point evaluation of the significand I(X/M ) = [X/M , X/M ]. The sign is interpreted in the same way as in two's complement representation: the sign is equal www.ijacsa.thesai.org to zero when x is positive and one when it is negative. The significand expresses the absolute value of x and is represented in the RNS with the moduli set {m 0 , m 1 , . . . , m n−1 }. The significand digits (residues) are represented as ordinary two's complement integers. The size of the moduli set n specifies the number of digits in the significand. If the product of all RNS moduli is M , then the precision of x is equal to log 2 M bits. Thus, changing the size of the moduli set allows one to achieve arbitrary precision.
The following notation is used to denote a multipleprecision integer in the described number format: The value of a multiple-precision integer of the form (4) can be computed using the CRT formula: The interval evaluation is included in the number representation as additional information, that is, X/M and X/M are stored in system memory along with other fields of the multiple-precision integer. This provides efficient comparison, sign computation and overflow detection, allowing one to calculate I(X/M ) in O(1) time using the formulas (3). Recently, this approach has been successfully used in the context of multiple-precision floating-point arithmetic based on the residue number system [22].
In order to be able to use interval evaluations for virtually any (arbitrarily large) value of M without worrying about underflow, X/M and X/M are represented as binary floatingpoint numbers with an extended exponent range, that is, have the form where f is a regular floating-point number (IEEE 754), and i is a two's complement integer. This extended-range representation is not intended to improve the level of numerical precision or accuracy, but it does ensure that there is no overflow or underflow when dealing with extremely large or small values.

IV. MULTIPLE-PRECISION INTEGER ADDITION
In this section, two algorithms for signed multiple-precision integer addition are presented. A naive implementation is presented first and then an improved one.
Step-by-step examples are also provided for both implementations.
That is, the lower bound of I(X/M ) is denoted by B[X, 0], while the upper one is denoted by B[X, 1]. Using this notation, 1]]. This notation is useful in that it allows one to dynamically specify the bound to be accessed. This notation is used in the rest of the present paper. if we want to obtain its valid representation in the RNS. Otherwise, the result will be reduced modulo M , and this event is classified as an integer overflow. The GRNS library implements efficient overflow detection using the floating-point interval evaluations, but that is beyond the scope of this paper. Use mixed-radix conversion to compare the magnitude of X and Y . If X ≥ Y , subtract Y from X and take s z ← s x ; otherwise, subtract X from Y and take s z ← s y ; In any case, I(Z/M ) should be recalculated. C. Algorithm 1 (Naive Implementation) 1) Description: Algorithm 1 takes two multiple-precision integers x and y represented as x = s x , X, I(X/M ) and y = s y , X, I(X/M ) , and outputs the sum z = x + y represented as z = s z , Z, I(Z/M ) . This algorithm analyzes the signs of the numbers, and if they are the same, then RNS addition of the significands is performed; otherwise, RNS subtraction is performed. The sign of the result is computed by comparing the magnitude of X = (x 0 , x 1 , . . . , x n−1 ) and Y = (y 0 , y 1 , . . . , y n−1 ) using the floating-point interval evaluations.  Table I.   TABLE I. EXAMPLE OF ALGORITHM 1 Step no. 3) Drawback: The main disadvantage of Algorithm 1 is that checking the signs of the operands via conditionals (ifelse statements) results in branch divergence among threads that concurrently compute different elements of a multipleprecision array. This may be normal for modern multi-core processors with good branch prediction accuracy, but this is a problem for SIMT (single instruction, multiple threads) architectures such as GPUs, where many threads run in lockstep.

B. Note on Overflow Detection
For example, a CUDA-compliant GPU consists of an array of streaming multiprocessors (SMs), each of which contains multiple streaming processors. Although each SM can run one or more different instructions, conditionals can greatly decrease performance inside an SM, as each branch of each conditional must be evaluated. Long code paths in a conditional can cause a 2-fold slowdown for each conditional within a warp (a group of 32 threads) and a 2 N slowdown for N nested conditionals. A maximum 32-time slowdown can occur when each thread in a warp executes a separate condition [23].
This bottleneck is illustrated in Fig. 2, which contains a flowchart of Algorithm 1. In the figure, 14 threads concurrently compute 14 multiple-precision additions on a system that follows the SIMT execution model. The right side of the figure shows threads running at once. D. Algorithm 2 (Improved Implementation) 1) Description: Algorithm 2 shows how to avoid conditional expressions when adding multiple-precision signed integers. This algorithm is a simplified version of the multipleprecision RNS-based floating-point addition algorithm that was originally proposed in [22]. The main idea is to use the radixcomplement representation of a negative number in the RNS. Recall that the precomputed constant V used in this algorithm is the greatest finite precision floating-point number that is less than or equal to 1/M . Use mixed-radix conversion to compare X and Y : • If X > Y , then assign s z ← s x .
• If X < Y , then assign s z ← s y .
• If X = Y , then assign s z ← 0.   We note that the if statement at step 14 of Algorithm 2 does not cause branch divergence, since there is no the corresponding else statement here. Table II, Algorithm 2 is used to compute  the sum of the numbers from the previous example.   TABLE II. EXAMPLE OF ALGORITHM 2 Step no. Calculations

V. PERFORMANCE COMPARISON RESULTS
This section gives comparative results of the presented multiple-precision integer addition algorithms. In the experiments, we used an GeForce RTX 2080 Ti graphics card that has 11 GB of GDDR6 memory, 4352 CUDA cores, and Compute Capability 7.5. This GPU was installed on a machine with an Intel Xeon 4100/8.25M S2066 OEM processor running Ubuntu 18.04.5 LTS, CUDA 10.2 and NVIDIA Driver 450.51.06 were used. The source code was compiled using the nvcc compiler with the -O3 and -Xcompiler=-fopenmp options.

A. Methodology
The parameters of the experiments are shown in Table III. Each dataset was composed of two multiple-precision integer arrays of the same length, and the performance was evaluated for element-by-element addition of the arrays. The performance was measured in the number of multiple-precision arithmetic operations (additions) per second. For comparison purposes, the performance of the GNU MP library was also measured on 4 CPU cores. In the experiments, we considered only the computation time, so the measurements do not include neither the data transfer time nor the time of converting data into internal multiple-precision representations.  For each precision p, a corresponding set of RNS moduli was generated such that where M is the product of all the moduli in the set. Table IV shows the relationship between the precision and moduli sets used in the experiments. The moduli sets were generated using Algorithm 3. This algorithm takes as input the smallest odd modulus m 0 , the size of the desired moduli set n, and produces an increasing sequence of n − 1 consecutive odd integers m 1 , m 2 , . . . , m n-1 that are coprime to each other and also coprime to m 0 . The value of m 0 is selected by trial and error until the condition (8) is satisfied. The used tool for generating moduli sets is freely available at https://github.com/kisupov/rns-moduli-generator.
For the CUDA implementations of the presented multipleprecision addition algorithms, 32 threads per each thread block were used, and the total number of blocks was calculated as follows: where N is the size of the dataset (1,000,000), nT hreads = 32, and K is defined as The GNU MP library implementation have been accelerated using the OpenMP library.

B. Results
In the first experiment, the input arrays were filled with pseudo-random non-negative integers ranging from 0 to (M − 1)/2, where M is the product of all the RNS moduli. The performance results are shown in Fig. 4.
In the second experiment, the input arrays were filled with pseudo-random non-positive integers ranging from (1 − M )/2 to 0. The results are reported in Fig. 5.
Finally, in the third experiment, the input arrays were filled with pseudo-random positive and negative integers ranging from (1 − M )/2 to (M − 1)/2. Fig. 6 demonstrates the performance results obtained in this setting. for i ← 1 to k do 6: if gcd(m i , t) > 1 then

C. Discussion
For Dataset-1 (Fig. 4), Algorithm 1 has nearly the same performance as Algorithm 2. This is because in Algorithm 1, all parallel threads follow steps 2-7 and there is no divergent execution paths. The results show that the developed CUDA functions are up to 65× faster than the parallel CPU implementation using GNU MP.
In the case of Dataset-2 (Fig. 5) the performance of Algorithm 1 remains the same as in the case of Dataset-1, since there are still no branch divergence (all parallel threads follow steps 2-7). In turn, the need to restore negative results reduces the performance of Algorithm 2 by an average of 1.1× compared to Dataset-1, and this performance degradation does not seem to be significant.
When using Dataset-3 (Fig. 6), branch divergence leads to an average 1.9-fold decrease in the performance of Algorithm  1 compared to Dataset-1 and Dataset-2. With 512-bit precision, the performance of Algorithm 1 is reduced by almost 3× compared to Dataset-1. In turn, the performance of Algorithm 2 reduced by at most a factor of 1.2 compared to Dataset-1. The net result is that when the operands have different signs, Algorithm 2 outperforms Algorithm 1 by up to 3×.
A limitation of the proposed CUDA implementations is that the execution time grows linearly with increasing the precision. This happens for the following reasons: 1) Each multiple-precision addition is performed as a single thread, that is, the digits of multiple-precision numbers are calculated sequentially. 2) As the precision increases, the stride between elements in the input arrays increases accordingly and the effective GPU memory bandwidth decreases.
It should be noted that it is possible to compute all the digits (residues) of multiple-precision significands in parallel across different RNS moduli without worrying about carry propagation. This parallel arithmetic property of the RNS is employed in [22] to implement GPU-accelerated multipleprecision linear algebra kernels. Furthermore, we note that if all the digits of a multiple-precision number are computed in parallel, then the structure-of-arrays (SoA) layout with a sequential addressing scheme will provide coalesced access to the global GPU memory. Implementing digit-parallel multipleprecision integer addition is a direction for future work.

VI. CONCLUSION
In this paper, we have considered two multiple-precision integer addition algorithms for graphics processing units. The algorithms are based on the representation of large integers in the residue number system.
The first algorithm uses conditional operators to check the signs of the operands. However, in this case, threads that concurrently compute different elements of a multipleprecision array take divergent execution paths, which leads to an increase in the total computation time. To overcome this disadvantage, the second algorithm uses the radix-complement representation of a negative number in the RNS.
Experiments have shown that when the signs of the operands are different, the second algorithm outperforms the first one by far. In turn, both algorithms running on an NVIDIA RTX 2080 Ti GPU have shown to be faster than the multi-core GNU MP implementation on an Intel Xeon 4100 processor.
The presented implementation is part of GRNS, a library for efficient computations in the residue number system using CUDA-enabled GPUs. In the future, we plan to implement digit-parallel versions of the multiple-precision integer operations to take full advantage of the internal RNS parallelism. Furthermore, we will focus on extending the GRNS functionality and implementing real-world multiple-precision applications using this library.