Fermat Factorization using a Multi-Core System

Factoring a composite odd integer into its prime factors is one of the security problems for some public-key cryptosystems such as the Rivest-Shamir-Adleman cryptosystem. Many strategies have been proposed to solve factorization problem in a fast running time. However, the main drawback of the algorithms used in such strategies is the high computational time needed to find prime factors. Therefore, in this study, we focus on one of the factorization algorithms that is used when the two prime factors are of the same size, namely, the Fermat factorization (FF) algorithm. We investigate the performance of the FF method using three parameters: (1) the number of bits for the composite odd integer, (2) size of the difference between the two prime factors, and (3) number of threads used. The results of our experiments in which we used different parameters values indicate that the running time of the parallel FF algorithm is faster than that of the sequential FF algorithm. The maximum speed up achieved by the parallel FF algorithm is 6.7 times that of the sequential FF algorithm using 12 cores. Moreover, the parallel FF algorithm has near-linear scalability. Keywords—Integer factorization; fermat factorization; parallel algorithm; multi-core


I. INTRODUCTION
The extensive use of digital systems has led to an increased need for information security. The main tool used to ensure the security of information is cryptography. In order to provide information security services, a set of cryptographic strategies is needed to convert plaintext into ciphertext. A set of such strategies is known as a cryptosystem. There are two main types of modern cryptosystems:-(1) public-key (asymmetric) cryptosystems such as the ElGamal digital signature scheme, Rivest-Shamir-Adleman (RSA) cryptosystem, Diffe-Hellman scheme and digital signature algorithm [1], and (2) private-key (symmetric) cryptosystems such as the advanced encryption standard algorithm [2].
The RSA cryptosystem is one of the important cryptosystems with security based on integer factorization problem, which is defined as follows: Given a positive integer , the aim of the factorization of is to find two positive integers (also known as factors) 1 and 2 such that equals the product of 1 and 2 , and 1 , 2 > 1. In this case, is called a composite integer. On the other hand, if cannot be factored, then is called a prime number. Thus, we can represent any positive integer as a unique product of prime factors.
In the RSA cryptosystem, the key is constructed by detecting two prime numbers 1 and 2 such that the size of each of them is large and approximately equal. The modulus for the key is defined as = 1 2 . Then an encryption exponent e is chosen that is relatively prime to ( ) = ( 1 − 1)( 2 − 1) . Finally, the decryption exponent d is defined as ≡ −1 (mod ( )).
The main challenge of factorization is the amount of time that is consumed to arrive at a solution, especially when the size of the prime factors is large. Also, there exists no deterministic polynomial algorithm to factor a composite number into two prime numbers.

A. High-Performance Computing
One of the strategies that can be utilized to reduce the high computational time needed by factorization methods is the high-performance computing (HPC). The main objective of using HPC is to design a parallel algorithm in running time, Tp, which is almost equal Tseq/p, where Tseq is the execution time of the problem using one processor and p is the number of processors used in the HPC. However, the achievement of this objective is not easy for several reasons such as the difficulty of dividing the problem into equal-sized, the communication between processors, and the dependences in some steps of the solution.
The effectiveness of the parallel algorithm can be measured using the speedup criteria. The speedup of a parallel algorithm is the ratio between the running time of the problem using one processor over the running time of the problem using p processors and is denoted by Sp=Tseq/Tp. The main goal of designing a parallel algorithm is to achieve linear speedup. Another important criteria for the parallel algorithm is scalability, which represents the parallel system's capacity to increase speedup in proportion to the number of processors.
Many hardware and software platforms have been introduced to measure parallel algorithms practically. Examples of parallel hardware are the cluster, multi-core, graphics processing unit (GPU) and cloud. There are also many different parallel programming languages or libraries such as open multi-processing (openMP), the message passing interface (MPI), and compute unified device architecture (CUDA).
The time complexity of the algorithms that belong to the general-purpose group is almost independent of the size of the factor found and depends on the size of . Examples of some of the methods that belong to this group are Lehman's method, Shanks' square form factorization method, continued fraction, multiple polynomial quadratic sieves, and number field sieve. In the case of the algorithms that belong to the special-purpose group, the time complexity of the algorithms mainly depends on the size of the factor found. Examples of some of the methods belong to this group are the trial division, Fermat, Pollard rho, and Lenstra's elliptic curve methods.
In this study, we focus on the Fermat factorization (FF) algorithm, which is an efficient method when the difference between two factors is small. Many research studies have attempted to enhance this method from the sequential computation viewpoint [11,12,13,14]. However, from the parallel computation perspective, to our knowledge there is only one published paper on implementing the FF algorithm on a GPU, namely, the NVIDIA GeForce GT 630 [15]. Also, in this study, the experimental conducted to parallelize the FF algorithm on the GPU was based on a small input size of less than 60.

C. Study Outline
In this study, we show how to utilize HPC to speed up the computation of FF method. We use a multi-core platform that executes 12 threads concurrently to reduce the execution time of the FF algorithm. Also, we study the effect of using HPC when we increase the difference between the two primes, even of two primes of the same size. The results show that the proposed parallel FF algorithm improves execution time and that the maximum speed up achieved by parallelization is 6.7 times that of a sequential FF algorithm. Moreover, the parallelization of the proposed parallel FF algorithm shows near-linear scalability.
The rest of this paper is arranged as follows. In Section 2, we provide an overview of the FF algorithm, including the mathematical concept and pseudocode algorithm, as well as a complexity analysis and example. In Section 3, we introduce our proposed strategy for parallelizing the FF algorithm. Then, in Section 4 we present and discuss the results of our experimental evaluation according to execution time, speed-up and scalability. Finally, in Section 5, we present the conclusion of this work.

II. THE FF ALGORITHM
In this section, first we introduce, briefly, the mathematical concept on which the FF algorithm is based. Second, we present the idea underpinning the FF algorithm as well as the pseudocode of the FF algorithm. Third, we provide a complexity analysis of the FF algorithm. Finally, we provide an illustrative example to show the effect of the difference between two primes on the performance of the FF algorithm.

A. Mathematical Concept
Assume that is an odd integer of the form = 1 2 , where 1 > 2 > 0. Then the integer can be formed as a subtraction of two squares 1 and 2 , i.e., = 1 2 − 2 2 .
We can easily prove this statement by setting 1 and 2 as follows: Then, Also, = 1 2 − 2 2 can be rewritten as follows: If the two values ( 1 + 2 ) and ( 1 − 2 ) are not equal to 1, then the two values are factors of n.

B. The Algorithm
The main idea of the algorithm is to search for two possible values 1 and 2 such that = 1 2 − 2 2 . We can rewrite the relation between , 1 and 2 as 2 2 = 1 2 − . So, if we know the value of 1 , we can find the value of 2 . Since the value of 2 2 is a positive integer, this means that 1 2 > . So, the initial value of 1 is ⌊√ ⌋ + 1.
The idea of FF algorithm is to test iteratively, increasing by a value of 1, all values of 1 beginning with ⌊√ ⌋ +1 until we detect a value of 1 that satisfies the condition that 1 2 − is a perfect square. In this case, the two factors are ( 1 + 2 ) and ( 1 − 2 ). The complete pseudocode of the FF algorithm is as shown in Algorithm 1. The algorithm consists of three main steps. The first step is to compute the square root of to determine the start value of 1 . The second step is an iterative step that increases the value of 1 by 1 until the value 1 2 − is a perfect square. At this point, the two factors are determined in the third step.
1. is small and the value of 1 is slightly greater than √ . Therefore, the number of iterations in the second step is small.
The worst case of the FF algorithm can be calculated as follows. Assume that the minimum value of 1 − 2 is . This implies that: If =3, for large primes, then 1 = +9 6 .
In general, the performance of FF algroithm is based on the difference between the two prime factors, and can be given by the following rule [16]:- In case of | 1 − 2 | = ( √ 4 ), the FF solution can be found easily [16]. Table I shows that the main step of FF algorithm, i.e., Step 2, is affected by the difference between the prime factors even when the two factors are of the same size. The table consists of seven columns. The first three columns are related to the numbers to be factor and their factorization, , 1 , and 2 . The two prime factors have sizes of 12 bits each, but they have different values. The fourth column, , represents the number of bits in the difference between two factors, ∆. The relation between and ∆ is 2 ≤ ∆< 2 +1 . The fifth and sixth columns represent the square root of and all the trail values of 1 , respectively. The last column represents the number of iterations in the second step of FF method.

D. Example
For all the values of , the number of bits is = 24, and the number of bits for each factor is 2 = 12. In the first row, the number of bits in the difference between two factors is 4 = 6. The number of bits in the difference between two factors is increased by 1 in each next row. It is clear from Table I, that when the difference between two factors increases, the number of iterations in the main step (Step 2) of the FF algorithm also increases.

III. PARALLEL FF ALGORITHM
In this section, we present the mechanism that is used to parallelize FF method. The FF algorithm can be considered as a searching algorithm over the range from ⌊√ ⌋ + 1 to +9 6 . Therefore, the proposed approach to parallelize the FF method is based on assigning the first integers to threads, such that each thread, , takes one integer. This means that integers ⌊√ ⌋ + 1, ⌊√ ⌋ + 2, …, ⌊√ ⌋ + are assigned to threads 1 , 2 , …, , respectively. If the target goal is not found by any thread, then the second integers, ⌊√ ⌋ + + 1, ⌊√ ⌋ + + 2, …, ⌊√ ⌋ + 2 , are assigned to threads 1 , 2 , …, , respectively. This process continues dynamically until a thread finds a value of 2 and satisfies the condition that 2 is a perfect power.
In general, the assignment of integer, 1 , to thread is given by the following formula: where represents the th integers, ≥ 1, and 1 ≤ ≤ .
All the steps in this parallelization method are given by Algorithm 2. The first step of the algorithm is a sequential steps that are used to (1) determine the value of the square root that is used by all threads, and (2) assign the shared variable found with false. The second step is a parallel step that is executed by all threads, where each thread , 1 ≤ ≤ , has two local variables, 1 and 2 . This step consists of three substeps, 2.1, 2.2, and 2.3. Substep 2.1 is used to assign initial values for (iteration number) and 1 . Substep 2.2 is used to update the value of 1 and 2 if the value of 2 is still not a perfect square or no other thread has found the solution. Finally, in Substep 2.3 the thread that has found the solution, i.e., 2 that is a perfect square, changes the value of found from false to true and then calculates the two factors 1 and 2 .
In order to improve the performance of Algorithm 2, we applied the following modifications. First, in order to be able to read a shared value between all threads, for each shared value between threads, we used a local variable instead of the shared value, except at the beginning of executing each thread. Also, for the shared value found, we used a shared array Ok of elements of Boolean type. We also changed the second condition in the While-loop in Substep 2.2, to Ok[i]. Second, we implemented a modification to enable writing on a shared www.ijacsa.thesai.org variable. This occurs when thread has found the solution. In this case, thread is responsible to changing all values of Ok using the critical region command. The complete steps of the modified algorithm are shown in Algorithm 3. Note: There is another approach that can be used to parallelize the range search, for FF algorithm. This approach is based on dividing the search range into , number of threads, subranges. Each thread , 1 ≤ ≤ , searches subrange, , which is defined as follows.
[ 1 + 1 + ( − 1) , 1  Thread starts the search with 1 = ⌊√ ⌋ + 1 + ( − 1) and tries to find the value of 2 satisfying the condition that 2 is a perfect power. If thread finds the target goal, 2 is a perfect power, then the shared variable, found, is changed from false to true. This means that all the other threads stop searching if one of the threads changes the variable found to true.

+ ]
In general, this approach is not efficient for two factors of the same size. For example, referring to Table I, consider n= 4079 × 2069 = 8439451 , and let the number of threads = 8. The range of the search is [2905,1406576] and therefore the range of the search for each thread is approximately 175822. The first thread will therefore find the solution after 169 iterations. In contrast, by using Algorithm 2, the solution can be found after just 22 iterations.

IV. EXPERIMENTAL EVALUATIONS
In this section, we present the procedures and the results of our evaluations of the impact of the suggested parallel approach on the FF method according to the following three parameters: (1) the number of bits for the composite odd integer, (2) size of the difference between two prime factors, and (3) number of threads used. To achieve these goals, the section involves two subsections. The first subsection provides the configurations of the platform and data used in the experiments. The second provides the measurement and analysis of the running times and the scalability of the suggested parallel method.

A. Platform and Data Setting
The platform settings in the experiments are based on the configurations shown in Table II. The experiments on all the studied algorithms are based on three parameters. The first two parameters are related to the generation of two prime numbers, 1 and 2 , of the same size to construct a composite odd number = 1 2 . The first parameter is the number of bits for the integer , which is . This means that the number of bits for each prime factor, 1 and 2 , is 2 . The second parameter is the difference between the two prime factors, which is ∆= | 1 − 2 |, 2 ≤ ∆< 2 +1 , where < 2 − 1. This means that a prime factor 1 of size 2 is generated, the size of the second prime factor generated is 2 such that the difference between them is ∆ and 2 ≤ ∆< 2 +1 , for a certain value of . The setting of these two parameters is shown in Table III. The maximum value of is 2 − 2 in order to ensure that the two prime factors are the same size. The minimum value of is 2 − 15, because this value is near to 4 , for the studied cases. Also, if is less than 2 − 15, for the studied cases, the running time of the algorithms tends to be toward zero. The third parameter is the number of cores, , used in the experiments and the values of are 4, 8, and 12.
In the experiments, we initially fix the value of , say = 80, and then generate two prime numbers, each of size 2 such that the difference between them is ∆, say ∆= 2 − 5. We repeat www.ijacsa.thesai.org the same process to generate 25 different data, , 1 ≤ ≤ 25, for the same values of and ∆. After that, we run Algorithm A on using a fixed number of cores, . Therefore, the running time for Algorithm A using cores is the average of the running times of Algorithm A on 25 instances. In the case of = 100, we run the FF algorithm only one time, because the running time is very large (see Table IV). Also, in this case, we run the parallel FF algorithm using threads for five instances only.
In general, the running time for Algorithm A is computed using the three parameters as follows: For each fixed value of , ∆, and , we measure the running time of Algorithm A by executing Algorithm A on 25 different instances and then compute the average of these running times in seconds.
In addition, for the fixed value of , we have 12 values for the running time of Algorithm A, and each of them is the average time for 25 instances. These 12 values come from all the combinations of four values of ∆ ( 2 − 15, 2 − 10, 2 − 5, and 2 − 2) and three values of (4, 8, and 12).

B. Discussion of the Results
Based on the platform and data settings described in the previous subsection, the running times of Algorithm 1 (sequential FF algorithm) and Algorithm 3 (parallel FF algorithm) are shown in Table IV. The table consists of six columns. The first column represents the number of bits for the composite odd integer , while the second column represents the number of bits for the difference between the factors. The third column represents the running time for the sequential FF algorithm, Algorithm 1. The fourth to sixth columns represent the running time for the parallel FF algorithm, Algorithm 3, using 4, 8, and 12 threads, respectively.  From results of the analysis of the running times of the two algorithms, 1 and 3, using different factors shown in Table IV, several observations can be made. First, in respect of the sequential FF algorithm, Algorithm 1: 1) The running time of the sequential FF algorithm increases with increased difference between the two prime factors. This means that, for a fixed value of , the running time of the FF algorithm when = 1 is less than when = 2 , where 1 < 2 . For example, when = 80, 1 = 35, and 2 = 40, the running times of the sequential FF algorithm are 0.05 and 35.7 seconds, respectively.
2) For a fixed value of and two different values of , 1 and 2 , the difference in the running time of the FF algorithm between 1 and 2 is significant.
3) The minimum and maximum running times of the sequential FF algorithm occur when the values of are a minimum of 4 , and a maximum of 2 − 2, respectively.
Second, in respect of the running time of the parallel FF algorithm, Algorithm 3: 1) The running time of the parallel FF algorithm decreases with an increase in the number of threads. This means that for fixed values of and , the running time of the parallel FF algorithm using threads is less than the running time for the same instance using ′ threads, where > ′. As an example, for = 80 and = 4, 8, and 12, the running times of the parallel FF algorithm are 10.7, 8.8, and 6.4, respectively.
2) The running time of the parallel FF algorithm is faster than the running time of the sequential FF algorithm using any number of threads, ≥ 4, except when the running time for FF algorithm is near to zero. In this case, when = 70 and www.ijacsa.thesai.org = 20, the parallelization approach is not efficient in terms of running time because the search range is very small.
3) For fixed values of and , the running time of the parallel FF algorithm is different from one instance to another. This is because the range of ∆ is large for a large value of . As an example, Fig. 1 shows the running time of the parallel FF algorithm on 25 different instances using four threads for the case of = 80 and = 35.
4) The amount of improvement in the parallel FF algorithm, using threads, with respect to the FF algorithm is greater than the improvement in the parallel FF algorithm using ′ threads, > ′, see Fig. 2. For example, in the case of = 90 and = 40, the amount of improvement in the parallel FF algorithm using four threads is 68.8%, whereas the amount of improvement increases to 76.6% using eight threads.
Third, we also measured the speedup of the parallel FF algorithm based on two viewpoints: (1) fixed values of and , and (2) fixed values of and . Fig. 3 shows the speedup values with fixed and , and varied values of , from which it can be observed that the speedup of the parallel FF algorithm increases with increased . This is true for every and studied except when = 70 and = 20, because the running time of the FF algorithm at these values is zero. For example, when = 90 and = 40, the speedup values of the parallel FF algorithm using = 4, 8, and 12 are 3.2, 4.3, and 6.8, respectively. In addition, in general, the speedup value equals, approximately, half of the number of threads.

1)
2) Fig. 4 shows the speedup values with fixed and , and varied values of , from which it can be observed that the speedup of the parallel FF algorithm increases, slightly, with increased . This means that for a fixed problem size and number of threads, the speedup value of the parallel FF algorithm increases, even slightly, with an increase in the difference between two prime factors. For example, when = 80 and = 12 , the speedup values of the parallel FF algorithm are 1, 5.1, 5.6, and 6.5 for =25, 30, 35, and 38, respectively.
In general, the maximum speedup achieved by the parallel FF algorithm was 6.7 times greater than that achieved by the FF algorithm. Moreover, the parallel FF algorithm had nearlinear scalability.
Fourth, Fig. 5 shows the efficiency of the parallel FF algorithm in the case of = 100 and different values of . The maximum efficiency value achieved when the number of threads equals four.     (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 4, 2020 330 | P a g e www.ijacsa.thesai.org

V. CONCLUSION
In this study, we addressed one of the challenging problems related to cryptography, namely, integer factorization. The goal of integer factorization is to factor a composite number into two prime factors. The FF algorithm is one of the factorization algorithms that is used when the two factors are the same size. We investigated the use of a multi-core system on the performance of the FF method based on three parameters: (1) the number of bits for the composite positive integer, (2) size of the difference between two prime factors, and (3) number of threads used. The experimental results showed that the running time for the parallel FF algorithm was faster than that of the FF algorithm. The maximum speedup achieved by the parallel algorithm was 6.7 times that of the sequential FF algorithm. Moreover, the parallel FF algorithm had near-linear scalability.
There are still some interesting open questions related to FF algorithm such as (1) how to use GPUs to parallelize FF algorithm, (2) how to reduce the running time of FF algorithm when the difference between the two prime factors is large, and (3) how to use FF algorithm in internet of things [17].