JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016 http://dx.doi.org/10.5573/JSTS.2016.16.5.564 ISSN(Print) 1598-1657 ISSN(Online) 2233-4866 Design and Analysis of Efficient Parallel Hardware Prime Generators Dong Kyue Kim1, Piljoo Choi1, Mun-Kyu Lee2, and Heejin Park3 Abstract—We present an efficient hardware prime generator that generates a prime p by combining trial division and Fermat test in parallel. Since the execution time of this parallel combination is greatly influenced by the number k of the smallest odd primes used in the trial division, it is important to determine the optimal k to create the fastest parallel combination. We present probabilistic analysis to determine the optimal k and to estimate the expected running time for the parallel combination. Our analysis is conducted in two stages. First, we roughly narrow the range of optimal k by using the expected values for the random variables used in the analysis. Second, we precisely determine the optimal k by using the exact probability distribution of the random variables. Our experiments show that the optimal k and the expected running time determined by our analysis are precise and accurate. Furthermore, we generalize our analysis and propose a guideline for a designer of a hardware prime generator to determine the optimal k by simply calculating the ratio of M to D, where M and D are the measured running times of a modular multiplication and an integer division, respectively. Index Terms—Performance analysis, digital integrated circuits, prime number, public key cryptosystem, information security Manuscript received Oct. 22, 2015; accepted May. 19, 2016 1 Dept. of Electronic Engineering, Hanyang University, 222 Wangsimni-ro, Seongdong-gu, Seoul 04763, Korea 2 Dept. of Computer and Information Engineering, Inha University, 100 Inha-ro, Nam-gu, Incheon 22212, Korea 3 Dept. of Computer Science and Engineering, Hanyang University, 222 Wangsimni-ro, Seongdong-gu, Seoul 04763, Korea E-mail : [email protected] I. INTRODUCTION Large primes are the basis of modern cryptography and are widely used in standard public key cryptosystems such as RSA [1], ElGamal [2], Digital Signature Standard (DSS) [3], and elliptic curve cryptosystems (ECC) [4, 5]. In order to maintain the security of these cryptosystems against advanced attacks to the underlying number theoretic problems such as integer factorization and discrete logarithm, even larger primes are required. For example, while the early RSA used 512-bit keys, the recommended key length was gradually increased to 1024 and 2048 bits, which require 512- and 1024-bit prime numbers, respectively. As a result, the key-pair generation procedure of RSA requires heavy computation, most of which is dedicated to the generation of random prime numbers. One obvious way to avoid heavy computation in a security product is to use pre-generated key pairs. That is, after generation of a private key and its corresponding public key in a separate prime generator external to the product, the private key may be injected into the product during manufacturing. However, this approach is very insecure because private keys outside the product may be exposed to unauthorized attackers. Another limitation of this approach is that it cannot change key values even when compromised. Due to these limitations, there are many applications that require a prime generator within the security product itself. Among them is the wellknown Trusted Platform Module (TPM) [6], which is a hardware module fixed to a platform that protects user data and verifies the current state of local or remote platforms. It has a key generator as an essential element for public-key encryption and digital signature. In JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016 addition, the Trusted Execution Environment (TEE) [7], in particular, its most promising implementation in mobile devices, ARM TrustZone [8], executes all security-related functions with its own key isolated from the outside world. Finally, recent advances in the Internet of Things (IoT) technologies require every IoT device to embed stand-alone security features, which include functionalities for secure key generation. In these environments, a hardware prime generator embedded in the device is inevitable. Moreover, many advanced applications such as proxy certificates [9] require frequent generation of fresh key pairs. In general, prime generation is an iterative procedure consisting of the following two steps. 1) Generate an odd random integer, r. 2) Determine if r is prime. If so, return r. Otherwise, go to step 1). Since the primality test in step 2) is very timeconsuming, researchers have performed extensive research to improve the speed of the primality test. The primality test algorithms are divided into two categories: deterministic and probabilistic. A deterministic primality test certifies that a random number is a prime with probability 1. In other words, if a random number passes the primality test, it is certainly a prime. Trial division [10], which is a popular deterministic primality test, divides a random number r by primes up to r . Other deterministic primality tests include Pocklingtons test [11], its elliptic curve analogue [12], the Jacobi sum test [13], Maurer’s algorithm [14], Shawe-Taylor’s algorithm [15], and the AKS test [16]. A probabilistic primality test certifies that a random number is a prime with a probability very close to 1, such as 1 − 1 / 2 s for a large s. Probabilistic primality tests include the Fermat test [10], the Miller-Rabin test [4, 17], the Solovay-Strassen test [18], the Frobenius-Grantham test [19], and the Lehmann test [20]. To enhance the speed of prime generation, it is a common practice in real applications to simultaneously use two or more primality tests. The most popular combination is to combine trial division and a probabilistic primality test such as the Fermat test, as in OpenSSL [21]. In this combination, the trial division divides a random number by k smallest primes. Then, the candidate number, which is not divisible by any of the k primes, is subjected to either the Fermat or Miller-Rabin 565 test. Since k, the number of primes used in the trial division stage, affects the overall performance, identifying the optimal k is a crucial factor in designing the combined algorithm. For software prime generators, this problem was analyzed in [14] for primes and in [22, 23] for safe primes. Given the times required for the main operations of trial division and the Fermat test, i.e., integer division and modular exponentiation, respectively, the running time of the sequential combination of trial division and the Fermat test can be estimated according to the value of k, and the optimal k that minimizes the running time can be determined. In this paper, we focus on examining different situations for a hardware prime generator. The trial division and the Fermat tests implemented in hardware not only show much better performance than their software counterparts, but they can also run in parallel. These advantages may significantly enhance the overall performance of the prime generator. Since p is generated by hardware, the value k is hardwired in the circuit and thus the optimal k should be determined at the design stage of the hardware prime generator. However, the method to compute the expected running time and the optimal k should be completely different from what was given for the sequential case [14, 22, 23]. We provide the solution to this problem and propose a guideline to develop hardware prime generators. First, we present a parallel prime generation algorithm combining trial division and the Fermat test. Next, we propose a framework to identify the optimal value of k, i.e., the number of primes used in trial division, and analyze the minimum expected running time of a parallel prime generator. Our analysis is conducted in two stages. In the preliminary analysis stage, we determine a rough range (such as the order of magnitude) that includes the optimal k. For this purpose, we only consider the expected values for the random variables that contribute to the running time. The result of this stage is used to narrow the search space for the second stage, i.e., the main analysis stage. In this second stage, we consider the exact probability distributions of random variables for a more precise analysis. We also verify the validity and accuracy of our analysis framework through an FPGA implementation of 512- and 1024-bit prime generators. According to our discovery and the experimental results, the main factors that determine the overall performance 566 DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS of prime generation are (i) the bit-length of a target prime number, denoted by n, (ii) the number of primes used in the trial division, denoted by k, and (iii) the ratio of the expected running time of a modular exponentiation to that of an integer division, denoted by ρ . As our final contribution, we provide the recommended values of k according to ρ when n = 512 and n = 1024. The remainder of the paper is organized as follows. Section II provides some preliminary information including primality tests and modular exponentiation methods. Section III presents the explicit parallel algorithm for prime generation. Section IV provides the theoretical analysis to determine the optimal k and the fastest running time for prime generation. In Section V, we present the experimental results for an FPGA implementation of 512- and 1024-bit prime generators and compare them with the estimation in Section IV. Section VI generalizes the results to the cases with various ratios ρ . Section VII concludes the paper with some remarks. The proofs of lemmas and theorems that are not given in Section IV are provided in the appendix. II. PRELIMINARIES r is prime, but it is fairly sufficient to show that r is composite if r has a small factor that is not larger than pk . Algorithm 1 presents our trial division procedure. When another algorithm calls Algorithm 1 as its subroutine, this subroutine call will be denoted as TD(r, k). Algorithm 1 Trial Division Input: r, k Output: PRIME or COMPOSITE 1: for j = 1 to k do 2: if r mod pj = 0 then 3: return COMPOSITE 4: end if 5: end for 6: return PRIME 3. Fermat Test The Fermat test for a random number r determines whether or not r is composite using Fermat’s theorem [10, p.967]. The precise operation is shown in Algorithm 2. Algorithm 2 Fermat Test [10, p.967] Input: integer r Output: PRIME or COMPOSITE 1: Generate a random number a such that 2 ≤ a ≤ r − 1 1. The Probability of a Random Number being Prime It is well known that the probability of a random number r being prime is 1 / ln r [24, p.11]. If r is odd, the probability is doubled to 2 / ln r . Thus, the probability of an n-bit odd random number being prime is 2 / ln 2n , 2: if then 3: return COMPOSITE 4: end if 5: return PRIME which is 1/ ( 0.3466n ) . For example, the probability of a 512-bit random odd number being prime is 1/ ( 0.3466 ⋅ 512 ) ≈ 1/ 177.46 , and that of a 1024-bit random odd number is 1/ ( 0.3466 ⋅1024 ) ≈ 1/ 354.92 . 2. Trial Division In a trial division for a random number r, r is divided by primes up to r . If r is not divisible by any prime number, it is prime. Otherwise, it is a composite number. The trial division may be modified to divide r by only up to the k smallest odd primes. We denote the k smallest odd primes by p1 < p2 < … < pk throughout this paper. This modified trial division cannot be used to certify that If r is definitely composite according to Fermat’s theorem, which is addressed by lines 2 to 4 in Algorithm 2. Otherwise, r is a probable prime number, which means that line 5 in Algorithm 2 can produce errors. However, if we repeat the test for several random a’s, the probability that Algorithm 2 produces an error is reduced. When another algorithm calls Algorithm 2 as its subroutine, this subroutine call will be denoted as FT(r). 4. Miller-Rabin Test We cannot entirely eliminate all errors by simply repeating Fermat tests with different base numbers because there exist composite integers called Carmichael JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016 numbers that always pass the Fermat test for any base a with gcd(a, r) = 1. The Miller-Rabin test, which is a modification of the Fermat test, can overcome this problem by additionally determining if there is a nontrivial square root (NSR) of 1 mod r as follows: Algorithm 3 Miller-Rabin Test [10, p.969] Input: integer r Output: PRIME or COMPOSITE 1: Find integers l and q with l > 0 and q odd, so that r–1 =2lq 2: Generate a random number a such that 2 < a < r–1 3: x0 ← aq mod r 4: for i = 1 to l do 5: xi ← x2i–1 mod r 6: if xi = 1 and xi-1 ≠ 1 and xi-1 ≠ r–1 then 7: return COMPOSITE // An NSR is found 8: end if 9: end for 10: if xl ≠ 1 then 11: return COMPOSITE // Fermat test 12: end if 13: return PRIME Similar to the Fermat test, if r passes the Miller-Rabin test, it has a very high probability of being prime. Damgard et al. [25] showed that 6 iterations of the Miller-Rabin test with different values of a are sufficient to certify that a 512-bit number r is prime, and 3 iterations of the Miller-Rabin test are sufficient to certify that a 1024-bit number r is prime. However, it should be noted that there are very few Carmichael numbers, and most composite numbers are covered by the original Fermat test, which does not consider NSRs. For example, r −1 if r is a 512-bit number and 2 ≡ 1( mod r ) , the probability that r is composite is less than 10 −20 due to Pomerance [26]. Therefore, the Fermat test is used instead of the Miller-Rabin test in many real-life programs such as OpenSSL [21]. In the present paper, we do not consider the Miller-Rabin test and instead use only the Fermat test, which greatly simplifies our procedure. 567 5. Computation Method of Modular Exponentiation The dominant operation in a Fermat test is modular exponentiation, i.e., the computation of a r −1 mod r There are many well-known algorithms for modular exponentiation, but we use the most powerful and simplest method, the binary method [27] in conjunction with the Montgomery number system [28]. In the following algorithm, MontMult(x, y, l, N) represents a Montgomery multiplication, i.e., x×y×(2l)–1 mod N. Algorithm 4 Binary Montgomery Modular Exponentiation Input: a, b, N, l Output: ab mod N 1: c ← a×2l mod N // Conversion of a to the Montgomery system 2: Let < bn–1bn–2 …b0 > be the binary representation of b, where bn–1, the most significant bit, is 1 3: d ← c 4: for i = n–2 to 0 do 5: d ← MontMult(d, d, l, N) // squaring 6: if bi = 1 then 7: d ← MontMult(d, c, l, N) // multiplication 8: end if 9: end for 10: return MontMult(d, 1) // Conversion back to the standard integral number system Although there is some overhead in lines 1 and 10 to convert numbers from the standard to Montgomery number system and vice versa, it is negligible. Therefore, the dominant part in Algorithm 4 is the main loop (lines 4 to 9). In each iteration of this loop, two Montgomery multiplications are required if bi = 1. Otherwise, only one Montgomery multiplication is required. Therefore, the number of multiplications in the main loop is (n – 1) + (HW(b) – 1), where HW(b) is the Hamming weight of b, i.e., the number of 1’s in the binary representation of b. When we apply Algorithm 4 to compute a r −1 mod r in the Fermat test, it is guaranteed that r – 1 is an n-bit even number, which implies that the least significant bit in the exponent is 0. Considering that each of the other bits in r – 1 is has a 1/2 probability of being 1, we can estimate that the main loop requires an average of (n – 1) + (n – 2)/2 multiplication operations. 568 DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS III. PARALLEL PRIME GENERATION: PARALLEL COMBINATION OF TRIAL DIVISIONS AND FERMAT TESTS In this section, we present a procedure to generate a random prime number that performs trial divisions and Fermat tests in parallel. Our procedure is shown in Algorithm 5. Algorithm 5 Parallel Prime Generation Input: n, k, m // k: number of primes used for trial division, m: number of iterations of Fermat tests Output: n-bit prime number 1: Generate a random n-bit odd integer r 2: // Initial trial division: Procedure T0 3: loop 4: if TD(r, k) = COMPOSITE then 5: r ← r+2 6: else 7: break // r passes the initial trial division test 8: end if 9: end loop 10: r'1 ← r 11: i ← 0 12: fstate ← DONE 13: tstate ← DONE 14: loop 15: loop 16: if fstate = DONE and tstate = DONE then 17: break 18: end if 19: end loop// Synchronization of Ti and Fi. 20: i ← i+1 21: // Procedure Ti and Procedure Fi are performed in parallel 22: procedure Ti do 23: tstate ← BUSY 24: r ← r'i+2 25: loop 26: if TD(r, k) = COMPOSITE then 27: r ← r+2 28: else 29: break // r passes the trial division test 30: end if 31: end loop 32: r'i+1 ← r // r is reserved for the next period of Fermat tests 33: tstate ← DONE 34: end procedure 35: procedure Fi do 36: fstate ← BUSY 37: for j = 1 to m do 38: fresult ← FT(r'i) 39: if fresult = COMPOSITE then 40: break 41: end if 42: end for 43: if fresult = PRIME then 44: return r'i 45: end if 46: fstate ← DONE 47: end procedure 48: end loop Fig. 1 shows the overall structure of Algorithm 5. Beginning with a random odd integer r generated by line 1 in Algorithm 5, we perform trial division tests (lines 3 to 9), TD(r, k), TD(r + 2, k), TD(r + 4, k), … until we find a candidate number, r1′ , which is not divisible by any prime up to pk. We denote this initial trial division period as procedure T0. Then, using this candidate number, we conduct a Fermat test denoted by procedure F1, which corresponds to lines 35 to 47 for iteration i = 1. While the Fermat test is being performed, we also conduct additional trial division tests beginning with r1′ + 2 (lines 22 to 34 with iteration i = 1). To be precise, TD( r1′ + 2 , k), TD( r1′ + 4 , k), TD( r1′ + 6 , k), … are performed until we find another candidate number, r2' , that is not divisible by any prime up to pk. We denote this trial division period as procedure T1. Note that procedures T1 and F1 are executed in parallel. The subsequent parallel procedures are executed in a similar manner. That is, Ti and Fi are executed in parallel during the ith iteration of the main loop (lines 14 to 48) of Algorithm 5. Even though Ti and Fi are executed in parallel, their completion times may not coincide with each other. For example, a Fermat test may be completed earlier as in iteration 1 in Fig. 1, or a trial division period may be completed earlier as in iteration 2. In the former JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016 569 Fig. 1. Example execution of Algorithm 5 with up to two iterations of the main loop. case, procedure Fi+1 cannot start before procedure Ti has been completed because a new candidate number for the Fermat test is not available yet. To resolve this issue and simplify the hardware architecture, we synchronize the starting times of Ti and Fi by waiting until both procedures Ti–1 and Fi –1 are completed (lines 15 to 19). Finally, Algorithm 5 terminates when procedure Fi failed to find any evidence that ri′ is composite within m iterations of the Fermat test. In this case, we conclude that ri′ is prime with very high probability. IV. ANALYSIS ON EXPECTED RUNNING TIME OF PARALLEL PRIME GENERATION In this section, we analyze the expected running time of Algorithm 5 according to k, the number of odd primes used for trial division, and n, the bit length of the prime number to be generated. Then, we estimate the optimal value of k that minimizes the running time for given n. As observed in the previous section, the running time of iteration i ( i ≥ 1) of the main loop of Algorithm 5 is max(T(Ti), T(Fi)), where T(Ti) and T(Fi) denote the running times of Ti and Fi, respectively. Therefore, for a precise analysis to find the optimal value of k, we must consider the distributions of T(Ti) and T(Fi) together. However, for a more effective analysis, we adopt a twostage approach. In the first stage, we find the expected values of T(Ti) and T(Fi) independently, and assume that all Ti and Fi consume these times, respectively. This assumption is used to determine a rough range (order of magnitude) of k that should include the optimal value. In the second stage, we conduct a precise analysis considering the distributions of T(Ti) and T(Fi). We establish a probability distribution on the running time of Algorithm 5, and find k that minimizes it. We search for the optimal k in the search space identified in the first stage. This two-stage approach reduces the overall time to identify the optimal value of k. 1. Stage 1: Determining Candidate Range for k In the first stage, we consider only the expected values of T(Ti) and T(Fi), ignoring their exact distributions. The analysis used in the first stage is basically the same approach as that given by Maurer [14]. We begin by analyzing the expected value of T(Ti). Because procedure Ti is composed of many calls to the sub-procedure TD(r, k), we first estimate the expected running time of a single TD(r, k). Let Tn,k(TD) be the running time of TD(r, k), i.e., a call to Algorithm 1, where n is the bit-length of r. Lemma 1 computes its expected value, E[Tn,k(TD)]. Lemma 2 determines the expected value of T(Ti) using E[Tn,k(TD)] and the expected number of trial divisions in Ti. The proofs of lemmas and theorems that are not given in this section are provided in the appendix. k j −1 1 Lemma 1: E Tn , k (TD ) = 1 + ∑ ∏ 1 − ⋅ divn , pi j = 2 i =1 570 DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS where p1 < p2 < … < pk are the k smallest odd primes and divn is the time required to divide an n-bit integer Table 1. Candidates of optimal k for n = 512, 1024. n m expn ( µsec) by a single-word odd prime. j −1 1 1 + ∑ j = 2 ∏ i =1 1 − p i Lemma 2: E T (Ti ) = k ∏ j =1 (1 − 1 / p j ) k ⋅ divn , divn ( µsec) expn divn Candidate k 512 6 5008.00 0.813 6159.9 5408 1024 3 19992.90 1.613 12394.9 11022 (1 + ∑ (∏ k0 j −1 j =2 i =1 (1 − 1/ p ) ) i ∏ (1 − 1/ p ) k0 j =1 j ) = exp n . Then, divn where p1 < p2 < … < pk are the k smallest odd primes and divn is the time required to divide an n-bit integer by a single-word odd prime. As the next step, we estimate E[T(Fi)], the expected running time of the ith Fermat Test Fi with an n-bit random number as follows. expn Lemma 3: E T ( Fi ) = m · expn for 1 ≤ i < x for i = x where expn is the expected running time for a modular exponentiation and x is the number of calls to Fi in Algorithm 5. Now, we are ready to analyze the expected running time of Algorithm 5, ETn(k). Because the required time of one iteration of the main loop in Algorithm 5 is dominated by max(E[T(Ti)], E[T(Fi)]), ETn(k) can be represented as follows by Lemma 3. ETn ( k ) = E T (Ti ) + ( E ( x ) + m − 1) ⋅ expn E ( x ) ⋅ E T (Ti ) + m ⋅ expn (2) if E T (Ti ) ≤ expn (1) if E T (Ti ) ≥ expn For further analysis, we express the expected value of x as a function of k and n in the following lemma. k 1 . Lemma 4: E [ x ] = 0.3466n∏ 1 − p j j =1 Now, we can express ETn(k) as a function of k, divn, and expn as follows: Theorem 1: Let k0 be the integer satisfying Now, our task is to find k that minimizes (2). Note that (2) depends on multiple parameters such as n, divn, expn, and m, as well as k. Table 1 shows the candidates of optimal k with two typical parameter settings. For the values of divn and expn, we used the measured values from a circuit explained in Section V. Fig. 2 and 3 show the behavior of ETn(k) for the above two settings. To help readers understand this behavior according to the changes in k, we generated two graphs for each setting. For n = 512, the values of ET512(k) (0 < k ≤ 50000) are plotted with a step size of 1000, and then a magnified graph is plotted for 0 < k ≤ 10000 with a step size of 400. For n = 1024, we first plotted the values of ET1024(k) (0 < k ≤ 100000) with a step size of 2000 and then plotted a magnified graph for 0 < k ≤ 20000 with a step size of 400. As a result of the above analysis, we identified the candidates of optimal k, i.e., k = 5408 and k = 11022. However, it should be noted that these values are not the actual optimal values, but they indicate that the optimal k will be found near these values. Therefore, they will be used to set the search space for the later precise analysis. For example, the search space for the first parameter setting (n = 512) will be chosen to be (2000, 7000) that includes 5408 in Section IV-2. JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016 571 2. Stage 2: Determining Optimal k Now, we are ready to conduct a precise analysis of the expected running time for Algorithm 5 using the exact distributions of T(Ti) and T(Fi). We first examine the distribution of T(Ti). Let ri (1 ≤ i ≤ w) be the ith n-bit random number to be tested by trial divisions and let ri′ (1 ≤ i ≤ x) be the ith random number that passes the trial division, i.e., Ti-1. Note that rx′ is rw and is the only nbit random number that passes both trial division and the Fermat test. Let f denote the function such that ri′ = rf (i ) . For f ( x) = w example, because rx′ = rw . By convention, we define f ( 0 ) = 0. Because each T(Ti) has the same distribution, we consider T(T0) without loss of generality. Fig. 1 illustrates that T0 is composed of calls to TD( r1 , k), TD( r2 , k), …, TD( rf (1) , k), and each call to TD( rj , k) Fig. 2. Distribution of the expected running time of Algorithm 5 for n = 512. of rj rj mod mod pN j (1 ≤ j ≤ f(1)) consists of computations p1 , rj mod p2 , rj mod p3 , …, for some 1 ≤ N j ≤ k . Let a random variable, X, denote the total number of divisions in T0. Then, X = N1 + N 2 + …+ N f (1) . Because the time for a single division is fixed as divn, the distribution of T(T0) is directly obtained from the distribution of X. The following theorem describes the probability distribution of X. Theorem 2: Pr { X = k + v} = Pr { f (1) = 2, X = k + v} + ⋯ + Pr { f (1) = v + 1, X = k + v} nw−1 −1 1 n1 −1 v +1 1 1 1 × ⋯× , = p × ∑ ∑ × ∏ 1 − × ∏ 1 − p j p j w = 2 ( n1 , …, nw−1 ) s .t . pn1 j =1 j =1 pnw−1 n1 +⋯+ nw−1 = v (3) for v ≥ 1 , and Pr { X = k } = p. (4) ∞ Fig. 3. Distribution of the expected running time of Algorithm 5 for n = 1024. We remark that ∑Pr { X = k + v} = 1. v=0 we can compute Pr { X = k + 3} using For example, 572 DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS Pr { X = k + 3} = Pr { f (1) = 2, X = k + 3} + Pr { f (1) = 3, X = k + 3} + Pr { f (1) = 4, X = k + 3} 1 1 1 = p × × 1 − × 1 − p1 p2 p3 1 1 1 1 1 1 + p × × × 1 − + × 1 − × . p1 p2 p1 p1 p1 p2 1 1 1 + p× × × p1 p1 p1 Next, we examine the distribution of T(Fi). As explained in Section II-5, { ( ( ) )} T ( Fi ) ≈ muln × ( n − 1) + HW ri′ − 1 − 1 , (5) except for the last stage Fx. Therefore, the distribution of T(Fi) for 1 ≤ i < x is directly obtained by the distribution of HW( ri ' − 1 ), which follows a binomial distribution, { ( ) } n − 2 n −2 Pr HW ri′ − 1 = v = /2 v −1 (6) for 1 ≤ v ≤ n − 1. Similarly, the distribution for the last stage is obtained as { ( ( ) )} T ( Fx ) ≈ m × muln × ( n − 1) + HW rx′ − 1 − 1 . Fig. 4. Probability distributions of T(Ti) and T(Fi) (1 ≤ i < x) for n = 512 and k = 5408. 5408, which was the candidate optimal value for n = 512 in Section IV-1. Although the distributions are discrete, we plot them using continuous curves for reader convenience. The distribution of T(Fi) is symmetric, but that of T(Ti) is extremely skewed, while its average value is approximately the same as the expected value obtained in Section IV-1, i.e., expn = 5008.00. Therefore, the tendency of the running time of Algorithm 5 according to k may be slightly different from that expected in Section IV-1, which is verified in our subsequent analyses. To more precisely analyze the behavior of Algorithm 5, we constructed the probability distributions of T(Ti) and T(Fi) for various k and estimated the distribution of the running time of Algorithm 5. Let Tn(k) be the running time of Algorithm 5. By the description of Algorithm 5, we x −1 Using the above analyses, we may obtain the probability distributions of T(Ti) and T(Fi). To be precise, see that Tn ( k ) = T (T0 ) + ∑ max ( T (Ti ) , T ( Fi ) ) + T ( Fx ) , using (3) and (4), we have Pr {T (Ti ) = k × divn } = p , where x is the number of candidate numbers that pass the trial division. For each k, we conducted a Monte Carlo simulation in which ten million random samples of Tn(k) are selected following the distributions of T(Ti) and T(Fi), and the average value of Tn(k) is computed. Then, we identified the k value that minimizes this average value. Because this procedure was very time-consuming, we conducted the analysis as follows. We first examined a wide range of k containing the estimated optimum value determined in Section IV-1, setting the step size to 100. That is, we tested k = 2000, 2100, … , 7000. For this simulation, we used a C program compiled by Microsoft Visual Studio 2010 on a machine with an Intel Core 2 Duo CPU and 4GB RAM. The result is plotted in Fig. 5. On the basis of the result shown in Fig. 5, we narrowed and Pr {T (Ti ) = ( k + v ) × divn } nw−1 −1 1 n1 −1 v +1 1 1 1 × ⋯× = p × ∑ ∑ × ∏ 1 − × ∏ 1 − , pn p j p j w = 2 ( n1 , …, nw−1 ) s .t . pn1 j =1 j =1 w−1 n1 +⋯+ nw−1 = v for v ≥ 1 . In addition, using (5) and (6), we have { } Pr T ( Fi ) = muln × {( n − 1) + ( v − 1)} n − 2 n−2 = /2 , v −1 for 1 ≤ v ≤ n − 1 . Fig. 4 shows these distributions for k = i =1 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016 573 Fig. 7. Block diagram of hardware prime generator. Fig. 5. Distribution of the expected running time of Algorithm 5 for n = 512. these two settings, the running times of Algorithm 5 were measured for the various choices of k and the optimal k that minimizes the running time. These k values found in the actual measurements are compared with the estimated values in Section IV. 1. Implementation of Parallel Prime Generator on FPGA Chip Fig. 6. Distribution of the expected running time of Algorithm 5 for n = 1024. the candidate range and tested k = 3500, 3501, …, 3800 with a step size of 1. According to the simulation result, the minimum time was 124.72 ms when k = 3586. We also conducted another Monte Carlo simulation with ten million samples for n = 1024. We first searched for the range 4000 ≤ k ≤ 14000 with a step size of 200 and then 7800 ≤ k ≤ 8200 with a step size of 1. The result of the former is shown in Fig. 6. According to the simulation, the minimum expected running time of Algorithm 5 for n = 1024 was 768.66 ms when k = 7812. V. IMPLEMENTATION AND EXPERIMENTAL RESULTS We implemented our parallel prime generator (Algorithm 5) on an FPGA chip and measured its performance. This experiment was conducted independently for 512- and 1024-bit primes. For each of A Xilinx Vertex 4 FPGA chip was used for our experiments. Distinct hardware prime generators were implemented for 512- and 1024-bit prime generation, respectively. A common block diagram of the prime generator, which is an implementation of Algorithm 5, is given in Fig. 7. The top module includes a trial division module for procedure Ti, and a modular exponentiation module to perform Fermat tests for procedure Fi. The top module receives a 512- or 1024-bit odd random number R as a base number for generating a prime. It also takes as another input k, the number of small primes that will be used in trial division. We embedded an enough number of small primes in the prime generator and let the generator to use k smallest ones for trial division according to the input k. In this way, we saved time for separately synthesizing FPGAs for distinct k. However, we remark that in a final product for a hardware prime generator, it is reasonable that the optimal value of k, say kopt, is fixed according to an analysis and should not be an input to the prime generator. In other words, the prime generator should contain exactly kopt small primes for compact hardware. In the ith parallel stage, the modular exponentiation module performs a Fermat test procedure Fi using the output from the trial division procedure Ti–1 in the previous stage. In parallel with Fi, the trial division module performs trial division procedure Ti on R with the k smallest odd primes. If R is divided by some prime p j (1 ≤ j ≤ k ) , R cannot be a prime. Therefore, the top 574 DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS Fig. 9. Block diagram of the modular exponentiation module. Fig. 8. Block diagram of the trial division module. module updates R by adding 2 to the current value and activates a new trial division test. If R is not divisible by any prime p j (1 ≤ j ≤ k ) , it becomes a candidate for a large prime. The top module then delivers R to the modular exponentiation module for the execution of a Fermat test procedure Fi+1, and the next parallel stage begins. At the same time, the top module also updates R by adding 2 to the current value. The updated R is used in a new trial division procedure Ti+1. The two procedures Fi+1 and Ti+1 are executed in parallel. For more details about the execution of prime generation, see Algorithm 5 and Fig. 1. Note that the trial division processes are continuously repeated, updating R using the rule R → R + 2. The only exceptions are the following two cases. The first is the case in which a new R that is not divisible by the k smallest odd primes has been discovered, but the modular exponentiation module is still busy performing a Fermat test. In this case, the trial division module is stalled until the modular exponentiation module finishes the work in progress and becomes available to receive the new R. The second case is when the current Fermat test finds a prime. In the latter case, all operations of the prime generator, including trial division processes, are terminated. We now explain the operation of the sub-modules in more detail. Fig. 8 shows the data path of the trial division module. The path of k is not included in Fig. 8 since k is used only in control logics. The trial division module divides R with small primes from the prime lists module. At first, the input R is stored in the register R, and the register Remain is initialized with MSBs of the register R. Then, the value in the register Remain is subtracted by a prime in the register Divisor. The result of the subtraction is stored in the register Remain after 1bit left shift. During the left shift, one bit selected from the unused part of the register R is inserted to the right. By iterating these subtraction-and-shifts, divisions are done. If there is only zero in the register Remain after a division, i.e., R is divisible by the prime in the register Divisor, the trial division module stops and concludes that R is composite. Otherwise, another division with a new prime is performed in the same way. If R is not divisible by any of the k primes, the trial division module stops and concludes that R is possibly a prime. To reduce the consumed area, the prime lists module maintains only the differences between adjacent primes. Using the values from the prime lists module, the next prime is calculated and stored in a register Next_divisor. Then, the value is left-aligned before handing it over to the register Divisor. Fig. 9 shows the data path of the modular exponentiation module. Setting N ← R, the modular exponentiation module repeatedly performs AN–1 (mod N) with different values of A, which is initially set to 2. If AN–1 (mod N) = 1, the exponentiation module repeats the test with a doubled A. If the equation AN–1 (mod N) = 1 always holds in a predefined number of repetitions (6 and 3 for the 512- and 1024-bit candidates, respectively), it is concluded that N is possibly a prime. For efficient computation of a modular exponentiation using Montgomery reductions, the modular exponentiation includes a Montgomery modular multiplication module, which performs two operations: the precomputation operation, A' = A × 2n (mod N), and Montgomery modular multiplications. Pre-computation is performed once at first, and its result A' is stored in registers A and D. Then, Montgomery modular squarings and multiplications are iterated, and a conversion to the standard number system is performed at last. Depending JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016 575 Fig. 10. Data path of Montgomery modular multiplication module. on the operations, one of register A, register D and constant 1 is selected as the input B of the Montgomery multiplication module. The data path of Montgomery modular multiplication module is shown in Fig. 10. In Fig. 10, (a) is the paths only for pre-computation; while (b) is only for Montgomery modular multiplication. During the precomputation process, the multiplicand A is stored in the sum register and shifted to the left at every round. If the shifted value is longer than k bits, either N or 2N is subtracted from each carry save adder (CSA). After repeating this process, A' = A × 2n (mod N) is obtained. During a Montgomery multiplication, the first CSA performs an addition operation of a partial product, Ai × B. If the least significant bit of the first CSA result is 1, either N or –N is selected as the input of the second CSA. The result of the second CSA is shifted to the right and is stored in the registers. For both processes, the 64-bit adder transforms the values in the carry and sum registers into a single number. 2. Experimental Results Using the prime generator implemented on the FPGA, we measured the time required for prime generation for various values of k. To be precise, we considered k = 500, 1000, 1500, …, 7000 for n = 512. To find a more accurate value of the optimal k, we used a 10-times smaller step size around the optimal value. That is, in the interval [3000,4000], we tested k = 3000, 3050, 3100, …, 4000. For n = 1024, we set the initial step size to 1000 and considered k = 1000, 2000, 3000, …, 14000. Around the optimal k, we tested k = 7000, 7100, …, 9000 using a step size of 100. For each of these k values, we generated 100,000 random primes and computed the average time Fig. 11. Estimated running times of parallel prime generation and measured values using an FPGA chip (n = 512). for prime generation. We also compared the measured values with the expected running times estimated by the analysis in Section IV. The results are shown in Fig. 11 and 12. The upper part in Fig. 11 shows a rough estimation of the time required to generate a 512-bit prime based on the analysis in Section IV-1. As in Fig. 2, the lower part of Fig. 11 magnifies the region that may include the optimal k and plots the estimated values based on the analysis in Section IV-2 as well as the measured values using the FPGA implementation. Fig. 12 illustrates the same task for 1024-bit primes. According to Fig. 11 and 12, the data measured using our FPGA implementation coincide with those estimated in Section IV-2, the results of the analysis using the probability distributions of T(Ti) and T(Fi). This result implies that the analysis methodology in Section IV-2 is very accurate and may be used to determine the optimal k for a given parameter setting. The data from the FPGA implementation also coincide with the range estimated through the preliminary analysis using the expected values, E[T(Ti)] and E[T(Fi)], in Section IV-1, except for the k values around the optimal points. For example, in the 512-bit case, for k ≤ 3000 576 DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS Table 2. Optimal values of k in various settings (*: the values obtained from the analyses in Section IV). n 512 1024 Fig. 12. Estimated running times of parallel prime generation and measured values using an FPGA chip (n = 1024). and k ≥ 6000, the three curves overlap with each other. In the 1024-bit case, these curves overlap for k ≤ 6000 and k ≥ 12000. Note that, for those k values that are not close to the optimal value, the overall execution time of the parallel prime generation is dominated by only one operation among the trial division and the Fermat test. That is, trial divisions are dominant for large k, but the Fermat tests dominate for small values of k. In the interval of k in which only one operation is dominant, the analysis in Section IV-1 using only the expected execution times of T(Ti) and T(Fi) is already quite accurate. It should also be noted that, even in the interval where the curves do not converge on each other, the relative error is not very large. For example, in the 1024bit case, the difference in the execution times is 789.6 ms (for the FPGA) – 742.9 ms (for the analysis in Section IV) = 46.7 ms, which is only 5.9 % of the measured value. In the 512-bit case, the largest difference occurred when k = 5500, with a difference of 128.8 ms – 123.1 ms = 5.7 ms, only 4.4 % of the measured value. The above analysis results provide us with a guideline to determine an optimal k for a given parameter setting. m 6 3 Step size to search for k ρ′ = muln divn Optimal k 20 1 2 3 4 8 12 16 20 24 28 32 480 900 1380 1860 3586* 5720 7160 9200 10720 12540 13800 50 1 2 3 4 8 12 16 20 24 28 32 1000 1950 2850 3800 7812* 11350 15800 18950 22550 27850 30850 That is, it is reasonable to first decide a rough range (e.g., an order of magnitude) of the optimal k through the simpler initial analysis in Section IV-1 in order to narrow the search space. This analysis provides graphs, such as those shown in the upper part of Fig. 11 and 12. Next, we determine the exact value of optimal k using the more accurate analysis explained in Section IV-2, which consumes more time than the analysis in Section IV-1. VI. GENERALIZATION According to the analyses and experiments in Sections IV and V, the optimal value of k only depends on the bit length n and the ratio ρ = expn / divn , where expn and divn represent the expected times for exponentiation and for division of an n-bit long integer by a single-word odd prime, respectively. Therefore, to reduce the time required for tedious analyses, it is helpful to compute the recommended k for various combinations of n and ρ before its actual use. Since expn is computed as expn = muln × {( n − 1) + ( n − 2 ) / 2} as in Section IV-1, we may also analyze the behavior of the optimal k according to two parameters, n and ρ ' = muln / divn , instead of n and ρ = expn / divn . Table 2 shows the results of this pre-computation for various combinations JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016 Fig. 13. Behavior of the optimal k according to ρ'. of n and ρ ' . Note that the analyses in Section IV-2 correspond to the cases in which ρ ' ≈ 8 for both n = 512 and n = 1024. Fig. 13 shows the relation between ρ ' and the corresponding optimal k for n = 512 and n = 1024. Interestingly, a nearly linear relationship is produced in both cases. VII. CONCLUSION We presented the first probabilistic analysis on the expected running time of the parallel combination of trial division and the Fermat test for generating a large prime. The expected running time was computed using only k, divn, and expn, where k is the number of small odd primes used in the trial division, divn is the time required for dividing an n-bit number by a word-sized prime, and expn is the time required for performing a modular exponentiation for an n-bit number. We presented a general framework to identify the optimal k that minimizes this expected running time. This framework is composed of two stages. In the first stage, a rough range for k was found through an analysis considering the expected running times of the trial division and Fermat tests independently. After narrowing the search space for k in the first stage, we identified the exact optimal value 577 of k through a more precise analysis considering the combined probability distributions of the timings for a trial division and a Fermat test. This two-stage approach reduced the overall time to determine the optimal value of k. We also implemented 512- and 1024-bit prime generators on an FPGA chip and verified that the experimental results correspond well to the analyzed values through the above two-stage framework. Motivated by this encouraging result, we finally provided the recommended values of k for different parameter settings, which we believe will help other researchers reduce time in determining the desired value of k for their own parameter settings. The advantages of our prime generator over conventional hardware prime generators are twofold. First, our hardware prime generator is a complete standalone prime generator that does not need any additional hardware or software. It is because our hardware prime generator includes two primality tests, while the conventional prime generators include only one primality test. Since practical prime generation requires two or more primality tests as mentioned in Introduction, the conventional prime generators require an additional hardware or software for another primality test, to be used in practice. Second, system developers do not need to be concerned with the optimal combination of the primality tests because optimization is already done when our hardware prime generator was designed and developed. However, with conventional prime generators, system developers should spend a lot of time and efforts to achieve the sophisticated optimal combination between a conventional prime generator and the other hardware/software module for additional primality test. ACKNOWLEDGMENTS This work was supported by the research fund of Signal Intelligence Research Center supervised by Defense Acquisition Program Administration and Agency for Defense Development of Korea and by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2016-H8501-16-1008) supervised by the IITP (Institute for Information & communications Technology Promotion). 578 DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] R. L. Rivest, A. Shamir and L. Adleman, “A method for obtaining digital signatures and publickey cryptosystems,” Communications of the ACM, vol. 21, pp. 120–126, 1978. T. ElGamal, “A public key cryptosystem and a signature scheme based on discrete logarithms,” in Advances in Cryotology, 1985, pp. 10–18. FIPS PUB 186-2. “Digital Signature Standard (DSS),” National Institute of Standards and Technology (NIST), 2000. V. Miller, “Use of elliptic curves in cryptography,” in Advances in Cryptology—CRYPTO’85 Proceedings, 1986, pp. 417–426. N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of Computation, vol. 48, pp. 203–209, 1987. Trusted Platform Module, http://www.trusted computinggroup.org/developers/trusted platform module. Trusted Environment Execution, http://www. globalplatform.org/mediaguidetee.asp. TrustZone, http://www.arm.com/products/processors/ technologies/trustzone/index.php. S. Tueke, V. Welch, D. Engert, L. Pearlman, and M.Thompson, “Internet X.509 public key infrastructure (PKI) proxy certificate profile,” RFC 3280 (Proposed Standard), 2004. T. H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms, 3rd ed: MIT press Cambridge, 2009. H. C. Pocklington, “The determination of the prime or composite nature of large numbers by Fermat’s theorem,” in Proceedings of the Cambridge Philosophical Society, 1914, pp. 29–30. A. O. L. Atkin and F. Morain, “Elliptic curves and primality proving,” Mathematics of Computation, vol. 61, pp. 29–68, 1993. W. Bosma and M.-P. van der Hulst, “Faster primality testing,” in Advances in Cryptology— EUROCRYPT’89, 1990, pp. 652–656. U. M. Maurer, “Fast Generation of Prime Numbers and Secure Public-Key Cryptographic Parameters,” Journal of Cryptology, vol. 8, pp. 123–155, 1995. J. Shawe-Taylor, “Generating strong primes,” Electronics Letters, vol. 22, pp. 875–877, 1986. [16] M. Agrawal, N. Kayal, and N. Saxena, “PRIMES is in P,” Annals of mathematics, vol. 160, no. 2, pp. 781–793, 2004. [17] M. O. Rabin, “Probabilistic Algorithm for Primality Testing,” Journal of Number Theory, vol. 12, pp. 128–138, 1980. [18] R. Solovay and V. Strassen, “A fast Monte-Carlo test for primality,” SIAM Journal on Computing, vol. 6, pp. 84–85, 1977. [19] J. Grantham, “A probable prime test with high confidence,” Journal of Number Theory, vol. 72, pp. 32–47, 1998. [20] D. J. Lehmann, “On primality tests,” SIAM Journal on Computing, vol. 11, pp. 374–375, 1982. [21] OpenSSL, http://www.openssl.org. [22] H. Park, S. K. Park, K.-R. Kwon, and D. K. Kim, “Probabilistic Analysis on Finding Optimal Combinations of Primality Tests in Real Applications,” in Information Security Practice and Experience, ed: Springer, 2005, pp. 74–84. [23] P. Heejin and D. K. Kim, “Probabilistic Analysis on the Optimal Combination of Trial Division and Probabilistic Primality Tests for Safe Prime Generation,” IEICE transactions on information and systems, vol. 94, pp. 1210–1215, 2011. [24] N. Koblitz, A course in number theory and cryptography, Berlin, Germany: Springer-Verlag, 1994. [25] I. Damgaard, P. Landrock, and C. Pomerance, “Average case error estimates for the strong probable prime test,” Mathematics of Computation, vol. 61, pp. 177–194, 1993. [26] C. Pomerance, “On the Distribution of Pseudoprimes,” Mathematics of Computation, pp. 587– 593, 1981. [27] D. E. Knuth, “The Art of Computer Programming, volume 2: Seminumerical Algorithms,” Reading: Addison-Wesley Professional, vol. 192, 1997. [28] P. L. Montgomery, “Modular Multiplication without Trial Division,” Math of Computation, vol. 44, pp. 519–521, 1985. [29] S. R. Dusse and B.S. Kaliski Jr, “A cryptographic library for the Motorola DSP56000,” in Advances in Cryptology—EUROCRYPT’90, 1991, pp. 230– 244. [30] T. Blum and C. Paar, “Montgomery modular exponentiation on reconfigurable hardware,” in JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016 Computer Arithmetic, 1999. Proceedings. 14th IEEE Symposium on, 1999, pp. 70–77. [31] D. M. Gordon, “A survey of fast exponentiation methods,” Journal of Algorithms, vol. 27, pp. 129– 146, 1998. [32] C. K. Koc, “High-Speed RSA Implementation,” Technical Report, RSA Laboratories, 1994. [33] G. L. Miller, “Riemann’s Hypothesis and Tests for Primality,” Journal of Computer Systems Science, vol. 13, pp. 300–317, 1976. APPENDIX. PROOF OF LEMMAS AND THEOREMS Proof of Lemma 1. Let T(TD(r, k)) denote the running time of trial division on a specific n-bit random number r with the k smallest odd primes. Let Dj be the event of dividing r by the jth smallest odd prime pj for 1 ≤ j ≤ k in Algorithm 1, and let Pr{Dj} be the probability of Dj. Let div(r, j) be the time required to divide r by pj . Then the expected value of T(TD(r, k)) can be represented as follows. E T ( TD ( r , k ) ) = Pr { D1} ⋅ div ( r ,1) + ⋯ + Pr { Dk } ⋅ div ( r , k ) k = ∑ (Pr { D j } ⋅ div ( r , j )) j =1 (7) 579 Now, we generalize E[T(TD(r, k))] to a value independent of a specific r. It is commonly accepted that div(r1, j) ≈ div(r2, j) for any two n-bit random numbers r1 and r2. Furthermore, div(r1, i) ≈ div(r1, j) for 1 ≤ i, j ≤ k because the odd primes used in the trial division are so small that each of them can be stored in a single word. Thus, we may use divn instead of div(r, j). Then, equation (9) can be rewritten as follows. k j −1 1 E T ( TD ( r , k ) ) = 1 + ∑ ∏ 1 − divn (10) pi j = 2 i =1 Since the right-hand side of (10) is not dependent on the value of r but is dependent on the bit length of r, which is n, E[T(TD(r, k))] can be replaced with a more general notation, E[Tn,k(TD)], which proves lemma. Proof of Lemma 2. Let ri (1 ≤ i ≤ w) be the ith n-bit random number to be tested by trial divisions in Algorithm 5, and let ri ' (1 ≤ i ≤ x) be the ith random number that passes the trial division. Note that rx′ is rw and is the only n-bit random number that passes both trial division and the Fermat test. As already defined, let f ' denote the function such that ri = rf (i ) . For example, f ( x ) = w because rx′ = rw . By convention, we define f ( 0 ) = 0. Then, procedure Ti contains all trial divisions on rj’s for f ( i ) < j ≤ f ( i + 1) . Because the individual Since we divide r by pj in trial division if and only if r is not divisible by any prime up to pj−1, Pr{Dj} is as follows: 1 1 1 Pr { D j } = 1 − 1 − ⋯ 1 − p1 p2 p j −1 j −1 1 = ∏ 1 − , pi i =1 trial division time for each rj is already computed in Lemma 1, we estimate the expected value of T(Ti) by computing the expected value of f ( i + 1) − f ( i ) , which is the number of trial divisions in Ti. Because the running time of the trial division on an n-bit random number and the value of f ( i ) − f ( i − 1) are independent, E[T(Ti)] (8) can be represented as E[T(Ti)] = E[Tn,k(TD)] × E[ f ( i + 1) − f ( i ) ]. Because the probability of executing a Fermat test is (1–1/p1)(1–1/p2) … (1–1/pk), we obtain for j ≥ 2, and Pr{D1} = 1. Using (7) and (8), we get the following equation: k E f ( i + 1) − f ( i ) = 1 / ∏ (1 − 1 / p j ) . Then, by Lemma j =1 E T ( TD ( r , k ) ) k j −1 1 = div ( r ,1) + ∑ ∏ 1 − ⋅ div ( r , j ) . pi j = 2 i =1 k j −1 E T (Ti ) = 1 + ∑ ∏ (1 − 1 / pi ) ⋅ divn / j = 2 i =1 1, we get (9) k ∏ (1 − 1 / p ) . j j =1 580 DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS Proof of Lemma 3. We first consider the expected running time of FT(r), i.e., the time to compute ar–1 mod r for an n-bit random number r and a single-word integer a. Note that the running time of a modular exponentiation (Algorithm 4) is the product of the number of Montgomery multiplications and the running time of each Montgomery multiplication. Because the complexity of a Montgomery multiplication, muln, is determined by the bit length n of the modulus r, we estimate the expected running time for a modular exponentiation as expn = muln × {(n – 1) + (n – 2)/2}, according to the estimation of the expected number of multiplications in Section II-5. We are interested in the number of calls to FT(r) in procedure Fi. We consider two separate cases. First, if r is prime, then FT(r) is called m times. On the other hand, if r is composite, it is very likely that the number of calls is 1 because the probability that a composite number passes even a single call of FT(r) is very low, as described in Section II. Consequently, E[T(Fi)] = expn if r is composite, and E[T(Fi)] = m · expn if r is prime. Therefore, the lemma holds. respectively, are described below. First, E[T(Ti)] is monotonically increasing as k increases because k j −1 ∑ ∏ (1 − 1 / p ) ) j=2 i i =1 in the numerator increases monok tonically, and the denominator, ∏ (1 − 1 / p ) , decreases j j =1 monotonically. Therefore, E[T(Ti)] ≤ expn if and only if k ≤ k0, and E[T(Ti)] ≥ expn if and only if k ≥ k0. Proof of Theorem 2. We analyze the probability distribution of X by classifying the cases according to the total number of calls to TD in T0, i.e., f(1). [Case 1: f(1) = 1, i.e., X = N1 ] In this case, r1 passes the trial division test and is labeled as r1' . Because r1 should not be divisible by any p j for 1 ≤ j ≤ k, the k probability that f(1) = 1 is p = ∏ (1 − 1 / p j ) . Because j =1 this is the only case in which X = k, we obtain Pr { X = k } = Pr { N1 = k , pk r1 } = p . [Case 2: f(1) = 2, i.e., X = N1 + N 2 ] In this case, r1 does not pass the trial division test, but r2 does, so it is Proof of Lemma 4. The probability that a random odd number is a prime is 1/(0.3466n), as explained in Section obvious that N 2 = k . However, there are k possible II. Thus, the probability that each ri′ is a prime is cases for N1 , i.e., N1 = 1, 2, …, k . Their probabilities v −1 k 1/(0.3466n ⋅∏ (1 − 1 / p j ) ) for 1 ≤ i ≤ x because ri′ satisfy k E [ x ] / 0.3466n ⋅ ∏ (1 − 1 / p j ) = 1 holds, which proves j =1 v = 1, 2, …, k , and we verify that k ∑Pr {N 1 = v, pv | r1 } = v =1 1 − p = 1 − Pr { N1 = k , pk r1 } , where pv | r1 means that pv divides r1 , i.e., r1 mod pv = 0 . Then, we obtain the lemma. Pr { f (1) = 2, X = k + v} Proof of Theorem 1. The body of (2) is easily derived from = Pr { N1 = v, pv | r1 } × Pr { N 2 = k , pk r2 } k j −1 (1) by replacing E[T(Ti)] with 1 + ∑ ∏ (1 − 1/ pi ) ) divn / j = 2 i =1 = k ∏ (1 − 1 / p ) p v −1 1 × ∏ 1 − . pv j =1 p j in accordance with Lemma 2 and E(x) with j j =1 [Case 3: f(1) = w (w ≥ 3), i.e., X = N1 + N 2 + ⋯ + N w ] k 0.3466n ⋅ for j =1 j =1 already passed a trial division test using the k smallest odd primes. Since the expected number of primes after the E[x]th iteration is 1, we determine that Pr { N1 = v, pv | r1 } = (1 / pv ) × ∏ (1 − 1 / p j ) ∏ (1 − 1 / p ) j according to Lemma 4. The j =1 This general case implies that r1 through rw−1 do not reason that the conditions E[T(Ti)] ≤ expn and E[T(Ti)] pass the trial division test, but rw does, so N w = k . ≥ expn in (1) can be replaced by k ≤ k0 and k ≥ k0, Now, in order to compute the probability JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016 Pr { f (1) = w, X = k + v} , we have to examine all possible combinations of N1 , N 2 , …, N w −1 that make the sum N1 + N 2 + ⋯ + N w −1 = v . As a result, we obtain Pr { f (1) = w, X = k + v} = ∑ Pr N1 = n1 , pn1 | r1 × ⋯× Pr N w−1 = nw−1 , pnw−1 | r1 ( nn1 ,+…⋯, +nwn−1 ) s=.tv. w−1 1 × Pr { N w = k , pk rw } { = p× ∑ ( n1 , …, nw−1 ) s .t . n1 +⋯+ nw−1 = v } { } nw−1 −1 1 n1 −1 1 1 1 × ∏ 1 − × ∏ 1 − × ⋯ × pn p pj p j . j =1 w−1 n1 j =1 In fact, this generalized equation also holds for w = 2. Merging all cases, we obtain (3) and (4). Dong Kyue Kim received the B.S., M.S. and Ph.D. degrees in Computer Engineering from Seoul National University in 1992, 1994, and 1999, respectively. From 1999 to 2005, he was an assistant professor in the Division of Computer Science and Engineering at Pusan National University. He is currently a full professor in the Department of Electronic Engineering at Hanyang University, Korea. His research interests are in the areas of security SoC (System on Chip), crypto-coprocessors, and information security. Piljoo Choi received the B.S. and M.S. degrees in Electronic Engineering from Hanyang University in 2010 and 2012, respectively. He is currently a Ph.D. candidate in the Department of Electronic Engineering at Hanyang University, Korea. His research interests are in the areas of security SoC (System on Chip), crypto-coprocessors, and information security. 581 Mun-Kyu Lee received the B.S. and M.S. degrees in Computer Engineering from Seoul National University in 1996 and 1998, respectively, and the Ph.D. degree in Electrical Engineering and Computer Science from Seoul National University in 2003. From 2003 to 2005, he was a senior engineer at Electronics and Telecommunications Research Institute, Korea. He is currently a professor in the Department of Computer and Information Engineering at Inha University, Korea. His research interests are in the areas of cryptographic algorithms, information security and theory of computation. Heejin Park received the B.S., M.S. and Ph.D. degrees in Computer Engineering from Seoul National University in 1994, 1996, and 2001, respectively. From 2001 to 2002, he worked as a post-doctoral researcher for the Department of Computer engineering at Seoul National University. From 2003 to 2003, he was a research professor at Ewha Womens University. He is currently a professor in the Department of Computer Science and Engineering at Hanyang University, Korea. His research interests are in the areas of cryptography, information security, and computer algorithm.
© Copyright 2026 Paperzz