Design and Analysis of Efficient Parallel Hardware Prime

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016
http://dx.doi.org/10.5573/JSTS.2016.16.5.564
ISSN(Print) 1598-1657
ISSN(Online) 2233-4866
Design and Analysis of Efficient Parallel Hardware
Prime Generators
Dong Kyue Kim1, Piljoo Choi1, Mun-Kyu Lee2, and Heejin Park3
Abstract—We present an efficient hardware prime
generator that generates a prime p by combining trial
division and Fermat test in parallel. Since the
execution time of this parallel combination is greatly
influenced by the number k of the smallest odd
primes used in the trial division, it is important to
determine the optimal k to create the fastest parallel
combination. We present probabilistic analysis to
determine the optimal k and to estimate the expected
running time for the parallel combination. Our
analysis is conducted in two stages. First, we roughly
narrow the range of optimal k by using the expected
values for the random variables used in the analysis.
Second, we precisely determine the optimal k by using
the exact probability distribution of the random
variables. Our experiments show that the optimal k
and the expected running time determined by our
analysis are precise and accurate. Furthermore, we
generalize our analysis and propose a guideline for a
designer of a hardware prime generator to determine
the optimal k by simply calculating the ratio of M to D,
where M and D are the measured running times of a
modular multiplication and an integer division,
respectively.
Index Terms—Performance analysis, digital integrated
circuits, prime number, public key cryptosystem,
information security
Manuscript received Oct. 22, 2015; accepted May. 19, 2016
1
Dept. of Electronic Engineering, Hanyang University, 222
Wangsimni-ro, Seongdong-gu, Seoul 04763, Korea
2
Dept. of Computer and Information Engineering, Inha University, 100
Inha-ro, Nam-gu, Incheon 22212, Korea
3
Dept. of Computer Science and Engineering, Hanyang University,
222 Wangsimni-ro, Seongdong-gu, Seoul 04763, Korea
E-mail : [email protected]
I. INTRODUCTION
Large primes are the basis of modern cryptography
and are widely used in standard public key cryptosystems
such as RSA [1], ElGamal [2], Digital Signature
Standard (DSS) [3], and elliptic curve cryptosystems
(ECC) [4, 5]. In order to maintain the security of these
cryptosystems against advanced attacks to the underlying
number theoretic problems such as integer factorization
and discrete logarithm, even larger primes are required.
For example, while the early RSA used 512-bit keys, the
recommended key length was gradually increased to
1024 and 2048 bits, which require 512- and 1024-bit
prime numbers, respectively. As a result, the key-pair
generation procedure of RSA requires heavy
computation, most of which is dedicated to the
generation of random prime numbers.
One obvious way to avoid heavy computation in a
security product is to use pre-generated key pairs. That is,
after generation of a private key and its corresponding
public key in a separate prime generator external to the
product, the private key may be injected into the product
during manufacturing. However, this approach is very
insecure because private keys outside the product may be
exposed to unauthorized attackers. Another limitation of
this approach is that it cannot change key values even
when compromised. Due to these limitations, there are
many applications that require a prime generator within
the security product itself. Among them is the wellknown Trusted Platform Module (TPM) [6], which is a
hardware module fixed to a platform that protects user
data and verifies the current state of local or remote
platforms. It has a key generator as an essential element
for public-key encryption and digital signature. In
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016
addition, the Trusted Execution Environment (TEE) [7],
in particular, its most promising implementation in
mobile devices, ARM TrustZone [8], executes all
security-related functions with its own key isolated from
the outside world. Finally, recent advances in the Internet
of Things (IoT) technologies require every IoT device to
embed stand-alone security features, which include
functionalities for secure key generation. In these
environments, a hardware prime generator embedded in
the device is inevitable. Moreover, many advanced
applications such as proxy certificates [9] require
frequent generation of fresh key pairs.
In general, prime generation is an iterative procedure
consisting of the following two steps.
1) Generate an odd random integer, r.
2) Determine if r is prime. If so, return r. Otherwise,
go to step 1).
Since the primality test in step 2) is very timeconsuming, researchers have performed extensive
research to improve the speed of the primality test. The
primality test algorithms are divided into two categories:
deterministic and probabilistic. A deterministic primality
test certifies that a random number is a prime with
probability 1. In other words, if a random number passes
the primality test, it is certainly a prime. Trial division
[10], which is a popular deterministic primality test,
divides a random number r by primes up to
r . Other
deterministic primality tests include Pocklingtons test
[11], its elliptic curve analogue [12], the Jacobi sum test
[13], Maurer’s algorithm [14], Shawe-Taylor’s algorithm
[15], and the AKS test [16]. A probabilistic primality test
certifies that a random number is a prime with a
probability very close to 1, such as 1 − 1 / 2 s for a large s.
Probabilistic primality tests include the Fermat test [10],
the Miller-Rabin test [4, 17], the Solovay-Strassen test
[18], the Frobenius-Grantham test [19], and the Lehmann
test [20].
To enhance the speed of prime generation, it is a
common practice in real applications to simultaneously
use two or more primality tests. The most popular
combination is to combine trial division and a
probabilistic primality test such as the Fermat test, as in
OpenSSL [21]. In this combination, the trial division
divides a random number by k smallest primes. Then, the
candidate number, which is not divisible by any of the k
primes, is subjected to either the Fermat or Miller-Rabin
565
test. Since k, the number of primes used in the trial
division stage, affects the overall performance,
identifying the optimal k is a crucial factor in designing
the combined algorithm. For software prime generators,
this problem was analyzed in [14] for primes and in [22,
23] for safe primes. Given the times required for the
main operations of trial division and the Fermat test, i.e.,
integer division and modular exponentiation, respectively,
the running time of the sequential combination of trial
division and the Fermat test can be estimated according
to the value of k, and the optimal k that minimizes the
running time can be determined.
In this paper, we focus on examining different
situations for a hardware prime generator. The trial
division and the Fermat tests implemented in hardware
not only show much better performance than their
software counterparts, but they can also run in parallel.
These advantages may significantly enhance the overall
performance of the prime generator. Since p is generated
by hardware, the value k is hardwired in the circuit and
thus the optimal k should be determined at the design
stage of the hardware prime generator. However, the
method to compute the expected running time and the
optimal k should be completely different from what was
given for the sequential case [14, 22, 23].
We provide the solution to this problem and propose a
guideline to develop hardware prime generators. First,
we present a parallel prime generation algorithm
combining trial division and the Fermat test. Next, we
propose a framework to identify the optimal value of k,
i.e., the number of primes used in trial division, and
analyze the minimum expected running time of a parallel
prime generator. Our analysis is conducted in two stages.
In the preliminary analysis stage, we determine a rough
range (such as the order of magnitude) that includes the
optimal k. For this purpose, we only consider the
expected values for the random variables that contribute
to the running time. The result of this stage is used to
narrow the search space for the second stage, i.e., the
main analysis stage. In this second stage, we consider the
exact probability distributions of random variables for a
more precise analysis. We also verify the validity and
accuracy of our analysis framework through an FPGA
implementation of 512- and 1024-bit prime generators.
According to our discovery and the experimental results,
the main factors that determine the overall performance
566
DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS
of prime generation are (i) the bit-length of a target prime
number, denoted by n, (ii) the number of primes used in
the trial division, denoted by k, and (iii) the ratio of the
expected running time of a modular exponentiation to
that of an integer division, denoted by ρ . As our final
contribution, we provide the recommended values of k
according to ρ when n = 512 and n = 1024.
The remainder of the paper is organized as follows.
Section II provides some preliminary information
including primality tests and modular exponentiation
methods. Section III presents the explicit parallel
algorithm for prime generation. Section IV provides the
theoretical analysis to determine the optimal k and the
fastest running time for prime generation. In Section V,
we present the experimental results for an FPGA
implementation of 512- and 1024-bit prime generators
and compare them with the estimation in Section IV.
Section VI generalizes the results to the cases with
various ratios ρ . Section VII concludes the paper with
some remarks. The proofs of lemmas and theorems that
are not given in Section IV are provided in the appendix.
II. PRELIMINARIES
r is prime, but it is fairly sufficient to show that r is
composite if r has a small factor that is not larger than pk .
Algorithm 1 presents our trial division procedure. When
another algorithm calls Algorithm 1 as its subroutine, this
subroutine call will be denoted as TD(r, k).
Algorithm 1 Trial Division
Input: r, k
Output: PRIME or COMPOSITE
1: for j = 1 to k do
2:
if r mod pj = 0 then
3:
return COMPOSITE
4:
end if
5: end for
6: return PRIME
3. Fermat Test
The Fermat test for a random number r determines
whether or not r is composite using Fermat’s theorem [10,
p.967]. The precise operation is shown in Algorithm 2.
Algorithm 2 Fermat Test [10, p.967]
Input: integer r
Output: PRIME or COMPOSITE
1: Generate a random number a such that 2 ≤ a ≤ r − 1
1. The Probability of a Random Number being Prime
It is well known that the probability of a random
number r being prime is 1 / ln r [24, p.11]. If r is odd, the
probability is doubled to 2 / ln r . Thus, the probability
of an n-bit odd random number being prime is 2 / ln 2n ,
2:
if
then
3:
return COMPOSITE
4:
end if
5: return PRIME
which is 1/ ( 0.3466n ) . For example, the probability of a
512-bit
random
odd
number
being
prime
is
1/ ( 0.3466 ⋅ 512 ) ≈ 1/ 177.46 , and that of a 1024-bit
random odd number is 1/ ( 0.3466 ⋅1024 ) ≈ 1/ 354.92 .
2. Trial Division
In a trial division for a random number r, r is divided
by primes up to
r . If r is not divisible by any prime
number, it is prime. Otherwise, it is a composite number.
The trial division may be modified to divide r by only up
to the k smallest odd primes. We denote the k smallest
odd primes by p1 < p2 < … < pk throughout this paper.
This modified trial division cannot be used to certify that
If
r is definitely composite
according to Fermat’s theorem, which is addressed by
lines 2 to 4 in Algorithm 2. Otherwise, r is a probable
prime number, which means that line 5 in Algorithm 2
can produce errors. However, if we repeat the test for
several random a’s, the probability that Algorithm 2
produces an error is reduced. When another algorithm
calls Algorithm 2 as its subroutine, this subroutine call
will be denoted as FT(r).
4. Miller-Rabin Test
We cannot entirely eliminate all errors by simply
repeating Fermat tests with different base numbers
because there exist composite integers called Carmichael
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016
numbers that always pass the Fermat test for any base a
with gcd(a, r) = 1. The Miller-Rabin test, which is a
modification of the Fermat test, can overcome this
problem by additionally determining if there is a
nontrivial square root (NSR) of 1 mod r as follows:
Algorithm 3 Miller-Rabin Test [10, p.969]
Input: integer r
Output: PRIME or COMPOSITE
1: Find integers l and q with l > 0 and q odd, so that
r–1 =2lq
2: Generate a random number a such that 2 < a < r–1
3: x0 ← aq mod r
4: for i = 1 to l do
5:
xi ← x2i–1 mod r
6:
if xi = 1 and xi-1 ≠ 1 and xi-1 ≠ r–1 then
7:
return COMPOSITE // An NSR is found
8:
end if
9: end for
10: if xl ≠ 1 then
11:
return COMPOSITE // Fermat test
12: end if
13: return PRIME
Similar to the Fermat test, if r passes the Miller-Rabin
test, it has a very high probability of being prime.
Damgard et al. [25] showed that 6 iterations of the
Miller-Rabin test with different values of a are sufficient
to certify that a 512-bit number r is prime, and 3
iterations of the Miller-Rabin test are sufficient to certify
that a 1024-bit number r is prime. However, it should be
noted that there are very few Carmichael numbers, and
most composite numbers are covered by the original
Fermat test, which does not consider NSRs. For example,
r −1
if r is a 512-bit number and 2 ≡ 1( mod r ) , the
probability that r is composite is less than 10 −20 due to
Pomerance [26]. Therefore, the Fermat test is used
instead of the Miller-Rabin test in many real-life
programs such as OpenSSL [21]. In the present paper, we
do not consider the Miller-Rabin test and instead use
only the Fermat test, which greatly simplifies our
procedure.
567
5. Computation Method of Modular Exponentiation
The dominant operation in a Fermat test is modular
exponentiation, i.e., the computation of a r −1 mod r
There are many well-known algorithms for modular
exponentiation, but we use the most powerful and
simplest method, the binary method [27] in conjunction
with the Montgomery number system [28]. In the
following algorithm, MontMult(x, y, l, N) represents a
Montgomery multiplication, i.e., x×y×(2l)–1 mod N.
Algorithm 4 Binary Montgomery Modular
Exponentiation
Input: a, b, N, l
Output: ab mod N
1: c ← a×2l mod N // Conversion of a to the
Montgomery system
2: Let < bn–1bn–2 …b0 > be the binary representation
of b, where bn–1, the most significant bit, is 1
3: d ← c
4: for i = n–2 to 0 do
5:
d ← MontMult(d, d, l, N) // squaring
6:
if bi = 1 then
7:
d ← MontMult(d, c, l, N) // multiplication
8:
end if
9: end for
10: return MontMult(d, 1) // Conversion back to the
standard integral number system
Although there is some overhead in lines 1 and 10 to
convert numbers from the standard to Montgomery
number system and vice versa, it is negligible. Therefore,
the dominant part in Algorithm 4 is the main loop (lines
4 to 9). In each iteration of this loop, two Montgomery
multiplications are required if bi = 1. Otherwise, only one
Montgomery multiplication is required. Therefore, the
number of multiplications in the main loop is (n – 1) +
(HW(b) – 1), where HW(b) is the Hamming weight of b,
i.e., the number of 1’s in the binary representation of b.
When we apply Algorithm 4 to compute a r −1 mod r in
the Fermat test, it is guaranteed that r – 1 is an n-bit even
number, which implies that the least significant bit in the
exponent is 0. Considering that each of the other bits in r
– 1 is has a 1/2 probability of being 1, we can estimate
that the main loop requires an average of (n – 1) + (n –
2)/2 multiplication operations.
568
DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS
III. PARALLEL PRIME GENERATION:
PARALLEL COMBINATION OF TRIAL
DIVISIONS AND FERMAT TESTS
In this section, we present a procedure to generate a
random prime number that performs trial divisions and
Fermat tests in parallel. Our procedure is shown in
Algorithm 5.
Algorithm 5 Parallel Prime Generation
Input: n, k, m // k: number of primes used for trial
division, m: number of iterations of Fermat tests
Output: n-bit prime number
1: Generate a random n-bit odd integer r
2: // Initial trial division: Procedure T0
3: loop
4:
if TD(r, k) = COMPOSITE then
5:
r ← r+2
6:
else
7:
break // r passes the initial trial division test
8:
end if
9: end loop
10: r'1 ← r
11: i ← 0
12: fstate ← DONE
13: tstate ← DONE
14: loop
15:
loop
16:
if fstate = DONE and tstate = DONE then
17:
break
18:
end if
19:
end loop// Synchronization of Ti and Fi.
20:
i ← i+1
21: // Procedure Ti and Procedure Fi are performed
in parallel
22:
procedure Ti do
23:
tstate ← BUSY
24:
r ← r'i+2
25:
loop
26:
if TD(r, k) = COMPOSITE then
27:
r ← r+2
28:
else
29:
break // r passes the trial division test
30:
end if
31:
end loop
32:
r'i+1 ← r // r is reserved for the next period
of Fermat tests
33:
tstate ← DONE
34:
end procedure
35:
procedure Fi do
36:
fstate ← BUSY
37:
for j = 1 to m do
38:
fresult ← FT(r'i)
39:
if fresult = COMPOSITE then
40:
break
41:
end if
42:
end for
43:
if fresult = PRIME then
44:
return r'i
45:
end if
46:
fstate ← DONE
47:
end procedure
48: end loop
Fig. 1 shows the overall structure of Algorithm 5.
Beginning with a random odd integer r generated by line
1 in Algorithm 5, we perform trial division tests (lines 3
to 9), TD(r, k), TD(r + 2, k), TD(r + 4, k), … until we
find a candidate number, r1′ , which is not divisible by
any prime up to pk. We denote this initial trial division
period as procedure T0. Then, using this candidate
number, we conduct a Fermat test denoted by procedure
F1, which corresponds to lines 35 to 47 for iteration i = 1.
While the Fermat test is being performed, we also
conduct additional trial division tests beginning with
r1′ + 2 (lines 22 to 34 with iteration i = 1). To be precise,
TD( r1′ + 2 , k), TD( r1′ + 4 , k), TD( r1′ + 6 , k), … are
performed until we find another candidate number, r2' ,
that is not divisible by any prime up to pk. We denote this
trial division period as procedure T1. Note that
procedures T1 and F1 are executed in parallel. The
subsequent parallel procedures are executed in a similar
manner. That is, Ti and Fi are executed in parallel during
the ith iteration of the main loop (lines 14 to 48) of
Algorithm 5. Even though Ti and Fi are executed in
parallel, their completion times may not coincide with
each other. For example, a Fermat test may be completed
earlier as in iteration 1 in Fig. 1, or a trial division period
may be completed earlier as in iteration 2. In the former
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016
569
Fig. 1. Example execution of Algorithm 5 with up to two iterations of the main loop.
case, procedure Fi+1 cannot start before procedure Ti has
been completed because a new candidate number for the
Fermat test is not available yet. To resolve this issue and
simplify the hardware architecture, we synchronize the
starting times of Ti and Fi by waiting until both
procedures Ti–1 and Fi –1 are completed (lines 15 to 19).
Finally, Algorithm 5 terminates when procedure Fi failed
to find any evidence that ri′ is composite within m
iterations of the Fermat test. In this case, we conclude
that ri′ is prime with very high probability.
IV. ANALYSIS ON EXPECTED RUNNING TIME
OF PARALLEL PRIME GENERATION
In this section, we analyze the expected running time
of Algorithm 5 according to k, the number of odd primes
used for trial division, and n, the bit length of the prime
number to be generated. Then, we estimate the optimal
value of k that minimizes the running time for given n.
As observed in the previous section, the running time of
iteration i ( i ≥ 1) of the main loop of Algorithm 5 is
max(T(Ti), T(Fi)), where T(Ti) and T(Fi) denote the
running times of Ti and Fi, respectively. Therefore, for a
precise analysis to find the optimal value of k, we must
consider the distributions of T(Ti) and T(Fi) together.
However, for a more effective analysis, we adopt a twostage approach. In the first stage, we find the expected
values of T(Ti) and T(Fi) independently, and assume that
all Ti and Fi consume these times, respectively. This
assumption is used to determine a rough range (order of
magnitude) of k that should include the optimal value. In
the second stage, we conduct a precise analysis
considering the distributions of T(Ti) and T(Fi). We
establish a probability distribution on the running time of
Algorithm 5, and find k that minimizes it. We search for
the optimal k in the search space identified in the first
stage. This two-stage approach reduces the overall time
to identify the optimal value of k.
1. Stage 1: Determining Candidate Range for k
In the first stage, we consider only the expected values
of T(Ti) and T(Fi), ignoring their exact distributions. The
analysis used in the first stage is basically the same
approach as that given by Maurer [14].
We begin by analyzing the expected value of T(Ti).
Because procedure Ti is composed of many calls to the
sub-procedure TD(r, k), we first estimate the expected
running time of a single TD(r, k). Let Tn,k(TD) be the
running time of TD(r, k), i.e., a call to Algorithm 1,
where n is the bit-length of r. Lemma 1 computes its
expected value, E[Tn,k(TD)]. Lemma 2 determines the
expected value of T(Ti) using E[Tn,k(TD)] and the
expected number of trial divisions in Ti. The proofs of
lemmas and theorems that are not given in this section
are provided in the appendix.
k  j −1


1 
Lemma 1: E  Tn , k (TD )  =  1 + ∑  ∏  1 −    ⋅ divn ,

pi   
j = 2  i =1 

570
DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS
where p1 < p2 < … < pk are the k smallest odd primes
and divn is the time required to divide an n-bit integer
Table 1. Candidates of optimal k for n = 512, 1024.
n
m expn ( µsec)
by a single-word odd prime.

j −1 
1
1 + ∑ j = 2  ∏ i =1 1 −

p
i
Lemma 2:


E T (Ti )  =
k
∏ j =1 (1 − 1 / p j )
k

 

⋅ divn ,
divn
( µsec)
expn
divn
Candidate k
512
6
5008.00
0.813
6159.9
5408
1024
3
19992.90
1.613
12394.9
11022
(1 + ∑ (∏
k0
j −1
j =2
i =1
(1 − 1/ p ) )
i
∏ (1 − 1/ p )
k0
j =1
j
) = exp
n
. Then,
divn
where p1 < p2 < … < pk are the k smallest odd primes
and divn is the time required to divide an n-bit integer
by a single-word odd prime.
As the next step, we estimate E[T(Fi)], the expected
running time of the ith Fermat Test Fi with an n-bit
random number as follows.
expn
Lemma 3: E T ( Fi )  = 
 m · expn
for 1 ≤ i < x
for i = x
where expn is the expected running time for a modular
exponentiation and x is the number of calls to Fi in
Algorithm 5.
Now, we are ready to analyze the expected running
time of Algorithm 5, ETn(k). Because the required time
of one iteration of the main loop in Algorithm 5 is
dominated by max(E[T(Ti)], E[T(Fi)]), ETn(k) can be
represented as follows by Lemma 3.
ETn ( k ) =

E T (Ti )  +


 ( E ( x ) + m − 1) ⋅ expn



 E ( x ) ⋅ E T (Ti )  + m ⋅ expn
(2)
if E T (Ti )  ≤ expn
(1)
if E T (Ti )  ≥ expn
For further analysis, we express the expected value of
x as a function of k and n in the following lemma.
k 
1 
.
Lemma 4: E [ x ] = 0.3466n∏ 1 −
p j 
j =1 
Now, we can express ETn(k) as a function of k, divn,
and expn as follows:
Theorem 1: Let k0 be the integer satisfying
Now, our task is to find k that minimizes (2). Note that
(2) depends on multiple parameters such as n, divn, expn,
and m, as well as k. Table 1 shows the candidates of
optimal k with two typical parameter settings. For the
values of divn and expn, we used the measured values
from a circuit explained in Section V. Fig. 2 and 3 show
the behavior of ETn(k) for the above two settings. To help
readers understand this behavior according to the
changes in k, we generated two graphs for each setting.
For n = 512, the values of ET512(k) (0 < k ≤ 50000) are
plotted with a step size of 1000, and then a magnified
graph is plotted for 0 < k ≤ 10000 with a step size of 400.
For n = 1024, we first plotted the values of ET1024(k) (0 <
k ≤ 100000) with a step size of 2000 and then plotted a
magnified graph for 0 < k ≤ 20000 with a step size of 400.
As a result of the above analysis, we identified the
candidates of optimal k, i.e., k = 5408 and k = 11022.
However, it should be noted that these values are not the
actual optimal values, but they indicate that the optimal k
will be found near these values. Therefore, they will be
used to set the search space for the later precise analysis.
For example, the search space for the first parameter
setting (n = 512) will be chosen to be (2000, 7000) that
includes 5408 in Section IV-2.
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016
571
2. Stage 2: Determining Optimal k
Now, we are ready to conduct a precise analysis of the
expected running time for Algorithm 5 using the exact
distributions of T(Ti) and T(Fi). We first examine the
distribution of T(Ti). Let ri (1 ≤ i ≤ w) be the ith n-bit
random number to be tested by trial divisions and let ri′
(1 ≤ i ≤ x) be the ith random number that passes the trial
division, i.e., Ti-1. Note that rx′ is rw and is the only nbit random number that passes both trial division and the
Fermat test. Let f denote the function such that ri′ = rf (i ) .
For
f ( x) = w
example,
because
rx′ = rw .
By
convention, we define f ( 0 ) = 0. Because each T(Ti)
has the same distribution, we consider T(T0) without loss
of generality. Fig. 1 illustrates that T0 is composed of
calls to TD( r1 , k), TD( r2 , k), …, TD( rf (1) , k), and each
call to TD( rj , k)
Fig. 2. Distribution of the expected running time of Algorithm
5 for n = 512.
of rj
rj mod
mod
pN j
(1 ≤ j ≤ f(1)) consists of computations
p1 , rj
mod
p2 , rj
mod
p3 , …,
for some 1 ≤ N j ≤ k . Let a random
variable, X, denote the total number of divisions in T0.
Then, X = N1 + N 2 + …+ N f (1) . Because the time for a
single division is fixed as divn, the distribution of T(T0) is
directly obtained from the distribution of X. The
following theorem describes the probability distribution
of X.
Theorem 2:
Pr { X = k + v}
= Pr { f (1) = 2, X = k + v} + ⋯ + Pr { f (1) = v + 1, X = k + v}


nw−1 −1
 1 n1 −1 
v +1
 1

1 
1   

 × ⋯× 
  ,
= p × ∑  ∑ 
× ∏ 1 −
× ∏ 1 −






p j  
p j   
w = 2 ( n1 , …, nw−1 ) s .t .  pn1
j =1 
j =1 
 pnw−1
 n1 +⋯+ nw−1 = v 


(3)
for v ≥ 1 , and
Pr { X = k } = p.
(4)
∞
Fig. 3. Distribution of the expected running time of Algorithm
5 for n = 1024.
We remark that
∑Pr { X = k + v} = 1.
v=0
we can compute Pr { X = k + 3} using
For example,
572
DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS
Pr { X = k + 3}
= Pr { f (1) = 2, X = k + 3} + Pr { f (1) = 3, X = k + 3}
+ Pr { f (1) = 4, X = k + 3}
1 
1  
1 
= p ×  × 1 −  × 1 −  
p1  
p2  
 p3 
1 1 
1  1 
1  1
+ p ×  × × 1 −  +
× 1 −  ×  .
p1  p2 
p1  p1 
 p1 p2 
1 1 1
+ p× × × 
 p1 p1 p1 
Next, we examine the distribution of T(Fi). As
explained in Section II-5,
{
( ( ) )}
T ( Fi ) ≈ muln × ( n − 1) + HW ri′ − 1 − 1 ,
(5)
except for the last stage Fx. Therefore, the distribution of
T(Fi) for 1 ≤ i < x is directly obtained by the distribution
of HW( ri ' − 1 ), which follows a binomial distribution,
{ ( ) }
 n − 2  n −2
Pr HW ri′ − 1 = v = 
/2
 v −1 
(6)
for 1 ≤ v ≤ n − 1. Similarly, the distribution for the last
stage is obtained as
{
( ( ) )}
T ( Fx ) ≈ m × muln × ( n − 1) + HW rx′ − 1 − 1 .
Fig. 4. Probability distributions of T(Ti) and T(Fi) (1 ≤ i < x) for
n = 512 and k = 5408.
5408, which was the candidate optimal value for n = 512
in Section IV-1. Although the distributions are discrete,
we plot them using continuous curves for reader
convenience. The distribution of T(Fi) is symmetric, but
that of T(Ti) is extremely skewed, while its average value
is approximately the same as the expected value obtained
in Section IV-1, i.e., expn = 5008.00. Therefore, the
tendency of the running time of Algorithm 5 according to
k may be slightly different from that expected in Section
IV-1, which is verified in our subsequent analyses.
To more precisely analyze the behavior of Algorithm 5,
we constructed the probability distributions of T(Ti) and
T(Fi) for various k and estimated the distribution of the
running time of Algorithm 5. Let Tn(k) be the running time
of Algorithm 5. By the description of Algorithm 5, we
x −1
Using the above analyses, we may obtain the
probability distributions of T(Ti) and T(Fi). To be precise,
see that Tn ( k ) = T (T0 ) + ∑ max ( T (Ti ) , T ( Fi ) ) + T ( Fx ) ,
using (3) and (4), we have Pr {T (Ti ) = k × divn } = p ,
where x is the number of candidate numbers that pass the
trial division. For each k, we conducted a Monte Carlo
simulation in which ten million random samples of Tn(k)
are selected following the distributions of T(Ti) and T(Fi),
and the average value of Tn(k) is computed. Then, we
identified the k value that minimizes this average value.
Because this procedure was very time-consuming, we
conducted the analysis as follows. We first examined a
wide range of k containing the estimated optimum value
determined in Section IV-1, setting the step size to 100.
That is, we tested k = 2000, 2100, … , 7000. For this
simulation, we used a C program compiled by Microsoft
Visual Studio 2010 on a machine with an Intel Core 2
Duo CPU and 4GB RAM. The result is plotted in Fig. 5.
On the basis of the result shown in Fig. 5, we narrowed
and
Pr {T (Ti ) = ( k + v ) × divn }


nw−1 −1
 1 n1 −1 
v +1
 1

1 
1   

 × ⋯× 
= p × ∑  ∑ 
× ∏ 1 −
× ∏ 1 −    ,




 pn
p j  
p j   
w = 2 ( n1 , …, nw−1 ) s .t .  pn1
j =1 
j =1 
 w−1
 n1 +⋯+ nw−1 = v 


for v ≥ 1 . In addition, using (5) and (6), we have
{
}
Pr T ( Fi ) = muln × {( n − 1) + ( v − 1)}
 n − 2  n−2
=
/2 ,
 v −1 
for 1 ≤ v ≤ n − 1 . Fig. 4 shows these distributions for k =
i =1
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016
573
Fig. 7. Block diagram of hardware prime generator.
Fig. 5. Distribution of the expected running time of Algorithm
5 for n = 512.
these two settings, the running times of Algorithm 5 were
measured for the various choices of k and the optimal k
that minimizes the running time. These k values found in
the actual measurements are compared with the estimated
values in Section IV.
1. Implementation of Parallel Prime Generator on
FPGA Chip
Fig. 6. Distribution of the expected running time of Algorithm
5 for n = 1024.
the candidate range and tested k = 3500, 3501, …, 3800
with a step size of 1. According to the simulation result,
the minimum time was 124.72 ms when k = 3586. We
also conducted another Monte Carlo simulation with ten
million samples for n = 1024. We first searched for the
range 4000 ≤ k ≤ 14000 with a step size of 200 and then
7800 ≤ k ≤ 8200 with a step size of 1. The result of the
former is shown in Fig. 6. According to the simulation,
the minimum expected running time of Algorithm 5 for n
= 1024 was 768.66 ms when k = 7812.
V. IMPLEMENTATION AND EXPERIMENTAL
RESULTS
We implemented our parallel prime generator
(Algorithm 5) on an FPGA chip and measured its
performance. This experiment was conducted
independently for 512- and 1024-bit primes. For each of
A Xilinx Vertex 4 FPGA chip was used for our
experiments. Distinct hardware prime generators were
implemented for 512- and 1024-bit prime generation,
respectively. A common block diagram of the prime
generator, which is an implementation of Algorithm 5, is
given in Fig. 7. The top module includes a trial division
module for procedure Ti, and a modular exponentiation
module to perform Fermat tests for procedure Fi.
The top module receives a 512- or 1024-bit odd
random number R as a base number for generating a
prime. It also takes as another input k, the number of
small primes that will be used in trial division. We
embedded an enough number of small primes in the
prime generator and let the generator to use k smallest
ones for trial division according to the input k. In this
way, we saved time for separately synthesizing FPGAs
for distinct k. However, we remark that in a final product
for a hardware prime generator, it is reasonable that the
optimal value of k, say kopt, is fixed according to an
analysis and should not be an input to the prime
generator. In other words, the prime generator should
contain exactly kopt small primes for compact hardware.
In the ith parallel stage, the modular exponentiation
module performs a Fermat test procedure Fi using the
output from the trial division procedure Ti–1 in the
previous stage. In parallel with Fi, the trial division
module performs trial division procedure Ti on R with the
k smallest odd primes. If R is divided by some prime
p j (1 ≤ j ≤ k ) , R cannot be a prime. Therefore, the top
574
DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS
Fig. 9. Block diagram of the modular exponentiation module.
Fig. 8. Block diagram of the trial division module.
module updates R by adding 2 to the current value and
activates a new trial division test. If R is not divisible by
any prime p j (1 ≤ j ≤ k ) , it becomes a candidate for a
large prime. The top module then delivers R to the
modular exponentiation module for the execution of a
Fermat test procedure Fi+1, and the next parallel stage
begins. At the same time, the top module also updates R
by adding 2 to the current value. The updated R is used in
a new trial division procedure Ti+1. The two procedures
Fi+1 and Ti+1 are executed in parallel. For more details
about the execution of prime generation, see Algorithm 5
and Fig. 1. Note that the trial division processes are
continuously repeated, updating R using the rule R → R
+ 2. The only exceptions are the following two cases.
The first is the case in which a new R that is not divisible
by the k smallest odd primes has been discovered, but the
modular exponentiation module is still busy performing a
Fermat test. In this case, the trial division module is
stalled until the modular exponentiation module finishes
the work in progress and becomes available to receive
the new R. The second case is when the current Fermat
test finds a prime. In the latter case, all operations of the
prime generator, including trial division processes, are
terminated.
We now explain the operation of the sub-modules in
more detail. Fig. 8 shows the data path of the trial
division module. The path of k is not included in Fig. 8
since k is used only in control logics. The trial division
module divides R with small primes from the prime lists
module. At first, the input R is stored in the register R,
and the register Remain is initialized with MSBs of the
register R. Then, the value in the register Remain is
subtracted by a prime in the register Divisor. The result
of the subtraction is stored in the register Remain after 1bit left shift. During the left shift, one bit selected from
the unused part of the register R is inserted to the right.
By iterating these subtraction-and-shifts, divisions are
done. If there is only zero in the register Remain after a
division, i.e., R is divisible by the prime in the register
Divisor, the trial division module stops and concludes
that R is composite. Otherwise, another division with a
new prime is performed in the same way. If R is not
divisible by any of the k primes, the trial division module
stops and concludes that R is possibly a prime.
To reduce the consumed area, the prime lists module
maintains only the differences between adjacent primes.
Using the values from the prime lists module, the next
prime is calculated and stored in a register Next_divisor.
Then, the value is left-aligned before handing it over to
the register Divisor.
Fig. 9 shows the data path of the modular
exponentiation module. Setting N ← R, the modular
exponentiation module repeatedly performs AN–1 (mod N)
with different values of A, which is initially set to 2. If
AN–1 (mod N) = 1, the exponentiation module repeats the
test with a doubled A. If the equation AN–1 (mod N) = 1
always holds in a predefined number of repetitions (6 and
3 for the 512- and 1024-bit candidates, respectively), it is
concluded that N is possibly a prime.
For efficient computation of a modular exponentiation
using Montgomery reductions, the modular exponentiation includes a Montgomery modular multiplication
module, which performs two operations: the precomputation operation, A' = A × 2n (mod N), and
Montgomery modular multiplications. Pre-computation
is performed once at first, and its result A' is stored in
registers A and D. Then, Montgomery modular squarings
and multiplications are iterated, and a conversion to the
standard number system is performed at last. Depending
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016
575
Fig. 10. Data path of Montgomery modular multiplication
module.
on the operations, one of register A, register D and
constant 1 is selected as the input B of the Montgomery
multiplication module.
The data path of Montgomery modular multiplication
module is shown in Fig. 10. In Fig. 10, (a) is the paths
only for pre-computation; while (b) is only for
Montgomery modular multiplication. During the precomputation process, the multiplicand A is stored in the
sum register and shifted to the left at every round. If the
shifted value is longer than k bits, either N or 2N is
subtracted from each carry save adder (CSA). After
repeating this process, A' = A × 2n (mod N) is obtained.
During a Montgomery multiplication, the first CSA
performs an addition operation of a partial product, Ai ×
B. If the least significant bit of the first CSA result is 1,
either N or –N is selected as the input of the second CSA.
The result of the second CSA is shifted to the right and is
stored in the registers. For both processes, the 64-bit
adder transforms the values in the carry and sum registers
into a single number.
2. Experimental Results
Using the prime generator implemented on the FPGA,
we measured the time required for prime generation for
various values of k. To be precise, we considered k = 500,
1000, 1500, …, 7000 for n = 512. To find a more
accurate value of the optimal k, we used a 10-times
smaller step size around the optimal value. That is, in the
interval [3000,4000], we tested k = 3000, 3050, 3100, …,
4000. For n = 1024, we set the initial step size to 1000
and considered k = 1000, 2000, 3000, …, 14000. Around
the optimal k, we tested k = 7000, 7100, …, 9000 using a
step size of 100. For each of these k values, we generated
100,000 random primes and computed the average time
Fig. 11. Estimated running times of parallel prime generation
and measured values using an FPGA chip (n = 512).
for prime generation. We also compared the measured
values with the expected running times estimated by the
analysis in Section IV. The results are shown in Fig. 11
and 12. The upper part in Fig. 11 shows a rough
estimation of the time required to generate a 512-bit
prime based on the analysis in Section IV-1. As in Fig. 2,
the lower part of Fig. 11 magnifies the region that may
include the optimal k and plots the estimated values
based on the analysis in Section IV-2 as well as the
measured values using the FPGA implementation. Fig.
12 illustrates the same task for 1024-bit primes.
According to Fig. 11 and 12, the data measured using
our FPGA implementation coincide with those estimated
in Section IV-2, the results of the analysis using the
probability distributions of T(Ti) and T(Fi). This result
implies that the analysis methodology in Section IV-2 is
very accurate and may be used to determine the optimal k
for a given parameter setting.
The data from the FPGA implementation also coincide
with the range estimated through the preliminary analysis
using the expected values, E[T(Ti)] and E[T(Fi)], in
Section IV-1, except for the k values around the optimal
points. For example, in the 512-bit case, for k ≤ 3000
576
DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS
Table 2. Optimal values of k in various settings
(*: the values obtained from the analyses in Section IV).
n
512
1024
Fig. 12. Estimated running times of parallel prime generation
and measured values using an FPGA chip (n = 1024).
and k ≥ 6000, the three curves overlap with each other.
In the 1024-bit case, these curves overlap for k ≤ 6000
and k ≥ 12000. Note that, for those k values that are not
close to the optimal value, the overall execution time of
the parallel prime generation is dominated by only one
operation among the trial division and the Fermat test.
That is, trial divisions are dominant for large k, but the
Fermat tests dominate for small values of k. In the
interval of k in which only one operation is dominant, the
analysis in Section IV-1 using only the expected
execution times of T(Ti) and T(Fi) is already quite
accurate. It should also be noted that, even in the interval
where the curves do not converge on each other, the
relative error is not very large. For example, in the 1024bit case, the difference in the execution times is 789.6 ms
(for the FPGA) – 742.9 ms (for the analysis in Section
IV) = 46.7 ms, which is only 5.9 % of the measured
value. In the 512-bit case, the largest difference occurred
when k = 5500, with a difference of 128.8 ms – 123.1 ms
= 5.7 ms, only 4.4 % of the measured value.
The above analysis results provide us with a guideline
to determine an optimal k for a given parameter setting.
m
6
3
Step size to
search for k
ρ′ =
muln
divn
Optimal k
20
1
2
3
4
8
12
16
20
24
28
32
480
900
1380
1860
3586*
5720
7160
9200
10720
12540
13800
50
1
2
3
4
8
12
16
20
24
28
32
1000
1950
2850
3800
7812*
11350
15800
18950
22550
27850
30850
That is, it is reasonable to first decide a rough range (e.g.,
an order of magnitude) of the optimal k through the
simpler initial analysis in Section IV-1 in order to narrow
the search space. This analysis provides graphs, such as
those shown in the upper part of Fig. 11 and 12. Next, we
determine the exact value of optimal k using the more
accurate analysis explained in Section IV-2, which
consumes more time than the analysis in Section IV-1.
VI. GENERALIZATION
According to the analyses and experiments in Sections
IV and V, the optimal value of k only depends on the bit
length n and the ratio ρ = expn / divn , where expn and
divn represent the expected times for exponentiation and
for division of an n-bit long integer by a single-word odd
prime, respectively. Therefore, to reduce the time
required for tedious analyses, it is helpful to compute the
recommended k for various combinations of n and ρ
before its actual use. Since expn is computed as
expn = muln × {( n − 1) + ( n − 2 ) / 2} as in Section IV-1,
we may also analyze the behavior of the optimal k
according to two parameters, n and ρ ' = muln / divn ,
instead of n and ρ = expn / divn . Table 2 shows the
results of this pre-computation for various combinations
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016
Fig. 13. Behavior of the optimal k according to ρ'.
of n and ρ ' . Note that the analyses in Section IV-2
correspond to the cases in which ρ ' ≈ 8 for both n =
512 and n = 1024.
Fig. 13 shows the relation between ρ ' and the
corresponding optimal k for n = 512 and n = 1024.
Interestingly, a nearly linear relationship is produced in
both cases.
VII. CONCLUSION
We presented the first probabilistic analysis on the
expected running time of the parallel combination of trial
division and the Fermat test for generating a large prime.
The expected running time was computed using only k,
divn, and expn, where k is the number of small odd primes
used in the trial division, divn is the time required for
dividing an n-bit number by a word-sized prime, and
expn is the time required for performing a modular
exponentiation for an n-bit number. We presented a
general framework to identify the optimal k that
minimizes this expected running time. This framework is
composed of two stages. In the first stage, a rough range
for k was found through an analysis considering the
expected running times of the trial division and Fermat
tests independently. After narrowing the search space for
k in the first stage, we identified the exact optimal value
577
of k through a more precise analysis considering the
combined probability distributions of the timings for a
trial division and a Fermat test. This two-stage approach
reduced the overall time to determine the optimal value
of k. We also implemented 512- and 1024-bit prime
generators on an FPGA chip and verified that the
experimental results correspond well to the analyzed
values through the above two-stage framework.
Motivated by this encouraging result, we finally provided
the recommended values of k for different parameter
settings, which we believe will help other researchers
reduce time in determining the desired value of k for
their own parameter settings.
The advantages of our prime generator over
conventional hardware prime generators are twofold.
First, our hardware prime generator is a complete standalone prime generator that does not need any additional
hardware or software. It is because our hardware prime
generator includes two primality tests, while the
conventional prime generators include only one primality
test. Since practical prime generation requires two or
more primality tests as mentioned in Introduction, the
conventional prime generators require an additional
hardware or software for another primality test, to be
used in practice. Second, system developers do not need
to be concerned with the optimal combination of the
primality tests because optimization is already done
when our hardware prime generator was designed and
developed. However, with conventional prime generators,
system developers should spend a lot of time and efforts
to achieve the sophisticated optimal combination
between a conventional prime generator and the other
hardware/software module for additional primality test.
ACKNOWLEDGMENTS
This work was supported by the research fund of
Signal Intelligence Research Center supervised by
Defense Acquisition Program Administration and Agency
for Defense Development of Korea and by the MSIP
(Ministry of Science, ICT and Future Planning), Korea,
under the ITRC (Information Technology Research
Center) support program (IITP-2016-H8501-16-1008)
supervised by the IITP (Institute for Information &
communications Technology Promotion).
578
DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
R. L. Rivest, A. Shamir and L. Adleman, “A
method for obtaining digital signatures and publickey cryptosystems,” Communications of the ACM,
vol. 21, pp. 120–126, 1978.
T. ElGamal, “A public key cryptosystem and a
signature scheme based on discrete logarithms,” in
Advances in Cryotology, 1985, pp. 10–18.
FIPS PUB 186-2. “Digital Signature Standard
(DSS),” National Institute of Standards and
Technology (NIST), 2000.
V. Miller, “Use of elliptic curves in cryptography,”
in Advances in Cryptology—CRYPTO’85 Proceedings,
1986, pp. 417–426.
N. Koblitz, “Elliptic curve cryptosystems,”
Mathematics of Computation, vol. 48, pp. 203–209,
1987.
Trusted Platform Module, http://www.trusted
computinggroup.org/developers/trusted
platform
module.
Trusted Environment Execution, http://www.
globalplatform.org/mediaguidetee.asp.
TrustZone, http://www.arm.com/products/processors/
technologies/trustzone/index.php.
S. Tueke, V. Welch, D. Engert, L. Pearlman, and
M.Thompson, “Internet X.509 public key
infrastructure (PKI) proxy certificate profile,” RFC
3280 (Proposed Standard), 2004.
T. H. Cormen, C.E. Leiserson, R.L. Rivest, and C.
Stein, Introduction to Algorithms, 3rd ed: MIT
press Cambridge, 2009.
H. C. Pocklington, “The determination of the prime
or composite nature of large numbers by Fermat’s
theorem,” in Proceedings of the Cambridge
Philosophical Society, 1914, pp. 29–30.
A. O. L. Atkin and F. Morain, “Elliptic curves and
primality proving,” Mathematics of Computation,
vol. 61, pp. 29–68, 1993.
W. Bosma and M.-P. van der Hulst, “Faster
primality testing,” in Advances in Cryptology—
EUROCRYPT’89, 1990, pp. 652–656.
U. M. Maurer, “Fast Generation of Prime Numbers
and Secure Public-Key Cryptographic Parameters,”
Journal of Cryptology, vol. 8, pp. 123–155, 1995.
J. Shawe-Taylor, “Generating strong primes,”
Electronics Letters, vol. 22, pp. 875–877, 1986.
[16] M. Agrawal, N. Kayal, and N. Saxena, “PRIMES is
in P,” Annals of mathematics, vol. 160, no. 2, pp.
781–793, 2004.
[17] M. O. Rabin, “Probabilistic Algorithm for
Primality Testing,” Journal of Number Theory, vol.
12, pp. 128–138, 1980.
[18] R. Solovay and V. Strassen, “A fast Monte-Carlo
test for primality,” SIAM Journal on Computing,
vol. 6, pp. 84–85, 1977.
[19] J. Grantham, “A probable prime test with high
confidence,” Journal of Number Theory, vol. 72,
pp. 32–47, 1998.
[20] D. J. Lehmann, “On primality tests,” SIAM Journal
on Computing, vol. 11, pp. 374–375, 1982.
[21] OpenSSL, http://www.openssl.org.
[22] H. Park, S. K. Park, K.-R. Kwon, and D. K. Kim,
“Probabilistic Analysis on Finding Optimal
Combinations of Primality Tests in Real
Applications,” in Information Security Practice and
Experience, ed: Springer, 2005, pp. 74–84.
[23] P. Heejin and D. K. Kim, “Probabilistic Analysis
on the Optimal Combination of Trial Division and
Probabilistic Primality Tests for Safe Prime
Generation,” IEICE transactions on information
and systems, vol. 94, pp. 1210–1215, 2011.
[24] N. Koblitz, A course in number theory and
cryptography, Berlin, Germany: Springer-Verlag,
1994.
[25] I. Damgaard, P. Landrock, and C. Pomerance,
“Average case error estimates for the strong
probable prime test,” Mathematics of Computation,
vol. 61, pp. 177–194, 1993.
[26] C. Pomerance, “On the Distribution of Pseudoprimes,” Mathematics of Computation, pp. 587–
593, 1981.
[27] D. E. Knuth, “The Art of Computer Programming,
volume 2: Seminumerical Algorithms,” Reading:
Addison-Wesley Professional, vol. 192, 1997.
[28] P. L. Montgomery, “Modular Multiplication
without Trial Division,” Math of Computation, vol.
44, pp. 519–521, 1985.
[29] S. R. Dusse and B.S. Kaliski Jr, “A cryptographic
library for the Motorola DSP56000,” in Advances
in Cryptology—EUROCRYPT’90, 1991, pp. 230–
244.
[30] T. Blum and C. Paar, “Montgomery modular
exponentiation on reconfigurable hardware,” in
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016
Computer Arithmetic, 1999. Proceedings. 14th
IEEE Symposium on, 1999, pp. 70–77.
[31] D. M. Gordon, “A survey of fast exponentiation
methods,” Journal of Algorithms, vol. 27, pp. 129–
146, 1998.
[32] C. K. Koc, “High-Speed RSA Implementation,”
Technical Report, RSA Laboratories, 1994.
[33] G. L. Miller, “Riemann’s Hypothesis and Tests for
Primality,” Journal of Computer Systems Science,
vol. 13, pp. 300–317, 1976.
APPENDIX. PROOF OF LEMMAS AND
THEOREMS
Proof of Lemma 1. Let T(TD(r, k)) denote the running
time of trial division on a specific n-bit random number r
with the k smallest odd primes. Let Dj be the event of
dividing r by the jth smallest odd prime pj for 1 ≤ j ≤ k
in Algorithm 1, and let Pr{Dj} be the probability of Dj.
Let div(r, j) be the time required to divide r by pj . Then
the expected value of T(TD(r, k)) can be represented as
follows.
E T ( TD ( r , k ) )  = Pr { D1} ⋅ div ( r ,1) + ⋯ + Pr { Dk } ⋅ div ( r , k )
k
= ∑ (Pr { D j } ⋅ div ( r , j ))
j =1
(7)
579
Now, we generalize E[T(TD(r, k))] to a value
independent of a specific r. It is commonly accepted that
div(r1, j) ≈ div(r2, j) for any two n-bit random numbers r1
and r2. Furthermore, div(r1, i) ≈ div(r1, j) for 1 ≤ i, j ≤
k because the odd primes used in the trial division are so
small that each of them can be stored in a single word.
Thus, we may use divn instead of div(r, j). Then, equation
(9) can be rewritten as follows.
k  j −1


1 
E T ( TD ( r , k ) )  =  1 + ∑  ∏ 1 −    divn (10)


pi   
j = 2  i =1 

Since the right-hand side of (10) is not dependent on
the value of r but is dependent on the bit length of r,
which is n, E[T(TD(r, k))] can be replaced with a more
general notation, E[Tn,k(TD)], which proves lemma.
Proof of Lemma 2. Let ri (1 ≤ i ≤ w) be the ith n-bit
random number to be tested by trial divisions in
Algorithm 5, and let ri ' (1 ≤ i ≤ x) be the ith random
number that passes the trial division. Note that rx′ is rw
and is the only n-bit random number that passes both trial
division and the Fermat test. As already defined, let f
'
denote the function such that ri = rf (i ) . For example,
f ( x ) = w because rx′ = rw . By convention, we define
f ( 0 ) = 0. Then, procedure Ti contains all trial divisions
on rj’s for f ( i ) < j ≤ f ( i + 1) . Because the individual
Since we divide r by pj in trial division if and only if r
is not divisible by any prime up to pj−1, Pr{Dj} is as
follows:

1 
1  
1 
Pr { D j } = 1 −  1 −

 ⋯ 1 −
p1  
p2  
p j −1 

j −1

1 
= ∏ 1 −  ,
pi 
i =1 
trial division time for each rj is already computed in
Lemma 1, we estimate the expected value of T(Ti) by
computing the expected value of f ( i + 1) − f ( i ) , which
is the number of trial divisions in Ti. Because the running
time of the trial division on an n-bit random number and
the value of f ( i ) − f ( i − 1) are independent, E[T(Ti)]
(8)
can be represented as E[T(Ti)] = E[Tn,k(TD)] ×
E[ f ( i + 1) − f ( i ) ]. Because the probability of executing
a Fermat test is (1–1/p1)(1–1/p2) … (1–1/pk), we obtain
for j ≥ 2, and Pr{D1} = 1.
Using (7) and (8), we get the following equation:
k
E  f ( i + 1) − f ( i )  = 1 / ∏ (1 − 1 / p j ) . Then, by Lemma
j =1
E T ( TD ( r , k ) ) 
k  j −1


1 
= div ( r ,1) + ∑  ∏ 1 −  ⋅ div ( r , j )  .
pi 
j = 2  i =1 

k

 j −1

E T (Ti )  =  1 + ∑  ∏ (1 − 1 / pi )   ⋅ divn /
j = 2  i =1


1, we get
(9)
k
∏ (1 − 1 / p ) .
j
j =1
580
DONG KYUE KIM et al : DESIGN AND ANALYSIS OF EFFICIENT PARALLEL HARDWARE PRIME GENERATORS
Proof of Lemma 3. We first consider the expected
running time of FT(r), i.e., the time to compute ar–1 mod
r for an n-bit random number r and a single-word integer
a. Note that the running time of a modular
exponentiation (Algorithm 4) is the product of the
number of Montgomery multiplications and the running
time of each Montgomery multiplication. Because the
complexity of a Montgomery multiplication, muln, is
determined by the bit length n of the modulus r, we
estimate the expected running time for a modular
exponentiation as expn = muln × {(n – 1) + (n – 2)/2},
according to the estimation of the expected number of
multiplications in Section II-5. We are interested in the
number of calls to FT(r) in procedure Fi. We consider
two separate cases. First, if r is prime, then FT(r) is
called m times. On the other hand, if r is composite, it is
very likely that the number of calls is 1 because the
probability that a composite number passes even a single
call of FT(r) is very low, as described in Section II.
Consequently, E[T(Fi)] = expn if r is composite, and
E[T(Fi)] = m · expn if r is prime. Therefore, the lemma
holds.
respectively, are described below. First, E[T(Ti)] is
monotonically increasing as k increases because

k

j −1
∑  ∏ (1 − 1 / p ) ) 
j=2

i
i =1

in the numerator increases monok
tonically, and the denominator,
∏ (1 − 1 / p ) , decreases
j
j =1
monotonically. Therefore, E[T(Ti)] ≤ expn if and only if
k ≤ k0, and E[T(Ti)] ≥ expn if and only if k ≥ k0.
Proof of Theorem 2. We analyze the probability
distribution of X by classifying the cases according to the
total number of calls to TD in T0, i.e., f(1).
[Case 1: f(1) = 1, i.e., X = N1 ] In this case, r1 passes
the trial division test and is labeled as r1' . Because r1
should not be divisible by any p j for 1 ≤ j ≤ k, the
k
probability that f(1) = 1 is p = ∏ (1 − 1 / p j ) . Because
j =1
this is the only case in which X = k, we obtain
Pr { X = k } = Pr { N1 = k , pk r1 } = p .
[Case 2: f(1) = 2, i.e., X = N1 + N 2 ] In this case, r1
does not pass the trial division test, but r2 does, so it is
Proof of Lemma 4. The probability that a random odd
number is a prime is 1/(0.3466n), as explained in Section
obvious that N 2 = k . However, there are k possible
II. Thus, the probability that each ri′ is a prime is
cases for N1 , i.e., N1 = 1, 2, …, k . Their probabilities
v −1
k
1/(0.3466n ⋅∏ (1 − 1 / p j ) ) for 1 ≤ i ≤ x because ri′
satisfy
k


E [ x ] /  0.3466n ⋅ ∏ (1 − 1 / p j )  = 1 holds, which proves
j =1


v = 1, 2, …, k , and we verify that
k
∑Pr {N
1
= v, pv | r1 } =
v =1
1 − p = 1 − Pr { N1 = k , pk r1 } , where pv | r1 means that
pv divides r1 , i.e., r1 mod pv = 0 . Then, we obtain
the lemma.
Pr { f (1) = 2, X = k + v}
Proof of Theorem 1. The body of (2) is easily derived from
= Pr { N1 = v, pv | r1 } × Pr { N 2 = k , pk r2 }

k

j −1

(1) by replacing E[T(Ti)] with 1 + ∑  ∏ (1 − 1/ pi ) )   divn /
j = 2  i =1


=

k
∏ (1 − 1 / p )
p v −1 
1 
× ∏ 1 −
.
pv j =1 
p j 
in accordance with Lemma 2 and E(x) with
j
j =1
[Case 3: f(1) = w (w ≥ 3), i.e., X = N1 + N 2 + ⋯ + N w ]
k
0.3466n ⋅
for
j =1
j =1
already passed a trial division test using the k smallest
odd primes. Since the expected number of primes after
the E[x]th iteration is 1, we determine that
Pr { N1 = v, pv | r1 } = (1 / pv ) × ∏ (1 − 1 / p j )
∏ (1 − 1 / p )
j
according to Lemma 4. The
j =1
This general case implies that r1 through rw−1 do not
reason that the conditions E[T(Ti)] ≤ expn and E[T(Ti)]
pass the trial division test, but rw does, so N w = k .
≥ expn in (1) can be replaced by k ≤ k0 and k ≥ k0,
Now,
in
order
to
compute
the
probability
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016
Pr { f (1) = w, X = k + v} ,
we have to examine all
possible combinations of N1 , N 2 , …, N w −1 that make
the sum N1 + N 2 + ⋯ + N w −1 = v . As a result, we obtain
Pr { f (1) = w, X = k + v}


=
∑ Pr N1 = n1 , pn1 | r1 × ⋯× Pr N w−1 = nw−1 , pnw−1 | r1
 ( nn1 ,+…⋯, +nwn−1 ) s=.tv.
w−1
 1
× Pr { N w = k , pk rw }
{
= p×
∑
( n1 , …, nw−1 ) s .t .
n1 +⋯+ nw−1 = v
}
{
}





nw−1 −1
 1 n1 −1 
 1

1 
1   
× ∏ 1 −
× ∏ 1 −
  × ⋯ × 
 



 pn
p
pj 
p j    .
j =1 

 w−1

 n1 j =1 
In fact, this generalized equation also holds for w = 2.
Merging all cases, we obtain (3) and (4).
Dong Kyue Kim received the B.S.,
M.S. and Ph.D. degrees in Computer
Engineering from Seoul National
University in 1992, 1994, and 1999,
respectively. From 1999 to 2005, he
was an assistant professor in the
Division of Computer Science and
Engineering at Pusan National University. He is
currently a full professor in the Department of Electronic
Engineering at Hanyang University, Korea. His research
interests are in the areas of security SoC (System on
Chip), crypto-coprocessors, and information security.
Piljoo Choi received the B.S. and
M.S. degrees in Electronic Engineering from Hanyang University in
2010 and 2012, respectively. He is
currently a Ph.D. candidate in the
Department of Electronic Engineering at Hanyang University, Korea.
His research interests are in the areas of security SoC
(System on Chip), crypto-coprocessors, and information
security.
581
Mun-Kyu Lee received the B.S. and
M.S. degrees in Computer Engineering from Seoul National
University in 1996 and 1998,
respectively, and the Ph.D. degree in
Electrical Engineering and Computer
Science from Seoul National
University in 2003. From 2003 to 2005, he was a senior
engineer at Electronics and Telecommunications
Research Institute, Korea. He is currently a professor in
the Department of Computer and Information
Engineering at Inha University, Korea. His research
interests are in the areas of cryptographic algorithms,
information security and theory of computation.
Heejin Park received the B.S., M.S.
and Ph.D. degrees in Computer
Engineering from Seoul National
University in 1994, 1996, and 2001,
respectively. From 2001 to 2002, he
worked as a post-doctoral researcher
for the Department of Computer
engineering at Seoul National University. From 2003 to
2003, he was a research professor at Ewha Womens
University. He is currently a professor in the Department
of Computer Science and Engineering at Hanyang
University, Korea. His research interests are in the areas
of cryptography, information security, and computer
algorithm.