Koblitz - CSE IIT Kgp - Indian Institute of Technology Kharagpur

Accelerations of Scalar Multiplication
Advanced Techniques
Debdeep Mukhopadhyay
Chester Rebeiro
Department of Computer Science and Engineering
Indian Institute of Technology Kharagpur
23-27 May 2011
Anurag Labs, DRDO
1
Non-Adjacency Form (NAF)
NAF(29)=(1,0,0,-1,0,1), since 29=32-4+1
Binary(29)=(1,1,1,0,1), since 29=16+8+4+1
Pros:
NAF does not have any consecutive ones (hence called non-adjacent).
Average density of non-zero terms in NAF is 1/3.
It reduces the number of point additions in ECC scalar multiplication.
Cons:
Maximum length of NAF can be one more than the binary.
23-27 May 2011
Anurag Labs, DRDO
2
Algorithm for NAF generation
k=29.
k0=2-(29%4)=1, k=29-1=28, k=14
k1=0 (Note that it can never be 1). k=7
k2=2-(7%4)=-1, k=4
k3=0, k=2
k4=0, k=1
k5=2-(1%4)=1, k=0 (algorithm terminates)
23-27 May 2011
Anurag Labs, DRDO
3
Why Non-adjacent?



When k is odd, it can be either 4p+1 or 4p+3 (p is an integer).
Case 1: k=4p+1
◦ ki=1, k=2p (even) => next NAF bit is 0
Case 2: k=4p+3
◦ ki=-1, k=2p+2 (even) => next NAF bit is 0
23-27 May 2011
Anurag Labs, DRDO
4
Scalar Multiplication with NAF
Expected Run time = m/3 A + m D
Normal Run time = m/2 A + mD
Note that here number of doubling is unchanged. Later we see a method to remove
doubling all together.
23-27 May 2011
Anurag Labs, DRDO
5
Width w-NAF
k=29, w=3
NAF digits = (1,0,0,0,0,-3)
29=(1,0,0,0,0,-3)=1.32-3
Pros:
Density of non-zero terms =1/(w+1)
Cons:
Pre-computation required, this means storage in hardware
Length is unaltered as normal NAF
23-27 May 2011
Anurag Labs, DRDO
6
Algorithm for width w-NAF generation

u≡k (mod 2w) => -2w-1≤k≤2w-1

k=29, w=3
◦
k0=-3, k=16
◦
k1=0, k=8
◦
k2=0, k=4
◦
k3=0, k=2
◦
k4=0, k=1
◦
k5=1, k=0 (algorithm terminates)
23-27 May 2011
Anurag Labs, DRDO
7
Scalar Multiplication with width w-NAF
Pre-computation: 1D + (2w-2-1)A
Expected Run time = m/(w+1) A + m D
Normal Run time = m/2 A + mD
Hence designing an architecture would incur the initial pre-computation phase.
23-27 May 2011
Anurag Labs, DRDO
8
Koblitz Curves
The previous methods did not reduce the number of doubling operations.
Koblitz invented a set of curves which does not require any doubling. he
curves were named after him.
• Koblitz curves are a special class of elliptic curves and are defined on
where elliptic curve parameter
• Koblitz curves are computationally efficient compared to random curves, as
Frobenius map can be utilized to accelerate scalar multiplication.
23-27 May 2011
Anurag Labs, DRDO
9
Choice of the curve

Choice of the curve depends on a, which can be either 0 or 1.

As we have seen the Elliptic Curve is a group of points.
◦ Group should be chosen that ECDLP is difficult.
◦ The number of elements in the elliptic group is called the order of the group.
◦ For security, the order of the group should be very nearly prime (it has a factor
of a prime number and a small integer)
 as otherwise there can be subgroups which are called as divisors of the group,
which makes the curve cryptographically weak.
◦ The field elements belong to GF(2m)
 The subgroups belong to GF(2d), where d | m.
 If m is prime, d=1. Thus the only subgroups are E0(GF(2)) and E1(GF(2)).
 It can be easily checked that:
 E0(GF(2)) = (O, (0,1))
 E1(GF(2))= (O, (0,1), (1,0), (1,1))
23-27 May 2011
Anurag Labs, DRDO
10
An Interesting Property
• The curve satisfies : (x4,y4)+2(x,y)=µ(x2,y2), where µ=(-1)1-a
•Define, Frobenius Map as:
•
Frobenius map follows the relation
•
For a point P on the Koblitz curve, we can use the property of Frobenius map to
compute 2P.
23-27 May 2011
where
Anurag Labs, DRDO
11
τ-adic NAF

The scalar k can be represented as a polynomial, where τ is the
inderminate.
◦ this sum is analogous to the binary expansion.
◦ the scalar is said to belong to the ring Z[τ].
◦ It can be proved that the τ-adic NAF representation is unique.
23-27 May 2011
Anurag Labs, DRDO
12
Divisibility by τ
In order to generate this NAF, we divide the element by τ, like we divided by 2 in
the binary NAF.
As it is a NAF, the remainder is generated such that the next NAF digit is zero.
23-27 May 2011
Anurag Labs, DRDO
13
Algorithm for τ-adic NAF generation
k=29.
The τ-adic NAF is (-1,0,1,0,1,0,-1,0,1)=> 29=1- τ2+ τ4+ τ6- τ8
29P=P- τ2(P)+ τ4(P)+ τ6(P)- τ8(P)
29P=(x,y)-(x4,y4)+(x16,y16)+(x64,y64)-(x256,y256)
Thus, the scalar multiplication avoids any doubling operation, instead it
performs easy squaring operation.
It may be noted that the length is almost twice of the binary expansion, hence a
reduction is necessary.
23-27 May 2011
Anurag Labs, DRDO
14
Reduction of the scalar


τm(P)=P [from Fermat’s Little Theorem]
(τm-1)(P)=O
◦ Hence, if γ≡k (mod τm-1)=> γ(P)=k(P)
23-27 May 2011
Anurag Labs, DRDO
15
Reduction of Scalar
•
Solinos presented efficient reduction algorithm for reduction of a scalar. The
algorithm involves integer multiplication. Thus, costly for hardware implementations.
•
An alternative approach known as Lazy Reduction was proposed by Brumley
and Jarvinen which uses the observation that division by
•
is cheap.
The algorithm uses multiplication and division by 2 and integer additions.
 Implementation is simple and area requirement is small.
•
The algorithm takes m clock cycles.
 So, Lazy.
23-27 May 2011
Anurag Labs, DRDO
16
Scalar Multiplication with
τ-adic NAF
Expected Run time = m/3 A
Normal Run time = m/2 A + mD
23-27 May 2011
Anurag Labs, DRDO
17
Summary
• Basic steps of scalar multiplication on Koblitz curves
 Reduction of the scalar.

NAF generation from reduced scalar.
 Point addition for nonzero
NAF digits.



Point addition is performed in Lopez-Dahab projective co-ordinate
system.
Point squaring for every
NAF digit.
Final field inversion to transform scalar multiplication result into affine
co-ordinate system from projective co-ordinate system.
• Our Koblitz curve scalar multiplier uses simple
scalar multiplication.
23-27 May 2011
Anurag Labs, DRDO
NAF method for
18
Top Level Architecture
23-27 May 2011
Anurag Labs, DRDO
19
Reduction of Scalar
•
Solinos presented efficient reduction algorithm for reduction of a scalar. The
algorithm involves integer multiplication. Thus, costly for hardware
implementations.
•
An alternative approach known as Lazy Reduction was proposed by Brumley
and Jarvinen which uses the observation that division by is cheap.
•
The algorithm uses multiplication and division by 2 and integer additions.
 Implementation is simple and area requirement is small.
•
The algorithm takes m clock cycles.
 So, Lazy.
23-27 May 2011
Anurag Labs, DRDO
20
Architecture for Reduction of Scalar
•
Arrangement of adders and shift circuits is used to perform reduction of scalar. Here u is the
LSB of c0. There are registers to store intermediate values. Control unit generates control
signals for Multiplexers and write enable signal for storage registers. Initially storage register
for c0 contains the value of scalar.
23-27 May 2011
Anurag Labs, DRDO
21
T-NAF Generation from Reduced Scalar
r0=b0+c0
r1=b1+c1
Reduced Scalar
Can be found by observing last two
bits of c0 and c1.
T-NAF digits are generated after performing reduction of the scalar. As, the
algorithm does integer additions and subtractions, adders of the reduction
circuit can be used to generate T-NAF digits.
23-27 May 2011
Anurag Labs, DRDO
22
Architecture for Reduction & T-NAF Generation
•
The left portion of the circuit is used to generate
digits. The
NAF generation
and reduction hardware shares the adders and registers. During reduction, control signal M is
set to 0. After the reduction is over,
NAF generation starts and M is changed to 1.
23-27 May 2011
Anurag Labs, DRDO
23
Choice of Scalar Multiplication Algorithm
•
There are two scalar multiplication algorithms in literature:
•
•
Process the scalar starting from MSB (Left-to-Right).
•
Process the scalar starting from LSB (Right-to-Left).
The Left-to-Right algorithm first computes the entire
NAF of the reduced
scalar and then starts processing the
NAF from MSB.
•
So, it waits for the entire
NAF generation and this takes nearly m clock
cycles in GF(2m).
• Additionally, at every iteration, Q is squared. So, when a point addition
in progress, we cannot perform
in parallel.
is
 But, squaring is cheap in hardwares and the algorithm does not uses this advantage of parallel processing.
23-27 May 2011
Anurag Labs, DRDO
24
•
Fast Scalar Multiplication Algorithm
The Right-to-Left algorithm for scalar multiplication is shown below
•
The scalar multiplication does not wait for entire
NAF of the scalar. As soon as the
LSB, i.e the first
NAF digit is generated, the scalar multiplication starts.
•
Additionally, point addition
independent of Q.
•
So, we can use the fact that point squaring is cheap in hardware and can perform
parallel with
.
updates only Q and point squaring
is
in
 So, we select this Right-to-Left algorithm for scalar multiplication.
23-27 May 2011
Anurag Labs, DRDO
25
Point Addition Unit
•
The point addition unit does point addition in Lopez-Dahzb co-ordinate system and takes 8 clock
cycles. Initially these three registers are initialized with base point (Px, Py, 1). After every point addition,
result Q Q+P is stored in register (RA1, RB1, RC1). In the figure, P = (RA2, RB2 ). In every clock
cycle field multiplication is performed and the Multiplier is of Hybrid Karatsuba type. Control signals
are used to control the multiplexers and write eneble signals for storage registers.
23-27 May 2011
Anurag Labs, DRDO
26
Point Addition Unit
23-27 May 2011
Anurag Labs, DRDO
27
Point Squaring Unit
• During scalar multiplication, point squaring is performed in every clock
cycle. The base point is updated P(x, y)
P(x2, y2). Point squarings are
performed using dedicated squarer circuits as squarers are cheap.
• If we see the scalar multiplication algorithm, then
it can be seen that point squaring
is independent of point additions.
• A nonzero
NAF digit is followed by several Zero digits (NAF
property). So, during point addition, we can continue point squaring in
parallel until another nonzero
NAF digit appears.
23-27 May 2011
Anurag Labs, DRDO
28
Point Squaring Unit
 The NAF digits are generated from LSB side. Let us consider a portion of the entire
NAF be <. . . . . .1 0 0 0 0 0 1 . . . . .>.
 For the first 1, a point addition is required nad this point addition takes 8 clock cycles.
 If we check the algorithm, then it can be seen that for a nonzer
point addition takes place and uses the present value of P.
NAF digit u, when a
 If we consider only sequential processing, then it can be seen that after performing point
addition for 1, the algorithm will perform 6 point squarings for the sequence <0 0 0 0 0 1>.
This will require 6 clock cycles.
 As P is independent of Q, we can perform the 6 point squarings in parallel with point
additions (which takes 8 clock cycles). Thus saving 6 clock cycles.
 When the next nonzero appears in <. . 1 0 0 0 0 0 1 . . > , then we must stop this parallel
processing of zeros, as the last updated value of P for <. . 1 0 0 0 0 0 1 . . > will be required
during the next point addition.
23-27 May 2011
Anurag Labs, DRDO
29
Architecture for Point Squaring Unit
• The point P(x, y) is in affine co-ordinate and two dedicated squarers are used
for squaring x and y co-ordinates.
• Initially the registers are assigned with the base point. When the scalar
multiplication starts, point squaring is performed for every digit and the
registers are updated.
• A write enable signal en is used to protect content of registers from
unnecessary squarings specially for the case (another Nonzer) mentioned in
previous slide.
23-27 May 2011
Anurag Labs, DRDO
30
Architecture for Inversion
•
Scalar multiplication when done in Lopez-Dahab co-ordinate system, requires a final
inversion after processing the entire scalar.
•
For ECC, Itoh-Tsujii inversion is efficient.
•
In a field GF(2m), the inverse of an element a is
•
Using quad operation
field GF(2233).
•
This requires multiplications and repeated quad operations. We can implement this
using a multiplier and quad circuits. 23-27 May 2011 Anurag Labs, DRDO
.
we can compute the inverse. Here is an example for the
3
Architecture for Inversion
•
This is the basic block diagram for the inversion unit. The multiplier is actually a part
of the point addition unit. This multiplier is shared between point addition unit and
inversion unit.
•
It can be seen from the previous slide that there are repeated quad operations. For
example in step 7, computation of
. If we use a single quad circuit, then the
exponentiation will take 14 clock cycles. To reduce number of clock cycles, we use a
cascade of several quad circuits. This cascade of quad circuits is called Quadblock.
23-27 May 2011
Anurag Labs, DRDO
32
Architecture for Quadblock …
•
Here is an example for a Quadblock which contains 11 cascaded quad circuits. So, for
an element a, we can raise it to a maximum of
.
•
A multiplexer is used to get intermediate results, for example
•
To raise an element to a power which is more than the number of cascaded quad
circuits, repeated application of the quad block is required. So, the number of clock
cycles depend on the number of quad circuits. For example, to perform
, we can
do it in two clock cycles.
•
Number of clock cycles reduce if we increase number of quad circuits. But delay
increases. So, there must be a balance in the design between delay and number of quad
circuits.
23-27 May 2011 Anurag Labs, DRDO
33
.
Experimental Performance
• Experimentation was performed on Xilinx Virtex V FPGA for GF(2283).
• Scalar multiplier on random curve in the field GF(2283) has an area of
around 40,000 LUTs, frequency 37 MHz and computation time of 63 micro
seconds.
• Koblitz curve scalar multiplier (in first stage of implementation) which
uses
in GF(2283), has an area of 41,300 LUTs, frequency 31
MHz and average computation time of 35 micro seconds.
• It can be seen, that a Koblitz curve crypto processor takes almost half
computation time compared to random curve crypto processor.
23-27 May 2011
Anurag Labs, DRDO
34
Further Acceleration
• We have found a novel technique to reduce number of point additions
during scalar multiplication using
representation of a scalar.
• For any scalar, we have found that length of
is close to half of the
length of
.
• However, there is an overhead of small amount of pre-computations and an
increased area.
• In Virtex IV FPGA, scalar multiplication using
for the field
GF(2283) saves 35% computation time compared to
method.
23-27 May 2011
Anurag Labs, DRDO
35
Thank You
23-27 May 2011
Anurag Labs, DRDO
36