Elliptic Curve Cryptography 5th - CSE IIT Kgp

IMPLEMENTATION OF
FINITE FIELD INVERSION
Debdeep Mukhopadhyay
Chester Rebeiro
Dept. of Computer Science and Engineering
Indian Institute of Technology Kharagpur
INDIA
Finite Field Inverse
23-27 May 2011
Anurag Labs, DRD0
2
Itoh-Tsujii Method for Binary Fields
23-27 May 2011
Anurag Labs, DRD0
3
The Steps
23-27 May 2011
Anurag Labs, DRD0
4
How do we do a Squaring
Consider (again) the field GF(24), with
irreducible polynomial x4+x+1.
 What is (x3+x2+1)2 in this field ?

23-27 May 2011
Anurag Labs, DRD0
5
Squaring
Squaring can be represented in the form of a
matrix multiplication
T.a
23-27 May 2011
Anurag Labs, DRD0

6
Quad Operation
Quad operation can
be done by two
squaring operations.
 Quad operation can
be written in the
form T2.a

23-27 May 2011
Anurag Labs, DRD0
7
Advantage of using Quad
Operations

Quad circuits have better LUT utilization
compared to Squarer circuits
23-27 May 2011
Anurag Labs, DRD0
8
Generalization of the Itoh-Tsujii
Algorithm
23-27 May 2011
Anurag Labs, DRD0
9
Theorem 1
23-27 May 2011
Anurag Labs, DRD0
10
Theorem 2
23-27 May 2011
Anurag Labs, DRD0
11
Quad Itoh-Tsujii Inversion
Algorithm
23-27 May 2011
Anurag Labs, DRD0
12
A Circuit for Inversion
At every clock
cycle, either the
multiplier or the
quadblock is
active.
 The output of
the multiplier is
stored in mout
register

23-27 May 2011
Anurag Labs, DRD0
13
Finding the Inverse
23-27 May 2011
Anurag Labs, DRD0
14
Finding the Inverse Step 2
23-27 May 2011
Anurag Labs, DRD0
15
Finding the Inverse Step 2
23-27 May 2011
Anurag Labs, DRD0
16
Control Signals for the Inverse
23-27 May 2011
Anurag Labs, DRD0
17
Performance Charts
23-27 May 2011
Anurag Labs, DRD0
18
Higher Powered Itoh-Tsujii
• We seen that Quad circuits utilize LUTs in a better way compared to
squarer circuits.
• Also LUT size is increasing as silicon technology reduces
• We have seen 4-LUT become 6-LUT, and now 8-LUT
• This gives us a motivation to investigate using higher powers other than
quad circuits
23-27 May 2011
Anurag Labs, DRD0
19
Revisiting the Theorems
23-27 May 2011
Anurag Labs, DRD0
20
2n Itoh-Tsujii Inversion
These are the overheads
Higher Powered
23-27 May 2011
Anurag Labs, DRD0
21
Overhead in 2n Itoh-Tsujii
• Computation of
.
• Using addition chain for
,
can be computed in
cycles, where is the length of addition chain for
.
• Computation of
, for
• Using addition chain for
, that contains
computed during
computation, because
23-27 May 2011
clock
Anurag Labs, DRD0
,
can be
.
22
2n Itoh-Tsujii Design
23-27 May 2011
Anurag Labs, DRD0
23
Building the Optimal Design
For a given field and a given
FPGA how do decide the
optimal design ?
Configurable Parameters
• Addition chain.
• Power circuit used in power block.
• Number of cascaded power
circuits in the power block.
• These have an effect on
– Number of clock cycles.
– Critical path delay.
23-27 May 2011
Anurag Labs, DRD0
24
Estimating AREA required on an
FPGA
• A k input LUT (k-LUT) can implement any functionality of maximum k
input variables.
• Total number of k-LUTs to implement a function with variables can be
expressed as
23-27 May 2011
Anurag Labs, DRD0
25
Estimating Delay of a Design in an
FPGA
• Delay in FPGAs comprise of LUT delay and routing delay..
• For this ITA architecture, we have experimentally found, total delay is
proportional to number of LUTs in critical path.
• We denote number of LUTs in a delay path as maxlutpath.
• In k-LUT, maxlutpath of an variable function is
23-27 May 2011
Anurag Labs, DRD0
26
Recap : Karatsuba Multiplier
23-27 May 2011
Anurag Labs, DRD0
27
Hybrid Karatsuba Multiplier for
GF(2233)

Note that the school book multiplier has
replaced the general Karatsuba Multiplier
School Book Multiplier
23-27 May 2011
Anurag Labs, DRD0
28
Estimating LUT Requirement for
Hybrid Karatsuba Multiplier
• The field multiplier is a hybrid Karatsuba multiplier.
• A
bit hybrid Karatsuba multiplier consists of two
bit and one
bit multipliers. This happens in recursive manner.
• In threshold ( ) level, School-Book multiplier is invoked.
• Total area of
bit hybrid Karatsuba multiplier is given by
• Total area for the School-Book multiplier is
23-27 May 2011
Anurag Labs, DRD0
29
Estimating Delay of Hybrid
Karatsuba Multiplier
• The hybrid Karatsuba multiplier is distributed in smaller multipliers like
a tree. Height of the tree is
• Each level of the Simple Karatsuba tree introduces one LUT delay.
• In threshold ( ) level, School-Book multiplier delay is added.
• Delay of School-Book multiplier is
• Delay of the entire multiplier in LUTs is given by
23-27 May 2011
Anurag Labs, DRD0
30
Estimating Area & Delay for
Modular Reduction
• For fields generated by trinomials, area of modular reduction
is almost equal to field size and delay
is one LUT considering
LUT size
.
• For fields generated by pentanomials,
–
and
2 LUT for
–
and
2 LUT for
23-27 May 2011
.
.
Anurag Labs, DRD0
31
Area & Delay Estimates for 2n
Circuit
• The output of a 2n circuit, which raises an input
be expressed as
, where is
and
,
• LUT requirement per output bit
• Total LUT requirement for the 2n circuit is
can
binary field matrix
is
• LUT delay per output bit
is
• Since all bits are in parallel, delay of 2n circuit is
23-27 May 2011
Anurag Labs, DRD0
32
Area & Delay Estimates for
Multiplexer
• For a 2s : 1 MUX, there are s selection lines and thus the output is a function
of 2s + s variables.
• For a MUX in
, each of the 2s input lines is of width m bits.
• Total LUT requirement is
• Total LUT delay of the MUX is
• When
upper bound
23-27 May 2011
number of inputs to MUX
Anurag Labs, DRD0
, the above gives a close
33
Area & Delay of PowerBlock
•
•
•
•
Let the Powerblock contains us number of cascaded 2n circuits.
The
has selection lines, where
LUT requirement for
is
Total LUT requirement for Powerblock is
• Delay of
is
• Total LUT delay of Powerblock in
23-27 May 2011
Anurag Labs, DRD0
34
Area & Delay for the Entire
Architecture
• LUT estimate for the entire architecture is
• There are two parallel delay paths.
– LUT delay of first path is
– LUT delay of second path is
– LUT delay of entire architecture is
23-27 May 2011
Anurag Labs, DRD0
35
Optimal Number of Cascades
• For a given field
and
based FPGA, Powerblock can be
configured with different power circuits
and cascades
.
• Increase in
reduces clock cycles, but increases delay of Powerblock.
•
is fixed, but
depends on
and .
•
is minimum when
• Minimum delay of the ITA architecture is thus
23-27 May 2011
Anurag Labs, DRD0
36
Power Circuit Selection to achieve
Minimum Clock Cycles
• Number of clock cycles for the inversion can be approximated as
•
•
•
•
Number of clock cycles for
increases linearly with .
The term
reduces with increase in .
When is small, the reduction in
is significant for increase in .
But, for large values of n, the increase in
dominates over the
decrease in
• So,
increases with increase in
for large values of .
23-27 May 2011
Anurag Labs, DRD0
37
Tuning Design for Optimality
• The performance metric is
• Minimization of
without increasing
gives best
performance. Area
remains almost same.
• The following steps are performed to achieve optimal performance
• The optimal architecture is given by
23-27 May 2011
Anurag Labs, DRD0
38
Validation of Theoretical Estimates
• Our estimation model uses maxlutpath to find LUT delay.
• Routing delay is difficult to model in FPGAs.
• To get overall delay, we have used experimental results for a reference
ITA architecture.
• Total delay of reference architecture is the
• Let LUT delay of reference architecture is
• Total delay of any other ITA architecture in the same field is
approximately
• Here
is a constant and depends on FPGA technology.
• In 4-LUT based
and 6-LUT based
Xilinx FPGAs,
has values 0.2 and 0.1 respectively.
23-27 May 2011
Anurag Labs, DRD0
39
Validation on 4-input LUT FPGAs
23-27 May 2011
Anurag Labs, DRD0
40
Validation on 6-input LUT FPGAs
23-27 May 2011
Anurag Labs, DRD0
41
Experimental Results
23-27 May 2011
Anurag Labs, DRD0
42
Comparison Charts
23-27 May 2011
Anurag Labs, DRD0
43