Document

Cryptographic Algorithms and their
Implementations

Discussion of how to map different algorithms
to our architecture
 Public-Key
Algorithms (Modular Exponentiation)
 Rijndael
 Serpent
 Others
(Mars, RC6, Twofish, etc.)
Modular Exponentiation

Square and Multiply Algorithm for Modular
Exponentiation
Modular Exponentiation

Montgomery Modular Multiplication
Modular Exponentiation

Several Approaches to implementing Modular
Multiplication:
 Redundant
Representation based (e.g. Carry-save)
 Residue Number System based.
 Systolic Array Based.

Word-based implementations preferable, due to
similarity with Symmetric-key
 Rules
out systolic arrays
Modular Exponentiation

Most popular and fastest were Carry-Save
representation based implementations.

Carry-save based were also word-oriented.

We selected fastest, simplest implementation:
 Extremely
beneficial to have simplicity and homogeneity in
algorithms when designing a custom reconfigurable fabric.
 Performance when implemented on Xilinx Virtex FPGAs:
almost 5 Mb/s !!! (highest reported that we could find)
Modular Exponentiation
Five-to-two Multiplier Modular Exponentiation (P, E, M)
1.
2.
3.
4.
5.
6.
7.
8.
K = 22k mod M … computed externally
P10 , P20 = 5to2_MontMult(K , 0 , 1 , 0 , M),
Z10 , Z20 = 5to2_MontMult(K , 0 , P , 0 , M);
FOR i = 0 to n-1 DO
Z1i+1 , Z2i+1 = 5to2_MontMult(Z1i , Z2i , Z1i , Z2i , M)
IF ei = 1 THEN
P1i+1 , P2i+1 = 5to2_MontMult(P1i , P2i , Z1i , Z2i , M)
ELSE
P1i+1 , P2i+1 = P1i , P2i
ENDFOR
P1n , P2n = 5to2_MontMult(1 , 0 , P1n-1 , P2n-1 , M)
P = P1n + P2n
RETURN P
Modular Exponentiation
Five-to-two CSA Montgomery Multiplication (A1 , A2 , B1 , B2 , M)
1.
2.
3.
4.
5.
S10 , S20 = 0 , 0
FOR i = 0 to m-1 DO
qi = [(S1i + S2i) + Ai*(B1+B2)] mod 2
S1i+1 , S2i+1 = CSR [(S1i + S2i) + Ai*(B1+B2) + qi*M] div 2
ENDFOR
Modular Exponentiation

Their Implementation of MM
MEM
Ai.B1
MEM
Ai.B2
MEM
qi.n
1024 Bits CSA
1024 bits shift register
1024 Bits CSA
FA
1024 bits shift register
FF
1024 Bits CSA
Ai
1024 Bits Registers
S1[i]
S2[i]
Modular Exponentiation

Implementing MM on our design
MEM
Ai.B1<1023:960>
MEM
Ai.B2<1023:960>
MEM
Ai.B2<959:896>
MEM
qi.n<1023:960>
64 Bits CSA
MEM
Ai.B1<63:0>
MEM
Ai.B1<959:896>
MEM
qi.n<959:896>
64 Bits CSA
64 Bits CSA
MEM
Ai.B2<63:0>
MEM
qi.n<63:0>
64 Bits CSA
64 Bits CSA
64 Bits CSA
64 Bits CSA
64 Bits CSA
64 Bits CSA
64 Bits Registers
64 Bits Registers
64 Bits Registers
S1[i]<1023:960>
S2[i]<1023:960>
S1[i]<959:896>
S2[i]<959:896>
S1[i]<63:0>
S2[i]<63:0>
Modular Exponentiation



Each of the 64-CSA blocks maps to a single basic
block
Outputs of the last basic block are registered.
qi is generated by random-logic block at the second
basic-block


Broadcast to all groups
Ai is generated in a similar manner, utilizing two more
basic-blocks:
64 bits shift register

Also broadcast to all groups
FA
64 bits shift register
FF
Ai
Modular Exponentiation

Efficient and scalable mapping to our design
 1024-bit
RSA will need to use 16 groups, while
 2048-bit will use 32, and 4096-bits will use 64 groups

Primary concern : clock rate may be limited by
bit-broadcasts of qi and Ai
 Potential
impediment to scalability
 We are exploring methods for pipelining these
broadcasts as well, to increase cycle-time and
scalability.
Rijndael

Primary operations:
 Sub-Bytes
 Shift-Rows
 Mix-Columns
 Add-Round-Key
Rijndael

Representation of Data: 128-bit state.
128-bits of state
128-bits
8-bits each
32-bits
32-bits
32-bits
Rijndael

Add-Round-Key


Sub-Bytes:





Simple 128-bit XOR operation: uses 1 basic-block
Simple operation: byte-wise table lookup from S-Box
Each S-box is 2kbits.
16 parallel S-boxes required !
No basic-blocks required, ALL memory-blocks required !
Shift-Rows


Simple operation: 4 x 32-bit permutations
Uses only 1 basic-block
Rijndael

Mix-Columns


Somewhat complicated: can be implemented using table lookups, but
we’re out of Memory !
Alternative implementation:
Rijndael

Mix-Columns
 Operation
may be expressed in
terms of “xtime()” function
 Mix-columns
implementation
requires “xtime()” operation on
each byte, followed by 4 XOR
operations
Rijndael

Mix-Columns




In order to efficiently implement “xtime()”, we modified it this way
In this form, only 2 basic-blocks are needed to apply “xtime()” to all 16
bytes
A single basic-block will take the 128-bit data as input, and generate the
“xtime()” mask (0000x7x70x7) for each of the 16 bytes at the permute unit.
Another basic-block will now first perform the XOR operation, followed
by a left shift (and substitute LSB with x7) at the permute unit.
Rijndael

Mix-Columns

After generating output from the “xtime()” function, 4 x 128-bit XOR
operations need to be performed


4 basic-blocks will be used
Note that the mix-column operation is carried out in parallel on all 4 columns.
Xtime masks for all bytes
32
32
32
64
32
32
64
A
32
32
64
64
B
C
D
64 Carry Save
Adders
4-Bit
Random
Logic
Permute Unit
O1
64
32
O2
64
64 Bits Registers
32
32
32
32
XOR
operation
Rijndael

Implementation summary
8
basic-blocks required only


 16

2 (1 each) for Add-Round-Key and Shift-Rows
6 for Mix-Columns (2 for xtime(), 4 for XOR operations)
Memory-blocks required !!
All memory blocks used up in a single round!
 In-efficient
implementation due to memory intensive
implementation of Rijndael

Only 10% logic used, versus 100% memory usage.
Rijndael

Potential Solutions
 Add


lots of memory !!
At least 10 times more
Issues with memory placement
 Consider


memory-less implementations of Sub-Byte
Requires GF() constant multiplication and Inverse Affine Transforms
Currently under study as the more efficient and practical option.
Serpent

Substitution-permutation
cipher comprised of
 Key Mixing,
 S-Box Substitution, and
 Linear Transformation.

S-boxes: 4 x 4 bit
 32
copies required each round
 16 x 4 x 32 = 2048 bits per
round.
Serpent

The Linear
Transformation step
consists of:
8
fixed permute
operations, and
 8 XOR operations

All operands are 32bits wide
Serpent

Serpent is an ideal match for our architecture:



8 x 32-bit fixed shifts and rotates can be easily implemented by the
permute units of 2 basic-blocks.
Additional 2 basic-blocks required to implement the 8 x 32-bit XOR
operations.
128-bit key mixing stage per round would require 1 more basic-block

Total of 5 basic-blocks and 2kbits of memory required per
round.

Each round perfectly fits in a single group of our architecture!

16 rounds of Serpent’s total of 32 may be unrolled in our
architecture
Other Algorithms

DES
 Implementation
of a single round is trivial: a single group
may implement multiple rounds !

Twofish
 Complex
structure, requires more time to define
implementation on our architecture.
 However, all its basic operations are directly supported.

RC6 and MARS
 Involve
logic:


complicated operations requiring special purpose
Data-dependent rotations
Multiplication Modulo 232
Other Algorithms

RC6 and MARS
 This
special-purpose logic was not incorporated
because:
Algorithms are more suitable for software
implementations than in hardware
 Lack of support and popularity of these algorithms
 Addition of special-purpose logic would occur overhead
beyond its area, as additional supporting interconnect
must be provided.

Comparison with Related Work

Although we cannot provide results based on empirical
evaluation, we can present a logical framework for
comparison of individual features

Through deductive reasoning, we identify what possible
advantages one approach may have over the other,
assuming all other factors normalized.
Comparison with Related Work

Comparison with FPGA based implementations
 Area Efficiency
 Use of basic gates instead of LUTs
 Basic-blocks with limited flexibility, thus fewer configuration bits
 Basic units (full adders) combined into clusters of 64, and
programmed as a single entity – further savings in configuration
memory elements
 Performance
 Use of basic gates instead of LUTs
 Simpler Interconnect, with fewer routing-switches
 Hierarchical organization – no long wires (except for bit-broadcast)
 Far smaller configuration data required – faster reconfiguration time
Comparison with Related Work

Comparison with FPGA based implementations
 Potential
pitfalls
Design dedicates considerable amount of area to interblock interconnect.
 Until actual area can be quantified, we are unsure of area
efficiency estimates.
 Need to identify most suitable Performance/Area
tradeoff.

Comparison with Related Work

Comparison with COBRA Architecture
 Uses
multiple copies of special purpose logic blocks, couples
with extremely simple interconnect.
Comparison with Related Work

Comparison with COBRA Architecture
 Low
logic-utilization – we have more generic blocks,
 Fixed latency operations
 Intermediate values registered only at RCE
boundary.
Programming Methodology

Reconfigurable Computing devices suffer from
following two critical issues:
 Lack
of a comprehensive programming model
 Lack of hardware virtualization

First issue implies the difficulty of programming RC
architectures such as FPGAs

Second issue deals with exposition of hardware
resource limitations to programmer.
Programming Methodology

How COBRA deals with these issues
 Essentially
a special-purpose programmable
architecture than a configurable one
 VLIW like instructions alleviate some of the
programming model related issues
 Also resolve the virtualization aspect.
Programming Methodology

The programming methodology and the impact
of the issues mentioned can be seen in terms of
a spectrum:
Microprocessor
COBRA [3]
Our Approach
FPGAs
Programming Methodology

Programming model issue less severe for us
because:
 Simple,

highly specialized architecture
Hardware Virtualization is still a concern.
Programming Methodology

Programming model:
 Provide
basic primitives that are supported by our
architecture.
 Programming is to be accomplished by expressing an
algorithm using these primitives and interconnecting these
primitives together using 32-bit interconnect.
 Mapping such a description onto our design should be a
trivial software challenge.
 Due to special purpose nature, primitives are limited in
number and thus programming should be an easy task.
Programming Methodology
Programming Primitives:








32-bit Carry Save Adder
32-bit XOR
32-bit AND
32-bit OR
32, 64, and 128-bit Ripple Carry
Adder
32, 64, and 128-bit Fixed Shifts
32 bit Rotates and random
permutes.
64-bit, 128-bit limited permutes
(TBD).








ANDing 32-bit value with a single
bit
128-bit shift-register
Random bit-logic implementation,
since each block is also capable of
implementing:
single 4-input function
two 3-input functions
four 2-input functions
4 global bit-broadcast lines
32-bit interconnect, point to point.
Conclusion: Work in Progress

Following areas of design still under consideration
and not completely defined yet:


Configurable Memory-block Architecture
VLSI Design to evaluate performance metrics and finetuning of logical design


i.e. if found to be too slow, reduce no of switches, use longer wires,
minimize the amount of interconnect to that which is necessary,
etc.
Furthermore, the iterative process of evaluating more
symmetric-key algorithms and refining the architecture is
still in progress.