Slides

Parallel Analysis of the Rijndael
Block Cipher
Philip Brisk
Adam Kaplan Majid Sarrafzadeh
Embedded & Reconfigurable Systems Lab
Computer Science Department
IASTED-PDCS November, 2003
Outline
• Introduction
• Background Material
• Analysis of the Rijndael Cipher
• Concluding Remarks
1/34
IASTED-PDCS November, 2003
Parallel Models of Computation
and Cryptography
• Achieving optimal performance of
cryptographic algorithms is imperative!
• Goal: Understand how to accelerate
performance by studying cryptography
under parallel models of computation.
2/34
IASTED-PDCS November, 2003
What can we Learn from Parallel
Models of Computation?
• Identification of performance bottlenecks.
• How to design efficient cryptographic
hardware.
• Techniques to improve future algorithms.
3/34
IASTED-PDCS November, 2003
Outline
• Introduction
• Background Material
– Cost Model
– Prefix Sum Computation
• Analysis of the Rijndael Cipher
• Concluding Remarks
4/34
IASTED-PDCS November, 2003
Cost Model
• n : problem size
• t(n) : number of steps
• p(n) = N > 1 : number of processors
c(n) : cost
s(n) : speedup
c ( n)  t ( n )  p ( n )
5/34
s ( n)
p ( n) N

IASTED-PDCS November, 2003
t ( n)
t ( n)
p ( n ) 1
p ( n) N
Cost Optimality
• Cost ≡ the number of steps executed
collectively by all processors.
• An algorithm is cost-optimal on a parallel
model of computation if:
c ( n)
6/34
p ( n) N
 t ( n)
p ( n ) 1
IASTED-PDCS November, 2003
Prefix Sum Computation
• P – a set of N processors: {P1, …, PN}
• Processor Pi holds a value ai.
• For each processor Pi, compute the sum Si:
i
S i   ak
k 1
Algorithm:
for i = 1 to N
Si = ai + Si-1
• Addition can be generalized to any binary associative
operation.
7/34
IASTED-PDCS November, 2003
Prefix Sum Computation
• Meijer and Akl [1987] described a solution
using a binary tree of processors.
3
8/34
6
1
4
IASTED-PDCS November, 2003
Prefix Sum Computation
• Meijer and Akl [1987] described a solution
using a binary tree of processors.
3
3
8/34
6
1
6
1
4
IASTED-PDCS November, 2003
Prefix Sum Computation
• Meijer and Akl [1987] described a solution
using a binary tree of processors.
9
3
3
8/34
6
1
1
4
IASTED-PDCS November, 2003
Prefix Sum Computation
• Meijer and Akl [1987] described a solution
using a binary tree of processors.
9
3
8/34
96
1
54
IASTED-PDCS November, 2003
Prefix Sum Computation
• Meijer and Akl [1987] described a solution
using a binary tree of processors.
9
3
8/34
9
1
9
5
IASTED-PDCS November, 2003
Prefix Sum Computation
• Meijer and Akl [1987] described a solution
using a binary tree of processors.
3
8/34
9
10
14
IASTED-PDCS November, 2003
A Cost-Optimal Prefix Sum
• To achieve cost optimality:
n
t n   O   O N log N 
N
 n 

pn   N  O
 log n 
9/34
IASTED-PDCS November, 2003
Outline
• Introduction
• Background Material
• Analysis of the Rijndael Cipher
• Concluding Remarks
10/34
IASTED-PDCS November, 2003
The Rijndael Cipher
• The cipher iterates in a series of rounds.
– Each round requires a Key
• Using the same key every round is not secure.
• Providing a sequence of keys as an input is
unreasonable.
• A key schedule is uses the original key to compute a
new key for each round.
11/34
IASTED-PDCS November, 2003
The Rijndael Cipher
Key Schedule
Round Transformation
– Key Expansion
• Expands the original
key analogously to
prefix-sum
computation.
– Round Key Selection
• Divides the expanded
key between the rounds
of the cipher
12/34
– 4 sub-transformations
applied during each
round:
•
•
•
•
ByteSub
Shift Row
MixColumn
AddRoundKey
IASTED-PDCS November, 2003
The Rijndael Cipher: Parameters
• Nb – Block Length (# bytes in state)
• Nk – Key Length
• Nr – Number of Rounds
• The key and state are represented as
2-dimensional arrays of bytes.
13/34
IASTED-PDCS November, 2003
Representation of the State
• The state is represented by a 4 x Nb/4 array
of bytes (Nb = 4, 6, or 8)
a0,0
a1,0
4
a2,0
a3,0
14/34
Nb
a0,1 a0,2
a1,1 a1,2
a2,1 a2,2
a3,1 a3,2
a0,3
a1,3
a2,3
a3,3
IASTED-PDCS November, 2003
The ByteSub Transformation
• Apply an S-Box to every byte in the state.
a0,0 a0,1 a0,2
a1,0 a1,1
ai,j a1,2
a2,0 a2,1 a2,2
a3,0 a3,1 a3,2
State
15/34
a0,3
a1,3
a2,3
a3,3
S-BOX
b0,0 b0,1 b0,2
b1,0 b1,1
bi,j b1,2
b2,0 b2,1 b2,2
b3,0 b3,1 b3,2
8-bit
lookup table
IASTED-PDCS November, 2003
State
b0,3
b1,3
b2,3
b3,3
The ByteSub Transformation
 y0  1 0 0 0
 y  1 1 0 0
 1 
 y2  1 1 1 0
  
 y3   1 1 1 1
 1 1 1 1
4
a0,0 a y0,1
 a 0,2 a0,3
 y5   0 1 1 1
a1,0 a y1,1
 a01,2 0 a1,3
a
1 1
6 i,j
  
a2,0 a y2,1
0 1
7
 a02,2 0 a2,3
1 1 1 1  x0  1
0 1 1 1  x1  1
0 0 1 1   x2  0 
   
0 0 0 1  x3  0




1 0 0 0 x4 0 
   
1 1 0 0  x5  1
1 1 1 0  x6  1
   
1 1 1 1  x7  0
S-BOX
a3,0 a3,1 a3,2 a3,3
State
15/34
b0,0 b0,1 b0,2
b1,0 b1,1
bi,j b1,2
b2,0 b2,1 b2,2
b3,0 b3,1 b3,2
8-bit
lookup table
IASTED-PDCS November, 2003
State
b0,3
b1,3
b2,3
b3,3
The ByteSub Transformation
• 1 processor
a0,0 a0,1 a0,2
a1,0 a1,1
ai,j a1,2
a2,0 a2,1 a2,2
a3,0 a3,1 a3,2
State
15/34
a0,3
a1,3
a2,3
a3,3
 t(n) = O(Nb)
S-BOX
b0,0 b0,1 b0,2
b1,0 b1,1
bi,j b1,2
b2,0 b2,1 b2,2
b3,0 b3,1 b3,2
8-bit
lookup table
IASTED-PDCS November, 2003
State
b0,3
b1,3
b2,3
b3,3
The ByteSub Transformation
• 4 x Nb processors
a0,0 a0,1 a0,2
a1,0 a1,1
ai,j a1,2
a2,0 a2,1 a2,2
a3,0 a3,1 a3,2
State
15/34
a0,3
a1,3
a2,3
a3,3

t(n) = O(1)
S-BOX
b0,0 b0,1 b0,2
b1,0 b1,1
bi,j b1,2
b2,0 b2,1 b2,2
b3,0 b3,1 b3,2
8-bit
lookup table
IASTED-PDCS November, 2003
State
b0,3
b1,3
b2,3
b3,3
The Shift-Row Transformation
• Shift each row of the state by a constant.
16/34
a0,0 a0,1 a0,2 a0,3
a1,0 a1,1 a1,2 a1,3
b0,0 b0,1 b0,2 b0,3
b1,1 b1,2 b1,3 b1,0
a2,0 a2,1 a2,2 a2,3
a3,0 a3,1 a3,2 a3,3
b2,2 b2,3 b2,0 b2,1
b3,3 b3,0 b3,1 b3,2
State
State
IASTED-PDCS November, 2003
The Shift-Row Transformation
• 1 processor
16/34
 t(n) = O(Nb)
a0,0 a0,1 a0,2 a0,3
a1,0 a1,1 a1,2 a1,3
b0,0 b0,1 b0,2 b0,3
b1,1 b1,2 b1,3 b1,0
a2,0 a2,1 a2,2 a2,3
a3,0 a3,1 a3,2 a3,3
b2,2 b2,3 b2,0 b2,1
b3,3 b3,0 b3,1 b3,2
State
State
IASTED-PDCS November, 2003
The Shift-Row Transformation
• 4 x Nb processors  t(n) = O(1)
16/34
a0,0 a0,1 a0,2 a0,3
a1,0 a1,1 a1,2 a1,3
b0,0 b0,1 b0,2 b0,3
b1,1 b1,2 b1,3 b1,0
a2,0 a2,1 a2,2 a2,3
a3,0 a3,1 a3,2 a3,3
b2,2 b2,3 b2,0 b2,1
b3,3 b3,0 b3,1 b3,2
State
State
IASTED-PDCS November, 2003
The Mix-Column Transformation
• Apply to each column in the state.
a0,0
a1,0
a0,j
a0,1 a0,2 a0,3
a1,1a1,ja1,2 a1,3
a2,0 a2,1 a2,2 a2,3
a2,j
a3,0 a3,1 a3,2 a3,3
a3,j
State
17/34
MixColumn
b0,0
b1,0
b0,j
b0,1 b0,2 b0,3
b1,1b1,jb1,2 b1,3
b2,0 b2,1 b2,2 b2,3
b2,j
b3,0 b3,1 b3,2 b3,3
b3,j
4x4 Byte
Matrix
IASTED-PDCS November, 2003
State
The Mix-Column Transformation
a0,0
a1,0
a2,0
a3,0
b0  02
 b   01
a0,j1  
a0,1 a0,2 a0,3
b2   01
a1,1a1,ja1,2
 a 1,3
b
 3  03
a2,1 a2,2 a2,3
a2,j
a3,1 a3,2 a3,3
a3,j
State
17/34
03 01 01 a0 
02 03 01  a1 
01 02 03 a2 
 
01 01 02  a3 
MixColumn
b0,0
b1,0
b0,j
b0,1 b0,2 b0,3
b1,1b1,jb1,2 b1,3
b2,0 b2,1 b2,2 b2,3
b2,j
b3,0 b3,1 b3,2 b3,3
b3,j
4x4 Byte
Matrix
IASTED-PDCS November, 2003
State
The Mix-Column Transformation
• 1 processor
a0,0
a1,0
a0,j
a0,1 a0,2 a0,3
a1,1a1,ja1,2 a1,3
a2,0 a2,1 a2,2 a2,3
a2,j
a3,0 a3,1 a3,2 a3,3
a3,j
State
17/34
 t(n) = O(Nb)
MixColumn
b0,0
b1,0
b0,j
b0,1 b0,2 b0,3
b1,1b1,jb1,2 b1,3
b2,0 b2,1 b2,2 b2,3
b2,j
b3,0 b3,1 b3,2 b3,3
b3,j
4x4 Byte
Matrix
IASTED-PDCS November, 2003
State
The Mix-Column Transformation
• O(Nb) processors
a0,0
a1,0
a0,j
a0,1 a0,2 a0,3
a1,1a1,ja1,2 a1,3
a2,0 a2,1 a2,2 a2,3
a2,j
a3,0 a3,1 a3,2 a3,3
a3,j
State
17/34

t(n) = O(1)
MixColumn
b0,0
b1,0
b0,j
b0,1 b0,2 b0,3
b1,1b1,jb1,2 b1,3
b2,0 b2,1 b2,2 b2,3
b2,j
b3,0 b3,1 b3,2 b3,3
b3,j
4x4 Byte
Matrix
IASTED-PDCS November, 2003
State
The Add-Round-Key
Transformation
• Xor each state byte with each key byte..
a0,0 a0,1 a0,2
a1,0 a1,1
ai,j a1,2
a2,0 a2,1 a2,2
a3,0 a3,1 a3,2
a0,3
a1,3
a2,3
a3,3
k0,0 k0,1 k0,2
k1,0 k1,1
ki,j k1,2
k2,0 k2,1 k2,2
k3,0 k3,1 k3,2
State
k0,3
k1,3
k2,3
k3,3
b0,0 b0,1 b0,2
b1,0 b1,1
bi,j b1,2
b2,0 b2,1 b2,2
b3,0 b3,1 b3,2
Key
XOR
18/34
IASTED-PDCS November, 2003
State
b0,3
b1,3
b2,3
b3,3
The Add-Round-Key
Transformation
• 1 processor
a0,0 a0,1 a0,2
a1,0 a1,1
ai,j a1,2
a2,0 a2,1 a2,2
a3,0 a3,1 a3,2
a0,3
a1,3
a2,3
a3,3
 t(n) = O(Nb)
k0,0 k0,1 k0,2
k1,0 k1,1
ki,j k1,2
k2,0 k2,1 k2,2
k3,0 k3,1 k3,2
State
k0,3
k1,3
k2,3
k3,3
b0,0 b0,1 b0,2
b1,0 b1,1
bi,j b1,2
b2,0 b2,1 b2,2
b3,0 b3,1 b3,2
Key
XOR
18/34
IASTED-PDCS November, 2003
State
b0,3
b1,3
b2,3
b3,3
The Add-Round-Key
Transformation
• 4 x Nb processors  t(n) = O(1)
a0,0 a0,1 a0,2
a1,0 a1,1
ai,j a1,2
a2,0 a2,1 a2,2
a3,0 a3,1 a3,2
a0,3
a1,3
a2,3
a3,3
k0,0 k0,1 k0,2
k1,0 k1,1
ki,j k1,2
k2,0 k2,1 k2,2
k3,0 k3,1 k3,2
State
k0,3
k1,3
k2,3
k3,3
b0,0 b0,1 b0,2
b1,0 b1,1
bi,j b1,2
b2,0 b2,1 b2,2
b3,0 b3,1 b3,2
Key
XOR
18/34
IASTED-PDCS November, 2003
State
b0,3
b1,3
b2,3
b3,3
The Round Transformation
For i = 1 to Nr – 1
State  ByteSub(State)
State  ShiftRow(State)
State  MixColumn(State)
State  AddRoundKey(State, Key)
Final Round:
State  ByteSub(State)
State  ShiftRow(State)
State  AddRoundKey(State, Key)
19/34
IASTED-PDCS November, 2003
The Round Transformation
• Sequential Model
p(n) = 1
t(n) = O(Nb x Nr)
• Fully Parallel Model
p(n) = O(Nb)
t(n) = O(Nr)
s(n) = O(Nb)
c(n) = O(Nb x Nr)
We have achieved
cost-optimality!
20/34
IASTED-PDCS November, 2003
Key Expansion Algorithm
For j = 1 to Nk
W[j] = (Key[4j],Key[4j+1],Key[4j+2],Key[4j+3])
For j = Nk+1 to Nb x (Nr+1)
temp = W[j-1]
if( j % Nk = 0 )
temp = SubByte(RotByte(temp)) ^
Rcon[j/Nk]
else if( Nk > 6 && j % Nk == 4 )
temp = SubByte(temp)
W[j] = W[j-Nk] XOR temp
21/34
IASTED-PDCS November, 2003
Key Expansion Algorithm on a
Uniprocessor (Sequential) Machine
Basic Algorithm Structure:
Nk iterations
For j = 1 to Nk
{…}
For j = Nk+1 to Nb x (Nr+1)
{…}
Nb x (Nr + 1) - Nk iterations
Total: Nb x (Nr + 1) iterations
1 processor
22/34

t(n) = O(Nb x Nr)
IASTED-PDCS November, 2003
Key Expansion Algorithm on a
Parallel Machine
• The loop-carried dependence appears to render
the algorithm impossible to parallelize…
For j = Nk+1 to Nb x (Nr+1)
temp = W[j-1]
…
W[j] = W[j-Nk] XOR temp
23/34
IASTED-PDCS November, 2003
Key Expansion Algorithm on a
Parallel Machine
• … Observe that XOR is a binary associative
operation.
For j = Nk+1 to Nb x (Nr+1)
temp = W[j-1]
…
W[j] = W[j-Nk] XOR temp
23/34
IASTED-PDCS November, 2003
Key Expansion Algorithm on a
Parallel Machine
• This algorithm is simply a variant of Prefix Sum
with XOR instead of +.
For j = Nk+1 to Nb x (Nr+1)
temp = W[j-1]
…
W[j] = W[j-Nk] XOR temp
23/34
IASTED-PDCS November, 2003
Key Expansion Algorithm
• To compute the prefix sum cost-optimally:


Nb  Nr

pn   O
 log Nb  log Nr 
t n  Olog Nb  log Nr 
24/34
IASTED-PDCS November, 2003
Round Key Selection
• Bytes W[Nb x i] through W[Nb x (i+1) – 1]
are chosen to be the key bits for round i.
W[1..Nb-1]
W[Nb..2Nb-1]
…
W[NbNr..Nb(Nr+1)-1]
• Can be interleaved with the Key Expansion
phase with no additional overhead.
25/34
IASTED-PDCS November, 2003
Key Schedule
• Sequential Algorithm
pn   1
t n  ONb  Nr 
• Parallel (Prefix-Sum) Algorithm


Nb  Nr

pn   O
 log Nb  log Nr 
t n  Olog Nb  log Nr 
26/34
IASTED-PDCS November, 2003
The Rijndael Cipher:
Sequential Model
Key Schedule
Round Transformation
pn   1
pn   1
t n  ONb  Nr 
t n  ONb  Nr 
Overall
pn   1
t n  ONb  Nr 
27/34
IASTED-PDCS November, 2003
The Rijndael Cipher:
Parallel Model
Key Schedule


Nb  Nr

pn   O
 log Nb  log Nr 
t n  Olog Nb  log Nr 
Round Transformation
pn  ONb
t n  ONr 
28/34
IASTED-PDCS November, 2003
The Rijndael Cipher:
Parallel Model
Altogether

 t n  ONr 
Nb  Nr

pn   O
 log Nb  log Nr  sn  ONb


Nb  Nr 2
  ONb  Nr 
cn   O
 log Nb  log Nr 
This model does NOT yield a cost-optimal solution!
29/34
IASTED-PDCS November, 2003
Achieving Cost Optimality with a
Parallel Model of Computation
• Reduce the number of processors from
 Nb 

pn   ONb  O
 log Nb 
• The Round Transformation requires time
t n  Olog Nb
• The Key Schedule requires time
t n  ONr  log Nb
30/34
IASTED-PDCS November, 2003
Achieving Cost Optimality
• Final Results:
 Nb 
 t n  ONr log Nb
pn   O
 log Nb 
• Speedup and Cost:
 Nb 
 cn  ONb  Nr 
sn   O
 log Nb 
31/34
IASTED-PDCS November, 2003
Summary of Results
• Fastest Model
• Cost-Optimal Model


Nb  Nr

pn   O
 log Nb  log Nr 
 Nb 

pn   O
 log Nb 
t n  ONr 
t n  ONr log Nb
sn  ONb
 Nb 

s n   O
 log Nb 
 Nb  Nr 2 

cn   O
 log Nb  log Nr 
32/34
cn  ONb  Nr 
IASTED-PDCS November, 2003
Outline
• Introduction
• Background Material
• Analysis of the Rijndael Cipher
• Concluding Remarks
33/34
IASTED-PDCS November, 2003
Concluding Remarks
• First theoretical study of the parallelism
inherent in the Rijndael AES.
• Fastest parallel model was not cost-optimal
- some acceleration was sacrificed in
order to achieve cost-optimality.
34/34
IASTED-PDCS November, 2003