swankoski_p

Dynamic High-Performance Multi-Mode
Architectures for AES Encryption
Eric Swankoski
Naval Research Lab
Vijay Narayanan
Penn State University
Swankoski
1
MAPLD 2005 / B103
Background & Motivation




Bandwidth and throughput capabilities of
modern optical networks is skyrocketing
Protecting transmitted data becoming more
and more critical
Current encryption architectures generally
aren’t capable of keeping up with high-speed
environments
SEU effects rarely, if ever, considered
Swankoski
2
MAPLD 2005 / B103
Plan of Attack: FPGA Encryption

Algorithm: Advanced Encryption Standard (AES)
–
–
–

Target Architecture: Xilinx FPGAs
–
–

Supports multiple key lengths
Supports multiple encryption modes
Supports multiple levels of pipelining
Can be adapted to ASIC devices
Virtex-II, Virtex-4
Target Performance: 60+ gigabits per second
–
Swankoski
Requires both inner-round and outer-round pipelining
3
MAPLD 2005 / B103
The AES Algorithm


10 Rounds of Encryption for 128-bit operands
Four basic operations:
–
SubBytes:

–
ShiftRows:

–
Polynomial multiplication (4 parallel operations per round)
AddRoundKey

Swankoski
Byte reordering and rotation (4 parallel operations per round)
MixColumns:

–
8-bit substitution (16 parallel operations per round)
Simple 128-bit XOR
4
MAPLD 2005 / B103
Optimizing for Performance


Exploit all possible parallelism
Alternative byte substitution methods
–
–

1 cycle for a lookup-based substitution
5 cycles for a mathematical transformation
Utilize pipelining
–
–
Outer-Round: 1 cycle per round
Inner-Round:


Swankoski
4 cycles per round (lookup-based byte substitution)
8 cycles per round (pipelined byte substitution)
5
MAPLD 2005 / B103
Combinatorial Byte Substitution


Actual mathematical transformation
Conventional implementation cannot be
pipelined
–


Smaller than lookup table
Faster than lookup table
–

Swankoski
Simple (atomic) 8x8 lookup table
Utilizes five-stage pipeline
All internal operands are four bits wide
6
MAPLD 2005 / B103
Encryption Round Diagram
Atomic S-Box:

–
Combinatorial S-Box:

–
–
76 Pipeline Stages
Needs a constant stream to be
effective
Parallel Key Scheduling

–
No performance penalty
Offline Key Scheduling

–
Swankoski
40 Pipeline Stages
7
Precomputed keys can be stored
in registers
MAPLD 2005 / B103
Counter (CTR) Mode




Effectively converts AES into a stream cipher
High security – similar to CBC
Supports inner-round and outer-round pipelining
No error propagation – errors are completely isolated
Swankoski
8
MAPLD 2005 / B103
Cipher Block Chaining (CBC) Mode



Most secure – no patterns are observed
Cannot be pipelined
100% downstream corruption resulting from data loss or singleevent upsets (SEUs) during encryption
–
Swankoski
Errors are isolated during decryption
9
MAPLD 2005 / B103
Electronic Codebook (ECB) Mode



Supports full pipelining
No error propagation – errors are completely isolated
Least secure – identical input gives identical output
–
Swankoski
Patterns observable in video and image data
10
MAPLD 2005 / B103
Staggered CBC Mode
Pipelined with Output Feedback
 Each encrypted block n depends on
itself and the block (n – x) where x is
the latency of the pipeline
 Maintains security while mitigating
some error propagation problems

Swankoski
11
MAPLD 2005 / B103
More Challenges
Error-Tolerant Encryption
 Maintaining High Security
 Maintaining High Performance

Swankoski
12
MAPLD 2005 / B103
Error-Tolerant Encryption

Are errors acceptable?
–
Possibly, but better to assume not
How do the multiple modes of
encryption deal with upsets?
 Is there a benefit to triple modular
redundancy (TMR)?

–
Swankoski
Is it what we expect?
13
MAPLD 2005 / B103
Error-Tolerant Encryption

CTR and ECB encryption isolate errors
–

Transmission integrity largely preserved even
without SEU mitigation
TMR can ensure 100% transmission
integrity
–
Swankoski
TMR REQUIRED for CBC encryption
14
MAPLD 2005 / B103
Error-Tolerant Encryption

Image 1: Error-Free Plaintext Image
–
–

Image 2: Decrypted Plaintext Image
–
–

Before Encryption / After Decryption
CTR, ECB, or CBC with mitigation
One corrupted block
CTR or ECB without mitigation
Image 3: Decrypted Plaintext Image
–
–
Swankoski
One block corrupted during encryption
CBC without mitigation
15
MAPLD 2005 / B103
Maintaining High Security


How do the multiple modes of encryption
affect security?
Is physical protection of the key necessary?
–

Depends on the environment
How is throughput affected by increased
security?
–
Swankoski
Hopefully, not at all…
16
MAPLD 2005 / B103
Maintaining High Security


ECB-encrypted image has observable patterns
CTR/CBC/SCBC encryption looks like random noise
Swankoski
17
MAPLD 2005 / B103
Maintaining High Security

Physical Key Protection
–

Power Analysis / Soft Attacks
–

Not required in aerospace applications
Countermeasures not mode specific
Throughput Effects
–
–
Swankoski
ECB & CTR far outperform CBC
Why is CBC an official mode?
18
MAPLD 2005 / B103
System-Level Diagram


Supports ECB,
CTR, CBC, and
SCBC modes
Supports two types
of TMR
–
–
Swankoski
19
System: triplicates all
control, key hardware,
and mode logic
Encryption: triplicates
only encryption and key
scheduling hardware
MAPLD 2005 / B103
Performance Results – Virtex-4
Byte
Substitution
Key
Scheduling
Area
Frequency
Throughput
(CTR, ECB, SCBC)
Throughput
(CBC)
ROM
Online
3588
339.5 MHz
43.5 Gbps
1.088 Gbps
ROM
Offline
2827
446.8 MHz
57.2 Gbps
1.430 Gbps
Combinatorial
Online
13651
519.2 MHz
66.5 Gbps
700.0 Mbps
Combinatorial
Offline
10912
519.2 MHz
66.5 Gbps
700.0 Mbps

Key Scheduling
–
–



Offline uses precomputed and stored keys (compile or design time)
Online uses dynamically computed keys (run time)
Significant performance improvement for combinatorial byte substitution in
pipelined mode
Virtex-II Pro performs better with ROM implementation (56.42 & 60.35 Gbps)
Better CBC performance achieved through other architectures
Swankoski
20
MAPLD 2005 / B103
Lessons Learned

Don’t try to over-optimize FPGA code
–
–

Know your synthesis tool
–

Returns diminish quickly
Sometimes less is more
Now why did it do THAT?
Check your system’s memory
–
RAM does fail at inopportune times…

Swankoski
ESPECIALLY if it has a lifetime warranty
21
MAPLD 2005 / B103
Lessons Learned

Over-optimization
–
In a highly pipelined FPGA design, routing plays a
MAJOR role in the clock frequency

–
–
–
Swankoski
70%-80% of the total delay
What would work in an ASIC (or in theory, or on
paper…) might actually make things worse
Manual floorplanning and P&R might help, but
usually provides minimal (if any) improvement
Moral? – Try reducing the pipeline depth as well
as increasing it, it just might help!
22
MAPLD 2005 / B103