https://​www.​cosic.​esat.​kuleuven.​be/​ecrypt/​AESday/​slides/​AES-DAY-Gueneysu.​pdf

Implementing AES on a
Bunch of Processors
ECRYPT AES Day – Bruges, Belgium
Tim Güneysu
Hardware Security Group
Horst Görtz Institute for IT-Security
10/23/2012
Outline
•
•
•
•
•
Introduction
Processor Platforms
Tricks, Tweaks and Codes
Benchmarks and Results
Conclusions
The AES Crib Sheet
AES Implementation: General Representation
• Two different representations
of AES in original proposal
– 8-bit standard implementation
– 32-bit T-Table implementation, e.g.,
 k0, j 
 E0 


 
 k1, j 
 E1 

T
(
A
)

T
(
A
)

T
(
A
)

T
(
A
)



0
0
1
5
2
10
3
15
E 
k
 2, j 
 2
E 
k 
 3
 3, j 
¼ round: 4 Table Look-Ups (TLU) + 4 x 32-bit XOR
1 round: 4 x 4 = 16 TLU + XOR
AES:
160 TLU+XOR per block encryption
Memory: 4 T-Boxes, 1kB each
AES Implementation : Choice of Processor
• A dedicated AES processor in HW
is not always the preferred option
– AES is often just a supplementary
function to a software application
– HW development is too costly
or necessary skills are not available
• But when doing AES in software:
which processor is the best?
AES Implementation : Parameters
• Key size of AES
– 128, 192, 256 bit
• Applied mode of operation
– ECB, CBC, GCM, CTR,…
• Blocks concurrently processed
– Single block (limited data transfers)
– Multiple blocks (overhead reduction, bitslicing)
• Round key computation
– Precomputed (when processing bulk data)
– On-the-fly (when changing keys frequently)
Outline
•
•
•
•
•
Introduction
Processor Platforms
Tricks, Tweaks and Codes
Benchmarks and Results
Conclusion
Processors and Platforms
• Native bit sizes of General-Purpose Processors (GPP)
4-Bit, e.g., MARC-4, NEC uPD75X in pocket calculators/washing machines
8-Bit, e.g., Atmel ATMegaXX, Intel 8051 in (many) embedded systems
16-Bit, e.g., TI MSP430, DEC PDP-11 in (fewer) embedded systems
32-Bit, e.g., ARM, TriCore in smart phones and automobiles
64-Bit, e.g., Intel i3/5/7, AMD A-Series in PCs and workstations
128-Bit, e.g., IBM Cell in PS3 (actually, only 128-bit SIMD on SPEs )
• Myth or Fact: AES is always most efficient on native
8-bit and 32-bit processors!?
Processor Architectures
• General processor design
RISC vs. CISC (Reduced/Complex Instruction Set Computer)
Single-Instruction Multiple Data (SIMD) operation
Super-scalar devices processing more than one instruction per cycle
• Processor interface to memory
Von-Neumann vs. Harvard: shared memory for data and program?
Cache for data and/or program? ( Cache attacks!)
Static/dynamic external or built-in RAM?
• Additional processor extensions
Multimedia/integer co-processor
Special/native Instruction Set Extensions (ISE)
Other Processor Architectures
• Streaming processors, such as GPUs
– Multi-processors run hundreds concurrent threads
– High memory bandwidth, but high latency to global memory
• Digital Signal Processors (DSP)
– Supports fast combined arithmetic instructions
– Are improved arithmetic instructions useful for AES?
• Other array/tile-based processors
– Synchronous/asynchronous processing cores
– Processor-based systolic array cores (Tilera, GreenArrays)
Outline
•
•
•
•
•
Introduction
Processor Platforms
Tricks, Tweaks and Codes
Benchmarks and Results
Conclusions
AES Software Optimization
• General requirements for secure implementation in software
– Disable (or control) cache to prevent cache attacks
– Avoid conditional branches to counter timing attacks
• Common tweaks to achieve high-performance
–
–
–
–
Make particular use of specialized instructions
Unroll rounds and loops to reduce instruction cycle count
Optimize register allocation
Precompute and store values in tables (e.g., T-Tables, round keys and constants)
• Common tweaks to minimize code size
– Reuse code by functions to minimize instruction count
– Limit amount of precomputed and stored values
• Common tweaks for low energy consumption
– Reduce number of costly load and store operations to memory
– General approach often similar to the optimization for high-performance
Coding Intermezzo: Have you ever tried to
implement AES on a Commodore C64?
AES-256 in ACME Assembler [Extract of source at http://www.robos.org/prog]
encrypt
ldx #$07
.addfirst
lda aesblock+0,x
eor expkey+0,x
sta tmpblock+0,x
lda aesblock+8,x
eor expkey+8,x
sta tmpblock+8,x
dex
bpl .addfirst
;
;
;
;
;
;
;
;
4
8
13
17
21
26
28
31
ldy
.round
lda
ldx
eor
ldx
eor
ldx
eor
ldx
eor
sta
;
;
;
;
;
;
;
;
;
;
4
7
11
14
18
21
25
28
32
36
lda
ldx
eor
ldx
eor
ldx
eor
ldx
eor
sta
#$10
expkey+$00,y
tmpblock+4*0+0
ssm0,x
tmpblock+4*1+1
ssm3,x
tmpblock+4*2+2
ssm2,x
tmpblock+4*3+3
ssm1,x
aesblock+$00
expkey+$01,y
tmpblock+4*0+0
ssm1,x
tmpblock+4*1+1
ssm0,x
tmpblock+4*2+2
ssm3,x
tmpblock+4*3+3
ssm2,x
aesblock+$01
lda expkey+$02,y
ldx tmpblock+4*0+0
eor ssm2,x
ldx tmpblock+4*1+1
eor ssm1,x
ldx tmpblock+4*2+2
eor ssm0,x
ldx tmpblock+4*3+3
eor ssm3,x
sta aesblock+$02
lda
ldx
eor
ldx
eor
ldx
eor
ldx
eor
sta
expkey+$03,y
tmpblock+4*0+0
ssm3,x
tmpblock+4*1+1
ssm2,x
tmpblock+4*2+2
ssm1,x
tmpblock+4*3+3
ssm0,x
aesblock+$03
lda
ldx
eor
ldx
eor
ldx
eor
ldx
eor
sta
expkey+$04,y
tmpblock+4*1+0
ssm0,x
tmpblock+4*2+1
ssm3,x
tmpblock+4*3+2
ssm2,x
tmpblock+4*0+3
ssm1,x
aesblock+$04
lda
ldx
eor
ldx
eor
ldx
eor
ldx
eor
sta
expkey+$05,y
tmpblock+4*1+0
ssm1,x
tmpblock+4*2+1
ssm0,x
tmpblock+4*3+2
ssm3,x
tmpblock+4*0+3
ssm2,x
aesblock+$05
Commodore C64
8-bit CPU with 64 KB RAM
Real Coding: Sample T-Table AES in C
– 208 loads
– 4 stores
– 508 integer instructions
•
•
•
•
160 shifts
176 masks (+16 for last rnd)
168 XORs
4 overhead for CTR mode
Extract and mask
input bytes
• AES has 720 instructions (INS)
Perform
TLU
• Per round, 4 instances of
code snippet required
Add TLU to
round keys
• High-performance AES
for processors ≥ 32 bit with
interleaved T-tables
Read
Round keys
(Reference code by Brain Gladman)
z0 = roundkeys[i * 4 + 0];
z1 = roundkeys[i * 4 + 1];
z2 = roundkeys[i * 4 + 2];
z3 = roundkeys[i * 4 + 3];
table0
0
table1
table2
table3
p00 = (uint32) y0 >> 20;
p01 = (uint32) y0 >> 12;
p02 = (uint32) y0 >> 4;
p03 = (uint32) y0 << 4;
p00 &= 0xff0;
p01 &= 0xff0;
p02 &= 0xff0;
p03 &= 0xff0;
p00 = *(uint32 *) (table0 + p00);
p01 = *(uint32 *) (table1 + p01);
p02 = *(uint32 *) (table2 + p02);
p03 = *(uint32 *) (table3 + p03);
z0 ^= p00;
z3 ^= p01;
z2 ^= p02;
z1 ^= p03;
…
(only ¼ round shown)
Interleaved
Memory Layout
(32-bit entries)
16
32
48
Table 0
Table 1
Table 2
Table 3
Table 0
Table 1
Table 2
Table 3
Table 0
Table 1
Table 2
Table 3
Table 0
Table 1
Table 2
Table 3
Offset (byte)
Access j-th table entry
of table i via table<i>+16j
Optimizing AES for High-Performance
• Special instruction: Combined Shift-and-Mask
Extract and mask
input bytes
– On PPC, rlwinm is available as single instruction
p00 = (uint32) y0 >> 20;
p01 = (uint32) y0 >> 12;
p02 = (uint32) y0 >> 4;
p03 = (uint32) y0 << 4;
p00 &= 0xff0;
p01 &= 0xff0;
p02 &= 0xff0;
p03 &= 0xff0;
p00 = (uint32) y0 >> 20 & 0xff0;
p01 = (uint32) y0 >> 12 & 0xff0;
p02 = (uint32) y0 >> 4 & 0xff0;
p03 = (uint32) y0 << 4 & 0xff0;
– Saves 160 instructions for separate masking [BS08]
– AES on PPC has now 540 instructions
Optimizing AES for High-Performance [cont.]
• Special instruction: Scaled Index Loads
p03 = (uint32) y0 << 4
…
p03 &= 0xff0
…
p03 = y0 & 0xff
p03 = *(uint32 *) (table3 + (p03 << 4))
p03 = *(uint32 *) (table3 + p03)
Mask first
and do shifted TLU
Perform Extract and mask
TLU
input bytes
– On x86, shift and load instructions can be combined
– Saves 80 instructions for separate shifting top and bottom bytes [BS08]
– AES on x86 has 640 instructions (not to be combined with previous method!!)
Optimizing AES for High-Performance [cont.]
• Availability of 64-bit Registers
– On AMD64 and UltraSparcV9, use padded values in 64-registers
0xc66363a5
0x0c60063006300a50
– Padding implicitly includes the shift by 4 bit (aka multiplication by 16)
– Padding is applied consistently through entire AES
– Saves 80 instructions (no need to mask top bytes anymore) [BS08]
– AES now has 640 instructions (again, not to be combined)
Optimizing AES for High-Performance [cont.]
• Other ways to optimize the T-Table AES in software…
–
–
–
–
–
–
–
Special instruction: Combined Load-XOR (x86/AM64)  saves 168 instructions
Special instruction: second byte extraction instruction (x86)  saves 40 instructions
Special instruction: two-bytes loads  saves 4 instructions
Byte extraction via loads  trades 160-320 integer instructions against 200 loads/stores
Round key caching in extra registers  saves about 44 instructions
Utilize SSE processor extensions instead of plain CPU ALU
…
• Results for common ≥32-bit processors (encrypting 4kByte of data)
–
–
–
–
–
IBM PPC G4 7410: 459 instructions
Intel Pentium 4 f12: 414 instructions
Intel Core 2 Quad Q6600 with SSE3: ≈ 278 instructions*
Sun UltraSparc III: 505 instructions
Intel Core i7 920: ≈ 278 instructions*
*estimated since not provided in the original work
 14.57 cycles/byte [BS08]
 14.13 cycles/byte [BS08]
 9.32 cycles/byte [KS09]
 12.06 cycles/byte [BS08]
 6.92 cycles/byte [KS09]
Coding Intermezzo: Have you ever
programmed AES-128 in Whitespace
Whitespace is a programming
language developed by Edwin
Brady and Chris Morris.
The Whitespace interpreter
ignores any non-whitespace
characters.
Only spaces, tabs and
linefeeds have a meaning.
AES on IBM‘s Cell Processor
• Hybrid Processor Architecture
– PPC Processor (main/control unit)
– 8 Synergistic Processing Elements (SPE)
as work horse with 128-bit registers
– Fast ring interconnect between PPC
and SPE units
– SPEs support efficient byte extraction and manipulation instructions
for their 128-bit SIMD registers (e.g., shuffle, select)
– 16-fold byte-sliced implementation per 128-bit register [BOS09]
– Single AES encryption in ≈283 instructions (1752+2764 INS per 16 streams)
 11.7 cycles/byte
AES on NVIDIA‘s GTX 295 GPU
• Streaming Processor Architecture
– 2x240 streaming processor units
– Memory design including local, shared,
texture and (large) global memory
– Cache for some memories
– Runs a large number of concurrent,
synchronized threads
– AES implementation using CUDA language (or OpenCL as alternative)
– 32-bit integer instructions and T-Tables stored in shared memory
– Benchmarking is not precise due to uncontrolled scheduling:
throughput up to 59.6 GB/s reported  0.17 cycles/byte [BOS09]
AES on Embedded Systems
• TI TMS320-C6201: 16/32-bit DSP @200MHz
– Parallelized T-Table implementation on four pairs of ALUs
– Encryption in 228 cycles  14.25 cycles/byte [WWGP00]
• AVR ATMega: 8-bit RISC microcontroller @8MHz
– 8-bit AES implementation
– Speed-up by 1 cycle per TLU by placing S-box in RAM (not Flash)
– Fast encryption takes 2,153 cycles  134.56 cycles/byte [BOS09]
• MARC4: 4-bit RISC microcontroller @1MHz
– 8-bit AES implementation with 2 registers per entry
– „Fast“ encryption takes 23,828 cycles  1,489 cycles/byte [KP12]
Coding Intermezzo: Programming AES
in Colors (aka ColorForth)
MixColum Layer in ColorForth
AES on Embedded Systems (cont.)
• GreenArrays GA144 Tile processor
–
–
–
–
Asynchronous 144 core device (nodes)
Each F18A core has 18-bit ALU
128 words of memory + 20 words stack
Up to 4 instructions per word
• AES implementation in ColorForth,
spread over 17 of 144 nodes
• Asynchronous processor operation
disables cycle count metric
• Absolute time per 128-bit encryption:
38 µ[email protected] supply voltage at 0.9µJ
Outline
•
•
•
•
•
Introduction
Processor Platforms
Tricks, Tweaks and Codes
Benchmarks and Results
Conclusions
AES-128 Platform Ranking on High-Performance
AES Performance when encrypting large packets (4Kb)
1)
2)
3)
4)
5)
6)
32-bit: GTX 295 GPU [BOS09]: 0.17 cycles/byte
64-bit (with 128-bit SSE3): Core i7 920 [KS09]: 6.92 cycles/byte
128-bit: IBM Cell SPE [BOS09]: 11.7 cycles/byte
8-bit: AVR ATMega [BOS09]: 134.56 cycles/byte
16-bit: TI C5420 [TI]: 219 cycles/byte
4-bit: MARC4 [KP12]: 1,489 cycles/byte
X) 18-bit: GreenArray GA144: 38µ[email protected]
Important Remark: Beware of distortions, e.g., due to little interest in
platform (4) or backward applied metrics for platform (1)
AES-128 Platform Ranking on Green Cryptography
Energy required to encrypt a single 128-bit AES block
1)
2)
3)
4)
5)
6)
7)
18-bit: GreenArray GA144: 0.63µ[email protected]
32-bit: GTX 295 GPU (TDP 289W): 0.67µJ* [max [email protected]]
128-bit: IBM Cell (TDP 110W): 0.91µJ* [max [email protected]]
64-bit: Core i7 920 (TDP 130W): 1.35µJ* [max [email protected]]
4-bit: MARC4: 8.58µJ@1MHz
8-bit: AVR ATMega: ≈10 µJ@8MHz**
16-bit: TI C5420 (TDP 266mW): 47 µJ* [max load@20MHz]
*Extrapolated from TDP with all cores running AES encryption at 100% utilization
**Extrapolated using an averaged AVR power model based on given cycle count
AES Benchmarking: More Results
Source for further (symmetric) crypto benchmarks:
eBACS: ECRYPT Benchmarking of Cryptographic Systems:
http://bench.cr.yp.to/primitives-stream.html
Contains latest benchmarks for
(currently) 27 different
processors running
AES-128/192/256
Outline
•
•
•
•
•
Introduction
Processor Platforms
Tricks, Tweaks and Codes
Benchmarks and Results
Conclusions
Conclusions
• You can find for (nearly) any processing device an
AES implementation in (nearly) any language
• What I couldn‘t find (yet)
– AES in Brainfuck programming language (sorry - no intermezzo slide!)
– AES on 2-bit or 256-bit processors
• Processors supporting natively the operands in AES (8/32-bit) are still
on the top of the list (Fact!)
• Processor extensions (such as AES NI or SSEx) greatly support AES
encryption in software (see Ryad‘s talk in the afternoon!)
Implementing AES on a
Bunch of Processors
ECRYPT AES Day – Bruges, Belgium
Tim Güneysu
Hardware Security Group
Horst Görtz Institute for IT-Security
Questions?
10/23/2012
Bibliography
•
[WWGP00] Thomas Wollinger, M. Wang, Jorge Guajardo Merchan, Christof Paar:
HOW WELL ARE HIGH-END DSPS SUITED FOR THE AES ALGORITHMS? AES ALGORITHMS ON THE TMS320C6X DSP The Third Advanced Encryption Standard (AES3)
Candidate Conference, New York, USA, April 13-14, 2000.
•
[BS08] Daniel J. Bernstein, Peter Schwabe: New AES Software Speed Records.
INDOCRYPT 2008: 322-336
•
[BOS09] Joppe W. Bos, Dag Arne Osvik, Deian Stefan: Fast Implementations of AES
on Various Platforms. IACR Cryptology ePrint Archive 2009: 501 (2009)
•
[KS09] Emilia Käsper, Peter Schwabe: Faster and Timing-Attack Resistant AES-GCM.
CHES 2009: 1-17
•
[KP12] Tino Kaufmann, Axel Poschmenn: Enabling Standardized Cryptography on
Ultra-Constrained 4-bit Microcontrollers, IEEE RFID 2012: 32-39