Implementing AES on a Bunch of Processors ECRYPT AES Day – Bruges, Belgium Tim Güneysu Hardware Security Group Horst Görtz Institute for IT-Security 10/23/2012 Outline • • • • • Introduction Processor Platforms Tricks, Tweaks and Codes Benchmarks and Results Conclusions The AES Crib Sheet AES Implementation: General Representation • Two different representations of AES in original proposal – 8-bit standard implementation – 32-bit T-Table implementation, e.g., k0, j E0 k1, j E1 T ( A ) T ( A ) T ( A ) T ( A ) 0 0 1 5 2 10 3 15 E k 2, j 2 E k 3 3, j ¼ round: 4 Table Look-Ups (TLU) + 4 x 32-bit XOR 1 round: 4 x 4 = 16 TLU + XOR AES: 160 TLU+XOR per block encryption Memory: 4 T-Boxes, 1kB each AES Implementation : Choice of Processor • A dedicated AES processor in HW is not always the preferred option – AES is often just a supplementary function to a software application – HW development is too costly or necessary skills are not available • But when doing AES in software: which processor is the best? AES Implementation : Parameters • Key size of AES – 128, 192, 256 bit • Applied mode of operation – ECB, CBC, GCM, CTR,… • Blocks concurrently processed – Single block (limited data transfers) – Multiple blocks (overhead reduction, bitslicing) • Round key computation – Precomputed (when processing bulk data) – On-the-fly (when changing keys frequently) Outline • • • • • Introduction Processor Platforms Tricks, Tweaks and Codes Benchmarks and Results Conclusion Processors and Platforms • Native bit sizes of General-Purpose Processors (GPP) 4-Bit, e.g., MARC-4, NEC uPD75X in pocket calculators/washing machines 8-Bit, e.g., Atmel ATMegaXX, Intel 8051 in (many) embedded systems 16-Bit, e.g., TI MSP430, DEC PDP-11 in (fewer) embedded systems 32-Bit, e.g., ARM, TriCore in smart phones and automobiles 64-Bit, e.g., Intel i3/5/7, AMD A-Series in PCs and workstations 128-Bit, e.g., IBM Cell in PS3 (actually, only 128-bit SIMD on SPEs ) • Myth or Fact: AES is always most efficient on native 8-bit and 32-bit processors!? Processor Architectures • General processor design RISC vs. CISC (Reduced/Complex Instruction Set Computer) Single-Instruction Multiple Data (SIMD) operation Super-scalar devices processing more than one instruction per cycle • Processor interface to memory Von-Neumann vs. Harvard: shared memory for data and program? Cache for data and/or program? ( Cache attacks!) Static/dynamic external or built-in RAM? • Additional processor extensions Multimedia/integer co-processor Special/native Instruction Set Extensions (ISE) Other Processor Architectures • Streaming processors, such as GPUs – Multi-processors run hundreds concurrent threads – High memory bandwidth, but high latency to global memory • Digital Signal Processors (DSP) – Supports fast combined arithmetic instructions – Are improved arithmetic instructions useful for AES? • Other array/tile-based processors – Synchronous/asynchronous processing cores – Processor-based systolic array cores (Tilera, GreenArrays) Outline • • • • • Introduction Processor Platforms Tricks, Tweaks and Codes Benchmarks and Results Conclusions AES Software Optimization • General requirements for secure implementation in software – Disable (or control) cache to prevent cache attacks – Avoid conditional branches to counter timing attacks • Common tweaks to achieve high-performance – – – – Make particular use of specialized instructions Unroll rounds and loops to reduce instruction cycle count Optimize register allocation Precompute and store values in tables (e.g., T-Tables, round keys and constants) • Common tweaks to minimize code size – Reuse code by functions to minimize instruction count – Limit amount of precomputed and stored values • Common tweaks for low energy consumption – Reduce number of costly load and store operations to memory – General approach often similar to the optimization for high-performance Coding Intermezzo: Have you ever tried to implement AES on a Commodore C64? AES-256 in ACME Assembler [Extract of source at http://www.robos.org/prog] encrypt ldx #$07 .addfirst lda aesblock+0,x eor expkey+0,x sta tmpblock+0,x lda aesblock+8,x eor expkey+8,x sta tmpblock+8,x dex bpl .addfirst ; ; ; ; ; ; ; ; 4 8 13 17 21 26 28 31 ldy .round lda ldx eor ldx eor ldx eor ldx eor sta ; ; ; ; ; ; ; ; ; ; 4 7 11 14 18 21 25 28 32 36 lda ldx eor ldx eor ldx eor ldx eor sta #$10 expkey+$00,y tmpblock+4*0+0 ssm0,x tmpblock+4*1+1 ssm3,x tmpblock+4*2+2 ssm2,x tmpblock+4*3+3 ssm1,x aesblock+$00 expkey+$01,y tmpblock+4*0+0 ssm1,x tmpblock+4*1+1 ssm0,x tmpblock+4*2+2 ssm3,x tmpblock+4*3+3 ssm2,x aesblock+$01 lda expkey+$02,y ldx tmpblock+4*0+0 eor ssm2,x ldx tmpblock+4*1+1 eor ssm1,x ldx tmpblock+4*2+2 eor ssm0,x ldx tmpblock+4*3+3 eor ssm3,x sta aesblock+$02 lda ldx eor ldx eor ldx eor ldx eor sta expkey+$03,y tmpblock+4*0+0 ssm3,x tmpblock+4*1+1 ssm2,x tmpblock+4*2+2 ssm1,x tmpblock+4*3+3 ssm0,x aesblock+$03 lda ldx eor ldx eor ldx eor ldx eor sta expkey+$04,y tmpblock+4*1+0 ssm0,x tmpblock+4*2+1 ssm3,x tmpblock+4*3+2 ssm2,x tmpblock+4*0+3 ssm1,x aesblock+$04 lda ldx eor ldx eor ldx eor ldx eor sta expkey+$05,y tmpblock+4*1+0 ssm1,x tmpblock+4*2+1 ssm0,x tmpblock+4*3+2 ssm3,x tmpblock+4*0+3 ssm2,x aesblock+$05 Commodore C64 8-bit CPU with 64 KB RAM Real Coding: Sample T-Table AES in C – 208 loads – 4 stores – 508 integer instructions • • • • 160 shifts 176 masks (+16 for last rnd) 168 XORs 4 overhead for CTR mode Extract and mask input bytes • AES has 720 instructions (INS) Perform TLU • Per round, 4 instances of code snippet required Add TLU to round keys • High-performance AES for processors ≥ 32 bit with interleaved T-tables Read Round keys (Reference code by Brain Gladman) z0 = roundkeys[i * 4 + 0]; z1 = roundkeys[i * 4 + 1]; z2 = roundkeys[i * 4 + 2]; z3 = roundkeys[i * 4 + 3]; table0 0 table1 table2 table3 p00 = (uint32) y0 >> 20; p01 = (uint32) y0 >> 12; p02 = (uint32) y0 >> 4; p03 = (uint32) y0 << 4; p00 &= 0xff0; p01 &= 0xff0; p02 &= 0xff0; p03 &= 0xff0; p00 = *(uint32 *) (table0 + p00); p01 = *(uint32 *) (table1 + p01); p02 = *(uint32 *) (table2 + p02); p03 = *(uint32 *) (table3 + p03); z0 ^= p00; z3 ^= p01; z2 ^= p02; z1 ^= p03; … (only ¼ round shown) Interleaved Memory Layout (32-bit entries) 16 32 48 Table 0 Table 1 Table 2 Table 3 Table 0 Table 1 Table 2 Table 3 Table 0 Table 1 Table 2 Table 3 Table 0 Table 1 Table 2 Table 3 Offset (byte) Access j-th table entry of table i via table<i>+16j Optimizing AES for High-Performance • Special instruction: Combined Shift-and-Mask Extract and mask input bytes – On PPC, rlwinm is available as single instruction p00 = (uint32) y0 >> 20; p01 = (uint32) y0 >> 12; p02 = (uint32) y0 >> 4; p03 = (uint32) y0 << 4; p00 &= 0xff0; p01 &= 0xff0; p02 &= 0xff0; p03 &= 0xff0; p00 = (uint32) y0 >> 20 & 0xff0; p01 = (uint32) y0 >> 12 & 0xff0; p02 = (uint32) y0 >> 4 & 0xff0; p03 = (uint32) y0 << 4 & 0xff0; – Saves 160 instructions for separate masking [BS08] – AES on PPC has now 540 instructions Optimizing AES for High-Performance [cont.] • Special instruction: Scaled Index Loads p03 = (uint32) y0 << 4 … p03 &= 0xff0 … p03 = y0 & 0xff p03 = *(uint32 *) (table3 + (p03 << 4)) p03 = *(uint32 *) (table3 + p03) Mask first and do shifted TLU Perform Extract and mask TLU input bytes – On x86, shift and load instructions can be combined – Saves 80 instructions for separate shifting top and bottom bytes [BS08] – AES on x86 has 640 instructions (not to be combined with previous method!!) Optimizing AES for High-Performance [cont.] • Availability of 64-bit Registers – On AMD64 and UltraSparcV9, use padded values in 64-registers 0xc66363a5 0x0c60063006300a50 – Padding implicitly includes the shift by 4 bit (aka multiplication by 16) – Padding is applied consistently through entire AES – Saves 80 instructions (no need to mask top bytes anymore) [BS08] – AES now has 640 instructions (again, not to be combined) Optimizing AES for High-Performance [cont.] • Other ways to optimize the T-Table AES in software… – – – – – – – Special instruction: Combined Load-XOR (x86/AM64) saves 168 instructions Special instruction: second byte extraction instruction (x86) saves 40 instructions Special instruction: two-bytes loads saves 4 instructions Byte extraction via loads trades 160-320 integer instructions against 200 loads/stores Round key caching in extra registers saves about 44 instructions Utilize SSE processor extensions instead of plain CPU ALU … • Results for common ≥32-bit processors (encrypting 4kByte of data) – – – – – IBM PPC G4 7410: 459 instructions Intel Pentium 4 f12: 414 instructions Intel Core 2 Quad Q6600 with SSE3: ≈ 278 instructions* Sun UltraSparc III: 505 instructions Intel Core i7 920: ≈ 278 instructions* *estimated since not provided in the original work 14.57 cycles/byte [BS08] 14.13 cycles/byte [BS08] 9.32 cycles/byte [KS09] 12.06 cycles/byte [BS08] 6.92 cycles/byte [KS09] Coding Intermezzo: Have you ever programmed AES-128 in Whitespace Whitespace is a programming language developed by Edwin Brady and Chris Morris. The Whitespace interpreter ignores any non-whitespace characters. Only spaces, tabs and linefeeds have a meaning. AES on IBM‘s Cell Processor • Hybrid Processor Architecture – PPC Processor (main/control unit) – 8 Synergistic Processing Elements (SPE) as work horse with 128-bit registers – Fast ring interconnect between PPC and SPE units – SPEs support efficient byte extraction and manipulation instructions for their 128-bit SIMD registers (e.g., shuffle, select) – 16-fold byte-sliced implementation per 128-bit register [BOS09] – Single AES encryption in ≈283 instructions (1752+2764 INS per 16 streams) 11.7 cycles/byte AES on NVIDIA‘s GTX 295 GPU • Streaming Processor Architecture – 2x240 streaming processor units – Memory design including local, shared, texture and (large) global memory – Cache for some memories – Runs a large number of concurrent, synchronized threads – AES implementation using CUDA language (or OpenCL as alternative) – 32-bit integer instructions and T-Tables stored in shared memory – Benchmarking is not precise due to uncontrolled scheduling: throughput up to 59.6 GB/s reported 0.17 cycles/byte [BOS09] AES on Embedded Systems • TI TMS320-C6201: 16/32-bit DSP @200MHz – Parallelized T-Table implementation on four pairs of ALUs – Encryption in 228 cycles 14.25 cycles/byte [WWGP00] • AVR ATMega: 8-bit RISC microcontroller @8MHz – 8-bit AES implementation – Speed-up by 1 cycle per TLU by placing S-box in RAM (not Flash) – Fast encryption takes 2,153 cycles 134.56 cycles/byte [BOS09] • MARC4: 4-bit RISC microcontroller @1MHz – 8-bit AES implementation with 2 registers per entry – „Fast“ encryption takes 23,828 cycles 1,489 cycles/byte [KP12] Coding Intermezzo: Programming AES in Colors (aka ColorForth) MixColum Layer in ColorForth AES on Embedded Systems (cont.) • GreenArrays GA144 Tile processor – – – – Asynchronous 144 core device (nodes) Each F18A core has 18-bit ALU 128 words of memory + 20 words stack Up to 4 instructions per word • AES implementation in ColorForth, spread over 17 of 144 nodes • Asynchronous processor operation disables cycle count metric • Absolute time per 128-bit encryption: 38 µ[email protected] supply voltage at 0.9µJ Outline • • • • • Introduction Processor Platforms Tricks, Tweaks and Codes Benchmarks and Results Conclusions AES-128 Platform Ranking on High-Performance AES Performance when encrypting large packets (4Kb) 1) 2) 3) 4) 5) 6) 32-bit: GTX 295 GPU [BOS09]: 0.17 cycles/byte 64-bit (with 128-bit SSE3): Core i7 920 [KS09]: 6.92 cycles/byte 128-bit: IBM Cell SPE [BOS09]: 11.7 cycles/byte 8-bit: AVR ATMega [BOS09]: 134.56 cycles/byte 16-bit: TI C5420 [TI]: 219 cycles/byte 4-bit: MARC4 [KP12]: 1,489 cycles/byte X) 18-bit: GreenArray GA144: 38µ[email protected] Important Remark: Beware of distortions, e.g., due to little interest in platform (4) or backward applied metrics for platform (1) AES-128 Platform Ranking on Green Cryptography Energy required to encrypt a single 128-bit AES block 1) 2) 3) 4) 5) 6) 7) 18-bit: GreenArray GA144: 0.63µ[email protected] 32-bit: GTX 295 GPU (TDP 289W): 0.67µJ* [max [email protected]] 128-bit: IBM Cell (TDP 110W): 0.91µJ* [max [email protected]] 64-bit: Core i7 920 (TDP 130W): 1.35µJ* [max [email protected]] 4-bit: MARC4: 8.58µJ@1MHz 8-bit: AVR ATMega: ≈10 µJ@8MHz** 16-bit: TI C5420 (TDP 266mW): 47 µJ* [max load@20MHz] *Extrapolated from TDP with all cores running AES encryption at 100% utilization **Extrapolated using an averaged AVR power model based on given cycle count AES Benchmarking: More Results Source for further (symmetric) crypto benchmarks: eBACS: ECRYPT Benchmarking of Cryptographic Systems: http://bench.cr.yp.to/primitives-stream.html Contains latest benchmarks for (currently) 27 different processors running AES-128/192/256 Outline • • • • • Introduction Processor Platforms Tricks, Tweaks and Codes Benchmarks and Results Conclusions Conclusions • You can find for (nearly) any processing device an AES implementation in (nearly) any language • What I couldn‘t find (yet) – AES in Brainfuck programming language (sorry - no intermezzo slide!) – AES on 2-bit or 256-bit processors • Processors supporting natively the operands in AES (8/32-bit) are still on the top of the list (Fact!) • Processor extensions (such as AES NI or SSEx) greatly support AES encryption in software (see Ryad‘s talk in the afternoon!) Implementing AES on a Bunch of Processors ECRYPT AES Day – Bruges, Belgium Tim Güneysu Hardware Security Group Horst Görtz Institute for IT-Security Questions? 10/23/2012 Bibliography • [WWGP00] Thomas Wollinger, M. Wang, Jorge Guajardo Merchan, Christof Paar: HOW WELL ARE HIGH-END DSPS SUITED FOR THE AES ALGORITHMS? AES ALGORITHMS ON THE TMS320C6X DSP The Third Advanced Encryption Standard (AES3) Candidate Conference, New York, USA, April 13-14, 2000. • [BS08] Daniel J. Bernstein, Peter Schwabe: New AES Software Speed Records. INDOCRYPT 2008: 322-336 • [BOS09] Joppe W. Bos, Dag Arne Osvik, Deian Stefan: Fast Implementations of AES on Various Platforms. IACR Cryptology ePrint Archive 2009: 501 (2009) • [KS09] Emilia Käsper, Peter Schwabe: Faster and Timing-Attack Resistant AES-GCM. CHES 2009: 1-17 • [KP12] Tino Kaufmann, Axel Poschmenn: Enabling Standardized Cryptography on Ultra-Constrained 4-bit Microcontrollers, IEEE RFID 2012: 32-39
© Copyright 2024 Paperzz