Hardware Implementations of the Round-Two SHA-3 Candidates: Comparison on a Common Ground Stefan Tillich1 2 , Martin Feldhofer1 , Mario Kirschbaum1 , Thomas Plos1 , Jörn-Marc Schmidt1 , Alexander Szekely1 1 Graz University of Technology, Institute for Applied Information Processing and Communications, Inffeldgasse 16a, A–8010 Graz, Austria 2 University of Bristol, Computer Science Department, Merchant Venturers Building, Woodland Road, BS8 1UB, Bristol, UK {Stefan.Tillich, Martin.Feldhofer, Mario.Kirschbaum, Thomas Plos, Joern-Marc.Schmidt, Alexander.Szekely}@iaik.tugraz.at, [email protected] Abstract Message Hash functions are a core part of many protocols that are in daily use. Following recent results that raised concerns regarding the security of the current hash standards, the National Institute of Standards and Technology (NIST) pronounced a competition to find a new Secure Hash Algorithm (SHA), the SHA-3. An important criterion for the new standard is not only its security, but also the performance and the costs of its implementations. This paper evaluates all 14 candidates that are currently in the second round of the SHA-3 competition. We provide a common framework that allows a fair comparison of the hardware implementations of all SHA-3 candidates. We optimized the hardware modules towards maximum throughput and give concrete numbers of our implementations for a 0.18 µm standard-cell technology. Keywords: Hash function, NIST, SHA-3, uniform interface, high throughput, algorithm 1 adaptive binary-search Introduction Security-related applications like e-banking or e-government have become increasingly important in the Internet over the last years. For realizing such applications, hash functions play an important role. Hash functions are cryptographic primitives that map an input message of arbitrary length to a fixed-size hash value, the so-called message digest. This is illustrated in Figure 1. In order to provide cryptographic security, a hash function has to fulfill two important properties: pre-image resistance and collision resistance. A hash function is called pre-image resistant if it is computationally One-way hash function Message digest Figure 1. Basic working principle of a hash function. infeasible to find an input message to a given message digest. A hash function is said to be collision resistant if it is computationally infeasible to find two different input messages that map to the same message digest1 . These properties enable for example the signing of a small message digest instead of a large message. An adversary cannot find another message that matches the signed digest within reasonable time. Since hash functions provide an essential base for various applications, they are a constant target for attacks. No collision has been found so far for the current standard of hash functions that specifies the Secure Hash Algorithms (SHA). Nevertheless, recent results reduced the security margin of the widely used SHA-1. This also raised concerns about the similarly-structured algorithms of the SHA-2 family. In order to find a secure replacement, the National Institute of Standards and Technology (NIST) announced the SHA-3 competition in 2008. Security experts all over the world were encouraged to submit proposals for a new hash function. The candidates are evaluated in three rounds, while each round dismisses some proposals. From the 64 initial submissions, 51 1 Note that since the input space is larger than the image space such two inputs naturally exist. entered the first round. Currently, the second round is in progress. In this round 14 different proposals are discussed. By the end of this year, five round-two candidates will be selected for the third round. In the end of 2012, a successor of the SHA-2 is to be announced by the NIST as the new standard, the SHA-3. While cryptographic security is a mandatory requirement for the future SHA-3, the performance and the costs of its implementations are essential criteria as well. For hardware implementations, the performance can be measured in throughput or latency, i.e. the number of bytes processed in a given time period or the time required for a single block. A measure for the costs can be the required silicon area or the power consumption of an implementation. A common ground for all these performance and cost figures is important for a meaningful comparison between the candidates. Existing publications that deal with a dedicated implementation of a single candidate are missing this common ground. This is because hardware modules are designed towards different goals (e.g. low area, maximum throughput), various cell libraries are used, and some consider message preprocessing operations (e.g. message padding) within the implementation and others do not. This is why we decided to implement all 14 actual round-two candidates in a common framework, allowing a fair comparison of them. Our hardware modules are optimized for maximizing the throughput and are synthesized for a 0.18 µm standard-cell technology. The remaining paper is organized as follows. Section 2 gives a brief description of the 14 remaining SHA-3 candidates, details on how we implemented them are provided in Section 3. The practical results are presented in Section 4 before conclusion is drawn in Section 5. 2 Description of the SHA-3 Candidates A number of 14 candidates have entered the second round of the SHA-3 competition which are: BLAKE [1], Blue Midnight Wish (BMW) [12], CubeHash [3], ECHO [2], Fugue [13], Grøstl [11], Hamsi [14], JH [17], Keccak [4], Luffa [8], Shabal [6], SHAvite-3 [5], SIMD [15], and Skein [10]. Besides cryptographic security, the candidates have to fulfill other formal requirements denoted by NIST like for example support for message-digest sizes of 224, 256, 384, and 512 bits, or a maximum message length of at least 264 − 1 bits. The candidates meet the variable digest size by either using a single flexible algorithm or by using several slightly modified versions of the same algorithm. CubeHash is an example for a flexible algorithm that allows to generate message digests of variable length up to 512 bits. BLAKE on the contrary, has one algorithm version for message digests up to 256 bits and one for message digests from 257 to 512 bits. The design concept of the algorithms strongly varies from candidate to candidate. ECHO, Fugue, Grøstl, and SHAvite-3, for example, are based on building blocks of the Advanced Encryption Standard (AES) like the matrix multiplication and the substitution box (S-box). Hamsi and JH apply so-called linear transformations. The candidates BLAKE, Blue Midnight Wish, CubeHash, Keccak, and Skein are based on rather simple logical operations like addition/subtraction, XOR, or shift. SIMD has a more complex structure that uses a NumberTheoretic Transform that can be realized via Fast Fourier Transformations (FFT). Also the size of the algorithm’s internal state varies. ECHO for example operates on a large internal state of 2048 bits. The internal state of BLAKE is smaller and has a size of only 512 bits for the version with message digests up to 256 bits and 1024 bits for the version with message digests from 257 to 512 bits. Moreover, all candidates use quite similar message-padding schemes. Padding makes the length of an arbitrary input message a multiple of the block length that can be processed by the hash algorithm. For more details about the SHA-3 candidates please take a look at the references provided above. An overview of the hardware implementations of the candidates is given in the following section. 3 Implementation of the SHA-3 Candidates For implementing the SHA-3 candidates in hardware, we have concentrated on the variants that produce a 256-bit message digest. Some of our hardware modules can additionally be reconfigured statically or dynamically to produce digests of a different size. Extra functionality like salting or keyed hashing modes are not supported. The hardware modules expect to receive padded messages as input (i.e. a number of full message blocks). Resulting benefits are a simplified design and a uniform interface which does not introduce communication overheads. Since our primary optimization goal is maximizing the throughput for long messages, padding performed outside the hardware module has no detrimental effect on the peak throughput. Apart from padding, the hardware modules are fully self-contained and require no additional components like for example external memory. Besides the SHA-3 candidates, we have also implemented a SHA-2 hardware module with a straight-forward approach. This module serves as point of reference. A description of the uniform interface and the optimization techniques applied during implementation of the hardware modules are provided hereafter. 3.1 Uniform Interface The uniform interface has been kept generic and is shared by all our SHA-3 hardware modules. As depicted in Figure 2, the uniform interface connects the hardware modules with their corresponding test bench and consists of: clock signal, reset signal, load signal, finalize signal, broad data-input port, and broad data-output port. The data-input port holds a complete message block (n bits) which is loaded into the hardware module by asserting the load signal. After loading the last message block, the Uniform interface Clock Reset KAT vectors Test bench Load Finalize Data input m n SHA-3 hardware module Data output Figure 2. Uniform interface that connects test bench and SHA-3 hardware module. test bench asserts the finalize signal. This indicates the end of a message and the hardware module provides the complete message digest (m bits) at the output port. The test bench uses the official Known Answer Test (KAT) vectors that contain input messages of variable length with the corresponding message digests to verify the correct functionality of the hardware modules. Appropriate padding of the messages is done in the test bench. 3.2 Optimization Techniques The primary optimization goal of our SHA-3 hardware modules was maximum peak throughput. In order to achieve this requirement, various well-known optimization techniques were applied and combined with each other. The optimization techniques are mainly: parallelization, loop unrolling, pipelining, and alternative implementation techniques. Parallelization The most obvious approach to increase the throughput is using parallel instantiation of hardware blocks. Our implementation of SIMD, for example, applies 16 parallel FFT operations and 16 parallel modular multipliers to compute a Number-Theoretic Transform. Parallelization trades chip area for speed and was used in all our algorithm implementations to a certain extent. However, parallelization has also its limitation. Inherent data dependencies of intermediate results prevent a speedup by applying parallelization. In case of our implementation of Grøstl, parallelizing all hardware blocks has not lead to the fastest implementation due to the presence of such data dependencies. Loop unrolling One method to speedup Algorithms with a round-based design (a so-called round function is iterated several times) is loop unrolling. This technique is some kind of parallelization which duplicates the round function multiple times. In that way, the number of required clock cycles is divided by the number of unrolled rounds. The achievable speedup depends on several factors such as additional control complexity and the degree of parallelization within the unrolled loops. We used loop unrolling to enhance the throughput of our implementations of Blue Midnight Wish, CubeHash, and Skein. Pipelining The opposite approach of loop unrolling is pipelining. There, the round function (or any combinational logic block) is divided into several stages which are separated by pipeline registers. Every pipeline stage processes data from different round iterations at the same time, providing a better utilization of the hardware. Moreover, adding pipeline registers can reduce the critical path of a design which results in a higher maximum clock frequency and further an increased throughput. This technique is only applicable if the input data of the pipeline are independent of each other. Pipelining was used, for example, for the implementation of BLAKE, Luffa, and Grøstl. Alternative implementation techniques Another optimization technique addresses the actual implementation of special functional blocks like multi-input adders or substitution networks. Such blocks often allow various ways of implementation which result in equivalent functional behavior, but lead to different overall throughput. For example, the addition of three numbers as required in the implementation of BLAKE can be realized via two consecutive standard adders or by using a special adder-structure (e.g. a carry-save adder) that is significantly faster. Another example is the AES-based substitution box used in the SHA-3 candidates ECHO, Fugue, Grøstl, and SHAvite-3. The substitution box can be represented as look-up table or calculated in hardware using algebraic properties [16]. Depending on the SHA-3 candidate that has been implemented, either the one or the other approach of the substitution box resulted in higher throughput. 4 Practical Results The SHA-3 hardware modules have been either implemented in VHDL or in Verilog. Except for the SIMD module2 , any eventual second-round tweaks have been integrated in the modules. Tweaks are minor patches, which the authors of the SHA-3 candidates were allowed to apply after the first round of the SHA-3 competition. As mentioned in Section 3.1, the correct functionality of the implemented modules has been verified against the official KAT vectors by means of simulations with Cadence NC Sim [7]. Our throughput evaluation assumes that the message blocks are delivered to the hardware module at a speed that allows it to operate under full utilization. The optimization target for synthesis was maximum peak throughput, which corresponds to the throughput for long messages. Note that for shorter messages, the throughput might change due to more or less-costly initialization operations and output transformations. The maximum peak throughput of each hardware module depends on the following three factors: (1) the processed message block 2 We don’t expect the integration of the round-two patches to significantly influence the performance of the SIMD implementation. size, (2) the latency, i.e. the number of required clock cycles to process one message block of n bits, and (3) the maximum clock frequency at which the hardware module can be operated. According to Equation 1 the throughput of each hardware module is calculated with these three factors: No (1) upper_delay + lower_delay 2 (2) If the run was not successful, a new value for lower_delay is set before executing Equation 2. The fine-tuning phase is enabled if the difference between upper_delay and lower_delay is less than 1 ns. In this case upper_delay, which is the delay of the last successful run, is reduced by 100 ps and becomes the new value for target_delay. The search for the lowest critical-path delay is finished as soon as the first synthesis run during the fine-tuning phase fails. In a second phase, in order to obtain results close to the real world, the variants with the best performance (i.e. highest throughput with a reasonable demand of area) were subjected to place & route. Global settings for each place & route run guaranteed fair comparison conditions for the SHA-3 candidates. Table 1 summarizes our results after place & route. It contains the block size of the hash algorithm (block) and 3 For the present work, the limit has been set to two hours. 4 A maximal negative slack of 50 ps has been allowed. (upper_delay – lower_delay) <= 1ns? target_delay = (upper_delay + lower_delay) / 2 Synthesis Clock constraint = target_delay Abort if >2 hours lower_delay = target_delay Synthesis as well as place & route of all implementations is based on the UMC 0.18 µm standard-cell library from Faraday [9] and has been conducted with Cadence PKSShell (v05.16) and Cadence First Encounter (v05.20), respectively [7]. Since the primary aim was maximum throughput, synthesis has been performed with high optimization effort towards maximizing the speed of the hardware modules. In some cases the throughput of an implementation variant increased only marginally at the cost of a considerable growth of the area (e.g. + 4 % throughput at + 65 % area). In such cases we stuck to the marginally slower but significantly smaller implementation variant. In a first phase of our evaluation we performed multiple synthesis runs for all the variants. In each run, the target for the critical-path delay has been adapted by means of an adaptive binary-search algorithm with fine tuning. Figure 3 shows the state diagram of the binary-search algorithm. We started with an upper critical-path delay (upper_delay) of 100 ns, which equals a clock frequency of 10 MHz. The lower bound (lower_delay) was initially set to 0 ns. As long as the synthesis runs are successful, i.e. the run finished within a certain amount of time3 and the synthesized design reaches the set target delay under worst-case conditions4 , the algorithm sets a new value for upper_delay and reduces the critical-path delay (target_delay) according to Equation 2. target_delay = Yes upper_delay = target_delay block size · f requency latency fine_tuning = Yes target_delay = upper_delay – 100ps target_delay = target_delay – 100ps throughput = target_delay = 100ns upper_delay = 100ns lower_delay = 0ns fine_tuning = No Yes fine_tuning? No Synthesis successful? No Yes Yes Synthesis successful? No Finish Figure 3. State diagram of the adaptive binary-search algorithm with fine tuning. the number of clock cycles required for the processing of one block (latency). The area is given in terms of thousand gate equivalents (kGEs)5 . The reported clock frequency is the maximum value under typical conditions6 . For the throughput (TP) column, the peak throughput at the stated clock frequency is computed according to Equation 1. Table 1. Results after place & route for the best implementation variants using the UMC 0.18 µm FSA0A_C standard-cell library. Implementation BLAKE BMW CubeHash ECHO Fugue Grøstl Hamsi JH Keccak Luffa Shabal SHAvite-3 SIMD Skein SHA-2 Block [bit] 512 512 256 1,536 32 512 32 512 1,088 256 512 512 512 256 512 Latency Area [cycles] [kGEs] 22 38.8 1 160.9 8 56.6 97 128.0 2 48.4 22 53.7 1 59.9 39 51.2 25 56.7 9 45.3 50 55.1 37 59.8 36 95.7 10 47.7 66 19.5 Clk freq. TP [MHz] [Gbit/s] 144.15 3,355 15.12 7.741 111.06 3.554 121.97 1.931 161.19 2.579 202.47 4.712 119.77 3.833 259.54 3.407 267.09 11.624 336.02 9.558 216.83 2.220 159.80 2.211 58.33 0.830 64.75 1.658 211,37 1.640 In terms of throughput, the Keccak implementation out5 For FSA0A_C, 1 GE equals 9.37 sqmils (i.e. the size of an ND2 cell). 6 Operating temperature 25 ◦ C, core supply voltage 1.8 V. performs all other modules by a considerable margin. The Luffa module is second fastest and more compact. Grøstl, Hamsi, JH, and CubeHash are the next-best implementations and have all similar area requirements. The BMW module achieves similar throughput, but at considerably higher hardware cost. Looking at the implementations of Fugue and BLAKE shows that they are a bit slower, but also smaller. The Shabal and SHAvite-3 modules are slower and bigger, achieving similar performance. A slightly lower throughput is reached by the ECHO module, which however requires more area. The Skein module follows with a moderate size. Our implementation of SIMD is the slowest in the field. The straight-forward SHA-2 implementation has the smallest area and achieves a throughput which is rather at the low end of the spectrum. Figure 4 shows a graphical representation of area in relation to highest throughput of all our implementations. BLAKE 12 BMW fit for any particular purpose. The user thereof uses the information as its sole risk and liability. References [1] Jean-Philippe Aumasson, Luca Henzen, Willi Meier, and Raphael C.-W. Phan. SHA-3 proposal BLAKE, version 1.3. Available online at http://131002. net/blake/blake.pdf, 2008. [2] Ryad Benadjila, Olivier Billet, Henri Gilbert, Gilles Macario-Rat, Thomas Peyrin, Matt Robshaw, and Yannick Seurin. SHA-3 Proposal: ECHO. Available online at http : / / crypto . rd . francetelecom . com / echo / doc / echo _ description_1-5.pdf, February 2009. [3] Daniel J. Bernstein. CubeHash specification (2.B.1). Available online at http://cubehash.cr.yp. to/submission/spec.pdf, October 2008. CubeHash 10 ECHO Throughput (Gbit/s) Fugue Grøstl 8 Hamsi JH 6 Keccak Luffa Shabal 4 SHAvite-3 SIMD 2 Skein SHA-2 0 0 20 40 60 80 100 120 140 160 Area (kGEs) Figure 4. Maximum peak throughput vs. area of the high-speed hardware implementations of the SHA-3 candidates (after place & route). 5 [4] Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. KECCAK specifications, Version 2 – September 10, 2009. Available online at http : / / keccak . noekeon . org / Keccak-specifications-2 . pdf, September 2009. Conclusions In this work we presented unified hardware implementations of all 14 round-two candidates of the SHA-3 competition. Our hardware modules were implemented and evaluated within a common framework and are aimed towards maximum throughput. We applied various optimization techniques like, for example, loop unrolling or pipelining. Each implementation uses the same interface, the same optimization heuristic during synthesis (adaptive binary-search algorithm with fine-tuning steps), and the same standard-cell technology. Utilizing this common ground allows a fair comparison of the maximum achievable throughput of all candidates. Acknowledgements. The work described in this paper has been supported by the European Commission through the ICT programme under contract ICT-2007-216676 ECRYPT II. The information in this document is provided as is, and no guarantee or warranty is given or implied that the information is [5] Eli Biham and Orr Dunkelman. The SHAvite-3 Hash Function (version from February 1, 2009). Available online at http://www.cs.technion.ac. il/~orrd/SHAvite-3/Spec.01.02.09. pdf, February 2009. [6] Emmanuel Bresson, Anne Canteaut, Benoît Chevallier-Mames, Christophe Clavier, Thomas Fuhr, Aline Gouget, Thomas Icart, Jean-François Misarsky, María Naya-Plasencia, Pascal Paillier, Thomas Pornin, Jean-René Reinhard, Céline Thuillet, and Marion Videau. Shabal, a Submission to NIST’s Cryptographic Hash Algorithm Competition. Available online at http://www.shabal.com/wp-content/ plugins/download-monitor/download. php?id=Shabal.pdf, October 2008. [7] Cadence Design Systems. The Cadence Design Systems Website. http://www.cadence.com/. [8] Christophe De Canniére, Hisayoshi Sato, and Dai Watanabe. Hash Function Luffa, Specification Ver. 2.0. Available online at http : / / www . sdl . hitachi . co . jp / crypto / luffa / Luffa _ v2 _ Specification _ 20090915 . pdf, September 2009. [9] Faraday Technology Corporation. Faraday FSA0A_C 0.18 µm ASIC Standard Cell Library, 2004. Details available online at http://www.faraday-tech.com. [10] Niels Ferguson, Stefan Lucks, Bruce Schneier, Doug Whiting, Mihir Bellare, Tadayoshi Kohno, Jon Callas, and Jesse Walker. The Skein Hash Function Family. Available online at http:// www.skein-hash.info/sites/default/ files/skein1.1.pdf, November 2008. [11] Praveen Gauravaram, Lars R. Knudsen, Krystian Matusiewicz, Florian Mendel, Christian Rechberger, and Søren S. Thomsen Martin Schläffer. Grøstl – a SHA-3 candidate. Available online at http:// www.groestl.info/Groestl.pdf, October 2008. [12] Danilo Gligoroski and Vlastimil Klima. Cryptographic Hash Function BLUE MIDNIGHT WISH. Available online at http : / / people . item . ntnu . no / ~danilog / Hash / BMW-SecondRound / Supporting _ Documentation / BlueMidnightWishDocumentation . pdf, September 2009. [13] Shai Halevi, William E. Hall, and Charanjit S. Jutla. The Hash Function “Fugue”. Available online at http://domino.research.ibm.com/ comm / research _ projects . nsf / pages / fugue . index . html / $FILE / fugue _ 09 . pdf, September 2009. [14] Özgül Küçük. The Hash Function Hamsi, version from September 14, 2009. Available online at http://www.cosic.esat.kuleuven. be / publications / article-1203 . pdf, September 2009. [15] Gaëtan Leurent, Charles Bouillaguet, and PierreAlain Fouque. SIMD Is a Message Digest. Updated version: 2009-01-15, 2009. [16] Stefan Tillich, Martin Feldhofer, and Johann Großschädl. Area, Delay, and Power Characteristics of Standard-Cell Implementations of the AES S-Box. In Stamatis Vassiliadis, Stephan Wong, and Timo Hämäläinen, editors, 6th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2006, Samos, Greece, July 17-20, 2006, Proceedings, volume 4017 of Lecture Notes in Computer Science, pages 457–466. Springer, July 2006. [17] Hongjun Wu. SHA-3 proposal JH, version January 15, 2009. JH online at http://icsd.i2r.astar.edu.sg/staff/hongjun/jh/index.html, 2008.
© Copyright 2024 Paperzz