Hardware Implementations of the Round-Two SHA

Hardware Implementations of the Round-Two SHA-3 Candidates:
Comparison on a Common Ground
Stefan Tillich1 2 , Martin Feldhofer1 , Mario Kirschbaum1 , Thomas Plos1 , Jörn-Marc Schmidt1 ,
Alexander Szekely1
1 Graz
University of Technology, Institute for Applied Information Processing and Communications,
Inffeldgasse 16a, A–8010 Graz, Austria
2 University
of Bristol, Computer Science Department, Merchant Venturers Building,
Woodland Road, BS8 1UB, Bristol, UK
{Stefan.Tillich, Martin.Feldhofer, Mario.Kirschbaum,
Thomas Plos, Joern-Marc.Schmidt, Alexander.Szekely}@iaik.tugraz.at, [email protected]
Abstract
Message
Hash functions are a core part of many protocols that are
in daily use. Following recent results that raised concerns
regarding the security of the current hash standards, the
National Institute of Standards and Technology (NIST)
pronounced a competition to find a new Secure Hash
Algorithm (SHA), the SHA-3. An important criterion
for the new standard is not only its security, but also
the performance and the costs of its implementations.
This paper evaluates all 14 candidates that are currently
in the second round of the SHA-3 competition. We
provide a common framework that allows a fair
comparison of the hardware implementations of all
SHA-3 candidates. We optimized the hardware modules
towards maximum throughput and give concrete numbers
of our implementations for a 0.18 µm standard-cell
technology.
Keywords: Hash function, NIST, SHA-3, uniform
interface, high throughput,
algorithm
1
adaptive binary-search
Introduction
Security-related applications like e-banking or
e-government have become increasingly important
in the Internet over the last years. For realizing such
applications, hash functions play an important role. Hash
functions are cryptographic primitives that map an input
message of arbitrary length to a fixed-size hash value, the
so-called message digest. This is illustrated in Figure 1.
In order to provide cryptographic security, a hash
function has to fulfill two important properties: pre-image
resistance and collision resistance. A hash function
is called pre-image resistant if it is computationally
One-way hash
function
Message digest
Figure 1. Basic working principle of a hash function.
infeasible to find an input message to a given message
digest. A hash function is said to be collision resistant
if it is computationally infeasible to find two different
input messages that map to the same message digest1 .
These properties enable for example the signing of a
small message digest instead of a large message. An
adversary cannot find another message that matches the
signed digest within reasonable time.
Since hash functions provide an essential base for various applications, they are a constant target for attacks. No
collision has been found so far for the current standard of
hash functions that specifies the Secure Hash Algorithms
(SHA). Nevertheless, recent results reduced the security
margin of the widely used SHA-1. This also raised
concerns about the similarly-structured algorithms of the
SHA-2 family. In order to find a secure replacement, the
National Institute of Standards and Technology (NIST)
announced the SHA-3 competition in 2008. Security
experts all over the world were encouraged to submit
proposals for a new hash function. The candidates are
evaluated in three rounds, while each round dismisses
some proposals. From the 64 initial submissions, 51
1 Note that since the input space is larger than the image space such
two inputs naturally exist.
entered the first round. Currently, the second round is
in progress. In this round 14 different proposals are discussed. By the end of this year, five round-two candidates
will be selected for the third round. In the end of 2012, a
successor of the SHA-2 is to be announced by the NIST
as the new standard, the SHA-3.
While cryptographic security is a mandatory requirement for the future SHA-3, the performance and the
costs of its implementations are essential criteria as well.
For hardware implementations, the performance can be
measured in throughput or latency, i.e. the number of
bytes processed in a given time period or the time required
for a single block. A measure for the costs can be the
required silicon area or the power consumption of an
implementation.
A common ground for all these performance and cost
figures is important for a meaningful comparison between
the candidates. Existing publications that deal with a dedicated implementation of a single candidate are missing
this common ground. This is because hardware modules are designed towards different goals (e.g. low area,
maximum throughput), various cell libraries are used,
and some consider message preprocessing operations (e.g.
message padding) within the implementation and others
do not. This is why we decided to implement all 14 actual
round-two candidates in a common framework, allowing a
fair comparison of them. Our hardware modules are optimized for maximizing the throughput and are synthesized
for a 0.18 µm standard-cell technology.
The remaining paper is organized as follows. Section 2
gives a brief description of the 14 remaining SHA-3
candidates, details on how we implemented them are
provided in Section 3. The practical results are presented
in Section 4 before conclusion is drawn in Section 5.
2
Description of the SHA-3 Candidates
A number of 14 candidates have entered the second
round of the SHA-3 competition which are: BLAKE [1],
Blue Midnight Wish (BMW) [12], CubeHash [3],
ECHO [2], Fugue [13], Grøstl [11], Hamsi [14], JH [17],
Keccak [4], Luffa [8], Shabal [6], SHAvite-3 [5],
SIMD [15], and Skein [10]. Besides cryptographic security, the candidates have to fulfill other formal requirements denoted by NIST like for example support for
message-digest sizes of 224, 256, 384, and 512 bits, or
a maximum message length of at least 264 − 1 bits. The
candidates meet the variable digest size by either using
a single flexible algorithm or by using several slightly
modified versions of the same algorithm. CubeHash is
an example for a flexible algorithm that allows to generate
message digests of variable length up to 512 bits. BLAKE
on the contrary, has one algorithm version for message
digests up to 256 bits and one for message digests from
257 to 512 bits.
The design concept of the algorithms strongly varies
from candidate to candidate. ECHO, Fugue, Grøstl, and
SHAvite-3, for example, are based on building blocks
of the Advanced Encryption Standard (AES) like the
matrix multiplication and the substitution box (S-box).
Hamsi and JH apply so-called linear transformations.
The candidates BLAKE, Blue Midnight Wish, CubeHash, Keccak, and Skein are based on rather simple
logical operations like addition/subtraction, XOR, or shift.
SIMD has a more complex structure that uses a NumberTheoretic Transform that can be realized via Fast Fourier
Transformations (FFT). Also the size of the algorithm’s
internal state varies. ECHO for example operates on
a large internal state of 2048 bits. The internal state
of BLAKE is smaller and has a size of only 512 bits
for the version with message digests up to 256 bits and
1024 bits for the version with message digests from 257
to 512 bits. Moreover, all candidates use quite similar
message-padding schemes. Padding makes the length of
an arbitrary input message a multiple of the block length
that can be processed by the hash algorithm.
For more details about the SHA-3 candidates please take
a look at the references provided above. An overview of
the hardware implementations of the candidates is given
in the following section.
3
Implementation of the SHA-3 Candidates
For implementing the SHA-3 candidates in hardware, we
have concentrated on the variants that produce a 256-bit
message digest. Some of our hardware modules can
additionally be reconfigured statically or dynamically to
produce digests of a different size. Extra functionality like
salting or keyed hashing modes are not supported. The
hardware modules expect to receive padded messages as
input (i.e. a number of full message blocks). Resulting
benefits are a simplified design and a uniform interface which does not introduce communication overheads.
Since our primary optimization goal is maximizing the
throughput for long messages, padding performed outside
the hardware module has no detrimental effect on the peak
throughput. Apart from padding, the hardware modules
are fully self-contained and require no additional components like for example external memory. Besides the
SHA-3 candidates, we have also implemented a SHA-2
hardware module with a straight-forward approach. This
module serves as point of reference. A description of
the uniform interface and the optimization techniques
applied during implementation of the hardware modules
are provided hereafter.
3.1
Uniform Interface
The uniform interface has been kept generic and is
shared by all our SHA-3 hardware modules. As depicted
in Figure 2, the uniform interface connects the hardware
modules with their corresponding test bench and consists
of: clock signal, reset signal, load signal, finalize signal,
broad data-input port, and broad data-output port. The
data-input port holds a complete message block (n bits)
which is loaded into the hardware module by asserting
the load signal. After loading the last message block, the
Uniform interface
Clock
Reset
KAT
vectors
Test
bench
Load
Finalize
Data input
m
n
SHA-3
hardware
module
Data output
Figure 2. Uniform interface that connects test bench
and SHA-3 hardware module.
test bench asserts the finalize signal. This indicates the
end of a message and the hardware module provides the
complete message digest (m bits) at the output port. The
test bench uses the official Known Answer Test (KAT)
vectors that contain input messages of variable length
with the corresponding message digests to verify the correct functionality of the hardware modules. Appropriate
padding of the messages is done in the test bench.
3.2
Optimization Techniques
The primary optimization goal of our SHA-3 hardware
modules was maximum peak throughput. In order to
achieve this requirement, various well-known optimization techniques were applied and combined with each
other. The optimization techniques are mainly: parallelization, loop unrolling, pipelining, and alternative
implementation techniques.
Parallelization The most obvious approach to increase
the throughput is using parallel instantiation of hardware
blocks. Our implementation of SIMD, for example,
applies 16 parallel FFT operations and 16 parallel modular
multipliers to compute a Number-Theoretic Transform.
Parallelization trades chip area for speed and was used
in all our algorithm implementations to a certain extent.
However, parallelization has also its limitation. Inherent data dependencies of intermediate results prevent a
speedup by applying parallelization. In case of our implementation of Grøstl, parallelizing all hardware blocks has
not lead to the fastest implementation due to the presence
of such data dependencies.
Loop unrolling One method to speedup Algorithms
with a round-based design (a so-called round function is
iterated several times) is loop unrolling. This technique is
some kind of parallelization which duplicates the round
function multiple times. In that way, the number of
required clock cycles is divided by the number of unrolled
rounds. The achievable speedup depends on several
factors such as additional control complexity and the
degree of parallelization within the unrolled loops. We
used loop unrolling to enhance the throughput of our
implementations of Blue Midnight Wish, CubeHash, and
Skein.
Pipelining The opposite approach of loop unrolling is
pipelining. There, the round function (or any combinational logic block) is divided into several stages which
are separated by pipeline registers. Every pipeline stage
processes data from different round iterations at the same
time, providing a better utilization of the hardware. Moreover, adding pipeline registers can reduce the critical path
of a design which results in a higher maximum clock
frequency and further an increased throughput. This technique is only applicable if the input data of the pipeline
are independent of each other. Pipelining was used, for
example, for the implementation of BLAKE, Luffa, and
Grøstl.
Alternative implementation techniques Another optimization technique addresses the actual implementation
of special functional blocks like multi-input adders or
substitution networks. Such blocks often allow various
ways of implementation which result in equivalent functional behavior, but lead to different overall throughput.
For example, the addition of three numbers as required
in the implementation of BLAKE can be realized via
two consecutive standard adders or by using a special
adder-structure (e.g. a carry-save adder) that is significantly faster. Another example is the AES-based
substitution box used in the SHA-3 candidates ECHO,
Fugue, Grøstl, and SHAvite-3. The substitution box can
be represented as look-up table or calculated in hardware
using algebraic properties [16]. Depending on the SHA-3
candidate that has been implemented, either the one or the
other approach of the substitution box resulted in higher
throughput.
4
Practical Results
The SHA-3 hardware modules have been either implemented in VHDL or in Verilog. Except for the SIMD
module2 , any eventual second-round tweaks have been
integrated in the modules. Tweaks are minor patches,
which the authors of the SHA-3 candidates were allowed
to apply after the first round of the SHA-3 competition.
As mentioned in Section 3.1, the correct functionality
of the implemented modules has been verified against
the official KAT vectors by means of simulations with
Cadence NC Sim [7].
Our throughput evaluation assumes that the message
blocks are delivered to the hardware module at a speed
that allows it to operate under full utilization. The
optimization target for synthesis was maximum peak
throughput, which corresponds to the throughput for long
messages. Note that for shorter messages, the throughput
might change due to more or less-costly initialization
operations and output transformations. The maximum
peak throughput of each hardware module depends on the
following three factors: (1) the processed message block
2 We don’t expect the integration of the round-two patches to significantly influence the performance of the SIMD implementation.
size, (2) the latency, i.e. the number of required clock
cycles to process one message block of n bits, and (3) the
maximum clock frequency at which the hardware module
can be operated. According to Equation 1 the throughput
of each hardware module is calculated with these three
factors:
No
(1)
upper_delay + lower_delay
2
(2)
If the run was not successful, a new value for
lower_delay is set before executing Equation 2. The
fine-tuning phase is enabled if the difference between
upper_delay and lower_delay is less than 1 ns. In this
case upper_delay, which is the delay of the last successful
run, is reduced by 100 ps and becomes the new value for
target_delay. The search for the lowest critical-path delay
is finished as soon as the first synthesis run during the
fine-tuning phase fails.
In a second phase, in order to obtain results close to
the real world, the variants with the best performance
(i.e. highest throughput with a reasonable demand of area)
were subjected to place & route. Global settings for each
place & route run guaranteed fair comparison conditions
for the SHA-3 candidates.
Table 1 summarizes our results after place & route. It
contains the block size of the hash algorithm (block) and
3 For
the present work, the limit has been set to two hours.
4 A maximal negative slack of 50 ps has been allowed.
(upper_delay – lower_delay)
<= 1ns?
target_delay =
(upper_delay +
lower_delay) / 2
Synthesis
Clock constraint =
target_delay
Abort if >2 hours
lower_delay =
target_delay
Synthesis as well as place & route of all implementations
is based on the UMC 0.18 µm standard-cell library from
Faraday [9] and has been conducted with Cadence PKSShell (v05.16) and Cadence First Encounter (v05.20),
respectively [7]. Since the primary aim was maximum
throughput, synthesis has been performed with high optimization effort towards maximizing the speed of the
hardware modules. In some cases the throughput of an implementation variant increased only marginally at the cost
of a considerable growth of the area (e.g. + 4 % throughput
at + 65 % area). In such cases we stuck to the marginally
slower but significantly smaller implementation variant.
In a first phase of our evaluation we performed multiple
synthesis runs for all the variants. In each run, the target
for the critical-path delay has been adapted by means
of an adaptive binary-search algorithm with fine tuning.
Figure 3 shows the state diagram of the binary-search
algorithm. We started with an upper critical-path delay
(upper_delay) of 100 ns, which equals a clock frequency
of 10 MHz. The lower bound (lower_delay) was initially
set to 0 ns. As long as the synthesis runs are successful,
i.e. the run finished within a certain amount of time3
and the synthesized design reaches the set target delay
under worst-case conditions4 , the algorithm sets a new
value for upper_delay and reduces the critical-path delay
(target_delay) according to Equation 2.
target_delay =
Yes
upper_delay =
target_delay
block size
· f requency
latency
fine_tuning = Yes
target_delay = upper_delay – 100ps
target_delay = target_delay – 100ps
throughput =
target_delay = 100ns
upper_delay = 100ns
lower_delay = 0ns
fine_tuning = No
Yes
fine_tuning?
No
Synthesis
successful?
No
Yes
Yes
Synthesis
successful?
No
Finish
Figure 3. State diagram of the adaptive binary-search
algorithm with fine tuning.
the number of clock cycles required for the processing of
one block (latency). The area is given in terms of thousand
gate equivalents (kGEs)5 . The reported clock frequency
is the maximum value under typical conditions6 . For the
throughput (TP) column, the peak throughput at the stated
clock frequency is computed according to Equation 1.
Table 1. Results after place & route for the best implementation variants using the UMC 0.18 µm FSA0A_C
standard-cell library.
Implementation
BLAKE
BMW
CubeHash
ECHO
Fugue
Grøstl
Hamsi
JH
Keccak
Luffa
Shabal
SHAvite-3
SIMD
Skein
SHA-2
Block
[bit]
512
512
256
1,536
32
512
32
512
1,088
256
512
512
512
256
512
Latency Area
[cycles] [kGEs]
22
38.8
1
160.9
8
56.6
97
128.0
2
48.4
22
53.7
1
59.9
39
51.2
25
56.7
9
45.3
50
55.1
37
59.8
36
95.7
10
47.7
66
19.5
Clk freq.
TP
[MHz]
[Gbit/s]
144.15
3,355
15.12
7.741
111.06
3.554
121.97
1.931
161.19
2.579
202.47
4.712
119.77
3.833
259.54
3.407
267.09
11.624
336.02
9.558
216.83
2.220
159.80
2.211
58.33
0.830
64.75
1.658
211,37
1.640
In terms of throughput, the Keccak implementation out5 For FSA0A_C, 1 GE equals 9.37 sqmils (i.e. the size of an ND2
cell).
6 Operating temperature 25 ◦ C, core supply voltage 1.8 V.
performs all other modules by a considerable margin. The
Luffa module is second fastest and more compact. Grøstl,
Hamsi, JH, and CubeHash are the next-best implementations and have all similar area requirements. The BMW
module achieves similar throughput, but at considerably
higher hardware cost. Looking at the implementations
of Fugue and BLAKE shows that they are a bit slower,
but also smaller. The Shabal and SHAvite-3 modules
are slower and bigger, achieving similar performance.
A slightly lower throughput is reached by the ECHO
module, which however requires more area. The Skein
module follows with a moderate size. Our implementation
of SIMD is the slowest in the field. The straight-forward
SHA-2 implementation has the smallest area and achieves
a throughput which is rather at the low end of the spectrum. Figure 4 shows a graphical representation of area in
relation to highest throughput of all our implementations.
BLAKE
12
BMW
fit for any particular purpose. The user thereof uses the
information as its sole risk and liability.
References
[1] Jean-Philippe Aumasson, Luca Henzen, Willi Meier,
and Raphael C.-W. Phan. SHA-3 proposal BLAKE,
version 1.3. Available online at http://131002.
net/blake/blake.pdf, 2008.
[2] Ryad Benadjila, Olivier Billet, Henri Gilbert,
Gilles Macario-Rat, Thomas Peyrin, Matt Robshaw,
and Yannick Seurin. SHA-3 Proposal: ECHO.
Available online at http : / / crypto . rd .
francetelecom . com / echo / doc / echo _
description_1-5.pdf, February 2009.
[3] Daniel J. Bernstein. CubeHash specification (2.B.1).
Available online at http://cubehash.cr.yp.
to/submission/spec.pdf, October 2008.
CubeHash
10
ECHO
Throughput (Gbit/s)
Fugue
Grøstl
8
Hamsi
JH
6
Keccak
Luffa
Shabal
4
SHAvite-3
SIMD
2
Skein
SHA-2
0
0
20
40
60
80
100
120
140
160
Area (kGEs)
Figure 4. Maximum peak throughput vs. area of the
high-speed hardware implementations of the SHA-3
candidates (after place & route).
5
[4] Guido Bertoni, Joan Daemen, Michaël Peeters,
and Gilles Van Assche. KECCAK specifications,
Version 2 – September 10, 2009.
Available
online at http : / / keccak . noekeon . org /
Keccak-specifications-2 . pdf, September 2009.
Conclusions
In this work we presented unified hardware implementations of all 14 round-two candidates of the SHA-3
competition. Our hardware modules were implemented
and evaluated within a common framework and are aimed
towards maximum throughput. We applied various optimization techniques like, for example, loop unrolling or
pipelining. Each implementation uses the same interface,
the same optimization heuristic during synthesis (adaptive
binary-search algorithm with fine-tuning steps), and the
same standard-cell technology. Utilizing this common
ground allows a fair comparison of the maximum achievable throughput of all candidates.
Acknowledgements.
The work described in this paper has been supported by
the European Commission through the ICT programme
under contract ICT-2007-216676 ECRYPT II. The information in this document is provided as is, and no guarantee or warranty is given or implied that the information is
[5] Eli Biham and Orr Dunkelman. The SHAvite-3 Hash
Function (version from February 1, 2009). Available
online at http://www.cs.technion.ac.
il/~orrd/SHAvite-3/Spec.01.02.09.
pdf, February 2009.
[6] Emmanuel Bresson, Anne Canteaut, Benoît
Chevallier-Mames, Christophe Clavier, Thomas
Fuhr, Aline Gouget, Thomas Icart, Jean-François
Misarsky, María Naya-Plasencia, Pascal Paillier,
Thomas Pornin, Jean-René Reinhard, Céline
Thuillet, and Marion Videau.
Shabal, a
Submission to NIST’s Cryptographic Hash
Algorithm Competition.
Available online at
http://www.shabal.com/wp-content/
plugins/download-monitor/download.
php?id=Shabal.pdf, October 2008.
[7] Cadence Design Systems. The Cadence Design Systems Website. http://www.cadence.com/.
[8] Christophe De Canniére, Hisayoshi Sato, and Dai
Watanabe.
Hash Function Luffa, Specification
Ver. 2.0. Available online at http : / / www .
sdl . hitachi . co . jp / crypto / luffa /
Luffa _ v2 _ Specification _ 20090915 .
pdf, September 2009.
[9] Faraday Technology Corporation.
Faraday
FSA0A_C 0.18 µm ASIC Standard Cell
Library, 2004.
Details available online at
http://www.faraday-tech.com.
[10] Niels Ferguson, Stefan Lucks, Bruce Schneier,
Doug Whiting, Mihir Bellare, Tadayoshi Kohno,
Jon Callas, and Jesse Walker. The Skein Hash
Function Family. Available online at http://
www.skein-hash.info/sites/default/
files/skein1.1.pdf, November 2008.
[11] Praveen Gauravaram, Lars R. Knudsen, Krystian
Matusiewicz, Florian Mendel, Christian Rechberger,
and Søren S. Thomsen Martin Schläffer. Grøstl –
a SHA-3 candidate. Available online at http://
www.groestl.info/Groestl.pdf, October
2008.
[12] Danilo Gligoroski and Vlastimil Klima.
Cryptographic
Hash
Function
BLUE
MIDNIGHT WISH.
Available online at
http : / / people . item . ntnu . no /
~danilog / Hash / BMW-SecondRound /
Supporting
_
Documentation
/
BlueMidnightWishDocumentation . pdf,
September 2009.
[13] Shai Halevi, William E. Hall, and Charanjit S.
Jutla. The Hash Function “Fugue”. Available online
at http://domino.research.ibm.com/
comm / research _ projects . nsf / pages /
fugue . index . html / $FILE / fugue _ 09 .
pdf, September 2009.
[14] Özgül Küçük. The Hash Function Hamsi, version from September 14, 2009. Available online
at http://www.cosic.esat.kuleuven.
be / publications / article-1203 . pdf,
September 2009.
[15] Gaëtan Leurent, Charles Bouillaguet, and PierreAlain Fouque. SIMD Is a Message Digest. Updated
version: 2009-01-15, 2009.
[16] Stefan Tillich, Martin Feldhofer, and Johann
Großschädl. Area, Delay, and Power Characteristics of Standard-Cell Implementations of the AES
S-Box. In Stamatis Vassiliadis, Stephan Wong,
and Timo Hämäläinen, editors, 6th International
Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2006,
Samos, Greece, July 17-20, 2006, Proceedings, volume 4017 of Lecture Notes in Computer Science,
pages 457–466. Springer, July 2006.
[17] Hongjun Wu. SHA-3 proposal JH, version January 15, 2009.
JH online at http://icsd.i2r.astar.edu.sg/staff/hongjun/jh/index.html, 2008.