FFT ALGORITHMS FOR MULTIPLY-ADD
ARCHITECTURES
FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT),
UEBERHUBER Christoph W. (AUT)
Abstract. FFTs are the single most important algorithms in science and engineering. To utilize current hardware architectures, special techniques are needed. This
paper introduces newly developed radix-2n FFT kernels that efficiently take advantage of fused multiply-add (FMA) instructions. If a processor is provided with FMA
instructions, the new radix-2n kernels reduce the number of required twiddle factors
from 2n − 1 to n compared to conventional radix-2n FFT kernels. The reduction
in the number of twiddle factors is accomplished by a “hidden” computation of the
twiddle factors (instead of accessing a twiddle factor array) by hiding the additional
arithmetic operations in FMA instructions. The new FFT kernels are fully compatible with conventional FFT kernels and can therefore easily be incorporated into
existing FFT software, such as Fftw or Spiral.
1
Introduction
In the computation of an FFT, memory accesses are the most dominating part of
the runtime. There are numerous FFT algorithms with identical arithmetic complexity
whose memory access patterns differ significantly. Accordingly, their runtime behavior
differs tremendously.
State of the art in FFT software are codes like Spiral (Moura et al. [13]) or Fftw
(Frigo and Johnson [4]) that are able to adapt themselves automatically to given memory
and hardware features in the most efficient way (Fabianek et al. [2]).
In this paper the memory access bottleneck is attacked by utilizing the parallelism
inherent in FMA instructions to reduce the number of memory accesses. This efficiency
enhancement is accomplished by a “hidden” computation of the twiddle factors (instead
of accessing a twiddle factor array) by including the additional arithmetic operations in
FMA instructions (and thereby increasing the utilization of FMA instructions).
The goal of earlier works (Linzer and Feig [10, 11], Goedecker [6], Karner et al. [9])
was to reduce the arithmetic complexity of FFT kernels. No attempt was made to reduce
333
the number of memory accesses. The newly developed FFT kernels introduced in this
paper require only n twiddle factors instead of the 2n − 1 twiddle factors required by
conventional radix-2n kernels.
Contrary to existing FMA optimized FFT kernels the newly developed FFT kernels
are fully compatible with conventional FFT kernels—since there is no need to scale the
twiddle factors—and can therefore easily be incorporated into existing FFT software.
This feature was tested by installing the new kernels into Fftw, the currently most
sophisticated and fastest FFT package (Karner et al. [8]).
2
Fused Multiply-Add Instructions
Fused multiply-add (FMA) instructions perform the ternary operation ±a ± b × c in
the same amount of time as needed for a single floating-point addition or multiplication
operation.
Let πfma denote the FMA arithmetic complexity, i. e., the number of floating-point
operations (additions, multiplications, and multiply-add operations) needed to perform
a specific numerical task on a processor equipped with FMA instructions.
It is useful to introduce the term FMA utilization to characterize the degree to which
an algorithm takes advantage of multiply-add instructions. FMA utilization is given by
F :=
πR − πfma
100 [%],
πfma
(1)
where πR denotes the real arithmetic complexity. F is the percentage of floating-point
operations performed by multiply-add instructions.
Some compilers are able to extract FMA operations from given source code and
produce FMA optimized code. However, compilers are not able to deal with all performance issues, because not all of the necessary information is available at compile time.
Therefore the underlying algorithm has to be dealt with on a more abstract level.
The main result of this section is that certain matrix computations occurring in FFT
algorithms can be optimized with respect to FMA utilization. The rules introduced in
the following can be incorporated into self-adapting numerical digital signal processing
software like Spiral. Algorithms for matrix-vector multiplication A b, where A has a
special structure, can be carried out using FMA instructions only (F = 100 %), i. e.,
such computations are FMA optimized . The following lemmas characterize some special
FMA optimized computations (Franchetti et al. [3]).
Lemma 2.1 For the complex matrices
±1 ±c
0
1
±c ±1
0
1
A=
,
,
,
,
0
1
±1 ±c
0
1
±c ±1
(2)
the calculation of A b for b ∈ C2 requires only two FMAs (πfma = 2, F = 100 %) if c is a
trivial twiddle factor (i. e., c ∈ R or c ∈ iR). Otherwise the computation of A b requires
four FMAs (πfma = 4, F = 100 %).
then matrices of the form
r Corollary 2.2 If A, A1 , A2 , . . . , Ar can be FMA optimized,
mn
mn
mn
and Lmn
m =
i=1 Ai , Im ⊗ A ⊗ In , and the matrix conjugate Lm A Ln , where Ln
mn −1
(Ln ) are stride permutation matrices (Van Loan [7]), can also be FMA optimized.
Multiplications by non-trivial twiddle factors can also be FMA optimized.
334
Lemma 2.3 A complex multiplication by ω ∈ C with |ω| = 1 requires only three
FMA instructions (πfma = 3, F = 100 %).
3
FMA Optimized FFT Kernels
The discrete Fourier transform y ∈ C N of an input vector x ∈ C N is defined by the
j,k
matrix-vector product y := FN x, where FN = [ωN
]j,k=0,1,...,N −1 with ωN = e2πi/N is the
Fourier transform matrix .
The basis of the fast Fourier transform (FFT) is the Cooley-Tukey radix-p splitting.
For N = pq ≥ 2 it holds that
FN = (Fp ⊗ Iq )Tqpq (Ip ⊗ Fq )Lpq
p ,
Tqpq
where
=
q−1 p−1
ij
ωpq
=
i=0 j=0
q−1
q−1 i
diag(1, ωpq , . . . , ωpq
)
(3)
(4)
i=0
is the twiddle factor matrix and Lpq
p is the stride-by-p permutation matrix (Van Loan [7]).
In Fftw the computation of an FFT is performed by so-called codelets, i. e., small
pieces of highly optimized machine generated code.
There are two types of codelets: codelets are used to perform multiplications by
twiddle factors and to compute in-place FFTs, whereas no-twiddle codelets are used to
compute out-of-place FFTs. (i) Twiddle codelets of size p compute
y := (Fp ⊗ Iq )Tqpq y,
(5)
and (ii) no-twiddle codelets of size q perform an out-of-place computation of y := Fq x.
If N = N1 N2 · · · Nn then the splitting (3) can be applied recursively and the fast
Fourier transform algorithm (with O(N ln(N )) complexity) is obtained. Any factorization of N leads to an FFT program with another memory access pattern and hence
to another runtime behavior. Fftw tries to find the optimum algorithm for a given
machine by applying, for instance, dynamic programming (Fabianek et al. [2]).
A no-twiddle codelet of size q can be FMA optimized with existing compiler techniques [5, 9]. A twiddle codelet, on the other hand, cannot be optimized at compile
time, since the twiddle factors can change from call to call. However, with some formula
manipulation and the tools developed in the previous section it is possible to get a full
FMA utilization even in twiddle codelets. A twiddle codelet can be written as
pq pq pq pq
(Fp ⊗ Iq )Tqpq = Lpq
p (Iq ⊗ Fp ) Lq Lp Tp Lq
= Lpq
p
q−1
=Ipq
j
(p−1)j
Fp diag(1, ωpq
, . . . , ωpq
) Lpq
q .
j=0
A factor of the form
Fp diag(1, ω j , . . . , ω (p−1)j )
(6)
is called an elementary radix-p FFT kernel.
In the following the kernel (6) will be FMA optimized, since then by applying Corollary 2.2, the twiddle codelet (5) can be FMA optimized as well.
Usually the twiddle factors are pre-computed and stored in an array and are accessed
during the FFT computation. In the following a method is presented to reduce the
335
number of memory accesses by a hidden on-line computation of the required twiddle
factors. First, the method is illustrated by means of a radix-8 kernel computation.
FMA Optimized Radix-8 Kernel. The radix-8 kernel computation is
y := F8 diag(1, ω, ω 2 , ω 3 , ω 4 , ω 5 , ω 6 , ω 7 )x.
(7)
Using a radix-2 decimation-in-frequency (DIF) factorization, (7) can be written as
y := R8 (I4 ⊗ F2 )D3 T2 (I2 ⊗ F2 ⊗ I2 )D2 T1 (F2 ⊗ I4 )D1 x,
D1 = diag(1, 1, 1, 1, ω 4 , ω 4 , ω 4 , ω 4 ),
D2 = diag(1, 1, ω 2 , ω 2 , 1, 1, ω 2 , ω 2 ),
D3 = diag(1, ω, 1, ω, 1, ω, 1, ω),
T1 = diag(1, 1, 1, 1, 1, c1 , −i, c2 ),
T2 = diag(1, 1, 1, −i, 1, 1, 1, −i),
√
√
c1 = (1 − i)/ 2, c2 = −1(1 + i)/ 2.
(8)
(9)
R8 is the bit-reversal permutation matrix (Van Loan [7]).
Arithmetic Complexity. The heart of the FMA optimized algorithm is the factorization
of the initial twiddle factor scaling matrix diag(1, ω, ω 2 , ω 3 , ω 4 , ω 5 , ω 6 , ω 7 ) into D1 D2 D3 .
D1 , D2 , and D3 are distributed over the 3 stages of the radix-8 kernel computation (8).
To fully utilize FMA instructions for computing the first stage, T1 (F2 ⊗ I4 )D1 x, the
following factorization is applied (Meyer and Schwarz [12]).
2 −1
1 0
4
⊗ I4 .
(F2 ⊗ I4 )D1 = (F2 ⊗ I4 )(diag(1, ω ) ⊗ I4 ) =
(10)
0 1
1 −ω 4
Now, by Lemma 2.1 and Corollary 2.2 the matrix-vector product (F2 ⊗ I4 )D1 x requires
24 FMA operations (πfma = 24, F = 100 %).
T1 contains the two non-trivial twiddle factors c1 and c2 . Therefore, by Lemma 2.3,
scaling by T1 can be carried out with 6 FMA instructions (πfma = 6, F = 100 %).
Factorizations similar to (10) for stages two and three lead to an arithmetic complexity of 78 (= 3 × 24 + 6) FMA instructions (πfma = 78, F = 100 %) for a radix-8
FFT kernel. From (9) it can be seen that only three twiddle factors—ω 4 , ω 2 , and ω—are
required.
FMA Optimized Radix-2n FFT Kernels. Using a radix-2 DIF factorization the
radix-2n kernel can be written as
1
n−i+1
2n −1
(11)
F2n diag(1, ω, . . . , ω
) = R2n
(I2i−1 ⊗ T22n−i )(I2i−1 ⊗ F2 ⊗ I2n−i )Di ,
i=n
n−i+1
T22n−i
n−i
−1
= I2n−i ⊕ diag(1, ω2n−i+1 , . . . , ω22n−i+1
),
Di = I2i−1 ⊗ diag(1, ω
2n −i
) ⊗ I2n−i .
(12)
(13)
R2n is the bit-reversal permutation matrix (Van Loan [7]). (13) shows that only n twiddle
n−1
factors—ω 2 , . . . , ω—are required in the radix-2n kernel. Applying factorization (10)
and Lemma 2.3 leads to the arithmetic complexity
πfma = 4.5 × n × 2n − 9 × 2n−1 + 6.
336
(14)
Radix
TF
Loads
πfma
FMA
Utilization
conventional kernels
1
6
8
25 %
3
14
28
21 %
7
30
84
17 %
15
62
220
17 %
31
126
548
17 %
2
4
8
16
32
TF
Loads
πfma
FMA
Utilization
fma optimized kernels
1
6
6
100 %
2
12
24
100 %
3
22
78
100 %
4
40
222
100 %
5
74
582
100 %
Table 1: Operation counts (number TF of required twiddle factors, number of load operations, arithmetic
complexity πfma , and FMA utilization F ) for radix-2, -4, -8, -16 and -32 kernels.
4
Runtime Experiments
Numerical experiments were performed on one processor of an SGI PowerChallenge
XL R10000. In a first series of experiments FFT algorithms using the newly developed FMA optimized radix-8 FFT kernel were compared with FFT algorithms using a
conventional radix-8 kernel in a triple-loop Cooley-Tukey framework.
In the experiments cycles and L2 data cache misses have been counted using the
on-chip performance monitor counter of the MIPS R10000 processor (Auer et al. [1]).
Run times have been derived from the cycle counts.
The super scalar architecture of the R10000 processor is able to execute up to four
instructions per cycle. Since the R10000 can execute one load or store and one multiplyadd instruction per cycle, the compiler can generate an instruction scheduling that overlaps memory references by floating-point operations. This is the reason why the lower
instruction count of the FMA optimized radix-8 kernel does not result in a significant
speed-up for lengths N ≤ 29 .
The speed-up in execution time compared to a conventional radix-8 kernel can be
seen in Fig. 1. For N ≤ 215 the speed-up is due to the lower number of primary data
cache misses. For N > 215 when secondary data cache misses make memory accesses
very costly the benefits of the new radix-8 kernel, i. e., fewer memory accesses leading to
fewer secondary data cache misses, results in a substantial speed-up in execution time.
!
"
#
$
%
Figure 1: Speed-up in minimum execution time and normalized number of M2 /N log2 N L2 cache misses
of the FMA optimized radix-8 FFT algorithm compared to a conventional radix-8 FFT algorithm on
one processor of an SGI Power Challenge XL.
337
An advantage of the FFT kernels presented in this paper is their “compatibility”
to conventional kernels, making it easy to incorporate them into existing FFT software.
This feature has been demonstrated by incorporating the multiply-add optimized radix-8
kernel into Fftw (Karner et al. [8]).
The installation of the FMA optimized radix-8 kernel in Fftw did not cause any
difficulty. For transform lengths N ≥ 220 a speed-up of more than 30 % was achieved.
5
Conclusion
To sum up, it can be said that the advantages of the multiply-add optimized FFT
kernels presented in this paper are their low number of required load operations, their
high efficiency on performance oriented computer systems, and their striking simplicity.
Current work is the integration of the above presented method into Fftw’s kernel
generator genfft (Frigo, Kral [5]) to generate FMA optimized radix-16, -32, and -64
kernels.
6
References
1. Auer, M., et al. Performance Evaluation of FFT Algorithms Using Performance
Counters. Tech. Rep. Aurora TR1998-20, Vienna University of Technology, 1998.
2. Fabianek, C., et al. Survey of Self-adapting FFT Software. Tech. Rep. Aurora
TR2002-01, Vienna University of Technology, 2002.
3. Franchetti, F., et al. FMA Optimized FFT Codelets. Tech. Rep. Aurora TR199828, Vienna University of Technology, 2002.
4. Frigo, M. and Johnson, S. G. FFTW: An Adaptive Software Architecture for the
FFT. In: Proc. ICASSP , pp. 1381–1384, 1998.
5. Frigo, M. and Kral, S. The Advanced FFT Program Generator GENFFT. Tech.
Rep. Aurora TR2001-03, Vienna University of Technology, 2001.
6. Goedecker, S. Fast Radix 2, 3, 4, and 5 Kernels for Fast Fourier Transformations
on Computers with Overlapping Multiply-Add Instructions. SIAM J. Sci. Comput.,
18, pp. 1605–1611, 1997.
7. Golub, G. H. and Van Loan, C. F. Matrix Computations. Johns Hopkins University
Press, Baltimore, 2nd edn., 1989.
8. Karner, H., et al. Accelerating FFTW by Multiply-Add Optimization. Tech. Rep.
Aurora TR1999-13, Vienna University of Technology, 1999.
9. Karner, H., et al. Multiply-Add Optimized FFT Kernels. Math. Models and Methods
in Appl. Sci., 11, pp. 105–117, 2001.
10. Linzer, E. N. and Feig, E. Implementation of Efficient FFT Algorithms on Fused
Multiply-Add Architectures. IEEE Trans. Signal Processing, 41, pp. 93–107, 1993.
11. Linzer, E. N. and Feig, E. Modified FFTs for Fused Multiply-Add Architectures.
Math. Comp., 60, pp. 347–361, 1993.
12. Meyer, R. and Schwarz, K. FFT Implementation on DSP Chips – Theory and
Practice. In: Proc. ICASSP , pp. 1503–1506, 1990.
13. Pueschel, M., et al. Spiral: A Generator for Platform-Adapted Libraries of Signal
Processing Algorithms. Journal of High Performance Computing and Applications,
submitted.
338
Acknowledgement
This work is the result of a cooperation with our colleague and friend Herbert Karner
who passed away in October 2001. The project was supported by the Special Research
Program SFB F011 “AURORA” of the Austrian Science Fund FWF.
Current Address
Institute for Applied Mathematics and Numerical Analysis, Vienna University of Technology, Wiedner Hauptstrasse 8-10, A-1040 Wien, Austria, Tel.: +43-1-58801-11512,
Email: {franz,kalti,christof}@aurora.anum.tuwien.ac.at
339
340
© Copyright 2026 Paperzz