Analysis of Twiddle Factor Memory Complexity of Radix-2^i

Linköping University Post Print
Analysis of Twiddle Factor Memory
Complexity of Radix-2^i Pipelined FFTs
Fahad Qureshi and Oscar Gustafsson
N.B.: When citing this work, cite the original article.
©2010 IEEE. Personal use of this material is permitted. However, permission to
reprint/republish this material for advertising or promotional purposes or for creating new
collective works for resale or redistribution to servers or lists, or to reuse any copyrighted
component of this work in other works must be obtained from the IEEE.
Fahad Qureshi and Oscar Gustafsson, Analysis of Twiddle Factor Memory Complexity of
Radix-2^i Pipelined FFTs, 2009, 43rd Asilomar Conference on Signals, Systems, and
Computers, 217-220.
Postprint available at: Linköping University Electronic Press
http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-65855
Analysis of Twiddle Factor Memory Complexity of
Radix-2i Pipelined FFTs
Fahad Qureshi and Oscar Gustafsson
Department of Electrical Engineering, Linköping University
SE-581 83 Linköping, Sweden
E-mail: {fahadq, oscarg}@isy.liu.se
Abstract—In this work, we analyze different approaches to
store the coefficient twiddle factors for different stages of
pipelined Fast Fourier Transforms (FFTs). The analysis is based
on complexity comparisons of different algorithms when implemented on Field-Programmable Gate Arrays (FPGAs) and ASIC
for different radix-2i algorithms. The objective of this work
is to investigate the best possible combination for storing the
coefficient twiddle factor for each stage of the pipelined FFT.
I. I NTRODUCTION
Computation of the discrete Fourier transform (DFT) and
inverse DFT is used in e.g. orthogonal frequency-division multiplexing (OFDM) communication systems and spectrometers.
An N -point DFT can be expressed as
X(k) =
N
−1
x(n)WNk , k = 0, 1, N − 1
(1)
n=0
2π
where Wn =e−j N is twiddle factor, the N :th primitive root
of unity with it’s exponent being evaluated modulo N , n is
the time index, and k is the frequency index. Various methods
for efficiently computing (1) have been the subject of a large
body of published literature. They are commonly referred to as
fast Fourier transform (FFT) algorithms. Also, many different
architectures to efficiently map the FFT algorithm to hardware
have been proposed [1].
A commonly used architecture for transforms of length
N = br is the pipelined FFT. The pipeline architecture
is characterized by continuous processing of input data. In
addition, the pipeline architecture is highly regular, making
it straightforward to automatically generate FFTs of various
lengths.
Figure 1 outlines the architecture of a Radix-2i single-path
delay feedback (SDF) decimation in frequency (DIF) pipeline
FFT architecture for length N . This architecture is generic
while the required ranges of each complex twiddle factor
multiplier is outlined in Table I for varying numbers of i.
For the twiddle factor multipliers with small ranges special
methods have been proposed. Especially one can note that for
a W4 multiplier the possible coefficients are {±1, ±j} and,
hence, this can be simply solved by optionally interchanging
real and imaginary parts and possibly negate (or replace the
addition with a subtraction in the subsequent stage). For larger
978-1-4244-5827-1/09/$26.00 ©2009 IEEE
217
ranges (W8 , W16 , and W32 ) approaches have been proposed
in [4], [6]–[8].
In this work we instead focus on using standard complex
multipliers. However the twiddle factors calculated advance,
stored in memories and retrieved for multiplication whenever
necessary. The size of the twidde factor memory for each
stage depends upon some factors; arithmetic precision, number
of FFT point and number of the stage. Usually for a long
FFT the lookup tables are large in comparsion with butterfly
and complex multiplier. In [9], [10] methods are proposed
to reduce the size of the memories by utilizing the octave
symmetry of the twiddle factors, hence only storing values
for angles between 0 ≤ α ≤ π/4. The memory then have
at most (N/8 + 1) words. However, the results in [9], [10]
are given for complete FFTs using the same architecture
for all memories and only for radix-22 . In this work we
show that octave symmetry is not always useful due to the
overhead of multiplexers and negations. Furthermore, we will
investigate the wordlength scaling effect as previous work has
shown that the occupied cell area when synthesizing lookup tables does not grow linearly with the number of bits
in the look-up table [11]. It is noted that one could use
dedicated memory structures on the FPGAs, but depending
on available resources and the size of the memories this may
not be suitable. For using the dedicated memory structures a
cost model is proposed in [12].
In next section the different architectures to implement
the twiddle factor memories are explained. In Section III,
we analyze and compare the implementation results of those
architectures. Finally, some conclusions are presented.
II. A RCHITECTURES FOR T WIDDLE FACTOR M EMORIES
The twiddle factor memory should provide the real and
imaginary parts of the twiddle factor. Typically, in a SDF
TABLE I
M ULTIPLICATION AT DIFFERENT STAGES FOR DIFFERENT ARCHITECTURE .
Radix
2
22 [2]
23 [3]
24 [4]
25 [5]
1
WN
W4
W4
W4
W4
2
WN/2
WN
W8
W8
W8
Stage number
3
4
WN/4
WN/8
W4
WN/4
WN
W4
W16
WN
W16
W32
5
WN/16
W4
W8
W4
WN
Asilomar 2009
Stage 1
Stage 2
Stage 3
N/2
N/4
N/8
BF
BF
BF
W
Stage 4
Stage 5
N/16
N/32
BF
W
W
N/64
BF
BF
W
W
Address
Fig. 1.
The R2i single-path delay feedback (SDF) decimation in frequency (DIF) pipeline FFT architecture with twiddle factor stages.
Wn Real
Address
Coefficient Memory
N words
Fig. 2.
Address
Wn Imaginary
Memory
(N/8 + 1) words
Mapping
1
Wn Real
Wn Imaginary
Block diagram of Single Look-up Table twiddle factor memory.
Fig. 4. Block diagram of twiddle factor memory with address mapping and
symmetry.
Wn Real
Address
Address
Mapping
Coefficient
Memory
N words
L
Wn Imaginary
M
Fig. 3.
1
Coefficient
Address
i
k
Block diagram of twiddle factor memory with address mapping.
Bit Flip
pipelined FFT architecture a counter is used to keep track
of which row of the FFT are computed in each clock cycle.
Hence, we will here assume that the mapping should be from
row number to the real and imaginary part of the twiddle
factor.
A. Single Look-up Table
The simplest approach, as shown in Fig. 4, is to just use
a large look-up table to store the twiddle factors. For a WN
multiplier, N words needs to be stored. Hence, for large N
one could expect this method to have a higher complexity
compared to the reduced schemes. On the other hand it
lacks any overhead. It should also be noted that this scheme
possibly stores the same twiddle factor in several positions as
the mapping is from row to twiddle factor and for radix-2i
algorithms some twiddle factors appears more than once for
i ≥ 2.
B. Twiddle Factor Memory with Address Mapping
A possible simplification is to use an address mapping
circuit that maps the row to the corresponding angle (k in
(1)) and use a memory storing the required elements only
once. For the general case, we will need to store many, but
not all, values, still using N possible words even though many
can be set to “don’t care”. Because of this one can expect the
resources used for the look-up table to be reduced compared to
the previous approach, given that the synthesis tool can benefit
from it. The structure is shown in Fig. 3.
218
Address
Fig. 5.
Block diagram of address mapping unit.
C. Twiddle Factor Memory with Address Mapping and Symmetry
Another modification, that was proposed in [9], [10], is to
use the well known octave symmetry to only store twiddle
factors for 0 ≤ α ≤ π/4. The additional cost is an address
mapping circuit as discussed in the previous section as well as
multiplexers to interchange the real and imaginary parts and
possible negations. The main benefit is that only N/8+1 words
are required to be stored. The resulting structure is shown in
Fig. 4
D. Address Mapping
The address mapping for a Radix-2i FFT is done as shown
in 5. Here, the total length of the FFT is 2L points and the
resolution of the twiddle factor multiplier is W2k . It is worth
noting that the address mapping for a given WN multiplier is
independent of L. Clearly, i will affect the complexity of the
address mapping circuitry.
III. A NALYSIS AND R ESULTS
We have analyzed complexity of twiddle factor memory
having resolution ≥ 64 with different architectures, considering radix-2i algorithm with different values of i. The
architectures of the twiddle factor memories have been coded
2
3
Radix−2
6
10
5
10
5
10
4
4
10
10
103
103
102
102
LUTs
101
Radix−2
6
10
Memory
Memory with AG
Memory with AG and symmetry
W
64
W
128
W
256
W
512
W
1024
W
2048
W
4096
101
W
8192
W
64
W
128
W
256
4
106
512
W
1024
W
W
W
W
W
W
2048
4096
8192
5
Radix−2
Radix−2
106
5
5
10
10
104
104
3
10
3
2
10
2
10
10
1
10
W
W64 W128 W256 W512 W1024 W2048 W4096 W8192
101
W
64
W
128
W
256
W
512
W
1024
2048
4096
8192
Twiddle Factors
Fig. 6. Radix2i SDF pipelined FFT twiddle factor memory complexity using
Mentor Graphics Precision targeting an Altera Stratix-IV FPGA.
105
Radix−22
4
10
103
4−input LUTs
101
Radix−2
10
W64 W128 W256 W512 W1024 W2048 W4096 W8192
8
103
5
10
10
107
6
10
6
105
4
10
103
W64 W128 W256 W512 W1024 W2048 W4096 W8192
4
103
64
W64 W128 W256 W512 W1024 W2048 W4096 W8192
Radix−2
5
10
10
1
128
W
256
512
W
1024
W
2048
W
4096
W
8192
Radix−23
3000
Memory
Memory with AG
Memory with AG and symmetry
2500
500
500
8
10
12
14
16
18
20
8
10
12
Radix−24
14
16
18
20
16
18
20
Radix−25
3000
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
W64 W128 W256 W512 W1024 W2048 W4096 W8192
W
In FPGA designs, the memory with address mapping is not
a beneficial choice because the synthesis tool does not utilize
the “don’t care” conditions. However in the ASIC designs it is
in the middle of the both, although never the best. To illustrate
the input of the wordlength, we synthesize a W1024 twiddle
factor using wordlengths varying from 10 to 18 bits to a Xilinx
Virtex-4 FPGA. The results are shown in Fig. 9 and shows the
expected linear behaviour. However, the offset, corresponding
to the constant wordlength circuitry like address generation,
differs between the approaches. Hence, one would expect that
for resolutions that gave similar complexity in Figs. 6, 7, and 8,
one would have to re-evaluate the best architecture based on
the used wordlength.
Figure 10 shows the complexity using the best architecture of the twiddle factor memory for radix-2i algorithm in
different technologies. It can be seen that, the twiddle factor
complexity for the same twiddle factor increases as the value
of i increases in radix-2i algorithms.
1000
2
W
Fig. 8. Radix2i SDF pipelined FFT twiddle factor memory complexity
0.35μm CMOS standard cells.
1500
3
W
Twiddle Factors
1000
10
Radix−2
8
2000
101
W64 W128 W256 W512 W1024 W2048 W4096 W8192
4
Radix−2
1500
104
1
10
W64 W128 W256 W512 W1024 W2048 W4096 W8192
2000
10
10
10
2500
104
2
104
3
104
10
3
10
105
106
10
105
4
Radix−22
5
5
10
3000
4
106
106
5
105
102
W64 W128 W256 W512 W1024 W2048 W4096 W8192
107
10
10
103
102
Radix−2
8
6
Radix−23
6
10
Memory
Memory with AG
Memory with AG and symmetry
107
10
Memory
Memory with AG
Memory with AG and symmetry
3
Radix−2
8
107
4−input LUTs
6
10
2
10
Cell Area
in VHDL. These architectures were synthesized using the three
different synthesis tools, Mentor Graphics Precision targeting
an Altera Stratix-IV FPGA, ISE Xilinx targeting an Virtex4 FPGA and Synopsys Design Compiler targeting 0.35μm
CMOS standard cells. The twiddle factors are represented
using 16 bits each for real and imaginary parts. The two’s
complement representation of the numbers is used in the
twiddle factor memory. The resulting complexity for each
stage is illustrated in Figs 6, 7, and 8 for different technologies
Altera Stratix-IV FPGA, Virtex-4 FPGA and 0.35μm CMOS
ASIC, respectively.
Figures 6, 7, and 8 show that the twiddle factor memory
with address mapping and symmetry architecture is the most
advantageous one for high range. However, for small ranges,
the simple look-up table approach is most beneficial. The point
where address mapping and symmetry is more beneficial than
the simple look-up table moves further towards the higher
resolution of twiddle factor as the value of i increases.
500
8
10
12
14
16
18
20
8
10
12
14
Wordlength
Twiddle Factors
Fig. 7. Radix2i SDF pipelined FFT twiddle factor memory complexity using
ISE Xilinx targeting an Virtex-4 FPGA.
219
Fig. 9. W1024 twiddle factor memory complexity for different wordlength
using ISE Xilinx targeting an Virtex-4 FPGA.
2
Radix−2
3
Radix−2
4
Radix−2
5
Radix−2
3000
2500
LUTs
TABLE V
T WIDDLE FACTOR MEMORY COMPLEXITY OF 8192-FFT SDF PIPELINED
WITH DIFFERENT ALGORITHMS (ASIC).
Altera
3500
2000
1500
Algorithm
22 [2]
23 [3]
24 [4]
25 [5]
1000
500
W
64
W
128
256
W
512
W
1024
W
2048
W
4096
W
8192
Xilinx
2500
4−input LUTs
W
Memory complexity
2
3
89471.2
39967.2
66739.4
25771.2
58167.2
27318.2
-
4
21294.0
-
Total
397233.2
352570.4
341668.6
328147.8
2000
1500
and V respectively. The values in italic corresponds to that
architecture where only a lookup table is used. This justifies
the inital assumption that the same architecture is not benefical
for all twiddle factor memories. The total complexity of the
twiddle factor memory is reduced as the value of i is increased,
except for Xilinx results.
1000
500
W64
3.5
x 10
W128 W256 W512 W1024 W2048 W4096 W8192
ASIC
5
Cell Area
3
2.5
2
1.5
IV. C ONCLUSIONS
1
0.5
W64
Fig. 10.
factors
1
246500.8
260059.8
283501.4
300829.6
W128 W256 W512 W1024 W2048 W4096 W8192
Twiddle Factor
Best architecture of twiddle factor memory for different twiddle
TABLE II
T WIDDLE FACTOR MEMORY COMPLEXITY OF 8192-FFT SDF PIPELINED
WITH DIFFERENT ALGORITHMS .
Algorithm
22 [2]
23 [3]
24 [4]
25 [5]
1
W8192
W8192
W8192
W8192
Memory
2
3
W2048
W512
W1024
W128
W512
W256
-
4
W128
-
R EFERENCES
TABLE III
T WIDDLE FACTOR MEMORY COMPLEXITY OF 8192-FFT SDF PIPELINED
WITH DIFFERENT ALGORITHMS (A LTERA ).
Algorithm
22 [2]
23 [3]
24 [4]
25 [5]
Memory complexity
1
2
3
4
2650
729
240 95
2835
581
96
3002
339
3123 157
-
Total
3714
3512
3341
3280
Table II shows twiddle factors for a 8192-point FFT single
delay feedback pipelined architecture having resolution ≥
64 for different radix-2i algorithms. The complexity of each
complex twiddle factor memory with best architecture by using
the three different technologies are shown in Tables III, IV
TABLE IV
T WIDDLE FACTOR MEMORY COMPLEXITY OF 8192-FFT SDF PIPELINED
WITH DIFFERENT ALGORITHMS (X ILINX ).
Algorithm
22 [2]
23 [3]
24 [4]
25 [5]
Memory complexity
1
2
3
4
1592
735
383 201
1653
556
228
1791
550
1863 527
-
In this paper, we have analyzed the complexity of twiddle
factor memories for pipelined FFTs considering different
architectures. Analysis is based on complexity comparisons
of different radix-2i algorithms when implemented either on
FPGAs (field programmable gate array) or standard cells. The
results show that a plain lookup table is advantageous for low
resolution memories while for larger resolution twiddle factor
memories, utilizing octave symmetry and a address generator
is advantageous. The break-point where the plain lookup table
approach is advantageous increases with increasing i.
Total
2911
2437
2341
2390
220
[1] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999.
[2] S. He and M. Torkelson, “A new approach to pipeline FFT processor,”
in Proc. IEEE Parallel Processing Symp., 1996, pp. 766–770.
[3] S. He and M. Torkelson, “Designing pipeline FFT processor for
OFDM(de)Modulation,” in Proc. IEEE URSI Int. Symp. Sig. Elect.,
1998, pp. 257–262.
[4] J.-E. Oh and M.-S. Lim, “New radix-2 to the 4th power pipeline FFT
processor,” IEICE Trans. Electron., vol. E88-C, no. 8, pp. 694–697, Aug.
2005.
[5] A. Cortes, I. Velez and J. F. Sevillano, “Radix r k FFTs: matricial
representation and SDC/SDF pipeline implementation,” IEEE Trans.
Signal Processing on, vol. 57, no. 7, pp. 2824–2839, Jul. 2009.
[6] Y.-E. Kim, K.-J. Cho, and J.-G. Chung, “Low power small area modified
Booth multiplier design for predetermined coefficients,” IEICE Trans.
Fund., vol. E90-A, no. 3, pp. 694–697, Mar. 2007.
[7] W. Han, T. Arslan, A. T. Erdogan and M. Hasan, “High-performance
low-power FFT cores,” ETRI Journal, vol. 30, no. 3, pp. 451–460, June
2008.
[8] F. Qureshi and O. Gustafsson, “Low-complexity reconfigurable complex
constant multiplication for FFTs,” in Proc. IEEE Int. Symp. Circuits
Syst., Taipei, Taiwan, May 24–27, 2009.
[9] H. Cho, M. Kim, D. Kim, and J. Kim “R22 SDF FFT implementation
with coefficient memory reduction scheme,” in Proc. Vehicular Technology Conf., 2006.
[10] M. Hasan and T. Arslan, “Scheme for reducing size of coefficient
memory in FFT processor,” IEEE Electronics Letters, vol. 38, no. 4,
pp. 163–164, Feb. 2007.
[11] O. Gustafsson and K. Johansson, “An empirical study on standard cell
synthesis of elementary function look-up tables,” in Proc. Asilomar Conf.
Signals Syst. Comp., Pacific Grove, CA, Oct. 26–29, 2008.
[12] P. A. Milder, M. Ahmad, J. C. Hoe and M. Püschel “Fast and accurate
resource estimation of automatically generated custom DFT IP cores,”
in Proc. FPGA, 2006, pp. 211–220.