Linköping University Post Print Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. ©2010 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Fahad Qureshi and Oscar Gustafsson, Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs, 2009, 43rd Asilomar Conference on Signals, Systems, and Computers, 217-220. Postprint available at: Linköping University Electronic Press http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-65855 Analysis of Twiddle Factor Memory Complexity of Radix-2i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson Department of Electrical Engineering, Linköping University SE-581 83 Linköping, Sweden E-mail: {fahadq, oscarg}@isy.liu.se Abstract—In this work, we analyze different approaches to store the coefficient twiddle factors for different stages of pipelined Fast Fourier Transforms (FFTs). The analysis is based on complexity comparisons of different algorithms when implemented on Field-Programmable Gate Arrays (FPGAs) and ASIC for different radix-2i algorithms. The objective of this work is to investigate the best possible combination for storing the coefficient twiddle factor for each stage of the pipelined FFT. I. I NTRODUCTION Computation of the discrete Fourier transform (DFT) and inverse DFT is used in e.g. orthogonal frequency-division multiplexing (OFDM) communication systems and spectrometers. An N -point DFT can be expressed as X(k) = N −1 x(n)WNk , k = 0, 1, N − 1 (1) n=0 2π where Wn =e−j N is twiddle factor, the N :th primitive root of unity with it’s exponent being evaluated modulo N , n is the time index, and k is the frequency index. Various methods for efficiently computing (1) have been the subject of a large body of published literature. They are commonly referred to as fast Fourier transform (FFT) algorithms. Also, many different architectures to efficiently map the FFT algorithm to hardware have been proposed [1]. A commonly used architecture for transforms of length N = br is the pipelined FFT. The pipeline architecture is characterized by continuous processing of input data. In addition, the pipeline architecture is highly regular, making it straightforward to automatically generate FFTs of various lengths. Figure 1 outlines the architecture of a Radix-2i single-path delay feedback (SDF) decimation in frequency (DIF) pipeline FFT architecture for length N . This architecture is generic while the required ranges of each complex twiddle factor multiplier is outlined in Table I for varying numbers of i. For the twiddle factor multipliers with small ranges special methods have been proposed. Especially one can note that for a W4 multiplier the possible coefficients are {±1, ±j} and, hence, this can be simply solved by optionally interchanging real and imaginary parts and possibly negate (or replace the addition with a subtraction in the subsequent stage). For larger 978-1-4244-5827-1/09/$26.00 ©2009 IEEE 217 ranges (W8 , W16 , and W32 ) approaches have been proposed in [4], [6]–[8]. In this work we instead focus on using standard complex multipliers. However the twiddle factors calculated advance, stored in memories and retrieved for multiplication whenever necessary. The size of the twidde factor memory for each stage depends upon some factors; arithmetic precision, number of FFT point and number of the stage. Usually for a long FFT the lookup tables are large in comparsion with butterfly and complex multiplier. In [9], [10] methods are proposed to reduce the size of the memories by utilizing the octave symmetry of the twiddle factors, hence only storing values for angles between 0 ≤ α ≤ π/4. The memory then have at most (N/8 + 1) words. However, the results in [9], [10] are given for complete FFTs using the same architecture for all memories and only for radix-22 . In this work we show that octave symmetry is not always useful due to the overhead of multiplexers and negations. Furthermore, we will investigate the wordlength scaling effect as previous work has shown that the occupied cell area when synthesizing lookup tables does not grow linearly with the number of bits in the look-up table [11]. It is noted that one could use dedicated memory structures on the FPGAs, but depending on available resources and the size of the memories this may not be suitable. For using the dedicated memory structures a cost model is proposed in [12]. In next section the different architectures to implement the twiddle factor memories are explained. In Section III, we analyze and compare the implementation results of those architectures. Finally, some conclusions are presented. II. A RCHITECTURES FOR T WIDDLE FACTOR M EMORIES The twiddle factor memory should provide the real and imaginary parts of the twiddle factor. Typically, in a SDF TABLE I M ULTIPLICATION AT DIFFERENT STAGES FOR DIFFERENT ARCHITECTURE . Radix 2 22 [2] 23 [3] 24 [4] 25 [5] 1 WN W4 W4 W4 W4 2 WN/2 WN W8 W8 W8 Stage number 3 4 WN/4 WN/8 W4 WN/4 WN W4 W16 WN W16 W32 5 WN/16 W4 W8 W4 WN Asilomar 2009 Stage 1 Stage 2 Stage 3 N/2 N/4 N/8 BF BF BF W Stage 4 Stage 5 N/16 N/32 BF W W N/64 BF BF W W Address Fig. 1. The R2i single-path delay feedback (SDF) decimation in frequency (DIF) pipeline FFT architecture with twiddle factor stages. Wn Real Address Coefficient Memory N words Fig. 2. Address Wn Imaginary Memory (N/8 + 1) words Mapping 1 Wn Real Wn Imaginary Block diagram of Single Look-up Table twiddle factor memory. Fig. 4. Block diagram of twiddle factor memory with address mapping and symmetry. Wn Real Address Address Mapping Coefficient Memory N words L Wn Imaginary M Fig. 3. 1 Coefficient Address i k Block diagram of twiddle factor memory with address mapping. Bit Flip pipelined FFT architecture a counter is used to keep track of which row of the FFT are computed in each clock cycle. Hence, we will here assume that the mapping should be from row number to the real and imaginary part of the twiddle factor. A. Single Look-up Table The simplest approach, as shown in Fig. 4, is to just use a large look-up table to store the twiddle factors. For a WN multiplier, N words needs to be stored. Hence, for large N one could expect this method to have a higher complexity compared to the reduced schemes. On the other hand it lacks any overhead. It should also be noted that this scheme possibly stores the same twiddle factor in several positions as the mapping is from row to twiddle factor and for radix-2i algorithms some twiddle factors appears more than once for i ≥ 2. B. Twiddle Factor Memory with Address Mapping A possible simplification is to use an address mapping circuit that maps the row to the corresponding angle (k in (1)) and use a memory storing the required elements only once. For the general case, we will need to store many, but not all, values, still using N possible words even though many can be set to “don’t care”. Because of this one can expect the resources used for the look-up table to be reduced compared to the previous approach, given that the synthesis tool can benefit from it. The structure is shown in Fig. 3. 218 Address Fig. 5. Block diagram of address mapping unit. C. Twiddle Factor Memory with Address Mapping and Symmetry Another modification, that was proposed in [9], [10], is to use the well known octave symmetry to only store twiddle factors for 0 ≤ α ≤ π/4. The additional cost is an address mapping circuit as discussed in the previous section as well as multiplexers to interchange the real and imaginary parts and possible negations. The main benefit is that only N/8+1 words are required to be stored. The resulting structure is shown in Fig. 4 D. Address Mapping The address mapping for a Radix-2i FFT is done as shown in 5. Here, the total length of the FFT is 2L points and the resolution of the twiddle factor multiplier is W2k . It is worth noting that the address mapping for a given WN multiplier is independent of L. Clearly, i will affect the complexity of the address mapping circuitry. III. A NALYSIS AND R ESULTS We have analyzed complexity of twiddle factor memory having resolution ≥ 64 with different architectures, considering radix-2i algorithm with different values of i. The architectures of the twiddle factor memories have been coded 2 3 Radix−2 6 10 5 10 5 10 4 4 10 10 103 103 102 102 LUTs 101 Radix−2 6 10 Memory Memory with AG Memory with AG and symmetry W 64 W 128 W 256 W 512 W 1024 W 2048 W 4096 101 W 8192 W 64 W 128 W 256 4 106 512 W 1024 W W W W W W 2048 4096 8192 5 Radix−2 Radix−2 106 5 5 10 10 104 104 3 10 3 2 10 2 10 10 1 10 W W64 W128 W256 W512 W1024 W2048 W4096 W8192 101 W 64 W 128 W 256 W 512 W 1024 2048 4096 8192 Twiddle Factors Fig. 6. Radix2i SDF pipelined FFT twiddle factor memory complexity using Mentor Graphics Precision targeting an Altera Stratix-IV FPGA. 105 Radix−22 4 10 103 4−input LUTs 101 Radix−2 10 W64 W128 W256 W512 W1024 W2048 W4096 W8192 8 103 5 10 10 107 6 10 6 105 4 10 103 W64 W128 W256 W512 W1024 W2048 W4096 W8192 4 103 64 W64 W128 W256 W512 W1024 W2048 W4096 W8192 Radix−2 5 10 10 1 128 W 256 512 W 1024 W 2048 W 4096 W 8192 Radix−23 3000 Memory Memory with AG Memory with AG and symmetry 2500 500 500 8 10 12 14 16 18 20 8 10 12 Radix−24 14 16 18 20 16 18 20 Radix−25 3000 3000 2500 2500 2000 2000 1500 1500 1000 1000 500 W64 W128 W256 W512 W1024 W2048 W4096 W8192 W In FPGA designs, the memory with address mapping is not a beneficial choice because the synthesis tool does not utilize the “don’t care” conditions. However in the ASIC designs it is in the middle of the both, although never the best. To illustrate the input of the wordlength, we synthesize a W1024 twiddle factor using wordlengths varying from 10 to 18 bits to a Xilinx Virtex-4 FPGA. The results are shown in Fig. 9 and shows the expected linear behaviour. However, the offset, corresponding to the constant wordlength circuitry like address generation, differs between the approaches. Hence, one would expect that for resolutions that gave similar complexity in Figs. 6, 7, and 8, one would have to re-evaluate the best architecture based on the used wordlength. Figure 10 shows the complexity using the best architecture of the twiddle factor memory for radix-2i algorithm in different technologies. It can be seen that, the twiddle factor complexity for the same twiddle factor increases as the value of i increases in radix-2i algorithms. 1000 2 W Fig. 8. Radix2i SDF pipelined FFT twiddle factor memory complexity 0.35μm CMOS standard cells. 1500 3 W Twiddle Factors 1000 10 Radix−2 8 2000 101 W64 W128 W256 W512 W1024 W2048 W4096 W8192 4 Radix−2 1500 104 1 10 W64 W128 W256 W512 W1024 W2048 W4096 W8192 2000 10 10 10 2500 104 2 104 3 104 10 3 10 105 106 10 105 4 Radix−22 5 5 10 3000 4 106 106 5 105 102 W64 W128 W256 W512 W1024 W2048 W4096 W8192 107 10 10 103 102 Radix−2 8 6 Radix−23 6 10 Memory Memory with AG Memory with AG and symmetry 107 10 Memory Memory with AG Memory with AG and symmetry 3 Radix−2 8 107 4−input LUTs 6 10 2 10 Cell Area in VHDL. These architectures were synthesized using the three different synthesis tools, Mentor Graphics Precision targeting an Altera Stratix-IV FPGA, ISE Xilinx targeting an Virtex4 FPGA and Synopsys Design Compiler targeting 0.35μm CMOS standard cells. The twiddle factors are represented using 16 bits each for real and imaginary parts. The two’s complement representation of the numbers is used in the twiddle factor memory. The resulting complexity for each stage is illustrated in Figs 6, 7, and 8 for different technologies Altera Stratix-IV FPGA, Virtex-4 FPGA and 0.35μm CMOS ASIC, respectively. Figures 6, 7, and 8 show that the twiddle factor memory with address mapping and symmetry architecture is the most advantageous one for high range. However, for small ranges, the simple look-up table approach is most beneficial. The point where address mapping and symmetry is more beneficial than the simple look-up table moves further towards the higher resolution of twiddle factor as the value of i increases. 500 8 10 12 14 16 18 20 8 10 12 14 Wordlength Twiddle Factors Fig. 7. Radix2i SDF pipelined FFT twiddle factor memory complexity using ISE Xilinx targeting an Virtex-4 FPGA. 219 Fig. 9. W1024 twiddle factor memory complexity for different wordlength using ISE Xilinx targeting an Virtex-4 FPGA. 2 Radix−2 3 Radix−2 4 Radix−2 5 Radix−2 3000 2500 LUTs TABLE V T WIDDLE FACTOR MEMORY COMPLEXITY OF 8192-FFT SDF PIPELINED WITH DIFFERENT ALGORITHMS (ASIC). Altera 3500 2000 1500 Algorithm 22 [2] 23 [3] 24 [4] 25 [5] 1000 500 W 64 W 128 256 W 512 W 1024 W 2048 W 4096 W 8192 Xilinx 2500 4−input LUTs W Memory complexity 2 3 89471.2 39967.2 66739.4 25771.2 58167.2 27318.2 - 4 21294.0 - Total 397233.2 352570.4 341668.6 328147.8 2000 1500 and V respectively. The values in italic corresponds to that architecture where only a lookup table is used. This justifies the inital assumption that the same architecture is not benefical for all twiddle factor memories. The total complexity of the twiddle factor memory is reduced as the value of i is increased, except for Xilinx results. 1000 500 W64 3.5 x 10 W128 W256 W512 W1024 W2048 W4096 W8192 ASIC 5 Cell Area 3 2.5 2 1.5 IV. C ONCLUSIONS 1 0.5 W64 Fig. 10. factors 1 246500.8 260059.8 283501.4 300829.6 W128 W256 W512 W1024 W2048 W4096 W8192 Twiddle Factor Best architecture of twiddle factor memory for different twiddle TABLE II T WIDDLE FACTOR MEMORY COMPLEXITY OF 8192-FFT SDF PIPELINED WITH DIFFERENT ALGORITHMS . Algorithm 22 [2] 23 [3] 24 [4] 25 [5] 1 W8192 W8192 W8192 W8192 Memory 2 3 W2048 W512 W1024 W128 W512 W256 - 4 W128 - R EFERENCES TABLE III T WIDDLE FACTOR MEMORY COMPLEXITY OF 8192-FFT SDF PIPELINED WITH DIFFERENT ALGORITHMS (A LTERA ). Algorithm 22 [2] 23 [3] 24 [4] 25 [5] Memory complexity 1 2 3 4 2650 729 240 95 2835 581 96 3002 339 3123 157 - Total 3714 3512 3341 3280 Table II shows twiddle factors for a 8192-point FFT single delay feedback pipelined architecture having resolution ≥ 64 for different radix-2i algorithms. The complexity of each complex twiddle factor memory with best architecture by using the three different technologies are shown in Tables III, IV TABLE IV T WIDDLE FACTOR MEMORY COMPLEXITY OF 8192-FFT SDF PIPELINED WITH DIFFERENT ALGORITHMS (X ILINX ). Algorithm 22 [2] 23 [3] 24 [4] 25 [5] Memory complexity 1 2 3 4 1592 735 383 201 1653 556 228 1791 550 1863 527 - In this paper, we have analyzed the complexity of twiddle factor memories for pipelined FFTs considering different architectures. Analysis is based on complexity comparisons of different radix-2i algorithms when implemented either on FPGAs (field programmable gate array) or standard cells. The results show that a plain lookup table is advantageous for low resolution memories while for larger resolution twiddle factor memories, utilizing octave symmetry and a address generator is advantageous. The break-point where the plain lookup table approach is advantageous increases with increasing i. Total 2911 2437 2341 2390 220 [1] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999. [2] S. He and M. Torkelson, “A new approach to pipeline FFT processor,” in Proc. IEEE Parallel Processing Symp., 1996, pp. 766–770. [3] S. He and M. Torkelson, “Designing pipeline FFT processor for OFDM(de)Modulation,” in Proc. IEEE URSI Int. Symp. Sig. Elect., 1998, pp. 257–262. [4] J.-E. Oh and M.-S. Lim, “New radix-2 to the 4th power pipeline FFT processor,” IEICE Trans. Electron., vol. E88-C, no. 8, pp. 694–697, Aug. 2005. [5] A. Cortes, I. Velez and J. F. Sevillano, “Radix r k FFTs: matricial representation and SDC/SDF pipeline implementation,” IEEE Trans. Signal Processing on, vol. 57, no. 7, pp. 2824–2839, Jul. 2009. [6] Y.-E. Kim, K.-J. Cho, and J.-G. Chung, “Low power small area modified Booth multiplier design for predetermined coefficients,” IEICE Trans. Fund., vol. E90-A, no. 3, pp. 694–697, Mar. 2007. [7] W. Han, T. Arslan, A. T. Erdogan and M. Hasan, “High-performance low-power FFT cores,” ETRI Journal, vol. 30, no. 3, pp. 451–460, June 2008. [8] F. Qureshi and O. Gustafsson, “Low-complexity reconfigurable complex constant multiplication for FFTs,” in Proc. IEEE Int. Symp. Circuits Syst., Taipei, Taiwan, May 24–27, 2009. [9] H. Cho, M. Kim, D. Kim, and J. Kim “R22 SDF FFT implementation with coefficient memory reduction scheme,” in Proc. Vehicular Technology Conf., 2006. [10] M. Hasan and T. Arslan, “Scheme for reducing size of coefficient memory in FFT processor,” IEEE Electronics Letters, vol. 38, no. 4, pp. 163–164, Feb. 2007. [11] O. Gustafsson and K. Johansson, “An empirical study on standard cell synthesis of elementary function look-up tables,” in Proc. Asilomar Conf. Signals Syst. Comp., Pacific Grove, CA, Oct. 26–29, 2008. [12] P. A. Milder, M. Ahmad, J. C. Hoe and M. Püschel “Fast and accurate resource estimation of automatically generated custom DFT IP cores,” in Proc. FPGA, 2006, pp. 211–220.
© Copyright 2024 Paperzz