[email protected] http://www.es.isy.liu.se/ FFT PROCESSOR, CONT. Department of Electrical Engineering Linköping University DIRECT MAPPING TECHNIQUE VLSI Design Logic Design Architecture Design Resource Assignment Resource Allocation Scheduling Partitionig into co-operating processes Lars Wanhammar DSP Integrated Circuits Department of Electrical Engineering Linköping University [email protected] http://www.es.isy.liu.se/ We will first perform a crude design, then subsequently improve it until a satisfactory solution is reached. We begin by partitioning the algorithm, followed by scheduling, resource allocation, resource assignment, and finally, synthesis of a matching architecture. This means that we will synthesize the architecture with its hardware modules step by step, starting from the algorithm. In this chapter we will perform the main steps in the system design phase of the FFT processor. Lars Wanhammar DSP Integrated Circuits MANUFACTURING AND TESTING INTERGRATED CIRCUITS DESIGN SYSTEM DESIGN SYSTEM SPECIFIKATION 3 DSP Integrated Circuits Department of Electrical Engineering Linköping University [email protected] http://www.es.isy.liu.se/ Lars Wanhammar DSP Integrated Circuits FFT/IFFT Data FFT/IFFT Data Department of Electrical Engineering Linköping University Generally, it is useful to introduce a memory process at this level explicitly. The input process is responsible for reading data into the FFT processor from the outside world. input, FFT, output, and memory process. At the top level we partition the FFT processor into four communicating processes: Finished Output FFT Input Start Read/Write Real/Imag Address Data [email protected] http://www.es.isy.liu.se/ Read/Write Real/Imag Address Data Read/Write Real/Imag Address Data Read/Write Real/Imag Address Data Memory First Design Iteration In a top-down design approach we start by identifying the major tasks (processes) to be executed. Lars Wanhammar Model #3 Model #2 Model #1 Golden Model Test Bench We advocate that a sequence of models with increasing degrees of details and functionality i built in order to better understand the “problem”; explore the design space; obtain a specification PROBLEM UNDERSTANDING Stimuli DSP Algorithm 1 ≤ 0.5 ms DSP SYSTEM DESIGN STRATEGY Compare 4 2 Department of Electrical Engineering Linköping University [email protected] http://www.es.isy.liu.se/ MHz complex words/s Lars Wanhammar DSP Integrated Circuits Department of Electrical Engineering Linköping University [email protected] http://www.es.isy.liu.se/ However, we choose to use two logical memories since this will simplify the implementation of the memories and increase the memory bandwidth, which is a critical factor. In principle, it is possible to use only one logical memory. N ( 2 + 2 ) ---- log 2( N ) 2 f memory = ------------------------------------------- = 55 ⋅ 10 6 t FFT For each butterfly operation we must read and write two complex values. Hence, the data rate will be We can also estimate the required data rate for the memory process. N W d ⋅ ---- log 2( N ) 2 24 ⋅ 5120 f CLPE = --------------------------------------- = -------------------------------------- = 165.2 N PE ⋅ t FFT 2 ⋅ 0.372 ⋅ 10 – 3 Thus, we need only two butterfly PEs to meet the required throughput. The minimum required clock frequency for the PEs is Lars Wanhammar DSP Integrated Circuits The partitioning should be done in steps where each step is sufficiently small as to make the transformation obvious, thereby avoiding design errors. Partitioning should be performed such that the processes at the lowest level in the hierarchy can be easily mapped onto hardware resources. These processes are then partitioned into a hierarchy of simpler processes. Output process: Writes data from the memory to the outside world, performs unscrambling of the data array in the last stage of the FFT. Interchanges the real and imaginary parts if needed. FFT process: The FFT process only computes the FFT since the IFFT is computed by interchanging the real and imaginary parts of the data in the input and output processes. Memory process: The input sequence and intermediate results are stored by the memory process. Input process: Interchanges the real and imaginary parts if needed. 7 5 Department of Electrical Engineering Linköping University [email protected] http://www.es.isy.liu.se/ ≈ 1.5 MHz Lars Wanhammar DSP Integrated Circuits Department of Electrical Engineering Linköping University [email protected] http://www.es.isy.liu.se/ 1 1 1 ------------------------------- = ---------------------------------------------------------------- = ------------------------------ = 2232 t I /O + t FTT – 3 – 3 0.128 ⋅ 10 + 0.32 ⋅ 10 0.488 ⋅ 10 – 3 The throughput becomes 24 ⋅ 5120 f CLPE = ----------------------------------- = 192 2 ⋅ 0.32 ⋅ 10 – 3 and the minimum required clock frequency for the PEs is N ( 2 + 2 ) ---- log 2( N ) 2 t FFT = ------------------------------------------- = 0.32 ms 2 ⋅ 32 ⋅ 10 6 FFTs/s We therefore decide to use a memory clock frequency of 32 MHz, i.e. twice the I/O data rate. The time required for the FFT then becomes This will make it possible to use the same clock frequency for the memories throughout the input, FFT, and output processes. Also, it is desirable that the memory clock frequency is a multiple of the I/O frequency. Lars Wanhammar DSP Integrated Circuits N W d ⋅ ---- log 2( N ) 24 ⋅ 5120 2 N PEb = -------------------------------------------- = --------------------------------------------------------t FFT f CLPEmax 0.372 ⋅ 10 – 3 ⋅ 220 ⋅ 10 6 A bit-serial butterfly PE can be implemented using Wd = 24 clock cycles and we assume that the usable clock frequency, fCLPEmax, is at least 220 MHz. The time available for the FFT is tFFT = 0.372 ms. The number of butterfly PEs is N ---- log ( N ) = 5120 2 2 It is possible to overlap their execution and speed up the FFT processor slightly, but the complexity of the design would increase significantly. The number of butterfly operations for each FFT is The partitioning assumes that the input, FFT, and output processes are executed sequentially. Resource Allocation Next we determine the resources required. 8 6 Department of Electrical Engineering Linköping University for Stage = 1 to M for ??? for ?? Butterfly_1 Butterfly_2 next ?? next ?? next Stage [email protected] http://www.es.isy.liu.se/ Stage 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Data index Department of Electrical Engineering Linköping University DSP Integrated Circuits doit Input Data DSP Integrated Circuits Department of Electrical Engineering Linköping University [email protected] http://www.es.isy.liu.se/ Lars Wanhammar Memory Write doit Output Data Department of Electrical Engineering Linköping University Finished next for i in 0 to N–1 loop next 3 W4 W0 W4 W0 W4 W0 W4 W0 [email protected] http://www.es.isy.liu.se/ doit [email protected] http://www.es.isy.liu.se/ Memory Butterfly Memory Butterfly Adresses W_Process next doit Ns := Ns/2; Ns := N; for m in 0 to (N/4–1) loop for Stage in 1 to M loop next for i in N–1 loop Start Memory Read Lars Wanhammar 2 W6 W4 W2 W0 W6 W4 W2 W0 4 W0 W0 W0 W0 W0 W0 W0 W0 Unscramble The behavior of this process depends on the chosen index generation scheme. We also specify the behavior of the input and output processes in more detail by partitioning them into the processes Memory, Memory_Read, and Memory_Write. First alternative: execute butterflies p and p+N/4 concurrently. Second alternative: execute p and p+Ns/2 concurrently in the first stage while p and p+Ns are executed concurrently in the following stages. 1 W7 W6 W5 W4 W3 W2 W1 W0 In this step we collect index computations into a process, Addresses. Third Design Iteration Lars Wanhammar DSP Integrated Circuits Thus, the twiddle factor for one of the butterfly PEs can be obtained from the twiddle factor of the corresponding butterfly PE by simply changing the sign and interchanging the real and imaginary parts. W p + N /4 = W N /4 W p = – jW p The relation between the coefficients for the butterflies in rows p and p + N/4 in the first stage of the FFT is Digit_Reverse 11 9 Having two concurrent butterflies means that we must generate two indices k in parallel: k1 and k2. First Alternative We will investigate two different alternative choices of order for the computations of the butterflies in the inner loop. for Stage = 1 to M for q = 1 to (N div(2*Ns) for n = 1 to Ns Butterfly next n next q next Stage Fast Fourier Transform – Sande-Tukey Lars Wanhammar DSP Integrated Circuits The coefficients (i.e., the twiddle factors) in the two parts are the same. There are M = log2(N) stages with N/2 butterfly operations each. Let us therefore first consider the signal-flow graph for a 16-point Sande– Tukey’s FFT where the butterfly operation has been chosen as the basic operation. We must also compute two coefficients Wp concurrently. There are many possible ways to do this. We must modify the original program so that two butterfly operations can be executed concurrently. The following design iteration therefore aims to modify the original sequential algorithm into a parallel description that can be mapped efficiently onto the hardware resources. Second Design Iteration Note that both the original FFT program and the description using communicating processes are sequential descriptions. 12 10
© Copyright 2026 Paperzz