dsp system design strategy problem understanding fft

[email protected]
http://www.es.isy.liu.se/
FFT PROCESSOR, CONT.
Department of Electrical Engineering
Linköping University
DIRECT MAPPING TECHNIQUE
VLSI Design
Logic Design
Architecture Design
Resource Assignment
Resource Allocation
Scheduling
Partitionig into
co-operating processes
Lars Wanhammar
DSP Integrated Circuits
Department of Electrical Engineering
Linköping University
[email protected]
http://www.es.isy.liu.se/
We will first perform a crude design, then subsequently improve it until a
satisfactory solution is reached.
We begin by partitioning the algorithm, followed by scheduling, resource
allocation, resource assignment, and finally, synthesis of a matching
architecture.
This means that we will synthesize the architecture with its hardware
modules step by step, starting from the algorithm.
In this chapter we will perform the main steps in the system design phase
of the FFT processor.
Lars Wanhammar
DSP Integrated Circuits
MANUFACTURING AND
TESTING
INTERGRATED CIRCUITS
DESIGN
SYSTEM DESIGN
SYSTEM SPECIFIKATION
3
DSP Integrated Circuits
Department of Electrical Engineering
Linköping University
[email protected]
http://www.es.isy.liu.se/
Lars Wanhammar
DSP Integrated Circuits
FFT/IFFT
Data
FFT/IFFT
Data
Department of Electrical Engineering
Linköping University
Generally, it is useful to
introduce a memory process at this level explicitly. The input process is
responsible for reading
data into the FFT processor from the outside
world.
input, FFT, output, and
memory process.
At the top level we partition the FFT processor
into four communicating
processes:
Finished
Output
FFT
Input
Start
Read/Write
Real/Imag
Address
Data
[email protected]
http://www.es.isy.liu.se/
Read/Write
Real/Imag
Address
Data
Read/Write
Real/Imag
Address
Data
Read/Write
Real/Imag
Address
Data
Memory
First Design Iteration
In a top-down design approach we start by identifying the major tasks
(processes) to be executed.
Lars Wanhammar
Model #3
Model #2
Model #1
Golden Model
Test Bench
We advocate that a sequence of models with increasing degrees of details
and functionality i built in order to
better understand the “problem”;
explore the design space;
obtain a specification
PROBLEM UNDERSTANDING
Stimuli
DSP Algorithm
1
≤ 0.5 ms
DSP SYSTEM DESIGN STRATEGY
Compare
4
2
Department of Electrical Engineering
Linköping University
[email protected]
http://www.es.isy.liu.se/
MHz
complex words/s
Lars Wanhammar
DSP Integrated Circuits
Department of Electrical Engineering
Linköping University
[email protected]
http://www.es.isy.liu.se/
However, we choose to use two logical memories since this will simplify
the implementation of the memories and increase the memory bandwidth,
which is a critical factor.
In principle, it is possible to use only one logical memory.
N
( 2 + 2 ) ---- log 2( N )
2
f memory = ------------------------------------------- = 55 ⋅ 10 6
t FFT
For each butterfly operation we must read and write two complex values.
Hence, the data rate will be
We can also estimate the required data rate for the memory process.
N
W d ⋅ ---- log 2( N )
2
24 ⋅ 5120
f CLPE = --------------------------------------- = -------------------------------------- = 165.2
N PE ⋅ t FFT
2 ⋅ 0.372 ⋅ 10 – 3
Thus, we need only two butterfly PEs to meet the required throughput.
The minimum required clock frequency for the PEs is
Lars Wanhammar
DSP Integrated Circuits
The partitioning should be done in steps where each step is sufficiently
small as to make the transformation obvious, thereby avoiding design
errors.
Partitioning should be performed such that the processes at the lowest
level in the hierarchy can be easily mapped onto hardware resources.
These processes are then partitioned into a hierarchy of simpler processes.
Output process: Writes data from the memory to the outside world, performs unscrambling of the data array in the last stage of the FFT. Interchanges the real and imaginary parts if needed.
FFT process: The FFT process only computes the FFT since the IFFT is
computed by interchanging the real and imaginary parts of the data in the
input and output processes.
Memory process: The input sequence and intermediate results are stored
by the memory process.
Input process: Interchanges the real and imaginary parts if needed.
7
5
Department of Electrical Engineering
Linköping University
[email protected]
http://www.es.isy.liu.se/
≈ 1.5
MHz
Lars Wanhammar
DSP Integrated Circuits
Department of Electrical Engineering
Linköping University
[email protected]
http://www.es.isy.liu.se/
1
1
1
------------------------------- = ---------------------------------------------------------------- = ------------------------------ = 2232
t I /O + t FTT
–
3
–
3
0.128 ⋅ 10 + 0.32 ⋅ 10
0.488 ⋅ 10 – 3
The throughput becomes
24 ⋅ 5120
f CLPE = ----------------------------------- = 192
2 ⋅ 0.32 ⋅ 10 – 3
and the minimum required clock frequency for the PEs is
N
( 2 + 2 ) ---- log 2( N )
2
t FFT = ------------------------------------------- = 0.32 ms
2 ⋅ 32 ⋅ 10 6
FFTs/s
We therefore decide to use a memory clock frequency of 32 MHz, i.e.
twice the I/O data rate. The time required for the FFT then becomes
This will make it possible to use the same clock frequency for the memories throughout the input, FFT, and output processes.
Also, it is desirable that the memory clock frequency is a multiple of the
I/O frequency.
Lars Wanhammar
DSP Integrated Circuits
N
W d ⋅ ---- log 2( N )
24 ⋅ 5120
2
N PEb = -------------------------------------------- = --------------------------------------------------------t FFT f CLPEmax
0.372 ⋅ 10 – 3 ⋅ 220 ⋅ 10 6
A bit-serial butterfly PE can be implemented using Wd = 24 clock cycles
and we assume that the usable clock frequency, fCLPEmax, is at least 220
MHz. The time available for the FFT is tFFT = 0.372 ms. The number of
butterfly PEs is
N
---- log ( N ) = 5120
2
2
It is possible to overlap their execution and speed up the FFT processor
slightly, but the complexity of the design would increase significantly.
The number of butterfly operations for each FFT is
The partitioning assumes that the input, FFT, and output processes are
executed sequentially.
Resource Allocation
Next we determine the resources required.
8
6
Department of Electrical Engineering
Linköping University
for Stage = 1 to M
for ???
for ??
Butterfly_1
Butterfly_2
next ??
next ??
next Stage
[email protected]
http://www.es.isy.liu.se/
Stage
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Data index
Department of Electrical Engineering
Linköping University
DSP Integrated Circuits
doit
Input Data
DSP Integrated Circuits
Department of Electrical Engineering
Linköping University
[email protected]
http://www.es.isy.liu.se/
Lars Wanhammar
Memory Write
doit
Output Data
Department of Electrical Engineering
Linköping University
Finished
next
for i in 0 to N–1 loop
next
3
W4
W0
W4
W0
W4
W0
W4
W0
[email protected]
http://www.es.isy.liu.se/
doit
[email protected]
http://www.es.isy.liu.se/
Memory
Butterfly
Memory
Butterfly
Adresses
W_Process
next
doit Ns := Ns/2;
Ns := N;
for m in 0 to (N/4–1) loop
for Stage in 1 to M loop
next
for i in N–1 loop
Start
Memory Read
Lars Wanhammar
2
W6
W4
W2
W0
W6
W4
W2
W0
4
W0
W0
W0
W0
W0
W0
W0
W0
Unscramble
The behavior of this process depends on the chosen index generation scheme. We
also specify the behavior of the input and output processes in more detail by partitioning them into the processes Memory, Memory_Read, and Memory_Write.
First alternative: execute butterflies p and p+N/4 concurrently.
Second alternative: execute p and p+Ns/2 concurrently in the first stage
while p and p+Ns are executed concurrently in the following stages.
1
W7
W6
W5
W4
W3
W2
W1
W0
In this step we collect index computations into a process, Addresses.
Third Design Iteration
Lars Wanhammar
DSP Integrated Circuits
Thus, the twiddle factor for
one of the butterfly PEs can
be obtained from the twiddle
factor of the corresponding
butterfly PE by simply
changing the sign and interchanging the real and imaginary parts.
W p + N /4 = W N /4 W p = – jW p
The relation between the
coefficients for the butterflies in rows p and p + N/4 in
the first stage of the FFT is
Digit_Reverse
11
9
Having two concurrent butterflies means that we must generate two indices k in parallel: k1 and k2.
First Alternative
We will investigate two different alternative choices of order for the computations of the butterflies in the inner loop.
for Stage = 1 to M
for q = 1 to (N div(2*Ns)
for n = 1 to Ns
Butterfly
next n
next q
next Stage
Fast Fourier Transform – Sande-Tukey
Lars Wanhammar
DSP Integrated Circuits
The coefficients (i.e., the twiddle factors) in the two parts are the same.
There are M = log2(N) stages with N/2 butterfly operations each.
Let us therefore first consider the signal-flow graph for a 16-point Sande–
Tukey’s FFT where the butterfly operation has been chosen as the basic
operation.
We must also compute two coefficients Wp concurrently. There are many
possible ways to do this.
We must modify the original program so that two butterfly operations can
be executed concurrently.
The following design iteration therefore aims to modify the original
sequential algorithm into a parallel description that can be mapped efficiently onto the hardware resources.
Second Design Iteration
Note that both the original FFT program and the description using communicating processes are sequential descriptions.
12
10