FPGA'12

Algorithm and Architecture Optimizations for Large Size
Two Dimensional Discrete Fourier Transform
Berkin Akin, Peter Milder, Franz Franchetti and James Hoe
Carnegie Mellon Univeristy, Pittsburgh PA USA
{bakin, pam, franzf, jhoe}@ece.cmu.edu
Overview
Memory access pattern and achieved bandwidth
Large size 2D Fast Fourier Transform
1024-by-1024 double precision 2D-FFT
1 Raw performance (Gflop/s)
• Have large strided DRAM
access pattern
• Does not exploit DRAM
row-buffer locality
•Used in image processing, scientific computing
Solving PDEs
Cosmology
CT
SAR
Intel Core i7 960 (CPU)
Nvidia GTX 480 (GPU)
17.2
2 Performance to bandwidth ratio (Gflop/GB)
Intel Core i7 960 (CPU)
2.76
5.07
Altera DE4 (FPGA)
• Memory bandwidth becomes bottleneck
for achieving high performance
Results are normalized to CPU.
3 Performance to power consumption ratio (Gflop/J)
Intel Core i7 960 (CPU)
• Does not fit on-chip
• Stored off-chip
• Effective bandwidth orchestration is required for:
1
Performance
Bandwidth Efficiency
2
3
1.00
Nvidia GTX 480 (GPU)
• e.g. 2K-by-2K double precision 2D-FFT:
• Input dataset: 64 MB
• # of operations: ~461.4 Mflop
82.1
Altera DE4 (FPGA)
• Results in low memory bandwidth utilization!
•Typical datasets are large and high precision!
7.6
Power Efficiency
Nvidia GTX 480 (GPU)
1.00
1.56
4.83
Altera DE4 (FPGA)
Results are normalized to CPU.
Background
2D-FFT algorithms
DFT is matrix-vector multiplication
DRAM operation
• Row column algorithm:
FFT algorithm is factorization of DFT matrix
tensor
twiddle factors
• Need to make use of every row
touched to maximize bandwidth
Row-wise and column-wise accesses!
• Large strides result in small
packets of transferred data
row-wise
access
permutation
assuming rank&bank
is active
row-buffer
MISS
Decode
Row
Row
row-buffer
Precharge
HIT
2D-FFT operates on 2D data, e.g. images
Stage 2
Row-wise access exploits
row-buffer locality
Stage 1
Column-wise access results
in row-buffer misses
column-wise
access
Decode
Column
DRAM row
Extra
Latency!
Row
Activate
data out
Solution: Algorithm and Architecture
• Data is accessed as tiles, not
row and column-wise
• Row-buffer misses are minimized!
Restructured algorithm
• Linear data mapping in DRAM causes
row and column-wise accesses
• Use 2D tiled data mapping where each tile
is mapped to a DRAM row
• Restructure the algorithm given 2D data mapping
• Double-buffering:
- Overlapped computation and I/O
- All modules kept busy
From algorithm to hardware
• Matching throughput to memory bandwidth:
hardware aware
manipulation
write tiles
into DRAM
on-chip permutation DFT computation
Local Memory
address generation
DRAM write
address generation
- Achieved via fine-grain control
over datapath parallelism
- Results in balanced design
Each tile  DRAM row
on-chip permutation
Local Memory
address generation
Streaming FFT
Core
streaming
width
• Ensuring continuous dataflow:
-Buffers are used to smooth
the flow of data.
read tiles
from DRAM
DRAM read
address generation
FIFO
Evaluation
Target application:
Raw performance:
-Double precision complex 2D-FFT
-Data sizes up to 2,048-by-2,048
Bandwidth Efficiency:
Power* Efficiency:
* Measured power consumption including DRAMs
Target platforms:
DRAM Type
# of Memory Channels
Memory BW (GB/s)
On-chip Memory (MB)
Proc. Freq. (MHZ)
# of Cores
Technology Node (nm)
Application
Infrastructure
Core i7
960
GTX
480
Stratix IV (DE4)
EP4SGX530
DDR3
GDDR5
DDR2
3
6
2
25.6
177.4
12
8
1.69
2.53
3,200
1,401
200
4
480
N/A
45
40
40
Spiral
CUDA 4.0
Spiral/Verilog
The authors acknowledge the support of the C2S2 Focus Center, one of six research centers funded under the Focus Center Research Program (FCRP), a Semiconductor Research Corporation entity.