Data-path Synthesis of VLIW Video Signal Processors

Data-path Synthesis of
VLIW Video Signal
Processor
Zhao Wu and Wayne Wolf
Dept. of Electrical Engineering,
Princeton University
Outline
• Introduction
• Architectural paradigm
• Trace-driven simulation
• Performance estimation
• Conclusions
Introduction
• Why programmable VSP?
– intense computation
– complex and diverse video applications
– increased development cost
– time-to-market pressure
• Why VLIW?
– Easy to implement in hardware
– high speed
– high degree of ILP available in video applications
Architecture Paradigm
Local
memory
Instruction cache
Cluster Cluster Cluster Cluster
Memory
unit
Register file
I/O
Crossbar
I/O
Cluster Cluster Cluster Cluster
Instruction cache
Functional
unit
Architectural Parameters
• Register file
– number of registers
• Functional unit
– number and type of functional units
• Interconnect
– number of clusters
– interconnect mechanism
Inst issued per cycle
Impact on MPEG-2 Encoder
35
m ult
30
shifter
25
ALU
20
15
m em
with unpipelined
multipliers
with pipelined
multipliers
10
5
0
Reg # 256 256 512 512 1K 1K 1K
Mem #
8 24
8 24
8
8 16
ALU # 16 32 16 32 16 32 24
Shift #
8 16
8 16
8
8
8
Mult #
8 24
8 24 24 24 16
Area 148 210 151 213 205 208 185
2
(mm )
1K 1K 1K 1K 1K 1K
16 16 16 24 24 24
32 32 32 16 24 32
8
8 16 16 16 16
8 24 24 16 16 24
163 211 215 191 193 218
2K
32
32
32
32
265
Trace-Driven Scheduling
Instrumented
program
prog.pixie
Binary
program
Dynamic
trace
Resource
description
Scheduler
prog
Disassembled
program
prog.asm
Result &
statistics
Block Diagram of the Scheduler
disassembled
program
Program
trace
Resource
description
Assembly code parser
Dependency
analyzer
Register
manager
Memory
manager
Funct unit
manager
VLIW
scheduler
Register
scoreboard
Memory
scoreboard
Reservation
station
Resource manager
Result &
statistics
Scheduling
record
Features of the Scheduler
• (Relatively) fast
– Instrumentation rather than interpretation
– linear to trace length
• Moderate memory requirement
– Pipelining saves storage
• Large scheduling window
– up to 109 instructions
– simulates both a VLIW compiler & a VLIW processor
• Realistic model
– limited resources
Performance Estimation
• Why do we need performance estimation?
– trace-driven simulation too slow (trace too long)
– design space too big
• How do we estimate?
– start from full-length trace simulation results
– increase resource: lower bound on cycle count
– decrease resource: upper bound on cycle count
bigger design
target design
smaller
design
IPC Histogram of ALU
Tim e (%)
60
Average IPCALU = 11.47
40
20
0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
ALU instructions/cycle on (1024, 16, 16, 8, 8)
Tim e (%)
20
Average IPCALU = 13.24
15
10
5
0
0
2
4
6 8 10 12 14 16 18 20 22 24 26 28 30 32
ALU instructions/cycle on (1024, 16, 32, 8, 8)
Increase and Decrease Resources
Cycle i:
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
Increase
resource
Decrease
resource
Cycle i’:
1
Cycle i’+1:
17 18 19 20 21 22 23 24
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
other instructions
Instruction count
Decrease resource
0
1
2
3
4
5
6 7 8 9 10 11 12 13 14 15 16
Instructions/cycle
• Split cycles that issue more FU ops and retime
– 168+8, 158+7, 148+6, 138+5, 128+4, …
• Why upper bound of cycle count
– 7, 6, 5, 4, … could be combined with 1, 2, 3, 4, …
Instruction count
Increase resource
This cycle removed
0
1
2 3 4 5 6
Instructions/cycle
7
8
• Tnew = Told - T8
– 168+8, 158+7, 148+6, 138+5, 128+4, …
• Why lower bound of cycle count
– sometimes can’t merge (e.g. increase from 8 to 12)
– sometimes no parallelism
Change More Than One Resource
• Have to take into account resource inter-correlation
– {depres1,res2,n}: # of cycles when at least one res1instruction depends on n res2-instructions
• Combine several bounds into one semi-bound
1
Tm 
Tfu ,m

FU fu FU
• Increase resource (m>n):
m
Tfu ,m  Tn (1  hist fu ,n )   depfu ,i, j
i  fu j n
• Decrease resource (m<n):
n
Tfu ,m  retime (hist fu , Tn )   depfu ,i, j
i  fu j m
Results
Data-path architecture
# of # of # of # of # of
regs mem ALU shfts muls
256
8 16
8
8
256
16 16
8
8
512
8 16
8
8
512
16 16
8
8
512
16 24
8
8
1024
8 16
8
8
1024
16 16
8
8
1024
16 24
8
8
1024
16 32
8
8
1024
16 32
16
8
2048
16 24
8
8
2048
16 32
8
8
Cost
Performance
Area Simulated Estimated
(mm2) speedup speedup
149.6
1.00
1.00
151.2
1.03
1.08
152.6
1.37
1.37
154.2
1.48
1.50
157.4
1.54
1.52
158.2
1.54
1.54
159.8
1.82
1.92
163.0
2.07
2.12
166.2
2.12
2.19
170.2
2.14
2.26
175.0
2.30
2.35
178.2
2.39
2.45
Conclusions
• Trace-driven simulation
– quantitative evaluation of an architecture
– too slow to be applied for every possible design
• Performance estimation
– based on simulated results
– automated procedure
– accurate enough

Download Report

Data-path Synthesis of VLIW Video Signal Processors

Paperzz.com

Your Paperzz