Document 7014075

Part 2: Synchronous Elastic Systems
Jordi Cortadella and Mike Kishinevsky
Universitat Politecnica
de Catalunya
Barcelona, Spain
Intel
Strategic CAD Labs
Hillsboro, USA
28th Int. Conf. on Application and Theory of Petri Nets
and Other Models of Concurrency
Siedlce, Poland, June 25, 2007
Synchronous elastic systems also called
– Latency tolerant systems or
– Latency insensitive systems
We use term “synchronous elastic” since
better linked to asynchronous elastic
Agenda of Part 2
I.
Basics of elastic systems
II.
Early evaluation and
performance analysis
III.
Optimization of elastic systems and
their correctness
I
What and Why
Intuition
How to design elastic systems
Converting synchronous system to elastic
Micro-arch opportunities
Marked Graph models
Performance evaluation
Synchronous Stream of Data
…
Clock cycle
…
Token (of data)
7 4 1
2 1 0
Synchronous Elastic Stream
…
Clock cycle
Clock cycle
…5
…
Bubble (no data)
…
7 4 1
2 1 0
4 3 2 1 0
7
4 1
Token
Synchronous Circuit
…
…
Latency = 0
7 4 1
1 0 2
…
8
+
4 3
Synchronous Elastic Circuit
Latency = 0
…
…
7 4 1
…7
… 1
4 1
1 0 2
0
2
…
8
+
…
8
+
e
Latency can vary
4 3
4 3
Ordinary Synchronous System
A
C
A
C
B
D
=
B
D
Changing latencies changes behavior
Synchronous Elastic
(characteristic property)
e
A
e
A
e
C
e
e
C
=
e
B
e
D
e
B
e
D
Changing latencies does NOT change behavior
= time elasticity
Why
Scalable
Modular (Plug & Play)
Better energy-delay trade-offs
(design for typical case instead of worst case)
New micro-architectural opportunities
in digital design
Not asynchronous: use existing design
experience, CAD tools and flows
Example of elastic behavior
ALU
1
1
2
6
3
5
2
4
3
ALU
1
1
2
6
3
5
2
4
3
ALU
2
2
3
1
4
6
3
5
4
ALU
3
3
4
2
5
1
4
6
5
ALU
4
4
5
3
6
2
5
1
6
ALU
5
5
6
4
1
3
6
2
1
ALU
6
6
1
5
2
4
1
3
2
19
ALU
1
1
?
2
6
3
5
2
4
3
20
ALU
?
2
3
1
2
4
6
3
5
4
21
Not valid
Join
ALU
?
2
2
1
3
4
6
3
5
4
Stop !
22
Not valid
2
3
ALU
?
2
4
5
1
3
4
6
5
Stop !
23
Not valid
?
2
3
4
5
ALU
Lazy
(stop)
2
1
3
4
6
5
Stop !
24
Not valid
?
2
3
4
5
ALU
Lazy
(stop)
2
1
3
4
5
Stop !
25
6
Not valid
2
?
3
4
5
ALU
Lazy
(stop)
2
1
3
4
5
Stop !
26
6
?
2
3
4
5
ALU
Lazy
(stop)
2
1
3
4
5
27
6
?
3
4
5
ALU
Lazy
(stop)
2
3
1
4
5
28
6
ALU
?
Stop
3
3
4
5
2
1
4
5
29
6
?
4
ALU
3
Stop
3
5
6
2
4
5
1
6
30
?
4
ALU
3
3
5
6
2
4
5
1
6
31
?
ALU
4
5
6
3
4
2
5
6
32
1
ALU
4
4
?
5
6
3
2
5
6
33
1
?
ALU
5
6
4
5
1
3
6
2
1
34
How to design elastic systems
We show an example of the implementation:
SELF = Synchronous Elastic Flow
Others are possible
35
Reminder:
Memory elements. Transparent latches
D
H
Q
D
L
En
En
Active high:
Active low:
En = 0 (opaque): Q = prev(Q)
En = 1 (opaque): Q = prev(Q)
En = 1 (transparent): Q = D
En = 0 (transparent): Q = D
36
36
Q
Reminder:
Memory elements. Flip-flop
D
L
H
Q
D
FF
CLK
CLK
CLK
D
Q
37
37
Q
Reminder: Clock cycle = two phases
0 delay abstraction
=
0 delay abstraction
x
z
z (i)  x(i  1)
L
H
x
0 delay abstraction
y
z
z (i)  y(i  0.5)  x(i  1)
38
38
Elastic channel protocol
Sender
not Valid
Valid * Stop
Idle
Retry
Receiver
Data
Valid
Valid * not Stop
Transfer
Stop
39
Elastic channel protocol
Sender
Receiver
* D D * C C C B * A
Data
Valid
Stop
Data
0 1 1 0 1 1 1 1 0 1
0 0 1 0 0 1 1 0 0 0
Valid
Stop
Transfer
Retry
Idle
40
Elastic buffer keeps data while stop is in flight
W1R1
Cannot be done with
Single Edge Flops
without double pumping
Can use latches inside
Master-Slave
W2R1
W1R2
W2R2
41
Communication channel
sender
receiver
Data
Data
Long wires: slow transmission
42
Pipelined communication
sender
receiver
Data
Data
What if the sender does not always send valid data?
43
The Valid bit
sender
receiver
Data
Data
Valid
Valid
What if the receiver is not always ready ?
44
The Stop bit
sender
receiver
Data
Data
Valid
Valid
Stop
Stop
0
0
0
0
45
0
The Stop bit
sender
receiver
Data
Data
Valid
Valid
Stop
Stop
0
0
0
1
46
1
The Stop bit
sender
receiver
Data
Data
Valid
Valid
Stop
Stop
0
0
1
1
47
1
The Stop bit
sender
receiver
Data
Data
Valid
Valid
Stop
Stop
1
1
1
1
Back-pressure
48
1
The Stop bit
sender
receiver
Data
Data
Valid
Valid
Stop
Stop
0
0
0
0
Long combinational path
49
1
Cyclic structures
Data
Valid
Stop
Combinational cycle
One can build circuits with combinational cycles (constructive cycles by Berry),
but synthesis and timing tools do not like them
50
Example: pipelined linear communication chain
with transparent latches
sender
receiver
H
L
H
L
½ cycle ½ cycle
Master and slave latches with independent control
51
Shorthand notation
(clock lines not shown)
…
D
Q
En
En
clk
52
SELF (linear communication)
sender
receiver
Data
Data
En
En
Valid
En
En
V
V
V
V
1
1
1
1
Stop
Valid
Stop
S
S
S
S
53
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
54
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
55
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
56
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
57
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
58
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
0
En
V
V
V
Valid
0
Stop
S
S
S
S
59
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
0
En
V
V
V
Valid
0
Stop
S
S
S
S
60
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
0
En
V
V
V
Valid
0
Stop
S
S
S
S
61
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
0
En
V
V
V
Valid
0
Stop
S
S
S
S
62
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
0
En
V
V
V
Valid
0
Stop
S
S
S
S
63
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
64
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
65
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
66
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
67
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
68
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
69
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
70
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
71
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
72
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
73
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
74
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
75
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
76
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
77
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
78
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
79
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
80
Stop
Basic VS block
Eni
Vi-1
Eni
Vi
Vi-1
Vi
VS
Si-1
Si
Si-1
Si
VS block + data-path latch = elastic HALF-buffer
81
Join
+
V1
VS
VS
S1
V
S VS
V2
S2
82
(Lazy) Fork
V
S
V1
S1
V2
S2
83
Eager Fork
S1
V1
^
V
V2
S
^
84
S2
Eager fork (another implementation)
VS
VS
VS
VS
VS
85
Variable Latency Units (to be changed)
[0 - k]
cycles
go
V/S
done
clear
V/S
86
Elasticization
Synchronous
Elastic
87
CLK
88
FORK
IF/ID
PC
J
O
I
N
ID/EX
J
O
I
N
EX/MEM
MEM/WB
F
O
R
K
CLK
89
FORK
V
S
J
O
I
N
J
O
I
N
V
S
CLK
V
V
S
S
F
O
R
K
90
V
S
0
FORK
1
0
J
O
I
N
0
1
0
CLK
J
O
I
N
1
1
0
0
F
O
R
K
91
1
0
1
0
Elastic control
layer 1
1
0 Generation of
0 gated clocks
0
1
CLK
92
1
0
Micro-architectural opportunities
93
Circuit vs. μarchitectural cycles
94
Variable-latency cache hits
12-cycle miss
L2-cache
2-way associative
32KB
2-cycle hit
L1-cache
1-cycle hit
suggested by Joel Emer for ASIM experiment
95
Variable-latency cache hits
12-cycle miss
L2-cache
Pseudo-associative
32KB
{1-2} cycle hit
L1-cache
1-cycle
Sequential access:
if hit inhit
first access L = 1, if not – L=2
Trade-off: faster, or larger, or less power cache
96
Variable-latency cache hits
12-cycle miss
L2-cache
Pseudo-associative
64KB
{2-3} cycle hit
L1-cache
1-cycle
Sequential access:
if hit inhit
first access L = 1, if not – L=2
Trade-off: faster, or larger, or less power cache
97
Variable-latency RF with less ports
By-pass control
By-pass
8 read ports
1 cycle read
ALU
Regfile
Cache of recent computations
LD
ADD
MUL
CMP
R2, A(R5)
R1, R2, R3
R4, R1, R6
R4, #100
1 by-pass, 1 read port
1 by-pass, 1 read port
1 by-pass
4-way superscalar or 4 threads assumed, 1 way shown
98
Variable-latency RF with less ports
By-pass control
By-pass
ALU
Regfile
4 read ports
1-2 cycle read
Cache of recent computations
LD
ADD
MUL
CMP
R2, A(R5)
R1, R2, R3
R4, R1, R6
R4, #100
1 by-pass, 1 read port
1 by-pass, 1 read port
1 by-pass
99
Variable-latency ALUs
In most of the ADD/SUB operations, only few
LSBs are used. The rest are merely for sign
extension.
Use the idea of telescopic units:
– 1-cycle addition for 16 bits and sign extension
– 2 or more cycles for 64-bit additions
(rare case)
– maybe there is no need for CLA adders …
– or do all additions with 16-bit adders only
100
# adds
Statistics
of operand
sizes
Benchmark
“Patricia”
from
Media Bench
#
adds
12 bits of an adder
do 95% of additions
bits of adder used
101
Variable Latency Adder
MSB +
LSB +
Control
long
102
Power-delay [preliminary]
103
Pre-compute and tag size information
0
1
long data
short data
… and select functional unit
according to the size of the data
104
Partitioned register file
0
0
1
0
0
0
0
1
0
MSBs
LSBs
sign
short / long
or heterogeneous register files, register caches
105
Reminder:
Petri Nets and Marked Graphs
106
Petri nets
Petri net

Places {P}

Transitions {T}

Arcs

Tokens
107
Petri nets. Token game
Enabling Rule:
• A transition is enabled if
t2
t1
t3
all its input places are marked
• Enabled transition can fire at any time
Firing Rule:
• One token is removed from
every input place
• One token is added to
t4
every output place
• Change of marking is atomic
108
Petri nets. Token game
Enabling Rule:
• A transition is enabled if
t2
t1
t3
all its input places are marked
• Enabled transition can fire at any time
Firing Rule:
• One token is removed from
every input place
• One token is added to
t4
every output place
• Change of marking is atomic
109
Timed Petri nets
Assign a delay (d) to every transition
– An enabled transition fires d time units after enabling
Assume d=1
t=0
110
Timed Petri nets
Assign a delay (d) to every transition
– An enabled transition fires d time units after enabling
Assume d=1
t=1
111
Timed Petri nets
Assign a delay (d) to every transition
– A transition with marked input places will fire after d time units
Assume d=1
t=2
112
Marked Graph models
of elastic systems
113
Modelling elastic control with Petri nets
bubble
data-token
data-token
bubble
114
Modelling elastic control with Petri nets
bubble
2 data-tokens
data-token
Hiding internal transitions of elastic buffers
115
Modelling elastic control with Marked Graphs
116
Modelling elastic control with Marked Graphs
Forward
(Valid or Request)
Backward
(Stop or Acknowledgement)
117
Elastic control with Timed Marked Graphs.
Continuous time = asynchronous
d=250ps
d=151ps
Delays in time units
250
151
118
Elastic control with Timed Marked Graphs.
Discrete time = synchronous elastic
d=1
d=1
Latencies in clock cycles
1
1
119
Elastic control with Timed Marked Graphs.
Discrete time. Multi-cycle operation
d=2
d=1
2
1
120
Elastic control with Timed Marked Graphs.
Discrete time. Variable latency operation
d  {1,2}
d=1
e.g. discrete probabilistic distribution:
average latency 0.8*1 + 0.2*2 = 1.2
{1,2}
1
121
Modeling forks and joins
d=1
1
122
Elastic Marked Graphs
An Elastic Marked Graph (EMG) is a Timed MG such that
for any arc a there exists a complementary arc a’
satisfying the following condition
•a = a’• and •a’ = a•
Initial number of tokens on a and a’ (M0(a)+M0(a’)) =
capacity of the corresponding elastic buffer
Similar forms of “pipelined” Petri Nets and Marked Graphs
have been previously used for modeling pipelining in HW
and SW (e.g. Patil 1974; Tsirlin, Rosenblum 1982)
123
Performance analysis on Marked Graphs
124
Performance
Th = operations / cycle
125
Performance
Th = 3 / 7
126
Performance
Th = 3 / 5
127
Performance
Th = 2 / 5
128
Performance
Th = min ( 0.43, 0.6, 0.4 )
129
Performance
Minimum mean-weight cycle
(Karp 1978)
Many efficient algorithms
(some reviewed in
Dasdan,Gupta 1998)
Th = min ( 0.43, 0.6, 0.4 )
130
II
Early evaluation
Dual Marked Graphs
Implementing early evaluation
Performance analysis
131
Early evaluation
Naïve solution: introduce choice places
– issue tokens at choice node only into one (some) relevant path
– problem: tokens can arrive to merge nodes out-of-order
later token can overpass the earlier one
Solution: change enabling rule
– early evaluation
– issue negative tokens to input places without tokens,
i.e. keep the same firing rule
– Add symmetric sub-channels with negative tokens
– Negative tokens kill positive tokens when meet
Two related problems:
Early evaluation and Exceptions (how to kill a data-token)
132
Examples of early evaluation
Goal: Improve system performance and power
MULTIPLIER
a
c
*
b
if a = 0 then c := 0 -- don’t wait for b
MULTIPLEXOR
a
T
b
F
c
if s = T then c := a -- don’t wait for b
else c := b -- don’t wait for a
s
133
Example: next-PC calculation
PC+4
Branch target
address
134
Example: next-PC calculation
PC+4
Branch target
address
135
Example: next-PC calculation
PC+4
Anti-token !
No branch
Branch target
address
136
Related work
Petri nets
– Extensions to model OR causality
[Kishinevsky et al. 1994, Yakovlev et al. 1996]
Asynchronous systems
– Reese et al 2002: Early evaluation
– Brej 2003: Early evaluation with anti-tokens
– Ampalan & Singh 2006: preemption using anti-tokens
137
Dual Marked Graphs
138
Dual Marked Graph
Marking: Arcs (places) > Z
Some nodes are labeled as early-enabling
Enabling rules for a node:
– Positive enabling: M(a) > 0 for every input arc
– Early enabling (for early enabling nodes):
M(a) > 0 for some input arcs
– Negative enabling: M(a) < 0 for every output arc
Firing rule: the same as in regular MG
139
Dual Marked Graphs
Early enabling is only defined for nodes labeled as earlyenabled. Models computations that can start before all
the incoming data available
Early enabling can be associated with an external guard
that depends on data variables (e.g., a select signal of a
multiplexor)
In DMG actual enabling guards are abstracted away
Anti-token generation: When an early enabled node
fires, it generates anti-tokens in the predecessor arcs
that had no tokens
Anti-token propagation counterflow: When negative
enabled node fires, it propagates the anti-tokens from
the successor to the predecessor arcs
140
Dual Marked Graph model
-1
-1
-1
Enabled !
-1
-1
141
Properties of DMGs
Firing invariant: Let node n be simultaneously positive (or
early) and negative enabled in marking M. Let M1 be the
result of firing n from M due to positive (or early) enabling.
Let M2 be the result of firing n from M due to negative
enabling. Then, M1 = M2.
Token preservation. Let c be a cycle of a strongly
connected DMG. For every reachable marking M,
M(c) = M0(c).
Liveness. A strongly connected DMG is live iff for every
cycle c: M(c) > 0.
Repetitive behavior. In a SC DMG: a firing sequence s from
M leads to the same marking iff every node fires in s the
same number of times.
DMGs have properties similar to regular MGs
142
Passive anti-token
Passive DMG = version of DMG without negative enabling
Negative tokens can only be generated due to early
enabling, but cannot propagate
Let D be a DMG and Dp be a corresponding passive DMG.
If environment (consumers) never generate negative
tokens, then
throughput (D) = throughput (Dp)
– If capacity of input places for early enabling transitions is unlimited,
then active anti-tokens do not improve performance
– Active anti-tokens reduce activity in the data-path
(good for power reduction)
143
Implementing early enabling
144
How to implement anti-tokens ?
Positive tokens
Negative tokens
145
How to implement anti-tokens ?
Positive tokens
Negative tokens
146
How to implement anti-tokens ?
Valid+
Stop+
Valid–
Stop–
+
-
Valid+
Stop+
Valid–
Stop–
147
Controller for elastic buffer
L
H
Data
En
V
S
En
L
H
V
V
H
L
S
S
V
S
148
Dual controller for elastic buffer
En
En
V+
V+
S+
S+
V-
V-
S-
S-
149
Dual fork/join and early join
Dual fork/join
Join with early evaluation
150
Example
Evaluation
Throughput
No early evaluation
0.277
Passive anti-tokens M2  W
0.280
Passive anti-tokens F3  W
0.387
Active anti-tokens
0.400
151
DLX processor model with slow bypass
Bypass
a
Fetch
Decode
Execute
Memory
1a
Throughput:
System performance
Th=0.5
b
Write-back
1b
Th = operations / cycle
Applying early evaluation on “Execution” and “Write-back”
Th = 0.7
(a=0.3; b=0.3)
152
Conclusions
Early evaluation can increase performance
beyond the min cycle ratio
The duality between tokens and anti-tokens
suggests a clean and effective implementation
153
Performance analysis
with early evaluation
(joint work with Jorge Júlvez)
Reminder: Performance analysis of
Marked graphs
Th = operations / cycle = number of firings per time unit
The throughput is given by the
minimum mean-weight cycle
Th(A)=3/7
A
B
C
Th(B)=3/5
Th(C)=2/5
Th=
min(Th(A), Th(B), Th(C))=
2/5=0.4
Efficient algorithms: (Karp 1978), (Dasdan,Gupta 1998)
155
15
Marked graphs. Performance analysis
The throughput can also be computed by means of
linear programming
t2
t1
p1
p2
t3
th  min( m p1 , m p 2 )
Average marking
t
1
t 0
t 
m p  lim
 m ()d
p
Throughput
th  min m p
p
156
15
Marked graphs. Performance analysis
The throughput can also be
computed by means of
linear programming
max th
b
p1
p2
a
p3
p4
d
p5
c
th  min m p
p
mp1 = 1 + tb – ta
mp2 = 0 + ta – tb
mp3 = 1 + td – ta
mp4 = 0 + ta – tc
mp5 = 1 + tc – td
th ≤
th ≤
th ≤
th ≤
reachability
th constraints
mp2 // transition b
mp4 // transition c
mp5 // transition d
min(mp1, mp3) // transition a
Th = 0.5
[Campos, Chiola,157
Silva 1991]
15
GMG = Multi-guarded Dual Marked Graph
Refinement of passive DMGs
Every node has a set of guards
Every guard is a set of input arcs (places)
Simple transition has one guard with all input places
Example:
t1
t3
p3
p1
t4
t2
p2
G(t4)={{t1,t3},{t2,t3}}
158
Multi-guarded Dual Marked Graph
Execution of transitions:
– At the initial marking and each time t fires one of
guards gi from G(t) is non-deterministically selected
Selection is persistent (cannot change between firings of t)
Accurate non-deterministic abstraction of the early
evaluation conditions (e.g. multiplexor select signals)
– t is enabled when for every place p in selected gi:
M(p) > 0
– enabled t can fire (regular firing rule)
Single-server semantics: no multiple-instances
of the same transition can fire simultaneously
– Abstraction for systems that communicate through
FIFO channels
159
Timed GMG
Every transition is assigned a non-negative delay d(t)
d(t)=1 unless specified otherwise
Every guard g of every guarded transition is assigned a
strictly positive probability p(g) such that for every t:
 p( g )  1
gG ( t )
Guard selection for every transition is non-deterministic,
but respects probabilities in the infinite executions
Probabilities assumed to be independent
(generous abstraction)
Firing of transition t takes d(t) time units, from the time it
becomes enabled until the firing is completed
160
Early evaluation
1-a
a
b
1-b
161
16
Early evaluation
1a
b
a
a
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.40 0.40 0.40 0.40 0.40 0.40
0.2 0.42 0.42 0.42 0.42 0.43 0.43
0.4 0.43 0.44 0.44 0.45 0.45 0.45
0.6 0.43 0.44 0.45 0.47 0.48 0.49
b
(0.43)
(0.60)
1b
0.8 0.43 0.44 0.46 0.48 0.51 0.54
1.0 0.43 0.44 0.46 0.49 0.54 0.60
(0.40)
162
16
Timed GMG
{Places, Transitions, d, Prob}
Throughput?
Marked graphs with early evaluation
stochastic dynamic system
Alternatives to compute the throughput:
– Simulation
– Markov chain
– Linear programming
163
16
Throughput
Th  lim
 
s ( ) / 
 – time, s() – firing vector
This limit exists for every timed GMG
It is the same for all transitions!
164
Markov chains
p1 p2
p3 p4
p5
b
p1
p2
a a
1 0
1 0
1
a, d
b, c
a
S1
p3
p4
1-a
d
b, c, d
0
1
1
1
a, b, c
0
S2
c
1
1
1
S3
Solve Markov chain:
S2 = S1; S3 = (1-a)S2;
Th = S1 + (1-a)S2
p5
1-a
0
0
S1 + S2 + S3 = 1;
Th = (2 - α) / (3 - α)
State explosion problem!
165
Linear programming formulation
Average marking: mp
max th
m = m0 + tin – tout
p1
p2
reachability
th constraints
d(t) * th ≤ min(mp1, mp2) for all “classical” t
t
p1
p2
a
1-a
th constraints
d(t) * th = a mp1+ (1-a) mp2 for all “early” t
t
166
Linear programming. Example
max th
b
p1
p3
p2
a a
p4
1-a
d
p5
c
mp1 = 1 + tb – ta
mp2 = 0 + ta – tb
mp3 = 1 + td – ta
mp4 = 0 + ta – tc
mp5 = 1 + tc – td
th ≤ mp2
th ≤ mp4
th ≤ mp5
th = a mp1 + (1-a) mp3
Th = (2 - α) / (3 - α)
167
Averaging cycle throughput or cycle times
does not work
b
1/2
p1
p3
Averaging throughput of
individual cycles
p2
a a
p4
1-a
2/3
d
p5
c
Th’ = α 1/2 + (1- α) 2/3 = (4 - α) / 6
Averaging effective cycle times
of individual cycles
1/Th” = 2α + (1- α) 3/2 = (3 + α) / 2
Th” = 2/(3+ α)
Th = (2 - α) / (3 - α)
168
Example
Initial marking
Average (steady state) marking
169
Linear programming
In general the LP yields a throughput
upper bound
Particular cases of exact throughput:
• No early joins (i.e. MGs)
• All joins are early evaluation
170
Throughput estimation
Throughput estimation:
t1
1. Up = throughput obtained with LP
t2 0.25
0.75
2. Low = throughput without
early enabling
3. Th = (Up + Low)/2
171
Results
Name
s27
Nodes
22
Edges MG=Low
32
0.333
Real
0.333
Up
0.333
ΔTh
0%
Err
0%
S208
12
15
0.500
0.571
0.594
14 %
4%
s298
5434
10040
0.091
0.120
0.129
32 %
8%
s349
73
114
0.333
0.333
0.333
0%
0%
s382
28
46
0.250
0.284
0.294
14 %
4%
s386
121
204
0.400
0.400
0.400
0%
0%
s400
30
50
0.400
0.438
0.470
10 %
1%
S444
34
58
0.200
0.261
0.287
31 %
7%
S510
367
671
0.167
0.167
0.167
0%
0%
S526
46
67
0.333
0.333
0.333
0%
0%
75% tokens, 25% bubbles S641
89
138
0.333
0.393
0.432
18 %
3%
S713
104
167
0.250
0.333
0.333
33 %
12 %
S820
424
738
0.143
0.201
0.230
41 %
7%
Random select probability S832
474
819
0.286
0.310
0.342
8%
1%
S953
156
259
0.286
0.295
0.333
3%
5%
S1423
396
711
0.100
0.184
0.189
84 %
21 %
S1488
564
1003
0.188
0.236
0.271
26 %
3%
S1494
564
1000
0.154
0.222
0.277
44 %
3%
S5378
736
1320
0.235
0.250
0.250
6%
3%
s9234
867
1658
0.200
0.219
0.248
10 %
2%
Circuits from MCNC
2-input gates
a latch for each gate
25% muxes
172
Summary
Early evaluation to improve system throughput
–
Evaluate expressions as soon as possible
–
Generate antitokens to erase “don’t care” bits
Analytical model to estimate the throughput
–
Useful for architectural exploration
Which muxes must have early evaluation ?
Where do we put our by-passes ?
–
Faster than simulation
–
Simulation can be used at later design stages
173
III
Optimization
– Slack matching and buffer sizing
– Retiming and recycling
– Clustering controllers
Correctness
– Theory of elastic machines
– Formal verification
174
Buffer sizing
(joint work with Dmitry Bufistov)
How many bubbles do we need?
176
How many bubbles do we need?
177
How many bubbles do we need?
178
How many bubbles do we need?
179
How many bubbles do we need?
180
How many bubbles do we need?
181
How many bubbles do we need?
O(n2) cycles
182
How many bubbles do we need?
O(n2) cycles
183
How many bubbles do we need?
O(n2) cycles
184
How many bubbles do we need?
O(n2) cycles
185
How many bubbles do we need?
O(n2) cycles
186
How many bubbles do we need?
O(n2) cycles
187
How many bubbles do we need?
O(n2) cycles
188
How many bubbles do we need?
O(n2) cycles
O(n) cycles
189
Throughput of an n-stage ring
n
bubbles
0
tokens + bubbles = n
1/2
No space
No tokens
0
tokens
n/2
190
n
What flexibility do we have ?
Number of tokens on a loop cannot be easily
changed (inherent to the computation)
Bubbles can always be added (as many as
necessary), but may decrease throughput
Buffer sizes can always be increased (provided
forward latency of the buffer does not change)
Tokens determine the maximum achievable
throughput (assuming infinite buffer sizes)
191
Optimization techniques
Buffer sizing: select optimal capacity of
elastic buffers without increasing forward
latency for propagating data-tokens
Slack matching: insert additional empty
elastic buffers
– increases buffer capacity
– but, typically increases forward latency as well
– also called recycling in the context of
synchronous elastic (latency-tolerant) designs
192
Buffer optimization
Th = 5 / 8 = 0.625
6/7
5/5=1
5/8
193
Buffer optimization
Forward (Valid)
Backward (Stop)
194
Buffer optimization
Th = 5 / 8 = 0.625
Th = 3 / 5 = 0.6
Th = 4 / 7 = 0.57
4 / 7 = 0.57
3 / 5 = 0.6
195
Why ?
Traffic jams in short branches
of fork/join structures
196
Solution 1: slack matching/recycling
Make non-critical branches longer
(be careful, forward latency increases)
197
Solution 2: increase buffer size
FIFO: n master + 1 slave latches
198
Slack matching vs. buffer capacity
Not equivalent (slack matching cannot always
achieve the forward latency, while buffer
capacity can)
Slack matching is a well-studied problem in
asynchronous design
Slack matching = inc buffer capacity + split
199
Buffer optimization
Th = 5 / 8 = 0.625
Th = 3 / 5 = 0.6
Th = 4 / 7 = 0.57
Increase buffer capacity = Put token in backward edge
200
Buffer optimization
Th = 5 / 8 = 0.625
Th = 3 / 5 = 0.6
Th = 4 / 7 = 0.57
Increase buffer capacity = Put token in backward edge
201
Buffer sizing
Find min possible increase in buffer sizes
such that the throughput is equal to the
throughput of a system with infinite size
buffers
Combinatorial problem
We found an exact ILP formulation, but …
ILP is exponential
Can we do better (polynomial time) ?
202
Buffer sizing is NP-complete
NP-hardness: reduction of
“min edges that cut all cycles in a digraph”
to buffer sizing
NP: Checking validity of solution can be done in
polynomial time (e.g. Karp’s algorithm)
Therefore,
No polynomial algorithm exists, unless P = NP
203
LP performance model
(only forward (“Valid”) edges)
m ( p)  m0 ( p)  tin  tout
p
d (t )  th  min m ( p)
t
p t
tin
p
max th
tout
max achievable throughput
(infinite-size buffers)
204
ILP model for buffer sizing
throughput with
infinite buffers
tin
extra capacity
m ( p)  m0 ( p)  m0 ( p)  tin  tout
p
d (t )  th  min m ( p)
t

p t
p
min  m0 ( p)
integer m0 ( p)
m0 ( p)  0
for forward edges
tout
205
p’
Table of results
Circuit
|V|
|E|
Th
Max
Th
ΔTok
C PU
(sec)
0.33
0.33
0
<1
81
95
s1423
484
942
s1488
321
1662
0.5
0.5
0
<1
s1494
341
1775
0.5
0.5
0
1
36
31
823
139
86
119
132
149
145
1138
182
208
183
191
1023
373
8315
100
78
7154
241
339
273
298
571
382
2484
298
350
919
972
1992
704
16440
0.5
0.5
0.5
0.5
0.5
0.33
0.33
0.5
0.33
0.42
0.5
0.42
0.5
0.5
0.25
0.45
0.25
1
0.75
0.5
0.6
0.5
0.33
0.33
0.5
0.33
0.55
0.67
0.5
0.5
0.5
0.25
0.64
0.33
26
18
0
3
0
0
0
0
0
30
6
1
0
0
0
10*
-
1
<1
5
<1
<1
<1
<1
<1
<1
4708
<1
<1
<1
<1
<1
>21600
-
s208
s27
s298
s349
s386
s400
s444
s510
s526
s5378
s641
s713
s820
s832
s9234
s953
s38417
* - Non optimal integral solution with time limit 120 seconds
M em
(M b)
108
1
1
946
7
7
7
8
16
11
500
11
15
31
34
350
60
>2Gb
Retiming and Recycling
(joint work with Dmitry Bufistov
and Sachin Sapatnekar)
207
Retiming
Retiming: moving registers across combinational
blocks or (equivalently) moving combinational
n retimed backward
blocks across registers
– forward retiming
– backward retiming
Retiming in elastic systems
n
n
n retimed forward
– all registers participating in the retiming move should
be labeled with the same number of data-tokens
– use of negative tokens can remove the above
constraint (will not discuss here)
208
Recycling
Recycling: insert (or remove) empty elastic
buffers (empty registers for short) on any
edge
– possible only in elastic systems
We will ignore initialization and consider
only steady state behavior
– Initialization to an equivalent state almost
always possible, but may require extra logic
209
Effective Cycle Time
Cycle time: c = max {path delay between registers}
Throughput: Q = min {tokens/cycle}
Was formally defined before
Effective cycle: C = c / Q
3
9
10
9
4
8
6
c=12 Q=4/5 C =15
210
R&R graph (RRG)
4
4
3
9
10
10
9
4
8
6
combinational block with a delay of 10 units
register (EB) with one data token
empty register (EB with no data tokens)
211
R&R is more powerful than retiming
Min delay R&R
Min delay retiming
4
3
9
10
9
C =16
8
4
4
4
9
6
10
4
3
9
4
8
C =15
212
6
Analogy between circuit retiming and
reachability in MGs
Retiming graph of a circuit = MG:
combinational block = node
connection = edge
register = token
firing rules = backward retiming rules: each time a node is retimed,
registers are removed from the input edges and added to the output
edges
MGs: A live marking M of an SCMG is reachable iff
M(f)  M0(f) for every cycle f.
Retiming interpretation.
=> valid retiming preserves the number of registers at each cycle
<= if an assignment of registers has the same number of registers at
each cycle as the initial circuit, then the assignment is a valid
retiming.
213
Analogy between circuit retiming and
reachability in MGs
Non-negative marking M is reachable iff the
marking equation holds:
M  M0  A s
Retiming interpretation:
M0 - initial assignment of registers to edges
M -assignment after retiming
A - retiming matrix
s  retiming vector
Rename M to R: R  R0  A s
ILP formulation (A is totally unimodular.
Polynomial problem)
214
Example of marking equation
n1
e12
n2
e13
e23
n3
n1
e24
e34
n4
e41
0 1   1
1  0  1
    
 2  0   0
  
1  0  0
0 1   0
    
0  0    1
M  M 0  As
n2
n3
1
0
 1 0  1 

 1 0   2
 
0  1 0
 

1  1 1 

0 1
0
1
1
0
0
0
215
n4
Valid R&R solutions
Any integer solution for R and R’ :
R  R'  R 0  A  s
is a valid R&R solution
R’ retiming subset (registers with data-tokens)
R represents the R&R solution (registers with data-tokens
or bubbles)
(R − R’) registers with bubbles (recycling)
216
Combinational path constraints
tin (e)  tout (e' )  d (u ), e'  ( w, u )
tout (e)  tin (e)   'R(e)
tout  0
tin (e)  
  desired cycle time
’  original cycle time or any other constant > 
d(u)  node u delay
R(e)  number of register on edge e=(u,v)
Register delays can be taken into account
217
Throughput constraints
R  ( R0  A  s ) / Q
Let R be a valid R&R register assignment.
There is a nonnegative real vector s that fulfils
the above inequality iff Q(R) > Q
218
ILP formulation for R&R
R  R0  A  s 1  0
 R  ( R 0  A  s 2) / Q

RR( , Q)  
 Path _ constr ( R, )
 R,s 1  int
Given a cycle time  and a throughput Q, R is a valid
R&R register assignment of an RRG (N,E,R0, d) with
(R) <  and Q(R) > Q iff there exists a feasible
solution of the above ILP
219
Min period R&R
Given an RRG and a throughput Q > 0, find a
register assignment R that minimizes the cycle
time  and has throughput Q(R) > Q .
min 
MIN _ PER(Q)  
subject to RR( , Q)
220
Max throughput R&R
Given an RRG and a cycle period  , find a
register assignment R with (R) <  that
maximizes the throughput Q(R).
max Q
MAX _ THR _ NL ( )  
subject to RR( , Q)
This problem is not linear (and not convex): Q is a variable of
the model and the second constraint of RR(, Q) is not linear.
Use binary search on different Q
221
Size of the interval for binary search
• Binary search explores [QL,QU]
 QL has feasible R&R solution, QU  does not
• What is the size of the interval not to miss an optimal solution?
| QL  QU | 1/(| R0 |  | E |)
|R0|  number of initial registers
|E|  number of edges
222
2
Min effective cycle time R&R
Given an RRG, find a register assignment Rmin with a
minimal effective cycle time C(Rmin)
MIN _ ECYC ( RRG ) :
MIN _ ECYC _ STEP(Q) :
R1 : MIN _ PER(Q)
R 2 : MAX _ THR ( ( R1))
return
R2
C : rt ; Q : d max/ rt ;  : 1 /(| R 0 |  | E |) 2
while Q < 1
R : MAX _ ECYC _ STEP(Q)
if C ( R) < C then C : C ( R)
Q : Q( R)  
return C
223
Search for min effective cycle
Initialization: Q(Rmin) > dmax/rt, where dmax is the
maximum delay of a node and rt is the cycle period
obtained by min-delay retiming
MT
rt
MT
MT

MP
0
Q0 dmax/rt

MP


MT
MT
MP
retiming
solution
MP
MIN_PER
MAX_THR
MP
Q1
This search does not miss any solution with a better effective cycle time
224
Results
R&R can provide better effective cycle
time than regular retiming if
– long cycles and
– unbalanced delays
Useful for micro-architectural problems,
not for well balanced circuit graphs
225
Clustering controllers
(joint work with Josep Carmona
and Jorge Júlvez)
226
Merging nodes in elastic MG
f
a
h
g
b
h
g
c
d
a
fb
e
c
e
Q  2/5
Q  2/5
h
a
fb
gc
e
d
Q  1/3
227
d
Sharing controllers and elastic buffers
R2
R1
a
b
C2
C1
R3
C3
c
R4
b
C4
R1
R5
C1
a
R23
R4
C4
C23
c
C5
R5
C5
228
Mergeable transitions
Definition:Transitions ti and t j of EMG G
are called mergeable if Th(G) = Th(G’),
where G’ obtained by merging ti and t j in
G
Idea: Merge transitions with the same
critical average marking at their input arcs
– If transitions are not critical, then explore
slack at the non-critical input arcs: check if the
same throughput can be achieved with critical
average marking
229
Correctness and Verification
230
Correctness (short story)
Developed theory of elastic machines
Verify correctness of any elastic
implementation = check conformance with
the definition of elastic machine
All SELF controllers are verified for
conformance
Elasticization is correct-by-construction
231
Correctness (long story) =
theory of Elastic Machines
(joint work with Sava Krstic and
John O’Leary)
232
Systems
233
Operations
234
Machines = abstract circuits
235
When is a network a machine?
236
Sequential and combinational dependency
237
Detecting combinational loops
238
[I,O] - Elastic machine
239
Liveness
240
Elastic machine
241
Elastic networks
242
Elastic feedback
243
Elastic network theorem
244
Inserting empty buffers
245
Verification of elastic systems
(joint work with Syed Suhaib and
Sava Krstic)
246
Implementation of Elastic Module
Elastic Module
Consumer 1
Producer 1
Computation
1. Verify Properties ofElastic
Elastic Machine
2. Verify Data Transfer
BufferProperties
Producer M
Consumer N
Join
Fork
247
Problem
Infinite domain transfer counters (tct)
Model checking requires finite domain
counters
248
Finite domain is sufficient
Any implementation has finite sequential depth between
any input channel and any output channel
Model the tct variable as integers modulo (k+1) in some
finite range [0,k] sufficient to cover maximal sequential
depth
Reset the tct when range reaches k, and restart from 0
How do you compute k?
249
Synchronous Slack
Capacity: C(i,j) is defined as the maximum
number of data storage elements between
channels i and j, where i ∈ I and j ∈O
For an [I,O]-system S, its synchronous
slack   min {i,j} {k: G(k ≥ max(|tcti – tctj|))}
 < max {i,j} C(i,j)
250
Modeling of Counters
in1
Counters
α – Modulo Count
β – Shifted Count
out1
Elastic
Module
in2
out2
0
Synchronous Slack =5
in1 in2
0
out1 out2
4
1
4
1
in1 in2
shifted counter = 0
points to hungriest
channels
0
3
2
3
2
Cycle 1
251
Modeling of Counters
in1
Counters
α – Modulo Count
β – Shifted Count
in2
out1
Elastic
Module
out2
0
Synchronous Slack =5
0
out1 out2
4
1
4
1
in1 in2
shifted counter = 0
points to hungriest
channels
0
3
2
3
in1
2
Cycle 2
252
Modeling of Counters
in1
Counters
α – Modulo Count
β – Shifted Count
in2
out1
Elastic
Module
out2
0
Synchronous Slack =5
0
out1 out2
4
1
4
1
in2
shifted counter = 0
points to hungriest
channels
0
3
in1
2
3
in1 in2
2
Cycle 3
253
out1
Modeling of Counters
in1
Counters
α – Modulo Count
β – Shifted Count
in2
out1
Elastic
Module
out2
0
Synchronous Slack =5
4
0
out2
4
in1
1
4
3
0
1
shifted counter = 0
points to hungriest
channels
0
3
in1 in2
2
2
3
in2
2
1
out1
Cycle 4
254
out1 out2
Modeling of Counters
in1
Counters
α – Modulo Count
β – Shifted Count
out1
Elastic
Module
in2
out2
0
Synchronous Slack =5
in1
4
3
4
in1 in2
1
2
3
4
0
shifted counter = 0
points to hungriest
channels
0
3
in2
2
1
2
0
1
out1
out1 out2
Cycle 5
255
out2
Modeling of Counters
in1
Counters
α – Modulo Count
β – Shifted Count
out1
Elastic
Module
in2
out2
0
Synchronous Slack =5
in1
2
3
4
in2
2
1
1
4
3
out1
shifted counter = 0
points to hungriest
channels
0
3
0
1
2
4
0
out1 out2
out2
Cycle 6
256
Modeling of Counters
in1
Counters
α – Modulo Count
β – Shifted Count
out1
Elastic
Module
in2
out2
0
Synchronous Slack =5
in1 in2
2
1
4
in2
0
1
1
out1 out2
2
3
in1
shifted counter = 0
points to hungriest
channels
0
3
4
0
2
4
3
out2
Cycle 7
257
Validating Data Correctness
Buffer full
Valid output
Synchronous Module
f
compare valid values
Eqcomparator
&
Receiver 1
Sender
Environment
Provide
ND data
Feeder
Receiver 2
Elastic Module
Insert ND
bubbles
f
Valid output
Join + EB + Fork
Buffer full
Valid
signal
Stop
signal
Data
signal
Buffers
258
Use uninterpreted function
Symbolic terms and uninterpreted function
– Proposed by Burch and Dill ’94
We employ similar procedure
– Encode all possible terms
– Combinational logic modeled as a single function
Consider, e.g., a two input uninterpreted function
259
Model checking
SPIN Model Checker [Bell Labs]
NuSMV Model Checker [IRST and CMU]
Summary
SELF gives a low cost implementation of elastic machines
Functionality correct when latencies change
New micro-architectural opportunities
Compositional theory proving correctness
Early evaluation - mechanism for performance and power
optimization
Retiming and recycling, buffer optimization and other
optimization opportunities
To read on this work: see list of references
Research directions
How to specify elastic machines
– Asynchronous specification (e.g., CSP) discretized asynchrony
view
– Elastic synchronous specification
(extend Esterel, Lustre, PBS with controlled asynchrony)
Compilers
Improve bounds on analytical perf. analysis for early
evaluation
Formal methods for micro-architectural optimization
R&R and buffer optimization for systems with early
evaluation
More on optimization for elastic machines
Some of Related work
Async
–
–
–
–
Rings (T. Williams, J.Sparso)
Caltech CHP and slack-elasticity (A. Martin, S.Burns, R.Manohar et al.)
Micropipelines (I. Sutherland)
Many others
Latency insensitive design
–
–
L. Carloni and a few follow-ups (large overhead)
C. Svensson (Linköping U.) - wire-pipelining
Interlock pipelines
–
H. Jacobson et al.
Desynchronization
–
–
J. Cortadella et al.
V. Varshavsky
Performance analysis
–
–
–
S.Burns
H. Hulgaard
C.Nielsen/M.Kishinevsky, etc.
Synchronous implementation of CSP
–
–
J. O’Leary et al.
A. Peeters et al.
Telescopic units – Benini et al.
See a list of refrerences

Download Report

Paperzz.com

Your Paperzz