Synthesis of synchronous
elastic architectures
Jordi Cortadella (Universitat Politècnica Catalunya)
Mike Kishinevsky (Intel Corp.)
Bill Grundmann (Intel Corp.)
Network of Computing Units
Out
In
B3
B1
B2
Network of Computing Units
Out
In
B3
B1
B2
Network of Computing Units
Out
In
B3
B1
B2
Latency-insensitive (elastic) system
Out
In
B3
B1
B2
Every block only
makes one step
when all inputs are valid
Why
Scalable
Modular (Plug & Play)
Tolerance to variable latency
– Communication
– Computation
Not asynchronous
– Use existing design paradigms
– CAD tools
Outline
The cost of elasticity
SELF: an elastic protocol
– Basic implementation (linear pipelines)
– General netlists (forks and joins)
– Formal models and verification
Synthesis of elastic architectures
Related work
What’s the cost of
Elastic
block
elasticity?
Data
Core
Data
Gated clock
Valid
Stop
CLK
Control
Valid
Stop
Communication channel
sender
receiver
Data
Data
Long wires: slow transmission
Pipelined communication
sender
Data
receiver
Data
Pipelined communication
sender
Data
receiver
Data
Pipelined communication
sender
receiver
Data
Data
How about if the sender does not always send valid data?
The Valid bit
sender
receiver
Data
Data
Valid
Valid
The Valid bit
sender
receiver
Data
Data
Valid
Valid
The Valid bit
sender
receiver
Data
Data
Valid
Valid
The Valid bit
sender
receiver
Data
Data
Valid
Valid
The Valid bit
sender
receiver
Data
Data
Valid
Valid
How about if the receiver is not always ready ?
The Stop bit
sender
receiver
Data
Data
Valid
Valid
Stop
Stop
0
0
0
0
0
The Stop bit
sender
receiver
Data
Data
Valid
Valid
Stop
Stop
0
0
0
1
1
The Stop bit
sender
receiver
Data
Data
Valid
Valid
Stop
Stop
0
0
1
1
1
The Stop bit
sender
receiver
Data
Data
Valid
Valid
Stop
Stop
1
1
1
Back-pressure
1
1
The Stop bit
sender
receiver
Data
Data
Valid
Valid
Stop
Stop
0
0
0
Long combinational path
0
1
Carloni’s relay stations (double storage)
sender
receiver
shell
main
main
shell
main
pearl
pearl
V
V
S
aux
S
V
aux
S
V
aux
S
Carloni’s relay stations (double storage)
sender
receiver
shell
main
main
shell
main
pearl
pearl
V
V
S
aux
S
V
aux
S
V
aux
S
Carloni’s relay stations (double storage)
sender
receiver
shell
main
main
shell
main
pearl
pearl
V
V
S
aux
S
V
aux
S
V
aux
S
Carloni’s relay stations (double storage)
sender
receiver
shell
main
main
shell
main
pearl
pearl
V
V
S
aux
S
V
aux
S
V
aux
S
Carloni’s relay stations (double storage)
sender
receiver
shell
main
main
shell
main
pearl
pearl
V
V
S
aux
S
V
aux
S
V
aux
S
Carloni’s relay stations (double storage)
sender
receiver
shell
main
main
shell
main
pearl
pearl
V
V
S
aux
S
V
aux
S
V
aux
S
Carloni’s relay stations (double storage)
sender
receiver
shell
main
main
shell
main
pearl
pearl
V
V
S
aux
S
V
aux
S
V
aux
S
Carloni’s relay stations (double storage)
sender
receiver
shell
main
main
shell
main
pearl
pearl
V
V
S
aux
S
V
aux
S
V
aux
S
Carloni’s relay stations (double storage)
sender
receiver
shell
main
main
shell
main
pearl
pearl
V
V
S
aux
S
V
aux
S
V
aux
• Handshakes with short wires
• Double storage required
S
Proposal: an elastic protocol
SELF (Synchronous ELastic Flow)
Simple and provably correct
Data-path with no overhead in:
– Area
– Latency
– Energy
Negligible control overhead
Fine-grain elasticity
Flip-flops vs. latches
sender
receiver
FF
FF
1 cycle
Flip-flops vs. latches
sender
receiver
H L
H L
1 cycle
Flip-flops vs. latches
sender
receiver
H L
H L
1 cycle
Flip-flops vs. latches
sender
receiver
H L
H L
1 cycle
Flip-flops vs. latches
sender
receiver
H L
H L
1 cycle
Flip-flops vs. latches
sender
receiver
H L
H L
1 cycle
Flip-flops vs. latches
sender
receiver
H L
H L
1 cycle
Flip-flops vs. latches
sender
receiver
H L
H L
1 cycle
Flip-flops already have a
double storage capability, but …
Flip-flops vs. latches
sender
receiver
H L
H L
1 cycle
Not allowed in conventional
FF-based design !
Flip-flops vs. latches
sender
receiver
H L
H L
1 cycle
Let’s make the master/slave latches independent
Flip-flops vs. latches
sender
receiver
H
L
H
L
½ cycle ½ cycle
Let’s make the master/slave latches independent
Only half of the latches (H or L) can move tokens
Elastic buffer keeps data
while stop is in flight
W1R1
Cannot be done with
Single Edge Flops
without double pumping
Use latches inside MS
W2R1
W1R2
W2R2
Carloni’s relay station
belongs to this class
SELF (linear communication)
sender
receiver
Data
Data
En
En
Valid
En
En
V
V
V
V
1
1
1
1
Stop
Valid
Stop
S
S
S
S
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
0
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
0
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
0
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
0
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
0
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
1
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
SELF
sender
receiver
Data
Data
En
Valid
En
En
V
1
En
V
V
V
Valid
0
Stop
S
S
S
S
Stop
The protocol
Sender
Data

Valid
0
Stop

Idle cycle: Valid = 0
Receiver
The protocol
Sender
Data
D
Valid
1
Stop
0
Transfer cycle: Valid = 1  Stop = 0
Receiver
The protocol
Sender
Data
D
Valid
1
Stop
1
Receiver
Retry cycle: Valid = 1  Stop = 1
Persistency: G [ V  S  (Data=D)  Next (V  Data=D) ]
The protocol
Sender
Receiver
* D D * C C C B * A
Data
Valid
Stop
Data
0 1 1 0 1 1 1 1 0 1
0 0 1 0 0 1 1 0 0 0
Transfer
Retry
Valid
Stop
Latch
Elastic Half Buffer
Data
Eni
EHB
Vi-1
Vi
Si-1
Si
Join
+
V1
EHB
EHB
S1
V2
S2
V
S EHB
Lazy Fork
V
S
V1
S1
V2
S2
Eager Fork
S1
^
V1
V
V2
S
^
S2
Elastic combinational paths
Wire
EB
Fork
Join
EB
Join / Fork
EB
Wire
EB
Elastic combinational paths
Enable signal
to data latches
Wire
EB
Fork
Join
EB
Join / Fork
EB
Wire
EB
Elastic combinational paths
Wire
EB
Fork
Join
EB
Join / Fork
EB
Wire
EB
Elastic buffer: formal model
i
i+1
Din
…
i+k
Dout
Vin
Sin
Vout
rd
Buffer [ 0.. ]
Initial state: rd = wr = 0
Invariant: wr  rd
wr
Sout
Elastic buffer: formal model
i
Din
i+1
…
i+k
Dout
Vin
Sin
Vout
rd
wr
Liveness properties (finite unbounded latencies)
• Finite forward latency:
G (rd  wr  F Vout)
• Finite backward latency :
G( Sout  F Sin)
Sout
Formal verification
i
Din
i+1
…
i+k
Dout
Vin
Vout
rd
wr

Sin
Dout
Din
Vin
Sin
Implementation
Vout
Sout
Sout
Formal verification
The abstract FSM model is appropriate for
compositional verification
Verification of implementations with model
checking (1-bit abstractions of the datapath)
– LTL specs + NuSMV
–
–
–
–
Buffer is a refinement of the spec
In-order data-transmission
Correct synchronization of fork/join structures
Absence of deadlocks
Observational equivalence
Synchronous:
D: a b c d e f g h i j k …
Elastic:
D: a a b b b c d e e f g g h i i i j k …
En: 1 0 1 0 0 1 1 1 0 1 1 0 1 1 0 0 1 1 …
Elasticization
Synchronous
Elastic
CLK
FORK
IF/ID
PC
CLK
J
O
I
N
ID/EX
J
O
I
N
EX/MEM
MEM/WB
F
O
R
K
FORK
V
S
J
O
I
N
J
O
I
N
V
S
CLK
V
V
S
S
F
O
R
K
V
S
FORK
1
0
J
O
I
N
J
O
I
N
1
0
CLK
1
1
0
0
F
O
R
K
1
0
0
FORK
1
0
J
O
I
N
0
1
0
CLK
J
O
I
N
1
1
0
0
F
O
R
K
1
0
1
0
Elastic control
layer 1
1
0 Generation of
0 gated clocks
0
1
CLK
1
0
Variable-latency Units
[0 - k]
cycles
go
VS
done
VS
Variable-latency units
Telescopic units:
– 1 cycle for fast operations
– 2 cycles for slow operations
Examples:
– Short / long additions (carry propagation)
– A × 0, A / 1
– Dynamic changes in latency
(fast if cold, slow if hot)
Microarchitectural exploration
Bubble insertion + Variable-latency units
– May improve performance
More bubbles but reduces cycle time
– Reduce power
Units designed for most frequent input data
Exploration at fine-granularity
Some related work
Asynchronous design
– Micropipelines (Sutherland)
– Rings (Williams, Sparso)
– CHP and slack-elasticity (Martin, Burns, Manohar et al.)
Latency insensitive design
– Carloni and a few follow-ups (large overhead)
– Wire pipelining: Svensson, Nookala, Casu, …
Interlock pipelines (H. Jacobson et al.)
De-synchronization
– J. Cortadella et al.
– V. Varshavsky
Synchronous implementations of CSP
– J. O’Leary et al.
– A. Peeters et al.
Summary
SELF: a specific protocol and implementation for elastic
systems with very small overhead buffering
Compositional theory proving correctness
(Krstic et al., FMCAD’06)
Library of controllers has been designed and their
correctness verified
Elasticization CAD in progress
New micro-architectural opportunities based on bubbles
and variable latency units