Asynchronous Logic: A Computer Systems Perspective

Asynchronous Logic: A Computer Systems
Perspective
Rajit Manohar
Computer Systems Lab, Cornell
http://vlsi.cornell.edu/
Approaches to computation
Discrete Time
Continuous Time
Discrete Value
Digital, synchronous
logic
Digital,
asynchronous
logic
Continuous Value
Switched-capacitor
analog
General analog
computation
clock
receiver
sender
data
signal value
Asynchronous logic
time
sampling window
data
req
ack
ack
data 1
data 0
acknowledge
receiver
data 1
sender
receiver
sender
data 0
X
X
X
X
data
request
acknowledge
Explicit versus implicit synchronization
A
B
A1
B1
C
C1
A
B
A1
B1
C
C1
C2
A2
B2
A2
C2
B2
C3
A3
B3
A3
C4
B4
A4
Synchronous
computation
B4
C4
A4
Asynchronous
computation
time
time
B3
C3
Differences at a glance
Asynchronous
Synchronous
Continuous time computation.
Discrete time computation.
Observed delay is the average over
possible circuit paths.
Observed delay is the maximum over
possible circuit paths.
Local switching from data-driven
computation. Can lead to low-power
operation.
Global activity from clock-driven
computation. Low power requires
careful clock gating.
Throughput determined by device
speed. (Margins needed for some
circuit families.)
Throughput determined by device
speed and additional operating
margins.
Data must be encoded, requiring
additional wires for signals.
Unencoded data acceptable due to
global clock network.
A little bit of (biased) history…
Binary addition
(von Neumann)
1940
Switching
theory (Miller)
Illiac II (UIUC/Muller)
1950
Macro modular systems
(Clarke, Molnar, Stucki,…)
Atlas, MU5 (Manchester)
1960
1970
Flow tables
(Huffman)
Trace theory (Snepscheut)
HiAER (Cauwenberghs)
Syntactic compilation
(Martin, Rem)
1980
ARM with
caches (Furber)
1990
Micro pipelines
(Sutherland)
System Timing (Seitz, in
Mead/Conway)
Asynchronous
CPU (Martin)
General hazard-free synthesis
AER (Mead lab)
Pipelined FPGAs
(Manohar)
FACETS
(Heidelberg)
2000
17FO4
MIPS CPU
(Martin)
2010
UltraSparc IIIi
mem controller
(Sun)
Commercial
80C51 (Phillips)
TrueNorth
(IBM/Cornell)
10G Ethernet
switch
(Fulcrum)
Event-driven
CPU (Manohar)
Sensory systems (retina, cochlea, etc.)
SpiNNaker
(Furber)
Neurogrid (Boahen)
I. Exploiting the gap between average and max delay
1
1
0
0
0
1
0
1
1
0
0
+
0
1
1
0
1
0
1
0
0
0
1
1
0
• Slow part: computing carries
❖
… but only in the worst-case!
Theorem [von Neumann, 1946]. The average-case latency for a ripple
carry binary adder is O(log N) for i.i.d. inputs
A.W. Burks, H.H. Goldstein, J. von Neumann. Preliminary discussion
of the logical design of an electronic computing instrument. (1946)
I. Exploiting the gap between average and max delay
• Standard adder tree
❖
O(log N) latency
• Hybrid adder tree
❖
Tree adder + ripple adder
Theorem [Winograd, 1965]. The worst-case latency for binary
addition is Ω(log N)
Theorem. The average-case latency of the hybrid asynchronous adder
is O(log log N) for i.i.d. inputs
Theorem. The average-case latency of the hybrid asynchronous adder
is optimal for any input distribution
S.O. Winograd. On the time required to perform addition. JACM (1965)
R. Manohar and J. Tierno. Asynchronous parallel prefix
computation. IEEE Transactions on Computers (1998)
II. Exploiting data-driven power management
• Low power FIR filter
❖
External, synchronous I/O
❖
Monitor occupancy of FIFOs
❖
Speed up/slow down using
adaptive power supply
DC/DC
converter
Vdd
• Break-point add/multiply
❖
❖
Detect when top half of
operand is required
State
FIR filter
Dynamically switch between
full width and half width arithmetic
L. Nielsen and J. Sparso. A Low-power Asynchronous
Datapath for a FIR filter bank. Proc. ASYNC (1996)
0 1
0 1 0 1
II. Exploiting data-driven power management
0 0 0 ··· 0 0 0 1
0 1
0 1 0 1 + 0 0 0 ··· 0 0 0 1
❖ Compress leading 0’s and 1’s0 0 0 · · · 0 0 1 0
• Width-adaptive numbers
SPEC: 124.m88ksim
Number: 1.4 ⇥ 1014 ⇤ 247
0 1
+
0 1
Average:08.4
0 1 0
90
70
60
50
40
Average: 10.9
Average: 8.8
30
20
no compaction
simple compaction
full compaction
10
0
5
10
15
20
Operand Widths
SPEC: 129.compress
(%)
100
25
70
60
50
40
30
Average: 10.5
Averag
20
no co
simple co
full co
10
0
30
Average: 10.0
80
receiver
Cumulative Distribution Function (%)
80
0
sender
Cumulative Distribution Function (%)
90
SPEC: 126.gcc
100
Number:
5
29
⇤
10
R. Manohar. Width-Adaptive Data Word
100
Architectures. Proc. ARVLSI (2001)
(%)
100
128−bit datapath
receiver
v/s
0 1
+
0 1
0 1 0
sender
0 0 0 ··· 0 0 0 1
+ 0 0 0 ··· 0 0 0 1
0 0 0 ··· 0 0 1 0
8−bits per VDW block
5
2
15
Operand Widths
SPEC: 132.ijpeg
20
25
III. Exploiting elasticity of pipelines
f
g
h
Asynchronous computation is*
robust to changes in circuitlevel pipelining!
f
g
h
*R. Manohar, A. J. Martin. Slack Elasticity in Concurrent
Computing. Proc. Mathematics of Program Construction (1998)
III. Exploiting elasticity of pipelines
• Asynchronous FIR filter
•
Data arrives at variable rates
•
Data rates are predictable
• Fixed amount of time for async
computation
1
2
3
• Variable number of clock cycles
for the fixed time budget
M. Singh, J.A. Tierno, A. Rylyakov, S. Rylov, S.M. Nowick. An
adaptively pipelined mixed synchronous-asynchronous digital
FIR filter chip operating at 1.3 GHz. Proc. ASYNC (2002)
9
III. Exploiting elasticity of pipelines
• FPGA throughput low due to overhead of
programmable interconnect
• ~3x improvement in peak throughput by
transparent interconnect pipelining
LB
CB
LB
CB
CB
SB
CB
SB
LB
CB
LB
CB
CB
SB
CB
SB
J. Teifel and R. Manohar. Highly pipelined asynchronous
FPGAs. Proc. FPGA (2004)
IV. Exploiting timing robustness
Asynchronous FPGA Test Data
Throughput (MHz)
1200
1000
Process: TSMC 0.18um
Nominal Vdd: 1.8V
77K
12K
800
294K
600
400
400K
200
Xilinx Virtex
0
0.5
1
1.5
Voltage (V)
2
2.5
D. Fang, J. Teifel, and R. Manohar. A High-Performance
Asynchronous FPGA: Test Results. Proc. FCCM (2005)
IV. Exploiting timing robustness
• “GasP” logic family
6-10 FO4
w
•
Self-reset
Track
reset
• No margins on control
signals
• Standard datapath
• Latest 40nm chip operates
>6 GHz
RA(i)
Single-track
control wire
RA(i+1)
Track
set
Wait for
L set,
R reset
Datapath
driver
w
I. Sutherland, S. Fairbanks. GasP: a minimal
FIFO control. Proc. ASYNC (2001)
V. Exploiting low noise and low EMI
time domain
frequency domain
0
100
200
300
400
MHz
Synchronous 80C51
(Phillips)
0
100
200
300
400
MHz
Asynchronous 80C51
(Phillips)
J. Kessels, T. Kramer, G. den Besten, A. Peeters, V. Timm. Applying
Asynchronous Circuits in Contactless Smart Cards. Proc. ASYNC (2000)
VI. Exploiting continuous-time information processing
• Continuous-time
digital signal processing
• Automatic adaptation
to input bandwidth
Nyquist sampling
Level-crossing sampling
• No aliasing from sampling
process
B. Schell, Y. Tsividis. A clockless adc/dsp/dac system with activitydependent power dissipation and no aliasing. Proc. ISSCC (2008)
VI. Exploiting continuous-time information processing
• Address-event representation in neuromorphic systems
❖
“Time represents itself”
❖
Spike timing and coincidence for information processing
• Asynchronous logic
❖
Temporal resolution is decoupled from throughput
❖
Automatic power management
• Projects
❖
Silicon retinas, cochleas, …
❖
More recent: FACETS, HiAER, Neurogrid, neuroP, SpiNNaker, TrueNorth
A note on oscillations in asynchronous logic
• Well-known result in asynchronous circuit theory
❖
Signal transitions: look at the ith occurrence of transition t
❖
time(t, i) = the actual time of this event
0
time
(t, i) = f (t) + p⇤ ⇥ i
❖
• Then:
|time(t, i)
time0 (t, i)| < B
❖
Bounded, independent of i, and t
❖
If the circuit is strongly connected, this result holds
• More recently: stronger periodicity under more general conditions
J. Magott. Performance evaluation of concurrent
systems using petri nets. IPL (1984)
S.M. Burns, A.J. Martin. Performance Analysis and
Optimization of Asynchronous Circuits. Proc. ARVLSI (1990)
W. Hua and R. Manohar. Exact timing analysis for concurrent
systems. To appear, work-in-progress session, DAC (2016)
Asynchronous logic in TrueNorth
• Spikes trigger activity…
❖
… so they should be asynchronous
… so add a periodic event to the system (the clock)
Token Control
❖
Scheduler
• Neurons leak periodically…
At the external timing “tick”:
SRAM
Spikes travel through asynchronous network
• Receiver buffers spikes until it is time to
deliver them
Digital, asynchronous logic
Router
• If the neuron spikes, send output spike to
destinations, with specified delay
Neuron
• Update each neuron, delivering spikes plus
any local state update
Asynchronous logic in TrueNorth
• Spike communication network
❖
Links are a shared resource
❖
Links might be contended
• Spike delivery times can differ based on
traffic
❖
Different runs of the same network might
produce different spike patterns
• Instead, we use a spike scheduler
❖
Guarantees deterministic computation
❖
… and hence, repeatability
Summary
• Asynchronous logic has a long history
• Some key properties exploited in projects
❖
Gap between average and maximum delay for a computation
❖
Automatic data-driven power management
❖
Elastic pipelining
❖
Robustness to timing uncertainty/variation
❖
Reduced EMI and noise
❖
Continuous-time information representation
Acknowledgments
• The “async” community
• Students:
❖
Ned Bingham, Wenmian Hua, Sean Ogden, Praful Purohit, Tayyar Rzayev, Nitish
Srivastava
❖
John Teifel, Clinton Kelly, Virantha Ekanayake, Song Peng, David Biermann,
David Fang, Chris LaFrieda, Filipp Akopyan, Basit Riaz Sheikh, Benjamin Tang,
Nabil Imam, Sandra Jackson, Carlos Otero, Benjamin Hill, Stephen Longfield,
Jonathan Tse, Robert Karmazin
• Collaborators:
❖
General: David Albonesi, Sunil Bhave, Francois Guimbretiere, Amit Lal, Yoram
Moses, Ivan Sutherland, Yannis Tsividis
❖
Neuromorphic: John Arthur, Kwabena Boahen, Tobi Delbruck, Chris Eliasmith,
Giacomo Indiveri, Paul Merolla, Dharmendra Modha
• Support: AFRL, DARPA, IARPA, NSF, ONR, IBM, Lockheed, Qualcomm,
Samsung, SRC