Introduction to basic concepts on asynchronous circuit design

Bridging the gap between
asynchronous design
and designers
Peter A. Beerel
Jordi Cortadella
Alex Kondratyev
Fulcrum Microsystems,
Calabasas Hills, CA, USA
Universitat Politècnica de
Catalunya, Barcelona, Spain
Cadence Berkeley Labs,
Berkeley, CA, USA
1
Outline
1. Basic concepts on asynchronous circuit design
Tea Break
2. Logic synthesis from concurrent specifications
3. Synchronization of complex systems
Lunch
4. Design automation for asynchronous circuits
Tea Break
5. Industrial experiences
2
Basic concepts on
asynchronous circuit design
3
Outline
What is an asynchronous circuit ?
Asynchronous communication
Asynchronous design styles (Micropipelines)
Asynchronous logic building blocks
Control specification and implementation
Delay models and classes of async circuits
Channel-based design
Why asynchronous circuits ?
4
Synchronous circuit
R
CL
R
CL
R
CL
R
CLK
Implicit (global) synchronization between blocks
Clock period > Max Delay (CL + R)
5
Asynchronous circuit
Ack
R
CL
R
CL
R
CL
R
Req
Explicit (local) synchronization:
Req / Ack handshakes
6
Motivation for asynchronous
Asynchronous design is often unavoidable:
 Asynchronous interfaces, arbiters etc.
Modern clocking is multi–phase and distributed –
and virtually ‘asynchronous’ (cf. GALS – next slide):
 Mesachronous (clock travels together with data)
 Local (possibly stretchable) clock generation
Robust asynchronous design flow is coming
(e.g. VLSI programming from Philips, Balsa from
Univ. of Manchester, NCL from Theseus Logic …)
7
Globally Async Locally Sync (GALS)
Asynchronous
World
Req1
Clocked Domain
Req3
R
CL
R
Ack3
Ack1
Req2
Ack2
Local CLK
Async-to-sync Wrapper
Req4
Ack4
8
Key Design Differences
Synchronous logic design:




proceeds without taking timing correctness
(hazards, signal ack–ing etc.) into account
Combinational logic and memory latches
(registers) are built separately
Static timing analysis of CL is sufficient to
determine the Max Delay (clock period)
Fixed set–up and hold conditions for latches
9
Key Design Differences
Asynchronous logic design:



Must ensure hazard–freedom, signal ack–ing,
local timing constraints
Combinational logic and memory latches (registers)
are often mixed in “complex gates”
Dynamic timing analysis of logic is needed to
determine relative delays between paths
To avoid complex issues, circuits may be built
as Delay-insensitive and/or Speed-independent
(as discussed later)
10
Verification and Testing Differences
Synchronous logic verification and testing:
 Only functional correctness aspect is verified and
tested
 Testing can be done with standard ATE and at low
speed (but high–speed may be required for DSM)
Asynchronous logic verification and testing:
 In addition to functional correctness, temporal aspect
is crucial: e.g. causality and order, deadlock–freedom
 Testing must cover faults in complex gates
(logic+memory) and must proceed at normal
operation rate
 Delay fault testing may be needed
11
Synchronous communication
1
1
0
0
1
0
Clock edges determine the time instants where data
must be sampled
Data wires may glitch between clock edges
(set–up/hold times must be satisfied)
Data are transmitted at a fixed rate
(clock frequency)
12
Dual rail
1
1
1
0
0
0
Two wires with L(low) and H (high) per bit
 “LL” = “spacer”, “LH” = “0”, “HL” = “1”
n–bit data communication requires 2n wires
Each bit is self-timed
Other delay-insensitive codes exist (e.g. k-of-n)
and event–based signalling (choice criteria: pin and
power efficiency)
13
Bundled data
1
1
0
0
1
0
Validity signal
 Similar to an aperiodic local clock
n–bit data communication requires n+1 wires
Data wires may glitch when no valid
Signaling protocols
 level sensitive (latch)
 transition sensitive (register): 2–phase / 4–phase
14
Example: memory read cycle
Valid address
Address
A
A
Valid data
Data
D
D
Transition signaling, 4-phase
15
Example: memory read cycle
Valid address
Address
A
A
Valid data
Data
D
D
Transition signaling, 2-phase
16
Asynchronous modules
DATA
PATH
Data IN
start
Data OUT
done
req in
ack in
req out
CONTROL
ack out
Signaling protocol:
reqin+ start+ [computation] done+ reqout+ ackout+ ackin+
reqin- start[reset]
done- reqout- ackout- ackin(more concurrency is also possible)
17
Asynchronous latches: C element
Vdd
A
A
C
B
Z
B
B
Z
A
Z
A
0
0
1
1
B
0
1
0
1
Z+
0
Z
Z
1
B
Z
A
Static Logic
Implementation
A
B
[van Berkel 91]
Gnd
18
C-element: Other implementations
Vdd
Vdd
A
A
B
B
Weak inverter
Z
Z
B
B
Dynamic
A
Gnd
A
Quasi-Static
Gnd
19
Dual-rail logic
A.t
B.t
C.t
Dual-rail AND gate
A.f
C.f
B.f
Valid behavior for monotonic environment
20
Completion detection
Dual-rail
logic
•
•
•
C
done
•
•
•
Completion detection tree
21
Differential cascode voltage switch logic
start
Z.f
Z.t
done
A.t
C.f
B.f
A.f
B.t
C.t
N-type
transistor
network
start
3–input AND/NAND gate
22
Examples of dual-rail design
Asynchronous dual-rail ripple-carry adder
(A. Martin, 1991)



Critical delay is proportional to logN
(N=number of bits)
32–bit adder delay (1.6m MOSIS CMOS): 11 ns
versus 40 ns for synchronous
Async cell transistor count = 34
versus synchronous = 28
More recent success stories (modularity and
automatic synthesis) of dual-rail logic from
Null-Convention Logic (Theseus Logic)
23
Bundled-data logic blocks
Single-rail logic
•
•
•
•
•
•
start
delay
done
Conventional logic + matched delay
24
Micropipelines (Sutherland 89)
Micropipeline (2-phase) control blocks
r1
d1
C
Join
sel
outf
in
outt
Select
Merge
out0
in
out1
Toggle
g1
r2
d2
g2
r1
a1
r2
a2
RequestGrant-Done
(RGD)Arbiter
r
a
Call
25
Micropipelines (Sutherland 89)
Aout
delay
C
L
logic
L
C
logic
C
Rin
Ain
delay
L
logic
L
C
Rout
delay
26
Data-path / Control
L
Rin
Aout
logic
L
logic
L
logic
L
Rout
Ain
CONTROL
27
Control specification
A+
A
B+
A–
B–
B
A input
B output
28
Control specification
A+
B–
A
B
A–
B+
29
Control specification
A+
B+
A
C+
C
A–
B–
C
B
C–
30
Control specification
A+
B+
C+
A
C
A–
C
B
B–
C–
31
Control specification
Ri
FIFO
cntrl
Ao
Ro
Ri+
Ro+
Ao+
Ai+
Ri-
Ro-
Ao-
Ai-
Ai
Ri
Ao
C
C
Ro
Ai
32
A simple filter: specification
Ain Rin IN
y := 0;
loop
x := READ (IN);
WRITE (OUT, (x+y)/2);
y := x;
end loop
filter
Aout Rout OUT
33
A simple filter: block diagram
+
x
IN
Rin
Ain
Rx
OUT
y
Ax
Ry
Ay
control
Ra
Aa
Rout
Aout
• x and y are level-sensitive latches (transparent when R=1)
• + is a bundled-data adder (matched delay between Ra and Aa)
• Rin indicates the validity of IN
• After Ain+ the environment is allowed to change IN
• (Rout,Aout) control a level-sensitive latch at the output
34
A simple filter: control spec.
x
IN
Rin
Ain
Rx
+
y
Ax
Ry
Ay
Ra
OUT
Aa
control
Rout
Aout
Rin+
Rx+
Ry+
Ra+
Rout+
Ain+
Ax+
Ay+
Aa+
Aout+
Rin–
Rx–
Ry–
Ra–
Rout–
Ain–
Ax–
Ay–
Aa–
Aout–
35
A simple filter: control impl.
Rx Ax
Ay Ry
Ra A a
Aout
C
Ain
Rout
Rin
Rin+
Rx+
Ry+
Ra+
Rout+
Ain+
Ax+
Ay+
Aa+
Aout+
Rin–
Rx–
Ry–
Ra–
Rout–
Ain–
Ax–
Ay–
Aa–
Aout–
36
Taking delays into account
z+
x+
x–
y+
x’
z’
x
z
z–
y
y–
Delay assumptions:
• Environment: 3 time units
• Gates: 1 time unit
events: x+  x’–  y+  z+  z’–  x–  x’+  z–  z’+  y– 
time: 3
4
5
6
7
9
10
12
13
14
37
Taking delays into account
z+
x+
x–
y+
x
z
z–
y–
x’
z’
y
very slow
Delay assumptions: unbounded delays
events: x+  x’–  y+  z+  x–  x’+  y– failure !
time: 3
4
5
6
9
10
11
38
Gate vs wire delay models
Gate delay model: delays in gates, no delays in wires
Wire delay model: delays in gates and wires
39
Delay models for async. circuits
Bounded delays (BD): realistic for gates and wires.

Technology mapping is easy, verification is difficult
Speed independent (SI): Unbounded (pessimistic)
delays for gates and “negligible” (optimistic) delays for
wires.

Technology mapping is more difficult, verification is
easy
Delay insensitive (DI): Unbounded (pessimistic) delays
for gates and wires.

DI class (built out of basic gates) is almost empty
BD
DI
SI  QDI
Quasi-delay insensitive (QDI): Delay insensitive except
for critical wire forks (isochronic forks).

In practice it is the same as speed independent
40
Channel-Based Design
Asynchronous
channel
clock
Synchronous System
Asynchronous System
Synchronization and communication between blocks
implemented with handshaking using asynchronous
channels by sending/receiving “data tokens”
41
Channel Design – Single Rail
sender
Ack
Req
Req
receiver
Ack
3
1
2
4
Data
Data
Data stable
4-phase bundled-data channel
Features

One request wire

One wire per data bit

One acknowledgment wire

Has timing assumptions
42
Channel Design: Dual Rail & 1-of-N
Dual Rail

Two wires per data bit

One acknowledgment wire

Advantage:
DataT
DataF
Logical
Value
0
0
Reset
0
1
0
1
0
1
1
1
Invalid
Supports delay-insensitive design
1-of-N

Generalization of dual-rail
Ack
sender
Data
2
Ack
receiver
1
4
3
Data
(1-of-N)
4-phase 1-of-N channel
43
Anatomy of a Channel-Based
Asynchronous Design

Architecture is typically a multi-level hierarchy of
communicating blocks
Reg A
Main FSM
Reg B
Memory
Adder
ASIC
Register
Bank
Multiplier
Yields a hierarchical netlist of cells, where at
each level blocks communicate along channels
BN-1 BN-2 BN-3
leaf cells
Subtract/
Divider
channels
Adder/
Mult.
Reg C
FAN-1
FAN-2
FAN-3
FA0
44
Asynchronous Cells
Input
Channels
F
Output
Channels
Definition

Smallest element that communicates with its neighbors along
asynchronous channels
Functionality


Reads a subset of input channels
Computes F and writes to a subset of output channels
Linear Pipelines

Only one input and one output channel
F
45
Cells for
Non-Linear Pipelines
• Non-Linear Pipelines
Joins and Forks
Conditional Joins: Read only some of the input channels
Conditional Splits: Write only to some of the output channels
F
F
Join
Fork
F
F
Conditional Join
Conditional Split
46
Template-Based Leaf-Cell Design
• Each pipeline style (QDI, timed…) has a different blueprint
• Create a library using a blueprint to implement the lowest level
communicating blocks
C
LCD
RCD
LCD
F
C
2-input 1-output pipeline stage
LCD
RCD
F
C
LCD
Blueprint for a QDI N-input
M-output pipeline stage
RCD
F
RCD
1-input 2-output pipeline stage
47
Template-Based Leaf-Cell Design
• Pros
• Enables fine-grain 2-D pipelining yielding high-performance
• Simplifies logic synthesis by enabling simple control circuit
generation and re-use of typical datapath synthesis
• Leaf-cells can be layed-out and verified creating a leaf-cell library,
localizing timing assumptions
• Cons
• Unified template may not be optimal in all cases
• Particularly, less effective for non-pipelined architectures with more
complicated control
48
Motivation (designer’s view)
Modularity for system-on-chip design
 Plug-and-play interconnectivity
Average-case peformance
 No worst-case delay synchronization
Many interfaces are asynchronous
 Buses, networks, ...
49
Motivation (technology aspects)
Low power
 Automatic clock gating
Electromagnetic compatibility
 No peak currents around clock edges
Security
 No ‘electro–magnetic difference’ between logical ‘0’
and ‘1’in dual rail code
Robustness
 High immunity to technology and environment
variations (temperature, power supply, ...)
50
Dissuasion
Concurrent models for specification
 CSP, Petri nets, ...: no more FSMs
Difficult to design
 Hazards, synchronization
Complex timing analysis
 Difficult to estimate performance
Difficult to test
 No way to stop the clock
51
But ... some successful stories
Philips
AMULET microprocessors
Sharp
Intel (RAPPID)
Start-up companies:
 Theseus logic, Fulcrum Microsystems,
Self–Timed Solutions
Recent blurb: It's Time for Clockless Chips, by
Claire Tristram (MIT Technology Review, v. 104,
no.8, October 2001:
http://www.technologyreview.com/magazine/
oct01/tristram.asp)
….
52