Regular Silicon Structures a.k.a VLSI Building Blocks

CS250 VLSI Systems Design
Fall 2009
John Wawrzynek, Krste Asanovic’, with John Lazzaro
Regular Silicon Structures
a.k.a
VLSI Building Blocks
Lecture 11, Regular Structures
CS250, UC Berkeley Fall ‘09
Introduction
‣ We've experienced synthesis and standard cell place and
‣
route.
Is that all there is? We can implement any digital system
with only primitive logic gates and flip-flops.
‣
If so, chip implementations would be pretty inefficient (and
boring to do!)
‣ Key questions:
‣
Where can special circuit- and layout-generators provide
advantage and how much?
Examples with a clear advantage:
RAM blocks
‣ Example where it is not so clear:
‣
cross-bar switches, datapaths, ROMs, multipliers
We’ll start with on-chip RAM
Lecture 11, Regular Structures
2
CS250, UC Berkeley Fall ‘09
Verilog RAM Specification
//
// Single-Port RAM with Synchronous Read
//
module v_rams_07 (clk, we, a, di, do);
input clk;
input we;
input [5:0] a;
input [15:0] di;
output [15:0] do;
reg
[15:0] ram [63:0];
reg
[5:0] read_a;
always @(posedge clk) begin
if (we)
ram[a] <= di;
read_a <= a;
end
assign do = ram[read_a];
endmodule
What do the synthesis tools do with this?
Lecture 11, Regular Structures
3
CS250, UC Berkeley Fall ‘09
Memory-Block Basics
log2(M)
M X N memory:
Depth = M, Width = N.
M words of memory, each word N bits wide.
VLSI tools flows include parameterized RAM-generators. User
specifies width, depth, (sometimes) aspect ratio; gets simulation &
timing models, layout.
Lecture 11, Regular Structures
4
CS250, UC Berkeley Fall ‘09
Internal Memory Organization
2-D arrary of bit
cells. Each cell
stores one bit of
data.
Special circuit tricks are
used for the cell array to
improve storage density.
‣ RAM/ROM naming convention:
‣
‣
examples: 32 X 8, "32 by 8" => 32 8-bit words
1M X 1, "1 meg by 1" => 1M 1-bit words
Lecture 11, Regular Structures
5
CS250, UC Berkeley Fall ‘09
Address Decoding
sel_row1
address
sel_row0
Address
• The function of the address decoder is
to generate a one-hot code word from
the address.
•
The output is use for row selection.
•
Many different circuits exist for this
function. A simple one is shown to the
right.
Lecture 11, Regular Structures
6
CS250, UC Berkeley Fall ‘09
Memory Block Internals
For read operation,
functionally the memory is
equivalent to a 2-D array
off flip-flops with tristate
outputs on each:
sel_row0
sel_row1
For write operation, functionally equivalent
includes a means to change state value:
Lecture 11, Regular Structures
7
These circuits are just
functional abstractions of the
actual circuits used.
CS250, UC Berkeley Fall ‘09
Storing computational state as charge
State is coded as the
amount of energy stored
by a capacitor.
+++ +++
--- ---
1.5V
+++ +++
--- ---
State is read by
sensing the amount
of energy
Problems: noise changes Q (up or down),
parasitics leak or source Q. Fortunately,
8
Q cannot change instantaneously, but that only
gets us in the ballpark.
Lecture 11, Regular Structures
CS250, UC Berkeley Fall ‘09
Static Memory Circuits
Dynamic Memory: Circuit remembers
for a fraction of a second.
Static Memory: Circuit remembers as
long as the power is on.
Non-volatile Memory: Circuit remembers
for many years, even if power is off.
9
Lecture 11, Regular Structures
CS250, UC Berkeley Fall ‘09
x
Idea: Store each bit with its
complement
x
“Row”
Why?
y
Gnd
Vdd
Vdd
Gnd
We can use the redundant
representation to compensate
for noise and leakage.
Lecture 11, Regular Structures
y
10
CS250, UC Berkeley Fall ‘09
Case #1: y = Gnd, y = Vdd ...
x
x
“Row”
Isd
y
Gnd
y
Vdd
Ids
11
Lecture 11, Regular Structures
CS250, UC Berkeley Fall ‘09
Case #2: y = Vdd, y = Gnd ...
x
x
“Row”
Isd
y
y
Gnd
Vdd
Ids
12
Lecture 11, Regular Structures
CS250, UC Berkeley Fall ‘09
Combine both cases to complete circuit
Gnd
noise
noise
Vdd
Vth
Vth
Vdd
Gnd
“Crosscoupled
inverters”
y
y
13
x
Lecture 11, Regular Structures
x
CS250, UC Berkeley Fall ‘09
SRAM Challenge #1: It’s so big!
SRAM area is 6X-10X DRAM area, same generation ...
Cell has
both
transistor
types
Capacitors are
usually
“parasitic”
capacitance of
wires and
transistors.
Lecture 11, Regular Structures
Vdd
AND
Gnd
Lots of
contacts,
transistors,
t wo bit
lines ...
14
CS250, UC Berkeley Fall ‘09
!
164-276&!"#$% #$1869
8#;
Recall: Positive edge-triggered flip-flop
8#;
<
8#;=
8#;
A flip-flop “samples” right before the
D
Q
"#$%&'(&)#'*+,#-*.
."12*&1'3"
8#-8;&1-&<&5"#$%
/
edge, and then “holds”
8#;=value.
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
0"12*&1'3" 4".2#1.&,4-3&5"#$%&
5"#$%&164-276&$&'()*
164-276&!"#$%Sampling
#$1869
Holds #$1869
circuit
value
8#;
8#;=
#-8;&1-&<&5"#$%
8#;=
8#;
8#;=
8#;=
8#;
++,!-.)'/4".2#1.&,4-3&
012-)34$5$%&
:#-8;&1-&<&5"#$%
16 Transistors:
Makes an SRAM
5"#$%&164-276&$&'()*
#$1869 look compact!
!"#$%&'())*
/
8#;
8#;=
67&1'-8
What do we get for the 10 extra transistors?
15
Clocked logic semantics.
Lecture 11, Regular Structures
CS250, UC Berkeley Fall ‘09
8#;
8#;=
8#;
8#;
!"#$%&'(&)#'*+,#-*.
Sensing: When clock is low
<
8#;=
8#;
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
!"#$%&'(&)#'*+,#-*.
A
flip-flop
“samples”
right before the
!."12*&1'3" 8#-8;&1-&<&5"#$% 164-276&!"#$%
#$1869
D
Q
8#;/
<
8#;=
edge, and then “holds” value.
/ :#-8;&1-&<&5"#$%
4".2#1.&,4-3&
0"12*&1'3" 4".2#1.&,4-3&5"#$%&
8#;
5"#$%&164-276&$&'()*
164-276&!"#$%Sampling
#$1869
Holds #$1869
circuit
value
8#;
8#;
!"#$%&'(&)#'*+,#-*.
."12*&1'3"
8#-8;&1-&<&5"#$%
/
8#-8;&1-&<&5"#$%
8#;
8#;=
8#;=
8#;
8#;=
8#;=
8#;=
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
0"12*&1'3"
4".2#1.&,4-3&5"#$%&
5"#$%&164-276&$&'()* #$1869
164-276&!"#$% #$1869
8#;
8#;=
8#;
++,!-.)'/4".2#1.&,4-3&
012-)34$5$%& 8#;=
:#-8;&1-&<&5"#$%
clk = 0 5"#$%&164-276&$&'()* #$1869
!"#$%&'())*
/
clk’ = 1
8#;=
8#;=
8#-8;&1-&<&5"#$%
!"#$%&'())*
Lecture 11, Regular Structures
8#;
8#;
Will capture
new
8#;=
8#;=4".2#1.&,4-3&
value
on
posedge.
/ :#-8;&1-&<&5"#$%
++,!-.)'/
012-)34$5$%&
8#;
5"#$%&164-276&$&'()* #$1869
67&1'-8
8#;=
Outputs
last
8#;
value captured.
67&1'-8
16
CS250, UC Berkeley Fall ‘09
8#;
8#;
<
!"#$%&'(&)#'*+,#-*.
Capture:
When clock goes high
8#;=
8#;
/ 0"12*&1'3"
4".2#1.&,4-3&5"#$%&
!"#$%&'(&)#'*+,#-*.
A
flip-flop
“samples”
right
before the
."12*&1'3"
8#-8;&1-&<&5"#$%
D
!Q
/
8#;
<
#$1869
8#;=
edge, and164-276&!"#$%
then “holds”
value.
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
0"12*&1'3" 4".2#1.&,4-3&5"#$%&
8#;
5"#$%&164-276&$&'()*
164-276&!"#$%Sampling
#$1869
Holds #$1869
circuit
value
8#;
8#;
!"#$%&'(&)#'*+,#-*.
."12*&1'3"
8#-8;&1-&<&5"#$%
/
8#-8;&1-&<&5"#$%
!"#$%&'())*
/
clk = 1
clk’ = 0
8#-8;&1-&<&5"#$%
Lecture !"#$%&'())*
11, Regular Structures
8#;=
8#;
8#;=
8#;=
8#;
8#;=
8#;=
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
0"12*&1'3" 4".2#1.&,4-3&5"#$%&
5"#$%&164-276&$&'()* #$1869
164-276&!"#$% #$1869
8#;
8#;=
8#;
++,!-.)'/4".2#1.&,4-3&
012-)34$5$%& 8#;=
:#-8;&1-&<&5"#$%
5"#$%&164-276&$&'()* #$1869
8#;=
8#;=
8#;
8#;
Remembers value just
8#;=
8#;=
captured.
8#;
/ :#-8;&1-&<&5"#$%
++,!-.)'/4".2#1.&,4-3&
012-)34$5$%&
5"#$%&164-276&$&'()* #$1869
67&1'-8
8#;=
Outputs
value just
8#;
captured.67&1'-8
17
CS250, UC Berkeley Fall ‘09
Challenge #2: Writing is a “fight”
When word line goes high, bitlines “fight” with cell inverters
to “flip the bit” -- must win quickly!
Solution: tune W/L of cell & driver transistors
Initial
state
Vdd
Bitline
drives
Gnd
Lecture 11, Regular Structures
Initial
state
Gnd
Bitline
drives
Vdd
18
CS250, UC Berkeley Fall ‘09
Challenge #3: Preserving state on read
When word line goes high on read, cell inverters must drive
large bitline capacitance quickly,
to preserve state on its small cell capacitances
Cell
state
Vdd
Bitline
a big
capacitor
Lecture 11, Regular Structures
Cell
state
Gnd
Bitline
a big
capacitor
19
CS250, UC Berkeley Fall ‘09
SRAM Operation Summary
word
word
bit
word
bit bit
bit word
bit
word
bit bit
bit
word
bit
bit bit
bit
Most common is 6transistor (6T) cell
array.
Word selects this cell, and
all others in a row.
word line
Write operation: column bit
lines are driven differentially (0
on one, 1 on the other).
Values overwrites cell state.
bit line
bit line
Read operation: column bit lines are “precharged”, then released. Cell pulls
down one bit line or the other. “Sense Amplifier” circuit quickly amplifies
difference between bit lines (saves time & energy).
Lecture 11, Regular Structures
20
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
21
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
22
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
23
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
24
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
25
CS250, UC Berkeley Fall ‘09
‣
Multi-ported Memory
Motivation:
‣
Consider CPU core register file:
‣
‣
‣
–
1 read or write per cycle limits
processor performance.
Complicates pipelining. Difficult
for different instructions to
simultaneously read or write
regfile.
Aa
Dina
WEa
Ab
Dinb
WEb
Common arrangement in pipelined
CPUs is 2 read ports and 1 write
port.
I/O data buffering:
Lecture 11, Regular Structures
•
disk or network interface
data
buffer
CPU
Douta
Dual-port
Memory
Doutb
dual-porting allows
both sides to
simultaneously
access memory at
full bandwidth.
CS250, UC Berkeley Fall ‘09
Dual-ported Memory Internals
‣ Add decoder, another set of
• Example cell: SRAM
read/write logic, bits lines,
word lines:
deca
decb
address
ports
Lecture 11, Regular Structures
cell
array
WL2
WL1
b2
b1
b1
b2
• Repeat everything but cross-coupled
inverters.
r/w logic
• This scheme extends up to a couple
more ports, then need to add additional
r/w logic
transistors.
data ports
27
CS250, UC Berkeley Fall ‘09
Cascading Memory-Blocks
How to make larger memory blocks out of smaller ones.
Increasing the width. Example: given 1Kx8, want 1Kx16
Lecture 11, Regular Structures
28
CS250, UC Berkeley Fall ‘09
Cascading Memory-Blocks
How to make larger memory blocks out of smaller ones.
Increasing the depth. Example: given 1Kx8, want 2Kx8
Lecture 11, Regular Structures
29
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
30
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
31
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
32
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
33
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
34
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
35
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
36
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
37
CS250, UC Berkeley Fall ‘09
Other Regular Structures
‣ In Transparencies
Lecture 11, Regular Structures
38
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
39
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
40
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
41
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
42
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
43
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
44
CS250, UC Berkeley Fall ‘09
DPCDataPath Compiler
! Custom Performance with ASIC
Effort
! 3X Faster than ASIC
! 40% Smaller than ASIC
! 10X Less Effort than Full Custom
DPC reads the output from static timing analysis
and displays the critical paths directly on the
schematic. The placement can be modified to
optimize critical paths, or extra drivers can be
added to the critical path, all in the schematic. You
then run through the placement and timing
iteration again. This iteration continues until the
timing criteria are satisfied. The iteration loop is
fast and visual. When you are satisfied with the
design performance, the placement file (DEF file)
is passed to a routing tool. The routed result can
then be read into the MAX Layout Editor to view,
and edit if necessary.
are displayed directly on the schematic at all levels
of the design hierarchy. In addition, the actual
delays of the paths are annotated onto the wires in
both the schematic and placement view.
The Tool for High Performance
Designs
DPC is the tool used by designers needing high
performance chips. They want the performance of
full custom design, but with a much shorter
design cycle. Datapaths designed with DPC are
3X (three times) faster and 40% smaller than
synthesis and place and route. At the same time, it
takes 10X (ten times) less effort than full custom
design.
In deep sub-micron design, wire length is the
dominant factor affecting critical path timing.
Cell placement becomes a critical step in chip
performance as well as power consumption. With
traditional tools, designers are at the mercy of
automatic placement tools. The DataPath
Compiler (DPC) lets the designer control
placement with immediate timing feedback.
Multiple what-if experiments can be performed.
Using a graphical display that back annotates
timing to the schematic, you can easily identify
timing problems and rapidly iterate through
potential solutions, yielding faster results. DPC is
so fast it can place, and then time, a 50K gate
datapath in 2-3 minutes.
Useful Identification of Critical
Paths
DPC predicts wire lengths early in the design
cycle. The resulting timing iterations are both fast
and accurate, allowing the designer to quickly
iterate to their performance goal. The critical paths
With DPC, you first enter the schematics into the
SUE design manager. DPC then uses the
schematic as a seed for placement. Once DPC has
the placement, it is able to estimate the wiring
delays and send this info to a static timing
analyzer. The results of static timing analysis are
then read back into SUE. The critical path is
highlighted in both the schematic and placement
view. Additionally, the delay and slope at each
node are displayed.
The example above shows the placement generated for our
sample 8-bit ALU. A critical path is highlighted in red and yellow
on both the schematic an placement views. Timing for other
nets is indicated in the menu and new nets can be selected and
highlighted.
DPC for Critical Path Optimization
In datapath designs, some simple directives by the
designer can produce speed-optimized layouts.
These directives are easily given and modified in
DPC. The placement of components on the
schematic directs relative placement in the
placement file.
SUE
DPC Features:
!
Automatically route, generate parasitics,
run timing analysis, and display criticalpath timing directly on schematics.
!
DPC includes its own timing analyzer,
or you can use iintegrated static timing
analysis tools such as Pearl, PathMill and
PrimeTime.
Fast - can do a 50K gate data path in a
few minutes.
! Use standard cells or custom datapath
cells.
!
!
Write out DEF placement information
and Verilog netlist for integration with
routing tools.
!
Available on LINUX platforms.
GDSII
DPC Placement &
Parasitic Estimation
Router
Timing Analysis
Parasitic
Extraction
FAST
AST
Cells can also be hard placed at specific row or
column locations and empty space can be
indicated. DPC automatically generates the row
and column placement and predicted wire lengths.
Wire predictions can be used to drive the DPC
timing analyzer as well as external timing analyzers
inluding Pearl, PrimeTime and PathMill.
Micro Magic
DPC
Figure 2-a.
DPC reduces the time required for placement and timing
analysis from days to minutes.
Micro Magic, Inc.
Sunnyvale, CA USA
Phone: 408.414.7647
www.micromagic.com
Inc.
Copyright 1995-2006, Micro Magic, Inc. All rights reserved.
Lecture 11, Regular Structures
The DEF placement file is sent to a router. The resulting
GDSII file can then be read into MAX (Micro Magic’s layout
tool).
DPC Design Flow
Figure 1-a.
Figure 2-b.
45
Micro Magic
Inc.
Fast Silicon Fast
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
46
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
47
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
48
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
49
CS250, UC Berkeley Fall ‘09
Lecture 11, Regular Structures
50
CS250, UC Berkeley Fall ‘09
Regular Structures
‣ In principle, standard cell libraries are sufficient for
‣
‣
‣
implementation of any logic circuit. In practice, great for
“random logic”, but what about other functions?
With logic synthesis and standard cell place and route as good
as it is, is there still a place for special regular structure
layout generators?
Exploiting regularity allow us to build special “generators”,
Which often leads to improved area, energy, and
performance.
‣ We looked at RAM, ROM, PLA, shifters
‣ Are there others?
Lecture 11, Regular Structures
51
CS250, UC Berkeley Fall ‘09
Random Notes
‣
‣
Multiplication another regular structure example
How do we (or should we) exploit “regular structures” in our design
flow?
‣ Special predesigned blocks
‣ ex: “large” SRAM block in library for instantiation
‣ Special layout generators with special leaf cells
‣ ex: PLA generators. SRAM/ROM generators.
‣ Special layout generators using standard cells
‣ Datapath compilers
‣
Is there always a clear win?
‣ ex: ROM table might be smaller and faster implemented as logic
equations in standard cells (with place and route)
‣ Clear advantage for SRAM, others?
Lecture 11, Regular Structures
52
CS250, UC Berkeley Fall ‘09