Challenges and Solutions for Nanometer SOC Designs

Challenges and Solutions for Nanometer
SOC Designs
Prof. Jason Cong
University of California, Los Angeles
Email: [email protected]
URL: http://cadlab.cs.ucla.edu/~cong
Magma Design Automation
http://www.magma-da.com
Outline
• Nanometer SOC challenges
• Opportunities and possible solutions
• Concluding remarks
Jason Cong
2
Challenges to Nanometer SOC Designs
• Rapid increase in design complexity and widening
gap of design productivity
• Rapid increase IC development cost in nanometer
technologies
Jason Cong
3
ITRS’2003
Year of production
2004
2006
2008
2010
2012
2015
2018
MPU/ASIC ½ Pitch (nm)
90
70
57
45
35
25
18
Functions per chip (million
transistors)
553
878
1,393
2,212
3,511
7,022
14,045
Chip size at production
(mm2)
310
310
310
310
310
310
310
Maximum power for highperformance with
heatsink (W)
158
180
200
218
240
270
300
On chip local clock (MHz)
4,171
6,783
10,972
15,079
20,065
33,403
53,207
14
15
16
16
16
17
18
Maximum wiring level
Jason Cong
4
“Double Exponential” Growth of Design Complexity
• C1: complexity due to exponential increase of chip
capacity
– More devices
– More power
– Heterogeneous integration, ……
• C2: complexity due to exponential decrease of feature
size
– Interconnect limitations
– Crosstalk and P/G noise
– Leakage power
– EMI, ……
• Design Complexity ∝ C1 x C2
Jason Cong
5
The Cost of Next Generation Product
Engineering Cost – 60% up
Product
Cost
Manufacturing Cost – 40% up
NRE/Mask Cost – 100% up
Respin cost – 78% up
Total Product Cost ($M)
50
$30M ~ $50M @ 90nm
Wireless chip case
40
Networking chip case
30
20
10
Jason Cong
Source: IBS Inc.
0.18um
0.15
0.13um
90nm
6
Rapid Increase in Manufacturing Cost
2.0 … 0.8 0.6 0.35 0.25 0.18 0.13 0.10
Single Mask
cost ($K)
1.5
1.5 2.5 4.5 7.5
12
40
60
# of Masks
12
12
26
30
34
Mask Set cost
($K)
18
18
12
30
16
20
72 150 312 1,000 2,000
60
$60
$50
$2.0
40
$1.5
$40
$30
$1.0
$20
$0.5
12
7.5
$10
$0.0
250nm
180nm
130nm
0
100nm
Source: EETimes
Jason Cong
7
Cost/Mask ($K)
Process (um)
Total Cost for Mask Set ($M)
$2.5
Outline
• Nanometer SOC challenges
• Opportunities and possible solutions
• Concluding remarks
Jason Cong
8
Opportunities and Possible Solutions
• Dealing with rapid increase of design complexity and
widening gap of design productivity
–
–
–
–
More scalable optimization engines
Better solutions to various scaling related problems
Higher degree of design automation
Design reuse
• Dealing with rapid increase in nanometer manufacturing cost
– Design for manufacturing and yield
– Alternative silicon implementation platforms
– Silicon reuse
Jason Cong
9
Example: Optimality Study of Circuit Placement
• Want to understand how much room for improvement
in circuit placement
?
„
„
„
„
„
Jason Cong
Construction of Placement Example with
Known Optimal or Upperbound
(PEKO/PEKU)
Match the characteristics of the real problems
First quantitative evaluation of the optimality
of circuit placement problem
Three EE Times articles coverage, and more
than 150 downloads from our website,
http://cadlab.cs.ucla.edu/~pubbench
Used in every placement since its publication
10
Studied Four State-of-the-Art Placers
•
•
•
•
Capo [A. Caldwell et al, 2000]
– Based on multilevel partitioner
– Aims to enhance the routability
Dragon [M. Wang et al, 2000]
– Uses hMetis for initial partition
– SA with bin-based swapping
mPL [T. Chan et al, 2000]
– Multilevel placer using NLP on the coarsest level
– Goto based relaxation
Qplace [Cadence Inc.]
– Leading edge industrial placer
– Component of Silicon Ensemble
Jason Cong
11
Experimental Results on PEKO (2002)
2.50
2.00
runtim e (s )
M ultiple of O ptim a l
3.00
1.50
1.00
0.50
0.00
0
50000
100000
150000
200000
250000
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
0
50000
#cells
dragon
•
capo
100000
150000
200000
250000
#cells
m PL
qplace
dragon
capo
m PL
qplace
Existing algorithms are 66-153% away from the optimal on PEKO
– There is significant room for improvement in placement algorithms!
•
ROI can be huge – 30% wirelength reduction is equivalent to
– Move from aluminum to copper, or
– One process generation shrink
Jason Cong
12
A Scalable Paradigm: Multi-Level Framework
Levels
Coarsening
Uncoarsening &
Refinement
(optimization)
Problem sizes
– First used to solve partial differential equations (multi-grid method)
– Successfully applied to circuit partitioning (hMetis [Karypis et al,
1997], MLPart [Caldwell, et al. 1999]): Best partitioners for cut-size
minimization
– First applied to circuit placement [Chan et al, ICCAD’00]: 10x speed-up
over GordianL
Jason Cong
13
Our Multilevel Placement Framework
Final Fine-Grain Problem.
Thorough GFD and
Detailed Placement
Interpolate
Initial Fine-Grain Problem
Aggregate
Intermediate Level
Aggregate
Aggregate
Interpolate
etc.
Interpolate
etc.
Intermediate Level
Aggregate
Jason Cong
Intermediate Level
Relaxation (GFD)
Intermediate Level
Relaxation (GFD)
Interpolate
Generalized Force Directed
Algorithm (GFD)
14
Optimization at Each Level:
Generalized Force Directed Method
•
Force directed method by Kraftwerk [Eisenmann and Johannes 98]
– Minimize quadratic wirelength: solve Ax0 = b
– Compute forces (fk) acting on cells based on the current density;
iteratively solve Axk+1 = fk
•
Our generalized force directed method
– Minimize log-sum-exp wirelength [Naylor 01; Kahng and Wang 04]
subject to even bin density constraints
– Use Poisson operator with Neumann boundary condition as a smoother
for density constraints
– Apply Uzawa algorithm to solve the constrained minimization
formulation => iteratively solve:
∇ w ( x k +1 ) = λ • f ( x k )
Jason Cong
update λ based on x k
15
mPL5 vs Other Placers
Wirelength Comparison
Scaled wirelength
1.25
1.20
Capo 8.8
1.15
Dragon 3.01
1.10
FastPlace
1.05
Fengshui 2.6
mPL5-fast
1.00
mPL5
0.95
0.90
0
50000
100000
150000
200000
250000
# Cells
Jason Cong
–
–
–
–
–
mPL5 has 10% better WL than Capo with similar runtime
mPL5 has 1% better WL than Dragon and runs 9 times faster
mPL5 has 4% better WL than Fengshui and runs 2 times faster
mPL5 has 11% better WL than FastPlace but runs 8 times slower
mPL5-fast is 5% better WL than FastPlace but runs 3 times slower 16
mPL5 vs Other Placers on PEKO Examples
Q u ality ratio
3.00
2.50
2.00
1.50
1.00
12506
27220
45639
68685
83709
182980
#Cells
Capo 8.8
Jason Cong
Dragon3.01
Fengshui2.6
mPL5
17
Opportunities and Possible Solutions
•
Dealing with rapid increase of design complexity and widening gap of
design productivity
– More scalable optimization engines
– Better solutions to various scaling related problems
– Higher degree of design automation
– Design reuse
•
Dealing with rapid increase in nanometer manufacturing cost
– Design for manufacturing
– Alternative silicon implementation platforms
– Silicon reuse
Jason Cong
18
Scaling-Related Problems
For example:
• Interconnect bottleneck
• Noise sensitivity
• Power and thermal limitations
• …
Jason Cong
19
Possible Solutions for Interconnect Bottleneck
• Handling the unpredictability of interconnect delays
– Gain-based synthesis
– Physical synthesis
• Coping with long interconnect delays
– Multi-cycle on-chip communication with aggressive
pipelining/retiming over global interconnects
– Latency-insensitive designs
– Globally asynchronous locally synchronize designs
– “Network-in-a-chip”
• …
Jason Cong
20
Interconnect Bottleneck in Nanometer Designs
ƒ Single-cycle full chip communication is no longer possible
ƒ Not supported by the current CAD toolset
5 cycles
„
„
„
„
4 cycles
ITRS’01 70nm Tech
5.63 G Hz across-chip clock
800 mm2 (28.3mm x 28.3mm)
IPEM BIWS estimations
‹
‹
„
3 cycles
2 cycles
1 cycle
0
Jason Cong
11.4
22.8
28.3
Buffer size: 100x
Driver/receiver size: 100x
On semi-global layer (tier 3) :
‹ Can travel up to 11.4 mm in
one cycle
‹ Need 5 clock cycles from
corner to corner
21
One Solution: Regular Distributed Register
Architecture
Reg. file
Reg. file
Reg. file
…
…
…
…
…
Hi
Wi
FSM
LCC
ADD
Cluster with area constraint
Use register banks:
ƒ
ƒ
LCC
FSM
ƒ
FSM
LCC
MUX
FSM
…
….
k cycle
Reg. file
2 cycle
Reg. file
MUL
Register File
Local
Computational
Cluster (LCC)
Global Interconnect
Reg. file
1 cycle
LCC
FSM
LCC
FSM
FSM
LCC
Island
Registers in each island are partitioned to k banks for 1 cycle,
2 cycle, … k cycle interconnect communication in each island
Highly regular
Jason Cong
22
MCAS: Architectural Synthesis for Multi-Cycle
Communication Using RDR Architecture
C program
CDFG
CDFG generation
generation
MCAS (Multi-Cycle
Architectural Synthesis)
CDFG
Resource
Resource allocation
allocation
&
Functional
& Functional unit
unit binding
binding
ICG
Scheduling-driven
Scheduling-driven placement
placement
Locations
Placement-driven
Placement-driven
rescheduling
rescheduling &
& rebinding
rebinding
Register
Register and
and port
port binding
binding
Datapath
Datapath &
& FSM
FSM generation
generation
Jason Cong
RTL
VHDL
Floorplan
constraints
Multi-cycle
path constraints
23
MCAS flow vs. Synopsys Behavioral Compiler
(on Virtex-II) [Cong, et al, T-CAD’04]
Design
pr
wang
mcm
honda
Flow
Synopsys BC
MCAS
Synopsys BC
MCAS
Synopsys BC
MCAS
Synopsys BC
MCAS
Cylces
25
27
29
14
43
34
29
23
Reg
28
34
36
35
142
35
44
42
ALU MULT fmax (MHz)
5
8
95.87
6
2
86.07
7
8
63.02
5
8
140.31
23
7
51.09
6
3
53.59
8
14
52.13
6
8
71.95
LUTs Latency (ns) MCAS vs. BC
877
260.78
1477
313.69
120.29%
1143
460.17
1523
99.78
21.68%
3256
841.60
2561
634.44
75.39%
2112
556.31
2606
319.65
57.46%
ƒ Synopsys Behavioral Compiler setting: default (optimizing latency)
ƒ Average latency ratio of MCAS vs. BC: 69%
900.00
3500
800.00
3000
700.00
2500
600.00
500.00
Synopsys BC
MCAS
400.00
2000
Synopsys BC
MCAS
1500
300.00
1000
200.00
500
100.00
0.00
Jason Cong
pr
wang
mcm
Latency
honda
0
pr
wang
mcm
Resource
honda
24
Possible Solutions for Noise Control
• Integrated modeling, analysis, and synthesis
capabilities
– Progressive modeling throughout the synthesis process
• Consistent through different stages
• Increasing accuracy as more physical information is available
• Need both avoidance (planning) and fixing (postprocessing) capabilities
– Guided by modeling and analysis
• Unified tool (single binary) with integrated synthesis,
physical design, and analysis capabilities
Jason Cong
25
Full Chip Analysis
•
Full chip crosstalk analysis is feasible.
– Fast full chip extraction for parasitics
– Accurate gate models, such as current source based gate models
– Efficient and stable reduced order model algorithms for
interconnects
– Integrated analyzers ensure efficiency at each stage and allow
faster convergence.
•
Unified data model allows the knowledge of the design to reduce
problem size.
– Consider the performance of crosstalk analysis in the
implementation flow
Multithreading, distributed processing techniques can be
employed to scale down the runtime.
•
Jason Cong
26
Crosstalk Avoidance
• Crosstalk avoidance is necessary to ensure design closure.
– Full chip crosstalk analysis can be performed efficiently during
P&R level.
– Preventive techniques like slew balancing, buffering/sizing, wire
sizing and spacing, crosstalk immune routing are effective in
controlling crosstalk but have side effects/costs.
– With embedded crosstalk analysis ability, implementation tool can
select the best preventive solutions on the fly while minimizing the
cost.
Jason Cong
27
Crosstalk Fixing
• Unified data model allows effective and efficient fixing.
– Avoidance techniques should be adopted to limit the difficulty of
fixing.
– Surgical optimization techniques (buffering, sizing, ripup &
reroute) can be applied to the critical paths with the guidance of
sign-off crosstalk analyzer
– Incremental crosstalk analysis ability (including incremental
extraction and incremental timing analysis) is critical for
efficiency
Jason Cong
28
Opportunities and Possible Solutions
• Dealing with rapid increase of design complexity and
widening gap of design productivity
–
–
–
–
More scalable optimization engines
Better solutions to various scaling related problems
Higher degree of design automation
Design reuse
• Dealing with rapid increase in nanometer manufacturing cost
– Design for manufacturing
– Alternative silicon implementation platforms
– Silicon reuse
Jason Cong
29
Example: Unified Implementation Flow from Magma
• Architecture
– Unified data model – patented
• Methodology
– FixedTiming approach –
patented – eliminates iterations
between synthesis and P&R
• Open system
– TcL-based API enables easy
access to design information
Jason Cong
Single executable – multiple product packages
30
Raising the Design Abstraction Level
• Electronic system-level (ESL) design automation for
the next productivity boost
• Previous failure of behavioral synthesis
– Lack of a compelling reason
– Lack of a solid RTL foundation
– Lack of consideration of physical reality
• The need of readdressing behavioral synthesis
– Rapid increase in design complexity
– Availability of robust RTL-to-GDSII flows
– Behavioral synthesis with physical reality
Jason Cong
31
Opportunities and Possible Solutions
• Dealing with rapid increase of design complexity and
widening gap of design productivity
–
–
–
–
More scalable optimization engines
Better solutions to various scaling related problems
Higher degree of design automation
Design reuse
• Dealing with rapid increase in nanometer manufacturing cost
– Design for manufacturing
– Alternative silicon implementation platforms
– Silicon reuse
Jason Cong
32
Four Levels of Design Reuse
•
•
•
•
Cell reuse
IP reuse
Architecture reuse
Silicon reuse – use of programmable technologies
 Dr. Claasen, CTO of Philips, Keynote Speech at DAC’2000
Jason Cong
33
Platform-Based Design
Application Space
Application Instance
• Principles of platform-based design:
Meet-in-the-middle
– Top-down
• Define a set of abstraction layers
• From specifications at a given level,
select a solution in terms of components
of the following layer and propagate
constraints
API Platform
Specification
– Bottom-up
Arch. Platform
Design-Space
Exploration
Platform Instance
Architectural Space
Jason Cong
• Platform components (e.g., microcontroller, RTOS, communication
primitives) at a given level are
abstracted to a higher level by their
functionality and a set of parameters
that help guiding the solution selection
process.
• The selection process is equivalent to a
covering problem if a common semantic
domain is used.
34
Source: GSRC
Opportunities and Possible Solutions
• Dealing with rapid increase of design complexity and
widening gap of design productivity
–
–
–
–
More scalable optimization engines
Better solutions to various scaling related problems
Higher degree of design automation
Design reuse
• Dealing with rapid increase in nanometer manufacturing cost
– Design for manufacturing
– Alternative silicon implementation platforms
– Silicon reuse
Jason Cong
35
Design For Manufacturing
• Resolution enhancement techniques (RET)
– OPC, PSM, …
• Design for yield
– Complex routing rules for nanometer designs
– Use of regular structures
– …
• Support of ECO (engineering change order)
– Accommodate changes in metal layers only
– Or in selected metal layers only
Jason Cong
36
Opportunities and Possible Solutions
• Dealing with rapid increase of design complexity and
widening gap of design productivity
–
–
–
–
More scalable optimization engines
Better solutions to various scaling related problems
Higher degree of design automation
Design reuse
• Dealing with rapid increase in nanometer manufacturing cost
– Design for manufacturing
– Alternative silicon implementation platforms
– Silicon reuse
Jason Cong
37
Alternative Silicon Implementation Alternative
 Structured ASICs
• What is Structured ASICs?
– A hybrid of hard-coded functions, like
memory and microprocessors, and
customizable logic gates
– Use a subset of metal/via layers for
customization
– Aimed at providing complexity and
performance characteristic close to
standard-cell ASICs with the flexibility
and fast-turnaround time of FPGAs, as
well as lower non-recurring engineering
fees
Jason Cong
38
Examples from the Industry: NEC ISSP (Instant
Silicon Solution Platform)
• 90 nm CMOS process, 7 layers
(2 customized routing layers)
• 6.5 M gates (number of usable
ISSP gates), 11.5 Mb (RAM
capacity)
• Up to 500 MHz operating
frequency
• SerDes: 10G Serial Interface
• Embedded macros: 2 port
SRAM, APLL, DLL, SPI4.2
(dynamic), 10 G Ethernet MAC,
UTOPIA, DDR controller, PCI
controller, UART, etc.
• Embedded self test
Jason Cong
39
Examples from the Industry: LSI RapidChip
• Four metal layers for configuration
• Based on the R Cell technology
fabric; an R Cell is a 5 transistor
element configured by metal
• Up to 5.8M available ASIC gates
• Up to 1.9Mbit onchip SRAM
• 1 and 2 port configurable memory
• Configurable IOs
• Support of CoreWare® IP libraries
• Multiple SerDes at up to 4.25 Gb/s
• Support of 212.5MHz ARM966
Jason Cong
40
Cost Comparison of Different Silicon
Implementations
FPGA
Structured ASIC
Cell-Based ASIC
Total Design Cost:
~$165K
~$500K
~$5.5M (Typical)
Vendor NRE
None
~$100K - $200K
$1M - $3M
Cost of Tools
~$30K
~$120K - $250K
> $300K
#Engineers
1 to 2
2 to 3
5 to 7
Price per Chip
$200 to $1K
~$30 - $150
~$30
Total Unit Cost Qty
(1K)
~$1000 ('03)
$500 - $650
$55K
--- Qty (5K)
~$220 (4Q'04)
$100 - $150
$1.1K
--- Qty (500K)
~$40 (4Q'04)
> $21
$11 - $20
Source: SemiView, 2003
Jason Cong
41
What’s Required for Structured ASICs
• Need to support architecture design
– Architecture solution space is large
•
•
•
•
•
Different complex cell and logic cluster structures
Ratio between combinational and sequential cells
Flat or Hierarchical? Number of levels of logic/routing hierarchies?
Hybrid resources? Block RAM, DSP block? Ratio and floorplan
Channel width, distribution, different interconnect designs
– Need quantitative analysis
• Understand the impact of each architecture decision on performance, area,
density, and routability
• Need efficient EDA tool support
– Optimized for the given architecture
Jason Cong
42
Magma’s Support of Full Spectrum of IC
Platforms
Blast Create
Blast Create - SA
ArchEvaluator
Blast FPGA
Blast Fusion
Design cost / NRE
Performance / Density
ArchEvaluator
Standard Cell
Blast Fusion - SA
Programmable
ASIC
FPGA
Jason Cong
43
Opportunities and Possible Solutions
• Dealing with rapid increase of design complexity and
widening gap of design productivity
–
–
–
–
More scalable optimization engines
Better solutions to various scaling related problems
Higher degree of design automation
Design reuse
• Dealing with rapid increase in nanometer manufacturing cost
– Design for manufacturing
– Alternative silicon implementation platforms
– Silicon reuse
Jason Cong
44
Silicon Implementation Platforms with
Embedded Programmable Technologies
General-Purpose Platforms
processor
Application-Specific Platforms
processor
ASIC
Programmable
Logic
memory
Alter, Xilinx, …
Jason Cong
memory
Programmable
Logic
IBM, LSI Logic, Motorola, TI, …
45
Examples of Programmable Platforms
• High Programmable Platforms
– Xilinx Virtex II Pro, Altera Stratix, etc.
– Provides reconfigurable processor + embedded memories + programmable logic
Rocket I/O Transceivers
PowerPC
405
PowerPC
405
405
405
Programmable
Logic PowerPC
PowerPC
Rocket I/O Transceivers
ƒXilinx Virtex II Pro
•
•
•
•
•
Up to 4 IBM PowerPC in FPGA fabric
Up to 24 embedded Rocket I/O transceivers
Up to 556 18*18 multipliers
Over 10 Mb embedded block RAM
Up to 125,136 logic elements (LEs)
Jason Cong
ƒAltera Stratix
•
•
•
•
•
Nios embedded processor
High-bandwidth I/O & High-Speed Interfaces
Up to 176 embedded multipliers
& up to 22 high performance DSP block
Up to 7 Mb embedded memory
Up to 79,040 logic elements (LEs)
46
Application-Specific Instruction Set Processors
(ASIPs)
•
ASIPs provide tradeoffs
between efficiency and
flexibility
– A general purpose processor +
specific hardware resource
– Base instruction set +
customized instructions
– Specific hardware resource
implements the customized
instructions
– Either runtime reconfigurable
or pre-synthesized
– More popular recently
• Altera Nios, Tensilica
Xtensa, Improv Jazz, ARC
Cores, IFX Carmel 20xx
Jason Cong
Register File
LD/ST
ALU
Memory
MUL
FU
User-Defined
Execution Units
User-Defined Register File
Base ISA Feature
Optional Functions
User-Defined Functions
47
Proposed ASIP Compilation Flow
[Cong, et al, FPGA’04]
•
C
SUIF / CDFG generator
ASIP constraints
Application Mapping
•
Instruction Implementation /
ASIP synthesis
Implementation
Pattern selection
– Select a subset of patterns to
maximize the potential speedup
while satisfying the area
constraint.
– Formulated as a 0-1 knapsack
problem
Pattern library
Mapped CDFG
Jason Cong
– Enumerate all of the patterns
through cut enumeration
Pattern Generation /
Pattern Selection
CDFG
Pattern generation
•
Application Mapping
– Map subject graph G(V, E) to
extended instruction set so that
the total execution time of G is
minimized
– Formulated as a min-area cell
library based technology
mapping problem
48
Experimental Result on Altera Nios
Speedup
Extended
Instruction
#
Nios
Estimation
Resource Overhead
LE
Memory
DSP
Block
fft_br
9
3.28
2.65
408
6.06%
65,536
9.79%
16
iir
7
3.18
3.73
255
3.79%
4,736
0.71%
40
fir
2
2.40
2.14
51
0.76%
1,024
0.15%
8
pr
2
1.57
1.75
71
1.05%
0
0.00%
14
dir
2
3.28
3.02
54
0.80%
0
0.00%
16
mcm
4
4.75
3.22
186
2.76%
0
0.00%
56
3.08
2.75
-
2.54%
-
1.77%
-
Average
Jason Cong
49
Outline
• Nanometer SOC challenges
• Opportunities and possible solutions
• Concluding remarks
Jason Cong
50
Concluding Remarks
• Design complexity and manufacturing cost are the two
biggest challenges to nanometer SOC designs
• Many opportunities for innovation
– Dealing with rapid increase of design complexity and widening
gap of design productivity
•
•
•
•
More scalable optimization engines
Better solutions to various scaling related problems
Higher degree of design automation
Design reuse
– Dealing with rapid increase in nanometer manufacturing cost
• Design for manufacturing
• Alternative silicon implementation platforms
• Silicon reuse
• Need collaborative efforts
– Academia and industry
– International collaboration
Jason Cong
51
Acknowledgements
• National Science Foundation (US), MARCO/GSRC
• Semiconductor Research Corporation (SRC)
• Many graduate student researchers from UCLA
Jason Cong
52