Challenges and Solutions for Nanometer SOC Designs Prof. Jason Cong University of California, Los Angeles Email: [email protected] URL: http://cadlab.cs.ucla.edu/~cong Magma Design Automation http://www.magma-da.com Outline • Nanometer SOC challenges • Opportunities and possible solutions • Concluding remarks Jason Cong 2 Challenges to Nanometer SOC Designs • Rapid increase in design complexity and widening gap of design productivity • Rapid increase IC development cost in nanometer technologies Jason Cong 3 ITRS’2003 Year of production 2004 2006 2008 2010 2012 2015 2018 MPU/ASIC ½ Pitch (nm) 90 70 57 45 35 25 18 Functions per chip (million transistors) 553 878 1,393 2,212 3,511 7,022 14,045 Chip size at production (mm2) 310 310 310 310 310 310 310 Maximum power for highperformance with heatsink (W) 158 180 200 218 240 270 300 On chip local clock (MHz) 4,171 6,783 10,972 15,079 20,065 33,403 53,207 14 15 16 16 16 17 18 Maximum wiring level Jason Cong 4 “Double Exponential” Growth of Design Complexity • C1: complexity due to exponential increase of chip capacity – More devices – More power – Heterogeneous integration, …… • C2: complexity due to exponential decrease of feature size – Interconnect limitations – Crosstalk and P/G noise – Leakage power – EMI, …… • Design Complexity ∝ C1 x C2 Jason Cong 5 The Cost of Next Generation Product Engineering Cost – 60% up Product Cost Manufacturing Cost – 40% up NRE/Mask Cost – 100% up Respin cost – 78% up Total Product Cost ($M) 50 $30M ~ $50M @ 90nm Wireless chip case 40 Networking chip case 30 20 10 Jason Cong Source: IBS Inc. 0.18um 0.15 0.13um 90nm 6 Rapid Increase in Manufacturing Cost 2.0 … 0.8 0.6 0.35 0.25 0.18 0.13 0.10 Single Mask cost ($K) 1.5 1.5 2.5 4.5 7.5 12 40 60 # of Masks 12 12 26 30 34 Mask Set cost ($K) 18 18 12 30 16 20 72 150 312 1,000 2,000 60 $60 $50 $2.0 40 $1.5 $40 $30 $1.0 $20 $0.5 12 7.5 $10 $0.0 250nm 180nm 130nm 0 100nm Source: EETimes Jason Cong 7 Cost/Mask ($K) Process (um) Total Cost for Mask Set ($M) $2.5 Outline • Nanometer SOC challenges • Opportunities and possible solutions • Concluding remarks Jason Cong 8 Opportunities and Possible Solutions • Dealing with rapid increase of design complexity and widening gap of design productivity – – – – More scalable optimization engines Better solutions to various scaling related problems Higher degree of design automation Design reuse • Dealing with rapid increase in nanometer manufacturing cost – Design for manufacturing and yield – Alternative silicon implementation platforms – Silicon reuse Jason Cong 9 Example: Optimality Study of Circuit Placement • Want to understand how much room for improvement in circuit placement ? Jason Cong Construction of Placement Example with Known Optimal or Upperbound (PEKO/PEKU) Match the characteristics of the real problems First quantitative evaluation of the optimality of circuit placement problem Three EE Times articles coverage, and more than 150 downloads from our website, http://cadlab.cs.ucla.edu/~pubbench Used in every placement since its publication 10 Studied Four State-of-the-Art Placers • • • • Capo [A. Caldwell et al, 2000] – Based on multilevel partitioner – Aims to enhance the routability Dragon [M. Wang et al, 2000] – Uses hMetis for initial partition – SA with bin-based swapping mPL [T. Chan et al, 2000] – Multilevel placer using NLP on the coarsest level – Goto based relaxation Qplace [Cadence Inc.] – Leading edge industrial placer – Component of Silicon Ensemble Jason Cong 11 Experimental Results on PEKO (2002) 2.50 2.00 runtim e (s ) M ultiple of O ptim a l 3.00 1.50 1.00 0.50 0.00 0 50000 100000 150000 200000 250000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 0 50000 #cells dragon • capo 100000 150000 200000 250000 #cells m PL qplace dragon capo m PL qplace Existing algorithms are 66-153% away from the optimal on PEKO – There is significant room for improvement in placement algorithms! • ROI can be huge – 30% wirelength reduction is equivalent to – Move from aluminum to copper, or – One process generation shrink Jason Cong 12 A Scalable Paradigm: Multi-Level Framework Levels Coarsening Uncoarsening & Refinement (optimization) Problem sizes – First used to solve partial differential equations (multi-grid method) – Successfully applied to circuit partitioning (hMetis [Karypis et al, 1997], MLPart [Caldwell, et al. 1999]): Best partitioners for cut-size minimization – First applied to circuit placement [Chan et al, ICCAD’00]: 10x speed-up over GordianL Jason Cong 13 Our Multilevel Placement Framework Final Fine-Grain Problem. Thorough GFD and Detailed Placement Interpolate Initial Fine-Grain Problem Aggregate Intermediate Level Aggregate Aggregate Interpolate etc. Interpolate etc. Intermediate Level Aggregate Jason Cong Intermediate Level Relaxation (GFD) Intermediate Level Relaxation (GFD) Interpolate Generalized Force Directed Algorithm (GFD) 14 Optimization at Each Level: Generalized Force Directed Method • Force directed method by Kraftwerk [Eisenmann and Johannes 98] – Minimize quadratic wirelength: solve Ax0 = b – Compute forces (fk) acting on cells based on the current density; iteratively solve Axk+1 = fk • Our generalized force directed method – Minimize log-sum-exp wirelength [Naylor 01; Kahng and Wang 04] subject to even bin density constraints – Use Poisson operator with Neumann boundary condition as a smoother for density constraints – Apply Uzawa algorithm to solve the constrained minimization formulation => iteratively solve: ∇ w ( x k +1 ) = λ • f ( x k ) Jason Cong update λ based on x k 15 mPL5 vs Other Placers Wirelength Comparison Scaled wirelength 1.25 1.20 Capo 8.8 1.15 Dragon 3.01 1.10 FastPlace 1.05 Fengshui 2.6 mPL5-fast 1.00 mPL5 0.95 0.90 0 50000 100000 150000 200000 250000 # Cells Jason Cong – – – – – mPL5 has 10% better WL than Capo with similar runtime mPL5 has 1% better WL than Dragon and runs 9 times faster mPL5 has 4% better WL than Fengshui and runs 2 times faster mPL5 has 11% better WL than FastPlace but runs 8 times slower mPL5-fast is 5% better WL than FastPlace but runs 3 times slower 16 mPL5 vs Other Placers on PEKO Examples Q u ality ratio 3.00 2.50 2.00 1.50 1.00 12506 27220 45639 68685 83709 182980 #Cells Capo 8.8 Jason Cong Dragon3.01 Fengshui2.6 mPL5 17 Opportunities and Possible Solutions • Dealing with rapid increase of design complexity and widening gap of design productivity – More scalable optimization engines – Better solutions to various scaling related problems – Higher degree of design automation – Design reuse • Dealing with rapid increase in nanometer manufacturing cost – Design for manufacturing – Alternative silicon implementation platforms – Silicon reuse Jason Cong 18 Scaling-Related Problems For example: • Interconnect bottleneck • Noise sensitivity • Power and thermal limitations • … Jason Cong 19 Possible Solutions for Interconnect Bottleneck • Handling the unpredictability of interconnect delays – Gain-based synthesis – Physical synthesis • Coping with long interconnect delays – Multi-cycle on-chip communication with aggressive pipelining/retiming over global interconnects – Latency-insensitive designs – Globally asynchronous locally synchronize designs – “Network-in-a-chip” • … Jason Cong 20 Interconnect Bottleneck in Nanometer Designs Single-cycle full chip communication is no longer possible Not supported by the current CAD toolset 5 cycles 4 cycles ITRS’01 70nm Tech 5.63 G Hz across-chip clock 800 mm2 (28.3mm x 28.3mm) IPEM BIWS estimations 3 cycles 2 cycles 1 cycle 0 Jason Cong 11.4 22.8 28.3 Buffer size: 100x Driver/receiver size: 100x On semi-global layer (tier 3) : Can travel up to 11.4 mm in one cycle Need 5 clock cycles from corner to corner 21 One Solution: Regular Distributed Register Architecture Reg. file Reg. file Reg. file … … … … … Hi Wi FSM LCC ADD Cluster with area constraint Use register banks: LCC FSM FSM LCC MUX FSM … …. k cycle Reg. file 2 cycle Reg. file MUL Register File Local Computational Cluster (LCC) Global Interconnect Reg. file 1 cycle LCC FSM LCC FSM FSM LCC Island Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island Highly regular Jason Cong 22 MCAS: Architectural Synthesis for Multi-Cycle Communication Using RDR Architecture C program CDFG CDFG generation generation MCAS (Multi-Cycle Architectural Synthesis) CDFG Resource Resource allocation allocation & Functional & Functional unit unit binding binding ICG Scheduling-driven Scheduling-driven placement placement Locations Placement-driven Placement-driven rescheduling rescheduling & & rebinding rebinding Register Register and and port port binding binding Datapath Datapath & & FSM FSM generation generation Jason Cong RTL VHDL Floorplan constraints Multi-cycle path constraints 23 MCAS flow vs. Synopsys Behavioral Compiler (on Virtex-II) [Cong, et al, T-CAD’04] Design pr wang mcm honda Flow Synopsys BC MCAS Synopsys BC MCAS Synopsys BC MCAS Synopsys BC MCAS Cylces 25 27 29 14 43 34 29 23 Reg 28 34 36 35 142 35 44 42 ALU MULT fmax (MHz) 5 8 95.87 6 2 86.07 7 8 63.02 5 8 140.31 23 7 51.09 6 3 53.59 8 14 52.13 6 8 71.95 LUTs Latency (ns) MCAS vs. BC 877 260.78 1477 313.69 120.29% 1143 460.17 1523 99.78 21.68% 3256 841.60 2561 634.44 75.39% 2112 556.31 2606 319.65 57.46% Synopsys Behavioral Compiler setting: default (optimizing latency) Average latency ratio of MCAS vs. BC: 69% 900.00 3500 800.00 3000 700.00 2500 600.00 500.00 Synopsys BC MCAS 400.00 2000 Synopsys BC MCAS 1500 300.00 1000 200.00 500 100.00 0.00 Jason Cong pr wang mcm Latency honda 0 pr wang mcm Resource honda 24 Possible Solutions for Noise Control • Integrated modeling, analysis, and synthesis capabilities – Progressive modeling throughout the synthesis process • Consistent through different stages • Increasing accuracy as more physical information is available • Need both avoidance (planning) and fixing (postprocessing) capabilities – Guided by modeling and analysis • Unified tool (single binary) with integrated synthesis, physical design, and analysis capabilities Jason Cong 25 Full Chip Analysis • Full chip crosstalk analysis is feasible. – Fast full chip extraction for parasitics – Accurate gate models, such as current source based gate models – Efficient and stable reduced order model algorithms for interconnects – Integrated analyzers ensure efficiency at each stage and allow faster convergence. • Unified data model allows the knowledge of the design to reduce problem size. – Consider the performance of crosstalk analysis in the implementation flow Multithreading, distributed processing techniques can be employed to scale down the runtime. • Jason Cong 26 Crosstalk Avoidance • Crosstalk avoidance is necessary to ensure design closure. – Full chip crosstalk analysis can be performed efficiently during P&R level. – Preventive techniques like slew balancing, buffering/sizing, wire sizing and spacing, crosstalk immune routing are effective in controlling crosstalk but have side effects/costs. – With embedded crosstalk analysis ability, implementation tool can select the best preventive solutions on the fly while minimizing the cost. Jason Cong 27 Crosstalk Fixing • Unified data model allows effective and efficient fixing. – Avoidance techniques should be adopted to limit the difficulty of fixing. – Surgical optimization techniques (buffering, sizing, ripup & reroute) can be applied to the critical paths with the guidance of sign-off crosstalk analyzer – Incremental crosstalk analysis ability (including incremental extraction and incremental timing analysis) is critical for efficiency Jason Cong 28 Opportunities and Possible Solutions • Dealing with rapid increase of design complexity and widening gap of design productivity – – – – More scalable optimization engines Better solutions to various scaling related problems Higher degree of design automation Design reuse • Dealing with rapid increase in nanometer manufacturing cost – Design for manufacturing – Alternative silicon implementation platforms – Silicon reuse Jason Cong 29 Example: Unified Implementation Flow from Magma • Architecture – Unified data model – patented • Methodology – FixedTiming approach – patented – eliminates iterations between synthesis and P&R • Open system – TcL-based API enables easy access to design information Jason Cong Single executable – multiple product packages 30 Raising the Design Abstraction Level • Electronic system-level (ESL) design automation for the next productivity boost • Previous failure of behavioral synthesis – Lack of a compelling reason – Lack of a solid RTL foundation – Lack of consideration of physical reality • The need of readdressing behavioral synthesis – Rapid increase in design complexity – Availability of robust RTL-to-GDSII flows – Behavioral synthesis with physical reality Jason Cong 31 Opportunities and Possible Solutions • Dealing with rapid increase of design complexity and widening gap of design productivity – – – – More scalable optimization engines Better solutions to various scaling related problems Higher degree of design automation Design reuse • Dealing with rapid increase in nanometer manufacturing cost – Design for manufacturing – Alternative silicon implementation platforms – Silicon reuse Jason Cong 32 Four Levels of Design Reuse • • • • Cell reuse IP reuse Architecture reuse Silicon reuse – use of programmable technologies Dr. Claasen, CTO of Philips, Keynote Speech at DAC’2000 Jason Cong 33 Platform-Based Design Application Space Application Instance • Principles of platform-based design: Meet-in-the-middle – Top-down • Define a set of abstraction layers • From specifications at a given level, select a solution in terms of components of the following layer and propagate constraints API Platform Specification – Bottom-up Arch. Platform Design-Space Exploration Platform Instance Architectural Space Jason Cong • Platform components (e.g., microcontroller, RTOS, communication primitives) at a given level are abstracted to a higher level by their functionality and a set of parameters that help guiding the solution selection process. • The selection process is equivalent to a covering problem if a common semantic domain is used. 34 Source: GSRC Opportunities and Possible Solutions • Dealing with rapid increase of design complexity and widening gap of design productivity – – – – More scalable optimization engines Better solutions to various scaling related problems Higher degree of design automation Design reuse • Dealing with rapid increase in nanometer manufacturing cost – Design for manufacturing – Alternative silicon implementation platforms – Silicon reuse Jason Cong 35 Design For Manufacturing • Resolution enhancement techniques (RET) – OPC, PSM, … • Design for yield – Complex routing rules for nanometer designs – Use of regular structures – … • Support of ECO (engineering change order) – Accommodate changes in metal layers only – Or in selected metal layers only Jason Cong 36 Opportunities and Possible Solutions • Dealing with rapid increase of design complexity and widening gap of design productivity – – – – More scalable optimization engines Better solutions to various scaling related problems Higher degree of design automation Design reuse • Dealing with rapid increase in nanometer manufacturing cost – Design for manufacturing – Alternative silicon implementation platforms – Silicon reuse Jason Cong 37 Alternative Silicon Implementation Alternative Structured ASICs • What is Structured ASICs? – A hybrid of hard-coded functions, like memory and microprocessors, and customizable logic gates – Use a subset of metal/via layers for customization – Aimed at providing complexity and performance characteristic close to standard-cell ASICs with the flexibility and fast-turnaround time of FPGAs, as well as lower non-recurring engineering fees Jason Cong 38 Examples from the Industry: NEC ISSP (Instant Silicon Solution Platform) • 90 nm CMOS process, 7 layers (2 customized routing layers) • 6.5 M gates (number of usable ISSP gates), 11.5 Mb (RAM capacity) • Up to 500 MHz operating frequency • SerDes: 10G Serial Interface • Embedded macros: 2 port SRAM, APLL, DLL, SPI4.2 (dynamic), 10 G Ethernet MAC, UTOPIA, DDR controller, PCI controller, UART, etc. • Embedded self test Jason Cong 39 Examples from the Industry: LSI RapidChip • Four metal layers for configuration • Based on the R Cell technology fabric; an R Cell is a 5 transistor element configured by metal • Up to 5.8M available ASIC gates • Up to 1.9Mbit onchip SRAM • 1 and 2 port configurable memory • Configurable IOs • Support of CoreWare® IP libraries • Multiple SerDes at up to 4.25 Gb/s • Support of 212.5MHz ARM966 Jason Cong 40 Cost Comparison of Different Silicon Implementations FPGA Structured ASIC Cell-Based ASIC Total Design Cost: ~$165K ~$500K ~$5.5M (Typical) Vendor NRE None ~$100K - $200K $1M - $3M Cost of Tools ~$30K ~$120K - $250K > $300K #Engineers 1 to 2 2 to 3 5 to 7 Price per Chip $200 to $1K ~$30 - $150 ~$30 Total Unit Cost Qty (1K) ~$1000 ('03) $500 - $650 $55K --- Qty (5K) ~$220 (4Q'04) $100 - $150 $1.1K --- Qty (500K) ~$40 (4Q'04) > $21 $11 - $20 Source: SemiView, 2003 Jason Cong 41 What’s Required for Structured ASICs • Need to support architecture design – Architecture solution space is large • • • • • Different complex cell and logic cluster structures Ratio between combinational and sequential cells Flat or Hierarchical? Number of levels of logic/routing hierarchies? Hybrid resources? Block RAM, DSP block? Ratio and floorplan Channel width, distribution, different interconnect designs – Need quantitative analysis • Understand the impact of each architecture decision on performance, area, density, and routability • Need efficient EDA tool support – Optimized for the given architecture Jason Cong 42 Magma’s Support of Full Spectrum of IC Platforms Blast Create Blast Create - SA ArchEvaluator Blast FPGA Blast Fusion Design cost / NRE Performance / Density ArchEvaluator Standard Cell Blast Fusion - SA Programmable ASIC FPGA Jason Cong 43 Opportunities and Possible Solutions • Dealing with rapid increase of design complexity and widening gap of design productivity – – – – More scalable optimization engines Better solutions to various scaling related problems Higher degree of design automation Design reuse • Dealing with rapid increase in nanometer manufacturing cost – Design for manufacturing – Alternative silicon implementation platforms – Silicon reuse Jason Cong 44 Silicon Implementation Platforms with Embedded Programmable Technologies General-Purpose Platforms processor Application-Specific Platforms processor ASIC Programmable Logic memory Alter, Xilinx, … Jason Cong memory Programmable Logic IBM, LSI Logic, Motorola, TI, … 45 Examples of Programmable Platforms • High Programmable Platforms – Xilinx Virtex II Pro, Altera Stratix, etc. – Provides reconfigurable processor + embedded memories + programmable logic Rocket I/O Transceivers PowerPC 405 PowerPC 405 405 405 Programmable Logic PowerPC PowerPC Rocket I/O Transceivers Xilinx Virtex II Pro • • • • • Up to 4 IBM PowerPC in FPGA fabric Up to 24 embedded Rocket I/O transceivers Up to 556 18*18 multipliers Over 10 Mb embedded block RAM Up to 125,136 logic elements (LEs) Jason Cong Altera Stratix • • • • • Nios embedded processor High-bandwidth I/O & High-Speed Interfaces Up to 176 embedded multipliers & up to 22 high performance DSP block Up to 7 Mb embedded memory Up to 79,040 logic elements (LEs) 46 Application-Specific Instruction Set Processors (ASIPs) • ASIPs provide tradeoffs between efficiency and flexibility – A general purpose processor + specific hardware resource – Base instruction set + customized instructions – Specific hardware resource implements the customized instructions – Either runtime reconfigurable or pre-synthesized – More popular recently • Altera Nios, Tensilica Xtensa, Improv Jazz, ARC Cores, IFX Carmel 20xx Jason Cong Register File LD/ST ALU Memory MUL FU User-Defined Execution Units User-Defined Register File Base ISA Feature Optional Functions User-Defined Functions 47 Proposed ASIP Compilation Flow [Cong, et al, FPGA’04] • C SUIF / CDFG generator ASIP constraints Application Mapping • Instruction Implementation / ASIP synthesis Implementation Pattern selection – Select a subset of patterns to maximize the potential speedup while satisfying the area constraint. – Formulated as a 0-1 knapsack problem Pattern library Mapped CDFG Jason Cong – Enumerate all of the patterns through cut enumeration Pattern Generation / Pattern Selection CDFG Pattern generation • Application Mapping – Map subject graph G(V, E) to extended instruction set so that the total execution time of G is minimized – Formulated as a min-area cell library based technology mapping problem 48 Experimental Result on Altera Nios Speedup Extended Instruction # Nios Estimation Resource Overhead LE Memory DSP Block fft_br 9 3.28 2.65 408 6.06% 65,536 9.79% 16 iir 7 3.18 3.73 255 3.79% 4,736 0.71% 40 fir 2 2.40 2.14 51 0.76% 1,024 0.15% 8 pr 2 1.57 1.75 71 1.05% 0 0.00% 14 dir 2 3.28 3.02 54 0.80% 0 0.00% 16 mcm 4 4.75 3.22 186 2.76% 0 0.00% 56 3.08 2.75 - 2.54% - 1.77% - Average Jason Cong 49 Outline • Nanometer SOC challenges • Opportunities and possible solutions • Concluding remarks Jason Cong 50 Concluding Remarks • Design complexity and manufacturing cost are the two biggest challenges to nanometer SOC designs • Many opportunities for innovation – Dealing with rapid increase of design complexity and widening gap of design productivity • • • • More scalable optimization engines Better solutions to various scaling related problems Higher degree of design automation Design reuse – Dealing with rapid increase in nanometer manufacturing cost • Design for manufacturing • Alternative silicon implementation platforms • Silicon reuse • Need collaborative efforts – Academia and industry – International collaboration Jason Cong 51 Acknowledgements • National Science Foundation (US), MARCO/GSRC • Semiconductor Research Corporation (SRC) • Many graduate student researchers from UCLA Jason Cong 52
© Copyright 2026 Paperzz