CS250 VLSI Systems Design Fall 2009 John Wawrzynek, Krste Asanovic’, with John Lazzaro Regular Silicon Structures a.k.a VLSI Building Blocks Lecture 11, Regular Structures CS250, UC Berkeley Fall ‘09 Introduction ‣ We've experienced synthesis and standard cell place and ‣ route. Is that all there is? We can implement any digital system with only primitive logic gates and flip-flops. ‣ If so, chip implementations would be pretty inefficient (and boring to do!) ‣ Key questions: ‣ Where can special circuit- and layout-generators provide advantage and how much? Examples with a clear advantage: RAM blocks ‣ Example where it is not so clear: ‣ cross-bar switches, datapaths, ROMs, multipliers We’ll start with on-chip RAM Lecture 11, Regular Structures 2 CS250, UC Berkeley Fall ‘09 Verilog RAM Specification // // Single-Port RAM with Synchronous Read // module v_rams_07 (clk, we, a, di, do); input clk; input we; input [5:0] a; input [15:0] di; output [15:0] do; reg [15:0] ram [63:0]; reg [5:0] read_a; always @(posedge clk) begin if (we) ram[a] <= di; read_a <= a; end assign do = ram[read_a]; endmodule What do the synthesis tools do with this? Lecture 11, Regular Structures 3 CS250, UC Berkeley Fall ‘09 Memory-Block Basics log2(M) M X N memory: Depth = M, Width = N. M words of memory, each word N bits wide. VLSI tools flows include parameterized RAM-generators. User specifies width, depth, (sometimes) aspect ratio; gets simulation & timing models, layout. Lecture 11, Regular Structures 4 CS250, UC Berkeley Fall ‘09 Internal Memory Organization 2-D arrary of bit cells. Each cell stores one bit of data. Special circuit tricks are used for the cell array to improve storage density. ‣ RAM/ROM naming convention: ‣ ‣ examples: 32 X 8, "32 by 8" => 32 8-bit words 1M X 1, "1 meg by 1" => 1M 1-bit words Lecture 11, Regular Structures 5 CS250, UC Berkeley Fall ‘09 Address Decoding sel_row1 address sel_row0 Address • The function of the address decoder is to generate a one-hot code word from the address. • The output is use for row selection. • Many different circuits exist for this function. A simple one is shown to the right. Lecture 11, Regular Structures 6 CS250, UC Berkeley Fall ‘09 Memory Block Internals For read operation, functionally the memory is equivalent to a 2-D array off flip-flops with tristate outputs on each: sel_row0 sel_row1 For write operation, functionally equivalent includes a means to change state value: Lecture 11, Regular Structures 7 These circuits are just functional abstractions of the actual circuits used. CS250, UC Berkeley Fall ‘09 Storing computational state as charge State is coded as the amount of energy stored by a capacitor. +++ +++ --- --- 1.5V +++ +++ --- --- State is read by sensing the amount of energy Problems: noise changes Q (up or down), parasitics leak or source Q. Fortunately, 8 Q cannot change instantaneously, but that only gets us in the ballpark. Lecture 11, Regular Structures CS250, UC Berkeley Fall ‘09 Static Memory Circuits Dynamic Memory: Circuit remembers for a fraction of a second. Static Memory: Circuit remembers as long as the power is on. Non-volatile Memory: Circuit remembers for many years, even if power is off. 9 Lecture 11, Regular Structures CS250, UC Berkeley Fall ‘09 x Idea: Store each bit with its complement x “Row” Why? y Gnd Vdd Vdd Gnd We can use the redundant representation to compensate for noise and leakage. Lecture 11, Regular Structures y 10 CS250, UC Berkeley Fall ‘09 Case #1: y = Gnd, y = Vdd ... x x “Row” Isd y Gnd y Vdd Ids 11 Lecture 11, Regular Structures CS250, UC Berkeley Fall ‘09 Case #2: y = Vdd, y = Gnd ... x x “Row” Isd y y Gnd Vdd Ids 12 Lecture 11, Regular Structures CS250, UC Berkeley Fall ‘09 Combine both cases to complete circuit Gnd noise noise Vdd Vth Vth Vdd Gnd “Crosscoupled inverters” y y 13 x Lecture 11, Regular Structures x CS250, UC Berkeley Fall ‘09 SRAM Challenge #1: It’s so big! SRAM area is 6X-10X DRAM area, same generation ... Cell has both transistor types Capacitors are usually “parasitic” capacitance of wires and transistors. Lecture 11, Regular Structures Vdd AND Gnd Lots of contacts, transistors, t wo bit lines ... 14 CS250, UC Berkeley Fall ‘09 ! 164-276&!"#$% #$1869 8#; Recall: Positive edge-triggered flip-flop 8#; < 8#;= 8#; A flip-flop “samples” right before the D Q "#$%&'(&)#'*+,#-*. ."12*&1'3" 8#-8;&1-&<&5"#$% / edge, and then “holds” 8#;=value. / :#-8;&1-&<&5"#$% 4".2#1.&,4-3& 0"12*&1'3" 4".2#1.&,4-3&5"#$%& 5"#$%&164-276&$&'()* 164-276&!"#$%Sampling #$1869 Holds #$1869 circuit value 8#; 8#;= #-8;&1-&<&5"#$% 8#;= 8#; 8#;= 8#;= 8#; ++,!-.)'/4".2#1.&,4-3& 012-)34$5$%& :#-8;&1-&<&5"#$% 16 Transistors: Makes an SRAM 5"#$%&164-276&$&'()* #$1869 look compact! !"#$%&'())* / 8#; 8#;= 67&1'-8 What do we get for the 10 extra transistors? 15 Clocked logic semantics. Lecture 11, Regular Structures CS250, UC Berkeley Fall ‘09 8#; 8#;= 8#; 8#; !"#$%&'(&)#'*+,#-*. Sensing: When clock is low < 8#;= 8#; / 0"12*&1'3" 4".2#1.&,4-3&5"#$%& !"#$%&'(&)#'*+,#-*. A flip-flop “samples” right before the !."12*&1'3" 8#-8;&1-&<&5"#$% 164-276&!"#$% #$1869 D Q 8#;/ < 8#;= edge, and then “holds” value. / :#-8;&1-&<&5"#$% 4".2#1.&,4-3& 0"12*&1'3" 4".2#1.&,4-3&5"#$%& 8#; 5"#$%&164-276&$&'()* 164-276&!"#$%Sampling #$1869 Holds #$1869 circuit value 8#; 8#; !"#$%&'(&)#'*+,#-*. ."12*&1'3" 8#-8;&1-&<&5"#$% / 8#-8;&1-&<&5"#$% 8#; 8#;= 8#;= 8#; 8#;= 8#;= 8#;= / :#-8;&1-&<&5"#$% 4".2#1.&,4-3& 0"12*&1'3" 4".2#1.&,4-3&5"#$%& 5"#$%&164-276&$&'()* #$1869 164-276&!"#$% #$1869 8#; 8#;= 8#; ++,!-.)'/4".2#1.&,4-3& 012-)34$5$%& 8#;= :#-8;&1-&<&5"#$% clk = 0 5"#$%&164-276&$&'()* #$1869 !"#$%&'())* / clk’ = 1 8#;= 8#;= 8#-8;&1-&<&5"#$% !"#$%&'())* Lecture 11, Regular Structures 8#; 8#; Will capture new 8#;= 8#;=4".2#1.&,4-3& value on posedge. / :#-8;&1-&<&5"#$% ++,!-.)'/ 012-)34$5$%& 8#; 5"#$%&164-276&$&'()* #$1869 67&1'-8 8#;= Outputs last 8#; value captured. 67&1'-8 16 CS250, UC Berkeley Fall ‘09 8#; 8#; < !"#$%&'(&)#'*+,#-*. Capture: When clock goes high 8#;= 8#; / 0"12*&1'3" 4".2#1.&,4-3&5"#$%& !"#$%&'(&)#'*+,#-*. A flip-flop “samples” right before the ."12*&1'3" 8#-8;&1-&<&5"#$% D !Q / 8#; < #$1869 8#;= edge, and164-276&!"#$% then “holds” value. / :#-8;&1-&<&5"#$% 4".2#1.&,4-3& 0"12*&1'3" 4".2#1.&,4-3&5"#$%& 8#; 5"#$%&164-276&$&'()* 164-276&!"#$%Sampling #$1869 Holds #$1869 circuit value 8#; 8#; !"#$%&'(&)#'*+,#-*. ."12*&1'3" 8#-8;&1-&<&5"#$% / 8#-8;&1-&<&5"#$% !"#$%&'())* / clk = 1 clk’ = 0 8#-8;&1-&<&5"#$% Lecture !"#$%&'())* 11, Regular Structures 8#;= 8#; 8#;= 8#;= 8#; 8#;= 8#;= / :#-8;&1-&<&5"#$% 4".2#1.&,4-3& 0"12*&1'3" 4".2#1.&,4-3&5"#$%& 5"#$%&164-276&$&'()* #$1869 164-276&!"#$% #$1869 8#; 8#;= 8#; ++,!-.)'/4".2#1.&,4-3& 012-)34$5$%& 8#;= :#-8;&1-&<&5"#$% 5"#$%&164-276&$&'()* #$1869 8#;= 8#;= 8#; 8#; Remembers value just 8#;= 8#;= captured. 8#; / :#-8;&1-&<&5"#$% ++,!-.)'/4".2#1.&,4-3& 012-)34$5$%& 5"#$%&164-276&$&'()* #$1869 67&1'-8 8#;= Outputs value just 8#; captured.67&1'-8 17 CS250, UC Berkeley Fall ‘09 Challenge #2: Writing is a “fight” When word line goes high, bitlines “fight” with cell inverters to “flip the bit” -- must win quickly! Solution: tune W/L of cell & driver transistors Initial state Vdd Bitline drives Gnd Lecture 11, Regular Structures Initial state Gnd Bitline drives Vdd 18 CS250, UC Berkeley Fall ‘09 Challenge #3: Preserving state on read When word line goes high on read, cell inverters must drive large bitline capacitance quickly, to preserve state on its small cell capacitances Cell state Vdd Bitline a big capacitor Lecture 11, Regular Structures Cell state Gnd Bitline a big capacitor 19 CS250, UC Berkeley Fall ‘09 SRAM Operation Summary word word bit word bit bit bit word bit word bit bit bit word bit bit bit bit Most common is 6transistor (6T) cell array. Word selects this cell, and all others in a row. word line Write operation: column bit lines are driven differentially (0 on one, 1 on the other). Values overwrites cell state. bit line bit line Read operation: column bit lines are “precharged”, then released. Cell pulls down one bit line or the other. “Sense Amplifier” circuit quickly amplifies difference between bit lines (saves time & energy). Lecture 11, Regular Structures 20 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 21 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 22 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 23 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 24 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 25 CS250, UC Berkeley Fall ‘09 ‣ Multi-ported Memory Motivation: ‣ Consider CPU core register file: ‣ ‣ ‣ – 1 read or write per cycle limits processor performance. Complicates pipelining. Difficult for different instructions to simultaneously read or write regfile. Aa Dina WEa Ab Dinb WEb Common arrangement in pipelined CPUs is 2 read ports and 1 write port. I/O data buffering: Lecture 11, Regular Structures • disk or network interface data buffer CPU Douta Dual-port Memory Doutb dual-porting allows both sides to simultaneously access memory at full bandwidth. CS250, UC Berkeley Fall ‘09 Dual-ported Memory Internals ‣ Add decoder, another set of • Example cell: SRAM read/write logic, bits lines, word lines: deca decb address ports Lecture 11, Regular Structures cell array WL2 WL1 b2 b1 b1 b2 • Repeat everything but cross-coupled inverters. r/w logic • This scheme extends up to a couple more ports, then need to add additional r/w logic transistors. data ports 27 CS250, UC Berkeley Fall ‘09 Cascading Memory-Blocks How to make larger memory blocks out of smaller ones. Increasing the width. Example: given 1Kx8, want 1Kx16 Lecture 11, Regular Structures 28 CS250, UC Berkeley Fall ‘09 Cascading Memory-Blocks How to make larger memory blocks out of smaller ones. Increasing the depth. Example: given 1Kx8, want 2Kx8 Lecture 11, Regular Structures 29 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 30 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 31 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 32 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 33 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 34 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 35 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 36 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 37 CS250, UC Berkeley Fall ‘09 Other Regular Structures ‣ In Transparencies Lecture 11, Regular Structures 38 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 39 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 40 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 41 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 42 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 43 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 44 CS250, UC Berkeley Fall ‘09 DPCDataPath Compiler ! Custom Performance with ASIC Effort ! 3X Faster than ASIC ! 40% Smaller than ASIC ! 10X Less Effort than Full Custom DPC reads the output from static timing analysis and displays the critical paths directly on the schematic. The placement can be modified to optimize critical paths, or extra drivers can be added to the critical path, all in the schematic. You then run through the placement and timing iteration again. This iteration continues until the timing criteria are satisfied. The iteration loop is fast and visual. When you are satisfied with the design performance, the placement file (DEF file) is passed to a routing tool. The routed result can then be read into the MAX Layout Editor to view, and edit if necessary. are displayed directly on the schematic at all levels of the design hierarchy. In addition, the actual delays of the paths are annotated onto the wires in both the schematic and placement view. The Tool for High Performance Designs DPC is the tool used by designers needing high performance chips. They want the performance of full custom design, but with a much shorter design cycle. Datapaths designed with DPC are 3X (three times) faster and 40% smaller than synthesis and place and route. At the same time, it takes 10X (ten times) less effort than full custom design. In deep sub-micron design, wire length is the dominant factor affecting critical path timing. Cell placement becomes a critical step in chip performance as well as power consumption. With traditional tools, designers are at the mercy of automatic placement tools. The DataPath Compiler (DPC) lets the designer control placement with immediate timing feedback. Multiple what-if experiments can be performed. Using a graphical display that back annotates timing to the schematic, you can easily identify timing problems and rapidly iterate through potential solutions, yielding faster results. DPC is so fast it can place, and then time, a 50K gate datapath in 2-3 minutes. Useful Identification of Critical Paths DPC predicts wire lengths early in the design cycle. The resulting timing iterations are both fast and accurate, allowing the designer to quickly iterate to their performance goal. The critical paths With DPC, you first enter the schematics into the SUE design manager. DPC then uses the schematic as a seed for placement. Once DPC has the placement, it is able to estimate the wiring delays and send this info to a static timing analyzer. The results of static timing analysis are then read back into SUE. The critical path is highlighted in both the schematic and placement view. Additionally, the delay and slope at each node are displayed. The example above shows the placement generated for our sample 8-bit ALU. A critical path is highlighted in red and yellow on both the schematic an placement views. Timing for other nets is indicated in the menu and new nets can be selected and highlighted. DPC for Critical Path Optimization In datapath designs, some simple directives by the designer can produce speed-optimized layouts. These directives are easily given and modified in DPC. The placement of components on the schematic directs relative placement in the placement file. SUE DPC Features: ! Automatically route, generate parasitics, run timing analysis, and display criticalpath timing directly on schematics. ! DPC includes its own timing analyzer, or you can use iintegrated static timing analysis tools such as Pearl, PathMill and PrimeTime. Fast - can do a 50K gate data path in a few minutes. ! Use standard cells or custom datapath cells. ! ! Write out DEF placement information and Verilog netlist for integration with routing tools. ! Available on LINUX platforms. GDSII DPC Placement & Parasitic Estimation Router Timing Analysis Parasitic Extraction FAST AST Cells can also be hard placed at specific row or column locations and empty space can be indicated. DPC automatically generates the row and column placement and predicted wire lengths. Wire predictions can be used to drive the DPC timing analyzer as well as external timing analyzers inluding Pearl, PrimeTime and PathMill. Micro Magic DPC Figure 2-a. DPC reduces the time required for placement and timing analysis from days to minutes. Micro Magic, Inc. Sunnyvale, CA USA Phone: 408.414.7647 www.micromagic.com Inc. Copyright 1995-2006, Micro Magic, Inc. All rights reserved. Lecture 11, Regular Structures The DEF placement file is sent to a router. The resulting GDSII file can then be read into MAX (Micro Magic’s layout tool). DPC Design Flow Figure 1-a. Figure 2-b. 45 Micro Magic Inc. Fast Silicon Fast CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 46 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 47 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 48 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 49 CS250, UC Berkeley Fall ‘09 Lecture 11, Regular Structures 50 CS250, UC Berkeley Fall ‘09 Regular Structures ‣ In principle, standard cell libraries are sufficient for ‣ ‣ ‣ implementation of any logic circuit. In practice, great for “random logic”, but what about other functions? With logic synthesis and standard cell place and route as good as it is, is there still a place for special regular structure layout generators? Exploiting regularity allow us to build special “generators”, Which often leads to improved area, energy, and performance. ‣ We looked at RAM, ROM, PLA, shifters ‣ Are there others? Lecture 11, Regular Structures 51 CS250, UC Berkeley Fall ‘09 Random Notes ‣ ‣ Multiplication another regular structure example How do we (or should we) exploit “regular structures” in our design flow? ‣ Special predesigned blocks ‣ ex: “large” SRAM block in library for instantiation ‣ Special layout generators with special leaf cells ‣ ex: PLA generators. SRAM/ROM generators. ‣ Special layout generators using standard cells ‣ Datapath compilers ‣ Is there always a clear win? ‣ ex: ROM table might be smaller and faster implemented as logic equations in standard cells (with place and route) ‣ Clear advantage for SRAM, others? Lecture 11, Regular Structures 52 CS250, UC Berkeley Fall ‘09
© Copyright 2024 Paperzz