Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 MDSP Project | Intel Lab Moscow Institute of Physics and Technology Agenda • Review – Instruction set architecture – Basic tools for computer architects • Amdahl's law • Pipelining – Ideal pipeline – Cost-performance trade-off – Dependencies – Hazards, interlocks & stalls – Forwarding – Limits of simple pipeline MDSP Project | Intel Lab Moscow Institute of Physics and Technology 2 Review MDSP Project | Intel Lab Moscow Institute of Physics and Technology 3 Instruction Set Architecture • ISA is the hardware/software interface – Defines set of programmer visible state – Defines instruction format (bit encoding) and instruction semantics – Examples: MIPS, x86, IBM 360, JVM • Many possible implementations of one ISA – 360 implementations: model 30 (c. 1964), z10 (c. 2008) – x86 implementations: 8086, 286, 386, 486, Pentium, Pentium Pro, Pentium 4, Core 2, Core i7, AMD Athlon, Transmeta Crusoe – MIPS implementations: R2000, R4000, R10000, R18K, … – JVM: HotSpot, PicoJava, ARM Jazelle, ... MDSP Project | Intel Lab Moscow Institute of Physics and Technology 4 Basic Tools for Computer Architects • • • • • • • Amdahl's law Pipelining Out-of-order execution Critical path Speculation Locality Important metrics – Performance – Power/energy – Cost – Complexity MDSP Project | Intel Lab Moscow Institute of Physics and Technology 5 Amdahl's Law MDSP Project | Intel Lab Moscow Institute of Physics and Technology 6 Amdahl’s Law: Definition • Speedup = Timewithout enhancement / Timewith enhancement • Suppose an enhancement speeds up a fraction f of a task by a factor of S 1‒f 1‒f f f/S • We should concentrate efforts on improving frequently occurring events or frequently used mechanisms MDSP Project | Intel Lab Moscow Institute of Physics and Technology 7 Amdahl’s Law: Example • New processor 10 times faster • Input-output is a bottleneck – 60% of time we wait • Let’s calculate • Its human nature to be attracted by 10× faster – Keeping in perspective its just 1.6× faster MDSP Project | Intel Lab Moscow Institute of Physics and Technology 8 Pipelining MDSP Project | Intel Lab Moscow Institute of Physics and Technology 9 A Generic 4-stage Processor Pipeline Fetch Unit gets the next instruction from the cache. Decode Fetch Floating Multimedia Point Unit Unit Decode Unit determines type of instruction. Instruction and data sent to Execution Unit. Write L2 Cache Write Unit stores result. Integer Unit Instruction Fetch Decode Execute Instruction Fetch Decode Read Write Execute Memory Write MDSP Project | Intel Lab Moscow Institute of Physics and Technology 10 Pipeline: Steady State Cycle 1 2 3 Instr1 Fetch Decode Read Fetch Decode Read Fetch Decode Read Fetch Decode Read Fetch Decode Read Fetch Decode Instr2 Instr3 Instr4 Instr5 4 5 Execute Memory 6 8 9 Write Execute Memory Instr6 7 Write Execute Memory Write Execute Memory Write Execute Memory Read Execute • Latency — elapsed time from start to completion of a particular task • Throughput — how many tasks can be completed per unit of time • Pipelining only improves throughput – Each job still takes 4 cycles to complete • Real life analogy: Henry Ford’s automobile assembly line MDSP Project | Intel Lab Moscow Institute of Physics and Technology 11 Pipelining Illustrated L TP ~ 1/n n Gate Delay L n/2 Gate Delay L L n/3 Gate Delay n/3 Gate Delay L n/2 Gate Delay L TP ~ 2/n n/3 Gate Delay TP ~ 3/n MDSP Project | Intel Lab Moscow Institute of Physics and Technology 12 Pipelining Performance Model • Starting from an unpipelined version with propagation delay T and TP = 1/T Unpipelined k-stage pipelined T/k S T where • S — delay through latch and overhead … T/k S MDSP Project | Intel Lab Moscow Institute of Physics and Technology 13 Hardware Cost Model • Starting from an unpipelined version with hardware cost G Unpipelined k-stage pipelined G/k where • L — cost of adding each latch L G … G/k L MDSP Project | Intel Lab Moscow Institute of Physics and Technology 14 Cost/Performance Trade-off [Peter M. Kogge, 1981] MDSP Project | Intel Lab Moscow Institute of Physics and Technology 15 Pipelining Idealism • Uniform suboperations – The operation to be pipelined can be evenly partitioned into uniformlatency suboperations • Repetition of identical operations – The same operations are to be performed repeatedly on a large number of different inputs • Repetition of independent operations – All the repetitions of the same operation are mutually independent – Good example: automobile assembly line MDSP Project | Intel Lab Moscow Institute of Physics and Technology 16 Pipelining Reality • Uniform suboperations... NOT! ⇒ Balance pipeline stages – Stage quantization to yield balanced stages – Minimize internal fragmentation (some waiting stages) • Identical operations... NOT! ⇒ Unify instruction types – Coalescing instruction types into one “multi-function” pipe – Minimize external fragmentation (some idling stages) • Independent operations... NOT! ⇒ Resolve dependencies – Inter-instruction dependency detection and resolution – Minimize performance lose MDSP Project | Intel Lab Moscow Institute of Physics and Technology 17 Pipelining Reality • Uniform suboperations... NOT! ⇒ Balance pipeline stages – Stage quantization to yield balanced stages – Minimize internal fragmentation (some waiting stages) • Identical operations... NOT! ⇒ Unify instruction types – Coalescing instruction types into one “multi-function” pipe – Minimize external fragmentation (some idling stages) • Independent operations... NOT! ⇒ Resolve dependencies – Inter-instruction dependency detection and resolution – Minimize performance lose MDSP Project | Intel Lab Moscow Institute of Physics and Technology 18 Pipelining Reality • Uniform suboperations... NOT! ⇒ Balance pipeline stages – Stage quantization to yield balanced stages – Minimize internal fragmentation (some waiting stages) • Identical operations... NOT! ⇒ Unify instruction types – Coalescing instruction types into one “multi-function” pipe – Minimize external fragmentation (some idling stages) • Independent operations... NOT! ⇒ Resolve dependencies – Inter-instruction dependency detection and resolution – Minimize performance lose MDSP Project | Intel Lab Moscow Institute of Physics and Technology 19 Pipelining Reality • Uniform suboperations... NOT! ⇒ Balance pipeline stages – Stage quantization to yield balanced stages – Minimize internal fragmentation (some waiting stages) • Identical operations... NOT! ⇒ Unify instruction types – Coalescing instruction types into one “multi-function” pipe – Minimize external fragmentation (some idling stages) • Independent operations... NOT! ⇒ Resolve dependencies – Inter-instruction dependency detection and resolution – Minimize performance lose MDSP Project | Intel Lab Moscow Institute of Physics and Technology 20 Pipelining Reality • Uniform suboperations... NOT! ⇒ Balance pipeline stages – Stage quantization to yield balanced stages – Minimize internal fragmentation (some waiting stages) • Identical operations... NOT! ⇒ Unify instruction types – Coalescing instruction types into one “multi-function” pipe – Minimize external fragmentation (some idling stages) • Independent operations... NOT! ⇒ Resolve dependencies – Inter-instruction dependency detection and resolution – Minimize performance lose MDSP Project | Intel Lab Moscow Institute of Physics and Technology 21 Pipelining Reality • Uniform suboperations... NOT! ⇒ Balance pipeline stages – Stage quantization to yield balanced stages – Minimize internal fragmentation (some waiting stages) • Identical operations... NOT! ⇒ Unify instruction types – Coalescing instruction types into one “multi-function” pipe – Minimize external fragmentation (some idling stages) • Independent operations... NOT! ⇒ Resolve dependencies – Inter-instruction dependency detection and resolution – Minimize performance lose MDSP Project | Intel Lab Moscow Institute of Physics and Technology 22 Hazards, Interlocks, and Stalls • Pipeline hazards – Potential violations of program dependences – Must ensure program dependences are not violated • Hazard resolution – Static Method: performed at compiled time in software – Dynamic Method: performed at run time using hardware – Stall – Flush – Forward • Pipeline interlock – Hardware mechanisms for dynamic hazard resolution – Must detect and enforce dependences at run time MDSP Project | Intel Lab Moscow Institute of Physics and Technology 23 Dependencies & Pipeline Hazards • Data dependence (register or memory) – True dependence (RAW) – Instruction must wait for all required input operands – Anti-dependence (WAR) – Later write must not clobber a still-pending earlier read – Output dependence (WAW) – Earlier write must not clobber an already-finished later write • Control dependence – A “data dependency” on the instruction pointer – Conditional branches cause uncertainty to instruction sequencing • Resource conflicts – Two instructions need the same device MDSP Project | Intel Lab Moscow Institute of Physics and Technology 24 Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low MDSP Project | Intel Lab Moscow Institute of Physics and Technology 25 Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low bge mul addu lw mul addu lw bge $10, $15, $24, $25, $13, $14, $15, $25, $9, $10, $6, 0($24) $8, $6, 0($14) $15, L2 4 $15 addu … $10, $10, 1 addu $11, $11, -1 4 $13 L2 L1: L2: MDSP Project | Intel Lab Moscow Institute of Physics and Technology 26 Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low bge mul addu lw mul addu lw bge $10, $15, $24, $25, $13, $14, $15, $25, $9, $10, $6, 0($24) $8, $6, 0($14) $15, L2 4 $15 addu … $10, $10, 1 addu $11, $11, -1 4 $13 L2 L1: L2: MDSP Project | Intel Lab Moscow Institute of Physics and Technology 27 Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low bge mul addu lw mul addu lw bge $10, $15, $24, $25, $13, $14, $15, $25, $9, $10, $6, 0($24) $8, $6, 0($14) $15, L2 4 $15 addu … $10, $10, 1 addu $11, $11, -1 4 $13 L2 L1: L2: MDSP Project | Intel Lab Moscow Institute of Physics and Technology 28 Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low bge mul addu lw mul addu lw bge $10, $15, $24, $25, $13, $14, $15, $25, $9, $10, $6, 0($24) $8, $6, 0($14) $15, L2 4 $15 addu … $10, $10, 1 addu $11, $11, -1 4 $13 L2 L1: L2: MDSP Project | Intel Lab Moscow Institute of Physics and Technology 29 Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low bge mul addu lw mul addu lw bge $10, $15, $24, $25, $13, $14, $15, $25, $9, $10, $6, 0($24) $8, $6, 0($14) $15, L2 4 $15 addu … $10, $10, 1 addu $11, $11, -1 4 $13 L2 L1: L2: MDSP Project | Intel Lab Moscow Institute of Physics and Technology 30 Pipeline: Data Hazards Cycle 1 2 3 Instr1 Fetch Decode Read Fetch Decode Read Fetch Decode Read Fetch Decode Read Fetch Decode Read Fetch Decode Instr2 Instr3 Instr4 4 5 Execute Memory Instr5 Instr6 6 7 8 9 Write Execute Memory Write Execute Memory Write Execute Memory Write Execute Memory Read Execute • Instr2: _ → rk • Instr3: rk → _ • How long should we stall for? MDSP Project | Intel Lab Moscow Institute of Physics and Technology 31 Pipeline: Stall on Data Hazard Cycle 1 2 3 Instr1 Fetch Decode Read Fetch Decode Read Fetch Decode Fetch Instr2 Instr3 Instr4 Instr5 4 5 Execute Memory 6 7 8 9 Stalled Read Execute Stalled Decode Read Stalled Fetch Decode Write Execute Memory Write Instr6 • • • • • Instr2: _ → rk Bubble Bubble Bubble Instr3: rk → _ • Make the younger instruction wait until the hazard has passed: – Stop all up-stream stages – Drain all down-stream stages MDSP Project | Intel Lab Moscow Institute of Physics and Technology 32 Pipeline: Forwarding Cycle 1 2 3 Instr1 Fetch Decode Read Fetch Decode Read Fetch Decode Read Fetch Decode Read Fetch Decode Read Fetch Decode Instr2 Instr3 Instr4 Instr5 Instr6 4 5 Execute Memory 6 7 8 9 Write Execute Memory Write Execute Memory Write Execute Memory Write Execute Memory Read Execute MDSP Project | Intel Lab Moscow Institute of Physics and Technology 33 Limitations of Simple Pipelined Processors (aka Scalar Processors) • Upper bound on scalar pipeline throughput – Limited by IPC = 1 • Inefficiencies of very deep pipelines – Clocking overheads – Longer hazards and stalls • Performance lost due to in-order pipeline – Unnecessary stalls • Inefficient unification Into single pipeline – Long latency for each instruction – Hazards and associated stalls MDSP Project | Intel Lab Moscow Institute of Physics and Technology 34 Limitations of Deeper Pipelines Unpipelined k-stage pipelined T/k S (Code size) (CPI) (Cycle) T … T/k ? Eventually limited by S S MDSP Project | Intel Lab Moscow Institute of Physics and Technology 35 Put it All Together: Limits to Deeper Pipelines Source: Ed Grochowski, 1997 MDSP Project | Intel Lab Moscow Institute of Physics and Technology 36 Acknowledgements • These slides contain material developed and copyright by: – Grant McFarland (Intel) – Christos Kozyrakis (Stanford University) – Arvind (MIT) – Joel Emer (Intel/MIT) MDSP Project | Intel Lab Moscow Institute of Physics and Technology 37
© Copyright 2025 Paperzz