现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱:[email protected] 提交作业邮箱:[email protected] 2013年 现代计算机体系结构 1 Exploiting ILP Using Multiple Issue and Static Scheduling 现代计算机体系结构 2 Multiple-issue Processors Come in Three Major Flavors • Statically Scheduled Superscalar Processors – issue varying numbers of instructions per clock – use in-order execution • Dynamically Scheduled Superscalar Processors – issue varying numbers of instructions per clock – use out-of-order execution • VLIW (very long instruction word) processors – issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet 现代计算机体系结构 3 The Basic VLIW Approach • VLIWs use multiple, independent function units • A VLIW packages the multiple operations into one very long instruction • Or a VLIW requires that the instructions in the issue packet satisfy the same constraints • There is no fundamental difference in the two approaches 现代计算机体系结构 4 Case study: A VLIW processor • A VLIW processor with instructions that contain five operations – One integer operation (or a branch) – Tow floating-point operations – Two memory references • An instruction length of between 80 and 120 bits – 16 to 24 bits per field => 5*16 or 80 bits to 5*24 or 120 bits wide 现代计算机体系结构 5 Recall: Unrolled Loop that Minimizes Stalls for Scalar 1 Loop: 2 3 4 5 6 7 8 9 10 11 12 13 14 L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D S.D S.D S.D DSUBUI BNEZ S.D F0,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 0(R1),F4 -8(R1),F8 -16(R1),F12 R1,R1,#32 R1,LOOP 8(R1),F16 L.D to ADD.D: 1 Cycle ADD.D to S.D: 2 Cycles ; 8-32 = -24 14 clock cycles, or 3.5 per iteration 现代计算机体系结构 6 Loop Unrolling in VLIW Memory reference 1 L.D F0,0(R1) L.D F10,-16(R1) L.D F18,-32(R1) L.D F26,-48(R1) Memory FP reference 2 operation 1 L.D F6,-8(R1) L.D F14,-24(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F12,F10,F2 ADD.D F20,F18,F2 S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 S.D -16(R1),F12 S.D -24(R1),F16 S.D -32(R1),F20 S.D -40(R1),F24 S.D -0(R1),F28 FP op. 2 Int. op/ branch Clock 1 2 ADD.D F8,F6,F2 3 ADD.D F16,F14,F2 4 ADD.D F24,F22,F2 5 6 7 DSUBUI R1,R1,#48 8 BNEZ R1,LOOP 9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS) 现代计算机体系结构 7 Problems with 1st Generation VLIW • Increase in code size – generating enough operations in a straightline code fragment requires ambitiously unrolling loops – whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding 现代计算机体系结构 8 Problems with 1st Generation VLIW • Operated in lock-step; no hazard detection HW – a stall in any functional unit pipeline caused entire processor to stall, since all functional units must be kept synchronized – Compiler might prediction function units, but caches hard to predict 现代计算机体系结构 9 Problems with 1st Generation VLIW • Binary code compatibility – Pure VLIW => different numbers of functional units and unit latencies require different versions of the code 现代计算机体系结构 10 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” • IA-64: instruction set architecture • 128 64-bit integer regs + 128 82-bit floating point regs – Not separate register files per functional unit as in old VLIW • Hardware checks dependencies • Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions? 现代计算机体系结构 11 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” Itanium™ was first implementation (2001) – Highly parallel and deeply pipelined hardware at 800Mhz – 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process Itanium 2™ is name of 2nd implementation (2005) – 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process – Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3 现代计算机体系结构 12 Increasing Instruction Fetch Bandwidth Predicts next instruct address, sends it out before decoding instruction PC of branch sent to BTB When match is found, Predicted PC is returned If branch predicted taken, instruction fetch continues at Predicted PC 现代计算机体系结构 13 Example • On a 2 issue processor Loop: LW R2,0(R1) DADDIU R2,R2,#1 SW 0(R1),R2 DADDIU R1,R1,#4 BNE R2,R3,Loop ;R2=array element ; increment R2 ;store result ;increment pointer ; branch if not last element • Assume – separate integer functional units for effective address calculation, for ALU operations, and for branch condition evaluation. – up to two instructions of any type can commit per clock 现代计算机体系结构 14 现代计算机体系结构 Without speculation, control dependency is the main performance limitation 15 现代计算机体系结构 With speculation, overlapping between iterations 16 Branch Target Buffer (BTB) • To reduce the branch penalty, we must know whether the as-yet-undecoded instruction is a branch, if so, what the next PC should be. – We can have a branch penalty of zero. • Branch-target buffer/Branch-target cache – A branch-prediction cache that stores the predicted address for the next instruction after a branch 现代计算机体系结构 17 Branch Target Buffer (BTB) 现代计算机体系结构 18 Return Address Predictor • Indirect jumps – Destination address varies at run time – For example • Case statement • Procedure return • Procedure return can be predicted with a branch-target buffer, but the accuracy can be low. Why? – The procedure may be called from multiple sites – The calls from one site are not clustered in time • E.g. nested recursion 现代计算机体系结构 19 Return Address Predictor – Caches most recent return addresses – Call: Push a return address on stack – Return: Pop an address off stack & predict as new PC 70% Misprediction frequency • How to overcome this problem? • Small buffer of return addresses acts as a stack go m88ksim 60% cc1 50% compress 40% xlisp ijpeg 30% perl 20% vortex 10% 0% 0 1 2 4 8 16 Return address buffer entries 现代计算机体系结构 20 Integrated Instruction Fetch Units • Multiple instructions are demanded by multiple-issue processors in a clock • How to meet the demand? • Integrated Instruction Fetch Units is one of the approaches – Integrated branch prediction – Instruction prefetch – Instruction memory access and buffering 现代计算机体系结构 21 Integrated Instruction Fetch Units • Integrated branch prediction – branch predictor is part of instruction fetch unit and is constantly predicting branches • Instruction prefetch – Instruction fetch units prefetch to deliver multiple instruct. per clock, integrating it with branch prediction • Instruction memory access and buffering – Fetching multiple instructions per cycle: • May require accessing multiple cache blocks (prefetch to hide cost of crossing cache blocks) • Provides buffering, acting as on-demand unit to provide instructions to issue stage as needed and in quantity needed 现代计算机体系结构 22 Value Prediction •Taxonomy of speculative execution 现代计算机体系结构 23 Why can we do value Prediction? • Several recent studies have shown that there is significant result redundancy in programs, i.e., many instructions perform the same computation and, hence, produce the same result over and over again. • These studies have found that for several benchmarks more than 75% of the dynamic instructions produce the same result as before. 现代计算机体系结构 24 Value Prediction • Attempts to predict value produced by instruction – E.g., Loads a value that changes infrequently • Value prediction is useful only if it significantly increases ILP – Focus of research has been on loads; so-so results, no processor uses value prediction • Related topic is address aliasing prediction – RAW for load and store or WAW for 2 stores • Address alias prediction is both more stable and simpler since need not actually predict the address values, only whether such values conflict – Has been used by a few processors 现代计算机体系结构 25 Pipeline with VP • The predictions are obtained from a hardware table, called Value Prediction Table (VPT). • These predicted values are used as inputs by instructions, which can then execute earlier than they could have if they had to wait for their inputs to become available in the traditional way. • When the correct values become available (after executing an instruction) the speculated values are verified – if a speculation is found to be wrong, the instructions which executed with the wrong inputs are re-executed – if the speculation is found to be correct then nothing special needs to be done 现代计算机体系结构 26 Pipeline with VP • a flow of a dependent chain of instructions (I, J, and K) through two different pipelines: (i) a base pipeline (without VP or IR); (ii) a pipeline with VP. • we assume the instructions I, J, and K, are fetched, decoded and renamed together. • In the base pipeline, the instructions execute sequentially, since they are data dependent, requiring three cycles to execute them; – the chain is committed by cycle 6. • In the pipeline with VP, the dependence between instructions is broken by predicting the outputs of I and J (alternately, the inputs of J and K). This enables the three instructions to execute simultaneously; – the chain is committed in cycle 4. 现代计算机体系结构 27 作业4 • 习题 3.2 现代计算机体系结构 28 现代计算机体系结构 29
© Copyright 2026 Paperzz