Memory Dependence Prediction Andreas Moshovos [email protected] www.ece.nwu.edu/~moshovos Electrical And Computer Engineering Department Northwestern University ECE Dept., Purdue University, Sept. 9, 1999 Exploiting Program Behavior Building faster/better machines: 1. Faster Circuits -> Faster Execution 2. Be “smarter” about program execution: Exploit Idiosynchrasies in Program Behavior Example? Caches Idiosyncrasies in access pattern Idea: Use small/fast storage for recently accessed data. What to do next? One Possibility is: Identify/Exploit Other Idiosyncrasies in Typical Program Behavior This might be the right time! Technology and Existing Knowledge. A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 2 Memory Dependences are Quite Regular • Identified a new form of regularity: for i a[i] = a[i - 1] + 1 Memory Dependence Locality store load 1. Load/Store has a Dependence? store 2. Which Dependence a Load/Store has? load store Time (loadPC, storePC) load Opportunity to Exploit this Regularity How to exploit? Need Techniques A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 3 Memory Dependence Prediction 1. Load/Store has a Dependence? Guess: How? 2. Which Dependence a Load/Store has? store Past Behavior -> Good Indicator of Future Behavior 1. Observe load GOAL Basis for Three Micro-Architectural Techniques: 1. Exploit Load/Store Parallelism 2. Reduce Memory Latency store 2. Predict load 3. Memory Bandwidth Amplification Memory is a major and ever increasing bottleneck A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 4 Roadmap • Exploiting Load/Store Parallelism - Two directions for improving memory performance - Load/Store Parallelism - Speculation/Synchronization - Performance • Speculative Memory Cloaking and Bypassing • Transient Value Cache • Other Applications • Future Directions: Slice Prediction & Slice-Processors A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 5 I n S e a r c h o f H i g h e r Performance #1 Timeline Instruction Sequence address Memory load Minimize send load value computation starts 1. Higher Performance: Memory Responds Faster A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 6 # 1 . M a k i n g M e m o r y R e s p o n d Faster Ideally: Memory is Large and Fast Can have it! Technology - Cost trade-off Solution: Memory Hierarchy Main Memory L2 slower larger cheaper L1 OK, we did the best we could, but ... ...memory is still not that fast ...and it is getting slower Is this the end? A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 7 I n S e a r c h o f H i g h e r Performance #2 Timeline send load old position Modified Sequence load load Maximize Original Sequence computation starts old position 2. Higher Performance: Send Load Request As Far In Advance As Possible But, will the program still run correctly? A. Can we ever move loads up? A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. B. How do we do it? 8 A. Can We Ever Move Loads Up? Instruction Sequence A load r1 = r2 + 10 load ..., [r1 + 5] r1 = r2 + 10 load ..., [r3 + 5] Valid Execution Order A load dependence A load load load A parallelism A A and load can execute in any order if load does not use a value produced by A A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 9 B. How To M o v e L o a d s U p ? Instruction Level Parallel Processors: Step #1 Step #3 Step #2 load ce s load de pe nd en load memory latency grab a chunk of instructions Instruction Window find the dependences Inspect Names execute Use Parallelism to Tolerate Memory Latency A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 10 M o v i n g L o a d s Past Stores Instructions Timeline dependence? load addr store addr. calc. store load addr. calc. load store ??? load 100 store addr load executes Use Of Addresses Hinders Parallelism ➔ Limits Us in Our Effort to Tolerate Memory Latency A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 11 Naive Memory Dependence Speculation • Don’t give up, be optimistic, guess no dependences exist • State-of-the-art in modern processors Timeline store load dependence? dependence? Instructions load addr Guess no: load executes store addr load re-executes if yes penalty Need to Balance: Gain vs. Penalty A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 12 D e p e n d e n c e S p e c u l a t i o n a n d Performance Program Order No Speculation No Dependence store A load B C D Speculation free A C B D store free load A C D load B store Dependence A C D B C load B store load D Speculation may affect performance either way Balance: Gain vs. Penalty Penalty: A. Moshovos (a) work thrown away (b) opportunity cost Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 13 How About Future Systems? Common Case: Today Soon • Memory will be slower ➔ Need to move loads further up Guessing Naively: Penalty becomes significant Loads — Ideal Behavior: No Dependence: execute at will Dependence: Synchronize w/ store Future Systems: Wish Dependences Were Known A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 14 Speculation/Synchronization Learn from mistakes: 1. Start with Naive Speculation 2. Remember mispeculations and predict next time the same will happen - Q1? Which loads should wait - Q2? How long Best performing scheme - A1: Loads w/ dependences - A2: Synchronize w/ appropriate store A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 15 How it works ST LD LD 3 ➔ ST 2 ➔ LD 1 ➔ LD Mem. Dependence Prediction Table Predict Loads w/ Dependences Mem. Dependence Synchronization Table Enforce Synchronization MDST LDPC F/EV STPC 0 1 LDPC F/EV STPC 0 1 MDPT LDPC STPC LDPC STPC PRED ② Allocate entry LDPC STPC PRED ② No! Wait LDPC LDPC STPC ② Resume ① Synchronize? ① Execute? ① Mispeculation 1 ST A. Moshovos LD 2 LD Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. PRED 3 STPC LDPC ST LD 16 Memory Dependence Speculation/Synchronization Timeline store A. Predict Dependence 1. Predict load Allocate Sync. bit 1 load 2. Predict store load addr 2 store addr 3 Wait on Sync. bit 3. Store Signals Load executes B. Predict No Dependence 2. Load executes 3. Store verifies Correct Prediction: Loads wait only as long as it is necessary Incorrect: Same as Naive or Delay A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 17 Speedup over Naive Speculation 80% 60% better 40% 20% ORACLE 10 1 10 2 10 3 10 4 10 7 11 0 12 5 14 1 14 5 14 6 09 9 12 4 12 6 12 9 13 0 13 2 13 4 14 7 0% CENTRALIZED 2 STORES/LOAD 256-Entries A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. SYNONYMS 4K MDPT 64 MDST 18 Evaluation - Superscalar 1. Can loads inspect Store addresses before obtaining a value? Requires an address-based load/store scheduler (ABS) 2. Is speculation used? • Speculation a win (naive over no-speculation): No ABS: ~29% int, ~113% fp mispeculations are an issue ABS: 4.6% int, 5.3% fp (0-cycle scheduling latency) virtually no mispeculations • Next: 1. Compare “Oracle + no ABS” with “naive + ABS” 2. Speculation/Synchronization on “no Abs” close to Oracle Speculation/Synchronization as an Alternative to ABS A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 19 “Oracle + no ABS” vs. “Naive + ABS” 40% ORACLE 30% NAIVE 1-CYCLE NAIVE 0-CYCLES NAIVE 2-CYCLES better 20% 10% 0% -20% 09 9 12 4 12 6 12 9 13 0 13 2 13 4 14 7 10 1 10 2 10 3 10 4 10 7 11 0 12 5 14 1 14 5 14 6 -10% Base case: “ABS + no-speculation” “Oracle + no ABS” close to “ABS + naive” Potential for Spec./Sync. A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 20 3% better 2% 1% 0% 09 129 124 126 139 130 132 144 7 10 101 102 103 104 117 120 145 141 145 6 “Oracle + no ABS” over Spec/Sync Speculation/Synchronization On the average within 0.93% (int) and 1.001% (fp) of oracle A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 21 Summary • Improving Processor/Memory Performance: move loads up in time • Addresses Hinder Parallelism • Memory Dependence Speculation/Synchronization Predict dependences and move loads up in time Classify loads in two categories: 1. No Dependence: Execute at will 2. Dependence: Synchronize with specific store A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 22 Related Work • In the context of Dynamically Scheduled Processors 1. J. H. Hesson, J. LeBlanc, S. J. Ciavaglia, IBM Patent,‘97 Store Barrier 2. G. Chrysos, J. Emer (DEC), Superscalar Env., ISCA ‘98 This work: UW Tech. Report, March ‘96, ISCA ‘97 A. Moshovos, S. Breach, T. N. Vijaykumar, G. Sohi • In the context of Statically Scheduled Processors 1. Static Disambiguation 2. Speculation in Software • Dynamic Compilation 3. Software/Hardware Hybrids A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 23 Roadmap • Exploiting Load/Store Parallelism • Speculative Memory Cloaking and Bypassing - Memory as a communication mechanism - Speculative Memory Cloaking - Other uses of Memory - Evaluation • Transient Value Cache • Other Applications • Future Directions: Operation Prediction & Slice-Processors A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 24 Memory as a Communication Mechanism addr calc Memory address store addr calc value load address Memory can be an inter-operation communication mechanism Communication Specification: implicit via Addresses Is this the best way in terms of performance? A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 25 Memory Communication Specification - Limitations Implicit Explicit Delays: 1. Calculate address 2. Establish Dependence A. Moshovos store 2 ? load ad dr es s Memory store addr load Explicit delay value store Timeline store value s es dr ad Program Order Instructions Store - Load: Direct Link No Delays Implicit Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. load addr store2 addr 26 Predicting Memory Dependences 1. Build Dependence history: Dependence Detection Table Record: (store PC, address) Loads: (load PC, address) ➔ (store PC, load PC) 2. Use history to predict forthcoming dependences assign synonyms to detected dependences use synonym to locate value A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 27 Speculative Memory Cloaking Dynamically & Transparently convert implicit into explicit • Dependence prediction ➔ direct store-load or load-load links • Speculative and has to be eventually verified Dependence Prediction load PC synonym store PC synonym Timeline 1 Synonym File 2 store value addr value f/e Traditional Memory Hierarchy address 3 load speculative value 4 addr verify A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. address 28 Speculative Memory Bypassing Observe: Store and Load are used to just pass values DEF R1 DEF R1 store R1 1 R1 TAG1 load R2 store R1 2 load R2 3 TAG1 synonym USE R2 USE R2 4 R2 TAG1 TAG2 • Extents over multiple store-load dependences • DEF and USE must co-exist in the instruction window Takes load-store or loads off the communication path A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 29 Speculative Memory Cloaking/Bypassing Address-based Memory as: Storage vs. Interface Ask: 1. What is memory used for? 2. How addresses impact the action? Inter-operation Communication in g STORE RX LOAD R Y Cloaking yp a Bypassing Memory ss DEF RX Data-Sharing B Cloaking USE R Y LOAD RZ LOAD R Y USE RZ USE R Y register address Dynamically Create Direct Links Between Producers/Consumers A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 30 Cloaking Coverage 100% 80% 60% 40% 20% DATA STACK 10 101 102 103 104 117 120 145 141 145 6 09 129 124 126 139 130 132 144 7 0% HEAP Most Dependences Correctly Predicted A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 31 Cloaking - Mispeculation Rates 6% 5% 4% 3% 2% 1% 0% 1.0% 0.8% 0.6% 0.4% 0.2% DATA STACK 10 101 102 103 104 117 120 145 141 145 6 09 9 12 4 12 6 12 9 13 0 13 2 13 4 14 7 0.0% HEAP • Relatively low mispeculation rates • Causes: Unstable Dependences - Recursive Functions A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 32 Perfor m a n c e - S e l e c t i v e I n v a l i d a t i o n 14% 12% 10% 8% 6% 4% 2% Selective Oracle A. Moshovos 10 101 102 103 104 117 120 145 141 145 6 09 9 12 4 12 6 12 9 13 0 13 2 13 4 14 7 0% Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 33 Summary • Address-based memory ➔ unnecessary delays • Used Memory Dependence Prediction to: 1. Improve Store-Load and Load-Load Communication address-based ➔ dependence-based 2. Take Store-Load off the Communication path Treat Program as a specification of what to do Not as also of how to do it. A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 34 Increasing Memory Bandwidth • Parallelism ➔ more bandwidth ➔ more L1 ports L1 L1 CPU CPU • A very small cache is easier to multi-port L1 L0 CPU • But, it increases the latency for all accesses that miss in it A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 35 Tr a n s i e n t Value Cache • Support for multiple/simultaneous Load/Store Requests A. Often RAW or RAR exists with recent store or load Observe: B. Many recent stores are killed + Small cache can service these - Latency for other loads will increase C. A & B / Dependence Status is predictable TVC CPU dependence? Yes - in series Avoid L1 access L1 No - in parallel Avoid ➔ L1 latency TVC CPU L1 DCache Bandwidth/Port Requirements are Reduced Few Loads Observe a Latency Increase A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 36 TVC vs. L0 - Hit Rates 100% TVC L0 80% 60% 40% 20% 09 9 12 4 12 6 12 9 13 0 13 2 13 4 14 7 10 1 10 2 10 3 10 4 10 7 11 0 12 5 14 1 14 5 14 6 0% A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 37 TVC vs. L0 - “Miss” Rates 70% 60% 50% 40% 30% 20% 10% 0% 8% 7% 6% 5% 4% 3% 2% 1% 0% TVC A. Moshovos 10 1 01 1 02 1 03 1 04 1 17 1 20 1 45 1 41 1 45 6 09 9 12 4 12 6 12 9 13 0 13 2 13 4 14 7 L0 Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 38 Other Applications • Transient Value Cache: Bandwidth amplification • Prefetching of recursive data structures Load-to-Load-EAC dependences • Virtual Function Call Precomputation Surgical extension to previous one A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 39 Current Research • Cloaking for Multiprocessors - Prediction in coherence protocol • Slice Prediction & Slice Processors - Use the Program to Predict its actions Identify Performance critical computation slices Run them speculatively in advance Annotate program with predicability information • Embedded Processor Optimizations • Multi-media application evolution • Self-micro-reconfigurable processors - Run-time extraction of convertable slices - Interfaces with OOO A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 40 Future Directions • Prediction is becoming prevalent - Branch Prediction - Value Prediction - Dependence Prediction - Coherence Prediction Thus Far: Perform Computation as Specified but as fast as possible Moving toward: A predict-and-verify model of computation Q1? How a processor should look like then? Q2? How to predict accurately? A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 41 Current Predictors • Treat Program as a black box • Observe outcomes • Guess: soon predictors of this kind will be very accurate say 90% or more • How about the rest 10%? Amdahl’s law -> may dominate performance • Slice prediction: execute SCOUT threads in parallel 1. Predict performance critical computation slice 2. Execute as far in advance as possible A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 42 Current Predictors • Treat Program as a black box • Build an approx. representation of program’s function “INPUT” “OUTPUT” (PC1, taken) (PC2, not-taken) (PC1, taken) f()? program • Guess: soon predictors of this kind will be very accurate say 90% or more • How about the rest 10%? Amdahl’s law -> may dominate performance A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 43 Slice Processors • But program itself is a representation of f()! • USE PROGRAM TO PREDICT IT’S ACTIONS while (list) if (list->data == KEY) list = list->next foo0(list) list->data == KEY else... then branch foo1(list) list = list->next else branch predict slices Time A. Moshovos Memory Dependence Prediction Permission to use these slides is granted provided that a reference to their origin is included. 44
© Copyright 2026 Paperzz