Memory Dependence Prediction

Memory Dependence Prediction
Andreas Moshovos
[email protected]
www.ece.nwu.edu/~moshovos
Electrical And Computer Engineering Department
Northwestern University
ECE Dept., Purdue University, Sept. 9, 1999
Exploiting Program Behavior
Building faster/better machines:
1. Faster Circuits -> Faster Execution
2. Be “smarter” about program execution:
Exploit Idiosynchrasies in Program Behavior
Example? Caches
Idiosyncrasies in access pattern
Idea: Use small/fast storage for recently accessed data.
What to do next? One Possibility is:
Identify/Exploit Other Idiosyncrasies in Typical Program
Behavior
This might be the right time! Technology and Existing Knowledge.
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
2
Memory Dependences are Quite Regular
• Identified a new form of regularity:
for i
a[i] = a[i - 1] + 1
Memory Dependence Locality
store
load
1. Load/Store has a Dependence?
store
2. Which Dependence a Load/Store has?
load
store
Time
(loadPC, storePC)
load
Opportunity to Exploit this Regularity
How to exploit? Need Techniques
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
3
Memory Dependence Prediction
1. Load/Store has a Dependence?
Guess:
How?
2. Which Dependence a Load/Store has?
store
Past Behavior ->
Good Indicator of Future Behavior
1. Observe
load
GOAL
Basis for Three Micro-Architectural Techniques:
1. Exploit Load/Store Parallelism
2. Reduce Memory Latency
store
2. Predict
load
3. Memory Bandwidth Amplification
Memory is a major and ever increasing bottleneck
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
4
Roadmap
• Exploiting Load/Store Parallelism
- Two directions for improving memory performance
- Load/Store Parallelism
- Speculation/Synchronization
- Performance
• Speculative Memory Cloaking and Bypassing
• Transient Value Cache
• Other Applications
• Future Directions: Slice Prediction & Slice-Processors
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
5
I n S e a r c h o f H i g h e r Performance #1
Timeline
Instruction
Sequence
address
Memory
load
Minimize
send load
value
computation
starts
1. Higher Performance: Memory Responds Faster
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
6
# 1 . M a k i n g M e m o r y R e s p o n d Faster
Ideally: Memory is Large and Fast
Can have it! Technology - Cost trade-off
Solution: Memory Hierarchy
Main Memory
L2
slower
larger
cheaper
L1
OK, we did the best we could, but ...
...memory is still not that fast
...and it is getting slower
Is this the end?
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
7
I n S e a r c h o f H i g h e r Performance #2
Timeline
send load
old position
Modified
Sequence
load
load
Maximize
Original
Sequence
computation
starts
old position
2. Higher Performance:
Send Load Request As Far In Advance As Possible
But, will the program still run correctly?
A. Can we ever move loads up?
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
B. How do we do it?
8
A. Can We Ever Move Loads Up?
Instruction Sequence
A
load
r1 = r2 + 10
load ..., [r1 + 5]
r1 = r2 + 10
load ..., [r3 + 5]
Valid Execution Order
A
load
dependence
A
load
load
load
A
parallelism
A
A and load can execute in any order
if load does not use a value produced by A
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
9
B. How To M o v e L o a d s U p ?
Instruction Level Parallel Processors:
Step #1
Step #3
Step #2
load
ce
s
load
de
pe
nd
en
load
memory
latency
grab a chunk
of instructions
Instruction
Window
find the
dependences
Inspect Names
execute
Use Parallelism to Tolerate Memory Latency
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
10
M o v i n g L o a d s Past Stores
Instructions
Timeline
dependence?
load addr
store addr. calc.
store
load addr. calc.
load
store
???
load
100
store addr
load executes
Use Of Addresses Hinders Parallelism
➔ Limits Us in Our Effort to Tolerate Memory Latency
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
11
Naive Memory Dependence Speculation
• Don’t give up, be optimistic, guess no dependences exist
• State-of-the-art in modern processors
Timeline
store
load
dependence?
dependence?
Instructions
load addr
Guess no: load executes
store addr
load re-executes if yes
penalty
Need to Balance: Gain vs. Penalty
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
12
D e p e n d e n c e S p e c u l a t i o n a n d Performance
Program
Order
No Speculation
No Dependence
store
A
load
B
C
D
Speculation
free
A
C
B
D store
free load
A
C
D
load
B
store
Dependence
A
C
D
B
C
load
B
store
load
D
Speculation may affect performance either way
Balance: Gain vs. Penalty
Penalty:
A. Moshovos
(a) work thrown away
(b) opportunity cost
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
13
How About Future Systems?
Common Case:
Today
Soon
• Memory will be slower
➔ Need to move loads further up
Guessing Naively:
Penalty becomes significant
Loads — Ideal Behavior:
No Dependence: execute at will
Dependence: Synchronize w/ store
Future Systems: Wish Dependences Were Known
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
14
Speculation/Synchronization
Learn from mistakes:
1. Start with Naive Speculation
2. Remember mispeculations and predict next time the same will
happen
- Q1? Which loads should wait
- Q2? How long
Best performing scheme
- A1: Loads w/ dependences
- A2: Synchronize w/ appropriate store
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
15
How it works
ST
LD
LD
3
➔
ST
2
➔
LD
1
➔
LD
Mem. Dependence Prediction Table
Predict Loads w/ Dependences
Mem. Dependence Synchronization Table
Enforce Synchronization
MDST
LDPC
F/EV
STPC 0 1
LDPC
F/EV
STPC 0 1
MDPT
LDPC STPC
LDPC STPC
PRED
② Allocate entry
LDPC STPC
PRED
② No! Wait
LDPC
LDPC STPC
② Resume
① Synchronize?
① Execute?
① Mispeculation
1
ST
A. Moshovos
LD
2
LD
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
PRED
3
STPC
LDPC
ST
LD
16
Memory Dependence Speculation/Synchronization
Timeline
store
A. Predict Dependence
1. Predict load
Allocate Sync. bit
1
load
2. Predict store
load addr
2
store addr
3
Wait on Sync. bit
3. Store Signals
Load executes
B. Predict No Dependence
2. Load executes
3. Store verifies
Correct Prediction: Loads wait only as long as it is necessary
Incorrect: Same as Naive or Delay
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
17
Speedup over Naive Speculation
80%
60%
better
40%
20%
ORACLE
10
1
10
2
10
3
10
4
10
7
11
0
12
5
14
1
14
5
14
6
09
9
12
4
12
6
12
9
13
0
13
2
13
4
14
7
0%
CENTRALIZED
2 STORES/LOAD
256-Entries
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
SYNONYMS
4K MDPT
64 MDST
18
Evaluation - Superscalar
1. Can loads inspect Store addresses before obtaining a value?
Requires an address-based load/store scheduler (ABS)
2. Is speculation used?
• Speculation a win (naive over no-speculation):
No ABS:
~29% int, ~113% fp
mispeculations are an issue
ABS:
4.6% int, 5.3% fp (0-cycle scheduling latency)
virtually no mispeculations
• Next:
1. Compare “Oracle + no ABS” with “naive + ABS”
2. Speculation/Synchronization on “no Abs” close to Oracle
Speculation/Synchronization as an Alternative to ABS
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
19
“Oracle + no ABS” vs. “Naive + ABS”
40%
ORACLE
30%
NAIVE
1-CYCLE
NAIVE
0-CYCLES
NAIVE
2-CYCLES
better
20%
10%
0%
-20%
09
9
12
4
12
6
12
9
13
0
13
2
13
4
14
7
10
1
10
2
10
3
10
4
10
7
11
0
12
5
14
1
14
5
14
6
-10%
Base case: “ABS + no-speculation”
“Oracle + no ABS” close to “ABS + naive”
Potential for Spec./Sync.
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
20
3%
better
2%
1%
0%
09
129
124
126
139
130
132
144
7
10
101
102
103
104
117
120
145
141
145
6
“Oracle + no ABS” over Spec/Sync
Speculation/Synchronization
On the average within 0.93% (int) and 1.001% (fp) of oracle
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
21
Summary
• Improving Processor/Memory Performance:
move loads up in time
• Addresses Hinder Parallelism
• Memory Dependence Speculation/Synchronization
Predict dependences and move loads up in time
Classify loads in two categories:
1. No Dependence: Execute at will
2. Dependence: Synchronize with specific store
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
22
Related Work
• In the context of Dynamically Scheduled Processors
1. J. H. Hesson, J. LeBlanc, S. J. Ciavaglia, IBM Patent,‘97
Store Barrier
2. G. Chrysos, J. Emer (DEC), Superscalar Env., ISCA ‘98
This work: UW Tech. Report, March ‘96, ISCA ‘97
A. Moshovos, S. Breach, T. N. Vijaykumar, G. Sohi
• In the context of Statically Scheduled Processors
1. Static Disambiguation
2. Speculation in Software
• Dynamic Compilation
3. Software/Hardware Hybrids
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
23
Roadmap
• Exploiting Load/Store Parallelism
• Speculative Memory Cloaking and Bypassing
- Memory as a communication mechanism
- Speculative Memory Cloaking
- Other uses of Memory
- Evaluation
• Transient Value Cache
• Other Applications
• Future Directions: Operation Prediction & Slice-Processors
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
24
Memory as a Communication Mechanism
addr
calc
Memory
address
store
addr
calc
value
load
address
Memory can be an inter-operation communication mechanism
Communication Specification: implicit via Addresses
Is this the best way in terms of performance?
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
25
Memory Communication Specification - Limitations
Implicit
Explicit
Delays: 1. Calculate address
2. Establish Dependence
A. Moshovos
store 2
?
load
ad
dr
es
s
Memory
store addr
load
Explicit
delay
value
store
Timeline
store
value
s
es
dr
ad
Program Order
Instructions
Store - Load: Direct Link
No Delays
Implicit
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
load addr
store2 addr
26
Predicting Memory Dependences
1. Build Dependence history: Dependence Detection Table
Record: (store PC, address)
Loads: (load PC, address)
➔ (store PC, load PC)
2. Use history to predict forthcoming dependences
assign synonyms to detected dependences
use synonym to locate value
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
27
Speculative Memory Cloaking
Dynamically & Transparently convert implicit into explicit
• Dependence prediction ➔ direct store-load or load-load links
• Speculative and has to be eventually verified
Dependence Prediction
load PC synonym
store PC synonym
Timeline
1
Synonym File
2
store
value
addr
value f/e
Traditional
Memory
Hierarchy
address
3
load
speculative
value
4
addr
verify
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
address
28
Speculative Memory Bypassing
Observe: Store and Load are used to just pass values
DEF R1
DEF R1
store R1
1
R1 TAG1
load R2
store R1
2
load R2
3
TAG1 synonym
USE R2
USE R2
4
R2 TAG1 TAG2
• Extents over multiple store-load dependences
• DEF and USE must co-exist in the instruction window
Takes load-store or loads off the communication path
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
29
Speculative Memory Cloaking/Bypassing
Address-based Memory as: Storage vs. Interface
Ask:
1. What is memory used for?
2. How addresses impact the action?
Inter-operation Communication
in
g
STORE RX
LOAD R Y
Cloaking
yp
a
Bypassing
Memory
ss
DEF RX
Data-Sharing
B
Cloaking
USE R Y
LOAD RZ
LOAD R Y
USE RZ
USE R Y
register
address
Dynamically Create Direct Links Between Producers/Consumers
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
30
Cloaking Coverage
100%
80%
60%
40%
20%
DATA
STACK
10
101
102
103
104
117
120
145
141
145
6
09
129
124
126
139
130
132
144
7
0%
HEAP
Most Dependences Correctly Predicted
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
31
Cloaking - Mispeculation Rates
6%
5%
4%
3%
2%
1%
0%
1.0%
0.8%
0.6%
0.4%
0.2%
DATA
STACK
10
101
102
103
104
117
120
145
141
145
6
09
9
12
4
12
6
12
9
13
0
13
2
13
4
14
7
0.0%
HEAP
• Relatively low mispeculation rates
• Causes:
Unstable Dependences - Recursive Functions
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
32
Perfor m a n c e - S e l e c t i v e I n v a l i d a t i o n
14%
12%
10%
8%
6%
4%
2%
Selective
Oracle
A. Moshovos
10
101
102
103
104
117
120
145
141
145
6
09
9
12
4
12
6
12
9
13
0
13
2
13
4
14
7
0%
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
33
Summary
• Address-based memory ➔ unnecessary delays
• Used Memory Dependence Prediction to:
1. Improve Store-Load and Load-Load Communication
address-based ➔ dependence-based
2. Take Store-Load off the Communication path
Treat Program as a specification of what to do
Not as also of how to do it.
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
34
Increasing Memory Bandwidth
• Parallelism ➔ more bandwidth ➔ more L1 ports
L1
L1
CPU
CPU
• A very small cache is easier to multi-port
L1
L0
CPU
• But, it increases the latency for all accesses that miss in it
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
35
Tr a n s i e n t Value Cache
• Support for multiple/simultaneous Load/Store Requests
A. Often RAW or RAR exists with recent store or load
Observe:
B. Many recent stores are killed
+ Small cache can service these
- Latency for other loads will increase
C. A & B / Dependence Status is predictable
TVC
CPU
dependence?
Yes - in series
Avoid L1 access
L1
No - in parallel
Avoid
➔
L1
latency
TVC
CPU
L1 DCache Bandwidth/Port Requirements are Reduced
Few Loads Observe a Latency Increase
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
36
TVC vs. L0 - Hit Rates
100%
TVC
L0
80%
60%
40%
20%
09
9
12
4
12
6
12
9
13
0
13
2
13
4
14
7
10
1
10
2
10
3
10
4
10
7
11
0
12
5
14
1
14
5
14
6
0%
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
37
TVC vs. L0 - “Miss” Rates
70%
60%
50%
40%
30%
20%
10%
0%
8%
7%
6%
5%
4%
3%
2%
1%
0%
TVC
A. Moshovos
10
1 01
1 02
1 03
1 04
1 17
1 20
1 45
1 41
1 45
6
09
9
12
4
12
6
12
9
13
0
13
2
13
4
14
7
L0
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
38
Other Applications
• Transient Value Cache:
Bandwidth amplification
• Prefetching of recursive data structures
Load-to-Load-EAC dependences
• Virtual Function Call Precomputation
Surgical extension to previous one
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
39
Current Research
• Cloaking for Multiprocessors
- Prediction in coherence protocol
• Slice Prediction & Slice Processors
- Use the Program to Predict its actions
Identify Performance critical computation slices
Run them speculatively in advance
Annotate program with predicability information
• Embedded Processor Optimizations
• Multi-media application evolution
• Self-micro-reconfigurable processors
- Run-time extraction of convertable slices
- Interfaces with OOO
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
40
Future Directions
• Prediction is becoming prevalent
- Branch Prediction
- Value Prediction
- Dependence Prediction
- Coherence Prediction
Thus Far:
Perform Computation as Specified but as fast as possible
Moving toward:
A predict-and-verify model of computation
Q1? How a processor should look like then?
Q2? How to predict accurately?
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
41
Current Predictors
• Treat Program as a black box
• Observe outcomes
• Guess: soon predictors of this kind will be very accurate
say 90% or more
• How about the rest 10%?
Amdahl’s law -> may dominate performance
• Slice prediction: execute SCOUT threads in parallel
1. Predict performance critical computation slice
2. Execute as far in advance as possible
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
42
Current Predictors
• Treat Program as a black box
• Build an approx. representation of program’s function
“INPUT”
“OUTPUT”
(PC1, taken)
(PC2, not-taken)
(PC1, taken)
f()?
program
• Guess: soon predictors of this kind will be very accurate
say 90% or more
• How about the rest 10%?
Amdahl’s law -> may dominate performance
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
43
Slice Processors
• But program itself is a representation of f()!
• USE PROGRAM TO PREDICT IT’S ACTIONS
while (list)
if (list->data == KEY)
list = list->next
foo0(list)
list->data == KEY
else...
then branch
foo1(list)
list = list->next
else branch
predict
slices
Time
A. Moshovos
Memory Dependence Prediction
Permission to use these slides is granted provided that a reference to their origin is included.
44