Milly Watt Project

Outline for Today
• Objective
– Physical Page Placement matters
• Power-aware memory
• Superpages
• Announcements
– Deadline extended (wrong kernel was our
fault)
1
Memory System Power Consumption
Laptop Power Budget
Handheld Power Budget
9 Watt Processor
1 Watt Processor
Memory
Other
Memory
Other
• Laptop: memory is small percentage of total
power budget
• Handheld: low power processor, memory is more
important
Opportunity:
Power Aware DRAM
• Multiple power states
– Fast access, high
power
– Low power, slow
access
• New take on memory +6000 ns
hierarchy
• How to exploit Power Down
3mW
opportunity?
Read/Write
Transaction
Active
300mW
+60 ns
Rambus
RDRAM
Power States
+6 ns
Standby
180mW
Nap
30mW
3
RDRAM as a Memory Hierarchy
• Each chip can be
independently put
into appropriate
power mode
• Number of chips at
each “level” of the
hierarchy can vary
dynamically.
Active
Active
Nap
Policy choices
– initial page placement in an
“appropriate” chip
– dynamic movement of page
from one chip to another
– transitioning of power state
of chip containing page
4
RAMBUS RDRAM Main
Memory Design
Part of Cache Block
Chip
0
Active
CPU/$
Chip
1
Standby
Chip
2
Chip
3
Power Down
• Single RDRAM chip provides high bandwidth per access
– Novel signaling scheme transfers multiple bits on one wire
– Many internal banks: many requests to one chip
• Energy implication: Activate only one chip to perform access at
same high bandwidth as conventional design
5
Conventional Main Memory
Design
Part of Cache Block
CPU/$
Chip
0
Chip
1
Active
Active
Chip
2
Chip
3
Active Active
• Multiple DRAM chips provide high bandwidth per access
– Wide bus to processor
– Few internal banks
• Energy implication: Must activate all those chips to perform access
at high bandwidth
6
Exploiting the Opportunity
Interaction between power state model and
access locality
• How to manage the power state
transitions?
– Memory controller policies
– Quantify benefits of power states
• What role does software have?
– Energy impact of allocation of data/text to
memory.
8
Power-Aware DRAM Main
Memory Design
• Properties of PA-DRAM
allow us to access and
control each chip
individually
• 2 dimensions to affect
energy policy:
HW controller / OS
• Energy strategy:
CPU/$
Software
control
OS
Page Mapping
Allocation
Hardware
control
ctrl
ctrl
ctrl
Chip
0
Chip
1
Chip
n-1
Standby
Power
Down
Active
– Cluster accesses to
already powered up
chips
– Interaction between
power state
transitions and data
locality
9
Power State Transitioning
completion
requests of last request
in run
time
gap
phigh
tl->h
th->l
tbenefit
ph->l
plow
pl->h
phigh
Ideal case:
Assume we want
no added latency
(th->l + tl->h + tbenefit ) * phigh > th->l * ph->l + tl->h * pl->h + tbenefit * plow
constant
10
Power State Transitioning
completion
requests of last request
in run
time
gap
phigh
th->l
ph->l
plow
tl->h
phigh
pl->h
On demand caseadds latency of
transition back up
12
Power State Transitioning
completion
requests of last request
in run
time
gap
threshold
phigh
th->l
ph->l plow
tl->h
phigh
pl->h
Threshold baseddelays transition
down
13
Page Allocation Polices
Virtual to Physical Page Mapping
• Random Allocation – baseline policy
– Pages spread across chips
• Sequential First-Touch Allocation
– Consolidate pages into minimal number of chips
– One shot
• Frequency-based Allocation
– First-touch not always best
– Allow (limited) movement after first-touch
14
Power-Aware Virtual Memory
Based On Context Switches
Huang, Pillai, Shin, “Design and Implementation of PowerAware Virtual Memory”, USENIX 03.
• Power state transitions under SW control (not HW controller)
• Treated explicitly as memory hierarchy: a process’s active set of nodes
is kept in higher power state
• Size of active node set is kept small by grouping process’s pages in
nodes together – “energy footprint”
– Page mapping - viewed as NUMA layer for implementation
– Active set of pages, ai, put on preferred nodes, ri
• At context switch time, hide latency of transitioning
– Transition the union of active sets of the next-to-run and likely next-afterthat processes to standby (pre-charging) from nap
– Overlap transitions with other context switch overhead
15
Power-Aware DRAM Main
Memory Design
• Properties of PA-DRAM
allow us to access and
control each chip
individually
• 2 dimensions to affect
energy policy:
HW controller / OS
• Energy strategy:
CPU/$
Software
control
OS
Page Mapping
Allocation
ctrl
ctrl
ctrl
Chip
0
Chip
1
Chip
n-1
Active
Standby
– Cluster accesses to
preferred memory
nodes per process
– OS triggered power
state transitions on
context switch
Nap
16
Rambus RDRAM
Read/Write
Transaction
+20 ns
Power Down
7mW
Active
313mW
+22510 ns
+20 ns
Nap
11mW
Rambus
RDRAM
Power States
+3 ns
Standby
225mW
+225 ns
17
RDRAM Active Components
Refresh Clock
Row
Col
decoder decoder
Active
X
X
X
Standby
X
X
X
Nap
X
X
Pwrdn
X
X
18
Determining Active Nodes
• A node is active iff at least one page from the node is
mapped into process i’s address space.
• Table maintained whenever page is mapped in or
unmapped in kernel.
• Alternatives
count
n0
n1
… n15
rejected due to
overhead:
108
2
17
p0
– Extra page faults
– Page table scans
…
• Overhead is only
pn
one incr/decr
per mapping/unmapping op
193
240
4322
19
Implementation Details
Problem:
DLLs and files shared by multiple
processes (buffer cache) become
scattered all over memory with a
straightforward assignment of incoming
pages to process’s active nodes – large
energy footprints afterall.
20
Implementation Details
Solutions:
• DLL Aggregation
– Special case DLLs by allocating Sequential firsttouch in low-numbered nodes
• Migration
– Kernal thread – kmigrated – running in background
when system is idle (waking up every 3s)
– Scans pages used by each process, migrating if
conditions met
• Private page not on
• Shared page outside 3
ri
21
22
Evaluation Methodology
• Linux implementation
• Measurements/counts taken of events and
energy results calculated (not measured)
• Metric – energy used by memory (only).
• Workloads – 3 mixes: light (editting, browsing,
MP3), poweruser (light + kernel compile),
multimedia (playing mpeg movie)
• Platform – 16 nodes, 512MB of RDRAM
• Not considered: DMA and kernel maintenance
threads
23
Results
• Base –
standby
when not
accessing
• On/Off –
nap when
system idle
• PAVM
24
Results
• PAVM
• PAVMr1 - DLL
aggregation
• PAVMr2 –
both DLL
aggregation &
migration
25
Results
26
Conclusions
• Multiprogramming environment.
• Basic PAVM: save 34-89% energy of 16
node RDRAM
• With optimizations: additional 20-50%
• Works with other kinds of power-aware
memory devices
27
Discussion: What about page
replacement policies?
Should (or how could) they
be power-aware?
Related Work
• Lebeck et al, ASPLOS 2000 – dynamic
hardware controller policies and page
placement
• Fan et al
– ISPLED 2001
– PACS 2002
• Delaluz et al, DAC 2002
29
Dual-state HW Power State
Policies
access
Active
• All chips in one base state
• Individual chip Active
while pending requests
• Return to base power
state if no pending access
access
No pending
access
Standby/Nap/Powerdown
Active
Access
Base
Time
30
Quad-state HW Policies
access
• Downgrade state if no
access for threshold time
• Independent transitions
based on access pattern to
each chip
• Competitive Analysis
– rent-to-buy
– Active to nap 100’s of ns
– Nap to PDN 10,000 ns
Access
access
Active
no access
for Ta-s
access
access
PDN
no access
for Tn-p
STBY
no access for
Ts-n
Nap
Active
STBY
Nap
PDN
Time
31
Page Allocation Polices
Virtual to Physical Page Mapping
• Random Allocation – baseline policy
– Pages spread across chips
• Sequential First-Touch Allocation
– Consolidate pages into minimal number of chips
– One shot
• Frequency-based Allocation
– First-touch not always best
– Allow (limited) movement after first-touch
33
Summary of Results
(Energy*Delay product, RDRAM, ASPLOS00)
Dual-state
Hardware
Random
Allocation
Sequential
Allocation
Nap is best
dual-state
policy
60%-85%
Additional
10% to 30%
over Nap
Improvement not
Quad-state obvious,
Hardware Could be equal
to dual-state
Best Approach:
6% to 55% over
dual-nap-seq,
80% to 99% over
all active.
2 state
model
4 state
model
48
OS Support for Superpages
Juan Navarro, Sitaram Iyer, Peter Druschel, Alan Cox
OSDI 2002
• Increasing cost in TLB miss overhead
– growing working sets
– TLB size does not grow at same pace
• Processors now provide superpages
– one TLB entry can map a large region
• OSs have been slow to harness them
– no transparent superpage support for apps
• Proposed: a practical and transparent
solution to support superpages
50
Translation look-aside buffer
• TLB caches virtual-to-physical address
translations
• TLB coverage
– amount of memory mapped by TLB
– amount of memory that can be accessed
without TLB misses
51
TLB coverage trend
TLB coverage as percentage of main memory
Factor of 1000
decrease in
15 years
TLB miss
overhead:
5%
10.0%
1.0%
0.1%
0.01%
0.001%
1985
30%
5-10%
1990
1995
2000
52
How to increase TLB coverage
• Typical TLB coverage  1 MB
• Use superpages!
– large and small pages
• Increase TLB coverage
– no increase in TLB size
– no internal fragmentation
If only large pages:
larger working sets, more I/O.
53
What are these superpages anyway?
• Memory pages of larger sizes
– supported by most modern CPUs
• Otherwise, same as normal pages
– power of 2 size
– use only one TLB entry
– contiguous
– aligned (physically and virtually)
– uniform protection attributes
– one reference bit, one dirty bit
54
A superpage TLB
virtual memory
base page entry (size=1)
virtual
address
superpage entry (size=4)
Alpha:
8,64,512KB; 4MB
Itanium:
4,8,16,64,256KB;
1,4,16,64,256MB
physical
address
TLB
physical memory
55
The superpage problem
Issue 1: superpage allocation
A
B
C
virtual memory
D
superpage boundaries
B
D
A
A
B
C
D
C
physical memory
 How / when / what size to allocate?
57
Issue 2: promotion
• Promotion: create a superpage out of a
set of smaller pages
– mark page table entry of each base page
• When to promote?
Wait for app to touch pages? May
Create small superpage?
Forcibly populate pages?
lose opportunity to increase TLB
May waste May
overhead.
cause internal fragmentation.
58
coverage.
Issue 3: demotion
Demotion: convert a superpage into
smaller pages
• when page attributes of base pages of a
superpage become non-uniform
• during partial pageouts
59
Issue 4: fragmentation
• Memory becomes fragmented due to
– use of multiple page sizes
– persistence of file cache pages
– scattered wired (non-pageable) pages
• Contiguity: contended resource
• OS must
– use contiguity restoration techniques
– trade off impact of contiguity restoration against
superpage benefits
60
Design
Key observation
Once an application touches the first page
of a memory object then it is likely that it will
quickly touch every page of that object
• Example: array initialization
• Opportunistic policies
– superpages as large and as soon as possible
– as long as no penalty if wrong decision
62
Superpage allocation
Preemptible reservations
A
B
C
virtual memory
D
superpage boundaries
A
B
How much do we reserve?
Goal: good TLB coverage,
without internal fragmentation.
C
D
physical memory
reserved
frames
63
Allocation: reservation size
Opportunistic policy
• Go for biggest size that is no larger than
the memory object (e.g., file)
• If size not available, try preemption
before resigning to a smaller size
– preempted reservation had its chance
64
Allocation: managing reservations
largest unused (and aligned) chunk
4
2
1
best candidate for preemption at front:
 reservation whose most recently populated
frame was populated the least recently
65
Incremental promotions
Promotion policy: opportunistic
2
4
4+2
8
66
Speculative demotions
• One reference bit per superpage
– How do we detect portions of a superpage
not referenced anymore?
• On memory pressure, demote
superpages when resetting ref bit
• Re-promote (incrementally) as pages
are referenced
67
Demotions: dirty superpages
• One dirty bit per superpage
– what’s dirty and what’s not?
– page out entire superpage
• Demote on first write to clean superpage
write
 Re-promote (incrementally) as other
pages are dirtied
68
Fragmentation control
• Low contiguity: modified page daemon
– restore contiguity
• move clean, inactive pages to the free list
– minimize impact
• prefer pages that contribute the most to
contiguity
• keep contents for as long as possible
(even when part of a reservation:
if reactivated, break reservation)
• Cluster wired pages
69
Experimental
evaluation
Experimental setup
•
•
•
•
•
FreeBSD 4.3
Alpha 21264, 500 MHz, 512 MB RAM
8 KB, 64 KB, 512 KB, 4 MB pages
128-entry DTLB, 128-entry ITLB
Unmodified applications
71
Best-case benefits
• TLB miss reduction usually above 95%
• SPEC CPU2000 integer
– 11.2% improvement (0 to 38%)
• SPEC CPU2000 floating point
– 11.0% improvement (-1.5% to 83%)
• Other benchmarks
– FFT (2003 matrix): 55%
– 1000x1000 matrix transpose: 655%
• 30%+ in 8 out of 35 benchmarks
72
Why multiple superpage sizes
64KB
512KB
4MB
All
FFT
1%
0%
55%
55%
galgel
28%
28%
1%
29%
mcf
24%
31%
22%
68%
Improvements with only one superpage
size vs. all sizes
74
Conclusions
• Superpages: 30%+ improvement
– transparently realized; low overhead
• Contiguity restoration is necessary
– sustains benefits; low impact
• Multiple page sizes are important
– scales to very large superpages
76