Outline for Today • Objective – Physical Page Placement matters • Power-aware memory • Superpages • Announcements – Deadline extended (wrong kernel was our fault) 1 Memory System Power Consumption Laptop Power Budget Handheld Power Budget 9 Watt Processor 1 Watt Processor Memory Other Memory Other • Laptop: memory is small percentage of total power budget • Handheld: low power processor, memory is more important Opportunity: Power Aware DRAM • Multiple power states – Fast access, high power – Low power, slow access • New take on memory +6000 ns hierarchy • How to exploit Power Down 3mW opportunity? Read/Write Transaction Active 300mW +60 ns Rambus RDRAM Power States +6 ns Standby 180mW Nap 30mW 3 RDRAM as a Memory Hierarchy • Each chip can be independently put into appropriate power mode • Number of chips at each “level” of the hierarchy can vary dynamically. Active Active Nap Policy choices – initial page placement in an “appropriate” chip – dynamic movement of page from one chip to another – transitioning of power state of chip containing page 4 RAMBUS RDRAM Main Memory Design Part of Cache Block Chip 0 Active CPU/$ Chip 1 Standby Chip 2 Chip 3 Power Down • Single RDRAM chip provides high bandwidth per access – Novel signaling scheme transfers multiple bits on one wire – Many internal banks: many requests to one chip • Energy implication: Activate only one chip to perform access at same high bandwidth as conventional design 5 Conventional Main Memory Design Part of Cache Block CPU/$ Chip 0 Chip 1 Active Active Chip 2 Chip 3 Active Active • Multiple DRAM chips provide high bandwidth per access – Wide bus to processor – Few internal banks • Energy implication: Must activate all those chips to perform access at high bandwidth 6 Exploiting the Opportunity Interaction between power state model and access locality • How to manage the power state transitions? – Memory controller policies – Quantify benefits of power states • What role does software have? – Energy impact of allocation of data/text to memory. 8 Power-Aware DRAM Main Memory Design • Properties of PA-DRAM allow us to access and control each chip individually • 2 dimensions to affect energy policy: HW controller / OS • Energy strategy: CPU/$ Software control OS Page Mapping Allocation Hardware control ctrl ctrl ctrl Chip 0 Chip 1 Chip n-1 Standby Power Down Active – Cluster accesses to already powered up chips – Interaction between power state transitions and data locality 9 Power State Transitioning completion requests of last request in run time gap phigh tl->h th->l tbenefit ph->l plow pl->h phigh Ideal case: Assume we want no added latency (th->l + tl->h + tbenefit ) * phigh > th->l * ph->l + tl->h * pl->h + tbenefit * plow constant 10 Power State Transitioning completion requests of last request in run time gap phigh th->l ph->l plow tl->h phigh pl->h On demand caseadds latency of transition back up 12 Power State Transitioning completion requests of last request in run time gap threshold phigh th->l ph->l plow tl->h phigh pl->h Threshold baseddelays transition down 13 Page Allocation Polices Virtual to Physical Page Mapping • Random Allocation – baseline policy – Pages spread across chips • Sequential First-Touch Allocation – Consolidate pages into minimal number of chips – One shot • Frequency-based Allocation – First-touch not always best – Allow (limited) movement after first-touch 14 Power-Aware Virtual Memory Based On Context Switches Huang, Pillai, Shin, “Design and Implementation of PowerAware Virtual Memory”, USENIX 03. • Power state transitions under SW control (not HW controller) • Treated explicitly as memory hierarchy: a process’s active set of nodes is kept in higher power state • Size of active node set is kept small by grouping process’s pages in nodes together – “energy footprint” – Page mapping - viewed as NUMA layer for implementation – Active set of pages, ai, put on preferred nodes, ri • At context switch time, hide latency of transitioning – Transition the union of active sets of the next-to-run and likely next-afterthat processes to standby (pre-charging) from nap – Overlap transitions with other context switch overhead 15 Power-Aware DRAM Main Memory Design • Properties of PA-DRAM allow us to access and control each chip individually • 2 dimensions to affect energy policy: HW controller / OS • Energy strategy: CPU/$ Software control OS Page Mapping Allocation ctrl ctrl ctrl Chip 0 Chip 1 Chip n-1 Active Standby – Cluster accesses to preferred memory nodes per process – OS triggered power state transitions on context switch Nap 16 Rambus RDRAM Read/Write Transaction +20 ns Power Down 7mW Active 313mW +22510 ns +20 ns Nap 11mW Rambus RDRAM Power States +3 ns Standby 225mW +225 ns 17 RDRAM Active Components Refresh Clock Row Col decoder decoder Active X X X Standby X X X Nap X X Pwrdn X X 18 Determining Active Nodes • A node is active iff at least one page from the node is mapped into process i’s address space. • Table maintained whenever page is mapped in or unmapped in kernel. • Alternatives count n0 n1 … n15 rejected due to overhead: 108 2 17 p0 – Extra page faults – Page table scans … • Overhead is only pn one incr/decr per mapping/unmapping op 193 240 4322 19 Implementation Details Problem: DLLs and files shared by multiple processes (buffer cache) become scattered all over memory with a straightforward assignment of incoming pages to process’s active nodes – large energy footprints afterall. 20 Implementation Details Solutions: • DLL Aggregation – Special case DLLs by allocating Sequential firsttouch in low-numbered nodes • Migration – Kernal thread – kmigrated – running in background when system is idle (waking up every 3s) – Scans pages used by each process, migrating if conditions met • Private page not on • Shared page outside 3 ri 21 22 Evaluation Methodology • Linux implementation • Measurements/counts taken of events and energy results calculated (not measured) • Metric – energy used by memory (only). • Workloads – 3 mixes: light (editting, browsing, MP3), poweruser (light + kernel compile), multimedia (playing mpeg movie) • Platform – 16 nodes, 512MB of RDRAM • Not considered: DMA and kernel maintenance threads 23 Results • Base – standby when not accessing • On/Off – nap when system idle • PAVM 24 Results • PAVM • PAVMr1 - DLL aggregation • PAVMr2 – both DLL aggregation & migration 25 Results 26 Conclusions • Multiprogramming environment. • Basic PAVM: save 34-89% energy of 16 node RDRAM • With optimizations: additional 20-50% • Works with other kinds of power-aware memory devices 27 Discussion: What about page replacement policies? Should (or how could) they be power-aware? Related Work • Lebeck et al, ASPLOS 2000 – dynamic hardware controller policies and page placement • Fan et al – ISPLED 2001 – PACS 2002 • Delaluz et al, DAC 2002 29 Dual-state HW Power State Policies access Active • All chips in one base state • Individual chip Active while pending requests • Return to base power state if no pending access access No pending access Standby/Nap/Powerdown Active Access Base Time 30 Quad-state HW Policies access • Downgrade state if no access for threshold time • Independent transitions based on access pattern to each chip • Competitive Analysis – rent-to-buy – Active to nap 100’s of ns – Nap to PDN 10,000 ns Access access Active no access for Ta-s access access PDN no access for Tn-p STBY no access for Ts-n Nap Active STBY Nap PDN Time 31 Page Allocation Polices Virtual to Physical Page Mapping • Random Allocation – baseline policy – Pages spread across chips • Sequential First-Touch Allocation – Consolidate pages into minimal number of chips – One shot • Frequency-based Allocation – First-touch not always best – Allow (limited) movement after first-touch 33 Summary of Results (Energy*Delay product, RDRAM, ASPLOS00) Dual-state Hardware Random Allocation Sequential Allocation Nap is best dual-state policy 60%-85% Additional 10% to 30% over Nap Improvement not Quad-state obvious, Hardware Could be equal to dual-state Best Approach: 6% to 55% over dual-nap-seq, 80% to 99% over all active. 2 state model 4 state model 48 OS Support for Superpages Juan Navarro, Sitaram Iyer, Peter Druschel, Alan Cox OSDI 2002 • Increasing cost in TLB miss overhead – growing working sets – TLB size does not grow at same pace • Processors now provide superpages – one TLB entry can map a large region • OSs have been slow to harness them – no transparent superpage support for apps • Proposed: a practical and transparent solution to support superpages 50 Translation look-aside buffer • TLB caches virtual-to-physical address translations • TLB coverage – amount of memory mapped by TLB – amount of memory that can be accessed without TLB misses 51 TLB coverage trend TLB coverage as percentage of main memory Factor of 1000 decrease in 15 years TLB miss overhead: 5% 10.0% 1.0% 0.1% 0.01% 0.001% 1985 30% 5-10% 1990 1995 2000 52 How to increase TLB coverage • Typical TLB coverage 1 MB • Use superpages! – large and small pages • Increase TLB coverage – no increase in TLB size – no internal fragmentation If only large pages: larger working sets, more I/O. 53 What are these superpages anyway? • Memory pages of larger sizes – supported by most modern CPUs • Otherwise, same as normal pages – power of 2 size – use only one TLB entry – contiguous – aligned (physically and virtually) – uniform protection attributes – one reference bit, one dirty bit 54 A superpage TLB virtual memory base page entry (size=1) virtual address superpage entry (size=4) Alpha: 8,64,512KB; 4MB Itanium: 4,8,16,64,256KB; 1,4,16,64,256MB physical address TLB physical memory 55 The superpage problem Issue 1: superpage allocation A B C virtual memory D superpage boundaries B D A A B C D C physical memory How / when / what size to allocate? 57 Issue 2: promotion • Promotion: create a superpage out of a set of smaller pages – mark page table entry of each base page • When to promote? Wait for app to touch pages? May Create small superpage? Forcibly populate pages? lose opportunity to increase TLB May waste May overhead. cause internal fragmentation. 58 coverage. Issue 3: demotion Demotion: convert a superpage into smaller pages • when page attributes of base pages of a superpage become non-uniform • during partial pageouts 59 Issue 4: fragmentation • Memory becomes fragmented due to – use of multiple page sizes – persistence of file cache pages – scattered wired (non-pageable) pages • Contiguity: contended resource • OS must – use contiguity restoration techniques – trade off impact of contiguity restoration against superpage benefits 60 Design Key observation Once an application touches the first page of a memory object then it is likely that it will quickly touch every page of that object • Example: array initialization • Opportunistic policies – superpages as large and as soon as possible – as long as no penalty if wrong decision 62 Superpage allocation Preemptible reservations A B C virtual memory D superpage boundaries A B How much do we reserve? Goal: good TLB coverage, without internal fragmentation. C D physical memory reserved frames 63 Allocation: reservation size Opportunistic policy • Go for biggest size that is no larger than the memory object (e.g., file) • If size not available, try preemption before resigning to a smaller size – preempted reservation had its chance 64 Allocation: managing reservations largest unused (and aligned) chunk 4 2 1 best candidate for preemption at front: reservation whose most recently populated frame was populated the least recently 65 Incremental promotions Promotion policy: opportunistic 2 4 4+2 8 66 Speculative demotions • One reference bit per superpage – How do we detect portions of a superpage not referenced anymore? • On memory pressure, demote superpages when resetting ref bit • Re-promote (incrementally) as pages are referenced 67 Demotions: dirty superpages • One dirty bit per superpage – what’s dirty and what’s not? – page out entire superpage • Demote on first write to clean superpage write Re-promote (incrementally) as other pages are dirtied 68 Fragmentation control • Low contiguity: modified page daemon – restore contiguity • move clean, inactive pages to the free list – minimize impact • prefer pages that contribute the most to contiguity • keep contents for as long as possible (even when part of a reservation: if reactivated, break reservation) • Cluster wired pages 69 Experimental evaluation Experimental setup • • • • • FreeBSD 4.3 Alpha 21264, 500 MHz, 512 MB RAM 8 KB, 64 KB, 512 KB, 4 MB pages 128-entry DTLB, 128-entry ITLB Unmodified applications 71 Best-case benefits • TLB miss reduction usually above 95% • SPEC CPU2000 integer – 11.2% improvement (0 to 38%) • SPEC CPU2000 floating point – 11.0% improvement (-1.5% to 83%) • Other benchmarks – FFT (2003 matrix): 55% – 1000x1000 matrix transpose: 655% • 30%+ in 8 out of 35 benchmarks 72 Why multiple superpage sizes 64KB 512KB 4MB All FFT 1% 0% 55% 55% galgel 28% 28% 1% 29% mcf 24% 31% 22% 68% Improvements with only one superpage size vs. all sizes 74 Conclusions • Superpages: 30%+ improvement – transparently realized; low overhead • Contiguity restoration is necessary – sustains benefits; low impact • Multiple page sizes are important – scales to very large superpages 76
© Copyright 2026 Paperzz