Dynamic Compiler-Driven Microprocessor Energy and Performance Optimization Vijay Janapa Reddi The University of Texas at Austin Introduction Power management is important -2- Trends Over the Years -3- Power Background Where does power go? » Dynamic power » Static power » Short-circuit power – “switching” power – “leakage” power Static power: transistor not perfect Dynamic power: charging/discharging capacitance from 01 or 10 consumes power Short-circuit power: happens briefly during transition Dynamic power dominates. But static power is increase more as technology scales down. -4- Dynamic Power Dynamic power iVdd Dynamic Power = aCV 2 f Vout CL α – Switching factors » e.g. clock toggles every cycle - switching factor is 1 C – Capacitance » Function of wire length, transistor size V – Supply voltage f – Clock frequency -5- Short-circuit Power Short-circuit power is still “dynamic” power Short-circuit current is caused by finite-slope input signal transitioning During the transitioning, NMOS and PMOS are both conducting » Short-circuit formed between VDD and GND -6- Leakage Power Leakage Power • 3 main sources of leakage, each will be discussed in the next slides Three sources of leakage » Subthreshold, gate and junction Leakage currents grow exponentially with increases in temperature, decreases in threshold voltage EE382M-8 Class Notes 1/17/2011 -7- 32 Introduction (2) Adaptive power management is important Dynamic voltage and frequency scaling (DVFS) is one effective technique to reduce power -8- Introduction (3) Adaptive power management is important Dynamic voltage and frequency scaling (DVFS) is one effective technique to reduce power Previous DVFS approaches » Hardware or OS-interrupt based [Semeraro et al,Micro’02] » Static Compiler based [Hsu et al, PLDI’03] No prior work on dynamic compiler DVFS -9- Goal of This Work Explore power control opportunities in a dynamic compiler An effective complement to existing techniques Ultimate goal: a multi-layer (SW & HW) collaborative control - 10 - What is Dynamic Compiler Driven DVFS? What is dynamic compiler? » Software that compiles/optimizes binary code at runtime Application binary … DVFS Optimization Dynamic compilation system Performance optimization OS and Hardware - 11 - Why Dynamic Compiler Driven DVFS? Advantages over existing approaches High-level Program structure Run-time Processor Information Hardware-based Static compiler based Dynamic compiler based More code aware, more adaptive, … Disadvantages and Challenges » Run-time operation & optimization overhead - 12 - Why Dynamic Compiler Driven DVFS? Advantages over existing approaches High-level Program structure x Hardware-based Run-time Processor Information Static compiler based Dynamic compiler based More code aware, more adaptive, … Disadvantages and Challenges » Run-time operation & optimization overhead - 13 - Why Dynamic Compiler Driven DVFS? Advantages over existing approaches High-level Program structure Hardware-based x Static compiler based √ Run-time Processor Information Dynamic compiler based More code aware, more adaptive, … Disadvantages and Challenges » Run-time operation & optimization overhead - 14 - Why Dynamic Compiler Driven DVFS? Advantages over existing approaches High-level Program structure Hardware-based x Static compiler based √ Dynamic compiler based √ Run-time Processor Information More code aware, more adaptive, … Disadvantages and Challenges » Run-time operation & optimization overhead - 15 - Why Dynamic Compiler Driven DVFS? Advantages over existing approaches High-level Program structure Hardware-based x Static compiler based √ Dynamic compiler based √ Run-time Processor Information More code aware, more adaptive, … Disadvantages and Challenges » Run-time operation & optimization overhead - 16 - √ Why Dynamic Compiler Driven DVFS? Advantages over existing approaches High-level Program structure Hardware-based x Static compiler based √ Dynamic compiler based √ Run-time Processor Information More code aware, more adaptive, … Disadvantages and Challenges » Run-time operation & optimization overhead - 17 - √ x Why Dynamic Compiler Driven DVFS? Advantages over existing approaches High-level Program structure Run-time Processor Information Hardware-based x Static compiler based √ x Dynamic compiler based √ √ More code aware, more adaptive, … Disadvantages and Challenges » Run-time operation & optimization overhead - 18 - √ Motivational Data (based on Qsort) Cache miss behavior changes over time - 19 - Motivational Data (based on Qsort) Cache miss behavior changes over time Input specific behavior at runtime - 20 - Contributions of This Work A new concept » Dynamic compiler driven DVFS A design framework » Implementation of prototype in a real system - 21 - Outline of This Talk … Introduction and motivation Design framework and DVFS decision algorithms Implementation and Deployment in a real system Experimental Results Conclusions and future work - 22 - A Design Framework Overall operation block diagram Start Monitor Dynamic Optimizer Dispatcher Run-time DVFS Optimizer (RDO) Cold code execution Hot code execution (w/ optimization & DVFS ) OS and hardware - 23 - KeyDVFS design issues: Candidate region selection? DVFS decision algorithm? insertion and code transformation? A Design Framework (2) Key design issues » Candidate code region selection » DVFS decision algorithm for code regions » DVFS Code insertion and transformation - 24 - A Design Framework (3) Key design issues » Candidate code region selection » DVFS decision algorithm for code regions » DVFS Code insertion and transformation - 25 - A DVFS Decision Algorithm RDO inserts test module at entry and exit points of a code region » How and Why? DVFS decision based on test information » Is it long-running? » Is it beneficial to apply DVFS? » What is the optimal DVFS setting? Key observation: » Ok to slow down CPU if waiting for memory How to decide DVFS setting quantitatively? - 26 - An analytical decision model Memory operation CPU operation Nconcurrent tasyn_mem f Ndependent f - 27 - execution time DVFS setting computation frequency Scaling factor 0 < ≤ 1.0 relative memory intensity f = fmax - 28 - relative CPU intensity Implementation In a Real System Platform: Intel PIN system [Luk et al PLDI’05] » O-PIN: optimization version of PIN Implementation highlights: » Candidate regions: functions and loops » Adjustable profiling/optimizing granularity » JIT (generate) code selectively » Possibly multiple DVFS code regions » Fast DVFS decision - 29 - Deployment in Hardware Hardware platform with power measurements » Intel development board (Pentium-M, 600 ~ 1600MHz) » OS: Linux 2.4.18 Voltage/current measurement Noise reduction - 30 - Data acquisition (DAQ) Data logging Outline of This Talk … Introduction and motivation Design framework and DVFS decision algorithms Implementation and Deployment in a real system Experimental Results Conclusions and future work - 31 - Experimental Setup and Benchmarks Over 40 Benchmarks » SPEC 2K INT/FP » SPEC95 FP » Olden Experimental Setup » 5% performance loss constraint » Run to completion with largest input set » Report average results from three separate runs - 32 - Different Power Metrics Performance metrics » Delay (execution time) per instruction, MIPS » IPC, CPI – abstracts out frequency (MHz) Energy and power metrics » Joules (J) and Watts (W) Combining both performance and power » MIPS/W ~ energy per instruction » MIPS2/W ~ energy * delay (EDP) » MIPS3/W ~ energy * delay2 (ED2P) - 33 - Power/Energy Metrics Energy metrics » Compare battery durations with given workload » Compare different processors’ energy efficiency Power metrics » Maximum power (TDP) – cost, thermal dissipation, reliability » Averaged power - battery life, electric bill EDP/ED2P » Compare power – performance efficiency » E ~ CV2 – reducing voltage always reduces energy - 34 - An Illustrative Example SPEC2K benchmark: 173.applu » 72 hot regions, 5 DVFS regions CPU Voltage/Power Voltage (V) 1.6 1.6GHZ 1.4 1.4GHZ 1.2GHZ 1.2 1 800MHZ 0.8 Power (W) 0 2 4 6 10 5 0 0 2 4 Time (seconds) 6 Region name total ops Avg. mem. trans Avg. inst. retired DVFS setting (Hz) jacld() 208M 24.8K 0.99M 0.8G blts() 286M 11.5K 0.99M 1.2G jacu() 156M 25.6K 0.99M 0.8G buts() 254M 12.9K 0.99M 1.2G rhs() 188M 8.2K 1.0M 1.4G - 35 - - 36 2KFP_Avg 301.apsi 200.sixtrack 191.fma3d 189.lucas 188.ammp 187.facerec 183.equake 179.art 178.galgel 177.mesa 173.applu EDP improvement 172.mgrid 171.swim -1% 168.wupwise 95FP_Avg 146.wave5 145.fpppp 141.apsi 125.turb3d 60% 50% 40% 30% 20% 10% 0% -10% 110.applu 107.mgrid 104.hydro2d 103.su2cor 102.swim 101.tomcatv Energy and Performance Results Results relative to O-PIN without DVFS » Including all RDO optimization overhead » EDP improvement varies from -1% to +70% +70% 70% RDO Energy and Performance Results Average results Performance Degradation Energy Savings 2.1% 24.1% Energy-Delay Product Improvement 22.4% SPEC2K FP 3.3% 24.0% 21.5% SPEC2K INT 0.7% 6.5% 6.0% Olden 3.7% 25.3% 22.7% Benchmark Suite SPEC95 FP SPEC2K INT : dominantly CPU bound : except for 181.mcf with 45% EDP improvement - 37 - Energy and Performance Results Compared to StaticScale EDP Improvement RDO EDP Improvement StaticScale SPEC95 FP 22.4% 5.6% SPEC2K FP 21.5% 6.8% 6.0% -0.3% 22.7% 6.3% Benchmark Suite SPEC2K INT Olden Our results versus StaticScale: 3~5X - 38 - Energy and Performance Results A rough comparison with [Hsu PLDI’03] » Only reported results for SPEC95 FP Performance Degradation Energy Savings Energy-Delay Product Improvement Our results 2.1% 24.1% 22.4% [Hsu PLDI’03] 2.1% 11.0% 9.0% Scheme 2X improvement due to: (1) Multiple DVFS regions. (2) Utilizing run-time hardware information - 39 - Basic O-PIN Overhead Average O-PIN overhead Benchmark Suite Overhead (Performance) Overhead (Energy) SPEC95 FP 3.3% 3.6% SPEC2K FP 1.8% 2.4% SPEC2K INT 3.7% 3.2% Olden 0.6% 1.3% - 40 - Basic O-PIN Overhead O-PIN has basic setup/operation overhead Performance optimizations may offset it and lead to net performance gain [Bala et al PLDI’00] » So far, we only do energy optimization; no perf optimization 0.5~15% basic O-PIN overhead, average 1~4% 2KFP_Avg 301.apsi 200.sixtrack 191.fma3d 187.facerec 183.equake 179.art 178.galgel 177.mesa 173.applu 172.mgrid 171.swim 168.wupwise 95FP_Avg 146.wave5 145.fpppp 141.apsi 125.turb3d 110.applu 107.mgrid 104.hydro2d 103.su2cor 102.swim 101.tomcatv - 41 - 189.lucas Performance Energy O-PIN Overhead 188.ammp 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% Outline of This Talk … Introduction and motivation Design framework and DVFS decision algorithms Implementation and Deployment in a real system Experimental Results Conclusions and future work - 42 - Results Summary & Conclusions A dynamic compiler driven DVFS framework » A new concept and a real implementation » Up to 70% EDPI from physical experiments » May be Generalized for other issues like di/dt Higher-level dynamic compiler has an important role in power control - 43 - Dynamic vs Static Dynamic power dominates. But static power is increasing more as technology scales down. Xeon @3.4GHz/1.25V Adapted from Li et al. 2009 - 44 - Future Work Deeper analysis of experimental results » Breakdown by contributing factors & comparison Performance optimizations & interactions Collaborative SW-HW control scheme » More micro-architecture supports (e.g. special PMC) Extension for joint CPU and memory energy control - 45 -
© Copyright 2025 Paperzz