Security Overview

Dynamic Compiler-Driven
Microprocessor Energy and
Performance Optimization
Vijay Janapa Reddi
The University of Texas at Austin
Introduction

Power management is important
-2-
Trends Over the Years
-3-
Power Background

Where does power go?
» Dynamic power
» Static power
» Short-circuit power




– “switching” power
– “leakage” power
Static power: transistor not perfect
Dynamic power: charging/discharging capacitance
from 01 or 10 consumes power
Short-circuit power: happens briefly during transition
Dynamic power dominates. But static power is increase
more as technology scales down.
-4-
Dynamic Power

Dynamic power
iVdd
Dynamic Power = aCV 2 f
Vout
CL

α – Switching factors
» e.g. clock toggles every cycle - switching factor is 1

C – Capacitance
» Function of wire length, transistor size


V – Supply voltage
f – Clock frequency
-5-
Short-circuit Power



Short-circuit power is still “dynamic” power
Short-circuit current is caused by finite-slope input
signal transitioning
During the transitioning, NMOS and PMOS are both
conducting
» Short-circuit formed between VDD and GND
-6-
Leakage Power
Leakage Power
• 3
main sources
of leakage,
each will be discussed in the next slides

Three
sources
of leakage
» Subthreshold, gate and junction
Leakage currents grow exponentially with increases in
temperature, decreases
in threshold voltage
EE382M-8 Class Notes
1/17/2011

-7-
32
Introduction (2)


Adaptive power management is important
Dynamic voltage and frequency scaling (DVFS)
is one effective technique to reduce power
-8-
Introduction (3)



Adaptive power management is important
Dynamic voltage and frequency scaling (DVFS)
is one effective technique to reduce power
Previous DVFS approaches
» Hardware or OS-interrupt based [Semeraro et
al,Micro’02]
» Static Compiler based [Hsu et al, PLDI’03]
No prior work on dynamic compiler DVFS
-9-
Goal of This Work



Explore power control opportunities in a dynamic
compiler
An effective complement to existing techniques
Ultimate goal: a multi-layer (SW & HW) collaborative
control
- 10 -
What is Dynamic Compiler Driven DVFS?

What is dynamic compiler?
» Software that compiles/optimizes binary code at runtime
Application
binary
…
DVFS
Optimization
Dynamic
compilation
system
Performance
optimization
OS and Hardware
- 11 -
Why Dynamic Compiler Driven DVFS?

Advantages over existing approaches
High-level
Program
structure
Run-time
Processor
Information
Hardware-based
Static compiler based
Dynamic compiler based
More code aware, more adaptive, …

Disadvantages and Challenges
» Run-time operation & optimization overhead
- 12 -
Why Dynamic Compiler Driven DVFS?

Advantages over existing approaches
High-level
Program
structure
x
Hardware-based
Run-time
Processor
Information
Static compiler based
Dynamic compiler based
More code aware, more adaptive, …

Disadvantages and Challenges
» Run-time operation & optimization overhead
- 13 -
Why Dynamic Compiler Driven DVFS?

Advantages over existing approaches
High-level
Program
structure
Hardware-based
x
Static compiler based
√
Run-time
Processor
Information
Dynamic compiler based
More code aware, more adaptive, …

Disadvantages and Challenges
» Run-time operation & optimization overhead
- 14 -
Why Dynamic Compiler Driven DVFS?

Advantages over existing approaches
High-level
Program
structure
Hardware-based
x
Static compiler based
√
Dynamic compiler based
√
Run-time
Processor
Information
More code aware, more adaptive, …

Disadvantages and Challenges
» Run-time operation & optimization overhead
- 15 -
Why Dynamic Compiler Driven DVFS?

Advantages over existing approaches
High-level
Program
structure
Hardware-based
x
Static compiler based
√
Dynamic compiler based
√
Run-time
Processor
Information
More code aware, more adaptive, …

Disadvantages and Challenges
» Run-time operation & optimization overhead
- 16 -
√
Why Dynamic Compiler Driven DVFS?

Advantages over existing approaches
High-level
Program
structure
Hardware-based
x
Static compiler based
√
Dynamic compiler based
√
Run-time
Processor
Information
More code aware, more adaptive, …

Disadvantages and Challenges
» Run-time operation & optimization overhead
- 17 -
√
x
Why Dynamic Compiler Driven DVFS?

Advantages over existing approaches
High-level
Program
structure
Run-time
Processor
Information
Hardware-based
x
Static compiler based
√
x
Dynamic compiler based
√
√
More code aware, more adaptive, …

Disadvantages and Challenges
» Run-time operation & optimization overhead
- 18 -
√
Motivational Data (based on Qsort)
Cache miss behavior changes over time
- 19 -
Motivational Data (based on Qsort)
Cache miss behavior changes over time
Input specific behavior at runtime
- 20 -
Contributions of This Work

A new concept
» Dynamic compiler driven DVFS

A design framework
» Implementation of prototype in a real system
- 21 -
Outline of This Talk …

Introduction and motivation

Design framework and DVFS decision algorithms

Implementation and Deployment in a real system

Experimental Results

Conclusions and future work
- 22 -
A Design Framework

Overall operation block diagram
Start
Monitor
Dynamic
Optimizer
Dispatcher
Run-time DVFS
Optimizer (RDO)
Cold code
execution
Hot code execution
(w/ optimization
& DVFS )
OS and hardware
- 23 -
KeyDVFS
design
issues:
Candidate
region selection?
DVFS
decision
algorithm?
insertion
and code
transformation?
A Design Framework (2)

Key design issues
» Candidate code region selection
» DVFS decision algorithm for code regions
» DVFS Code insertion and transformation
- 24 -
A Design Framework (3)

Key design issues
» Candidate code region
selection
» DVFS decision algorithm for
code regions
» DVFS Code insertion and
transformation
- 25 -
A DVFS Decision Algorithm

RDO inserts test module at entry and exit
points of a code region
» How and Why?

DVFS decision based on test information
» Is it long-running?
» Is it beneficial to apply DVFS?
» What is the optimal DVFS setting?

Key observation:
» Ok to slow down CPU if waiting for memory
How to decide DVFS setting quantitatively?
- 26 -
An analytical decision model
Memory
operation
CPU
operation
Nconcurrent
tasyn_mem
f
Ndependent
f
- 27 -
execution
time
DVFS setting computation
frequency
Scaling factor
0 <  ≤ 1.0
relative
memory
intensity
f
=
 fmax
- 28 -
relative
CPU
intensity
Implementation In a Real System

Platform: Intel PIN system [Luk et al PLDI’05]
» O-PIN: optimization version of PIN

Implementation highlights:
» Candidate regions: functions and loops
» Adjustable profiling/optimizing granularity
» JIT (generate) code selectively
» Possibly multiple DVFS code regions
» Fast DVFS decision
- 29 -
Deployment in Hardware

Hardware platform with power measurements
» Intel development board (Pentium-M, 600 ~ 1600MHz)
» OS: Linux 2.4.18
Voltage/current
measurement
Noise reduction
- 30 -
Data acquisition
(DAQ)
Data logging
Outline of This Talk …

Introduction and motivation

Design framework and DVFS decision algorithms

Implementation and Deployment in a real system

Experimental Results

Conclusions and future work
- 31 -
Experimental Setup and Benchmarks

Over 40 Benchmarks
» SPEC 2K INT/FP
» SPEC95 FP
» Olden

Experimental Setup
» 5% performance loss constraint
» Run to completion with largest input set
» Report average results from three separate runs
- 32 -
Different Power Metrics

Performance metrics
» Delay (execution time) per instruction, MIPS
» IPC, CPI – abstracts out frequency (MHz)

Energy and power metrics
» Joules (J) and Watts (W)

Combining both performance and power
» MIPS/W ~ energy per instruction
» MIPS2/W ~ energy * delay (EDP)
» MIPS3/W ~ energy * delay2 (ED2P)
- 33 -
Power/Energy Metrics

Energy metrics
» Compare battery durations with given workload
» Compare different processors’ energy efficiency

Power metrics
» Maximum power (TDP) – cost, thermal dissipation,
reliability
» Averaged power - battery life, electric bill

EDP/ED2P
» Compare power – performance efficiency
» E ~ CV2 – reducing voltage always reduces energy
- 34 -
An Illustrative Example

SPEC2K benchmark: 173.applu
» 72 hot regions, 5 DVFS regions
CPU Voltage/Power
Voltage (V)
1.6
1.6GHZ
1.4
1.4GHZ
1.2GHZ
1.2
1
800MHZ
0.8
Power (W)
0
2
4
6
10
5
0
0
2
4
Time (seconds)
6
Region
name
total
ops
Avg.
mem.
trans
Avg.
inst.
retired
DVFS
setting
(Hz)
jacld()
208M
24.8K
0.99M
0.8G
blts()
286M
11.5K
0.99M
1.2G
jacu()
156M
25.6K
0.99M
0.8G
buts()
254M
12.9K
0.99M
1.2G
rhs()
188M
8.2K
1.0M
1.4G
- 35 -
- 36 2KFP_Avg
301.apsi
200.sixtrack
191.fma3d
189.lucas
188.ammp
187.facerec
183.equake
179.art
178.galgel
177.mesa
173.applu
EDP improvement
172.mgrid
171.swim
-1%
168.wupwise
95FP_Avg
146.wave5
145.fpppp
141.apsi
125.turb3d
60%
50%
40%
30%
20%
10%
0%
-10%
110.applu
107.mgrid
104.hydro2d
103.su2cor

102.swim
101.tomcatv
Energy and Performance Results
Results relative to O-PIN without DVFS
» Including all RDO optimization overhead
» EDP improvement varies from -1% to +70%
+70%
70%
RDO
Energy and Performance Results

Average results
Performance
Degradation
Energy
Savings
2.1%
24.1%
Energy-Delay
Product
Improvement
22.4%
SPEC2K FP
3.3%
24.0%
21.5%
SPEC2K INT
0.7%
6.5%
6.0%
Olden
3.7%
25.3%
22.7%
Benchmark
Suite
SPEC95 FP
SPEC2K INT
: dominantly CPU bound
: except for 181.mcf with 45% EDP improvement
- 37 -
Energy and Performance Results

Compared to StaticScale
EDP
Improvement
RDO
EDP
Improvement
StaticScale
SPEC95 FP
22.4%
5.6%
SPEC2K FP
21.5%
6.8%
6.0%
-0.3%
22.7%
6.3%
Benchmark
Suite
SPEC2K INT
Olden
Our results versus StaticScale: 3~5X
- 38 -
Energy and Performance Results

A rough comparison with [Hsu PLDI’03]
» Only reported results for SPEC95 FP
Performance
Degradation
Energy
Savings
Energy-Delay
Product
Improvement
Our results
2.1%
24.1%
22.4%
[Hsu PLDI’03]
2.1%
11.0%
9.0%
Scheme
2X improvement due to:
(1) Multiple DVFS regions.
(2) Utilizing run-time hardware information
- 39 -
Basic O-PIN Overhead

Average O-PIN overhead
Benchmark
Suite
Overhead
(Performance)
Overhead
(Energy)
SPEC95 FP
3.3%
3.6%
SPEC2K FP
1.8%
2.4%
SPEC2K INT
3.7%
3.2%
Olden
0.6%
1.3%
- 40 -
Basic O-PIN Overhead


O-PIN has basic setup/operation overhead
Performance optimizations may offset it and
lead to net performance gain [Bala et al PLDI’00]
» So far, we only do energy optimization; no perf
optimization
 0.5~15% basic O-PIN overhead, average 1~4%
2KFP_Avg
301.apsi
200.sixtrack
191.fma3d
187.facerec
183.equake
179.art
178.galgel
177.mesa
173.applu
172.mgrid
171.swim
168.wupwise
95FP_Avg
146.wave5
145.fpppp
141.apsi
125.turb3d
110.applu
107.mgrid
104.hydro2d
103.su2cor
102.swim
101.tomcatv
- 41 -
189.lucas
Performance
Energy
O-PIN Overhead
188.ammp
20%
18%
16%
14%
12%
10%
8%
6%
4%
2%
0%
Outline of This Talk …

Introduction and motivation

Design framework and DVFS decision algorithms

Implementation and Deployment in a real system

Experimental Results

Conclusions and future work
- 42 -
Results Summary & Conclusions

A dynamic compiler driven DVFS framework
» A new concept and a real implementation
» Up to 70% EDPI from physical experiments
» May be Generalized for other issues like di/dt

Higher-level dynamic compiler has an
important role in power control
- 43 -
Dynamic vs Static

Dynamic power dominates. But static power is
increasing more as technology scales down.
Xeon @3.4GHz/1.25V
Adapted from Li et al. 2009
- 44 -
Future Work

Deeper analysis of experimental results
» Breakdown by contributing factors & comparison

Performance optimizations & interactions

Collaborative SW-HW control scheme
» More micro-architecture supports (e.g. special PMC)

Extension for joint CPU and memory energy
control
- 45 -