A Hardware architecture for Dynamic Performance and Energy

Hardware Architectures for
Power and Energy Adaptation
Phillip Stanley-Marbell
Outline

Motivation

Related Research

Architecture

Experimental Evaluation

Extensions

Summary and Future work
2
Motivation

Power consumption is becoming a limiting factor with
scaling of technology to smaller feature sizes



Mobile/battery-powered computing applications
Thermal issues in high end servers
Low Power Design is not enough:

Power- and Energy-Aware Design



Adapt to non-uniform application behavior
Only use as many resources as required by application
This talk : Exploit processor-memory performance
gap to save power, with limited performance
degradation
3
Related Research

Reducing power dissipation in on-chip caches



Reducing instruction cache leakage power
dissipation [Powell et al, TVLSI ‘01]
Reducing dynamic power in set-associative caches
and on-chip buffer structures [Dropsho et al, PACT ‘02]
Reducing power dissipation of CPU core

Compiler-directed dynamic voltage scaling of
CPU core [Hsu, Kremer, Hsiao. ISLPED ‘01]
4
Target Application Class:
Memory-Bound Applications

Memory-bound applications


Limited by memory system performance
Single-issue in-order processors

Limited overlap of main memory access and
computation
CPU @ Vdd
CPU @ Vdd/2
5
Power-Performance Tradeoff

Detect memory-bound execution phases

Maintain sufficient information to determine
compute / stall time ratio

Pros


Scaling down CPU core voltage yields significant
energy savings (Energy  Vdd2)
Cons

Performance hit (Delay  Vdd)
6
Power Adaptation Unit (PAU)

Maintains information to determine ratio of compute to stall time

Entries allocated for instructions which cause CPU stalls

Intuitively, one table entry required per program loop

Fields:




State (I, A, T, V)
# instrs. executed (NINSTR)
Distance b/n stalls (STRIDE)
Saturating ‘Quality’ counter (Q)
[From S-M et al, PACS 2002]
7
PAU Table Entry State
Machine
If CPU at-speed,
slow it down
Slowdown factor, ∂, for a target
1% performance degradation:
∂ = 0.01 • STRIDE + NINSTR
NINSTR
8
Example
for (x = 100;;)
{
if (x- - > 0)
a = i;
b = *n;
c = *p++;

PAU table entries created
for each assignment

After 100 iterations,
assignment to a stops
 Entries for b or c can take
over immediately
}
9
Experimental Methodology

Simulated PAU as part of a single-issue embedded
processor


Used Myrmigki simulator [S-M et al, ISLPED 2001]
Models Hitachi SH RISC embedded processor




5 stage in-order pipeline
8K unified L1, 100 cycle latency to main memory
Empirical instruction power model, from SH7708 device
Voltage scaling penalty of 1024 cycles, 14uJ

Investigated effect of PAU table size on performance,
power

Intuitively, PAU table entries track program loops with
repeated stalls
10
Effect of Table Size on Energy
Savings

Single-entry PAU table provides 27% reduction in energy, on average

Scaling up to a 64-entry PAU table only provides additional 4%
11
Effect of Table Size on
Performance

Single-entry PAU table incurs 0.75% performance degradation, on avg.

Large PAU table, leads to more aggressive behavior, increased penalty
12
Overall Effect of Table Size :
Energy-Delay product

Considering both performance and power, there is little
benefit from larger PAU table sizes
13
Extending the PAU structure

Multiprogramming environments

Superscalar architectures

Slowdown factor computation
14
PAU in Multiprogramming
Environments

Only a single entry necessary per application

Amortize mem.-bound phase detection

Would be wasteful to flush PAU at each context switch (~10ms)

Extend PAU entries with an ID field:

CURID and IDMASK fields written to by OS
15
PAU in Superscalar
Architectures
CPU @ Vdd
CPU @ Vdd/2

Dependent computations are ‘stretched out’


FUs with no dependent instructions unduly slowed down
Maintain separate instruction counters per FU:
Drawback : Requires ability to run
FUs in core at different voltages
16
Slowdown factor computation

Computation only performed on application
phase change


Solution : computation by software ISR


Hardware solution would be wasteful
Compute ∂ , lookup discrete Vdd/Freq. by indexing
into a lookup table
Similar software handler solution proposed in
[Dropsho et al, 2002]
17
Summary & Future Work

PAU : Hardware identifies program regions (loops)
with compute / memory stall mismatch

Due to nature of most programs, even single entry
PAU is effective : can achieve 27% energy savings
with only 0.75% perf. Degradation

Proposed extensions to PAU architecture

Future work



Evaluations with smaller miss penalties
Implementation of proposed extensions
More extensive evaluation of implications of applications
18
Questions
19