Hardware Architectures for
Power and Energy Adaptation
Phillip Stanley-Marbell
Outline
Motivation
Related Research
Architecture
Experimental Evaluation
Extensions
Summary and Future work
2
Motivation
Power consumption is becoming a limiting factor with
scaling of technology to smaller feature sizes
Mobile/battery-powered computing applications
Thermal issues in high end servers
Low Power Design is not enough:
Power- and Energy-Aware Design
Adapt to non-uniform application behavior
Only use as many resources as required by application
This talk : Exploit processor-memory performance
gap to save power, with limited performance
degradation
3
Related Research
Reducing power dissipation in on-chip caches
Reducing instruction cache leakage power
dissipation [Powell et al, TVLSI ‘01]
Reducing dynamic power in set-associative caches
and on-chip buffer structures [Dropsho et al, PACT ‘02]
Reducing power dissipation of CPU core
Compiler-directed dynamic voltage scaling of
CPU core [Hsu, Kremer, Hsiao. ISLPED ‘01]
4
Target Application Class:
Memory-Bound Applications
Memory-bound applications
Limited by memory system performance
Single-issue in-order processors
Limited overlap of main memory access and
computation
CPU @ Vdd
CPU @ Vdd/2
5
Power-Performance Tradeoff
Detect memory-bound execution phases
Maintain sufficient information to determine
compute / stall time ratio
Pros
Scaling down CPU core voltage yields significant
energy savings (Energy Vdd2)
Cons
Performance hit (Delay Vdd)
6
Power Adaptation Unit (PAU)
Maintains information to determine ratio of compute to stall time
Entries allocated for instructions which cause CPU stalls
Intuitively, one table entry required per program loop
Fields:
State (I, A, T, V)
# instrs. executed (NINSTR)
Distance b/n stalls (STRIDE)
Saturating ‘Quality’ counter (Q)
[From S-M et al, PACS 2002]
7
PAU Table Entry State
Machine
If CPU at-speed,
slow it down
Slowdown factor, ∂, for a target
1% performance degradation:
∂ = 0.01 • STRIDE + NINSTR
NINSTR
8
Example
for (x = 100;;)
{
if (x- - > 0)
a = i;
b = *n;
c = *p++;
PAU table entries created
for each assignment
After 100 iterations,
assignment to a stops
Entries for b or c can take
over immediately
}
9
Experimental Methodology
Simulated PAU as part of a single-issue embedded
processor
Used Myrmigki simulator [S-M et al, ISLPED 2001]
Models Hitachi SH RISC embedded processor
5 stage in-order pipeline
8K unified L1, 100 cycle latency to main memory
Empirical instruction power model, from SH7708 device
Voltage scaling penalty of 1024 cycles, 14uJ
Investigated effect of PAU table size on performance,
power
Intuitively, PAU table entries track program loops with
repeated stalls
10
Effect of Table Size on Energy
Savings
Single-entry PAU table provides 27% reduction in energy, on average
Scaling up to a 64-entry PAU table only provides additional 4%
11
Effect of Table Size on
Performance
Single-entry PAU table incurs 0.75% performance degradation, on avg.
Large PAU table, leads to more aggressive behavior, increased penalty
12
Overall Effect of Table Size :
Energy-Delay product
Considering both performance and power, there is little
benefit from larger PAU table sizes
13
Extending the PAU structure
Multiprogramming environments
Superscalar architectures
Slowdown factor computation
14
PAU in Multiprogramming
Environments
Only a single entry necessary per application
Amortize mem.-bound phase detection
Would be wasteful to flush PAU at each context switch (~10ms)
Extend PAU entries with an ID field:
CURID and IDMASK fields written to by OS
15
PAU in Superscalar
Architectures
CPU @ Vdd
CPU @ Vdd/2
Dependent computations are ‘stretched out’
FUs with no dependent instructions unduly slowed down
Maintain separate instruction counters per FU:
Drawback : Requires ability to run
FUs in core at different voltages
16
Slowdown factor computation
Computation only performed on application
phase change
Solution : computation by software ISR
Hardware solution would be wasteful
Compute ∂ , lookup discrete Vdd/Freq. by indexing
into a lookup table
Similar software handler solution proposed in
[Dropsho et al, 2002]
17
Summary & Future Work
PAU : Hardware identifies program regions (loops)
with compute / memory stall mismatch
Due to nature of most programs, even single entry
PAU is effective : can achieve 27% energy savings
with only 0.75% perf. Degradation
Proposed extensions to PAU architecture
Future work
Evaluations with smaller miss penalties
Implementation of proposed extensions
More extensive evaluation of implications of applications
18
Questions
19
© Copyright 2026 Paperzz