slides - Irisa

Energy-efficiency potential of a
phase-based cache resizing
scheme for embedded systems
G. Pokam and F. Bodin
1
Motivation (1/3)

High-performance accommodates
difficultly with low-power

Consider the cache hierarchy for instance

benefits of large caches



maintain embedded code + data workload on-chip
reduce off-chip memory traffic
however,
caches account for ~80% of the transistors count
 we usually devote half of the chip area to caches

3
Motivation (2/3)

Cache impact on the energy consumption

static energy is incommensurate in comparison to the
rest of the chip



80% of the transistors contribute steadily to the
leakage power
dynamic energy (transistors switching activities)
represents an important fraction of the total energy due
to the high access frequency of caches
Caches design is therefore critical in the
context of high-performance embedded
systems
4
Motivation (3/3)

We seek to address cache energy
management via


Any good ways to achieve that ?


Hardware/software interaction
Yes: add flexibility to allow a cache to be
reconfigured efficiently
How ?

Follow program phases to adapt the cache
structure accordingly
5
Previous work (1/2)

Some configurable cache proposals that
apply to embedded systems include:

Albonesi [MICRO’99]: selective cache ways
to disable/enable individual cache ways of a highly
set-associative cache


Zhang & al. [ISCA’03]: way-concatenation
to reduce the cache associativity while still
maintaining the full cache capacity

6
Previous work (2/2)

These approaches only consider
configuration on a per-application basis

Problems :



empirically, no best cache size exists for a given
application
varying dynamic cache behavior within an application,
and from one application to another
Therefore, these approaches do not accommodate
well to program phase changes
7
Our approach

Objective :




emphasize on application-specific cache architectural
parameters
To do so, we consider a cache with fixed line size
and modulus set mapping function
 power/perf is dictated by size and associativity
Not all dynamic program phases may have the
same requirements on cache size and
associativity !
Dynamically varying size and assoc. to leverage
power/perf. tradeoff at phase-level
8
Cache model (1/8)

Baseline cache model:

way-concatenation cache [Zhang ISCA’03]

Functionality of the way-concatenation cache
on each cache lookup, a logic selects the number of
active cache ways m out of the n available cache ways

virtually, each active cache way is a multiple of the
size of a single bank in the base n-way cache.

9
Cache model (2/8)

Our proposal:

modify the associativity while guaranteeing
cache coherency

modify the cache size while preserving data
availability on unused cache portions
10
Cache model (3/8)

First enhancement: associativity level

Problem with baseline model

consider the following scenario in the baseline model
Phase 0: 32K 2-way, active banks are 0 and 2
Bank 0
Bank 1
Bank 2
Bank 3
@A
Phase 1: 32K 1-way, active bank is 2, @A is modified
Old copy
@A
@A
invalidation
11
Cache model (4/8)

Proposed solution :

assume a write-through cache
the unused tag and status arrays must be made
accessible on a write to ensure coherency across
cache configurations => associative tag array

actions of the cache controller: access all tag arrays
on a write request to set the corresponding status bit
to invalid

12
Cache model (5/8)

Second enhancement: cache size level

Problem with the baseline model:
Gated-Vdd is used to disconnect a bank => data are
not preserved across 2 configurations!


Proposed solution:
unused cache ways are put in a low-power mode =>
drowsy mode [Flautner & al. ISCA’02]
 tag portion is left unchanged !
 Main advantage


we can reduce the cache size, preserve the state of the
unused memory cells across program phases, while
still reducing leakage energy !
13
Cache model (6/8)

Overall cache model
14
Cache model (8/8)



Drowsy circuitry accounts for less than 3% of the
chip area
Accessing a line in drowsy mode requires 1 cycle
delay [Flautner & al. ISCA’02]
ISA extension

we assume the ISA can be extended with a
reconfiguration instruction having the following effects
on WCR:
way-mask
0
1
2
3
drowsy bit config
0/1
32K1W/8K1W
0/1
32K2W/16K1W
0/1
32K2W/16K2W
0
32K4W
16
Trace-based analysis (1/3)

Goal :


We want to extract a performance and energy
profiles from the trace in order to adapt the
cache structure to the dynamic application
requirements
Assumptions :


LRU replacement policy
no prefetching
17
Trace-based analysis (2/3)

sample interval = i
set mapping function = map j
LRU-Stack distance d = x

Then, define the LRU-stack profiles :



(for varying the associativity)
(for varying the cache size)
P i (map j ( x)) : performance
for each pair ( map j , x ) , this expression defines
the number of dynamic references that hit in caches
with LRU-stack distance d  x

18
Trace-based analysis (3/3)

Ei (map j ( x)) : energy
Ei (map j ( x))  Pi (map j ( x)) * Ecache
 i * ETag
 N  i * Edrowsy
 Write i * Ememory
Cache
energy
Tag energy
Drowsy
transitions
energy
memory
energy
Ememory  ratio * Ecache
19
Experimental setup (1/2)


Focus on data cache
Simulation platform





4-issue VLIW processor [Faraboschi & al. ISCA’00]
32KB 4-way data cache
32B block size
20 cycles miss penalty
Benchmarks



MiBench: fft, gsm, susan
MediaBench: mpeg, epic
PowerStone: summin, whestone, v42bis
20
Experimental setup (2/2)

CACTI 3.0



Hotleakage



to obtain energy values
we extend it to provide leakage energy values
for each simulated cache configuration
from where we adapted the leakage energy
calculation for each simulated leakage
reduction technique
estimated memory ratio = 50
drowsy energy from [Flautner & al. ISCA’02]
21
Program behavior (1/4)

GSM
  100K
All 32K
config
All 16K
config
Capacity miss effect
(log10 scale)
Sensitive
region
8K config
Tradeoff region
Insensitive
region
(log10 scale)
22
Program behavior (2/4)

FFT
23
Program behavior (3/4)

Working set size sensitivity property


the working set can be partitioned into
clusters with similar cache sensitivity
Capturing sensitivity through working set
size clustering


the partitioning is done relative to the base cache
configuration
We use a simple metric based on the Manhattan
k1
k2
distance vector from two points vi and vi
vik 2  vik 1  
24
Program behavior (4/4)

More energy/performance profiles
summin
whestone
25
Results (1/3)

Dynamic energy reduction
26
Results (2/3)

Leakage energy savings (0.07um)
Better due to
gated-Vdd
27
Results (3/3)

Performance
Worst-case degradation
(65% due to drowsy
transitions)
28
Conclusions and future work

Can do better for improving performance


management of reconfiguration at the
compiler level



reduce the frequency of drowsy transitions
within a phase with refined cache bank access
policies
insert BB annotation in the trace
exploit feedback-directed compilation
promising scheme for embedded systems
29