slides - Microarch.org

RUBIK: FAST ANALYTICAL POWER MANAGEMENT
FOR LATENCY-CRITICAL SYSTEMS
HARSHAD KASTURE, DAVIDE BARTOLINI, NATHAN BECKMANN,
DANIEL SANCHEZ
MICRO 2015
Motivation
2
Low server utilization in today’s datacenters results in
resource and energy inefficiency
!  Stringent latency requirements of user-facing services is a
major contributing factor
!  Power management for these services is challenging
! 
!  Strict
requirements on tail latency
!  Inherent variability in request arrival and service times
! 
Rubik uses statistical modeling to adapt to short-term
variations
!  Respond
to abrupt load changes
!  Improve power efficiency
!  Allow colocation of latency-critical and batch applications
Understanding Latency-Critical Applications 3
Back End
Back End
Leaf Node
Client
Root Node
Datacenter
Back End
Back End
Leaf Node
Back End
Back End
Leaf Node
Understanding Latency-Critical Applications 4
Back End
Back End
Leaf Node
Client
Root Node
Datacenter
Back End
Back End
Leaf Node
Back End
Back End
Leaf Node
Understanding Latency-Critical Applications 5
Back End
Back End
Leaf Node
Client
Root Node
Datacenter
Back End
Back End
Leaf Node
Back End
Back End
Leaf Node
Understanding Latency-Critical Applications 6
Back End
Back End
Leaf Node
1 ms
Client
Root Node
1 ms
Back End
Back End
Leaf Node
Back End
Back End
Leaf Node
Datacenter
! 
The few slowest responses determine user-perceived latency
! 
Tail latency (e.g., 95th / 99th percentile), not mean latency, determines
performance
Prior Schemes Fall Short
! 
7
Traditional DVFS schemes (cpufreq, TurboBoost…)
!  React
to coarse grained metrics like processor utilization,
oblivious to short-term performance requirements
! 
Power management for embedded systems (PACE,
GRACE…)
!  Do
! 
not consider queuing
Schemes designed specifically for latency-critical systems
(PEGASUS [Lo ISCA’14], Adrenaline [Hsu HPCA’15])
!  Rely
on application-specific heuristics
!  Too conservative
Insight 1: Short-Term Load Variations
! 
Latency-critical applications have significant short-term load
variations
moses
300
QPS
250
200
150
100
50
0
0.0
! 
1.0
2.0
Time (s)
3.0
4.0
PEGASUS [Lo ISCA’14] uses feedback control to adapt frequency
setting to diurnal load variations
Deduce server load from observed request latency
!  Cannot adapt to short-term variations
! 
8
Insight 2: Queuing Matters!
Tail latency is often determined
by queuing, not the length of
individual requests
moses
15.0
Response
Time (ms)
! 
12.0
9.0
6.0
3.0
0.0
0.0
requests boosted (sped up)
!  Frequency settings must be
conservative to handle queuing
Service
Time (ms)
Adrenaline [Hsu HPCA’15] uses
application-level hints to
distinguish long requests from
short ones
1.0
2.0
3.0
4.0
2.0
3.0
4.0
2.0
3.0
4.0
Time (s)
4
3
2
1
0
0.0
!  Long
1.0
Time (s)
4
Queue
Len
! 
9
2
0
0.0
1.0
Time (s)
Rubik Overview
10
Use queue length as a measure of instantaneous system
load
!  Update frequency whenever queue length changes
! 
!  Adapt
to short-term load variations
Core
Activity
Queue
Length
Idle
Rubik
Time
Time
Core
Frequency
Time
Goal: Reshaping Latency Distribution
Probability
Density
Response Latency
11
Key Factors in Setting Frequencies
! 
Distribution of cycle requirements of individual requests
!  Larger
! 
How long has a request spent in the queue?
!  Longer
! 
variance " more conservative frequency setting
wait times " higher frequency
How many requests are queued waiting for service
!  Longer
queues " higher frequency
12
There’s Math!
13
P[S0 = c] = P[S = c + ω | S > ω ] =
ω
€
P[S = c + ω ]
P[S > ω ]
Cycles
Cycles
times
!##i "
##
$
PSi = PSi −1 * PS = PS0 * PS * PS * ...* PS
*
€
€
Cycles
Cycles
ci
f ≥ max
i=0...N L − (t + m )
i
i
Cycles
Efficient Implementation
! 
14
Pre-computed tables store most of the required quantities
Target Tail Tables
c0 c1 c2
c
m0 m1 m152
ω=0
ω=0
ω < 25th pct th
ω < 25 pct
ω < 50th pct th
ω < 50 pct
ω < 75th pct th
ω < 75 pct
Otherwise
Updated Periodically
m15
Read on each request
arrival/departure
Table contents are independent of system load!
!  Implemented as a software runtime
! 
!  Hardware
support: fast, per-core DVFS, performance counters
for CPI stacks
Evaluation
! 
15
Microarchitectural simulations using zsim
!  Power
model tuned to a real system
Core 3 Core 4 Core 5
Shared L3
o  Westmere-like OOO cores
o  Fast per-core DVFS
o  CPI stack counters
o  Pin threads to cores
Core 0 Core 1 Core 2
! 
Compare Rubik against two oracular schemes:
!  StaticOracle:
Pick the lowest static frequency that meets latency
targets for a given request trace
!  AdrenalineOracle: Assume oracular knowledge of long and short
requests, use offline training to pick frequencies for each
Evaluation
! 
16
Five diverse latency-critical applications
!  xapian
(search engine)
!  masstree (in-memory key-value store)
!  moses (statistical machine translation)
!  shore-mt (OLTP)
!  specjbb (java middleware)
! 
For each application, latency target set at the tail latency
achieved at nominal frequency (2.4 GHz) at 50%
utilization
Tail Latency
StaticOracle
17
Rubik
AdrenalineOracle
0.50
0.25
0.00
2.00
1.60
1.20
0.80
0.40
0.00
Freqs
(GHz)
Tail Lat
(ms)
Load
0.75
3.2
2.4
1.6
0.8
0.0
0.0
1.0
Time (s)
2.0
Tail Latency
StaticOracle
18
Rubik
AdrenalineOracle
0.50
0.25
0.00
2.00
1.60
1.20
0.80
0.40
0.00
Freqs
(GHz)
Tail Lat
(ms)
Load
0.75
3.2
2.4
1.6
0.8
0.0
0.0
1.0
Time (s)
2.0
Core Power Savings (%)
Core Power Savings
19
70
StaticOracle
AdrenalineOracle
Rubik
60
50
40
30
20
10
0
30%
masstree
! 
30%
30%
moses
30%
shore
specjbb
30%
30%
xapian
mean
All three schemes save significant power at low utilization
!  Rubik
performs best, reducing core power by up to 66%
Core Power Savings (%)
Core Power Savings
20
70
StaticOracle
AdrenalineOracle
Rubik
60
50
40
30
20
10
0
30% 40%
masstree
! 
moses
30% 40%
shore
30% 40%
specjbb
30% 40%
30% 40%
xapian
mean
All three schemes save significant power at low utilization
!  Rubik
! 
30% 40%
performs best, reducing core power by up to 66%
Rubik’s relative savings increase as short-term adaptation
becomes more important
Core Power Savings (%)
Core Power Savings
21
70
StaticOracle
AdrenalineOracle
Rubik
60
50
40
30
20
10
0
! 
30% 40% 50%
30% 40% 50%
30% 40% 50%
30% 40% 50%
30% 40% 50%
30% 40% 50%
masstree
moses
shore
specjbb
xapian
mean
All three schemes save significant power at low utilization
!  Rubik
performs best, reducing core power by up to 66%
Rubik’s relative savings increase as short-term adaptation
becomes more important
!  Rubik saves significant power even at high utilization
! 
!  17%
on average, and up to 34%
Real Machine Power Savings
! 
V/F transition latencies of >100 µs even with integrated
voltage controllers
!  Likely
due to inefficiencies in firmware
Rubik successfully adapts to higher V/F transition
latencies
Core Power Savings (%)
! 
22
60
50
StaticOracle
Rubik
40
30
20
10
0
30%
40%
masstree
50%
30%
40%
moses
50%
Static Power Limits Efficiency
23
Idle
Latency-critical
Utilization
Datacenter
Batch
Utilization
RubikColoc: Colocation Using Rubik
24
Statically
Partitioned LLC
Rubik sets
Latency-Critical
Frequencies
RubikColoc
RubikColoc Savings
! 
25
RubikColoc saves significant power and resources over a
segregated datacenter baseline
!  17%
reduction in datacenter power consumption; 19% fewer
machines at high load
!  31% reduction in datacenter power consumption, 41% fewer
machines at high load
Conclusions
! 
! 
! 
26
Rubik uses fine-grained power management to reduce
active core power consumption by up to 66%
Rubik uses statistical modeling to account for various
sources of uncertainty, and avoids application-specific
heuristics
RubikColoc uses Rubik to colocate latency-critical and
batch applications, reducing datacenter power
consumption by up to 31% while using up to 41% fewer
machines
THANKS FOR YOUR ATTENTION!
QUESTIONS?