RUBIK: FAST ANALYTICAL POWER MANAGEMENT FOR LATENCY-CRITICAL SYSTEMS HARSHAD KASTURE, DAVIDE BARTOLINI, NATHAN BECKMANN, DANIEL SANCHEZ MICRO 2015 Motivation 2 Low server utilization in today’s datacenters results in resource and energy inefficiency ! Stringent latency requirements of user-facing services is a major contributing factor ! Power management for these services is challenging ! ! Strict requirements on tail latency ! Inherent variability in request arrival and service times ! Rubik uses statistical modeling to adapt to short-term variations ! Respond to abrupt load changes ! Improve power efficiency ! Allow colocation of latency-critical and batch applications Understanding Latency-Critical Applications 3 Back End Back End Leaf Node Client Root Node Datacenter Back End Back End Leaf Node Back End Back End Leaf Node Understanding Latency-Critical Applications 4 Back End Back End Leaf Node Client Root Node Datacenter Back End Back End Leaf Node Back End Back End Leaf Node Understanding Latency-Critical Applications 5 Back End Back End Leaf Node Client Root Node Datacenter Back End Back End Leaf Node Back End Back End Leaf Node Understanding Latency-Critical Applications 6 Back End Back End Leaf Node 1 ms Client Root Node 1 ms Back End Back End Leaf Node Back End Back End Leaf Node Datacenter ! The few slowest responses determine user-perceived latency ! Tail latency (e.g., 95th / 99th percentile), not mean latency, determines performance Prior Schemes Fall Short ! 7 Traditional DVFS schemes (cpufreq, TurboBoost…) ! React to coarse grained metrics like processor utilization, oblivious to short-term performance requirements ! Power management for embedded systems (PACE, GRACE…) ! Do ! not consider queuing Schemes designed specifically for latency-critical systems (PEGASUS [Lo ISCA’14], Adrenaline [Hsu HPCA’15]) ! Rely on application-specific heuristics ! Too conservative Insight 1: Short-Term Load Variations ! Latency-critical applications have significant short-term load variations moses 300 QPS 250 200 150 100 50 0 0.0 ! 1.0 2.0 Time (s) 3.0 4.0 PEGASUS [Lo ISCA’14] uses feedback control to adapt frequency setting to diurnal load variations Deduce server load from observed request latency ! Cannot adapt to short-term variations ! 8 Insight 2: Queuing Matters! Tail latency is often determined by queuing, not the length of individual requests moses 15.0 Response Time (ms) ! 12.0 9.0 6.0 3.0 0.0 0.0 requests boosted (sped up) ! Frequency settings must be conservative to handle queuing Service Time (ms) Adrenaline [Hsu HPCA’15] uses application-level hints to distinguish long requests from short ones 1.0 2.0 3.0 4.0 2.0 3.0 4.0 2.0 3.0 4.0 Time (s) 4 3 2 1 0 0.0 ! Long 1.0 Time (s) 4 Queue Len ! 9 2 0 0.0 1.0 Time (s) Rubik Overview 10 Use queue length as a measure of instantaneous system load ! Update frequency whenever queue length changes ! ! Adapt to short-term load variations Core Activity Queue Length Idle Rubik Time Time Core Frequency Time Goal: Reshaping Latency Distribution Probability Density Response Latency 11 Key Factors in Setting Frequencies ! Distribution of cycle requirements of individual requests ! Larger ! How long has a request spent in the queue? ! Longer ! variance " more conservative frequency setting wait times " higher frequency How many requests are queued waiting for service ! Longer queues " higher frequency 12 There’s Math! 13 P[S0 = c] = P[S = c + ω | S > ω ] = ω € P[S = c + ω ] P[S > ω ] Cycles Cycles times !##i " ## $ PSi = PSi −1 * PS = PS0 * PS * PS * ...* PS * € € Cycles Cycles ci f ≥ max i=0...N L − (t + m ) i i Cycles Efficient Implementation ! 14 Pre-computed tables store most of the required quantities Target Tail Tables c0 c1 c2 c m0 m1 m152 ω=0 ω=0 ω < 25th pct th ω < 25 pct ω < 50th pct th ω < 50 pct ω < 75th pct th ω < 75 pct Otherwise Updated Periodically m15 Read on each request arrival/departure Table contents are independent of system load! ! Implemented as a software runtime ! ! Hardware support: fast, per-core DVFS, performance counters for CPI stacks Evaluation ! 15 Microarchitectural simulations using zsim ! Power model tuned to a real system Core 3 Core 4 Core 5 Shared L3 o Westmere-like OOO cores o Fast per-core DVFS o CPI stack counters o Pin threads to cores Core 0 Core 1 Core 2 ! Compare Rubik against two oracular schemes: ! StaticOracle: Pick the lowest static frequency that meets latency targets for a given request trace ! AdrenalineOracle: Assume oracular knowledge of long and short requests, use offline training to pick frequencies for each Evaluation ! 16 Five diverse latency-critical applications ! xapian (search engine) ! masstree (in-memory key-value store) ! moses (statistical machine translation) ! shore-mt (OLTP) ! specjbb (java middleware) ! For each application, latency target set at the tail latency achieved at nominal frequency (2.4 GHz) at 50% utilization Tail Latency StaticOracle 17 Rubik AdrenalineOracle 0.50 0.25 0.00 2.00 1.60 1.20 0.80 0.40 0.00 Freqs (GHz) Tail Lat (ms) Load 0.75 3.2 2.4 1.6 0.8 0.0 0.0 1.0 Time (s) 2.0 Tail Latency StaticOracle 18 Rubik AdrenalineOracle 0.50 0.25 0.00 2.00 1.60 1.20 0.80 0.40 0.00 Freqs (GHz) Tail Lat (ms) Load 0.75 3.2 2.4 1.6 0.8 0.0 0.0 1.0 Time (s) 2.0 Core Power Savings (%) Core Power Savings 19 70 StaticOracle AdrenalineOracle Rubik 60 50 40 30 20 10 0 30% masstree ! 30% 30% moses 30% shore specjbb 30% 30% xapian mean All three schemes save significant power at low utilization ! Rubik performs best, reducing core power by up to 66% Core Power Savings (%) Core Power Savings 20 70 StaticOracle AdrenalineOracle Rubik 60 50 40 30 20 10 0 30% 40% masstree ! moses 30% 40% shore 30% 40% specjbb 30% 40% 30% 40% xapian mean All three schemes save significant power at low utilization ! Rubik ! 30% 40% performs best, reducing core power by up to 66% Rubik’s relative savings increase as short-term adaptation becomes more important Core Power Savings (%) Core Power Savings 21 70 StaticOracle AdrenalineOracle Rubik 60 50 40 30 20 10 0 ! 30% 40% 50% 30% 40% 50% 30% 40% 50% 30% 40% 50% 30% 40% 50% 30% 40% 50% masstree moses shore specjbb xapian mean All three schemes save significant power at low utilization ! Rubik performs best, reducing core power by up to 66% Rubik’s relative savings increase as short-term adaptation becomes more important ! Rubik saves significant power even at high utilization ! ! 17% on average, and up to 34% Real Machine Power Savings ! V/F transition latencies of >100 µs even with integrated voltage controllers ! Likely due to inefficiencies in firmware Rubik successfully adapts to higher V/F transition latencies Core Power Savings (%) ! 22 60 50 StaticOracle Rubik 40 30 20 10 0 30% 40% masstree 50% 30% 40% moses 50% Static Power Limits Efficiency 23 Idle Latency-critical Utilization Datacenter Batch Utilization RubikColoc: Colocation Using Rubik 24 Statically Partitioned LLC Rubik sets Latency-Critical Frequencies RubikColoc RubikColoc Savings ! 25 RubikColoc saves significant power and resources over a segregated datacenter baseline ! 17% reduction in datacenter power consumption; 19% fewer machines at high load ! 31% reduction in datacenter power consumption, 41% fewer machines at high load Conclusions ! ! ! 26 Rubik uses fine-grained power management to reduce active core power consumption by up to 66% Rubik uses statistical modeling to account for various sources of uncertainty, and avoids application-specific heuristics RubikColoc uses Rubik to colocate latency-critical and batch applications, reducing datacenter power consumption by up to 31% while using up to 41% fewer machines THANKS FOR YOUR ATTENTION! QUESTIONS?
© Copyright 2026 Paperzz