Computer Structure Power and Power Management Lihu Rappoport and Adi Yoaz 1 Computer Structure 2016 – Advanced Topics Thermal Design Point TDP power – Maximum amount of power the thermal solution of the platform is required to dissipate – Calculated as the rolling average of 5sec power of the highest real existing applications – Not including Viruses Some virus like application are ignored as well TDP power impacts Fan – Guaranteed frequency CPU – Affects form factor – cooling solution cost – acoustic noise How much power can this fan can dissipate? 2 Computer Structure 2016 – Advanced Topics Average Power Average Power – Average power = Total Energy / Total time – Including low-activity and idle-time (~90% idle time for client) Average Power determines – – – – Battery life – for mobile devices Electricity bill Air-condition bill For servers Air pollution Battery Life Continuous Web surfing over wireless 1 web search • Same energy as an 11W light bulb for 1 hour • Emits 7gr CO2 • 4B web searches daily 3 Computer Structure 2016 – Advanced Topics Platform Power Processor average power is <10% of the platform CLK 5% LAN Fan DVD 2% 2% ICH 2% 3% Display (panel + inverter) 33% HDD 8% GFX 8% Misc. 8% CPU 10% MCH 9% 4 Power Supply 10% Computer Structure 2016 – Advanced Topics CPU Power Components Dynamic power: Pdyn = CV2f – C – total electrical capacitance charged/discharged per cycle The sum of all the capacities of all transistors and wires which are charged/discharged (toggle from 01 or from 10) per cycle Transistors which maintain their value do not spend dynamic power Application dependent E.g., an application with no floating point calculation will not toggle the transistors in the floating point execution unit Typically, a bigger CPU has a bigger Cdyn – V – voltage – f – frequency: increasing f also requires increasing V ~linearly CV2f ~ f3 X% frequency costs ~3X% power Leakage power – Leakage of transistors under voltage, which is a function of 5 Z – total size of all transistors, V – voltage, t – temperature Computer Structure 2016 – Advanced Topics Performance per Watt In small form-factor devices thermal budget limits performance – Old target: get max performance – New target: get max performance at a given power envelope Performance per Watt Increasing f also requires increasing V (~linearly) – Dynamic Power = αCV2f = Kf3 X% performance costs ~3X% power – A power efficient feature – better than 1:3 performance : power Otherwise it is better to just increase frequency (and voltage) Vmin is the minimal operation voltage – Once at Vmin, reducing frequency no longer reduces voltage – At this point a feature is power efficient only if it is 1:1 performance : power Active energy efficiency tradeoff – Energyactive = Poweractive × Timeactive Poweractive / Perfactive – Energy efficient feature: 1:1 performance : power 6 Computer Structure 2016 – Advanced Topics Example –Power/Performance Exam Question יש לתכנן מערכת Multi Processorויש להשוות בין שני סוגי מעבדים כך שניתן יהיה להריץ אפליקציות ביעילות תוך קבלת ביצועים ) (MP performanceמרביים במסגרת תקציב ההספק. מעטפת הספק של המערכת היא ,60Wattכאשר שני שליש מתקציב ההספק הינו עבור המעבדים – ( Core’sוהשאר עבור ה – .)Uncore יש להשוות בין שני מעבדים אפשריים לצורך בניית המערכת ,מעבד גדול ומעבד קטן: מעבד גדול מעבד קטן שטח 4mm2 2mm2 רוחב 4 wide 3 wide IPC 3 2 ההספק הסטטי Leakage Powerהוא 0.25Wattלמילימטר רבוע (ההספק הסטטי קבוע ולא משתנה עם המתח/תדר). הקיבול הדינאמי של כל מעבד נתון כפונקציה של ה – IPCשל האפליקציה אותה הוא מריץ וערכו הואCdyn=IPC×750pF : Computer Structure 2016 – Advanced Topics 7 Example –Power/Performance )’Exam Question (Cont בטבלה הבאה נתונות נקודות מתח ותדר אפשריות לעבודת המעבדים הגדול והקטן: תדר ב Ghzמעבד גדול תדר ב Ghzמעבד קטן מתח ב Volt’s 0.75 1 Vmin=0.7 1 1.25 Vmin=0.7 1.35 1.45 Vmin=0.7 1.75 1.75 0.8 2.25 2 0.85 2.5 2.25 0.9 3 2.5 0.95 3.5 2.75 Vmax=1 עבור כל מערכת יש למצוא את מספר המעבדים האופטימלי ,וכן לחשב את הביצועים ). (MP performance instructions per second לאור התוצאות איזו מערכת הייתם ממליצים לייצר? זו עם המעבד הגדול או זו עם המעבד הקטן. Computer Structure 2016 – Advanced Topics 8 Solution Power = c×V2×f + leakage Leakage = 0.25w/mm2 × A(mm2) – Leakage of the big Core = 0.25W/mm2 × 4mm2 = 1w – Leakage of the small Core = 0.25W/mm2 × 2mm2 = 0.5w Cdyn=IPC×750pF – Big Core: Cdyn = 3×0.75= 2.25 – Small Core: Cdyn = 2×0.75= 1.5 Maximize total MP performance work at Vmin’s max freq.: – Big Core: V= 0.7v f=1.35Ghz – Small Core V= 0.7v f=1.45Ghz Per Core power for most efficient operating point: – For Big Core System: P = 2.25×0.72×1.35+1=2.49w – For Small Core System: P = 1.5 ×0.72×1.45+1=1.57w 9 Computer Structure 2016 – Advanced Topics Solution (cont.) Power budget for all cores= 60w×2/3=40w Number of Cores = power for all cores / power per core – Big Cores: 40w/2.49w = 16 – Small Cores: 40w/1.57w= 25 MP Performance = #of cores × IPC × f (inst/sec) – Big Cores: 16 × 3 × 1.35GHz = 64.8×109 inst/sec – Samll Cores: 25 × 2 × 1.45GHz = 72.5×109 inst/sec The small core system provides higher MP performance at the given system power envelop – It also takes less area: 25×2=50mm2 vs 16×4=64mm2 this is the recommend system 10 Computer Structure 2016 – Advanced Topics Managing Power Typical CPU usage varies over time – Bursts of high utilization & long idle periods (~90% of time in client) Optimize power and energy consumption – High power when high performance is needed – Low power at low activity or idle Enhanced Intel SpeedStep® Technology – Multi voltage/frequency operating points – OS changes frequency to meet performance needs and minimize power – Referred to as processor Performance states = P-States OS notifies CPU when no tasks are ready for execution – CPU enters sleep state, called C-state – Using MWAIT instruction, with C-state level as an argument – Tradeoff between power and latency Deeper sleep more power savings longer to wake 11 Computer Structure 2016 – Advanced Topics P-states Operation frequencies are called P-states = Performance states – P0 is the highest frequency – P1,2,3… are lower frequencies – Pn is the min Vcc point = Energy efficient point DVFS = Dynamic Voltage and Frequency Scaling – Power = CV2f ; f = KV Power ~ f3 – Program execution time ~ 1/f – E = P×t E ~ f2 Pn is the most energy efficient point Power P0 P1 – Going up/down the cubic curve of power High cost to achieve frequency large power savings for some small frequency reduction P2 Pn Freq 12 Computer Structure 2016 – Advanced Topics C-States: C0 C0: CPU active state Active Core Power Local Clocks and Logic Clock Distribution Leakage 13 Computer Structure 2016 – Advanced Topics C-States: C1 C0: CPU active state C1: Halt state: • • • • Active Core Power Stop core pipeline Stop most core clocks No instructions are executed Caches respond to external snoops Clock Distribution Leakage 14 Computer Structure 2016 – Advanced Topics C-States: C3 C0: CPU active state C1: Halt state: • • • • Active Core Power Stop core pipeline Stop most core clocks No instructions are executed Caches respond to external snoops C3 state: • Stop remaining core clocks • Flush internal core caches Leakage 15 Computer Structure 2016 – Advanced Topics C-States: C6 C0: CPU active state C1: Halt state: • • • • Active Core Power Stop core pipeline Stop most core clocks No instructions are executed Caches respond to external snoops C3 state: • Stop remaining core clocks • Flush internal core caches C6 state: • Processor saves architectural state • Turn off power gate, eliminating leakage Leakage Core power goes to ~0 16 Computer Structure 2016 – Advanced Topics Putting it all together CPU running at max power and frequency Periodically enters C1 C0 P0 20 18 16 Power [W] 14 12 10 8 6 4 C1 2 0 Time 17 Computer Structure 2016 – Advanced Topics Putting it all together Going into idle period – Gradually enters deeper C states – Controlled by OS C0 P0 20 18 16 Power [W] 14 12 10 8 C2 6 C3 4 C1 2 C4 0 Time 18 Computer Structure 2016 – Advanced Topics Putting it all together Tracking CPU utilization history – OS identifies low activity – Switches CPU to lower P state C0 P0 20 18 16 C0 P1 Power [W] 14 12 10 8 C2 6 C3 4 C1 2 C4 0 Time 19 Computer Structure 2016 – Advanced Topics Putting it all together CPU enters Idle state again C0 P0 20 18 16 C0 P1 Power [W] 14 12 10 8 C2 6 C3 4 C1 2 C2 C4 C3 C4 0 Time 20 Computer Structure 2016 – Advanced Topics Putting it all together Further lowering the P state DVD play runs at lowest P state C0 P0 20 18 16 C0 P1 Power [W] 14 12 10 8 6 C2 C3 4 C1 2 C2 C4 C0 P2 C3 C4 0 Time 21 Computer Structure 2016 – Advanced Topics Voltage and Frequency Domains Two Independent Variable Power Planes Shared frequency for all IA32 cores and ring Independent frequency for PG Fixed Programmable power plane for System Agent VCC SA Embedded power gates – CPU cores, ring and LLC Embedded power gates – each core can be turned off individually Cache power gating – turn off portions or all cache at deeper sleep states – Graphics processor Can be varied or turned off when not active VCC Periphery – Optimize SA power consumption – System On Chip functionality and PCU logic – Periphery: DDR, PCIe, Display VCC Core (Gated) VCC Core (Gated) VCC Core (ungated) VCC Core (Gated) VCC Core (Gated) VCC Graphics VCC Periphery 22 Computer Structure 2016 – Advanced Topics Turbo Mode P1 is guaranteed frequency – CPU and GFX simultaneous heavy load at worst case conditions – Actual power has high dynamic range – OS treats P0 as any other P-state Requesting is when it needs more performance – P1 to P0 range is fully H/W controlled Frequency transitions handled completely in HW PCU keeps silicon within existing operating limits – Systems designed to same specs, with or without Turbo Mode Pn is the energy efficient state – Lower than Pn is controlled by Thermal-State frequency P0 is max possible frequency – the Turbo frequency – P1-P0 has significant frequency range (GHz) Single thread or lightly loaded applications GFX <>CPU balancing P0 1C “Turbo” H/W Control P1 OS Visible States OS Control T-state & Throttle Pn LFM 23 Computer Structure 2016 – Advanced Topics Turbo Mode Power Gating Zero power for inactive cores 24 Core 3 Core 2 Core 1 Core 0 Workload Lightly Threaded Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Computer Structure 2016 – Advanced Topics Turbo Mode Power Gating Zero power for inactive cores Turbo Mode Use thermal budget of inactive core to increase frequency of active cores 25 Core 1 Core 0 Workload Lightly Threaded Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Computer Structure 2016 – Advanced Topics Turbo Mode Power Gating Zero power for inactive cores Turbo Mode Use thermal budget of inactive core to increase frequency of active cores 26 Core 1 Core 0 Workload Lightly Threaded Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Computer Structure 2016 – Advanced Topics Turbo Mode Turbo Mode Increase frequency within thermal headroom 27 Core Core3 3 Core Core2 2 Core 11 Core Core Core0 0 Active cores running workloads < TDP Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Computer Structure 2016 – Advanced Topics Turbo Mode Power Gating Zero power for inactive cores Turbo Mode Increase frequency within thermal headroom 28 Core 2 Core 3 Core 1 Core 0 Workload Lightly Threaded And active cores < TDP Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Computer Structure 2016 – Advanced Topics Thermal Capacitance Classic Model Steady-State Thermal Resistance Steady-State Thermal Resistance AND Dynamic Thermal Capacitance Temperature Temperature Design guide for steady state New Model Classic model response Time More realistic response to power changes Time Temperature rises as energy is delivered to thermal solution Thermal solution response is calculated at real-time 29 Foil taken from IDF 2011 Computer Structure 2016 – Advanced Topics Intel® Turbo Boost Technology 2.0 Power After idle periods, the system accumulates “energy budget” and can accommodate high power/performance for a few seconds C0/P0 (Turbo) Turbo Boost 2.0 In Steady State conditions the power stabilizes on TDP Use accumulated energy budget to enhance user experience “TDP” Sleep or Low power Time Buildup thermal budget during idle periods 30 Foil taken from IDF 2011 Computer Structure 2016 – Advanced Topics Core and Graphic Power Budgeting • Cores and Graphics integrated on the same die with separate voltage/frequency controls; tight HW control • Full package power specifications available for sharing • Power budget can shift between Cores and Graphics Core Power [W] Heavy CPU workload Total package power Sandy Bridge Next Gen Turbo Sum of max power for short periods Realistic concurrent max power Specification Core Power Applications Heavy Graphics workload Specification Graphics Power 31 Foil taken from IDF 2011 Graphics Power [W] Computer Structure 2016 – Advanced Topics
© Copyright 2026 Paperzz