Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs Ibrahim Ahmed, Shuze Zhao, Olivier Trescases and Vaughn Betz Email:[email protected] FPGA Power Consumption Challenge 2 VDD (V) 1.5 1 0.5 0 150 130 90 65 40 28 14 Technology (nm) 2 FPGA Power Consumption Challenge 2 VDD (V) 1.5 VDD not scaling 1 0.5 0 150 130 90 65 40 28 14 Technology (nm) 3 FPGA Power Consumption Challenge • Obstacle against entering emerging low power/mobile market (IoT) • Must show superior perf/W to compete in Data centers • Need innovation to bring power down “The future of continued scaling is dependent on adaptive power management and voltage scaling”, IEEE Fellow Kevin Zhang, VP of Intel's Technology and Manufacturing Group 4 Worst-case Modelling is Wasteful • Devices have different delay -> Variation !! 5 Worst-case Modelling is Wasteful • Delay is temperature dependant High Temperature 6 Worst-case Modelling is Wasteful • Delay is affected by VDD Lower VDD 7 Worst-case Modelling is Wasteful • Aging also affects delay End-of-life 8 Worst-case Modelling is Wasteful • Aging also affects delay End-of-life Static timing analysis (STA) accommodates the tail 9 Worst-case Modelling is Wasteful • Aging also affects delay • Timing models add margins for :End-of-life • • • • • Slow device Worst temperature Worst voltage droop End-of-life effects Guard-bands for noise, etc.. 10 How significant are the added margins ? 250 FIR filter Fmax on a 60-nm Cyclone IV (1.2 V nominal VDD) Measured Fmax (MHz) 200 150 CAD reported Fmax 100 50 0 800 900 1000 1100 1200 1300 1400 Supply Voltage (mV) 11 How significant are the added margins ? 250 FIR filter Fmax on a 60-nm Cyclone IV (1.2 V nominal VDD) Measured Fmax (MHz) 200 150 CAD reported Fmax 100 > 20 % reduction in VDD without reducing Fmax 50 0 800 900 1000 1100 1200 1300 1400 Supply Voltage (mV) 12 How significant are the added margins ? 250 FIR filter Fmax on a 60-nm Cyclone IV (1.2 V nominal VDD) Measured Fmax (MHz) 200 150 CAD reported Fmax 100 > 20 % reduction in VDD without reducing Fmax 50 Dynamic Voltage Scaling (DVS) 0 800 900 1000 1100 1200 1300 1400 Supply Voltage (mV) 13 Dynamic Voltage Scaling • Find minimum VDD that guarantees operation at required speed • VDD, reduces both dynamic and static power Pdynamic a VDD2 • Static power drops even faster • DVS has been commercially adopted by CPUs, but not FPGAs • FPGA’s programmability unknown critical path at fabrication time • This work: exploit programmability to perform design & chipspecific calibration 14 Outline • DVS proposal • Testing Procedure • FRoC • Results • Summary & Future work 15 Outline • DVS proposal • Testing Procedure • FRoC • Results • Summary & Future work 16 Conventional Design Cycle One Measurement by STA Application HDL Passes timing FPGA Application bit-stream Program & run application with nominal VDD 17 DVS Proposal Overview CAD System Application HDL FPGA Calibration bit-stream Replicated critical path 1st measurement by conventional STA (once per application) FPGA Application bit-stream Critical path Heaters 18 DVS Proposal Overview CAD System Application HDL FPGA Power VDD stage 2nd measurement by on-chip calibration (repeated for each FPGA) FPGA Calibration bit-stream Application bit-stream Critical path Program & generate calibration table (CT) 19 DVS Proposal Overview CAD System Application HDL FPGA FPGA Calibration bit-stream Program & generate calibration table (CT) Application bit-stream CT T = t2 T = t1 V Fmax VDD Power stage Program & run application with DVS 20 DVS Proposal Overview CAD System Application HDL Today’s talk FPGA FPGA Calibration bit-stream Program & generate calibration table (CT) Application bit-stream CT T = t2 T = t1 V Fmax Program & run application with DVS 21 Generating the Calibration Bit-stream • Performed on each FPGA at least once • For aging effects, calibration with every power up • Capture all speed-limiting paths • Invisible to FPGA users Fast Robust Automated Calibration FRoC CAD tool 22 Outline • Motivation • DVS proposal • Testing Procedure • FRoC • Results • Summary & Future work 23 How to measure Fmax • Stimulate with random inputs and check output ? • Does not guarantee exercising the critical path (CP) • To robustly measure the delay of a path :• Off-path inputs must have a steady non-controlling value Tested path Steady 1/0 LUT 24 How to measure Fmax • Stimulate with random inputs and check output ? • Does not guarantee exercising the critical path (CP) • To robustly measure the delay of a path :• Off-path inputs must have a steady non-controlling value • Control over the edge transition from input output Tested path / LUT Edge 1/0 25 Measuring the Delay of a Single Path Application FF Critical path (CP) FF FF FF LUT FF LUT Replicate LUT LUT FF FF 26 Measuring the Delay of a Single Path Application FF Critical path (CP) FF FF FF LUT FF LUT FF Replicate LUT LUT LUT LUT FF FF FF 27 Measuring the Delay of a Single Path Application FF Critical path (CP) FF FF FF LUT FF FF Change LUT mask LUT XOR LUT LUT XOR FF FF FF 28 Measuring the Delay of a Single Path Application FF FF FF FF FF FF Edge1 Critical path (CP) LUT Control edge transition LUT XOR Edge2 LUT LUT XOR FF FF FF 29 Measuring the Delay of a Single Path Input stimulus Application FF FF FF FF FF FF Edge1 FF Critical path (CP) LUT Detect timing faults LUT Error detection XOR Edge2 LUT LUT XOR FF FF FF XNOR FF Error 30 A Single Path Delay is Not Robust • Many paths have delay close to the CP • Within-die variation may cause some other paths to be more critical • Varying VDD affects FPGA elements delay differently Robust; measure delay of many near critical paths Fast; use 1 calibration bit-stream 31 Testing Disjoint Paths • Testing many disjoint paths is mostly easy • Repeat the same procedure for single path testing Application FF FF FF FF 32 Testing Disjoint Paths • Testing many disjoint paths is mostly easy • Repeat the same procedure for single path testing Application FF Calibration FF FF FF ⨁ ⨁ FF FF ⨁ ⨁ FF Error FF Error 33 ..but What to Do with Overlapping Paths? FF S1 FF S2 LUT A LUT B • Paths sharing a LUT through different inputs Path1 LUT C FF Path2 34 ..but What to Do with Overlapping Paths? FF S1 FF S2 LUT A LUT B • Paths sharing a LUT through different inputs • To test Path1, fix off-path input at C Path1 LUT C FF Path2 35 ..but What to Do with Overlapping Paths? FF S1 FF S2 LUT A LUT B Path1 LUT C FF • Paths sharing a LUT through different inputs • To test Path1, fix off-path input at C • Path1 & Path2 can’t be tested together Path2 36 ..but What to Do with Overlapping Paths? FF S1 FF S2 LUT A LUT B Path1 LUT C Path2 FF • Paths sharing a LUT through different inputs • To test Path1, fix off-path input at C • Path1 & Path2 can’t be tested together • Need 2 separate test phases 37 ..but What to Do with Overlapping Paths? FixA FF S1 LUT A FF S2 LUT B FixB Path1 LUT C Path2 FF • Paths sharing a LUT through different inputs • To test Path1, fix off-path input at C • Path1 & Path2 can’t be tested together • Need 2 separate test phases -Add Fix control signals to keep LUT output constant -Test controller cycles through test phases sequentially 38 LUT Masks for Testing 𝐹 = 𝐹𝑖𝑥 ⋅ 𝐼1 ⨁𝐼2 … ⨁𝐼𝐾−2 ⨁ 𝐸𝑑𝑔𝑒 + 𝐹𝑖𝑥 Fix off-path inputs Break re-convergent fan-outs 𝐼1 𝐼2 𝐼𝐾−2 𝐹𝑖𝑥 𝐸𝑑𝑔𝑒 K-LUT Control edge transition • 𝐹𝑖𝑥 only added when required • Developed more LUT masks to test Cyclone IV carry-chains with the same controllability 39 Can’t Test Everything with 1 Bit-stream • One or two LUT inputs used as control signals P P2 1 P3 P4 LUT 40 Can’t Test Everything with 1 Bit-stream • One or two LUT inputs used as control signals P P2 1 Edge LUT Fix 41 Can’t Test Everything with 1 Bit-stream • One or two LUT inputs used as control signals • Fixing LUT output does not break all re-convergent fan-outs Path2 LUT A P P2 1 Edge LUT Fix LUT B Path1 LUT C 42 Can’t Test Everything with 1 Bit-stream • One or two LUT inputs used as control signals • Fixing LUT output does not break all re-convergent fan-outs Path2 LUT A P P2 1 Edge LUT Fix LUT B Path1 LUT C • LAB inputs constraint • Carry-chains constraints 43 Outline • Motivation • DVS proposal • Testing Procedure • FRoC • Results • Summary & Future work 44 CAD System with FRoC Proposed CAD system Calibration HDL Application HDL Quartus P&R 1) Paths selection Quartus STA FRoC 2) Paths replication Location & Routing Constraints 3) Grouping replicated paths Calibration bit-stream Quartus Application bit-stream 4) Test controller generation 45 1) Path selection Application circuit FF FF FF LUT FF LUT LUT FF 46 1) Path selection • Extract near critical paths from STA Application circuit P5 FF P1 FF P2 4-LUT P3 FF FF P4 • {P1, P2, P3, P4, P5} 4-LUT 4-LUT FF 47 1) Path selection • Extract near critical paths from STA Application circuit P5 FF P1 FF P2 P3 FF FF P4 • {P1, P2, P3, P4, P5} • Select which paths to test • Can’t test {P2,P3,P4} in 1 bit-stream 4-LUT 4-LUT 4-LUT Two inputs reserved for control signals (Fix , Edge) FF 48 1) Path selection • Extract near critical paths from STA Application circuit P5 FF P1 FF P2 4-LUT P3 FF 4-LUT FF • {P1, P2, P3, P4, P5} • Select which paths to test • Can’t test {P2,P3,P4} in 1 bit-stream • Select the more critical paths • {P1, P2, P3 , P5} 4-LUT FF 49 2) Path replication Application circuit P5 FF P1 FF P2 4-LUT P3 FF FF 4-LUT 4-LUT FF Replication + Control Signals 2) Path replication Application circuit P5 FF P1 FF P2 P3 P5 FF Replicated Paths FF FF P1 Fix2 FF P2 P3 FF Fix1 Edge1 Edge2 4-LUT 4-LUT 4-LUT 4-LUT FF Replication + Control Signals Fix3 4-LUT Edge3 4-LUT FF 51 3) Grouping replicated paths P5 Replicated Paths FF P1 Fix2 FF P2 P3 FF Fix1 Edge1 Edge2 4-LUT Fix3 4-LUT Edge3 4-LUT FF 52 3) Grouping replicated paths P5 Replicated Paths FF P1 Fix2 FF P2 P3 • Minimising test phases -> minimises calibration time FF Fix1 Edge1 Edge2 4-LUT Fix3 4-LUT Edge3 4-LUT FF 53 3) Grouping replicated paths P5 Replicated Paths FF P1 Fix2 FF P2 P3 • Minimising test phases -> minimises calibration time • Graph coloring problem FF Fix1 Edge1 Edge2 4-LUT Fix3 4-LUT Edge3 4-LUT FF 54 3) Grouping replicated paths P5 Replicated Paths FF P1 Fix2 FF P2 P3 • Minimising test phases -> minimises calibration time • Graph coloring problem FF Fix1 Edge1 Edge2 4-LUT Fix3 4-LUT Edge3 4-LUT FF 55 3) Grouping replicated paths P5 Replicated Paths FF P1 Fix2 FF P2 P3 • Minimising test phases -> minimises calibration time • Graph coloring problem FF Fix1 Edge1 Edge2 4-LUT Fix3 4-LUT Edge3 4-LUT FF 56 3) Grouping replicated paths P5 Replicated Paths FF P1 Fix2 FF P2 P3 • Minimising test phases -> minimises calibration time • Graph coloring problem FF Fix1 Edge1 Edge2 4-LUT Fix3 4-LUT Edge3 4-LUT FF 57 3) Grouping replicated paths P5 Replicated Paths FF P1 Fix2 FF P2 P3 FF Fix1 • Minimising test phases -> minimises calibration time • Graph coloring problem • Tested > 5000 paths using 17 phases only !! Edge1 Edge2 4-LUT Fix3 4-LUT Edge3 4-LUT FF 58 4) Test controller generation • For each test phase :• Set the appropriate control signals • Generates input stimulus • Detects timing faults Replicated paths Input stimulus Control signals Sink registers Test Controller Error 59 Outline • Motivation • DVS proposal • Testing Procedure • FRoC • Results • Summary & Future work 60 Benchmarks & Target Chip • Dual-channel 51-tap low pass FIR filter • Full crossbar (Xbar) with 16 100-bit-wide-ports Application LE utilization Reported FMAX FIR filter 67,505 (59 %) 121 MHz Crossbar 26,579 (23 %) 115 MHz • Targeting Cyclone IV EP4CE115F29C7 (TSMC 60-nm technology) • Nominal VDD 1.2 V 61 How Many Edges Are We Covering ? • Timing edge is a connection between • I & O of a cell (Cell delay) , O of a cell & I of another cell (connection delay) • Timing edge criticality = (longest path using this edge)/(CP delay) Timing edge coverage Xbar 10000 candidate paths FIR 10000 candidate paths Criticality % Covering more than 90 % of the more critical bins. FRoC favours testing the more critical edges 62 First, a Sanity Check • Need to validate the CT values • Selected benchmarks are feed-forward applications with no buried states 250 FIR measured Fmax Application BIST controller M I S R 200 Ref = Tested Fmax (MHz) L F S R Xbar measured Fmax 121.18 Xbar CAD reported Fmax 115.19 FIR CAD reported Fmax 150 100 50 0 800 900 1000 1100 1200 Supply Voltage (mV) 1300 63 1400 How Many Paths to Measure ? Xbar 1 Path 2000 Paths 10000 Paths Benchmark Actual Fmax 220 220 200 200 180 180 1 path is not robust 160 Fmax(MHz) Fmax(MHz) FIR 140 120 2000 Paths 10000 Paths Benchmark Actual Fmax 160 140 Fan-out loading effects 120 100 100 80 80 60 1 Path 60 0.8 0.9 1 VDD(V) 1.1 1.2 1.3 0.8 0.9 1 VDD(V) 1.1 1.2 1.3 64 Fan-out Correction & Guard-banding • Correcting for fan-out through the difference in reported delay (by Quartus STA) between the calibration and the application bit-streams • 1 % for FIR & 5 % for Xbar • Guard-banding for IR-drop, crosstalk effects • 5 % for both benchmarks (experimental values) 65 Generated CT & Power Savings Xbar FIR Benchmark Actual Fmax Guard-banded CT 200 200 180 180 160 160 Fmax(MHz) Fmax(MHz) Benchmark Actual Fmax 140 120 140 120 100 100 80 80 60 Guard-banded CT 60 0.8 0.9 1 VDD(V) 1.1 1.2 1.3 0.8 0.9 1 VDD(V) 1.1 1.2 1.3 66 Generated CT & Power Savings Xbar FIR Benchmark Actual Fmax Guard-banded CT 200 200 180 180 160 160 Fmax(MHz) Fmax(MHz) Benchmark Actual Fmax 140 120 Nominal operation 100 80 Guard-banded CT 140 120 Nominal operation 100 80 60 60 0.8 0.9 1 VDD(V) 1.1 1.2 1.3 0.8 0.9 1 VDD(V) 1.1 1.2 1.3 67 Generated CT & Power Savings Xbar FIR Benchmark Actual Fmax Guard-banded CT 200 200 180 180 160 160 Fmax(MHz) Fmax(MHz) Benchmark Actual Fmax 140 120 Nominal operation 100 80 Guard-banded CT 140 120 Nominal operation 100 80 60 60 0.8 0.9 1 VDD(V) 1.1 1.2 1.3 0.8 0.9 1 VDD(V) 1.1 1.2 1.3 68 Generated CT & Power Savings Xbar FIR Benchmark Actual Fmax Guard-banded CT 200 200 180 180 160 160 Fmax(MHz) Fmax(MHz) Benchmark Actual Fmax 140 120 Nominal operation 100 80 Guard-banded CT 140 120 Nominal operation 100 80 60 0.8 0.9 1 With DVS, run both application safely at 1 V 60 1.1 1.2 1.3 0.8 0.9 1 V Save (V) > 33 % total power consumption DD VDD(V) 1.1 1.2 1.3 69 Outline • Motivation • DVS proposal • Testing Procedure • FRoC • Results • Summary & Future work 70 Summary • Presented a DVS approach tailored for FPGA (off-line calibration) • Created FRoC tool to automate the calibration procedure • Achieve more than 33 % total power reduction 71 Future Work • Reducing guard-bands to enable more power savings • Complete fan-out modelling for tested paths • Account for IR-drop during calibration • # of required calibration bit-streams for full coverage • Testing hard blocks to find the safest minimum VDD 72 Summary • Presented a DVS approach tailored for FPGA (off-line calibration) • Created FRoC tool to automate the calibration procedure • Achieve more than 33 % total power reduction 73
© Copyright 2026 Paperzz