FPGA-accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-off Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, Ce Zhang 1 What is the most efficient way of training dense linear models on an FPGA? β’ FPGAs can handle floating-point, but they are certainly not the best choice. Yet! We have tried: On par with a 10-core Xeon, not a clear win. β’ How about some recent developments in the machine research? Low-precision data In theory leads to EXACTLY the same end-result. 2 Stochastic Gradient Descent (SGD) N features M samples π΄π×π x Data Set Data Source DRAM, SSD, Sensor Data Ar π₯π×1 Model = while not converged do for i from 1 to M do ππ = ππ β π β ππ ππ π β π β πΎππ end end ππ×1 Labels Gradient: dot(Ar, x)Ar Model x Computation CPU, GPU, FPGA Memory CPU Cache, DRAM, FPGA BRAM 3 Advantage of Low-Precision Data + FPGA N features while not converged do for i from 1 to M do ππ = ππ β π β ππ ππ π β π β πΎππ end end Compress! M samples π΄ x Data Set Data Source DRAM, SSD, Sensor π₯π×1 Model Data Ar = ππ×1 Labels Gradient: dot(Ar, x)Ar Model x Computation CPU, GPU, FPGA Memory CPU Cache, DRAM, FPGA BRAM 4 Key Takeaway β’ Using an FPGA, we can increase the hardware efficiency while maintaining the statistical efficiency of SGD for dense linear models. Training Loss Training Loss Naïve rounding Stochastic rounding Full-precision faster Iterations same quality Execution Time β’ In practice, the system opens up a multivariate trade-off space: Data properties, FPGA implementation, SGD parametersβ¦ | | 5 Outline for the Restβ¦ 1. Stochastic quantization. Data layout 2. Target platform: Intel Xeon+FPGA 3. Implementation: From 32-bit to 1-bit SGD on FPGA. 4. Evaluation: Main results. Side effects. 5. Conclusion 6 Stochastic Quantization [1] [β¦. 0.7 β¦.] 1 Naive solution: Nearest rounding (=1) => Converge to a different solution. 0 1 min π₯ 2 Stochastic rounding: 0 with probability 0.3, 1 with probability 0.7 => Expectation remains the same. Converge to the same solution. (π΄π β π₯ β ππ ) 2 Gradient ππ = ππ β π β ππ ππ π We need 2 independent samples ππ = π1 (ππ ) β π β ππ π2 (ππ ) β¦and each iteration of the algorithm needs fresh samples. [1] H Zhang, K Kara, J Li, D Alistarh, J Liu, C Zhang: ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models, arXiv:1611.05402 7 Before the FPGA: The Data Layout 1. Feature 2. Feature 3. Feature n. Feature 1. Sample 32-bit float value 32-bit float value 32-bit float value 32-bit float value m. Sample 32-bit float value 32-bit float value 32-bit float value 32-bit float value Currently slow! 1. Sample m. Sample 4x compression! Quantize to 4-bit. We need 2 independent quantizations! 1. Feature 2. Feature 3. Feature n. Feature 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 1 quantization index For each iteration A new index! 8 Target Platform ~30 GB/s 96GB Main Memory Mem. Controller 6.5 GB/s Caches QPI QPI Endpoint Accelerator Intel Xeon CPU Ivy Bridge, Xeon E5-2680 v2 10 cores, 25 MB L3 Altera Stratix V Intel Xeon+FPGA 9 Full Precision SGD on FPGA Data Source DRAM, SSD, Sensor 64B cache-line 16 floats 1 2 x 16 float multipliers x C 16 float2fixed converters C x x C C 16 fixed2float converters 32-bit floating-point: 16 values Processing rate: 12.8 GB/s C + + + + 3 + + + + + + + 16 fixed adders Dot product 5 1 fixed2float converter + + + 4 c ax B A b fifo a fifo Computation Custom Logic b 6 1 float adder Ξ³ Gradient calculation + + x 1 float multiplier 7 Ξ³(ax-b) 8 16 float multipliers 9 16 float2fixed converters Model update 16 fixed adders x a D x x x C C C - - - Ξ³(ax-b)a C x - Ξ³(ax-b)a Batch size is reached? x loading Storage Device FPGA BRAM 10 Scale out the design for low-precision data! How can we increase the internal parallelism while maintaining the processing rate? | | 11 Challenge + Solution 64B cache-line 16 floats 1 2 x 16 float multipliers x C x 16 float2fixed converters C 8-bit: 32 values 4-bit: 64 values 2-bit: 128 values 1-bit: 256 values x C 16 fixed2float converters C C + + + + 3 + + + + + + + 16 fixed adders Dot product 5 1 fixed2float converter + + + 4 c ax B A b fifo a fifo 1 float adder x 1 float multiplier 7 Ξ³(ax-b) 8 16 float multipliers 9 16 float2fixed converters Model update => Scaling out is not trivial! b 6 Ξ³ Gradient calculation + + Q8 Q4 Q2 Q1 16 fixed adders 1) x a D x x x C C C - - - Ξ³(ax-b)a C x - Ξ³(ax-b)a Batch size is reached? x loading We can get rid of floating-point arithmetic. 2) We can further simplify integer arithmetic for lowest precision data. 12 Trick 1: Selection of quantization levels U β 1. Select the lower and upper bound [L,U] 2. Select the size of the interval β All quantized values are integers! L 13 This is enough for Q4 and Q8 64B cache-line K=128, 64, 32 quantized values 1 x K fixed multipliers x + + + + + + K fixed adders + + x + + + Dot product x + A b fifo a fifo + + Q (a)x b Gradient calculation 2 x - 1 fixed adder Ξ³ x Q (a) 1 bit-shift Ξ³(Q (a)x-b) K fixed multipliers 8-bit: 32 values 4-bit: 64 values 2-bit: 128 values 1-bit: 256 values Q8 Q4 Q2 Q1 Batch size is reached? x 3 x x - - Ξ³(Q (a)x-b)Q (a) Model update K fixed adders - x - Ξ³(Q (a)x-b)Q (a) x loading 14 Trick 2: Implement Multiplication using Multiplexer Q2value[1:0] In[31:0] 0 <<1 00 01 10 Out[31:0] Q2value normalized to [0,2] or [-1,1] 1 Q2value[1:0] 2 0 In[31:0] -In[31:0] 00 01 11 Result[31:0] Q1value[0] 0 0 In[31:0] 1 Result[31:0] Out[31:0] Q2 multiplier Q1 multiplier 15 Support for all Qx 64B cache-line K=128, 64, 32 quantized values 1 x K fixed multipliers x + + + + x + + K fixed adders + + + 1 + A b fifo a fifo + + 128 fixed multipliers x + + x + + b 2 x - 1 fixed adder Ξ³ x + + + + K fixed multipliers 128 fixed adders Batch size is reached? 3 x + + + A b fifo a fifo + + 1 fixed adder x x Q (a)x Gradient calculation 2 Ξ³ b x x 1 bit-shift Ξ³(Q (a)x-b) 128 fixed multipliers - - Q (a) Batch size is reached? x Ξ³(Q (a)x-b)Q (a) K fixed adders Dot product Q (a) 1 bit-shift Ξ³(Q (a)x-b) Model update x + Q (a)x Gradient calculation Split one 64B cache-line into 2 parts 4 K=128 quantized values K=128 quantized values x + + Dot product 64B cache-line K=256 quantized values - x - Ξ³(Q (a)x-b)Q (a) x loading x 3 x x - - Ξ³(Q (a)x-b)Q (a) Model update 128 fixed adders - x - Ξ³(Q (a)x-b)Q (a) Circuit for Q8, Q4 and Q2 SGD Circuit for Q1 SGD Processing rate: 12.8 GB/s Processing rate: 6.4 GB/s x loading 16 Support for all Qx 64B cache-line K=128, 64, 32 quantized values 1 x K fixed multipliers x + + + + x + + K fixed adders + + + A b fifo a fifo + + Ξ³ x Ξ³(Q (a)x-b) float Q83 K fixed multipliers x Q (a) K fixed adders Q4 - 38% Batch size is reached? 35% x - - x - Ξ³(Q (a)x-b)Q (a) Q2, Q1 + Dot DSP product 12% + + + + + + x + + + BRAM A b fifo a fifo + 128 fixed adders + 7% Q (a)x 1 fixed adder 2 Gradient calculation 25% x Ξ³ 6%Ξ³(Q (a)x-b) b x x Q (a) 1 bit-shift Batch size is reached? x Ξ³(Q (a)x-b)Q (a) Model update Logic x 1 bit-shift 128 fixed multipliers x + - 1 fixed adder x + Data Type b Q (a)x Gradient calculation 1 + 2 Split one 64B cache-line into 2 parts 4 K=128 quantized values K=128 quantized values x + + Dot product 64B cache-line K=256 quantized values x 36% 50% loading 43% 1% 128 fixed multipliers 6% Ξ³(Q (a)x-b)Q (a) Model update 6% 128 fixed adders x 3 - x x - - x - Ξ³(Q (a)x-b)Q (a) Circuit for Q8, Q4 and Q2 SGD Circuit for Q1 SGD Processing rate: 12.8 GB/s Processing rate: 6.4 GB/s x loading 17 Data sets for evaluation Name Size # Features # Classes MNIST GISETTE EPSILON SYN100 SYN1000 60,000 6,000 10,000 10,000 10,000 780 5,000 2,000 100 1,000 10 2 2 Regression Regression Following the Intel legal guidelines on publishing performance numbers, we would like to make the reader aware that results in this publication were generated using preproduction hardware and software, and may not reο¬ect the performance of production or future systems. 18 SGD Performance Improvement 0.030 float CPU 1-thread float CPU 10-threads Q4 FPGA float FPGA Training Loss 0.012 0.010 0.008 0.006 0.004 float CPU 1-thread float CPU 10-threads Q8 FPGA float FPGA 0.025 Training Loss 0.014 0.020 0.015 0.010 0.005 0.002 0.000 1E-4 7.2x 0.001 0.01 Time (s) SGD on Synthetic100, 4-bit works 0.1 0.000 0.001 2x 0.01 0.1 1 Time (s) SGD on Synthetic1000, 8-bit works 19 1-bit also works on some data sets 0.0215 0.0210 float CPU 1-thread float CPU 10-threads Q1 FPGA float FPGA Training Loss 0.0205 0.0200 0.0195 0.0190 0.0185 MNIST multi-classification accuracy 0.0180 0.0175 10.6x 0.0170 0.001 0.01 0.1 1 Time (s) SGD on MNIST, Digit 7 1-bit works 10 Precision float CPU-SGD 1-bit FPGA-SGD Accuracy 85.82% 85.87% Training Time 19.04 s 2.41 s 20 Naïve Rounding vs. Stochastic Rounding float Q2, Q4, Q8, Training Loss 0.25 0.20 F2 F4 F8 Stochastic quantization results in unbiased convergence. 0.15 0.10 Bias 0.05 0 10 20 30 40 50 60 Number of Epochs 21 Effect of Step Size 0.1 Training Loss float, step size=1/29, Q2, step size=2/29 float, step size=1/212, Q2, step size=2/212 float, step size=1/215, Q2, step size=2/215 0.01 Large step size + Full precision vs. Small step size + Low precision 0.001 0 10 20 30 40 50 60 Number of Epochs 22 Conclusion β’ Highly scalable, parametrizable FPGA-based stochastic gradient descent implementations for doing linear model training. β’ Open source: www.systems.ethz.ch/fpga/ZipML_SGD Key Takeaways: 1. The way to train linear models on FPGA should be through the usage of stochastically rounded, lowprecision data. 2. Multivariate trade-off space: Precision vs. end-to-end runtime, convergence quality, design complexity, data and system properties. | | 23 | | 24 Backup Why do dense linear models matter? β’ Workhorse algorithm for regression and classification. β’ Sparse, high dimensional data sets can be converted into dense ones. Transfer Learning Random Features Why is training speed important? β’ Often a human-in-the-loop process. β’ Configuring parameters, selecting features, data dependency etc. Train Evaluate 25 Effect of Reusing Indexes Training Loss 0.025 float Q2, 64 indexes Q2, 32 indexes Q2, 16 indexes Q2, 8 indexes Q2, 4 indexes Q2, 2 indexes 0.020 0.015 In practice 8 indexes are enough! 0.010 0.005 0.000 0 10 20 30 40 50 60 Number of Epochs 26
© Copyright 2026 Paperzz