FPGA-accelerated Dense Linear Machine Learning: A Precision

FPGA-accelerated Dense Linear Machine Learning:
A Precision-Convergence Trade-off
Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, Ce Zhang
1
What is the most efficient way of training dense linear
models on an FPGA?
β€’ FPGAs can handle floating-point, but they are certainly not the best
choice.
Yet!
We have tried: On par with a 10-core Xeon, not a clear win.
β€’ How about some recent developments in the machine research?
Low-precision data
In theory
leads to
EXACTLY the same end-result.
2
Stochastic Gradient Descent (SGD)
N features
M samples
𝐴𝑀×𝑁
x
Data Set
Data Source
DRAM, SSD,
Sensor
Data Ar
π‘₯𝑁×1
Model
=
while not converged do
for i from 1 to M do
π’ˆπ’Š = π’‚π’Š βˆ™ 𝒙 βˆ’ 𝑏𝑖 π’‚π’Š
𝒙 ← 𝒙 βˆ’ π›Ύπ’ˆπ’Š
end
end
𝑏𝑀×1
Labels
Gradient: dot(Ar, x)Ar
Model x
Computation
CPU, GPU, FPGA
Memory
CPU Cache,
DRAM,
FPGA BRAM
3
Advantage of Low-Precision Data + FPGA
N features
while not converged do
for i from 1 to M do
π’ˆπ’Š = π’‚π’Š βˆ™ 𝒙 βˆ’ 𝑏𝑖 π’‚π’Š
𝒙 ← 𝒙 βˆ’ π›Ύπ’ˆπ’Š
end
end
Compress!
M samples
𝐴
x
Data Set
Data Source
DRAM, SSD,
Sensor
π‘₯𝑁×1
Model
Data Ar
=
𝑏𝑀×1
Labels
Gradient: dot(Ar, x)Ar
Model x
Computation
CPU, GPU, FPGA
Memory
CPU Cache,
DRAM,
FPGA BRAM
4
Key Takeaway
β€’ Using an FPGA, we can increase the hardware efficiency while maintaining the statistical
efficiency of SGD for dense linear models.
Training
Loss
Training
Loss
Naïve rounding
Stochastic rounding
Full-precision
faster
Iterations
same
quality
Execution Time
β€’ In practice, the system opens up a multivariate trade-off space:
Data properties, FPGA implementation, SGD parameters…
|
|
5
Outline for the Rest…
1.
Stochastic quantization. Data layout
2. Target platform: Intel Xeon+FPGA
3. Implementation: From 32-bit to 1-bit SGD on FPGA.
4. Evaluation: Main results. Side effects.
5. Conclusion
6
Stochastic Quantization [1]
[…. 0.7 ….]
1 Naive solution: Nearest rounding (=1)
=> Converge to a different solution.
0
1
min
π‘₯ 2
Stochastic rounding: 0 with probability 0.3, 1 with probability 0.7
=> Expectation remains the same. Converge to the same solution.
(π΄π‘Ÿ βˆ™ π‘₯ βˆ’ π‘π‘Ÿ )
2
Gradient
π’ˆπ’Š = π’‚π’Š βˆ™ 𝒙 βˆ’ 𝑏𝑖 π’‚π’Š
π‘Ÿ
We need 2
independent samples
π’ˆπ’Š = 𝑄1 (π’‚π’Š ) βˆ™ 𝒙 βˆ’ 𝑏𝑖 𝑄2 (π’‚π’Š )
…and each iteration of the algorithm needs fresh samples.
[1] H Zhang, K Kara, J Li, D Alistarh, J Liu, C Zhang: ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models, arXiv:1611.05402
7
Before the FPGA: The Data Layout
1. Feature
2. Feature
3. Feature
n. Feature
1. Sample
32-bit float value
32-bit float value
32-bit float value
32-bit float value
m. Sample
32-bit float value
32-bit float value
32-bit float value
32-bit float value
Currently slow!
1. Sample
m. Sample
4x compression!
Quantize to 4-bit.
We need 2 independent quantizations!
1. Feature
2. Feature
3. Feature
n. Feature
4-bit | 4-bit
4-bit | 4-bit
4-bit | 4-bit
4-bit | 4-bit
4-bit | 4-bit
4-bit | 4-bit
4-bit | 4-bit
4-bit | 4-bit
1 quantization index
For each iteration
A new index!
8
Target Platform
~30 GB/s
96GB Main
Memory
Mem. Controller
6.5 GB/s
Caches
QPI
QPI
Endpoint
Accelerator
Intel Xeon CPU
Ivy Bridge, Xeon E5-2680 v2
10 cores, 25 MB L3
Altera Stratix V
Intel Xeon+FPGA
9
Full Precision SGD on FPGA
Data Source
DRAM, SSD,
Sensor
64B cache-line
16 floats
1
2
x
16 float multipliers
x
C
16 float2fixed
converters
C
x
x
C
C
16 fixed2float
converters
32-bit floating-point: 16 values
Processing rate: 12.8 GB/s
C
+
+
+
+
3
+
+
+
+
+
+
+
16 fixed
adders
Dot
product
5
1 fixed2float
converter
+
+
+
4
c
ax
B
A
b fifo a fifo
Computation
Custom Logic
b
6
1 float adder
Ξ³
Gradient
calculation
+
+
x
1 float multiplier
7
Ξ³(ax-b)
8
16 float multipliers
9
16 float2fixed
converters
Model update
16 fixed
adders
x
a
D
x
x
x
C
C
C
-
-
-
Ξ³(ax-b)a
C
x - Ξ³(ax-b)a
Batch size
is reached?
x
loading
Storage Device
FPGA BRAM
10
Scale out the design for low-precision data!
How can we increase the internal parallelism while
maintaining the processing rate?
|
|
11
Challenge + Solution
64B cache-line
16 floats
1
2
x
16 float multipliers
x
C
x
16 float2fixed
converters
C
8-bit: 32 values
4-bit: 64 values
2-bit: 128 values
1-bit: 256 values
x
C
16 fixed2float
converters
C
C
+
+
+
+
3
+
+
+
+
+
+
+
16 fixed
adders
Dot
product
5
1 fixed2float
converter
+
+
+
4
c
ax
B
A
b fifo a fifo
1 float adder
x
1 float multiplier
7
Ξ³(ax-b)
8
16 float multipliers
9
16 float2fixed
converters
Model update
=> Scaling out is not trivial!
b
6
Ξ³
Gradient
calculation
+
+
Q8
Q4
Q2
Q1
16 fixed
adders
1)
x
a
D
x
x
x
C
C
C
-
-
-
Ξ³(ax-b)a
C
x - Ξ³(ax-b)a
Batch size
is reached?
x
loading
We can get rid of floating-point
arithmetic.
2) We can further simplify integer
arithmetic for lowest precision data.
12
Trick 1: Selection of quantization levels
U
βˆ†
1. Select the lower and upper bound [L,U]
2. Select the size of the interval βˆ†
All quantized values are integers!
L
13
This is enough for Q4 and Q8
64B cache-line
K=128, 64, 32 quantized values
1
x
K fixed multipliers
x
+
+
+
+
+
+
K fixed
adders
+
+
x
+
+
+
Dot
product
x
+
A
b fifo a fifo
+
+
Q (a)x
b
Gradient
calculation
2
x
-
1 fixed adder
Ξ³
x
Q (a)
1 bit-shift
Ξ³(Q (a)x-b)
K fixed multipliers
8-bit: 32 values
4-bit: 64 values
2-bit: 128 values
1-bit: 256 values
Q8
Q4
Q2
Q1
Batch size
is reached?
x
3
x
x
-
-
Ξ³(Q (a)x-b)Q (a)
Model update
K fixed
adders
-
x - Ξ³(Q (a)x-b)Q (a)
x
loading
14
Trick 2: Implement Multiplication using Multiplexer
Q2value[1:0]
In[31:0]
0
<<1
00
01
10
Out[31:0]
Q2value
normalized to
[0,2] or [-1,1]
1
Q2value[1:0]
2
0
In[31:0]
-In[31:0]
00
01
11
Result[31:0]
Q1value[0]
0
0
In[31:0]
1
Result[31:0]
Out[31:0]
Q2 multiplier
Q1 multiplier
15
Support for all Qx
64B cache-line
K=128, 64, 32 quantized values
1
x
K fixed multipliers
x
+
+
+
+
x
+
+
K fixed
adders
+
+
+
1
+
A
b fifo a fifo
+
+
128 fixed
multipliers
x
+
+
x
+
+
b
2
x
-
1 fixed adder
Ξ³
x
+
+
+
+
K fixed multipliers
128 fixed
adders
Batch size
is reached?
3
x
+
+
+
A
b fifo a fifo
+
+
1 fixed adder
x
x
Q (a)x
Gradient
calculation
2
Ξ³
b
x
x
1 bit-shift
Ξ³(Q (a)x-b)
128 fixed multipliers
-
-
Q (a)
Batch size
is reached?
x
Ξ³(Q (a)x-b)Q (a)
K fixed
adders
Dot
product
Q (a)
1 bit-shift
Ξ³(Q (a)x-b)
Model update
x
+
Q (a)x
Gradient
calculation
Split one 64B cache-line into 2 parts
4
K=128 quantized values
K=128 quantized values
x
+
+
Dot
product
64B cache-line
K=256 quantized values
-
x - Ξ³(Q (a)x-b)Q (a)
x
loading
x
3
x
x
-
-
Ξ³(Q (a)x-b)Q (a)
Model update
128 fixed
adders
-
x - Ξ³(Q (a)x-b)Q (a)
Circuit for Q8, Q4 and Q2 SGD
Circuit for Q1 SGD
Processing rate: 12.8 GB/s
Processing rate: 6.4 GB/s
x
loading
16
Support for all Qx
64B cache-line
K=128, 64, 32 quantized values
1
x
K fixed multipliers
x
+
+
+
+
x
+
+
K fixed
adders
+
+
+
A
b fifo a fifo
+
+
Ξ³
x
Ξ³(Q (a)x-b)
float
Q83
K fixed multipliers
x
Q (a)
K fixed
adders
Q4
-
38%
Batch size
is reached?
35%
x
-
-
x - Ξ³(Q (a)x-b)Q (a)
Q2, Q1
+
Dot
DSP
product
12%
+
+
+
+
+
+
x
+
+
+
BRAM
A
b fifo a fifo
+
128 fixed
adders
+
7%
Q (a)x
1 fixed adder
2
Gradient
calculation
25%
x
Ξ³
6%Ξ³(Q (a)x-b)
b
x
x
Q (a)
1 bit-shift
Batch size
is reached?
x
Ξ³(Q (a)x-b)Q (a)
Model update
Logic
x
1 bit-shift
128 fixed
multipliers
x
+
-
1 fixed adder
x
+
Data
Type
b
Q (a)x
Gradient
calculation
1
+
2
Split one 64B cache-line into 2 parts
4
K=128 quantized values
K=128 quantized values
x
+
+
Dot
product
64B cache-line
K=256 quantized values
x 36%
50%
loading
43%
1%
128 fixed multipliers
6% Ξ³(Q (a)x-b)Q (a)
Model update
6%
128 fixed
adders
x
3
-
x
x
-
-
x - Ξ³(Q (a)x-b)Q (a)
Circuit for Q8, Q4 and Q2 SGD
Circuit for Q1 SGD
Processing rate: 12.8 GB/s
Processing rate: 6.4 GB/s
x
loading
17
Data sets for evaluation
Name
Size
# Features
# Classes
MNIST
GISETTE
EPSILON
SYN100
SYN1000
60,000
6,000
10,000
10,000
10,000
780
5,000
2,000
100
1,000
10
2
2
Regression
Regression
Following the Intel legal guidelines on publishing performance numbers, we would like to make the
reader aware that results in this publication were generated using preproduction hardware and
software, and may not reflect the performance of production or future systems.
18
SGD Performance Improvement
0.030
float CPU 1-thread
float CPU 10-threads
Q4 FPGA
float FPGA
Training Loss
0.012
0.010
0.008
0.006
0.004
float CPU 1-thread
float CPU 10-threads
Q8 FPGA
float FPGA
0.025
Training Loss
0.014
0.020
0.015
0.010
0.005
0.002
0.000
1E-4
7.2x
0.001
0.01
Time (s)
SGD on Synthetic100,
4-bit works
0.1
0.000
0.001
2x
0.01
0.1
1
Time (s)
SGD on Synthetic1000,
8-bit works
19
1-bit also works on some data sets
0.0215
0.0210
float CPU 1-thread
float CPU 10-threads
Q1 FPGA
float FPGA
Training Loss
0.0205
0.0200
0.0195
0.0190
0.0185
MNIST
multi-classification accuracy
0.0180
0.0175
10.6x
0.0170
0.001
0.01
0.1
1
Time (s)
SGD on MNIST, Digit 7
1-bit works
10
Precision
float CPU-SGD
1-bit FPGA-SGD
Accuracy
85.82%
85.87%
Training Time
19.04 s
2.41 s
20
Naïve Rounding vs. Stochastic Rounding
float
Q2,
Q4,
Q8,
Training Loss
0.25
0.20
F2
F4
F8
Stochastic
quantization
results in
unbiased
convergence.
0.15
0.10
Bias
0.05
0
10
20
30
40
50
60
Number of Epochs
21
Effect of Step Size
0.1
Training Loss
float, step size=1/29,
Q2, step size=2/29
float, step size=1/212,
Q2, step size=2/212
float, step size=1/215,
Q2, step size=2/215
0.01
Large step size +
Full precision
vs.
Small step size +
Low precision
0.001
0
10
20
30
40
50
60
Number of Epochs
22
Conclusion
β€’ Highly scalable, parametrizable FPGA-based stochastic gradient descent implementations for doing linear
model training.
β€’ Open source: www.systems.ethz.ch/fpga/ZipML_SGD
Key Takeaways:
1.
The way to train linear models on FPGA should be through the usage of stochastically rounded, lowprecision data.
2. Multivariate trade-off space: Precision vs. end-to-end runtime, convergence quality, design complexity,
data and system properties.
|
|
23
|
|
24
Backup
Why do dense linear models matter?
β€’ Workhorse algorithm for regression and classification.
β€’ Sparse, high dimensional data sets can be converted into dense ones.
Transfer Learning
Random Features
Why is training speed important?
β€’ Often a human-in-the-loop process.
β€’ Configuring parameters, selecting features, data dependency
etc.
Train
Evaluate
25
Effect of Reusing Indexes
Training Loss
0.025
float
Q2, 64 indexes
Q2, 32 indexes
Q2, 16 indexes
Q2, 8 indexes
Q2, 4 indexes
Q2, 2 indexes
0.020
0.015
In practice 8
indexes are
enough!
0.010
0.005
0.000
0
10
20
30
40
50
60
Number of Epochs
26