Sparse Matrix Multiply - University of Wisconsin–Madison

Analysis of Sparse Convolutional Neural
Networks
Sabareesh Ganapathy
Manav Garg
Prasanna Venkatesh Srinivasan
UNIVERSITY OF WISCONSIN-MADISON
Convolutional Neural Network
•
•
•
•
State of the art in Image classification
Terminology – Feature Maps, Weights
Layers - Convolution, ReLU, Pooling, Fully Connected
Example - AlexNet 2012
(Image taken from mdpi.com)
UNIVERSITY OF WISCONSIN-MADISON
Sparsity in Networks
• Trained networks occupy huge memory. (~200MB for AlexNet)
•
Many DRAM accesses.
• Filters can be sparse. Sparsity percentage for filters in Alexnet
convolutional layers shown below.
Layer
Sparsity Percentage
CONV2
6
CONV3
7
CONV4
7
• Sparsity too low.
• Thresholding in inference will lead to accuracy loss.
• Account for sparsity during training.
UNIVERSITY OF WISCONSIN-MADISON
Sparsity in Networks
• Deep Compression
•
•
Pruning low-weight values, retrain to recover accuracy.
Less storage and also speedup reported in Fully Connected layers.
• Structured Sparsity Learning (SSL) in Deep Neural Networks
•
•
Employs locality optimization for sparse weights.
Memory savings and speedup in Convolutional layers.
Layer
Sparsity Percentage
CONV1
14
CONV2
54
CONV3
72
CONV4
68
CONV5
58
UNIVERSITY OF WISCONSIN-MADISON
Caffe Framework
•
•
•
•
Open-source framework to build and run convolutional neural nets.
Provides Python and C++ interface for inference.
Source code in C++ and Cuda.
Employs efficient data structures for feature maps and weights.
•
Blob data structure
• Caffe Model Zoo - Repository of CNN models for analysis.
• Pretrained models for base and compressed versions of AlexNet available
in the Model zoo.
UNIVERSITY OF WISCONSIN-MADISON
Convolution = Matrix Multiply
.
.
.
OFM: 55x55x96
IFM: 227x227x3
Filter: 11x11x3x96
Stride: 4
• IFM converted to 363x3025 matrix
• filter looks at 11x11x3 input volume, 55 locations along W,H.
• Weights converted to 96x363 matrix
• OFM = Weights x IFM.
• BLAS libraries used to implement matrix multiply ( GEMM )
• MKL for CPU, CuBLAS for GPU
UNIVERSITY OF WISCONSIN-MADISON
Sparse Matrix Multiply
• Weight matrix can be represented in sparse format for sparse networks.
• Compressed Sparse Row Format. Matrix converted to arrays to represent
non-zero values.
•
•
•
Array A - Contains non zero values.
Array JA - Column index of each element in A.
Array IA - Cumulative sum of number of non-zero values in previous rows.
• Sparse representation saves memory and could result in efficient
computation.
• Wen-Wei – New Caffe branch for sparse convolution
•
•
•
•
Represent convolutional layer weights in CSR format.
Uses sparse matrix multiply routines. (CSRMM)
Weight in sparse format, IFM in dense , Output is in dense.
MKL library for CPU, cuSPARSE for GPU.
UNIVERSITY OF WISCONSIN-MADISON
Analysis Framework
• Initially gem5-gpu was planned to be used as simulation framework.
Gem5 ended up being very slow due to the large size of Deep Neural
Networks.
• Analysis was performed by running Caffe and Cuda programs on Native
Hardware.
• For CPU analysis, AWS system with 2 Intel Xeon cores running @ 2.4GHz
was used.
• For GPU analysis, dodeca system with NVIDIA GeForce GTX 1080 GPU was
used.
UNIVERSITY OF WISCONSIN-MADISON
Convolutional Layer Analysis
• Deep Compression and SSL trained networks were used for analysis. Both
showed similar trends.
• Memory savings obtained with sparse representation is given below
Layers
Memory Saving
CONV1
1.11
CONV2
2.48
CONV3
2.72
CONV4
2.53
CONV5
2.553
• The time taken for the multiplication was recorded. Conversion time to
CSR format was not included as weights are sparsified only once for a set
of IFMs.
UNIVERSITY OF WISCONSIN-MADISON
CPU- CSRMM vs GEMM
• CSRMM slower compared to GEMM.
• Overhead depends on sparsity percentage.
UNIVERSITY OF WISCONSIN-MADISON
GPU- CSRMM vs GEMM
• CSRMM overhead more in GPU.
• GPU operations are faster compared to CPU.
UNIVERSITY OF WISCONSIN-MADISON
Fully Connected Layer (FC) Analysis
• Fully Connected Layers form the final layers of a typical CNN and
implemented as Matrix Vector Multiply operation. (GEMV).
• Modified Caffe’s internal data structures (blob) to represent Weights of
FC layer in sparse format.
• Sparse Matrix-Vector Multiplication (SpMV) used for sparse computation.
• Deep Compression Model used for analysis.
Image taken from petewarden.com
UNIVERSITY OF WISCONSIN-MADISON
FC Layer Analysis
• Speed-up of 3x observed for both CPU and GPU.
UNIVERSITY OF WISCONSIN-MADISON
Matrix Multiply Analysis
• Custom C++ and Cuda programs were written to measure the time taken
to execute only matrix multiplication routines.
• This allowed us to vary the sparsity of weight matrix to figure out the
break-even point where CSRMM performs faster than GEMM.
• Size of the weight matrix were chosen to be equal to that of the largest
AlexNet CONV layer.
• The zeros were distributed randomly in the weight matrix.
UNIVERSITY OF WISCONSIN-MADISON
Matrix Multiply Analysis
Speedup of CSRMM over GEMM for
random sparsity (CPU)
2.5
Speedup
2
1.5
1
0.5
0
0.945
0.95
0.955
0.96
0.965
0.97
0.975
0.98
0.985
0.99
0.995
Sparsity
Speedup of CSRMM over GEMM for
random sparsity (GPU)
0.6
Speedup
0.5
0.4
0.3
0.2
0.1
0
0.945
0.95
0.955
0.96
0.965
0.97
0.975
0.98
0.985
0.99
0.995
Sparsity
UNIVERSITY OF WISCONSIN-MADISON
GPU Memory Scaling Experiment
Motivation:
As Sparse Matrix representation occupies significantly less space compared
to dense representation, a larger working set can fit in the Cache/Memory
system in the case of sparse.
Implementation:
As the “Weights” matrix was the one being passed in the Sparse format, we
increased its size to a larger value.
UNIVERSITY OF WISCONSIN-MADISON
GPU Results (GEMM vs CSRMM)
GEMM vs CSRMM ( Weight Matrix = 256 x 1200 )
Format
Total Memory Used
(MB)
Runtime (ns)
Dense
3377
60473
Sparse
141
207910
GEMM vs CSRMM ( Weight Matrix = 25600 x 24000 )
Format
Total Memory Used
(MB)
Runtime (ns)
Dense
5933
96334
Sparse
1947
184496
While GEMM is still faster in both cases, (GEMM/CSRMM) Time Ratio
increases to 0.52 from 0.29 as the Weight Matrix Dimensions are
bumped.
UNIVERSITY OF WISCONSIN-MADISON
GPU: Sparse X Sparse
• IFM sparsity due to ReLU activation. Sparsity in CONV layers of
AlexNet given below
•
Layer
Sparsity Percentage
CONV2
23.57
CONV3
56.5
CONV4
64.7
CONV5
68.5
CUSP library used in custom program for sparse x sparse multiply of IFM
and Weights.
•
•
•
Speedup of 4x observed compared to GEMM routine.
Memory savings of 4.2x compared to GEMM.
Could not scale to typical dimensions of AlexNet. This is in progress.
UNIVERSITY OF WISCONSIN-MADISON
CONCLUSION
•
Representing Matrices in Sparse Format results in significant Memory
savings as expected.
•
We didn’t observe any practical computational benefits for
Convolutional Layers using library routines provide by MKL & cuSPARSE
for both CPU & GPU.
•
Fully Connected Layers showed around 3x Speedup for layers having a
high sparsity.
•
For a large dataset & GPU Memory, we might see drop in convolutional
runtime for Sparse representation.
•
Sparse x Sparse computation showed promising results. Will be
implemented in Caffe.
UNIVERSITY OF WISCONSIN-MADISON
THANK YOU
QUESTIONS ?
UNIVERSITY OF WISCONSIN-MADISON