Analysis of Sparse Convolutional Neural Networks Sabareesh Ganapathy Manav Garg Prasanna Venkatesh Srinivasan UNIVERSITY OF WISCONSIN-MADISON Convolutional Neural Network • • • • State of the art in Image classification Terminology – Feature Maps, Weights Layers - Convolution, ReLU, Pooling, Fully Connected Example - AlexNet 2012 (Image taken from mdpi.com) UNIVERSITY OF WISCONSIN-MADISON Sparsity in Networks • Trained networks occupy huge memory. (~200MB for AlexNet) • Many DRAM accesses. • Filters can be sparse. Sparsity percentage for filters in Alexnet convolutional layers shown below. Layer Sparsity Percentage CONV2 6 CONV3 7 CONV4 7 • Sparsity too low. • Thresholding in inference will lead to accuracy loss. • Account for sparsity during training. UNIVERSITY OF WISCONSIN-MADISON Sparsity in Networks • Deep Compression • • Pruning low-weight values, retrain to recover accuracy. Less storage and also speedup reported in Fully Connected layers. • Structured Sparsity Learning (SSL) in Deep Neural Networks • • Employs locality optimization for sparse weights. Memory savings and speedup in Convolutional layers. Layer Sparsity Percentage CONV1 14 CONV2 54 CONV3 72 CONV4 68 CONV5 58 UNIVERSITY OF WISCONSIN-MADISON Caffe Framework • • • • Open-source framework to build and run convolutional neural nets. Provides Python and C++ interface for inference. Source code in C++ and Cuda. Employs efficient data structures for feature maps and weights. • Blob data structure • Caffe Model Zoo - Repository of CNN models for analysis. • Pretrained models for base and compressed versions of AlexNet available in the Model zoo. UNIVERSITY OF WISCONSIN-MADISON Convolution = Matrix Multiply . . . OFM: 55x55x96 IFM: 227x227x3 Filter: 11x11x3x96 Stride: 4 • IFM converted to 363x3025 matrix • filter looks at 11x11x3 input volume, 55 locations along W,H. • Weights converted to 96x363 matrix • OFM = Weights x IFM. • BLAS libraries used to implement matrix multiply ( GEMM ) • MKL for CPU, CuBLAS for GPU UNIVERSITY OF WISCONSIN-MADISON Sparse Matrix Multiply • Weight matrix can be represented in sparse format for sparse networks. • Compressed Sparse Row Format. Matrix converted to arrays to represent non-zero values. • • • Array A - Contains non zero values. Array JA - Column index of each element in A. Array IA - Cumulative sum of number of non-zero values in previous rows. • Sparse representation saves memory and could result in efficient computation. • Wen-Wei – New Caffe branch for sparse convolution • • • • Represent convolutional layer weights in CSR format. Uses sparse matrix multiply routines. (CSRMM) Weight in sparse format, IFM in dense , Output is in dense. MKL library for CPU, cuSPARSE for GPU. UNIVERSITY OF WISCONSIN-MADISON Analysis Framework • Initially gem5-gpu was planned to be used as simulation framework. Gem5 ended up being very slow due to the large size of Deep Neural Networks. • Analysis was performed by running Caffe and Cuda programs on Native Hardware. • For CPU analysis, AWS system with 2 Intel Xeon cores running @ 2.4GHz was used. • For GPU analysis, dodeca system with NVIDIA GeForce GTX 1080 GPU was used. UNIVERSITY OF WISCONSIN-MADISON Convolutional Layer Analysis • Deep Compression and SSL trained networks were used for analysis. Both showed similar trends. • Memory savings obtained with sparse representation is given below Layers Memory Saving CONV1 1.11 CONV2 2.48 CONV3 2.72 CONV4 2.53 CONV5 2.553 • The time taken for the multiplication was recorded. Conversion time to CSR format was not included as weights are sparsified only once for a set of IFMs. UNIVERSITY OF WISCONSIN-MADISON CPU- CSRMM vs GEMM • CSRMM slower compared to GEMM. • Overhead depends on sparsity percentage. UNIVERSITY OF WISCONSIN-MADISON GPU- CSRMM vs GEMM • CSRMM overhead more in GPU. • GPU operations are faster compared to CPU. UNIVERSITY OF WISCONSIN-MADISON Fully Connected Layer (FC) Analysis • Fully Connected Layers form the final layers of a typical CNN and implemented as Matrix Vector Multiply operation. (GEMV). • Modified Caffe’s internal data structures (blob) to represent Weights of FC layer in sparse format. • Sparse Matrix-Vector Multiplication (SpMV) used for sparse computation. • Deep Compression Model used for analysis. Image taken from petewarden.com UNIVERSITY OF WISCONSIN-MADISON FC Layer Analysis • Speed-up of 3x observed for both CPU and GPU. UNIVERSITY OF WISCONSIN-MADISON Matrix Multiply Analysis • Custom C++ and Cuda programs were written to measure the time taken to execute only matrix multiplication routines. • This allowed us to vary the sparsity of weight matrix to figure out the break-even point where CSRMM performs faster than GEMM. • Size of the weight matrix were chosen to be equal to that of the largest AlexNet CONV layer. • The zeros were distributed randomly in the weight matrix. UNIVERSITY OF WISCONSIN-MADISON Matrix Multiply Analysis Speedup of CSRMM over GEMM for random sparsity (CPU) 2.5 Speedup 2 1.5 1 0.5 0 0.945 0.95 0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 Sparsity Speedup of CSRMM over GEMM for random sparsity (GPU) 0.6 Speedup 0.5 0.4 0.3 0.2 0.1 0 0.945 0.95 0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 Sparsity UNIVERSITY OF WISCONSIN-MADISON GPU Memory Scaling Experiment Motivation: As Sparse Matrix representation occupies significantly less space compared to dense representation, a larger working set can fit in the Cache/Memory system in the case of sparse. Implementation: As the “Weights” matrix was the one being passed in the Sparse format, we increased its size to a larger value. UNIVERSITY OF WISCONSIN-MADISON GPU Results (GEMM vs CSRMM) GEMM vs CSRMM ( Weight Matrix = 256 x 1200 ) Format Total Memory Used (MB) Runtime (ns) Dense 3377 60473 Sparse 141 207910 GEMM vs CSRMM ( Weight Matrix = 25600 x 24000 ) Format Total Memory Used (MB) Runtime (ns) Dense 5933 96334 Sparse 1947 184496 While GEMM is still faster in both cases, (GEMM/CSRMM) Time Ratio increases to 0.52 from 0.29 as the Weight Matrix Dimensions are bumped. UNIVERSITY OF WISCONSIN-MADISON GPU: Sparse X Sparse • IFM sparsity due to ReLU activation. Sparsity in CONV layers of AlexNet given below • Layer Sparsity Percentage CONV2 23.57 CONV3 56.5 CONV4 64.7 CONV5 68.5 CUSP library used in custom program for sparse x sparse multiply of IFM and Weights. • • • Speedup of 4x observed compared to GEMM routine. Memory savings of 4.2x compared to GEMM. Could not scale to typical dimensions of AlexNet. This is in progress. UNIVERSITY OF WISCONSIN-MADISON CONCLUSION • Representing Matrices in Sparse Format results in significant Memory savings as expected. • We didn’t observe any practical computational benefits for Convolutional Layers using library routines provide by MKL & cuSPARSE for both CPU & GPU. • Fully Connected Layers showed around 3x Speedup for layers having a high sparsity. • For a large dataset & GPU Memory, we might see drop in convolutional runtime for Sparse representation. • Sparse x Sparse computation showed promising results. Will be implemented in Caffe. UNIVERSITY OF WISCONSIN-MADISON THANK YOU QUESTIONS ? UNIVERSITY OF WISCONSIN-MADISON
© Copyright 2024 Paperzz