April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE Michael Andersch, 7th April 2016 WHAT IS INFERENCE, ANYWAYS? Building a deep neural network based application Step 1: Use data to train the neural network - training Step 2: Use the neural network to process unseen data - inference 2 INFERENCE VS TRAINING How is inference different from training? 1. No backpropagation / static weights enables graph optimizations, simplifies memory management 2. Tendency towards smaller batch sizes harder to amortize weight loading, achieve high GPU utilization 3. Reduced precision requirements provides opportunity for BW savings and accelerated arithmetic 3 OPTIMIZING SOFTWARE FOR INFERENCE Extracting every bit of performance What’s running on the GPU: cuDNN optimizations Support for standard tensor layouts and major frameworks Available automatically and “for free” How you use it: Framework optimizations Every last bit of performance matters Challenging due to framework structure Changes to one framework don’t propagate to others 4 OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Efficient small batch convolutions Optimal convolution algorithm depends on convolution layer dimensions Winograd speedup over GEMM-based convolution (VGG-E layers, N=1) 1.84 2.03 1.83 2.07 2.26 1.98 1.92 1.25 0.73 conv 1.1 conv 1.2 conv 2.1 conv 2.2 conv 3.1 conv 3.2 conv 4.1 conv 4.2 conv 5.0 Meta-parameters (data layouts, texture memory) afford higher performance Using texture memory for convolutions: 13% inference speedup (GoogLeNet, batch size 1) 5 OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Graph optimization tensor concat 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. 1x1 conv. max pool 1x1 conv. input 6 OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Graph optimization next input concat relu bias 1x1 conv. relu bias 3x3 conv. relu bias 3x3 conv. relu bias 3x3 conv. relu bias 1x1 conv. relu bias 3x3 conv. max pool input concat 7 OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Vertical fusion next input concat 1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR 1x1 CBR max pool input concat 8 OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Horizontal fusion next input concat 5x5 CBR 3x3 CBR 1x1 CBR max pool 1x1 CBR input concat 9 OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Concat elision next input 5x5 CBR 3x3 CBR 1x1 CBR max pool 1x1 CBR input 10 OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Concurrency next input 5x5 CBR 3x3 CBR 1x1 CBR max pool 1x1 CBR input 11 OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Effective use of cuBLAS intrinsics Run GEMV instead of GEMM Small batch sizes degrade N dimension B matrix becomes narrow Pre-transpose weight matrices Allows using NN/NT GEMM, where NT > NN > TN 12 ACCELERATED INFERENCE ON PASCAL Support for fast mixed precision arithmetic Inference products will support a new dedicated vector math instruction Multi-element dot product, 8-bit integer inputs, 32-bit accumulator 4x the rate of equivalent FP32 operations Full-speed FP32 processing for any layers that require higher precision 13 BUT WHO WILL IMPLEMENT IT? Introducing NVIDIA GIE: GPU Inference Engine OPTIMIZATION ENGINE STRATEGY EXECUTION ENGINE 14 GPU INFERENCE ENGINE WORKFLOW OPTIMIZATION ENGINE DIGITS TRAINING TOOLS STRATEGY EXECUTION ENGINE 15 SUMMARY Inference on the GPU Tesla M4 Hyperscale Accelerator GPUs are a great platform for inference Efficiency: great performance/watt Scalability: from 3W to 300W GPU-based inference affords … … same performance in much tighter power envelope … freeing up the CPU to do other work Questions: [email protected], or find me after the talk! 16 April 4-7, 2016 | Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join
© Copyright 2025 Paperzz