NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE

April 4-7, 2016 | Silicon Valley
NVIDIA GIE: HIGH-PERFORMANCE
GPU INFERENCE ENGINE
Michael Andersch, 7th April 2016
WHAT IS INFERENCE, ANYWAYS?
Building a deep neural network based application
Step 1: Use data to train the neural network - training
Step 2: Use the neural network to process unseen data - inference
2
INFERENCE VS TRAINING
How is inference different from training?
1. No backpropagation / static weights
enables graph optimizations, simplifies memory management
2. Tendency towards smaller batch sizes
harder to amortize weight loading, achieve high GPU utilization
3. Reduced precision requirements
provides opportunity for BW savings and accelerated arithmetic
3
OPTIMIZING SOFTWARE FOR INFERENCE
Extracting every bit of performance
What’s running on the GPU: cuDNN optimizations
Support for standard tensor layouts and major frameworks
Available automatically and “for free”
How you use it: Framework optimizations
Every last bit of performance matters
Challenging due to framework structure
Changes to one framework don’t propagate to others
4
OPTIMIZING SOFTWARE FOR INFERENCE
Challenge: Efficient small batch convolutions
Optimal convolution algorithm depends on convolution layer dimensions
Winograd speedup over GEMM-based convolution (VGG-E layers, N=1)
1.84
2.03
1.83
2.07
2.26
1.98
1.92
1.25
0.73
conv 1.1
conv 1.2
conv 2.1
conv 2.2
conv 3.1
conv 3.2
conv 4.1
conv 4.2
conv 5.0
Meta-parameters (data layouts, texture memory) afford higher performance
Using texture memory for convolutions: 13% inference speedup
(GoogLeNet, batch size 1)
5
OPTIMIZING SOFTWARE FOR INFERENCE
Challenge: Graph optimization
tensor
concat
3x3 conv.
5x5 conv.
1x1 conv.
1x1 conv.
1x1 conv.
max pool
1x1 conv.
input
6
OPTIMIZING SOFTWARE FOR INFERENCE
Challenge: Graph optimization
next input
concat
relu
bias
1x1 conv.
relu
bias
3x3 conv.
relu
bias
3x3 conv.
relu
bias
3x3 conv.
relu
bias
1x1 conv.
relu
bias
3x3 conv.
max pool
input
concat
7
OPTIMIZING SOFTWARE FOR INFERENCE
Graph optimization: Vertical fusion
next input
concat
1x1 CBR
3x3 CBR
5x5 CBR
1x1 CBR
1x1 CBR
1x1 CBR
max pool
input
concat
8
OPTIMIZING SOFTWARE FOR INFERENCE
Graph optimization: Horizontal fusion
next input
concat
5x5 CBR
3x3 CBR
1x1 CBR
max pool
1x1 CBR
input
concat
9
OPTIMIZING SOFTWARE FOR INFERENCE
Graph optimization: Concat elision
next input
5x5 CBR
3x3 CBR
1x1 CBR
max pool
1x1 CBR
input
10
OPTIMIZING SOFTWARE FOR INFERENCE
Graph optimization: Concurrency
next input
5x5 CBR
3x3 CBR
1x1 CBR
max pool
1x1 CBR
input
11
OPTIMIZING SOFTWARE FOR INFERENCE
Challenge: Effective use of cuBLAS intrinsics
Run GEMV instead of GEMM
Small batch sizes degrade N dimension
B matrix becomes narrow
Pre-transpose weight matrices
Allows using NN/NT GEMM, where NT > NN > TN
12
ACCELERATED INFERENCE ON PASCAL
Support for fast mixed precision arithmetic
Inference products will support a new dedicated vector math instruction
Multi-element dot product, 8-bit integer inputs, 32-bit accumulator
4x the rate of equivalent FP32 operations
Full-speed FP32 processing for any layers that require higher precision
13
BUT WHO WILL IMPLEMENT IT?
Introducing NVIDIA GIE: GPU Inference Engine
OPTIMIZATION ENGINE
STRATEGY
EXECUTION ENGINE
14
GPU INFERENCE ENGINE WORKFLOW
OPTIMIZATION ENGINE
DIGITS TRAINING TOOLS
STRATEGY
EXECUTION ENGINE
15
SUMMARY
Inference on the GPU
Tesla M4 Hyperscale Accelerator
GPUs are a great platform for inference
Efficiency: great performance/watt
Scalability: from 3W to 300W
GPU-based inference affords …
… same performance in much tighter power envelope
… freeing up the CPU to do other work
Questions: [email protected], or find me after the talk!
16
April 4-7, 2016 | Silicon Valley
THANK YOU
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join