GPU computing for embedded applications

GPU computing for embedded applications
High Performance Consulting
GPU computing
solutions provider
HPC was the sole solutions provider for developing our Gradientech
Tracking Tool. The high quality of the end-product means scientist
now have an excellent software tool for analyzing chemotactic cellresponses in areas such as cancer research.
Sara Thorslund, CEO, Gradientech
Authorized
Nvidia
Partners
HPC was a key contributor to our new real-time video enhancement
product for interventional radiology. In medical devices,
performance and reliability are of paramount importance and HPC
helped us reach our goals with their deep GPGPU expertise, great
team skills, and their dedication to our success. We look forward to
future collaborations with HPC.
Arto Järvinen, R&D Director, ContextVision
HPC's contributions to our LDI 5s Data Path software meant we
didn't just reach our performance goals, we exceeded them. Their
input on GPU programming in particular had a significant impact
on our product‘s compute performance. I strongly recommend
employing HPC for any performance oriented software
development.
Anders Österberg, Expert, Innovation Field High
Performance Computing, Micronic Mydata
2 GPU developers from HPC took 3 months for a
project that had occupied 7 FPGA developers for 11
months. We've seen our development cycle shrink
by more than 10x.
EU-based Aerospace/Defence Client
Performance and efficiency
Client benefits
 Embedded GPU hardware solutions
 Programming environment
 Hardware and software trends
Performance and efficiency
Performance over time
7000
6000
GFLOPS
5000
4000
Nvidia GPUs
3000
2000
1000
0
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
*GFLOPS = Giga Floating point operations per second
Performance over time
7000
6000
GFLOPS
5000
4000
Nvidia GPUs
Intel CPUs
3000
2000
1000
0
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
*GFLOPS = Giga Floating point operations per second
Power efficiency
30
25
GFLOPS/watt
20
Nvidia GPUs
15
Intel CPUs
10
5
0
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Power efficiency
Efficency
30
~40 w
~165 w
25
Embedded low-power
systems?
~250 w
GFLOPS/watt
20
Rugged PC + 850m GPU
15
Power ~ 150 w
10
5
0
GPU - GTX
850M
GPU - Titan GPU - GTX 980 CPU - Core i7- CPU - Core i73960X
3770K
Power efficiency
70,0
~5w
60,0
50,0
40,0
~40 w
30,0
~165 w
~250 w
20,0
10,0
0,0
Nvidia
TK1(SoC)
GTX 850M
(GPU)
GTX Titan
(GPU)
GTX 980 (GPU) Intel Core i7- Intel Core i73960X (CPU) 3770K (CPU)
This is a full
SoC =
System on a
chip!
What about compared
to other embedded
processors?
Power efficiency
70,0
ARM CPU
60,0
ARM CPU
50,0
40,0
30,0
20,0
x86 CPU
10,0
0,0
FAQ: How do they compare to FPGA:s?
Power efficiency
120,0
100,0
Apples to oranges comparison –
Reconfigurable hardware VS a
general purpose processor
GFLOPS/watt
80,0
60,0
40,0
20,0
0,0
Next gen. FPGA:s employ ”Hardened floating point DSP blocks” -Altera
Client benefits
Scalability
Simplified CPU model
Simplified GPU model
Number of execution units (FPUs*)
3500
3000
# of units
2500
2000
Nvidia GPUs
1500
1000
500
0
*FPU = Floating Point Unit
Intel CPUs
 GPUs are built on a scalable hardware
platform.
Software frameworks enforce
scalability
Scalability: Client example
We developed a complex signal processing application consisting
of 10 000+ lines of code utilizing GPU computing
Scalability with Compute Power
8
7
6
5
4
3
2
1
0
Scalability with Bandwidth
Achieved
Performance
GPU Compute
Performance
Nvidia GT 240 Nvidia GTX 460m Nvidia GTX 480
2008
2010
2011
8
7
6
5
4
3
2
1
0
Achieved
Performance
GPU Bandwidth
Performance
Nvidia GT 240 Nvidia GTX 460m Nvidia GTX 480
2008
2010
2011
Scalability: Client example
We developed a complex signal processing application consisting
of 10 000+ lines utilizing GPU computing
Scalability with Compute Power
8
7
6
5
4
3
2
1
0
Scalability with Bandwidth
Achieved
Performance
GPU Compute
Performance
Nvidia GT 240 Nvidia GTX 460m Nvidia GTX 480
2008
2010
2011
8
7
6
5
4
3
2
1
0
Achieved
Performance
GPU Bandwidth
Performance
Nvidia GT 240 Nvidia GTX 460m Nvidia GTX 480
2008
2010
2011
Increased utilization on newer generation
using identical source code 
GPU performance scales very well on new GPUs
Deployment of complex algorithms
Deployment of complex algorithms
CPU-like flexibility
flexibility?
+ more!
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
C++ support
Recursion
Dynamic Tree Structures
Regular Cache Hierarchy (L1 ↔ L2 ↔ RAM)
Very large RAM size (up to 12 GB)
2D Spatial Cache and Manual ”Scratch Pad” Cache
Completely dynamic run-time memory allocation
Good hardware support for atomics
Hardware accelerated sin, cos, sqrt, 1/sqrt...
IEEE 754-2008 32-bit and 64-bit floating-point support
Very Complex
Algorithm
Rapid prototyping
(1) – design new algorithm
Algorithm design
(2) – develop real-time protoype
(4) – Modify design
FPGA protoype
(3) – Test in real-environment
Validation & Field
testing
Rapid prototyping
Algorithm design
The exact same GPU-code can be reused on
all platforms with perfect scaling
GPU prototype
Validation & Field
testing
Embedded Deployment
SoC: 3-10w
PC: 50-100w
~200 w
Develop on high-end PC
 Reach real-time quickly
"A project that took 2 GPU developers 3 months we had 7
FPGA developers work on for 11 months.
We've seen our development cycle shrink by more than 10x
... this is the benefit of being able to work in plain C-code “
-Defense client
Embedded GPU accelerated hardware
Embedded hardware: High-end
Cores
640 FPUs
Performance
1130 GFLOPS
Bandwidth
80 GB/s
I/O
PCI Express 3.0 x16
Max TDP
45 watt
Form factor
3U
Embedded hardware: High-end
GE Intelligent Platforms – IPN251
Same GPU as in last slide integrated with the CPU
Name
IPN251
CPU
Quad core Intel core i7-3610
CPU arch
Ivy bridge
CPU RAM
16 GB ECC
GPU
GM107
I/O
PCIe 3.0 x16
Form factor
6U
OS drivers
Linux /Windows
TDP
120 watt
Embedded hardware: Tegra K1 SoC
Tesla HPC Cluster GPU
2880 cores @ 225 watt
Desktop & cluster compute
capabilities directly into mobile
 9 years of platform maturing
and library code now available
on mobile.
Tegra K1 GPU
1 Streaming Multiprocesor
 192 cores
Embedded hardware: Tegra K1 SoC
ARM
ARM
ARM
ARM
USB
L2 cache
UART
L2 cache
Memory controller
LPDDR3
PCIe 2.0
HDMI
GbE
H.264
GPIO
CSI/DSI
Tegra K1 SoC
192 core GPU
4 ARM cores
 326 GFLOPS
 ~5 watt
Embedded hardware: Tegra K1 SoC
 COMe Type 10 (55x84 mm)
2 GB LPDDR3
”Typical draw during load is less than 6
watts” – Dustin Franklin GE-IP
MXC (70x85 mm)
 Artix-7 FPGA
 HD-SDI input & output
 8 GB LPDDR3
Programming environment
Programming environment
Application
CUDA
OpenCL
OpenACC
Libraries
Graphics card
CUDA & OpenCL provide relative
programming ease
Programming environment
Supercomputer
Server /Cloud
Entertainment
CUDA
Embedded
Same GPU codes
AND libraries run
on all platforms
Programming environment: Tegra K1
Rapid development on
desktop environment
Benchmark and tweak on
TK1 dev. board
Linux host
Profile & remote-debug
Application
Cross compile
CUDA
Deploy on rugged hardware
Native compile
Performance
hints from
profiler
Application
timeline – find
your bottlencks
Programming environment:
Accelerated libraries
Linear Algebra
FFT, BLAS,
SPARSE, Matrix
NVIDIA
cuFFT,
cuBLAS,
cuSPARSE
Numerical & Math
RAND, Statistics
NVIDIA
Math Lib
NVIDIA cuRAND
Data Struct. & AI
Sort, Scan, Zero Sum
Visual Processing
Image & Video
GPU AI –
Path Finding
GPU AI –
Board
Games
NVIDIA
NPP
NVIDIA
Video
Encode
Programmability: Easy to program ?
”GPUs have outpaced standard CPUs in performance, and have
become as easy, if not easier to program than multicore CPUs.”
-Prof. Jack Dongarra, Oak Ridge National Laboratory
Development
time (months)
GPU GT200
2
GPU GF100
2
Virtex4 FPGA
30
Virtex5 FPGA
24
A Comparison of FPGA and GPU for Real-Time Phase-Based
Optical Flow, Stereo, and Local Image Features
Karl Pauwels, Matteo Tomasi, Javier Diaz, Eduardo Ros, and
Marc M. Van Hulle, Senior Member, IEEE
CPUs, GPUs, FPGA:s
Programming ease
CPU
GPU
FPGA
Efficiency
Hardware and software trends
Trends: Accelerated Processing Units (APU)
CPU
CPU
GPU
GPU
CPU
AMD Trinity: 17 – 100 W
CPU
CPU
GPU
AMD/Xbox One: 150 W
Intel Haswell: 10 – 90 W
Everyone’s doing it!
C
P
U
Intel, AMD, Nvidia, ARM
GPU
C
P
U
AMD/SONY PS4: 150 W
1 – 150 Watt
GPU > CPU
CPU
CPU
GPU
Apple A7: 1-4 W
Trends: Programmability
Memory Hierarchy
Before
Now
Hardware and software trends
Dynamic Parallelism: The GPU can call itself
Before
Now
Minimizes need for communication
between CPU and GPU:
• Easier for developer.
• Less strain placed on host CPU.
Hardware and software trends
Programmability:
• More programmer friendly memory hierarchy
• Dynamic Parallelism: The GPU can call itself
Hardware and software trends
APU: Closer integration between CPU and GPU.
• Higher CPU-GPU bandwidth
• Lower CPU-GPU latency
Programmability: GPUs are becoming easier to program.
• More programmer friendly memory hierarchy
• Dynamic Parallelism: The GPU can call itself
Hardware news
Tegra K1 follow up has been announced: Tegra X1
 512 GFLOPS FP32
1024 GFLOPS FP16
 H265 encode & decode hardware support
Already on roadmap for embedded applications
Hardware news
Nvidia and AMD releasing HBM memories for their next generation
high-end GPUs:
Smaller physical form factor
 Much higher memory capacity
 Much higher bandwidth
 Less power consumption ( 6 pJ/bit vs 22 pJ/bit)
Jimmy Pettersson
GPU computing specialist
High performance Consulting
Nvidia recommended consultants
[email protected]
BONUS SLIDES
Triple tegra
Comparison
~ 60-150 watt
45 watt GPU + SBC (single
board computer)
 60-150 watt systems
Full system in under 10
watts. Typical load TDP 5-7
watt (tested).
~ 5-10 watt
Bonus
CU
CU
Cache
Cache
Cache
Cache
CU
CU
Compute
units:
CU