GPU computing for embedded applications High Performance Consulting GPU computing solutions provider HPC was the sole solutions provider for developing our Gradientech Tracking Tool. The high quality of the end-product means scientist now have an excellent software tool for analyzing chemotactic cellresponses in areas such as cancer research. Sara Thorslund, CEO, Gradientech Authorized Nvidia Partners HPC was a key contributor to our new real-time video enhancement product for interventional radiology. In medical devices, performance and reliability are of paramount importance and HPC helped us reach our goals with their deep GPGPU expertise, great team skills, and their dedication to our success. We look forward to future collaborations with HPC. Arto Järvinen, R&D Director, ContextVision HPC's contributions to our LDI 5s Data Path software meant we didn't just reach our performance goals, we exceeded them. Their input on GPU programming in particular had a significant impact on our product‘s compute performance. I strongly recommend employing HPC for any performance oriented software development. Anders Österberg, Expert, Innovation Field High Performance Computing, Micronic Mydata 2 GPU developers from HPC took 3 months for a project that had occupied 7 FPGA developers for 11 months. We've seen our development cycle shrink by more than 10x. EU-based Aerospace/Defence Client Performance and efficiency Client benefits Embedded GPU hardware solutions Programming environment Hardware and software trends Performance and efficiency Performance over time 7000 6000 GFLOPS 5000 4000 Nvidia GPUs 3000 2000 1000 0 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 *GFLOPS = Giga Floating point operations per second Performance over time 7000 6000 GFLOPS 5000 4000 Nvidia GPUs Intel CPUs 3000 2000 1000 0 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 *GFLOPS = Giga Floating point operations per second Power efficiency 30 25 GFLOPS/watt 20 Nvidia GPUs 15 Intel CPUs 10 5 0 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Power efficiency Efficency 30 ~40 w ~165 w 25 Embedded low-power systems? ~250 w GFLOPS/watt 20 Rugged PC + 850m GPU 15 Power ~ 150 w 10 5 0 GPU - GTX 850M GPU - Titan GPU - GTX 980 CPU - Core i7- CPU - Core i73960X 3770K Power efficiency 70,0 ~5w 60,0 50,0 40,0 ~40 w 30,0 ~165 w ~250 w 20,0 10,0 0,0 Nvidia TK1(SoC) GTX 850M (GPU) GTX Titan (GPU) GTX 980 (GPU) Intel Core i7- Intel Core i73960X (CPU) 3770K (CPU) This is a full SoC = System on a chip! What about compared to other embedded processors? Power efficiency 70,0 ARM CPU 60,0 ARM CPU 50,0 40,0 30,0 20,0 x86 CPU 10,0 0,0 FAQ: How do they compare to FPGA:s? Power efficiency 120,0 100,0 Apples to oranges comparison – Reconfigurable hardware VS a general purpose processor GFLOPS/watt 80,0 60,0 40,0 20,0 0,0 Next gen. FPGA:s employ ”Hardened floating point DSP blocks” -Altera Client benefits Scalability Simplified CPU model Simplified GPU model Number of execution units (FPUs*) 3500 3000 # of units 2500 2000 Nvidia GPUs 1500 1000 500 0 *FPU = Floating Point Unit Intel CPUs GPUs are built on a scalable hardware platform. Software frameworks enforce scalability Scalability: Client example We developed a complex signal processing application consisting of 10 000+ lines of code utilizing GPU computing Scalability with Compute Power 8 7 6 5 4 3 2 1 0 Scalability with Bandwidth Achieved Performance GPU Compute Performance Nvidia GT 240 Nvidia GTX 460m Nvidia GTX 480 2008 2010 2011 8 7 6 5 4 3 2 1 0 Achieved Performance GPU Bandwidth Performance Nvidia GT 240 Nvidia GTX 460m Nvidia GTX 480 2008 2010 2011 Scalability: Client example We developed a complex signal processing application consisting of 10 000+ lines utilizing GPU computing Scalability with Compute Power 8 7 6 5 4 3 2 1 0 Scalability with Bandwidth Achieved Performance GPU Compute Performance Nvidia GT 240 Nvidia GTX 460m Nvidia GTX 480 2008 2010 2011 8 7 6 5 4 3 2 1 0 Achieved Performance GPU Bandwidth Performance Nvidia GT 240 Nvidia GTX 460m Nvidia GTX 480 2008 2010 2011 Increased utilization on newer generation using identical source code GPU performance scales very well on new GPUs Deployment of complex algorithms Deployment of complex algorithms CPU-like flexibility flexibility? + more! 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. C++ support Recursion Dynamic Tree Structures Regular Cache Hierarchy (L1 ↔ L2 ↔ RAM) Very large RAM size (up to 12 GB) 2D Spatial Cache and Manual ”Scratch Pad” Cache Completely dynamic run-time memory allocation Good hardware support for atomics Hardware accelerated sin, cos, sqrt, 1/sqrt... IEEE 754-2008 32-bit and 64-bit floating-point support Very Complex Algorithm Rapid prototyping (1) – design new algorithm Algorithm design (2) – develop real-time protoype (4) – Modify design FPGA protoype (3) – Test in real-environment Validation & Field testing Rapid prototyping Algorithm design The exact same GPU-code can be reused on all platforms with perfect scaling GPU prototype Validation & Field testing Embedded Deployment SoC: 3-10w PC: 50-100w ~200 w Develop on high-end PC Reach real-time quickly "A project that took 2 GPU developers 3 months we had 7 FPGA developers work on for 11 months. We've seen our development cycle shrink by more than 10x ... this is the benefit of being able to work in plain C-code “ -Defense client Embedded GPU accelerated hardware Embedded hardware: High-end Cores 640 FPUs Performance 1130 GFLOPS Bandwidth 80 GB/s I/O PCI Express 3.0 x16 Max TDP 45 watt Form factor 3U Embedded hardware: High-end GE Intelligent Platforms – IPN251 Same GPU as in last slide integrated with the CPU Name IPN251 CPU Quad core Intel core i7-3610 CPU arch Ivy bridge CPU RAM 16 GB ECC GPU GM107 I/O PCIe 3.0 x16 Form factor 6U OS drivers Linux /Windows TDP 120 watt Embedded hardware: Tegra K1 SoC Tesla HPC Cluster GPU 2880 cores @ 225 watt Desktop & cluster compute capabilities directly into mobile 9 years of platform maturing and library code now available on mobile. Tegra K1 GPU 1 Streaming Multiprocesor 192 cores Embedded hardware: Tegra K1 SoC ARM ARM ARM ARM USB L2 cache UART L2 cache Memory controller LPDDR3 PCIe 2.0 HDMI GbE H.264 GPIO CSI/DSI Tegra K1 SoC 192 core GPU 4 ARM cores 326 GFLOPS ~5 watt Embedded hardware: Tegra K1 SoC COMe Type 10 (55x84 mm) 2 GB LPDDR3 ”Typical draw during load is less than 6 watts” – Dustin Franklin GE-IP MXC (70x85 mm) Artix-7 FPGA HD-SDI input & output 8 GB LPDDR3 Programming environment Programming environment Application CUDA OpenCL OpenACC Libraries Graphics card CUDA & OpenCL provide relative programming ease Programming environment Supercomputer Server /Cloud Entertainment CUDA Embedded Same GPU codes AND libraries run on all platforms Programming environment: Tegra K1 Rapid development on desktop environment Benchmark and tweak on TK1 dev. board Linux host Profile & remote-debug Application Cross compile CUDA Deploy on rugged hardware Native compile Performance hints from profiler Application timeline – find your bottlencks Programming environment: Accelerated libraries Linear Algebra FFT, BLAS, SPARSE, Matrix NVIDIA cuFFT, cuBLAS, cuSPARSE Numerical & Math RAND, Statistics NVIDIA Math Lib NVIDIA cuRAND Data Struct. & AI Sort, Scan, Zero Sum Visual Processing Image & Video GPU AI – Path Finding GPU AI – Board Games NVIDIA NPP NVIDIA Video Encode Programmability: Easy to program ? ”GPUs have outpaced standard CPUs in performance, and have become as easy, if not easier to program than multicore CPUs.” -Prof. Jack Dongarra, Oak Ridge National Laboratory Development time (months) GPU GT200 2 GPU GF100 2 Virtex4 FPGA 30 Virtex5 FPGA 24 A Comparison of FPGA and GPU for Real-Time Phase-Based Optical Flow, Stereo, and Local Image Features Karl Pauwels, Matteo Tomasi, Javier Diaz, Eduardo Ros, and Marc M. Van Hulle, Senior Member, IEEE CPUs, GPUs, FPGA:s Programming ease CPU GPU FPGA Efficiency Hardware and software trends Trends: Accelerated Processing Units (APU) CPU CPU GPU GPU CPU AMD Trinity: 17 – 100 W CPU CPU GPU AMD/Xbox One: 150 W Intel Haswell: 10 – 90 W Everyone’s doing it! C P U Intel, AMD, Nvidia, ARM GPU C P U AMD/SONY PS4: 150 W 1 – 150 Watt GPU > CPU CPU CPU GPU Apple A7: 1-4 W Trends: Programmability Memory Hierarchy Before Now Hardware and software trends Dynamic Parallelism: The GPU can call itself Before Now Minimizes need for communication between CPU and GPU: • Easier for developer. • Less strain placed on host CPU. Hardware and software trends Programmability: • More programmer friendly memory hierarchy • Dynamic Parallelism: The GPU can call itself Hardware and software trends APU: Closer integration between CPU and GPU. • Higher CPU-GPU bandwidth • Lower CPU-GPU latency Programmability: GPUs are becoming easier to program. • More programmer friendly memory hierarchy • Dynamic Parallelism: The GPU can call itself Hardware news Tegra K1 follow up has been announced: Tegra X1 512 GFLOPS FP32 1024 GFLOPS FP16 H265 encode & decode hardware support Already on roadmap for embedded applications Hardware news Nvidia and AMD releasing HBM memories for their next generation high-end GPUs: Smaller physical form factor Much higher memory capacity Much higher bandwidth Less power consumption ( 6 pJ/bit vs 22 pJ/bit) Jimmy Pettersson GPU computing specialist High performance Consulting Nvidia recommended consultants [email protected] BONUS SLIDES Triple tegra Comparison ~ 60-150 watt 45 watt GPU + SBC (single board computer) 60-150 watt systems Full system in under 10 watts. Typical load TDP 5-7 watt (tested). ~ 5-10 watt Bonus CU CU Cache Cache Cache Cache CU CU Compute units: CU
© Copyright 2026 Paperzz