apollo - UT Computer Science

GPUs and GPU Programming
Bharadwaj Subramanian, Apollo Ellis, Keshav Pingali
Imagery taken from
Nvidia Dawn Demo
Slide on GPUs, Cuda and
Programming Models by Apollo Ellis
Slides on OpenCL by Bharadwaj
Subramanian
A GPU is a Multi-core Architecture
• High throughput is prioritized over low latency
single task execution
• Large collection of fixed function and software
programmable resources
Graphics Pipeline
• Virtual scene Virtual camera used to render
• Direct3D and OpenGL formulate the process as a
pipeline of operations on fundamental entities
–
–
–
–
Vertices
Primitives
Fragments
Pixels
• Data flows in entity streams between pipeline
stages.
Graphics Pipeline
Graphics Pipeline
• GPU Front End
– Otherwise known as Vertex Generator
• Takes in vertex descriptors: Location plus Type (Line,
Triangle, Quad, Poly)
– Attributes (Normal, Texture Coordinate, Color etc.)
• Performs a prefetch on the vertex data and constructs a
vertex stream.
Graphics Pipeline
• Vertex Processing
– Programmable Vertex Shader Execute
• Typically converts from world space to camera space
• Languages include Cg and HLSL
• Primitive Assembly
– Convert form vertices to primitives
• Rasterization
– Primitive Sampler in Screen space
– Fragment Generator
Graphics Pipeline
• Fragment Processing
– Programmable Fragment Shader Execute
• Texture Lookup and Light Interaction Calculation
• Cg and HLSL
• ROP
– Raster Operations (Depth Buffer Cull, Alpha Blend)
– Calculate each fragment’s contribution to given
pixels
Shader Programming
• Fragment or Vertex processing is defined by
shader programs written in Cg or GLSL or HLSL
• Compiled at runtime to binary
• Or compiled offline and then transformed at
runtime
• C-like function that processes a single input and
output in isolation
• Run in parallel on multiple shader cores
• Wide SIMD instructions due to instruction
streaming
Parallel Processing and Encapsulation
• Task Parallelism is available across stages
– Eg. Vertices are processed while fragments processed
etc.
• Data Parallelism is available across stream
entities.
– Each entity is independ of each other because of the
task offloading onto the fixed function units
• Fixed Function Units encapsulate hard to
parallelize work in optimized hardware
components
Still A Scheduling Problem
• Processing and on-chip resources must be
dynamically reallocated to pipeline stages
• Depends on the current loads at different
stages
• How to determine if different stages get more
cores or more cache becomes an issue.
• Hardware Multithreading provides a solution
to thread stalls distributes resources more
evenly
CUDA
• CUDA is a more general data parallel model
– No Pipe
•
•
•
•
•
Clusters of Threads
Scatter Operations (Multiple Write)
Gather Operations (Multiple Read)
Application based decomposition of threads
Threads can share data and communicate with
each other
CUDA Programming Model
• GPU is viewed as a coprocessor with DRAM
and many parallel threads
• Data parallel portions of applications can be
offloaded onto this coprocessor
• C on the GPU
– Global and Shared Variables
– Pointers and Explicit Memory Allocation
– OpenGL and DirectX interoperability
Tesla Architecture
• Scalable array of multithreaded Streaming
Multiprocessors or SMs 768 to 12,288
concurrent threads
Kernels
• C C++ Simple Functions or Full Programs
• Consists of thread blocks and grids
– Thread Block
• Set of concurrent threads that cooperate through
barriers and shared memory.
– Grid
• Set of thread blocks that are independent form each
other
• Multiple Grids per Kernel
Syntax Example
• __global__
void my_par_func(float a){
do something with a
}
int dimGrid = 256, dimBlock 256
my_par_func<<<dimGrid,dimBlock>>>(5.0f)
Execution
• SIMT Single Instruction Multiple Model Scheduler
schedules Warps or sets of concurrent threads on
SM units.
• Warp is scheduled independently of other warps
• If a Warps threads diverge in control flow path
the paths are each executed turning off the
threads that are not effected
• No recursion is allowed for stack space problems
SIMD vs SIMT
• CUDA utilizes the wide SIMD units
• However SIMD is not exposed to the
programmer
• Instead SIMD units are used by multiple
threads at once
• SIMT utilizes of SIMD
CUDA Wrap Up
•
•
•
•
•
•
•
More general model using same hardware
GPU is a CUDA coprocessor
Tesla Architecture 768 to 12000+ threads
C C++ syntax
Serial Branching
No recursion
SIMD used by SIMT
Another Model GRAMPS
• General Runtime Architecture for Multicore
Parallel Systems
• A programming model for graphics pipelines
• Allows for custom pipelines mixing fixed
function and programmable stages
• Data is exchanged using queues and buffers
• Motivation comes from hybrid applications
– REYES Rasterization and Ray Tracing
Execution Graphs
•
•
•
•
Analog of a GPU pipeline
Made up of Stages
Provides scheduling information
Not limited to execution DAGs
– Cycles are not forbidden
• Forward progress is not guaranteed
• Flexibility presumably outweighs the cost of
well behaved programs assurance
Stages
• Types SHADER THREAD FIXEDFUNCTION
• Operate asynchronously exposing parallelism
• Indicate similarities in data access and
execution characteristics for efficient
processing
• Useful when benefits coherent execution
outweigh deferred processing
Shader
• Short live run to completion computations
• Per element programs
• Push operation introduced for conditional
output
• Otherwise queue inputs and outputs are
managed automatically
• Shader instances are scheduled in packets
similar to GPU execution
Threads and Fixed Function
• Threads
– Similar to CPU threads designed for task
parallelism
– Must be manually parallelize by the application
– Useful for repacking data between Shader stages
and processing bulk chunks of data where sharing
or cross communication is needed
• Fixed Function
– Hardware unit wrappers
Buffers and Queues
• Buffers
– essentially shared memory across stages
• Queues
– Packets are the primitive data format of the queue
defined at creation
– Opaque packets: are for data chunks which need not
be interpreted
– Collection packets: for shader group dispatch
• Queue Manipulation
– Thread/Fixed Stages
– Shader Stages
Thread Fixed Stages
• reserve-commit
– reserve: returns called a reference to one or more
contiguous packets a reservation is also acquired
– commit: is a notification that releases the
referenced data back to the system
– Input commit means packet has been consumed
– Output commit means packet can go downstream
Shader Stages
• Queue ops are transparent to the user
• As input packets arrive output reservations
are attained
• When all shader instances for a collection
packer are done the commits happen
automatically
• Queue Sets are introduced
– Groups of queues viewed as single queues for
sharing among shaders
Summary GRAMPS
• Application creates stages, queues, and
buffers.
• Queues and buffer are bound to stages
• Computation proceeds according to execution
graphs
• Computation graphs are fully programmable
• Dynamic aggregation of work at runtime