CS 179: GPU

CS 179: Lecture 3
A bold problem!
 Suppose we have a polynomial P(r) with coefficients c0, …,
cn-1, given by:
 We want, for r0, …, rN-1, the sum:
A kernel
!!
A more correct kernel
Performance problem!
 Serialization of atomicAdd()
Parallel accumulation
…
Parallel accumulation
Accumulation methods – A visual
 Purely atomic:
*output
 “Linear accumulation”:
Multiprocessor
(Blocks go here)
SIMD processing unit
(Warps go here)
Block
GPU: Internals
Warp
Warp
 32 threads that execute simultaneously!
 A[] + B[] -> C[] problem:
 Didn’t matter!
 Warp of threads 0-31:
 A[idx] + B[idx] -> C[idx]
 Warp of threads 32-63:
 A[idx] + B[idx] -> C[idx]
 Warp of threads 64-95:
 A[idx] + B[idx] -> C[idx]
…
Divergence – Example
//Suppose we have a pointer to some floating-point
//value *val...
if (threadIdx.x % 2 == 0)
*value += 2;
else
*value /= 3;
//Branch A
//Branch B
 Branches result in different instructions!
Divergence
 What happens:
 Executes normally until if-statement
 Branches to calculate Branch A (blue threads)
 Goes back (!) and branches to calculate Branch B (red threads)
Calculating polynomial values –
does it matter?
 Warp of threads 0-31:
 (calculate some values)
 Warp of threads 32-63:
 (calculate some values)
 Warp of threads 64-95:
 (calculate some values)
 Same instructions! Doesn’t matter!
Linear reduction - does it matter?
 (after calculating values…)
 Warp of threads 0-31:
 Thread 0: Accumulate sum
 Threads 1-31: Do nothing
 Warp of threads 32-63:
 Do nothing
 Warp of threads 64-95:
 Do nothing
 Doesn’t really matter… in this case.
Improving our reduction
 More threads participating in the process!
 “Binary tree”
Improving our reduction
Improving our reduction
//Let our shared memory block be partial_outputs[]...
synchronize threads before starting...
set offset to 1
while ( (offset * 2) <= block dimension):
if (thread index % (offset * 2) is 0)
AND you won’t exceed your block dimension:
add partial_outputs[thread index + offset] to
partial_outputs[thread index]
double the offset
synchronize threads
Get thread 0 to atomicAdd() partial_outputs[0] to output
Improving our reduction
 What will happen?
 Each warp will do meaningful execution on ~16 threads
 You use lots more warps than you have to!
 A “divergent tree”
 How to improve?
“Non-divergent tree”
“Non-divergent tree”
//Let our shared memory block be partial_outputs[]...
set offset to highest power of 2 that’s less than the
block dimension
//For the first iteration, check that you don’t access
//out of range memory
while (offset >= 1):
if (thread index < offset):
add partial_outputs[thread index + offset] to
partial_outputs[thread index]
halve the offset
synchronize threads
Get thread 0 to atomicAdd() partial_outputs[0] to output
“Non-divergent tree”
 Suppose we have our block of 512 threads…
 Warp of threads 0-31:
 Accumulate result
…
 Warp of threads 224-255:
 Accumulate result
 Warp of threads 256-287:
 Do nothing
…
 Warp of threads 480-511:
 Do nothing
“Non-divergent tree”
 Suppose we’re now down to 32 threads…
 Warp of threads 0-31:
 Threads 0-15: Accumulate result
 Threads 16-31: Do nothing
 Much less divergence
 Divergence only occurs in the middle, if ever!
Reduction – Four Approaches
 Atomic only:
 Linear:
*output
 Divergent tree:
 Non-divergent tree:
Notes on files? (An aside)
 Labs and CUDA programming typically have the following
files involved:
 ____.cc
 Allows C++ code
 g++ compiles this
 ____.cu
 Allows CUDA syntax
 nvcc compiles this
 ____.cuh
 CUDA header file (declare accessible functions)
Big Picture (so far)
 CUDA allows large speedups!
 Parallelism with CUDA: Different from CPU parallelism
 Threads are “small”
 Memory to think about
 Increase speedup by thinking about problem constraints!
 Reduction
 (and more!)
Big Picture (so far)
 Steps in these two problems are widely applicable!
 Dot product
 Norm calculation
 (and more)
Other notes
 Office hours:
 Monday: 8-10 PM
 Tuesday: 8-10 PM
 Lab 1 due Wednesday, 5 PM