CS179: GPU Programming

CS179: GPU Programming
Lecture 7: Lab 3 Recitation
Today
 Miscellaneous CUDA syntax
 Recap on CUDA and buffers
 Shared memory for an N-body simulation
 Flocking simulations
 Integrators
CUDA Kernels
 Launching the kernel:
 kernel<<<gridDim, blockDim, sMemSize>>>(args);
 Need to know gridDim, blockDim, sMemSize (and args)
 If no sMemSize set, it will default to 0
CUDA Kernels
 Grid and block architecture:
 Grids can be 1D, 2D, or on CUDA 2.x+, 3D
 Blocks can be 1D, 2D, or 3D
 1024 threads per block maximum (512 on older systems)
 Dimension is only for convenience, choose what’s best for you
 Most applications are fine in 1D
 Image processing may lend more intuitively to a 2D block/grid
 Shared memory size:
 Requirement is application-dependent
 Limited by CUDA version (probably 48kB for you)
CUDA Functions
 Three different kinds of CUDA functions:
 __host__: runs on CPU (__host__ keyword is superfluous)
 __device__: runs on GPU, only called from GPU
 Think of these as helper functions
 __global__: runs on GPU, only called from CPU
 These are our kernel functions
CUDA Functions
 Things to be aware of:
 On older CUDA, __device__ and __global__ don’t have
recursion
 Cannot have function pointers to __device__ functions
 Restrictions on __global__ functions:
 Must return void
 64kB maximum size for parameters
CUDA Functions
 Error checking for memory calls:
 Can check status of function using cudaGetErrorString()
 For lab 3, we make you a macro:
#define gpuErrchk(ans) { gpuAssert((ans), (char*)__FILE__,
__LINE__); }
inline void gpuAssert(cudaError_t code, char* file, int
line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n",
cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
 Call as gpuErrchk(cudaMemcpy(…));
CUDA Variables
 Like functions, have a few different types:
 __device__/__constant__




Stored in global/constant memory, respectively
Accessible by all threads and blocks
Set using cudaMalloc, cudaMemset, cudaMemcpy, etc.
We can also write to __device__ memory on GPU
 __shared__
 Lives in shared memory
 Accessible only by threads within associated block
 Requires syncthreads call to guarantee “correctness”
CUDA Variables
 Some CUDA vector variable types:
 char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1,
ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1,
int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3,
ulong3, long4, ulong4, float1, float2, float3, float4, double2, …
 Vector components available via .x, .y, .z, .w
 var.x
 Make vectors with make_<type>(args)
 var = make_float3(1.0, 2.0, 3.0);
 dim3: used for assigning block/grid size
 Essentially just a uint3
 Each component of a dim3 must be at least 1!
CUDA and Buffers
 Need to know how to link buffers into CUDA
 Nothing conceptually new, just some functions:
 cudaGLRegisterBufferObject(bufferObj)
 Used to first register the buffer into CUDA -- done once
 cudaGLUnregisterBufferObject(bufferObj)
 Once we’re done with it, we unregister -- done once
 cudaGLMapBufferObject((void**)&devPtr, bufferObj);
 Associates CUDA memory with the buffer -- done once per kernel call
 cudaGLUnmapBufferObject(bufferObj);
 Disassociates the buffer so OpenGL can read -- done once per kernel
call after kernel finishes
 Remember to include <cuda_gl_interop.h>
N-Body Simulation
 1 thread = 1 particle
 Kernel call handles one step in simulation
 Calculate acceleration, then velocity, then position*
*not quite, as we’ll see in a few slides
 1 block wont be enough for all of the particles
 How do we share all positions?
 Load as much global memory into shared memory
 Calculate acceleration based on those positions
 Update velocity, then load new global memory and repeat
N-Body Simulation
Flocking
 First, a video:
 https://www.youtube.com/watch?v=ctMty7av0jc
Flocking
 2 main ideas (3 for bird flocking)
 Separation: bugs will try and stay away from other bugs
 Cohesion: bugs will try to stay near the center of the swarm
 Alignment: birds will try and head towards the average heading
 Not present in bug flocking algorithms
Flocking
 Separation: think repelling magnets
 Inverse squared law works pretty well:
 accel ~= 1/d2, where d is the distance between two particles
Flocking
 Cohesion: move towards average position
 Cohesion fights separation, try and find factors that balance
the two out well
Flocking
 Alignment: steer towards average heading of neighbors
 Dependent on both positions AND velocities!
 Requires you make more buffers to store velocities
 A fair amount more work.. good candidate for extra credit!
Integrators
 After acceleration is calculated, update
 Simple Euler is easiest…
 But is a bad integrator!
 Symplectic Euler works better:
 Basic idea: update velocity based on old position,
then update new position based on new velocity
 new_vel = old_vel + dt * accel(old_pos)
 new_pos = old_pos + dt * new_vel
 More complex integrators can be even more
accurate, but even harder to implement
 If you have time, try implementing a different one
(Runge-Kutta, maybe?) for EC
Pingponging
 Two sets of buffers, one for new, one for old.
 Why?
 Suppose one block finishes while another block is still reading
 New positions will be used for old calculations!
 Solution: pingponging with 2 buffers
 Both buffers already made for you
 Use one set for old state, one for new state, then flip when done
Final Notes
 When loading from shared memory, be sure not to try and
access out of bounds memory
 Can use %, or mod by shared memory size
 Problem: % is slow!
 Solution: We’re provided a WRAP macro for you:
#define WRAP(x,m) ((x)<(m)?(x):((x)-(m)))
Final Notes
 You will also need to set initial positions and velocities
 This can be done however you’d like!
 Idea: have a few initial clusters with semi-random velocities
 Don’t feel restricted to this!
Final Notes
 gluPerspective: controls the camera
 Based on your simulation, current setup might not fit
 Feel free to adjust!
 gluPerspective(float fov, float aspect_ratio, float near, float far)
Final Notes
 Due Wednesday, 5PM
 OH at regular posted times
 Important note: this lab will NOT work remotely!
 Trying to ssh and compile will be fine, running will throw crazy
errors!
 2 new CUDA-capable computers coming to 104ANB soon…
 For now, get work done early if you need to use minuteman