CS179: GPU Programming
Lecture 7: Lab 3 Recitation
Today
Miscellaneous CUDA syntax
Recap on CUDA and buffers
Shared memory for an N-body simulation
Flocking simulations
Integrators
CUDA Kernels
Launching the kernel:
kernel<<<gridDim, blockDim, sMemSize>>>(args);
Need to know gridDim, blockDim, sMemSize (and args)
If no sMemSize set, it will default to 0
CUDA Kernels
Grid and block architecture:
Grids can be 1D, 2D, or on CUDA 2.x+, 3D
Blocks can be 1D, 2D, or 3D
1024 threads per block maximum (512 on older systems)
Dimension is only for convenience, choose what’s best for you
Most applications are fine in 1D
Image processing may lend more intuitively to a 2D block/grid
Shared memory size:
Requirement is application-dependent
Limited by CUDA version (probably 48kB for you)
CUDA Functions
Three different kinds of CUDA functions:
__host__: runs on CPU (__host__ keyword is superfluous)
__device__: runs on GPU, only called from GPU
Think of these as helper functions
__global__: runs on GPU, only called from CPU
These are our kernel functions
CUDA Functions
Things to be aware of:
On older CUDA, __device__ and __global__ don’t have
recursion
Cannot have function pointers to __device__ functions
Restrictions on __global__ functions:
Must return void
64kB maximum size for parameters
CUDA Functions
Error checking for memory calls:
Can check status of function using cudaGetErrorString()
For lab 3, we make you a macro:
#define gpuErrchk(ans) { gpuAssert((ans), (char*)__FILE__,
__LINE__); }
inline void gpuAssert(cudaError_t code, char* file, int
line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n",
cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
Call as gpuErrchk(cudaMemcpy(…));
CUDA Variables
Like functions, have a few different types:
__device__/__constant__
Stored in global/constant memory, respectively
Accessible by all threads and blocks
Set using cudaMalloc, cudaMemset, cudaMemcpy, etc.
We can also write to __device__ memory on GPU
__shared__
Lives in shared memory
Accessible only by threads within associated block
Requires syncthreads call to guarantee “correctness”
CUDA Variables
Some CUDA vector variable types:
char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1,
ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1,
int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3,
ulong3, long4, ulong4, float1, float2, float3, float4, double2, …
Vector components available via .x, .y, .z, .w
var.x
Make vectors with make_<type>(args)
var = make_float3(1.0, 2.0, 3.0);
dim3: used for assigning block/grid size
Essentially just a uint3
Each component of a dim3 must be at least 1!
CUDA and Buffers
Need to know how to link buffers into CUDA
Nothing conceptually new, just some functions:
cudaGLRegisterBufferObject(bufferObj)
Used to first register the buffer into CUDA -- done once
cudaGLUnregisterBufferObject(bufferObj)
Once we’re done with it, we unregister -- done once
cudaGLMapBufferObject((void**)&devPtr, bufferObj);
Associates CUDA memory with the buffer -- done once per kernel call
cudaGLUnmapBufferObject(bufferObj);
Disassociates the buffer so OpenGL can read -- done once per kernel
call after kernel finishes
Remember to include <cuda_gl_interop.h>
N-Body Simulation
1 thread = 1 particle
Kernel call handles one step in simulation
Calculate acceleration, then velocity, then position*
*not quite, as we’ll see in a few slides
1 block wont be enough for all of the particles
How do we share all positions?
Load as much global memory into shared memory
Calculate acceleration based on those positions
Update velocity, then load new global memory and repeat
N-Body Simulation
Flocking
First, a video:
https://www.youtube.com/watch?v=ctMty7av0jc
Flocking
2 main ideas (3 for bird flocking)
Separation: bugs will try and stay away from other bugs
Cohesion: bugs will try to stay near the center of the swarm
Alignment: birds will try and head towards the average heading
Not present in bug flocking algorithms
Flocking
Separation: think repelling magnets
Inverse squared law works pretty well:
accel ~= 1/d2, where d is the distance between two particles
Flocking
Cohesion: move towards average position
Cohesion fights separation, try and find factors that balance
the two out well
Flocking
Alignment: steer towards average heading of neighbors
Dependent on both positions AND velocities!
Requires you make more buffers to store velocities
A fair amount more work.. good candidate for extra credit!
Integrators
After acceleration is calculated, update
Simple Euler is easiest…
But is a bad integrator!
Symplectic Euler works better:
Basic idea: update velocity based on old position,
then update new position based on new velocity
new_vel = old_vel + dt * accel(old_pos)
new_pos = old_pos + dt * new_vel
More complex integrators can be even more
accurate, but even harder to implement
If you have time, try implementing a different one
(Runge-Kutta, maybe?) for EC
Pingponging
Two sets of buffers, one for new, one for old.
Why?
Suppose one block finishes while another block is still reading
New positions will be used for old calculations!
Solution: pingponging with 2 buffers
Both buffers already made for you
Use one set for old state, one for new state, then flip when done
Final Notes
When loading from shared memory, be sure not to try and
access out of bounds memory
Can use %, or mod by shared memory size
Problem: % is slow!
Solution: We’re provided a WRAP macro for you:
#define WRAP(x,m) ((x)<(m)?(x):((x)-(m)))
Final Notes
You will also need to set initial positions and velocities
This can be done however you’d like!
Idea: have a few initial clusters with semi-random velocities
Don’t feel restricted to this!
Final Notes
gluPerspective: controls the camera
Based on your simulation, current setup might not fit
Feel free to adjust!
gluPerspective(float fov, float aspect_ratio, float near, float far)
Final Notes
Due Wednesday, 5PM
OH at regular posted times
Important note: this lab will NOT work remotely!
Trying to ssh and compile will be fine, running will throw crazy
errors!
2 new CUDA-capable computers coming to 104ANB soon…
For now, get work done early if you need to use minuteman
© Copyright 2026 Paperzz