Complete Unified
Device Architecture
A Highly Scalable Parallel
Programming Framework
Submitted in partial fulfillment of the requirements for the Maryland high school diploma
Andrew “Shirley” Das Sarma (Calico Cannonballs McMullins), Blair Computational Methods 2009
Background: Why CUDA?
Scientific Computing
•
•
•
•
•
A large computer market
Arithmetic-intensive
Huge datasets
Distributed
Parallel
Background: Why CUDA?
Moore’s Law
• Transistors double every 24 months
• Slowing down?
• New tricks
– Multicore
– Multi-node
• Metrics
– Transistors per circuit
– Performance per unit cost
Background: Why CUDA?
CPU vs. GPU
• CPUs optimized for general workload
– More instructions per second
– Pipelining, lookahead branch prediction, etc.
• GPUs optimized for parallel calculations
– 1 pixel shader = 1 thread
– Lots of pixel shaders
– Lots of arithmetic
– On-card DRAM
Background: Why CUDA?
CPU vs. GPU
Background: Why CUDA?
CPU vs. GPU
In terms of raw computing power,
GPUs surpass CPUs.
What is CUDA?
• GPGPU (not just graphics, or no graphics)
• Runs on CPU and GPU
• High-level language
– Extension of C
– FORTRAN coming soon
• One compiler
• Only NVIDIA so far
– Tesla
– Larrabee
• Unfathomably cool
How it works
• C language extension
– Language constructs
– Keywords
• Low-overhead threads
• Independent blocks
• CPU or GPU: choose one
– CPU good for sequential or non-numerical
tasks
– GPU good for highly parallel calculations
GPU block diagram
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
Load/store
Load/store
CUDA: A C extension
• Declspecs: host, global, device
• Keywords: blockIdx, threadIdx, etc.
• Intrinsics: __syncthreads()
• Runtime API
– cudaMalloc()
– cudaMemcpy()
– etc.
• Kernel launch:
kernel<<<blocks,threads>>>()
CUDA: A C extension
Integrated source
(foo.cu)
nvcc/cudacc
EDG C/C++ frontend
Open64 Global Optimizer
GPU Assembly
CPU Host Code
foo.s
foo.cpp
OCG
gcc / cl
G80 SASS
foo.sass
Background: Pointers
• Pointer: a structure that contains the
address of some other data in memory
• malloc(size_t sz) returns a pointer to
sz bytes of available memory
• To declare a 20-element int array:
int * A = (int *) malloc(20*sizeof(int));
Background: Threads
• Sequence of instructions
• One thread at a time
– Multicore
• Desktop computer has thousands of threads
– Usually fewer than 4 cores
• GPU comfortably runs millions of threads
– Hundreds of cores
CUDA execution model
• Arrays of parallel threads
• Each thread executes the same code
• Work determined by threadIdx,
blockIdx, blockDim, gridDim
• Blocks: collections of threads
– Threads in a block can cooperate and share
fast local memory
– No inter-block cooperation
• 1D, 2D, or 3D block/thread numbering
CUDA execution model
Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Kernel
2
Block (1, 1)
(0,0,1)
(1,0,1)
(2,0,1)
(3,0,1)
Thread Thread Thread Thread
(0,0,0) (1,0,0) (2,0,0) (3,0,0)
Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization
CUDA execution model
• All functions are declared __host__,
__global__, or __device__
• Host: Runs on CPU, called from CPU
• Global: Runs on GPU, called from CPU
• Device: Runs on GPU, called from GPU
CUDA memory model
• Global memory
– Faster than CPU memory
– Slower than cache
– Accessible by all threads
• Block shared memory
– Small-ish, fast, shared by threads in a block
• Thread memory
– Small, fast, local
• Texture memory
– Small, fast, global
Example: SAXPY (
__host__ void
SAXPYCPU(float * X, float * Y, float a, int N)
{
for(int i=0; i<N; i++)
Y[i] = a*X[i] + Y[i]
}
__global__ void
SAXPYGPU(float * X, float * Y, float a)
{
int i = blockDim.x*blockIdx.x+threadIdx.x;
Y[i] = a*X[i] + Y[i];
}
(continued)
)
Example: SAXPY
__host__ int main() {
int N = 1073741824 ; //2^30 ≈ 1 billion
size_t sz = N * sizeof(float); //bytes we need
float * h_X = (float *) malloc(sz); //allocate the
float * h_Y = (float *) malloc(sz); //host memory
/*some code to fill up h_X and h_Y*/
float * d_X, * d_Y;
cudaMalloc((void **)&d_X, sz); //allocate the
cudaMalloc((void **)&d_Y, sz); //device memory
//move the data onto the GPGPU
cudaMemcpy(d_X, h_X, sz, cudaMemcpyHostToDevice);
cudaMemcpy(d_Y, h_Y, sz, cudaMemcpyHostToDevice);
(continued)
Example: SAXPY
//data is on the device; time to do some SAXPY
int threadsPerBlock = 256;
int blocks = N / threadsPerBlock;
SAXPYGPU<<<blocks, threadsPerBlock>>>(X, Y, 2);
cudaThreadSynchronize(); //wait until done
cudaMemcpy(h_Y, d_Y, sz, cudaMemcpyDeviceToHost);
cudaFree(d_X);
cudaFree(d_Y); //we no longer need the device memory
}
Example: SAXPY
That was easy.
Example: 2D integration
Simpson 2D coefficient matrix:
Our function:
f(x,y)=exy (x+y+π)-1/2 sin(log(x-y+π))
Want ∫∫ f(x,y) dA over |x|,|y| ≤ 1
Example: 2D integration
__host__ int main()
{
int B = N/T;
//(N+1)^2=points, T=threads, B=blocks
size_t sz = B*N*sizeof(dtyp);
//dtyp is typedef’d
dtyp * d, *h = (dtyp *) malloc(sz);
cudaMalloc((void **)&d, sz);
dim3 Threads(T);
dim3 Grid(B, N); //W=bound of integration
S2DGPU<<<Grid, Threads>>>(-W, W, -W, W, d); //INVOKE
cudaThreadSynchronize(); //wait for it to finish
cudaMemcpy(h, d, sz, cudaMemcpyDeviceToHost);
cudaFree(d);
dtyp u=0;
for(int i=0; i<B*N; i++)
u += h[i];
//sigma the different results
u += f2(W, W);
//algorithm misses last point
u *= (dtyp)4*W*W/(9*N*N); //normalize
}
Example: 2D integration
__host__ void S2DCPU(dtyp x0, dtyp xf, dtyp y0, dtyp yf, dtyp* a)
{
*a=0;
dtyp x=x0, y;
for(int i=0; i<=N; i++)
{
y = y0;
for(int j=0; j<=N; j++)
{
bool c1 = i==0||i==N, c2 = j==0||j==N;
*a+=(c1?(c2?1:(j%2==0?2:4)):
(i%2==0?(c2?2:(j%2==0?4:8)):
(c2?4:(j%2==0?8:16))))*f2(x,y);
y += (yf-y0)/N;
}
x += (xf-x0)/N;
}
}
Example: 2D integration
__global__ void S2DGPU(dtyp x0, dtyp xf, dtyp y0, dtyp yf, dtyp * a)
{
int X = blockIdx.x*blockDim.x+threadIdx.x;
int Y = blockIdx.y;
dtyp x = x0+(xf-x0)*X/(gridDim.x*blockDim.x);
dtyp y = y0+(yf-y0)*Y/gridDim.y;
__shared__ dtyp u[T];
bool evx = (X&1)==0, evy = (Y&1)==0;
u[threadIdx.x] =
(X==0?(Y==0?1:(evy?2:4)):(evx?(Y==0?2:(evy?4:8)):
(Y==0?4:(evy?8:16))))*F(x,y);
if(threadIdx.x==0)
if(blockIdx.x==0)
u[threadIdx.x]+=(blockIdx.y==0?1:
((blockIdx.y&1)==0?2:4))*F(xf,y);
else if(blockIdx.x==1)
u[threadIdx.x]+=(blockIdx.y==0?1:
((blockIdx.y&1)==0?2:4))*F(x0+(xf-x0)
*blockIdx.y/gridDim.y, yf);
__syncthreads();
if(threadIdx.x==0)
{
for(int i=1; i<T; i++)
u[0]+=u[i];
a[blockIdx.x*gridDim.y+Y]=u[0];
}
Next-gen GPGPUs
NVIDIA Tesla S1070
– 960 cores @ 1.44 GHz
– 16 GB DRAM
– No more, no less
– 506 GB/s memory
bandwidth
– 4000 GFLOPS
– 800 W (.2 W/GFLOPS)
– $4,000 ($1/GFLOPS)
Intel Xeon 5500
– 4 cores @ 3.2 GHz
– Up to 192 GB DRAM*
– *Memory not included
– 64 GB/s memory
bandwidth
– ~50 GFLOPS
– 130 W (2.6 W/GFLOPS)
– $2,300 ($46/GFLOPS)
Runtime data: 2D integration
dtyp
double
N
GPU
16
GPU
256
GPU
512
CPU
16
CPU
256
CPU
2048
1024
8
9
-
114
208
-
1685
2047
-
double 16384
float
1024
8
9
10
float
16384
353
255
361
Note: All times are in milliseconds.
25996 15192 17056
113
210
-
29949 23907 23009
© Copyright 2026 Paperzz