Broad Phase Collision Detection

Broad Phase Collision Detection
Collision detection is a technique to detect collisions between objects in a n-dimensional space. Of the different algorithms
used the broad phase collision detection followed by a narrow phase collision detection is the most efficient. Broad phase
selects potentially colliding pair of objects which are used as input for the narrow phase which does an exact match of
colliding sets of objects. The broad phase has phases and the implementation here has been done in parallel.
Spatial Subdivision
Spatial Subdivision will partition the space of into a 3D cubical uniform grid space. This can be visualized as the particle
space being divided into cubical cells of the same dimension.
The grid space has an origin and the positions of objects are represented as 3D coordinates relative to the origin of the 3D
grid. The 3D coordinates of an object are the centroid (cx,cy,cz). If at a certain instant of time an object’s centroid is located
within a cell, it is identified and stored as the home (H) cell of the object. Empirically, the objects being considered for
collision tests have varied dimensions and have different radii of the bounding spheres around them and are not uniformly
distributed in the 3D space. Hence every object might not be located completely within a cell. An object can have its
centroid within a cell, the home (H) cell for it and also has the possibility of overlapping with 2n-1of its 3n-1 neighboring
cells. This is based on the computation of cell sizes depending on the dimension of the largest bounding sphere of an object.
These 2n cells which the object can overlap are known as the phantom cells (P) cells. Furthermore, every cube in the grid
cell will be labeled as having a type 1 to 2n where n is the dimension of the grid space which is 3 in this case. This label will
be used while processing for collisions between objects in the 3D grid space.
3D uniform grid space (Image courtesy Chapter 32, GPU gems 3 )
Object in the grid space (Image courtesy Chapter 32, GPU gems 3 )
Implementation Considerations of Spatial Subdivision:
The 3D grid space does not have one specific data structure but has been generated as a collection of data structures. There
are 3 arrays used:
- Cell ID Array :
o
Size – NUM_OF_OBJECTS * 8
o
Stores - the hash of the home (H) cell & 2n-1 phantom (P) cells of the objects
- Object List Array:
o Size – NUM_OF_OBJECTS * 8
o Stores – Array of structures to store object ID and values of the H and P cell types of the object
- General Object Array:
o Size – NUM_OF_OBJECTS
o Stores – Array of structures to store object ID, centroid coordinates x,y,z , bounding sphere of an
object
Parallel implementation of Spatial Subdivision on CUDA:
The parallel implementation consists of threads being used to perform the creation of the Cell ID and Object List Array.
Each of these arrays have a dimension of (NUM_OF_OBJECTS * 8) locations. Each thread used will be responsible for
working on a single object from the General Object array or multiple objects if the number of objects is more than the
number of threads being used.
Parallel computation will promote the possibility of an object involved in multiple collision tests to be updated multiple
times by different threads. This has to be avoided to avoid unnecessary duplication in processing time. To prevent this we
use the cell types which specify the iteration in which an object was processed. In 3D this means 8 computational passes to
cover all possible threads. All objects in a cell type will be traversed at iteration (1 to 8). We do a spatial separation of cells
by fixing the cell dimension to be as large as the largest bounding sphere of an object. Spatial separation of cells ensures
that every cell is separated by all other cells with the same cell type by one cell of different cell type. This will prevent from
multiple threads updating an object.
The other problem associated with parallel execution will be to prevent same collision tests between two objects multiple
times. This arises if there are 2 objects with centroids in different cells and they overlap each other’s cells. To eliminate this
possibility we have:
1. Stored the values to track the H cell & the P cells intersected by the bounding volume of an object
2. We will also further increase the size of the bounding sphere by sqrt(2) times and scale the cell size to 1.5 times of the
bounding sphere.
Before performing a collision test during traversal T we check if the H cell type T’of the participating objects is less than T
and common to the cell types of the 2 objects. We can forego this test, since this implies that the test has already been
performed. O1 & O2 have same overlapping cells and different H cell type. In pass T3 we will not consider collision
detection, since in a lower pass T1 it has already been performed. In the case of O3 & O4 both overlap cells of type 3 but
have their centroids in different cells. The cell size ensures that the 2 do not collide
Eliminating duplicate collision tests using cell type & scaling bounding spheres & cells (Image courtesy Chapter 32,
GPU gems 3 )
Our parallel implementation of Spatial Subdivision on CUDA:
The creation of H & P cell IDs for each object will be done by one thread. If there are more objects than threads one thread
will work on more than 1 object.
Preprocessing:
1. Determine the Bounding Sphere of each object
2. Find the largest Bounding Sphere i.e. the largest object
3. Scale it by sqrt(2) times & ensuring grid cell is at least 1.5 times as large as the scaled Bounding Sphere of the largest
object
4. Create Object Structure:
- Object ID
- Position (x, y, z) in 3D space – Generate randomly
- Bounding Sphere (B.S.) dimension – Generate randomly
5. Create Object List Structure:
- Object ID
- Cell Type H or P (indicated with 0 for H and 1 for P)
- H cell type ( 1 to 2n)
- P cell type ( 1 to 2n)
6. Create Cell ID & Object List array of NUM_OF_OBJECTS x 8 elements.
- The Cell ID array stores the cell id of the H cell of the first object followed by cell id of its P cells. The cell id is
calculated as the hash value of its centroid coordinates. Whenever the Cell ID array is sorted the Object List array is also
similarly reorganized
- The Object List array stores the Object ID of every object and the type of cell H or P and the value of the cell type ( 1 –
8).
7. Create the Object array
Generation and assignment of Cell ID & Object List Arrays:
1. The object IDs are generated and assigned sequentially to both the Object Array and the Object List Array.
2. We launch CUDA kernel to populate the Cell ID & Object List array.
3. Each thread works on an object or multiple objects if number of objects is more than the number of threads. In this each
thread j in a block B wil work on objects iB + j, iB + j + nT, iB + j + 2nT. The value B is the number of threads per
block, j is the thread number and T is the total number of threads, n is the iteration.
4. For each object the home cell is found by a simple hash function:
hash = ((int)(obj_d[object_num].cx / cell_size) << xshift) |
((int)(obj_d[object_num].cy / cell_size) << yshift) |
((int)(obj_d[object_num].cz / cell_size) << zshift);
This value is stored into the Object List array and Cell List array as the home cell id of the object. The cell type is updated
as H in the Object List array.
5. The 3n-1 neighbors are next checked for intersection. If there is an intersection the hash is computed and stored as the P
cell id of the object in the Cell ID array and Object List array. The cell type is updated as P in the Object List array.
6. For no intersections we simply store value 0xffffffff in the Object List & the Cell List of the corresponding Object
ID . These will be ignored during collision list processing.
After the Cell ID array is created by Spatial Subdivision it will need to be sorted. While sorting the Cell ID array the order
of the Object List array should also be retained hence there is a need for a stable sorting algorithm. We use the parallel
radix sort to do this. We need a parallel prefix sum to do this the details of which follow in the next section.
Parallel Prefix Sum
All prefix sum for a given array , say A(1,2,3.........n), is defined as the array generated by performing a binary associative
operator ʘ on all the elements of the array from the index 1 to k-1 and storing it in the kth element[Blelloch, 1990].
For an array
A
a1 , a2, a3. . . . . . . . . . . . . . an
Prefix Sum
I (a1) (a1 ʘ a2 ) ….......(a1 ʘ a2 ʘ a3 ʘ.............ʘ an-1)
and for an example if ʘ is addition then
for an
A
1
3
5
6
7
Prefix Sum
0
1
4
9
15
The all prefix sum performed on an array is usually called a scan[GPU GEMS chapter 39]. We are interested in the version
where ʘ is addition. The serial version of prefix scan is done by initializing the first element of the prefix scan to the 0
(additive identity) and adding the previous element of the original array to the previous element of the scan array to store
get the current element of the scan array.
out[0] = 0;
for i = 1 to n-1
out[i] = out[i-1] + in[i-1]
Sequential Scan Algorithm. [GPU Gems Chapter 39]
where out is the prefix scan array and in is the original input array.
The algorithm loops over all the elements of the array once and hence works in O(n) time.
An efficient parallel version should be in the same complexity order.
A naive parallel algorithm would be the following.
for d = 1 to log2 n do
for all k in parallel do
if k >= 2 d then
x[k] = x[k – 2 d-1] + x[k]
Naive Parallel Scan Algorithm.[GPU Gems Chapter 39]
This algorithm has the complexity of O(n logn ) and is hence
undesirable. A better parallel prefix sum algorithm is
presented in [Blelloch, 1990]. The algorithm contains two phases, a reduce - sup sweep phase where the sum of all the
elements is calculated in log2n steps first. This could be visualized as construction of a binary tree with all the elements as
the leaves and each higher node a sum of the its children. In the next phase called the down sweep phase the root of the
conceptual tree formed above is made zero. Then at each step the left child of the tree is given the value of its parent and the
right child is given the sum of its parent and the previous value of the left child.
for d = 0 to log2 n–1do
for all k = 0 to n – 1 by 2 d+1 in parallel do
x[k+ 2 d+1–1]=x[k+2 d–1]+x[k+2 d +1–1]
Up Sweep Phase.[GPU GEMS chapter 39]
Down Sweep Phase[GPU GEMS chapter 39].
x[n–1]<-0
for d = log2 n – 1 down to 0 do for all k = 0 to n – 1 by 2 d +1 in parallel do
t =x[k+2 d – 1]
x[k + 2 d – 1] = x[k + 2 d +1 –1]
x[k + 2 d +1– 1] = t + x[k + 2 d +1 –1]
(Images Courtesy GPU GEMS chapter 39)
Implementation Considerations:
The algorithm described above is efficient in that it performs in O(n) complexity. But to efficiently implement it on CUDA
we need to take care of certain issues.
Avoiding Bank Conflicts: The up sweep phase of the algorithm
needs to load elements from the global memory into shared memory.
One way to do this is to make each thread access the neighboring
elements in the array.
__shared__ temp[]
temp[threadIdx.x * 2]
= Input[threadIdx.x * 2];
temp[threadIdx.x *2 + 1]
= Input[threadIdx.x * 2 + 1 ];
temp[threadIdx.x ] = Input[threadIdx.x ];
temp[threadIdx.x + n/2] = Input[threadIdx.x + n/2];
where n is the array size (reduced bank conflicts case)
But this method of access has a stride of 2 and hence results in a two
way bank conflict. To avoid this each thread is made to load the
elements from two different halves of the array. Accessing the global
memory this way gives a better performance as this is a coalesced
memory access. The access of shared memory this way though does
not eliminate all of the bank conflicts but not all of them cause the up
sweep and down sweep algorithms work on elements that are d
elements apart where d is a power of 2. Accessing this would result
in 2,4,8 and 16 way bank conflicts and different steps. To avoid
these bank conflicts we need to pad the shared memory array with
one element for every 16 elements in the shared memory and access
it in the following manner.
If original index = ai, then the padded index would be (ai + ai / 16).
For example the above mentioned shared memory accesses would
become temp
[(threadIdx.x) + (threadIdx.x / 16)] = Input[threadIdx.x ] instead of
being temp[threadIdx.x ] = Input[threadIdx.x ];
(Image Courtesy GPU GEMS chapter 39)
Array Size:
The above mentioned scan algorithm assumes that the array size is a power of two. When it is not it can be simply be
rounded off to the nearest multiple of two and perform the algorithm. When extracting the results its enough to consider the
elements till the original size of the array as they are not affected by the elements after it. The given algorithm works using
a single block of threads. Hence the max size of the array becomes twice the max number of threads in a block , i.e. 512 * 2
= 1024.
Our implementation Specifications:
We use the parallel prefix sum as a part of parallel radix sort algorithm. But the maximum length of input array that we need
to perform prefix scan is 384. This case arises when we use 32 thread blocks with each block having 12 thread groups for
parallel radix sorting. The 384 is well with in the limit of 1024 and hence we need not worry of extending the algorithm for
higher sizes. The 384 elements array is padded till next power of 2 that is 512 for the reasons stated above.
Once we have the parallel prefix sum we will use it to perform the parallel radix sort.
Parallel Radix Sort
Radix sort performs sorting in passes, the result after each pass being the input to the next pass. If the keys are of length B
bits it is done in B/L passes sorting L group of bits in each pass. Radix sorts is a stable sort which means that the order of
the keys with the same value is preserved.
"Each sorting pass of a radix sort proceeds by masking off all bits of each key except for the currently active set of L bits,
and then it tabulates the number of occurrences of each of the 2 L possible values that result from this masking. The
resulting array of radix counters is then processed to convert each value into offsets from the beginning of the array. In other
words, each value of the table is replaced by the sum of all its preceding values; this operation is called a prefix sum. Finally,
these offsets are used to determine the new positions of the keys for this pass, and the array of keys is reordered
accordingly."[gpu gems 3 chapter 32]
Our Implementation of each pass of parallel radix sort is done in 3 phases. These phases are described below.
First phase - Counting Phase:
First the current group of bits that are active need to be extracted by masking. This could be done by a simple AND
operation and shift of the keys with appropriate element. Say if we need to perform the sorting on the first 8 bits of the keys
then we need to AND the keys with 0xff and need to shift it by 0 bits to the right. Once this is done the occurrence of each
of the 2 L possible values needs to be counted in this phase.
Implementation Considerations and specifications:
When implementing this phase on CUDA, the work of counting the occurrence of each value is divided among the thread
blocks of the CUDA. Each thread in a block then counts the occurrences of the values within its assigned portion. Each
threads need to have its own set of 2 L counters to avoid race conditions. So the parallel radix sort requires
Num_Blocks * Num_Threads_Per_Block * Num_Radices [GPU gems chapter 32], where Num_Radices = 2 L .
The GPUs we worked on (Tesla) both have a shared memory size of 16k. This implies we have a maximum of 4092 32 bit
counters per a block. We prefer the size of the thread block to be a multiple of 64[GPU gems chapter 32] and choosing
Num_Threads_Per_Block as 256 gives us 16 (4092/256) counters per thread. This means L = 4 and 8 passes to sort 32 bit
keys(our case). Hence to reduce the number of passes we make a group of R = 16 threads share 256 counters. Then the
number of counters required by a block becomes (Num_Threads_Per_Block /16) * 2 L . If we choose L to be 8 and
Num_Threads_Per_Block = 256 then the number of counters required exactly matches the max available value of 4096.
But some of the memory in the shared memory is used for housekeeping[GPU gems chapter 32] and hence we use 192
threads per block. Then each thread block has 12 groups of threads that share counters among themselves. To avoid racing
conditions among the threads in the same group we need to make sure that they update the counters sequentially. This is not
much of a loss of performance cause the number of passes needed to sort 32 bit keys are reduced by half from 8 to 4. Also
making the threads in a group work sequentially introduces divergence, but the reduction of number of passes more than
makes up for it.
The elements that are accessed by each thread block needs to be consecutive to preserve the stable sort property of radix sort
and hence the thread access pattern of elements of the input array is as shown above. Each thread block accesses equal
portions of the input array, and with in a thread block the first group accesses the first part of the section and with in a
section assigned to the thread group thread accesses elements with an offsets of 16. This access pattern gives us the
following indexing, the first element accessed by a thread is given by
Also the next phase requires the radix counters of each
block to be in the given lay out.
For such a lay out when writing the counters back to the
global memory the memory access pattern is like
radixCounters[i*16+mx][gx], where i*16 gives the value of
the radix and mx = threadIdx.x/16 and gx is the group of the
thread(threadIdx.x/16). So the stride of such an access is
number_of_groups which is equal to 12 in our case. This
causes bank conflicts and hence to avoid the bank conflicts
extra shared memory is created, i.e instead of declaring a
__shared__ int radixCounters[NUM_OF_RADICES]
[GROUPS_PER_BLOCK ] we use
__shared__ int radixCounters[NUM_OF_RADICES]
[GROUPS_PER_BLOCK + 1] .This makes the stride an
odd number (13) and hence avoids bank conflicts.
This phase is implemented in 2 kernels, one for masking and the other for counting.
Our use of radix sort is to sort the cellIds, the keys that are passed are the cellId array elements.
Results;
Phase
Number of Blocks
Number of elements
Processing Time
1
32
3072
0.04875000
1
32
6144
0.04625000
1
32
15360
0.04700000
1
32
30720
0.06200000
Second phase - Radix Summation:
In the final phase we re -order the input elements based on the offsets. These offsets are calculated in this phase with a small
amount of work left for the next phase. We need to find out the prefix sum and total sum of counters for each radix. The
work is again divided among the thread blocks and each thread block gets counters of an equal number of radices to perform
the prefix sum on. The calculation of the prefix sum is described in the subsection Parallel Prefix Sum and implemented in
the same way mentioned there. So after this phase we would the total sum of each radix counters and prefix sum of each of
counters for the each of the radix. But to re order the elements in the final phase we need to convert the total sums into
absolute offsets. This needs the total sums of radices from other blocks as well and hence requires inter block
synchronization. So this is done in the other phase(new kernel).
Results:
For a 256 element array the timing is 0.00018457 ms
Third Phase - Reordering:
This phase needs to calculate the absolute offsets first to reorder the elements. So from the total sums computed in the last
phase, this phase performs a parallel prefix scan once to get the absolute offsets.
Once the offsets are found it adds the corresponding absolute offsets to the prefix scans of the counters of each radix.
Finally starting with the reordering each block reads the value from
the input array the same way it did in the first phase in groups to
reorder the elements in the following steps:
" 1.Reads the value of the radix counter corresponding to the
thread group and radix of the cell ID
2.Writes the cell ID to the output cell ID array at the offset
indicated by the radix counter
3.Increments the radix counter " [GPU gems chapter 32]
The write and increment steps need to be done sequentially by
threads in a same group again, to avoid race conditions.
Implementation Considerations and specifications:
When reordering the elements you need to send in to the kernel not just the keys but the objects attached to the keys as well.
This is because the radix sort is done in passes output from each pass is sent as an input the other pass. This maintains the
stability of the sort. So in our cases while the cellIds are the keys we need to sort, we also need to sent the objectId array to
the kernel to rearrange them as well according to the changes done to the cellId.
Results:
Phase
Number of Blocks
Number of elements
Processing Time
3
32
3072
0.06125000
3
32
6144
0.06275000
3
32
15360
0.06200000
3
32
30720
0.06425000
The total processing time for Radix Sort is : 1.8 ms including all memory transfers
Detecting Collisions
Creating the Collision Cell List:
Once we have the Cell ID array sorted we begin preparing for the collision detection test between objects contained in the
same cell. Post performing the radix sort to sort the Cell ID array we have a list that has the same cell ids together. We need
to scan this list for changes in cell ids which will indicate an end of all the objects within 1 particular cell and beginning of a
new cell with all objects within it.
Implementation Considerations of Spatial Subdivision:
We will parallelize this by assigning each thread roughly equal chunks of the Cell ID array so that each thread will scan
through the chunk to detect change in cell id. The first change will be ignored but the second change will be used to mark
the end of the first transition it encountered. The first change will not be recorded by any of the threads and will be left to be
recorded by a preceding thread except for the first thread which begins at the first cell id. Invalid cell ids with the value
0xffffffff will be ignored.
There are 2 scans done on the sorted Cell ID array:
1. Count the number of objects in the contained within each collision cell and convert them to offsets within the array
through a parallel prefix sum in several passes similar to the one done in the radix sort in phase 2 of the radix sort referred
above.
2. For every collision cell we create an entry in a new array with its start the number of P & H occupants.
3. These cells are divided into 8 different lists of non-interacting cells that will be traversed with 8 passes each working on a
single list.
Scan sorted Cell ID for transitions to locate Collision Cells (Image courtesy Chapter 32, GPU gems 3 )
Traversing the Collision Cell List:
Every collision cell is assigned to a thread for collision check. The thread will check for intersections in bounding volume
and indicate a collision serving as narrow phase collision detection.
Thread assignment of a collision list (Image courtesy Chapter 32, GPU gems 3
References :
Scott Le Grand, NVIDIA Corporation, GEMS GPU Chapter 32. Broad-Phase Collision Detection with CUDA
Introduction to Collision Detection, Computing and Software Department, McMaster University, John McCutchan
November 9, 2006
Erleben, Kenny, Joe Sporring, Knud Henriksen, and Henrik Dohlmann. 2005. Physics-Based Animation. Charles River
Media. See especially pp. 613–616.
Mirtich, Brian. 1996. "Impulse-Based Dynamic Simulation of Rigid Body Systems." Ph.D. thesis. Department of Computer
Science, University of California, Berkeley.
Sedgewick, Robert. 1992. Algorithms in C++. Addison-Wesley. See especially pp. 133–144.
Witkin, A., and D. Baraff. 1997. "An Introduction to Physically Based Modeling: Rigid Body Simulation II—
Nonpenetration Constraints." SIGGRAPH 1997 Course 19.
Blelloch, Guy E. 1989. "Scans as Primitive Parallel Operations." IEEE Transactions on Computers 38(11), pp. 1526–
1538.
Blelloch, Guy E. 1990. "Prefix Sums and Their Applications." Technical Report CMU-CS-90-190, School of Computer
Science, Carnegie Mellon University.
Submitted By :
Aritra Nath (9119 6092)
Harsha Hardhagari (1883 9151)