Broad Phase Collision Detection Collision detection is a technique to detect collisions between objects in a n-dimensional space. Of the different algorithms used the broad phase collision detection followed by a narrow phase collision detection is the most efficient. Broad phase selects potentially colliding pair of objects which are used as input for the narrow phase which does an exact match of colliding sets of objects. The broad phase has phases and the implementation here has been done in parallel. Spatial Subdivision Spatial Subdivision will partition the space of into a 3D cubical uniform grid space. This can be visualized as the particle space being divided into cubical cells of the same dimension. The grid space has an origin and the positions of objects are represented as 3D coordinates relative to the origin of the 3D grid. The 3D coordinates of an object are the centroid (cx,cy,cz). If at a certain instant of time an object’s centroid is located within a cell, it is identified and stored as the home (H) cell of the object. Empirically, the objects being considered for collision tests have varied dimensions and have different radii of the bounding spheres around them and are not uniformly distributed in the 3D space. Hence every object might not be located completely within a cell. An object can have its centroid within a cell, the home (H) cell for it and also has the possibility of overlapping with 2n-1of its 3n-1 neighboring cells. This is based on the computation of cell sizes depending on the dimension of the largest bounding sphere of an object. These 2n cells which the object can overlap are known as the phantom cells (P) cells. Furthermore, every cube in the grid cell will be labeled as having a type 1 to 2n where n is the dimension of the grid space which is 3 in this case. This label will be used while processing for collisions between objects in the 3D grid space. 3D uniform grid space (Image courtesy Chapter 32, GPU gems 3 ) Object in the grid space (Image courtesy Chapter 32, GPU gems 3 ) Implementation Considerations of Spatial Subdivision: The 3D grid space does not have one specific data structure but has been generated as a collection of data structures. There are 3 arrays used: - Cell ID Array : o Size – NUM_OF_OBJECTS * 8 o Stores - the hash of the home (H) cell & 2n-1 phantom (P) cells of the objects - Object List Array: o Size – NUM_OF_OBJECTS * 8 o Stores – Array of structures to store object ID and values of the H and P cell types of the object - General Object Array: o Size – NUM_OF_OBJECTS o Stores – Array of structures to store object ID, centroid coordinates x,y,z , bounding sphere of an object Parallel implementation of Spatial Subdivision on CUDA: The parallel implementation consists of threads being used to perform the creation of the Cell ID and Object List Array. Each of these arrays have a dimension of (NUM_OF_OBJECTS * 8) locations. Each thread used will be responsible for working on a single object from the General Object array or multiple objects if the number of objects is more than the number of threads being used. Parallel computation will promote the possibility of an object involved in multiple collision tests to be updated multiple times by different threads. This has to be avoided to avoid unnecessary duplication in processing time. To prevent this we use the cell types which specify the iteration in which an object was processed. In 3D this means 8 computational passes to cover all possible threads. All objects in a cell type will be traversed at iteration (1 to 8). We do a spatial separation of cells by fixing the cell dimension to be as large as the largest bounding sphere of an object. Spatial separation of cells ensures that every cell is separated by all other cells with the same cell type by one cell of different cell type. This will prevent from multiple threads updating an object. The other problem associated with parallel execution will be to prevent same collision tests between two objects multiple times. This arises if there are 2 objects with centroids in different cells and they overlap each other’s cells. To eliminate this possibility we have: 1. Stored the values to track the H cell & the P cells intersected by the bounding volume of an object 2. We will also further increase the size of the bounding sphere by sqrt(2) times and scale the cell size to 1.5 times of the bounding sphere. Before performing a collision test during traversal T we check if the H cell type T’of the participating objects is less than T and common to the cell types of the 2 objects. We can forego this test, since this implies that the test has already been performed. O1 & O2 have same overlapping cells and different H cell type. In pass T3 we will not consider collision detection, since in a lower pass T1 it has already been performed. In the case of O3 & O4 both overlap cells of type 3 but have their centroids in different cells. The cell size ensures that the 2 do not collide Eliminating duplicate collision tests using cell type & scaling bounding spheres & cells (Image courtesy Chapter 32, GPU gems 3 ) Our parallel implementation of Spatial Subdivision on CUDA: The creation of H & P cell IDs for each object will be done by one thread. If there are more objects than threads one thread will work on more than 1 object. Preprocessing: 1. Determine the Bounding Sphere of each object 2. Find the largest Bounding Sphere i.e. the largest object 3. Scale it by sqrt(2) times & ensuring grid cell is at least 1.5 times as large as the scaled Bounding Sphere of the largest object 4. Create Object Structure: - Object ID - Position (x, y, z) in 3D space – Generate randomly - Bounding Sphere (B.S.) dimension – Generate randomly 5. Create Object List Structure: - Object ID - Cell Type H or P (indicated with 0 for H and 1 for P) - H cell type ( 1 to 2n) - P cell type ( 1 to 2n) 6. Create Cell ID & Object List array of NUM_OF_OBJECTS x 8 elements. - The Cell ID array stores the cell id of the H cell of the first object followed by cell id of its P cells. The cell id is calculated as the hash value of its centroid coordinates. Whenever the Cell ID array is sorted the Object List array is also similarly reorganized - The Object List array stores the Object ID of every object and the type of cell H or P and the value of the cell type ( 1 – 8). 7. Create the Object array Generation and assignment of Cell ID & Object List Arrays: 1. The object IDs are generated and assigned sequentially to both the Object Array and the Object List Array. 2. We launch CUDA kernel to populate the Cell ID & Object List array. 3. Each thread works on an object or multiple objects if number of objects is more than the number of threads. In this each thread j in a block B wil work on objects iB + j, iB + j + nT, iB + j + 2nT. The value B is the number of threads per block, j is the thread number and T is the total number of threads, n is the iteration. 4. For each object the home cell is found by a simple hash function: hash = ((int)(obj_d[object_num].cx / cell_size) << xshift) | ((int)(obj_d[object_num].cy / cell_size) << yshift) | ((int)(obj_d[object_num].cz / cell_size) << zshift); This value is stored into the Object List array and Cell List array as the home cell id of the object. The cell type is updated as H in the Object List array. 5. The 3n-1 neighbors are next checked for intersection. If there is an intersection the hash is computed and stored as the P cell id of the object in the Cell ID array and Object List array. The cell type is updated as P in the Object List array. 6. For no intersections we simply store value 0xffffffff in the Object List & the Cell List of the corresponding Object ID . These will be ignored during collision list processing. After the Cell ID array is created by Spatial Subdivision it will need to be sorted. While sorting the Cell ID array the order of the Object List array should also be retained hence there is a need for a stable sorting algorithm. We use the parallel radix sort to do this. We need a parallel prefix sum to do this the details of which follow in the next section. Parallel Prefix Sum All prefix sum for a given array , say A(1,2,3.........n), is defined as the array generated by performing a binary associative operator ʘ on all the elements of the array from the index 1 to k-1 and storing it in the kth element[Blelloch, 1990]. For an array A a1 , a2, a3. . . . . . . . . . . . . . an Prefix Sum I (a1) (a1 ʘ a2 ) ….......(a1 ʘ a2 ʘ a3 ʘ.............ʘ an-1) and for an example if ʘ is addition then for an A 1 3 5 6 7 Prefix Sum 0 1 4 9 15 The all prefix sum performed on an array is usually called a scan[GPU GEMS chapter 39]. We are interested in the version where ʘ is addition. The serial version of prefix scan is done by initializing the first element of the prefix scan to the 0 (additive identity) and adding the previous element of the original array to the previous element of the scan array to store get the current element of the scan array. out[0] = 0; for i = 1 to n-1 out[i] = out[i-1] + in[i-1] Sequential Scan Algorithm. [GPU Gems Chapter 39] where out is the prefix scan array and in is the original input array. The algorithm loops over all the elements of the array once and hence works in O(n) time. An efficient parallel version should be in the same complexity order. A naive parallel algorithm would be the following. for d = 1 to log2 n do for all k in parallel do if k >= 2 d then x[k] = x[k – 2 d-1] + x[k] Naive Parallel Scan Algorithm.[GPU Gems Chapter 39] This algorithm has the complexity of O(n logn ) and is hence undesirable. A better parallel prefix sum algorithm is presented in [Blelloch, 1990]. The algorithm contains two phases, a reduce - sup sweep phase where the sum of all the elements is calculated in log2n steps first. This could be visualized as construction of a binary tree with all the elements as the leaves and each higher node a sum of the its children. In the next phase called the down sweep phase the root of the conceptual tree formed above is made zero. Then at each step the left child of the tree is given the value of its parent and the right child is given the sum of its parent and the previous value of the left child. for d = 0 to log2 n–1do for all k = 0 to n – 1 by 2 d+1 in parallel do x[k+ 2 d+1–1]=x[k+2 d–1]+x[k+2 d +1–1] Up Sweep Phase.[GPU GEMS chapter 39] Down Sweep Phase[GPU GEMS chapter 39]. x[n–1]<-0 for d = log2 n – 1 down to 0 do for all k = 0 to n – 1 by 2 d +1 in parallel do t =x[k+2 d – 1] x[k + 2 d – 1] = x[k + 2 d +1 –1] x[k + 2 d +1– 1] = t + x[k + 2 d +1 –1] (Images Courtesy GPU GEMS chapter 39) Implementation Considerations: The algorithm described above is efficient in that it performs in O(n) complexity. But to efficiently implement it on CUDA we need to take care of certain issues. Avoiding Bank Conflicts: The up sweep phase of the algorithm needs to load elements from the global memory into shared memory. One way to do this is to make each thread access the neighboring elements in the array. __shared__ temp[] temp[threadIdx.x * 2] = Input[threadIdx.x * 2]; temp[threadIdx.x *2 + 1] = Input[threadIdx.x * 2 + 1 ]; temp[threadIdx.x ] = Input[threadIdx.x ]; temp[threadIdx.x + n/2] = Input[threadIdx.x + n/2]; where n is the array size (reduced bank conflicts case) But this method of access has a stride of 2 and hence results in a two way bank conflict. To avoid this each thread is made to load the elements from two different halves of the array. Accessing the global memory this way gives a better performance as this is a coalesced memory access. The access of shared memory this way though does not eliminate all of the bank conflicts but not all of them cause the up sweep and down sweep algorithms work on elements that are d elements apart where d is a power of 2. Accessing this would result in 2,4,8 and 16 way bank conflicts and different steps. To avoid these bank conflicts we need to pad the shared memory array with one element for every 16 elements in the shared memory and access it in the following manner. If original index = ai, then the padded index would be (ai + ai / 16). For example the above mentioned shared memory accesses would become temp [(threadIdx.x) + (threadIdx.x / 16)] = Input[threadIdx.x ] instead of being temp[threadIdx.x ] = Input[threadIdx.x ]; (Image Courtesy GPU GEMS chapter 39) Array Size: The above mentioned scan algorithm assumes that the array size is a power of two. When it is not it can be simply be rounded off to the nearest multiple of two and perform the algorithm. When extracting the results its enough to consider the elements till the original size of the array as they are not affected by the elements after it. The given algorithm works using a single block of threads. Hence the max size of the array becomes twice the max number of threads in a block , i.e. 512 * 2 = 1024. Our implementation Specifications: We use the parallel prefix sum as a part of parallel radix sort algorithm. But the maximum length of input array that we need to perform prefix scan is 384. This case arises when we use 32 thread blocks with each block having 12 thread groups for parallel radix sorting. The 384 is well with in the limit of 1024 and hence we need not worry of extending the algorithm for higher sizes. The 384 elements array is padded till next power of 2 that is 512 for the reasons stated above. Once we have the parallel prefix sum we will use it to perform the parallel radix sort. Parallel Radix Sort Radix sort performs sorting in passes, the result after each pass being the input to the next pass. If the keys are of length B bits it is done in B/L passes sorting L group of bits in each pass. Radix sorts is a stable sort which means that the order of the keys with the same value is preserved. "Each sorting pass of a radix sort proceeds by masking off all bits of each key except for the currently active set of L bits, and then it tabulates the number of occurrences of each of the 2 L possible values that result from this masking. The resulting array of radix counters is then processed to convert each value into offsets from the beginning of the array. In other words, each value of the table is replaced by the sum of all its preceding values; this operation is called a prefix sum. Finally, these offsets are used to determine the new positions of the keys for this pass, and the array of keys is reordered accordingly."[gpu gems 3 chapter 32] Our Implementation of each pass of parallel radix sort is done in 3 phases. These phases are described below. First phase - Counting Phase: First the current group of bits that are active need to be extracted by masking. This could be done by a simple AND operation and shift of the keys with appropriate element. Say if we need to perform the sorting on the first 8 bits of the keys then we need to AND the keys with 0xff and need to shift it by 0 bits to the right. Once this is done the occurrence of each of the 2 L possible values needs to be counted in this phase. Implementation Considerations and specifications: When implementing this phase on CUDA, the work of counting the occurrence of each value is divided among the thread blocks of the CUDA. Each thread in a block then counts the occurrences of the values within its assigned portion. Each threads need to have its own set of 2 L counters to avoid race conditions. So the parallel radix sort requires Num_Blocks * Num_Threads_Per_Block * Num_Radices [GPU gems chapter 32], where Num_Radices = 2 L . The GPUs we worked on (Tesla) both have a shared memory size of 16k. This implies we have a maximum of 4092 32 bit counters per a block. We prefer the size of the thread block to be a multiple of 64[GPU gems chapter 32] and choosing Num_Threads_Per_Block as 256 gives us 16 (4092/256) counters per thread. This means L = 4 and 8 passes to sort 32 bit keys(our case). Hence to reduce the number of passes we make a group of R = 16 threads share 256 counters. Then the number of counters required by a block becomes (Num_Threads_Per_Block /16) * 2 L . If we choose L to be 8 and Num_Threads_Per_Block = 256 then the number of counters required exactly matches the max available value of 4096. But some of the memory in the shared memory is used for housekeeping[GPU gems chapter 32] and hence we use 192 threads per block. Then each thread block has 12 groups of threads that share counters among themselves. To avoid racing conditions among the threads in the same group we need to make sure that they update the counters sequentially. This is not much of a loss of performance cause the number of passes needed to sort 32 bit keys are reduced by half from 8 to 4. Also making the threads in a group work sequentially introduces divergence, but the reduction of number of passes more than makes up for it. The elements that are accessed by each thread block needs to be consecutive to preserve the stable sort property of radix sort and hence the thread access pattern of elements of the input array is as shown above. Each thread block accesses equal portions of the input array, and with in a thread block the first group accesses the first part of the section and with in a section assigned to the thread group thread accesses elements with an offsets of 16. This access pattern gives us the following indexing, the first element accessed by a thread is given by Also the next phase requires the radix counters of each block to be in the given lay out. For such a lay out when writing the counters back to the global memory the memory access pattern is like radixCounters[i*16+mx][gx], where i*16 gives the value of the radix and mx = threadIdx.x/16 and gx is the group of the thread(threadIdx.x/16). So the stride of such an access is number_of_groups which is equal to 12 in our case. This causes bank conflicts and hence to avoid the bank conflicts extra shared memory is created, i.e instead of declaring a __shared__ int radixCounters[NUM_OF_RADICES] [GROUPS_PER_BLOCK ] we use __shared__ int radixCounters[NUM_OF_RADICES] [GROUPS_PER_BLOCK + 1] .This makes the stride an odd number (13) and hence avoids bank conflicts. This phase is implemented in 2 kernels, one for masking and the other for counting. Our use of radix sort is to sort the cellIds, the keys that are passed are the cellId array elements. Results; Phase Number of Blocks Number of elements Processing Time 1 32 3072 0.04875000 1 32 6144 0.04625000 1 32 15360 0.04700000 1 32 30720 0.06200000 Second phase - Radix Summation: In the final phase we re -order the input elements based on the offsets. These offsets are calculated in this phase with a small amount of work left for the next phase. We need to find out the prefix sum and total sum of counters for each radix. The work is again divided among the thread blocks and each thread block gets counters of an equal number of radices to perform the prefix sum on. The calculation of the prefix sum is described in the subsection Parallel Prefix Sum and implemented in the same way mentioned there. So after this phase we would the total sum of each radix counters and prefix sum of each of counters for the each of the radix. But to re order the elements in the final phase we need to convert the total sums into absolute offsets. This needs the total sums of radices from other blocks as well and hence requires inter block synchronization. So this is done in the other phase(new kernel). Results: For a 256 element array the timing is 0.00018457 ms Third Phase - Reordering: This phase needs to calculate the absolute offsets first to reorder the elements. So from the total sums computed in the last phase, this phase performs a parallel prefix scan once to get the absolute offsets. Once the offsets are found it adds the corresponding absolute offsets to the prefix scans of the counters of each radix. Finally starting with the reordering each block reads the value from the input array the same way it did in the first phase in groups to reorder the elements in the following steps: " 1.Reads the value of the radix counter corresponding to the thread group and radix of the cell ID 2.Writes the cell ID to the output cell ID array at the offset indicated by the radix counter 3.Increments the radix counter " [GPU gems chapter 32] The write and increment steps need to be done sequentially by threads in a same group again, to avoid race conditions. Implementation Considerations and specifications: When reordering the elements you need to send in to the kernel not just the keys but the objects attached to the keys as well. This is because the radix sort is done in passes output from each pass is sent as an input the other pass. This maintains the stability of the sort. So in our cases while the cellIds are the keys we need to sort, we also need to sent the objectId array to the kernel to rearrange them as well according to the changes done to the cellId. Results: Phase Number of Blocks Number of elements Processing Time 3 32 3072 0.06125000 3 32 6144 0.06275000 3 32 15360 0.06200000 3 32 30720 0.06425000 The total processing time for Radix Sort is : 1.8 ms including all memory transfers Detecting Collisions Creating the Collision Cell List: Once we have the Cell ID array sorted we begin preparing for the collision detection test between objects contained in the same cell. Post performing the radix sort to sort the Cell ID array we have a list that has the same cell ids together. We need to scan this list for changes in cell ids which will indicate an end of all the objects within 1 particular cell and beginning of a new cell with all objects within it. Implementation Considerations of Spatial Subdivision: We will parallelize this by assigning each thread roughly equal chunks of the Cell ID array so that each thread will scan through the chunk to detect change in cell id. The first change will be ignored but the second change will be used to mark the end of the first transition it encountered. The first change will not be recorded by any of the threads and will be left to be recorded by a preceding thread except for the first thread which begins at the first cell id. Invalid cell ids with the value 0xffffffff will be ignored. There are 2 scans done on the sorted Cell ID array: 1. Count the number of objects in the contained within each collision cell and convert them to offsets within the array through a parallel prefix sum in several passes similar to the one done in the radix sort in phase 2 of the radix sort referred above. 2. For every collision cell we create an entry in a new array with its start the number of P & H occupants. 3. These cells are divided into 8 different lists of non-interacting cells that will be traversed with 8 passes each working on a single list. Scan sorted Cell ID for transitions to locate Collision Cells (Image courtesy Chapter 32, GPU gems 3 ) Traversing the Collision Cell List: Every collision cell is assigned to a thread for collision check. The thread will check for intersections in bounding volume and indicate a collision serving as narrow phase collision detection. Thread assignment of a collision list (Image courtesy Chapter 32, GPU gems 3 References : Scott Le Grand, NVIDIA Corporation, GEMS GPU Chapter 32. Broad-Phase Collision Detection with CUDA Introduction to Collision Detection, Computing and Software Department, McMaster University, John McCutchan November 9, 2006 Erleben, Kenny, Joe Sporring, Knud Henriksen, and Henrik Dohlmann. 2005. Physics-Based Animation. Charles River Media. See especially pp. 613–616. Mirtich, Brian. 1996. "Impulse-Based Dynamic Simulation of Rigid Body Systems." Ph.D. thesis. Department of Computer Science, University of California, Berkeley. Sedgewick, Robert. 1992. Algorithms in C++. Addison-Wesley. See especially pp. 133–144. Witkin, A., and D. Baraff. 1997. "An Introduction to Physically Based Modeling: Rigid Body Simulation II— Nonpenetration Constraints." SIGGRAPH 1997 Course 19. Blelloch, Guy E. 1989. "Scans as Primitive Parallel Operations." IEEE Transactions on Computers 38(11), pp. 1526– 1538. Blelloch, Guy E. 1990. "Prefix Sums and Their Applications." Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University. Submitted By : Aritra Nath (9119 6092) Harsha Hardhagari (1883 9151)
© Copyright 2024 Paperzz