CUDA-Based Implementation of GSLIB: The Geostatistical Software Library Daniel Baeza Oscar Peredo [email protected] [email protected] The Mining Process Exploration Evaluation Planning Operation ! GSLIB: The Geostatistical Software Library GSLIB: The Geostatistical Software Library • GSLIB is a software package composed by a set of utilities and applications related with geostatistics • Full implemented in Fortran 77/90 • Run in OSX, Linux and Windows • Widely used for academics, researchers, engineers GSLIB: Geostatistical Software Library and user's guide (1998) Deutsch, Clayton V, Journel, André G Variogram calculation with GSLIB • gamv is the GSLIB variogram calculation method • It’s a fundamental tool in geostatistics • Allow to quantify the spatial variability of a variable • Used in geostatistical estimation and simulation • High computational cost 1.20 Normal Scores Semivariogram Low Solubil 1.20 1.00 1.00 .80 γ Normal Scores Semivariogram High Solubi .80 γ .60 .60 .40 .40 .20 .20 .00 .00 0. 100. 200. 300. Distance 400. 500. 0. 100. 200. 300. Distance 400. 500. Variogram calculation Z(u) 2!(h) = 1 ∑ [z(u) - z(u + h)] N(h) || a - b || = h b z x y a 2 Z(u + h) !(h) more variability less variability h ! Variogram computation Sequential & Parallel Implementations Among its main applications, gam and gamv can calculate variogram values from structured and scattered 3D data points. In this work we are interested in gamv, being of general usage for any data set and also being of special interest in terms of execution speed (currently gamv is orders of magnitude slower that gam). Inside gamv it is possible to distinguish three subroutines. The first is related with data and parameters loading, the second is in charge of the main computation and the third stores the results in a output file. In the current implementation of gamv, the variogram is calculated measuring the spatial variability between all pairs of points separated by a lag vector distance. This spatial search has computational cost of order O(n2 ), where n is the number of data points. The sequential implementation of GSLIB gamv is shown in Algorithm 1. Lines 4 and 5 correspond to the loops of all pair of samples from the database and lines 6 to 12 correspond to statistics extraction from the selected pairs of points. Finally, in line 13 is performed the calculation of variogram values from previously extracted statistics. Sequential implementation Algorithm 1: gamv sequential implementation from GSLIB Input: • (VR, ⌦): sample data values VR (m columns) defined in a 3D domain of coordinates ⌦ Setup parameters & Load data • nvar: number of variables (nvar m) • nlag: number of lags • h: lag separation distance • ndir: number of directions • h1 , . . . , hndir : directions • ⌧1 , . . . , ⌧ndir : geometrical tolerance parameters • nvarg: number of variograms • (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variogram types 1 2 3 4 5 6 7 8 9 10 11 12 Read input parameter file; Read sample data values file; zeros(nvar ⇥ nlag ⇥ ndir ⇥ nvarg); for i 2 {1, . . . , |⌦|} do for j 2 {i, . . . , |⌦|} do for id 2 {1, . . . , ndir} do for iv 2 {1, . . . , nvarg} do for il 2 {1, . . . , nlag} do pi = (xi , yi , zi ) 2 ⌦; p j = (x j , y j , z j ) 2 ⌦; if (pi , p j ) satisfy tolerances ⌧id and ||pi p j || ⇡ hid ⇥ il ⇥ h then Save (VRi,ivheadiv or VRi,ivtailiv ) and (VR j,ivheadiv or VR j,ivtailiv ) according to variogram ivtypeiv into ; Main computation: Loop over pairs of points Read computation results 13 build variogram using statistics Output: Output file with Write in the output file values Commonly gamv is not a costly part in the structural modeling process, in terms of computational time, but some processes may need to calculate a variogram many times in their execution. Examples of this processes can be automatic variogram fitting or inverse modeling of kernels fitting experimental variogram values. In these scenarios gamv execution time becomes a bottleneck. In this work we focused our e↵ort in translating the current gamv main Among its main applications, gam and gamv can calculate variogram values from structured and scattered 3D data points. In this work we are interested in gamv, being of general usage for any data set and also being of special interest in terms of execution speed (currently gamv is orders of magnitude slower that gam). Inside gamv it is possible to distinguish three subroutines. The first is related with data and parameters loading, the second is in charge of the main computation and the third stores the results in a output file. In the current implementation of gamv, the variogram is calculated measuring the spatial variability between all pairs of points separated by a lag vector distance. This spatial search has computational cost of order O(n2 ), where n is the number of data points. The sequential implementation of GSLIB gamv is shown in Algorithm 1. Lines 4 and 5 correspond to the loops of all pair of samples from the database and lines 6 to 12 correspond to statistics extraction from the selected pairs of points. Finally, in line 13 is performed the calculation of variogram values from previously extracted statistics. Sequential implementation Algorithm 1: gamv sequential implementation from GSLIB Input: • (VR, ⌦): sample data values VR (m columns) defined in a 3D domain of coordinates ⌦ • nvar: number of variables (nvar m) • nlag: number of lags • h: lag separation distance • ndir: number of directions • h1 , . . . , hndir : directions • ⌧1 , . . . , ⌧ndir : geometrical tolerance parameters • nvarg: number of variograms • (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variogram types 1 2 3 4 5 6 7 8 9 10 11 12 13 Read input parameter file; Read sample data values file; zeros(nvar ⇥ nlag ⇥ ndir ⇥ nvarg); for i 2 {1, . . . , |⌦|} do for j 2 {i, . . . , |⌦|} do for id 2 {1, . . . , ndir} do for iv 2 {1, . . . , nvarg} do for il 2 {1, . . . , nlag} do pi = (xi , yi , zi ) 2 ⌦; p j = (x j , y j , z j ) 2 ⌦; if (pi , p j ) satisfy tolerances ⌧id and ||pi p j || ⇡ hid ⇥ il ⇥ h then Save (VRi,ivheadiv or VRi,ivtailiv ) and (VR j,ivheadiv or VR j,ivtailiv ) according to variogram ivtypeiv into ; build variogram using statistics Output: Output file with Write in the output file values Commonly gamv is not a costly part in the structural modeling process, in terms of computational time, but some processes may need to calculate a variogram many times in their execution. Examples of this processes can be automatic variogram fitting or inverse modeling of kernels fitting experimental variogram values. In these scenarios gamv execution time becomes a bottleneck. In this work we focused our e↵ort in translating the current gamv main Among its main applications, gam and gamv can calculate variogram values from structured and scattered 3D data points. In this work we are interested in gamv, being of general usage for any data set and also being of special interest in terms of execution speed (currently gamv is orders of magnitude slower that gam). Inside gamv it is possible to distinguish three subroutines. The first is related with data and parameters loading, the second is in charge of the main computation and the third stores the results in a output file. In the current implementation of gamv, the variogram is calculated measuring the spatial variability between all pairs of points separated by a lag vector distance. This spatial search has computational cost of order O(n2 ), where n is the number of data points. The sequential implementation of GSLIB gamv is shown in Algorithm 1. Lines 4 and 5 correspond to the loops of all pair of samples from the database and lines 6 to 12 correspond to statistics extraction from the selected pairs of points. Finally, in line 13 is performed the calculation of variogram values from previously extracted statistics. Sequential implementation Algorithm 1: gamv sequential implementation from GSLIB Input: • (VR, ⌦): sample data values VR (m columns) defined in a 3D domain of coordinates ⌦ • nvar: number of variables (nvar m) • nlag: number of lags • h: lag separation distance • ndir: number of directions • h1 , . . . , hndir : directions • ⌧1 , . . . , ⌧ndir : geometrical tolerance parameters • nvarg: number of variograms • (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variogram types 1 2 3 4 5 6 7 8 9 10 11 12 13 Read input parameter file; Read sample data values file; zeros(nvar ⇥ nlag ⇥ ndir ⇥ nvarg); for i 2 {1, . . . , |⌦|} do for j 2 {i, . . . , |⌦|} do for id 2 {1, . . . , ndir} do for iv 2 {1, . . . , nvarg} do for il 2 {1, . . . , nlag} do pi = (xi , yi , zi ) 2 ⌦; p j = (x j , y j , z j ) 2 ⌦; if (pi , p j ) satisfy tolerances ⌧id and ||pi p j || ⇡ hid ⇥ il ⇥ h then Save (VRi,ivheadiv or VRi,ivtailiv ) and (VR j,ivheadiv or VR j,ivtailiv ) according to variogram ivtypeiv into ; build variogram using statistics Output: Output file with Write in the output file values Commonly gamv is not a costly part in the structural modeling process, in terms of computational time, but some processes may need to calculate a variogram many times in their execution. Examples of this processes can be automatic variogram fitting or inverse modeling of kernels fitting experimental variogram values. In these scenarios gamv execution time becomes a bottleneck. In this work we focused our e↵ort in translating the current gamv main • ndir: number of directions 3 • h1 , . . . , hndir : directions 4 • ⌧1 , . . . , ⌧ndir : geometrical tolerance parameters 5 • nvarg: number of variograms 1 2 3 4 5 6 7 8 9 10 11 12 Set and Initialize shared memo syncthreads() iter x = id x itery = idy 6 • (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variogram types 7 while (iter x & itery 2 ChunkP Read input parameter file; Read sample data values file; 8 do zeros(nvar ⇥ nlag ⇥ ndir ⇥ nvarg); |⌦| 9 j = iter + x for i 2 {1, . . . , |⌦|} do 2 for j 2 {i, . . . , |⌦|} do 10 i = itery for id 2 {1, . . . , ndir} do 11 compute statistics(pi , p j ) for iv 2 {1, . . . , nvarg} do for il 2 {1, . . . , nlag} do 12 store result in shared memo pi = (xi , yi , zi ) 2 ⌦; 13 if (iter x > itery ) then p j = (x j , y j , z j ) 2 ⌦; if (pi , p j ) satisfy tolerances ⌧id and ||pi p j || ⇡ hid ⇥ il ⇥ h then 14 i = itery Save (VRi,ivheadiv or VRi,ivtailiv ) and (VR j,ivheadiv or VR j,ivtailiv ) according to variogram ivtypeiv into ; 15 j = iter x Sequential implementation compute statistics(pi , p 13 build variogram using statistics Write in the output file 17 store result in shared m Output: Output file with values 18 else 19 if (iter x == itery ) then Commonly gamv is not a costly part in the structural modeling process, in terms iof= computational time 20 itery some processes may need to calculate a variogram many times in their 21 execution. Examples this processes c j = of iter y automatic variogram fitting or inverse modeling of kernels fitting experimental variogram values. In these scen 22 compute statistics( gamv execution time becomes a bottleneck. In this work we focused our e↵ort in translating the current gamv 23 store result in share 16 3 24 25 i = iter x + |⌦| 2 j = itery + |⌦| 2 • ndir: number of directions 3 • h1 , . . . , hndir : directions 4 • ⌧1 , . . . , ⌧ndir : geometrical tolerance parameters 5 • nvarg: number of variograms 1 2 3 4 5 6 7 8 9 10 11 12 Set and Initialize shared memo syncthreads() iter x = id x itery = idy 6 • (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variogram types 7 while (iter x & itery 2 ChunkP Read input parameter file; Read sample data values file; 8 do zeros(nvar ⇥ nlag ⇥ ndir ⇥ nvarg); |⌦| 9 j = iter + x for i 2 {1, . . . , |⌦|} do 2 for j 2 {i, . . . , |⌦|} do 10 i = itery for id 2 {1, . . . , ndir} do 11 compute statistics(pi , p j ) for iv 2 {1, . . . , nvarg} do for il 2 {1, . . . , nlag} do 12 store result in shared memo pi = (xi , yi , zi ) 2 ⌦; 13 if (iter x > itery ) then p j = (x j , y j , z j ) 2 ⌦; if (pi , p j ) satisfy tolerances ⌧id and ||pi p j || ⇡ hid ⇥ il ⇥ h then 14 i = itery Save (VRi,ivheadiv or VRi,ivtailiv ) and (VR j,ivheadiv or VR j,ivtailiv ) according to variogram ivtypeiv into ; 15 j = iter x Sequential implementation compute statistics(pi , p 13 build variogram using statistics Write in the output file 17 store result in shared m Output: Output file with values 18 else 19 if (iter x == itery ) then Commonly gamv is not a costly part in the structural modeling process, in terms iof= computational time 20 itery some processes may need to calculate a variogram many times in their 21 execution. Examples this processes c j = of iter y automatic variogram fitting or inverse modeling of kernels fitting experimental variogram values. In these scen 22 compute statistics( STEP N the current gamv STEP 1 STEP 2 STEPwe 3 gamv execution time becomes a bottleneck. In this work focused our ...e↵ort in translating 23 store result in share 16 3 24 25 i = iter x + |⌦| 2 j = itery + |⌦| 2 Algorithm 2: extract statistics kernel() Input: • (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦ • nvar: number of variables (nvar m) • nlag: number of lags • h: lag separation distance • ndir: number of directions • h1 , . . . , hndir : directions Parallel implementation • ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters • nvarg: number of variogram • (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 id x = blockId.x*blockDim.x + threadId.x idy = blockId.y*blockDim.y + threadId.y Set and Initialize shared memory in the block syncthreads() iter x = id x itery = idy /* x threads coord in the GPU grid */ /* y threads coord in the GPU grid */ while (iter x & itery 2 ChunkPoints) /* Chunk points belongs thread (x,y) */ do j = iter x + |⌦| 2 i = itery compute statistics(pi , p j ) store result in shared memory via atomic functions if (iter x > itery ) then i = itery j = iter x compute statistics(pi , p j ) store result in shared memory via atomic functions else if (iter x == itery ) then i = itery j = itery compute statistics(pi , p j ) store result in shared memory via atomic functions i = iter x + |⌦| 2 j = itery + |⌦| 2 compute statistics(pi , p j ) store result in shared memory via atomic functions up date(iter x , itery ) syncthreads() save Statistics values that are in shared memory into globa memory via atomic functions Output: Array 8 GPU Kernel Each thread compute only a couple of correlation values Algorithm 2: extract statistics kernel() Input: • (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦ • nvar: number of variables (nvar m) • nlag: number of lags • h: lag separation distance • ndir: number of directions • h1 , . . . , hndir : directions Parallel implementation • ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters • nvarg: number of variogram • (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 id x = blockId.x*blockDim.x + threadId.x idy = blockId.y*blockDim.y + threadId.y Set and Initialize shared memory in the block syncthreads() iter x = id x itery = idy /* x threads coord in the GPU grid */ /* y threads coord in the GPU grid */ while (iter x & itery 2 ChunkPoints) /* Chunk points belongs thread (x,y) */ do j = iter x + |⌦| 2 i = itery compute statistics(pi , p j ) store result in shared memory via atomic functions if (iter x > itery ) then i = itery j = iter x compute statistics(pi , p j ) store result in shared memory via atomic functions else if (iter x == itery ) then i = itery j = itery compute statistics(pi , p j ) store result in shared memory via atomic functions i = iter x + |⌦| 2 j = itery + |⌦| 2 compute statistics(pi , p j ) store result in shared memory via atomic functions up date(iter x , itery ) syncthreads() save Statistics values that are in shared memory into globa memory via atomic functions Output: Array 8 Algorithm 2: extract statistics kernel() Input: • (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦ • nvar: number of variables (nvar m) • nlag: number of lags • h: lag separation distance • ndir: number of directions • h1 , . . . , hndir : directions Parallel implementation • ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters • nvarg: number of variogram • (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 id x = blockId.x*blockDim.x + threadId.x idy = blockId.y*blockDim.y + threadId.y Set and Initialize shared memory in the block syncthreads() iter x = id x itery = idy /* x threads coord in the GPU grid */ /* y threads coord in the GPU grid */ 0 while (iter x & itery 2 ChunkPoints) /* Chunk points belongs thread (x,y) */ do j = iter x + |⌦| 2 i = itery compute statistics(pi , p j ) store result in shared memory via atomic functions if (iter x > itery ) then i = itery j = iter x compute statistics(pi , p j ) store result in shared memory via atomic functions else 0 if (iter x == itery ) then i = itery 2 3 j = itery compute statistics(pi , p j ) store result in shared memory via atomic functions i = iter x + |⌦| 2 0 1 0 j = itery + |⌦| 2 compute statistics(pi , p j ) 2 3 1 store result in shared memory via atomic functions 3 up date(iter x , itery ) syncthreads() save Statistics values that are in shared memory into globa memory via atomic functions Output: Array 0 2 3 STEP 1 8 2 0 1 2 3 0 1 3 STEP 1 0 1 2 3 0 2 0 1 3 2 STEP 2 0 1 2 3 0 2 3 0 2 0 1 0 1 0 1 3 2 3 2 3 1 3 0 1 2 3 STEP 2 STEP 3 STEP 4 Algorithm 2: extract statistics kernel() Input: • (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦ • nvar: number of variables (nvar m) • nlag: number of lags • h: lag separation distance • ndir: number of directions • h1 , . . . , hndir : directions Parallel implementation • ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters • nvarg: number of variogram • (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 id x = blockId.x*blockDim.x + threadId.x idy = blockId.y*blockDim.y + threadId.y Set and Initialize shared memory in the block syncthreads() iter x = id x itery = idy /* x threads coord in the GPU grid */ /* y threads coord in the GPU grid */ 0 while (iter x & itery 2 ChunkPoints) /* Chunk points belongs thread (x,y) */ do j = iter x + |⌦| 2 i = itery compute statistics(pi , p j ) store result in shared memory via atomic functions if (iter x > itery ) then i = itery j = iter x compute statistics(pi , p j ) store result in shared memory via atomic functions else 0 if (iter x == itery ) then i = itery 2 3 j = itery compute statistics(pi , p j ) store result in shared memory via atomic functions i = iter x + |⌦| 2 0 1 0 j = itery + |⌦| 2 compute statistics(pi , p j ) 2 3 1 store result in shared memory via atomic functions 3 up date(iter x , itery ) syncthreads() save Statistics values that are in shared memory into globa memory via atomic functions Output: Array 0 2 3 STEP 1 8 2 0 1 2 3 0 1 3 STEP 1 0 1 2 3 0 2 0 1 3 2 STEP 2 0 1 2 3 0 2 3 0 2 0 1 0 1 0 1 3 2 3 2 3 1 3 0 1 2 3 STEP 2 STEP 3 STEP 4 Algorithm 2: extract statistics kernel() Input: • (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦ • nvar: number of variables (nvar m) • nlag: number of lags • h: lag separation distance • ndir: number of directions • h1 , . . . , hndir : directions Parallel implementation • ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters • nvarg: number of variogram • (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 id x = blockId.x*blockDim.x + threadId.x idy = blockId.y*blockDim.y + threadId.y Set and Initialize shared memory in the block syncthreads() iter x = id x itery = idy /* x threads coord in the GPU grid */ /* y threads coord in the GPU grid */ 0 while (iter x & itery 2 ChunkPoints) /* Chunk points belongs thread (x,y) */ do j = iter x + |⌦| 2 i = itery compute statistics(pi , p j ) store result in shared memory via atomic functions if (iter x > itery ) then i = itery j = iter x compute statistics(pi , p j ) store result in shared memory via atomic functions else 0 if (iter x == itery ) then i = itery 2 3 j = itery compute statistics(pi , p j ) store result in shared memory via atomic functions i = iter x + |⌦| 2 0 1 0 j = itery + |⌦| 2 compute statistics(pi , p j ) 2 3 1 store result in shared memory via atomic functions 3 up date(iter x , itery ) syncthreads() save Statistics values that are in shared memory into globa memory via atomic functions Output: Array 0 2 3 STEP 1 8 2 0 1 2 3 0 1 3 STEP 1 0 1 2 3 0 2 0 1 3 2 STEP 2 0 1 2 3 0 2 3 0 2 0 1 0 1 0 1 3 2 3 2 3 1 3 0 1 2 3 STEP 2 STEP 3 STEP 4 Algorithm 2: extract statistics kernel() Input: • (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦ • nvar: number of variables (nvar m) • nlag: number of lags • h: lag separation distance • ndir: number of directions • h1 , . . . , hndir : directions Parallel implementation • ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters • nvarg: number of variogram • (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 id x = blockId.x*blockDim.x + threadId.x idy = blockId.y*blockDim.y + threadId.y Set and Initialize shared memory in the block syncthreads() iter x = id x itery = idy /* x threads coord in the GPU grid */ /* y threads coord in the GPU grid */ 0 while (iter x & itery 2 ChunkPoints) /* Chunk points belongs thread (x,y) */ do j = iter x + |⌦| 2 i = itery compute statistics(pi , p j ) store result in shared memory via atomic functions if (iter x > itery ) then i = itery j = iter x compute statistics(pi , p j ) store result in shared memory via atomic functions else 0 if (iter x == itery ) then i = itery 2 3 j = itery compute statistics(pi , p j ) store result in shared memory via atomic functions i = iter x + |⌦| 2 0 1 0 j = itery + |⌦| 2 compute statistics(pi , p j ) 2 3 1 store result in shared memory via atomic functions 3 up date(iter x , itery ) syncthreads() save Statistics values that are in shared memory into globa memory via atomic functions Output: Array 0 2 3 STEP 1 8 2 0 1 2 3 0 1 3 STEP 1 0 1 2 3 0 2 0 1 3 2 STEP 2 0 1 2 3 0 2 3 0 2 0 1 0 1 0 1 3 2 3 2 3 1 3 0 1 2 3 STEP 2 STEP 3 STEP 4 ! Results 50.0 510x510x1 Omni-Directional Semivariogram 40.0 γ 30.0 20.0 10x10x120 Omni-Directional Semivariogram 10.0 Standard version 1.00 CUDA version 0.0 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 0.80 Distance γ 0.60 0.40 0.20 Standard version CUDA version 0.00 0.0 10.0 20.0 30.0 40.0 Distance 50.0 60.0 70.0 Dell Precision T7600 - Intel Xeon E5 3.3 GHz CPU Frequency 10 MB cache 16GB RAM (1,6 MHz) 1 TB HDD Linux 2300 2095 1725 Seconds - Standard Version 1330 1150 518.69 575 0 0.8 4.68 9.52 5000 12000 65000 143.2 261000 500000 800000 1000000 Number of points CUDA version 50 43.2 Tesla c2075 - 448 CUDA cores - 1.15 GHz Frequency of CUDA cores - 6GB RAM - 114 GB/sec Memory bandwidth Seconds 37.5 28.45 25 12.28 12.5 3.62 0 0.25 0.53 0.61 5000 12000 65000 261000 500000 800000 1000000 Number of points Standard CUDA version Version 2300 50 2095 43.2 Seconds Seconds 1725 37.5 1330 28.45 1150 25 12.28 518.69 12.5 575 0 0.25 0.8 0.53 4.68 0.61 9.52 5000 12000 65000 3.62 143.2 261000 500000 800000 1000000 Number of points SpeedUp 55 46.7 48.5 42.2 SpeedUp 41.25 39.6 Sequential Parallel CPU GPU 1x 48x 34 min 43 sec 27.5 15.6 13.75 8.8 3.2 0 5000 12000 65000 261000 500000 Number of points 800000 1000000 ! Current and future work • Finish the CUDA version of indicator simulation and gaussian simulation of GSLIB • Release the first GSLIB-CUDA version with these three methods ! Thanks! Questions?
© Copyright 2024 Paperzz