CUDA-Based Implementation of GSLIB: The Geostatistical Software

CUDA-Based Implementation of
GSLIB: The Geostatistical Software
Library
Daniel Baeza
Oscar Peredo
[email protected]
[email protected]
The Mining Process
Exploration
Evaluation
Planning
Operation
!
GSLIB: The Geostatistical Software Library
GSLIB: The Geostatistical Software Library
•
GSLIB is a software package composed by a set of utilities and
applications related with geostatistics
•
Full implemented in Fortran 77/90
•
Run in OSX, Linux and Windows
•
Widely used for academics, researchers, engineers
GSLIB: Geostatistical Software Library and user's guide (1998)
Deutsch, Clayton V, Journel, André G
Variogram calculation with GSLIB
•
gamv is the GSLIB variogram calculation method
•
It’s a fundamental tool in geostatistics
•
Allow to quantify the spatial variability of a variable
•
Used in geostatistical estimation and simulation
•
High computational cost
1.20 Normal Scores Semivariogram Low Solubil
1.20
1.00
1.00
.80
γ
Normal Scores Semivariogram High Solubi
.80
γ
.60
.60
.40
.40
.20
.20
.00
.00
0.
100.
200.
300.
Distance
400.
500.
0.
100.
200.
300.
Distance
400.
500.
Variogram calculation
Z(u)
2!(h) = 1 ∑ [z(u) - z(u + h)]
N(h)
|| a - b || = h
b
z
x
y
a
2
Z(u + h)
!(h)
more variability
less variability
h
!
Variogram computation
Sequential & Parallel Implementations
Among its main applications, gam and gamv can calculate variogram values from structured and scattered 3D data
points. In this work we are interested in gamv, being of general usage for any data set and also being of special
interest in terms of execution speed (currently gamv is orders of magnitude slower that gam). Inside gamv it is possible
to distinguish three subroutines. The first is related with data and parameters loading, the second is in charge of
the main computation and the third stores the results in a output file. In the current implementation of gamv, the
variogram is calculated measuring the spatial variability between all pairs of points separated by a lag vector distance.
This spatial search has computational cost of order O(n2 ), where n is the number of data points. The sequential
implementation of GSLIB gamv is shown in Algorithm 1. Lines 4 and 5 correspond to the loops of all pair of samples
from the database and lines 6 to 12 correspond to statistics extraction from the selected pairs of points. Finally, in line
13 is performed the calculation of variogram values from previously extracted statistics.
Sequential implementation
Algorithm 1: gamv sequential implementation from GSLIB
Input:
• (VR, ⌦): sample data values VR (m columns) defined in a 3D domain of coordinates ⌦
Setup parameters
&
Load data
• nvar: number of variables (nvar  m)
• nlag: number of lags
• h: lag separation distance
• ndir: number of directions
• h1 , . . . , hndir : directions
• ⌧1 , . . . , ⌧ndir : geometrical tolerance parameters
• nvarg: number of variograms
• (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variogram types
1
2
3
4
5
6
7
8
9
10
11
12
Read input parameter file;
Read sample data values file;
zeros(nvar ⇥ nlag ⇥ ndir ⇥ nvarg);
for i 2 {1, . . . , |⌦|} do
for j 2 {i, . . . , |⌦|} do
for id 2 {1, . . . , ndir} do
for iv 2 {1, . . . , nvarg} do
for il 2 {1, . . . , nlag} do
pi = (xi , yi , zi ) 2 ⌦;
p j = (x j , y j , z j ) 2 ⌦;
if (pi , p j ) satisfy tolerances ⌧id and ||pi p j || ⇡ hid ⇥ il ⇥ h then
Save (VRi,ivheadiv or VRi,ivtailiv ) and (VR j,ivheadiv or VR j,ivtailiv ) according to variogram ivtypeiv into ;
Main computation: Loop over pairs of points
Read computation results
13
build variogram using statistics
Output: Output file with
Write
in the output file
values
Commonly gamv is not a costly part in the structural modeling process, in terms of computational time, but
some processes may need to calculate a variogram many times in their execution. Examples of this processes can be
automatic variogram fitting or inverse modeling of kernels fitting experimental variogram values. In these scenarios
gamv execution time becomes a bottleneck. In this work we focused our e↵ort in translating the current gamv main
Among its main applications, gam and gamv can calculate variogram values from structured and scattered 3D data
points. In this work we are interested in gamv, being of general usage for any data set and also being of special
interest in terms of execution speed (currently gamv is orders of magnitude slower that gam). Inside gamv it is possible
to distinguish three subroutines. The first is related with data and parameters loading, the second is in charge of
the main computation and the third stores the results in a output file. In the current implementation of gamv, the
variogram is calculated measuring the spatial variability between all pairs of points separated by a lag vector distance.
This spatial search has computational cost of order O(n2 ), where n is the number of data points. The sequential
implementation of GSLIB gamv is shown in Algorithm 1. Lines 4 and 5 correspond to the loops of all pair of samples
from the database and lines 6 to 12 correspond to statistics extraction from the selected pairs of points. Finally, in line
13 is performed the calculation of variogram values from previously extracted statistics.
Sequential implementation
Algorithm 1: gamv sequential implementation from GSLIB
Input:
• (VR, ⌦): sample data values VR (m columns) defined in a 3D domain of coordinates ⌦
• nvar: number of variables (nvar  m)
• nlag: number of lags
• h: lag separation distance
• ndir: number of directions
• h1 , . . . , hndir : directions
• ⌧1 , . . . , ⌧ndir : geometrical tolerance parameters
• nvarg: number of variograms
• (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variogram types
1
2
3
4
5
6
7
8
9
10
11
12
13
Read input parameter file;
Read sample data values file;
zeros(nvar ⇥ nlag ⇥ ndir ⇥ nvarg);
for i 2 {1, . . . , |⌦|} do
for j 2 {i, . . . , |⌦|} do
for id 2 {1, . . . , ndir} do
for iv 2 {1, . . . , nvarg} do
for il 2 {1, . . . , nlag} do
pi = (xi , yi , zi ) 2 ⌦;
p j = (x j , y j , z j ) 2 ⌦;
if (pi , p j ) satisfy tolerances ⌧id and ||pi p j || ⇡ hid ⇥ il ⇥ h then
Save (VRi,ivheadiv or VRi,ivtailiv ) and (VR j,ivheadiv or VR j,ivtailiv ) according to variogram ivtypeiv into ;
build variogram using statistics
Output: Output file with
Write
in the output file
values
Commonly gamv is not a costly part in the structural modeling process, in terms of computational time, but
some processes may need to calculate a variogram many times in their execution. Examples of this processes can be
automatic variogram fitting or inverse modeling of kernels fitting experimental variogram values. In these scenarios
gamv execution time becomes a bottleneck. In this work we focused our e↵ort in translating the current gamv main
Among its main applications, gam and gamv can calculate variogram values from structured and scattered 3D data
points. In this work we are interested in gamv, being of general usage for any data set and also being of special
interest in terms of execution speed (currently gamv is orders of magnitude slower that gam). Inside gamv it is possible
to distinguish three subroutines. The first is related with data and parameters loading, the second is in charge of
the main computation and the third stores the results in a output file. In the current implementation of gamv, the
variogram is calculated measuring the spatial variability between all pairs of points separated by a lag vector distance.
This spatial search has computational cost of order O(n2 ), where n is the number of data points. The sequential
implementation of GSLIB gamv is shown in Algorithm 1. Lines 4 and 5 correspond to the loops of all pair of samples
from the database and lines 6 to 12 correspond to statistics extraction from the selected pairs of points. Finally, in line
13 is performed the calculation of variogram values from previously extracted statistics.
Sequential implementation
Algorithm 1: gamv sequential implementation from GSLIB
Input:
• (VR, ⌦): sample data values VR (m columns) defined in a 3D domain of coordinates ⌦
• nvar: number of variables (nvar  m)
• nlag: number of lags
• h: lag separation distance
• ndir: number of directions
• h1 , . . . , hndir : directions
• ⌧1 , . . . , ⌧ndir : geometrical tolerance parameters
• nvarg: number of variograms
• (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variogram types
1
2
3
4
5
6
7
8
9
10
11
12
13
Read input parameter file;
Read sample data values file;
zeros(nvar ⇥ nlag ⇥ ndir ⇥ nvarg);
for i 2 {1, . . . , |⌦|} do
for j 2 {i, . . . , |⌦|} do
for id 2 {1, . . . , ndir} do
for iv 2 {1, . . . , nvarg} do
for il 2 {1, . . . , nlag} do
pi = (xi , yi , zi ) 2 ⌦;
p j = (x j , y j , z j ) 2 ⌦;
if (pi , p j ) satisfy tolerances ⌧id and ||pi p j || ⇡ hid ⇥ il ⇥ h then
Save (VRi,ivheadiv or VRi,ivtailiv ) and (VR j,ivheadiv or VR j,ivtailiv ) according to variogram ivtypeiv into ;
build variogram using statistics
Output: Output file with
Write
in the output file
values
Commonly gamv is not a costly part in the structural modeling process, in terms of computational time, but
some processes may need to calculate a variogram many times in their execution. Examples of this processes can be
automatic variogram fitting or inverse modeling of kernels fitting experimental variogram values. In these scenarios
gamv execution time becomes a bottleneck. In this work we focused our e↵ort in translating the current gamv main
• ndir: number of directions
3
• h1 , . . . , hndir : directions
4
• ⌧1 , . . . , ⌧ndir : geometrical tolerance parameters
5
• nvarg: number of variograms
1
2
3
4
5
6
7
8
9
10
11
12
Set and Initialize shared memo
syncthreads()
iter x = id x
itery = idy
6
• (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variogram types
7 while (iter x & itery 2 ChunkP
Read input parameter file;
Read sample data values file;
8 do
zeros(nvar ⇥ nlag ⇥ ndir ⇥ nvarg);
|⌦|
9
j
=
iter
+
x
for i 2 {1, . . . , |⌦|} do
2
for j 2 {i, . . . , |⌦|} do
10
i = itery
for id 2 {1, . . . , ndir} do
11
compute statistics(pi , p j )
for iv 2 {1, . . . , nvarg} do
for il 2 {1, . . . , nlag} do
12
store result in shared memo
pi = (xi , yi , zi ) 2 ⌦;
13
if (iter x > itery ) then
p j = (x j , y j , z j ) 2 ⌦;
if (pi , p j ) satisfy tolerances ⌧id and ||pi p j || ⇡ hid ⇥ il ⇥ h then
14
i = itery
Save (VRi,ivheadiv or VRi,ivtailiv ) and (VR j,ivheadiv or VR j,ivtailiv ) according to variogram ivtypeiv into ;
15
j = iter x
Sequential implementation
compute statistics(pi , p
13
build variogram using statistics Write in the output file
17
store result in shared m
Output: Output file with values
18
else
19
if (iter x == itery ) then
Commonly gamv is not a costly part in the structural modeling process,
in terms iof= computational
time
20
itery
some processes may need to calculate a variogram many times in their 21
execution. Examples
this processes c
j = of
iter
y
automatic variogram fitting or inverse modeling of kernels fitting experimental variogram values. In these scen
22
compute statistics(
gamv execution time becomes a bottleneck. In this work we focused our e↵ort in translating the current gamv
23
store result in share
16
3
24
25
i = iter x + |⌦|
2
j = itery + |⌦|
2
• ndir: number of directions
3
• h1 , . . . , hndir : directions
4
• ⌧1 , . . . , ⌧ndir : geometrical tolerance parameters
5
• nvarg: number of variograms
1
2
3
4
5
6
7
8
9
10
11
12
Set and Initialize shared memo
syncthreads()
iter x = id x
itery = idy
6
• (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variogram types
7 while (iter x & itery 2 ChunkP
Read input parameter file;
Read sample data values file;
8 do
zeros(nvar ⇥ nlag ⇥ ndir ⇥ nvarg);
|⌦|
9
j
=
iter
+
x
for i 2 {1, . . . , |⌦|} do
2
for j 2 {i, . . . , |⌦|} do
10
i = itery
for id 2 {1, . . . , ndir} do
11
compute statistics(pi , p j )
for iv 2 {1, . . . , nvarg} do
for il 2 {1, . . . , nlag} do
12
store result in shared memo
pi = (xi , yi , zi ) 2 ⌦;
13
if (iter x > itery ) then
p j = (x j , y j , z j ) 2 ⌦;
if (pi , p j ) satisfy tolerances ⌧id and ||pi p j || ⇡ hid ⇥ il ⇥ h then
14
i = itery
Save (VRi,ivheadiv or VRi,ivtailiv ) and (VR j,ivheadiv or VR j,ivtailiv ) according to variogram ivtypeiv into ;
15
j = iter x
Sequential implementation
compute statistics(pi , p
13
build variogram using statistics Write in the output file
17
store result in shared m
Output: Output file with values
18
else
19
if (iter x == itery ) then
Commonly gamv is not a costly part in the structural modeling process,
in terms iof= computational
time
20
itery
some processes may need to calculate a variogram many times in their 21
execution. Examples
this processes c
j = of
iter
y
automatic variogram fitting or inverse modeling of kernels fitting experimental variogram values. In these scen
22
compute
statistics(
STEP N the current gamv
STEP 1
STEP 2
STEPwe
3
gamv execution time becomes
a bottleneck.
In this work
focused our ...e↵ort in translating
23
store result in share
16
3
24
25
i = iter x + |⌦|
2
j = itery + |⌦|
2
Algorithm 2: extract statistics kernel()
Input:
• (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦
• nvar: number of variables (nvar  m)
• nlag: number of lags
• h: lag separation distance
• ndir: number of directions
• h1 , . . . , hndir : directions
Parallel implementation
• ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters
• nvarg: number of variogram
• (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
id x = blockId.x*blockDim.x + threadId.x
idy = blockId.y*blockDim.y + threadId.y
Set and Initialize shared memory in the block
syncthreads()
iter x = id x
itery = idy
/* x threads coord in the GPU grid */
/* y threads coord in the GPU grid */
while (iter x & itery 2 ChunkPoints)
/* Chunk points belongs thread (x,y) */
do
j = iter x + |⌦|
2
i = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
if (iter x > itery ) then
i = itery
j = iter x
compute statistics(pi , p j )
store result in shared memory via atomic functions
else
if (iter x == itery ) then
i = itery
j = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
i = iter x + |⌦|
2
j = itery + |⌦|
2
compute statistics(pi , p j )
store result in shared memory via atomic functions
up date(iter x , itery )
syncthreads()
save Statistics values that are in shared memory into globa memory via atomic functions
Output: Array
8
GPU Kernel
Each thread compute
only a couple of correlation values
Algorithm 2: extract statistics kernel()
Input:
• (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦
• nvar: number of variables (nvar  m)
• nlag: number of lags
• h: lag separation distance
• ndir: number of directions
• h1 , . . . , hndir : directions
Parallel implementation
• ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters
• nvarg: number of variogram
• (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
id x = blockId.x*blockDim.x + threadId.x
idy = blockId.y*blockDim.y + threadId.y
Set and Initialize shared memory in the block
syncthreads()
iter x = id x
itery = idy
/* x threads coord in the GPU grid */
/* y threads coord in the GPU grid */
while (iter x & itery 2 ChunkPoints)
/* Chunk points belongs thread (x,y) */
do
j = iter x + |⌦|
2
i = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
if (iter x > itery ) then
i = itery
j = iter x
compute statistics(pi , p j )
store result in shared memory via atomic functions
else
if (iter x == itery ) then
i = itery
j = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
i = iter x + |⌦|
2
j = itery + |⌦|
2
compute statistics(pi , p j )
store result in shared memory via atomic functions
up date(iter x , itery )
syncthreads()
save Statistics values that are in shared memory into globa memory via atomic functions
Output: Array
8
Algorithm 2: extract statistics kernel()
Input:
• (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦
• nvar: number of variables (nvar  m)
• nlag: number of lags
• h: lag separation distance
• ndir: number of directions
• h1 , . . . , hndir : directions
Parallel implementation
• ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters
• nvarg: number of variogram
• (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
id x = blockId.x*blockDim.x + threadId.x
idy = blockId.y*blockDim.y + threadId.y
Set and Initialize shared memory in the block
syncthreads()
iter x = id x
itery = idy
/* x threads coord in the GPU grid */
/* y threads coord in the GPU grid */
0
while (iter x & itery 2 ChunkPoints)
/* Chunk points belongs thread (x,y) */
do
j = iter x + |⌦|
2
i = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
if (iter x > itery ) then
i = itery
j = iter x
compute statistics(pi , p j )
store result in shared memory via atomic functions
else
0
if (iter x == itery ) then
i = itery
2 3
j = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
i = iter x + |⌦|
2
0 1
0
j = itery + |⌦|
2
compute statistics(pi , p j )
2 3
1
store result in shared memory via atomic functions
3
up date(iter x , itery )
syncthreads()
save Statistics values that are in shared memory into globa memory via atomic functions
Output: Array
0
2 3
STEP 1
8
2
0 1
2 3
0
1 3
STEP 1
0 1
2 3
0 2
0
1 3
2
STEP 2
0 1
2 3
0
2 3
0 2
0 1
0 1
0
1 3
2 3
2 3
1 3
0 1
2 3
STEP 2
STEP 3
STEP 4
Algorithm 2: extract statistics kernel()
Input:
• (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦
• nvar: number of variables (nvar  m)
• nlag: number of lags
• h: lag separation distance
• ndir: number of directions
• h1 , . . . , hndir : directions
Parallel implementation
• ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters
• nvarg: number of variogram
• (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
id x = blockId.x*blockDim.x + threadId.x
idy = blockId.y*blockDim.y + threadId.y
Set and Initialize shared memory in the block
syncthreads()
iter x = id x
itery = idy
/* x threads coord in the GPU grid */
/* y threads coord in the GPU grid */
0
while (iter x & itery 2 ChunkPoints)
/* Chunk points belongs thread (x,y) */
do
j = iter x + |⌦|
2
i = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
if (iter x > itery ) then
i = itery
j = iter x
compute statistics(pi , p j )
store result in shared memory via atomic functions
else
0
if (iter x == itery ) then
i = itery
2 3
j = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
i = iter x + |⌦|
2
0 1
0
j = itery + |⌦|
2
compute statistics(pi , p j )
2 3
1
store result in shared memory via atomic functions
3
up date(iter x , itery )
syncthreads()
save Statistics values that are in shared memory into globa memory via atomic functions
Output: Array
0
2 3
STEP 1
8
2
0 1
2 3
0
1 3
STEP 1
0 1
2 3
0 2
0
1 3
2
STEP 2
0 1
2 3
0
2 3
0 2
0 1
0 1
0
1 3
2 3
2 3
1 3
0 1
2 3
STEP 2
STEP 3
STEP 4
Algorithm 2: extract statistics kernel()
Input:
• (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦
• nvar: number of variables (nvar  m)
• nlag: number of lags
• h: lag separation distance
• ndir: number of directions
• h1 , . . . , hndir : directions
Parallel implementation
• ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters
• nvarg: number of variogram
• (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
id x = blockId.x*blockDim.x + threadId.x
idy = blockId.y*blockDim.y + threadId.y
Set and Initialize shared memory in the block
syncthreads()
iter x = id x
itery = idy
/* x threads coord in the GPU grid */
/* y threads coord in the GPU grid */
0
while (iter x & itery 2 ChunkPoints)
/* Chunk points belongs thread (x,y) */
do
j = iter x + |⌦|
2
i = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
if (iter x > itery ) then
i = itery
j = iter x
compute statistics(pi , p j )
store result in shared memory via atomic functions
else
0
if (iter x == itery ) then
i = itery
2 3
j = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
i = iter x + |⌦|
2
0 1
0
j = itery + |⌦|
2
compute statistics(pi , p j )
2 3
1
store result in shared memory via atomic functions
3
up date(iter x , itery )
syncthreads()
save Statistics values that are in shared memory into globa memory via atomic functions
Output: Array
0
2 3
STEP 1
8
2
0 1
2 3
0
1 3
STEP 1
0 1
2 3
0 2
0
1 3
2
STEP 2
0 1
2 3
0
2 3
0 2
0 1
0 1
0
1 3
2 3
2 3
1 3
0 1
2 3
STEP 2
STEP 3
STEP 4
Algorithm 2: extract statistics kernel()
Input:
• (VR, ⌦): sample data values VR (m columns) defined in a 3D domain ⌦
• nvar: number of variables (nvar  m)
• nlag: number of lags
• h: lag separation distance
• ndir: number of directions
• h1 , . . . , hndir : directions
Parallel implementation
• ⌧1 , . . . , ⌧ndir : geometrical tolerances parameters
• nvarg: number of variogram
• (ivtype1 , ivtail1 , ivhead1 ), . . . , (ivtypenvarg , ivtailnvarg , ivheadnvarg ): variograms types
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
id x = blockId.x*blockDim.x + threadId.x
idy = blockId.y*blockDim.y + threadId.y
Set and Initialize shared memory in the block
syncthreads()
iter x = id x
itery = idy
/* x threads coord in the GPU grid */
/* y threads coord in the GPU grid */
0
while (iter x & itery 2 ChunkPoints)
/* Chunk points belongs thread (x,y) */
do
j = iter x + |⌦|
2
i = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
if (iter x > itery ) then
i = itery
j = iter x
compute statistics(pi , p j )
store result in shared memory via atomic functions
else
0
if (iter x == itery ) then
i = itery
2 3
j = itery
compute statistics(pi , p j )
store result in shared memory via atomic functions
i = iter x + |⌦|
2
0 1
0
j = itery + |⌦|
2
compute statistics(pi , p j )
2 3
1
store result in shared memory via atomic functions
3
up date(iter x , itery )
syncthreads()
save Statistics values that are in shared memory into globa memory via atomic functions
Output: Array
0
2 3
STEP 1
8
2
0 1
2 3
0
1 3
STEP 1
0 1
2 3
0 2
0
1 3
2
STEP 2
0 1
2 3
0
2 3
0 2
0 1
0 1
0
1 3
2 3
2 3
1 3
0 1
2 3
STEP 2
STEP 3
STEP 4
!
Results
50.0 510x510x1 Omni-Directional Semivariogram
40.0
γ
30.0
20.0
10x10x120 Omni-Directional Semivariogram
10.0
Standard version
1.00
CUDA version
0.0
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
0.80
Distance
γ
0.60
0.40
0.20
Standard version
CUDA version
0.00
0.0
10.0
20.0
30.0
40.0
Distance
50.0
60.0
70.0
Dell Precision T7600
-
Intel Xeon E5
3.3 GHz CPU Frequency
10 MB cache
16GB RAM (1,6 MHz)
1 TB HDD
Linux
2300
2095
1725
Seconds
-
Standard Version
1330
1150
518.69
575
0
0.8
4.68
9.52
5000
12000
65000
143.2
261000 500000 800000 1000000
Number of points
CUDA version
50
43.2
Tesla c2075
- 448 CUDA cores
- 1.15 GHz Frequency of CUDA cores
- 6GB RAM
- 114 GB/sec Memory bandwidth
Seconds
37.5
28.45
25
12.28
12.5
3.62
0
0.25
0.53
0.61
5000
12000
65000
261000 500000 800000 1000000
Number of points
Standard
CUDA version
Version
2300
50
2095
43.2
Seconds
Seconds
1725
37.5
1330
28.45
1150
25
12.28
518.69
12.5
575
0
0.25
0.8
0.53
4.68
0.61
9.52
5000
12000
65000
3.62
143.2
261000 500000 800000 1000000
Number of points
SpeedUp
55
46.7
48.5
42.2
SpeedUp
41.25
39.6
Sequential
Parallel
CPU
GPU
1x
48x
34 min
43 sec
27.5
15.6
13.75
8.8
3.2
0
5000
12000
65000
261000
500000
Number of points
800000
1000000
!
Current and future work
•
Finish the CUDA version of indicator simulation and gaussian
simulation of GSLIB
•
Release the first GSLIB-CUDA version with these three methods
!
Thanks! Questions?