Big Data Complexities for Scientific Computing in the Oil and Gas

1
Big Data Complexities for Scientific
Computing in the Oil and Gas
Industry
noSQL, SQL, and mo’SQL
http://www.limitpoint.com/images/Publications/BigDataInOilAndGas.pdf
David M. Butler, President
Limit Point Systems, Inc.
© Limit Point Systems, Inc. 2014
2
Outline
 Big Data in oil & gas exploration & production
 Field theory for data scientists
 The data model paradigm
 The sheaf data model
 A query language for the sheaf data model
© Limit Point Systems, Inc. 2014
3
The oil and gas business
Adapted from [Krebbers]
“Upstream” is exploration and production (“E&P”) (upper left)
“Downstream” is transportation, refining, and marketing (lower right)
© Limit Point Systems, Inc. 2014
4
Major Acquired “Upstream” Data Types
 Time lapse raw seismic
 Time lapse prestack seismic image
 Time lapse poststack seismic image
 Well logs
 Production monitoring
dozens of other data types
© Limit Point Systems, Inc. 2014
5
Time lapse raw seismic data
 each sensor gives
amplitude as a function
of time
 ~10K sensors
 moving towards ~1M
 ~10K shots
 ~5K samples/shot
 ~4 – 12 bytes/sample
 time lapse:
 repeat ~2/year
 ~10 years
from [KrisEnergy]
~10 TB/project*~100 projects/year/major company
~1PB/year/major
© Limit Point Systems, Inc. 2014
6
Time lapse prestack seismic image data
 clean up seismic data
 remove noise
 remove artifacts
 other signal processing operations
 “migrate” data
 focus signal energy
 convert time to position
 up to 5D array of data
 reflectivity as a function of
 3D position
 source-sensor 2D offset
~same size as raw seismic
© Limit Point Systems, Inc. 2014
7
Poststack seismic image data
 “stack” of prestack data
 aggregate over 1 or more
array indices
 reduces size ~100x
 2D or 3D image
 reflectivity as function of
position
 similar to medical
ultrasound image
[epmag 1]
interpret to produce model of subsurface
© Limit Point Systems, Inc. 2014
8
Well logs
 lower sensor package
into well
 measure various
properties as a
function of depth
 ~10k samples
 ~1k components
 simple numbers
 bore hole images
 others
 typically done once
before production
starts
[decogeo]
~100MB/well*~1K wells/year/major
~ 100GB/year/major
© Limit Point Systems, Inc. 2014
9
Production monitoring
 Classical methods at well head
 flow volumes
 gas/oil/water composition
 temperature
 pressure
 Distributed sensing methods
 fiber optic cables in well
 acoustic sensing
 temperature sensing
 ~1000 equivalent discrete sensors
 ~1k samples/sec
 continuous monitoring
 ~10-100GB/day/well
 function of time and position along
well path
~1K wells (growing rapidly)
~1PB/year/major
© Limit Point Systems, Inc. 2014
[epmag 2]
[slb 1]
10
Major interpreted/modeled data types
 Geological structure model
 Velocity model
 Basin model
 Reservoir models
 geological
 quantitative
 engineering
 Geomechanical model
dozens of other data types
© Limit Point Systems, Inc. 2014
11
Geological structure model
 geologist interprets seismic image
 identifies surfaces defining rock
strata and faults
 very complex networks of
intersecting surfaces
 iterative process
 seismic image depends on acoustic
velocity
 acoustic velocity depends on rock
type
 rock type interpreted from seismic
image and well data
 ~1GB/structure
~1K structures/year/major
~1TB/year/major
© Limit Point Systems, Inc. 2014
12
Velocity model
 velocity of sound as a function of
position in volume corresponding
to geological structure
 scalar, vector, or tensor models
 used to produce seismic images
 accurate velocity model key to
good seismic image
 ~1-10GB/model
[geosoft]
[pdgm 1]
~1K models/year/major
~1TB/year/major
© Limit Point Systems, Inc. 2014
13
Basin model
 dynamic model of entire
sedimentary basin
 rock movement
 fluid movement
 study history of
hydrocarbon deposits
 generation
 expulsion
 migration to reservoir
 entrapment
 useful in predicting
whether structure contains
oil or gas
~100GB/model*~100/year/major
~10TB/year/major
© Limit Point Systems, Inc. 2014
[outernode]
14
Reservoir models
 static models
 prior to production
 estimate volume and
other properities
 dynamic models
 fluid flow
 fluid composition
 function of position
and time
 used to guide drilling &
production
 keep wells producing
 ~100GB/project
[dgi]
many fields, many versions/year/major
~100 TB/year/major
© Limit Point Systems, Inc. 2014
15
Geomechanical model
 simulation of mechanical
stresses and strains
 whole subsurface
 specific reservoirs
 stress, strain, deformation as
function of position and time
 used to anticipate
mechanical changes around
bore hole and in reservoir
 ~1-10GB/model
[slb3]
~100 models/year/major
~100GB/year/major
© Limit Point Systems, Inc. 2014
16
Summary of “Upstream” Data Types
(Order of magnitude estimates)
Variety
Volume (/object)
Velocity (/year/major)
Raw seismic
~1TB
~1PB
Prestack seismic
~1TB
~1PB
Poststack seismic
~10GB
~10TB
Well logs
~100MB/well
~100GB
Production monitoring
~10GB
~1PB
Geological structure
~1GB
~1TB
Velocity model
~1GB
~1TB
Basin model
~100GB
~10TB
Reservoir models
~100GB
~100TB
Geomechanical model
~1GB
~100GB
dozens of other data types, all important
variety rather than volume or velocity is dominant feature
© Limit Point Systems, Inc. 2014
17
Upstream Data Flow (partial)
[cda]
complex interoperation between data types
© Limit Point Systems, Inc. 2014
18
Shared Earth Model concept
 integrated data base for evolving models of subsurface
 all data types
 multiple scales



structure
reservoir
basin
 multiple interpretations and versions per object
 uncertainty quantification for everything
 provenance for everything
 constantly evolving
holy grail of Exploration and Production (“E&P”) data integration
in practice: still mostly vendor proprietary islands of integration
© Limit Point Systems, Inc. 2014
19
Shared Earth Model conceptually similar to
conventional enterprise data warehouse
 analysis and report oriented rather than transaction oriented
 integrates data from many different applications
 Extract-Transfer-Load (“ETL”) processes a critical component
 conventional warehouse and ETL
 relational data model provides conceptual framework
 Shared Earth Model for E&P data
 relational data model has not proven particularly useful
 why not?
most data is physicist’s “field” data
© Limit Point Systems, Inc. 2014
20
Outline
 Big Data in oil & gas exploration & production
 Field theory for data scientists
 The data model paradigm
 The sheaf data model
 A query language for the sheaf data model
© Limit Point Systems, Inc. 2014
21
Field Theory for Data Scientists
 physicist’s “field” not same as database admin’s “field”
 field describes some physical property as function of position
and/or time in some physical object
 position in a physical object
 physical property
 physical property as a function of position
use a simple example to introduce these ideas
© Limit Point Systems, Inc. 2014
22
A simple example
derrick
floor
Upper
well
well
junction
Lower
well
bore 2
bore 1
Branched well
© Limit Point Systems, Inc. 2014
23
position in a physical object
 position represented
y
R2
by coordinate vector

𝑥(𝑝)
𝑟⃗ =
𝑦(𝑝)
y(p)
p
x(p)
© Limit Point Systems, Inc. 2014
x
24
Physical property
 physical property types specified by mathematical physics
 family of types jointly referred to as multilinear algebra
 scalar types
 single number F
 vector types
𝐹0
⃗
 column of numbers 𝐹 =
𝐹1
 tensor types
𝐹00
⃡
 matrix of numbers 𝐹 =
𝐹10
𝐹01
𝐹11
each has important algebraic properties
a few dozen standard types, many more app specific types
© Limit Point Systems, Inc. 2014
25
Physical property as a function of position
 function (map) from physical
space to property space
 associates a value of F with
each p in the object
𝑭𝟎𝟎
𝑭 𝒓 =
𝑭𝟏𝟏
𝑭𝟎𝟎
𝑭𝟏𝟏
𝒙
𝒚
y
R2
p
y(p)
 infinite number of points
 infinite number of property
values
x(p)
how do we represent this on the computer?
© Limit Point Systems, Inc. 2014
𝑭𝟎𝟎
𝑭𝟏𝟏
x
𝑭𝟎𝟎
𝑭𝟏𝟏
26
How do we represent a field on the computer?
 numerous methods
 small industry busy creating new methods
 makes interoperation and integration difficult
 some common features
 decompose physical object into simple pieces
 approximate by simple function on each piece
© Limit Point Systems, Inc. 2014
27
Decompose physical object into simple pieces
 mathematicians call each piece a “cell”
 decomposition is a “cell complex”
df
df
s0
v1
s1
j
j
s2
v3
s3
v4
more commonly called a “mesh”
© Limit Point Systems, Inc. 2014
s4
v5
s5
v6
28
Approximate by simple function on each cell
 for each cell c:
 store a data tuple
 specify an evaluation
 example: linear interpolation
F
method
 evaluation method
 F(p) = evalc(p)(p, data tuple)
 data tuple may or may not
correspond to value of field
at some point
 depends on evaluation
method
F1
value(p)
F0
v0
p
v1
u(p)
value(p) = u*F1 + (1-u)*F0
data for entire field is an array of tuples
© Limit Point Systems, Inc. 2014
29
Data for entire field is an array of tuples
cell 0 cell 1 cell 2
scalar
F0
F1
vector
F0,0
...
F2
cell 0
cell n-1
cell 1
F1,0
F0,1
Fn-1
cell 2
F1,1
F0,2
cell 0
tensor
F00,0
F01,0
F10,0
cell n-1
F1,2
...
F0,n-1
F1,n-1
cell 1
F11,0
F00,1
F01,1
F10,1
cell n-1
F11,1
...
F00,n-1 F01,n-1 F10,n-1 F11,n-1
tuple components typically real (float or double)
but may be of any type
© Limit Point Systems, Inc. 2014
30
How do we want to use field data?
 operations specified by mathematical physics
 five main categories
 topological operations
 compose and decompose
 geometric operations
 change the shape
 functional operations
 set and get the value at a point
 move field from one mesh to another
 algebraic operations
 add, subtract, multiply, divide, diagonalize, ...
 calculus operations
 differentiate and integrate
© Limit Point Systems, Inc. 2014
31
Why isn’t the relational model useful for field
data?
 doesn’t fit the way we want to store field data
 relational schema can’t directly capture field entity


captures data tuple entity instead of entire field entity
field entity has to be reconstructed by queries
 normalization forces introduction of surrogate keys
 may require recursive queries
 doesn’t fit the way we want to use field data
 table operations are too low level
 aren’t useful for high level field operations
 no pay-off to using relational model
 most field data is stored in app-specific, proprietary flat files
so what data model is useful for field data?
© Limit Point Systems, Inc. 2014
32
Outline
 Big Data in oil & gas exploration & production
 Field theory for data scientists
 The data model paradigm
 The sheaf data model
 A query language for the sheaf data model
© Limit Point Systems, Inc. 2014
33
The data model paradigm
 Data model [Codd] specifies
 class of mathematical objects
 operations on those objects
 constraints valid instances must satisfy
 Languages, libraries, tools based on data model
 Applications developed on top of tools
Numerous benefits
© Limit Point Systems, Inc. 2014
34
Benefits of data model paradigm
 Increases level of abstraction for application development
 Increases capability of applications
 Facilitates interoperation and integration
 Increases productivity of programmers
But …
© Limit Point Systems, Inc. 2014
35
But …
 Benefits only accrue if model captures application structure
 The more structure captured the bigger the benefit
Important to capture as much structure as possible
© Limit Point Systems, Inc. 2014
36
Spectrum of mathematical structure captured
by various data models
 most noSQL models capture less structure than relational
 the “no” in noSQL should perhaps be “less”
 scientific apps have way more mathematical structure
 relational model isn’t nearly structured enough
scientific apps don’t need no Structured Query Language
need a (much) more Structured Query Language – mo’SQL
© Limit Point Systems, Inc. 2014
37
Data model/mo’SQL requirements
 must capture common math structure of scientific data
 scalars, vectors, tensors
 topology and geometry
 fields
 algebra and calculus operations
 must describe how math entities are represented/stored
 decomposition into primitive types and operations
 decomposition for parallelism
 must maintain rigorous connection between high level
semantics and low level implementation
need a new data model
© Limit Point Systems, Inc. 2014
38
Outline
 Big Data in oil & gas exploration & production
 Field theory for data scientists
 The data model paradigm
 The sheaf data model
 A query language for the sheaf data model
© Limit Point Systems, Inc. 2014
39
Sheaf data model
 objects are discrete sheaves over finite distributive lattices
 math details:
http://www.limitpoint.com/images/Publications/The%20Sheaf%20Data%20Model.pdf
 finite distributive lattice
 “part space”
 all distinct composite parts formed from set of basic parts
 discrete sheaf
 describes association of attributes with parts
algebraic description of decomposition of abstract data types into
tuples of primitive attributes
© Limit Point Systems, Inc. 2014
40
Visualizing a finite distributive lattice
 directed acyclic graph
 “Hasse diagram”
composite part A
 two kinds of nodes
 composite parts
 basic parts
covers
 links represent “covers”
 covers := immediately
basic part B
includes
 A covers B if and only if


A includes B
there is no C such that A includes C
includes B.
 draw graph so that if A covers
B, B is lower on page
example
© Limit Point Systems, Inc. 2014
covers
basic part C
41
Example: branched well
well
derrick
floor
lower well
upper well
Upper
well
well
junction
Lower
well
bore 1
bore 2
bore 2
bore 1
df
Well parts
junction
Hasse diagram
basic parts are independent objects
composite parts are precisely the sum of their basic parts
© Limit Point Systems, Inc. 2014
42
Sheaf table metaphor
 data base is a set of tables
 each table represents a type
 each row an instance
 each column an attribute
 rows carry client-defined
lattice order
 col lattice is row lattice of
some other table
 schema are first class
objects
unified algebraic framework for all common scientific data types
© Limit Point Systems, Inc. 2014
43
Unified framework for scientific data types
 tabular types
 contains relational model as limiting case
 row lattice is a boolean lattice
 physical property types
 scalars, vectors, tensors
 object-oriented types with multiple inheritance
 col lattice is subobject inclusion hierarchy
 spatial types (meshes)
 any decomposition of space
 row lattice represents spatial inclusion
 field types
 any property, any mesh, any evaluation method
 col lattice = tensor(mesh row lattice, property col lattice)
rigorous connection between abstract math types and numeric reps
from high level specification to tuples of primitives
© Limit Point Systems, Inc. 2014
44
Open Source Implementation
 SheafSystem™ Community Edition
 C++ libraries with Java, Python, and C# bindings
Field API
field
types
pushers
refiners
Geometry API
coordinate sections
(invertible sections)
point locators
Fiber Bundle Data Model API
spatial
types
physical property types
groups
tensors
Jacobians
section
types
Sheaf Data Model API
sheaf storage agent
HDF5
www.sheafsystem.org or github
© Limit Point Systems, Inc. 2014
45
Outline
 Big Data in oil & gas exploration & production
 Field theory for data scientists
 The data model paradigm
 The sheaf data model
 A query language for the sheaf data model
© Limit Point Systems, Inc. 2014
46
Query language for sheaf data model
 work in progress
 with Prof Magne Haveraaen
 Bergen Language Design Laboratory, University of Bergen
 started with initial guess at operators
 extension of relational operators
 experience with implementation
 formalizing and refining definitions
goal is “mo’SQL”
© Limit Point Systems, Inc. 2014
47
Acknowledgements
 Mark Verschuren, Shell, provided many useful comments and
other input for this presentation
 Original research and development funded by subcontracts
B347785, B515090, and B560973 of prime contract W-7405ENG-48 with the Department of Energy National Nuclear
Security Administration (DOE/NNSA)
 Ongoing development has been funded by Shell
GameChanger and Shell TaCIT
http://www.limitpoint.com/images/Publications/BigDataInOilAndGas.pdf
© Limit Point Systems, Inc. 2014
48
END
© Limit Point Systems, Inc. 2014
49
References 1
 [Krebbers] “Big Data & Analytics: Exploiting it”, Johan
Krebbers, VP Architecture, Shell
http://cdn.osisoft.com/corp/en/media/presentations/2013/
UsersConference2013/PDF/UC2013_Shell_Krebbers_GlobalIT
Architecture_1.pdf
 [KrisEnergy] http://www.krisenergy.com/company/aboutoil-and-gas/exploration/
 [epmag 1] http://www.epmag.com/Exploration-GeologyGeophysics/Three-D-Seismic-Advances-Improve-ExplorationSuccess_90469
 [decogeo]
http://www.decogeo.com/upload/Image/log1_bigl.jpg
© Limit Point Systems, Inc. 2014
50
References 2
 [epmag 2] http://www.epmag.com/item/DAS-enables-
simultaneous-multiwell-VSP_121593
 [slb1]
http://www.slb.com/resources/case_studies/completions/~/medi
a/Images/completions/intelligent/wellwatcher_neon_tp_01tn.jpg
 [slb 2] System of subsurface faults and horizons in the Gulfaks oil
field in the Norwegian sector of the North Sea. Data set courtesy of
Schlumberger Limited.
 [geosoft]
http://blogs.geosoft.com/exploringwithdata/2012/08/3dmodelling-with-velocity-volumes-in-gm-sys.html
 [pdgm 1] http://www.pdgm.com/getmedia/c72b49d9-571b-4fe8ae3f-bfd00f862b0d/Skua-salt2010.jpg.aspx?width=1024&height=650&ext=.jpg
© Limit Point Systems, Inc. 2014
51
References 3
 [slb 3] http://www.software.slb.com/PublishingImages/total-
stress.jpg
 [dgi]
http://www.dgi.com/images/cvslideshow/fullsize/CoViz4D_Slides
how_003.jpg
 [outernode]
http://outernode.pir.sa.gov.au/__data/assets/image/0020/119009
/Curnamona_3D.jpg
 [cda]
http://www.oilandgasuk.co.uk/cmsfiles/custom/html/report14.png
 [Codd] E. F. Codd. 1970. A relational model of data for large shared
data banks. Commun. ACM 13, 6 (June 1970), 377-387.
DOI=10.1145/362384.362685
http://doi.acm.org/10.1145/362384.362685
© Limit Point Systems, Inc. 2014