Toward a semantic web of paleoclimatology - USC

Article
Volume 14, Number 2
28 February 2013
doi:10.1002/ggge.20067
ISSN: 1525-2027
Toward a semantic web of paleoclimatology
Julien Emile-Geay
Department of Earth Sciences, University of Southern California, Los Angeles, California, USA
([email protected])
Jason A. Eshleman
IO Informatics, Berkeley, California, USA
[1] The paleoclimate record is information-rich, yet significant technical barriers currently exist before this
record can be used to answer scientific questions. Here we make the case for a universal format to structure
paleoclimate data. A simple example demonstrates the scientific utility of such a self-describing way of
organizing coral data and meta-data. This example is generalized to a universal ontology that may form the
backbone of an open-source, open-access, and crowd-sourced paleoclimate database. The format is designed
to enable semantic searches, and is expected to accelerate discovery on topical scientific problems like climate
extremes, the characteristics of natural climate variability, and climate sensitivity to various forcings.
Components: 8,100 words, 1 table, 4 figures.
Keywords: paleoclimatology; database; data-mining; machine-learning; semantic; archive.
Index Terms: 0473 Biogeosciences: Paleoclimatology and paleoceanography (3344, 4900); 3344 Atmospheric
Processes: Paleoclimatology (0473, 4900); 1912 Informatics: Data management, preservation, rescue; 1914
Informatics: Data mining.
Received 6 August 2012; Revised 10 January 2013; Accepted 10 January 2013; Published 28 February 2013.
Emile-Geay J., and J. A. Eshleman (2013), Toward a semantic web of paleoclimatology, Goechem. Geophys. Geosyst.,
14, 457–469, doi:10.1002/ggge.20067.
1. Introduction
[2] Science is based on building on, reusing and
openly criticising the published body of scientific knowledge. For science to effectively
function, and for society to reap the full benefits from scientific endeavours, it is crucial that
science data be made open.
Panton Principles [Murray-Rust et al., 2010]
[3] In the past decade or so, the paleoclimate community has embraced these principles to a remarkable
extent, so much so that a vast number of proxy records
from all parts of the world are now publicly available
through online databases like the International Tree
©2013. American Geophysical Union. All Rights Reserved.
Ring Data Bank [ITRDB, 2009], the National Climatic Data Center (NCDC) (http://www.ncdc.noaa.
gov/paleo/data.html) or PANGAEA (http://www.
pangaea.de/). The benefits of making this information
freely available are immense. In principle, it may:
(1) Facilitate the reproduction of published results,
a core principle of modern science,
(2) Enable the exploration of all available data, and
ease comparisons between all relevant proxy
records through meta-searches,
(3) Serve as the basis for multiproxy reconstructions of past climates, and
(4) Widen the of citizen participation to individuals
outside the relatively small community of climate
science specialists.
457
Geochemistry
Geophysics
Geosystems
3
G
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
[4] Yet, as currently archived, this information does
not allow to answer societally relevant questions like:
“How anomalous is 20th century climate in the context
of the past 2000 years?” or “What is the relationship
between climate forcing and observed climate variations in the paleoclimate record?” To be sure, no simple
answer is expected to either question; nonetheless, a
persistent and seldom-recognized barrier to the optimal
use of paleoclimate information is that the diversity of
proxies, measurements, and chronological information
requires a flexible format that maintains sufficient
metadata. So far, no universal format has been established to do so; the synthesis of climate proxy information is therefore labor-intensive, unnecessarily tedious,
and increases the probability of introducing human
errors. Furthermore, such obstacles are common to
other geoscientific fields; it was estimated, for instance
that NASA scientists spend up to 60% of their time preparing data for comparison with model output—an
enormous waste of their precious time [NASA, 2012].
Such considerations have led the National Science
Foundation to launch the EarthCube initiative (http://
www.nsf.gov/geo/earthcube/) in June 2011, seeking
“transformative concepts and approaches to create
integrated data management infrastructures across
the geosciences.”
[5] Many obstacles to the access to, and analysis of,
paleoclimate data would vanish if data and metadata
were archived in a machine-readable format, as has
been demonstrated in other scientific domains (e.g.,
BioDash [Neumann and Quan, 2006], QuakeML
[Schorlemmer et al., 2004], GenBank [Benson et al.,
2011], and MAGE-ML [Spellman et al., 2002]).
Microarray experimental metadata and parameters
can be searched and complete data sets rapidly retrieved from two centralized best-practice repositories,
Gene Expression Omnibus [Edgar et al., 2002] and
ArrayExpress [Parkinson et al., 2007]. This ability
to rapidly construct novel data sets with all appropriate
data and metadata has enabled data-mining and
machine-learning algorithms to be applied to a much
wider data set than ever possible before. Such work,
to give but a few examples, has allowed for the discovery and validation of robust biomarkers for organ transplant rejection [Chen et al., 2010], characterize gene
function [Pierre et al., 2010], and detect patterns of
gene coexpression [Adler et al., 2009]. Similar breakthroughs would be possible in climate science if online
archives conformed to relatively simple principles.
[6] The key to those recent success stories has
unquestionably been the concept of interoperability
(the ability for different systems to readily exchange
data); what has enabled it is semantic technology.
In this article, we discuss the benefits of designing a
10.1002/ggge.20067
universal, semantically based format for the archival
of paleoclimate data. To further motivate this technical
operation and illustrate its scientific merit, we start by
describing a currently existing format used in recent
publications [Emile-Geay et al., 2013a, 2013b],
demonstrating how it can enable rapid information
extraction (section 2). We then propose semantic
extensions of this format (section 3), discussing the
pros and cons of the approach and outlining an
achievable path toward the stated objectives. We
finish by discussing the climate science repercussions
of this potentially disruptive technology in section 4.
2. The Power of Well-Organized Data
2.1. Database Format
[7] The University of Southern California climate
dynamics group recently compiled all publicly available coral records and archived them into a format
suitable for the Matlab programming language. The
format organizes data as a Matlab structure with
fields describing the proxy class (e.g., class:
’CORAL’), the type of measurement (e.g., d18O,
Sr/Ca , extension rate, fluorescence), includes the
raw chronology (chron) and raw data (data) as
vectors, as well as metadata such as latitude (lat, a
scalar), longitude (lon, a scalar), the site name
(site, a character string), a quick bibliographic reference (reference, a string), and a cite key to a
bibliographical software like EndNote or BibTeX.
For a concrete example, consider the Sr/Ca record
from Palmyra Island from Nurhati et al. [2011]. The
corresponding database entry reads:
[8] Accompanying the aforementioned fields are
others that are calculated from them and enable a
rapid parsing of the database. These include the initial year present in the chronology (year_i), the
final year (year_f), and the average resolution in
months (raw_res). Such quantities will soon prove
458
Geochemistry
Geophysics
Geosystems
3
G
10.1002/ggge.20067
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
useful to select records on the basis of their time
coverage or resolution. In addition to these fields,
one may want to process the raw proxy information
to obtain more meaningful quantities. For instance,
the data may have been collected at uneven time intervals, so it would need to be interpolated onto a regular
time grid for certain types of analysis (e.g., spectral
analysis, empirical orthogonal function (EOF) analysis) to be performed. In addition, a monthly-resolved
record such as this one exhibits a prominent seasonal
cycle that one may want to remove prior to analysis.
This can be done very simply by adding the
corresponding data vectors chron_reg and
data_reg, the new resolution res (in this case,
res = raw_res = 1) and a vector of monthly
anomalies anom).
[9] This strategy had the advantage of enabling
a “journaled” approach to data processing; it
makes clear which transformations were applied
to each proxy record and enables users to go back
to the raw data and apply their own processing
steps if so desired. Let us now see what scientific
information may be derived from this form of
organization.
2.2. Analysis
[10] The first thing one might want to do is visualize
the spatiotemporal coverage of the database. This
may be done in a few lines of code, and is
illustrated in Figure 1.
a) Proxy representation by age (67 records)
40oN
30oN
20oN
10oN
0o
o
10 S
18
O
20oS
Sr/Ca
30oS
other
40oS
1200
1300
1400
1500
1600
1700
1800
1900
Most ancient age resolved
b) Proxy availability over time
# proxies
60
18
O
Sr/Ca
other
40
20
0
1500
1550
1600
1650
1700
1750
1800
1850
1900
1950
2000
Time
Figure 1. An information-rich representation of the coral database introduced herein. (a) Location of each record,
stratified by proxy types d18O (circles), Sr/Ca (squares), “other”, (triangles, covering fluorescence and extension rate)
coded by shape and most ancient age resolved. (b) Proxy availability over time, against stratified by proxy type.
459
Geochemistry
Geophysics
Geosystems
3
G
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
[11] Next, one may want to inspect the data them-
selves. It is easy to annualize all records (using
DJF averages for those subannually-resolved
records) and assemble them all into a single matrix,
which is plotted in Figure 2 to obtain a synoptic
view of the data set. The well-documented tendency
for negative d18O and Sr/Ca values in the late 20th
century (probably due to large-scale warming and/
or reorganizations of the hydrological cycle, see
Gagan et al. [2000]) is then immediately apparent.
[12] Finally, one may want to perform a more
revealing analysis of climate variability as portrayed by these records. The extensive meta-data
and the structured array syntax allow the user to
easily subset records according to several criteria,
e.g., finding records containing d18O measurements
only, with a quarterly resolution or finer, at least
100 years in length and overlapping. This leaves
25 records out of the original 67 (Table 1), covering
the period 1893 to 1984. These records can be readily annualized (taking December–January–February
averages for monthly records, cold season averages
otherwise), assembled into a matrix and an EOF
analysis [Lorenz, 1956] may be readily performed.
[13] The first mode of this EOF analysis is pre-
sented in Figure 3. As expected, the spatial pattern
of sea-surface temperature associated with this
10.1002/ggge.20067
mode bears a strong resemblance to El Niño-Southern
Oscillation (ENSO), as confirmed by its temporal
expression (PC1), which displays maxima and
minima coincident with known ENSO events. The
multitaper spectrum of this principal component
reveals a relatively strong annual component and
dominant interannual variability, consistent with
what is independently known about ENSO [e.g.,
Sarachik and Cane, 2010]. The relative area covered by each circle illustrates the magnitude of the
EOF coefficients (“loadings”), while the color
refers to their sign (after multiplying d18O values
by –1 so that negative excursions in oxygen isotopes
correspond to positive temperature anomalies). Their
signs are consistent with the ENSO thermal signal
expected in each isotopic record, though the disparate amplitudes reflect the varying influence of other
factors (climatic or not).
[14] That a pan-tropical network of subannually-
resolved coral records may capture ENSO variability
has been amply demonstrated before [Evans et al.,
2000, 2002]. It is merely confirmed by the present
analysis, with the difference that this information
was extracted in less than 200 lines of code (including visualization) and can be easily modified for
different data selection criteria, which amount to a
mere logical indexing of the Matlab structured array.
Figure 2. Synopsis of the coral matrix. Colors represent Z-scores with records sorted in alphabetical order.
460
Geochemistry
Geophysics
Geosystems
3
G
10.1002/ggge.20067
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
Table 1. List of Coral Records Used for the EOF Analysis of Figure 3. The Records Were Obtained From the Full
Database (Illustrated in Figure 1) After Applying the Following Conditions: d18O Only, Quarterly Resolution or Finer,
at Least 100 Years Long, and Overlapping. Time Resolution is Expressed in Months. The Quantity r Represents the
Linear Correlation to Local Temperature Derived From the HadSST2 Data Set [Rayner et al., 2006]. Correlations are
Assessed Using a Nonparametric Monte Carlo Test Against 1000 Isopersistent Red Noise Time Series; Significant
Correlations are Denoted in Bold
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Site
Type
Resolution
Latitude
Longitude
Range
r
Reference
Abrolhos
Amedée Light.
Bunaken
Clipperton Atoll
Guam
Ifaty
La Réunion
Laing
Lombok Strait
Madang
Mafia
Mahe
Maiana Atoll
Mentawai
Moorea
Ningaloo Reef
Palmyra
Rabaul
Rarotonga(2R)
Rarotonga(3R)
Ras Umm Sidd
Secas
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
d18O
1
3
1
1
1
2
2
3
1
3
2
1
2
1
2
2
1
1
1
1
2
1
28 S
22 S
2 N
10 N
13 N
23 S
21 S
4 S
8 S
5 S
8 S
5 S
1 N
0 N
18 S
22 S
6 N
4 S
21 S
21 S
28 N
8 N
114 E
166 E
125 E
109 W
145 E
44 E
55 E
145 E
116 E
146 E
40 E
55 E
173 E
99 E
150 W
114 E
162 W
152 E
160 E
160 E
34 E
82 W
1794–1994
1660–1993
1860–1990
1893–1994
1790–2000
1659–1995
1832–1995
1884–1993
1782–1990
1880–1993
1622–1722
1846–1995
1840–1994
1858–1997
1852–1990
1878–1995
1886–1998
1867–1997
1726–1996
1874–2000
1751–1995
1707–1984
–0.27
–0.24
–0.21
–0.27
–0.32
–0.11
–0.07
–0.37
–0.13
–0.22
–0.20
–0.32
–0.46
–0.37
–0.06
–0.24
–0.62
–0.23
–0.18
–0.20
–0.24
–0.11
Kuhnert et al. [1999]
Quinn et al. [1998]
Charles et al. [2003]
Linsley et al. [2000]
Asami et al. [2005]
Zinke et al. [2004]
Pfeiffer et al. [2004]
Tudhope et al. [2001]
Charles et al. [2003]
Tudhope et al. [2001]
Damassa et al. [2006]
Charles et al. [1997]
Urban et al. [2000]
Abram et al. [2008]
Boiseau et al. [1998]
Kuhnert et al. [2000]
Cobb et al. [2001]
Quinn et al. [2006]
Linsley et al. [2008]
Linsley et al. [2006]
Felis et al. [2000]
Linsley et al. [1994]
The main reason for the conciseness of this code is
the consistent format associated with each record,
which enable very easy manipulations of the data
set. In the next section, we describe how simple
extensions of this principle of data organization
may enable much richer scientific questions to be
asked of paleoclimate proxy networks.
3. The Paleoclimate Database of the
Future
[15] Given the imperfections of each proxy type, the
value of inferring climate information from multiple
proxy classes is no longer in doubt [e.g., Mann,
2002; Li et al., 2010; Wahl et al., 2010], and indeed
forms the basis of surface temperature reconstructions
of the past millennium and beyond [Jones et al., 2009;
NRC, 2006], which have wide implications in the public sphere. The recent community focus embodied by
the PAGES 2 K project (http://www.pages-igbp.org/
workinggroups/2k-network) illustrates the widespread
recognition of the importance of pooling together all
the proxy information available over the Common
Era to enable new scientific discoveries to be made.
[16] Three leading questions motivated the PAGES
2 K data collation initiative:
(1) What did the main patterns and modes of
climate variability on subdecadal to orbital
time scales look and operate like?
(2) How does climate variability and extreme
events relate to the important primary forcing
factors, namely orbital, solar, and volcanic?
(3) What feedbacks operated to modulate the
climate response?
[17] In an ideal world, the first two questions could
be answered relatively easily by applying machinelearning and signal-processing algorithms to an internally consistent database gathering all paleoclimate
records contributed to date. Our world, as it stands,
is far from ideal, but relatively simple steps, which
we describe next, should improve upon this situation.
3.1. Attributes of an Ideal Paleoclimate
Data Format
[18] Our own experience and discussions with col-
leagues suggest that an ideal data format would
conform to the following attributes:
1. Parsability: the format should be self-describing
(hence machine-readable), and would therefore
enable a semantic web [Berners-Lee et al., 2001]
of paleoclimate information.
461
Geochemistry
Geophysics
Geosystems
3
G
10.1002/ggge.20067
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
Coral
18
O network, 22 sites, EOF 1 - 18% variance
24oN
16oN
o
8 N
o
0
o
8 S
o
16 S
24oS
EOF > 0, size magnitude
EOF < 0, size magnitude
-1
-0.5
0
0.5
Contours: SST ( C) regressed onto PC 1
Principal component timeseries
1
PC 1 MTM spectrum
0.15
Frequency
0.05
0
-0.05
Power
PC 1 (unitless)
0.1
-0.1
100
50
20
10
6
4
2
1
10-2
-0.15
10-4
-0.2
1900
1910
1920
1930
1940
1950
1960
1970
1980
Time
10-2
10-1
100
Frequency (cpy)
Figure 3. First EOF mode of the network of 25 d18O records presented in Table 1. (top) EOF coefficients (blue < 0,
red > 0) overlain on a map of HadSST2 DJF temperature regressed onto the first principal component. d18O values
were multiplied by –1 so that positive excursions correspond to warming temperature; (bottom left) time series of
the first principal component; (bottom right) multitaper spectrum [Thomson, 1982] of the first principal component;
note that the y-axis is the product of the power by the frequency so that the relative area under the spectrum is preserved in this logarithmic scale. Numbers refers to the period of oscillation in years
2. Universality: the format should be platformindependent (readable on all computer and operating systems), and language-independent (readable
in major programming languages like Fortran,
C, MATLAB, R, Python, IDL, and NCL/PyNGL).
It should enable remote data access (e.g., OPeNDAP, http://opendap.org/) to ensure that users
rely on the most up-to-date information prior to
conducting analysis.
3. Extensibility: the format should require a minimum set of fields to appropriately define a
paleoclimate record, but allow for the database
to grow organically as more records are added,
or—equally important—as more metadata are
added to existing records.
4. Citability: a legitimate concern amongst scientists who generate paleoclimate observations is
that their work be given due recognition in subsequent analyses and syntheses. Currently, the
NCDC relies on an honor system, but one could
imagine easier and more stringent tools. This
could include, as in the database described in
section 2, electronic references to publications
(such a digital object identifiers, EndNote or
BibTeX cite keys) embedded in the data format.
This would allow the automatic citation of
peer-reviewed articles as well as data citations
[Killeen, 2012] whenever a data record is being
used for analysis. For instance, Table 1 was
created automatically in Matlab (using LaTeX
conventions) to make due recognition of scientific work an automatic part of paleoclimate data
analysis. This principle should be foundational
to a universal paleoclimate data format.
5. Ergonomy: The format should be easy to use,
update and manage.
[19] We submit that semantic technology provides
the best way of realizing this vision. In the next
section, we discuss semantics and ontologies as
relevant to paleoclimate knowledge representation.
462
Geochemistry
Geophysics
Geosystems
3
G
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
3.2. Semantic Technology for
Paleoclimatology
3.2.1. What is the Semantic Web?
[20] Anyone familiar with computers knows that they
scarcely understand human language. A Google
search for instance, parses the World Wide Web for
instances of certain keywords, and returns links to
documents containing these keywords, ranked by
some measure of relevance. The current Web links
together strings of characters, without any notion of
what they mean. In contrast, semantic technology
[e.g., Berners-Lee et al., 2001] aims to assign meaning
to its constituents; it defines various entities (e.g., a
place, a person, a document) and describes the relationships between them (e.g., Albert Einstein is a
person who wrote articles about physics while living
in Switzerland, Germany, and California). It does so
by utilizing three-component statements or “triples”:
a subject of the statement, a predicate expressing some
property of the statement, and an object related to the
subject via this predicate. For instance, in the expression “Albert Einstein is a person”, the subject of the
statement is “Albert Einstein”, the object “person”,
and the predicate relating the two is “is a.” Objects
of one statement can be the subject of another statement and predicates may themselves be the subjects
of other statements, thus creating a potentially limitless graph of information. In this simplicity lies its
main strength: elementary relationships are easy to
define, but the potentially infinite size of the graph
means that complex reasoning may be automatically
carried out as an emergent property of the great many
triples. Semantic technology underlies the recent
success of computers to process queries in human
language, such as IBM’s Watson (http://www-03.
ibm.com/innovation/us/watson/science-behind_watson.
shtml) or Apple’s Siri. They also underlie many recent innovations in knowledge management (http://
www.huffingtonpost.com/steve-hamby/semantic-webtechnology_b_1228883.html). The key to this success
is knowledge representation (KR), an emerging field
of computer science central to the Semantic Web. KR
studies how to formalize relationships so that the
information can make sense to machines. At the core
of knowledge representation is an ontology. Ontologies define relationships between entities and provide
a core vocabulary for organizing information within
a domain. A common vocabulary provides for easier
interoperability between data modeled using the ontology. The Dublin Core Metadata Initiative (http://
dublincore.org/) provides a vocabulary sufficient
to generally describe most published documents;
domain-specific ontologies, like the Semantic Web
10.1002/ggge.20067
for Earth and Environmental Technology (SWEET,
http://sweet.jpl.nasa.gov/) describe classes of entities, relationships between classes and relationship
more specific to modeling Earth Science data.
Subclasses inherit the properties from parent classes;
membership in a subclass implies membership in the
parent class. Adherence to an ontology also allows
for formal reasoning whereby properties not explicitly described can be inferred from known properties.
As a simple example from the SWEET ontology, the
Greenhouse Effect is a subclass of the general
phenomena of Global Warming. By simple inference,
any paper discussing the Greenhouse Effect is
discussing the broader concept Global Warming.
The latter property of such a paper need not be
made explicit to be true.
[21] Knowledge representation is no magic wand.
But while it cannot solve any scientific question per
se, it can make many problems far easier. Just as
the same integer can be expressed in Roman or
Hindu-Arabic representation, the latter makes calculations much easier than the former. Similarly,
expressing data via structured vocabularies may help
solve many issues related to access, manipulation
and interpretation. Simply put, by shifting the burden
of description onto the data themselves, one can build
leaner code to read, write and manipulate it.
[22] In the Earth Sciences, knowledge representation
has helped solve a number of real-life problems:
setting up virtual observatories [Fox et al., 2009],
tracking volcanic effects through several subdisciplines of the geosciences [Fox et al., 2007] or greatly
accelerating the analysis of stream metabolism
for watershed ecosystem management [Gil et al.,
2011]. EarthCube is keenly aware of the potential
of KR to solve many data management issues,
with an entire working group devoted to ontologies
(http://earthcube.ning.com/group/semantics-andontologies). One should note that several ontologies
already exist in the Earth Sciences (e.g., SWEET,
GeoSCI-ML, http://www.geosciml.org/), and some
are specific to EarthCube [Berg-Cross et al., 2012].
Some initial efforts have also been made to encode climate data via ontologies (http://iridl.ldeo.columbia.
edu/ontologies/). Whichever format is adopted by
paleoclimatologists should be made compatible with
those, perhaps via the use of ontology design patterns
[Janowicz and Hitzler, 2012a].
3.2.2. Semantic Pros and Cons
[23] When engaging colleagues about semantic tech-
nology in paleoclimate, we are always asked this
pertinent question: “what can a semantic framework
463
Geochemistry
Geophysics
Geosystems
3
G
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
achieve that a standard database cannot?” The most
honest answer is probably “nothing” [Janowicz and
Hitzler, 2012b], but issues of efficiency quickly come
to the fore: one can go anywhere on a bicycle, but it is
often too slow or impractical to do so. Similarly,
while much of EarthCube could conceptually be built
with existing tools, the fact that it has not is a testimony to the difficulty of the enterprise; we believe
that a semantic framework would lower those
barriers. For the sake of discussion, we shall limit
ourselves to knowledge representation using the
resource description framework (RDF), although it
is worth noting that much of the same would be true
of non-RDF implementations as well.
[24] A graph-data format for representing informa-
tion, RDF is in fact the backbone of a semantic
web. Bergman [2009] provides a comprehensive
overview of semantic technology in RDF, listing
no fewer than 60 advantages over competing
approaches. RDF is readily expressed (“serialized”)
in many formats, including—but not limited to—
eXtensible Markup Language (XML) text files.
Open-source database backend “triple stores” also
exist that make it possible to store RDF in a manner
that facilitates distributed querying. When made
available as SPARQL endpoints, the data within
can be queried across the internet. While much of
this is also true of storing data in a relational database, storing data in a triple store does not require
a priori schema, unlike relational databases. To
extend a data set requires only adding additional triples. Perhaps most importantly, dramatically fewer
limits, if any at all, are placed on the extensibility
of a data set when compared to a relational database
where the initial choice of schema can limit the
types of information that can be recorded and the
types of queries that can be performed. This is a
crucial advantage: science is a dynamic enterprise,
and as it progresses, terminology, notation and understanding inevitably change. It is therefore critical that a paleoclimate database be allowed to grow
organically. To quote Bergman [2009]: “This very
adaptability is what enables RDF to be viewed as
data-driven design. We can deal with a partial
and incomplete world; we can learn as we go; we
can start small and simple and evolve to more
understanding and structure; and we can preserve
all structure and investments we have previously
made.” This flexibility is particularly attractive
as large organizations like NSF contemplate adopting a format that may commit them for several
decades.
[25] Equally importantly, because RDF data come
with their own “dictionary” in the form of an
10.1002/ggge.20067
ontology, querying a triple store can be done without prior knowledge of the structure of the data.
Distributed querying via SPARQL is aimed at
breaking down barriers between the segmented data
“silos” where RDBM reigns supreme. RDF is in
fact the basis for the Linked Open Data framework
[Bizer et al., 2009], a set of best practices for storing and exchanging data on the Web. Publishing
paleoclimate data online as Linked Open Data
means that they will be instantly accessible to a
wide community, helping to valorize the data by
encouraging re-use, and will enable machine learning algorithms to find common patterns between
paleoclimate data and other sources (e.g., weather
station data, climate model output). Finally, there
are already libraries for creating, writing, storing,
searching, and analyzing RDF databases available in open-source languages like R and Python
(http://code.google.com/p/rdflib/). Some open-source
triple stores already offer formal reasoning and inferencing capabilities.
[26] What are the disadvantages of a semantic
representation? An obvious one is that it is more
complex than existing tools like, say, netCDF
(http://www.unidata.ucar.edu/software/netcdf/). More
complexity means a steeper learning curve and,
unless anything is done, a slower adoption. However, although netCDF is ideally suited to gridded
data characteristic of numerical model output, it is
in fact very ill-suited to the storage of very heterogeneous data sets composed of multiple proxies from
multiple locations at multiples time resolutions, all
hallmarks of paleoclimate networks. To some
extent, then, this complexity is warranted by the
heterogeneity of the paleoclimate data themselves.
Furthermore, upward of 100 “RDFizers” already
exist, allowing users to turn an Excel spreadsheet
into an RDF representation with minimal programming knowledge. Along those lines, we should note
that there are also tools to translate between RDF
and the more traditional RDB format (http://www.
w3.org/2001/sw/rdb2rdf/), so adopting RDF is anything but a dead-end. It also makes it easy to build
Web or desktop applications that assist paleoclimate
data contributors in properly formatting the data and
metadata prior to uploading it to a central repository,
or to their own webpage.
[27] Another common concern is the memory over-
head of storing data in RDF. In this nascent age of
Big Data, paleoclimatologists paradoxically find
themselves more plagued by the scarcity, rather
than the excess, of observations. Storage space is
therefore not a primary concern. Nonetheless, one
would obviously want as efficient a storage solution
464
Geochemistry
Geophysics
Geosystems
3
G
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
as possible. The—at times—more verbose triples
may present a storage overhead, although this is
mitigated by well-formed RDF that adheres to a
robust ontology and binary serializations of the data
(e.g., HDT (http://www.w3.org/Submission/2011/
SUBM-HDT-20110330/) or Sesame (http://rivulidevelopment.com/2011/11/binary-rdf-in-sesame/)).
[28] There are some practical limitations to utilizing
RDF as a means of modeling and storing data.
While possible to model as such, RDF is not the
most efficient means of storing some tabular data
or multidimensional arrays. In such cases, it may
be preferable to employ a hybrid approach, separating out metadata—which is efficiently stored as
triples—and the source data themselves, maintained in their original format or stored in a
relational database. Nonetheless, modeling data as
RDF allows for flexible integration from multiple
sources without the same level of curation necessary to store data in a relational system. Once
modeled, it is possible to model some or all of the
graph as relational tables [Ramanujam et al.,
2009]. Presently, query performance on graph data
in triple stores may underwhelm some users, especially when compared to a relational database. A
direct comparison of query performance between
data modeled as RDF and data stored in a relational
database is difficult to quantify, because performance is a product of the database software and
the relational model employed (http://www.w3.org/
wiki/RdfStoreBenchmarking). For many queries,
10.1002/ggge.20067
however, relational models presently outperform
triples. The flexible semantic model makes more
queries possible, but it is more difficult to optimize.
Additionally, while the triple construction of data
semantics was proposed almost four decades ago
[Abrial, 1974] dedicated triple stores do not have
the maturity of relational database management
software and tools.
3.3. Building a Paleoclimate Web of Data
[29] We now consider practical steps to building a
paleoclimate Web of data. The standard should be
widely adopted, which requires collective discussion and community consensus. This paper serves
as a blueprint for the elaboration of a format of
optimal use to the paleoclimate community.
[30] In practice, two elements are necessary to a
semantic database. The first is an ontology, which
describes the objects and the relationships between
them. As an illustration, a basic ontology describing relationships between the data and metadata
fields of the climate record described in section
2.1 is given in Figure 4; there is no question that
it will have to be extended based on community
input. The second element needed is a way to store
this information. As described above, RDF is
optimal for metadata and ontologies, whereas the
data themselves may be stored in a universal columnar format like comma-operated values, or in RDF
as well (with some storage overhead).
Figure 4. Preliminary ontology describing relationships between the data and meta-data fields of the Nurhati et al.
[2011] climate record. Several fields are viewed as instances of larger classes (ProxyClass, Site, Reference), which
would allow computers to perform operations on all records within a specific class (e.g., if the measurement type is
d18O , or if the proxy class is “Tree Ring Width”, or if the resolution is less than 3 months, etc.). All records in such
a database would be bound to each other by similar links, allowing machines to automatically process any form of
query involving existing information. Such a design would also allow growth, by adding records and/or additional
information about each record.
465
Geochemistry
Geophysics
Geosystems
3
G
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
[31] A specificity of paleoclimate observations is
that they almost always consist of time series. Their
semantic representation should therefore build this
into the initial design, perhaps using dedicated
time series representation tools like SDMX-RDF
(http://publishing-statistical-data.googlecode.com/svn/
trunk/specs/src/main/html/index.html). This would
allow the reuse of a considerable body of R code
dealing with SDMX-formatted data (http://opensdmxdevelopers.wikispaces.com/RSDMX).
[32] An important problem to address is to ensure
that scientists who take the time to upload their data
to online repositories (a) conform to this new data
standard and (b) do not face significant additional
time burden for doing so. This can be done most
effectively by providing them with spreadsheet or
Web-form templates, which does not require any
programming expertise and will automatically
organize the data into a form that can be converted
to the desired one. The complexity of the data network in the backend should be largely or wholly
invisible to the scientist sharing data. Once uploaded,
however, the full extent of this data network complexity, including any raw and processed data as
well as metadata, should be easily ported directly
into analysis tools in a manner similar to that
employed by the BioConductor package [Gentleman
et al., 2004], where entire microarray data sets can
be imported directly from ArrayExpress or Gene
Expression Omnibus into R for analysis.
[33] Another important consideration is scalability:
it should be easy to nucleate the database using
existing efforts (cf. section 2) or a similar initiative
in R (vpIR, T. Ault, pers. comm., 2012). It should
also be easy to grow given a backbone. This leads
to the definition of a minimal information standard
(part of the ontology) necessary to define a paleoclimate record. At a minimum, our own experience
suggests that the record must include: a chronology
(time series of dates, one date per sample), one
measurement time series (one proxy observation
per sample), latitude, longitude, site name, and bibliographic reference. Of course, considerably more
information could be added, and indeed, should
be. For instance, age model uncertainties in nonannual proxies are best assessed if the actual radiometric dates are available, not just the one age
model selected for analysis and publication. One
may therefore need to define a standard format for
geoscientific chronologies as part of the ontology,
incorporating raw measurements so that the chronology can be updated in the light of different calibrations (e.g., updates to the “radiocarbon curve”)
as was done (manually) in Anchukaitis and Tierney
10.1002/ggge.20067
[2012]. Designing a universal representation of
chronologies used in paleoclimatology would yield
additional dividends, such as enabling a more rigorous comparison between records (e.g., a U-series
dated cave deposit and a varve-counted lake record)
and disseminating statistical best practices via
open-source code.
[34] Additionally, one should be able to add as
much metadata as is thought relevant. One could
think of adding information about experimental
protocol, proxy interpretation, uncertainties, instrument precision, sample number, reproducibility, as
well as tracking various processing steps applied
to the raw data (e.g., calibration information, isotopic corrections, standardization, detrending) as
illustrated in section 2. This method of defining a
minimal information standard has proved successful with microarray data, where minimal information about microarray experiments [Brazma et al.,
2001] specifies those core elements for recording
data and its provenance, experimental protocols
necessary, and versioning. How much or how little
metadata to include for each proxy class will
require community discussion; several workshops
are planned over the next few years as part of the
EarthCube roadmap.
[35] Semantic technology has the power to greatly
simplify the input, organization, intelligibility, and
therefore usability of paleoclimate data. It is worth
noting, however, that its adoption will take time,
and that other structured formats—like the one described in section 2—may provide useful stepping
stones along that path.
4. Discussion
[36] This article has proposed a new, self-describing
format to store paleoclimate data. It is intended as a
blueprint to spark discussion among the community,
and to hopefully lead to the implementation and deployment of one such technology in the near future.
[37] The large-scale adoption of such a system
would profoundly affect the way we do science. As
demonstrated in section 2.2, a rich data format
means that more information can be extracted with
less effort from the user. The semantic framework
we propose here would enable proxy records to be
viewed as instances of certain classes of objects, thus
enabling object-oriented programming to be applied
to paleoclimatology. A universal, open-source data
format would quickly enable software libraries to
be built around those objects, and would apply, by
design, to all objects within a class. Making this
466
Geochemistry
Geophysics
Geosystems
3
G
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
code widely available (perhaps on the same Web
servers as the data themselves) would enable paleoclimate scientists to easily apply sophisticated data
analysis tools to their own data before sharing it,
even without much programming knowledge.
[38] In a nutshell, a semantic web for paleoclimatol-
ogy would allow the power of automation to enter
the study of past climates. This would first change
the way paleoclimate records are compared to each
other, a task done routinely in paleoclimate studies.
Currently, each new record is typically stacked
against existing ones and a comparison is made.
The choice of records used in this comparison is
partially dictated by scientific considerations (e.g.,
gathering proxies thought to be sensitive to a particular climate phenomenon), and partially dictated by
ease of access. A universal data format would ensure that all the information relevant to a comparison would be gathered using objective criteria,
and using the most appropriate methods (e.g., formally taking time uncertainties into account). Furthermore, a semantic web would radically change
the way paleoclimate information is used to evaluate general circulation models, an endeavor at the
heart of initiatives like the Paleoclimate Modeling
Intercomparison Project (http://pmip3.lsce.ipsl.fr/).
It would also transform the integration of paleoclimate records with instrumental variables (i.e., climate reconstruction), again allowing records to be
included in a reconstruction only if they fit certain
objective criteria (e.g., “give me all annuallyresolved records located within 30 degrees of the
equator”). Coupled to process-based models of
proxy behavior, it would eventually enable the
assimilation of paleoclimate data into climate
model trajectories. All these tasks presently require
sophisticated tools and a daunting amount of
human resources, chiefly provided by volunteers.
These tasks would be considerably eased by agreeing upon a common data standard, whose overhead
cost would be negligible once the submission of
proxy records to online databases is designed to
structure the data to conform to these standards
(via web forms or spreadsheet templates). Indeed,
since a majority of paleoclimate scientist already
send their data to the NCDC or PANGAEA, there
would be little additional work involved for them
once the submission process is standardized.
[39] On one hand, this semantic web is expected to
bring many benefits: increased relevance of paleoclimate data to future climate projections, increased
transparency, increased accessibility, increased reproducibility, and increased educational opportunities.
On the other hand, one may easily imagine a scenario
10.1002/ggge.20067
whereby inadequate algorithms will be thoughtlessly
applied to crunch through the database and come up
with absurd answers to irrelevant questions. Obviously, automation will never be a substitute for
knowledge, and expert judgment will always be critical to the interpretation of any result derived from
paleoclimate records. The solution to such problems
is more collaboration between all members of the
paleoclimate community; a common language to
describe such data can only help this objective. Furthermore, malicious or ignorant use of climate data
is already performed routinely by climate change
deniers, and a paleoclimate semantic web would do
little to change this.
[40] The evolution we propose parallels that of
other data-driven fields like bioinformatics, whose
recent history has shown both the power and limits
of automation. A semantic web of paleoclimatology
would create much of the same opportunities, and
likely much of the same pitfalls. Yet, in this field
as in others, the advantages seem to far outweigh
the disadvantages: a machine-readable database of
paleoclimate proxies means that paleoclimatologists will be able to put their data in a larger, more
useful context, spend less time accessing data, and
more time thinking about the scientific implications
of their results. Who would not want that?
Acknowledgments.
[41] J.E.G. acknowledges Tiffany Tsai and Nasim Mirnateghi
for help with assembling the coral database. This work was
partially supported by grant NA10OAR4310115 from the
National Oceanographic and Atmospheric Administration. Data
and code necessary to generate the figures of this article are
available at http://climdyn.usc.edu/Publications_files/EmileGeay%26Eshleman_SI.zip.
References
Abram, N. J., M. K. Gagan, J. E. Cole, W. S. Hantoro, and
M. Mudelsee (2008), Recent intensification of tropical climate
variability in the Indian Ocean, Nat. Geosci., 1, 849–853,
doi:10.1038/ngeo357.
Abrial, J. (1974), Data semantics, in Data Base Management Proceedings IFIP Working Conference on Data Base
Management, edited by J. W. Klimbie and K. L. Koffeman,
pp. 1–59, North-Holland Publishing Company, Cargèse,
Corsica, France.
Adler, P., R. Kolde, M. Kull, A. Tkachenko, H. Peterson,
J. Reimand, and J. Vilo (2009), Mining for coexpression
across hundreds of datasets using novel rank aggregation
and visualization methods, Genome Biol., 10(12), R139
doi:10.1186/gb-2009-10-12-r139.
Anchukaitis, K. J., and J. E. Tierney (2012), Identifying
coherent spatiotemporal modes in time-uncertain proxy
467
Geochemistry
Geophysics
Geosystems
3
G
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
paleoclimate records, Clim. Dyn., 1–16, doi:10.1007/s00382012-1483-0.
Asami, R., T. Yamada, Y. Iryu, T. M. Quinn, C. P. Meyer, and G.
Pauley (2005), Interannual and decadal variability of the
western Pacific sea surface condition for the years 1787-2000:
Reconstruction based on stable isotope record from a Guam
coral, J. Geophys. Res., 110, doi:10.1029/2004JC002555.
Benson, D. A., I. Karsch-Mizrachi, D. J. Lipman, J. Ostell,
and E. W. Sayers (2011), GenBank, Nucleic Acids Res., 39
(Database issue), D32–37, doi:10.1093/nar/gkq1079.
Berg-Cross, G., et al. (2012), Semantics and ontologies for
earthcube, in Proceedings of the Workshop on GIScience in
the Big Data age.
Bergman, M. (2009), Advantages and Myths of RDF, http://
www.mkbergman.com/483/advantages-and-myths-of-rdf/
Berners-Lee, T., J. Hendler, and O. Lassila (2001), The semantic
web: a new form of Web content that is meaningful to computers
will unleash a revolution of new possibilities, Sci. Am., 29–37.
Bizer, C., T. Heath, and T. Berners-Lee (2009), Linked data - the
story so far, Inter. J. Semantic Web Inform. Syst., 5(3), 1–22,
doi:10.4018/jswis.2009081901.
Boiseau, M., A. Juillet-Leclerc, P. Yiou, B. Salvat, P. Isdale,
and M. Guillaume (1998), Atmospheric and oceanic evidences of El Niño-Southern Oscillation events in the south
central Pacific Ocean from coral stable isotopic records over
the last 137 years, Paleoceanogr., 13, 671–685, doi:10.1029/
98PA02502.
Brazma, A., et al. (2001), Minimum information about a microarray experiment (MIAME)-toward standards for microarray
data, Nat. Genet., 29(4), 365–371 doi:10.1038/ng1201-365.
Charles, C. D., D. E. Hunter, and R. G. Fairbanks (1997),
Interaction Between the ENSO and the Asian Monsoon in a
Coral Record of Tropical Climate, Science, 277(5328),
925–928, doi:10.1126/science.277.5328.925.
Charles, C. D., K. Cobb, M. D. Moore, and R. G. Fairbanks
(2003), Monsoon-tropical ocean interaction in a network of
coral records spanning the 20th century, Mar. Geol., 201,
207–222, doi:10.1016/S0025-3227(03)00217-2.
Chen, R., et al. (2010), Differentially expressed RNA from
public microarray data identifies serum protein biomarkers
for cross-organ transplant rejection and other conditions,
PLoS Comput. Biol., 6(9), doi:10.1371/journal.pcbi.1000940.
Cobb, K. M., C. D. Charles, and D. E. Hunter (2001), A central
tropical Pacific coral demonstrates Pacific, Indian, and
Atlantic decadal climate connections, Geophys. Res. Lett.,
28, 2209–2212, doi:10.1029/2001GL012919.
Damassa, T. D., J. E. Cole, H. R. Barnett, T. R. Ault, and
T. R. McClanahan (2006), Enhanced multidecadal climate
variability in the seventeenth century from coral isotope
records in the western Indian Ocean, Paleoceanogr., 21,
PA2016, doi:10.1029/2005PA001217.
Edgar, R., M. Domrachev, and A. E. Lash (2002), Gene
Expression Omnibus: NCBI gene expression and hybridization array data repository, Nucleic Acids Res., 30(1), 207–210.
Emile-Geay, J., K. Cobb, M. Mann, and A. T. Wittenberg
(2013), Estimating Central Equatorial Pacific SST variability
over the Past Millennium. Part 1: Methodology and Validation, J. Clim., in press doi:10.1175/JCLI-D-11-00510.1.
Emile-Geay, J., K. Cobb, M. Mann, and A. T. Wittenberg
(2013), Estimating Central Equatorial Pacific SST variability
over the Past Millennium. Part 2: Reconstructions and Implications, J. Clim., in press doi:10.1175/JCLI-D-11-00511.1.
Evans, M. N., A. Kaplan, and M. A. Cane (2000), Intercomparison of coral oxygen isotope data and historical sea surface
temperature (SST): Potential for coral-based SST field reconstructions, Paleoceanogr., 15, 551–562.
10.1002/ggge.20067
Evans, M. N., A. Kaplan, and M. A. Cane (2002), Pacific sea
surface temperature field reconstruction from coral d18O
data using reduced space objective analysis, Paleoceanogr.,
17, 7–1, doi:10.1029/2000PA000590.
Felis, T., J. Pätzold, Y. Loya, M. Fine, A. H. Nawar, and G.
Wefer (2000), A coral oxygen isotope record from the
northern Red Sea documenting NAO, ENSO, and North
Pacific teleconnections on Middle East climate variability
since the year 1750, Paleoceanogr., 15, 679–694, doi:10.1029/
1999PA000477.
Fox, P., D. McGuinness, R. Raskin, and K. Sinha (2007), A
volcano erupts: semantically mediated integration of heterogeneous volcanic and atmospheric data, in Proceedings of
the ACM First Workshop on CyberInfrastructure: Information Management in eScience, CIMS ’07, pp. 1–6, ACM,
New York, NY, USA, doi:10.1145/1317353.1317355.
Fox, P., D. L. McGuinness, L. Cinquini, P. West, J. Garcia, J.
L. Benedict, and D. Middleton (2009), Ontology-supported
scientific data frameworks: The virtual solar-terrestrial observatory experience, Comput. Geosci., 35(4), 724–738,
doi:10.1016/j.cageo.2007.12.019.
Gagan, M. K., L. K. Ayliffe, J. W. Beck, J. E. Cole, E. R. M.
Druffel, R. B. Dunbar, and D. P. Schrag (2000), New views
of tropical paleoclimates from corals, Q. Sci. Rev., 19(1-5),
45–64, doi:10.1016/S0277-3791(99)00054-2.
Gentleman, R. C., et al. (2004), Bioconductor: open software
development for computational biology and bioinformatics,
Genome Biol., 5(10), R80, doi:10.1186/gb-2004-5-10-r80.
Gil, Y., P. Szekely, S. Villamizar, T. Harmon, V. Ratnakar,
S. Gupta, M. Muslea, F. Silva, and C. Knoblock (2011), Mind
your metadata: Exploiting semantics for configuration, adaptation, and provenance in scientific workflows, in Proceedings
of the Tenth International Semantic Web Conference (ISWC),
Bonn, Germany, Springer: Verlag, 395.
ITRDB (2009), National Climatic Data Center, World Data
Center for Paleoclimatology, Tree Ring Database.
Janowicz, K., and P. Hitzler (2012a), The digital earth as
knowledge engine, Semant. Web, 3(3), 213–221,
doi:10.3233/SW-2012-0070.
Janowicz, K., and P. Hitzler (2012b), Key ingredients for your
next semantics elevator talk, in Advances in Conceptual
Modeling, Lecture Notes in Computer Science, vol. 7518,
edited by S. Castano, P. Vassiliadis, L. Lakshmanan, and
M. Lee, pp. 213–220, Springer, Berlin Heidelberg,
doi:10.1007/978-3-642-33999-8_27.
Jones, P., et al. (2009), High-resolution palaeoclimatology of the
last millennium: a review of current status and future prospects,
Holocene, 19(1), 3–49, doi:10.1177/0959683608098952.
Killeen, T. (2012), Data citation in the geosciences, http://
www.nsf.gov/pubs/2012/nsf12058/nsf12058.jsp
Kuhnert, H., J. Pätzold, B. Hatcher, K. H. Wyrwoll, A. Eisenhauer,
L. B. Collins, Z. R. Zhu, and G. Wefer (1999), A 200-year
coral stable oxygen isotope record from a high-latitude reef
off Western Australia, Coral Reefs, 18, 1–12, doi:10.1007/
s003380050147.
Kuhnert, H., J. Pätzold, K.-H. Wyrwoll, and G. Wefer (2000),
Monitoring climate variability over the past 116 years in
coral oxygen isotopes from Ningaloo Reef, Western
Australia, Inter. J. Earth Sci., 88, 725–732, doi:10.1007/
s005310050300.
Li, B., D. W. Nychka, and C. M. Ammann (2010), The value of
multi-proxy reconstruction of past climate, J. Amer. Statist.
Assoc., 105, 883–911, doi:10.1198/jasa.2010.ap09379.
Linsley, B. K., R. B. Dunbar, G. M. Wellington, and
D. A. Mucciarone (1994), A coral-based reconstruction
of intertropical convergence zone variability over Central
468
Geochemistry
Geophysics
Geosystems
3
G
EMILE-GEAY AND ESHLEMAN: PALEOCLIMATE SEMANTIC WEB
America since 1707, J. Geophys. Res., 99, 9977–9994,
doi:10.1029/94JC00360.
Linsley, B. K., L. Ren, R. B. Dunbar, and S. S. Howe (2000),
El Niño Southern Oscillation (ENSO) and decadal-scale
climate variability at 10 N in the eastern Pacific from 1893
to 1994: A coral-based reconstruction from Clipperton Atoll,
Paleoceanogr., 15, 322–335, doi:10.1029/1999PA000428.
Linsley, B. K., A. Kaplan, Y. Gouriou, J. Salinger,
P. B. de Menocal, G. M. Wellington, and S. S. Howe
(2006), Tracking the extent of the South Pacific Convergence
Zone since the early 1600 s, Geochem. Geophys. Geosyst., 7,
5003–+, doi:10.1029/2005GC001115.
Linsley, B. K., P. Zhang, A. Kaplan, S. S. Howe, and
G. M. Wellington (2008), Interdecadal-decadal climate
variability from multicoral oxygen isotope records in the South
Pacific Convergence Zone region since 1650 A.D., Paleoceanogr.,
23(26), A262,219+, doi:10.1029/2007PA001539.
Lorenz, E. N. (1956), Empirical orthogonal functions and
statistical weather prediction, Scientific Report 1, Statistical
Forecasting Project. 110268, Massachusetts Institute of
Technology Defense Doc. Center.
Mann, M. E. (2002), Climate reconstruction: The value of
multiple proxies, Science, 297(5586), 1481–1482, doi:10.1126/
science.1074318.
Murray-Rust, P., C. Neylon, R. Pollock, and J. Wilbanks
(2010), Panton principles, principles for open data in science.
NASA (2012), A.40 computational modeling algorithms and
cyberinfrastructure, Tech. rep., National Aeronautics and
Space Administration (NASA).
Neumann, E. K., and D. Quan (2006), BioDash: a semantic
web dashboard for drug development, Pac. Symp. Biocomput., 176–187, PMID: 17094238.
Committee on Surface Temperature Reconstructions for the Last
2, 000 Years, N. R. C. (2006), Surface Temperature Reconstructions for the Last 2000 Years, The National Academies
Press, Washington, D.C., 160.
Nurhati, I. S., K. M. Cobb, and E. Di Lorenzo (2011), Decadal-scale
SST and salinity variations in the central tropical pacific: Signatures of Natural and Anthropogenic Climate change, J. Clim.,
24(13), 3294–3308, doi:10.1175/2011JCLI3852.1.
Parkinson, H., et al. (2007), ArrayExpress--a public database of
microarray experiments and gene expression profiles,
Nucleic Acids Res., 35(Database issue), D747–750,
doi:10.1093/nar/gkl995.
Pfeiffer, M., O. Timm, W.-C. Dullo, and S. Podlech (2004),
Oceanic forcing of interannual and multidecadal climate variability in the southwestern Indian Ocean: Evidence from a
160 year coral isotopic record (La Réunion, 55 E, 21 S),
Paleoceanogr., 19, 4006–+, doi:10.1029/2003PA000964.
Pierre, M., B. DeHertogh, A. Gaigneaux, B. DeMeulder,
F. Berger, E. Bareke, C. Michiels, and E. Depiereux (2010),
Meta-analysis of archived DNA microarrays identifies genes
10.1002/ggge.20067
regulated by hypoxia and involved in a metastatic phenotype
in cancer cells, BMC Cancer, 10, 176, doi:10.1186/14712407-10-176.
Quinn, T. M., T. J. Crowley, F. W. Taylor, C. Henin, P. Joannot,
and Y. Join (1998), A multicentury stable isotope record
from a New Caledonia coral: Interannual and decadal sea
surface temperature variability in the southwest Pacific since
1657 A.D., Paleoceanogr., 13, 412–426, doi:10.1029/
98PA00401.
Quinn, T. M., F. W. Taylor, and T. J. Crowley (2006), Coral-based
climate variability in the Western Pacific Warm Pool since
1867, J. Geophys. Res. (Oceans), 111, 11,006–+, doi:10.1029/
2005JC003243.
Ramanujam, S., A. Gupta, L. Khan, S. Seida, and B. Thuraisngham
(2009), R2d: A bridge between the semantic weband rerlational
visualization tools, in IEEE International Conference on
Semantic Computing ICSC ’09.
Rayner, N., P. Brohan, D. Parker, C. Folland, J. Kennedy,
M. Vanicek, T. Ansell, and S. Tett (2006), Improved analyses
of changes and uncertainties in sea surface temperature measured in situ since the mid-nineteenth century: the HadSST2
data set, J. Clim., 19(3), 446–469, doi:10.1175/JCLI3637.1.
Sarachik, E. S., and M. A. Cane (2010), The El Niño-Southern
Oscillation Phenomenon, Cambridge University Press, Cambridge,
UK. 336.
Schorlemmer, D., A. Wyss, S. Maraini, S. Wiemer, and
M. Baer (2004), QuakeML - An XML schema for seismology,
Tech. rep., Swiss Seismological Service.
Spellman, P. T., et al. (2002), Design and implementation of
microarray gene expression markup language (MAGE-ML),
Genome Biol., 3(9), RESEARCH0046, doi:10.1186/
gb-2002-3-9-research0046.
Thomson, D. J. (1982), Spectrum estimation and harmonic
analysis, Proc. IEEE, 70(9), 1055–1096.
Tudhope, A. W., C. P. Chilcott, M. T. McCulloch, E. R. Cook,
J. Chappell, R. M. Ellam, D. W. Lea, J. M. Lough, and G. B.
Shimmield (2001), Variability in the El Niño-Southern
oscillation through a glacial-interglacial cycle, Science, 291,
1511–1517, doi:10.1126/science.1057969.
Urban, F. E., J. E. Cole, and J. T. Overpeck (2000), Influence
of mean climate change on climate variability from a 155-year
tropical Pacific coral record, Nature, 407, 989–993.
Wahl, E. R., D. M. Anderson, B. A. Bauer, R. Buckner,
E. P. Gille, W. S. Gross, M. Hartman, and A. Shah (2010),
An archive of high-resolution temperature reconstructions
over the past 2+ millennia, Geochem. Geophys. Geosyst.,
11(1), doi:10.1029/2009GC002817.
Zinke, J., W.-C. Dullo, G. A. Heiss, and A. Eisenhauer (2004),
ENSO and Indian Ocean subtropical dipole variability is
recorded in a coral record off southwest Madagascar for
the period 1659 to 1995, Earth Planet. Sci. Lett., 228,
177–194, doi:10.1016/j.epsl.2004.09.028.
469