Exposing legacy file-based data

Exposing legacy file-based data
(interop-for-files)
Andrew Woolf
CCLRC Rutherford Appleton Laboratory
[email protected]
AUKEGGS
Canberra, 2006-11-29
Outline
•
•
•
•
Introduction
The feature model as integration key
An interoperability approach for files
xlink review and proposed profile for
legacy data
• Examples
• Issues
AUKEGGS
Canberra, 2006-11-29
Introduction
• Much ‘earth-science’ data exists as large legacy
file-stores
– e.g. ECMWF: 2 Pb of file-based data
– e.g British Atmospheric Data Centre: 40 Tb of filebased data
• Interoperability demands common approaches
• BUT, multitude of formats masks commonality
– netCDF, HDF4, HDF5, GRIB, NASA Ames, PP, ...
AUKEGGS
Canberra, 2006-11-29
Introduction
• File-centred data management focusses on the
container rather than content
• File API is fundamental point of reference
– binary format details not always exposed or
guaranteed
– public API may be only supported access mechanism
– often implemented as performant optimised native
library
• Conclusion: can’t/shouldn’t migrate
AUKEGGS
Canberra, 2006-11-29
Introduction
• Want to expose information, not format...
AUKEGGS
Canberra, 2006-11-29
Introduction
• Information structures may be composed
across files
AUKEGGS
Canberra, 2006-11-29
The feature model
• Common pattern with file-data:
– need to integrate information structures
across multiple files
– (relational tables provide this implicitly)
• Semantics provide an integration key
– e.g. an oceanographer and meteorologist can
share a conversation about data despite
format differences
AUKEGGS
Canberra, 2006-11-29
The feature model
AUKEGGS
Canberra, 2006-11-29
A model for file-based interoperability
• Retain file-based persistence format
• Supplement with feature-based conceptual
model
• ‘Cast’ legacy data onto conceptual model
– interoperableData = (featureModel)
legacyData
• Legacy file data + GML-encoded conceptual
‘metadata’ = ‘interoperable view’
– may be exposed through W*S
AUKEGGS
Canberra, 2006-11-29
A model for file-based interoperability
• GML provides conceptual feature ‘skeleton’
• File provides ‘flesh’
• GML ‘by-reference’ pattern for property
values
– uses simple xlink
– “The value of a GML property that carries an
xlink:href attribute is the resource returned
by traversing the link”
AUKEGGS
Canberra, 2006-11-29
xlink review
extended xlink [role] [title]
remote resource B
[href]
[role]
[title]
[label]
local resource A
[role]
[title]
[label]
remote resource C
[href]
[role]
[title]
[label]
arc 1
[arcrole] [title]
[show] [actuate]
arc 2
local resource D
[role]
[title]
[label]
arc 3
AUKEGGS
Canberra, 2006-11-29
xlink review
simple xlink [role] [title]
remote resource
[href]
[role]
[title]
[label]
arc
[arcrole] [title]
[show] [actuate]
local resource
[role]
[title]
[label]
AUKEGGS
Canberra, 2006-11-29
xlink review
• ‘role’ (URI):
– indicates a property of the remote resource
– must be a URI reference that “identifies some
resource that describes the intended property”
• ‘arcrole’ (URI):
– describes the “meaning of the arc’s ending
resource relative to its starting resource”
– corresponds to RDF notion of a property
• starting-resource HAS arc-role ending-resource
AUKEGGS
Canberra, 2006-11-29
xlink patterns for files
extended xlink
GML feature instance
Aggregation semantics determined
by xlink arc traversal rules
AUKEGGS
Canberra, 2006-11-29
xlink patterns for files
simple xlink
GML feature instance
Aggregation semantics
determined by storage
descriptor
AUKEGGS
Canberra, 2006-11-29
xlink proposal
<someGMLElement
xlink:arcrole="hasRemoteContentEmbeddedAt#localXpath"
xlink:href="storageDescriptor#portion"
xlink:role="storageSchemaIdentifier"
xlink:show="embed"
xlink:actuate="onRequest | onLoad"/>
• href examples:
–
–
–
–
netCDF#variable
RDBMS#SQLQuery
GRIBFile#recordNumber
CSMLStorageDescriptor#arrayID
AUKEGGS
Canberra, 2006-11-29
Example
• GML CR 06-160
– ISO 19123
CV_ReferenceableGrid
<gml:ReferenceableGrid gml:id="ID001" srsName="urn:ogc:def:crs:EPSG:6.6:4326" dimension="2">
<gml:limits>
<gml:GridEnvelope>
<gml:low>0 0</gml:low>
<gml:high>7 4</gml:high>
</gml:GridEnvelope>
</gml:limits>
<gml:axisLabels>x y</gml:axisLabels>
<gml:coordTransformTable>
<gml:GridCoordinatesTable>
<gml:gridOrdinate>
<gml:GridOrdinateDescription>
<gml:coordAxisLabel>Geodetic longitude</gml:coordAxisLabel>
<gml:coordAxisValues>
<gml:SpatialOrTemporalPositionList>
<gml:coordinateList>13.5 24.9 32.4 37.7 41.5 46.8 54.4 65.7</gml:coordinateList>
</gml:SpatialOrTemporalPositionList>
</gml:coordAxisValues>
<gml:gridAxesSpanned>x</gml:gridAxesSpanned >
<gml:sequenceRule axisOrder="+1">Linear</gml:sequenceRule>
</gml:GridOrdinateDescription>
</gml:gridOrdinate>
<gml:gridOrdinate>
<gml:GridOrdinateDescription>
<gml:coordAxisLabel>Geodetic latitude</gml:coordAxisLabel>
<gml:coordAxisValues>
<gml:SpatialOrTemporalPositionList>
<gml:coordinateList>
53.1 48.7 46.2 44.7 43.9 43.3 43.1 44.0
46.2 43.2 41.5 40.6 40.2 40.0 40.3 41.7
37.1 36.1 35.6 35.5 35.7 36.0 37.1 39.5
30.4 30.2 30.4 30.7 31.1 32.0 33.8 37.2
24.3 24.8 25.3 26.0 26.6 27.7 29.7 33.4
</gml:coordinateList>
</gml:SpatialOrTemporalPositionList>
</gml:coordAxisValues>
<gml:gridAxesSpanned>x y</gml:gridAxesSpanned >
<gml:sequenceRule axisOrder="+1 -2">Linear</gml:sequenceRule>
</gml:GridOrdinateDescription>
</gml:gridOrdinate>
</gml:GridCoordinatesTable>
</gml:coordTransformTable>
</gml:ReferenceableGrid>
AUKEGGS
Canberra, 2006-11-29
Example
• netCDF ASCII dump:
netcdf myfile {
dimensions:
x=8;
y=5;
variables:
float lon(x) ;
lon:long_name = “longitude” ;
lon:units = “degrees_east” ;
float lat(x,y) ;
lat:long_name = “latitude” ;
lat:units = “degrees_north” ;
float temp(x,y) ;
temp:coordinates = “lon lat” ;
temp:long_name = “temperature” ;
temp:units = “degC” ;
data:
lon = 13.5, 24.9, 32.4, 37.7, 41.5, 46.8, 54.4, 65.7 ;
lat = 53.1, 48.7, 46.2, 44.7, 43.9, 43.3, 43.1, 44.0, 46.2, 43.2, 41.5, ...
AUKEGGS
Canberra, 2006-11-29
Example
<gml:gridOrdinate>
<gml:GridOrdinateDescription>
<gml:coordAxisLabel>Geodetic longitude</gml:coordAxisLabel>
<gml:coordAxisValues>
<gml:SpatialOrTemporalPositionList>
<gml:coordinateList srsName=“WGS84”>13.5 24.9 32.4 37.7 41.5 46.8 54.4
65.7</gml:coordinateList>
</gml:SpatialOrTemporalPositionList>
</gml:coordAxisValues>
<gml:gridAxesSpanned>x</gml:gridAxesSpanned >
<gml:sequenceRule axisOrder="+1">Linear</gml:sequenceRule>
</gml:GridOrdinateDescription>
</gml:gridOrdinate>
<gml:coordAxisValues
xlink:arcrole=“http://ndg.nerc.ac.uk/xlinkUsage/insert#SpatialOrTemporalPositionList/coordinateList”
xlink:href=“myfile.nc#lon”
xlink:role=“http://ndg.nerc.ac.uk/fileFormat/netcdf”
xlink:show=“embed”>
<gml:SpatialOrTemporalPositionList>
<gml:coordinateList srsName=“WGS84”/>
</gml:SpatialOrTemporalPositionList>
AUKEGGS
</gml:coordAxisValues>
Canberra, 2006-11-29
Issues
• Need to ‘get as close as possible’ to target
– ‘merge’ semantics consistent with GML?
(Opportunity: no best practice for GML yet!)
• “If both a link and content are present in an
instance of a property element, then the object
found by traversing the xlink:href link shall be the
normative value of the property. The object
included as content shall be used by the data
recipient only if the remote instance cannot be
resolved; this may be considered to be a "cached"
version of the object.” [GML 7.2.3.4]
AUKEGGS
Canberra, 2006-11-29
Issues
• xlink:href (URI) for remote resource fragment (formatspecific)
– e.g. RDBMS#SQLQuery, netCDF#variable, etc...
• xlink:role (URI) for resource format
– e.g. reference PRONOM-type format repository?
• implied conversion to GML target content type
• xlink:arcrole (URI) for ‘embed remote content’ semantics
– ‘insert at relative XPath’ essential
• simple xlink can’t handle multiple resources
– application-specific ‘storage descriptor’ schemas for file
aggregation semantics
AUKEGGS
Canberra, 2006-11-29
Conclusion
• Presented a profile for xlink with files in absence
of current best practice
• Meets key practical requirements
– retain file-based persistence formats
– provide interoperability ‘wrapper’
– focus on logical content, not container (feature model)
• Semantic governance at appropriate points
• Enables powerful, scalable mechanism for real
data
– e.g. large meteorological datasets
AUKEGGS
Canberra, 2006-11-29