Introduction to Marine Metadata

SeaDataNet Training Course
Introduction to Marine
Metadata
Roy Lowry
British Oceanographic Data Centre
Jargon Warning
• The nature of the material Geoff and I will be
presenting inevitably involves many words that
are very familiar to us, but not to you
• One approach would be for us to define
everything, but that could take all day
• Please
 Don’t feel you have a problem if you don’t know what
we’re talking about – it’s our role to help you
understand
 Consider our presentations interactive and feel free
to ask any question at any time
Overview
• Metadata definition
• Metadata function
• Metadata classification
• Metadata interoperability
• Vocabulary management
• Metadata standards and crosswalks
• Ontolgies
• Metadata horror stories
Metadata Definition
• What is metadata?
 Information about data
 Includes everything except the numbers
themselves
 “42” is data, but means nothing
 “42 is the abundance of Calanus
finmarchicus per litre at a location 56N 4E,
between depths of 10m and 20m at 00:30 on
01/02/1990” means a lot more
Metadata Function
• Why is metadata needed?
 To provide information to allow data
to be discovered
 To provide information on whether
data should be used for a given
purpose
 To provide information on how to use
data
Metadata Classification
• EU INSPIRE draft metadata rules
follow this approach, classifying
metadata into:
 Discovery metadata
 Evaluation metadata
 Use metadata
Discovery Metadata
• Discovery metadata is information posted to
allow datasets to be located by search engines
• Bare minimum is 5-dimensional co-ordinate
coverage
 3 spatial (x,y,z) as digital ranges or keywords
 1 temporal (time) as digital ranges or keywords
 1 other (parameter space) as keywords
• May be enriched by keywords covering
aspects such as instrument, platform, project,
activity
Evaluation Metadata
• Evaluation metadata is information that
allows a potential user to ascertain
whether a discovered dataset is fit for
purpose
• Covers issues like resolution, precision,
accuracy, methodology, provenance,
data quality, access restrictions
• Often includes a plain-text abstract to
provide scientific context
Use Metadata
• Use metadata is information required to
make use of the data in a tool or
application
 Access protocols (technical and political)
 5-dimensional co-ordinate coverage (see
discovery) plus units of measure
 Properties of co-ordinate coverage known
as dataset ‘shape’ or feature type (e.g. point
time series, profile, spatial grid)
Metadata Classification
• Those still awake will have noticed that coordinate coverage represents a significant
overlap between discovery and use
• Controversy rages about whether ‘discovery’
coverage and ‘use’ coverage should be the
same
• My current view is:
 They are different with significantly more detail
required for use
 Systems should be able to convert from use to
discovery, but not the other way round
 Search engines should be able to drill down from
discovery metadata into use metadata to satisfy the
evaluation use case
Metadata Interoperability
• Interoperability is the ability to share
data from multiple sources as a
common resource from a single tool
• Interoperability has four levels (Bishr,
1998)
 System – protocols, hardware and operating
systems
 Syntactic/Structural – loading data into a
common tool (reading each others’ files)
 Semantic – understanding of terms used in
the data by both humans and machines
• Only two levels (syntactic/semantic)
need worry us as data managers
Metadata Interoperability
• The easiest way to achieve any kind of
interoperability is by maintaining
uniformity across distributed systems
• Nice idea, but this is the real world and
many different people have had many
different reasons (some valid, others
not) why they should do it ‘their way’
• So we have the face the reality of
heterogeneous legacy metadata
repositories
Metadata Interoperability
• Most marine metadata greybeards agree that if
they knew 20 years ago what they know now,
we wouldn’t have the problem of a
heterogeneous legacy
• Anyone in this day and age with the blank
canvass of a new system who decides to
ignore standards and design their own
metadata structures deserves to be damned for
eternity making legacy systems interoperate
• I’ve re-invented wheels in the past and am
currently on my way to eternity!
Metadata Interoperability
• Metadata standards support
interoperability by specifying
 The fields to be included in a
metadata document
 The way in which those fields are
populated
Vocabulary Management
• A controlled vocabulary contains the
terms that may be used to populate a
metadata field
• A good controlled vocabulary
comprises:
 Terms
 Term definitions
 Keys (semantically neutral strings that may
be used to represent the term to computers)
 Term abbreviations
Vocabulary Management
• A good controlled vocabulary possesses:
 Content governance
 A mechanism for the management of vocabulary
entries that:
– Makes decisions about new entries
– Makes decisions about changes to existing
entries
 Technical governance
 A mechanism to
– Control changes dictated by content governance
including
» Versioning
» Audit trails to allow recreation of previous
versions
– Distribute the most up-to-date vocabulary version
Vocabulary Management
• In SEA-SEARCH and EDIOS we had:
 Content governance
 Ad-hoc decisions made by individuals,
including vacation students (non-specialist
undergraduates), on the spur of the moment
 No rules of engagement for permitted
changes
 Technical governance
 CSV files on one or more FTP sites updated
haphazardly with no formal time stamping
and no version labelling
Vocabulary Management
• In SeaDataNet we now have:
 Content governance
 SeaVoX a moderated e-mail list under the joint
auspices of SeaDataNet and IOC MarineXML to make
decisions concerning vocabulary change
 Technical governance
 Vocabularies held in an Oracle back-end that
automatically documents change, including
timestamps, versioning and previous version
preservation
 A web service API (plus client for those who need it)
maintained by BODC on behalf of SeaDataNet and
the UK NERC DataGrid providing live access to the
latest version of the BODC Oracle database
Metadata Standards
• Metadata specifications, primarily targeted at ‘discovery’
metadata, relevant to ‘evaluation’ metadata but of little
relevance to ‘use’ metadata
• DIF - Set up by NASA’s Global Change Master Directory
(GCMD) primarily to document satellite datasets
• FGDC - Mandatory US Government dataset description
• ISO19115/19139 - A metadata content standard (19115)
now developed into an XML schema (19139) targeted at
describing GIS datasets, but much more useful
• INSPIRE Metadata Rules - A European draft standard for
geospatial data, largely based on ISO19115, destined to
become the European answer to FGDC
Metadata Standards
• Continuing…..
• EDMED - Standard developed by BODC for EU
MAST programme to describe datasets and
subsequently developed by SEA-SEARCH
• Cruise Summary Report - IOC standard
description for research cruises and
associated datasets
• EDIOS - Standard developed by EuroGOOS
with EU funding to describe datasets of
repeated measurements
• My view is that this list is far too long…….
Metadata Standards
• All these standards do pretty much the same
thing
• It would seem a very good idea to provide
‘crosswalks’ – the means to translate
documents conforming to one standard into
another
• XML technology provides the means to do this
through XSLT scripts
• I have yet to find one that actually works at
what I consider an acceptable level
 Lowest Common Denominator mappings make the
conversions too ‘lossy’
 Semantic issues are generally ignored
Ontologies
• Each standard is associated with a set of
controlled vocabularies
• Semantic interoperability requires us to either
 Harmonise the vocabularies (develop a single
overarching vocabulary)
 Translate between the vocabularies
• Harmonisation usually considered to be too
difficult for repositories with significant legacy
population
• Which brings us to ontologies…..
Ontologies
• Vocabulary translation can be based on
a simple mapping
 Term in vocabulary 1 maps (i.e. has some
relation to) term in vocabulary 2
 Now considered to be a gross oversimplification – consider the examples
 Pigments map to chlorophyll
 Nitrate maps to nutrients
 Carbon concentration due to phytoplankton
maps to phytoplankton carbon biomass
Ontologies
• An ontology (small ‘o’ – this is computer
science, not philosophy) may be
considered as a set of lists with
relationships specified between list
members
• The previous example becomes
 Pigments is broader than chlorophyll
 Nitrate is narrower than nutrients
 Carbon concentration due to phytoplankton
is synonymous with phytoplankton carbon
biomass
Ontologies
• Computer science provides tools for ontology
management
 XML specification languages (Web Ontology
Language OWL and SKOS thesaurus language)
 Tools, such as inference engines, to use these
languages as a basis for decision making and to
derive additional relationships
 Statements
– A synonymous with B
– B synonymous with C
 Inference
– A synonymous with C
 Welcome to the world of Artificial Intelligence….
Metadata Horror Stories
• Vocabulary nightmares
• The plaintext monster
• The evil shoehorn
Vocabulary Nightmares
• Weak content governance
 Maintaining a vocabulary properly requires a surprisingly
large amount of intellectual input as consistent and robust
decisions need to be made quickly
 Many vocabularies have been populated by isolated
individuals, who are sometimes inexperienced and working
under pressure at the coal-face
 The result is vocabularies with useless terms like ‘see
website’ (referring to a broken URL) or rubbish like ‘NVT’
(Dutch for not applicable) in a list of sea level datums
• Weak technical governance
 Lack of clearly defined, readily obtainable and versioned
master copies leads to a proliferation of local lists
 Like finches on the Galapagos Islands these soon evolve
into something completely different. Eventually as finches
lose the ability to interbreed, lists lose the ability to
interoperate
Vocabulary Nightmares
• Semantic keys
 During the 80s and 90s, great importance was placed
on making keys meaningful mnemonics
 Not scalable, particularly if there are restrictions on
key size (try getting 18,000 meaningful unique labels
out of 8 bytes!)
 Following this doctrine has caused
 New vocabularies to be created requiring months of
subsequent mapping work to re-establish
interoperability
 The disintegration of established standards (e.g. ICES
Ship Codes when USA left the fold)
 Insanity in other vocabularies (e.g. USA has three
different IOC Country Codes due to Ship Code key
syntax)
Vocabulary Nightmares
• GCMD maintenance
 NASA’s GCMD maintain vocabularies (called GCMD
keywords) for the DIF metadata standard
 They have no compunction about deleting terms
which
 Invalidates DIFs in legacy repositories
 Breaks referential integrity in user databases
 Their entries have no keys which
 Makes changes or corrections to terms difficult to find
 Causes these changes to break referential integrity in
user databases
 Consequently GCMD keyword updates are only done
when there is a dire need, resulting in yet more local
list evolution
The Plaintext Monster
• Some of the previous generation of data
managers saw hard-copy printout as the
primary metadata delivery mechanism
• This can easily be delivered by metadata
repositories based on big chunks of plaintext
• Such repositories are virtually useless for
machine-based metadata processing
• Remember that sticking text together is much
easier than picking it apart
The Plaintext Monster
• SeaDataNet has EDMED and not DIF because
of history and a misguided desire for plaintext
over structured fields
• EDMED through its XML schema is currently
evolving towards a structured standard
(ISO19115) that will make interoperability much
easier and this evolution will continue during
the life of SeaDataNet
• When populating EDMED plaintext always think
how you can make it easier to pick it apart by
being consistent or using internal markup (but
not XML or XHTML - embedding these inside
fields is not XML-friendly )
The Evil Shoehorn
• A shoehorn forces something (a foot) to
occupy a space that it doesn’t quite fit (a shoe)
• The metadata equivalent is using a metadata
structure designed to describe a particular
thing to describe something else
• For example, using a Cruise Summary Report
to describe the activities of a scientist on a
beach collecting mussels for analysis
The Evil Shoehorn
• Why is this evil?
 Shoehorning causes data model entity definitions to
be changed
 Changing entity definitions causes strange things to
happen to supporting vocabularies, for example CSR
shoehorning has led to the following ‘ship names’
 RRS Challenger (a ship)
 Dr Mussel Collector (a person)
 Helicopter (a type of platform)
 Dover to Boulogne (a ferry route)
 These vocabularies are shared between data models
and used as constraints or to populate drop-down
lists in user interfaces
 Would you want ‘Dr Mussel Collector’ appearing in
the drop-down list labelled ‘ship name’ in your
system?
That’s All Folks!
Questions or
Coffee?