CCLRC Scientific Metadata (CSMD) Model - National e

CCLRC Scientific Metadata
(CSMD) Model
April 2004 NESC
Shoaib Sufi
CCLRC e-Science Centre
Model Motivation
• A common general format/standard for Scientific
Studies and data holdings metadata does not
exist
• By proposing Model and Implementation:
– Form a specification for the types of metadata
studies should captured by Scientific Studies
– Ease citation, collaboration, exploitation and
Integration
– Allow easy Integration of distributed
heterogeneous metadata systems into a
homogeneous (albeit virtual) Platform
Shoaib Sufi
CCLRC e-Science Centre
Structure of Metadata Model
• The CCLRC Scientific
metadata model
(CSMD) is a studydata set orientated
model:
– Indexing
– Provenance
– Data Description
– Data Location
– Access Conditions
– Related Material
Shoaib Sufi
CCLRC e-Science Centre
What influenced CSMD
• CIP from Earth Observation
• DDI from Social Sciences
• DublinCore from the Library community
– Publication only metadata
• XSIL as used on LIGO
– Low level ‘Scientific Data Objects’ focus
• CERA from the MPIM
– A bit specific to Earth Sciences but close
• … hence the need to develop out own General
Model – CCLRC Scientific Metadata Model
Shoaib Sufi
CCLRC e-Science Centre
some Model aims
• Abstract class orientated description of the
types of metadata that should be captured
by Scientific Studies
• Create a denominator for Scientific Study
metadata which form a specification
• Metadata workshop at NIEES 2002 during
a discussion on metadata standards – are
people capturing metadata at the moment
– simple answer given was no !!
Shoaib Sufi
CCLRC e-Science Centre
CSMD Used on DataPortal
• XML Implementation
used as Data Interface for
DataPortal
• Single view of
heterogeneous
systems/schemas
• Acts as a stress test of
the model
– Limitations feed into
Model Requirements
– New requirements fed
back into
implementation
Shoaib Sufi
CCLRC e-Science Centre
Model Breakdown: Provenance
• The Study contains the following metadata:
– The Study Name
– The Study Institution
– The Investigator
– Extended Study Information
• Abstract
• Funding
• Start and End times
– Investigations
Shoaib Sufi
CCLRC e-Science Centre
Investigations
• A Study can have more
than one investigation;
possible enumerations
are experiment,
simulation,
measurements etc. –
investigations contain:
– Name
– Investigation Type
– Abstract
– Resource
– Link to DataHolding
Shoaib Sufi
CCLRC e-Science Centre
Topic (for indexing)
• Keywords
– Discipline (i.e. domain)
– Keyword Source (e.g.
domain dictionary)
– Keyword
• Subjects
– Discipline
– Subject Source (e.g.
domain taxonomy)
– Subject
Shoaib Sufi
CCLRC e-Science Centre
Access Condition & Related Material
• Access Conditions
– Contains a list of users or groups who are
allowed access to the metadata and data, or a
pointer to an access control system which
contains such data for this study
• Related Material
– One or many links and or textual descriptions
of material related to this study e.g. earlier
studies or parallel studies
Shoaib Sufi
CCLRC e-Science Centre
Data
•
Data Description holds a
logical description of the
Study’s data:
– Data Name
– Type of Data
– Status
– Data Topic
– Parameters
– Related Data Ref
– Relation type (e.g.
derived)
• Data Location contains
the link between logical
name and physical URI’s
– Data Name
– Locator(s)
Shoaib Sufi
CCLRC e-Science Centre
More on Parameters
• Parameters contain a lot of information about the
data objects (DO) and collections
• A collection/DO can have many parameter
entries, each parameter entry contains:
• Parameter derivation (e.g. measured/fixed)
– The value
– The units
– Range
– Error margin
• Parameter aggregation is also supported
Shoaib Sufi
CCLRC e-Science Centre
Cardinality Issues
• The model recommends a certain
cardinality of elements
• Certain metadata components are
necessary for one to have an instance of
the implemented model – treating
everything as optional is not acceptable
• It is though implementations may modify
this more to their needs – model attempts
to remain ideal (i.e. most common
Cardinality)
Shoaib Sufi
CCLRC e-Science Centre
Enumeration Issues
• Enumerations (or controlled vocabularies) e.g.
types of investigator, types of institutions; these
are distinct from the model e.g. as taxonomies
are.
• However they are necessary for the model to
work so implementations e.g. CCLRC
DataPortal XML implementation of the model
propose some enumerations for common things
• Recognised and relevant controlled vocabularies
are hoped to be used by implementation where
they are available
Shoaib Sufi
CCLRC e-Science Centre
Conformance Level
• For a complete metadata study-dataset
record a large amount of metadata has to
be stored/processed
• So it’s useful to have conformance levels
• Model uses 5 levels
• Each level specifies more metadata (and
Indexing information) should be held
Shoaib Sufi
CCLRC e-Science Centre
Level 1
• Type of Information captured:
– Study and Investigation metadata with
indexing at the Study level
• Level 1 metadata is similar to
library/publication style metadata (e.g.
DublinCore)
Shoaib Sufi
CCLRC e-Science Centre
Level 2
• Type of Information captured:
– Level 1 + DataHolding metadata (i.e.
DataSets and DataObjects)
Shoaib Sufi
CCLRC e-Science Centre
Level 3
• Type of Information captured:
– Level 2 + related material, Access
condition, indexing to data collection
levels
Shoaib Sufi
CCLRC e-Science Centre
Level 4
• Type of Information captured:
– Level 3 + indexing to data object level
and data object parameter information
Shoaib Sufi
CCLRC e-Science Centre
Level 5
• Type of Information captured:
– All metadata components are filled as
L4 + funding, resources used, facilities
used etc
Shoaib Sufi
CCLRC e-Science Centre
Conformance Levels
• L1 is similar to library/publication style metadata (e.g.
DublinCore)
• The current DataPortal uses somewhere between L2
and L3 – indexing at study level moving towards
collection level but with parameter information
• Envisaged only new systems designed with CSMD will
conform to L4+
• Benefit of conformance levels; the higher the level of
conformance to the CSMD the richer the clients that
operate on the data can be
– e.g. identifying datasets and objects which link
directly to keywords/taxonomies and not just studies
Shoaib Sufi
CCLRC e-Science Centre
Shoaib Sufi
CCLRC e-Science Centre
Facilities using CSMD
• CCLRC Facilities (via CCLRC DataPortal):
– ISIS - Neutron Spallation at Rutherford Appleton Laboratory
(test)
– SR – Synchroton Radiation source at Daresbury Laboratory
(test)
– British Atmospheric Data Centre (BADC) at RAL (prototype)
• External Facilities (via CCLRC DataPortal):
– Max-Planck-Institut für Meteorologie (MPIM) in Hamburg
• External Projects using CSMD
– NERC funded E-mineral ‘environment from the molecular level’
– EPSRC funded E-materials project
– Manchester MyGrid project uses an adapted version
– ISIS (RAL) have taken data needs inhouse and use a model
based heavily on CSMD
Shoaib Sufi
CCLRC e-Science Centre
The Future
• Increased use/recommendation for use of
Controlled vocabularies
• Increased support for formal identification
systems
• Feeding relevant ideas from other standards
• Update XML and Relational implementations so
they more closely track the model.
• Look into internationalisation issues and see if
these effect the model or the implementations
Shoaib Sufi
CCLRC e-Science Centre
More information
• Latest Model description
– http://wwwdienst.rl.ac.uk/library/2002/tr/dltr2002001.pdf
• For an XML implementation and Relational
Implementation, newer draft of the model
documentation e-mail:
– [email protected] with the subject
containing [metadata model request]
Shoaib Sufi
CCLRC e-Science Centre