NP 24622-1 CMDI

Component Metadata Infrastructure for Language
Resources
Part 1. The Component Metadata Model
- proposal for a new work item in ISO TC37/SC4
D. Broeder, D. van Uytvanck, Thorsten Trippel, Maria Gravilidou, P. Wittenburg1,
Introduction.
This document is meant to be present the essential aspects of the so called Component Metadata
model that is currently used in the CLARIN metadata infrastructure [1] and is considered by other
infrastructure projects. At the ISO TC37/SC4 meeting in Berlin 2010 it was discussed that a
standardization effort for the current approaches with respect to component metadata would be
supportive both bolster the current achievements and important to foster interoperability between
different Language Resource infrastructure projects. It was discussed that in order to distribute
responsibility over multiple groups working on these issues, the work would be split in three separate
standardization documents:
1. CMDI – Part 1, The Component Metadata Model
2. CMDI – Part 2, The Component Metadata Specification Language
3. CMDI – Part 3, Recommended Metadata Components
This document, although offering information relevant for all three parts, will concentrate on Part 1
and should be usable to extend to a first standard draft for Part 1.
General Background.
The need and use of metadata and especially metadata for Language Resources (LR) is not
elaborated here. We do want to make the point that there is currently a fragmented landscape with
respect to metadata used in the domain of LRs. Different infrastructure initiatives and projects [2], [3]
have concluded that there are limitations with existing metadata schemas such as DC/OLAC, IMDI,
TEI header. These limitations include:
– Inflexibility: too many (IMDI) or too few (OLAC) metadata elements
– Limited interoperability (both semantic and syntactic)
– Problematic (unfamiliar) terminology for some sub-communities.
– Limited support for LT tool & services descriptions
Discussions in the last years have led to the hope that solutions might come from using:
–
–
Explicit defined schema & semantics
User/project/community defined metadata components
This partly inspired by the work on the component structure of LMF [ISO 24613:2008.] and the
emergence of the ISO DCR [ISO 12620:2009] as a stable concept registry.
This solution, which we will call a Component Metadata Infrastructure (CMDI), was started being worked on in
the CLARIN project [4], currently it has reached a level of maturity where we look for stabilization and
generalization in cooperation with other infrastructure projects in the Language Resource domain such as
META-NET [5].
1
The authors are partly members of the German and Dutch ISO groups.
Component Metadata Infrastructure
A component metadata infrastructure (CMDI) is not an attempt to introduce a single new metadata schema but
rather create an environment that allows the coexistence of many community and researcher defined metadata
schemas. Essential points of CMDI are:
 Metadata components are bundles of metadata elements that describe a specific aspect of a resource
 components can be grouped together forming new components making more complex resource
descriptions possible
 A group of components can be used to create a metadata schema that can be instantiated into metadata
descriptions for resources
 have all metadata components and constituent metadata elements make their semantics explicit by
referencing concepts or data categories in a concept registry.

Actor
Sample freq.
Format
Technical
Metadata
Size
…
Language
Name
Id
…
Technical
Metadata
Language
Technical
Metadata
Name
Age
Sex
Language
…
Figure 1. Building a metadata schema from subsequently: a Technical Metadata component, a Language component
and an Actor component. The resulting metadata schema can then be used to describe for instance a speech
recording.
Metadata Component Specifications and Schemas
The (reusable) individual metadata components will need a specification format just as the also reusable
metadata profile. Part 2 of this standard will handle the exact specifications. For the purpose of presenting an
overview of the CMDI, we assume in this part of the standard that the component/profile specifications will be
XML based and that the resulting metadata schema is a W3C XML schema. Figure 2 shows the naming of the
different specifications and schemas. The resulting metadata schema can be used by an XML editor or a
specialized metadata editor to create metadata descriptions for specific resources.
Project
Location
Actor
Profile definition
XML
Metadata schema
W3C XML Schema
Language
Technical
Metadata
Metadata
profile
Component definition
XML
Metadata description
XML File
Figure 2. A CMD profile can be transformed into a W3C metadata schema that can be used to create instantiate
metadata descriptions (instantiation)
The Metadata Model
The essential features of the CMDI that determine its descriptive power, connection with the semantic tools as
the ISO-DCR and integration with resource repositories and archives are:
1. A component has attributes: name, multiplicity, concept-link
2. The component model should support recursion
3. A component contains a number of metadata elements
4. A metadata element has a: name, value-scheme, multiplicity, concept link
5. A component can refer to a number of resources or to other metadata components
6. A component can contain information about resource relations
7. A component grammar has to be fully deterministic to avoid ambiguity
Clearly not all features are equally important, but we leave an analysis to an extended version of this document.
One of the essential features of the model is the requirement for metadata elements and components to link or
refer to concept registries such as the ISO DCR or the ISO CDB (ISO Concept Database). This is essential to
achieve the explicit semantics requirement and partly solves problems created by the semantic overlap that
occurs with the use of different components to describe identical aspects of resources. This is unavoidable when
we leave it to the community to create metadata components and profiles the suit their needs as they see fit. Part
3 of this standard “Recommended Metadata Components and Profiles” will alleviate some of these problems.
A supplementary approach to solve the problems introduced by semantic overlap between metadata
components, is the use of a relation registry (RR) [6], that can be used to create relations between different
concepts. A user can define such relations to tune the compromise between precision and recall when executing
metadata search queries.
Figure 3. A UML diagram version of Figure 2, showing the dependencies between metadata components, profiles and
schemas
Figure 4. The metadata element component relation
The relation between metadata components and the constituent metadata elements and other metadata
components is shown in Figure 4. The recursive nature of the component-component relations greatly enhances
the expressive power of the model and limits the necessary component variety.
Figure 5. A complete CMDI model as used by the CLARIN CMDI implementation.
The complete model presented in Figure 5. This is the model that was developed in the CLARIN project and
exhibits some extra features such as a “JournalFileProxy” component meant to allow references to journal files
from applications. It also allows an extra modelling feature: the possibilities to have metadata descriptions refer
to other metadata descriptions instead of referring to data resources. This is a second recursive property of the
model next to the possibility for components to contain other components.
The references and links between metadata descriptions, resources and concepts in the ISO-DCR or other
concept registries should all be stable references using Persistent Identifiers (PIDs) according to ISO-FDIS24619 PISA. Including using cool URIs for the concept links to ISOCat and ISOCDB and all references to
resources and metadata can contain PIDs.
[1] CMDI Component Metadata Infrastructure, http://www.clarin.eu/cmdi
[2] IMDI, Isle Metadata Initiative, http://www.mp.nl/IMDI
[4] CLARIN, Common Language Resource Infrastructure http://www.clarin.eu/
[5] META-NET, http://www.meta-net.eu/
[6] M. Kemps-Snijders, M.A. Windhouwer, S.E. Wright. Putting data categories in their semantic context. In
proceedings of the IEEE e-Humanities – an emerging discipline Workshop at the 4th IEEE International
Conference on e-Science, Indianapolis, Indiana, USA, December 10, 2008. http://www.clarin.eu/system/files/eHumanities-ISOcat-final.pdf