Component Metadata Infrastructure for Language Resources Part 1. The Component Metadata Model - proposal for a new work item in ISO TC37/SC4 D. Broeder, D. van Uytvanck, Thorsten Trippel, Maria Gravilidou, P. Wittenburg1, Introduction. This document is meant to be present the essential aspects of the so called Component Metadata model that is currently used in the CLARIN metadata infrastructure [1] and is considered by other infrastructure projects. At the ISO TC37/SC4 meeting in Berlin 2010 it was discussed that a standardization effort for the current approaches with respect to component metadata would be supportive both bolster the current achievements and important to foster interoperability between different Language Resource infrastructure projects. It was discussed that in order to distribute responsibility over multiple groups working on these issues, the work would be split in three separate standardization documents: 1. CMDI – Part 1, The Component Metadata Model 2. CMDI – Part 2, The Component Metadata Specification Language 3. CMDI – Part 3, Recommended Metadata Components This document, although offering information relevant for all three parts, will concentrate on Part 1 and should be usable to extend to a first standard draft for Part 1. General Background. The need and use of metadata and especially metadata for Language Resources (LR) is not elaborated here. We do want to make the point that there is currently a fragmented landscape with respect to metadata used in the domain of LRs. Different infrastructure initiatives and projects [2], [3] have concluded that there are limitations with existing metadata schemas such as DC/OLAC, IMDI, TEI header. These limitations include: – Inflexibility: too many (IMDI) or too few (OLAC) metadata elements – Limited interoperability (both semantic and syntactic) – Problematic (unfamiliar) terminology for some sub-communities. – Limited support for LT tool & services descriptions Discussions in the last years have led to the hope that solutions might come from using: – – Explicit defined schema & semantics User/project/community defined metadata components This partly inspired by the work on the component structure of LMF [ISO 24613:2008.] and the emergence of the ISO DCR [ISO 12620:2009] as a stable concept registry. This solution, which we will call a Component Metadata Infrastructure (CMDI), was started being worked on in the CLARIN project [4], currently it has reached a level of maturity where we look for stabilization and generalization in cooperation with other infrastructure projects in the Language Resource domain such as META-NET [5]. 1 The authors are partly members of the German and Dutch ISO groups. Component Metadata Infrastructure A component metadata infrastructure (CMDI) is not an attempt to introduce a single new metadata schema but rather create an environment that allows the coexistence of many community and researcher defined metadata schemas. Essential points of CMDI are: Metadata components are bundles of metadata elements that describe a specific aspect of a resource components can be grouped together forming new components making more complex resource descriptions possible A group of components can be used to create a metadata schema that can be instantiated into metadata descriptions for resources have all metadata components and constituent metadata elements make their semantics explicit by referencing concepts or data categories in a concept registry. Actor Sample freq. Format Technical Metadata Size … Language Name Id … Technical Metadata Language Technical Metadata Name Age Sex Language … Figure 1. Building a metadata schema from subsequently: a Technical Metadata component, a Language component and an Actor component. The resulting metadata schema can then be used to describe for instance a speech recording. Metadata Component Specifications and Schemas The (reusable) individual metadata components will need a specification format just as the also reusable metadata profile. Part 2 of this standard will handle the exact specifications. For the purpose of presenting an overview of the CMDI, we assume in this part of the standard that the component/profile specifications will be XML based and that the resulting metadata schema is a W3C XML schema. Figure 2 shows the naming of the different specifications and schemas. The resulting metadata schema can be used by an XML editor or a specialized metadata editor to create metadata descriptions for specific resources. Project Location Actor Profile definition XML Metadata schema W3C XML Schema Language Technical Metadata Metadata profile Component definition XML Metadata description XML File Figure 2. A CMD profile can be transformed into a W3C metadata schema that can be used to create instantiate metadata descriptions (instantiation) The Metadata Model The essential features of the CMDI that determine its descriptive power, connection with the semantic tools as the ISO-DCR and integration with resource repositories and archives are: 1. A component has attributes: name, multiplicity, concept-link 2. The component model should support recursion 3. A component contains a number of metadata elements 4. A metadata element has a: name, value-scheme, multiplicity, concept link 5. A component can refer to a number of resources or to other metadata components 6. A component can contain information about resource relations 7. A component grammar has to be fully deterministic to avoid ambiguity Clearly not all features are equally important, but we leave an analysis to an extended version of this document. One of the essential features of the model is the requirement for metadata elements and components to link or refer to concept registries such as the ISO DCR or the ISO CDB (ISO Concept Database). This is essential to achieve the explicit semantics requirement and partly solves problems created by the semantic overlap that occurs with the use of different components to describe identical aspects of resources. This is unavoidable when we leave it to the community to create metadata components and profiles the suit their needs as they see fit. Part 3 of this standard “Recommended Metadata Components and Profiles” will alleviate some of these problems. A supplementary approach to solve the problems introduced by semantic overlap between metadata components, is the use of a relation registry (RR) [6], that can be used to create relations between different concepts. A user can define such relations to tune the compromise between precision and recall when executing metadata search queries. Figure 3. A UML diagram version of Figure 2, showing the dependencies between metadata components, profiles and schemas Figure 4. The metadata element component relation The relation between metadata components and the constituent metadata elements and other metadata components is shown in Figure 4. The recursive nature of the component-component relations greatly enhances the expressive power of the model and limits the necessary component variety. Figure 5. A complete CMDI model as used by the CLARIN CMDI implementation. The complete model presented in Figure 5. This is the model that was developed in the CLARIN project and exhibits some extra features such as a “JournalFileProxy” component meant to allow references to journal files from applications. It also allows an extra modelling feature: the possibilities to have metadata descriptions refer to other metadata descriptions instead of referring to data resources. This is a second recursive property of the model next to the possibility for components to contain other components. The references and links between metadata descriptions, resources and concepts in the ISO-DCR or other concept registries should all be stable references using Persistent Identifiers (PIDs) according to ISO-FDIS24619 PISA. Including using cool URIs for the concept links to ISOCat and ISOCDB and all references to resources and metadata can contain PIDs. [1] CMDI Component Metadata Infrastructure, http://www.clarin.eu/cmdi [2] IMDI, Isle Metadata Initiative, http://www.mp.nl/IMDI [4] CLARIN, Common Language Resource Infrastructure http://www.clarin.eu/ [5] META-NET, http://www.meta-net.eu/ [6] M. Kemps-Snijders, M.A. Windhouwer, S.E. Wright. Putting data categories in their semantic context. In proceedings of the IEEE e-Humanities – an emerging discipline Workshop at the 4th IEEE International Conference on e-Science, Indianapolis, Indiana, USA, December 10, 2008. http://www.clarin.eu/system/files/eHumanities-ISOcat-final.pdf
© Copyright 2026 Paperzz