EnvThes – interlinked thesaurus for long term ecological

EnviroInfo 2013: Environmental Informatics and Renewable Energies
Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5
EnvThes - interlinked thesaurus for long term ecological research,
monitoring, and experiments
Herbert Schentz, Johannes Peterseil1, Nic Bertrand2
(contributors Mauro Bastiano3, David Blankman4, Mark Frenzel5, Ulf Grandin6, Miki
Kertesz7, Alessandro Oggioni8)
Abstract
The long term ecological research and monitoring is resulting in a vast amount of data describing environmental
characteristics, drivers and pressures. Despite harmonisation efforts in the field of methods and observation designs
within the frame of LTER Europe or other related networks, different management solutions, together with a varying
set of terms and concepts describing the data, are often used for local data management. Therefore, there is not only
the need for syntactic but also semantic harmonisation in order to ensure data exchange for cross site and cross domain analysis. The use of a common controlled vocabulary or thesaurus is the first step to enable semantic harmonisation across a network of sites.
Within the European scale projects EnvEurope (LIFE08 ENV/IT/00399) and ExpeER the development of a controlled vocabulary for long term ecological research and monitoring was started in order to provide a common set of
terms and concepts used in this domain. It is called EnvThes and implemented as SKOS/RDF based thesaurus using
poolParty as management tool. PoolParty offers a web browser based viewing and editing interface, as well as a
linkedData interface and a SPARQL endpoint.
EnvThes is based on existing vocabularies, like US-LTER controlled vocabulary, EUNIS Habitat list, Catalogue of
Life, NASA Units, extended by concepts which are additionally needed. Links to the original sources and to matching concepts within commonly used thesauri like EUROVOC, GEMET, AGROVOC or EarTh are established, thus
overcoming the islands of conceptual models, developed for different purposes. These links to slowly changing vocabularies, like EUROVOC or GEMET are the anchors for stable semantics; and the LTER specific concepts, for
which no linkable equivalent can be found in those thesauri, are the building blocks of a flexible conceptual model,
needed by science.
EnvThes provides keywords for the metadata system DEIMS, and is used for semantic annotation of the data in the
EnvEurope data repository. It can additionally be used as source of terms and links to concepts for any sort of document (for example a Microsoft Excel spreadsheet, Word document, PowerPoint presentation, LATEX document,
etc.) used in the LTER context.
1
Umweltbundesamt GmbH, Spittelauer Lände 5, A-1090 Vienna
CEH Computing Services, Oxford Centre for Ecology and Hydrology, Mansfield Road, Oxford, OX1 3SR
3
ISMAR CNR Castello 1364/a 30122 Venezia Italy
4
ILTER Information Management Committee, Information Management, Israel LTER
5
Helmholtz-Zentrum für Umweltforschung – UFZ, Permoserstraße 15, 04318 Leipzig , Germany
6
Swedish University of Agricultural Sciences, Department of Aquatic Sciences and Assessment SE-750 07 Uppsala, Sweden
7
Institute of Ecology and Botany, MTA Centre for Ecological Research, Vácrátót, Hungary
8
CNR-IREA, Via Bassini, 15 - 20133 Milano, Italy
2
1. Introduction
The long term ecological monitoring and research network (LTER) in Europe9 provides a vast amount
of data with regard to drivers and impacts of environmental change (Mirtl & Krauze 2007, Mirtl et al.
2013). A variety of ecosystems are monitored touching research topics of several different disciplines, like
biodiversity research, soil science, air pollution measurement, ground water and surface water analysis,
meteorology, hydrology and others. In addition to the variety of disciplines involved, data management
and analysis is fragmented and carried out by several dislocated institutes which are responsible for the
monitored sites. Because of the distributed responsibility, the implemented data management systems vary
from collections of simple spreadsheet tables, distributed domain specific databases to centralized data
management systems. Within those systems data are organized using different data models, and related
vocabularies, taxonomies and code lists. Existing semantic de facto standards like the EUNIS habitat list10,
the Catalogue of Life, or the World Reference Base for Soil Resources11 are often not considered depending on the autonomous decision of each data manager and provider.
Overcoming the described semantic heterogeneity, resulting from the variety of disciplines and institutes, is crucial for the intended exchange of metadata and data. The establishment and use of a common
set of exactly defined concepts is a first step into the direction of semantic harmonization; using those
concepts to annotate data and metadata and finally using them directly in the local data management systems should be the subsequent activities (Schentz et al. 2005, Adamescu et al. 2010). As the issue “establishment of semantic interoperability” is not specific for the domain of long term ecological research, nor
for ecology, appropriate methods and tools have been developed all over the world and disciplines (Gruber
1993). Controlled vocabularies and thesauri are widely accepted and good practices to establish the foundation for semantic harmonization, a precondition for reuse and sharing of data. They have been established for that purpose long before the first computer was built. Printed versions dictionaries for medicine,
pharmacy or other taxonomies, for example species, soil or habitats, have been created already centuries
ago. Nowadays, a vast number of controlled vocabularies and thesauri exist and are digitally available.
LTER Europe and related projects like EnvEurope (LIFE08 ENV/IT/00039912) or ExpeER13 are working towards a harmonisation of methods and data flows for long term ecological research, monitoring and
experiments. A common conceptual model and definition of terms is therefore crucial in order to allow not
only for a syntactic but also semantic harmonised exchange of metadata and data. Because of the multidisciplinary character of long term ecological research, monitoring and experiments existing thesauri mostly
cover only a part of the conceptual model needed for this domains. Therefore, within the projects EnvEurope and ExpeER, the decision was taken to build a domain thesaurus used as the semantic base for
metadata and data exchange. It should be built on existing controlled vocabularies and extended wherever
needed. The US LTER Community has established a controlled vocabulary for long term ecological research and monitoring (LTER Controlled Vocabulary Working Group 2011), which should serve as backbone for EnvThes.
9
See http://www.lter-europe.net/
See http://eunis.eea.europa.eu/habitats.jsp
11
See http://www.fao.org/nr/land/soils/soil/wrb-documents/en/
12
See http://www.enveurope.eu/
13
See http://www.expeeronline.eu/
10
Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5
2.
Requirements and Tools
The aim of EnvThes is to provide a list of significant terms and their definitions used in the domain of
long term ecological research, monitoring and experiments. When establishing such a thesaurus, two,
slightly contradicting tasks have to be fulfilled. On the one side the vocabulary has to cover the broad and
dynamic range of research, which means, that the vocabulary must include very specific terms which may
change fast according to the development of science and their addressed fields, whereas on the other the
vocabulary must guarantee continuity and, widely used and rather stable existing vocabularies shall neither be neglected nor reinvented. By this the thesaurus shall achieve stability and flexibility at the same
time!
Based on these basic requirements different special requirements can be derived in order to provide a
useful semantic tool for the community of users:
Completeness – The concepts from EnvThes are used to annotate metadata with keywords and therefore
support discovery and interpretation of dataset metadata. The thesaurus therefore needs to cover concepts
of the related domains with a given level of detail. In addition to the keywords the common vocabulary
shall provide the basis for semantic interoperability of exchanged information and knowledge, related to
the observed entities, parameters, methods, campaigns, data and metadata. Therefore there are concepts
within EnvThes additionally to those, which are directly used as metadata keywords.
Usability – Two features are highly desired, when using the concepts of this common vocabulary. First,
an easy way to take the concept into any document (MS-Word document, HTML page, LATEX document, Excel spreadsheet, PowerPoint presentation) and second, a possibility to resolve the inserted term,
which means to follow the way back to the thesaurus, where additional information can be found. Within
the first version of data collection, the terms shall be entered with simple “copy and paste procedures”,
but, seen from the point of view of EnvThes, the establishment of links is as simple as the insertion of text
and can be done from within any document, which can digest hyperlinks.
Accessibility – Using a controlled vocabulary like EnvThes as a common semantic basis means, that it
must be accessible from any community member. Whereas open access for use is the first level of exchange, distributed maintenance and administration should be possible.
Multilingualism – Another requirement in an international network is to provide multilingualism, even
though English is established as the main language for the use and definition of concepts within the LTER
community. Multilingualism is supported by providing translations of the concepts into native languages
occurring in the project context. However, a decision was made that EnvThes has a primary language:
English, which means that every term is represented in English and that the labels in other languages are
translations and may or may not be present, according to the availability of resources for translations. So
far there are more non-translated terms than translated ones.
Support data – Within the LTER datasets a lot of data are nominal. An important need with nominal data is that values are taken from common code lists, which therefore have to be accessible by the whole
community. A thesaurus is a good repository for such code lists.
Compliant to standards – In order to allow for future cooperation and collaboration with other communities, existing standards and best practices should be followed. The W3C standard SKOS/RDF (Isaac et
al. 2009, Klyne et al. 2004, Miles et al. 2009) reaches a high level of acceptance within the thesaurus
community. Therefore, a large number of tools fulfilling this standard are available. As for the harmonisation of controlled vocabularies interlinking thesauri is important, tools with a LinkedData interface are
very common.
Therefore, the solution for the management of semantics chosen by LTER Europe is the establishment
and use of a SKOS/RDF14 based thesaurus. Within the EnvEurope and ExpeER project the PoolParty15
14
15
see http://www.w3.org/2004/02/skos/
see http://poolparty.biz/
Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5
thesaurus management tool is used. The resulting thesaurus is called EnvThes and exists in two instances:
a development instance for the work in progress – containing the latest set of terms, which are not yet regarded stable (http://vocabs.lter-europe.net/EnvThes3.html) and a production instance which is stable and
ready for citations (http://vocabs.lter-europe.net/EnvThes.html).
3.
The elements of EnvThes Extending existing controlled vocabularies
EnvThes is based on existing controlled vocabularies using the US LTER Controlled Vocabulary16 as stable backbone. Missing concepts are added using a collaborative approach to provide a comprehensive list
of concepts and definitions. The needs for new concepts arising from the addressed scientific domains are
analysed and existing controlled vocabularies from different domains are searched to avoid redefinitions.
When a term is found in an existing vocabulary it is entered to EnvThes and a link to the source is provided to ensure the provenance.
For EnvThes, the inclusion of the following foreign controlled vocabularies was decided:
• the INSPIRE spatial data themes17,
• the EUNIS habitat list18 (down to level 3),
• list of units, based on the NASA units19.
In addition, the inclusion of organismic top level categories (down to family level) of the catalogue of
life20 was decided to ensure for agreed concepts for the taxonomic coverage of the data. However, users
are encouraged to use the sources of the catalogue of life down to species level.
US LTER controlled
vocabulary
Catalogue
of Life
backbone
(NASA – Sweet)
units
EUNIS
Habitats
L1 – L3
EnvThes
INSPIRE spatial data themes
Special units for
Ecology
EnvEurope
Parameters
ExpeER
Parameters
Figure 1
Components of EnvThes
Only concepts, which cannot be found in the US-LTER controlled Vocabulary or in other existing thesauri are added and defined. Those concepts, already defined in other rather slowly changing vocabularies
and the links to those vocabularies build the stable core of EnvThes, and the allowed extensions provide
the flexibility.
16
see http://im.lternet.edu/VocabTOR
see http://www.eionet.europa.eu/gemet/inspire_themes
18
see http://eunis.eea.europa.eu/
19
see http://www.nasa.gov/offices/oce/functions/standards/isu.html and http://www.qudt.org/qudt/owl/1.0.0/unit/
20
see http://www.catalogueoflife.org/
17
Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5
The example in Table 1 shows this principal to extend a stable core: We imported the NASA sweet SI
units and units outside SI, and established a link to the origins, in order to be compliant with a stable de
facto standard for units. However, there are units, needed by the LTER community, which are missing in
the “NASA units”, which had to be added. We used the SKOS scope notes to annotate, which units are
NASA units and which are EnvThes units.
term
NASAnote
FNU
NASA SWEET unit EnvThes unit
MOA
NASA SWEET unit EnvThes unit
MYA
NASA SWEET unit EnvThes unit
Np
SOA
Enote
EnvThes unit
NASA SWEET unit EnvThes unit
`
EnvThes unit
century NASA SWEET unit EnvThes unit
cm
NASA SWEET unit EnvThes unit
d
NASA SWEET unit EnvThes unit
db
NASA SWEET unit EnvThes unit
decade
NASA SWEET unit EnvThes unit
eV
EnvThes unit
g/(ha·yr)
EnvThes unit
Table 1
Extension of NASA units
3.1 Definitions
It is seen as a good practice within the semantic community to define concepts in a human readable form
in addition to all the other structuring elements. EnvThes follows this good practice and allows definitions
in all the languages allowed for concepts. But in the same way, as EnvThes assumes a primary language,
English, there is a primary definition language, English. That means, that, in case there is any definition,
there has to be an English definition and there may be definitions in other languages.
3.2 Scope notes
EnvThes uses the scope notes, offered by SKOS/RDF, to define special usages of concepts:
• The concepts which are used by the US-LTER community for keywords of their metadata
• The concepts which are very broad and therefore should just be used for annotation of metadata
and not for data annotation.
Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5
4.
Usage of EnvThes
The common defined semantics is used on two different levels: First, to attach keywords to the metadata
description for datasets and second, to annotate concepts from EnvThes to data (see Figure 2). Both levels
are needed to allow for a semantic enhance data management and exchange.
EnvEurope and ExpeER metadata are managed using a DRUPAL based application, DEIMS (Drupal
Ecological Information Management System21), and EnvThes is integrated as taxonomy. The keywords for
the dataset metadata are taken from EnvThes. In the version of DRUPAL currently used EnvThes has to
be physically imported as no life link to a SKOS/RDF thesaurus can be used in DRUPAL 6. With the migration of DEIMS to DRUPAL 7 in the upcoming month this direct link will be established, thus avoiding
the need of updates, enforcing a good and consistent and comprehensive versioning within EnvThes.
DEIMS
Any
Term
Document
EnvThes
Link
EnvEurope
Data
Repsoitory
Figure 2
Usage of EnvThes
In addition to the keywords EnvThes is used to annotate data with defined concepts. This is done for example for the observed media, being defined as an ecosystem compartment which is investigated. This reflects e.g. soil water which can be linked to the appropriate concept in EnvThes. So even if different
names for the same concept are used in the original dataset a semantic harmonised view on the data can be
established by using EnvThes as a semantic transformation. Within the EnvEurope project a linkedData
(Berners-Lee 2006, Bizer et al. 2011) service was established, using the D2RQ server22. This implements
links to EnvThes, which can be followed by humans and machines, as D2RQ offers a web browser interface, linkedData interface and SPARQL endpoint.
As EnvThes is linkedData based, with a unique URL for every concept, the pointers to the concepts can
also be inserted into any document supporting hyperlinks. To test this, you can try to follow the link
http://vocabs.lter-europe.net/EnvThes3/USLterCV_110, if you view this document with a tool supporting
hyperlinks.
21
22
See http://data.lter-europe.net/deims/
See http://d2rq.org/
Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5
5.
Interlinked Thesauri
Semantic interoperability with other groups working within or close to the long term ecological research
domain is a very much desired feature. But large and expensive institutes or substantial projects for harmonization of semantics are far beyond the resources of the domain projects and the contributing scientists. The reuse of existing concepts, as described above is a good method to overcome parallelisms in the
development of controlled vocabularies, and is an ingredient for avoiding semantic silos.
The establishment of linkages from concepts within EnvThes to concepts within other existing thesauri
is the second ingredient. It is important that those links can not only be established but also can be resolved (followed) during runtime. The LinkedData23 (Berners-Lee 2006, Bizer et al. 2011) technology,
with URLs as identifiers, supports both, the establishment and resolution of links. Establishing a link between a local concept and another one in foreign thesaurus is done by inclusion of a pointer to the foreign
URL, just in the way tables are linked within a relational database. When reading the data, the pointer can
be followed by machines and humans; similar to the way as a hyperlink can be followed in the World
Wide Web or a foreign key can be resolved in a relational database.
AGROVOC
Definitions and translations for “pollution” broader and narrower terms, as
seen by FAO
Wikipedia
Definitions, designs, pictures
EUROVOC
Definitions and translations for “pollution” broader and narrower terms, as
seen by EU Comissions
GEMET
Definitions and translations for “pollution” broader and narrower terms, as
seen by EEA
Figure 3
Interlinked Thesauri EnvThes, Eurovoc, Agrovoc, Wikipedia
SKOS/RDF allows five different relations to model these links: _exact match, close match, narrower
match, broader match_ and _related match_. Within EnvThes exact match is the most used link as close
match and related match are too fuzzy and the establishment of narrower match and broader match is very
labour intensive, allowing only little support by automated processes.
Links to controlled vocabularies like GEMET (see http://www.eionet.europa.eu/gemet), EARth (see
http://uta.iia.cnr.it/earth_eng.htm), AgroVoc (see http://aims.fao.org/standards/agrovoc/about), EuroVoc
23
see http://linkeddata.org/
Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5
(see http://eurovoc.europa.eu/drupal/), EUNIS habitat list (see http://eunis.eea.europa.eu/habitats-codebrowser.jsp), Wikipedia (see http://en.wikipedia.org/wiki/Main_Page), dBpedia (work in progress), as
well as NASA units (work in progress, see: http://sweet.jpl.nasa.gov/ontology/ and
http://www.qudt.org/qudt/owl/1.0.0/quantity/) are established and the degree of concept matches (e.g. exact match, close match) are defined, thus creating a web of interlinked thesauri.
The user of EnvThes can see whether a concept of interest is linked to other thesauri and thus can estimate whether the term is widely accepted or not and can search for further details within the foreign vocabulary following the link, perhaps detecting translations, related terms, images and explanations
Unfortunately DEIMS (based on Drupal 6), so far, does not allow for multiple controlled vocabularies.
Therefore all the mentioned sources have to be integrated into one thesaurus, even when the original vocabulary is already LinkedData based, which means that we so far have to create duplicates of concepts in
foreign vocabularies. We, however, keep the greatest part of the unique identifier, provide the link to the
original concept and therefore can eliminate duplicates as soon as all our implemented tools allow that.
6.
Outlook – squaring the circle <finalisierung: Herbert>
The integration of data from different domains raises two needs, which on a first glance are contradicting
each other. First, for the description of the data a common and rather stable vocabulary is needed in order
to support semantic interoperability. Second, at the same time extensions to the vocabulary are needed in
order to allow the description of new scientific fields and new technologies. A solution is a stable set of
related core concepts and dynamic models extending them. It is highly desirable that this is done using an
architecture of dispersed machines running different kinds of software, just having their interfaces in
common – a large common virtual thesaurus composed of several sub-thesauri – a web of controlled vocabularies. The realisation of controlled vocabularies with a linkedData interface is a good method to put
these requirements into reality. From the point of view of the long term ecological research and monitoring GEMET, EarTh, EuroVoc, and EUNIS habitats represent rather stable vocabularies linked to EnvThes. Wikipedia offers the citizen scientist’s view and AgroVoc is a dynamic vocabulary for the related
agricultural domain. The same system of interlinks could also help to step by step harmonize ontologies,
such as those registered in the very valuable bioportal (see: http://bioportal.bioontology.org/).
The direct use of the EnvThes within the project frame is related to the discovery of data and services.
Unambiguous and well defined keywords are needed to annotate metadata in order to be able to search
consistently for data. In addition to that, the use of common concepts e.g. within data generation enhances
semantic interoperability and their usability. EnvThes therefore is an important step towards an integrated
network of information (Bizer et al. 2009) and data resulting from the LTER Europe network as well as
the EnvEurope project.
There is no doubt, that we are still far away from a web of LTER data, more for data policy reasons
than for technical reasons, but the first step is done by EnvThes and in the same time is the foundation for
the establishment of links to other domains.
7.
Bibliography
Adamescu, M., Peterseil, J., Dactu, S., Cazacu, C., Vadineanu, A. (2010). Elements for the design of a
General Ecological Database. In: Maurer, I. and Tochtermann, K. (eds.) Information and Communication Technologies for Biodiversity and Agriculture. Shaker Verlag, Aachen. pp. 49-66.
Berners-Lee, T. (2006), Design Issues: Linked Data. Online
http://www.w3.org/DesignIssues/LinkedData.html
Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5
Bizer, C. (et al) (2011) Linked Data: Evolving the Web into a Global Data Space, (1st edition). Synthesis
Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool. (Full text
available for free at http://linkeddatabook.com/editions/1.0/).
Bizer, C., Talis, T.H., Berners-Lee, T. (2009), Linked Data - The Story So Far. International Journal on
Semantic Web and Information Systems (IJSWIS), 5(3), 1-22. DOI: 10.4018/jswis.2009081901
European Environment Agency (2009) European Nature Information System (EUNIS).
http://eunis.eea.europa.eu/about.jsp and http://eunis.eea.europa.eu/habitats.jsp
European Envrionment Agency (1999) GEMET, GEneral Multilingual Envrionmental Thesaurus
http://www.eionet.europa.eu/gemet/
European Union (2013) EuroVoc Multilingual Thesaurus of the European Union.
http://eurovoc.europa.eu/
Food and Agriculture Organization of the United Nations (2013) AGROVOC.
http://aims.fao.org/standards/agrovoc/about
Gruber, T.R. (1993). Toward principles for the design of ontologies used for knowledge sharing. Originally in N. Guarino and R. Poli, (Eds.), International Workshop on Formal Ontology, Padova, Italy.
Revised August 1993. International Journal of Human-Computer Studies 43(5-6):907-928.
Isaac, A. (et al.) (2009) Simple Knowledge Organization System Primer. W3C Working Group Note 18
August 2009. http://www.w3.org/TR/skos-primer/
Klyne G., Carroll J.J. (2004) Resource Description Framework (RDF): Concepts and Abstract Syntax.
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
LTER Controlled Vocabulary Working Group (2011) Terms of Reference.
http://im.lternet.edu/VocabTOR
Miles, A., Bechhofer S. (2009) SKOS Simple Knowledge Organization System Reference.
http://www.w3.org/TR/2009/REC-skos-reference-20090818/
Mirtl, M., Krauze, K. (2007). Developing a new Strategy for Environmental Research and Monitoring:
The European Long-Term Ecological Research Network’s (LTER Europe) role and perspective. In:
Chmielewski, T.J. (Ed.) Nature Conservation Management: From Idea to practical Results. Lublin –
Lodz – Helsinki – Aarhus. pp. 36-52.
Mirtl, M., Orenstein, D., Wildenberg, M, Peterseil, J., Frenzel, M. (2013). Development of LTSER Platforms in LTER Europe: Challenges and Experiences in implementing Place-Based Long TermSocio-ecological Research in Selected Regions. In: Singh, S.J., Haberl, H., Chertow, M., Mirtl, M.
and Schmid, M. (Eds.). Long Term Socio-Ecological Research. Studies in Society-Nature Interactions Across Spatial and Termporal Scales. Springer , Dodrecht Heidelberg New York London. pp.
409-442.
National Aeronautics and Space Administration (2011) Semantic Web for Earth and Environmental Terminology (SWEET) http://sweet.jpl.nasa.gov/ontology/ and
http://www.qudt.org/qudt/owl/1.0.0/quantity/
National Centers for Biomedical Computing (2005) BioPortal. http://bioportal.bioontology.org/
poolParty http://www.poolparty.biz/
Schentz, H., Schleidt, K., Lane, M. (2005). Requirements for an IT Framework for ALTER-Net – Results
from the Requirment Analysis conducted in Work Package I6. ALTER-Net Deliverable
WPI6_2005_01.
Species 2000 Secretariat (2013) Catalogue of life, Annual Checklist.
http://www.catalogueoflife.org/colwebsite/
Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5