EnviroInfo 2013: Environmental Informatics and Renewable Energies Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5 EnvThes - interlinked thesaurus for long term ecological research, monitoring, and experiments Herbert Schentz, Johannes Peterseil1, Nic Bertrand2 (contributors Mauro Bastiano3, David Blankman4, Mark Frenzel5, Ulf Grandin6, Miki Kertesz7, Alessandro Oggioni8) Abstract The long term ecological research and monitoring is resulting in a vast amount of data describing environmental characteristics, drivers and pressures. Despite harmonisation efforts in the field of methods and observation designs within the frame of LTER Europe or other related networks, different management solutions, together with a varying set of terms and concepts describing the data, are often used for local data management. Therefore, there is not only the need for syntactic but also semantic harmonisation in order to ensure data exchange for cross site and cross domain analysis. The use of a common controlled vocabulary or thesaurus is the first step to enable semantic harmonisation across a network of sites. Within the European scale projects EnvEurope (LIFE08 ENV/IT/00399) and ExpeER the development of a controlled vocabulary for long term ecological research and monitoring was started in order to provide a common set of terms and concepts used in this domain. It is called EnvThes and implemented as SKOS/RDF based thesaurus using poolParty as management tool. PoolParty offers a web browser based viewing and editing interface, as well as a linkedData interface and a SPARQL endpoint. EnvThes is based on existing vocabularies, like US-LTER controlled vocabulary, EUNIS Habitat list, Catalogue of Life, NASA Units, extended by concepts which are additionally needed. Links to the original sources and to matching concepts within commonly used thesauri like EUROVOC, GEMET, AGROVOC or EarTh are established, thus overcoming the islands of conceptual models, developed for different purposes. These links to slowly changing vocabularies, like EUROVOC or GEMET are the anchors for stable semantics; and the LTER specific concepts, for which no linkable equivalent can be found in those thesauri, are the building blocks of a flexible conceptual model, needed by science. EnvThes provides keywords for the metadata system DEIMS, and is used for semantic annotation of the data in the EnvEurope data repository. It can additionally be used as source of terms and links to concepts for any sort of document (for example a Microsoft Excel spreadsheet, Word document, PowerPoint presentation, LATEX document, etc.) used in the LTER context. 1 Umweltbundesamt GmbH, Spittelauer Lände 5, A-1090 Vienna CEH Computing Services, Oxford Centre for Ecology and Hydrology, Mansfield Road, Oxford, OX1 3SR 3 ISMAR CNR Castello 1364/a 30122 Venezia Italy 4 ILTER Information Management Committee, Information Management, Israel LTER 5 Helmholtz-Zentrum für Umweltforschung – UFZ, Permoserstraße 15, 04318 Leipzig , Germany 6 Swedish University of Agricultural Sciences, Department of Aquatic Sciences and Assessment SE-750 07 Uppsala, Sweden 7 Institute of Ecology and Botany, MTA Centre for Ecological Research, Vácrátót, Hungary 8 CNR-IREA, Via Bassini, 15 - 20133 Milano, Italy 2 1. Introduction The long term ecological monitoring and research network (LTER) in Europe9 provides a vast amount of data with regard to drivers and impacts of environmental change (Mirtl & Krauze 2007, Mirtl et al. 2013). A variety of ecosystems are monitored touching research topics of several different disciplines, like biodiversity research, soil science, air pollution measurement, ground water and surface water analysis, meteorology, hydrology and others. In addition to the variety of disciplines involved, data management and analysis is fragmented and carried out by several dislocated institutes which are responsible for the monitored sites. Because of the distributed responsibility, the implemented data management systems vary from collections of simple spreadsheet tables, distributed domain specific databases to centralized data management systems. Within those systems data are organized using different data models, and related vocabularies, taxonomies and code lists. Existing semantic de facto standards like the EUNIS habitat list10, the Catalogue of Life, or the World Reference Base for Soil Resources11 are often not considered depending on the autonomous decision of each data manager and provider. Overcoming the described semantic heterogeneity, resulting from the variety of disciplines and institutes, is crucial for the intended exchange of metadata and data. The establishment and use of a common set of exactly defined concepts is a first step into the direction of semantic harmonization; using those concepts to annotate data and metadata and finally using them directly in the local data management systems should be the subsequent activities (Schentz et al. 2005, Adamescu et al. 2010). As the issue “establishment of semantic interoperability” is not specific for the domain of long term ecological research, nor for ecology, appropriate methods and tools have been developed all over the world and disciplines (Gruber 1993). Controlled vocabularies and thesauri are widely accepted and good practices to establish the foundation for semantic harmonization, a precondition for reuse and sharing of data. They have been established for that purpose long before the first computer was built. Printed versions dictionaries for medicine, pharmacy or other taxonomies, for example species, soil or habitats, have been created already centuries ago. Nowadays, a vast number of controlled vocabularies and thesauri exist and are digitally available. LTER Europe and related projects like EnvEurope (LIFE08 ENV/IT/00039912) or ExpeER13 are working towards a harmonisation of methods and data flows for long term ecological research, monitoring and experiments. A common conceptual model and definition of terms is therefore crucial in order to allow not only for a syntactic but also semantic harmonised exchange of metadata and data. Because of the multidisciplinary character of long term ecological research, monitoring and experiments existing thesauri mostly cover only a part of the conceptual model needed for this domains. Therefore, within the projects EnvEurope and ExpeER, the decision was taken to build a domain thesaurus used as the semantic base for metadata and data exchange. It should be built on existing controlled vocabularies and extended wherever needed. The US LTER Community has established a controlled vocabulary for long term ecological research and monitoring (LTER Controlled Vocabulary Working Group 2011), which should serve as backbone for EnvThes. 9 See http://www.lter-europe.net/ See http://eunis.eea.europa.eu/habitats.jsp 11 See http://www.fao.org/nr/land/soils/soil/wrb-documents/en/ 12 See http://www.enveurope.eu/ 13 See http://www.expeeronline.eu/ 10 Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5 2. Requirements and Tools The aim of EnvThes is to provide a list of significant terms and their definitions used in the domain of long term ecological research, monitoring and experiments. When establishing such a thesaurus, two, slightly contradicting tasks have to be fulfilled. On the one side the vocabulary has to cover the broad and dynamic range of research, which means, that the vocabulary must include very specific terms which may change fast according to the development of science and their addressed fields, whereas on the other the vocabulary must guarantee continuity and, widely used and rather stable existing vocabularies shall neither be neglected nor reinvented. By this the thesaurus shall achieve stability and flexibility at the same time! Based on these basic requirements different special requirements can be derived in order to provide a useful semantic tool for the community of users: Completeness – The concepts from EnvThes are used to annotate metadata with keywords and therefore support discovery and interpretation of dataset metadata. The thesaurus therefore needs to cover concepts of the related domains with a given level of detail. In addition to the keywords the common vocabulary shall provide the basis for semantic interoperability of exchanged information and knowledge, related to the observed entities, parameters, methods, campaigns, data and metadata. Therefore there are concepts within EnvThes additionally to those, which are directly used as metadata keywords. Usability – Two features are highly desired, when using the concepts of this common vocabulary. First, an easy way to take the concept into any document (MS-Word document, HTML page, LATEX document, Excel spreadsheet, PowerPoint presentation) and second, a possibility to resolve the inserted term, which means to follow the way back to the thesaurus, where additional information can be found. Within the first version of data collection, the terms shall be entered with simple “copy and paste procedures”, but, seen from the point of view of EnvThes, the establishment of links is as simple as the insertion of text and can be done from within any document, which can digest hyperlinks. Accessibility – Using a controlled vocabulary like EnvThes as a common semantic basis means, that it must be accessible from any community member. Whereas open access for use is the first level of exchange, distributed maintenance and administration should be possible. Multilingualism – Another requirement in an international network is to provide multilingualism, even though English is established as the main language for the use and definition of concepts within the LTER community. Multilingualism is supported by providing translations of the concepts into native languages occurring in the project context. However, a decision was made that EnvThes has a primary language: English, which means that every term is represented in English and that the labels in other languages are translations and may or may not be present, according to the availability of resources for translations. So far there are more non-translated terms than translated ones. Support data – Within the LTER datasets a lot of data are nominal. An important need with nominal data is that values are taken from common code lists, which therefore have to be accessible by the whole community. A thesaurus is a good repository for such code lists. Compliant to standards – In order to allow for future cooperation and collaboration with other communities, existing standards and best practices should be followed. The W3C standard SKOS/RDF (Isaac et al. 2009, Klyne et al. 2004, Miles et al. 2009) reaches a high level of acceptance within the thesaurus community. Therefore, a large number of tools fulfilling this standard are available. As for the harmonisation of controlled vocabularies interlinking thesauri is important, tools with a LinkedData interface are very common. Therefore, the solution for the management of semantics chosen by LTER Europe is the establishment and use of a SKOS/RDF14 based thesaurus. Within the EnvEurope and ExpeER project the PoolParty15 14 15 see http://www.w3.org/2004/02/skos/ see http://poolparty.biz/ Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5 thesaurus management tool is used. The resulting thesaurus is called EnvThes and exists in two instances: a development instance for the work in progress – containing the latest set of terms, which are not yet regarded stable (http://vocabs.lter-europe.net/EnvThes3.html) and a production instance which is stable and ready for citations (http://vocabs.lter-europe.net/EnvThes.html). 3. The elements of EnvThes Extending existing controlled vocabularies EnvThes is based on existing controlled vocabularies using the US LTER Controlled Vocabulary16 as stable backbone. Missing concepts are added using a collaborative approach to provide a comprehensive list of concepts and definitions. The needs for new concepts arising from the addressed scientific domains are analysed and existing controlled vocabularies from different domains are searched to avoid redefinitions. When a term is found in an existing vocabulary it is entered to EnvThes and a link to the source is provided to ensure the provenance. For EnvThes, the inclusion of the following foreign controlled vocabularies was decided: • the INSPIRE spatial data themes17, • the EUNIS habitat list18 (down to level 3), • list of units, based on the NASA units19. In addition, the inclusion of organismic top level categories (down to family level) of the catalogue of life20 was decided to ensure for agreed concepts for the taxonomic coverage of the data. However, users are encouraged to use the sources of the catalogue of life down to species level. US LTER controlled vocabulary Catalogue of Life backbone (NASA – Sweet) units EUNIS Habitats L1 – L3 EnvThes INSPIRE spatial data themes Special units for Ecology EnvEurope Parameters ExpeER Parameters Figure 1 Components of EnvThes Only concepts, which cannot be found in the US-LTER controlled Vocabulary or in other existing thesauri are added and defined. Those concepts, already defined in other rather slowly changing vocabularies and the links to those vocabularies build the stable core of EnvThes, and the allowed extensions provide the flexibility. 16 see http://im.lternet.edu/VocabTOR see http://www.eionet.europa.eu/gemet/inspire_themes 18 see http://eunis.eea.europa.eu/ 19 see http://www.nasa.gov/offices/oce/functions/standards/isu.html and http://www.qudt.org/qudt/owl/1.0.0/unit/ 20 see http://www.catalogueoflife.org/ 17 Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5 The example in Table 1 shows this principal to extend a stable core: We imported the NASA sweet SI units and units outside SI, and established a link to the origins, in order to be compliant with a stable de facto standard for units. However, there are units, needed by the LTER community, which are missing in the “NASA units”, which had to be added. We used the SKOS scope notes to annotate, which units are NASA units and which are EnvThes units. term NASAnote FNU NASA SWEET unit EnvThes unit MOA NASA SWEET unit EnvThes unit MYA NASA SWEET unit EnvThes unit Np SOA Enote EnvThes unit NASA SWEET unit EnvThes unit ` EnvThes unit century NASA SWEET unit EnvThes unit cm NASA SWEET unit EnvThes unit d NASA SWEET unit EnvThes unit db NASA SWEET unit EnvThes unit decade NASA SWEET unit EnvThes unit eV EnvThes unit g/(ha·yr) EnvThes unit Table 1 Extension of NASA units 3.1 Definitions It is seen as a good practice within the semantic community to define concepts in a human readable form in addition to all the other structuring elements. EnvThes follows this good practice and allows definitions in all the languages allowed for concepts. But in the same way, as EnvThes assumes a primary language, English, there is a primary definition language, English. That means, that, in case there is any definition, there has to be an English definition and there may be definitions in other languages. 3.2 Scope notes EnvThes uses the scope notes, offered by SKOS/RDF, to define special usages of concepts: • The concepts which are used by the US-LTER community for keywords of their metadata • The concepts which are very broad and therefore should just be used for annotation of metadata and not for data annotation. Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5 4. Usage of EnvThes The common defined semantics is used on two different levels: First, to attach keywords to the metadata description for datasets and second, to annotate concepts from EnvThes to data (see Figure 2). Both levels are needed to allow for a semantic enhance data management and exchange. EnvEurope and ExpeER metadata are managed using a DRUPAL based application, DEIMS (Drupal Ecological Information Management System21), and EnvThes is integrated as taxonomy. The keywords for the dataset metadata are taken from EnvThes. In the version of DRUPAL currently used EnvThes has to be physically imported as no life link to a SKOS/RDF thesaurus can be used in DRUPAL 6. With the migration of DEIMS to DRUPAL 7 in the upcoming month this direct link will be established, thus avoiding the need of updates, enforcing a good and consistent and comprehensive versioning within EnvThes. DEIMS Any Term Document EnvThes Link EnvEurope Data Repsoitory Figure 2 Usage of EnvThes In addition to the keywords EnvThes is used to annotate data with defined concepts. This is done for example for the observed media, being defined as an ecosystem compartment which is investigated. This reflects e.g. soil water which can be linked to the appropriate concept in EnvThes. So even if different names for the same concept are used in the original dataset a semantic harmonised view on the data can be established by using EnvThes as a semantic transformation. Within the EnvEurope project a linkedData (Berners-Lee 2006, Bizer et al. 2011) service was established, using the D2RQ server22. This implements links to EnvThes, which can be followed by humans and machines, as D2RQ offers a web browser interface, linkedData interface and SPARQL endpoint. As EnvThes is linkedData based, with a unique URL for every concept, the pointers to the concepts can also be inserted into any document supporting hyperlinks. To test this, you can try to follow the link http://vocabs.lter-europe.net/EnvThes3/USLterCV_110, if you view this document with a tool supporting hyperlinks. 21 22 See http://data.lter-europe.net/deims/ See http://d2rq.org/ Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5 5. Interlinked Thesauri Semantic interoperability with other groups working within or close to the long term ecological research domain is a very much desired feature. But large and expensive institutes or substantial projects for harmonization of semantics are far beyond the resources of the domain projects and the contributing scientists. The reuse of existing concepts, as described above is a good method to overcome parallelisms in the development of controlled vocabularies, and is an ingredient for avoiding semantic silos. The establishment of linkages from concepts within EnvThes to concepts within other existing thesauri is the second ingredient. It is important that those links can not only be established but also can be resolved (followed) during runtime. The LinkedData23 (Berners-Lee 2006, Bizer et al. 2011) technology, with URLs as identifiers, supports both, the establishment and resolution of links. Establishing a link between a local concept and another one in foreign thesaurus is done by inclusion of a pointer to the foreign URL, just in the way tables are linked within a relational database. When reading the data, the pointer can be followed by machines and humans; similar to the way as a hyperlink can be followed in the World Wide Web or a foreign key can be resolved in a relational database. AGROVOC Definitions and translations for “pollution” broader and narrower terms, as seen by FAO Wikipedia Definitions, designs, pictures EUROVOC Definitions and translations for “pollution” broader and narrower terms, as seen by EU Comissions GEMET Definitions and translations for “pollution” broader and narrower terms, as seen by EEA Figure 3 Interlinked Thesauri EnvThes, Eurovoc, Agrovoc, Wikipedia SKOS/RDF allows five different relations to model these links: _exact match, close match, narrower match, broader match_ and _related match_. Within EnvThes exact match is the most used link as close match and related match are too fuzzy and the establishment of narrower match and broader match is very labour intensive, allowing only little support by automated processes. Links to controlled vocabularies like GEMET (see http://www.eionet.europa.eu/gemet), EARth (see http://uta.iia.cnr.it/earth_eng.htm), AgroVoc (see http://aims.fao.org/standards/agrovoc/about), EuroVoc 23 see http://linkeddata.org/ Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5 (see http://eurovoc.europa.eu/drupal/), EUNIS habitat list (see http://eunis.eea.europa.eu/habitats-codebrowser.jsp), Wikipedia (see http://en.wikipedia.org/wiki/Main_Page), dBpedia (work in progress), as well as NASA units (work in progress, see: http://sweet.jpl.nasa.gov/ontology/ and http://www.qudt.org/qudt/owl/1.0.0/quantity/) are established and the degree of concept matches (e.g. exact match, close match) are defined, thus creating a web of interlinked thesauri. The user of EnvThes can see whether a concept of interest is linked to other thesauri and thus can estimate whether the term is widely accepted or not and can search for further details within the foreign vocabulary following the link, perhaps detecting translations, related terms, images and explanations Unfortunately DEIMS (based on Drupal 6), so far, does not allow for multiple controlled vocabularies. Therefore all the mentioned sources have to be integrated into one thesaurus, even when the original vocabulary is already LinkedData based, which means that we so far have to create duplicates of concepts in foreign vocabularies. We, however, keep the greatest part of the unique identifier, provide the link to the original concept and therefore can eliminate duplicates as soon as all our implemented tools allow that. 6. Outlook – squaring the circle <finalisierung: Herbert> The integration of data from different domains raises two needs, which on a first glance are contradicting each other. First, for the description of the data a common and rather stable vocabulary is needed in order to support semantic interoperability. Second, at the same time extensions to the vocabulary are needed in order to allow the description of new scientific fields and new technologies. A solution is a stable set of related core concepts and dynamic models extending them. It is highly desirable that this is done using an architecture of dispersed machines running different kinds of software, just having their interfaces in common – a large common virtual thesaurus composed of several sub-thesauri – a web of controlled vocabularies. The realisation of controlled vocabularies with a linkedData interface is a good method to put these requirements into reality. From the point of view of the long term ecological research and monitoring GEMET, EarTh, EuroVoc, and EUNIS habitats represent rather stable vocabularies linked to EnvThes. Wikipedia offers the citizen scientist’s view and AgroVoc is a dynamic vocabulary for the related agricultural domain. The same system of interlinks could also help to step by step harmonize ontologies, such as those registered in the very valuable bioportal (see: http://bioportal.bioontology.org/). The direct use of the EnvThes within the project frame is related to the discovery of data and services. Unambiguous and well defined keywords are needed to annotate metadata in order to be able to search consistently for data. In addition to that, the use of common concepts e.g. within data generation enhances semantic interoperability and their usability. EnvThes therefore is an important step towards an integrated network of information (Bizer et al. 2009) and data resulting from the LTER Europe network as well as the EnvEurope project. There is no doubt, that we are still far away from a web of LTER data, more for data policy reasons than for technical reasons, but the first step is done by EnvThes and in the same time is the foundation for the establishment of links to other domains. 7. Bibliography Adamescu, M., Peterseil, J., Dactu, S., Cazacu, C., Vadineanu, A. (2010). Elements for the design of a General Ecological Database. In: Maurer, I. and Tochtermann, K. (eds.) Information and Communication Technologies for Biodiversity and Agriculture. Shaker Verlag, Aachen. pp. 49-66. Berners-Lee, T. (2006), Design Issues: Linked Data. Online http://www.w3.org/DesignIssues/LinkedData.html Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5 Bizer, C. (et al) (2011) Linked Data: Evolving the Web into a Global Data Space, (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool. (Full text available for free at http://linkeddatabook.com/editions/1.0/). Bizer, C., Talis, T.H., Berners-Lee, T. (2009), Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems (IJSWIS), 5(3), 1-22. DOI: 10.4018/jswis.2009081901 European Environment Agency (2009) European Nature Information System (EUNIS). http://eunis.eea.europa.eu/about.jsp and http://eunis.eea.europa.eu/habitats.jsp European Envrionment Agency (1999) GEMET, GEneral Multilingual Envrionmental Thesaurus http://www.eionet.europa.eu/gemet/ European Union (2013) EuroVoc Multilingual Thesaurus of the European Union. http://eurovoc.europa.eu/ Food and Agriculture Organization of the United Nations (2013) AGROVOC. http://aims.fao.org/standards/agrovoc/about Gruber, T.R. (1993). Toward principles for the design of ontologies used for knowledge sharing. Originally in N. Guarino and R. Poli, (Eds.), International Workshop on Formal Ontology, Padova, Italy. Revised August 1993. International Journal of Human-Computer Studies 43(5-6):907-928. Isaac, A. (et al.) (2009) Simple Knowledge Organization System Primer. W3C Working Group Note 18 August 2009. http://www.w3.org/TR/skos-primer/ Klyne G., Carroll J.J. (2004) Resource Description Framework (RDF): Concepts and Abstract Syntax. http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/ LTER Controlled Vocabulary Working Group (2011) Terms of Reference. http://im.lternet.edu/VocabTOR Miles, A., Bechhofer S. (2009) SKOS Simple Knowledge Organization System Reference. http://www.w3.org/TR/2009/REC-skos-reference-20090818/ Mirtl, M., Krauze, K. (2007). Developing a new Strategy for Environmental Research and Monitoring: The European Long-Term Ecological Research Network’s (LTER Europe) role and perspective. In: Chmielewski, T.J. (Ed.) Nature Conservation Management: From Idea to practical Results. Lublin – Lodz – Helsinki – Aarhus. pp. 36-52. Mirtl, M., Orenstein, D., Wildenberg, M, Peterseil, J., Frenzel, M. (2013). Development of LTSER Platforms in LTER Europe: Challenges and Experiences in implementing Place-Based Long TermSocio-ecological Research in Selected Regions. In: Singh, S.J., Haberl, H., Chertow, M., Mirtl, M. and Schmid, M. (Eds.). Long Term Socio-Ecological Research. Studies in Society-Nature Interactions Across Spatial and Termporal Scales. Springer , Dodrecht Heidelberg New York London. pp. 409-442. National Aeronautics and Space Administration (2011) Semantic Web for Earth and Environmental Terminology (SWEET) http://sweet.jpl.nasa.gov/ontology/ and http://www.qudt.org/qudt/owl/1.0.0/quantity/ National Centers for Biomedical Computing (2005) BioPortal. http://bioportal.bioontology.org/ poolParty http://www.poolparty.biz/ Schentz, H., Schleidt, K., Lane, M. (2005). Requirements for an IT Framework for ALTER-Net – Results from the Requirment Analysis conducted in Work Package I6. ALTER-Net Deliverable WPI6_2005_01. Species 2000 Secretariat (2013) Catalogue of life, Annual Checklist. http://www.catalogueoflife.org/colwebsite/ Copyright 2013 Shaker Verlag, Aachen, ISBN: 978-3-8440-1676-5
© Copyright 2026 Paperzz