Unified Thesaurus - possibilities, representation and design

Unified Thesaurus feasibility study Possibilities, Representation and Design MMLab, ITEC, IBCN Table of Contents INTRODUCTION
3 PURPOSE
4 REQUIREMENTS
4 Ambiguity
4 Synonymy
4 Semantic Relationships
4 Facets
4 CONTROLLED VOCABULARY CAPABILITIES
5 Terms-Concepts
5 Preferred Terms / Non-preferred Terms
5 Compound Concepts
5 Equivalence Relationships
6 Hierarchical Relationships
6 Associative Relationships
6 Top Concepts
7 Concept Groups
7 Notes for the Concepts and Terms
7 Provenance Information
7 DATA MODEL
8 Terms and Concepts / Preferred – Non Preferred Terms
9 Equivalence Relationships / Compound Concepts
9 1
Hierarchical / Associative Relationships / Top Concepts
10 Concept Groups
10 Notes on Concepts and Terms
11 Provenance Information
11 CONTROLLED VOCABULARY REPRESENTATION: SKOS
12 Introduction
12 SKOS Model and ISO 25964
13 SKOS for Interlinking Controlled Vocabularies
15 SKOS for Unified Thesaurus
16 A DESIGN STRATEGY FOR UNIFIED THESAURUS
18 Skosification
20 Proposal for Unified Thesaurus
24 APPENDIX: SKOS TECHNICAL
29 SKOS vocabulary
29 SKOS classes
30 SKOS properties
34 Concept grouping
44 Author Affiliation Sam Coppens iMinds – MMLab – UGent Pedro Debevere
iMinds – MMLab – UGent Erik Mannens
iMinds – MMLab – UGent 2
Introduction This document was drafted in the context of the Unified Thesaurus feasibility study commissioned by VIAA1. The feasibility study was finalized in January 2014. In a previous deliverable, we discussed the different kinds of controlled vocabularies and their capabilities. In this deliverable, we show how such controlled vocabularies can be formalized. There exist several standards for modeling and structuring controlled vocabularies. The first international standard for thesauri was ISO 2788, Guidelines for the establishment and development of monolingual thesauri, originally published in 1974 and updated in 1986. In 1985 it was joined by the complementary standard ISO 5964, Guidelines for the establishment and development of multilingual thesauri. Over the years ISO 2788 and ISO 5964 were adopted as national standards in several countries, for example Canada, France and UK. ISO 2788 and ISO 5964 were withdrawn in 2011, when they were replaced by the first part of ISO 25964. Whereas most of the applications envisaged for ISO 2788 and ISO 5964 were databases in a single domain, often in-‐
house or for paper-‐based systems, ISO 25964 provides additional guidance for the new context of networked applications, including the Semantic Web. The ISO 25964 standard specifies a conceptual data model for the structure of a controlled vocabulary. It is intended for thesauri to support information retrieval and specifically to guide the choice of terms used in indexing, tagging, and search queries. It consists of two parts. The first part explains how to build a monolingual or multilingual thesaurus, how to display it and how to manage its development. A compliant thesaurus should list all the concepts available for indexing in a given context, and labels them with preferred terms, terms and synonyms. Relationships between the concepts and between the terms are shown, making it easy to navigate in the thesaurus. This part offers basically a data model for controlled vocabularies. The second part deals with the challenges of using one thesaurus in combination with another, and/or with some other type of controlled vocabulary. Next to the ISO standard, there is also the American ANSI/NISO Z39.19-‐2005 Standard (Guidelines for the construction, Format, and Management of Monolingual Controlled Vocabularies). This standard covers more or less the same as the ISO 25964 standard, part 1, but does not go that far in providing a data model for controlled vocabularies. This deliverable will consist of two sections. One section will focus on requirements/best practices for the controlled vocabulary representations; the other section will detail a data model that covers most of these best practices to represent controlled vocabularies. 1
More information on the Unified Thesaurus feasibility study is available on the iminds.be site:
http://www.iminds.be/nl/projecten/2014/09/08/unified-thesaurus
3
Purpose A controlled vocabulary serves information retrieval. This can be summarized into five purposes: •
•
•
•
•
Translation: Provide a means for converting the natural language of authors, indexers, and users into a vocabulary that can be used for indexing and retrieval. Consistency: Promote uniformity in term format and in the assignment of terms. Indication of relationships: Indicate semantic relationships among terms. Label and browse: Provide consistent and clear hierarchies in a navigation system to help users locate desired content objects. Retrieval: Serve as a searching aid in locating content objects. Requirements These purposes define the requirements of a controlled vocabulary. The basic requirements of a controlled vocabulary are discussed in this section. The basic requirements are derived from the ISO 25964, part 1 specification. Every controlled vocabulary should be able to tackle these requirements. Ambiguity Ambiguity occurs when in a language a word (or group of words) has more than one meaning. A controlled vocabulary must be able to deal with ambiguity. A controlled vocabulary must compensate for the problems caused by ambiguity by ensuring that each term has one and only one meaning. Synonymy Synonyms happen when a concept can be represented using multiple terms having the same or similar meanings. A controlled vocabulary must compensate for the problems caused by synonymy. Semantic Relationships A controlled vocabulary may support various kinds of relationships among its concepts. These include: •
•
•
equality relationships hierarchical relationships associative relationships These relationships may be defined as required for a particular application. Via the use of concepts, a controlled vocabulary can support the various types of relationships, regardless of the different spellings, words, or word groups used for denoting the concept. Facets Controlled vocabularies, especially large ones consisting of thousands of terms, may be easier to use if they are organized in some way other than just hierarchically. Faceted analysis is another way of 4
organizing knowledge. It takes a bottom-‐up approach, forming areas of knowledge after first having pieced together their parts and determining the areas of knowledge they form, rather than the discipline directed, top-‐down approach of hierarchies. This way a controlled vocabulary may support multiple hierarchies and can thus be poly-‐hierarchical. Controlled Vocabulary Capabilities To tackle the basic requirements, discussed in the previous section, the ISO 25964, Part 1 specification must provide certain capabilities. These capabilities are discussed in this section. If a controlled vocabulary fulfills these capabilities, you can say it is ISO 25964, Part 1-‐compliant. Thus, these capabilities cover the following requirements: •
•
•
•
Ambiguity Synonymy Semantic Relationships Facets Terms-‐Concepts To cover the ambiguity requirement, the structure of the controlled vocabularies is based on concepts instead of terms. This means the model for describing the controlled vocabulary shows the relationships between the concepts, and distinguishes these from the terms that are used for labeling these concepts. This way, a controlled vocabulary can deal with ambiguity. Different concepts can be linked to the same terms. A typical example of such a term is “Java”. This term is linked to three different concepts, i.e., the island, the coffee, and the programming language. The relationships in a controlled vocabulary are between the concepts, not the terms. This way, a controlled vocabulary is able to deal with synonyms, hyponyms, ambiguity, and associative relationships. Preferred Terms / Non-‐preferred Terms Dealing with the synonymy is achieved by ensuring that each concept is represented by a single preferred term. The vocabulary should list the other synonyms and variants as non-‐preferred terms with USE references to the preferred term. In a multilingual controlled vocabulary, each concept can have one preferred term for each language the vocabulary supports. Of course, the use of language attributes for the terms is then necessary. Compound Concepts Compound or multiword terms in natural language are considered lexemes, i.e., words bound together as lexical units. “Media Art” is a good example. Dictionaries differ in their policies regarding the inclusion of various categories of compound terms. For this reason, guidelines on the handling of compound terms in controlled vocabularies have been developed. 5
To be acceptable as a term, a compound term should express a single concept or unit of thought, capable of being arranged in a genus-‐species relationship within a hierarchy or tree structure. “Media Art” is an example of a compound term that is acceptable because both “Media” and “Art” area broader concept of “Media Art”. A compound concept can be constructed from its individual parts. For instance “Media Art” can be defined as the intersection of two concepts, i.e., “Media” and “Art”. But at the same time, compound concepts can be defined as the union of the individual concepts. For this reason, it is always better to make compound concepts, concepts on their own in the controlled vocabulary, with optionally relationships to the individual terms to support the search and retrieval. Equivalence Relationships When the same concept can be expressed by two or more terms, one of these is selected as the preferred term. The relationship between preferred and non-‐preferred terms is an equivalence relationship in which each term is regarded as referring to the same concept. The preferred term in effect substitutes for other terms expressing equivalent or nearly equivalent concepts. A cross-‐
reference to the preferred term should be made from any “equivalent” entry term. Equivalence relationships are used for synonyms, lexical variants, near-‐synonyms, and cross references to the concepts of a compound concept. Hierarchical Relationships The use of hierarchical relationships is the primary feature that distinguishes a taxonomy or thesaurus from other, simple forms of controlled vocabularies such as lists and synonym rings. Hierarchical relationships are based on degrees or levels of super-‐ordination and sub-‐ordination, where the superordinate term represents a class or a whole, and subordinate terms refer to its members or parts. Reciprocity should be expressed by the following relationship indicators: •
•
Broader Term: a label for the superordinate (parent) term Narrower Term: a label for the subordinate (child) term For instance “Media Art” is a narrower term for “Art”. “Art” is a broader term for “Media Art”, of course. Hierarchical relationships can be used for generic relationships, like the example with “Media Art”, but also instance relationships (e.g. “Person” and “Hugo Claus”), and whole-‐part relationships (e.g., “News” and “Politics”). These relationships allow putting some hierarchy in the controlled vocabulary and support using the controlled vocabulary as a classification scheme. Associative Relationships This relationship covers associations between terms that are neither equivalent nor hierarchical, yet the terms are semantically or conceptually associated to such an extent that the link between them should be made explicit in the controlled vocabulary, on the grounds that it may suggest additional terms for use in indexing or retrieval. For instance, terms from a common broader term 6
can be related to each other via an associative relationship, e.g., “Media Art” and “Modern Art” can be related via an associative relationship, both being a narrower term of “Art”. Top Concepts Each concept can have a pointer linking it to the concept at the top of any hierarchy in which it occurs. These top concepts can be facet names, for example, and this link can facilitate browsing by clearly indicating which facet a concept is in. It can also be used for validation, because hierarchical relationships are valid only if the two concepts are in the same facet. This feature can be useful in producing a list of top-‐level concepts from which to start browsing. These links and attributes are, strictly speaking, redundant, because top concepts could be identified by navigating up the hierarchy until no more broader concepts could be found, but as this search would use substantial processing resources, it will generally be more efficient to store the information rather than determining it every time it is needed. Concept Groups Many thesauri group concepts into subsets, often discipline based, called “themes,” “micro-‐
thesauri,” “domains” or “groups.” The box in the model called “concept group” provides for such groups. The concepts within such a group may or may not have any hierarchical or associative relationship with each other and may be drawn from distinct hierarchies or facets of the thesaurus, such as activities, people, places or things. Concept groups may be nested. Thus, it provides the possibility of a classified arrangement which complements the generic hierarchy of the thesaurus itself. This means in fact that a thesaurus should be able to support poly-‐hierarchies, as this is one of the requirements. Concept groups are a means to support this feature. Notes for the Concepts and Terms The controlled vocabulary should support notes of various concepts and terms. These notes can describe for instance the usage of a concept, can describe the concept itself, can give an editorial note, etc. Thus, these notes are free text fields which give the concept more contextual information. This is needed because the controlled vocabulary can have different concepts, denoted with the same terms. The notes will provide the context information to disambiguate between them. Provenance Information A controlled vocabulary should provide provenance information. The provenance information will tell the version history of the controlled vocabulary, which concepts were added, which terms were modified, who did the modifications, what are the justifications for the modification, etc. This information is essential for managing the controlled vocabulary, as it can grow large very quickly. 7
Data Model The specification for controlled vocabularies defines a data model that fulfills all the requirements and follows the guidelines. The data model is shown below. In the remainder of this section, we will show how each of the desired capabilities are covered by this data model. F IGURE 1 : I SO 2 5964 T HESAURUS U ML M ODEL 8
Terms and Concepts / Preferred – Non Preferred Terms In the model there is a clear distinction between the concepts and the terms denoting the concepts. They can be linked via PreferredTerm or SingleNonPreferredTerm relationships. Equivalence Relationships / Compound Concepts Equivalence relationships can be made between the terms of concepts, as shown in the figure above. Next to the equivalence relationships, there is also support for CompoundEquivalence relationships. These relationships can be used for compound objects. The “Media Art” compound 9
concept can have a link to the “Media” term and the “Art” term and these terms can be linked via the CompoundEquivalence relationship. Hierarchical / Associative Relationships / Top Concepts Next to equivalence relationships between terms, the data model also foresees hierarchical relationships between concepts, as shown in the figure above. Not only hierarchical relationships, but also associative relationships are supported that way and concepts can be linked the same way to their top concepts. This is shown in the overall figure of the data model. Concept Groups Concepts can not only be linked to a thesaurus, or other concepts for denoting the hierarchy, they can also be grouped into multiple hierarchies via the ConceptGroup class, which in turn can also be labelled. This is shown in the figure above. 10
Notes on Concepts and Terms The data model also allows adding notes to concepts and terms, as shown in this figure. Examples of such notes are ScopeNote or Definition. Provenance Information Via the VersionHistory class the data model allows describing a little provenance information on the thesaurus. 11
Controlled Vocabulary Representation: SKOS Introduction In this section, we have a closer look at the Simple Knowledge Organization System (SKOS). SKOS is a data model defined by the Semantic Web Deployment Working Group, which is part of the Semantic Web Activity of the World Wide Web Consortium (W3C) for modeling controlled vocabularies. At the moment, it is the only standardized data model for controlled vocabularies. It incorporates all the capabilities a controlled vocabulary should have, according the ISO 25964 specification. This means when you use SKOS for describing your controlled vocabulary, you are compliant to the ISO specification. Controlled vocabularies have typically been developed in close relationship with the CMS. Hence, many controlled vocabularies have a proprietary data model. For this data model various technologies are used, ranging from spreadsheets, over XML, to RDF. Since many datasets nowadays are being published as Open Data, a uniform model was needed for publishing the used controlled vocabularies in these datasets. For this reason, SKOS was designed. On the Web, a controlled vocabulary should always be described using SKOS. This way, different SKOS vocabularies can be linked to each other, thus reusing information from one another. It is not only a way for disseminating your controlled vocabulary, it is also a way for designing your controlled vocabulary, because you are then sure to be ISO-‐compliant and to exploit all capabilities of a controlled vocabulary as it should. SKOS allows formally describing the content and structure of structured vocabularies, such as thesauri, taxonomies, classification schemes, etc. Because SKOS is based on the Resource Description Framework (RDF) (1), a Knowledge Organization System (KOS) that is defined using SKOS is machine-‐readable and publishable on the World Wide Web. This subsequently allows linking concepts with other concepts defined in the Web of Data and is important in creating a contextualized description. At the same time, SKOS is also a means to support multilingualism. The terms of a SKOS thesaurus are identified using URIs, but each term can have multiple labels in different languages to support multilingual search and classification. Another benefit of SKOS is that it is very flexible and extensible. Typically, SKOS is used for controlled vocabularies. These controlled vocabularies tend to change often. SKOS allows modeling these controlled vocabularies in such a way it is backward compatible if your vocabulary is changed or extended. Terms can be added to your vocabulary and even their hierarchy can change without affecting the already existing descriptions making use of the vocabulary. In this section, we will show how SKOS is actually an implementation of the ISO 25964 standard. It maps almost one-‐on-‐one on the data model of the ISO standard. At the same time SKOS has some extra benefits that make it a good representation choice for a Unified Thesaurus. For a detailed/more technical description of the SKOS ontology, we refer to the appendix. 12
SKOS Model and ISO 25964 Basically, SKOS consists of a set of classes and properties for describing the controlled vocabulary. All classes and properties are denoted by URIs, as SKOS is described using RDF. In this section, we will split up these classes and properties, showing how SKOS actually relates to the ISO standard, justifying the choice for SKOS as a representation for any unified thesaurus. Concepts and Terms / Preferred Terms -‐ Non-‐preferred Terms Class URI skos:Concept skos:ConceptScheme Property URI Domain Range skos:inScheme rdfs:Resource skos:ConceptScheme skos:altLabel -‐ rdfs:Literal skos:hiddenLabel -‐ rdfs:Literal skos:prefLabel rdfs:Literal -‐ A thesaurus in SKOS is denoted by the skos:ConceptScheme class. To this class a set of concepts can be linked using the skos:inScheme property. Each concept in turn is described by at least one preferred label using skos:pref:Label. The non-‐preferred labels are linked to the concept using skos:altLabel. Skos:hiddenLabel is used for linking terms to a concept which should never be shown to end-‐users, mostly to support commonly made spelling errors for extra retrieval possibilities. 13
Top Concepts / Concept Groups Class URI skos:Collection skos:OrderedCollection Property URI Domain Range skos:hasTopConcept skos:ConceptScheme skos:Concept skos:topConceptOf skos:Concept skos:ConceptScheme skos:member skos:Collection union(skos:Concept, skos:Collection) skos:memberList skos:OrderedCollection rdf:List Just as in the ISO specification, top concepts can be described in the controlled vocabulary, using SKOS. For this SKOS provides the properties skos:hasTopConcept and skos:topConceptOf. Besides the top concepts, SKOS also provides some concept groupings using the classes skos:Collection and skos:OrderedCollection. The properties used for linking a concept to such a collection are skos:member and skos:memberList. This way, SKOS supports top concepts and facets, thus poly-‐
hierarchy. Semantic Relationships URI Domain Range skos:broader skos:Concept skos:Concept skos:broaderTransitive skos:Concept skos:Concept skos:narrower skos:Concept skos:Concept skos:narrowerTransitive skos:Concept skos:Concept 14
skos:related skos:Concept skos:Concept skos:semanticRelation skos:Concept skos:Concept SKOS provides semantic relationships: hierarchical relationships to bring some hierarchy into the controlled vocabulary, and associative relationships for associations between concepts. For this, SKOS provides the following properties: skos:broader, skos:boraderTransitive, skos:narrower, skos:narrowerTransitive, skos:related, and skos:semanticRelation. Notes on Concepts and Terms / Provenance information URI Domain Range skos:notation -‐ rdfs:Resource skos:changeNote -‐ rdfs:Resource skos:definition -‐ rdfs:Resource skos:editorialNote -‐ rdfs:Resource skos:example -‐ rdfs:Resource skos:historyNote -‐ rdfs:Resource skos:note -‐ rdfs:Resource skos:scopeNote -‐ rdfs:Resource Notes in SKOS can always be provided to concepts. These notes can provide information on the use of the concept, some change that has been made, an editorial note, etc. These notes will provide the extra contextual information. SKOS for Interlinking Controlled Vocabularies SKOS is next to a data model for designing and describing controlled vocabularies, also a model that supports the interlinking of several SKOS vocabularies. For this, SKOS has some mapping properties that will relate concepts to each other. Thus SKOS allows building a controlled vocabulary that purely consists of mappings that will interlink other, existing SKOS vocabularies. 15
URI Domain Range skos:broadMatch skos:Concept skos:Concept skos:closeMatch skos:Concept skos:Concept skos:exactMatch skos:Concept skos:Concept skos:mappingRelation skos:Concept skos:Concept skos:narrowMatch skos:Concept skos:Concept skos:relatedMatch skos:Concept skos:Concept SKOS provides exact mappings and fuzzy mappings. The latter are needed, because this allows mapping vocabularies with different granularities/hierarchies. skos:exactMatch is the property that will interlink two concepts from different SKOS vocabularies that are the same. The other mapping properties are less exact, but broaden the scope of their usage. SKOS for Unified Thesaurus SKOS is an ideal representation for controlled vocabularies, more specifically for a Unified Thesaurus. The reasons for this are: •
•
•
•
•
ISO 25964 compliant Mapping between concepts Supports several design strategies Easily maintainable Poly-‐hierarchical The first two reasons have just been discussed and do not need any further explanation. The latter three will be discussed more into detail here. Design Strategies Two different approaches can be used when developing a controlled vocabulary: •
•
Top down: This approach is used when the hierarchy of the vocabulary is known, but the complete set of concepts belonging to the vocabulary is not known. Therefore, the hierarchy is first modeled and an initial set of concepts that are known to be part of the vocabulary is introduced. Additional concepts are then gradually added (e.g., to improve search operations). Bottom up: In this approach all concepts of the vocabulary are known, but their hierarchy is unknown. Here, additional relations are gradually introduced in order to further develop the vocabulary. 16
In practice, however, often a mix of the above-‐mentioned approaches is encountered. In this case, a vocabulary initially consists of a number of concepts and relations and additional concepts and relations are then gradually added to the vocabulary. SKOS supports both strategies for developing new controlled vocabularies. Easily Maintainable SKOS supports easy maintainability of the controlled vocabulary. This means it is able to guarantee backward compatibility. Thus, the SKOS vocabulary can be changed over time, without affecting the records that are described using the SKOS vocabulary. Adding concepts OWL makes use of an open world assumption. As the SKOS data model has been designed as an OWL-‐ontology, this concept also applies to the SKOS data model. The open world assumption states that the truth-‐value of a statement is independent of whether or not it is known by any single observer or agent to be true. For a SKOS concept scheme, this assumption means that a concept scheme only describes the concept scheme, but does not define it. In other words, it is possible that other concepts belong to the concept scheme, but are not known for now. Therefore, it is possible to add new concepts to a concept scheme along the way. Removing concepts When a concept should no longer be used for annotation purposes, it should also no longer be selectable in an annotation tool. However, this concept should still be discoverable in a search application, as items could have been annotated with this concept. For this reason, one approach could be to design two concept schemes; one containing all currently allowed concepts for annotation and one that contains all concepts that have been used in the earlier annotation process. In this way, concepts that are no longer in use are no longer available for annotation. However, during search, it is still possible to use these concepts. Changing relations between concepts A SKOS model is rather static, i.e., the SKOS data model only allows describing a KOS and it does not provide a way to keep track of changes to the structure of the KOS, other than the use of notes. Altering a SKOS model through changing relationships between concepts can have a major influence on the result set of information retrieval requests, as relationships between concepts are often used in order to construct a result set. Suppose for example that a concept “Media Art” has a narrower concept “Cartoon”. During annotation of a media item, only the concept “Media Art” was provided. When a user performs a search operation which involves concept “Media Art”, the item that was annotated with concept “Cartoon” can also be relevant and therefore included in the result set through query expansion (e.g., including all narrower concepts of concept “Media Art” in the 17
query). However, once this relation has been removed, it is no longer possible to retrieve this item using a query involving the concept “Media Art”. One way to alleviate this problem is to store the concept position in the model during annotation (i.e., its broader terms, narrower terms, transitive closure, etc.) as part of the annotation. The search application then uses this information instead of the current SKOS model. Changing the concept position in the SKOS model now no longer has an influence on the previously annotated media items, as the search application only uses the stored concept position instead of the current SKOS model. New annotations store the concept position in the current SKOS model. Poly-‐hierarchy A SKOS vocabulary supports poly-‐hierarchy. This means that concepts can be grouped in different ways/hierarchies at the same time. This is essential when developing a controlled vocabulary that will be used in very different domains. Agreeing on concepts is one thing, agreeing on the hierarchy of the concepts is another more difficult task, as hierarchies tend to be domain-‐specific. Supporting poly-‐hierarchy is a major asset for such vocabularies. A Design Strategy for Unified Thesaurus A simplified strategy for designing the Unified Thesaurus is depicted below, and consists of the following basic steps: •
•
•
•
•
Identification of the domains Identification of the concepts for each domain Representation of the concepts Enrichment of the concepts Maintenance of the vocabulary / extensions First one has to decide how to split up the vocabulary to be designed into certain domains/facets/top concepts/root elements of the vocabulary. This way one can split up the vocabulary into pieces that can be managed on their own. At the same time, they deliver starting 18
points for browsing through the vocabulary. Following the ISO25964 specification, this can be done through top concepts, hierarchical relationships or facets (concept groups) or even separated vocabularies. Once the domains are identified, for each domain one can start thinking how it will be populated with concepts. Which concepts are relevant for that domain, how will they be gathered? How do you decide on which concepts will be represented into the vocabulary? Once the concepts are identified, the gathered controlled vocabulary can be represented using an appropriate representation. For Unified Thesaurus, SKOS can be chosen as the representation for several reasons explained in the previous section. The process of turning the concepts into a SKOS representation is called skosification. After this, the concepts of the controlled vocabulary can be enriched. These enrichments will give the concepts extra contextual information. This extra contextual information will allow disambiguation. By including for instance a DBpedia link to the concepts, they are already a little more disambiguated, because concepts denoted with the same terms will get a different DBpedia link as enrichment. The last step is deciding on a managing strategy. Until now, the controlled vocabulary was conceived as a static thing. But this is just a starting point. A controlled vocabulary is in fact a dynamic thing, evolving over time: concepts will be added and relationships will change. Thus, you need a strategy to manage the controlled vocabulary such that it suits the content it is used for. This is discussed in another deliverable on dynamic vocabularies and their tooling to support this. The remainder of this section will be split up into two sections: •
•
Skosification strategy UT proposal First, we will give a detailed workflow for skosifying a controlled vocabulary. Unified Thesaurus will be represented into SKOS. In order for UT to make links to the controlled vocabularies of the contributing institutions, their controlled vocabulary will also need to be disseminated as a SKOS vocabulary. Thus, skosification is a crucial step for all the involved controlled vocabularies. As explained earlier, this Skosification is only needed for dissemination purposes. The internal representation of the institutions’ vocabularies can remain unaltered. In the second part, we suggest how to jumpstart UT. One does not need to develop a unifying controlled vocabulary from scratch. There are some starting points from where the Unified Thesaurus can evolve. 19
Skosification In this section, we propose a skosification workflow. Thus, once the concepts are identified, how can they be turned into a SKOS vocabulary? For this, we propose a 4-‐step workflow. This workflow consists of the following steps: •
•
•
•
Identifiers for the concepts Modeling of the concepts Relating the concepts internally Enriching the concepts Before starting to explain the workflow, it is important to know that SKOS is built on top of RDF. In RDF, information is described using triples. These triples consist of a subject, a predicate, and an object. A subject is linked to its object via the predicate. Subjects and predicates are always denoted with URIs, objects with URIs or literals. For the URIs namespaces are supported as in XML. This way, one can describe everything. An example is shown below. The two triples below, for instance, will resp. describe that something denoted with the URI :Dali represents a person, and that person’s name is “Salvador Dali” in Dutch. •
•
<:Dali> <rdf:type> <foaf:Person> . <:Dali> <foaf:name> “Salvador Dali”@nl . For explaining the operational steps, we will give an example, very basic set of terms that will be converted into a SKOS vocabulary. The terms are depicted below. Identifiers for the concepts In a first step, one needs to define identifiers for the concepts. Not only the concepts need identifiers, also the controlled vocabulary they belong to needs an identifier. As explained earlier, SKOS is built on top of RDF in which everything is identified using URIs and where namespaces for the URIs are supported. Thus, below we show how this is done for the proposed set of terms. Notice also the separate URIs for the controlled vocabulary itself. 20
Modeling During the second step, we model the concepts and the controlled vocabulary. This is shown schematically in the figure below. We need to tell which URIs are a controlled vocabulary (skos:ConceptScheme) and which are a concept (skos:Concept). Then for each concept, we can denote one preferred label per language (skos:prefLabel). We can also add alternative labels, and even hidden labels. Finally, we need to say that the concepts belong to that vocabulary (skos:inScheme). The triples shown in this picture can be serialized into a textual form. One way of doing the serialization for triples is using the N-‐Triples format: 21
@prefix voc: <http://example.org/vocabulary#> . @prefix con: <http://example.org/concept#> . @prefix rdf: <http://www.w3.org/1999/02/22-‐rdf-‐syntax-‐ns#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . <voc:UT> <rdf:type> <skos:ConceptScheme> . … <con:MediaArt> <rdf:type> <skos:Concept> . <con: MediaArt > <skos:inScheme> <voc:UT> . <con: MediaArt > <skos:prefLabel> “Mediakunst”@nl . <con: MediaArt > <skos:altLabel> “mediakunst”@nl . <con: MediaArt > <skos:hiddenLabel> “media kunst”@nl . <con: MediaArt > <skos:prefLabel> “Media Art”@en . <con:Art> <rdf:type> <skos:Concept> . <con:Art> <skos:prefLabel> “Kunst”@nl . … This set of triples can also be serialized into XML: <?xml version="1.0" encoding="utf-‐8"?> <rdf:RDF xmlns:voc=“http://example.org/vocabulary#” xmlns:con=“http://example.org/concept#” xmlns:rdf=“http://www.w3.org/1999/02/22-‐rdf-‐syntax-‐ns#” xmlns:skos=“http://www.w3.org/2004/02/skos/core#"> <rdf:Description rdf:about="http://example.org/vocabulary#UT"> <rdf:type rdf:resource = "http://www.w3.org/2004/02/skos/core#ConceptScheme" /> </rdf:Description> … <rdf:Description rdf:about="http://example.org/concept#MediaArt"> <rdf:type rdf:resource = "http://www.w3.org/2004/02/skos/core#Concept" /> <skos:inScheme rdf:resource = " http://example.org/vocabulary#UT " /> <skos:prefLabel xml:lang=“nl">Mediakunst</skos:prefLabel> <skos:altLabel xml:lang=“nl">mediakunst</skos:prefLabel> <skos:hiddenLabel xml:lang=“nl">media kunst</skos:prefLabel> <skos:prefLabel xml:lang=“en">Media Art</skos:prefLabel> </rdf:Description> <rdf:Description rdf:about="http://example.org/concept#Art"> <rdf:type rdf:resource = "http://www.w3.org/2004/02/skos/core#Concept" /> <skos:prefLabel xml:lang=“nl">Kunst</skos:prefLabel> … </rdf:Description> … </rdf:RDF> 22
Relating the concepts/hierarchy During the third step, one needs to build in some hierarchy into the vocabulary. For this, we use the skos:broader or skos:narrower relationship. In the example skos:broader is used. The serialization looks like this: @prefix voc: <http://example.org/vocabulary#> . @prefix con: <http://example.org/concept#> . @prefix rdf: <http://www.w3.org/1999/02/22-‐rdf-‐syntax-‐ns#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . <voc:UT> <rdf:type> <skos:ConceptScheme> . … <con:MediaArt> <rdf:type> <skos:Concept> . <con:MediaArt> <skos:inScheme> <voc:UT> . <con:MediaArt> <skos:broader> <con:ContemporaryArt> . <con: MediaArt > <skos:prefLabel> “Mediakunst”@nl . <con: MediaArt > <skos:altLabel> “mediakunst”@nl . <con: MediaArt > <skos:hiddenLabel> “media kunst”@nl . <con: MediaArt > <skos:prefLabel> “Media Art”@en . <con: ContemporaryArt > <rdf:type> <skos:Concept> . <con: ContemporaryArt > <skos:inScheme> <voc:UT> . <con: ContemporaryArt > <skos:broader> <con:Art> . <con: Art > <rdf:type> <skos:Concept> . <con: Art > <skos:inScheme> <voc:UT> . <con: Art > <skos:broader> <con:Culture> . <con:Culture> <rdf:type> <skos:Concept> . <con:Culture> <skos:topConceptOf> <voc:UT> . … 23
Enrichment/Contextualisation During the last step, the concepts will be contextualized by relating it to eternal resources, thus, concepts of other vocabularies or to resources of other datasets. This is shown in the example below. And it is serialized as follows: @prefix voc: <http://example.org/vocabulary#> . @prefix con: <http://example.org/concept#> . @prefix rdf: <http://www.w3.org/1999/02/22-‐rdf-‐syntax-‐ns#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . <voc:UT> <rdf:type> <skos:ConceptScheme> . … <con:MediaArt> <rdf:type> <skos:Concept> . <con: MediaArt> <skos:broader> <con:ContemporaryArt> . <con: MediaArt> <skos:exactMatch> <dbpedia:MediaArt> . <con: MediaArt> <skos:exactMatch> <freebase:MediaArt> . <con: MediaArt> <skos:exactMatch> <vrt:media_kunst> . <con: MediaArt> <skos:prefLabel> “Mediakunst”@nl . <con: MediaArt> <skos:altLabel> “mediakunst”@nl . <con: MediaArt> <skos:hiddenLabel> “media kunst”@nl . <con: MediaArt> <skos:prefLabel> “Media Art”@en . … Proposal for Unified Thesaurus First of all, when starting a unifying thesaurus, there are several strategies. A first strategy would be to start from scratch and develop a new vocabulary that everyone needs to integrate into their records. This approach is not the right way to follow, because it is not backwards compatible. This 24
means that everybody will need to adapt its records to the vocabulary. In the Unified Thesaurus scenario this is not manageable. Another approach is to directly map all the existing vocabularies as shown below. This approach is also not feasible, because it means that everybody needs to adjust its vocabularies and this every time another vocabulary provider makes some changes to its vocabulary. Thus, this approach is also not the right way to proceed. For Unified Thesaurus, the best approach is to develop a concise vocabulary to start from. And use this vocabulary as a linking hub to relate all the controlled vocabularies. This way, the management of the mappings is centralized, the number of mappings is minimized, and this approach is backwards compatible. Thus, nobody needs to alter its vocabulary. 25
With this strategy, one has to decide first on the domains of the Unified Thesaurus. For each domain a separate vocabulary will be designed. This has the advantage that every vocabulary can be managed on its own, and even by different institutions. This approach distributes the effort. For deciding on the domains, there are several parameters to take into account, which will affect the split up into domains: •
•
•
•
Domain Size of the vocabulary Hierarchy (flat vs. layered) Dynamicity Based on these parameters, we propose to start with the following vocabularies: •
•
•
•
Actors: this can be persons, companies, named entities Locations Concepts: this categories acts like a dictionary of which the locations and actors are instantiations. Categories: This vocabulary will bring in the relations between the concepts. (hierarchy) Each of these vocabularies is different in these parameters, as shown below. That is why it is better to split them up into separate vocabularies. Each vocabulary will have a different design strategy, mapping strategy, and maintenance strategy. 26
As explained earlier, SKOS allows extending and altering the vocabulary in a later stage, without breaking its backwards compatibility, thus, without the need to alter the records that are described with it. SKOS also allows both a top-‐down as a bottom-‐up approach, or a hybrid approach. That is why we propose to start each vocabulary flat, concise, and extend the concepts and the relationships in a later stage. Of the four proposed vocabularies there are two that can be implemented quite fast and two that will be implemented at a different speed. The two to start with are the actor vocabulary and the location vocabulary. Actors and locations can be identified unambiguously. Concepts and categories are much more ambiguous in nature. One can always argue about the hierarchy of some concepts, or certain concepts and named entities are far more difficult to identify. For instance, is a “Space Shuttle” different from a “Space Transportation System”? Hard to tell, while the “Discovery” and “Atlantis” were certainly different named entities. 27
Locations are easy to disambiguate, based on their coordinates. You can use a bottom-‐up approach: identify the locations first and later on put some hierarchy in it. There is also a good reference point for locations, i.e., GeoNames2. Actors are a little more difficult, but are also without discussion to be identified unambiguously, of course in the assumption that all the information for that is there. This is the difference with the categories and concepts vocabularies. With the latter two, even if enough information is there to identify what is meant, confusion can still remain. For instance, persons, even when they have the same name, can always be identified unambiguously if the data provider can deliver their name, in combination with their birthplace and birthdate. Also in this context, there are good starting points. The VRT has a huge controlled vocabulary of actors from where the Unified Thesaurus Actors vocabulary can start from. Another good starting point could be: DBpedia3, or Freebase4, but these are not that specialized in Belgian actors. The approach here is definitely bottom-‐up. Actually it is better not to start with a hierarchy for persons. They can always change location, role, name, … The categories controlled vocabulary will be very difficult. It will put in the hierarchy in the concepts vocabulary. This is always a very subjective process and hard to agree on. For this reason, we propose to start with this vocabulary using a top-‐down approach in a later stage, after the location and actor vocabularies are already started. In a first stage, you should limit the number of categories and also the category depth, e.g., first only start with top concepts, thus only two layers, and later on, one can add extra layers. A good starting point here would again be DBpedia who has lots of hierarchy in their categories or something like WordNet5. WordNet is a dictionary in fact, but with some hierarchy defined into it. The concepts vocabulary will be more difficult than the location and actor vocabularies, but more easily than the categories vocabularies. This vocabulary will be flat and just needs to be populated for which automatic extraction tools can be used. Thus, here we will follow a strictly bottom-‐up approach. The difficulties here are the associative relationships and the equivalence relationships. When are two concepts truly different, which are related, etc.? Here a starting point can be WOrdNet, DBpedia or Freebase. 2
http://www.geonames.org/
http://dbpedia.org
4
http://www.freebase.com
5
http://wordnet.princeton.edu/
3
28
Appendix: SKOS Technical In this section, an overview is given of important aspects of SKOS that need to be taken into consideration when modeling a KOS in order to maintain consistency and to obey to all integrity conditions defined in the SKOS specification. Additionally, a number of best practices are formulated. Later on, in the following Chapter, these best practices are employed for designing a multilingual vocabulary for describing contemporary art to support multilingual search in Europeana. The SKOS specification is defined in (2). A primer introducing SKOS is available at (3). SKOS use cases are described in (4). SKOS vocabulary Table 7 and Table 8 give an overview of all the introduced terms in the SKOS data model6.
Table 7 gives an overview of the introduced classes (names of classes start by convention with
a capital letter) while Table 8 lists the properties (property names start by convention with a
lowercase character). The SKOS namespace URI is ‘http://www.w3.org/2004/02/skos/core#’. By
convention, the prefix used with this namespace is ‘skos’.
URI
skos:Concept
skos:ConceptScheme
skos:Collection
skos:OrderedCollection
Table 7: Classes introduced in SKOS specification
URI
Domain
Range
skos:inScheme
rdfs:Resource
skos:ConceptScheme
skos:hasTopConcept
skos:ConceptScheme
skos:Concept
skos:topConceptOf
skos:Concept
skos:ConceptScheme
skos:altLabel
-
rdfs:Literal
skos:hiddenLabel
-
rdfs:Literal
skos:prefLabel
-
rdfs:Literal
Symmetri
c
Transiti
ve
6
The SKOS data model is defined as an OWL Full ontology
29
skos:notation
-
rdfs:Resource
skos:changeNote
-
rdfs:Resource
skos:definition
-
rdfs:Resource
skos:editorialNote
-
rdfs:Resource
skos:example
-
rdfs:Resource
skos:historyNote
-
rdfs:Resource
skos:note
-
rdfs:Resource
skos:scopeNote
-
rdfs:Resource
skos:broader
skos:Concept
skos:Concept
skos:broaderTransitive
skos:Concept
skos:Concept
skos:narrower
skos:Concept
skos:Concept
skos:narrowerTransitive
skos:Concept
skos:Concept
skos:related
skos:Concept
skos:Concept
skos:semanticRelation
skos:Concept
skos:Concept
skos:member
skos:Collection
union(skos:Concept,
skos:Collection)
skos:memberList
skos:OrderedCollection
rdf:List
skos:broadMatch
skos:Concept
skos:Concept
skos:closeMatch
skos:Concept
skos:Concept
X
skos:exactMatch
skos:Concept
skos:Concept
X
skos:mappingRelation
skos:Concept
skos:Concept
skos:narrowMatch
skos:Concept
skos:Concept
skos:relatedMatch
skos:Concept
skos:Concept
X
X
X
X
X
Table 8: Properties introduced in SKOS specification
The following sections give a detailed overview of the SKOS vocabulary.
SKOS classes The SKOS specification defines four classes: Concept, ConceptScheme, Collection and
OrderedCollection. Each class is defined as an instance of owl:Class (defined in the OWL
specification). The following subsections describe each class in more detail.
30
The skos:Concept class Instances of the skos:Concept class represent concepts that will be used in the KOS. The
SKOS specification states that an instance of the Concept class can be viewed as an idea or
notion; a unit of thought. This is a very flexible view. However, during KOS modeling it is not
always straightforward to decide what should be modeled as concept.
In order to introduce a SKOS concept, it is sufficient to state (in RDF) that a resource is an
instance of the skos:Concept class, as illustrated below (Turtle notation).
<MediaArt> rdf:type skos:Concept .
The skos:ConceptScheme class SKOS offers the means of representing a KOS using the skos:ConceptScheme class. A SKOS
concept scheme can be viewed as an aggregation of one or more SKOS concepts. Semantic
relationships between those concepts may also be viewed as part of a concept scheme,
although (as will be discussed next) it can only be formally stated which concepts are part of a
concept scheme. SKOS provides no means to state which relationships between concepts are
part of a concept scheme.
Instances of the skos:ConceptScheme class are used to aggregate instances of the
skos:Concept class. In order to introduce a SKOS concept scheme, it is sufficient to provide a
resource with a type indication of skos:ConceptScheme, as illustrated below.
<ContemporaryArt> rdf:type skos:ConceptScheme .
In order to add a concept to a concept scheme, the skos:inScheme property is used.
<MediaArt> skos:inScheme <ContemporaryArt> .
The skos:inScheme property is defined as an instance of the class owl:ObjectProperty. The
range (asserted by using the rdfs:range property defined in RDF Schema) of skos:inScheme is
the class skos:ConceptScheme. Note that no domain is stated for the skos:inScheme property,
i.e., the domain is the class of all resources (rdfs:Resource). This decision has been made to
provide more flexibility.
Important to note is that SKOS allows instances of the skos:Concept class to belong to more
than one concept scheme. This is in contrast to many information systems, which do not allow
using concepts in more than one concept scheme. This flexibility allows new concept schemes
to be defined by using concepts defined in other concept schemes.
ex:MediaArt
skos:inScheme
ex:ExternalConceptScheme .
ex:MediaArt
skos:inScheme
<ContemporaryArt> .
31
Integrity constraint: The extension (i.e., the set of instances of a class) of the
skos:ConceptScheme class must be disjoint with the extension of the skos:Concept class. This
means that a resource cannot simultaneously be an instance of the skos:Concept class and the
skos:ConceptScheme class.
Best practice: As a concept scheme can contain a very large amount of concepts, a best
practice is to formally state which concepts are positioned at the top in the hierarchy of the
concept scheme.
This can be done by using one of the following properties: skos:hasTopConcept (which is
defined as a sub-property of skos:inScheme) or skos:topConceptOf. The properties
skos:hasTopConcept and skos:topConceptOf are defined as inverse properties (using the
owl:inverseOf property defined in OWL). This means that when the following fact is stated:
<ContemporaryArt> skos:hasTopConcept
<MediaArt> .
an OWL-reasoner will entail the fact:
<MediaArt>
skos:topConceptOf
<ContemporaryArt> .
Best practice: In order to improve readability, it is advised to only use one of these two
(skos:hasTopConcept or skos:topConceptOf) properties during modeling.
Note that the use of the skos:hasTopConcept (or skos:topConceptOf) is a best practice and
therefore there are no integrity conditions on their usage. This, as a result, has as a
consequence that it is possible to define structures that do not seem logical, but are still
consistent with the SKOS specification, as illustrated in the following example.
<ContemporaryArt>
<Animation>
<MediaArt>
skos:hasTopConcept <Animation>.
skos:broader
<MediaArt>.
skos:inScheme
<ContemporaryArt> .
Although this does not seem valid from a modeling perspective, this example is consistent with
the SKOS specification.
For every concept that must be part of a concept scheme, a formal assertion is needed stating
that the concept belongs to the concept scheme. For example, when a concept has been linked
to a concept scheme (e.g., using the skos:inScheme property) and this concept has several
narrower concepts, this does not imply that these narrower concepts are also part of the
concept scheme, unless explicitly stated. This means for example that the following facts:
<MediaArt>
<MediaArt>
skos:narrower
skos:inScheme
<Animation>
<ContemporaryArt> .
.
will not entail the fact
32
<Annimation> skos:inScheme
<ContemporaryArt> .
Note that because skos:topConceptOf was defined as a sub-property of skos:inScheme, the
usage of this property allows an RDFS-reasoner to entail the fact that the concept belongs to
the concept scheme.
<MediaArt>
skos:topConceptOf
<ContemporaryArt> .
skos:inScheme
<ContemporaryArt> .
entails
<MediaArt>
The skos:Collection class An instance of the skos:Collection class can be used to group concepts. This can be useful to
state that some concepts have something in common. This collection can then for instance be
labeled or provided with additional information.
To introduce a collection, a resource has to be typed as being an instance of the class
skos:Collection. Concepts can be added to this collection using the skos:member property.
<MyInterests> rdf:type skos:Collection ;
skos:member <Animation> , <VideoGameArt> , <MultimediaInstallation> .
The domain of the skos:member property is skos:Collection, the range is the union of the
classes skos:Concept and skos:Collection. This allows nesting of collections.
The skos:OrderedCollection class The skos:OrderedCollection class has a similar purpose as the skos:Collection class. However,
this class should be used when the order of the elements appearing in a collection is important.
The skos:OrderedCollection class is defined as a subclass of the skos:Collection class.
Therefore, when a resource is typed as being an instance of the skos:OrderCollection class, it is
derived that this resource is also an instance of the skos:Collection class.
An ordered collection is introduced by typing a resource with skos:OrderedCollection. Then a list
is constructed containing all member concepts. This list is then linked to the instance of the
skos:OrderedCollection class using the skos:memberList property. The domain of this property
is the class skos:OrderedCollection, the range is rdf:List. A Collection instance can be inferred
from a skos:OrderedCollection instance.
The skos:memberList property is defined as a functional property, as it is not logical that an
ordered collection has multiple member lists. However, this is not an integrity constraint, as this
was not possible without explicitly stating that two lists are different.
33
Restrictions on class usage Integrity constraint: The classes skos:Concept, skos:ConceptScheme and skos:Collection are
disjoint. This means, for example, that it cannot be stated that a resource is an instance of either
the skos:Concept or skos:ConceptScheme class once it has been stated that it is an instance of
the skos:Collection class.
SKOS properties Labeling properties The SKOS specification defines three properties that can be used to provide resources with
labels: skos:prefLabel, skos:altLabel, skos:hiddenLabel.
These properties are defined as instances of the owl:AnnotationProperty class defined in OWL
(11) and as sub-properties of the rdfs:label property, which is defined in the RDFS (RDF
Schema) vocabulary (8). As a result, the range of these properties is the rdfs:Literal class.
No domain has been specified for these properties. Therefore, these properties can be used to
provide labels to any resource.
The property skos:prefLabel is used to denote a preferred lexical label for a resource. In an
information system, this label will typically be presented to the user.
<MediaArt> skos:prefLabel “Media Art”@en.
Integrity constraint: A resource is not allowed to have more than one preferred label for a
specific language tag.
The property skos:altLabel is used to provide alternative labels for a concept. These alternative
labels represent valid alternative representations for a resource, but are not the preferred
representation. Therefore, these may also be displayed to the user.
<MediaArt> skos:altLabel “MultimediaArt”@en.
Note that it is valid to provide multiple alternative labels for a specific language.
The property skos:hiddenLabel is used when a label is not expected to be shown to a user, but
is to be used for example for text indexing. In this way, a misspelled label (or a label that has
been deprecated as a consequence of spelling rule changes) can be provided to a resource
allowing a search application to find the relevant resource.
Integrity constraint: The properties skos:prefLabel, skos:altLabel, skos:hiddenLabel are
pairwise disjoint.
Best practice: Every concept should have a preferred label.
34
Best practice: Following common practice in KOS design, the preferred label of a concept may
also be used to unambiguously represent this concept within a KOS and its applications. So
even though the SKOS data model does not formally enforce it, it is recommended that no two
concepts in the same KOS have the same preferred lexical label for any given language tag.
The object of triples using a labeling property is a plain literal. As already mentioned a plain
literal has an optional language tag. The language tag then indicates the language used in the
literal (21). This indication is important for many types of information processing, which require
knowledge of the language in which information is expresses in order for the information
processing to be performed, e.g., spell-checking and computer-synthesized speech.
Best practice: Always provide label tags to literals when using SKOS labeling properties.
As it is best practice to always provide a language tag to labeling properties, this can lead to
multiple modeling options. Consider the following example where we want to label a concept
with different labels in different languages.
Minimal labeling approach <MediaArt> rdf:type skos:Concept;
skos:prefLabel "Media Art"@en;
skos:altLabel "Multimedia Art"@en;
skos:altLabel "MM Art"@en;
skos:prefLabel "Media Kunst"@nl;
skos:altLabel "Multimedia Kunst"@nl;
skos:altLabel "MM Kunst"@nl;
skos:narrower <Animation>;
skos:narrower <MultimediaInstallation>.
<Animation> rdf:type skos:Concept;
skos:prefLabel "Animatie"@nl;
skos:prefLabel "Animation"@en.
35
<MultimediaInstallation> rdf:type skos:Concept;
skos:prefLabel "Multimedia Installatie"@nl;
skos:prefLabel "Multimedia Installation"@en;
In this example, we provided both a Dutch and English preferred label to the concepts. In
addition, we added abbreviations for these labels as alternative labels. The decision to use the
full label instead of the abbreviation as preferred label was taken because it is best practice to
have unique preferred labels for each concept in the KOS (and using the full representation
minimizes the possibility of duplicate preferred labels). However, if the abbreviations are actually
the preferred labels (this depends on the usage scenario of the KOS) and no other concepts
appear in the KOS having the same abbreviations, it can be decided to use the abbreviations as
preferred labels.
The model above does not state that the label “Multimedia Art” is also used in Dutch to refer to
this concept. This is because, according to (21), the language tag indicates the language used
in the label, which is English in this case.
When using this modeling approach, it is up to the application using the KOS to use labels from
other languages in the absence of a label for a specific language (e.g., Dutch). When a user, for
example, has selected Dutch as the preferred language to be used in the user interface, the
application cannot present Dutch labels for the concepts above, as these are not present. One
solution is to present the English labels by default in the absence of a Dutch label.
Maximal labeling approach. When it is not expected that the KOS will be subject to information processing applications (e.g.,
when the KOS is only used as a navigational assistance tool in a search application) and/or the
application using the KOS is not able to select labels defined in other languages, it could be
decided to explicitly model the labels that are used in Dutch, although the label itself is actually
an English label.
<MediaArt> rdf:type skos:Concept;
skos:prefLabel "Media Art"@en;
skos:altLabel "Multimedia Art"@en;
skos:altLabel "MM Art"@en;
skos:prefLabel "Media Kunst"@nl;
skos:altLabel "Multimedia Kunst"@nl;
36
skos:altLabel "MM Kunst"@nl;
skos:altLabel "Media Art"@nl;
skos:narrower <Animation>;
skos:narrower <MultimediaInstallation>.
<Animation> rdf:type skos:Concept;
skos:prefLabel "Animatie"@nl;
skos:altLabel "Animation"@nl.
skos:prefLabel "Animation"@en.
<MultimediaInstallation> rdf:type skos:Concept;
skos:prefLabel "Multimedia Installatie"@nl;
skos:altLabel "Multimedia Installation"@nl.
skos:prefLabel "Multimedia Installation"@en.
The choice between these two modeling approaches depends on the usage of the KOS.
Notation properties A notation is a lexical code used to uniquely identify a concept of a concept scheme. Although in
the Web of Data (and SKOS) the URI is the preferred way to identify a concept, notations can
be useful as a bridging mechanism.
A notation can be added to a skos:Concept instance using the skos:notation property.
<Fiction> skos:notation “1.2.32”^^<GenreNotationDatatype>.
The skos:notation property is defined as an instance of the owl:DatatypeProperty class defined
in OWL.
Best practice: Use only typed literals when providing notations. The (user-defined) datatype
URI then denotes a system of notations or classification codes. When providing preferred labels
for a concept, it is best practice to not use typed literals but plain literals.
Best practice: No two concepts in a concept scheme should be given an identical notation.
37
Properties denoting a hierarchical relation The following properties are defined in SKOS to denote a hierarchical relation (i.e., to state that
one concept is more general than another one):
•
skos:broader
•
skos:narrower
•
skos:broaderTransitive
•
skos:narrowerTransitive
These properties are all defined as instances of the owl:ObjectProperty class. skos:broader and
skos:narrower are defined as inverse properties of each other. Similarly, skos:broaderTransitive
and skos:narrowerTransitive are defined as inverse properties of each other.
The properties skos:broaderTransitive, skos:narrowerTransitive and skos:related are defined as
sub-properties of the skos:semanticRelation property. The domain and range of the
skos:semanticRelation property is skos:Concept. skos:broader is defined as a sub-property of
skos:broaderTransitive. Similarly, the skos:narrower property is defined as a sub-property of
skos:narrowerTransitive.
Best practice: Only the properties skos:broader and/or skos:narrower should be used during
modeling. The properties skos:broaderTransitive and skos:narrowerTransitive should be used
by an application to access the transitive closure of a hierarchy expressed with skos:broader
and skos:narrower properties.
Best practice: Only use the skos:broader and skos:narrower properties to assert a direct
hierarchical link between two concepts.
In this way, an application is able to retrieve the direct broader and narrower terms of a concept.
For this reason, the properties skos:broader and skos:narrower are not defined as transitive
properties. Consider the following example.
<MediaArt>
<Animation>
skos:narrower
skos:narrower
<Animation>
<StopMotion> .
.
As the skos:narrower property was not defined as a transitive property, the fact
<MediaArt>
skos:narrower
<StopMotion> .
38
will not be entailed by a reasoner. If this would be the case (i.e., when skos:narrower was
defined as a transitive property), it would become impossible to retrieve only the direct narrower
concepts.
However, in some applications, it can be important to retrieve all narrower (or broader)
concepts, even if these concepts are not directly related. For this reason, SKOS defined the
skos:broaderTransitive and skos:narrowerTransitive properties. As the property names suggest,
these properties are defined as transitive properties (by defining these properties as instances
of the owl:transitiveProperty class). The properties skos:broader and skos:narrower are defined
as a sub-property of the
skos:broaderTransitive and skos:narrowerTransitive property
respectively. As a result, after applying an OWL-reasoner, the following facts (among others)
will be entailed:
<MediaArt>
<Animation>
skos:narrowerTransitive
skos:narrowerTransitive
<Animation>
<StopMotion> .
<MediaArt>
skos:narrowerTransitive
<StopMotion> .
.
The first two entailed facts are the result of the fact that skos:narrower is defined as a subproperty of skos:narrowerTransitive. The last fact is entailed because the
skos:narrowerTransitive property was defined as a transitive property.
Note that SKOS allows to provide multiple skos:broader and skos:narrower relations between
concepts. For example, it is valid to assert the following:
<MediaArt>
<Animation>
<MediaArt>
skos:narrower
skos:narrower
skos:narrower
<Animation>
<StopMotion> .
<StopMotion> .
.
However, the above example is not considered best practice, as skos:narrower should only be
used to link direct narrower concept.
Best practice: In order to improve readability, it is advised to only use either the skos:broader
or skos:narrower property during modeling a KOS. Choosing which property to use during
modeling depends on the modeling approach. If the modeling approach follows a top-down
pattern, then the usage of the skos:narrower property is preferred. Alternatively, when the
modeling approach is bottom-up, the usage of the property skos:broader is preferred.
39
Properties denoting an associative relation The property skos:related can be used to state that two concepts are related with each other,
but neither is more general or specific (otherwise, the skos:broader or the skos:narrower
property should be used). skos:related is defined as a symmetric property, but it is not defined
as a transitive property.
Integrity constraint: skos:related is disjoint with skos:broaderTransitive (and consequently with
skos:narrowerTransitive, skos:broader and skos:narrower). Although this integrity constraint is
not formally stated in the SKOS ontology, when this is not obeyed, the model is considered not
consistent.
For example, the following is considered not consistent as there is a conflict between
associative links and hierarchical links.
<MediaArt>
<Animation>
<MediaArt>
skos:narrower
skos:narrower
skos:related
<Animation>
<StopMotion> .
<StopMotion> .
.
This is an important constraint that should be taken into consideration when developing a KOS
using SKOS. If two resources are related using the property skos:related, there must not be a
chain using skos:broader, skos:narrower (and also skos:broaderTransitive and
skos:narrowerTransitive, but these last two should not be used during modeling anyhow)
between these two concepts.
Finally, note that the domain and range of the skos:semanticRelation property is defined as a
skos:Concept. Therefore, this property (and properties defined as sub-properties of
skos:semanticRelation) cannot be used to document, for example, a skos:Collection, as this
would render the model inconsistent.
Documentation properties It is often desired to provide resources with additional information. For this purpose the SKOS
specification defines the following properties:
•
skos:note
•
skos:changeNote
•
skos:definition
•
skos:editorialNote
•
skos:example
40
•
skos:historyNote
•
skos:scopeNote
skos:changeNote, skos:definition, skos:editorialNote, skos:example, skos:historyNote and
skos:scopeNote are defined as sub-properties of the more general skos:note property. All
properties are also defined as instances of the owl:AnnotationProperty class. As no domain is
defined for these properties, these can be used to annotate every resource (a skos:Concept, a
skos:ConceptScheme, a resource of type owl:Class, etc.). Also, no range is defined for these
properties, allowing both RDF plain literals and other resource to appear as the object of the
assertion. For example:
<ExampleConcept> skos:note "note containing additional information"@en .
<ExampleConceptSchem> skos:note <ConceptSchemeNoteResource>.
Best practice: The skos:note property is a property that can be used for general documentation
purposes. However, it is advised to use the more specialized properties when applicable.
The following list explains when to use each property.
•
Skos:scopeNote: Is used to provide additional information related to the intended
meaning/usage of the resource. This additional information may not be complete.
•
skos:definition: Is used to give a complete description of the intended meaning of the
resource.
•
skos:example: Is used to give an example of the usage of the concept.
•
skos:historyNote: Is used to give information about the fact that the meaning of the
resource has changed.
•
skos:editorialNote: Is used to provide general information for the editor of the KOS.
•
skos:changeNote: Is used to provide detailed information about modifications to the
KOS.
The skos:editorialNote and skos:changeNote properties are intended to be used to provide
information to the KOS editor(s) and not to their users.
Mapping properties Mapping concepts defined in different concept schemes can be very useful in many
applications. For this reason, the SKOS specification introduced the following mapping
properties:
41
•
skos:mappingRelation
•
skos:exactMatch
•
skos:closeMatch
•
skos:broadMatch
•
skos:narrowMatch
•
skos:relatedMatch
All properties stated above are defined as instances of the owl:ObjectProperty class.
skos:closeMatch, skos:broadMatch, skos:narrowMatch and skos:relatedMatch are subproperties of the skos:mappingRelation. skos:mappingRelation is defined as a sub-property of
skos:semanticRelation. Therefore, the domain and range of these properties is skos:Concept.
Best practice: When two concepts of different concept schemes denote the same concept, and
therefore could be used interchangeably, the skos:exactMatch property can be used to map
these concepts. Note that the skos:exactMatch property is defined as a transitive property, and
therefore care should be taken when using this property in modeling. This property should only
be used when two concepts have a similar intended meaning over a broad range of applications
and concept schemes.
If this is not the case, i.e., the concepts can only be used interchangeably in some applications,
the property skos:closeMatch should be used instead. The reason why skos:closeMatch is
preferred in this case is that it is not defined as a transitive property. skos:exactMacth is defined
as a sub-property of skos:closeMatch.
The skos:broadMatch and skos:narrowMatch can be used to define a hierarchical mapping
between concepts defined in separate concept schemes. These properties are defined as
inverse properties of each other. The skos:broadMatch and skos:narrowMatch are defined as a
sub-property of skos:broader and skos:narrower respectively.
An associative mapping between concepts defined in separate concept schemes is established
using the skos:relatedMatch property. skos:relatedMatch is defined as a sub-property of
skos:related.
The skos:closeMatch, skos:exactMatch and skos:relatedMatch are defined as symmetric
properties.
Integrity constraint: skos:exactMatch is disjoint with the properties skos:broadMatch,
skos:narrowMatch and skos:relatedMatch. A KOS is not considered consistent according to the
SKOS data model when a clash can be found between exact mappings and associative or
42
hierarchical mappings. The following example is therefore considered not consistent according
to the SKOS data model.
@prefix my: <http://example.org/myvocab/> .
@prefix ext: <http://example.com/vocab/> .
<my:MediaArt> skos:exactMatch <ext:MediaArt> .
<my:MediaArt> skos:relatedMatch <ext:ContemporaryArt> .
OWL also provides some properties that can be used to map resources: owl:equivalentClass,
owl:equivalentProperty and owl:sameAs. The owl:equivalentClass is used to assert that two
classes have the same class extension. Similarly, the owl:equivalentProperty property is used to
assert that two properties have the same property extension. The owl:sameAs property is used
to assert that two resources actually refer to the same thing. At first glance, it can seem useful
to use the owl:sameAs property to map concepts from different concept schemes when they
denote the same thing.
Best practice: However, because of the formal consequences the usage of this property has, it
is not recommended to use the owl:sameAs property to map SKOS concepts.
Consider the following example as an illustration of the unwanted results the usage of
owl:sameAs can have.
@prefix my: <http://example.org/myvocab/> .
@prefix ext: <http://example.com/vocab/> .
my:MediaArt skos:prefLabel “Media Art”@en.
ext:Media-Art skos:prefLabel “Multimedia Art”@en.
my:MediaArt owl:sameAs ext:Media-Art.
After applying an OWL reasoner, the following additional fact (among others) will be entailed.
my:MediaArt skos:prefLabel “Multimedia Art”@en.
ext:Media-Art skos:prefLabel “Media Art”@en.
Note that now both resources have two different preferred English labels. As already noted, a
SKOS model is considered to be inconsistent when a concept has more than one preferred
label for a given language tag. This example illustrates why the use of skos:exactMatch is
preferred over the use of owl:sameAs.
Best practice: Use the mapping properties skos:broadMatch, skos:narrowMatch and
skos:relatedMatch to define hierarchical and associative relations between concepts defined in
43
different concept schemes and use the skos:broader, skos:narrower and skos:related properties
to define hierarchical and associative relations between concepts defined in the same concept
scheme.
Concept grouping During modeling a KOS, it is often desired to group concepts that have something in common.
We explain two possible approaches to do this.
Grouping concepts using collections. Consider the following example. A concept ‘animation’ has been introduced as the top level
concept of a concept scheme about contemporary art. Additionally, concepts such as ‘cartoon’,
‘stopmotion’, ‘animatronics’, etc. are introduced as narrower concepts of the ‘animation’
concept. Note that these concepts can be grouped. For example, the concepts ‘cartoon’ and
‘animatronics’ rely on computers for generating the animation, ‘stopmotion’ is an animation
generated by hand. This can be done using collections as illustrated below.
<Animation> rdf:type skos:Concept .
<Animation> skos:inScheme <ContemporaryArt> .
<Animation> skos:topConceptOf <ContemporaryArt> .
<cartoon> rdf:type skos:Concept .
<stopmotion> rdf:type skos:Concept .
<animatronics> rdf:type skos:Concept .
<cartoon> skos:inScheme <ContemporaryArt> .
<stopmotion> skos:inScheme <ContemporaryArt>.
<animatronics> skos:inScheme <ContemporaryArt>.
<Animation> skos:narrower <cartoon> .
<Animation> skos:narrower <stopmotion> .
<Anitmation> skos:narrower <animatronics> .
<HandGeneratedArt> rdf:type skos:Collection ;
skos:member <stopmotion> , <cartoon> ;
skos:prefLabel “Hand generated art.”@en .
<ComputerGeneratedArt> rdf:type skos:Collection ;
skos:member <cartoon> , <animatronics> ;
skos:prefLabel “Computer generated art”@en .
As collections and concepts must be disjoint in SKOS, it is not possible to define semantic
relations for a collection (because the range and domain of skos:semanticRelation is defined as
44
skos:Concept.). Therefore, grouping concepts into collections has no influence on the position
of a concept in the concept scheme. SKOS collections can be used to model hierarchies where
the notion denoted by the collection is not considered a real concept of the KOS.
It is up to the application using the SKOS model to decide how the concept hierarchy is
displayed (e.g., whether or not to include collection membership information in the displayed
hierarchy).
Grouping concepts using hierarchical relations. SKOS concept schemes are often designed to be used as a navigation hierarchy (e.g., to assist
users in a search application). If this is the case, it can be decided not to use collections but
organize concepts using hierarchical relations only. This approach is illustrated in the following
example.
<Animation> rdf:type skos:Concept .
<Animation> skos:inScheme <ContemporaryArt> .
<Animation> skos:topConceptOf <ContemporaryArt> .
<HandGeneratedArt> rdf:type skos:Concept ;
skos:prefLabel “Hand generated art.”@en .
<ComputerGeneratedArt> rdf:type skos:Concept ;
skos:prefLabel “Computer generated art”@en .
<cartoon> rdf:type skos:Concept .
<stopmotion> rdf:type skos:Concept .
<animatronics> rdf:type skos:Concept .
<HandGeneratedArt> skos:inScheme <ContemporaryArt> .
<ComputerGeneratedArt> skos:inScheme <ContemporaryArt> .
<cartoon> skos:inScheme <ContemporaryArt> .
<stopmotion> skos:inScheme <ContemporaryArt> .
<animatronics> skos:inScheme <ContemporaryArt> .
<Animation> skos:narrower <ComputerGeneratedArt>.
<Animation> skos:narrower <HandGeneratedArt>.
<HandGeneratedArt> skos:narrower <stopmotion> , <cartoon> .
<ComputerGeneratedArt> skos:narrower <cartoon> , <animatronics> .
45
The decision, as to which approach is followed, is left to the KOS designer. This decision can be
influenced by the intended use of the KOS (e.g., when the KOS will only be used as a
navigation aid) and the application utilizing the KOS (e.g., when an application does not support
collection processing). However, from a pure modeling perspective, the first approach using
collections is often considered as better practice than the sole use of hierarchical relations, as it
better models the semantics of the KOS.
Best practice: nesting of concept schemes is not considered best practice. A concept scheme
should be considered as an aggregation of concepts only. Therefore, if a concept should be part
of more than one concept scheme, this should be denoted through using multiple
skos:inScheme properties: one for each concept scheme a concept belongs to.
46

Download Report

Unified Thesaurus - possibilities, representation and design

Paperzz.com

Your Paperzz