Concept drift and how to identify it Shenghui Wangb,a,∗, Stefan Schlobacha , Michel Kleina a Vrije Universiteit Amsterdam, The Netherlands b OCLC, Leiden, The Netherlands Abstract This paper studies concept drift over time. We first define the meaning of a concept in terms of intension, extension and label. Then we study concept drift over time using two theories: one based on concept identity and one based on concept morphing. A qualitative toolkit for analysing concept drift is proposed to detect concept shift and stability when concept identity is available, and concept split and strength of morphing chain if using the morphing theory. We apply our framework in four case-studies: a political vocabulary in SKOS, the DBpedia ontology in RDFS, the LKIF-Core ontology in OWL and a few biomedical ontologies in OBO. We describe ways of identifying interesting changes in the meaning of concept within given application contexts. These case-studies illustrate the feasibility of our framework in analysing concept drift in knowledge organisation schemas of varying expressiveness. 1. Introduction Knowledge organisation systems (KOS), such as formal ontologies (e.g. modelled in OWL), thesauri or taxonomies (e.g. described in SKOS) or other term classification schemes, play a crucial role in providing semantic interoperability in many domains and use cases. They have become critical to the Web of Data for structured access of documents in libraries or patient records based on diagnostic information, and many more applications. In almost all modern types of KOS, concepts are the central constructs that are used to describe sets of objects with shared characteristics. Although it is widely recognised to be an oversimplification most current systems consider their underlying KOS to be stable over time. For many applications this starts to be a critical problem, and this paper attempts to provide a theory of what we call concept drift.1 As the world is continuously changing, concepts also change over time. That is, for example, a concept refers to different objects at different points in time. The term Government of the Netherlands refers to different people in 1999 and in 2009. Or, consider the concept Middle class which is a concept with very different properties in various periods of time. Concept drift is known to be a problem in data mining or machine learning, when learned models loose their predictive power over time [50]. Here, we deal with concepts that are explicitly and symbolically represented in some KOS, and where the drift needs to be made explicit as well. In this sense our contribution is orthogonal and complementary to the work in the Machine Learning literature. In this paper we will formally define what concept drift in a KOS is, and how to study its impact, given two different ontological paradigms, and without committing to a particular type of Knowledge organisation schemes. These definitions are not intended to provide new philosophical insights, but aim at making existing accepted notions of intension, extension and labelling applicable in the context of dynamics of semantics. Semantic drift: Throughout the theoretical part of the paper we will use the generic example of the European Union (the concept European Union) and the countries of the EU (EUCountry) as our concepts of reference.2 The historical overview provided by the EU on [5] gives an interesting discussion over its development. Starting in the early Fifties “the European Union is set up” with “Belgium, France, Germany, Italy, Luxembourg and the Netherlands” as founding members. Within the next 6 decades many other countries joined, and the European Union transforms slowly from an economic community to an “international organisation governing common economic, social, and security policies” [4]. The following list gives some definitions for the concept of the European Union over time. (1979) The European Community is a common denominator for the European Economic (EEC), the European Coal and Steel Community (ECSC), and the European Atomic Energy Community (EAEC).[3] (1999) The European Community is the new stage in the implementation of increasing the Union of the European people. [2] ∗ Corresponding author. Email addresses: [email protected] (Shenghui Wang), [email protected] (Stefan Schlobach), [email protected] (Michel Klein) 1 Concept drift will be our name for change in meaning of a concept over time. This drift in meaning occurs not only over time, but also over location, culture, etc. For ease of presentation we will mostly refer to drift in time, but significant parts of the framework should extend to other kinds of “contexts.” Preprint submitted to Elsevier 2 We use two different concepts EUCountry and European Union as they examplify two different kinds of change: the (intensional) definition of the EU is highly instable and changes throughout history, but as a relatively abstract concept it is not so clear what its instances (extension) are. For EUCountry its the opposite: the set of instances regularly changing, but its intensional definitions as countries of the EU relatively stable. June 14, 2013 Morphing based concept drift: The previous analysis identifying shift and stability of the meaning of a concept makes sense under the assumption, or ontological commitment, that the concept of the European Union in 1995 is considered to be the same concept of the European Union in 2007. This is just one possible interpretation, and we do not want to restrict our framework to just this one world-view. he discussion on identity is probably as old as philosophy itself, and has played a role in ontology engineering as well. We try to keep out of this discussion by providing both methods that work on identity, and without it. For the latter we consider concepts to be pertaining to just one moment in time and that they evolve/morph into new, but highly similar, concepts at each moment in time. Meaning drift is then to be defined in the degree of dissimilarity of these maximally similar concepts over time. Again, we will study drift with respect to intension, extension and labels. In this alternative conceptualisation of concept drift we also want to identify some qualitative notions to describe more drastic notions of drift. One example is the notion of split which occurs when the different semantic elements (intension, extension or label) morph into different concepts. When a series of concepts at different moments are linked by this morphing relation, then a morphing chain is formed. The strength of the morphing chain, i.e. , how similar the morphed concepts are, is another notion to study. Research questions: To our knowledge there has been no formalisation of what concept drift actually means and implies. In order to identify different types of changes in concepts and to understand the impact of concept drift, such a formalisation is critical. Therefore, this paper focuses on the following research questions: (2003) The European Union or EU is an international organisation of European states, established by the Treaty on European Union. [52] (2006) The European Union (EU) is a supranational and intergovernmental union of 25 independent, democratic member states. [53] (2010) The European Union (EU) is an economic and political union of 27 member countries, located primarily in Europe. [51] Most of the aspects of the change of meaning of a concept over time that we introduce in this paper can be identified in this example. The instances of the set of countries of the European Union, also called its extension, change (eg. from 25 to 27 between 2006 and 2007). The label of the concept European Union changes from European Community to European Union. From a judicial perspective those are different concepts. However, the European Union website [5] considers the EU and EC to be the same concepts. Note that they use the label European Union to signify the concept that existed in 1952. There was no European Union at the time, however, only the European Community. In our opinion the website refers to an abstract concept of an organisational unit which stands for the European idea, which we now refer to as the European Union. Of course, the properties of this European Union have significantly change as well; from a union bound by economic treaties to a full political, military and social organisation. Ontologically, we say that the intension of the concept European Union has changed. These three types of changes will be discussed as three different aspects of semantic drift: extensional, label and intensional drift respectively. Identity based concept drift: The impact of such drift is difficult to measure and for that purpose we would like to identify more drastic and qualitative changes in meaning. In the example of the EU countries there are some of these more drastic changes. Recall that the EU started out with just 6 countries, all of which were Central European. The effect of the expansion of the EU is that the original meaning of the concept EUCountry in terms of its members in 1950 is now far closer to the concept CES of Central European states than to the concept EUCountry. We will call such changes shift. Another way of looking at drift is to study the similarity between the meaning over time. Figure 1 shows, e.g. , the number of European member countries of NATO and EU: Studying the similarity between these sets, e.g. , in this simple example as the ratio of the number of new versus old instances, indicates the moments when the most drastic changes occured (in both cases between 1995 and 2004). We say that we measure stability of the concept as compared to other concepts. In principle, there are two lessons: for each concept we can identify the most unstable moments in a temporal chain, but we can also compare the average or overall instability of a concept over time. An interesting comparison can be made with the concept of European Countries within NATO, which has also seen expansion over time. Note, that the latter is more extensionally stable as countries joined more gradually. 1. What is concept drift, and how to formalise it? 2. Can we identify the impact of concept-drift? It is important to note that we want to develop a generic framework which allows us to study concept drift without a priori choosing one of the above described ontological commitments (identity versus morphing) or a particular type of KOS. Methodology: We provide a generic formalisation of the meaning of concepts in terms of label, intension and extension. Based on the ontological commitment towards an identity- or morphing-based model concept drift is defined differently, and the qualitative follow-up notions are therefore defined differently. For the former, the instability over a time period and concept shift between time points (where part of the meaning of a concept shifts to some other concept) are crucial notions. For the latter, the notion of split and the morphing strength will be introduced. As our proposal is meant in a very generic way we need to discuss the notions of intension and extension in more detail. Case-studies: We instantiate our framework in four casestudies, studying concept drift in our motivating case, a political ontology in SKOS [17] used by communication scientists for political analysis as well as a general purpose RDFS [30] 2 EU diff E-NATO diff 1949 9 - 1952 6 10 11,1% 1955 9 50% 11 10% 1973 9 0% 11 0% 1981 10 11,1% 12 9,1% 1986 12 20% 12 0% 1995 15 25% 12 0% 1999 15 0% 15 25% 2004 25 66,7% 22 46,7% 2007 27 8,0% 22 0% Figure 1: The number of European member countries of NATO and EU over time. ontology, DBpedia [25], a legal OWL [31] ontology, LKIFCore [15], three OBO biomedical ontologies [41]. We investigate the introduced mechanisms for studying concept drift in these four different KR models. Our experiments show the feasibility of both the formalisation and identification mechanisms by pointing to some examples of concept (in)stability, shift, split, and morphing strength which were identified as relevant by collaborating domain experts. We are aware that those casestudies do not constitute a formal evaluation or even less empirical proofs of validity of our approach. However, they indicate the broad coverage and potential of the toolkit (although not all tools are useful in all scenarios). Furthermore, selected domain experts have in three of the four cases confirmed the usefulness of some of the results, the fourth being general enough to allow an informal assessment without expert input. Contributions: This paper operationalises established (philosophical) insights into a general pragmatic framework for dealing with concept drift and developed a general toolkit to detect drift in different domains. We believe that we contribute to a better understanding of temporal change of meaning in formal knowledge organisation schemes and its impact in practical applications. We motivate and define the crucial notions of shift, split, stability and morphing strength for different ontological commitments and different types of Knowledge organisation systems. The four case-studies illustrate the feasibility of our framework in analysing concept drift in knowledge organisation schemas of varying expressiveness. This is a significantly extended version of our previous paper [47] in which some of the ideas were introduced. We now provide a far more detailed analysis and formalisations, including an alternative model for dealing with changes based on concept morphing (with the related notions of split and morphing strength), we also provide a far more thorough evaluations including new datasets. Structure of the paper: Section 2 gives a general introduction to the four types of knowledge organisation schemas investigated and discusses the most relevant related work. In Section 3, we provide the basic formalisation of the meaning of concepts and on which two theories of concept drift are based, namely identity-based and morphing-based. These two theories are defined in Section 4 together with a practical toolkit for analysing concept drift. In Section 5, we apply our general framework in four different case studies involving four different KOSs. Finally, Section 6 concludes the paper. dition, we describe what has been done on formalizing concept drift in knowledge systems. 2.1. Knowledge Organisation Schemas Knowledge Organisation Schema (KOS) is a broad notion used to refer to schemes meant to describe information. This notion encompasses term lists and glossaries, classification schemes such as subject headings or taxonomies, and semantically richer structures such as thesauri and ontologies. More specifically, in [14] the notion of a KOS is explained as follows: The term knowledge organization systems is intended to encompass all types of schemes for organizing information and promoting knowledge management. Knowledge organization systems include classification and categorization schemes that organize materials at a general level, subject headings that provide more detailed access, and authority files that control variant versions of key information such as geographic names and personal names. Knowledge organization systems also include highly structured vocabularies, such as thesauri, and less traditional schemes, such as semantic networks and ontologies. Because knowledge organization systems are mechanisms for organizing information, they are at the heart of every library, museum, and archive. Over the last decades, several standards for representing knowledge structures have been developed. For example, the International Organization for Standardization (ISO) has developed a guideline for monolingual thesauri (ISO-2788) and a guideline for multilingual thesauri (ISO-5964), and the NISO has developed guidelines for the construction, format, and management of monolingual controlled vocabularies (ANSI/NISO Z39.19). In this paper, we focus on representation used in the context of the Semantic Web, i.e. SKOS and OWL. 2.2. RDF Schema RDF Schema (RDFS) is a simple type system for the Resource Description Format (RDF). It provides a mechanism to define domain-specific properties and classes of resources to which you can apply those properties. The basic modeling primitives in RDFS are class definitions (rdfs:Class) and subclass-of statements (rdfs:subClassOf), which together allow the definition of class hierarchies. RDFS also includes property definitions and subproperty-of (rdfs:subPropertyOf) statements to build property hierarchies as well as domain and range statements (rdfs:domain and rdfs:range) to restrict the possible combinations of properties and classes. Inherited from RDF, the 2. Background In this section, we describe some general notions that are necessary for understanding the remainder of the paper. In ad3 type statements (rdf:type) are used to declare a resource as an instance of a specific class). In addition, RDFS comes with a specific labeling relation rdfs:label which can be useful to add natural language terms to a class. 2.5. OBO The OBO format is widely used to represent a significant number of biomedical ontologies, including the Gene Ontology (GO).3 Originated along with the Gene Ontology, the OBO “flat file format” has evolved to support the needs of the biomedical ontologies that fall under the Open Biomedical Ontologies (OBO) umbrella. According to its authors, the OBO flat format main aims are to provide 1) human readability, 2) ease of parsing, 3) extensibility and 4) minimal redundancy. The OBO format currently forms the backbone of most GO based annotation and data analysis tools. 2.3. OWL The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies endorsed by the World Wide Web Consortium. They are characterised by formal semantics and a number of different serialisations, eg. RDF/XML. OWL is built upon RDF and RDFS. An OWL class is defined as an RDF resource of type owl:Class. The mechanism to define class hierarchies is inherited from RDFS. Furthermore, properties are used to describe resources and specify class characteristics. Properties may possess logical capabilities such as being transitive, symmetric, inverse and functional. The W3C Consortium has recently endorsed a new version of OWL as standard referred to as OWL2 [32], which extends the previous version with more expressiveness and fixes some previous problems. [Term] id: HP:0000023 name: Inguinal hernia namespace: medical genetics exact synonym: ”Inguinal hernias” [] is a: HP:0000035 ! Abnormality of the testis is a: HP:0004299 ! Hernia of the abdominal wall As the above example shows, a OBO ontology is a collection of stanzas, each of which describes one element of the ontology. A stanza is introduced by a line containing a stanza name that identifies the type of element being described. The rest of the stanza consists of lines, each of which contains a tag followed by a colon, a value, and an optional comment introduced by “!”. 2.4. SKOS SKOS (Simple Knowledge Organization System) is a common data model for sharing and linking knowledge organization systems via the Web, published as a W3C recommendation [18]. Its aim is to capture much of the similarity between thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is built upon RDF and RDFS. The core of the SKOS model defines the classes and properties needed for representing the features most commonly found in thesauri and other types of structured controlled vocabularies. The primitive objects in SKOS are abstract concepts, which are defined as RDF resources of type skos:Concept. These concepts can be further described via RDF properties by: 2.6. Related work Although concept drift is widely recognised as an important issue in Semantic Web applications,4 there is not yet a generally accepted way of dealing with changing concepts over time. However, there is quite some related research (e.g., see [11, 27]) that discuss aspects of coping with concept drift. Below, an overview is given on the research about concept change in different areas. Let us first look at research in other domains that is related to our notion of concept drift. In historical linguistics, semantic shift describes the evolution of word usage. Each word has multiple senses and connotations which can be added, removed or altered over time. Semantic change is a change in one meaning of a word. Semantic shift can be triggered by different forces and have different types [1]. In this interpretation of “semantic”, there is no explicit relation with the meaning and change of an “underlying concept”, as the main entity of concern is the word itself. . In machine learning concept drift addresses a similar problem [45]. The term concept refers to the quantity that a learning model is trying to predict, i.e. the variable. Concept drift is the situation in which the statistical properties of the target concept • labels (i.e. terms), especially the preferred labels and alternative labels skos:altLabel; these properties are sub-properties of rdfs:label, and are used to link a skos:Concept to an RDF literal (a character string) optionally combined with a language tag (e.g. ”en-US”); skos:prefLabel • semantic relations, such as skos:broader, skos:narrower and skos:related; • documentary notes, such as definitions, scope notes and examples, represented by (subproperties) of skos:note. 3 http://www.geneontology.org/ Using the semantic relationships, the concepts can be organized in hierarchies using broader-narrower relationships, or linked by non-hierarchical (“related”) relationships. Moreover, skos:concept’s can be grouped into schemes, using the skos:inScheme property. 4 E.g., see the fact that the topic is mentioned in both the discussion about the design of SKOS (http://www.w3.org/2001/sw/wiki/SKOS/ Issues/ConceptEvolution) and the previous version of OWL (http: //www.w3.org/TR/2004/REC-owl-guide/), while there is no finegrained solution yet in OWL2. 4 change over time. This requires regular updates in the predicting model itself. A special case is virtual concept drift [49] (or sampling shift), in which the meaning of a concept does not change, instead, the data that this concept classifies changes. An example is the concept spam for which the meaning does not change, but the data distribution (i.e. the relative frequency of the properties) is changing. In 1994, Klenner and Hahn [24] discuss exactly the problem of concept drift because of evolving notions over time, however, not in the context of Semantic Web applications but for technical standards. As a mechanism for updating static value restrictions or integrity constraints, they propose an automatic procedure. This generates a generational stratification of the underlying level of generic concepts in terms of concept versions; single instances are then related to their associated concept version. The procedure exploits a so called progress model—provided by an expert—which describes in qualitative terms the regularities of foreseeable changes of attributes in a domain. Versions are then detected by measuring the change in values of attributes of instances. With the goal of detecting concept drift and the occurrence of new concepts in a domain, Fanizzi et al. describe the use of a conceptual clustering technique based on unsupervised learning [6]. In their approach, a clustering method is used to hierarchically organize groups of similar instances. Concept drift is detected by finding new individuals that are too far apart from existing clusters, but that together do not form a new cluster. If the unclustered instances do form a cluster, a new concept has occurred. In the Semantic Web community, the problem addressed here is related to a broader problem of ontology evolution and change management, which refers to the “problem of deciding the modifications to perform upon an ontology in response to a certain need for change as well as the implementation of these modifications and the management of their effects in depending data, services, applications, agents or other elements” [7]. The existing research on ontology evolution and versioning often addresses this problem at the macro ontological level, that is, the effect of certain change operations over the ontology elements, including concepts, relations and instances, as well as the interoperability issue between different variants (versions) over time. For example, [13] formally defines ontology perspectives, which describe the relation between versions of an ontology and its extension. For analyzing interoperability between different versions of an ontology, change detection for individual concepts is a relevant issue [34]. Different techniques for detecting such changes have been developed. For example, Plessers and De Troyer introduce a change detection mechanism [37, 38] that is able to find meaningful changes in sets of changes, based on the formal definition of these changes. This approach, however, assumes a fixed identity of concepts. As such, it is of limited use in cases where this assumption does not hold. A syntactic change detection approach for ontologies is presented in [23]. In this approach, the semantic implication of the change has to be made explicit by a human user. PromptDiff [35] is an approach and tool for comparing ontology versions that also detects changes in individual concepts. In [22], an interpretation of the effect of such changes on different types of ontological compatibility is given. This approach does not assume fixed identifiers, but tries to match different ontological elements using a set of heuristics focussing on the structural aspects of an ontology. It does not take the extension of concepts into account, neither it looks at the intension of a change. In contrast, in [16], the intensional change of concepts in different versions of ontologies is studied. This approach again assumes a fixed identity. In the context of integrating ontologies, their has been a lot work done on mechanisms and measures for calculating similarity concept similarity. An overview can be found in [20]. Such measures are related to our research, as our framework also makes use of different similarity measures. However, the aim of our work is not to introduce another similarity measure, but a framework that allows to combine different aspects of concept similarity to describe concept change. A whole body of research in ontology change management investigates methodologies and techniques for change management in collaborative working environments [28, 12, 21, 36, 33, 42, 43, 44]. In these papers, methods for the coordination of the changes made by different authors are developed and implemented, such as locking, change detection and approval mechanisms. Although very relevant for ontology development, these approaches focus on a specific use case. There has not been much study on the specific meaning of change for specific concepts. In [10] a series of “concept signatures” extracted from the textual definitions of the same concept at different time are used to detect drifts. This definition-based method is applicable if there is rich definitions of concepts and the definitions are constantly modified. In the Ontoclean framework [9] the meta-properties identity and rigidity are defined with respect to the stability of a concept. In this article, we do not focus on a specific application of concept change, but we present a generic framework to describe and analyse concept drift within two common ontological paradigms. This framework helps understanding and analyzing specific strategies for detecting and coping with concept change in knowledge structures. 3. Two theories of concept drift Arguably, the meaning of concepts changes over time. In this section we define precisely what this actually means, as there are different philosophical views on the matter. The two core alternatives can briefly be summarised upfront: • Although possibly changing its meaning, a concept can exist over periods of time. Different variants of the same concepts can differ in meaning. • Concepts only exist at specific moments in time, and evolve gradually into some other concepts (possibly with almost the same meaning) at the next moment in time. We will build a framework that works on both views, and provide qualitative notions of concept drift for both views. 5 cyc: Political Entity art1 art2 art3 skos:Concept rdf:type rdfs:subClassOf of the domain, and a set of properties (unary predicates). Both universe and properties depend on the application and the formalism used. Definition 1. The meaning C t of a concept C at some moment in time t is a triple (labelt (C), intt (C), extt (C)), where labelt (C) is a String, intt (C) a set of properties (the intension of C), and extt (C) a subset of the universe (the extension of C). rdf:type dc:subject cyc: European Union dbpedia:Country rdf:type skos:broader yago:economy rdfs:subClassOf rdf:type Netherlands France rdf:type dbpedia:EUCountry Germany We will refer to intension, extension and label of a concept as the aspects of the concept. cyc:Central European Country 3.1.1. A closer look at extension and intension The example ontology introduced in Fig. 2 shows that Definition 1 is far from unambiguous. Depending on the semantics underlying a particular ontological representation language, the extension, intension and even labels can have different definitions. OWL and RDFS are the most simple cases as the model theory of DL semantics provides immediate mapping to the definition of extension as subset of the domain. For the concept EUcountry in Fig. 2, the extension is defined as the resources in the subject position of an rdf:type triple. For abstract concepts, this view on extension is hardly applicable. In knowledge organisation schemes such as SKOS with less rigid formal semantics, the notion of extension is far less clear cut. In many applications there is a clearly defined domain of discourse which we can take as the semantic universe for defining the extension of a concept. In Definition 1, the meaning of concepts has been defined in a rather non-committal way in order to accommodate for these weaker semantic structures as well. For example, European Union is a skos:concept which is used to annotate a number of articles. We argue that, in certain applications, it is acceptable to use the articles as the extension of the concept European Union. In our political study in Section 5.1 we will define the set of sentences annotated with a particular concept as its extension. For the aspect of intension, there is no one-and-for-all interpretation either. The definition is deliberately left vague to allow broad coverage of our theory of concept drift. In a FirstOrder Logic sense, the properties of a concept C correspond to the binary predicates that hold for C. Which predicates to consider then depends on the formalisms used for representing the ontology. If we interpret the ontology in Fig. 2 in RDF only, the intension of the concept dbpedia:EUCountry could be defined as just the set of triples it occurs in (except those with rdf:type as the predicate), i.e. , Figure 2: An example ontology representing the EU in 2003 To explain and motivate our ideas we will use the following simple information which is freely inspired by the data available through DBpedia.5 Let us consider the following ontology (depicted in Fig. 2) which contains some information valid in 2003 about among others the resources dbpedia:EUCountry, cyc:Political Entity, and cyc:European Union [29] from several different sources, and related using different relations. Concept dbpedia:EUCountry is defined as countries, which are political entities. The EU was at the time a political entity which also was an economy. There are some example countries as instances of the concept dbpedia:EUCountry, and some articles art1,art2 and art3 about the European Union. Fig 3 which will be introduced later in the paper describes the European Union in 2010. This example, and some extensions provided later, are used in the further discussions to explain our general ideas. 3.1. The meaning of concepts Let us first commit to some basic definitions regarding the meaning of concepts, which is common to both views. The intension of a concept are the properties implied by it, the extension the set of things it extends to. We also consider the labelling as a part of the meaning of a concept, because the way people reference a concept is also crucial in studying concept drift. Labels do not refer to a unique identifier but to a natural language description used to convey the meaning of a concept from one human to another. We believe this proposal to be consistent with the most common philosophical approaches. We apply and formalise the standard distinction between intension and extension which goes back to [8]. We include, somewhat more unconventionally, the labelling in the meaning of a concept (in the tradition of the signifier [39]). Our definition of the meaning of a concept, and its drift is intended to be generic enough to be applied in different ontological frameworks. In this paper, we apply our ideas on a SKOS vocabulary, a few RDFS, OWL and OBO ontologies. We start out from a set of objects referred to as the universe (dbpedia:EUCountry rdfs:subclass dbpedia:Country) (cyc:European Union skos:broader dbpedia:EUCountry) If interpreted in RDFS, the semantic closure could be taken into account, i.e. , the triple (dbpedia:EUCountry rdfs:subClassOf cyc:PoliticalEntity) is probably relevant as well and could be part of the intension. For OWL and Description Logics formalisations, the intension could also contain implicitly true predicates that can be expressed. In our example, the concept dbpedia:EUCountry is also a subclass of the class of all the things 5 http://wiki.dbpedia.org/ 6 In the most simple cases calculating similarity between two versions of an aspect is using standard notion of similarity directly on the aspects. For labels, being defined as Strings, there are standard distance measures, such as Levenshtein’s [26] edit distance. For sets of predicates variants of Jaccard similarity [19] are the obvious candidates. The situation can, however, be more intricate depending on the ontological commitment w.r.t. extension and intension. For example, if the extension of a concept is defined as the set of articles annotated with it the overlap between extensions at different moments in time is usually empty. In our case study on political journalism we will later first define similarity between the instances of the extension and aggregate over this similarity in order to determine the overall similarity between the extensions. As stated before, similarity needs to be defined explicitly per application. Again, we believe that a framework for studying concept drift should be generic enough to cater for particular definitions of similarity. which have at least one instance which is a Central European Country (or ∃ rdf:type . cyc:Central European Country in a Description Logic like notation). The above examples show that there is no clearcut definition of intension and extension, and even the definition of the label of a concept can depend on the application: often there are alternative labels and we could even take glosses or other descriptions into account. In our view, a framework on the change of meaning of a concept over time should be applicable to all those types of ontological commitments as long as the three aspects of the meaning are unambiguously formalised. 3.1.2. A closer look at time It was stated earlier in the paper that we do not intend to provide a formal study of time and logical formalisations of change over time. Instead, we commit to the most simple temporal model where time is defined as linear, discrete and complete. This allows us to refer to particular points in time by subscripts i, j such that ti ≤ t j whenever i ≤ j. We will use ≤ to denote both the ordering on subscripts and in time which we expect not to cause any confusion. A further and deeper study of different temporal models is out of the scope of this paper. 3.2. Meaning drift based on concept identity over time All aspects of the meaning of a concept can change. We claim that the meaning of a concept drifts if there are two variants that differ at different points in time. If a concept at different times has the same meaning, there is no concept drift. This definition of drift is based on the idea that a concept retains its identity over time, i.e. , remains the same at least temporarily. In order to define drift, we need a workable definition of a concept and variants of such concepts. Given the philosophical assumption that concepts can exist at different moments in time, concepts can naturally also have different meanings at different times. 3.1.3. Similarity Investigating changes in the meaning of a concept means studying changes in intension, extension and label. Given the above discussion, it should not come as a surprise that the difference between the aspects over time also do not come in a single form. The only notion which is required is a notion of similarity between the aspects of two concepts or two variants of the same concept at two moments in time. There will be intensional similarity, simint , which is calculated between sets of predicates, extensional similarity, simext , between sets of objects, and label similarity, simlabel , between strings. Each similarity is a function with the range [0, 1], where a similarity value of 1 indicates equality. We will show different examples on how to define the similarity between two aspects in the discussion of our case-studies, and will only briefly discuss some generic problems here. For given notions of similarity between the meaning of two concepts according to a particular aspect we can extend this to similarity of a series of meanings of concepts over time: simasp (C1t1 , C2t2 , . . . , Cntn ) = Definition 2. The meaning C t of a concept C at time t is a variant of C. This definition assumes the stability of a concept over time even though different variants might semantically differ, in other words, the concept itself retains its identity over time. Identity allows us to compare two variants of the same concept at different moments in time even if the meaning (either label, extension and possibly parts of its intension) has changed. We will assume that there is always only one variant of a concept at one moment in time, i.e. , only one concept at a moment can be identical to a concept at another moment. ti ti+1 Σn−1 i=1 simasp (C i , C i+1 ) . n−1 Definition 3. The ordered set of all variants for a concept C between two moments t1 and tn is a chain and abbreviate: chain(C, t1 , tn ) = C t1 → C t2 → . . . → C tn where asp is an aspect asp ∈ {int, ext, label}. The maximal similarity between the meaning of two concepts for a series of concepts over time is defined as: max simasp (C1t1 , Cntn ) ti+1 ) Σn−1 simasp (Citi , Ci+1 = i=1 . n−1 for all time-points between t1 and tn . Given our “meaning of a concept” and the notion of identity, we can define concept drift per aspect as the basic notion of change in the meaning of a concept. (1) ti+1 i+1 where simasp (Citi , Ci+1 ) ≥ simasp (Citi , Dti+1 ) for the meaning of any other concept Di+1 at time ti+1 . We will call the sequence of maximally similar concepts Ci+1 the max-sim-chain between C1 and Cn between t1 and tn for aspect asp, referred to by maxsimasp (C1 , Cn , t1 , tn ). Definition 4. A concept C has extensionally drifted (according to identity semantics) between time ti and t j (ti ≤ t j ) if and only if simext (Cti , Ct j ) , 1. Intensional and label drift are defined similarly. The meaning of a concept has drifted if one of the aspects has drifted. 7 Before we go on to discuss an alternative way of defining concept drift based on concept morphing, let us discuss the concept of identify in more detail. art3 cyc: Political Entity rdf:type rdfs:subClassOf art5 skos:Concept dc:subject rdf:type cyc: European Union dbpedia:Country rdf:type skos:broader yago:economy rdfs:subClassOf rdf:type Germany Netherlands France dbpedia:EUCountry Romenia Poland rdf:type 3.2.1. A closer look at identity Our definition of concept, its meaning and the subsequent notion of identity is rather cyclic and non-committal. However, an exhaustive discussion on the nature of what a concept is, and whether it remains itself over time is not at the core of this paper. What matters for this model is the assumption of identity and what we can do whenever we can identify identical concepts over time. First, let us give one possible explanation what identity of a concept could mean. This suggestion is based on the idea of rigidity of the intension of a concept.6 Formally, we assume that the intension of a concept C is the disjoint union of a rigid and a non-rigid set of properties (i.e. (intr (C) ∪ intnr (C))). This separation between rigid and non-rigid properties does not need to be explicitly specified, but rigidity is one way to define for identity of a concept over time. Intuitively, this amounts to the assumption that a concept is uniquely identified through some core properties that do not change over time. Formally, this is art4 ns:social union cyc: Central European Country cyc: Eastern European Country Figure 3: An example ontology representing the EU in 2010 A cheaper way of determining identity is based on expert knowledge about the process and structure of the ontology in question. It is often the case that labels were explicitly kept stable over time to enforce a priori identity of concepts. Close knowledge of such processes can help to decide the determining factor in identifying identity. Finally, there are attempts to detect identity automatically, for example, by assuming that most similar concepts are identical, or by conjecturing that concepts in a similarity chain between two variants are identical. The advantage is that those methods are simple and cheap to apply. The disadvantage, however, is that identity and drift are based on the same information. In fact, this observation can be taken to the extreme of defining concept drift without the underlying notion of identity in the first place. We call this approach concept morphing. Conjecture 1. Two concepts C1 and C2 are considered identical if their rigid intensions are equivalent, i.e. , intr (C1 ) = intr (C2 ). This assumption implies that if the rigid core of a concept changes, the new concept will be a different concept. An example is demagogue which used to denote the concept Political Leader while the same label now refers to a different concept Populist Demagogue. How to determine identity?. Although we provide a possible explanation and motivation for the notion of identity based on the rigid part of the intention, this definition is by no means practical, as the rigid core is unknown in almost all applications. Since all three aspects of the meaning of a concept can drift over time, identifying identity is crucial and non-trivial. In order to render practical algorithms, we need to study ways of determining identity of concepts. There are several methods for doing so: using oracles, domain knowledge or automatic techniques. Although identity resolution between concepts over time is not at the core of this paper, we need to discuss some of these options to make credible our claim that identity-based analysis of meaning drift is practically applicable. The simplest way of determining identity of concepts over time is to ask an oracle. Such an oracle usually comes in form of a human domain expert who can determine whether two variants have the same rigid properties based on his/her domain knowledge. Unfortunately, such oracles are difficult to find and usually very expensive. 3.3. Meaning drift based on concept morphing In this alternative conceptualisation of concept drift, there is no underlying notion of identity we can rely on. This means that two concepts at two different moments in time are never identical. Instead, a concept C1 morphs over time into a new concept C2 . Let us first define morphing of the three aspects of the semantics and of the meaning of a concept in general. Definition 5. A concept Ci extensionally morphes into a concept C j from t to t + 1 if and only if extt+1 (C j ) is maximally similar to extt (Ci ). Formally, we write: morph extt,t+1 (Ci , C j ) iff arg max simext (Cit , Ckt+1 ) = j k and similarly for intensional and label morphing. A concept Ci morphes into a concept C j from t to t + 1 if and only if all semantic aspects, i.e. , extension, intension and label of Ci , morph into C j . 6 Rigidity is discussed in Ontoclean [9] in a slightly different way. There rigidity is a meta-property of a concept that modellers should make explicit. We use it in a stronger way as an intrinsic and the only stable part of the meaning of a concept which otherwise can drift in various ways. Let us consider the EU in 2010 and in 2003 in our example ontology from Fig.2 and 3. Now, the concepts with label EUCountry are different concepts. There is label morphing 8 for the concept dbpedia:EUCountry in 2003 to the concept with the same name in 2010 as the label stays the same. However, given that the extension of concept dbpedia:EUCountry in 2003 is more similar to the extension of concept cyc:Central European Country the former morphes extensionally to the latter. Note that an aspect of a concept C can morph into several concepts if the maximal similarity between C and those concepts is the same w.r.t. this aspect. time C1t1 t1 t2 C1t2 C2t2 Definition 6. A concept C has extensionally drifted (according to morphing semantics) between t and t + 1, if and only if simext (C, C1 ) , 1 for morph extt,t+1 (C, C1 ). Intensional and label drift are defined similarly. The meaning of a concept has drifted if one of the aspects has drifted. Figure 4: Concept shift (identity-based) Another way of analysing concept drift over time is to compare the (average) stability of concepts. As a relative measure, it does not require any priori commitment (such as a threshold). In the previous section, we argued that the dbpedia:EUCountry of 2010, although having changed in some aspects, was still the same concept as the one in 2003, and even as the one in 1979. An alternative is to consider the label EUCountry to refer to different concepts at different times. In this morphing view, the concept dbpedia:EUCountry in 2003 morphs into the concept dbpedia:EUCountry in 2010 on the label aspect, although neither extension nor intension stayed the same. In the case of morphing, we define the max-sim-chain between C1 and Cn between t1 and tn for aspect asp the aspmorphing chain. The similarity between the concepts in such a maximal morphing chain will later be used to indicate the strength of a morphing chain. Definition 7. A concept C between time t1 and tn is more stable than a concept C1 on aspect asp if, and only if, t avg1≤i, j≤n,i, j (simasp (C ti , C t j )) > avg1≤i, j≤n,i, j (simasp (C1ti , C1j )). Relative stability is the most interesting use of stability: to order the concepts by stability values. In our experience, the set of the most stable and the most unstable concepts are usually of interest to domain experts. In our example ontology about the European Union from Fig. 2 and 3 the extension of the concept dbpedia:EUCountry grows from 3 to 5 countries, while that of concept cyc:Political Entity consists of just one (explicit) instance cyc:European Union in both variants. Jaccard similarity between the extension of dbpedia:EUCountry in 2003 and 2010 is thus 53 =0.6, the one for cyc:Political Entity 1, which makes the latter the more stable concept. The second use of stability is to track the change of similarity between variants of a concept. The most unstable moments are those when the similarity is the lowest. 4. A qualitative toolkit for analysing concept drift In the previous section, we have formally defined concept drift as the change of aspects of the meaning of a concept over time. Concept drift happens regularly. Even if it can be measured, it is often difficult to grasp its impact, which probably results in a far too fine-grained analysis. In this section, we will define some more practical notions that describe changes in meaning: our toolkit for analysing concept drift. This toolkit takes both possible notions of drift into account, and we will define identify-based observations, and morphingbased ones. Practically, it does not matter whether one of our tools is identity or morphing based, but, conceptually, we will introduce them separately. Definition 8. In a chain chain(C, t1 , tn ) the concept C is more stable at time ti than at t j on aspect asp if simasp (C ti , C ti+1 ) > simasp (C t j , C t j+1 ). We can use this ordering in our use-cases to identify most unstable, and most stable moments in a chain. 4.1. Identity based observations Based on identity-based concept drift we define two notions: stability and shift. 4.1.2. Concept shift A special case of instability is when a concept becomes so unstable that part of its meaning is more representative for a different concept rather than for itself. We call this concept shift. 4.1.1. Concept (in)stability The more the meaning of a concept drifts, the more unstable it becomes. Although there is no indication of when an unstable concept becomes critically unstable, stability can still be an interesting notion to study. A first alternative is to define a threshold based on experience. Every similarity between two variants below this threshold signifies some interesting semantic change. A typical example is label similarity which is often defined using edit distance. A high edit-distance signals label instability. Definition 9. The meaning of a concept C extensionally shifts between two of its variants C ti and C t j if the extension of C t j is more similar to the extension of a non-identical concept rather than to the extension of C ti . Intensional and label shift are defined similarly. Concept shift can have drastic consequences on the use of a concept in an application, because some other concept has in 9 time fact taken over its meaning. Concept shift is explained in Fig. 4 where ovals denote the meaning of a concept with each circle a particular aspect. Arrows stand for the most similar aspects of the new variant. From the moment t1 to t2 , there is shift in terms of the middle aspect, because, in this aspect, the concept C2 at t2 is more similar to C1 at t1 than C1 at t2 is, while the other two aspects do not shift. Consider the extension of concept dbpedia:EUCountry in 2003 with the three countries Germany, Netherlands and France. In 2010 this set of countries is more similar to the concept cyc:Central European Country than to the original dbpedia:EUCountry, which means that the extensional meaning of the concept has shifted. time C1 t1 t1 t2 t2 C2 C2 C1 C3 C3 C4 Figure 5: 1-way concept split (left) and synonymy split (right) (morphingbased) a different concept from the one that the other aspects morph to (on the left-hand side), or an aspect is equally similar to the same aspect of more than one other concept (on the right-hand side of Fig. 5). 4.2. Morphing based observations Definition 11. The meaning of a concept C splits at time t if, and only if, either Morphing-based concept drift comes with two notions: morphing strength and split. 1. it does not morph into another concept, or 4.2.1. Strength of morphing chains Without an explicit definition of identity, it does not make sense to define stability of a concept over time. A closely related notion is strength of the morphing chain, which indicates how similar the concepts in this (maximal) morphing chain between two time-points are. 2. one aspect morphes into more than one concept, which happens when the similarity between those concepts is the same according to at least one aspect. Remember that the notion of morphing of a concept is defined through simultaneous morphing of all aspects to the same concept. This implies that if the concept does not morph there must be an aspect that morphes into a different concept from the one(s) the other two aspects morph to. In our ontology in Fig. 2 and 3 the concept dbpedia:EUCountry in 2003 1-way splits as it label-morphes into concept dbpedia:EUCountry with the same name, but extensionally morphes into concept cyc:Central European Country. Finally, there is one additional possibility for the meaning of a concept to split (depicted in Fig: 6): remember that a morphing-chain morphchainasp (Cb , Ce , tb , te ) is defined as a chain of concepts between Cb (begin) and Ce (end) over time points tb and te where the similarity between the aspect asp of the concepts in a chain for every time-point is maximal. A morphing chain split now occurs when the endpoint of such a chain is less similar to the original concept at the start of the chain than some other concept. Definition 10. A morphing chain morphchainasp (Cb , Ce , tb , te ) is stronger than another morphing chain morphchainasp (Cb0 , Ce0 , tb , te ) on aspect asp if and only if max simasp (Cbtb , Cete ) > max simasp (Cbtb0 , Cete0 ) where max simasp (Cbtb , Cete ) has been defined in Equation 1 in Section 3. We already pointed out that in our example ontology concept dbpedia:EUCountry extensionally morphes into concept cyc:Central European Country. This similarity equals 1 (complete overlap), which makes this a very strong chain (though of course rather short). Concept cyc:European Union extensionally morphes into the concept of same name, but there is only an overlap of 1 out of 5 articles, which is 0.2 according to standard Jaccard similarity. This means dbpedia:EUCountry→cyc:Central European Country is a stronger extensional morph chain than cyc:European Union→cyc:European Union. Relative strength is the most interesting use of strength we will apply in our case-study: to order the chains by strength. In our experience, the set of strongest and weakest concepts are usually of interest to domain experts. Definition 12. A concept has a morphing-based chain split for aspect asp if the similarity between a concept Cb at the beginning, and Ce at the end of the chain morphchainasp (Cb , Ce , tb , te ) is smaller than the similarity between the same aspect of concept Cb and some other concept Ce0 at te . 4.2.2. Concept split Concept split is very similar to concept shift discussed in the previous section. It is defined based on the fact that not all aspects of a concept at ti is the most similar to the same concept at ti+1 . As there is no identity as the reference to talk about shift, we can only say that the meaning of a concept splits into several new concepts. This can happen in two ways (as shown in Fig. 5): either one of the aspects morphs into the aspect of 4.3. Applying the framework To apply our framework for concept drift in a specific usecase, the following steps are required: 1. to define intension, extension and a labelling function. 2. to define similarity functions over intension, extension and labels 10 5.1. Case study 1: Concept shift in political reporting Communication scientists annotate various media content with concepts from controlled vocabularies (of increasing expressiveness) in order to study the relationship between the Media and the political processes. The idea is to code the meaning of sentences and articles in a formalised graph representation, using the method of Network analysis of Evaluative Texts (NET) [46]. Take the following sentence as an example. time t1 C1 t2 C2 t3 C3 t4 C4 C5 Het Openbaar Ministerie (OM) wil de komende vier jaar mensenhandel uitroeien. (The Justice Department (OM) wants to eliminate human trafficking within the next four years.) Figure 6: Chainsplit (morphing-based) Given that the mission of the Semantic Web includes giving meaning to resources on the Web, it could come as a surprise that defining intensions, extensions and even labelling functions is by no means trivial. The usual model-theoretic notions of extension and intension, e.g. , for RDF(S) or OWL semantics are slightly misleading here, as they refer to specific models, whereas ontologies usually represent classes of models. In practice, one needs to define the relevant notions per use-case, where each such definition is an ontological commitment. In the following section, we will give such commitments for different case-studies. It should be understood that such a commitment is never uncontroversial. The sentence is coded as <om, -1, human trafficking> where om is an actor concept and human trafficking an issue concept, and -1 indicates the Justice Department is negative about human trafficking. Here, the concepts are taken from controlled vocabularies which generally consist of all related political actors and issues. They have recently been represented using the SKOS model [18]. In this case study, we focus on five variants of a political vocabulary used during the five most recent Dutch national election campaigns, which took place in 1994, 1998, 2002, 2003 and 2006. The size of the vocabulary increased from 101 concepts in 1994 to 580 concepts in 2006. Newspaper articles on Dutch politics during these campaign periods were manually annotated with the concepts from the particular variant of that year. In our case, the annotated sentences of these articles are considered as the instantiation of the abstract political concepts which annotated them. In 1994, we gathered 1502 such annotated sentences while in 2006, there were 14572 annotated sentences. Our case study in political reporting was supported by an expert from the Dept. of Communication Science who provide us with pairs of identical concepts over time and who performed a small-scale qualitative evaluation of the produced results. This expert has extensive experience in annotation and encoding of political statements over time, and has recently produced a manual mapping between the vocabularies used to describe the election campaigns for a qualitative comparison of the respective newspaper articles. 4.4. Implementation of our toolkit This toolkit has been implemented in Python and is available at https://www.few.vu.nl/˜swang/data/ concept_drift/getConceptDrift.py. This toolkit currently requires different versions of the ontology to be queryable through independent SPARQL-endpoints. It first collects all the information of intension, extension and labels of the concepts, and then calculates the similarity between variants of the same concepts. Based on the calculated similarity it automatically identifies concept shift and split, and provides a ranking based on concept stability and morphing strength. 5. Case-studies Political concepts and their meaning. We now formally define our problem: for each election campaign t ∈ {1994, 1998, 2002, 2003, 2006} we have a set ∆t of sentences annotated by concepts from a SKOS vocabulary Vt . The label of a concept is obtained using the SKOS labelling property skos:prefLabel. The extension ext s (Ct ) of a concept Ct ∈ Vt at time t is the set of all sentences annotated by Ct , i.e. , In this section we will discuss 4 different case-studies as examples on how to use our methods, and to give some evidence of the potential of the ideas introduced in Section 3. The purpose of these case-studies is two-fold: first, we want to indicate how to practically apply our analysis and tool-kit by exemplifying the usage in a number of very different scenarios, and based on a variety of different knowledge organisation principles. Secondly, we believe that we gather evidence of the potential usefulness of the results in a number of different domains. Although the presented results do not amount of proofs, nor are empirically representative, our experience with legal, medical experts, and communication scientists show promisingly that our approach can produce non-trivial and interesting results. ext s (Ct ) = {s ∈ ∆t | annotatedBy Ct }. It is most difficult to formally define the intension of a concept, as there is no explicit intensional definition of the concepts available. We construct an explicit intension based on co-occurrence of concepts in annotations. For each concept C, we calculate the top K concepts topKuse(C) which co-occur 11 the most in the sentences they code in one moment in time. The properties we use to define the intension are based on “topicality,” i.e. , a property PC (D) is true if, and only if, D is in the topKuse relation with C. The intension of C is then the set of properties int s (C) = {PC (D) = true}, in other words, the intension is in fact determined by all associated concepts. In our following experiments, we took K = 3. Note, this is an empirical choice which gave sensible results. 0.02 Rechtsstaat Moroccans 0.01 Referendum Bureaucracy 0.04 0.03 0.02 Environmental Democracy 0.03 Activist Islam Sharia 0.02 0.04 Democracy 0.03 Democracy 2003 2002 2006 Figure 8: Intension of concept Democracy in 3 years, with average drift of (simint = 0.02) 0 . 2 2 9 employees Similarity of intension, extensions and labels. Similarity of labels can be determined through standard Levenshtein edit distance. In our case we define similarity as 1 minus the hyperbolic tangent of the original edit distance. Extensional similarity is usually determined by calculating the overlap of the extensions. In our case, this is not possible, as the set of sentences in different years are disjoint. In order to use disjoint extensions to measure the similarity of concepts from different years, we applied the mapping tool which was developed in [40]. For each sentence in Year1, the mapping tool first looks for the most similar sentence in Year2. Then this sentence of Year1 is considered to be coded by the concept(s) with which its most similar sentence of Year2 is coded. In this way, two disjoint extensions become dually annotated, and we then measure the similarity between two concepts using their extensions.7 Since the intension of a concept is determined by its associated concepts, the intensional similarity between two concepts is therefore determined by the set similarity between the sets of concepts with which they are associated. We use the Jaccard similarity for this purpose. Once the similarity is calculated, we can study concept drift using two theories introduced above, respectively. unions 0.266 employees 0.085 Socio-Economic Council 0.09 employers 2002 unions 0.122 social pact employees 0.26 0.189 employers 2003 work migration discrimination 0.032 0.048 employers 2006 Figure 9: Intension of concept Employers in 3 years, with average drift of (simint = 0.15) 1998 rbeursfraude (stock fraud) → →2002 belangenverstrengeling (conflict of interest) → 2003 corruptie (corruption)→ 2006 fraude en corruptie (fraud and corruption) In Fig. 7 (b) and (c), we find that most concepts have rather unstable intension and extension over the years (sim int (C) and sim ext (C) are low). On one hand, this suggests that the label of concepts is more stable than their intension, which means the concept label could be used as a pseudo-identity in practice if an oracle is not available. On the other hand, we use an approximation of the real extension, therefore, the reliability of the calculated extensional similarity is not fully guaranteed. The instability of intension is far beyond our expectation, which suggests that we should look for another way of formalising concept intension, as the reliability of the results provided by the current formalisation is doubtful. Fig. 8 and Fig. 9 respectively give an example of intensionally very unstable and very stable concepts. Here, the red links are the identical concepts provided by domain experts, the black links are the association links (i.e. , the concepts which co-occur the most to annotated sentences) and the number next to the links are the strength of the association. As the sim int value indicates, Concept Employers is rather stable over the years, while Concept Democracy seems to shift its meaning in different years. This observations are also consistent with the political reality. Although in this particular case this is related to the definition of the intension, it is still interesting to observe from these two figures that, when one concept is stable, its closely associated concepts tend to be stable too, while the concepts closely associated with an unstable concept tend to be also unstable. 5.1.1. Study concept drift based on identity The first step of studying concept drift based on identity is to determin concept identity. The identity problem is solved, in this case, by the manual concept mapping provided by a communication science expert. This enables us to investigate the concept shift and stability in terms of the label, intension and extension. Measuring concept stability. Since our domain expert has already indicated the identical variants of the same concepts over the years, we can easily rank the stability of the concepts based on their stability. Here, the stability is calculated as the average similarity between the variants of neighbouring time-points. Fig. 7 shows the histogram of the stability values in terms of the three aspects. We can see that many concepts have very stable labels (sim label (C) = 1) over time. For example, the concept Asielzoekers (asylum seekers) has exactly the same label across all the years. However, some concepts do have very unstable labels. For instance, according to the domain expert, the following 6 concepts are intensionally identical: Identifying concept shift. Even with the defects of our commitments, we can still identify interesting concept shift. Label Shift As shown in Fig 10 (a),8 1994 Military has the same label as 1998 Military. However, according to our domain expert, 1998 militairen is actually identical to 2006 Dutch military 1994 sjo creawetsto → 1998 wcorruptie (corruption) → 7 In Voting Computers High Incomes 8 For [48] we have shown that this method produces reasonable results. 12 convenience, we translate all the Dutch labels into English. 45 45 45 40 40 40 35 35 35 30 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (a) Label Stability 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 (b) Intensional Stability 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (c) Extensional Stability Figure 7: Stability value distribution: The X-axis gives the average corresponding similarity over the years between variants of the same concept; the Y-axis the number of concepts with this average. 1998 2006 1994 1998 2006 Military Military 2003 Obligation for elderly to search jobs Childcare Military Dutch military deployment (a) Label shift 2002 0.44 Euthanasia Childcare Free Childcare 0.03 Wim van de Camp (b) Extensional shift 0.38 2003 0.48 Euthanasia 0.50 Obligation to search jobs Unemployment benefit receivers 0.43 Euthanasia Obligation to search jobs Unemployment benefit receivers Figure 11: Example of morphing chains, where the red links are consistent with concept identity provided by the domain expert. Figure 10: Example of label shift and extension shift, where the red links indicate the two concepts are identical according to our domain experts, while the blue links are the most similar concepts in terms of the corresponding aspect. 2002 extensional 1998 intensional Capitalism deployment, while 1994 Military is identical to 2006 Military. These two concepts with the same label “Military” are actually different concepts. Therefore, Concept 1994 Military has a shift in label as, according to Definition 9, its label shifts to another concept. Extension Shift Fig 10 (b) suggests an extensional shift. According to the domain expert, 2003 Childcare is the same as 2006 Childcare. However, 2006 Free childcare is more similar to 2003 Childcare in terms of their extension. In this case, we say 2003 Childcare has shifted its meaning extensionally towards a more specific (narrower) topic, namely free childcare. This is also confirmed by the post-hoc analysis of our domain experts. 1998 Free Childcare 2002 extensional label Free Childcare intensioal Companies, Coroprations, Business Employment Trial drillings intensional intensional The Fifth Spatial Planning Memorandum Kyoto treaty Figure 12: Example of morphing chains, where the red links are consistent with concept identity provided by the domain expert. expert. This morphing chain also has the highest strength. Clearly, this chain makes more sense than the bottom one, which is the weakest morphing chain. Detecting concept split. If three aspects do not morph unambiguously into one concept or at least one aspect morphs to multiple concepts, there is a concept split. In Fig. 12, Concept Free Childcare has a 1-way concept split from 1998 to 2002, as defined in 4.2.2. Concept Capitalism intensionally split into three concepts which are also different from its extensional morph. Unfortunately, almost all concepts have a morphingbased chain split, which is not really surprising. As we discussed above, our commitment to concept intension and extension in this case are both problematic, therefore the reliability of the similarity is not guaranteed. As a consequence, the detected splits are also doubtful. 5.1.2. Study concept drift based on morphing Let us now study concept drift without replying on the existence of concept identities. As we introduced earlier, morphingbased concept drift involves two notions: morphing strength and concept split. Measuring morphing strength. When each concept of a series of concepts can always morphs into the concept of the next time point, a morphing chain is formed. Starting from each concept, we find its max-sim-chain, as defined in Section 3.1.3. Even if we take only the most similar concept as the morphed one, these morphing chains each have a different strength (see Definition 10). We therefore rank these morphing chains based on their strength. In Fig. 11, we listed three morphing chains. The top one is actually consistent with the identity provided by our domain 5.1.3. Identify concept drift moments We can also measure at which moment(s) the concepts are the most stable or unstable. At time point ti , we measure the similarity of concepts between time points ti and ti+1 . When concept identity can be defined, we take the average similarity 13 Version 3.5 3.4 3.3 3.2 between two variants of the same concepts. When using morphing theory, we take the average similarity between all concepts at ti and their morphed ones at the next time point ti+1 . Fig. 13 compares the analysis using these two theories of concept drift, where the error bars indicate the corresponding standard deviations over all the morphing strength. Three aspects show the same trend. Both theories indicate that the concept are the most stable between 2002 and 2003, as the similarity is the highest. While between 1998 and 2002, there was big changes in terms of intension, especially using the morphing theory. This indicates that the concepts were associated with quite different concepts in these two years. #Concept 255 204 174 174 #Resource 1,477,377 1,161,678 1,054,199 875,273 Table 1: Four versions of the DBpedia ontology Please note that this definition is just one possible choice of ontological commitment regarding the meaning of a concept in RDF(S). Similarity relations between labels are defined on the basis of string similarity, similarity between intension and extension, which are just sets of resources and triples respectively can easily be defined through the set similarity (Jaccard). 5.2. Case-study 2: Concept drift in DBpedia DBpedia9 is probably the most successful ontology currently linked within the Linked-Open Data (LOD) cloud.10 It combines a hand-crafted class hierarchy with automatically generated instance data taken from the Wikipedia effort. Through its huge coverage, DBpedia is now the most strongly linked dataset within the LOD. For the sake of this research we consider the DBpedia ontology in RDFS [30], i.e. , we ignore the (very few) OWL operators used in the model. RDF(S) and its underlying semantics is a very common modeling framework, which makes DBpedia an interesting object of study as it is almost exclusively modeled in RDF(S). Experiments to study concept drift in DBpedia. We studied the four latest versions of the DBpedia ontology, namely, DBpedia 3.5, 3.4, 3.3 and 3.2. Table 1 gives some general information about these four versions. These versions were uploaded into independent RDF repositories. 5.2.1. Study concept drift based on identity DBpedia concepts have their URI references which remain stable over the versions. Therefore we use the URIrefs as the identities of concepts. Again this is rather arbitrary decision that can be argued for or against. In our case, a DBpedia concept refers to its website, rather than an underlying abstract concept. If this website suddenly refers to some other topic (which happens occasionally), this change will be detected as intensional shift. Although most DBpedia concepts define their label using rdfs:label, these labels are mostly equivalent to the localnames of their URIrefs. This means that labels are strongly correlated to the identify, and it is thus less interesting to study label drift. For each concept, we built its extensional representation (i.e. , the set of instances) and the intensional representation (i.e. , the set of related triples). The Jaccard similarity measure was applied to measure the extensional and intensional similarity between the different variants of the same concept. In the end, we averaged over all similarities to indicate the general level of stability of this concept, in terms of their extensional and intensional semantics, respectively. Table 2 gives the top 5 most stable and unstable concepts in terms of these two aspects. Concept Politician is considered extensionally very unstable. This can easily be confirmed by the change of the sheer amount of instances. In Version 3.2, it has only 476 instances, while in Version 3.5, it has already 19,285 instances. The extension of this concept clearly has expanded significantly. However, low stability not necessarily leads to concept shift. For example, although growing, the extension of Politician is always the most similar to the extension of its following variant. Concept City is also very unstable extensionally. In Version 3.4 it has indeed shifted to another concept in Version 3.5, Settlement. These two concepts share more than 73% instances, which causes a high extensional similarity between RDF(S) concepts and their meaning. RDFS comes with a specific labeling relation rdfs:label. Furthermore, RDF is equipped with a rdf:type relation that relates objects with classes. It seems natural to define the extension of an RDF class to be the set of all instances in the rdf:type relation. The intension of a class is on the other hand not specifically defined in RDF(S). We have chosen a simple approach which focusses on the “semantic” operators in RDFS with fixed semantics, more precisely rdfs:subclass, rdfs:range, rdfs:domain. The intension of a concept C is then simply the set of all triples with C in the subject or object position of these three types of triples. Let us define the meaning of a DBpedia concept formally. We will call the combination of the DBpedia terminology T 11 and the explicit type information as well as the relations translated from Wikipedia, the DBpedia ontology. Definition 13. Let O be the DBpedia ontology, i.e. , a set of triples (s, p, o), and O∗ the semantic closure of O. The rdflabel labr (C) of C is defined as the object of (C,rdfs:label, o). The rdf-extension extr (C) of C is defined as the set of resources r such that (r rdf:type C) ∈ O∗ . The rdf-intension intr (C) of C is defined as the set of all triples (C, p, o) ∈ O∗ in O where p =rdfs:subClassOf and (s, p, C), where p ∈ {rdfs:subclass, rdfs:domain, rdfs:range}. 9 http://wiki.dbpedia.org/ 10 http://linkeddata.org/ 11 The terminology is called differently in the different DBpedia versions, but usually something like dbpedia-ontoloy.owl. 14 label 1.2 1.0 0.30 0.25 0.6 Concept Stability Concept Stability 0.8 0.4 0.2 0.20 0.15 0.10 0.0 0.2 0.4 Intension 0.35 0.05 Morphing ID 1994_1998 1998_2002 2002_2003 Version Pairs 0.00 2003_2006 Morphing ID 1994_1998 (a) Label stability 0.7 0.6 0.15 Concept Stability Concept Stability Whole Concept 0.8 0.20 0.10 0.05 0.5 0.4 0.3 0.2 0.1 0.00 0.05 2003_2006 (b) Intensional stability Extension 0.25 1998_2002 2002_2003 Version Pairs Morphing ID 1994_1998 Morphing ID 0.0 1998_2002 2002_2003 Version Pairs 0.1 2003_2006 (c) Extensional stability 1994_1998 1998_2002 2002_2003 Version Pairs 2003_2006 (d) Whole concept stability Figure 13: Concept stability over time, using two formalisations — Political vocabulary Rank 1 2 3 4 5 Extensional Intensional Planet Road Infrastructure SportsEvent FormulaOneRacer WineRegion Cyclist LunarCrater Cleric WrestlingEvent ... ... 163 164 165 166 167 OfficeHolder Politician City Vein BasketballPlayer EthnicGroup College ChemicalCompound Band BritishRoyalty teresting question whether this was an intentional decision, or rather happened.12 Similarly, studying the intensional stability and shifts also gives insight to the evolution of the intensional semantics of a concept. For example, the intensionally very unstable concept, EthnicGroup, has been involved in the rdfs:domain and rdfs:range of a continuously changing set of properties. This indicates this concepts are related to different concepts in different versions, which contributes to the intensional instability. Furthermore, real shifts happened to some concepts. The identified extensional shift, from City to Settlement is also found to be an intensional shift, that is, these two concepts not only share a lot of instances, but their intensional definitions are very similar too. This double-confirmation is valuable because it may well indicate a genuine concept shift. Table 2: The top 5 most stable and last 5 least stable DBpedia concepts in terms of their extension and intension (of the 167 concepts present in all four versions) 5.2.2. Study concept drift based on morphing If we do not constrain ourselves to use explicite identities, what can the morphing theory tell us? them, which is higher than the similarity between the two variants of City. We found that Settlement appeared only in version 3.5. For some reason, most of the instances of City in version 3.4 have been transferred to Settlement. As DBpedia is of course based on a community effort of Wikipedia, this poses an in- 12 The Wikipedia logs unfortunately do not give clear insights into this questions. 15 Concept SportsEvent has the strongest morphing chain, while Concept ChemicalCompound has the weakest morphing chain. Stronger morphing chains can involve concept shift if the identities are know, such as the red links in Fig. 14. Let us look at the morphing-based chain split. Out of 104 morphing chains from Version 3.2 to Version 3.5, only 6 concepts have such chain split. This indicates the propagation of the similarity along maximum morphing pairs of concepts to a great extend preserves the identity of the concepts, although for a very small number of concepts, such propagation leads to a relatively more drastic drift. Most stable concepts Most unstable concepts norm.owl#Custom expression.owl#Promise norm.owl#Potestative Expression norm.owl#Hohfeldian Power legal-action.owl#Mandate legal-action.owl#Public Law legal-action.owl#Asignment legal-action.owl#Act of Law relative-places.owl#Place legal-action.owl#Delegation Table 3: Top 5 stable and unstable concepts. Definition 14. Let O to be the LKIF-Core ontology and O∗ denote the OWLIM inferred semantic closure. The owl-label labo (C) of C is defined as the object of the (C,rdfs:label, o). The owl-extension exto (C) of C is defined as the set of individuals i such that (i rdf : type C) ∈ O∗ . The owl-intension into (C) of C is defined: 5.2.3. Identify concept drift moments As shown in Fig. 15, concept label is indeed the most stable aspect of the concept meaning. There are a relatively big intensional changes between Version 3.3 and 3.4, while the concepts are rather stable in terms of their extensions. As opposed to Fig. 13, the two theories give rather similar analysis. One reason is that RDFS provides much more explicit definitions for concept intension and extension, which leads to more reliable and consistent analysis. 1. all triples (C, p, o) ∈ O∗ and (s, p, C) ∈ O∗ 2. all triples in chains {(C, p1 , o1 ) ◦ (s2 , p2 , o2 ) ◦ . . . , ◦(sn , pn , on )} where sk = ok−1 , plus 3. all triples in chains {(s1 , p1 , o1 ) ◦ (s2 , p2 , o2 ), ◦, . . . , ◦(sn , pn , C)} where sk+1 = ok being blank nodes. 5.3. Case-study 3: Concept drift in LKIF-Core In this section we look at the evolution of concepts in the LKIF-Core, an OOWLWL ontology of basic legal concepts.13 It consists of 15 modules, each of which describes a set of closely related concepts from both legal and common sense domains. This ontology has also been continuously developed, and uses most of OWL’s expressiveness. Our case study on a legal ontology was supported by an expert from the Leibniz Center for Law of the Universtity of Amsterdam who performed a small-scale qualitative evaluation of the produced results and gave advise on the intended meaning of the concepts in LKIF. This expert is one of the leading developers of LKIF. Again, the above definition is only one possible ontological commitment regarding the meaning of an OWL concept. Based on such definition, the similarity of label, intension and extension can be calculated in the same way as for the DBpedia case. Experiments to study concept drift in LKIF. We studied 4 major versions of LKIF, namely, 1.0, 1.0.2, 1.0.3 and 1.1. As before we take the localname in the URIrefs as the identity of concepts and then study the concept drift based on identity. We also study concept drift using the morphing theory. Unfortunately, the rdfs:label was actually rarely used (only 4 concepts) and the LKIF ontology does not come with any instance data. This limits ourself to focus on the intensional shift, stability, split and morphing strength. OWL concepts and their meaning. The formal meaning of a concept in an OWL DL ontology is often far more explicitly defined than in other formalisms. The intension of a concept could potentially be defined as the set of all possible DL concepts that are equivalent to it wrt. the ontology. Unfortunately, this is only possible in so-called definitorical terminologies, and difficult to calculate even if possible. We will approximate this set in our experiments (rather coarsely) by using the OWLIM14 interpretation of LKIF, and consider finite sets of consequences (triples-chains). As OWL ontologies are specifications of sets of possible models, there is no unique notion of the extension of concepts. However, once committed to a particular model, DL semantics provide the formal instance-of relation to specify the extension of a concept. between the intension of two concepts Let us define the meaning of concepts in LKIF. 5.3.1. Study concept drift based on identity All concepts were ranked according to its intensional stability, see Table 3 for some examples. The ranked list has been confirmed by one of the developer of LKIF to be consistent with his expectations. By comparing the intension of concepts between different versions, we were also able to find true concept shifts, listed in Table 4. As Table 4 shows, some intensional shift corresponding to shifting in modules. For example, Speech Act is no more a general action, instead, it belongs to the expression module for describing, propositions and propositional attitudes (belief, intention), qualifications, statements and media. While the other kind of shift, for example, Mental Concept to Mental Entity, were confirmed to be a renaming operation. 13 http://ontology.leibnizcenter.org/trac/wiki/ LKIFCore 14 OWLIM is a RDF database management system, supporting the semantics of RDFS, OWL Horst and OWL 2 RL, see http://www.ontotext.com/ owlim/. For our experiments we used OWLIM version 3. 16 dbpedia32 SportsEvent dbpedia33 0.98 0.89 Protista 0.99 City 0.99 River ChemicalCompound 0.64 SportsEvent Protista City River ChemicalCompound dbpedia34 0.98 0.77 SportsEvent Fungus 0.60 City 0.78 SportsEvent 0.89 Fungus 0.84 0.47 dbpedia35 0.97 Settlement 0.62 River ChemicalCompound 0.71 Stream ChemicalCompound Figure 14: Examples of morphing chains lkif1.0:action.owl#Speech Act lkif1.0:action.owl#Termination lkif1.0.2:lkif-top.owl#Mental Concept lkif1.0.2:lkif-top.owl#Physical Concept lkif1.0.2:expression.owl#Speech Act lkif1.0.2:process.owl#Termination lkif1.0.3:lkif-top.owl#Mental Entity lkif1.0.3:lkif-top.owl#Physical Entity Table 4: Examples of confirmed intensional shift in LKIF-Core 5.3.2. Study concept drift based on morphing Some shift (found based on identity semantics) can also be recognised when looking at the morphing chains. In Fig. 16, the shift from action.owl#Speech Act in Version 1.0 to expression.owl#Speech Act in Version 1.0.2 is the first link (in red) of the morphing chain. This morphing chain is consistent with the actual operations on this ontology. Another morphing chain starting from Concept lkif1.0:action.owl#Termination is also confirmed by our domain expert, but its strength is weaker. Similar to the DBpedia case, out of 98 morphing chains from Version 1.0 to Version 1.1, only 7 concepts have morphingbased chain split. As we know that the labels almost as unique identifiers in LKIF, this indicates further that the max-sim-chain on intension can be used as a reliable strategy for identifying concept identities. Intension 1.0 0.9 Concept Stability 0.8 0.7 0.6 0.5 0.4 0.3 0.2 5.3.3. Study concept drift moments Again we compare the stability of chains, either based on identity or morphing. As visible in Fig. 17, there are less changes between Version 1.0.2 and 1.0.3. Similar to the DBpedia case, the two theories gave quite similar analysis. Morphing ID lkif1.0_lkif1.0.2 lkif1.0.2_lkif1.0.3 Version Pairs lkif1.0.3_lkif1.1 Figure 17: Concept stability over time, using two formalisations — LKIF qualitative analyses of the results were done by one of the curators of the Gene Ontology, and main contributors to the HPO. OBO concepts and their meaning. The meaning of an OBO concept is defined by a term stanza which consists of the information using the tag: value pairs. Each term has its identity defined using the tag id and the labelling tag name. Its intension is then defined by all the remaining tag: value pairs. The instances of an OBO ontology are defined by instance stanzes which is associated to its concept using the tag instance of. Therefore, we define the meaning of an OBO concept formally as follows 5.4. Case-study 4: Concept drift in NCBO Biomedical Ontologies We also studied the concept drift in a few biomedical ontologies from the NCBO Bioportal.15 These ontologies are in the OBO format.16 Unfortunately, there is also no instance data available from the Bioportal. We therefore only study concept drift in terms of label and intension. Our case study on biomedical ontology was supported by NCBO in Stanford. Through the contacts at NCBO small-scale Definition 15. Let O be an OBO ontology. The obo-label labo (C) of concept C is defined as the value of the tag name. The obo-extension exto (C) of C is the set of instances whose value of the tag instance of is C. The obo-intension into (C) of C is the set of the remaining tag: value pairs of C (ie. all but the 15 http://bioportal.bioontology.org/ 16 http://www.geneontology.org/GO.format.obo-1_2. shtml 17 1.2 label Intension 1.0 Concept Stability Concept Stability 1.0 0.8 0.6 0.8 0.6 0.4 0.4 Morphing ID Morphing ID 0.2 dbpedia32_dbpedia33 dbpedia33_dbpedia34 dbpedia34_dbpedia35 Version Pairs 0.2 dbpedia32_dbpedia33 dbpedia33_dbpedia34 dbpedia34_dbpedia35 Version Pairs (a) Label stability 1.10 (b) Intensional stability Extension 1.00 0.95 1.00 Concept Stability Concept Stability 1.05 0.95 0.90 0.85 Whole Concept 1.05 0.90 0.85 0.80 0.75 0.70 Morphing ID Morphing ID 0.65 0.80 dbpedia32_dbpedia33 dbpedia33_dbpedia34 dbpedia34_dbpedia35 Version Pairs 0.60 dbpedia32_dbpedia33 dbpedia33_dbpedia34 dbpedia34_dbpedia35 Version Pairs (c) Extensional stability (d) Whole concept stability Figure 15: Concept stability over time, using two formalisations — DBpedia name changes, therefore, many false intensional shifts were identified. In Fig. 18, the stability between Version 1.54 and 1.59 drops to around 0.5, which is because a new property xref was introduced and most of the concepts were enriched by UMLSreferences using this tag. Therefore the intension of the concepts were increased at that moment, which should be considered as a crucial moment as the concepts evolve. However, this of course relies on the commitments we took about which information should be consider as a part of intension. tag). Based on such definition, the similarity of label, intension and extension can be calculated in the same way as for the previous DBpedia and LKIF cases. Study concept drift in OBO ontologies. The Human Phenotype Ontology (HPO) has 28 versions, with the number of concepts increased from 8773 to 9559. Among these versions, we identified 18 label shifts. These label shifts mainly because that the concept of the previous version was deleted and replaced by another new concept with the same label but a new id number. For example, Concept HP:0005730 (Small epiphyses) in Version 1.66 is substituted by Concept HP:0010585 in Version 1.66. We also identified 6707 intensional shifts. Some intensional changes are due to the changes in the ontology structure. For example, the property synonym in Version 1.1 was changed into exact synonym in Version 1.2, which then was changed back to synonym. The current similarity measure is sensitive to such According to the HPO ontology developers, it is not surprising that the early days of ontology development were more dynamic and concepts are less stable than the later versions. With the quality of the ontology improving, the similarity between versions is expected to become bigger, which is consistent with our results. We also looked at the evolution of the Gene Ontology (23 versions), see Fig. 19. For the Gene Ontology, we only took the sample versions from the Bioportal repository at the monthly 18 LKIF 1.0 LKIF 1.0.2 action.owl#Speech_Act action.owl#Termination 0.31 0.15 LKIF 1.0.3 expression.owl#Speech_Act process.owl#Termination 0.66 expression.owl#Speech_Act 0.62 process.owl#Termination LKIF 1.1 0.70 0.65 expression.owl#Speech_Act process.owl#Termination Figure 16: Examples of morphing chains — LKIF label 1.10 1.0 Intension 0.9 1.05 1.00 Concept Stability Concept Stability 0.8 0.95 0.7 0.6 0.5 0.90 0.4 0.3 Morphing ID 1.1_1.2 1.2_1.3 1.3_1.4 1.4_1.9 1.9_1.26 1.26_1.42 1.42_1.47 1.47_1.54 1.54_1.59 1.59_1.61 1.61_1.64 1.64_1.65 1.65_1.66 1.66_1.67 1.67_1.68 1.68_1.69 1.69_1.7 1.7_1.71 1.71_1.72 1.72_1.73 1.73_1.74 1.74_1.75 1.75_1.76 1.76_1.77 1.77_1.78 1.78_1.79 1.79_1.8 Morphing ID 1.1_1.2 1.2_1.3 1.3_1.4 1.4_1.9 1.9_1.26 1.26_1.42 1.42_1.47 1.47_1.54 1.54_1.59 1.59_1.61 1.61_1.64 1.64_1.65 1.65_1.66 1.66_1.67 1.67_1.68 1.68_1.69 1.69_1.7 1.7_1.71 1.71_1.72 1.72_1.73 1.73_1.74 1.74_1.75 1.75_1.76 1.76_1.77 1.77_1.78 1.78_1.79 1.79_1.8 0.85 Version Pairs Version Pairs (a) Label stability (b) Intensional stability Figure 18: Concept stability over time, using two formalisations — Human Phenotype Ontology basis from June 2008 to May 2010.17 There are different types of edits, including term name changes, term definition updates, updates to relationships used by terms, and extensive changes to external references. These all contribute to the label and intensional drift. The GO developers did find more changes had been conducted at the less stable moments. As shown in Fig 19, the labels were changed the most between September and November 2008, while the most intensionally unstable moment is between June and July 2009, which is also confirmed by the GO expert we consulted to be because of substantial changes in the SDB (Society for Developmental Biology) terms. efforts that have gone into topics such as ontology evolution, semantic versioning or temporal modelling and reasoning, most tools are still based on static representations. The existing ontology versioning frameworks focus on the interoperability between versions and data. There is not yet a formal framework for concept drift, nor an implementation for identifying significant concept drift. This paper attempts to close this gap by introducing a theoretical foundation for concept drift. We study concept drift based on two theories: one based on concept identity and one based on concept morphing. We also propose a generic qualitative toolkit to study concept drift through some more practical notions: shift and stability, split and morphing strength, over time. We apply this general formalisation in practical applications modelled in SKOS, RDFS, OWL and OBO. These plausibility tests are still preliminary, but encouraging: although intensional drift is difficult to study because the concepts are often not formally defined, the detected concept drift gives useful information for the domain experts, for example, identifying substantial ontology design changes. Additionally, our experiments also suggest that studying morphing chains is a promising strategy in practice for identifying concept identities. Our case-studies also indicate that in few realistic scenarios all of the proposed methods give useful insights, or can even be meaningfully applied. In some cases, one or even two of the aspects of the meaning were missing, in others one of the aspects almost functionally coupled with identity. For that reason 6. Conclusion More and more applications critically depend on some kind of concept schemes for the semantic interoperability of their data. However, although it is recognised by many as a critical problem, the continuous change in meaning of concepts (called drift in this paper) has not yet received the attention it deserves in the ontology modelling community. Despite the significant 17 As shown in previous cases, two theories provide similar results. However, the analysis based on morphing is very computationally expensive as a full similarity matrix between all pairs of concepts needs to be computed. The number of terms in our sampled versions grows from 27542 to 33243, which makes the computation not feasible practically. Therefore, we only show the results based on identity. 19 label 1.10 1.05 Concept Stability 1.00 0.95 0.90 1.00 0.95 0.90 0.85 Jun08 Jul08 Aug08 Sep08 Nov08 Dec08 Jan09 Feb09 Mar09 Apr09 May09 Jun09 Jul09 Aug09 Sep09 Oct09 Nov09 Dec09 Jan10 Feb10 Mar10 Apr10 0.80 Version Pairs Jun08 Jul08 Aug08 Sep08 Nov08 Dec08 Jan09 Feb09 Mar09 Apr09 May09 Jun09 Jul09 Aug09 Sep09 Oct09 Nov09 Dec09 Jan10 Feb10 Mar10 Apr10 Concept Stability 1.05 0.85 Intension 1.10 Version Pairs Figure 19: Concept stability — Gene Ontology we introduced our mechanisms as a toolkit, for the knowledge engineer and domain expert to try out, and in the end to choose those that are most likely to produce meaningful insights. Future research will be directed in two directions: first, in cooperation with Communication Scientists working on political reporting and legal experts, we will apply the proposed methods to detect changes of meaning in real-world scenarios. We are also involved in a close cooperation with NCBO which will give us more insights into the problem. In at least two of these domains we will have to do a more detailed larger-scale userability study to assess the validity of our proposed notions beyond the slightly anecdotal evindence provided in this paper. With the experience gained, we plan to create applications the ultimate goal is of course to improve usability of our toolkit for more and more easy analysis of drift in dynamic environments, and to extend the semantics of concept drift into query or reasoning tasks. [9] N. Guarino, C. Welty, Evaluating ontological decisions with ontoclean, Commun. ACM 45 (2) (2002) 61–65. [10] J. A. Gulla, G. Solskinnsbakk, P. Myrseth, V. Haderlein, O. Cerrato, Semantic drift in ontologies, in: Proceedings of 6th International Conference on Web Information Systems and Technologies, Valencia, Spain, 2010. [11] P. Haase, F. Harmelen, Z. Huang, H. Stuckenschmidt, Y. Sure, A framework for handling inconsistency in changing ontologies, The Semantic Web–ISWC 2005 (2005) 353–367. [12] P. Haase, A. Hotho, L. Schmidt-Thieme, Y. Sure, Collaborative and usage-driven evolution of personal ontologies, in: A. Gmez-Prez, J. Euzenat (eds.), The Semantic Web: Research and Applications, vol. 3532 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2005, pp. 486–499. URL http://dx.doi.org/10.1007/11431053_33 [13] J. Heflin, Z. Pan, A model theoretic semantics for ontology versioning, in: In Third International Semantic Web Conference, Springer, 2004. [14] G. Hodge, Systems of knowledge organization for digital libraries: Beyond traditional authority files, Tech. rep., Council on Library and Information Resources (Apr. 2000). [15] R. Hoekstra, J. Breuker, M. D. Bello, A. Boer, The lkif core ontology of basic legal concepts, in: P. Casanovas, M. A. Biasiotti, E. Francesconi, M. T. Sagri (eds.), Proceedings of the Workshop on Legal Ontologies and Artificial Intelligence Techniques, 2007. [16] Z. Huang, H. Stuckenschmidt, Reasoning with multi-version ontologies: A temporal logic approach, in: In Proceeding of the 4th International Semantic Web Conference (ISWC, 2005. [17] A. Isaac, E. Summers, Skos simple knowledge organization system primer, http://www.w3.org/TR/skos-primer/ (2008). [18] A. Isaac, E. Summers, SKOS Primer, W3C Group Note (2009). URL http://www.w3.org/TR/skos-primer/ [19] P. Jaccard, Étude comparative de la distribution florale dans une portion des alpes et des jura, Bulletin de la Socit Vaudoise des Sciences Naturelles 37 (1901) 547–579. [20] Y. Kalfoglou, M. Schorlemmer, Ontology mapping: the state of the art, The knowledge engineering review 18 (01) (2003) 1–31. [21] A. Kalyanpur, B. Parsia, E. Sirin, B. Grau, J. Hendler, Swoop: A web ontology editing browser, Web Semantics: Science, Services and Agents on the World Wide Web 4 (2) (2006) 144–153. [22] M. Klein, Change management for distributed ontologies, Ph.D. thesis, Vrije Universiteit Amsterdam (Aug. 2004). URL http://www.cs.vu.nl/˜mcaklein/thesis/ [23] M. C. A. Klein, D. Fensel, A. Kiryakov, D. Ognyanov, Ontology versioning and change detection on the web, in: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web, EKAW ’02, Springer-Verlag, 7. Acknowledgements We are especially grateful to Janet Takens, Jan Kleinnijenhuis, Sebastian Khler, Peter Robinson, Sandra Dölken, Paea LePendu, Rinke Hoekstra, Doug Howe who participated our use case study and provided crucial feedback. [1] [2] [3] [4] [5] L. Bloomfield, Language, Allen and Unwin, 1933. Brockhaus: Europaeische gemeinschaft, translated (1999). DTV, DTV Atlas (1979). Encyclopedia britannica, online version visited 27 May 2010 (2010). European Union, The history of the european union, Website, http: //europa.eu/abc/history/index_en.htm (2010). [6] N. Fanizzi, C. d’Amato, F. Esposito, Conceptual clustering and its application to concept drift and novelty detection, in: ESWC’08: Proceedings of the 5th European semantic web conference on The semantic web, Springer-Verlag, Berlin, Heidelberg, 2008. [7] G. Flouris, D. Manakanatas, H. Kondylakis, D. Plexousakis, G. Antoniou, Ontology change: classification and survey, Knowledge Eng. Review 23 (2) (2008) 117–152. [8] F. L. G. Frege, über sinn und bedeutung, Zeitschrift für Philosophie und philosophische Kritik 100 (1892) 25–50. 20 [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] London, UK, 2002. URL http://portal.acm.org/citation.cfm?id=645362. 756436 M. Klenner, U. Hahn, Concept versioning: A methodology for tracking evolutionary concept drift in dynamic concept systems, in: In Proc. of ECAI 1994, Wiley, 1994. J. Lehmann, C. Bizer, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, S. Hellmann, Dbpedia - a crystallization point for the web of data, Journal of Web Semantics 7 (3) (2009) 154–165. V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Physics Doklady 10 (1966) 707. Y. Liang, H. Alani, N. Shadbolt, Changing ontology breaks queries, in: I. Cruz, S. Decker, D. Allemang, C. Preist, D. Schwabe, P. Mika, M. Uschold, L. Aroyo (eds.), The Semantic Web - ISWC 2006, vol. 4273 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2006, pp. 982–985. URL http://dx.doi.org/10.1007/11926078_79 Y. Liang, H. Alani, N. Shadbolt, et al., Change management: The core task of ontology versioning and evolution, in: Proceedings of Postgraduate Research Conference in Electronics, Photonics, Communications and Networks, and Computing Science, 2005. C. Matuszek, J. Cabral, M. Witbrock, J. Deoliveira, An introduction to the syntax and content of cyc, in: Proceedings of the 2006 AAAI Spring Symposium on Formalizing and Compiling Background Knowledge and Its Applications to Knowledge Representation and Question Answering, 2006. B. McBride, Rdf vocabulary description language 1.0: Rdf schema, http://www.w3.org/TR/rdf-schema/ (2004). D. L. McGuinness, F. van Harmelen, Owl web ontology language overview, http://www.w3.org/TR/owl-features/ (January 2004). B. Motik, B. Cuenca Grau, I. Horrocks, Z. Wu, A. Fokoue, C. Lutz, Owl 2 web ontology language profiles, http://www.w3.org/TR/owl2-profiles/ (2009). N. Noy, A. Chugh, W. Liu, M. Musen, A framework for ontology evolution in collaborative environments, The Semantic Web-ISWC 2006 (2006) 544–558. N. F. Noy, M. Klein, Ontology evolution: Not the same as schema evolution, Knowledge and Information Systems 6 (2004) 428–440. URL http://dx.doi.org/10.1007/s10115-003-0137-2 N. F. Noy, M. A. Musen, Promptdiff: a fixed-point algorithm for comparing ontology versions, in: Eighteenth national conference on Artificial intelligence, American Association for Artificial Intelligence, Menlo Park, CA, USA, 2002. URL http://portal.acm.org/citation.cfm?id=777092. 777207 N. F. Noy, M. A. Musen, Ontology versioning in an ontology management framework, IEEE Intelligent Systems 19 (2004) 6–13. URL http://dx.doi.org/10.1109/MIS.2004.33 P. Plessers, O. De Troyer, Ontology change detection using a version log, in: Y. Gil, E. Motta, V. Benjamins, M. Musen (eds.), The Semantic Web ISWC 2005, vol. 3729 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2005, pp. 578–592. URL http://dx.doi.org/10.1007/11574620_42 P. Plessers, O. D. Troyer, S. Casteleyn, Understanding ontology evolution: A change detection approach, Web Semantics: Science, Services and Agents on the World Wide Web 5 (1) (2007) 39 – 49, selected Papers from the International Semantic Web Conference (ISWC2005). URL http://www.sciencedirect.com/ science/article/B758F-4MX4VXJ-1/2/ a6eaf2835f9d1ca7799bcc27980d2973 F. D. Saussure, Course in General Linguistics., Open Court Classics, 1986, reedition. B. Schopman, S. Wang, S. Schlobach, Deriving concept mappings through instance mappings, in: Proceedings of the 3rd Asian Semantic Web Conference, Bangkok, Thailand, 2008. B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L. J. Goldberg, K. Eilbeck, A. Ireland, C. J. Mungall, The OBI consortium, N. Leontis, P. Rocca-Serra, A. Ruttenberg, S.-A. Sansone, R. H. Scheuermann, N. Shah, P. L. Whetzel, S. Lewis, The obo foundry: coordinated evolution of ontologies to support biomedical data integration, Nature Biotechnology (25) (2007) 1251 – 1255. [42] L. Stojanovic, A. Maedche, B. Motik, N. Stojanovic, User-driven ontology evolution management, Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web (2002) 133–140. [43] L. Stojanovic, B. Motik, Ontology evolution within ontology editors, in: Proceedings of the OntoWeb-SIG3 Workshop, Citeseer, 2002. [44] Y. Sure, M. Erdmann, J. Angele, S. Staab, R. Studer, D. Wenke, OntoEdit: Collaborative ontology development for the semantic web, The Semantic WebISWC 2002 (2002) 221–235. [45] A. Tsymbal, The problem of concept drift: definitions and related work, Tech. Rep. TCD-CS-2004-15, Computer Science Department, Trinity College Dublin, Ireland (2004). [46] J. Van Cuilenburg, J. Kleinnijenhuis, J. De Ridder, Towards a graph theory of journalistic texts., European Journal of Communication 1 (1986) 65– 96. [47] S. Wang, S. Schlobach, M. Klein, Concept drift and how to identify it, in: European Knowledge Acquisition Workshop (EKAW), 2010. [48] S. Wang, S. Schlobach, J. Takens1, W. van Atteveldt, Mapping-chains for studying concept shift in political ontologies, in: Proceedings of the Ontology Matching workshop (OM 2009), Washington, USA, 2009. [49] G. Widmer, M. Kubat, Effective learning in dynamic environments by explicit context tracking, in: ECML ’93: Proceedings of the European Conference on Machine Learning, Springer-Verlag, London, UK, 1993. [50] G. Widmer, M. Kubat, Learning in the presence of concept drift and hidden contexts, Mach. Learn. 23 (1) (1996) 69–101. [51] Wikipedia: European union, Timeback machine: visited (2010). URL http://en.wikipedia.org/w/index.php?title= European_Union&oldid=391894227 [52] Wikipedia: European union, Timeback machine: visited (2003). URL http://en.wikipedia.org/w/index.php?title= European_Union&oldid=1433926 [53] Wikipedia: European union, Timeback machine: visited (2006). URL http://en.wikipedia.org/w/index.php?title= European_Union&oldid=84415101 21
© Copyright 2026 Paperzz