Concept drift and how to identify it

Concept drift and how to identify it
Shenghui Wangb,a,∗, Stefan Schlobacha , Michel Kleina
a Vrije
Universiteit Amsterdam, The Netherlands
b OCLC, Leiden, The Netherlands
Abstract
This paper studies concept drift over time. We first define the meaning of a concept in terms of intension, extension and label.
Then we study concept drift over time using two theories: one based on concept identity and one based on concept morphing. A
qualitative toolkit for analysing concept drift is proposed to detect concept shift and stability when concept identity is available,
and concept split and strength of morphing chain if using the morphing theory. We apply our framework in four case-studies: a
political vocabulary in SKOS, the DBpedia ontology in RDFS, the LKIF-Core ontology in OWL and a few biomedical ontologies
in OBO. We describe ways of identifying interesting changes in the meaning of concept within given application contexts. These
case-studies illustrate the feasibility of our framework in analysing concept drift in knowledge organisation schemas of varying
expressiveness.
1. Introduction
Knowledge organisation systems (KOS), such as formal
ontologies (e.g. modelled in OWL), thesauri or taxonomies
(e.g. described in SKOS) or other term classification schemes,
play a crucial role in providing semantic interoperability in
many domains and use cases. They have become critical to
the Web of Data for structured access of documents in libraries
or patient records based on diagnostic information, and many
more applications. In almost all modern types of KOS, concepts are the central constructs that are used to describe sets of
objects with shared characteristics. Although it is widely recognised to be an oversimplification most current systems consider
their underlying KOS to be stable over time. For many applications this starts to be a critical problem, and this paper attempts
to provide a theory of what we call concept drift.1 As the world
is continuously changing, concepts also change over time. That
is, for example, a concept refers to different objects at different
points in time. The term Government of the Netherlands refers to
different people in 1999 and in 2009. Or, consider the concept
Middle class which is a concept with very different properties in
various periods of time. Concept drift is known to be a problem
in data mining or machine learning, when learned models loose
their predictive power over time [50]. Here, we deal with concepts that are explicitly and symbolically represented in some
KOS, and where the drift needs to be made explicit as well. In
this sense our contribution is orthogonal and complementary to
the work in the Machine Learning literature.
In this paper we will formally define what concept drift in
a KOS is, and how to study its impact, given two different
ontological paradigms, and without committing to a particular
type of Knowledge organisation schemes. These definitions are
not intended to provide new philosophical insights, but aim at
making existing accepted notions of intension, extension and
labelling applicable in the context of dynamics of semantics.
Semantic drift: Throughout the theoretical part of the paper we
will use the generic example of the European Union (the concept European Union) and the countries of the EU (EUCountry) as
our concepts of reference.2 The historical overview provided
by the EU on [5] gives an interesting discussion over its development. Starting in the early Fifties “the European Union
is set up” with “Belgium, France, Germany, Italy, Luxembourg
and the Netherlands” as founding members. Within the next 6
decades many other countries joined, and the European Union
transforms slowly from an economic community to an “international organisation governing common economic, social, and
security policies” [4]. The following list gives some definitions
for the concept of the European Union over time.
(1979) The European Community is a common denominator
for the European Economic (EEC), the European Coal and
Steel Community (ECSC), and the European Atomic Energy Community (EAEC).[3]
(1999) The European Community is the new stage in the implementation of increasing the Union of the European people. [2]
∗ Corresponding
author.
Email addresses: [email protected] (Shenghui Wang),
[email protected] (Stefan Schlobach),
[email protected] (Michel Klein)
1 Concept drift will be our name for change in meaning of a concept over
time. This drift in meaning occurs not only over time, but also over location,
culture, etc. For ease of presentation we will mostly refer to drift in time, but
significant parts of the framework should extend to other kinds of “contexts.”
Preprint submitted to Elsevier
2 We use two different concepts EUCountry and European Union as they
examplify two different kinds of change: the (intensional) definition of the EU
is highly instable and changes throughout history, but as a relatively abstract
concept it is not so clear what its instances (extension) are. For EUCountry its
the opposite: the set of instances regularly changing, but its intensional definitions as countries of the EU relatively stable.
June 14, 2013
Morphing based concept drift: The previous analysis identifying shift and stability of the meaning of a concept makes
sense under the assumption, or ontological commitment, that
the concept of the European Union in 1995 is considered to be
the same concept of the European Union in 2007. This is just
one possible interpretation, and we do not want to restrict our
framework to just this one world-view. he discussion on identity is probably as old as philosophy itself, and has played a role
in ontology engineering as well. We try to keep out of this discussion by providing both methods that work on identity, and
without it. For the latter we consider concepts to be pertaining
to just one moment in time and that they evolve/morph into new,
but highly similar, concepts at each moment in time. Meaning
drift is then to be defined in the degree of dissimilarity of these
maximally similar concepts over time. Again, we will study
drift with respect to intension, extension and labels.
In this alternative conceptualisation of concept drift we also
want to identify some qualitative notions to describe more drastic notions of drift. One example is the notion of split which occurs when the different semantic elements (intension, extension
or label) morph into different concepts. When a series of concepts at different moments are linked by this morphing relation,
then a morphing chain is formed. The strength of the morphing
chain, i.e. , how similar the morphed concepts are, is another
notion to study.
Research questions: To our knowledge there has been no formalisation of what concept drift actually means and implies. In
order to identify different types of changes in concepts and to
understand the impact of concept drift, such a formalisation is
critical. Therefore, this paper focuses on the following research
questions:
(2003) The European Union or EU is an international organisation of European states, established by the Treaty on
European Union. [52]
(2006) The European Union (EU) is a supranational and intergovernmental union of 25 independent, democratic member states. [53]
(2010) The European Union (EU) is an economic and political
union of 27 member countries, located primarily in Europe. [51]
Most of the aspects of the change of meaning of a concept
over time that we introduce in this paper can be identified in
this example. The instances of the set of countries of the European Union, also called its extension, change (eg. from 25 to
27 between 2006 and 2007). The label of the concept European
Union changes from European Community to European Union.
From a judicial perspective those are different concepts. However, the European Union website [5] considers the EU and EC
to be the same concepts. Note that they use the label European
Union to signify the concept that existed in 1952. There was no
European Union at the time, however, only the European Community. In our opinion the website refers to an abstract concept
of an organisational unit which stands for the European idea,
which we now refer to as the European Union. Of course, the
properties of this European Union have significantly change as
well; from a union bound by economic treaties to a full political, military and social organisation. Ontologically, we say that
the intension of the concept European Union has changed. These
three types of changes will be discussed as three different aspects of semantic drift: extensional, label and intensional drift
respectively.
Identity based concept drift: The impact of such drift is difficult to measure and for that purpose we would like to identify
more drastic and qualitative changes in meaning. In the example of the EU countries there are some of these more drastic
changes. Recall that the EU started out with just 6 countries, all
of which were Central European. The effect of the expansion
of the EU is that the original meaning of the concept EUCountry
in terms of its members in 1950 is now far closer to the concept
CES of Central European states than to the concept EUCountry.
We will call such changes shift. Another way of looking at drift
is to study the similarity between the meaning over time. Figure 1 shows, e.g. , the number of European member countries
of NATO and EU:
Studying the similarity between these sets, e.g. , in this simple example as the ratio of the number of new versus old instances, indicates the moments when the most drastic changes
occured (in both cases between 1995 and 2004). We say that
we measure stability of the concept as compared to other concepts. In principle, there are two lessons: for each concept we
can identify the most unstable moments in a temporal chain, but
we can also compare the average or overall instability of a concept over time. An interesting comparison can be made with the
concept of European Countries within NATO, which has also seen
expansion over time. Note, that the latter is more extensionally
stable as countries joined more gradually.
1. What is concept drift, and how to formalise it?
2. Can we identify the impact of concept-drift?
It is important to note that we want to develop a generic framework which allows us to study concept drift without a priori
choosing one of the above described ontological commitments
(identity versus morphing) or a particular type of KOS.
Methodology: We provide a generic formalisation of the meaning of concepts in terms of label, intension and extension.
Based on the ontological commitment towards an identity- or
morphing-based model concept drift is defined differently, and
the qualitative follow-up notions are therefore defined differently.
For the former, the instability over a time period and concept shift between time points (where part of the meaning of a
concept shifts to some other concept) are crucial notions. For
the latter, the notion of split and the morphing strength will be
introduced.
As our proposal is meant in a very generic way we need to
discuss the notions of intension and extension in more detail.
Case-studies: We instantiate our framework in four casestudies, studying concept drift in our motivating case, a political ontology in SKOS [17] used by communication scientists
for political analysis as well as a general purpose RDFS [30]
2
EU
diff
E-NATO
diff
1949
9
-
1952
6
10
11,1%
1955
9
50%
11
10%
1973
9
0%
11
0%
1981
10
11,1%
12
9,1%
1986
12
20%
12
0%
1995
15
25%
12
0%
1999
15
0%
15
25%
2004
25
66,7%
22
46,7%
2007
27
8,0%
22
0%
Figure 1: The number of European member countries of NATO and EU over time.
ontology, DBpedia [25], a legal OWL [31] ontology, LKIFCore [15], three OBO biomedical ontologies [41]. We investigate the introduced mechanisms for studying concept drift in
these four different KR models. Our experiments show the feasibility of both the formalisation and identification mechanisms
by pointing to some examples of concept (in)stability, shift,
split, and morphing strength which were identified as relevant
by collaborating domain experts. We are aware that those casestudies do not constitute a formal evaluation or even less empirical proofs of validity of our approach. However, they indicate
the broad coverage and potential of the toolkit (although not all
tools are useful in all scenarios). Furthermore, selected domain
experts have in three of the four cases confirmed the usefulness
of some of the results, the fourth being general enough to allow
an informal assessment without expert input.
Contributions: This paper operationalises established (philosophical) insights into a general pragmatic framework for dealing with concept drift and developed a general toolkit to detect
drift in different domains. We believe that we contribute to a
better understanding of temporal change of meaning in formal
knowledge organisation schemes and its impact in practical applications. We motivate and define the crucial notions of shift,
split, stability and morphing strength for different ontological
commitments and different types of Knowledge organisation
systems. The four case-studies illustrate the feasibility of our
framework in analysing concept drift in knowledge organisation schemas of varying expressiveness. This is a significantly
extended version of our previous paper [47] in which some of
the ideas were introduced. We now provide a far more detailed
analysis and formalisations, including an alternative model for
dealing with changes based on concept morphing (with the related notions of split and morphing strength), we also provide a
far more thorough evaluations including new datasets.
Structure of the paper: Section 2 gives a general introduction
to the four types of knowledge organisation schemas investigated and discusses the most relevant related work. In Section 3, we provide the basic formalisation of the meaning of
concepts and on which two theories of concept drift are based,
namely identity-based and morphing-based. These two theories are defined in Section 4 together with a practical toolkit
for analysing concept drift. In Section 5, we apply our general
framework in four different case studies involving four different
KOSs. Finally, Section 6 concludes the paper.
dition, we describe what has been done on formalizing concept
drift in knowledge systems.
2.1. Knowledge Organisation Schemas
Knowledge Organisation Schema (KOS) is a broad notion
used to refer to schemes meant to describe information. This
notion encompasses term lists and glossaries, classification
schemes such as subject headings or taxonomies, and semantically richer structures such as thesauri and ontologies. More
specifically, in [14] the notion of a KOS is explained as follows:
The term knowledge organization systems is intended
to encompass all types of schemes for organizing
information and promoting knowledge management.
Knowledge organization systems include classification and categorization schemes that organize materials at a general level, subject headings that provide more detailed access, and authority files that
control variant versions of key information such as
geographic names and personal names. Knowledge
organization systems also include highly structured
vocabularies, such as thesauri, and less traditional
schemes, such as semantic networks and ontologies.
Because knowledge organization systems are mechanisms for organizing information, they are at the heart
of every library, museum, and archive.
Over the last decades, several standards for representing knowledge structures have been developed. For example, the International Organization for Standardization (ISO) has developed a
guideline for monolingual thesauri (ISO-2788) and a guideline
for multilingual thesauri (ISO-5964), and the NISO has developed guidelines for the construction, format, and management
of monolingual controlled vocabularies (ANSI/NISO Z39.19).
In this paper, we focus on representation used in the context of
the Semantic Web, i.e. SKOS and OWL.
2.2. RDF Schema
RDF Schema (RDFS) is a simple type system for the Resource Description Format (RDF). It provides a mechanism
to define domain-specific properties and classes of resources
to which you can apply those properties. The basic modeling primitives in RDFS are class definitions (rdfs:Class) and
subclass-of statements (rdfs:subClassOf), which together allow
the definition of class hierarchies. RDFS also includes property
definitions and subproperty-of (rdfs:subPropertyOf) statements to
build property hierarchies as well as domain and range statements (rdfs:domain and rdfs:range) to restrict the possible combinations of properties and classes. Inherited from RDF, the
2. Background
In this section, we describe some general notions that are
necessary for understanding the remainder of the paper. In ad3
type statements (rdf:type) are used to declare a resource as an
instance of a specific class). In addition, RDFS comes with a
specific labeling relation rdfs:label which can be useful to add
natural language terms to a class.
2.5. OBO
The OBO format is widely used to represent a significant
number of biomedical ontologies, including the Gene Ontology
(GO).3 Originated along with the Gene Ontology, the OBO “flat
file format” has evolved to support the needs of the biomedical ontologies that fall under the Open Biomedical Ontologies
(OBO) umbrella. According to its authors, the OBO flat format main aims are to provide 1) human readability, 2) ease of
parsing, 3) extensibility and 4) minimal redundancy. The OBO
format currently forms the backbone of most GO based annotation and data analysis tools.
2.3. OWL
The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies endorsed by the World Wide Web Consortium. They are characterised by formal semantics and a number of different serialisations, eg. RDF/XML. OWL is built upon RDF and RDFS.
An OWL class is defined as an RDF resource of type owl:Class.
The mechanism to define class hierarchies is inherited from
RDFS. Furthermore, properties are used to describe resources
and specify class characteristics. Properties may possess logical capabilities such as being transitive, symmetric, inverse and
functional. The W3C Consortium has recently endorsed a new
version of OWL as standard referred to as OWL2 [32], which
extends the previous version with more expressiveness and fixes
some previous problems.
[Term]
id: HP:0000023
name: Inguinal hernia
namespace: medical genetics
exact synonym: ”Inguinal hernias” []
is a: HP:0000035 ! Abnormality of the testis
is a: HP:0004299 ! Hernia of the abdominal wall
As the above example shows, a OBO ontology is a collection of
stanzas, each of which describes one element of the ontology.
A stanza is introduced by a line containing a stanza name that
identifies the type of element being described. The rest of the
stanza consists of lines, each of which contains a tag followed
by a colon, a value, and an optional comment introduced by “!”.
2.4. SKOS
SKOS (Simple Knowledge Organization System) is a common data model for sharing and linking knowledge organization systems via the Web, published as a W3C recommendation [18]. Its aim is to capture much of the similarity between
thesauri, classification schemes, taxonomies, subject-heading
systems, or any other type of structured controlled vocabulary.
SKOS is built upon RDF and RDFS.
The core of the SKOS model defines the classes and properties needed for representing the features most commonly found
in thesauri and other types of structured controlled vocabularies. The primitive objects in SKOS are abstract concepts, which
are defined as RDF resources of type skos:Concept. These concepts can be further described via RDF properties by:
2.6. Related work
Although concept drift is widely recognised as an important
issue in Semantic Web applications,4 there is not yet a generally accepted way of dealing with changing concepts over time.
However, there is quite some related research (e.g., see [11, 27])
that discuss aspects of coping with concept drift. Below, an
overview is given on the research about concept change in different areas.
Let us first look at research in other domains that is related
to our notion of concept drift. In historical linguistics, semantic
shift describes the evolution of word usage. Each word has multiple senses and connotations which can be added, removed or
altered over time. Semantic change is a change in one meaning
of a word. Semantic shift can be triggered by different forces
and have different types [1]. In this interpretation of “semantic”, there is no explicit relation with the meaning and change
of an “underlying concept”, as the main entity of concern is the
word itself. .
In machine learning concept drift addresses a similar problem [45]. The term concept refers to the quantity that a learning
model is trying to predict, i.e. the variable. Concept drift is the
situation in which the statistical properties of the target concept
• labels (i.e.
terms), especially the preferred labels
and alternative labels skos:altLabel; these
properties are sub-properties of rdfs:label, and are used to
link a skos:Concept to an RDF literal (a character string)
optionally combined with a language tag (e.g. ”en-US”);
skos:prefLabel
• semantic relations, such as skos:broader, skos:narrower and
skos:related;
• documentary notes, such as definitions, scope notes and
examples, represented by (subproperties) of skos:note.
3 http://www.geneontology.org/
Using the semantic relationships, the concepts can be organized in hierarchies using broader-narrower relationships, or
linked by non-hierarchical (“related”) relationships. Moreover, skos:concept’s can be grouped into schemes, using the
skos:inScheme property.
4 E.g.,
see the fact that the topic is mentioned in both the discussion about
the design of SKOS (http://www.w3.org/2001/sw/wiki/SKOS/
Issues/ConceptEvolution) and the previous version of OWL (http:
//www.w3.org/TR/2004/REC-owl-guide/), while there is no finegrained solution yet in OWL2.
4
change over time. This requires regular updates in the predicting model itself. A special case is virtual concept drift [49]
(or sampling shift), in which the meaning of a concept does not
change, instead, the data that this concept classifies changes.
An example is the concept spam for which the meaning does
not change, but the data distribution (i.e. the relative frequency
of the properties) is changing.
In 1994, Klenner and Hahn [24] discuss exactly the problem of concept drift because of evolving notions over time,
however, not in the context of Semantic Web applications but
for technical standards. As a mechanism for updating static
value restrictions or integrity constraints, they propose an automatic procedure. This generates a generational stratification
of the underlying level of generic concepts in terms of concept
versions; single instances are then related to their associated
concept version. The procedure exploits a so called progress
model—provided by an expert—which describes in qualitative
terms the regularities of foreseeable changes of attributes in a
domain. Versions are then detected by measuring the change in
values of attributes of instances.
With the goal of detecting concept drift and the occurrence
of new concepts in a domain, Fanizzi et al. describe the use of
a conceptual clustering technique based on unsupervised learning [6]. In their approach, a clustering method is used to hierarchically organize groups of similar instances. Concept drift is
detected by finding new individuals that are too far apart from
existing clusters, but that together do not form a new cluster. If
the unclustered instances do form a cluster, a new concept has
occurred.
In the Semantic Web community, the problem addressed
here is related to a broader problem of ontology evolution and
change management, which refers to the “problem of deciding
the modifications to perform upon an ontology in response to a
certain need for change as well as the implementation of these
modifications and the management of their effects in depending data, services, applications, agents or other elements” [7].
The existing research on ontology evolution and versioning often addresses this problem at the macro ontological level, that
is, the effect of certain change operations over the ontology elements, including concepts, relations and instances, as well as
the interoperability issue between different variants (versions)
over time. For example, [13] formally defines ontology perspectives, which describe the relation between versions of an
ontology and its extension.
For analyzing interoperability between different versions of
an ontology, change detection for individual concepts is a relevant issue [34]. Different techniques for detecting such changes
have been developed. For example, Plessers and De Troyer introduce a change detection mechanism [37, 38] that is able to
find meaningful changes in sets of changes, based on the formal
definition of these changes. This approach, however, assumes a
fixed identity of concepts. As such, it is of limited use in cases
where this assumption does not hold. A syntactic change detection approach for ontologies is presented in [23]. In this approach, the semantic implication of the change has to be made
explicit by a human user. PromptDiff [35] is an approach and
tool for comparing ontology versions that also detects changes
in individual concepts. In [22], an interpretation of the effect
of such changes on different types of ontological compatibility is given. This approach does not assume fixed identifiers,
but tries to match different ontological elements using a set of
heuristics focussing on the structural aspects of an ontology. It
does not take the extension of concepts into account, neither it
looks at the intension of a change. In contrast, in [16], the intensional change of concepts in different versions of ontologies
is studied. This approach again assumes a fixed identity.
In the context of integrating ontologies, their has been a lot
work done on mechanisms and measures for calculating similarity concept similarity. An overview can be found in [20].
Such measures are related to our research, as our framework
also makes use of different similarity measures. However, the
aim of our work is not to introduce another similarity measure,
but a framework that allows to combine different aspects of concept similarity to describe concept change.
A whole body of research in ontology change management
investigates methodologies and techniques for change management in collaborative working environments [28, 12, 21, 36,
33, 42, 43, 44]. In these papers, methods for the coordination of
the changes made by different authors are developed and implemented, such as locking, change detection and approval mechanisms. Although very relevant for ontology development, these
approaches focus on a specific use case.
There has not been much study on the specific meaning of
change for specific concepts. In [10] a series of “concept signatures” extracted from the textual definitions of the same concept
at different time are used to detect drifts. This definition-based
method is applicable if there is rich definitions of concepts and
the definitions are constantly modified. In the Ontoclean framework [9] the meta-properties identity and rigidity are defined
with respect to the stability of a concept.
In this article, we do not focus on a specific application of
concept change, but we present a generic framework to describe and analyse concept drift within two common ontological paradigms. This framework helps understanding and analyzing specific strategies for detecting and coping with concept
change in knowledge structures.
3. Two theories of concept drift
Arguably, the meaning of concepts changes over time. In this
section we define precisely what this actually means, as there
are different philosophical views on the matter. The two core
alternatives can briefly be summarised upfront:
• Although possibly changing its meaning, a concept can exist over periods of time. Different variants of the same
concepts can differ in meaning.
• Concepts only exist at specific moments in time, and
evolve gradually into some other concepts (possibly with
almost the same meaning) at the next moment in time.
We will build a framework that works on both views, and provide qualitative notions of concept drift for both views.
5
cyc: Political Entity
art1
art2
art3
skos:Concept
rdf:type
rdfs:subClassOf
of the domain, and a set of properties (unary predicates). Both
universe and properties depend on the application and the formalism used.
Definition 1. The meaning C t of a concept C at some moment
in time t is a triple (labelt (C), intt (C), extt (C)), where labelt (C)
is a String, intt (C) a set of properties (the intension of C), and
extt (C) a subset of the universe (the extension of C).
rdf:type
dc:subject
cyc: European Union
dbpedia:Country
rdf:type
skos:broader
yago:economy
rdfs:subClassOf
rdf:type
Netherlands
France
rdf:type
dbpedia:EUCountry
Germany
We will refer to intension, extension and label of a concept
as the aspects of the concept.
cyc:Central
European
Country
3.1.1. A closer look at extension and intension
The example ontology introduced in Fig. 2 shows that Definition 1 is far from unambiguous. Depending on the semantics
underlying a particular ontological representation language, the
extension, intension and even labels can have different definitions.
OWL and RDFS are the most simple cases as the model theory of DL semantics provides immediate mapping to the definition of extension as subset of the domain. For the concept EUcountry in Fig. 2, the extension is defined as the resources in the
subject position of an rdf:type triple. For abstract concepts, this
view on extension is hardly applicable. In knowledge organisation schemes such as SKOS with less rigid formal semantics,
the notion of extension is far less clear cut. In many applications there is a clearly defined domain of discourse which we
can take as the semantic universe for defining the extension of a
concept. In Definition 1, the meaning of concepts has been defined in a rather non-committal way in order to accommodate
for these weaker semantic structures as well. For example, European Union is a skos:concept which is used to annotate a number
of articles. We argue that, in certain applications, it is acceptable to use the articles as the extension of the concept European
Union. In our political study in Section 5.1 we will define the
set of sentences annotated with a particular concept as its extension.
For the aspect of intension, there is no one-and-for-all interpretation either. The definition is deliberately left vague to
allow broad coverage of our theory of concept drift. In a FirstOrder Logic sense, the properties of a concept C correspond to
the binary predicates that hold for C. Which predicates to consider then depends on the formalisms used for representing the
ontology. If we interpret the ontology in Fig. 2 in RDF only,
the intension of the concept dbpedia:EUCountry could be defined
as just the set of triples it occurs in (except those with rdf:type as
the predicate), i.e. ,
Figure 2: An example ontology representing the EU in 2003
To explain and motivate our ideas we will use the following
simple information which is freely inspired by the data available through DBpedia.5 Let us consider the following ontology (depicted in Fig. 2) which contains some information valid
in 2003 about among others the resources dbpedia:EUCountry,
cyc:Political Entity, and cyc:European Union [29] from several different sources, and related using different relations. Concept
dbpedia:EUCountry is defined as countries, which are political entities. The EU was at the time a political entity which also was
an economy. There are some example countries as instances
of the concept dbpedia:EUCountry, and some articles art1,art2 and
art3 about the European Union. Fig 3 which will be introduced
later in the paper describes the European Union in 2010. This
example, and some extensions provided later, are used in the
further discussions to explain our general ideas.
3.1. The meaning of concepts
Let us first commit to some basic definitions regarding the
meaning of concepts, which is common to both views. The
intension of a concept are the properties implied by it, the extension the set of things it extends to. We also consider the labelling as a part of the meaning of a concept, because the way
people reference a concept is also crucial in studying concept
drift. Labels do not refer to a unique identifier but to a natural
language description used to convey the meaning of a concept
from one human to another. We believe this proposal to be consistent with the most common philosophical approaches. We
apply and formalise the standard distinction between intension
and extension which goes back to [8]. We include, somewhat
more unconventionally, the labelling in the meaning of a concept (in the tradition of the signifier [39]).
Our definition of the meaning of a concept, and its drift is
intended to be generic enough to be applied in different ontological frameworks. In this paper, we apply our ideas on a
SKOS vocabulary, a few RDFS, OWL and OBO ontologies.
We start out from a set of objects referred to as the universe
(dbpedia:EUCountry rdfs:subclass dbpedia:Country)
(cyc:European Union skos:broader dbpedia:EUCountry)
If interpreted in RDFS, the semantic closure could be taken
into account, i.e. , the triple (dbpedia:EUCountry rdfs:subClassOf
cyc:PoliticalEntity) is probably relevant as well and could be part
of the intension. For OWL and Description Logics formalisations, the intension could also contain implicitly true predicates that can be expressed. In our example, the concept dbpedia:EUCountry is also a subclass of the class of all the things
5 http://wiki.dbpedia.org/
6
In the most simple cases calculating similarity between two
versions of an aspect is using standard notion of similarity directly on the aspects. For labels, being defined as Strings, there
are standard distance measures, such as Levenshtein’s [26] edit
distance. For sets of predicates variants of Jaccard similarity
[19] are the obvious candidates. The situation can, however,
be more intricate depending on the ontological commitment
w.r.t. extension and intension. For example, if the extension
of a concept is defined as the set of articles annotated with it
the overlap between extensions at different moments in time
is usually empty. In our case study on political journalism we
will later first define similarity between the instances of the extension and aggregate over this similarity in order to determine
the overall similarity between the extensions. As stated before,
similarity needs to be defined explicitly per application. Again,
we believe that a framework for studying concept drift should
be generic enough to cater for particular definitions of similarity.
which have at least one instance which is a Central European
Country (or ∃ rdf:type . cyc:Central European Country in a Description Logic like notation).
The above examples show that there is no clearcut definition
of intension and extension, and even the definition of the label
of a concept can depend on the application: often there are alternative labels and we could even take glosses or other descriptions into account. In our view, a framework on the change of
meaning of a concept over time should be applicable to all those
types of ontological commitments as long as the three aspects
of the meaning are unambiguously formalised.
3.1.2. A closer look at time
It was stated earlier in the paper that we do not intend to provide a formal study of time and logical formalisations of change
over time. Instead, we commit to the most simple temporal
model where time is defined as linear, discrete and complete.
This allows us to refer to particular points in time by subscripts
i, j such that ti ≤ t j whenever i ≤ j. We will use ≤ to denote
both the ordering on subscripts and in time which we expect not
to cause any confusion. A further and deeper study of different
temporal models is out of the scope of this paper.
3.2. Meaning drift based on concept identity over time
All aspects of the meaning of a concept can change. We
claim that the meaning of a concept drifts if there are two variants that differ at different points in time. If a concept at different times has the same meaning, there is no concept drift. This
definition of drift is based on the idea that a concept retains its
identity over time, i.e. , remains the same at least temporarily.
In order to define drift, we need a workable definition of a concept and variants of such concepts.
Given the philosophical assumption that concepts can exist
at different moments in time, concepts can naturally also have
different meanings at different times.
3.1.3. Similarity
Investigating changes in the meaning of a concept means
studying changes in intension, extension and label. Given the
above discussion, it should not come as a surprise that the difference between the aspects over time also do not come in a
single form. The only notion which is required is a notion of
similarity between the aspects of two concepts or two variants
of the same concept at two moments in time. There will be
intensional similarity, simint , which is calculated between sets
of predicates, extensional similarity, simext , between sets of objects, and label similarity, simlabel , between strings. Each similarity is a function with the range [0, 1], where a similarity
value of 1 indicates equality. We will show different examples
on how to define the similarity between two aspects in the discussion of our case-studies, and will only briefly discuss some
generic problems here.
For given notions of similarity between the meaning of two
concepts according to a particular aspect we can extend this to
similarity of a series of meanings of concepts over time:
simasp (C1t1 , C2t2 , . . . , Cntn ) =
Definition 2. The meaning C t of a concept C at time t is a variant of C.
This definition assumes the stability of a concept over time
even though different variants might semantically differ, in
other words, the concept itself retains its identity over time.
Identity allows us to compare two variants of the same concept
at different moments in time even if the meaning (either label,
extension and possibly parts of its intension) has changed. We
will assume that there is always only one variant of a concept
at one moment in time, i.e. , only one concept at a moment can
be identical to a concept at another moment.
ti
ti+1
Σn−1
i=1 simasp (C i , C i+1 )
.
n−1
Definition 3. The ordered set of all variants for a concept C
between two moments t1 and tn is a chain and abbreviate:
chain(C, t1 , tn ) = C t1 → C t2 → . . . → C tn
where asp is an aspect asp ∈ {int, ext, label}. The maximal
similarity between the meaning of two concepts for a series of
concepts over time is defined as:
max
simasp (C1t1 , Cntn )
ti+1
)
Σn−1 simasp (Citi , Ci+1
= i=1
.
n−1
for all time-points between t1 and tn .
Given our “meaning of a concept” and the notion of identity,
we can define concept drift per aspect as the basic notion of
change in the meaning of a concept.
(1)
ti+1
i+1
where simasp (Citi , Ci+1
) ≥ simasp (Citi , Dti+1
) for the meaning
of any other concept Di+1 at time ti+1 . We will call the sequence of maximally similar concepts Ci+1 the max-sim-chain
between C1 and Cn between t1 and tn for aspect asp, referred to
by maxsimasp (C1 , Cn , t1 , tn ).
Definition 4. A concept C has extensionally drifted (according
to identity semantics) between time ti and t j (ti ≤ t j ) if and only
if simext (Cti , Ct j ) , 1. Intensional and label drift are defined
similarly. The meaning of a concept has drifted if one of the
aspects has drifted.
7
Before we go on to discuss an alternative way of defining
concept drift based on concept morphing, let us discuss the concept of identify in more detail.
art3
cyc: Political Entity
rdf:type
rdfs:subClassOf
art5
skos:Concept
dc:subject
rdf:type
cyc: European Union
dbpedia:Country
rdf:type
skos:broader
yago:economy
rdfs:subClassOf
rdf:type
Germany
Netherlands
France
dbpedia:EUCountry
Romenia
Poland
rdf:type
3.2.1. A closer look at identity
Our definition of concept, its meaning and the subsequent
notion of identity is rather cyclic and non-committal. However,
an exhaustive discussion on the nature of what a concept is, and
whether it remains itself over time is not at the core of this paper. What matters for this model is the assumption of identity
and what we can do whenever we can identify identical concepts over time.
First, let us give one possible explanation what identity of a
concept could mean. This suggestion is based on the idea of
rigidity of the intension of a concept.6 Formally, we assume
that the intension of a concept C is the disjoint union of a rigid
and a non-rigid set of properties (i.e. (intr (C) ∪ intnr (C))). This
separation between rigid and non-rigid properties does not need
to be explicitly specified, but rigidity is one way to define for
identity of a concept over time. Intuitively, this amounts to the
assumption that a concept is uniquely identified through some
core properties that do not change over time. Formally, this is
art4
ns:social union
cyc: Central
European
Country
cyc: Eastern
European
Country
Figure 3: An example ontology representing the EU in 2010
A cheaper way of determining identity is based on expert
knowledge about the process and structure of the ontology in
question. It is often the case that labels were explicitly kept
stable over time to enforce a priori identity of concepts. Close
knowledge of such processes can help to decide the determining
factor in identifying identity.
Finally, there are attempts to detect identity automatically, for
example, by assuming that most similar concepts are identical,
or by conjecturing that concepts in a similarity chain between
two variants are identical. The advantage is that those methods
are simple and cheap to apply. The disadvantage, however, is
that identity and drift are based on the same information.
In fact, this observation can be taken to the extreme of defining concept drift without the underlying notion of identity in
the first place. We call this approach concept morphing.
Conjecture 1. Two concepts C1 and C2 are considered identical if their rigid intensions are equivalent, i.e. , intr (C1 ) =
intr (C2 ).
This assumption implies that if the rigid core of a concept
changes, the new concept will be a different concept. An example is demagogue which used to denote the concept Political
Leader while the same label now refers to a different concept
Populist Demagogue.
How to determine identity?. Although we provide a possible
explanation and motivation for the notion of identity based on
the rigid part of the intention, this definition is by no means
practical, as the rigid core is unknown in almost all applications.
Since all three aspects of the meaning of a concept can drift over
time, identifying identity is crucial and non-trivial.
In order to render practical algorithms, we need to study
ways of determining identity of concepts. There are several
methods for doing so: using oracles, domain knowledge or automatic techniques. Although identity resolution between concepts over time is not at the core of this paper, we need to
discuss some of these options to make credible our claim that
identity-based analysis of meaning drift is practically applicable.
The simplest way of determining identity of concepts over
time is to ask an oracle. Such an oracle usually comes in form
of a human domain expert who can determine whether two variants have the same rigid properties based on his/her domain
knowledge. Unfortunately, such oracles are difficult to find and
usually very expensive.
3.3. Meaning drift based on concept morphing
In this alternative conceptualisation of concept drift, there is
no underlying notion of identity we can rely on. This means
that two concepts at two different moments in time are never
identical. Instead, a concept C1 morphs over time into a new
concept C2 .
Let us first define morphing of the three aspects of the semantics and of the meaning of a concept in general.
Definition 5. A concept Ci extensionally morphes into a concept C j from t to t + 1 if and only if extt+1 (C j ) is maximally
similar to extt (Ci ). Formally, we write:
morph extt,t+1 (Ci , C j ) iff arg max simext (Cit , Ckt+1 ) = j
k
and similarly for intensional and label morphing.
A concept Ci morphes into a concept C j from t to t + 1 if and
only if all semantic aspects, i.e. , extension, intension and label
of Ci , morph into C j .
6 Rigidity is discussed in Ontoclean [9] in a slightly different way. There
rigidity is a meta-property of a concept that modellers should make explicit. We
use it in a stronger way as an intrinsic and the only stable part of the meaning
of a concept which otherwise can drift in various ways.
Let us consider the EU in 2010 and in 2003 in our example ontology from Fig.2 and 3. Now, the concepts with label
EUCountry are different concepts. There is label morphing
8
for the concept dbpedia:EUCountry in 2003 to the concept with
the same name in 2010 as the label stays the same. However,
given that the extension of concept dbpedia:EUCountry in 2003
is more similar to the extension of concept cyc:Central European
Country the former morphes extensionally to the latter.
Note that an aspect of a concept C can morph into several
concepts if the maximal similarity between C and those concepts is the same w.r.t. this aspect.
time
C1t1
t1
t2
C1t2
C2t2
Definition 6. A concept C has extensionally drifted (according to morphing semantics) between t and t + 1, if and only if
simext (C, C1 ) , 1 for morph extt,t+1 (C, C1 ). Intensional and label drift are defined similarly. The meaning of a concept has
drifted if one of the aspects has drifted.
Figure 4: Concept shift (identity-based)
Another way of analysing concept drift over time is to compare the (average) stability of concepts. As a relative measure,
it does not require any priori commitment (such as a threshold).
In the previous section, we argued that the dbpedia:EUCountry
of 2010, although having changed in some aspects, was still the
same concept as the one in 2003, and even as the one in 1979.
An alternative is to consider the label EUCountry to refer to
different concepts at different times. In this morphing view,
the concept dbpedia:EUCountry in 2003 morphs into the concept
dbpedia:EUCountry in 2010 on the label aspect, although neither
extension nor intension stayed the same.
In the case of morphing, we define the max-sim-chain between C1 and Cn between t1 and tn for aspect asp the aspmorphing chain. The similarity between the concepts in such
a maximal morphing chain will later be used to indicate the
strength of a morphing chain.
Definition 7. A concept C between time t1 and tn is more stable
than a concept C1 on aspect asp if, and only if,
t
avg1≤i, j≤n,i, j (simasp (C ti , C t j )) > avg1≤i, j≤n,i, j (simasp (C1ti , C1j )).
Relative stability is the most interesting use of stability: to
order the concepts by stability values. In our experience, the set
of the most stable and the most unstable concepts are usually
of interest to domain experts. In our example ontology about
the European Union from Fig. 2 and 3 the extension of the concept dbpedia:EUCountry grows from 3 to 5 countries, while that
of concept cyc:Political Entity consists of just one (explicit) instance cyc:European Union in both variants. Jaccard similarity
between the extension of dbpedia:EUCountry in 2003 and 2010
is thus 53 =0.6, the one for cyc:Political Entity 1, which makes the
latter the more stable concept.
The second use of stability is to track the change of similarity
between variants of a concept. The most unstable moments are
those when the similarity is the lowest.
4. A qualitative toolkit for analysing concept drift
In the previous section, we have formally defined concept
drift as the change of aspects of the meaning of a concept over
time. Concept drift happens regularly. Even if it can be measured, it is often difficult to grasp its impact, which probably
results in a far too fine-grained analysis. In this section, we
will define some more practical notions that describe changes
in meaning: our toolkit for analysing concept drift.
This toolkit takes both possible notions of drift into account,
and we will define identify-based observations, and morphingbased ones. Practically, it does not matter whether one of our
tools is identity or morphing based, but, conceptually, we will
introduce them separately.
Definition 8. In a chain chain(C, t1 , tn ) the concept C is more
stable at time ti than at t j on aspect asp if
simasp (C ti , C ti+1 ) > simasp (C t j , C t j+1 ).
We can use this ordering in our use-cases to identify most unstable, and most stable moments in a chain.
4.1. Identity based observations
Based on identity-based concept drift we define two notions:
stability and shift.
4.1.2. Concept shift
A special case of instability is when a concept becomes so
unstable that part of its meaning is more representative for a
different concept rather than for itself. We call this concept
shift.
4.1.1. Concept (in)stability
The more the meaning of a concept drifts, the more unstable
it becomes. Although there is no indication of when an unstable concept becomes critically unstable, stability can still be an
interesting notion to study.
A first alternative is to define a threshold based on experience. Every similarity between two variants below this threshold signifies some interesting semantic change. A typical example is label similarity which is often defined using edit distance.
A high edit-distance signals label instability.
Definition 9. The meaning of a concept C extensionally shifts
between two of its variants C ti and C t j if the extension of C t j is
more similar to the extension of a non-identical concept rather
than to the extension of C ti . Intensional and label shift are defined similarly.
Concept shift can have drastic consequences on the use of a
concept in an application, because some other concept has in
9
time
fact taken over its meaning. Concept shift is explained in Fig. 4
where ovals denote the meaning of a concept with each circle a
particular aspect. Arrows stand for the most similar aspects of
the new variant. From the moment t1 to t2 , there is shift in terms
of the middle aspect, because, in this aspect, the concept C2 at
t2 is more similar to C1 at t1 than C1 at t2 is, while the other two
aspects do not shift.
Consider the extension of concept dbpedia:EUCountry in
2003 with the three countries Germany, Netherlands and France.
In 2010 this set of countries is more similar to the concept cyc:Central European Country than to the original dbpedia:EUCountry, which means that the extensional meaning of the
concept has shifted.
time
C1
t1
t1
t2
t2
C2
C2
C1
C3
C3
C4
Figure 5: 1-way concept split (left) and synonymy split (right) (morphingbased)
a different concept from the one that the other aspects morph
to (on the left-hand side), or an aspect is equally similar to the
same aspect of more than one other concept (on the right-hand
side of Fig. 5).
4.2. Morphing based observations
Definition 11. The meaning of a concept C splits at time t if,
and only if, either
Morphing-based concept drift comes with two notions: morphing strength and split.
1. it does not morph into another concept, or
4.2.1. Strength of morphing chains
Without an explicit definition of identity, it does not make
sense to define stability of a concept over time. A closely related notion is strength of the morphing chain, which indicates
how similar the concepts in this (maximal) morphing chain between two time-points are.
2. one aspect morphes into more than one concept, which
happens when the similarity between those concepts is the
same according to at least one aspect.
Remember that the notion of morphing of a concept is defined
through simultaneous morphing of all aspects to the same concept. This implies that if the concept does not morph there
must be an aspect that morphes into a different concept from
the one(s) the other two aspects morph to.
In our ontology in Fig. 2 and 3 the concept dbpedia:EUCountry
in 2003 1-way splits as it label-morphes into concept dbpedia:EUCountry with the same name, but extensionally morphes
into concept cyc:Central European Country.
Finally, there is one additional possibility for the meaning of a concept to split (depicted in Fig: 6): remember that
a morphing-chain morphchainasp (Cb , Ce , tb , te ) is defined as a
chain of concepts between Cb (begin) and Ce (end) over time
points tb and te where the similarity between the aspect asp of
the concepts in a chain for every time-point is maximal. A morphing chain split now occurs when the endpoint of such a chain
is less similar to the original concept at the start of the chain
than some other concept.
Definition 10. A morphing chain morphchainasp (Cb , Ce , tb , te )
is
stronger
than
another
morphing
chain
morphchainasp (Cb0 , Ce0 , tb , te ) on aspect asp if and only
if
max simasp (Cbtb , Cete ) > max simasp (Cbtb0 , Cete0 )
where max simasp (Cbtb , Cete ) has been defined in Equation 1 in
Section 3.
We already pointed out that in our example ontology concept dbpedia:EUCountry extensionally morphes into concept
cyc:Central European Country. This similarity equals 1 (complete overlap), which makes this a very strong chain (though
of course rather short). Concept cyc:European Union extensionally morphes into the concept of same name, but there is only an
overlap of 1 out of 5 articles, which is 0.2 according to standard
Jaccard similarity. This means dbpedia:EUCountry→cyc:Central
European Country is a stronger extensional morph chain than
cyc:European Union→cyc:European Union.
Relative strength is the most interesting use of strength we
will apply in our case-study: to order the chains by strength. In
our experience, the set of strongest and weakest concepts are
usually of interest to domain experts.
Definition 12. A concept has a morphing-based chain split
for aspect asp if the similarity between a concept Cb
at the beginning, and Ce at the end of the chain
morphchainasp (Cb , Ce , tb , te ) is smaller than the similarity between the same aspect of concept Cb and some other concept
Ce0 at te .
4.2.2. Concept split
Concept split is very similar to concept shift discussed in the
previous section. It is defined based on the fact that not all
aspects of a concept at ti is the most similar to the same concept
at ti+1 . As there is no identity as the reference to talk about
shift, we can only say that the meaning of a concept splits into
several new concepts. This can happen in two ways (as shown
in Fig. 5): either one of the aspects morphs into the aspect of
4.3. Applying the framework
To apply our framework for concept drift in a specific usecase, the following steps are required:
1. to define intension, extension and a labelling function.
2. to define similarity functions over intension, extension and
labels
10
5.1. Case study 1: Concept shift in political reporting
Communication scientists annotate various media content
with concepts from controlled vocabularies (of increasing expressiveness) in order to study the relationship between the Media and the political processes. The idea is to code the meaning of sentences and articles in a formalised graph representation, using the method of Network analysis of Evaluative Texts
(NET) [46]. Take the following sentence as an example.
time
t1
C1
t2
C2
t3
C3
t4
C4
C5
Het Openbaar Ministerie (OM) wil de komende vier
jaar mensenhandel uitroeien. (The Justice Department (OM) wants to eliminate human trafficking
within the next four years.)
Figure 6: Chainsplit (morphing-based)
Given that the mission of the Semantic Web includes giving
meaning to resources on the Web, it could come as a surprise
that defining intensions, extensions and even labelling functions is by no means trivial. The usual model-theoretic notions
of extension and intension, e.g. , for RDF(S) or OWL semantics are slightly misleading here, as they refer to specific models, whereas ontologies usually represent classes of models. In
practice, one needs to define the relevant notions per use-case,
where each such definition is an ontological commitment. In
the following section, we will give such commitments for different case-studies. It should be understood that such a commitment is never uncontroversial.
The sentence is coded as <om, -1, human trafficking> where om
is an actor concept and human trafficking an issue concept, and -1
indicates the Justice Department is negative about human trafficking. Here, the concepts are taken from controlled vocabularies which generally consist of all related political actors and
issues. They have recently been represented using the SKOS
model [18].
In this case study, we focus on five variants of a political vocabulary used during the five most recent Dutch national election campaigns, which took place in 1994, 1998, 2002, 2003
and 2006. The size of the vocabulary increased from 101 concepts in 1994 to 580 concepts in 2006. Newspaper articles on
Dutch politics during these campaign periods were manually
annotated with the concepts from the particular variant of that
year. In our case, the annotated sentences of these articles are
considered as the instantiation of the abstract political concepts
which annotated them. In 1994, we gathered 1502 such annotated sentences while in 2006, there were 14572 annotated
sentences.
Our case study in political reporting was supported by an
expert from the Dept. of Communication Science who provide us with pairs of identical concepts over time and who performed a small-scale qualitative evaluation of the produced results. This expert has extensive experience in annotation and
encoding of political statements over time, and has recently produced a manual mapping between the vocabularies used to describe the election campaigns for a qualitative comparison of
the respective newspaper articles.
4.4. Implementation of our toolkit
This toolkit has been implemented in Python and
is available at https://www.few.vu.nl/˜swang/data/
concept_drift/getConceptDrift.py.
This toolkit currently requires different versions of the ontology to be queryable through independent SPARQL-endpoints.
It first collects all the information of intension, extension and
labels of the concepts, and then calculates the similarity between variants of the same concepts. Based on the calculated
similarity it automatically identifies concept shift and split, and
provides a ranking based on concept stability and morphing
strength.
5. Case-studies
Political concepts and their meaning. We now formally
define our problem: for each election campaign t ∈
{1994, 1998, 2002, 2003, 2006} we have a set ∆t of sentences
annotated by concepts from a SKOS vocabulary Vt .
The label of a concept is obtained using the SKOS labelling
property skos:prefLabel. The extension ext s (Ct ) of a concept
Ct ∈ Vt at time t is the set of all sentences annotated by Ct , i.e. ,
In this section we will discuss 4 different case-studies as examples on how to use our methods, and to give some evidence
of the potential of the ideas introduced in Section 3. The purpose of these case-studies is two-fold: first, we want to indicate
how to practically apply our analysis and tool-kit by exemplifying the usage in a number of very different scenarios, and
based on a variety of different knowledge organisation principles. Secondly, we believe that we gather evidence of the potential usefulness of the results in a number of different domains.
Although the presented results do not amount of proofs, nor are
empirically representative, our experience with legal, medical
experts, and communication scientists show promisingly that
our approach can produce non-trivial and interesting results.
ext s (Ct ) = {s ∈ ∆t | annotatedBy Ct }.
It is most difficult to formally define the intension of a concept, as there is no explicit intensional definition of the concepts available. We construct an explicit intension based on
co-occurrence of concepts in annotations. For each concept C,
we calculate the top K concepts topKuse(C) which co-occur
11
the most in the sentences they code in one moment in time. The
properties we use to define the intension are based on “topicality,” i.e. , a property PC (D) is true if, and only if, D is in the
topKuse relation with C. The intension of C is then the set of
properties int s (C) = {PC (D) = true}, in other words, the intension is in fact determined by all associated concepts. In our
following experiments, we took K = 3. Note, this is an empirical choice which gave sensible results.
0.02
Rechtsstaat
Moroccans
0.01
Referendum
Bureaucracy
0.04
0.03 0.02
Environmental
Democracy
0.03
Activist
Islam
Sharia
0.02
0.04
Democracy
0.03
Democracy
2003
2002
2006
Figure 8: Intension of concept Democracy in 3 years, with average drift of
(simint = 0.02)
0 . 2 2 9 employees
Similarity of intension, extensions and labels. Similarity of labels can be determined through standard Levenshtein edit distance. In our case we define similarity as 1 minus the hyperbolic
tangent of the original edit distance.
Extensional similarity is usually determined by calculating
the overlap of the extensions. In our case, this is not possible,
as the set of sentences in different years are disjoint. In order
to use disjoint extensions to measure the similarity of concepts
from different years, we applied the mapping tool which was
developed in [40]. For each sentence in Year1, the mapping
tool first looks for the most similar sentence in Year2. Then this
sentence of Year1 is considered to be coded by the concept(s)
with which its most similar sentence of Year2 is coded. In this
way, two disjoint extensions become dually annotated, and we
then measure the similarity between two concepts using their
extensions.7
Since the intension of a concept is determined by its associated concepts, the intensional similarity between two concepts
is therefore determined by the set similarity between the sets of
concepts with which they are associated. We use the Jaccard
similarity for this purpose.
Once the similarity is calculated, we can study concept drift
using two theories introduced above, respectively.
unions
0.266
employees
0.085
Socio-Economic
Council
0.09
employers
2002
unions
0.122
social
pact
employees
0.26
0.189
employers
2003
work
migration
discrimination
0.032
0.048
employers
2006
Figure 9: Intension of concept Employers in 3 years, with average drift of
(simint = 0.15)
1998 rbeursfraude (stock fraud) →
→2002 belangenverstrengeling (conflict of interest) →
2003 corruptie (corruption)→
2006 fraude en corruptie (fraud and corruption)
In Fig. 7 (b) and (c), we find that most concepts have rather
unstable intension and extension over the years (sim int (C) and
sim ext (C) are low). On one hand, this suggests that the label
of concepts is more stable than their intension, which means
the concept label could be used as a pseudo-identity in practice if an oracle is not available. On the other hand, we use
an approximation of the real extension, therefore, the reliability of the calculated extensional similarity is not fully guaranteed. The instability of intension is far beyond our expectation,
which suggests that we should look for another way of formalising concept intension, as the reliability of the results provided
by the current formalisation is doubtful.
Fig. 8 and Fig. 9 respectively give an example of intensionally very unstable and very stable concepts. Here, the red links
are the identical concepts provided by domain experts, the black
links are the association links (i.e. , the concepts which co-occur
the most to annotated sentences) and the number next to the
links are the strength of the association. As the sim int value indicates, Concept Employers is rather stable over the years, while
Concept Democracy seems to shift its meaning in different years.
This observations are also consistent with the political reality.
Although in this particular case this is related to the definition
of the intension, it is still interesting to observe from these two
figures that, when one concept is stable, its closely associated
concepts tend to be stable too, while the concepts closely associated with an unstable concept tend to be also unstable.
5.1.1. Study concept drift based on identity
The first step of studying concept drift based on identity is
to determin concept identity. The identity problem is solved, in
this case, by the manual concept mapping provided by a communication science expert. This enables us to investigate the
concept shift and stability in terms of the label, intension and
extension.
Measuring concept stability. Since our domain expert has already indicated the identical variants of the same concepts over
the years, we can easily rank the stability of the concepts based
on their stability. Here, the stability is calculated as the average
similarity between the variants of neighbouring time-points.
Fig. 7 shows the histogram of the stability values in terms
of the three aspects. We can see that many concepts have very
stable labels (sim label (C) = 1) over time. For example, the
concept Asielzoekers (asylum seekers) has exactly the same label across all the years. However, some concepts do have very
unstable labels. For instance, according to the domain expert,
the following 6 concepts are intensionally identical:
Identifying concept shift. Even with the defects of our commitments, we can still identify interesting concept shift.
Label Shift As shown in Fig 10 (a),8 1994 Military has the
same label as 1998 Military. However, according to our domain
expert, 1998 militairen is actually identical to 2006 Dutch military
1994 sjo creawetsto → 1998 wcorruptie (corruption) →
7 In
Voting
Computers
High Incomes
8 For
[48] we have shown that this method produces reasonable results.
12
convenience, we translate all the Dutch labels into English.
45
45
45
40
40
40
35
35
35
30
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
5
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a) Label Stability
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
(b) Intensional Stability
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(c) Extensional Stability
Figure 7: Stability value distribution: The X-axis gives the average corresponding similarity over the years between variants of the same concept; the Y-axis the
number of concepts with this average.
1998
2006
1994
1998
2006
Military
Military
2003
Obligation for elderly to search jobs
Childcare
Military
Dutch military deployment
(a) Label shift
2002
0.44
Euthanasia
Childcare
Free Childcare
0.03
Wim van de Camp
(b) Extensional shift
0.38
2003
0.48
Euthanasia
0.50
Obligation to search jobs
Unemployment beneﬁt receivers
0.43
Euthanasia
Obligation to search jobs
Unemployment beneﬁt receivers
Figure 11: Example of morphing chains, where the red links are consistent with
concept identity provided by the domain expert.
Figure 10: Example of label shift and extension shift, where the red links indicate the two concepts are identical according to our domain experts, while the
blue links are the most similar concepts in terms of the corresponding aspect.
2002
extensional
1998
intensional
Capitalism
deployment,
while 1994 Military is identical to 2006 Military. These
two concepts with the same label “Military” are actually different concepts. Therefore, Concept 1994 Military has a shift in
label as, according to Definition 9, its label shifts to another
concept.
Extension Shift Fig 10 (b) suggests an extensional shift.
According to the domain expert, 2003 Childcare is the same as
2006 Childcare. However, 2006 Free childcare is more similar to
2003 Childcare in terms of their extension. In this case, we say
2003 Childcare has shifted its meaning extensionally towards a
more specific (narrower) topic, namely free childcare. This is
also confirmed by the post-hoc analysis of our domain experts.
1998
Free Childcare
2002
extensional
label
Free Childcare
intensioal
Companies, Coroprations, Business
Employment
Trial drillings
intensional
intensional
The Fifth Spatial Planning Memorandum
Kyoto treaty
Figure 12: Example of morphing chains, where the red links are consistent with
concept identity provided by the domain expert.
expert. This morphing chain also has the highest strength.
Clearly, this chain makes more sense than the bottom one,
which is the weakest morphing chain.
Detecting concept split. If three aspects do not morph unambiguously into one concept or at least one aspect morphs to
multiple concepts, there is a concept split. In Fig. 12, Concept Free Childcare has a 1-way concept split from 1998 to 2002,
as defined in 4.2.2. Concept Capitalism intensionally split into
three concepts which are also different from its extensional
morph. Unfortunately, almost all concepts have a morphingbased chain split, which is not really surprising. As we discussed above, our commitment to concept intension and extension in this case are both problematic, therefore the reliability of
the similarity is not guaranteed. As a consequence, the detected
splits are also doubtful.
5.1.2. Study concept drift based on morphing
Let us now study concept drift without replying on the existence of concept identities. As we introduced earlier, morphingbased concept drift involves two notions: morphing strength
and concept split.
Measuring morphing strength. When each concept of a series
of concepts can always morphs into the concept of the next time
point, a morphing chain is formed. Starting from each concept,
we find its max-sim-chain, as defined in Section 3.1.3. Even
if we take only the most similar concept as the morphed one,
these morphing chains each have a different strength (see Definition 10). We therefore rank these morphing chains based on
their strength.
In Fig. 11, we listed three morphing chains. The top one is
actually consistent with the identity provided by our domain
5.1.3. Identify concept drift moments
We can also measure at which moment(s) the concepts are
the most stable or unstable. At time point ti , we measure the
similarity of concepts between time points ti and ti+1 . When
concept identity can be defined, we take the average similarity
13
Version
3.5
3.4
3.3
3.2
between two variants of the same concepts. When using morphing theory, we take the average similarity between all concepts
at ti and their morphed ones at the next time point ti+1 . Fig. 13
compares the analysis using these two theories of concept drift,
where the error bars indicate the corresponding standard deviations over all the morphing strength.
Three aspects show the same trend. Both theories indicate
that the concept are the most stable between 2002 and 2003,
as the similarity is the highest. While between 1998 and 2002,
there was big changes in terms of intension, especially using
the morphing theory. This indicates that the concepts were associated with quite different concepts in these two years.
#Concept
255
204
174
174
#Resource
1,477,377
1,161,678
1,054,199
875,273
Table 1: Four versions of the DBpedia ontology
Please note that this definition is just one possible choice of
ontological commitment regarding the meaning of a concept in
RDF(S).
Similarity relations between labels are defined on the basis
of string similarity, similarity between intension and extension,
which are just sets of resources and triples respectively can easily be defined through the set similarity (Jaccard).
5.2. Case-study 2: Concept drift in DBpedia
DBpedia9 is probably the most successful ontology currently
linked within the Linked-Open Data (LOD) cloud.10 It combines a hand-crafted class hierarchy with automatically generated instance data taken from the Wikipedia effort. Through its
huge coverage, DBpedia is now the most strongly linked dataset
within the LOD. For the sake of this research we consider the
DBpedia ontology in RDFS [30], i.e. , we ignore the (very few)
OWL operators used in the model. RDF(S) and its underlying semantics is a very common modeling framework, which
makes DBpedia an interesting object of study as it is almost
exclusively modeled in RDF(S).
Experiments to study concept drift in DBpedia. We studied the
four latest versions of the DBpedia ontology, namely, DBpedia
3.5, 3.4, 3.3 and 3.2. Table 1 gives some general information
about these four versions. These versions were uploaded into
independent RDF repositories.
5.2.1. Study concept drift based on identity
DBpedia concepts have their URI references which remain
stable over the versions. Therefore we use the URIrefs as the
identities of concepts. Again this is rather arbitrary decision
that can be argued for or against. In our case, a DBpedia concept refers to its website, rather than an underlying abstract concept. If this website suddenly refers to some other topic (which
happens occasionally), this change will be detected as intensional shift.
Although most DBpedia concepts define their label using
rdfs:label, these labels are mostly equivalent to the localnames
of their URIrefs. This means that labels are strongly correlated
to the identify, and it is thus less interesting to study label drift.
For each concept, we built its extensional representation
(i.e. , the set of instances) and the intensional representation
(i.e. , the set of related triples). The Jaccard similarity measure
was applied to measure the extensional and intensional similarity between the different variants of the same concept. In the
end, we averaged over all similarities to indicate the general
level of stability of this concept, in terms of their extensional
and intensional semantics, respectively. Table 2 gives the top 5
most stable and unstable concepts in terms of these two aspects.
Concept Politician is considered extensionally very unstable.
This can easily be confirmed by the change of the sheer amount
of instances. In Version 3.2, it has only 476 instances, while in
Version 3.5, it has already 19,285 instances. The extension of
this concept clearly has expanded significantly. However, low
stability not necessarily leads to concept shift. For example,
although growing, the extension of Politician is always the most
similar to the extension of its following variant.
Concept City is also very unstable extensionally. In Version 3.4 it has indeed shifted to another concept in Version
3.5, Settlement. These two concepts share more than 73% instances, which causes a high extensional similarity between
RDF(S) concepts and their meaning. RDFS comes with a specific labeling relation rdfs:label. Furthermore, RDF is equipped
with a rdf:type relation that relates objects with classes. It seems
natural to define the extension of an RDF class to be the set of
all instances in the rdf:type relation. The intension of a class is on
the other hand not specifically defined in RDF(S). We have chosen a simple approach which focusses on the “semantic” operators in RDFS with fixed semantics, more precisely rdfs:subclass,
rdfs:range, rdfs:domain. The intension of a concept C is then simply the set of all triples with C in the subject or object position
of these three types of triples.
Let us define the meaning of a DBpedia concept formally.
We will call the combination of the DBpedia terminology T 11
and the explicit type information as well as the relations translated from Wikipedia, the DBpedia ontology.
Definition 13. Let O be the DBpedia ontology, i.e. , a set of
triples (s, p, o), and O∗ the semantic closure of O. The rdflabel labr (C) of C is defined as the object of (C,rdfs:label, o).
The rdf-extension extr (C) of C is defined as the set of resources
r such that (r rdf:type C) ∈ O∗ . The rdf-intension intr (C)
of C is defined as the set of all triples (C, p, o) ∈ O∗ in O
where p =rdfs:subClassOf and (s, p, C), where p ∈ {rdfs:subclass,
rdfs:domain, rdfs:range}.
9 http://wiki.dbpedia.org/
10 http://linkeddata.org/
11 The terminology is called differently in the different DBpedia versions, but
usually something like dbpedia-ontoloy.owl.
14
label
1.2
1.0
0.30
0.25
0.6
Concept Stability
Concept Stability
0.8
0.4
0.2
0.20
0.15
0.10
0.0
0.2
0.4
Intension
0.35
0.05
Morphing
ID
1994_1998
1998_2002
2002_2003
Version Pairs
0.00
2003_2006
Morphing
ID
1994_1998
(a) Label stability
0.7
0.6
0.15
Concept Stability
Concept Stability
Whole Concept
0.8
0.20
0.10
0.05
0.5
0.4
0.3
0.2
0.1
0.00
0.05
2003_2006
(b) Intensional stability
Extension
0.25
1998_2002
2002_2003
Version Pairs
Morphing
ID
1994_1998
Morphing
ID
0.0
1998_2002
2002_2003
Version Pairs
0.1
2003_2006
(c) Extensional stability
1994_1998
1998_2002
2002_2003
Version Pairs
2003_2006
(d) Whole concept stability
Figure 13: Concept stability over time, using two formalisations — Political vocabulary
Rank
1
2
3
4
5
Extensional
Intensional
Planet
Road
Infrastructure
SportsEvent
FormulaOneRacer
WineRegion
Cyclist
LunarCrater
Cleric
WrestlingEvent
...
...
163
164
165
166
167
OfficeHolder
Politician
City
Vein
BasketballPlayer
EthnicGroup
College
ChemicalCompound
Band
BritishRoyalty
teresting question whether this was an intentional decision, or
rather happened.12
Similarly, studying the intensional stability and shifts also
gives insight to the evolution of the intensional semantics of a
concept. For example, the intensionally very unstable concept,
EthnicGroup, has been involved in the rdfs:domain and rdfs:range
of a continuously changing set of properties. This indicates this
concepts are related to different concepts in different versions,
which contributes to the intensional instability. Furthermore,
real shifts happened to some concepts. The identified extensional shift, from City to Settlement is also found to be an intensional shift, that is, these two concepts not only share a lot
of instances, but their intensional definitions are very similar
too. This double-confirmation is valuable because it may well
indicate a genuine concept shift.
Table 2: The top 5 most stable and last 5 least stable DBpedia concepts in
terms of their extension and intension (of the 167 concepts present in all four
versions)
5.2.2. Study concept drift based on morphing
If we do not constrain ourselves to use explicite identities,
what can the morphing theory tell us?
them, which is higher than the similarity between the two variants of City. We found that Settlement appeared only in version
3.5. For some reason, most of the instances of City in version
3.4 have been transferred to Settlement. As DBpedia is of course
based on a community effort of Wikipedia, this poses an in-
12 The Wikipedia logs unfortunately do not give clear insights into this questions.
15
Concept SportsEvent has the strongest morphing chain, while
Concept ChemicalCompound has the weakest morphing chain.
Stronger morphing chains can involve concept shift if the identities are know, such as the red links in Fig. 14.
Let us look at the morphing-based chain split. Out of 104
morphing chains from Version 3.2 to Version 3.5, only 6 concepts have such chain split. This indicates the propagation of
the similarity along maximum morphing pairs of concepts to a
great extend preserves the identity of the concepts, although for
a very small number of concepts, such propagation leads to a
relatively more drastic drift.
Most stable concepts
Most unstable concepts
norm.owl#Custom
expression.owl#Promise
norm.owl#Potestative Expression
norm.owl#Hohfeldian Power
legal-action.owl#Mandate
legal-action.owl#Public Law
legal-action.owl#Asignment
legal-action.owl#Act of Law
relative-places.owl#Place
legal-action.owl#Delegation
Table 3: Top 5 stable and unstable concepts.
Definition 14. Let O to be the LKIF-Core ontology and O∗
denote the OWLIM inferred semantic closure. The owl-label
labo (C) of C is defined as the object of the (C,rdfs:label, o).
The owl-extension exto (C) of C is defined as the set of individuals i such that (i rdf : type C) ∈ O∗ . The owl-intension into (C)
of C is defined:
5.2.3. Identify concept drift moments
As shown in Fig. 15, concept label is indeed the most stable
aspect of the concept meaning. There are a relatively big intensional changes between Version 3.3 and 3.4, while the concepts
are rather stable in terms of their extensions. As opposed to
Fig. 13, the two theories give rather similar analysis. One reason is that RDFS provides much more explicit definitions for
concept intension and extension, which leads to more reliable
and consistent analysis.
1. all triples (C, p, o) ∈ O∗ and (s, p, C) ∈ O∗
2. all triples in chains {(C, p1 , o1 ) ◦ (s2 , p2 , o2 ) ◦
. . . , ◦(sn , pn , on )} where sk = ok−1 , plus
3. all
triples
in
chains
{(s1 , p1 , o1 )
◦
(s2 , p2 , o2 ), ◦, . . . , ◦(sn , pn , C)} where sk+1 = ok being
blank nodes.
5.3. Case-study 3: Concept drift in LKIF-Core
In this section we look at the evolution of concepts in the
LKIF-Core, an OOWLWL ontology of basic legal concepts.13
It consists of 15 modules, each of which describes a set of
closely related concepts from both legal and common sense domains. This ontology has also been continuously developed,
and uses most of OWL’s expressiveness.
Our case study on a legal ontology was supported by an expert from the Leibniz Center for Law of the Universtity of Amsterdam who performed a small-scale qualitative evaluation of
the produced results and gave advise on the intended meaning
of the concepts in LKIF. This expert is one of the leading developers of LKIF.
Again, the above definition is only one possible ontological
commitment regarding the meaning of an OWL concept. Based
on such definition, the similarity of label, intension and extension can be calculated in the same way as for the DBpedia case.
Experiments to study concept drift in LKIF. We studied 4 major
versions of LKIF, namely, 1.0, 1.0.2, 1.0.3 and 1.1. As before
we take the localname in the URIrefs as the identity of concepts and then study the concept drift based on identity. We
also study concept drift using the morphing theory. Unfortunately, the rdfs:label was actually rarely used (only 4 concepts)
and the LKIF ontology does not come with any instance data.
This limits ourself to focus on the intensional shift, stability,
split and morphing strength.
OWL concepts and their meaning. The formal meaning of a
concept in an OWL DL ontology is often far more explicitly
defined than in other formalisms. The intension of a concept
could potentially be defined as the set of all possible DL concepts that are equivalent to it wrt. the ontology. Unfortunately,
this is only possible in so-called definitorical terminologies, and
difficult to calculate even if possible. We will approximate this
set in our experiments (rather coarsely) by using the OWLIM14
interpretation of LKIF, and consider finite sets of consequences
(triples-chains). As OWL ontologies are specifications of sets
of possible models, there is no unique notion of the extension of
concepts. However, once committed to a particular model, DL
semantics provide the formal instance-of relation to specify the
extension of a concept. between the intension of two concepts
Let us define the meaning of concepts in LKIF.
5.3.1. Study concept drift based on identity
All concepts were ranked according to its intensional stability, see Table 3 for some examples. The ranked list has been
confirmed by one of the developer of LKIF to be consistent with
his expectations.
By comparing the intension of concepts between different
versions, we were also able to find true concept shifts, listed in
Table 4.
As Table 4 shows, some intensional shift corresponding to
shifting in modules. For example, Speech Act is no more a general action, instead, it belongs to the expression module for
describing, propositions and propositional attitudes (belief, intention), qualifications, statements and media. While the other
kind of shift, for example, Mental Concept to Mental Entity, were
confirmed to be a renaming operation.
13 http://ontology.leibnizcenter.org/trac/wiki/
LKIFCore
14 OWLIM is a RDF database management system, supporting the semantics
of RDFS, OWL Horst and OWL 2 RL, see http://www.ontotext.com/
owlim/. For our experiments we used OWLIM version 3.
16
dbpedia32
SportsEvent
dbpedia33
0.98
0.89
Protista
0.99
City
0.99
River
ChemicalCompound
0.64
SportsEvent
Protista
City
River
ChemicalCompound
dbpedia34
0.98
0.77
SportsEvent
Fungus
0.60
City
0.78
SportsEvent
0.89
Fungus
0.84
0.47
dbpedia35
0.97
Settlement
0.62
River
ChemicalCompound
0.71
Stream
ChemicalCompound
Figure 14: Examples of morphing chains
lkif1.0:action.owl#Speech Act
lkif1.0:action.owl#Termination
lkif1.0.2:lkif-top.owl#Mental Concept
lkif1.0.2:lkif-top.owl#Physical Concept
lkif1.0.2:expression.owl#Speech Act
lkif1.0.2:process.owl#Termination
lkif1.0.3:lkif-top.owl#Mental Entity
lkif1.0.3:lkif-top.owl#Physical Entity
Table 4: Examples of confirmed intensional shift in LKIF-Core
5.3.2. Study concept drift based on morphing
Some shift (found based on identity semantics) can also be
recognised when looking at the morphing chains. In Fig. 16,
the shift from action.owl#Speech Act in Version 1.0 to expression.owl#Speech Act in Version 1.0.2 is the first link (in red) of the
morphing chain. This morphing chain is consistent with the actual operations on this ontology. Another morphing chain starting from Concept lkif1.0:action.owl#Termination is also confirmed
by our domain expert, but its strength is weaker.
Similar to the DBpedia case, out of 98 morphing chains from
Version 1.0 to Version 1.1, only 7 concepts have morphingbased chain split. As we know that the labels almost as unique
identifiers in LKIF, this indicates further that the max-sim-chain
on intension can be used as a reliable strategy for identifying
concept identities.
Intension
1.0
0.9
Concept Stability
0.8
0.7
0.6
0.5
0.4
0.3
0.2
5.3.3. Study concept drift moments
Again we compare the stability of chains, either based on
identity or morphing. As visible in Fig. 17, there are less
changes between Version 1.0.2 and 1.0.3. Similar to the DBpedia case, the two theories gave quite similar analysis.
Morphing
ID
lkif1.0_lkif1.0.2
lkif1.0.2_lkif1.0.3
Version Pairs
lkif1.0.3_lkif1.1
Figure 17: Concept stability over time, using two formalisations — LKIF
qualitative analyses of the results were done by one of the curators of the Gene Ontology, and main contributors to the HPO.
OBO concepts and their meaning. The meaning of an OBO
concept is defined by a term stanza which consists of the information using the tag: value pairs. Each term has its identity defined using the tag id and the labelling tag name. Its intension is
then defined by all the remaining tag: value pairs. The instances
of an OBO ontology are defined by instance stanzes which is
associated to its concept using the tag instance of. Therefore, we
define the meaning of an OBO concept formally as follows
5.4. Case-study 4: Concept drift in NCBO Biomedical Ontologies
We also studied the concept drift in a few biomedical ontologies from the NCBO Bioportal.15 These ontologies are in the
OBO format.16 Unfortunately, there is also no instance data
available from the Bioportal. We therefore only study concept
drift in terms of label and intension.
Our case study on biomedical ontology was supported by
NCBO in Stanford. Through the contacts at NCBO small-scale
Definition 15. Let O be an OBO ontology. The obo-label
labo (C) of concept C is defined as the value of the tag name.
The obo-extension exto (C) of C is the set of instances whose
value of the tag instance of is C. The obo-intension into (C) of C
is the set of the remaining tag: value pairs of C (ie. all but the
15 http://bioportal.bioontology.org/
16 http://www.geneontology.org/GO.format.obo-1_2.
shtml
17
1.2
label
Intension
1.0
Concept Stability
Concept Stability
1.0
0.8
0.6
0.8
0.6
0.4
0.4
Morphing
ID
Morphing
ID
0.2 dbpedia32_dbpedia33 dbpedia33_dbpedia34 dbpedia34_dbpedia35
Version Pairs
0.2 dbpedia32_dbpedia33 dbpedia33_dbpedia34 dbpedia34_dbpedia35
Version Pairs
(a) Label stability
1.10
(b) Intensional stability
Extension
1.00
0.95
1.00
Concept Stability
Concept Stability
1.05
0.95
0.90
0.85
Whole Concept
1.05
0.90
0.85
0.80
0.75
0.70
Morphing
ID
Morphing
ID
0.65
0.80 dbpedia32_dbpedia33 dbpedia33_dbpedia34 dbpedia34_dbpedia35
Version Pairs
0.60 dbpedia32_dbpedia33 dbpedia33_dbpedia34 dbpedia34_dbpedia35
Version Pairs
(c) Extensional stability
(d) Whole concept stability
Figure 15: Concept stability over time, using two formalisations — DBpedia
name
changes, therefore, many false intensional shifts were identified. In Fig. 18, the stability between Version 1.54 and 1.59
drops to around 0.5, which is because a new property xref was
introduced and most of the concepts were enriched by UMLSreferences using this tag. Therefore the intension of the concepts were increased at that moment, which should be considered as a crucial moment as the concepts evolve. However, this
of course relies on the commitments we took about which information should be consider as a part of intension.
tag).
Based on such definition, the similarity of label, intension and
extension can be calculated in the same way as for the previous
DBpedia and LKIF cases.
Study concept drift in OBO ontologies. The Human Phenotype
Ontology (HPO) has 28 versions, with the number of concepts
increased from 8773 to 9559. Among these versions, we identified 18 label shifts. These label shifts mainly because that the
concept of the previous version was deleted and replaced by
another new concept with the same label but a new id number.
For example, Concept HP:0005730 (Small epiphyses) in Version
1.66 is substituted by Concept HP:0010585 in Version 1.66.
We also identified 6707 intensional shifts. Some intensional
changes are due to the changes in the ontology structure. For
example, the property synonym in Version 1.1 was changed into
exact synonym in Version 1.2, which then was changed back to
synonym. The current similarity measure is sensitive to such
According to the HPO ontology developers, it is not surprising that the early days of ontology development were more dynamic and concepts are less stable than the later versions. With
the quality of the ontology improving, the similarity between
versions is expected to become bigger, which is consistent with
our results.
We also looked at the evolution of the Gene Ontology (23
versions), see Fig. 19. For the Gene Ontology, we only took the
sample versions from the Bioportal repository at the monthly
18
LKIF 1.0
LKIF 1.0.2
action.owl#Speech_Act
action.owl#Termination
0.31
0.15
LKIF 1.0.3
expression.owl#Speech_Act
process.owl#Termination
0.66
expression.owl#Speech_Act
0.62
process.owl#Termination
LKIF 1.1
0.70
0.65
expression.owl#Speech_Act
process.owl#Termination
Figure 16: Examples of morphing chains — LKIF
label
1.10
1.0
Intension
0.9
1.05
1.00
Concept Stability
Concept Stability
0.8
0.95
0.7
0.6
0.5
0.90
0.4
0.3
Morphing
ID
1.1_1.2
1.2_1.3
1.3_1.4
1.4_1.9
1.9_1.26
1.26_1.42
1.42_1.47
1.47_1.54
1.54_1.59
1.59_1.61
1.61_1.64
1.64_1.65
1.65_1.66
1.66_1.67
1.67_1.68
1.68_1.69
1.69_1.7
1.7_1.71
1.71_1.72
1.72_1.73
1.73_1.74
1.74_1.75
1.75_1.76
1.76_1.77
1.77_1.78
1.78_1.79
1.79_1.8
Morphing
ID
1.1_1.2
1.2_1.3
1.3_1.4
1.4_1.9
1.9_1.26
1.26_1.42
1.42_1.47
1.47_1.54
1.54_1.59
1.59_1.61
1.61_1.64
1.64_1.65
1.65_1.66
1.66_1.67
1.67_1.68
1.68_1.69
1.69_1.7
1.7_1.71
1.71_1.72
1.72_1.73
1.73_1.74
1.74_1.75
1.75_1.76
1.76_1.77
1.77_1.78
1.78_1.79
1.79_1.8
0.85
Version Pairs
Version Pairs
(a) Label stability
(b) Intensional stability
Figure 18: Concept stability over time, using two formalisations — Human Phenotype Ontology
basis from June 2008 to May 2010.17 There are different types
of edits, including term name changes, term definition updates,
updates to relationships used by terms, and extensive changes to
external references. These all contribute to the label and intensional drift. The GO developers did find more changes had been
conducted at the less stable moments. As shown in Fig 19, the
labels were changed the most between September and November 2008, while the most intensionally unstable moment is between June and July 2009, which is also confirmed by the GO
expert we consulted to be because of substantial changes in the
SDB (Society for Developmental Biology) terms.
efforts that have gone into topics such as ontology evolution,
semantic versioning or temporal modelling and reasoning, most
tools are still based on static representations. The existing ontology versioning frameworks focus on the interoperability between versions and data. There is not yet a formal framework
for concept drift, nor an implementation for identifying significant concept drift.
This paper attempts to close this gap by introducing a theoretical foundation for concept drift. We study concept drift
based on two theories: one based on concept identity and one
based on concept morphing. We also propose a generic qualitative toolkit to study concept drift through some more practical
notions: shift and stability, split and morphing strength, over
time. We apply this general formalisation in practical applications modelled in SKOS, RDFS, OWL and OBO. These plausibility tests are still preliminary, but encouraging: although
intensional drift is difficult to study because the concepts are
often not formally defined, the detected concept drift gives useful information for the domain experts, for example, identifying
substantial ontology design changes. Additionally, our experiments also suggest that studying morphing chains is a promising strategy in practice for identifying concept identities.
Our case-studies also indicate that in few realistic scenarios
all of the proposed methods give useful insights, or can even be
meaningfully applied. In some cases, one or even two of the
aspects of the meaning were missing, in others one of the aspects almost functionally coupled with identity. For that reason
6. Conclusion
More and more applications critically depend on some kind
of concept schemes for the semantic interoperability of their
data. However, although it is recognised by many as a critical
problem, the continuous change in meaning of concepts (called
drift in this paper) has not yet received the attention it deserves
in the ontology modelling community. Despite the significant
17 As shown in previous cases, two theories provide similar results. However,
the analysis based on morphing is very computationally expensive as a full
similarity matrix between all pairs of concepts needs to be computed. The
number of terms in our sampled versions grows from 27542 to 33243, which
makes the computation not feasible practically. Therefore, we only show the
results based on identity.
19
label
1.10
1.05
Concept Stability
1.00
0.95
0.90
1.00
0.95
0.90
0.85
Jun08
Jul08
Aug08
Sep08
Nov08
Dec08
Jan09
Feb09
Mar09
Apr09
May09
Jun09
Jul09
Aug09
Sep09
Oct09
Nov09
Dec09
Jan10
Feb10
Mar10
Apr10
0.80
Version Pairs
Jun08
Jul08
Aug08
Sep08
Nov08
Dec08
Jan09
Feb09
Mar09
Apr09
May09
Jun09
Jul09
Aug09
Sep09
Oct09
Nov09
Dec09
Jan10
Feb10
Mar10
Apr10
Concept Stability
1.05
0.85
Intension
1.10
Version Pairs
Figure 19: Concept stability — Gene Ontology
we introduced our mechanisms as a toolkit, for the knowledge
engineer and domain expert to try out, and in the end to choose
those that are most likely to produce meaningful insights.
Future research will be directed in two directions: first, in cooperation with Communication Scientists working on political
reporting and legal experts, we will apply the proposed methods
to detect changes of meaning in real-world scenarios. We are
also involved in a close cooperation with NCBO which will give
us more insights into the problem. In at least two of these domains we will have to do a more detailed larger-scale userability
study to assess the validity of our proposed notions beyond the
slightly anecdotal evindence provided in this paper. With the
experience gained, we plan to create applications the ultimate
goal is of course to improve usability of our toolkit for more
and more easy analysis of drift in dynamic environments, and
to extend the semantics of concept drift into query or reasoning
tasks.
[9] N. Guarino, C. Welty, Evaluating ontological decisions with ontoclean,
Commun. ACM 45 (2) (2002) 61–65.
[10] J. A. Gulla, G. Solskinnsbakk, P. Myrseth, V. Haderlein, O. Cerrato, Semantic drift in ontologies, in: Proceedings of 6th International Conference on Web Information Systems and Technologies, Valencia, Spain,
2010.
[11] P. Haase, F. Harmelen, Z. Huang, H. Stuckenschmidt, Y. Sure, A framework for handling inconsistency in changing ontologies, The Semantic
Web–ISWC 2005 (2005) 353–367.
[12] P. Haase, A. Hotho, L. Schmidt-Thieme, Y. Sure, Collaborative and
usage-driven evolution of personal ontologies, in: A. Gmez-Prez, J. Euzenat (eds.), The Semantic Web: Research and Applications, vol. 3532 of
Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2005,
pp. 486–499.
URL http://dx.doi.org/10.1007/11431053_33
[13] J. Heflin, Z. Pan, A model theoretic semantics for ontology versioning,
in: In Third International Semantic Web Conference, Springer, 2004.
[14] G. Hodge, Systems of knowledge organization for digital libraries: Beyond traditional authority files, Tech. rep., Council on Library and Information Resources (Apr. 2000).
[15] R. Hoekstra, J. Breuker, M. D. Bello, A. Boer, The lkif core ontology of
basic legal concepts, in: P. Casanovas, M. A. Biasiotti, E. Francesconi,
M. T. Sagri (eds.), Proceedings of the Workshop on Legal Ontologies and
Artificial Intelligence Techniques, 2007.
[16] Z. Huang, H. Stuckenschmidt, Reasoning with multi-version ontologies:
A temporal logic approach, in: In Proceeding of the 4th International
Semantic Web Conference (ISWC, 2005.
[17] A. Isaac, E. Summers, Skos simple knowledge organization system
primer, http://www.w3.org/TR/skos-primer/ (2008).
[18] A. Isaac, E. Summers, SKOS Primer, W3C Group Note (2009).
URL http://www.w3.org/TR/skos-primer/
[19] P. Jaccard, Étude comparative de la distribution florale dans une portion
des alpes et des jura, Bulletin de la Socit Vaudoise des Sciences Naturelles
37 (1901) 547–579.
[20] Y. Kalfoglou, M. Schorlemmer, Ontology mapping: the state of the art,
The knowledge engineering review 18 (01) (2003) 1–31.
[21] A. Kalyanpur, B. Parsia, E. Sirin, B. Grau, J. Hendler, Swoop: A web
ontology editing browser, Web Semantics: Science, Services and Agents
on the World Wide Web 4 (2) (2006) 144–153.
[22] M. Klein, Change management for distributed ontologies, Ph.D. thesis,
Vrije Universiteit Amsterdam (Aug. 2004).
URL http://www.cs.vu.nl/˜mcaklein/thesis/
[23] M. C. A. Klein, D. Fensel, A. Kiryakov, D. Ognyanov, Ontology versioning and change detection on the web, in: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web, EKAW ’02, Springer-Verlag,
7. Acknowledgements
We are especially grateful to Janet Takens, Jan Kleinnijenhuis, Sebastian Khler, Peter Robinson, Sandra Dölken, Paea
LePendu, Rinke Hoekstra, Doug Howe who participated our
use case study and provided crucial feedback.
[1]
[2]
[3]
[4]
[5]
L. Bloomfield, Language, Allen and Unwin, 1933.
Brockhaus: Europaeische gemeinschaft, translated (1999).
DTV, DTV Atlas (1979).
Encyclopedia britannica, online version visited 27 May 2010 (2010).
European Union, The history of the european union, Website, http:
//europa.eu/abc/history/index_en.htm (2010).
[6] N. Fanizzi, C. d’Amato, F. Esposito, Conceptual clustering and its application to concept drift and novelty detection, in: ESWC’08: Proceedings of the 5th European semantic web conference on The semantic web,
Springer-Verlag, Berlin, Heidelberg, 2008.
[7] G. Flouris, D. Manakanatas, H. Kondylakis, D. Plexousakis, G. Antoniou, Ontology change: classification and survey, Knowledge Eng. Review 23 (2) (2008) 117–152.
[8] F. L. G. Frege, über sinn und bedeutung, Zeitschrift für Philosophie und
philosophische Kritik 100 (1892) 25–50.
20
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
London, UK, 2002.
URL http://portal.acm.org/citation.cfm?id=645362.
756436
M. Klenner, U. Hahn, Concept versioning: A methodology for tracking
evolutionary concept drift in dynamic concept systems, in: In Proc. of
ECAI 1994, Wiley, 1994.
J. Lehmann, C. Bizer, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak,
S. Hellmann, Dbpedia - a crystallization point for the web of data, Journal
of Web Semantics 7 (3) (2009) 154–165.
V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions
and Reversals, Soviet Physics Doklady 10 (1966) 707.
Y. Liang, H. Alani, N. Shadbolt, Changing ontology breaks queries,
in: I. Cruz, S. Decker, D. Allemang, C. Preist, D. Schwabe, P. Mika,
M. Uschold, L. Aroyo (eds.), The Semantic Web - ISWC 2006, vol.
4273 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2006, pp. 982–985.
URL http://dx.doi.org/10.1007/11926078_79
Y. Liang, H. Alani, N. Shadbolt, et al., Change management: The core
task of ontology versioning and evolution, in: Proceedings of Postgraduate Research Conference in Electronics, Photonics, Communications and
Networks, and Computing Science, 2005.
C. Matuszek, J. Cabral, M. Witbrock, J. Deoliveira, An introduction to
the syntax and content of cyc, in: Proceedings of the 2006 AAAI Spring
Symposium on Formalizing and Compiling Background Knowledge and
Its Applications to Knowledge Representation and Question Answering,
2006.
B. McBride, Rdf vocabulary description language 1.0: Rdf schema,
http://www.w3.org/TR/rdf-schema/ (2004).
D. L. McGuinness, F. van Harmelen, Owl web ontology language
overview, http://www.w3.org/TR/owl-features/ (January 2004).
B. Motik, B. Cuenca Grau, I. Horrocks, Z. Wu, A. Fokoue, C. Lutz, Owl
2 web ontology language profiles, http://www.w3.org/TR/owl2-profiles/
(2009).
N. Noy, A. Chugh, W. Liu, M. Musen, A framework for ontology evolution in collaborative environments, The Semantic Web-ISWC 2006
(2006) 544–558.
N. F. Noy, M. Klein, Ontology evolution: Not the same as schema evolution, Knowledge and Information Systems 6 (2004) 428–440.
URL http://dx.doi.org/10.1007/s10115-003-0137-2
N. F. Noy, M. A. Musen, Promptdiff: a fixed-point algorithm for comparing ontology versions, in: Eighteenth national conference on Artificial intelligence, American Association for Artificial Intelligence, Menlo Park,
CA, USA, 2002.
URL http://portal.acm.org/citation.cfm?id=777092.
777207
N. F. Noy, M. A. Musen, Ontology versioning in an ontology management
framework, IEEE Intelligent Systems 19 (2004) 6–13.
URL http://dx.doi.org/10.1109/MIS.2004.33
P. Plessers, O. De Troyer, Ontology change detection using a version log,
in: Y. Gil, E. Motta, V. Benjamins, M. Musen (eds.), The Semantic Web
ISWC 2005, vol. 3729 of Lecture Notes in Computer Science, Springer
Berlin / Heidelberg, 2005, pp. 578–592.
URL http://dx.doi.org/10.1007/11574620_42
P. Plessers, O. D. Troyer, S. Casteleyn, Understanding ontology evolution: A change detection approach, Web Semantics: Science, Services
and Agents on the World Wide Web 5 (1) (2007) 39 – 49, selected Papers
from the International Semantic Web Conference (ISWC2005).
URL
http://www.sciencedirect.com/
science/article/B758F-4MX4VXJ-1/2/
a6eaf2835f9d1ca7799bcc27980d2973
F. D. Saussure, Course in General Linguistics., Open Court Classics,
1986, reedition.
B. Schopman, S. Wang, S. Schlobach, Deriving concept mappings
through instance mappings, in: Proceedings of the 3rd Asian Semantic
Web Conference, Bangkok, Thailand, 2008.
B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L. J.
Goldberg, K. Eilbeck, A. Ireland, C. J. Mungall, The OBI consortium,
N. Leontis, P. Rocca-Serra, A. Ruttenberg, S.-A. Sansone, R. H. Scheuermann, N. Shah, P. L. Whetzel, S. Lewis, The obo foundry: coordinated
evolution of ontologies to support biomedical data integration, Nature
Biotechnology (25) (2007) 1251 – 1255.
[42] L. Stojanovic, A. Maedche, B. Motik, N. Stojanovic, User-driven ontology evolution management, Knowledge Engineering and Knowledge
Management: Ontologies and the Semantic Web (2002) 133–140.
[43] L. Stojanovic, B. Motik, Ontology evolution within ontology editors, in:
Proceedings of the OntoWeb-SIG3 Workshop, Citeseer, 2002.
[44] Y. Sure, M. Erdmann, J. Angele, S. Staab, R. Studer, D. Wenke, OntoEdit:
Collaborative ontology development for the semantic web, The Semantic
WebISWC 2002 (2002) 221–235.
[45] A. Tsymbal, The problem of concept drift: definitions and related work,
Tech. Rep. TCD-CS-2004-15, Computer Science Department, Trinity
College Dublin, Ireland (2004).
[46] J. Van Cuilenburg, J. Kleinnijenhuis, J. De Ridder, Towards a graph theory
of journalistic texts., European Journal of Communication 1 (1986) 65–
96.
[47] S. Wang, S. Schlobach, M. Klein, Concept drift and how to identify it, in:
European Knowledge Acquisition Workshop (EKAW), 2010.
[48] S. Wang, S. Schlobach, J. Takens1, W. van Atteveldt, Mapping-chains
for studying concept shift in political ontologies, in: Proceedings of the
Ontology Matching workshop (OM 2009), Washington, USA, 2009.
[49] G. Widmer, M. Kubat, Effective learning in dynamic environments by
explicit context tracking, in: ECML ’93: Proceedings of the European
Conference on Machine Learning, Springer-Verlag, London, UK, 1993.
[50] G. Widmer, M. Kubat, Learning in the presence of concept drift and hidden contexts, Mach. Learn. 23 (1) (1996) 69–101.
[51] Wikipedia: European union, Timeback machine: visited (2010).
URL
http://en.wikipedia.org/w/index.php?title=
European_Union&oldid=391894227
[52] Wikipedia: European union, Timeback machine: visited (2003).
URL
http://en.wikipedia.org/w/index.php?title=
European_Union&oldid=1433926
[53] Wikipedia: European union, Timeback machine: visited (2006).
URL
http://en.wikipedia.org/w/index.php?title=
European_Union&oldid=84415101
21

Download Report

Concept drift and how to identify it

Paperzz.com

Your Paperzz