4152 JIS25/2 02.BYRN C/fj

4152 JIS25/2 02.BYRN C/fj - CiteSeerX

Journal of Information Science
http://jis.sagepub.com
An adaptive thesaurus employing semantic distance, relational inheritance and nominal
compound interpretation for linguistic support of information retrieval
Christopher C. Byrne and Stephen A. McCracken
Journal of Information Science 1999; 25; 113
DOI: 10.1177/016555159902500203
The online version of this article can be found at:
http://jis.sagepub.com/cgi/content/abstract/25/2/113
Published by:
http://www.sagepublications.com
On behalf of:
Chartered Institute of Library and Information Professionals
Additional services and information for Journal of Information Science can be found at:
Email Alerts: http://jis.sagepub.com/cgi/alerts
Subscriptions: http://jis.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
© 1999 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:48 am
Page 113
The effect of postings information on searching behaviour
1
2
3
4
5
6
7
8
9
1110
1
2
3
4
5
6
7
8
9
20
1
2
113
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
1
2
3
4
5
6
7
8
9
50
1
2
An adaptive thesaurus employing
semantic distance, relational
inheritance and nominal
compound interpretation for
linguistic support of information
retrieval
Christopher C. Byrne and Stephen A.
McCracken
linguistic support methodologies and automated language
processing in general.
Penn State University, University Park, PA, USA
1. Introduction
Received 29 June 1998
Revised 4 November 1998
Abstract.
Presented is a domain specific thesaurus for linguistic
support of an information broker. It employs an adaptive
semantic distance function based on pathfinder networks
and algorithms for automated population by nominal compound interpretation and relational inheritance. Created in
support of the National Information Infrastructure Testbed
for Diagnostic and Prognostic Maintenance of Equipment
and Processes, funded by the US Department of Energy, the
thesaurus and information broker were demonstrated
through two pilot applications: Internet-based decision support for maintenance and operation of the Eddystone powerplant near Philadelphia, PA, and maintenance of the Allison
T-56 engine used in the E2-C naval aircraft. The structure of
the thesaurus was formalized to facilitate application in
other domains and to further the understanding of thesauri,
Correspondence to: C.C. Byrne, Applied Research Laboratory,
Penn State University, PO Box 30, State College, PA 16804,
USA. Tel: +1 814 863 4343. E-mail: byrne@games.
arl.psu.edu
The National Information Infrastructure (NII) Testbed
was funded by the US Department of Energy to develop
technology to build on the physical connectivity of
electronic systems worldwide, with software addressing issues of semantic heterogeneity of user groups and
electronic resources which limit the interoperability of
distributed systems despite the physical connectivity.
Its particular focus has been to develop and demonstrate
such technology in the context of (remote) electronic
support of monitoring, diagnosis and prognosis (mdp)
of electromechanical systems [1].
An information broker and associated thesaurus were
developed as part of the project, the functions of which
include enabling human or computer users to locate
and retrieve available data, using their own local terminology to drive the search and without needing any
knowledge of the actual name, location or access
methods of the data source. There is a growing body of
research in this area and this work synthesizes several
approaches which have been put forth, including those
of Hurson et al. [2, 3], Schwaneveldt [4] and Gay [5].
The primary challenge for an information broker is to
mediate the semantic heterogeneity of the users and
information sources. The information broker should
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
Journal of Information Science,
25Institute
(2) 1999,
113–131
© 1999 Chartered
of Library pp.
and Information
Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
113
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:48 am
Page 114
An adaptive thesaurus for linguistic support of information retrieval
enable users to retain their own language usage and
likewise allow legacy information sources (databases,
analysis programs, etc) to retain their interfaces, no
matter how cryptic. Requiring modification of legacy
systems for integration would, of course, greatly restrict
the number of sources which would be integrated.
Mediation of the heterogeneity of sources and users can
be achieved via a domain model of all concepts relevant
to a given domain (mdp of electro-mechanical equipment in this case). A domain model is a formal structure supportive of automated reference and inclusive of
domain specific semantics used by humans in the
domain. Information is represented in a source model,
which is a representation of the information using the
formalisms of the domain model. The domain model
acts to bridge user semantics and information source
semantics, including terminology and the semantic
relations associating different concepts. These relations
are used to control the search mechanism by ‘matching’
on information model elements whose relations to the
search elements satisfy various constraints.
The domain model in the information broker developed for the NII is a thesaurus and the source model
and associated search engine are an implementation
of Hurson’s Summary Schemas Model [2, 3], which
assumes the existence of a thesaurus with a semantic
distance function. Since the source model represents
available information in the language of the domain
model, which in this case is the thesaurus, information
is represented by terms and relations. At the site of each
information source, a local map (schema) identifies
each piece of information by one or more thesaurus
terms and subject categories. A hierarchical (tree)
representation is then constructed. Each information
source is regarded as a node whose (local) schema identifies each piece of available information from that
source. Nodes are then grouped under a common parent
node, and a (summary) schema for the parent node is
constructed automatically by using the thesaurus to
represent several precise terms with one more abstract
term (the common parent in the thesaurus). These
parent nodes are grouped under a common grandparent
node by the same process until a single top node is
reached. There is no necessary correlation between the
logical organization of the tree and the physical configuration of participant computers. The tree structure
enables efficient search for information by restricting
the search to only continue down paths through nodes
whose schemas contain terms matching search terms.
The thesaurus is used to declare a match between
related if not identical terms, and the precision of the
thesaurus relations, in particular the semantic distance
114
function, enables user control over the precision and
recall of the search. Hurson’s original model places no
particular requirements on the thesaurus other than the
existence of a semantic distance function, and thus
could be implemented using a semantic distance function based on statistically generated co-occurrence
frequencies such as latent semantic indexing which do
not contain the wealth of relation information of thesauri. The present implementation follows Hurson’s
model by using a user input semantic distance tolerance
to control the matching process. Using a full thesaurus,
however, allows generalization of the model with
regard to what constraints are imposed on the matching
process of the search algorithm. In theory, any information in the thesaurus could be used to control the
search. An extension, for example, would be to pass a
list of relations along with the semantic distance
tolerance in a query, directing the search algorithm to
declare matches only if terms were within the distance
tolerance along paths specified by the list of relations.
The technical details of paths and semantic distance are
explained in the body of subsequent sections.
In the interest of motivating and illustrating the
mathematical treatment of thesauri, presentation of the
actual mdp thesaurus which was constructed will
precede the formal treatment of thesauri in general. Its
functionality and its role in support of the information
broker will be explained and illustrated with examples.
Finally, the formal (mathematical) structure will be
presented with reference to the examples given. The
main benefit of this formal treatment is that, by clarifying the abstract ideas separately from the particular
application, these ideas might more easily be applied to
new domains, where the particular relations, inheritance rules and post-coordination algorithm might be
quite different due to context. The formal treatment,
however, provides a clear description of the salient
properties defining these features. Formalization is also
a prerequisite for computer automation and, while the
current work made only minor progress toward automated thesaurus population and processing, future
work can build on this beginning. There are other
formal treatments of knowledge representation for
information retrieval which are more thoroughly developed than the present work; notably, the work on
ontologies at the Knowledge Systems Laboratory at
Stanford, but the present work has the advantage of
simplicity and compatibility with legacy resources.
Domain specific thesauri are already in existence for
many specialized knowledge fields such as medicine
and law. These may be put to use immediately with the
methods described here and gradually enhanced in
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
© 1999 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
Journal of Information Science, 25 (2) 1999, pp. 113–131
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
1
2
3
4
5
6
7
8
9
50
1
2
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:49 am
Page 115
CHRISTOPHER C. BYRNE AND STEPHEN A. MCCRACKEN
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
1120
1
2
3
4
5
6
7
8
9
1130
1
2
3
4
5
6
7
8
9
1140
1
2
3
4
5
6
7
8
9
50
1
112
fidelity to improve performance. Likewise, no constraints are placed on the structure or terminology of
legacy information sources for their integration via
these methods. Section 2 will document the structure
and contents of the mdp thesaurus, explain the design
choices and the construction process and illustrate
with examples. Section 3 will document the electronic
implementation of the thesaurus. Section 4 will revisit
the key components, giving mathematical definitions
and algorithms and relating them to the examples of
section 2. Section 5 will conclude with comparisons to
related work and outline open research issues.
rules. Relational inheritance rules allow relations
between some terms to be inferred automatically by the
software and thus they need not be entered explicitly
by the (human) thesaurus compiler. Rules for nominal
compound interpretation automatically expand the
thesaurus vocabulary to include all compounds of arbitrary finite length of terms explicitly posted in the
thesaurus, though with reduced fidelity of relations
as a trade-off in the interest of simplicity, which is
regarded as both a theoretical and practical advantage.
The topics introduced by this brief overview will now
be described in detail.
2. Mdp thesaurus
2.1. Mdp thesaurus relations
Thesauri have been in common usage as linguistic
references for some time; Roget’s Thesaurus [6] being
perhaps the most well known. It is assumed that this
audience is sufficiently familiar with thesauri that the
detailed features of our thesaurus can be presented with
only brief introduction. The knowledge captured by our
thesaurus, like that of any thesaurus, consists of the
meaning relationships between terms. The uniqueness
of the mdp thesaurus lies in the scope and fidelity of
these relationships as well as in methods for automating portions of the population process, which is the
most human intensive aspect of this approach. The
naming convention for relations is designed to eliminate ambiguities in the directionality of asymmetric
relations by names which include verbs, so that expressions of the form term1 relation term2 form sentences
when full relation names are used. Hence, we call the
International Standards Organization (ISO) standard
relation Broader_Term by the name Has_Broader_Term
[7]. We have preserved the ISO standard abbreviations
so that ISO standard thesauri can be imported. In
comparison with the generality of the ISO standard
relations Broader_Term, Narrower_Term, and Related_
Term, the relations in the mdp thesaurus discriminate
between many logical types of non-equivalence relatedness between terms. Also included is an equivalence
relation other than synonymy. In addition to this
logical precision, the semantic distance in our thesaurus attempts to capture the relation of psychological
(dis)similarity of terms, as evidenced, for example, by
frequency of association. There are several approaches
in the literature to capturing this type of relation, all of
them experimental, and ours is based on Pathfinder
Networks (PFNets) [4] with some important modifications. The automated population features are relational
inheritance and nominal compound interpretation
Table 1 lists the relations in the mdp thesaurus, organized by the usual categories of equivalence, hierarchical and associative relations, whose meaning here is that
of standard thesaurus usage. This meaning is formally
defined by the reflexivity, symmetry and transitivity
properties of abstract relations, for which standard
mathematical definitions are included in Appendix A.
Appendix A also includes standard mathematical
definitions for graph theoretic concepts used such as
weighted directed graph, vertices, edges, edge weights,
path, path length, etc. The numbers listed with each relation are the ‘edge weights’ for the PFNet style semantic
distance computation, which is explained in section 2.2.
In short, they represent the distance from one term to
another when the given relation relates the two terms.
The precision of our thesaurus relations is evident
from the listing in Table 1. As mentioned above, the ISO
standard relations are supported, both for import of
ISO standard thesauri and for exceptional term pairs
whose precise relation is not distinguished in the
thesaurus. Reciprocal relations can be identified from
the relation names, and relations with no apparent reciprocals are symmetric (self-reciprocal). Technically,
the relations Has_Preferred_Synonym and Is_Used_For
are not symmetric and are only reflexive as applied
to preferred terms in the thesaurus. This, however,
derives from the use of one preferred term to represent
each equivalence class of synonyms for convenience
and memory savings. The synonymy relation is indeed
an equivalence relation, but is represented by two relations so that all other relatives (which are presumed to
be the same for all synonyms) need only be entered
explicitly for one representative of each equivalent
class. Relatives of non-preferred terms are accessed by
first referencing their preferred synonyms.
The relation is_derived_from means that if term1
is_derived_from term2 then if either term1 or term2 is
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
Journal of Information Science,
25Institute
(2) 1999,
113–131
© 1999 Chartered
of Library pp.
and Information
Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
115
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:49 am
Page 116
An adaptive thesaurus for linguistic support of information retrieval
Table 1
Mdp thesaurus relations
Hierarchical relations
(anti-reflexive, antisymmetric, transitive)
Has_Broader_Term*
Has_Narrower_Term*
Has_Supertype
Has_Subtype
Is_Instance_of_Class
Has_Instance
Is_Component_of
Has_Component
Equivalence relations
(reflexive, symmetric,
transitive)
3
3
3
3
4
1
4
4
Associative relations
(all others)
Has_Preferred_Synonym*
Is_Used_for*
Is_Derived_from
0
0
1
Has_Related_Term*
Accepts_Input
Is_Input_to
Produces_Output
Is_Output_by
Performs_Process
Is_Performed_by
Has_Measurement_Unit
Is_Measurement_Unit_of
Has_Property
Is_Property_of
Has_Subject_Category*
Is_Subject_Category_of
Has_Semantic_Role
Is_Semantic_Role_of
Has_No_Direct_Relation_to
6
5
5
5
5
5
5
5
1
6
6
6
6
6
6
9
* indicates ISO standard thesaurus relations. Integers indicate default edge weights.
known, the other can be computed from an invertible
map, e.g. radius is_derived_from diameter, diameter
is_derived_from circular area and circular area
is_derived_from radius. The relation has_semantic_
role associates each thesaurus term with a high-level
characterization such as object or process. Semantic
roles are used for nominal compound interpretation,
which is discussed in section 2.5. The relation has_
no_direct_relation_to exists only to place an upper
bound of 9 on the semantic distance function for consistency with the PFNet model as discussed in section 2.2.
The relation has_component and its reciprocal is_
component_of are used for objects such as airplane
has_component engine, as well as for processes such as
fossil fueled electric power generation has_component
steam generation. The relation has_instance and its
reciprocal is_instance_of allow for the inclusion in the
thesaurus of proper names, which can be valuable and
even necessary search terms, whether model names of
equipment such as E2-C aircraft, facility names such as
Eddystone Powerplant, software program names such
as Matlab or method names such as Fourier Transform.
The relation has_subject_category links every term in
the vocabulary to a particular subject category in the
mdp category hierarchy. The subject category hierarchy
reflects application sub-domains. Many taxonomies
could be used for a category breakdown, and this breakdown was chosen, as were the thesaurus contents as a
116
whole, to reflect discriminations important to the target
user community. Maintenance professionals are typically interested in their own particular application,
even though many similarly named data elements are
available through the information broker, e.g. components such as fans and boilers or measurements such as
fan speed and boiler temperature.
The thesaurus can be visualized as a graph, with
terms for vertices and each relation represented by a
set of directed edges connecting pairs of terms. The
category hierarchy is then a subgraph, which is in fact
a tree since inter-category relations are all hierarchical.
Fig. 1 depicts one subgraph of the mdp thesaurus.
Fig. 2 depicts the category hierarchy.
The mdp thesaurus relations, which are considerably
more precise than typical thesaurus relations, provide
more detailed information to any query; and furthermore, they will support more detailed queries, whether
by humans or software programs. The information
broker, in particular, which matches query terms with
source model terms representing available information,
can be instructed to declare a match using any number
of precise criteria based on the thesaurus relations.
Aside from speed, the two major performance criteria
of information retrieval systems are precision and
recall: precision being the percentage of information
retrieved that is relevant and recall being the percentage of relevant information which is retrieved. Recall
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
© 1999 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
Journal of Information Science, 25 (2) 1999, pp. 113–131
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
1
2
3
4
5
6
7
8
9
50
1
2
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:49 am
Page 117
CHRISTOPHER C. BYRNE AND STEPHEN A. MCCRACKEN
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
1120
1
2
3
4
5
6
7
8
9
1130
1
2
3
4
5
6
7
8
9
1140
1
2
3
4
5
6
7
8
9
50
1
112
Fig. 1. Mdp thesaurus subgraph.
Fig. 2 shows the category tree of the mdp thesaurus, which will be expanded as required by expansion of the
application domains. The two most expanded subtrees reflect the two pilot applications of the NII.
Fig. 2. Mdp thesaurus category hierarchy.
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
Journal of Information Science,
25Institute
(2) 1999,
113–131
© 1999 Chartered
of Library pp.
and Information
Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
117
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:49 am
Page 118
An adaptive thesaurus for linguistic support of information retrieval
is generally increased by returning more information;
precision by returning less. The use of a thesaurus
to return information labeled by terms related but
not identical to the search terms increases recall.
Restriction of such returns by using precise relations,
including a semantic distance tolerance, increases
precision. Thus, the precision of the relations and the
use of a semantic distance allows a user control over
the precision recall trade-off and, to some extent, mollifies that trade-off.
2.2. Semantic distance, Pathfinder Networks and
adaptation
PFNets [4] model semantic distance by creating a graph
with terms as vertices and weighted directed edges
whose weights (in the range 0 to 9) are input by human
subjects in response to presentation of all ordered pairs
of terms in a target vocabulary. Subjects are prompted
for all weights twice, as a consistency check. Once these
weights have been entered, distance is modeled as the
shortest path through the graph between two terms,
where path length is computed from the edge weights
by the Minkowski Functional, discussed in detail in
section 4. The consequence of the path length computation is that, although a subject may have intuitively
judged one term to be a distance of, say, 5 from another,
if there exists a path of length less than 5 through other
terms in the vocabulary, then the input edge of weight
(length) 5 will be deleted. Two parameters affect the
distance computation. The maximum number of edges
in any path (Q in Schwaneveldt’s notation) is one parameter, so that input edges will only be deleted if there
exists a shorter path consisting of at most Q edges. The
other parameter (r in Schwaneveldt’s and Minkowski’s
notation) affects the relative importance of edges of
differing weight. For large r, path length is dominated
by the weight of the largest edge in the path, so that, for
example, the length of a path with two edges weighted
5 and 1 will be longer than a path with six edges of
weight 2 each. This parametric control of the distance
computation is extremely useful for semantic modeling. Two terms linked by a path of several hierarchical
edges, for example, may psychologically be more
closely related than two terms linked by a path of fewer
associative edges. For example, aircraft, related to leer
jet by the path aircraft has_subtype jet aircraft has_
instance leer jet, is more closely related to leer jet than
is flight, which is related to leer jet by the path flight
is_performed_by leer jet. In our implementation, the
parameter Q is not used, so the semantic distance
between two terms is always the length of the shortest
118
path, regardless of how many edges are in the path. In
this way, our semantic distance satisfies the triangle
inequality required of all metrics in standard mathematics (see section 4 for details).
PFNets in their original form would not be useable on
a large vocabulary, since the number of user responses
required for a vocabulary of size N is 2N(N–1), requiring
19,800 user inputs for a vocabulary of only 100 terms,
and the resulting weights would be entirely customized
for the inputting user. The shortest path concept of
PFNets has theoretical merit, however, as it embodies a
cognitive mechanism for term associations. This
approach was modified for large-scale application by
using the thesaurus relations to assign default weights
to edges between terms. For distance as the length of
the shortest path to be defined on all pairs of terms,
there must be a path through the thesaurus between all
pairs of terms; mathematically speaking, the graph of
the thesaurus must be connected. Since the categories
are all linked by paths through the category tree,
and every term has at least one associated category, the
mdp thesaurus is connected. In addition, the relation
has_no_direct_relation_to is included to provide a
(single edge) path of weight 9 between terms without
any other edge, thus setting an upper limit of 9 on all
distances, which conforms to the 0–9 range of the
PFNet model. The default weight for each relation is
shown in Table 1.
Although the use of relation-based weights makes the
PFNet approach scaleable, there could be a loss in accuracy due to the use of relation-based weights. For this
reason, an adaptation mechanism was created to allow
user input to adjust the edge weights to match individual semantics. The thesaurus graphical user interface (GUI), as well as the data search GUI, of the
information broker provide mechanisms for a user to
suppress or reinforce individual links or entire paths
in the thesaurus. This adaptation is implemented by
storing a user profile for all users who reinforce or
suppress links (the user profile is created automatically
the first time a user uses the reinforce or suppress
options). A group of users can share a profile by using
the same user_id with the thesaurus and/or information
broker. As a research tool, this approach also makes
possible assessments of the commonality of semantics
within a specified user group and the effects of adaptation for a less than completely homogeneous user
group. The underlying mathematics is detailed in
section 4.
The use of relation-based weights also makes
possible the extension of the distance function to
compounds of thesaurus terms, as discussed in section
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
© 1999 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
Journal of Information Science, 25 (2) 1999, pp. 113–131
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
1
2
3
4
5
6
7
8
9
50
1
2
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:49 am
Page 119
CHRISTOPHER C. BYRNE AND STEPHEN A. MCCRACKEN
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
1120
1
2
3
4
5
6
7
8
9
1130
1
2
3
4
5
6
7
8
9
1140
1
2
3
4
5
6
7
8
9
50
1
112
2.5, by providing weights for the inferred (hierarchical)
relations generated by the abstraction algorithm.
2.3. Domain capture
The relations included are by no means an exhaustive
list of all semantic relations in the English language, but
were chosen as particularly important to distinctions
made by the target user community: engineers, operators
and mechanics working on mdp of electro-mechanical
equipment. Both the terms and the relations included in
the thesaurus were chosen based on two criteria.
First of all, workers at the Eddystone powerplant in
Philadelphia, where NII project partner EPRI M&D
Center had an on-site development and technology
transfer facility, and at the Patuxant River Naval Air Test
Center, also a project partner, interviews were conducted with full-time maintenance personnel to learn
the domain specific semantics in use. For example,
header is synonymous to pipe in this domain, whereas
header is synonymous to preamble in computer science.
All instances of time are organized into two types:
instant (e.g. date and time stamp) and duration (e.g.
access time and flight time). Given the vast amounts of
time data in databases, this distinction increases precision considerably without sacrificing recall, since a
working engineer will often query for the instant of an
event but not its duration, or vice versa. Recall can be
increased by querying for time in the event that both
instant and duration information are sought. The relation antonym, found in any general thesaurus such as
Roget’s [6], is not in this thesaurus, because maintenance
professionals do not normally search for opposites.
Distinctions between forms of relatedness, such as has_
subtype, has_instance and has_component, are important, however, since relatives by one of these relations
will often be sought in exclusion of the others; likewise,
the associative relations performs_process, is_input_to,
has_measurement_unit, etc. The reciprocal pair has_
measurement_unit and is_measurement_unit_of are
interesting because of their asymmetric weights.
Personnel in the mdp domain will often query for measurements of system properties using the name of the
measurement units rather than the name of the property,
e.g. querying for Hertz or rpm when frequency data is
sought or querying for psi when pressure data is sought.
The same people do not typically query for properties
when the units of measurement are sought, however.
Because of the nearly synonymous use of measurement
units and the corresponding properties they measure,
the default weight of the is_measurement_unit_of
relation is 1, but the default weight of the reciprocal
has_measurement_unit is 5. Furthermore, if this asymmetry were not present, the distance between all measured properties would be vastly decreased, owing to
relatively short paths through the units of measurement hierarchy, e.g. frequency has_measurement_unit
frequency units has_supertype units of measurement
has_subtype pressure units is_measurement_unit_of
pressure. Hierarchical relations are typically ‘close’ psychologically, so if both has_measurement_unit and
is_measurement_unit_of were also ‘close’ then precision
would suffer greatly as all physical properties would be
returned when any one of them was sought with only a
nominal semantic distance tolerance. All such domain
knowledge was acquired through conversations with
working professionals in the domain and trial use of the
system in its early development stages.
Secondly, the terms representing available information items obviously drive the minimal required vocabulary. A thesaurus to support term-based searches can
be populated from scratch by starting with acceptable
terms to represent information items, be they text documents, diagrams, database fields, etc, and then adding
related terms to the thesaurus in case they might be
entered as search terms. This process can be iterated,
adding relatives of relatives, and so on. This is not the
only method and sometimes it will be faster to take a
legacy thesaurus and edit its contents to customize its
vocabulary and relations. For this project, the Defense
Technical Information Center (DTIC) Thesaurus was
used as a baseline to make the system operational at the
start. Exceptions to domain specific usage were then
identified through actual exercise of the system. Likewise, the high-fidelity relations were added incrementally to improve precision.
2.4. Initial population
Thesaurus terms are limited to nouns, though there is
a noun form of every verb, meaning the process or the
action of the verb, such as flight, power generation and
calibration. Adjectives are important in the domain
vocabulary as modifiers of nouns and, when needed,
the entire modified noun is stored explicitly in the
thesaurus, as in main steam header, high pressure and
early warning. Any time a desired term is not found by
a user or by an information provider, the thesaurus
compiler is contacted with a request for the term to
be entered. The ISO standard thesaurus relations are
supported, enabling electronic import of ISO standard
thesauri for rapid start-up population of a new domain,
such as medicine or law. Following import, the compiler can modify the vocabulary and relation data.
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
Journal of Information Science,
25Institute
(2) 1999,
113–131
© 1999 Chartered
of Library pp.
and Information
Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
119
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:49 am
Page 120
An adaptive thesaurus for linguistic support of information retrieval
Commercial ‘off the shelf’ (COTS) software (currently
MultiThes) is used to maintain the vocabulary and relation data. The use of COTS software notwithstanding,
population still requires extensive manual effort. Terms
are added one at a time or imported from a text file.
Relations are built by relating pairs of terms by the relation, one pair at a time. Reciprocal relations are created
automatically by the software, and consistency checks
are also performed automatically, to avoid loops with
hierarchical relations, for example. To minimize the
manual effort, algorithms were developed to automate
part of the process. An algorithm for nominal compound interpretation was implemented to automatically relate compound terms to their constituent terms
explicitly present in the thesaurus. Relational inheritance rules automatically relate members of related
hierarchies.
2.5. Post-coordination/Nominal compound
interpretation
Post-coordination in information retrieval is the logical
processing of compound terms without their explicit
representation in the underlying domain model. It is
closely related to the interpretation of nominal (nounnoun) compounds in natural language processing algorithms [5, 8–14]. The linguistics literature on this topic
is in its infancy, with most research focusing on highfidelity classifications of nouns, in an effort to infer from
the classes of the component nouns the relation of the
compound term to each of its parts. To date, however,
no taxonomy has been conceived which does not result
in numerous misclassifications when applied in this
manner. Our application does not require the highfidelity comprehension of natural language processing,
however. The goal of post-coordination in the mdp thesaurus is to relieve the (human) thesaurus compiler of
the necessity to enter explicitly all reasonable combinations of all basic vocabulary terms. For example, terms
such as temperature can be compounded with any
object, process or substance to produce terms such as
engine temperature, boiling temperature and water temperature. Likewise, objects are often compounded to
indicate a component relation, such as brake pad or aircraft engine. The interpretation required for the information broker is limited to: (i) being able to compute the
semantic distance between any two terms and (ii) be
able to abstract any term. This approach to interpreting
compounds is fundamentally akin to algorithms in the
literature, but the weaker functional requirements of
the application allow a simple classification scheme
and inference rule to be used.
120
The has_semantic_role relation assigns to each term
in the thesaurus one of the seven ‘semantic roles’:
●
object – aircraft, domain;
●
process – flight, damage, analysis;
●
property – temperature, time, precision, power;
●
substance – steel, coal, water, steam;
●
unit – seconds, pounds, inches;
●
meta-information – report, summary, graph,
diagram, estimate;
●
designator – name, number, type, identity.
Object is general enough to include both physical
entities, such as aircraft, and terms such as domain,
which refers to a conceptual, rather than physical,
object. Meta-information refers to information about
information, as in the examples above, which indicate
the form (report, graph, diagram) or the fidelity
(summary, estimate) rather than the subject of the information.
The semantic roles are ranked in importance, with
object and process being of primary importance and the
other five being of secondary importance. These rankings are then used to abstract a compound term by the
following rule.
An arbitrary nominal compound with two constituent
nouns abstracts to the constituent noun with the highest ranking semantic role, or to the first constituent noun
in case the semantic roles have equal ranking. No
attempt is made to infer the exact nature of the abstraction, and so the inferred relation is the generic ISO
standard has_broader_term.
Examples of inferences following this rule are brake
pad has_broader_term brakes, aircraft engine has_
broader_term aircraft, steam temperature has_broader_
term steam and temperature sensor has_broader_term
sensor.
Although this rule is stated for compounds with just
two constituent nouns, if applied to a compound of
greater length by first applying it to constituent pairs
of nouns, the final term to which the compound is
abstracted is independent of the intermediate pairings.
Mathematically speaking the rule is associative. The
associativity of abstraction enables the simplest possible implementation. An arbitrary nominal compound
with any number of constituent nouns abstracts to the
leftmost constituent noun having the highest ranking
semantic_role.
The general principle behind these rankings is that,
in the mdp domain, a particular object or process is
typically the main target of investigation and particular
information regarding that object or process is not
desired in the abstract. For example, mdp personnel do
not typically query for temperature data from arbitrary
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
© 1999 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
Journal of Information Science, 25 (2) 1999, pp. 113–131
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
1
2
3
4
5
6
7
8
9
50
1
2
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:49 am
Page 121
CHRISTOPHER C. BYRNE AND STEPHEN A. MCCRACKEN
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
1120
1
2
3
4
5
6
7
8
9
1130
1
2
3
4
5
6
7
8
9
1140
1
2
3
4
5
6
7
8
9
50
1
112
sources, but rather from particular sources, such as
engine temperature, and such data is most likely to be
stored together with other engine data, not with temperature data from unrelated sources. Whenever this
abstraction rule produces results inconsistent with the
mdp domain, the exceptional compounds are put
explicitly in the thesaurus, so the goal of saving compilation time and reducing the stored thesaurus data is
mostly achieved, yet the domain specific fidelity of the
thesaurus is not sacrificed in the process.
The abstraction rule produces a hierarchical path
from any compound term t1 . . . . tn to a term ti explicitly in the thesaurus. The semantic distance function
can then be extended to all compounds of arbitrary
finite length by assigning the default weight for
has_broader_term to each stage in the abstraction and
computing the Minkowski path length just as it is done
for terms explicitly in the thesaurus. In the event that
two compound terms share more than one constituent
noun, the abstraction process can result in the two
compounds having a common ancestor which is itself
a compound not explicitly in the thesaurus. In this case,
the distance will be computed along a path consisting
entirely of inferred hierarchical edges. The distance
from aircraft performance summary to flight data
would be computed along the path:
aircraft engine performance summary
has_broader_term
3
aircraft engine performance
has_broader_term
3
aircraft engine
has_broader_term
3
aircraft
performs_process
5
flight
has_narrower_term
3
flight data
total distance: 6.15
If all compounds in the domain vocabulary were
explicitly in the thesaurus, then the shortest path
would be:
aircraft engine performance summary
is_output_by
5
aircraft engine performance data analysis
accepts_input
5
aircraft engine performance data
is_instance_of
1
flight data
total distance: 6.3
The reader is reminded that this path length is dependent on the Minkowski r parameter, which has the
value 3 in these calculations. For the example given,
the lengths of the two paths would remain close for any
value greater than or equal to 2, but both would shrink
toward 5 if r were increased, since 5 is the greatest
weight of any edge in either path (see section 4 for
details). Despite the difference in logical precision of
the two paths, the closeness of the semantic distance
resulting from the automated abstraction to the semantic distance which would result from posting all
compounds explicitly assures that the search algorithm
will behave similarly given that: (i) the first term is
entered in a query and the second term is present in a
schema of the source model and (ii) semantic distance
is the only search control. If particular relations were
also used to control the search, greater precision in
search would be possible, but the automated abstraction algorithm would not fully support this precision,
since only the relations has_broader_term and
has_narrower_term are inferred by the algorithm.
2.6. Relational inheritance
Relational inheritance refers to the inclusion of a term
pair in a relation, based on the inclusion in that relation of a pair of terms hierarchically related to the term
pair in question. There are four inherited relations in
the mdp thesaurus.
Has_semantic_role is inherited, since any subtype,
instance or component of an object is an object; any
subtype, instance, component of a process is a process,
etc. The savings in manual input for automatically
including the pairs (term, role) in the has_semantic_
role relation are considerable: one manual entry is
saved for each term not at the top of a hierarchy. This
translates into weeks of labor, with a vocabulary of a
few thousand words organized in hierarchies averaging
four to five levels deep.
Also inherited is the relation has_measurement_unit
and its reciprocal is_measurement_unit_of. For example, the term time, its subtypes and their instances
(date, access time, etc) are input manually, as are the
term time units and its instances (seconds, hours, days,
years, nano-seconds, etc). After this, manual input of
the pair (time, time units) in the relation has_measurement_units results in the automatic inclusion in that
relation of any subtype or instance of time with any
instance of time units. Thus, date has_measurement_
unit years, date has_measurement_unit days, access
time has_measurement_unit seconds, access time has_
measurement_unit nano-seconds and all other pairs of
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
Journal of Information Science,
25Institute
(2) 1999,
113–131
© 1999 Chartered
of Library pp.
and Information
Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
121
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:49 am
Page 122
An adaptive thesaurus for linguistic support of information retrieval
this form are generated automatically. The savings in
this case are NM for each hierarchy of measured
properties (time, pressure, temperature, frequency, etc),
where N is the number of measured properties in the
hierarchy and M is the number of applicable measurement units (which are entered as instances of a generic
term such as time units or pressure units.
A third relation in the mdp thesaurus which is inherited is has_subject_category. The inheritance in this case
works up the hierarchy for the subtype and instance
relations: parents inherit the categories of their children.
Subject categories reflect application in the mdp thesaurus. Since a helicopter is a subtype of an aircraft, any
application of a helicopter is, in fact, an application of an
aircraft. Since an E2-C is an instance of an aircraft, any
applications of an E2-C are applications of an aircraft.
The component relation confers category in the usual
(downward) direction. In general, the simpler the item,
the broader the application. Screws has_subject_category mdp, for example, since screws are used virtually
everywhere in mdp. Boilers are used in fossil-fueled
and nuclear electric power generation but not in solar,
wind or water electric power generation. Any time a
term is assigned all subcategories of a given category,
the pairs (term, subcategory) are all removed from the
relation and replaced with the pair (term, category),
which, in fact, implies application in all subcategories.
Therefore, subject categories are inherited up subtype
and instance hierarchical paths, and down component
paths. Thus, categories need only be entered for leaf
terms in subtype and instance hierarchies and only for
top terms in component hierarchies.
The savings in manual input from component inheritance of categories are substantial, as they are for
semantic role inheritance. The savings from subtype
and instance category inheritance are less straightforward to compute in general form. If hierarchies split
into exactly two branches at each level, the savings are
N–1 for each instance or subtype hierarchy where N is
the number of leaf terms. If hierarchies split into three
branches at each level, the savings are (N+1)/2 for each
such hierarchy (if the hierarchies have more than two
levels). Generally, as the breadth (branching) of these
hierarchies increases, the relative savings from subtype
or instance category inheritance decreases. Thesaurus
hierarchies are typically broader than they are deep,
and the savings might reasonably be estimated as about
N/5 per hierarchy, i.e. one term need not be manually
assigned a category for every five terms whose category
is assigned in an instance or subtype hierarchy.
The savings in manual input from the relational
inheritance rules and the post-coordination algorithm
122
together add up to major reduction in the work required
to populate the thesaurus. Moreover, the sophisticated
semantic assessments required for this input mandate
the assignment of highly qualified personnel to a task
which necessarily involve tedious labor in documenting those assessments. This is reason enough to
focus additional research on automated population
methodologies and, moreover, such research will
contribute to the development of automated language
processing on the whole.
3. Thesaurus electronic implementation
The mdp thesaurus is implemented with a client-server
architecture, the server coded in C++ and the client a
mixture of C++ and Java to facilitate access by Webbased Java GUIs. A call level interface is provided
for access by computers and a GUI is provided for
human access. The client-server design enables multiple users to query the thesaurus asynchronously
without the overhead of multiple programs running. It
also allows access on a low end computer without the
central processing unit or memory resources to run a
large thesaurus. There are potential applications which
favor distributing the knowledge in the thesaurus,
which could be done by assigning to each thesaurus
program another one to query if it does not find a
given term.
3.1. Call level interface
The call level interface provides the function calls from
either a C++ or Java program. The thesaurus is implemented as a generic object in C++, which has methods
to return information and accept input. Table 2 lists the
functions, their arguments and their returns.
3.2. Graphical user interface
The GUI for the thesaurus has three main screens,
shown in Figs. 3–5. The first provides alphabetical
look-up capability. Clicking on any term will bring up
the second screen, which displays all relatives organized by relation and allows the user to query for
semantic distance between the current term and any
other term. The third screen shows the semantic
distance requested, displays the path along which the
minimal path length was realized and has reinforce and
suppress buttons for adaptation. Reinforcing or suppressing a path adapts the weight of every individual
edge in the path.
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
© 1999 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
Journal of Information Science, 25 (2) 1999, pp. 113–131
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
1
2
3
4
5
6
7
8
9
50
1
2
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:49 am
Page 123
CHRISTOPHER C. BYRNE AND STEPHEN A. MCCRACKEN
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
1120
1
2
3
4
5
6
7
8
9
1130
1
2
3
4
5
6
7
8
9
1140
1
2
3
4
5
6
7
8
9
50
1
112
Table 2
Mdp thesaurus call level interface
Function
Argument
Return
Comments (where needed)
get_ancestors
of: term
list<term>
get_common
_ancestors
get_immediate
_relatives
relation_between
of: list<term>
list<term>
predecessors of term by any path whose
edges are the same hierarchical relation
ancestors common to every term in list
list<term>
terms t such that term relation t
relation
relation r such that term1 r
term2
true if term1 relation term2;
else false
preferred_synonym
is_post-coordinated
of: term,
by: relation
from: term1,
to: term2
from: term1,
to: term2,
by: relation
around: term,
radius: distance
from: term,
to: term
of: term
term
recognizes
term
boolean
get_shortest_path
from: term,
to: term
starting: term1,
ending: term2
from: term,
to: term
from: term,
to: term
from: term1,
to: term2
of: category,
to: number_
levels_down
of: category,
to: number_
levels_up
of: term
list<term>
relation_exists
get_neighborhood
semantic_distance
get_alphabetical_list
reinforce_path
suppress_path
hierarchical_relation
_exists_between
get_sub_categories
get_super_category
get_categories
boolean
list<term>
all terms t such that
distance(term, t)<distance
distance
term
boolean
false if the term is explicitly posted;
else true
false if neither term nor its component
terms are posted; else true
list<term>
void
applies adaptation weights to each edge
of the shortest path from term1 to term2
applies adaptation weights to each edge
of the shortest path from term1 to term2
true if term1 ● term2 or
term2 ● term1
void
boolean
list<category>
list<category>
list<term>
4. Mathematical structure of thesauri
The use of a thesaurus as a domain model has been
adopted by research efforts in information retrieval
varying from abstract modeling to commercial system
implementation [2, 3, 15, 16], though none of this
literature addresses the mathematical structure of a
thesaurus. This section will define the formal structure
of the thesaurus. Refer to Appendix A for formal defin-
get_immediate_relatives(term,
has_subject_category)
itions of graphs and relations, which are part of standard mathematics literature.
4.1. Graph model of a thesaurus
Using the basic definitions of relations and graphs
contained in Appendix A, the contents and form of a
thesaurus can be modeled as follows:
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
Journal of Information Science,
25Institute
(2) 1999,
113–131
© 1999 Chartered
of Library pp.
and Information
Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
123
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:50 am
Page 124
An adaptive thesaurus for linguistic support of information retrieval
Fig. 3 shows the main screen of the thesaurus GUI. Terms are listed alphabetically and clicking any letter at the
bottom jumps to that part of the list. Double-clicking any term jumps to the term information screen (Fig. 4). The
scroll bar at the right scrolls through the listing term by term.
Fig. 3. Thesaurus browser GUI (main screen).
Definition 4.1: A term is a particular semantic representation of an entity or concept.
Definition 4.2: A vocabulary is a set of terms.
Definition 4.3: A thesaurus is a pair T = (V,R) of a
vocabulary V and a set of relations R = {r1. . .,rk} on V.
Observation 4.4: A thesaurus T = (V,R) is in one-to-one
correspondence with the graph G = SGi where Gi =
(T, ri) is the graph of ri. for i = 1,. . .,k.
Definition 4.5: A thesaurus T = (V,R) is connected if its
graph is connected.
Definition 4.6: Given any thesaurus T, the trivial
completion of T is the thesaurus obtained by the addition of the relation r = VxV – (∪ ri)
124
The semantic interpretation of r is the relation ‘not
directly related’.
Definition 4.7: Let T = (V,R) be a thesaurus and R+ = {x
real and x 0}. A semantic distance is a function d:
VxV->R + which models the conceptual dissimilarity of
pairs of terms in V.
Observation 4.8: The graph of a thesaurus with a
semantic distance is a weighted complete graph.
●
4.2. Semantic distance and metrics
The following definition of path length is taken from
the work of Schwaneveldt [4], who did not create the
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
© 1999 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
Journal of Information Science, 25 (2) 1999, pp. 113–131
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
1
2
3
4
5
6
7
8
9
50
1
2
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:51 am
Page 125
CHRISTOPHER C. BYRNE AND STEPHEN A. MCCRACKEN
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
1120
1
2
3
4
5
6
7
8
9
1130
1
2
3
4
5
6
7
8
9
1140
1
2
3
4
5
6
7
8
9
50
1
112
Fig. 4 shows the screen produced in response to double-clicking a term in alphabetical listing, or likewise doubleclicking on a term on this screen, enabling a user to follow links through the thesaurus. The GUI queries the
thesaurus and displays a box for relation connecting the focus term to other terms. The boxes at the bottom allow
the user to query for the shortest path from one term to another, the starting term defaulting to the focus term of the
current screen.
Fig. 4. Thesaurus GUI term information screen.
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
Journal of Information Science,
25Institute
(2) 1999,
113–131
© 1999 Chartered
of Library pp.
and Information
Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
125
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:51 am
Page 126
An adaptive thesaurus for linguistic support of information retrieval
formula, but rather conceived its application to
semantic distance:
Definition 4.9: Given a weighted graph T = (V,R) and a
path p = {v1. . .,vn} define the path length l(p) ∈ R+ by
l(p) = (Si=1,n=1w(vi,vi+1) r)1/2, where r is a positive integer
parameter.
The formula of 4.9 is called the Minkowski Functional
which is typically applied in vector spaces wherein
the term w(v1,vi+1) (the weight on the edge (v1, vi+1) )
is replaced by |xi| (the ith coordinate relative to
some basis) to define a norm from which a metric can
be defined as d(x,y) = ||x–y||. The reason the traditional application of the Minkowski Functional
induces a metric and our application does not is that
|xi–yi| = |yi–xi| in the field of coefficients of a vector
space, but w(vi, vj ) Þ w(vj , vi) for a general weighted
directed graph, in particular for the graph of a thesaurus
in which the semantic distance between terms can be
asymmetric. The Minkowski r parameter is currently
set to 3, but is an input to the thesaurus program. The
effect of r is as follows: if r = 1, path length is just
the simple sum of the edge weights along the path,
and as r increases, although the absolute path length
decreases, the relative effect is to give greater importance to edges with greater weight, the limit as r tends
toward infinity being the weight of the longest edge:
Fig. 5 shows the screen produced in response to a
submission of a semantic distance query in the term
information screen (Fig. 4). The starting term is at the
top, and the path is displayed along which the shorter
path is realized. The weight of each edge in the path is
shown as well as the total distance, computed by the
Minkowski Functional (2.21, with r = 3).
Fig. 5. Thesaurus GUI shortest path screen.
Observation 4.10: Given a weighted connected graph
T = (V,R), the function:
d(v1, v2) = min{l(p)|p is a path from v1 to v2}
is defined on VxV, since T is connected. If the weights
on the edges of T are non-negative, then d is nonnegative for any value of r.
Definition 4.11: The function d of 4.10 is used to
compute semantic distance in the mdp thesaurus.
Table 3 compares the definition of semantic distance
and the specific semantic distance in the mdp thesaurus
with the definition of a metric.
Table 3
Mathematical comparison of semantic distance with a metric
Property of the
induced relations
{∪rd|d∈D,R+⊇D}
Metric
d: VxV->R+
Semantic distance
d: VxV->R+
Semantic distance
in mdp thesaurus
reflexivity
symmetry
d(v1, v2) = 0⇔v1 = v2
d(v1, v2) = d(v2, v1)
v1 = v2⇒ d(v1, v2) = 0
optional; utimately
an empirical question
optional; ultimately
an empirical question
distance between synonyms is also zero
no, likelihood of term substitution is not
universally symmetric
yes, shortest path satisfies triangle
inequality
triangle inequality effects a d(v1, v3) < d(v1, v2) + d(v2, v3)
weakened form of transivity
126
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
© 1999 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
Journal of Information Science, 25 (2) 1999, pp. 113–131
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
1
2
3
4
5
6
7
8
9
50
1
2
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:51 am
Page 127
CHRISTOPHER C. BYRNE AND STEPHEN A. MCCRACKEN
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
1120
1
2
3
4
5
6
7
8
9
1130
1
2
3
4
5
6
7
8
9
1140
1
2
3
4
5
6
7
8
9
50
1
112
4.3. Adaptation of semantic distance
As described in section 2, the semantic distance function in the mdp thesaurus adapts to usage by individuals or groups of users sharing a user_id. The weight
stored for an edge is given as a function of the number
of adaptation inputs:
Definition 4.12: A damping sequence for an adaptive
process is a sequence of real numbers {«t} satisfying:
lim t-t->`«t = 0 and St=1,`«t = `
Definition 4.13: The adaptive weight on a link (v,v′) in
the mdp thesaurus is given by:
w0(v,v′) = relation-based default
wt(v,v′) = (1-«t)wt–1(v,v′) + «t wtarget
where {«t} is a damping sequence for an adaptive
learning process, t is the number of times the particular
link (v,v′) has been adapted, and wtarget = wmin for reinforcement and wtarget = wmax for suppression.
The current values of the parameters in 4.13 are
wmin = 0, wmax = 9, and «t = 1/(t + 10). They are parameterized in the model and in the computer code to facilitate sensitivity studies.
4.4. Precision and recall
Definition 4.14: Let m be a measure on information sets,
let Z(Q) be the set of all available information which is
desired in response to the query Q, and let F(Q) be the
information returned by the retrieval system in response to the query Q. Then the precision and recall
relative to Q is defined by:
Precision(Q) = 100*m(F(Q)∩Z(Q)) / m(F(Q))
Recall(Q) = 100*m(F(Q)∩Z(Q)) / m(Z(Q))
An overall measure of precision and recall can be
defined from these query dependent definitions by
taking an average of query dependent precision and
recall, possibly a weighted average, with weights reflecting the observed frequency or the predicted likelihood
of different queries.
4.5. Relational inheritance
Observation 4.15: If r is a hierarchical relation on V,
then r induces a partial order on V by either t < t’ if t r
t’ or t > t’ if t’ r t.
Definition 4.16: A relation r is inherited if ∃ a hierarchical relation rh ∋ t > t’ and t r t’’ ⇒ t’ r t’’ where > is
the partial order induced by rh.
The inherited relations in the mdp thesaurus are has_
measurement_unit, is_measurement_unit_of, has_
semantic_role and has_subject_category, which are
inherited by the rules.
4.17: Relation inheritance rules:
(a) t > t’ and t has_semantic_role t’’⇒ t’ has_semantic_
role t’’
(b) t > t’, and t has_measurement_unit t ⇒ t’ has_
measurement_unit t
(c) t > t’, and t is_measurement_unit_of t ⇒ t’ is_
measurement_unit_of t
(d) t > t’ and t’ has_subject_category t’’ ⇒ t has_
subject_category t’’
5. Summary/Issues for further research
The thesaurus just described has been successfully
demonstrated in support of the information broker as
part of the overall NII Testbed for mdp [1], though
coding the adaptation is still under way, so no experimental results on its performance are yet available.
Intended as a foundation for the development of more
sophisticated semantic processing, much attention was
given to formalizing the underlying mathematics.
Intended also as a prototype to demonstrate the added
value of such information technology, the implementation was kept simple and thus the thesaurus serves as
a useful example of a ‘minimally sophisticated’
linguistic support tool.
5.1. Ontalingua
There are far more complicated systems under development; notably, the work ontologies at Knowledge
Systems Laboratory (KSL), Stanford University. The
Ontalingua programming language [17] was developed
as a meta-language striving to provide the fundamental
building blocks necessary to define an ontology (a
domain specific perspective for interpreting information). One can use the basic types in Ontalingua to
define classes of information and relations between
them, and the programming language is sophisticated
enough to support inferencing and consistency checks
on information. Example applications (developed at
KSL) include a Euclidean geometry ontology in which
geometric structures are defined and questions can be
asked as to the relations between them. It is impressive
how much can be captured by a software program with
the Ontalingua language, but this approach requires that
‘knowledge’ be coded into a program with this language.
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
Journal of Information Science,
25Institute
(2) 1999,
113–131
© 1999 Chartered
of Library pp.
and Information
Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
127
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:51 am
Page 128
An adaptive thesaurus for linguistic support of information retrieval
The relative merit of the approach presented in this
paper is that search and retrieval of information from
legacy sources can be vastly improved and simplified
with very little start-up costs. The information broker is
implemented as a collection of small autonomous programs which communicate with each other and with
Web-accessible GUIs by passing messages over the
Internet. The code is written in C++ and Java using standard libraries and has been run on Solaris, DEC Unix,
Linux, Silicon Graphics Unix and Windows 95.
The philosophy underlying this work was to maximize intelligently the benefits of readily available
legacy resources, both in terms of thesaurus information and of existing data sources. The increased fidelity
and intelligent processing of the thesaurus points the
way to optimizing a fundamentally simple structure for
knowledge and domain representation which appears
to be entirely adequate for many applications. Since
the methodology was designed to incorporate legacy
resources, the only requirements placed on information
sources are that the information items they provide can
be named with thesaurus terms and that access to these
items can be automated. The first is satisfied by virtually any information source providing a finite set of
fixed items and, for analysis programs which provide
processing outputs in response to queries entered in a
formal grammar, it is the analysis program which will
be found by the information broker. The second is satisfied by any electronic information source. Thus, if new
information sources are constructed using advanced
technologies such as Ontalingua programs, these will
be accessible via our design, as are legacy systems.
Sources can include databases, as in Hurson’s original
conception of the summary schema paradigm, but can
likewise include executable programs which produce
information as output. Thus, an Ontalingua program or
a CORBA object could be mapped into a leaf node in
the summary schema tree by representing all possible
program outputs as thesaurus terms or by representing
the entire program as one or more thesaurus terms and
transferring control to the program’s own interface
when its output is requested by a user.
5.2. Co-occurrence lists and latent semantic indexing
There are other approaches to semantic distance in the
literature as well, and methods for suggesting alternate
search terms with explicit reference to numerical
measures. Generation of co-occurrence lists is perhaps
the leading approach of this type. Matrices are compiled, wherein both the rows and columns are indexed
by the terms in a particular vocabulary and the matrix
128
entries count the number of times each term pair occurs
together. The vocabulary is typically the entire vocabulary appearing in some set of documents, excluding
conjunctions, prepositions, pronouns and articles, but
including verbs, adjectives, adverbs, nouns and proper
names. That terms ‘occur together’ can mean consecutively, in the same sentence, in the same paragraph, in
the same document, in the same title, listed as coauthors or listed in the same reference list. The Digital
Library Initiative at the University of Illinois at Urbana
[15] is using co-occurrence lists in combination with
subject-category thesauri as methods for query refinement. When a user enters a search term, the resolution
(if any) is accompanied by two lists of term suggestions
for improving the recall of the search while minimizing
any loss in precision. The list of term suggestions generated by the subject thesaurus reflects the (human)
thesaurus compiler’s intelligent organization of the
documents. The list of term suggestions in the cooccurrence list is ranked in order of the frequency of
co-occurrence of vocabulary terms with the search
terms. In this way, the user is given a wealth of information to refine his or her query, albeit with the burden
of entering multiple queries. The co-occurrence lists are
purely statistical and give no information about the
precise relationship of the user’s term to the suggested
term, but, by the same token, they have the advantage
of associating terms which may be good clues for
finding relevant information but are not directly related
semantically to the user’s term and thus would not be
considered by a thesaurus. The philosophy of the
Digital Library Initiative is to offer a variety of linguistic
support tools from which the user may choose, rather
than focusing one method for the sake of automation.
Latent semantic indexing [18] is an application of
multivariate analysis using singular value decomposition (SVD) of a matrix with terms for rows and documents for columns. The SVD represents the original
matrix as a product of orthonormal matrices which can
be regarded as axes to plot terms in k-dimensional
Euclidean space. The original output of the SVD may
contain as many as m = min(number of terms, number
of documents) axes, but if the scaling weights on many
of these are very small, setting them to zero will result
in the product of the modified factors approximating the
original matrix closely. Thus, the corresponding axes
can be disregarded with little effect and perhaps they
should be, since the original matrix of (sample) term
associations may imperfectly reflect true semantic associations and axes with small scaling weights may be
modeling the sample noise. Terms, queries (groups of
terms) and documents are all mapped into the ‘concept
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
© 1999 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
Journal of Information Science, 25 (2) 1999, pp. 113–131
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
1
2
3
4
5
6
7
8
9
50
1
2
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:51 am
Page 129
CHRISTOPHER C. BYRNE AND STEPHEN A. MCCRACKEN
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
1120
1
2
3
4
5
6
7
8
9
1130
1
2
3
4
5
6
7
8
9
1140
1
2
3
4
5
6
7
8
9
50
1
112
space’ spanned by the chosen axes, with queries and
documents being mapped to the centroid of their constituent terms. In this model, similarity could be measured by Euclidean distance or by the cosine between
the vectors representing concepts. Bellcore, the lead
developers of this approach, currently is investigating
the best choice of a distance measure.
All such empirical methods to infer relatedness from
usage are a valuable complement to theoretical models
such as thesauri, for both theoretical and practical
reasons. To the extent that they suggest (if not fully characterize) heretofore overlooked semantic relations, they
aid the clarification and validation of theoretical models.
To the extent that the creative dynamics of human
language usage defy static organization, they offer automated organization and thus adapt more quickly. Latent
semantic indexing is also being applied to multilingual
search [19]. Limitations include the computing resource
requirements, the intensive computations and the fact
that it is the usage in the source documents which is captured, not the usage of the searchers. The searchers are
expected to adapt to that usage. Application to databases
could introduce performance issues as well, owing to the
relative sparsity of text, much of which would have to be
generated by humans as an intermediate step to handle
cryptic legacy schemas.
tative measures are required. Precision and recall are
well defined given a measure on information sets, but
there exist no such measures which have been empirically demonstrated to accurately reflect human evaluations of information sets. Counting measure is the most
elementary measure for information sets, whereby by
the measure of each individual information item is 1.
Thus, if 10 items are found and 5 of them are relevant,
and 15 relevant items are available, precision is 50%
and recall is 33%. This measure fails to capture distinctions between the relative value of each information
item, e.g. 1 item might contain almost all of the relevant information and the other 9 contain relatively
little. Most commercial Internet search engines use
a weighted counting measure for text documents, not
measuring each document as 1, but rather as a weighted
sum of the number of occurrences of the search terms
in the document. The weights give more importance to
an occurrence of a search term in the title than in the
body of the text, give more weight to occurrences close
to the beginning of the document and give more weight
to occurrences of multiple search terms which are adjacent or at least in the same sentence or paragraph. Until
progress is made in the computerized interpretation of
natural language and context dependent interpretation,
empirically valid measures on information sets will be
out of reach for many types of information.
5.3. Future development
The thesaurus documented in this paper, as well as the
entire information broker, are prototypes rather than
commercial products. They are ready to be employed by
enterprises eager to improve access to enterprise-wide
information resources, though incremental additional
development would be expected to accompany such
application. The system is being marketed to enterprises
interested in technology transition, and also serves as a
testbed facilitating experimentation with new theories.
For example, the co-occurrence statistical methodologies
discussed above could be used as a basis for semantic distance and tested with the same system, owing to the modular, object-oriented implementation of the thesaurus
server. Open research issues remain to be addressed, and
the final three paragraphs touch on the major issues
encountered in connection with the NII work.
5.4. Quantification of performance measures
Maximization of both precision and recall while minimizing requirements for user knowledge and expertise
are the natural performance goals of information
retrieval systems. To assess how good is good, quanti-
5.5. Supporting multiple domains
The principle of domain specific knowledge representation begs the question as to whether domains are
(i) well defined and (ii) non-overlapping. In other
words, is the relation ‘in the same domain’ an equivalence relation on the set of users? Clearly it is not,
unless each individual user is considered to have his or
her own domain. Within a given domain, there are local
variations and there are commonalities with larger
domains. In general, the more specific the domain,
the larger the vocabulary with unambiguous interpretation. No matter where the lines are drawn between
domains, it is to be expected that users may seek information using, and perhaps requiring, cross domain
terminology. Thus, a mechanism is required to bridge
or correlate domain models.
Just as the thesaurus is a weighted graph with edges
reflecting relations, each relation can be regarded as a
domain specific relation and a graph could be constructed with edges representing relation-domain pairs. The
mathematical machinery is well suited to handle such
a model. Additionally, cross-domain relations could
account for differing interpretations of nouns, e.g.
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
Journal of Information Science,
25Institute
(2) 1999,
113–131
© 1999 Chartered
of Library pp.
and Information
Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
129
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:51 am
Page 130
An adaptive thesaurus for linguistic support of information retrieval
‘header’, which means ‘pipe’ to a mechanical engineer
and means ‘preamble’ to a computer scientist. This
modeling approach needs to be thoroughly developed
and tested.
5.6. Post-coordination/Nominal compound
interpretation
The post-coordination algorithm and associated semantic roles were outside the scope of the NII program
and were undertaken only as decision support for
population of the thesaurus. Such mechanisms are the
topic of considerable research by linguists; the bulk
of the literature consists of competing classification schemes, far more complicated than the semantic
roles discussed in section 2.5, yet still producing
numerous wrong interpretations. It could be that all
such taxonomies, including that of this paper, are
missing the forest for the trees. A system for interpretation of nominal compounds which is ‘almost always’
correct is still a formidable challenge for the research
community.
Appendix A: Relations and graphs
Definition A.1: A relation is a subset r of TxT, where T
is any set; t1 rt2 means (t1, t2)∈r.
Definition A.2: Let r be a relation. Then:
r is reflexive if t r t ∀ t∈T
r is symmetric if t1rt2 ⇒ t2r t1 ∀ t1 t2∈T
r is transitive if t1 r t2 and t2 r t3 ⇒ t1 r t3 ∀ tl,
t2,t3∈T
r is anti-reflexive if ~(t r t)∀ t∈T
r is anti-symmetric if t1r t2 ⇒ ~(t2r t1)∀ t1t2∈T
r is an equivalence relation if r is reflexive, symmetric
and transitive.
Definition A.3: If V is any set, a partition P = {P1. . .,Pn}
of V is a collection of subsets of V such that ∪Pi = V and
i Þ j ⇒ Ei ∩ Ej) = ∅.
Observation A.4: If r is an equivalence relation on V,
then r induces a partition of V into disjoint sets of elements, the elements of each set being related by r.
Conversely, if P = {P1 . . .,Pn} is any partition of V, and
the subset r of VxV is defined by (v1, v2) ∈r ⇔ v1 ∈Pj and
v2 ∈ Pj for some Pj ∈ P, the r is an equivalence relation.
Definition A.5: A directed graph G = (V,E) is a set V of
vertices (also called nodes) and a set E of edges, each
edge e = (v1, v2) being a pair of nodes, and the edge is
said to go from v1 to v2. If v1 = v2, e is called a loop. Note
that, in general, a graph may contain multiple edges
between the same two nodes.
130
Definition A.6: Given a set V and a relation r ∈ VxV,
the graph of r is the graph G = (V,r) whose nodes are
the elements of V and whose edges are elements of r.
Definition A.7: If G = (V,E) is a graph, V⊇V′ and E⊇E′
and E′⊆V′ × V′ then G′ = (V′,E′) as a subgraph of G.
Definition A.8: If G1,. . .,Gn is a collection of graphs, Gi =
(Vi,Ei), then G = ∪Gi = (∪Vi, ∪Ei) is a graph as well,
which contains each Gi as a subgraph. Define SEi to be
the set obtained by concatenating E1,. .,En., where
Gi = (Vi,Ei). Then SGi = (∪Vi, SEi) is a graph as well,
which contains each Gi is a subgraph. If i ● j ⇒ Ei ∩
Ej = ∅, then ∪Gi = SGi .
Definition A.9: A weighted graph is a graph G = (V, E)
and a function w :E->R associating with each edge e a
weight. w(e). Equivalently, if G = (V, E∪W) where W is
the union over d of the relations rd = {(v1, v2) | w(v1,
v2) = d}.
Definition A.10: If G is a graph, a path p = {e1,. . ., en}
in G is a sequence of edges in G. Equivalently, a path is
a sequence of nodes p = {v1,. . ., vn+1} where {(vi, vj) ∈E
∀i,j. If v1 = vn+1 then p is a cycle. A path of the form (v,v)
is a loop.
Definition A.11: A graph G = (V,E) is complete if
E⊇VxV.
Definition A.12: A graph G = (V,E) is connected if ∀ (v1,
v2) ∈ VxV, there exists a path in G from v1 to v2.
Acknowledgements
The authors gratefully acknowledge Eric Grele (computer science), Steven Hess (psychology), Joonki Hong
(operations research), Ali Hurson (computer science),
Alex Margoulis (mathematics), Shashi Phoha (information science, NII principal investigator) and Sekhar
Tangirala (mechanical engineering) for many helpful
conversations and for fostering the interdisciplinary
perspective which became the hallmark of the NII
Testbed program. In addition, J. Hong is credited with
coding the GUIs in Java and J. Duzak and A. Margoulis
contributed to the coding of the C++ thesaurus server
code, which was primarily written by S. McCracken.
References
[1] S. Phoha, Using the National Information Infrastructure
for Monitoring, Diagnosis and Prognosis of Operating
Machinery. [Paper presented at the IEEE 35th Conference
on Decision and Control, Kyoto, Japan, December 1996.]
[2] M.W. Bright and A.R. Hurson, Linguistic support for
semantic identification and interpretation in multidatabases. In: Proceedings of the Workshop on Interopera-
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
© 1999 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
Journal of Information Science, 25 (2) 1999, pp. 113–131
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
1
2
3
4
5
6
7
8
9
50
1
2
4152 JIS25/2 02.BYRN C/fj
1/3/99 10:51 am
Page 131
CHRISTOPHER C. BYRNE AND STEPHEN A. MCCRACKEN
111
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
1120
1
2
3
4
5
6
7
8
9
1130
1
2
3
4
5
6
7
8
9
1140
1
2
3
4
5
6
7
8
9
50
1
112
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
bility in Multidatabase Systems (IEEE Computer Society
Press, 1991), pp. 306–313.
M.W. Bright, A.R. Hurson and S.H. Pakzad, Automated
resolution of semantic heterogeneity in multidata-bases,
Transactions on Database Systems 19(2) (1994) 212–253.
R.W. Schwaneveldt, Pathfinder Associative Networks,
Studies in Knowledge Organization (Ablex Publishing
Corporation, Norwood, NJ, 1990).
L.S. Gay, Interpreting Nominal Compounds for
Information Retrieval (COINS Technical Report 88-86)
(Computer and Information Science Department,
University of Massachusetts, 1988).
P.M. Roget and C.O.S. Mawson, Roget’s Thesaurus
(Garden City Publishing Company, Inc, New York, 1940).
International Organization for Standardization, Documentation – Guidelines for the Establishment and
Development of Monolingual Thesauri (ISO 2788) 2nd
ed. (ISO, 1989).
A.W.S. Cater and D.W.F. McLoughlin, Compound Noun
Interpretation using Taxonomic Links: An Experiment
in Progress (Department of Computer Science,
University College, Dublin, Ireland, 1996).
Y. Arens, J.J. Granacki and A.C. Parker, Phrasal analysis
of long noun sequences, Proceedings of the ACL (1987).
R.B. Lees, The Grammar of English Nominalization
(Indiana University, Bloomington, IN, 1960).
J.N. Levi, The Syntax and Semantics of Complex
Nominals (Academic Press, New York, 1979).
P. Downing, On the creation and use of English compound nouns, Language 53 (1977) 810–842.
[13] D.B. McDonals, Understanding Noun Compounds (PhD
Thesis) (Carnegie Mellon University, 1982).
[14] R.J. Byrd, Discovering relationships among word senses:
dictionaries in the electronic age. In: Proceedings of the
5th Annual Conference at the Center for the New Oxford
English Dictionary (University of Waterloo, Canada,
1989).
[15] B.R. Schatz, E.H. Johnson, P.A. Cochrane and H. Chen,
Interactive Term Suggestion for Users of Digital Libraries:
Using Subject Thesauri and Co-occurrence Lists for
Information Retrieval (Digital Library Initiative, Grainger
Engineering Library Information Center, University of
Illinois at Urbana-Champaign, 1996).
[16] W.B. Croft and R.T. Thompson, I3R: a new approach to
the design of document retrieval systems, Journal of the
American Society for Information Science 38(6) (1987)
389–404.
[17] T.R. Gruber, A translation approach to portable ontology
specifications, Knowledge Acquisition 5(2) (1993)
199–220.
[18] S. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas
and R.A. Harshman, Indexing by latent semantic analysis, Journal of the Society for Information Science (1990)
391–407.
[19] S.T. Dumais, T.K. Landauer and M.L. Littman,
Automatic cross-linguistic information retrieval using
latent semantic indexing. In: Proceedings of the ACM
SIGIR 96 Workshop on Cross-Linguistic Information
Retrieval, August 1996.
Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008
Journal of Information Science,
25Institute
(2) 1999,
113–131
© 1999 Chartered
of Library pp.
and Information
Professionals. All rights reserved. Not for commercial use or unauthorized distribution.
131

Download Report

4152 JIS25/2 02.BYRN C/fj - CiteSeerX

Paperzz.com

Your Paperzz