THE EFFECTS OF QUERY COMPLEXITY, EXPANSION
AND STRUCTURE ON RETRIEVAL PERFORMANCE
IN PROBABILISTIC TEXT RETRIEVAL
Jaana Kekäläinen
Department of Information Studies
University of Tampere
1999
Acta Universitatis Tamperensis 678
ISBN 951-44-4596-1; ISSN 1455-1616
Para a minha Andorinha Sinhá.
J.S.Bach: Wir müssen durch viel Trübsal in das Reich Gottes eingehen.
Symphony. BWV 146.
J.S.Bach: Herz und Mund und Tat und Leben.
Choral. BWV 147.
Acknowledgements.
This study was a part of The Finnish Multimedia Programme, KaMu-TILA, and also supported by
a grant from The Emil Aaltonen Foundation, The Aamulehti Centenary Fund under the hospices of
The Finnish Cultural Foundation, The Pirkanmaa Regional Fund, and The Scientific Foundation of
The City of Tampere.
Markku Juusola, Juha Riihioja and Ulla Saikkonen were of great assistance in judging the
relevance of thousands of articles. Pauliina Arresalo, Leena Leppänen, Marja-Kaisa Miilumäki,
Solveig Stormbom and Arja Tuuliniemi kindly helped me to test facet analysis. I am grateful to my
assiduous supervisor, Prof. Kalervo Järvelin, for encouragement and confidence in my work. I
would also like to thank Heikki Keskustalo for his patience in Unix matters, the whole FIRE group
and Ruud van der Pol for discussions and thoughtful comments, and Virginia Mattila for the
language checking.
Abstract
In this study the effects of query complexity, expansion and structure on retrieval performance –
measured as precision and recall – in probabilistic text retrieval were tested. Complexity refers to
the number of search facets or intersecting concepts in a query. Facets were divided into major and
minor facets on the basis of their importance with respect to a corresponding request. Two
complexity levels were tested: high complexity refers to queries using all search facets identified
from requests, low complexity was achieved by formulating queries with major facets only. Query
expansion was based on a thesaurus, from which the expansion keys were elicited for queries.
There were five expansion types: (1) the first query version was an unexpanded, original query
with one search key for each search concept (original search concepts) elicited from the test
thesaurus; (2) the synonyms of the original search keys were added to the original query; (3) search
keys representing the narrower concepts of the original search concepts were added to the original
query; (4) search keys representing the associative concepts of the original search concepts were
added to the original query; (5) all previous expansion keys were cumulatively added to the
original query. Query structure refers to the syntactic structure of a query expression, marked with
query operators and parentheses. The structure of queries was either weak (queries with no
differentiated relations between search keys, except weights) or strong (different relationships
between search keys). More precisely, strong query structures were based on facets or intersecting
concepts. Altogether five weak and eight strong structure types were tested. The test involved 30
test requests which all were formulated into 110 queries representing different structure, expansion
and complexity combinations. The test database was a text database of 53,893 newspaper articles.
The test was run in InQuery, a probabilistic text retrieval system.
The test revealed that when the queries were unexpanded, there were no great differences
between different structure types irrespective of the complexity level. When queries were
expanded, the performance of the weakly structured queries dropped, but the performance of the
best strongly structured queries improved. The differences in performance between complexity
levels varied by different expansions, but in all, differences were minor in this respect. The best
performance was achieved with a combination of a facet structure, high complexity, and the largest
expansion. However, all strong structures did not perform well with expansion. The operator
combining search keys within a facet was more decisive than the operator combining facets. The
typical interpretation given to the OR operator in partial match retrieval proved to be too
permissive, and thus, performance decreased when queries were expanded. The best result was
achieved by treating all search keys within a facet as instances of one search key, i.e., using the
SYN operator as a ‘facet operator’.
II
Contents
1 Problems in Search Key Selection and Query Formulation.....................................................1
2 Concepts, Words and Semantic Relations in Linguistics and IR .............................................5
2.1 Concept, Meaning and Symbol.................................................................................................5
2.2 Meanings and Semantic Relations............................................................................................6
Literal meaning ........................................................................................................................6
Lexical meaning .......................................................................................................................7
Synonymy..................................................................................................................................8
Hyponymy and taxonymy .........................................................................................................8
Meronymy.................................................................................................................................9
Opposition ................................................................................................................................10
Syntactic meaning ....................................................................................................................10
2.3 Thesauri and Other Conceptual Models ...................................................................................10
The ISO standard for monolingual thesauri.............................................................................11
Semantic nets............................................................................................................................12
Statistical methods for identification of relationships between words .....................................13
Three abstraction levels model for thesauri .............................................................................14
Ontologies ................................................................................................................................14
3 Retrieval Techniques : Exact and Partial Matching.................................................................16
3.1 Vector Space and Probabilistic Models....................................................................................17
3.2 The InQuery Retrieval System .................................................................................................19
Syntax of the query language of InQuery.................................................................................22
3.3 Expression of Linguistic Relations with Operators ..................................................................23
4 From Requests to Queries ...........................................................................................................25
4.1 Requests ....................................................................................................................................25
4.2 Query Construction ..................................................................................................................25
Conceptual level.......................................................................................................................25
Linguistic level .........................................................................................................................27
String level ...............................................................................................................................28
4.3 Query Structures.......................................................................................................................29
5 Query Expansion..........................................................................................................................32
5.1 QE Based on Search Results ....................................................................................................33
5.2 QE Based on Knowledge Structures.........................................................................................34
QE based on a statistical, collection dependent thesaurus ......................................................35
QE based on semantic relations...............................................................................................36
5.3 Summary of the Earlier QE Studies..........................................................................................39
6 Aims of the Study .........................................................................................................................41
6.1 Setting.......................................................................................................................................41
6.2 Research Problems ...................................................................................................................43
7 Data and Methods........................................................................................................................45
7.1 Retrieval System and Test Database.........................................................................................45
III
7.2 Requests and Conceptual Query Plans .....................................................................................46
7.3 Test Thesaurus and ExpansionTool..........................................................................................47
Thesaurus representation in ExpansionTool............................................................................47
Thesaurus organisation............................................................................................................53
Query construction and expansion...........................................................................................56
7.4 Test Queries..............................................................................................................................65
Complexity................................................................................................................................65
Coverage ..................................................................................................................................66
Broadness.................................................................................................................................66
Query structures .......................................................................................................................67
Structural features....................................................................................................................71
Combining variables ................................................................................................................76
7.5 Relevance Assessments and Measures of Effectiveness...........................................................76
Relevance .................................................................................................................................76
Measures of effectiveness .........................................................................................................77
7.6 Statistical Testing .....................................................................................................................80
8 Results ...........................................................................................................................................83
8.1 Characteristics of Queries : Complexity, Coverage and Broadness..........................................83
8.2 Precision of Queries .................................................................................................................84
Weak query structures ..............................................................................................................85
Strong query structures ............................................................................................................89
Weak vs. strong structures........................................................................................................91
8.3 Rank-based Comparison of Queries .........................................................................................93
8.4 Statistical Significance of Differences in Effectiveness............................................................94
8.5 Performance of Query Combinations by Requests ...................................................................97
Analysis of a deviant result ......................................................................................................100
8.6 Summary of the Results............................................................................................................101
9 Discussion......................................................................................................................................104
Effects of complexity.................................................................................................................107
Effectiveness of thesaurus-based query expansion...................................................................107
Effects of weighting schemes ....................................................................................................108
Co-effects of query broadness and structure............................................................................109
Reliability and validity of the results........................................................................................110
Further study ............................................................................................................................111
10 Conclusion ....................................................................................................................................113
Appendix 1 .........................................................................................................................................121
Appendix 2 .........................................................................................................................................123
Appendix 3 .........................................................................................................................................125
Appendix 4 .........................................................................................................................................127
Appendix 5 .........................................................................................................................................139
IV
List of Figures
Figure 1. Information storage and retrieval process ................................................................................ 2
Figure 2. Test setting. .............................................................................................................................. 4
Figure 3. Semantic triangles .................................................................................................................... 5
Figure 4. Examples of a taxonomy and a hyponomy. ............................................................................. 9
Figure 5. A classification of retrieval techniques .................................................................................... 16
Figure 6. Document retrieval inference network ..................................................................................... 20
Figure 7. Abstraction levels of IR............................................................................................................ 26
Figure 8. QE methods and the sources of expansion keys....................................................................... 32
Figure 9. The TREC request number 122................................................................................................ 37
Figure 10. Test variables.......................................................................................................................... 42
Figure 11. Query structure classification I............................................................................................... 73
Figure 12. Query structure classification II ............................................................................................. 75
Figure 13. P-R curves of the best weak query structure,
complexity level and QE combinations .............................................................................................. 88
Figure 14. P-R curves of the best combinations of the strong structures................................................. 90
Figure 15. P-R curves for the best and worst combinations
of weak / strong query structure, query expansion, complexity level................................................. 92
Figure 16. P-R curves of the best combinations
of different query structure groups ..................................................................................................... 96
Figures 17-29. Histograms comparing average p@50 scores
to median performance by query structures and requests ................................................................... 99
Figure 30. P-R curves for all structures with low complexity and no expansion .................................... 105
Figure 31. P-R curves for all structures with high complexity and no expansion ................................... 105
Figure 32. P-R curves for all structures with low complexity and full expansion................................... 106
Figure 33. P-R curves for all structures with high complexity and full expansion.................................. 106
Appendix 4: Fig. 1-2. P-R curves of the SUM combinations.................................................................. 127
Appendix 4: Fig. 3-4. P-R curves of the SYN1 combinations ................................................................ 128
Appendix 4: Fig. 5-6. P-R curves of the SYN2 combinations ................................................................ 129
Appendix 4: Fig. 7-8. P-R curves of the WSUM1 combinations............................................................ 130
Appendix 4: Fig. 9-10. P-R curves of the WSUM2 combinations.......................................................... 131
Appendix 4: Fig. 11-12. P-R curves of the SSYN1 combinations .......................................................... 132
Appendix 4: Fig. 13-14. P-R curves of the SSYN2 combinations .......................................................... 133
Appendix 4: Fig. 15-16. P-R curves of the ASYN combinations ........................................................... 134
Appendix 4: Fig. 17-18. P-R curves of the WSYN1-2 combinations ..................................................... 135
Appendix 4: Fig. 19-20. P-R curves of the BOOL combinations ........................................................... 136
Appendix 4: Fig. 21-22. P-R curves of the XSUM combinations........................................................... 137
Appendix 4: Fig. 23-24. P-R curves of the OSUM combinations........................................................... 138
V
List of Tables
Table 1. Semantic relations in WordNet .................................................................................................. 13
Table 2. The relation CONCEPTS........................................................................................................... 47
Table 3. The relation CONS_EXPRS ...................................................................................................... 48
Table 4. The relation EXPRESSIONS..................................................................................................... 49
Table 5. The relation STRINGS .............................................................................................................. 50
Table 6. The relation EXPR_STRINGS .................................................................................................. 52
Table 7. The relation ASSOCIATIONS .................................................................................................. 52
Table 8. The relation PHYS_PARTITIVE and PHYS_GENERIC ......................................................... 53
Table 9. Structural features in test query types ........................................................................................ 72
Table 10. Overview of the combination of controlled variables.............................................................. 76
Table 11. Classification of the search concepts ....................................................................................... 83
Table 12. Average number of concepts per facet and query.................................................................... 84
Table 13. Average broadness................................................................................................................... 84
Table 14. Precision averages over 11 document cut-off values............................................................... 86
Table 15. Precision averages over 10 recall levels .................................................................................. 87
Table 16. Average ranks .......................................................................................................................... 93
Table 17. Significant differences in performance within query structure groups.................................... 94
Appendix 3: Table 1. Classification of the search concepts ................................................................... 125
Appendix 5: Table 1. P@50 precision scores of the baseline
and best combinations by requests...................................................................................................... 139
Appendix 5: Table 2. P@50 precision scores of the baseline
and worst combinations by requests ................................................................................................... 140
VI
1
Problems in Search Key Selection and
Query Formulation
Information retrieval (IR) is the activity of information searching and the name of a research area.
“IR is concerned with the processes involved in the representation, storage, searching and finding
of information which is relevant to a requirement for information desired by a human user”
(Ingwersen 1992, 49). IR research develops concepts, methods, and systems through which all
information – in different forms and places – is easily accessible in forms as convenient as
possible for those who need it. The problems of text storage and retrieval have been a major issue
in research. (Järvelin 1995, 25.)
The reasons why people seek information are called information needs. They are countless,
and, in principle, unique in every case. That is, persons looking for information with their
background knowledge are very different and the situations in which the needs arise are varied.
Information needs may be categorised into (1) verificative, (2) conscious topical, and (3) muddled
or ill-defined needs. The first category refers to situation where documents with known properties
are sought, e.g., by author name, titles of known authors, etc. The second type implies that the
topic is known and definable, but less exact than in the first category. A person looking for
information has some level of understanding of it. In the third category are the cases in which a
person wishes to find new knowledge and concepts in domains he is not familiar with. (Ingwersen
& Willett 1995, 169.)
An IR system encompasses the belief that information can be organised and represented for
retrieval, and that information needs have some recurring characteristics. From the representation
of information we cannot deduce its possible uses. IR systems are based on the idea that the
needed information can be described. Thus, information retrieval is directed towards information
representation instead of information use. This also means that the person engaged in information
retrieval must be able to describe relevant1 information beforehand. However, in some situations
relevant information is recognised only when it is encountered and perused, not before. (Fugmann
1985.)
Figure 1 illustrates the information storage and retrieval process. Because this study is about
text retrieval2, documents are assumed to have textual representations. In a database, documents
are represented by their full text, some part of the text or a description in a documentation
language, or by some combination of these. The representations – whatever they are – are saved as
character strings, say, in an inverted file, which is a typical file structure used in text-based
retrieval. A string representing a document is a key. Additional information about the position of a
key in a text may also be stored in order to allow searching for the proximity of keys.
1 Relevance is discussed in detail below, see p. 94. Relevant information is information the person with an information
need wants to elicit. Typically, relevance is seen as a property of a document in relation to a request, and it is binaryvalued: a document either is or is not relevant.
2 Retrieval refers here to retrieval of digital documents.
1
T a s ks , a ctivities
N eed
Forma tion
D ocument
P rod uct ion
D ocument
A c quis ition
Ind ex
L a ngua ge
S ea rch
N ego tia t ion
Q uery
Formula t ion
S tora ge
S tora ge/ Ret rieva l
M echa nis m
Res ult
D ataba s e
Figure 1. Information storage and retrieval process (by courtesy of Prof. Järvelin, 1997).
Information needs1 should be formable in natural language to be communicable in text retrieval.
This formulation is known as a request. If a decision is made to use an IR system, the request must
be adapted to the conditions of the system, i.e., it must be translated to correspond to the
representation of documents (in Figure 1 this is referred to as search negotiation). A request
translated into a form acceptable for the retrieval mechanism is known as a query. Queries consist
of search keys, which are usually words or phrases represented by character strings, and operators,
which express the requirements for the occurrences of the keys. The query language of an IR
system defines the operators and a syntax for queries. Queries are matched with the
representations of documents. These representations consist of text keys which are stored into the
index of a database as character strings. Matching refers to matching of character strings, hence,
no meanings are involved and talking about matching of search words or terms is somewhat
misleading.
The linguistic expression of an information need is not a trivial task, because the need may be
ill-defined. Yet, a textual representation, i.e., a query, is a prerequisite for text retrieval.2 The
process of transforming the information need into a corresponding query may be compared to
translation: the ideas should be communicated, not the sequence of words. In natural language one
idea or subject may be expressed in countless ways. To retrieve all documents that contain
potentially relevant information for the information need, one should find all the expressions that
have been used to represent that information. Thus, sticking to words obscures the many-to-many
relationship between subjects and expressions, or concepts and words.
If we agree that a search3 should be started at the conceptual level, how are we to recognise
concepts from the information need and how to select concepts that are useful for searching?
Concept and facet analysis are classical information storage and retrieval methods for
identification and organisation of the concepts of documents and requests. A facet is an aspect of a
1 In Figure 1 the information needs arise from tasks. This is typical, but not the only reason for information seeking.
2 Browsing may be seen as an exception to this. It is a type of information searching which does not always require an
initial query.
3 A search refers to the process of forming a query, entering the query into the system, reformulating the query, getting
search results, and all other interaction within the IR system.
2
document or a request that consists of one or more concepts somehow related in meaning.
(Lancaster & Warner 1993; Soergel 1985.) Further, expressions represent the search concepts and
search keys represent the expressions. The search keys should be as reliable as possible, i.e., they
should match text keys representing the right expressions, not any others. These search keys
should appear in relevant, but not in non-relevant documents. This seldom is the case, rather, some
search keys are better or more reliable than others, but the reliable ones do not cover all the
relevant documents.
Typically, non-professional searchers tend to formulate queries with only a few search keys.
This causes problems because of the variability of expressions in the natural language of
documents – or their representations. Query expansion is a method of adding new search keys to a
query to obtain a better correspondence between the query and documents carrying potential
information. Expansion keys are usually elicited from the search results of an original
(unexpanded) query or some external source, such as vocabularies. (Efthimiadis 1996; Xu & Croft
1996.)
Text retrieval methods may be divided into exact and partial (or best) match methods (Belkin &
Croft 1987; Ingwersen & Willett 1995). The former is, in practice, Boolean retrieval, the latter
consist of several methods, of which the most prevalent are perhaps probabilistic methods and
methods based on the vector space model. In Boolean retrieval a database is divided into two
parts: into documents that exactly match the query (the result set, presumed relevant documents),
and documents that do not match (presumed non-relevant documents). The query expresses the
retrieval conditions with Boolean logic. All documents in the result set are assumed to have equal
relevance. In partial match retrieval, all documents of the database or documents containing at
least one search key are ranked according to their presumed relevance. The scores of documents
are calculated from weights given to text keys, and possibly also to search keys. The weighting of
the keys is usually based on the frequency of the key in a document and the inverse frequency of
the key in the whole database (known as tf*idf weighting1). (Hersh 1996; Ingwersen & Willett
1995; Salton 1989; Sparck Jones 1972.)
The formulation of queries is different within the different retrieval methods. In this study
query structure refers to the use of operators to express relations between search keys. In Boolean
retrieval, operators based on Boolean logic are available to indicate conjunctions, disjunctions and
negations of search keys, as well as parentheses to mark the order of operations. In the vector
space model and probabilistic methods search keys may appear without explicitly marked
relations, or the relations may be expressed with operators, which guide the calculation of scores
from the weights of the keys. The structure of queries may be described as weak (queries with a
single operator, no differentiated relations between search keys) or strong (queries with several
operators, different relationships between search keys).
In the IR literature query structures, query formulation, query expansion, and concepts have
been discussed. Some of these topics have been studied within the exact match paradigm, others in
the partial match approach. In the latter, unstructured queries are typically used in evaluations.
This setting has led to claims about the usefulness of QE sources and effectiveness of search key
weighting schemes. However, the interaction between the number of search keys (e.g., QE) and
query structures has not been systematically tested. In partial match retrieval, the use of concepts
1 The abbreviation tf means term frequency, i.e., frequency of the key t in a document, and idf means inverse document
frequency, which is often given as log(N/n), where N is the number of documents in the collection, and n is the number
of documents containing the key t. NB. We use ‘key’ rather than ‘term’ because we reserve ‘term’ to refer to certain
units of documentation languages. A search key, by contrast, may be a natural language word, an abbreviation, a term,
etc. See Järvelin 1993.
3
or facets has not been popular because their identification requires human involvement.
Nevertheless, the elicitation of concepts from searchers is possible. In many situations ‘conceptual
tools’ are available for automatic query formulation and expansion. These tools may be thesauri,
monolingual dictionaries and vocabularies, and multilingual dictionaries. Thus, concept-based
query structures in partial match retrieval are worth studying. This study investigates conceptbased query formulation and query expansion in probabilistic text retrieval, and shows that query
structure matters in QE. The following aspects of queries are examined: the number of concepts in
a query, the number of search keys in a query, and the structure of a query.
In fo rm a tio n
R equest
need
R e c o g n itio n o f
s e a rc h c o n c e p ts
a n d fa c e ts
C o n c e p tu a l
q u e ry p la n
Test
th e s a u r u s
(c o n c e p tu a l
m o d e l)
1 s t Q u e ry
fo r m u la tio n
Q u e ry
la n g u a g e
U nexpanded
q u e rie s w ith v a ry in g
s tr u c tu re
Q u e r y e x p a n s io n
M a tc h in g
Expanded
q u e rie s w ith v a ry in g
s tr u c tu re
= in fo r m a tio n n e e d
s p e c ific d a ta
D a ta b a s e
R e trie v e d
d o c u m e n ts
R e le v a n c e
ju d g e m e n ts
= p ro c es s
= p e r s is te n t d a ta
R e s u lts
Figure 2. Test setting.
Figure 2 illustrates the test setting. Starting from a request we shall identify search concepts from a
test thesaurus and divide them into facets. The resulting conceptual query plan will be formulated
into queries with different structures. These unexpanded queries will be matched against a test
document collection. Then the queries will be expanded with search keys taken from the thesaurus.
Expanded queries (with different structure variants) will also be matched against the database. The
effectiveness of unexpanded and expanded structure variants will be compared.
The outline of the dissertation is as follows: In Chapters 2–5 the literature is reviewed and the
main concepts are defined – Chapter 2 discusses concepts and the linguistic aspects of QE, in
Chapter 3 retrieval techniques are reviewed, Chapter 4 is about query formulation, Chapter 5
reviews previous QE studies. Research questions are formulated in Chapter 6, test methods and
material are explained in Chapter 7, results are presented in Chapter 8, discussion and conclusions
are given in Chapters 9–10.
4
2
Concepts, Words and Semantic
Relations in Linguistics and IR
Linguistics studies such phenomena of natural language as meanings of the words and texts. The
study of meaning is called semantics. IR borrows tools from linguistics for handling natural
language. For the present study understanding of concepts and their relationships is crucial,
because query formulation is based on concepts and query expansion is based on semantic
relationships. The need for QE arises from variations in expression in natural language, which may
be examined through semantic relations between concepts and words.
Linguists talk reluctantly about concepts, they rather discuss meanings, but concepts are often
discussed in IR, especially in the areas of classification and documentation languages. What is
then the relationship between concepts and meanings? They are not true synonyms, but in some
cases they are interchangeable. In this chapter both linguistic and IR viewpoints on meanings and
concepts are reviewed first, then semantic relationships will be introduced.
2.1
Concept, Meaning and Symbol
A concept may be defined as a generalised idea, mental entity or category (Oxford reference
dictionary 1986, 178; Lyons 1977, 110; Karlsson 1980, 233). When concepts are discussed the
semantic triangle introduced by Ogden and Richards is very often cited (Ogden & Richards 1923;
Karlsson 1994; Fig. 3).
(a)
Sym bol
Thought or Re ference
Referent
(b)
M ind
Language
World
Figure 3. Semantic triangles. (Karlsson 1994, 189.)
Karlsson (1994, 189) points out that the symbol in the model means particularly the form of a
symbol, the thought or reference represents the relationship between a symbol and a concept, and
the referent represents the reality that is discussed. The word thought in the triangle (a) in Figure 3
is sometimes replaced with the word meaning (Karlsson 1980, 233) or concept (Haarala 1981, 19;
Lyons 1977, 96). The other triangle (b) is on a more general level representing relations between
mind, language and the world. Thus, it seems that a concept is a perceived entity of the world. To
be more accurate, there are more worlds than the real world. Imaginary worlds or possible worlds
can also be referred to by the symbols of a language. (Karlsson 1994, 189–193.) Thus, a concept
may also be a conceived entity. Dahlberg (1992, 66) defines concepts as knowledge units
consisting of a referent, a verbal form and characteristics. In this study the word concept is used as
5
a synonym for the words thought or reference or meaning, as they are defined in the semantic
triangle. A concept is an abstraction1 of a referent, denoted by one or more linguistic symbols.
A concept may be a category, a universal (e.g., dog2 refers to any dog of any world), or a
concept may refer to an individual (e.g., the first Finnish female professor refers to a specified
person in a specified world). The existence of universal concepts is usually accepted3, at least as
an abstraction or mental representation. “Categorization ... refers to the process of dividing the
world of experience into groups – or categories – whose members bear some perceived relation of
similarity to each other. ... It is the recognition of similarities between otherwise unlike entities
and the subsequent identification of categories that permits the individual to discover order within
an otherwise complex and chaotic environment” (Jacob 1991, 78).
If one tries to find a concept for each word, one soon notes that not all words seem to denote
any concept or meaning. Concrete nouns are easiest to define and typical examples when concepts
are discussed, abstract nouns and verbs are more difficult, and prepositions and particles (for
instance ‘and’, ‘if’, ‘to’) are very hard to adjust to the triangle. Not all words have referents in any
world, their meaning is in the system of language. They are needed because language is not only
words but also phrases and sentences.
2.2
Meanings and Semantic Relations
“The entities of the world are what they are but they may be phrased (= somebody may phrase
them if he needs or wants to), that is, they may be turned into words and phrases”4 (Karlsson 1994,
190). Because people may have different viewpoints on entities, these may be referred to by
countless expressions in natural language. The relation between a concept and a symbol will be
discussed next. This is an area of linguistics, thus, meanings are discussed rather than concepts.
Literal meaning
A symbol consists of a form and a meaning that together refer to an entity. The symbols we are
interested in are words5. Some words have more precise meanings than others, e.g., it is much
easier to define a liver than a soul. Yet, a word does have a literal meaning that can be found in the
way the word is used by most people, otherwise communication would not be possible. The literal
meaning may not be easily defined, rather meanings are often fuzzy. The literal meaning does not
1 Concepts as mental representations are partly individual, partly common to all speakers of a language or all human
beings. The individuality and universality of symbols are discussed by Ikonen (1994). Putnam (1973) discusses the
definition of concepts or meanings or intensions as mental vs. abstract entities.
2 In the examples concepts are in grotesque, form of a symbol – when specially emphasised – in bold face.
3 For example Lyons (1977, 109–114) discusses conceptualism, nominalism and realism, which all have different views
on existence of universals. Although he and Palmer (1982, 24–29) are rejecting concepts as mentalism, in IR the
concept of a concept is a useful abstraction, and it is merely a question of naming whether to talk about a meaning or a
concept. A concept as an abstraction can be seen analogously to the definition of a lexeme by Lyons (1977, 22; see
also Lexical meaning p. 10). This is a good example of different usage of words in linguistics and IR.
4 Translated by the present author.
5 ‘Expression’ would often be more accurate than ‘word’ in this discussion, because the interest lies in linguistic units
representing concepts. However, ‘word’ is rooted in the linguistic terminology, and unless there is a special need to
point out the distinction, ‘word’ is used.
6
inhibit variation in the use of the word: people use language creatively. The meaning of a word is
negotiable, that is, the listeners make their interpretation according to the situation and their
background knowledge. (Karlsson 1994, 187.) However, existence of the literal meaning implies
that the basic meaning of a word – a concept referred to – can be agreed upon.
Svenonius (1992) discusses the well-known theories of meaning: the referential theory and the
instrumentalist theory. In the former, the essential part of the meaning of the word is what it refers
to. The meanings of words, and thus, the categories named by them are stable. According to the
latter, the meaning of a word is its use. Words do not have fixed meanings, and some words have
more fluid boundaries than others. It seems, that both arguments hold: the meaning of a word may
not change endlessly, otherwise communication would be impossible; however, it is easy to find
examples of words that change their meaning along the context. Thus, words seem to have both
literal and situational meanings (see Karlsson 1994, 222-224).
Lexical meaning
Lexical semantics studies the meaning of individual words or lexemes. A lexeme is a dictionary
word, an abstraction of a word, which may appear in many forms (tokens) in language (Lyons
1977, 22). Because the distinction is not crucial here, ‘word’ is used instead of ‘lexeme’. A word
has a denotation or an intension1 and an extension. The denotation is the potential of the word to
refer to the referents existing in the world (any possible world); it can also be defined as a lexical
meaning. It could be described by enumerating semantic traits. For example, the semantic traits of
woman might be following: living, human, feminine, adult. This description reveals only some
aspects of the meaning because listing all semantic traits is not feasible. Another way to describe
the denotation is by prototypes2, which are typical examples of a category. A prototype of flower is
rose, but it might be marigold, thus a prototype is not undoubted, rather it is way to show the
fuzziness of categories and the strength of the membership in a category. Extension is the union of
all possible referents of a word. (Karlsson 1994, 191, 195-200.)
Except a denotation, words may have connotations. These refer to subjective or affective
meanings that the use of a word arouses in individuals or a group of people. Certain characteristics
may be attached to a word and they are inherited to all entities denoted by the word. For example,
white is connected with purity, innocence, emptiness, and death. Then, death may be referred to by
white colour or anything that is white. Affective meaning is connected to the expression of
emotions and attitudes. Some entities and processes are more loaded with emotions, thus they are
often paraphrased gaining numerous names (e.g., die, pass away, decease, perish, expire, kick the
bucket, pop off, peg out, cop it, etc.). (Karlsson 1994, 218–219; Palmer 1976, 92–93.)
Some words have more than one meaning, e.g., tongue, mint. The former is an example of
polysemy, i.e., the meanings of a word are slightly different, but coming from the same, usually
concrete origin; the latter is an example of homonymy, i.e., the meanings are not related, but the
words have the same form. The phrase “Do you like the tongue?” might have several meanings
depending on whether the listener is eating, kissing or visiting a foreign country. (Karlsson 1994,
197.)
1 Here ‘denotation’ and ‘intension’ are considered equivalents. They are used in overlapping fields of study, e.g.
linguistics and lexicography.
2 For prototypes, see Rosch (1973, 1978), Rosch & Mervis (1975), Rosch et al. (1976).
7
Words are more or less related to each other through their meanings. For instance, a book, a
chapter, and a paragraph are semantically more related than a fork, ecology and eternity. Related
words form a semantic field that affects the literal meaning of a word. Semantic fields are analysed
in linguistics by identifying semantic relations, which are first divided into paradigmatic and
syntagmatic relations. A word has a paradigmatic relation to all words that could replace it. A
syntagmatic relation exists between words that may be joined together (to a phrase, a sentence,
etc., see syntactic meaning below). (Karlsson 1994, 17–18, 202–203; Cruse 1986.) In the
following synonymy, hyponymy, meronymy, and opposition will be introduced as examples of
paradigmatic relationships. These examples are chosen, because they are semantic relations most
often exploited in IR, and are relevant for the present study.
Synonymy
Synonymy means the identity of the denotations of two – or more – words1. Absolute synonyms
could replace each other in every textual context without a change in the meaning. This is so rare
that it is hard to find an example without any counter example, but someone and somebody might
illustrate the case. More often words are near or quasisynonyms, i.e. they can replace each other in
some contexts (e.g., dismiss, discharge). The style or connotation of quasisynonyms may vary
(e.g., madhouse, mental hospital), or one might be more general than the other (e.g., flesh,
meat). (Karlsson 1994, 203–204; Cruse 1986, 265–294.)
Hyponymy and taxonymy
Hyponymy is defined as hierarchic submission relations between meanings. The underordinate
word is a hyponym, the word of the higher category is a superordinate or a hypernym. A word may
be a hyponym and a hypernym, depending on the levels of the hierarchy, e.g., dog is a hyponym of
animal, and a hypernym of fox terrier. (Karlsson 1994, 204–205.)
Taxonymy is a special type of hyponymy. If hyponyms have a common hypernym, and if they
are strict incompatibles (class exclusion), and the principle of classification is constant, a hierarchy
is a taxonomy. It is a ‘Y is a kind of X’ relation. (Cruse 1986, 136-137.) In the hyponomy of
Figure 4 the two hyponyms of the same superordinate (book) are not incompatibles, because the
principle of classification has not been held constant. In the taxonomy of Figure 4 classes exclude
each other.
Typical examples of taxonomies are the hierarchies of science, but natural or folk taxonomies
also exist. Ethnolinguists have been studying the latter and suggest that folk taxonomies have five
or fewer levels. Cruse gives an example (1986, 145):
unique beginner
(plant)
|
life-form
(bush)
|
generic
(rose)
|
specific
(hybrid tea)
1 This relation holds for expressions as well, but the terminology of lexical semantics is adopted here.
8
|
varietal
(Peace)
The generic level is the most important for speakers of natural language. Cruse states:
“This is the level of the ordinary everyday names for things and creatures: cat, oak, carnation, apple, car,
church, cup, etc. Items at this level are particularly likely to be morphologically simple, and to be ‘original’
in the sense that they are not borrowed by metamorphical extension from other semantic areas. This is also
the level at which the greatest number of items is likely to occur, although it is obvious that if every generic
item in a taxonomy had several specifics, then the number of items would be greater at the specific level.
The point is, however, that most branches of taxonomic hierarchies terminate at generic level. Items which
occur at specific and varietal levels are particularly likely to be morphologically complex, and compound
words are frequent.”
(Cruse 1986, 146.)
Part of the taxonomy of birds
AVES
(Birds)
... PICIFORMES
PASSERIFORMES
... LANIIDAE
HIRUNDINIDAE
RIPARIA RIPARIA
(Sand martin)
STRIGIFORMES ...
PARIDAE ...
DELICHON URBICA
(House martin)
HIRUNDO RUSTICA
(Swallow)
Two hyponyms of book
BOOK
PAPERBACK
NOVEL
Figure 4. Examples of a taxonomy and a hyponomy.
Meronymy
Meronymy (partonymy) is the relationship between the whole (holonym) and its parts
(meronyms). A hand and fingers is a typical example, or a bike and wheels, pedals, a saddle, a
handlebar, ... The difference between parts and pieces is decisive: if a bike is broken down with a
hacksaw, the pieces are not parts of the bike. Like in taxonymy, the principle of differentiation of
the whole to its part should be constant, e.g., if one item is an abstract entity, all the other parts
must be as well. (Cruse 1986, 157-159.)
In some cases, especially concerning abstract entities, it is not obvious whether a relation is a
partonymy or not. As an example Cruse mentions geographical relations (continent : country, e.g.,
9
Europe : France; country : capital, e.g., France : Paris); group – member relations (senate :
senator); class – member relations (clergy : bishop); collection – member relations (forest : tree);
object – material relations (tumbler : glass); substance – particle relations (snow : flake). Typical
for some of these relations is that they are structurally less integrated than typical physical objects,
and their parts are often less differentiated, and form non-branching hierarchies. Some of the
wholes are collectivities, and their parts are, under another aspect, independent wholes on their
own. (Cruse 1986, 173-177.)
Opposition
Opposites possess simultaneous closeness and distance from each other. They differ along one
dimension of meaning, in respect of all other features they are identical. They are mostly either
adjectives or verbs. Typical examples of opposition are complementaries and antonyms.
Complementaries divide a conceptual domain into two exclusive classes, which cover the whole
domain (e.g., dead : alive; true : false; pass : fail). Antonyms denote degrees of some variable
property. They do not bisect a domain, rather, there is a range of values lying between the
antonyms (e.g., long : short; kind : cruel; like : dislike). (Cruse 1986, 197-199, 204.)
Directional opposites are another group of oppositions. Adverbs, prepositions, and verbs indicating
motion in opposite directions are typical examples (e.g., forwards : backwards; up : down; ascend :
descend). (Cruse 1986, 223–224.)
Syntactic meaning
The meaning of a sentence cannot be inferred from the meanings of the individual words in the
sentence. ‘Bill saw Sue’ does not mean the same as ‘Sue saw Bill’. The syntactic meaning of a
word takes into account the textual surroundings and the morphological and syntactical rules that
guide the formulation of words, phrases, and sentences. Words form groups that tend to appear
together, i.e., words collocate with certain other words (e.g., nouns that collocate with the verb
‘save’ in newswires: save forests ~ lives ~ enormous ~ annually ~ jobs ~ money ~ life ~ dollars ~
costs ...). (Karlsson 1994, 215-216; Leech 1990, 17.)
2.3
Thesauri and Other Conceptual Models
Information science has employed conceptual structures for knowledge organisation and
representation, classification and IR. These structures may be conceptual models consisting of
concepts and relations between the concepts, that is, semantic nets. The most classical example is
a thesaurus, a documentation language used for the representation of information at the storage
phase and for query formulation at the retrieval phase. Other conceptual models are ontologies
which have their origin in philosophy and which have been studied in computer science and
information system management.
10
The ISO standard for monolingual thesauri
The types of relations a thesaurus represents are semantic relations. The International standard for
monolingual thesauri (ISO 1986, 1) distinguishes two kinds of inter-term1 relationships: “a)
syntactical or a posteriori relationships between the terms that together summarise the subject of a
document. ...; b) those a priori or thesaural relationships between terms assigned to documents and
other terms which, because they form part of common and shared frame of reference, are present
by implication.”
Equivalence, hierarchical and associative relationships are introduced in the standard. The
equivalence is a relation between terms referring to the same concept, and it covers synonyms and
quasi-synonyms. Synonyms are defined as terms whose meanings can be regarded as the same,
therefore they are practically interchangeable. As an example of quasi-synonymy the standard
gives antonymy (e.g., wetness : dryness – “terms whose meanings are generally regarded as
different in ordinary usage, but they are treated as though they are synonyms for indexing
purposes”). In a documentation language one of the equivalent expressions is chosen to represent
the concept, and it is called a preferred term, the others are non-preferred terms. (Ibid., 14.)
The hierarchical relationship covers both taxonymy and meronymy, the former is called the
generic relationship, and the latter the hierarchical whole-part relationship. Except these, the
standard mentions the instance relationship, “which identifies the link between a general category
of things or events, expressed by a common noun, and an individual instance of that category, the
instance then forming a class-of-one which is represented by a proper name” (ibid., 17).
Superordinate terms of any type of hierarchy (hypernyms, holonyms, categories) are called
broader terms, and subordinate terms (hyponyms, meronyms, instances) narrower terms.
Polyhierarchical relations are allowed, that is, a concept may belong to more than one hierarchy at
the same time (e.g., organs – as instrument - may have keyboard instruments and wind instruments
as superordinates). (Ibid., 15-17.)
The associative relationship does not have an easily recognised counterpart in the semantic
relationships. It is defined to cover relations between terms that are mentally associated but are not
in either equivalence or hierarchical relation. Further, the standard divides terms into two groups:
those belonging to the same category (e.g., ships and boats), and those belonging to different
categories (e.g., forestry and forests). The former type relates to siblings with overlapping
meanings – not all siblings are close in meanings. (Ibid., 17-18.) The latter type is diverse, and the
standard gives the following examples to illustrate the case (ibid., 18-19):
(a) a discipline or field of study and the objects or phenomena studied
aesthetics : beauty
(b) an operation or process and its agent or instrument
temperature control : thermostat
(c) an action and the product of the action
weaving : cloth
(d) an action and its patient
imprisonment : prisoners
(e) concepts related to their properties
poisons : toxicity
(f) concepts related to their origins
Denmark : Dane
1 ISO 2788-1986 is a standard for a documentation language, thus ‘term’ is used rather than ‘word’. The standard admits
using ‘term’ and ‘concept’ sometimes interchangeably.
11
(g) concepts linked by causal dependence
diseases : pathogens
(h) a thing and its counter agent
plants : herbicides
(i) a concept and its unit of measurement
astronomical distance : light year
(j) syncategorematic phrases and their embedded nouns
fossil reptiles : reptiles
The standard emphasises that these are representative examples, not a normative list. Some of
these relations could be described as syntagmatic in a linguistic sense, for instance, an action and
the product of the action, and an action and its patient. The conceptual difference in IR and
linguistics is worth noting: in IR these relations mainly represent paradigmatic relations, that is,
they may be alternatives in a document or request description.
Typically, equivalence and associative relations are reciprocal in IR. However, there are
occasions where this may not hold. Let us think about the expression ‘dismiss workers’, which
could be paraphrased as ‘discharge workers’. Equivalence between dismiss and discharge could
thus be established. Nevertheless, more meanings are attached to the form discharge than to the
form dismiss, then, some meanings of discharge have nothing in common with the meanings of
dismiss. If one is searching for ‘discharges of duties’, dismiss would not be a good candidate to
replace discharge. Discharge, however, could in most cases replace dismiss. Similarly, if
association is established between plant and herbicide, the latter would always be associated with
the former, but for the former the latter represents a special aspect.
The relationships given in the standard are the most typical ones in thesauri, yet, other kinds of
relations have been proposed for IR, especially finer distinctions instead of the associative
relationship, which is vaguely defined in the standard and in most thesauri (e.g., Fox 1980; Wang,
Vandendorpe & Evens 1985; Nutter, Fox & Evens 1990; Myaeng & McHale 1991; see also
Ontologies, below).
Semantic nets
Paradigmatic and syntagmatic relations tie the denotations of words to semantic nets, which
represent the system of meaning relations in natural language (Karlsson 1994, 216–217). WordNet
is an example of a lexical database that is a semantic net. It includes English nouns, verbs,
adjectives and adverbs organised as sets of synonyms, each set representing a lexicalised concept.
A number of semantic relations between sets of synonyms, called synsets, are given (see Table 1).
In WordNet a word with more than one sense is defined polysemous, and two words that have at
least one sense1 in common are defined synonymous. The different senses for polysemous words
are given and they are disambiguated through their semantic relations. In WordNet homonymy
falls into polysemy, e.g., a word may be a noun and verb, for example. The possible parts of
speech (syntactic categories) for each semantic relation are given in Table 1. The relations may be
represented as word lists or graphs. (Miller 1995.)
1 For differences of ‘sense’ and ‘meaning’ see Lyons 1977, 197–206; Palmer 1976, 29–32. In this study they are used
interchangeably.
12
Semantic relation
Syntactic category
Synonymy
(similar)
N, V, Aj, Av
Antonomy
(opposite)
Aj, Av, (N, V)
Hyponymy
(superordinate)
N
Meronymy
(part)
N
Troponymy
(manner)
V
Entailment
Note: N = Nouns
V
Aj = Adjectives
Examples
pipe, tube
rise, ascend
sad, unhappy
rapidly, speedily
wet, dry
powerful, powerless
friendly, unfriendly
rapidly, slowly
sugar maple, maple
maple, tree
tree, plant
brim, hat
gin, martini
ship, fleet
march, walk
whisper, speak
drive, ride
divorce, marry
V = Verbs Av = Adverbs
Table 1. Semantic relations in WordNet (Miller 1995, 40.)
Statistical methods for identification of relationships between words
Constructing thesauri is a laborious task, thus, there is a great interest to automate it. The basic
idea is that frequently co-occurring words are semantically related or associated. In the following
citation Svenonius (1992) describes automatic classification, which is a hypernym of automatic
thesaurus construction:
“In the field of classification research the idea of category formation by the method of family resamblances
underlies techniques of automatic classification. ... Categories that are automatically derived differ in
several ways from those in a manually constructed classification: 1) the manner of their derivation is purely
a posteriori; this is the ultimate operationalization of the principle of literary warrant; 2) the relationship
between members forming such categories is essentially statistical; the members of a given class are similar
to each other not because they possess a category-defining characteristic but by virtue of sharing a family
resemblance; and the nature of the hierarchical and precedence relationships among automatically derived
categories is not defined in terms of inheritance properties.”
(Svenonius 1992, 11–12.)
The simplest case of automatic thesaurus construction is to collect word pairs co-occurring
frequently enough in a database. This gives a word-based thesaurus with one relation type. Schütze
and Pedersen (1997) remark that words co-occurring in long documents need not to have any
relation.1 A more reliable way is to require words to co-occur within a given distance (‘a window
of n words’). Further, phrases are important in IR to preserve the meaning (Hull et al. 1997; Allan
et al. 1997a). Natural language processing has methods for phrase recognition (see Ruge &
Schwartz 1991; Smeaton 1992, 1995; Strzalkowski 1995; Strzalkowski, Lin & Perez-Carballo
1997). With these methods it is possible to yield a phrase-based thesaurus, but types of relations
between expressions cannot be differentiated.
1 See also Peat and Willett (1991) who argue against the usefulness of QE based on co-occurrence data.
13
Jing and Croft (1994) present PhraseFinder1, a program that automatically constructs a thesaurus
using text analysis. PhraseFinder considers co-occurrences between phrases and words as
associations. A phrase is defined as a sequence of words whose part-of-speech satisfy any of the
specified phrase rules, e.g., a simple noun phrase rule {NNN, NN, N} states that a phrase may
consist of triple nouns, double nouns, or a single noun. A phrase may have other parts-of-speech as
well. Co-occurrences are recognised from windows of 3-10 sentences, not from the whole
documents, because the latter would produce too many invalid associations. Pairwise associations
are generated between phrases and words within a paragraph. The association frequency, which is
equal to word frequency times phrase frequency, is summed over the whole collection for each
association. Finally, the association data are filtered by discarding associations with frequency 1
and phrases that are associated with too many words. PhraseFinder is accessed through the
InQuery retrieval system, i.e., similarity between a query and thesaurus items is computed, and the
output is a ranked list of phrases.
Three abstraction levels model for thesauri
Järvelin, Kristensen, Niemi, Sormunen and Keskustalo (1996b) propose a deductive data model
for thesauri, which is based on three abstraction levels: the conceptual level, the linguistic level
and the string level (Järvelin 1993). The conceptual level represents concepts and conceptual
relationships (e.g., hierarchical generic or partitive relationship, association relationship). The linguistic level represents natural language expressions for the concepts and equivalence relations
between them. Each concept may have several synonymous expressions with varied reliability.
This is comparable to family resemblance or fuzzy membership: some expressions are more
typical names for a concept, but others may be used as well. Each concept is denoted by a term
(the principal name of the concept) and possibly a number of synonyms2. Polysemy is allowed in
the model by letting one expression to denote many concepts. Concepts are identified through an
identification code and the relations they have. Each expression may be represented by one or
more strings – also of varied reliability – at the string level. Each string is a matching model representing how the expression may be matched in database indices built in various ways (e.g., with or
without compound words split into component words, and with or without stemming). The
proposed data model can be used for the representation and navigation of indexing and retrieval
thesauri and as a vocabulary source for concept-based query expansion in heterogeneous retrieval
environments. (Järvelin et al. 1996b.)
Ontologies
The name ontology is adopted from the philosophical study of the nature of being, i.e., Ontology.
Ontologies are specifications of a conceptualisation, i.e., models showing concepts and their
relations (in some possible world). (Guarino 1995.) They are used as unifying framework (or
shared understanding) for communication between people with different viewpoints and needs,
1 PhraseFinder is later on renamed to InFinder (Allan, Ballesteros, Callan, Croft & Lu 1995).
2 Synonymy is understood rather loosely here (cf. Miller 1995). Possible candidates are synonyms, quasisynonyms,
corresponding verbs / nouns - a term is not necessarily a noun - and common names.
14
and for inter-operability among systems with different paradigms, languages and software tools.
Recently, ontologies have been studied for information system management and agent
technologies. (Guarino, Masolo & Vetere 1998; Uschold & Gruninger 1996.)
Are ontologies and thesauri then identical? No, because their origin and uses differ. Relations
in thesauri are often based on linguistics (lexical semantics). The study of formal ontology is the
basis for ontology construction. Formal ontology is defined as “the theory of a priori distinctions
within [1] (our perception of) the entities of the world, or particulars (physical objects, events,
regions of space, amounts of matter... ); [2] the categories we use to talk about the real world, or
universals (concepts, properties, qualities, states, relations, roles, parts...)” (Guarino 1997). A
thesaurus may be an ontology. Guarino (1997) states that ontologies include formal definitions
which are mostly lacking from thesauri, thus, the latter are ‘simple ontologies’. Another
differentiation is made between reference ontologies and shareable ontologies, thesauri belonging
to the latter. Guarino et al. (1998) blame WordNet for undifferentiated relations (e.g. mixing
disjoint and overlapping concepts), which might be a lightening example. However, differences
are not the most interesting point, rather, formal ontology provides tools for elaborating
conceptual relationships further for IR.
*
*
*
The present study is based on the idea that requests can be represented as a set of concepts and
relations between them. Conceptual models, which are representations of concepts and their
relations in a subject area, could thus be used for formulating a request into a query. In this chapter
we have discussed the methods and tools linguistics gives for the identification of concepts and
their relations, as well as the nature of different conceptual relations and conceptual models. This
study will employ a special conceptual model – a thesaurus for text searching – for testing QE.
The construction of the thesaurus is based on ideas outlined by ISO (1986), and the relations of the
thesaurus represent the basic semantic relations discussed in Section 2.2. (i.e., synonymy,
meronymy, taxonymy). We shall test what effects QE – based on different semantic relations – has
on retrieval performance.
15
3
Retrieval Techniques : Exact and Partial
Matching
A query and document representations are compared using some retrieval or matching technique.
Figure 5 shows a classification of retrieval techniques. An essential differentiation lies between
exact and partial match. The former is identified with matching based on Boolean logic, the latter
is diversified, most studied cases being the vector space model (Salton & McGill 1983) and probabilistic retrieval (Robertson & Sparck Jones 1976; Croft & Harper 1979; van Rijsbergen 1979).
Exact match means that the conditions given in a query must be fulfilled exactly in a document to
be retrieved. Boolean retrieval has been the prevalent technique of operational IR systems until
recently. The technique has some disadvantages, it
(1) misses many relevant documents whose representations match the query only partially;
(2) does not rank retrieved documents;
(3) cannot take into account the relative importance of search or text keys;
(4) requires complicated query logic formulation; (Belkin & Croft 1987, 113.)
Points 1 and 2 refer to the fact that Boolean retrieval is binary-valued: documents either match
or not, they are either relevant or non-relevant. Point 4 may also be regarded as an advantage:
Boolean logic gives the means to build a query with a conceptual structure and express some
relations between concepts, that is, to construct a query with a structure.
IR research has been more interested in partial match techniques, which seem to be a remedy
for the problems of the Boolean retrieval: partial match techniques retrieve documents only
partially matching the query; they arrange the result set according to degrees of assumed
relevance; they allow both text key and search key weighting; queries may be natural language
sentences or just sets of words with no structure. The weighting is an essential feature of partial
match techniques, because it allows to express the importance of the keys in relation to documents,
and further allows the ranking of the documents in relation to a query. Partial match techniques
and their weighting schemes may be empirically (e.g., the vector space model) or theoretically
(e.g., probabilistic approaches) justified. (Efthimiadis 1992, 13.)
Retrieval techniques
Exact match
Partial match
Document based
Structure based
Logic
Feature based
Graph
Formal
Probabilistic
Network based
Cluster
Browsing
Spreading
activation
Ad hoc
Fuzzy sets
Vector space
Figure 5. A classification of retrieval techniques (Belkin & Croft 1987, 112).
16
3.1
Vector Space and Probabilistic Models
In the vector space model1 both documents and queries are represented in a t-dimensional space,
where t is the number of keys in the database index. Each representation is a vector V with t
components, each component wi representing the weight – importance as a numeric value – of the
corresponding key i in a document or query. The document and query vectors are compared using
a similarity measure, which shows the similarity as a value between 0 and 1. The cosine function
is typically used to compute similarities in the model, but there are alternative functions. Weights
are often computed by using the frequency of the key in the document and the inverse frequency
of the key in the whole collection (known as tf*idf – key frequency times inverse document
frequency weighting). (Salton & McGill 1983, 118–130.) This weighting is based on the following
principles: (1) the degree to which the document and a key are related should increase as the
frequency of the key in the document increases; (2) the usefulness of a key in discriminating
among documents should decrease as the number of documents in which it appears increases
(Kantor 1994, 64).
Probabilistic retrieval is based on the probability ranking principle2. It states:
“If a reference retrieval system’s response to each request is a ranking of the documents in the
collection in order of decreasing probability of usefulness to the user who submitted the request, where
the probabilities are estimated as accurately as possible on the basis of whatever data is available for
this purpose, then the overall effectiveness of the system to its user will be the best that is obtainable on
the basis of that data.”
(Cooper 1994, 242.)
An optimal discriminant function is P(R | D) / P(NR | D), where P(R | D) represents the probability
of relevance when the document D is given; P(NR | D) represents the probability of non-relevance
given the document D. The probability of relevance is computed relative to keys representing a
document, and documents are ranked according to this estimated probability. The estimation is a
hard problem and different solutions lead to different probabilistic retrieval models. (Croft &
Harper 1979; van Rijsbergen 1979, 111–120.)
In the binary independence model document descriptions (i.e., keys) have binary weights (1 or
0), and it is assumed that the distribution of keys in relevant documents is independent and their
distribution in non-relevant documents is independent (Robertson & Sparck Jones 1976)3.
Bayesian inversion is used to simplify the discriminant function to form
P(D | R) * P(R) / P(D | NR)* P(NR)
(1)
where P(D | R) = ∏i=1...t (pi)xi (1-pi)1-xi,
pi is the probability that key i has weight 1 in the set of relevant documents – pi=P(xi=1 | rel); (1pi)=P(xi=0 | rel),
xi=1, if key i occurs in document, otherwise xi=0,
P(D | NR) = ∏
(q )xi (1-q )1-xi ,
i=1...t
i
i
1 Retrieval model is a larger entity including a retrieval technique. The former specifies the representations of documents
and information needs, and how they are compared. It is an embodiment of the theory behind. The latter refers to the
matching function of a query and documents. (Turtle & Croft 1992, 279.)
2 For discussion about the principle, see Robertson 1977b.
3 Cooper (1991) discusses the known fact that the independence assumptions are not plausible. He suggest that these
should be replaced by a single linked dependence assumption.
17
qi is the probability that key i has weight 1 in the set of non-relevant documents – qi=P(xi=1 | nonrel); (1-qi)=P(xi=0 | non-rel).
By substituting the values of P(D | R) and P(D | NR) in the function and taking logs, the function is
transformed into a linear discriminant function
gD = ∑i =1...t xi log
pi (1 − qi )
P( R)
1 − pi
+ ∑i =1...t log
+ log
qi (1 − pi )
P ( NR )
1 − qi
(2)
The last element, log(pi(1-qi)/qi(1-pi)), of the first term of the function is known as a relevance
weight which is given to all documents containing key i. The other terms of the function are
neglected as constants. The discriminant function may thus be written as follows
gD = ∑i =1...t xi log
pi (1 − qi )
qi (1 − pi )
(3)
(Robertson & Sparck Jones 1976; Robertson 1977b; Croft & Harper 1979; van Rijsbergen 1979).
Robertson and Sparck Jones (1976), and Robertson (1986) discuss the problems of estimating
pi and qi. A sample of retrieved documents should be used, but such a sample is often small,
especially the number of relevant documents, and thus estimation is not reliable. The authors
suggest the point-5 formula for estimating the relevance weight. The formula follows:
wt = log
(r + 0.5)( N − n − R + r + 0.5)
(n − r + 0.5)( R − r + 0.5)
(4)
where wt is the relevance weight
N is the total number of documents in the collection
R is the number of relevant documents in the collection
n is the number of retrieved documents
r is the number of retrieved relevant documents
The constants 0.5 prevent the formula from giving infinite weights under certain circumstances. If
no relevance information is available (i.e., R = r = 0), the formula reduces to
wt = log
(N − n + 0.5)
(n + 0.5)
(5)
Croft and Harper (1979) suggest that if no relevance information is available, pi should be
assumed constant and qi should be estimated by the proportion of documents in the whole
collection that contain the key. Then the formula for the relevance weight becomes
wt = constant + log
(N − n)
n
(6)
Robertson (1986) shows that (5) and (6) are closely related to IDF weighting proposed by Sparck
Jones (1972), which is
wt = log
N
n
(7)
The problems and benefits related to these estimations are discussed by Robertson (1986), and
Robertson and Walker (1997). Robertson and Walker (1994), and Robertson and Sparck Jones
18
(1997) show that adding information about within-document key frequency and document length
to the formula (4) further improves performance1.
3.2
The InQuery Retrieval System
A different probabilistic approach is taken in the InQuery2 retrieval system. It is based on
Bayesian inference networks, which are directed, acyclic dependency graphs. The nodes of the
graph represent propositional variables and arcs represent dependence relations between
propositions. The nodes in the network are either true or false. Values assigned to arcs range from
0 to 1, and are interpreted as beliefs. If a proposition represented by a node d implies a proposition
represented by a node t, they are connected by an arc from d to t. The node t contains a link
matrix3 specifying P(t | d) for all the possible values of the two variables. If a node has multiple
parents, the matrix specifies the dependence of the node to all its parents. If a set of prior
probabilities for the root nodes of the graph is given, the probability or degree of belief connected
to all remaining nodes can be computed using inference networks. (Turtle 1990; Callan, Croft &
Harding 1992.)
A network for information retrieval is shown in Figure 6. It consists of a document network and
a query network. The former represents the document collection, documents (d1 ... dk) as roots of
the graph, texts (t1 ...t m) representing documents and concepts4 (r1 ... rn) representing texts. A
document may have several types of contents, although text is a typical case, and each content has
several concepts. The document network is constructed once for a given collection and its
structure is not altered during query processing. The query network consist of (I) a user’s
information need, queries (q1 ... qp) representing the need, and concepts (c1 ... cs) included in the
queries. A query network is built for every information need and it changes when queries are
modified or new queries added to the network. These two networks are connected by arcs between
the concepts representing texts and concepts representing queries. (Turtle & Croft 1992, 280–281.)
1 See also Robertson, Walker & Hancock-Beaulieu 1995.
2 InQuery is an information retrieval system developed at the Center for Intelligent Information Retrieval, Computer
Science Department, University of Massachusetts.
3 Link matrix, see Turtle 1990, 52–59.
4 The use of the word ‘concept’ in Turtle and Croft (1992) is not very strict, it is almost a synonym of ‘word’. In Figure
6 the relation between concepts representing texts and concepts representing queries is more comprehensible if this is
kept in mind.
19
. . .
d1
d2
t1
t2
. . .
r1
r2
r3
. . .
rn
c3
. . .
cp
. . .
qs
c1
c2
q1
d k-1
dk
t m-1
tm
I
Figure 6. Document retrieval inference network (Turtle 1990, 39).
A simplification of the model assumes one-to-one correspondence between document nodes and
text nodes, and between concepts representing texts and concepts representing queries. These
assumptions are often justified in IR systems, e.g., a document does not have many different text or other format - versions, and the conceptual level of the representations of queries and documents may be interpreted as one. However, if the difference between, say, intellectual and
automatic indexing is important, the separation between concepts1 of documents and concepts of
queries should be retained. (Turtle 1990, 44–46.)
The network, as a whole, represents the dependence of an information need on the documents
in the collection where the dependence is mediated by concepts representing documents and
queries. When a document di is observed, evidence di = true is attached to the network, and all
other document nodes are set to false. A probability that the information need is met may now be
computed. By repeating this process for other documents (dj, i ≠ j), the probability that the
information need is met may be computed given each document in the collection, and rank
documents accordingly. (Ibid., 47.)
For all non-root nodes in the inference network the probability that a node takes on a value,
given any set of values for its parent nodes2, must be estimated. If a node a has parents pa = {p1,
..., pn}, P(a | p1, ..., pn) must be estimated. A link matrix is used to represent the probability that
the node a takes the value true or false for all combinations of parent values. For example, node
Q1 has two parents A, B, and node Q2 one parent C, and P(A = true) = a, P(B = true) = b, and P(C
= true) = c. The and combination, Q1 AND Q2, is true only when A, B, and C all are true, and
conditioning over the set of parents gives
Pand(Q1, Q2 = true) = abc
(8)
1 In fact, the differentiation is done between expressions representing concepts, rather, than between concepts.
2 Parent nodes of a node are the directly preceding nodes in the graph, and the node itself is a child node in relation to
the parent nodes.
20
Pand(Q1, Q2 = false) = (1 - a) (1 - b) (1 - c) + (1 - a) (1 - b) c + (1 - a) b (1 - c) +(1 - a) bc +
a (1 - b) (1 - c) + a (1 - b) c + ab (1 - c) = 1 - abc.
(Ibid., 52–54.)
Other expressions may be defined similarly:
Pnot(Q) = 1 – p1
(9)
Por(Q1, Q2, ..., Qn) = 1 – (1 – p1)*(1 – p2)* ... *(1 –pn)
(10)
Psum(Q1, Q2, ..., Qn) = (p1+p2+ ... +pn) / n
(11)
Pwsum(ws, w1Q1, w2Q2, ...,wnQn) =
ws (w1p1+w2p2+ ... +wnpn) / (w1+w2+... +wn)
(12)
where P denotes probability, Qi, i = 1...n, is either a string or an InQuery expression, pi is the
belief value of Qi, wi is the weight of Qi, and ws is a weight given for a clause. (Turtle 1990, 57,
59; Rajashekar & Croft 1995, 274–275.)
Each document node has a prior probability associated with it that describes the probability that
the document is observed, usually this is set to 1/collection_size. The estimation of the conditional
probabilities for the concept nodes (ri, representing documents) is based on tf*idf weighting.
Given the concepts ri = {r1 ,..., rn} and the documents dj = {d1, ..., dk}, P(ri = true | dj = true) =
belief in the representation concept1 is estimated by the function
α + (1 – α) * ntf * idf
(13)
The first factor in the formula (α), known as a default probability, represents the probability that a
key should be assigned to a document in which it does not occur. This has been shown to improve
the retrieval performance. (Turtle 1990; Turtle & Croft 1992.) In the InQuery Version 3.1 the
function is as follows:
tfij
0.4 + 0.6 *
tfij + 0.5 + 1.5 * dlj
adl
N + 0.5
log dfi
*
log( N + 1.0)
(14)
where tfij = the frequency of the key i in the document j
dlj = the length of the document j (as a number of keys)
adl = average document length in the collection
N = collection size (as a number of documents)
dfi = number of documents containing key i.
(Allan et al. 1997b.)
The inference network model differs from the binary independence model, because it does not
use Bayesian inversion, thus there are no probabilities that correspond directly to pi and qi of the
latter model. Samples of relevant documents are not used in estimation of probabilities, and this is
justified by the lack of representative samples. (Turtle & Croft 1992, 288.)
1 In the formula a representation concept is equalised with a key.
21
Syntax of the query language of InQuery
In the query language of InQuery an operator precedes its argument (i.e., prefix notation is used).
All operators are marked with the hash sign (#), and the arguments of the operator are delimited by
parentheses, for instance: #and(cat swallow). According to the InQuery manual, #SUM, #WSUM,
#OR, #AND, #NOT are belief operators, and #SYN, #n, #uwn are proximity operators (in #n, #uwn
the symbol n denotes an integer, uw stands for unordered window). The interpretations of the
former operators are explained above. All keys within the SYN operator are treated as instances of
one key, thus the SYN operator influences the calculation tf*idf values (Rajashekar & Croft 1995).
The probability for operands connected by the SYN operator is calculated by modifying the tf*idf
function as follows:
∑i∈S tfij
0.4 + 0.6 *
dlj
∑i∈S tfij + 0.5 + 1.5 *
adl
N + 0.5
log dfS
*
log( N + 1.0)
(15)
where tfij = the frequency of the key i in the document j
S = a set of search keys within the SYN operator
dlj = the length of the document j (as a number of keys)
adl = average document length in the collection
N = collection size (as a number of documents)
dfS = number of documents containing at least one key of the set S.
Operators #n and #uwn are more traditional proximity operators. The former requires that all its
argument keys occur in the specified order with at most n words separating any of the keys. The
#uwn is the unordered window operator, which requires that all keys within its scope occur in any
order, within a window of n words. Proximities are not based on grammatical units but on the
absolute number of words, including stopwords.1
Clauses with different operators may be nested as follows: #and(#or(cat polecat weasel)
#or(martin swallow swift)). However, there are some restrictions for the nesting of clauses. “A
primary rule in formulating structured queries is that ‘belief operators’ may not occur inside of
‘proximity operators’. This is because proximity lists (the basic unit of InQuery knowledge) can be
converted to a belief list (a score or weight), but belief lists may not be converted to proximity
lists.” (InQuery s.a.) Consequently, structured transformations are necessary if there is a need to
express, say, the disjunction of search keys within a proximity clause. Thus, #uw3(#or(cat polecat
weasel) #or(martin swallow swift)) is syntactically incorrect and must be expressed, instead, as
#or(#uw3(cat martin) #uw3(cat swallow) #uw3(cat swift) ... ).
In InQuery 3.1 the operation #0(key key) is normally used for specially identified and indexed
phrases. In the test setting, however, it was used to retrieve compound words by their parts. To
give an example, a compound word as puun | jalostus | teollisuus (which means wood processing
industry, parts separated by ‘|’ ), is stored in the index as three parts and one compound consisting
of three parts, all of which have the same address. Thus, the word is retrievable by the following
keys: puu2, jalostus, teollisuus, #0(puu jalostus), #0(jalostus teollisuus), #0(puu teollisuus).
1 NB. In the database index of the present study no stopwords were filtered out.
2 The first part of the compound word is in genitive (puun), but through morphological analysis all the text words are
transformed to and stored in their basic forms (puu).
22
Especially useful this is for long compound words typical for the language of newspapers (e.g.,
työehtosopimuskäytäntö / collective labour agreement practice; jälleenrakentamisurakkakilpailu /
public tender for a reconstruction contract).
3.3
Expression of Linguistic Relations with Operators
In the Boolean retrieval the AND operator may be used to express syntagmatic relations between
search keys, and the OR operator paradigmatic relations. However, neither of them is accurate: the
AND operator does not work on a sentence or phrase level, the OR operator does not specify the
relation type. Proximity operators are a more restrictive interpretation of AND. With them it is
possible to (1) match textual units (paragraphs, sentences) of documents; (2) specify distance for
two or more search keys; (3) specify order of occurrence for two or more search keys (Keen 1991,
3). Even though proximity operators usually improve the accuracy of results, the closeness of keys
does not warrant a meaningful syntactic relationship. Thus, all these operators are modest means
to approximate structures of natural language.
Originally, partial match queries did not include operators. It was possible to express the
importance of the search keys by using weights, but paradigmatic or syntagmatic relations could
not be expressed. In the extended Boolean models – of which the p-norm model is the best known
– Boolean operators are given a soft interpretation. The main idea is to loosen the strictness of the
AND operator in order to retrieve almost matching documents, and to tighten the permissiveness
of the OR operator in order to give a higher rank to documents containing more than one search
key. (Salton, Fox & Wu 1983; Fox, Betrabet, Koushik & Lee 1992; Lee 1994.) Also probabilistic
models have introduced a range of operators, including interpretations of the Boolean operators
and proximity operators (e.g., InQuery, see above). Nevertheless, these operators neither allow
exact representation of the structures of natural language.
Kantor (1994, 60–62) examines the methods to represent Boolean operations when keys have
non-binary weights. He evaluates how the methods respect DeMorgan’s rules relating conjunction
and disjunction, i.e.,
(1) ¬(p ∧ q) ⇔ ¬p ∨ ¬q; (2) ¬(p ∨ q) ⇔ ¬p ∧ ¬q,
and divides them into product-based and sum-based representations. These may be formulated as
follows:
1
fxORy = 1 / p ( f xp + f yp )1 / p (18)
2
fx = 1 − fx
(16) fxANDy = fxfy
(19)
fxANDy = 1 − fxORy
(17) fxORy = 1 − (1 − fx)(1 − fy )
(20)
Product-based representation.
Sum-based representation.
where fz represents the match of some specific document to the key combination z. In the product
based representation the OR function is calculated from the AND function by DeMorgan’s rules,
and this is used, for example, in InQuery. In the sum-based representation the AND function is
calculated from the OR function, and it is used in the p-norm model.
Structure is thus brought to partial match queries, but the formulation of queries is no easier
than in traditional Boolean logic. Rather, it is more difficult to guess the consequences of
23
operations when using partial match techniques. Different mathematical interpretations given to
the operators seem to have a great influence on retrieval performance (e.g., Greiff, Croft & Turtle
1997).
*
*
*
In Chapter 3 we have introduced the main text retrieval techniques and their differences. The
difference between exact and partial match techniques is important for us, because the effects of
query structures and QE are different in exact and partial matching. Our test will be conducted in a
partial match environment, and the matching principle of the retrieval system of this study,
InQuery, was explained. In InQuery, operators adopted from Boolean logic are given ‘soft’
interpretations, and a range of other operators are defined as well. With these operators it is
possible to construct several query structures that we shall test.
24
4
From Requests to Queries
An information need should be formulated into the language of an IR system in order to retrieve
documents. In this chapter we shall present the transformation process.
4.1 Requests
A request is an embodiment of the information need in natural language, but it can hardly include
all the aspects of the need (Swanson 1988). As soon as the request is stated, it dominates the scene,
and the words of the request govern query formulation. This is related to the label effect, which
means that information need is expressed by words and phrases that do not wholly describe the
need but label it. (Ingwersen 1982, 177-178; Ingwersen & Willett 1995, 169.) This is problematic
because the labels may hinder proper search key selection in the query formulation phase.
Information retrieval situations may be divided according to who is participating in the
situation. A user may be searching alone, a user and an intermediary may collaborate, or an
intermediary may be searching on the basis of a request. In many IR tests written requests have
been the only representations of information needs (e.g., TREC), and this is sometimes the case
even in real information retrieval environments (e.g., in some libraries, newspaper archives).
Collaboration is a more prototypical situation, however. In the web era, the user as a searcher is
common. The concept of a request is problematic, because it seems to imply an intermediary to
whom the request is addressed. If the user is searching by himself is the request ever articulated?
In that case the idea of the search topic could be understood as a request.
4.2
Query Construction
We adopt the model of three abstraction levels, that is, we distinguish between conceptual,
linguistic and string levels (see Fig. 7). First, search concepts are identified from a request.
Second, the concepts are connected to expressions, which, in turn, are replaced by search keys. As
the result of this process, a request is translated into a query.
Conceptual level
A conceptual or facet analysis of the request is an essential part of query formulation (Soergel
1985, 350–355; Fidel 1991a, 491; Paice 1991; Lancaster & Warner 1993, 133–136). At the
conceptual level a query should be formulated without any ties to query languages, rather it should
respect the information need behind the request. A result of this analysis is a conceptual query
plan. A facet is an aspect of a request. It is difficult to specify how concepts and facets should be
identified, and which of them should be selected as search elements. The type of information need
behind the request, the document collection searched, the representation of documents, the
searcher’s understanding of the information need, among other considerations, influence the
selection. Concepts may be divided into disjunctive and conjunctive concepts according to whether
25
they belong to a same facet or different facets. In principle, the relations of concepts in a facet are
paradigmatic, the facets of a plan are in a syntagmatic relation. The concepts of a facet typically
have semantic relations, but this certainly is a loose condition which does not hold in every case.1
(Soergel 1985; Järvelin 1995, 142–145.)
CONCEPTUAL LEVEL
Conceptual
query plan
Search concepts
LINGUISTIC LEVEL
Expressed
query
Search expressions
Formal
query
STRING LEVEL
Search keys
Figure 7. Abstraction levels of IR (adapted from Järvelin 1995).
Mirja Iivonen (1995) studied the consistency of the formulation of query statements in
bibliographic retrieval. Her subjects were 24 experienced searchers and eight students of
information studies. Iivonen had rules for the identification of concepts and the equivalence of
concepts. In her study, elemental and superimposed classes were recognised as concepts. The
equivalence of concepts was established if there was an equivalence relationship between the
expressions, hierarchical or commonly recognised relationship between the concepts, or if there
was a coordinative relationship related to the search request between the concepts. Iivonen found
that at the search word level the intersearcher consistency is low (31.2%), but when consistency is
considered at conceptual level, it is much higher (87.6%)2. She also points out that if a request
contained many concepts searchers did not select all of them into a query. Some concepts were
selected by most searchers, but others were selected less frequently and their variation was greater.
(Iivonen 1995.)
Harter (1990, 144–145) distinguishes between single-meaning and multi-meaning facets. The
former type gives an overall context within which the expressions of the other type are to be
interpreted. On the other hand, the multi-meaning facet specifies certain aspects of the singlemeaning facet. The expressions of the single-meaning facet are more likely to be synonyms or
near synonyms than the expressions of the multi-meaning facet. “Within single-meaning facets,
searchers can expect a certain amount of redundancy and overlap among relevant postings, and
need not to be so concerned with finding every possible synonym or near synonym. Within multimeaning facets, there will be little semantic redundancy or overlap among relevant postings within
1 Let us consider a request: 'Do smoking or asbestos dust cause lung cancer?' Asbestos dust, smoking and lung
cancer may be recognised as search concepts. Asbestos dust and smoking belong to one facet, yet, they are not
semantically related. Another example is a request 'The origins of paddling'. In this case the facet of paddling may be
represented by kayak and canoe. These are semantically related to paddling, but in a linguistic viewpoint represent a
syntagmatic rather than paradigmatic relation. In a query, however, they may be alternatives, i.e., disjunctives.
2 Averages of pairwise asymmetric consistency figures.
26
a facet, and searchers will need to find as many ways of representing the facet as possible.” (Harter
1990, 145.)
Sormunen (1994, 73) distinguishes between basic and additional concepts1. He states that (1) a
request cannot be formulated into a specific query without words referring to the major concept of
the request; (2) most probably at least one word representing a major concept occurs in texts about
the concept; (3) a major concept can be represented with a limited number of words. Accordingly,
in this study a major facet refers to a facet that is necessary in a reasonable conceptual query plan.
It is similar to concepts selected by most searchers in Iivonen’s study, or a single-meaning facet.
The other facets of a conceptual query plan are minor facets, conceptually similar to multimeaning facets.
The conceptual query plan may be described by complexity, specificity, and coverage.
Complexity is the number of facets in the conceptual query plan. A facet is represented in the plan
if any of the concepts representing the facet is included. Complexity is connected to the
syntagmatic relations of facets (or to conjunction of concepts). Specificity refers to the specificity
of the concepts representing facets. If the concepts are of the same specificity level as the request,
the plan is fully specific. The specificity is connected to conceptual hierarchies. Coverage is the
number of concepts in the conceptual query plan. Coverage is connected to the paradigmatic
relations of the concepts in a facet (or to disjunction of concepts). (Järvelin 1995, 146–147.)
In Boolean retrieval the effects of complexity, specificity and coverage on search results are
quite well understood (e.g., Harter & Peters 1985; Fidel 1991a, c; Harter 1990), but in partial
match retrieval results have not been examined according to these characteristics. In Boolean
retrieval, facets are added in order to limit the size of a result set – or removed to increase the
number of retrievals. What is the role of facets in a partial match environment? Facet analysis of
partial match queries is almost non-existent because these queries are thought of as ‘natural
language queries’ or ‘bags of words’ with no specified concept or facet structure.
Linguistic level
The conceptual query plan is turned into the level of expressions by naming the concepts with
accurate words and phrases (expressed query). (See Fig. 7.) A search expression refers to words
and phrases of natural language, to common codes and abbreviations, and to terms of
documentation languages. Search keys are either string constants, which correspond to expressions
at string level, or string patterns, which match several string constants and are typically formed
from constants by applying wild cards. (Järvelin 1995, 176–178.)
A request may be represented by different queries that are constructed according to different
principles, known as strategies (Bates 1990). The building block strategy is a typical Boolean
search strategy, and corresponds to the facet analysis of a request. In this strategy the searcher
decomposes the search topic into facets and names the concepts in each facet with accurate expressions. All expressions belonging to the same facet are combined by the Boolean OR operator, and
facets are joined by the Boolean AND or proximity operators. (Harter 1986, 173.) The process of
joining expressions by the OR operator to the initial search key identifying the facet is, in effect,
query expansion2 (Efthimiadis 1996, 127).
1 Sormunen refers to conjunctive concepts.
2 Query expansion, see Chapter 5.
27
By search expression selection the query should obtain the correspondence to the representations
of relevant documents. The selection is dependent on the search topic, search environment, and the
searcher. The database may be intellectually indexed, and thus allow descriptor field and text field
searching. The vocabulary of the documents may be general (e.g., newspaper articles) or special
(e.g., medical articles). Fidel (1991a, 492) mentions two criteria in search expression selection
which are crucial for intermediaries: whether an expression is a single-meaning or common
expression, and whether controlled vocabulary may be used or not. Different types of concepts
have a different number of expressions: individual concepts do not usually have many names, but
universals, especially abstract universals, are referred to by several expressions. If controlled
vocabulary is used (i.e., if only descriptor fields are searched), expressions are selected from a
thesaurus, and the number of expressions in a query could be smaller compared to the use of
natural language expressions. However, Fidel (1991b, 510) reports that searchers who prefer ‘free’
words use, on average, the same number of search expressions as those who prefer descriptors.
The information need (accurate or vague) and goal of the search (broad or narrow) influence
search expression selection.
String level
At the string level search expressions are turned into search keys (character strings) suitable for
matching in the index of a database, and the expressed query becomes a formal query. In this
study, a ‘search key’ refers to a ‘search string’1. The differentiation between a lexeme and a token
is useful (see p. 7), a search expression is comparable to a lexeme and its occurrences in texts may
be referred to as tokens. A search key should match as many occurrences (tokens) of the
expression (lexeme) in documents as possible. Truncation and character masking of search
expressions are typical methods of search key building for database indices containing inflected
word forms. If a database index is built of basic word forms, search keys are in the basic form as
well, and truncation is not needed as much. Then, the correspondence between search expressions
and search keys, and expressed and formal queries is greater. However, differences may exist, say,
if two phrases with different proximity operators at string level represent one search expression, or
due to truncation.
The number of search keys in a query is an analogous concept with coverage, and is referred to
as broadness. In a strict sense, complexity and coverage belong to the conceptual query plan, but
are used to describe formal queries as well, because no confusion should arise. In other words, the
attributes of the conceptual level are passed on to the linguistic and string levels. In all, abstraction
levels should be understood as overlapping layers, and a concept as a kernel item, which has
representatives at other levels.
1 A ‘search key’ is often used to refer to a ‘search term’, but here it is used to refer to a ‘search string’ because 1) we
have adopted the differentiation between a word, a term and a string (see Järvelin 1993) and, 2) matching is done at the
string level, thus, search strings are the true search keys.
28
4.3
Query Structures
In the literature the expression structured queries typically refers to queries formulated with the
Boolean operators in contrast to natural language queries, which are sets of words (e.g., Salton et
al. 1983; Croft 1986; Croft, Turtle & Lewis 1991; Brown 1995; Hull 1997). Thus, structure may
be understood as relationships between search keys in a query, and it is expressed in query languages by operators or search key weights, which guide the matching of search keys and document
representations. In the present study, this is referred to as a query structure. A facet structure
implies that facets must be identified, as is the case with Boolean queries in the conjunctive
normal form. However, a query may have a structure, but this structure is not necessarily
restorable to a conjunctive facet structure.
Partial match techniques interpret logical Boolean operations as arithmetic operations with
which the individual weights of the text keys appearing in a query and a document are combined
to rank the documents. Ranking is always based on calculus, thus, rather than talking about
unstructured and structured queries, one might talk about the features of structure. Search key
weighting is also a way to give a query a structure, because the relative importance of search keys
with respect to the request may be expressed with weights. Still, we distinguish between weights
and operators. The former are numerical values attached to keys, concepts, facets, or clauses; the
latter indicate logical or arithmetic operations, or the position of keys in documents.
The distinction between weights of the text keys and search keys is important: ranking is based
on the weights of the text keys that appear in a query, search key weighting is used to emphasise
some keys more than others. The weights of text keys may be binary or non-binary. In the latter
case, tf*idf weighting is typical. Search keys may also have binary or non-binary weights. In the
former case, there are no explicit weights, i.e., all search keys have a weight of 1. In the latter case,
weights may be tf*idf based or they may be chosen by the searcher to reflect the importance of the
search keys. If search key weighting is allowed, the weights of search and text keys are usually
multiplied in ranking.
For example, if a query with no operators is submitted to InQuery, ranking is based on the
averages of the weights of the keys that occur in a query and a document, or in other words, the
SUM operator is a default operator in InQuery. A query in which search keys are weighted (i.e.,
their relative importance is given), has more structure. In InQuery this type is a WSUM query,
which gives the weighted averages of the weights of the search keys that occur in documents. In
the vector space model, ranking is based on summing the products of weights of search keys and
text keys.
The structure of queries may be described as weak or strong. A weak query structure does not
indicate facet or concept structure with operators (i.e., queries with a single or no explicit operator;
differentiated relations between search keys). In a strong query structure search keys representing
different concepts or facets are separated by operators (i.e., queries with several operators;
differentiated relationships between search keys). An example of a weak query structure is
#SUM(a1 a2 a3 b1 b2 b3), and of a strong query structure #AND(#OR(a1 a2 a3) #OR(b1 b2 b3)),
when ai and bi (i = 1, ..., 3) are search keys.1 The latter is also a facet-structured query.
Boolean structured queries have been claimed to be more effective2 than weakly structured
queries, both in a probabilistic retrieval system (e.g., InQuery) and the vector space model1 (Salton
1 In the example, the query language of InQuery is used, see pp. 27-29.
2 Effectiveness measured as precision and recall.
29
et al. 1983; Turtle 1990; Belkin, Kantor, Fox & Shaw 1995; Hull 1997). Tests by Turtle and Salton
were run in small standard test collections, and a test by Belkin et al. in the TREC-2 collection.
The number of queries was representative. Nevertheless, it is difficult to say what other
characteristics besides structure may have influenced the results, because the details of the test
queries are not reported. Turtle and Croft (1991) explain the performance improvements due to
structural information (phrase structure, compound nominals, and negation) captured in Boolean
queries, which information is not exploited in weakly structured queries. The report by Hull (1997)
confirms these conclusions. He compared a weighted Boolean model (based on probabilistic
principles) with the vector space model. However, neither report describes the queries in detail.
Queries can be classified according to their structural features:
1) are concepts identified?
2) are facets identified?
3) are concepts weighted?
4) are facets weighted?
5) are search keys based on single words or phrases?
6) are search keys weighted?
7) by what operators are facets / concepts / search keys connected?
The first division is based on concept identification. If concepts are identified, query structure
may be based on concepts or facets. An example of facet structure is a query constructed with the
Boolean block search strategy (a query in the conjunctive normal form), i.e., concepts representing
one aspect of a request are connected with the OR operator, and facets are connected with the
AND operator. In partial match retrieval models, facets may be combined with soft interpretations
of Boolean operators or other types of operators (see Rajashekar & Croft 1995; Fox, Betrabet,
Koushik & Lee 1992; Lee 1994, Salton, Fox & Wu 1982). It would be possible to replace Boolean
or soft Boolean operators with proximity operators, e.g., between facets, but in such cases
proximity operators may be interpreted as stricter versions of Boolean operators.
Concepts are not necessarily grouped into facets, e.g., in a conjunctive normal form query, but
concepts may be conjuncted instead of facets. Facets or concepts may be weighted in some partial
match retrieval models. When no explicit operator is used, facets or concepts should be weighted,
otherwise they cannot be identified. In all concept-based structures search keys may be weighted
or unweighted, word or phrase-based .
If no concepts are identified, a query is normally a ‘bag of words’. This implies further that no
explicit operators between search keys are used (e.g., vector space queries, queries in probabilistic
models without operators), or relations between search keys are not specified (e.g., all search keys
are treated equally). Search keys may be weighted when no explicit operator is given or when
probabilistic operators are used. Search keys may be formed of single words or phrases.
In Section 7.5 we shall develop a classification for query structures based on the structural
features discussed above.
*
*
*
In this study we rely on differentiation between the conceptual, linguistic and string level of
queries. These levels are manifested in query formulation and expansion: query formulation begins
with conceptual analysis of requests, after that search keys representing search concepts are
elicited through the linguistic level, i.e., through expressions denoting concepts. QE runs through
the same levels. Two principal types of query formulation will be tested: one type is based on
1 Comparisons of the vector space model and the p-norm model.
30
concept or facet identification, and results in strongly structured queries; the other is based on the
combination of search keys rather than concepts (i.e., concept or facet structures are relaxed) and
leads to weakly structured queries.
31
5
Query Expansion
Query formulation, reformulation, and expansion have been studied extensively because the
selection of good search keys is difficult but crucial for good results. Real users’ requests and/or
queries do not usually contain all the expressions – and not necessarily the best expressions – that
might be used to describe the concepts of interest. Typically, requests are short, and each concept
in them is named by one expression only, and queries may contain only a few search expressions.
(Bates, Wilde & Siegfried 1993, 23–27; Lu & Keefer 1995.) The first query formulation typically
acts as an entry to the search system and is followed by browsing and query reformulations
(Marchionini, Dwiggens, Katz & Lin 1993, 39).
Query expansion
Intellectual
QE
Based on
search results
Automatic
QE
Interactive
QE
Based on knowledge
structures
Collection
dependent
Collection
independent
Figure 8. QE methods and the sources of expansion keys
(Efthimiadis 1996, 124).
The first query may be reformulated by adding search keys with or without reweighting (NB., both
search keys and text keys may be reweighted), the process known as query expansion (QE). QE
may be specified by the sources of expansion keys, by the methods of selecting expansion keys,
and by the methods of adding expansion keys to queries. Efthimiadis (1996, 124; see Fig. 8)
divides the sources into knowledge structures, which are collection dependent or independent, and
search results. The methods for selecting expansion keys are statistical or linguistic. The user’s
role in QE may be active or passive, i.e., the user may select expansion keys (intellectual
expansion), or the system may suggest possible search keys (interactive expansion), or the process
may run invisibly to the user (automatic expansion). (Efthimiadis 1996, 124; Ekmekcioglu,
Robertson & Willett 1992, 139; Hancock-Beaulieu 1992, 99.) In general, QE is not tied to any
retrieval technique, but may be applied with any of them. However, the effects of QE could be
dependent on retrieval technique and query structure, although little consideration has been given
to query structures in partial match retrieval. In the following sections we shall review QE studies
on the basis of the expansion source and query structures.
32
5.1
QE Based on Search Results
Relevance feedback is a query modification technique based on search results. Information about
the occurrences of possible search keys in relevant and non-relevant documents is used in the
selection of new search keys or in reweighting. Search keys are added from relevant documents
retrieved in the previous search, or a new query is formulated abandoning the initial search keys.
Because the method is based on the relevant documents of the retrieved set, the searcher’s
relevance judgements are needed. In some tests this has been avoided by assuming that n first
documents of a ranked output are relevant. All words of relevant documents may be added, but
usually expansion keys are ranked by some algorithm, which is typically akin to matching
algorithms1. The optimal number of added words is a research issue. The range varies from a few
words to several hundred. (Efthimiadis 1992, 34–35; Efthimiadis 1996, 134–135; Harman 1992.)
Both moderate and massive automatic QE based on relevance feedback have been proved effective
many times (e.g., Harman 1992; Buckley, Singhal, Mandar & [Salton] 1995; Robertson, Walker,
Jones, Hancock-Beaulieu & Gatford 1995; Beaulieu et al. 1997; Walker, Robertson, Boughanem,
Jones & Sparck Jones 1997). The queries in these studies were weakly structured.
The user may be asked either to judge the relevance of the results or to choose the expansion
keys from a ranked list of words, or he may do both. Efthimiadis (1992) investigated real users’
selection of expansion keys which were obtained from relevance feedback. The users chose about
one third of the words offered. They were asked to state the relation of the five best expansion
keys to the original search keys. The relationships were of standard thesaural type, e.g.,
hierarchical and associative relations. For 34% of expansion keys there was no relation to the
original search keys. Of the remaining 66%, most keys (70%) were hyponyms of the original
search keys, 5% were hypernyms, and an associative relationship held for 25%. The overall search
results provided some evidence for the effectiveness of interactive QE based on relevance
feedback. The queries were weakly structured. The users seem to be fastidious in QE, thus,
automatic QE has given greater improvements. (Efthimiadis 1992.) However, Magennis and van
Rijsbergen (1997) suggest that there is a difference between experienced and inexperienced users,
the former group being able to improve retrieval performance through interactive expansion
whereas the latter group is not.
QE based on relevance feedback and concept identification was tested in the TREC
environment by the research group of the Australian National University (Hawking, Thistlewaite
& Bailey 1997; Hawking, Thistlewaite & Craswell 1997). The main idea was that in order to be
relevant a document should contain evidence for the presence of all search concepts, not just one.
In the TREC-5 study, search concepts were intellectually selected from requests and then search
keys were generated for each concept without using information from the collection. Queries
consisted of concept intersections. Matching was distance-based, that is, documents were scored
on the basis of the occurrences of any representatives of the search concepts within a given
distance. The queries were then expanded by words of the top ranked documents retrieved by
initial queries. The expansion words had to be allocated to the right concepts because of the query
structure. This was achieved by computing association strengths between concepts (their
representatives) and candidate expansion words. Compared to unexpanded queries, expansion
1 For discussion on algorithms, see Efthimiadis & Biron (1994).
33
increased recall significantly. The performance of the concept-structured queries was also superior
to automatically constructed queries. (Hawking, Thistlewaite & Bailey 1997.)
In the TREC-6 study Hawking, Thistlewaite and Craswell (1997) tested three relevance scoring
methods: (1) a frequency based matching, i.e., a tf*idf weighting variation; (2) concepts scoring, in
which “the final score s for a document is derived from the concept scores c1, ..., cn using s =
(kc1+1)* ... *(kcn+1)”; (3) distance scoring, also used in TREC-5. Queries were first automatically
formulated from requests (auto queries), and these queries were then further intellectually refined
by asking a user to recognise concepts, remove bad search keys, combine any suitable pair of
words into a phrase, add new search keys which were obviously missing, and alter search key
weights. The user was not allowed to consult the collection (blind queries). In the next phase, the
user modified queries by running them and scanning the documents retrieved (interactive queries).
In the last phase, queries were expanded automatically using 20 top ranked documents as a source
for 30 expansion words1. The performance of all queries with concept-based structure
(unexpanded blind, expanded, and interactive) was better than the performance of the unexpanded
and expanded automatic (weakly structured) queries. In addition, concept scoring worked significantly better than frequency scoring; distance scoring was the worst method. The authors point out
that concept scoring improves the ranking of documents that contain representatives of all search
concepts. (Hawking, Thistlewaite & Craswell 1997.)
Mitra, Singhal and Buckley (1998) discuss the problem of query drift in relevance feedback,
when no relevance judgements are available, and n top ranking documents are assumed to be
relevant. If a large proportion of these n documents is not relevant, bad expansion keys will be
added to the query. The researchers suggest that Boolean constraints could be used in the selection
of the n documents. In other words, all aspects of a request should be represented in documents
assumed to be relevant and used as a source for expansion keys. Human constructed constraints
were compared to automatic constraints. Both increased the effectiveness of relevance feedback.
5.2
QE Based on Knowledge Structures
Thesauri are typical knowledge structures used in IR. They may be collection dependent or
independent. Further, they may be divided into general and domain specific, and into
indexing/searching and searching thesauri. The idea of a searching thesaurus is not a new one, but
especially attractive because storage of full texts of documents is common nowadays, but
intellectual indexing is no longer widespread. Lancaster (1972, 150) proposed such a tool to
counsel a user in search key selection. Piternick (1984), Strong and Drott (1986), Bates (1986) and
Fidel and Efthimiadis (1995) discuss possible forms and use of conceptual models in IR. All
systems that include intermediary functions, especially presearch counselling, maintain a
conceptual knowledge structure, which might be a thesaurus (e.g., CITE, see Doszkocs 1983).
1 For a method for expansion key selection, see Hawking, Thistlewaite & Craswell 1997.
34
QE based on a statistical, collection dependent thesaurus
Jing and Croft (1994), and Callan, Croft & Broglio (1995) report on QE with an automatically
constructed association thesaurus (PhraseFinder thesaurus, see p. 13). The tests involved one small
collection and three large collections1, and two sets of 50 queries. In unexpanded queries
broadness was varied by using different parts of requests in query formulation. The retrieval model
was probabilistic matching, i.e., the test system was InQuery. The test thesauri were constructed
from samples of the test collections, and one from the whole (one large) collection. In QE nouns,
verbs, adjectives, adverbs, numerals, and different phrase combinations of these were added into
queries. The number of added words or phrases varied, and different weights for search keys and
expansion keys were tested.2 Obviously the queries were weakly structured (WSUM queries, see
p. 21). Jing and Croft (1994) report that a phrase-based thesaurus yielded better performance than
a word-based thesaurus, but both improved performance compared to unexpanded queries in a
small collection. Further, roughly the same improvement was obtained with a thesaurus based on a
sample than a thesaurus based on the whole collection (tested in one large collection). The shorter
(original) queries achieved greatest improvement with QE, however, the overall performance was
worse than the performance of the longer queries. The researchers conclude that although the
overall performance improved with QE, the optimal number of expansion keys varied from request
to request. In addition, it was difficult to adjust the weights for expansion keys. (Ibid., 13.) Callan
et al. (1995) confirm the results showing that QE by PhraseFinder improves performance overall
and with different document cut-off values (DCV3) in a large TIPSTER subcollection. QE was
more effective with short queries constructed from the concepts of the TREC requests4 only (see
Fig. 9), compared to long queries with all words from the requests.
The PhraseFinder thesaurus is based on expression co-occurrences in the whole collection,
whereas typical relevance feedback uses n top ranked documents as search key sources. Xu and
Croft (1996) combined these techniques into an approach they called local context analysis. Noun
phrases were selected on the basis of their co-occurrence with search keys in n top ranked passages
of 300 words. The phrases were ranked, and 70 top ranked phrases were added into a query. The
expanded query included the original query as one part and the expansion keys as another part.
Expansion keys were weighted according to their rank. The expanded query structure follows:
QEXPANDED = #WSUM(1.0 1.0 Q wQ’ Q’)
(21)
where wQ’ = 2, Q’ = #WSUM(1.0 w1 np1 w2 np2 ... w70 np70), wi = 1.0 – 0.9 * i/70, i = 1, ..., 70,
and np is a noun phrase. In other words, the first expansion key got the weight 0.99, and the last
the weight 0.1, and the whole expansion clause was weighted higher (by the weight 2.0) than the
clause with the original keys (by the weight 1.0).
Xu and Croft showed that local context analysis was more effective than relevance feedback or
PhraseFinder-based QE: the unexpanded baseline average precision was 25.2%, PhraseFinder
based 26.0%, relevance feedback based 27.9%, and local context analysis based 31.1%
1 NPL (National Physical Laboratory collection), about 11,429 abstracts in the field of physics. TIPSTER collections,
240,000 – 740,000 documents (newswires and articles).
2 The reports are not very exact about these details.
3 DCV, document cut-off value indicates the number of documents in a retrieval set in best match retrieval. Precision
scores may be calculated at different DCVs, and then averaged over these points (see Hull 1996).
4 The TREC requests have different parts, the requests of TREC 1-3 being longest. Typically, title, description and
narrative parts are included, but the earlier requests have domain and concept fields as well. (See Voorhees & Harman
1997.)
35
respectively. The local context analysis approach was further tested in TREC-5 and -6 with similar
results by the research group of the University of Massachusetts (Allan et al. 1997a, b).
QE based on semantic relations
Voorhees (1994) expanded queries with WordNet, which includes types of semantic relations
similar to those of a thesaurus and is general in scope (see p. 11). Test requests were TREC-3
requests (Fig. 9) and the retrieval model was the vector space model. Three kinds of unexpanded
queries were constructed: (1) queries based on full TREC requests, (2) queries based on summary
statements1 and concepts, (3) queries based on summary statements only. Voorhees selected word
groups (synsets) that she found appropriate for QE on the basis of the request, thus, she
disambiguated polysemous words. Four expansion strategies were tested: expansion by synonyms
only; expansion by synonyms and all descendants in the is-a hierarchy; expansion by synonyms
and all parents in the is-a hierarchy; and expansion by synonyms and any synset directly related to
the given set of synonyms. Queries were vectors composed of n subvectors of different search key
types, e.g., original query words and expansion keys representing synonyms or different
hierarchical levels in WordNet. The similarity between a document vector and an extended query
vector was computed as the weighted sum of the similarities between the document vector and
each of the query’s subvectors:
n
sim(D,Q) = ∑ α i * D • Si
i=1
(22)
where D is document vector, Q = <S1, ...Sn> is query vector, Si (i = 1, ..., n) are subvectors, αi is a
weight reflecting the importance of the ith subvector Si, and • denotes the inner product of the two
vectors. Text keys and original search keys were weighted by typical vector space weights (see
Lee 1995), for expansion keys the normalisation factor of the original search keys was used. For
the original keys the weighting factor α was always 1, for different expansion key types α varied
from 0.1 to 2. In most cases the original search keys were given higher weights according to a
common assumption that user supplied keys are superior to expansion keys. (Voorhees 1994.) The
queries of the test had some structure, but not a concept structure.
QE in Voorhees’ (1) and (2) type queries did not prove useful. For type (3) queries the
performance of expanded queries was significantly better than performance of the unexpanded
queries (35% improvement in the 11-point average precision), but the overall performance was
lower than the performance of unexpanded type (1) queries (39% decrease in the 11-point average
precision). (Voorhees 1994.) The impact of the broadness of the unexpanded query on the
effectiveness of QE is very clear in this study. The early TREC requests were quite long compared
to typical end-user requests or queries (Sparck Jones 1995; Lu & Keefer 1995). In TREC-4 the
requests were much shorter and the overall performance in that round dropped notably (Harman
1995; Voorhees & Harman 1997; Kwok & Grunfeld 1995).
1 According to Voorhees (1994), summary statements are shorter versions (single sentences) of requests. They are
frequently, but not always, identical to the description fields.
36
<dom> Domain: Medical & Biological
<title> Topic: RDT&E of New Cancer Fighting Drugs
<desc> Description:
Document will report on the research, development, testing, and evaluation (RDT&E) of a new anticancer drug developed anywhere in the world.
<narr> Narrative:
A relevant document will report on any phase in the world-wide process of bringing new cancer
fighting drugs to market, from conceptualization to government marketing approval. The laboratory
or company responsible for the drug project, the specific type of cancer(s) which the drug is
designed to counter, and the chemical/medical properties of the drug must be identified.
<con> Concept(s):
1. cancer, leukemia
2. drug, chemotherapy
Figure 9. The TREC request number 122 (Voorhees 1994, 64).
Jones, Gatford, Robertson, Hancock-Beaulieu and Secker (1995) reported on thesaurus-based QE
done by users. The test collection was the INSPEC database searched through the probabilistic
OKAPI system (see Robertson 1997). Twenty users entered their queries and obtained a list of best
matching terms from the INSPEC thesaurus. The users selected terms they found accurate as
expansion keys. Then, four versions were constructed and run for each query: (1) – original query
containing the keys from the user only, run as a text search, i.e., on all fields, (2) – controlled
query containing selected thesaurus terms only, run as a controlled vocabulary search on
descriptor fields only, (3) – query containing thesaurus terms only, run as a text search, (4) –
hybrid query containing original keys and thesaurus terms, run as a text search. The queries had a
weak structure. On average, the users saw 67 terms from which they selected 6.5 for QE. The top
20 documents of each query type were first pooled and judged for relevance by the users. The
overall performance of the queries (2) – (3) was slightly poorer than the performance of the
original query (1), while the hybrid query (4) had slightly better effectiveness. In 10 cases the
hybrid query yielded a better result than the original query, in 9 cases a worse result, in one case
the result was the same. An interesting notion was that QE had a reordering effect: the overlap in
top ranked result sets of the hybrid and original queries was very low. (Jones et al. 1995.)
QE with a domain specific searching thesaurus was tested by Kristensen and Järvelin (1990),
Kristensen (1993) and Järvelin et al. (1996a). The study by Kristensen and Järvelin was conducted
in an operational database of a newspaper archive, using Boolean retrieval. A small test thesaurus
was constructed for the study on the basis of words occurring in newspaper articles on economic
issues. Journalists’ requests were first formulated into (original) queries and then expanded by
expressions given by the test thesaurus. Queries were expanded in two phases: First, the keys of
the original query were expanded by disjunctions of the synonyms given by the searching
thesaurus without modifying the overall logic of the query. Second, the disjunctions in the
synonym queries were further expanded by disjunctions of related words given by the searching
thesaurus again without modifying the overall logic of the synonym query. The results of three
query types were analysed with relation to relative recall and precision by setting the relative
recall for the largest expansion queries at 100%. The average relative recall for the unexpanded
queries was 45% and for the synonym queries 82%. The average precision values for the three
query types were 51%, 41% and 33% respectively.
37
Kristensen (1993) confirmed these results in another QE study. The methodology was similar, but
the test thesaurus was larger, and hierarchical expansion (hyponyms) was also tested. The
environment was a database of 227,000 full text newspaper articles, with a Boolean retrieval
system. Kristensen reports doubling in recall with an 11% decline in precision (from 62.5% to
51.2%) for the best expansion type compared to unexpanded queries. The best expansion included
synonyms, hyponyms and related expressions of original search keys.
Järvelin et al. (1996a, b) tested QE with ExpansionTool (see Section 7.3) both with exact and
partial match techniques. The test was conducted in a database containing 54,000 Finnish
newspaper articles operated under the InQuery retrieval system (Version 1.6, both Boolean and
probabilistic matching techniques). The test involved ten requests. First, conceptual query plans
were constructed for each request. In an unexpanded or original query (oq), the concepts were
named using expressions from the request. Then, the queries were expanded cumulatively with
synonyms, hyponyms and related concepts given by the test thesaurus of ExpansionTool. In the
basic synonym queries (sq) the concepts of the requests were identified in the thesaurus and their
synonymous expressions were used in query construction. No conceptual expansion was
performed. In the narrower concept expanded queries (nq), the narrower concepts were first added
and then their synonymous expressions were used as above in query construction. In the
associative concept (aq) expanded queries, both the narrower concepts and the associative
concepts were added and queries were constructed as above. (Järvelin et al. 1996b, 46–47.)
For Boolean retrieval the queries were constructed with the block search strategy (conjunctive
normal form queries). For probabilistic matching, a query had the following structure:
#sum (#or (c1 #sum(c11 ... c1n)) ... #or(ck #sum(ck1 ... ckm)))
(23)
where ci (i = 1...k) are the original search keys (k ≥ 1) and cij (j ≥ 0) are the expansion keys for the
key ci.
The average number of concepts in the requests was 4.2, which gave 3.8 conjunctive facets for
conjunctive queries. The average number of search keys in different query types for the
conjunctive and probabilistic SUM queries were as follows: oq 4.2; sq 12.2; nq 21.6; aq 40.2.
(Järvelin et al. 1996b, 47.)
For Boolean retrieval the results show that the largest expansion (aq) increased recall
significantly, but decrease in precision was also significant, even if less so. If only the narrower
concept expansion was done, recall still improved significantly, but decrease in precision was
negligible. For the probabilistic queries, precision was averaged over given retrieved set sizes from
one to 100 documents for each query type. For document cut-off values from 20 to 100 the
performance of the largest expansion (aq) was the best. The differences both in recall and
precision were significant between the original queries and all expanded queries. However,
differences between expanded queries were not significant. (Järvelin et al. 1996b, 49–50.)
Kristensen (1996) tested thesaural QE with various query structures. A set of 30 requests was
formulated into Boolean structured queries and queries with the SUM operator (InQuery). These
queries were searched in the same test environment that Järvelin et al. (1996a) used. The
unexpanded queries were constructed and the expansions performed in the same manner as above.
Kristensen found that QE decreased performance at cut-off level of 20 documents with both types
of query structures. Expansion included synonyms, hyponyms, and expressions in associative
relations to original search keys. The average precision for unexpanded Boolean structured queries
was 51%, for unexpanded SUM queries 49%, for expanded Boolean structured queries 31%, and
for expanded SUM queries 44%. When the queries were unexpanded, the Boolean queries worked
38
better than the SUM queries, but with QE the result was the opposite, the SUM queries were more
effective. The broadness of unexpanded queries was on average 4.5, and of expanded queries 32.5.
These results confirm the effectiveness of Boolean structured queries compared to simple
probabilistic queries when no QE is used (see Turtle 1990). When Boolean queries are expanded
with a thesaurus, alternative search keys are added to each facet, that is, the number of disjuncted
keys in each facet increases. If expansion is large, the interpretation of the OR operator in InQuery
(system Version 1.6), and the use of default value 0.4 for search keys not present in a document
lead to a situation where the weight of the expanded facets (even if containing no document keys)
is high relative to the weights of other facets. Usually, not all facets are expanded with equally
many keys, and the non- or less expanded facets get more weight as a result of the interpretation of
the AND operator combining the facets. The AND operator is interpreted as multiplication of the
belief values of search keys – or other structure components of the query. When the number of
alternatives in OR clauses is low, Boolean queries work reasonably, as the results show.
Although the expanded SUM queries were more effective than expanded Boolean structured
queries, they were less effective than unexpanded SUM and Boolean queries. The reason is that in
ranking based on averages, documents that include expressions referring to one concept only may
obtain very high ranks. This is a problem if the request has many search concepts. The importance
of balance between search concepts has recently been noticed by other researchers, e.g., Hawking
and Thistlewaite (1995); Hawking, Thistlewaite and Craswell (1997); Buckley, Mitra, Walz and
Cardie (1998); Mitra, Singhal and Buckley (1998).
5.3
Summary of the Earlier QE Studies
Altogether, automatic QE based on relevance feedback has been shown to improve retrieval
performance, and it has become a standard approach1. QE based on thesauri has given mixed
results: general, collection independent thesauri have not been very successful; domain specific,
collection dependent thesauri have sometimes improved performance, sometimes impaired it. The
most successful have been the statistically constructed, collection dependent knowledge structures.
It seems that the conceptual QE based on semantic relations cannot give results as good as the
statistical approach. Are concepts and semantics to be abandoned in QE? Before the question can
be settled, the variables tested should be examined again.
Characteristics of requests and queries are usually not reported in detail. Requests seem to be of
the type ‘conscious topical’ (Ingwersen & Willett 1995), and often verbose, as in TREC. The
effect of this is that queries have usually been broad, since they are automatically produced from
requests. However, the broadness of original queries has been varied in tests, and it has been
shown that shorter queries gain most of QE (Voorhees 1994; Jing & Croft 1994; Callan et al.
1995; Lu & Keefer 1995). Only in some studies have requests been submitted to conceptual
analysis, e.g. Kristensen (1993, 1996), Järvelin et al. (1996a, b), Hawking and Thistlewaite (1995),
Hawking, Thistlewaite and Bailey (1997), and Hawking, Thistlewaite and Craswell (1997). Not in
these studies, nor QE studies in general, however, have the effects of complexity, coverage and
broadness of queries been tested.
The structure of queries may be described as weak (queries without operators, e.g., vectors) or
strong (queries with operators, e.g., Boolean operators). Some evidence of better performance for
1 Sparck Jones (1997, 7) calls this the generic tf*idf paradigm with relevance feedback refinement.
39
strongly structured queries exists, though without QE (e.g., Turtle 1990; Turtle & Croft 1991;
Belkin et al. 1995; Hull 1997). In QE studies, only Järvelin et al. (1996a, b) and Hawking,
Thistlewaite and Craswell (1997) have compared some types of weakly and strongly structured
queries, the latter having a concept-based structure. Queries with different types of operators (e.g.,
the Boolean operators and types of ‘probabilistic’ operators) have been combined (Belkin et al.
1993; 1994; 1995; Fox, Koushik, Shaw, Modlin & Rao 1993; Fox & Shaw 1994; Shaw & Fox
1995). This leads to complicated structures, but the method improves performance. Because
neither the number nor the overlap of search keys in different queries is reported, it is hard to
judge the possible effects of QE and broadness on performance.
Principles of search key weighting are different for different QE methods. Relevance feedback
utilises often tf*idf information of the retrieved documents. When knowledge structures, such as
thesauri, are used for QE, the weighting of search keys is more problematic. It has been claimed
that original search keys are more effective than expansion keys, and that weighting schemes
should reflect this (Fox 1980; Wang et al. 1985; Voorhees 1994). Both Fox and Voorhees used
constant weights, whereas Wang, Vandendorpe and Evens weighted expansion keys according to
similarity between a query and a test thesaurus (giving less weight to the word which has a long
list of associated words). In most of these tests, all expansion keys had lower weights than original
keys.1
Statistical, search result or collection-based QE methods have been prevalent in research.
Although concepts and conceptual relations are mentioned, they cannot truly be captured with
statistical methods, nor with automatic syntactical analysis, e.g., automatically constructed
thesauri have only one type of relation: association. Semantic relations have been tested in QE
using both general vocabularies (Voorhees 1994; Jones et al. 1995) and collection dependent
thesauri (Wang et al. 1985; Kristensen 1993; Järvelin et al. 1996a, b). Nevertheless, these studies
do not consider the interaction of query structure, complexity and expansion.
1 An exception is the test by Voorhees (1994). She tested several weights for the expansion sub-vectors, also weights 1.0
and 2.0 - when the weight of the original key vector was 1.0. These weights, however, were inferior compared to
weights smaller than 1.0.
40
6
Aims of the Study
6.1
Setting
In this study the retrieval performance of different query formulation methods in a probabilistic
retrieval system was analysed. The interaction and effects of the following variables were tested:
• the number of facets in a query
• the number of search concepts representing facets
• the number of search keys representing concepts
• QE with different semantic relationships
• the ways to combine weights in scoring, i.e., query structures.
Query formulation was based on facets, and QE was based on a thesaurus. First, concepts
representing requests were selected from the thesaurus, second, queries were formulated in which
the number of concepts and the number of search keys representing the concepts were varied by
QE, and the query structure was varied.
In this study requests represented information needs in a laboratory setting. Requests provided
facets representing the subject of a search. Queries were characterised by complexity, coverage,
broadness, and query structure. Complexity is the number of facets in a query. Facets were divided
into major and minor facets on the basis of their importance with respect to the request. Major
facets included the main search concepts, and were essential for formulating a reasonable query.
Minor facets contained additional aspects of the request which in Boolean block search strategy
would be used to reduce the size of the result set. Two complexity levels were tested: high
complexity refers to queries using all search facets identified from requests, low complexity was
achieved by formulating queries with major facets only.
Coverage is the number of concepts in a query, and the coverage of unexpanded queries was
given in the request, i.e., the search concepts were identified on the basis of requests. Broadness is
the total number of search keys in a query. Coverage and broadness were varied by different query
expansion types.
Query structure refers to the way scores are calculated from individual weights of the keys,
guided by operators. The structure of queries may be weak (queries with a single or no explicit
operator, no differentiated relations between search keys, except weights) or strong (queries with
several operators, different relationships between search keys). More precisely, strong query
structures are based on facets. Each facet indicates one aspect of the request, facets are represented
by a set of concepts, which, in turn, are expressed by sets of search keys.
Query formulation and expansion were based on a conceptual model, i.e., a searching
thesaurus. The thesaurus of the test includes the following types of semantic relationships:
synonymy, taxonymy (generic relations), meronymy (partitive relations). It also has instance
hierarchies and associative relations akin to relations defined in the ISO standard (1986).
41
Document characteristics
• language
• length
• text type (newspaper articles)
Document representation in DB
• index type (basic forms)
DOCUMENT AXIS
Request
IR TECHNIQUE AXIS
Query language
• operators
• weighting
Matching technique
• partial matching
REQUEST AXIS
P
r
o
Query
- structure
- complexity
- QE
b
l
e
m
Matching
&
DB
s
c
Retrieval
results, DCV
o
p
e
Precision
Recall
Relevance
Request characteristics
• subject field
• complexity
• coverage
Request representation as queries
• conceptual analysis
- complexity
• search key selection
- unexpanded and expanded
queries
- a searching thesaurus as
QE source
- coverage and broadness varied
by QE
• structural features of queries
- facets, concepts, phrases,
single keys, operators, weights
Figure 10. Test variables.
Figure 10 illustrates the test variables. The varied controlled variables are (1) queries, in which
complexity, coverage, broadness and structure are varied; (2) the size of the retrieved document
sets (document cut-off value). The constant controlled variables are (a) searching thesaurus
consisting of concepts, expressions, keys, and semantic relations; (b) query language; (c) matching
technique. Others are concomitant variables: requests; database, which consists of documents with
varied length and topics, but constant language, and index, the type of which is constant. Recall
and precision of the retrieved document sets are the response variables.
This study relies on human identification of facets and concepts. Thus, it is in contrast to
statistical methods that handle requests as sets of expressions or strings and are employed in
automatic query formulation and relevance feedback. The model underlying the test does not
assume mediated searching, it supports any searcher in query formulation and expansion. The
searching thesaurus gives a searcher a view of a special domain, or an area of interest. The
searcher may then select the concepts that are of interest for his search, and the thesaurus supplies
the search keys corresponding to the concepts. The searching thesaurus supports conceptual QE.
However, the aim of the present study is to explore the co-effects of query complexity, broadness
and structure in general, not to compare different QE methods. Conceptual query formulation and
expansion gave us an opportunity to control the test variables systematically. This study was an
experiment conducted in a laboratory environment. This implies that assumptions had to be made
about information needs, requests and relevance, that is, about the user and the use.
42
6.2
Research Problems
This study seeks to answer the question: Are the effects of concept-based query expansion on
retrieval performance dependent on query structure and complexity level in a partial match
retrieval system? This main problem is divided into testable subproblems, which are presented
below.
The queries of end-users are often short, consisting of few search keys. Then, the number of
search facets – or concepts – is also low, and the facets may be considered as principal facets with
relation to the information need. Subproblems 1 and 2 focus on the impact of QE on the
effectiveness of short queries with different structures.
1. Are there differences in performance within any weak query structure given that
complexity is low and the broadness of queries is systematically varied?
2. Are there differences in performance within any strong query structure given that
complexity is low and the broadness of queries is systematically varied?
In exact match (Boolean) retrieval, additional facets are used to improve precision. In partial
match retrieval it is not well understood, what are the effects of varied complexity on performance.
Subproblems 3 and 4 test the effects of high complexity with different structures and varied
broadness.
3. Are there differences in performance within any weak query structure given that
complexity is high and the broadness of queries is systematically varied?
4. Are there differences in performance within any strong query structure given that
complexity is high and the broadness of queries is systematically varied?
In Subproblem 5 the performance between different weak query structures and in Subproblem
6 between different strong query structures is compared when complexity and broadness are
varied.
5. Are there differences in performance between weak query structures when the
complexity and broadness of queries are systematically varied?
6. Are there differences in performance between strong query structures when the
complexity and broadness of queries are systematically varied?
Kristensen (1996) demonstrated that QE was not effective on queries with the Boolean
structure in a partial match system, but neither did it enhance queries with a weak structure. The
best performance for different structures was achieved with different broadness. In earlier studies,
however, Boolean structured queries have shown better performance than weakly structured
queries, but without QE. In Subproblems 7-10 the performance between the strong and weak query
structures is compared at different complexity levels and with different expansion types.
7. Is the performance of the strongly structured queries better than the performance of
the weakly structured queries when the queries are unexpanded and complexity is
low?
8. Is the performance of the strongly structured queries better than the performance of
the weakly structured queries when the queries are unexpanded and complexity is
high?
9. Is the performance of the strongly structured queries better than the performance of
the weakly structured queries when the queries are expanded and complexity is low?
10. Is the performance of the strongly structured queries better than the performance of
the weakly structured queries when the queries are expanded and complexity is
high?
43
The identification of facets is somewhat laborious and hard to automate. The identification of
concepts is simpler, though not trivial either. However, the naming of search concepts rather than
search facets would be a desirable option for many IR applications. Subproblem 11 tests what are
the effects of disregarding facets and recognising concepts only.
11. Are there differences in retrieval performance, when the facet structure is reduced
to a concept-based structure.
Weighting is a typical way to express the importance of search keys and give a query a
structure. According to a common belief, expansion keys should be given lower weights than
original search keys. This is based on the idea that the original search keys are the most typical
expressions for the corresponding concepts. In the test setting the search concepts were named by
expressions which were believed to be their most common names in the search environment, i.e.,
in newspaper articles. In the test thesaurus these expressions were referred to as terms.
Subproblem 12 tests different weighting schemes for original and expansion keys.
12. Are there differences in retrieval performance between queries in which expansion
keys are given lower weights than original search keys and queries in which weights
are equal?
Subproblem 13 compares search key weighting with facet weighting. In the former, search keys
are weighted according to their semantic type (see Section 7.4), in the latter, major facets are given
higher weights than minor facets.
13. Does search key weighting yield better retrieval performance than facet weighting
with any QE type at any complexity level?
The relations given in a statistically constructed thesaurus and an intellectually constructed
thesaurus are likely to be different. Yet, it is not known whether differentiation between semantic
relations is useful in QE. The use of a searching thesaurus gives an opportunity to examine the
effectiveness of different semantic relations in QE.
14. Is any semantic expansion type more effective than other?
44
7
Data and Methods
The test of this study was conducted in the Information Retrieval Laboratory of the Department of
Information Studies, University of Tampere. In this chapter the test setting, methods and data will
be described.
7.1
Retrieval System and Test Database
In this study searches were conducted in a partial match system. InQuery (Version 3.1) was chosen
for the test, because it has a wide range of operators, including interpretations of the Boolean
operators, and it allows search key weighting. Moreover, InQuery has shown good performance in
several tests (e.g., TREC 2–5, see Harman 1995 and Voorhees & Harman 1997; Allan et al. 1995;
Xu & Croft 1996). Details of the system are given in Section 3.2.
The test database contained 53,893 newspaper articles published in three Finnish newspapers in
1988-1992. The articles represented different sections of the newspapers, most articles addressed
economics, foreign and internal affairs. The average article length was 233 words, and typical
paragraphs were two or three sentences in length. The database index contained all keys in their
morphological basic forms. In the basic word form analysis, all compound words were split into
their component words in their morphological basic forms.
Many expressions typically prepositional in English are often built by adding suffixes in
Finnish (e.g., talossammekaan , where the first four letters ‘talo’ is the basic form of the word
which means ‘house’, ‘-ssa’ is the case ending of the inessive, ‘-mme’ is a possessive suffix, ‘kaan’ is an enclitic particle, which means ‘[not] even’, and the whole means ‘[not] even in our
house’). (Karlsson 1983, 22–23.) Inflected word forms may be returned to basic forms by morphological analysis. Thus, even if Finnish has 14 cases for nouns, the searcher needs not to
formulate all of them to his query nor worry about word truncation, which is difficult, because the
stem of the word may change in inflection (e.g., yö, öisin which mean ‘night, in the nights’). In
highly agglutinative languages, for instance, in Finnish and German, compound words are often
spelled together (e.g., ydinvoimalaitos, atomkraftwerk, which mean nuclear power plant).
Compound word splitting makes all parts of the word searchable. (Alkula & Honkela 1992.) This
improves performance both in exact match and partial match retrieval.
Compound word splitting has also a query expansion effect. Many hierarchical relations in
Finnish are based on compound words (e.g., tehdas -> paperitehdas -> hienopaperitehdas which
means factory -> paper mill -> paper mill producing fine paper). When compound words are
split, the use of any part of the word as a search key will match all compound words that include
the part. Typically searchers might use heads (for instance, tehdas /mill) to get all the narrower
concepts. Usually there is no way to control whether the search key matches only heads or any
other compound part, i.e., modifiers, and thus, false matches occur. Nevertheless, as compound
words contain the search key, many of the matches are relevant. A single word may also be too
general in meaning, e.g. all kind of mills are not wanted. If the searching of not only single parts
but combination of them is allowed, the precision problem is somewhat alleviated (e.g.
paperitehdas / paper mill instead of tehdas / mill). To some extent compound word splitting causes
45
automatically narrower concept expansion and associative concept expansion to queries. This
reduces the effects of these expansion types.
The morphological analysis is dictionary-based, thus, not all text words will be recognised.
Typically, these unrecognised words are foreign words (e.g., names Bildt, Untag), or typos.
Unrecognised words were stored into the database index in the form they occurred in text. In the
index these words were marked with @. For example, the name Bildt appeared in the index with
the following entries: @bildt, @bildtiä, @bildtiin, @bildtillä, @bildtille, @bildtiltä, @bildtin,
@bildtistä, e.g., in various cases. In query formulation this was taken into account by representing
unrecognised search keys by the strings that appeared in the index.
7.2
Requests and Conceptual Query Plans
In the Laboratory there is a collection of 35 requests. A set of requests represents a selection1 of
information needs. The requests cover topics about economics, foreign and internal affairs,
matching the scope of the database. Further, the requests are divided into subject (8 requests),
person (5 requests), organisation (12 requests) and geographically oriented requests (10). (Sormunen 1994.) For this study 30 requests were selected on the basis of their expandability, i.e., they
provided possibilities for studying the interaction of the test parameters (see Appendix 1). One
geographic request, one person, and three organisation requests were abandoned, because their
most essential concepts were proper names, which usually do not have many associated concepts
or synonyms. The complexity of these requests was also low, i.e., they had only one expandable
facet.
A facet analysis of the requests was accomplished in the study by Sormunen (1994), resulting
in conceptual query plans for the requests, in which major and minor facets were identified. It
seems evident that the importance of facets is varying. The assumption that searchers identify
major and minor facets consistently is also plausible (e.g., Harter 1990; Iivonen 1995). In the
present study the concepts for facets were chosen from the thesaurus. The earlier conceptual query
plans of the requests were not used as such but as guidelines for the selection of facets and
concepts. That was because of the problems in syntactic factoring which led to concepts not
included in the thesaurus. Hence, the researcher identified facets from the requests on the basis of
the earlier plans, and selected concepts to the facets from the thesaurus in accordance with the
requests (see Appendix 2). Real searchers were not involved in the query formulation because this
study seeks to test the effect of the structural and representation parameters on retrieval
performance, not to find out how real searchers would use the thesaurus, nor to evaluate the
performance of their queries.
1 These requests, as test requests hardly ever, are not a random sample, rather a selection with various characteristics,
which are not controlled in the test.
46
7.3
Test Thesaurus and ExpansionTool
Thesaurus representation in ExpansionTool
The test thesaurus was constructed to be operated under the ExpansionTool software. The
thesaurus representation in ExpansionTool is described by Järvelin et al. (1996b). Here a sample is
given of the ExpansionTool thesaurus of this study.
The thesaurus is represented in a relational database, which is in the third normal form (e.g.,
Ullman 1988). Concepts are represented in the relation CONCEPTS (Table 2) which has the
attributes CNO (concept number), CNAME (concept name), and CATEGORY (concept category).
Concept categories are chosen according to the application domain of the thesaurus. In the test
thesaurus for newspaper articles, the concept categories are individuals (ind), professions (prof),
organisations (org), abstract (abstr) or physical (phys) objects, processes/events (proc_event),
geographical names/places (geog)1. The total number of concepts in the test thesaurus is 832.
CONCEPTS
CNO
c1
c2
c3
c4
c5
c6
c7
c8
c9
c10
c11
c12
c13
c14
CNAME
nuclear power plant
nuclear reactor
nuclear power
radioactive waste
nuclear waste
high-active waste
low-active waste
fission product
spent fuel
storage
repository
process
refine
treat
CATEGORY
phys
phys
phys
phys
phys
phys
phys
phys
phys
phys
phys
proc_event
proc_event
proc_event
Table 2. The relation CONCEPTS.
The relationships between concepts and their expressions are given in the link relation
CONS_EXPRS (Table 3) which has the identifier attributes CNO and ENO, as well as the attributes
ETYPE (expression type) and STRENGTH. The attribute ETYPE defines one expression as a
preferred term for the concept (term). The other expressions for the concept are entry terms and
are described as synonyms (syno), quasi-synonyms (qsyn), ellipses (ellip), acronyms (acro),
common names (comn), antonyms (anto), corresponding adjectives (adje), or corresponding verbs
(verb). These are based on lexical relationships recognised in the ISO standard (ISO 1986) for
thesaurus construction, except verbs and adjectives.2 These are included into the thesaurus because
in text searching they have proved useful (Jackson 1983; Kristensen 1992). The attribute
STRENGTH specifies the strength of the association between the concept and the expression as a
1 The categories of the thesaurus are not used in this study, but are shortly explained here, since they are essential in
order to understand the thesaurus representation.
2 For semantic relations and thesaurus relations, see also Sections 2.2 and 2.3.
47
real number between [0, 1]. The strength figures were judged by the researcher, but they could be
based on statistics, or relevance feedback.
CONS_EXPRS
CNO
c1
c1
c1
c2
c2
c3
c3
c4
c5
c6
c7
c8
c9
c10
c10
c10
c11
c12
c13
c14
ENO
e10
e11
e12
e20
e21
e30
e31
e40
e50
e60
e70
e80
e90
e100
e101
e102
e110
e120
e130
e140
ETYPE
term
syno
syno
term
syno
term
qsyn
term
term
term
term
term
term
term
qsyn
qsyn
term
term
term
term
TRENGTH
1.0
0.9
0.9
1.0
0.9
1.0
0.9
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.8
0.8
1.0
1.0
1.0
1.0
Table 3. The relation CONS_EXPRS.
In the test thesaurus terms always have the strength value 1.0. The strength values of other
expressions reflect their reliability as the representatives of the concept in relation to the term. It is
noteworthy that this model allows polysemy and homonymy at the linguistic level: expressions
may denote several concepts1. This can be useful if, say, storage is needed both as a computer
science and commerce word. The hierarchies and associative relations of the corresponding
concepts will be different in both cases. For the user the difference in meaning can be explained
through a scope note or through the relations of the word.
Concept expressions are represented in the relation EXPRESSIONS (Table 4) with two
attributes ENO (expression number) and EXPRESSION (the expression). The number of
expressions in the thesaurus is 1,345. The expressions, given in the example of Table 4, are
translations from Finnish expressions, and thus may not be quite accurate. However, in this case
proper English wording is not the primary focus.
1 In the test thesaurus this feature is not used.
48
EXPRESSIONS
ENO
e10
e11
e12
e20
e21
e30
e31
e40
e50
e60
e70
e80
e90
e100
e101
e102
e110
e120
e130
e140
EXPRESSION
nuclear power plant
nuclear station
atomic power plant
nuclear reactor
atomic reactor
nuclear energy
atomic energy
radioactive waste
nuclear waste
high-active waste
low-active waste
fission product
spent fuel
storage
store
stock
repository
process
refine
treat
Table 4. The relation EXPRESSIONS.
The matching models are represented in the relation STRINGS (Table 5). Each matching model
represents, in a query-language independent way, how the expression may be matched in
document texts or database indices built in various ways. The relation has three attributes SNO
(string number), STYPE (string type), and STRING. The string type indicates whether the STRING
attribute gives a matching model for a morphological basic word form (bw) or for a stem of
inflected forms (st). The idea here is that the database index may contain either morphological
basic forms or inflected words as they appear in the text.1 In the test thesaurus only basic word
forms are given. The attribute STRING gives a matching model for the expression in a way
independent of retrieval systems. A matching model is either a string pattern or a string constant.
The number of matching models in the thesaurus is 1,558.
1 See Section 4.2 – String level, p. 36.
49
SNO
s100
s101
s110
s111
...
s400
s401
s500
s501
s600
s601
s700
s701
s800
s801
s900
s901
s1000
s1010
s1020
s1100
s1200
s1300
s1400
STYPE
bw
bw
bw
bw
...
bw
bw
bw
bw
bw
bw
bw
bw
bw
bw
bw
bw
bw
bw
bw
bw
bw
bw
bw
STRINGS
STRING
phra(3, <bw(nuclear), bw(power), bw(plant)>)
prox(3, <bw(nuclear), bw(power), bw(plant)>, 3)
phra(2, <bw(nuclear), bw(station)>)
prox(2, <bw(nuclear), bw(station)>, 3)
...
phra(2, <bw(radioactive), bw(waste)>)
prox(2, <bw(radioactive), bw(waste)>, 3)
phra(2, <bw(nuclear), bw(waste)>)
prox(2, <bw(nuclear), bw(waste)>, 3)
phra(2, <cw(<bw(high), bw(active)>), bw(waste)>)
prox(2, <cw(<bw(high), bw(active)>), bw(waste)>, 3)
phra(2, <cw(<bw(low), bw(active)>), bw(waste)>)
prox(2, <cw(<bw(low), bw(active)>), bw(waste)>, 3)
phra(2, <bw(fission), bw(product)>)
prox(2, <bw(fission), bw(product)>, 3)
phra(2, <bw(spend), bw(fuel)>)
prox(2, <bw(spend), bw(fuel)>, 3)
bw(storage)
bw(store)
bw(stock)
bw(repository)
bw(process)
bw(refine)
bw(treat)
Table 5. The relation STRINGS.
The matching model language has the following characteristics:
•
“Matching of basic words by their morphological basic forms. This is expressed by bw(word), when
word is a morphological basic form (e.g., bw(accident)).
•
Matching of compound words by their morphological basic forms. The basic form matching models
take into account that the database index may or may not recognize the compound word components. Thus
the matching models are able to generate both whole compound word in the basic form and each of its
components. In many languages, including Finnish, the component words may occur in an inflected form
in the compound but are in the basic form if split by the analysis program. Thus the compound word basic
forms are expressed by cw(<comp1, ..., compn>), when comp1, ..., compn are the component words in the
correct order. Components which may inflect in the compound (genitive in Finnish) are modelled by
iw(basicform, inflform), where basicform and inflform are the basic and inflected forms respectively. Other
components are modelled by the basic word form construct. For example, "shipyard" and "plywood" would
be modelled by cw(<bw(ship), bw(yard)>) and cw(<bw(ply), bw(wood)>). The Finnish word
"puun|jalostus|teollisuus" (for wood processing industry, "|" indicating components) has a genitive in the
first component. The model is thus cw(<iw(puu, puun), bw(jalostus), bw(teollisuus)>). Now both the
compound form "puunjalostusteollisuus" and the component form "puu" + "jalostus" + "teollisuus" can be
constructed automatically.
...
•
Matching of phrases with a defined word order through morphological basic forms. The model is
phra(CompCount, Comps), where CompCount indicates the number of components of the phrase, Comps
gives the component in their required order. The components may be stems, basic words or compound
words. For example, ‘information retrieval’ would be modelled by phra(2, <bw(information),
bw(retrieval)>) in this order and no intervening words.
50
•
Matching of words in a defined order, with intervening words allowed, through morphological basic
forms or stems. The model is prox(CompCount, Comps, Distance), where CompCount indicates the
number of components of the phrase, Comps gives the component in their required order, and Distance
gives the allowed distance (in words) between the components. The components may be stems, basic words
or compound words. For example, ‘information retrieval’ would be modelled by prox(2, <bw(information),
bw(retrieval)>, 3) in this order and with distance of 0 - 3 words.
•
Matching of more general expressions with required word adjacency (but not order) through
morphological basic forms or stems. The model is adj(CompCount, Comps, Distance), with components as
above. For example, information retrieval could be modelled by adj(2, <bw(information), bw(retrieval)>, 4)
to match e.g. "information organization, storage and retrieval".”1
(Järvelin et al. 1996b, 9–10)
The relationships between expressions and matching models are given in the link relation
EXPR_STRINGS (Table 6) which has the identifier attributes ENO and SNO, and RELIABILITY.
The attribute RELIABILITY gives the matching model’s reliability in matching the intended
expression as a real number between [0, 1]. In the test thesaurus strings are either strict (‘precision
oriented’) with value 1.0, or loose (‘recall oriented’) with value less than 1.0. Unexpanded queries
consisted of strict matching models, loose matching models were used in QE.
Concept relationships are either association relationships or hierarchic relationships. They are
always non-synonymous relationships because the latter are relationships between expressions.
The relation ASSOCIATIONS (Table 7) represents the association relationships between concepts.
It has the attributes CNO and ASS_CNO, containing pairs of a concept and an associated concept,
as well as the attributes ASS_TYPE (association type) and ASS_STRENGTH (association strength).
The following association types specify how the concepts are associated: sibling concepts (sibl),
familial relationship (fami), process relationship (actor-act-acting-instrument-object-place; pros),
cause and consequence (cause), and antonyms (anto).2 The values of association strength were
based on the judgements of the researcher.
1 The sample thesaurus does not include all matching model types. There are no examples of matching models for
adjacent words without a specified order. However, these are analogical in structure with the proximity matching
models.
2 Associative relationships are discussed in Section 2.3, pp. 15-16.
51
XPR_STRINGS
ENO
e10
e10
e11
e11
e40
e40
e50
e50
e60
e60
e70
e70
e80
e80
e90
e90
e100
e101
e102
e110
e120
e130
e140
SNO
s100
s101
s110
s111
s400
s401
s500
s501
s600
s601
s700
s701
s800
s801
s900
s901
s1000
s1010
s1020
s1100
s1200
s1300
s1400
RELIABILITY
1.0
0.9
1.0
0.9
1.0
0.8
1.0
0.9
1.0
0.9
1.0
0.9
1.0
0.9
1.0
0.9
1.0
0.9
1.0
1.0
1.0
1.0
1.0
Table 6. The relation EXPR_STRINGS.
In the relation ASSOCIATIONS (Table 7) the concepts c1 (nuclear power plant) and c3 (nuclear
power) are associated in the process relationship with the strength 0.8, the concepts c4 (radioactive
waste) and c8 (fission product) are associated on the basis of their familial relation with the
strength 0.7, and concepts c6 (low-active waste) and c7 (high-active waste) are siblings with the
strength 0.8. The associated concept pairs are stored twice with the only difference in the position
of the concepts. This is done in order to simplify queries – there is no need to check both attributes
for a concept. This makes it also possible, if needed, to make association strengths asymmetric.
ASSOCIATIONS
CNO
c1
c3
c4
c8
c4
c9
c6
c7
c12
c13
c12
c14
ASS_CNO
c3
c1
c8
c4
c9
c4
c7
c6
c13
c12
c14
c12
ASS_TYPE
proc
proc
fami
fami
fami
fami
sibl
sibl
fami
fami
fami
fami
SS_STRENGTH
0.8
0.8
0.7
0.7
0.6
0.6
0.8
0.8
0.5
0.5
0.6
0.6
Table 7. The relation ASSOCIATIONS.
52
Representation of hierarchic relationships is based on binary relations partitioned according to
concept categories (e.g., organisations, processes, abstract or physical objects, etc.) and hierarchy
types (i.e., generic, partitive and instance relationships).1 In the sample thesaurus there are a
partitive relationship and a generic relationship for physical objects, and a generic relationship for
processes and events. In Table 8 the concept c2 (nuclear reactor) is given as a part or narrower
concept of the concept c1 (nuclear power plant). The concept c5 (nuclear waste) is generic
narrower concept of the concept c4 (radioactive waste), and has ‘low-active’ and ‘high-active
waste’ as narrower concepts. ‘Repository’ (c11) is the narrower concept of ‘storage’ (c10).
PHYS_PARTITIVE
BROADER_CONS
NARROWER_CONS
c1
c2
PHYS_GENERIC
BROADER_CONS
NARROWER_CONS
c4
c5
c5
c6
c5
c7
c10
c11
Table 8. The relation PHYS_PARTITIVE and PHYS_GENERIC.
Thesaurus organisation
The following formalisation of the thesaurus is based on the set-theoretic description of a
thesaurus database by Sintichakis and Constantopoulos (1997), and notations and formal
representation conventions by Järvelin and Niemi (1993). This formalisation is a compact and
exact way to understand the query expansion process in ExpansionTool. However, the
formalisation is simplified: all details of the application (e.g., strength, reliability, and association
strength figures) are not given.
The thesaurus consists of concepts, expressions and matching models, and relations between
these objects, as shown above in the thesaurus sample. The concepts form a set C = {c1, c2, ..., cn}.
The domain of the set is denoted by C = string. The set of expressions is denoted by EXP = {e1,
e2, ..., en}. Expressions are divided into two sets, a set of terms T = {t1, t2, ..., tn} and a set of nonterm expressions NT = {nt1, nt2, ..., ntm}, T ⊂ EXP, NT ⊂ EXP, and T ∩ NT = ∅. The domain of
all expressions is denoted by EXP = string. The set of matching models is denoted by MM =
{mm1, mm2, ..., mmp}. The set of strict matching models is SM = {sm1, sm2, ..., smq}. Strict
matching models are a subset of the set of matching models, SM ⊂ MM. The domain of all
matching models is denoted by MM = string.
Notational convention 1. n-tuples are denoted between angle brackets, e.g., <a, b, c>.
Definition 1. Each concept has a principal name, a term, in the thesaurus. The concepts are
mapped to the terms through a bijection. Function c-term maps each concept to its term. This
function is given by enumerating its constituent concept and term identifier pairs:
c-term: C → Τ
c-term = {<c1, t1>, <c2, t2>, ..., <cn, tn>}, ci ∈ C ∧ ti∈ T.
1 Hierarchical relationships are discussed in Section 2.2, pp. 11-13.
53
Definition 2. Expressions are represented by matching models. Some of the models are more
reliable than the others, i.e., these strict models do not match occurrences of many other
expressions. The function e-strict links each expression to its strict matching models. Let ei be
any expression and smij (j ≥ 1) be ei’s corresponding strict matching models, then
e-strict: EXP → P(SM)
e-strict = {<e1, {sm11, ..., sm1n}>, <e2, {sm21, ..., sm2m}>, ..., <en, {smn1, ..., smnk}>}, ei ∈
EXP ∧ smij ∈ SM.
Definition 3. Except strict matching models, an expression may have other, less reliable
matching models, i.e., these models may match occurrences of other expressions as well. Let ei be
any expression and mmij (j ≥ 1) be any matching model of ei. All matching models of an
expression are obtained by the function:
e-all: EXP → P(MM)
e-all = {<e1, {mm11, ..., mm1n}>, <e2, {mm21, ..., mm2m}>, ..., <en, {mmn1, ..., mmnk}>},
ei ∈ EXP ∧ mmij ∈ MM.
The generalisation and association relations are concept relations, and equivalence is a relation
between terms and non-term expressions.
Definition 4. The generalisation relation includes all hierarchical relations between concepts.
Let c and x be concepts, {c, x} ⊆ C. The relation GEN ⊆ C × C is a binary relation. In each tuple
<c, x> ∈ GEN, x denotes the hierarchically broader concept of c.
Definition 5. Let GEN be the generalisation relation, and c, x and y be concepts, {c, x, y} ⊆ C.
The transitive closure of GEN, GEN+, is the relation obtained by recursive application of the
following rules:
1° if <c, x> ∈ GEN, then <c, x> ∈ GEN+.
2° if <c, x> ∈ GEN+ and <x, y> ∈ GEN+, then <c, y> ∈ GEN+.
Definition 6. Let GEN be the generalisation relation, and c and x be concepts, {c, x} ⊆ C. All
broader concepts of c are obtained by the function:
c-broad: C × GEN → P(C)
c-broad (c, GEN) = {x | <c, x> ∈ GEN+}.
Definition 7. Let GEN be the generalisation relation, and c, x be concepts, {c, x} ⊆ C. All
narrower concepts of x are obtained by the function:
c-narr: C × GEN → P(C)
c-narr (x, GEN) = {c | <c, x> ∈ GEN+}.
Definition 8. The association relation includes the associative relations between concepts. Let c
and z be concepts, {c, z} ⊆ C. The relation ASS⊆ C × C is a binary relation. In each tuple <c, z>
∈ ASS, c and z denote associatively related concepts. ASS is reflexive, thus if <c, z> ∈ ASS, then
<z, c> ∈ ASS.
Definition 9. Let ASS be the association relation, and let c and z be concepts, {c, z} ⊆ C. The
set of concepts associated with the concept c is given by the function:
c-asso: C × ASS → P(C)
c-asso(c, ASS) = {z | <c, z> ∈ ASS}.
Definition 10. The equivalence relation SYN⊆ T × P(NT) is a binary relation between terms and
their synonymous expressions. SYN = {<t1, {tn11,..., tn1n}>, <t2,{tn21, ..., tn2m}>, ..., <tn, {tnn1, ...,
tnnk}>}. Let <t, SN> ∈ SYN, then, t denotes a term and SN is a set of the synonymous non-term
expressions of t.
54
Definition 11. Let SYN be the equivalence relation, and t ∈ T any term. The set of synonymous
expressions of t, SN, is obtained by the function:
t-syns: T × P(T × P(NT)) → P(NT)
t-syns(t, SYN) = SN, when <t, SN> ∈ SYN.
Definition 12. Let c ∈ C be any concept, c-term be the bijection connecting the concepts to their
corresponding terms, and SYN be the equivalence relation. The set of expressions denoting the
concept c is obtained by the function:
c-expr: C × P(C × Τ) × P(T × P(NT)) → P(EXP)
c-expr(c, c-term, SYN) = {c-term(c)} ∪ t-syns(c-term(c), SYN)
Notational convention 2. Let f: X→Y be any function. Then dom(f) = X denotes the domain of f
and rng(f) = Y the range of f.
Definition 13. The thesaurus is a tuple THESAURUS = <c-term, e-strict, e-all, SYN, GEN, ASS>
where c-term, e-strict and e-all are functions defined above, and SYN, GEN, ASS are the
equivalence, generalisation, and association relations of THESAURUS respectively, defined above.
The relationships of the thesaurus components can be defined as follows:
1° ∀<x, y> ∈ GEN : {x, y}⊆ dom(c-term)
– all concepts in the generic relationship are concepts of the thesaurus.
2° ∀<x, y> ∈ ASS : {x, y}⊆ dom(c-term)
– all concepts in the associative relationship are concepts of the thesaurus.
3° ∀<t, s> ∈ SYN : t ∈ rng(c-term)
– all terms in the synonym relationship are terms of the thesaurus.
4° Hierarchies in GEN must be acyclic. This can be defined by the requirement that no
concept is a broader concept of itself:
∀c ∈ C: c ∉ c-broad(c, GEN).
5° The concept relations GEN and ASS should be pairwise disjoint, i.e., two concepts related
through association should not have hierarchical relation, and concepts related through
generalisation / specialisation should not be related through association:
GEN+ ∩ ASS = ∅.
6° A concept is not allowed to be related to itself through any type of relationship:
∀c ∈ C, {c} ∩ c-broad(c, GEN) = {c} ∩ c-narr(c, GEN) = {c} ∩ c-asso(c, GEN) =
∅.
Notational convention 3. Let s be any finite n-tuple (of length n). The selector function σi, (i =
1, ..., n), selects the ith component of s. For example, if s = <a, b, {3,7}>, then σ3(s) = {3, 7}.
Now, σ1(THESAURUS) = c-term, ..., σ6(THESAURUS) = ASS. For simplicity, c-term, e-strict, eall, SYN, GEN, and ASS will be used below to denote the thesaurus components.
A sample thesaurus, used in examples below, follows:
THESAURUS-1 = <{<c4, t40>, <c5, t50>, <c6, t60>, <c7, t70>, <c8, t80>, <c9, t90>, <c10,
t100>, <c11, t110>, <c12, t120>, <c13, t130>, <c14, t140>},
{<t40, {phra(2, <bw(radioactive), bw(waste)>)}>,
<t50, {phra(2, <bw(nuclear), bw(waste)>)}>,
<t60, {phra(2, <cw(<bw(low), bw(active)>),
bw(waste)>)}>,
<t70, {phra(2, <cw(<bw(high), bw(active)>),
bw(waste)>)}>,
<t80, {phra(2, <bw(fission), bw(product)>)}>,
55
<t90, {phra(2, <bw(spend), bw(fuel)>)}>,
<t100, {bw(storage)}>, <nt101, {bw(store)}>,
<nt102, {bw(stock)}>, <t110, {bw(repository)}>,
<t120, {bw(process)}>, <t130, {bw(refine)}>,
<t140, {bw(treat)}> },
{<t40, {phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>, 3)}>,
<t50, {phra(2, <bw(nuclear), bw(waste)>)
prox(2, <bw(nuclear), bw(waste)>, 3)}>,
<t60, {phra(2, <cw(<bw(low), bw(active)>),
bw(waste)>), prox(2, <cw(<bw(low),
bw(active)>), bw(waste)>, 3)}>,
<t70, {phra(2, <cw(<bw(high), bw(active)>),
bw(waste)>), prox(2, <cw(<bw(high),
bw(active)>), bw(waste)>, 3)}>,
<t80, {phra(2, <bw(fission), bw(product)>),
prox(2, <bw(fission), bw(product)>, 3)}>,
<t90, {phra(2, <bw(spend), bw(fuel)>),
prox(2, <bw(spend), bw(fuel)>, 3)}>,
<t100, {bw(storage)}>, <nt101, {bw(store)}>,
<nt102, {bw(stock)}>, <t110, {bw(repository)}>,
<t120, {bw(process)}, <t130, {bw(treat)}>,
<t140, {bw(refine)}>},
{<t100, {nt101, nt102}>},
{<c4, c5>, <c5, c6>, <c5, c7>, <c10, c11>}
{<c4, c8>, <c8, c4>, <c4, c9>, <c9, c4>, <c5, c8>,
<c8, c5>, <c5, c9>, <c9, c5>, <c6, c8>, <c8, c6>,
<c6, c9>, <c9, c6>, <c7, c8>, <c8, c7>, <c7, c9>,
<c9, c7>,<c12, c13>, <c13, c12>, <c12, c14>,
<c14, c12>}>.
Query construction and expansion
In this study queries were formulated and expanded using ExpansionTool. The deductive data
model and query expansion in ExpansionTool are explained by Järvelin et al. (1996b).
ExpansionTool processes queries at conceptual, linguistic and string level independently of any
query language. Translation into a specific query language is done last, known as matching model
translation. The following abstracted and formalised description of the different expansion types of
the present study is based on the presentation of Järvelin et al. (1996b).
Unexpanded queries
Unexpanded – original – queries are formulated of concepts selected from the thesaurus. Further,
these concepts are interpreted as conjunctive facets representing aspects of the information need.
The concepts of each facet are alternative (or disjunctive) interpretations of the facet. Thus the
56
starting point is a concept query Q={{c11, ..., c1n1}, {c21, ..., c2n2}, ..., {ck1, ..., cknk}} where F1 =
{c11, ..., c1n1}, F2 = {c21, ..., c2n2} and Fk = {ck1, ..., cknk} are facets and cij are concept identifiers.
In principle, there is an ‘AND’ between the facets F1, F2 , ..., Fk and an ‘OR’ between the
concepts within each facet, e.g., between c11, ..., c1n1. This high-level structure is maintained
throughout query construction and rejected only in the matching model translation phase if the
query structure required it (e.g., in weakly structured queries, or with probabilistic operators
instead of Boolean operators).
At the linguistic level, expressions are collected as concept representatives. In unexpanded
queries, the original concepts are represented by their terms only. That is, unexpanded concept
facets are replaced by unexpanded expression facets {E1, ..., Ek}, where each facet Ei is derived
from the corresponding concept facet Fi by replacing each concept identifier in Fi by the identifier
of the corresponding term.
At the string level the expression identifiers are replaced by matching models. In unexpanded
queries each term identifier is replaced by one strict matching model with reliability value one.
Consequently, unexpanded expression facets {E1, ..., Ek} are replaced by string facets {S1, ..., Sk}.
Only one index type was used in this test. The index included words in their basic forms with
compound words split. Thus, matching models were selected accordingly.
In unexpanded queries concepts are represented by their terms, and each term is represented by
its strict matching models. The definition of the function for finding the strict matching models for
a concept follows.
Definition 14. Let THESAURUS be a thesaurus <c-term, e-strict, e-all, SYN, GEN, ASS>, and c
∈ dom(c-term) any concept. The strict matching models of a concept in an unexpanded facet are
elicited by the function:
ET-c-uxp: C × THESAURUS → P(SM)
ET-c-uxp(c, THESAURUS) = e-strict(c-term(c)).
Query formulation from concepts to a query is illustrated by the following example. Assume
that the test request is about the processing and storage of radioactive waste. In the sample
thesaurus THESAURUS-1 c-term(c4) = t40 denoting ‘radioactive waste’, c-term(c10) = t100
denoting ‘storage’, and c-term(c12) = t120 denoting ‘process’. Two facets are identified: F1
={c4}, F2 ={c10, c12}. They form the concept query Q1={{c4}, {c10, c12}}. The terms t40, t100
and t120 are represented by strict matching models phra(2, <bw(radioactive), bw(waste)>),
bw(storage) and bw(process) respectively.1
The strict matching models for the original concepts are obtained by the function:
ET-c-uxp(c4, THESAURUS-1) = {phra(2, <bw(radioactive), bw(waste)>)}.
ET-c-uxp(c10, THESAURUS-1) = {bw(storage)}.
ET-c-uxp(c12, THESAURUS-1) = {bw(process)}.
In an unexpanded facet all concepts are represented by their terms, which in turn are
represented by their strict matching models. We now define the function ET-f-uxp for finding the
strict matching models of an unexpanded facet.
Definition 15. Let F= {c1, ..., cn} be any facet, c ∈ dom(c-term) any concept of the facet F, and
SM a set of strict matching models. The strict matching models for each unexpanded facet were
obtained by the function:
1 NB. The matching models are given using the ExpansionTool notation, explained in the thesaurus example above, see
pp. 62-64. The reliability values were used mechanically in the test, thus they are not given in the formulation. The
strict matching models for phrases were formulated with phrasal structures, the loose models were formulated with
more relaxed proximity operators.
57
ET-f-uxp: P(C) × THESAURUS → P(SM)
ET-f-uxp(F, THESAURUS) = ∪c∈F ET-c-uxp(c, THESAURUS).
To continue with the example, the strict matching models for the concepts in unexpanded facets
are obtained as follows:
ET-f-uxp(F1 ,THESAURUS-1) = {phra(2, <bw(radioactive), bw(waste)>)},
ET-f-uxp(F2, THESAURUS-1) = {bw(storage), bw(process)}.
For an unexpanded query all strict matching models of all facets are collected. We now define
the function ET-Q-uxp for formulating an unexpanded query.
Definition 16. Let Q = {F1, ..., Fn} be any query with F1, ..., Fn facets. The strict matching
models for an unexpanded query are obtained by the function:
ET-Q-uxp: P(P(C)) × THESAURUS → P(P(SM))
ET-Q-uxp(Q, THESAURUS) = {ET-f-uxp(F, THESAURUS) | F ∈ Q}
The strict matching models for the unexpanded sample query are obtained as follows:
ET-Q-uxp(Q1, THESAURUS-1) =
{{phra(2, <bw(radioactive), bw(waste)>)},{bw(storage), bw(process)}}.
Query expansion
Each concept is expanded to a disjunctive set of concepts based on conceptual relationships
selected. In this study narrower and associative concepts were used in concept expansion.
Hierarchical expansions could have been done upwards or downwards in hierarchies, however
only the latter was tested here, because the former expansion is suspected for unreliability. In the
hierarchical expansion all hierarchical relationship types, i.e., generic, partitive and instance
relationship, were used. When expansion to the direction of narrower concepts was requested, the
concepts were expanded to all their transitively narrower concepts. When expansion to the
associated concepts was requested, the concepts were expanded by one step in the association
relationship. The association strength may, in principle, vary between 0 and 1. In the present study
association strength of concepts was not a variable, and thus only one value was attached to it.
This means, in practice, that all concepts irrespective of the strength of their association were
involved in expansions. The full expansion included original, narrower and associative concepts.
The expansion result, in each expansion case, is an expanded concept query Q' ={F1', ..., Fk'}
where each facet was Fi' = {ci1, ci11, ci12, ..., ci1m1, ..., cin, cin1, cin2, ..., cinmn} containing the
original concepts {ci1, ..., cin}, and the expansion concepts, {ci11, ci12, ..., ci1m1, ..., cin1, cin2, ...,
cinmn}.
Synonym expansion. Expressions are divided into terms and other equivalent expressions1. In
the synonym expansion, equivalent expressions of all terms are added to the query. The set of
unexpanded concept facets {F1, ..., Fk} is expanded with synonyms by adding all equivalent
expressions of the terms of the original concepts to the query. The result is an expanded set of
expression facets {E01', ..., E0k'}, where each facet E0i' is derived from the corresponding concept
facet Fi by representing each original concept in Fi by its term and all its equivalent expressions.
The expanded concept facets {F1', ..., Fk'} are similarly expanded with synonyms. The result is an
expanded set of expression facets {E1', ..., Ek'}, where each facet Ei' is derived from the corre1 Other expressions are synonyms, quasisynonyms, ellipses, verbs, adjectives, common names, and partial expressions,
which all are understood as equivalent expressions. However, in this study the differentiation was only made between
terms and non-term expressions. The more elaborate classification was not used.
58
sponding expanded concept facet Fi' by replacing each concept in Fi' by its term and all its
equivalent expressions. Thus, synonyms of a term representing a concept are included in all query
expansions.
In the string expansion, all loose matching models for each expression are attached to the
query. The sets of expanded expression facets {E01', ..., E0k'} and {E1', ..., Ek'} are replaced by
expanded string facets {S1', ..., Sk'}.
In the synonym expansion each original concept is represented by its term and non-term
expressions, and each expression is represented by all its matching models. We now define the
function ET-c-syn for the synonym expansion of a concept.
Definition 17. Let THESAURUS be a thesaurus <c-term, e-strict, e-all, SYN, GEN, ASS>, and c
∈ dom(c-term) be any concept. Matching models for the concept c in the synonym expansion are
obtained by the function:
ET-c-syn: C × THESAURUS → P(MM)
ET-c-syn(c, THESAURUS) = ∪e ∈c-expr(c, c-term, SYN) e-all(e).
Let us continue with the query example. After the synonym expansion the facets still include
the original concepts, but the concepts are represented by terms and non-term expressions. The
term of the concept c10 has the following equivalent expressions, e101 denoting ‘store’ and e102
denoting ‘stock’. The thesaurus gives the following matching models for these expressions:
bw(store), bw(stock). Further, one loose matching model is given for the term of the concept c4 :
prox(2, <bw(radioactive), bw(waste)>, 3).
All matching models for the original concepts in synonym expansion are obtained as follows:
ET-c-syn(c4, THESAURUS-1) = {phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>,
3)}.
ET-c-syn(c10, THESAURUS-1) = {bw(storage), bw(store), bw(stock)}.
ET-c-syn(c12, THESAURUS-1) = {bw(process)}.
The synonym expansion of concept facets resulted in expanded string facets, where original
concepts are represented by terms and non-term expressions, which, in turn are represented by all
matching models. Next, we shall define the function for the synonym expansion of a facet.
Definition 18. Let F= {c1, ..., cn} be any facet, and THESAURUS a thesaurus <c-term, e-strict, eall, SYN, GEN, ASS>. The matching models for each synonym-expanded facet are elicited by the
function:
ET-f-syn: P(C) × THESAURUS → P(MM)
ET-f-syn(F, THESAURUS) = ∪c ∈ F ET-c-syn(c, THESAURUS)
All matching models for the synonym-expanded facets in the example are obtained as follows:
ET-f-syn(F1, THESAURUS-1) = {phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>,
3)}.
ET-f-syn(F2, THESAURUS-1) = {bw(storage), bw(store), bw(stock),
bw(process)}.
In the synonym expansion of a query, all matching models of the terms and non-term
expressions of the original concepts in all facets are collected. We now define the function for
formulating a synonym-expanded query.
Definition 19. Let Q = {F1, ..., Fn} be any unexpanded query with F1, ..., Fn facets. All
matching models for each synonym-expanded query are obtained by the function:
59
ET-Q-syn: P(P(C)) × THESAURUS → P(P(MM))
ET-Q-syn(Q, THESAURUS) = {ET-f-syn(F, THESAURUS) | F ∈ Q}
In the example, the matching models for the synonym-expanded query are obtained as follows:
ET-Q-syn(Q1, THESAURUS-1) = {{phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>,
3)}, {bw(storage), bw(store),
bw(stock), bw(process)}}
Narrower concept expansion. In the narrower concept expansion all narrower concepts of the
original concepts are collected and combined with the original concept. Narrower concepts are
represented by all matching models. We shall next define the function ET-c-narr for expanding a
concept with its narrower concepts.
Definition 20. Let THESAURUS be a thesaurus <c-term, e-strict, e-all, SYN, GEN, ASS>, c
∈ dom(c-term) be any concept. The matching models of the original concept and its narrower
concepts are obtained by the function :
ET-c-narr: C × THESAURUS → P(MM)
ET-c-narr(c, THESAURUS) = ∪e ∈Je-all(e)
where J = ∪c ∈NR c-expr(c, c-term, SYN),
and NR={c} ∪ c-narr (c, GEN).
In the case of sample query Q1 the narrower concept expansion adds the following concepts into
the original unexpanded query: c5 , c6, c7, c11. They are represented by the following terms: t50
denoting ‘nuclear waste’, t60 denoting ‘low-active waste’, t70 denoting ‘high-active waste’, and
t110 denoting ‘repository’.
The matching models for the original and narrower concepts are obtained as follows:
ET-c-narr(c4, THESAURUS-1) =
{phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>, 3),
phra(2, <bw(nuclear), bw(waste)>),
prox(2, <bw(nuclear), bw(waste)>, 3),
phra(2, <cw(<bw(low), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(low), bw(active)>), bw(waste)>, 3)
phra(2, <cw(<bw(high), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(high), bw(active)>), bw(waste)>, 3)}.
ET-c-narr(c10, THESAURUS-1) = {bw(storage), bw(store), bw(stock),
bw(repository)}.
ET-c-narr(c16, THESAURUS-1) = {bw(process)}.
The narrower concept expansion of concept facets results in expanded string facets, where
original concepts and their narrower concepts are represented by all matching models representing
the terms and non-term expressions. Matching models for a narrower concept expanded string
facet form a bag rather than a union because, if a facet has two or more concepts with same
narrower concepts, these duplicate concepts are not discarded1. We shall first define a notation for
forming a bag, then we shall define the function for the narrower concept expansion for a facet.
Notational convention 4. Let X = {x1, x2, x3, ..., xn} and Y = {y1, y2, y3, ..., ym} be any sets
which are not necessarily disjoint. The bag – including duplicate elements – formed from these
1 Neither are duplicates eliminated between facets. However, the duplicates were rare: In one request of this study two
concepts had common descendants; in four requests two concepts had common associative concepts.
60
sets is the set X || Y = { x1, x2, ..., xn, y1, y2, ..., ym}. Let X1, X2, ..., Xn be any sets. Then
|| k = 1,..., n X k = X1 || ... || Xn .
Definition 21. Let F= {c1, ..., cn} be any facet. The matching models for each narrower concept
expanded facet form the bag:
ET-f-narr: P(C) × THESAURUS → P(MM)
ET-f-narr(F, THESAURUS) = ||c∈F ET-c-narr(c, THESAURUS).
In the expansion example the matching models for the narrower concept expanded facets are
elicited as follows:
ET-f-narr(F1, THESAURUS-1)=
{phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>, 3),
phra(2, <bw(nuclear), bw(waste)>),
prox(2, <bw(nuclear), bw(waste)>, 3),
phra(2, <cw(<bw(low), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(low), bw(active)>), bw(waste)>, 3)
phra(2, <cw(<bw(high), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(high), bw(active)>), bw(waste)>, 3)}.
ET-f-narr(F2, THESAURUS-1)=
{bw(storage), bw(store), bw(stock), bw(repository),
bw(process)}.
In the narrower concept expansion of a query, all matching models of the terms and non-term
expressions of the original concepts and their narrower concepts in all facets are collected. We
now define the function for the narrower concept query expansion.
Definition 22. Let Q = {F1, ..., Fn} be any unexpanded query with F1, ..., Fn facets. All
matching models for a narrower concept expanded query are obtained by the function:
ET-Q-narr: P(P(C)) × THESAURUS → P(P(MM))
ET-Q-narr(Q, THESAURUS) = {ET-f-narr(F, THESAURUS) | F ∈ Q}
In the expansion example, the matching models for the narrower concept expanded query are
obtained as follows:
ET-Q-narr(Q1, THESAURUS-1) =
{{phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>, 3),
phra(2, <bw(nuclear), bw(waste)>),
prox(2, <bw(nuclear), bw(waste)>, 3),
phra(2, <cw(<bw(low), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(low), bw(active)>), bw(waste)>, 3)
phra(2, <cw(<bw(high), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(high), bw(active)>), bw(waste)>, 3)},
{bw(storage), bw(store), bw(stock), bw(repository),
bw(process)}}.
Associative concept expansion. In the associative concept expansion all associative concepts of
the original concepts are collected and combined with the original concepts. The concepts are
represented by all matching models. We shall next define the function for the associative concept
expansion for a concept.
61
Definition 23. Let THESAURUS be a thesaurus <c-term, e-strict, e-all, SYN, GEN, ASS>, and
c ∈ dom(c-term) be any concept. The matching models of the original concept and all its
associative concepts are obtained by the function :
ET-c-asso: C × THESAURUS → P(MM)
ET-c-asso(c, THESAURUS) = ∪e∈K e-all(e)
where K = ∪c∈ACc-expr(c, c-term, SYN),
and AC={c} ∪ c-asso(c, ASS).
To continue with the query expansion example, the associative concept expansion adds concepts
c8, c9, c13 and c14 into the original conceptual query plan. These concepts are represented by the
following terms: t80 denoting ‘fission product’ , t90 denoting ‘spent fuel’, t130 denoting ‘refine’
and t140 denoting ‘treat’. The associative concept expansion for each search concept follows:
ET-c-asso(c4, THESAURUS-1) =
{phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>, 3),
phra(2, <bw(fission), bw(product)>),
prox(2, <bw(fission), bw(product)>, 3),
phra(2, <bw(spend), bw(fuel)>),
prox(2, <bw(spend), bw(fuel)>, 3)}.
ET-c-asso(c10, THESAURUS) = {bw(storage), bw(store), bw(stock)}.
ET-c-asso(c16, THESAURUS) = {bw(process), bw(refine), bw(treat)}.
The associative concept expansion of concept facets results in expanded string facets, where the
original concepts and their associative concepts are represented by all matching models
representing the terms and non-term expressions. As in the narrower concept expansion, matching
models for an associative concept expanded string facet form a bag. Next, we shall define the
function for the associative facet expansion.
Definition 24. Let F= {c1, ..., cn} be any facet. The matching models for each associative
concept expanded facet form the bag:
ET-f-asso: P(C) × THESAURUS → P(MM)
ET-f-asso(F, THESAURUS) = ||c∈F ET-c-asso(c, THESAURUS).
In the example, the matching models for the associative concept expanded query are elicited as
follows:
ET-f-asso(F1, THESAURUS-1 ) =
{phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>, 3),
phra(2, <bw(fission), bw(product)>),
prox(2, <bw(fission), bw(product)>, 3),
phra(2, <bw(spend), bw(fuel)>),
prox(2, <bw(spend), bw(fuel)>, 3)}.
ET-f-asso(F2, THESAURUS) =
{bw(storage), bw(store), bw(stock), bw(process), bw(refine),
bw(treat)}.
In the associative concept expansion of a query, all matching models of the terms and non-term
expressions of the original concepts and their associative concepts in all facets are collected. Now
we shall define the function for the associative concept expansion of a query.
62
Definition 25. Let Q = {F1, ..., Fn} be any unexpanded query with F1, ..., Fn facets. All matching
models for each associative concept expanded query are obtained by the function:
ET-Q-asso: P(P(C)) × THESAURUS → P(P(MM))
ET-Q-asso(Q, THESAURUS) = {ET-f-asso(F, THESAURUS) | F ∈ Q}
In the example, the matching models for the associative concept expanded query are obtained
as follows:
ET-Q-asso(Q1, THESAURUS-1) =
{{phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>, 3),
phra(2, <bw(fission), bw(product)>),
prox(2, <bw(fission), bw(product)>, 3),
phra(2, <bw(spend), bw(fuel)>),
prox(2, <bw(spend), bw(fuel)>, 3)},
{bw(storage), bw(store), bw(stock), bw(process), bw(refine),
bw(treat)}.
Full expansion. In the full expansion all previous expansions are combined, i.e., all matching
models for terms and non-term expressions of the original concepts, their narrower and associative
concepts are collected. We shall now define the function for the full expansion of a concept.
Definition 26. Let THESAURUS be a thesaurus <c-term, e-strict, e-all, SYN, GEN, ASS>, and c
∈ dom(c-term) be any concept. The matching models for each concept in the full concept
expansion are elicited by the function :
ET-c-full: C × THESAURUS → P(MM)
ET-c-full(c, THESAURUS) = ET-c-narr(c, THESAURUS) ∪ ET-c-asso(c,
THESAURUS)
In the query expansion example, the full expansion combines the matching models of the
previous expansions. The matching models of any original concept and all its related concepts are
obtained as follows:
ET-c-full(c4, THESAURUS-1) =
{phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>, 3),
phra(2, <bw(nuclear), bw(waste)>),
prox(2, <bw(nuclear), bw(waste)>, 3),
phra(2, <cw(<bw(low), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(low), bw(active)>), bw(waste)>, 3)
phra(2, <cw(<bw(high), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(high), bw(active)>), bw(waste)>, 3),
phra(2, <bw(fission), bw(product)>),
prox(2, <bw(fission), bw(product)>, 3),
phra(2, <bw(spend), bw(fuel)>),
prox(2, <bw(spend), bw(fuel)>, 3)}.
ET-c-full(c10, THESAURUS-1) = {bw(storage), bw(store), bw(stock),
bw(repository)}.
ET-c-full(c16, THESAURUS-1) = {bw(process), bw(refine), bw(treat)}.
The full expansion of concept facets results in expanded string facets, where the original
concepts and all their related concepts are represented by all matching models representing the
terms and non-term expressions.
63
Definition 27. Let F= {c1, ..., cn} be any facet. The matching models for each fully expanded facet
form the bag:
ET-f-full: P(C) × THESAURUS → P(MM)
ET-f-full(F, THESAURUS) = ||c∈F ET-c-full(c, THESAURUS).
In the query expansion example, the matching models for each fully expanded facet are
obtained as follows:
ET-f-full(F1, THESAURUS-1)=
{phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>, 3),
phra(2, <bw(nuclear), bw(waste)>),
prox(2, <bw(nuclear), bw(waste)>, 3),
phra(2, <cw(<bw(low), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(low), bw(active)>), bw(waste)>, 3)
phra(2, <cw(<bw(high), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(high), bw(active)>), bw(waste)>, 3),
phra(2, <bw(fission), bw(product)>),
prox(2, <bw(fission), bw(product)>, 3),
phra(2, <bw(spend), bw(fuel)>),
prox(2, <bw(spend), bw(fuel)>, 3)}.
ET-f-full(F2, THESAURUS)= {bw(storage), bw(store), bw(stock),
bw(repository), bw(process), bw(refine),
bw(treat)}.
In the full expansion of a query, all matching models of the terms and non-term expressions of
the original concepts and all their related concepts in all facets are collected. Next, we shall define
the function for the full expansion of a query.
Definition 28. Let Q = {F1, ..., Fn} be any query with F1, ..., Fn facets. All matching models for
the full-expanded query are obtained by the function:
ET-Q-full: P(P(C)) × THESAURUS → P(P(MM))
ET-Q-full(Q, THESAURUS) = {ET-f-full(F, THESAURUS) | F ∈ Q}
In the query expansion example, the full-expanded query is a set of all fully expanded facets. The
matching models for the full-expanded query are obtained as follows:
ET-Q-full(Q1, THESAURUS-1) =
{{phra(2, <bw(radioactive), bw(waste)>),
prox(2, <bw(radioactive), bw(waste)>, 3),
phra(2, <bw(nuclear), bw(waste)>),
prox(2, <bw(nuclear), bw(waste)>, 3),
phra(2, <cw(<bw(low), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(low), bw(active)>), bw(waste)>, 3)
phra(2, <cw(<bw(high), bw(active)>), bw(waste)>),
prox(2, <cw(<bw(high), bw(active)>), bw(waste)>, 3),
phra(2, <bw(fission), bw(product)>),
prox(2, <bw(fission), bw(product)>, 3),
phra(2, <bw(spend), bw(fuel)>),
prox(2, <bw(spend), bw(fuel)>, 3)},
{bw(storage), bw(store), bw(stock), bw(repository),
bw(process), bw(refine), bw(treat)}}.
64
Facets replaced by concepts
The research problem 11 suggests that facet structure could be abandoned and replaced by
concepts. This may be seen as a special case of the facet structure, where each facet may include
one concept only, i.e., Q = {{c1}, {c2}, ..., {cn}}.
Then, the query formulation and expansion follows the formulation given above.
Query structuring
The matching models are still query language independent, e.g., phrases are denoted by phra(2,
<bw(key), bw(key)>). Such models are translated to search keys of selected query language in the
next phase, when the query is constructed. In ExpansionTool this phase is called matching model
translation.
“This step translates the query language independent expanded query expression into a query of a given
specific query language. [...] Matching model translation is implemented in the basis of logic grammars.
Each logic grammar is a set of logical rules which generate well formed expressions of a specified query
language. Each query language has its own logic grammar which takes the expression types and syntax
limitations on the query language into account. This means, for example, that the limitations of a particular
query language may require restructuring the CNF structure of matching models in addition to their
translation. The logic grammars are able to generate expressions which, from the logical viewpoint, contain
a proximity operator instead of an ordinary conjunction, between the facets. Most query languages however
do not allow disjunctive components within a proximity expression. In such a case the logic grammars turn
the set of matching model facets into disjunctive normal form (DNF, Aho & Ullman 1992) wherein
proximity operations can be applied.”
(Järvelin et al. 1996b, 37-38.)
The parameters of matching model translation are (1) the target query language identifier, (2) the
facet operator and (3) the database index type indicator. Only one target query language identifier
was needed in the test, inquery. The facet operator shows how the facets are combined.
Alternatives are disjunction, conjunction, conjunction with a paragraph or a sentence proximity
condition, or by probabilistic operators, or any combination of these. The combination alternative
is identified in the facet operator parameter by parameter values or, and, para, sent, sum, wsum,
osum, xsum, syns, and wsyns respectively.
The queries produced in matching model translation and resulting in queries with different
structures are described in detail in the next chapter. Proximity operators were not used between
facets. Phrasal expressions were translated to InQuery syntax with proximity operators.
7.4
Test Queries
In this section we shall explain the composition of the test queries.
Complexity
Testing the effects of complexity requires facet analysis and some reasonable way to formulate
queries with different number of facets. The most obvious way to construct query versions would
65
be adding one facet after another. However, the order of facets is not self-evident, and this leads to
a rotation problem (as a number of possible combinations). In the present study, complexity was
represented by major and minor facets. Complexity was tested with two different combinations of
facets in queries: (a) major facets only – F1; (b) major facets + minor facets – F2. Appendix 2
gives the conceptual query plans for the 30 test queries.
For testing the other variables facet analysis is not necessary, it could be replaced by
identification of concepts.
Coverage
The search concepts were identified from the test thesaurus on the basis of the request. The
selected concepts were collocated to facets, according to their relations (disjunctive or conjunctive
concepts1) and the aspect they represented in the request. Hence, the baseline coverage was
determined by requests, and it represented the case of no conceptual expansion (C0). The search
concepts selected from the thesaurus were called original concepts. Coverage was extended by
expanding the query first with the narrower concepts and then with the associative concepts of the
original concepts from the test thesaurus. These were expansion concepts, and the expansions were
named Cn and Ca respectively. These expansions were not cumulative. All semantically related
concepts were added in the last expansion (Cf). The number of semantic relations of a concept is
not constant, because it depends on the semantic field the concept belongs to. All search concepts
of all requests were included in the thesaurus.
Broadness
Each concept is represented in the test thesaurus by a term, which is the principal name of a
concept, and a set of synonyms. When the concept was represented by a term only in an expression
query, expression expansion was not performed (E0), but when a concept was represented by a
term and all its synonyms, expression expansion was done (E1). Each expression is represented by
one or several search keys. In case of no string expansion (S0) only one search key (the most
reliable) was selected for each expression. In the string expansion (S1) all search keys for each
expression were added to the query.
Conceptual, expression and string expansions were combined in the following manner:
C0+E0+S0 -> Q0
C0+E1+S1 -> Qs
Cn+E1+S1 -> Qn
Ca+E1+S1 -> Qa
Cf+E1+S1 -> Qf
This led to five different expansion types. The case of no expression expansion was not
combined to conceptual expansions, because the previous study by Järvelin et al. (1996a, b) did
not show great difference in performance between different expression expansion types with a
given conceptual expansion.
1 Request number 14 is an exception: within one facet concepts were conjuncted to form an alternative concept, see
Appendix 2.
66
In query construction, the set of facets containing matching models were derived for each
expansion type by the formulae (Q being the original concept identifier set for the request):
Q0: ET-Q-uxp(Q, THESAURUS)
Qs: ET-Q-syn(Q, THESAURUS)
Qn: ET-Q-narr(Q, THESAURUS)
Qa: ET-Q-asso(Q, THESAURUS)
Qf: ET-Q-full(Q, THESAURUS)
In each case the result is a set of matching model sets, albeit with differing content, of the type
{{mm11, ..., mm1n}, ..., {mmk1, ..., mmkl}} when the original concept query Q contains k facets,
and each facet contains kj models. Each model mmij is a matching model for some expression j
belonging to the facet i.
Query structures
Search keys were combined using following query structure types (these types will be defined and
discussed below):
• SUM , average of the weights of keys
• SYN1-2, synonym groups of keys
• WSUM1-2, weighted averages of the weights of keys
• SSYN 1-2, averages of the weights of synonym groups
• ASYN, product of the weights of synonym groups
• WSYN1-2, weighted averages of the weights of synonym groups
• BOOL, soft Boolean query
• XSUM, average of the weights of OR groups
• OSUM, average of the weights of the original query and expansion key groups.
Next we shall define the query structure types and illustrate them with sample queries. One
request is formulated into different query structures in two expansion types (Q0, Qn), with full
complexity (F2). Because the compound words are much more frequent in Finnish than in English,
the example is only illustrative. The repetition of search expressions is greater in the example than
in the test queries, because it is caused by phrasal expressions, which are typically compound
words in Finnish (e.g., nuclear power plant vs. ydinvoimalaitos; nuclear waste vs. ydinjäte). In
those queries where repetition of search keys has no effect, it has been omitted (e.g., SYN1). To
show how compound words were processed, search keys high-active and low-active are regarded
as compound words.
The request is:
The processing and storage of radioactive waste.
The conceptual query plan follows, major facets are in bold face:
radioactive waste AND (process OR storage)
SUM. An unexpanded SUM query was constructed of the original concepts of a request, each
concept represented by a key corresponding to a term. In the expansions all expressions were
added as single words, i.e., no phrases were included.
SUM/Q0
#sum(radioactive waste process storage)
SUM/Qn
#sum(radioactive waste radioactive waste nuclear waste nuclear waste
#0(high active) waste #0(high active) waste #0(low active) waste #0(low
active) waste process storage stock store repository)
67
SYN1-2. In SYN1-2 queries search keys were combined with the SYN operator. All keys within
the SYN operator are treated as instances of one key, thus the SYN operator influences the
calculation tf*idf values (Rajashekar & Croft 1995, see p. 22). In SYN1 queries no phrases were
marked, but in SYN2 queries phrases were built with proximity operators.
SYN1/Q0
#syn(radioactive waste process storage)
SYN1/Qn
#syn(radioactive waste nuclear #0(high active)
#0(low active) process storage stock store repository)
SYN2/Q0
#syn(#1(radioactive waste) process storage)
SYN2/Qn
#syn(#1(radioactive waste) #3(radioactive waste)
#1(nuclear waste) #3(nuclear waste)
#1(#0(high active) waste) #3(#0(high active) waste)
#1(#0(low active) waste) #3(#0(low active) waste)
process storage stock store repository)
WSUM1-2. Weighting original keys higher than expansion keys is an usual method in QE. In
WSUM1 queries, expansion keys and the keys of equivalent expressions (i.e., synonyms for the
terms of the original concepts) were given smaller weights (1.0) than the keys of the original
concepts (2.0). This weighting scheme represents the simplest way to make difference between
original and expansion keys. In WSUM2 queries, keys were weighted according to the type of
expansion they belonged to: the weight of term keys was 10, of synonym keys 9, of narrower
concept keys 7, and of associative concept keys 5.1 In this weighting scheme the semantic distance
between concepts was taken into account. The values of the weights are - to some extent arbitrary. Unexpanded WSUM2 queries are identical with unexpanded WSUM1 queries.
In SUM, SYN1-2 queries all search keys were treated equally and no facets were identified. In
WSUM queries there was neither a facet structure, though search keys were given different
weights. Thus, WSUM queries are more structured than SUM and SYN queries, but the structuring
is based on search key types, not on their meanings.
WSUM1/Q0 #sum(#1(radioactive waste) process storage)
WSUM1/Qn #wsum(1
2 #1(radioactive waste) 1 #3(radioactive waste)
1 #1(nuclear waste) 1 #3(nuclear waste)
1 #1(#0(high active) waste) 1 #3(#0(high active) waste)
1 #1(#0(low active) waste) 1 #3(#0(low active) waste)
2 process
2 storage 1 stock 1 store 1 repository)
WSUM2/Q0 #sum(#1(radioactive waste) process storage)
WSUM2/Qf #wsum(1
10 #1(radioactive waste) 9 #3(radioactive waste)
7 #1(nuclear waste) 7 #3(nuclear waste)
7 #1(#0(high active) waste) 7 #3(#0(high active) waste)
7 #1(#0(low active) waste) 7 #3(#0(low active) waste)
5 #1(spent fuel) 5 #1(fission product)
10 process 5 refine 5 treat
10 storage 9 stock 9 store 7 repository)
1 NB. In the example, WSUM2 is illustrated with a fully expanded query because of the weighting scheme of the
structure.
68
SSYN1-2. In SSYN1 queries, concepts were identified instead of facets. Search keys of each
search concept were combined with the SYN operator, and the SYN clauses were combined with
the SUM operator, i.e., an average of the weights of synonym groups was calculated.1 In SSYN2
queries, SYN clauses were formed of facets instead of concepts. The SSYN1 structure is a
somewhat simpler way to query formulation, because concepts need not be divided into facets.
SSYN1/Q0
SSYN1/Qn
#sum(#1(radioactive waste) process storage)
#sum(#syn(#1(radioactive waste) #3(radioactive waste)
#1(nuclear waste) #3(nuclear waste)
#1(#0(high active) waste) #3(#0(high active) waste)
#1(#0(low active) waste) #3(#0(low active) waste))
#syn(process)
#syn(storage stock store repository))
SSYN2/Q0
#sum(#1(radioactive waste) #syn(process storage))
SSYN2/Qn
#sum(#syn(#1(radioactive waste) #3(radioactive waste)
#1(nuclear waste) #3(nuclear waste)
#1(#0(high active) waste) #3(#0(high active) waste)
#1(#0(low active) waste) #3(#0(low active) waste))
#syn(process storage stock store repository))
ASYN. ASYN queries were similar to SSYN2 queries, but SYN clauses were combined with the
probabilistic AND operator, i.e., a product of facets was calculated instead of an average.
ASYN/Q0
#and( #1(radioactive waste) #syn(process storage))
ASYN/Qn
#and(#syn(#1(radioactive waste) #3(radioactive waste)
#1(nuclear waste) #3(nuclear waste)
#1(#0(high active) waste) #3(#0(high active) waste)
#1(#0(low active) waste) #3(#0(low active) waste))
#syn(process storage stock store repository))
WSYN1-2. WSYN1-2 queries were similar to SSYN2, but facets were weighted according to their
type. In WSYN1 queries, the weight of major facets was 10 and of minor facets 3, in WSYN2
queries, weights were 10 for majors and 7 for minors. This query structure tested the effect of
emphasising major facets, when no extra weight was given to original keys.
WSYN1/Q0 #wsum(1 10 #1(radioactive waste) 3 #syn(process storage))
WSYN1/Qn #wsum(1
10 #syn(#1(radioactive waste) #3(radioactive waste)
#1(nuclear waste) #3(nuclear waste)
#1(#0(high active) waste) #3(#0(high active) waste)
#1(#0(low active) waste) #3(#0(low active) waste))
3 #syn(process storage stock store repository))
WSYN2/Q0 #wsum(1 10 #1(radioactive waste) 7 #syn(process storage))
WSYN2/Qn #wsum(1
10 #syn(#1(radioactive waste) #3(radioactive waste)
#1(nuclear waste) #3(nuclear waste)
#1(#0(high active) waste) #3(#0(high active) waste)
1 NB. The unexpanded SSYN1 queries were not equal to the WSUM1-2 queries, if the number of search keys
representing a term (and a concept) was more than one (in case of unrecognised words, see morphological analysis, pp.
57-58). Then, a concept had more than one representative at the string level, and these keys formed a SYN clause in an
unexpanded query.
69
#1(#0(low active) waste) #3(#0(low active) waste))
7 #syn(process storage stock store repository))
BOOL. A query with Boolean operators was a typical facet-based query. The interpretations of
Boolean operators are given above, i.e., queries were not strict Boolean queries. Earlier studies
provide evidence that queries with Boolean operators are more effective than weakly structured
queries in best match retrieval systems. However, QE in InQuery with Boolean structured queries
has not been successful. The interpretation of the Boolean OR operator in InQuery Version 3.1
used in the present study differs from the interpretation of the Version 1.6 used in the earlier QE
study by Kristensen (1996). The OR operator of Version 3.1 works more like the exact match OR
operator (“One of terms within the OR operator must be found in a document for that document to
get credit for this operator.” [InQuery s.a.]). The stricter interpretation was believed to improve
performance in QE.
BOOL/Q0
#and(#1(radioactive waste) #or(process storage))
BOOL/Qn
#and(
#or(#1(radioactive waste) #3(radioactive waste)
#1(nuclear waste) #3(nuclear waste)
#1(#0(high active) waste) #3(#0(high active) waste)
#1(#0(low active) waste) #3(#0(low active) waste)),
#or(process storage stock store repository))
OSUM. An OSUM query was a combination of a Boolean query consisting of terms of different
facets (i.e., original keys) and SYN clauses consisting of expansion keys for each facet. The
Boolean query and the SYN clauses were combined with the SUM operator. This structure gave
extra weight for the clause with the original search concepts. This approach was similar to query
combination explored in some earlier studies (e.g., Belkin et al. 1993, 1995; Shaw & Fox 1995).
The unexpanded OSUM query was identical with the unexpanded BOOL query.
OSUM/Q0
#and(#1(radioactive waste) #or(process storage))
OSUM/Qn
#sum(
#and(#1(radioactive waste) #or(process storage))
#syn(#3(radioactive waste) #1(nuclear waste) #3(nuclear waste)
#1(#0(high active) waste) #3(#0(high active) waste)
#1(#0(low active) waste) #3(#0(low active) waste))
#syn(stock store repository))
XSUM. The XSUM query structure is a modification of the Boolean structure: facets (OR clauses)
were combined with the SUM operator instead of the AND operator. An average is less
discriminating than product, thus, a failure or a lack in expansion of one facet does not affect the
total score dramatically. Within OR clauses extra weight was given to original keys (terms)
through structure. All expansion keys for one facet formed a SUM clause which was combined to
an OR clause with the original key(s) of the facet.
XSUM/Q0
#sum(#1(radioactive waste) #or(process storage))
XSUM/Qn
#sum(
#or(#1(radioactive waste)
#sum(#3(radioactive waste) #1(nuclear waste) #3(nuclear waste)
#1(#0(high active) waste) #3(#0(high active) waste)
#1(#0(low active) waste) #3(#0(low active) waste)))
#or(process storage #sum(store stock repository)))
70
Structural features
Weakness or strongness in query structures is not an absolute property. In Table 9 different
structural features are given scores to sort the query types with relation to the structuredness. The
scores reflect the difficulty of recognising the different structural features, yet, they are in no sense
absolute. A structure type was given one point if phrases were identified, one point if search
strings were weighted, two points if concepts were identified, and three points if facets were
identified. Queries were divided into weakly (score ≤ 2) and strongly (score ≥ 3) structured queries
according to the presence or absence of concepts/facets. In this study, SUM, SYN1-2, WSUM1-2
represent weak query structures, and SSYN1-2, ASYN, WSYN1-2, BOOL, XSUM, and OSUM
represent strong query structures.
Figures 11-12 illustrate the different structural combinations that are possible in InQuery. It
also shows which combinations were tested in this study (boxes without shading). Structures are
classified according to the following principles:
Concept identification
No concepts identified / Concepts identified / Facets identified
Weighting of concepts
Unweighted concept / Weighted concepts
Weighting of facets
Unweighted facets / Weighted facets
The type of the operator
soft Boolean (AND, OR) / Probabilistic (the SUM, SYN and
WSUM operators) / Hybrid (both soft Boolean and
probabilistic operators)
Search key weighting
Unweighted search keys / Weighted search keys
Phrase recognition1
Single keys / Phrases.
1 NB. In Figures 11-12 proximity operators appear implicitly in phrase formation. They could be used between concepts
or facets as well. Thus, they could be placed to soft Boolean and probabilistic operators.
71
Structural features
Query structures
SUM
SYN 1
Phrases recognised
SYN 2
1
Search keys
weighted
WSUM1 WSUM2 SSYN1
1
1
1
1
Concepts identified
1
2
Facets identified:
- logical facets
- weighted facets
- equal facets
- ‘hybrid’ facets
Structure scores
1
Structural features
2
3
Query structures
SSYN2
Phrases recognised
2
1
ASYN
1
WSYN
1-2
1
BOOL
XSUM
1
Search keys
weighted
OSUM
1
1
1
1
Concepts identified
Facets identified:
- logical facets
3
- weighted facets
- equal facets
3
3
3
4
4
3
- ‘hybrid’ facets
Structure scores
3
4
4
5
5
Table 9. Structural features in test query types.
72
Structures by concept identification
No concepts identified
Concepts identified
Facets identified
No concepts identified
Soft Boolean operators
Not tested
Hybrid structures
Prob. operators
Unw . search keys
W eight. search keys
Single words
Phrases
Single words
Phrases
SUM, SYN1
SYN2
Not tested
W SUM1-2
Not tested
Figure 11. Query structure classification I.
73
Concepts recognised
Soft Boolean operators
Not tested
Hybrid structures
Prob. operators
W eight. concepts
Unw . concepts
W eight. search keys
Unw . search keys
Phrases
Single w ords
SS YN1
Not tested
Not tested
Not tested
Not tested
Figure 12. Query structure classification II.
74
Not tested
SSYN2
Not tested
W eight. search keys
Figure 12. Query structure classification III.
Single w ords
Phrases
Unw. search keys
Unw . facets
W SYN1-2
Phrases
Not tested
Single words
Not tested
W eight. search keys
W eight. facets
Unw. search keys
Prob. operators
BOO L
Phrases
Not tested
Single words
Unw . search keys
Unw . facets
Soft Boolean operators
Facets recognised
Not tested
Unw. search keys
XSUM , O SUM
Phrases
Not tested
Single w ords
W eight. search keys
Unw. facets
Hybrid structures
Not tested
W eight. facets
75
Combining variables
In all there were 13 query structure types, five of them representing weakly structured queries
(SUM, WSUM1-2, SYN1-2) and eight representing strongly structured queries (BOOL, SSYN1-2,
ASYN, WSYN1-2, OSUM, XSUM). Of the latter type seven, excluding SSYN1, were also facet
structured. Queries were first constructed without expansion (Q0) and expanded at all four
expansion levels (Qs, Qn, Qa, Qf), except WSUM2 queries which were either unexpanded or fully
expanded (Q0/Qf) because of the weighting scheme. Queries were also formulated at two different
complexity levels (F1/F2), exceptions were the WSYN1-2 queries, which were formulated with all
facets because of the weighting scheme. This setting generated 110 queries1 with different
structure and expansion combination for each of the 30 requests, i.e., in all 3,300 queries were run
and evaluated for the present study.
Query structure
Weak
Strong
Complexity
Expansion
SUM
–
Weighting
F1/F2
Q0 – Qf
SYN1-2
–
F1/F2
Q0 – Qf
WSUM1
All keys
F1/F2
Q0 – Qf
WSUM2
All keys
F1/F2
Q0/Qf
SSYN1-2
–
F1/F2
Q0 – Qf
ASYN
–
F1/F2
Q0 – Qf
WSYN1-2
Facets
F2
Q0 – Qf
BOOL
–
F1/F2
Q0 – Qf
XSUM
Original keys*
F1/F2
Q0 – Qf
OSUM
Original keys*
F1/F2
Q0 – Qf
Legend: F1 = major facets only, F2 = all facets; Q0 – Qf = the five expansion types.
* Weighting of the terms was gained through structure, i.e., no explicit weights were assigned.
Table 10. Overview of the combination of controlled variables.
Query structure is a prerequisite for a query, hence, it was fixed first and other variables were
considered through it. Table 10 gives an overview of the combination of the variables. Together
they determine the overall structure of a query.
7.5
Relevance Assessments and Measures of Effectiveness
Relevance
Measurement of effectiveness in IR is based on the notion of relevance. Relevance should reveal
whether the retrieved document was a wanted document according to the query, which represents
the request, which represents the information need. Thus, a relevant document should contain the
1 This number is not 114 because unexpanded BOOL queries and OSUM queries were always identical, and so were
unexpanded WSUM1 and WSUM2 queries.
information needed. This, however, is an ideal case, because the correspondence between the
query and the request, and the request and the information need is not one-to-one. A document
may be relevant in relation, say, to the query and the request, but not to the information need, or
relevant only in combination with another document (Harter 1992; Swanson 1988).
Researchers with different research interests have used relevance in different meanings –
referring to different concepts (Saracevic 1975; Belkin 1981; Fugmann 1985; Schamber,
Eisenberg & Nilan 1990; Schamber 1994). Saracevic (1996; 1997) points out that there are
different frameworks in which relevance has been used and defined and thus, more than one
relevances. Saracevic (1996) distinguishes between the following relevances:
• System or algorithmic relevance is a relation between a query and documents in a given
IR system with a given procedure or algorithm.
• Topical or subject relevance is a relation between the subject expressed in a query and
subject covered by documents in the file of the system. (‘Aboutness is the criterion by
which topicality is inferred.’)
• Cognitive relevance or pertinence is a relation between the state of knowledge and
cognitive information need of the user, and documents retrieved, or in the file of the
system, or even in existence.
• Situational relevance or utility is a relation between the task, or problem at hand, and
documents retrieved by or contained in the system, or even in existence.
• Motivational or affective relevance is a relation between the intents, goals and
motivations of the user and documents retrieved, or in the file of the system, or even in
existence.
In this study the relevance used as a basis for measurement is topical relevance. We assume
that topical relevance is a prerequisite for other relevances, and for IR systems to function
reasonably (see Froehlich 1994; Kristensen 1995). Further, topical relevance is the only obtainable
in laboratory experiments. Even with topical relevance the problem of judging it still exists. Ellis
(1996) and Saracevic (1996) point out that this judgement is a human action (using people as
measuring instruments) and, thus, it lacks the exactness of physical sciences. In addition, the
dichotomous judgements used in relevance assessments (i.e., relevant / non-relevant) are a
simplification because relevance is more likely to be continuous (Robertson 1977a; Robertson &
Belkin 1978). Nevertheless, the idea of the organisation of information presupposes that the
aboutness of an information item can be resolved, and topicality is aboutness. The judgements
may vary, but aboutness can be agreed upon in a similar sense as literal meaning can be agreed
upon.
Measures of effectiveness
Prevailing performance measures based on relevance are recall and precision. Recall is defined as
recall =
number of relevant documents retrieved
.
total number of relevant documents in collection
Precision is defined as
precision =
number of relevant documents retrieved
.
total number of documents retrieved
77
(Lancaster 1986, 132-133.) In this study precision and recall scores were calculated on the basis of
dichotomous relevance, i.e., relevant articles were those judged fairly relevant or relevant, nonrelevant were articles judged marginally relevant or non-relevant (see Recall base, p. 96).
Retrieval evaluation results achieved by precision and recall are typically presented in two
forms, (1) as an average precision-recall curve (P-R curve), or (2) as average precision and recall
at a fixed document cut-off value (DCV). The former measures the precision at the same fixed
levels of recall for all requests and averages the results, the latter examines a fixed number of
documents for each request and uses this information to compute recall and precision scores,
which are averaged over requests. (Hull 1993, 329.)
In the average P-R curve precision and recall are used as a bivariate measure of retrieval
effectiveness. Recall is defined as an independent variable and precision as a dependent variable,
since precision is averaged at the fixed levels of recall. Because precision cannot be exactly
defined at all fixed recall levels, interpolation is needed. Further, 100 per cent recall is not reached
by every algorithm or query, thus, P-R curve is not faithful to actual data points. The assumptions
attached to the P-R curve are, that a particular level of recall must be attained by every request,
and the best method is the one that reaches this level with fewest number of non-relevant
documents. The P-R curve provides no information about the number of documents that have to be
retrieved in order to reach a given recall level. Because requests have different number of relevant
documents, a recall level of, say, 50% may mean a result set of 20 or 200 documents for different
requests. (Keen 1992, 494; Tague-Sutcliffe 1992, 484; Hull 1993, 331.)
The DCV method uses the number of retrieved documents as an independent variable and
precision and recall as dependent variables. Thus, from a user’s point of view evaluation is based
on equivalent effort rather than equivalent performance. If the DCV is larger than the number of
relevant documents for any request, it is impossible to get 100 per cent precision. If the DCV is
smaller than the number of relevant documents for any request, it is impossible to get 100 per cent
recall. Precision and recall should be measured over a range of DCVs, not at one cut-off point, and
then averaged over requests. This alleviates the irregularities caused by one cut-off point. (Keen
1992, 493–494; Hull 1993, 331–332.)
Hull (1996, 75) points out that when the number of relevant documents is very different for
different requests, the variability in precision may be somewhat misleading. He proposes that the
unequal variance in evaluation measure could be normalised by replacing the score for each
method by ranking the scores of different methods with respect to each request.
Recall base
In ideal IR experiments whole test collection is judged for relevance, i.e., the relevance of each
document in the collection in relation to every request is known. If the number of documents in the
collection is large, it may be impossible to judge the relevance of all documents. Then, recall
should be estimated. For this, the number of relevant documents in the collection for each test
request should be estimated. We refer to the estimation of the total number of relevant documents
for all test requests as a recall base.
For the test requests and test collection of the present experiment, a recall base was partly
constructed in earlier studies. Sormunen (1994) constructed a set of Boolean queries with varied
complexity and broadness for each request. Retrieved documents were assessed for relevance by
two journalists and two information analysts. The assessments were based on a four level scale:
(0) not relevant, (1) the topic of the request is mentioned, but only in passing, marginally relevant,
78
(2) the topic of request is discussed briefly, fairly relevant, (3) the topic is the main theme of the
article, relevant. The relevance of 20 requests was assessed by two (one by three) persons, the rest
by one person. The assessors agreed in 73% of the parallel assessments, in 21% of the cases the
difference was one point, and in 6% two or three points. If the difference was one point, the
assessment was chosen from each judge in turn. If the difference was two or three points, the
article was checked by the researcher to find out if there was a logical reason for disagreement,
and a more plausible alternative was selected. (Sormunen 1994, 92.)
Kristensen (1996) tested query expansion with different query structures on 30 requests of the
collection. The retrieved documents that were not in the recall base were judged for relevance by
three of the earlier judges using the same scale and procedure as before. Three of the requests were
assessed by three, 10 requests by two, and 17 requests by one person. Based on the 13 requests, the
intra-assessor consistencies within three assessor pairs were 77.3%, 77.7%, and 93.2% (83%, on
average)1. To find out how consistent the persons were in their assessments, 24 previously judged
documents were rejudged. The inter-assessor consistencies were 79.2%, 91.7%, and 95.8%
(88.9%, on average). After these two studies, the recall base (for the 30 requests of the present
study) contained relevance assessments for 8,321 documents of which 1,024 were relevant.
The result set size for queries in the present study was set at 50 (that is, document cut-off value
was 50). The total number of documents retrieved was 30 (number of requests) x 110 (number of
query types) x 50 (DCV) = 165,000. Documents retrieved for each request were pooled, duplicates
were removed, and the remaining documents were compared against the existing recall base. 7,010
documents did not exist in the recall base. These documents were assessed by the same three
persons and given the same instructions as in the study by Kristensen (1996), the results of each
request was assessed by one person. Among these documents, there were 42 new relevant
documents. Thus, the total number of known relevant documents for 30 test requests was 1,066,
the rest of the database, 52,827 documents, was assumed to be non-relevant for any of the 30
requests.
Can the recall base be considered reliable? If the recall base does not include all relevant
documents of the database, recall figures are too high when calculated at different DCVs up to 50.
If P–R curves are considered, both precision and recall may be erroneous, errors being greater at
high recall levels than low recall levels. There were not many relevant documents in the set of new
retrieved documents. This was expected, because Sormunen tested Boolean queries with low
complexity and high broadness. In the present study the new relevant documents were retrieved by
queries of 13 requests, 17 requests did not yield any new relevant documents. The number of new
relevant documents per request varied from one to 10. For three requests the percentage of the new
relevant documents of all known relevant documents was higher than 10 (the percentages of new
relevant documents over all known relevant documents for the three requests: request #5 / 18.5%,
request #8 / 27.6%, request #10 / 10.2%). Most of the new relevant documents were gathered
between DCV 20 and 30. In other words, with DCV 10, in all five new relevant documents (12%)
were retrieved, with DCV 20 the number of new relevant documents was 19 (45%), DCV 30
retrieved 31 new relevant documents (74%), and DCV 40 and 50 retrieved 35 (83%) and 42
(100%) new relevant documents respectively. Thus, it seems very likely that inspection of the rest
of the database would result in fewer new relevant documents than were found with DCV 50 in
this test setting.
1 NB. For these consistency figures, the relevance judgement scale was transformed into binary judgements, i.e.,
documents given scores from 0 to 1 were non-relevant, and those with scores of 2-3 were relevant. Thus, differences
between these two groups were measured.
79
We shall demonstrate the effectiveness of different query types as average precision over 11 DCVs
(cut-off values being 1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50), and as P-R curves – precision at 10
recall levels. Precision, instead of recall, was chosen for the former because in average precision
each relevant document has equal weight, while in average recall relevant documents from queries
with a larger relevant document set are less important than those from queries with a smaller
relevant set (Hull 1993, 331). Further, average precision scores over DCVs were replaced by
average ranks for different structure and expansion combinations with respect to each request, in
order to normalise the variance of the number of relevant documents of different requests.
7.6
Statistical Testing
The measurement in IR experiments brings about several problems with regard to statistical
testing. Robertson (1981, 25) points out that sampling and sample sizes are problematic in IR
experiments. This holds especially for requests, since usually it is easier to get a sufficient number
of documents than requests. However the requests are collected, they cannot be regarded as a
random sample, instead, they are a selection. At least, the characteristics of the queries should be
described so that they could be compared to what is known about requests in general.
Another problem with statistical testing is related to recall. The parametric methods of
statistical inference assume that the underlying distribution is normal. Robertson remarks:
“Because there may be few relevant documents for any given query1, and because the values
of recall for individual queries may be very widely spread, the distribution of recall values
over queries tends to look very strange indeed. In particular one tends to find many
occurrences of the extreme values (0 or 100 per cent), and many occurrences of those values
that happen to be low-denominator fractions (e.g. 75, 33, 60 per cent).”
(Robertson 1981, 25.)
Non-parametric methods for significance tests in IR experiments are recommended by Robertson
(1981, 25), Keen (1992, 496–497), and Hull (1993, 333). These tests have less power2 but their
requirements on distributions and sample sizes are not as strict as those of the parametric
counterparts. Suitable tests for comparing two methods3 are the sign test and the Wilcoxon signed
ranks test. The sign test utilises information only about the order of the differences between test
results, but the Wilcoxon test uses the relative magnitude as well, thus, the latter assumes the
interval scale. (Hull 1993, 333; Siegel & Castellan 1988, 87). Keen (1992, 496) says that “the
precise suitability of the data for Wilcoxon test has a measure of doubt about it”, and van
Rijsbergen (1979, 179) argues that only the sign test is valid for IR experiments, because precision
and recall are discrete measures. Conversely, Hull (1993, 334) states that discrete distributions can
be approximated by the normal distribution, if the measure in question is averaged over a
reasonable number of discrete measures. This means that even parametric statistics could be
suitable. Thus, Hull advocates to try all of the tests (parametric and non-parametric) on the data. If
results are very different, the data could be examined with diagnostic plots in order to determine
which test is most reliable. A possible parametric test is the two-sample t-test, which compares the
1 Note, that Robertson uses query in the same sense as request is used in this study.
2 The power of the test is defined as the probability of rejecting H (null hypothesis) when it is in fact false. The null
0
hypothesis is an hypothesis of ‘no effect’. (Siegel & Castellan 1988, 7, 10.)
3 A case where two categories or a matched pair is compared (see Siegel & Castellan 1988, 73).
80
magnitude of differences between two methods to the variation among the differences. (Hull 1993,
334; see also Zobel 1998.)
If more than two methods are to be compared, the test should not be conducted between each
pair, because there is a chance to find significant difference between pairs, although none exists
when all methods are considered at a time. A parametric test for more than two samples is a twoway Analysis of Variance (ANOVA), which simultaneously compares the average scores of each
method. The Friedman two-way analysis of variance by ranks is a generalisation of the sign test, a
non-parametric alternative for comparing more than two related samples. (Conover 1980, 299;
Siegel & Castellan 1988, 168, 174; Hull 1993, 334.)
Keen (1992, 497) points out that statistical significance does not necessarily imply practical
importance. As a rule of thumb Sparck Jones (1974) proposes that the differences in the mean
results would be regarded as material if > 10%, noticeable if 5% – 10%, and not noticeable if
< 5%. The consistency of differences implies practical importance, and it shows in statistical tests
as well.
In the present study non-parametric statistical tests were used, because the data seemed not to
fit the requirements of parametric tests1. Since more than two methods were tested, the Friedman
two-way analysis of variance by ranks was selected to test statistics. It tests whether k related
samples or repeated measures come from the same population or populations with the same
median. The data is cast in b rows and k columns, where rows represent requests (units) and
columns respective queries (variables). The test is based on ranks. The scores of each row are
ranked from 1 to k, and the Friedman test determines the probability that the rank totals for each
variable (or column k) differ significantly from the values that would be expected by chance.
(Siegel & Castellan 1988, 175-176.)
The test value, Fc, may be computed by the following formula2:
Fc =
(b − 1)(B2 − bk(k + 1) 2 / 4 )
A2 − B2
(24)
where
b
k
A2 = ∑ ∑ (R(X ij ))2
(25)
i=1 j=1
and
k
1
B2 = ∑ Rj 2
b j =1
(26)
and
b = number of rows
k = number of columns
R(Xij) = rank of the cell in the ith row and jth column
Rj = sum of ranks in the jth column. (Conover 1980, 299–300.)
1 For example, the number of ties in the data indicates that errors are not continuous. (See Hull 1993, 335.)
2 The formula allows ties.
81
If the Friedman test allows the rejection of the null hypothesis1, at least one of the tested methods
is known to differ from at least one other method. Methods i and j are considered different if the
following inequality is satisfied:
| Rj −
/
2b( A2 − B2 ) 1 2
Ri |≥ t 1 − α / 2
(b − 1)(k − 1)
(27)
where Rj and Ri are rank sums of jth and ith column
A2, B2, b, and k are given above
t1−α/2 gives the critical value, above which lie 1-α/2 per cent of the distribution.
(Conover 1980, 300.)2
1 H : all the variables come from populations with the same median.
0
2 Hull (1996) recommends the Friedman test in the form given by Conover (1980) because it is more powerful than the
usual form given, for example, by Siegel and Castellan (1988). The former is used in this study because it proved more
powerful in the test.
82
8 Results
8.1
Characteristics of Queries : Complexity, Coverage and Broadness
The conceptual analysis of the 30 test requests resulted in the identification of 112 facets. The
number of facets per conceptual query plan ranged from 3 to 5, averaging 3.7. The number of
major facets ranged from 2 to 3, the average was 2.1, and the full number was 64. The number of
minor facets ranged from 1 to 3 with the average 1.6, and the full number 48. Thus, the number of
major facets per conceptual query plan was higher than the number of minor facets.
The concepts in major and minor facets were different in nature. The concepts in major facets
were specific things, persons, organisations, etc., whereas the concepts in minor facets were more
general matters, such as abstract objects, processes, actions and occasions. Their semantic
relations were also different: the former type had more hierarchical relations, the latter more
associative relations. In order to demonstrate the difference of the facets, we divided the search
concepts into five classes, which were geographical names, other proper names, other concrete
objects, abstract objects, and processes/events. Table 11 shows a summary of the classification.
The names of the concepts in each class are given in Appendix 3. In major facets, there were quite
a lot of names of persons, organisations and geographical places. In major facets the number of
concrete objects was markedly higher than in minor facets. All these concept types, except the
names of persons, appear in hierarchies, whereas the concepts of minor facets, abstract objects and
processes / events, tend to have more associative relations.
Geograph. Other proper
name
name
Major
facets
Minor
facets
All
facets
Other
concr. obj.
Abstract
object
Process/
Event
All
concepts
11
16
15
18
10
70
7
1
2
23
42
75
18
17
17
41
52
145
Table 11. Classification of the search concepts.
The average number of concepts in a query, the average coverage, is given in Table 12. The
average number of concepts per facet is also given. It should be remembered that F1 denotes major
facets (average per query 2.1), and F2 all facets (average per query 3.7). As stated above, major
and minor facets behaved differently in concept expansions, i.e., major facets included more
narrower concepts than minor facets, which, in turn, included more associative concepts than
major facets. The average broadness of differently expanded queries at two complexity levels is
given in Table 13. The broadness figures are different for query structures where phrases are not
identified (SUM, SYN1-2), and query structures with phrases (all others).
83
Average number of concepts per facet
QE types
Complexity levels
Q0
Qn
Qa
F1
1.1
3.8
4.3
(major facets)
–
1.6
2.1
7.6
(minor facets)
F2
1.3
3.0
5.5
(all facets)
Average number of concepts per query
QE types
Complexity levels
Q0
Qn
Qa
F1
2.4
8.0
9.0
(major facets)
F2
4.9
11.3
20.3
(all facets)
Qf
7.0
8.0
7.2
Qf
14.7
26.8
Table 12. Average number of concepts per facet and query.
Complexity levels
F1 (major facets)
F2 (all facets)
Q0
3.5
6.1
Complexity levels
F1 (major facets)
F2 (all facets)
Q0
2.9
5.4
Average number of search keys per query
(no phrases)
QE types
Qs
Qn
Qa
11.8
21.4
26.9
18.5
30.6
49.5
Average number of search keys per query
(phrases recognised)
QE types
Qs
Qn
Qa
7.9
16.4
20.6
14.1
24.4
42.1
Qf
36.8
62.3
Qf
29.0
52.4
Table 13. Average broadness.
8.2
Precision of Queries
The precision scores of combinations of query structures, expansion types and complexity levels
are given in Tables 14 and 15. In the former the average precision scores are calculated over 11
document cut-off values (i.e., cut-off values 1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50; referred to as
p@50), in the latter the scores are average precisions over 10 recall levels (i.e., 10-100; referred to
as 10pr). The overall precision level is higher for p@50 scores than for 10pr scores, because in the
latter case the high recall levels imply that the number of retrieved documents is often higher than
50, and the number of non-relevant documents is also larger. The overall picture of the
performance of the different combinations is similar. However, the ranking order of combinations
is not exactly the same in 10pr results and p@50 results. There are some changes between and
within structure types (e.g., between structures: the p@50 score of SSYN2/F2/Qf is better than the
p@50 score of ASYN/F2/Qf, but when comparing the 10pr scores the order is reverse; within
structures: the p@50 score of XSUM/F2/Qf is better than the p@50 score of XSUM/F1/Qf, but the
order is reverse for 10pr scores). This implies that there are crossovers in performance, i.e.,
84
comparing two combinations, the first is better at the low recall end but the other performs better
at the high recall end. The highest recall levels were not obtainable with DCV 50 in every case for
two reasons: (1) the number of relevant documents was more than 50 for seven requests; (2) many
combinations had not retrieved all relevant documents at DCV 50 though the number of relevant
documents was less than 50.
Weak query structures
In the weakly structured queries three tendencies were apparent: first, for the SUM and WSUM2
query structures QE was detrimental; second, the performance of the SYN1-2 and WSUM1 query
structures was slightly enhanced by some expansion types, but precision did not increase
continuously; third, queries with high complexity were in most cases more effective than queries
with low complexity. (See Tables 14-15, and Appendix 4, Fig. 1–10.) Within weak structures, the
largest gain through expansion was 5.2 percentage units for the p@50 results, and 3.6 percentage
units for the 10pr results. Thus, QE did not prove very effective within weak structures. Losses
through QE were greater, at most 11.7 percentage units for the p@50 results, and 6.6 percentage
units for the 10pr results. The greatest losses were in the precision scores of the SUM/F1 queries,
the greatest benefits in the precision of SYN2/F2 queries, though the latter had a very low
precision level overall.
The weighting scheme for WSUM queries was quite mechanistic, based on each expansion
key’s relationship to the original key. If weighting had been based on some other criteria, e.g., on
the frequencies of keys in the database, or tf*idf weights, the result might be different.
The effects of phrase recognition were somewhat unexpected: the SUM queries (i.e., queries
without phrases) were the most effective over all weak query structures according to p@50. The
10pr results gave one higher score for WSUM1/F2/Qn combination than the score of SUM/F2/E0.
An explanation for this phrase phenomenon is probably that a phrase as the only expression for a
concept (since there was no expansion) was too strict a condition, single search keys were more
effective. The splitting of compound words for the database index intensified the effect because
single words matched to the components of compound words, thus, the relaxing of phrases gave
search keys with higher frequencies.
Another explanation might also be that the repetitive search keys, which resulted from relaxing
the phrases, were not removed from the queries. In InQuery this repetition within SUM or WSUM
structures gives extra weight to the repeated keys. However, the advantage was lost when SUM
queries were expanded. A similar pattern can be detected within SYN queries – the precision of
the unexpanded SYN1 queries was slightly better than the precision of the unexpanded SYN2
queries. The repetition of search keys is not a valid explanation with the SYN structures, but the
strictness of phrases as only search keys holds even here.
Comparison of the different weak query structures reveals two different structure groups,
namely the SYN structures, performance of which is lowest over all structures, and the other weak
structures (SUM, WSUM1-2), performance of which is very similar, and overall best, without QE
or with slight QE (see Fig. 13). One might try to explain the failure of QE in the weakly structured
queries by defects in the test thesaurus. This explanation would imply failures in the strong
structures as well as in the weak structures. However, since QE proved effective with some of the
strong structures, this explanation seems wrong. Instead, we seek the explanation in query
structures.
85
Query
structures/
Complexity
level
QE types
Q0
Qs
Q0-Qs
Qn
Q0-Qn
Qa
Q0-Qa
Qf
Q0-Qf
SUM/F1
46.5
40.7
-5.8
34.8
-11.7
37.3
-9.2
36.1
-10.4
SUM/F2
47.7
44.0
-3.7
40.7
-7.0
39.3
-8.4
40.8
-6.9
W SYN1/F1
24.9
26.1
+1.2
24.7
-0.2
23.9
-1.0
26.1
+1.2
e
SYN1/F2
24.7
26.0
+1.3
27.5
+2.8
27.7
+3.0
29.4
+4.7
a
SYN2/F1
24.1
27.7
+3.6
24.1
0
24.5
+0.4
23.5
-0.6
k
SYN2/F2
20.9
26.1
+5.2
25.2
+4.3
23.1
+2.2
25.1
+4.2
WSUM1/F1
43.2
44.3
+1.1
39.2
-4.0
41.8
-1.4
40.8
-2.4
WSUM1/F2
44.3
46.3
+2.0
44.0
-0.3
46.1
+1.8
45.2
+0.9
WSUM2/F1
43.2
•
•
•
•
•
•
36.6
-6.6
WSUM2/F2
44.3
•
•
•
•
•
•
41.3
-3.0
SSYN1/F1
44.2
50.9
+6.7
52.5
+8.3
48.0
+3.8
51.0
+6.8
SSYN1/F2
45.0
50.5
+5.5
50.3
+5.3
54.9
+9.9
55.0
+10.0
SSYN2/F1
43.7
51.4
+7.7
53.2
+9.5
49.7
+6.0
51.8
+8.1
SSYN2/F2
45.1
51.0
+5.9
51.9
+6.8
54.1
+9.0
56.3
+11.2
S ASYN/F1
44.2
51.9
+7.7
54.1
+9.9
50.0
+5.8
52.4
+8.2
t
ASYN/F2
45.6
50.9
+5.3
52.7
+7.1
54.4
+8.8
56.1
+10.5
r
WSYN1/F2
45.9
53.1
+7.2
54.9
+9.0
53.0
+7.1
56.0
+10.1
o
WSYN2/F2
46.7
52.0
+5.3
52.9
+6.2
54.6
+8.0
56.7
+10.0
n
BOOL/F1
40.9
37.3
-3.6
31.8
-9.1
29.6
-11.3
28.6
-12.3
g
BOOL/F2
42.5
32.0
-10.5
20.4
-22.1
29.3
-13.2
25.1
-17.4
XSUM/F1
41.0
43.1
+2.1
44.9
+3.9
44.5
+3.5
46.0
+5.0
XSUM/F2
42.6
44.1
+1.5
44.1
+1.5
45.7
+3.1
46.1
+3.5
OSUM/F1
40.9
39.1
-1.8
40.5
-0.4
39.5
-1.4
44.3
+3.4
OSUM/F2
42.5
34.5
-8.0
37.1
-5.4
38.0
-4.5
45.1
+2.6
Table 14. Precision averages over 11 document cut-off values. (N=30)
(NB. The best precision score of each column is in italics, the best precision score of each row is in bold face, precision
scores higher than the precision of SUM/F2/Q0 are shaded.)
86
Query
structures/
Complexity
level
QE types
Q0
Qs
Q0-Qs
Qn
Q0-Qn
Qa
Q0-Qa
Qf
Q0-Qf
SUM/F1
36.0
31.9
-4.1
29.4
-6.6
29.4
-6.6
29.5
-6.5
SUM/F2
36.8
34.8
-2.0
32.9
-3.9
30.3
-6.5
32.2
-4.6
W SYN1/F1
17.8
16.4
-1.4
16.9
-0.9
15.2
-2.6
16.9
-0.9
e
SYN1/F2
15.8
15.7
-0.1
17.5
+1.7
16.5
+0.7
18.4
+2.6
a
SYN2/F1
16.6
18.7
+2.1
17.3
+0.7
16.4
-0.2
16.0
-0.6
k
SYN2/F2
12.8
16.0
+3.2
16.4
+3.6
13.4
+0.6
15.2
+2.4
WSUM1/F1
33.9
35.1
+1.2
33.3
-0.6
33.4
-0.5
34.4
+0.5
WSUM1/F2
33.8
36.2
+2.4
36.9
+3.1
35.4
+1.6
36.4
+2.6
WSUM2/F1
33.9
•
•
•
•
•
•
30.0
-3.9
WSUM2/F2
33.8
•
•
•
•
•
•
32.6
-1.2
SSYN1/F1
34.6
42.6
+8.0
45.7
+11.1
41.0
+6.4
45.2
+10.6
SSYN1/F2
34.2
41.3
+7.1
43.8
+9.6
47.2
+13.0
49.1
+14.9
SSYN2/F1
34.5
42.8
+8.3
46.3
+11.8
43.0
+8.5
46.5
+12.0
S SSYN2/F2
34.4
42.5
+8.1
45.9
+11.5
48.0
+13.6
51.1
+16.7
t
ASYN/F1
35.4
43.2
+7.8
47.2
+11.8
43.6
+8.2
47.3
+11.9
r
ASYN/F2
34.8
42.4
+7.6
45.9
+11.1
48.5
+13.7
51.6
+16.8
o
WSYN1/F2
36.4
44.9
+8.5
48.7
+12.3
47.0
+10.6
51.0
+14.6
n
WSYN2/F2
35.9
44.5
+8.6
48.1
+12.2
48.3
+12.4
51.8
+15.9
g
BOOL/F1
32.3
27.5
-4.8
26.3
-6.0
22.9
-9.4
24.8
-7.5
BOOL/F2
32.7
24.8
-7.9
16.7
-16.0
21.9
-10.8
20.7
-12.0
XSUM/F1
32.1
34.3
+2.2
35.6
+3.5
36.6
+4.5
36.4
+4.3
XSUM/F2
32.4
34.5
+2.1
34.3
+1.9
37.2
+4.8
36.1
+3.7
OSUM/F1
32.3
29.2
-3.1
31.4
-0.9
32.3
0
36.3
+4.0
OSUM/F2
32.7
22.8
-9.9
25.9
-6.8
28.9
-3.8
32.1
-0.6
Table 15. Precision averages over 10 recall levels. (N=30)
(NB. The best precision score of each column is in italics, the best precision score of each row is in bold face, precision
scores higher than the precision of SUM/F2/Q0 are shaded.)
87
The operator SUM ranks the documents according to the average weights of search keys. The
search keys present in the document get the tf*idf weights, and those absent get the default value.
Simple average of the weights is a ‘bag of words’ solution which does not take into account the
relative importance of the words, nor does it identify concepts or facets. When the number of
search keys is small, the performance of the SUM queries is best. Any expansion, however, tends
to decrease the effectiveness, because with expansion keys the number of possible key
combinations in the document increases and the structure does not ensure that all search concepts
are represented in retrieved documents. Consequently, documents with many occurrences of one
search key or any arbitrary combination of search keys – from relevance point-of-view – may rank
high.
100
SUM/F2/Q0
WSUM1/F2/Qn
WSUM2/F1/Q0
SYN1/F2/Qf
SYN2/F1/Qs
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Figure 13. P-R curves of the best weak query structure, complexity level and QE combinations.
The WSUM queries attacked this problem by weighting original search keys higher than
expansion keys as suggested by literature. The ranking is based on weighted averages of the search
key weights. The weighting supported somewhat QE, but the overall performance was not
markedly better than the performance of the SUM queries. Thus, the problem of arbitrary search
key combinations remained. In the WSUM1 queries all expansion keys were given equal weights
which were lower than the weights of the original keys; in the WSUM2 queries the expansion keys
were weighted according to their semantic relation type, but with lower weights than original keys.
Comparison of these two structures shows that the simpler weighting scheme (WSUM1) was
better.
The SYN queries tended also to retrieve documents that did not contain all search facets. The
SYN operator treats all search keys as occurrences of one key by modifying the basic tf*idf
formula for weighting the keys (for the formula, see p. 22). As the only operator of a query, SYN
is inappropriate because the discriminating power of the tf*idf modification is lost when the
number of search keys grows.
88
Strong query structures
The strong query structures may clearly be divided into three groups on the basis of the effects of
QE. In the first group, including the BOOL queries, QE was clearly detrimental; in the second
group, with the SSYN1-2, ASYN, WSYN, XSUM queries, QE was clearly useful, in the third
group, including the OSUM queries, effects of QE were mixed. This holds for high and low
complexity queries. (See Tables 14-15, and Appendix 4, Fig. 11-24.)
All QE types increased the precision of the SSYN1-2, ASYN and WSYN1-2 queries with high
complexity. The positive change in the precision scores from unexpanded queries to the best
expanded versions varied from 5.3 to 11.2 percentage units for the average p@50, and from 7.1 to
16.8 percentage units for the average 10pr. The best overall precision scores were among these
queries, as well as the largest improvements. The difference between the precision of unexpanded
and best expanded combinations was higher for the 10pr scores than for the p@50 scores, which
implies that precision was improving most at high recall end. The overall best precision score –
both 10pr and p@50 – was yielded by WSYN2 with high complexity and full expansion, i.e., with
most evidence available (WSYN2/F2/Qf).
When SSYN and ASYN structures were combined with low complexity the performance of
these structures was somewhat different. QE increased precision, but the best expansion was not
always the largest. Especially with regard to the p@50 results, the narrower concept expansion
tended to give the best performance, and the associative concept expansion the worst performance
(compared to other expansion types). The reason may be explained by the different nature of the
concepts in major and minor facets. The concepts of major facets had more hierarchical relations,
which obviously were specific and matched the conceptual structures of relevant documents. The
associative concept expansion added more general search keys, which obviously had rather high
frequency in the collection. It seems that the associative relations for major facet concepts were
too general. However, in the 10pr results the best performance was achieved by full expansion.
The comparison of the 10pr scores of the low and high complexity levels of the strongly
structured SYN queries shows that low complexity queries tended to have better performance than
high complexity queries until the associative concept expansion (see Table 15). The p@50 scores
are better for low complexity queries with the synonym and narrower concept expansion, but not
without expansion. As explained above, the synonym and narrower concept expansion did not add
as many search keys to the minor facets as to the major facets. Thus, the minor facets with few
expressions were too strict conditions. In other words, if the weight of the SYN clause containing a
minor facet is very low, or at extreme case, the default, the query is better off without the minor
facet, because weights calculated as a product or an average will be higher when the worst decimal
factor is left out. This also means that the search keys brought to minor facets by associative
concept expansion were good enough to compensate the faults of major facets with this expansion
type. The best performances were achieved by the full expansion, which implies that the
combination of the hierarchical relations of the major facet concepts and the associative relations
of the minor facet concepts was useful.1
1 If the thesaurus had given more associative concepts for the concepts of major facets, or more narrower concepts for
the concept of minor facets, the results might be different. We strongly believe, that the different nature of the concepts
in major and minor facets is not a coincidence, but the validity of the thesaurus cannot be proved.
89
100
WSYN2/F2/Qf
ASYN/F2/Qf
SSYN1/F2/Qf
XSUM/F2/Qa
OSUM/F1/Qf
BOOL/F2/Q0
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Figure 14. P-R curves of the best combinations of the strong structures.1
The SSYN1-2, ASYN and WSYN structures combine weights of search keys of a facet (SSYN1 of
a concept) in a SYN clause, in which all search keys are treated as instances of one key. An
average, product or weighted average is then calculated of the weights of SYN clauses. The facet
or concept structure formed into a SYN clause is the decisive component, the operation used to
combine the weights of SYN clauses does not seem so important. (See Fig. 14.)
The SSYN1 structure was a modification of the SSYN2 structure. Operators were the same but
the original concepts included in one facet were divided into separate SYN clauses. Thus, each
SYN clause contained the expressions representing or referring to one original search concept.
Figure 14 shows that the there is only a slight difference between the best SSYN1 query and best
overall query (WSYN2/F2) or the best ASYN query. The result is not surprising because the
number of concepts per facet was not high. Nevertheless, the identification of concepts is easier
than the identification of facets, thus, the result suggests that facets could be dispersed.
Within the BOOL structure, queries with no expansion yielded the best performance, the
difference between low and high complexity was slight. All expansions decreased precision. This
can be explained with the interpretation given to the AND and OR operators. If a search key is not
present in a document it gets a default weight of 0.4, if the key is present it is given the tf*idf
weight. The weight for all keys within the OR operator is calculated as follows: 1 – (1 – p1)*(1 –
p2)* ... *(1 –pn), where pi is the weight of the search key i. The more search keys the OR clause
includes, the higher weight it gets, whether the search keys are present or not. The weights of OR
clauses are combined as a product. This leads to ranking where the OR clause or clauses with
fewest keys are decisive, i.e., in the shortest OR clauses the presence or absence of search keys
becomes decisive irrespective of the importance of the keys for the query.
The performance of the XSUM queries gained somewhat of QE. The highest difference
between the precision scores of the best unexpanded and the best expanded combination was about
5 percentage units. The p@50 scores were best for the XSUM queries with high complexity and
1 Two strong structures, SSYN2 and WSYN1, are left out of Figure 14 to keep it readable. Their P-R curves overlap
with ASYN and WSYN2.
90
full expansion. The 10pr scores were best for the XSUM queries with high complexity and
associative concept expansion. In most cases, the performance of queries with high complexity
was better, but differences between the high and low complexity queries were rather small.
The p@50 and 10pr scores for the OSUM structure differ somewhat. According to the former,
the full expansion enhanced effectiveness at both low and high complexity levels, the latter shows
that the precision of the low complexity queries was only enhanced by full expansion. Both scores
agree that the other expansion types were detrimental for OSUM queries with high and low
complexity. The idea of the OSUM structure was to give, through structure, more weight to the
original search keys than to the expansion keys. The Boolean structure was chosen because it had
been shown to perform better than weakly structured queries (SUM / WSUM, Turtle 1990).
However, the weight of the Boolean clause in OSUM queries became lower than the weights of
SYN clauses because of the interpretation of the AND and SYN operators in InQuery. Then,
taking an average (SUM) of the weights of all clauses led to ranking where expansion keys, in
fact, were decisive. If some original concept or facet had none or few expansion keys, its
importance was buried in the low total weight of a Boolean clause, or low weight of a SYN clause
(with one or few keys only). Then, the facet balance was also lost. The decrease in performance by
synonym, narrower concept and associative concept expansion was markedly greater for queries
with high complexity than for queries with low complexity.
Weak vs. strong structures
Figure 15 shows the best and worst combinations of both weakly and strongly structured queries.
On the one hand, the precision of the best weakly structured query exceeded the worst strongly
structured query, on the other, the precision of the best strongly structured queries was markedly
better than the precision of the best weakly structured query. The facet or concept structure alone
did not warrant high performance, the operation used to form a facet or a concept was also
important. The differences between the best unexpanded weakly and strongly structured queries
were rather small. The effectiveness of weakly structured queries was not enhanced by QE, but the
strong SYN structures gained notably of QE.
In Figures 30-33 (pp. 129-130) the P-R curves for all structures with low and high complexity,
with and without expansion are given. Figures 30-31 illustrate that in case of no expansion
differences between structures are negligible – the only exception being the SYN structures, which
were much worse than the other structures. In Figure 32 it can be seen that with low complexity
and full expansion the structures form a continuum, but the curves of the three best performing
structures (SSYN1, SSYN2, ASYN) are clearly above the other curves (NB. WSYN1-2 do not
have low complexity versions.) In Figure 33 are P-R curves for all structures with high complexity
and full expansion. The curves form three distinct groups: the worst group with SYN1-2 and
BOOL structures; the middle group with SUM, WSUM1-2, OSUM and XSUM structures; the best
group with all strong SYN structures, SSYN1-2, ASYN, WSYN1-2.
91
100
WSYN2/F2/Qf
WSUM1/F2/Qn
BOOL/F2/Qn
SYN2/F2/Q0
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Figure 15. P-R curves for the best and worst combinations of weak / strong query structure, query expansion,
complexity level.
92
8.3
Rank-based Comparison of Queries
All 110 test combinations of query structure, complexity level and expansion type were ranked for
each request on the basis of their DCV precision scores. That is, each request was a row in a table
containing 110 precision scores, and these scores were ranked from 1 to 110 (1 was given to the
worst precision and 110 to the best). Each combination formed a column, and an average rank over
requests was calculated for each column, i.e., combination. The average ranks are shown in Table
16. Average performance for ranking based on 110 combinations is 55.5.
W
e
a
k
S
t
r
o
n
g
Query
structures
SUM/F1
SUM/F2
SYN1/F1
SYN1/F2
SYN2/F1
SYN2/F2
WSUM1/F1
WSUM1/F2
WSUM2/F1
WSUM2/F2
SSYN1/F1
SSYN1/F2
SSYN2/F1
SSYN2/F2
ASYN/F1
ASYN/F2
WSYN1/F2
WSYN2/F2
BOOL/F1
BOOL/F2
OSUM/F1
OSUM/F2
XSUM/F1
XSUM/F2
Q0
65.8
64.5
23.3
20.2
26.3
16.6
60.4
56.5
60.4
56.5
60.5
57.6
59.1
60.4
60.2
60.8
64.8
66.6
55.3
55.5
55.3
55.5
55.2
57.4
Qs
48.5
54.0
25.0
22.2
31.5
23.1
59.5
62.5
•
•
76.5
70.8
78.4
75.6
80.9
72.7
83.7
78.9
46.9
43.8
48.1
39.6
62.3
58.9
QE types
Qn
37.3
47.6
25.3
27.9
23.6
20.6
50.3
57.4
•
•
77.0
70.0
80.9
75.7
85.1
76.0
84.7
79.0
39.8
23.8
46.2
39.0
65.6
57.3
Qa
49.7
48.3
22.6
27.2
22.0
21.6
55.5
64.7
•
•
68.6
81.5
71.5
82.5
74.7
81.9
79.5
82.8
37.7
36.3
49.0
47.5
62.0
64.3
Qf
42.9
49.3
27.4
34.1
19.6
25.5
55.1
61.1
45.0
53.0
75.7
82.0
79.0
87.3
81.8
85.1
86.5
87.6
37.9
31.4
57.9
55.1
68.7
64.3
Table 16. Average ranks.
(NB. The best rank of each column is in italics, the best rank of each row is in bold face, ranks higher than the rank of
WSYN2/F2/Q0 - the best unexpanded - are shaded.)
The rank-based measurement gives a very similar overview as the measurement based on average
precision scores. However, the comparison of the low and high complexity combinations shows
that the better performance has passed from the high to the low complexity counterpart in many
cases (e.g. F2 -> F1 SUM/Q0; WSUM1-2/Q0; SSYN1/Q0; OSUM/Qf; XSUM/Qf, c.f., Table 14).
This exhibits how average ranks differ from average precision: the former does not take into
account the quantity of change, just how many times one method is performing better the other.
Among the weak structures the ranks show QE more promising, though, the performance of most
combinations is lower than 55.5. For the strong structures Table 16 shows that differences between
the best combinations in performance are slight. The best performance for an unexpanded query
93
and expanded query could not be reached by the same query structure. In addition, the tendency of
strong SYN queries with low complexity to yield the best scores by narrower concept expansion is
apparent in Table 16.
8.4
Statistical Significance of Differences in Effectiveness
The Friedman two-way analysis by ranks was based on precision scores calculated over DCVs.
The test was based on ranks over which the averages of Table 16 were calculated. Friedman test
between 110 combinations over 30 requests showed that there were significant differences
between combinations (Fc = 19.4, p<.005). Then, combinations were compared pairwise to find
out between which pairs the significant differences were. While the number of comparisons
between all combinations was rather large, and all comparisons were not interesting, the results of
pairwise comparisons will be presented as follows: (1) the significant differences within each
query structure group will be given (i.e., the precision scores of the combinations of one query
structure, two complexity levels and five expansion types were compared), (2) the significant
differences in performance between the best combinations of different query structure groups. The
significance level was the same as in Friedman test, p<.005.
Query
structure
Combinations of complexity levels
and expansion types
SUM
F1/Q0 > F1/Qs, F1/Qn, F2/Qn, F2/Qa, F1/Qf
F2/Q0 > F1/Qn, F2/Qn, F1/Qf
SYN1
no significant differences
SYN2
no significant differences
WSUM1
no significant differences
WSUM2
no significant differences
BOOL
F1/Q0 > F2/Qn, F1/Qa, F2/Qa, F1/Qf, F2/Qf
F2/Q0, F1/Qs, F2/Qs > F2/Qn
SSYN1
F2/Qa, F2/Qf > F1/Q0, F2/Q0
F1/Qs, F1/Qn, F1/Qf > F2/Q0
SSYN2
F1/Qs, F1/Qn, F2/Qa, F1/Qf, F2/Qf > F1/Q0, F2/Q0
ASYN
F1/Qs, F1/Qn, F2/Qa, F1/Qf, F2/Qf > F1/Q0, F2/Q0
WSYN1
F2/Qs, F2/Qn, F2/Qf > F2/Q0
WSYN2
F2/Qf > F2/Q0
OSUM
F1/Qf > F2/Qs, F2/Qn
XSUM
no significant differences
Table 17. Significant differences in performance within query structure groups (p<.005).
Within the weak query structures, significant differences in performance appeared only in the
SUM structured queries (Table 17). The precision of unexpanded SUM queries was significantly
better than the precision of many expanded SUM queries, yet the performance of all expansions
was not significantly worse (e.g. the full expansion with all facets). Significant differences cannot
be explained by the broadness of the combinations: the precision of F2/Qa was significantly worse
than the precision of F1/Q0, but the precision of a broader combination, F2/Qf, was not.
94
Within the strong structures more differences were significant. These differences confirm the
division of strong structures into three groups: 1. a structure for which QE was detrimental
(BOOL); 2. structures for which QE was useful (SSYN1-2, ASYN, WSYN1-2); 3. structures for
which QE had no effects or results were mixed (OSUM, XSUM).
Within the BOOL structure, the performance of many expanded queries was significantly
worse than the performance of the unexpanded queries. Within the SSYN2 and ASYN structures
the improvement in the performance of five expansion and complexity combinations over the
unexpanded queries was significant (F1/Qs, F1/Qn, F2/Qa, F1/Qf, F2/Qf). Within the SSYN1
structure, the significant differences were between the performance of the associative and full
expansions with high complexity (F2/Qa, F2/Qf) and the performance of the unexpanded queries
with low and high complexity (F1/Q0, F2/Q0); also the low complexity queries with the synonym,
narrower concept and full expansion (F1/Qs, F1/Qn, F1/Qf) were performing significantly better
than the unexpanded queries with high complexity (F2/Q0). Within the WSYN1 structure, the
improvement in the performance of all expansions, except associative concept expansion, was
significant compared to the performance of the unexpanded queries. Within WSYN2 structure, the
performance of the full expansion was significantly better than the performance of the unexpanded
queries. It is noteworthy that within the best strong structures (i.e., SSYN2-3, ASYN, WSYN1-2)
the performance of the queries with different expansions did not differ significantly from each
other.
Within the OSUM structure, the only significant differences were between the performance of
the full-expanded queries with low complexity (F1/Qf) and the synonym and narrower concept
expanded queries with high complexity (F2/Qs, F2/Qn). The performance of the unexpanded
queries was not significantly different from the performance of the expanded queries. Within
XSUM structured queries there were no significant differences.
The best combinations within different query structures were following:
1. SUM/F1/Q0
2. SYN1/F2/Qf
3. SYN2/F1/Qs
4. WSUM1/F2/Qa
5. WSUM2/F1/Q0
6. BOOL/F2/Q0
7. SSYN1/F2/Qf
8. SSYN2/F2/Qf
9. ASYN/F2/Qf
10. WSYN1/F2/Qf
11. WSYN2/F2/Qf
12. OSUM/F1/Qf
13. XSUM/F1/Qf
The significant differences between the best combinations of different structures (by ranks) were
following (complexity and QE type indicators have been left out for readability; > denotes p <
.005):
SSYN1, SSYN2, ASYN, WSYN1, WSYN2 > WSUM1, WSUM2, BOOL, OSUM > SYN1, SYN2
SSYN2, ASYN, WSYN1, WSYN2 > SUM > SYN1, SYN2
SSYN2, WSYN1, WSYN2 > XSUM > SYN1, SYN2
If the performance of any of the best combinations is significantly better than the performance of
some other combination, it allows the interpretation that the better performing combination is
significantly better than the whole other structure group, yet, it does not prove that the whole
structure group of the better combination differs from the other group.
In Figure 16 the best performing combinations of each of the 13 structure groups are illustrated.
The selection of best combinations is based on ranks, but the curves are drawn on the basis of the
95
10pr scores. The overall picture is similar irrespective the statistics used: the best structure /
complexity / expansion combinations form three groups, and performance differences between
groups – the best combination of each group compared with group(s) below – are statistically
significant.
The performance of two SYN structure groups is significantly worse than the performance of
all the other best combinations. The performance of the best combinations of all the strong SYN
structures, except of SSYN1, is significantly better than the performance of all the weak structures,
and the strong BOOL and OSUM structures. The performance of the best combination of SSYN1
differs significantly from the performance of the WSUM1-2, BOOL and OSUM combinations.
The performance of the best combination of SSYN2 is also better than the performance of the
XSUM structure combinations. These statistical tests confirm the effectiveness of the different
strong SYN structures.
100
WSYN2/F2/Qf
SSYN2/F2/Qf
WSYN1/F2/Qf
ASYN/F2/Qf
SSYN1/F2/Qf
XSUM/F1/Qf
SUM/F1/Q0
WSUM1/F2/Qa
WSUM2/F1/Q0
OSUM/F1/Qf
BOOL/F2/Q0
SYN1/F2/Qf
SYN2/F1/Qs
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Figure 16. P-R curves of the best combinations of different query structure groups.
To resolve research problems 7-10, the two complexity levels were combined separately to noexpansion and expansion, and the weakly structured queries were compared to the strongly
structured queries. The following trends were observed in the testing of significance:
• Low & high complexity, no expansion (see research problems 7-8):
When queries were unexpanded and the complexity was either low or high, there were no
significant differences in the performance of structure types, with one exception – the
performance of the SYN1-2 queries was significantly worse than the performance of all
other structure types, whether strong or weak.
• Low complexity, QE (see research problem 9):
The strongly structured SSYN2, ASYN and SSYN1/Qs-Qn queries exceeded the
performance of all weakly structured queries. The best combination of the XSUM structure
(XSUM/Qf) exceeded the expanded weakly structured queries - except WSUM1 queries - in
performance. The performance of the best OSUM and BOOL structured queries was
significantly better than the performance SYN1-2 queries, but the WSUM1/Qs,Qa,Qf
queries exceeded BOOL queries in performance.
96
• High complexity, QE (see research problem 10):
The performance of the strong SYN queries (SSYN1-2, ASYN, WSYN1-2) was
significantly better than the performance of the weakly structured queries (SUM, SYN1-2,
WSUM2) - except the performance of the expanded WSUM1 queries. The latter were
exceeded in performance only by the best combinations of the strongly structured SYN
queries. The performance of expanded XSUM queries was significantly better than the
performance of SYN1-2 queries. On the one hand, the performance of the expanded SYN1
queries was significantly worse than the performance of the best expanded OSUM and
BOOL queries (OSUM exceeded SYN2 as well) at high complexity level, on the other, the
best WSUM1 combinations were performing significantly better than the worst OSUM and
BOOL combinations.
8.5
Performance of Query Combinations by Requests
There was no combination of complexity level, expansion type and query structure, not even a
single structure that would have yielded the best performance for all requests. To demonstrate the
differences in performance between requests, the combinations for each request were ranked
according to the p@50 precision. The median precision score (over all combinations1) for each
request was taken as a baseline to which the best and worst combinations were compared. In
Figures 17-29 the average precision (p@50) histograms of the best combinations of each structure
type are given. These histograms measure the average precision2 of a combination against the
median average precision of all corresponding combinations on that request. These graphs should
give an overview of the performance of the query structure types by requests.
1 Median is calculated over all complexity levels (i.e., F1 and F2) and all expansion types (i.e., Q0–Qf).
2 The average precision was calculated over the 11 DCVs.
97
Fig 18. SYN1/F2/Qf
Precision relative to the median
Precision relative to the median
Fig 17. SUM/F2/Q0
60
40
20
0
-20
-40
-60
0
5
10
15
20
Request#
25
30
Precision relative to the median
Precision relative to the median
40
20
0
-20
-40
5
10
15
20
Request#
40
20
0
-20
-40
-60
0
5
25
Precision relative to the median
Precision relative to the median
0
-20
-40
-60
25
20
0
-20
-40
5
10
15
20
Request#
25
30
25
30
-20
-40
-60
5
10
15
20
Request#
60
40
20
0
-20
-40
-60
0
5
10
15
20
Request#
Fig 24. ASYN/F2/Qf
40
-60
0
30
0
Fig 23. SSYN2/F2/Qf
60
25
20
30
Precision relative to the median
Precision relative to the median
20
15
20
Request#
30
Fig 22. SSYN1/F2/Qf
40
10
25
40
Fig 21. WSUM2/F2/Q0
5
15
20
Request#
60
0
30
60
0
10
Fig 20. WSUM1/F2/Qs
Fig 19. SYN2/F1/Qs
60
-60
0
60
25
30
60
40
20
0
-20
-40
-60
0
5
10
15
20
Request#
Fig 26. WSYN2/F2/Qf
Precision relative to the median
Precision relative to the median
Fig 25. WSYN1/F2/Qf
60
40
20
0
-20
-40
-60
0
5
10
15
20
Request#
25
30
60
40
20
0
-20
-40
-60
0
5
40
20
0
-20
-40
-60
0
5
10
15
20
Request#
15
20
Request#
25
30
25
30
Fig 28. XSUM/F2/Qf
60
Precision relative to the median
Precision relative to the median
Fig 27. BOOL/F2/Q0
10
25
30
25
30
60
40
20
0
-20
-40
-60
0
5
10
15
20
Request#
Precision relative to the median
Fig 29. OSUM/F2/Qf
60
40
20
0
-20
-40
-60
0
5
10
15
20
Request#
Figures 17-29. Histograms comparing average p@50 scores to median performance by query structures and requests.
It can be seen that the average precisions of the WSYN1-2/F2/Qf, SSYN1-2/F2/Qf and
ASYN/F2/Qf combinations are almost always above the median precision and in a few cases about
at median level. The negative performance of the SYN1-2 structures is marked. The performance
of other combinations is more varied. The performance of the OSUM/F2/Qf differs strongly from
the median in both directions, thus it seems rather unpredictable.
The p@50 scores of the baseline and the best combinations for each request are given in Appendix
5, Table 1, as well as the difference between the baseline and best precision. The median score
gives also a hint about the ‘hardness’ of the request. For 26 requests (of 30) the best performing
query structure is strong, and for 17 requests it is some strong SYN structure1. For 29 requests the
best combination is expanded, and for 22 requests the best combination includes all facets (high
complexity). In five cases the best structure is weak (there is overlap, because in some cases more
1 The second best performance is yielded by some strong SYN structure in nine cases, i.e., strong SYN structures are the
best or second best performing structures for 26 requests.
99
than one combination yielded the same result). Only in one case, the best precision was obtained
by unexpanded weak structure. The most frequent best structure in Table 1 of Appendix 5 is
ASYN with nine occurrences; the frequency of other structures is more dispersed.
The worst performing structures for each request are given in Appendix 5, Table 2. The
difference between the worst and the baseline performances is greater than between the best and
baseline performances. The worst query structure types are also more concentrated: BOOL is the
worst query structure for 15 requests, SYN1-2 for 12 requests, each. High and low complexity
queries occur equally among worst combinations.
Analysis of a deviant result
It is hard to find the characteristics of requests that would explain why weak structures performed
better than strong structures in some cases. These deviations cannot be grouped into any one of the
request categories, i.e., to subject, person, organisation or geographically oriented requests. Nor
does the number of known relevant documents seem to explain the phenomenon. In order to better
understand this variation, we analysed the performance of different combinations with respect to
the request number 30. This request was selected because it divided the best combinations very
clearly: the average precision histograms show that in this request all strong combinations
performed at about the median level, whereas all weak SUM and WSUM combinations are
performing better. The analysis is based on DCV 50 results.
Request number 30 is an organisation name request – The initiatives, interpellations, speeches,
and voting activity of the Greens in the Finnish Parliament. The actions of the party and the
individual representatives are of interest. Altogether three facets were identified: F1 = (green
party)1, F2 = (parliament, representative), and F3 = (initiatives, interpellations, speeches, voting).
The number of known relevant items is 13. It is a ‘hard’ request: the median p@50 score is 2.6.
The best precision score for the request, 37.4, was achieved by the SUM/F2/Q0 combination. In
general, the precision scores (p@50) of the request number 30 reveal two tendencies: first, QE was
never useful; second, high complexity queries (with all three facets) were more effective than low
complexity queries (with facets 1 and 2).
Scanning the search keys present in the relevant documents shows that the Green party was
referred to as ‘Greens’ in 12 articles and as ‘Green union’ once. Six articles of the 12 first
mentioned contained the word ‘party’ as well, but not within a three-word limit of the word
‘green’. Thus, all combinations including a search phrase #1(Green party), or its synonym
#3(Green party), but not ‘green’ as a single key, were abortive. The document frequency of the
key #1(Green party) was 44, of the key #3(Green party) 77, and of the key ‘green’ 880. The
SUM/F2/Q0 included search keys ‘green’ and ‘party’ as single keys, the WSUM1/F2/Q0 query2
included phrase #1(Green party), otherwise the queries were equivalent. The average precision
scores for these queries were 37.4 and 36.7 respectively. The latter precision did not drop more
because of the other search keys; the former was impaired by the search key ‘party’, which alone
could not successfully represent the first facet. Other search keys in unexpanded queries were
‘parliament’ (with document frequency 1,118), ‘representative’ (1,529), ‘initiative’ (1,088),
1 The search concepts and search keys were taken from the thesaurus in the query formulation phase. In the test
thesaurus Greens were referred to as Green party, and MP as representative.
2 The unexpanded WSUM1 and WSUM2 queries are identical.
100
‘interpellation’ (35), ‘speech’ (572), and ‘voting’ (2,112)1. The second facet was represented in all
relevant documents by the key ‘parliament’ and in seven relevant articles by ‘representative’. The
third facet was represented in 10 relevant articles, in six of them by ‘voting’, in five articles by
‘interpellation’, in four by ‘initiative’, and once by ‘speech’2. It seems that the WSUM query
retrieved almost as many relevant documents as the SUM query by the initial low frequency words
of the third facet, although it missed the first facet. SUM and WSUM queries retrieved the same
eight relevant articles among top 50 documents, but in different order. In all, the top 50 documents
of these queries had 35 documents in common.
The precision of the unexpanded SSYN2/F2, ASYN/F2, WSYN1/F2 and WSYN2/F2 queries
was 15.9%, 19.2%, 6.8%, and 10.4% respectively3. Their performance was worse than the
performance of weakly structured queries since the phrase #1(Green party) was rather rare, and
documents containing it ranked high. The WSYN queries gave extra weight to the major facets (1
and 2), and were less effective because the evidence of Facet 3 was diminished. In the WSYN1
query the weight for Facet 2 was smaller than in the WSYN2 query, thus, its precision is lower.
Replacing the phrase #1(Green party) by the single key ‘green’ in the SSYN2 query increased
precision from 15.9% to 25.8%.
QE was not useful in the case of this request because many of the initial search keys occurred
in all relevant articles. Some of the expansion keys given by the thesaurus turned out to be
harmful. Adding election (4,994), a key representing an associative concept of voting, and the
house of representatives (4,887), the synonym of parliament, especially impaired the performance
of expanded queries. These two keys also tended to co-occur. Overlap in the document sets
retrieved by the expanded, strongly structured SYN queries was almost complete, and all these
queries retrieved the same relevant document in the DCV 50 set. QE was not useful because the
expansion keys were not suitable for the context of the request. Thus, in this case the unexpected
results proved to be a thesaurus failure.
8.6
Summary of the Results
We summarise our findings by settling each of our research problems. In general, we state that the
effects of QE on retrieval performance are dependent on the structure of a query. To some extent
the effectiveness of QE is dependent on the complexity of a query, but the effects of complexity
are not as large as the effects of a query structure. In addition, the nature of search concepts seems
to influence the co-effects of QE and complexity.
1. Within each weak query structure unexpanded or slightly (with synonyms)
expanded queries tended to yield the best performance when the complexity of
queries was low. The only significant differences in performance were within SUM
structured queries (F1/Q0 > F1/Qn, F1/Qf).
1 At this point translations into English turn out to be inaccurate. For example, the frequency and other meanings of the
word ‘speech’ are different from those of the original Finnish word ‘puheenvuoro’. The word ‘parliament’ is also
problematic because it has two translations in Finnish – a native equivalent, ‘eduskunta’, which was the original search
key, and a loan word ‘parlamentti’, which is here translated to ‘the house of representatives’. The former may be used
to refer to the Finnish parliament and to parliaments of some other countries, but never, for example, to the parliament
of Great Britain, which is referred to as ‘parlamentti’.
2 The sum of occurrences is 16 because words co-occurred in articles.
3 Here we will restrict our inquiry to these best strong structures. The SSYN1 is not interesting in this case because its
unexpanded version is equivalent to WSUM1.
101
2.
3.
4.
5.
6.
7.
8.
9.
The strongly structured queries did not behave as similarly with relation to each
other as the weakly structured queries. Within the strong SYN structures with low
complexity, all expanded queries gave significantly better precision scores than the
unexpanded queries. Between different expansion types there were no significant
differences in performance. The performance of the BOOL structured queries was
not improved by QE. The unexpanded BOOL queries with low complexity yielded
the best precision scores, which also were significantly better than the precision of
some expanded BOOL queries (BOOL/F1/Q0 > BOOL/F1/Qa, Qf). The
performance of OSUM and XSUM structured queries with low complexity was not
markedly improved by QE. The best precision scores were yielded by the
associative concept or full expansion.
An increase of the complexity of queries did not dramatically change the effects of
QE on the performance of the weakly structured queries. Within SUM and WSUM2
query structures with high complexity, QE was not useful. The performance of
SYN1-2 and WSUM1 queries at the high complexity level was slightly improved by
QE, but not significantly.
Within the strongly structured queries, moving from low complexity to high
complexity did not change the effects of QE. The most apparent change was that at
the high complexity level the strongly structured SYN queries yielded the best
precision scores with the full expansion while at the low complexity level the best
expansion type was the narrower concept expansion. Statistically this difference is
non-significant.
Comparing the performance of the best combinations of each weak query structure
shows that the best SUM, WSUM1 and WSUM2 queries were performing
significantly better than SYN1-2 queries (SUM/F1/Q0, WSUM1/F2/Qa,
WSUM2/F2/
Q0 > SYN1-2/F1-2/Q0-f).
Comparing the performance of the best combinations of each strong query structure
shows that the best strongly structured SYN queries were performing significantly
better than BOOL and OSUM queries (SSYN1-2/ASYN/WSYN1-2/F2/Qf >
BOOL/OSUM/F1-2/Q0-f). In addition, the performance of the SSYN2/F2/Qf
combination was significantly better than the performance of all XSUM structured
queries. Other differences between the performance of the best combinations were
non-significant.
The performance of all strongly structured queries was significantly better than the
performance of the weakly structured SYN queries when the queries were
unexpanded and complexity was low. Otherwise, there were no significant
differences between unexpanded weakly and strongly structured queries at the low
complexity level.
The performance of all strongly structured queries was significantly better than the
performance of the weakly structured SYN queries when the queries were
unexpanded and complexity was high. Otherwise, there were no significant
differences between unexpanded weakly and strongly structured queries at the high
complexity level.
The performance of the strongly structured SSYN2, ASYN and SSYN1/Qs-Qn
queries exceeded the performance of all weakly structured queries, when queries
102
10.
11.
12.
13.
14.
were expanded and complexity was low. The best combination of the XSUM
structure (XSUM/Qf) exceeded the expanded weakly structured queries - except
WSUM1 queries - in performance. The performance of the best OSUM and BOOL
structured queries was significantly better than the performance SYN1-2 queries,
but the WSUM1/Qs,Qa,Qf queries exceeded BOOL queries in performance given
that queries were expanded and complexity was low.
At high complexity level the performance of the expanded, strongly structured SYN
queries (SSYN1-2, ASYN, WSYN1-2) was significantly better than the
performance of the expanded, weakly structured queries (SUM, SYN1-2, WSUM2)
- except of the expanded WSUM1 queries. The latter were exceeded in performance
only by the best combinations of the strong SYN structured queries. The
performance of the expanded XSUM queries was significantly better than the performance of the SYN1-2 queries. The performance of the best expanded OSUM and
BOOL queries at high complexity level was significantly better than the
performance of the expanded SYN1 queries, and the OSUM queries exceeded the
performance of the expanded SYN2 queries as well. However, the best WSUM1
combinations were performing significantly better than the worst OSUM and BOOL
combinations.
Reducing the facet structure of the strong SSYN2 queries to the concept-based
structure of the SSYN1 queries did not affect retrieval performance substantially.
When compared at same complexity level, SSYN1 queries yielded lower precision
scores than SSYN2 queries, but SSYN1/F2 were performing better SSYN2/F1
queries. Differences were non-significant.
Weighting expansion keys lower than original keys did not improve retrieval
performance. The performance of the queries with this weighting scheme (WSUM12/F1-2/Qs-f) was about at the average performance level.
Weighting schemes for individual search keys tested in this study did not improve
retrieval performance compared to facet weighting schemes. The latter, weighting
major facets higher than minor facets, improved retrieval performance.
Most of those query structure and complexity combinations that were improved by
QE yielded the best performance with the full expansion. The narrower concept
expansion proved also to be effective, especially with the strongly structured SYN
queries with low complexity.
103
9
Discussion
The results obtained in this study may be viewed from two aspects: first, the results stress the
effects of query structures when the broadness of queries is varied; second, the results demonstrate
the effectiveness of concept-based query expansion, with a thesaurus as a source for QE. In other
words, QE was a way to vary broadness, but the results concerning the co-effects of query
structures and broadness are not limited to this kind of QE, or QE in general. The results indicate
the effectiveness of different query structures in probabilistic text retrieval when the number of
search keys per query varies from low to high.
The broadness of queries was lowest when queries were unexpanded. In this case, query
structures formed two groups: (1) the weakly structured SYN queries and (2) all other structures.
The former group performed much worse than the latter, but within the groups, there were no
significant differences in performance. The complexity of queries did not affect this result (see
Fig. 30-31). When queries were expanded and broadness was high, the grouping was either
dissolved or turned into three groups. If complexity was low and queries were fully expanded, the
performance of different structures was scattered. The strongly structured SYN queries yielded the
best performance and the weakly structured SYN queries the worst performance. The performance
of other structures was evenly scattered between these extremes. (See Fig. 32.) If complexity was
high and queries were fully expanded, three performance groups could be distinguished: (1) the
worst performing group (weakly structured SYN queries and strong BOOL queries); (2) the best
performing group (strongly structured SYN queries); (3) the mediocre group (all other structures).
(See Fig. 33.)
The effectiveness of thesaurus-based QE was best when synonyms, narrower and associative
concepts were added to queries, i.e. with the largest expansion. The structure of queries was based
on concepts or facets, keys representing a concept or facet combined with the SYN operator.
In brief, given a probabilistic text retrieval system, the findings indicate that when the number
of search keys in queries is low, the structure of queries is not decisive for performance. However,
when queries are expanded, the best effectiveness is yielded by facet structured queries, with a
proper facet operator. This is a new result because the co-effects of query structure and broadness
have not been studied in partial match environment. 1
1 Some of the results have been published in The proceedings of 21st ACM-SIGIR conference (Kekäläinen & Järvelin
1998).
104
100
SUM
ASYN
SSYN1
SSYN2
WSUM1-2
BOOL & OSUM
XSUM
SYN1
SYN2
80
Precision
60
40
20
0
0
20
40
60
80
100
Recall
Figure 30. P-R curves for all structures with low complexity and no expansion.
100
SUM
WSYN1
WSYN2
ASYN
SSYN2
SSYN1
WSUM1-2
BOOL & OSUM
XSUM
SYN1
SYN2
80
Precision
60
40
20
0
0
20
40
60
80
100
Recall
Figure 31. P-R curves for all structures with high complexity and no expansion.
105
100
ASYN
SSYN2
SSYN1
XSUM
OSUM
WSUM1
WSUM2
SUM
BOOL
SYN1
SYN2
80
Precision
60
40
20
0
0
20
40
60
80
100
Recall
Figure 32. P-R curves for all structures with low complexity and full expansion.
100
WSYN2
ASYN
SSYN2
WSYN1
SSYN1
WSUM1
XSUM
WSUM2
SUM
OSUM
BOOL
SYN1
SYN2
80
Precision
60
40
20
0
0
20
40
60
80
100
Recall
Figure 33. P-R curves for all structures with high complexity and full expansion.
106
Effects of complexity
The facets representing the different aspects of a request were divided into major and minor facets.
The former were essential for formulating a reasonable query, the latter represented additional
specifying aspects. The complexity of queries was varied by formulating queries either with major
facets only, or with all facets. Overall, the high complexity queries were more effective than the
low complexity queries, but the difference in performance was slight and statistically nonsignificant. Nevertheless, the different characteristics of major and minor facets could explain the
difference: major facets had more narrower concept relations and fewer associative concept
relations than minor facets. When QE was most effective, i.e., with strongly structured SYN
queries, the performance of low complexity queries with narrower concept expansion was better
than the performance of high complexity queries with the same expansion. The associative concept
expansion changed the order. The variation in complexity was not very great – on average, there
were 2.1 facets in low complexity queries and 3.7 facets in high complexity queries. A greater
variation could have intensified the difference, but it is doubtful whether the number of
meaningful facets could be much higher.
Effectiveness of thesaurus-based query expansion
The effectiveness of QE was dependent on query structures. With all weak query structures,
OSUM and BOOL structures the effect of QE was negative or indifferent, whereas with strong
SYN structures QE improved performance significantly. The best precision scores were obtained
by adding all expansion keys from the thesaurus, i.e., by combining the synonym, narrower
concept and associative concept expansions. Differences in performance between synonym,
narrower concept and associative concept expansions within different structures were not great.
However, the major facets had more narrower concepts than associative concepts related to them,
but the minor facets had more associative concept expansions than narrower concept expansions.
In addition, there was a clear tendency for the low complexity queries to yield better performance
by the narrower concept expansion than by the associative concept expansion, whereas the high
complexity queries gained most through full expansion. It seems that major facets are specific and
represent the focus of a request; the minor facets represent multiple aspects and are general in
nature. A useful QE strategy could be obtained by expanding major facets with narrower concepts
and minor facets with associative concepts, and then combining the expanded facets.
In statistical thesaurus construction keys associated statistically are identified but the type of
the relationship cannot be ascertained. Nevertheless, any keys associated with the original keys,
disregarding their relationship type, could be considered for QE, since the largest expansion
performed generally best. QE using statistical thesauri as a source for expansion keys has been
shown to improve retrieval performance (Jing & Croft 1994; Callan et al. 1995; Xu & Croft 1996).
We do not know whether a statistical thesaurus would yield results similar to our test thesaurus.
This is suggested by our findings, though not shown by the present study, because no statistical
thesaurus was compared. A possible problem might be that the statistical approach produces both
syntagmatic and paradigmatic associations, while the associations of a traditional thesaurus are, in
principle, paradigmatic. A statistical thesaurus might supply keys representing ‘wrong’ concepts in
a facet, if a facet-based query structure is assumed.
107
The unexpanded test queries were short. There were on average 2.9 search keys in unexpanded
low complexity queries, and 5.4 keys in unexpanded high complexity queries. Thus, these queries
were similar to the queries of non-professional searchers. For short queries QE is most useful, as
has been demonstrated in earlier studies (Voorhees 1994; Jing & Croft 1994). If initial queries are
short, the precision of top ranking documents might not be high, and QE based on relevance
feedback might not be effective. Thesaurus-based QE is not dependent on search results, and
might thus be more useful, though, without testing this is speculative.
Jing and Croft (1994) state that the evaluation of statistical thesauri could be done via QE to see
if retrieval performance is improved. We have evaluated an intellectually constructed test
thesaurus in this way, and it was capable of improving the performance given that the query
structure was suitable. A closer analysis of request number 30, on which the structure and QE
combinations behaved unexpectedly, revealed that not all expansion keys were useful, on the
contrary, some of them turned out to impair performance. Thus, other thesauri might prove more
or less useful in similar test setting. The performance level is not absolute. The analysis also
suggests that QE is not useful for all requests, because in some cases all relevant documents may
be retrieved with original search keys. Nevertheless, the findings concerning the co-effects of QE
and query structure are not dependent on the thesaurus. We believe that the growth of the
thesaurus would not change the relative QE effectiveness with appropriate query structures,
although the number of expansion keys would then rise.
Effects of weighting schemes
In this study both search key weighting and facet weighting were tested. Two weighting schemes
gave higher weights to original search keys than to expansion keys, a method which has been
believed to improve performance because original search keys are claimed to be more effective
than expansion keys (Fox 1980; Wang et al. 1985; Voorhees 1994). Two weighting schemes gave
higher weights to major facets than to minor facets because the former were believed to be more
important for the request than the latter. The weighting schemes were mechanistic: (a) WSUM1 –
the weight of the original search keys was 2 and weight of the expanded keys was 1; (b) WSUM2
– the weight of the original search keys was 10 and the expansion keys were weighted according to
their type (synonyms were weighted by 9, narrower concept keys by 7, associative concept keys by
5); (c) WSYN1 – the weight of the major facets was 10 and of minor facets 3; (d) WSYN2 – the
weight of the major facets was 10 and of minor facets 7. Thus, weighting schemes were insensitive
to differences in quality of expansion keys.
The performance of queries with weighted search keys was mediocre. QE with search key
weighting was not especially useful, nor did the effectiveness of original search keys get any
support. Conversely, weighting the major facets higher than the minor facets was a useful strategy.
The best average precision was achieved by the weighting scheme (d) and full expansion
(WSYN2/F2/Qf). The comparison of WSYN1 and WSYN2 structures demonstrates the difference
between major and minor facets: the WSYN1 queries with low complexity and narrower concept
expansion were performing better than WSYN2 queries with the same combination. The former
structure gave lower weights to minor facets than the latter structure. The associative concept
expansion changed the rank order of the structures.
In the first two weighting schemes (WSUM1-2) no facets or concepts were identified, but the
structuring was based on weighting the search keys by their types. This kind of structure does not
ensure that the different aspects of the query are present in retrieved documents. More elaborated
108
weighting schemes should take into account the facet (concept) structure of a request and the
quality of different keys within a facet (concept). The quality of a key could be defined by the
searcher or as the distance of an expansion key from the original key in a semantic net. Although
this study did not indicate notable differences in effectiveness of search keys representing different
semantic relations, we did not test weighting of keys within facets which might be worth testing.
Co-effects of query broadness and structure
Variation in the effectiveness of unexpanded queries was slight irrespective of query structures.
Exceptions to this general notion were the queries with weak SYN structure, which performed
notably worse than the queries with other structure types. The result is in contrast to earlier studies
which provided evidence of better performance for Boolean structured queries over weakly structured, ‘natural language’ queries (Salton et al. 1983; Turtle 1990; Hull 1997). The number of
search keys in queries is not reported, but there are no mentions of QE. If the broadness of queries
in those studies was higher than the broadness of unexpanded queries in this study, the
contradiction is even harder to explain: we suggest that Boolean structured queries with high
broadness cannot be effective, given the typical soft interpretations of the Boolean operators.
The recognition of phrases did not improve effectiveness of unexpanded queries, on the
contrary, the average precision scores of the unexpanded SUM queries – without phrases – were
the best over all unexpanded queries. This result is corroborated by Turtle (1990) who reported
that replacing single search keys by phrases (or ‘word pairs’) was not useful, but combination of
single keys and phrases improved performance in Boolean structured queries, again without QE.
This seems predictable: phrases in queries with few keys tend to impair performance, because
without alternatives they are too strict conditions for matching. When added to queries with single
keys, the performance is better.
The performance of expanded queries varied with structure types, and variation was greatest
with the largest expansion, i.e., when the broadness of queries was highest. A group of conceptand facet-structured queries performed significantly better than any other combination. A common
feature in these structures was that search keys representing a facet or a concept were formed into
a synonym group. This means that the occurrences of these search keys were counted together and
treated as if they were instances of one search key. In other words, the SYN operator affects the
calculation of tf*idf weights. Intuitively, this seems to be a reasonable way to construct a concept
or a facet of search keys representing it. Thus, the weight calculation of the SYN operator seems to
be an appropriate way to interpret disjunctions in partial match retrieval. The best performing
structures combined synonym groups with different operators, but the choice of the operator at this
point had no notable effects on performance.
We do not know of any other studies that have examined systematically the effects of query
structure and broadness. Recently, however, it has been suggested that retrieval performance could
be enhanced by ensuring that all search concepts are represented in top ranking documents (e.g.,
Buckley et al. 1998; Mitra et al. 1998; Hawking et al. 1997). The forming of concept groups and
concept-based scoring tested by Hawking, Thistlewaite and Bailey (1997), and Hawking,
Thistlewaite and Craswell (1997) is similar to our approach. The researchers report on positive
effects of these methods on retrieval performance, which result confirms our findings.
Pirkola (1998) tested the effects of query structures in cross-lingual IR. The test environment
included a part of the TREC-collection with 514,825 documents, the InQuery system, English
109
requests, their Finnish equivalents and dictionary-based translation for Finnish queries into
English. The researcher compared the effectiveness of unstructured and structured translated
queries. The structure of structured queries was similar to the SSYN1 structure, and the structure
of unstructured queries to the SUM structure of this study. In Pirkola’s study each word of a
request (excluding stop-words) was taken as a concept and all its translations (representing the
concept) were collected into a SYN clause. Pirkola reports that structured queries were performing
notably better than the unstructured queries. This corroborates our findings, since translation
seems to have a QE effect, i.e., translations given to one search key include several alternatives.
These alternatives may be related to, or represent, different meanings of the key. The ‘wrong’
alternatives are eliminated by other search concepts, because not all combinations of search keys
exist in documents. These results also indicate that the growth of the collection does not impair the
co-effects of structure and broadness.
Ballesteros and Croft (1998) tested disambiguation of the ‘wrong’ translations in cross-lingual
IR using a query structure based on SYN groups1. Also they formed a SYN clause of all
translations given to an original search key. Ballesteros and Croft report that the retrieval
performance of the strongly structured queries was slightly better than the performance of weakly
structured queries2. The researchers state that the SYN operator is useful while it de-emphasises
infrequent words that otherwise would get excessively high weights due to their high idf value.
Indeed, the SYN operator increases the df value and decreases the idf value, but it also increases
the tf value.
Reliability and validity of the results
Robertson and Beaulieu (1997) point out that experimentation should illuminate and help to
develop theories and models which, in turn, should guide the design of good systems. The authors
state that otherwise “every experimental result is unique and peculiar to the circumstances of the
experiment”.
This study was an experiment in a laboratory setting. The environment gave us an opportunity
to test the variables of interest – the broadness, complexity and structure of queries – in a
controlled way. We have shown that when the number of search keys is low, query structure does
not affect retrieval performance, but when queries are expanded – in other words, the number of
search keys is increased – query structure affects retrieval performance substantially. In addition,
we have tested concept-based QE using different semantic concept and word relations. This kind
of QE proved to be useful with the strong SYN structures. May these results be generalised? The
following factors of the test setting are critical for answering the question: the test requests, the
document collection, and the retrieval technique.
Test requests
The requests of this study represented subject searching, i.e., conscious topical information needs.
Thus, the results are valid only for similar type of information needs - or requests representing
1 The query structure tested in the study by Ballesteros & Croft might be similar to SSYN1 or WSYN, but this is
uncertain, since the authors do not explain how the SYN clauses were combined.
2 We suppose that the query structure to which the use of the SYN operator was compared is similar to SUM or WSUM
structure in the present study, however, this can only be assumed from the report.
110
them. The number of requests was 30. There is no general knowledge about the sufficient number
of requests, though the problem has been discussed (e.g., Robertson 1981; 1990). However, as the
trends in our results were consistent, and also supported by statistical testing, we believe that a
larger number of requests would also corroborate the results.
Test collection
The document collection of the study consisted of 53,893 newspaper articles. Two main concerns
about the validity of the collection are its size and its documents. More interesting than the
absolute size of the collection, is whether the results would hold if the collection were to grow. In
general, the growth of the collection could lead to a lower performance level, but relative performance between different query structure and broadness combinations would likely remain. The QE
results are dependent on the relation of the thesaurus and the collection, thus the thesaurus should
be updated when the collection – its vocabulary – grows.
The length and style of documents have effects on QE outcomes and their generalisation. The
vocabulary of newspaper articles is varied and general. In other types of documents with fewer
synonyms and other alternative expressions (e.g., law texts, scientific articles, etc.), QE might not
prove as useful. However, there are many documents written in general, informal style (e.g.,
newswires, some public records, newsgroup mails, advertisements, ...) which could be comparable
with newspaper articles. The test articles were fairly homogenous in length. Collections where
document length is more varied might not be comparable to the test collection with respect to the
results, yet, this is not known.
The language of the articles was Finnish. The features of a language that affect the
generalisation of the results are, at least, the construction of compound words and the inflection of
words. Finnish is an agglutinative language, i.e., it has many compound words spelled together
and a lot of inflection. However, keys were stored in the index of the test database in their basic
forms, and compound words were split. Thus, the results are not strictly bound to Finnish – in
principle, morphological analysis and compound word splitting would produce an environment
similar to that of any language using an alphabet. Further, the existence of semantic relations
between concepts and words is common to all natural languages.
Retrieval technique
Query structures, if understood as the calculation of ranking scores from individual weights of the
keys, guided by the query and determined by operators, are dependent on the matching technique
of the IR system. Operators are defined in the query language of the system. Most partial match
techniques use weighting algorithms based on tf*idf weighting, thus, their combination in scoring
is somehow comparable. Although the operators of this study were specific to the InQuery system,
the basic ideas of the tested structures are more general. Therefore, the results would be indicative
beyond this test setting.
Further study
The average performance of the strong SYN structures was significantly enhanced by QE, but
differences between requests were great. The analysis of the retrieved documents of one request
111
suggests that the variation may be explained by failures in the thesaurus. In other words, the
relations given by the thesaurus were not suitable for all requests. Another reason may be that the
vocabulary in the relevant documents of the request was much the same and thus no QE was
needed. A further microanalysis of the search results by requests is still needed to fully understand
the variation. The study provided a slight evidence of differences between major and minor facets,
which implies that different kinds of concepts should be expanded differently, i.e., some concepts
have hierarchical relations, others have associative relations. We shall explore further the nature of
concepts in query formulation and QE.
The relevance used in this study was dichotomous.1 In a further study we shall test the effects
of different relevance levels. First, we shall limit the relevant articles to those judged fully relevant
(excluding fairly relevant articles); second, we shall accept all relevant, fairly relevant and
marginally relevant articles as relevant. In these new test settings, we shall examine the precision
scores of different combinations in order to find out whether any combination tends to retrieve
articles at any particular relevance level.
In this study we did not compare QE with different expansion sources. However, a statistical
thesaurus and relevance feedback are interesting alternatives while the best precision scores were
yielded by full expansion. We suspect that these sources will yield different expansion keys than
the test thesaurus, but the overlap should be tested. We believe that for short queries relevance
feedback would be more effective after QE.
Concept-based query formulation is convenient when a searcher is not a professional in
searching, or is not familiar with the vocabulary of the database. The IR interface could support a
searcher by showing him a conceptual model from which he could select the search concepts.
Then search keys representing the concepts could be shown to the searcher for selection. Keys for
concepts could be collected by a statistical approach, intellectually, or they could be elicited from
existing vocabularies or thesauri. The structure for a query could be chosen on the basis of the
number of selected search keys, or some strong facet / concept structure could be used as a default,
unless the searcher selects a structure for his query. This is a sketch for an interface we would like
to implement in order to test the interactive searching of real users. First, we want to know if there
are real information needs that would lead to information retrieval situations where concept-based
queries would be useful. Second, we want to test how concept-based query formulation would
work in interactive searching, and what level of performance would be obtainable.
1 NB. Relevance was judged with a four level scale. The two highest levels were accepted as relevant for calculation of
precision and recall.
112
10 Conclusion
In this study we have tested the effects of query structure, complexity and expansion on retrieval
performance in partial match retrieval. We have demonstrated that when the broadness of queries
is low, the query structure does not have significant effects on retrieval performance, but when
queries are expanded – in other words, when the number of search keys is increased – the query
structure affects retrieval performance substantially. The best effectiveness for broad queries is
yielded by facet- or concept-based query structures that treat all keys representing a facet or a
concept as instances of one key.
QE based on concepts and semantic relations proved useful given an appropriate query
structure. The results show that the expansion including all semantic relations was the best. In
general, differences in performance between synonym, narrower concept and associative concept
expansions were not great within each structure. Weighting original keys higher than expansion
keys did not improve performance, but weighting major facets higher than minor facets was
useful.
A formal description for the thesaurus model and query expansion operations was given. The
formalisation may be used as a basis for the development of a thesaurus software. In addition, a
classification for the structural features of query structures was constructed. The classification may
clarify the different types of queries and principles of query construction. This would be useful
because the results of different studies cannot be compared without knowledge of the features of
test queries. Further, we have shown the importance of the analysis of results at the request level to
understand variation in performance.
113
References
Aho, A. V. & Ullman, J. D. (1992). Foundations of computer science. New York: Computer Science Press.
Alkula, R. & Honkela, T. (1992). Tekstin tallennus- ja hakumenetelmien kehittäminen suomen kielen tulkintaohjelmien
avulla. FULLTEXT-projektin loppuraportti. [Linguistic processing and retrieval techniques in Finnish fulltext
databases. Final report of the FULLTEXT project.] VTT julkaisuja 765. Espoo: Valtion teknillinen tutkimuskeskus.
[In Finnish.]
Allan, J., Ballesteros, L., Callan, J. P., Croft, W. B. & Lu, Z. (1995). Recent experiments with INQUERY [online]. [Cited
5.2.1998.] Available from: <URL: http://trec.nist.gov/pubs/trec4/
papers/umass.ps>.
Allan, J., Callan, J., Croft, B., Ballesteros, L., Broglio, J., Xu, J. & Shu, H. (1997a). INQUERY at TREC 5. In E. M.
Voorhees & D. K. Harman (Eds.), Information technology: The Fifth Text Retrieval Conference (TREC-5).
Gaithersburg, MD: National Institute of Standards and Technology, 119–132.
Allan, J., Callan, J., Croft, B., Ballesteros, L., Byrd, D., Swan, R. & Xu, J. (1997b). INQUERY does battle with TREC-6
[online, to appear in Proceedings of TREC-6].
[Cited 26.2.1998.] Available from: <URL: http://trec.nist.gov/pubs/trec6/papers/umass-trec97.ps>.
Ballesteros, L. & Croft, W. B. (1998). Resolving ambiguity for cross-language retrieval. In W. B. Croft, A. Moffat, C. J.
van Rijsbergen, R. Wilkinson & J. Zobel (Eds.), Proceedings of the 21st Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval. New York, NY: ACM, 64–71.
Bates, M. J. (1986). Subject access in online catalogs: A design model. Journal of the American Society for Information
Science, 37(6), 357–376.
Bates, M. J. (1990). Where should the person stop and the information search interface start? Information Processing &
Management, 26(5), 575–591.
Bates, M. J., Wilde, D. N. & Siegfried, S. (1993). An analysis of search terminology used by humanities scholars: The
Getty Online Searching Project report number 1. Library Quarterly, 63(1), 1–39.
Beaulieu, M. M., Gatford, M., Huang, X., Robertson, S. E., Walker, S. & Williams, P. (1997). Okapi at TREC–5. In E.
M. Voorhees & D. K. Harman (Eds.), Information technology: The Fifth Text Retrieval Conference (TREC-5).
Gaithersburg, MD: National Institute of Standards and Technology, 143–166.
Belkin, N. J. (1981). Ineffable concepts in information retrieval. In K. Sparck Jones (Ed.), Information retrieval
experiment. London: Butterworths, 44–57.
Belkin, N., Cool, C., Croft, W. B. & Callan, J. P. (1993). The effect of multiple query representations of information
retrieval performance. In Korfhage, R., Rasmussen, E. M. & Willett, P. (Eds.), Proceedings of the 16th International
Conference on Research and Development in Information Retrieval. New York, NY: ACM, 339–346.
Belkin, N. J. & Croft, W. B. (1987). Retrieval techniques. In M. E. Williams (Ed.), Annual review of information science
and technology, vol. 22. New York, NY: Elsevier, 109–145.
Belkin, N. J., Kantor, P., Cool, C. & Quatrain, R. (1994). Combining evidence for information retrieval. In D. K.
Harman (Ed.), The Second Text REtrieval Conference (TREC–2). Gaithersburg, MD: National Institute of Standards
and Technology, 35–44.
Belkin, N., Kantor, P., Fox, E. A. & Shaw, J. A. (1995). Combining evidence of multiple query representations for
information retrieval. Information Processing & Management, 31(3), 431–448.
Brown, E. W. (1995). Fast evaluation of structured queries for information retrieval. In E. A. Fox, P. Ingwersen & R.
Fidel (Eds.), Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval. New York, NY: ACM, 30–38.
Buckley, C., Mitra, M., Walz, J. & Cardie, C. (1998). Using clustering and superconcepts within SMART: TREC 6.
[online, to appear in Proceedings of TREC-6]. [Cited 7.7.1998.] Available from: <URL: http://trec.nist.gov
/pubs/trec6/papers/cornell.ps>
Buckley, C., Singhal, A., Mandar, M. & (Salton, G.) (1995). New Retrieval Approaches Using SMART: TREC 4
[online]. [Cited 5.2.1998.] Available from: <URL: http://trec.nist.gov/ pubs/trec4/papers/Cornell-trec4.ps>.
Burgin, R. (1992). Variations in relevance judgements and the evaluation of retrieval performance. Information
Processing & Management, 28(5), 619–627.
Callan, J. P., Croft, W. B. & Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the Third
International Conference on Databases and Expert Systems Applications. Berlin: Springer Verlag, 78–84.
Callan, J. P., Croft, W. B. & Broglio, J. (1995). TREC and TIPSTER experiments with INQUERY. Information
Processing & Management, 31(3), 327–343.
Conover, W. J. (1980). Practical nonparametric statistics (2nd ed.). New York: John Wiley & Sons.
Cooper, W.S. (1991). Some inconsistencies and misnomers in probabilistic information retrieval. In A. Bookstein, Y.
Chiaramella, G. Salton & V.V. Raghavan (Eds.), Proceedings of the 14th Annual International ACM–SIGIR
Conference on Research and Development in Information Retrieval. New York, NY: ACM, 57–61.
Cooper, W. S. (1994). The formalism of probability theory in IR: A foundation or an encumbrance? In W. Bruce Croft &
C. J. van Rijsbergen (Eds.), Proceedings of the 17th Annual International ACM–SIGIR Conference on Research and
Development in Information Retrieval. New York, NY: ACM, 242–247.
114
Croft, W. B. (1986). Boolean queries and term dependencies in probabilistic retrieval models. Journal of the American
Society for Information Science, 37(2), 71–77.
Croft, W. B. & Harper, D. J. (1979). Using probabilistic models of document retrieval without relevance information.
Journal of Documentation, 35(4), 285–295.
Croft, W. B., Turtle, H. R. & Lewis, D. (1991). The use of phrases and structured queries in information retrieval. In A.
Bookstein, Y. Chiaramella, G. Salton & V.V. Raghavan (Eds.), Proceedings of the 14th Annual International ACM–
SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM, 32–45.
Cruse, D. A. (1986). Lexical semantics. Cambridge: Cambridge University Press.
Dahlberg, I. (1992). Knowledge organization and terminology: Philosophical and linguistic bases. International
Classification, 19(2), 65–71.
Efthimiadis, E. N. (1992). Interactive query expansion and relevance feedback for document retrieval systems. Ph.D.
dissertation. Department of Information Science, The City University, London.
Efthimiadis, E. N. (1996). Query expansion. In M. E. Williams (Ed.), Annual Review of Information Science and
Technology, vol. 31. Medford, NJ: Information Today, 121–187.
Efthimiadis, E. N. & Biron, P. V. (1994). UCLA-Okapi at TREC 2: Query expansion experiments. In D. K. Harman
(Ed.), The Second Text REtrieval Conference (TREC–2). Gaithersburg, MD: National Institute of Standards and
Technology, 279–289.
Ekmekcioglu, F. C., Robertson, A. M. & Willett, P. (1992). Effectiveness of query expansion in ranked-output document
retrieval system. Journal of Information Science, 18(2), 139–147.
Ellis, D. (1996). The dilemma of measurement in information retrieval research. Journal of the American Society for
Information Science, 47(1), 23–36.
Fidel, R. (1991a). Searchers’ selection of search keys: I. The selection routine. Journal of the American Society for
Information Science, 42(7), 490–500.
Fidel, R. (1991b). Searchers’ selection of search keys: II. Controlled vocabulary or free-text searching. Journal of the
American Society for Information Science, 42(7), 501–514.
Fidel, R. (1991c). Searchers’ selection of search keys: III. Searching styles. Journal of the American Society for
Information Science, 42(7), 515–527.
Fidel, R. & Efthimiadis, E. N. (1995). Terminological knowledge structure for intermediary expert systems. Information
Processing & Management, 31(1), 15–27.
Fox, E. A. (1980). Lexical relations: Enhancing effectiveness of information retrieval system. ACM – SIGIR Forum,
15(3), 5–36.
Fox, E., Betrabet, S., Koushik, M. & Lee, W. (1992). Extended Boolean models. In W.B. Frakes & R. Baeza-Yates
(Eds.), Information Retrieval: Data Structures & Algorithms. Englewood Cliffs, NJ: Prentice Hall, 393–418.
Fox, E. A., Koushik, M. P., Shaw, J., Modlin, R. & Rao, D. (1993). Combining evidence from multiple searches. In D.
K. Harman (Ed.), The First Text REtrieval Conference (TREC–1). Gaithersburg, MD: National Institute of Standards
and Technology, 319–328.
Fox, E. A. & Shaw, J. A. (1994). Combination of multiple searches. In D. K. Harman (Ed.), The Second Text REtrieval
Conference (TREC–2). Gaithersburg, MD: National Institute of Standards and Technology, 243–249.
Froehlich, T. J. (1994). Relevance reconsidered – towards an agenda for the 21st century: Introduction to special topic
issue on relevance research. Journal of the American Society for Information Science, 45(3), 124–134.
Fugmann, R. (1985). The five axiom theory of indexing and information supply. Journal of the American Society for
Information Science, 36(2), 116–129.
Greiff, W. R., Croft, W. B. & Turtle, H. (1997). Computationally tractable probabilistic modelling of Boolean operators.
In N. J. Belkin, A. D. Narasimhalu & P. Willett (Eds.), Proceedings of the 20th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval. New York, NY: ACM, 119–127.
Guarino, N. (1995). Formal ontology, conceptual analysis and knowledge representation. International Journal of
human and Computer Studies, 43(5/6), 625–640.
Guarino, N. (1997). Semantic matching: Formal ontological distinctions for information organization, extraction, and
integration. In M. T. Pazienza (Ed.), Information extraction: A multidisciplinary approach to an emerging
information technology. Lecture notes in computer science, vol. 1299. Berlin: Springer Verlag, 139–170.
Guarino, N., Masolo, C. & Vetere, G. (1998). OntoSeek: Using large linguistic ontologies for gathering information
resources from the Web. LADSEB-CNR Technical Report 01/98.
Haarala, R. (1981). Sanastotyön opas [A guide to lexicography]. Kotimaisten kielten tutkimuskeskuksen julkaisuja 16.
Helsinki: Kotimaisten kielten tutkimuskeskus. [In Finnish]
Hancock-Beaulieu, M. (1992). Query expansion: Advances in research in online catalogues. Journal of Information
Science, 18(2), 99–103.
Harman, D. K. (1992). Relevance feedback revisited. In N. Belkin, P. Ingwersen & A. Mark Pejtersen (Eds.),
Proceedings of the 15th Annual International ACM–SIGIR Conference on Research and Development in Information
Retrieval. New York, NY: ACM, 1–10.
Harman, D. K. (1995). Overview of the fourth text retrieval conference (TREC-4) [online]. [Cited 5.2.1998] Available
from: <URL: http://trec.nist.gov/pubs/trec4 /papers/overview.ps>.
Harter, S. P. (1986). Online information retrieval: concepts, principles, and techniques. Orlando: Academic Press.
115
Harter, S. P. (1990). Search term combination and retrieval overlap: A proposed methodology and case study. Journal of
the American Society for Information Science, 41(2), 132–146.
Harter, S. P. (1992). Psychological relevance and information science. Journal of the American Society for Information
Science, 43(9), 602–615.
Harter, S. P. & Peters, A. R. (1985). Heuristics for online information retrieval: A topology and preliminary listing.
Online Review, 9(5): 407–424.
Hawking, D. & Thistlewaite, P. (1995). Proximity operators – so near and yet so far [online]. [Cited 11.9.1996.]
Available from: <URL: http://trec.nist.gov/pubs/trec4 /papers/anu.ps>.
Hawking, D., Thistlewaite, P. & Bailey, P. (1997). ANU/ACSys TREC-5 experiments. In E. M. Voorhees & D. K.
Harman (Eds.), Information technology: The Fifth Text Retrieval Conference (TREC-5). Gaithersburg, MD: National
Institute of Standards and Technology, 359–375.
Hawking, D., Thistlewaite, P. & Craswell, P. (1997). ANU/ACSys TREC-6 experiments [online, to appear in
Proceedings of TREC-6]. [Cited 26.2.1998.] Available from: <URL: http://trec.nist.gov/pubs/trec6/papers/anu.ps>.
Hersh, W. R. (1996). Information retrieval: A health care perspective. New York: Springer.
Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In Korfhage, R., Rasmussen, E.M. &
Willett, P. (Eds.), Proceedings of the 16th International Conference on Research and Development in Information
Retrieval. New York, NY: ACM, 349–338.
Hull, D. A. (1996). Stemming algorithms: a case study for detailed evaluation. Journal of the American Society for
Information Science, 47(1), 70-84.
Hull, D. A. (1997). Using structured queries for disambiguation in cross-language information retrieval. In AAAI Spring
Symposium on Cross-Language Text and Speech Retrieval Electronic Working Notes [online]. [Cited 13.8.1997.]
Available from: <URL: http://www.clis.umd.edu/dlrg/filter/sss/papers/hull3.ps>
Hull, D. A., Grefenstette, G., Schulze, B., Gaussier, E., Schütze, H. & Pedersen, J. (1997). Xerox TREC-5 site report:
Routing, filtering, NLP, and Spanish tracks. In E. M. Voorhees & D. K. Harman (Eds.), Information technology: The
Fifth Text Retrieval Conference (TREC-5). Gaithersburg, MD: National Institute of Standards and Technology, 167–
180.
Iivonen, M. (1995). Hakulausekkeiden muotoilun yhdenmukaisuus onlineviitehaussa [Consistency in the formulation of
query statements in online bibliographic retrieval]. Ph.D. dissertation. Acta Universitatis Tamperensis, ser. A, vol.
443. Tampere: University of Tampere. [English summary]
Ikonen, P. (1994). Näkökohtia symboliprosessista [Viewpoints on the process of symbolisation]. In P. Ikonen & E.
Rechardt, Thanatos, häpeä ja muita tutkielmia [Thanatos, shame and other studies]. Helsinki: Nuorisopsykoterapiasäätiö, 203–211. [In Finnish]
Ingwersen, P. (1992). Information retrieval interaction. London: Taylor Graham.
Ingwersen, P. & Willett, P. (1995). An introduction to algorithmic and cognitive approaches for information retrieval.
Libri, 45, 160–177.
INQUERY [s.a.]. InQuery 3.1 documentation.
ISO (1986). ISO International Standard 2788. Documentation - Guidelines for the establishment and development of
monolingual thesauri. International Organization for Standardization.
Jackson, L. (1983). Searching full-text databases. In Proceedings of the 7th International Online Information Meeting.
Oxford: Learned Information, 419–426.
Jacob, E.K. (1991). Classification and categorization: Drawing the line. In B. H. Kwasnik & R. Fidel (Eds.), Advances in
classification research. Vol. 2. Proceedings of the 2nd ASIS SIG/CR classification research workshop. Medford, NJ:
Learned Information, 67–83.
Jing, Y. & Croft, W. B. (1994). An association thesaurus for information retrieval [online]. [Cited 5.8.1997.] Available
from:
<URL: http://ciir.cs.umass.edu/info/ciirbiblo.html#1>
Jones, S., Gatford, M., Robertson, S., Hancock-Beaulieu, M. & Secker, J. (1995). Interactive thesaurus navigation:
Intelligence rules ok? Journal of the American Society for Information Science, 46(1), 52–59.
Järvelin, K. (1993). Merkkijonot, sanat, termit ja käsitteet informaation haussa [Strings, words, terms and concepts in
information retrieval]. Kirjastotiede ja informatiikka, 12(4), 119-128. [In Finnish.]
Järvelin, K. (1995). Tekstitiedonhaku tietokannoista: johdatus periaatteisiin ja menetelmiin [Textual information
retrieval in databases: Principles and methods]. Espoo: Suomen ATK-kustannus. [In Finnish.]
Järvelin, K. (1997). Access to information: Information retrieval. Notes of the lecture given in Copenhagen in November
1997.
Järvelin, K., Kristensen, J., Niemi, T., Sormunen, E. & Keskustalo, H. (1996a). A deductive data model for query
expansion. In H.-P. Frei, D. Harman, P. Schäuble, and R. Wilkinson (Eds.), Proceedings of the 19th Annual
International ACM–SIGIR Conference on Research and Development in Information Retrieval. New York, NY:
ACM, 235–249.
Järvelin, K., Kristensen, J., Niemi, T., Sormunen, E. & Keskustalo, H. (1996b). A deductive data model for thesaurus
navigation and query expansion. Finnish Information Studies 5. Tampere: Informaatiotutkimuksen laitos, Tampereen
yliopisto.
Järvelin, K. & Niemi, T. (1993). An entity-based approach to query processing in relational databases. Part I: Entity type
representation. Data & Knowledge Engineering, 10, 117–150.
116
Kantor, P. (1994). Information retrieval techniques. In M.E. Williams (Ed.), Annual Review of Information Science and
Technology, vol. 29, 53–90.
Karlsson, F. (1980). Johdatusta yleiseen kielitieteeseen [Introduction to general linguistics]. Helsinki: Gaudeamus. [In
Finnish.]
Karlsson, F. (1983). Finnish grammar. Translated by A. Chesterman. Porvoo: WSOY.
Karlsson, F. (1994). Yleinen kielitiede [General linguistics]. Helsinki: Yliopistopaino. [In Finnish.]
Keen, E. M. (1991). The use of term position devices in ranked output experiments. Journal of Documentation, 47(1), 1–
22.
Keen, E. M. (1992). Presenting results of experimental retrieval comparisons. Information Processing & Management ,
28(4), 491–501.
Kekäläinen, J. & Järvelin, K. (1998). The impact of query structure and query expansion on retrieval performance. In W.
B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson & J. Zobel (Eds.), Proceedings of the 21st Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM,
130–137.
Kristensen, J.1 (1992). Vapaasanahakujen laajentaminen hakutesauruksen avulla haettaessa indeksoimattomasta
tekstitietokannasta [Expanding queries for text searching with a searching thesaurus]. Kirjastotieteen ja informatiikan
lisensiaatintutkimus. Tampereen yliopisto. [In Finnish.]
Kristensen, J. (1993). Expanding end-users’ query statements for free text searching with a search-aid thesaurus.
Information Processing & Management, 29(6), 733–744.
Kristensen, J. (1995). Aiherelevanssi ja käyttäjärelevanssi tulkinnan näkökulmasta. [Topicality and user-relevance from
the point of the view of interpretation.] Kirjastotiede ja informatiikka 14(3), 95–99. [In Finnish.]
Kristensen, J. (1996). Concept-based query expansion in a probabilistic IR system. In P. Ingwersen & N. O. Pors (Eds.),
Proceedings of the Second International Conference on Conceptions of Library and Information Science:
Integration in Perspective. Copenhagen: The Royal School of Librarianship, 281-291.
Kristensen, J. & Järvelin, K. (1990). The effectiveness of a searching thesaurus in free-text searching of a full-text
database. International Classification, 17(2), 77–84.
Kwok, K. L. & Grunfeld, L. (1995). TREC-4 Ad-hoc, routing retrieval and filtering experiments using PIRCS [online].
[Cited 5.2.1998.] Available from: <http://trec.nist.gov/pubs/trec4 /papers/queenst4.ps>.
Lancaster, F. W. (1972). Vocabulary control for information retrieval. Washington, DC: Information Resources Press.
Lancaster, F. W. (1986). Vocabulary control for information retrieval (2nd ed.). Washington, DC: Information
Resources Press.
Lancaster, F. W. & Warner, Amy J. (1993). Information retrieval today. Arlington, VA: Information Resources Press.
Lee, J. H. (1994). Properties of extended Boolean models in information retrieval. In W. Bruce Croft & C. J. van
Rijsbergen (Eds.), Proceedings of the 17th Annual International ACM–SIGIR Conference on Research and
Development in Information Retrieval. New York, NY: ACM, 182–190.
Lee, J. H. (1995). Combining multiple evidence from different properties of weighting schemes. In E. A. Fox, P.
Ingwersen and R. Fidel (Eds.), Proceedings of the 18th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval. New York, NY: ACM, 180–188.
Leech, G. (1981). Semantics: The study of meaning (2nd ed.). Harmondsworth: Penguin Books.
Lu, X. A. & Keefer, R. B. (1995). Query expansion/reduction and its impact on retrieval effectiveness. In D. K. Harman
(Ed.), The Third Text REtrieval Conference (TREC–3). Gaithersburg, MD: National Institute of Standards and
Technology, 231–239.
Lyons, J. (1977). Semantics, vol. 1. Cambridge: Cambridge University press.
Magennis, M. & van Rijsbergen, C. J. (1997). The potential and actual effectiveness of interactive query expansion. In
N. J. Belkin, A. D. Narasimhalu & P. Willett (Eds.), Proceedings of the 20th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval. New York, NY: ACM, 324–332.
Marchionini, G., Dwiggens, S., Katz, A. & Lin, X. (1993). Information seeking in full-text end-user-oriented search
systems: The roles of domain and search expertise. Library & Information Science Research 15(1), 35–70.
Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM 38(11), 39–41.
Mitra, M., Singhal, A. & Buckley, C. (1998). Improving automatic query expansion. In W. B. Croft, A. Moffat, C. J. van
Rijsbergen, R. Wilkinson & J. Zobel (Eds.), Proceedings of the 21st Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval. New York, NY: ACM, 206–214.
Myaeng, S. H. & McHale, M. L. (1991). Toward a relation hierarchy for information retrieval. In B. H. Kwasnik & R.
Fidel (Eds.), Advances in classification research. Vol. 2. Proceedings of the 2nd ASIS SIG/CR classification
research workshop. Medford, NJ: Learned Information, 101–113.
Nutter, J. T., Fox, E. A. & Evens, M. W. (1990). Building a lexicon from machine-readable dictionaries for improved
information retrieval. Literary and Linguistic Computing 5(2), 129–138.
Paice, C. D. (1991). A thesaural model of information retrieval. Information Processing & Management 27(5): 433–447.
Palmer, F. R. (1982). Semantics. (2nd ed.) Cambridge: Cambridge University Press.
1 The family name of the present author in the period 1981-1997 was Kristensen.
117
Peat, H. J. & Willett, P. (1991). The limitations of term co-occurence data for query expansion in document retrieval
systems. Journal of the American Society for Information Science 42(5): 378–383.
Pirkola, A. (1998). The effects of query structure and dictionary setups in dictionary-based cross-language information
retrieval. In W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson & J. Zobel (Eds.), Proceedings of the 21st
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:
ACM, 55–63.
Piternick, A. B. (1984). Searching vocabularies: a developing category of online search tools. Online Review 8(5), 441–
449.
Putnam, H. (1973). Meaning and reference. The Journal of Philosophy 70(19), 699–711.
Rajashekar, T. B. & Croft, W. B. (1995). Combining automatic and manual index representations in probabilistic
retrieval. Journal of the American Society for Information Science 46(4), 272–283.
Robertson, S. E. (1977a). The probabilistic character of relevance. Information Processing & Management 13(13), 247–
251.
Robertson, S. E. (1977b). The probability ranking principle in IR. Journal of Documentation 33(4), 294–304.
Robertson, S. E. (1981). The methodology of information retrieval experiment. In K. Sparck Jones (Ed.), Information
retrieval experiment. London: Butterworths, 9–31.
Robertson, S. E. (1986). On relevance weight estimation and query expansion. Journal of Documentation 42(3), 182–
188.
Robertson, S. E. (1990). On sample sizes for non-matched-pair IR experiments. Information Processing & Management
26(6), 739–753.
Robertson, S. E. (1997). Overview of the OKAPI project. Journal of Documentation 53(1), 3–7.
Robertson, S. E. & Beaulieu, M. (1997). Research and evaluation in information retrieval. Journal of Documentation
53(1), 51–57.
Robertson, S. E. & Belkin, N. J. (1978). Ranking in principle. Journal of Documentation 34(2), 93–100.
Robertson, S. E. & Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for
Information Science 27(3), 129–146.
Robertson, S. E. & Sparck Jones, K. (1997). Simple, proven approaches to text retrieval (rev.ed.) [online]. [Cited
10.10.1998.] Technical report No. 356. Cambridge: Computer Laboratory. Available from:
<URL:
http://www.cl.cam.ac.uk/ftp/papers/index.html>.
Robertson, S. E. & Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic
weighted retrieval. In W. Bruce Croft & C. J. van Rijsbergen (Eds.), Proceedings of the 17th Annual International
ACM–SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM, 222–241.
Robertson, S. E. & Walker, S. (1997). On relevance weights with little relevance information. In N. J. Belkin, A. D.
Narasimhalu & P. Willett (Eds.), Proceedings of the 20th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval. New York, NY: ACM, 16–24.
Robertson, S .E., Walker, S., Hancock-Beaulieu, M. M. (1995). Large test collection experiments on an operational,
interactive system: Okapi at TREC. Information Processing & Management 31(3), 345–360.
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1995). Okapi at TREC–3. In D. K.
Harman (Ed.), The Third Text REtrieval Conference (TREC–3). Gaithersburg, MD: National Institute of Standards
and Technology, 109–126.
Rosch, E. H. (1973). Natural categories. Cognitive Psychology, 4, 328–350.
Rosch, E. H. (1978). Principles of categorization. In E. Rosch & B. Lloyd (Eds.), Cognition and categorisation.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Rosch, E. H. & Mervis, C. (1975). Family resemblance: Studies in the internal structure of categories. Cognitive
Psychology , 7, 573–605.
Rosch, E. H., Mervis, C., Gray, W., Johnson, D. & Boyes-Braem, P. (1976). Basic objects in natural categories.
Cognitive Psychology , 8, 382–439.
Ruge, G. & Schwartz, C. (1991). Term associations and computational linguistics. International Classification 18(1), 1925.
Salton, G., Fox, E. & Wu, H. (1983). Extended Boolean information retrieval. Communications of the ACM 26(11),
1022–1036.
Salton, G. & McGill, M. J. (1983). Introduction to modern information retrieval. London: McGraw-Hill.
Saracevic, T. (1975). Relevance: A review of and framework for the thinking on the notion in information science.
Journal of the American Society for Information Science 26(6), 321–343.
Saracevic, T. (1996). Relevance reconsidered ‘96. In P. Ingwersen & N. O. Pors (Eds.), Proceedings of the Second
International Conference on Conceptions of Library and Information Science: Integration in Perspective.
Copenhagen: The Royal School of Librarianship, 201–218.
Saracevic, T. (1997). The stratified model of information retrieval interaction: Extension and applications. In C.
Schwartz & M. Rorvik (Eds.), ASIS ‘97: Proceedings of the 60th ASIS annual meeting. Medford, NJ: Information
Today, 313–327.
Schamber, L. (1994). Relevance and information behavior. In M. E. Williams (Ed.), Annual review of Information
Science and Technology, vol. 29. Medford, NJ: Learned Information, 3–48.
118
Schamber, L., Eisenberg, M. B. & Nilan, M. S. (1990). A re-examination of relevance: toward a dynamic, situational
definition. Information Processing & Management, 26(6), 755–776.
Schütze, H. & Pedersen, O. (1997). A cooccurrence-based thesaurus and two applications to information retrieval.
Information Processing & Management 33(3): 307–318.
Shaw, J. A. & Fox, E. A. (1995). Combination of multiple searches. In D. K. Harman (Ed.), The Third Text REtrieval
Conference (TREC–3). Gaithersburg, MD: National Institute of Standards and Technology, 105–108.
Siegel, S. & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences. New York, NY: McGraw-Hill.
Sintichakis, M. & Constantopoulos, P. (1997). A method for monolingual thesauri merging. In N. J. Belkin, A. D.
Narasimhalu & P. Willett (Eds.), Proceedings of the 20th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval. New York, NY: ACM, 129–138.
Smeaton, A. F. (1992). Progress in the application of natural language processing to information retrieval tasks. The
Computer Journal 35(3), 268–278.
Smeaton, A. F. (1995). Natural language processing & information retrieval. A tutorial presented at the Second
European Summer School in Information Retrieval – ESSIR ‘95, Glasgow, Scotland, September 1995.
Soergel, D. (1985). Organizing information. Orlando: Academic Press.
Sormunen, E. (1994). Vapaatekstihaun tehokkuus ja siihen vaikuttavat tekijät sanomalehtiaineistoa sisältävässä
tekstikannassa [Free-text searching efficiency and factors affecting it in a newspaper article database]. VTT
Julkaisuja 790. Espoo: Valtion Teknillinen Tutkimuskeskus. [In Finnish.]
Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of
Documentation 28, 11–21.
Sparck Jones, K. (1974). Automatic indexing. Journal of Documentation 30(4), 393–432.
Sparck Jones, K. (1995). Reflection on TREC. Information Processing & Management 31(3), 291–314.
Sparck Jones, K. (1997). Summary performance comparisons TREC-2, TREC-3, TREC-4, TREC-5, TREC-6 [online].
[Cited 5.6.1998] Available from:
<http://trec.nist.gov/pubs /trec6/papers/sparck.ps>.
Strong, G. W. & Drott, M. C. (1986). A thesaurus for end-user indexing and retrieval. Information Processing &
Management 22(6), 487–492.
Strzalkowski, T. (1995). Natural language information retrieval. Information Processing & Management 31(3), 397–417.
Strzalkowski, T., Lin, F. & Perez-Carballo. J. (1997). Natural language information retrieval TREC-6 report [online, to
appear
in
Proceedings
of
TREC-6].
[Cited
26.2.1998.]
Available
from:
<URL:
http://trec.nist.gov/pubs/trec6/papers/ge.ps>.
Swanson, D. R. (1988). Historical note: information retrieval and the future of an illusion. Journal of the American
Society for Information Science 39(4), 92–98.
Svenonius, E. (1992). Classification: Prospects, problems and possibilities. In N. J. Williamson & M. Hudon (Eds.),
Classification research for knowledge representation and organization. Proceedings of 5th international Study
Conference on Classification Research. Amsterdam: Elsevier, 5–25.
Tague-Sutcliffe, J. (1992). The pragmatics of information retrieval experimentation, revisited. Information Processing &
Management 28(4), 467–490.
The Oxford reference dictionary (1986). J. M. Hawkins (Ed.). Oxford: Clarendon Press.
Turtle, H. R. (1990). Inference networks for document retrieval. Ph.D. dissertation. Computer and information Science
Department, University of Massachusetts. COINS Technical Report 90–92.
Turtle, H. R. & Croft, W. B. (1991). Evaluation of an inference network-based retrieval model. ACM transactions on
Information systems 9(3), 187–222.
Turtle, H. R. & Croft, W. B. (1992). A comparison of text retrieval models. The Computer Journal 35(3), 279–290.
Ullman, J. D. (1988). Principles of Database and Knowledge Base Systems. Vol. I. Rockville, MD: Computer Science
Press.
Uschold. M. & Gruninger, M. (1996). Ontologies: principles, methods and applications. The Knowledge Engineering
Review 11(2), 93–136.
Walker, S., Robertson, S. E., Boughanem, M., Jones, G. J. F. & Sparck Jones, K. (1997). Okapi at TREC-6: Automatic
ad hoc, VLC, routing, filtering and QSDR [online, to appear in Proceedings of TREC-6]. [Cited 26.2.1998.]
Available from: <URL: http://trec.nist.gov /pubs/trec6/papers/city_proc_auto.ps>.
Van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). Butterworths: London.
Wang, Y.-C., Vandendorpe, J. & Evens, M. (1985). Relational thesauri in information retrieval. Journal of the American
Society for Information Science 36(1), 15–27.
Voorhees, E. (1994). Query expansion using lexical-semantic relations. In W. Bruce Croft & C. J. van Rijsbergen (Eds.),
Proceedings of the 17th Annual International ACM–SIGIR Conference on Research and Development in Information
Retrieval. New York, NY: ACM, 61–69.
Voorhees, E. (1998). Variations in relevance judgements and the measurement of retrieval effectiveness. In W. B. Croft,
A. Moffat, C. J. van Rijsbergen, R. Wilkinson & J. Zobel (Eds.), Proceedings of the 21st Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 315–323.
119
Voorhees, E. M. & Harman, D. (1997). Overview of the fifth Text REtrieval conference (TREC-5). In E. M. Voorhees &
D. Harman (Eds.), Information technology: The Fifth Text REtrieval Conference (TREC-5). Gaithersburg, MD:
National Institute of Standards and Technology, 1–28.
Xu, J. & Croft, W. B. (1996). Query expansion using local and global document analysis. In H.-P. Frei, D. Harman, P.
Schäuble & R. Wilkinson (Eds.), Proceedings of the 19th Annual International ACM–SIGIR Conference on
Research and Development in Information Retrieval. New York, NY: ACM, 4–11.
Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In W. B. Croft, A. Moffat,
C. J. van Rijsbergen, R. Wilkinson & J. Zobel (Eds.), Proceedings of the 21st Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval. New York: ACM, 307–314.
120
Appendix 1
Requests
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
The Bush - Gorbachev Summit in Helsinki in September 1990. The subjects of the
negotiations, and resolutions and agreements.
The South-American debt crisis. How has the debt problem developed? What has been done
in order to solve the problem?
Dumping charges against the Finnish forest industry in the U.S. The content of the dumping
charges, the result of the trial.
The proposal for a federation of municipalities between the town and the rural community
of Jyväskylä. Supporters’ and opponents’ opinions and reasons are wanted. Calculation of
the economic effects (economic incentives, subsidies, among other things).
The abolition of the Warsaw pact. Everything about the process of change, reactions of
different member countries, about decisions and etc.
The economic boycott against Lithuania by the Soviet Union in spring 1990. What actions
were linked to the boycott, and how were these manifest in Lithuania? Events that
terminated the boycott.
Annihilation of Iraqi weapons of mass destruction. According to the armistice agreement of
the Gulf War Iraq must surrender chemical, biological, and nuclear weapons and their
production engineering. UN is responsible for the inventory and annihilation of the
weapons. Has the mission succeeded?
The decisions regarding oil prices and the production of OPEC.
The violence against the opposition committed by the miners on whom President Iliescu’s
government had called for help in Bucharest. Background information of events, victims
and repercussions.
The UN peace keeping operation for Namibia’s independence. Information about the
preparations, events linked to the operation, and the action of the UNTAG-troops and the
Finnish battalion.
The role of the European Parliament in EU decision-making. The role of the EU-parliament
in relation to the Commission and other official organs is of interest. What kinds of changes
to the present situation have been desired, and who has been asking for them? How does
democratic control function in the EU?
Carl Bildt and Nordic co-operation. Bildt’s statements concerning Nordic co-operation.
What has Bildt said about the co-operation between Finland and Sweden in particular?
News about the working of the Yugoslavian presidential council. Especially information
about the sessions and decisions.
The 2 + 4 negotiations between East and West Germany and the four allied [the United
States, UK, France, and the Soviet Union] concerning the reunification of Germany. What
were the most essential problems? What particular conflicts came up? What is essential in
the treaty?
The profitability of VALMET in the production of tractors and vehicles. Forest machines,
tractors, transportation vehicles, and railway carriages (Transtech, among others) are
incluted in the branch. Partnerships in the car and truck industry are not of interest.
121
Appendix 1
(continued)
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
Layoffs in Tampella. The target is information about layoffs in various companies of the
Tampella-consortium.
The investments of KERA and KTM (Department of Trade and Industry) in tourism.
Information about loans and subsidies (here = investments) granted for the business.
Summaries especially valuable.
An overview of natural gas acquisition by Neste. What has Neste achieved in natural gas
acquisition (fields and import contracts), delivery (building a network), and marketing?
The handling and storage of radioactive waste produced by nuclear power plants. Examples
of problems, risks and radioactive waste accidents.
The spread of AIDS in the EU. How serious is the AIDS situation in these countries?
Information about contagion, campaigns and other activities to stop AIDS spreading.
Lifting of foodstuff import restrictions on the Finnish food industry.
Economic trends and cyclic fluctuation in housebuilding in Finland: especially statistics,
prognoses, and estimates.
Exhaust gas emissions of road traffic in Finland and abroad. The development of emissions
and future expectations (among others, the influence of legislation). What is the influence of
catalysers on emission levels? The technology of catalysers is not of interest.
The investments of the Japanese car industry in Europe, and the production co-operation
with European car manufacturers. To what countries have Japanese automobile factories
been planned, set up, and extended?
The environmental investments of the forest industry, especially investments in sewage
treatment by the chemical forest industry. Both investments in sewage treatment plants and
the utilisation of environment friendlier processes are requested.
The opening hours of shops. Requests to ascertain the debate on free open hours for retail
shops. The statements and actions of the business organisations and those of trade unions
need to be under special scrutiny.
Packages as an environment protection issue. Especially interest in development and testing
of recycling systems, legislation concerning recycling in different countries.
Esko Aho and Finland’s application for EU membership. Aho’s opinions, attitude, and
activities regarding the application. Comments on his actions by others.
Kauko Juhantalo’s speeches and activities regarding nuclear power. Juhantalo’s opinion /
justification for the fifth nuclear power plant. How did Juhantalo further the decision for
nuclear power?
The initiatives, interpellations, speeches, and voting activity of the Greens in the Finnish
Parliament. The actions of the party and the individual representatives are of interest.
122
Appendix 2
Conceptual query plans
Concepts are marked with square brackets, ‘[]’. In principle, a facet consists of all concepts
between two AND operators, exceptions are marked with parentheses, (), e.g., conceptual query
plan for the request 14. Major facets are in bold face.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
[Bush] AND [Gorbachev] AND [Helsinki] AND [summit] AND ([decision] OR
[agreement])
[South-America] AND [debt] AND ([crisis] OR [problem]) AND ([development] OR
[solution])
[forest industry] AND [dumping] AND [USA] AND [trial]
[Jyväskylä] AND([town] OR [rural community]) AND [federation of municipalities]
AND [economics] AND ([supporter] OR [opponent])
[Warsaw pact] AND [abolition] AND [member country] AND [decision]
[Lithuania] AND [economic boycott] AND [Soviet Union] AND [termination]
[Iraqi] AND [mass destruction weapon] AND [UN] AND ([annihilation] OR [inventory])
[Opec] AND [oil] AND ([production] OR [price]) AND [decision]
[Bucharest] AND [miner] AND [violence] AND [opposition]
[Namibia] AND ([peace protection operation] OR [UNTAG]) AND [UN] AND
[independence]
[EU parliament] AND [decision making] AND [official organs of EU]
[Bildt] AND [Nordic countries] AND [co-operation] AND [statement]
[Yugoslavia] AND [presidential council] AND ([decision] OR[session])
[Germany] AND [reunification] AND ([allied] OR ([USA] AND [United Kingdom] AND
[France] AND [Soviet Union])) AND ([conflict] OR [treaty])
[Valmet] AND ([forest machine] OR [tractor] OR [railway carriage]) AND
([production] OR [profitability])
[Tampella] AND [layoff] AND [labour]
([Kera] OR [KTM]) AND [tourism] AND ([loan] OR [subsidy] OR [investment])
[Neste] AND [natural gas] AND ([acquisition] OR [delivery] OR [marketing])
[nuclear power plant] AND [radioactive waste] AND ([processing] OR [storage]) AND
([problem] OR [accident] OR [risk])
[EU country] AND [AIDS] AND [spread]
[food industry] AND [import restriction] AND [lifting] AND [Finland]
[house] AND [building] AND [cyclic fluctuation] AND ([statistics] OR [prognosis])
[road traffic] AND [emission] AND [catalyser] AND ([legislation] OR [development])
[Japanese car industry] AND [European car manufacturer] AND [car factory] AND
([co-operation] OR [investment])
[chemical forest industry] AND [sewage treatment] AND [investment]
[retail] AND [business hours] AND ([business organisation] OR [trade union]) AND
[liberation] AND [statement]
[package] AND [recycling] AND [legislation]
123
Appendix 2
(continued)
28.
29.
30.
[Esko Aho] AND [EU membership] AND [application] AND [opinion]
[Kauko Juhantalo] AND [nuclear power] AND ([decision] OR [opinion])
[Green party] AND ([parliament] OR [representative]) AND ([initiative] OR
[interpellation] OR [speech] OR [voting])
124
Appendix 3
Classification of the search concepts
Geograph.
name
Bucharest
EU country
Other proper
name
Bildt
Bush
Finland
Esko Aho
France
EU parliament
Germany
Gorbachev
Great Britain
Helsinki
Iraqi
Green party
Kauko
Juhantalo
Kera
Other concrete Abstract object
object
allied
AIDS
car factory
business
hours
catalyser
business
organisations
emission
chemical
forest industry
forest machine cyclic
fluctuation
house
debt
miner
economic
boycott
economics
natural gas
Jyväskylä
KTM
nuclear power
Lithuania
Neste
Namibia
Opec
Nordic country Tampella
nuclear power
plant
radioactive
waste
oil
South-America UN, UN
package
Soviet Union (2) UNTAG
USA (2)
Valmet
railway
carriage
representative
Yugoslavia
Warsaw pact
tractor
18 (11)
17 (16)
weapon of
mass
destruction
17 (15)
EU
membership
European car
manufacturer
food industry
Process/Event
abolition
accident
acquisition
agreement
annihilation
application
building
co-operation
(2)
conflict
crisis
decision (5)
forest industry decision
making
delivery
import
restriction
independence
development (2)
Japanese car
industry
labour
loan
member country
official organs
of EU
opinion (3)
opponent
opposition
parliament
price
presidential
council
dumping
federation of
municipalities
initiatives
interpellations
inventory
investment (3)
layoff
legislation (2)
liberation
lifting
marketing
Table 1. Classification of the search concepts. The concepts of major facets are in bold face.
125
Appendix 3
(continued)
Geograph.
name
Other proper
name
Other
concrete
object
Abstract
object
Process/Event
profitability
peace keeping
operation
problem
processing
production (2)
prognosis
retail
retail
organisation
risk
road traffic
rural
community
statement
statistics
subsidy
supporter
tourism
town
trade union
treaty
41 (18)
recycling
reunification
session
sewage
treatment
solution
speech
spread
storage
summit
termination
trial
violence
voting
52 (10)
Table 1. (continued) Classification of the search concepts. The concepts of major facets are in bold face.
126
Appendix 4
P-R curves of the query structures at different complexity and expansion
levels.
100
SUM/F1/Q0
SUM/F1/Qs
SUM/F1/Qn
SUM/F1/Qa
SUM/F1/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
100
SUM/F2/Q0
SUM/F2/Qs
SUM/F2/Qn
SUM/F2/Qa
SUM/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Fig. 1-2. P-R curves of the SUM combinations.
127
Appendix 4
(continued)
100
SYN1/F1/Q0
SYN1/F1/Qs
SYN1/F1/Qn
SYN1/F1/Qa
SYN1/F1/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
100
SYN1/F2/Q0
SYN1/F2/Qs
SYN1/F2/Qn
SYN1/F2/Qa
SYN1/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Fig. 3-4. P-R curves of the SYN1 combinations.
128
Appendix 4
(continued)
100
SYN2/F1/Q0
SYN2/F1/Qs
SYN2/F1/Qn
SYN2/F1/Qa
SYN2/F1/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
100
SYN2/F2/Q0
SYN2/F2/Qs
SYN2/F2/Qn
SYN2/F2/Qa
SYN2/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Fig. 5-6. P-R curves of the SYN2 combinations.
129
Appendix 4
(continued)
100
WSUM1/F1/Q0
WSUM1/F1/Qs
WSUM1/F1/Qn
WSUM1/F1/Qa
WSUM1/F1/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
100
WSUM1/F2/Q0
WSUM1/F2/Qs
WSUM1/F2/Qn
WSUM1/F2/Qa
WSUM1/F2/Qa
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Fig. 7-8. P-R curves of the WSUM1 combinations.
130
Appendix 4
(continued)
100
WSUM2/F1/Q0
WSUM2/F1/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
100
WSUM2/F2/Q0
WSUM2/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Fig. 9-10. P-R curves of the WSUM2 combinations.
131
Appendix 4
(continued)
100
SSYN1/F1/Q0
SSYN1/F1/Qs
SSYN1/F1/Qn
SSYN1/F1/Qa
SSYN1/F1/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
100
SSYN1/F2/Q0
SSYN1/F2/Qs
SSYN1/F2/Qn
SSYN1/F2/Qa
SSYN1/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Fig. 11-12. P-R curves of the SSYN1 combinations.
132
Appendix 4
(continued)
100
SSYN2/F1/Q0
SSYN2/F1/Qs
SSYN2/F1/Qn
SSYN2/F1/Qa
SSYN2/F1/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
100
SSYN2/F2/Q0
SSYN2/F2/Qs
SSYN2/F2/Qn
SSYN2/F2/Qa
SSYN2/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Fig. 13-14. P-R curves of the SSYN2 combinations.
133
Appendix 4
(continued)
100
ASYN/F1/Q0
ASYN/F1/Qs
ASYN/F1/Qn
ASYN/F1/Qa
ASYN/F1/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
100
ASYN/F2/Q0
ASYN/F2/Qs
ASYN/F2/Qn
ASYN/F2/Qa
ASYN/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Fig. 15-16. P-R curves of the ASYN combinations.
134
Appendix 4
(continued)
100
WSYN1/F2/Q0
WSYN1/F2/Qs
WSYN1/F2/Qn
WSYN1/F2/Qa
WSYN1/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
100
WSYN2/F2/Q0
WSYN2/F2/Qs
WSYN2/F2/Qn
WSYN2/F2/Qa
WSYN2/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Fig. 17-18. P-R curves of the WSYN1-2 combinations.
135
Appendix 4
(continued)
100
BOOL/F1/Q0
BOOL/F1/Qs
BOOL/F1/Qn
BOOL/F1/Qa
BOOL/F1/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
100
BOOL/F2/Q0
BOOL/F2/Qs
BOOL/F2/Qn
BOOL/F2/Qa
BOOL/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Fig. 19-20. P-R curves of the BOOL combinations.
136
Appendix 4
(continued)
100
XSUM/F1/Q0
XSUM/F1/Qs
XSUM/F1/Qn
XSUM/F1/Qa
XSUM/F1/Qf
Precision
80
60
40
20
0
0
20
40
60
Recall
80
100
100
XSUM/F2/Q0
XSUM/F2/Qs
XSUM/F2/Qn
XSUM/F2/Qa
XSUM/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
80
100
Recall
Fig. 21-22. P-R curves of the XSUM combinations.
137
Appendix 4
(continued)
100
OSUM/F1/Q0
OSUM/F1/Qs
OSUM/F1/Qn
OSUM/F1/Qa
OSUM/F1/Qf
Precision
80
60
40
20
0
0
20
40
60
Recall
80
100
100
OSUM/F2/Q0
OSUM/F2/Qs
OSUM/F2/Qn
OSUM/F2/Qa
OSUM/F2/Qf
Precision
80
60
40
20
0
0
20
40
60
Recall
80
100
Fig. 23-24. P-R curves of the OSUM combinations.
138
Appendix 5
The baseline and best precision scores by requests.
Request Best
combination
#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
SSYN2/F2/Qa-Qf
WSYN2/F2/Qn
ASYN/F2/Qf
BOOL/F1/Q0
XSUM/F1-2/Qs-Qf
OSUM/F1/Qs-Qn
ASYN/F2/Qs
SSYN1/F2/Qf
WSYN1/F2/Qs-Qn
ASYN/F2/Qa-Qf
WSUM1/F2/Qf
SUM/F2/Qf
OSUM/F1/Qf
SSYN2/F2/Qf
SSYN1/F2/Qs,Qa
ASYN/F2/Qf
OSUM/F2/Qf
ASYN/F2/Qa-Qf
XSUM/F1/Qn
BOOL/F2/Qf
ASYN/F1/Qn
BOOL/F1/Qn
ASYN/F1/Qf
WSYN1/F2/Qf
WSYN2/F2/Qf
SUM/F2/Q0
BOOL/F2/Qn
ASYN/F2/Qn
WSUM1/F2/Qn
OSUM/F1/Qa-Qf
ASYN/F1-F2/Qa-Qf
SSYN1/F2/Qa-Qf
SUM/F2/Q0
Known
Baseline Best
relevants preciprecision
sion
32
55
16
8
31.7
45.0
42.2
33.1
50.6
90.9
63.2
46.2
Best
prec. –
baseline
prec.
18.9
45.9
21.0
13.1
48
87
65
29
24
98
29
13
36
54
45
14
17
37
34
21
14
36
87
16
25
63.7
80.0
66.8
67.1
50.4
88.2
19.8
36.5
69.4
51.2
63.8
21.6
33.7
57.7
67.2
13.8
22.6
27.4
74.3
21.5
2.0
74.8
90.4
91.4
71.3
62.2
96.1
50.7
57.5
73.6
84.2
75.2
49.9
50.8
70.3
84.7
54.2
34.6
39.7
85.3
50.5
14.4
11.1
10.4
24.6
4.2
11.8
7.9
30.9
21.0
4.2
33.0
11.4
28.3
17.1
12.6
17.5
40.4
12.0
12.3
11.0
29.0
12.4
27
59
21
6
38.1
84.4
45.9
26.5
70.1
89.9
65.6
39.2
32.0
5.5
19.7
12.7
13
2.6
37.4
34.8
Table 1. P@50 precision scores of the baseline and best combinations by requests.
139
Appendix 5
(continued)
Baseline and worst precision scores by requests.
Request Worst
#
combination(s)
Known
relevants
Baseline Worst
preciprecision
sion
1
SYN2/F1/Q0
32
31.7
4.9
Baseline
prec. worst
prec.
26.8
2
SYN2/F1/Qa
55
45.0
2.1
42.9
3
16
42.2
0
42.2
4
SUM/F1/Qn
BOOL/F2/Qa-Qf
SYN1/F1/Qs-Qf
SYN1/F2/Q0-Qn,
SYN2/F2/Q0
OSUM/F2/Qn-Qf
8
33.1
3.5
29.6
5
BOOL/F2/Qs-Qn
48
63.7
1.4
62.3
6
BOOL/F1/Qf
87
80.0
1.1
78.9
7
OSUM/F1/Qa
65
66.8
26.2
40.6
8
SYN1-2/F2/Q0
29
67.1
31.2
35.9
9
SYN1-2/F2/Q0
24
50.4
14.5
35.9
10
BOOL/F2/Qf
98
88.2
14.5
73.7
11
29
19.8
1.6
18.2
13
36.5
0
36.5
13
SYN2/F1/Qa-Qf
SYN2/F2/Qa
BOOL/F1-2/Qn
BOOL/F2/Qf
SYN1/F2/Qa
36
69.4
15.7
53.7
14
BOOL/F1-2/Qf
54
51.2
0.4
50.8
15
BOOL/F2/Qn
45
63.8
0
63.8
16
14
21.6
0
21.6
17
BOOL/F1-2/Qn
BOOL/F2/Qf
SYN1/F1-2/Qs
17
33.7
0
33.7
18
BOOL/F1-2/Qf
37
57.7
5.8
51.9
12
Qf
Table 2. P@50 precision scores of the baseline and worst combinations by requests.
140
Appendix 5
(continued)
Request Worst
#
combination(s)
Known
relevants
Baseline Worst
preciprecision
sion
19
SYN2/F1/Qn
34
67.2
4.2
Baseline
prec. worst
prec.
63.0
20
SUM/F1-2/Qs-Qf
WSUM2/F1/Qf
SYN1/F1-2/Qs-Qf
SYN2/F1/Qf
BOOL/F2/Qn-Qf
21
13.8
0
13.8
14
22.6
0
22.6
BOOL/F1/Qa
BOOL/F2/Qn-Qa
BOOL/F1/Qf
36
27.4
0
27.4
87
74.3
6.2
68.1
16
21.5
0
21.5
25
2.0
0
2.0
27
38.1
4.7
33.4
27
SYN1/F1-2/Q0
SYN2/F2/Qs
SUM/F1/Q0
SYN1/F1-2/Q0-Qs,
Qa
SYN2/F1/Qn
SYN2/F2/Qs-Qf
WSUM1/F2/Qn
SSYN1/F1-2/Qn
SSYN2/F1-2/Qn
ASYN/F1-2/Qn
WSYN1/F2/Qn
WSYN2/F2/Qn
BOOL/F2/Qa
OSUM/F1-2/Qn
SYN1/F1/Q0
SYN2/F1-2/Q0
BOOL/F2/Qn
59
84.4
0.9
83.5
28
SYN1/F1/Qa-Qf
21
45.9
2.2
43.7
29
SYN1/F1/Qs-Qn
6
26.5
1.1
25.4
30
SYN1/F1/Qa-Qf
SYN1/F2/Q0-Qa
SYN2/F2/Qn-Qf
BOOL/F1/Q0
OSUM/F1/Qn
13
2.6
0
2.6
21
22
23
24
25
26
Table 2. (continued) P@50 precision scores of the baseline and worst combinations by requests.
141
© Copyright 2026 Paperzz