Revisiting Exhaustivity and Specificity Using Propositional

Revisiting Exhaustivity and Specificity Using Propositional
Logic and Lattice Theory
Karam Abdulahhad
Jean-Pierre Chevallet
Catherine Berrut
Université de Grenoble
LIG laboratory, MRIM group
Grenoble, France
UPMF-Grenoble 2
LIG laboratory, MRIM group
Grenoble, France
UJF-Grenoble 1
LIG laboratory, MRIM group
Grenoble, France
[email protected]
[email protected]
ABSTRACT
Exhaustivity and Specificity in logical Information Retrieval
framework were introduced by Nie [16]. However, even with
some attempts, they are still theoretical notions without
a clear idea of how to be implemented. In this study, we
present a new approach to deal with them. We use propositional logic and lattice theory in order to redefine the two
implications and their uncertainty P (d → q) and P (q → d).
We also show how to integrate the two notions into a concrete IR model for building a new effective model. Our
proposal is validated against six corpora, and using two
types of terms (words and concepts). The experimental results showed the validity of our viewpoint, which state: the
explicit integration of Exhaustivity and Specificity into IR
models will improve the retrieval performance of these models. Moreover, there should be a type of balance between
the two notions.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: Information
Search and Retrieval
Keywords
Information retrieval, Exhaustivity, Specificity
1.
INTRODUCTION
Many studies in Information Retrieval (IR) field argue
that the retrieval process could be represented as a logical
implication directed from a document d to a query q. Van
Rijsbergen [23] proposes that if d and q are sets of logical
sentences in a specific logic then d should be retrieved iff it
implies q, noted d → q.
However, the IR process is uncertain [8]. Therefore, a
measure P for estimating the certainty of the implication
d → q, and of course for ranking, is needed. In this way, the
IR process could be modeled as a two-step process:
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from [email protected].
ICTIR’13, September 29–October 02, 2013, Copenhagen, Denmark.
Copyright 2013 ACM 978-1-4503-2107-5/13/09 ...$15.00.
[email protected]
1. Retrieval: extracting candidate documents from a corpus by recognizing the reason or reasons that make a
document d a candidate answer to a query q.
2. Ranking: estimating the Relevance Score Value (RSV )
between d and q, or equivalently, estimating the strength
of the reasons that make d a candidate answer to q.
Nie [16] introduced two notions: Exhaustivity and Specificity (ES). Exhaustivity d → q means: q is deducible from
d or there is a deductive path from d to q. In other words,
Nie presented d as a possible world1 then either q is directly
true in d, or there is a series of changes that should be done
on d to make q true. Whereas Specificity q → d means:
d is deducible from q or there is a deductive path from q
to d. Exhaustivity is recall-oriented whereas Specificity is
precision-oriented [15].
Exhaustivity and Specificity describe, in a subjective manner, the nature of the retrieval process. We think that if the
two notions are objectively described and integrated into IR
models, the gain in effectiveness would be appreciable.
In this work, we show how both notions can be theoretically redefined using Propositional Logic and Lattice theory,
and how they can be integrated into an IR model, and then
we translate the theoretical description into a practical one.
The paper is organized as follow. In Section 2, we review
some studies about Exhaustivity and Specificity. We redefine Exhaustivity and Specificity using Propositional Logic
and Lattice theory, and we theoretically integrate them into
IR models, in Section 3. We show in Section 4 how to implement this theoretical framework into a concrete IR model.
In Section 5, we describe our experiments and discuss results. We conclude in Section 6.
2.
ES IN IR LITERATURE
Exhaustivity and Specificity (ES) were early introduced
by Spärck Jones [12]. She talked about document exhaustivity which refers to the number of terms the document
contains, and term specificity which refers to the number of
documents the term belongs to. According to [12], Exhaustivity and specificity are statistical notions. In this study,
we are more concerned with the definition of Exhaustivity
and Specificity that were introduced by Nie [16]. According
to Nie, Exhaustivity says how much elements of q are mentioned in d, whereas Specificity says in what level of detail
the elements of q are mentioned in d. Nie claims that the
retrieval process is not only d → q but also q → d.
1
Expressed using a modal logic.
However, IR is an uncertain process [8], because:
• q is an imperfect representation of user needs;
• d is also an imperfect representation of the content of
documents;
• Relevance judgement depends on external factors, like
user background knowledge (user dependent).
Therefore, another component, besides the logic, should be
added to estimate the certainty of an implication and to
offer a mechanism of ranking. In other words, a measure P
should exist and be capable of measuring the certainty of
the logical implication between d and q, noted P (d → q).
The final RSV (d, q), according to Nie [16], becomes:
RSV (d, q) = F [P (d → q), P (q → d)]
(1)
where P (d → q) = 1 if d → q is true, or 0 ≤ P (d → q) < 1
otherwise, and F is an arbitrary function for combining two
numerical values.
Crestani [10] proposes the existence of a function Sim capable of estimating the semantic similarity between any two
terms. He extends the retrieval function to become:
• Query viewpoint: starting from a query and comparing
it to a document.
X
RSV max(q.d) (d, q) =
Sim (t, t∗ ) wd (t∗ ) wq (t)
t∈q
∗
where, t ∈ T is a document term that gives the maximum similarity with t, wd (t∗ ) is the weight of t∗ in d,
and wq (t) is the weight of t in q.
• Document viewpoint: starting from a document and
comparing it to a query.
X
RSV max(d.q) (d, q) =
Sim (t, t∗ ) wd (t) wq (t∗ )
t∈d
Crestani [10] tries to integrate Exhaustivity and Specificity
in one IR model, where q . d refers to Exhaustivity and d . q
refers to Specificity.
Losada et al. [15] also uses the two notions in order to
build a more precise retrieval model. They exploit propositional logic, its truth-value interpretations, and the belief
revision technique, to define the two notions and build an
implementable retrieval function.
Exhaustivity and Specificity are also exploited in structured documents and passage retrieval fields, where IR systems try to find a document d that answers a query q (Exhaustivity) and within d systems try to find the most appropriate or precise component or passage (Specificity) [9].
As we said, according to Nie [16], the relation between
d and q is not symmetric, where d → q is different from
q → d. However, some IR models represent d and q in
the same formalism. This is the case in the Vector Space
Model (VSM) [20], where it is assumed that both d and q
are vectors in the same term space and the relevance value is
the similarity between them. In that case, RSV is computed
using a symmetric function (e.g. Cosine).
In some other IR models, the RSV computation is not
symmetric. This is the case in Probabilistic Models (PM)
[18] and Language Models (LM) [17]. PMs compute the
probability that a certain document is relevant to a certain
query, whereas they do not compute the probability that a
query is the appropriate question for retrieving a document.
In other words, PMs talk about P (R|d, q), but they do not
actually talk about P (A|d, q), where A is a binary random
variable: A = True if q is an appropriate question for d, or
A = False otherwise. LMs compute the ability of a document to reproduce a query P (d|q), but they rarely compute
the ability of a query to reproduce a document P (q|d).
The two notions, Exhaustivity and Specificity, have been
studied in the state of the art but mostly in a theoretical
point of view. We propose to go deeper in this analysis by
revisiting them from a theoretical point of view in a logical
framework, and then proposing a practical implementation.
3.
REVISITING ES MODEL
In this study, we basically depend on Propositional Logic
(PL) and Lattice theory. We also assume that the logical
implication d → q is actually the material implication [3, 4].
We start this section by a mathematical introduction to PL,
lattices, and the potential relation between them.
3.1
3.1.1
Mathematical Foundations
Propositional Logic (PL)
We define a set of atomic propositions A = {a1 , . . . , an }.
The set A forms the alphabet that is used to build logical
sentences. The set A is finite |A| = n and any proposition
ai ∈ A can have one of two values: True T or False F .
In PL, a semantic is given to a logical sentence s by assigning a truth value (T or F ) to each atomic proposition in s.
In this manner, each logical sentence s has several interpretations Is depending on the truth value of its propositions.
The subset of interpretations Ms ⊆ Is that makes s true is
called the set of models of s, noted Ms |= s. Moreover, |= s
means that s is a tautology or it is true in all interpretations. According to this formalism, Ms is a set of models,
and each model m ∈ Ms can be defined as a subset of atomic
propositions that have T as interpretation[15]. For example,
assume A = {a, b, c} and s = a ∧ ¬b, then Ms = {m1 , m2 }
where m1 = {a} and m2 = {a, c}. Note that propositions
that do not occur in the sentence (e.g. c) can be T or F .
Assume s1 and s2 are logical sentences. By definition,
the material implication s1 → s2 is equivalent to models
inclusion:
[|= s1 → s2 ] ⇔ [Ms1 ⊆ Ms2 ]
(2)
In this study, we consider a special kind of logical sentences, namely clauses. Each clause is a conjunction of nonnegative atomic propositions, e.g. a, a ∧ b, a ∧ c, etc. Each
clause s has, like any logical sentence, a set of models Ms .
However, depending on the set of models Ms , it is possible
to define one model ms :
\
ms =
m
(3)
m∈Ms
where ms will only contain the propositions that occur in
the clause s. For example, assume A = {a, b, c} and the
clause s = a ∧ b, then Ms = {m1 , m2 } where m1 = {a, b}
and m2 = {a, b, c}. According to the previous definition
ms = m1 ∩ m2 = {a, b}.
Assume s1 and s2 are two clauses, then according to (Eqs.2
& 3), we have:
[|= s1 → s2 ] ⇔ [ms2 ⊆ ms1 ]
(4)
3.1.2
Lattice Theory
3.2.1
A
The algebraic structure B = (2 , ∩, ∪, \, >, ⊥) is a Boolean
algebra, or equivalently, a distributive and complemented
lattice, where:
1) 2A is the power set of A
2) the meet operation is the set intersection (∩)
3) the join operation is the set union (∪)
4) the complement operation is the set difference (\)
5) the top element is > = A
6) the bottom element is ⊥ = ∅.
The ordering relation ≤ defined on B is:
∀x, y ∈ 2A , [x ≤ y] ⇔ [x ⊆ y]
(5)
We previously showed that any logical clause corresponds
to a subset of propositions of A (Eq.3). Therefore, any logical clause corresponds to one node on the lattice B. From
(Eqs.4 & 5), we conclude that for any two clauses s1 and s2 :
[|= s1 → s2 ] ⇔ [ms2 ≤ ms1 ]
3.1.3
(6)
Degree of Inclusion
In any lattice (L, ∧L , ∨L ) and for any two elements x, y ∈
L, even that x does not include y, it is possible to describe
the degree to which x includes y. Knuth [13, 14] generalizes
the inclusion to the degree of inclusion represented by real
numbers. He introduced the z function:

if x ≥ y
 1
0
if x ∧L y = ⊥
∀x, y ∈ L, z(x, y) =
(7)
 z
otherwise, 0 < z < 1
where z(x, y) quantifies the degree to which x includes y. In
the extreme case, where (L, ∧L , ∨L , ¬L , >, ⊥) is a Boolean
algebra and z is consistent with all properties of the Boolean
algebra structure, then z is the conditional probability [14]:
∀x, y ∈ L, z(x, y) = P (x|y)
(8)
In general, the z function is not forcibly consistent with all
properties, and then z could be replaced by more relaxed
function than probability [3].
Back to our logical framework, the degree of certainty
of a material implication can be replaced by the degree of
inclusion or equivalently the z function (Eqs.6 & 7). In
other words, assume s1 and s2 are two clauses. We know
that s1 and s2 correspond to two subsets of propositions
ms1 and ms2 (Eq.3), respectively. We also know that the
material implication s1 → s2 to be true means that ms2 ≤
ms1 must be satisfied (Eq.6). Hence, the certainty of the
implication P (s1 → s2 ) can be replaced by the degree of
inclusion z(ms1 , ms2 ):
P (s1 → s2 ) = z(ms1 , ms2 )
A document d is a logical clause, or equivalently, a conjunction of its terms. Queries are represented in the same
way. Representing a document d as a conjunction of its
terms, means that: in any model of d, the terms that appear in d must be true and the other terms can be true or
false. This is a relaxed version of the underling hypothesis
in most classical IR models, where the terms that do not
appear in d are implicitly false.
Our representation, like most classical IR models, is not
capable of representing negative terms, e.g. ¬ai where ai ∈
A is not a valid clause.
A document d is a logical clause. It thus corresponds
to a set
T of models Md and a subset of terms md , where
md = m∈Md m (Eq.3). More Precisely, md is the set of
terms that appear in d.
3.2.2
P (d → q) = z(md , mq )
(10)
3.2.3
Exhaustivity & Specificity & RSV (d, q)
Nie [16] differentiates between d → q (Exhaustivity) and
q → d (Specificity). According to (Eq.10), P (d → q) =
z(md , mq ) and P (q → d) = z(mq , md ).
We said that if the z function is consistent with all properties of the Boolean algebra structure then it is equivalent
to the conditional probability (Eq.8). We also said that this
is not a mandatory constraint. If the z function is only consistent with the join associativity 2 property [14], then the z
function must satisfy the following rule:
z(x ∧L y, t) = z(x, t) + z(y, t) − z(x ∨L y, t)
(11)
Depending on (Eq.11) we have:
• z(md ∩ mq , mq ) = z(md , mq ) + z(mq , mq ) − z(md ∪
mq , mq ), we know from (Eq.7) that z(mq , mq ) = 1
because mq ≥ mq and z(md ∪ mq , mq ) = 1 because
md ∪ mq ≥ mq . Therefore, Exhaustivity:
P (d → q) = z(md , mq ) = z(md ∩ mq , mq )
(9)
• z(md ∩ mq , md ) = z(md , md ) + z(mq , md ) − z(md ∪
mq , md ), we know from (Eq.7) that z(md , md ) = 1
because md ≥ md and z(md ∪ mq , md ) = 1 because
md ∪ mq ≥ md . Therefore, Specificity:
P (q → d) = z(mq , md ) = z(md ∩ mq , md )
IR Notions
In this section, we redefine, in view of our previous mathematical foundations, the basic IR notions: document, query,
retrieval decision, uncertainty, Exhaustivity, and Specificity.
The alphabet A = {a1 , . . . , an } is the set of all terms,
where each term ai ∈ A is an atomic proposition. We say a
term ai is true with respect to a document d if it appears in
d, or false otherwise.
Retrieval Decision & Uncertainty
As we mentioned, the retrieval decision can be modeled
as a logical implication d → q between a document d and a
query q. IR is intrinsically an uncertain process, therefore
representing the retrieval decision by d → q is quite limited
because d → q is a binary decision. We thus need to measure
the degree of implication, namely P (d → q).
Hence, we first choose the material implication to represent the retrieval decision [3, 4], and then according to
(Eq.9), it is possible to estimate P (d → q) through z(md , mq ):
because if s1 → s2 is true then P (s1 → s2 ) = 1 (Eq.1)
and ms2 ≤ ms1 (Eq.6). In addition, if ms2 ≤ ms1 then
z(ms1 , ms2 ) = 1 (Eq.7).
3.2
Documents & Queries
These conclusions are compatible with the intuition about
Exhaustivity and Specificity that is presented in [2], where
according to [2], Exhaustivity or Specificity can be estimated
by comparing the coordination level d ∩ q with the query q
or the document d, respectively.
2
x ∨L (y ∨L t) = (x ∨L y) ∨L t
Finally, for computing RSV (d, q) it is now possible to replace (Eq.1) by:
We present later the exact definition of the similarity
function and the semantic relation between terms, because they depend on the type of terms that are used
for indexing documents and queries.
RSV (d, q) = F [z(md ∩ mq , mq ), z(md ∩ mq , md )]
For simplifying the notation, we replace, in the rest of this
paper, md ∩ mq by σ.
RSV (d, q) = F [z(σ, mq ), z(σ, md )]
4.
(12)
FROM THEORETICAL TO CONCRETE
IR MODEL
We project our logical modeling on a vector space IR
framework, and study the effects of integrating Exhaustivity and Specificity into a concrete IR model on the retrieval
performance of that model. As we said, both d, q, and σ
are of the same nature. Moreover, according to our model:
md contains the terms occurring in d, and mq contains the
terms occurring in q. However, σ represents what d and q
share. Hence, there are several ways to express σ. The most
direct way is to build σ depending on the terms that occur
in both d and q. A more natural way is to build σ depending
on the terms of d (or q) that either occur in q (or d) or have
a semantic relation with some terms in q (or d). We here
assume that terms within d and q are independent.
In order to take a further step from our logical framework
to the vector space framework, we should redefine the basic
IR notions:
• The term space A: each proposition or term ai ∈ A
corresponds to a dimension in the term space, |A| = n.
−
→
• Each document d is a vector d in the space A. Each
term ai ∈ A corresponds to a component wadi in the
−
→
vector d . For terms that occur in d, they have a
weight wadi > 0 whereas wadi = 0 for the others.
→
• The query q is a vector −
q in the space A. Each term
ai ∈ A corresponds to a component waq i in the vector
−
→
q . For terms that occur in q, they have a weight
waq i > 0 whereas waq i = 0 for the others.
→
• The object σ is also a vector −
σ in the space A. As we
said, it is possible to build different versions of σ. In
→
this study, we choose to build two versions −
σ and −
σ→
+
as follows:
→
1- Basic object −
σ , where it is represented using the
terms that occur in both d and q. Each term ai ∈
→
A corresponds to a component waσi in the vector −
σ,
σ
where wai > 0 for terms that occur in both d and q,
whereas waσi = 0 for the others.
2- Extended object −
σ→, where it is represented using
+
the terms of q that either occur in d or have a semantic
relation with some terms of d. Each term ai ∈ A
σ
corresponds to a component wai+ in the vector −
σ→
+,
σ+
where wai > 0 for the terms of q that either occur
in d or have a semantic relation with some terms of
σ
d, whereas wai+ = 0 for the others. Here, we need
a function Sim for computing the semantic similarity
between any two terms a1 , a2 ∈ A:

a1 = a2
 1
0
no semantic relation
Sim(a1 , a2 ) =
(13)
 u
otherwise, 0 < u < 1
It remains two functions F and z in (Eq.12), which need
to be determined. The function z now becomes a function for computing the distance between two vectors, but we
need a function consistent with the join associativity property of the Boolean algebra structure. Abdulahhad et al.
[3] show that the inner-product satisfies this condition, we
thus choose the inner-product to replace the z function in
(Eq.12). The function F is a function for combining two
numerical values. In this study we choose the product (×).
Finally and in view of our choices, the retrieval formula
(Eq.12) can be developed into two concrete retrieval formulae depending on which version of the object σ is used.
4.1
Basic Object (σ)
According to the choices of the two functions z and F ,
(Eq.12) can be rewritten as follow:
−
→
→
→
→
RSV σ (d, q) = (−
σ ·−
q ) × (−
σ · d)
(14)
Applying log to (Eq.14) will not change the ranking, but it
will transform product to sum, and it will allow to study the
mutual-impact between Exhaustivity and Specificity. Therefore, (Eq.14) can be rewritten as follow:
−
→
→
→
→
RSV σ (d, q) ∝ α × log(−
σ ·−
q ) + (1 − α) × log(−
σ · d ) (15)
where α ∈ [0, 1] is a tuning parameter, and we introduce
it to study the mutual-impact between Exhaustivity and
Specificity. The retrieval formula thus becomes:
P
waσ × waq + RSV σ (d, q) ∝ α × log
a∈AP
(16)
σ
d
(1 − α) × log
a∈A wa × wa
In IR, documents are ranked with respect to a query and not
the inverse, we thus choose to weight the terms of the object
σ using the query: waσ = c(a, q) for terms a ∈ A that occur
in both d and q, or 0 otherwise. Query’s terms are equally
1
for terms a ∈ A that occur in q, or 0
important: waq = |q|
otherwise. c(a, q) is the count of the term a in the query q
and |q| is the query length. Actually, the only component
that should be clarified is wad . Fang et al. [11] reviewed, in a
form of constraints, the weighting heuristics that should be
respected by weighting functions to be effective. In this paper, we use (Eq.17), which respects the first four constraints
(TFC1 , TFC2 , TDC , and LNC1 ) of Fang et al. [11].
wad =
c(a, d)
c(a, d) +
|d|
avdl
×
N
df (a)
(17)
where, c(a, d) the count of a in d, |d| document length, avdl
average document length in the corpus, N the total number
of documents in the corpus, and df (a) the number of documents in the corpus that contains a. If a does not occur in
d then wad = 0.
Equation (16) consists of two components. One is monotonically increasing w.r.t. the number of shared terms between d and q, but monotonically decreasing w.r.t. query
length. Another is also monotonically increasing w.r.t. the
number of shared terms, but monotonically decreasing w.r.t.
document length.
4.2
Extended object (σ+ )
The retrieval formula of the extended object −
σ→
+ can be
→
developed in the same way as that of the basic object −
σ
(Eq.16). However, the main difference is that the terms of
the extended object −
σ→
+ do not forcibly occur in document.
In other words, there are some terms of −
σ→
+ that do not occur
in d, but they have a semantic relation with some terms of
d. Therefore, the retrieval formula becomes (Eq.18) instead
of (Eq.16) when using the extended object −
σ→
+.
P
σ+
q
RSV σ+ (d, q) ∝ α × log
a∈A wa × wa +
(18)
P
σ+
(1 − α) × log
× Sim(a, a0 ) × wad0
w
a
a∈A
σ
where, wa + = c(a, q) for the terms of q that either occur
in d or have a semantic relation with some terms of d, or 0
1
otherwise. waq = |q|
for terms that occur in q, or 0 otherwise.
0
In addition, a is the document term that has a semantic
relation with a, and a0 has the maximum semantic similarity
value with a according to the function Sim (Eq.13).
Equations (16 & 18) are the main retrieval formulae and
they are the formulae that will be used in our experiments.
4.3
Terms Definition
Of course we use words, as a usually-used type of terms, to
represent the content of documents and queries. However, it
is also possible to use a more informative terms, which are
concepts [7]. Normally, concepts are a part of an external
resource or a knowledge base, and they are connected using
a variety of semantic relations. Concepts are not a part of
documents and queries, then we need a tool to map text of
documents and queries to concepts. UMLS3 is an example
of external resources containing concepts, and MetaMap4 [6]
is an example of tools that map text to UMLS concepts.
On the one hand, using concepts instead of words raises
some problems because natural languages are ambiguous,
and mapping tools are not thus very precise. Abdulahhad
et al. [1] talk about one of the potential problems that could
appear when moving from words to concepts.
On the other hand, we said that the object σ refers to
what d and q semantically share. Therefore, using concepts
makes the definition of σ more flexible through exploiting
the semantic relations between concepts. Hence, it is possible to exceed the simple intersection-based definition of σ.
→
Concerning the object σ, we use the basic definition −
σ
with the two types of terms: words and concepts. Whereas,
the extended definition −
σ→
+ is only used with concepts.
The semantic relation that will be used to build the extended object −
σ→
+ is the ‘isa’ relation (Hyponymy / Hypernymy) between concepts. This relation is defined on the
concepts of UMLS. The isa relation is:
• transitive: for any three concepts c1 , c2 , and c3 if
isa
isa
isa
c1 −−→ c2 and c2 −−→ c3 then c1 −−→ c3 .
isa
isa
• directed: c1 −−→ c2 is different from c2 −−→ c1 .
Therefore, it is possible to define the length or the number
of edges lci .cj of the isa-path between two concepts ci and
cj directed from ci to cj in a resource like UMLS in our case.
The extended object −
σ→
+ will contain the concepts cq of q
that either occur in d or there exists atleast a concept cd in
isa
d where cd −−→ cq . We choose the isa relation directed from
3
4
Unified Medical Language System (www.nlm.nih.gov)
metamap.nlm.nih.gov
the concepts of d to the concepts of q, i.e. a query about
‘animal’ will be satisfied by a document about ‘dog’ but not
the inverse.
There are several possible choices for the semantic similarity function (Eq.13) between two concepts. The main constraint is to be monotonically decreasing w.r.t. the length
of the isa-path between the two concepts. One possible definition is the exponential function:

ci = cj
 1
0
no semantic relation
Sim(ci , cj ) =
(19)
 −β×lci .cj
isa
e
ci −−→ cj
where β ∈ R+ is a tuning parameter. We use this definition
of the semantic similarity function in our experiments.
5.
EXPERIMENTS
The main goal of our experiments is to show, on the one
hand, the validity of our hypothesis about the improvement
in the retrieval performance of IR models that could be
gained by integrating Exhaustivity and Specificity. On the
other hand, to show that even by using simple retrieval formulae (Eq.16 & 18), an IR model could perform as well as
or even better than some state of the art models, e.g. Pivoted Normalization Method (example of VSM)[21], BM25
(example of probabilistic models) [19], Jelinek-Mercer and
Dirichlet (examples of language models) [26].
As we previously illustrated, calculating σ is not a simple
set intersection. The problem beyond is to exactly compute
what d and q semantically share. Therefore, we propose to
compare two definitions of terms: 1- words: they are classically extracted after removing stop words and stemming,
and 2- concepts: text of documents is mapped to concepts
through an external knowledge base. That’s why, we use
medical corpora from CLEF5 , which can be used with the
well-known knowledge base UMLS.
The retrieval formulae used are compared to each other
using the Mean Average Precision (MAP) metric.
As statistical significance test, we use Fisher’s Randomization test at the 0.05 level [22].
5.1
Experimental Setup
Six corpora and two types of indexing terms are used. We
also define the object σ in two different ways.
5.1.1
Indexing Terms
Words: the final list of words that index documents and
queries, is the list of words in the corpus after removing stop
words and stemming the remaining using Porter algorithm.
Concepts: the text of documents and queries is mapped
to UMLS concepts using MetaMap. UMLS is a multi-source
knowledge base in the medical domain, whereas, MetaMap
is a tool for mapping text to UMLS concepts. Using concepts allows to profit from the semantic relations between
concepts, so it allows to build different versions of σ object.
5.1.2
Corpora
Four corpora from CLEF and two from TREC6 are used.
Table 1 shows some statistics about them. The statistics are
taken after removing stop words and stemming.
5
6
www.clef-initiative.eu
http://trec.nist.gov/
img09, img10, img11, img12: contain short medical
documents and queries.
trec6, trec8: contain long general-content documents
and queries. For TREC queries, we use the three fields:
‘title’, ‘desc’, and ‘narr’. Concerning trec6, documents on
disks 4 & 5 and topics 301-350 are used. Concerning trec8,
documents on disks 4 & 5 without Congressional Record
(CR) and topics 401-450 are used.
Table 1: Corpora statistics. avdl and avql are average
length of documents and queries
Corpus
#d
#q
Terms
avdl
avql
Words
62.16
3.36
img09
74901
25
Concepts 157.48 10.84
Words
62.12
3.81
img10
77495
16
Concepts 157.27 12.0
Words
44.83
4.0
img11
230088 30
Concepts 101.92 12.73
Words
47.16
3.55
img12
306530 22
Concepts
47.16
9.41
Words
266.67 46.56
trec6
551787 50
Concepts
Words
243.87 29.06
trec8
523865 50
Concepts
-
5.1.3
Retrieval Formulae
First, in order to study the effect of Exhaustivity and
Specificity, we compare our retrieval formulae RSV σ (Eq.16)
and RSV σ+ (Eq.18) with RSV d·q (Eq.20).
−
→ →
c(a, d)
1 X
N
RSV d·q (d, q) = d · −
q = ×
×
q a∈A c(a, d) + |d|
df (a)
avdl
(20)
Second, we compare the retrieval performance of RSV σ
and RSV σ+ with:
• Pivoted Normalization (PIV ), where s = 0.2 [21, 11].
• BM25 , where k1 = 1.2, b = 0.75, k3 = 1000 [19, 11].
• Dirichlet Prior Method (DIR), where µ = 2000 [26].
• Jelinek-Mercer Method (JM ), where usually λ = 0.1 for
short queries, which is the case in imgXX corpora. Whereas,
λ = 0.7 for long queries, which is the case in trec6 and trec8
corpora [26].
We set parameters’ values to the usually used default values.
5.2
Results and Discussion
We divide our experiments into three main categories: 1experiments to study the effect of α value on the retrieval
performance of RSV σ (Eq.16) and RSV σ+ (Eq.18), 2- experiments using words to compare the retrieval performance
of RSV σ , which corresponds to the basic definition of σ,
with the performance of some state of the art IR models,
and 3- experiments using concepts to compare the performance of RSV σ with other IR models, and also to study the
effect of the extended definition of σ on the overall retrieval
performance through studying the performance of RSV σ+ .
5.2.1
Exhaustivity vs. Specificity
In this category of experiments, we only use words as indexing terms. By varying the value of α, in RSV σ (Eq.16),
between α = 0.0 (only Specificity) and α = 1.0 (only Ex-
haustivity), it is possible to study the mutual-impact between Exhaustivity and Specificity.
In Table 2, we apply RSV σ (Eq.16) to all corpora using
words, (*) means a significant improvement comparing to
α = 0.0 ‘only Specificity’ and (†) means a significant improvement comparing to α = 1.0 ‘only Exhaustivity’. In
Table 2, we see that, on the one hand, integrating the two
components (Exhaustivity and Specificity) is better than depending on only one. On the other hand, there should be a
type of balance between the two components without dominance. However, the exact value of α is corpus-dependent.
In general, α = 0.5, which is the average of α values in all
corpora, gives good performance. In the rest of this paper,
we fix α = 0.5 to give an equal importance to both Exhaustivity and Specificity.
5.2.2
Experiments Using Words
All experiments in this category are done using words as
indexing terms. The main goal of this part of experiments
is to study the retrieval performance of RSV σ (Eq.16 with
α = 0.5), and to compare it with: 1- the basic inner-product
Vector Space Model without Exhaustivity and Specificity
RSV d·q (Eq.20), and 2- some classical IR models.
In Table 3, we apply RSV σ (Eq.16 with α = 0.5), RSV d·q
(Eq.20), PIV , BM25 , JM , and DIR to all corpora using
words where (*) indicates that RSV σ is significantly better. Table 3 shows the effect of integrating Exhaustivity
and Specificity on the performance of RSV σ . The performance of RSV σ , using the basic version of the intermediate
object σ, is always better than the performance of RSV d·q ,
even though sometimes, the improvement is not clear. This
type of results shows that the indirect matching between d
and q using an object σ gives better results comparing to
the direct matching.
Table 3 also shows that the performance of RSV σ is better
than or equivalent to the performance of some classical IR
models, like PIV , BM25 , JM , and DIR. That also shows,
through using Exhaustivity and Specificity, it is possible to
build an effective or high performance IR model.
Concerning trec6, the performance of our formula RSV σ is
slightly better than the best run in TREC6 conference (the
run ‘anu6alol’ where M AP = 0.2602) [25], and it is also
better than the best result obtained by the DFR framework
(the run ‘I(ne )L2’ where M AP = 0.2600) [5]. Concerning
trec8, all runs in TREC8 conference, which have better score
than our score, use some supplementary techniques. For
example, query expansion, pseudo relevance feedback, POS
tools, and fusion of several runs [24].
This category of experiments shows that, it is possible
to build a high-performance IR model through exploiting
the two notions: Exhaustivity and Specificity, or in other
words, through the indirect matching between documents
and queries.
5.2.3
Experiments Using Concepts
All experiments in this category are done using UMLS
concepts as indexing terms. The main goal of this part of
experiments is, on the one hand, to study the retrieval performance of RSV σ through repeating the same experiments
of the previous section but using concepts instead of words,
and on the other hand, to study the effect of using the extended version of the object σ+ .
Here, we compare the retrieval performance of our formu-
α
0.0 (Specificity)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0 (Exhaustivity)
Table
img09
0.3149
0.3340 †∗
0.3544 †∗
0.3653 †∗
0.3737 †∗
0.3789 †∗
0.3834 †∗
0.3846 †∗
0.3846 †∗
0.3731 †∗
0.2300
Formula
RSV σ
RSV d·q
PIV
BM25
JM
DIR
Table
img09
0.3789
0.3149*
0.3664
0.3726
0.3792
0.3353*
2: Exhaustivity vs. Specificity
img10
img11
img12
0.3117
0.1965
0.1207
0.3266 †∗
0.2113 †∗
0.1376 †∗
0.3352 †∗
0.2229 †∗
0.1477 †∗
0.3402 †∗ 0.2284 †∗
0.1539 †∗
0.3102 †
0.2330 †∗ 0.1580 †∗
0.3171 †
0.2317 †∗ 0.1586 †∗
0.3221 †
0.2282 †
0.1539 †
0.3194 †
0.2221 †
0.1485 †
0.3144 †
0.2114 †
0.1463 †
0.3056 †
0.1922 †
0.1348 †
0.1921
0.0749
0.0551
3: Experiments Using Words
img10
img11
img12
trec6
0.3171 0.2317 0.1586 0.2661
0.3117 0.1965* 0.1207* 0.1677*
0.2992 0.1546* 0.1027* 0.2076*
0.2745* 0.1995* 0.1438
0.2238
0.2994 0.1985* 0.1371
0.2532
0.2960 0.1534* 0.1161* 0.2410
lae RSV σ (Eq.16 with α = 0.5) and RSV σ+ (Eq.18 with
α = 0.5) with the performance of some classical IR models.
In Table 4, we apply RSV σ+ , RSV σ , RSV d·q , PIV , BM25 ,
JM , and DIR to CLEF corpora using concepts, where (*)
means RSV σ+ is significantly better and (†) means RSV σ is
significantly better. Table 4 shows that the performance of
RSV σ , using the basic version of σ, is better than or equivalent to the performance of RSV d·q and some classical IR
models, like PIV , BM25 , JM , and DIR. That also shows,
through using Exhaustivity and Specificity, it is possible to
build an effective or high performance IR model.
Comparing RSV σ+ with RSV σ in (Table 4) shows that
using the extended object σ+ slightly improves the performance, except in the img10 corpus. The performance of
RSV σ+ presented in (Table 4) is for β = 7 in the semantic
similarity function (Eq.19). Here, we do not work on the
optimization of the value of β. We just show the potential
of our model. However, the interesting thing is that the
same value of β gives the best performance in all corpora.
Hence, the value of β seems to be dependent on the source
containing concepts and relations, UMLS in our case, and
not on corpora.
The performance of RSV σ+ is significantly better than
the performance of classical IR models, except for the img10
corpus. In addition, RSV σ+ gives slightly better performance comparing to RSV σ , except for the img10 corpus.
That shows, integrating Exhaustivity and Specificity in an
IR model and the appropriate choice of the object σ can
lead to important gain in the retrieval performance.
Comparing the results obtained using words (Table 3)
with those of using concepts (Table 4) shows that the improvement is more significant when using words. That happens because natural languages are ambiguous. Therefore,
mapping tools, like MetaMap, map each noun-phrase to several candidate concepts. In this case, σ does not correspond
to what d and q semantically share, because there is a noise
in the text-concept mapping process. In our opinion, the
trec6
0.2116
0.2295 †∗
0.2464 †∗
0.2584 †∗
0.2639 †∗
0.2661 †∗
0.2613 †∗
0.2357 †∗
0.1742 †
0.0617 †
0.0120
trec8
0.1934
0.2166 †∗
0.2339 †∗
0.2480 †∗
0.2609 †∗
0.2666 †∗
0.2631 †∗
0.2478 †∗
0.2112 †
0.1271 †
0.0369
trec8
0.2666
0.1674*
0.2302*
0.2521
0.2627
0.2514
idea presented in [1], somehow, justifies our obtained results
in this study.
Table 4: Experiments Using Concepts
Formula
img09
img10
img11
img12
RSV σ+
0.3620
0.3247
0.1868
0.1427
RSV σ
0.3489
0.3448
0.1822
0.1322*
RSV d·q
0.3195*
0.3340
0.1783
0.1249*
PIV
0.2626*†
0.2530
0.1096*† 0.0934*†
BM25
0.2672*† 0.2127*† 0.1552*
0.1034*
JM
0.3058*
0.2451† 0.1580*† 0.1022*
DIR
0.2675*†
0.2455
0.1228*† 0.0861*†
In general, experiments show that: 1- both Exhaustivity
and Specificity are important to build a good performance
IR model, and there should be a type of balance between
the two components, 2- using two types of terms (words and
concepts) gives more credibility to our results, and show
that even using words generally gives better results but using concepts gives more flexibility to explore more versions
of the object σ, and 3- the performance of our model (RSV σ
and RSV σ+ ) has, in most cases, statistically significant improvement over the performance of classical IR models.
6.
CONCLUSION
We study, in this paper, two IR notions: Exhaustivity and
Specificity. In order to present an implementable version
of Exhaustivity and Specificity, we use Propositional Logic
and Lattice theory. We also exploit the potential relation
between them.
First, we show how to translate a special kind of logical
sentences, namely clauses, to subsets of atomic propositions
(Eq.3). Second, the set of subsets forms a distributive and
complemented lattice. Third, we show that the material
implication between two clauses is equivalent to the partial
order relation defined on the previous lattice (Eq.6). Fourth,
we exploit the degree of inclusion measure (Eq.7) in order
to quantify the partial order relation. Fifth, we present a
new definition of the uncertain implications (Eq.9). Finally,
we redefine, in view of our previous modelling, the basic IR
notions and build a new implementable IR model capable of
integrating Exhaustivity and Specificity.
In this paper, we see IR as a process of computing the
similarity between d and q, and also as a process of estimating the degree of alignment between d and q (Exhaustivity
and Specificity). In other words, IR should not only compute similarity but also dissimilarity between d and q. We
present an implementable approach to integrate Exhaustivity and Specificity into a concrete IR model.
The experiments are done depending on six corpora, four
with short queries and documents and two with long ones.
We use two types of terms: words and UMLS concepts. We
also present two different definitions of the object md ∩ mq :
basic definition σ, and extended definition σ+ . The experimental results show that the retrieval performance could be
strongly improved through the explicit taking into account
of Exhaustivity and Specificity. Moreover, even simple formulae RSV σ and RSV σ+ could perform better than some
state of the art models.
Experiments also show that there should be a type of balance between the two components (Exhaustivity and Specificity). However, the relative weight of Exhaustivity or Specificity is corpus-dependent.
This study could in future be developed in different directions. We presented a general IR framework based on
comparing d and q with an intermediate object instead of
the direct comparison. Therefore, this study could, on the
one hand, be developed by building new and more informative versions of the object σ. On the other hand, it could
be developed by instancing the new framework using other
concrete IR models, e.g. Language Models or Probabilistic
Models.
7.
REFERENCES
[1] K. Abdulahhad, J.-P. Chevallet, and C. Berrut.
MRIM at ImageCLEF2012. From Words to Concepts:
A New Counting Approach. In CLEF, 2012.
[2] K. Abdulahhad, J.-P. Chevallet, and C. Berrut. The
Effective Relevance Link between a Document and a
Query. In DEXA, Part I, pages 206–218, Vienna,
Autriche, Sept. 2012.
[3] K. Abdulahhad, J.-P. Chevallet, and C. Berrut. A New
Lattice-Based Information Retrieval Theory. Research
Report RR-LIG-038, LIG, Grenoble, France, 2013.
[4] K. Abdulahhad, J.-P. Chevallet, and C. Berrut. Is
uncertain logical-matching equivalent to conditional
probability? SIGIR ’13, 2013.
[5] G. Amati and C. J. Van Rijsbergen. Probabilistic
Models of Information Retrieval Based on Measuring
the Divergence from Randomness. ACM Trans. Inf.
Syst., 20(4):357–389, Oct. 2002.
[6] A. R. Aronson. Metamap: Mapping Text to the
UMLS Metathesaurus, 2006.
[7] J.-P. Chevallet, J.-H. Lim, and D. T. H. Le. Domain
Knowledge Conceptual Inter-Media Indexing:
Application to Multilingual Multimedia Medical
Reports. CIKM ’07, pages 495–504, Lisbon, Portugal,
2007.
[8] Y. Chiaramella and J. P. Chevallet. About Retrieval
Models and Logic. Comput. J., 35:233–242, June 1992.
[9] Y. Chiaramella, P. Mulhem, and F. Fourel. A Model
for Multimedia Information Retrieval. Technical
report, 1996.
[10] F. Crestani. Exploiting the Similarity of Non-Matching
Terms at RetrievalTime. Inf. Retr., 2(1):27–47, 2000.
[11] H. Fang, T. Tao, and C. Zhai. A Formal Study of
Information Retrieval Heuristics. SIGIR ’04, pages
49–56, Sheffield, United Kingdom, 2004.
[12] K. S. Jones. A Statistical Interpretation of Term
Specificity and its Application in Aetrieval. Journal of
Documentation, 28:11–21, 1972.
[13] K. H. Knuth. Deriving laws from ordering relations. In
In press: Bayesian Inference and Maximum Entropy
Methods in Science and Engineering, Jackson Hole
WY, pages 204–235, 2003.
[14] K. H. Knuth. Lattice Duality: The Origin of
Probability and Entropy. Neurocomput., 67:245–274,
Aug. 2005.
[15] D. E. Losada and A. Barreiro. A Logical Model for
Information Retrieval Based on Propositional Logic
and Belief Revision. The Computer Journal,
44:410–424, 2001.
[16] J. Nie. An Outline of a General Model for Information
Retrieval Systems. SIGIR ’88, pages 495–506,
Grenoble, France, 1988.
[17] J. M. Ponte and W. B. Croft. A Language Modeling
Approach to Information Retrieval. SIGIR ’98, pages
275–281, Melbourne, Australia, 1998.
[18] S. E. Robertson. The Probability Ranking Principle in
IR. pages 281–286. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, 1997.
[19] S. E. Robertson and S. Walker. Some Simple Effective
Approximations to the 2-Poisson Model for
Probabilistic Weighted Retrieval. SIGIR ’94, pages
232–241, Dublin, Ireland, 1994.
[20] G. Salton, A. Wong, and C. S. Yang. A Vector Space
Model for Automatic Indexing. Communications of
the ACM, (18):613–620, 1975.
[21] A. Singhal, C. Buckley, and M. Mitra. Pivoted
Document Length Normalization. SIGIR ’96, pages
21–29, Zurich, Switzerland, 1996.
[22] M. D. Smucker, J. Allan, and B. Carterette. A
Comparison of Statistical Significance Tests for
Information Retrieval Evaluation. CIKM ’07, pages
623–632, Lisbon, Portugal, 2007.
[23] C. J. van Rijsbergen. A Non-Classical Logic for
Information Retrieval. Comput. J., 29(6):481–485,
1986.
[24] E. M. Voorhees and D. Harman. Overview of the
Eighth Text REtrieval Conference (TREC-8). In
TREC, 1999.
[25] E. M. Voorhees and D. Harman. Overview of the
Sixth Text REtrieval Conference (TREC-6). Inf.
Process. Manage., 36(1):3–35, Jan. 2000.
[26] C. Zhai and J. Lafferty. A Study of Smoothing
Methods for Language Models Applied to Ad Hoc
Information Retrieval. SIGIR ’01, pages 334–342, New
Orleans, Louisiana, United States, 2001.