Revisiting Exhaustivity and Specificity Using Propositional Logic and Lattice Theory Karam Abdulahhad Jean-Pierre Chevallet Catherine Berrut Université de Grenoble LIG laboratory, MRIM group Grenoble, France UPMF-Grenoble 2 LIG laboratory, MRIM group Grenoble, France UJF-Grenoble 1 LIG laboratory, MRIM group Grenoble, France [email protected] [email protected] ABSTRACT Exhaustivity and Specificity in logical Information Retrieval framework were introduced by Nie [16]. However, even with some attempts, they are still theoretical notions without a clear idea of how to be implemented. In this study, we present a new approach to deal with them. We use propositional logic and lattice theory in order to redefine the two implications and their uncertainty P (d → q) and P (q → d). We also show how to integrate the two notions into a concrete IR model for building a new effective model. Our proposal is validated against six corpora, and using two types of terms (words and concepts). The experimental results showed the validity of our viewpoint, which state: the explicit integration of Exhaustivity and Specificity into IR models will improve the retrieval performance of these models. Moreover, there should be a type of balance between the two notions. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Information Search and Retrieval Keywords Information retrieval, Exhaustivity, Specificity 1. INTRODUCTION Many studies in Information Retrieval (IR) field argue that the retrieval process could be represented as a logical implication directed from a document d to a query q. Van Rijsbergen [23] proposes that if d and q are sets of logical sentences in a specific logic then d should be retrieved iff it implies q, noted d → q. However, the IR process is uncertain [8]. Therefore, a measure P for estimating the certainty of the implication d → q, and of course for ranking, is needed. In this way, the IR process could be modeled as a two-step process: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICTIR’13, September 29–October 02, 2013, Copenhagen, Denmark. Copyright 2013 ACM 978-1-4503-2107-5/13/09 ...$15.00. [email protected] 1. Retrieval: extracting candidate documents from a corpus by recognizing the reason or reasons that make a document d a candidate answer to a query q. 2. Ranking: estimating the Relevance Score Value (RSV ) between d and q, or equivalently, estimating the strength of the reasons that make d a candidate answer to q. Nie [16] introduced two notions: Exhaustivity and Specificity (ES). Exhaustivity d → q means: q is deducible from d or there is a deductive path from d to q. In other words, Nie presented d as a possible world1 then either q is directly true in d, or there is a series of changes that should be done on d to make q true. Whereas Specificity q → d means: d is deducible from q or there is a deductive path from q to d. Exhaustivity is recall-oriented whereas Specificity is precision-oriented [15]. Exhaustivity and Specificity describe, in a subjective manner, the nature of the retrieval process. We think that if the two notions are objectively described and integrated into IR models, the gain in effectiveness would be appreciable. In this work, we show how both notions can be theoretically redefined using Propositional Logic and Lattice theory, and how they can be integrated into an IR model, and then we translate the theoretical description into a practical one. The paper is organized as follow. In Section 2, we review some studies about Exhaustivity and Specificity. We redefine Exhaustivity and Specificity using Propositional Logic and Lattice theory, and we theoretically integrate them into IR models, in Section 3. We show in Section 4 how to implement this theoretical framework into a concrete IR model. In Section 5, we describe our experiments and discuss results. We conclude in Section 6. 2. ES IN IR LITERATURE Exhaustivity and Specificity (ES) were early introduced by Spärck Jones [12]. She talked about document exhaustivity which refers to the number of terms the document contains, and term specificity which refers to the number of documents the term belongs to. According to [12], Exhaustivity and specificity are statistical notions. In this study, we are more concerned with the definition of Exhaustivity and Specificity that were introduced by Nie [16]. According to Nie, Exhaustivity says how much elements of q are mentioned in d, whereas Specificity says in what level of detail the elements of q are mentioned in d. Nie claims that the retrieval process is not only d → q but also q → d. 1 Expressed using a modal logic. However, IR is an uncertain process [8], because: • q is an imperfect representation of user needs; • d is also an imperfect representation of the content of documents; • Relevance judgement depends on external factors, like user background knowledge (user dependent). Therefore, another component, besides the logic, should be added to estimate the certainty of an implication and to offer a mechanism of ranking. In other words, a measure P should exist and be capable of measuring the certainty of the logical implication between d and q, noted P (d → q). The final RSV (d, q), according to Nie [16], becomes: RSV (d, q) = F [P (d → q), P (q → d)] (1) where P (d → q) = 1 if d → q is true, or 0 ≤ P (d → q) < 1 otherwise, and F is an arbitrary function for combining two numerical values. Crestani [10] proposes the existence of a function Sim capable of estimating the semantic similarity between any two terms. He extends the retrieval function to become: • Query viewpoint: starting from a query and comparing it to a document. X RSV max(q.d) (d, q) = Sim (t, t∗ ) wd (t∗ ) wq (t) t∈q ∗ where, t ∈ T is a document term that gives the maximum similarity with t, wd (t∗ ) is the weight of t∗ in d, and wq (t) is the weight of t in q. • Document viewpoint: starting from a document and comparing it to a query. X RSV max(d.q) (d, q) = Sim (t, t∗ ) wd (t) wq (t∗ ) t∈d Crestani [10] tries to integrate Exhaustivity and Specificity in one IR model, where q . d refers to Exhaustivity and d . q refers to Specificity. Losada et al. [15] also uses the two notions in order to build a more precise retrieval model. They exploit propositional logic, its truth-value interpretations, and the belief revision technique, to define the two notions and build an implementable retrieval function. Exhaustivity and Specificity are also exploited in structured documents and passage retrieval fields, where IR systems try to find a document d that answers a query q (Exhaustivity) and within d systems try to find the most appropriate or precise component or passage (Specificity) [9]. As we said, according to Nie [16], the relation between d and q is not symmetric, where d → q is different from q → d. However, some IR models represent d and q in the same formalism. This is the case in the Vector Space Model (VSM) [20], where it is assumed that both d and q are vectors in the same term space and the relevance value is the similarity between them. In that case, RSV is computed using a symmetric function (e.g. Cosine). In some other IR models, the RSV computation is not symmetric. This is the case in Probabilistic Models (PM) [18] and Language Models (LM) [17]. PMs compute the probability that a certain document is relevant to a certain query, whereas they do not compute the probability that a query is the appropriate question for retrieving a document. In other words, PMs talk about P (R|d, q), but they do not actually talk about P (A|d, q), where A is a binary random variable: A = True if q is an appropriate question for d, or A = False otherwise. LMs compute the ability of a document to reproduce a query P (d|q), but they rarely compute the ability of a query to reproduce a document P (q|d). The two notions, Exhaustivity and Specificity, have been studied in the state of the art but mostly in a theoretical point of view. We propose to go deeper in this analysis by revisiting them from a theoretical point of view in a logical framework, and then proposing a practical implementation. 3. REVISITING ES MODEL In this study, we basically depend on Propositional Logic (PL) and Lattice theory. We also assume that the logical implication d → q is actually the material implication [3, 4]. We start this section by a mathematical introduction to PL, lattices, and the potential relation between them. 3.1 3.1.1 Mathematical Foundations Propositional Logic (PL) We define a set of atomic propositions A = {a1 , . . . , an }. The set A forms the alphabet that is used to build logical sentences. The set A is finite |A| = n and any proposition ai ∈ A can have one of two values: True T or False F . In PL, a semantic is given to a logical sentence s by assigning a truth value (T or F ) to each atomic proposition in s. In this manner, each logical sentence s has several interpretations Is depending on the truth value of its propositions. The subset of interpretations Ms ⊆ Is that makes s true is called the set of models of s, noted Ms |= s. Moreover, |= s means that s is a tautology or it is true in all interpretations. According to this formalism, Ms is a set of models, and each model m ∈ Ms can be defined as a subset of atomic propositions that have T as interpretation[15]. For example, assume A = {a, b, c} and s = a ∧ ¬b, then Ms = {m1 , m2 } where m1 = {a} and m2 = {a, c}. Note that propositions that do not occur in the sentence (e.g. c) can be T or F . Assume s1 and s2 are logical sentences. By definition, the material implication s1 → s2 is equivalent to models inclusion: [|= s1 → s2 ] ⇔ [Ms1 ⊆ Ms2 ] (2) In this study, we consider a special kind of logical sentences, namely clauses. Each clause is a conjunction of nonnegative atomic propositions, e.g. a, a ∧ b, a ∧ c, etc. Each clause s has, like any logical sentence, a set of models Ms . However, depending on the set of models Ms , it is possible to define one model ms : \ ms = m (3) m∈Ms where ms will only contain the propositions that occur in the clause s. For example, assume A = {a, b, c} and the clause s = a ∧ b, then Ms = {m1 , m2 } where m1 = {a, b} and m2 = {a, b, c}. According to the previous definition ms = m1 ∩ m2 = {a, b}. Assume s1 and s2 are two clauses, then according to (Eqs.2 & 3), we have: [|= s1 → s2 ] ⇔ [ms2 ⊆ ms1 ] (4) 3.1.2 Lattice Theory 3.2.1 A The algebraic structure B = (2 , ∩, ∪, \, >, ⊥) is a Boolean algebra, or equivalently, a distributive and complemented lattice, where: 1) 2A is the power set of A 2) the meet operation is the set intersection (∩) 3) the join operation is the set union (∪) 4) the complement operation is the set difference (\) 5) the top element is > = A 6) the bottom element is ⊥ = ∅. The ordering relation ≤ defined on B is: ∀x, y ∈ 2A , [x ≤ y] ⇔ [x ⊆ y] (5) We previously showed that any logical clause corresponds to a subset of propositions of A (Eq.3). Therefore, any logical clause corresponds to one node on the lattice B. From (Eqs.4 & 5), we conclude that for any two clauses s1 and s2 : [|= s1 → s2 ] ⇔ [ms2 ≤ ms1 ] 3.1.3 (6) Degree of Inclusion In any lattice (L, ∧L , ∨L ) and for any two elements x, y ∈ L, even that x does not include y, it is possible to describe the degree to which x includes y. Knuth [13, 14] generalizes the inclusion to the degree of inclusion represented by real numbers. He introduced the z function: if x ≥ y 1 0 if x ∧L y = ⊥ ∀x, y ∈ L, z(x, y) = (7) z otherwise, 0 < z < 1 where z(x, y) quantifies the degree to which x includes y. In the extreme case, where (L, ∧L , ∨L , ¬L , >, ⊥) is a Boolean algebra and z is consistent with all properties of the Boolean algebra structure, then z is the conditional probability [14]: ∀x, y ∈ L, z(x, y) = P (x|y) (8) In general, the z function is not forcibly consistent with all properties, and then z could be replaced by more relaxed function than probability [3]. Back to our logical framework, the degree of certainty of a material implication can be replaced by the degree of inclusion or equivalently the z function (Eqs.6 & 7). In other words, assume s1 and s2 are two clauses. We know that s1 and s2 correspond to two subsets of propositions ms1 and ms2 (Eq.3), respectively. We also know that the material implication s1 → s2 to be true means that ms2 ≤ ms1 must be satisfied (Eq.6). Hence, the certainty of the implication P (s1 → s2 ) can be replaced by the degree of inclusion z(ms1 , ms2 ): P (s1 → s2 ) = z(ms1 , ms2 ) A document d is a logical clause, or equivalently, a conjunction of its terms. Queries are represented in the same way. Representing a document d as a conjunction of its terms, means that: in any model of d, the terms that appear in d must be true and the other terms can be true or false. This is a relaxed version of the underling hypothesis in most classical IR models, where the terms that do not appear in d are implicitly false. Our representation, like most classical IR models, is not capable of representing negative terms, e.g. ¬ai where ai ∈ A is not a valid clause. A document d is a logical clause. It thus corresponds to a set T of models Md and a subset of terms md , where md = m∈Md m (Eq.3). More Precisely, md is the set of terms that appear in d. 3.2.2 P (d → q) = z(md , mq ) (10) 3.2.3 Exhaustivity & Specificity & RSV (d, q) Nie [16] differentiates between d → q (Exhaustivity) and q → d (Specificity). According to (Eq.10), P (d → q) = z(md , mq ) and P (q → d) = z(mq , md ). We said that if the z function is consistent with all properties of the Boolean algebra structure then it is equivalent to the conditional probability (Eq.8). We also said that this is not a mandatory constraint. If the z function is only consistent with the join associativity 2 property [14], then the z function must satisfy the following rule: z(x ∧L y, t) = z(x, t) + z(y, t) − z(x ∨L y, t) (11) Depending on (Eq.11) we have: • z(md ∩ mq , mq ) = z(md , mq ) + z(mq , mq ) − z(md ∪ mq , mq ), we know from (Eq.7) that z(mq , mq ) = 1 because mq ≥ mq and z(md ∪ mq , mq ) = 1 because md ∪ mq ≥ mq . Therefore, Exhaustivity: P (d → q) = z(md , mq ) = z(md ∩ mq , mq ) (9) • z(md ∩ mq , md ) = z(md , md ) + z(mq , md ) − z(md ∪ mq , md ), we know from (Eq.7) that z(md , md ) = 1 because md ≥ md and z(md ∪ mq , md ) = 1 because md ∪ mq ≥ md . Therefore, Specificity: P (q → d) = z(mq , md ) = z(md ∩ mq , md ) IR Notions In this section, we redefine, in view of our previous mathematical foundations, the basic IR notions: document, query, retrieval decision, uncertainty, Exhaustivity, and Specificity. The alphabet A = {a1 , . . . , an } is the set of all terms, where each term ai ∈ A is an atomic proposition. We say a term ai is true with respect to a document d if it appears in d, or false otherwise. Retrieval Decision & Uncertainty As we mentioned, the retrieval decision can be modeled as a logical implication d → q between a document d and a query q. IR is intrinsically an uncertain process, therefore representing the retrieval decision by d → q is quite limited because d → q is a binary decision. We thus need to measure the degree of implication, namely P (d → q). Hence, we first choose the material implication to represent the retrieval decision [3, 4], and then according to (Eq.9), it is possible to estimate P (d → q) through z(md , mq ): because if s1 → s2 is true then P (s1 → s2 ) = 1 (Eq.1) and ms2 ≤ ms1 (Eq.6). In addition, if ms2 ≤ ms1 then z(ms1 , ms2 ) = 1 (Eq.7). 3.2 Documents & Queries These conclusions are compatible with the intuition about Exhaustivity and Specificity that is presented in [2], where according to [2], Exhaustivity or Specificity can be estimated by comparing the coordination level d ∩ q with the query q or the document d, respectively. 2 x ∨L (y ∨L t) = (x ∨L y) ∨L t Finally, for computing RSV (d, q) it is now possible to replace (Eq.1) by: We present later the exact definition of the similarity function and the semantic relation between terms, because they depend on the type of terms that are used for indexing documents and queries. RSV (d, q) = F [z(md ∩ mq , mq ), z(md ∩ mq , md )] For simplifying the notation, we replace, in the rest of this paper, md ∩ mq by σ. RSV (d, q) = F [z(σ, mq ), z(σ, md )] 4. (12) FROM THEORETICAL TO CONCRETE IR MODEL We project our logical modeling on a vector space IR framework, and study the effects of integrating Exhaustivity and Specificity into a concrete IR model on the retrieval performance of that model. As we said, both d, q, and σ are of the same nature. Moreover, according to our model: md contains the terms occurring in d, and mq contains the terms occurring in q. However, σ represents what d and q share. Hence, there are several ways to express σ. The most direct way is to build σ depending on the terms that occur in both d and q. A more natural way is to build σ depending on the terms of d (or q) that either occur in q (or d) or have a semantic relation with some terms in q (or d). We here assume that terms within d and q are independent. In order to take a further step from our logical framework to the vector space framework, we should redefine the basic IR notions: • The term space A: each proposition or term ai ∈ A corresponds to a dimension in the term space, |A| = n. − → • Each document d is a vector d in the space A. Each term ai ∈ A corresponds to a component wadi in the − → vector d . For terms that occur in d, they have a weight wadi > 0 whereas wadi = 0 for the others. → • The query q is a vector − q in the space A. Each term ai ∈ A corresponds to a component waq i in the vector − → q . For terms that occur in q, they have a weight waq i > 0 whereas waq i = 0 for the others. → • The object σ is also a vector − σ in the space A. As we said, it is possible to build different versions of σ. In → this study, we choose to build two versions − σ and − σ→ + as follows: → 1- Basic object − σ , where it is represented using the terms that occur in both d and q. Each term ai ∈ → A corresponds to a component waσi in the vector − σ, σ where wai > 0 for terms that occur in both d and q, whereas waσi = 0 for the others. 2- Extended object − σ→, where it is represented using + the terms of q that either occur in d or have a semantic relation with some terms of d. Each term ai ∈ A σ corresponds to a component wai+ in the vector − σ→ +, σ+ where wai > 0 for the terms of q that either occur in d or have a semantic relation with some terms of σ d, whereas wai+ = 0 for the others. Here, we need a function Sim for computing the semantic similarity between any two terms a1 , a2 ∈ A: a1 = a2 1 0 no semantic relation Sim(a1 , a2 ) = (13) u otherwise, 0 < u < 1 It remains two functions F and z in (Eq.12), which need to be determined. The function z now becomes a function for computing the distance between two vectors, but we need a function consistent with the join associativity property of the Boolean algebra structure. Abdulahhad et al. [3] show that the inner-product satisfies this condition, we thus choose the inner-product to replace the z function in (Eq.12). The function F is a function for combining two numerical values. In this study we choose the product (×). Finally and in view of our choices, the retrieval formula (Eq.12) can be developed into two concrete retrieval formulae depending on which version of the object σ is used. 4.1 Basic Object (σ) According to the choices of the two functions z and F , (Eq.12) can be rewritten as follow: − → → → → RSV σ (d, q) = (− σ ·− q ) × (− σ · d) (14) Applying log to (Eq.14) will not change the ranking, but it will transform product to sum, and it will allow to study the mutual-impact between Exhaustivity and Specificity. Therefore, (Eq.14) can be rewritten as follow: − → → → → RSV σ (d, q) ∝ α × log(− σ ·− q ) + (1 − α) × log(− σ · d ) (15) where α ∈ [0, 1] is a tuning parameter, and we introduce it to study the mutual-impact between Exhaustivity and Specificity. The retrieval formula thus becomes: P waσ × waq + RSV σ (d, q) ∝ α × log a∈AP (16) σ d (1 − α) × log a∈A wa × wa In IR, documents are ranked with respect to a query and not the inverse, we thus choose to weight the terms of the object σ using the query: waσ = c(a, q) for terms a ∈ A that occur in both d and q, or 0 otherwise. Query’s terms are equally 1 for terms a ∈ A that occur in q, or 0 important: waq = |q| otherwise. c(a, q) is the count of the term a in the query q and |q| is the query length. Actually, the only component that should be clarified is wad . Fang et al. [11] reviewed, in a form of constraints, the weighting heuristics that should be respected by weighting functions to be effective. In this paper, we use (Eq.17), which respects the first four constraints (TFC1 , TFC2 , TDC , and LNC1 ) of Fang et al. [11]. wad = c(a, d) c(a, d) + |d| avdl × N df (a) (17) where, c(a, d) the count of a in d, |d| document length, avdl average document length in the corpus, N the total number of documents in the corpus, and df (a) the number of documents in the corpus that contains a. If a does not occur in d then wad = 0. Equation (16) consists of two components. One is monotonically increasing w.r.t. the number of shared terms between d and q, but monotonically decreasing w.r.t. query length. Another is also monotonically increasing w.r.t. the number of shared terms, but monotonically decreasing w.r.t. document length. 4.2 Extended object (σ+ ) The retrieval formula of the extended object − σ→ + can be → developed in the same way as that of the basic object − σ (Eq.16). However, the main difference is that the terms of the extended object − σ→ + do not forcibly occur in document. In other words, there are some terms of − σ→ + that do not occur in d, but they have a semantic relation with some terms of d. Therefore, the retrieval formula becomes (Eq.18) instead of (Eq.16) when using the extended object − σ→ +. P σ+ q RSV σ+ (d, q) ∝ α × log a∈A wa × wa + (18) P σ+ (1 − α) × log × Sim(a, a0 ) × wad0 w a a∈A σ where, wa + = c(a, q) for the terms of q that either occur in d or have a semantic relation with some terms of d, or 0 1 otherwise. waq = |q| for terms that occur in q, or 0 otherwise. 0 In addition, a is the document term that has a semantic relation with a, and a0 has the maximum semantic similarity value with a according to the function Sim (Eq.13). Equations (16 & 18) are the main retrieval formulae and they are the formulae that will be used in our experiments. 4.3 Terms Definition Of course we use words, as a usually-used type of terms, to represent the content of documents and queries. However, it is also possible to use a more informative terms, which are concepts [7]. Normally, concepts are a part of an external resource or a knowledge base, and they are connected using a variety of semantic relations. Concepts are not a part of documents and queries, then we need a tool to map text of documents and queries to concepts. UMLS3 is an example of external resources containing concepts, and MetaMap4 [6] is an example of tools that map text to UMLS concepts. On the one hand, using concepts instead of words raises some problems because natural languages are ambiguous, and mapping tools are not thus very precise. Abdulahhad et al. [1] talk about one of the potential problems that could appear when moving from words to concepts. On the other hand, we said that the object σ refers to what d and q semantically share. Therefore, using concepts makes the definition of σ more flexible through exploiting the semantic relations between concepts. Hence, it is possible to exceed the simple intersection-based definition of σ. → Concerning the object σ, we use the basic definition − σ with the two types of terms: words and concepts. Whereas, the extended definition − σ→ + is only used with concepts. The semantic relation that will be used to build the extended object − σ→ + is the ‘isa’ relation (Hyponymy / Hypernymy) between concepts. This relation is defined on the concepts of UMLS. The isa relation is: • transitive: for any three concepts c1 , c2 , and c3 if isa isa isa c1 −−→ c2 and c2 −−→ c3 then c1 −−→ c3 . isa isa • directed: c1 −−→ c2 is different from c2 −−→ c1 . Therefore, it is possible to define the length or the number of edges lci .cj of the isa-path between two concepts ci and cj directed from ci to cj in a resource like UMLS in our case. The extended object − σ→ + will contain the concepts cq of q that either occur in d or there exists atleast a concept cd in isa d where cd −−→ cq . We choose the isa relation directed from 3 4 Unified Medical Language System (www.nlm.nih.gov) metamap.nlm.nih.gov the concepts of d to the concepts of q, i.e. a query about ‘animal’ will be satisfied by a document about ‘dog’ but not the inverse. There are several possible choices for the semantic similarity function (Eq.13) between two concepts. The main constraint is to be monotonically decreasing w.r.t. the length of the isa-path between the two concepts. One possible definition is the exponential function: ci = cj 1 0 no semantic relation Sim(ci , cj ) = (19) −β×lci .cj isa e ci −−→ cj where β ∈ R+ is a tuning parameter. We use this definition of the semantic similarity function in our experiments. 5. EXPERIMENTS The main goal of our experiments is to show, on the one hand, the validity of our hypothesis about the improvement in the retrieval performance of IR models that could be gained by integrating Exhaustivity and Specificity. On the other hand, to show that even by using simple retrieval formulae (Eq.16 & 18), an IR model could perform as well as or even better than some state of the art models, e.g. Pivoted Normalization Method (example of VSM)[21], BM25 (example of probabilistic models) [19], Jelinek-Mercer and Dirichlet (examples of language models) [26]. As we previously illustrated, calculating σ is not a simple set intersection. The problem beyond is to exactly compute what d and q semantically share. Therefore, we propose to compare two definitions of terms: 1- words: they are classically extracted after removing stop words and stemming, and 2- concepts: text of documents is mapped to concepts through an external knowledge base. That’s why, we use medical corpora from CLEF5 , which can be used with the well-known knowledge base UMLS. The retrieval formulae used are compared to each other using the Mean Average Precision (MAP) metric. As statistical significance test, we use Fisher’s Randomization test at the 0.05 level [22]. 5.1 Experimental Setup Six corpora and two types of indexing terms are used. We also define the object σ in two different ways. 5.1.1 Indexing Terms Words: the final list of words that index documents and queries, is the list of words in the corpus after removing stop words and stemming the remaining using Porter algorithm. Concepts: the text of documents and queries is mapped to UMLS concepts using MetaMap. UMLS is a multi-source knowledge base in the medical domain, whereas, MetaMap is a tool for mapping text to UMLS concepts. Using concepts allows to profit from the semantic relations between concepts, so it allows to build different versions of σ object. 5.1.2 Corpora Four corpora from CLEF and two from TREC6 are used. Table 1 shows some statistics about them. The statistics are taken after removing stop words and stemming. 5 6 www.clef-initiative.eu http://trec.nist.gov/ img09, img10, img11, img12: contain short medical documents and queries. trec6, trec8: contain long general-content documents and queries. For TREC queries, we use the three fields: ‘title’, ‘desc’, and ‘narr’. Concerning trec6, documents on disks 4 & 5 and topics 301-350 are used. Concerning trec8, documents on disks 4 & 5 without Congressional Record (CR) and topics 401-450 are used. Table 1: Corpora statistics. avdl and avql are average length of documents and queries Corpus #d #q Terms avdl avql Words 62.16 3.36 img09 74901 25 Concepts 157.48 10.84 Words 62.12 3.81 img10 77495 16 Concepts 157.27 12.0 Words 44.83 4.0 img11 230088 30 Concepts 101.92 12.73 Words 47.16 3.55 img12 306530 22 Concepts 47.16 9.41 Words 266.67 46.56 trec6 551787 50 Concepts Words 243.87 29.06 trec8 523865 50 Concepts - 5.1.3 Retrieval Formulae First, in order to study the effect of Exhaustivity and Specificity, we compare our retrieval formulae RSV σ (Eq.16) and RSV σ+ (Eq.18) with RSV d·q (Eq.20). − → → c(a, d) 1 X N RSV d·q (d, q) = d · − q = × × q a∈A c(a, d) + |d| df (a) avdl (20) Second, we compare the retrieval performance of RSV σ and RSV σ+ with: • Pivoted Normalization (PIV ), where s = 0.2 [21, 11]. • BM25 , where k1 = 1.2, b = 0.75, k3 = 1000 [19, 11]. • Dirichlet Prior Method (DIR), where µ = 2000 [26]. • Jelinek-Mercer Method (JM ), where usually λ = 0.1 for short queries, which is the case in imgXX corpora. Whereas, λ = 0.7 for long queries, which is the case in trec6 and trec8 corpora [26]. We set parameters’ values to the usually used default values. 5.2 Results and Discussion We divide our experiments into three main categories: 1experiments to study the effect of α value on the retrieval performance of RSV σ (Eq.16) and RSV σ+ (Eq.18), 2- experiments using words to compare the retrieval performance of RSV σ , which corresponds to the basic definition of σ, with the performance of some state of the art IR models, and 3- experiments using concepts to compare the performance of RSV σ with other IR models, and also to study the effect of the extended definition of σ on the overall retrieval performance through studying the performance of RSV σ+ . 5.2.1 Exhaustivity vs. Specificity In this category of experiments, we only use words as indexing terms. By varying the value of α, in RSV σ (Eq.16), between α = 0.0 (only Specificity) and α = 1.0 (only Ex- haustivity), it is possible to study the mutual-impact between Exhaustivity and Specificity. In Table 2, we apply RSV σ (Eq.16) to all corpora using words, (*) means a significant improvement comparing to α = 0.0 ‘only Specificity’ and (†) means a significant improvement comparing to α = 1.0 ‘only Exhaustivity’. In Table 2, we see that, on the one hand, integrating the two components (Exhaustivity and Specificity) is better than depending on only one. On the other hand, there should be a type of balance between the two components without dominance. However, the exact value of α is corpus-dependent. In general, α = 0.5, which is the average of α values in all corpora, gives good performance. In the rest of this paper, we fix α = 0.5 to give an equal importance to both Exhaustivity and Specificity. 5.2.2 Experiments Using Words All experiments in this category are done using words as indexing terms. The main goal of this part of experiments is to study the retrieval performance of RSV σ (Eq.16 with α = 0.5), and to compare it with: 1- the basic inner-product Vector Space Model without Exhaustivity and Specificity RSV d·q (Eq.20), and 2- some classical IR models. In Table 3, we apply RSV σ (Eq.16 with α = 0.5), RSV d·q (Eq.20), PIV , BM25 , JM , and DIR to all corpora using words where (*) indicates that RSV σ is significantly better. Table 3 shows the effect of integrating Exhaustivity and Specificity on the performance of RSV σ . The performance of RSV σ , using the basic version of the intermediate object σ, is always better than the performance of RSV d·q , even though sometimes, the improvement is not clear. This type of results shows that the indirect matching between d and q using an object σ gives better results comparing to the direct matching. Table 3 also shows that the performance of RSV σ is better than or equivalent to the performance of some classical IR models, like PIV , BM25 , JM , and DIR. That also shows, through using Exhaustivity and Specificity, it is possible to build an effective or high performance IR model. Concerning trec6, the performance of our formula RSV σ is slightly better than the best run in TREC6 conference (the run ‘anu6alol’ where M AP = 0.2602) [25], and it is also better than the best result obtained by the DFR framework (the run ‘I(ne )L2’ where M AP = 0.2600) [5]. Concerning trec8, all runs in TREC8 conference, which have better score than our score, use some supplementary techniques. For example, query expansion, pseudo relevance feedback, POS tools, and fusion of several runs [24]. This category of experiments shows that, it is possible to build a high-performance IR model through exploiting the two notions: Exhaustivity and Specificity, or in other words, through the indirect matching between documents and queries. 5.2.3 Experiments Using Concepts All experiments in this category are done using UMLS concepts as indexing terms. The main goal of this part of experiments is, on the one hand, to study the retrieval performance of RSV σ through repeating the same experiments of the previous section but using concepts instead of words, and on the other hand, to study the effect of using the extended version of the object σ+ . Here, we compare the retrieval performance of our formu- α 0.0 (Specificity) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 (Exhaustivity) Table img09 0.3149 0.3340 †∗ 0.3544 †∗ 0.3653 †∗ 0.3737 †∗ 0.3789 †∗ 0.3834 †∗ 0.3846 †∗ 0.3846 †∗ 0.3731 †∗ 0.2300 Formula RSV σ RSV d·q PIV BM25 JM DIR Table img09 0.3789 0.3149* 0.3664 0.3726 0.3792 0.3353* 2: Exhaustivity vs. Specificity img10 img11 img12 0.3117 0.1965 0.1207 0.3266 †∗ 0.2113 †∗ 0.1376 †∗ 0.3352 †∗ 0.2229 †∗ 0.1477 †∗ 0.3402 †∗ 0.2284 †∗ 0.1539 †∗ 0.3102 † 0.2330 †∗ 0.1580 †∗ 0.3171 † 0.2317 †∗ 0.1586 †∗ 0.3221 † 0.2282 † 0.1539 † 0.3194 † 0.2221 † 0.1485 † 0.3144 † 0.2114 † 0.1463 † 0.3056 † 0.1922 † 0.1348 † 0.1921 0.0749 0.0551 3: Experiments Using Words img10 img11 img12 trec6 0.3171 0.2317 0.1586 0.2661 0.3117 0.1965* 0.1207* 0.1677* 0.2992 0.1546* 0.1027* 0.2076* 0.2745* 0.1995* 0.1438 0.2238 0.2994 0.1985* 0.1371 0.2532 0.2960 0.1534* 0.1161* 0.2410 lae RSV σ (Eq.16 with α = 0.5) and RSV σ+ (Eq.18 with α = 0.5) with the performance of some classical IR models. In Table 4, we apply RSV σ+ , RSV σ , RSV d·q , PIV , BM25 , JM , and DIR to CLEF corpora using concepts, where (*) means RSV σ+ is significantly better and (†) means RSV σ is significantly better. Table 4 shows that the performance of RSV σ , using the basic version of σ, is better than or equivalent to the performance of RSV d·q and some classical IR models, like PIV , BM25 , JM , and DIR. That also shows, through using Exhaustivity and Specificity, it is possible to build an effective or high performance IR model. Comparing RSV σ+ with RSV σ in (Table 4) shows that using the extended object σ+ slightly improves the performance, except in the img10 corpus. The performance of RSV σ+ presented in (Table 4) is for β = 7 in the semantic similarity function (Eq.19). Here, we do not work on the optimization of the value of β. We just show the potential of our model. However, the interesting thing is that the same value of β gives the best performance in all corpora. Hence, the value of β seems to be dependent on the source containing concepts and relations, UMLS in our case, and not on corpora. The performance of RSV σ+ is significantly better than the performance of classical IR models, except for the img10 corpus. In addition, RSV σ+ gives slightly better performance comparing to RSV σ , except for the img10 corpus. That shows, integrating Exhaustivity and Specificity in an IR model and the appropriate choice of the object σ can lead to important gain in the retrieval performance. Comparing the results obtained using words (Table 3) with those of using concepts (Table 4) shows that the improvement is more significant when using words. That happens because natural languages are ambiguous. Therefore, mapping tools, like MetaMap, map each noun-phrase to several candidate concepts. In this case, σ does not correspond to what d and q semantically share, because there is a noise in the text-concept mapping process. In our opinion, the trec6 0.2116 0.2295 †∗ 0.2464 †∗ 0.2584 †∗ 0.2639 †∗ 0.2661 †∗ 0.2613 †∗ 0.2357 †∗ 0.1742 † 0.0617 † 0.0120 trec8 0.1934 0.2166 †∗ 0.2339 †∗ 0.2480 †∗ 0.2609 †∗ 0.2666 †∗ 0.2631 †∗ 0.2478 †∗ 0.2112 † 0.1271 † 0.0369 trec8 0.2666 0.1674* 0.2302* 0.2521 0.2627 0.2514 idea presented in [1], somehow, justifies our obtained results in this study. Table 4: Experiments Using Concepts Formula img09 img10 img11 img12 RSV σ+ 0.3620 0.3247 0.1868 0.1427 RSV σ 0.3489 0.3448 0.1822 0.1322* RSV d·q 0.3195* 0.3340 0.1783 0.1249* PIV 0.2626*† 0.2530 0.1096*† 0.0934*† BM25 0.2672*† 0.2127*† 0.1552* 0.1034* JM 0.3058* 0.2451† 0.1580*† 0.1022* DIR 0.2675*† 0.2455 0.1228*† 0.0861*† In general, experiments show that: 1- both Exhaustivity and Specificity are important to build a good performance IR model, and there should be a type of balance between the two components, 2- using two types of terms (words and concepts) gives more credibility to our results, and show that even using words generally gives better results but using concepts gives more flexibility to explore more versions of the object σ, and 3- the performance of our model (RSV σ and RSV σ+ ) has, in most cases, statistically significant improvement over the performance of classical IR models. 6. CONCLUSION We study, in this paper, two IR notions: Exhaustivity and Specificity. In order to present an implementable version of Exhaustivity and Specificity, we use Propositional Logic and Lattice theory. We also exploit the potential relation between them. First, we show how to translate a special kind of logical sentences, namely clauses, to subsets of atomic propositions (Eq.3). Second, the set of subsets forms a distributive and complemented lattice. Third, we show that the material implication between two clauses is equivalent to the partial order relation defined on the previous lattice (Eq.6). Fourth, we exploit the degree of inclusion measure (Eq.7) in order to quantify the partial order relation. Fifth, we present a new definition of the uncertain implications (Eq.9). Finally, we redefine, in view of our previous modelling, the basic IR notions and build a new implementable IR model capable of integrating Exhaustivity and Specificity. In this paper, we see IR as a process of computing the similarity between d and q, and also as a process of estimating the degree of alignment between d and q (Exhaustivity and Specificity). In other words, IR should not only compute similarity but also dissimilarity between d and q. We present an implementable approach to integrate Exhaustivity and Specificity into a concrete IR model. The experiments are done depending on six corpora, four with short queries and documents and two with long ones. We use two types of terms: words and UMLS concepts. We also present two different definitions of the object md ∩ mq : basic definition σ, and extended definition σ+ . The experimental results show that the retrieval performance could be strongly improved through the explicit taking into account of Exhaustivity and Specificity. Moreover, even simple formulae RSV σ and RSV σ+ could perform better than some state of the art models. Experiments also show that there should be a type of balance between the two components (Exhaustivity and Specificity). However, the relative weight of Exhaustivity or Specificity is corpus-dependent. This study could in future be developed in different directions. We presented a general IR framework based on comparing d and q with an intermediate object instead of the direct comparison. Therefore, this study could, on the one hand, be developed by building new and more informative versions of the object σ. On the other hand, it could be developed by instancing the new framework using other concrete IR models, e.g. Language Models or Probabilistic Models. 7. REFERENCES [1] K. Abdulahhad, J.-P. Chevallet, and C. Berrut. MRIM at ImageCLEF2012. From Words to Concepts: A New Counting Approach. In CLEF, 2012. [2] K. Abdulahhad, J.-P. Chevallet, and C. Berrut. The Effective Relevance Link between a Document and a Query. In DEXA, Part I, pages 206–218, Vienna, Autriche, Sept. 2012. [3] K. Abdulahhad, J.-P. Chevallet, and C. Berrut. A New Lattice-Based Information Retrieval Theory. Research Report RR-LIG-038, LIG, Grenoble, France, 2013. [4] K. Abdulahhad, J.-P. Chevallet, and C. Berrut. Is uncertain logical-matching equivalent to conditional probability? SIGIR ’13, 2013. [5] G. Amati and C. J. Van Rijsbergen. Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness. ACM Trans. Inf. Syst., 20(4):357–389, Oct. 2002. [6] A. R. Aronson. Metamap: Mapping Text to the UMLS Metathesaurus, 2006. [7] J.-P. Chevallet, J.-H. Lim, and D. T. H. Le. Domain Knowledge Conceptual Inter-Media Indexing: Application to Multilingual Multimedia Medical Reports. CIKM ’07, pages 495–504, Lisbon, Portugal, 2007. [8] Y. Chiaramella and J. P. Chevallet. About Retrieval Models and Logic. Comput. J., 35:233–242, June 1992. [9] Y. Chiaramella, P. Mulhem, and F. Fourel. A Model for Multimedia Information Retrieval. Technical report, 1996. [10] F. Crestani. Exploiting the Similarity of Non-Matching Terms at RetrievalTime. Inf. Retr., 2(1):27–47, 2000. [11] H. Fang, T. Tao, and C. Zhai. A Formal Study of Information Retrieval Heuristics. SIGIR ’04, pages 49–56, Sheffield, United Kingdom, 2004. [12] K. S. Jones. A Statistical Interpretation of Term Specificity and its Application in Aetrieval. Journal of Documentation, 28:11–21, 1972. [13] K. H. Knuth. Deriving laws from ordering relations. In In press: Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Jackson Hole WY, pages 204–235, 2003. [14] K. H. Knuth. Lattice Duality: The Origin of Probability and Entropy. Neurocomput., 67:245–274, Aug. 2005. [15] D. E. Losada and A. Barreiro. A Logical Model for Information Retrieval Based on Propositional Logic and Belief Revision. The Computer Journal, 44:410–424, 2001. [16] J. Nie. An Outline of a General Model for Information Retrieval Systems. SIGIR ’88, pages 495–506, Grenoble, France, 1988. [17] J. M. Ponte and W. B. Croft. A Language Modeling Approach to Information Retrieval. SIGIR ’98, pages 275–281, Melbourne, Australia, 1998. [18] S. E. Robertson. The Probability Ranking Principle in IR. pages 281–286. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. [19] S. E. Robertson and S. Walker. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. SIGIR ’94, pages 232–241, Dublin, Ireland, 1994. [20] G. Salton, A. Wong, and C. S. Yang. A Vector Space Model for Automatic Indexing. Communications of the ACM, (18):613–620, 1975. [21] A. Singhal, C. Buckley, and M. Mitra. Pivoted Document Length Normalization. SIGIR ’96, pages 21–29, Zurich, Switzerland, 1996. [22] M. D. Smucker, J. Allan, and B. Carterette. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. CIKM ’07, pages 623–632, Lisbon, Portugal, 2007. [23] C. J. van Rijsbergen. A Non-Classical Logic for Information Retrieval. Comput. J., 29(6):481–485, 1986. [24] E. M. Voorhees and D. Harman. Overview of the Eighth Text REtrieval Conference (TREC-8). In TREC, 1999. [25] E. M. Voorhees and D. Harman. Overview of the Sixth Text REtrieval Conference (TREC-6). Inf. Process. Manage., 36(1):3–35, Jan. 2000. [26] C. Zhai and J. Lafferty. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. SIGIR ’01, pages 334–342, New Orleans, Louisiana, United States, 2001.
© Copyright 2026 Paperzz