INFORMETRICS R7/RR L. t'gghr and K. Koursrau (t'dion) 0 Elvvier Science PublrJl~rsBV.. 198b EXPLOITING THE PROBABILITY RANKING PRINCIPLE TO INCREASE THE EFFECTIVENESS SYSTEMS OF CONVENTIONAL BOOLEAN RETRIEVAL Tadeusz RADECKl Department of Computer Science, Lincoln, Nebraska 68588-0115, U.S.A. University of Nebraska-Lincoln, Abstract This paper reports on research aimed at developing practical methods for improvina t h e performance of conventional Boolean information retrievai systims. ~ o i specifically, e t h e objective of this research is t o incorporate into these systems a mechanism for ranking t h e documents of a collection in descending order of their probabilities of usefulness t o t h e user. There are several reasons why a ranking mechanism of this type may b e expected t o provide retrieval results superior t o those of traditional Boolean search techniques which normally d o not furnish users with any indication of document relevance. In particular, t h e sc-called output overload problem, which occurs when t h e size of t h e document s e t retrieved in response t o a given query is unmanageable, could practically be eliminated since t h e ranking information would assist t h e searcher in deciding when t o end an examination of t h e system's output with t h e confidence of identifying most of t h e useful items t h a t have been retrieved. Moreover, since i t would no longer be necessary t o inspect all of t h e output documents, a user q w r y could then be constructed broad enough t o allow more relevant items t o be retrieved. Accordingly, such a system enhancement may result in an improvement in retrieval effectiveness, especially in a n increase in recall. his presents and illustrates a theoretical framework, based on t h e Probability Ranking Principle, for implementing this enhancement. Research indicates t h a t a relevance-based ranking scheme of this type could readily b e incorporated into conventional retrieval systems without altering their underlying fundamental principles; some additional software in t h e form of a front-end is all t h a t would be needed. As a result, t h e performance of these systems may be significantly improved at a n acceptable cost. A great deal of research on information retrieval has been motivated by t h e recognition t h a t clues for estimating t h e degrees of usefulness, or relevance, of documents in a collection a r e rarely complete. More specifically, i t is widely recognized t h a t predicting document relevance usually involves uncertainties which a r e inherently probabilistic, and a s a result retrieval systems a r e generally unable t o precisely determine t h e extent t o which a given document will satisfy t h e user's needs. Thus, designing retrieval systems which a r e capable of identifying all and only relevant documents does not seem feasible under available theoretical models. However, i t is often suggested t h a t they should be able t o present t h e retrieved documents in descending order of their estimated 210 probabilities of relevance. t h e Probability Ranking foundation for much of retrieval since t h e earlier T. Radecki The desire t o rank t h e output in this fashion underlies Principle (PRP) (see, e.g., Ref. [I]), which is t h e t h e research into probabilistic methods of information studies by Maron and Kuhns [ Z ] . MOTIVATION The ultimate objective of t h e PRP-based approach t o information retrieval is t o rank t h e individual documents of a collection in descending order of their estimated probabilities of usefulness t o t h e user. Therefore, a ranking mechanism of this type could be very helpful in operational retrieval environments. In particular, i t would practically eliminate t h e so-called output overload problem which occurs when an unmanageable number of items a r e retrieved with respect t o a given query. More specifically, ranking information would assist t h e searcher in deciding when t o end an examination of t h e system's output with t h e confidence of identifying most of t h e useful documents t h a t have been retrieved. Since i t would no longer be necessary t o inspect t h e entire output, user queries could be constructed broad enough t o allow t h e retrieval of more relevant items. Accordingly, using a PRP-based ranking mechanism may result in an improvement in retrieval effectiveness, especially in an increase in recall (*). In view of t h e potential of t h e PRP-approach, i t is of particular importance t o t r y t o incorporate i t i n t o conventional Boolean retrieval systems. However, this approach has been developed primarily for use in systems in which both documents and queries a r e represented by s e t s of index terms. Thus, of t h e two major components of any conventional retrieval system-(i) procedures for document indexing, and (ii) procedures for query formulation-only t h e indexing component is compatible with t h e P R P approach (in conventional systems, queries a r e represented by Boolean expressions referred t o a s Boolean search request formulations (Boolean SRF)). Therefore, one may conclude t h a t this ranking mechanism cannot readily be incorporated into these systems. However, preliminary studies indicate t h a t a formal theory can be developed, on t h e basis of which a front-end could be designed t o merge t h e PRP-based ranking scheme with conventional Boolean retrieval systems [ 3 , 4 ] . Since such an enhancement would permit t h e fundamental principles behind these systems t o be preserved, significant improvement in their performance may b e achieved a t a n acceptable cost. The remainder of this paper presents a theoretical framework for extending t h e PRP-based approach t o conventional Boolean retrieval systems. First, t h e original PRP approach is briefly described to help clarify this presentation. OVERVIEW O F THE ORIGINAL PRP-APPROACH As already mentioned, t h e PRP-based approach, a s first introduced, requires both documents and queries t o be represented by s e t s of index terms. A major observation underlying t h e development of this approach is t h a t with respect t o this type of query representation, referred t o a s a binary SRF, in t h e document collection a number of disjoint s e t s can be identified, e a c h cwresponding t o a different non-empty subset of t h e q w r y ' s index terms. More specifically, e a c h s e t contains those documents whose representations include t h e terms t h a t are present in t h e corresponding subset, but d o not include those terms which a r e absent. For instance, if a query is represented by t h e t h r e e index terms, t h and t then seven disjoint document s e t s can be identified: one containing ~ documznts which a r e indexed by all t h r e e terms, a second containing those t h a t are indexed by t and t,, but not t ; etc. The nex s t e p in t h e reasoning is t o note t h a t indivibual query terms,'present in or absent from a document representation (DR), provide clues which can b e used t o estimate t h e probability of t h e document being useful. From t h e perspective of a given binary SRF, all * Recall is defined a s t h e proportion of relevant documents which have been retrieved (and presented t o t h e user). k The Effectiveness o f ConventionnlBooleon RefrievalSysfems 21 1 the documents in a given s e t a r e indistinguishable. Thus, e a c h document s e t may be characterized by a hypothetical document t h a t is represented by t h e corresponding subset of t h e query's index terms. For example, a hypothetical document associated with t h e document s e t containing those items indexed by t and t2, but not t j is represented by I t l , tZ]. Therefore, i t is sufficient t oI calculate t h e probability of relevance for each hypothetical document set. Accordingly, with respect t o a given query, t h e number of different probabilities t h a t need t o be estimated is generally equal t o t h e number of different non-empty subsets of t h e query's index terms, or equivalently, t h e number of t h e cwresponding document sets. there a r e several procedures available which can be used t o calculate these probabilities (one of which is applied later in this paper). The details of these procedures may be found in t h e relevant literature on probabilistic design principles (see, e.g., Refs. [ 5 , 9 ] ) . EXTENDING THE ORIGINAL RETRIEVAL SYSTEMS PRP-APPROACH TO CONVENTIONAL BOOLEAN The rationale for extending t h e original PRP-based retrieval approach t o conventional Boolean retrieval systems is based on t h e well-known result of Boolean algebra (see, e.g., [lo]), which s t a t e s t h a t each element of such an algebra can be represented in a unique form referred t o a s i t s disjunctive normal form (DNF). From this result, i t follows t h a t any Boolean SRF can b e uniquely transformed into t h e disjunction of a number of distinct expressions, each of which is a conjunction of all t h e query's index term-related Boolean variables, where each variable occurs once, either negated or unnegated. Accordingly, t h e output retrieved by a conventional Boolean search strategy can be viewed a s the union of a number of disjoint document sets, each of which corresponds t o one of t h e query's DNF-conjuncts. This implies t h a t a given document which satisfies t h e Boolean SRF matches exactly one of t h e constituent conjuncts, and consequently would be assigned t o only one of t h e document sets. Thus, from t h e perspective of t h e Boolean SRF, all t h e documents matching a particular DNF-conjunct a r e indistinguishable and have t h e same representation. More specifically, all t h e documents in each s e t a r e seen a s being represented by t h e index terms whose Boolean variables appear unnegated in t h e corresponding conjunct. Accordingly, each document s e t can again be characterized by a hypothetical document represented by these index terms. The PRP-based scheme can now be applied t o rank t h e hypothetical documents in descending order of their probabilities of relevance. In estimating such a probability, t h e index terms which a r e associated with t h e negated Boolean variables in a given conjunct a r e t o he regarded a s absent from t h e representation of t h e corresponding hypothetical document. Since there is a one-to-one correspondence between t h e hypothetical documents and t h e respective document sets, these sets can b e ranked according t o t h e rank values of t h e hypothetical documents. Thus, by calculating t h e rank values of t h e hypothetical documents, we can determine a partial ranking of t h e output documents in descending order of their probabilities of relevance. Therefore, i t can be concluded t h a t t h e original PRP-based approach can be extended t o conventional Boolean retrieval systems without changing any of their underlying fundamental principles. T o further clarify this theoretical framework, Let us consider two query representations: (i) a Boolean SRF; and (ii) a binary SRF (a s e t of index terms), where each t e r m in t h e binary SRF corresponds t o a variable occurring in t h e Boolean SRF (there is a one-to-one mapping from t h e s e t of index terms onto t h e s e t of Boolean variables). Assuming t h a t t h e Boolean and binary SRFs a r e processed against t h e same document collection by t h e extended and original PRP-based retrieval methods, respectively, e a c h of t h e constituent output document s e t s for t h e Boolean query also occurs a s one of t h e output's constituent document s e t s for t h e binary SR, and both of these s e t s have equal * . rank values Moreover, t h e respective rankings of t h e output documents s e t s would be identical if t h e Boolean SRF was a disjunction of a l l t h e possible conjuncts formed from t h e Boolean SRF's variables, excluding t h e conjunct in which all t h e variables are negated, o r equivalently, a disjunction of all t h e Boolean SRF's variables. Therefore, t h e extended ranking scheme can be regarded a s more general than t h e original. In order t o illustrate this relationship between t h e extended and t h e original PRP-based retrieval models, l e t us consider t h e Boolean SRF, q, = ( t , o r t,) and " not t,,, - - where t,t, and t,, a r e t h e Boolean variables corresponding t o t h e index -- - t e r m s t I, t 2 and t3, respectively, while or, and, and not denote t h e Boolean operations of disjunction, conjunction, and negation, respectively. Accordingly, t h e disjunctive normal form of qB for t h e Boolean s e t X = itl, t2, t3} is: (qB))( = (not t l and t2 and not t ) or ( t l and not t and not t3) 3 2 o r ( t l and t and not t3) 2 Thus, if qB is processed against a collection of documents represented by s e t s of unweighted index terms, t h e resulting output c a n b e viewed as t h e union of t h e t h r e e disjoint document sets: (i) t h e first containing those documents indexed by but not t l and t3; (ii) t h e second containing those indexed by t l , but not t t2' 2 and t3; and (iii) the third containing those indexed by t and t but not t3. If I 2 t h e query is represented by t h e binary SRF, q b = i t , t , t 1, then t h e ranked 1 2 3 output generated by t h e original PRP-based procedure would include all of t h e document s e t s generated by t h e extended procedure with t h e s a m e rank values assigned. Clearly, if q = t l o r t 2 o r t3, then t h e resulting disjunctive normal B form (for t h e same s e t X) would consist of seven conjuncts, and t h e ranked document sets, a s well a s their rank values, would be identical t o those obtained for q b' As can b e seen from t h e above discussion, t h e extended Boolean retrieval scheme allows a wide range of user abilities in specifying a n information need. One extreme is where t h e searcher c a n precisely represent such a need by providing a list of terms, a s well a s t h e logical relationships between them, and t h e other e x t r e m e is where t h e user is only able t o furnish a list of index t e r m s (without any of t h e logical relationships). Thus t h e extended scheme is considerably more flexible in t e r m s of system-user communication than t h e conventional Boolean scheme. This is another significant advantageous f e a t u r e of t h e suggested enhancement. Any procedure t h a t can b e used t o estimate t h e probability of relevance of a document represented by unweighted index terms could also be applied to a hypothetical document t o determine i t s rank value. One such procedure, which is discussed below, is based on Bayesian decision theory and utilizes information concerning various frequency distributions of index terms within t h e document collection (see, e.g., Refs. [ M I ) . In t h e adopted procedure, a discriminant function, g(dH), plays a central role in ranking t h e output documents. This function c a n be concisely expressed as: k g(dH) = 2 wixd(i) + C, i=l Provided, of course, t h a t only those Boolean variables t h a t occur in t h e Boolean SRF have been taken into account by t h e extended ranking scheme. The Effectivenes of Conventional Boolean Retrieval Systems 213 In this formula: wi is t h e swcalled relevance weight of t h e index term, ti, corresponding t o t h e ith variable in t h e Boolean SRF (the number of relevance weights which are t o be calculated with respect t o a iven query is generally equal t o t h e number of variables in t h e Boolean sRF?; x d~ (i) is t h e value of t h e index t e r m , ti, with regard document, dH, which c a n t a k e one of two values: I if the index term, ti, is present in the to a hypothetical hypothetical representation d or equivalently, if the ith term-related H' variable occurs unnegated in t h e corresponding conjunct; or 0 if the index term, ti, absent from the hypothetical representation d or equivalently, if t h e ith term-related H' variable occurs negated in t h e corresponding conjunct; document Boolean document Boolean C - is a constant, interpreted a s t h e cutoff value 161, which is t h e same for a l l t h e hypothetical documents (and therefore all of t h e documents in t h e collection) with respect t o a given query. In order t o calculate t h e value of g(dH) with regard t o a particular hypothetical document d t h e summation of t h e w;s for those index t e r m s which are H' present (x (i) = I) in i t s representation, dH, is t o be performed (or equivalently, d t h e summanon of t h e w.'s for those terms which correspond t o t h e unnegated 1 Boolean variables in t h e associated conjunct). An implicit role of t h e discriminant function is t o classify e a c h hypothetical document (and t h e corresponding document set) into one of two classes: either t h e class of relevant documents or t h e class of nonrelevant documents. Specifically, t h e decision rule is t o classify a given hypothetical document a s relevant if g(dH) > 0, and a s non-relevant if g(dH) ( 0. However, t h e principal role of t h e discriminant function is t o rank hypothetical documents, and simultaneously t h e retrieved documents, in decreasing order of their probabiliteis of usefulness. More specifically, t h e greater t h e value of g(d ) t h e higher t h e probability of relevance of t h e hypothetical document, H and thus all t h e hypothetical documents can be ranked accordingly. Since t h e d ~ ' value of t h e constant C is t h e same f o r all of t h e documents in t h e collection, i t will not affect the resulting ranking (although i t may be a factor in t h e classification process). Accordirlgly, this constant c a n be ignored in generating the ranked output, and will not be considered further in this paper. This will considerably simplify further discussions without affecting t h e final conclusions. Thus, t h e rank value of a hypothetical document, dH, c a n be determined by k g(d ) - 2 w.x (i) + C . The formula for wi can be shown t o be in t h e following H I d i= I form [5-61: < - n i) w. = log 1 iii(l where - ti) c i is t h e probability of occurrence of t h e index t e r m , ti in a DR given T.Radecki 214 t h a t it represents a relevant document, and g is t h e probability of occurrence of t; in a DR representing a non-relevant item. Thus, in order t o calculate t h e and n ifor each index w i 3 s i t is necessary t o estimate t h e probabilities r, term, ti, associated with the Boolean SRF. One pmsibility is t o use the occurrence characteristics of ti, for a sample document s e t that is representative of the entire collection, and estimate these probabilities in terms of proportions. Specifically, l e t N be t h e total number of documents representative sample, in which R items have k e n assessed a s relevant user. Thus, i t follows t h a t there a r e N - R non-relevant items in t h e Furthermore, l e t ni denote t h e number of sample documents (relevant simple in the by t h e sample. or not) which a r e indexed by the term ti, and l e t ri stand for t h e number of relevant sample documents indexed by Using ti. this notation, the occurrence characteristics f a a particular index term, t i , can b e presented conveniently by means of a contingency table [ > 6 ] a s followr Relevant x Non-relevant (i) = I n. d~ (i) = 0 X - r. N - n . - R + r i n. 1 N - ni d~ Accordingly L i and ni can b e estimated by the following ratios: and ". I n.-r. =- I I ' N-R Thus, t h e formula f o r wi can now be expressed a a w.= log ril (R-ri) hi-ri)l (N - n.-I R + ri) This is in f a c t t h e well-known index term weighting formula referred t o a s F4 by Robertson and Sparck Jones [ 5 ] . The following simple example illustrates t h e use of t h e adopted F4 formula i n determining t h e relevance weights f a index terms, and subsequently generating t h e ranked output from a given Boolean SRF. 215 The Effectiveness of Conventio~l Boolean RetrievalSystems Example. Suppose a query submitted t o a conventional Boolean retrieval system is represented by the Boolean SRF, qB, as follows: 48 = (t. and not t2) or (t2 and t3) 1 . Thus, the disjunctive normal form, (qB)X, for the set X = itl, (qg& z (not tl and t2 and t ) or (t 3 (tl Accordingly, and not t 2 01 . and t2 and t3) the retrieved output, Y (qB), may equivalently be viewed as the union conjuncts: (i) not t 2 and not t:, and not t ) 3 and t3) or (tl of the disjoint document sets, Al, not t I t2, t3) is: ,, A2, A3 and ,A corresponding to the and t 2 and t3: (ii) tl and not t2 and not t3; (iii) tl and 1 and t ; and (iv) t and t and t3, respectively. That is, 3 I 2 u(qB) = Al A2 A3 Ag . The first set, A1, contains those documents in the collection which are indexed by the terms t and t but not tl (it is of no importance whether the items 2 3 are indexed by any of the other terms) : A contains those documents which are 2 indexed by t but not t2 and t3; and so forth. 1 As already mentioned, from the perspective of a conventional Boolean search scheme, all the items i n an output's constituent document set are indistinguishable. In other words, the documents contained in such a set are considered by the system to be represented by the same set of index terms (although, i n fact, they may be indexed by other terms that are not referenced i n the query). Accordingly, the documents in A,, A ,, A 2 and A, are viewed as A itl, t2, t3}, respectively. I n compliance * < - the - being characterized by the index term sets it,, t?), (t,l, with - it,, t?), and - extended document ranking scheme, these sets of index terms can be regarded as the representations of the hypothetical documents, dH dH2, dH and dH respectively. Therefore, the , system's output Formally, can be hewed , as me set , of 4 four hypothetical documents. We can now apply the adopted F4 formula t o determine the rank values of these hypothetical documents i n terms of their probabilities of relevance. Since there exists a one-to-one correspondence between the hypothetical documents and the output's constituent document sets, the rank values of the hypothetical documents are also the rank values of the associated document sets. Thus, having determined the rank values of the hypothetical documents, we can determine the ranking of the corresponding output's document sets i n descending order of their T.Radecki 216 probabilities of relevance. In order t o use t h e F4 formula, a representative sample of t h e document collection is required. Assume t h a t t h e sample document s e t consists of 10 items, d l , d2, d3, d4, dg, d6, d7, dg, dg, dlO, with the following representations: Let us further assume t h a t of t h e sample documents, dl, d4, d6, dg and d9 have been assessed a s relevant by the user. This information is sufficient t o determine t h e relevance weights f o r t h e individual index terms associated with qg. Specifically, f o r the term tl, we have r l = 3 and nl = 5; Accordingly, t h e related contingency table is (N = 10, R = 5, r1 = 3, n1 = 5): Relevant x (1) 5 Non-relevant 1 d~ Xd (I) = 0 H Thus, w1 = log The contingency table for t 2 #$ = log 914 = 0.3522 is (N = 10, R = 5, r Relevant . 2 -- 4, n2 = 7): Non-relevant Hence, w2 = log '/1 312 = log 813 = 0.4260. The Effectiveness of ConventionnlBoolean Retrieval Systems Finally, t h e contingency table for t 3 is (N : Relevant 10, R = 5, r 3 -- 217 2, n3 : 6): Non-relevant Theref ore, m =log w3 = Log 213 2/12 = log 116 = - . 0.7782 These index term relevance weights w i l l now allow us t o calculate a rank value, dli, for e a c h hypothetical document, and simultaneously for t h e d ~ i ) corresponding output document set, A. . 1 These rank values are: = - 0.3522, and This implies t h e following ranking, Ord Y (q): Ord The A2 document set is presumably the (q ) = B most likely Y < A ~ A ,, to Al, A3 contain >. relevant documents. Thus, i t should be presented t o t h e user first, followed, if needed, by and s o forth. t h e next highest ranked document set, A 4' CONCLUDING REMARKS This paper has presented a theoretical framework for incorporating a PRP-based document ranking mechanism into a conventional Boolean retrieval system. This extended system exhibits a number of advantages which include: (i) accommodating traditional Boolean SRF's used in commercial retrieval systems without changing any of t h e fundamental principles upon which these systems a r e based; (ii) providing for t h e retrieval of documents in probabilities of usefulness t o t h e user; (iii) facilitating a system-user adaptation mechanism weights for a query's index terms; descending order of their by determining relevance T. Radecki 2la (iv) allowing for a wide range of user abilities in specifying an information need. T o summarize, by demonstrating t h e suitability of Boolean logic for exploiting t h e Probability Ranking Principle, i t has been shown t h a t t h e implementation of an extended document ranking procedure should result in considerable improvement in t h e retrieval effectiveness of traditional search techniques at a n acceptable cost. BIBLIOGRAPHY Robertson, S.E., The Probability Ranking Principle in IR, Journal of Documentation 3 3 (4) (1977) p. 294-304. Maron, M.E., Kuhns, J.L., On Relevance, Probabilistic Indexing and lnformation Retrieval. Journal of t h e Association for C o m.~ u t i -n pMachinery 7 (3) (1960) p. 216-244. Radecki, T., Reducing t h e Perils of Merging Boolean and Weighted Retrieval systems, Journal of Documentation 38 (3) (1982) p. 207-211. Radecki, T., A Probabilistic Approach t o Information Retrieval in Systems with Boolean Search Request Formulations, Journal of t h e American Society for Information Science 3 3 (6) (1982) p. 365-370. Robertson, S.E., Sparck Jones, K., Relevance Weighting of Search Terms, Journal of t h e American Society for lnformation Science 27 (3) (1976) p. 129-146. Van R i j s k r g e n , C.J., Information Retrieval. (2nd ed. Butterworths, London, 11K .. 19791. . ,. Cooper, W.S., Huizinga, P., The Maximum Entropy Principle and i t s Application t o t h e Design of Probabilistic Retrieval Systems, Information Technology: Research and Development 1 (2) (1982) p. 99-112. Heine, M.H., A Simple, Intelligent Front End for lnformation Retrieval Systems Using Boolean Logic, lnformation Technology: Research and Development I (4) (1982) p. 247-260. C o o E r . W.S., Exoloitine t h e Maximum Entroov Princiole t o Increase ~ e t ; i e v a l ~ f f & t i v e n e s s , Journal of t h e ~ m i r i c a y~ o c l e t y f' o r lnformation Science 34 (1) (1983) p. 31-39. Arnold, B.H., Logic and Boolean Algebra, (Prentice-Hall, Englewood Cliffs, New Jersey, USA, 1962). - A > > -
© Copyright 2026 Paperzz