OneTouch 4.0 Sanned Documents

INFORMETRICS R7/RR
L. t'gghr and K. Koursrau (t'dion)
0 Elvvier Science PublrJl~rsBV.. 198b
EXPLOITING THE PROBABILITY RANKING PRINCIPLE TO INCREASE
THE EFFECTIVENESS
SYSTEMS
OF
CONVENTIONAL
BOOLEAN
RETRIEVAL
Tadeusz RADECKl
Department of Computer Science,
Lincoln, Nebraska 68588-0115, U.S.A.
University
of
Nebraska-Lincoln,
Abstract
This paper reports on research aimed at developing practical methods
for improvina t h e performance of conventional Boolean information
retrievai systims. ~ o i specifically,
e
t h e objective of this research is t o
incorporate into these systems a mechanism for ranking t h e documents
of a collection in descending order of their probabilities of usefulness
t o t h e user. There are several reasons why a ranking mechanism of
this type may b e expected t o provide retrieval results superior t o
those of traditional Boolean search techniques which normally d o not
furnish users with any indication of document relevance. In particular,
t h e sc-called output overload problem, which occurs when t h e size of
t h e document s e t retrieved in response t o a given query is
unmanageable, could practically be eliminated since t h e ranking
information would assist t h e searcher in deciding when t o end an
examination of t h e system's output with t h e confidence of identifying
most of t h e useful items t h a t have been retrieved. Moreover, since i t
would no longer be necessary t o inspect all of t h e output documents, a
user q w r y could then be constructed broad enough t o allow more
relevant items t o be retrieved. Accordingly, such a system enhancement
may result in an improvement in retrieval effectiveness, especially in
a n increase in recall.
his
presents and illustrates a theoretical framework, based on
t h e Probability Ranking Principle, for implementing this enhancement.
Research indicates t h a t a relevance-based ranking scheme of this type
could readily b e incorporated into conventional retrieval systems
without altering their underlying fundamental principles; some additional
software in t h e form of a front-end is all t h a t would be needed. As a
result, t h e performance of these systems may be significantly improved
at a n acceptable cost.
A great deal of research on information retrieval has been motivated by t h e
recognition t h a t clues for estimating t h e degrees of usefulness, or relevance, of
documents in a collection a r e rarely complete. More specifically, i t is widely
recognized t h a t predicting document relevance usually involves uncertainties which
a r e inherently probabilistic, and a s a result retrieval systems a r e generally unable
t o precisely determine t h e extent t o which a given document will satisfy t h e
user's needs. Thus, designing retrieval systems which a r e capable of identifying
all and only relevant documents does not seem feasible under available
theoretical models. However, i t is often suggested t h a t they should be able t o
present t h e retrieved documents in descending order of their estimated
210
probabilities of relevance.
t h e Probability Ranking
foundation for much of
retrieval since t h e earlier
T. Radecki
The desire t o rank t h e output in this fashion underlies
Principle (PRP) (see, e.g., Ref. [I]), which is t h e
t h e research into probabilistic methods of information
studies by Maron and Kuhns [ Z ] .
MOTIVATION
The ultimate objective of t h e PRP-based approach t o information retrieval is t o
rank t h e individual documents of a collection in descending order of their
estimated probabilities of usefulness t o t h e user. Therefore, a ranking mechanism
of this type could be very helpful in operational retrieval environments. In
particular, i t would practically eliminate t h e so-called output overload problem
which occurs when an unmanageable number of items a r e retrieved with respect
t o a given query. More specifically, ranking information would assist t h e searcher
in deciding when t o end an examination of t h e system's output with t h e
confidence of identifying most of t h e useful documents t h a t have been retrieved.
Since i t would no longer be necessary t o inspect t h e entire output, user queries
could be constructed broad enough t o allow t h e retrieval of more relevant items.
Accordingly, using a PRP-based ranking mechanism may result in an improvement
in retrieval effectiveness, especially in an increase in recall (*).
In view of t h e potential of t h e PRP-approach, i t is of particular importance t o
t r y t o incorporate i t i n t o conventional Boolean retrieval systems. However, this
approach has been developed primarily for use in systems in which both
documents and queries a r e represented by s e t s of index terms. Thus, of t h e two
major components of any conventional retrieval system-(i) procedures for
document indexing, and (ii) procedures for query formulation-only t h e indexing
component is compatible with t h e P R P approach (in conventional systems, queries
a r e represented by Boolean expressions referred t o a s Boolean search request
formulations (Boolean SRF)). Therefore, one may conclude t h a t this ranking
mechanism cannot readily be incorporated into these systems. However,
preliminary studies indicate t h a t a formal theory can be developed, on t h e basis
of which a front-end could be designed t o merge t h e PRP-based ranking scheme
with conventional Boolean retrieval systems [ 3 , 4 ] . Since such an enhancement
would permit t h e fundamental principles behind these systems t o be preserved,
significant improvement in their performance may b e achieved a t a n acceptable
cost. The remainder of this paper presents a theoretical framework for extending
t h e PRP-based approach t o conventional Boolean retrieval systems. First, t h e
original PRP approach is briefly described to help clarify this presentation.
OVERVIEW O F THE ORIGINAL PRP-APPROACH
As already mentioned, t h e PRP-based approach, a s first introduced, requires both
documents and queries t o be represented by s e t s of index terms. A major
observation underlying t h e development of this approach is t h a t with respect t o
this type of query representation, referred t o a s a binary SRF, in t h e document
collection a number of disjoint s e t s can be identified, e a c h cwresponding t o a
different non-empty subset of t h e q w r y ' s index terms. More specifically, e a c h
s e t contains those documents whose representations include t h e terms t h a t are
present in t h e corresponding subset, but d o not include those terms which a r e
absent. For instance, if a query is represented by t h e t h r e e index terms, t
h
and t then seven disjoint document s e t s can be identified: one containing ~
documznts which a r e indexed by all t h r e e terms, a second containing those t h a t
are indexed by t and t,, but not t ; etc. The nex s t e p in t h e reasoning is t o
note t h a t indivibual query terms,'present
in or absent from a document
representation (DR), provide clues which can b e used t o estimate t h e probability
of t h e document being useful. From t h e perspective of a given binary SRF, all
*
Recall is defined a s t h e proportion of relevant documents which have been
retrieved (and presented t o t h e user).
k
The Effectiveness o f ConventionnlBooleon RefrievalSysfems
21 1
the documents in a given s e t a r e indistinguishable. Thus, e a c h document s e t may
be characterized by a hypothetical document t h a t is represented by t h e
corresponding subset of t h e query's index terms. For example, a hypothetical
document associated with t h e document s e t containing those items indexed by t
and t2, but not t j is represented by I t l , tZ]. Therefore, i t is sufficient t oI
calculate t h e probability of relevance for each hypothetical document set.
Accordingly, with respect t o a given query, t h e number of different probabilities
t h a t need t o be estimated is generally equal t o t h e number of different
non-empty subsets of t h e query's index terms, or equivalently, t h e number of t h e
cwresponding document sets. there a r e several procedures available which can be
used t o calculate these probabilities (one of which is applied later in this paper).
The details of these procedures may be found in t h e relevant literature on
probabilistic design principles (see, e.g., Refs. [ 5 , 9 ] ) .
EXTENDING THE ORIGINAL
RETRIEVAL SYSTEMS
PRP-APPROACH
TO CONVENTIONAL
BOOLEAN
The rationale for extending t h e original PRP-based retrieval approach t o
conventional Boolean retrieval systems is based on t h e well-known result of
Boolean algebra (see, e.g., [lo]), which s t a t e s t h a t each element of such an
algebra can be represented in a unique form referred t o a s i t s disjunctive normal
form (DNF). From this result, i t follows t h a t any Boolean SRF can b e uniquely
transformed into t h e disjunction of a number of distinct expressions, each of
which is a conjunction of all t h e query's index term-related Boolean variables,
where each variable occurs once, either negated or unnegated.
Accordingly, t h e output retrieved by a conventional Boolean search strategy can
be viewed a s the union of a number of disjoint document sets, each of which
corresponds t o one of t h e query's DNF-conjuncts. This implies t h a t a given
document which satisfies t h e Boolean SRF matches exactly one of t h e constituent
conjuncts, and consequently would be assigned t o only one of t h e document sets.
Thus, from t h e perspective of t h e Boolean SRF, all t h e documents matching a
particular DNF-conjunct a r e indistinguishable and have t h e same representation.
More specifically, all t h e documents in each s e t a r e seen a s being represented by
t h e index terms whose Boolean variables appear unnegated in t h e corresponding
conjunct. Accordingly, each document s e t can again be characterized by a
hypothetical document represented by these index terms. The PRP-based scheme
can now be applied t o rank t h e hypothetical documents in descending order of
their probabilities of relevance. In estimating such a probability, t h e index terms
which a r e associated with t h e negated Boolean variables in a given conjunct a r e
t o he regarded a s absent from t h e representation of t h e corresponding
hypothetical document. Since there is a one-to-one correspondence between t h e
hypothetical documents and t h e respective document sets, these sets can b e
ranked according t o t h e rank values of t h e hypothetical documents. Thus, by
calculating t h e rank values of t h e hypothetical documents, we can determine a
partial ranking of t h e output documents in descending order of their probabilities
of relevance. Therefore, i t can be concluded t h a t t h e original PRP-based
approach can be extended t o conventional Boolean retrieval systems without
changing any of their underlying fundamental principles.
T o further clarify this theoretical framework, Let us consider two query
representations: (i) a Boolean SRF; and (ii) a binary SRF (a s e t of index terms),
where each t e r m in t h e binary SRF corresponds t o a variable occurring in t h e
Boolean SRF (there is a one-to-one mapping from t h e s e t of index terms onto
t h e s e t of Boolean variables). Assuming t h a t t h e Boolean and binary SRFs a r e
processed against t h e same document collection by t h e extended and original
PRP-based retrieval methods, respectively, e a c h of t h e constituent output
document s e t s for t h e Boolean query also occurs a s one of t h e output's
constituent document s e t s for t h e binary SR, and both of these s e t s have equal
*
.
rank values
Moreover, t h e respective rankings of t h e output documents s e t s
would be identical if t h e Boolean SRF was a disjunction of a l l t h e possible
conjuncts formed from t h e Boolean SRF's variables, excluding t h e conjunct in
which all t h e variables are negated, o r equivalently, a disjunction of all t h e
Boolean SRF's variables. Therefore, t h e extended ranking scheme can be regarded
a s more general than t h e original.
In order t o illustrate this relationship between t h e extended and t h e original
PRP-based retrieval models, l e t us consider t h e Boolean SRF, q, = ( t , o r t,) and
"
not t,,,
-
-
where t,t, and t,, a r e t h e Boolean variables corresponding t o t h e index
--
-
t e r m s t I, t 2 and t3, respectively, while or, and, and not denote t h e Boolean
operations of disjunction, conjunction, and negation, respectively. Accordingly, t h e
disjunctive normal form
of qB for t h e Boolean s e t X = itl, t2, t3} is:
(qB))( = (not t l and t2 and not t ) or ( t l and not t and not t3)
3
2
o r ( t l and t
and not t3)
2
Thus, if qB is processed against a collection of documents represented by s e t s of
unweighted index terms, t h e resulting output c a n b e viewed as t h e union of t h e
t h r e e disjoint document sets: (i) t h e first containing those documents indexed by
but not t l and t3; (ii) t h e second containing those indexed by t l , but not t
t2'
2
and t3; and (iii) the third containing those indexed by t
and t but not t3. If
I
2
t h e query is represented by t h e binary SRF, q b = i t , t , t 1, then t h e ranked
1 2 3
output generated by t h e original PRP-based procedure would include all of t h e
document s e t s generated by t h e extended procedure with t h e s a m e rank values
assigned. Clearly, if q = t l o r t 2 o r t3, then t h e resulting disjunctive normal
B
form (for t h e same s e t X) would consist of seven conjuncts, and t h e ranked
document sets, a s well a s their rank values, would be identical t o those obtained
for q
b'
As can b e seen from t h e above discussion, t h e extended Boolean retrieval scheme
allows a wide range of user abilities in specifying a n information need. One
extreme is where t h e searcher c a n precisely represent such a need by providing
a list of terms, a s well a s t h e logical relationships between them, and t h e other
e x t r e m e is where t h e user is only able t o furnish a list of index t e r m s (without
any of t h e logical relationships). Thus t h e extended scheme is considerably more
flexible in t e r m s of system-user communication than t h e conventional Boolean
scheme. This is another significant advantageous f e a t u r e of t h e suggested
enhancement.
Any procedure t h a t can b e used t o estimate t h e probability of relevance of a
document represented by unweighted index terms could also be applied to a
hypothetical document t o determine i t s rank value. One such procedure, which is
discussed below, is based on Bayesian decision theory and utilizes information
concerning various frequency distributions of index terms within t h e document
collection (see, e.g., Refs. [ M I ) .
In t h e adopted procedure, a discriminant function, g(dH), plays a central role in
ranking t h e output documents. This function c a n be concisely expressed as:
k
g(dH) =
2
wixd(i)
+ C,
i=l
Provided, of course, t h a t only those Boolean variables t h a t occur in t h e
Boolean SRF have been taken into account by t h e extended ranking scheme.
The Effectivenes of Conventional Boolean Retrieval Systems
213
In this formula:
wi is t h e swcalled relevance weight of t h e index term, ti, corresponding t o t h e
ith variable in t h e Boolean SRF (the number of relevance weights which are
t o be calculated with respect t o a iven query is generally equal t o t h e
number of variables in t h e Boolean sRF?;
x
d~
(i) is t h e value of t h e index t e r m , ti, with regard
document, dH, which c a n t a k e one of two values:
I
if
the
index
term,
ti,
is
present
in
the
to
a
hypothetical
hypothetical
representation
d
or equivalently, if the ith term-related
H'
variable occurs unnegated in t h e corresponding conjunct; or
0
if
the
index
term,
ti,
absent
from
the
hypothetical
representation d
or equivalently, if t h e ith term-related
H'
variable occurs negated in t h e corresponding conjunct;
document
Boolean
document
Boolean
C - is a constant, interpreted a s t h e cutoff value 161, which is t h e same for
a l l t h e hypothetical documents (and therefore all of t h e documents in t h e
collection) with respect t o a given query.
In order t o calculate t h e value of g(dH) with regard t o a particular hypothetical
document d
t h e summation of t h e w;s for those index t e r m s which are
H'
present (x (i) = I) in i t s representation, dH, is t o be performed (or equivalently,
d
t h e summanon of t h e w.'s for those terms which correspond t o t h e unnegated
1
Boolean variables in t h e associated conjunct). An implicit role of t h e discriminant
function is t o classify e a c h hypothetical document (and t h e corresponding
document set) into one of two classes: either t h e class of relevant documents or
t h e class of nonrelevant documents. Specifically, t h e decision rule is t o classify a
given hypothetical document a s relevant if g(dH) > 0, and a s non-relevant if
g(dH) ( 0. However, t h e principal role of t h e discriminant function is t o rank
hypothetical documents, and simultaneously t h e retrieved documents, in decreasing
order of their probabiliteis of usefulness. More specifically, t h e greater t h e value
of g(d ) t h e higher t h e probability of relevance of t h e hypothetical document,
H
and thus all t h e hypothetical documents can be ranked accordingly. Since t h e
d ~ '
value of t h e constant C is t h e same f o r all of t h e documents in t h e collection,
i t will not affect the resulting ranking (although i t may be a factor in t h e
classification process). Accordirlgly, this constant c a n be ignored in generating the
ranked output, and will not be considered further in this paper. This will
considerably simplify further discussions without affecting t h e final conclusions.
Thus, t h e rank value of a hypothetical document, dH, c a n be determined by
k
g(d ) - 2 w.x (i) + C . The formula for wi can be shown t o be in t h e following
H I d
i= I
form [5-61:
<
-
n i)
w. = log
1
iii(l
where
-
ti)
c i is t h e probability of occurrence of t h e index t e r m ,
ti in a DR given
T.Radecki
214
t h a t it represents a relevant document, and g
is t h e probability of occurrence
of t; in a DR representing a non-relevant item. Thus, in order t o calculate t h e
and n ifor each index
w i 3 s i t is necessary t o estimate t h e probabilities r,
term, ti, associated with the Boolean SRF. One pmsibility is t o use the
occurrence characteristics of ti, for a sample document s e t that is representative
of the entire collection, and estimate these probabilities in terms of
proportions. Specifically, l e t N be t h e total number of documents
representative sample, in which R items have k e n assessed a s relevant
user. Thus, i t follows t h a t there a r e N - R non-relevant items in t h e
Furthermore, l e t ni denote t h e number of sample documents (relevant
simple
in the
by t h e
sample.
or not)
which a r e indexed by the term ti, and l e t ri stand for t h e number of relevant
sample
documents
indexed
by
Using
ti.
this
notation,
the
occurrence
characteristics f a a particular index term, t i , can b e presented conveniently by
means of a contingency table [ > 6 ] a s followr
Relevant
x
Non-relevant
(i) = I
n.
d~
(i) = 0
X
-
r.
N - n . - R + r i
n.
1
N
-
ni
d~
Accordingly
L i and
ni can b e estimated by the following ratios:
and
".
I
n.-r.
=-
I
I
'
N-R
Thus, t h e formula f o r wi can now be expressed a a
w.= log
ril (R-ri)
hi-ri)l (N
-
n.-I R
+ ri)
This is in f a c t t h e well-known index term weighting formula referred t o a s F4
by Robertson and Sparck Jones [ 5 ] . The following simple example illustrates t h e
use of t h e adopted F4 formula i n determining t h e relevance weights f a index
terms, and subsequently generating t h e ranked output from a given Boolean SRF.
215
The Effectiveness of Conventio~l
Boolean RetrievalSystems
Example. Suppose a query submitted t o a conventional Boolean retrieval system is
represented by the Boolean SRF, qB, as follows:
48
= (t. and not t2) or (t2 and t3)
1
.
Thus, the disjunctive normal form, (qB)X, for the set X = itl,
(qg&
z
(not tl and t2 and t ) or (t
3
(tl
Accordingly,
and not t
2
01
.
and t2 and t3)
the retrieved output, Y (qB), may equivalently be viewed as the union
conjuncts: (i) not t
2
and not t:, and not t )
3
and t3) or (tl
of the disjoint document sets, Al,
not t
I
t2, t3) is:
,,
A2, A3 and ,A
corresponding to the
and t 2 and t3: (ii) tl and not t2 and not t3; (iii) tl and
1
and t ; and (iv) t and t and t3, respectively. That is,
3
I
2
u(qB) = Al
A2
A3
Ag
.
The first set, A1, contains those documents in the collection which are indexed
by the terms t and t but not tl (it is of no importance whether the items
2
3
are indexed by any of the other terms) : A contains those documents which are
2
indexed by t but not t2 and t3; and so forth.
1
As already mentioned, from the perspective of a conventional Boolean search
scheme,
all
the
items i n an output's
constituent document set are
indistinguishable. In other words, the documents contained in such a set are
considered by the system to be represented by the same set of index terms
(although, i n fact, they may be indexed by other terms that are not referenced
i n the query). Accordingly, the documents in A,, A
,,
A 2 and A, are viewed as
A
itl,
t2, t3},
respectively.
I n compliance
*
<
- the
-
being characterized by the index term sets it,,
t?), (t,l,
with
-
it,, t?), and
-
extended document
ranking
scheme, these sets of index terms can be regarded as the representations of the
hypothetical documents, dH
dH2, dH
and dH
respectively. Therefore, the
,
system's output
Formally,
can
be
hewed
,
as me set
,
of
4 four hypothetical documents.
We can now apply the adopted F4 formula t o determine the rank values of these
hypothetical documents i n terms of their probabilities of relevance. Since there
exists a one-to-one correspondence between the hypothetical documents and the
output's constituent document sets, the rank values of the hypothetical documents
are also the rank values of the associated document sets. Thus, having
determined the rank values of the hypothetical documents, we can determine the
ranking of the corresponding output's document sets i n descending order of their
T.Radecki
216
probabilities of relevance. In order t o use t h e F4 formula, a representative
sample of t h e document collection is required. Assume t h a t t h e sample document
s e t consists of 10 items, d l , d2, d3, d4, dg, d6, d7, dg, dg, dlO, with the
following representations:
Let us further assume t h a t of t h e sample documents, dl, d4, d6, dg and d9 have
been assessed a s relevant by the user.
This information is sufficient t o determine t h e relevance weights f o r t h e
individual index terms associated with qg. Specifically, f o r the term tl, we have
r l = 3 and nl = 5; Accordingly, t h e related contingency table is (N = 10, R = 5,
r1 = 3, n1 = 5):
Relevant
x
(1)
5
Non-relevant
1
d~
Xd (I) = 0
H
Thus,
w1 = log
The contingency table for t
2
#$ = log 914
= 0.3522
is (N = 10, R = 5, r
Relevant
.
2 -- 4, n2 = 7):
Non-relevant
Hence,
w2 = log '/1
312 = log 813 = 0.4260.
The Effectiveness of ConventionnlBoolean Retrieval Systems
Finally, t h e contingency table for t
3 is (N
:
Relevant
10, R = 5, r 3
--
217
2, n3
:
6):
Non-relevant
Theref ore,
m =log
w3 = Log 213
2/12 = log 116 =
-
.
0.7782
These index term relevance weights w i l l now allow us t o calculate a rank value,
dli, for e a c h hypothetical document,
and simultaneously for t h e
d ~ i )
corresponding output document set, A.
.
1
These rank values are:
=
-
0.3522,
and
This implies t h e following ranking, Ord Y (q): Ord
The
A2
document
set
is
presumably
the
(q ) =
B
most likely
Y
<
A ~ A
,,
to
Al, A3
contain
>.
relevant
documents. Thus, i t should be presented t o t h e user first, followed, if needed, by
and s o forth.
t h e next highest ranked document set, A
4'
CONCLUDING REMARKS
This paper has presented a theoretical framework for incorporating a PRP-based
document ranking mechanism into a conventional Boolean retrieval system. This
extended system exhibits a number of advantages which include:
(i) accommodating traditional Boolean SRF's used in commercial retrieval systems
without changing any of t h e fundamental principles upon which these systems
a r e based;
(ii) providing for t h e retrieval of documents in
probabilities of usefulness t o t h e user;
(iii) facilitating a system-user adaptation mechanism
weights for a query's index terms;
descending
order
of
their
by determining relevance
T. Radecki
2la
(iv) allowing for a wide range of user abilities in specifying an information need.
T o summarize, by demonstrating t h e suitability of Boolean logic for exploiting t h e
Probability Ranking Principle, i t has been shown t h a t t h e implementation of an
extended document ranking procedure should result in considerable improvement in
t h e retrieval effectiveness of traditional search techniques at a n acceptable cost.
BIBLIOGRAPHY
Robertson, S.E.,
The Probability Ranking Principle in IR, Journal of
Documentation 3 3 (4) (1977) p. 294-304.
Maron, M.E.,
Kuhns, J.L.,
On Relevance, Probabilistic Indexing and
lnformation Retrieval. Journal of t h e Association for C o m.~ u t i -n pMachinery
7 (3) (1960) p. 216-244.
Radecki, T., Reducing t h e Perils of Merging Boolean and Weighted
Retrieval systems, Journal of Documentation 38 (3) (1982) p. 207-211.
Radecki, T., A Probabilistic Approach t o Information Retrieval in Systems
with Boolean Search Request Formulations, Journal of t h e American
Society for Information Science 3 3 (6) (1982) p. 365-370.
Robertson, S.E., Sparck Jones, K., Relevance Weighting of Search Terms,
Journal of t h e American Society for lnformation Science 27 (3) (1976)
p. 129-146.
Van R i j s k r g e n , C.J., Information Retrieval. (2nd ed. Butterworths, London,
11K
.. 19791.
. ,.
Cooper, W.S.,
Huizinga, P., The Maximum Entropy Principle and i t s
Application t o t h e Design of Probabilistic Retrieval Systems, Information
Technology: Research and Development 1 (2) (1982) p. 99-112.
Heine, M.H., A Simple, Intelligent Front End for lnformation Retrieval
Systems Using Boolean Logic, lnformation Technology: Research and
Development I (4) (1982) p. 247-260.
C o o E r . W.S.,
Exoloitine t h e Maximum Entroov Princiole t o Increase
~ e t ; i e v a l ~ f f & t i v e n e s s , Journal of t h e ~ m i r i c a y~ o c l e t y f' o r lnformation
Science 34 (1) (1983) p. 31-39.
Arnold, B.H., Logic and Boolean Algebra, (Prentice-Hall, Englewood Cliffs,
New Jersey, USA, 1962).
-
A
>
>
-