Ranking Algorithms

Query Language
Baeza-Yates and Navarro
Modern Information Retrieval, 1999
Chapter 4
1
Query Language

Keyword-based Querying
» Single-word Queries
» Context Queries
– Phrase
– Proximity
» Boolean Queries
» Natural Language
2
Query Language (Cont.)

Pattern Matching
»
»
»
»
»
»
»
Words
Prefixes
Suffixes
Substring
Ranges
Allowing errors
Regular expressions
3
Query Language (Cont.)

Structural Queries
» Form-like fixed structures
» Hypertext structure
» hierarchical structure
4
Structural Queries
(a)
(b)
(c)
(a) form-like fixed structure, (b) hypertext structure, and (c) hierarchical
structure
5
Hierarchical Structure
chapter
Chapter 4
4.1 Introduction
We conver in this chapter the
different kinds of ...
....
4.4 Structural Queries
....
section
section
title
Introduction
title
We cover .........
Structural...
......
in
An example of a hierarchical structure:
the page of a book,
its schematic view, and
a parsed query to retrieve the figure
figure
figure
with
section
with
title
"structural"
6
The Types of Queries
Boolean queries
fuzzy Boolean
Natural language
structural queries
basic queries
proximity
phrases
pattern matching
errors
words
keywords and context
substrings
prefixes
suffixes
regular expressions
extended patterns
7
Query Operations
Baeza-Yates, 1999
Modern Information Retrieval
Chapter 5
8
Query Modification

Improving initial query formulation
» Relevance feedback
– approaches based on feedback information from users
» Local analysis
– approaches based on information derived from the set of
documents initially retrieved (called the local set of documents)
» Global analysis
– approaches based on global information derived from the
document collection
9
Relevance Feedback

Relevance feedback process
» it shields the user from the details of the query reformulation
process
» it breaks down the whole searching task into a sequence of
small steps which are easier to grasp
» it provides a controlled process designed to emphasize
some terms and de-emphasize others

Two basic techniques
» Query expansion
– addition of new terms from relevant documents
» Term reweighting
– modification of term weights based on the user relevance
judgement
10
Vector Space Model

Definition
wi,j: the ith term in the vector for document dj
wi,k: the ith term in the vector for query qk
t: the number of unique terms in the data set
d j  (w1, j, w2, j,, wt , j )
qk  (w1, k , w2, k ,, wt , k )
t
similarity (d j , qk )   wi , j  wi , k
i 1
(0.5  0.5 max k i{,tfj k , j } )idf i
tf
wi , j 
k 1 (0.5  0.5 max k k{,tfj k , j } ) 2 idf k2
t
tf
11
Query Expansion and and Term
Reweighting for the Vector Model

Ideal situation
» CR: set of relevant documents among all documents in the collection
q opt

1
1

dj
dj


 d j C R
 d j C R
| CR |
N  | CR |
Rocchio (1965, 1971)
» R: set of relevant documents, as identified by the user among
the retrieved documents
» S: set of non-relevant documents among the retrieved
documents
qm   q 


|R|
 d j R
dj


|S|
 d j S
dj
12
Rocchio’s Algorithm


Ide_Regular (1971)
q m   q   d
j R
d j   d j S d j
Ide_Dec_Hi
q m   q   d j R d j   Max{d j | d j  S}

Parameters
  =  =  =1
 >0
13
Probabilistic Model

Definition
» pi: the probability of observing term ti in the set of relevant
documents
» qi: the probability of observing term ti in the set of
nonrelevant documents
t

pi (1  qi ) 

sim (d j , q)   wi , j wi ,q 
log

i 1
qi (1  pi ) 



Initial search assumption
» pi is constant for all terms ti (typically 0.5)
» qi can be approximated by the distribution of ti in the whole
collection
p (1  qi )
( N  df i )
N
wti  log i
 log
 log
 idf i
qi (1  pi )
df i
df i
14
Term Reweighting for the
Probabilistic Model


Robertson and Sparck Jones (1976)
With relevance feedback from user
N: the number of documents in the collection
R: the number of relevant documents for query q
ni: the number of documents having term ti
ri: the number of relevant documents having term ti
Document Relevance
+
Document
Indexing
+
-
ri
ni-ri
ni
R-ri
N-ni-R+ri
N-ni
R
N-R
N
15
Term Reweighting for the
Probabilistic Model (cont.)

Initial search assumption
» pi is constant for all terms ti (typically 0.5)
» qi can be approximated by the distribution of ti in the whole
t
N  ni
collection
sim ( d j , q )   wi , j wi , q log
i 1
ni

With relevance feedback from users
» pi and qi can be approximated by
ri
pi  ( )
R
ni  ri
qi  (
)
N R
» hence the term weight is updated by
t

ri ( N  ni  R  ri ) 

sim (d j , q)   wi , j wi ,q 
log

i 1
( R  ri )( ni  ri ) 


16
Term Reweighting for the
Probabilistic Model (Cont.)

However, the last formula poses problems for certain
small values of R and ri (R=1, ri=0)
ri  0.5
ni  ri  0.5
pi  (
) qi  (
)
R 1
N  R 1

Instead of 0.5, alternative adjustments have been
propsed
ri 
pi  (
)
R 1
ni
N
ni  ri 
qi  (
)
N  R 1
ni
N
17
Term Reweighting for the
Probabilistic Model (Cont.)

Characteristics
» Advantage
– the term reweighting is optimal under the asumptions of


term independence
binary document indexing (wi,q {0,1} and wi,j {0,1})
» Disadvantage
– no query expansion is used
– weights of terms in the previous query formulations are also
disregarded
– document term weights are not taken into account during the
feedback loop
18
Evaluation of relevance feedback

Standard evaluation method is not suitable
» (i.e., recall-precision) because the relevant documents used
to reweight the query terms are moved to higher ranks.

The residual collection method
» the set of all documents minus the set of feedback
documents provided by the user
» because highly ranked documents are removed from the
collection, the recall-precision figures for qm tend to be lower
than the figures for the original query q
» as a basic rule of thumb, any experimentation involving
relevance feedback strategies should always evaluate recallprecision figures relative to the residual collection
19
Automatic Local Analysis

Definition
» local document set Dl : the set of documents retrieved by a
query
» local vocabulary Vl : the set of all distinct words in Dl
» stemed vocabulary Sl : the set of all distinct stems derived
from Vl

Building local clusters
» association clusters
» metric clusters
» scalar clusters
20
Association Clusters

Idea
» co-occurrence of stems (or terms) inside documents
| D|
c ( ku , k v )   f u , j  f v , j
j 1
– fu,j: the frequency of a stem ku in a document dj
» local association cluster for a stem ku
– the set of k largest values c(ku, kv)
» given a query q, find clusters for the |q| query terms
» normalized form
s ( ku , k v ) 
c ( ku , k v )
c ( ku , ku )  c ( k v , k v )  c ( ku , k v )
21
Metric Clusters

Idea
» consider the distance between two terms in the same cluster

Definition
» V(ku): the set of keywords which have the same stem form as ku
» distance r(ki, kj)=the number of words between term ku and kv
c ( ku , k v ) 
1


iV ( ku ) jV ( k v ) r ( ki , k j )
» normalized form
c ( ku , k v )
s ( ku , k v ) 
| V ( ku ) |  | V ( k v ) |
22
Scalar Clusters

Idea
» two stems with similar neighborhoods have some
synonymity relationships

Definition
» cu,v=c(ku, kv)
» vectors of correlation values for stem ku and kv
su  (cu ,1 , cu , 2 ,  , cu ,t )
sv  (cv ,1 , cv , 2 ,, cv ,t )
» scalar association matrix
su  sv
S u ,v 
| su |  | sv |
» scalar clusters
– the set of k largest values of scalar association
23
Automatic Global Analysis


A thesaurus-like structure
Short history
» Until the beginning of the 1990s, global analysis was
considered to be a technique which failed to yield consistent
improvements in retrieval performance with general
collections
» This perception has changed with the appearance of modern
procedures for global analysis
24
Query Expansion based on a
Similarity Thesaurus

Idea by Qiu and Frei [1993]
» Similarity thesaurus is based on term to term relationships
rather than on a matrix of co-occurrence
» Terms for expansion are selected based on their similarity to
the whole query rather than on their similarities to individual
query terms

Definition
»
»
»
»
»
N: total number of documents in the collection
t: total number of terms in the collection
tfi,j: occurrence frequency of term ki in the document dj
tj: the number of distinct index terms in the document dj
itfj : the inverse term frequency for document dj
t
itf j  log
tj
25
Similarity Thesaurus

Each term is associated with a vector
ki  ( wi ,1, wi , 2,, wi , N )
» where wi,j is a weight associated to the index-document pair
(0.5  0.5 max k i{, tfj i ,k } )itf j
tf
wi , j 

k 1 (0.5  0.5 max ki{,ktfi ,k } )2 itf k2
N
tf
The relationship between two terms ku and kv is
N
cu ,v  ku  kv   wu , j  wv , j
j 1
» Note that this is a variation of the correlation measure used
for computing scalar association matrices
26
Term weighting vs. Term concept space
Doc dj
Term ki
Doc dj
tfij
(0.5  0.5 max k i{,tfj k , j } )idf i
k 1 (0.5  0.5 max k k{,tfj k , j } ) 2 idf k2
t
tf
(0.5  0.5 max k i{, tfj i ,k } )itf j
tf
tf
wi , j 
tfij
Term ki
wi , j 
k 1 (0.5  0.5 max ki{,ktfi ,k } )2 itf k2
N
tf
27
Query Expansion Procedure with
Similarity Thesaurus
1. Represent the query in the concept space by using the
representation of the index terms
q
w
k u q
u, q
ku
2. Compute the similarity sim(q,kv) between each term kv and
the whole query


sim (q, kv )  q  kv    wu ,q ku   kv   wu ,q cu ,v
ku Q
 ku q

3. Expand the query with the top r ranked terms according to
sim(q,kv)
wv ,q '
sim (q, kv )

k q wu ,q
u
28
Example of Similarity Thesaurus
The distance of a given term kv to the query centroid QC might be
quite distinct from the distances of kv to the individual query
terms
ki
QC={ka ,kb}
kv
kj
ka
QC
kb
29
Query Expansion based on a
Similarity Thesaurus
» A document dj is represented term-concept space by
dj 
w
kv d j
v, j
 kv
» If the original query q is expanded to include all the t index terms,
then the similarity sim(q, dj) between the document dj and the
query q can be computed as


 

sim (q, d j )    wu ,q  ku    wv , j  kv 


 ku q
  kv d j

sim (q, d j ) 
 w
k v d j k u q
v, j
 wu ,q cu ,v
– which is similar to the generalized vector space model
30
Query Expansion based on a
Statistical Thesaurus

Idea by Crouch and Yang (1992)
» Use complete link algorithm to produce small and tight
clusters
» Use term discrimination value to select terms for entry into a
particular thesaurus class

Term discrimination value
» A measure of the change in space separation which occurs
when a given term is assigned to the document collection
31
Term Discrimination Value

Terms
» good discriminators: (terms with positive discrimination values)
– index terms
» indifferent discriminators: (near-zero discrimination values)
– thesaurus class
» poor discriminators: (negative discrimination values)
– term phrases

Document frequency dfk
» dfk >n/10: high frequency term (poor discriminators)
» dfk <n/100: low frequency term (indifferent discriminators)
» n/100  dfk n/10: good discriminator
32
Statistical Thesaurus

Term discrimination value theory
» the terms which make up a thesaurus class must be
indifferent discriminators

The proposed approach
» cluster the document collection into small, tight clusters
» A thesaurus class is defined as the intersection of all the low
frequency terms in that cluster
» documents are indexed by the thesaurus classes
» the thesaurus classes are weighted by


|C |
wtC
i 1
wi ,C
|C |
33
Discussion

Query expansion
» useful
» little explored technique

Trends and research issues
» The combination of local analysis, global analysis, visual
displays, and interactive interfaces is also a current and
important research problem
34