Query Language
Baeza-Yates and Navarro
Modern Information Retrieval, 1999
Chapter 4
1
Query Language
Keyword-based Querying
» Single-word Queries
» Context Queries
– Phrase
– Proximity
» Boolean Queries
» Natural Language
2
Query Language (Cont.)
Pattern Matching
»
»
»
»
»
»
»
Words
Prefixes
Suffixes
Substring
Ranges
Allowing errors
Regular expressions
3
Query Language (Cont.)
Structural Queries
» Form-like fixed structures
» Hypertext structure
» hierarchical structure
4
Structural Queries
(a)
(b)
(c)
(a) form-like fixed structure, (b) hypertext structure, and (c) hierarchical
structure
5
Hierarchical Structure
chapter
Chapter 4
4.1 Introduction
We conver in this chapter the
different kinds of ...
....
4.4 Structural Queries
....
section
section
title
Introduction
title
We cover .........
Structural...
......
in
An example of a hierarchical structure:
the page of a book,
its schematic view, and
a parsed query to retrieve the figure
figure
figure
with
section
with
title
"structural"
6
The Types of Queries
Boolean queries
fuzzy Boolean
Natural language
structural queries
basic queries
proximity
phrases
pattern matching
errors
words
keywords and context
substrings
prefixes
suffixes
regular expressions
extended patterns
7
Query Operations
Baeza-Yates, 1999
Modern Information Retrieval
Chapter 5
8
Query Modification
Improving initial query formulation
» Relevance feedback
– approaches based on feedback information from users
» Local analysis
– approaches based on information derived from the set of
documents initially retrieved (called the local set of documents)
» Global analysis
– approaches based on global information derived from the
document collection
9
Relevance Feedback
Relevance feedback process
» it shields the user from the details of the query reformulation
process
» it breaks down the whole searching task into a sequence of
small steps which are easier to grasp
» it provides a controlled process designed to emphasize
some terms and de-emphasize others
Two basic techniques
» Query expansion
– addition of new terms from relevant documents
» Term reweighting
– modification of term weights based on the user relevance
judgement
10
Vector Space Model
Definition
wi,j: the ith term in the vector for document dj
wi,k: the ith term in the vector for query qk
t: the number of unique terms in the data set
d j (w1, j, w2, j,, wt , j )
qk (w1, k , w2, k ,, wt , k )
t
similarity (d j , qk ) wi , j wi , k
i 1
(0.5 0.5 max k i{,tfj k , j } )idf i
tf
wi , j
k 1 (0.5 0.5 max k k{,tfj k , j } ) 2 idf k2
t
tf
11
Query Expansion and and Term
Reweighting for the Vector Model
Ideal situation
» CR: set of relevant documents among all documents in the collection
q opt
1
1
dj
dj
d j C R
d j C R
| CR |
N | CR |
Rocchio (1965, 1971)
» R: set of relevant documents, as identified by the user among
the retrieved documents
» S: set of non-relevant documents among the retrieved
documents
qm q
|R|
d j R
dj
|S|
d j S
dj
12
Rocchio’s Algorithm
Ide_Regular (1971)
q m q d
j R
d j d j S d j
Ide_Dec_Hi
q m q d j R d j Max{d j | d j S}
Parameters
= = =1
>0
13
Probabilistic Model
Definition
» pi: the probability of observing term ti in the set of relevant
documents
» qi: the probability of observing term ti in the set of
nonrelevant documents
t
pi (1 qi )
sim (d j , q) wi , j wi ,q
log
i 1
qi (1 pi )
Initial search assumption
» pi is constant for all terms ti (typically 0.5)
» qi can be approximated by the distribution of ti in the whole
collection
p (1 qi )
( N df i )
N
wti log i
log
log
idf i
qi (1 pi )
df i
df i
14
Term Reweighting for the
Probabilistic Model
Robertson and Sparck Jones (1976)
With relevance feedback from user
N: the number of documents in the collection
R: the number of relevant documents for query q
ni: the number of documents having term ti
ri: the number of relevant documents having term ti
Document Relevance
+
Document
Indexing
+
-
ri
ni-ri
ni
R-ri
N-ni-R+ri
N-ni
R
N-R
N
15
Term Reweighting for the
Probabilistic Model (cont.)
Initial search assumption
» pi is constant for all terms ti (typically 0.5)
» qi can be approximated by the distribution of ti in the whole
t
N ni
collection
sim ( d j , q ) wi , j wi , q log
i 1
ni
With relevance feedback from users
» pi and qi can be approximated by
ri
pi ( )
R
ni ri
qi (
)
N R
» hence the term weight is updated by
t
ri ( N ni R ri )
sim (d j , q) wi , j wi ,q
log
i 1
( R ri )( ni ri )
16
Term Reweighting for the
Probabilistic Model (Cont.)
However, the last formula poses problems for certain
small values of R and ri (R=1, ri=0)
ri 0.5
ni ri 0.5
pi (
) qi (
)
R 1
N R 1
Instead of 0.5, alternative adjustments have been
propsed
ri
pi (
)
R 1
ni
N
ni ri
qi (
)
N R 1
ni
N
17
Term Reweighting for the
Probabilistic Model (Cont.)
Characteristics
» Advantage
– the term reweighting is optimal under the asumptions of
term independence
binary document indexing (wi,q {0,1} and wi,j {0,1})
» Disadvantage
– no query expansion is used
– weights of terms in the previous query formulations are also
disregarded
– document term weights are not taken into account during the
feedback loop
18
Evaluation of relevance feedback
Standard evaluation method is not suitable
» (i.e., recall-precision) because the relevant documents used
to reweight the query terms are moved to higher ranks.
The residual collection method
» the set of all documents minus the set of feedback
documents provided by the user
» because highly ranked documents are removed from the
collection, the recall-precision figures for qm tend to be lower
than the figures for the original query q
» as a basic rule of thumb, any experimentation involving
relevance feedback strategies should always evaluate recallprecision figures relative to the residual collection
19
Automatic Local Analysis
Definition
» local document set Dl : the set of documents retrieved by a
query
» local vocabulary Vl : the set of all distinct words in Dl
» stemed vocabulary Sl : the set of all distinct stems derived
from Vl
Building local clusters
» association clusters
» metric clusters
» scalar clusters
20
Association Clusters
Idea
» co-occurrence of stems (or terms) inside documents
| D|
c ( ku , k v ) f u , j f v , j
j 1
– fu,j: the frequency of a stem ku in a document dj
» local association cluster for a stem ku
– the set of k largest values c(ku, kv)
» given a query q, find clusters for the |q| query terms
» normalized form
s ( ku , k v )
c ( ku , k v )
c ( ku , ku ) c ( k v , k v ) c ( ku , k v )
21
Metric Clusters
Idea
» consider the distance between two terms in the same cluster
Definition
» V(ku): the set of keywords which have the same stem form as ku
» distance r(ki, kj)=the number of words between term ku and kv
c ( ku , k v )
1
iV ( ku ) jV ( k v ) r ( ki , k j )
» normalized form
c ( ku , k v )
s ( ku , k v )
| V ( ku ) | | V ( k v ) |
22
Scalar Clusters
Idea
» two stems with similar neighborhoods have some
synonymity relationships
Definition
» cu,v=c(ku, kv)
» vectors of correlation values for stem ku and kv
su (cu ,1 , cu , 2 , , cu ,t )
sv (cv ,1 , cv , 2 ,, cv ,t )
» scalar association matrix
su sv
S u ,v
| su | | sv |
» scalar clusters
– the set of k largest values of scalar association
23
Automatic Global Analysis
A thesaurus-like structure
Short history
» Until the beginning of the 1990s, global analysis was
considered to be a technique which failed to yield consistent
improvements in retrieval performance with general
collections
» This perception has changed with the appearance of modern
procedures for global analysis
24
Query Expansion based on a
Similarity Thesaurus
Idea by Qiu and Frei [1993]
» Similarity thesaurus is based on term to term relationships
rather than on a matrix of co-occurrence
» Terms for expansion are selected based on their similarity to
the whole query rather than on their similarities to individual
query terms
Definition
»
»
»
»
»
N: total number of documents in the collection
t: total number of terms in the collection
tfi,j: occurrence frequency of term ki in the document dj
tj: the number of distinct index terms in the document dj
itfj : the inverse term frequency for document dj
t
itf j log
tj
25
Similarity Thesaurus
Each term is associated with a vector
ki ( wi ,1, wi , 2,, wi , N )
» where wi,j is a weight associated to the index-document pair
(0.5 0.5 max k i{, tfj i ,k } )itf j
tf
wi , j
k 1 (0.5 0.5 max ki{,ktfi ,k } )2 itf k2
N
tf
The relationship between two terms ku and kv is
N
cu ,v ku kv wu , j wv , j
j 1
» Note that this is a variation of the correlation measure used
for computing scalar association matrices
26
Term weighting vs. Term concept space
Doc dj
Term ki
Doc dj
tfij
(0.5 0.5 max k i{,tfj k , j } )idf i
k 1 (0.5 0.5 max k k{,tfj k , j } ) 2 idf k2
t
tf
(0.5 0.5 max k i{, tfj i ,k } )itf j
tf
tf
wi , j
tfij
Term ki
wi , j
k 1 (0.5 0.5 max ki{,ktfi ,k } )2 itf k2
N
tf
27
Query Expansion Procedure with
Similarity Thesaurus
1. Represent the query in the concept space by using the
representation of the index terms
q
w
k u q
u, q
ku
2. Compute the similarity sim(q,kv) between each term kv and
the whole query
sim (q, kv ) q kv wu ,q ku kv wu ,q cu ,v
ku Q
ku q
3. Expand the query with the top r ranked terms according to
sim(q,kv)
wv ,q '
sim (q, kv )
k q wu ,q
u
28
Example of Similarity Thesaurus
The distance of a given term kv to the query centroid QC might be
quite distinct from the distances of kv to the individual query
terms
ki
QC={ka ,kb}
kv
kj
ka
QC
kb
29
Query Expansion based on a
Similarity Thesaurus
» A document dj is represented term-concept space by
dj
w
kv d j
v, j
kv
» If the original query q is expanded to include all the t index terms,
then the similarity sim(q, dj) between the document dj and the
query q can be computed as
sim (q, d j ) wu ,q ku wv , j kv
ku q
kv d j
sim (q, d j )
w
k v d j k u q
v, j
wu ,q cu ,v
– which is similar to the generalized vector space model
30
Query Expansion based on a
Statistical Thesaurus
Idea by Crouch and Yang (1992)
» Use complete link algorithm to produce small and tight
clusters
» Use term discrimination value to select terms for entry into a
particular thesaurus class
Term discrimination value
» A measure of the change in space separation which occurs
when a given term is assigned to the document collection
31
Term Discrimination Value
Terms
» good discriminators: (terms with positive discrimination values)
– index terms
» indifferent discriminators: (near-zero discrimination values)
– thesaurus class
» poor discriminators: (negative discrimination values)
– term phrases
Document frequency dfk
» dfk >n/10: high frequency term (poor discriminators)
» dfk <n/100: low frequency term (indifferent discriminators)
» n/100 dfk n/10: good discriminator
32
Statistical Thesaurus
Term discrimination value theory
» the terms which make up a thesaurus class must be
indifferent discriminators
The proposed approach
» cluster the document collection into small, tight clusters
» A thesaurus class is defined as the intersection of all the low
frequency terms in that cluster
» documents are indexed by the thesaurus classes
» the thesaurus classes are weighted by
|C |
wtC
i 1
wi ,C
|C |
33
Discussion
Query expansion
» useful
» little explored technique
Trends and research issues
» The combination of local analysis, global analysis, visual
displays, and interactive interfaces is also a current and
important research problem
34
© Copyright 2026 Paperzz