Document

Computing scores in a
complete search system
1
Recap
 Term frequency weighting
The log frequency weight of term t in d is defined as
follows
 idf weight
dft is the document frequency, the number of documents
that t occurs in.
df is an inverse measure of the informativeness of the
term.
idf is a measure of the informativeness of the term.
idf weight of term t :
2
Recap
tf-idf weighting
The tf-idf weight of a term is the product of its tf
weight and its idf weight.
Best known weighting scheme in information
retrieval
3
Recap
Cosine similarity between query and
document
qi is the tf-idf weight of term i in the query.
di is the tf-idf weight of term i in the document.
|vector| denote vector length
4
Recap
Cosine similarity illustrated
5
Efficient scoring and ranking
 Example
q =jealous gossip
Observations
 The unit vectorv(q ) has only two non-zero
components.
 In the absence of any weighting for query terms,
these non-zero components are equal – in this case,
both equal 0.707.
 For any two documents d1, d2

6
A faster algorithm
7
How do we compute the top k in ranking?
 In many applications, we don’t need a complete ranking.
 We just need the top k for a small k (e.g., k = 100).
 If we don’t need a complete ranking, is there an efficient
way of computing just the top k?
 Naive:
 Compute scores for all N documents
 Sort
 Return the top k
 What’s bad about this?
 Alternative?
8
Use heap for selecting the top k
A heap efficiently implements a priority
queue.
Binary tree in which each node’s value is
greater than the values of its children.
Takes O(J) operations to construct (where
J is the number of documents with nonzero score)
Each of k winners read off in O(k log J)
steps
9
Inexact top K document retrieval
 Produce K documents that are likely to be among
the K highest scoring documents for a query.
Lower the cost of computing the K documents
 The heuristics have the following two-step scheme
Find a set A of documents that are contenders, where
K < |A| ≪ N. A does not necessarily contain the K topscoring documents for the query, but is likely to have
many documents with scores near those of the top K.
Return the K top-scoring documents in A.
10
Inexact top K document retrieval
Index elimination
Champion lists, static quality scores and
ordering
Cluster pruning
11
Index elimination
Only consider documents containing terms
whose idf exceeds a preset threshold
Benefit: the postings lists of low-idf terms are
generally long
Only consider documents that contain
many of the query terms.
A danger of this scheme is that we may end up
with fewer than K candidate documents in the
output
12
Champion lists, static quality scores and
ordering
 Champion lists
the set of the r documents with the highest weights for t
take the union of the champion lists for each of the
terms comprising q
 Static quality scores
A measure of quality g(d) for each document d that is
query-independent
The net score for a document d is some combination of
g(d) together with the query-dependent score
13
Champion lists, static quality scores and
ordering
For a well-chosen value r, we maintain for
each term t a global champion list of the r
documents with the highest values for g(d)
+ tf-idft,d
14
Cluster pruning
 Preprocessing
1. Pick √N documents at random from the collection.
Call these leaders.
2. For each document that is not a leader, we compute
its nearest leader.
 Query processing
1. Given a query q, find the leader L that is closest to q.
This entails computing cosine similarities from q to each
of the √N leaders.
2. The candidate set A consists of L together with its
followers. We compute the cosine scores for all
documents in this candidate set.
15
Cluster pruning
16
Cluster pruning
Variations of cluster pruning introduce
additional parameters b1 and b2, both of
which are positive integers.
Attach each follower to its b1 closest leaders,
rather than a single closest leader.
Consider the b2 leaders closest to the query q.
17
The complete search system
18
Tiered indexes
 Basic idea:
 Create several tiers of indexes, corresponding to importance of
indexing terms
 During query processing, start with highest-tier index
 If highest-tier index returns at least k (e.g., k = 100) results:
 stop and return results to user
 If we’ve only found < k hits: repeat for next index in tier cascade
 Example: two-tier system
 Tier 1: Index of all titles
 Tier 2: Index of the rest of documents
 Pages containing the search words in the title are better hits than
pages containing the search words in the body of the text.
19
Tiered index
20
Tiered indexes
The use of tiered indexes is believed to be
one of the reasons that Google search
quality was significantly higher initially
(2000/01) than that of competitors.
21
Query-term proximity
 Users prefer a document in which most or all of
the query terms appear close to each other
 Let ω be the width of the smallest window in a
document d that contains all the query terms
Example :
 Query : strained mercy
 Sentence : “The quality of mercy is not strained”
 ω =4
 Such proximity-weighted scoring functions are a
departure from pure cosine similarity and closer
to the “soft conjunctive” semantics
Used by Google and other web search engines
22
Query parser and scoring functions
 Query parser
 Translate the user-specified keywords into a query with various
operators that is executed against the underlying indexes
 Example
 Run the user-generated query string as a phrase query. Rank them
by vector space scoring
 If fewer than ten documents contain the phrase, run the two 2-term
phrase queries; rank these using vector space scoring
 If we still have fewer than ten results, run the vector space query
consisting of the three individual query terms
 Score
 vector space scoring, static quality, proximity weighting and
potentially other factors
23
Complete search system
24
Components we haven’t covered yet
 Document cache: we need this for generating
snippets (=dynamic summaries)
 Zone indexes: They separate the indexes for
different zones
the body of the document, all highlighted text in the
document, anchor text, text in metadata fields etc
 Proximity ranking (e.g., rank documents in which
the query terms occur in the same local window
higher than documents in which the query terms
occur far from each other)
25
Components we haven’t covered yet:
Query parser
 IR systems often guess what the user intended.
 The two-term query London tower (without
quotes) may be interpreted as the phrase query
“London tower”.
 The query 100 Madison Avenue, New York may
be interpreted as a request for a map.
 How do we “parse” the query and translate it into
a formal specification containing phrase
operators, proximity operators?
26
Vector space retrieval: Complications
 How do we combine phrase retrieval with vector
space retrieval?
We do not want to compute document frequency / idf for
every possible phrase. Why?
 How do we combine Boolean retrieval with
vector space retrieval?
For example: “or”-constraints and “not”-constraints
Postfiltering is simple, but can be very inefficient – no
easy answer.
 How do we combine wild cards with vector
space retrieval?
Again, no easy answer
27