Full size

Web Systems and Algorithms
Searching and Indexing
Chris Brooks
Department of Computer Science
University of San Francisco
Department of Computer Science — University of San Francisco – p.1/??
Searching
Once we’ve collected a large body of Web pages (or
other documents), we would probably like to find
information within this collection.
There are a few criteria we would like to satisfy:
Expressivity - can we ask complex queries?
Scalability - Can we handle large document
collections?
Relevance - do the documents retrieved satisfy our
need? Are they ranked?
Department of Computer Science — University of San Francisco – p.2/??
Information Needs
A user typically has an information need.
The job of an IR system (such as a search engine) is to
translate that need into a query and then find
documents that satisfy that need.
We can evaluate a search engine by its effectiveness at
satisfying this information need.
Department of Computer Science — University of San Francisco – p.3/??
Queries
What sorts of queries might we want to ask?
Boolean queries: “cat AND dog”, “(cat OR dog) and fish”
“cat AND NOT mouse”
proximity queries: “cat NEAR dog” “cat, dog within same
sentence”
phrases: “the lazy dog”, “the cat in the hat”
“full text”: user just inputs keywords “cat dog fish”
Synonymy: documents with words like “cat”
Similarity: documents similar to a given document
Department of Computer Science — University of San Francisco – p.4/??
Boolean queries
Let’s start simple, with boolean queries.
The user provides one or more keywords, and we must
find documents containing those keywords.
Given this query language, how can we structure our
document collection?
We want to avoid linear search through all documents.
Department of Computer Science — University of San Francisco – p.5/??
Inverted Index
The standard way to solve this is through the
construction of an inverted index.
This is a hashtable mapping tokens to documents they
appear in.
Construction is very easy:
index =
for document in collection :
for word in document :
if word in index :
index[word].insert(document)
else :
index[word] = [document]
(note: we only add a document once if words occur
multiple times.)
Department of Computer Science — University of San Francisco – p.6/??
Inverted Index
Retrieval is also easy.
For single-word queries, return the list of documents this
word refers to.
For multi-word queries, compute the intersection (for
AND) or union (for OR) or set difference (for NOT).
Department of Computer Science — University of San Francisco – p.7/??
Intersection
How do we efficiently compute the intersection of two
lists?
Department of Computer Science — University of San Francisco – p.8/??
Intersection
### assume both lists are sorted
p1 = l1.next(); p2 = l2.next()
result = []
while p1 != null and p2 != null :
if p1 == p2 :
result.append(p1)
p1 = l1.next()
p2 = l2.next()
elif p1 < p2 :
p1 = l1.next()
else :
p2 = l2.next()
Department of Computer Science — University of San Francisco – p.9/??
Union and NOT
Union and NOT are very similar.
For more complex queries, we want to take care to
merge lists in the correct order.
We want to start with the least frequent term, then move
in order of increasing frequency
Nested queries can be much more challenging.
Department of Computer Science — University of San Francisco – p.10/??
Processing Documents
We actually skipped a step here.
We need to go from a document, which is a string, to a
list of tokens, which are the keys in the inverted index.
We need to decide:
How to separate tokens
What tokens to retain?
Can tokens be combined or grouped?
Department of Computer Science — University of San Francisco – p.11/??
Tokenizing
The simplest thing to do is to split on whitespace.
What about punctuation or non-alphanumeric
characters? Throw them away?
What about that’s, or aren’t, or C++?
What about San Francisco, or Tiger Woods?
What about anti-social vs antisocial?
Dates? Phone numbers?
What about languages like Chinese or Japanese?
Department of Computer Science — University of San Francisco – p.12/??
Tokenizing
Choices include:
Fixed set of rules, written as regular expressions
Learn rules from labeled data
Use a segmentation algorithm, such as Viterbi
Choice depends on anticipated information needs,
experience, and efficiency issues.
Department of Computer Science — University of San Francisco – p.13/??
Removing non-useful tokens
We would also like to avoid indexing tokens that do not
help us find documents.
Markup. We can probably extract this with an HTML
parser.
Stopwords. These are words that do not contain any
useful semantic content.
a, an, the, are, among, about, he, she, they, etc.
We can either have a fixed list, or determine them
through frequency analysis.
Removing stopwords may give us problems with phrase
search
“As We May Think”, “University of San Francisco”
Department of Computer Science — University of San Francisco – p.14/??
Stemming
We may also want to remove prefixes and suffixes.
For example, if a user searches for “car” and the
document contains the token “cars”.
This process is known as stemming
Most stemmers use a fixed set of rules for removing
suffixes.
This can introduce errors, due to inconsistencies in
English
university and universal both stem to univers
In Web search, stemming can also introduce problems
due to acronyms and jargon
gated, SOCKS, ides
Department of Computer Science — University of San Francisco – p.15/??
Conflation
Stemming is a special case of what is called
normalization or conflation
Terms with the same meaning are grouped into an
equivalence class.
e.g. car, auto, automobile
If we have a thesuarus such as WordNet that contains
synonyms, we can deal with this in two different ways:
During querying, look up synonyms and add them to the
query as disjunction.
“car” becomes “car” OR “auto” OR “automobile”
During indexing, store a document inder the entry for
each synonym
Tradeoff: space vs time
Department of Computer Science — University of San Francisco – p.16/??
Normalization
Other issues include:
Accents and diacritical marks. Less important in
English, more in other languages.
Case. Do we want to convert everything to lower case?
What about proper names? (Bush, Windows)
Acronyms?
Misspellings, dates, regionalisms (color vs colour)
Department of Computer Science — University of San Francisco – p.17/??
Measuring performance
There are two basic metrics for measuring retrieval
performance:
Precision
What fraction of the documents returned actually
meet a user’s information need?
Recall
What fraction of the documents that would meet the
user’s information need are returned?
Often we trade one against the other.
Department of Computer Science — University of San Francisco – p.18/??
Ranked Retrieval
We probably don’t want to just give the user an unsorted
list of results.
We would instead like to present the best results first.
What does best mean?
Best match to our query
Most authoritative
Department of Computer Science — University of San Francisco – p.19/??
Ranked Retrieval
If we want to score documents according to how well
they match our query, we can use TFIDF and cosine
similarity.
This assigns each word in a document a weight
according to how frequently it occurs.
Cosine similarity measures the distance between two
vectors; this produces a score between 0 and 1.
Nice, but perhaps not effective for short queries.
Department of Computer Science — University of San Francisco – p.20/??
Ranked Retrieval
An alternative approach is to use a document’s place in
the Web to determine its rank.
Intuition: Each document has prestige, which is a
function of the prestige of the documents that link to it.
This is the basis of PageRank.
One advantage is that a document’s owner cannot
“game” the system by putting extra words in a document.
Department of Computer Science — University of San Francisco – p.21/??
PageRank
The idea behind PageRank is this:
What is the probability if winding up on a web page x
if one is surfing “at random”?
Let’s assume the web is strongly connected, and that
every page has out-degree >= 1.
To have clicked once and wound up at x, a surfer
must have:
Been at a page y that links to x
Chosen the link from y to x out of the E(y) outward
links from y . Assuming a uniform distribution, this
1
.
is N = E(y)
Department of Computer Science — University of San Francisco – p.22/??
PageRank
SInce there are y different pages linking to x,
P
p1 [x] = p0 [y] ∗ N1y
(the probability of being at y and then choosing to go to
x)
We can follow this computation out to determine the
stationary distribution of p.
This is a page’s prestige, or PageRank.
We’ll go through this more carefully next week.
Department of Computer Science — University of San Francisco – p.23/??
PageRank
What’s interesting about PageRank is that a document’s
“value” is determined by the value of its neighbors.
In practice, we would first find pages that matched a
query, then use PageRank to order these results,
perhaps in combination with document-level rankings.
A criticism of PageRank is that there’s not necessarily a
connection between quality and popularity.
Department of Computer Science — University of San Francisco – p.24/??
Index compression
An inverted index can take up a large amount of space,
particularly if positional information about where a term
occurred in a document is also kept.
For performance reasons, it is preferable to keep the
index small enough to be held in memory.
We can do this through index compression
If our decompression algorithm is fast enough, the cost
of decompression will be less than the cost of retrieving
an entry from disk.
Department of Computer Science — University of San Francisco – p.25/??
Index compression
A simple thing we can do is to store offsets of document
IDs.
For example, if we have: foo: 13, 100, 150 we store
foo: 13,87,50
For large numbers of documents, the offset will
hopefully be smaller
We can also construct a hierarchy of indices, using a
tree-based approach.
Each identifier maps into a second-level index.
Compression makes updating more challenging.
Department of Computer Science — University of San Francisco – p.26/??
Relevance Feedback
Retrieval can be very challenging in the Web domain
Queries are very short, lots of potential results.
If we can get the user to give us some help in
understanding their information needs, we can improve
performance.
This is called relevance feedback.
Department of Computer Science — University of San Francisco – p.27/??
Relevance Feedback
The basic idea is this:
The user submits a query q
We return some documents.
For each document, the user ranks them as “yes” or
“no”
We want to then use those features of the good
documents to do a second search.
Department of Computer Science — University of San Francisco – p.28/??
Rocchio’s method
A simple way to do this is called Rocchio’s method.
We compute each document’s score as before
We then compute a weight for the most useful words in
the “good” documents and a weight for the most useful
words in the “bad” documents.
These weights are used to adjust a document’s score.
Department of Computer Science — University of San Francisco – p.29/??
Probabilistic feedback
If we have “good” and “bad” documents, we can also use
them to build a Naive Bayes classifier that will predict
the likelihood of a document matching a search query.
We can also build a probabilistic model directly from our
collection and use this to predict the likelihood that a
document will satisfy a query.
P (dx = T rue|q = cat, dog, tree)
Department of Computer Science — University of San Francisco – p.30/??
Relevance Feedback
Relevance feedback has not been widely used in web
search
Issues
Users don’t like labeling data
Users don’t want to search twice
Mostly useful for boosting recall, which is not as
important as precision for this task.
We can approximate “good” documents by mining
clickstream data.
We can also use documents that are both linked in
many pages
Department of Computer Science — University of San Francisco – p.31/??
Handling metadata
The standard bag-of-words approach ignores metadata
and document structure.
The simplest approach is to add weights to terms that
occur inside a tag of interest.
The META tag was originally useful for this, but is an
obvious target for spammers.
The anchor text inside links to a particular page can be
particularly helpful.
Department of Computer Science — University of San Francisco – p.32/??
Complex queries
The approach we’ve seen so far will work for Boolean
and free-text searches.
What about phrases or sentences?
The simplest approach would be to construct a separate
index for phrases of length n or less.
We don’t want to keep all phrases, though. Just those
that are statistically interesting.
To do this, we estimate the frequency of tokens t1 and t2
from our collection.
If t1 t2 is an interesting phrase, P (t1 t2 ) >> P (t1 ) ∗ P (t2 ) or
P (t1 t2 ) << P (ti )P (t2 ).
Department of Computer Science — University of San Francisco – p.33/??
Positional indices
An alternate, more scalable approach is to keep track of
the position in a document where a term occurs.
To process a phrase query, we would treat it as a
Boolean query, but in computing intersection only admit
adjacent words.
For example, suppose we had the query “fat cat” and
the following indices:
fat: d1 : (34, 72, 103), d2 : (44, 61)
cat: d1 : (35, 88, 104), d2 : (42, 99)
We would conclude that d1 had two matches, and d2
zero.
We can also use this texhnique to implement proximity
or sentence queries.
“cat w/3 mat”
Department of Computer Science — University of San Francisco – p.34/??
Approximate matches
We might also want to find documents who contain
tokens very close to a search query.
Misspellings, regional dialect, translation error.
We could have an aggressive conflation service.
We could reduce words to n-grams and match on this.
We could mine query logs and match to terms in our
index using q-grams, and then use this to update
conflation.
Department of Computer Science — University of San Francisco – p.35/??
Summary
Invertex indexes are used to point from terms into
documents.
Supporting rich queries can require a great deal of
preprocessing of text.
We can rank results either according to goodness of fit
or to prestige (or both).
Department of Computer Science — University of San Francisco – p.36/??