Introduction to information retrieval (IR)

Introduction to
information retrieval (IR)
Haim Levkowitz
University of Massachusetts
Lowell
I VP R
© Copyright 2006-2008 Haim Levkowitz
1
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Index construction
I VP R
© Copyright 2006-2008 Haim Levkowitz
2
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Plan
• Previous: Tolerant retrieval
• Wildcards
• Spell correction
• Soundex
• Now:
• Index construction
I VP R
© Copyright 2006-2008 Haim Levkowitz
3
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Index construction
• How do we construct an index?
• What strategies can we use with
limited main memory?
I VP R
© Copyright 2006-2008 Haim Levkowitz
4
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Overview
•
•
•
•
•
Intro
Block merge indexing
Distributed indexing
Dynamic indexing
Other indices
I VP R
© Copyright 2006-2008 Haim Levkowitz
5
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Recall index construction
Term
Doc #
I
1
did
1
enact
1
julius
1
caesar
1
I
1
was
1
killed
1
i'
1
the
1
capitol
1
brutus
1
killed
1
me
1
so
2
let
2
it
2
be
2
with
2
caesar
2
the
2
noble
2
brutus
2
hath
2
told
2
you
2
caesar
2
was
2
© Copyright 2006-2008 Haim Levkowitz
6
ambitious
2
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
• Documents parsed to extract
words
• saved w/ Document ID
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
I VP R
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Key step: sort
• After all doc’s parsed
• inverted file sorted by
terms
Core indexing step
667M items to sort
I VP R
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
ambitious
be
brutus
brutus
capitol
caesar
caesar
caesar
did
enact
hath
I
I
i'
it
julius
killed
killed
let
me
noble
so
the
the
told
you
was
was
with
Doc #
2
2
1
2
1
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
2
2
1
2
2
2
1
2
2
© Copyright 2006-2008 Haim Levkowitz
7
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Index construction
• Index construction in main memory
• simple & fast. But:
• As build index, hard to use compression
• Parse docs one at a time
• Final postings for any term – incomplete
until end
• W/ compression much more complex (later)
• 10-12 bytes per postings entry ==> need
several temporary gigabytes -- HDD
I VP R
© Copyright 2006-2008 Haim Levkowitz
8
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Example: The Reuters
collection
I VP R
© Copyright 2006-2008 Haim Levkowitz
9
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Example: The Reuters
collection
Symbol
Static
Value
N
documents
800,000
L
avg. # word tokens per document
200
M
(distinct) word types
400,000
avg. # bytes per word token (incl. spaces /
punct.)
6
avg. # bytes per word token (without spaces /
punct.)
4.5
avg. # bytes per word type
7.5
non-positional postings
100,000,000
I VP R
Why
such
diff?
© Copyright 2006-2008 Haim Levkowitz
10
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
System parameters for
design
• Disk seek ~10 milliseconds
• Block transfer from disk ~1
microsecond / byte (following seek)
• All other ops ~10 microseconds
• E.g., compare two postings entries
and decide their merge order
I VP R
© Copyright 2006-2008 Haim Levkowitz
11
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Bottleneck
• Parse & build postings entries one
doc at a time
• Now sort postings entries by term
• then by doc within each term
• Sort 100M records on disk ==> too
slow -- too many disk seek
If every comparison took 2 disk seeks, and N items could be
sorted with N log2N comparisons, how long would this take?
I VP R
© Copyright 2006-2008 Haim Levkowitz
12
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Sorting with fewer disk
seeks
• 12-byte (4+4+4) records (term, doc, freq)
• Generated as parse docs
• Must now sort 100M such 12-byte records by term
• Define Block ~1.6M such records
• can “easily” fit a couple into memory
• Will have 64 such blocks to start with
• (Example; blocks much larger for modern HW)
• Accumulate postings for each block, sort, write to disk
• then merge blocks into one long sorted order
I VP R
© Copyright 2006-2008 Haim Levkowitz
13
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Sorting 64 blocks of 1.6M
records
• First, read each block and sort within:
• Quicksort takes 2N ln N expected steps
• Here 2 x (1.6M ln 1.6M) steps
• Exercise: estimate total time to read each
block from disk and quicksort it
• 64 times this estimate ==> 64 sorted runs
of 1.6M records each
• Need 2 copies of data on disk, throughout
I VP R
© Copyright 2006-2008 Haim Levkowitz
14
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Merging 64 sorted runs
• Merge tree of log264 = 6 layers
• During each layer, read into memory runs
in blocks of 1.6M, merge, write back
1
2
3
4
1
2
Merged run
3
Runs being
merged
I VP R
4
Disk
© Copyright 2006-2008 Haim Levkowitz
15
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Merge tree
…
1 run … ?
2 runs … ?
4 runs … ?
8 runs, 12.8M/run
16 runs, 6.4M/run
32 runs, 3.2M/run
Sorted runs
I VP R
Bottom level
of tree
…
1
2
63 64
© Copyright 2006-2008 Haim Levkowitz
16
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Block merge indexing
I VP R
© Copyright 2006-2008 Haim Levkowitz
17
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Merging 64 runs
• Time estimate for disk transfer:
• 6 x (64 blocks x 1.6Mb x 12 x 10-6sec/b) x 2 ≈
41mins
Disk block
transfer time.
Why is this an
Overestimate?
# Layers in
merge tree
I VP R
Work out how these
transfers are staged,
and the total time for
merging.
Read +
Write
© Copyright 2006-2008 Haim Levkowitz
18
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Exercise -- fill in this table
Step
?
Time
1
64 initial quicksorts of 10M records each
2
Read 2 sorted blocks for merging, write back
3
Merge 2 sorted blocks
4
Add (2) + (3) = time to read/merge/write
5
64 times (4) = total merge time
I VP R
© Copyright 2006-2008 Haim Levkowitz
19
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Large memory indexing
• Suppose instead we had 16GB memory for
above indexing task
• Key decision in block merge indexing: block size
• Exercise: Choose what initial block sizes? ==>
What index time?
• Repeat with a couple of values of n, m
• In practice, spidering (crawling) often
interlaced with indexing
• Spidering bottlenecked by WAN speed and
many other factors -- more later
I VP R
© Copyright 2006-2008 Haim Levkowitz
20
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Distributed indexing
• For web-scale indexing (don’t try this
at home!):
• must use distributed computing
cluster
• Individual machines fault-prone
• Can unpredictably slow down or fail
• How exploit such pool of machines?
I VP R
© Copyright 2006-2008 Haim Levkowitz
21
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Distributed indexing
• Maintain master machine directing
indexing job
• Considered “safe”
• Break up indexing into sets of
(parallel) tasks
• Master machine assigns each task to
idle machine from pool
I VP R
© Copyright 2006-2008 Haim Levkowitz
22
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Parallel tasks
• Will use two sets of parallel tasks
• Parsers
• Inverters
• Break input document corpus into splits
• Each split = subset of documents
• Master assigns split to idle parser machine
• Parser
• reads document at time
• Emits (term, doc) pairs
I VP R
© Copyright 2006-2008 Haim Levkowitz
23
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Parallel tasks
• Parser writes pairs into j partitions
• Each for range of terms’ first letters
• E.g., a-f, g-p, q-z
• here j = 3
• Now to complete index inversion
I VP R
© Copyright 2006-2008 Haim Levkowitz
24
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Data flow
assign
splits
I VP R
Master
assign
Parser
a-f g-p q-z
Parser
a-f g-p q-z
Parser
a-f g-p q-z
Postings
Inverter
a-f
Inverter
g-p
Inverter
q-z
© Copyright 2006-2008 Haim Levkowitz
25
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Inverters
• Collect all (term, doc) pairs for
partition
• Sort and write to postings list
• Each partition contains set of postings
Above process flow a special case of MapReduce
I VP R
© Copyright 2006-2008 Haim Levkowitz
26
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Schema in MapReduce
I VP R
© Copyright 2006-2008 Haim Levkowitz
27
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Distributed indexing
• MapReduce is simple interface for
distributing complex tasks
• Several indexing steps at Google
computed with MapReduce
I VP R
© Copyright 2006-2008 Haim Levkowitz
28
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Dynamic indexing
• Docs come in over time
• postings updates for terms already
in dictionary
• new terms added to dictionary
• Docs get deleted
I VP R
© Copyright 2006-2008 Haim Levkowitz
29
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Simplest approach
Maintain “big” main index
New docs go into “small” auxiliary index
Search across both, merge results
Deletions
• Invalidation bit-vector for deleted docs
• Filter docs output on search result by this
invalidation bit-vector
• Periodically, re-index into one main index
•
•
•
•
I VP R
© Copyright 2006-2008 Haim Levkowitz
30
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Issue with big and small
indices
• Corpus-wide statistics hard to maintain
• E.g., spell-correction: which of several
corrected alternatives present to user?
• We said: pick the one with most hits
• How do maintain top ones with multiple
indices?
• One possibility: ignore small index for
such ordering
• Will see more such statistics used in results
ranking
I VP R
© Copyright 2006-2008 Haim Levkowitz
31
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Issue with big and small
indices
• Frequent merges
• Poor performance during merge
• (at least if implemented naively)
I VP R
© Copyright 2006-2008 Haim Levkowitz
32
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Logarithmic merge
• Maintain series of indices
• each twice as large as previous one
• Keep smallest (J0) in memory
• Larger ones (I0, I1, . . . ) on disk
• If J0 gets too big (> n), write to disk as I0
• or merge with I0 (if I0 already exists) and
right merger to I1
• etc.
I VP R
© Copyright 2006-2008 Haim Levkowitz
33
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Logarithmic merge
I VP R
© Copyright 2006-2008 Haim Levkowitz
34
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Logarithmic merge
• # indices bounded by O(log T)
• (T = # postings read so far)
• So query processing require merge of
at most O(log T)
• Each posting merged at most O(log T)
• Aux. index: construction time O(T2)
• worst case
I VP R
© Copyright 2006-2008 Haim Levkowitz
35
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Dynamic indexing at large
search engines
• Often combination
• Frequent incremental changes
• Occasional complete rebuild
I VP R
© Copyright 2006-2008 Haim Levkowitz
36
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Building positional
indices
Why?
• Still sorting problem (but larger)
• Exercise: given 1GB of memory, how
would you adapt the block merge
described earlier?
I VP R
© Copyright 2006-2008 Haim Levkowitz
37
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Building n-gram indices
• As text parsed, enumerate n-grams
• For each n-gram, need pointers to all dictionary terms
containing it
• “postings”
• NB: same “postings entry” can arise repeatedly in
parsing docs
• need efficient “hash” to keep track
• E.g., that trigram uou occurs in term deciduous will
be discovered on each text occurrence of deciduous
I VP R
© Copyright 2006-2008 Haim Levkowitz
38
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Building n-gram indices
• Once all (n-gram ∈ term) pairs enumerated
• must sort for inversion
• Recall average English dictionary term ≈8
characters
• ==> about 6 trigrams per term on
average
• For vocabulary of 500K terms ==> ~3
million pointers
• can compress
I VP R
© Copyright 2006-2008 Haim Levkowitz
39
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Index on disk vs. memory
• Most retrieval systems keep
• dictionary in memory
• postings on disk
• Web search engines frequently keep both in
memory
• massive memory requirement
• feasible for large Web service installations
• less so for commercial usage where query loads
are lighter
I VP R
© Copyright 2006-2008 Haim Levkowitz
40
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Indexing in the real world
• Typically, not all doc’s on local file system
• Documents need to be spidered
• Could be dispersed over WAN with varying
connectivity
• Must schedule distributed spiders
• Already discussed distributed indexers
• Could be (secure content) in
• Databases
• Content management applications
• Email applications
I VP R
© Copyright 2006-2008 Haim Levkowitz
41
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Content residing in
applications
• Mail systems/groupware, content management
contain most “valuable” documents
• http often not most efficient way to fetch these
documents
• native API fetching
• Specialized, repository-specific connectors
• Also facilitate document viewing when search
result selected for viewing
I VP R
© Copyright 2006-2008 Haim Levkowitz
42
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Secure documents
• Each document accessible to subset of users
• Access Control Lists (ACLs)
• Search users authenticated
• Query to retrieve doc only if user may access it
• So if there are docs matching your search but
you’re not privy to them
• “Sorry no results found”
• E.g., as lowly employee in company, get “No
results” for the query “salary roster”
I VP R
© Copyright 2006-2008 Haim Levkowitz
43
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Users in groups, docs
from groups
• Index ACLs
• filter results by them
Documents
Users
0/1
0 if user can’t read
doc, 1 otherwise
• Often, user membership in ACL group verified at query
time ==> slowdown
I VP R
© Copyright 2006-2008 Haim Levkowitz
44
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Exercise
• Can spelling suggestion compromise
such document-level security?
• Consider the case when there are
documents matching my query, but I
lack access to them
I VP R
© Copyright 2006-2008 Haim Levkowitz
45
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Compound documents
• What if doc consisted of components
• Each component has own ACL
• Search should get doc only if
• query meets one of its components
• + have access to component
• More generally: doc assembled from
computations on components
• e.g., in Lotus databases or in content
management systems
• How index such docs?
I VP R
© Copyright 2006-2008
Haim Levkowitz
46
No good
answers
…
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
“Rich” documents
• How index images?
• Query Based on Image Content (QBIC) systems
• “show picture similar to this CAT scan”
• vector space retrieval (later)
• Currently, image search usually based on meta-data
• E.g., file name
• e.g., monalisa.jpg
• New approaches exploit social tagging
• E.g., flickr.com
• Still meta-data (not image features)
I VP R
© Copyright 2006-2008 Haim Levkowitz
47
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Passage/sentence
retrieval
• Suppose want to retrieve
• Not entire document matching query
• Only passage/sentence/fragment
• E.g., in very long document
• Can index passages/sentences as minidocuments
• what should index units be?
• ==> XML search
I VP R
© Copyright 2006-2008 Haim Levkowitz
48
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/
Resources
• IIR, Chapter 4
• MG, Chapter 5
I VP R
© Copyright 2006-2008 Haim Levkowitz
49
http://www.cs.uml.edu/~haim/teaching/iws/ir/presentations/