Indexing and Complexity

Indexing and Complexity
Agenda
• Inverted indexes
• Computational complexity
Some Interesting Questions
• How long will it take to find a document?
– Is there any work we can do in advance?
• If so, how long will that take?
• How big a computer will I need?
– How much disk space? How much RAM?
• What if more documents arrive?
– How much of the advance work must be repeated?
– Will searching become slower?
– How much more disk space will be needed?
A Cautionary Tale
• Searching is easy - just ask Microsoft!
– “Find” can search my 1 GB disk in 30 seconds
• Well, actually it only looks at the file names...
• How long do you think find would take for
– The 100 GB disk we just got?
– For the World Wide Web?
• Computers are getting faster, but…
– How does AltaVista give answers in 5 seconds?
The “Inverted File” Trick
• Organize the bag of words matrix by terms
– You know the terms that you are looking for
• Look up terms like you search phone books
– For each letter, jump directly to the right spot
• For terms of reasonable length, this is very fast
– For each term, store the document identifiers
• For every document that contains that term
• At query time, use the document identifiers
– Consult a “postings file”
Inverted File
A
B
AI
AL
BA
BR
C
D
F
G
J
L
M
N
O
P
Q
T
TH
TI
Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
An Example
Postings
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
0
0
1
1
0
0
0
0
0
1
0
0
1
0
1
1
0
4, 8
2, 4, 6
1, 3, 7
1, 3, 5, 7
2, 4, 6, 8
3, 5
3, 5, 7
2, 4, 6, 8
3
1, 3, 5, 7
2, 4, 8
2, 6, 8
1, 3, 5, 7, 8
6, 8
1, 3
1, 5, 7
2, 4, 6
0
1
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
1
1
1
0
0
0
The Finished Product
Inverted File
A
B
AI
AL
BA
BR
C
D
F
G
J
L
M
N
O
P
Q
T
TH
TI
Term Postings
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
4, 8
2, 4, 6
1, 3, 7
1, 3, 5, 7
2, 4, 6, 8
3, 5
3, 5, 7
2, 4, 6, 8
3
1, 3, 5, 7
2, 4, 8
2, 6, 8
1, 3, 5, 7, 8
6, 8
1, 3
1, 5, 7
2, 4, 6
What Goes in a Postings File?
• Boolean retrieval
– Just the document number
• Ranked Retrieval
– Document number and term weight (TF*IDF, ...)
• Proximity operators
– Word offsets for each occurrence of the term
• Example: Doc 3 (t17, t36), Doc 13 (t3, t45)
How Big Is the Postings File?
• Very compact for Boolean retrieval
– About 10% of the size of the documents
• If an aggressive stopword list is used!
• Not much larger for ranked retrieval
– Perhaps 20%
• Enormous for proximity operators
– Sometimes larger than the documents!
• But access is fast - you know where to look
Building an Inverted Index
• Simplest solution is a single sorted array
– Fast lookup using binary search
– But sorting large files on disk is very slow
– And adding one document means starting over
• Tree structures allow easy insertion
– But the worst case lookup time is linear
• Balanced trees provide the best of both
– Fast lookup and easy insertion
– But they require 45% more disk space
Starting a B+ Tree Inverted File
Now is the time for all good …
aaaaa
all
good
now
now
time
Adding a New Term
Now is the time for all good men …
aaaaa
aaaaa
all
good
now
men
men
now
time
How Big is the Inverted Index?
• Typically smaller than the postings file
– Depends on number of terms, not documents
• Eventually almost all terms will be indexed
– But the postings file will continue to grow
• Postings dominate asymptotic space complexity
– Linear in the number of documents
• Assuming that the documents remain about the same size
Some Facts About Disks
• It takes a long time to get the first byte
– A Pentium can do 1,000,000 operations in 10 ms
• But you can get 1,000 bytes just about as fast
– 40 MB/sec transfer rates are typical
• So it pays to put related stuff in each “block”
– M-ary trees B+ are better than binary B+ trees
• Time complexity is measured in disk blocks read
– Since computing time is negligible by comparison
Time Complexity
• Indexing
– Walk the inverted file, splitting if needed
– Insert into the postings file in sorted order
– Hours or days for large collections
• Query processing
– Walk the inverted file
– Read the postings file
– Seconds, even for enormous collections
Summary
• Slow indexing yields fast query processing
• We use extra disk space to save query time
– Index space is in addition to document space
– Time and space complexity must be balanced
• Disk block reads are the critical resource
– Fast disks are more useful than fast computers
A Question
• If insertions are more common than queries
(for example, filtering news stories as they
arrive and then never looking at them
again), what kind of an index should you
build?