2.2 T, lecture

INFO 320: Information Needs, Searching,
and Presentation (aka… Search)
Instructor: William Jones
Email: [email protected]
TA: Brennen Smith
Email: [email protected]
Lectures: Tuesdays & Thursdays: 1:30 –
3:20 pm, MGH 238
Labs : Wed.: 1:30 - 2:20 pm, MGH 030
For this Week 3 (10/13/2013) (Basics
of Search)


2.2 T
 Add Word exercise in class
 Boolean search vs. the vector space model
 B-trees
2.2 W
 One-minute madness – each team gets one minute to
describe progress on lab exercises & issues
encountered.
 On-going work in lab.
William Jones, a 2013
1.2 T
INFO 320
2
And also for this Week 3 (of 10/13)

2.2 Th
 Cool tool presentations;
 Essay review
 Wrap-up
 Guest speaker on SEO;

2.2 F

Quiz on Module 2.
William Jones, a 2013
1.2 T
INFO 320
3
William Jones, a 2013
1.2 T
INFO 320
4
Components of a web crawler
William Jones, a 2013
1.2 T
INFO 320
fromButtcher, Clarke & Cormack, 2010,
Information Retrieval, Chapter 15
5
Parsing a document

What format is it in?


What language is it in?


pdf/word/excel/html?
How to handle “and”?
What character set is in use? …
William Jones, a 2013
1.2 T
INFO 320
6
What you see..
William Jones, a 2013
1.2 T
INFO 320
*from http://en.wikipedia.org/wiki/Lawrence_Massacre
7
Is not what the crawler gets
William Jones, a 2013
1.2 T
INFO 320
*from http://en.wikipedia.org/wiki/Lawrence_Massacre
8
What character set is in use?


ISO-8859-1. Latin alphabet part 1 covers North
America, Western Europe, Latin America, the
Caribbean, Canada, Africa; the default for Web pages.
UTF-8. A character set implementation of Unicode. A
character in UTF8 can be from 1 to 4 bytes long. UTF8 can represent any character in the Unicode
standard. UTF-8 is backwards compatible with ASCII.
UTF-8 is the preferred encoding for e-mail and web
pages.
William Jones, a 2013
1.2 T
*from
INFO 320 http://www.w3schools.com/TAGS/ref_charactersets.asp
9
An HTML sample
William Jones, a 2013
1.2 T
INFO 320
*from http://en.wikipedia.org/wiki/Lawrence_Massacre
10
Typical Stop Word List
William Jones, a 2013
1.2 T
INFO 320
11
Ambiguity of Natural Language (NL)

Synonomy: Different Words, Same
Meaning



“car” ~= “automobile”
“stomach pain after eating” ~= “post-prandial
abdominal discomfort”
Polysemy: Same Words, Different
Meanings



“jaguar” as animal vs. kind of automobile.
“juvenile victims of crime” vs. “victims of juvenile
crime”
Venetian blinds vs. blind Venetians
William Jones, a 2013
1.2 T
INFO 320
12
How to handle synonyms?
car= automobile



When the document contains automobile, index it under
car as well (also vice-versa)
Or expand query. When the query contains automobile,
look under car too.
Or form concept, <automobile>



When “car” is encountered, index under “<automobile>” (and “car” too?)
Likewise for “automobile”.
When either “car” or “automobile” are encountered in a query, add the
term “<automobile>”.
William Jones, a 2013
1.2 T
INFO 320
13
Term Weighting


Binary –presence or absence of term
TF


Simple count
“Sublinear” TF scaling

IDF

TF .IDF
William Jones, a 2013
1.2 T
INFO 320
14
A matrix as a way to understand the
index, the vector model and more.
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Term 1 1
0
1
1
0
0
Term 2 0
1
0
1
0
1
Term 3 0
0
0
0
1
0
Term 4 1
1
…
Term 5 1
Term 6 0
…
William Jones, a 2013
1.2 T
INFO 320
15
Cells can have weights. Terms can be
composites. Documents can have
sections…
William.title
Doc
1.1
0
William.abstr 2
act
William.intro 0
William Jones, a 2013
1.2 T
Doc
1.2
3
Doc
2.1
4
Doc
2.2
0
Doc 6
0
1
0
1
0
0
7
0
INFO 320
…
0
16
The index has 3 essential components
1. A term list –
structured for fast
access to individual
terms
Doc 1 Doc 2 Doc 3
Doc 4
Doc 5
Doc 6
Term 1 1
0
1
1
0
0
Term 2 0
1
0
1
0
1
Term 3 0
0
0
0
1
0
Term 4 1
1
…
Term 5 1
Term 6 0
…
William Jones, a 2013
1.2 T
INFO 320
17
The index has 3 essential components
2. For each term, a list of associations to documents.
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Term 1 1
0
1
1
0
0
Term 2 0
1
0
1
0
1
Term 3 0
0
0
0
1
0
Term 4 1
1
…
Term 5 1
Term 6 0
…
William Jones, a 2013
1.2 T
INFO 320
18
The index has 3 essential components
3. a list of documents
that are indexed.
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Term 1 1
0
1
1
0
0
Term 2 0
1
0
1
0
1
Term 3 0
0
0
0
1
0
Term 4 1
1
…
Term 5 1
Term 6 0
…
William Jones, a 2013
1.2 T
INFO 320
19
The index can store information for
each component
1.
2.
3.
For terms – overall frequency in corpus
(IDF), methods to identify or compute the
term (e.g., variations in spelling, sound wave
transformations, checksums, etc.)
For term-to-doc associations – weights, # of
occurrences, occurrence offsets, etc.
For documents – address (by which to
access content), summary, overall popularity
(e.g., using PageRank).
William Jones, a 2013
1.2 T
INFO 320
20
The index can store information for
each component
1.
2.
3.
For terms – overall frequency in
corpus (IDF), methods to identify
or compute the term (e.g.,
variations in spelling, sound wave
transformations, checksums, etc.)
For term-to-doc associations – weights, # of occurrences,
occurrence offsets, etc.
For documents – address (by which to access content), summary,
overall popularity (e.g., using PageRank).
William Jones, a 2013
1.2 T
INFO 320
21
The index can store information for
each component
1.
2.
3.
For terms – overall frequency in corpus (IDF), methods to identify
or compute the term (e.g., variations in spelling, sound wave
transformations, checksums, etc.)
For term-to-doc associations –
weights, # of occurrences,
occurrence offsets, etc.
For documents – address (by which to access content), summary,
overall popularity (e.g., using PageRank).
William Jones, a 2013
1.2 T
INFO 320
22
The index can store information for
each component
1.
2.
3.
For terms – overall frequency in corpus
(IDF), methods to identify or compute the
term (e.g., variations in spelling, sound wave
transformations, checksums, etc.)
For term-to-doc associations – weights, # of
occurrences, occurrence offsets, etc.
For documents – address (by which to
access content), summary, overall popularity
(e.g., using PageRank).
William Jones, a 2013
1.2 T
INFO 320
23
Methods for fast access to terms

Simple sort

If updates are few; or term list can reside in RAM.

Hashing*

B-trees (more next thursday)
William Jones, a 2013
1.2 T
INFO 320
*From
http://wapedia.mobi/en/Hash_function
24
Term Weighting


Binary –presence or absence of term
TF


Simple count
“Sublinear” TF scaling

IDF

TF .IDF
William Jones, a 2013
1.2 T
INFO 320
25
Zipf’s law

If documents of a corpus are ranked (r) by
the frequency (f) of their occurrence, then…
r·f=k
William Jones, a 2013
1.2 T
Relates to the Pareto
principle aka the "80-20
rule“.
INFO 320
Schütze, Hinrich; Christopher D. Manning;
Prabhakar Raghavan (2008) Introduction to
Information Retrieval
26
An sample Zipf distribution
The graph
“hugs” the y
and x-axes.
Much is
accounted for
by top-ranked
items but much
is also hidden
in a looong tail.
William Jones, a 2013
1.2 T
INFO 320
*from
http://www.celtnet.org.uk/info/long_tail.
php
27
Questions?
William Jones, a 2013
1.2 T
INFO 320
28