Information Retrieval

Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Information Retrieval
Document Parsing
Basic indexing pipeline
Documents to
be indexed.
Friends, Romans, countrymen.
Tokenizer
Token stream.
Friends
Romans
Countrymen
roman
countryman
Linguistic modules
friend
Modified tokens.
Indexer
Inverted index.
1
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Parsing a document
What character set is in use?
What format is it in?
Plain ASCII, UTF-8, UTF-16,…
pdf/word/excel/html?
What language is it in?
Each of these is a classification problem,
with many complications…
Tokenization: Issues
Chinese/Japanese no spaces between words:
Not always guaranteed a unique tokenization
Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた
社は情報不足のため時間あた$500K(約
約6,000万円
万円)
フォーチュン
社は情報不足のため時間あた
万円
Katakana
Hiragana
Kanji
“Romaji”
What about DNA sequences ? ACCCGGTACGCAC...
Definition of Tokens What you can search !!
2
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Case folding
Reduce all letters to lower case
Many exceptions
e.g., General Motors
USA vs. usa
Morgen will ich in MIT …
Is this the
German “mit”?
Stemming
Reduce terms to their “roots”
language dependent
e.g., automate(s), automatic, automation all
reduced to automat.
e.g., casa, casalinga, casata, casamatta, casolare,
casamento, casale, rincasare, case reduced to cas
Originally used to reduce the dictionary size, now…
3
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Porter’s algorithm
Commonest algorithm for stemming English
Conventions + 5 phases of reductions
phases applied sequentially
each phase consists of a set of commands
sample convention: Of the rules in a
compound command, select the one that
applies to the longest suffix.
sses → ss, ies → i, ational → ate, tional → tion
Full morphologial analysis modest benefit !!
Thesauri
Handle synonyms and polysemy
Hand-constructed equivalence classes
e.g., car = automobile
e.g., macchina = automobile = spider
For each word it specifies a list of correlated words (usually,
synonyms, polysemic or phrases for complex concepts).
Co-occurrence Pattern: BT (broader term), NT (narrower term)
Vehicle (BT) Car Fiat 500 (NT)
How to use it in SE ??
4
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Dmoz Directory
5
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Yahoo! Directory
Information Retrieval
Statistical Properties of Documents
6
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Statistical properties of texts
Token are not distributed uniformly
They follow the so called “Zipf Law”
Few tokens are very frequent
A middle sized set has medium frequency
Many are rare
The first 100 tokens sum up to 50% of the text
Many of these tokens are stopwords
An example of “Zipf curve”
7
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Zipf’s law log-log plot
The Zipf Law, in detail
K-th most frequent term has frequency
approximately 1/k; or the product of the frequency
(f) of a token and its rank (r) is almost a constant
r * f = c |T|
f = c |T| / r
s
f = c |T| / r
s = 1.5÷
1.5÷2.0
General Law
Scale-invariant: f(br) = b−s * f(r)
8
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Distribution vs Cumulative distr
Power-law with smaller exponent
Sum after the k-th element is ≤ fk k/(s-1)
Sum up to the k-th element is ≥ fk k
Consequences of Zipf Law
Do exist many not frequent tokens that do not
discriminate. These are the so called “stop words”
English: to, from, on, and, the, ...
Italian: a, per, il, in, un,…
Do exist many tokens that occur once in a text and
thus are poor to discriminate (error?).
English: Calpurnia
Italian: Precipitevolissimevolmente (o, paklo)
Words with medium frequency
☺
Words that discriminate
9
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Other statistical properties of texts
The number of distinct tokens grows as
The so called “Heaps Law” (|T|β where β<1)
Hence the token length is Ω(log |T|)
Interesting words are the ones with
Medium frequency (Luhn)
Frequency vs. Term significance (Luhn)
10