Information Retrieval - Classic IR Models

Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Information Retrieval
Classic IR Models
Bruno Martins
Departamento de Engenharia Informática
Instituto Superior Técnico
1st Semester
2012/2013
Outline
Information
Retrieval
Bruno Martins
Generic
Document
Model
1
Generic Document Model
2
The Boolean Model
3
The Vector Space Model
4
The Probabilistic Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Outline
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
1
Generic Document Model
2
The Boolean Model
3
The Vector Space Model
4
The Probabilistic Model
The Vector
Space Model
The
Probabilistic
Model
Index Terms
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
In the classic IR models, documents are represented by
index terms
full text/selected keywords
structure/no structure
Not all terms are equally useful
index terms can be weighted
We assume that terms are mutually independent
this is, of course, a simplification
Definition of a Document Model
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Definition
Let t be the number of index terms in the collection of
documents, and let ki be a generic index term.
K = {k1 , . . . , kt } is the set of all index terms.
A weight wi,j ≥ 0 is associated with each index term ki of
a document dj .
For an index term which does not appear in the document
text, wi,j = 0.
Each document dj is associated to a term vector d~j ,
represented by d~j = (w1,j , w2,j , . . . , wt,j ).
Function gi (d~j ) returns the weight of index term ki in
vector d~j .
Outline
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
1
Generic Document Model
2
The Boolean Model
3
The Vector Space Model
4
The Probabilistic Model
The Vector
Space Model
The
Probabilistic
Model
The Boolean Model
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Simple model based on Boolean algebra
Intuitive concept
Precise semantics
Clear formal basis
Widely adopted by early information systems
Boolean Model Queries
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Follows Boolean algebra syntax and semantics
Term weights are binary
wi,j ∈ {0, 1}
wi,j = 1 — term present,
wi,j = 0 — term not present
Queries are Boolean expressions
E.g., q = ka ∧ (kb ∨ ¬kc )
Documents are considered relevant if the query evaluates
to 1 (true)
An example
Information
Retrieval
Bruno Martins
Generic
Document
Model
d1
That government is best which
governs least
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
d2
That government is best which
governs not at all
d3
When men are prepared for it,
that will be the kind of
government which they will have
An example
Information
Retrieval
Bruno Martins
Generic
Document
Model
d1
That government is best which
governs least
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
d2
That government is best which
governs not at all
d3
When men are prepared for it,
that will be the kind of
government which they will have
q = government ∧ best
answer: d1 , d2
An example
Information
Retrieval
Bruno Martins
Generic
Document
Model
d1
That government is best which
governs least
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
d2
That government is best which
governs not at all
d3
When men are prepared for it,
that will be the kind of
government which they will have
q = government ∧ best
answer: d1 , d2
q = government∧best∧¬all
answer: d1
An example
Information
Retrieval
Bruno Martins
Generic
Document
Model
d1
That government is best which
governs least
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
d2
That government is best which
governs not at all
d3
When men are prepared for it,
that will be the kind of
government which they will have
q = government ∧ best
answer: d1 , d2
q = government∧best∧¬all
answer: d1
q=
government ∨ (best ∧ ¬all)
answer: d1 , d2 , d3
Document-Query Similarity
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Queries can be translated to a disjunction of conjunctive
vectors
~q = ka ∧ (kb ∨ ¬kc ) ⇔ (1, 1, 1) ∨ (1, 1, 0) ∨ (1, 0, 0)
each tuple corresponds to a vector (ka , kb , kc )
Similarity of a document to a query is defined as:
sim(dj , q) =
1 if ∃q~c ∈ ~q |∀i , gi (d~j ) = gi (q~c )
0 otherwise
Problems of The Boolean Model
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Retrieval based only on binary decisions
More similar to data retrieval than information retrieval
Can retrieve too many, or too little documents
Some documents may be more relevant than others
How do you translate a query to a boolean expression?
Non-expert users may no be able to represent their
information needs using Boolean expressions
Terms are all equally important
Index term weighting can bring great improvements in
performance
Outline
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
1
Generic Document Model
2
The Boolean Model
3
The Vector Space Model
4
The Probabilistic Model
The Vector
Space Model
The
Probabilistic
Model
The Vector Space Model
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Simple model, based on linear algebra
Term weights are not binary
Allows computing a continuous degree of similarity
between queries and documents
Thus, allows ranking documents according to their
possible relevance
Documents as Vectors
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Documents are represented as vectors
d~j = (w1,j , w2,j , . . . , wt,j )
wi,j is the weight of term i in document j
Queries are also vectors
~q = (w1,q , w2,q , . . . , wt,q )
Vector operations can be used to compare
queries×documents (or documents×documents)
An example
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Example
Suppose the vocabulary has two terms k1 = ‘men’,
k2 = ‘government’
Two documents, d1 and d2 can be defined as, for instance
d~1 = (2.2, 5.2)
d~2 = (4.9, 1.0)
An example
Information
Retrieval
Bruno Martins
The Vector
Space Model
The
Probabilistic
Model
government
The Boolean
Model
d~1 = (2.2, 5.2)
d~2 = (4.9, 1.0)
d~1
Generic
Document
Model
d~2
men
Defining Document Vectors
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Two questions are still unanswered:
1
How do we define term weights?
2
How do we compare documents to queries?
Defining Term Weights — TF
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Term frequency
Term frequency is a measure of term importance within a
document
Definition
Let N be the total number of documents in the system and ni
be the number of documents in which term ki appears. The
normalized frequency of a term ki in document dj is given by:
fi,j =
freqi,j
maxl freql,j
where freqi,j is the number of occurrences of term ki in
document dj .
Defining Term Weights — IDF
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
(Inverse) Document frequency
Document frequency is a measure of term importance within a
collection
Definition
The inverse document frequency of a term ki is given by:
idfi = log
N
ni
Defining Term Weights — TF-IDF
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Vector
Space Model
Definition
The weight of a term ki in document dj for the vector space
model is given by the tf-idf formula:
The
Probabilistic
Model
wi,j = fi,j × log
The Boolean
Model
N
ni
Document Similarity
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Similarity between documents and queries is a measure of
the correlation between their vectors
Documents/queries that share the same terms, with
similar weights, should be more similar
Thus, as a similarity measure, we use the cosine of the
angle between the vectors
Pt
wi,j × wi,q
d~j · ~q
qP
sim(dj , q) =
= qP i=1
t
t
2 ×
2
|d~j | × |~q |
w
i=1 i,j
i=1 wi,q
An example
Information
Retrieval
Bruno Martins
d~1
Generic
Document
Model
The Vector
Space Model
The
Probabilistic
Model
cos(α) = 0.9
cos(θ) = 0.8
government
The Boolean
Model
~q
α
d~2
θ
men
Outline
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
1
Generic Document Model
2
The Boolean Model
3
The Vector Space Model
4
The Probabilistic Model
The Vector
Space Model
The
Probabilistic
Model
The Probabilistic Model
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Models the IR problem in a probabilistic framework
Estimates the probability of document dj being relevant to
the end user
Assumes that:
1
2
3
the relevance probability depends only on the query and
the document
there is a subset R of relevant documents
index terms are independent
Term weights are binary
Document Query Similarity
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
As a similarity measure, we use the ratio between the
probability of finding the relevant documents and the
probability of finding the non-relevant documents
sim(dj , q) =
P(R|d~j )
P(R|d~j )
Similarity Probabilities I
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
Initial Equation
sim(dj , q) =
The Vector
Space Model
The
Probabilistic
Model
P(R|d~j )
P(d~j |R) × P(R)
P(d~j |R)
=
∼
P(R|d~j )
P(d~j |R) × P(R)
P(d~j |R)
Assuming term independence...
sim(dj , q) ∼
(Πgi (d~j )=1 P(ki |R)) × (Πgi (d~j )=0 P(ki |R))
(Πgi (d~j )=1 P(ki |R)) × (Πgi (d~j )=0 P(ki |R))
Similarity Probabilities II
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Tanking logs and removing constant factors...
sim(dj , q) =
t
X
i=1
wi,q ×wi,j × log
1 − P(ki |R)
P(ki |R)
+ log
1 − P(ki |R)
P(ki |R)
Blind assumptions
P(ki |R) = 0.5
P(ki |R) = nNi
Similarity Probabilities III
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
After document retrieval...
Let V be the number of returned documents; let Vi be the
number of returned documents containing term ki .
P(ki |R) =
P(ki |R) =
Vi
V
ni −Vi
N−V
Avoiding small values...
n
P(ki |R) =
P(ki |R) =
Vi + Ni
V +1 n
ni −Vi + Ni
N−V +1
Problems of the Probabilistic Model
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
There is no accurate estimate for the first run probabilities
The Vector
Space Model
Index terms are not weighted
The
Probabilistic
Model
Terms are assumed mutually independent
Comparison of the Classic Models
Information
Retrieval
Bruno Martins
Generic
Document
Model
The Boolean
Model
The Vector
Space Model
The
Probabilistic
Model
Boolean model is considered the weakest
There is some controversy over which model shows better
performance: vector space or probabilistic
However, nowadays, the vector space model is the most
widely used