Recuperação e Gestão de Informação

Recuperação e Gestão de Informação
Classic IR Models
Departamento de Engenharia Informática
Instituto Superior Técnico
1o Semestre
2007/2008
Outline
1
Generic Document Model
2
The Boolean Model
3
The Vector Space Model
4
The Probabilistic Model
5
Comparison of the Classic Models
Outline
1
Generic Document Model
2
The Boolean Model
3
The Vector Space Model
4
The Probabilistic Model
5
Comparison of the Classic Models
Index Terms
In the classic IR models, documents are represented by
index terms
full text/selected keywords
structure/no structure
Not all terms are equally useful
index terms can be weighted
We assume that terms are mutually independent
this is, of course, a simplification
Definition of a Document Model
Definition
Let t be the number of index terms in the collection of
documents, and let ki be a generic index term.
K = {k1 , . . . , kt } is the set of all index terms.
A weight wi,j ≥ 0 is associated with each index term ki of
a document dj .
For an index term which does not appear in the document
text, wi,j = 0.
With document dj is associated a term vector d~j ,
represented by d~j = (w1,j , w2,j , . . . , wt,j ).
Function gi (d~j ) returns the weight of index term ki in
vector d~j .
Outline
1
Generic Document Model
2
The Boolean Model
3
The Vector Space Model
4
The Probabilistic Model
5
Comparison of the Classic Models
The Boolean Model
Simple model based on Boolean algebra
Intuitive concept
Precise semantics
Clear formal basis
Widely adopted by early information systems
Boolean Model Queries
Follows Boolean algebra syntax and semantics
Term weights are binary
wi,j ∈ {0, 1}
wi,j = 1 — term present,
wi,j = 0 — term not present
Queries are Boolean expressions
E.g., q = ka ∧ (kb ∨ ¬kc )
Documents are considered relevant if the query evaluates
to 1 (true)
An example
d1
That government is best which
governs least
d2
That government is best which
governs not at all
d3
When men are prepared for it,
that will be the kind of
government which they will have
An example
d1
That government is best which
governs least
q = government ∧ best
answer: d1 , d2
d2
That government is best which
governs not at all
d3
When men are prepared for it,
that will be the kind of
government which they will have
An example
d1
That government is best which
governs least
q = government ∧ best
answer: d1 , d2
d2
That government is best which
governs not at all
d3
When men are prepared for it,
that will be the kind of
government which they will have
q = government∧best∧¬all
answer: d1
An example
d1
That government is best which
governs least
q = government ∧ best
answer: d1 , d2
d2
That government is best which
governs not at all
d3
When men are prepared for it,
that will be the kind of
government which they will have
q = government∧best∧¬all
answer: d1
q = government∨best∧¬all
answer: d1 , d2 , d3
Document-Query Similarity
Queries can be translated to a disjunction of conjunctive
vectors
~q = (1, 1, 1) ∧ (1, 1, 0) ∧ (1, 0, 0)
each tuple corresponds to a vector (ka , kb , kc )
Similarity of a document to a query is defined as:
sim(dj , q) =
1 if ∃q~cc |(q~cc ∈ ~q ) ∧ (∀ki , gi (d~j ) = gi (q~cc ))
0 otherwise
Problems of The Boolean Model
Retrieval based only on binary decisions
More similar to data retrieval than information retrieval
Can retrieve too many, or too little documents
Some documents may be more relevant than others
How do you translate a query to a boolean expression?
Non-expert users may no be able to represent their
information needs using Boolean expressions
Terms are all equally important
Index term weighting can bring great improvements in
performance
Outline
1
Generic Document Model
2
The Boolean Model
3
The Vector Space Model
4
The Probabilistic Model
5
Comparison of the Classic Models
The Vector Space Model
Simple model, based on linear algebra
Term weights are not binary
Allows computing a continuous degree of similarity
between queries and documents
Thus, allows ranking documents according to their
possible relevance
Documents as Vectors
Documents are represented as vectors
d~j = (w1,j , w2,j , . . . , wt,j )
wi,j is the weight of term i in document j
Queries are also vectors
~q = (w1,q , w2,q , . . . , wt,q )
Vector operations cab be used to compare
queries×documents (or documents×documents)
An example
Example
Suppose the vocabulary has two terms k1 = ‘men’,
k2 = ‘government’
Two documents, d1 and d2 can be defined as, for instance
d~1 = (2.2, 5.2)
d~2 = (4.9, 1.0)
An example
d~1 = (2.2, 5.2)
d~2 = (4.9, 1.0)
government
d~1
d~2
men
Defining Document Vectors
Two questions are still unanswered:
1
How do we define term weights?
2
How do we compare documents to queries?
Defining Term Weights — TF
Term frequency
Term frequency is a measure of term importance within a
document
Definition
Let N be the total number of documents in the system and ni
be the number of documents in which term ki appears. The
normalized frequency of a term ki in document dj is given by:
fi,j =
freqi,j
maxl freql,j
where freqi,j is the number of occurrences of term ki in
document dj .
Defining Term Weights — IDF
(Inverse) Document frequency
Document frequency is a measure of term importance within a
collection
Definition
The inverse document frequency of a term ki is given by:
idfi = log
N
ni
Defining Term Weights — TF-IDF
Definition
The weight of a term ki in document dj for the vector space
model is given by the tf-idf formula:
wi,j = fi,j × log
N
ni
Document Similarity
Similarity between documents and queries is a measure of
the correlation between their vectors
Documents/queries that share the same terms, with
similar weights, should be more similar
Thus, as similarity a measure, we use the cosine of the
angle between the vectors
Pt
wi,j × wi,q
d~j · ~q
qP
sim(dj , q) =
= qP i=1
t
t
2 ×
2
|d~j | × |~q |
w
i=1 i,j
i=1 wi,q
An example
d~1
~q
cos(α) = 0.9
government
cos(θ) = 0.8
α
d~2
θ
men
Outline
1
Generic Document Model
2
The Boolean Model
3
The Vector Space Model
4
The Probabilistic Model
5
Comparison of the Classic Models
The Probabilistic Model
Models the IR problem in a probabilistic framework
Estimates the probability of document dj being relevant to
the user
Assumes that:
1
2
3
the probability depends only on the query and the
document
there is a subset R of relevant documents
index terms are independent
Term weights are binary
Document Query Similarity
As a similarity measure, we use the ratio between the
probability of finding the relevant documents and the
probability of finding the non-relevant documents
sim(dj , q) =
P(R|d~j )
P(R|d~j )
Similarity Probabilities I
Initial Equation
sim(dj , q) =
P(R|d~j )
P(d~j |R) × P(R)
P(d~j |R)
=
∼
P(R|d~j )
P(d~j |R) × P(R)
P(d~j |R)
Assuming term independence...
sim(dj , q) ∼
(Πgi (d~j )=1 P(ki |R)) × (Πgi (d~j )=0 P(ki |R))
(Πgi (d~j )=1 P(ki |R)) × (Πgi (d~j )=0 P(ki |R))
Similarity Probabilities II
Tanking logs and removing constant factors...
sim(dj , q) =
t
X
i=1
wi,q ×wi,j × log
1 − P(ki |R)
P(ki |R)
+ log
1 − P(ki |R)
P(ki |R)
Blind assumptions
P(ki |R) = 0.5
P(ki |R) = nNi
Similarity Probabilities III
After document retrieval...
Let V be the number of returned documents; let Vi be the
number of returned documents containing term ki .
P(ki |R) =
P(ki |R) =
Vi
V
ni −Vi
N−V
Avoiding small values...
n
P(ki |R) =
P(ki |R) =
Vi + Ni
V +1 n
ni −Vi + Ni
N−V +1
Problems of the Probabilistic Model
There is no accurate estimate for the first run probabilities
Index terms are not weighted
Terms are assumed mutually independent
Outline
1
Generic Document Model
2
The Boolean Model
3
The Vector Space Model
4
The Probabilistic Model
5
Comparison of the Classic Models
Comparison of the Classic Models
Boolean model is considered the weakest
There is some controversy over which shows better
performance: vector space or probabilistic
However, nowadays, the vector space model is the most
widely used

Download Report

Recuperação e Gestão de Informação

Paperzz.com

Your Paperzz