Supervised IR
Refined Computation of relevant set based on:
•
Incremental user feedback (Relevance Feedback)
OR
•
Initial fixed training sets
–
–
User tags relevant/irrelevant
Routing problem  initial class
Big open Question –
How do we obtain feedback automatically with minimal effort?
“Unsupervised” IR
Predicting relevance without user feedback
Pattern Matching:
–Query vector/set
–Document vector/set
–Co-occurrence of terms assumed to be
indication of relevance
Relevance Feedback
Incremental Feedback in vector model
Refer to Rocchio, 71
Q0 = Initial Query
Q1 = aQ0 + b
1 NRel
1
S Ri -d N
NRel i = 1
NIrrel
S Si
Irrel i = 1
Probabilistic IR/Text Classification
Document Retrieval
If P(Rel|Doci) > P(Irrel|Doci)
Then Doci is “relevant”
Else Doci is “not relevant”
-ORIf P(Rel|Doci) > 1
P(Irrel|Doci)
Then Doci is “relevant”…
Magnitude of ratio indicates our confidence
Text Classification
Select Classj such that:
P(Classj | Doci) is maximized
(Bowling, DogBreeding, etc.)
(incoming mail message)
Alternately
P(Classj | Doci)
P(NOT Classj | Doci)
is maximized
General Formulation
Compute:
P(Classj | Evidence)
One of a fixed K classes
*disjoint classes* - Can’t be a
member of more than 1
or set of feature values
(e.g. Words in a language,Medical Test Results, etc)
Uses:
•REL/IRREL  Document Retrieval
•Work/Bowling/Dog Breeding  Text Classification/Routing
•Spanish/Italian/English  Language ID
•Sick/Well  Medical Diagnosis
•Herpes/Not Herpes  Medical Diagnosis
Feature Set
Goal: To Compute:
P(Classj | Doci)  Abstract Formulation
P(Classj| Representation of Doci)  Probability given a
representation of
Doci
P(Classj| W1, W2,…Wk) One Representation of a vector
of words in the
document
-OR-
P(Classj| F1, F2, … Fk) More general, a list of document
features
Problem – Sparse Feature Set
In Medical Diagnosis:
worth considering all possible
feature combinations
Test 1
Test 2
Test 3
F(H), F(Not H)
Herpes
T
T
T
30/1
-Herpes
T
T
F
12/120
Herpes
T
F
T
17/3
-Herpes
T
F
F
4/186
-Herpes
F
T
T
100/32
Can compute P(Evidence|Classi) directly from data for all evidence patterns
Eg P(T,T,F|Herpes) = 12/Total Herpes
Word 17
In IR:
Too many combinations of feature
values to estimate class distribution
after all combinations
Word 24
Word 38
Word 54
Work
C++
Compile
Run
486
Personal
Collie
Show
Pup
Fur
Akita
Show
Pup
Groom
Work
Personal
Bayes Rule
Posterior probability of class given evidence
Prior probability of class
P(Classi|Evidence) = P(Evidence|Classi) x P(Classi)
P(Evidence)
Uninformative prior: P(Classi) =
1
(Total # of Classes)
Example in Medical Diagnosis
A single blood test
Probability of test if patient has herpes
P(Herpes|Evidence) = P(Evidence|Herpes) * P(Herpes)
Probability of herpes given a test result
P(Evidence) Prior probability of patient having herpes
Prob of a (pos/neg) test result
P(Herpes|Positive Test) = .9
P(Herpes|Negative Test) = .001
P(Not Herpes|Positive Test) = .1
P(Not Herpes|Negative Test) = .999
Evidence Decomposition
P(Classj | Evidence)
A given combination of feature values
Medical
Diagnosis
Text
Classification/
Routing
Class
Blood Test
Visible Sores
Fever
Blood Test 2
HERPES
POS
T
T
F
NOT HERP
NEG
F
T
F
NOT HERP
NEG
F
F
F
HERPES
NEG
T
F
T
Class
W13
W27
W34
W49
Wi
Work
Compiler
C++
YK486
Disassembler
…
Dog Breeding
Collie
Show
Grooming
Sire
…
Personal
date
Tonight
movie
love
…
Bowling
Example in Text Classification / Routing
Dog Breeding
(collie, groom, show)
Prior chance that mail is about dog breeding
P(Classi|Evidence) = P(Evidence|Classi) * P(Classi)
Observe directly through Training data
P(Evidence)
Class 1 – Dog Breeding Training
Collie
Fur
Collie
Class 2 - Work
Compiler
C++
X86
Groom
Lex
Show
YACC
Computer
Poodle
Sire
Breed
Akita
Pup
Java
Probabalistic IR
Target/Goal:
Evidence
P(Rel|Doci)
P(Irrel|Doci)
Document
(Words in) Doc1
.95
.05
Retrieval
(Words in) Doc2
.80
.20
(Words in) Doc3
.01
.99
Evidence
Document
Routing /
Classification
For a given
model of
relevance to
user’s needs
P(Work1)
P(Work2)
P(Dog Breeding) P(Bowling) P(other)
(Words in) Doc1
.91
.01
.07
.02
.01
(Words in) Doc2
.45
.45
.03
.05
.02
(Words in) Doc3
.01
.03
.94
.01
.01
Multiple Binary Splits
Q1
B
A
A1
A2
B2
B1
Flat K-Way Classification
Q1
A
B
C
D
E
F
G
Likelihood Ratios
P(Class1| Evidence) = P(Evidence|Class1) * P(Class1)
P(Evidence)
P(Class2| Evidence) = P(Evidence|Class2) * P(Class2)
P(Evidence)
P(Class1|Evidence)
P(Class2|Evidence)
=
P(Evidence|Class1)
P(Evidence|Class2)
*
P(Class1)
P(Class2)
Likelihood Ratios
1.
Binary
Classifications
2.
Can Treat K-Way
classification as a
series of binary 3.
classifications
P(Rel|Doci)
Document Retrieval options are
P(Irrel|Doci)
Rel and Irrel
P(Work|Doci)
Binary routing task –
P(Personal|Doci)
(2 possible classes)
P(Classj|Doci)
P(NOT Classj|Doci)
•Compute this ratio for all classes
•Choose class j for which this ratio is greatest
Independence Assumption
Evidence = w1,w2,w3,…wk
Final Odds
Initial Odds
P(Class1|Evidence)
P(Class2|Evidence)
=
=
P(Class1)
P(Class2)
P(Class1)
P(Class2)
k
*

i=1
*
P(Evidence|Class1)
P(Evidence|Class2)
P(wi|Class1)
P(wi|Class2)
Using Independence Assumption
P(Personal|Akita,pup,fur,show)
P(Work|Akita,pup,fur,show)
=
*
P(Personal|Evidence)
P(Work|Evidence)
=
P(Personal)
P(Work)
*
P(Akita|Personal)
P(Akita|Work)
P(fur|Personal)
*
P(fur|Work)
1
9
*
27
2
*
*
P(pup|Personal)
P(pup|Work)
P(show|Personal)
P(show|Work)
18
0
*
36
2
*
3
5
Product of likelihood ratios for each
word
a1= some constant
Note: Ratios (Partially) Self Weighting
1
1
(
5137/100,000
5238/100,000
P(Akita|Personal)
37

P(Akita|Work)
1
(
37/100,000
1/100,000
P(The|Personal)
P(The|Work)

e.g.
e.g.
)
)
Bayesian Model Applications
Authorship Identification
P(Hamilton|Evidence)
P(Evidence|Hamilton)
P(Hamilton)
=
P(Madison|Evidence)
P(Evidence|Madison) * P(Madison)
Sense Disambiguation
P(Tank-Container|Evidence) P(Evidence|Tank-Container)
P(Tank-Container)
=
*
P(Tank-Vehicle|Evidence)
P(Evidence|Tank-Vehicle)
P(Tank-Vehicle)
Dependence Trees
(Hierarchical Bayesian Models)
P(w1,w2,…,wk) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w2) * P(w5|w2) * P(w6|w5) * P(w6|w5w4)
w1
w2
w3
w4
= direction of dependence
w5
w6
Full Probability Decomposition
P(w) = P(w1) * P(w2|w1) * P(w3|w2w1) * P(w4|w3w2w1) * …
Using Simplifying (Markov) Assumptions
P(w) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w3) * …
(Assume P(word) only conditional upon the probability of the previous word)
Assumption of Full Independence
P(w) = P(w1) * P(w2) * P(w3) * P(w4) * …
Graphical Models – Partial Decomposition into Dependence Trees
P(w) = P(w1) * P(w2|w1) * P(w3) * P(w4|w1w3) * …