Supervised IR Refined Computation of relevant set based on: • Incremental user feedback (Relevance Feedback) OR • Initial fixed training sets – – User tags relevant/irrelevant Routing problem initial class Big open Question – How do we obtain feedback automatically with minimal effort? “Unsupervised” IR Predicting relevance without user feedback Pattern Matching: –Query vector/set –Document vector/set –Co-occurrence of terms assumed to be indication of relevance Relevance Feedback Incremental Feedback in vector model Refer to Rocchio, 71 Q0 = Initial Query Q1 = aQ0 + b 1 NRel 1 S Ri -d N NRel i = 1 NIrrel S Si Irrel i = 1 Probabilistic IR/Text Classification Document Retrieval If P(Rel|Doci) > P(Irrel|Doci) Then Doci is “relevant” Else Doci is “not relevant” -ORIf P(Rel|Doci) > 1 P(Irrel|Doci) Then Doci is “relevant”… Magnitude of ratio indicates our confidence Text Classification Select Classj such that: P(Classj | Doci) is maximized (Bowling, DogBreeding, etc.) (incoming mail message) Alternately P(Classj | Doci) P(NOT Classj | Doci) is maximized General Formulation Compute: P(Classj | Evidence) One of a fixed K classes *disjoint classes* - Can’t be a member of more than 1 or set of feature values (e.g. Words in a language,Medical Test Results, etc) Uses: •REL/IRREL Document Retrieval •Work/Bowling/Dog Breeding Text Classification/Routing •Spanish/Italian/English Language ID •Sick/Well Medical Diagnosis •Herpes/Not Herpes Medical Diagnosis Feature Set Goal: To Compute: P(Classj | Doci) Abstract Formulation P(Classj| Representation of Doci) Probability given a representation of Doci P(Classj| W1, W2,…Wk) One Representation of a vector of words in the document -OR- P(Classj| F1, F2, … Fk) More general, a list of document features Problem – Sparse Feature Set In Medical Diagnosis: worth considering all possible feature combinations Test 1 Test 2 Test 3 F(H), F(Not H) Herpes T T T 30/1 -Herpes T T F 12/120 Herpes T F T 17/3 -Herpes T F F 4/186 -Herpes F T T 100/32 Can compute P(Evidence|Classi) directly from data for all evidence patterns Eg P(T,T,F|Herpes) = 12/Total Herpes Word 17 In IR: Too many combinations of feature values to estimate class distribution after all combinations Word 24 Word 38 Word 54 Work C++ Compile Run 486 Personal Collie Show Pup Fur Akita Show Pup Groom Work Personal Bayes Rule Posterior probability of class given evidence Prior probability of class P(Classi|Evidence) = P(Evidence|Classi) x P(Classi) P(Evidence) Uninformative prior: P(Classi) = 1 (Total # of Classes) Example in Medical Diagnosis A single blood test Probability of test if patient has herpes P(Herpes|Evidence) = P(Evidence|Herpes) * P(Herpes) Probability of herpes given a test result P(Evidence) Prior probability of patient having herpes Prob of a (pos/neg) test result P(Herpes|Positive Test) = .9 P(Herpes|Negative Test) = .001 P(Not Herpes|Positive Test) = .1 P(Not Herpes|Negative Test) = .999 Evidence Decomposition P(Classj | Evidence) A given combination of feature values Medical Diagnosis Text Classification/ Routing Class Blood Test Visible Sores Fever Blood Test 2 HERPES POS T T F NOT HERP NEG F T F NOT HERP NEG F F F HERPES NEG T F T Class W13 W27 W34 W49 Wi Work Compiler C++ YK486 Disassembler … Dog Breeding Collie Show Grooming Sire … Personal date Tonight movie love … Bowling Example in Text Classification / Routing Dog Breeding (collie, groom, show) Prior chance that mail is about dog breeding P(Classi|Evidence) = P(Evidence|Classi) * P(Classi) Observe directly through Training data P(Evidence) Class 1 – Dog Breeding Training Collie Fur Collie Class 2 - Work Compiler C++ X86 Groom Lex Show YACC Computer Poodle Sire Breed Akita Pup Java Probabalistic IR Target/Goal: Evidence P(Rel|Doci) P(Irrel|Doci) Document (Words in) Doc1 .95 .05 Retrieval (Words in) Doc2 .80 .20 (Words in) Doc3 .01 .99 Evidence Document Routing / Classification For a given model of relevance to user’s needs P(Work1) P(Work2) P(Dog Breeding) P(Bowling) P(other) (Words in) Doc1 .91 .01 .07 .02 .01 (Words in) Doc2 .45 .45 .03 .05 .02 (Words in) Doc3 .01 .03 .94 .01 .01 Multiple Binary Splits Q1 B A A1 A2 B2 B1 Flat K-Way Classification Q1 A B C D E F G Likelihood Ratios P(Class1| Evidence) = P(Evidence|Class1) * P(Class1) P(Evidence) P(Class2| Evidence) = P(Evidence|Class2) * P(Class2) P(Evidence) P(Class1|Evidence) P(Class2|Evidence) = P(Evidence|Class1) P(Evidence|Class2) * P(Class1) P(Class2) Likelihood Ratios 1. Binary Classifications 2. Can Treat K-Way classification as a series of binary 3. classifications P(Rel|Doci) Document Retrieval options are P(Irrel|Doci) Rel and Irrel P(Work|Doci) Binary routing task – P(Personal|Doci) (2 possible classes) P(Classj|Doci) P(NOT Classj|Doci) •Compute this ratio for all classes •Choose class j for which this ratio is greatest Independence Assumption Evidence = w1,w2,w3,…wk Final Odds Initial Odds P(Class1|Evidence) P(Class2|Evidence) = = P(Class1) P(Class2) P(Class1) P(Class2) k * i=1 * P(Evidence|Class1) P(Evidence|Class2) P(wi|Class1) P(wi|Class2) Using Independence Assumption P(Personal|Akita,pup,fur,show) P(Work|Akita,pup,fur,show) = * P(Personal|Evidence) P(Work|Evidence) = P(Personal) P(Work) * P(Akita|Personal) P(Akita|Work) P(fur|Personal) * P(fur|Work) 1 9 * 27 2 * * P(pup|Personal) P(pup|Work) P(show|Personal) P(show|Work) 18 0 * 36 2 * 3 5 Product of likelihood ratios for each word a1= some constant Note: Ratios (Partially) Self Weighting 1 1 ( 5137/100,000 5238/100,000 P(Akita|Personal) 37 P(Akita|Work) 1 ( 37/100,000 1/100,000 P(The|Personal) P(The|Work) e.g. e.g. ) ) Bayesian Model Applications Authorship Identification P(Hamilton|Evidence) P(Evidence|Hamilton) P(Hamilton) = P(Madison|Evidence) P(Evidence|Madison) * P(Madison) Sense Disambiguation P(Tank-Container|Evidence) P(Evidence|Tank-Container) P(Tank-Container) = * P(Tank-Vehicle|Evidence) P(Evidence|Tank-Vehicle) P(Tank-Vehicle) Dependence Trees (Hierarchical Bayesian Models) P(w1,w2,…,wk) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w2) * P(w5|w2) * P(w6|w5) * P(w6|w5w4) w1 w2 w3 w4 = direction of dependence w5 w6 Full Probability Decomposition P(w) = P(w1) * P(w2|w1) * P(w3|w2w1) * P(w4|w3w2w1) * … Using Simplifying (Markov) Assumptions P(w) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w3) * … (Assume P(word) only conditional upon the probability of the previous word) Assumption of Full Independence P(w) = P(w1) * P(w2) * P(w3) * P(w4) * … Graphical Models – Partial Decomposition into Dependence Trees P(w) = P(w1) * P(w2|w1) * P(w3) * P(w4|w1w3) * …
© Copyright 2026 Paperzz