Discriminative Training of Markov Logic Networks Parag Singla & Pedro Domingos Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work Markov Logic Networks(MLNs) AI systems must be able to learn, reason logically and handle uncertainty Markov Logic Networks [Richardson and Domingos, 2004]- an effective way to combine first order logic and probability Markov Networks are used as underlying representation Features specfied using arbitrary formulas in finite first order logic Training of MLNs – Generative Approach Optimize the joint distribution of all the variables Parameters learnt independent of specific inference task Maximum-likelihood (ML) training – computation of the gradient involves inference – too slow! Use Psuedo-likelihood (PL) as an alternative – easy to compute PL is suboptimal. Ignores any non-local interactions between variables ML, PL – generative training approaches Training of MLNs Discriminative Approach No need to optimize the joint distribution of all the variables Optimize the conditional likelihood (CL) of non-evidence variables given evidence variables Parameters learnt for a specific inference task Tends to do better than generative training in general Why is Discriminative Better? Generative Discriminative Parameters learnt are not optimized Parameters learnt are optimized for for the specific inference task. the specific inference task. Need to model all the dependencies Need not model dependencies in the data – learning might become between evidence variables – makes complicated. learning task easier. Example of generative models: MRFs Example of discriminative models: CRFs [Lafferty, McCallum, Pereira 2001] Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work Markov Logic Networks A Markov Logic Network (MLN) is a set of pairs (F, w) where F is a formula in first-order logic w is a real number Together with a finite set of constants, it defines a Markov network with One node for each grounding of each predicate in the MLN One feature for each grounding of each formula F in the MLN, with the corresponding weight w Likelihood # true groundings of ith clause 1 if jth ground clause is true, 0 otherwise 1 1 P( X ) exp wi ni ( x) exp w j g j ( x) Z iF Z jG Iterate over all MLN clauses Iterate over all ground clauses Gradient of Log-Likelihood log Pw ( x) ni ( x) Ew ni ( x) wi Feature count according to data Feature count according to model 1st term: # true groundings of formula in DB 2nd term: inference required (slow!) Pseudo-Likelihood [Besag, 1975] PL( X ) P( x | MB( x )) x Likelihood of each ground atom given its Markov blanket in the data Does not require inference at each step Optimized using L-BFGS [Liu & Nocedal, 1989] Gradient of Pseudo-Log-Likelihood i nsati ( x ) p( x 0)nsati ( x 0) p( x 1)nsati ( x 1) x where nsati(x=v) is the number of satisfied groundings of clause i in the training data when x takes value v Most terms not affected by changes in weights After initial setup, each iteration takes O(# ground predicates x # first-order clauses) Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work Conditional Likelihood (CL) Normalize over all possible configurations of non-evidence variables 1 P(Y | X ) exp wi ni ( x, y ) Zx iFY Non-evidence variables Evidence variables Iterate over all MLN clauses with at least one grounding containing query variables Derivative of log CL log Pw ( y | x) ni ( x, y ) Ew ni ( x, y ) wi 1st term: # true groundings (involving query variables) of formula in DB 2nd term: inference required, as before (slow!) Derivative of log CL Approximate the expected count by MAP count log Pw ( y | x) ni ( x, y ) Ew ni ( x, y ) wi * log Pw ( y | x) ni ( x, y ) ni ( x, y ) wi MAP state Approximating the Expected Count Use Voted Perceptron Algorithm [Collins, 2002] Approximate the expected count by count for the most likely state (MAP) state Used successfully for linear chain Markov networks MAP state found using Viterbi Voted Perceptron Algorithm Initialize wi=0 For t=1 to T Find the MAP configuration according to current set of weights. wi,t= * (training count – MAP count) wi= T t 1 wi,t/T (Avoids over-fitting) Generalizing Voted Perceptron Finding the MAP configuration NP hard for the general case. Can be reduced to a weighted satisfiability (MaxSAT) problem. Given a SAT formula in clausal form e.g. (x1 V x3 V x5) … (x5 V x7 Vx50) with clause i having weight of wi Find the assignment maximizing the sum of weights of satisfied clauses. MaxWalkSAT [Kautz, Selman & Jiang 97] Assumes clauses with positive weights Mixes greedy search with random walks Start with some configuration of variables. Randomly pick an unsatisfied clause. With probability p, flip the literal in the clause which gives maximum gain. With probability 1-p flip a random literal in the clause. Repeat for a pre-decided number of flips, storing the best seen configuration. Handling the Negative Weights MLN allows formulas with negative weights. A formula with weight w can be replaced by its negation with weight –w in the ground Markov network. (x1 x3 x5) [w] => (x1 x3 x5) [-w] => (x1 x3 x5) [-w] (x1 x3 x5) [-w] => x1 , x3 , x5 [ -w/3] Weight Initialization and Learning Rate Weights initialized using log odds of each clause being true in the data. Determining the learning rate – use a validation set. Learning rate 1/#(ground predicates) Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work Link Prediction UW-CSE database Used by Richardson & Domingos [2004] Database of people/courses/publications at UW-CSE 22 Predicates e.g. Student(P), Professor(P), AdvisedBy(P1,P2) 1158 constants divided into 10 types 4,055,575 ground atoms 3212 true ground atoms 94 hand coded rules stating various regularities Student(P) => !Professor(P) Predict AdvisedBy in the absence of information about the predicates Professor and Student Systems Compared MLN(VP) MLN(ML) MLN(PL) KB CL NB BN Results on Link Prediction 1.237 1 0.693 0.53 0.6 0.4 0.2 0.063 0.033 0.046 0.034 System BN NB CL KB ) (P L ML N (M L) ML N (V P) 0 ML N -CLL 0.8 Results on Link Prediction 1 0.6 0.4 0.295 0.232 0.2 0.114 0.077 0.006 0.065 0.02 System BN NB CL KB ) (P L ML N (M L) ML N (V P) 0 ML N AUC 0.8 Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work Object Identification Given a database of various records referring to objects in the real world Each record represented by a set of attribute values Want to find out which of the records refer to the same object Example: A paper may have more than one reference in a bibliography database Why is it Important? Data Cleaning and Integration – first step in the KDD process Merging of data from multiple sources results in duplicates Entity Resolution: Extremely important for doing any sort of data-mining State of the art – far from what is required. Citeseer has 30 different entries for the AI textbook by Russell and Norvig Standard Approach [Fellegi & Sunter, 1969] Look at each pair of records independently Calculate the similarity score for each attribute value pair based on some metric Find the overall similarity score Merge the records whose similarity is above a threshold Take a transitive closure An Example Record Title Author Venue B1 Object Identification using MLNs Linda Stewart KDD 2004 B2 Object Identification using MLNs Linda Stewart SIGKDD 10 B3 Learning Boolean Formulas Bill Johnson KDD 2004 B4 Learning of Boolean Formulas William Johnson SIGKDD 10 Subset of a Bibliography Relation Graphical Representation in Standard Model Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? Sim(KDD 2004, SIGKDD 10) Sim(Linda Stewart, Linda Stewart) Venue Sim(KDD 2004, SIGKDD 10) Venue Author Sim(Bill Johnson, William Johnson) Author Record-pair node Evidence node What’s Missing? Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? Sim(KDD 2004, SIGKDD 10) Sim(Linda Stewart, Linda Stewart) Author Venue Sim(KDD 2004, SIGKDD 10) Venue Sim(Bill Johnson, William Johnson) Author If from b1=b2, you infer that “KDD 2004” is same as “SIGKDD 10”, how can you use that to help figure out if b3=b4? Collective Model – Basic Idea Perform simultaneous inference for all the candidate pairs Facilitate flow of information through shared attribute values Representation in Standard Model Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b3=b4 ? b1=b2 ? Sim(KDD 2004, SIGKDD 10) Sim(Linda Stewart, Linda Stewart) Sim(KDD 2004, SIGKDD 10) Venue Author Venue Sim(Bill Johnson, William Johnson) Author No sharing of nodes Merging the Evidence Nodes Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b3=b4 ? b1=b2 ? Sim(KDD 2004, SIGKDD 10) Sim(Linda Stewart, Linda Stewart) Venue Author Still does not solve the problem. Why? Sim(Bill Johnson, William Johnson) Author Introducing Information Nodes Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) b1.T=b2.T? b1=b2 ? Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) Information node b3=b4 ? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? Sim(Linda Stewart, Linda Stewart) Sim(KDD 2004, SIGKDD 10) b3.A=b4.A? Venue Author Full representation in Collective Model Sim(Bill Johnson, William Johnson) Author Flow of Information Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) b1.T=b2.T? Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? Sim(Linda Stewart, Linda Stewart) Author Sim(KDD 2004, SIGKDD 10) b3.A=b4.A? Venue Sim(Bill Johnson, William Johnson) Author Flow of Information Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) b1.T=b2.T? Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? Sim(Linda Stewart, Linda Stewart) Author Sim(KDD 2004, SIGKDD 10) b3.A=b4.A? Venue Sim(Bill Johnson, William Johnson) Author Flow of Information Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) b1.T=b2.T? Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? Sim(Linda Stewart, Linda Stewart) Author Sim(KDD 2004, SIGKDD 10) b3.A=b4.A? Venue Sim(Bill Johnson, William Johnson) Author Flow of Information Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) b1.T=b2.T? Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? Sim(Linda Stewart, Linda Stewart) Author Sim(KDD 2004, SIGKDD 10) b3.A=b4.A? Venue Sim(Bill Johnson, William Johnson) Author Flow of Information Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) b1.T=b2.T? Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? Sim(Linda Stewart, Linda Stewart) Author Sim(KDD 2004, SIGKDD 10) b3.A=b4.A? Venue Sim(Bill Johnson, William Johnson) Author MLN Predicates for DeDuplicating Citation Databases If two bib entries are the same SameBib(b1,b2) If two field values are the same SameAuthor(a1,a2), SameTitle(t1,t2), SameVenue(v1,v2) If cosine based TFIDF score of two field values lies in a particular range (0, 0 - .2, .2 .4, etc.) – 6 predicates for each field. E.g. AuthorTFIDF.8(a1,a2) is true if TFIDF similarity score of a1,a2 is in the range (.2, .4] MLN Rules for De-Duplicating Citation Databases Singleton Predicates Two fields are same => corresponding bib entries are same. AuthorTFIDF.8(a1,a2) =>SameAuthor(a1,a2) Transitive closure (currently not incorporated) Author(b1,a1) Author(b2,a2) SameBib(b1,b2)=> SameAuthor(a1,a2) High similarity score => two fields are same Author(b1,a1) Author(b2,a2) SameAuthor(a1,a2)=> SameBib(b1,b2) Two papers are same => corresponding fields are same ! SameBib(b1,b2) SameBib(b1,b2) SameBib(b2,b3) => SameBib(b1,b3) 25 first order predicates, 46 first order clauses. Cora Database Cleaned up version of McCallum’s Cora database. 1295 citations to 132 difference Computer Science research papers, each citation described by author, venue, title fields. 401,552 ground atoms. 82,026 tuples (true ground atoms) Predict SameBib, SameAuthor, SameVenue Systems Compared MLN(VP) MLN(ML) MLN(PL) KB CL NB BN Results on Cora Predicting the Citation Matches 13.261 1 8.629 0.699 0.6 0.461 0.4 0.2 0.082 0.069 0.067 System BN NB CL KB ) (P L M LN (M L) M LN (V P) 0 M LN -CLL 0.8 Results on Cora Predicting the Citation Matches 0.973 1 0.945 0.722 0.8 0.6 0.4 0.149 0.111 0.2 0.187 System BN NB CL KB ) (P L M LN (M L) M LN (V P) 0 M LN AUC 0.951 Results on Cora Predicting the Author Matches 12.973 1 3.062 8.096 2.375 0.6 0.4 0.203 0.2 0.203 0.069 System BN NB CL KB ) (P L M LN (M L) M LN (V P) 0 M LN -CLL 0.8 Results on Cora Predicting the Author Matches 0.969 1 0.734 0.734 0.6 0.323 0.4 0.18 0.162 0.2 0.09 System BN NB CL KB ) (P L M LN (M L) M LN (V P) 0 M LN AUC 0.8 Results on Cora Predicting the Venue Matches 13.38 1 8.475 0.708 0.6 0.4 0.232 0.233 0.233 0.2 System BN NB CL KB ) (P L M LN (M L) M LN (V P) 0 M LN -CLL 0.8 1.261 Results on Cora Predicting the Venue Matches 1 0.771 0.6 0.342 0.4 0.2 0.339 0.096 0.061 0.339 0.047 System BN NB CL KB ) (P L M LN (M L) M LN (V P) 0 M LN AUC 0.8 Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work Conclusions Markov Logic Networks – a powerful way of combining logic and probability. MLNs can be discriminatively trained using a voted perceptron algorithm Discriminatively trained MLNs perform better than purely logical approaches, purely probabilistic approaches as well as generatively trained MLNs. Future Work Discriminative learning of MLN structure Max-margin type training of MLNs Extensions of MaxWalkSAT Further application to the link prediction, object identification and possibly other application areas.
© Copyright 2024 Paperzz