Discriminative training of MLNs

Discriminative Training of
Markov Logic Networks
Parag Singla & Pedro Domingos
Outline




Motivation
Review of MLNs
Discriminative Training
Experiments



Link Prediction
Object Identification
Conclusion and Future Work
Outline




Motivation
Review of MLNs
Discriminative Training
Experiments



Link Prediction
Object Identification
Conclusion and Future Work
Markov Logic Networks(MLNs)




AI systems must be able to learn, reason
logically and handle uncertainty
Markov Logic Networks [Richardson and
Domingos, 2004]- an effective way to
combine first order logic and probability
Markov Networks are used as underlying
representation
Features specfied using arbitrary formulas in
finite first order logic
Training of MLNs – Generative
Approach






Optimize the joint distribution of all the variables
Parameters learnt independent of specific inference
task
Maximum-likelihood (ML) training – computation of
the gradient involves inference – too slow!
Use Psuedo-likelihood (PL) as an alternative – easy to
compute
PL is suboptimal. Ignores any non-local interactions
between variables
ML, PL – generative training approaches
Training of MLNs Discriminative Approach




No need to optimize the joint distribution of
all the variables
Optimize the conditional likelihood (CL) of
non-evidence variables given evidence
variables
Parameters learnt for a specific inference task
Tends to do better than generative training in
general
Why is Discriminative Better?
Generative
Discriminative
Parameters learnt are not optimized  Parameters learnt are optimized for
for the specific inference task.
the specific inference task.

Need to model all the dependencies  Need not model dependencies
in the data – learning might become between evidence variables – makes
complicated.
learning task easier.

Example of generative models:
MRFs


Example of discriminative models:
CRFs [Lafferty, McCallum, Pereira
2001]
Outline




Motivation
Review of MLNs
Discriminative Training
Experiments



Link Prediction
Object Identification
Conclusion and Future Work
Markov Logic Networks

A Markov Logic Network (MLN) is a set of
pairs (F, w) where



F is a formula in first-order logic
w is a real number
Together with a finite set of constants,
it defines a Markov network with


One node for each grounding of each predicate in
the MLN
One feature for each grounding of each formula F
in the MLN, with the corresponding weight w
Likelihood
# true groundings
of ith clause
1 if jth ground clause
is true, 0 otherwise


1

 1
P( X )  exp   wi ni ( x)   exp   w j g j ( x) 
Z
 iF
 Z
 jG

Iterate over all MLN clauses
Iterate over all ground clauses
Gradient of Log-Likelihood

log Pw ( x)  ni ( x)  Ew ni ( x)
wi
Feature count according to data
Feature count according to model
1st term: # true groundings of formula in DB
2nd term: inference required (slow!)
Pseudo-Likelihood
[Besag, 1975]
PL( X )   P( x | MB( x ))
x



Likelihood of each ground atom given its
Markov blanket in the data
Does not require inference at each step
Optimized using L-BFGS [Liu & Nocedal,
1989]
Gradient of
Pseudo-Log-Likelihood
i   nsati ( x )   p( x 0)nsati ( x 0)  p( x 1)nsati ( x 1) 
x
where nsati(x=v) is the number of satisfied groundings
of clause i in the training data when x takes value v


Most terms not affected by changes in weights
After initial setup, each iteration takes
O(# ground predicates x # first-order
clauses)
Outline




Motivation
Review of MLNs
Discriminative Training
Experiments



Link Prediction
Object Identification
Conclusion and Future Work
Conditional Likelihood (CL)
Normalize over all possible configurations
of non-evidence variables


1
P(Y | X ) 
exp   wi ni ( x, y ) 
Zx
 iFY

Non-evidence
variables
Evidence variables
Iterate over all MLN clauses
with at least one grounding
containing query variables
Derivative of log CL

log Pw ( y | x)  ni ( x, y )  Ew ni ( x, y )
wi
1st term: # true groundings (involving query
variables) of formula in DB
2nd term: inference required, as before (slow!)
Derivative of log CL
Approximate the expected count by MAP count

log Pw ( y | x)  ni ( x, y )  Ew ni ( x, y )
wi

*
log Pw ( y | x)  ni ( x, y )  ni ( x, y )
wi
MAP state
Approximating the Expected
Count

Use Voted Perceptron Algorithm
[Collins, 2002]



Approximate the expected count by count
for the most likely state (MAP) state
Used successfully for linear chain Markov
networks
MAP state found using Viterbi
Voted Perceptron Algorithm


Initialize wi=0
For t=1 to T



Find the MAP configuration according to
current set of weights.
wi,t=  * (training count – MAP count)
wi=
T

t 1
wi,t/T (Avoids over-fitting)
Generalizing Voted Perceptron


Finding the MAP configuration NP hard
for the general case.
Can be reduced to a weighted
satisfiability (MaxSAT) problem.


Given a SAT formula in clausal form e.g.
(x1 V x3 V x5) … (x5 V x7 Vx50) with clause i
having weight of wi
Find the assignment maximizing the sum of
weights of satisfied clauses.
MaxWalkSAT



[Kautz, Selman & Jiang 97]
Assumes clauses with positive weights
Mixes greedy search with random walks




Start with some configuration of variables.
Randomly pick an unsatisfied clause.
With probability p, flip the literal in the clause
which gives maximum gain. With probability 1-p
flip a random literal in the clause.
Repeat for a pre-decided number of flips, storing
the best seen configuration.
Handling the Negative Weights




MLN allows formulas with negative
weights.
A formula with weight w can be
replaced by its negation with weight –w
in the ground Markov network.
(x1  x3  x5) [w] => (x1  x3  x5) [-w]
=> (x1  x3  x5) [-w]
(x1  x3  x5) [-w] => x1 , x3 , x5 [ -w/3]
Weight Initialization and
Learning Rate


Weights initialized using log odds of
each clause being true in the data.
Determining the learning rate – use a
validation set.

Learning rate  1/#(ground predicates)
Outline




Motivation
Review of MLNs
Discriminative Training
Experiments



Link Prediction
Object Identification
Conclusion and Future Work
Outline




Motivation
Review of MLNs
Discriminative Training
Experiments



Link Prediction
Object Identification
Conclusion and Future Work
Link Prediction

UW-CSE database







Used by Richardson & Domingos [2004]
Database of people/courses/publications at UW-CSE
22 Predicates e.g. Student(P), Professor(P),
AdvisedBy(P1,P2)
1158 constants divided into 10 types
4,055,575 ground atoms
3212 true ground atoms
94 hand coded rules stating various regularities


Student(P) => !Professor(P)
Predict AdvisedBy in the absence of information about the
predicates Professor and Student
Systems Compared







MLN(VP)
MLN(ML)
MLN(PL)
KB
CL
NB
BN
Results on Link Prediction
1.237
1
0.693
0.53
0.6
0.4
0.2
0.063
0.033
0.046
0.034
System
BN
NB
CL
KB
)
(P
L
ML
N
(M
L)
ML
N
(V
P)
0
ML
N
-CLL
0.8
Results on Link Prediction
1
0.6
0.4
0.295
0.232
0.2
0.114
0.077
0.006
0.065
0.02
System
BN
NB
CL
KB
)
(P
L
ML
N
(M
L)
ML
N
(V
P)
0
ML
N
AUC
0.8
Outline




Motivation
Review of MLNs
Discriminative Training
Experiments



Link Prediction
Object Identification
Conclusion and Future Work
Object Identification




Given a database of various records referring
to objects in the real world
Each record represented by a set of attribute
values
Want to find out which of the records refer to
the same object
Example: A paper may have more than one
reference in a bibliography database
Why is it Important?





Data Cleaning and Integration – first step in
the KDD process
Merging of data from multiple sources results
in duplicates
Entity Resolution: Extremely important for
doing any sort of data-mining
State of the art – far from what is required.
Citeseer has 30 different entries for the AI
textbook by Russell and Norvig
Standard Approach






[Fellegi & Sunter, 1969]
Look at each pair of records independently
Calculate the similarity score for each
attribute value pair based on some metric
Find the overall similarity score
Merge the records whose similarity is above a
threshold
Take a transitive closure
An Example
Record
Title
Author
Venue
B1
Object Identification using MLNs
Linda Stewart
KDD 2004
B2
Object Identification using MLNs
Linda Stewart
SIGKDD 10
B3
Learning Boolean Formulas
Bill Johnson
KDD 2004
B4
Learning of Boolean Formulas
William Johnson
SIGKDD 10
Subset of a Bibliography Relation
Graphical Representation in
Standard Model
Title
Title
Sim(Object Identification using MLNs,
Object Identification using MLNs)
Sim(Learning Boolean Formulas,
Leraning of Boolean Formulas)
b1=b2
?
b3=b4
?
Sim(KDD 2004, SIGKDD 10)
Sim(Linda Stewart,
Linda Stewart)
Venue
Sim(KDD 2004, SIGKDD 10)
Venue
Author
Sim(Bill Johnson,
William Johnson)
Author
Record-pair node
Evidence node
What’s Missing?
Title
Title
Sim(Object Identification using MLNs,
Object Identification using MLNs)
Sim(Learning Boolean Formulas,
Leraning of Boolean Formulas)
b1=b2
?
b3=b4
?
Sim(KDD 2004, SIGKDD 10)
Sim(Linda Stewart,
Linda Stewart)
Author
Venue
Sim(KDD 2004, SIGKDD 10)
Venue
Sim(Bill Johnson,
William Johnson)
Author
If from b1=b2, you infer that “KDD 2004” is same as “SIGKDD 10”, how can
you use that to help figure out if b3=b4?
Collective Model – Basic Idea


Perform simultaneous inference for all
the candidate pairs
Facilitate flow of information through
shared attribute values
Representation in Standard
Model
Title
Title
Sim(Object Identification using MLNs,
Object Identification using MLNs)
Sim(Learning Boolean Formulas,
Leraning of Boolean Formulas)
b3=b4
?
b1=b2
?
Sim(KDD 2004, SIGKDD 10)
Sim(Linda Stewart,
Linda Stewart)
Sim(KDD 2004, SIGKDD 10)
Venue
Author
Venue
Sim(Bill Johnson,
William Johnson)
Author
No sharing of nodes
Merging the Evidence Nodes
Title
Title
Sim(Object Identification using MLNs,
Object Identification using MLNs)
Sim(Learning Boolean Formulas,
Leraning of Boolean Formulas)
b3=b4
?
b1=b2
?
Sim(KDD 2004, SIGKDD 10)
Sim(Linda Stewart,
Linda Stewart)
Venue
Author
Still does not solve the problem. Why?
Sim(Bill Johnson,
William Johnson)
Author
Introducing Information Nodes
Title
Title
Sim(Object Identification using MLNs,
Object Identification using MLNs)
b1.T=b2.T?
b1=b2
?
Sim(Learning Boolean Formulas,
Leraning of Boolean Formulas)
Information node
b3=b4
?
b3.T=b4.T?
b1.V=b2.V?
b3.V=b4.V?
b1.A=b2.A?
Sim(Linda Stewart,
Linda Stewart)
Sim(KDD 2004, SIGKDD 10)
b3.A=b4.A?
Venue
Author
Full representation in Collective Model
Sim(Bill Johnson,
William Johnson)
Author
Flow of Information
Title
Title
Sim(Object Identification using MLNs,
Object Identification using MLNs)
b1.T=b2.T?
Sim(Learning Boolean Formulas,
Leraning of Boolean Formulas)
b1=b2
?
b3=b4
?
b3.T=b4.T?
b1.V=b2.V?
b3.V=b4.V?
b1.A=b2.A?
Sim(Linda Stewart,
Linda Stewart)
Author
Sim(KDD 2004, SIGKDD 10)
b3.A=b4.A?
Venue
Sim(Bill Johnson,
William Johnson)
Author
Flow of Information
Title
Title
Sim(Object Identification using MLNs,
Object Identification using MLNs)
b1.T=b2.T?
Sim(Learning Boolean Formulas,
Leraning of Boolean Formulas)
b1=b2
?
b3=b4
?
b3.T=b4.T?
b1.V=b2.V?
b3.V=b4.V?
b1.A=b2.A?
Sim(Linda Stewart,
Linda Stewart)
Author
Sim(KDD 2004, SIGKDD 10)
b3.A=b4.A?
Venue
Sim(Bill Johnson,
William Johnson)
Author
Flow of Information
Title
Title
Sim(Object Identification using MLNs,
Object Identification using MLNs)
b1.T=b2.T?
Sim(Learning Boolean Formulas,
Leraning of Boolean Formulas)
b1=b2
?
b3=b4
?
b3.T=b4.T?
b1.V=b2.V?
b3.V=b4.V?
b1.A=b2.A?
Sim(Linda Stewart,
Linda Stewart)
Author
Sim(KDD 2004, SIGKDD 10)
b3.A=b4.A?
Venue
Sim(Bill Johnson,
William Johnson)
Author
Flow of Information
Title
Title
Sim(Object Identification using MLNs,
Object Identification using MLNs)
b1.T=b2.T?
Sim(Learning Boolean Formulas,
Leraning of Boolean Formulas)
b1=b2
?
b3=b4
?
b3.T=b4.T?
b1.V=b2.V?
b3.V=b4.V?
b1.A=b2.A?
Sim(Linda Stewart,
Linda Stewart)
Author
Sim(KDD 2004, SIGKDD 10)
b3.A=b4.A?
Venue
Sim(Bill Johnson,
William Johnson)
Author
Flow of Information
Title
Title
Sim(Object Identification using MLNs,
Object Identification using MLNs)
b1.T=b2.T?
Sim(Learning Boolean Formulas,
Leraning of Boolean Formulas)
b1=b2
?
b3=b4
?
b3.T=b4.T?
b1.V=b2.V?
b3.V=b4.V?
b1.A=b2.A?
Sim(Linda Stewart,
Linda Stewart)
Author
Sim(KDD 2004, SIGKDD 10)
b3.A=b4.A?
Venue
Sim(Bill Johnson,
William Johnson)
Author
MLN Predicates for DeDuplicating Citation Databases



If two bib entries are the same SameBib(b1,b2)
If two field values are the same SameAuthor(a1,a2), SameTitle(t1,t2),
SameVenue(v1,v2)
If cosine based TFIDF score of two field
values lies in a particular range (0, 0 - .2, .2 .4, etc.) – 6 predicates for each field.

E.g. AuthorTFIDF.8(a1,a2) is true if TFIDF
similarity score of a1,a2 is in the range (.2, .4]
MLN Rules for De-Duplicating
Citation Databases

Singleton Predicates


Two fields are same => corresponding bib entries are same.


AuthorTFIDF.8(a1,a2) =>SameAuthor(a1,a2)
Transitive closure (currently not incorporated)


Author(b1,a1)  Author(b2,a2)  SameBib(b1,b2)=> SameAuthor(a1,a2)
High similarity score => two fields are same


Author(b1,a1)  Author(b2,a2)  SameAuthor(a1,a2)=> SameBib(b1,b2)
Two papers are same => corresponding fields are same


! SameBib(b1,b2)
SameBib(b1,b2)  SameBib(b2,b3) => SameBib(b1,b3)
25 first order predicates, 46 first order clauses.
Cora Database





Cleaned up version of McCallum’s Cora
database.
1295 citations to 132 difference Computer
Science research papers, each citation
described by author, venue, title fields.
401,552 ground atoms.
82,026 tuples (true ground atoms)
Predict SameBib, SameAuthor, SameVenue
Systems Compared







MLN(VP)
MLN(ML)
MLN(PL)
KB
CL
NB
BN
Results on Cora
Predicting the Citation Matches
13.261
1
8.629
0.699
0.6
0.461
0.4
0.2
0.082
0.069
0.067
System
BN
NB
CL
KB
)
(P
L
M
LN
(M
L)
M
LN
(V
P)
0
M
LN
-CLL
0.8
Results on Cora
Predicting the Citation Matches
0.973
1
0.945
0.722
0.8
0.6
0.4
0.149
0.111
0.2
0.187
System
BN
NB
CL
KB
)
(P
L
M
LN
(M
L)
M
LN
(V
P)
0
M
LN
AUC
0.951
Results on Cora
Predicting the Author Matches
12.973
1
3.062
8.096
2.375
0.6
0.4
0.203
0.2
0.203
0.069
System
BN
NB
CL
KB
)
(P
L
M
LN
(M
L)
M
LN
(V
P)
0
M
LN
-CLL
0.8
Results on Cora
Predicting the Author Matches
0.969
1
0.734
0.734
0.6
0.323
0.4
0.18
0.162
0.2
0.09
System
BN
NB
CL
KB
)
(P
L
M
LN
(M
L)
M
LN
(V
P)
0
M
LN
AUC
0.8
Results on Cora
Predicting the Venue Matches
13.38
1
8.475
0.708
0.6
0.4
0.232
0.233
0.233
0.2
System
BN
NB
CL
KB
)
(P
L
M
LN
(M
L)
M
LN
(V
P)
0
M
LN
-CLL
0.8
1.261
Results on Cora
Predicting the Venue Matches
1
0.771
0.6
0.342
0.4
0.2
0.339
0.096
0.061
0.339
0.047
System
BN
NB
CL
KB
)
(P
L
M
LN
(M
L)
M
LN
(V
P)
0
M
LN
AUC
0.8
Outline




Motivation
Review of MLNs
Discriminative Training
Experiments



Link Prediction
Object Identification
Conclusion and Future Work
Conclusions



Markov Logic Networks – a powerful
way of combining logic and probability.
MLNs can be discriminatively trained
using a voted perceptron algorithm
Discriminatively trained MLNs perform
better than purely logical approaches,
purely probabilistic approaches as well
as generatively trained MLNs.
Future Work




Discriminative learning of MLN structure
Max-margin type training of MLNs
Extensions of MaxWalkSAT
Further application to the link
prediction, object identification and
possibly other application areas.