Hybrids of generative and discriminative models

MSRC Summer School - 30/06/2009
Hybrids of generative and
discriminative methods for
machine learning
Cambridge – UK
Motivation
Generative models
• prior knowledge
• handle missing data such as labels
Discriminative models
• perform well at classification
However
• no straightforward way to combine them
Content
Generative and discriminative methods
A principled hybrid framework
• Study of the properties on a toy example
• Influence of the amount of labelled data
Content
Generative and discriminative methods
A principled hybrid framework
• Study of the properties on a toy example
• Influence of the amount of labelled data
Generative methods
Answer: “what does a cat look like? and a
dog?” => data and labels joint distribution
x : data
c : label
 : parameters
Generative methods
Objective function:
G() = p() p(X, C|)
G() = p() n p(xn, cn|)
1 reusable model per class, can deal with
incomplete data
Example: GMMs
Example of generative model
Discriminative methods
Answer: “is it a cat or a dog?” => labels
posterior distribution
x : data
c : label
 : parameters
Discriminative methods
The objective function is
D() = p() p(C|X, )
D() = p() n p(cn|xn, )
Focus on regions of ambiguity, make faster
predictions
Example: neural networks, SVMs
Example of discriminative model
SVMs / NNs
Generative versus discriminative
No effect of the double mode on the decision boundary
Content
Generative and discriminative methods
A principled hybrid framework
• Study of the properties on a toy example
• Influence of the amount of labelled data
Semi-supervised learning
Few labelled data / lots of unlabelled data
Discriminative methods overfit, generative
models only help classify if they are “good”
Need to have the modelling power of
generative models while performing at
discriminating => hybrid models
Discriminative training
Bach et al, ICASSP 05
Discriminative objective function:
D() = p() n p(cn|xn, )
Using a generative model:
D() = p() n [ p(xn, cn|) / p(xn|) ]
D() = p() n
p(xn, cn|)
c p(xn, c|)
Convex combination
Bouchard et al, COMPSTAT 04
Generative objective function:
G() = p() n p(xn, cn|)
Discriminative objective function:
D() = p() n p(cn|xn, )
Convex combination:
log L() =   log D() + (1- )  log G()
[0,1]
A principled hybrid model
A principled hybrid model
A principled hybrid model
A principled hybrid model
A principled hybrid model
 - posterior distribution of the labels
’- marginal distribution of the data
 and ’ communicate through a prior
Hybrid objective function:
L(,’) = p(,’)  n p(cn|xn, )  n p(xn|’)
A principled hybrid model
 = ’ => p(, ’) = p() (-’)
L(,’) = p() (-’) n p(cn|xn, ) n p(xn|’)
L() = G()
generative case
  ’ => p(, ’) = p() p(’)
L(,’) = [ p() n p(cn|xn, ) ] 
[ p(’) n p(xn|’) ]
L(,’) = D()  f(’) discriminative case
A principled hybrid model
Anything in between – hybrid case
Choice of prior:
p(, ’) = p() N(’|, ())
  0 =>   0 =>  = ’
  1 =>    =>   ’
Why principled?
Consistent with the likelihood of graphical
models
=> one way to train a system
Everything can now be modelled
=> potential to be Bayesian
Potential to learn 
Learning
EM / Laplace approximation / MCMC
either intractable or too slow
Conjugate gradients
flexible, easy to check BUT sensitive to
initialisation, slow
Variational inference
Content
Generative and discriminative methods
A principled hybrid framework
• Study of the properties on a toy example
• Influence of the amount of labelled data
Toy example
Toy example
2 elongated distributions
Only spherical gaussians allowed => wrong
model
2 labelled points per class => strong risk of
overfitting
Toy example
Decision boundaries
Content
Generative and discriminative methods
A principled hybrid framework
• Study of the properties on a toy example
• Influence of the amount of labelled data
A real example
Images are a special case, as they contain
several features each
2 levels of supervision: at the image level,
and at the feature level
• Image label only => weakly labelled
• Image label + segmentation => fully labelled
The underlying generative model
multinomial
multinomial
gaussian
The underlying generative model
weakly – fully
labelled
Experimental set-up
3 classes: bikes, cows, sheep
: 1 Gaussian per class => poor generative
model
75 training images for each category
HF framework
HF versus CC
Results
When increasing the proportion of fully
labelled data, the trend is:
generative  hybrid  discriminative
Weakly labelled data has little influence on
the trend
With sufficient fully labelled data, HF tends
to perform better than CC
Experimental set-up
3 classes: lions, tigers and cheetahs
: 1 Gaussian per class => poor generative
model
75 training images for each category
HF framework
HF versus CC
Results
Hybrid models consistently perform better
However, generative and discriminative
models haven’t reached saturation
No clear difference between HF and CC
Conclusion
Principled hybrid framework
Possibility to learn the best trade-off
Helps for ambiguous datasets when labelled
data is scarce
Problem of optimisation
Future avenues
Bayesian version (posterior distribution of
) under study
Replace  by a diagonal matrix  to allow
flexibility => need for the Bayesian version
Choice of priors
Thank you!