Modeling Latent Variable Uncertainty for Loss

Modeling Latent Variable Uncertainty
for Loss-based Learning
M. Pawan Kumar
École Centrale Paris
École des Ponts ParisTech
INRIA Saclay, Île-de-France
Ben Packer
Stanford University
Daphne Koller
Stanford University
Aim
Accurate learning with weakly supervised data
Train Input xi
Output yi
Input x
Bison
Deer
Elephant
Giraffe
Output y = “Deer”
Latent Variable h
Llama
Rhino
Object Detection
Aim
Accurate learning with weakly supervised data
Input x
Feature Ψ(x,y,h) (e.g. HOG)
Function f : Ψ(x,y,h)  (-∞, +∞)
Output y = “Deer”
Latent Variable h
Prediction
(y(f),h(f)) = argmaxy,h f(Ψ(x,y,h))
Aim
Accurate learning with weakly supervised data
Input x
Feature Ψ(x,y,h) (e.g. HOG)
Function f : Ψ(x,y,h)  (-∞, +∞)
Output y = “Deer”
Latent Variable h
Learning
f* = argminf Objective(f)
Aim
Find a suitable objective function to learn f*
Input x
Feature Ψ(x,y,h) (e.g. HOG)
Function f : Ψ(x,y,h)  (-∞, +∞)
Encourages accurate prediction
User-specified criterion for accuracy
Learning
Output y = “Deer”
Latent Variable h
f* = argminf Objective(f)
Outline
• Previous Methods
• Our Framework
• Optimization
• Results
• Ongoing and Future Work
Latent SVM
Linear function parameterized by w
Prediction (y(w), h(w)) = argmaxy,h wTΨ(x,y,h)
Learning
minw Σi Δ(yi,yi(w),hi(w)) User-defined loss
✔ Loss based learning
✖ Loss independent of true (unknown) latent variable
✖ Doesn’t model uncertainty in latent variables
Expectation Maximization
exp(θTΨ(x,y,h))
Joint probability Pθ(y,h|x) =
Z
Prediction (y(θ), h(θ)) = argmaxy,h Pθ(y,h|x)
Expectation Maximization
exp(θTΨ(x,y,h))
Joint probability Pθ(y,h|x) =
Z
Prediction (y(θ), h(θ)) = argmaxy,h θTΨ(x,y,h)
Learning
maxθ Σi log (Pθ(yi|xi))
Expectation Maximization
exp(θTΨ(x,y,h))
Joint probability Pθ(y,h|x) =
Z
Prediction (y(θ), h(θ)) = argmaxy,h θTΨ(x,y,h)
Learning
maxθ Σi Σhi log (Pθ(yi,hi|xi))
✔ Models uncertainty in latent variables
✖ Doesn’t model accuracy of latent variable prediction
✖ No user-defined loss function
Outline
• Previous Methods
• Our Framework
• Optimization
• Results
• Ongoing and Future Work
Problem
Model Uncertainty in Latent Variables
Model Accuracy of Latent Variable Predictions
Solution
Use two different distributions for the two different tasks
Model Uncertainty in Latent Variables
Model Accuracy of Latent Variable Predictions
Solution
Use two different distributions for the two different tasks
Pθ(hi|yi,xi)
hi
Model Accuracy of Latent Variable Predictions
Solution
Use two different distributions for the two different tasks
Pθ(hi|yi,xi)
hi
Pw(yi,hi|xi)
(yi(w),hi(w))
(yi,hi)
The Ideal Case
No latent variable uncertainty, correct prediction
Pθ(hi|yi,xi)
hi
Pw(yi,hi|xi)
(yi(w),hi(w))
(yi,hi)
The Ideal Case
No latent variable uncertainty, correct prediction
Pθ(hi|yi,xi)
hi(w)
hi
(yi(w),hi(w))
(yi,hi)
Pw(yi,hi|xi)
The Ideal Case
No latent variable uncertainty, correct prediction
Pθ(hi|yi,xi)
hi(w)
hi
(yi,hi(w))
(yi,hi)
Pw(yi,hi|xi)
In Practice
Restrictions in the representation power of models
Pθ(hi|yi,xi)
hi
Pw(yi,hi|xi)
(yi(w),hi(w))
(yi,hi)
Our Framework
Minimize the dissimilarity between the two distributions
Pθ(hi|yi,xi)
hi
User-defined dissimilarity measure
Pw(yi,hi|xi)
(yi(w),hi(w))
(yi,hi)
Our Framework
Minimize Rao’s Dissimilarity Coefficient
Pθ(hi|yi,xi)
hi
Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)
Pw(yi,hi|xi)
(yi(w),hi(w))
(yi,hi)
Our Framework
Minimize Rao’s Dissimilarity Coefficient
Pθ(hi|yi,xi)
hi
Hi(w,θ)
- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)
Pw(yi,hi|xi)
(yi(w),hi(w))
(yi,hi)
Our Framework
Minimize Rao’s Dissimilarity Coefficient
Pθ(hi|yi,xi)
hi
Hi(w,θ) - β Hi(θ,θ)
- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))
Pw(yi,hi|xi)
(yi(w),hi(w))
(yi,hi)
Our Framework
Minimize Rao’s Dissimilarity Coefficient
Pθ(hi|yi,xi)
hi
minw,θ Σi Hi(w,θ) - β Hi(θ,θ)
Pw(yi,hi|xi)
(yi(w),hi(w))
(yi,hi)
Outline
• Previous Methods
• Our Framework
• Optimization
• Results
• Ongoing and Future Work
Optimization
minw,θ Σi Hi(w,θ) - β Hi(θ,θ)
Initialize the parameters to w0 and θ0
Repeat until convergence
Fix w and optimize θ
Fix θ and optimize w
End
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
Pθ(hi|yi,xi)
hi(w)
Case I: yi(w) = yi
hi
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
Pθ(hi|yi,xi)
hi(w)
Case I: yi(w) = yi
hi
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
Pθ(hi|yi,xi)
hi
Case II: yi(w) ≠ yi
Optimization of θ
minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)
Stochastic subgradient descent
Pθ(hi|yi,xi)
hi
Case II: yi(w) ≠ yi
Optimization of w
minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)
Expected loss, models uncertainty
Form of optimization similar to Latent SVM
Concave-Convex Procedure (CCCP)
Observation: When Δ is independent of true h,
our framework is equivalent to Latent SVM
Outline
• Previous Methods
• Our Framework
• Optimization
• Results
• Ongoing and Future Work
Object Detection
Train Input xi
Output yi
Input x
Bison
Deer
Elephant
Output y = “Deer”
Latent Variable h
Giraffe
Mammals Dataset
Llama
60/40 Train/Test Split
5 Folds
Rhino
Results – 0/1 Loss
Statistically Significant
Average Test Loss
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
LSVM
Our
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Results – Overlap Loss
Average Test Loss
0.6
0.5
0.4
LSVM
Our
0.3
0.2
0.1
0
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Action Detection
Train Input xi
Output yi
Input x
Jumping
Phoning
Playing Instrument
Reading
Riding Bike
Riding Horse
Output y = “Using Computer”
Latent Variable h
Running
Taking Photo
Using Computer
Walking
PASCAL VOC 2011
60/40 Train/Test Split
5 Folds
Results – 0/1 Loss
Statistically Significant
Average Test Loss
1.2
1
0.8
LSVM
Our
0.6
0.4
0.2
0
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Results – Overlap Loss
Statistically Significant
Average Test Loss
0.74
0.73
0.72
0.71
0.7
0.69
0.68
0.67
0.66
0.65
0.64
0.63
LSVM
Our
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Outline
• Previous Methods
• Our Framework
• Optimization
• Results
• Ongoing and Future Work
Slides Deleted !!!