Decision Forests - Computer Vision Lab Dresden


Decision Forests
for computer vision and medical image analysis
A. Criminisi, J. Shotton and E. Konukoglu
http://research.microsoft.com/projects/decisionforests
Material: book, code, slides, videos, exercises
Book: Decision Forests in Computer Vision and Medical Image Analysis. A. Criminisi and J. Shotton. Springer. 2013.
Code: The Microsoft Research Cambridge Sherwood Software Library http://research.microsoft.com/projects/decisionforests/ What can decision forests do? tasks
Classification forests
Density forests
Regression forests
Manifold forests
Decision trees and decision forests
A decision tree
A general tree structure
Is top part blue?
root node
7
8
4
9
5
10 11
e
tru
3
false
2
Is bottom part blue?
fal
se
1
Is bottom part green?
true
internal (split) node
e
tru
fal
se
0
6
12
terminal (leaf) node
13 14
In
do
o
Ou
r
td
oo
r
A forest is an ensemble of trees. The trees are all slightly different from one another.
[ Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation. 9:1545-­‐-­‐1588, 1997] [ L. Breiman. Random forests. Machine Learning. 45(1):5-­‐-­‐32, 2001]
Decision tree testing (runtime)
Input data in feature space
Input test point
Split the data at node
Prediction at leaf
Decision tree training (off-line)
Input training data
How to split the data?
Input data in feature space
Binary tree? Ternary? How big a tree? What tree structure?
Decision forest training (off-line)
…
How many trees? How different? How to fuse their outputs?
…
The decision forest model
Basic notation
Input data point
e.g.
Collection of feature responses
Output/label space
e.g.
Categorical, continuous?
Feature response selector
. d=?
Features can be e.g. wavelets? Pixel intensities? Context?
Forest model
Parameters related to each split node:
i) which features, ii) what geometric primitive, iii) thresholds.
ensemble
tree
Node test parameters
Node objective function (train.)
e.g.
The “energy” to be minimized when training the j-th split node
Node weak learner
e.g.
The test function for splitting data at a node j.
Leaf predictor model
e.g.
Point estimate? Full distribution?
Randomness model (train.)
e.g.
Stopping criteria (train.)
e.g. max tree depth =
Forest size
The ensemble model
1. Bagging,
2. Randomized node optimization
How is randomness injected during training? How much?
When to stop growing a tree during training
Total number of trees in the forest
e.g.
How to compute the forest output from that of individual trees?
Decision forest model: the randomness model
1) Bagging (randomizing the training set)
The full training set
The randomly sampled subset of training data made available for the tree t
Forest training
Efficient training
Decision forest model: the randomness model
2) Randomized node optimization (RNO)
The full set of all possible node test parameters
For each node the set of randomly sampled features
Randomness control parameter. For no randomness and maximum tree correlation. For max randomness and minimum tree correlation.
Node training
Node weak learner
Node test params
The effect of
Small value of
; little tree correlation.
Large value of
; large tree correlation.
Decision forest model: the ensemble model
An example forest to predict
continuous variables
Decision forest model: training and information gain
Shannon’s entropy — a measure for the “information contents”
Decision forest model: training and information gain
Information gain
Split 1
Before split
(for categorical, non-­‐parametric distributions)
Node training
Split 2
Shannon’s entropy
Decision forest model: training and information gain
Information gain
Split 1
Before split
(for continuous, parametric densities)
Node training
Split 2
Differential entropy of Gaussian
Background: overfitting and underfitting
Classification forests
Efficient, supervised multi-class classification
[ V. Lepetit and P. Fua. Keypoint Recognition Using Randomized Trees. IEEE Trans. PAMI. 2006.]
Classification forest
Training data in feature space
?
Classification tree
training
?
?
Model specialization for classification
Input data point
Output is categorical
(
is feature response)
Entropy of a discrete distribution
(discrete set)
Node weak learner
with
Obj. funct. for node j
(information gain)
Training node j
Predictor model
(class posterior)
Classification forest: the weak learner model
Splitting data at node j
Node weak learner
Node test params
Examples of weak learners
Weak learner: axis aligned
Weak learner: oriented line
Feature response
for 2D example.
Feature response
for 2D example.
Feature response
for 2D example.
With
With
With
In general
or
may select only a very small subset of features
a generic line in homog. coordinates.
Weak learner: conic section
a matrix representing a conic.
See Appendix C for relation with kernel trick.
Classification forest: the prediction model
What do we do at the leaf?
leaf
leaf
Prediction model: probabilistic
leaf
Classification forest: the ensemble model
Tree t=1
The ensemble model
Forest output probability
t=2
t=3
Classification forest: effect of tree depth
Training points: 4-­‐class mixed
T=200, D=3, w. l. = conic
T=200, D=6, w. l. = conic
T=200, D=15, w. l. = conic
max tree depth, D
underfitting
overfitting
Predictor model = prob.
Classification forest: max-margin properties
Randomness type = Randomized node optimization
Node parameters
= which features,
= what geometric model,
= thresholds
Vertical weak learner
Node training
Training node j
Classification forest: max-margin properties
Randomness type = Randomized node optimization
Range of max info gain
1
0
Training node j
Classification forest: max-margin properties
Training points
gap
gap
“support” vector
“support” vector
Demonstrating max-­‐margin property
Optimum partition for
with
It is easy to show that
and consequently
In practice it suffices to use “large enough”.
Classification forest: max-margin properties
Randomness type = Bagging (only)
Training node j
Classification forest: max-margin properties
Randomness type = Bagging (only)
Range of max info gain
Training node j
Classification forest: max-margin properties
Randomness type = Bagging (only)
Range of max info gain
Training node j
Classification forest: max-margin properties
Randomness type = Bagging (only)
Range of max info gain
Training node j
Classification forest: max-margin properties
Randomness type = Bagging (only)
Range of max info gain
Training node j
Classification forest: max-margin properties
Randomness type = Bagging (only)
Range of max info gain
Training node j
Classification forest: max-margin – comparing with bagging
Random. node optimization (only)
Bagging (50%) and RNO
Classification forests in practice
Microsoft Kinect for Xbox 360
[ J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-­‐Time Human Pose Recognition in Parts from a Single Depth Image. In Proc. IEEE CVPR, June 2011.]
Body tracking in Microsoft Kinect for XBox 360
left hand
right
shoulder
Input depth image
right foot
neck
Training labelled data
Visual features
Classification forest
Labels are categorical
Objective function
Input data point
Node parameters
Visual features
Node training
Feature response
Weak learner
Predictor model
Body tracking in Microsoft Kinect for XBox 360
Input depth image (bg removed)
Inferred body parts posterior
Regression forests
Probabilistic, non-linear estimation of continuous parameters
[ L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and Regression Trees. Chapman and Hall/CRC, 1984.]
Regression forest
Training data
?
Model specialization fore regression
Input data point
Output/labels are continuous
Node weak learner
Obj. funct. for node j
Training node j
Leaf model
Information gain
defined on continuous
variables.
Regression forest: the node weak learner model
Splitting data at node j
Node weak learner
Node test params
Examples of node weak learners
Weak learner: axis aligned
Weak learner: oriented line
Feature response
for 2D example.
Feature response
for 2D example.
Feature response
for 2D example.
With
With
With
In general
or
a generic line in homog. coordinates.
may select only a very small subset of features
Weak learner: conic section
a matrix representing a conic.
See Appendix C for relation with kernel trick.
Regression forest: the predictor model
What do we do at the leaf?
Examples of leaf (predictor) models
Predictor model: constant
Predictor model: polynomial
(note: linear for n=1, constant for n=0)
Predictor model: probabilistic-linear
At node j
Regression forest: objective function
•
Computing the regression information gain at node j
Regression information gain
Differential entropy of Gaussian
Our regression information gain
Regression forest: the ensemble model
t=2
Tree t=1
The ensemble model
Forest output probability
t=3
Regression forest: probabilistic, non-linear regression
Training points
Tree posteriors
Generalization properties: -­‐
Smooth interpolating behaviour in gaps between training data. -­‐
Uncertainty increases with distance from training data.
Forest posterior
See Appendix A for probabilistic line fitting.
Parameters: T=400, D=2, weak learner = aligned, leaf model = prob. line
Regression forest: probabilistic, non-linear regression
underfitting
max tree depth D = 1
overrfitting
max tree depth D = 5
(6 videos in this page)
Parameters: T=400, w.l. = aligned, l.m. = prob. line
Density forests
Forests as generative models
Density forest model
Unlabelled input data
?
Model specialization for density estimation
Input data point
Output: density
Node weak learner
Obj. function for node j
Training node j
Leaf model
with
Density forest model
Unlabelled input data
?
The objective function
Information gain
Example entropy measure in
unsupervised setting
Valid for a Gaussian model.
Differential entropy here is
related to volume of cluster.
Training and information gain
Information gain
Split 1
Before split
(for continuous, parametric densities)
Node training
Split 2
Differential entropy of Gaussian
Density forest model
Tree t=1
t=2
The prediction and ensemble models
t=3
Partition function
Output of tree
with
The ensemble model
with
Density forest model
Tree t=1
t=2
Partition function
t=3
1. Compute via cumulative multivariate normal for ax.-­‐align. w. learn.
Piece-­‐wise Gaussian density (for a single tree)
The prediction and ensemble models
Output of tree
with
2. Grid-­‐based numerical integration
The ensemble model
Density forest: forest building
Decision trees
Estimated Gaussians
D=2
Decision boundaries
D=3
Input points
D=5
overfitting
Parameters: T=200, weak learner = aligned, predictor = multivariate Gaussian
D=4
D=3
Density forest: quantitative evaluation
D=5
D=3
Rand. sampled points
D=5
D=6
D=6
Ground truth densities
D=4
Error curves
Estimated forest densities
Parameters: T=100, weak learner = linear, predictor = Gaussian
Density forest: quantitative evaluation
D=5
D=6
D=6
D=5
Ground truth densities
Random sampled points
Estimated forest densities
Error curves
Parameters: T=100, weak learn. = linear, predict. = Gaussian
Manifold forests
Non-linear embedding and dimensionality reduction
[M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation. 2003.]
Manifold learning and dimensionality reduction
Unlabelled input data
Problem statement
Given the points find a mapping which respects distance relationships.
with
Manifold forest: the model
Manifold learning is achieved here via density forests…
Unlabelled input data
Clustering tree 1
Clustering tree 2
Clustering tree 3
Model specialization for non-linear manifold learning
Example entropy measure in
unsupervised setting
Input data point
Output is a mapping
with
Node weak learner
Obj. funct. for node j
Training node j
Predictor model
Valid for a Gaussian model.
Differential entropy here is
related to area of cluster.
Manifold Forest for non-linear dimensionality reduction
Input points
1) Using forests to construct the graph
We remember that each tree t defines a clustering of the input points
with a leaf index and the set of all leaves.
The affinity model: For each clustering tree we can compute an association matrix
(unnormormalized transition probabilities in Markov Random Walks)
Where the distance D can be defined in many different ways, e.g.
Different clusterings for different trees
-­‐ Mahalanobis
-­‐ Gaussian
-­‐ Binary
with
The ensemble model: The affinity matrix for the entire forest is
Manifold Forest for non-linear dimensionality reduction
Input points
2) Spectral dimensionality reduction
A low dimensional embedding is now found by linear algebra
(The n x n matrix L is the normalized graph Laplacian)
v'1
v'
2
v'n
final mapping
(1 video here showing effect of increasing T)
Manifold Forest for non-linear dimensionality reduction
Input points
2) Spectral dimensionality reduction
A low dimensional embedding is now found by linear algebra
(The n x n matrix L is the normalized graph Laplacian)
Mapping f
final mapping
(1 video here showing effect of increasing T)
Manifold Forests
Input points
(1 video here showing effect of increasing T)
Parameters: T=50, D=3, weak learner = linear, predictor = Gaussian
Conclusion/Summary (ds)
•
Trees partition the input space into meaningful parts
•
Decision trees are “meta”-classifiers — the modeling power is not
bounded
•
Building forests from trees: bagging, randomization — increases the
generalization capabilities, prevents overfitting
•
Classification, regression, density estimation, manifold learning: very
similar learning procedures, differences only at leaves
Interesting to think about:
all of the classifiers considered in the course (i.e. SVM, AdaBoost, Neural
Networks, Decision Forests) have infinite modeling capabilities
•
•
•
Which one to use (in practice) ?
How to combine ?
How to transform one into another ?