Decision Forests for computer vision and medical image analysis A. Criminisi, J. Shotton and E. Konukoglu http://research.microsoft.com/projects/decisionforests Material: book, code, slides, videos, exercises Book: Decision Forests in Computer Vision and Medical Image Analysis. A. Criminisi and J. Shotton. Springer. 2013. Code: The Microsoft Research Cambridge Sherwood Software Library http://research.microsoft.com/projects/decisionforests/ What can decision forests do? tasks Classification forests Density forests Regression forests Manifold forests Decision trees and decision forests A decision tree A general tree structure Is top part blue? root node 7 8 4 9 5 10 11 e tru 3 false 2 Is bottom part blue? fal se 1 Is bottom part green? true internal (split) node e tru fal se 0 6 12 terminal (leaf) node 13 14 In do o Ou r td oo r A forest is an ensemble of trees. The trees are all slightly different from one another. [ Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation. 9:1545-‐-‐1588, 1997] [ L. Breiman. Random forests. Machine Learning. 45(1):5-‐-‐32, 2001] Decision tree testing (runtime) Input data in feature space Input test point Split the data at node Prediction at leaf Decision tree training (off-line) Input training data How to split the data? Input data in feature space Binary tree? Ternary? How big a tree? What tree structure? Decision forest training (off-line) … How many trees? How different? How to fuse their outputs? … The decision forest model Basic notation Input data point e.g. Collection of feature responses Output/label space e.g. Categorical, continuous? Feature response selector . d=? Features can be e.g. wavelets? Pixel intensities? Context? Forest model Parameters related to each split node: i) which features, ii) what geometric primitive, iii) thresholds. ensemble tree Node test parameters Node objective function (train.) e.g. The “energy” to be minimized when training the j-th split node Node weak learner e.g. The test function for splitting data at a node j. Leaf predictor model e.g. Point estimate? Full distribution? Randomness model (train.) e.g. Stopping criteria (train.) e.g. max tree depth = Forest size The ensemble model 1. Bagging, 2. Randomized node optimization How is randomness injected during training? How much? When to stop growing a tree during training Total number of trees in the forest e.g. How to compute the forest output from that of individual trees? Decision forest model: the randomness model 1) Bagging (randomizing the training set) The full training set The randomly sampled subset of training data made available for the tree t Forest training Efficient training Decision forest model: the randomness model 2) Randomized node optimization (RNO) The full set of all possible node test parameters For each node the set of randomly sampled features Randomness control parameter. For no randomness and maximum tree correlation. For max randomness and minimum tree correlation. Node training Node weak learner Node test params The effect of Small value of ; little tree correlation. Large value of ; large tree correlation. Decision forest model: the ensemble model An example forest to predict continuous variables Decision forest model: training and information gain Shannon’s entropy — a measure for the “information contents” Decision forest model: training and information gain Information gain Split 1 Before split (for categorical, non-‐parametric distributions) Node training Split 2 Shannon’s entropy Decision forest model: training and information gain Information gain Split 1 Before split (for continuous, parametric densities) Node training Split 2 Differential entropy of Gaussian Background: overfitting and underfitting Classification forests Efficient, supervised multi-class classification [ V. Lepetit and P. Fua. Keypoint Recognition Using Randomized Trees. IEEE Trans. PAMI. 2006.] Classification forest Training data in feature space ? Classification tree training ? ? Model specialization for classification Input data point Output is categorical ( is feature response) Entropy of a discrete distribution (discrete set) Node weak learner with Obj. funct. for node j (information gain) Training node j Predictor model (class posterior) Classification forest: the weak learner model Splitting data at node j Node weak learner Node test params Examples of weak learners Weak learner: axis aligned Weak learner: oriented line Feature response for 2D example. Feature response for 2D example. Feature response for 2D example. With With With In general or may select only a very small subset of features a generic line in homog. coordinates. Weak learner: conic section a matrix representing a conic. See Appendix C for relation with kernel trick. Classification forest: the prediction model What do we do at the leaf? leaf leaf Prediction model: probabilistic leaf Classification forest: the ensemble model Tree t=1 The ensemble model Forest output probability t=2 t=3 Classification forest: effect of tree depth Training points: 4-‐class mixed T=200, D=3, w. l. = conic T=200, D=6, w. l. = conic T=200, D=15, w. l. = conic max tree depth, D underfitting overfitting Predictor model = prob. Classification forest: max-margin properties Randomness type = Randomized node optimization Node parameters = which features, = what geometric model, = thresholds Vertical weak learner Node training Training node j Classification forest: max-margin properties Randomness type = Randomized node optimization Range of max info gain 1 0 Training node j Classification forest: max-margin properties Training points gap gap “support” vector “support” vector Demonstrating max-‐margin property Optimum partition for with It is easy to show that and consequently In practice it suffices to use “large enough”. Classification forest: max-margin properties Randomness type = Bagging (only) Training node j Classification forest: max-margin properties Randomness type = Bagging (only) Range of max info gain Training node j Classification forest: max-margin properties Randomness type = Bagging (only) Range of max info gain Training node j Classification forest: max-margin properties Randomness type = Bagging (only) Range of max info gain Training node j Classification forest: max-margin properties Randomness type = Bagging (only) Range of max info gain Training node j Classification forest: max-margin properties Randomness type = Bagging (only) Range of max info gain Training node j Classification forest: max-margin – comparing with bagging Random. node optimization (only) Bagging (50%) and RNO Classification forests in practice Microsoft Kinect for Xbox 360 [ J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-‐Time Human Pose Recognition in Parts from a Single Depth Image. In Proc. IEEE CVPR, June 2011.] Body tracking in Microsoft Kinect for XBox 360 left hand right shoulder Input depth image right foot neck Training labelled data Visual features Classification forest Labels are categorical Objective function Input data point Node parameters Visual features Node training Feature response Weak learner Predictor model Body tracking in Microsoft Kinect for XBox 360 Input depth image (bg removed) Inferred body parts posterior Regression forests Probabilistic, non-linear estimation of continuous parameters [ L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and Regression Trees. Chapman and Hall/CRC, 1984.] Regression forest Training data ? Model specialization fore regression Input data point Output/labels are continuous Node weak learner Obj. funct. for node j Training node j Leaf model Information gain defined on continuous variables. Regression forest: the node weak learner model Splitting data at node j Node weak learner Node test params Examples of node weak learners Weak learner: axis aligned Weak learner: oriented line Feature response for 2D example. Feature response for 2D example. Feature response for 2D example. With With With In general or a generic line in homog. coordinates. may select only a very small subset of features Weak learner: conic section a matrix representing a conic. See Appendix C for relation with kernel trick. Regression forest: the predictor model What do we do at the leaf? Examples of leaf (predictor) models Predictor model: constant Predictor model: polynomial (note: linear for n=1, constant for n=0) Predictor model: probabilistic-linear At node j Regression forest: objective function • Computing the regression information gain at node j Regression information gain Differential entropy of Gaussian Our regression information gain Regression forest: the ensemble model t=2 Tree t=1 The ensemble model Forest output probability t=3 Regression forest: probabilistic, non-linear regression Training points Tree posteriors Generalization properties: -‐ Smooth interpolating behaviour in gaps between training data. -‐ Uncertainty increases with distance from training data. Forest posterior See Appendix A for probabilistic line fitting. Parameters: T=400, D=2, weak learner = aligned, leaf model = prob. line Regression forest: probabilistic, non-linear regression underfitting max tree depth D = 1 overrfitting max tree depth D = 5 (6 videos in this page) Parameters: T=400, w.l. = aligned, l.m. = prob. line Density forests Forests as generative models Density forest model Unlabelled input data ? Model specialization for density estimation Input data point Output: density Node weak learner Obj. function for node j Training node j Leaf model with Density forest model Unlabelled input data ? The objective function Information gain Example entropy measure in unsupervised setting Valid for a Gaussian model. Differential entropy here is related to volume of cluster. Training and information gain Information gain Split 1 Before split (for continuous, parametric densities) Node training Split 2 Differential entropy of Gaussian Density forest model Tree t=1 t=2 The prediction and ensemble models t=3 Partition function Output of tree with The ensemble model with Density forest model Tree t=1 t=2 Partition function t=3 1. Compute via cumulative multivariate normal for ax.-‐align. w. learn. Piece-‐wise Gaussian density (for a single tree) The prediction and ensemble models Output of tree with 2. Grid-‐based numerical integration The ensemble model Density forest: forest building Decision trees Estimated Gaussians D=2 Decision boundaries D=3 Input points D=5 overfitting Parameters: T=200, weak learner = aligned, predictor = multivariate Gaussian D=4 D=3 Density forest: quantitative evaluation D=5 D=3 Rand. sampled points D=5 D=6 D=6 Ground truth densities D=4 Error curves Estimated forest densities Parameters: T=100, weak learner = linear, predictor = Gaussian Density forest: quantitative evaluation D=5 D=6 D=6 D=5 Ground truth densities Random sampled points Estimated forest densities Error curves Parameters: T=100, weak learn. = linear, predict. = Gaussian Manifold forests Non-linear embedding and dimensionality reduction [M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation. 2003.] Manifold learning and dimensionality reduction Unlabelled input data Problem statement Given the points find a mapping which respects distance relationships. with Manifold forest: the model Manifold learning is achieved here via density forests… Unlabelled input data Clustering tree 1 Clustering tree 2 Clustering tree 3 Model specialization for non-linear manifold learning Example entropy measure in unsupervised setting Input data point Output is a mapping with Node weak learner Obj. funct. for node j Training node j Predictor model Valid for a Gaussian model. Differential entropy here is related to area of cluster. Manifold Forest for non-linear dimensionality reduction Input points 1) Using forests to construct the graph We remember that each tree t defines a clustering of the input points with a leaf index and the set of all leaves. The affinity model: For each clustering tree we can compute an association matrix (unnormormalized transition probabilities in Markov Random Walks) Where the distance D can be defined in many different ways, e.g. Different clusterings for different trees -‐ Mahalanobis -‐ Gaussian -‐ Binary with The ensemble model: The affinity matrix for the entire forest is Manifold Forest for non-linear dimensionality reduction Input points 2) Spectral dimensionality reduction A low dimensional embedding is now found by linear algebra (The n x n matrix L is the normalized graph Laplacian) v'1 v' 2 v'n final mapping (1 video here showing effect of increasing T) Manifold Forest for non-linear dimensionality reduction Input points 2) Spectral dimensionality reduction A low dimensional embedding is now found by linear algebra (The n x n matrix L is the normalized graph Laplacian) Mapping f final mapping (1 video here showing effect of increasing T) Manifold Forests Input points (1 video here showing effect of increasing T) Parameters: T=50, D=3, weak learner = linear, predictor = Gaussian Conclusion/Summary (ds) • Trees partition the input space into meaningful parts • Decision trees are “meta”-classifiers — the modeling power is not bounded • Building forests from trees: bagging, randomization — increases the generalization capabilities, prevents overfitting • Classification, regression, density estimation, manifold learning: very similar learning procedures, differences only at leaves Interesting to think about: all of the classifiers considered in the course (i.e. SVM, AdaBoost, Neural Networks, Decision Forests) have infinite modeling capabilities • • • Which one to use (in practice) ? How to combine ? How to transform one into another ?
© Copyright 2026 Paperzz