STRATEGIES FOR VISUAL RECOGNITION Donald GEMAN Dept. of Applied Mathematics and Statistics Center for Imaging Science Johns Hopkins University Outline General Orientation within Imaging Semantic Scene Interpretation Three Paradigms with Examples Generative Predictive Hierarchical Critique and Conclusions 2 Sensors to Images Constructing images from measured data. Examples Ordinary visible light cameras Computed tomography (CT, SPECT, PET, MRI) Ultrasound, Molecular, etc. Mathematical Tools Harmonic analysis Partial differential equations Poisson processes 3 Images to Images/Surfaces Transform images to more compact or informative data structures. Examples Restoration (de-noising, de-blurring, inpainting) Compression Shape-from-shading Mathematical Tools Harmonic analysis Regularization theory and variational methods Bayesian inference, graphical models, MCMC 4 Image to Words Semantic and structural interpretations of images. Examples Selective attention, figure/ground separation Object detection and classification Scene categorization Mathematical Tools Distributions on grammars, graphs, transformations Computational learning theory Shape spaces Geometric and algebraic invariants 5 Semantic Scene Interpretation Understanding how brains interpret sensory data, or how computers might also, is a major challenge. Here: One greyscale image. Although likely crucial to biological learning, no cues from color, motion or depth. There is objective reality Y(I), at least at the level of key words. 6 Dreaming A description machine f : I Y from images I I to description Y Y of an underlying scene. Better Yet: A sequence of increasingly fine interpretations Y = (Y1 ,Y2 ,...) perhaps “nested.” 7 More Dreaming ACCURACY: Y(I)= Y(I) for most images. LEARNING: There is an explicit set of instructions for building Y involving samples from a learning set L ( I1 , Y1 ),...,( I n , Yn ) EXECUTION: There is an explicit set of instructions for evaluating Y(I) with as little computation as possible. ANALYSIS: There is “supporting theory” which guides construction and predicts performance. 8 Detecting Boats 9 Where Are the Faces? Whose? 10 Within Class Variability 11 How Many Samples are Necessary? 12 Recognizing Context 13 Many Levels of Description 14 Confounding Factors Local (but not global) ambiguity Complexity: There are so many things to look for! Arbitrary views and lighting Clutter: Alternative hypothesis is not “white noise” Knowledge: Somehow quantify Domination of clutter Invariance of object names under transforms Regularity of the physical world 15 Confounding Factors (cont) Scene interpretation is an infinite-dimensional classification problem. Is segmentation/grouping performed before, during or after recognition? No advances in computers or statistical learning will overcome the small-sample dilemma. Some organizational framework is unavoidable. 16 Small-Sample Computational Learning L ( x1 , y1 ),...,( xn , yn ) : Training set for inductive learning xi X : Measurement or feature vector; yi Y : True label or explanation of xi. Examples: X : Acoustic speech signals; Y : transcription into words X : Natural images; Y :semantic description Common property: n is very small relative to the effective dimensions 17 Three Paradigms Generative: Centered on a joint statistical model for features X and interpretations Y. Predictive: Proceed (almost) directly from data to decision boundaries. Hierarchical: Exploit shared features among objects and interpretations. 18 Generative Modeling The world is very special: Not all explanations and observations are equally likely. Capture regularities with stochastic models. Learning and decision-making based P(X,Y), derived from A distribution P(Y) on interpretations accounting for priori knowledge and expectations. A conditional data model P(X|Y) accounting for visual appearance. Inference principle: Given X, choose the interpretation Y which maximizes P(Y|X). 19 Generative Modeling: Examples Deformable templates Prior on transformations Template + noise data model Hidden Markov models Probabilities on grammars and production rules Graphical models, e.g., Bayesian networks LDA, etc. Gaussian mixtures 20 Gaussian Part/Appearance Model Y: shape class with prior p(y) Z: locations of object “parts” X=X(I): features whose components capture local topography (interest points, edges, wavelets) Compound Gaussian model: p(z|y): multivariate normal (m(y), C(y)) p(x|z,y): multivariate normal (m(z,y), C(z,y)) Estimate Y as arg max p(z,y|x) = arg max p(x|z,y) p(z|y) p(y) 21 Generative Modeling: Critique In principle, a very general framework. In practice, Diabolically hard to model P(Y). Intensive computation with P(Y|X). P(X|Y) alone amounts to “templates-for-everything” which lacks power requires infinite computation 22 Predictive Learning Do not solve a more difficult problem than is necessary; ultimately only a decision boundary is needed. Representation and learning: Replace I by a fixed length feature vector X Quantize Y to a finite number of classes 1,2,…,C Specify a family F of “classifiers” f(X) . Induce f(X) directly from a training set L . Often does require some modeling. 23 Predictive Learning: Examples Examples which, in effect, learn P(Y|X) directly and apply Bayes rule: Artificial neural networks k-NN with smart metrics (e.g., “shape context”) Decision trees Support vector machines (interpretation as Bayes rule via logistic regression) Multiple classifiers (e.g., random forests) 24 Support Vector Machines Let L ( x1 , y1 ),...,( xn , yn ) be a training set generated i.i.d according to P(X,Y). wt x b 1 wt x b 0 Maximize the margin: wt x b 1 w 2 w min 12 wt w s.t. yi (wt xi b) 1 0, i w ,b max i i 12 i j i j yi y j xi , x j i s.t i 0, i and y i i i 0 25 SVM (cont) The classification function: f ( x) b wt x yi 1 i xi , x yi 1 i xi , x Data in the input space are mapped into a higher dimensional one, where linear separability holds: ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 26 SVM (cont) The optimization problem and the classification function are similar to the linear case. max i i 12 i j i j yi y j ( xi ), ( x j ) i s.t i 0, i and f ( x) b wt x yi 1 i y i i i 0 ( xi ), ( x) , ( x ), ( x ) The scalar product yi 1 i ( xi ), ( x) is replaced by a kernel. 27 Predictive Learning: Critique In principle, universal learning machines which could mimic natural processes and “learn” invariance from enough examples. In practice, lacks a global organizing principle to confront A very large number of classes (say 30,000) The small-sample dilemma The complexity of clutter Excessive computation 28 Hierarchical Modeling The world is very special: Vision is only possible due to its hierarchical organization into common parts and sub-interpretations. Determine common visual structure by: Clustering images; Information-theoretic criteria (e.g., mutual information) to select common patches; Building classifiers (e.g., decision trees or multiclass boosting); Constructing grammars. 29 Hierarchical: Examples Compositional vision: A “theory of reusable parts” Hierarchies of image patches or fragments. Algorithmic modeling: coarse-to-fine representation of the computational process. 30 Hierarchical Indexing Coarse-to-fine modeling of both the interpretations and the computational process: Unites representation and processing. Proceed from broad scope with low power to narrow scope with high power. Concentrate processing on ambiguous areas. Evidence that coarse information is conveyed earlier than fine information in neural responses to visual stimuli. 31 Hierarchical Indexing (cont) Estimate Y by exploring X = X A , A H where H is a hierarchy of nested partitions of Y and X A is binary test for Y A vs Y A. Index D : Explanations y Y not ruled out by any test: D y Y : X A = 1 y A 32 Hierarchical Indexing (cont) A recursive partitioning of Y with four levels; there is a binary test for each of the 15 cells. (A): Positive tests are shown in black (B): The index is the union of leaves 3 and 4. (C): The “trace” of coarse-to-fine search. 33 When is CTF Optimal? c(A) = cost, p(A) = power of the test for cell A in H. c* = cost of a perfect test for a single hypothesis. The mean cost of a sequential testing strategy T is EC (T ) c( A)qA (T ) c* E | D | A where q A (T ) is the probability of performing test X A THEOREM: (G. Blanchard/DG) CTF is optimal if c( A) c( B) A , p( A) BC ( A) p( B) where C(A) = direct children of A in H. 34 Density of Work Original image Spatial concentration of processing 35 Modeling vs. Learning: Variations on the Bias-Variance Dilemma Reduce variance (dependence on L) by introducing the “right” biases (a priori structure) or by introducing more complexity? Is dimensionality a “curse” or a “blessing”? Hard-wiring vs. tabula rasa. |L| - small vs. large: “Credit” for learning with small L?. Is the interesting limit |L| goes to infinity or zero? 36 Conclusions Automatic scene interpretation remains elusive. However, growing success with particular object categories, e.g., vehicles and faces, and many industrial applications (e.g., wafer inspection). No dominant mathematical framework, and the “right” one is unclear. Few theoretical results outside classification. 37 Naïve Bayes Map I to a feature vector X Boolean edges Wavelet coefficients Interest points Assume the components of X are conditionally independent given Y. Learn the marginal distributions under object and background hypotheses from data. Uniform prior P(Y). Perform a likelihood ratio test to detect objects against background. 38
© Copyright 2026 Paperzz