Examples

STRATEGIES FOR
VISUAL RECOGNITION
Donald GEMAN
Dept. of Applied Mathematics and Statistics
Center for Imaging Science
Johns Hopkins University
Outline
 General Orientation within Imaging
 Semantic Scene Interpretation
 Three Paradigms with Examples



Generative
Predictive
Hierarchical
 Critique and Conclusions
2
Sensors to Images
 Constructing images from measured data.
 Examples



Ordinary visible light cameras
Computed tomography (CT, SPECT, PET, MRI)
Ultrasound, Molecular, etc.
 Mathematical Tools



Harmonic analysis
Partial differential equations
Poisson processes
3
Images to Images/Surfaces
 Transform images to more compact or informative
data structures.
 Examples



Restoration (de-noising, de-blurring, inpainting)
Compression
Shape-from-shading
 Mathematical Tools



Harmonic analysis
Regularization theory and variational methods
Bayesian inference, graphical models, MCMC
4
Image to Words
 Semantic and structural interpretations of images.
 Examples
 Selective attention, figure/ground separation
 Object detection and classification
 Scene categorization
 Mathematical Tools
 Distributions on grammars, graphs,
transformations
 Computational learning theory
 Shape spaces
 Geometric and algebraic invariants
5
Semantic Scene Interpretation
 Understanding how brains interpret sensory data,
or how computers might also, is a major
challenge.
 Here: One greyscale image. Although likely
crucial to biological learning, no cues from color,
motion or depth.
 There is objective reality Y(I), at least at the level
of key words.
6
Dreaming
A description machine
f : I Y
from images I  I to description Y Y of an
underlying scene.
Better Yet: A sequence of increasingly fine
interpretations Y = (Y1 ,Y2 ,...) perhaps “nested.”
7
More Dreaming

 ACCURACY: Y(I)= Y(I) for most images.
 LEARNING: There is an explicit set of instructions for building
Y involving samples from a learning set
L  ( I1 , Y1 ),...,( I n , Yn )
 EXECUTION:
There is an explicit set of instructions for

evaluating Y(I) with as little computation as possible.
 ANALYSIS: There is “supporting theory” which guides
construction and predicts performance.
8
Detecting Boats
9
Where Are the Faces? Whose?
10
Within Class Variability
11
How Many Samples are Necessary?
12
Recognizing Context
13
Many Levels of Description
14
Confounding Factors
 Local (but not global) ambiguity
 Complexity: There are so many things to look for!
 Arbitrary views and lighting
 Clutter: Alternative hypothesis is not “white noise”
 Knowledge: Somehow quantify



Domination of clutter
Invariance of object names under transforms
Regularity of the physical world
15
Confounding Factors (cont)
 Scene interpretation is an infinite-dimensional
classification problem.
 Is segmentation/grouping performed before,
during or after recognition?
 No advances in computers or statistical learning
will overcome the small-sample dilemma.
Some organizational framework is unavoidable.
16
Small-Sample Computational Learning
L  ( x1 , y1 ),...,( xn , yn ) : Training set for inductive learning
 xi  X : Measurement or feature vector;
 yi Y : True label or explanation of xi.
 Examples:
 X : Acoustic speech signals; Y : transcription into words
 X : Natural images; Y :semantic description
 Common property: n is very small relative to the
effective dimensions
17
Three Paradigms
 Generative: Centered on a joint statistical model
for features X and interpretations Y.
 Predictive: Proceed (almost) directly from data to
decision boundaries.
 Hierarchical: Exploit shared features among
objects and interpretations.
18
Generative Modeling
 The world is very special: Not all explanations and
observations are equally likely. Capture regularities with
stochastic models.
 Learning and decision-making based P(X,Y), derived from


A distribution P(Y) on interpretations accounting for priori
knowledge and expectations.
A conditional data model P(X|Y) accounting for visual
appearance.
 Inference principle: Given X, choose the interpretation Y
which maximizes P(Y|X).
19
Generative Modeling: Examples
 Deformable templates


Prior on transformations
Template + noise data model
 Hidden Markov models
 Probabilities on grammars and production rules
 Graphical models, e.g., Bayesian networks
 LDA, etc.
 Gaussian mixtures
20
Gaussian Part/Appearance Model
 Y: shape class with prior p(y)
 Z: locations of object “parts”
 X=X(I): features whose components capture local
topography (interest points, edges, wavelets)
 Compound Gaussian model:


p(z|y): multivariate normal (m(y), C(y))
p(x|z,y): multivariate normal (m(z,y), C(z,y))
 Estimate Y as

arg max p(z,y|x) = arg max p(x|z,y) p(z|y) p(y)
21
Generative Modeling: Critique
 In principle, a very general framework.
 In practice,



Diabolically hard to model P(Y).
Intensive computation with P(Y|X).
P(X|Y) alone amounts to “templates-for-everything”
which


lacks power
requires infinite computation
22
Predictive Learning
 Do not solve a more difficult problem than is
necessary; ultimately only a decision boundary is
needed.
 Representation and learning:
Replace I by a fixed length feature vector X
 Quantize Y to a finite number of classes 1,2,…,C
 Specify a family F of “classifiers” f(X) .
 Induce f(X) directly from a training set L .
 Often does require some modeling.

23
Predictive Learning: Examples
 Examples which, in effect, learn P(Y|X) directly
and apply Bayes rule:



Artificial neural networks
k-NN with smart metrics (e.g., “shape context”)
Decision trees
 Support vector machines (interpretation as Bayes
rule via logistic regression)
 Multiple classifiers (e.g., random forests)
24
Support Vector Machines
 Let L  ( x1 , y1 ),...,( xn , yn ) be a training set
generated i.i.d according to P(X,Y).
wt x  b  1
wt x  b  0
 Maximize the margin:

wt x  b  1
w
2
w
min 12 wt w s.t. yi (wt xi  b)  1  0, i
w ,b
max  i  i  12  i  j  i j yi y j xi , x j
i
s.t i  0, i and
y
i
i
i
0
25
SVM (cont)
 The classification function:
f ( x)  b  wt x 

yi 1
i
xi , x 

yi 1
i
xi , x
 Data in the input space are mapped into a higher
dimensional one, where linear separability holds:
( )
( )
( )
( )
( )
( )
( )
( )
26
SVM (cont)
 The optimization problem and the classification
function are similar to the linear case.
max  i  i  12  i  j  i j yi y j ( xi ), ( x j )
i
s.t i  0, i and
f ( x)  b  wt x 

yi 1
i
y
i
i
i
0
( xi ), ( x) 
,

(
x
),

(
x
)
 The scalar product

yi 1
i
( xi ), ( x)
is replaced by a
kernel.
27
Predictive Learning: Critique
 In principle, universal learning machines which could
mimic natural processes and “learn” invariance from
enough examples.
 In practice, lacks a global organizing principle to confront
 A very large number of classes (say 30,000)
 The small-sample dilemma
 The complexity of clutter
 Excessive computation
28
Hierarchical Modeling
 The world is very special: Vision is only possible
due to its hierarchical organization into common
parts and sub-interpretations.
 Determine common visual structure by:




Clustering images;
Information-theoretic criteria (e.g., mutual
information) to select common patches;
Building classifiers (e.g., decision trees or multiclass boosting);
Constructing grammars.
29
Hierarchical: Examples
 Compositional vision: A “theory of reusable parts”
 Hierarchies of image patches or fragments.
 Algorithmic modeling: coarse-to-fine
representation of the computational process.
30
Hierarchical Indexing
Coarse-to-fine modeling of both the interpretations
and the computational process:




Unites representation and processing.
Proceed from broad scope with low power to
narrow scope with high power.
Concentrate processing on ambiguous areas.
Evidence that coarse information is conveyed
earlier than fine information in neural responses to
visual stimuli.
31
Hierarchical Indexing (cont)
Estimate Y by exploring
X =  X A , A  H
where H is a hierarchy of nested partitions of Y
and X A is binary test for Y  A vs Y  A.
Index D : Explanations y  Y not ruled out
by any test: D   y  Y : X A = 1  y  A
32
Hierarchical Indexing (cont)
 A recursive partitioning of Y with four levels; there
is a binary test for each of the 15 cells.
 (A): Positive tests are shown in black
 (B): The index is the union of leaves 3 and 4.
 (C): The “trace” of coarse-to-fine search.
33
When is CTF Optimal?
 c(A) = cost, p(A) = power of the test for cell A in H.
 c* = cost of a perfect test for a single hypothesis.
 The mean cost of a sequential testing strategy T is
EC (T )   c( A)qA (T )  c* E | D |
A
where q A (T ) is the probability of performing test X A
THEOREM: (G. Blanchard/DG) CTF is optimal if
c( A)
c( B)
A  ,
 
p( A) BC ( A) p( B)
where C(A) = direct children of A in H.
34
Density of Work
Original image
Spatial concentration of
processing
35
Modeling vs. Learning: Variations on the
Bias-Variance Dilemma
 Reduce variance (dependence on L) by
introducing the “right” biases (a priori structure) or
by introducing more complexity?
 Is dimensionality a “curse” or a “blessing”?
 Hard-wiring vs. tabula rasa.
 |L| - small vs. large:


“Credit” for learning with small L?.
Is the interesting limit |L| goes to infinity or zero?
36
Conclusions
 Automatic scene interpretation remains elusive.
 However, growing success with particular object
categories, e.g., vehicles and faces, and many
industrial applications (e.g., wafer inspection).
 No dominant mathematical framework, and the
“right” one is unclear.
 Few theoretical results outside classification.
37
Naïve Bayes
 Map I to a feature vector X
 Boolean edges
 Wavelet coefficients
 Interest points
 Assume the components of X are conditionally
independent given Y.
 Learn the marginal distributions under object and
background hypotheses from data.
 Uniform prior P(Y).
 Perform a likelihood ratio test to detect objects against
background.
38