Krista`s presentation

3D Scene Models
6.870 Object recognition and scene understanding
Krista Ehinger
Questions


What makes a good 3D scene model? How
accurate does it need to be?
How far can you get with automatic surface
detection? Where do you need human input?
Modelling the scene

Real scenes have way too many surfaces
Modelling the scene

Option 1: Diorama world
Tour Into the Picture (TIP)


Model the scene as 5 planes + foreground
objects
Easy implementation: planes/objects defined by
humans
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
TIP Implementation


User defines vanishing point, rear wall of the
scene (inner rectangle)
Given some assumptions about the camera,
position/size of all planes can be computed...
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
Defining the box


Define planes: Floor -> y=0, Ceiling -> y=H
Given horizon (vanishing point), corners of floor,
ceiling can be computed from 2D image
position
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
Defining the box

Once the positions of the planes are known,
compute the texture of the planes
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
What about foreground objects?


Assume a quadrangle attached
to floor, compute attachment
points, upper points
Hierarchical model of
foreground objects
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
Extracting foreground objects


Foreground objects removed, added to mask
Holes in background filled in using photo
completion software
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997
TIP Demonstration
TIP Discussion


Pros:

Accurate model (due to human input)

Deals with foreground objects, occlusions
Cons:

Requires human input, not automatic

Model too simple for many real-world scenes
Modelling the scene

Option 2: Pop-up book world
Automatic Photo Pop-Up

Three classes of surface: ground, sky, vertical

Not just a box: can model more kinds of scenes

Automatic classification, no labeling
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005.
Photo Pop-Up Implementation




Pixels -> superpixels -> constellations
Automatic labeling of constellations as ground,
vertical, or sky
Define angles of vertical planes (using
attachment to ground)
Map textures to vertical planes (as in TIP)
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005.
Superpixels, constellations


Superpixels are neighboring pixels that have
nearly the same color (Tao et al, 2001)
Superpixels assigned to constellations
according to how likely they are to share a label
(ground, vertical, sky) based on difference
between feature vectors
Feature vectors


Color features: RGB, hue, saturation
Texture features: Difference of oriented
Gaussians, Textons

Location (absolute and percentile)

N superpixels in constellation

Line and intersection detectors

Not used: constellation shape (contiguous, N
sides), some texture features
Training process


For each of 82 labeled training images

Compute superpixels, features, pairwise likelihoods

Form a set of N constellations (N = 3 to 25), each
labeled with ground truth

Compute constellation features
Compute constellation label, homogeneity
likelihood:
Training process



Adaboost weak classifiers learn to estimate
whether superpixels have same label (based on
feature vector)
Another set of Adaboost week classifiers learns
constellation label, homogeneity likelihood
(expressed as percent ground, vertical, sky,
mixed)
Emphasis on classifying larger constellations
Building the 3D model


Along vertical/ground boundary, fit line
segments (Hough transform) – goal is to find
simplest shape (fewest lines)
Project lines up from corners of boundary lines,
cut and fold
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005.
Photo Pop-Up Demonstration
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005.
Photo Pop-Up Discussion


Pros:

Automatic

Can handle a variety of scenes, not just boxes
Cons:

No handling of foreground objects

Misclassification leads to very strange models

Only 2 kinds of surface: ground, vertical
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005.
Modelling the scene

Option 3: Actually try to model surface angles
3D Scene Structure from Still Image



Compute surface normal for each surface
No right-angle assumptions; surfaces can have
any angle
Automatic (trained on images with known depth
maps)
3D Scene Implementation


Segment image into superpixels
Estimate surface normal of each superpixel
(using Markov Random Field model)

Optional: Detect and extract foreground objects

Map textures to planes
Original image
Modeled depth map
A.A.Saxena,
Saxena,M.
M.Sun,
Sun,A.A.Y.Y.Ng.
Ng."Learning
"Learning3-D
3-DScene
SceneStructure
Structurefrom
froma aSingle
SingleStill
StillImage".
Image".InInICCV
ICCVworkshop
workshopon
on3D
3DRepresentation
Representationfor
forRecognition
Recognition(3dRR-07),
(3dRR-07),2007
2007
Image features


Superpixel features (xi)

Color and texture features as in Photo Pop-Up

Vector also includes features of neighboring
superpixels
Boundary features (xij)

Color difference, texture difference, edge detector
Markov Random Field Model


First term: model planes in terms of image
features of superpixels
Second term: model planes in terms of pairs of
superpixels, with constraints...
A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image". In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007
Model constraints



Connected structure: except where there is an
occlusion, neighboring superpixels are likely to
be connected
Coplanar structure: except where there are
folds, neighboring superpixels are likely to lie on
the same plane
Co-linearity: long straight lines in the image
correspond to straight lines in 3D
Foreground objects


Automatically-detected foreground objects may
be removed from model (for example:
pedestrians, using Dalal & Triggs detector)
Detected objects add 3D cues (pedestrians are
basically vertical, occlude other surfaces)
3D Scene Demonstration
Results
A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image". In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007
3D Scene Discussion


Pros:

Handles a variety of scene types

Fairly accurate (about 2/3 of scenes correct)

Automatic

Handles foreground objects
Cons:

Still fails on 1/3 of scenes
Discussion



Simple 3D models are adequate for many
scenes
You can get pretty far without human input (but
still would be better results with human
annotation of scenes)
Extensions?

Use photo completion techniques to handle
occlusions?

Massive training sets -> better 3D models?