Motion and Stereo 2

Feature matching for stereo image analysis
Introduction
We have seen that, in order to compute disparity in a pair of
stereo images, we need to determine the local shift
(disparity) of one image with respect to the other image.
In order to perform the matching which allows us to
compute the disparity, we have 2 problems to solve :
 Which points to select for matching?
- we require distinct ‘feature points’
 How do we model feature points in 1 image with feature
points in the second image?
5 feature points
Left
5 feature points
Right
There are some fairly obvious guidelines in solving both
problems.
Feature points must be :
 Local (extended line segments are no good, we require
local disparity)
 Distinct (a lot ‘different’ from neighbouring points)
 Invariant (to rotation, scale, illumination)
The matching process must be :
 Local (thus limiting the search area)
 Consistent (leading to ‘smooth’ disparity estimates)
Approaches to feature point selection
Previous approaches to feature point selection have been :
 Moravec interest operator
- this is based on thresholding local greylevel squared
differences
 Symmetric features
- circular features, spirals
 Line segment endpoints
 Corner points
A feature extraction algorithm based on corner point
detection is described in Haralick & Shapiro, pp. 334 which
allows corner point locations to be accurately estimated
along with the covariance matrix of the location estimate.
The approach uses least-squares estimation and leads to an
identical covariance matrix as was obtained for the optical
flow estimation algorithm!
An algorithm for corner point estimation
Assume that the corner point lies at location ( x c , y c ) . We
require an estimate ( xc , yc ) of the location together with a
covariance matrix which we obtain by considering the
intersection of line segments with an m point window
spanning the corner point location.
Suppose we observe 2 ‘strong’ edges in the window. These
edges will meet at the corner point estimate ( xc , yc ) (unless
they are parallel).
( xc , y c )
( x1 , y1 )
( x2 , y 2 )
What if we observe 3 strong edges?
( xc , y c )
( x3 , y3 )
( x2 , y 2 )
( x1 , y1 )
The location estimate ( xc , yc ) minimises the sum of
perpendicular distances to each line segment.
n1
n3
( xc , y c )
n2
Thus, in this case ( xc , yc ) is selected to minimise n12  n22  n32 .
How do we characterise an edge segment?
Each edge passes through some point ( xi , yi ) and has a
gradient f ( xi , yi )  f i at that point. The edge segment is
characterised by the perpendicular distance to the origin li
and the gradient vector direction  i .
li
( xi , yi )
i
f i
 f  (cos(  ),sin(  ))

i
i
i
 f  x cos(  )  y sin(  )
li  ( xi , yi )T 
i
i
i
i
i
Finally, we can quantify the certainty with which we believe
an edge to be present passing through ( xi , yi ) by the gradient
2
vector squared-magnitude f i  wi .
In order to establish a mathematical model of the corner
location, we assume that the perpendicular distance of the
edge segment ( li ,  i ) is modelled as a random variable ni with
variance  2n .
li
( xc , yc )
ni
( xi , yi )
i
f i
li  xc cos(  i )  yc sin(  i )  ni
Assuming our observations are the quantities ( li ,  i ) for each
location i in the m -point window, we can employ a
weighted least-squares procedure to estimate ( x c , y c ) :
m
( x c , y c )  min( xc ,yc )  wi ni2
i 1
2
where wi  f i .
Defining :
m
m
( xc , yc )   w n   wi ( li  xi cos i  yi sin i )2
i 1
2
i i
i 1
( xc , yc )
0
xc
x
c  xc
( xc , yc )
0
yc
y
c  y c
These equations straightforwardly lead to a matrix-vector
equation :
m
 
2
 i 1 wi cos  i
m
  wi cos  i sin  i
 i 1
m


 wi cos  i sin  i   x    li wi cos  i 
c
i 1
i 1
     m

m
 wi sin 2  i   y c    li wi sin  i 
i 1
i 1
m
Subsituting :
li  xi cos(  i )  yi sin(  i )
f i
f i cos  i 
x
f i
f i sin  i 
y
 m f i 2
 
 i 1 x
 m f i f i

 i 1 y y
 m  f i 2
f i f i  
f i f i 

x

yi  



 x
i
i

1

x

x

y




i 1 y y
 c  
m f 2   y
 c   m  f i f i
f i 2  
i


xi 
y 

i 1 y

y i  
 i 1  x y
m
 m f i 2
 
 xc   i 1 x
  m
 yc   f i f i

 i 1 y y
f f 
 i i
i 1 y y

m f 2 
 i 
i 1 y

m
1
 m  f i 2
f i f i  

xi 
yi  
i

1

x

x

y



 m  f f
f i 2  
i
i

xi 
yi  
i

1

x

y

y



We can estimate  2n by :
1 m
 wi ( li  x c cos  i  y c sin  i )2
 
m  2 i 1
2
n
The covariance matrix is then given by :
C( xc ,yc )
 m f i 2
 
2  i 1 x
  n m
 f i f i

 i 1 y y
f i f i 


i 1 y y

m f 2 
 i 
i 1 y

m
1
This is the same covariance matrix we obtained for the
optical flow estimation algorithm.
Window selection and confidence measure for the corner
location estimate
How do we know our window contains a corner point? We
require a confidence estimate based on the covariance
matrix C( xc ,yc ) .
As in the case of the optical flow estimate, we focus on the
eigenvalues  1   2 of C( xc ,yc ) and define a confidence ellipse :
1
2
( xc , y c )
We require 2 criteria to be fulfilled in order to accept the
corner point estimate :
 The maximum eigenvalue  1 should the smaller than a
threshold value
 The confidence ellipse shall not be too elongated implying
that the corner location estimate precision is much greater
in one direction than the other
In order to satisfy the second criterion, we can quantify the
circularity of the confidence ellipse by the form factor q :
2
 1   2 
4 1 2
4 det( C )
 
q  1 

(  1   2 )2
tr 2 ( C )
 1   2 
Its easy to check that 0  q  1 with q  0 for an elongated
ellipse and q  1 for a circular ellipse. Thus we can threshold
the value of q in order to decide whether to accept the corner
location estimate or not.
Correspondence matching for disparity computation
The general problem is to pair each corner point estimate
( xˆc , yˆ c ) in one stereo image with a corresponding estimate
i
i
( xˆ , yˆ ) in the other stereo image. The disparity estimate is
'
'
then the displacement vector dˆ  ( x  xˆ , y  yˆ ) .
'
cj
'
cj
ij
ci
cj
ci
cj
We can represent each possible match i  j for feature
points i , j  1..m by a bipartite graph with each graph node
being a feature point. The correspondence matching
problem is then to prune the graph so that one node in the
left image is only matched with one node in the right image.
L
R
L
R
There are a number of feature correspondence algorithms
that have been developed ranging from relaxation labelling
type algorithms to neural network techniques. Typically
these algorithms work on the basis of :


Feature similarity
Smoothness of the underlying disparity field
We can define feature point f i , corresponding to node i in
the left stereo image. In our case, f i is a corner feature with
image location ( xc , yc ) .
i
i
We need to match f i with some feature f j' in the right stereo
image. Such a match would produce a disparity estimate
d ij  ( xc  xc' , yc  yc' ) at position ( xc , yc ) .
i
j
i
j
i
( xci , yci )
d ij
i
We will outline a matching algorithm based on relaxation
labelling which makes the following assumptions :
 We can define a similarity function wij between matching
feature points f i and f j' . Matching feature points would
be expected to have a large value of wij . One possibility
might be :
1
wij 
1  sij
where  is a constant and sij is the normalised sum-ofsquared greylevel differences between pixels in the m point window centred on the two feature points.
 Disparity values for neighbouring feature points are
similar. This is based on the fact that most object surfaces
are smooth and large jumps in the depth values at
neighbouring points on the object surface, corresponding
to large jumps in disparity values at neighbouring feature
points, are unlikely.
Our relaxation labelling algorithm is based on the
probability pi ( d ij ) that feature point f i is matched to feature
point f j' resulting in a disparity value of d ij .
We can initialise this probability using our similarity
function :
pi
(0)
( d ij ) 
wij
 wij
j
We next compute the increment in pi ( d ij ) by considering the
contribution from all of the neighbouring feature points of f i
which, themselves, have disparities close to d ij :
d ij
fi
qi ( r ) ( d ij ) 


k : f k neighbour of f i l :d kl close to d ij
pk ( r ) ( d kl )
Finally, we update the our probabilities according to :
(r ) 
(r ) 
p
(
d
)
q
( dij )
i
ij
i
( r 1 ) 
pi
( dij ) 
 p ( r ) ( d )q ( r ) ( d )
j
i
ij
i
ij
This is an iterative algorithm, whereby all feature points are
updated simultaneously. The algorithm converges to a state
where none of the probabilities associated with each feature
point doesn’t change above some pre-defined threshold
value. At this point the matches are defined such that :
f i  f j if pi ( R ) ( d ij )  max k pi ( R ) ( d ik )
Example
Stereo pair (left and right images) generated by a ray tracing
program.
A selection of successfully matched corner features (using
the above corner detection and matching algorithms) and
the final image disparity map which is proportional to the
depth at each image point.
Motion and 3D structure from optical flow
Introduction
This area of computer vision research attempts to
reconstruct the structure of the imaged 3D environment and
the 3D motion of objects in the scene from optical flow
measurements made on a sequence of images.
Applications include autonomous vehicle navigation and
robot assembly. Typically a video camera (or more than one
camera for stereo measurements) are attached to a mobile
robot. As the robot moves, it can build up a 3D model of the
objects in its environment.
The fundamental relation between the optical flow vector at
position ( x , y ) in the image plane, v( x , y )  ( v x ( x , y ), v y ( x , y )) ,
and the relative motion of the point on an object surface
projected to ( x , y ) can easily be derived using the equations
for perspective projection.
We assume that the object has a rigid body translational
motion of V  ( VX ,VY ,VZ ) ) relative to a camera centred coordinate system ( X ,Y ,Z ) .
( X ,Y , Z ) Vt
Z
( x , y ) vt
f
O
Y
X
The equations for perspective projection are :
 x
 X
f
  
 
 y Z ( x , y )  Y 
We can differentiate this equation with respect to t :
d  x  v x 
f
    
dt  y  v y  Z 2


=


 ZV X  XVZ 


 ZVY  YVZ 
fV X
fXVZ 


Z
Z2 
fVY fYVZ 

Z
Z2 
dX dY dZ
(
V
,
V
,
V
)

(
, , )
where X Y Z
dt dt dt .
Substituting in the perspective projection equations, this
simplifies to :
 v x  1  fV X  xVZ  1  f
  
 
v
fV

yV
Z0
 y Z  Y
Z
0
f
V X 
 x  
V 
 y  Y 
 VZ 
We can invert this equation by solving for (V X ,VY ,VZ ) :
V X 
 vx 
 x
  Z 
 
 VY    v y     y 
f  
 
 
 VZ 
 0
f
This consists of a component parallel to the image plane and
an unobservable component along the line of sight :
( x , y , f )
Z
Z
( v , v ,0 )
f x y
V
v
( x, y, f )
f
O
Y
X
Focus of expansion
From the expression for the optical flow, we can determine a
simple structure for the flow vectors in an image
corresponding to a rigid body translation :

 v x  1  fV X  xVZ  VZ 

  

Z 
 v y  Z  fVY  yVZ 


fV X

 x
VZ

fVY

 y
VZ

VZ  x0  x 



Z  y0  y
fV
fV
X
Y
where ( x0 , y0 )  ( V , V ) is called the focus of expansion
Z
Z
(FOE). For V Z towards the camera (negative), the flow
vectors point away from the FOE (expansion) and for V Z
away from the camera, the flow vectors point towards the
FOE (contraction).
v( x , y )
( x, y )
( x0 , y0 )
Example
Diverging tree test sequence (40 frame sequence). The first
and last frame in the sequence are shown and the optical
flow showing the flow vectors pointing towards the FOE
indicating an expanding flow field.
What 3D information does the FOE provide?
fV X fVY
f
( x0, y0 , f )  (
,
,f )
(V ,V ,V )
VZ VZ
VZ X Y Z
Thus, the direction of translational motion V / V can be
determined by the FOE position.
V
Z
( x0 , y0 , f )
f
O
Y
X
We can also determine the time to impact from the optical
flow measurements close to the FOE.
 v x  VZ
 
 vy  Z
 x0  x 


 y 0  y
VZ 1

Z

 is the time to impact, an important quantity for both
mobile robots and biological vision systems!
Z
VZ
The position of the FOE and the time to impact can be found
using the least-squares technique based on measurements of
the optical flow ( v x , v y ) at a number of image points ( xi , yi )
(see Haralick & Shapiro, pp. 191).
i
i
An algorithm for determining structure from motion
The structure from motion problem can be stated as follows :
Given an observed optical flow field v( xi , yi ) measured at
N image locations ( xi , yi ) i  1.. N which are the projected
points of N points ( X i ,Yi , Zi ) on a rigid body surface moving
with velocity V  (V X ,VY ,VZ ) , then determine, to a
multiplicative scale constant, the positions ( X i ,Yi , Zi ) and
velocity V  (V X ,VY ,VZ ) from the flow field.
Note that this is a rather incomplete statement of the full
problem since, in most camera systems attached to mobile
robots, the camera can pan and tilt inducing an angular
velocity about the camera axes. This significantly
complicates the solution.
V
Z
Z
X i ,Yi , Zi
v( xi , yi )
X
f
O
X
Y
Y
An elegant solution exists for the simple case
(  X , Y ,  Z )  0 (Haralick & Shapiro, pp. 188-191). The
general case is more complex, and a lot of research is still
going on investigating this general problem and the stability
of the algorithms developed.
The algorithm starting point is the optical flow equation :
V X 
 vx 
 x
  Z 
 
 VY    v y     y 
f  
 
 
 VZ 
 0
f
T
T
Thus, since V  (V X ,VY ,VZ ) is the vector sum of ( vx , vy ,0 )
T
and ( x , y , f ) then the vector product of these 2 vectors is
orthogonal to V :
( v x , v y ,0 )  ( x , y , f )
( vx , v y ,0 )
(VX ,VY ,VZ )
( x, y, f )
But :
 vx   x   v y f 

    
 vy    y     vx f 
    

 0   f   v x f  v y x
Thus :
 vy f 


  vx f 


v
f

v
x
 x
y 
T
VX 
 
 VY   0
 
 VZ 
This equation applies to all points ( xi , yi ) i  1.. N .
Obviously a trivial solution to this equation would be
V  0 . Also if some non-zero vector V is a solution then so
is the vector cV for any scalar constant c . This confirms
that we cannot determine the absolute magnitude of the
velocity vector, we can only determine it to a multiplicative
scale constant.
We want to solve the above equation, in a least-squares sense,
for all points ( xi , yi ) i  1.. N , subject to the condition that :
VT V  k 2
This constrains the squared-magnitude of the velocity vector
2
to be some arbitrary value k .
We can rewrite the above orthogonality condition for all
points ( xi , yi ) i  1.. N in matrix-vector notation :
V X 
 
A VY   0
 
 VZ 
where :
 v x1 f

.

A
.

 vxN f
 v x1 f
.
.
 vxN f
v x1 y1  x1v y1 

.


.

vxN y N  xN v yN 
The problem is thus stated as :
minV ( AV )T ( AV ) subject to V T V  k 2
This is a classic problem in optimization theory and the
 is given by the
solution is that the optimum value V
T
eigenvector of A A corresponding to the minimum
T
eigenvalue. A A is given by :
N
N
N

2
2
2
f
v

f
v
v
f
v yi ( v xi yi  xi v yi ) 



yi
x1 yi

i 1
i 1
i 1


N
N
N
T
2
2
2
A A
 f  v x1 v yi
f  v xi
 f  v xi ( v xi yi  xi v yi )
i 1
i 1
i 1
 N

N
N
2
 f  v y ( v x yi  xi v y )  f  v x ( v x yi  xi v y )

 ( v xi yi  xi v yi )
i
i
i
i
 i 1 i i

i 1
i 1
 , we can then
Once we have determined our estimate V
compute the depths Zi i  1.. N of our scene points since,
from our original optical flow equation :
 vx  1  f
  
 vy  Z  0
0
f
Zv x  fV X  xVZ
Zv y  fVY  yVZ
V X 
 x  
V 
 y  Y 
 VZ 
We can compute the least-squares estimate of each Zi from
the above 2 equations as :

min Z i  Z i v xi  fVˆX  xiVˆZ

 
2

2
ˆ
ˆ
 Z i v yi  fVY  yiVZ 

The solution is :
Zˆ i 
vxi ( fVˆX  xiVˆZ )  v yi ( fVˆY  yiVˆZ )
vx2i  v y2i
Finally, the scene co-ordinates ( X i ,Yi ) can be found by
applying the perspective projection equations.
Structure from stereo
Introduction
The problem of structure from stereo is to reconstruct the
co-ordinates of the 3D scene points from the disparity
measurements. For the simple case of the cameras in normal
position, we can derive a relationship between the scene
point depth and the disparity using simple geometry :
( X ,Y , Z )
Z
xL
L
R
f
O
b
xR
Take O to be the origin of the ( X ,Y , Z ) co-ordinate system.
Using similar triangles :
X xL

Z
f
X  b xR

Z
f
x L xR b


f
f
Z
bf
Z
x L  xR
x L  x R is known as the disparity between the projections of
the same point in the 2 images. By measuring the disparity of
corresponding points (conjugate points), we can infer the
depths of the scene point (given knowledge of the camera
focal length f and baseline b ) and hence build up a depth
map of the object surface.
We can also derive this relationship using simple vector
algebra. Relative to the origin O of our co-ordinate system,
T
T
the left and right camera centres are at ( 0,0,0 ) , ( b ,0,0 )
where b is the camera baseline. For some general scene point
( X ,Y , Z ) with left and right image projections ( x L , y L ) and
( xR , yR ) :
 xL 
f
  
 yL  Z
 X   xR 
f
    
 Y   yR  Z
 X  b


Y


f
fb
x L  x R  ( X  ( x  b )) 
Z
Z
Z
fb
xL  xR
Because the cameras are in normal position, the y disparity
y L  y R is zero.
Once the scene point depth is known, the X and Y coordinates of the scene point can be found using the
perspective projection equations.
When the cameras are in some general orientation with
respect to each other, the situation is a little more complex
and we have to use epipolar geometry in order to reconstruct
our scene points.
Epipolar geometry
Epipolar geometry reduces the 2D image search space for
matching feature points to a 1D search space along the
epipolar line. The epipolar line eR ( p L ) in the right image
plane is determined from the projection of a point P in the
left image plane. There is a corresponding epipolar line
e L ( pR ) corresponding to the projection of P in the right
image plane.
The key to understanding epipolar geometry is to recognise
that OR PO L lie in a plane (the epipolar plane). The epipolar
line eR ( p L ) is the intersection of the epipolar plane with the
right image plane. The corresponding point in the right
image plane to point p L in the left image plane can be the
projection of any point along the ray O L P , thus defining the
epipolar line eR ( p L ) .
P
eR ( p L )
e L ( pR )
pL
OL
OR
Given a feature point with position vector OR PR in the right
hand image plane, the epipolar line e L ( pR ) in the left hand
image plane can easily be found.
Any vector OR PR , for   1 projects to the epipolar line.
With respect to the co-ordinate system centred on the left
hand camera, OR PR becomes R( OR PR )  OL OR where R is
a rotation matrix aligning the left and right camera coordinate systems. The rotation matrix and the baseline
vector OL OR can be determined using a camera calibration
algorithm.
Thus for any value   1, the corresponding position ( x , y )
on the epipolar line can be found using perspective
projection. By varying  , the epipolar line is swept out. Of
course, only two values of  are required to determine the
line.
For cameras not in alignment, we can define a surface of
zero disparity relative the vergence angle  which is the
angle that the cameras’ optical axes make with each other :
d0


d  0, x L  xR
d0
xL
xR
For vergence angles   , disparities are defined as negative
(ie. objects lie outside the zero disparity surface). For
vergence angles   , disparities are defined as positive (ie.
objects lie inside the zero disparity surface.)
A particularly simple case arises when the image planes are
parallel and perpendicular to the optical axis, the epipolar
lines are just the image rows.
P
e L ( pR )
OL
pL
eR ( p L )
pR
OR
Summary
We have looked at how we can interpret 2D information,
specifically optical flow and disparity measurements in
order to determine 3D motion and structure information
about the imaged scene.
We have described algorithms for :
 Optical flow estimation
 Feature corner point location estimation
 Correspondence matching for feature points
 3D velocity and structure estimation from optical flow