Fast Stereo Vision for Mobile Robots by Global Minima of Cost

Fast Stereo Vision for Mobile Robots by Global Minima
of Cost Functions
Roland Brockers, Marcus Hund, Bärbel Mertsching
GET Lab, University of Paderborn, Pohlweg 47-49, 33098 Paderborn, Germany
{brockers, hund, mertsching}@get.upb.de
Abstract
In this paper we introduce a novel stereo algorithm for
computing a disparity map from a stereo image pair by minimizing a global cost function. The approach consists of two
steps: First a „traditional“ correlation-based similarity
measurement is performed, then a relaxation takes place to
eliminate possible ambiguities. The relaxation is formulated
as a cost-optimizing approach, taking into account both the
stereoscopic continuity constraint and considerations of the
pixel similarity. The special formulation guarantees the existence of a unique minimum of the cost function which can
be easily and rapidly found by standard numerical procedures. Results on real and synthetic images demonstrate the
operative potential of the approach.
1. Introduction and Motivation
Many stereo algorithms emphasize the quality of resulting disparity maps while disregarding the aspect of
computation time. Efficient approaches with good accuracy have been developed in the past [2, 5, 6]. Unfortunately, a good accuracy generally can only be achieved
with additional computational expenses which results in
low efficiency of the algorithms. This is the crucial point
when implementing them on a mobile system which depends on a fast 3D information acquisition due to changing surroundings during movements.
In our approach we have developed a new stereo algorithm that tries to fuse the advantages of a high accurate relaxation approach with the acceleration of a special
mathematical formulation that is based on minimizing a
cost function. The algorithm works in two steps: First a
similarity measurement is computed which represents a
probability of a valid disparity between two pixels in the
stereo images. In a second step an optimization via a cost
function chooses the most plausible pixel pairs and assigns the resulting disparity map. The cost function approach results in solving a system of linear equations
with a unique solution, which can easily be computed
with the help of a fast standard numerical procedure.
0-7803-8387-7/04/$20.00 © 2004 IEEE
In the following we give an overview of the algorithm.
After describing the used similarity measurement in the
second section, section 3 explains the cost-optimizing
approach. While section 4 demonstrates the post-processing, including the explicit occlusion detection, some
results showing the performance of the entire algorithm
are depicted in section 5.
2. Similarity Measurement
The first part of the algorithm consists of computing
a similarity measurement as a metric for the probability
that two pixels in the two stereo images form a correspondence pair, this means, that they correspond to the
two projections of a 3D scene point onto the camera
planes. To compute a more general similarity measurement, the sum of absolute differences (SAD) in combination with the image gradient and the amount of texture is
applied to a local neighborhood of the concerned pixels:
S 0 ( x, y, d ) = E –
∑ ∑ w ( x – u, y – v ) ( cS1 h + c S2 v + cS3 g + cS4 t ) ( u, v, d )
u v
(1)
g ( x , y , d ) = i l ( x + d, y ) – i r ( x , y )
h ( x, y, d ) = f hl ( x + d, y ) – f hr ( x, y )
v ( x, y, d ) = f vl ( x + d, y ) – f vr ( x, y )
t ( x, y, d ) = 255 – min { f hl ( x + d, y ) , f hr ( x, y ) }
(2)
with the grey level differences
f hl = i l ( x + 1, y ) – i l ( x, y ),
f vl = i l ( x, y + 1 ) – i l ( x, y ),
f hr = i r ( x + 1, y ) – i r ( x, y )
f vr = i r ( x, y + 1 ) – i r ( x, y )
(3)
The four functions h, v, g and t compute the SAD of
the horizontal and vertical image gradient (h and v), the
SAD of the image grey value (g) and of the amount of
texture (t) between the concerned pixels ( x, y ) in the
right camera image i r and ( x + d, y ) in the left camera
image i l . They are windowed across a local neighborhood of the pixel ( x, y ) via a square window function
w. The four constants c S1, ..., c S4 allow to build an individual measurement of the similarity, giving us the possibility of changing the focus between the four described
functions to adapt to different kinds of input images.
However, in most cases it is sufficient to compute the
SAD of the grey value differences (g) to achieve good
results with the subsequent relaxation process.
E is a positive constant, which is needed to change
the order of our measurement, assuring that a high similarity coincides with higher values of S.
As the result we get a „traditional“ measurement, an
initial match value S 0 ( x, y, d ) indicating the probability
of d being the correct 1D disparity of the point ( x, y ) ,
assuming rectified, undistorted input images. Due to
ambiguity occurring in the input images it is not sufficient to declare the pixel pair with the highest match as
the resulting correspondence pair according to the pixel
( x, y ) . The resolution of this ambiguity occurs in a second step.
3. Global Optimization via Cost Functions
Different to approaches in literature [7, 8, 9], the
elimination of ambiguity is performed via a special relaxation procedure formulated in terms of a cost function, allowing to state stereoscopic constraints in single
cost terms. The great advantage lies in the fact that with
an appropriate definition of the cost terms, the search for
the optimal solution is reduced to solving a system of
linear equations, which can be done easily and, more important, rapidly, with the help of a standard numerical
procedure.
At first, each pixel ( x, y ) of the image is assigned to
a value k ∈ { 1, ... , n } with n ∈ N being the number
of pixels in the reference image. Then each element
S ( x, y, d ) of the disparity space is assigned to a value
ξ ( k, d ) , and the ξ ( k, d ) are arranged as a vector ξ :
ξ = ( ξ ( 1, d min ), …, ξ ( n, d
,
min )
ξ ( 1, dmin + 1 ), …, ξ ( n, d
)
max )
T
(4)
The arrangement of the variables as a vector later on
results in a more simple formulation of a gradient decent
method to minimize the cost function.
Two assumptions are made about the cost function.
First, costs arise, if the elements ξ ( k, d ) of the disparity
space differ from the initial values ξ ( k, d ) given by the
0
similarity measurement. Second, there will be costs, if
the stereoscopic continuity constraint [10] is not fulfilled.
The first requirement leads to a cost term punishing
the distance of the parameter vector ξ to ξ 0 :
d max
P 1 ( ξ )= c 1 ⋅
n
∑ ∑ ( ξ ( i, d ) – ξ ( i, d ) 0 )
2
(5)
d = d min i = 1
The distance is added across all components of the
disparity space and weighted by a positive constant c 1 .
According to the stereoscopic continuity constraint
the disparity values of neighbored pixels in one image
are expected to be piecewise smooth [10]. This leads to
the formulation of a second cost term which generates
costs, if the disparity of a pixel diverges from those of its
neighbors:
d max
P2 ( ξ ) = c2 ⋅
n
∑ ∑ ∑
( ξ ( i, d ) – ξ ( j, d ) )
2
(6)
d = d min i = 1 j ∈ U i
c 2 again is a positive constant weight, while U k
represents the local support area to a given pixel k, defining the neighboring pixels of k.
U k can be chosen as a fixed quadratic window with
k at its center (fig. 1); in another approach it is defined
depending on the amount of texture in the surrounding
of the considered pixel. Starting from a quadratic window, its boundaries are reduced when edge transitions,
detected by a gradient threshold filter, occur in the
neighborhood (fig. 2). This gives the entire process a
discontinuity-preserving behavior and reduces the phenomenon of growing disparity areas at disparity discontinuities, caused by fixed neighborhood coupling, which
is well known in relaxation processes.
b
a
Fig. 1: Fixed local support area U of two pixels a and b; the
influence is symmetric: a ∈ U b, b ∈ U a
b
ξ i + 1 = ξ i – λ ∇P ( ξ i ),
b
a)
a ∈ U b, b ∈ U a
b)
a ∉ U b, b ∉ U a
a
b
a
b
c)
a ∈ U b, b ∈ U a
d)
a ∉ U b, b ∉ U a
(9)
with a positive fixed increment λ .
It can be shown that the iteration converges very rapidly to the global minimum of the cost function, which
represents the optimal solution.
a
a
λ>0
4. Disparity Estimation, Explicit Occlusion Detection and Subpixel Accuracy
Once the minimum of the cost function is found, the
valid disparity d(k) of a pixel k can be retrieved via a
maximum search across the parameters belonging to k:
detected edge
transition
k
local support area Uk
of pixel k
Fig. 2: a), c) If no edge transition occurs in the rectangular
spanned by two pixels a and b, the pixels belong to the local
support area of each other, b), d) otherwise they are excluded,
e) reduced support area of pixel k
The complete cost function appears as the sum of the
individual cost terms:
P ( ξ )=
c1 ⋅
n
∑ ∑ ( ξ ( i, d ) – ξ ( i , d ) 0 )
2
d = d min i = 1
d max
+ c2 ⋅
(7)
n
∑ ∑ ∑
( ξ ( i, d ) – ξ ( j , d ) )
2
d = d min i = 1 j ∈ U i
Due to the quadratic terms of the cost function there
must exist a minimum which represents the optimal solution. This minimum has to satisfy ∇P ( ξ ) = 0 , which
leads to a system of linear equations
Aξ – ξ 0 = 0
ξ ( k, j ) = max ( ξ ( k, i ) )
(10)
with i ∈ { d min, …, d max }
e)
d max
d(k) = j
The fact that the variable ξ ( k, j ) of the winning disparity of a pixel does not converge against a fixed value
allows to set up an explicit occlusion detection. Keeping
in mind that ξ ( k, j ) still represents a measurement of the
probability of a pixels disparity offers the possibility to
select, in cases of ambiguity, the parameter with the
highest value to obtain the most probable disparity:
A maximum search across all pixels r in one image
corresponding with an identical pixel xl in the other image (fig. 3a) eliminates correspondences of pixels originating from occluded areas corresponding with regular
pixels in the second view (11), whereas a search for correspondence pairs that aim at scene points lying in succession from a viewpoint of a cyclopean camera (fig. 3b)
eliminates correspondences of occlusion areas in both
images (see eq. 12).
⎧
⎪ d ( k ),
d(k) = ⎨
⎪ c,
⎩
ξ ( k, d ( k ) ) = max { ξ ( r, d ( r ) ) }
(11)
(8)
Since the matrix A is symmetric and positive defi–1
nite, there must exist an inverse A . Therefore we have
a unique solution of equation (8).
In our approach the minimum of equation (7) is computed numerical with the gradient descent method. The
fact of formulating ξ as a vector simplifies the resulting
iteration rule to:
xl = r + d ( r )
otherwise
⎧ d ( k ),
⎪
d(k) = ⎨
⎪
⎩ c,
ξ ( k, d ( k ) ) = max { ξ ( r, d ( r ) ) }
d(r)
x c = r + ---------2
otherwise
(12)
with c being a constant value outside of the disparity
range, to label occluded areas in the disparity map.
p
a)
smooth while details on the roof are still visible. Since
the right image is the reference image, most of the detected occlusions occur on the Pentagon’s right side. Detected occlusions and regions left unmatched are marked
black.
b)
q
background
object
camera
projection
centers
image planes
xl
rp
rq left view
right view
cyclopean
camera
Fig. 3: a) Two image points rp and rq correspond with the
same point xl in the second view, b) correspondence pairs located in the black area of both images pointing to scene points
located behind an object in the view of a cyclopean camera
Finally, the same cost approach can be used in a modified form to compute sub-pixel precise disparity values
for each pixel (13) which means in substance applying
the continuity constraint to the computed pixel-accurate
disparity map. Before this can be done a filter operation
eliminates detected 1 pixel wide occlusion areas which
appear as a result of the pixel precise resolution (fig. 4c).
The detected pixels are filled with the average disparity
value of their neighbors, now being able to participate in
the following final relaxation process:
c3
∑ ( d ( i ) – d0 ( i ) )
(13)
m
∑ ∑
c)
d)
2
i=1
+ c4
b)
e)
m
P ( ξ̃ ) =
a)
(d(i ) – d(j))
2
i = 1 j ∈ Ũ i
Fig. 4: a), b) Left and right grey level image of the pentagon,
c) pixel precise and d) sub pixel precise disparity map, light areas indicate higher disparity values signifying scene points
nearer to the observer, e) upper section of the disparity map together with the contour information (white) used to reduce the
local support area (cs1=1; cs2=1; cs3=10; cs4=1; c1=1; c2=1;
c3=1; c4=0.8; max Uk =9x9; max Ũ i =5x5)
Similar to ξ , ξ̃ is composed of the counted pixel disparity values:
ξ̃ = ( d ( 1 ), …, d ( n ) )
T
(14)
Ũ i contains neighboring pixels in a quadratic surrounding with d ( i ) – d ( j ) < 1.3 ensuring again that disparity
discontinuities are maintained.
5. Experimental Results
The results for the Pentagon scene (fig. 4) show the
effect of the discontinuity-preserving local support area.
The sub-pixel disparity map (fig. 4d) is piecewise
The algorithm was applied with the full power of the
defined similarity measurement (see eq. 1). The local
support area U k initially had a maximum dimension of
9x9 pixels and was reduced by detected edge transitions
(fig. 4e).
To evaluate the quality of our approach, the algorithm was tested with several test images that provide
ground truth data of the observed scene. Figure 5 shows
the utilized images of the Sawtooth, Venus and Map
scene that where first introduced by R. Szeliski and R.
Zabih [4, 7], together with the according ground truth
maps and the computed sub-pixel precise disparity map.
a)
b)
c)
d)
e)
f)
g)
h)
i)
Fig. 5: Sawtooth (a), venus (d) and map scene (g) image,
a), d), g) original image, left view, b), e), h) ground truth referring to the left view, c), f), i) computed sub-pixel precise disparity map (cs1=0; cs2=0; cs3=8; cs4=0; c1=1; c2=10.7; c3=1;
c4=0.8; Uk=3x3; max Ũ i =5x5)
Table 1: Percentage of „bad“ pixel whose disparity error is
greater than one, *acquired from: http://www.middlebury.edu/
stereo/
Sawtooth
Venus
Map
*
0,31*
Graph Cut [3]
1,30
Dynamic
Programming [1]
4,84*
10,10*
3,33*
Bayesian
Diffusion [5]
1,45*
4,00*
0,20*
Cost Function
2,09
2,96
0,20
(0,269
RMS)
(0,255
RMS)
(0,304
RMS)
*
1,79
Due to the high amount of texture the scenes were
computed with a minimum 3x3 local support area and a
reduced similarity measurement. The set of parameters
was chosen to get the best error behavior for all three images. In figure 5f some mismatches can be observed in
image regions without texture. A rising of the coupling
factor c2 erases this effect.
Table 1 shows the quantitative results of the cost
function approach compared with some other selected
stereo algorithms. The comparison shows that algorithms like the Graph Cut algorithm introduced by Kolmogorov and Zabih [2] achieve better results while
requiring much more computation time - whereas fast
approaches like the Dynamic Programming algorithm
[1] cannot keep up with the quality of the results.
With an AMD1800XP processor, the computation
time of our approach for the sub pixel refined disparity
map of the Map scene was 53 seconds with a disparity
range of 61 disparities, 80 iterations and a 217 x 209 image size.
For a 100x100 image size, 10 iterations and 20 disparity levels the algorithm yields reasonable results with
about 3 fps on the same machine.
6. References
[1]
Bobick, A.F.; Intille, S.S.: „Large occlusion stereo“,
IJCV 33(3), 1999, pp. 181-200
[2] Kolmogorov, V.; Zabih, R.: „Visual correspondence
with occlusions using graph cuts“, ICCV2001, 2001, pp.
508-515
[3] Lin, M.; Tomasi, T.: „Surfaces with occlusions from layered stereo“, CVPR2003(I), 2003, pp. 710-717
[4] Scharstein, D.; Szeliski, R.: „A taxonomy and evaluation
of dense two-frame stereo correspondence algorithms“,
IJCV 47(1/2/3), 2002, pp. 7-42
[5] Scharstein, D.; Szeliski, R.: „Stereo matching with nonlinear diffusion“, IJCV 28(2), pp. 155-174, 1998
[6] Sun, J.; Shum, H.-Y.; Zheng, N.-N.: „Stereo matching
using belief propagation“, ECCV2002 (2), Copenhagen,
Denmark, 2002, pp. 510-524
[7] Szeliski, R.; Zabih, R.: „An experimental comparison of
stereo algorithms“, International Workshop in Vision Algorithms, Kerkyra, Greece, 1999, pp. 1-19
[8] Trapp, R.; Drüe, S.; Hartmann, G.: „Stereo matching
with implicit detection of occlusions“, ECCV98 (2),
1998, pp. 17-33
[9] Zitnick, C.L.; Kanade, T.: „A volumetric iterative approach to stereo matching and occlusion detection“,
Technical report CMU-RI-TR-98-30, Carnegie Mellon
University, Pittsburgh, PA, 1998
[10] Marr, D.; Poggio, T.: „Cooperative computation of stereo disparity“, In „Science“, vol. 194, 1976, pp. 283-287
[11] Murray, D.; Little, J.: „Using real-time stereo vision for
mobile robot navigation“, In „Autonomous Robots“, vol.
8, no. 2, 2000, pp. 161-171