Joint Non-rigid Motion Estimation and Segmentation

Joint Non-rigid Motion Estimation
and Segmentation
Boris Flach1 and Radim Sara2
1
2
Dresden University of Technology
Czech Technical University Prague
Abstract. Usually object segmentation and motion estimation are considered (and modelled) as different tasks. For motion estimation this
leads to problems arising especially at the boundary of an object moving in front of another if e.g. prior assumptions about continuity of the
motion field are made. Thus we expect that a good segmentation will
improve the motion estimation and vice versa. To demonstrate this we
consider the simple task of joint segmentation and motion estimation
of an arbitrary (non-rigid) object moving in front of a still background.
We propose a statistical model which represents the moving object as
a triangular (hexagonal) mesh of pairs of corresponding points and introduce an provably correct iterative scheme, which simultaneously finds
the optimal segmentation and corresponding motion field.
1
Introduction
Even though motion estimation is a thoroughly investigated problem of image
processing, which had attracted attention for decades, we should admit, that
at least a handful crucial open problems remain on the agenda. One of them
are object boundaries and occlusions (imagine e.g. an object moving in front
of a background: If motion estimation is considered as a separate task, then
usually some continuity or smoothness prior assumptions for the motion field are
modelled [1, 2], which regularise and thus improve the result almost everywhere
but not at boundaries and partial occlusions. Thus we can expect that a good
segmentation will improve the motion estimation and vice versa. In case of strict
a-priori knowledge – e.g. rigid body motion – this segmentation can be modelled
directly in terms of admissible motion fields [3]. In case of more general situations
like non-rigid motion this is not possible. To overcome this problem, we propose
a joint motion estimation and segmentation model. To start with, we consider
this task for the simple situation of an arbitrary (non-rigid) object moving in
front of a still background.
We propose a statistical model, which represents both, the moving foreground
object and the background in terms of labelling vertices of a triangular /hexagonal) mesh. The state (label) of each vertex is a segment label and a displacement
vector. This allows to incorporate prior assumptions for the objects shape and
the motion field by either hard or statistical restrictions. For instance, to avoid
R. Klette and J. Žunić (Eds.): IWCIA 2004, LNCS 3322, pp. 631–638, 2004.
c Springer-Verlag Berlin Heidelberg 2004
632
B. Flach and R. Sara
motion fields which do not preserve the topology of the object (given by the
segmentation), we require coherent orientations for the displacement states of
elementary triangles of the lattice in the foreground segment. Image similarity as
well as consistency with some colour models for the segments are modelled in a
statistical way. Consequently, we obtain a statistical model for the (hidden) state
field and the images, which is a Gibbs probability distribution of higher order in
our case. This allows to pose the joint segmentation and motion estimation as a
Bayes task with respect to a certain loss function.
2
The Model
Consider non-rigidly moving object against stationary background. The object
can have holes but must not self-occlude during the motion. There are two images
and the task is to segment the moving object and estimate the motion field.
Let R be a set of vertices associated with some subset of image pixels chosen
in a regular way. We consider a hexagonal lattice on these vertices (see Fig. 1).
Its edges are denoted by e ∈ E, where E1 denotes the subset of edges forming
the rectangular sub-lattice. The elementary triangles of the hexagonal
lattice
are
denoted by t ∈ T . Each vertex r ∈ R has a compound label x(r), v(r) , where
x(r) ∈ {0, 1} is a segment label and v(r) ∈ V is an integer-valued displacement
vector. A complete labelling is thus a pair of mappings x : R → {0, 1}, v : R → V
and defines a simultaneous segmentation and motion field.
Fig. 1. Hexagonal lattice and segmentation
We consider a prior statistical model for such labellings which favours compact foreground segments and continuous displacement fields and preserves the
topology of the foreground segment:
1
x(r) − x(r )2 −
Ht x(t), v(t)
p(x, v) = exp −β
Z
t∈T
(r,r )∈E1
(1)
Hc x(e), v(e) −
Hb x(r), v(r)
−
e∈E1
r∈R
The first sum is over all edges in E1 and represents a Potts model for segmentation. The second sum is over all elementary triangles of the hexagonal lattice.
Joint Non-rigid Motion Estimation and Segmentation
633
The function Ht is infinity if all three vertices r ∈ t are labelled as foreground
and their displacements reverse the orientation of that triangle. It is zero in all
other cases. Hence, this term zeroes the probability for displacement fields which
do not preserve the topology of the foreground segment. The third sum is over
all edges in E1 . The function Hc is infinity if both vertices are labelled by foreground and their displacement vectors differ more than a predefined value. It is
zero in all other cases. Hence, this term zeroes the probability of non-continuous
displacement fields for the foreground segment. The last sum is over all vertices
and Hb is infinity in case a vertex marked background has an nonzero displacement vector and is zero otherwise. Hence, this term reflects our assumption of a
still background. Combining all together, we get a third order Gibbs probability
distribution aka Markov Random Field.
Our measurement model is as follows. Let y, ỹ be two feature fields taken
from two consecutive images of the scene i.e. y, ỹ : R → F , where F is the
set of feature values (which might be a set of colours in the simplest case). To
express the conditional probability of obtaining y and ỹ given segmentation x and
displacement field v, we introduce the following subsets of vertices. Let S(x) ⊆ R
denote the subset of vertices labelled as foreground, S(x) = {r ∈ R | x(r) = 1}.
All vertices labelled as background are divided into two sets: O(x) represents
those which are occluded by foreground in the second image and B(x) are those
which are visible in both images:
O(x) = r ∈ R | x(r) = 0, ∃ r : x(r ) = 1, r = r + v(r ) ,
(2)
B(x) = R \ S(x) ∪ O(x) .
(3)
Using these sets, the conditional probability is as simple as

p(y, ỹ | x, v) = exp 
qf y(r), ỹ r + v(r) +
r∈S(x)
+
r∈B(x)

q̄b y(r)  ,
qb y(r), ỹ(r) +
(4)
r∈O(x)
where qf (f, f ) and qb (f, f ) are the log-likelihoods to obtain feature values f
and f for corresponding image points in foreground and background, respectively and where q̄b is the log-likelihood to obtain the feature value f for a background image point. These probabilities can be easily estimated from foreground
and background feature distributions if a simple independent noise model is assumed for the camera. It is worth noting that the second and third sum in (4)
are non-local: occlusion of a vertex r depends on the states of all vertices which
might occlude r.
Having a complete statistical model for quadruples (y, ỹ, x, v), we formulate
the recognition task as Bayes decision with respect to the following loss function
C(y, ỹ, x, v) =
µ1 1{x(r) = x (r)} + µ2 v(r) − v (r)2 ,
(5)
r
634
B. Flach and R. Sara
which is locally additive. Each local term penalises wrong decisions with respect
to segmentation and displacement vectors, respectively. Minimising the average
loss (i.e. the risk) gives the following Bayes decision [6]
x∗ (r) = arg max pr (x(r) = k | y, ỹ),
k
∗
v · pr (v(r) = v | y, ỹ),
v (r) =
(6)
(7)
v
where pr (x(r) | y, ỹ) and pr (v(r) | y, ỹ) are the marginal posterior probabilities
for segment label and displacement vector, respectively:
pr (x(r) = k | y, ỹ) =
p(x, v | y, ỹ)
(8)
x : x(r)=k v
and similarly for pr (v(r) = v | y, ỹ). Hence, we need this probability for the
Bayes decision. Note that the (7) gives non-integer decisions for the displacement
vectors, though the states are considered as integer-valued vectors.
We do not know how to perform the sums in (7) and (8) over all state
fields explicitly and in polynomial time. Nevertheless, it is possible to estimate
the needed probabilities using a Gibbs sampler [5]. In one sampling step, one
chooses a vertex r, fixes the states in all other vertices and randomly generates a new state according to its posterior conditional probability p(x(r), v(r) |
x(R \ {r}), v(R \ {r}), y, ỹ) given the fixed states x(R \ {r}), v(R \ {r}) in all
other vertices. According to [5] the relative state frequencies observed during the
sampling process converge to the needed marginal probabilities.
3
Experiments
In this section we show results on three image pairs: rect, hand and ball, see
Figs. 2, 3 and 4, respectively. The rect are prepared artificially and have size
100 × 100 The size of the hand images is 580 × 500 pixels, and the ball images
253×256 pixels. Both pairs are JPEG images shot by a standard compact digital
camera.
The motion in the rect pair is a combination of a translation and a projective
transform. In this dataset, the segmentaion itself is not as simple, but being
combined with motion estimation, the result is fairly good.
The motion in the hand pair is almost uniform in the direction towards the
lower-left image corner. It is rather non-uniform in the ball pair, although the
dominant motion is translation, again towards the lower left corner of the image,
the additional components are in-plane counter-clockwise rotation due to wrist
rotation and out-of plane rotation due to closing the hand towards the forearm.
The ball moves rigidly but the wrist does not.
In the first natural dataset, the segmentation task itself is relatively easy,
based on the foreground-background colour difference but the motion is hard
to estimate due to the lack of sufficiently strong features in the skin region in
Joint Non-rigid Motion Estimation and Segmentation
635
sub-quantised low-resolution images. In the second dataset, however, the motion
estimation task is more easy based on rich colour structure (up to highlights
and uniform-colour patches) but the segmentation would be more difficult on
the colour alone.
The hand images are re-quantised to 32 levels per colour channel and the ball
images to 16 levels. The image features f were the re-quantised RGB triples. The
log-likelihoods qf , qb and q̄b are estimated from the re-quantised images based
on rough manual pre-segmentation (although their automatic estimation from
data is possible, see e.g. [4], it is not the focus of the present paper).
The spacing of the regular lattice was four pixels in the hand pair and three
pixels in the ball pair. In the hand pair, the expected motion range was −24±12
pixels in the horizontal direction and ±12 pixels in the vertical direction; in the
ball pair, the corresponding ranges were −18 ± 6 pixels and 11 ± 4 pixels,
Fig. 2. Results on the rect pair. Top row: input images. Bottom row: segmentation in
the first frame and the deformed triangular mesh in the second frame, both overlaid
over the corresponding images
636
B. Flach and R. Sara
respectively. The initial estimate for the field x were based on local (vertexwise) decisions using the log-likelihoods qf and qb . The initial estimate for the
motion field v was zero.
Results for the hand pair are shown in Fig. 3. The bottom left overlay shows
those vertices of the hexagonal lattice that are labelled as foreground (in red).
The bottom right overlay shows the lattice after displacement by estimated motion field. The moving boundary at the lower edge of the wrist is captured,
although it is fuzzy because of body hair seen against the background.
Results for the ball pair are shown in Fig. 4. We used a stronger β for
the Potts model compared to the hand pair, due to the more complex colour
structure. Again, the bottom left overlay shows in red those vertices of the
hexagonal lattice that are labelled as foreground. The bottom right overlay shows
the vertices displaced by the motion field as blue dots and the residual motion
field after subtracting the mean motion vector v̄ = (−20.5, 11.0) to see the other
modes of the motion.
Fig. 3. Results on the hand pair. Top row: input images and their overlay. Bottom row:
segmentation in the first frame and the deformed triangular mesh in the second frame,
both overlaid over the corresponding images
Joint Non-rigid Motion Estimation and Segmentation
637
Fig. 4. Results on the ball pair. Top row: input images and their overlay. Bottom row:
segmentation in the first frame and the residual motion field after subtracting mean
motion vector
4
Conclusions
In this paper we present our preliminary results on a simplified version for joint
segmentation and motion estimation. Though these results seem to promising,
there are some open problems:
1. The segmentation model is too simple.
2. the model is not symmetric with respect to time reversal.
3. The topology preservation condition does not enforce the boundary to be a
Jordan-curve.
4. Self occlusions of the foreground are not allowed.
We believe these can be addressed in our future work.
References
1. Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via
graph cuts. In Proc. of the 7. Intl. Conf. on Computer Vision, volume 1, pages
377–384, 1999.
638
B. Flach and R. Sara
2. Thomes Brox, Andres Bruhn, Nils Papenberg, and Joachim Weickert. High accuracy
optical flow estimation based on a theory for warping. In T. Pajdla and J. Matas,
editors, Computer Vision, volume 3024 of Lecture Notes in Computer Science, pages
25–36. Springer, 2004. Proc. of ECCV2004.
3. Daniel Cremers and Christoph Schnörr. Motion competition: Variational integration
of motion segmentation and shape recognition. In Luc van Gool, editor, Pattern
Recognition, volume 2449 of Lecture Notes in Computer Science, pages 472–480.
Springer, 2002. Proc. of DAGM2002.
4. Boris Flach, Eeri Kask, Dmitrij Schlesinger, and Andriy Skulish. Unifying registration and segmentation for multi-sensor images. In Luc Van Gool, editor, Pattern
Recognition, volume 2449 of Lecture Notes in Computer Science, pages 190–197.
Springer Verlag, 2002.
5. Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions and
the bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6(6):721–741, 1984.
6. Michail I. Schlesinger and Vaclav Hlavač. Ten Lectures on Statistical and Structural Pattern Recognition, volume 24 of Computational Imaging and Vision. Kluwer
Academic Press, 2002.