Quasi-perspective Projection Model: Theory and Application to

Int J Comput Vis
DOI 10.1007/s11263-009-0267-4
Quasi-perspective Projection Model: Theory and Application
to Structure and Motion Factorization from Uncalibrated Image
Sequences
Guanghui Wang · Q.M. Jonathan Wu
Received: 1 December 2008 / Accepted: 29 June 2009
© Springer Science+Business Media, LLC 2009
Abstract This paper addresses the problem of factorizationbased 3D reconstruction from uncalibrated image sequences.
Previous studies on structure and motion factorization are
either based on simplified affine assumption or general
perspective projection. The affine approximation is widely
adopted due to its simplicity, whereas the extension to perspective model suffers from recovering projective depths. To
fill the gap between simplicity of affine and accuracy of perspective model, we propose a quasi-perspective projection
model for structure and motion recovery of rigid and nonrigid objects based on factorization framework. The novelty
and contribution of this paper are as follows. Firstly, under the assumption that the camera is far away from the
object with small lateral rotations, we prove that the imaging process can be modeled by quasi-perspective projection,
which is more accurate than affine model from both geometrical error analysis and experimental studies. Secondly,
The work is supported in part by Natural Sciences and Engineering
Research Council of Canada, and the National Natural Science
Foundation of China under Grant No. 60575015.
Electronic supplementary material The online version of this article
(http://dx.doi.org/10.1007/s11263-009-0267-4) contains
supplementary material, which is available to authorized users.
G. Wang () · Q.M.J. Wu
Department of Electrical and Computer Engineering, University
of Windsor, 401 Sunset, Windsor, N9B 3P4, Ontario Canada
e-mail: [email protected]
Q.M.J. Wu
e-mail: [email protected]
G. Wang
National Laboratory of Pattern Recognition, Institute
of Automation, Chinese Academy of Sciences, Beijing 100080,
China
we apply the model to establish a framework of rigid and
nonrigid factorization under quasi-perspective assumption.
Finally, we propose an Extended Cholesky Decomposition
to recover the rotation part of the Euclidean upgrading matrix. We also prove that the last column of the upgrading
matrix corresponds to a global scale and translation of the
camera thus may be set freely. The proposed method is validated and evaluated extensively on synthetic and real image
sequences and improved results over existing schemes are
observed.
Keywords Structure from motion · Computational models
of vision · Quasi-perspective projection · Imaging
geometry · Matrix factorization · Singular value
decomposition · Euclidean reconstruction
1 Introduction
The problem of structure and motion recovery from image sequences is an important theme in computer vision.
Great progresses have been made for different applications
during the last two decades (Hartley and Zisserman 2004).
Among these methods, factorization based approach, for
its robust behavior and accuracy, is widely studied since it
deals uniformly with the data sets of all images (Poelman
and Kanade 1997; Quan 1996; Tomasi and Kanade 1992;
Triggs 1996). The factorization algorithm was first proposed
by Tomasi and Kanade (1992) in the early 90’s. The main
idea of this algorithm is to factorize the tracking matrix into
motion and structure matrices simultaneously by singular
value decomposition (SVD) with low-rank approximation.
The algorithm assumes an orthographic projection model. It
was extended to weak perspective and paraperspective projection by Poelman and Kanade (1997). The orthographic,
Int J Comput Vis
weak perspective, and paraperspective projections can be
generalized as affine camera model.
More generally, Christy and Horaud (1996) extended the
above methods to a perspective camera model by incrementally performing the factorization under affine assumption.
The method is an affine approximation to full perspective
projection. Triggs (1996) and Sturm and Triggs (1996) proposed a full projective reconstruction method via rank-4 factorization of a scaled tracking matrix with projective depths
recovered from pairwise epipolar geometry. The method
was further studied in Han and Kanade (2000), Heyden et al.
(1999), Mahamud and Hebert (2000), where different iterative schemes were proposed to recover the projective depths
through minimizing reprojection errors. Recently, Oliensis
and Hartley (2007) provided a complete theoretical convergence analysis for the iterative extensions. Unfortunately, no
iteration has been shown to converge sensibly, and they proposed a simple extension, called CIESTA, to give a reliable
initialization to other algorithms.
The above methods work only for rigid objects and static
scenes. Whereas in real world, many scenarios are nonrigid
or dynamic such as articulated motion, human faces carrying different expressions, lip movements, hand gesture, and
moving vehicles etc. In order to deal with such situations,
many extensions stemming from the factorization algorithm
were proposed to relax the rigidity constraint. Costeira and
Kanade (1998) first discussed how to recover the motion and
shape of several independent moving objects via factorization using orthographic projection. Bascle and Blake (1998)
proposed a method for factorizing facial expressions and
poses based on a set of preselected basis images. Recently,
Li et al. (2007) proposed to segment multiple rigid-body motions from point correspondences via subspace separation.
Yan and Pollefeys (2005, 2008) proposed a factorizationbased approach to recover the structure and kinematic chain
of articulated objects.
In the pioneer work by Bregler et al. (2000), it is demonstrated that the 3D shape of a nonrigid object can be expressed as a weighted linear combination of a set of shape
bases. Then the shape bases and camera motions are factorized simultaneously for all time instants under the rank
constraint of the tracking matrix. Following this idea, the
method was extensively investigated and developed by many
researchers, such as Brand (2001, 2005), Del Bue et al.
(2006, 2004), Torresani et al. (2008, 2001), and Xiao et al.
(2006), Xiao and Kanade (2005). Recently, Rabaud and Belongie (2008) relaxed the Bregler’s assumption (2000) by
assuming that only small neighborhoods of shapes are wellmodeled by a linear subspace, and proposed a novel approach to solve the problem by adopting a manifold-learning
framework.
Most nonrigid factorization methods are based on affine
camera model due to its simplicity. It was extended to perspective projection in Xiao and Kanade (2005) by iteratively
recovering the projective depths. The perspective factorization is more complicated and does not guarantee its convergence to the correct depths, especially for nonrigid scenarios (Hartley and Zisserman 2004). Vidal and Abretske
(2006) proposed that the constraints among multiple views
of a nonrigid shape consisting of k shape bases can be reduced to multilinear constraints. They presented a closed
form solution to the reconstruction of a nonrigid shape consisting of two shape bases. Hartley and Vidal (2008) proposed a closed form solution to the nonrigid shape and motion with calibrated cameras or fixed intrinsic parameters.
Since the factorization is only defined up to a nonsingular
transformation matrix, many researchers adopt the metric
constraints to recover the matrix and upgrade the factorization to the Euclidean space (Brand 2001; Bregler et al. 2000;
Del Bue et al. 2004; Torresani et al. 2001). However, the rotation constraint may cause ambiguity in the combination of
shape bases. Xiao et al. (2006) proposed a basis constraint
to solve the ambiguity and provided a closed-form solution.
The essence of the factorization algorithm is to find a
low-rank approximation of the tracking matrix. Most algorithms adopt SVD to compute the approximation. Alternatively, Hartley and Schaffalizky (2003) proposed to use
power factorization (PF) to find the low-rank approximation, which can handle missing data in a tracking matrix. It
was extended to nonrigid factorization in both metric space
(Wang et al. 2008) and affine space (Wang and Wu 2008a).
Vidal et al. (2008) proposed to combine the PF algorithm
for motion segmentation. There are some other nonlinear
based studies to deal with incomplete tracking matrix with
some entries unavailable, such as Damped Newton method
(Buchanan and Fitzgibbon 2005) and Levenberg-Marquardt
based method (Chen 2008). Torresani et al. (2008) proposed a Probabilistic Principal Components Analysis algorithm to estimate the 3D shape and motion with missing
data. Camera calibration is an indispensable step in retrieving 3D metric information from 2D images. Many selfcalibration algorithms were proposed to calibrate fixed camera parameters (Maybank and Faugeras 1992; Hartley 1997;
Luong and Faugeras 1997), varying parameters (Heyden and
Åström 1997; Pollefeys et al. 1999), and affine camera models (Quan 1996).
Previous studies on factorization are either based on
affine camera model or perspective projection. The affine assumption is widely adopted due to its simplicity although it
is just an approximation to real imaging process. Whereas
the extension to perspective model suffers from recovery
of the projective depths, which is computationally intensive
and no convergence is guaranteed. In this paper, we try to
make a trade-off between the simplicity of affine and accuracy of full perspective projection and propose a novel
framework for the problem. Assuming that the camera is far
away from the object with small lateral rotations, which is
Int J Comput Vis
similar to affine assumption and is easily satisfied in practice, we propose a quasi-perspective projection model and
give an error analysis of different projection models. The
model is proved to be more accurate than affine approximation since the projective depths are implicitly embedded in
the shape matrix, but its computational complexity is similar
to affine. We apply this model to the factorization algorithm
and establish a framework of rigid and nonrigid factorization
under quasi-perspective projection. We elaborate the computational details on recovery of the Euclidean upgrading
matrix. To the best of our knowledge, there is no similar report in literature. The idea was first proposed in CVPR 2008
(Wang and Wu 2008b) and we will present more theoretical
analysis and experimental evaluations in the paper.
The remaining part of the paper is organized as follows.
The definition and background on the factorization algorithm is given in Sect. 2. The proposed quasi-perspective
model and error analysis are elaborated in Sect. 3. The application to rigid factorization under the proposed model is
detailed in Sect. 4. The quasi-perspective nonrigid factorization is presented in Sect. 5. Extensive experimental evaluations on synthetic data are given in Sect. 6. Some test results
on real image sequences are reported in Sect. 7. Finally, the
concluding remarks are presented in Sect. 8.
2 Background on Factorization
2.1 Problem Definition
Under perspective projection, a 3D point Xj is projected
onto an image point xij in frame i according to equation
λij xij = Pi Xj = Ki [Ri , Ti ]Xj
(1)
where λij is a non-zero scale factor, commonly called projective depth; xij = [uij , vij , 1]T and Xj = [xj , yj , zj , 1]T
are expressed in homogeneous form; Pi is the projection matrix of the i-th frame; Ri and Ti are the corresponding rotation matrix and translation vector of the camera with respect
to the world system; Ki is the camera calibration matrix in
form of
⎤
⎡
fi
ςi
u0i
(2)
Ki = ⎣ 0 κi fi v0i ⎦
0
0
1
where fi represents the camera’s focal length; [u0i , v0i ]T is
the coordinates of the camera’s principal point; ςi refers to
the skew factor; κi is called aspect ratio of the camera. For
some precise industrial CCD cameras, we may assume zero
skew, known principal point, and unit aspect ratio i.e., ςi =
0, u0i = v0i = 0, and κi = 1. Then the camera is simplified
to have only one intrinsic parameter.
When the distance of an object from a camera is much
greater than the depth variation of the object, we may assume affine camera model. Under affine assumption, the last
row of the projection matrix is of the form PT3i [0, 0, 0, 1],
where ‘’ denotes equality up to scale. Then the projection
process (1) can be simplified by removing the scale factor
λij .
x̄ij = Ai X̄j + T̄i
(3)
where, Ai ∈ R2×3 is composed by the upper-left 2 × 3 submatrix of Pi ; x̄ij = [uij , vij ]T and X̄j = [xj , yj , zj ]T are
the non-homogeneous form of xij and Xj , respectively; T̄i
is the corresponding translation vector, which is actually the
image of world origin. Under affine projection, it is easy to
verify that the centroid of a set of space points is projected to
the centroid of their images. Therefore, the translation term
vanishes if all the image points in each frame are registered
to the corresponding centroid, and the projection is simplified to the form
x̄ij = Ai X̄j
(4)
The problem of structure from motion is defined as:
Given n tracked feature points of an object across a sequence
of m frames {xij |i = 1, . . . , m, j = 1, . . . , n}. We want to
recover the structure Sij = {Xij |i = 1, . . . , m, j = 1, . . . , n}
and motion {Ri , Ti } of the object. The factorization based
algorithm is proved to be an effective method to deal with
this problem. As shown in Table 1, the algorithms can generally be classified into following four categories according
to the camera assumption and object property. (i) Rigid object under affine assumption; (ii) rigid object under perspective projection; (iii) nonrigid object under affine assumption; (iv) nonrigid object under perspective projection. In Table 1, ‘Quasi-persp’ stands for quasi-perspective projection
model to be discussed in the paper. The meaning of symbols
W, M, S, B, and H in the table is defined in the following
subsections.
2.2 Rigid Factorization
Under affine assumption (4), the projection from space to
the sequence is expressed as
⎡
⎤ ⎡ ⎤
x̄11 · · · x̄1n
A1
⎢ ..
⎥
⎢
.
.
..
.. ⎦ = ⎣ ... ⎥
(5)
⎣ .
⎦ X̄1 , . . . , X̄n
x̄m1 · · · x̄mn
Am
S̄3×n
W2m×n
M2m×3
where W is called tracking matrix; M and S̄ are called motion matrix and shape matrix respectively. It is evident that
the rank of the tracking matrix is at most 3, and the rank
Int J Comput Vis
Table 1 Classification of
structure and motion
factorization of rigid and
nonrigid objects
Classification
Tracking matrix
Motion matrix
Shape matrix
Upgrading matrix
Rigid
Affine
W ∈ R2m×n
M ∈ R2m×3
S̄ ∈ R3×n
H ∈ R3×3
Perspective
Ẇ ∈ R3m×n
M ∈ R3m×4
S ∈ R4×n
H ∈ R4×4
Quasi-Persp
W ∈ R3m×n
M ∈ R3m×4
S ∈ R4×n
H ∈ R4×4
Affine
W ∈ R2m×n
M ∈ R2m×3k
B̄ ∈ R3k×n
H ∈ R3k×3k
Perspective
Ẇ ∈ R3m×n
M ∈ R3m×(3k+1)
B ∈ R(3k+1)×n
H ∈ R(3k+1)×(3k+1)
Quasi-Persp
W ∈ R3m×n
M ∈ R3m×(3k+1)
B ∈ R(3k+1)×n
H ∈ R(3k+1)×(3k+1)
Nonrigid
constraint can be easily imposed by performing SVD decomposition on the tracking matrix W and truncating it to
rank 3. However, the decomposition is not unique since it is
only defined up to a nonsingular linear transformation matrix H ∈ R3×3 as W = (MH)(H−1 S̄). Actually, the decomposition is just one of the affine reconstructions of an object. By inserting H into the factorization, we can upgrade
the reconstruction from affine to the Euclidean space. We
will alternatively name the matrix as (Euclidean) upgrading
matrix in the following. Many researchers utilize the metric constraints of the motion matrix to recover the matrix
(Poelman and Kanade 1997; Quan 1996), which is indeed
a self-calibration process under the constraints of simplified
camera parameters.
When the perspective projection model (1) is adopted, the
factorization equation can be formulated as
⎡
λ11 x11
⎢ ..
⎣ .
λm1 xm1
···
..
.
···
Ẇ3m×n
⎤ ⎡ ⎤
λ1n x1n
P1 .. ⎥ = ⎢ .. ⎥ X̄1 , . . . , X̄n
. ⎦ ⎣ . ⎦ 1, . . . , 1
λmn xmn
Pm S4×n
Hebert 2000). However, there is no guarantee that the procedure will converge to a global minimum, as recently proved
in Oliensis and Hartley (2007), no iteration has been shown
to converge sensibly.
2.3 Nonrigid Factorization
When an object is nonrigid, many studies follow the Bregler’s assumption (Bregler et al. 2000) that the nonrigid
structure can be approximated by a linearly weighted combination of k rigid shape bases.
S̄i =
k
(6)
where Ẇ is called projective depths scaled tracking matrix,
and its rank is at most 4 if a consistent set of scalars λij are
present; M and S are the camera matrix and homogeneous
shape matrix respectively. Obviously, any such factorization
corresponds to a valid projective reconstruction which is defined up to a projective transformation matrix H ∈ R4×4 . We
can still use the metric constraint to recover the upgrading
matrix.
The most difficult part for perspective factorization is to
recover the projective depths that are consistent with (1).
One method is to estimate the depths pairwisely from the
fundamental matrix and then string them together (Sturm
and Triggs 1996; Triggs 1996). The disadvantage of such approach is the computational cost and possible error accumulation. Another method is to start with initial depths λij = 1,
and iteratively refine the depths by reprojections (Han and
Kanade 2000; Hartley and Zisserman 2004; Mahamud and
ωil Bl
(7)
where Bl ∈ R3×n is a shape base that embodies the principal mode of the deformation, ωil ∈ R is called deformation
weight. Under this assumption and affine camera model, the
nonrigid factorization is modeled as
⎡
M3m×4
l=1
x̄11
⎢ ..
⎣ .
x̄m1
···
..
.
···
W2m×n
⎤ ⎡
x̄1n
ω11 A1
.. ⎥ = ⎢ ..
. ⎦ ⎣ .
x̄mn
ωm1 Am
···
..
.
···
M2m×3k
⎤⎡ ⎤
ω1k A1
B1
.. ⎥ ⎢ .. ⎥
. ⎦⎣ . ⎦
ωmk Am
Bk
B̄3k×n
(8)
We call M nonrigid motion matrix, B̄ nonrigid shape matrix which is composed of the k shape bases. It is easy to see
from (8) that the rank of the nonrigid tracking matrix W is at
most 3k. The decomposition can be achieved by SVD with
the rank-3k constraint, which is defined up to a nonsingular
upgrading matrix H ∈ R3k×3k . If the matrix is known, Ai ,
ωil and S̄i can be recovered accordingly from M and B̄. The
computation of H here is more complicated than in rigid
case. Many researchers (Brand 2001; Del Bue et al. 2004;
Torresani et al. 2001) adopted the metric constraints of the
motion matrix. However, the constraints may be insufficient
when the object deforms at varying speed. Xiao et al. (2006)
proposed a basis constraint to solve such ambiguity.
Int J Comput Vis
Similarly, the factorization under perspective projection
can be formulated as follows (Xiao and Kanade 2005).
⎡
⎢
Ẇ3m×n = ⎢
⎣
(1:3)
ω11 P1
..
.
(1:3)
ωm1 Pm
···
..
.
···
(1:3)
ω1k P1
..
.
(1:3)
ωmk Pm
M3m×(3k+1)
⎤ ⎡B ⎤
1
⎢ .. ⎥
⎥
.. ⎥ ⎢ . ⎥
⎥
. ⎦ ⎢
⎣Bk ⎦
(4)
Pm
T
1
(4)
P1
B(3k+1)×n
(9)
(1:3)
where Ẇ is the depths-scaled tracking matrix as in (6); Pi
(4)
and Pi denote the first three and the fourth columns of Pi ,
respectively; 1 = [1, . . . , 1]T is an n vector with unit entities.
The rank of the correctly scaled tracking matrix is at most
3k + 1. The decomposition is defined up to a transformation
H ∈ R(3k+1)×(3k+1) , which can be determined in a similar
but more complicated way. Just as in rigid case, the most
difficult part for nonrigid perspective factorization is to determine the projective depths. Since there is no pairwise fundamental matrix for deformable features, we can only use
the iterative method to recover the depth, although it is more
likely to be stuck in a local minimum in nonrigid situation.
3 Quasi-perspective Projection
In this section, we will propose a new quasi-perspective projection model to fill the gap between simplicity of affine
camera and accuracy of perspective projection.
3.1 Quasi-perspective Projection
Under perspective projection, the image formation process
is shown in Fig. 1. In order to ensure large overlapping
part of the object to be reconstructed, the camera usually
undergoes really small movements across adjacent views,
especially for images of a video sequence. Suppose Ow −
Xw Yw Zw is the world coordinate system selected on the object to be reconstructed. Oi − Xi Yi Zi is the camera system
with Oi being the optical center of the camera. Without loss
of generality, we assume there is a reference camera system Or − Xr Yr Zr . As the world system can be set freely,
we align it with the reference frame as illustrated in Fig. 1.
Therefore, the rotation Ri of frame i with respect to the reference frame is the same as the rotation of the camera to the
world system.
Definition 1 (Axial and lateral rotation) The orientation of
a camera is usually described by roll-pitch-yaw angles. For
the i-th frame, we define the pitch, yaw, and roll as the rotations αi , βi , and γi of the camera with respect to the Xw , Yw ,
and Zw axes of the world system. As shown in Fig. 1, the
optical axis of the cameras usually point towards the object.
For convenience of discussion, we define γi as the axial rotation angle, and define αi , βi as lateral rotation angles.
Proposition 2 Suppose the camera undergoes small lateral
rotation with respect to the reference frame, then the variation of projective depth λij is mainly proportional to the
depth of the space point, and the projective depths of a point
at different views have similar trend of variation.
Proof Suppose the rotation and translation of the i-th frame
to the world system are Ri = [r1i , r2i , r3i ]T and Ti =
[txi , tyi , tzi ]T , respectively. Then the projection matrix can
be written as
Pi = Ki [Ri , Ti ]
⎡ T
fi r1i + ςi rT2i + u0i rT3i
⎣
κi fi rT2i + v0i rT3i
=
rT3i
⎤
fi txi + ςi tyi + u0i tzi
κi fi tyi + v0i tzi ⎦ (10)
tzi
Let us decompose the rotation matrix into the rotations
around three axes as R(γi )R(βi )R(αi ). Then we have
Ri = R(γi )R(βi )R(αi )
⎡
⎤⎡
⎤⎡
⎤ ⎡
⎤
Cγi −Sγi 0
Cβi 0 Sβi
1 0
0
Cγi Cβi Cγi Sβi Sαi − Sγi Cαi Cγi Sβi Cαi + Sγi Sαi
= ⎣Sγi Cγi 0⎦ ⎣ 0
1 0 ⎦ ⎣0 Cαi −Sαi⎦ = ⎣Sγi Cβi Sγi Sβi Sαi + Cγi Cαi Sγi Sβi Cαi − Cγi Sαi⎦
0 Sαi Cαi
−Sβi
Cβi Sαi
Cβi Cαi
0
0
1 −Sβi 0 Cβi
(11)
Int J Comput Vis
Fig. 1 Imaging process of an object. (a) Camera setup with respect to the object. (b) The relationship of world coordinate system and camera
system at different viewpoint
where ‘S’ stands for sine function, and ‘C’ stands for cosine
function. By inserting (10) and (11) into (1), we have
where Cβi Cαi ≤ 1. Under the assumption of tzi zj , the
ratio can be approximated by
λij = [rT3i , tzi ]Xj
μi =
= −(Sβi )xj + (Cβi Sαi )yj + (Cβi Cαi )zj + tzi
(12)
From Fig. 1, we know that the rotation angles αi , βi , γi
of the camera to the world system are the same as those to
the reference frame. Under small lateral rotations, i.e., small
angles of αi and βi , we have Sβi Cβi Cαi and Cβi Sαi Cβi Cαi . Thus (12) can be approximated by
λij ≈ (Cβi Cαi )zj + tzi
(13)
All features {xij |j = 1, . . . , n} in the i-th frame correspond to the same rotation angles αi , βi , γi and translation
tzi . It is evident from (13) that the projective depths of a
point in all frames have similar trend of variation, which are
in proportion to the value of zj of the space point. Actually,
the projective depths have nothing related with the axial rotation γi .
Proposition 3 Under small lateral rotations and further assumption that the distance of the camera to an object is
greatly larger than the depth of the object, i.e., tzi zj , then
the ratio of {λij |i = 1, . . . , m} corresponding to any two different frames can be approximated by a constant.
Proof Let us take the reference frame as an example, the
ratio of the projective depths of any frame i to those of the
reference frame can be written as
μi =
λrj
(Cβr Cαr )zj + tzr
≈
λij
(Cβi Cαi )zj + tzi
=
Cβr Cαr (zj /tzi ) + tzr /tzi
Cβi Cαi (zj /tzi ) + 1
(14)
λrj
tzr
≈
λij
tzi
(15)
All features in a frame have the same translation term.
Thus from (15) we can see that the projective depth ratios
of two frames for all features have the same approximation μi .
According to Proposition 3, we have λij = μ1i λrj . Thus
the perspective projection equation (1) can be approximated
by
1
λrj xij = Pi Xj
μi
Let us denote λrj as
(16)
1
j
, and reformulate (16) as
xij = Pqi Xqj
(17)
where
Pqi = μi Pi ,
Xqj = j Xj
(18)
We call (17) as quasi-perspective projection model.
Compared with general perspective projection, the quasiperspective assumes that projective depths between different
frames are defined up to a constant μi . Thus the projective
depths are implicitly embedded in the scalars of the homogeneous structure Xqj and the projection matrix Pqi , and
the difficult problem of estimating the unknown depths is
avoided. The model is more general than affine projection
model (3), where all projective depths are simply assumed
to be equal.
Int J Comput Vis
3.2 Error Analysis of Different Projection Model
m̄a =
In this subsection, we will give a heuristic analysis on the
imaging errors of quasi-perspective and affine camera models with respect to the general perspective projection. For
simplicity, the subscript ‘i’ of the frame number is omitted in the following. Suppose the intrinsic parameters of the
cameras are known, and all images are normalized by applying the inverse K−1
i to each frame. Then the projection
matrices under different projection model can be written as
⎡ T
⎤
r 1 tx
⎢
⎥
P = ⎣rT2 ty ⎦ , rT3 = [−Sβ, CβSα, CβCα],
(19)
T
r 3 tz
⎤
⎡ T
r 1 tx
⎥
⎢
Pq = ⎣ rT2 ty ⎦ , rT3q = [0, 0, CβCα],
(20)
rT3q tz
⎤
⎡ T
r 1 tx
⎥
⎢
Pa = ⎣rT2 ty ⎦ , 0T = [0, 0, 0]
(21)
0 T tz
where P is the projection matrix of perspective projection,
Pq is that of quasi-perspective assumption, and Pa is that of
affine projection. It is clear that the main difference of these
projection matrices only lies in last row. For a space point
X̄ = [x, y, z]T , its projection under different camera models
is given by
⎡
⎤
u
X̄
⎦,
v
m=P
=⎣
(22)
1
T
r3 X̄ + tz
⎡
⎤
u
X̄
⎦,
v
=⎣
mq = Pq
(23)
1
T
r3q X̄ + tz
⎡ ⎤
u
X̄
(24)
ma = Pa
= ⎣v ⎦
1
tz
1 u
tz v
(30)
The point m̄ is an ideal image of perspective projection. Let us define eq = |m̄q − m̄| as the error of quasiperspective, and ea = |m̄a − m̄| as the error of affine, where
‘| · |’ stands for the norm of a vector. Then we have
eq = |m̄q − m̄|
T
T
(r3 − rT3q )X̄
r3 X̄ + tz
= T
|m̄|
m̄ − m̄ = det
r3q X̄ + tz
rT3q X̄ + tz
−(Sβ)x + (CβSα)y
|m̄|,
= det
(CβCα)z + tz
ea = |m̄a − m̄|
T
T r X̄ + tz
r X̄
|m̄|
m̄ − m̄ = det 3
= 3
tz
tz
−(Sβ)x + (CβSα)y + (CβCα)z
|m̄|
= det
tz
(31)
(32)
Based on above equations, it is rational to state the following results for different projection models.
1. The axial rotation angle γ around Z-axis has no influence
on the images of m̄, m̄q and m̄a .
2. When the distance of a camera to an object is much larger
than the object depth, both m̄q and m̄a are close to m̄.
3. When the camera system is aligned with the world system, i.e., α = β = 0, we have rT3q = rT3 = [0, 0, 1] and
eq = 0. Thus m̄q = m̄, and the quasi-perspective assumption is equivalent to perspective projection.
4. When the rotation angles α and β are small, we have
eq < ea , i.e., the quasi-perspective assumption is more
accurate than affine assumption.
5. When the space point lies on the plane through the world
origin and perpendicular to the principal axis, i.e., the direction of rT3 , we have α = β = 0 and z = 0. It is easy to
verify that m̄ = m̄q = m̄a .
where
u = rT1 X̄ + tx , v = rT2 X̄ + ty ,
(25)
rT3 X̄ = −(Sβ)x + (CβSα)y + (CβCα)z,
(26)
rT3q X̄ = (CβCα)z
(27)
and the nonhomogeneous image points can be denoted as
1
u
m̄ = T
,
(28)
r3 X̄ + tz v
1
u
,
(29)
m̄q = T
r3q X̄ + tz v
4 Quasi-Perspective Rigid Factorization
Under quasi-perspective projection (17), the factorization
equation of a tracking matrix is expressed as
⎡
x11
⎢ ..
⎣ .
xm1
···
..
.
···
⎤ ⎡
⎤
x1n
μ1 P1
.. ⎥ = ⎢ .. ⎥ [ X , . . . , X ]
n n
. ⎦ ⎣ . ⎦ 1 1
xmn
(33)
μm Pm
which can be written concisely as
W3m×n = M3m×4 S4×n
(34)
Int J Comput Vis
The form is similar to perspective factorization (6). However, the projective depths in (33) are embedded in the motion and shape matrices, hence there is no need to estimate
them explicitly. By performing SVD on the tracking matrix
and imposing the rank-4 constraint, W may be factorized
as M̂3m×4 Ŝ4×n . However, the decomposition is not unique
since it is defined up to a nonsingular linear transformation
H4×4 as M = M̂H and S = H−1 Ŝ. If a reasonable upgrading matrix is recovered, the Euclidean structure and motions
can be easily recovered from the shape matrix S and motion
matrix M. Due to the special form of (33), the recovery of
an upgrading matrix has some special properties compared
with those under affine and perspective projection. We will
show the computation details later in the article.
We adopt the metric constraint to compute an upgrading matrix H4×4 . Let us represent the matrix into two parts as
(35)
where Hl denotes the first three columns of H, Hr denotes
the fourth column. Suppose M̂i is the i-th triple rows of M̂,
then from M̂i H = [M̂i Hl |M̂i Hr ], we know that
(1:3)
M̂i Hl = μi Pi
(4)
M̂i Hr = μi Pi
= μi Ki Ri ,
= μi Ki Ti
(36)
(37)
Let us denote Ci = M̂i QM̂Ti , where Q = Hl HTl is a 4 × 4
symmetric matrix. As in previous factorization studies (Han
and Kanade 2000; Quan 1996), we adopt a simplified camera model with only one parameter as Ki = diag(fi , fi , 1).
Then from
Ci = M̂i QM̂Ti = (μi Ki Ri )(μi Ki Ri )T
⎡ 2
⎤
fi
⎦
= μ2i Ki KTi = μ2i ⎣
fi2
1
where Uij denotes the (i, j )-th element of U, and uij is a
scalar. For example, a n × (n − 1) vertical extended upper
triangular matrix can be written explicitly as
⎡
⎤
u11 u12 · · · u1(n−1)
⎢u21 u22 · · · u2(n−1) ⎥
⎢
⎥
⎢
u32 · · · u3(n−1) ⎥
U=⎢
(41)
⎥
⎢
.. ⎥
.
.
⎣
.
. ⎦
un(n−1)
4.1 Recovery of the Euclidean Upgrading Matrix
H = [Hl |Hr ]
Definition 4 (Vertical extended upper triangular matrix)
Suppose U is a n × k (n > k) matrix. We call U a vertical
extended upper triangular matrix if it is of the form
uij if i ≤ j + (n − k)
(40)
Uij =
0
if i > j + (n − k)
Proposition 5 (Extended Cholesky Decomposition) Suppose Qn is a n × n positive semidefinite symmetric matrix of
rank k. Then it can be decomposed as Qn = Hk HTk , where
Hk is a n × k matrix of rank k. Furthermore, the decomposition can be written as Qn = k Tk with k a n × k vertical
extended upper triangular matrix. The degree-of-freedom of
the matrix Q is nk − 12 k(k − 1), which is the number of unknowns in k .
The proof of Proposition 5 is given in Appendix 1. Form
the Extended Cholesky Decomposition we can easily obtain
the following result.
Result 6 The matrix Q recovered from (39) is a 4 × 4 positive semidefinite symmetric matrix of rank 3. It can be decomposed as Q = Hl HTl , where Hl is a 4 × 3 rank 3 matrix.
The decomposition can be further written as Q = 3 T3
with 3 a 4 × 3 vertical extended upper triangular matrix.
(39)
The computation of Hl is very simple. Suppose the SVD
decomposition of Q is U4 4 UT4 , where U4 is a 4 × 4 orthogonal matrix, 4 = diag(σ1 , σ2 , σ3 , 0) is a diagonal matrix
with σi the singular value of Q. Thus we can immediately
have
⎡√
⎤
σ1
√
⎦
Hl = U(1:3) ⎣
σ2
(42)
√
σ3
Since the factorization (33) can be defined up to a global
scalar as W = MS = (εM)(S/ε), we may set μ1 = 1 to
avoid the trivial solution of Q = 0. Thus we have 4m + 1
linear constraints in total on the 10 unknowns of Q, which
can be solved via least squares. Ideally, Q is a positive semidefinite symmetric matrix, the matrix Hl can be recovered
from Q via matrix decomposition.
where U(1:3) denotes the first three columns of U. Then the
vertical extended upper triangular matrix 3 can be constructed from Hl as shown in Appendix 1. The computation is an extension of Cholesky Decomposition to the case
of positive semidefinite symmetric matrix, while general
Cholesky Decomposition can only be applied to positive
definite symmetric matrix. From the number of unknowns
in 3 we know that Q is only defined up to 9 degree-offreedom.
we can obtain the following constraints.
⎧
Ci (1, 2) = Ci (2, 1) = 0
⎪
⎪
⎨
Ci (1, 3) = Ci (3, 1) = 0
⎪ Ci (2, 3) = Ci (3, 2) = 0
⎪
⎩
Ci (1, 1) − Ci (2, 2) = 0
(38)
Int J Comput Vis
Remark 7 In Result 6, we assume Q is positive semidefinite.
However, the recovered matrix Q may be negative definite in
case of noisy data, thus we can not adopt the above method
to decompose it into the form of Hl HTl or 3 T3 . In this
case, let us denote
⎡
h1
⎢h4
3 = ⎢
⎣
⎤
h2 h3
h5 h6⎥
⎥
h7 h8⎦
h9
(43)
and substitute the matrix Q in (38) with 3 T3 . Then a best
estimation of 3 in (43) can be obtained via minimizing the
following cost function
m
1 2
J1 = min
Ci (1, 2) + C2i (1, 3) + C2i (2, 3)
(3 ) 2
i=1
+ (Ci (1, 1) − Ci (2, 2))2
(44)
The minimization scheme can be solved via any nonlinear optimization techniques, such as Gradient Descent or
Levenberg-Marquardt (LM) algorithm.
Remark 8 In Result 6, we claim that the symmetric matrix
Q can be decomposed into 3 T3 . In practice, the recovery of 3 is unnecessary since the upgrading matrix (35) is
not unique. Thus we can simply decompose the matrix into
Hl HTl as shown in (42). However, the decomposition is impossible for negative definite matrix Q. In such cases, it is
suggested to parameterize Q with 3 since we can reduce 3
unknowns by introducing the vertical extended upper triangular matrix (43). Hence we only need to optimize 9 parameters in the minimization scheme (44).
We now show how to recover the right part Hr of the upgrading matrix (35). From quasi-perspective equation (17),
we have
(1:3)
(4)
xij = (μi Pi )(j X̄j ) + (μi Pi )j
(45)
For all the features in the i-th frame, we make a summation of their coordinates and have
n
(1:3)
xij = μi Pi
j =1
n
j =1
(1:3)
μi Pi
(4)
(j X̄j ) + μi Pi
n
j
(46)
j =1
n
j =1 j
= 1. Thus equation (46) is simplified to
⎡
⎤
u
j ij
M̂i Hr =
xij = ⎣ j vij ⎦
j =1
n
n
(47)
which provides 3 linear constraints on the four unknowns
of Hr . Therefore, we can obtain 3m equations from the sequence and Hr can be recovered via linear least squares.
From the above analysis, we note that the solution of Hr
is not
of the world oriunique as it is dependant on selection
gin nj=1 (j X̄j ) and the global scalar nj=1 j . Actually,
Hr may be set freely as shown in the following proposition.
Proposition 9 (Recovery of Hr ) Suppose Hl in (35) is already recovered. Let us construct a matrix as H̃ = [Hl |H̃r ],
where H̃r is an arbitrary 4-vector that is independent with
the three columns of Hl . Then H̃ must be a valid upgrading
matrix. i.e., M̃ = M̂H̃ is a valid Euclidean motion matrix,
and S̃ = H̃−1 Ŝ corresponds to a valid Euclidean shape matrix.
The proof can be found in Appendix 2. According to
Proposition 9, the value of Hr can be set randomly as any
4-vector that is independent to Hl . In practice, Hr may be
set from SVD decomposition of
Hl = U4×4 4×3 VT3×3
⎡
σ1
⎢0
= [u1 , u2 , u3 , u4 ] ⎢
⎣0
0
0
σ2
0
0
⎤
0
0⎥
⎥ [v , v , v ]T
σ3 ⎦ 1 2 3
0
where U and V are two orthogonal matrices, is a diagonal
of the three singular values. Let us choose an arbitrary value
σr between the biggest and the smallest singular values of
Hl , then we may set
H r = σ r u4 ,
H = [Hl , Hr ]
(49)
The construction guarantees that H is invertible and has
the same condition number as Hl , such that we can obtain a
good precision in computing the inverse H−1 . After recovering the Euclidean motion and shape matrices, the intrinsic parameters and pose of the camera associated with each
frame can be easily computed as follows.
(1:3)
(4)
μi Pi
can be recovered from M̂i Hl ,
=
where
M̂i Hr . Since the world coordinate system can be chosen
freely, we may set nj=1 (j X̄j ) = 0, which is equivalent to
set origin of the world system at the gravity center of the
scaled space points. On other hand, since the reconstruction is defined up to a global scalar, we may simply set
(48)
μi = Mi(3) ,
(50)
fi =
1
1
(1:3)
(1:3)
Mi(1) = Mi(2) ,
μi
μi
(51)
Ri =
1 −1 (1:3)
K Mi ,
μi i
(52)
Ti =
1 −1 (4)
K Mi
μi i
Int J Comput Vis
(1:3)
where M(1:3)
. The result is obi(t) denotes the t-th row of Mi
tained under quasi-perspective assumption, which is a close
approximation to the general perspective projection. The solution may be further optimized to perspective projection by
minimizing the image reprojection residuals.
1 |x̄ij − x̂ij |2
(Ki ,Ri ,Ti ,μi ,Xj ) 2
m
J2 =
frames due to occlusions, it is hard to perform SVD decomposition. In case of missing data, we can replace the step 2
in Algorithm 10 with power factorization algorithm (Hartley and Schaffalizky 2003; Wang and Wu 2008a) to obtain a
least-square solution of M̂ and Ŝ. Then upgrade the solution
to Euclidean space according to the proposed scheme.
n
min
(53)
i=1 j =1
where x̂ij denotes the reprojected image point computed
via perspective projection (1). The minimization process is
termed as bundle adjustment (Hartley and Zisserman 2004),
which is usually solved via Levenberg-Marquardt iterations.
4.2 Outline of the Algorithm
The implementation of the rigid factorization algorithm is
summarized as follows.
Algorithm 10 (Quasi-perspective rigid factorization) Given
the tracking matrix W ∈ R3m×n across a sequence with
small camera movements. Compute the Euclidean structure
and motion parameters under quasi-perspective projection.
1. Balance the tracking matrix via point-wise and imagewise rescalings, as in (Sturm and Triggs 1996), to improve numerical stability;
2. Perform rank-4 SVD factorization on the tracking matrix
to obtain a solution of M̂ and Ŝ;
3. Compute the left part of upgrading matrix Hl according
to (42), or (44) for negative definite matrix Q;
4. Compute Hr and H according to (49);
5. Recover the Euclidean motion matrix M = M̂H and
shape matrix S = H−1 Ŝ;
6. Estimate the camera parameters and pose from (50) to
(52);
7. Optimize the solution via bundle adjustment (53).
Remark 11 In above analysis, as well as in other factorization algorithms, we usually assume one-parameter-camera
model as in (38) so that we may use this constraint to recover
an upgrading matrix H. When the one parameter assumption
is not satisfied in real applications, it is possible to take the
proposed solution as an initial value and optimize the camera parameters via Kruppa constraint arisen from pairwise
images (Wang et al. 2008).
Remark 12 The essence of quasi-perspective factorization
(34) is to find a rank-4 approximation MS of the tracking
matrix, i.e. to minimize the Frobenius norm W − MS2F .
Most studies adopt SVD decomposition of W and truncate
it to the desired rank. However, when the tracking matrix
is not complete, such as some features are missing in some
5 Quasi-perspective Nonrigid Factorization
For nonrigid factorization, we still follow the Bregler’s assumption (7) to represent a nonrigid shape by weighted
combination of k shape bases. Under quasi-perspective projection, the structure is expressed in homogeneous form
with nonzero scalars. Let us denote the scale weighted
nonrigid structure associated with the i-th frame as S̄i =
[1 X̄1 , . . . , n X̄n ], denote the l-th scale weighted shape basis as Bl = [1 X̄l1 , . . . , n X̄ln ]. Then from (7) we have
X̄i =
k
ωil X̄li ,
i = 1, . . . , n
(54)
l=1
Let us multiply a weight scale i on both side as
i X̄i = i
k
l=1
ωil X̄li =
k
ωil (i X̄li ),
i = 1, . . . , n
(55)
l=1
then we can immediately have the following result.
k
S̄i
ωil Bl
l=1
Si = T =
T
(56)
We call (56) Extended Bregler’s assumption to homogeneous case. Under this extension, the quasi-perspective projection of the i-th frame can be formulated as
k
(1:3)
(4)
l=1 ωil Bl
Wi = (μi Pi )Si = [μi Pi , μi Pi ]
T
⎡ ⎤
B1
·
· ·⎥
(1:3)
(1:3)
(4) ⎢
⎥
= [ωi1 μi Pi , . . . , ωik μi Pi , μi Pi ] ⎢
(57)
⎣Bk ⎦
T
Thus the nonrigid factorization under quasi-perspective
projection can be expressed as
⎤
⎡
· · · ω1k μ1 P(1:3)
μ1 P(4)
ω11 μ1 P(1:3)
1
1
1
⎢
..
..
.. ⎥
..
⎥
W3m×n = ⎢
.
.
.
. ⎦
⎣
(1:3)
ωm1 μm Pm
⎡ ⎤
B1
⎢ .. ⎥
⎢ ⎥
×⎢ . ⎥
⎣Bk ⎦
T
···
(1:3)
ωmk μm Pm
(4)
μm Pm
(58)
Int J Comput Vis
or express concisely in matrix form as
W3m×n = M3m×(3k+1) B(3k+1)×n
(59)
The factorization expression is similar to (9). However,
the difficult problem of estimating the projective depths is
avoided here. The rank of the tracking matrix is at most
3k + 1, and the factorization is defined again up to a transformation matrix H ∈ R(3k+1)×3k+1) . Suppose the SVD factorization of a tracking matrix with rank constraint is W = M̂B̂.
Similar to the rigid case, we can adopt the metric constraint
to compute an upgrading matrix. Let us denote the matrix
into k + 1 parts as
H = [H1 , . . . , Hk |Hr ]
(60)
where Hl ∈ R(3k+1)×3 (l = 1, . . . , k) denotes the l-th triple
columns of H, and Hr denotes the last column of H. Then
we have
(1:3)
M̂i Hl = ωil μi Pi
= ωil μi Ki Ri ,
(4)
M̂i Hr = μi Pi = μi Ki Ti
(61)
(62)
Similar to (38) in rigid case, Let us denote Cii =
M̂i Ql M̂Ti
with Ql = Hl HTl , we get
Cii = M̂i Ql M̂Ti
= (ωil μi Ki Ri )(ωi l μi Ki Ri )T
=
ωil ωi l μi μi Ki (Ri Ri )KTi
(63)
where i and i (= 1, . . . , m) correspond to different frame
numbers, l = 1, . . . , k corresponds to different shape bases.
Assuming a simplified camera model with only one parameter as Ki = diag(fi , fi , 1), we have
⎡ 2
⎤
fi
⎦
(64)
Cii = M̂i Ql M̂Ti = ωil2 μ2i ⎣
fi2
1
from which we can obtain following four constraints.
⎧
f1 (Ql ) = Cii (1, 2) = 0
⎪
⎪
⎨
f2 (Ql ) = Cii (1, 3) = 0
(65)
f (Q ) = Cii (2, 3) = 0
⎪
⎪
⎩ 3 l
f4 (Ql ) = Cii (1, 1) − Cii (2, 2) = 0
The above constraints are similar to (39) in rigid case.
However, the matrix Ql in (64) is a (3k + 1) × (3k + 1)
symmetric matrix. According to Proposition 5, its degreeof-freedom should be 9k degree of freedom, since it can be
decomposed into the product of (3k + 1) × 3 vertical extended upper triangular matrix. Given m frames, we have 4m
linear constraints on Ql . It appears that if we have enough
features and frames, the matrix Ql can be solved linearly
by stacking all the constraints in (65). Unfortunately, only
the rotation constraints may be insufficient when an object
deforms at varying speed, since most of the constraints are
redundant. Xiao and Kanade (2005) proposed a basis constraint to solve this ambiguity. The main idea is to select k
frames that include independent shapes and treat them as a
set of bases. Suppose the first k frames are independent of
each other, then their corresponding weighting coefficients
can be set as
1 if i, l = 1, . . . , k and i = l
(66)
ωil =
0 if i, l = 1, . . . , k and i = l
From (63) we can obtain following basis constraint.
⎡
⎤
0 0 0
Cii = ⎣0 0 0⎦ if i = 1, . . . , k, i = 1, . . . , m,
0 0 0
and i = l
(67)
Given m images, (67) can provide 9m(k − 1) linear constraints to the matrix Ql (some of the constraints are redundant since Ql is symmetric). By combining the rotation constraint (65) and basis constraint (67) together, the matrix Ql
can be computed linearly. Later, Hl , l = 1, . . . , k can be decomposed from Ql according to following result.
Result 13 The matrix Ql is a (3k + 1) × (3k + 1) positive semidefinite symmetric matrix of rank 3. It can be
decomposed as Q = Hl HTl , where Hl is a (3k + 1) × 3
rank 3 matrix. The decomposition can be further written as
Q = 3 T3 with 3 being a (3k + 1) × 3 vertical extended
upper triangular matrix.
The Result can be easily derived from Proposition 5. Note
that the Proposition 9 is still valid for nonrigid case. Thus the
vector Hr in (60) can be set as an arbitrary (3k + 1)-vector
that is independent with all columns in {Hl }l=1,...,k . After
recovering the Euclidean upgrading matrix, the camera parameters, motions, shape bases, weighing coefficients can
be easily determined from the motion matrix M = M̂H and
shape matrix B = H−1 B̂.
6 Evaluations on Synthetic Data
6.1 Evaluation on Quasi-perspective Projection
During the simulation, we randomly generated 200 points
within a cube of 20 × 20 × 20 in space as shown in Fig. 2(a),
where we only displayed the first 50 points for simplicity. The depth variation in Z-direction of the space points
is shown in Fig. 2(b). We simulated 10 images from these
points by perspective projection. The image size is set at
Int J Comput Vis
Fig. 2 Evaluation on projective depth approximation of the first 50 points. (a) and (b) Coordinates of the synthetic space points (c) and (d) The
real and the approximated projective depths under quasi-perspective assumption
800 × 800. The camera parameters are set as follows: The
focal lengths are set randomly between 900 and 1100, the
principal point is set at the image center, and the skew is
zero. The rotation angles are set randomly between ±5◦ .
The X and Y positions of the cameras are set randomly
between ±15, while the Z positions are set evenly from
200 to 220. The true projective depths λij associated with
these points across 10 different views are shown in Fig. 2(c),
where the values are given after normalization so that they
have unit mean value. We then estimate λ1j and μi from
(13) and (14), and construct the estimated projective depths
λ
from λ̂ij = μ1ji . The registered result is shown in Fig. 2(d).
We can see from experiment that the recovered projective
depths are very close to the ground truths, and are generally
proportional to the variation of space points in Z-direction.
If we adopt affine camera model, it is equivalent to setting
all the projective depths to 1. The error is obviously much
bigger than that of the quasi-perspective assumption.
According to projection equations (28) to (32), different images will be obtained if we adopt different camera
models. Here we generated three sets of images using the
simulated space points via general perspective projection
model, affine camera model, and quasi-perspective projection model. We compared the errors of quasi-perspective
projection model (31) and affine assumption (32). The mean
errors of different models in each frame are shown in
Fig. 3(a), the histogram distribution of errors for all 200
points across 10 frames is shown in Fig. 3(b). From the result, we can see that the error of quasi-perspective assumption is much more smaller than that under affine assumption.
Influence of different imaging conditions to the quasiperspective assumption is also investigated. Initially, we fix
the camera position as given in first test and vary the amplitude of rotation angles from ±5◦ to ±50◦ in a step of 5◦ . At
each step, we check the relative error of recovered projective
depths, which is defined as
eij =
|λij − λ̂ij |
× 100 (%)
λij
(68)
where λ̂ij is the estimated projective depth. We carried out
100 independent tests at each step so as to obtain a statistically meaningful result. The mean and standard deviation
of eij are shown in Fig. 4(a). We then fix the rotation angles at ±5◦ and vary the relative distance of a camera to
an object (i.e. the ratio between the distance of a camera to
an object center and that of the object depth) from 2 to 20
in a step of 2. The mean and standard deviation of eij at
each step for 100 tests are shown in Fig. 4(b). The result
shows that the quasi-perspective projection is a good ap-
Int J Comput Vis
Fig. 3 Evaluation of the imaging errors by different camera models. (a) The mean error in each frame. (b) The histogram distribution of the errors
under quasi-perspective and affine projection model
Fig. 4 Evaluation on quasi-perspective projection under different imaging conditions. (a) The relative error of the estimated depths with different
rotation angles. (b) The relative error with respect to different relative distances
proximation (eij < 0.5%) when the rotation angles are less
than ±35◦ and relative distance is larger than 6. Please note
that the result is obtained from noise free data.
6.2 Evaluation on Rigid Factorization
We added Gaussian white noise to the initially generated
10 images, and varied the noise level from 0 to 3 pixels
with a step of 0.5. At each noise level, we reconstructed the
3D structure of the object which is defined up to a similarity transformation with the ground truth. We register reconstructed model with the ground truth and calculate the reconstruction error, which is defined as mean pointwise distances between reconstructed structure and the ground truth.
The mean and standard deviation of the error on 100 independent tests are shown in Fig. 5. The proposed algorithm
(Quasi) is compared with (Poelman and Kanade 1997) under affine assumption (Affine) and (Han and Kanade 2000)
under perspective projection (Persp). We then take these solutions as initial values and perform the perspective optimization through LM iterations. It is evident that the pro-
posed method performs much better than that of affine, the
optimized solution (Quasi+LM) is very close to perspective
projection with optimization (Persp+LM).
The proposed model is based on the assumption of large
relative camera-to-object distance and small camera rotations. We compared the effect of the two factors to different camera models. In first case, we vary the relative distance from 4 to 18 in steps of 2. At each relative distance,
we generated 20 images with the following parameters. The
rotation angles are confined between ±5◦ , the X and Y positions of the camera are set randomly between ±15. We recovered the structure and computed the reconstruction error
for each group of images. The mean error by different methods is shown in Fig. 6(a). In the second case, we increase
the rotation angles to the range of ±20◦ , and retain other
camera parameters as in the first case. The mean reconstruction error is given in Fig. 6(b). The results are evaluated on
100 independence tests with 1-pixel Gaussian noise. We can
obtain the following conclusions from the results. (1) The error by quasi-perspective projection is consistently less than
that by affine, especially at small relative distances. (2) Both
Int J Comput Vis
Fig. 5 Evaluation on rigid factorization. The mean (a) and standard deviation (b) of the reconstruction errors by different algorithms at different
noise levels
Fig. 6 The mean reconstruction error of different projection models with respect to varying relative distance. The rotation angles of the camera
are confined to a range of (a) ±5◦ and (b) ±20◦
reconstruction errors by affine and quasi-perspective projection increase greatly when the relative distance is less than
6, since both models are based on large distance assumption.
(3) The error at each relative distance increases with the rotation angles, especially at small relative distances, since the
projective depths are related with rotation angles. (4) Theoretically the relative distance and rotation angles have no
influence on the result of full perspective projection. However, we can see that the error by perspective projection also
increases slightly with an increase in rotation angles and the
decrease in relative distance. This is because we estimate the
projective depths iteratively starting with an affine assumption (Han and Kanade 2000). The iteration easily gets stuck
to local minima due to bad initialization.
We compared the computation time of different factorization algorithms without LM optimization. The program was
implemented with Matlab 6.5 on an Intel Pentium 4 3.6 GHz
CPU. In this test, we use all the 200 feature points and vary
the frame number, from 5 to 200, so as to generate different
data size. The actual computation time (seconds) for different data sets are listed in Table 2, where computation time
Table 2 The average computation time of different algorithms
Frame number
5
10
50
100
Time (s)
0.015
0.015
0.031
0.097
Affine
Quasi
0.015
0.016
0.047
0.156
Persp
0.281
0.547
3.250
6.828
150
0.156
0.297
10.58
200
0.219
0.531
15.25
for perspective projection is taken under 10 iterations (it
usually takes about 30 iterations to compute the projective
depths in perspective factorization). Clearly, the computation time of quasi-perspective is at the same level as that
under affine assumption, While the perspective factorization
is computationally more intensive than other methods.
6.3 Evaluation on Nonrigid Factorization
In this test, we generated a synthetic cube with 6 evenly distributed points on each visible edge. There are three sets of
moving points on adjacent surfaces of the cube that move
on the surfaces at constant speed as shown in Fig. 7(a), each
Int J Comput Vis
Fig. 7 Simulation result on nonrigid factorization. (a) Two synthetic cubes with moving points in space. (b) The quasi-perspective factorization
result of the two frames superimposed with the ground truth. (c) The final structures after optimization
Fig. 8 Evaluation on nonrigid factorization. The mean (a) and standard deviation (b) of the reconstruction errors by different algorithms at
different noise levels
moving set is composed of 5 points. The cube with moving
points can be taken as a nonrigid object with 2 shape bases.
We generated 10 frames with the same camera parameters
as in the first test of rigid case. We reconstructed the structure associated with each frame by the proposed method as
shown in Fig. 7(b) and (c). We can see that the structure after
optimization is visually the same as the ground truth, while
the result before optimization is a little bit deformed due to
perspective effect.
We compared our method with the nonrigid factorization
under affine assumption (Xiao et al. 2006) and that under
perspective projection (Xiao and Kanade 2005). The mean
and standard deviation of the reconstruction errors with respect to different noise levels are shown in Fig. 8. It is clear
that the proposed method performs much better than that under affine camera model.
7 Evaluation on Real Image Sequences
We tested our proposed method on many real sequences,
and we report results of four experiments here. All images
in the test, except those in the Franck face sequence, were
captured by Canon Powershot G3 camera with a resolution
of 1024 × 768. In order to ensure large overlap of the object to be reconstructed, the camera undergoes small movement during image acquisition, hence the quasi-perspective
Int J Comput Vis
(a)
(b)
(c)
Fig. 9 Reconstruction result of the stone post sequence. (a) Three images from the sequence, where the tracked features with relative disparities
are overlaid to the second and the third images. (b) The reconstructed VRML model of the scene shown from different viewpoints with texture
mapping. (c) The corresponding triangulated wireframe of the reconstructed model
assumption is satisfied for all these sequences. Please refer to the supplemental video for details of these test results.
7.1 Test on Stone Post Sequence
There are 8 images in the stone post sequence, which were
taken at the Sculpture Park near downtown Windsor. We established the initial correspondences by utilizing technique
in Wang (2006) and eliminated outliers iteratively as in Torr
et al. (1998). Totally 3693 reliable features were tracked
across the sequence, the features in two frames with relative
disparities are shown in Fig. 9. We recovered 3D structure
of the object and camera motions by utilizing the proposed
algorithm, as well as some previous methods. The recovered
camera focal lengths are listed in Table 3, where we give the
result of first frame only due to limited space, ‘Quasi+LM’,
‘Affine+LM’, and ‘Persp+LM’ stand for quasi-perspective,
affine, and perspective factorization with global optimization, respectively. Figure 9 shows the reconstructed VRML
model with texture and corresponding triangulated wireframe viewed from different viewpoints. The reconstructed
model is visually plausible and realistic.
In order to give a comparative quantity evaluation, we
reproject the reconstructed 3D structure back to the images
and calculate reprojection errors, which is defined as distances between detected and reprojected image points. Figure 10 shows the histogram distributions of the errors using
9 bins. The corresponding mean (‘Mean’) and standard deviation (‘STD’) of the errors are listed in Table 3. We can see
that the reprojection error by our proposed model is much
smaller than that under affine assumption.
Int J Comput Vis
7.2 Test on Fountain Base Sequence
There are 7 images in the fountain base sequence, which
were also taken at the Sculpture Park of Windsor. The correspondences were established using same technique as in previous test. Totally 4218 reliable features were tracked across
the sequence as shown in Fig. 11(a). Figure 11(b) and (c)
show the reconstructed VRML model with texture mapping
and the corresponding triangulated wireframe from different
viewpoints. The model looks realistic and most details are
correctly recovered by the method. A comparison analysis
on camera parameters and reprojection errors are presented
in Table 3 and Fig. 10, respectively. We can see from the results that our proposed scheme outperforms that under affine
camera model.
7.3 Test on Dynamic Grid Sequence
There are 12 images in the dynamic grid sequence. The
background of the sequence is two orthogonal sheets with
square grids which are used as ground truth for evaluation. On the two orthogonal surfaces, there are three moving objects that move linearly in three directions. We established correspondences using method (Wang 2006), and
eliminated outliers interactively. Totally 206 features were
Table 3 Camera parameters of the first frame and reprojection errors
in real sequence test
Sequence
Method
Focus (f )
Mean
STD
Stone post
Quasi+LM
2151.8
0.421
0.292
Fountain base
Affine+LM
2167.3
0.667
0.461
Persp+LM
2154.6
0.237
0.164
Quasi+LM
2140.5
0.418
0.285
Affine+LM
2153.4
0.629
0.439
Persp+LM
2131.7
0.240
0.168
tracked across the sequence, where 140 features belong to
static background and 66 features belong to three moving objects, as shown in Fig. 12(a). We recovered metric
structure of the scenario by utilizing proposed method. Figure 12(b) and (c) show reconstructed VRML models and
corresponding wireframes associated with two dynamic positions. It is clear that the dynamic structure is correctly recovered.
The background of this sequence is two orthogonal sheets
with square grids. We take this as ground truth and compute
the angle (unit: degree) between two reconstructed surfaces
of the orthogonal background, the length ratio of two diagonals of each square grid and the angle formed by the two
diagonals. The mean errors of these three values are denoted
by Eα1 , Erat , and Eα2 , respectively. The mean reprojection
error Erep1 of the reconstructed structure is also computed.
As a comparison, the results obtained by different methods
are listed in Table 4. The result by the proposed model outperforms that of affine.
7.4 Test on Franck Face Sequence
The Franck face sequence was downloaded from the European working group on face and gesture recognition
(www-prima.inrialpes.fr/FGnet/). We selected 60 frames
with various facial expressions for the test. The image resolution is 720 × 576, and there are 68 tracked feature across
the sequence, which are also downloaded from the internet.
Figure 13 shows the reconstructed models of four frames
utilizing our proposed method. Different facial expressions
are correctly recovered, though some points are not very accurate due to tracking errors. The result could be used for
visualization and recognition. For analysis, the relative reprojection error Erep2 generated from different methods are
listed in Table 4. We can see that in all these tests, the accuracy by the proposed method is fairly close to that of full
perspective projection, and performs much better than affine
assumption.
Fig. 10 The histogram distributions of the reprojection errors by different algorithms in real sequence test. (a) Result of stone post sequence.
(b) Result of fountain base sequence
Int J Comput Vis
(a)
(b)
(c)
Fig. 11 Reconstruction result of the fountain base sequence. (a) Three images from the sequence, where the tracked features with relative disparities are overlaid to the second and the third images. (b) The reconstructed VRML model of the scene shown from different viewpoints with texture
mapping. (c) The corresponding triangulated wireframe of the reconstructed model
Table 4 Performance comparison on the grid and face sequences
Method
Eα1
Eα2
Erat
Erep1
Erep2
5.26
Quasi
1.62
0.75
0.12
4.37
Affine
2.35
0.92
0.15
5.66
6.58
Persp
1.28
0.63
0.10
3.64
4.35
Quasi+LM
0.58
0.26
0.04
1.53
2.47
Affine+LM
0.96
0.37
0.07
2.25
3.19
Persp+LM
0.52
0.24
0.04
1.46
1.96
8 Conclusion
In this paper, we proposed a quasi-perspective projection
model and analyzed the projection errors of different pro-
jection models. We applied our proposed model to rigid and
nonrigid factorization and elaborated the computation details of Euclidean upgrading matrix. The proposed method
avoids the difficult problem of computing projective depths
in perspective factorization. It is computationally simple
with better accuracy than affine approximation. The proposed model is suitable for structure and motion factorization of a short sequence with small camera motions. Experiments demonstrated improvements of our algorithm over
existing techniques. It should be noted that the small rotation assumption of the proposed model is not so limited and
is usually satisfied in many real applications. During image
acquisition of an object to be reconstructed, we tend to control the camera movement so as to guarantee large overlapping part, which also facilitates the feature tracking process.
Int J Comput Vis
(a)
(b)
(c)
Fig. 12 Reconstruction results of the dynamic grid sequence. (a) Three images from the sequence overlaid with the tracked features and relative
disparities shown in the second and the third images, please note the three moving objects. (b) The reconstructed VRML model of the structure
shown from different viewpoints with texture mapping. (c) The corresponding triangulated wireframe of the reconstructed model
For a long sequence of images taken around an object, the
assumption is violated. However, we can simply divide the
sequence into several subsequences with small movements,
then register and merge the result of each subsequence to
reconstruct the structure of the whole object.
Acknowledgements The authors would like to thank the anonymous
reviewers for their valuable comments and constructive suggestions.
The work is supported in part by Natural Sciences and Engineering
Research Council of Canada, and the National Natural Science Foundation of China under Grant No. 60575015.
Appendix 1: Proof of Proposition 5
Extended Cholesky Decomposition: Suppose Qn is a n × n
positive semidefinite symmetric matrix of rank k. Then it
can be decomposed as Qn = Hk HTk , where Hk is a n × k
matrix rank k. Furthermore, the decomposition can be written as Qn = k Tk with k a n × k vertical extended upper
triangular matrix. The degree-of-freedom of the matrix Q is
nk − 12 k(k − 1), which is the number of unknowns in k .
Proof Since Qn is a n × n positive semidefinite symmetric
matrix of rank k, it can be decomposed by SVD as
⎡
σ1
⎢
⎢
⎢
⎢
Qn = UUT = U ⎢
⎢
⎢
⎢
⎣
⎤
..
.
σk
0
..
.
⎥
⎥
⎥
⎥ T
⎥U
⎥
⎥
⎥
⎦
(69)
0
where U is a n × n orthogonal matrix, is a diagonal matrix
with σi the singular value of Qn . Thus we can immediately
Int J Comput Vis
(a)
(b)
(c)
Fig. 13 Reconstruction of different facial expressions in Franck face sequence. (a) Four frames from the sequence with the 68 tracked features overlaid to the last frame. (b) Front, side, and top views of the reconstructed VRML models with texture mapping. (c) The corresponding
triangulated wireframe of the reconstructed model
have
⎡√
σ1
(1:k) ⎢
Hk = U
⎣
⎤
..
.
Hku
⎥
⎦=
Hkl
√
σk
(70)
such that Qn = Hk HTk , where U(1:k) denotes first k columns
of U, Hku denotes upper (n − k) × k submatrix of Hk , and
Hkl denotes lower k × k submatrix of Hk . By applying RQdecomposition on Hkl , we have Hkl = kl Ok , where kl is
an upper triangular matrix, Ok is an orthogonal matrix.
Let us denote Hku OTk as ku , and construct
a n × k vertical extended upper triangular matrix k =
have Hk = k Ok , and
Qn = Hk HTk = (k Ok )(k Ok )T = k Tk
ku
kl
. Then we
(71)
It is easy to verify that the degree-of-freedom of the matrix Q (i.e., the number of unknowns in k ) is
nk − 12 k(k − 1).
The proposition can be taken as an extension of Cholesky
Decomposition to the case of positive semidefinite symmetric matrix, while Cholesky Decomposition can only deal
with positive definite symmetric matrix.
Appendix 2: Proof of Proposition 9
Recovery of Hr : Suppose Hl in (35) is already recovered.
Let us construct a matrix as H̃ = [Hl |H̃r ], where H̃r is an
arbitrary 4-vector that is independent with the three columns
of Hl . Then H̃ must be a valid upgrading matrix, i.e., M̃ =
M̂H̃ is a valid Euclidean motion matrix, and S̃ = H̃−1 Ŝ corresponds to a valid Euclidean shape matrix.
Int J Comput Vis
Proof Suppose the correct transformation matrix is H =
[Hl |Hr ], then from
1 X̄1 , . . . , n X̄n
−1
(72)
S = H Ŝ =
1 ,
...,
n
we can obtain one correct Euclidean structure [X̄1 , . . . , X̄n ]
of the object under certain coordinate frame in the world by
dehomogenizing of the shape matrix S. The arbitrary constructed matrix H̃ = [Hl |H̃r ] and the correct matrix H is defined up to a 4 × 4 invertible matrix G in form of
I
g
(73)
H = H̃G, G = T3
s
0
where I3 is a 3 × 3 identity matrix, g is a 3-vector, 0 is a
zero 3-vector, s is a nonzero scalar. Under the transformation
matrix H̃, the motion M̂ and shape Ŝ are transformed to
I3 −g/s
−1
(74)
M̃ = M̂H̃ = M̂HG = M T
0 1/s
S̃ = H̃−1 Ŝ = (HG−1 )−1 Ŝ = G(H−1 Ŝ)
(X̄ + g)/s · · · n (X̄n + g)/s
=s 1 1
···
n
1
(75)
We can see from (75) that the new shape S̃ is actually the
original structure that undergoes a translation g and a scale
1/s, which does not change the Euclidean structure. From
(74) we have M̃(1:3) = M(1:3) , which indicates that the firstthree-column of the new motion matrix (corresponds to the
rotation part) does not change. While the last column, which
corresponds to translation part, is modified in accordance
with the translation and scale changes of the structure.
Therefore, the constructed matrix H̃ is a valid transformation matrix that can upgrade the factorization from projective space into the Euclidean space.
References
Bascle, B., & Blake, A. (1998). Separability of pose and expression in
facial tracing and animation. In Proceedings of the international
conference on computer vision (pp. 323–328) 1998.
Brand, M. (2001). Morphable 3D models from video. In Proceedings
of IEEE conference on computer vision and pattern recognition
(Vol. 2, pp. 456–463) 2001.
Brand, M. (2005). A direct method for 3D factorization of nonrigid motion observed in 2d. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 122–128) 2005.
Bregler, C., Hertzmann, A., & Biermann, H. (2000). Recovering nonrigid 3D shape from image streams. In Proceedings of IEEE
conference on computer vision and pattern recognition (Vol. 2,
pp. 690–696) 2000.
Buchanan, A. M., & Fitzgibbon, A. W. (2005). Damped newton algorithms for matrix factorization with missing data. In Proceedings
of IEEE conference on computer vision and pattern recognition
(Vol. 2, pp. 316–322) 2005.
Chen, P. (2008). Optimization algorithms on subspaces: revisiting
missing data problem in low-rank matrix. International Journal
of Computer Vision, 80(1), 125–142.
Christy, S., & Horaud, R. (1996). Euclidean shape and motion from
multiple perspective views by affine iterations. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 18(11), 1098–
1104.
Costeira, J., & Kanade, T. (1998). A multibody factorization method
for independent moving objects. International Journal of Computer Vision, 29(3), 159–179.
Del Bue, A., Smeraldi, F., & de Agapito, L. (2004). Non-rigid structure from motion using nonparametric tracking and non-linear optimization. In IEEE workshop in articulated and nonrigid motion
ANM04, held in conjunction with CVPR2004 (pp. 8–15), June
2004.
Del Bue, A., Lladó, X., & de Agapito, L. (2006). Non-rigid metric
shape and motion recovery from uncalibrated images using priors. In Proceedings of IEEE conference on computer vision and
pattern recognition (Vol. 1, pp. 1191–1198) 2006.
Han, M., & Kanade, T. (2000). Creating 3D models with uncalibrated
cameras. In Proceedings of IEEE computer society workshop
on the application of computer vision (WACV2000), December
2000.
Hartley, R. (1997). Kruppa’s equations derived from the fundamental
matrix. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2), 133–135.
Hartley, R., & Schaffalizky, F. (2003). Powerfactorization: 3D reconstruction with missing or uncertain data. In Australia-Japan advanced workshop on computer vision, 2003.
Hartley, R., & Vidal, R. (2008). Perspective nonrigid shape and motion
recovery. In ECCV (1), Lecture notes in computer science: Vol.
5302 (pp. 276–289). Berlin: Springer.
Hartley, R., & Zisserman, A. (2004). Multiple view geometry in computer vision (2nd edn.). Cambridge: Cambridge University Press.
Heyden, A., & Åström, K. (1997) Euclidean reconstruction from image sequences with varying and unknown focal length and principal point. In IEEE conference on computer vision and pattern
recognition (pp. 438–443) 1997.
Heyden, A., Berthilsson, R., & Sparr, G. (1999). An iterative factorization method for projective structure and motion from image
sequences. Image and Vision Computing, 17(13), 981–991.
Li, T., Kallem, V., Singaraju, D., & Vidal, R. (2007). Projective factorization of multiple rigid-body motions. In IEEE conference on
computer vision and pattern recognition, 2007.
Luong, Q., & Faugeras, O. (1997). Self-calibration of a moving camera from point correspondences and fundamental matrices. International Journal of Computer Vision, 22(3), 261–289.
Mahamud, S., & Hebert, M. (2000). Iterative projective reconstruction
from multiple views. In IEEE conference on computer vision and
pattern recognition (Vol. 2, pp. 430–437) 2000.
Maybank, S., & Faugeras, O. (1992). A theory of self-calibration of a
moving camera. International Journal of Computer Vision, 8(2),
123–151.
Oliensis, J., & Hartley, R. (2007). Iterative extensions of the
Sturm/Triggs algorithm: convergence and nonconvergence. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
29(12), 2217–2233.
Poelman, C., & Kanade, T. (1997). A paraperspective factorization
method for shape and motion recovery. IEEE Transactions on Pattern and Analysis and Machine Intelligence, 19(3), 206–218.
Pollefeys, M., Koch, R., & Van Gool, L. (1999). Self-calibration and
metric reconstruction in spite of varying and unknown intrinsic
camera parameters. International Journal of Computer Vision,
32(1), 7–25.
Quan, L. (1996). Self-calibration of an affine camera from multiple
views. International Journal of Computer Vision, 19(1), 93–105.
Int J Comput Vis
Rabaud, V., & Belongie, S. (2008). Re-thinking non-rigid structure
from motion. In IEEE conference on computer vision and pattern
recognition, 2008.
Sturm, P. F., & Triggs, B. (1996). A factorization based algorithm for
multi-image projective structure and motion. In European conference on computer vision (2) (pp. 709–720) 1996.
Tomasi, C., & Kanade, T. (1992). Shape and motion from image
streams under orthography: a factorization method. International
Journal of Computer Vision, 9(2), 137–154.
Torr, P. H. S., Zisserman, A., & Maybank, S. J. (1998). Robust detection of degenerate configurations while estimating the fundamental matrix. Computer Vision and Image Understanding, 71(3),
312–333.
Torresani, L., Yang, D. B., Alexander, E. J., & Bregler, C. (2001).
Tracking and modeling non-rigid objects with rank constraints.
In IEEE conference on computer vision and pattern recognition
(Vol. 1, pp. 493–500) 2001.
Torresani, L., Hertzmann, A., & Bregler, C. (2008). Nonrigid structurefrom-motion: Estimating shape and motion with hierarchical priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5), 878–892.
Triggs, B. (1996). Factorization methods for projective structure and
motion. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 845–851). San Francisco, California, USA, 1996.
Vidal, R., & Abretske, D. (2006). Nonrigid shape and motion from
multiple perspective views. In European conference on computer
vision (2). Lecture notes in computer science: Vol. 3952 (pp. 205–
218). Berlin: Springer.
Vidal, R., Tron, R., & Hartley, R. (2008). Multiframe motion segmentation with missing data using powerfactorization and GPCA. International Journal of Computer Vision, 79(1), 85–105.
Wang, G. (2006). A hybrid system for feature matching based on SIFT
and epipolar constraints. (Tech. Rep.). Department of Electrical
and Computer Engineering, University of Windsor.
Wang, G., Tsui, H.-T., & Wu, J. (2008). Rotation constrained power
factorization for structure from motion of nonrigid objects. Pattern Recognition Letters, 29(1), 72–80.
Wang, G., & Wu, Q. J. (2008a). Stratification approach for 3D euclidean reconstruction of nonrigid objects from uncalibrated image
sequences. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 38(1), 90–101.
Wang, G., & Wu, J. (2008b). Quasi-perspective projection with applications to 3D factorization from uncalibrated image sequences.
In IEEE conference on computer vision and pattern recognition,
2008.
Wang, G., Wu, J., & Zhang, W. (2008). Camera self-calibration and
three dimensional reconstruction under quasi-perspective projection. In Proceedings Canadian conference on computer and robot
vision (pp. 129–136) 2008.
Xiao, J., & Kanade, T. (2005). Uncalibrated perspective reconstruction
of deformable structures. In Proceedings of the international conference on computer vision (Vol. 2, pp. 1075–1082) 2005.
Xiao, J., Chai, J., & Kanade, T. (2006). A closed-form solution to nonrigid shape and motion recovery. International Journal of Computer Vision, 67(2), 233–246.
Yan, J., & Pollefeys, M. (2005). A factorization-based approach to articulated motion recovery. In IEEE conference on computer vision
and pattern recognition (2) (pp. 815–821) 2005.
Yan, J., & Pollefeys, M. (2008). A factorization-based approach for
articulated nonrigid shape, motion and kinematic chain recovery
from video. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 30(5), 865–877.