Perception of Structure from Motion:
Lower Bound Results
John Aloimonos and Amit Bandyopadhyay
Department of Computer Science
The University of Rochester
Rochester, NY 14627
TR 158
March 1985
Abstract
This paper investigates lower bounds in relation to the structure from
motion problem, i.e., the minimal number of points from an ensemble of
points that move in a rigid configuration and the minimal number of
projections that are required to uniquely recover the structure. We
prove that two orthographic projections of four noncoplanar points
admit only four interpretations (up to a reflection) of structure. This
forms the basis for an algorithm to recover structure from motion. We
also show that it is possible to uniquely recover structure from three
orthographic projections of three points in space, when a certain
condition holds. Furthermore, when this condition does not hold. the
number of structures compatible with the motion is at most two.
This research was supported in part by the National Science
Foundation under Grants MCS-8203306 and MCS-8104008.
Perception of structure from motion : Lower bound results
John Aloimonos and Amit Bandyopadhyay
Dept. of Computer Science
The University of Rochester
Rochester, N.Y. 14627
Abstract
This paper investigates lower bounds in relation to the structure from motion
problem, Le. the minimal number of points from an ensemble of points that move
in a rigid configuration and the minimal number of projections that are required,
to uniquely recover the structure. We prove that two orthographic projections of
four noncoplanar points admit only four interpretations (up to a reflection) of
structure. This forms the basis for an algorithm to recover structure from motion.
We also show that it is possible to uniquely recover structure from three
orthographic projections of three points in space, when a certain condition holds.
Furthermore , when this condition does not hold, the number of structures
compatible with the motion is at most two.
Keywords: Perception, Vision Processing, Structure from motion.
1. Introduction
The interpretation of visual motion by humans and other biological organisms
·is an exiting field in the study of perception. An issue here is what kinds of
mathematical analysis are adequate and lead to a biologically plausible model of
computation for the task. In this paper we examine ways and means by which a
perceptual system may be organized to detect the three dimensional structure of
rigid objects from their projected motion. The ability of the human visual system
to discern structure from motion stimulus was demonstrated by experiments by
Wallach and O'Connell in the 1950's [6]. Subsequently Gunnar Johansson [3]
discovered our ability to recognize the human fonnfrom the projected motion of
as few as ten points on the body, such as the various joints like elbows, shoulders
and knees.
2
It would seem that the perception of rigid structure from motion should not
ft:quire the detection of the projected trajectory of too many points. One of the
first rigorous mathematical treatment of this problem was done by S. Ullman [5].
In his classical paper on the computation of structure from motion, Ullman
showed how structure was determined uniquely (up to a reflection) from the
projected locations of four noncoplanar points, obtained at three different instants
of time. His analysis is based on the orthographic projection model. The
treatment also considered the correspondence of the four projected points
between the three frames, as available. In our analysis we too work with
orthographic projection and assume the point correspondences already given.
While it is true that the perspective or central projection model is more
appropriate for image formation in the human visual system, we will argue that
orthographic projection is a realistic simplification for this specific problem. One
reason is that at small retinal eccentricities perspective effects are small. Another
reason is that in Ullman's scheme, as well as ours, only a small number of points
are considered at a time and so orthography will serve as an adequate model.
Ullman's analysis shows that three orthographic projections of four
noncoplanar points are sufficient to recover structure.To our knowledge there has
been no attempt to investigate lower bound results in relation to this problem. In
other words, the question as to whether four points are necessary to recover
structure from three orthographic views, is yet to be answered. The results
reponed in this paper will address this important question. We show that two
orthographic views offour noncoplanar points admit only four interpretations ( up to
a reflection) of the structure of the four points. This is the nucleus of an algorithm
for recovering structure from motion. Our second result shows that it is possible to
uniquely recover structure from three orthographic views of three points in space
when a certain general condition holds. Furthermore, even when the above
condition does not hold, the number of structures compatible with the motion is at
most two.
The fact that these are lower bounds can be established by showing that the
structure compatible with two orthographic views of three points or three
orthographic views- of two points, forms a continuum and hence is not unique.
We should mention in passing that the problem of interpretation of
Johansson's "biological motion" was analysed by Hoffman and Flinchbaugh [1],
Hoffman and Bennett [2] and Webb and Aggarwal (7] . Their analysis is for
orthographic projection with the additional assumption that the axis of rotation is
fixed for theentire period of observation ( i.e. equivalently, the motion is planar).
From the other hand. our analysis does not require the fixed axis assumption.
2. Mathematical formulation and lower bound arguments
Consider the Cartesian representation of a point in 3-D space. This is the
vector ex. Y.Z) . A quartet of four such points can be written as (X,. Yj.Z for
i = 1.2.3.4. Let these points move and take up new positions (X'j.l'~i'Z';), Considering
j)
3
rigidity, we have the fact that the motion can be represented by an affine
transformation:
(X'j,Y'j,Zj)T
= R(Xj.Yj.Zj)T +(~x.~Y,~Z)T
(1)
where R is a 3 by 3 rotation matrix and (~X,~Y,~Z) is a translation vector.Taking
the orthographic projection of the above we have:
(2)
(3)
where the elements rij of the rotation matrix depend upon three independent
parameters - the axis of rotation and the angle of rotation about this axis.
Now if we take two views of three points, we obtain six equations in the
seven variables - three for the rotation, two depth variables ( wehave three depths
but only relative depth can be recovered) and two for the translation. Thus we
cannot solve the problem in this case. A similar argument holds for three views of
two points and two views of four coplanar points. So, according to the above
argument, the following theorem has been proved.
Theorem 1: In general it is impossible to recover the structure of
1) Three points, given two orthographic projections of these points,
2) Two points, given three orthographic projections of these points, and
3) Four coplanar points. given two orthographic projections of these points.
In the sequel we are going to prove that two orthographic projections offour
noncoplanar points admit only four interpretations (up to a reflection) of the
structure of the four points as well as that three orthographic projections of three
points uniquely recover the structure of these points . So, given theorem 1 , the
above results will constitute lower bounds for the problem at hand. Before we
proceed, we need constraints between the structure of rigidly moving points and
their image displacements. In the next section, we develop these constraints, in
lemmas 1 and 2 .
3. Mathematical preliminaries
[n this section we develop the constraint that was mentionedd in the previous
section, in two forms, in lemmas 1 and 2 .
Lemma 1 :
Given two distinct orthographic projections of three points in a rigid configuration,
the gradient (p.q) of the plane that the three points define (with respect to the
coordinate system of the first frame), lies on a conic section in the gradient space.
The coefficients of this conic section depend entirely on the interframe displacements
4
.of the above points.
Proof:
Let the three points in space be O.A ,B in their first position and 0' .A' ,B' in
their second position and their projections in the two frames be 01,A 1,B l and
02.A2.B2. respectively ( See figure 1). Let also the gradient of the plane OAB be
G = (P.q). Furthermore, let:
(4)
(5)
(6)
(7)
A1
0
1
<
0
8
a
8
t
2
frame 2
frame 1
Figure 1.
Considering the geometry of the first projection ( OAB to 0lA1B l ), we have
that:
5
(8)
(9)
Similarly, considering the second projection ( OAB to OzAzB z), we get:
O"A' = (Xz,yz,A)
(10)
= (cz.dz,p.)
(11)
0"1J'
where x and p. are to be determined,
But, because of the rigid motion, the vectors OA and O"A' have the same length.
The same holds for the vectors OB and 011'. From these requirements we get:
(12)
(13)
Finally, again
because of the rigidity, the angles between
OA ,OBandO"A' ,O"B' are the same. From this, we get:
the
vectors
0-:4..08 = O"A'.O"S'
(14)
where"." denotes the dot product operation.
Substituting to equation (14) from equations (8),(9),(10),(11),(12),(13), we get:
al.art(G.al)(G,PI) = az.pz±.Ap.
and substituting the values for
si
(PI -
x and p. and
squaring appropriately, we get:
112Z Xu.al)
7'l- Z
-Z -Z 7'01$ Z
-jf
z » 7'0- 7'01$
-z-az
-Z flZ l1Z
- fJ - I f Z
+(al
-az Xu·ptJ -2(alPlawz)(u.al)(u,pI}+(al
X I -P2 }-(al.
l-a2PZ) :: a
Given that
and
the above equation (15) is of the form:
Ap2+ Bqz+ f pq+A
=a
where the coefficients A,B,GAMMA.A depend on the image vectors
ilz. (q.e.d.).
at. a2 , PI and
We now state and prove a second lemma, that relates the depth differences of the
world points with the interframe displacements.
6
Lemma 2:
Given two distinct orthographic projections of three points a.A.B , with depths
ZO,ZA ,ZB (with respect to the coordinate system of the first frame), the tuple (Zt,Z2),
with Zt = zo - ZII and Z2 = Zo - ZB, lies on a conic section on the plane (Zt,Z2)'
The coefficients of this conic depend entirely on the interframe displacements of the
above points.
Proof:
[t is obvious that this statement is equivalent to the previous lemma. The
reason that we state it. is that we will use this form of the constraint in our
subsequent analysis. Using the nomenclature of the previous lemma. we observe
that:
and
The above equation (16) proves the claim.
The above lemmas relate the structure ( shape ) of three points with their two
distinct orthographic projections. Whether the points move or the projection plane
moves (moving observer) or both of them move, the analysis remains the same.
We will now state and prove the theorems pertaining to lower bound results in the
recovery of structure from motion.
4. Lower bound results
So far, we have established the fact that two orthographic views of less than
four points cannot recover the structure of these points. We now show that if the
number of points is four, structure can be determined.
Theorem 2:
Two orthographic projections of four rigidly linked noncoplanar points are
compatible with at most four interpretations (plus reflections) of their relative 3-D
positions. Adding a third view yields a unique interpretation of the structure of the
four points.
Proof:
Let the four points in space be a.A.B ,C. Let also the projections of the four
points in the two frames be Ot,At,Bt,C t and a 2,A2,B 2,C 2 respectively (See Fig. 2),
and the gradients of the planes OAB. aBC and oc A be G t = (Pt.qt), G 2 = (P2,q2)
and GJ = (PJ,qJ) respectively (with respect to the first frame).
7
A
8
8
frame 1
frame 2
Figure 2.
To prove that there is no ambiguity. we have to prove that the gradients
GloG Z.G3 are uniquely determined. Using the projections 01A 1.0181 and their
corresponding ones o;hO;Bz and utilizing lemma I we get:
AlP1Z+B1Q1Z+ClPlql+Dl =
0
(17)
where the coefficients depend entirely on the image vectors.
Similarly. considering the projections 0181 and olt 1 and their corresponding ones
in the second frame and the projections oiC 1 and OlA 1 and their corresponding
ones in the second frame. we get:
A2Pzz+B2Ql+C2P2Q2+D2
=0
(18)
A3Pl+B3Q32+C3P3Q3+D3
=0
(19)
At this point we should say that the above equations are independent because
they come from the rigidity of the three rods OA,OB,OC. In other words the fact
that the three lengths OA, OB, and OC in space remain constant and the two
angles AOB and BOC in space remain constant between the two frames, does not
imply that the third angle COA will remain the same. Later we will give a rigorous
proof of the independence of these equations, using algebraic geometrical tools.
Proceeding, we note that we have more information about the gradients
GloG Z.G 3 from the well known Mackworth constraints that they state:
G 1.0181 = G Z.0181
(20)
G Z.Olt l = G3.0 lt l
(21)
(22)
8
The above equations (17),(18),(19),(20),(21),(22) constitute a system of six
equations in the six uknowns Pl,Ql.P2,Q2.P3,Q3' This system has at most 8 solutions (
actually it has four solutions plus the inherent Necker reflections) from Bezout's
theorem, and the theorem has been proved.
Before we proceed with a rigorous proof, we shed some light on the form and
information content of the equations (17)-(22). Equations (20).(21) and (22)
simply express the fact that the gradients Gl.G 2.G 3 of the three planes make a
triangle the direction of whose sides are known, but we don't know its position
and its scaling. From the other hand, equations (17).(18) and (19) state that each
of the gradients G1.G2.G3 lies on a conic section in gradient space, So, in order to
solve the problem (i.e. to find the three gradients) we have to put a triangle on
gradient space, such that its sides have the orientation defined by the Mackworth
constraints (equations (20).(21).(22») and each one of its vertices lies on each one
of the three conic sections (defined by equations (17),(18) and (19». At this point
we should say, that several important problems in Vision Processing have been
solved in a very similar way. Hom (Hom, B.K.P.,"Understanding image
intensities",AI 8:201-231.1977) solved the problemofdete nnining the shape ofa
polyhedral object from intensity information and the Mackworth constraints, and
Kanade (Kanade, T. "Recovery of the three dimensional shape of an obgect from
a single view",AI 17 (1981) 409-460) solved the same problem (shape of
polyhedral objects) but using skewed symmetry and the Mackworth constraints .
We now show an alternative way for the proof of the present theorem, which
is constructive and so it can be used as the basis for an algorithm that will
compute the structure of four points from two projections.
In this scheme we use a sligthly different formulation that makes use of
lemma 2 , Consider again four points 0, A, B ,C in the world with depths Zo ,Z,4. .zs ,Ze
( with respect to the coordinate system of the first frame ), and the projections of
the points in the two views 01.Al.B1.C l and 02,A 2,B2,C2 respectively (See fig. 3).
B
2
frame 1
frame 2
Figure 3.
9
for i = 1,2 and also let
Z3 = Zo - Ze
It is now obvious that the values of ZloZ2.Z3 uniquely define the structure of
the four points. Using lemma 2 we derive the following three equations:
- 2-a2
- 2)Z2-- 2(alPl11 a2P2)ZIZ2+(al
- 11
- 2-a2
- 2X III 2- II22}-(alPl- 11 a2
- II2)2 = 0 (23)
(III 2- P22)ZI 2+(al
1
-
- 2-Y2
- 2)ZI 2+(al
- 2-a2
- 2)Z3 2-2(al'Yl- - - 2-a2
- 2XYI
- 2-Y2
- 2}-(aIYI-a2Y2)
- - - 2=
a2Y2)Zlz)+(al
(Yl
,11 2\PI
0
(24)
II22)z) 2+(Y1- 2-Y2- 2)ZZ2-2(YlPCY2PZ)Z3
- 11 - 11
- 2-Yz
- 2XPI 2- II22}-(YlPI-Y2PZ)
- 11 - 11 2 = 0
(25)
ZZ+(Yl
The above equations (23).(24) and (25) constitute a system L of three
polynomial equations in the three uknowns Zl.ZZ.Z3 that define the structure of the
four points.
The simple fact that we have three equations and three uknowns here does
not mean that this system will have a finite number of solutions. To find out if
there are a finite number of solutions we apply the inverse function theorem. This
theorem allows us to conclude that whenever the Jacobian of these equations is
nonsingular, the mapping defined by these equations is locally one to one and
onto. Hence. any roots at points where the Jacobian is nonsingular are isolated and
not part of a continuum of solutions.
It is a simple exercise to compute the Jacobian of the above system and prove
that in general it has rank three. ( One has to be careful when determining the
rank of the Jacobian: all the coe fficients have to be expressed in the image
coordinates. otherwise hidden dependencies may cause problems. The degenerate
cases can be easily found by factoring the determinant of the Jacobian ).
Consequently we can assert that the system has but a finite number of solutions
(From the Jacobian test we conclude that the solutions are isolated.i.e.
countable. But the set of the solutions of the system L is an algebraic set and as
such it will have finite cornponents.So, the number of the solutions is finite).
By Bezout's theorem. we know that the sum of the multiplicities of the
solutions does not exceed the product of the degrees of the equations. which in
this case is eight. So. there are at most eight solutions. ( Actually. there are four
solutions and their four Necker reflections).
10
If we eliminate two of the uknowns from the equations of the system
get one equation on one uknown which is of the form:
AZt8+Bzj6+CZj4+Dzj2+E
=0
~
we
(26)
where the coefficients are functions entirely of the image data. Equation (26) is
nonhomogenuous of fourth degree in z? and can be solved in closed form for z/
. Knowing one z, we can solve for the remaining Zj 's using the equations of the
system s,
To conclude the proof of the theorem. if we add one more view. then the
solution is unique. and the proof is immediate from the "Structure from Motion"
theorem. by S. Ullman (Ullman. 5.,1979). (q.e.d).
We now proceed with our second theorem.
Theorem 3:
Three orthographic projections of three rigidly linked points are compatible with at
most one interpretation (plus reflection) of their relative 3-D positions. in general.
Furthermore, when a certain testable condition holds then there at most two
interpretations (plus reflections). Adding a fourth view yields a unique interpretation
of the structure of the four points.
Proof:
Let the three points in space be O.A.8 with depths io h ,ZB ( with respect to
the coordinate system of the first view). and their projections on the three frames
be OJ.At.Bj for i = 1.2,3 respectively (See Fig. 4).
frame 1
frame 2
frame 3
Figure 4.
Let also :
Zl
=
Zo -
ZA
II
Zz
= Zo -
Z8
for i = 1.2.3.
Now applying lemma 2 for frames 1 and 2 and then for frames 1 and 3 we get
the following equations:
- , - Z)zz Z-2(al
- Pl-a2Pz)zlz2+(al
- 11
- Z-az)(
- 2 PI2(P Iz- P ZZ)zl 2+(a(-az
Pz2)-(al·
- P l-aWZ)
- 11 Z = 0
(27)
P 32)Zl 2+(al
- Z-a3
- Z)Z2 2-2(al
- Pl-a3P3)ZIZZ+(al
- 11
- Z-(3)(
- Z PI z-
P3z)-(al·
- pl-a3P3)
- 11 z = 0
(28)
( P 12-
The above equations constitute a system II of two equations in the two uknowns
ZI,Z2. The Jacobian of this system has rank two in general. and so by applying the
inverse function theorem we conclude that the system has finite solutions. Using
Bezout's theorem we conclude that the system has at most four solutions. (
Actually two solutions. plus the Necker reflections).
In the sequel we prove that in general the above system II has a unique
solution (plus reflection).
After eliminating the constant tenus from equations (27),(28) we get:
(KZ'vI-KINz)Zlz+(M2NI-MINz)ZIZZ+(LzNI-LINz)Zzz
=0
(29)
with
K I = PtZ-Pi
«, =Piz-Pl
Equation (29) is homogenuous in
~ = x we get the following equation:
Z2
ZI,ZZ
and by dividing with z/ and setting
12
(30)
The solution of the above equation (30) is given by:
x=
-(M 2N 1-Al 1N2)±' /~
(31)
K 2N 1-K[N 2
where Disc is the discriminant of equation (30).
From the other hand, if the length of the vectors OA ,OB is p and J.L
respectively, then from the geometry of the projection on the first frame, it is
obvious that:
,; p2_ a 12
Consequently, x = ±.,;
.
2 {12 •
1
J.L-
Thus, if x has two solutions then these solutions must have the same absolute
value and opposite sign if both are to be valid. From (31) we conclude that x will
have two valid solutions if:
(32)
Obviously the above condition (32) is a testable condition in the image data.
So far, we have concluded that if condition (32) holds, then the problem has
two solutions ( plus reflections ), because then there will be two solutions for
x =
.:l, and so four solutions for (Z1>Z2) (actually two solutions, plus reflections ).
Z2
If condition (32) does not hold, then there is only one solution for x and
consequently two solutions for (Zl.Z2) (actually one solution, plus reflection ).
In addition, the above description can be used to actually find the structure of
three points from three projections, by developing equation (30), solve for x and
then use this value in conjunction with the equations (27) and (28) of the system
Lh to solve for Zl.12 rejecting the imaginary roots.
Finally,to conclude the proof we have to prove that if we add one more view,
then we get a unique result. If we call 04,A 4.B4 the projections in the fourth view,
and let 04:44 =;;4 and 0;134 = P4, then considering the first and the fourth frame
we get the equation:
P
- 1 - 2)Z2 2-2(01
-{11-0-lP4)ZlZ2+(01
- it
- 2-04)(
- 2
- if - 11 2 = 0
({112- /12
4 )Zl 2+(0(-04
12- /12
4 }-(al'PI-04P4)
(33)
Equations (27),(28) and (33) constitute a system of three equations with two
uknowns.So, this system, barring degeneracy will have at most one solution.
tl
·~I.1
I
5. Discussion and Conclusion
The perception of rigid structure from motion stimulus is well within the
competence level of the human visual system, An analytical exposition of the
problem and its solution is due to Ullman. In his pioneering work on this topic,
Ullman demonstrated that such a task is easily accomplished by observing the
motion of as few as four noncoplanar points for three frames.
It was believed for a long time [4] that it is not possible to do any better.
However our investigation shows that the stimulus necessary for discerning 3-D
structure can in fact be simpler than in Ullman's scheme. We set out to examine
the minimum stimulus - in terms of number of points and number of views - that
we can utilize to form a coherent globaJ percept of structure. For two
orthographic views this turns out to be four noncoplanar points. In this case
structural interpretation narrows down to four aJternative arrangements.The
perceptual algorithm can be arranged in a manner similar to Ullman's scheme.This
in volved parallel computation of structure and motion using four points at
different image locations. with globaJ coherence signaJled by matching' motion
transformation parameters.
However, when the observation period spans three separate views, tracking
three points provides enough information to perceive the structure. The advantage
in this case is that, unlike the four points case, the interpretation of structure is not
contingent upon noncoplanarity which is a 3·0 property of the points not directly
evident from the image.
Our results fill an important gap in the study of the perception of structure
from motion - showing the limitation of the approach.
\\'t believe that our work forms an irnporiant evtensioc to Ulimans
theor, .and, if; conjunction '-' ith interpretation schemes for recovering structure ir.
the C2St of biological m,:·j,;:"f; using the p):;..;:.L-iry (or hec a~.:s: 2.SS~:T.;:,ti·Jn [1. 2. '7]
c.=,~~:.i:L.;:tS a s~~n~5:::::""'1: .22\c.SL:': in the problem of L~-: ij; .. ::::,;-e:..:::;~,r; ofs.ructure
f~ . :)~
m.x: ~\t.
:
e
II'i
[I:1.,
;I
I
i
I
i
14
6. Acknowledgments We would like to thank Christopher Brown and Dana
Ballard for their help during the preparation of this paper. Our thanks also go to
Jerome Feldman and the Rochester Vision group for their constructive criticism.
7. References
1. D. D. Hoffman and B. E. Flinchbaugh, The Interpretation of Biological
Motion, Bio!. Cybernetics 42, (1982), 195-204.
2. D. D. Hoffman and B. M. Bennett, Inferring the relative three dimensional
positions of two moving points, 1. Opt. Soc. Am. A 2(2), (February 1985),
350-353.
3. G. Johansson, Visual Perception of Biological Motion and a Model for its
Analysis, Percept. & Psychophysics 14(2), (1973), 201-211.
4. D. Marr, VISION, W.H. Freeman, San Francisco, 1982.
5. S. Ullman, The Interpretation of Structure from Motion, Proc. R. Soc. Lond.
(B) B 203, (1979), 405-426.
6. H. Wallach and D. N. O'Connell, Kinetic Depth Effect, 1. Exp. Psychol.
45(4), (1953), 205-217.
7. 1: Webb and 1. K. Aggarwal, Structure from motion of rigid and jointed
objects, Artificial Intelligence 19, (1982), 107-130.
;J
© Copyright 2026 Paperzz