Stabilization of video streams using rotational and translational

1
Stabilization of video streams using rotational
and translational motion estimation
E. Ardizzone, R. Gallea, A. V. Miceli Barone, M. Morana
Abstract—We present a real-time method for stabilizing video
sequences taken from an oscillating camera. We estimate camera
rotation (roll) and translation on the image plane with Fourier
spectrum analysis and Integral projection matching and then
filter-out high-frequency motion components (presumably due to
unintentional oscillation) , preserving intentional motion.
expensive, thus requires a specialized hardware to be
implemented in real-time.
In this paper we refer particularly to the stabilization of
video streams taken from the camera of a Sony Aibo ERS-7, a
dog-like robot, as an example application, although the
method is not limited to this domain.
Index Terms—Video Stabilization, Motion compensation,
Video signal processing, Integral projection matching, Frequency
domain analysis
I. INTRODUCTION
V
streams taken from an hand held camera or a
camera mounted on a moving vehicle are often corrupted
by the oscillation of the unstable mounting. These oscillations
are unaesthetic when observed by a human and make difficult
any kind of automated analysis like object recognition and
localization. So, for both amateur video recording applications
and automated video analysis applications (as, for instance,
robotics) a method for stabilizing video streams is required.
This method should be fast enough to be implemented in realtime on a performance limited hardware, like a amateur
camera video signal processor or a robot control computer.
Image stabilization can be divided in three phases: motion
estimation, motion compensation and image composition.
In literature there are two classes of motion estimation
techniques: block matching [1], [2] and feature matching [3],
[4], [5]. Block matching can estimate only global image
translations, not other kind of motion like rotation. Feature
matching tries to find corresponding features between
subsequent frames. It can estimate various kind of motion but
its quality is limited by the kind of feature used. Finding
useful features in a generic unknown environment can be
computational-intensive, thus these methods are not well
suited in applications were there isn’t a prior knowledge of the
environment [5].
The method we present performs stabilization by estimating
global 2D image rotational and translational motion using
Fourier spectrum analysis and Integral projection matching
and then filtering-out oscillation using polynomial
interpolation.
We don’t use correspondence matching like other
stabilization methods, since this approach is computationally
IDEO
II. METHOD DESCRIPTION
Stabilized image
Input image
Roll stabilization
Traslational stabilization
Fig.1. High-level dataflow diagram.
A. Overview
Our stabilization system is made of two cascaded
subsystems: A roll correction system and a translation
correction system.
B. Roll correction system
Roll is the rotation of a vehicle (or a camera) with respect
of its own longitudinal axis. In ships, for instance, roll is
generated by the waves. In a Sony Aibo it is generated by the
leg motion.
Usually the roll axis of a vehicle is approximately
orthogonal to the horizon line so we chose this line as a
reference for the correction.
First, we measure the inclination of lines known to be
almost parallel to the horizon line. In the scenario of an Aibo
playing RoboCup soccer these are: distant field lines and the
gate top edges, in an indoor or city outdoor scenario these are:
edges of buildings, windows, walls, furniture, etc., in a sea
scenario it is simply the true horizon.
Second, we rotate each frame by the mean measured
inclination of the detected lines.
2
The whole processing is composed by the following
subsequent operations, applied on each frame:
1) Extraction of a vertical gradient estimate
2) Horizontal line detection mask filtering
3) Fourier transform
4) Extraction of the Fourier spectrum principal direction
5) Spectrum thresholding for angle detection
6) Frame rotation
(a) input frame
(b) vertical gradient
(c) horizontal segments
(d) Fourier spectrum
Note that this methods works well in scenarios with
predominance of straight edges, like indoor, outdoor or sea
scenarios, while doesn’t performs well in natural or otherwise
complex scenarios.
C. Translation correction system
To remove translational oscillations while preserving
intentional camera translations (without having prior
knowledge of them) we perform three steps:
1) Global motion estimation
2) Intentional motion estimation
3) Oscillation removal
Global motion estimation is done using the Integral
projection matching technique [1].
For each NxN frame, a summation over the rows and one over
the cols is performed do obtain two separate N-elements
vectors.
For two consecutive frames, the vectors obtained from the
summation over the rows are compared by making one scroll
above the other within a specified window size, calculating
the displacement that gives the least sum of squared
differences. That displacement is the estimated vertical
translation. Similarly, operating on the vectors obtained from
the summation over the cols we get the horizontal translation.
Example:
Consider the following images:
(e) thresholded spectrum
(f) rotated frame
Fig.2. Processing steps.
The vertical gradient (b) is estimated with the point-to-point
difference between each pixel and its down neighbor, then the
frame is filtered (convolved) with an horizontal line detection
matrix which enhances almost horizontal segments of a
sufficient length (c). The Fourier transform is performed on
this image; its spectrum contains, apart from a vertical peak,
an almost vertical line, passing from the centre, which is
orthogonal to the horizon line of the original image (d).
This image is thresholded (e) and least square error
regression is used to find the line that best fits the points. The
angle between this line and the vertical image axis is the mean
roll angle of the original frame.
Note that we work in the frequency domain because in this
domain the line is unique and passing from the centre, while
in the original image we have multiple lines in various
positions.
After the angle has been estimated, the whole original
image is rotated by it to compensate the roll (f).
Fig.3. Two tralated frames.
Summing over the cols we get:
16
16
4
14
6
12
12
12
12
12
12
6
6
12
12
12
12
12
14
4
16
16
And
16
16
16
16
4
14
12
6
14
4
Fig. 11. Integral projection of the columns.
By scrolling the first vector above the second we get a
minimum sum of squared differences for a displacement of 2.
3
Roll stabilization
Similarly, summing over the rows we get:
16
16
Intensity image
4
14
6
12
12
12
12
12
12
6
14
4
16
16
14
6
12
12
12
12
12
12
6
14
4
16
16
16
And
16
4
Vertical
Derivative
Horizontal
segments detection
FFT
Roatation
Least Squares
line computation
Thresholding
Full image
Fig. 12. Integral projection of the rows
With a displacement of -1.
So the global translation between the two frames is (2, -1),
as can be seen by inspecting directly the matrices.
Traslational stabilization
Global motion
estimation
Intentional motion
estimation
Frame
mosaicing
Stabilized image
-
After estimating the global motion, we need to estimate the
intentional motion. So we do a polynomial fitting on the
global translation values of the last K frames. Using a loworder polynomial the fitting cure will be smooth, thus
oscillation would almost disappear, while the intentional
motion would remain. But if the order is too low, the curve
would not promptly follow abrupt intentional motion changes,
so a tradeoff must be made. We found experimentally that
second or third order polynomials gave the best results. The
number of frames to take into account for fitting (K) should
also be chosen with a trade off: a too small K generates an
estimate that suffers from oscillation and a too large K
generate a n estimate that follows the intentional movement
with too much delay and also degrades the performances.
With our tests on the Sony Aibo, we found a good value of 60
frames, which corresponds to 4 seconds of video stream. For
other applications other values may produce better results.
Oscillation motion
deletion
Fig. 5. Complete system block diagram.
In general the rotated and translated frame would not fit the
original rectangular bounds: some parts at the corner would
remain without image data. To prevent the unaesthetic effect
of ‘black corners’ and to maintain temporal coherence
between frames we put the new frame on the top of the
previous, creating a mosaic effect [6].
III. RESULTS
This method has been tested on video sequences taken in
different environments with different resolutions. The
algorithm, implemented in Matlab is able to process an
average of 1.5 frames per second at a resolution of 416x320.
About half of the processing time is used to perform the image
rotation, since it is a computational-intensive geometrical
transformation. An efficient implementation of the rotation,
would significately speed up the whole algorithm.
In our tests we used polynomials of second and third order
for traslational motion compensation. Second order
polynomials produce a smooth visible motion, removing
significant obscillations, but not promptly following
intentional motion changes. Third order polynomial follow the
intentional motion promptly but don't remove all obscillations.
Fig. 4. Global motion estimation versus corrected motion.
After estimating the global and the intentional translational
motion, we calculate the oscillation simply as the difference
between the two, and then we translate the frame in order to
compensate.
The number of samples considered for smoothing affects
the smoothnes of the output sequence as well as the
complexity of the algorithm. We found values about 60 to be
well performing. Lower values, while speeding up the method
would produce visible obscillations, and highter values would
produce slow adapting motion estimations.
4
IV. CONCLUSION
We presented a video stream stabilization method suitable
of real-time implementation. The method assume a simple 2D
rotation and translation motion model. It also assumes that the
images contains straight edges, especially almost horizontal,
so it is suited for operation in artificial environments like a
building interior or a city, or also in the sea, where the horizon
is viewable. It is not suited for operation in an environment
with a lot of complex and variously oriented edges, like a
forest.
The main performance bottlenecks are the image rotation
and translation used to compensate the oscillation after the
estimation. So the main optimization that may be applied
should be directed towards these steps.
Also for some applications, like robotics, it may be
unnecessary to perform such corrections as the estimated
oscillation values could be sufficient for an image processing
module to perform correct analysis.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
Krishna Ratakonda, Real-Time Digital Video Stabilization For MultiMedia Applications (Online). Available: URL
Y. M. Yeh, H. C. Chiang, and S. J. Wang, A digital camcorder image
stabilizer based on gray coded bit-plane block matching, in proc. 13th
IPPR Conf. Computer Vision, Graphics and Image Processing,
Taipei,Taiwan, 2000, pp. 244-251.
Z. Duric and A. Rosenfeld, Image sequence stabilization in real time,
Real-Time Imaging vol. 2, no. 5, pp. 271-284, 1996.
Tom Stepleton, Ethan Tira-Thompson, AIBO Camera Stabilization, 16720, Fall 2003.
Yu-Ming Liang, Hsiao-Rong Tyan, Shyang-Lih Chang, Hong-Yuan
Mark Liao, and Sei-Wang Chen, Video Stabilization for a Camcorder
Mounted on a Moving Vehicle, IEEE Transactions on Vehicular
Technology, vol. 53, no. 6, November 2004.
M. Hansen, P. Anandan, K. Dana, G. van der Wal, P. Burt, Real-time
Scene Stabilization and Mosaic Construction, David Sarnoff Research
Center, CN 5300, Princeton, NJ 08543.