An MPEG-4-Compatible Stereoscopic/Multiview Video Coding

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 2, FEBRUARY 2006
1
An MPEG-4-Compatible Stereoscopic/Multiview Video Coding Scheme
W. Yang, K. N. Ngan, Fellow, IEEE, and J. Cai, Member, IEEE
Index Terms—Joint disparity and motion estimation, MPEG-4
compatible, multiview video coding, stereoscopic video coding.
I. INTRODUCTION
S
video encoders and adopted disparity compensation to remove
the inter-channel redundancy. The MPEG-2 multiview profile
(MVP) is a straightforward way to encode the stereoscopic
video sequences.
There are three major issues we consider to be important for
developing an efficient multiview video encoder. First, compatibility with existing video coding standards should be maintained. While the main application area of the MPEG-2 MVP
is in stereoscopic TV, it is expected that multiview aspects of
MPEG-4 will play a major role in interactive applications, e.g.,
navigation through the virtual 3-D world with embedded natural
video objects [5]. Therefore, the main view is encoded using
MPEG-4 in our proposed encoder. Second, the computational
complexity of disparity and motion estimation should be low,
and the relations between the disparity and motion fields should
be fully exploited. This can be achieved by jointly estimating
disparity and motion fields, which provides accurate and smooth
vector fields at relatively low complexity. Third, the reference
structure of the multiview video encoder should be flexible for
various scenarios and the encoder should be easily extendable.
In this paper, we propose a multiview video encoder subject
to the above considerations. The framework of the proposed encoder is described in Section II. Different reference structures
for coding multiple views are comparatively studied. The experimental results are analyzed in Section III, and the conclusions
are drawn in Section IV.
IE
E
Pr E
oo
f
Abstract—In this paper, we propose an efficient codec for multiview video coding, which is compatible with the MPEG-4 video
standard. The main views of the multiview video are encoded using
an MPEG-4 encoder and the auxiliary views are encoded by joint
disparity and motion compensation. An edge-preserving regularization scheme that jointly calculates disparity and motion vectors
is performed on the VOP basis. The output of the encoder contains
one bitstream for each view, and the main view bitstreams can be
decoded by a standard MPEG-4 decoder. In addition, in the case
of five-view encoding, we compare four different prediction structures in order to find the best one under certain scenarios. To evaluate the proposed encoder, the MPEG-2 multiview profile (MVP) is
implemented on the MPEG-4 platform for fair comparison, which
is referred to as MPEG-4 MVP in this paper. Experimental results
prove that the proposed encoder achieves a higher image quality at
similar bit rate than the conventional scheme and is very promising
for the applications including videoconferencing and three-dimensional telepresence.
TEROSCOPIC or multiview video, due to the ability to provide the perception of depth, yields a more vivid and accurate representation of the structure of a scene as compared to
monocular video. When each eye of the viewer is presented with
the corresponding image from the two views that form stereoscopic video, the viewer experiences the sensation of three-dimensional (3-D) vision. Therefore, stereoscopic and multiview
systems have a wide range of applications in entertainment,
manufacturing, telemedicine, remote operations, telerobotics,
3-D visual communications, and virtual reality [1].
For transmission and storage of multiview video data,
compression is important as the required bandwidth linearly
increases with the number of camera channels. Multiview
sequences can be compressed much more efficiently than the
independent compression of its individual components by
exploiting, in addition to the intra- and inter-frame redundancy,
the high inter-channel correlations. In the literature, most of the
multiview encoders that are compatible with existing standards
are based on MPEG-2. The DISTIMA [2] project developed
a system for capturing, coding, transmitting, and presenting
digital stereoscopic image sequences, and the PANORAMA
[3] project enhanced the visual information exchange with 3-D
telepresence. Both can be integrated with MPEG-2. Puri [4]
and Luo [5] also proposed MPEG-2-compatible stereoscopic
Manuscript received November 1, 2003; revised February 8, 2005. This paper
was recommended by Associate Editor A. Puri.
W. Yang and J. Cai are with the Nayang Technological University, singapore
63978 (e-mail: [email protected]; [email protected]).
K. N. Ngan is with the Chinese University of Hong Kong, Shatin, NT, Hong
Kong (e-mail: [email protected]).
Digital Object Identifier 10.1109/TCSVT.2005.862496
II. MPEG-4-COMPATIBLE MULTIVIEW VIDEO CODING
A. Encoder Structure
As a preprocessing step, the auxiliary view images are input
to the encoder after balancing [6] with the main view images.
The purpose of image balancing is to eliminate the potential
signal difference between the stereo images, which is due to
lighting conditions and camera differences.
The block diagram of the proposed stereoscopic video encoder is shown in Fig. 1. For multiple views, the correlation between the main view and the auxiliary view is similar. The main
view is encoded using an MPEG-4 encoder [7]. The only difference is that the motion vectors for P- and B-frames in the main
view are calculated by the joint disparity and motion estimation
module, but not by full search block matching as in MPEG-4.
The auxiliary view is encoded by joint disparity and motion
compensation. One important application area of the stereoscopic/multiview video lies in teleconferencing, where headand-shoulder videos are frequently used. Thus, our proposed
encoder can operate either on the frame basis or on the VOP
basis. For the VOP-based encoding, shape coding in MPEG-4
is adopted to code the contours of the VOPs. For the auxiliary
views, the shape is also encoded after disparity or motion compensation. For every two images pairs, joint regularization of
1051-8215/$20.00 © 2006 IEEE
2
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 2, FEBRUARY 2006
Fig. 1. Block diagram of the proposed stereoscopic video encoder.
IE
E
Pr E
oo
f
disparity and motion fields is performed. The joint regularization procedure is performed iteratively under the stereo consistency constraint
(1)
A detailed description of the joint regularization procedure is
given in [8]. The joint estimation is performed on the original
images. To obtain a closed-loop implementation so that the results are reproducible at the decoder, refinement of the disparity
and motion vectors is required. However, this may destroy the
smoothness of the vector fields. As a tradeoff, the refinement
is performed within a very small search range with half-pixel
accuracy during the encoding process using the reconstructed
reference images. The disparity and motion vectors thus obtained are encoded by DPCM entropy coding using a predefined VLC table. The intra-frames and the residual images after
compensation are encoded by DCT transform, quantization, and
run-length coding (RLC) of the DCT coefficients.
Fig. 2 shows the GOP structure, which is defined by the and
parameters. Here, is intra distance, which is the length of
GOP, and
is the prediction distance. Without loss of generframes and the prediction
ality, we set the intra distance
frames. As shown in Fig. 2, we introduce new
distance
picture types I P , and B for the auxiliary view, where the
subscript
means that the picture is also predicted from the
main view images by disparity compensation. Thus, I -frames
are predicted by disparity from the corresponding I-frame and
P B frames are predicted jointly by disparity and motion
fields from both views.
Different macroblock (MB) prediction modes of texture data
for different picture types are defined in Table I. The prediction
mode that minimizes the sum of absolute difference (SAD) is
selected for each MB. For I VOPs, shape is also encoded by
disparity compensation and, for P and B VOPs, shape is encoded either by disparity or by motion.
Similar to the MPEG-4 video coding rate control, we can assume that there is a constant and predefined ratio between the
average quantization parameters between different frame types.
Thus, the MPEG-4 rate control scheme [7] can be extended
straightforwardly by increasing the number of frame types.
Fig. 2. GOP structure of proposed encoder.
TABLE I
VOP TYPES AND MB PREDICTION TYPES
The encoder works similarly for multiple views for each pair
of corresponding main view and auxiliary view. In this paper,
three-view and five-view video coding are considered.
B. Reference Structure
In most previous works, the reference frames used to predict
a desired frame were both fixed and heuristically chosen. These
reference frames do not necessarily yield the best prediction,
and, accordingly, compression performance suffers. Prediction
performance is related to the similarity between the two frames.
For one video sequence, the temporally adjacent frames will be
most similar, but, for multiview video data, the similarity is not
as straightforward.
To keep compatibility with MPEG-4, we only consider the
adaptive reference frame selection for the auxiliary view in multiview video coding. The problem can be considered in both
view-level and picture-level reference structures.
View-level reference structure refers to the problem of given
pair, where is the total number of views and
is
a
the number of main views, how to determine the position of the
main views. In our case, we only consider the situation that all
the views are parallel with equal distance between the adjacent
YANG et al.: MPEG-4-COMPATIBLE STEREOSCOPIC/MULTIVIEW VIDEO CODING SCHEME
3
Fig. 3. View-level prediction structures for five-view video encoding.
IE
E
Pr E
oo
f
pairs. The basic principle for defining the reference structure is
based on the fact that the further the distance between the two
views is, the more occlusion occurs and the worse the disparity
estimation will be. For two-view and three-view video coding,
the configuration is quite straightforward. In our experiments,
the left view is coded as the main view and the right view is
coded as the auxiliary view for two-view video. For three-view
video, the middle view is selected as the main view to provide
better prediction to the two auxiliary views.
For the five-view video, we consider the four configurations
as shown in Fig. 3. The bold lines represent the main views and
the others are the auxiliary views. Config 1 adopts directly the
concept from the GOP structure in mono-view video coding. In
particular, in Config 1, view 4 is predicted from view 0, and the
other views are predicted bi-directionally from views 0 and 4. In
this way, high compression can be achieved especially when the
disparity between the views is very small. In both Configs 2 and
3, the middle view is coded as main view and the difference is
whether the outermost views are predicted from the main view
or the adjacent auxiliary views. In Config 4, there are two main
views and this is suitable for the case when the disparity between
each adjacent view pairs is very large.
Picture-level reference structure considers the selection between disparity and motion vectors, forward motion, and backward motion vector. The selection is done at MB level based on
the minimum distortion criterion. As shown in Fig. 2, the auxiliary view pictures are predicted either by disparity alone I or
jointly by disparity and motion fields (P and B ). In Config 4,
the auxiliary view can be predicted by two main views. In this
case, the I pictures are predicted by bidirectional disparity and
the P B pictures are predicted by two disparity fields and
one motion field.
III. EXPERIMENTAL RESULTS AND ANALYSIS
The two-view sequence Train and Tunnel and the three-view
sequence Reading are used to evaluate the proposed encoder.
The original images are shown in Fig. 4. Since the original sequences contain only two or three views, the intermediate views
are synthesized [9] to generate three-view and five-view video.
The resolutions are 720 576 pixels for Train and Tunnel and
768 288 pixels for Reading, respectively. Under the epipolar
constraint, the disparity vectors are searched only in the horizontal direction with a search range of 32 pixels. For the motion fields, the search range is 8 pixels in both directions. The
image balancing is performed for both the proposed scheme and
the MPEG-4 MVP scheme.
Fig. 4. Test sequences. The first frame of the left view of (a) Train and Tunnel
and (b) Reading.
The comparisons of the coding results of the proposed encoder with the MPEG-4 MVP are given in Figs. 5 and 6. The bit
rate here is the sum of bits per second (bps) for the two views,
and the PSNR is the overall value calculated from the average
distortion for all the frames in all of the views. As shown in
Fig. 5, the proposed encoder increases the average PSNR about
1.2–1.5 dB for both sequences for coding two views. The performance gain comes from both the main view and the auxiliary
views. The decoded main view quality of our proposed encoder
is always better than that of the MPEG-4 MVP method. The
reason for this is that the motion vectors calculated in the proposed encoder cost less bits for encoding, thus more bits can be
used for texture coding. However, this is also partly due to the
rate control scheme, which allocates bits to different frame types
proportionally according to the predefined ratio. In addition, the
quality of the auxiliary view of our proposed encoder is always
better than that of the MPEG-4 MVP encoder although fewer
bits are used. The reasons being, first, the proposed encoder
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 2, FEBRUARY 2006
IE
E
Pr E
oo
f
4
Fig. 5. Results comparison for stereoscopic video coding. (a) Train and Tunnel
and (b) Reading.
structure, including the prediction structure and MB modes design considering the different properties of disparity and motion, better exploits the correlations among the multiview video
data than MPEG-4 MVP. Second, the prediction frames from the
main view in our proposed encoder has higher quality which can
provide more efficient estimation and compensation. Finally, the
disparity and motion fields obtained by the joint regularization
algorithm require fewer bits for coding of P and B pictures.
The three-view coding results are compared in Fig. 6, where
the coding gain in PSNR is about 1.2 dB. Compared with
stereoscopic video coding, the performance gain for coding
three views is limited, and this is reasonable since the gain
comes only from the other auxiliary view. Actually, the performance gain for coding five views can be predictable if the
number of the main and the auxiliary views are known. Thus,
we only compare the coding results of the four configurations
for five-view video coding.
The comparison of the four configurations in view-level
prediction for five-view video coding is shown in Fig. 7. For
Reading, since the disparity range is very large, Config 4,
where views 1 and 3 are encoded as main views, performs the
best. Config 1 has the worst result since the direct disparity
between the two outermost views (views 0 and 4) is very large
and thus disparity prediction for view 4 from view 0 almost
Fig. 6. Results comparison for three-view video coding. (a) Train and Tunnel
and (b) Reading.
fails. For Train and Tunnel, since the disparity range is small,
Config 1 has the best result, which is more obvious at higher bit
rates. Config 4 has the worst result since two main views cost
a lot of bits to code. For Config 3, the results are unstable and
largely depend on the rate control algorithm. The reason is that,
in Config 3, views 1 and 3 are predicted from the main view,
view 2, but they are also the reference views for views 0 and 4,
respectively. The reconstructed image qualities of views 1 and
3 have great effects on those of views 0 and 4.
From these results, we confirm that the best view-level prediction structure depends very much on the properties of the
multiview video data, i.e., the disparity range. For a small baseline distance in the applications such as robotic stereovision,
Config 1 will be a good choice. For a large baseline distance, as
in videoconferencing, Config 4 can achieve better performance.
In order to evaluate fairly the different prediction structure, the
properties of the video data have to be taken into considerations.
IV. CONCLUSION
A multiview video encoder has been proposed in this paper,
which exploits well not only the redundancy within each view
but also among different views. The encoder is object-based
YANG et al.: MPEG-4-COMPATIBLE STEREOSCOPIC/MULTIVIEW VIDEO CODING SCHEME
5
mental results have demonstrated that our proposed encoder significantly outperforms the conventional scheme MPEG-4 MVP.
We have also compared different prediction structures in the
case of five-view coding and have found that the optimal prediction structure actually depends on the video properties, i.e.,
camera parameters, disparity range and the applications. The encoder can be easily extended to encode various number of views
and can also be extended to handle multiple video objects, given
the original shape masks of the sequences.
REFERENCES
IE
E
Pr E
oo
f
[1] L. Guan, S.-Y. Kung, and J. Larsen, Multimedia Image and Video Processing. Boca Raton, FL: CRC Press, 2000.
[2] D. Tzovaras, N. Grammalidis, and M. G. Strintzis, “Object-based coding
of stereo image sequences using joint 3-D motion/disparity compensation,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 2, pp.
312–327, Apr. 1997.
[3] J.-R. Ohm, K. Gruneberg, E. Hendriks, M. E. Izquierdo, D. Kalivas,
M. Karl, D. Papadimatos, and A. Redert, “A realtime hardware system
for stereoscopic videoconferencing with viewpoint adaptation,” Signal
Process.: Image Commun., vol. 14, pp. 147–171, 1998.
[4] A. Puri, R. V. Kollarits, and B. B. Haskell, “Basics of stereoscopic video,
new compression results with MPEG-2 and a proposal for MPEG-4,”
Signal Process.: Image Commun., vol. 10, pp. 201–234, 1997.
[5] L. Yan, Z. Zhaoyang, and A. Ping, “Stereo video coding based on frame
estimation and interpolation,” IEEE Trans. Broadcasting, vol. 49, no. 1,
pp. 14–21, Mar. 2003.
[6] [AUTHOR: PLEASE PROVIDE THE DEPARTMENT FOR
WHICH THE THESIS WAS PREPARED.—ED.]A. Mancini,
“Disparity estimation and intermediate view reconstruction for noble
applications in stereoscopic video,” M.S. thesis, McGill Univ., Feb.
1998.
[7] MPEG-4 Video Verification Model Version 18.0, Jan. 2000. ISO/IEC
JTC1/SC29/WG11 N3908.
[8] W. Yang, K. Ngan, J. Lim, and K. Sohn, “Joint motion and disparity
fields estimation for stereoscopic video sequences,” Signal Process.:
Image Commun., vol. 20, no. 3, pp. 265–276, Mar. 2005.
[9] H. S. Kim and K. H. Sohn, “Feature-based disparity estimation for intermediate view reconstruction of multiview images,” in Proc. Int. Conf.
Imaging Sci., Systs, Technol., vol. 2, Jun. 2001, pp. 1–8.
Fig. 7. Comparison of view-level reference structures for five-view video
coding. (a) Train and Tunnel and (b) Reading.
and is compatible with the MPEG-4 video standard. The experi-