IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 2, FEBRUARY 2006 1 An MPEG-4-Compatible Stereoscopic/Multiview Video Coding Scheme W. Yang, K. N. Ngan, Fellow, IEEE, and J. Cai, Member, IEEE Index Terms—Joint disparity and motion estimation, MPEG-4 compatible, multiview video coding, stereoscopic video coding. I. INTRODUCTION S video encoders and adopted disparity compensation to remove the inter-channel redundancy. The MPEG-2 multiview profile (MVP) is a straightforward way to encode the stereoscopic video sequences. There are three major issues we consider to be important for developing an efficient multiview video encoder. First, compatibility with existing video coding standards should be maintained. While the main application area of the MPEG-2 MVP is in stereoscopic TV, it is expected that multiview aspects of MPEG-4 will play a major role in interactive applications, e.g., navigation through the virtual 3-D world with embedded natural video objects [5]. Therefore, the main view is encoded using MPEG-4 in our proposed encoder. Second, the computational complexity of disparity and motion estimation should be low, and the relations between the disparity and motion fields should be fully exploited. This can be achieved by jointly estimating disparity and motion fields, which provides accurate and smooth vector fields at relatively low complexity. Third, the reference structure of the multiview video encoder should be flexible for various scenarios and the encoder should be easily extendable. In this paper, we propose a multiview video encoder subject to the above considerations. The framework of the proposed encoder is described in Section II. Different reference structures for coding multiple views are comparatively studied. The experimental results are analyzed in Section III, and the conclusions are drawn in Section IV. IE E Pr E oo f Abstract—In this paper, we propose an efficient codec for multiview video coding, which is compatible with the MPEG-4 video standard. The main views of the multiview video are encoded using an MPEG-4 encoder and the auxiliary views are encoded by joint disparity and motion compensation. An edge-preserving regularization scheme that jointly calculates disparity and motion vectors is performed on the VOP basis. The output of the encoder contains one bitstream for each view, and the main view bitstreams can be decoded by a standard MPEG-4 decoder. In addition, in the case of five-view encoding, we compare four different prediction structures in order to find the best one under certain scenarios. To evaluate the proposed encoder, the MPEG-2 multiview profile (MVP) is implemented on the MPEG-4 platform for fair comparison, which is referred to as MPEG-4 MVP in this paper. Experimental results prove that the proposed encoder achieves a higher image quality at similar bit rate than the conventional scheme and is very promising for the applications including videoconferencing and three-dimensional telepresence. TEROSCOPIC or multiview video, due to the ability to provide the perception of depth, yields a more vivid and accurate representation of the structure of a scene as compared to monocular video. When each eye of the viewer is presented with the corresponding image from the two views that form stereoscopic video, the viewer experiences the sensation of three-dimensional (3-D) vision. Therefore, stereoscopic and multiview systems have a wide range of applications in entertainment, manufacturing, telemedicine, remote operations, telerobotics, 3-D visual communications, and virtual reality [1]. For transmission and storage of multiview video data, compression is important as the required bandwidth linearly increases with the number of camera channels. Multiview sequences can be compressed much more efficiently than the independent compression of its individual components by exploiting, in addition to the intra- and inter-frame redundancy, the high inter-channel correlations. In the literature, most of the multiview encoders that are compatible with existing standards are based on MPEG-2. The DISTIMA [2] project developed a system for capturing, coding, transmitting, and presenting digital stereoscopic image sequences, and the PANORAMA [3] project enhanced the visual information exchange with 3-D telepresence. Both can be integrated with MPEG-2. Puri [4] and Luo [5] also proposed MPEG-2-compatible stereoscopic Manuscript received November 1, 2003; revised February 8, 2005. This paper was recommended by Associate Editor A. Puri. W. Yang and J. Cai are with the Nayang Technological University, singapore 63978 (e-mail: [email protected]; [email protected]). K. N. Ngan is with the Chinese University of Hong Kong, Shatin, NT, Hong Kong (e-mail: [email protected]). Digital Object Identifier 10.1109/TCSVT.2005.862496 II. MPEG-4-COMPATIBLE MULTIVIEW VIDEO CODING A. Encoder Structure As a preprocessing step, the auxiliary view images are input to the encoder after balancing [6] with the main view images. The purpose of image balancing is to eliminate the potential signal difference between the stereo images, which is due to lighting conditions and camera differences. The block diagram of the proposed stereoscopic video encoder is shown in Fig. 1. For multiple views, the correlation between the main view and the auxiliary view is similar. The main view is encoded using an MPEG-4 encoder [7]. The only difference is that the motion vectors for P- and B-frames in the main view are calculated by the joint disparity and motion estimation module, but not by full search block matching as in MPEG-4. The auxiliary view is encoded by joint disparity and motion compensation. One important application area of the stereoscopic/multiview video lies in teleconferencing, where headand-shoulder videos are frequently used. Thus, our proposed encoder can operate either on the frame basis or on the VOP basis. For the VOP-based encoding, shape coding in MPEG-4 is adopted to code the contours of the VOPs. For the auxiliary views, the shape is also encoded after disparity or motion compensation. For every two images pairs, joint regularization of 1051-8215/$20.00 © 2006 IEEE 2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 2, FEBRUARY 2006 Fig. 1. Block diagram of the proposed stereoscopic video encoder. IE E Pr E oo f disparity and motion fields is performed. The joint regularization procedure is performed iteratively under the stereo consistency constraint (1) A detailed description of the joint regularization procedure is given in [8]. The joint estimation is performed on the original images. To obtain a closed-loop implementation so that the results are reproducible at the decoder, refinement of the disparity and motion vectors is required. However, this may destroy the smoothness of the vector fields. As a tradeoff, the refinement is performed within a very small search range with half-pixel accuracy during the encoding process using the reconstructed reference images. The disparity and motion vectors thus obtained are encoded by DPCM entropy coding using a predefined VLC table. The intra-frames and the residual images after compensation are encoded by DCT transform, quantization, and run-length coding (RLC) of the DCT coefficients. Fig. 2 shows the GOP structure, which is defined by the and parameters. Here, is intra distance, which is the length of GOP, and is the prediction distance. Without loss of generframes and the prediction ality, we set the intra distance frames. As shown in Fig. 2, we introduce new distance picture types I P , and B for the auxiliary view, where the subscript means that the picture is also predicted from the main view images by disparity compensation. Thus, I -frames are predicted by disparity from the corresponding I-frame and P B frames are predicted jointly by disparity and motion fields from both views. Different macroblock (MB) prediction modes of texture data for different picture types are defined in Table I. The prediction mode that minimizes the sum of absolute difference (SAD) is selected for each MB. For I VOPs, shape is also encoded by disparity compensation and, for P and B VOPs, shape is encoded either by disparity or by motion. Similar to the MPEG-4 video coding rate control, we can assume that there is a constant and predefined ratio between the average quantization parameters between different frame types. Thus, the MPEG-4 rate control scheme [7] can be extended straightforwardly by increasing the number of frame types. Fig. 2. GOP structure of proposed encoder. TABLE I VOP TYPES AND MB PREDICTION TYPES The encoder works similarly for multiple views for each pair of corresponding main view and auxiliary view. In this paper, three-view and five-view video coding are considered. B. Reference Structure In most previous works, the reference frames used to predict a desired frame were both fixed and heuristically chosen. These reference frames do not necessarily yield the best prediction, and, accordingly, compression performance suffers. Prediction performance is related to the similarity between the two frames. For one video sequence, the temporally adjacent frames will be most similar, but, for multiview video data, the similarity is not as straightforward. To keep compatibility with MPEG-4, we only consider the adaptive reference frame selection for the auxiliary view in multiview video coding. The problem can be considered in both view-level and picture-level reference structures. View-level reference structure refers to the problem of given pair, where is the total number of views and is a the number of main views, how to determine the position of the main views. In our case, we only consider the situation that all the views are parallel with equal distance between the adjacent YANG et al.: MPEG-4-COMPATIBLE STEREOSCOPIC/MULTIVIEW VIDEO CODING SCHEME 3 Fig. 3. View-level prediction structures for five-view video encoding. IE E Pr E oo f pairs. The basic principle for defining the reference structure is based on the fact that the further the distance between the two views is, the more occlusion occurs and the worse the disparity estimation will be. For two-view and three-view video coding, the configuration is quite straightforward. In our experiments, the left view is coded as the main view and the right view is coded as the auxiliary view for two-view video. For three-view video, the middle view is selected as the main view to provide better prediction to the two auxiliary views. For the five-view video, we consider the four configurations as shown in Fig. 3. The bold lines represent the main views and the others are the auxiliary views. Config 1 adopts directly the concept from the GOP structure in mono-view video coding. In particular, in Config 1, view 4 is predicted from view 0, and the other views are predicted bi-directionally from views 0 and 4. In this way, high compression can be achieved especially when the disparity between the views is very small. In both Configs 2 and 3, the middle view is coded as main view and the difference is whether the outermost views are predicted from the main view or the adjacent auxiliary views. In Config 4, there are two main views and this is suitable for the case when the disparity between each adjacent view pairs is very large. Picture-level reference structure considers the selection between disparity and motion vectors, forward motion, and backward motion vector. The selection is done at MB level based on the minimum distortion criterion. As shown in Fig. 2, the auxiliary view pictures are predicted either by disparity alone I or jointly by disparity and motion fields (P and B ). In Config 4, the auxiliary view can be predicted by two main views. In this case, the I pictures are predicted by bidirectional disparity and the P B pictures are predicted by two disparity fields and one motion field. III. EXPERIMENTAL RESULTS AND ANALYSIS The two-view sequence Train and Tunnel and the three-view sequence Reading are used to evaluate the proposed encoder. The original images are shown in Fig. 4. Since the original sequences contain only two or three views, the intermediate views are synthesized [9] to generate three-view and five-view video. The resolutions are 720 576 pixels for Train and Tunnel and 768 288 pixels for Reading, respectively. Under the epipolar constraint, the disparity vectors are searched only in the horizontal direction with a search range of 32 pixels. For the motion fields, the search range is 8 pixels in both directions. The image balancing is performed for both the proposed scheme and the MPEG-4 MVP scheme. Fig. 4. Test sequences. The first frame of the left view of (a) Train and Tunnel and (b) Reading. The comparisons of the coding results of the proposed encoder with the MPEG-4 MVP are given in Figs. 5 and 6. The bit rate here is the sum of bits per second (bps) for the two views, and the PSNR is the overall value calculated from the average distortion for all the frames in all of the views. As shown in Fig. 5, the proposed encoder increases the average PSNR about 1.2–1.5 dB for both sequences for coding two views. The performance gain comes from both the main view and the auxiliary views. The decoded main view quality of our proposed encoder is always better than that of the MPEG-4 MVP method. The reason for this is that the motion vectors calculated in the proposed encoder cost less bits for encoding, thus more bits can be used for texture coding. However, this is also partly due to the rate control scheme, which allocates bits to different frame types proportionally according to the predefined ratio. In addition, the quality of the auxiliary view of our proposed encoder is always better than that of the MPEG-4 MVP encoder although fewer bits are used. The reasons being, first, the proposed encoder IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 2, FEBRUARY 2006 IE E Pr E oo f 4 Fig. 5. Results comparison for stereoscopic video coding. (a) Train and Tunnel and (b) Reading. structure, including the prediction structure and MB modes design considering the different properties of disparity and motion, better exploits the correlations among the multiview video data than MPEG-4 MVP. Second, the prediction frames from the main view in our proposed encoder has higher quality which can provide more efficient estimation and compensation. Finally, the disparity and motion fields obtained by the joint regularization algorithm require fewer bits for coding of P and B pictures. The three-view coding results are compared in Fig. 6, where the coding gain in PSNR is about 1.2 dB. Compared with stereoscopic video coding, the performance gain for coding three views is limited, and this is reasonable since the gain comes only from the other auxiliary view. Actually, the performance gain for coding five views can be predictable if the number of the main and the auxiliary views are known. Thus, we only compare the coding results of the four configurations for five-view video coding. The comparison of the four configurations in view-level prediction for five-view video coding is shown in Fig. 7. For Reading, since the disparity range is very large, Config 4, where views 1 and 3 are encoded as main views, performs the best. Config 1 has the worst result since the direct disparity between the two outermost views (views 0 and 4) is very large and thus disparity prediction for view 4 from view 0 almost Fig. 6. Results comparison for three-view video coding. (a) Train and Tunnel and (b) Reading. fails. For Train and Tunnel, since the disparity range is small, Config 1 has the best result, which is more obvious at higher bit rates. Config 4 has the worst result since two main views cost a lot of bits to code. For Config 3, the results are unstable and largely depend on the rate control algorithm. The reason is that, in Config 3, views 1 and 3 are predicted from the main view, view 2, but they are also the reference views for views 0 and 4, respectively. The reconstructed image qualities of views 1 and 3 have great effects on those of views 0 and 4. From these results, we confirm that the best view-level prediction structure depends very much on the properties of the multiview video data, i.e., the disparity range. For a small baseline distance in the applications such as robotic stereovision, Config 1 will be a good choice. For a large baseline distance, as in videoconferencing, Config 4 can achieve better performance. In order to evaluate fairly the different prediction structure, the properties of the video data have to be taken into considerations. IV. CONCLUSION A multiview video encoder has been proposed in this paper, which exploits well not only the redundancy within each view but also among different views. The encoder is object-based YANG et al.: MPEG-4-COMPATIBLE STEREOSCOPIC/MULTIVIEW VIDEO CODING SCHEME 5 mental results have demonstrated that our proposed encoder significantly outperforms the conventional scheme MPEG-4 MVP. We have also compared different prediction structures in the case of five-view coding and have found that the optimal prediction structure actually depends on the video properties, i.e., camera parameters, disparity range and the applications. The encoder can be easily extended to encode various number of views and can also be extended to handle multiple video objects, given the original shape masks of the sequences. REFERENCES IE E Pr E oo f [1] L. Guan, S.-Y. Kung, and J. Larsen, Multimedia Image and Video Processing. Boca Raton, FL: CRC Press, 2000. [2] D. Tzovaras, N. Grammalidis, and M. G. Strintzis, “Object-based coding of stereo image sequences using joint 3-D motion/disparity compensation,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 2, pp. 312–327, Apr. 1997. [3] J.-R. Ohm, K. Gruneberg, E. Hendriks, M. E. Izquierdo, D. Kalivas, M. Karl, D. Papadimatos, and A. Redert, “A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation,” Signal Process.: Image Commun., vol. 14, pp. 147–171, 1998. [4] A. Puri, R. V. Kollarits, and B. B. Haskell, “Basics of stereoscopic video, new compression results with MPEG-2 and a proposal for MPEG-4,” Signal Process.: Image Commun., vol. 10, pp. 201–234, 1997. [5] L. Yan, Z. Zhaoyang, and A. Ping, “Stereo video coding based on frame estimation and interpolation,” IEEE Trans. Broadcasting, vol. 49, no. 1, pp. 14–21, Mar. 2003. [6] [AUTHOR: PLEASE PROVIDE THE DEPARTMENT FOR WHICH THE THESIS WAS PREPARED.—ED.]A. Mancini, “Disparity estimation and intermediate view reconstruction for noble applications in stereoscopic video,” M.S. thesis, McGill Univ., Feb. 1998. [7] MPEG-4 Video Verification Model Version 18.0, Jan. 2000. ISO/IEC JTC1/SC29/WG11 N3908. [8] W. Yang, K. Ngan, J. Lim, and K. Sohn, “Joint motion and disparity fields estimation for stereoscopic video sequences,” Signal Process.: Image Commun., vol. 20, no. 3, pp. 265–276, Mar. 2005. [9] H. S. Kim and K. H. Sohn, “Feature-based disparity estimation for intermediate view reconstruction of multiview images,” in Proc. Int. Conf. Imaging Sci., Systs, Technol., vol. 2, Jun. 2001, pp. 1–8. Fig. 7. Comparison of view-level reference structures for five-view video coding. (a) Train and Tunnel and (b) Reading. and is compatible with the MPEG-4 video standard. The experi-
© Copyright 2026 Paperzz