EXPLOITING REGION OF INTEREST FOR IMPROVED VIDEO CODING A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Ramya Gopalan, B.E. Telecommunication Engg. Graduate Program in Electrical and Computer Engineering The Ohio State University 2009 Master’s Examination Committee: Prof. Ashok Krishnamurthy, Adviser Prof. Yuan Zheng c Copyright by ⃝ Ramya Gopalan 2009 ABSTRACT Today massive video data transfers are successfully done over the Internet. But, achieving a good quality real time video Transfer is still a challenge. Video requires large bandwidth for error-free transmission over the network. Since bandwidth is an expensive resource, ensuring sufficient bandwidth for error free transmission of video is difficult. Hence, low bit rate transmission is preferred. But low bit rate transmission implies lower quality video. This is where Region of Interest Video Coding comes into play. It uses the fact that there are certain areas in the video which are of higher perceptual importance than other areas and assigns a higher quality to those areas. Hence a better perceptual video quality is achieved under the same bandwidth conditions. Region of Interest Coding can be broken into 2 parts: Region of Interest Tracking and Video Preprocessing. Region of Interest Tracking involves tracking the region of interest in all the frames of the video based on some criteria specified by the user. Video Preprocessing is the preprocessing performed, to ensure higher quality to the region of interest over the background. This thesis presents a methodology for both Region of Interest Tracking and Video preprocessing. In Region of Interest Tracking, a covariance tracking method is studied and is applied in the 2 different cases of rigid and non rigid body motion of the Region of Interest (ROI). In Video preprocessing, a ii spatio-temporal preprocessing technique is studied and is tested on different standard video sequences. iii This is dedicated to my family and friends iv ACKNOWLEDGMENTS This thesis would not have been possible if not for a few people who have aided me academically and otherwise. I would like to thank Dr. Ashok Krishnamurthy without whose guidance and research ideas this thesis would have not been possible. I am truly grateful to him for having confidence in me and trusting my research decisions. Through him, I have learnt the importance of understanding a problem completely and the importance of a good literature survey before starting out. I also thank Prof. Yuan F. Zheng, for being a part of my thesis defense committee and sharing his thoughts and ideas on my thesis. I would like to thank Dr. Prasad Calyam. His constant guidance and support throughout my thesis study helped me a great deal. Apart from his technical guidance, he has helped me understand the importance of organizing and presenting work. I am also grateful to the staff and students of OSC for making my work in OSC an enjoyable and a pleasurable one. Finally, I would like to thank my family and friends whose support means a lot to me. I am eternally grateful to my parents for having supported me all throughout and having faith in me and in whatever I choose to do. As I finish my masters, I take back with me not just a degree but also the friendships I formed with a few people here. I am thankful to my friends here and in other places for making my masters a valuable and an enjoyable journey. v VITA August 21, 1985 . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Mumbai, India June 30, 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B.E. Telecommunication Engg., R.V. College of Engineering, Bangalore, India September 2007 - present . . . . . . . . . . . . . . . . . . . M.S. Candidate, Electrical and Computer Department, The Ohio State University September 2007 - June 2009 . . . . . . . . . . . . . . . . Graduate Research Associate, Ohio Supercomputer Center, The Ohio State University. FIELDS OF STUDY Major Field: Electrical and Computer Engineering Studies in: Video Processing Random Signal Analysis Linear Algebra vi TABLE OF CONTENTS Page Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Chapters: 1. 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2 1.3 ROI Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ROI preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 5 Region of Interest Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Introduction . . . . . . . . . . . . . . . . . . 2.1.1 Tracking as a Matching Problem . . . 2.1.2 Tracking as an Estimation Problem . . Region Descriptors . . . . . . . . . . . . . . . Covariance as a Region Descriptor . . . . . . Distance Calculation on Covariance Matrices The Covariance Tracking Algorithm . . . . . Model Update Strategies . . . . . . . . . . . . Chapter Summary . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 9 12 13 14 17 17 18 3. ROI coding & Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1 3.2 . . . . . . . . . . 20 21 22 27 28 30 31 35 39 41 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.1 . . . . . 42 43 44 46 47 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 53 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 3.4 3.5 4. 4.2 4.3 5. Introduction . . . . . . . . . . . . . . . . . . . . . . . Spatial ROI coding . . . . . . . . . . . . . . . . . . . . 3.2.1 Flexible Macroblock Ordering . . . . . . . . . . Temporal ROI coding . . . . . . . . . . . . . . . . . . 3.3.1 Decreasing the frame rate of the background . . ROI Preprocessing Techniques . . . . . . . . . . . . . . 3.4.1 Spatial ROI Preprocessing Techniques . . . . . 3.4.2 Temporal ROI Preprocessing Techniques . . . . 3.4.3 Spatio-Temporal ROI Preprocessing Techniques Chapter Summary . . . . . . . . . . . . . . . . . . . . ROI detection Results . . . . . . . . . 4.1.1 Example 1 - Rigid Motion . . . 4.1.2 Example 2 - Non Rigid Motion ROI preprocessing Results . . . . . . . Conclusion . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LIST OF FIGURES Figure Page 3.1 FMO Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Block Diagram of the FMO Technique in the H.264 codec for ROI coding 26 3.3 Schematic Representation of decreasing frame rate of the background 29 3.4 Qualtiy Map; leftmost figure shows the original sequence, the center one shows the ROI masked in white and background in black; the rightmost one shows the quality map indicating the importance given - white being most important and black being the least . . . . . . . . 32 Block Diagram of a spatial ROI preprocessing technique - Duplication of ROI macroblocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.6 Schematic Representation of the ROI macroblock Duplication . . . . 34 3.7 Schematic Representation of the Motion Compensated Temporal Filter 36 3.8 Lifting Scheme of Discrete Wavelet Transform . . . . . . . . . . . . . 37 3.9 The MCTF ROI preprocessing scheme . . . . . . . . . . . . . . . . . 39 3.5 3.10 Proposed ROI preprocessing scheme 24 . . . . . . . . . . . . . . . . . . 40 4.1 Frame 1 of Container Sequence with ROI selected . . . . . . . . . . . 43 4.2 Tracking Results for Container Sequence . . . . . . . . . . . . . . . . 44 4.3 Frame 1 of Stefan Sequence with ROI selected . . . . . . . . . . . . . 45 ix 4.4 Tracking Results for Stefan Sequence . . . . . . . . . . . . . . . . . . 45 4.5 Normalized PSNR of the ROI for Spatial, Temporal, Spatio-temporal processed Carphone Sequence . . . . . . . . . . . . . . . . . . . . . . 47 PSNR of the background for Spatial, Temporal, Spatio-temporal processed Carphone Sequence . . . . . . . . . . . . . . . . . . . . . . . . 48 PSNR of the ROI for Spatial, Temporal, Spatio-temporal processed News Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 PSNR of the background for Spatial, Temporal, Spatio-temporal processed News Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 50 PSNR of the ROI for Spatial, Temporal, Spatio-temporal processed Foreman Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.10 PSNR of the background for Spatial, Temporal, Spatio-temporal processed Foreman Sequence . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.11 Improvement Ratio of 3 sequences . . . . . . . . . . . . . . . . . . . . 52 4.6 4.7 4.8 4.9 x CHAPTER 1 INTRODUCTION The main challenge in video coding is to reduce the amount of data (number of bits) needed to describe the video while preserving the video quality. ROI video coding is a clever way of doing video coding where it makes use of the fact that the viewer gives more importance to certain regions of the video over the others. It provides more quality (more bits) to the areas corresponding to the ROI at the expense of reduced quality of the background. Typically ROI video coding is used in specific applications also. One application is transmitting video in low bit-rate conditions. Transmitting video conference material over the mobile phone, can be problematic for low bit rates. The encoding of these videos can lead to a high number of artifacts. This is solved by giving more quality to the important areas in the video over other areas. This is what is done in any video conferencing application where the facial region is the most important. If the information communicated by the movement of the lips and facial expressions is lost, the artifacts are large. Also, reduced quality in the facial area can appear more disturbing for the viewer than reduced quality of the background. As mentioned in [1] : This is solved, in several approaches, by applying more compression to the background than in the facial region and is included in the articles by Eleftheriadis et al. [2] and 1 Chen et al. [3]. Region of Interest video coding also finds use in surveillance. Regions of interest are often well defined in surveillance application, for example people or vehicles. This problem of transmitting surveillance video from several cameras in real time at low bit rates have been addressed in several publications. Tankus et al. in [4] dealt with the means of detecting camouflaged enemies in the shape of people or military vehicles, which was suggested could be applied to ROI video coding in [5]. Other areas of surveillance where ROI approaches have been applied include traffic surveillance in [6] and security applications in [7] . In [8], Region of Interest coding is applied to soccer games where the ball and the soccer player form the region of interest. In summary, ROI coding is used in videoconferencing, surveillance and other low bit-rate video transmissions. In this thesis, we study Region of Interest Coding by breaking it down into 2 parts: ROI Tracking and Video Preprocessing. Only a brief overview about each section is given in [1.1] and [1.2], but is discussed in detail in chapters 2 and 3. 1.1 ROI Tracking Region of Interest Tracking involves detecting the Region of Interest in each frame given prior information of the ROI’s appearance. It is key to good ROI video coding because correctly predicting and detecting the ROI is very crucial. A falsely detected ROI leads to lower perceptual quality than the original video. Region of interest Tracking can either be done keeping a specific application in mind (face tracking for video conferencing applications), or based on models of the human visual system, or can also be dealt as a tracking problem, where the user specifies the ROI in the first frame and the tracker detects it in the consecutive frames. In this thesis, we deal with 2 ROI tracking as a tracking problem. This kind of tracking can again be divided into 2 methods. The first and the most popular is where the ROI model is matched with candidate blocks of the current frame and the candidate block with highest similarity with the target ROI model is made the new ROI model. The second method is where the ROI model is estimated/predicted based on state space and transition models. The above classifications have been explained in detail in chapter [2]. In this thesis, ROI Tracking was implemented as a matching problem and chapter [4] shows the results of the tracking. The detected ROI in all the frames were observed and the performance was measured subjectively. A few papers also use objective metrics to measure the tracking performance. Tracking rate is an objective metric where the ratio of the successfully detected frames to the total number of frames in the video sequence is measured. It is considered as a mis-tracking if the ROI detected exceeds the correct ROI by more than a specific number of pixels. The number of pixels beyond which it is decided as a mis-tracking is dependent on the user. Hence, in this thesis we haven’t measured the tracking rate, but just put the different frames and the detected ROIs. 1.2 ROI preprocessing It is known that, at low bit rates, video coding is performed in order to optimize the average quality under the constraints of a limited bandwidth. Region of Interest coding optimizes the average quality of the video by allotting more quality to certain areas (Region of Interest) over the others. As mentioned earlier, ROI coding consists of 2 parts: The ROI must first be detected, which requires previous knowledge of 3 what the user finds interesting in the sequence. Secondly the video sequence is compressed/encoded to different amounts of encoding based on the bandwidth conditions. This is achieved by bit allocation, which controls the number of bits allocated to the different parts of the the video sequence. Video preprocessing is a method of ROI coding it processes the video in such a way that when bit allocation takes place in the codec, more bits are allocated to the ROI over the background. Most of the ROI coding techniques make modifications in the codec ; ROI preprocessing don’t make codec modifications but make modifications to the video sequence before the encoding stage. ROI preprocessing can be classified as spatial and temporal preporcessing. In spatial ROI preprocessing the processing takes place within a single frame where as in temporal ROI preprocessing techniques the processing takes place between frames. These 2 types have been explained in detail in chapter [3]. The objective metrics used to measure the performance of ROI preprocessing are Peak Signal to Noise Ratio (PSNR) and bit rate. PSNR is the ratio between the maximum possible power of a signal and the power of the noise that affects the signal. PSNR for each compressed video frame with the corresponding original video frame is calculated and an average over all frames is found. PSNR is given by equation [1.2] 𝑃 𝑆𝑁 𝑅𝑑𝑏 = 10𝑙𝑜𝑔10 ( 2552 ) 𝑀 𝑆𝐸 (1.1) where MSE is the Mean Squared Error given by the following equation: 𝑀 𝑆𝐸 = 1 𝑛 Σ (𝑥𝑖 − 𝑦𝑖 )2 𝑁 𝑖=1 (1.2) where 𝑁 is the number of pixels within each frame, 255 is the maximum pixel intensity value and 𝑥𝑖 and 𝑦𝑖 represents pixel 𝑖 in the original and distorted video sequence, respectively. Paper [56] by Wang talks about including more HVS characteristics in 4 objective measures, to improve PSNR as a metric as PSNR disregards the position of the pixel, its context and its perceptual importance. Bit rate is another objective metric which indicates the the bandwidth needed to transmit the video file. Since network is an expensive resource, a low bit rate is always desired. In this thesis we measure bit-rate as file sizes. A lower file size corresponds to a lower bit rate. 1.3 Thesis Outline This chapter gave an introduction to all topics discussed in this thesis. The rest of the thesis is organized as follows. Chapter 2 talks about Region of Interest Tracking. First, the different ways of doing tracking (matching and estimation) are discussed, followed by region descriptors and covariance as region descriptors. Finally, the covariance tracking algorithm and its model update strategies are explained. Chapter 3 talks about Region of Interest Preprocessing, starting with Region of Interest Coding and its different methods (spatial and temporal), moving on to Region of Interest Preprocessing and its different methods (spatial, temporal and spatio-temporal). Chapter 4 contains the ROI tracking and preprocessing results. Chapter 5 concludes the thesis and provides some ideas for further work. 5 CHAPTER 2 REGION OF INTEREST TRACKING ROI Tracking is the detection of a region in each video frame having prior knowledge of what the user finds interesting in the sequence. As mentioned earlier, ROI tracking can either be done keeping a specific application in mind (face tracking for video conferencing applications), or based on models of the human visual system, or can also be dealt as a tracking problem, where the user specifies the ROI in the first frame and the tracker detects it in the consecutive frames. So, it involves detecting an object/region in consecutive video frames given its location in the first frame. In other words, if 𝑉 is a video composed of n frames 𝐹 1...𝐹 𝑛 and 𝐵1 is the contour of the ROI in frame 𝐹 1; the tracking problem consists in defining the contours 𝐵2...𝐵𝑛 from 𝐵1 [9]. This kind of tracking can be done using not just appearance models but using motion estimation also. ROI Tracking finds application in video surveillance (crime watch, sports statistics / reporting), post production cinema (apply special color effects on objects), video transfer (video-conferencing) and other vision applications. Though there is a significant amount of work done in this area, there is still, not a single, robust solution to detect non-rigid, fast moving bodies. In this thesis, we do not propose a solution; rather study a method which combines spatial, structural and other appearance attributes to detect the ROI. 6 2.1 Introduction Region of Interest is an area in a video frame characterized by several attributes like color, structure, texture, brightness etc. ROI Tracking involves detecting these areas in consecutive frames given its location in the first frame. The difficulty in this is that, the area / ROI can undergo slight appearance changes, illumination changes, occlusions etc. It is the task of the ROI tracking scheme to come up with a method which detects the ROI in presence of such outliers. ROI Tracking can be done in 2 primary ways. 1. Matching the target block (ROI) with the candidate blocks and making the best match the next target block 2. Estimation of the state (target block) given all the measurements up to that moment, or equivalently constructing the probability density function of object location 2.1.1 Tracking as a Matching Problem Tracking can be modeled as a matching problem. It involves matching the target block in the previous frame with the candidate blocks in the current frame. Similarity measures are used to quantify the match. Similarity of geometric features, like shape, location, structure and non-geometric features, like color and illumination, are measured. The candidate block with the highest similarity measure is made the new ROI. Tracking Using Color Histogram Matching [10] implements Tracking as a matching problem. Here the color histograms of the target region and the candidate region are 7 obtained and the similarities of the histograms are measured in terms of a similarity measure called the Bhattacharya Coefficient. The candidate region with the highest Bhattacharya coefficient is made the new target region. The method is now explained in detail. The target is modeled by its color distribution. If 𝑥𝑖 for 𝑖 = 1, 2, .., 𝑛 is the pixel locations of the target model centered at zero; we define a function 𝑏 : 𝑅2 → 1...𝑚 which associates to the pixel the index of the histogram bin corresponding to the color of the pixel. The probability of the color 𝑢 in the target model, denoted as 𝑞(𝑢), is derived by employing a convex and monotonic decreasing kernel which assigns a smaller weight to the locations that are farther from the center of the target. This probability is calculated for all colors 𝑢 = 1...𝑚. Starting with an estimated location, the current frame is broken down into blocks of the same size as the target model. The color histograms of these regions are found and their similarities are measured. The equations for the above are shown below: 𝜌(𝑦) = 𝜌[𝑝(𝑦), 𝑞] = Σ𝑛𝑖=1 √ 𝑝𝑢 (𝑦)𝑞𝑢 (𝑦) (2.1) where 𝑝(𝑦) = 𝑝𝑢 (𝑦) for 𝑢 = 1...𝑚 with Σ𝑚 𝑢=1 𝑝(𝑢) = 1 is the m bin color histogram of the candidate region at the location y and 𝑞(𝑦) = 𝑞𝑢 (𝑦) for 𝑢 = 1...𝑚 with Σ𝑚 𝑢=1 𝑞(𝑢) = 1 is the m bin color histogram of the target region The similarities of the 2 histograms are measured using the distance shown below 𝑑(𝑦) = √ 1 − 𝜌[𝑝(𝑦), 𝑞] (2.2) Minimum distance implies the highest similarity. The Bhattacharya coefficient 𝜌[𝑝(𝑦), 𝑞] is maximized in order to find the minimum distance. Mean Shift iterations 8 are used for the maximization. The iterations start at the same location as that of the ROI in the previous frame. Advantages 1. It is robust to rotational changes, camera position, partial occlusions etc 2. Performs extremely well given the ROI is highly specific in color. For eg color histogram matching works really well in face tracking since the skin color occupies a very compact area in the 2 chroma component space. Disadvantages 1. It fails in cases when the discriminating power between histograms is not high. This happens when the ROI is not strongly characterized by color. We did not choose the color histogram method for tracking implementation because the tracking is very specific to color and hence is not robust. There are joint histogram matching methods available which model many features as a histogram, but we did not choose those since the dimensionality becomes high when the number of features increases. The meaning of the above statement will be better understood later in this chapter. 2.1.2 Tracking as an Estimation Problem Tracking can also be dealt as an estimation problem. It is also called stochastic detection/tracking. In this method the ROI of the video frame 𝑌𝑡 is modeled as 𝑍𝑡 parameterized by 𝜃𝑡 (as shown in equation [2.3]) and a state space model (described by state transition and observation models) is employed to accommodate the multiple frames, as shown in equations [2.4] and [2.5]. 9 Region of Interest 𝑍𝑡 = 𝑇 [𝑌𝑡 , 𝜃𝑡−1 ] (2.3) 𝜃𝑡 = 𝐹𝑡 (𝜃𝑡−1 ) (2.4) 𝜃𝑡 = 𝑌𝑡 (𝜃𝑡 ) (2.5) State Transition Model: Observation Model: The task of the estimation problem is to find 𝑝(𝜃(𝑡)∣𝑌 (1 : 𝑡)) given the observation model likelihood 𝑝(𝑌 (𝑡)∣𝜃(𝑡)) and state transition probabilty 𝑝(𝜃(𝑡)∣𝜃(𝑡 − 1)). The above estimation problem can be implemented with the observation model (appearance model) embedded in a Particle filter or using the EM (ExpectationMaximization) algorithm. These methods will not be discussed in detail, but the basic idea of these methods are presented. A particle filter approximates the posterior (𝑗) (𝑗) distribution 𝑝(𝜃(𝑡)∣𝑌 (1 : 𝑡)) by a set of weighted particles 𝑆𝑡 = [𝜃𝑡 , 𝑤𝑡 ]𝐽𝑗=1 with (𝑗) Σ𝐽𝑗=1 𝑤𝑡 = 1. The state estimate 𝜃𝑡 (ROI characteristics) is either given by Minimum Mean Square Estimate (MMSE), or the Maximum A posteriori Estimate (MAP), or other forms based. Equation [2.6] shows the Minimum Mean Square Estimate (MMSE) of the unknown state 𝜃. In the equation, 𝐸 is the expectation operator, 𝜃𝑡 is the unknown state (target characteristics), 𝑌𝑡 is the observation (frame), 𝑤𝑡 are the weights used and 𝑃 are the number of weights used to approximate the filter. The weights are calculated using information about the state transition probability 𝑝(𝜃(𝑡)∣𝜃(𝑡 − 1)) and observation model likelihood 𝑝(𝑌 (𝑡)∣𝜃(𝑡)). 𝜃𝑡 = 𝐸[𝜃𝑡 ∣𝑌1:𝑡 ] ≈ 𝑃 −1 Σ𝑃𝑖=1 (𝑤𝑗𝑡 𝜃𝑗𝑡 ) 10 (2.6) The EM algorithm finds the state estimate 𝜃 using the Maximum Likelihood Estimate (MLE). The Maximum Likelihood Estimate performs an expectation (E) step, which computes an expectation of the log likelihood with respect to the current estimate of the distribution for the latent(unknown) variables, and a maximization (M) step, which computes the parameters which maximize the expected log likelihood found on the E step. These parameters are then used to determine the distribution of the latent variables in the next E step. One of the main advantages of performing Tracking as an Estimation problem is that it doesn’t just find the local best match but does a global one. But this comes at the expense of being very computationally intensive. It is computationally intensive because such algorithms are stochastic in nature, that is they require analysis of random phenomena. Analysis of random phenomena is generally computationally intensive. We did not use an estiimation approach in our tracking scheme as the complexity of such methods are high and are generally computationally intensive. To perform any kind of Tracking (Matching or Estimation), the first task lies in modeling the ROI. These models are called Region Descriptors and are studied in detail in the next section. 11 2.2 Region Descriptors The ROI needs to be described/modeled well to perform good ROI tracking. The model based on different features of the region is called a region descriptor. Features like color, edge information (gradient), intensity etc which are specific to each region form the model parameters. Hence feature selection is one of the most important steps for tracking and classification problems. Good features should be discriminative and easy to compute. Features such as color, gradient, filter responses are the simplest choices for image features and were used for many years in computer vision. Histograms are a natural extension to such features that are non-parametric estimators, effective in non-rigid object tracking. A major concern is the lack of a competent similarity criterion that captures both spatial and statistical properties, i.e. most approaches either depend only on the color distributions or structural models. Joint feature histograms can be used, but the problem is that it doesnt scale to higher dimensions. Its dimensionality is 𝑏𝑑 where b is the number of histogram bins and 𝑑 is the number of features. Even normal appearance models where the region is represented with raw values, does not scale well to higher dimensions as they need 𝑛𝑋𝑑 dimensions, where 𝑛 is the number of pixels. This is where covariance matrices are superior [11]. They provide an excellent way of fusing multiple features and scale well to higher dimensions since the size if the descriptor (covariance matrix) only depends on the number of features 𝑑. It has only (𝑑2 + 𝑑)/2 different values. We will now study Covariance matrix as a region descriptor in detail and also study distance calculation on these matrices. 12 2.3 Covariance as a Region Descriptor Let 𝐼 be the one dimensional intensity/gray-scale image or the three dimensional color image. Let 𝐹 be the feature image obtained from 𝐼. The dimensionality of 𝐹 is 𝑊 𝑋𝐻𝑋𝑑 where 𝑊 and 𝐻 is the size of the ROI and 𝑑 is the number of features used to describe the ROI and itis given by the following equation : 𝐹 (𝑥, 𝑦) = 𝜙(𝐼, 𝑥, 𝑦) (2.7) where 𝜙 can be any mapping such as intensity, color, gradients, filter responses, etc. An example feature image mapping is shown in the equation below. It can be noted that the feature image / vector can be constructed using 2 types of attributes : spatial attributes and structural attributes. 𝑓𝑘 = [𝑥, 𝑦, 𝐼(𝑥, 𝑦), 𝐼𝑥 (𝑥, 𝑦)...] (2.8) where 𝑘 is the pixel, 𝑥 and 𝑦 are coordinates of the pixel, 𝐼(𝑥, 𝑦) is the intensity at the pixel k, 𝐼𝑥 (𝑥, 𝑦) is the intensity gradient along the x direction. After the feature image 𝐹 for the region 𝑅 is obtained, its 𝑑𝑋𝑑 covariance matrix is constructed as follows: 𝐶𝑅 = 1 Σ𝑛𝑘=1 (𝑧𝑘 − 𝜇)(𝑧𝑘 − 𝜇)𝑇 𝑛−1 (2.9) where 𝑧𝑘 for 𝑘 = 1...𝑛 is the 𝑑 dimensional feature points inside the ROI 𝑅 and 𝜇 is the mean of the points. It can be seen from the above equations that the covariance matrix is an efficient way to fuse multiple features. The diagonal entries represent the variance of each feature and the non-diagonal entries represent the correlations. As the covariance measurement involves taking an average and calculating the covariance, the few noise corrupting induvidual samples are largely filtered out. 13 There are several advantages of using covariance matrices as region descriptors. Some of them are: 1. It is the most natural and simple way to combine structural and statistical features together 2. They are low dimensional compared to other descriptors. It has only (𝑑2 + 𝑑)/2 different values since the size of the covariance matrix does not depend on the size of the region but only depends on the number of features used 3. Scale and rotation invariance can be achieved easily depending on the way the features are defined 4. Covariance is invariant to the mean changes like identical shifting of color values since covariance involves taking the difference of mean value from the actual values. Therefore identical shifting of mean values does not matter as the individual values also shift. As a result of this, they can track sequences under varying illumination conditions. In the next section, we study how to measure the similarity of such matrices. Similarity of these matrices need to be computed in order to perform ROI tracking as matching. 2.4 Distance Calculation on Covariance Matrices The space of covariance matrices is not closed under multiplication with negative scalars. Hence, a simple subtraction of covariance matrices will not work. A different metric involving traces and joint eigen-values of the matrices is used [12]. It is shown 14 to be the distance coming from a canonical invariant Riemann metric on the space of real symmetric positive definite matrices and is given by the following equation: 𝜌(𝐶1 , 𝐶2 ) = √ Σ𝑛𝑖=1 [𝑙𝑛2 𝜆𝑖 (𝐶1 , 𝐶2 )] (2.10) Where 𝜆𝑖 (𝐶1 , 𝐶2 ) for 𝑖 = 1...𝑛 are the generalized eigen values computed from 𝜆𝑖 𝐶1 𝑥𝑖 − 𝐶2 𝑥𝑖 = 0 for 𝑖 = 1...𝑑 and 𝑥𝑖 ∕= 0 . The reason that the sum of logarithmic values of generalized eigen values of the 2 matrices indicate the similarity of 2 covariance matrices has been shown below: Let 𝑥 be a column vector of properties of the candidate region that is a Gaussian random variable. Let us now multiply the random vector 𝑥 with a scalar vector 𝑒𝑇 such that 𝜎 2(𝐶) ≤ 𝜎 2(𝐻) (2.11) where 𝐶 and 𝐻 are the covariance matrices of 𝑥 (candidate region) and the reference ROI respectively, and 𝜎 is variance. From this we get the equation below: 𝑒𝑇 𝐶𝑒 ≤ 𝑒𝑇 𝐻𝑒 (2.12) Hence, 0 ≤ 𝜆(𝑒) = 𝑒𝑇 𝐶𝑒 ≤1 𝑒𝑇 𝐻𝑒 (2.13) 𝑒 ∕= 0. From this we get an equation of the form 𝜆𝐻𝑒 − 𝐶𝑒 = 0 (2.14) This equation is solved by setting 𝜆 as the maximum generalized eigen vector (𝜆). But the maximum 𝜆 value is not completely descriptive of the covariance matrix 15 similarity. We can use traces and determinants of the 𝜆 values, bu they are also not indicative of the similariy of the covariance matrices. Hence we use Σ𝑛𝑖=1 𝑙𝑛2 𝜆𝑖 as the metric where 𝜆𝑖 is the ith generalized eigen value. It is a monotonically increasing function with 𝜆𝑖 . It is a metric because it satisfies the metric axioms for positive definite symmetric matrices. The axioms are shown below. 1. 𝜌(𝐶1 , 𝐶2 ) ≥ 0 and 𝜌(𝐶1 , 𝐶2 ) = 0 only if 𝐶1 = 𝐶2 2. 𝜌(𝐶1 , 𝐶2 ) = 𝜌(𝐶2 , 𝐶1 ) 3. 𝜌(𝐶1 , 𝐶2 ) + 𝜌(𝐶1 , 𝐶3 ) ≥ 𝜌(𝐶2 , 𝐶3 ) The generalized eigen-values can be computed with 𝑂(𝑑3 ) arithmetic operations using numerical methods and as additional 𝑑 logarithm operations are required for distance computation; where 𝑑 is the number of features used in the region descriptor. This is usually faster than comparing two histograms that grow exponentially with 𝑑. 16 2.5 The Covariance Tracking Algorithm We now study the whole tracking process. 1. Start at the estimated location (x0,y0) (location of the ROI in the previous frame) in the current frame and obtain the covariance matrix of the region centered at (x0,y0) and of the same size as the reference ROI. 2. Repeat step 1 for all locations in the neighborhood of the pixel (x0,y0). Neighbourhood is the region around the pixel whose size is decided by the user. 3. Calculate the distance of each covariance matrix obtained in step 2 with respect to the reference covariance matrix, which is the covariance matrix of the ROI of the previous frame. 4. Compute the minimum distance and the assign the region having the minimum distance as the new ROI / target. 5. Repeat steps 1 to 4 for all frames in the video sequence This algorithm is for when the ROI characteristics (location and other information) of the first frame is provided at start. 2.6 Model Update Strategies The above algorithm was used as the implementation of the tracker. We tested it on sequences which had the ROI moved in both rigid and non rigid fashion. The tracking was successful on these sequences when we kept the previous frame’s ROI as the desired ROI in the current frame. But there are cases when the ROI changes a lot in the sequence. Hence it is necessary to adapt to these variations. A model 17 update of the ROI needs to be performed. Before we get into how the model update is done, let us look at the different ways to fix the ROI used as the reference for the next frame. 1. Use a fixed ROI / template as the reference 2. Make the previous ROI as the new reference 3. Update the ROI with every frame If we implement the second method, which is to make the last match, the current ROI, it will also result in large errors as the rapidly changing ROI is susceptible to drift. Thus a compromise between these 2 cases is needed. This is what is called model updating where the ROI is constructed so that it contains not only the ROI information of the immediate previous frame but of the other previous frames also. There are again 2 ways of doing the model update which are as follows: 1. Give equal priority to all frames 2. Weight the frames so that priority is given to the most recent frames over the first few frames Most model update techniques weight the frames exponentially, giving more priority to the recent frames over the older ones [13]. The only drawback of model updating is the increased cost of computation, but this is compensated by the robustness of the reference ROI model built. 2.7 Chapter Summary In this chapter, we looked at tracking done both as a matching and as an estimation problem, with an example in both cases. We then studied how the ROI is 18 modeled, under ’region descriptors’. Covariance matrix as a region descriptor and the metric used to find similarity between 2 covariance matrices were studied next and finally, the whole tracking algorithm and some additions to the algorithm (model update) were studied. 19 CHAPTER 3 ROI CODING & PREPROCESSING The main idea behind ROI coding is to increase the quality within the ROI by decreasing quality in the background. This can be achieved by reallocating bits from the background to the foreground. More bits to the ROI implies better visual quality. There are several ways of doing ROI coding. Preprocessing is one way of doing it, where the required processing takes place outside the codec. The most important advantage of this method is that, changes can be made easily. But it comes with the loss of adaptability of the background quality to variable bit rates of the channel. We will study different preprocessing techniques in detail in this chapter, but we first look at ROI coding in general and it’s classifications. 3.1 Introduction As mentioned above, ROI coding is nothing but reallocating bits from the background to the foreground. ROI coding can be achieved either by a preprocessing stage before the encoding or by controlling the parameters in the encoder; this forms one basis of classification. Methods which make modifications to the codec, allow alterations of the encoder as long as the required features are included and the syntax of the bit stream is unaltered. Most of the ROI coding techniques use this method. 20 Apart from classifying ROI coding techniques by when the encoding is done, it can also be classified based on how the processing. There 2 different ways of processing and they are either spatial or temporal. ROI coding can also be classified as: 1. Spatial ROI coding 2. Temporal ROI coding They are explained in detail in the following sections. 3.2 Spatial ROI coding As mentioned in [1], most of the research on bit allocation for ROI video coding apply spatial methods and more which control parameters within the codec. The most common way of performing spatial ROI coding is to change the quantization step size. A higher / coarser step size for a particular region implies that the number of bits allocated to that region is lesser. Hence, a higher quantization step size is chosen for the background and a smaller one is chosen for the foreground. One example is where two step sizes are used for each frame and the smaller stepsize is used for the ROI and the larger one for the background [14] [15] . The problem with this method is that the difference between background and ROI quality appears abrupt if there is a large difference in quantization. This is referred to as blockiness. Solutions include adapating the quantization step sizes of the background based on the distance from the ROI border [16], [17], or applying three quantization levels in [18] or deciding the step size based on a sensitivity function defined by the user[18]. Other approaches involving changing other encoder parameters (like controlling the number of non zero DCT coefficients) have been addressed in [19] and [20]. Controlling quantization parameters allows direct integration with the rate-distortion function within the codec, 21 but can introduce blockiness due to coarse quantization. Some preprocessing techniques whose resulting error is generally less disturbing than conventional methods, perform spatial ROI coding by low pass filtering the background. Low-pass filtering (blurring) of non-ROI are dealt in [3] and [21]. Low pass filtering results in a lesser number of non-zero DCT components and thus reduces information. Also, there is a reduction in prediction error due to the absence of high frequencies. The ratedistortion optimization of the encoder then re-allocates bits to the ROI, which still contains high frequencies. Low pass filtering of the background regions are applied using one filter for the whole frame is discussed in [3] by Chen et al.. This leads to a distinct boundary separating the ROI and background resulting in the decrease of the perceptual quality of the video. A gradual transition of quality from the ROI to overcome this problem has been discussed in paper [21] by Itti. We have looked at the overview of spatial ROI encoding techniques. We now study a specific technique, called FMO, which is a tool present in H.264 video codec, to perform ROI video coding. This method was not implemented in the thesis, but is explained to give a better picture of how ROI coding is performed by changing codec parameters. 3.2.1 Flexible Macroblock Ordering One of the new characteristics of the H.264/AVC standard is an error resilience tool called Flexible Macroblock Ordering (FMO) [22]. FMO divides an image into regions called slices (maximum of 8). These slices are basically a collection of macroblocks that are assigned to a slice using a MBAmap (MacroBlock Allocation map). The MBA map contains an identification number that specifies to which slice each 22 macroblock belongs. Deciding on an identification number for each macroblock or in other words, deciding to which slice each macroblock goes is discussed later in this section, under ’Types of FMO’. The macroblocks in each slice are processed in a scan order(from left to right and top to bottom). Each slice is transmitted independently in separate units called packets. Each packet contains in its own header, hence each slice (which is transmitted as a packet) can be decoded independently. This helps us to correct errors easily by exploiting the spatial redundancy of the images. For example, the macroblocks could be grouped in such a way that no macroblock and its neighbor belong to the same group. This way, if a slice is lost during transmission, its easy to reconstruct the lost blocks by interpolating information from its neighboring blocks. The use of FMO together with advanced error resilience tools can keep the visual quality even with a packet loss rate of 10%. When using FMO, the image can be divided in different scan patterns of the macroblocks. FMO consists of 7 different types, from Type 0 to Type 6. Type 6 is the most random giving the user full flexibilty. All the other ones contain a certain pattern. These patterns can be exploited when storing and transmitting the MBA map: 1. Type 0: uses runlengths which are repeated to fill the frame. Therefore only those runlengths have to be known to rebuild the image on the decoder side. 2. Type 1: also known as scattered slices; it uses a mathematical function, which is known in both the encoder and the decoder, to spread the macroblocks. The distribution in the figure, in which the macroblocks are spread forming a chess board, is very common. 23 Figure 3.1: FMO Types 3. Type 2: is used to mark rectangular areas, so called regions of interest. In this case the coordinates top-left and bottom-right of the rectangles are saved in the MBAmap. 4. Type 3-5: are dynamic types that let the slice groups grow and shrink over the different pictures in a cyclic way. Only the growth rate, the direction and the position in the cycle have to be known. The different types have been illustrated in figure[3.1] 24 Each FMO type has its own application. FMO Type 1 can be useful in videoconferences to maintain privacy. The chess board pattern could be used for this. Each slice group is sent in a different packet. This way, if a hacker wants to decode the videoconference, he or she will have to know exactly in which two packets the information is being sent. It can also be used in network environments with a high packet loss rate. It is FMO Type 2 which is used for ROI encoding. It performs ROI encoding by encoding the background slice with lesser bits compared to the foreground/ROI bits. The number of bits in the ROI is increased by decreasing the quantization step size of the slice containing the ROI information. 𝐵𝑖 = 𝐴(𝐾 𝜎𝑖2 + 𝐶) 𝑄2𝑖 (3.1) where 𝐵𝑖 is the total number of bits alloted to slice i; 𝐴 is the number of pixels in the macroblock; 𝐾 is the value adjusted based on the statistics of the specific frame being encoded; 𝜎𝑖2 represents the ith Macroblock Variance of the motion compensated residual signal;𝑄𝑖 is the Quantization Step Size of the ith slice and 𝐶 is the average rate to encode the motion vectors and the bit stream header. Therefore, to increase the number of bits 𝐵𝑖 in a slice, the quantization step size 𝑄𝑖 is decreased.The generalized block diagram of the technique has been shown in figure[3.2] 25 Figure 3.2: Block Diagram of the FMO Technique in the H.264 codec for ROI coding 26 A brief summary of the FMO ROI technique: 1. It is a spatial ROI technique 2. FMO type 2 is most commonly used for ROI encoding where each frame is divided into 2 rectangular slices - one being the foreground and the other. the background. 3. It is codec dependent since parameters inside the codec are changed We now proceed to temporal ROI coding and study one of it’s techniques in detail 3.3 Temporal ROI coding The main idea in temporal ROI coding is to reduce the frame rate of the background compared to the ROI. A lower frame rate requires lesser number of bits because the size of the motion vector reduces and thus the number of bits needed to represent it also reduces. One way of reducing the frame rate of the background is to extract the ROI from the video and transmit the ROI and background as 2 different sequences; transmitting the background at a lower frame rate. This is supported by the MPEG-4 standard. Schemes which extract and encode objects and background in separate layers and transmitted at 2 different frame rates (lower frame rate for the background)have been discussed in papers [23] and [24] There are some preprocessing methods which perform temporal coding. Temporally averaging the background with the help of a filter reduces the variance of the background and hence reduces the number of bits required to represent the background.All these methods don’t affect the syntax of the encoded bit stream and therefore the compatibility with all video standard remains. There are temporal ways that 27 do change the syntax of the bit stream. In [25], the macroblocks not belonging to the ROI are skipped in the inter-frames. In [18] all macroblocks of the background are skipped if the motion between the frames is not much (does not cross a global threshold). [26] also talks about a similar approach. After going through a brief overview of the different temporal coding techniques, we now proceed to a specific temporal technique. 3.3.1 Decreasing the frame rate of the background The block diagram of this technique has been shown in figure[3.3]. The main idea in this technique is that the frame rate of the background is decreased by a factor of 2 and that of the ROI’s is retained and so the number of bits needed to encode the background is much lesser than that of the ROI. The frame rate of the background is decreased by 2 by replacing the background of every even frame by the background of the previous odd frame (can be seen in the figure). As a result, the motion vectors of the background,which contain the number of pixels a macroblock has moved; decreases in size. The prediction error of the background, which is the error between the macroblock and its best match in the previous frame; also decreases in size. Therefore the number of bits needed to represent the background decreases. 28 Figure 3.3: Schematic Representation of decreasing frame rate of the background 29 Summary of the temporal technique: 1. It is a temporal ROI processing technique. 2. It is preprocessing technique as all the ROI coding happens out of the codec. 3. It only works in case where the background does not change much between consecutive frames. We have looked at the 2 different ways ROI encoding is done : spatial and temporal techniques. We now look at Region of Interest coding done in the preprocessing stages. 3.4 ROI Preprocessing Techniques Quantization bits are reallocated from background areas in the images to the ROI to ensure quality in the ROI. Previous works have mainly concentrated on controlling quantization step sizes of the blocks in a frame by deciding if the block is classified as ROI or background [3-4]. The blocks belonging to the ROI gets a smaller step size and those in the background get a bigger step size. But these require changes within the codec. Preprocessing methods achieve independence of codec. [1] and [5] talk about lowpass filtering (blurring) of the background. This decreases the information resulting from the discrete cosine transform or discrete wavelet transform of the background as the high pass coefficients are removed. The low-passed image is passed on to a codec, which allocates small quantization step sizes to the ROI, i.e. more coding resources to ROI and large step sizes to the background . In [1], a lowpass filter is applied to the background using only one filter. This leads to boundary effects and therefore a decrease in the perceptual quality. A Gaussian Pyramid is used to overcome this problem as it creates a smoother transition from the ROI to 30 the background assuming the center of the regions to be placed at the precise locations where the human is predicted to gaze. Reducing the frame rate of the background by a factor of 2 as explained in [3.3.1] is another ROI preprocessing technique. Like the ROI coding schemes, even ROI preprocessing can be classified as spatial and temporal.We discussed some of them above, we study them in detail in the next sections. 3.4.1 Spatial ROI Preprocessing Techniques Spatial ROI Preprocessing make modifications within each frame, before the encoding stage, to give more quality to the ROI over the background. The most common way of doing spatial preprocessing is low pass filtering (blurring) of the background. Though there is a considerable degradation in the quality of the background, the compression ratio achieved is very high. One variation to these low pass filters is to have a quality map which helps in providing a more gradual transition from the ROI to the background. Figure [3.4] shows the original frame, the binary Region of Interest map showing the ROI in white and the quality map. The quality map is shown in the next figure. Quality map is basically an importance map which gives more priority to pixels close to the center of the ROI and gradually reduces the importance as it moves towards the background. The quality map is assigned values ranging from 0 to 1; 1 being the most important and 0 the least. A similar spatial low pass filtering preprocessing technique is talked about in [27] which uses various low pass gaussian filters and helps improving the PSNR of the ROI by removing the border effects as there is an even smoother transition (compared to only one filter) from the ROI to the background. 31 Figure 3.4: Qualtiy Map; leftmost figure shows the original sequence, the center one shows the ROI masked in white and background in black; the rightmost one shows the quality map indicating the importance given - white being most important and black being the least Duplication of pixels in the ROI is another spatial preprocessing technique. It basically adds redundancy to the ROI so that it is more resilient to the losses in the network. It is used in conditions where high bit-rate transmission is supported since adding redundant bits to the bit stream increases the bit-rate required. The block diagram of this duplication technique is shown in Figure [3.5] The preprocessing step 𝑇 is basically non-linear transforms that duplicates each row of a macroblock in the ROI just below the original row. A representation of the non linear transform / duplication is shown in figure [3.6]. the post-processing block 𝑇 ′ ,which occurs after decoding a frame, consists of an error recovery step and the inverse transform step. When either coefficients or motion vectors for a particular macroblock are lost (which implies a bad macroblock), and if its corresponding macroblock pair was received successfully (which implies a good macroblock), then the error recovery step simply uses the reconstructed data of the good macroblock. 32 Figure 3.5: Block Diagram of a spatial ROI preprocessing technique - Duplication of ROI macroblocks When both the macroblocks in the macroblock pair are bad, that is they undergo either coefficients or motion vectors loss, the error concealment block takes care of this. In the concealment step, in the case when coefficient loss occurs, the reconstructed macroblock pointed to by the motion vector, is used. In case the motion vector is lost, median prediction among the motion vectors associated with the neighboring macroblocks is used to perform the concealment. In brief, there are 2 main ways to perform Spatial ROI preprocessing and they are: 1. Low Pass Filtering / Blurring the background 2. Duplicating macroblocks in the ROI We implemented the Low Pass Filtering as the spatial preprocessing technique. We chose it over the duplication of ROI macroblocks because of the simplicity in 33 Figure 3.6: Schematic Representation of the ROI macroblock Duplication 34 implementation and the fact that it is done under the same bandwidth condition unlike the duplication technique which require additional bandwidth. We have looked at spatial ROI preprocessing techniques so far, we now look at temporal ROI preprocessing techniques. 3.4.2 Temporal ROI Preprocessing Techniques Spatial preprocessing techniques work within a frame where as temporal preprocessing techniques work on a set of frames. There are 2 primary ways of doing temporal ROI preprocessing. One is to reduce the frame rate of the background, as explained in [3.3.1]. And another is to temporally average the background. That is apply a temporal average filter on the background. On temporally averaging the background, the variance of the background decreases and hence the number of bits needed to represent the background, according to equation [3.1], also decreases. The temporal filter can be implemented as a normal average filter or as a wavelet filter. Motion-compensated simple 5/3 lifted temporal wavelet filter is one type of wavelet filter which is considered as an effective approach for building scalable video codecs in the literature. The basic idea of this approach is that a motion compensated temporal filter is used to decompose a set of video frames into muti-level wavelet subbands (low-pass and high-pass). The high-pass subbands are adaptively weighted and inverse transformation is performed to reconstruct the original video sequence. Motion compensation (MC) is used to remove temporal redundancy and it plays an important role in achieving a high compression for the video. As mentioned in [28] temporal redundancy comes from the temporal correlations because the same objects 35 Figure 3.7: Schematic Representation of the Motion Compensated Temporal Filter exist between frames when the sampling period is small enough such that no significant deformation occurs. Motion compensation removes this temporal redundancy by describing a picture in terms of the transformation of a reference picture to the current picture. Motion compensated temporal filtering (MCTF) is a multiple frame reference motion compensation technique which can be thought of as applying the wavelet transform in the temporal domain. Implementation of the MCTF with 5/3 filters for a set of 4 frames has been shown in figure [3.7]. 36 Figure 3.8: Lifting Scheme of Discrete Wavelet Transform The high and low-pass sub-bands are obtained after the prediction and update steps and is obtained as shown in the equations below: High-pass sub-bands: 1 ℎ𝑘 [𝑚, 𝑛] = 𝑥2𝑘+1 [𝑚, 𝑛] − (𝑥2𝑘 [𝑚, 𝑛] + 𝑥2𝑘+2 [𝑚, 𝑛]) 2 (3.2) Low-pass sub-bands: 1 𝑙𝑘 [𝑚, 𝑛] = 𝑥2𝑘 [𝑚, 𝑛] + (ℎ𝑘−1 [𝑚, 𝑛] + ℎ𝑘 [𝑚, 𝑛]) 4 (3.3) where 𝑥𝑘 [𝑚, 𝑛] are the video frames, ℎ𝑘 [𝑚, 𝑛] are the high-pass subbands and 𝑙𝑘 [𝑚, 𝑛] are the low-pass subbands. Another schematic representation of the MCTF is shown in figure [3.8] 37 To invert the transform, the same lifting equations/steps are applied in reverse order and the signs are changed. The inverse transform equations are shown below. 1 𝑥2𝑘 [𝑚, 𝑛] = 𝑙𝑘 [𝑚, 𝑛] − (ℎ𝑘−1 [𝑚, 𝑛] + ℎ𝑘 [𝑚, 𝑛]) 4 (3.4) 1 𝑥2𝑘+1 [𝑚, 𝑛] = ℎ𝑘 [𝑚, 𝑛] + (𝑥2𝑘 [𝑚, 𝑛] + 𝑥2𝑘+2 [𝑚, 𝑛]) 2 (3.5) We studied the Motion Compensated Wavelet Transform, we now look at MCTF in the context of temporal ROI preprocessing. MCTF breaks down the video frames into low-pass and high-pass subband frames. The high-pass frames are then adaptively weighted such that details in the ROI is given more importance / weight than those in the background. The low-pass subband frames are kept intact. Video content in areas which are given higher weights are reproduced with more fidelity than the area given lower weights. Since the background receives lower weight in the high-pass subband frames, their pixel variation in the temporal domain decreases and hence the required data rates (number of bits) in the background is also reduced. A block diagram of the temporal preprocessing scheme is shown in figure [3.9]. The object tracking is for the ROI detection and the importance map generator generates weights needed to adaptively weight the sub-bands. In summary, the MCTF preprocessing technique, temporally filters the background, which results in a decrease in the temporal variance of the background, which further results in an improved coding efficiency of the video encoder. 38 Figure 3.9: The MCTF ROI preprocessing scheme In brief, there are 2 primary ways to perform temporal ROI preprocessing: 1. Low Pass Filtering the background temporally 2. Reducing the frame-rate of the background The first method, Low Pass filtering the background temporally was the chosen implementation because of its simplicity and the fact that knowledge of the ROI coordinates is not needed at the receiver side unlike in the second method. 3.4.3 Spatio-Temporal ROI Preprocessing Techniques It is very intuitive that combining spatial and temporal preprocessing techniques help in increasing the coding efficiency of the video encoder. As mentioned in [1], the spatial methods reduce the background information transmitted in DCT components or the motion prediction error. The temporal filters, on the other hand, mainly reduces the bits assigned to the background motion vectors. Thus combinations of 39 Figure 3.10: Proposed ROI preprocessing scheme spatial and temporal approaches increases the reallocation of bits from the background to the ROI, since the spatial filters and the temporal filters reduce the background bits in different parts of the encoding. In [25] Lee et al. combine a spatial method controlling quantization step sizes with the skipping of background. blocks for every second frame under limited global frame motion. Spatial methods controlling the number of DCT components are combined with a similar temporal method as [25], but adapted to ROI and background content in [49], and combined with an temporal average filter in [45]. In this thesis, we have combined a Spatial Low pass filter with the MCTF temporal filter and studied the efficiency of the resulting preprocessing scheme with respect to no ROI preprocessing, only spatial and only temporal schemes. A block diagram of the spatio-temporal ROI preprocessing scheme has been shown in figure [3.10]. 40 3.5 Chapter Summary In this chapter, we looked at the different ways of Region of Interest Coding (spatial and temporal) and then specifically looked at only ROI preprocessing techniques, studying it under the same classifications of spatial and temporal coding. 41 CHAPTER 4 EXPERIMENTS AND RESULTS This section is divided into 2 parts: ROI detection results and ROI preprocessing results. In ROI detection, we study the performance of the covariance tracking method and show its effectiveness in cases when the ROI undergoes large position and structural changes. In ROI preprocessing, we study the efficiency of our spatiotemporal filter when applied to sequences of different activity levels. We study the performance in terms of PSNR and Improvement Ratio. Improvement Ratio is indicative of the compression ratio / bit-rate. It is the ratio of the file size of the encoded video sequence with no ROI processing to the file size of the encoded video sequence with ROI processing. 4.1 ROI detection Results As explained in [2.3], the covariance tracking method offers many advantages over other methods like Color Histogram matching and other appearance based model matching methods. One main advantage is its ability to fuse multiple features while maintaining a low dimensionality. We assessed the performance of the covariance tracking method by testing it on 2 types of motion: Rigid motion and non rigid motion. We selected the feature vector as 𝑓𝑘 = [𝑥, 𝑦, 𝑅(𝑥, 𝑦), 𝐺(𝑥, 𝑦), 𝐵(𝑥, 𝑦), 𝐼𝑥 (𝑥, 𝑦), 𝐼𝑦 (𝑥, 𝑦)] 42 Figure 4.1: Frame 1 of Container Sequence with ROI selected where 𝑘 is the pixel index, 𝑥 and 𝑦 are the position coordinates, 𝑅(𝑥, 𝑦) , 𝐺(𝑥, 𝑦), 𝐵(𝑥, 𝑦) are the RGB (color) values and 𝐼𝑥 (𝑥, 𝑦) and 𝐼𝑦 (𝑥, 𝑦) are the x and y intensity gradients respectively. These features were chosen because it combines both position(𝑥,𝑦) and appearance attributes: color (𝑅(𝑥, 𝑦) , 𝐺(𝑥, 𝑦), 𝐵(𝑥, 𝑦)) and structure (𝐼𝑥 (𝑥, 𝑦) and 𝐼𝑦 (𝑥, 𝑦)). 4.1.1 Example 1 - Rigid Motion The container sequence which is a standard test sequence consists of a steamer boat moving down the river in a straight line. It exhibits linear / rigid body motion. A small portion in the front of the boat was chosen as the ROI and was tracked / detected in the consecutive frames as seen in figure [4.1]. Satisfactory results were obtained. The sample tracking Results are shown in [4.2] 43 Figure 4.2: Tracking Results for Container Sequence 4.1.2 Example 2 - Non Rigid Motion The stefan sequence is now taken to see how the covariance tracking algorithm works when the ROI undergoes non rigid motion. In this sequence, the tennis player is selected as the ROI as seen in figure [4.3] . The tennis player exhibits non rigid motion as he twists and turns his body. The covariance tracking algorithm does a good job in tracking him till the last frame. The tracking results are shown in figure [4.3] We note visually that the ROI in Example 1 is characterized by shape and that in Example 2 is characterized by color and structure. The covariance tracker performs a great job in detecting the ROI in both cases and is as good as any other appearance based matching approach or color histogram matching. 44 Figure 4.3: Frame 1 of Stefan Sequence with ROI selected Figure 4.4: Tracking Results for Stefan Sequence 45 4.2 ROI preprocessing Results The performance of the spatio-temporal ROI preprocessor is studied in terms of 2 metrics: PSNR and Improvement Ratio. PSNR gives a quantitative measure of the quality of the video and is observed both for the ROI and the background. Improvement Ratio gives a measure of the bit-rate. As explained earlier, a higher Improvement Ratio implies a lower bit-rate - which is desired. The preprocessor was applied to 3 different video sequences. We first look at the Carphone video Sequence, then at 2 other sequences Foreman and News. As we already know, there is a huge degradation in the quality of the background with ROI processing, as the video is compressed. Even though there is a huge degradation in the background quality with compressed file size; the quality of the ROI remains almost the same. Hence we use normalized PSNR (PSNR/File Size) as the metric for measurement of quality of the PSNR of the ROI, as it shows improvement in the quality of the ROI with ROI processing. The results obtained after passing the sequence through the spatio-temporal filter is shown in the following figures. Figure [4.5] show the PSNR / File Size of the ROI. Figure [4.6] show the PSNR of the background for the 3 kinds of processing on the Carphone sequence. The spatio-temporal filter was run on the other sequences and their PSNRs have been show in figures [4.7],[4.8],[4.9],[4.10] . Also their Improvement Ratio was plotted in figure [4.11]. 46 Figure 4.5: Normalized PSNR of the ROI for Spatial, Temporal, Spatio-temporal processed Carphone Sequence 4.3 Conclusion ROI Detection - We ran the covariance tracking algorithm, first on a simple sequence which had the ROI moving in a straight line and then on a more complicated sequence where the ROI exhibited non rigid motion. The reason we chose this method was because of its simplicity and its ability to fuse multiple features at the same time while retaining a small dimensionality. We found that the results obtained were as expected: The covariance tracking method did a good job of detecting the ROI. 47 Figure 4.6: PSNR of the background for Spatial, Temporal, Spatio-temporal processed Carphone Sequence ROI Preprocessing - Looking at all the plots, we conclude that the spatio-temporal ROI preprocessing is the most efficient. Efficiency was measured both in terms of PSNR (of the ROI) and Improvement Ratio. We also note that the preprocessing scheme definitely shows a marked improvement over no ROI preprocessing at all. 48 Figure 4.7: PSNR of the ROI for Spatial, Temporal, Spatio-temporal processed News Sequence 49 Figure 4.8: PSNR of the background for Spatial, Temporal, Spatio-temporal processed News Sequence 50 Figure 4.9: PSNR of the ROI for Spatial, Temporal, Spatio-temporal processed Foreman Sequence 51 Figure 4.10: PSNR of the background for Spatial, Temporal, Spatio-temporal processed Foreman Sequence Figure 4.11: Improvement Ratio of 3 sequences 52 CHAPTER 5 CONCLUSION AND FUTURE WORK Real time video transmission over unreliable and lossy communication channels, under bandwidth and delay constraints, is still a relevant problem. It is particularly difficult over the Internet which is a public network not designed to handle traffic with strict bandwidth and delay requirements. There are previous works on error resilience methods which reduce the effect of network losses on the transmitted video streams, like forward error correction, layered coding (scalable video coding) and others. But all these methods provide the same resilience for all parts of the video. Region of Interest Video coding provides more resilience to certain areas of the video which have higher perceptual importance. These areas of high perceptual importance are collectively called Region of Interest. Region of Interest Coding involves detecting the region of interest (Region of Interest Tracking) and coding the video(Region of Interest Coding), giving higher priority to the region of interest. In this thesis, we studied a few methods for Region of Interest Tracking and Video Preprocessing. We then studied a specific method for each of them and implemented them. 53 The work done in this thesis is briefly explained below. 1. Region of Interest Tracking : We studied the covariance tracking algorithm in Region of Interest Tracking because of its simplicity and its ability to easily fuse multiple features together. We showed that the covariance tracking algorithm performs well when the the ROI undergoes both rigid body and non rigid body motion. 2. Video Preprocessing : There are numerous ways to do Region of Interest Video Coding. We chose a preprocessing approach to do ROI coding due to its simplicity and the fact that it involves no changes within the codec. In preprocessing, we explored both spatial and temporal ways. We also combined spatial and temporal approach into a spatio-temporal one and found its superiority over just spatial and temporal methods. Though different methods of performing ROI Tracking and coding were studied, they were not implemented and compared with the chosen methods. Also, the ROI Tracking and coding schemes were not integrated and the whole ROI coding system’s performance was not tested on different video sequences and in different network conditions. There are many more additions which can be made to the existing study and implementation. A list of all them is mentioned in the next page. 54 Work to be done in Region of Interest Tracking: 1. Perform model update 2. Build a rotation invariant model 3. Implement other tracking schemes like color histogram matching, particle filtering methods and compare their performance with the covariance tracking algorithm Work to be done in Video Preprocessing: 1. Implement other coding methods (preprocessing / non preprocessing) and compare it with the existing spatio-temporal filter 2. Test the ROI coding scheme in different network conditions 3. Modify parameters in the coding to changing network conditions. Detect change in network conditions using a feedback system. 4. Integrate the Region of Interest Tracking scheme with the preprocessing scheme and evaluate performance based on PSNR of the desired region of interest. With these additions we can truly build a complete and efficient solution / system for Region of Interest Video coding, for the application of low bit rate video transmission over networks. 55 BIBLIOGRAPHY [1] “Spatio-temporal pre-processing methods for region-of-interest video coding.” Linda S. Karlsson, Department of Information Technology and Media Mid Sweden University, Thesis 2007. [2] A. Eleftheriadis and A. Jacquin, “Automatic face location detection and tracking for model-assisted coding of video teleconferencing sequences at low bit-rates,” Signal Processing Image Communication, vol. 7, pp. 231–248, Sepember 1995. [3] M. J. Chen, M. C. Chi, C. T. Hsu, and J. W. Chen, “ROI Video Coding Based on H.263+ with Robust Skin-Color Detection Technique.,” IEEE Transactions on Consumer Electronics, vol. 49, pp. 724–730, August 2003. [4] A. Tankus and Y. Yeshurun, “Detection of Regions of Interest and Camouflage Breaking by Direct Convexity Estimation,” IEEE Workshop on Visual Surveillance, pp. 42–48, 1998. [5] “Detection of interesting areas in images by using convexity and rotational symmetries.” Linda S. Karlsson, Master Thesis No. 31 (2002), Dept. of Science and Technology Linkoping University Sweden 2002. [6] J. Meessen, C. Parisot, X. Desurmont, and J.-F. Delaigle, “Scene Analysis for Reducing Motion JPEG 2000 Video Surveillance Delivery Bandwidth and Complexity,” IEEE International Conference on Image Processing Genua Italy, vol. 1, p. 577580, 2005. [7] G. Wang, T. T. Wong, and P. Heng, “Real-Time Surveillance Video Display with Salience,” The Third ACM International Workshop on Video Surveillance and Sensor Networks, pp. 37–43, 2005. [8] J. D. McCarthy, M. A. Sasse, and D. Miras, “Sharp or Smooth? Comparing the effects of quantization vs. frame rate for streamed video,” The SIGCHI Conference on Human Factors in Computing Systems, p. 535542, 2004. [9] V. Garcia, E. Debreuve, and M. Barlaud, “Region of Interest tracking based on keypoint trajectories on a group of pictures,” International Workshop on Content-Based Multimedia Indexing, pp. 198 – 203, June 2007. 56 [10] D. Comaniciu, V. Ramesh, and P. Meer, “Real-Time Tracking of Non-Rigid Objects using Mean Shift,” IEEE Transactions on Computer Vision and Pattern Recognition, 2009. [11] F.Porikli, P.Meer, and O.Tuzel, “Covariance Tracking using Model Update Based on Lie Algebra,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 728–735, 2006. [12] “A metric for covariance matrices.” Wolfgang Frstner and Boudewijn Moonen and Fdpdq Gdq and Carl Friedrich Gauss, 1999. [13] S. K. Zhou, R. Chellappa, and B. Moghaddam, “Visual Tracking and Recognition Using Appearance-Adaptive Models in Particle Filters,” IEEE Transactions on Image Processing, vol. 13, November 2004. [14] D. Chai, K. N. Ngan, and A. Bouzerdoum, “Foreground / Background Bit Allocation for Region-of-Interest Coding,” IEEE International Conference on Image Processing, vol. 2, pp. 923–926, 2000. [15] M. J. C. Ming, C. C. anf Ching Ting Hsu, and J. W. Chen, “ROI Video Coding Based on H.263+ with Robust Skin-Color Detection Technique,” IEEE Transactions on Consumer Electronics, vol. 49, pp. 724–730, 2003. [16] S. Daly, K. Matthews, and J. Ribas-Corbera, “Face-Based Visually-Optimized Image Sequence Coding,” IEEE Transactions on Consumer Electronics, vol. 3, pp. 443–447, 1998. [17] S. Sengupta, S. K. Gupta, and J. M. Hannah, “Percetually Motivated BitAllocation for H.264 Encoded Video Sequences,” IEEE International Conference on Image Processing, vol. 3, pp. 797–800, 2003. [18] J.-B. Lee and A. Eleftheriadis, “Spatio-Temporal Model-Assisted Very Low-BitRate CodingWith Compatibility,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, pp. 1517–1531, December 2005. [19] X. Yang, W. Lin, Z. Lu, X. Lin, S. Rahrdja, E. Ong, and S. Yao, “Local Visual Perceptual Clues and its use in Videophone Rate Control,” International Symposium on Circuits and Systems, vol. 3, pp. 805–808, 2004. [20] H. Wang and K. El-Maleh, “Joint Adaptive Background Skipping and Weighted Bit Allocation for Wireless Video Telephony,” IEEE International Conference on Wireless Networks, Communications and Mobile Computing, vol. 3, pp. 1243– 1248, 2005. 57 [21] L. Itti, “Automatic Foveation for Video Compression Using a Neurobiological Model of Visual Attention,” IEEE Transactions on Image Processing, vol. 13, pp. 1304–1318, October 2004. [22] “Flexible Macroblock Ordering, an error resilience tool in H.264/AVC.” Y.Dhondt and P.Lambert, Fifth FTW PhD Symposium, Faculty of Engineering, Ghent University, Desembre 2004, Paper No. 106. [23] J. W. Lee, A. Vetro, Y. Wang, and Y. S. Ho, “Bit Allocation for MPEG-4 Video Coding With Spatio-Temporal Tradeoffs,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, p. 488502, June 2003. [24] J. Meessen, C. Parisot, X. Desurmont, and J.-F. Delaigle, “Scene Analysis for Reducing Motion JPEG 2000 Video Surveillance Delivery Bandwidth and Complexity,” IEEE International Conference on Image Processing, Genua, Italy, vol. 1, pp. 577–580, June 2005. [25] J. Augustine, S. K. Rao, N. P. Jouppi, and S. Iyer, “Region of Interest Editing of MPEG-2 Video Streams in the Compressed Domain,” IEEE International Conference on Multimedia and Expo, p. 559562, 2004. [26] H. W. Y. Liang and K. El-Maleh, “Real-Time Region-of-Interest Video Coding using Content-Adaptive Background Skipping with Dynamic Bit Reallocation,” IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 45–48, 2006. [27] L. S. Karlsson, M. Sjostrom, and R. Olsson, “Spatiotemporal filter for ROI video coding,” 14th European Signal Processing Conference EUSIPCO Florence Italy, 2006. [28] X. J. Sclabassi, R. B. Liu, and S. M, “Content-Based Video Preprocessing for Remote Monitoring of Neurosurgery,” 1st Transdisciplinary Conference on Distributed Diagnosis and Home Healthcare, vol. 2, pp. 67–70, 2006. 58
© Copyright 2026 Paperzz