EXPLOITING REGION OF INTEREST FOR IMPROVED VIDEO

EXPLOITING REGION OF INTEREST FOR IMPROVED
VIDEO CODING
A Thesis
Presented in Partial Fulfillment of the Requirements for
the Degree Master of Science in the
Graduate School of The Ohio State University
By
Ramya Gopalan, B.E. Telecommunication Engg.
Graduate Program in Electrical and Computer Engineering
The Ohio State University
2009
Master’s Examination Committee:
Prof. Ashok Krishnamurthy, Adviser
Prof. Yuan Zheng
c Copyright by
⃝
Ramya Gopalan
2009
ABSTRACT
Today massive video data transfers are successfully done over the Internet. But,
achieving a good quality real time video Transfer is still a challenge. Video requires
large bandwidth for error-free transmission over the network. Since bandwidth is an
expensive resource, ensuring sufficient bandwidth for error free transmission of video is
difficult. Hence, low bit rate transmission is preferred. But low bit rate transmission
implies lower quality video. This is where Region of Interest Video Coding comes
into play. It uses the fact that there are certain areas in the video which are of
higher perceptual importance than other areas and assigns a higher quality to those
areas. Hence a better perceptual video quality is achieved under the same bandwidth
conditions.
Region of Interest Coding can be broken into 2 parts: Region of Interest Tracking
and Video Preprocessing. Region of Interest Tracking involves tracking the region of
interest in all the frames of the video based on some criteria specified by the user.
Video Preprocessing is the preprocessing performed, to ensure higher quality to the
region of interest over the background. This thesis presents a methodology for both
Region of Interest Tracking and Video preprocessing. In Region of Interest Tracking,
a covariance tracking method is studied and is applied in the 2 different cases of rigid
and non rigid body motion of the Region of Interest (ROI). In Video preprocessing, a
ii
spatio-temporal preprocessing technique is studied and is tested on different standard
video sequences.
iii
This is dedicated to my family and friends
iv
ACKNOWLEDGMENTS
This thesis would not have been possible if not for a few people who have aided
me academically and otherwise.
I would like to thank Dr. Ashok Krishnamurthy without whose guidance and
research ideas this thesis would have not been possible. I am truly grateful to him
for having confidence in me and trusting my research decisions. Through him, I have
learnt the importance of understanding a problem completely and the importance of
a good literature survey before starting out. I also thank Prof. Yuan F. Zheng, for
being a part of my thesis defense committee and sharing his thoughts and ideas on
my thesis. I would like to thank Dr. Prasad Calyam. His constant guidance and
support throughout my thesis study helped me a great deal. Apart from his technical
guidance, he has helped me understand the importance of organizing and presenting
work. I am also grateful to the staff and students of OSC for making my work in
OSC an enjoyable and a pleasurable one.
Finally, I would like to thank my family and friends whose support means a lot
to me. I am eternally grateful to my parents for having supported me all throughout
and having faith in me and in whatever I choose to do. As I finish my masters, I take
back with me not just a degree but also the friendships I formed with a few people
here. I am thankful to my friends here and in other places for making my masters a
valuable and an enjoyable journey.
v
VITA
August 21, 1985 . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Mumbai, India
June 30, 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B.E. Telecommunication Engg.,
R.V. College of Engineering,
Bangalore, India
September 2007 - present . . . . . . . . . . . . . . . . . . . M.S. Candidate,
Electrical and Computer Department,
The Ohio State University
September 2007 - June 2009 . . . . . . . . . . . . . . . . Graduate Research Associate,
Ohio Supercomputer Center,
The Ohio State University.
FIELDS OF STUDY
Major Field: Electrical and Computer Engineering
Studies in:
Video Processing
Random Signal Analysis
Linear Algebra
vi
TABLE OF CONTENTS
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
Chapters:
1.
2.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
1.2
1.3
ROI Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ROI preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . .
Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
3
5
Region of Interest Tracking . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Introduction . . . . . . . . . . . . . . . . . .
2.1.1 Tracking as a Matching Problem . . .
2.1.2 Tracking as an Estimation Problem . .
Region Descriptors . . . . . . . . . . . . . . .
Covariance as a Region Descriptor . . . . . .
Distance Calculation on Covariance Matrices
The Covariance Tracking Algorithm . . . . .
Model Update Strategies . . . . . . . . . . . .
Chapter Summary . . . . . . . . . . . . . . .
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
9
12
13
14
17
17
18
3.
ROI coding & Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.1
3.2
.
.
.
.
.
.
.
.
.
.
20
21
22
27
28
30
31
35
39
41
Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.1
.
.
.
.
.
42
43
44
46
47
Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . .
53
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.3
3.4
3.5
4.
4.2
4.3
5.
Introduction . . . . . . . . . . . . . . . . . . . . . . .
Spatial ROI coding . . . . . . . . . . . . . . . . . . . .
3.2.1 Flexible Macroblock Ordering . . . . . . . . . .
Temporal ROI coding . . . . . . . . . . . . . . . . . .
3.3.1 Decreasing the frame rate of the background . .
ROI Preprocessing Techniques . . . . . . . . . . . . . .
3.4.1 Spatial ROI Preprocessing Techniques . . . . .
3.4.2 Temporal ROI Preprocessing Techniques . . . .
3.4.3 Spatio-Temporal ROI Preprocessing Techniques
Chapter Summary . . . . . . . . . . . . . . . . . . . .
ROI detection Results . . . . . . . . .
4.1.1 Example 1 - Rigid Motion . . .
4.1.2 Example 2 - Non Rigid Motion
ROI preprocessing Results . . . . . . .
Conclusion . . . . . . . . . . . . . . .
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
LIST OF FIGURES
Figure
Page
3.1
FMO Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Block Diagram of the FMO Technique in the H.264 codec for ROI coding 26
3.3
Schematic Representation of decreasing frame rate of the background
29
3.4
Qualtiy Map; leftmost figure shows the original sequence, the center
one shows the ROI masked in white and background in black; the
rightmost one shows the quality map indicating the importance given
- white being most important and black being the least . . . . . . . .
32
Block Diagram of a spatial ROI preprocessing technique - Duplication
of ROI macroblocks . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.6
Schematic Representation of the ROI macroblock Duplication . . . .
34
3.7
Schematic Representation of the Motion Compensated Temporal Filter 36
3.8
Lifting Scheme of Discrete Wavelet Transform . . . . . . . . . . . . .
37
3.9
The MCTF ROI preprocessing scheme . . . . . . . . . . . . . . . . .
39
3.5
3.10 Proposed ROI preprocessing scheme
24
. . . . . . . . . . . . . . . . . .
40
4.1
Frame 1 of Container Sequence with ROI selected . . . . . . . . . . .
43
4.2
Tracking Results for Container Sequence . . . . . . . . . . . . . . . .
44
4.3
Frame 1 of Stefan Sequence with ROI selected . . . . . . . . . . . . .
45
ix
4.4
Tracking Results for Stefan Sequence . . . . . . . . . . . . . . . . . .
45
4.5
Normalized PSNR of the ROI for Spatial, Temporal, Spatio-temporal
processed Carphone Sequence . . . . . . . . . . . . . . . . . . . . . .
47
PSNR of the background for Spatial, Temporal, Spatio-temporal processed Carphone Sequence . . . . . . . . . . . . . . . . . . . . . . . .
48
PSNR of the ROI for Spatial, Temporal, Spatio-temporal processed
News Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
PSNR of the background for Spatial, Temporal, Spatio-temporal processed News Sequence . . . . . . . . . . . . . . . . . . . . . . . . . .
50
PSNR of the ROI for Spatial, Temporal, Spatio-temporal processed
Foreman Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.10 PSNR of the background for Spatial, Temporal, Spatio-temporal processed Foreman Sequence . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.11 Improvement Ratio of 3 sequences . . . . . . . . . . . . . . . . . . . .
52
4.6
4.7
4.8
4.9
x
CHAPTER 1
INTRODUCTION
The main challenge in video coding is to reduce the amount of data (number
of bits) needed to describe the video while preserving the video quality. ROI video
coding is a clever way of doing video coding where it makes use of the fact that
the viewer gives more importance to certain regions of the video over the others.
It provides more quality (more bits) to the areas corresponding to the ROI at the
expense of reduced quality of the background.
Typically ROI video coding is used in specific applications also. One application is
transmitting video in low bit-rate conditions. Transmitting video conference material
over the mobile phone, can be problematic for low bit rates. The encoding of these
videos can lead to a high number of artifacts. This is solved by giving more quality
to the important areas in the video over other areas. This is what is done in any
video conferencing application where the facial region is the most important. If the
information communicated by the movement of the lips and facial expressions is
lost, the artifacts are large. Also, reduced quality in the facial area can appear more
disturbing for the viewer than reduced quality of the background. As mentioned in [1] :
This is solved, in several approaches, by applying more compression to the background
than in the facial region and is included in the articles by Eleftheriadis et al. [2] and
1
Chen et al. [3]. Region of Interest video coding also finds use in surveillance. Regions
of interest are often well defined in surveillance application, for example people or
vehicles. This problem of transmitting surveillance video from several cameras in real
time at low bit rates have been addressed in several publications. Tankus et al. in
[4] dealt with the means of detecting camouflaged enemies in the shape of people or
military vehicles, which was suggested could be applied to ROI video coding in [5].
Other areas of surveillance where ROI approaches have been applied include traffic
surveillance in [6] and security applications in [7] . In [8], Region of Interest coding
is applied to soccer games where the ball and the soccer player form the region of
interest. In summary, ROI coding is used in videoconferencing, surveillance and other
low bit-rate video transmissions.
In this thesis, we study Region of Interest Coding by breaking it down into 2 parts:
ROI Tracking and Video Preprocessing. Only a brief overview about each section is
given in [1.1] and [1.2], but is discussed in detail in chapters 2 and 3.
1.1
ROI Tracking
Region of Interest Tracking involves detecting the Region of Interest in each frame
given prior information of the ROI’s appearance. It is key to good ROI video coding
because correctly predicting and detecting the ROI is very crucial. A falsely detected
ROI leads to lower perceptual quality than the original video. Region of interest
Tracking can either be done keeping a specific application in mind (face tracking for
video conferencing applications), or based on models of the human visual system, or
can also be dealt as a tracking problem, where the user specifies the ROI in the first
frame and the tracker detects it in the consecutive frames. In this thesis, we deal with
2
ROI tracking as a tracking problem. This kind of tracking can again be divided into
2 methods. The first and the most popular is where the ROI model is matched with
candidate blocks of the current frame and the candidate block with highest similarity
with the target ROI model is made the new ROI model. The second method is where
the ROI model is estimated/predicted based on state space and transition models.
The above classifications have been explained in detail in chapter [2]. In this thesis,
ROI Tracking was implemented as a matching problem and chapter [4] shows the
results of the tracking. The detected ROI in all the frames were observed and the
performance was measured subjectively. A few papers also use objective metrics to
measure the tracking performance. Tracking rate is an objective metric where the
ratio of the successfully detected frames to the total number of frames in the video
sequence is measured. It is considered as a mis-tracking if the ROI detected exceeds
the correct ROI by more than a specific number of pixels. The number of pixels
beyond which it is decided as a mis-tracking is dependent on the user. Hence, in this
thesis we haven’t measured the tracking rate, but just put the different frames and
the detected ROIs.
1.2
ROI preprocessing
It is known that, at low bit rates, video coding is performed in order to optimize
the average quality under the constraints of a limited bandwidth. Region of Interest
coding optimizes the average quality of the video by allotting more quality to certain
areas (Region of Interest) over the others. As mentioned earlier, ROI coding consists
of 2 parts: The ROI must first be detected, which requires previous knowledge of
3
what the user finds interesting in the sequence. Secondly the video sequence is compressed/encoded to different amounts of encoding based on the bandwidth conditions.
This is achieved by bit allocation, which controls the number of bits allocated to the
different parts of the the video sequence. Video preprocessing is a method of ROI
coding it processes the video in such a way that when bit allocation takes place in
the codec, more bits are allocated to the ROI over the background. Most of the ROI
coding techniques make modifications in the codec ; ROI preprocessing don’t make
codec modifications but make modifications to the video sequence before the encoding
stage. ROI preprocessing can be classified as spatial and temporal preporcessing. In
spatial ROI preprocessing the processing takes place within a single frame where as
in temporal ROI preprocessing techniques the processing takes place between frames.
These 2 types have been explained in detail in chapter [3].
The objective metrics used to measure the performance of ROI preprocessing
are Peak Signal to Noise Ratio (PSNR) and bit rate. PSNR is the ratio between the
maximum possible power of a signal and the power of the noise that affects the signal.
PSNR for each compressed video frame with the corresponding original video frame
is calculated and an average over all frames is found. PSNR is given by equation [1.2]
𝑃 𝑆𝑁 𝑅𝑑𝑏 = 10𝑙𝑜𝑔10 (
2552
)
𝑀 𝑆𝐸
(1.1)
where MSE is the Mean Squared Error given by the following equation:
𝑀 𝑆𝐸 =
1 𝑛
Σ (𝑥𝑖 − 𝑦𝑖 )2
𝑁 𝑖=1
(1.2)
where 𝑁 is the number of pixels within each frame, 255 is the maximum pixel intensity
value and 𝑥𝑖 and 𝑦𝑖 represents pixel 𝑖 in the original and distorted video sequence,
respectively. Paper [56] by Wang talks about including more HVS characteristics in
4
objective measures, to improve PSNR as a metric as PSNR disregards the position
of the pixel, its context and its perceptual importance. Bit rate is another objective
metric which indicates the the bandwidth needed to transmit the video file. Since
network is an expensive resource, a low bit rate is always desired. In this thesis we
measure bit-rate as file sizes. A lower file size corresponds to a lower bit rate.
1.3
Thesis Outline
This chapter gave an introduction to all topics discussed in this thesis. The rest of
the thesis is organized as follows. Chapter 2 talks about Region of Interest Tracking.
First, the different ways of doing tracking (matching and estimation) are discussed,
followed by region descriptors and covariance as region descriptors. Finally, the covariance tracking algorithm and its model update strategies are explained. Chapter 3
talks about Region of Interest Preprocessing, starting with Region of Interest Coding
and its different methods (spatial and temporal), moving on to Region of Interest Preprocessing and its different methods (spatial, temporal and spatio-temporal). Chapter
4 contains the ROI tracking and preprocessing results. Chapter 5 concludes the thesis
and provides some ideas for further work.
5
CHAPTER 2
REGION OF INTEREST TRACKING
ROI Tracking is the detection of a region in each video frame having prior knowledge of what the user finds interesting in the sequence. As mentioned earlier, ROI
tracking can either be done keeping a specific application in mind (face tracking for
video conferencing applications), or based on models of the human visual system, or
can also be dealt as a tracking problem, where the user specifies the ROI in the first
frame and the tracker detects it in the consecutive frames. So, it involves detecting
an object/region in consecutive video frames given its location in the first frame. In
other words, if 𝑉 is a video composed of n frames 𝐹 1...𝐹 𝑛 and 𝐵1 is the contour of
the ROI in frame 𝐹 1; the tracking problem consists in defining the contours 𝐵2...𝐵𝑛
from 𝐵1 [9]. This kind of tracking can be done using not just appearance models but
using motion estimation also. ROI Tracking finds application in video surveillance
(crime watch, sports statistics / reporting), post production cinema (apply special
color effects on objects), video transfer (video-conferencing) and other vision applications. Though there is a significant amount of work done in this area, there is still,
not a single, robust solution to detect non-rigid, fast moving bodies. In this thesis, we
do not propose a solution; rather study a method which combines spatial, structural
and other appearance attributes to detect the ROI.
6
2.1
Introduction
Region of Interest is an area in a video frame characterized by several attributes
like color, structure, texture, brightness etc. ROI Tracking involves detecting these
areas in consecutive frames given its location in the first frame. The difficulty in this
is that, the area / ROI can undergo slight appearance changes, illumination changes,
occlusions etc. It is the task of the ROI tracking scheme to come up with a method
which detects the ROI in presence of such outliers. ROI Tracking can be done in 2
primary ways.
1. Matching the target block (ROI) with the candidate blocks and making the best
match the next target block
2. Estimation of the state (target block) given all the measurements up to that
moment, or equivalently constructing the probability density function of object
location
2.1.1
Tracking as a Matching Problem
Tracking can be modeled as a matching problem. It involves matching the target
block in the previous frame with the candidate blocks in the current frame. Similarity measures are used to quantify the match. Similarity of geometric features, like
shape, location, structure and non-geometric features, like color and illumination, are
measured. The candidate block with the highest similarity measure is made the new
ROI.
Tracking Using Color Histogram Matching [10] implements Tracking as a matching
problem. Here the color histograms of the target region and the candidate region are
7
obtained and the similarities of the histograms are measured in terms of a similarity
measure called the Bhattacharya Coefficient. The candidate region with the highest
Bhattacharya coefficient is made the new target region. The method is now explained
in detail. The target is modeled by its color distribution. If 𝑥𝑖 for 𝑖 = 1, 2, .., 𝑛
is the pixel locations of the target model centered at zero; we define a function 𝑏 :
𝑅2 → 1...𝑚 which associates to the pixel the index of the histogram bin corresponding
to the color of the pixel. The probability of the color 𝑢 in the target model, denoted as
𝑞(𝑢), is derived by employing a convex and monotonic decreasing kernel which assigns
a smaller weight to the locations that are farther from the center of the target. This
probability is calculated for all colors 𝑢 = 1...𝑚. Starting with an estimated location,
the current frame is broken down into blocks of the same size as the target model.
The color histograms of these regions are found and their similarities are measured.
The equations for the above are shown below:
𝜌(𝑦) = 𝜌[𝑝(𝑦), 𝑞] = Σ𝑛𝑖=1
√
𝑝𝑢 (𝑦)𝑞𝑢 (𝑦)
(2.1)
where 𝑝(𝑦) = 𝑝𝑢 (𝑦) for 𝑢 = 1...𝑚 with Σ𝑚
𝑢=1 𝑝(𝑢) = 1 is the m bin color histogram of the
candidate region at the location y and 𝑞(𝑦) = 𝑞𝑢 (𝑦) for 𝑢 = 1...𝑚 with Σ𝑚
𝑢=1 𝑞(𝑢) = 1
is the m bin color histogram of the target region
The similarities of the 2 histograms are measured using the distance shown below
𝑑(𝑦) =
√
1 − 𝜌[𝑝(𝑦), 𝑞]
(2.2)
Minimum distance implies the highest similarity. The Bhattacharya coefficient
𝜌[𝑝(𝑦), 𝑞] is maximized in order to find the minimum distance. Mean Shift iterations
8
are used for the maximization. The iterations start at the same location as that of
the ROI in the previous frame.
Advantages
1. It is robust to rotational changes, camera position, partial occlusions etc
2. Performs extremely well given the ROI is highly specific in color. For eg color
histogram matching works really well in face tracking since the skin color occupies a very compact area in the 2 chroma component space.
Disadvantages
1. It fails in cases when the discriminating power between histograms is not high.
This happens when the ROI is not strongly characterized by color.
We did not choose the color histogram method for tracking implementation because the tracking is very specific to color and hence is not robust. There are joint
histogram matching methods available which model many features as a histogram,
but we did not choose those since the dimensionality becomes high when the number
of features increases. The meaning of the above statement will be better understood
later in this chapter.
2.1.2
Tracking as an Estimation Problem
Tracking can also be dealt as an estimation problem. It is also called stochastic
detection/tracking. In this method the ROI of the video frame 𝑌𝑡 is modeled as 𝑍𝑡
parameterized by 𝜃𝑡 (as shown in equation [2.3]) and a state space model (described
by state transition and observation models) is employed to accommodate the multiple
frames, as shown in equations [2.4] and [2.5].
9
Region of Interest
𝑍𝑡 = 𝑇 [𝑌𝑡 , 𝜃𝑡−1 ]
(2.3)
𝜃𝑡 = 𝐹𝑡 (𝜃𝑡−1 )
(2.4)
𝜃𝑡 = 𝑌𝑡 (𝜃𝑡 )
(2.5)
State Transition Model:
Observation Model:
The task of the estimation problem is to find 𝑝(𝜃(𝑡)∣𝑌 (1 : 𝑡)) given the observation
model likelihood 𝑝(𝑌 (𝑡)∣𝜃(𝑡)) and state transition probabilty 𝑝(𝜃(𝑡)∣𝜃(𝑡 − 1)).
The above estimation problem can be implemented with the observation model
(appearance model) embedded in a Particle filter or using the EM (ExpectationMaximization) algorithm. These methods will not be discussed in detail, but the
basic idea of these methods are presented. A particle filter approximates the posterior
(𝑗)
(𝑗)
distribution 𝑝(𝜃(𝑡)∣𝑌 (1 : 𝑡)) by a set of weighted particles 𝑆𝑡 = [𝜃𝑡 , 𝑤𝑡 ]𝐽𝑗=1 with
(𝑗)
Σ𝐽𝑗=1 𝑤𝑡 = 1. The state estimate 𝜃𝑡 (ROI characteristics) is either given by Minimum
Mean Square Estimate (MMSE), or the Maximum A posteriori Estimate (MAP),
or other forms based. Equation [2.6] shows the Minimum Mean Square Estimate
(MMSE) of the unknown state 𝜃. In the equation, 𝐸 is the expectation operator,
𝜃𝑡 is the unknown state (target characteristics), 𝑌𝑡 is the observation (frame), 𝑤𝑡 are
the weights used and 𝑃 are the number of weights used to approximate the filter.
The weights are calculated using information about the state transition probability
𝑝(𝜃(𝑡)∣𝜃(𝑡 − 1)) and observation model likelihood 𝑝(𝑌 (𝑡)∣𝜃(𝑡)).
𝜃𝑡 = 𝐸[𝜃𝑡 ∣𝑌1:𝑡 ] ≈ 𝑃 −1 Σ𝑃𝑖=1 (𝑤𝑗𝑡 𝜃𝑗𝑡 )
10
(2.6)
The EM algorithm finds the state estimate 𝜃 using the Maximum Likelihood
Estimate (MLE). The Maximum Likelihood Estimate performs an expectation (E)
step, which computes an expectation of the log likelihood with respect to the current
estimate of the distribution for the latent(unknown) variables, and a maximization
(M) step, which computes the parameters which maximize the expected log likelihood
found on the E step. These parameters are then used to determine the distribution
of the latent variables in the next E step. One of the main advantages of performing
Tracking as an Estimation problem is that it doesn’t just find the local best match
but does a global one. But this comes at the expense of being very computationally
intensive. It is computationally intensive because such algorithms are stochastic in
nature, that is they require analysis of random phenomena. Analysis of random
phenomena is generally computationally intensive. We did not use an estiimation
approach in our tracking scheme as the complexity of such methods are high and are
generally computationally intensive.
To perform any kind of Tracking (Matching or Estimation), the first task lies in
modeling the ROI. These models are called Region Descriptors and are studied in
detail in the next section.
11
2.2
Region Descriptors
The ROI needs to be described/modeled well to perform good ROI tracking.
The model based on different features of the region is called a region descriptor.
Features like color, edge information (gradient), intensity etc which are specific to
each region form the model parameters. Hence feature selection is one of the most
important steps for tracking and classification problems. Good features should be
discriminative and easy to compute. Features such as color, gradient, filter responses
are the simplest choices for image features and were used for many years in computer
vision. Histograms are a natural extension to such features that are non-parametric
estimators, effective in non-rigid object tracking.
A major concern is the lack of a competent similarity criterion that captures
both spatial and statistical properties, i.e. most approaches either depend only on
the color distributions or structural models. Joint feature histograms can be used,
but the problem is that it doesnt scale to higher dimensions. Its dimensionality is 𝑏𝑑
where b is the number of histogram bins and 𝑑 is the number of features. Even normal
appearance models where the region is represented with raw values, does not scale
well to higher dimensions as they need 𝑛𝑋𝑑 dimensions, where 𝑛 is the number of
pixels. This is where covariance matrices are superior [11]. They provide an excellent
way of fusing multiple features and scale well to higher dimensions since the size if
the descriptor (covariance matrix) only depends on the number of features 𝑑. It has
only (𝑑2 + 𝑑)/2 different values. We will now study Covariance matrix as a region
descriptor in detail and also study distance calculation on these matrices.
12
2.3
Covariance as a Region Descriptor
Let 𝐼 be the one dimensional intensity/gray-scale image or the three dimensional
color image. Let 𝐹 be the feature image obtained from 𝐼. The dimensionality of 𝐹
is 𝑊 𝑋𝐻𝑋𝑑 where 𝑊 and 𝐻 is the size of the ROI and 𝑑 is the number of features
used to describe the ROI and itis given by the following equation :
𝐹 (𝑥, 𝑦) = 𝜙(𝐼, 𝑥, 𝑦)
(2.7)
where 𝜙 can be any mapping such as intensity, color, gradients, filter responses, etc.
An example feature image mapping is shown in the equation below. It can be noted
that the feature image / vector can be constructed using 2 types of attributes : spatial
attributes and structural attributes.
𝑓𝑘 = [𝑥, 𝑦, 𝐼(𝑥, 𝑦), 𝐼𝑥 (𝑥, 𝑦)...]
(2.8)
where 𝑘 is the pixel, 𝑥 and 𝑦 are coordinates of the pixel, 𝐼(𝑥, 𝑦) is the intensity at
the pixel k, 𝐼𝑥 (𝑥, 𝑦) is the intensity gradient along the x direction. After the feature
image 𝐹 for the region 𝑅 is obtained, its 𝑑𝑋𝑑 covariance matrix is constructed as
follows:
𝐶𝑅 =
1
Σ𝑛𝑘=1 (𝑧𝑘 − 𝜇)(𝑧𝑘 − 𝜇)𝑇
𝑛−1
(2.9)
where 𝑧𝑘 for 𝑘 = 1...𝑛 is the 𝑑 dimensional feature points inside the ROI 𝑅 and 𝜇 is
the mean of the points.
It can be seen from the above equations that the covariance matrix is an efficient
way to fuse multiple features. The diagonal entries represent the variance of each
feature and the non-diagonal entries represent the correlations. As the covariance
measurement involves taking an average and calculating the covariance, the few noise
corrupting induvidual samples are largely filtered out.
13
There are several advantages of using covariance matrices as region descriptors.
Some of them are:
1. It is the most natural and simple way to combine structural and statistical
features together
2. They are low dimensional compared to other descriptors. It has only (𝑑2 + 𝑑)/2
different values since the size of the covariance matrix does not depend on the
size of the region but only depends on the number of features used
3. Scale and rotation invariance can be achieved easily depending on the way the
features are defined
4. Covariance is invariant to the mean changes like identical shifting of color values since covariance involves taking the difference of mean value from the actual
values. Therefore identical shifting of mean values does not matter as the individual values also shift. As a result of this, they can track sequences under
varying illumination conditions.
In the next section, we study how to measure the similarity of such matrices.
Similarity of these matrices need to be computed in order to perform ROI tracking
as matching.
2.4
Distance Calculation on Covariance Matrices
The space of covariance matrices is not closed under multiplication with negative
scalars. Hence, a simple subtraction of covariance matrices will not work. A different
metric involving traces and joint eigen-values of the matrices is used [12]. It is shown
14
to be the distance coming from a canonical invariant Riemann metric on the space of
real symmetric positive definite matrices and is given by the following equation:
𝜌(𝐶1 , 𝐶2 ) =
√
Σ𝑛𝑖=1 [𝑙𝑛2 𝜆𝑖 (𝐶1 , 𝐶2 )]
(2.10)
Where 𝜆𝑖 (𝐶1 , 𝐶2 ) for 𝑖 = 1...𝑛 are the generalized eigen values computed from
𝜆𝑖 𝐶1 𝑥𝑖 − 𝐶2 𝑥𝑖 = 0 for 𝑖 = 1...𝑑 and 𝑥𝑖 ∕= 0 .
The reason that the sum of logarithmic values of generalized eigen values of the 2
matrices indicate the similarity of 2 covariance matrices has been shown below:
Let 𝑥 be a column vector of properties of the candidate region that is a Gaussian
random variable. Let us now multiply the random vector 𝑥 with a scalar vector 𝑒𝑇
such that
𝜎 2(𝐶) ≤ 𝜎 2(𝐻)
(2.11)
where 𝐶 and 𝐻 are the covariance matrices of 𝑥 (candidate region) and the reference
ROI respectively, and 𝜎 is variance. From this we get the equation below:
𝑒𝑇 𝐶𝑒 ≤ 𝑒𝑇 𝐻𝑒
(2.12)
Hence,
0 ≤ 𝜆(𝑒) =
𝑒𝑇 𝐶𝑒
≤1
𝑒𝑇 𝐻𝑒
(2.13)
𝑒 ∕= 0. From this we get an equation of the form
𝜆𝐻𝑒 − 𝐶𝑒 = 0
(2.14)
This equation is solved by setting 𝜆 as the maximum generalized eigen vector (𝜆).
But the maximum 𝜆 value is not completely descriptive of the covariance matrix
15
similarity. We can use traces and determinants of the 𝜆 values, bu they are also not
indicative of the similariy of the covariance matrices. Hence we use Σ𝑛𝑖=1 𝑙𝑛2 𝜆𝑖 as the
metric where 𝜆𝑖 is the ith generalized eigen value. It is a monotonically increasing
function with 𝜆𝑖 .
It is a metric because it satisfies the metric axioms for positive definite symmetric
matrices. The axioms are shown below.
1. 𝜌(𝐶1 , 𝐶2 ) ≥ 0 and 𝜌(𝐶1 , 𝐶2 ) = 0 only if 𝐶1 = 𝐶2
2. 𝜌(𝐶1 , 𝐶2 ) = 𝜌(𝐶2 , 𝐶1 )
3. 𝜌(𝐶1 , 𝐶2 ) + 𝜌(𝐶1 , 𝐶3 ) ≥ 𝜌(𝐶2 , 𝐶3 )
The generalized eigen-values can be computed with 𝑂(𝑑3 ) arithmetic operations
using numerical methods and as additional 𝑑 logarithm operations are required for
distance computation; where 𝑑 is the number of features used in the region descriptor.
This is usually faster than comparing two histograms that grow exponentially with
𝑑.
16
2.5
The Covariance Tracking Algorithm
We now study the whole tracking process.
1. Start at the estimated location (x0,y0) (location of the ROI in the previous
frame) in the current frame and obtain the covariance matrix of the region
centered at (x0,y0) and of the same size as the reference ROI.
2. Repeat step 1 for all locations in the neighborhood of the pixel (x0,y0). Neighbourhood is the region around the pixel whose size is decided by the user.
3. Calculate the distance of each covariance matrix obtained in step 2 with respect
to the reference covariance matrix, which is the covariance matrix of the ROI
of the previous frame.
4. Compute the minimum distance and the assign the region having the minimum
distance as the new ROI / target.
5. Repeat steps 1 to 4 for all frames in the video sequence
This algorithm is for when the ROI characteristics (location and other information)
of the first frame is provided at start.
2.6
Model Update Strategies
The above algorithm was used as the implementation of the tracker. We tested
it on sequences which had the ROI moved in both rigid and non rigid fashion. The
tracking was successful on these sequences when we kept the previous frame’s ROI
as the desired ROI in the current frame. But there are cases when the ROI changes
a lot in the sequence. Hence it is necessary to adapt to these variations. A model
17
update of the ROI needs to be performed. Before we get into how the model update
is done, let us look at the different ways to fix the ROI used as the reference for the
next frame.
1. Use a fixed ROI / template as the reference
2. Make the previous ROI as the new reference
3. Update the ROI with every frame
If we implement the second method, which is to make the last match, the current
ROI, it will also result in large errors as the rapidly changing ROI is susceptible to
drift. Thus a compromise between these 2 cases is needed. This is what is called
model updating where the ROI is constructed so that it contains not only the ROI
information of the immediate previous frame but of the other previous frames also.
There are again 2 ways of doing the model update which are as follows:
1. Give equal priority to all frames
2. Weight the frames so that priority is given to the most recent frames over the
first few frames
Most model update techniques weight the frames exponentially, giving more priority to the recent frames over the older ones [13]. The only drawback of model updating
is the increased cost of computation, but this is compensated by the robustness of
the reference ROI model built.
2.7
Chapter Summary
In this chapter, we looked at tracking done both as a matching and as an estimation problem, with an example in both cases. We then studied how the ROI is
18
modeled, under ’region descriptors’. Covariance matrix as a region descriptor and
the metric used to find similarity between 2 covariance matrices were studied next
and finally, the whole tracking algorithm and some additions to the algorithm (model
update) were studied.
19
CHAPTER 3
ROI CODING & PREPROCESSING
The main idea behind ROI coding is to increase the quality within the ROI by
decreasing quality in the background. This can be achieved by reallocating bits from
the background to the foreground. More bits to the ROI implies better visual quality.
There are several ways of doing ROI coding. Preprocessing is one way of doing it,
where the required processing takes place outside the codec. The most important
advantage of this method is that, changes can be made easily. But it comes with the
loss of adaptability of the background quality to variable bit rates of the channel.
We will study different preprocessing techniques in detail in this chapter, but we first
look at ROI coding in general and it’s classifications.
3.1
Introduction
As mentioned above, ROI coding is nothing but reallocating bits from the background to the foreground. ROI coding can be achieved either by a preprocessing stage
before the encoding or by controlling the parameters in the encoder; this forms one
basis of classification. Methods which make modifications to the codec, allow alterations of the encoder as long as the required features are included and the syntax of
the bit stream is unaltered. Most of the ROI coding techniques use this method.
20
Apart from classifying ROI coding techniques by when the encoding is done, it can
also be classified based on how the processing. There 2 different ways of processing
and they are either spatial or temporal. ROI coding can also be classified as:
1. Spatial ROI coding
2. Temporal ROI coding
They are explained in detail in the following sections.
3.2
Spatial ROI coding
As mentioned in [1], most of the research on bit allocation for ROI video coding
apply spatial methods and more which control parameters within the codec. The
most common way of performing spatial ROI coding is to change the quantization
step size. A higher / coarser step size for a particular region implies that the number
of bits allocated to that region is lesser. Hence, a higher quantization step size is chosen for the background and a smaller one is chosen for the foreground. One example
is where two step sizes are used for each frame and the smaller stepsize is used for the
ROI and the larger one for the background [14] [15] . The problem with this method
is that the difference between background and ROI quality appears abrupt if there is
a large difference in quantization. This is referred to as blockiness. Solutions include
adapating the quantization step sizes of the background based on the distance from
the ROI border [16], [17], or applying three quantization levels in [18] or deciding the
step size based on a sensitivity function defined by the user[18]. Other approaches
involving changing other encoder parameters (like controlling the number of non zero
DCT coefficients) have been addressed in [19] and [20]. Controlling quantization parameters allows direct integration with the rate-distortion function within the codec,
21
but can introduce blockiness due to coarse quantization. Some preprocessing techniques whose resulting error is generally less disturbing than conventional methods,
perform spatial ROI coding by low pass filtering the background. Low-pass filtering
(blurring) of non-ROI are dealt in [3] and [21]. Low pass filtering results in a lesser
number of non-zero DCT components and thus reduces information. Also, there is
a reduction in prediction error due to the absence of high frequencies. The ratedistortion optimization of the encoder then re-allocates bits to the ROI, which still
contains high frequencies. Low pass filtering of the background regions are applied
using one filter for the whole frame is discussed in [3] by Chen et al.. This leads to
a distinct boundary separating the ROI and background resulting in the decrease of
the perceptual quality of the video. A gradual transition of quality from the ROI to
overcome this problem has been discussed in paper [21] by Itti.
We have looked at the overview of spatial ROI encoding techniques. We now
study a specific technique, called FMO, which is a tool present in H.264 video codec,
to perform ROI video coding. This method was not implemented in the thesis, but is
explained to give a better picture of how ROI coding is performed by changing codec
parameters.
3.2.1
Flexible Macroblock Ordering
One of the new characteristics of the H.264/AVC standard is an error resilience
tool called Flexible Macroblock Ordering (FMO) [22]. FMO divides an image into
regions called slices (maximum of 8). These slices are basically a collection of macroblocks that are assigned to a slice using a MBAmap (MacroBlock Allocation map).
The MBA map contains an identification number that specifies to which slice each
22
macroblock belongs. Deciding on an identification number for each macroblock or in
other words, deciding to which slice each macroblock goes is discussed later in this
section, under ’Types of FMO’. The macroblocks in each slice are processed in a scan
order(from left to right and top to bottom). Each slice is transmitted independently
in separate units called packets. Each packet contains in its own header, hence each
slice (which is transmitted as a packet) can be decoded independently. This helps
us to correct errors easily by exploiting the spatial redundancy of the images. For
example, the macroblocks could be grouped in such a way that no macroblock and its
neighbor belong to the same group. This way, if a slice is lost during transmission, its
easy to reconstruct the lost blocks by interpolating information from its neighboring
blocks. The use of FMO together with advanced error resilience tools can keep the
visual quality even with a packet loss rate of 10%.
When using FMO, the image can be divided in different scan patterns of the
macroblocks. FMO consists of 7 different types, from Type 0 to Type 6. Type 6 is
the most random giving the user full flexibilty. All the other ones contain a certain
pattern. These patterns can be exploited when storing and transmitting the MBA
map:
1. Type 0: uses runlengths which are repeated to fill the frame. Therefore only
those runlengths have to be known to rebuild the image on the decoder side.
2. Type 1: also known as scattered slices; it uses a mathematical function, which
is known in both the encoder and the decoder, to spread the macroblocks. The
distribution in the figure, in which the macroblocks are spread forming a chess
board, is very common.
23
Figure 3.1: FMO Types
3. Type 2: is used to mark rectangular areas, so called regions of interest. In this
case the coordinates top-left and bottom-right of the rectangles are saved in the
MBAmap.
4. Type 3-5: are dynamic types that let the slice groups grow and shrink over the
different pictures in a cyclic way. Only the growth rate, the direction and the
position in the cycle have to be known.
The different types have been illustrated in figure[3.1]
24
Each FMO type has its own application. FMO Type 1 can be useful in videoconferences to maintain privacy. The chess board pattern could be used for this. Each
slice group is sent in a different packet. This way, if a hacker wants to decode the
videoconference, he or she will have to know exactly in which two packets the information is being sent. It can also be used in network environments with a high packet loss
rate. It is FMO Type 2 which is used for ROI encoding. It performs ROI encoding
by encoding the background slice with lesser bits compared to the foreground/ROI
bits. The number of bits in the ROI is increased by decreasing the quantization step
size of the slice containing the ROI information.
𝐵𝑖 = 𝐴(𝐾
𝜎𝑖2
+ 𝐶)
𝑄2𝑖
(3.1)
where 𝐵𝑖 is the total number of bits alloted to slice i; 𝐴 is the number of pixels in the
macroblock; 𝐾 is the value adjusted based on the statistics of the specific frame being
encoded; 𝜎𝑖2 represents the ith Macroblock Variance of the motion compensated residual signal;𝑄𝑖 is the Quantization Step Size of the ith slice and 𝐶 is the average rate
to encode the motion vectors and the bit stream header. Therefore, to increase the
number of bits 𝐵𝑖 in a slice, the quantization step size 𝑄𝑖 is decreased.The generalized
block diagram of the technique has been shown in figure[3.2]
25
Figure 3.2: Block Diagram of the FMO Technique in the H.264 codec for ROI coding
26
A brief summary of the FMO ROI technique:
1. It is a spatial ROI technique
2. FMO type 2 is most commonly used for ROI encoding where each frame is
divided into 2 rectangular slices - one being the foreground and the other. the
background.
3. It is codec dependent since parameters inside the codec are changed
We now proceed to temporal ROI coding and study one of it’s techniques in detail
3.3
Temporal ROI coding
The main idea in temporal ROI coding is to reduce the frame rate of the background compared to the ROI. A lower frame rate requires lesser number of bits because the size of the motion vector reduces and thus the number of bits needed to
represent it also reduces. One way of reducing the frame rate of the background is to
extract the ROI from the video and transmit the ROI and background as 2 different
sequences; transmitting the background at a lower frame rate. This is supported by
the MPEG-4 standard. Schemes which extract and encode objects and background
in separate layers and transmitted at 2 different frame rates (lower frame rate for the
background)have been discussed in papers [23] and [24]
There are some preprocessing methods which perform temporal coding. Temporally averaging the background with the help of a filter reduces the variance of
the background and hence reduces the number of bits required to represent the background.All these methods don’t affect the syntax of the encoded bit stream and therefore the compatibility with all video standard remains. There are temporal ways that
27
do change the syntax of the bit stream. In [25], the macroblocks not belonging to
the ROI are skipped in the inter-frames. In [18] all macroblocks of the background
are skipped if the motion between the frames is not much (does not cross a global
threshold). [26] also talks about a similar approach. After going through a brief
overview of the different temporal coding techniques, we now proceed to a specific
temporal technique.
3.3.1
Decreasing the frame rate of the background
The block diagram of this technique has been shown in figure[3.3]. The main
idea in this technique is that the frame rate of the background is decreased by a
factor of 2 and that of the ROI’s is retained and so the number of bits needed to
encode the background is much lesser than that of the ROI. The frame rate of the
background is decreased by 2 by replacing the background of every even frame by the
background of the previous odd frame (can be seen in the figure). As a result, the
motion vectors of the background,which contain the number of pixels a macroblock
has moved; decreases in size. The prediction error of the background, which is the
error between the macroblock and its best match in the previous frame; also decreases
in size. Therefore the number of bits needed to represent the background decreases.
28
Figure 3.3: Schematic Representation of decreasing frame rate of the background
29
Summary of the temporal technique:
1. It is a temporal ROI processing technique.
2. It is preprocessing technique as all the ROI coding happens out of the codec.
3. It only works in case where the background does not change much between
consecutive frames.
We have looked at the 2 different ways ROI encoding is done : spatial and temporal
techniques. We now look at Region of Interest coding done in the preprocessing stages.
3.4
ROI Preprocessing Techniques
Quantization bits are reallocated from background areas in the images to the ROI
to ensure quality in the ROI. Previous works have mainly concentrated on controlling
quantization step sizes of the blocks in a frame by deciding if the block is classified
as ROI or background [3-4]. The blocks belonging to the ROI gets a smaller step
size and those in the background get a bigger step size. But these require changes
within the codec. Preprocessing methods achieve independence of codec. [1] and
[5] talk about lowpass filtering (blurring) of the background. This decreases the
information resulting from the discrete cosine transform or discrete wavelet transform
of the background as the high pass coefficients are removed. The low-passed image
is passed on to a codec, which allocates small quantization step sizes to the ROI, i.e.
more coding resources to ROI and large step sizes to the background . In [1], a lowpass filter is applied to the background using only one filter. This leads to boundary
effects and therefore a decrease in the perceptual quality. A Gaussian Pyramid is
used to overcome this problem as it creates a smoother transition from the ROI to
30
the background assuming the center of the regions to be placed at the precise locations
where the human is predicted to gaze. Reducing the frame rate of the background
by a factor of 2 as explained in [3.3.1] is another ROI preprocessing technique.
Like the ROI coding schemes, even ROI preprocessing can be classified as spatial
and temporal.We discussed some of them above, we study them in detail in the next
sections.
3.4.1
Spatial ROI Preprocessing Techniques
Spatial ROI Preprocessing make modifications within each frame, before the encoding stage, to give more quality to the ROI over the background. The most common
way of doing spatial preprocessing is low pass filtering (blurring) of the background.
Though there is a considerable degradation in the quality of the background, the compression ratio achieved is very high. One variation to these low pass filters is to have
a quality map which helps in providing a more gradual transition from the ROI to
the background. Figure [3.4] shows the original frame, the binary Region of Interest
map showing the ROI in white and the quality map. The quality map is shown in the
next figure. Quality map is basically an importance map which gives more priority
to pixels close to the center of the ROI and gradually reduces the importance as it
moves towards the background. The quality map is assigned values ranging from 0
to 1; 1 being the most important and 0 the least. A similar spatial low pass filtering
preprocessing technique is talked about in [27] which uses various low pass gaussian
filters and helps improving the PSNR of the ROI by removing the border effects as
there is an even smoother transition (compared to only one filter) from the ROI to
the background.
31
Figure 3.4: Qualtiy Map; leftmost figure shows the original sequence, the center one
shows the ROI masked in white and background in black; the rightmost one shows
the quality map indicating the importance given - white being most important and
black being the least
Duplication of pixels in the ROI is another spatial preprocessing technique. It
basically adds redundancy to the ROI so that it is more resilient to the losses in the
network. It is used in conditions where high bit-rate transmission is supported since
adding redundant bits to the bit stream increases the bit-rate required. The block
diagram of this duplication technique is shown in Figure [3.5]
The preprocessing step 𝑇 is basically non-linear transforms that duplicates each row
of a macroblock in the ROI just below the original row. A representation of the
non linear transform / duplication is shown in figure [3.6]. the post-processing block
𝑇 ′ ,which occurs after decoding a frame, consists of an error recovery step and the
inverse transform step. When either coefficients or motion vectors for a particular macroblock are lost (which implies a bad macroblock), and if its corresponding
macroblock pair was received successfully (which implies a good macroblock), then
the error recovery step simply uses the reconstructed data of the good macroblock.
32
Figure 3.5: Block Diagram of a spatial ROI preprocessing technique - Duplication of
ROI macroblocks
When both the macroblocks in the macroblock pair are bad, that is they undergo either coefficients or motion vectors loss, the error concealment block takes care of this.
In the concealment step, in the case when coefficient loss occurs, the reconstructed
macroblock pointed to by the motion vector, is used. In case the motion vector is
lost, median prediction among the motion vectors associated with the neighboring
macroblocks is used to perform the concealment.
In brief, there are 2 main ways to perform Spatial ROI preprocessing and they
are:
1. Low Pass Filtering / Blurring the background
2. Duplicating macroblocks in the ROI
We implemented the Low Pass Filtering as the spatial preprocessing technique.
We chose it over the duplication of ROI macroblocks because of the simplicity in
33
Figure 3.6: Schematic Representation of the ROI macroblock Duplication
34
implementation and the fact that it is done under the same bandwidth condition
unlike the duplication technique which require additional bandwidth.
We have looked at spatial ROI preprocessing techniques so far, we now look at temporal ROI preprocessing techniques.
3.4.2
Temporal ROI Preprocessing Techniques
Spatial preprocessing techniques work within a frame where as temporal preprocessing techniques work on a set of frames. There are 2 primary ways of doing
temporal ROI preprocessing. One is to reduce the frame rate of the background, as
explained in [3.3.1]. And another is to temporally average the background. That
is apply a temporal average filter on the background. On temporally averaging the
background, the variance of the background decreases and hence the number of bits
needed to represent the background, according to equation [3.1], also decreases. The
temporal filter can be implemented as a normal average filter or as a wavelet filter.
Motion-compensated simple 5/3 lifted temporal wavelet filter is one type of wavelet
filter which is considered as an effective approach for building scalable video codecs
in the literature. The basic idea of this approach is that a motion compensated
temporal filter is used to decompose a set of video frames into muti-level wavelet
subbands (low-pass and high-pass). The high-pass subbands are adaptively weighted
and inverse transformation is performed to reconstruct the original video sequence.
Motion compensation (MC) is used to remove temporal redundancy and it plays
an important role in achieving a high compression for the video. As mentioned in [28]
temporal redundancy comes from the temporal correlations because the same objects
35
Figure 3.7: Schematic Representation of the Motion Compensated Temporal Filter
exist between frames when the sampling period is small enough such that no significant deformation occurs. Motion compensation removes this temporal redundancy
by describing a picture in terms of the transformation of a reference picture to the
current picture. Motion compensated temporal filtering (MCTF) is a multiple frame
reference motion compensation technique which can be thought of as applying the
wavelet transform in the temporal domain. Implementation of the MCTF with 5/3
filters for a set of 4 frames has been shown in figure [3.7].
36
Figure 3.8: Lifting Scheme of Discrete Wavelet Transform
The high and low-pass sub-bands are obtained after the prediction and update
steps and is obtained as shown in the equations below:
High-pass sub-bands:
1
ℎ𝑘 [𝑚, 𝑛] = 𝑥2𝑘+1 [𝑚, 𝑛] − (𝑥2𝑘 [𝑚, 𝑛] + 𝑥2𝑘+2 [𝑚, 𝑛])
2
(3.2)
Low-pass sub-bands:
1
𝑙𝑘 [𝑚, 𝑛] = 𝑥2𝑘 [𝑚, 𝑛] + (ℎ𝑘−1 [𝑚, 𝑛] + ℎ𝑘 [𝑚, 𝑛])
4
(3.3)
where 𝑥𝑘 [𝑚, 𝑛] are the video frames, ℎ𝑘 [𝑚, 𝑛] are the high-pass subbands and 𝑙𝑘 [𝑚, 𝑛]
are the low-pass subbands. Another schematic representation of the MCTF is shown
in figure [3.8]
37
To invert the transform, the same lifting equations/steps are applied in reverse order
and the signs are changed. The inverse transform equations are shown below.
1
𝑥2𝑘 [𝑚, 𝑛] = 𝑙𝑘 [𝑚, 𝑛] − (ℎ𝑘−1 [𝑚, 𝑛] + ℎ𝑘 [𝑚, 𝑛])
4
(3.4)
1
𝑥2𝑘+1 [𝑚, 𝑛] = ℎ𝑘 [𝑚, 𝑛] + (𝑥2𝑘 [𝑚, 𝑛] + 𝑥2𝑘+2 [𝑚, 𝑛])
2
(3.5)
We studied the Motion Compensated Wavelet Transform, we now look at MCTF
in the context of temporal ROI preprocessing. MCTF breaks down the video frames
into low-pass and high-pass subband frames. The high-pass frames are then adaptively weighted such that details in the ROI is given more importance / weight than
those in the background. The low-pass subband frames are kept intact. Video content
in areas which are given higher weights are reproduced with more fidelity than the
area given lower weights. Since the background receives lower weight in the high-pass
subband frames, their pixel variation in the temporal domain decreases and hence
the required data rates (number of bits) in the background is also reduced. A block
diagram of the temporal preprocessing scheme is shown in figure [3.9]. The object
tracking is for the ROI detection and the importance map generator generates weights
needed to adaptively weight the sub-bands. In summary, the MCTF preprocessing
technique, temporally filters the background, which results in a decrease in the temporal variance of the background, which further results in an improved coding efficiency
of the video encoder.
38
Figure 3.9: The MCTF ROI preprocessing scheme
In brief, there are 2 primary ways to perform temporal ROI preprocessing:
1. Low Pass Filtering the background temporally
2. Reducing the frame-rate of the background
The first method, Low Pass filtering the background temporally was the chosen
implementation because of its simplicity and the fact that knowledge of the ROI
coordinates is not needed at the receiver side unlike in the second method.
3.4.3
Spatio-Temporal ROI Preprocessing Techniques
It is very intuitive that combining spatial and temporal preprocessing techniques
help in increasing the coding efficiency of the video encoder. As mentioned in [1], the
spatial methods reduce the background information transmitted in DCT components
or the motion prediction error. The temporal filters, on the other hand, mainly
reduces the bits assigned to the background motion vectors. Thus combinations of
39
Figure 3.10: Proposed ROI preprocessing scheme
spatial and temporal approaches increases the reallocation of bits from the background
to the ROI, since the spatial filters and the temporal filters reduce the background
bits in different parts of the encoding. In [25] Lee et al. combine a spatial method
controlling quantization step sizes with the skipping of background. blocks for every
second frame under limited global frame motion. Spatial methods controlling the
number of DCT components are combined with a similar temporal method as [25],
but adapted to ROI and background content in [49], and combined with an temporal
average filter in [45]. In this thesis, we have combined a Spatial Low pass filter with
the MCTF temporal filter and studied the efficiency of the resulting preprocessing
scheme with respect to no ROI preprocessing, only spatial and only temporal schemes.
A block diagram of the spatio-temporal ROI preprocessing scheme has been shown
in figure [3.10].
40
3.5
Chapter Summary
In this chapter, we looked at the different ways of Region of Interest Coding (spatial and temporal) and then specifically looked at only ROI preprocessing techniques,
studying it under the same classifications of spatial and temporal coding.
41
CHAPTER 4
EXPERIMENTS AND RESULTS
This section is divided into 2 parts: ROI detection results and ROI preprocessing results. In ROI detection, we study the performance of the covariance tracking
method and show its effectiveness in cases when the ROI undergoes large position
and structural changes. In ROI preprocessing, we study the efficiency of our spatiotemporal filter when applied to sequences of different activity levels. We study the
performance in terms of PSNR and Improvement Ratio. Improvement Ratio is indicative of the compression ratio / bit-rate. It is the ratio of the file size of the encoded
video sequence with no ROI processing to the file size of the encoded video sequence
with ROI processing.
4.1
ROI detection Results
As explained in [2.3], the covariance tracking method offers many advantages over
other methods like Color Histogram matching and other appearance based model
matching methods. One main advantage is its ability to fuse multiple features while
maintaining a low dimensionality. We assessed the performance of the covariance
tracking method by testing it on 2 types of motion: Rigid motion and non rigid motion. We selected the feature vector as 𝑓𝑘 = [𝑥, 𝑦, 𝑅(𝑥, 𝑦), 𝐺(𝑥, 𝑦), 𝐵(𝑥, 𝑦), 𝐼𝑥 (𝑥, 𝑦), 𝐼𝑦 (𝑥, 𝑦)]
42
Figure 4.1: Frame 1 of Container Sequence with ROI selected
where 𝑘 is the pixel index, 𝑥 and 𝑦 are the position coordinates, 𝑅(𝑥, 𝑦) , 𝐺(𝑥, 𝑦),
𝐵(𝑥, 𝑦) are the RGB (color) values and 𝐼𝑥 (𝑥, 𝑦) and 𝐼𝑦 (𝑥, 𝑦) are the x and y intensity gradients respectively. These features were chosen because it combines both
position(𝑥,𝑦) and appearance attributes: color (𝑅(𝑥, 𝑦) , 𝐺(𝑥, 𝑦), 𝐵(𝑥, 𝑦)) and structure (𝐼𝑥 (𝑥, 𝑦) and 𝐼𝑦 (𝑥, 𝑦)).
4.1.1
Example 1 - Rigid Motion
The container sequence which is a standard test sequence consists of a steamer
boat moving down the river in a straight line. It exhibits linear / rigid body motion.
A small portion in the front of the boat was chosen as the ROI and was tracked /
detected in the consecutive frames as seen in figure [4.1]. Satisfactory results were
obtained. The sample tracking Results are shown in [4.2]
43
Figure 4.2: Tracking Results for Container Sequence
4.1.2
Example 2 - Non Rigid Motion
The stefan sequence is now taken to see how the covariance tracking algorithm
works when the ROI undergoes non rigid motion. In this sequence, the tennis player
is selected as the ROI as seen in figure [4.3] . The tennis player exhibits non rigid
motion as he twists and turns his body. The covariance tracking algorithm does a
good job in tracking him till the last frame. The tracking results are shown in figure
[4.3]
We note visually that the ROI in Example 1 is characterized by shape and that in
Example 2 is characterized by color and structure. The covariance tracker performs a
great job in detecting the ROI in both cases and is as good as any other appearance
based matching approach or color histogram matching.
44
Figure 4.3: Frame 1 of Stefan Sequence with ROI selected
Figure 4.4: Tracking Results for Stefan Sequence
45
4.2
ROI preprocessing Results
The performance of the spatio-temporal ROI preprocessor is studied in terms
of 2 metrics: PSNR and Improvement Ratio. PSNR gives a quantitative measure
of the quality of the video and is observed both for the ROI and the background.
Improvement Ratio gives a measure of the bit-rate. As explained earlier, a higher
Improvement Ratio implies a lower bit-rate - which is desired. The preprocessor was
applied to 3 different video sequences. We first look at the Carphone video Sequence,
then at 2 other sequences Foreman and News.
As we already know, there is a huge degradation in the quality of the background
with ROI processing, as the video is compressed. Even though there is a huge degradation in the background quality with compressed file size; the quality of the ROI
remains almost the same. Hence we use normalized PSNR (PSNR/File Size) as the
metric for measurement of quality of the PSNR of the ROI, as it shows improvement
in the quality of the ROI with ROI processing. The results obtained after passing the
sequence through the spatio-temporal filter is shown in the following figures. Figure
[4.5] show the PSNR / File Size of the ROI. Figure [4.6] show the PSNR of the background for the 3 kinds of processing on the Carphone sequence. The spatio-temporal
filter was run on the other sequences and their PSNRs have been show in figures
[4.7],[4.8],[4.9],[4.10] . Also their Improvement Ratio was plotted in figure [4.11].
46
Figure 4.5: Normalized PSNR of the ROI for Spatial, Temporal, Spatio-temporal
processed Carphone Sequence
4.3
Conclusion
ROI Detection - We ran the covariance tracking algorithm, first on a simple sequence which had the ROI moving in a straight line and then on a more complicated
sequence where the ROI exhibited non rigid motion. The reason we chose this method
was because of its simplicity and its ability to fuse multiple features at the same time
while retaining a small dimensionality. We found that the results obtained were as
expected: The covariance tracking method did a good job of detecting the ROI.
47
Figure 4.6: PSNR of the background for Spatial, Temporal, Spatio-temporal processed Carphone Sequence
ROI Preprocessing - Looking at all the plots, we conclude that the spatio-temporal
ROI preprocessing is the most efficient. Efficiency was measured both in terms of
PSNR (of the ROI) and Improvement Ratio. We also note that the preprocessing
scheme definitely shows a marked improvement over no ROI preprocessing at all.
48
Figure 4.7: PSNR of the ROI for Spatial, Temporal, Spatio-temporal processed News
Sequence
49
Figure 4.8: PSNR of the background for Spatial, Temporal, Spatio-temporal processed News Sequence
50
Figure 4.9: PSNR of the ROI for Spatial, Temporal, Spatio-temporal processed Foreman Sequence
51
Figure 4.10: PSNR of the background for Spatial, Temporal, Spatio-temporal processed Foreman Sequence
Figure 4.11: Improvement Ratio of 3 sequences
52
CHAPTER 5
CONCLUSION AND FUTURE WORK
Real time video transmission over unreliable and lossy communication channels,
under bandwidth and delay constraints, is still a relevant problem. It is particularly
difficult over the Internet which is a public network not designed to handle traffic with
strict bandwidth and delay requirements. There are previous works on error resilience
methods which reduce the effect of network losses on the transmitted video streams,
like forward error correction, layered coding (scalable video coding) and others. But
all these methods provide the same resilience for all parts of the video. Region of
Interest Video coding provides more resilience to certain areas of the video which
have higher perceptual importance. These areas of high perceptual importance are
collectively called Region of Interest. Region of Interest Coding involves detecting
the region of interest (Region of Interest Tracking) and coding the video(Region of
Interest Coding), giving higher priority to the region of interest. In this thesis, we
studied a few methods for Region of Interest Tracking and Video Preprocessing. We
then studied a specific method for each of them and implemented them.
53
The work done in this thesis is briefly explained below.
1. Region of Interest Tracking : We studied the covariance tracking algorithm in
Region of Interest Tracking because of its simplicity and its ability to easily fuse
multiple features together. We showed that the covariance tracking algorithm
performs well when the the ROI undergoes both rigid body and non rigid body
motion.
2. Video Preprocessing : There are numerous ways to do Region of Interest Video
Coding. We chose a preprocessing approach to do ROI coding due to its simplicity and the fact that it involves no changes within the codec. In preprocessing,
we explored both spatial and temporal ways. We also combined spatial and
temporal approach into a spatio-temporal one and found its superiority over
just spatial and temporal methods.
Though different methods of performing ROI Tracking and coding were studied, they
were not implemented and compared with the chosen methods. Also, the ROI Tracking and coding schemes were not integrated and the whole ROI coding system’s
performance was not tested on different video sequences and in different network
conditions. There are many more additions which can be made to the existing study
and implementation. A list of all them is mentioned in the next page.
54
Work to be done in Region of Interest Tracking:
1. Perform model update
2. Build a rotation invariant model
3. Implement other tracking schemes like color histogram matching, particle filtering methods and compare their performance with the covariance tracking
algorithm
Work to be done in Video Preprocessing:
1. Implement other coding methods (preprocessing / non preprocessing) and compare it with the existing spatio-temporal filter
2. Test the ROI coding scheme in different network conditions
3. Modify parameters in the coding to changing network conditions. Detect change
in network conditions using a feedback system.
4. Integrate the Region of Interest Tracking scheme with the preprocessing scheme
and evaluate performance based on PSNR of the desired region of interest.
With these additions we can truly build a complete and efficient solution / system for Region of Interest Video coding, for the application of low bit rate video
transmission over networks.
55
BIBLIOGRAPHY
[1] “Spatio-temporal pre-processing methods for region-of-interest video coding.”
Linda S. Karlsson, Department of Information Technology and Media Mid Sweden University, Thesis 2007.
[2] A. Eleftheriadis and A. Jacquin, “Automatic face location detection and tracking
for model-assisted coding of video teleconferencing sequences at low bit-rates,”
Signal Processing Image Communication, vol. 7, pp. 231–248, Sepember 1995.
[3] M. J. Chen, M. C. Chi, C. T. Hsu, and J. W. Chen, “ROI Video Coding Based
on H.263+ with Robust Skin-Color Detection Technique.,” IEEE Transactions
on Consumer Electronics, vol. 49, pp. 724–730, August 2003.
[4] A. Tankus and Y. Yeshurun, “Detection of Regions of Interest and Camouflage
Breaking by Direct Convexity Estimation,” IEEE Workshop on Visual Surveillance, pp. 42–48, 1998.
[5] “Detection of interesting areas in images by using convexity and rotational symmetries.” Linda S. Karlsson, Master Thesis No. 31 (2002), Dept. of Science and
Technology Linkoping University Sweden 2002.
[6] J. Meessen, C. Parisot, X. Desurmont, and J.-F. Delaigle, “Scene Analysis for
Reducing Motion JPEG 2000 Video Surveillance Delivery Bandwidth and Complexity,” IEEE International Conference on Image Processing Genua Italy, vol. 1,
p. 577580, 2005.
[7] G. Wang, T. T. Wong, and P. Heng, “Real-Time Surveillance Video Display with
Salience,” The Third ACM International Workshop on Video Surveillance and
Sensor Networks, pp. 37–43, 2005.
[8] J. D. McCarthy, M. A. Sasse, and D. Miras, “Sharp or Smooth? Comparing the
effects of quantization vs. frame rate for streamed video,” The SIGCHI Conference on Human Factors in Computing Systems, p. 535542, 2004.
[9] V. Garcia, E. Debreuve, and M. Barlaud, “Region of Interest tracking based
on keypoint trajectories on a group of pictures,” International Workshop on
Content-Based Multimedia Indexing, pp. 198 – 203, June 2007.
56
[10] D. Comaniciu, V. Ramesh, and P. Meer, “Real-Time Tracking of Non-Rigid
Objects using Mean Shift,” IEEE Transactions on Computer Vision and Pattern
Recognition, 2009.
[11] F.Porikli, P.Meer, and O.Tuzel, “Covariance Tracking using Model Update Based
on Lie Algebra,” IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, vol. 1, pp. 728–735, 2006.
[12] “A metric for covariance matrices.” Wolfgang Frstner and Boudewijn Moonen
and Fdpdq Gdq and Carl Friedrich Gauss, 1999.
[13] S. K. Zhou, R. Chellappa, and B. Moghaddam, “Visual Tracking and Recognition
Using Appearance-Adaptive Models in Particle Filters,” IEEE Transactions on
Image Processing, vol. 13, November 2004.
[14] D. Chai, K. N. Ngan, and A. Bouzerdoum, “Foreground / Background Bit Allocation for Region-of-Interest Coding,” IEEE International Conference on Image
Processing, vol. 2, pp. 923–926, 2000.
[15] M. J. C. Ming, C. C. anf Ching Ting Hsu, and J. W. Chen, “ROI Video Coding
Based on H.263+ with Robust Skin-Color Detection Technique,” IEEE Transactions on Consumer Electronics, vol. 49, pp. 724–730, 2003.
[16] S. Daly, K. Matthews, and J. Ribas-Corbera, “Face-Based Visually-Optimized
Image Sequence Coding,” IEEE Transactions on Consumer Electronics, vol. 3,
pp. 443–447, 1998.
[17] S. Sengupta, S. K. Gupta, and J. M. Hannah, “Percetually Motivated BitAllocation for H.264 Encoded Video Sequences,” IEEE International Conference
on Image Processing, vol. 3, pp. 797–800, 2003.
[18] J.-B. Lee and A. Eleftheriadis, “Spatio-Temporal Model-Assisted Very Low-BitRate CodingWith Compatibility,” IEEE Transactions on Circuits and Systems
for Video Technology, vol. 15, pp. 1517–1531, December 2005.
[19] X. Yang, W. Lin, Z. Lu, X. Lin, S. Rahrdja, E. Ong, and S. Yao, “Local Visual Perceptual Clues and its use in Videophone Rate Control,” International
Symposium on Circuits and Systems, vol. 3, pp. 805–808, 2004.
[20] H. Wang and K. El-Maleh, “Joint Adaptive Background Skipping and Weighted
Bit Allocation for Wireless Video Telephony,” IEEE International Conference on
Wireless Networks, Communications and Mobile Computing, vol. 3, pp. 1243–
1248, 2005.
57
[21] L. Itti, “Automatic Foveation for Video Compression Using a Neurobiological
Model of Visual Attention,” IEEE Transactions on Image Processing, vol. 13,
pp. 1304–1318, October 2004.
[22] “Flexible Macroblock Ordering, an error resilience tool in H.264/AVC.”
Y.Dhondt and P.Lambert, Fifth FTW PhD Symposium, Faculty of Engineering, Ghent University, Desembre 2004, Paper No. 106.
[23] J. W. Lee, A. Vetro, Y. Wang, and Y. S. Ho, “Bit Allocation for MPEG-4 Video
Coding With Spatio-Temporal Tradeoffs,” IEEE Transactions on Circuits and
Systems for Video Technology, vol. 13, p. 488502, June 2003.
[24] J. Meessen, C. Parisot, X. Desurmont, and J.-F. Delaigle, “Scene Analysis for Reducing Motion JPEG 2000 Video Surveillance Delivery Bandwidth and Complexity,” IEEE International Conference on Image Processing, Genua, Italy, vol. 1,
pp. 577–580, June 2005.
[25] J. Augustine, S. K. Rao, N. P. Jouppi, and S. Iyer, “Region of Interest Editing
of MPEG-2 Video Streams in the Compressed Domain,” IEEE International
Conference on Multimedia and Expo, p. 559562, 2004.
[26] H. W. Y. Liang and K. El-Maleh, “Real-Time Region-of-Interest Video Coding using Content-Adaptive Background Skipping with Dynamic Bit Reallocation,” IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 45–48, 2006.
[27] L. S. Karlsson, M. Sjostrom, and R. Olsson, “Spatiotemporal filter for ROI video
coding,” 14th European Signal Processing Conference EUSIPCO Florence Italy,
2006.
[28] X. J. Sclabassi, R. B. Liu, and S. M, “Content-Based Video Preprocessing for
Remote Monitoring of Neurosurgery,” 1st Transdisciplinary Conference on Distributed Diagnosis and Home Healthcare, vol. 2, pp. 67–70, 2006.
58