Feedback-less Distributed Video Coding and its Application in

Feedback-less Distributed Video
Coding and its Application in
Compressing Endoscopy Videos
Rami Cohen
Feedback-less Distributed Video
Coding and its Application in
Compressing Endoscopy Videos
Research Thesis
In Partial Fulfillment of the Requirements for the Degree of
Master of Science in Electrical Engineering
Rami Cohen
Submitted to the Senate of the
Technion–Israel Institute of Technology
Tamuz 5772
Haifa
July 2012
The Research Thesis Was Done Under The Supervision of Prof. David Malah
in the Faculty of Electrical Engineering
Acknowledgement
I would like to thank my supervisor, Prof. David Malah, for his dedicated guidance throughout all the stages of this research. I wish to thank the staff of the Signal and Image Processing
Lab (SIPL) for their help and technical support. I would also like to thank my friends and
colleagues in the Technion, for fruitful discussions and helpful comments. Finally, I express
my deep gratitude to my family for their constant encouragement and support.
The Generous Financial Help of The Technion is Gratefully Acknowledged
Contents
Abstract
1
1 Introduction
5
2 Distributed Video Coding
8
2.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.1
Slepian-Wolf Theorem . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.2
Wyner-Ziv Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2
DVC Systems - Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.3
Existing DVC Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.3.1
Stanford . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.3.2
PRISM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.3.3
DISCOVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3 LORD: LOw-complexity, Rate-controlled, Distributed video coding system
28
3.1
Design Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2
Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.2.1
Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.2.2
Block Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2.3
Rate-Distortion Optimization . . . . . . . . . . . . . . . . . . . . . .
34
3.2.4
Rate Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.3.1
Intra Frame Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.3.2
Side Information Creation . . . . . . . . . . . . . . . . . . . . . . . .
43
3.3.3
Noise Correlation Model . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.3.4
De-quantization and inverse DCT . . . . . . . . . . . . . . . . . . . .
56
3.3
i
4 Adaptation of LORD to Endoscopy Video
4.1 Endoscopy Videos . . . . . . . . . . . . . .
4.2 Bayer Format . . . . . . . . . . . . . . . .
4.3 Adaptation of LORD to Endoscopy Videos
Compression
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
60
60
61
66
5 Experimental Results
5.1 Standard Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Bayer Endoscopy Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
71
79
6 Conclusion
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
84
86
A CFA Demosaicing Algorithms
87
B MMSE Reconstruction Using Side Information
95
C Rate-Distortion Model
98
References
101
Hebrew Abstract
107
ii
List of Figures
2.1
Typical multi-hop wireless sensor network architecture . . . . . . . . . . . .
9
2.2
Distributed encoding of two statistically dependent i.i.d. sources X and Y
.
11
2.3
Slepian-Wolf Theorem, admissible rate region . . . . . . . . . . . . . . . . .
11
2.4
Wyner-Ziv coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.5
H.264/AVC CODEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.6
Limited complexity video encoders . . . . . . . . . . . . . . . . . . . . . . .
16
2.7
Typical high-level diagram of a DVC system . . . . . . . . . . . . . . . . . .
17
2.8
Zig-zag scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.9
Motion-compensated interpolation (MCI) . . . . . . . . . . . . . . . . . . . .
21
2.10 Stanford CODEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.11 Syndrome encoding and entropy encoding in a block . . . . . . . . . . . . . .
23
2.12 PRISM CODEC
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.13 DISCOVER CODEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.1
LORD - Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.2
Determining IN T RAT H . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3
Block classification: example . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.4
Bins with their probabilities (m = 3) . . . . . . . . . . . . . . . . . . . . . .
33
3.5
RC scheme, applied to Foreman 40th frame (as a key frame). The target rate
is 1.5bpp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.6
Foreman sequence, 100 frames, 1.5bpp rate constraint, intra-only coding mode 42
3.7
LORD - Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.8
Motion estimation, ipel precision . . . . . . . . . . . . . . . . . . . . . . . .
45
3.9
Prediction error, ipel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.10 Motion estimation, qpel precision . . . . . . . . . . . . . . . . . . . . . . . .
47
3.11 Prediction error, ipel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.12 Motion field – smoothing (intermediate stage before extrapolation) . . . . . .
49
3.13 Overlapping extrapolated blocks . . . . . . . . . . . . . . . . . . . . . . . . .
50
iii
3.14
3.15
3.16
3.17
3.18
3.19
3.20
3.21
Motion extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Extrapolation error, qpel . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
First 3 frames of Football and Foreman . . . . . . . . . . . . . . . . . . . . .
Quantization interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Possible relations between the side information y and the quantization interval
Estimate noise between SI and WZ frames . . . . . . . . . . . . . . . . . . .
Actual noise between SI and WZ frame . . . . . . . . . . . . . . . . . . . . .
Distribution of the noise between the WZ frame and the side information . .
51
52
52
54
54
57
58
59
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
61
62
63
64
64
65
66
4.9
Typical endoscope . . . . . . . . . . . . . . . . . . . . . .
Images from endoscopy process, performed on a pig . . . .
Bayer CFA . . . . . . . . . . . . . . . . . . . . . . . . . . .
Luminous efficiency curve . . . . . . . . . . . . . . . . . .
Profile/cross-section of a Bayer filter . . . . . . . . . . . .
Raw images acquired using Bayer CFA . . . . . . . . . . .
Bayer and its decomposition into RGB components . . . .
Calculation process of PSNR between a Bayer image and
version . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
RGB and YCbCr . . . . . . . . . . . . . . . . . . . . . . .
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
Blocks classification - Foreman and Football . . . . . . . . .
Blocks classification – Coastguard . . . . . . . . . . . . . . .
Bits allocation and PSNR results – Foreman and Football . .
Bits allocation and PSNR results – Coastguard . . . . . . .
Compression results – Foreman and Football . . . . . . . . .
Compression results – Coastguard . . . . . . . . . . . . . . .
Samples from endoscopy videos . . . . . . . . . . . . . . . .
Blocks classification - Endoscopy videos . . . . . . . . . . . .
Bits allocation and PSNR results – Endoscopy videos, 2bpp
Compression results – Bayer videos . . . . . . . . . . . . . .
A.1
A.2
A.3
A.4
A.5
A.6
Bayer CFA 2 × 2 block . . . . . . . . . .
Bayer CFA 3 × 3 neighbourhood . . . . .
Linear interpolation . . . . . . . . . . . .
Part of Bayer CFA . . . . . . . . . . . .
Bayer frame from endoscopy video . . . .
Results of bilinear and gradient-corrected
iv
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
its reconstructed
. . . . . . . . . .
. . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
68
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
73
74
75
76
77
78
80
81
82
83
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
bilinear interpolation methods
.
.
.
.
.
.
.
.
.
.
.
.
88
89
89
92
92
94
Abstract
In today’s digital video coding paradigm, as standardized by MPEG and the ITU-T H.26x
recommendations, the encoder performs both spatial and temporal analysis of the video,
in order to exploit the redundancies in the signal. This results in a high-complexity encoder, where the decoder has a much lower complexity. With the recent wide deployment of
low-power multimedia sensors, wireless cameras and mobile camera phones, this traditional
video coding architecture is being challenged, since such devices require low-power and lowcomplexity encoder devices. Surprisingly, it has been shown that efficient compression can
also be achieved by exploiting source statistics—partially or fully—at the decoder only.
This insight is based on information-theoretic bounds from the 1970s, obtained by Slepian
and Wolf for distributed lossless coding and by Wyner and Ziv for lossy coding with decoder
side information. This compression method is generally referred to as Distributed Video
Coding (DVC) or Wyner-Ziv (WZ) Coding.
Two major DVC architectures are the Stanford WZ codec, which works at the frame level,
and the Berkeley (aka PRISM) WZ codec, which works at the block level. An important
difference between these architectures is the use of a feedback channel (between the encoder
and the decoder) in the Stanford codec, even though such a channel is unacceptable in many
practical scenarios. Results obtained by these codecs show the high potential of the DVC
approach, but they are still far from the abilities of codecs such as H.264, which applies both
intra- and inter-frame prediction at the encoder.
In this research, we propose a new DVC encoder which is based on the principles introduced in PRISM. We improve the offline noise model used in PRISM and in Stanford codecs,
by proposing a model that changes both spatially and temporally, hence adapting itself to
the varying statistics of a video. Moreover, we use a rate distortion optimization process and
employ a highly exact rate control scheme that enables the use of our encoder in real time
applications.
Finally, we adapt our solution to videos acquired by Bayer sensors in endoscopy (medical
procedure), in which only partial information of the colors in each pixel is known. This special
video format has not been addressed yet in the DVC framework. We show that, using our
encoder, a significant improvement in performance can be achieved over a standard intra
coding method with a similar complexity.
1
List of Acronyms
AVC
Advanced Video Coding
CFA
Color Filter Array
CIF
Common Intermediate Format
CODEC
COder - DECoder
CRC
Cyclic Redundancy Check
DCT
Discrete Cosine Transform
DISCOVER DIStributed COding for Video sERvices
DISCUS
Distributed Source Coding Using Syndromes
DSC
Distributed Source Coding
DVC
Distributed Video Coding
ECC
Error Correcting Codes
GOP
Group of Pictures
HVS
Human Vision System
ITU
International Communication Union
JPEG
Joint Photographic Experts Group
LDPC
Low Density Parity Check
LORD
LOw-complexity, Rate-controlled, Distributed video coding system
MC
Motion Compensation
MCI
Motion-Compensated Interpolation
ME
Motion Estimation
MJPEG
Motion JPEG
ML
Maximum Likelihood
MMSE
Minimum Mean Square Error
2
MPEG
Moving Picture Experts Group
MSB
Most Significant Bit
MV
Motion Vector
MX
Motion eXtrapolation
PDF
Probability Density Function
PMF
Probability Mass Function
PRISM
Power-efficient, Robust, hIgh-compression, Syndrome-based Multimedia
coding
PSNR
Peak Signal to Noise Ratio
QCIF
Quarter Common Intermediate Format
RC
Rate Control
RD
Rate Distortion
RDO
Rate Distortion Optimization
RGB
Red, Green, Blue
SI
Side Information
SW
Slepian-Wolf
WZ
Wyner-Ziv
3
List of Notations
α
Laplace distribution parameter
∆
Quantization step
λ
Lagrange multiplier
µ
Mean
σ2
Variance
D
Distortion
H
Parity check matrix
X
Random variable/Source/Wyner-Ziv frame
X̂
Decoded value of X
RX
Coding rate of source X
Ed
Residual energy
ρ
Fraction of zeros among quantized transform coefficients
Q (·)
Quantizer
H (·)
Entropy
H (·, ·)
Joint entropy
I (·; ·)
Mutual information
R (D)
Rate distortion function
4
Chapter 1
Introduction
In the last few decades we are witnessing a transformation in the way we communicate.
Digital media have become an integral part of our lifestyle. Advances in computer and
communication technologies have led to a proliferation of digital media content and its integration into everyday devices and activities. The efficient storage of this media, especially
video content, without the existence of efficient compression algorithms would not be possible, because of the huge amount of needed storage. These algorithms are an integral part of
most video processing, communication and display systems.
Video coding systems, as standardized by MPEG and the ITU-T H.26x recommendations
in the last two decades, are based significantly on a hybrid compression scheme consisting
of spatio-temporal prediction and block-wise Discrete Cosine Transform (DCT). These components have been adopted by almost all modern video coding standards, where the most
recent standard, H.264/Advanced Video Coding (AVC) [1], offers a bitrate saving of over
50% compared to earlier standards, such as MPEG-2.
In all of the modern video standards, such as MPEG and H.26x, the encoder analyses
extensively the video signal in order to enable an efficient compression. The most computationally expensive operation involved in the encoding process is the motion estimation (ME)
process, which produces a prediction of the current video frame, in terms of motion vectors
(MVs) and the previously decoded frames. This component of the encoder may constitute
up to 70% of the encoder’s complexity [2]. ME is one of the most effective methods in video
compression for reducing temporal redundancy, and as such it is used in spite of its high
complexity.
It those standards, the encoder is typically one to two orders of magnitude more complex
than the decoder. This is suited for downlink oriented applications, such as video broadcasting. In such applications, a low complexity decoder is important since the video is encoded
once and then decoded by millions of users, so it is highly reasonable to keep the decoder
5
complexity as low as possible.
However, today we see a shift towards producing and sharing videos, especially for real
time applications, such as video conferencing over wireless/cellular networks, video surveillance and many more, which rely on an upstream model. The clients, often mobile, which
capture the video, have low-power and limited resources, in contrast to the central server,
which is usually powerful.
In the latter case, a low complexity video encoder is needed (even at the expense of a
complex decoder). The growing need for low complexity encoders has led to new profiles in
H.264, such as Constrained Baseline Profile (CBP), in which the ME is carried on without
using sub-pixel accuracy and the use of B and P slices is limited. However, a computationally
heavy rate distortion optimization is still needed in order to get reasonable results [3].
In an effort to give an answer to this need, a novel video coding paradigm, which is
known as Distributed Video Coding (DVC), has emerged in the last decade. This paradigm
employs principles of lossy source coding with side information at the decoder, also known as
Wyner-Ziv (WZ) coding. These principles rely on the seminal information theory theorems
by Slepian and Wolf [4] (for the lossless case) and by Wyner and Ziv [5] (for the lossy case),
which are presented in Chapter 2.
These theorems give the theoretical foundations for the possibility to design an efficient
video compressing system by exploiting the source statistics at the decoder only, thus transferring there the encoder complexity. In the DVC framework, the information is distributed
and only the decoder has access to the predicted frame. The decoder interprets the predicted frame as a "noisy version" of the original frame, and tries to remove the noise by
assuming an appropriate probability model on the error, as well as error-correcting codes
(ECC). The assumed statistical model, which is built at the decoder using information from
the currently encoded frame and from previously decoded frames, is an important part of
any DVC system.
DVC offers several potential advantages over standard video encoders. They include a
flexible distribution of the complexity between the encoder and the decoder and an intrinsic
error robustness, since a DVC encoder involves no prediction loop, which is part of most
video encoders, such as MPEG-2 and H.264.
The first practical DVC solutions based on the mentioned theorems have been evolving
in the last decade, trying to exploit the video data correlation only at the decoder. The main
DVC systems, such as PRISM (Power-efficient, Robust, High-compression, Syndrome-based
Multimedia coding) [6, 7], Stanford [8, 9] and DISCOVER [10] are reviewed in Chapter 2.
Currently, the performance of current DVC systems are far from the performance of the
state-of-the-art H.264. There is an ongoing research in this field, with the aim to close the
6
gap between DVC and standard video coders.
In this thesis, we propose a new feedback-less DVC system, based on the PRISM framework. The major drawbacks of PRISM are addressed, and appropriate solutions are proposed
and implemented. We propose a statistical distribution model, whose parameter varies online, spatially and temporally, in accordance with the changes in the video scene. We show
how to adapt our system to channel rate constraints, using an appropriate rate control
algorithm.
Our proposed video CODEC will be presented in Chapter 3. We also extend the DVC
framework by designing a system for distributed coding of videos acquired by Bayer sensors
[11] used in endoscopy. Bayer sensors acquire only one color value information for each pixel
(R or G or B). In order to obtain (an estimation of) the full color information, demosaicing
algorithms are used. This video format has not been addressed yet in the DVC framework,
and it poses a challenge that involves the proper treatment of the different colors under
overall rate or quality constraints. The special structure of a Bayer video and the needed
adaptation of our encoder to this format will be presented in Chapter 4. Performance
evaluation of our CODEC are given in Chapter 5. Finally, conclusions and future directions
are given in Chapter 6.
7
Chapter 2
Distributed Video Coding
In this chapter we briefly give the needed information theoretic results that laid the basis
for designing distributed video coding (DVC) systems. These video coding systems fall
under a more general framework called distributed source coding (DSC), and the necessary
background of this field is given below. We start with the problem of coding two correlated
sources, in the lossless and the lossy cases. Following the presentation of theoretic results,
we give an overview of the DVC paradigm. In addition, three practical DVC systems are
presented in detail.
2.1
Background
Distributed Source Coding (DSC) is a compression paradigm that relies on the coding of
two (or more) random correlated information sources that do not communicate with each
other. By modelling the correlation between multiple sources at the decoder side, DSC is
able to shift the computational complexity from the encoder side to the decoder side, and
therefore provides an appropriate framework for applications with a complexity-constrained
sender, such as sensor networks and mobile video/multimedia compression.
An example for a wireless sensor network can be seen in Figure 2.1. This network consists
of spatially distributed autonomous sensors that are used as monitors of some physical quantity. Each sensor encodes its monitored data separately and sends it to a central decoder.
Assuming that there exists a correlation between different sensors that can be exploited at
the decoder, the encoders can take this into account and send only the data that can not be
derived at the decoder.
This is demonstrated in the following scenario. Assume that two of the sensors monitor
the temperature of neighbouring zones, Ta and Tb . It may be possible to find a statistical
model which describes the relation between the measurements of the sensors (e.g., Tb − Ta
8
Figure 2.1: Typical multi-hop wireless sensor network architecture
is distributed according to some known probability distribution), such that the data of one
sensor would be sufficient for creating an estimate of the data from the other sensor. This
example demonstrates the underlying principle of DSC.
One of the main properties of distributed source coding is that the computational burden
in the encoder is shifted to the joint decoder, which exploits the correlation, rather than the
encoder. This property is the main reason that led to the development of video encoding
systems that are based on DSC.
When dealing with lossless encoding and decoding of a random source X, it is a well
known result from information theory [12] that the minimal rate needed for encoding this
source is its entropy, H(X) bits. An encoder that employs an appropriate coding scheme
working at this rate can theoretically ensure that the source is reconstructed without errors.
In the case of two i.i.d. and dependent sources, X and Y , distributed according to a joint
probability mass function (PMF) P (X, Y ), the minimum rate that theoretically ensures
their perfect recovery is their mutual entropy, R = H (X, Y ). This is true when the encoder
has access for both X and Y . It should be noted that the following inequality holds in
general:
H (X, Y ) ≤ H (X) + H (Y )
(2.1.1)
so the exploitation of the correlation between the sources can lead to a lower encoding rate.
Thus, it is interesting to ask what is the minimal rate needed for the transmission of X
and Y , when Y is not known at the encoder, but is known at the decoder along with the
joint PMF of X and Y . It makes sense that the needed rate should be lower than the sum of
the entropies, because of the correlation between X and Y . However, there were no known
bounds on this rate until the 1970s.
In 1973, D. Slepian and J. K. Wolf proposed the information theoretical lossless compres9
sion bound on distributed compression of two statistically dependent i.i.d. sources X and
Y [4]. This bound was extended to a more general case in which there are more than two
sources by T. M. Cover in 1975 [13]. According to these bounds, the needed rate is indeed
lower than the sum of the entropies.
Later, theoretical bounds on the lossy compression case were presented by A. D. Wyner
and J. Ziv in 1976 [5]. These bounds and later works show that there is a rate loss in the lossy
case, but it is bounded. Moreover, there is case in which no rate loss is incurred (comparing
with the case in which both X and Y are presented at the encoder). The results of WZ
theorem are more relevant to the field of video coding, since efficient compression of video
is achieved through a lossy compression.
In the following section, these two seminal theorems are described, with the relevant
results that will be used later in the DVC framework.
2.1.1
Slepian-Wolf Theorem
Consider the case in which there are two statistically dependent i.i.d. finite-alphabet random
sources X and Y , and that the encoding and decoding process of these sources is performed
on blocks of length n such that (xn , y n ) ∈ X n × Y n . When these sequences are jointly
encoded, the theoretic total rate needed is lower bounded by R = H (X, Y ), where H (X, Y )
is their mutual entropy (it is a simple extension of the minimal rate needed for one source).
It is obvious that separate encoding of these source requires theoretically no more than
R = RX + RY = H (X) + H (Y ) bits, no matter whether they are jointly decoded or not.
However, the question which remained unanswered until the theorem of Slepian and Wolf
was whether there is a lower bound on the rate needed for separate encoding of these sources,
when they are jointly decoded. In other words, is there (theoretically) any rate loss because
the sources are not jointly encoded? This question is demonstrated in Figure 2.2.
Surprisingly, Slepian and Wolf [4] have shown in 1973 that the achievable total rate R in
this case is in fact smaller than H (X) + H (Y ), and is lower bounded by the mutual entropy,
i.e.: R ≥ H (X, Y ), given that the rates of the individual encoders are bounded by the
conditional entropy of one source given the other, i.e.: RX ≥ H (X| Y ) and RY ≥ H (Y | X).
System that satisfies the last three inequalities for the rate pairs (RX , RY ) is said to be
admissible system. The admissible rate region is the closure of the set of all admissible rate
pairs. This region is depicted in Figure 2.3.
These bounds imply that the minimum coding rate for separate encoding is the same
as for joint encoding, provided that the individual rates are higher than their respective
conditional entropies. Another important result of Slepian-Wolf theorem is that the source
10
Figure 2.2: Distributed encoding of two statistically dependent i.i.d. sources X and Y
Figure 2.3: Achievable rate region for lossless distributed compression of two statistically
dependent i.i.d. sources X and Y [14]
11
X can be encoded using RX ≥ H (X| Y ) bits, regardless of the encoder’s access to Y ,
assuming that the decoder has access to Y , which is referred to as side information. The
design process of a coding scheme achieving the Slepian-Wolf bound is described in [12].
The essential principle behind the proof of this theorem is the idea of random bins. We
choose a large random index for each source sequence. If the set of typical source sequences
(sequences with empirical entropy that is close to the entropy of the source) is small enough,
then with high probability, different source sequences have different indices, and the source
sequence can be recovered from the index only.
The dependency between X and Y , estimated in the decoder, is modelled as a virtual
dependency channel P (Y | X). The rationale behind this idea is that the higher correlation
between the sources, the less amount of information that should be sent by the encoder. For
example, an error correcting code can be applied to X, where only its parity bits are sent to
the decoder, which tries to improve the quality of the side information Y using the parity
bits and the dependency model. Channel capacity-achieving codes have been shown to give
good performance corresponding to the desired corner points of the Slepian-Wolf region. The
modelling of this channel is the main part of any Slepian-Wolf coding system.
It was understood that Slepian-Wolf coding is closely related to channel coding in 1970s,
and after about 30 years, practical DSC systems started to be implemented by different
channel codes. One of the common ways to implement the DSC approach is to employ
syndrome encoding. Such an approach was suggested by Pradhan and Ramchandran in
their DISCUS (Distributed Source Coding Using Syndromes) system [15], where they also
suggested general design methods for trellis codes under this framework.
The basic framework of syndrome based DSC is that, for each source, its input space is
partitioned into several cosets (partitions) according to the particular channel coding method
used. This is effectively done by using the parity-check matrix H of the code. Every input of
each source gets an output (which can be referred to as a quantized value) indicating which
coset the input belongs to. The joint decoder decodes the inputs by using the received coset
indices and by exploiting the dependence between the sources. The choice and the design of
the needed channel code should take into account the correlation between the input sources.
In the last decade, more sophisticated channel codes have been adapted to the DSC system, such as Turbo codes [16] and LDPC Codes [17]. The encoders required for these codes
are usually simple and easy to implement, while the decoders have much higher computational complexity and are able to get good performance by utilizing source statistics. By
using these sophisticated channel codes (that can achieve near-capacity performance), the
corresponding DSC system can approach the Slepian-Wolf bound.
12
Figure 2.4: Wyner-Ziv coding of X using correlated side information Y available at the
decoder only
2.1.2
Wyner-Ziv Theorem
The work of Slepian and Wolf was extended in 1976 to the lossy case by Wyner and Ziv [5].
The Wyner-Ziv theorem deals with lossy compression of a source X given side information
Y which is available at the decoder, but not at the encoder, as depicted in Figure 2.4. The
reconstruction quality of the source X is measured by some distortion metric d : X × X̂ →
[0, ∞), where X̂ ∈ X̂ is the reconstructed source.
Wyner and Ziv have shown that the theoretic coding rate needed to obtain some distorZ
tion D when Y is not present in the encoder, RW
X|Y (D), is in general larger than the rate
required by a system with side information that is available both at the encoder and decoder,
Z
R X|Y (D). I.e.: RW
X|Y (D) ≥ R X|Y (D) (where R (D) is the rate-distortion function). The
Wyner-Ziv distortion function is [12]:
Z
RW
X|Y (D) =
(I (X; Z) − I (Y ; Z))
min
P ( Z|X),g(x,y):
E[X,g(Z,Y )]≤D
(2.1.2)
where Z is an auxiliary random variable such that Z ↔ X ↔ Y form a Markov chain,
I (U ; V ) is the mutual information between two sources U and V and g is the decoder
reconstruction function, g : Z × Y → Ŷ.
It was also shown that in the special case where X and Y are jointly
sources,
Gaussian
2 ), the last
and the distortion metric is the mean squared error (MSE, D = E X − X̂
inequality turns into equality. That is, there is no loss comparing to joint encoding. It
was shown later by Pradhan et al. [18] that in fact only the difference X − Y needs to be
13
Gaussian. In this case, the rate-distortion function can be calculated analytically:
Z
R X|Y (D) = RW
X|Y
1
(D) = max 0, log
2
σ 2x|y
D
!!
(2.1.3)
where σ 2x|y is the conditional variance of X given Y . For example, given that Y = X + N
where X and N are independent Gaussians with variances σx2 and σn2 respectively, this
conditional variance is equal to:
σ2 σ2
σ 2x|y = 2 x n 2
(2.1.4)
σx + σn
The case of two sources that are jointly Gaussian is important, since the difference X − Y
can be modelled in many system as a Gaussian noise. Moreover, a Gaussian random variable
has the maximal possible differential entropy (over any other probability distribution), so
(2.1.3) can serve as a lower bound on the transmission rate for many systems. The theoretic
design process of Wyner-Ziv coding scheme is described in [12]. In general, a Wyner-Ziv
coding scheme is obtained by adding a quantizer and a de-quantizer to the Slepian-Wolf
coding scheme. Therefore, a Wyner-Ziv coder design could focus on the quantizer and
corresponding reconstruction method design.
The rate loss of the Wyner-Ziv setting relatively to joint encoding was investigated by
Zamir et al. [19], who have shown that, using nested linear/lattice codes, this rate loss is
Z
upper bounded by 0.5 bits/sample: R X|Y (D) + 0.5bit ≥ RW
X|Y (D). This is true for general
correlated sources X and Y when the distortion measure is MSE. Zamir has also shown
that a more general analysis of the RD function of different types of sources and different
correlation models can be done using the extended Blahut-Arimoto algorithm [20,21], which
involves a numerical computation of the RD function.
The concept of Wyner-Ziv coding is well suited to the video coding scenario. In video
coding context such as DVC, X is usually the current frame to be encoded, where Y is an
estimate of the current frame, obtained for example using previously decoded frames and
partial information form the current frame. Since video sequences are usually both spatially
and temporarily correlated, the correlation model can exploit both dependencies.
The main component of any Wyner-Ziv video coding system is the modelling of this
correlation, where in general there are two kinds of models: one estimate the correlation
in advance, using training sequences (offline), whereas the second one adapts itself to the
changing statistics of the video (online). In the next section, practical DVC systems are
described, with details on their used correlation models.
14
Figure 2.5: Basic coding structure for H.264/AVC [1]
2.2
DVC Systems - Overview
Standard video encoders, such as H.264/AVC, can be viewed as a source coding system
with side information available both at the encoder and the decoder. To be specific, there
is a predictor Y of each frame X, which is created at the encoder and is known to both
the encoder and the decoder. This predictor, or side information, is created using temporal
(acquired via a motion estimation process) and spatial prediction.
Video encoders that exploit both temporal and spatial information are usually referred to
as Hybrid video encoders. In these encoders, the (quantized) residual between the frame to be
encoded X and the predictor Y , along with the motion vectors (MVs), are sent to the decoder.
They are used there with previously decoded frames in order to obtain a reconstruction X̂
of X. This process is well exemplified by the CODEC diagram of H.264/AVC in Figure 2.5.
As can be seen in this figure, the encoder side of H.264/AVC (which is the state-of-the-art
video encoder nowadays) is quite complex. Specifically, it includes both inter (temporal) and
intra (spatial) frame prediction modules, which are computationally heavy. The decoder,
on the other side, needs only to perform relatively simple operations, such as additions of
blocks of the existing decoded frames to the residue frame, using the MVs, and afterwards
de-quantization and inverse-DCT operations. This is a classic master-slave configuration,
where the encoder is the master and the decoder is the slave.
15
Figure 2.6: Limited complexity video encoders. Clockwise: mobile-phone, wireless camera,
capsule endoscopy, wireless endoscope
Together, the Slepian-Wolf and the Wyner-Ziv theorems suggest that it is possible to
compress two statistically dependent signals in a distributed way (separate encoding, joint
decoding), approaching the coding efficiency of conventional predictive coding schemes. This
gives rise to a new approach, according to which the encoder can be simple, whereas the
decoder is the part which exploits the statistical dependency between the sources, and hence
is the complex part of the video coding system.
A new video coding paradigm, known as distributed video coding (DVC), which is based
on the theorems above, has emerged in the last decade. DVC can be considered as a particular
application of the theoretic results from distributed source coding. Since efficient video
coding is usually obtained using lossy compression, DVC is also referred to as Wyner-Ziv
(WZ) video coding.
According to the DVC paradigm, the side information is created at the decoder. In
particular, the complex operation of temporal prediction (motion estimation) in the encoder
is avoided and hence the complexity of the encoder is reduced significantly. Generally, DVC
systems aim at combining intraframe coding features (which provide low-complexity encoding
and robustness to transmission errors) with the compression efficiency of interframe coding.
Examples for applications in which low-complexity encoders are needed are given in Figure
2.6.
A high-level diagram of a typical DVC system is shown in Figure 2.7. As can be seen in
16
Figure 2.7: Typical high-level diagram of a DVC system
this diagram, the side information is created at the decoder and there is no exploitation at the
encoder of the temporal correlation between neighbouring frames. Instead, this correlation
is exploited at the decoder using temporal information that exists at the decoded frames.
The main goal of DVC is to provide compression performance comparable to that of
hybrid video coding systems, like MPEG-2 and H.264/AVC. According to the theorems of
Slepian-Wolf and Wyner-Ziv, this goal is achievable theoretically, in certain cases, like the
Gasussian case. However, there are some assumptions of these theorems which are not
necessarily satisfied under the DVC framework.
For example, the joint distribution of the source and the side information is not known in
practice, and is most likely to vary both spatially and temporarily along the video sequence,
unless the video is highly static. In particular, the joint Gaussian distribution case that was
proven by Wyner and Ziv to incur no rate loss over conventional encoding, where both the
source and the side information are known to the encoder, is not necessarily satisfied, so
some loss in rate is unavoidable. Consequently, it is expected to encounter some degradation
in performance comparing with hybrid video encoders.
There are two main advantages of a DVC system over "conventional" video encoders,
such as MPEG-2 or H.264/AVC:
1. Complexity DVC framework allows to design decoders of different complexities. For
example, the creation of the side information can be done using simple algorithms
(such as simply copying the previously decoded frame) or more complex ones, such as
algorithms that are based on motion interpolation or motion extrapolation. Moreover,
in an environment that allows the encoder to be complex to some amount, the encoder
can be adapted accordingly, for example by performing a limited motion estimation.
17
2. Robustness to ‘drift’ errors In standard video coding systems, a transmission error
or a packet loss may lead to an erroneous predictor of the current frame at the decoder
(encoder-decoder mismatch). In this case, the closed loop that appears in Figure 2.5
losses its major advantage as a synchronization loop between the encoder and the
decoder. This difference may result in an erroneous reconstruction at the decoder, and
this error may even affect the reconstruction of other frames too (in practice, this drift
error is limited to the GOP size, which is typically 15 frames in H.264/AVC). This is
not the case in DVC, since in a DVC based encoder the frame is encoded with usually
no dependence on any other frame, so even in a case of transmission error, only one
frame can be affected.
2.3
Existing DVC Systems
The first practical WZ systems have been developed in Stanford University and in Berkeley
University around 2002. The Stanford architecture is characterized by frame-based coding,
using turbo codes and a feedback channel, whereas the Berkeley solution (named PRISM)
is characterized by block-based coding, using the coset coding approach, without a feedback
channel.
Later on, additional DVC architectures were proposed, based mainly on the two previously mentioned ones. Notable newer systems are DISCOVER [10] and VISNET II [22],
where DISCOVER has gained more popularity due to its simplicity and good performance.
DISCOVER is based on Stanford architecture, with the main changes being an improved
noise model and the use of LDPC codes instead of turbo codes. VISNET II is similar to
DISCOVER, and its main new components are hierarchical motion estimation [23] in the SI
creation process and a deblocking filter [24] that is applied to the decoded frames.
In the following section, a review of three main DVC systems, Stanford, PRISM and
DISCOVER, is given.
2.3.1
Stanford
The Stanford DVC system was proposed by a group of researchers at Stanford university.
This video coding architecture was first proposed for the pixel domain [8] and was later
extended to the transform domain [9]. We will focus on the more efficient transform domain
Stanford DVC.
The Stanford’s solution takes the frame approach. The video sequence is divided into
two kinds of frames: key frames and Wyner-Ziv frames. The first ones are intra coded, using
18
Figure 2.8: Zig-zag scan of the DCT coefficients. The upper left-most coefficient is the DC
coefficient
H.263+ intra mode or the more advanced (and complex) H.264/AVC intra mode, where the
latter ones are encoded using distributed source coding techniques. In the following pages,
a description of the encoder and the decoder used in the Stanford DVC system is given.
Encoder
1. Classification The video sequence is divided into key frames and WZ frames. Typically, a GOP of length two is chosen, meaning that the first frame in each GOP is
a key frame and the second one is a WZ frame that is encoded using principles from
DSC, as described below.
2. Transform 4 × 4 or 8 × 8 non-overlapping DCT transform is applied to each video
frame. The resulting transform coefficients are then zig-zag scanned, as depicted in
Figure 2.8 (for a 8 × 8 block). This scan assigns lower indices numbers to coefficients
with higher energy. The DCT coefficients that have a same index k are then grouped
together, forming DCT bands Bk (in the case of 8 × 8 transform, k = 1, 2, ..., 64).
3. Quantization Each DCT band Bk is uniformly quantized, using a uniform scalar
quantizer with 2Mk levels. The number of quantization levels usually varies from band
to band in order to match the bands’ perceptual importance. Bits of the quantized
coefficients are then grouped together, forming bit planes.
4. Turbo Coding Each DCT band is turbo-coded, starting with the most significant
bit (MSB) plane. The parity bits are generated for each bit plane, stored in a buffer
and are sent to the decoder in chunks upon decoder requests, via a feedback channel.
19
As mentioned earlier, these bits are used for the "correction" of the side information
(estimation of the WZ frame), which is created at the decoder.
Decoder
1. Side Information Creation The decoder creates side information that serves as an
estimated (or "noisy") version of the WZ frames. This is done by motion-compensated
interpolation (MCI), using the the decoded frames which are closest to the current
frame. In this method, motion vectors are created in the decoder using motion estimation between the two neighbouring (already decoded) key frames, where one is the
previous key frame and the other is the next key frame. These MVs are then used for
the prediction of the WZ frame from the previous key frame. This process is depicted
in Figure 2.9. As can be seen, this method typically introduces a delay of one frame,
since the estimation process of the WZ frame also depends on the next key frame.
2. Correlation Noise Modelling The residual statistics between the DCT-transformed
source frame and the side information are assumed to be modelled by a Laplace distribution. The Laplace distribution parameter α is estimated using an off-line training
phase.
3. Turbo Decoding Each bit plane (starting from the MSB one) is turbo decoded using
the side information and the noise distribution model. Such decoding is commonly
performed using Pearl’s belief propagation algorithm [25]. If the decoding probability
error is higher than a pre-defined threshold Pe (usually set to 10−3 ), the decoder requires
more parity bits from the encoder, until the error probability is less than Pe or until a
stopping criterion is met (such as maximal allowed delay).
4. Reconstruction After the turbo decoding operation, the bit planes associated with
the DCT bands are grouped together in order to form a decoded quantized symbol
stream. Once all the (quantized) DCT bands are known, de-quantization and inverseDCT are performed in order to obtain the pixel-domain WZ frame.
A schematic description of Stanford CODEC is shown in Figure 2.10. The Stanford
CODEC suffers from a few drawbacks. First, it uses a feedback channel between the encoder
and the decoder, which is impractical in many situations, especially in real-time systems.
Moreover, using a noise model whose parameters were obtained off-line is unrealistic, since
the noise in a typical video sequence varies both spatially and temporarily. Partial solutions
to these issues are given in the following DVC architecture, named PRISM.
20
Figure 2.9: Motion-compensated interpolation (MCI)
2.3.2
PRISM
PRISM [6, 7] (Power-efficient,Robust, hIgh-compression, Syndrome-based Multimedia coding), developed at Berkeley University, operates in the transform-domain. The Berkeley’s
solution takes the block approach, where each block in a frame can be encoded using one of
a several modes. The concept of GOPs is not employed in PRISM. In the following pages,
a description of the encoder and the decoder used in the PRISM DVC system is given.
Encoder
1. Transform 8 × 8 non-overlapping DCT transform is applied to each video frame. The
resulting transform coefficients are then zig-zag scanned.
2. Quantization Scalar quantizer is applied to the DCT coefficients according to the
desired reconstruction quality. The DC coefficient and a small number of the AC
coefficients (which are near the DC in zig-zag order) are quantized using a small quantization step, since they contain most of the energy of the transform and hence give
more information about the image than the remaining coefficients. These coefficients
are quantized using a larger quantization step.
3. Classification In this stage, the coding mode for each block is decided: no coding
21
Figure 2.10: Stanford CODEC [9]
(SKIP mode, means that the co-located block is copied from the previous frame),
intraframe (JPEG-like) coding, or syndrome coding (described below), depending on
the estimated temporal correlation noise. The correlation estimation process depends
on the permitted complexity of the encoder. If there is a requirement for the encoder to
be as simple as possible then the noise is estimated by simply subtracting the previous
co-located block from the current one (zero-motion block difference). Otherwise, some
basic motion estimation can be performed.
4. Syndrome Encoding In this mode, the low frequency coefficients (typically the lower
20%, called WZ coefficients) are encoded using the concept of syndromes, i.e., by
transmitting the label of the interval that contains the quantized symbol. The number
of bits required for transmitting the syndrome is smaller than the number needed for
the transmission of the coefficient, so compression is achieved. BCH code [26] is usually
used for this purpose, since it works well for small block sizes.
5. Entropy Encoding The high-frequency quantized coefficients which were not syn22
Figure 2.11: Syndrome encoding and entropy encoding in a block
drome encoded in the syndrome coding mode are entropy coded, using JPEG-based
encoding. The different parts of a syndrome-coded block are shown in Figure 2.11.
6. Hash generation Cyclic Redundancy Check (CRC) code [27] is computed for the WZ
coefficients in each block. The length of this code is usually taken to be 16 bits. This
CRC code is used in the decoding process (see below) in order select the best candidate
block predictor in the process of syndrome decoding.
Decoder
1. Side Information Creation Each block with WZ coefficients has several candidates
for side information, which are obtained from the previously decoded block using halfpixel accuracy motion estimation. That is, all the neighbouring blocks in the previous
frame (within a search range) can serve as a potential side information.
2. Correlation Noise modelling The noise between WZ blocks and the side information
is assumed to have a Laplacian distribution. The distribution parameter is estimated
during an off-line training phase.
3. Syndrome Decoding Each block that serves as SI is used for the reconstruction of
WZ coefficients, by choosing the symbol from the coset indicated by the syndrome
which has minimal Hamming distance from the SI block. This can be done using the
Viterbi algorithm [28]. The recovered codeword is then verified using the CRC code. If
the CRC code does not match, the decoding process is repeated for the next candidate.
4. Entropy Decoding Once the WZ coefficients are recovered, the entropy coded coefficients, corresponding to the high-frequency components, are intra-decoded.
5. Reconstruction Rescaling and inverse DCT are applied to the decoded symbols.
23
A schematic description of PRISM CODEC is shown in Figure 2.12. The main drawbacks
of PRISM architecture are as follows. First, the noise model parameter is estimated off-line.
As mentioned before, the noise in a video sequence varies both spatially and temporarily.
Moreover, PRISM requires the additional overhead of CRC bits. Furthermore, there is
a possibility that none of the predictors will produce the needed CRC (in this case, the
predictor with the closest CRC to the calculated one is typically used).
Figure 2.12: PRISM CODEC [6]
24
2.3.3
DISCOVER
The DISCOVER (DIStributed COding for Video sERvices) video CODEC [10] architecture
is based on the Stanford WZ video coding architecture, with changes that include better SI
generation, the use of LDPC code and a 8-bit CRC code for verification. In the following
pages, a description of the encoder and the decoder in the DISCOVER DVC system is given.
Encoder
1. Classification In this stage, the coding mode for each frame - WZ frame or key
(intra) frame, is decided for each frame. an adaptive GOP size selection algorithm [29]
controls the insertion of key frames between WZ frames. This algorithm is based on
simple temporal activity measures, and it groups frames with similar motion content
in order to construct GOPs which are more correlated. The first frame in each such a
GOP is a key frame.
2. Transform 4x4 non-overlapping blocks integer DCT, as used in H.264 [1]. This transform is is more computationally efficient than regular DCT and reduces decoding errors.
The resulting 16 DCT coefficients are then organized in 16 bands.
3. Quantization Each DCT band Bk (k = 1, 2, ..., 16) is uniformly quantized, using a
uniform scalar quantizer with 2Mk levels. The number of quantization levels usually
varies from band to band in order to match the bands’ perceptual importance. Bits of
the quantized coefficients are then grouped together, forming bit planes.
4. LDPC Encoding For each bit plane, syndrome bits are created using an LDPC code.
The Wyner-Ziv encoder stores these accumulated syndromes in a buffer and transmits
them in chunks/packets upon decoder request, through a feedback channel.
5. Hash generation Cyclic Redundancy Check (CRC) code is computed for each WZ
bit plane, where its length is usually taken to be 8 bits. This CRC is used in the
decoding process in order to aid the decoder in detecting decoding errors.
Decoder
1. Side Information Creation The decoder creates the side information for each WZ
coded frame using an MCI technique similar to the one used in the Stanford CODEC,
using the previous and the next key frames. The obtained motion vectors are smoothed
using a median filter in order to improve consistency.
25
2. Correlation Noise Modeling The error distribution between corresponding DCT
bands of SI and WZ is modelled as a Laplace distribution with on-line evaluation of the
parameter α. The Laplace distribution parameter is estimated at different granularity
levels (the frame, the block and the pixel level) [30], using the frame difference between
the previous and next frames.
3. LDPC decoding The decoder requires additional chunks of parity bits if the decoding
error of the current bit stream falls below a pre-defined threshold.
4. Hash checking The successfulness of the LDPC decoding operation is verified using
the 8-bit CRC code. If the CRC computed on the decoded bit plane matches the value
received from the encoder, the decoding is declared successful. Otherwise, the decoder
requests more parity bits.
5. Reconstruction The decoded value is reconstructed as the conditional expectation
of the source given the decoded quantized coefficient. Rescaling and inverse DCT are
applied to the decoded coefficients.
A schematic description of DISCOVER CODEC is shown in Figure 2.13. The major
improvement of this DVC system over the previous ones is the on-line estimation of the
correlation noise. However, a feedback channel still exists in this architecture. According
to a comparison given in [31], DISCOVER outperforms both PRISM and Stanford DVC
system, where its best performance is achieved for a GOP size of 2. The good performance of
DISCOVER can can be related to the improved noise model and to the use of LDPC that has
better error-correcting capabilities than turbo code, especially in low SNR environment [32].
It should be noted that using a larger GOP size leads to degradation in performance (up
to 1.5dB in the case of a GOP of size 8), so the use of an adaptive GOP size selection is not
needed in practice.
26
Figure 2.13: DISCOVER CODEC [10]
27
Chapter 3
LORD: LOw-complexity,
Rate-controlled, Distributed video
coding system
In this chapter we propose a new DVC system, named LORD, which addresses certain
drawbacks of some of the previously mentioned DVC system. We also address rate control
and real time implementation needs.
3.1
Design Requirements
As was shown before (Chapter 2), there are three main drawbacks in most of the existing
DVC systems:
1. Noise Model Most of the existing DVC systems use an off-line noise model. That is,
the statistics of the noise is collected in advance, using some video samples, and then
is used for the creation of the noise model. Clearly, this approach is unrealistic, since
the noise in any typical video sequence varies both spatially and temporarily. Even in
the case of DISCOVER, in which the statistics of the noise is estimated on-line, the
statistics of the noise are used for the whole frame, without taking into account the
different magnitude of the noise in different parts of the frame.
2. Rate Control Most of the existing DVC systems do not address the rate control
needed for real time applications or for channels that are limited by a maximal amount
of bits per some time interval of transmission. The rate in most of these systems can
usually only be estimated by the encoder, and overflow or underflow phenomena are
28
likely to occur. This is due to the partial information which is sent by the encoder,
where the decoder is left with the task of requesting more information.
3. Feedback Channel Most existing DVC systems today use a feedback channel. The
decoder uses this channel in order to require more parity bits when the decoding fails.
This is unacceptable in any real time system, since it incurs an unknown delay (because
the number of parity bits requests in not known in advance). It is proposed in some
systems to model the number of requests, using an appropriate probability model.
However, deviation from the desired rate using such a model is usually unavoidable.
The aim in the design process of LORD is to address the issues above. First, the noise
statistics in LORD is updated on-line, at the decoder, using both spatial and temporal
information. Second, the rate control is performed at the encoder, in a highly accurate
way, without being affected by the decoder. Moreover, no feedback channel is used in our
implementation, and no delay is incurred by the process of creating side information.
Some of the components that are used in LORD were adapted from PRISM. For example,
we use the concept of working with blocks, rather than working with frames. This enables a
more localized treatment of the video content, and improves the consistency of the suggested
noise model. We take into account that the side information can’t usually predict the WZ
frame as a whole, due to varying statistics of the video sequence. Hence, using the block
approach, we consider only a selected number of blocks as the side information.
We use the concept of GOP, according to which the input is partitioned to key and intra
frames. In addition, we use a rate distortion optimization scheme, which distributes the
available budget of bits inside the GOP. We also show how to ensure (in a highly precise
manner) that the allocated number of bits is not exceeded, using a simple rate control
scheme. Finally, we use the existing components of LORD for the task of compressing
endoscopy videos, with the needed adaptations to this task, as will be shown in Chapter 4.
During the design process, we focused on the general requirement of low-complexity.
Accordingly, the components of LORD involve simple and straight-forward calculations only,
which enable the implementation of LORD in real time applications or in resources limited
environments.
The components of the LORD CODEC are described in detail below. We begin with
describing the encoder and the decoder components.
29
Figure 3.1: LORD - Encoder (highlighted: our main contributions)
3.2
Encoder
A scheme of LORD’s encoder is given in Figure 3.1. The main new features in this encoder,
compared with existing encoders, are the rate-distortion optimization (RDO) and rate control
(RC) components. The RDO module distributes the available budget of bits between the
frames inside each group of pictures (GOP), which is composed of key (intra) and WZ frames,
and the RC component ensures that the transmission rate satisfies the rate constraint of the
transmission channel.
We will concentrate on the case of a GOP that consists of two frames: key frame and WZ
frame (in this order). Thus, the available bits should be distributed between two frames.
Extension to GOP of a larger size with any combination of key and WZ frames is straight
forward, using the same principles that will be described later. It should be noted that in
most of the existing DVC encoders, it has been shown that using a GOP of size 2 provides
usually the best results, where in cases in which an improvement has been achieved using a
larger GOP, it was minor [31, 33].
3.2.1
Transform
The frame that should be encoded is transformed to the DCT domain. We use blocks of
size 8 × 8 that provide a reasonable trade-off between quality and compression efficiency.
Moreover, it enables the use of JPEG encoder components, such as perceptual quantization
matrices, which give different weights to different DCT bands, according to their importance
to the Human Visual System (HVS). As mentioned earlier, this transform is an essential part
30
of any video coding system, since it compacts the energy of the source to a small number of
coefficients, and usually has performance that are close to those of Karhunen-Loève transform
(KLT) that optimally compacts the energy [34].
3.2.2
Block Classification
In this stage, the coding mode of each block is decided at the encoder. The decision is made
according to the residual energy Ed of the difference between each block and its co-located
one in the previous frame (i.e., zero-motion prediction). For simplicity, we will measure Ed
from now on in terms of PSNR, that is, higher difference leads to a lower PSNR. Following
similar principle used in PRISM and in one of its implementation by Fowler [35], there are
three possible coding modes:
1. SKIP If the energy Ed is higher than a predefined threshold, SKIPT H , the current
block is not encoded, and the decoder simply copies its co-located block from the
previous frame. This mode is very effective in stationary scenes or in stationary parts
of two consecutive frames, such as background regions or slowly varying objects.
2. INTRA If the energy Ed is smaller than a predefined threshold, IN T RAT H , the block
is encoded using conventional intra-coding. The intra-coding method used in LORD is
JPEG, which offers a low computational complexity and good performance. The first
two frames of the video input are intra-coded, because of the side information creation
process, which requires two decoded frames at the decoder in order to extrapolate the
third one (see Sec. 3.3.2).
3. COSET If the energy Ed is between IN T RAT H and SKIPT H , the block is encoded
using DVC principles. In order to ensure a reasonable quality of the decoded block,
only the first 15 AC coefficients (WZ coefficients) are encoded using this method. The
DC (which is highly important and hence should be encoded with minimal distortion)
and the remaining AC coefficients (which usually have much smaller energy than the
first 15 ones) are encoded using JPEG. The maximal absolute differences Vk between
each of the current 15 AC coefficients (k = 2, 3, ..., 16) and the coefficients of the
co-located blocks in the previous frame are transmitted losslessly to the decoder. The
encoder quantizes uniformly each of the WZ coefficients to symmetric 2m levels (m = 3
in our implementation), as can be seen in Figure 3.4. It sends the quantization index
for each of the coefficients using Huffman code, which is simple to use and approaches
the entropy limit of the rate and also offers a quick decoding process, since it is a
prefix-code.
31
Figure 3.2: Determining IN T RAT H
In order to determine the threshold IN T RAT H (we set SKIP T H = 40dB), we evaluated
the performance of LORD for different values of IN T RAT H , using three video sequences.
The results can be seen in Figure 3.2. According to these results, IN T RAT H was set to
23dB. An example for the result of the blocks classification stage (using this threshold) is
given in Figure 3.3.
The distribution of Ed is usually concentrated symmetrically around zero. Therefore,
given that it is known whether the quantization interval falls in the positive or negative side
of zero, we are left with estimating the probability that a coefficient belongs to one of 2m−1
bins. These probabilities will be denoted as p1 , p2 , p3 , p4 , as depicted in Figure 3.4.
It is reasonable to assume that these probabilities satisfy p1 > p2 > p3 > p4 , as indeed was
obtained in our experiments, for the majority of the frames. Hence, the Huffman dictionary
can be created once (off-line) using this assumption, as depicted in Table 3.4. This dictionary
is known in advance at the decoder, so there is no need to build the dictionary (and to
transmit it to the decoder) for each frame.
According to our tests, the additional incurred rate using this fixed dictionary, compared
to an optimal - on-line computed - dictionary, is negligible, hence the use of the dictionary
from Table 3.4 is preferred since it saves calculations needed for reconstructing the dictionary
for each frame. It also prevents decoding errors that may occur because of erroneous reception
of the dictionary.
32
Figure 3.3: Block classification: example
Figure 3.4: Bins with their probabilities (m = 3)
33
Index Codeword
1
01
2
00
3
101
4
100
5
1101
6
1100
7
1111
8
1110
Table 3.1: Bin indices and their associated codewords
As in DISCOVER (see Sec. 2.3.3), the GOP that is used in LORD consists of two frames,
where the first one is a key (intra) frame and the second one is a WZ frame. This GOP
structure will be denoted later as IW (Intra–Wyner-Ziv). The classification of blocks to
coding modes is applied only for WZ frames (for a key frame, all the blocks are classified as
intra blocks).
This use of a period of 2 for key frames improves the quality of the side information
(estimate of the WZ frame) at the decoder, which relies heavily on the quality of the previously decoded frame, as will be shown later. It should be noted that the first two frames of
the video input are intra-coded, due to the method in which the SI is computed, so the IW
structure begins from the second frame.
3.2.3
Rate-Distortion Optimization
The input GOP to the encoder consists of a key frame and a WZ frame. Assuming that the
available bit budget for each GOP is B bits (due to channel constraints), and that there are
T = 2 frames in the GOP (IW structure), we would like to distribute the available B bits
between the two frames in the GOP in an optimal way, taking into account the different
encoding modes used. The extension to different number of frames in the GOP can be done
in a straight forward manner, using the results below.
For this purpose, we use a rate distortion model, which relates the distortion to the
number of bits allocated for encoding the frames. Keeping in mind that low complexity
models are needed in the DVC framework, we use a relatively simple and well-known RD
model, with the necessary adaptations needed for the case of DVC. Detailed description of
34
this model is given in Appendix C and is based on [36, 37]. The main parts of this model
are as follows.
Assume that we have P random variables (maybe with different probability distributions):
X1 , X2 , ..., XP , with zero mean and variances σi2 (i = 1, 2, ...P ). The distortion (measured
as the mean squared error, MSE) incurred when uniformly quantizing Xi using bi bits (i.e.,
2bi levels) can be modelled as:
(3.2.1)
Di (bi ) = hi σi2 2−2bi
where the constants hi are determined by the probability density function (PDF) of Xi . This
distortion is the average distortion over N → ∞ realizations of Xi .
In our case, we refer to the DCT coefficients as emitted by the random variables above,
associated with the DCT bands. We treat the DCT bands in each frame of the GOP
separately, since different coding modes are used in each frame, and the content of the
frames varies. For example, when dealing with a GOP of size 2, we have a total of P = 128
random variables,
Now, assume that there are Mkey (intra) blocks in the key frame, Mcoset COSET blocks
and Mintra INTRA blocks in the WZ frame (SKIP blocks are not taken into account since
they do not consume bits). It should be noted that in the case of COSET blocks we refer
only to their part that is intra-coded . We denote the indices of the DCT bands that belong
to this part as A = {1, 17 − 64}, and we consider this part from now on as belonging to the
INTRA blocks. We weight the overall distortion according to number of blocks (according
to their coding mode) in each frame, that is:
D=
P/2
X
|i=1
P
X
mi hi σi2 2−2bi +
mi hi σi2 2−2bi
(3.2.2)
i=P/2+1
{z
}
distortion from key frame
|
{z
}
distortion from WZ frame
where:
1. mi = Mkey for i = 1, 2, ..., P/2 (indices associated with the key frame)
2. mi = Mintra for i = P/2 + 2, P/2 + 3, ..., P/2 + 16 (indices associated with INTRA only
blocks of the WZ frame)
3. mi = Mintra + Mcoset for i = P/2 + {A}, where A = {1, 17 − 64} as defined above
(indices associated with INTRA blocks and the intra part in COSET blocks of the WZ
frame)
It should be noted that the effective number of available bits for coding the intra parts
of the GOP is B − COSETbits , where COSETbits denotes the total number of bits allocated
35
for the first 15 AC coefficients in the COSET blocks. Given that β bits are allocated for
each of these coefficients (β = 3 in our implementation), we get:
COSET bits = 15β · Mcoset
(3.2.3)
Now, writing the total distortion more compactly, the optimization problem we have to
solve is therefore:
minD =
bi
s.t.
P
X
P
X
mi hi σi2 2−2bi
i=1
(3.2.4)
bi ≤ B − COSETbits
i=1
where bi denotes the bits assigned to the ith DCT band. We set (see Appendix C) hi = hG =
√
3π
for the DC band (i = 1, P/2 + 1 for GOP of size 2), which is assumed to have a Gaussian
2
distribution (in JPEG the differences of the DCs are actually encoded, but it can be assumed
that the distribution of the differences is also Gaussian). In addition, hi = hL = 29 for the
AC bands, which are assumed to have a Laplace distribution [38].
The empirical maximum-likelihood (ML) estimation of the variance of each band is calculated according to its distribution. That is, the variance of each DC band (that is assumed
to be Gaussian-distributed) is calculated according to:
2
σG
NG
1 X
=
x2
NG j=1 j
(3.2.5)
where xj denotes a realization (coefficient) that is associated with the DC band, and NG is
the total number of realizations in this band. The variance of a specific AC band (that is
assumed to be Laplace-distributed) is calculated according to:
!2
NL
X
1
σL2 = 2
|xj |
NL j=1
(3.2.6)
where xj denotes a realization that is associated with the AC band, and NL is the total
number of realizations in this band. These variances are calculated separately for each band
in each frame of the GOP.
The solution of the optimization problem (3.2.4) is (see Appendix C):
1
σ2 1
hi 1
mi
bi = b̄ + log2 2i + log2 + log2
2
ρ
2
H 2
M
36
(3.2.7)
where:
B − COSETbits 2
,ρ =
b̄ =
P
P
Y
!1/P
σi2
,H =
i=1
P
Y
!1/P
hi
,M =
i=1
P
Y
!1/P
mi
(3.2.8)
i=1
It should be noted that the bi in the solution (3.2.7) can be non-integer or negative
for some i. In our case, we consider the total number of the bits allocated for each frame
(see below), rather than the individual bi , so rounding this number (especially when it is
sufficiently large) is unlikely to cause any noticeable degradation in performance. Negative
bi should be set to 0, where it is needed to reduce the allocated bits for other bands, in
order to satisfy the constraint on the total number of bits. It should be noted that in our
experiments, no negative values of bi were obtained.
Since we want to take HVS characteristics into account (see Sec. 3.2.1), we do not
quantize the coefficient in the ith band by simply using bi bits, but we use the perceptual
quantization matrices of JPEG. We first calculate the total number of bits that should be
allocated to each frame, that is:
Bkey =
P/2
X
bi , Bwz =
i=1
P
X
bi
(3.2.9)
i=P/2+1
and by using an appropriate rate control scheme (see Sec. 3.2.4) we determine the
quantization matrices that should be used in order to achieve the bits distribution (3.2.9)
inside the GOP. The solution (3.2.9) can be considered as an estimate of how "difficult" it
is to compress each frame in the GOP and therefore - how to distribute the available bits.
3.2.4
Rate Control
Given the allocated bits for each frame in the GOP, according to the rate distortion scheme
suggested in Sec. 3.2.3, our task now is to employ a rate control (RC) scheme that ensures
that each frame uses its optimal share of bits. Since after the classification stage we are
left essentially with coefficients that are JPEG encoded, the rate is controlled by an RC
algorithm that changes the quantization matrices of JPEG throughout each frame, in order
to achieve the desired bitrate.
The parameter that controls the quality in JPEG is the scale factor parameter (will be
denoted later as q), which multiplies the base quantization matrix that is given in Figure
3.2. The quantization matrix quantizes each DCT band with a quantization step stepi (q)
(i = 1, 2, ..., 64 denotes the index of the DCT band, where in our case we treat separately
the intra parts of a key frame and a WZ frame) that is visually optimized for each band.
37




















16 11 10 16
24
40
51
61
12 12 14 19
26
58
60
55
14 13 16 24
40
57
69
56
14 17 22 29
51
87
80
62
18 22 37 56
68
109 103
77
24 35 55 64
81
104 113
92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103
99




















Table 3.2: JPEG base quantization matrix
Each DCT coefficient is quantized with a step proportional to q (the proportion constants
are the elements of the base quantization matrix in Table 3.2). It is clear that a relation
R = R(q) (where R is the bit rate) is highly desirable. This way, once given a target rate R,
we could ensure that this rate is not exceeded, by simply inverting the relation R = R(q) in
order to obtain q = q(R). However, as noted in [39], there does not exist any such reliable
relation.
He and Mitra [40] approached the problem from a different point of view. Instead of
modelling R(q), they showed that in an image there exists a linear relationship between the
coding bit rate R and the fraction of zeros among the quantized coefficients, denoted by ρ
(0 ≤ ρ ≤ 1):
R (ρ) = θ (1 − ρ)
(3.2.10)
where θ (the negative of the slope) is a constant related to the image content that depends
mainly on the amount of texture and edges in the image. It should be noted that the
existence of such a relation makes sense since R being monotonically decreasing with ρ
implies a one-to-one mapping between them. We can first determine ρ using q:
ρ (q) =
X
1 X
1
N
i j : |xi,j | ≤ stepi (q)
(3.2.11)
where N is the total number umber of the DCT coefficients that are not SKIP encoded
(where in COSET blocks only their intra part is considered), calculated differently for each
frame in the GOP. That is, ρ is determined by counting only the DCT coefficients that are
quantized to zero. Once ρ is known, we can simply get the associated rate using (3.2.10).
However, we still have to estimate the parameter θ. The following scheme (adapted from [40])
38
proposes an adaptive way for the estimation of θ, while maintaining the relation (3.2.10).
Let’s denote by M the number of blocks in the current frame (in our case, it is the
number of blocks that are not SKIP coded) and by Nm the number of already coded blocks,
where each block contains S coefficients (in our case, S = 64 for INTRA blocks, and S = 49
for COSET blocks). Let Bm be the number of the bits used to encode these Nm blocks.
Also denote by ηm the number of zeros in these blocks. Note that all the values above are
unnormalized.
The proposed algorithm for allocating B − COSETbits bits for the current frame consists
of the following stages:
1. Set Nm = ηm = Bm = 0, θ = 6.5 (average value for a typical video sequence).
2. According to (3.2.10), the number of zeros to be produced by quantizing the remaining
blocks is:
B − COSETbits − Bm
(3.2.12)
η = S · (M − Nm ) −
θ
Using η from (3.2.12), calculate q = q(η) using (3.2.11) (where in this case N from
(3.2.11) denotes the number of coefficients that have been already encoded) and encode
the current block with this q.
3. Let η0 and B0 denote the number of zeros and the number of bits produced by the
current block, respectively. set:
ηm := ηm + η0 , Bm := Bm + B0 , Nm := Nm + 1
(3.2.13)
and update θ (if Nm is sufficiently large) according to:
θ=
Bm
S · Nm − ηm
(3.2.14)
4. Repeat steps 2,3 until all the blocks are encoded.
This rate control algorithm has a low computational complexity and implementation cost,
as it involves mostly simple additions for the calculation of (3.2.11). It divides the video
frame into two groups, coded and yet uncoded blocks, and balances the bit budget between
them. This RC scheme turns out to be very accurate and robust. It should also be noted
that the rate control of the current frame does not use any information or statistics from its
previous frame. Therefore, it is not affected by scene changes.
39
Figure 3.5: RC scheme, applied to Foreman 40th frame (as a key frame). The target rate is
1.5bpp.
According to tests on a set of video sequences in [40], the relative deviation from the
target rate is less than 2%. There is a PSNR gain of up to 1.17dB over other (and more
complicated) RC schemes, such as VM7 [41,42] and TMN8 [43]. It should be noted that the
variance of θ over a frame is usually relatively small, so that the adaptive estimation process
of θ is quite stable. An example of the RC algorithm, when applied to one (key) frame, is
given in Figure 3.5.
Because of its low computational complexity and good performance, this rate control
algorithm is used in LORD’s encoder. The separate treatment of each frame fits well into the
framework of DVC. Moreover, the relative simple implementation of this algorithm enables
its use in real time applications.
Results of this RC scheme, when applied to the first 100 frames of Foreman sequence
with a target rate of 1.5bpp, are shown in Figure 3.6. As can be seen in this example, the
maximal deviation of the actual rate from the target rate is about 1%. As expected, there
are fluctuations in the PSNR, due to the allocation of the same amount of bits to each frame
(For the sake of demonstration, all the frames are intra coded, that is, each frame is a key
frame), without taking into account the differences in their spatial content.
It should be noted that we can deal in a simple manner with cases in which the resulting
rate using this RC scheme is higher than required. In this case, we can perform an additional
40
pass over the blocks and remove the bits used for coefficients with the lowest allocation of
bits. This process should be stopped when the rate constrained is satisfied, so there is no
need usually to pass over the whole frame.
This additional pass is especially important if no delay is permitted (so a buffer can not
be used). An important examples for application with strict delay constraints is endoscopy
(see Sec. 4.1).
41
(a) bpp (the red lines denote deviation of 1% from 1.5bpp)
(b) PSNR
Figure 3.6: Foreman sequence, 100 frames, 1.5bpp rate constraint, intra-only coding mode
42
Figure 3.7: LORD - Decoder (highlighted: our main contributions)
3.3
Decoder
LORD’s decoder consists of the components that are depicted in Figure 3.7. The main components of this decoder are the side-information creation and the noise correlation modelling.
These components are described next.
3.3.1
Intra Frame Decoder
The input to this component is the first frame of the GOP, that is, the key frame, which was
intra-coded using JPEG. The decoded frame is simply obtained using a conventional JPEG
decoder. This decoded frame, that is denoted X̂2k−1 , along with one previously decoded
frame, denoted X̂2k−2 , are used for side information creation.
3.3.2
Side Information Creation
In LORD, the side information is created using motion extrapolation. That is, a prediction
of the current Wyner-Ziv frame, X2k , is created at the decoder, using two previously decoded
frames, X̂2k−1 and X̂2k−2 . This prediction serves as a side information in the decoding process
of a WZ frame.
43
For this purpose, we begin by performing (at the decoder) a full search motion estimation
(ME) process between X̂2k−1 and X̂2k−2 . ME is the process of determining motion vectors
(MVs) that describe the displacement of each block from one image to another. As mentioned
earlier (see Sec. 3.2.1), we work with blocks of 8 × 8 pixels. Usually, the first (reference)
frame is named anchor frame and the second one is named tracked frame. In modern video
encoders, the motion vectors and the prediction error are sent to the decoder, so the tracked
frame can be recovered using the anchor frame and this information.
The basic motion estimation process is performed using integer pixel (ipel) motion vectors. This is depicted in Figure 3.8, in which the predicted tracked frame can be seen. The
prediction error can be seen in Figure 3.9. As expected, most of the energy of the error is
concentrated at the edges of the image.
To improve the accuracy of motion estimation, it is common to interpolate the anchor
frame to half-pixel (hpel) or quarter-pel precision (qpel) precision. It is especially important
for slow-motion sequences, in which the motion is more likely to be in steps of hpel or qpel.
It is also possible to further interpolate the anchor frame to eighth-pixel precision, but the
additional complexity usually doesn’t justify the improvement in performance.
The interpolation of the anchor frame to qpel precision (×4) can be achieved by any
appropriate interpolation filter. In H.264, The filter that is used for the interpolation process
is:
h = 1 0 −5 0 20 32 20 0 −5 0 1 /32
where this filter is first used in order to obtain hpel precision, and then applied again on the
hpel grid, in order to obtain qpel precision. The interpolation process is separable, which
means the sampling rate in one direction is doubled by inserting zero-valued samples followed
by filtering using a the filter h, and then the process repeats for the other direction.
In summary, the ME process used in LORD is as follows:
1. ipel ME Integer-pixel motion vectors are obtained using standard ipel ME (as shown
in Figure 3.8). These vectors serve as basis for obtaining qpel MVs.
2. Interpolation The anchor frame is interpolated to qpel precision using H.264 interpolation filter (twice). Bilinear interpolation can also be used, but the H.264 filter
provides better results and is separable.
3. qpel ME The ipel MVs are used as pointers to the starting points for the qpel search,
in order to reduce complexity. The search is performed over the sub-grids of hpel and
qpel precision.
44
(a) Football - tracked frame
(b) Football - predicted tracked frame
(c) Foreman - tracked frame
(d) Foreman - predicted tracked frame
Figure 3.8: Motion estimation, ipel precision
45
(a) Football (prediction error: 26.84dB)
(b) Foreman (prediction error: 32.15dB)
Figure 3.9: Prediction error (normalized absolute value, offset of 150 gray levels was used
for better visualization), ipel precision
An example for the qpel ME process is shown in Figure 3.10. The prediction error is
shown in Figure 3.11. It can be seen that qpel ME process leads to significant improvement
over ipel ME. Compared to ipel ME, there is an PSNR gain of more than 4dB for Foreman,
and more than 2dB for Football.
Following the ME process, a prediction of the WZ frame is obtained using a motion
extrapolation (MX) process, in which the qpel MVs are extrapolated in order to obtain an
extrapolated frame. This method works as follows. Suppose that we have two known frames
(we called them anchor frame and tracked frame) at the decoder, and we want to extrapolate
the third one (which will serve as an estimate of the WZ frame). The stages involved in this
MX process are:
1. Motion estimation Motion vectors are estimated for overlapping 8 × 8 blocks, for
the motion between the anchor and the tracked frames, using qpel ME, as described
earlier (page 44). This overlapping is obtained by using an offset parameter, which is a
two-dimensional vector (ox , oy ) that determines the 2D offset of tracked frame from the
upper left corner of the image in the x and y directions. The ME process is repeated
for several offset vectors, obtaining a set of MVs associated with each offset. Working
with various offset vectors and applying the ME process for each offset vectors helps
in preventing holes (pixels with no predictors) in the extrapolated frame (see stage 4).
2. Smoothing In order to obtain a better description of the motion, the motion field
(MF) is smoothed by replacing each MV by the median of this MV and its 4 nearest
MVs. This is necessary since the MVs between the frames are usually over-fitted after
46
(a) Football - tracked frame
(b) Football - predicted tracked frame
(c) Foreman - tracked frame
(d) Foreman - predicted tracked frame
Figure 3.10: Motion estimation, qpel precision
47
(a) Football (prediction error: 29.14dB)
(b) Foreman (prediction error: 36.32dB)
Figure 3.11: Prediction error (normalized absolute value, offset of 150 gray levels was used
for better visualization), qpel precision
ME, and the smoothing gives a more coherent description of the motion. It should be
noted that since we deal with vectors, the median vector is selected as the vector with
the minimal sum of (Euclidean) distances to the others. Examples of motion fields,
before smoothing and afterwards, are depicted in Figure 3.12. Our tests show that
the median smoothing has better performance (measuring the quality of extrapolated
frame) compared with not using smoothing at all by 0.4 − 0.5dB in PSNR. It also
provides better results than the the averaging method, in which each MV is replaced
by the average of its 4 nearest neighbours, by 0.1 − 0.2dB in PSNR.
3. Projection Assuming linear motion (from the anchor to the tracked frame), the pixels
from the tracked frame are projected to the next (extrapolated) frame using the motion
vectors at qpel precision obtained before. These MVs are rounded so that they point
to the nearest grid point on the unsampled extrapolated frame.
4. Iterations In order to cover as many pixels as possible, the stages above are repeated
with different (ox , oy ) values (we use increasing steps of 2 pixels in both dimensions,
such that the frame is moved along the diagonal), until the percentage of non-covered
pixels falls under some threshold. Experimental results show that 2 − 3 iterations are
sufficient if a threshold of 1% is used.
Using the method above for MX, some pixels in the extrapolated frame may be predicted from multiple pixels in the tracked frame, or from no one - resulting in holes in the
extrapolated frame. The solutions used in these cases are as follows:
48
(a) Football - before smoothing
(b) Football - after smoothing
(c) Foreman - before smoothing
(d) Foreman - after smoothing
Figure 3.12: Motion field – smoothing (intermediate stage before extrapolation)
49
Figure 3.13: Overlapping extrapolated blocks
• Multiple predictions If the same pixel in the extrapolated frame is predicted by
multiple pixels (due to extrapolated MVs that point to overlapping projected blocks in
the extrapolated frame), as depicted in Figure 3.13, the averaged extrapolated values
of each pixel are used as prediction.
• Non-covered pixels Each non-covered pixel is spatially interpolated, using the average of the 4 nearest known (already extrapolated) pixels.
Results of this method are given in Figure 3.14. The extrapolation error can be seen in
Figure 3.15, where it should be noted that most of the error is concentrated at the edges,
since the description of the movement of edges using blocks is usually insufficient. Another
source of error stems from parts with complex motion which is not detected correctly, such
as the movement of the mouth in Foreman sequence. The mouth is opened slowly but it is
also moving with the head. For reference, the first 3 frames of Football and Foreman can be
seen in Figure 3.16.
It is important to note that parts with complex motion will be usually intra-coded in our
DVC CODEC, assuming that the collocation differences in those parts is large. Thus, LORD
considers as a reliable side information only the parts of the extrapolated frame which are
assumed to be a result of a relatively slow motion from the previous frame, that is, only
blocks which are COSET-coded.
50
(a) Football - original frame
(b) Football - extrapolated frame
(c) Foreman - original frame
(d) Foreman - extrapolated frame
Figure 3.14: Motion extrapolation
51
(a) Football (prediction error: 19.3dB)
(b) Foreman (prediction error: 24.9dB)
Figure 3.15: Extrapolation error (normalized absolute value, offset of 150 gray levels was
used for better visualization), qpel precision
(a) Football 1
(b) Football 2
(c) Football 3
(d) Foreman 1
(e) Foreman 2
(f) Foreman 3
Figure 3.16: First 3 frames of Football and Foreman
52
3.3.3
Noise Correlation Model
Once the side-information is created at the decoder, it is used in order to decode the COSETcoded blocks. As mentioned earlier (see Sec. 3.2.2), only the first 15 AC coefficients of
these blocks are encoded, by sending the indices of their quantization interval. In order to
determine the most probable location of each such coefficient inside the known interval, that
is, to "correct" the noisy (quantized) COSET coefficients, a noise model is used.
The noise model between the WZ frame and its prediction is obtained as follows. After
the ME process between the anchor and tracked frames, we consider the differences between
the tracked frame and its predicted version, when only the differences (N ) associated with
blocks that are further extrapolated to COSET blocks in the WZ prediction are used. The
noise between the the tracked frame and its prediction serve as the an estimate of the noise
between the WZ frame (X) and its MX-predicted frame, which serves as side information
(Y ).
Following a commonly used model [33], we assume that the noise N = X − Y , given
Y = y, is distributed according to Laplace distribution, that is:
f X|y (x) = fN (x − y) =
α −α|x−y|
e
2
(3.3.1)
where α is the Laplace parameter, calculated using the ML estimator, according to:
α=
!−1
K
1 X
|xi − µ̂|
K i=1
(3.3.2)
where xi (i = 1, 2, ..., K) are the samples, and µ̂ is the median of these samples. α is
calculated separately for each of the WZ bands (2 − 16) in COSET mode .
As can be seen in (3.3.1), we consider (the known) Y to be the center of the distribution.
Now, assuming that the boundaries of the quantization interval of the current DCT coefficient
are provided by encoder (will be denoted as zi and zi+1 ), we can get an MMSE estimate of
the source X, using both these boundaries and the side information:
zR
i+1
x̂ = E [x| x ∈ [zi , zi+1 ) , y] =
xfX|y (x) dx
zi
zR
i+1
(3.3.3)
fX|y (x) dx
zi
where the integrals in (3.3.3) can be carried out analytically, resulting in a closed form
expression for x̂, which is essentially the centroid of the distribution over the given bin
53
Figure 3.17: Quantization interval (assuming that y is within the interval)
Figure 3.18: Possible relations between the side information y and the quantization interval
[zi , zi+1 ) [44] (see Appendix B):



1+
∆

zi + α



1 −eα∆



1
1

−αγ

γ+
e
− δ+
e−αδ
α
α
x̂ =
y+

−αγ
−αδ


2− e
+e






∆
1

 zi+1 − α −
1 − eα∆
if y < zi
if y ∈ [zi , zi+1 )
(3.3.4)
if y ≥ zi+1
V
is the quantization interval length, which depends on the
where ∆ = ∆ (V ) = 2m−1
maximal DCT difference V between co-located blocks (for simplicity we omit the index k
from V ), γ = y − zi and δ = zi+1 − y, as depicted in Figure 3.17. The three possible cases
for y are depicted in Figure 3.18.
The following extreme cases should be noted. Let g (α; ∆) denotes the distance of x̂ from
the bin boundaries in the cases y < zi and y ≥ zi+1 :
g (α; ∆) =
1
∆
1 − eα∆
+
=
α 1 − eα∆
α (1 − eα∆ )
(3.3.5)
Furthermore, denote by h (α; γ, δ) the distance from the side information y to x̂ in the
case y ∈ [zi , zi+1 ):
54
h (α; γ, δ) =
γ+
e−αγ − δ + α1 e−αδ
2 − (e−αγ + e−αδ )
1
α
(3.3.6)
It can be shown that (see Appendix B):
lim g (α; ∆) =
α→0
lim h (α; γ, δ) =
α→0
∆
, lim g (α; ∆) = 0
2 α→∞
δ−γ
, lim h (α; γ, δ) = 0
α→∞
2
(3.3.7)
(3.3.8)
That is, when α conveys no information about the noise distribution (α → 0), the MMSE
reconstructed value is simply the center of the quantization interval in the cases y < zi and
(y is moved according to the relation between the lengths of δ and γ)
y ≥ zi+1 , or y + δ−γ
2
in the case y ∈ [zi , zi+1 ).
On the other hand, when the noise is highly localized (α → ∞), the reconstructed value
approaches the boundaries of the quantization interval if y < zi or y ≥ zi+1 , and approaches
the side information y otherwise.
The estimated noise statistics between the WZ frame and the extrapolated frame are
given in Figure 3.19. This noise is obtained using full search (in qpel resolution) between the
anchor and tracked frame, which results in a predictor for the tracked frame. It is compared
with the actual statistics between the extrapolated frame and the WZ frame, which are given
in Figure 3.20. Both statistics are shown with their fit to Laplace distribution, where the
measure used for the quality of this fit is the root mean square (RMS) of the differences
between the fitted probability function and the empirical histogram (The smaller the RMS,
the better the approximation of the Laplace distribution).
It should be noted that other measures can be used, such as Kullback-Leibler divergence,
but we chose the RMS measure due to its simplicity and the visually appealing results of
the fittings. As can be seen, the RMS values are small, meaning that Laplace distribution
model is indeed a reasonable model for the distribution of of the noise. The fit of the first
15 AC coefficients to Laplace distribution for Foreman and Football is given in Figure 3.21.
As expected, the distribution of the estimated noise is more localized, compared to the
actual one, since it was obtained using full search motion estimation, resulting in a relatively
small magnitude of the noise. This discrepancy may be corrected by multiplying the estimated variance by an appropriate constant. This constant was found empirically to fall in
the range of 1.4 − 1.9. Its value can also be estimated by the decoder (for each DCT band),
using the last 3 decoded frames. In our implementation, a value of 1.5 was used.
The PSNR gain using MMSE reconstruction (3.3.4), compared to uniform quantization
55
Average
Sequence
PSNR Gain
Foreman
1.43dB
Football
1.11dB
Coastguard
1.67dB
Table 3.3: MMSE reconstruction: Average PSNR gain over uniform quantization
of the first 15 AC coefficients, can be seen in Table 3.3. According to these experimental
results, the PSNR gain is 1.1-1.7dB. It should be noted that it is difficult to obtain analytic
bound on the error E (x − x̂)2 , due to the unknown statistics of the side information y, as
it is also needed to treat differently the two cases y ∈ [zi , zi+1 ) and y ∈
/ [zi , zi+1 ).
3.3.4
De-quantization and inverse DCT
Once the DCT coefficients of the WZ frame are reconstructed, using (3.3.4) and the INTRA/SKIP blocks, they undergo de-quantization and inverse DCT (IDCT), as defined in
JPEG standard, resulting in a reconstructed GOP in the pixel domain.
56
(a) Football - AC1
(d) Foreman - AC1
(b) Football - AC2
(e) Foreman - AC2
(c) Football - AC3
(f) Foreman - AC3
Figure 3.19: The noise between the the anchor and the tracked frame. Serves as an estimate
of the noise between the SI and the WZ frame. The RMS value is given for the Laplacian
PDF fit of each histogram.
57
(a) Football - AC1
(b) Football - AC2
(c) Football - AC3
(d) Foreman - AC1
(e) Foreman - AC2
(f) Foreman - AC3
Figure 3.20: Actual noise between SI and WZ frame
58
(a) Foreman - Frame no. 3, actual noise
(b) Foreman - Frame no. 3, estimated noise
(c) Football - Frame no. 3, actual noise
(d) Football - Frame no. 3, estimated noise
Figure 3.21: Distribution of the noise (prediction error) between the WZ frame and the SI
(its prediction). The numbers in the legend denote the DCT bands in a zigzag order.
59
Chapter 4
Adaptation of LORD to Endoscopy
Video Compression
This chapter begins by providing a background on endoscopy videos and on Bayer color filter
array, which is the type of sensor used for acquiring endoscopy videos. Later, we show how
to adapt LORD to endoscopy videos and to Bayer videos in general.
4.1
Endoscopy Videos
Endoscopy (means "looking inside") typically refers to looking inside the body for medical
reasons using an endoscope, an instrument used to examine the interior of a hollow organ
or cavity of the body. Unlike most other medical imaging devices, endoscopes are inserted
directly into the organ.
An endoscope consists of a long, thin, flexible tube that has a light source and a video
camera. Images of the inside of the patient’s body can be seen on a screen. These images
enable the examination of the the interior surfaces of an organ or tissue. The endoscope can
also be used for enabling biopsies and retrieving foreign objects.
The images recorded by an endoscope are obtained using a camera that is attached to the
endoscope. As the technology advances, the resolution of this sensor increases. Nowadays,
these sensors can achieve a resolution up to 2 mega pixels.
There is a shift recently to transmission of endoscopy videos over a wireless channel. This
enables the physician to work in a more sterile environment. However, the transmission rate
of video content during the endoscopy process is limited due to limited power resources. The
channel used provides usually relatively low rate, compared with transmission of data over
a cable. A wireless endoscope is depicted in Figure 4.1. Example for an image that was
acquired during an endoscopy process is given in Figure 4.2.
60
Figure 4.1: Wireless endoscope
The limitations above pose a problem, especially when dealing with high-resolution endoscopy. Therefore, an efficient compressing scheme is needed, which meets the channel
constraints and also offers a good reconstruction quality of the video.
This application fits well into the principles that are behind DVC. That is, there is need
for a low-complexity encoder, while allowing high-complexity decoder, which is essentially a
computer which is connected to the endoscope. For this purpose, we adapt our LORD DVC
system to the compression of Bayer endoscopy videos. This is described in Section 4.3.
4.2
Bayer Format
In photography, a color filter array (CFA) is a mosaic of tiny color filters, placed over the
pixel sensors of an image sensor, to capture color information. Color filters are needed since
simply measuring the light’s intensity which enters the sensor does not provide information
about the color. Thus, the needed color information is obtained by filtering the incoming
light, usually to its red, green and blue (RGB) components, which can be added together in
various ways to reproduce a broad array of colors.
A Bayer filter [11] mosaic is a CFA for arranging RGB color filters on a square grid of
photosensors (either CMOS or CCD). Its particular arrangement of color filters is used in
most single-chip digital image sensors used in digital cameras, camcorders and scanners to
create a color image. It is also commonly used in video recording devices used in modern
endoscopy operations.
Bayer CFA is composed of filter blocks of size 2 × 2, which are 50% green, 25% red and
25% blue. According to the order of the colors inside the block, we get different alignments,
called GBRG, GRBG, BGGR or RGGB, where each string represents the raster scan order of
the red, green, and blue sensors in each block in the CFA. An example of a specific alignment
(BGGR) is given in Figure 4.3.
61
(a) Raw (Bayer) endoscopy image (intensities of the pixels, 256 levels)
(b) Full color image
Figure 4.2: Images from endoscopy process, performed on a pig (Images courtesy Gyrus
ACMI, Inc.)
62
Figure 4.3: Bayer CFA, BGGR alignment (the basic block is encircled by a yellow rectangle)
This CFA uses twice as many green elements as red or blue in order to conform with
the strong sensitivity of the human vision system (HVS) to green light. The luminance
perception of the human retina uses medium (M) and long (L) cone cells combined, during
daylight vision, which are most sensitive to green light. This can be seen in Figure 4.4,
which depicts the response of the HVS to a range of wavelengths. The peak of the response
appears around 550nm, which corresponds to green light.
In Bayer CFA, Each physical pixel in the camera’s sensor has an optical filter placed over
it, allowing penetration of only a particular light color (red, green or blue), as demonstrated
in Figure 4.5. Example for an image acquired using a Bayer CFA is given in Figure 4.6. The
blocks which constitute the Bayer CFA can be seen in this image.
The Bayer filter is almost universal on consumer digital cameras. Nevertheless, there
are a few alternatives to Bayer CFA. One notable example is the Foveon X3 sensor, which
is able to record all the three primary colors for each pixel, using three vertically stacked
photodiodes in each pixel sensor. The obvious advantage of this sensor is that it is no longer
needed to perform the demosiacing opertaion in order to obtain the full color information.
However, this sensor and some other similar sensors are much more expensive than a
Bayer sensor. Moreover, due to the need to obtain all the three colors for the same pixel,
the noise in the acquiring process is more noticeable. There are also reported problems of
low performance under low-light conditions. Hence, Bayer filter is still the dominant sensor
array used in the commercially available cameras.
Due to the partial color information in each pixel, it is needed to estimate the other
two primary colors in order to have information about all three colors at every pixel. Bayer
demosaicing is the process of translating a Bayer array of primary colors into a final image
that contains full color information (RGB) at each pixel. We use an algorithm suggested by
63
Figure 4.4: Luminous efficiency curve [45]
Figure 4.5: Profile/cross-section of a Bayer filter (BGGR alignment)
64
(a) Bayer CFA image as a grayscale image (intensities of the
pixels, 256 levels)
(b) Left: Bayer CFA image as a color image (Each pixel has only one color component). Right: Enlargement of a
part of the color image.
Figure 4.6: Raw images acquired using Bayer CFA
65
Figure 4.7: Bayer and its decomposition into RGB components
Malavr et al. [46], which is described in Appendix A, along with several other demosaicing
algorithms.
4.3
Adaptation of LORD to Endoscopy Videos
Some necessary adaptations are needed when compressing endoscopy Bayer videos with
LORD, because of the special structure of this format. They include:
1. RGB Separation Each Bayer frame is decomposed into its RGB components. The
encoder is fed with each component separately. This way, the spatial correlation of
each color channel is exploited. An example for a decomposition of a Bayer frame into
its components is given in Figure 4.7. It can be seen that the green component is two
times larger in size compared to the red and blue components.
2. Rate-Distortion Optimization Assuming a GOP of size 2, there are three color
components for each frame, resulting in a total of 6 color components in the GOP.
In order to decide on the appropriate bits distribution between these components, an
66
Figure 4.8: Calculation process of PSNR between a Bayer image and its reconstructed version
appropriate RDO scheme should be used. This scheme is described below, and is based
on the scheme described in Sec. 3.2.3.
The way in which the PSNR value is calculated (between the original frame and the
decompressed one) in the case of compressing a Bayer frame is depicted in Figure 4.8. As
can be seen in this figure, the calculation of the PSNR involves the transform of the Bayer
frame (after demosaicing) from the RGB color space to the YCbCr color space, where the
quality comparison is performed between the Y components of the original frame and the
reconstructed one.
The relation between the RGB components to Y is obtained using the following weights
(according to ITU-R BT.601 standard):
wR = 0.299, wG = 0.587, wB = 0.114
(4.3.1)
Y = wR R + wG G + wB B
(4.3.2)
where:
An example for using YCbCr color space is given in Figure 4.9. It is clear from this
example that the most significant component of YCbCr is Y, which is essentially a grayscale
version of the main image.
The total distortion is the sum of the distortions that are incurred by the color components, when each band in each color component (assuming a total number of P bands in the
67
(a) Full color image (RGB)
(b) Y component
(c) Cb component
(d) Cr component
Figure 4.9: RGB and YCbCr
GOP) is assigned bi (i = 1, 2, ..., P ) bits. We will assume that a GOP of size 2 is used, so
P = 3 · 2 · 64 = 384.
Taking the weights from (4.3.1) into account, the expectation of the squared error
2
Y − Ŷ
is:
2 E Y − Ŷ
= E wR R − R̂ + wG G − Ĝ + wB B − B̂
2 2 2 2
2
2
+ wG · E G − Ĝ
+ wB · E B − B̂
= wR · E R − R̂
h
i
X
+
wCi wCj E Ci − Ĉi Cj − Ĉj
2 (4.3.3)
i,j
where Ci = {R, G, B} (i = 1, 2, 3). Assuming that the most significant part of the error is
contributed by the differences between color components of the same type, we approximate
(4.3.3) to be:
E
Y − Ŷ
2 ≈
2
wR
·E
R − R̂
2 +
2
wG
·E
2 2 2
G − Ĝ
+ wB · E B − B̂
(4.3.4)
68
It should be noted that it is possible to de-correlate the color components (e.g., by using
Karhunen-Loeve Transform) so that the relation in (4.3.4) becomes accurate. However,
the distribution of the DCT bands of the resulting de-correlated color components will not
necessarily be Laplace for the AC bands, making it difficult to employ our rate-distortion
model. Therefore, we use the approximation provided by (4.3.4).
As a result of (4.3.4), we weight the distortion of each color component C by wC2 (C =
R, G, B). In order to simplify notations, we divide the total distortion of the GOP into the
distortion incurred by the key frame (Dkey ) and the distortion incurred by the WZ frame
(Dwz ):
Dkey =
2
wR
·
P/6
X
mRi,key hi σi2 2−2bi
2
+ wG
·
i=1
|
P/3
X
}
|
·
X
distortion from G component
·
}
|
2
+ wG
X
·
{z
distortion from R component
}
|
}
(4.3.5)
mGi,wz hi σi2 2−2bi
2
+ wB
i=2P/3+1
{z
mBi,key hi σi2 2−2bi
distortion from B component
5P/6
mRi,wz hi σi2 2−2bi
i=P/2+1
|
P/2
X
i=P/3+1
{z
2P/3
Dwz =
2
+ wB
i=P/6+1
{z
distortion from R component
2
wR
mGi,key hi σi2 2−2bi
·
P
X
mBi,wz hi σi2 2−2bi
i=5P/6+1
{z
distortion from G component
}
|
{z
distortion from B component
(4.3.6)
where mCi,key denotes the number of (intra) blocks of the C color (C = R, G, B) in the
key frame (in our case, it is the same for each color), and mCi,wz denotes the number of coded
blocks (i.e., not SKIP, where for COSET blocks only their intra parts are taken into account)
in the WZ frame, for each color C, calculated in a similar manner to the calculation of mi
in Sec. 3.2.3.
Compared with the rate distortion component used in LORD for standard videos (see
Sec. 3.2.3), the main difference in the case of Bayer videos is the separation between the
color components and the use of different weights for each color. The total distortion is:
D = Dkey + Dwz
(4.3.7)
which can be written more compactly as:
D=
P
X
wi2 mi hi σi2 2−2bi
i
69
(4.3.8)
}
where mi , hi , σi2 (i = 1, 2, ..., P ) are defined similarly to the definitions in Sec. 3.2.3, and wi2
denotes the squared weight from (4.3.2), chosen according to the color component associated
with the ith band. Taking into account the total number of bits that were allocated for the
first 15 AC coefficients in each COSET block, denoted by COSETbits (see Sec. 3.2.3), the
resulting optimization problem is:
minD =
P
X
bi
s.t.
wi2 mi hi σi2 2−2bi
i
P
X
(4.3.9)
bi ≤ B − COSET bits
i
The solution of this optimization problem is a simple extension of (3.2.7):
1
w2
1
mi 1
hi 1
σ2
bi = b̄ + log2 i2 + log2
+ log2 + log2 2i
2
W
2
M
2
H 2
ρ
(4.3.10)
where:
B − COSETbits
,W2 =
b̄ =
P
P
Y
!1/P
wi2
,M =
i=1
P
Y
i=1
!1/P
mi
,H =
P
Y
i=1
!1/P
hi
, ρ2 =
P
Y
i=1
!1/P
σi2
(4.3.11)
Finally, the bits are distributed among the color components by summing up the bits
allocated for each band in each color channel. In the case of a GOP size of 2, the available
bits are distributed among 6 color components.
70
Chapter 5
Experimental Results
In this chapter we present experimental tests and results of the compressing scheme used in
LORD. The experiments were performed on standard testing sequences as well as on Bayer
endoscopy videos.
5.1
Standard Videos
In this section, we give the compression results obtained using LORD on the sequences Foreman, Football and Coastguard. The resolution of the movies is 176 × 144 (QCIF resolution,
which is commonly used in the literature for tests, for saving computation time), and their
frame rate is 15Hz. The first 100 frames of each sequence were encoded, where a GOP of
size 2 was used (IW structure, see Sec. 3.2.2).
First, the blocks in each WZ frame are classified to SKIP, COSET or INTRA mode.
The distribution of the modes is shown in Figure 5.1 and Figure 5.2. Examples for the bit
allocation scheme used and PSNR results are given in Figure 5.3 and Figure 5.4. As can
be seen in these figures, Foreman and Coastguard sequences are characterized by relatively
slow motion, resulting in a large fraction of SKIP and COSET blocks.
On the other hand, the Football sequence includes moving football players and fast
movement of the camera, resulting in a fast motion. Moreover, the fraction of INTRA
blocks in Football reduces abruptly for frames 35-55, due to a linear motion of the camera,
that results in many COSET blocks. However, in general there is a large amount of INTRA
blocks in Football, especially in the beginning of this sequence. In all the sequences, the
fraction of COSET blocks is at least 0.3.
The results of encoding Foreman, Football and Coastguard using LORD are given in
Figure 5.5 and Figure 5.6. The performance is compared with INTRA (JPEG) encoding of all
the frames, where for fair comparison rate distortion optimization (Sec. 3.2.3) was employed
71
also in this mode, and with H.264/AVC (using IPPP GOP structure and INTRA-only mode).
As can be seen, LORD outperforms the intra coding mode in all the cases. However, it can
be seen that H.264 IPPP (which is much more complex than our encoder and requires a
delay of 3 frames in IPPP GOP structure) outperforms our system for all sequences. On the
other hand, the performance of H.264 INTRA are closer to LORD, especially in the cases of
Foreman and Coastguard, although H.264 INTRA has a higher computational complexity
than LORD (at the encoder) because of the intra-frame prediction used in this mode.
The PSNR gain (comparing with INTRA) is about 2dB for Foreman, and up to 1dB
for Football. The most noticeable improvement is obtained for Coastguard, in which an
improvement of up to 4dB in PSNR is achieved. This can be explained by the linear motion
of the objects in this sequence, which is well suited for the side information creation process.
The maximal deviation of the bitrate from the predefined one in all the cases is less than
1.5%.
It can be seen that the bit distribution scheme follows well the content of each video
sequence. We do not simply separate intra and WZ frames, but rather allocate bits according
to the actual content of the frames. For example, Foreman and Coastguard are characterized
by a slow motion, and hence their bits distribution scheme changes quickly between intra
and WZ frame, due to the large fraction of SKIP and COSET blocks. However, in the case of
Football there is a fast movement that leads to a relatively large amount of INTRA blocks,
and hence the bits distribution is almost the same for both intra and WZ frames, apart from
several frames.
Our scheme also captures the changes in the video sequences. For example, it can be
readily seen that in the case of Coastguard, there is an abrupt change in both the blocks
classification and bits distributions after frame number 70. This is due to the sudden vertical
movement of the camera in this part of the sequence.
72
(a) Foreman
(b) Football
Figure 5.1: Blocks classification - Foreman and Football
73
Figure 5.2: Blocks classification – Coastguard
74
(a) Foreman - bits allocation
(b) Foreman - PSNR
(c) Football - bits allocation
(d) Football - PSNR
Figure 5.3: Bits allocation and PSNR results – Foreman and Football, 1bpp
75
(a) Coastguard - bits allocation
(b) Coastguard - PSNR
Figure 5.4: Bits allocation and PSNR results – Coastguard, 1bpp
76
(a) Foreman
(b) Football
Figure 5.5: Compression results – Foreman and Football
77
Figure 5.6: Compression results – Coastguard
78
5.2
Bayer Endoscopy Videos
In this section, the performance of LORD are evaluated using Bayer endoscopy videos. Two
videos were used, simulating two typical endoscopy operations. The first one is a simulation of
a surgery, performed on a chicken, and the second one is a simulation of detection of disorders
of the gastrointestinal tract (performed usually using capsule endoscopy [47]). Characteristic
frames of these simulations can be seen in Figure 5.7. The resolution of the videos is 368×480
and 240 × 320 respectively. 100 frames from each video were used for the test and a GOP of
size 2 was used.
The surgery simulation video includes a motion of a surgical instrument. Besides that,
this video is slowly varying. Moreover, there are some dark parts in this video, due to the
limited light used in the endoscopy process (it should be noted that for fair comparison we
removed a significant amount of the dark parts, which otherwise would be classified as SKIP).
This video is also characterized by many smooth parts, except for some edges caused by the
structure of the chicken and the surgical instrument. The second video is characterized by
fast motion, but is somewhat more smooth, except for the last several frames, in which some
(synthetic) tumors are revealed.
We follow the compressing scheme suggested for Bayer videos, as described in section 4.3.
The blocks classification for each video is given in Figure 5.8 (averaged over the different
color channels). Examples for the bit allocation scheme used and PSNR results are given in
Figure 5.9. As can be clearly seen, the available bits are distributed according to the weights
given in (4.3.1).
The results of encoding these video sequences using LORD are given in Figure 5.10. In
both cases, a significant PSNR gain over INTRA only mode is seen, where in the case of the
simulation on the chicken the gain is up to 5dB. The gain in the case of the second simulation
is significant too, though somewhat smaller on average by approximately 2dB than the first
one, due to the fast movement in this video. LORD also outperforms H.264 INTRA, and
this can be attributed to the bit allocation scheme used in LORD, which takes into account
the different weights of the color channels in Bayer videos.
79
(a) Simulation - surgery, Bayer
(b) Simulation - surgery, Demosaiced
(c) Simulation - gastrointestinal tract, Bayer
(d) Simulation - gastrointestinal tract, Demosaiced
Figure 5.7: Samples from endoscopy videos. Surgical instrument can be seen in the first two
images.
80
(a) Simulation - chicken
(b) Simulation - gastrointestinal tract
Figure 5.8: Blocks classification - Endoscopy videos
81
(a) Simulation - chicken, bits allocation
(b) Simulation - chicken, PSNR
(c) Simulation - gastrointestinal tract, bits allocation
(d) Simulation - gastrointestinal tract, PSNR
Figure 5.9: Bits allocation and PSNR results – Endoscopy videos, 2bpp
82
(a) Simulation - chicken
(b) Simulation - gastrointestinal tract
Figure 5.10: Compression results – endoscopy videos
83
Chapter 6
Conclusion
6.1
Summary
In this research work we focused on the design process of a distributed video coding system,
resulting in a new DVC encoder, which we named LORD. During this process, we addressed
several drawbacks of existing DVC systems. We propose an online varying noise model,
which adapts itself to the content of the video, rather than using an offline model that is
unrealistic. Moreover, our model is more localized than the models used in some of the
existing DVC systems, which work at the frame level, rather than at the block level, as done
in LORD.
The noise model is based on the assumption that the noise between the side information
and the frame to be encoded is distributed according to a Laplace distribution. The Laplace
parameter is estimated online at the coefficient level, instead of working at the block or
frame level as in existing DVC systems. Moreover, we treat the DC and the AC coefficient
differently, taking into account their different distributions. We also take into account the
low energy nature of the AC coefficients with high (in zigzag order) indices, thus encoding
them using standard JPEG even when they are a part of WZ (COSET) block.
We use a rate control at the encoder, rather than the rate control used in existing DVC
systems, which is affected by the decoder and relies heavily on probability models, thus
inexact in nature, leading to complex design of the needed buffer. Our rate control is not
affected by the decoder and is found to be exact.
Moreover, we don’t use a feedback channel, whose use in existing DVC systems is almost
dominant. The use of this channel incurs unknown delay in the encoding/decoding process,
and is unsuitable for real time tasks or even for some offline tasks such as storage. By
working without such a channel we allow a real time implementation of our codec.
In addition, we employ a rate-distortion optimization (RDO) process, for an optimized
84
Encoder
Noise model
Rate control
Feedback channel Bayer support
LORD
Online
At the encoder
No
Yes
DISCOVER
Online
At the decoder
Yes
Not considered
Stanford
Offline
At the decoder
Yes
Not considered
PRISM
Offline
Not reported
No
Not considered
Table 6.1: DVC encoders - comparison
distribution of the available bits over the encoded frames. It should be noted that our RDO
module works at the coefficient level, and takes into account the different coding modes used
for each block. This RDO process involves simply the calculations of the variances of the
DCT coefficients, and an optimal distribution of the available bits is performed according
to a closed-form expression. A comparison between different DVC systems is given in Table
6.1.
Lastly, we adapt our encoder to Bayer endoscopy videos. This video format differs from
standard videos, since it contains information on a single color only in each pixel. The
medical process of endoscopy fits well in our low complexity framework of DVC, since a
wireless endoscope has limited capabilities of encoding videos, due to its power consumption
limitations.
We show how to treat each color component separately, and how to take into account
HVS consideration during the rate distortion optimization process in the Bayer case. The
changes needed to the basic design of LORD are small, and our encoder (which is suited in
principle for standard videos) can be adapted easily to handle also Bayer videos.
As shown in this work, our encoder outperforms standard intra coding for both standard
and Bayer videos, where in the case of Bayer endoscopy videos the PSNR gain is up to
5dB. This is achieved by a relatively low complexity encoder, which is only slightly more
complex than a JPEG encoder. As mentioned earlier, a DVC system enables shifting the
computational load from the encoder to the decoder, and it fits well into the framework of
compressing endoscopy videos.
In conclusion, this work focused on designing a feedback-less DVC system and its adaptation to Bayer endoscopy videos. Solutions to the drawbacks existing in most of the DVC
systems were presented. As always, there is still additional work that can be done to improve
the performance of our codec. LORD is modular, allowing one to replace some of its components by others, without affecting necessarily the functioning of the remaining components.
Some additional ideas that can be applied to LORD are discussed in the next section.
85
6.2
Future Work
There are several possible directions for future work. The following ideas can be a basis for
further research:
1. Noise model Our noise model is localized in the sense that it takes into account
only blocks which are COSET encoded. However, it is possible to consider even a
more localized model, which treats groups of COSET blocks separately, according to
their temporal difference from their neighbouring blocks and according to their spatial
content. In addition, different noise distribution models can also be tested.
2. Side information creation LORD is well suited for video sequences with relatively
low motion activity. In such cases, the linear motion assumption used in the extrapolation process is justified. However, in cases of fast motion, this assumption may not
hold, and some more sophisticated SI creation components should be used.
3. Decision on coding modes In our implementation, the coding modes are chosen
according to a predefined threshold applied to the difference between co-located blocks.
This threshold can be changed dynamically, according to the content of the video
sequence. Moreover, other measures than the simple zero-motion difference can be
used.
4. De-correlation of the RGB components in a Bayer frame This is needed for a
more accurate estimate of the squared error in the case of Bayer frames (see Sec. 4.3).
After de-correlation, it is needed to use an appropriate distribution model for the DCT
bands of the de-correlated components.
As mentioned earlier, our codec is modular, enabling the change of individual components
without affecting others. It also enables the adaptation of each component to the complexity
constraints. For example, if the encoder is allowed to be more complex to some extent, some
motion estimation can be used at the encoder, instead of relying entirely on the zero-motion
difference between co-located blocks. Any sophisticated method can be used at the decoder,
for improving the quality of the side information, assuming that the decoder can be as
complex as we wish. To sum up, further research may involve the improvement of individual
components of LORD with a performance evaluation after each change made.
86
Appendix A
CFA Demosaicing Algorithms
A demosaicing algorithm is a digital image process used to reconstruct a full color image
from the incomplete color samples output obtained by an image sensor overlaid with a color
filter array (CFA). In the case of a Bayer filter, it is needed to reconstruct for each pixel the
missing two colors. Many modern digital cameras can save images in a raw format allowing
the user to demosaic it using software. The raw output of Bayer-filter cameras is referred to
as a Bayer pattern image.
In order to get final images of high quality, as quick as possible, a desired demosaicing
algorithm should have the following traits:
• Avoiding false color artifacts, such as abrupt unnatural changes of intensity over a
number of neighboring pixels.
• Maximum preservation of image resolution.
• Low computational complexity
• Robustness to noise in the raw Bayer data.
Since the color subsampling of a CFA by its nature results in aliasing (especially in the
red and blue colors, which are sampled less frequently than the green color), an optical
anti-aliasing filter is typically placed in the optical path between the image sensor and the
lens.
There are many demosaicing algorithms which can be used for the interpolation of the
missing colors in a Bayer CFA. The demosaicing algorithms range from simple ones, which
use standard interpolation methods on each color channel, to more sophisticated ones, which
exploit the correlation between different color channels. In the following pages, a review of
commonly used interpolation methods used in demosaicing algorithms is provided. This
review is partially based on reviews given in [48–51].
87
Figure A.1: Bayer CFA 2 × 2 neighbourhood
Nearest-neighbour interpolation
The simplest of all interpolation algorithms is nearest neighbour interpolation. In this
method, the missing colors in each pixel are simply copied from an adjacent pixel of the
same color channel (using 2 × 2 neighbourhood, as depicted in Figure A.1). Usually, the
missing green color in the red and blue pixels is taken as the average of the two known green
pixels.
This method introduces significant color errors, especially along edges. However, since
almost no calculations are performed, this method may be beneficial in applications whose
low computational complexity is essential, such as low-power video imaging systems. It is
unsuitable for any application where quality matters, but can be useful for generating an
initial preview.
Linear Interpolation
Another simple interpolation algorithm is linear interpolation (sometimes referred to as bilinear interpolation, due to 2D grid used in images). A 3 × 3 neighbourhood is taken from
the CFA, as depicted in Figure A.2. The missing pixel color values are estimated by the
averaging of nearby known color values.
This interpolation can be performed by a convolution with the following kernels:




 0 1 0 
 1 2 1 



1
1



FG =  1 4 1  , FR/B =  2 4 2 


4
4




0 1 0
1 2 1
where FG works on the green channel, and FR/B works on the red or the blue channels.
This convolution can be implemented efficiently using frequency filtering methods (see [52]).
As can be seen in the kernels above, lower weight is given to pixels which are far from the
88
Figure A.2: Bayer CFA 3 × 3 neighbourhood
(a) Original image
(b) Reconstructed image
Figure A.3: Linear interpolation
current pixel.
This interpolation method performs well in smooth areas where the color changes slowly.
However, when performed along edges where color changes abruptly, false colors are introduced, resulting in a degradation in quality of the reconstructed picture. These artifacts can
be seen in Figure A.3. The low-pass nature of this demosaicing algorithm should also be
noted.
An extended version of linear interpolation involves the use of a 7 × 7 neighbourhood.
The convolution kernels in this case are defined as follows:
89


0









1 

FG =
256 









0
1
0
0
0
0
0
0
−9
0
−9
0
0
0
−9
0
81
0
−9
0
1
0
81
256
81
0
1
0
−9
0
81
0
−9
0
0
0
−9
0
−9
0
0
0
0
0
1
0
0
0






















FR/B
 1


 0



 −9

1 
 −16
=
256 


 −9



 0


1

0
−9
−16
−9
0
0
0
0
0
0
0
81
144
81
0
0
144
256
144
0
0
81
144
81
0
0
0
0
0
0
0
−9
−16
−9
0
1



0 



−9 


−16 



−9 



0 


1
Linear interpolation with 7×7 neighbourhood usually results in a better results than using
a smaller neighbourhood. It occurs especially in cases of Bayer frames with high resolution,
in which there is a correlation between neighbouring pixels even in a large neighbourhood.
Adaptive Color Plane Interpolation
In this method, it is assumed that the color channels (or color planes) are perfectly correlated
in a small enough neighbourhood, in which the following equations:
G=B+b
(A.0.1)
G=R+r
hold for some constants b, r. These constants are estimated using arithmetic averages of the
red an blue channels, and using an appropriately scaled second derivative term for the green
90
channel. The estimate of these constants also involves the use of edge classifiers, which
are calculated in the horizontal and vertical directions using the gradient and the second
derivative of the pixels in these directions.
Once the green plane is fully interpolated, the red and blue planes are interpolated next
using information from the interpolated green channel. The appropriate formulas are given
in [46], Section 2.7.
Gradient-corrected bilinear interpolation
An improvement to the linear interpolation method (A) was suggested in [46]. The assumption used in this method is that in a luminance/chrominance decomposition, the chrominance
components don’t vary much across pixels. This method exploits the interchannel correlations between the different color channels and uses the gradient among one color channel, in
order to correct the bilinearly interpolated value.
For example, in order to interpolate the green value at an R location (‘+’ pixel in Figure
A.4), the bilinearly interpolated value ĝb (i, j) (the missing G color in a B-known pixel) is
corrected by a measure of the gradient of R at that location:
gcorrected (i, j) = ĝb (i, j) + α∆R
(A.0.2)
∆R = r (i, j) − ravg
(A.0.3)
where:
and ravg is the average of the 4 nearest red values. α is a gain factor which controls the
intensity of such correction. The gradient-corrections for B and R are calculated in a similar
manner, and the relevant formulas are given in [46], with additional gain factors β and γ,
which correspond to ∆G and ∆B , respectively. These gain factors are estimated using a data
set of images, and were chosen experimentally as:
α = 1/2, β = 5/8, γ = 3/4
(A.0.4)
This method achieves an improvement of 5.5dB in PSNR over bilinear demosaicing, and
outperforms many non-linear algorithms, while being relatively simple and amenable to linear
implementation. This method is used by the function "Demosaic" in MATLAB.
Demonstration of the result of this algorithm is provided in Figure A.5, where the original
image was acquired using a Bayer sensor in an endoscopy process, simulating a medical
operation.
91
Figure A.4: Part of Bayer CFA
(a) Bayer image (as a grayscale image)
(b) Demosaiced image using A
Figure A.5: Bayer frame from endoscopy video (images courtesy Gyrus ACMI, Inc.)
92
Post Processing
There are two main methods which are used after the demosaicing process, in order to
improve the quality of the resulting image. These methods are:
1. Local Color Ratio Based Post Processing This method uses the mean values in
some predefined regions of each color channel. These values are used for the correction
of unnatural changes in color values by smoothing the color ratio planes [53] (the
ratios between different color channels). Tests show that this method performs well in
removing false colors, which occur mainly along edges.
2. Median Filtering the relation introduced by equation (A.0.1) can be imposed by
gathering all the color difference values over a square neighbourhood around a pixel.
Once the differences are gathered, their median is calculated, and then used as an
approximation of what the current pixels color difference should be. It is best used
when performed only on edge pixels.
Performance
In order to evaluate the performance of the demosaicing algorithms mentioned earlier, a
Bayer image was acquired by sub-sampling a color image. Several tests were conducted
in [48] and in [46], such as PSNR comparison between the color image and the interpolated
image, blur measure (using information on the average edges’ width), edge slope measure
(using information on the edges’ height and width) and color artifacts at edges.
According to the results of the tests above, it was shown that the algorithm presented
in A provides the best average results, in spite of its low complexity compared with more
sophisticated demosaicing algorithms. An example of the improvement of this algorithm
over a bilinear one can be seen in Figure A.6. Since in a DVC system we are interested in
low complexity algorithms, this algorithm suits our needs and will be used in our proposed
DVC CODEC. As mentioned earlier, this algorithm is used by MATLAB as default.
93
(a) Original image
(b) Demosaiced image using bilinear interpolation
(c) Demosaiced image using gradient-corrected bilinear interpolation
Figure A.6: Results of bilinear and gradient-corrected bilinear interpolation methods [46].
Note the significant reduction of false colors (circled) when using the second method.
94
Appendix B
MMSE Reconstruction Using Side
Information
Assume that we want to decode a WZ frame X, that is distributed according to Laplace
distribution (with a known parameter α) given the side information Y :
f X|y (x) =
α −α|x−y|
e
2
(B.0.1)
If we know the quantization interval [zi , zi+1 ) in which x resides, we can get an MMSE
estimate of the source x:
zR
i+1
x̂mmse = E [x| x ∈ [zi , zi+1 ) , y] =
xfX|y (x) dx
zi
zR
i+1
(B.0.2)
fX|y (x) dx
zi
Plugging (B.0.1) into (B.0.2), and denoting the numerator and denomerator of (B.0.2)
by N and D respectively, we get:
α
N=
2
zi+1
Z
xe−α|x−y| dx
(B.0.3)
zi
α
D=
2
zi+1
Z
e−α|x−y| dx
(B.0.4)
zi
For y ∈
/ [zi , zi+1 ), N and D can be written as [44]:
N=
(1 ± αzi ) e±α(y−zi ) (1 ± αzi+1 ) e±α(y−zi+1 )
−
2α
2α
95
(B.0.5)
D=±
e±α(y−zi ) − e±α(y−zi+1 )
2
(B.0.6)
where + and − correspond to the cases y < zi and y ≥ zi , respectively. For y ∈ [zi , zi+1 ),
(B.0.3) can be decomposed into two parts:
α
N=
2
Zy
α
xe−α|x−y| dx +
2
zi
zi+1
Z
xe−α|x−y| dx
(B.0.7)
y
where each of the integrals in (B.0.7), is given by (B.0.5), by either + or −. D can be
represented in a similar way, using (B.0.6). Finally, using the decomposition rule in (B.0.7)
and (B.0.2), we can get:



∆
1+

zi + α



1 −eα∆



1
1

−αγ

e
− δ+
e−αδ
γ+
α
α
x̂ =
y+

−αγ + e−αδ


2
−
e






1
∆

 zi+1 − α −
1 − eα∆
if y < zi
if y ∈ [zi , zi+1 )
(B.0.8)
if y ≥ zi+1
where ∆ is the quantization interval, γ = y − zi and δ = zi+1 − y.
There are two extreme cases, in which it is interesting to calculate x̂. The first one is
α → ∞ (the noise is highly localized), and the other one is α → 0 (there is no information
about the noise distribution).
Let g (α; ∆) denote the distance of x̂ from the bin boundaries in the cases y < zi and
y ≥ zi+1 :
1 − eα∆
1
∆
g (α; ∆) = +
=
(B.0.9)
α 1 − eα∆
α (1 − eα∆ )
Furthermore, denote by h (α; γ, δ) the distance from the side information y to x̂ in the
case y ∈ [zi , zi+1 ):
h (α; γ, δ) =
γ+
e−αγ − δ + α1 e−αδ
(γα + 1) e−αγ − (δα + 1) e−αδ
=
2 − (e−αγ + e−αδ )
2α − α (e−αγ + e−αδ )
1
α
(B.0.10)
It is easy to see that:
lim g (α; ∆) = 0, lim h (α; γ, δ) = 0
α→∞
α→∞
(B.0.11)
That is, when α conveys no information on the noise distribution, x̂ is chosen as the
96
boundaries of the bins in the cases y < zi and y ≥ zi+1 , and simply as y otherwise.
In the case α → 0, it is needed to apply L’Hopitals rule:
1 − eα∆ + α∆
α→0
α→0 α (1 − eα∆ )
−∆2 eα∆
∆
= lim
=
α∆
α∆
2
α∆
α→0 −∆e
− ∆e − α∆ e
2
lim g (α; ∆) = lim
(γα + 1) e−αγ − (δα + 1) e−αδ
α→0
α→0
2α − α (e−αγ + e−αδ )
γe−αγ − α (γα + 1) e−αγ − δe−αδ + α (δα + 1) e−αδ
= lim
α→0
2 − (e−αγ + e−αδ ) + α2 (e−αγ + e−αδ )
−γe−αγ + δe−αγ + α · (...)
= lim
α→0 2 − α (e−αγ + e−αδ ) + 2α (e−αγ + e−αδ ) − α3 (e−αγ + e−αδ )
δ−γ
=
2
(B.0.12)
lim h (α; γ, δ) = lim
(B.0.13)
Hence, when α → ∞, the reconstructed value x̂ in the cases y < zi and y ≥ zi+1 is the
middle of the quantization bin, and otherwise it is y + δ−γ
.
2
97
Appendix C
Rate-Distortion Model
Suppose that we have a set of k random variables X1 , X2 , ..., Xk , each with zero mean value
and with variances EXi = σi2 , for i = 1, 2, ..., k. The quadratic distortion associated with
the quantization of each Xi is:
X
Di = E (Xi − Q (Xi ))2 =
j
Z
(x − Q (x))2 fXi (x) dx
(C.0.1)
Rji
where Rji denotes the partition cells of coefficient i.
Assuming that the step sizes of the quantizers are sufficiently small (high-resolution quantizer), and assuming that bi bits are allocated to the i-th band (i.e., 2bi levels), the integral
in (C.0.1) can be approximated as (assuming that the overload distortion is negligible):
(C.0.2)
Di (bi ) ≈ hi σi2 2−2bi
where the constants hi are determined by the pdf fi (x) of the normalized random variable
Xi /σi (f˜X (y) = σX fX (σX y)):
1
hi =
12
(Z
∞
)3
1/3
[fi (x)]
dx
(C.0.3)
−∞
The approximation in (C.0.2) is correct for high rate quantizers, but it often serves as a
good approximation even in medium to low rate cases. For example, the constant hi can be
calculated analytically for Gaussian and Laplacian distributions:


Z ∞ " − x2 #1/3 3 √

1
e 2
3π
√
hG =
dx
=

12  −∞
2
2π
98
(C.0.4)

#1/3 3
Z ∞ "√


√
1
2 − 2|x|
9
hL =
e
dx
=

12  −∞ 2
2
(C.0.5)
In a case in which we are limited by an overall target bit allocation B for encoding
the random variables, we would like to distribute B between them such that the resulting
distortion will be minimal. The following results regarding bit allocation optimization given
a rate constraint are based partially on [36, 37].
The optimization problem we have to solve is as follows:
minD =
bi
s.t.
k
X
hi σi2 2−2bi
i=1
P
X
(C.0.6)
bi ≤ B
i=1
We can further extend the optimization problem, by assuming that the overall distortion
is weighted, where the weights wi are specified (for example, different weight can be assigned
to different DCT coefficients, according to their importance to the HVS). The optimization
problem turns now into:
minD =
bi
s.t.
k
X
i=1
P
X
wi hi σi2 2−2bi
(C.0.7)
bi ≤ B
i=1
The associated Lagrangian of the optimization problem (C.0.7) is:
J (bi , λ) =
P
X
!
wi hi σi2 2−2bi
+λ
i=1
P
X
!
(C.0.8)
bi − B
i=1
Differentiating J with respect to bi and λ and equating the derivatives to 0 leads to the
solution:
1
σ2 1
wi 1
hi
bi = b̄ + log2 2i + log2
+ log2
2
ρ
2
W
2
H
where:
B
b̄ = , ρ2 =
k
k
Y
i=1
!1/k
σi2
,W =
k
Y
i=1
99
!1/k
wi
,H =
k
Y
i=1
(C.0.9)
!1/k
hi
(C.0.10)
That is, the average number of bits is assigned to each random variable, and then the
geometric means of the constants σi , wi amd hi are added. The minimum overall distortion
attained with this solution is given by:
D = kHW ρ2 2−2b̄
(C.0.11)
In this case, each individual quantizer contributes an average distortion that is inversely
proportional to the corresponding weight values wi . Thus, a quantizer whose weight value
is large (compared with the geometric mean G) is assigned a relatively important role in
achieving the overall quality objective. For the optimal allocation, this quantizer will therefore have a relatively small average distortion.
It should be noted however, that the assumption that the distortion behaves according
to (C.0.2) is an approximation. Practically, the distortion as a function of the available bits
depends strongly on the specific algorithm used for the compression. Each algorithm has its
own R-D curve that may not have any analytic representation. Thus, the solution (C.0.9)
can serve as an estimation of how "difficult" it is to compress each component and therefore
- how to distribute the available bits.
100
Bibliography
[1] T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H.264/AVC
video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560 –576, July 2003.
[2] Choi et al. Fast motion estimation with modified diamond search for variable motion
block sizes. In Proceedings of ICIP, pages 371–374, September 2003.
[3] Z. Wei, K. Lam Tang, and K.N. Ngan. Implementation of H.264 on mobile device. IEEE
Transactions on Consumer Electronics, 53(3):1109 –1116, August 2007.
[4] D. Slepian and J. Wolf. Noiseless coding of correlated information sources. IEEE
Transactions on Information Theory, 19(4):471–480, Jul 1973.
[5] A. Wyner and J. Ziv. The rate-distortion function for source coding with side information at the decoder. IEEE Transactions on Information Theory, 22(1):1–10, Jan
1976.
[6] R. Puri and K. Ramchandran. Prism: an uplink-friendly multimedia coding paradigm.
In IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings. (ICASSP ’03), volume 4, pages 856–9, April 2003.
[7] R. Puri, A. Majumdar, and K. Ramchandran. Prism: A video coding paradigm with
motion estimation at the decoder. IEEE Transactions on Image Processing, 16(10):2436
–2448, October 2007.
[8] A. Aaron, R. Zhang, and B. Girod. Wyner-ziv coding of motion video. In Proc. Asilomar
Conference on Signals and Systems, pages 240–244, 2002.
[9] A. Aaron, S. D. Rane, E. Setton, and B. Girod. Transform-domain Wyner-Ziv codec for
video. In S. Panchanathan & B. Vasudev, editor, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 5308 of Society of Photo-Optical
Instrumentation Engineers (SPIE) Conference Series, pages 520–528, January 2004.
101
[10] X. Artigas, J. Ascenso, M. Dalai, S. Klomp, D. Kubasov, and M. Ouaret. The DISCOVER codec: Architecture, Techniques and Evaluation. In Picture Coding Symposium
(PCS’07), 2007.
[11] B. E. Bayer. Color imaging array. U.S. Patent 3,971,065, 1976.
[12] T. M. Cover and J. A. Thomas. Elements of information theory, second edition. John
Wiley and Sons, New York, NY, USA, 2006.
[13] T. Cover. A proof of the data compression theorem of Slepian and Wolf for ergodic
sources (Corresp.). IEEE Transactions on Information Theory, 21(2):226 – 228, Mar
1975.
[14] B. Girod, A.M. Aaron, S. Rane, and D. Rebollo-Monedero. Distributed video coding.
Proceedings of the IEEE, 93(1):71 –83, Jan. 2005.
[15] S. S. Pradhan and K. Ramchandran. Distributed source coding using syndromes (DISCUS): design and construction. IEEE Transactions on Information Theory, 49(3):626
– 643, Mar 2003.
[16] C. Berrou, A. Glavieux, and P. Thitimajshima. Near shannon limit error-correcting
coding and decoding: Turbo-codes. 1. In IEEE International Conference on Communications, volume 2, pages 1064 –1070 vol.2, May 1993.
[17] R. G. Gallager. Low Density Parity Check Codes. PhD thesis, M.I.T., 1963.
[18] S. S. Pradhan, j. Chou, and k. Ramchandran. Duality between source coding and
channel coding and its extension to the side information case. IEEE Transactions on
Information Theory, 49(5):1181 – 1203, May 2003.
[19] R. Zamir, S. Shamai, and U. Erez. Nested linear/lattice codes for structured multiterminal binning. IEEE Transactions on Information Theory, 48(6):1250 –1276, Jun
2002.
[20] H. H. Permuter and I. Naiss. Extension of the blahut-arimoto algorithm for maximizing
directed information. In 48th Annual Allerton Conference on Communication, Control,
and Computing (Allerton), 2010.
[21] F. Dupuis, W. Yu, and F. M. J. Willems. Blahut-arimoto algorithms for computing
channel capacity and rate-distortion with side information. In International Symposium
on Information Theory, ISIT 2004.
102
[22] J. Ascenso et al. The VISNET II DVC CODEC: Architecture, tools and performance.
In Proc. of the 18th European Signal Processing Conference (EUSIPCO), 2010.
[23] J. Ascenso, C. Brites, and F. Pereira. Content adaptive wyner-ziv video coding driven
by motion activity. In Image Processing, 2006 IEEE International Conference on, pages
605 –608, Oct. 2006.
[24] R. Martins, C. Brites, J. Ascenso, and F. Pereira. Adaptive deblocking filter for transform domain wyner-ziv video coding. Image Processing, IET, 3(6):315 –328, december
2009.
[25] R. J. Mceliece, D. J. C. Mackay, and J. Cheng. Turbo decoding as an instance of pearl’s
belief propagation algorithm. IEEE Journal on Selected Areas in Communications,
16:140–152, 1998.
[26] S. Lin and D. J. Costello. Error Control Coding. Prentice Hall, 2004.
[27] G. Castagnoli, J. Ganz, and P. Graber. Optimum cycle redundancy-check codes with
16-bit redundancy. Communications, IEEE Transactions on, 38(1):111 –114, jan 1990.
[28] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260 –269, April
1967.
[29] J. Ascenso, C. Brites, and F. Pereira. Content adaptive wyner-ziv video coding driven
by motion activity. In IEEE International Conference on Image Processing, pages 605
–608, Oct. 2006.
[30] C. Brites, J. Ascenso, and F. Pereira. Studying temporal correlation noise modeling
for pixel based wyner-ziv video coding. In Image Processing, 2006 IEEE International
Conference on, pages 273 –276, Oct. 2006.
[31] P. L. Dragotti and M. Gastpar. Distributed Source Coding: Theory, Algorithms and
Applications. Academic Press, 2009.
[32] K. Fagervik and A.S. Larssen. Performance and complexity comparison of low density
parity check codes and turbo codes. In Nordic Signal Processing Symposium (NORSIG),
2003.
[33] F. Dufaux et al. Distributed video coding: Trends and perspectives. In EURASIP
Journal on Image and Video Processing (special issue on DVC), 2010.
103
[34] H. Stark and J. W. Woods. Probability, Statistics, and Random Processes for Engineers,
4th edition. Prentice Hall, 2011.
[35] J. E. Fowler. An Implementation of PRISM Using QccPack. Technical report, Mississippi State ERC, Mississippi State University, 2005.
[36] N. S. Jayant and P. Noll. Digital Coding of Waveforms. Prentice Hall, 1984.
[37] A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Kluwer
academic publishers, 1992.
[38] R. Reininger and J. Gibson. Distributions of the Two-Dimensional DCT Coefficients
for Images. IEEE Transactions on Communications, 31(6):835 – 839, June 1983.
[39] H. Hsueh-Ming and C. Jiann-Jone. Source model for transform video coder and its
application. IEEE Transactions on Circuits and Systems for Video Technology, 7(2):287
–298, 1997.
[40] H. Zhihai and S. K. Mitra. A linear source model and a unified rate control algorithm for
DCT video coding. IEEE Transactions on Circuits and Systems for Video Technology,
12(11):970 – 982, 2002.
[41] T. Chiang and Y. Q. Zhang. A new rate control scheme using quadratic rate distortion
model. IEEE Transactions on Circuits and Systems for Video Technology, 7(1):246
–250, Feb 1997.
[42] L. Hung-Ju, T. Chiang, and Y. Q. Zhang. Scalable rate control for MPEG-4 video.
IEEE Transactions on Circuits and Systems for Video Technology, 10(6):878 –894, Sep
2000.
[43] J. Ribas-Corbera and L. Shawmin. Rate control in DCT video coding for low-delay
communications. IEEE Transactions on Circuits and Systems for Video Technology,
9(1):172 –185, Feb 1999.
[44] D. Kubasov, J. Nayak, and C. Guillemot. Optimal reconstruction in Wyner-Ziv video
coding with multiple side information. In IEEE 9th Workshop on Multimedia Signal
Processing, pages 183–186, Oct. 2007.
[45] J. W. Woods. Multidimensional Signal, Image, and Video Processing and Coding. Academic Press, 2012.
104
[46] H. S. Malvar, H. Li-wei, and R. Cutler. High-quality linear interpolation for demosaicing
of bayer-patterned color images. In IEEE International Conference on Acoustics, Speech,
and Signal Processing, 2004 (ICASSP ’04)., volume 3, pages iii – 485–8 vol.3, May 2004.
[47] K. Mergener. Advances in endoscopy: Update on the use of capsule endoscopy. Gastroenterology & Hepatology, Department of Gastroenterology, University of Pennsylvania
Health System, 2008.
[48] Maschal et al. Review of Bayer Pattern Color Filter Array (CFA) Demosaicing with
New Quality Assessment Algorithms. Technical report, U.S. Army Research Laboratory,
2010.
[49] R. Cohen. Demosaicing algorithms. Technical report, Technion - Israel Institute of
Technology, 2010.
[50] R. Cohen. Distributed Video Coding, Sec. 3-4. Technical report, Technion - Israel
Institute of Technology, 2011.
[51] R. Ramanath, W. E. Snyder, G. L. Bilbro, and W. A. Sander. Demosaicking methods
for bayer color arrays. In Journal of Electronic Imaging, volume 11, pages 633–642, July
2002.
[52] R. W. Schafer and J. R. Buck. Discrete-Time Signal Processing (2nd Edition). PrenticeHall Signal Processing Series, 2009.
[53] R. Lukac and K.N. Plataniotis. Normalized color-ratio modeling for CFA interpolation.
IEEE Transactions on Consumer Electronics, 50(2):737 – 745, May 2004.
105
106
‫קידוד מבוזר חסר‪-‬משוב של וידאו ויישומו‬
‫בדחיסת סרטי אנדוסקופיה‬
‫רמי כהן‬
‫המחקר נעשה בהנחיית פרופ' דוד מלאך בפקולטה‬
‫להנדסת חשמל‬
‫תודות‬
‫ברצוני להודות לפרופ' דוד מלאך על הנחייתו המסורה לאורך כל שלבי המחקר‪ .‬כמו‬
‫כן‪ ,‬ברצוני להודות לצוות המעבדה לעיבוד אותות ותמונות על עזרתם ותמיכתם‬
‫הטכנית‪ .‬תודה לחברי ולעמיתי בטכניון על דיונים מפרים ועל הפיכת השהות בטכניון‬
‫לנעימה עוד יותר‪ .‬לבסוף‪ ,‬אני מודה למשפחתי על תמיכתם ועידודם‪.‬‬
‫אני מודה לטכניון על התמיכה הכספית הנדיבה בהשתלמותי‬
‫תקציר‬
‫בשנים האחרונות אנו עדים לשינוי בצורה שבה אנו מתקשרים‪ .‬המדיה הספרתית הפכה לחלק בלתי נפרד מחיינו‪,‬‬
‫וההתקדמות הטכנולוגית הביאה לשגשוג נרחב של תוכן ספרתי ולשילובו בתחומים רבים של חיינו‪ .‬אחסון מידע‬
‫ספרתי ‪ ,‬בפרט מידע וידאו‪ ,‬ללא שימוש באלגוריתמי דחיסה יעילים‪ ,‬היה בלתי אפשרי‪ ,‬לאור כמות האחסון העצומה‬
‫שהיתה נדרשת‪ .‬לכן‪ ,‬אלגוריתמים אלה הם חלק משמעותי בכל מערכת המייצרת מדיה ספרתית‪ ,‬ובפרט מידע‬
‫וידאו‪.‬‬
‫מערכות סטנדרטיות לקידוד וידאו‪ ,‬כגון ‪ H.264/AVC‬ו‪ ,MPEG-2-‬מבוססות לרוב על סכמת דחיסה משולבת‪,‬‬
‫המערבת חיזוי זמני‪-‬מרחבי וקידוד התמרה‪ .‬בכל מערכת כזו‪ ,‬המקודד בוחן את אות הוידאו המתקבל באופן מעמיק‪,‬‬
‫על מנת לזהות יתירויות ולאפשר דחיסה יעילה‪ .‬תהליך זה הינו בדר"כ מורכב מאוד מבחינה חישובית‪ .‬כאשר חלקו‬
‫העיקרי הוא שערוך תנועה‪ ,‬המספק חיזוי זמני של מסגרת הוידאו הבאה‪ ,‬ועלול להוות עד ‪ 07%‬מסיבוכיות‬
‫המקודד הכוללת‪ .‬תהליך זה מבוצע למרות סיבוכיותו הגבוהה מפני שהוא יעיל מאוד למטרות דחיסה‪.‬‬
‫במערכות אלה‪ ,‬המקודד מסובך חישובית מהמפענח בסדר גודל עד שני סדרי גודל‪ .‬מצב זה מתאים למקרים שבהם‬
‫יש צורך להפיץ את הוידאו הדחוס למשתמשי קצה‪ ,‬כגון שידורי טלויזיה בכבלים ושידור וידאו לפי דרישה (‪.)VOD‬‬
‫במצבים אלו קיימת חשיבות לסיבוכיות נמוכה של המפענח‪ ,‬מפני שהוידאו מקודד פעם אחת ומפוענח מספר רב‬
‫של פעמים אצל המשתמשים‪ .‬זהו המצב שהיה קיים בשנים האחרונות‪ ,‬ביחוד טרם עידן המזעור ופיתוח טכנולוגיה‬
‫רבת עוצמה הזמינה גם עבור המשתמש הביתי‪.‬‬
‫יחד עם זאת‪ ,‬כיום אנו חווים מעבר ליישומים בהם יש צורך ביצירה ובשיתוף מהירים של וידאו‪ ,‬בפרט לצרכי זמן‬
‫אמת‪ ,‬כגון שיחות ועידה אלחוטיות‪ ,‬רשתות חיישנים‪ ,‬מצלמות אבטחה וכדומה‪ .‬ביישומים מסוג זה‪ ,‬המקודד‬
‫המשמש לדחיסת הוידאו הוא בדר"כ נייד ובעל מגבלות הספק ומשאבים‪ ,‬ולכן שימוש במקודדי וידאו סטנדרטיים‬
‫אינו מעשי‪ .‬לעומת זאת‪ ,‬הצד המפענח מורכב לעיתים קרובות משרת חזק‪ ,‬שיכול לממש חישובים בעלי סיבוכיות‬
‫גבוהה‪ ,‬ולכן קיים הגיון בהעברת המטלות החישוביות המסובכות לצד המפענח‪.‬‬
‫במקרים אלו‪ ,‬נדרש מקודד בעל סיבוכיות חישובית נמוכה (גם במחיר של הגדלת הסיבוכיות במפענח)‪ .‬יש לציין כי‬
‫ניתן אמנם להשתמש בשיטות דחיסה פשוטות הדוחסות כל מסגרת בנפרד‪ ,‬כמו ‪ ,Motion JPEG‬אך שיטות אלה לא‬
‫מנצלות את היתירות הזמנית הקיימת בסרט וידאו ולכן אינן יעילות‪ .‬באופן כללי‪ ,‬כאשר לא מנוצלת היתירות‬
‫הזמנית הקיימת בוידאו חלה ירידה משמעותית בביצועים‪ .‬יש לציין כי כיום קיימות התאמות מיוחדות של‬
‫מקודדים מתקדמים למצבים שבהם נדרשת סיבוכיות נמוכה (דבר זה קיים למשל במקודד ‪ ,H.264‬ע"י הגדרת‬
‫פרופילים מתאימים למשימות כגון קידוד וידאו במכשירים סלולריים)‪ ,‬אך עדין נדרש תהליך אופטימיזציה מסובך‬
‫על מנת לקבל תוצאות דחיסה סבירות‪.‬‬
‫על מנת לתת מענה לצורך זה‪ ,‬התפתח בעשור האחרון תחום הנקרא קידוד מבוזר של וידאו‪ .‬בשיטת קידוד זו‪,‬‬
‫הנקראת גם קידוד ויינר‪-‬זיו (‪ ,)Wyner-Ziv Coding‬מתבצעת דחיסה מאבדת‪-‬נתונים המסתייעת בתהליך השחזור‬
‫במידע צד הנבנה במפענח‪ .‬העקרונות העומדים בבסיס השיטה מסתמכים על משפטים מתורת האינפורמציה של‬
‫א‬
‫‪( Slepian & Wolf‬למקרה של דחיסה משמרת) ושל ‪( Wyner & Ziv‬הרחבה למקרה של דחיסה מאבדת נתונים)‪,‬‬
‫שפותחו בשנות השבעים של המאה העשרים‪.‬‬
‫משפטים אלו נותנים את הבסיס התאורטי לבניית מערכת דחיסה יעילה המנצלת את הסטטיסטיקה של אות‬
‫הכניסה במפענח בלבד‪ ,‬ועל ידי כך מעבירה את סיבוכיות המקודד למפענח‪ .‬במשפט של ‪ Wyner & Ziv‬אף הוכח כי‬
‫במקרים מסויימים ניתן להגיע לביצועי דחיסה זהים לביצועים של מקודדים סטנדרטיים‪ ,‬וזאת בשימוש בקידוד‬
‫מבוזר‪ .‬בשיטה זו‪ ,‬החלק המורכב של ביצוע שערוך תנועה במקודד אינו נדרש‪ ,‬ולכן הסיבוכיות החישובית של‬
‫המקודד פוחתת באופן משמעותי‪ ,‬כך שהמבנה הקלאסי של מקודד מורכב‪-‬מפענח פשוט במקודדי הוידאו‬
‫הסטנדרטיים מתהפך‪.‬‬
‫בשיטת דחיסה זו‪ ,‬יש למפענח בלבד גישה למסגרת הוידאו החזויה‪ ,‬והוא מתייחס אליה כמסגרת וידאו "רועשת"‪.‬‬
‫הסרת הרעש ממסגרת זו מתבצעת בהסתייעות במודלי פילוג הסתברות מתאימים (המקשרים בין מידע הצד לבין‬
‫המסגרת המקודדת)‪ ,‬ובשימוש בקודים לתיקון שגיאות‪ .‬המודל ההסתברותי נוצר במפענח‪ ,‬על ידי שימוש‬
‫במסגרות קודמות שפוענחו ובמידע חלקי מהמסגרת הנוכחית‪ .‬המודל הזה הינו המרכיב העיקרי בכל מערכת‬
‫לקידוד מבוזר של וידאו‪ ,‬והוא משפיע בצורה המשמעותית ביותר על איכות הוידאו המשוחזר‪ ,‬כך שקיימת חשיבות‬
‫לתכנון מתאים של מודל זה‪.‬‬
‫בקידוד מבוזר של וידאו טמונים מספר יתרונות לעומת שיטות קידוד סטנדרטיות של וידאו‪ ,‬ביניהם חלוקה גמישה‬
‫של הסיבוכיות בין המקודד למפענח וחסינות טובה יותר מפני שגיאות‪ .‬למשל‪ ,‬אם במהלך השידור אובד חלק‬
‫מהמידע המשודר (בעקבות רעשים בערוץ וכדומה)‪ ,‬ההשפעה של אובדן זה מוגבלת למספר מצומצם של מסגרות‬
‫(בדר"כ למסגרת המקודדת הנוכחית בלבד)‪ ,‬בניגוד למקרה של מקודד וידאו סטנדרטי‪ ,‬שבו האובדן יכול להוביל‬
‫לשגיאה בכל המסגרות הנמצאות בקבוצת המסגרות המקודדות (‪ .)GOP‬דבר זה נובע מכך שקידוד המסגרת‬
‫הנוכחית המתבצע במקודד איננו תלוי במסגרות קודמות‪ ,‬והתלות הזו מנוצלת רק במפענח‪.‬‬
‫מערכות מעשיות המציעות דחיסה של וידאו תוך שימוש בעקרונות של קידוד מבוזר של וידאו התפתחו בעשור‬
‫האחרון‪ .‬בעבודה זו אנו מציגים שלוש מערכות קיימות (‪ Stanford, PRISM‬ו‪ ,)DISCOVER-‬וסוקרים את רכיביהן‬
‫העיקריים ‪ .‬ביצועי מערכות אלה עדין רחוקים מביצועים של מקודדים סטנדרטיים כמו ‪ ,H.264‬והמחקר כיום‬
‫מתמקד בשיטות לסגירת פערים אלה‪.‬‬
‫המערכות הנפוצות כיום לקידוד מבוזר של וידאו נעזרות במשוב בין המקודד לבין המפענח‪ .‬במערכות מסוג זה‪,‬‬
‫סיביות הזוגיות (‪ )Parity bits‬של המידע אותו רוצים לקודד נשלחות (בחלקים) למפענח‪ .‬הפענוח מתבצע בעזרת‬
‫שימוש במידע הצד שנוצר במפענח ובסיביות הזוגיות הנ"ל‪ .‬כאשר שגיאת הפענוח (תחת הנחה של מודל רעש‬
‫מתאים) גדולה מסף מסויים‪ ,‬המפענח דורש מהמקודד סיביות זוגיות נוספות‪ ,‬בשימוש במשוב‪ .‬שיטה זו של שימוש‬
‫במשוב אינה מעשית במערכות הדורשות השהיה נמוכה‪ ,‬וכמו כן עלולה שלא להתאים למערכות זמן‪-‬אמת‪.‬‬
‫לאור זאת‪ ,‬אנו מתמקדים בעבודה זו במערכת קידוד מבוזר שהינה חסרת משוב‪ .‬דוגמה למערכת כזו היא ‪,PRISM‬‬
‫ואנו מתמקדים במסגרת עבודה זו בשיפורה‪ .‬אנו מציעים מערכת חדשה לקידוד מבוזר חסר‪-‬משוב של וידאו אותה‬
‫אנו מכנים ‪ .LORD: LOw-complexity, Rate-controlled, Distributed video coding system‬במסגרת מערכת‬
‫זו‪ ,‬אנו מציעים ומממשים פתרונות לחסרונות המרכזיים של ‪ ,PRISM‬תוך שימוש בעקרונות שהוצגו ב‪,PRISM-‬‬
‫ב‬
‫כגון חלוקה של מסגרות לבלוקים בעלי סטטיסטיקה דומה‪ ,‬על סמך ההפרש ביניהם לבין הבלוקים במסגרת‬
‫הקודמת‪ ,‬ושימוש בקידוד דומה לקידוד ‪.Syndrome‬‬
‫בפרט‪ ,‬אנו מציעים מודל הסתברותי של הרעש (בין מידע הצד לבין המידע במקודד) המשתנה בזמן אמת‪ ,‬הן‬
‫מרחבית והן זמנית‪ ,‬בהתאם לסטטיסטיקה של הוידאו הנכנס למערכת‪ .‬מודל זה הוא מודל הסתברותי המניח כי‬
‫הרעש מפולג ‪ ,Laplace‬כאשר על סמך הנחה זו מחושבים משערכים אופטימליים (במובן ‪ )MMSE‬של המידע‬
‫ששודר‪ ,‬בשימוש בביטוי סגור‪ .‬בניגוד למודלים אחרים של רעש‪ ,‬כגון מודלים המתבססים על אינטרפולציה‬
‫הדורשים פענוח גם של המסגרת הבאה‪ ,‬המודל שבו אנו משתמשים מתבסס על אקסטרפולציה‪ ,‬ודורש פענוח של‬
‫שתי המסגרות הקודמות בלבד‪ ,‬כך שאיננו מצריך השהיה‪.‬‬
‫בנוסף‪ ,‬אנו מציעים פתרון להתאמת המערכת למגבלות שידור בערוץ‪ ,‬בשימוש באלגוריתם השולט על קצב‬
‫השידור‪ ,‬כך שהמערכת המוצעת מתאימה לשימושי זמן אמת‪ .‬השליטה בקצב השידור מתבצעת במקודד‪ ,‬והיא‬
‫מדוייקת במיוחד ואינה נסמכת על חישובים הסתברותיים או על תלות במפענח‪ ,‬דבר הקיים במערכות רבות מסוג‬
‫‪ .DVC‬המערכת אותה אנו מציעים אינ ה דורשת שימוש בערוץ משוב‪ ,‬ולכן היא מהווה צעד מרכזי במעבר של‬
‫מערכות מסוג ‪ DVC‬לאפשרויות מימוש בזמן אמת‪ .‬כנדרש במערכות מסוג זה‪ ,‬התייחסנו בתשומת לב לשימוש‬
‫במקודד ברכיבים בעלי סיבוכיות נמוכה ככל האפשר‪.‬‬
‫המקודד אותו אנו מציעים הוא מודולרי‪ ,‬כך שניתן לשנות כל רכיב בו על מנת להוביל לשיפור אפשרי של ביצועים‪.‬‬
‫כמו כן ניתן להתאים את הסיבוכיות של רכיביו למידת הסיבוכיות הנדרשת וליישום הרצוי‪ .‬למשל‪ ,‬ניתן להשתמש‬
‫בשערוך תנועה ברמה מסויימת גם במקודד‪ ,‬על מנת לשפר את המצב הקיים‪ ,‬המסתמך על הפרשים בין מסגרות‬
‫עוקבות‪.‬‬
‫בהמשך‪ ,‬אנו מתאימים את המערכת לדחיסת סרטי וידאו מסוג ‪ ,Bayer‬המתקבלים במהלך תהליך של אנדוסקופיה‪.‬‬
‫בתהליך האנדוסקופיה מוחדרת מצלמה זעירה לגוף המנותח‪ ,‬ושידור התמונות למקלט מוגבל מבחינת סיבוכיות‬
‫ומבחינת קצב‪ ,‬כך שמערכות מסוג ‪ DVC‬מתאימות לשימוש בישום זה‪.‬‬
‫וידאו מסוג ‪ Bayer‬מכיל מידע חלקי בלבד לגבי רכיבי הצבע של כל מסגרת (כל פיקסל במסגרת מכיל מידע על צבע‬
‫אחד בלבד‪ :‬אדום‪ ,‬ירוק או כחול)‪ ,‬ועל מנת לקבל מידע צבע מלא יש צורך לבצע תהליך הנקרא ‪,Demosaicing‬‬
‫שבאמצעותו משוחזרים שלושת הצבעים העיקריים של כל פיקסל‪ .‬הסברים לגבי המבנה של וידאו מסוג זה ולגבי‬
‫אלגוריתמי ‪ Demosaicing‬ניתנים בעבודה זו‪.‬‬
‫מבנה וידאו זה מציג אתגר בשל מידע הצבע החלקי הקיים בו‪ ,‬כך שניצול היתירות המרחבית והזמנית הקיים בו‬
‫מתבצע בצורה שונה מאשר בוידאו סטנדרטי‪ .‬יש לציין כי עבודה זו הינה ראשונית בהתייחסות לדחיסת וידאו‬
‫מסוג ‪ Bayer‬בשיטות של קידוד מבוזר‪ ,‬ויוצגו בה לראשונה תוצאות של דחיסת וידאו מסוג זה בשימוש בעקרונות‬
‫אלו‪ .‬תוצאות ניסוייות של המערכת מראות שיפורים ניכרים גם במקרה של קידוד סרטי וידאו סטנדרטיים וגם‬
‫במקרה של קידוד סרטי אנדוסקופיה‪ ,‬כאשר במקרה האחרון מתקבל שיפור ב‪ PSNR-‬של עד ‪ 5dB‬בהשוואה לקידוד‬
‫‪ INTRA‬מבוסס ‪.JPEG‬‬
‫ג‬
‫קידוד מבוזר חסר‪-‬משוב של וידאו ויישומו‬
‫בדחיסת סרטי אנדוסקופיה‬
‫חיבור על מחקר‬
‫לשם מילוי חלקי של הדרישות לקבלת תואר מגיסטר למדעים‬
‫בהנדסת חשמל‬
‫רמי כהן‬
‫הוגש לסנט הטכניון – מכון טכנולוגי לישראל‬
‫יולי ‪2702‬‬
‫חיפה‬
‫תמוז תשע"ב‬