Perceptual Vector Quantization for Video Coding - Jean

Perceptual Vector Quantization for Video Coding
Jean-Marc Valin
Mozilla, Mountain View, USA
ABSTRACT
This paper describes a proposal for applying energy conservation principles to the Daala video codec, using
gain-shape vector quantization. The technique originates from the CELT mode of the Opus audio codec, where
it is used to conserve the spectral envelope of an audio signal. One potential advantage of conserving energy
of the AC coefficients in video coding is preserving textures rather than low-passing them. Also, by explicitly
quantizing a gain, we can have a simple contrast masking model with no signaling cost. The main challenge of
using gain-shape quantization for video coding is that we generally have a good prediction (the reference frame),
so we are essentially starting from a point that is already on the quantization hyper-sphere, rather than at the
origin like in CELT. We demonstrate how a predictor can be incorporated by applying a Householder reflection
and by encoding the prediction error as an angle. We also derive a new way of encoding the quantized coefficients
and show that the resulting technique improves the quality of the coded images and videos.
1. INTRODUCTION
Video codecs are traditionally based on scalar quantization of discrete cosine transform (DCT) coefficients with
fixed quantization step size. High-quality encoders adapt the quantization step size at the macroblock level to
account for contrast masking.1 The technique is still limited because the adjustment is frequency-independent
and cannot be applied at a level smaller than a full macroblock. Audio codecs have long considered frequencydependent masking effects and more recently, the Opus codec2 has integrated masking properties as part of its
bitstream so that it does not have to be signaled.
This is the approach we are taking in this with this proposal for applying energy conservation principles to the
Daala video codec,3 using gain-shape vector quantization. The technique originates from the CELT mode4 of the
Opus audio codec, where it is used to conserve the spectral envelope of an audio signal. One potential advantage
of conserving energy of the AC coefficients in video coding is preserving textures rather than low-passing them.
Also, by explicitly quantizing a gain, we can have a simple contrast masking model with no signaling cost.
The main challenge of using gain-shape quantization for video coding is that we generally have a good
prediction (the reference frame), so we are essentially starting from a point that is already on the quantization
hyper-sphere, rather than at the origin like in CELT. We demonstrate how a predictor can be incorporated by
applying a Householder reflection and by encoding the prediction error as an angle. We also derive a new way
of encoding the quantized coefficients when the sum of their magnitude is known. We show the quality obtained
using the proposed technique on both still images and videos.
2. PYRAMID VECTOR QUANTIZER
A pyramid vector quantization codebook5 of dimension N and resolution K is constructed as the sum of K
signed unit pulses
N
−1
X
y ∈ ZN :
|yi | = K .
(1)
i=0
In the CELT mode4 of the Opus codec,2 this codebook is used in the context of gain-shape quantization, with
the gain encoded separately from a unit-norm vector derived from a codeword as u = y/ kyk.
Figure 1: Band definition for 4x4, 8x8 and 16x16 blocks. The low frequencies are recursively divided following
the pattern of smaller blocks. Blocks of 4x4 have all their AC coefficients in a single band.
3. APPLICATION TO VIDEO CODING
To apply gain-shape vector quantization to DCT coefficients, it is important to first divide the coefficients into
frequency bands just like for audio, to avoid having energy being moved across octaves or directions during the
quantization process. Fig. 1 illustrates the current bands we use for different block sizes. Blocks of 4x4, 8x8 and
16x16 are split into 1, 4, and 7 bands, respectively. DC is always excluded and scalar quantized separately.
When no prediction exists for a DCT band b, then the quantized coefficients are simply x̂b = ĝb ub , where ĝb
is the quantized gain. In its simplest form, the gain is quantized linearly as ĝb = Qb γb , where γ is an integer
index and Qb is the quantization resolution (different for each band to take into account the contrast sensitivity
function of the human visual system). It is easy to modify the quantization to adapt based on the contrast in
the band being quantized. This can be done by using a non-linear gain quantizer
ĝb = Qb γbβ ,
(2)
where β > 1 controls the amount of masking. The value of K used for the shape quantizer is computed from γb
and β to achieve the same distortion in the direction of the gain as in the other directions. It does not need to
be signaled in the bitstream since γb and β are known to the decoder.
The main challenge in adapting gain-shape quantization to video coding is when a prediction is available.
While it is possible to apply the above technique to the motion compensation residual, it is not ideal since 1) the
gain parameter no longer represents a perceptually meaningful quantity and 2) it is no longer possible to adapt
the quantization to take into account contrast masking.
Instead, we still use a gain based on the coefficients xb of the current band, but we introduce an angle θb
derived from the cosine distance between the current coefficients xb and their prediction rb
cos θb =
xTb rb
kxb k krb k
(3)
To make it tractable to represent the difference between the coefficients and their prediction, we compute a
Householder reflection that will align the prediction along a predetermined axis m. The reflection vector is given
by
rb
v=
+ sem ,
(4)
krb k
where em is a unit vector along axis m, with the sign s and axis m selected based on the largest component of
the prediction to minimize numerical error (both are computable by the decoder). The reconstructed coefficients
are then given by
vT ẑb
x̂b = ĝb ẑb − 2 T v ,
(5)
v v
where the unit vector ẑb is constructed from the angle θ̂b and a unit-norm codeword ub as
ẑb = cos θ̂b em + sin θ̂b ub .
(6)
Because of the angle, dimension m can be omitted from ub , which now has N − 1 dimensions and N − 2 degrees
of freedom (since its gain is unity).
3.1 Coefficient Encoding
Encoding coefficients quantized with PVQ differs from encoding scalar-quantized coefficients from the fact that
the sum of the coefficients’ magnitude is known (equal to K). It is possible to take advantage of the known K
value either through modeling the distribution of coefficient magnitude or by modeling the zero runs. In the case
of magnitude modeling, the expectation of the magnitude of coefficient n is modeled as
E (|yn |) = µ
Kn
,
N −n
(7)
where Kn is the number of number of pulses left after encoding coefficients from 0 to n − 1 and µ depends on
the distribution of the coefficients. For run-length modeling, the expectation of the position of the next non-zero
coefficient is given by
N −n
E (run) = ν
,
(8)
Kn
where ν also models the coefficient distribution.
4. RESULTS
The contrast masking algorithm is evaluated on both still a set of 50 still images taken from Wikipedia and
downsampled to 1 megapixel, and on a set of short video clips ranging from CIF to 720p in resolution. First,
the PSNR performance of scalar quantization vs vector quantization is compared in Fig. 2a and 2c. To make
comparison easier, a flat quantization matrix is used for both scalar and vector quantization.
Since the use of contrast masking is expected to make measurements such as PSNR worse, we evaluate its
effect using a fast implementation of multi-scale structural similarity (FAST-SSIM).6 Fig. 2b and 2d show the
FAST-SSIM results with and without contrast masking. For still images, the average improvement is 0.90 dB,
equivalent to a 24.8% reduction in bitrate at equal quality, while for videos, the average improvement is 0.83 dB,
equivalent to a 13.7% reduction in bitrate.
5. CONCLUSION
We have presented a perceptual vector quantization technique for still images and video. We have shown that
it can be used to implement adaptive quantization based on contrast masking without any signaling in a way
that improves quality. For now, contrast masking is restricted to the luma planes, but it remains to be seen if a
similar technique can be applied to the chroma planes.
This work is part of the Daala project.3 The full source code, including all of the PVQ work described in
this paper is available in the project git repository.7
REFERENCES
[1] Osberger, W., [Perceptual Vision Models for Picture Quality Assessmnet and Compression Applications ],
Queensland University of Technology, Brisbane (1999).
[2] Valin, J.-M., Vos, K., and Terriberry, T. B., “Definition of the Opus Audio Codec.” RFC 6716 (Proposed
Standard) (Sept. 2012).
[3] “Daala website.” https://xiph.org/daala/.
[4] Valin, J.-M., Maxwell, G., Terriberry, T. B., and Vos, K., “High-quality, low-delay music coding in the opus
codec,” in [Proc. 135th AES Convention ], (Oct. 2013).
(a) Still image, PSNR of scalar vs vector quantization
(b) Still image, FAST-SSIM with and without masking
(c) Video, PSNR of scalar vs vector quantization
(d) Video, FAST-SSIM with and without masking
Figure 2: Comparing scalar quantization to vector quantization with and without contrast masking.
[5] Fischer, T. R., “A pyramid vector quantizer,” IEEE Trans. on Information Theory 32, 568–583 (1986).
[6] Chen, M.-J. and Bovik, A. C., “Fast structural similarity index algorithm,” in [Proc. ICASSP], 994–997
(march 2010).
[7] “Daala git repository.” https://git.xiph.org/daala.git.