Perceptual Vector Quantization for Video Coding Jean-Marc Valin Mozilla, Mountain View, USA ABSTRACT This paper describes a proposal for applying energy conservation principles to the Daala video codec, using gain-shape vector quantization. The technique originates from the CELT mode of the Opus audio codec, where it is used to conserve the spectral envelope of an audio signal. One potential advantage of conserving energy of the AC coefficients in video coding is preserving textures rather than low-passing them. Also, by explicitly quantizing a gain, we can have a simple contrast masking model with no signaling cost. The main challenge of using gain-shape quantization for video coding is that we generally have a good prediction (the reference frame), so we are essentially starting from a point that is already on the quantization hyper-sphere, rather than at the origin like in CELT. We demonstrate how a predictor can be incorporated by applying a Householder reflection and by encoding the prediction error as an angle. We also derive a new way of encoding the quantized coefficients and show that the resulting technique improves the quality of the coded images and videos. 1. INTRODUCTION Video codecs are traditionally based on scalar quantization of discrete cosine transform (DCT) coefficients with fixed quantization step size. High-quality encoders adapt the quantization step size at the macroblock level to account for contrast masking.1 The technique is still limited because the adjustment is frequency-independent and cannot be applied at a level smaller than a full macroblock. Audio codecs have long considered frequencydependent masking effects and more recently, the Opus codec2 has integrated masking properties as part of its bitstream so that it does not have to be signaled. This is the approach we are taking in this with this proposal for applying energy conservation principles to the Daala video codec,3 using gain-shape vector quantization. The technique originates from the CELT mode4 of the Opus audio codec, where it is used to conserve the spectral envelope of an audio signal. One potential advantage of conserving energy of the AC coefficients in video coding is preserving textures rather than low-passing them. Also, by explicitly quantizing a gain, we can have a simple contrast masking model with no signaling cost. The main challenge of using gain-shape quantization for video coding is that we generally have a good prediction (the reference frame), so we are essentially starting from a point that is already on the quantization hyper-sphere, rather than at the origin like in CELT. We demonstrate how a predictor can be incorporated by applying a Householder reflection and by encoding the prediction error as an angle. We also derive a new way of encoding the quantized coefficients when the sum of their magnitude is known. We show the quality obtained using the proposed technique on both still images and videos. 2. PYRAMID VECTOR QUANTIZER A pyramid vector quantization codebook5 of dimension N and resolution K is constructed as the sum of K signed unit pulses N −1 X y ∈ ZN : |yi | = K . (1) i=0 In the CELT mode4 of the Opus codec,2 this codebook is used in the context of gain-shape quantization, with the gain encoded separately from a unit-norm vector derived from a codeword as u = y/ kyk. Figure 1: Band definition for 4x4, 8x8 and 16x16 blocks. The low frequencies are recursively divided following the pattern of smaller blocks. Blocks of 4x4 have all their AC coefficients in a single band. 3. APPLICATION TO VIDEO CODING To apply gain-shape vector quantization to DCT coefficients, it is important to first divide the coefficients into frequency bands just like for audio, to avoid having energy being moved across octaves or directions during the quantization process. Fig. 1 illustrates the current bands we use for different block sizes. Blocks of 4x4, 8x8 and 16x16 are split into 1, 4, and 7 bands, respectively. DC is always excluded and scalar quantized separately. When no prediction exists for a DCT band b, then the quantized coefficients are simply x̂b = ĝb ub , where ĝb is the quantized gain. In its simplest form, the gain is quantized linearly as ĝb = Qb γb , where γ is an integer index and Qb is the quantization resolution (different for each band to take into account the contrast sensitivity function of the human visual system). It is easy to modify the quantization to adapt based on the contrast in the band being quantized. This can be done by using a non-linear gain quantizer ĝb = Qb γbβ , (2) where β > 1 controls the amount of masking. The value of K used for the shape quantizer is computed from γb and β to achieve the same distortion in the direction of the gain as in the other directions. It does not need to be signaled in the bitstream since γb and β are known to the decoder. The main challenge in adapting gain-shape quantization to video coding is when a prediction is available. While it is possible to apply the above technique to the motion compensation residual, it is not ideal since 1) the gain parameter no longer represents a perceptually meaningful quantity and 2) it is no longer possible to adapt the quantization to take into account contrast masking. Instead, we still use a gain based on the coefficients xb of the current band, but we introduce an angle θb derived from the cosine distance between the current coefficients xb and their prediction rb cos θb = xTb rb kxb k krb k (3) To make it tractable to represent the difference between the coefficients and their prediction, we compute a Householder reflection that will align the prediction along a predetermined axis m. The reflection vector is given by rb v= + sem , (4) krb k where em is a unit vector along axis m, with the sign s and axis m selected based on the largest component of the prediction to minimize numerical error (both are computable by the decoder). The reconstructed coefficients are then given by vT ẑb x̂b = ĝb ẑb − 2 T v , (5) v v where the unit vector ẑb is constructed from the angle θ̂b and a unit-norm codeword ub as ẑb = cos θ̂b em + sin θ̂b ub . (6) Because of the angle, dimension m can be omitted from ub , which now has N − 1 dimensions and N − 2 degrees of freedom (since its gain is unity). 3.1 Coefficient Encoding Encoding coefficients quantized with PVQ differs from encoding scalar-quantized coefficients from the fact that the sum of the coefficients’ magnitude is known (equal to K). It is possible to take advantage of the known K value either through modeling the distribution of coefficient magnitude or by modeling the zero runs. In the case of magnitude modeling, the expectation of the magnitude of coefficient n is modeled as E (|yn |) = µ Kn , N −n (7) where Kn is the number of number of pulses left after encoding coefficients from 0 to n − 1 and µ depends on the distribution of the coefficients. For run-length modeling, the expectation of the position of the next non-zero coefficient is given by N −n E (run) = ν , (8) Kn where ν also models the coefficient distribution. 4. RESULTS The contrast masking algorithm is evaluated on both still a set of 50 still images taken from Wikipedia and downsampled to 1 megapixel, and on a set of short video clips ranging from CIF to 720p in resolution. First, the PSNR performance of scalar quantization vs vector quantization is compared in Fig. 2a and 2c. To make comparison easier, a flat quantization matrix is used for both scalar and vector quantization. Since the use of contrast masking is expected to make measurements such as PSNR worse, we evaluate its effect using a fast implementation of multi-scale structural similarity (FAST-SSIM).6 Fig. 2b and 2d show the FAST-SSIM results with and without contrast masking. For still images, the average improvement is 0.90 dB, equivalent to a 24.8% reduction in bitrate at equal quality, while for videos, the average improvement is 0.83 dB, equivalent to a 13.7% reduction in bitrate. 5. CONCLUSION We have presented a perceptual vector quantization technique for still images and video. We have shown that it can be used to implement adaptive quantization based on contrast masking without any signaling in a way that improves quality. For now, contrast masking is restricted to the luma planes, but it remains to be seen if a similar technique can be applied to the chroma planes. This work is part of the Daala project.3 The full source code, including all of the PVQ work described in this paper is available in the project git repository.7 REFERENCES [1] Osberger, W., [Perceptual Vision Models for Picture Quality Assessmnet and Compression Applications ], Queensland University of Technology, Brisbane (1999). [2] Valin, J.-M., Vos, K., and Terriberry, T. B., “Definition of the Opus Audio Codec.” RFC 6716 (Proposed Standard) (Sept. 2012). [3] “Daala website.” https://xiph.org/daala/. [4] Valin, J.-M., Maxwell, G., Terriberry, T. B., and Vos, K., “High-quality, low-delay music coding in the opus codec,” in [Proc. 135th AES Convention ], (Oct. 2013). (a) Still image, PSNR of scalar vs vector quantization (b) Still image, FAST-SSIM with and without masking (c) Video, PSNR of scalar vs vector quantization (d) Video, FAST-SSIM with and without masking Figure 2: Comparing scalar quantization to vector quantization with and without contrast masking. [5] Fischer, T. R., “A pyramid vector quantizer,” IEEE Trans. on Information Theory 32, 568–583 (1986). [6] Chen, M.-J. and Bovik, A. C., “Fast structural similarity index algorithm,” in [Proc. ICASSP], 994–997 (march 2010). [7] “Daala git repository.” https://git.xiph.org/daala.git.
© Copyright 2026 Paperzz