2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics O ctober 18-21, 2009, New Paltz, NY ITU-T G.719: A NEW LOW-COMPLEXITY FULL-BAND (20 KHZ) AUDIO CODING STANDARD FOR HIGH-QUALITY CONVERSATIONAL APPLICATIONS Anisse Taleb and Manuel Briand Minjie Xie ∗ and Peter Chu Polycom, Inc. 10 Mall Road, #110, Burlington, MA 01803, USA [email protected] ABSTRACT This paper describes a new low-complexity full-band (20 kHz) audio coding algorithm which has been recently standardized by ITU-T as Recommendation G.719. The algorithm is designed to provide 20 Hz - 20 kHz audio bandwidth using a 48 kHz sample rate, operating at 32 - 128 kbps. This codec features very high audio quality and low computational complexity and is suitable for use in applications such as videoconferencing, teleconferencing, and streaming audio over the Internet. Subjective test results from the Optimization/ Characterization phase of G.719 are also presented in the paper. Index Terms— Audio coding, full-band, low complexity, adaptive time-frequency transform, fast lattice vector quantization 1. INTRODUCTION In hands-free videoconferencing and teleconferencing markets, there is strong and increasing demand for audio coding providing the full human auditory bandwidth of 20 Hz to 20 kHz. This is because: • Conferencing systems are increasingly used for more elaborate presentations, often including music and sound effects (i.e. animal sounds, musical instruments, vehicles or nature sounds, etc.) which occupy a wider audio band than speech. Presentations involve remote education of music, playback of audio and video from DVDs and VCRs, audio/video clips from PCs, and elaborate audio-visual presentations from, for example, PowerPoint. • Users perceive the bandwidth of 20 Hz to 20 kHz as representing the ultimate goal for audio bandwidth. The resulting market pressures ar e causing a shift in this direction, now that sufficient IP bitrate and audio coding technology are available to deliver this. As with any audio codec for hands-free videoconferencing use, the requirements include: • Low latency (support natural conversation) • Low complexity (free cycles for video processing and other audio processing; reduce cost) • High quality on all signal types To meet the market need for such a full-band audio coding standard, ITU-T Q10/SG16 launched the standardization of Low-Complexity Full-Band Audio Coding Extension to G.722.1 (G.722.1FB) in November 2006. The ITU-T G.722.1FB ∗ Ericsson Research – Mu ltimedia Technologies Färögatan 6, Kista, Sweden {anisse.taleb, manuel.briand}@ericsson.com Qualification phase was completed in June 2007 and two candidates submitted respectively by Ericsson AB and Polycom, Inc. were qualified. In September 2007, the two proponents announced their collaboration in an Optimization/ Characterization phase and agreed to jointly develop a candidate codec. The joint candidate codec met all the requirements in the subjective quality tests for the G.722.1FB Optimization/ Characterization phase in April 2008. In June 2008, the joint codec was adopted as ITU-T Recommendation G.719 which is the first full-band audio coding standard of ITU-T [1]. In this paper we first give an overview of the ITU-T G.719 codec, followed by a description of two important modules of the codec - adaptive time-frequency transform and quantization of transform coefficients. We also present the evaluation of the computational complexity and memory requirements of the codec. Finally, subjective test results from the Optimization/ Characterization phase are summarized. 2. OVERVIEW OF THE G.719 CODEC The G.719 codec is a low-complexity transform-based audio codec and can provide an audio bandwidth of 20 Hz to 20 kHz at 32 - 128 kbps. The codec operates on frames of 20 ms corresponding to 960 samples at a sampling rate of 48 kHz and has an algorithmic delay of 40 ms. The codec features very high audio quality and extremely low computational complexity compared to other state-of-the-art audio coding algorithms. 2.1 Encoder Norm estimation Transient detector Input signal Fs = 48 kHz 265 Norm adjustment Bit allocation Spectrum normalization Transform FLVQ and coding Noise level adjustment Transient signaling Quantization indices Huffman code Noise level adjustment index MUX Figure 1: Block diagram of the G.719 encoder. Minjie Xie was with Polycom, Inc., when this work was done. He is now with Huawei Technologies (USA). 978-1-4244-3679-8/09/$25.00©2009 IEEE Norm quantization and coding 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2009, New Paltz, NY A block diagram of the G.719 encoder is shown in Figure 1. The input signal sampled at 48 kHz is processed through a transient detector. Depending on the detection of a transient, indicated by a flag IsTransient, a high frequency resolution or a low frequency resolution transform is applied on the input signal frame. The adaptive transform is based on a modified discrete cosine transform (MDCT) [2] in case of stationary frames. For transient frames, the MDCT is modified to obtain a higher temporal resolution without a need for additional delay and with very little overhead in complexity. Transient frames have a temporal resolution equivalent to 5 ms frames. The obtained transform coefficients are grouped into bands of unequal lengths as 8, 16, 24, or 32. As the bandwidth is 20 kHz, only 800 transform coefficients are used. The 160 transform coefficients representing frequencies above 20 kHz are ignored. The norm or power of each band is estimated and the resulting spectral envelope consisting of the norms of all bands is scalar quantized and encoded. The coefficients are then normalized by the quantized norms. The quantized norms are further adjusted based on adaptive spectral weighting and used as input for bit allocation. An adaptive bit-allocation scheme based on the quantized norms of the bands is used to assign the available bits in a frame among the bands. The number of bits assigned to each transform coefficient can be as large as 9 bits depending on the input signal. The normalized transform coefficients are lattice vector quantized according to the allocated bits for each band. Huffman coding is applied to the quantization indices for both the coded norms and transform coefficients. The norm of the non-coded transform coefficients is estimated, coded, and transmitted to the decoder. coefficients and regenerated transform coefficients are mixed and lead to normalized spectrum. The decoded spectral envelope is applied to the decoded full-band spectrum. Finally, the inverse transform is applied to recover the time-domain decoded signal. This is performed by applying either the inverse MDCT for stationary mode, or the inverse of the higher temporal resolution transform for transient mode. 2.2 Decoder The resulting signal y(k) represents the transform coefficients of the input frame. It should be noted that in stationary mode, the cascade of windowing, TDA, and DCTIV is equivalent to applying the modulated lapped transform (MLT) [2]. If the signal is detected as a transient, a further re-ordering of the time-domain aliased signal is performed. The basis functions of the resulting filter-bank would have an incoherent time and frequency responses without re-ordering. The reordering operation consists of shuffling the upper and lower half x (n) as follows: of the TDA output signal ~ DEMUX + Envelope Shaping Transient signaling Spectral-fill Generator Norm indices Noise level adjustment index FLVQ indices Transient signaling FLVQ Decoding Audio signal Fs = 48 kHz Inverse Transform Figure 2: Block diagram of the G.719 decoder. A block diagram of the decoder is shown in Figure 2. The transient flag is first decoded which indicates the frame configuration, i.e. stationary or transient. The spectral envelope is then decoded and the same, bit-exact, norm adjustment and bit-allocation algorithms are used at the decoder to re-compute the bit-allocation which is essential for decoding quantization indices of the normalized transform coefficients. After decoding the transform coefficients, the non-coded transform coefficients (allocated zero bits) in low frequencies are regenerated by using a spectral-fill codebook built from the decoded transform coefficients. The noise level adjustment index is used to adjust the level of the regenerated coefficients. The non-coded transform coefficients in high frequencies are regenerated using bandwidth extension. The decoded transform 3. ADAPTIVE TIME-FREQUENCY TRANSFORM The adaptive time-frequency transform is based on the detection of a transient. In the case of stationary signals, the transform has a high frequency resolution which is able to efficiently represent stationary sounds. In the case of transient sounds, the time-frequency transform will increase its time resolution and allows a better representation of the rapid changes in the input signal characteristics. These two modes of operations share a common buffering and windowing module and the switching between one mode of operation to the other is instantaneous. Thus no additional look-ahead in the transient detection is needed. This allows the codec to have selectable time resolution at low complexity and with zero additional delay. If the input signal is detected as stationary, a type IV discrete cosine transform (DCTIV) [2] is applied on the output ~ x (n) of the time domain aliasing (TDA) operation. The DCTIV of the time aliased signal is defined by the following equation: 959 y (k ) = ⎡ π ⎤ ∑ ~x (n) cos⎢⎣⎛⎜⎝ n + 2 ⎞⎟⎠⎛⎜⎝ k + 2 ⎞⎟⎠ 960 ⎥⎦ , k = 0, 1, …, 959 (1) 1 1 n =0 v ( n) = ~ x (959 − n) , n = 0, 1, …, 959. (2) This re-ordering is only conceptual and in reality no computations are involved. Higher time resolution is obtained by zero padding the signal v(n) and dividing the resulting signal into four overlapped equal length sub-frames. The amount of zero-padding is equal to 120 on each side of the signal. The segments are 50% overlapped and each segment has a length equal to 480. The two inner segments are post-windowed using a sine window of length 480. The windows for outer segments are constructed using half a sine window. Each resulting post-windowed segment is further processed by applying the MDCT, i.e. time aliasing followed by DCTIV. The output of the MDCT for each segment represents the signal spectrum at different time instants, thus in case of transients, a higher time resolution is used. The length of the output of each of the four MDCT is half the length of the input segment, i.e. 240. Therefore, this operation does not introduce additional redundancy. 266 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 4. and consists of the points falling on concentric spheres of radius TRANSFORM COEFFICIENT QUANTIZATION Each band consists of one or more vectors of 8-dimensional transform coefficients and the coefficients are normalized by the quantized norm. All 8-dimensional vectors belonging to one band are assigned the same number of bits for quantization. A fast lattice vector quantization (FLVQ) scheme [3] is used to quantize the normalized coefficients in 8 dimensions. In FLVQ the quantizer comprises two sub-quantizers: a D8-based higherrate lattice vector quantizer (HRQ) and an RE8-based lower-rate lattice vector quantizer (LRQ). HRQ is a multi-rate quantizer designed to quantize the transform coefficients at rates of 2 up to 9 bit/coefficient and its codebook is based on the so-called Voronoi code for the D8 lattice [4]. D8 is a well-known lattice and defined as: 8 D8 = {(y1, y2, y3, y4, y5, y6, y7, y8) ∈ Z8 | ∑y i = even} (3) 2 2r centered at the origin, where r = 0, 1, 2, ... . The set of points on a sphere constitutes a spherical code and can be used as a quantization codebook. The LRQ codebook of 256 codewords is stored in a structured look-up table so that a fast searching method and a fast indexing method are developed to reduce the computational complexity [3]. 5. COMPLEXITY OF THE G.719 CODEC The computational complexity of the G.719 codec in 16/32-bit fixed-point is estimated by encoding and decoding the source material used for the subjective tests of the G.719 Optimization/ Characterization phase. The observed average and worst-case complexity in units of WMOPS [6] are shown in Table 1 for different bitrates. Storage requirements are reported in Table 2. Table 1: Computational complexity of the G.719 codec. i =1 where Z8 is the lattice which consists of all points with integer coordinates. It can be seen that D8 consists of the points having integer coordinates with an even sum. The codebook of HRQ is constructed from a finite region of the D8 lattice and is not stored in memory. The codewords are generated by a simple algebraic method and a fast quantization algorithm is used. To minimize the distortion for a given rate, the D8 lattice should be truncated and scaled. Actually, the input vectors instead of the lattice codebook are scaled in order to use the fast searching and indexing algorithms proposed by Conway and Sloane [4]. However, the fast searching algorithm introduced in [4] assumes an infinite lattice which can not be used as the codebook in real-time audio coding systems. In other words, for a given rate the algorithm can not be used to quantize the input vectors lying outside the truncated lattice region. To resolve this problem, a fast method for quantizing “outliers” xo is developed and described as follows: 1) Scale down the vector xo by 2: xo = xo / 2. 2) Find the nearest D8 point u to xo and then compute the index vector j of u using the indexing algorithm in [4]. 3) Find the codeword y from the index vector j and then compare y with u. If y is different from u, repeat Steps 1) to 3). Otherwise, compute w = xo / 16. Note that due to the normalization of transform coefficients, a few iterations are required to find the best codeword. 4) Compute xo = xo + w. 5) Find the nearest D8 point u to xo and then compute the index vector j of u using the indexing algorithm in [4]. 6) Find the codeword y from the index vector j and then compare y with u. If y and u are exactly same, k = j and repeat Steps 4) to 6). Otherwise, k is the index of the best codeword to xo and stop. It has been observed that the 8-dimensional coefficient vectors have a high concentration of probability around the origin. Therefore, Huffman coding is optionally tried for the quantization indices of HRQ to increase efficiency of quantization when the rate is smaller than 6 bit/coefficient. LRQ is a 1-bit quantizer and uses spherical codes based on the rotated Gosset lattice RE8 [5] as the codebook. The RE8 lattice is defined as RE8 = 2 D8 U {2 D8 + (1, 1, 1, 1, 1, 1, 1, 1)} October 18-21, 2009, New Paltz, NY (4) 267 Encoder Decoder Encoder + Decoder Bitrate (kbps) Avg. Max Avg. Max Avg. Max 32 6.663 7.996 6.876 7.413 13.539 15.397 48 7.427 9.073 7.230 7.806 14.657 16.861 64 7.912 9.899 7.554 8.161 15.466 18.060 80 8.303 10.564 7.865 8.496 16.169 19.026 96 8.555 10.796 8.122 8.757 16.677 19.536 112 8.787 10.881 8.368 9.009 17.155 19.890 128 8.980 11.775 8.610 9.225 17.590 21.000 Table 2: Memory usage of the G.719 codec. Memory type Encoder Decoder Encoder + Decoder Static RAM 1 1.0 3.9 4.9 RAM 1 12.2 12.2 24.4 8.3 8.9 10.7 1.2 1.8 Scratch Data ROM 1 Program ROM 1 2 1.2 In units of 16-b kWords 6. 2 In units of 1000’s basic operators SUBJECTIVE PERFORMANCE OF G.719 Subjective tests for the ITU-T G.719 (ex. G.722.1FB) Optimization/Characterization phase were performed from mid February through early April 2008 by independent listening laboratories in American English, French, and Spanish. According to a test plan designed by ITU-T Q7/SG12 experts, the joint candidate codec conducted two experiments as follows: • Experiment 1: Speech (clean, reverberant, and noisy) • Experiment 2: Mixed content and music Mixed content items are representative of advertisement, film trailers, news with jingles, music with announcements, etc., and contain speech, music, and noise. Each experiment used the “triple stimulus/hidden reference/double blind test method” described in ITU-R Recommendation BS.1116-1. A standard MPEG audio codec, LAME MP3 version 3.97 as found on the LAME website as of September 2006 [7] was used as the reference codec in the subjective tests. The ITU-T requirement 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics was that the G.719 candidate codec at 32, 48, and 64kbps be proven “Not Worse Than” the reference codec at 40, 56, and 64 kbps, respectively, with a 95% statistical confidence level. In addition, the G.719 candidate codec at 64 kbps was also tested against the G.722.1C codecs at 48 kbps for Experiment 2. The subjective test results for the G.719 codec are shown in Figures 3-6. Statistical analysis of the results showed that the G.719 codec met all performance requirements specified for the subjective Optimization/Characterization test. For experiment 1 the G.719 codec was better than the reference codec at all bit rates in both languages. For experiment 2 the G.719 codec is better than the reference codec at the lowest bit rate for all the items and at the two other bitrates for most of the items. An additional subjective listening test for the G.719 codec was conducted later to evaluate the quality of the codec at rates higher than those described in the ITU-T test plan. Because the quality expectation of the codec at these high rates is high, a pre-selection of critical items, for which the quality at the lower bitrate range was most degraded, was conducted prior to testing. The test results are shown in Figure 7. It has been proven that transparency was reached for critical material at 128 kbps. Mixed-content and Music (Spanish) 5.0 4.5 DMOS 4.0 2.5 G.719 64kbps G.722.1C 48kbps G.719 64kbps Ref 64kbps G.719 48kbps Ref 56kbps G.719 32kbps Ref 40kbps 1.0 Figure 6: Subjective test results – Experiment 2 (Spanish). Additional Subjective Test Results (DMOS) 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 G.719 128kbps Source G.719 112kbps Source G.719 96kbps Source G.719 80kbps Source G.719 64kbps Source G.719 48kbps Source Source G.719 32kbps 1.5 1.0 Reverberant Ref 64kbps G.719 64kbps Ref 56kbps G.719 48kbps Ref 40kbps G.719 32kbps Ref 64kbps G.719 64kbps Ref 56kbps G.719 48kbps Ref 40kbps G.719 32kbps Ref 64kbps G.719 64kbps Ref 56kbps G.719 48kbps Ref 40kbps Figure 7: Additional subjective test results. G.719 32kbps DMOS 3.0 1.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 Clean Reverberant and Noisy Figure 3: Subjective test results – Experiment 1 (English). Speech (French) Ref 64kbps G.719 64kbps Ref 56kbps G.719 48kbps Ref 40kbps Reverberant G.719 32kbps Ref 64kbps G.719 64kbps Ref 56kbps G.719 48kbps Ref 40kbps Clean G.719 32kbps Ref 64kbps G.719 64kbps Ref 56kbps G.719 48kbps Ref 40kbps 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 G.719 32kbps DMOS 3.5 2.0 Speech (North American English) Reverberant and Noisy Figure 4: Subjective test results – Experiment 1 (French). Mixed-content and Music (North American English) 5.0 4.5 4.0 DMOS October 18-21, 2009, New Paltz, NY 3.5 3.0 2.5 2.0 G.719 64kbps G.719 64kbps Ref 64kbps G.719 48kbps Ref 56kbps G.719 32kbps Ref 40kbps 1.0 G.722.1C 48kbps 1.5 Figure 5: Subjective test results – Experiment 2 (English). 7. CONCLUSION This paper has presented the 20 kHz full-band audio coding algorithm and subjective Optimization/Characterization test results of ITU-T Recommendation G.719. The G.719 codec features very high audio quality, extremely low computational complexity, and low algorithmic delay compared to other stateof-the-art audio coding codecs. Subjective test results show that the G.719 codec achieves transparent audio quality at 128 kbps. The main intended applications for this codec are videoconferencing and teleconferencing, as well as Internet streaming. 8. REFERENCES [1] ITU-T Recommendation G.719, “Low-complexity fullband audio coding for high-quality conversational applications,” June 2008. [2] H. S. Malvar, Signal Processing with Lapped Transforms. Norwood, MA: Artech House, 1992. [3] M. Xie, “Lattice Vector Quantization and its Applications in Speech and Audio Coding,” in Speech Processing in Modern Communication: Challenges and Perspectives. Springer, To appear. [4] J. H. Conway and N. J. A. Sloane, Sphere Packings, Lattices and Groups. Springer-Verlag, New York, 1988. [5] C. Lamblin and J.-P. Adoul, “Algorithme de quantification vectorielle sphérique à partir du réseau de Gosset d’ordre 8,” Annales des Télécommunications, pp. 172-186, 1988. [6] ITU-T Recommendation G.191, “Software tools for speech and audio coding standardization,” Sept. 2005. [7] http://sourceforge.net/project/showfiles.php?group_id=290 &package_id=309. 268
© Copyright 2026 Paperzz