ITU-T G.719: A New Low-Complexity Full-Band

2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
O
ctober 18-21, 2009, New Paltz, NY
ITU-T G.719: A NEW LOW-COMPLEXITY
FULL-BAND (20 KHZ) AUDIO CODING
STANDARD FOR HIGH-QUALITY CONVERSATIONAL APPLICATIONS
Anisse Taleb and Manuel Briand
Minjie Xie ∗ and Peter Chu
Polycom, Inc.
10 Mall Road, #110, Burlington, MA 01803, USA
[email protected]
ABSTRACT
This paper describes a new low-complexity full-band (20 kHz)
audio coding algorithm which has been recently standardized
by ITU-T as Recommendation G.719. The algorithm is
designed to provide 20 Hz - 20 kHz audio bandwidth using a 48
kHz sample rate, operating at 32 - 128 kbps. This codec
features very high audio quality and low computational
complexity and is suitable for use in applications such as
videoconferencing, teleconferencing, and streaming audio over
the Internet. Subjective test results from the Optimization/
Characterization phase of G.719 are also presented in the paper.
Index Terms— Audio coding, full-band, low
complexity, adaptive time-frequency transform, fast
lattice vector quantization
1.
INTRODUCTION
In hands-free videoconferencing and teleconferencing markets,
there is strong and increasing demand for audio coding
providing the full human auditory bandwidth of 20 Hz to 20
kHz. This is because:
• Conferencing systems are increasingly used for more
elaborate presentations, often including music and sound
effects (i.e. animal sounds, musical instruments, vehicles or
nature sounds, etc.) which occupy a wider audio band than
speech. Presentations involve remote education of music,
playback of audio and video from DVDs and VCRs,
audio/video clips from PCs, and elaborate audio-visual
presentations from, for example, PowerPoint.
• Users perceive the bandwidth of 20 Hz to 20 kHz as
representing the ultimate goal for audio bandwidth. The
resulting market pressures ar
e causing a shift in this
direction, now that sufficient IP bitrate and audio coding
technology are available to deliver this.
As with any audio codec for hands-free videoconferencing
use, the requirements include:
• Low latency (support natural conversation)
• Low complexity (free cycles for video processing and other
audio processing; reduce cost)
• High quality on all signal types
To meet the market need for such a full-band audio coding
standard, ITU-T Q10/SG16 launched the standardization of
Low-Complexity Full-Band Audio Coding Extension to G.722.1
(G.722.1FB) in November 2006. The ITU-T G.722.1FB
∗
Ericsson Research – Mu ltimedia Technologies
Färögatan 6, Kista, Sweden
{anisse.taleb, manuel.briand}@ericsson.com
Qualification phase was completed in June 2007 and two
candidates submitted respectively by Ericsson AB and Polycom,
Inc. were qualified. In September 2007, the two proponents
announced their collaboration in an Optimization/
Characterization phase and agreed to jointly develop a candidate
codec. The joint candidate codec met all the requirements in the
subjective quality tests for the G.722.1FB Optimization/
Characterization phase in April 2008. In June 2008, the joint
codec was adopted as ITU-T Recommendation G.719 which is
the first full-band audio coding standard of ITU-T [1].
In this paper we first give an overview of the ITU-T G.719
codec, followed by a description of two important modules of
the codec - adaptive time-frequency transform and quantization
of transform coefficients. We also present the evaluation of the
computational complexity and memory requirements of the
codec. Finally, subjective test results from the Optimization/
Characterization phase are summarized.
2.
OVERVIEW OF THE G.719 CODEC
The G.719 codec is a low-complexity transform-based audio
codec and can provide an audio bandwidth of 20 Hz to 20 kHz
at 32 - 128 kbps. The codec operates on frames of 20 ms
corresponding to 960 samples at a sampling rate of 48 kHz and
has an algorithmic delay of 40 ms. The codec features very high
audio quality and extremely low computational complexity
compared to other state-of-the-art audio coding algorithms.
2.1 Encoder
Norm
estimation
Transient
detector
Input signal
Fs = 48 kHz
265
Norm
adjustment
Bit
allocation
Spectrum
normalization
Transform
FLVQ
and coding
Noise level
adjustment
Transient
signaling
Quantization
indices
Huffman
code
Noise
level
adjustment
index
MUX
Figure 1: Block diagram of the G.719 encoder.
Minjie Xie was with Polycom, Inc., when this work was done. He is now with Huawei Technologies (USA).
978-1-4244-3679-8/09/$25.00©2009 IEEE
Norm
quantization
and coding
2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
October 18-21, 2009, New Paltz, NY
A block diagram of the G.719 encoder is shown in Figure 1.
The input signal sampled at 48 kHz is processed through a
transient detector. Depending on the detection of a transient,
indicated by a flag IsTransient, a high frequency resolution or a
low frequency resolution transform is applied on the input signal
frame. The adaptive transform is based on a modified discrete
cosine transform (MDCT) [2] in case of stationary frames. For
transient frames, the MDCT is modified to obtain a higher
temporal resolution without a need for additional delay and with
very little overhead in complexity. Transient frames have a
temporal resolution equivalent to 5 ms frames.
The obtained transform coefficients are grouped into bands
of unequal lengths as 8, 16, 24, or 32. As the bandwidth is 20
kHz, only 800 transform coefficients are used. The 160
transform coefficients representing frequencies above 20 kHz
are ignored. The norm or power of each band is estimated and
the resulting spectral envelope consisting of the norms of all
bands is scalar quantized and encoded. The coefficients are
then normalized by the quantized norms. The quantized norms
are further adjusted based on adaptive spectral weighting and
used as input for bit allocation. An adaptive bit-allocation
scheme based on the quantized norms of the bands is used to
assign the available bits in a frame among the bands. The
number of bits assigned to each transform coefficient can be as
large as 9 bits depending on the input signal. The normalized
transform coefficients are lattice vector quantized according to
the allocated bits for each band. Huffman coding is applied to
the quantization indices for both the coded norms and transform
coefficients. The norm of the non-coded transform coefficients
is estimated, coded, and transmitted to the decoder.
coefficients and regenerated transform coefficients are mixed
and lead to normalized spectrum. The decoded spectral
envelope is applied to the decoded full-band spectrum. Finally,
the inverse transform is applied to recover the time-domain
decoded signal. This is performed by applying either the
inverse MDCT for stationary mode, or the inverse of the higher
temporal resolution transform for transient mode.
2.2 Decoder
The resulting signal y(k) represents the transform coefficients of
the input frame. It should be noted that in stationary mode, the
cascade of windowing, TDA, and DCTIV is equivalent to
applying the modulated lapped transform (MLT) [2].
If the signal is detected as a transient, a further re-ordering
of the time-domain aliased signal is performed. The basis
functions of the resulting filter-bank would have an incoherent
time and frequency responses without re-ordering. The reordering operation consists of shuffling the upper and lower half
x (n) as follows:
of the TDA output signal ~
DEMUX
+
Envelope
Shaping
Transient signaling
Spectral-fill
Generator
Norm indices
Noise level
adjustment index
FLVQ indices
Transient signaling
FLVQ
Decoding
Audio signal
Fs = 48 kHz
Inverse
Transform
Figure 2: Block diagram of the G.719 decoder.
A block diagram of the decoder is shown in Figure 2. The
transient flag is first decoded which indicates the frame
configuration, i.e. stationary or transient. The spectral envelope
is then decoded and the same, bit-exact, norm adjustment and
bit-allocation algorithms are used at the decoder to re-compute
the bit-allocation which is essential for decoding quantization
indices of the normalized transform coefficients.
After decoding the transform coefficients, the non-coded
transform coefficients (allocated zero bits) in low frequencies
are regenerated by using a spectral-fill codebook built from the
decoded transform coefficients. The noise level adjustment
index is used to adjust the level of the regenerated coefficients.
The non-coded transform coefficients in high frequencies are
regenerated using bandwidth extension. The decoded transform
3.
ADAPTIVE TIME-FREQUENCY TRANSFORM
The adaptive time-frequency transform is based on the detection
of a transient. In the case of stationary signals, the transform
has a high frequency resolution which is able to efficiently
represent stationary sounds. In the case of transient sounds, the
time-frequency transform will increase its time resolution and
allows a better representation of the rapid changes in the input
signal characteristics. These two modes of operations share a
common buffering and windowing module and the switching
between one mode of operation to the other is instantaneous.
Thus no additional look-ahead in the transient detection is
needed. This allows the codec to have selectable time resolution
at low complexity and with zero additional delay.
If the input signal is detected as stationary, a type IV
discrete cosine transform (DCTIV) [2] is applied on the output
~
x (n) of the time domain aliasing (TDA) operation. The DCTIV
of the time aliased signal is defined by the following equation:
959
y (k ) =
⎡
π ⎤
∑ ~x (n) cos⎢⎣⎛⎜⎝ n + 2 ⎞⎟⎠⎛⎜⎝ k + 2 ⎞⎟⎠ 960 ⎥⎦ , k = 0, 1, …, 959 (1)
1
1
n =0
v ( n) = ~
x (959 − n) , n = 0, 1, …, 959.
(2)
This re-ordering is only conceptual and in reality no
computations are involved. Higher time resolution is obtained
by zero padding the signal v(n) and dividing the resulting signal
into four overlapped equal length sub-frames. The amount of
zero-padding is equal to 120 on each side of the signal. The
segments are 50% overlapped and each segment has a length
equal to 480. The two inner segments are post-windowed using
a sine window of length 480. The windows for outer segments
are constructed using half a sine window.
Each resulting post-windowed segment is further processed
by applying the MDCT, i.e. time aliasing followed by DCTIV.
The output of the MDCT for each segment represents the signal
spectrum at different time instants, thus in case of transients, a
higher time resolution is used. The length of the output of each
of the four MDCT is half the length of the input segment, i.e.
240. Therefore, this operation does not introduce additional
redundancy.
266
2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
4.
and consists of the points falling on concentric spheres of radius
TRANSFORM COEFFICIENT QUANTIZATION
Each band consists of one or more vectors of 8-dimensional
transform coefficients and the coefficients are normalized by the
quantized norm. All 8-dimensional vectors belonging to one
band are assigned the same number of bits for quantization. A
fast lattice vector quantization (FLVQ) scheme [3] is used to
quantize the normalized coefficients in 8 dimensions. In FLVQ
the quantizer comprises two sub-quantizers: a D8-based higherrate lattice vector quantizer (HRQ) and an RE8-based lower-rate
lattice vector quantizer (LRQ).
HRQ is a multi-rate quantizer designed to quantize the
transform coefficients at rates of 2 up to 9 bit/coefficient and its
codebook is based on the so-called Voronoi code for the D8
lattice [4]. D8 is a well-known lattice and defined as:
8
D8 = {(y1, y2, y3, y4, y5, y6, y7, y8) ∈ Z8 |
∑y
i
= even} (3)
2 2r centered at the origin, where r = 0, 1, 2, ... . The set of
points on a sphere constitutes a spherical code and can be used
as a quantization codebook. The LRQ codebook of 256
codewords is stored in a structured look-up table so that a fast
searching method and a fast indexing method are developed to
reduce the computational complexity [3].
5.
COMPLEXITY OF THE G.719 CODEC
The computational complexity of the G.719 codec in 16/32-bit
fixed-point is estimated by encoding and decoding the source
material used for the subjective tests of the G.719 Optimization/
Characterization phase. The observed average and worst-case
complexity in units of WMOPS [6] are shown in Table 1 for
different bitrates. Storage requirements are reported in Table 2.
Table 1: Computational complexity of the G.719 codec.
i =1
where Z8 is the lattice which consists of all points with integer
coordinates. It can be seen that D8 consists of the points having
integer coordinates with an even sum. The codebook of HRQ is
constructed from a finite region of the D8 lattice and is not
stored in memory. The codewords are generated by a simple
algebraic method and a fast quantization algorithm is used.
To minimize the distortion for a given rate, the D8 lattice
should be truncated and scaled. Actually, the input vectors
instead of the lattice codebook are scaled in order to use the fast
searching and indexing algorithms proposed by Conway and
Sloane [4]. However, the fast searching algorithm introduced in
[4] assumes an infinite lattice which can not be used as the
codebook in real-time audio coding systems. In other words, for
a given rate the algorithm can not be used to quantize the input
vectors lying outside the truncated lattice region. To resolve
this problem, a fast method for quantizing “outliers” xo is
developed and described as follows:
1) Scale down the vector xo by 2: xo = xo / 2.
2) Find the nearest D8 point u to xo and then compute the
index vector j of u using the indexing algorithm in [4].
3) Find the codeword y from the index vector j and then
compare y with u. If y is different from u, repeat Steps 1)
to 3). Otherwise, compute w = xo / 16. Note that due to
the normalization of transform coefficients, a few
iterations are required to find the best codeword.
4) Compute xo = xo + w.
5) Find the nearest D8 point u to xo and then compute the
index vector j of u using the indexing algorithm in [4].
6) Find the codeword y from the index vector j and then
compare y with u. If y and u are exactly same, k = j and
repeat Steps 4) to 6). Otherwise, k is the index of the best
codeword to xo and stop.
It has been observed that the 8-dimensional coefficient
vectors have a high concentration of probability around the
origin. Therefore, Huffman coding is optionally tried for the
quantization indices of HRQ to increase efficiency of
quantization when the rate is smaller than 6 bit/coefficient.
LRQ is a 1-bit quantizer and uses spherical codes based on
the rotated Gosset lattice RE8 [5] as the codebook. The RE8
lattice is defined as
RE8 = 2 D8 U {2 D8 + (1, 1, 1, 1, 1, 1, 1, 1)}
October 18-21, 2009, New Paltz, NY
(4)
267
Encoder
Decoder
Encoder + Decoder
Bitrate
(kbps)
Avg.
Max
Avg.
Max
Avg.
Max
32
6.663
7.996
6.876
7.413
13.539
15.397
48
7.427
9.073
7.230
7.806
14.657
16.861
64
7.912
9.899
7.554
8.161
15.466
18.060
80
8.303
10.564
7.865
8.496
16.169
19.026
96
8.555
10.796
8.122
8.757
16.677
19.536
112
8.787
10.881
8.368
9.009
17.155
19.890
128
8.980
11.775
8.610
9.225
17.590
21.000
Table 2: Memory usage of the G.719 codec.
Memory type
Encoder
Decoder
Encoder + Decoder
Static RAM 1
1.0
3.9
4.9
RAM 1
12.2
12.2
24.4
8.3
8.9
10.7
1.2
1.8
Scratch
Data ROM 1
Program ROM
1
2
1.2
In units of 16-b kWords
6.
2
In units of 1000’s basic operators
SUBJECTIVE PERFORMANCE OF G.719
Subjective tests for the ITU-T G.719 (ex. G.722.1FB)
Optimization/Characterization phase were performed from mid
February through early April 2008 by independent listening
laboratories in American English, French, and Spanish.
According to a test plan designed by ITU-T Q7/SG12 experts,
the joint candidate codec conducted two experiments as follows:
• Experiment 1: Speech (clean, reverberant, and noisy)
• Experiment 2: Mixed content and music
Mixed content items are representative of advertisement, film
trailers, news with jingles, music with announcements, etc., and
contain speech, music, and noise. Each experiment used the
“triple stimulus/hidden reference/double blind test method”
described in ITU-R Recommendation BS.1116-1. A standard
MPEG audio codec, LAME MP3 version 3.97 as found on the
LAME website as of September 2006 [7] was used as the
reference codec in the subjective tests. The ITU-T requirement
2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
was that the G.719 candidate codec at 32, 48, and 64kbps be
proven “Not Worse Than” the reference codec at 40, 56, and 64
kbps, respectively, with a 95% statistical confidence level. In
addition, the G.719 candidate codec at 64 kbps was also tested
against the G.722.1C codecs at 48 kbps for Experiment 2.
The subjective test results for the G.719 codec are shown in
Figures 3-6. Statistical analysis of the results showed that the
G.719 codec met all performance requirements specified for the
subjective Optimization/Characterization test. For experiment 1
the G.719 codec was better than the reference codec at all bit
rates in both languages. For experiment 2 the G.719 codec is
better than the reference codec at the lowest bit rate for all the
items and at the two other bitrates for most of the items.
An additional subjective listening test for the G.719 codec
was conducted later to evaluate the quality of the codec at rates
higher than those described in the ITU-T test plan. Because the
quality expectation of the codec at these high rates is high, a
pre-selection of critical items, for which the quality at the lower
bitrate range was most degraded, was conducted prior to testing.
The test results are shown in Figure 7. It has been proven that
transparency was reached for critical material at 128 kbps.
Mixed-content and Music (Spanish)
5.0
4.5
DMOS
4.0
2.5
G.719
64kbps
G.722.1C
48kbps
G.719
64kbps
Ref
64kbps
G.719
48kbps
Ref
56kbps
G.719
32kbps
Ref
40kbps
1.0
Figure 6: Subjective test results – Experiment 2 (Spanish).
Additional Subjective Test Results (DMOS)
5.5
5.0
4.5
4.0
3.5
3.0
2.5
2.0
G.719
128kbps
Source
G.719
112kbps
Source
G.719
96kbps
Source
G.719
80kbps
Source
G.719
64kbps
Source
G.719
48kbps
Source
Source
G.719
32kbps
1.5
1.0
Reverberant
Ref 64kbps
G.719 64kbps
Ref 56kbps
G.719 48kbps
Ref 40kbps
G.719 32kbps
Ref 64kbps
G.719 64kbps
Ref 56kbps
G.719 48kbps
Ref 40kbps
G.719 32kbps
Ref 64kbps
G.719 64kbps
Ref 56kbps
G.719 48kbps
Ref 40kbps
Figure 7: Additional subjective test results.
G.719 32kbps
DMOS
3.0
1.5
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
Clean
Reverberant and Noisy
Figure 3: Subjective test results – Experiment 1 (English).
Speech (French)
Ref 64kbps
G.719 64kbps
Ref 56kbps
G.719 48kbps
Ref 40kbps
Reverberant
G.719 32kbps
Ref 64kbps
G.719 64kbps
Ref 56kbps
G.719 48kbps
Ref 40kbps
Clean
G.719 32kbps
Ref 64kbps
G.719 64kbps
Ref 56kbps
G.719 48kbps
Ref 40kbps
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
G.719 32kbps
DMOS
3.5
2.0
Speech (North American English)
Reverberant and Noisy
Figure 4: Subjective test results – Experiment 1 (French).
Mixed-content and Music (North American English)
5.0
4.5
4.0
DMOS
October 18-21, 2009, New Paltz, NY
3.5
3.0
2.5
2.0
G.719
64kbps
G.719
64kbps
Ref
64kbps
G.719
48kbps
Ref
56kbps
G.719
32kbps
Ref
40kbps
1.0
G.722.1C
48kbps
1.5
Figure 5: Subjective test results – Experiment 2 (English).
7.
CONCLUSION
This paper has presented the 20 kHz full-band audio coding
algorithm and subjective Optimization/Characterization test
results of ITU-T Recommendation G.719. The G.719 codec
features very high audio quality, extremely low computational
complexity, and low algorithmic delay compared to other stateof-the-art audio coding codecs. Subjective test results show that
the G.719 codec achieves transparent audio quality at 128 kbps.
The main intended applications for this codec are
videoconferencing and teleconferencing, as well as Internet
streaming.
8.
REFERENCES
[1] ITU-T Recommendation G.719, “Low-complexity fullband audio coding for high-quality conversational
applications,” June 2008.
[2] H. S. Malvar, Signal Processing with Lapped Transforms.
Norwood, MA: Artech House, 1992.
[3] M. Xie, “Lattice Vector Quantization and its Applications
in Speech and Audio Coding,” in Speech Processing in
Modern Communication: Challenges and Perspectives.
Springer, To appear.
[4] J. H. Conway and N. J. A. Sloane, Sphere Packings,
Lattices and Groups. Springer-Verlag, New York, 1988.
[5] C. Lamblin and J.-P. Adoul, “Algorithme de quantification
vectorielle sphérique à partir du réseau de Gosset d’ordre
8,” Annales des Télécommunications, pp. 172-186, 1988.
[6] ITU-T Recommendation G.191, “Software tools for speech
and audio coding standardization,” Sept. 2005.
[7] http://sourceforge.net/project/showfiles.php?group_id=290
&package_id=309.
268