A Phase Robust Spectral Magnitude Estimator for Acoustic Echo

A Phase Robust Spectral Magnitude Estimator
for Acoustic Echo Suppression
Øystein Birkenes
TANDBERG, now a part of Cisco Systems, Inc.
Philip Pedersens Vei 20, 1366 Lysaker, Norway
Email: [email protected]
x(n)
Abstract—Acoustic echo cancellation (AEC) using linear adaptive filters breaks down during phase variations in the echo path.
Such phase variations frequently occur on personal computer
platforms. Acoustic echo suppression (AES) has been proposed
as a more robust method than AEC. In this method, the
spectral magnitude of the echo needs to be estimated. Existing
approaches to spectral magnitude estimation are either nonrobust or inaccurate. In this paper we present a new spectral
magnitude estimator, which is accurate and at the same time
robust to phase variations.
Analysis filter bank
...
Subband k
Echo
magnitude
estimation
Echo
path
|Û (m, k)|
...
...
×
Fig. 1.
Y (m, k)
...
Z(m, k)
...
Synthesis filter bank
z(n)
Gk (m)
Analysis filter bank
|Y (m, k)|
Gain
|·|
computation
I. I NTRODUCTION
Removal of unwanted echo caused by acoustic coupling between a loudspeaker and a microphone is essential towards the
goal of achieving natural communication in teleconferencing
systems. In traditional acoustic echo cancellation (AEC) [1],
the echo path is modeled using a linear adaptive filter. Echo is
removed by subtracting the output of the adaptive filter from
the microphone signal.
The above approach breaks down in the presence of phase
variations in the echo path, which typically occur during
teleconferencing with personal computers [2]. As a solution,
Avendano [3] proposed an acoustic echo suppression (AES)
method that removes echo by spectral subtraction. The idea
is to achieve robustness through estimation of the spectral
magnitude of the echo signal while ignoring the phase. Unfortunately, even though the spectral magnitude estimator in
[3] is accurate during ideal operating conditions, it is not
robust against phase variations since it relies on complex linear
adaptive subband filters.
In [4], the authors investigated spectral magnitude estimators in which the echo magnitude in each frequency subband
is estimated using a real adaptive filter operating on either
the magnitude or the squared magnitude of the input samples.
These estimators ignore the phase of the loudspeaker signal
and are therefore robust. They are also computationally efficient, since only a few real parameters need to be estimated for
each subband. However, the robustness and reduced computational complexity are achieved at the cost of poor estimates.
Good echo suppression may still be achieved by making the
spectral subtraction method more aggressive, but this may lead
to artifacts and near-end speech distortion, in particular during
doubletalk situations.
In this paper, we introduce a robust re-parametrization of
the spectral magnitude estimator in [3]. In this way, accurate
...
X(m, k)
y(n)
Near-end
speech
Noise
Acoustic echo suppression.
echo magnitude estimates are obtained even in the presence
of phase variations in the echo path. The result is that echo
may be sufficiently attenuated while artifacts and near-end
speech distortion are kept low. We demonstrate the accuracy
and robustness of the proposed method through simulations.
The paper is organized as follows. In the next section we
briefly review the AES approach in [3] before we introduce
our robust spectral magnitude estimator and make comparisons
with other approaches. In Sec. III we present results from simulation experiments, and in Sec. IV we state the conclusions.
II. ACOUSTIC E CHO S UPPRESSION AND S PECTRAL
M AGNITUDE E STIMATION
A. Acoustic Echo Suppression
An illustration of the class of acoustic echo suppression
(AES) methods that we consider in this paper is shown in
Fig. 1. The microphone signal y(n), comprising echo, nearend speech, and noise, is divided into a set of frequency
subbands with the use of an analysis filter bank. We write
the microphone signal in frame m and subband k as
Y (m, k) = U (m, k) + V (m, k) + W (m, k),
(1)
where U (m, k), V (m, k), and W (m, k) are the echo, near-end
speech, and noise in frame m and subband k, respectively.
The signal x(n) from far-end is likewise divided into a set
of frequency subbands, where X(m, k) denotes the signal in
frame m and subband k. For each frame m and subband k,
an estimate |Û (m, k)| of the echo magnitude is computed
based on the signals X(m, k) and Y (m, k), followed by a
gain computation according to [5]
!1/α
max |Y (m, k)|α − β|Û (m, k)|α , 0
, (2)
Gk (m) =
|Y (m, k)|α
where the parameters α > 0 and β > 0 are used to control the
amount of echo reduction versus speech distortion. The output
in frame m and subband k is
Z(m, k) = Gk (m)Y (m, k).
(3)
Finally, the full-band output z(n) sent to the far-end is formed
by combining all subband signals Z(m, k) using a synthesis
filter bank.
If |Û (m, k)| is a good estimate of |U (m, k)|, AES works
well by attenuating subbands with echo and passing through
subbands with near-end speech. However, if |Û (m, k)| is a
poor estimate of |U (m, k)|, which may happen if the magnitude estimator is inaccurate and/or non-robust, AES may suffer
from residual echo, artifacts, and near-end speech distortion.
Higher echo reduction can be achieved by making the gains
more aggressive through selection of appropriate values for
α and β, but then there will be more audible artifacts and
near-end speech distortion, especially during doubletalk. This
trade-off between residual echo, on the one hand, and artifacts
and speech distortion, on the other, motivates the search for
accurate and robust spectral magnitude estimators. In the
following we derive one such estimator.
B. A Phase Robust Spectral Magnitude Estimator
As long as the overlap between adjacent frequency subbands
is small, a good model (although non-robust, as we shall see
in Sec. II-C) for the complex echo signal in frame m and
subband k is [3]
Û (m, k) =
L−1
X
i=0
Hk (i)X(m − i, k),
(4)
where Hk (0), . . . , Hk (L−1) are L complex parameters. In the
following we derive an expression for the squared magnitude
of (4) and split each term into two factors, where the first
factor is a function only of the parameters, and the second
factor is a function only of the input samples X(m, k), X(m−
1, k), . . . , X(m − L + 1, k). This will lead to an accurate and
robust estimator of the magnitude of each subband echo signal.
The squared magnitude of (4) can be written
|Û (m, k)|2 =
+
L−1
X L−1
X
i=0 j=0
j6=i
L−1
X
i=0
|Hk (i)|2 |X(m − i, k)|2
Hk (i)Hk∗ (j)X(m − i, k)X ∗ (m − j, k). (5)
Recognizing that the last sum consists of pairs of complex
conjugate terms, we can write
|Û (m, k)|2 =
+2
L−1
X L−1
X
i=0 j=i+1
L−1
X
i=0
|Hk (i)|2 |X(m − i, k)|2
Re{Hk (i)Hk∗ (j)X(m − i, k)X ∗ (m − j, k)}.
(6)
Next, we use the following easily verifiable identity: for two
complex numbers z1 and z2 , we can write Re{z1 z2 } =
Re{z1 }Re{z2 } − Im{z1 }Im{z2 }. Thus
|Û (m, k)|2 =
+2
L−1
X L−1
X
L−1
X
i=0
|Hk (i)|2 |X(m − i, k)|2
Re{Hk (i)Hk∗ (j)}Re{X(m−i, k)X ∗ (m−j, k)}
i=0 j=i+1
− Im{Hk (i)Hk∗ (j)}Im{X(m − i, k)X ∗ (m − j, k)} . (7)
We arrive at a new model for the squared magnitude of the
echo in frame m and subband k by introducing a set of L2
real parameters Fk (i, j), 0 < i, j < L − 1 as follows:
2
Ûpro
(m, k) =
L−1
X
i=0
+
L−1
X L−1
X
i=0 j=i+1
Fk (i, i)|X(m − i, k)|2
Fk (i, j)Re{X(m − i, k)X ∗ (m − j, k)}
+ Fk (j, i)Im{X(m − i, k)X ∗ (m − j, k)} . (8)
Finally, our proposed estimator for the echo magnitude is
q
2 (m, k), 0 ,
(9)
Ûpro (m, k) = max Ûpro
2
(m, k) may be
where the max operator is needed since Ûpro
negative.
The model in (8) is non-linear in the input samples, but
linear in the real parameters. The latter implies that it is
straightforward to find an online estimation algorithm for
the parameters. In particular, common adaptive algorithms
[6] such as normalized least mean squares (NLMS), affine
projection algorithm (APA), or recursive least squares (RLS)
can be used. Furthermore, it should be noted that (8) is a more
general model than (7) since the parameters of the former are
not restricted to lie on the subset implied by the latter. As an
example, Fk (i, i) ∈ R may be negative, whereas |Hk (i)|2 ≥ 0,
the corresponding factor in (7), is inevitably non-negative.
Nevertheless, since (4) has been shown [3] to be a good model
of the subband echo signal, we expect each parameter Fk (i, j)
to converge to its counterpart in (7).
Therefore, with the above assumption, Fk (i, j) for
j > i attempts to estimate 2Re{Hk (i)Hk∗ (j)} =
2|Hk (i)||Hk (j)| cos(ψHk (i) − ψHk (j) ), where, for a complex number z, ψz denotes its phase. This means that instead of estimating the phases of the complex parameters
Hk (0), · · · , Hk (L − 1), the model in (8) estimates the phase
differences between them. These phase differences are approximately unaltered after phase variations in the echo path. The
same applies to Fk (i, j) for j < i. This phase robustness
property of the proposed estimator is demonstrated through
the simulation results in Sec. III.
As a final note, keep in mind that no special measures
have been taken in order to avoid parameter divergence during
doubletalk. A common solution to the doubletalk problem is to
freeze the parameter updates during doubletalk with the help
of a doubletalk detector [7]. Another solution that has shown
to be effective is the two-path model introduced in [8].
C. Comparison with Other Spectral Magnitude Estimators
Avendano [3] proposed to estimate the echo magnitude in
each frame m and subband k by taking the magnitude of the
output of the model in (4), that is,
Ûnr (m, k) = Û (m, k).
(10)
This estimator, which we will refer to as the non-robust
magnitude estimator, is not phase robust since a change in the
phase requires the complex parameters Hk (0), . . . , Hk (L − 1)
to re-adapt. Our proposed estimator in (9) is a phase robust
re-parametrization of (10).
We now make comparisons with two robust magnitude
estimators that have appeared in the literature [4] under the
name of power regression and magnitude regression. In power
regression, the squared echo magnitude is modeled as
2
Ûpow
(m, k) =
L−1
X
i=0
Fk (i)|X(m − i, k)|2 ,
(11)
where Fk (0), · · · , Fk (L − 1) are real parameters. The estimated echo magnitude is
q
2 (m, k), 0 .
(12)
Ûpow (m, k) = max Ûpow
Power regression is phase robust since the phase of the signal
X(m, k) is simply omitted. However, power regression is poor
at estimating the echo magnitude since, as we immediately observe by comparing (11) with (5), all cross-terms are missing.
Magnitude regression can be written
Ûmag (m, k) =
L−1
X
i=0
Fk (i)|X(m − i, k)|,
(13)
where Fk (0), · · · , Fk (L − 1) are real parameters. The magnitude estimate Ûmag (m, k) may be negative so it is again
necessary to ensure non-negativity with the use of the max
operator. As with power regression, magnitude regression is
also phase robust since the phase of X(m, k) is omitted.
Squaring (13) leads to
|Ûmag (m, k)|2 =
+
L−1
X L−1
X
i=0 j=0
j6=i
L−1
X
i=0
|Fk (i)|2 |X(m − i, k)|2
|Fk (i)||Fk (j)||X(m − i, k)||X(m − j, k)|, (14)
so we see that magnitude regression does allow modeling of
the cross-terms. However, we can immediately observe that
the model in (14) lacks the phase factors appearing in (5),
thereby leading to poor estimation.
Faller and Chen [9] took a slightly different approach to
AES. Instead of estimating the spectral magnitude of the
echo, they proposed to estimate the spectral envelope, which
they define as a frequency smoothed spectral magnitude or
squared magnitude. Spectral subtraction is then performed
using the spectral envelope of the microphone in addition
to the estimated spectral envelope of the echo. For the same
reasons as with magnitude and power regression, the proposed
spectral envelope estimators in [9] are inaccurate.
III. E XPERIMENTS
We evaluated the performance of the proposed spectral
magnitude estimator through simulations. The sampling frequency was set to 48 kHz and the frame size was set to 10
milliseconds. A room impulse response (RIR) was measured
from the left loudspeaker to the microphone on a Lenovo
ThinkPad T61 in a normal office environment. The measured
RIR was delayed by 20 milliseconds in order to simulate buffer
delays. A microphone signal was generated by adding nearend noise and echo, where the echo signal was computed
by convolving far-end Gaussian white noise of 20 seconds
duration with the delayed RIR. The power of the near-end
Gaussian white noise signal was set such that the echo-tonoise ratio at the microphone was 30 dB. In all simulations an
oversampled filter bank with small overlap between adjacent
subbands was used. The RLS algorithm with forgetting factor
0.998 and a sufficiently small regularization parameter was
used to update the parameters. Initial experiments indicated
that the estimation errors with power and magnitude regression
flatten out beyond L = 3, whereas the errors with the nonrobust method and the proposed method continue to decrease
beyond L = 3. We chose L = 5 for the experiments in this
paper, meaning that the non-robust method had 5 complex
parameters, power and magnitude regression had 5 real parameters, and the proposed method had 25 real parameters for
each subband.
Figure 2 shows the average squared estimation error during
steady-state of the proposed spectral magnitude estimator versus other existing approaches. The initial convergence phase
was different for the various methods, with the non-robust
method being the fastest whereas the other methods showed
similar convergence behavior. Therefore, only the last part of
the 20 seconds clip, where all algorithms had converged, was
used to compute the average. Power regression is not shown in
the plot since its error is similar to, or slightly higher than, the
error with magnitude regression. We observe that the proposed
method performs similarly to the non-robust method and better
than magnitude/power regression for low frequencies. The
good performance compared to magnitude/power regression
can be attributed to the accurate modeling of the cross-terms.
For higher frequencies, where the echo-to-noise ratio is lower,
the cross-terms become more difficult to model, with the result
30
30
50
25
60
ERLE in dB
Average squared error in dB
40
70
80
1100
15
5
5000
10000
Frequency in Hertz
15000
20000
Fig. 2. Average squared estimation error versus frequency for proposed
method (solid line), non-robust method (dotted line), and magnitude regression
(dashed line). Power regression is not shown since its error is similar to or
slightly higher than the error with magnitude regression.
30
Proposed method
Non-robust method
Magnitude regression
40
50
00
20
40
60
Delay change in samples
80
100
Fig. 4.
Echo return loss enhancement (ERLE) versus delay change for
proposed method (solid line), non-robust method (dotted line), magnitude
regression (dashed line), and power regression (dash-dotted line) during delay
change of the echo path every second.
computation in (2) were α = β = 1. The non-robust method is
seen to deteriorate even after a few samples delay change. The
proposed method, power regression, and magnitude regression
deteriorate only slowly as the delay change increases.
IV. C ONCLUSION
60
70
80
90
100
1100
20
10
90
100
Average squared error in dB
35
Proposed method
Non-robust method
Magnitude regression
5000
10000
Frequency in Hertz
15000
20000
Fig. 3. Average squared estimation error versus frequency for proposed
method (solid line), non-robust method (dotted line), and magnitude regression
(dashed line) during a delay change in the echo path by ± 20 samples every
second. Power regression is not shown since its error is similar to or slightly
higher than the error with magnitude regression.
that the proposed method achieves a performance more similar
to magnitude/power regression.
We then examined the phase robustness of the proposed
estimator. Every second we toggled between the RIR used
above and a 20 sample delayed version of the same RIR. The
results are shown in Fig. 3. The non-robust method is seen to
perform poorly due to its phase sensitivity. The other methods
seem to be fairly robust.
Figure 4 shows echo return loss enhancement (ERLE),
defined as the ratio between estimated microphone power σ̂y2
and estimated output power σ̂z2 , versus the delay of the second
RIR during the toggle period. The parameters in the gain
We have presented a new phase robust spectral magnitude
estimator for acoustic echo suppression (AES). The estimator
was derived through a redundant re-parametrization of the
squared magnitude of the linear echo model for each frequency subband. Simulations demonstrated that the proposed
magnitude estimator is more accurate than existing robust
alternatives such as power and magnitude regression, even in
the presence of frequent phase variations in the echo path.
R EFERENCES
[1] M. Sondhi, “An adaptive echo canceler,” Bell Syst. Tech. J., vol. 46, no. 3,
pp. 497–511, Mar 1967.
[2] G. Zoia, A. Sturzenegger, and O. Hochreutiner, “Audio quality and
acoustic echo issues for voip on portable devices,” in IEEE Int. Conf. on
Portable Information Devices, 2007, pp. 1–5.
[3] C. Avendano, “Acoustic echo suppression in the STFT domain,” in Proc.
IEEE Workshop on Appl. of Sig. Proc. to Audio, New Paltz, New York,
USA, Oct 2001.
[4] A. S. Chhetri, A. C. Surendran, J. W. Stokes, and J. C. Platt, “Regressionbased residual acoustic echo suppression,” in IWAENC 2005, Eindhoven,
The Netherlands, Sep 2005.
[5] P. Vary, “Noise suppression by spectral magnitude estimation —
mechanism and theoretical limits—,” Signal Processing, vol. 8, no. 4,
pp. 387–400, July 1985.
[6] A. H. Sayed, Adaptive Filters. NJ: John Wiley & Sons, 2008.
[7] J. Benesty, D. R. Morgan, and J. H. Cho, “A new class of doubletalk
detectors based on cross-correlation,” IEEE Trans. Speech and Audio
Processing, vol. 8, no. 2, pp. 168–172, Mar 2000.
[8] K. Ochiai, T. Araseki, and T. Ogihara, “Echo canceler with two echo path
models,” IEEE Trans. Communications, vol. 25, no. 6, pp. 589–595, Jun
1977.
[9] C. Faller and J. Chen, “Suppressing acoustic echo in a spectral envelope
space,” IEEE Trans. Speech and Audio Processing, vol. 5, no. 13, pp.
1048–1062, Sept 2005.