A Phase Robust Spectral Magnitude Estimator for Acoustic Echo Suppression Øystein Birkenes TANDBERG, now a part of Cisco Systems, Inc. Philip Pedersens Vei 20, 1366 Lysaker, Norway Email: [email protected] x(n) Abstract—Acoustic echo cancellation (AEC) using linear adaptive filters breaks down during phase variations in the echo path. Such phase variations frequently occur on personal computer platforms. Acoustic echo suppression (AES) has been proposed as a more robust method than AEC. In this method, the spectral magnitude of the echo needs to be estimated. Existing approaches to spectral magnitude estimation are either nonrobust or inaccurate. In this paper we present a new spectral magnitude estimator, which is accurate and at the same time robust to phase variations. Analysis filter bank ... Subband k Echo magnitude estimation Echo path |Û (m, k)| ... ... × Fig. 1. Y (m, k) ... Z(m, k) ... Synthesis filter bank z(n) Gk (m) Analysis filter bank |Y (m, k)| Gain |·| computation I. I NTRODUCTION Removal of unwanted echo caused by acoustic coupling between a loudspeaker and a microphone is essential towards the goal of achieving natural communication in teleconferencing systems. In traditional acoustic echo cancellation (AEC) [1], the echo path is modeled using a linear adaptive filter. Echo is removed by subtracting the output of the adaptive filter from the microphone signal. The above approach breaks down in the presence of phase variations in the echo path, which typically occur during teleconferencing with personal computers [2]. As a solution, Avendano [3] proposed an acoustic echo suppression (AES) method that removes echo by spectral subtraction. The idea is to achieve robustness through estimation of the spectral magnitude of the echo signal while ignoring the phase. Unfortunately, even though the spectral magnitude estimator in [3] is accurate during ideal operating conditions, it is not robust against phase variations since it relies on complex linear adaptive subband filters. In [4], the authors investigated spectral magnitude estimators in which the echo magnitude in each frequency subband is estimated using a real adaptive filter operating on either the magnitude or the squared magnitude of the input samples. These estimators ignore the phase of the loudspeaker signal and are therefore robust. They are also computationally efficient, since only a few real parameters need to be estimated for each subband. However, the robustness and reduced computational complexity are achieved at the cost of poor estimates. Good echo suppression may still be achieved by making the spectral subtraction method more aggressive, but this may lead to artifacts and near-end speech distortion, in particular during doubletalk situations. In this paper, we introduce a robust re-parametrization of the spectral magnitude estimator in [3]. In this way, accurate ... X(m, k) y(n) Near-end speech Noise Acoustic echo suppression. echo magnitude estimates are obtained even in the presence of phase variations in the echo path. The result is that echo may be sufficiently attenuated while artifacts and near-end speech distortion are kept low. We demonstrate the accuracy and robustness of the proposed method through simulations. The paper is organized as follows. In the next section we briefly review the AES approach in [3] before we introduce our robust spectral magnitude estimator and make comparisons with other approaches. In Sec. III we present results from simulation experiments, and in Sec. IV we state the conclusions. II. ACOUSTIC E CHO S UPPRESSION AND S PECTRAL M AGNITUDE E STIMATION A. Acoustic Echo Suppression An illustration of the class of acoustic echo suppression (AES) methods that we consider in this paper is shown in Fig. 1. The microphone signal y(n), comprising echo, nearend speech, and noise, is divided into a set of frequency subbands with the use of an analysis filter bank. We write the microphone signal in frame m and subband k as Y (m, k) = U (m, k) + V (m, k) + W (m, k), (1) where U (m, k), V (m, k), and W (m, k) are the echo, near-end speech, and noise in frame m and subband k, respectively. The signal x(n) from far-end is likewise divided into a set of frequency subbands, where X(m, k) denotes the signal in frame m and subband k. For each frame m and subband k, an estimate |Û (m, k)| of the echo magnitude is computed based on the signals X(m, k) and Y (m, k), followed by a gain computation according to [5] !1/α max |Y (m, k)|α − β|Û (m, k)|α , 0 , (2) Gk (m) = |Y (m, k)|α where the parameters α > 0 and β > 0 are used to control the amount of echo reduction versus speech distortion. The output in frame m and subband k is Z(m, k) = Gk (m)Y (m, k). (3) Finally, the full-band output z(n) sent to the far-end is formed by combining all subband signals Z(m, k) using a synthesis filter bank. If |Û (m, k)| is a good estimate of |U (m, k)|, AES works well by attenuating subbands with echo and passing through subbands with near-end speech. However, if |Û (m, k)| is a poor estimate of |U (m, k)|, which may happen if the magnitude estimator is inaccurate and/or non-robust, AES may suffer from residual echo, artifacts, and near-end speech distortion. Higher echo reduction can be achieved by making the gains more aggressive through selection of appropriate values for α and β, but then there will be more audible artifacts and near-end speech distortion, especially during doubletalk. This trade-off between residual echo, on the one hand, and artifacts and speech distortion, on the other, motivates the search for accurate and robust spectral magnitude estimators. In the following we derive one such estimator. B. A Phase Robust Spectral Magnitude Estimator As long as the overlap between adjacent frequency subbands is small, a good model (although non-robust, as we shall see in Sec. II-C) for the complex echo signal in frame m and subband k is [3] Û (m, k) = L−1 X i=0 Hk (i)X(m − i, k), (4) where Hk (0), . . . , Hk (L−1) are L complex parameters. In the following we derive an expression for the squared magnitude of (4) and split each term into two factors, where the first factor is a function only of the parameters, and the second factor is a function only of the input samples X(m, k), X(m− 1, k), . . . , X(m − L + 1, k). This will lead to an accurate and robust estimator of the magnitude of each subband echo signal. The squared magnitude of (4) can be written |Û (m, k)|2 = + L−1 X L−1 X i=0 j=0 j6=i L−1 X i=0 |Hk (i)|2 |X(m − i, k)|2 Hk (i)Hk∗ (j)X(m − i, k)X ∗ (m − j, k). (5) Recognizing that the last sum consists of pairs of complex conjugate terms, we can write |Û (m, k)|2 = +2 L−1 X L−1 X i=0 j=i+1 L−1 X i=0 |Hk (i)|2 |X(m − i, k)|2 Re{Hk (i)Hk∗ (j)X(m − i, k)X ∗ (m − j, k)}. (6) Next, we use the following easily verifiable identity: for two complex numbers z1 and z2 , we can write Re{z1 z2 } = Re{z1 }Re{z2 } − Im{z1 }Im{z2 }. Thus |Û (m, k)|2 = +2 L−1 X L−1 X L−1 X i=0 |Hk (i)|2 |X(m − i, k)|2 Re{Hk (i)Hk∗ (j)}Re{X(m−i, k)X ∗ (m−j, k)} i=0 j=i+1 − Im{Hk (i)Hk∗ (j)}Im{X(m − i, k)X ∗ (m − j, k)} . (7) We arrive at a new model for the squared magnitude of the echo in frame m and subband k by introducing a set of L2 real parameters Fk (i, j), 0 < i, j < L − 1 as follows: 2 Ûpro (m, k) = L−1 X i=0 + L−1 X L−1 X i=0 j=i+1 Fk (i, i)|X(m − i, k)|2 Fk (i, j)Re{X(m − i, k)X ∗ (m − j, k)} + Fk (j, i)Im{X(m − i, k)X ∗ (m − j, k)} . (8) Finally, our proposed estimator for the echo magnitude is q 2 (m, k), 0 , (9) Ûpro (m, k) = max Ûpro 2 (m, k) may be where the max operator is needed since Ûpro negative. The model in (8) is non-linear in the input samples, but linear in the real parameters. The latter implies that it is straightforward to find an online estimation algorithm for the parameters. In particular, common adaptive algorithms [6] such as normalized least mean squares (NLMS), affine projection algorithm (APA), or recursive least squares (RLS) can be used. Furthermore, it should be noted that (8) is a more general model than (7) since the parameters of the former are not restricted to lie on the subset implied by the latter. As an example, Fk (i, i) ∈ R may be negative, whereas |Hk (i)|2 ≥ 0, the corresponding factor in (7), is inevitably non-negative. Nevertheless, since (4) has been shown [3] to be a good model of the subband echo signal, we expect each parameter Fk (i, j) to converge to its counterpart in (7). Therefore, with the above assumption, Fk (i, j) for j > i attempts to estimate 2Re{Hk (i)Hk∗ (j)} = 2|Hk (i)||Hk (j)| cos(ψHk (i) − ψHk (j) ), where, for a complex number z, ψz denotes its phase. This means that instead of estimating the phases of the complex parameters Hk (0), · · · , Hk (L − 1), the model in (8) estimates the phase differences between them. These phase differences are approximately unaltered after phase variations in the echo path. The same applies to Fk (i, j) for j < i. This phase robustness property of the proposed estimator is demonstrated through the simulation results in Sec. III. As a final note, keep in mind that no special measures have been taken in order to avoid parameter divergence during doubletalk. A common solution to the doubletalk problem is to freeze the parameter updates during doubletalk with the help of a doubletalk detector [7]. Another solution that has shown to be effective is the two-path model introduced in [8]. C. Comparison with Other Spectral Magnitude Estimators Avendano [3] proposed to estimate the echo magnitude in each frame m and subband k by taking the magnitude of the output of the model in (4), that is, Ûnr (m, k) = Û (m, k). (10) This estimator, which we will refer to as the non-robust magnitude estimator, is not phase robust since a change in the phase requires the complex parameters Hk (0), . . . , Hk (L − 1) to re-adapt. Our proposed estimator in (9) is a phase robust re-parametrization of (10). We now make comparisons with two robust magnitude estimators that have appeared in the literature [4] under the name of power regression and magnitude regression. In power regression, the squared echo magnitude is modeled as 2 Ûpow (m, k) = L−1 X i=0 Fk (i)|X(m − i, k)|2 , (11) where Fk (0), · · · , Fk (L − 1) are real parameters. The estimated echo magnitude is q 2 (m, k), 0 . (12) Ûpow (m, k) = max Ûpow Power regression is phase robust since the phase of the signal X(m, k) is simply omitted. However, power regression is poor at estimating the echo magnitude since, as we immediately observe by comparing (11) with (5), all cross-terms are missing. Magnitude regression can be written Ûmag (m, k) = L−1 X i=0 Fk (i)|X(m − i, k)|, (13) where Fk (0), · · · , Fk (L − 1) are real parameters. The magnitude estimate Ûmag (m, k) may be negative so it is again necessary to ensure non-negativity with the use of the max operator. As with power regression, magnitude regression is also phase robust since the phase of X(m, k) is omitted. Squaring (13) leads to |Ûmag (m, k)|2 = + L−1 X L−1 X i=0 j=0 j6=i L−1 X i=0 |Fk (i)|2 |X(m − i, k)|2 |Fk (i)||Fk (j)||X(m − i, k)||X(m − j, k)|, (14) so we see that magnitude regression does allow modeling of the cross-terms. However, we can immediately observe that the model in (14) lacks the phase factors appearing in (5), thereby leading to poor estimation. Faller and Chen [9] took a slightly different approach to AES. Instead of estimating the spectral magnitude of the echo, they proposed to estimate the spectral envelope, which they define as a frequency smoothed spectral magnitude or squared magnitude. Spectral subtraction is then performed using the spectral envelope of the microphone in addition to the estimated spectral envelope of the echo. For the same reasons as with magnitude and power regression, the proposed spectral envelope estimators in [9] are inaccurate. III. E XPERIMENTS We evaluated the performance of the proposed spectral magnitude estimator through simulations. The sampling frequency was set to 48 kHz and the frame size was set to 10 milliseconds. A room impulse response (RIR) was measured from the left loudspeaker to the microphone on a Lenovo ThinkPad T61 in a normal office environment. The measured RIR was delayed by 20 milliseconds in order to simulate buffer delays. A microphone signal was generated by adding nearend noise and echo, where the echo signal was computed by convolving far-end Gaussian white noise of 20 seconds duration with the delayed RIR. The power of the near-end Gaussian white noise signal was set such that the echo-tonoise ratio at the microphone was 30 dB. In all simulations an oversampled filter bank with small overlap between adjacent subbands was used. The RLS algorithm with forgetting factor 0.998 and a sufficiently small regularization parameter was used to update the parameters. Initial experiments indicated that the estimation errors with power and magnitude regression flatten out beyond L = 3, whereas the errors with the nonrobust method and the proposed method continue to decrease beyond L = 3. We chose L = 5 for the experiments in this paper, meaning that the non-robust method had 5 complex parameters, power and magnitude regression had 5 real parameters, and the proposed method had 25 real parameters for each subband. Figure 2 shows the average squared estimation error during steady-state of the proposed spectral magnitude estimator versus other existing approaches. The initial convergence phase was different for the various methods, with the non-robust method being the fastest whereas the other methods showed similar convergence behavior. Therefore, only the last part of the 20 seconds clip, where all algorithms had converged, was used to compute the average. Power regression is not shown in the plot since its error is similar to, or slightly higher than, the error with magnitude regression. We observe that the proposed method performs similarly to the non-robust method and better than magnitude/power regression for low frequencies. The good performance compared to magnitude/power regression can be attributed to the accurate modeling of the cross-terms. For higher frequencies, where the echo-to-noise ratio is lower, the cross-terms become more difficult to model, with the result 30 30 50 25 60 ERLE in dB Average squared error in dB 40 70 80 1100 15 5 5000 10000 Frequency in Hertz 15000 20000 Fig. 2. Average squared estimation error versus frequency for proposed method (solid line), non-robust method (dotted line), and magnitude regression (dashed line). Power regression is not shown since its error is similar to or slightly higher than the error with magnitude regression. 30 Proposed method Non-robust method Magnitude regression 40 50 00 20 40 60 Delay change in samples 80 100 Fig. 4. Echo return loss enhancement (ERLE) versus delay change for proposed method (solid line), non-robust method (dotted line), magnitude regression (dashed line), and power regression (dash-dotted line) during delay change of the echo path every second. computation in (2) were α = β = 1. The non-robust method is seen to deteriorate even after a few samples delay change. The proposed method, power regression, and magnitude regression deteriorate only slowly as the delay change increases. IV. C ONCLUSION 60 70 80 90 100 1100 20 10 90 100 Average squared error in dB 35 Proposed method Non-robust method Magnitude regression 5000 10000 Frequency in Hertz 15000 20000 Fig. 3. Average squared estimation error versus frequency for proposed method (solid line), non-robust method (dotted line), and magnitude regression (dashed line) during a delay change in the echo path by ± 20 samples every second. Power regression is not shown since its error is similar to or slightly higher than the error with magnitude regression. that the proposed method achieves a performance more similar to magnitude/power regression. We then examined the phase robustness of the proposed estimator. Every second we toggled between the RIR used above and a 20 sample delayed version of the same RIR. The results are shown in Fig. 3. The non-robust method is seen to perform poorly due to its phase sensitivity. The other methods seem to be fairly robust. Figure 4 shows echo return loss enhancement (ERLE), defined as the ratio between estimated microphone power σ̂y2 and estimated output power σ̂z2 , versus the delay of the second RIR during the toggle period. The parameters in the gain We have presented a new phase robust spectral magnitude estimator for acoustic echo suppression (AES). The estimator was derived through a redundant re-parametrization of the squared magnitude of the linear echo model for each frequency subband. Simulations demonstrated that the proposed magnitude estimator is more accurate than existing robust alternatives such as power and magnitude regression, even in the presence of frequent phase variations in the echo path. R EFERENCES [1] M. Sondhi, “An adaptive echo canceler,” Bell Syst. Tech. J., vol. 46, no. 3, pp. 497–511, Mar 1967. [2] G. Zoia, A. Sturzenegger, and O. Hochreutiner, “Audio quality and acoustic echo issues for voip on portable devices,” in IEEE Int. Conf. on Portable Information Devices, 2007, pp. 1–5. [3] C. Avendano, “Acoustic echo suppression in the STFT domain,” in Proc. IEEE Workshop on Appl. of Sig. Proc. to Audio, New Paltz, New York, USA, Oct 2001. [4] A. S. Chhetri, A. C. Surendran, J. W. Stokes, and J. C. Platt, “Regressionbased residual acoustic echo suppression,” in IWAENC 2005, Eindhoven, The Netherlands, Sep 2005. [5] P. Vary, “Noise suppression by spectral magnitude estimation — mechanism and theoretical limits—,” Signal Processing, vol. 8, no. 4, pp. 387–400, July 1985. [6] A. H. Sayed, Adaptive Filters. NJ: John Wiley & Sons, 2008. [7] J. Benesty, D. R. Morgan, and J. H. Cho, “A new class of doubletalk detectors based on cross-correlation,” IEEE Trans. Speech and Audio Processing, vol. 8, no. 2, pp. 168–172, Mar 2000. [8] K. Ochiai, T. Araseki, and T. Ogihara, “Echo canceler with two echo path models,” IEEE Trans. Communications, vol. 25, no. 6, pp. 589–595, Jun 1977. [9] C. Faller and J. Chen, “Suppressing acoustic echo in a spectral envelope space,” IEEE Trans. Speech and Audio Processing, vol. 5, no. 13, pp. 1048–1062, Sept 2005.
© Copyright 2026 Paperzz