Two-Stage Speech Enhancement with Manipulation of the Cepstral Excitation
Samy Elshamy1, Nilesh Madhu2, Wouter Tirry2 , Tim Fingscheidt1
1Technische Universität Braunschweig | Institute for Communications Technology | Braunschweig, Germany
{s.elshamy, t.fingscheidt}@tu-bs.de | phone +49 (0) 531 391-2450
2NXP Software | Leuven, Belgium
{nilesh.madhu,wouter.tirry}@nxp.com
Motivation
■ In our everyday life we often rely on telecommunication systems where high speech quality with little noise is important for a pleasant experience
■ Common noise reduction schemes benefit from precise a priori SNR estimates
■ Problem: Common noise reduction schemes sometimes suffer from musical tones
Increasing noise attenuation typically also increases speech distortion
■ Solution: Manipulate the excitation signal in order to directly model the spectral fine structure (harmonics)
Apply the envelope to the manipulated residual, use it for instantaneous a priori SNR estimation
New Two-Stage Speech Enhancement
Manipulation step 2:
Replace the pitch bin amplitude and overestimate it:
CEM
Pitch estimation (50 Hz – 500 Hz):
Search for maximum in the cepstral LPC residual:
Excitation template (
,
Trained from speech data for each
Manipulation step 3:
Apply some start and end decay to the spectrum:
):
:
Manipulation step 1:
Let the energy coefficient match the energy of the input’s excitation
signal:
frame index
noise power estimate
frequency bin index
spectral weights
quefrency bin index
a priori SNR estimate
clean speech signal
intermediate clean speech estimate
clean speech estimate
microphone signal
preliminary denoised microphone signal
preliminary denoised signal’s envelope
preliminary denoised signal’s excitation
preliminary denoised signal’s excitation
enhanced excitation signal
Legend
Two-Stage Speech Enhancement with Manipulation of the Cepstral Excitation
Samy Elshamy1, Nilesh Madhu2, Wouter Tirry2 , Tim Fingscheidt1
1Technische Universität Braunschweig | Institute for Communications Technology | Braunschweig, Germany
{s.elshamy, t.fingscheidt}@tu-bs.de | phone +49 (0) 531 391-2450
2NXP Software | Leuven, Belgium
{nilesh.madhu,wouter.tirry}@nxp.com
Experimental Setup
■ 8 kHz sample rate, frame size 256 samples, frame shift 50%, periodic square root Hann window
■ NTT super wideband database @ 8kHz, American and British English speakers (14 in total, 100 utterances per speaker)
■ Training of speaker-independent excitation templates on clean speech
■ Leave-one-out fashion (test on 1 speaker with templates obtained from the 13 other speakers)
■ 80 utterances per speaker, total of 1040 utterances for each training
■ Testing the proposed two-stage noise reduction
■ ETSI background noise database (road, car, office and pub noise), 6 SNRs: -5 dB, 0 dB, 5 dB, 10 dB, 15 dB, 20 dB
■ 20 utterances per speaker, total of 6720 utterances
■ Baseline approaches:
■ Two common schemes with minimum statistics noise power estimation, decision-directed a priori SNR estimation,
and MMSE-LSA [1], SG-jMAP [2] weighting rules
■ HRNR two-step approach [3]
[1] Y. Ephraim and D. Malah, “Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-33, no. 2, pp. 443–445, Apr. 1985.
[2] T. Lotter and P. Vary, “Speech Enhancement by MAP Spectral Amplitude Estimation Using a Super-Gaussian Speech Model,” EURASIP Journal on Applied Signal Processing, vol. 2005, no. 7, pp. 1110–1126, 2005.
[3] C. Plapous, C. Marro, L. Mauuary, and P. Scalart, “A Two-Step Noise Reduction Technique,” in Proc. of ICASSP, Montreal, Quebec, Canada, May 2004, pp. 289–292.
Experimental Evaluation
■ Segmental speech-to-speech-distortion ratio (
)
■ PESQ MOS-LQO of the filtered clean speech component
■ Segmental noise attenuation (
)
■
■ A figure of merit (FoM) to facilitate the overall evaluation
SNR = -5 dB
>1 dB
improvement
Measuring
speech distortion
Measuring
speech distortion
SNR =
20 dB
>1 dB
improvement
Measuring noise attenuation
Conclusion
■ Novel two-stage speech enhancement (a priori SNR estimation) approach utilizing the cepstral domain to model the fine structure of the residual signal
directly
■ The approach outperforms common noise reduction schemes consistently by obtaining higher noise attenuation values without increasing the
level of speech distortion
Some Details...
Evaluation Methodology and Metrics
■ We utilize the white-box approach [1] to obtain the separately processed components
and
■ Two different measures for each category, speech quality and noise suppression
■ In line with ITU-T P.1100 [2, Sect. 8] we measure the distortion of the filtered clean speech component
White-Box Approach
and not the enhanced signal
White-Box Approach
■ The linearity assumption
allows to process each component
separately prior to superimposition and to obtain the same signal
■ Allows for a detailed evaluation of the individual microphone signal components
and the influence of the spectral weighting rule in a lab environment
NR
under test
Segmental Speech-To-Speech-Distortion Ratio [5]
■ Compares reference and processed speech component sample by sample,
providing a measure for speech distortion
■ Target a high value
■
is the set of speech active frames based on a VAD operating on the clean speech
component
Segmental Noise Attenuation [5]
■ Depicts a local frame-wise ratio of the noise
component and the corresponding filtered noise
component
■ Target a high value
■ Values limited to
and
PESQ MOS-LQO P.862 [3]
Delta SNR
■ Global measure indicating achieved noise suppression
■ Levels measured after P.56 [4]
■ Target a high value
■ Psychoacoustic measure
■ PESQ is not validated for artifacts stemming from noise suppression algorithms; to
be more compliant with P.862 [3] we measure PESQ on the filtered clean speech
component
and not on the enhanced signal
, providing a measure for
speech distortion
■ Target a high value
clean speech component
noise component
microphone signal
enhanced signal
filtered clean speech component
filtered noise component
Legend
frame index
frequency bin index
set of all frames
length of frame
sample delay compensation
[1] S. Gustafsson, R. Martin, and P. Vary, “On the Optimization of Speech Enhancement Systems Using Instrumental Measures,” in Proc. of Workshop on Quality Assessment in Speech, Audio, and Image
Communication, Darmstadt, Germany, Mar. 1996, pp. 36–40.
[2] ITU, Rec. P.1100: Narrow-Band Hands-Free Communication in Motor Vehicles, International Telecommunication Union, Telecommunication Standardization Sector (ITU-T), Jan. 2015.
[3] ITU, Rec. P.862: Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-To-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs,
International Telecommunication Union, Telecommunication Standardization Sector (ITU-T), Feb. 2001.
[4] ITU, Rec. P.56: Objective Measurement of Active Speech Level, International Telecommunication Union, Telecommunication Standardization Sector (ITU-T), Dec. 2011.
[5] T. Fingscheidt, S. Suhadi, and S. Stan, “Environment-Optimized Speech Enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 4, pp. 825–834, May 2008.
© Copyright 2026 Paperzz