Presentation - ISCA Speech

IMPROVEMENT IN HOARSE
VOICE DENOISING
FOR REAL - TIME DSP
IMPLEMENTATION
Claudia Manfredi, Fabrizio Dori, Ernesto Iadanza
Department of Electronics and Telecommunications
Università degli Studi di Firenze, Firenze, Italy
AIMS
 Voice hoarseness is mainly related to airflow
turbulence in the vocal tract. It can be due to vocal
fold paralysis, polyps, cordectomisation or other
dysfunction,
which
alter
regular
speech
production.
 Standard but reliable procedures and devices are
desirable, as far as medical, surgical and
logopaedic treatments are concerned.
 Simple, user-friendly and cheap devices are
needed, to be easily found on the market.
 To this aim, a portable device is proposed, as an aid
for dysphonic speakers. It could be of help for
diminishing effort in speaking and improve
communication quality, which is closely related to
social problems due to awkwardness of voice.
INTRODUCTION
 This paper presents an approach for reducing voice
hoarseness, based on low-order Singular Value
Decomposition (SVD) of data matrices.
 Quantitative (SNR/PSD) indexes are evaluated for
testing the filtering procedure.
 A prototype DSP board implementing the
procedure is developed, by means of properly
optimised C and Assembler code.
 A new step is implemented in the filtering chain to
reduce noise and increase output signal quality.
WHY SVD
SVD of noisy data provides a factorization of data
matrix A where singular vectors of A corresponding
to the dominant singular values are a good estimate of
noise-free matrix.
SVD is a numerically reliable and robust means for
estimating the space of clean data (signal subspace)
from the white noise corrupted data, and is thus
particularly suited for speech denoising.
The resulting low-rank approximation gaves a
variable subspace of dimension p, linked to the signal
dynamics. However, from the denoising point of view,
the best results were obtained with p=2 both with
synthetic and real data.
SVD DENOISING PROCEDURE
IN
N-points
frame
extraction
SVD
Analysys
Clean speech
frame
reconstruction
repeat on subsequent frames until the whole speech signal is cleaned
OUT
QUANTITATIVE MEASURE:
SNR
M
SNR  10log 10
 y (n)
2
n 1
M
2
 (y(n)  y filt (n))
n 1
y(n) = noisy signal sample at time n
yfilt(n) = filtered signal sample at time n
Low SNR values correspond to strong filtering.
QUANTITATIVE MEASURE:
PSD
PSD
nonfilt
PSD  10 log
10 PSD
filt
PSD
(f  4 kHz )
non

filt
PSD  10 log
low
10 PSD (f  4 kHz )
filt
PSD
(f  4kHz)
non

filt
PSD
 10 log
high
10
PSD (f  4kHz)
filt
For good denoising:
PSD and PSDlow around zero (no loss of power)
High PSDhigh (loss of power due to noise)
HW IMPLEMENTATION:
DSP BOARD
Realized in C.N.R. Labs, Institute of Clinical Physiology, Pisa
DSP BOARD: SW TOPICS
 The SVD algorithm is implemented by means of a
two-step procedure:
1 - The data matrix A is bi-diagonalised applying a
sequence of Householder reflections;
2 - A is made diagonal using a modified QR
algorithm.
 C language routines build data matrices, perform
computations and collect filtered data for the output
frame reconstruction.
Assembler code is used for SVD factorisation.
Pipelining is used for loops.
RESULTS: REAL DATA
Adult male patients that underwent partial
cordectomisation, pronouncing the Italian word
/aiuole/ (flowerbeds), which is composed of the
five principal vowels /a/, /e/, /i/, /o/, /u/.
Nine lancet (A1-A9) and seven laser (B1-B7)
operated subjects were analysed (all results are
not presented here).
RESULTS: PSD
PSD =0.56198 PSD
tot
low
=0.53911 PSD
=20.7645 [dB]
PSD =0.018502 PSD
high
tot
Before improvement
0
low
=-0.0037879 PSD
=14.6047 [dB]
high
After improvement
0
-10
-10
PSD [dB]
PSD [dB]
-20
-20
original
-30
original
-30
-40
-40
-50
filtered
filtered
-50
-60
0
2000
4000
6000
SNR=7.4194
8000
10000
Freq. [Hz]
12000
•PSD lower on the whole frequency range (PSD=0.56
dB), especially on the high frequency region
(PSDhigh=20.76 dB).
•Small negative effect on the low one (PSDlow=0.54),
with some lowering of the output signal level.
•Strong denoising is achieved (SNR=7.42 dB).
0
2000
4000
6000
SNR=16.3975
8000
10000
Freq. [Hz]
12000
•Better values for the PSD on whole frequency range
(PSD=0.02 dB) and on the low frequency range
(PSDlow=-0.004 dB), corresponding to a good voice level
at the output.
•Lower PSD on the high frequency region
(PSDhigh=14.6 dB), and correspondingly a higher SNR
value (SNR=16.4 dB).
RESULTS: SPECTROGRAM
Spectrogram - Non-filtered signal
Spectrogram - Filtered signal
6000
5500
5500
5500
5000
5000
5000
4500
4500
4500
Frequency [Hz]
4000
3500
3000
2500
2000
4000
3500
3000
2500
2000
4000
3500
3000
2500
2000
1500
1500
1500
1000
1000
1000
500
500
500
0
0.1
0.2
0.3
0.4 0.5
Time [s]
0.6
0.7
0.8
0
0.1
0.2
0.3
0.4 0.5
Time [s]
0.6
Spectrogram - Enhanced filtering
6000
Frequency [Hz]
Frequency [Hz]
6000
0.7
0.8
0
0.1
0.2
0.3
0.4 0.5
Time [s]
0.6
0.7
0.8
The noise level is reduced (middle and right plot), especially above 4 kHz.
Middle plot clearly shows the presence of click noise, which appears as almost regularly spaced
vertical lines of rather high intensity.
Right plot shows that the new approach allows reducing this side effect, that has almost
disappeared. Moreover, harmonics have been enhanced and often recovered, as their intensity
colour clearly shows.
RESULTS: FORMANTS
Formants - o=Non-filt.; *=SVD-filt.; +=Enhanced SVD-filt
5000
Formant trajectory is
obtained by means of
Autoregressive (AR)
PSD estimation, with a
model order: p=25=Fs.
4500
4000
3500
Frequency [Hz]
3000
2500
This choice is in fact a
good compromise
between parsimony and
good resolution.
2000
1500
1000
500
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time [s]
Formant enhancement
0.8
FINAL REMARKS
 The proposed algorithm has been successfully
implemented on the DSP board and useful
objective parameters for voice analysis are
evaluated.
 The DSP board is a prototype of an HW tool for
real-time voice processing and enhancing.
 The DSP board is the first step towards a portable
device, which could be of help for dysphonic
people.
FUTURE WORK
Compare SVD with other methods (already
successfully done with Least Squares and some
wavelet family).
Further optimise filtered voices (increase energy and
reduce spikes). Specifically, a selective procedure
should be defined for the normalisation step, in order
to avoid enhancing undesired frequency ranges.
Find correlation between GIRBAS scale and
quantitative parameters, also for classification
purposes.
Optimise the device, as an aid for non-invasive
diagnosis, rehabilitation and dysphonic patients
speaking.
THANKS FOR YOUR
ATTENTION
DENOISING WITH SVD
 SVD allows solving the problem of finding the nxm
matrix Ap, prank(A), which will best approximate A
in the 2-norm sense.
 The unique solution to this problem is
Ap = UpVT
where U and V are as before and p is obtained from
 by setting to zero all but its p largest singular
values.
 The accuracy of this approximation is given by p+1.
SINGULAR VALUE DECOMPOSITION
(SVD)
Σ
U AV  
0
T
0

0
Anxm = data matrix; U, V = left and right singular vectors of
A, respectively; r = min (n, m) = rank(A)).
rxr = diag(12…r), i>0, the singular values of A. They
display the distance of A from low-rank matrices.
Together with U and V, i are used to construct optimal lowrank approximants of A: Ap, p r
THE DATA MATRIX A
y(L )
y (L  1)


y ( L  2)
 y (L  1)

.
.

.
.

 y ( N  1)
y ( N  2)
A T
T
y
(
N

L

1
)
y
( N  L  2)

 y T (N  L)
y T ( N  L  1)


.
.

.
.


y T ( 2)
y T ( 3)
.
.
.
.
.
.
.
.
.
.


y ( 2) 

.

.

y( N  L) 

T
y (N) 
y T ( N  1)


.

.

y T (L  1) 
y (1)
y(t)=signal
sample at
time t
N=3Fs/Fmin
Fs=sampling
frequency
Fmin=min F0
L=Fs + 4
Fs in kHz
FINDING THE SIGNAL
SUBSPACE DIMENSION p
Classical system identification techniques, as well as a new
approach (DME) were applied to the decreasing sequence
of singular values 12…r>0 of A, on each data
frame. This gave a variable subspace dimension p, linked
to the signal dynamics.
However, from the denoising point of view, the best results
were obtained with
p=2
both with synthetic and real data.
QUALITATIVE MEASURE:
GRBAS SCALE
It comprises five qualitative parameters, ranging
from 0 (healthy voice) to 3 (severe disease):
G = grade of dysphony, related to shimmer and
noise
R = roughness, related to jitter
B = breathiness, related to shimmer
A = asthenicity
S = strainess
quantitative measures:PSD
quantitative measures:SNR
RESULTS - GRBAS and SNR
A3-SVD-DWT
A7-SVD-DWT
B3-SVD-DWT
B6-SVD-DWT
G
1-0-2
2-1-1
2-1-1
1-1-2
R
1-1-1
0-0-1
1-0-0
1-0-0
B
1-0-1
2-1-1
2-1-1
2-1-1
A
0-0-0
0-0-1
2-1-1
1-0-1
S
0-0-1
0-0-0
0-0-0
0-0-0
SNR
3.97-11.2
7.56-15.4
13.5-16.9
6.57-13.1
quantitative measures:PSD
quantitative measures:SNR
OTHER SOUND EXAMPLES
Original
Svd filtered
New Svd
GRBAS and SNR results
results:spectrogram