A connectionist approach for automatic speaker identification

S5.2
A CONNECTIONIST APPROACH FOR
AUTOMATIC SPEAKER IDENTIFICATION
Younds BENNANI ' Franqoise FOGELMAN SOUUE' *PatrickGALUNARI +**
* Universite de Paris Sud, Centre &Orsay
Laboratoire de Recherche en Informatique
CNRS UA 410, Bitiment 490
91405 Orsay FRANCE
Ecole des Hautes Etudes en Jnformatique
Laboratoire &Intelligence ArtificieUe
45 rue des Saints-Phres
75006 Paris FRANCE
+
Abstract
2. THE AUTOMATIC SPEAKER
IDENTIFICATION SYSTEM
This paper presents a connectionist approach to automatic
speaker identification, based for the first time on the LVQ
(Learning Vector Quantization) algorithm. For each
"adherent" to the identification system, a number of
references is fixed. The algorithm is based on a nearest
neighbor principle, with adaptation through learning. The
identification is realized by comparing to a given threshold
the distance of the unknown utterance to the nearest
reference. Preliminary tests run on a 10 speakers set show
an identication rate of 97% for MFC coefficients. We
present the identgcation system and data base used, and
indicate the results obtained for different combinations of
parameters. Wefurther evaluate our system, by comparing
its performances with a Bayesian system.
2.1 System architecture
A speaker identification system (fig. 1) includes various
successive steps: a parameterization step (to produce, from
the microphone signal, a population of vectors in RP), a
modelization step (to build a model of the speaker's voice).
The next step depends on the use of the system: in learning
mode, for each speaker, models for training and for test are
chosen to serve as references and -in the case of LVQfurther adapted during training. In recognition mode, a
classifier is implemented.
(
)
(
=
)
\ /
1. INTRODUCTION
We introduce a complete system for Automatic Speaker
Identification (ASI) and Automatic Speaker Verification
(ASV). Connectionist models have recently shown good
performances for classification tasks [lo], [13], [14], [l5],
[17]. We are thus investigating the possibility to best
integrate connectionist methods into such a system so as to
get enhanced performances.The main problem in integrating
connectionist techniques is to determine, at each step, the
best combination of classical and connectionist processings.
We have started working in this area for the AS1 problem.
We present in this paper preliminary results which aim at
demonstrating the viability of such an approach in ASI: we
have built a simple system (shown in fig 1) investigating the
optimal use of a few techniques (different coefficients: LPC
and MFCC, different sentences models, and different
classifiers: Bayesian and LVQ). LVQ is a nearest neighbor
classifier recently proposed in [13] which has produced
good performances in classification tasks [6].
We are currently working at building a larger and more
complex system which will combine more processings and
seek their optimal combinations. In particular, other
connectionist techniques are being investigated: multi-layer
networks and TDNN, topological maps.. .
e
l
Decision
The paper is organized as follows. In section 2, we present
the architecture of our system, the speech data base, and its
pre-processing, the sentences modelization and the LVQ2
algorithm. In section 3, our experiments and results.
Conclusions are drawn in section 4.
Figure 1 : System architecture
265
CH2847-2/90/0000-0265 $1.00
0
1990 IEEE
2.2 The Data base
A voice model may thus be characterizedby 3 vectors in RP:
one for the mean and two eigen vectors. In [ 111, it has been
shown that such a model sufficiently captures the speaker
characteristics to allow for good identification. In this paper
we have chosen to use the mean and first eigen vector only.
We have tested our AS1 system on a population of 10
speakers, half male and female. The data base contains ten
sentences, in french, phonetically balanced [4]. Each
sentence is very short, lasting from 1,5 to 3 s. Each of the
10 speakers has pronounced each sentence 10 times, in a
unique session. The total number of sentences is thus
lOxlOxlO= 1000. In the preliminary results presented here,
we have used the fiist sentence only.
Models can then be compared by using the euclidean
distance.
2.6 The learning algorithm: LVQZ
We have already tested an AS1 system on isolated words.
We now use sentences instead of words, because they allow
to make better use of a speaker's identity. Balanced
sentences provide short signals with optimal variability
distribution.
LVQ is one of the best connectionist techniques for
classification tasks. Its performances are comparable, for
example, to multi layer networks, for much reduced leaming
time. LVQ can thus be especially interesting for tasks
requiring large training sets, such as e.g. speech processing.
The recordings have been realized on a Memorex
equipment, in our office, where the background noise is
relatively high. Noise sources originate from: conversations,
steps, doors opening and closing, exterior noises
(telephone.. .). The energy of these different sources is
concentrated in certain frequency bands and does not
correspond to a white noise.
LVQ works as follows: each class is characterizedby a fixed
set of reference vectors, of the same dimensions as the data
to be classified. When an unknown vector is presented, all
the reference vectors are investigated to determine the
nearest one, in the Euclidean distance sense. The vector is
then classified into the class of this nearest reference. LVQ
is an algorithm for adaptively modifying the references. In
[13], two versions of LVQ, LVQl and LVQZ were
proposed. We have used here LVQ2, because of its better
performances.
2.3 Preprocessing of the analogic signal
The analogic recordings have been first digitized at 10 KHz
on 16 bits with an OROS card, after low-pass filtering (04000 Hz). The samples are then pre-emphasized by a first
order digital filter with transfer function 1 - 0.95.z-l.
Let x be a vector in the training set, mi(t) the nearest
reference vector and Ci its class.
- If mi(t) and x are in different classes, then let mj(t) be the
second nearest reference vector and Cj its class. If x is in
class Cj, then a symmetrical window, of size w, is set
around the mid-point of mi(t) and mj(t). Ex falls within the
window, then mi(t) is moved away from x and mj(t) is
moved closer.
2.4 Digital signal analysis
We have tested two different parameterization methods:
LPC : a 12th order autocorrelation analysis is carried out,
every 10 ms, using 25,6 ms overlapping Hamming
windows. Each frame is then converted into a 12th order
LPC vector. Each of the 1000 sentences is thus converted
into a 12xN array, where N is the number of frames in the
sentence.
- In all other cases, nothing changes.
More precisely, the adaptation rule writes:
This parameterization technique has been widely used for
automatic speaker recognition [l], [ 113 ...
MFCC : the MFCC (Me1 Frequency Cepstral Coefficient)
parameterization comes from a Fourier analysis of the
signal. The Fourier spectrum is computed from a 25,6 ms
frame (256 points at 10 KHz) obtained through Hamming
windowing. 24 triangular filters are then passed to obtain
the Me1 scale. For each window, we obtain an 8-dimension
MFCC vector.
We have taken here :
- a(t) = 0.1*(1 - t / nmax) where nmax is the maximum
number of iterations allowed.
- w = I mi@)- mj(t) I
2.5 Sentences modelization
The computations required for LVQ2 are thus very simple.
On top of that, LVQ can be made to converge very fast, by a
careful initialization:for example, reference vectors for each
class can be initialized by a IC-means technique on the
examples. This initial choice is thus already approximately
correct, LVQ just has to refine it by making use of the class
identification information.
Since the signal has been parameterized,a sentence is now a
pxN array or a set of N points in RP, where p = 12 (for
LPC) or 8 (for MFCC) and N is the number of frames. For
each speaker, and each utterance of a sentence, we will
model this cloud of points through a combination of its
mean and the two first eigen vectors of the covariance
matrix, in Principal Component Analysis (PCA). The mean
gives the position of the cloud and the two eigen vectors its
"shape".
266
3. EXPERIMENTS AND RESULTS
Y/1E-2
3.1 Experiments
The speaker identification system has been fist simulated on
a population of 10 speakers. The system must work
independently of the sentence ; as a first step, we have
started here working on the first sedtence only: "I1 se
garantira du froid avec ce bon capuchon". This sentence is
very short, lasting between 2,4 s and 3 s. We thus have 100
sentences (10 pronunciations x 10 speakers) and their 100
models. We have run different experiments, for both the
LPC and MFC coefficients, by using vectors with the mean
m and the first eigen vector Vi. m behaves like a long term
spectrum. Other possibilities have also been tested, but led
to poor results.
For each pair (vector, coefficients), the system has been
tested, on the same data base, with the cross validation
technique "leave-one-out'' described in [121 and [51. The
LVQ algorithm always converged in less than 50 sweeps
through the data base, for each combination 9 utterances *
10 speakers. The result is thus, in our case, an average on
the 10 possible utterances "left out" (for each speaker). LVQ
was initialized with a K-means with K=2 or 3, for each
class.
63
I "
.
41
-
20
-
-1
.
-23
.
-45
1
"
"
"
'
I
0
Po
8
O0
4
¶
5
i
2
-111
- 133 - 1 6 7
-109
-50
6XllE-2
7
Y/lE-3
699
Moreover, the results of the LVQ technique have been
compared to a Bayesian system [2], [31.
0
540
.
3.2 Results
380
-
The results are given in table 1.
221
.
.
62
-97
~
0
414
4
7
i77 7
;
1
7
5
5
-256.
5 5
:¶
Table 1: Identification rate (in%) for the LVQ-based
technique (with K=2 and K=3 references: LVQ-2 and LVQ3) and the Bayesian system (BS)
t
-575
-4'5
-7341
-101
Those results show that, for both the Bayesian (BS) and the
LVQ systems, the MFC coefficients do significantly better
than the LPCs; but their computation time is twice larger
than for LPCs. However, the huge difference obtained for
the two parameter sets might also be due to some numerical
instability in our computation of the LPCs (through the autocorrelation method).
These results indeed show that the information contained in
the mean m and first eigen vector Vi by itself is sufficient to
identify the different speakers. Figure 2 shows the
projections of the (m, VI) vectors on the fist two axis of
PCA: the clouds are evidently relatively well separated.
However, this figure should be used with care: the inertia
gathered by these two axis is relatively low in our case
(-40%) and points which look well separated in the figure,
e.g. speaker 5, 8 and 9 utterances for LPCs, might still be
misclassified because the references change at each pass of
the "leave-one-out'' technique. The classification errors can
be better understood by looking at the confusion matrix,
which is here an average on the different passes.
0
1
2
3
4
5
6
7
8
9
267
5
5
5
:!¶¶
9
5
¶
r868
'
'
-49
0
1
2
3
1 0 0
0
0
0
1 0 0
0
0
0
1 0 0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
'
3
4
0
0
'
5
0
0
'
66
6
0
0
0
0
'
'
'
1 XJ 1 E- i
7
0
0
0
0
0
0
0
0
1 0 0
0
0
0
1 0 0
0
0
0
1 0 0
0
0
1
9
0
1
0
0
0
0
0
0
8
0
0
0
0
0
0
0
0
8
0
9
0
0
0
0
0
0
0
0
1
10
0
1
2
3
4
5
6
7
8
9
0
0
0
2
0
0
0
0
0
8
0
1
0
0
0
0
0
0
0
9
0
0
0
0
0
0
0
2
1
9
0
0
0
0
0
9
0
0
0
0
1 ~
0
0
0
0
0
0
0
8
0
7
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
8
2
1
0
0
0
0
0
1
6
1
1
0
0
0
1
1
2
6
1
0
0
0
2
0
0
2
8
0
5. REFERENCES
[11 B.S.Atal. "Effectivenessof LPC characteristics of the
speech wave for A.SJ and A.S.V, JASA-Vol.55.1974.
[2] J. B.Attili. "On the development of a real-time textindependent speaker verification system." Ph.D Rensselaer
Polytechnic Institute, 1987.
[3] J. B.Attili. "A TMS32OC20 based real time, textindependent automatic speaker verification system."
ICASSP 1988.
[4] P . C o m b e s c u r e . "20 listes de dix phrases
phonttiquement tquilibrcfes. " Revue d'acoustique, no 56,
34-38, 1981.
[5] Efron. "Estimatingthe Error Rate in a Predictive Rule:
Improvement on Cross-Validation", 3 16-331, JASA,78.
1983.
[6] E.McDermott et S. Katagiri. "Shift-invariant,
Multi-Category Phoneme Recognition Using Kohonen's
LVQ2" proc. of ICASSP, S3.1, 1989.
[7] F. Fogelman-Souliii "Mtthodes connexionnistes
pour l'apprentissage". 2tmes JournCes du PRC-GRECO
Intelligence Artificielle, Toulouse, Teknea, 275-293, 1988.
[8] F. Fogelman-Souliii, P. Gallinari, Y. Le Cun,
S. Thiria. "Network learning" In "Machine Learning", Y.
Kodratoff, R. Michalski Eds. Morgan Kaufmann, to
appear.
[9] S. Furui. "Cepstral Analysis technique for automatic
speaker verification." IEEE Trans. on ASSP, vol29 N02,
april1981.
[ 101 P. Gallinari, S. Thiria, F. Fogelman-Soulik.
"MultilayerPerceptrons and Data Analysis".Second annual
International Conference on Neural Networks, San Diego,
- .
1-391-401, 1988.
[ 111 Y.Grenier. "Identificationdu locuteur et Adaptation
au locuteur d'un syst2me de Reconnaissance Phoncfmique".
thtse de Docteur-IngCnieur, ENST-E-77005,Paris 1977.
[ 121 Lachenbruch & Mickey. "Estimation of Error
Rates in Discriminant Analysis".Technometics.l-11.1968.
[ 131 T.Kohonen."Self-Organization and Associative
Memory " (2nd Ed.), Springer, Berlin-Heidelberg-New
York-Tokyo, 1988.
[ 141 T. Kohonen."The neural phonetic typewriter". IEEE
Computer, 11-22, march 1988.
[15] T. Kohonen, G. Barna et R. Chrisley.
"Statistical Pattern Recognition with neural networks:
Benchmarking Studies IEEE, proc. of ICNN. vol. 16168, July 1988.
[16] F.K.Soong, A. E. Rosenberg. "On the use of
instantaneous and transitional spectral information in
speaker recognition." . IEEE Trans. on ASSP, vo1.36 N"6,
june 1988.
[17] A. Waibel, T.Hanazawa, G. Hinton, K.
Shikano et K. Lang "Phoneme Recognition: Neural
Networks vs. Hidden Markov Models". Proc. of ICASSP,
S3.3, 107-110, avril 1988.
Table 2: confusion matrices for LPCs (bottom) and
MFCCs (top). Speaker in line i is identified as speaker in
column j.
In this table, it is interesting to see that the two methods are
not complementary: the errors with MFCCs are the same as
those for LPCs. As expected, confusions arise for females
more than for males, females being always mistaken for
other females and males for males.
These results are only preliminary. With an increased
number of speakers, we intend to use two different data
bases for males and females and different models.
4. CONCLUSION
The experimental system that we have realized has the
following properties:
- simple algorithm
- fast learning
- fast identification
- maximal error rate of 3%
- short signal (3s): classical systems, with similar
performances, use much longer durations (20 to 30s).
We think that these preliminary results demonstrate the
interest of using connectionist models for speaker
identification tasks. By mapping the speech problem to a
pattem recognition problem, we have been able to provide a
connectionist classifier leading to an error rate less than 3%,
with a model based only on the statistical properties of the
frames cloud.
Of course, many problems have yet to be solved: increase of
the number of speakers, addition/suppression of a speaker,
automatic control of the learning process, update of the
references (becauseof the speaker adaptation).
'I.
We are presently working on a larger system, with an
increased number of speakers, combining various
connectionist algorithms.
ACKNOWLEDGEMENTS
We gratefully acknowledge helpful discussions with C . MONTACIE
(ENST), E. DAVDIN (ENST)for his participation in programming the
Bayesian system, J.S. LIENARD (LIMSI) for his comments on that
work and the members of LRI who contributed io the recording of the
datn base.
268