S5.2 A CONNECTIONIST APPROACH FOR AUTOMATIC SPEAKER IDENTIFICATION Younds BENNANI ' Franqoise FOGELMAN SOUUE' *PatrickGALUNARI +** * Universite de Paris Sud, Centre &Orsay Laboratoire de Recherche en Informatique CNRS UA 410, Bitiment 490 91405 Orsay FRANCE Ecole des Hautes Etudes en Jnformatique Laboratoire &Intelligence ArtificieUe 45 rue des Saints-Phres 75006 Paris FRANCE + Abstract 2. THE AUTOMATIC SPEAKER IDENTIFICATION SYSTEM This paper presents a connectionist approach to automatic speaker identification, based for the first time on the LVQ (Learning Vector Quantization) algorithm. For each "adherent" to the identification system, a number of references is fixed. The algorithm is based on a nearest neighbor principle, with adaptation through learning. The identification is realized by comparing to a given threshold the distance of the unknown utterance to the nearest reference. Preliminary tests run on a 10 speakers set show an identication rate of 97% for MFC coefficients. We present the identgcation system and data base used, and indicate the results obtained for different combinations of parameters. Wefurther evaluate our system, by comparing its performances with a Bayesian system. 2.1 System architecture A speaker identification system (fig. 1) includes various successive steps: a parameterization step (to produce, from the microphone signal, a population of vectors in RP), a modelization step (to build a model of the speaker's voice). The next step depends on the use of the system: in learning mode, for each speaker, models for training and for test are chosen to serve as references and -in the case of LVQfurther adapted during training. In recognition mode, a classifier is implemented. ( ) ( = ) \ / 1. INTRODUCTION We introduce a complete system for Automatic Speaker Identification (ASI) and Automatic Speaker Verification (ASV). Connectionist models have recently shown good performances for classification tasks [lo], [13], [14], [l5], [17]. We are thus investigating the possibility to best integrate connectionist methods into such a system so as to get enhanced performances.The main problem in integrating connectionist techniques is to determine, at each step, the best combination of classical and connectionist processings. We have started working in this area for the AS1 problem. We present in this paper preliminary results which aim at demonstrating the viability of such an approach in ASI: we have built a simple system (shown in fig 1) investigating the optimal use of a few techniques (different coefficients: LPC and MFCC, different sentences models, and different classifiers: Bayesian and LVQ). LVQ is a nearest neighbor classifier recently proposed in [13] which has produced good performances in classification tasks [6]. We are currently working at building a larger and more complex system which will combine more processings and seek their optimal combinations. In particular, other connectionist techniques are being investigated: multi-layer networks and TDNN, topological maps.. . e l Decision The paper is organized as follows. In section 2, we present the architecture of our system, the speech data base, and its pre-processing, the sentences modelization and the LVQ2 algorithm. In section 3, our experiments and results. Conclusions are drawn in section 4. Figure 1 : System architecture 265 CH2847-2/90/0000-0265 $1.00 0 1990 IEEE 2.2 The Data base A voice model may thus be characterizedby 3 vectors in RP: one for the mean and two eigen vectors. In [ 111, it has been shown that such a model sufficiently captures the speaker characteristics to allow for good identification. In this paper we have chosen to use the mean and first eigen vector only. We have tested our AS1 system on a population of 10 speakers, half male and female. The data base contains ten sentences, in french, phonetically balanced [4]. Each sentence is very short, lasting from 1,5 to 3 s. Each of the 10 speakers has pronounced each sentence 10 times, in a unique session. The total number of sentences is thus lOxlOxlO= 1000. In the preliminary results presented here, we have used the fiist sentence only. Models can then be compared by using the euclidean distance. 2.6 The learning algorithm: LVQZ We have already tested an AS1 system on isolated words. We now use sentences instead of words, because they allow to make better use of a speaker's identity. Balanced sentences provide short signals with optimal variability distribution. LVQ is one of the best connectionist techniques for classification tasks. Its performances are comparable, for example, to multi layer networks, for much reduced leaming time. LVQ can thus be especially interesting for tasks requiring large training sets, such as e.g. speech processing. The recordings have been realized on a Memorex equipment, in our office, where the background noise is relatively high. Noise sources originate from: conversations, steps, doors opening and closing, exterior noises (telephone.. .). The energy of these different sources is concentrated in certain frequency bands and does not correspond to a white noise. LVQ works as follows: each class is characterizedby a fixed set of reference vectors, of the same dimensions as the data to be classified. When an unknown vector is presented, all the reference vectors are investigated to determine the nearest one, in the Euclidean distance sense. The vector is then classified into the class of this nearest reference. LVQ is an algorithm for adaptively modifying the references. In [13], two versions of LVQ, LVQl and LVQZ were proposed. We have used here LVQ2, because of its better performances. 2.3 Preprocessing of the analogic signal The analogic recordings have been first digitized at 10 KHz on 16 bits with an OROS card, after low-pass filtering (04000 Hz). The samples are then pre-emphasized by a first order digital filter with transfer function 1 - 0.95.z-l. Let x be a vector in the training set, mi(t) the nearest reference vector and Ci its class. - If mi(t) and x are in different classes, then let mj(t) be the second nearest reference vector and Cj its class. If x is in class Cj, then a symmetrical window, of size w, is set around the mid-point of mi(t) and mj(t). Ex falls within the window, then mi(t) is moved away from x and mj(t) is moved closer. 2.4 Digital signal analysis We have tested two different parameterization methods: LPC : a 12th order autocorrelation analysis is carried out, every 10 ms, using 25,6 ms overlapping Hamming windows. Each frame is then converted into a 12th order LPC vector. Each of the 1000 sentences is thus converted into a 12xN array, where N is the number of frames in the sentence. - In all other cases, nothing changes. More precisely, the adaptation rule writes: This parameterization technique has been widely used for automatic speaker recognition [l], [ 113 ... MFCC : the MFCC (Me1 Frequency Cepstral Coefficient) parameterization comes from a Fourier analysis of the signal. The Fourier spectrum is computed from a 25,6 ms frame (256 points at 10 KHz) obtained through Hamming windowing. 24 triangular filters are then passed to obtain the Me1 scale. For each window, we obtain an 8-dimension MFCC vector. We have taken here : - a(t) = 0.1*(1 - t / nmax) where nmax is the maximum number of iterations allowed. - w = I mi@)- mj(t) I 2.5 Sentences modelization The computations required for LVQ2 are thus very simple. On top of that, LVQ can be made to converge very fast, by a careful initialization:for example, reference vectors for each class can be initialized by a IC-means technique on the examples. This initial choice is thus already approximately correct, LVQ just has to refine it by making use of the class identification information. Since the signal has been parameterized,a sentence is now a pxN array or a set of N points in RP, where p = 12 (for LPC) or 8 (for MFCC) and N is the number of frames. For each speaker, and each utterance of a sentence, we will model this cloud of points through a combination of its mean and the two first eigen vectors of the covariance matrix, in Principal Component Analysis (PCA). The mean gives the position of the cloud and the two eigen vectors its "shape". 266 3. EXPERIMENTS AND RESULTS Y/1E-2 3.1 Experiments The speaker identification system has been fist simulated on a population of 10 speakers. The system must work independently of the sentence ; as a first step, we have started here working on the first sedtence only: "I1 se garantira du froid avec ce bon capuchon". This sentence is very short, lasting between 2,4 s and 3 s. We thus have 100 sentences (10 pronunciations x 10 speakers) and their 100 models. We have run different experiments, for both the LPC and MFC coefficients, by using vectors with the mean m and the first eigen vector Vi. m behaves like a long term spectrum. Other possibilities have also been tested, but led to poor results. For each pair (vector, coefficients), the system has been tested, on the same data base, with the cross validation technique "leave-one-out'' described in [121 and [51. The LVQ algorithm always converged in less than 50 sweeps through the data base, for each combination 9 utterances * 10 speakers. The result is thus, in our case, an average on the 10 possible utterances "left out" (for each speaker). LVQ was initialized with a K-means with K=2 or 3, for each class. 63 I " . 41 - 20 - -1 . -23 . -45 1 " " " ' I 0 Po 8 O0 4 ¶ 5 i 2 -111 - 133 - 1 6 7 -109 -50 6XllE-2 7 Y/lE-3 699 Moreover, the results of the LVQ technique have been compared to a Bayesian system [2], [31. 0 540 . 3.2 Results 380 - The results are given in table 1. 221 . . 62 -97 ~ 0 414 4 7 i77 7 ; 1 7 5 5 -256. 5 5 :¶ Table 1: Identification rate (in%) for the LVQ-based technique (with K=2 and K=3 references: LVQ-2 and LVQ3) and the Bayesian system (BS) t -575 -4'5 -7341 -101 Those results show that, for both the Bayesian (BS) and the LVQ systems, the MFC coefficients do significantly better than the LPCs; but their computation time is twice larger than for LPCs. However, the huge difference obtained for the two parameter sets might also be due to some numerical instability in our computation of the LPCs (through the autocorrelation method). These results indeed show that the information contained in the mean m and first eigen vector Vi by itself is sufficient to identify the different speakers. Figure 2 shows the projections of the (m, VI) vectors on the fist two axis of PCA: the clouds are evidently relatively well separated. However, this figure should be used with care: the inertia gathered by these two axis is relatively low in our case (-40%) and points which look well separated in the figure, e.g. speaker 5, 8 and 9 utterances for LPCs, might still be misclassified because the references change at each pass of the "leave-one-out'' technique. The classification errors can be better understood by looking at the confusion matrix, which is here an average on the different passes. 0 1 2 3 4 5 6 7 8 9 267 5 5 5 :!¶¶ 9 5 ¶ r868 ' ' -49 0 1 2 3 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ' 3 4 0 0 ' 5 0 0 ' 66 6 0 0 0 0 ' ' ' 1 XJ 1 E- i 7 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 9 0 1 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 8 0 9 0 0 0 0 0 0 0 0 1 10 0 1 2 3 4 5 6 7 8 9 0 0 0 2 0 0 0 0 0 8 0 1 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 2 1 9 0 0 0 0 0 9 0 0 0 0 1 ~ 0 0 0 0 0 0 0 8 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 2 1 0 0 0 0 0 1 6 1 1 0 0 0 1 1 2 6 1 0 0 0 2 0 0 2 8 0 5. REFERENCES [11 B.S.Atal. "Effectivenessof LPC characteristics of the speech wave for A.SJ and A.S.V, JASA-Vol.55.1974. [2] J. B.Attili. "On the development of a real-time textindependent speaker verification system." Ph.D Rensselaer Polytechnic Institute, 1987. [3] J. B.Attili. "A TMS32OC20 based real time, textindependent automatic speaker verification system." ICASSP 1988. [4] P . C o m b e s c u r e . "20 listes de dix phrases phonttiquement tquilibrcfes. " Revue d'acoustique, no 56, 34-38, 1981. [5] Efron. "Estimatingthe Error Rate in a Predictive Rule: Improvement on Cross-Validation", 3 16-331, JASA,78. 1983. [6] E.McDermott et S. Katagiri. "Shift-invariant, Multi-Category Phoneme Recognition Using Kohonen's LVQ2" proc. of ICASSP, S3.1, 1989. [7] F. Fogelman-Souliii "Mtthodes connexionnistes pour l'apprentissage". 2tmes JournCes du PRC-GRECO Intelligence Artificielle, Toulouse, Teknea, 275-293, 1988. [8] F. Fogelman-Souliii, P. Gallinari, Y. Le Cun, S. Thiria. "Network learning" In "Machine Learning", Y. Kodratoff, R. Michalski Eds. Morgan Kaufmann, to appear. [9] S. Furui. "Cepstral Analysis technique for automatic speaker verification." IEEE Trans. on ASSP, vol29 N02, april1981. [ 101 P. Gallinari, S. Thiria, F. Fogelman-Soulik. "MultilayerPerceptrons and Data Analysis".Second annual International Conference on Neural Networks, San Diego, - . 1-391-401, 1988. [ 111 Y.Grenier. "Identificationdu locuteur et Adaptation au locuteur d'un syst2me de Reconnaissance Phoncfmique". thtse de Docteur-IngCnieur, ENST-E-77005,Paris 1977. [ 121 Lachenbruch & Mickey. "Estimation of Error Rates in Discriminant Analysis".Technometics.l-11.1968. [ 131 T.Kohonen."Self-Organization and Associative Memory " (2nd Ed.), Springer, Berlin-Heidelberg-New York-Tokyo, 1988. [ 141 T. Kohonen."The neural phonetic typewriter". IEEE Computer, 11-22, march 1988. [15] T. Kohonen, G. Barna et R. Chrisley. "Statistical Pattern Recognition with neural networks: Benchmarking Studies IEEE, proc. of ICNN. vol. 16168, July 1988. [16] F.K.Soong, A. E. Rosenberg. "On the use of instantaneous and transitional spectral information in speaker recognition." . IEEE Trans. on ASSP, vo1.36 N"6, june 1988. [17] A. Waibel, T.Hanazawa, G. Hinton, K. Shikano et K. Lang "Phoneme Recognition: Neural Networks vs. Hidden Markov Models". Proc. of ICASSP, S3.3, 107-110, avril 1988. Table 2: confusion matrices for LPCs (bottom) and MFCCs (top). Speaker in line i is identified as speaker in column j. In this table, it is interesting to see that the two methods are not complementary: the errors with MFCCs are the same as those for LPCs. As expected, confusions arise for females more than for males, females being always mistaken for other females and males for males. These results are only preliminary. With an increased number of speakers, we intend to use two different data bases for males and females and different models. 4. CONCLUSION The experimental system that we have realized has the following properties: - simple algorithm - fast learning - fast identification - maximal error rate of 3% - short signal (3s): classical systems, with similar performances, use much longer durations (20 to 30s). We think that these preliminary results demonstrate the interest of using connectionist models for speaker identification tasks. By mapping the speech problem to a pattem recognition problem, we have been able to provide a connectionist classifier leading to an error rate less than 3%, with a model based only on the statistical properties of the frames cloud. Of course, many problems have yet to be solved: increase of the number of speakers, addition/suppression of a speaker, automatic control of the learning process, update of the references (becauseof the speaker adaptation). 'I. We are presently working on a larger system, with an increased number of speakers, combining various connectionist algorithms. ACKNOWLEDGEMENTS We gratefully acknowledge helpful discussions with C . MONTACIE (ENST), E. DAVDIN (ENST)for his participation in programming the Bayesian system, J.S. LIENARD (LIMSI) for his comments on that work and the members of LRI who contributed io the recording of the datn base. 268
© Copyright 2026 Paperzz