Score Information Decision Fusion Using Support Vector Machine for a Correlation Filter Based Speaker Authentication System Dzati Athiar Ramli, Salina Abdul Samad, and Aini Hussain Department of Electrical, Electronic and Systems Engineering, Faculty of Engineering, University Kebangsaan Malaysia, 43600 Bangi Selangor, Malaysia [email protected], [email protected], [email protected]. Abstract. In this paper, we propose a novel decision fusion by fusing score information from multiple correlation filter outputs of a speaker authentication system. Correlation filter classifier is designed to yield a sharp peak in the correlation output for an authentic person while no peak is perceived for the imposter. By appending the scores from multiple correlation filter outputs as a feature vector, Support Vector Machine (SVM) is then executed for the decision process. In this study, cepstrumgraphic and spectrographic images are implemented as features to the system and Unconstrained Minimum Average Correlation Energy (UMACE) filters are used as classifiers. The first objective of this study is to develop a multiple score decision fusion system using SVM for speaker authentication. Secondly, the performance of the proposed system using both features are then evaluated and compared. The Digit Database is used for performance evaluation and an improvement is observed after implementing multiple score decision fusion which demonstrates the advantages of the scheme. Keywords: Correlation Filters, Decision Fusion, Support Vector Machine, Speaker Authentication. 1 Introduction Biometric speaker authentication is used to verify a person’s claimed identity. Authentication system compares the claimant’s speech with the client model during the authentication process [1]. The development of a client model database can be a complicated procedure due to voice variations. These variations occur when the condition of the vocal tract is affected by the influence of internal problems such as cold or dry mouth, and also by external problems, for example temperature and humidity. The performance of a speaker authentication system is also affected by room and line noise, changing of recording equipment and uncooperative claimants [2], [3]. Thus, the implementation of biometric systems has to correctly discriminate the biometric features from one individual to another, and at the same time, the system also needs to handle the misrepresentations in the features due to the problems stated. In order to overcome these limitations, we improve the performance of speaker authentication systems by extracting more information (samples) from the claimant and then executing fusion techniques in the decision process. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 235–242, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 236 D.A. Ramli, S.A. Samad, and A. Hussain So far there are many fusion techniques in literature that have been implemented in biometric systems for the purpose of enhancing the system performance. These include the fusion of multiple-modalities, multiple-classifiers and multiple-samples [4]. Teoh et. al. in [5] proposed a combination of features of face modality and speech modality so as to improve the accuracy of biometric authentication systems. Person identification based on visual and acoustic features has also been reported by Brunelli and Falavigna in [6]. Suutala and Roning in [7] used Learning Vector Quantization (LVQ) and Multilayer Perceptron (MLP) as classifiers for footstep profile based person identification whereas in [8], Kittler et.al. utilized Neural Networks and Hidden Markov Model (HMM) for hand written digit recognition task. The implementation of multiple-sample fusion approach can be found in [4] and [9]. In general, these studies revealed that the implementation of the fusion approaches in biometric systems can improve system performance significantly. This paper focuses on the fusion of score information from multiple correlation filter outputs for a correlation filter based speaker authentication system. Here, we use scores extracted from the correlation outputs by considering several samples extracted from the same modality as independent samples. The scores are then concatenated together to form a feature vector and then Support Vector Machine (SVM) is executed to classify the feature vector as either authentic or imposter class. Correlation filters have been effectively applied in biometric systems for visual applications such as face verification and fingerprint verification as reported in [10], [11]. Lower face verification and lip movement for person identification using correlation filters have been implemented in [12], [13], respectively. A study of using correlation filters in speaker verification for speech signal as features can be found in [14]. The advantages of correlation filters are shift-invariance, ability to trade-off between discrimination and distortion tolerance and having a close-form expression. 2 Methodology The database used in this study is obtained from the Audio-Visual Digit Database (2001) [15]. The database consists of video and corresponding audio of people reciting digits zero to nine. The video of each person is stored as a sequence of JPEG images with a resolution of 512 x 384 pixels while the corresponding audio provided as a monophonic, 16 bit, 32 kHz WAV file. 2.1 Spectroghaphic Features A spectrogram is an image representing the time-varying spectrum of a signal. The vertical axis (y) shows frequency, the horizontal axis (x) represents time and the pixel intensity or color represents the amount of energy (acoustic peaks) in the frequency band y, at time x [16], [17]. Fig.1 shows samples of the spectrogram of the word ‘zero’ from person 3 and person 4 obtained from the database. From the figure, it can be seen that the spectrogram image contains personal information in terms of the way the speaker utters the word such as speed and pitch that is showed by the spectrum. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Frequency Frequency Score Information Decision Fusion Using Support Vector Machine 0.5 0.4 0.3 237 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 0 0 1000 2000 3000 4000 Time 5000 6000 7000 0 1000 2000 3000 4000 Time 5000 6000 7000 Fig. 1. Examples of the spectrogram image from person 3 and person 4 for the word ‘zero’ Comparing both figures, it can be observed that although the spectrogram image holds inter-class variations, it also comprises intra-class variations. In order to be successfully classified by correlation filters, we propose a novel feature extraction technique. The computation of the spectrogram is described below. a. Pre-emphasis task. By using a high-pass filter, the speech signal is filtered using the following equation: x (t ) = (s(t ) − 0.95) ∗ x (t − 1) (1) x ( t ) is the filtered signal, s( t ) is the input signal and t represents time. b. Framing and windowing task. A Hamming window with 20ms length and 50% overlapping is used on the signal. c. Specification of FFT length. A 256-point FFT is used and this value determines the frequencies at which the discrete-time Fourier transform is computed. d. The logarithm of energy (acoustic peak) of each frequency bin is then computed. e. Retaining the high energies. After a spectrogram image is obtained, we aim to eliminate the small blobs in the image which impose the intra-class variations. This can be achieved by retaining the high energies of the acoustic peak by setting an appropriate threshold. Here, the FFT magnitudes which are above a certain threshold are maintained, otherwise they are set to be zero. f. Morphological opening and closing. Morphological opening process is used to clear up the residue noisy spots in the image whereas morphological closing is the task used to recover the original shape of the image caused by the morphological opening process. 2.2 Cepstrumgraphic Features Linear Predictive Coding (LPC) is used for the acoustic measurements of speech signals. This parametric modeling is an approach used to match closely the resonant structure of the human vocal tract that produces the corresponding sounds [17]. The computation of the cepstrumgraphic features is described below. a. Pre-emphasis task. By using a high-pass filter, the speech signal is filtered using equation 1. b. Framing and windowing task. A Hamming window with 20ms length and 50% overlapping is used on the signal. c. Specification of FFT length. A 256-point FFT is used and this value determines the frequencies at which the discrete-time Fourier transform is computed. 238 D.A. Ramli, S.A. Samad, and A. Hussain d. Auto-correlation task. For each frame, a vector of LPC coefficients is computed from the autocorrelation vector using Durbin recursion method. The LPC-derived cepstral coefficients (cepstrum) are then derived that lead to 14 coefficients per vector. e. Resizing task. The feature vectors are then down sampled to the size of 64x64 in order to be verified by UMACE filters. 2.3 Correlation Filter Classifier Unconstrained Minimum Average Correlation Energy (UMACE) filters which evolved from Matched Filter are synthesized in the Fourier domain using a closed form solution. Several training images are used to synthesize a filter template. The designed filter is then used for cross-correlating the test image in order to determine whether the test image is from the authentic class or imposter class. In this process, the filter optimizes a criterion to produce a desired correlation output plane by minimizing the average correlation energy and at the same time maximizing the correlation output at the origin [10][11]. The optimization of UMACE filter equation can be summarized as follows, U mace = D −1m (2) D is a diagonal matrix with the average power spectrum of the training images placed along the diagonal elements while m is a column vector containing the mean of the Fourier transforms of the training images. The resulting correlation plane produce a sharp peak in the origin and the values at everywhere else are close to zero when the test image belongs to the same class of the designed filter [10][11]. Fig. 2 shows the correlation outputs when using a UMACE filter to determine the test image from the authentic class (left) and imposter class (right). 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 30 30 30 20 10 0 30 20 20 10 20 10 10 0 0 0 Fig. 2. Examples of the correlation plane for the test image from the authentic class (left) and imposter class (right) Peak-to-Sidelobe ratio (PSR) metric is used to measure the sharpness of the peak. The PSR is given by PSR = peak − mean σ (3) Here, the peak is the largest value of the test image yield from the correlation output. Mean and standard deviation are calculated from the 20x20 sidelobe region by excluding a 5x5 central mask [10], [11]. Score Information Decision Fusion Using Support Vector Machine 239 2.4 Support Vector Machine Support vector machine (SVM) classifier in its simplest form, linear and separable case is the optimal hyperplane that maximizes the distance of the separating hyperplane from the closest training data point called the support vectors [18], [19]. From [18], the solution of a linearly separable case is given as follows. Consider a problem of separating the set of training vectors belonging to two separate classes, {( ) ( )} D = x 1 , y1 ,... x L , y L , x ∈ ℜ n , y ∈ {− 1,−1} (4) with a hyperplane, w, x + b = 0 (5) The hyperplane that optimally separates the data is the one that minimizes φ( w ) = 1 w 2 2 (6) which is equivalent to minimizing an upper bound on VC dimension. The solution to the optimization problem (7) is given by the saddle point of the Lagrange functional (Lagrangian) φ( w , b, α) = L 1 2 w − ∑ α i ⎛⎜ y i ⎡ w , x i + b⎤ − 1⎞⎟ ⎢ ⎥⎦ ⎠ 2 i =1 ⎝ ⎣ (7) where α are the Lagrange multipliers. The Lagrangian has to be minimized with respect to w , b and maximized with respect to α ≥ 0 . Equation (7) is then transformed to its dual problem. Hence, the solution of the linearly separable case is given by, α* = arg min α L 1 L L ∑ ∑ αiα jyi y j xi , x j − ∑ αk 2 i =1 j=1 k =1 (8) with constrains, α i ≥ 0, i = 1,..., L and L ∑ α jy j = 0 j=1 (9) Subsequently, consider a SVM as a non-linear and non-separable case. Non-separable case is considered by adding an upper bound to the Lagrange multipliers and nonlinear case is considered by replacing the inner product by a kernel function. From [18], the solution of the non-linear and non-separable case is given as α* = arg min α ( ) L 1 L L ∑ ∑ α i α j yi y jK x i , x j − ∑ α k 2 i =1 j=1 k =1 (10) with constrains, 0 ≤ α i ≤ C, i = 1,..., L and L ∑ α j y j = 0 x (t ) = (s(t ) − 0.95) ∗ x (t − 1) j=1 (11) 240 D.A. Ramli, S.A. Samad, and A. Hussain Non-linear mappings (kernel functions) that can be employed are polynomials, radial basis functions and certain sigmoid functions. 3 Results and Discussion Assume that N streams of testing data are extracted from M utterances. Let s = {s1 , s 2 ,..., s N } be a pool of scores from each utterance. The proposed verification system is shown in Fig.3. a11 am1 ... Filter design 1 . . . . a1n amn ... Correlation filter Filter design n Correlation filter . . . . b1 FFT bn IFFT Correlation output psr1 . . . . . . . . FFT IFFT Correlation output psrn Support vector machine (polynomial kernel) (a11… am1 ) … (a1n … amn )– training data b1, b2 … bn – testing data m – number of training data n - number of groups (zero to nine) Decision Fig. 3. Verification process using spectrographic / ceptrumgraphic images For the spectrographic features, we use 250 filters which represent each word for the 25 persons. Our spectrographic image database consists of 10 groups of spectrographic images (zero to nine) of 25 persons with 46 images per group of size 32x32 pixels, thus 11500 images in total. For each filter, we used 6 training images for the synthesis of a UMACE filter. Then, 40 images are used for the testing process. These six training images were chosen based on the largest variations among the images. In the testing stage, we performed cross correlations of each corresponding word with 40 authentic images and another 40x24=960 imposter images from the other 24 persons. For the ceptrumgraphic features, we also have 250 filters which represent each word for the 25 persons. Our ceptrumgraphic image database consists of 10 groups of ceptrumgraphic images (zero to nine) of 25 persons with 43 images per group of size 64x64 pixels, thus 10750 images in total. For each filter, we used 3 training images for the synthesis of the UMACE filter and 40 images are used for the testing process. We performed cross correlations of each corresponding word with 40 authentic images and another 40x24=960 imposter images from the other 24 persons. Score Information Decision Fusion Using Support Vector Machine 241 For both cases, polynomial kernel has been employed for the decision fusion procedure using SVM. Table 1 below compares the performance of single score decision and multiple score decision fusions for both spectrographic and ceptrumgrapic features. The false accepted rate (FAR) and false rejected rate (FRR) of multiple score decision fusion are described in Table 2. Table 1. Performance of single score decision and multiple score decision fusion features spectrographic cepstrumgraphic single score 92.75% 90.67% multiple score 96.04% 95.09% Table 2. FAR and FRR percentages of multiple score decision fusion features spectrographic cepstrumgraphic FAR 3.23% 5% FRR 3.99% 4.91% 4 Conclusion The multiple score decision fusion approach using support vector machine has been developed in order to enhance the performance of a correlation filter based speaker authentication system. Spectrographic and cepstrumgraphic features, are employed as features and UMACE filters are used as classifiers in the system. By implementing the proposed decision fusion, the error due to the variation of data can be reduced hence further enhance the performance of the system. The experimental result is promising and can be an alternative method to biometric authentication systems. Acknowledgements. This research is supported by Fundamental Research Grant Scheme, Malaysian Ministry of Higher Education, FRGS UKM-KK-02-FRGS00362006 and Science Fund, Malaysian Ministry of Science, Technology and Innovation, 01-01-02-SF0374. References 1. Campbell, J.P.: Speaker Recognition: A Tutorial. Proceeding of the IEEE 85, 1437–1462 (1997) 2. Rosenberg, A.: Automatic speaker verification: A review. Proceeding of IEEE 64(4), 475– 487 (1976) 3. Reynolds, D.A.: An overview of Automatic Speaker Recognition Technology. Proceeding of IEEE on Acoustics Speech and Signal Processing 4, 4065–4072 (2002) 4. Poh, N., Bengio, S., Korczak, J.: A multi-sample multi-source model for biometric authentication. In: 10th IEEE on Neural Networks for Signal Processing, pp. 375–384 (2002) 5. Teoh, A., Samad, S.A., Hussein, A.: Nearest Neighborhood Classifiers in a Bimodal Biometric Verification System Fusion Decision Scheme. Journal of Research and Practice in Information Technology 36(1), 47–62 (2004) 242 D.A. Ramli, S.A. Samad, and A. Hussain 6. Brunelli, R., Falavigna, D.: Personal Identification using Multiple Cue. Proceeding of IEEE Trans. on Pattern Analysis and Machine Intelligence 17(10), 955–966 (1995) 7. Suutala, J., Roning, J.: Combining Classifier with Different Footstep Feature Sets and Multiple Samples for Person Identification. In: Proceeeding of International Conference on Acoustics, Speech and Signal Processing, pp. 357–360 (2005) 8. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. Proceeding of IEEE Trans on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998) 9. Cheung, M.C., Mak, M.W., Kung, S.Y.: Multi-Sample Data-Dependent Fusion of Sorted Score Sequences for Biometric verification. In: IEEE Conference on Acoustics Speech and Signal Processing (ICASSP 2004), pp. 229–232 (2004) 10. Savvides, M., Vijaya Kumar, B.V.K., Khosla, P.: Face Verification using Correlation Filters. In: 3rd IEEE Automatic Identification Advanced Technologies, pp. 56–61 (2002) 11. Venkataramani, K., Vijaya Kumar, B.V.K.: Fingerprint Verification using Correlation Filters. In: System AVBPA, pp. 886–894 (2003) 12. Samad, S.A., Ramli, D.A., Hussain, A.: Lower Face Verification Centered on Lips using Correlation Filters. Information Technology Journal 6(8), 1146–1151 (2007) 13. Samad, S.A., Ramli, D.A., Hussain, A.: Person Identification using Lip Motion Sequence. In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part I. LNCS (LNAI), vol. 4692, pp. 839–846. Springer, Heidelberg (2007) 14. Samad, S.A., Ramli, D.A., Hussain, A.: A Multi-Sample Single-Source Model using Spectrographic Features for Biometric Authentication. In: IEEE International Conference on Information, Communications and Signal Processing, CD ROM (2007) 15. Sanderson, C., Paliwal, K.K.: Noise Compensation in a Multi-Modal Verification System. In: Proceeding of International Conference on Acoustics, Speech and Signal Processing, pp. 157–160 (2001) 16. Spectrogram, http://cslu.cse.ogi.edu/tutordemo/spectrogramReading/spectrogram.html 17. Klevents, R.L., Rodman, R.D.: Voice Recognition: Background of Voice Recognition, London (1997) 18. Gunn, S.R.: Support Vector Machine for Classification and Regression. Technical Report, University of Southampton (2005) 19. Wan, V., Campbell, W.M.: Support Vector Machines for Speaker Verification and Identification. In: Proceeding of Neural Networks for Signal Processing, pp. 775–784 (2000)
© Copyright 2025 Paperzz