Speaker Identification Using Second Order Complex Group Delay Functions 1 G.Latha Sree, 2Dr.A.Subbarami Reddy 1 PG scholar, Dept of ECE(DECS), SKIT, JNTUA, Srikalahasti, AP, India, E-mail: [email protected] 2 Dept of ECE, SKIT, JNTUA, Srikalahasti, AP, India, E-mail:[email protected] clustering algorithms. Abstract In this paper, a text-independent speaker The process of speaker identification contains two identification system has been introduced. It includes modes[11]: two steps feature extraction and feature matching. The * Training or Enrolment Mode second order group delay functions are used for feature * Testing or Identification Mode extraction and the technique of vector quantization is In the training mode, speakers with known identity are used for feature matching. The main idea here is to enrolled into the system’s database. In the recognition mode, derive cepstrum-like features from second order group an unknown speaker speech sample which is one of the delay functions instead of deriving them from power trained samples is given as input and the code makes a spectrum. The second order group delay function is decision about the identity of speaker. computed from phase of the Fourier transform of the II. BLOCK DIAGRAM signal. For feature extraction, we proposed a set of The process of speaker identification system consists of features from spectrum estimation using second order mainly two phases. In the first phase which is speaker group delay. These features are more robust to channel enrolment, speech samples of all speakers are collected and variations compared to features based on Mel spectral they are used to train the system. This collection of enrolled coefficients. The extracted speech features of the trained speech samples is also called as speaker database. In specified speaker are quantized to a number of centroids the second phase which is identification phase, a test sample by using the K-means algorithm. These centroids from an unknown speaker is compared with the speaker represent the codebook of that speaker. By calculating database. Both phases have same steps, feature extraction, the the which is extracting speaker dependent characteristics from centroids of each speaker in the training phase and the their speech. This step helps to reduce the amount of the test feature vectors of the each individual speaker in testing data while retaining the individual speaker discriminative phase, the speaker is identified. The code is developed in information. Next in the enrolment phase, these features are Matlab and performs the identification successfully. modelled and stored in the database. The extracted features Keywords-- Group delay; second order group delay; k-means; are compared with the models stored in the speaker speaker identification. database. Based on these comparisons the final decision minimum quantization I. distance between INTRODUCTION Speech is the most usual and natural way of communication. about speaker identity is made. This process[19] is represented in Figure 1. Irrespective of other forms of identification, like passwords Speaker modelling or keys, speech is non-intrusive as a biometric identification. In speaker identification system, an unknown speaker is compared with the database of known speakers, and the best Speech Feature extraction input Pattern matching matched speaker is identified. This system is based on the speaker related information included in speech wave samples. Previous systems used the traditional amplitude related approaches but in this paper, we used new approach by making use of phase related group delay functions and Decision logic Speaker database Speaker id Figure 1: Block diagram of Speaker identification system III. PRE-PROCESSING In order to enhance the efficiency of the extraction process, used in this paper speech signals are first pre-processed before extracting C. Windowing features. This pre-processing consists digital filtering and The signal obtained after framing has signal discontinuities detection of speech signal. Filtering includes pre-emphasis at the beginning and end of each frame. To minimize these filtering and removing any noise using several algorithms. discontinuities, blocking is used. The different windowing A. Pre-emphasis techniques are available for this process which includes Pre-emphasis is the technique in speech processing used to Rectangular window, Triangular window, Hanning window, enhance the high frequencies of the speech signal. There are Hamming window, etc. If w(n) represents a window for 0 ≤ two major factors that require the need of pre-emphasis. n ≤ N–1, where N represents the number of samples in each Firstly, the speech signal usually contains more specific frame, then the result of windowing of the signal is as information in higher frequencies rather than in lower follows frequencies. Secondly, pre-emphasis also removes the y(n) = x(n) * w(n) glottal effects from the vocal tract parameters. When a here the Hamming window is used for the windowing speech signal is recorded with a microphone from certain process, which has the equation [11] distance, it has approximately –6dB /octave slope downward comparing to the true spectrum. By applying pre-emphasis, IV. the spectrum is supposed to be flattened, hence consisting of GROUP DELAY formats of same heights. The digitized speech waveform The negative derivative of the Fourier transform of the suffers from additive noise. This pre-emphasis reduces this phase of a signal is defined as group delay[1,2-3,4]. It is range and is done by using a FIR high-pass filter. With x[n] computed from the magnitude spectrum of the Fourier as input in time domain and 0.9 ≤ a ≤ 1.0, the filter equation transform which is equal to that computed from the phase can be written as[11] spectrum[5,6] for a minimum phase signal. y[n] = x[n]−a . x[n−1]. Computing group delay function of a real signal is And also it is implemented as a first-order Finite Impulse difficult due to many reasons. The most important one is the Response (FIR) filter defined as: wrapping of the phase function. This is because the phase H(z) = 1 – α.z-1 function of a discrete time signal causes discontinuities in Generally α can be chosen between 0.9 and 0.95. We used the multiples of . This problem can be overcome by α=0.95. computing the group delay function B. Framing signal x (n) as follows[1,7,8] In order to prevent aliasing effect, framing is used which converts the continuous speech signal into frames of desired ( ) directly from the X ( ) x(n)e jn 1 n length. In this process, the continuous speech signal is The X ( ) can also be expressed as: converted into frames of N samples and adjacent frames X X e j ( ) being separated by M such that M < N. The initial frame here X ( ) X R ( ) X I ( ) consists of the first N samples and the second frame begins M samples after the first frame, and overlaps it by N - M samples. Similarly, the third frame begins 2M samples after 2 ( ) Tan 1 ( all the speech sample is accounted for within one or more frames[11]. Typical values for N=256 and M = N/2 which is 128 are 2 X I ( ) ) X R ( ) Group delay is defined as[1,9,5-8] the first frame (or M samples after the second frame) and overlaps by N-2M samples. This overlapping continues until 2 ( ) d ( ) d 3 In order to avoid unwrapping, another method is used to calculate the group delay directly as log X () log( X () ) j () 4 X ( ) dX ( ) d X ( ) e j ( ) j e j ( ) d d d d Then again we can write the equation as d log X ( ) ( ) Im d Group delay 5 j ( ) dX ( ) d X ( ) e jX ( ) 2 d d X ( ) ( ) can be computed form 2 and 3 as d X I ( ) d X R ( ) X I ( ) X R ( ) d d ( ) 2 X ( ) 𝑑𝑋(𝑤) 6 𝑑𝑤 = 𝑗𝑋(𝑤) 𝑑∅(𝑤) 𝑑𝑤 𝑋(𝑤) 2 |𝑋(𝑤)| [𝑋𝑅 𝑌𝐼 − 𝑋𝐼 𝑋𝑅 ] Dividing equation by X ( ) and multiplying both sides with where the R and I represent the real and imaginary parts. As differentiation can only be approximated in the discrete-time domain with the use of the Fourier transform property . jF nx(n) + dXI dX R X R d X I d dX ( ) d j yields dX ( ) j d X R YI X I YR d ( ) j 2 X ( ) d X ( ) 7 Where F denotes the Fourier transform separating the real and imaginary parts, we get dX R ( ) dX ( ) j I d d YR ( ) jYI ( ) F nx(n)1 jFnx(n)R 8 Here the group delay appears as a complex quantity of Using the above expression group delay as in (6) can be which rewritten[1, 9] as The second term in the above equation appeared as the 9 If Y ( ) be the Fourier transform of nx(n), F nx(n)and imaginary part of the group delay ( ) the dimensions of the I1 ( ) are also same with dimensions of R1 ( ) . the subscripts R and I denote the real and imaginary parts. We can rewrite equation (9) as: Complex Where 10 Group Delay Functions Assume an N- sample system function with ‘n’ being our time domain index and representing the filters discrete time Fourier transform(DTFT), X ( ) Therefore the equation appears as a more general expression for the computation of the group delay X R ( ) YR ( ) X I ( )YI (w) ( ) 2 X ( ) of is the real part which is the group delay obtained from the traditional definition of the group delay. X F nx(n) X () F nx(n) I R 1 () R 2 X () Formulation R () in polar form as ( ) is Complex Group Delay dX ( ) j d X Y X I YI Re al R1 R R 2 X ( ) X ( ) and dX ( ) j d X Y X Y The I R imag I1 R I 2 X ( ) X ( ) below[3], formulation of the proposed method is being derived by 𝑋(𝑤) = |𝑋(𝑤)𝑒 𝑗 (𝑤) performing the derivative of the above equation with respect In the above equation, X () is the frequency magnitude response of the filter, ( ) is the filter response and to and by doing some mathematical manipulations it can be written as is continuous frequency measured in radians/seconds. Taking derivative of with respective to , we have Where d ( ) is the Complex II order Group Delay d an autocorrelation like function derived from the magnitude of the Fourier Transform (FT). Here Where K is the size of DFT typically chosen as 2*N . XR = Real part of fft(x(n)) XI = Imaginary part of fft(x(n)) is truncated to smooth the finer detail and to obtain YR = Real part of fft(nx(n)) spectral envelop, YI = Imaginary part of fft(nx(n)) ZR = Real part of fft(n2x(n)) ZI = Imaginary part of fft(n2x(n)) V. FEATURE EXTRACTION The steps involved in computing the new features are shown where L is 16-24. Then, the truncated sequence is multiplied in the form of a flow diagram[12] in figure 2. by a tapering window such as a hamming window to eliminate discontinuities at the ends of the sequence. Difference of samples Hamming window DFT Speech signal | | The group delay spectrum of r(n) is then computed as GD[k], represents sampled group delay spectrum. IDFT Finally, the features are computed in a manner similar to the cepstral coefficients as the inverse DFT of sampled group- Hamming window Feature vectors IDFT of 2nd order group delay Absolute value of 2nd order group delay Second order group delay Figure 2: Steps involved in computing feature vectors delay spectrum. The above group delay spectrum is computed using the following algorithm. In the case of other signal processing techniques, the speech Algorithm for computing II-order group delay functions signal s(n) is pre-emphasized for removing the dc 1. component and to spectrally flatten the signal before feature-extraction. The conventional pre-emphasis is done Let x (n) be the given M-point causal sequence then compute[1, 10] y(n) = n x(n). 2. by a difference operation [4]. Compute the N-point (N>>M) Discrete Fourier Transform (DFT) X(k) and Y(k) of the sequences x(n) and y(n) respectively, for K=0,1,…..,N-1. This operation can be performed on each frame of the signal 3. Compute cepstrally smoothed spectrum S(k) of x (k ) 4. Compute the spectrum z(k) by dividing x (k ) and then, as shown in the figure, each frame of speech Where h(n) are the samples of Hamming window. The windowing operation is followed by the computation of by s(k). is multiplied by a Hamming window. If ‘m’ is the frame number and N is the number of samples in each frame then 2 2 5. Compute the modified group delay function For k=0, 1… N-1. 0 (k ) as 6. Compute the derivatives of real and imaginary parts of 2D plot of accoustic vectors 2 group delay function of equation as 1 d R X R ( Z I 2YI ) X I ( Z R 2YR ) 2 2 d (X R X I ) 6th Dimension 0 2 d I Y ( ) X R Z R X I Z I 2 2 I 2 d X ( ) VI. -1 -2 -3 FEATURE MATCHING AND SPEAKER -4 -2.5 RECOGNITION The first step involved is to build a speaker-database C= {C1 , C2 , …. , CN} which consists of N codebooks, each -2 -1.5 -1 -0.5 0 5th Dimension 0.5 1 1.5 2 Figure 3: Clusters in the K-means algorithm (number of centroids C = 2). codebook for each speaker in the database. This process is 2D plot of accoustic vectors done by converting the input signal into a sequence of 2 vectors as X={x1, x2… xN}. These feature vectors are 1 clustered into a set of M codewords as C={c1, c2, …. , cM}. These set of codewords is called a codebook for the algorithm. There exists number of algorithms for the generation of codebook such as Generalized Lioyd 0 6th Dimension specified speaker and can be done by using a clustering -1 -2 algorithm (GLA) or K-means algorithm or Linde-BuzoGray, Self Organizing Maps (SOM), Pairwise Nearest -3 Neighbour (PNN) etc., In this paper, the K-means algorithm is used since it is the simplest, most popular, simplest and -4 -2.5 -2 -1.5 easy way to implement. -1 -0.5 0 5th Dimension 0.5 1 1.5 2 K-means algorithm Figure 4: Clusters in the K-means algorithm (number of The K-means algorithm[19] partitions the ‘M’ feature centroids C= 8) vectors into ‘C’ centroids. This method first randomly In short, the K-means algorithm performs three steps below chooses C cluster-centroids from M feature vectors. Then each feature vector will be assigned to the nearest centroid C, and then the new set of centroids are calculated for the new clusters. This assigning procedure is continued until the mean square error between the M feature vectors and the cluster-centroids C is below a certain assigned threshold. In other words, the main objective of the K-means algorithm is to minimize the total intra-cluster variance, V as Where we have k clusters Si, i = 1,2,...,k and 𝑢𝑖 is the mean point or centroid of all the points. The clusters in the Kmeans algorithm are as shown in figure. until convergence[19]: 1. Determine the coordinate of centroid. 2. Determine the each object distance to the centroids. 3. Group the objects based on the minimum distance (finding the closest centroid). A Matlab code has been generated to implement speaker identification system. This v_kmeans.m pre-defined code needs disteusq.m, and rnsubset.m, functions voicebox.m, winenvar.m files from voicebox which is a speech processing tool box in Matlab. The output of the defined system is shown in the following table in the form of distortion measure with the input signals being given in .wav format. We have 30 clear speech signals names Sp01, Sp02, Sp03….Sp30. We can observe that the diagonal element has the least vector quantization distance value in their respective row. It indicates Sp01 matches with Sp01, Sp02 matches with Sp02 and so on. Here the Codebook size Figure 5: K-means algorithm flow diagram. is 16. The distortion measures for first 10 samples of 30 Feature Matching In the recognition phase, the unknown speaker represented clear speech samples are as given in table 1. first order group delay first order group delay by the feature vectors {X1, X2,…., XT}, is compared with 1.5 the codebooks in the database for the final recognition. The 0.5 speaker with the lowest distortion is chosen as best matched 0.5 0 0 50 100 150 200 250 300 1 1 0.5 0 50 100 150 50 200 250 300 0 50 Group delay of ‘sp01’ One way to calculate the distortion measure is the sum of centroid is to use the average of the Euclidean distances. distance, weighted Euclidean and Mahalanobis. We used 300 100 150 200 250 300 1 0.5 0.5 0 0 50 100 150 200 250 300 1 1 0.5 0 Euclidean distance here. 50 100 150 200 50 100 150 200 250 300 250 300 1.5 0.5 0 0 second order group delay second order group delay 0 250 1.5 1.5 The best-known distance measures as far are Euclidean 200 first order group delay first order group delay 1.5 0 150 Group delay of ‘sp02’ 1 squared distances between vector and its representative 100 1.5 0.5 0 0 second order group delay second order group delay 1.5 0 speaker[19]. 1 1 0 distortion measure is computed for each codebook and the 1.5 250 300 0 Group delay of ‘sp03’ 50 100 150 200 Group delay of ‘sp04’ first order group delay 1.5 first order group delay 1.5 1 1 0.5 Where 𝐶𝑚𝑖𝑛 denotes the nearest codeword 𝑥𝑡 in the 0.5 0 0 codebook 𝐶𝑖 and d( ) is the Euclidean distance. Thus, each feature vector in the sequence is compared with all the 0 50 100 150 200 250 V = (v1, v2...vn) is 150 200 given 𝑛 by 𝑖=1 The train speaker file which has the lowest distortion distance is chosen to be identified as the best match for given unknown speaker. VII. EXPERIMENTAL RESULTS 300 second order group delay 250 300 1 1 0.5 0 0 0 50 100 150 200 250 300 0 50 100 150 200 Group delay of ‘sp06’ first order group delay first order group delay 1.5 1.5 1 1 0.5 0.5 2 √(𝑢1 − 𝑣1 )2 + (𝑢2 − 𝑣2 )2 + 𝑢 − 𝑣𝑛 )2 = √∑(𝑢𝑖 − 𝑣𝑖 ) 250 second order group delay Group delay of ‘sp05’ The Euclidean distance between two points U = (u1, and 100 1.5 average distance is selected to be the best match. u2…un) 50 1.5 0.5 codebooks available, and the codebook with the minimum 0 300 0 0 0 50 100 150 200 250 300 1 1 0.5 0 50 100 150 200 250 Group delay of ‘sp07’ delay of ‘sp08’ 50 100 150 200 250 300 250 300 1.5 0.5 0 0 second order group delay second order group delay 1.5 300 0 0 50 100 150 200 Group Group delay of ‘sp19’ first order group delay 1.5 first order group delay 1.5 Group delay of ‘sp20’ 1 1 0.5 0.5 0 0 0 50 100 150 200 250 300 first order group delay 0 50 100 150 200 250 300 first order group delay 1.5 1.5 second order group delay second order group delay 1 1.5 1.5 1 1 0 0.5 0.5 0.5 0.5 1 0 0 50 100 150 200 250 300 0 0 50 100 150 200 250 300 0 50 Group delay of ‘sp09’ 100 1.5 1 1 0.5 0.5 0 50 100 150 250 300 1 1 0.5 0.5 0 0 0 50 100 150 100 150 200 250 300 200 250 300 0 50 100 150 200 250 300 first order group delay first order group delay 0 200 Group delay of ‘sp10’ 1.5 0 150 50 1.5 1.5 0 0 second order group delay second order group delay 200 250 300 Group delay of ‘sp21’ Group delay of ‘sp22’ 0 50 100 150 200 250 300 second order group delay second order group delay 1.5 1.5 1 1 0.5 0.5 first order group delay 1.5 first order group delay 1.5 1 1 0.5 0 0 50 100 150 200 250 0 300 0.5 0 50 100 150 200 250 300 0 0 50 100 150 200 250 300 0 second order group delay Group delay of ‘sp11’ 0 50 100 1.5 Group delay of ‘sp12’ 150 200 250 300 250 300 second order group delay 1.5 1 1 first order group delay 1.5 0.5 first order group delay 0.5 1.5 0 1 1 0 50 100 150 200 250 0 300 0 50 100 150 200 0.5 0.5 0 0 50 100 150 200 250 300 0 Group delay of ‘sp23’ 0 50 100 150 200 250 Group delay of ‘sp24’ 300 first order group delay second order group delay first order group delay 1.5 second order group delay 1.5 1.5 1.5 1 1 1 1 0.5 0.5 0 0 0.5 0.5 0 50 100 150 200 250 300 0 0 50 100 150 200 250 0 50 100 150 200 250 0 300 Group delay of ‘sp14’ first order group delay 1.5 100 150 200 250 300 250 300 1.5 1 1 0.5 0.5 0 first order group delay 50 second order group delay 1.5 Group delay of ‘sp13’ 0 second order group delay 300 0 50 100 150 200 250 0 300 0 50 100 150 200 1.5 1 Group delay of ‘sp25’ 1 0.5 Group delay of ‘sp26’ 0.5 0 first order group delay 0 50 100 150 200 250 300 0 0 50 second order group delay 100 150 200 250 300 1.5 first order group delay 1.5 1 second order group delay 1.5 1.5 1 1 0.5 0.5 1 0 0.5 0 0.5 0 50 100 150 200 250 0 50 100 150 200 250 300 250 300 300 second order group delay second order group delay 0 0 50 100 150 200 250 0 300 1.5 1.5 0 50 100 150 200 250 300 1 1 Group delay of ‘sp15’ Group delay of ‘sp16’ 0 first order group delay 0.5 0.5 0 0 50 100 150 200 250 300 0 50 100 150 200 first order group delay 1.5 1.5 1 1 0.5 0.5 Group delay of ‘sp27’ Group delay of ‘sp28’ first order group delay 1.5 0 0 50 100 150 200 250 300 0 first order group delay 0 50 second order group delay 100 150 200 250 300 1.5 1 1 0.5 0.5 0 0 1.5 1 second order group delay 1.5 1 0.5 0 50 100 150 200 250 300 0.5 0 0 50 100 150 200 250 300 0 0 50 100 150 200 250 300 250 300 second order group delay 0 50 100 150 200 250 300 1.5 second order group delay 1.5 Group delay of ‘sp17’ 1 Group delay of ‘sp18’ 1 0.5 first order group delay first order group delay 1.5 1.5 1 1 0.5 0.5 0 0 0 50 100 150 200 250 300 0 0 50 1 1 0.5 0.5 0 50 100 150 200 100 150 200 250 250 300 0 50 100 150 200 250 300 0 0 50 100 150 200 Group delay of ‘sp30’ 300 second order group delay 1.5 0 Group delay of ‘sp29’ second order group delay 1.5 0 0.5 Therefore a new function called complex II-Order Group delay spectrum estimation function has been derived based 0 50 100 150 200 250 300 on the II-order derivative of the FT phase from the first order Group Delay. The above graphs explain the comparison of the results of I and II order group delay Sp01 Sp02 Sp03 Sp04 spectrum estimations. Sp05 Sp06 Sp07 Sp08 Sp09 Sp10 Sp01 1.4015 3.5685 47.2517 24.3959 3.673 43.3044 45.4605 1.828 23.2688 8.8337 Sp02 5.7877 1.546 73.6625 12.2542 8.7874 57.2519 66.8928 4.0014 11.7488 3.6959 Sp03 2.515 1.9171 0.3509 2.0337 2.1988 2.4502 2.3813 2.997 1.6221 1.9297 Sp04 37.7496 19.4954 198.4953 4.4036 42.4395 21.7943 120.8951 29.9203 8.2478 12.1787 Sp05 9.9634 11.0722 44.2632 38.709 1.6697 31.7374 32.6545 9.45 37.0609 18.9741 Sp06 108.7146 73.5691 270.8312 34.9402 110.8011 6.6072 52.1381 96.4685 30.6069 55.5596 Sp07 292.5806 233.8699 552.4637 134.951 299.3711 52.8933 5.6412 270.5984 137.5742 183.8065 Sp08 1.4689 2.8624 56.3202 19.7096 5.906 50.6259 49.9415 0.8891 19.0048 6.5949 Sp09 30.9172 14.1947 144.7578 4.8229 36.6153 23.0945 123.3529 24.8221 1.782 7.4463 Sp10 12.949 4.157 134.232 6.658 17.6606 39.5662 89.9253 11.1178 6.7061 2.4925 Table 1: vector quantization distance for clear speech signals sp01, sp02 …, sp10 for Codebook size 16 For the codesize C=64, the distortion measures are as table 2, Sp01 Sp02 Sp03 Sp04 Sp05 Sp06 Sp07 Sp08 Sp09 Sp10 Sp01 0.0477 2.263 44.211 21.9238 1.6905 38.8762 36.2556 0.4171 21.5793 7.6464 Sp02 3.2468 0.1121 70.3834 11.1982 7.1873 55.0577 55.6189 1.8472 10.6776 2.106 Sp03 1.3646 0.5392 0.0106 0.0956 0.3546 0.5221 0.2019 1.0475 0.3999 0.2589 Sp04 33.0936 16.8433 145.9561 0.17 40.7936 17.8142 114.5024 27.1216 5.3323 10.695 Sp05 5.6573 7.7478 40.7699 35.2992 0.1881 25.7197 23.3882 7.2705 34.5832 16.0768 Sp06 105.1664 64.0917 265.321 21.7871 107.1294 1.222 47.1959 87.8631 22.9843 42.1392 Sp07 292.5806 233.8699 552.4637 134.951 299.3711 52.8933 5.6412 270.5984 137.5742 183.8065 Sp08 107.1294 1.222 47.1959 87.8631 22.9843 42.1392 45.605 42.5683 0.054 17.2582 Sp09 28.2196 11.8204 140.0115 1.9762 34.8564 17.7642 114.0403 22.6694 0.1159 4.5542 Sp10 10.9589 2.9404 94.7017 4.7994 16.0181 37.1376 76.7163 7.3841 4.0541 0.0988 Table 2: vector quantization distance for clear speech signals sp01, sp02 …., sp10for Codebook size 64 Sp11 Sp12 Sp13 Sp14 Sp15 Sp16 Sp17 Sp18 Sp19 Sp20 Sp11 0.1484 65.1784 21.5979 20.7737 49.3174 2.5639 7.425 10.0336 31.6638 58.4676 Sp12 1.797 0.0163 1.0522 1.5982 1.3808 0.8691 2.3859 0.7867 2.2089 1.1561 Sp13 23.4132 165.1833 0.1439 86.8203 138.6536 10.9179 54.6253 3.0934 109.393 9.7619 Sp14 20.019 13.0474 18.8587 0.1281 7.2694 17.9446 4.5651 17.6394 1.6354 18.8146 Sp15 6.0607 1.4044 4.6381 4.8876 0.022 4.15 7.0423 4.0129 1.3536 4.4477 Sp16 3.6527 96.6039 10.3606 38.7142 79.3581 0.4683 20.2432 2.9393 55.0562 38.0559 Sp17 6.1887 29.7645 40.4638 4.3573 18.8134 15.6141 0.0602 31.1431 9.6879 39.5601 Sp18 11.2014 134.4288 4.4223 65.066 110.9595 3.1052 35.4422 0.0782 82.235 20.6682 Sp19 12.1607 6.4206 10.7223 1.4321 3.282 9.985 9.6403 9.7418 0.1031 10.3455 Sp20 60.0935 260.2641 11.5196 152.725 230.4135 38.2909 109.0684 20.8764 182.9044 0.2048 Table 3: vector quantization distance for s clear speech signals sp11, sp12 …., sp20 for Codebook size 64 Sp21 Sp22 Sp23 Sp24 Sp25 Sp26 Sp27 Sp28 Sp29 Sp30 Sp21 0.0119 1.8438 0.6474 1.4412 3.5724 0.8071 0.1006 1.7976 0.7391 0.3915 Sp22 3.6261 0.0472 6.0256 7.8602 13.485 6.0815 2.4752 6.3189 6.1236 3.7091 Sp23 125.3918 86.8163 0.055 5.1752 173.9393 13.2605 5.816 46.3378 10.2329 52.1404 Sp24 116.5472 78.9724 2.0027 0.0427 161.6597 5.9124 1.1095 40.8969 4.2172 41.7361 Sp25 0.0361 0.0449 0.0809 0.0982 0.023 0.0828 0.2068 0.0708 0.0973 0.1565 Sp26 174.1992 126.171 6.9555 7.2877 233.8164 0.0595 10.3256 73.6223 16.5765 17.7002 Sp27 111.134 77.9214 4.1117 3.9269 158.8772 10.3068 0.8846 42.577 4.6876 50.5968 Sp28 19.1895 6.6693 24.3559 28.0832 39.469 24.4232 17.8915 0.1097 18.4084 21.8964 Sp29 79.1624 49.2407 2.3483 3.5573 121.4712 15.7295 2.6265 20.2872 0.0669 65.0841 Sp30 291.4751 231.8188 44.5492 44.067 378.5392 17.2398 49.7507 164.4803 66.2853 0.0942 Table 4: vector quantization distance for s clear speech signals sp21, sp22 …., sp30 for Codebook size 64 10 From the above table the system identifies the speaker speaker with the speakers in the database. The following according to the theory: “the most likely speaker sample table provides the performance results for various speaker must have the minimum possible Euclidean distance identification tests. when compared database”[19]. to The all the above codebooks procedure in has the been implemented for 10 speech signals where five of them are male and five of them are female. These speech samples are in a regional language Telugu and the system has successfully identified all the speakers. LPC Identification 96.2% MEL GDP 97.0% 97.0% result The results show that the identification rate of the system increases as the number of centroids increases. Also as the number of speakers increases, the number of centroids This method clearly provides better resolution and also suppresses the spikes which are generated due to noise in the spectrum when compared to first order group delay functions. increases. A spectral estimation method based on complex II-order derivative of the Fourier Transform of the phase also called as II order Group Delay has been proposed for the estimation of the signal characteristics and this newly The effect of changing the codebook size on proposed method is compared with the I-order derivative the VQ distortion of the Fourier Transform of the phase and proves to give Codebook Size best results. Matching Score (VQ distortion) REFERENCES Sp01 Sp03 Sp08 2 95.7358 16.0041 110.1414 8 2.9861 3.5683 3.7059 16 0.6435 0.3276 0.9069 64 0.1036 0.0123 0.0457 [1] K.Nagi Reddy, S.Narayana Reddy, ASR Reddy. “Significance of complex group delay functions in spectrum estimation.” Signal & image processing: An International journal (SIPIJ) Vol.2, No.1, pp.115-133, March 2011, ISSN 0976-710X. [2] B. Yegnanarayana and Hema A. Murthy “Significance of Group DelayFunctions in Spectrum Estimation” IEEE Transactions on signal processing. 128 0.0053 0.0020 0.0106 256 0 0 0 Vol. 40. NO.9.pp 2281-2289, September 1992. [3] B. Yegnanarayana, "Formant extraction from linear prediction phase spectra," J. Acoust. Soc. Amer., vol. 63, pp. 1638-1640, May 1978. Table 3: Matching score for different codebook sizes. From the above table 3, we can see that as the codebook [4] John G.Proakis and Dimitris G Monolakis “Digital size C increases, the Euclidean distance for the same signal Processing speaker is decreased. Applications“Prentice –Hall,1997. Anand Joseph M., Guruprasad VIII. CONCLUSION S., Principles,Algorithms Yegnanarayana B.” and Extracting A text-independent speaker identification system is the Formants from Short Segments of Speech using main goal of this paper. The feature extraction process is Group Delay Functions” done using complex second order group delay functions ICSLP, pp:1009-1012 and the feature matching is performed using Vector [5] INTERSPEECH 2006 – Yegnanarayana, B., Saikia, D. K., and Krishnan, T. Quantization technique. Using the extracted features, a R., “Significance of group delay functions in signal codebook for each speaker was created and clustering of reconstruction from spectral magnitude or phase”, feature vectors is done using the K-means algorithm. IEEE Trans. on Acoustics Speech and Signal Proc., Codebooks from all the speakers form the database for the Vol. 32, no. 3, pp. 610-623, Jun. 1984. system. A distortion measure based on minimizing the Euclidean distance was used for matching the unknown [6] A.V oppenheim and R.W Schafer ‘’ “Digital signal Processing” Englewood cliff,NJ , Prentice –Hall 11 [7] H K Lakshminarayana, J S Bhat and H M Mahesh, “Improved Estimation of Evolutionary Spectrum [17] based feature for Robust Speech Recognition” The andModified Magnitude Group Delay by Signal Annals of ”Dunarea de Jos” University of Galaţi, Decomposition” International Journal of Information Fascicle III, 2009,Vol 32,No 1,pp.60-65 , ISSN 1221- and Communication Engineering 5:3 2009,pp198-209 454X on Short Time Fourier Abbasian Ali and Marvi Hossien”The Phase Spectra based feature for Robust [18] University of Galaţi, Fascicle III, 2009,Vol 32,No 1,pp.60-65 , ISSN 1221-454X G. Farahani, Homayounpoor,” S.M. Use and Spectral M.M. Peaks In Autocorrelation And Group Delay Domains For Robust Speech Recognition” ICASSP 2006,pp:517520 [10] Ms. Mani Roja, Deepak Harjani and Mohita Jethwani. "Speaker Recognition System using MFCC and Vector Quantization Approach." International Journal for Scientific Research and Development 1.9 (2014): 1934-1937. [11] Aruna Bayya and B. Yegnanarayana , "Robust features for speech recognition Systems," in Proc. ICSLP '98, December 1998 [12] Yegnanarayana, B., Saikia, D. K., and Krishnan, T. R., “Significance of group delay functions in signal reconstruction from spectral magnitude or phase”, IEEE Trans. on Acoustics Speech and Signal Proc., Vol. 32, no. 3, pp. 610-623, Jun. 1984. [13] Rajesh M. Hegde, Hema A. Murthy, Venkata Ramana Rao Gadde: “Significance of the Modified Group Delay Feature in Speech Recognition” IEEE Transactions on audio, speech, and language processing, vol. 15, no. 1, january 2007 [14] Rajesh M. Hegde and Hema A. Murthy: “Speaker Identification using The Modified Group Delay Feature.” [15] Hema A. Murthy and Gadde V. Ramana Rao. “The Modified group delay function and its application to phoneme recognition.” In Proceedings of the ICASSP, Vol.I, pages 68-71, April 2003. [16] 2010. [19] E. Karpov, “Real Time Speaker Identification,” Master`s thesis, Department of Computer Science, Ahadi Of Abeer M. Abu-Hantash, Ala’a Tayseer Spaih:” Text Independent Speaker Identification system”, may Speech Recognition” The Annals of ”Dunarea de Jos” [9] Abbasian Ali and Marvi Hossien”The Phase Spectra Transforms based [8] pages 18-32, October 1994. H.Gish and M.Schmidt. “Text Independent Speaker Identification.” In IEEE Signal Processing Magazine, University of Joensuu, 2003.
© Copyright 2025 Paperzz