Speaker Identification Using Second Order Complex Group

Speaker Identification Using Second Order Complex Group Delay Functions
1
G.Latha Sree, 2Dr.A.Subbarami Reddy
1
PG scholar, Dept of ECE(DECS), SKIT, JNTUA, Srikalahasti, AP, India, E-mail:
[email protected]
2
Dept of ECE, SKIT, JNTUA, Srikalahasti, AP, India, E-mail:[email protected]
clustering algorithms.
Abstract In this paper, a text-independent speaker
The process of speaker identification contains two
identification system has been introduced. It includes
modes[11]:
two steps feature extraction and feature matching. The
*
Training or Enrolment Mode
second order group delay functions are used for feature
*
Testing or Identification Mode
extraction and the technique of vector quantization is
In the training mode, speakers with known identity are
used for feature matching. The main idea here is to
enrolled into the system’s database. In the recognition mode,
derive cepstrum-like features from second order group
an unknown speaker speech sample which is one of the
delay functions instead of deriving them from power
trained samples is given as input and the code makes a
spectrum. The second order group delay function is
decision about the identity of speaker.
computed from phase of the Fourier transform of the
II. BLOCK DIAGRAM
signal. For feature extraction, we proposed a set of
The process of speaker identification system consists of
features from spectrum estimation using second order
mainly two phases. In the first phase which is speaker
group delay. These features are more robust to channel
enrolment, speech samples of all speakers are collected and
variations compared to features based on Mel spectral
they are used to train the system. This collection of enrolled
coefficients. The extracted speech features of the
trained speech samples is also called as speaker database. In
specified speaker are quantized to a number of centroids
the second phase which is identification phase, a test sample
by using the K-means algorithm. These centroids
from an unknown speaker is compared with the speaker
represent the codebook of that speaker. By calculating
database. Both phases have same steps, feature extraction,
the
the
which is extracting speaker dependent characteristics from
centroids of each speaker in the training phase and the
their speech. This step helps to reduce the amount of the test
feature vectors of the each individual speaker in testing
data while retaining the individual speaker discriminative
phase, the speaker is identified. The code is developed in
information. Next in the enrolment phase, these features are
Matlab and performs the identification successfully.
modelled and stored in the database. The extracted features
Keywords-- Group delay; second order group delay; k-means;
are compared with the models stored in the speaker
speaker identification.
database. Based on these comparisons the final decision
minimum
quantization
I.
distance
between
INTRODUCTION
Speech is the most usual and natural way of communication.
about speaker identity is made. This process[19] is
represented in Figure 1.
Irrespective of other forms of identification, like passwords
Speaker
modelling
or keys, speech is non-intrusive as a biometric identification.
In speaker identification system, an unknown speaker is
compared with the database of known speakers, and the best
Speech
Feature
extraction
input
Pattern
matching
matched speaker is identified. This system is based on the
speaker related information included in speech wave
samples. Previous systems used the traditional amplitude
related approaches but in this paper, we used new approach
by making use of phase related group delay functions and
Decision logic
Speaker
database
Speaker id
Figure 1: Block diagram of Speaker identification system
III.
PRE-PROCESSING
In order to enhance the efficiency of the extraction process,
used in this paper
speech signals are first pre-processed before extracting
C. Windowing
features. This pre-processing consists digital filtering and
The signal obtained after framing has signal discontinuities
detection of speech signal. Filtering includes pre-emphasis
at the beginning and end of each frame. To minimize these
filtering and removing any noise using several algorithms.
discontinuities, blocking is used. The different windowing
A. Pre-emphasis
techniques are available for this process which includes
Pre-emphasis is the technique in speech processing used to
Rectangular window, Triangular window, Hanning window,
enhance the high frequencies of the speech signal. There are
Hamming window, etc. If w(n) represents a window for 0 ≤
two major factors that require the need of pre-emphasis.
n ≤ N–1, where N represents the number of samples in each
Firstly, the speech signal usually contains more specific
frame, then the result of windowing of the signal is as
information in higher frequencies rather than in lower
follows
frequencies. Secondly, pre-emphasis also removes the
y(n) = x(n) * w(n)
glottal effects from the vocal tract parameters. When a
here the Hamming window is used for the windowing
speech signal is recorded with a microphone from certain
process, which has the equation [11]
distance, it has approximately –6dB /octave slope downward
comparing to the true spectrum. By applying pre-emphasis,
IV.
the spectrum is supposed to be flattened, hence consisting of
GROUP DELAY
formats of same heights. The digitized speech waveform
The negative derivative of the Fourier transform of the
suffers from additive noise. This pre-emphasis reduces this
phase of a signal is defined as group delay[1,2-3,4]. It is
range and is done by using a FIR high-pass filter. With x[n]
computed from the magnitude spectrum of the Fourier
as input in time domain and 0.9 ≤ a ≤ 1.0, the filter equation
transform which is equal to that computed from the phase
can be written as[11]
spectrum[5,6] for a minimum phase signal.
y[n] = x[n]−a . x[n−1].
Computing group delay function of a real signal is
And also it is implemented as a first-order Finite Impulse
difficult due to many reasons. The most important one is the
Response (FIR) filter defined as:
wrapping of the phase function. This is because the phase
H(z) = 1 – α.z-1
function of a discrete time signal causes discontinuities in
Generally α can be chosen between 0.9 and 0.95. We used
the multiples of   . This problem can be overcome by
α=0.95.
computing the group delay function
B. Framing
signal x (n) as follows[1,7,8]
In order to prevent aliasing effect, framing is used which
converts the continuous speech signal into frames of desired
 ( ) directly from the
X ( )   x(n)e  jn
1
n
length. In this process, the continuous speech signal is
The X (  ) can also be expressed as:
converted into frames of N samples and adjacent frames
X    X   e j ( )
being separated by M such that M < N. The initial frame
here X ( )  X R ( )  X I ( )
consists of the first N samples and the second frame begins
M samples after the first frame, and overlaps it by N - M
samples. Similarly, the third frame begins 2M samples after
2
( )  Tan 1 (
all the speech sample is accounted for within one or more
frames[11].
Typical values for N=256 and M = N/2 which is 128 are
2
X I ( )
)
X R ( )
Group delay is defined as[1,9,5-8]
the first frame (or M samples after the second frame) and
overlaps by N-2M samples. This overlapping continues until
2
 ( )  
d ( )
d
3
In order to avoid unwrapping, another method is used to
calculate the group delay directly as
log X ()  log( X () )  j ()
4
X ( )
dX ( )
d
 X ( ) e j ( ) j
 e j (  ) d
d
d
d
Then again we can write the equation as
 d log X ( ) 
 ( )   Im

 d 
Group delay
5
j (  )
dX ( )
d X ( ) e
 jX ( ) 
2
d
d
X ( )
 ( ) can be computed form 2 and 3 as
d X I ( ) 
d X R ( ) 

 X I ( )
 X R ( )

d
d 
 ( )  
2
X ( )
𝑑𝑋(𝑤)
6
𝑑𝑤
= 𝑗𝑋(𝑤)
𝑑∅(𝑤)
𝑑𝑤
𝑋(𝑤)
2
|𝑋(𝑤)|
[𝑋𝑅 𝑌𝐼 − 𝑋𝐼 𝑋𝑅 ]
Dividing equation by X ( ) and multiplying both sides
with
where the R and I represent the real and imaginary parts. As
differentiation can only be approximated in the discrete-time
domain with the use of the Fourier transform property
.  jF nx(n) 
+
dXI 
 dX R
 X R d  X I d 


dX ( )
d
j
yields
  dX ( ) 

 j  d 

X R YI  X I YR 
d ( )


j


2
X
(

)
d


X ( )


7
Where F denotes the Fourier transform separating the real
and imaginary parts, we get
dX R ( )
dX ( )
j I
d
d
 YR ( )  jYI ( )
F nx(n)1  jFnx(n)R 
8
Here the group delay appears as a complex quantity of
Using the above expression group delay as in (6) can be
which
rewritten[1, 9] as
The second term in the above equation appeared as the
9
If Y ( ) be the Fourier transform of nx(n), F
nx(n)and
imaginary part of the group delay  ( ) the dimensions of
the  I1 ( ) are also same with dimensions of  R1 ( ) .
the subscripts R and I denote the real and imaginary parts.
We can rewrite equation (9) as:
Complex
Where
10
Group
Delay
Functions
Assume an N- sample system function with ‘n’ being our
time domain index and representing the filters discrete time
Fourier transform(DTFT),
X ( )
Therefore the equation appears as a more general
expression for the computation of the group delay
 X R ( ) YR ( )  X I ( )YI (w) 
 ( )  

2
X ( )


of
is the real part which is the group delay
obtained from the traditional definition of the group delay.
 X  F nx(n)  X () F nx(n) 
I
R
1
 ()   R

2
X ()


Formulation
 R ()
in polar form as
 ( )
is
Complex
Group
Delay
  dX ( ) 

 j  d 

X Y  X I YI

Re al  
   R1  R R
2
X
(

)
X ( )




and
  dX ( ) 

 j  d 
 X Y X Y
The
I R

imag  
  I1
 R I
2
X
(

)
X ( )




below[3],
formulation of the proposed method is being derived by
𝑋(𝑤) = |𝑋(𝑤)𝑒 𝑗  (𝑤)
performing the derivative of the above equation with respect
In the above equation,
X () is the frequency magnitude
response of the filter,  ( ) is the filter response and

to

and by doing some mathematical manipulations it can
be written as
is
continuous frequency measured in radians/seconds. Taking
derivative of with respective to  , we have
Where
d ( )
is the Complex II order Group Delay
d
an autocorrelation like function derived from the magnitude
of the Fourier Transform (FT).
Here
Where K is the size of DFT typically chosen as 2*N .
XR = Real part of fft(x(n))
XI = Imaginary part of fft(x(n))
is truncated to smooth the finer detail and to obtain
YR = Real part of fft(nx(n))
spectral envelop,
YI = Imaginary part of fft(nx(n))
ZR = Real part of fft(n2x(n))
ZI = Imaginary part of fft(n2x(n))
V.
FEATURE EXTRACTION
The steps involved in computing the new features are shown
where L is 16-24. Then, the truncated sequence is multiplied
in the form of a flow diagram[12] in figure 2.
by a tapering window such as a hamming window to
eliminate discontinuities at the ends of the sequence.
Difference
of samples
Hamming
window
DFT
Speech
signal
| |
The group delay spectrum of r(n) is then computed as
GD[k], represents sampled group delay spectrum.
IDFT
Finally, the features are computed in a manner similar to the
cepstral coefficients as the inverse DFT of sampled group-
Hamming
window
Feature
vectors
IDFT of 2nd
order group
delay
Absolute
value of 2nd
order group
delay
Second
order group
delay
Figure 2: Steps involved in computing feature vectors
delay spectrum.
The above group delay spectrum is computed using the
following algorithm.
In the case of other signal processing techniques, the speech
Algorithm for computing II-order group delay functions
signal s(n) is pre-emphasized for removing the dc
1.
component and to spectrally flatten the signal before
feature-extraction. The conventional pre-emphasis is done
Let x (n) be the given M-point causal sequence then
compute[1, 10] y(n) = n x(n).
2.
by a difference operation [4].
Compute the N-point (N>>M) Discrete Fourier
Transform (DFT) X(k) and Y(k) of the sequences x(n)
and y(n) respectively, for K=0,1,…..,N-1.
This operation can be performed on each frame of the signal
3.
Compute cepstrally smoothed spectrum S(k) of x (k )
4.
Compute the spectrum z(k) by dividing x (k )
and then, as shown in the figure, each frame of speech
Where h(n) are the samples of Hamming window.
The windowing operation is followed by the computation of
by
s(k).
is multiplied by a Hamming window. If ‘m’ is the frame
number and N is the number of samples in each frame then
2
2
5.
Compute the modified group delay function
For k=0, 1… N-1.
 0 (k )
as
6.
Compute the derivatives of real and imaginary parts of
2D plot of accoustic vectors
2
group delay function of equation as
1
d R  X R ( Z I  2YI )  X I ( Z R  2YR ) 


2
2
d 
(X R  X I )

6th Dimension
0
2
d I 
 Y ( )   X R Z R  X I Z I 

2

  2 I
2
d 
X ( )



VI.
-1
-2
-3
FEATURE MATCHING AND SPEAKER
-4
-2.5
RECOGNITION
The first step involved is to build a speaker-database
C=
{C1 , C2 , …. , CN} which consists of N codebooks, each
-2
-1.5
-1
-0.5
0
5th Dimension
0.5
1
1.5
2
Figure 3: Clusters in the K-means algorithm (number of
centroids C = 2).
codebook for each speaker in the database. This process is
2D plot of accoustic vectors
done by converting the input signal into a sequence of
2
vectors as X={x1, x2… xN}. These feature vectors are
1
clustered into a set of M codewords as C={c1, c2, …. , cM}.
These set of codewords is called a codebook for the
algorithm. There exists number of algorithms for the
generation of codebook such as Generalized Lioyd
0
6th Dimension
specified speaker and can be done by using a clustering
-1
-2
algorithm (GLA) or K-means algorithm or Linde-BuzoGray, Self Organizing Maps (SOM), Pairwise Nearest
-3
Neighbour (PNN) etc., In this paper, the K-means algorithm
is used since it is the simplest, most popular, simplest and
-4
-2.5
-2
-1.5
easy way to implement.
-1
-0.5
0
5th Dimension
0.5
1
1.5
2
K-means algorithm
Figure 4: Clusters in the K-means algorithm (number of
The K-means algorithm[19] partitions the ‘M’ feature
centroids C= 8)
vectors into ‘C’ centroids. This method first randomly
In short, the K-means algorithm performs three steps below
chooses C cluster-centroids from M feature vectors. Then
each feature vector will be assigned to the nearest centroid
C, and then the new set of centroids are calculated for the
new clusters. This assigning procedure is continued until the
mean square error between the M feature vectors and the
cluster-centroids C is below a certain assigned threshold. In
other words, the main objective of the K-means algorithm is
to minimize the total intra-cluster variance, V as
Where we have k clusters Si, i = 1,2,...,k and 𝑢𝑖 is the mean
point or centroid of all the points. The clusters in the Kmeans algorithm are as shown in figure.
until convergence[19]:
1. Determine the coordinate of centroid.
2. Determine the each object distance to the centroids.
3. Group the objects based on the minimum distance
(finding the closest centroid).
A Matlab code has been generated to implement speaker
identification
system.
This
v_kmeans.m
pre-defined
code
needs
disteusq.m,
and
rnsubset.m,
functions
voicebox.m, winenvar.m files from voicebox which is a
speech processing tool box in Matlab. The output of the
defined system is shown in the following table in the form
of distortion measure with the input signals being given in
.wav format. We have 30 clear speech signals names Sp01,
Sp02, Sp03….Sp30. We can observe that the diagonal
element has the least vector quantization distance value in
their respective row. It indicates Sp01 matches with Sp01,
Sp02 matches with Sp02 and so on. Here the Codebook size
Figure 5: K-means algorithm flow diagram.
is 16. The distortion measures for first 10 samples of 30
Feature Matching
In the recognition phase, the unknown speaker represented
clear speech samples are as given in table 1.
first order group delay
first order group delay
by the feature vectors {X1, X2,…., XT}, is compared with
1.5
the codebooks in the database for the final recognition. The
0.5
speaker with the lowest distortion is chosen as best matched
0.5
0
0
50
100
150
200
250
300
1
1
0.5
0
50
100
150
50
200
250
300
0
50
Group delay of ‘sp01’
One way to calculate the distortion measure is the sum of
centroid is to use the average of the Euclidean distances.
distance, weighted Euclidean and Mahalanobis. We used
300
100
150
200
250
300
1
0.5
0.5
0
0
50
100
150
200
250
300
1
1
0.5
0
Euclidean distance here.
50
100
150
200
50
100
150
200
250
300
250
300
1.5
0.5
0
0
second order group delay
second order group delay
0
250
1.5
1.5
The best-known distance measures as far are Euclidean
200
first order group delay
first order group delay
1.5
0
150
Group delay of ‘sp02’
1
squared distances between vector and its representative
100
1.5
0.5
0
0
second order group delay
second order group delay
1.5
0
speaker[19].
1
1
0
distortion measure is computed for each codebook and the
1.5
250
300
0
Group delay of ‘sp03’
50
100
150
200
Group delay of ‘sp04’
first order group delay
1.5
first order group delay
1.5
1
1
0.5
Where 𝐶𝑚𝑖𝑛 denotes the nearest codeword
𝑥𝑡
in the
0.5
0
0
codebook 𝐶𝑖 and d( ) is the Euclidean distance. Thus, each
feature vector in the sequence is compared with all the
0
50
100
150
200
250
V
=
(v1,
v2...vn)
is
150
200
given
𝑛
by
𝑖=1
The train speaker file which has the lowest distortion
distance is chosen to be identified as the best match for
given unknown speaker.
VII. EXPERIMENTAL RESULTS
300
second order group delay
250
300
1
1
0.5
0
0
0
50
100
150
200
250
300
0
50
100
150
200
Group
delay of ‘sp06’
first order group delay
first order group delay
1.5
1.5
1
1
0.5
0.5
2
√(𝑢1 − 𝑣1 )2 + (𝑢2 − 𝑣2 )2 + 𝑢 − 𝑣𝑛 )2 = √∑(𝑢𝑖 − 𝑣𝑖 )
250
second order group delay
Group delay of ‘sp05’
The Euclidean distance between two points U = (u1,
and
100
1.5
average distance is selected to be the best match.
u2…un)
50
1.5
0.5
codebooks available, and the codebook with the minimum
0
300
0
0
0
50
100
150
200
250
300
1
1
0.5
0
50
100
150
200
250
Group delay of ‘sp07’
delay of ‘sp08’
50
100
150
200
250
300
250
300
1.5
0.5
0
0
second order group delay
second order group delay
1.5
300
0
0
50
100
150
200
Group
Group delay of ‘sp19’
first order group delay
1.5
first order group delay
1.5
Group delay of ‘sp20’
1
1
0.5
0.5
0
0
0
50
100
150
200
250
300
first order group delay
0
50
100
150
200
250
300
first order group delay
1.5
1.5
second order group delay
second order group delay
1
1.5
1.5
1
1
0
0.5
0.5
0.5
0.5
1
0
0
50
100
150
200
250
300
0
0
50
100
150
200
250
300
0
50
Group delay of ‘sp09’
100
1.5
1
1
0.5
0.5
0
50
100
150
250
300
1
1
0.5
0.5
0
0
0
50
100
150
100
150
200
250
300
200
250
300
0
50
100
150
200
250
300
first order group delay
first order group delay
0
200
Group delay of ‘sp10’
1.5
0
150
50
1.5
1.5
0
0
second order group delay
second order group delay
200
250
300
Group delay of ‘sp21’
Group delay of ‘sp22’
0
50
100
150
200
250
300
second order group delay
second order group delay
1.5
1.5
1
1
0.5
0.5
first order group delay
1.5
first order group delay
1.5
1
1
0.5
0
0
50
100
150
200
250
0
300
0.5
0
50
100
150
200
250
300
0
0
50
100
150
200
250
300
0
second order group delay
Group delay of ‘sp11’
0
50
100
1.5
Group delay of ‘sp12’
150
200
250
300
250
300
second order group delay
1.5
1
1
first order group delay
1.5
0.5
first order group delay
0.5
1.5
0
1
1
0
50
100
150
200
250
0
300
0
50
100
150
200
0.5
0.5
0
0
50
100
150
200
250
300
0
Group delay of ‘sp23’
0
50
100
150
200
250
Group delay of ‘sp24’
300
first order group delay
second order group delay
first order group delay
1.5
second order group delay
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0
0
0.5
0.5
0
50
100
150
200
250
300
0
0
50
100
150
200
250
0
50
100
150
200
250
0
300
Group delay of ‘sp14’
first order group delay
1.5
100
150
200
250
300
250
300
1.5
1
1
0.5
0.5
0
first order group delay
50
second order group delay
1.5
Group delay of ‘sp13’
0
second order group delay
300
0
50
100
150
200
250
0
300
0
50
100
150
200
1.5
1
Group delay of ‘sp25’
1
0.5
Group delay of ‘sp26’
0.5
0
first order group delay
0
50
100
150
200
250
300
0
0
50
second order group delay
100
150
200
250
300
1.5
first order group delay
1.5
1
second order group delay
1.5
1.5
1
1
0.5
0.5
1
0
0.5
0
0.5
0
50
100
150
200
250
0
50
100
150
200
250
300
250
300
300
second order group delay
second order group delay
0
0
50
100
150
200
250
0
300
1.5
1.5
0
50
100
150
200
250
300
1
1
Group delay of ‘sp15’
Group delay of ‘sp16’
0
first order group delay
0.5
0.5
0
0
50
100
150
200
250
300
0
50
100
150
200
first order group delay
1.5
1.5
1
1
0.5
0.5
Group delay of ‘sp27’
Group delay of ‘sp28’
first order group delay
1.5
0
0
50
100
150
200
250
300
0
first order group delay
0
50
second order group delay
100
150
200
250
300
1.5
1
1
0.5
0.5
0
0
1.5
1
second order group delay
1.5
1
0.5
0
50
100
150
200
250
300
0.5
0
0
50
100
150
200
250
300
0
0
50
100
150
200
250
300
250
300
second order group delay
0
50
100
150
200
250
300
1.5
second order group delay
1.5
Group delay of ‘sp17’
1
Group delay of ‘sp18’
1
0.5
first order group delay
first order group delay
1.5
1.5
1
1
0.5
0.5
0
0
0
50
100
150
200
250
300
0
0
50
1
1
0.5
0.5
0
50
100
150
200
100
150
200
250
250
300
0
50
100
150
200
250
300
0
0
50
100
150
200
Group delay of ‘sp30’
300
second order group delay
1.5
0
Group delay of ‘sp29’
second order group delay
1.5
0
0.5
Therefore a new function called complex II-Order Group
delay spectrum estimation function has been derived based
0
50
100
150
200
250
300
on the II-order derivative of the FT phase from the first
order Group Delay. The above graphs explain the
comparison of the results of I and II order group delay
Sp01
Sp02
Sp03
Sp04
spectrum estimations.
Sp05
Sp06
Sp07
Sp08
Sp09
Sp10
Sp01
1.4015
3.5685
47.2517
24.3959
3.673
43.3044
45.4605
1.828
23.2688
8.8337
Sp02
5.7877
1.546
73.6625
12.2542
8.7874
57.2519
66.8928
4.0014
11.7488
3.6959
Sp03
2.515
1.9171
0.3509
2.0337
2.1988
2.4502
2.3813
2.997
1.6221
1.9297
Sp04
37.7496
19.4954
198.4953
4.4036
42.4395
21.7943
120.8951
29.9203
8.2478
12.1787
Sp05
9.9634
11.0722
44.2632
38.709
1.6697
31.7374
32.6545
9.45
37.0609
18.9741
Sp06
108.7146
73.5691
270.8312
34.9402
110.8011
6.6072
52.1381
96.4685
30.6069
55.5596
Sp07
292.5806
233.8699
552.4637
134.951
299.3711
52.8933
5.6412
270.5984
137.5742
183.8065
Sp08
1.4689
2.8624
56.3202
19.7096
5.906
50.6259
49.9415
0.8891
19.0048
6.5949
Sp09
30.9172
14.1947
144.7578
4.8229
36.6153
23.0945
123.3529
24.8221
1.782
7.4463
Sp10
12.949
4.157
134.232
6.658
17.6606
39.5662
89.9253
11.1178
6.7061
2.4925
Table 1: vector quantization distance for clear speech signals sp01, sp02 …, sp10 for Codebook size 16
For the codesize C=64, the distortion measures are as table 2,
Sp01
Sp02
Sp03
Sp04
Sp05
Sp06
Sp07
Sp08
Sp09
Sp10
Sp01
0.0477
2.263
44.211
21.9238
1.6905
38.8762
36.2556
0.4171
21.5793
7.6464
Sp02
3.2468
0.1121
70.3834
11.1982
7.1873
55.0577
55.6189
1.8472
10.6776
2.106
Sp03
1.3646
0.5392
0.0106
0.0956
0.3546
0.5221
0.2019
1.0475
0.3999
0.2589
Sp04
33.0936
16.8433
145.9561
0.17
40.7936
17.8142
114.5024
27.1216
5.3323
10.695
Sp05
5.6573
7.7478
40.7699
35.2992
0.1881
25.7197
23.3882
7.2705
34.5832
16.0768
Sp06
105.1664
64.0917
265.321
21.7871
107.1294
1.222
47.1959
87.8631
22.9843
42.1392
Sp07
292.5806
233.8699
552.4637
134.951
299.3711
52.8933
5.6412
270.5984
137.5742
183.8065
Sp08
107.1294
1.222
47.1959
87.8631
22.9843
42.1392
45.605
42.5683
0.054
17.2582
Sp09
28.2196
11.8204
140.0115
1.9762
34.8564
17.7642
114.0403
22.6694
0.1159
4.5542
Sp10
10.9589
2.9404
94.7017
4.7994
16.0181
37.1376
76.7163
7.3841
4.0541
0.0988
Table 2: vector quantization distance for clear speech signals sp01, sp02 …., sp10for Codebook size 64
Sp11
Sp12
Sp13
Sp14
Sp15
Sp16
Sp17
Sp18
Sp19
Sp20
Sp11
0.1484
65.1784
21.5979
20.7737
49.3174
2.5639
7.425
10.0336
31.6638
58.4676
Sp12
1.797
0.0163
1.0522
1.5982
1.3808
0.8691
2.3859
0.7867
2.2089
1.1561
Sp13
23.4132
165.1833
0.1439
86.8203
138.6536
10.9179
54.6253
3.0934
109.393
9.7619
Sp14
20.019
13.0474
18.8587
0.1281
7.2694
17.9446
4.5651
17.6394
1.6354
18.8146
Sp15
6.0607
1.4044
4.6381
4.8876
0.022
4.15
7.0423
4.0129
1.3536
4.4477
Sp16
3.6527
96.6039
10.3606
38.7142
79.3581
0.4683
20.2432
2.9393
55.0562
38.0559
Sp17
6.1887
29.7645
40.4638
4.3573
18.8134
15.6141
0.0602
31.1431
9.6879
39.5601
Sp18
11.2014
134.4288
4.4223
65.066
110.9595
3.1052
35.4422
0.0782
82.235
20.6682
Sp19
12.1607
6.4206
10.7223
1.4321
3.282
9.985
9.6403
9.7418
0.1031
10.3455
Sp20
60.0935
260.2641
11.5196
152.725
230.4135
38.2909
109.0684
20.8764
182.9044
0.2048
Table 3: vector quantization distance for s clear speech signals sp11, sp12 …., sp20 for Codebook size 64
Sp21
Sp22
Sp23
Sp24
Sp25
Sp26
Sp27
Sp28
Sp29
Sp30
Sp21
0.0119
1.8438
0.6474
1.4412
3.5724
0.8071
0.1006
1.7976
0.7391
0.3915
Sp22
3.6261
0.0472
6.0256
7.8602
13.485
6.0815
2.4752
6.3189
6.1236
3.7091
Sp23
125.3918
86.8163
0.055
5.1752
173.9393
13.2605
5.816
46.3378
10.2329
52.1404
Sp24
116.5472
78.9724
2.0027
0.0427
161.6597
5.9124
1.1095
40.8969
4.2172
41.7361
Sp25
0.0361
0.0449
0.0809
0.0982
0.023
0.0828
0.2068
0.0708
0.0973
0.1565
Sp26
174.1992
126.171
6.9555
7.2877
233.8164
0.0595
10.3256
73.6223
16.5765
17.7002
Sp27
111.134
77.9214
4.1117
3.9269
158.8772
10.3068
0.8846
42.577
4.6876
50.5968
Sp28
19.1895
6.6693
24.3559
28.0832
39.469
24.4232
17.8915
0.1097
18.4084
21.8964
Sp29
79.1624
49.2407
2.3483
3.5573
121.4712
15.7295
2.6265
20.2872
0.0669
65.0841
Sp30
291.4751
231.8188
44.5492
44.067
378.5392
17.2398
49.7507
164.4803
66.2853
0.0942
Table 4: vector quantization distance for s clear speech signals sp21, sp22 …., sp30 for Codebook size 64
10
From the above table the system identifies the speaker
speaker with the speakers in the database. The following
according to the theory: “the most likely speaker sample
table provides the performance results for various speaker
must have the minimum possible Euclidean distance
identification tests.
when
compared
database”[19].
to
The
all
the
above
codebooks
procedure
in
has
the
been
implemented for 10 speech signals where five of them are
male and five of them are female. These speech samples
are in a regional language Telugu and the system has
successfully identified all the speakers.
LPC
Identification 96.2%
MEL
GDP
97.0%
97.0%
result
The results show that the identification rate of the system
increases as the number of centroids increases. Also as the
number of speakers increases, the number of centroids
This method clearly provides better resolution and also
suppresses the spikes which are generated due to noise in
the spectrum when compared to first order group delay
functions.
increases. A spectral estimation method based on complex
II-order derivative of the Fourier Transform of the phase
also called as II order Group Delay has been proposed for
the estimation of the signal characteristics and this newly
The effect of changing the codebook size on
proposed method is compared with the I-order derivative
the VQ distortion
of the Fourier Transform of the phase and proves to give
Codebook Size
best results.
Matching Score (VQ distortion)
REFERENCES
Sp01
Sp03
Sp08
2
95.7358
16.0041
110.1414
8
2.9861
3.5683
3.7059
16
0.6435
0.3276
0.9069
64
0.1036
0.0123
0.0457
[1]
K.Nagi Reddy, S.Narayana Reddy, ASR Reddy.
“Significance of complex group delay functions in
spectrum estimation.” Signal & image processing: An
International journal (SIPIJ) Vol.2, No.1, pp.115-133,
March 2011, ISSN 0976-710X.
[2]
B.
Yegnanarayana
and
Hema
A.
Murthy
“Significance of Group DelayFunctions in Spectrum
Estimation” IEEE Transactions on signal processing.
128
0.0053
0.0020
0.0106
256
0
0
0
Vol. 40. NO.9.pp 2281-2289, September 1992.
[3]
B. Yegnanarayana, "Formant extraction from linear
prediction phase spectra," J. Acoust. Soc. Amer., vol.
63, pp. 1638-1640, May 1978.
Table 3: Matching score for different codebook sizes.
From the above table 3, we can see that as the codebook
[4]
John G.Proakis and Dimitris G Monolakis “Digital
size C increases, the Euclidean distance for the same
signal
Processing
speaker is decreased.
Applications“Prentice –Hall,1997. Anand Joseph M.,
Guruprasad
VIII. CONCLUSION
S.,
Principles,Algorithms
Yegnanarayana
B.”
and
Extracting
A text-independent speaker identification system is the
Formants from Short Segments of Speech using
main goal of this paper. The feature extraction process is
Group Delay Functions”
done using complex second order group delay functions
ICSLP, pp:1009-1012
and the feature matching is performed using Vector
[5]
INTERSPEECH 2006 –
Yegnanarayana, B., Saikia, D. K., and Krishnan, T.
Quantization technique. Using the extracted features, a
R., “Significance of group delay functions in signal
codebook for each speaker was created and clustering of
reconstruction from spectral magnitude or phase”,
feature vectors is done using the K-means algorithm.
IEEE Trans. on Acoustics Speech and Signal Proc.,
Codebooks from all the speakers form the database for the
Vol. 32, no. 3, pp. 610-623, Jun. 1984.
system. A distortion measure based on minimizing the
Euclidean distance was used for matching the unknown
[6]
A.V oppenheim and R.W Schafer ‘’ “Digital signal
Processing” Englewood cliff,NJ , Prentice –Hall
11
[7]
H K Lakshminarayana, J S Bhat and H M Mahesh,
“Improved Estimation of Evolutionary Spectrum
[17]
based feature for Robust Speech Recognition” The
andModified Magnitude Group Delay by Signal
Annals of ”Dunarea de Jos” University of Galaţi,
Decomposition” International Journal of Information
Fascicle III, 2009,Vol 32,No 1,pp.60-65 , ISSN 1221-
and Communication Engineering 5:3 2009,pp198-209
454X
on
Short
Time
Fourier
Abbasian Ali and Marvi Hossien”The Phase
Spectra
based
feature
for
Robust
[18]
University of Galaţi, Fascicle III, 2009,Vol 32,No
1,pp.60-65 , ISSN 1221-454X
G.
Farahani,
Homayounpoor,”
S.M.
Use
and
Spectral
M.M.
Peaks
In
Autocorrelation And Group Delay Domains For
Robust Speech Recognition” ICASSP 2006,pp:517520
[10]
Ms. Mani Roja, Deepak Harjani and Mohita
Jethwani. "Speaker Recognition System using MFCC
and Vector Quantization Approach." International
Journal for Scientific Research and Development 1.9
(2014): 1934-1937.
[11]
Aruna Bayya and B. Yegnanarayana , "Robust
features for speech recognition Systems," in Proc.
ICSLP '98, December 1998
[12]
Yegnanarayana, B., Saikia, D. K., and Krishnan, T.
R., “Significance of group delay functions in signal
reconstruction from spectral magnitude or phase”,
IEEE Trans. on Acoustics Speech and Signal Proc.,
Vol. 32, no. 3, pp. 610-623, Jun. 1984.
[13]
Rajesh M. Hegde, Hema A. Murthy, Venkata
Ramana Rao Gadde: “Significance of the Modified
Group Delay Feature in Speech Recognition” IEEE
Transactions
on
audio,
speech,
and
language
processing, vol. 15, no. 1, january 2007
[14]
Rajesh M. Hegde and Hema A. Murthy: “Speaker
Identification using The Modified Group Delay
Feature.”
[15]
Hema A. Murthy and Gadde V. Ramana Rao. “The
Modified group delay function and its application to
phoneme
recognition.”
In
Proceedings of the
ICASSP, Vol.I, pages 68-71, April 2003.
[16]
2010.
[19]
E. Karpov, “Real Time Speaker Identification,”
Master`s thesis, Department of Computer Science,
Ahadi
Of
Abeer M. Abu-Hantash, Ala’a Tayseer Spaih:” Text
Independent Speaker Identification system”, may
Speech
Recognition” The Annals of ”Dunarea de Jos”
[9]
Abbasian Ali and Marvi Hossien”The Phase Spectra
Transforms
based
[8]
pages 18-32, October 1994.
H.Gish and M.Schmidt. “Text Independent Speaker
Identification.” In IEEE Signal Processing Magazine,
University of Joensuu, 2003.

Download Report

Speaker Identification Using Second Order Complex Group

Paperzz.com

Your Paperzz