codebook modulationat at speaker verification/identification task

223
CODEBOOK MODELING IN SPEAKER
VERIFICATION/IDENTIFICATION TASK SOLUTION
Kh. M. Akhmad1
1Vladimir
State University
600005, Vladimir, Gorkovo st., 87
+7 920 911 45 05, [email protected]
It is considered speakers recognition methods, based on using of vector quantization
algorithms (VQ), distortion measures, standards modeling and identification by code
book.
Introduction
There are various methods of speech
recognition [3], however, recently, a
comparison method is core became to the
standard. It is connected mainly with progress
in the field of electronic components, in
particular, with increase of processors
computational power and memory capacity.
By comparison to the standard , speech signals
description are compared to in advance
reserved reference (standard) descriptions, and
their degree of similarity is calculated.
Recognition result is the most similar
reference pattern.
Task description
At the decision of identification problem, to
use as the standard all received attributes
vectors set is inefficiently and inconveniently,
as their quantity can be very much greater
(proportionally length of a phrase), that
attracts significant growth of a speakers
database and reduction of identification speed
. Besides features vectors are irregularly
distributed in some space area, and some
groups, i.e. similar each other features vectors
are more close in features space.
Therefore it is meaningful to break all features
vectors into the groups containing similar each
other vector. Such problem is solved by means
of vector quantization (VQ) — display process
a plenty of vectors in final number of vector
space areas. Each such area referred to cluster
and can be presented its centroid, denoted as
code vectors.. The code words set for one
speaker refers to as the codebook, as is the
standard [8,4,6,7].
Task solution
The codebook modeling task is put as follows.
Let the features vectors set is given
T
X  xi | i  1,..., L, где xi  xi1 ,..., xiD   R D ,
where D– vector Dimension (D=12) . Let
C  c1 ,..., ck  – codebook, where K  L –
code vectors number (clusters) in codebook
and ci  ci1 ,..., ciD T  R D .
Then, vectorial quantizer Q with D –
dimension and K – size, this is mapping of
victors set X in one of centriods
C  ci | i  1,..., K  C  Q X  .
The primary task at vector quantizer
modulation is distortion minimization, i.e.
distances between vectors and centroids. The
measure distortion choice at codebook
modulation can affect its quality. The distance
concept or a distortion measure is understood
as function
d : RD  RD  R ,
(1)
224
Which measures differently between two
features vectors x and y. For equal vectors
d : RD  RD  R
Euclidean distance. Is most common measure
which starting with a physical distance
principle
between two points in space. It is defined as:
d E  x, y  
D
x  y T x  y    xk  yk 2 (2)
k 1
Mean -square error (MSE), calculated by
formula:
dCKO x, y  
D
1
x  y T x  y   1  xk  yk 2 ,(3)
D
D k 1
can serve one given measure modification.
Weighed Mean -square error. The basic
problem with the Euclidean distance, that
component vector dispersions differ from each
to other. Components with big dispersion
dominate over distance. Besides, dispersions
Mel-frequency
cepstral
coefficients
component (MFCC) strongly differ. Hence,
normalization of these features vectors is
therefore needed or use normalizing measure:
1
(4)
d w x, y   x  y T W x  y  ,
D
where W – positive-definite weighing matrix.
Often take W  Г 1 , where Г – covariance
matrix of random vector x:
(5)
Г  E x  x x  x T , x  x
In this case
1
d w  x, y    x  y T Г 1  x  y  ,
(6)
D
well known as Mahalanobis-distance [8].
The simplified normalization method based on
Mahalanobis-distance [6] is used in given
work. The given method assumes use as
measure MSE, but with preliminary
normalized arguments
x  k
xk  k
, k  1,..., D,
(7)
k
where xk and xk are the original and
normalized k-th features vector components,
respectively, and  k and k – are the meansquare and standard deviations of the k-th
component over all features vectors samples.
The vectors have zero mean and unit variance
after the normalization.


Codebook modeling it is necessary, that the
choice cluster for the next features vectors has
been carried out on a distortion measure
minimum, i.e. a choice of the nearest cluster,
and that each code vector got out of average
distortion minimized condition in the cell.
There is enough of algorithms for performance
clustering and codebook modeling [8,6,7].
Generalized Lloyd algorithm (GLA, LBG, kaverages) starts with an initial codebook,
which is iteratively improved until a local
minimum is reached. It is the most popular
algorithm in clustering problems. On each
step, feature vectors are displayed in the
nearest clusters, and then everyone cluster
centroids (code vectors) are recalculated in
view of the vectors which have got there. New
codebook quality, after iteration, is better or
equally previous. So repeats demanded quality
of the codebook will not be reached yet.
Further, the algorithm is resulted.
Let's assume, that it is necessary to break
training vectors set X  xi | i  1,..., L to K
clusters.
Let’s Ci m – i-th cluster on m-th iteration
with cim -centriods. K – number of
vectors .
code
Step 1. k=1, create codebook consisting of one
code vector:
1 L
(8)
c1*   X i .
L i 1
Calculate average distortion inside only cluster
1 L
*
(9)
Dсред
  d xi , c1* .
L i 1
Step 2: Splitting. Increase the code book size
twice, breaking already created cluster by two
according to rule:
ci0   1   ci* 
(10)
, i  1,..., k ,
ck0i  1   ci* 
where   0.01 – splitting parameter, k=2k.
( 0)
*
 Dсред
Step 3. Iteration. Give Dсред
, m =0 –
 
iteration counter.
a) Training vectors set X  xi | i  1,..., L
is classified on clusters, Ci , i  1,..., k by
225
means of a nearest neighbour rule: x  Ci m  ,

 

if and only if, when d x, cim  d x, cjm for
all j  i . In other words, each features vector
concerns to that cluster, to which it more close
according to chosen metrics.
b)Correction centriods by the following
formula:
 xCi m  X
cim 1 
,  i  1,..., k . (11)
 xC m 1
i
c) m=m+1.
d)calculate average distance between
features vectors and corresponding them
centriods:
L
m   1 d x , cm  , j : x  C m. (12)
Dсред
 i j
i
j
L i 1

e) if
m 1  D m 
Dсред
сред
D m 

  , then go to step
сред
(3 .а).
*  D m  .
give Dсред
сред
f) Give ci*  cim,  i  1,..., k – resulting
code vectors set.
Step 4. If k<K, then go to step (2), else, stop.
Generalized Lloyd algorithm (also known as
LBG) is easily enough sold has complexity of
order O (KLH) (H - iterations quantity) and
yields good results in most cases. However
this method finds only the first local minimum
and final clusteration quality strongly depends
on initial values. One of disposal ways of these
lacks is use global (intercluster) optimization.
A method of such optimization can be the
random and determined interchange centriods
or a splitting-merge method.
For example, randomized local search (RLS).
The given algorithm includes global and local
optimization stages [1,2].
Step 1. Initialization. In a random way get out
K (clusters quantity) vectors of attributes as
центроидов. On basis of it created optimum
splitting by a rule «the nearest neighbour ». As
a result of it, the algorithm will be insensitive
to initial splitting. Give iteration counter m =1.
Step 2. Iteration.
a) Random exchange. In a random way
get out cluster and it centriod replaced random
chosen features vector:
cim  x j | i  random1, K , j  random1, L . (13)
b)Local resplitting. New splitting created
taking into account changed centriod.
c) Calculate a new centriods by new
splitting.
d)Give m=m+1.
Step 3. let’s f – criterion
function. If


m
m1
, then get out centriods as
f
 f
current decision, has given on step (2.c)
Step 4. If m<T, then go to step (2), else – stop.
T – is fixed and set before algorithm work
(T=100, 200, 500, 1000, 2000). Algorithm
complexity About O(TL). As criterion
function f can serve mean-square error:
1 L
f   d xi , cjm , j  xi  C j m . (14)
L i 1
We can suggested to improve the given
algorithm work, having replaced a step (З.c)
several iterations of the LGA. It will allow to
lead much a local optimization in the best way
and to created optimum clusters unlike
centriods usual calculation procedure. The
offered improvement quality will slow down
algorithm work, that in the majority the
appendix is not so critical. In this case
algorithm complexity will be About O(TKL).
Exist as well other clustering algorithms, but
above described generally are more effective
and convenient in realization [7].
The code book, which Received by one of the
above described ways is represents a given
speaker model and then is brought in a
speakers database, which represents code
books set (standards).
The speaker identification process on an
existing code books set (speaker database) is
similar to training process.
From test speaker speech will be extracted a
features vectors set X  xi | i  1,..., L. Then,
defined, what of code books in a database
there better corresponds received set of code
vectors.
Speaker database contained from codebooks
set (standards) - B  C1,..., C N , which N –
speakers number in database, C  ci1 ,..., cik 


226
– code book, corresponding i-th speaker (k –
code book number).
Hereinafter for simplification it is meant, that
in a database there are identical size code
books, though all algorithms are fair and for
other case.
One of simple and effective ways of code
book definition, that which is better there
correspond test speaker features vectors, is
described by following algorithm [7]:
Step 1. For each speaker codebook
Ci  i  1,..., N do, Compute the distortion
Di  d  X , Ci 
(15)
between X and Ci .
Step 2. Identify the index of the unknown
speaker ID as the one with the smallest
distortion, i.e.
ID  arg min Di  .
(16)
i 1,..., N
The distortion measure (15) approximates the
dissimilarity
between
the
codebook
Ci  ci1 ,..., cik  and the features vector set
X  x1 ,..., x L .
We use the most intuitive distortion measure;
map each vector in X to the nearest code
vector in Ci and compute the average of these
distances:
d  X , Ci  
1
L
d E x j , cik ,
 min
k 1
L
k
(17)
j 1
where d E is the Euclidean metric (2).
The approach based on weight coefficients
introducing [5] is interesting enough. Being
based on this idea, the algorithm which is
carrying out weighed identification has been
thought up. Before comparison, correlation
between code books and code vectors which
have higher distinctive power is calculated,
greater weights are appointed. It is not
required to any aprioristic information on
vectors attributes. Unlike the previous method
here it has appeared to use more conveniently
not a distinction measure, but a similarity
measure.
Thus, at identification, the best matching
codebook is now defined as the codebook that
maximizes the similarity measure between
features vectors set X and codebook Ci , i.e.
ID  arg max si  X , Ci ,
(18)
i 1,..., N
here the similarity measure is defined as:
L
1
s X , Ci  
L

1

k
j 1 min
k 1
d E x j , cik

.
(19)
As various code vectors have different
distinctive powers it is offered to use for
identification not only distance from vectors
up to the nearest code vector, but also
distinctive power of this code vector. For this
purpose weight coefficients, and then the
similarity measure (19) will look as:
s X , Ci  
1
L
L

1
k
j 1 min
k 1

d E x j , cik

 
w cijmin ,
(20)
where cijmin – nearest code victor to x j of
codebook Ci , w – weight function.
Such weighing can be considered as shifting
operator dividing surface in a direction to
more significant code vectors. For each code
vector weight function is calculated.
Speaker modified database result will be
represented as
Bm  C1,W1 ,..., CN ,WN 
where Wi  wci1 ,..., wcik  – assigned
weights i-th code book . Weights are
calculated each time when in a database are
added the new announcer that be in progress at
training stage and does not influence on
calculations complexity at identification.
Weight coefficients calculations are carried
out bye following formula:
 
w cij 
1
N

k
k 1, k  i min d
E
m 1
1
cij , ck m
i  1,..., N , j  1,..., K .
,
(21)
227
Conclusion
References
As experiments have shown, the weighed
identification developed method, is much
better to cope with a task, when the features
vectors quantity at testing is not enough, i.e.
identification occurs by short fragments of
speech. The described methods solve
identification problem on the closed set i.e.
when it is obviously known, that the standard
which has been take from the tested speaker, is
in a database. If as the test speaker will be
taken the speaker, whose standard in base is
not present, the given algorithms will choose
the most similar speaker, that in any case it
will be erroneous. However, if enter some
threshold values for a distortion measure
(similarity), then in case of threshold crossing
for the chosen speaker, it is possible to
consider, that this speaker is foreign (stranger)
and is not present in a database. Thus, the
identification problem will be solved on the
open set without significant modification
(changes) in the described methods.
Thresholds steal up experimentally, and in
practice, such method works exactly enough
and effectively.
1.
2.
3.
4.
5.
6.
7.
8.
Franti P., Kivijarvi J. Random swapping
technique
for
improving
clustering
in
unsupervised classification. // - ftp: // ftp.cs.
joensuu.fi /franti/ papers/ scia99-l.ps
Franti P., Kivijarvi J. Randomized local search
algorithm for the clustering Problem.//Pattern
Analysis an Applications, 3(4): 358-369, 2000,
ftp://ftp.cs.joensuu.fi/ franti/papers/rls.ps
Gorelik A. L, Skripkin V.A. Recognition
Methods: school-book. 3-th edition. - М.: Higher
School, 1989. - 232 . (in Russian)
Gray R. M. Vector quantization. // IEEE ASSP
Mag., vol. 1, pp. 4-29, April 1984.
Kinnunen Т., Franti P. Speaker Discriminative
Weighting Method for
VQ-based Speaker
identification." http://cs.joensuu.fi/pages/tkinnu/
research/ pdf/Discriminative wightingMethod.pdf
Kinnunen Т., Karkkainen I., Franti P. Is speech
data clustered? - statistical analysis of cepstral
features. - http://cs.joensuu.fi / pages / tkinnu /
research /pdf/IsSpeechClustered.pdf
Kinnunen Т., Kilpelainen Т., Franti P.
Comparison of clustering algorithms in speaker
identification", Proc. LASTED Int. Conf. Signal
Processing and Communications (SPC): 222-227.
Marbella, Spain, 2000.
Makhowel John.. Vector quantization in speech
coding. // —IEEE, 1985, т.73, №11, pp.19-60.
(in Russian)