РАСПОЗНАВАНИЕ РЕЧЕВЫХ КОМАНД НА ФОНЕ ШУМОВ

275
FORMATION OF MODEL LIBRARY FOR RECOGNITION OF SPEECH
COMMANDS ON THE BACKGROUND OF NOISE1
N.A. Krasheninnikova2
2Ulyanovsk
State University,
432970, Ulyanovsk, Lev Tolstoy St., 42, e-mail: [email protected]
The article considers the problem of formation of model library for recognition of
speech commands on the background noise. The algorithms which help to obtain good
solutions of the problem of models choice within acceptable time are suggested.
1. Introduction
When recognizing speech commands (SC)
from a fixed dictionary the speech command
under recognition (SCR) is usually compared
with a model command (MC) and SCR falls
under the category of that MC to which it is
nearer. Besides the distance between SC is
found out in some space of attributes, for example, comparing SC autocorrelation portraits
[1]. Due to speech variability one and the same
SC are pronounced a bit differently even by
one and the same announcer, so one MC
doesn’t completely represent SC. Moreover
the presence of noise greatly influences the
estimated distance between SC. These two factors decrease the recognition quality.
In order to improve SC presentation it is desirable to use several MC, and the more MC we
use, the better the presentation and recognition
is especially on the background of noise.
However if the number of MC is large enough
the calculation cost on its recognition increases. Moreover it is necessary to spend much
time for the announcer to pronounce the commands. In [2] an imitation method of command pronunciation from its one real pronunciation by the announcer was suggested. It
solves the problem of MC obtaining.
In order to reduce the calculation capacity it is
necessary to reduce the number of MC of each
SC in such a way that they would characterize
the variability of a given SC pronunciation
_______________________________________________________________________
Support of the grant RFBR а 06-08-008-10
1
completely enough. Thus, within each set of
pronunciations of each SC it is necessary to
choose such a subset which would represent
the given SC in the best way.
Let us formulate the problem under consideration. The dictionary consists of m SC:
{C1 , C 2 ,..., C m } . For every SC C i there exist a
set of its pronunciations Pi  { pi1 , pi 2 ,..., pini } .
This set can consist of real pronunciations or
can be formed artificially [2]. Besides it can
include pronunciations on the background of
different noises. In general, this set must describe the possible variants of a given SC,
which may occur during the process of its
recognition as completely as possible.
For any elements pi and p j from
P  P1U ...UPm a function (quasimetric)
d ( pi , p j ) is defined. Among the metric axioms it may possibly not satisfy the triangle
axiom. The distance d ( pi , p j ) is the extent of
some difference between elements pi and p j ,
which is used in the process of SC recognition. For example, it may be difference between spectrums of sound signals, their autocorrelation functions, wavelet transformations
and so on.
From every set Pi it is necessary to choose a
subset Ei  ei1 , ei 2 ,..., eik   Pi of k elements
which we will call MC. This subset will be
used in the process of recognition, so it is to
276
represent all the pronunciation variety in the
sense of the metric used d ( pi , p j ) . For this
purpose the average quasi-distance
1
d 
M
m
  min {d ( p , e), e  E },
i 1
pPi
i
(1)
M  n1  n2  ...  nm  km
of P elements to the nearest MC must be as
short as possible. Besides, MC must be chosen in such a way, that MC of different commands could be easily distinguished from one
another in the sense of a chosen metric. Let’s
consider this requirement to be an average
quasi-distance between MC of different commands:
D
1
  min{ d (e, f ), f  Ei } ,
mk i eEi
(2)
which on the contrary must be as large as possible.
It is necessary to notice that the problem under
consideration has much in common with the
clustering problem as each MC can be considered as the “center” of a cluster, which consists of pronunciations nearest to the very MC
(pic. 1 shows one model from each of the four
SC).
which does not increase on d and does not
decrease on D . The minimum of function (3)
corresponds to the optimal library.
In [3, 4] this problem was considered for the
case of a singular SC. The algorithms suggested in these works can be applied for a case of
several commands, but separately in each of
them, which corresponds to function (3) in the
form
U d.
(4)
But in this case the correlations between different SC will not be taken into consideration.
More general variants of (3) are
U d D;
U  d / D.
(5)
(6)
Criterion (5) is not a good variant, as the following simple example illustrates. Let there
exist two classes, represented by segments
K1  [0;1] and K 2  [2;3] on a number line
(pic. 2). Let simple distance between points A
and B be distance d ( A, B) . Then minimum d
is obtained when E1  0.5 and E2  2.5 , i.e.
on the central points of the segments. Maximum D is obtained when E1  0 and E2  3 ,
i.e. on the extreme points of the segments, at
the same points minimum U  d  D is obtained, but it is obviously a bad variant of a
library, as the models are too far from the class
centers.
Рис. 2. The choice of models on a system of segments
Pic. 1. The classes of speech command pronunciation
and models within them
Thus it is necessary to minimize (1) and maximize (2), and it is the problem of contradiction. That is why let’s introduce a common
quality criterion of MC library formation in as
the following function
U  U (d , D) ,
(3)
If U  d / D , than the minimum is obtained
when models are E1  (5  17 ) / 2  0.438 and
E 2  3  E1  2.572 , they are shown as bold dots
in pic. 2. This variant of the library seems
well-founded: each model is situated not far
from the centre of its class and at the same
time it is rather far from the other classes.
277
Thus criterion (6) seems to be a more suitable
variant, which leads to intuitively better variants of MC choice.
The problem under consideration can of course
be solved by simple sorting out, what is not
acceptable even if the number of pronunciations is small.
Further the article suggests several quasioptimal algorithms of solving the problem
with acceptable volume of calculation.
2. The Algorithm of improving of an available solution
At first the initial set of MC E 1 is chosen at
random, for which corresponding value
U ( E 1 ) is estimated using formula (3). Then
sorting out of all variants of replacement of
the first MC e11 in the first SC on the element
from P1 \ E11 is carried out. The best of E1 and
these variants (in the sense of U minimum) is
remembered
and
considered
to
be
1
 , e12,..., e1k } where e1 is an optimal reE2  {e11
placement of model e1 .
Then tests to replace the second MC e12 by
the first SC in E2 on the elements from the
set P1 \ E21 are carried out. And so on until the
set of models E 1 k 1 is obtained.
Then in the same way the tests to replace the
models of the other SC are carried out.
The described procedure of MC set improvement is carried out twice.
The experiments with the given algorithm
have shown that the obtained set of MC usually reaches a deadlock (it can not be further
improved by the suggested procedure) and
even if it is not optimal it is next to it being
only 3-6 per cent inferior. The algorithm is
executed rather quickly. It is essential that the
time spent increases nearly linear with the increase of the number of pronunciations.
In order to fulfill the algorithm it is necessary
only to set distances d(p, q) between pronunciations. These distances can be calculated using
the concrete application of recognition algorithm, under which the MC library is chosen.
Of course, the final result sufficiently depends
on the initial choice of MC. That is why it is
better to use several initial random variants. It
was determined that a good solution could be
usually found out after a dozen attempts.
3. Gravitation Algorithm
Let pronunciations be presented as dots of mdimensional Euclidean space with a usual metric (it is the space of attributes according to
which SC recognition is conducted). Let them
be material dots with a unitary mass in viscous
medium. Then these dots will experience mutual gravity with medium resistance. The dots
which are situated nearer to each other are attracted stronger; they approach each other
more quickly and join in clusters.
There is an analogy between dots which are
forming clusters and dots which are divided
into clusters – in both cases a set of dots is divided into groups of dots which are close to
each other. If during the process of dots
movement we mark k largest clusters and
within each cluster consider the dot which is
nearest to the centre of gravity to be a model
element, then we will get good ways of solving the problem of choosing models for different MC, i.e. for criterion (4).
For criterion of (6) type repulsion between the
dots which belong to different SC is introduced into the algorithm
This heuristic algorithm is easy in use, needs
small memory – it is necessary to store only
the current coordinates and the speeds of
moving dots. The medium viscosity is imitated by multiplication of an obtained at every
iteration dot speed on coefficient c  1.
4. Algorithms of fuzzy clusterization
The problem considered in this article is close
to the problem of fuzzy clusterization. The latter also demands dividing some elements into
classes (clusters), but the element’s belonging
to each of the clusters is fuzzy, i.e. an element
somehow belongs to all clusters, and the total
measure of element’s belonging to all clusters
is equal to one. Besides, representatives (middles) of the clusters are not obligatory the elements of the initial set, they are typical repre-
278
sentatives of their clusters in the sense of this
fuzzy belonging. Optimality of clusterization
is understood in the sense of minimum of exponents of (3) type, but the sum is taken together with coefficients, which is equal to the
extent of belonging to clusters. In [5] there has
been suggested a number of iterative algorithms of quasi-optimal solution of the problem of fuzzy clusterization.
Similar algorithms have been used for the
problem under consideration, i.e. the problem
of model choice. The extent of belonging has
been chosen as a fixed decreasing function of
distance.
6. Libraries with different number of MC
In the formulated problem of formation of MC
library it has been suggested that the numbers
of MC for all SC are the same. This requirement is not necessary: different SC may have
different variability, so it is desirable to increase the number of MC for SC with larger
variability and to decrease to less probable SC.
The algorithms described above are easy to
modernize in order to from such libraries. For
this purpose average distances (1) for every
separate SC are calculated. If this distance is
much less the average one for SC the number
of MC is decreased in one and vise versa. After such changing of MC numbers the MC library is formed in a usual way. Then another
test of change of MC numbers is conducted.
At the same time the value of quality criterion
is constantly controlled. The algorithm is
stopped after a given number of cycles or
when the set value is obtained.
In a gravitation algorithm the problem can be
solved in another way. The maximal radius of
clusters, i.e. the maximal distance from the
pronunciation to MC is given. Then the number of MC for each command will be defined
by the algorithm itself in the process of attraction and repulsion of dots.
7. Algorithms Tests
The conducted experiments have shown that
the probability of right recognition of speech
commands was higher if the models were
chosen not randomly but with the help of the
algorithms described, because in such a case
models sets better presented different pronunciations.
8. Conclusion
The suggested algorithms help to obtain good
solutions of the problem of models choice
within acceptable time which is much less than
simple sorting out demands. The first of the
mentioned algorithms is more universal, as no
clear structure is supposed to be in the initial
set of elements, only the knowledge of quasimetric is required. The experiments showed
that if models are chosen with the help of these
algorithms the recognition quality is higher,
then if MC are simply read.
References
1.
2.
3.
4.
5.
V.R. Krasheninnikov, A.I. Armer. Speech signal
recognition on the background of noise // «Sample
recognition and image analysis: new information
technologies», Works of the 7th international conference РОАИ-7, SPb. - 2004. - P. 752-755 (in
Russian).
V.R. Krasheninnikov, A.I. Armer. The Speech
Commands Variability Simulation // Proceedings of
International Concurrent Engineering, International
Society for Productivity Enhancement (ISPE), Dallas, USA. – 2005. – P. 387-390.
V.R. Krasheninnikov, N.A. Krasheninnikova, V.V.
Kuznetsov. Algorithms of speech command model
choice in the process of speech recognition //
Works of the 62th scientific session devoted to the
Day of Radio, Moscow. – 2007. – P. 158-159 (in
Russian).
V.R. Krasheninnikov, V.V. Kuznetsov, E.A. Rasputko. The algorithm of model choice in a given set
of elements // Bulletin of UlSTU, Ulyanovsk, UlSTU. - 2006. - № 3, pp 59 –61 (in Russian).
A.P. Velmisov. Algorithm of fuzzy clusterization //
Works of Middle -Volga mathematical society, Saransk, Middle-Volga mathematical society. – 2006.
- Vol. 8, No 1. – P. 192–197 (in Russian).