Filter Bank Design for Speaker Diarization Based on

FILTER BANK DESIGN FOR SPEAKER DIARIZATION BASED ON GENETIC
ALGORITHMS
C. Charbuillet, B. Gas, M. Chetouani, J. L. Zarader
Laboratoire des Instruments et Systèmes d'Ile-de-France
Université Pierre et Marie Curie, Paris, FRANCE
ABSTRACT
Speech recognition systems usually need a feature
extraction stage aiming at obtaining the best signal
representation. In this article we propose to use genetic
algorithms to design a feature extraction method adapted to
the speaker diarization task.
We present an adaptation of the common MFCC feature
extractor which consists in designing a filter bank, with
optimized bandwidths.
Experiments are carried out using a state-of-the-art speaker
diarization system. The proposed method outperforms the
original filter bank based on the Mel scale one. Furthermore,
the obtained filter bank reveals the importance of some
specific spectral information for speaker recognition.
from the human auditory system [4]. The used non-linear
scale is the Mel. The following figure (Fig.1.) presents the
Mel triangular filter bank.
Frequency (Hz)
Fig.1. The Mel scaled triangular filter bank
We propose, in this paper, to design a filter bank for the
speaker diarization task. For this purpose, we use genetic
algorithms.
1. INTRODUCTION
Speech feature extraction plays a major role in speaker
recognition systems. Current speech feature extraction
methods are based on speech production for the Linear
Predictive Coding (LPC) or perception for the Mel
Frequency Cepstral Coefficients (MFCC). However,
traditional methods in speech feature extraction do not take
into consideration the specific information about the task to
achieve. To overcome this drawback, several approaches
have been proposed by Katagiri [1], Torkkola [2], and
Chetouani [3]. The main idea of these methods is to
simultaneously learn the parameters of both the feature
extractor and the classifier. This procedure consists in the
optimization of a criterion, which can be the Maximization
of the Mutual Information (MMI) or the Minimization of the
Classification Error (MCE).
This paper presents a new framework for the optimization of
the filter bank of a MFCC based feature extractor. The
MFCC feature extractor is considered as a standard for most
of the tasks such as speech and speaker recognition or
language identification.
The MFCC feature extraction process mainly consists in the
modification of the short-term spectrum by a filter bank.
The central frequencies and the bandwidths are inspired
1­4244­0469­X/06/$20.00 ©2006 IEEE
Genetic algorithms (GA) were first proposed by Holland in
1975 [5] and became widely used in various disciplines as a
new mean of complex systems optimization. GAs most
attractive quality is certainly their aptitude to avoid local
minima. However, our study relies on another quality which
is the fact that GAs are an unsupervised optimization
method. So they can be used as an exploration tool, free to
find the best solution without any constraint.
In the first part, the speaker diarization task is presented and
the clustering system is described. Afterwards, we present
the genetic algorithm we used, followed by the experiment
we made and the obtained results. In the end, we will
present an analysis of the obtained filter bank on which
some spectrum properties for speaker discrimination will be
showed.
2. SPEAKER DIARIZATION
Speaker diarization is composed of two successive phases:
segmentation and clustering. The aim of the segmentation
phase is to detect the speaker changes. The clustering phase
associates all the segments produced by the same speaker
(i.e. the clusters). Speaker diarization is currently gaining
importance as a mean of indexing voluminous spoken data
accumulated, for archival use [6]. For both segmentation
I ­ 673
Authorized licensed use limited to: Telecom ParisTech. Downloaded on January 18, 2010 at 05:29 from IEEE Xplore. Restrictions apply.
ICASSP 2006
and clustering, the Bayesian Information Criterion (BIC) is
used. Speaker change detection is obtained during an abrupt
change of the BIC. This segmentation phase forms the
initialization of our adaptation process. Indeed, features are
extracted with the MFCC and compared by the BIC.
In the second phase, a hierarchical clustering also based on
the BIC allows to merge the segments having the highest
BIC difference.
This difference is defined as:
d (d + 1) ⎞
λ⎛
dBIC(Ci , C j ) = −Dr (Ci , C j ) + ⎜ d +
⎟ log n i + n j
2⎝
2 ⎠
With:
ni + n j
nj
n
Dr (Ci , C j ) =
.log Σij − i .log Σi −
.log Σ j
2
2
2
(
Initialisation
p(t)
Variation
Evaluation
)
Where Ci, Cj are two sequences of feature vectors
representing two clusters to merge; ni is the size of Ci and Σi
the covariance matrix estimated from the cluster data Ci.; λ
is a penalty coefficient usually fixed to 1.5.
The key idea of the proposed work is to optimize the filter
bank for a global improvement of this clustering phase.
3. GENETIC ALGORITHM
A genetic algorithm is an optimization method. Its aim is to
find the best values of system's parameters in order to
maximize its performances. The basic idea is that of "natural
selection", i.e. the principle of "the survival of the fittest". A
GA operates on a population of systems. In our application,
each individual of the population is a filter bank defined by
its bandwidth and center frequency parameters.
The algorithm used is called the Selection Evaluation
Variation (SEV) [8] and is illustrated in figure 2. To evolve
the desired filter bank, we consider a population p(t) of Np
filter banks undergoing a variation-evaluation-selection
loop, i.e. p(t+1) = S E V p(t). First, a random initialization
is done for each individual of the population p(0). The
variation operator V consists in a random variation of each
filter bank's parameters. The evaluation operator E is
defined problem specific, and is usually given in terms of a
fitness function. It consists in evaluating the performances
of each individual of the population. In our application, the
performance of a filter bank is given by the clustering error
rate of the entire clustering system using it. After evaluating
the performance of each individual filter bank, the selection
operator S selects the Ns best individuals. These individuals
are then cloned according to the evaluation results to
produce the new generation p(t+1) of Np filter banks.
Consequently of this selection process, the average of the
performance of the population tends to increase and in our
application adapted filter banks tend to emerge.
Selection
p(t+1)
Fig.2. The S.E.V. algorithm
4.
EXPERIMENTS AND RESULTS
This section presents the experiments we made and the
results we obtained. Our proposed feature extraction method
was evaluated on the speaker diarization task of the French
ESTER [9] Broadcast News Evaluation.
4.1. Database
The ESTER [9] corpus is composed of 40 hours of audio
data recorded from four different French radio broadcast
news shows. This corpus is very varied, containing natural
speech, telephonic interventions, speech on musical
background and all that with different recording qualities.
The audio files’ length is either of 1 hour or of 20 minutes.
The number of speaker involved in each file varies from 1 to
39.
The performance measure used for the speaker diarization
task is the official RT 2003 NIST1 metric. It is based on an
optimum "one to one" mapping of reference speaker IDs to
system output speaker IDs.
4.2. Evolution data
We used two different databases for the evolution
simulation. The first one is called "evolution base" and was
used for the filter banks’ evaluation and selection. It is
composed of four hours of representative data. The second
one, called "cross validation base", was used to evaluate the
generalization capability of our algorithm. It is composed of
eight hours of radio emission.
4.3. Simulation parameters
In this experiment, we used the S.E.V. algorithm to
optimize the bandwidth parameters of the filter banks. The
overlapping between filters was preserved by using the
following rule: "each filter begins at the middle of the left
adjacent one". Therefore the bandwidth parameters and the
previous rule impose the center frequency parameters to be
as follows:
1
http://www.nist.gov/speech/tests/rt/rt2003/spring/
I ­ 674
Authorized licensed use limited to: Telecom ParisTech. Downloaded on January 18, 2010 at 05:29 from IEEE Xplore. Restrictions apply.
B
C0 = 0
2
and
Ci +1 = Ci +
different from the Mel-scaled one (Figure 1). An analysis of
this filter bank will be presented further on in the article (see
section 7).
Bi +1
2
Ci and Bi being respectively the center frequency and the
bandwidth of the ith filter in the bank.
The bandwidth and center frequency are discrete parameters
varying from 0 to 255 and coding the [0 8000] Hz frequency
domain. Before starting the S.E.V. algorithm, a random
initialization of the bandwidths is done using a uniform
random distribution of 25 units (~ 780Hz).
For evolving the desired filter bank, the following algorithm
parameters have been used:
-
Population size Np: 50
Number of selected individuals Ns: 15
Variation operator: uniform random variation of 3 units
(~100Hz) for each bandwidth.
Evaluation method: clustering error rate.
Number of filters in the bank: 24
Feature vector dimension: 24
4.4. Evolution simulation
The evolution simulation was done using the previously
defined parameters. Figure 3 reports the evolution of the
clustering error rate on the evolution and cross validation
bases.
Frequency (Hz)
Fig.4. Filter bank obtained at the 24th generation
4.5. Comparative results
The evaluation of the obtained filter bank’s capacities was
done on two data bases used on the ESTER campaign. The
first one, called "Development Base" is given to all
participants to self evaluate their system. It's composed of
eight hours of radio emissions. The second one, called
"Evaluation Base", is also composed of eight hours of radio
emissions and its particularity is the fact that two hours of
this base come from two unknown radios. Results on this
base are evaluated by an independent organism. The
following table (Table 1) reports the clustering error rates
obtained from three filter banks: a linear scaled filter bank, a
Mel-scaled one and our filter bank obtained with the S.E.V.
algorithm.
Clustering error rate on the evolution base (%)
Linear scaled
filter bank
Mel scaled
filter bank
Obtained filter
bank
Generation number
Clustering error
rate on the
Development
Base (%)
Clustering error
rate on the
Evaluation Base
(%)
17.64
24.11
17.80
23.38
16.98
22.74
Clustering error rate on the cross validation base (%)
Table.1. Comparison of filter bank performances
As we can notice from this table, the obtained filter
outperforms the Mel-scaled filter bank and the linear scaled
one.
Generation number
Best filter bank of the current generation
Mel filter bank
Fig.3. Evolution of the clustering error rate
We can observe that the clustering error rate decreases on
the cross validation base until the 24th generation. The
degradation of the generalization capacity which follows
can be explained as an over adaptation phenomenon. The
filter bank obtained in the 24th generation is depicted in
Figure 4. We can observe that this filter bank appears very
5. FILTER BANK ANALYSIS
In order to interpret the obtained results an additional
experiment was made. It consists in repeating the evolution
simulation with different initializations of the population
p(0) and in comparing the obtained filter banks. For all
simulations, the initialization of the bandwidths of each
filter bank of the population was done using a uniform
random distribution of 25 units (~ 780Hz). The simulation
parameters and conditions were the same as those described
in section 4.3.
Figure 5 presents the filter banks obtained with three
different simulations. Their clustering error rates on the
I ­ 675
Authorized licensed use limited to: Telecom ParisTech. Downloaded on January 18, 2010 at 05:29 from IEEE Xplore. Restrictions apply.
Development Base are respectively 17.54 %, 17.14 % and
16.98 %. They all outperform the Mel-scaled filter bank and
the linear scaled one. In order to compare them, Figure 6
presents the bandwidth according to the center frequency of
the filters for each filter bank. We can observe similarities
between all of the obtained filters. Especially concerning the
presence of large filters centered on 1500, 5100, and
6500Hz.
Filter bank n°1
Filter bank n°2
filter banks are fit on some speaker-discriminative spectral
properties.
6.
CONCLUSION
In this paper, we proposed to use genetic algorithms to
optimize the filter bank of the MFCC feature extractor to the
speaker clustering task. The experiments we made showed
that the obtained filter bank outperforms the Mel-scaled one.
Furthermore, robustness of the obtained solutions according
to the initial conditions showed us some spectrum properties
which seem to be relevant for the speaker clustering task.
Our work perspectives will consist in evaluating these
speaker discriminative spectrum properties. For this we are
planning to design a new spectrum-based feature extractor
according to this knowledge.
7.
REFERENCES
Filter bank n°3
[1] Katagiri, S., Handbook of Neural Networks for Speech
Processing, Artech House, Boston, 2000.
[2] K. Torkkola, “Feature Extraction by Non-Parametric Mutual
Information Maximization”, Journal of Machine Learning
Research, Vol. 3, MIT Press, Cambridge, MA, pp. 1415-1438,
2003.
Frequency (Hz)
Fig.5. Filter bank obtained with different initial conditions
[3] M. Chetouani, M. Faundez, B. Gas and J.L. Zarader , "Nonlinear Speech Feature Extraction for Phoneme Classification and
Speaker Recognition", Nonlinear speech processing : Algorithms
and Analysis, Springer Verlag, 2005.
Bandwidth (Hz)
[4] H. P. Combrinck and E. C. Botha, “On The Mel-Scaled
Cepstrum”, Proceedings of the Seventh Annual South African
Workshop on Pattern Recognition, University of Pretoria, Pretoria,
1996.
[5] Holland, J. H., Adaptation in Natural and Artificial Systems,
University of Michigan Press, Ann Arbor, MI, 1975.
[6] Makhoul, J., Kubala, F., Leek, T., Liu, D., Nguyen, L.,
Schwartz, R., and Srivastava, A., “Speech and language
technologies for audio indexing and retrieval”, Proceedings of the
IEEE, Vol. 88, No. 8, IEEE Press US, Piscataway, pp. 1338-1353,
2000.
Center frequency (Hz)
Mel scale filter bank
Filter bank n°1
Filter bank n°2
Filter bank n°3
Mean of the filter banks n° 1, 2, 3
[7] Gravier, G. and Betser, M., Audioseg: "Audio Segmentation
Toolkit", 2005. http://www.irisa.fr/metiss/guig/index-en.html
Fig.6. Bandwidth according to center frequency
In addition, the presence of a large filter centered in 1500
Hz makes these filter banks very different from the Melscaled one (Figure 1). In fact, MFCCs are more adapted to
phoneme discrimination than to speaker discrimination. This
task may rely on other speech spectrum properties.
The robustness of the obtained solutions according to the
initial conditions allows us to consider that the obtained
[8] F. Pasemann, U. Dieckmann, and U. Steinmetz, "Evolving
Structure and Function of Neurocontrollers", Proceedings of the
1999 Congress on Evolutionary Computation Journal, IEEE Press
US, Piscataway, pp. 1937-1978, 1999.
[9] S. Galliano, E. Geoffrois, D. Mostefa, K. Choukri, J.F.
Bonastre, and G. Gravier, “The Ester Phase II Evaluation
Campaign for the Rich Transcription of French Broadcast News”,
Proceedings of Eurospeech/Interspeech'05, Lisbon, pp.1149-1152,
2005.
I ­ 676
Authorized licensed use limited to: Telecom ParisTech. Downloaded on January 18, 2010 at 05:29 from IEEE Xplore. Restrictions apply.