Speaker verification using i-Vectors

Speaker verification using i-Vectors
Andreas Nautsch
Hochschule Darmstadt
4. Short duration experiments & analysis
2. From Speech to i-vectors
1
300
2
400
600 10
5
0.05
0
-0.05
3
2
1
200
1
300 400
2
600 10
2
3
4 5 6 7
Time (in s)
8
5
Iterations
1
300 400
MFCC-2
2
0
-2
5
2
600 10
5
Iterations
0.2
0.1
0.10.2 0.5 1
2
5 10
FMR (in %)
20 30 40
(e) Detection error tradeoff diagram comparing the False
Match Rate (FMR) and the False Non-Match Rate (FNMR)
(d) C = 1024
Enrolment
Verification
EER
33×2.5s/subject 5s/attempt
206.2%
5.5%
0.9%
6.1%
3.2%
2.1%
9.9%
3.1%
2.5%
16.2%
3.3%
2.3%
125 250 375 500 625 750 875 1,000
Segments
Different speakers have different subspaces within an universal acoustical cluster,
the UBM. The subspace offset from the universal cluster describes the direction
vector of a sample which depends on the verbalised text of a speaker. In order
to obtain relevant speaker-only features, these vectors are analysed towards
characteristic factors. Thereby, a factor analysis model is iteratively trained.
The factor analysed features are referred to as identity-vectors (i-vectors) [2].
Reference and probe i-vectors can be compared, e.g. by the cosine distance
between these two vectors.
2
2
Factors
HMM
i-vector-128/400
i-vector-256/400
i-vector-512/300
−100
Figure: Speech sample, left: raw values, right: after segment-wise Fourier transform.
MFCC-2
2
1
200
System
−75
0
9 10
10
Table: Real-time comparison of the model-based Hidden-Markov-Model (HMM) system as
baseline and the i-vector systems
-0.1
1
Iterations
Figure: (a-d): EER performances of i-vector system configurations: number of characteristic
factors (i-vector dimension), factor analysis model training iterations & UBM components (C),
(e): detection error tradeoff comparison of the three best configurations on development
(dashed) and evaluation (solid) data sets.
−50
1
5
5
−25
2
20
0.5
5
0
0.1
600 10
(b) C = 256
4
Frequency (in kHz)
Changes in air pressure
300 400
1
(c) C = 512
0.15
1
2
Factors
(a) C = 128
Factors
Characteristic information about the voice of a subject can be obtained by
extracting speech signal features. Thereby, the acoustical features depend on
the speaker and the spoken text. Different speakers can be distinguished by
comparing patterns of the whole frequency band which can be obtained by a
segment-wise applied Fourier transform and the extraction of Mel-Frequency
Cepstrum Coefficients (MFCCs).
30
2
1
200
Iterations
Factors
i-vector-128
i-vector-128
i-vector-256
i-vector-256
i-vector-512
i-vector-512
30 FNM
40
5
FNMR (in %)
2
1
200
EER (in %)
5
EER (in %)
I Current speaker recognition research on
. Text-independent scenarios, i.e. free speech
. Various-duration samples, e.g. call-center speech
. Short-duration samples, e.g. random PINs
I Statistical comparison approaches
. Model-based: sound unit models for each subject
⇒ Accurate on short samples, but high computation efforts on long samples
. Template-based: offset to universal background model (UBM)
⇒ Accurate on long samples, but ambiguous behaviour on short samples
The short duration scenario comprises sequences of 5 spoken German digits
between 0 and 9 (approx. 5 seconds) by 356 subjects who were separated
into 56 calibration and 300 evaluation subjects. Configuration evaluations on
the amount of UBM components, the number of characteristic factors, and
iterations of the factor analysis model were firstly performed on the calibration
subset with respect to the Equal Error Rate (EER) and then validated on the
evaluation subset.
EER (in %)
Speaker verification becomes more important as a biometric key security solution
in industry, forensic, and governmental terms, e.g. for data encryption purposes
on mobile devices, or user validations on contact-centers. Users seem to be
annoyed by maintaining several PINs and passwords, i.e. biometrics, which
cannot be lost or forgotten, bring about substantial benefits with respect to
usability [1].
EER (in %)
1. Motivation
I Applicable i-vector configurations positively validated
. 128 & 256 UBM components: 5 iterations & 400 factors
. 512 UBM components: 10 iterations, 300 factors
I Applicable performance on short duration scenario with content constrains,
i.e. random PINs
I i-vector systems outperform the HMM baseline systems in real-time with
up to 97% relative gains on subject enrolments and 44% relative gains on
verifications, where HMMs have lower EERs
In order to preserve reproducibility an ISO/IEC standardised general biometric
system design [3] was adapted to speaker verification purposes, and implemented
in Matlab. Further, public speaker recognition community tools were used for
speech signal processing, i-vector extraction, and evaluation measurements.
UBM
Speaker A
0
Speaker B
-2
-3
-2
-1
0
MFCC-1
1
2
3
-3
-2
-1
0
MFCC-1
1
2
3
Figure: Speakers within acoustical space clusters (schematic), left: Gaussian Mixture Model
(GMM) cluster with 4 components of the figure 1 sample, right: intended differences between
UBM cluster and speaker-depending subspaces leading to characteristic offsets.
3. Research questions
I Is the i-vector approach extensible on short duration scenarios with
applicable performances?
I Do i-vector systems deliver new information to approaches that are known
to perform well on short duration samples?
I Are duration-depending performance mismatches compensable?
Student
Andreas Nautsch
[email protected]
.
.
..
.
.
.
..
..
..
..
..
..
..
..
..
..
. .......
.. .....
.
Figure: Speaker verification framework design based on ISO/IEC 19795-1:2006 [3].
Examiner
Prof. Dr. Christoph Busch
[email protected]
Associated company
atip GmbH
http://www.atip.de
Second examiner
Dr. Christian Rathgeb
[email protected]
Supervisor
Prof. Dr. Herbert Reininger
[email protected]
.
.
.
..
..
..
..
..
..
..
..
...........
...........
..........
.........
..
..
..
..
..
..
..
..
.
.
.
....... .
.........
..
..
..
..
..
..
.
..
.
.
.
.
Speaker verification using i-Vectors
Andreas Nautsch
Hochschule Darmstadt
5. Fusion of baseline and i-vector systems
7. Duration-invariant i-vector score normalisation
Score-level system fusions were performed on the three i-vector systems and
between the HMM baseline system and the most promising i-vector system
on the calibration subset, the 256 UBM component i-vector system. By
fusing only i-vector systems, significantly gains could be observed, and the
HMM+i-vector-256 fusion significantly outperformed all other systems in highsecure applications which are indicated by low false match rates (FMRs) where
lower false non-match rates (FNMRs) mean fewer genuine attempt rejections.
HMM
i-vector-256
i-vector-128+256+512
HMM+i-vector-256
30 FNM
40
30
20
FNMR (in %)
The National Institute of Standards and Technology (NIST) hosts the 2014
i-vector machine learning challenge. In this challenge were i-vectors of 414
subjects supplied together with according sample durations (1-300 seconds).
Varying sample durations are emphasised by NIST for the purpose of examining
durational effects on the speaker verification robustness.
Figure: Occurence quantity of observed phones reduces exponentially with duration [5]
10
5
2
1
0.5
0.2
0.1
0.1 0.2 0.5
1
2
5
10
FMR (in %)
20
30 40
Figure: Detection error tradeoff comparison of HMM-baseline, i-vector and fused systems
I By using additional information from i-vector systems, the biometric
performance in terms of EER relatively increases by 44%
I Hasan et. al [5] introduced duration groups of: 5, 10, 20, 40 seconds and
full (>40 seconds)
I Adaptive Symmetric (AS) score normalisation compares reference and
probe i-vectors to a development set of i-vectors
. Computing mean and deviation for top-100 comparisons for each:
reference,
probe


S−µprobe 
reference

. S 0 = 21  S−µ
+
σreference
σprobe
I New approach: Duration-invariant AS normalisation (dAS)
. Idea: only compare i-vectors of same duration group
. On reference - development set: i-vectors with probe duration
. On probe - development set: i-vectors with duration >60 seconds
I Offline evaluation: 10×5-fold cross-validation
6. Score cross-entropy evaluation
Baseline
AS
dAS
30 FNMs
40
For forensic purposes, recognition systems not only need to have highperformance, rather it is important to know the evidence that this systems
provide. In terms of a binary recognition system, evidence is measured by score
cross-entropy. Where Bayesian thresholds are denoted with respect to prior
odds of genuine attempts, or likewise the cost of falsely accepting impostors.
For comparability purposes among various prior odds, Bayesian entropy error
rates are normalized by the according default entropy, on which the evidence
equals a random guess. Further, system entropy as total entropy Htot
norm can be
separated into the minimum entropy assuming well-calibrated systems Hmin
norm
and a calibration loss [4].
False Non-Match Rate (in %)
30
20
Table: System performances: avg. EER,
FMR100, minDCF
System
Baseline
AS
dAS
10
5
EER FMR100 minDCF Challenge
2.56
5.15
0.428
0.336
2.49
4.48
0.378
0.331
2.06 3.47
0.364
0.312
2
Table: Average cross-entropy comparison: all
scores & duration groups
1
0.5
0.2
0.1
0.10.2 0.5 1 2
5 10
20 30 40
False Match Rate (in %)
Figure: Detection error tradeoff diagram
System all
5s 10s
Baseline 0.89 0.95 0.93
AS 2.49 0.17 1.18
dAS 0.10 0.35 0.20
20s
0.92
0.41
0.11
40s
0.89
0.18
0.07
full
0.86
0.05
0.07
Figure: Evaluation on variant duration: baseline i-vector system, baseline+AS normalization,
and baseline+dAS normalization, where NIST refers to the minimum detection error cost
function (minDCF), the normalised Bayesian error rate at the NIST threshold
8. Future perspectives
Figure: Entropy evaluation of HMM baseline, i-vector, and fused systems with Bayesian
threshold of the 2014 NIST i-vector machine learning challenge NISTη
I In general, fused systems provide more evidence
I HMM+i-vector-256 fusion was observed to have the lowest score
cross-entropy, and hence contains the highest amount of information with
the lowest integral under the Htot
norm curve as the cost of Log-Likelihood
Ratio (LLR) scores Cllr = 0.028
Student
Andreas Nautsch
[email protected]
.
.
..
.
.
.
..
..
..
..
..
..
..
..
..
..
. .......
.. .....
.
I Examining other i-vector comparators, e.g. probabilistic linear discriminant
analysis (PLDA)
I Effect analysis on other speech signal features, e.g. MHECs, ProsPols or
RASTA-PLPs
I Evaluating feature-level fusions, i.e. concatenating i-vectors stemming from
different features
References
[1]
[2]
[3]
[4]
A. Krupp et al.: Social Acceptance of Biometric Technologies in Germany: A Survey, IEEE BIOSIG, 2013.
N. Dehak et al.: Front-End Factor Analysis for Speaker Verification, IEEE/ACM TASLP, 2011.
ISO/IEC: Information technology — Biometric performance testing and reporting — Part 1: principals and framework, ISO-19795-1, 2006(E), 2011.
N. Brümmer: Measuring, refining and calibrating speaker and language information extracted from speech, Ph.D. dissertation, University of
Stellenbosch, 2010.
[5] T. Hasan et al.: Duration Mismatch Compensation for i-Vector based Speaker Recognition Systems, IEEE ICASSP, 2013.
Examiner
Prof. Dr. Christoph Busch
[email protected]
Associated company
atip GmbH
http://www.atip.de
Second examiner
Dr. Christian Rathgeb
[email protected]
Supervisor
Prof. Dr. Herbert Reininger
[email protected]
.
.
.
..
..
..
..
..
..
..
..
...........
...........
..........
.........
..
..
..
..
..
..
..
..
.
.
.
....... .
.........
..
..
..
..
..
..
.
..
.
.
.
.