JHU WS08 RSR Team Final Presentation

u2
u1
d11
v2
d22
d33
v1
Robust speaker recognition
over varying channels
Niko Brummer, Lukas Burget, William Campbell,
Fabio Castaldo, Najim Dehak, Reda Dehak, Ondrej
Glembek, Valiantsina Hubeika, Sachin Kajarekar,
Zahi Karam, Patrick Kenny, Jason Pelecanos,
Douglas Reynolds, Nicolas Scheffer, Robbie Vogt
M  m  Vy  Dz  Ux
1
XYZ 7/13/2017
JHU WS’08 RSR Team
u2
u1
d11
Intersession Variability
v2
d22
d33
v1
The largest challenge to practical use of speaker detection
systems is channel/session variability
NIST SRE2008 - Interview speech
•
Variability refers to changes in channel effects
between training and successive detection
attempts
Channel/session variability encompasses several
factors
– The microphones
•
Carbon-button, electret, hands-free, array, etc
–
The acoustic environment
Office, car, airport, etc.
–
The transmission channel
Landline, cellular, VoIP, etc.
–
Different microphone
in training and test
about 3% EER
The same
microphone
in training
and test
< 1% EER
The differences in speaker voice
Aging, mood, spoken language, etc.
M  m  Vy  Dz  Ux
2
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
M  m  Vy  Dz  Ux
3
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Tools to fight unwanted variability
d33
v1
Joint Factor Analysis
u2
u1
d11
v2
d22
d33
v1
M = m + Vy + Dz + Ux
M  m  Vy  Dz  Ux
4
JHU WS’08 RSR Team
u2
u1
d11
Baseline System
v2
d22
d33
v1
NIST SRE08
short2-short3
Miss Probability
Telephone Speech in Training and Test
System based on
Joint Factor Analysis
M  m  Vy  Dz  Ux
5
False Alarm Probability
JHU WS’08 RSR Team
u2
u1
d11
SRE NIST Evaluations
v2
d22
d33
v1
• Annual NIST evaluations of speaker verification technology
•
•
(since 1995) using a common paradigm for comparing
technologies
All the team members participated in recent 2008 NIST
evaluations
JHU workshop provided a great opportunity to:
– do common post-evaluation analysis of our systems
– combine and improve techniques developed by individual sites
• Thanks to NIST evaluations we have:
– identified some of the current problems that we worked on
– well defined setup and evaluation framework
– baseline systems that were trying to extend and improve during
the workshop
M  m  Vy  Dz  Ux
6
JHU WS’08 RSR Team
u2
u1
d11
Subgroups
v2
d22
d33
v1
• Diarization using JFA
•
Factor Analysis Conditioning
• SVM – JFA and fast scoring
•
Discriminative System Optimization
M  m  Vy  Dz  Ux
7
JHU WS’08 RSR Team
u2
u1
d11
Diarization using JFA
v2
d22
d33
v1
Problem Statement
– Diarization is an important upstream process for real-world multi-speaker
speech
– At one level diarization depends on accurate speaker discrimination for
change detection and clustering
– JFA and Bayesian methods have the promise of providing improvements
to speaker diarization
Goals
– Apply diarization systems to summed telephone speech and interview
microphone speech
Baseline segmentation-agglomerative clustering
Streaming system using speaker factors features
New variational-bayes approach using eigenvoices
– Measure performance in terms of DER and effect on speaker detection
M  m  Vy  Dz  Ux
8
JHU WS’08 RSR Team
u2
u1
d11
Factor Analysis Conditioning
v2
d22
d33
v1
Problem Statement
– A single FA model is sub-optimal across different conditions
– Eg.: different durations, phonetic content and recording scenario
Goals
– Explore two approaches:
- Build FA models specific to each condition and robustly combine multiple
models
- Extend the FA model to explicitly model the condition as another source of
variability
M  m  Vy  Dz  Ux
9
JHU WS’08 RSR Team
u2
u1
d11
SVM - JFA
v2
d22
d33
v1
Problem Statement
– The Support Vector Machine is a discriminative recognizer which has
proved to be useful for SRE
– Parameters of generative GMM speaker models are used as features
for linear SVM ( sequence kernels)
– We know Joint Factor Analysis provides higher quality GMMs, but
using these as is in SVMs has not been so successful.
Goals
– Analysis of the problem
– Redefinition of SVM kernels based on JFA?
– Application of JFA vectors to recently proposed and closely related
bilinear scoring techniques which do not use SVMs
M  m  Vy  Dz  Ux
10
JHU WS’08 RSR Team
u2
u1
d11
Discriminative System Optimization
v2
d22
d33
v1
Problem Statement
– Discriminative training has proved very useful in speech and language
recognition, but has not been investigated in depth for speaker
recognition
– In both speech and language recognition, the classes (phones,
languages) are modeled with generative models, which can be trained
with copious quantities of data
– But in speaker recognition, our speaker GMMs have at best a few
minutes of training typically of only one recording of the speaker
Goals
– Reformulate the speaker recognition problem as binary discrimination
between pairs of recordings which can be (i) of the same speaker, or (ii)
of two different speakers
– We now have lots of training data for these two classes and we can
afford to train complex discriminative recognizers
M  m  Vy  Dz  Ux
11
JHU WS’08 RSR Team
u2
u1
d11
Relevance MAP adaptation
v2
d22
d33
v1
Target speaker model
Test data
UBM
•2D features
•Single Gaussian model
•Only mean vector(s) are adapted
M  m  Vy  Dz  Ux
12
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Intersession variability
d33
v1
Target speaker model
UBM
M  m  Vy  Dz  Ux
13
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Intersession variability
d33
v1
Target speaker model
Test data
UBM
M  m  Vy  Dz  Ux
14
JHU WS’08 RSR Team
u2
u1
d11
Intersession compensation
v2
d22
d33
v1
Target speaker model
Test data
UBM
For recognition, move both models along the high inter-session
variability direction(s) to fit well the test data (e.g. in ML sense)
M  m  Vy  Dz  Ux
15
JHU WS’08 RSR Team
u2
u1
d11
Joint Factor Analysis model
v2
d22
d33
v1
•
•
Probabilistic model proposed by Patrick Kenny
Speaker model represented by mean supervector
M = m + Vy + Dz + Ux
–
–
–
–
U – subspace with high intersession/channel variability (eigenchannels)
V – subspace with high speaker variability (eigenvoices)
D - diagonal matrix describing remaining speaker variability not covered by V
Gaussian priors assumed for speaker factors y, z and channel factors x
u2
u1
d11
3D space of model parameters
(e.g. 3 component GMM; 1D features)
d22
d33
M  m  Vy  Dz  Ux
16
v2
m
v1
JHU WS’08 RSR Team
u2
u1
d11
Working with JFA
v2
d22
d33
v1
•
Enrolling speaker model:
– Given enrollment data and the hyperparameters m, Σ, V, D and U, obtain MAP
point estimates (or posterior distributions) of all factors x, y, z
– Most of the speaker information is in low dimensional vector y; less in high
dimensional vector z; x should contain only channel related info.
•
Test:
– Given fixed (distributions of) speaker dependent factors y and z, obtain new
estimates of channel factors x for test data
– Score for test utterance is log likelihood ratio between UBM and speaker model
defined by factors x, y, z
•
Training hyperparameters
– Hyperparameters m, Σ, V, D and U can be
estimated from training data using EM algorithm
u2
u1
Posterior distributions of “hidden” factors x, y, z
and hyperparameters are alternately estimated to
maximize likelihood of training data
Distributions of speaker factors y, z are constraint
to be the same for all segments of the same
speaker while channel factors x may be different
d33
for every segment.
M  m  Vy  Dz  Ux
17
d11
v2
d22
m
v1
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Flavors of JFA
d33
v1
•Relevance MAP adaptation
M = m + Dz with D2 = Σ/ τ
where Σ is matrix with UBM variance supervector in diagonal
•Eigenchannel adaptation (SDV, BUT)
• Relevance MAP for enrolling speaker model
• Adapt speaker model to test utterance using U estimated by PCA
•JFA without V, with D2 = Σ/ τ (QUT, LIA)
•JFA without V, with D trained from data (CRIM)
can be seen as training different τ for each supervector coefficient
Effective relevance factor τef= trace(Σ)/ trace(D2)
•JFA with V (CRIM)
u
2
u1
d11
v2
d22
d33
M  m  Vy  Dz  Ux
18
m
v1
JHU WS’08 RSR Team
u2
u1
d11
Flavors of JFA
v2
d22
d33
v1
SRE 2006
(all trials, det1)
No JFA
Eigenchannel adapt.
JFA: d2 = Σ/ τ
JFA: d trained on data
JFA with eigenvoices
Full JFA significantly outperform the other JFA configurations.
M  m  Vy  Dz  Ux
19
JHU WS’08 RSR Team
u2
u1
d11
Subgroups
v2
d22
d33
v1
• Diarization based on JFA
•
Factor Analysis Conditioning
• SVM – JFA and fast scoring
•
Discriminative System Optimization
M  m  Vy  Dz  Ux
20
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
Diarization Group
Douglas Reynolds, Patrick Kenny,
Fabio Castaldo, Ciprian Costin
M  m  Vy  Dz  Ux
22
XYZ 7/13/2017
JHU WS’08 RSR Team
u2
u1
d11
Roadmap
v2
d22
d33
v1
• Introduction
– Problem definition
– Experiment setup
• Diarization Systems
– Variational Bayes System
– Streaming and Hybrid Systems
• Analysis and Conclusions
M  m  Vy  Dz  Ux
23
JHU WS’08 RSR Team
Diarization
u2
u1
d11
v2
Segmentation and Clustering
d22
d33
v1
• Determine when speaker change has occurred in speech
signal (segmentation)
• Group together speech segments from same speaker
(clustering)
• Prior speaker information may or may not be available
Where are speaker
changes?
Speaker A
Which segments are from
the same speaker?
Speaker B
M  m  Vy  Dz  Ux
24
JHU WS’08 RSR Team
u2
u1
d11
Diarization Applications
v2
d22
d33
v1
• Diarization is used as a pre-process for other downstream
•
applications
Human consumption
– Annotate transcript with speaker changes/labels
– Provide overview of speaker activity
• Algorithm consumption
– Adaptation of speech recognition system
– Application to speaker detection with multi-speaker speech
Speaker
Diarization
M  m  Vy  Dz  Ux
25
1sp
detector
1sp
detector
M
A
X
Utterance
score
JHU WS’08 RSR Team
u2
u1
d11
Diarization Error Measures
v2
d22
d33
v1
•
Direct Measure – Diarization Error Rate (DER)
–
–
Optimal alignment of reference and
hypothesized diarizations
Error is sum of
hyp
miss (speaker in reference but not in hypothesis)
false alarm (speaker in hypothesis but not in
reference)
speaker-error (mapped reference speaker is not
the same as the hypothesized speaker)
–
ref
miss
fa
err
Time weighted measure
•
Consumer Measure – Effect on speaker
detection system
– Determine speaker detection error rate when
using different diarization output
– Focus on NIST SRE 2008 data with a fixed
detection system (JFA GMM-UBM system)
PROBABILITY OF MISS (in %)
Emphasizes talkative speakers
EER
PROBABILITY OF FALSE ALARM (in %)
M  m  Vy  Dz  Ux
26
JHU WS’08 RSR Team
u2
u1
d11
Diarization Experiment Data
v2
d22
d33
v1
• Summed channels telephone speech
– Use summed channel data for test only
(avoid complication of extra clustering in training)
– We can derive reference for DER scoring using ASR
transcripts from separate channels
(no-score for silence and speaker overlap)
– Compare use of diarization to two extremes
Best case: use reference diarization
Worst case: no diarization
• Interview microphone speech
– Single microphone recording capturing both interviewee
(target) and interviewer
– Avoid use of unrealistic side information about location of
interviewee speech provided in NIST eval
– Reference for DER scoring from lavaliere microphones ASR
transcripts
M  m  Vy  Dz  Ux
27
JHU WS’08 RSR Team
u2
u1
d11
Roadmap
v2
d22
d33
v1
• Introduction
– Problem definition
– Experiment setup
• Diarization Systems
– Variational Bayes System
– Streaming and Hybrid Systems
• Analysis and Conclusions
M  m  Vy  Dz  Ux
28
JHU WS’08 RSR Team
u2
u1
d11
Baseline System
v2
d22
d33
v1
p(x|x)
p(y|y)
p(z|z)
Speaker Change
Detection
Agglomerative
Clustering
Initial
speaker
data
Final Diarization
Viterbi Decode
Train GMMs
Refined speaker data
• Three stages in baseline system
– BIC based speaker change detection
– Full covariance agglomerative clustering with BIC stopping
criterion
– Iterative re-segmentation with GMM Viterbi decoding
M  m  Vy  Dz  Ux
29
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Factor Analysis Applied to Diarization
d33
v1
• State of the art speaker recognition systems use hundreds of
speaker and channel factors
– Processing requires entire utterances – can't be implemented
incrementally
• State of the art diarization systems require lots of local
decisions
– Very short (~1 sec) speech segments
– Speaker segmentation: is this frame a speaker change point?
– Agglomerative clustering: Given two short segments, is the
speaker the same?
• Proposed solution: Variational Bayes (VB)
– Fabio Valente, Variational Bayesian Methods for Audio
Indexing, PhD Dissertation, Eurecom, 2005
M  m  Vy  Dz  Ux
30
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Advantages of a Bayesian Approach
d33
v1
• EM-like convergence guarantees
• No premature hard decisions as in agglomerative clustering
– This suggested a `soft clustering' heuristic which reduced the
diarization error rate of the baseline system by almost 50%
• In theory at least, Bayesian methods are not subject to the
over-fitting which maximum likelihood methods are prone to
– Bayesian model selection is a quantitative version of Occam's
razor (David MacKay)
– It ought to be possible to determine the number of speakers in
a file without resorting to BIC like fudge factors (Fabio Valente)
M  m  Vy  Dz  Ux
31
JHU WS’08 RSR Team
u2
u1
d11
Eigenvoice Speaker Model
v2
d22
d33
v1
• For diarization we use only the eigenvoice component of
factor analysis
s  m  Vy,
y ~ N (0, I)
• A supervector s is the concatenation of the mean vectors in a
•
•
speaker dependent Gaussian mixture model
The supervector m is speaker independent
The matrix V is of low rank
– The columns of V are the eigenvoices
– The entries of y are the speaker factors
• A highly informative prior on speaker dependent GMM's
• Adding eigenchannels doesn't help in diarization (so far)
M  m  Vy  Dz  Ux
32
JHU WS’08 RSR Team
u2
u1
d11
Variational Bayes Diarization
v2
d22
d33
v1
• Assume 2 speakers and uniformly segment the file into 1
second intervals
– This restriction can be removed in a second pass
• Alternate between estimating two types of posterior
distribution until convergence
– Segment posteriors (soft clustering)
– Speaker posteriors (location of the speakers in the space of
speaker factors)
• Construct GMM's for each speaker and re-segment the data
– Iterate as needed
M  m  Vy  Dz  Ux
33
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Segment Posteriors
d33
v1
M  m  Vy  Dz  Ux
34
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Speaker Posteriors
d33
v1
M  m  Vy  Dz  Ux
35
JHU WS’08 RSR Team
Variational Bayes Diarization
u2
u1
d11
v2
Details
d22
d33
v1
• Begin
– Extract Baum-Welch statistics from each segment
• On each iteration
– For each speaker:
Synthesize Baum Welch statistics by weighting the Baum Welch
statistics for each segment by the corresponding posterior
Update the posterior distribution of the speaker factors
– For each segment
• End
Update the segment posteriors for each speaker
– Construct GMM's for each speaker
– Re-segment the data
– Iterate
M  m  Vy  Dz  Ux
36
JHU WS’08 RSR Team
u2
u1
d11
Experiment Configuration
v2
d22
d33
v1
•
Features used for Variational Bayes
–
–
•
Features used in the baseline system and in the re-segmentation
phase of Variational Bayes
–
–
•
Universal background model with 512 Gaussians
200 speaker factors, no channel factors
V matrix scaled by 0.6
Test set: the summed channel telephone data provided by NIST in
the 2008 speaker recognition evaluation
–
•
Un-normalized cepstral coefficients c0, .., c12
Including c0 was a lucky bug
Factor analysis configuration for Variational Bayes
–
–
–
•
39 dimensional feature set optimized by Brno for speaker recognition
Cepstral coefficients c0, .., c12 + first, second and third order
derivatives + Gaussianization + HLDA
2215 files, (~200 hours)
NIST Diarization Error used to measure performance
–
Ground truth diarization is available
M  m  Vy  Dz  Ux
37
JHU WS’08 RSR Team
Experiment Results
u2
u1
d11
v2
NIST 2008 Summed Channel Telephone Speech
d22
d33
v1
10
Mean
DER
Std
DER
9
8
7
6
1
Baseline BW Viterbi
6.8
12.3
2
VB
9.1
11.9
3
VB BW Viterbi
4.5
8.5
4
VB BW Viterbi 2nd pass
3.8
7.6
5
Baseline soft-cluster BW Viterbi
3.5
8.0
DER (%)
Diarization Systems
5
4
3
2
1
0
1
•
•
•
•
2
3
4
5
VB = Variational Bayes
BW = Baum -Welch training of speaker GMM's
Viterbi = re-segmentation with speaker GMM's
The second pass in VB uses a non-uniform segmentation
provided by the first pass
Compared to the baseline, soft clustering achieves a 50%
reduction in error rates
•
M  m  Vy  Dz  Ux
38
JHU WS’08 RSR Team
u2
u1
d11
Roadmap
v2
d22
d33
v1
• Introduction
– Problem definition
– Experiment setup
• Diarization Systems
– Variational Bayes System
– Streaming and Hybrid Systems
• Analysis and Conclusions
M  m  Vy  Dz  Ux
39
JHU WS’08 RSR Team
Streaming System
u2
u1
d11
v2
LPT Diarization System*
d22
d33
v1
• Main Ideas
– Use eigenvoice model for creating a stream of speaker
factors Yt computed on a sliding windows
– Perform segmentation and clustering with these new features
• Eigenvoice Model:
s  m  Vy,
y ~ N (0, I)
* Based on Castaldo, F.; Colibro, D.; Dalmasso, E.; Laface, P.; Vair,
C.,Stream-based speaker segmentation using speaker factors and
eigenvoices, ICASSP 2008
M  m  Vy  Dz  Ux
40
JHU WS’08 RSR Team
u2
u1
d11
Streaming System
v2
d22
d33
v1
Feature Extraction
Audio
Slices
Streaming Factor
Analysis
x1 x2 x3
GMM
1
GMM
2
x4 x5 x6
GMM
1
x7 x8 x9
GMM
2
GMM
1
GMM
2
Slice Clustering
M  m  Vy  Dz  Ux
42
JHU WS’08 RSR Team
Streaming System
u2
u1
d11
v2
Stream Factor Analysis
d22
d33
v1
Y1
Feature
Extraction
Slices
x1 x2 x3 x4 x5 x6
x7 x8 x9 x10 x11 x12
Y2
Viterbi Segmentation
Clustering
GMM 1
Creating GMMs
GMM 2
M  m  Vy  Dz  Ux
43
JHU WS’08 RSR Team
u2
u1
d11
v2
Streaming System
Stream Factor Analysis
d22
d33
v1
First 2 dimensions of y stream
M  m  Vy  Dz  Ux
44
JHU WS’08 RSR Team
Streaming System
u2
u1
d11
v2
Slice Clustering
d22
d33
v1
• A GMM model for each slice is created
• Last step: clustering the GMMs created in each slice
• The system decides whether GMMs come from the same or
•
different speakers by using an approximation of the
Kullback-Leibler divergence between GMMs
Large KL-divergence => new speaker
<λ
New 60s slice
GMM 1
Min KL
Divergence
New GMM
GMM 2
Creating new model
GMM 3
M  m  Vy  Dz  Ux
46
>λ
JHU WS’08 RSR Team
u2
u1
d11
Hybrid Clustering
v2
d22
d33
v1
• Speaker factors works in the streaming diarization system
• Experiments done during the workshop showed cosine
distance between speaker factors produces low speaker
detection errors
• Modifying the baseline system using these new ideas
• Hybrid Clustering
– Replace the classical clustering using speaker factors and
cosine distance
M  m  Vy  Dz  Ux
47
JHU WS’08 RSR Team
Hybrid Clustering
u2
u1
d11
v2
Different Approaches
d22
d33
v1
• First Approach: Level Cutting
– Stop the agglomerative clustering at a certain level and
compute speaker factors for each cluster
– Merge the clusters that have the maximum similarity with
respect to the cosine distance
– Iterate until only two clusters remain
• Second Approach: Tree Searching
– Build agglomerative clustering up to the top level
– Select the nodes that have a number of frames above a
threshold
– Merge the clusters that have the maximum similarity with
respect to the cosine distance
– Iterate until only two clusters remain
M  m  Vy  Dz  Ux
48
JHU WS’08 RSR Team
Hybrid Clustering
u2
u1
d11
v2
Level Cutting
d22
d33
v1
Y1
Y1
Y1
M  m  Vy  Dz  Ux
49
Y2
Y2
1
1
Y2
Y2
Y1
2
2
S
P
E
A
K
E
R
3
Y3
Y3
Y3
3
4
F
A
C
T
O
R
Y4
Y4
4
5
C
L
U
S
T
E
R
I
N
G
Y5
5
6
A
G
G
L
O
M
E
R
A
T
I
V
E
C
L
U
S
T
E
R
I
N
G
JHU WS’08 RSR Team
Hybrid Clustering
u2
u1
d11
v2
Tree Searching
d22
d33
v1
Threshold=100
180
=selected
cluster(Y)
50
M  m  Vy  Dz  Ux
50
550
110
70
110
70
60
70
550
210
210
340
340
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Results On Summed Telephone Speech
d33
v1
System
(2213 Audio
Files)
Minimum
Maximum
Diarization
Diarization
Error Rate(%) Error Rate(%)
Standard
Deviation
(%)
Average
Diarization
Error Rate
(%)
Streaming
System
0.0
53.2
8.8
4.6
Baseline
Diarization
System
0.0
57.2
12.3
6.8
Hybrid System
1 (Level
Cutting)
0.0
67.0
14.6
17.1
Hybrid System
2 (Tree
Search)
0.0
63.2
13.6
6.8
M  m  Vy  Dz  Ux
51
JHU WS’08 RSR Team
u2
u1
d11
Roadmap
v2
d22
d33
v1
• Introduction
– Problem definition
– Experiment setup
• Diarization Systems
– Variational Bayes System
– Streaming and Hybrid Systems
• Analysis and Conclusions
M  m  Vy  Dz  Ux
52
JHU WS’08 RSR Team
DER vs EER
u2
u1
d11
v2
Summed Telephone Speech
d22
d33
v1
• Some correlation of DER to EER
• Systems with DER <10% have comparable EERs
• No clear knee in the curve
– Still have EER gains (over doing nothing) with relatively poor
DER=20% system
15
14
EER (%)
13
12
11
10
9
8
0
10
20
30
40
DER (%)
M  m  Vy  Dz  Ux
53
JHU WS’08 RSR Team
DER vs EER
u2
u1
d11
v2
Summed Telephone Speech
d22
d33
v1
• Unclear trends with low DER systems
– VB+2nd pass and BL+soft_cluster
• DER may be too coarse of a measure for effects on EER
10
VB
BL
EER (%)
9.8
VB + 2nd
pass
9.6
Hybrid
9.4
9.2
Ref
BL+ soft
cluster
LPT
9
0
2
6
4
8
10
DER (%)
M  m  Vy  Dz  Ux
54
JHU WS’08 RSR Team
u2
u1
d11
Interview Speech
v2
d22
d33
v1
• Interview speech differs from telephone speech in two main
aspects
– Audio quality is much more variable for various microphones
– Conversations are dominated by interviewee
• DER for do-nothing diarization (single speaker for all time)
– Telephone: 35%
– Interview: 11%
•
Next challenge is
to apply
diarization
systems to new
domain
Avoid idealistic
assumptions and
knowledge used in
NIST eval
•
M  m  Vy  Dz  Ux
55
No
diarization in
train or test
EER=10.9%
Ideal
diarization in
train and test EER=5.4%
JHU WS’08 RSR Team
u2
u1
d11
Conclusion
v2
d22
d33
v1
• Implemented variational Bayes diarization system using
both segment and speaker posterior optimization
• Used speaker factor model for three speaker diarization
systems
– Streaming, VB, and hybrid
• Demonstrated effectiveness of soft-clustering for improving
speaker diarization
• Produced low diarization error rate (3.5-4.5%) for telephone
speech
• New challenges await for interview speech domain
– Microphones
– Conversational patterns
M  m  Vy  Dz  Ux
56
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
Speaker Recognition:
Factor Analysis
Conditioning
(13 August 2008)
Sub-Team:
Sachin Kajarekar (SRI), Elly Na (GMU), Jason Pelecanos (IBM),
Nicolas Scheffer (SRI), Robbie Vogt (QUT)
M  m  Vy  Dz  Ux
59
XYZ 7/13/2017
JHU WS’08 RSR Team
u2
u1
d11
Overview
v2
d22
d33
v1
•
•
•
•
•
•
Introduction
A Phonetic Analysis
Combination Strategies
Within Session Variability Modeling
Hierarchical Factor Analysis
Review
M  m  Vy  Dz  Ux
60
JHU WS’08 RSR Team
u2
u1
d11
Introduction
v2
d22
d33
v1
Problem Statement
– A single FA model is sub-optimal across different conditions
- Eg.: different durations, phonetic content and recording scenario
Goals
– Explore two approaches:
- Build FA models specific to each condition and robustly combine
multiple models
- Extend the FA model to explicitly model the condition as another
source of variability
Results and Outcomes
– A conditioned FA model can provide improved performance
- But, score level combination may not be the best way
– Including Within-Session factors in an FA model can reduce the
sensitivity to utterance duration and phonetic content variability
– Stacking factors across conditions or data subsets can provide
additional robustness
– Hierarchical modeling for Factor Analysis shows promise
– Applicability to other condition types: languages, microphones, …
M  m  Vy  Dz  Ux
61
JHU WS’08 RSR Team
u2
Introduction
u1
d11
v2
Speech Partitioning… An overview
d22
d33
v1
Train Data
Test Data
phoneme
I ‘w’
phoneme
II
‘ah1’
phoneme
I ‘w’
phoneme
II
‘ow’
phoneme
III ‘n’
feature space
feature space
phoneme I
M  m  Vy  Dz  Ux
62
phoneme
III
‘d’
phoneme II
phoneme III
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Overview
d33
v1
• Introduction
• A Phonetic Analysis
• Combination Strategies
• Within Session Variability Modeling
• Hierarchical Factor Analysis
• Review
M  m  Vy  Dz  Ux
63
JHU WS’08 RSR Team
A Phonetic Analysis
u2
u1
d11
v2
Effect of Phonetic Mismatch
d22
d33
v1
– How does the difference between the content in enrollment
and test change the resulting performance?
Test
Enroll
Vowel
Consonant
Vowel
4.50%
0.0208
12.47%
0.0537
Consonant
10.72%
0.0521
7.03%
0.0336
EER
Min. DCF
– This result, albeit an extreme example, demonstrates the
challenge of mis-matched phonetic content
– This phenomena is especially present for short duration
utterances
M  m  Vy  Dz  Ux
64
JHU WS’08 RSR Team
A Phonetic Analysis
u2
u1
d11
v2
Performance vs. % of Speech
d22
d33
v1
DET 1
Phoneme
Type
% of speech
EER (%)
DCF
EER (%)
DCF
E
vowel
18.93
12.16
0.0567
8.62
0.0419
O
vowel
10.71
14.57
0.0645
12.30
0.0558
i
vowel
6.85
16.73
0.0749
15.49
0.0696
A:
vowel
5.89
23.31
0.0876
21.79
0.0852
n
nonvowel
5.44
19.08
0.0779
17.23
0.073
e:
vowel
4.73
25.31
0.0917
22.92
0.0866
k
stop
4.49
25.56
0.0926
22.26
0.0868
z
sibilant
4.25
29.73
0.098
28.22
0.0971
o
vowel
3.01
25.53
0.0924
25.24
0.0926
t
stop
2.76
27.04
0.0956
24.92
0.0936
s
sibilant
2.74
30.73
0.0965
27.63
0.0908
f
sibilant
2.41
34.43
0.0998
31.42
0.0984
j
nonvowel
2.38
25.00
0.0918
22.41
0.0862
v
sibilant
2.35
33.66
0.1
30.78
0.0992
m
nonvowel
2.29
21.18
0.0835
18.63
0.0782
S
sibilant
2.21
31.97
0.0959
31.74
0.0981
l
nonvowel
1.99
30.05
0.0974
29.91
0.0955
M  m  Vy  Dz  Ux
65
DET 3
JHU WS’08 RSR Team
A Phonetic Analysis
u2
u1
d11
v2
Performance vs. % of Speech
d22
d33
v1
40
vowel
nonvowel
sibilant
stop
Performance (EER) %
30
20
10
0
0
5
10
15
20
% of Speech
M  m  Vy  Dz  Ux
66
JHU WS’08 RSR Team
A Phonetic Analysis
u2
u1
d11
v2
Fusion Analysis
d22
d33
v1
Vowel with Others
Phonemes
% of Speech
EER (%)
Phonemes
% of Speech
EER (%)
En
24.37
7.96
EO
29.64
7.04
Ek
23.42
8.4
Ei
25.78
7.58
Ez
23.18
8.35
E A:
24.82
8.94
Et
21.69
8.72
E e:
23.66
8.55
El
20.92
8.56
Eo
21.94
8.29
On
16.15
9.64
Oi
17.56
9.42
Ok
15.2
10.89
O A:
16.6
11.32
Oz
14.96
11.76
O e:
15.44
11.32
Ot
13.47
10.85
Oo
13.72
11.38
in
12.29
11.93
i A:
12.74
13.05
iz
11.1
14.41
i e:
11.58
13.91
n e:
10.17
14.4
A: e:
10.62
17.23
is
9.59
14.35
io
9.86
14.07
A: t
8.65
17.6
A: o
8.9
17.59
nj
7.82
13.71
e: o
7.74
18.46
zs
6.99
23.03
ot
5.77
19.34
ts
5.5
21.88
Fuse All
83.43
5.68
fv
4.76
26.72
Sl
4.2
25.63
M  m  Vy  Dz  Ux
67
Vowel with Vowel
JHU WS’08 RSR Team
u2
u1
d11
v2
A Phonetic Analysis
Fusion Analysis
d22
d33
v1
vowel with others
vowel with vowel
M  m  Vy  Dz  Ux
68
JHU
JHUWS’08
WS’08RSR
RSRTeam
Team
u2
u1
d11
v2
d22
Overview
d33
v1
• Introduction
• A Phonetic Analysis
• Combination Strategies
• Within Session Variability Modeling
• Hierarchical Factor Analysis
• Review
M  m  Vy  Dz  Ux
69
JHU WS’08 RSR Team
Combination Strategies
u2
u1
d11
v2
Context
d22
d33
v1
• Conditioned factor analysis
•
– Multiple systems for multiple conditions
– Multiple subspaces (eg: microphones)
Current solution
– Select the best system for each condition
– Perform score-level combination (our baseline)
How to robustly gather information from these systems?
•
Exploring combination strategies in the model space
•
• Candidate for the study: Broad-Phone classes
– Work in the speaker space instead of channel space
– Small set of events
FA Cond. Smaller system configuration (512g, 120 EV, 60EC)
M  m  Vy  Dz  Ux
70
JHU WS’08 RSR Team
Combination Strategies
u2
u1
d11
v2
Baseline Results
d22
d33
v1
Table of Results for Different Phone Sets (DET 1 SRE’06)
% Data
EER (%)
Min. DCF
Vowels
60
6.17
0.296
Consonants
40
7.91
0.391
NonVowels
15
10.7
0.502
Sibilants
15
14.14
0.647
Stops
10
15.27
0.685
Vow. + NV + Si + St. (4 classes)
100
5.42
0.272
Vow. + Cons. (2 classes)
100
5.20
0.262
Baseline
100
5.12
0.241
Thanks S. Kajarekar, C. Richey, SRI International
M  m  Vy  Dz  Ux
71
JHU WS’08 RSR Team
Combination Strategies
u2
u1
d11
v2
Stacked Eigenvectors
d22
d33
v1
•
In training, estimate different subspaces modeling the same kind of
variability
– Eg: Different utterance lengths, different set of microphone sets
In practice :
– Merging of supervectors generated by each subspace
– New rank is the sum of each subspace rank
– Can generate very large (and redundant) subspaces
Advantages:
– No retraining during enrollment / recognition time
– No need of labeled data for system selection
– Increased robustness of system in both scenarios (correlation
between two subspaces)
•
•
V1
V2
M  m  Vy  Dz  Ux
72
V1
y1
V2
y2
JHU WS’08 RSR Team
Combination Strategies
u2
u1
d11
v2
Combining U and/or V?
d22
d33
v1
• Stacking U’s (channel): successfully demonstrated (at NIST)
•
for a large set of microphones
Stacking V’s (speaker): Suitable for phone conditioning as:
– Phonetic models can represent the speaker
– Precedents are P-GMM, MLLR systems, …
Table of 2 Class Stacking Results (DET 1 SRE’06)
EER (%)
Min. DCF
Baseline
5.20
0.262
Stacked Us
5.72
0.279
Stacked Vs
5.14
0.243
Stack Us+Vs
5.31
0.258
M  m  Vy  Dz  Ux
73
JHU WS’08 RSR Team
Combination Strategies
u2
u1
d11
v2
Augmented eigenvectors
d22
d33
v1
• Again, train several subspaces on the same kind of
•
•
variability
In practice:
– Value of the subspace rank is unchanged
– Increase the model size
– Need to retrain the joint variability
– Not extendable to more than 2,3 classes
Close to Tied factor analysis
– Produce a single y, independent from the class
V1
V2
V1
M  m  Vy  Dz  Ux
74
V2
JHU WS’08 RSR Team
y
Combination Strategies
u2
u1
d11
v2
Factor Analysis: Un-tying
d22
d33
v1
• Augmented eigenvectors produce a common y for all
•
•
conditions
In practice:
– There’s always a between class error
– The error is averaged out by the ML algorithm
Keep each speaker factor (y) from each class with the error
– More parameters to describe a speaker
– Feed this input to a classifier
– Experiments with Gaussians as classes are promising
V1
V2
TRAINING
M  m  Vy  Dz  Ux
75
y
V1
y1
V2
y2
TESTING
JHU WS’08 RSR Team
Combination Strategies
u2
u1
d11
v2
Results
d22
d33
v1
Table of Results for Different Factor Configurations
# of Classes
EER
DCF
1
Baseline
Single system
5.12
0.241
1
Baseline
(x2) EV
4.83
0.239
2
Baseline
2 sys. fusion
5.20
0.262
4
Baseline
4 sys. fusion
5.42
0.272
2
Stacked V
2 sys fusion
5.09
0.247
4
Stacked V
4 sys. fusion
5.03
0.250
2
Stacked V
Single system
5.14
0.243
4
Stacked V
Single system
4.76
0.234
2
Augmented
Single system
13.4
0.573
2
Augmented
Retrained (Tied)
5.39
0.266
16
Un-tied
Gaussian
4.54
0.233
M  m  Vy  Dz  Ux
76
Method
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Overview
d33
v1
• Introduction
• A Phonetic Analysis
• Combination Strategies
• Within Session Variability Modeling
• Hierarchical Factor Analysis
• Review
M  m  Vy  Dz  Ux
77
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Within-session Variability Modeling
d33
v1
• The characteristics of inter-session variability are dependent
on session duration.
• This doesn't fit well with the JFA model
– Capturing more than channel!
– Speech content (phonetic information) averages out for long
utterances but will become significant for short utterances
M  m  Vy  Dz  Ux
78
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Within-session Variability Modeling
d33
v1
• Proposed solution: Model within-session variability as well
– Break utterances into smaller segments, each described by:
M = m + Vy + UI x + UW w + dz
U split into inter- (UI) and within-session (UW) parts
• x is held constant for a whole utterance,
• but we have many w's!
• In this work we chose to align our segments with OLPR
transcripts
– i.e. one w per phonetic event,
– Approx. 10 per second
– Approx. 1000 in a NIST conversation side
M  m  Vy  Dz  Ux
79
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Within-session Variability Modeling
d33
v1
• Proposed solution: Model within-session variability as well
– Break utterances into smaller segments, each described by:
M = m + Vy + UI x + UW w + dz
y
x
UI
V
M  m  Vy  Dz  Ux
80
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Within-session Variability Modeling
d33
v1
• Proposed solution: Model within-session variability as well
– Break utterances into smaller segments, each described by:
M = m + Vy + UI x + UW w + dz
UI
V
M  m  Vy  Dz  Ux
81
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Within-session Variability Modeling
d33
v1
• Proposed solution: Model within-session variability as well
– Break utterances into smaller segments, each described by:
M = m + Vy + UI x + UW w + dz
UI
UW
V
M  m  Vy  Dz  Ux
82
JHU WS’08 RSR Team
u2
u1
d11
v2
Within-session Variability Modeling
Single Phonetic Event
d22
d33
v1
Dimension
M  m  Vy  Dz  Ux
83
JHU WS’08 RSR Team
u2
u1
d11
v2
Within-session Variability Modeling
Single Phonetic Event
d22
d33
v1
Dimension
M  m  Vy  Dz  Ux
84
JHU WS’08 RSR Team
u2
u1
d11
v2
Within-session Variability Modeling
Contribution Over Varying Utterance Lengths
d22
d33
v1
Dimension
M  m  Vy  Dz  Ux
85
JHU WS’08 RSR Team
Within-session Variability Modeling
u2
u1
d11
v2
Within-session Results
d22
d33
v1
U+V+D
50
1conv
EER
Min DCF
3.10%
.0159
U+V+D
60
3.03%
.0156
13.01%
.0562
20.31%
.0820
UMatched+V+D 50
3.10%
.0159
12.20%
.0531
19.71%
.0814
UI+UW+V+D
2.97%
.0170
11.98%
.0541
19.67%
.0807
JFA Model
Subspace
Dims
50I + 10W
20 sec
EER
Min DCF
12.79%
.0561
10 sec
EER
Min DCF
20.21%
.0819
Similar performance for full conversations
• Modest gains with reduced utterance lengths, mostly for the
EER.
– Better than matching U to the utterance length in most cases
– Good flexibility across utterance length for a single model!
M  m  Vy  Dz  Ux
86
JHU WS’08 RSR Team
u2
u1
d11
Overview
v2
d22
d33
v1
•
•
•
•
•
Introduction
A Phonetic Analysis
Combination Strategies
Within Session Variability Modeling
Hierarchical Factor Analysis
• Review
M  m  Vy  Dz  Ux
87
JHU WS’08 RSR Team
u2
u1
d11
Hierarchical Factor Analysis
v2
d22
d33
v1
Low Complexity
Coarse Grain Model
M  m  Vy  Dz  Ux
88
High Complexity
Fine Grain Model
JHU WS’08 RSR Team
u2
u1
d11
v2
Hierarchical Factor Analysis
Multi-grained Hybrid Model
d22
d33
v1
 Such a model may compensate for session effects that cause
large regional variability and localized distortions.
 A multi-grained model may be structured in a manner such that
the nuisance kernel subspace has reduced complexity (a reduced
number of parameters) while preserving compensation impact.
M  m  Vy  Dz  Ux
89
JHU WS’08 RSR Team
Hierarchical Factor Analysis
u2
u1
d11
v2
Multi-grained GMM/Phone Model
d22
d33
v1
Table of NIST 2008 Minimum DCF Results
Task
Condition
7
Condition
8
Base System with
NAP
0.179
0.182
Base System with
Multigrained NAP
0.175
0.166
Broad Phone
System with NAP
0.212
0.209
Broad Phone
System with
Multigrained NAP
0.206
0.190
Thanks to Jiri Navratil (IBM) for the phonetic results.
M  m  Vy  Dz  Ux
90
JHU WS’08 RSR Team
Hierarchical Factor Analysis
u2
u1
d11
v2
Multi-stage FA Broad Phone Model
d22
d33
v1
Table of NIST 2006 Minimum DCF/EER Results
Baseline
Hierarchical
Baseline
Hierarchical
Phone
Type
NonVowel
Sibilant
Stop
Vowel
NonVowel
Sibilant
Stop
Vowel
DET 3 - Base
Min DCF
0.0888
0.0988
0.0993
0.0604
0.0852
0.0994
0.0991
0.0482
EER
24.04%
30.28%
33.33%
11.26%
23.24%
28.93%
33.27%
10.29%
Consonant
Vowel
Consonant
Vowel
0.0839
0.0604
0.0777
0.0482
20.48%
11.26%
18.26%
10.29%
DET 3 - ZTNorm
Min DCF
EER
0.0413
9.05%
0.0584
13.05%
0.0631
13.81%
0.0201
3.97%
0.042
9.53%
0.0585
14.20%
0.0655
14.63%
0.0206
3.91%
0.0323
0.0201
0.0312
0.0206
6.28%
3.97%
6.45%
3.91%
Fusion of hierarchical systems with baseline system gives modest improvements.
M  m  Vy  Dz  Ux
91
JHU WS’08 RSR Team
u2
u1
d11
Review
v2
d22
d33
v1
Results and Outcomes
– A conditioned FA model can provide improved performance
- But, score level combination may not be the best way
- Automatic system selection may not be feasible
– Including Within-Session factors in an FA model can reduce the
sensitivity to utterance duration and phonetic content variability
– Stacking factors across conditions or data subsets can provide
additional robustness
– Hierarchical modeling for Factor Analysis shows promise
– Applicability to other condition types: languages, microphones, …
M  m  Vy  Dz  Ux
92
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
Support Vector Machines
and
Joint Factor Analysis
Najim DEHAK, Reda DEHAK, Zahi KARAM,
and John NOECKER Jr.
M  m  Vy  Dz  Ux
94
XYZ 7/13/2017
JHU WS’08 RSR Team
u2
u1
d11
Outline
v2
d22
d33
v1
•
•
•
•
•
•
•
Introduction
SVM-JFA : GMM supervectors space
SVM-JFA : Speaker factors space
Intersession compensation in speaker factors space
–
Within Class Covariance Normalization
–
Handling variability with SVMs
SVM-JFA : Combined factor space
Importance of speaker and channel factors
Conclusion
M  m  Vy  Dz  Ux
95
JHU WS’08 RSR Team
u2
u1
d11
Introduction
v2
d22
d33
v1
• Joint Factor Analysis is state of the art in speaker
verification.
• Combine discriminative and generative models.
• SVM - JFA
– Speaker GMM supervectors space.
– Speaker factors space.
– Combination of factors.
• Intersession variability compensation in speaker factors.
M  m  Vy  Dz  Ux
96
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
SVM in GMM Supervector Space
s = m + Vy + Dz
M  m  Vy  Dz  Ux
97
XYZ 7/13/2017
JHU WS’08 RSR Team
u2
u1
d11
SVM-JFA : Supervector Space
v2
d22
d33
v1
• Our Starting Point : Kernel between two GMM supervectors
(Campbell 06).
ubm
Target
JFA
Adaptation
Nontarget
 1 
 1
  11 
 n   1 
  n    1 
 
  n    
  n  
  n 
  n 
Speaker SVM
Target
SVM
Training
K ( g a , g b )  i  at ,i  i1b ,i
i
Project each utterance
into a high-dimensional
space – stack mean
vectors from GMM
Use a KL-based
kernel
Non-Target
m
ubm
JFA
Adaptation
 1 
 
 
  n 
SVM Scoring
M  m  Vy  Dz  Ux
98
Detection score
JHU WS’08 RSR Team
Thanks to Douglas Reynolds for this slide
u2
u1
d11
JFA Configuration
v2
d22
d33
v1
• Gender independent JFA.
• 2048 Gaussians, 60 dimensional features
– 19 Gaussianized MFCC’s + energy + delta + double delta
•
•
•
•
300 speaker factors.
100 channel factors for telephone speech
Decoupled estimation of eigenvoice and diagonal matrix (D)
JFA hyper-parameters are obtained on MIXER and
Switchboard database.
M  m  Vy  Dz  Ux
99
JHU WS’08 RSR Team
u2
u1
d11
SVM-JFA : Supervector Space
v2
d22
d33
v1
• Result on NIST 2006 and 2008 SRE, Core condition,
Telephone-Telephone dataset. (EER)
NIST 2006 SRE
NIST 2008 SRE
English
All trials
English
All trials
JFA: s=m+Vy
1.95%
3.01%
2.81%
5.58%
JFA: s=m+Vy+dz
1.80%
2.96%
2.81%
5.69%
SVM-JFA: s=m+Vy
4.24%
4.98%
5.10%
7.92%
SVM-JFA: s=m+Vy+dz
4.23%
4.92%
5.27%
8,13%
• JFA scoring is obtained with frame by frame.
M  m  Vy  Dz  Ux
100
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
SVM in JFA-Speaker Factors
Space
s=m+Vy
M  m  Vy  Dz  Ux
101
XYZ 7/13/2017
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
SVM-JFA : Speaker Factors Space
d33
v1
•
Using speaker factors ‘y’ rather than GMM
supervectors ‘s’.
– s=m+V
y
– c = Ux
• Low dimension space
• We can perform quick experiments.
• First we used only eigenvoice adaptation (D=0).
M  m  Vy  Dz  Ux
102
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
SVM-JFA : Speaker Factors Space
d33
v1
• s=m+Vy
• Inner-product kernel
k ( y1 , y2 )  y1 , y2 
• Gaussian kernel :
k ( y1 , y2 )  exp( 
1
2
y1  y2 )
k
• Cosine Kernel :
 y1 , y 2 
k ( y1 , y 2 ) 
y1 . y 2
M  m  Vy  Dz  Ux
103
JHU WS’08 RSR Team
u2
u1
d11
SVM-JFA : Speaker Factors Space
v2
d22
d33
v1
• Result on NIST 2006 SRE, Core condition. (EER).
English
JFA
All trials
No-norm
T-norm
Zt-norm
No-norm
T-norm
Zt-norm
-
-
1.95%
-
-
3.01%
4.98%
-
KL- kernel
supervectors
4.24%
Linear kernel
3.47%
2.93%
4.64%
4.04%
-
Gauss kernel
3.03%
2.98%
4.59%
4.46%
-
Cosine kernel
3.08%
2.92%
4.18%
4.15%
-
M  m  Vy  Dz  Ux
104
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
SVM-JFA : Speaker Factors Space
d33
v1
• Results on NIST 2008 SRE, Core condition. TelephoneTelephone dataset (EER).
All trials
M  m  Vy  Dz  Ux
105
No-norm
T-norm
Zt-norm
JFA
-
-
5.58%
KL- kernel
supervectors
-
7.92%
-
Linear kernel
7.06%
7.10%
-
Gauss kernel
7.84%
7.42%
-
Cosine kernel
7.24%
7.24%
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
Intersession Compensation in
Speaker Factors Space
Within Class Covariance Normalization in SVM
speaker factors space.
M  m  Vy  Dz  Ux
106
XYZ 7/13/2017
JHU WS’08 RSR Team
Within Class Covariance Normalization
(WCCN)
u2
u1
d11
v2
d22
d33
v1
•
Within Class Covariance
1 S 1
W 
S s 1 ns
–
1
ys 
ns
ns
s
s
t
(
y

y
)(
y

y
)
 i s i s
i 1
ns
y
i 1
s
i
mean of utterances of each speaker
–
S
number of speakers
–
ns
number of utterances for each speaker ( s )
– Within class covariance is calculated in MIXER and Switchboard
database.
M  m  Vy  Dz  Ux
107
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
SVM-JFA : Speaker Factors Space
(WCCN)
y
• s=m+V
• Linear kernel
k ( y1 , y 2 )  y1 W 1 y 2
t
• Cosine Kernel :
M  m  Vy  Dz  Ux
108
y1 W 1 y2
k ( y1 , y2 )  t 1
t
y1 W y1 . y2 W 1 y2
t
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
SVM-JFA : Speaker Factors Space
(WCCN)
• Result on NIST 2006 SRE, Core condition, English trials
(EER)
Without WCCN
Linear kernel
With WCCN
No-norm
T-norm
No-norm
T-norm
3.47%
2.93%
2.87%
2.44%
17 %
Cosine kernel
JFA-scoring
M  m  Vy  Dz  Ux
109
3.03%
2.98%
2.60%
2.45%
1.95%
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
Intersession Compensation in
Speaker Factors Space
Handling variability with SVMs
M  m  Vy  Dz  Ux
110
XYZ 7/13/2017
JHU WS’08 RSR Team
Handling Variability with SVMs
u2
u1
d11
v2
Good and Bad Variability
d22
d33
v1
• Types of Variability:
– Good Variability (Inter-speaker)
Speaker information
– Bad Variability (Nuisance)
Session
Channel
•
Handling Variability
–
–
–
–
Joint Factor Analysis
GMM + LFA
SVM + NAP
SVM + WCCN
M  m  Vy  Dz  Ux
111
JHU WS’08 RSR Team
u2
u1
d11
v2
Handling Variability with SVMs
Motivation: Handling Nuisance
d22
d33
v1
Principle Nuisance
Dimension
Nuisance Estimate
w w
M  m  Vy  Dz  Ux
112
U
JHU WS’08 RSR Team
u2
u1
d11
v2
Handling Variability with SVMs
SVM Formulation
d22
d33
v1
w
U
ξ=∞
ξ=2
ξ=1
ξ=0
M  m  Vy  Dz  Ux
114
JHU WS’08 RSR Team
Handling Variability with SVMs
u2
u1
d11
v2
Results
d22
d33
v1
Using only 300 speaker factors s=m+Vy
Dimension of Nuisance subspace = 50
11%
18%
M  m  Vy  Dz  Ux
116
JHU WS’08 RSR Team
Handling Variability with SVMs
u2
u1
d11
v2
Future Work
d22
d33
v1
•
Beyond Nuisance compensation:
– Bias towards using inter-speaker variability
– Handle all variability
Bias towards inter-speaker
Bias away from Nuisance
•
Extend formulation to full supervectors
M  m  Vy  Dz  Ux
117
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
SVM-JFA : Speaker and Common
Factors Space
y + Dz
s=m+V
M  m  Vy  Dz  Ux
118
XYZ 7/13/2017
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
SVM-JFA : Speaker and Common
Factors Space
• Full Joint Factor Analysis : s = m + Vy + D
z
– y : Speaker factors.
– z : Common factors (EER 6.23% on NIST 2006 SRE, English trials).
• How to use speaker and common factors with SVMs?
– Score fusion
– Kernel combination
M  m  Vy  Dz  Ux
119
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Scores Fusion vs. Kernels Combination
d33
v1
• Scores Fusion:
– Linear weighted score
Weights are computed using a development scores dataset
M
S F x   w0   wl Sl x 
l 1
• Kernels Combination:
– Linear weighted kernel
No development dataset for weights estimation
M
k F x, y    l kl x, y 
l 1
M  m  Vy  Dz  Ux
120
JHU WS’08 RSR Team
u2
u1
d11
Kernel Combination Space
v2
d22
d33
v1
Kernel in speaker
y
factor space ( )
?
New Kernel
function
Kernel in common
z
factor space ( )
Kernel space
M  m  Vy  Dz  Ux
121
JHU WS’08 RSR Team
u2
u1
d11
Kernel Combination Training
v2
d22
d33
v1
Large margin classifier (SVM)
Linear Kernels Combination
M
k F x, y    l kl x, y 
l 1
Maximize the margin
Learn a linear combination
Multiple Kernel SVM training
Find  that maximize the margin
M  m  Vy  Dz  Ux
122
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Results
d33
v1
NIST 2006 SRE
NIST 2008 SRE
English
all
English
All
Cosine Kernel on Y
2.34%
3.59%
3.86%
6.55%
Cosine Kernel On Z
6.26%
8.68%
10.34%
13.45%
Linear Score Fusion
2.11%
3.62%
3.23%
6.86%
Kernel Combination
2.08%
3.62%
3.20%
6.60%
M  m  Vy  Dz  Ux
123
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
Importance of Speaker and
Channel Factors
M  m  Vy  Dz  Ux
124
XYZ 7/13/2017
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
Importance of Speaker and Channel
Factors
• Gender JFA (Female part).
• 2048 Gaussians, 60 dimensional features
– 19 Gaussianized MFCC’s + energy + delta + double delta
•
•
•
•
300 speaker factors.
0 common factors.
100 channel factors for telephone speech
JFA hyper-parameters are obtained on MIXER and
Switchboard database.
M  m  Vy  Dz  Ux
125
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
Importance of Speaker and Channel
Factors
x
M = m + Vy + Dz + U
Apply intersession
compensation on
speaker factor
space rather than
supervector space
M  m  Vy  Dz  Ux
126
EER=20% Oops:
Channel factors contain
information about speaker
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
•
Importance of Speaker and Channel
Factors
Three systems:
– M = m + Vy + Ux
– M = m + Vy
– M = m + Tt
•
•
•
V : low rank matrix, eigenvoices (speaker variability 300 dim)
U : low rank matrix, eigenchannels (channel variability 100 dim)
T : low rank matrix contain total variability (speaker and channel
variability 400 dim).
EER
Results on NIST 2006 SRE,
core condition, English trials,
female part (using cosine
kernel).
M  m  Vy  Dz  Ux
127
M = m + Vy + Ux
2.56%
M = m + Vy
2.74%
M = m + Tt
2.19%
JHU WS’08 RSR Team
u2
u1
d11
Conclusion
v2
d22
d33
v1
• SVM scoring in speaker factor space rather than GMM
supervector space:
– Quite well linear separation in speaker factors space.
– Improves over JFA-SVM supervectors
– Performance comparable to other scoring
– Allows for faster scoring
• Generalized using combinations of factors
• Further improvement using intersession compensation in
speaker factors space
•
JFA as feature extraction
M  m  Vy  Dz  Ux
128
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
SVM vs. Cosine Distance (CD) Scoring
d33
v1
• SVMs find a linear separation in the speaker factor space.
• The score can be computed as the nearest neighbor
distance.
• Can we omit the SVM and use different distance?
M  m  Vy  Dz  Ux
130
JHU WS’08 RSR Team
u2
u1
d11
SVM vs. CD Scoring
v2
d22
d33
v1
• The idea is to compute the trial score as a cosine distance
between the enrollment and test speaker models
T
ytrain
 ytest
score 
ytrain  ytest
• Inspired by the SVM, normalization by within-class
covariance (WCCM) can be applied
T
y train
 y test
score 
y train  y test
y  y  chol(  wc ) 1
M  m  Vy  Dz  Ux
131
JHU WS’08 RSR Team
SVM vs. CD Scoring
u2
u1
d11
v2
Results in the y Space
d22
d33
v1
• CD Scoring
• SVM
– applies gender
dependent ZT-norm
– does not need any
score normalization
2006 det3
2008 det6
EER
DCF
EER
DCF
EER
DCF
SVM w. cosine
kernel, WCCN
2.38
1.27
3.62
1.80
6.59
3.23
SVM w. cosine
kernel
2.98
1.55
4.09
2.02
7.20
3.24
Cosine dist w.
WCCN, ZT-norm
2.00
1.19
3.82
1.93
6.43
3.60
Cosine dist w.
ZT-norm
2.82
1.50
4.07
2.11
6.95
3.55
M  m  Vy  Dz  Ux
132
2006 det1
JHU WS’08 RSR Team
SVM vs. CD Scoring
u2
u1
d11
v2
Extending to x,y,z Space
d22
d33
v1
• Motivation
– z carries a residual speaker information, which we want to
use.
– Ideally, there should be no speaker information in the x
vector as it expresses the channel shift.
– We tried to substitute the y vectors by x and z.
– z was much worse than y, but still gave reasonable result
(approx. 9% EER on 2006 det1)
– surprisingly, using x gave around 25% EER on 2006 det1
– let’s train a linear fusion on all these systems
M  m  Vy  Dz  Ux
133
JHU WS’08 RSR Team
SVM vs. CD Scoring
u2
u1
d11
v2
Results in the x, y, z Space
d22
d33
v1
• CD Scoring
• SVM
– applies gender
dependent ZT-norm
– uses gender
dependent linear
logistic regression –
for 2006, train on 2008
and vice-versa
– kernel combination
– uses y, z only (no
improvement using x)
2006 det3
2008 det6
EER
DCF
EER
DCF
EER
DCF
SVM w. cosine
kernel
2.98
1.55
4.09
2.02
7.20
3.24
SVM w. cosine
kernel – y,z
2.08
1.27
3.62
2.00
6.60
3.41
Cosine dist w. ZTnorm
2.82
1.50
4.07
2.11
6.95
3.55
Cosine dist w. ZTnorm – x,y,z
2.11
1.26
3.62
1.87
6.24
3.29
M  m  Vy  Dz  Ux
134
2006 det1
JHU WS’08 RSR Team
SVM vs. CD Scoring
u2
u1
d11
v2
Conclusion
d22
d33
v1
CD Scoring
• Positives:
– The problem of scoring
is symmetric
– No training steps
• Negatives:
– ZT-norm is needed
– poorer relative
improvement on all
trials (det1) – possibly
needs calibration
M  m  Vy  Dz  Ux
135
SVM
• Positives:
– Generalizes well in all
trials (det1)
– No need for score
normalization
• Negatives:
– SVM training procedure
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
Discriminative Optimization of
Speaker Recognition Systems
Lukas Burget & Niko Brummer, with lots of
help from other team members, especially
Ondrej Glembek, Najim Dehak and Valja
Hubeika.
M  m  Vy  Dz  Ux
137
XYZ 7/13/2017
JHU WS’08 RSR Team
Discriminative Training
u2
u1
d11
v2
What is new here?
d22
d33
v1
•
•
•
Discriminative training of speaker models has
been around for more than a decade,
and SVM speaker modeling has been a
constant feature at the NIST SRE evaluations
since 2003.
So what is new in this work?
M  m  Vy  Dz  Ux
138
JHU WS’08 RSR Team
Discriminative Training
u2
u1
d11
v2
What is new here?
d22
d33
v1
•
We propose to discriminatively optimize the
whole speaker recognition system, rather
than individual speaker models.
M  m  Vy  Dz  Ux
139
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Traditional Discriminative Training
d33
v1
discriminative
optimization
enrollment speech
test speech
feature extraction
feature extraction
estimate model
match
score
M  m  Vy  Dz  Ux
140
JHU WS’08 RSR Team
u2
u1
d11
Current State-of-the-Art
v2
d22
d33
v1
Generative Modeling
via Joint Factor Analysis
(ML optimization)
test speech
enrollment speech
feature extraction
system hyperparameters
feature extraction
estimate model
match
score
M  m  Vy  Dz  Ux
141
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
Proposed Discriminative System
Optimization
discriminative
optimization
test speech
enrollment speech
feature extraction
system
hyper-parameters
feature extraction
estimate model
match
score
M  m  Vy  Dz  Ux
142
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
d33
v1
Proposed Discriminative System
Optimization
This methodology directly
measures and optimizes
the quality of the output
enrollment speech
of the whole system.
feature extraction
discriminative
optimization
system
hyper-parameters
test speech
feature extraction
estimate model
match
score
M  m  Vy  Dz  Ux
143
JHU WS’08 RSR Team
Discriminative Training
u2
u1
d11
v2
What is new here?
d22
d33
v1
•
Typically, we have small amount of enrollment data for
target speaker, which disallows use of standard
discriminative techniques
We need to consider inter-session variability – an
important problem in SRE
But only recent data collection with the same speaker
recorded over various channels allowed us to start the
work in this direction
•
•
M  m  Vy  Dz  Ux
144
JHU WS’08 RSR Team
u2
u1
d11
v2
Discriminative Training
What is new here?
d22
d33
v1
Target speaker model
Test data
UBM
M  m  Vy  Dz  Ux
145
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Discriminative System Optimization
d33
v1
•
•
•
•
•
Motivation
Envisioned advantages
Challenges
A few techniques to address these challenges
Some preliminary experiments
M  m  Vy  Dz  Ux
146
JHU WS’08 RSR Team
u2
u1
d11
v2
Discriminative System Optimization
Motivation
d22
d33
v1
•
Several participants of this workshop have previous
successful experience with similar training:
– Discriminative training of weighted linear
combinations of the outputs of multiple sub-systems
(a.k.a. fusion) has been very successful in the last
few NIST Speaker Recognition Evaluations
 Neural Networks, SVM, Logistic Regression
– Lukas and BUT were very successful with
discriminative (MMI) training of GMM's in the similar
task of Language Recognition in the last two NIST
LRE's
M  m  Vy  Dz  Ux
147
JHU WS’08 RSR Team
u2
u1
d11
v2
Discriminative System Optimization
Envisioned Advantages
d22
d33
v1
•
•
Discriminative training can compensate for unrealistic
generative modeling assumptions.
– Could find hyperparameter estimates that give
better accuracy than ML estimates
Discriminative training can optimize smaller, simpler,
faster systems to rival the accuracy of larger
generatively trained systems
– In this workshop, we concentrated on this aspect,
with a few encouraging results
M  m  Vy  Dz  Ux
148
JHU WS’08 RSR Team
u2
u1
d11
v2
Discriminative System Optimization
Challenges
d22
d33
v1
•
•
This is a difficult problem!
– In large LVCSR systems it took years for
discriminative methods to catch up with generative
ones
Challenges include:
– complexity and scale of implementation
– overtraining
M  m  Vy  Dz  Ux
149
JHU WS’08 RSR Team
u2
u1
d11
v2
Discriminative System Optimization
Challenges
d22
d33
v1
• Complexity:
– Computation of derivatives for optimization (gradient,
Hessian) of complex systems
– Finding and coding good numerical optimization algorithms
•
Scale (CPU, memory):
– our current state-of-the-art systems can have tens of
millions of parameters
– 1500 hours of training speech, or 250 million training
examples
•
Overtraining (up to millions of parameters)
M  m  Vy  Dz  Ux
150
JHU WS’08 RSR Team
Techniques
u2
u1
d11
v2
Computing Derivatives
d22
d33
v1
•
•
We tried a coding technique that automatically
implements the chain-rule for partial derivatives for
function combination
– Similar to back propagation in neural networks
– Computationally equivalent to reverse-mode
automatic differentiation
– Did not scale well for our problem, involved
multiplication of multiple Jacobian matrices of very
large dimension
Our solution was to restrict our efforts to very simple
system designs, for which the derivatives could be
hand-coded and optimized
M  m  Vy  Dz  Ux
151
JHU WS’08 RSR Team
Derivatives
u2
u1
d11
v2
Hand-Optimized
d22
d33
v1
•
•
Lukas hand-optimized a gradient calculation of 6
million components over 440 million training examples
to run in 15 minutes on a single machine
This was made possible by:
– Replacing GMM log-likelihood calculation with a
linear approximation (without significant
performance loss)
– Not doing ZT-norm (at some performance loss)
M  m  Vy  Dz  Ux
152
JHU WS’08 RSR Team
Techniques
u2
u1
d11
v2
Optimization Algorithms
d22
d33
v1
•
We investigated stochastic gradient descent (after
inspiring invited talk here at JHU by Yann LeCun)
– Did not scale well in our computing environment
– Difficult to set hyperparameters
– Not obvious how to parallelize over machines
M  m  Vy  Dz  Ux
153
JHU WS’08 RSR Team
Techniques
u2
u1
d11
v2
Optimization Algorithms
d22
d33
v1
•
We investigated MATLAB's optimization toolbox.
– Tried 'large scale' trust-region optimization
algorithm
– Did not scale well in time and space
– Needs further investigation ...
M  m  Vy  Dz  Ux
154
JHU WS’08 RSR Team
Techniques
u2
u1
d11
v2
Optimization Algorithms
d22
d33
v1
•
•
•
Lukas was successful in his experiments with the
Extended Baum-Welch algorithm
I was successful in my experiments with the RPROP1
algorithm
In both cases, we coded our own optimization
algorithms in MATLAB for economy of scale
1. See http://en.wikipedia.org/wiki/Rprop
M  m  Vy  Dz  Ux
155
JHU WS’08 RSR Team
u2
u1
d11
Objective Function
v2
d22
d33
v1
•
Our discriminative optimization objective function has
many names: Maximum Mutual Information (MMI),
Minimum Cross-Entropy, Logistic Regression, ...
– This criterion optimizes classification error-rates
over wide ranges of priors and cost functions
– For linear systems, gives nice convex optimization
objective
– Gives some protection against over-training
– Has been very successfully applied to fusion of
sub-systems in NIST SRE evaluations
M  m  Vy  Dz  Ux
156
JHU WS’08 RSR Team
u2
u1
d11
Overtraining
v2
d22
d33
v1
•
•
•
•
I was optimizing 90 000 parameters and Lukas 6
million
This allows the training to learn irrelevant detail of the
training data (even though we used 100's of millions of
training examples)
We both managed to optimize EER << 1% over the
development data (Switchboard, SRE'04+05) if we
allowed the training to go too far
These overtrained systems did not generalize to good
performance on independent test data (SRE'06+08)
M  m  Vy  Dz  Ux
157
JHU WS’08 RSR Team
u2
u1
d11
v2
d22
Regularization to Combat Overtraining
d33
v1
•
•
We used early stopping to combat overtraining
– just stop training when performance on a crossvalidation set stops improving
We hope to be able to apply more principled
approaches in the future
– adding SVM-style regularization penalties, or
– more general Bayesian methods with appropriate
priors on the hyperparameters.
M  m  Vy  Dz  Ux
158
JHU WS’08 RSR Team
u2
u1
d11
Proof of Concept Experiments
v2
d22
d33
v1
•
Niko: smaller scale experiment using 300-dimensional
y-vectors for train and test, training 90 000 parameters
•
Lukas: larger scale experiments using 300 dimensional
y-vector for train and 20 000 dimensional statistic for
test, training 6 million parameters
M  m  Vy  Dz  Ux
159
JHU WS’08 RSR Team
u2
u1
d11
Small Scale Experiment
v2
d22
d33
v1
•
•

Within-class covariance-normalized dot-product
between y-vectors for train and test
Generative (ML) covariance estimate gives on a
subset (english females) of SRE 2006:
EER = 2.61%
Discriminative retraining of covariance gave an 11%
relative improvement:
EER = 2.33%
M  m  Vy  Dz  Ux
160
JHU WS’08 RSR Team
u2
u1
d11
Large Scale Experiment 1
v2
d22
d33
v1
•
•
•
•
Pure eigenvoice system (only V; no U and D)
GMM with 512 components; 39D features
V matrix trained discriminatively (300x20k parameters)
Fixed original speaker factors y
EER[%]
No norm ZT-norm
Generative V
15.44
11.42
Discriminative V
7.19
5.06
Discriminative V with
channel compensated y
6.80
4.81
Generative V and U
6.99
161
u1
d11
SRE 2006 all trials (det1)
M  m  Vy  Dz  Ux
u2
v2
4.07
d22
d33
m
v1
JHU WS’08 RSR Team
u2
u1
d11
Large Scale Experiment 2
v2
d22
d33
v1
•
•
•
Channel compensated system (V and U; no D)
Only V matrix trained discriminatively
Fixed original speaker factors y
EER[%]
No norm ZT-norm
Generative V and U
6.99
4.07
Discriminative V
6.00
3.87
u2
u1
Generative U
d11
SRE 2006 all trials (det1)
v2
d22
d33
M  m  Vy  Dz  Ux
162
m
v1
JHU WS’08 RSR Team
u2
u1
d11
Next Steps
v2
d22
d33
v1
• Re-estimation of other hyperparameters (e.g. U)
• Iterative re-estimation of both hyperparameters and
•
factors
Direct optimization of ZT-normalized system
(derivatives difficult to compute)
M  m  Vy  Dz  Ux
163
JHU WS’08 RSR Team
u2
u1
d11
Conclusion
v2
d22
d33
v1
• This is a large and difficult problem.
• But it has the potential of worthwhile gains:
•
•
– Possibility of more accurate, but faster and
smaller systems.
We have managed to show some proof of concept,
but so far without improving on the state-of-the-art.
Remaining problems are practical and theoretical.
– Complexity of optimization.
– Principled methods for combating overtraining.
M  m  Vy  Dz  Ux
164
JHU WS’08 RSR Team
u2
u1
d11
v2
Robust Speaker Recognition
Summary
d22
d33
v1
•
•
•
Diarization
Examined application of JFA and
Bayesian methods to diarization
Produced 3-4% DER on summed
telephone speech
Working on challenging interview
speech
Factor Analysis Conditioning
Explored ways to use JFA to
account for non-session
variability (phone)
Showed robustness using withinsession, stacking and hierarchical
modeling
•
•
u2
u1
d11
v2
d22
d33
•
•
•
m
SVM-JFA
Developed techniques to use JFA
elements in SVM classifiers
Results comparable to full JFA
system but with fast scoring and
no score normalization
Better performance using all JFA
factors
M  m  Vy  Dz  Ux
166
v1
•
•
Discriminative System
Optimization
Focused on means to
discriminatively optimize the
whole speaker recognition system
Demonstrated proof-of-concept
experiments
JHU WS’08 RSR Team
u2
u1
d11
Robust Speaker Recognition
v2
d22
d33
v1
• Extremely productive and enjoyable workshop
• Aim is to continue collaboration in problem areas going
forward
• Cross-site, joint efforts will provide big gains in future
speaker recognition evaluations and experiments
• Possible special session at ICASSP on team workshop
efforts
M  m  Vy  Dz  Ux
167
JHU WS’08 RSR Team

Download Report

JHU WS08 RSR Team Final Presentation

Paperzz.com

Your Paperzz