u2 u1 d11 v2 d22 d33 v1 Robust speaker recognition over varying channels Niko Brummer, Lukas Burget, William Campbell, Fabio Castaldo, Najim Dehak, Reda Dehak, Ondrej Glembek, Valiantsina Hubeika, Sachin Kajarekar, Zahi Karam, Patrick Kenny, Jason Pelecanos, Douglas Reynolds, Nicolas Scheffer, Robbie Vogt M m Vy Dz Ux 1 XYZ 7/13/2017 JHU WS’08 RSR Team u2 u1 d11 Intersession Variability v2 d22 d33 v1 The largest challenge to practical use of speaker detection systems is channel/session variability NIST SRE2008 - Interview speech • Variability refers to changes in channel effects between training and successive detection attempts Channel/session variability encompasses several factors – The microphones • Carbon-button, electret, hands-free, array, etc – The acoustic environment Office, car, airport, etc. – The transmission channel Landline, cellular, VoIP, etc. – Different microphone in training and test about 3% EER The same microphone in training and test < 1% EER The differences in speaker voice Aging, mood, spoken language, etc. M m Vy Dz Ux 2 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 M m Vy Dz Ux 3 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Tools to fight unwanted variability d33 v1 Joint Factor Analysis u2 u1 d11 v2 d22 d33 v1 M = m + Vy + Dz + Ux M m Vy Dz Ux 4 JHU WS’08 RSR Team u2 u1 d11 Baseline System v2 d22 d33 v1 NIST SRE08 short2-short3 Miss Probability Telephone Speech in Training and Test System based on Joint Factor Analysis M m Vy Dz Ux 5 False Alarm Probability JHU WS’08 RSR Team u2 u1 d11 SRE NIST Evaluations v2 d22 d33 v1 • Annual NIST evaluations of speaker verification technology • • (since 1995) using a common paradigm for comparing technologies All the team members participated in recent 2008 NIST evaluations JHU workshop provided a great opportunity to: – do common post-evaluation analysis of our systems – combine and improve techniques developed by individual sites • Thanks to NIST evaluations we have: – identified some of the current problems that we worked on – well defined setup and evaluation framework – baseline systems that were trying to extend and improve during the workshop M m Vy Dz Ux 6 JHU WS’08 RSR Team u2 u1 d11 Subgroups v2 d22 d33 v1 • Diarization using JFA • Factor Analysis Conditioning • SVM – JFA and fast scoring • Discriminative System Optimization M m Vy Dz Ux 7 JHU WS’08 RSR Team u2 u1 d11 Diarization using JFA v2 d22 d33 v1 Problem Statement – Diarization is an important upstream process for real-world multi-speaker speech – At one level diarization depends on accurate speaker discrimination for change detection and clustering – JFA and Bayesian methods have the promise of providing improvements to speaker diarization Goals – Apply diarization systems to summed telephone speech and interview microphone speech Baseline segmentation-agglomerative clustering Streaming system using speaker factors features New variational-bayes approach using eigenvoices – Measure performance in terms of DER and effect on speaker detection M m Vy Dz Ux 8 JHU WS’08 RSR Team u2 u1 d11 Factor Analysis Conditioning v2 d22 d33 v1 Problem Statement – A single FA model is sub-optimal across different conditions – Eg.: different durations, phonetic content and recording scenario Goals – Explore two approaches: - Build FA models specific to each condition and robustly combine multiple models - Extend the FA model to explicitly model the condition as another source of variability M m Vy Dz Ux 9 JHU WS’08 RSR Team u2 u1 d11 SVM - JFA v2 d22 d33 v1 Problem Statement – The Support Vector Machine is a discriminative recognizer which has proved to be useful for SRE – Parameters of generative GMM speaker models are used as features for linear SVM ( sequence kernels) – We know Joint Factor Analysis provides higher quality GMMs, but using these as is in SVMs has not been so successful. Goals – Analysis of the problem – Redefinition of SVM kernels based on JFA? – Application of JFA vectors to recently proposed and closely related bilinear scoring techniques which do not use SVMs M m Vy Dz Ux 10 JHU WS’08 RSR Team u2 u1 d11 Discriminative System Optimization v2 d22 d33 v1 Problem Statement – Discriminative training has proved very useful in speech and language recognition, but has not been investigated in depth for speaker recognition – In both speech and language recognition, the classes (phones, languages) are modeled with generative models, which can be trained with copious quantities of data – But in speaker recognition, our speaker GMMs have at best a few minutes of training typically of only one recording of the speaker Goals – Reformulate the speaker recognition problem as binary discrimination between pairs of recordings which can be (i) of the same speaker, or (ii) of two different speakers – We now have lots of training data for these two classes and we can afford to train complex discriminative recognizers M m Vy Dz Ux 11 JHU WS’08 RSR Team u2 u1 d11 Relevance MAP adaptation v2 d22 d33 v1 Target speaker model Test data UBM •2D features •Single Gaussian model •Only mean vector(s) are adapted M m Vy Dz Ux 12 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Intersession variability d33 v1 Target speaker model UBM M m Vy Dz Ux 13 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Intersession variability d33 v1 Target speaker model Test data UBM M m Vy Dz Ux 14 JHU WS’08 RSR Team u2 u1 d11 Intersession compensation v2 d22 d33 v1 Target speaker model Test data UBM For recognition, move both models along the high inter-session variability direction(s) to fit well the test data (e.g. in ML sense) M m Vy Dz Ux 15 JHU WS’08 RSR Team u2 u1 d11 Joint Factor Analysis model v2 d22 d33 v1 • • Probabilistic model proposed by Patrick Kenny Speaker model represented by mean supervector M = m + Vy + Dz + Ux – – – – U – subspace with high intersession/channel variability (eigenchannels) V – subspace with high speaker variability (eigenvoices) D - diagonal matrix describing remaining speaker variability not covered by V Gaussian priors assumed for speaker factors y, z and channel factors x u2 u1 d11 3D space of model parameters (e.g. 3 component GMM; 1D features) d22 d33 M m Vy Dz Ux 16 v2 m v1 JHU WS’08 RSR Team u2 u1 d11 Working with JFA v2 d22 d33 v1 • Enrolling speaker model: – Given enrollment data and the hyperparameters m, Σ, V, D and U, obtain MAP point estimates (or posterior distributions) of all factors x, y, z – Most of the speaker information is in low dimensional vector y; less in high dimensional vector z; x should contain only channel related info. • Test: – Given fixed (distributions of) speaker dependent factors y and z, obtain new estimates of channel factors x for test data – Score for test utterance is log likelihood ratio between UBM and speaker model defined by factors x, y, z • Training hyperparameters – Hyperparameters m, Σ, V, D and U can be estimated from training data using EM algorithm u2 u1 Posterior distributions of “hidden” factors x, y, z and hyperparameters are alternately estimated to maximize likelihood of training data Distributions of speaker factors y, z are constraint to be the same for all segments of the same speaker while channel factors x may be different d33 for every segment. M m Vy Dz Ux 17 d11 v2 d22 m v1 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Flavors of JFA d33 v1 •Relevance MAP adaptation M = m + Dz with D2 = Σ/ τ where Σ is matrix with UBM variance supervector in diagonal •Eigenchannel adaptation (SDV, BUT) • Relevance MAP for enrolling speaker model • Adapt speaker model to test utterance using U estimated by PCA •JFA without V, with D2 = Σ/ τ (QUT, LIA) •JFA without V, with D trained from data (CRIM) can be seen as training different τ for each supervector coefficient Effective relevance factor τef= trace(Σ)/ trace(D2) •JFA with V (CRIM) u 2 u1 d11 v2 d22 d33 M m Vy Dz Ux 18 m v1 JHU WS’08 RSR Team u2 u1 d11 Flavors of JFA v2 d22 d33 v1 SRE 2006 (all trials, det1) No JFA Eigenchannel adapt. JFA: d2 = Σ/ τ JFA: d trained on data JFA with eigenvoices Full JFA significantly outperform the other JFA configurations. M m Vy Dz Ux 19 JHU WS’08 RSR Team u2 u1 d11 Subgroups v2 d22 d33 v1 • Diarization based on JFA • Factor Analysis Conditioning • SVM – JFA and fast scoring • Discriminative System Optimization M m Vy Dz Ux 20 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 Diarization Group Douglas Reynolds, Patrick Kenny, Fabio Castaldo, Ciprian Costin M m Vy Dz Ux 22 XYZ 7/13/2017 JHU WS’08 RSR Team u2 u1 d11 Roadmap v2 d22 d33 v1 • Introduction – Problem definition – Experiment setup • Diarization Systems – Variational Bayes System – Streaming and Hybrid Systems • Analysis and Conclusions M m Vy Dz Ux 23 JHU WS’08 RSR Team Diarization u2 u1 d11 v2 Segmentation and Clustering d22 d33 v1 • Determine when speaker change has occurred in speech signal (segmentation) • Group together speech segments from same speaker (clustering) • Prior speaker information may or may not be available Where are speaker changes? Speaker A Which segments are from the same speaker? Speaker B M m Vy Dz Ux 24 JHU WS’08 RSR Team u2 u1 d11 Diarization Applications v2 d22 d33 v1 • Diarization is used as a pre-process for other downstream • applications Human consumption – Annotate transcript with speaker changes/labels – Provide overview of speaker activity • Algorithm consumption – Adaptation of speech recognition system – Application to speaker detection with multi-speaker speech Speaker Diarization M m Vy Dz Ux 25 1sp detector 1sp detector M A X Utterance score JHU WS’08 RSR Team u2 u1 d11 Diarization Error Measures v2 d22 d33 v1 • Direct Measure – Diarization Error Rate (DER) – – Optimal alignment of reference and hypothesized diarizations Error is sum of hyp miss (speaker in reference but not in hypothesis) false alarm (speaker in hypothesis but not in reference) speaker-error (mapped reference speaker is not the same as the hypothesized speaker) – ref miss fa err Time weighted measure • Consumer Measure – Effect on speaker detection system – Determine speaker detection error rate when using different diarization output – Focus on NIST SRE 2008 data with a fixed detection system (JFA GMM-UBM system) PROBABILITY OF MISS (in %) Emphasizes talkative speakers EER PROBABILITY OF FALSE ALARM (in %) M m Vy Dz Ux 26 JHU WS’08 RSR Team u2 u1 d11 Diarization Experiment Data v2 d22 d33 v1 • Summed channels telephone speech – Use summed channel data for test only (avoid complication of extra clustering in training) – We can derive reference for DER scoring using ASR transcripts from separate channels (no-score for silence and speaker overlap) – Compare use of diarization to two extremes Best case: use reference diarization Worst case: no diarization • Interview microphone speech – Single microphone recording capturing both interviewee (target) and interviewer – Avoid use of unrealistic side information about location of interviewee speech provided in NIST eval – Reference for DER scoring from lavaliere microphones ASR transcripts M m Vy Dz Ux 27 JHU WS’08 RSR Team u2 u1 d11 Roadmap v2 d22 d33 v1 • Introduction – Problem definition – Experiment setup • Diarization Systems – Variational Bayes System – Streaming and Hybrid Systems • Analysis and Conclusions M m Vy Dz Ux 28 JHU WS’08 RSR Team u2 u1 d11 Baseline System v2 d22 d33 v1 p(x|x) p(y|y) p(z|z) Speaker Change Detection Agglomerative Clustering Initial speaker data Final Diarization Viterbi Decode Train GMMs Refined speaker data • Three stages in baseline system – BIC based speaker change detection – Full covariance agglomerative clustering with BIC stopping criterion – Iterative re-segmentation with GMM Viterbi decoding M m Vy Dz Ux 29 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Factor Analysis Applied to Diarization d33 v1 • State of the art speaker recognition systems use hundreds of speaker and channel factors – Processing requires entire utterances – can't be implemented incrementally • State of the art diarization systems require lots of local decisions – Very short (~1 sec) speech segments – Speaker segmentation: is this frame a speaker change point? – Agglomerative clustering: Given two short segments, is the speaker the same? • Proposed solution: Variational Bayes (VB) – Fabio Valente, Variational Bayesian Methods for Audio Indexing, PhD Dissertation, Eurecom, 2005 M m Vy Dz Ux 30 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Advantages of a Bayesian Approach d33 v1 • EM-like convergence guarantees • No premature hard decisions as in agglomerative clustering – This suggested a `soft clustering' heuristic which reduced the diarization error rate of the baseline system by almost 50% • In theory at least, Bayesian methods are not subject to the over-fitting which maximum likelihood methods are prone to – Bayesian model selection is a quantitative version of Occam's razor (David MacKay) – It ought to be possible to determine the number of speakers in a file without resorting to BIC like fudge factors (Fabio Valente) M m Vy Dz Ux 31 JHU WS’08 RSR Team u2 u1 d11 Eigenvoice Speaker Model v2 d22 d33 v1 • For diarization we use only the eigenvoice component of factor analysis s m Vy, y ~ N (0, I) • A supervector s is the concatenation of the mean vectors in a • • speaker dependent Gaussian mixture model The supervector m is speaker independent The matrix V is of low rank – The columns of V are the eigenvoices – The entries of y are the speaker factors • A highly informative prior on speaker dependent GMM's • Adding eigenchannels doesn't help in diarization (so far) M m Vy Dz Ux 32 JHU WS’08 RSR Team u2 u1 d11 Variational Bayes Diarization v2 d22 d33 v1 • Assume 2 speakers and uniformly segment the file into 1 second intervals – This restriction can be removed in a second pass • Alternate between estimating two types of posterior distribution until convergence – Segment posteriors (soft clustering) – Speaker posteriors (location of the speakers in the space of speaker factors) • Construct GMM's for each speaker and re-segment the data – Iterate as needed M m Vy Dz Ux 33 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Segment Posteriors d33 v1 M m Vy Dz Ux 34 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Speaker Posteriors d33 v1 M m Vy Dz Ux 35 JHU WS’08 RSR Team Variational Bayes Diarization u2 u1 d11 v2 Details d22 d33 v1 • Begin – Extract Baum-Welch statistics from each segment • On each iteration – For each speaker: Synthesize Baum Welch statistics by weighting the Baum Welch statistics for each segment by the corresponding posterior Update the posterior distribution of the speaker factors – For each segment • End Update the segment posteriors for each speaker – Construct GMM's for each speaker – Re-segment the data – Iterate M m Vy Dz Ux 36 JHU WS’08 RSR Team u2 u1 d11 Experiment Configuration v2 d22 d33 v1 • Features used for Variational Bayes – – • Features used in the baseline system and in the re-segmentation phase of Variational Bayes – – • Universal background model with 512 Gaussians 200 speaker factors, no channel factors V matrix scaled by 0.6 Test set: the summed channel telephone data provided by NIST in the 2008 speaker recognition evaluation – • Un-normalized cepstral coefficients c0, .., c12 Including c0 was a lucky bug Factor analysis configuration for Variational Bayes – – – • 39 dimensional feature set optimized by Brno for speaker recognition Cepstral coefficients c0, .., c12 + first, second and third order derivatives + Gaussianization + HLDA 2215 files, (~200 hours) NIST Diarization Error used to measure performance – Ground truth diarization is available M m Vy Dz Ux 37 JHU WS’08 RSR Team Experiment Results u2 u1 d11 v2 NIST 2008 Summed Channel Telephone Speech d22 d33 v1 10 Mean DER Std DER 9 8 7 6 1 Baseline BW Viterbi 6.8 12.3 2 VB 9.1 11.9 3 VB BW Viterbi 4.5 8.5 4 VB BW Viterbi 2nd pass 3.8 7.6 5 Baseline soft-cluster BW Viterbi 3.5 8.0 DER (%) Diarization Systems 5 4 3 2 1 0 1 • • • • 2 3 4 5 VB = Variational Bayes BW = Baum -Welch training of speaker GMM's Viterbi = re-segmentation with speaker GMM's The second pass in VB uses a non-uniform segmentation provided by the first pass Compared to the baseline, soft clustering achieves a 50% reduction in error rates • M m Vy Dz Ux 38 JHU WS’08 RSR Team u2 u1 d11 Roadmap v2 d22 d33 v1 • Introduction – Problem definition – Experiment setup • Diarization Systems – Variational Bayes System – Streaming and Hybrid Systems • Analysis and Conclusions M m Vy Dz Ux 39 JHU WS’08 RSR Team Streaming System u2 u1 d11 v2 LPT Diarization System* d22 d33 v1 • Main Ideas – Use eigenvoice model for creating a stream of speaker factors Yt computed on a sliding windows – Perform segmentation and clustering with these new features • Eigenvoice Model: s m Vy, y ~ N (0, I) * Based on Castaldo, F.; Colibro, D.; Dalmasso, E.; Laface, P.; Vair, C.,Stream-based speaker segmentation using speaker factors and eigenvoices, ICASSP 2008 M m Vy Dz Ux 40 JHU WS’08 RSR Team u2 u1 d11 Streaming System v2 d22 d33 v1 Feature Extraction Audio Slices Streaming Factor Analysis x1 x2 x3 GMM 1 GMM 2 x4 x5 x6 GMM 1 x7 x8 x9 GMM 2 GMM 1 GMM 2 Slice Clustering M m Vy Dz Ux 42 JHU WS’08 RSR Team Streaming System u2 u1 d11 v2 Stream Factor Analysis d22 d33 v1 Y1 Feature Extraction Slices x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 Y2 Viterbi Segmentation Clustering GMM 1 Creating GMMs GMM 2 M m Vy Dz Ux 43 JHU WS’08 RSR Team u2 u1 d11 v2 Streaming System Stream Factor Analysis d22 d33 v1 First 2 dimensions of y stream M m Vy Dz Ux 44 JHU WS’08 RSR Team Streaming System u2 u1 d11 v2 Slice Clustering d22 d33 v1 • A GMM model for each slice is created • Last step: clustering the GMMs created in each slice • The system decides whether GMMs come from the same or • different speakers by using an approximation of the Kullback-Leibler divergence between GMMs Large KL-divergence => new speaker <λ New 60s slice GMM 1 Min KL Divergence New GMM GMM 2 Creating new model GMM 3 M m Vy Dz Ux 46 >λ JHU WS’08 RSR Team u2 u1 d11 Hybrid Clustering v2 d22 d33 v1 • Speaker factors works in the streaming diarization system • Experiments done during the workshop showed cosine distance between speaker factors produces low speaker detection errors • Modifying the baseline system using these new ideas • Hybrid Clustering – Replace the classical clustering using speaker factors and cosine distance M m Vy Dz Ux 47 JHU WS’08 RSR Team Hybrid Clustering u2 u1 d11 v2 Different Approaches d22 d33 v1 • First Approach: Level Cutting – Stop the agglomerative clustering at a certain level and compute speaker factors for each cluster – Merge the clusters that have the maximum similarity with respect to the cosine distance – Iterate until only two clusters remain • Second Approach: Tree Searching – Build agglomerative clustering up to the top level – Select the nodes that have a number of frames above a threshold – Merge the clusters that have the maximum similarity with respect to the cosine distance – Iterate until only two clusters remain M m Vy Dz Ux 48 JHU WS’08 RSR Team Hybrid Clustering u2 u1 d11 v2 Level Cutting d22 d33 v1 Y1 Y1 Y1 M m Vy Dz Ux 49 Y2 Y2 1 1 Y2 Y2 Y1 2 2 S P E A K E R 3 Y3 Y3 Y3 3 4 F A C T O R Y4 Y4 4 5 C L U S T E R I N G Y5 5 6 A G G L O M E R A T I V E C L U S T E R I N G JHU WS’08 RSR Team Hybrid Clustering u2 u1 d11 v2 Tree Searching d22 d33 v1 Threshold=100 180 =selected cluster(Y) 50 M m Vy Dz Ux 50 550 110 70 110 70 60 70 550 210 210 340 340 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Results On Summed Telephone Speech d33 v1 System (2213 Audio Files) Minimum Maximum Diarization Diarization Error Rate(%) Error Rate(%) Standard Deviation (%) Average Diarization Error Rate (%) Streaming System 0.0 53.2 8.8 4.6 Baseline Diarization System 0.0 57.2 12.3 6.8 Hybrid System 1 (Level Cutting) 0.0 67.0 14.6 17.1 Hybrid System 2 (Tree Search) 0.0 63.2 13.6 6.8 M m Vy Dz Ux 51 JHU WS’08 RSR Team u2 u1 d11 Roadmap v2 d22 d33 v1 • Introduction – Problem definition – Experiment setup • Diarization Systems – Variational Bayes System – Streaming and Hybrid Systems • Analysis and Conclusions M m Vy Dz Ux 52 JHU WS’08 RSR Team DER vs EER u2 u1 d11 v2 Summed Telephone Speech d22 d33 v1 • Some correlation of DER to EER • Systems with DER <10% have comparable EERs • No clear knee in the curve – Still have EER gains (over doing nothing) with relatively poor DER=20% system 15 14 EER (%) 13 12 11 10 9 8 0 10 20 30 40 DER (%) M m Vy Dz Ux 53 JHU WS’08 RSR Team DER vs EER u2 u1 d11 v2 Summed Telephone Speech d22 d33 v1 • Unclear trends with low DER systems – VB+2nd pass and BL+soft_cluster • DER may be too coarse of a measure for effects on EER 10 VB BL EER (%) 9.8 VB + 2nd pass 9.6 Hybrid 9.4 9.2 Ref BL+ soft cluster LPT 9 0 2 6 4 8 10 DER (%) M m Vy Dz Ux 54 JHU WS’08 RSR Team u2 u1 d11 Interview Speech v2 d22 d33 v1 • Interview speech differs from telephone speech in two main aspects – Audio quality is much more variable for various microphones – Conversations are dominated by interviewee • DER for do-nothing diarization (single speaker for all time) – Telephone: 35% – Interview: 11% • Next challenge is to apply diarization systems to new domain Avoid idealistic assumptions and knowledge used in NIST eval • M m Vy Dz Ux 55 No diarization in train or test EER=10.9% Ideal diarization in train and test EER=5.4% JHU WS’08 RSR Team u2 u1 d11 Conclusion v2 d22 d33 v1 • Implemented variational Bayes diarization system using both segment and speaker posterior optimization • Used speaker factor model for three speaker diarization systems – Streaming, VB, and hybrid • Demonstrated effectiveness of soft-clustering for improving speaker diarization • Produced low diarization error rate (3.5-4.5%) for telephone speech • New challenges await for interview speech domain – Microphones – Conversational patterns M m Vy Dz Ux 56 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 Speaker Recognition: Factor Analysis Conditioning (13 August 2008) Sub-Team: Sachin Kajarekar (SRI), Elly Na (GMU), Jason Pelecanos (IBM), Nicolas Scheffer (SRI), Robbie Vogt (QUT) M m Vy Dz Ux 59 XYZ 7/13/2017 JHU WS’08 RSR Team u2 u1 d11 Overview v2 d22 d33 v1 • • • • • • Introduction A Phonetic Analysis Combination Strategies Within Session Variability Modeling Hierarchical Factor Analysis Review M m Vy Dz Ux 60 JHU WS’08 RSR Team u2 u1 d11 Introduction v2 d22 d33 v1 Problem Statement – A single FA model is sub-optimal across different conditions - Eg.: different durations, phonetic content and recording scenario Goals – Explore two approaches: - Build FA models specific to each condition and robustly combine multiple models - Extend the FA model to explicitly model the condition as another source of variability Results and Outcomes – A conditioned FA model can provide improved performance - But, score level combination may not be the best way – Including Within-Session factors in an FA model can reduce the sensitivity to utterance duration and phonetic content variability – Stacking factors across conditions or data subsets can provide additional robustness – Hierarchical modeling for Factor Analysis shows promise – Applicability to other condition types: languages, microphones, … M m Vy Dz Ux 61 JHU WS’08 RSR Team u2 Introduction u1 d11 v2 Speech Partitioning… An overview d22 d33 v1 Train Data Test Data phoneme I ‘w’ phoneme II ‘ah1’ phoneme I ‘w’ phoneme II ‘ow’ phoneme III ‘n’ feature space feature space phoneme I M m Vy Dz Ux 62 phoneme III ‘d’ phoneme II phoneme III JHU WS’08 RSR Team u2 u1 d11 v2 d22 Overview d33 v1 • Introduction • A Phonetic Analysis • Combination Strategies • Within Session Variability Modeling • Hierarchical Factor Analysis • Review M m Vy Dz Ux 63 JHU WS’08 RSR Team A Phonetic Analysis u2 u1 d11 v2 Effect of Phonetic Mismatch d22 d33 v1 – How does the difference between the content in enrollment and test change the resulting performance? Test Enroll Vowel Consonant Vowel 4.50% 0.0208 12.47% 0.0537 Consonant 10.72% 0.0521 7.03% 0.0336 EER Min. DCF – This result, albeit an extreme example, demonstrates the challenge of mis-matched phonetic content – This phenomena is especially present for short duration utterances M m Vy Dz Ux 64 JHU WS’08 RSR Team A Phonetic Analysis u2 u1 d11 v2 Performance vs. % of Speech d22 d33 v1 DET 1 Phoneme Type % of speech EER (%) DCF EER (%) DCF E vowel 18.93 12.16 0.0567 8.62 0.0419 O vowel 10.71 14.57 0.0645 12.30 0.0558 i vowel 6.85 16.73 0.0749 15.49 0.0696 A: vowel 5.89 23.31 0.0876 21.79 0.0852 n nonvowel 5.44 19.08 0.0779 17.23 0.073 e: vowel 4.73 25.31 0.0917 22.92 0.0866 k stop 4.49 25.56 0.0926 22.26 0.0868 z sibilant 4.25 29.73 0.098 28.22 0.0971 o vowel 3.01 25.53 0.0924 25.24 0.0926 t stop 2.76 27.04 0.0956 24.92 0.0936 s sibilant 2.74 30.73 0.0965 27.63 0.0908 f sibilant 2.41 34.43 0.0998 31.42 0.0984 j nonvowel 2.38 25.00 0.0918 22.41 0.0862 v sibilant 2.35 33.66 0.1 30.78 0.0992 m nonvowel 2.29 21.18 0.0835 18.63 0.0782 S sibilant 2.21 31.97 0.0959 31.74 0.0981 l nonvowel 1.99 30.05 0.0974 29.91 0.0955 M m Vy Dz Ux 65 DET 3 JHU WS’08 RSR Team A Phonetic Analysis u2 u1 d11 v2 Performance vs. % of Speech d22 d33 v1 40 vowel nonvowel sibilant stop Performance (EER) % 30 20 10 0 0 5 10 15 20 % of Speech M m Vy Dz Ux 66 JHU WS’08 RSR Team A Phonetic Analysis u2 u1 d11 v2 Fusion Analysis d22 d33 v1 Vowel with Others Phonemes % of Speech EER (%) Phonemes % of Speech EER (%) En 24.37 7.96 EO 29.64 7.04 Ek 23.42 8.4 Ei 25.78 7.58 Ez 23.18 8.35 E A: 24.82 8.94 Et 21.69 8.72 E e: 23.66 8.55 El 20.92 8.56 Eo 21.94 8.29 On 16.15 9.64 Oi 17.56 9.42 Ok 15.2 10.89 O A: 16.6 11.32 Oz 14.96 11.76 O e: 15.44 11.32 Ot 13.47 10.85 Oo 13.72 11.38 in 12.29 11.93 i A: 12.74 13.05 iz 11.1 14.41 i e: 11.58 13.91 n e: 10.17 14.4 A: e: 10.62 17.23 is 9.59 14.35 io 9.86 14.07 A: t 8.65 17.6 A: o 8.9 17.59 nj 7.82 13.71 e: o 7.74 18.46 zs 6.99 23.03 ot 5.77 19.34 ts 5.5 21.88 Fuse All 83.43 5.68 fv 4.76 26.72 Sl 4.2 25.63 M m Vy Dz Ux 67 Vowel with Vowel JHU WS’08 RSR Team u2 u1 d11 v2 A Phonetic Analysis Fusion Analysis d22 d33 v1 vowel with others vowel with vowel M m Vy Dz Ux 68 JHU JHUWS’08 WS’08RSR RSRTeam Team u2 u1 d11 v2 d22 Overview d33 v1 • Introduction • A Phonetic Analysis • Combination Strategies • Within Session Variability Modeling • Hierarchical Factor Analysis • Review M m Vy Dz Ux 69 JHU WS’08 RSR Team Combination Strategies u2 u1 d11 v2 Context d22 d33 v1 • Conditioned factor analysis • – Multiple systems for multiple conditions – Multiple subspaces (eg: microphones) Current solution – Select the best system for each condition – Perform score-level combination (our baseline) How to robustly gather information from these systems? • Exploring combination strategies in the model space • • Candidate for the study: Broad-Phone classes – Work in the speaker space instead of channel space – Small set of events FA Cond. Smaller system configuration (512g, 120 EV, 60EC) M m Vy Dz Ux 70 JHU WS’08 RSR Team Combination Strategies u2 u1 d11 v2 Baseline Results d22 d33 v1 Table of Results for Different Phone Sets (DET 1 SRE’06) % Data EER (%) Min. DCF Vowels 60 6.17 0.296 Consonants 40 7.91 0.391 NonVowels 15 10.7 0.502 Sibilants 15 14.14 0.647 Stops 10 15.27 0.685 Vow. + NV + Si + St. (4 classes) 100 5.42 0.272 Vow. + Cons. (2 classes) 100 5.20 0.262 Baseline 100 5.12 0.241 Thanks S. Kajarekar, C. Richey, SRI International M m Vy Dz Ux 71 JHU WS’08 RSR Team Combination Strategies u2 u1 d11 v2 Stacked Eigenvectors d22 d33 v1 • In training, estimate different subspaces modeling the same kind of variability – Eg: Different utterance lengths, different set of microphone sets In practice : – Merging of supervectors generated by each subspace – New rank is the sum of each subspace rank – Can generate very large (and redundant) subspaces Advantages: – No retraining during enrollment / recognition time – No need of labeled data for system selection – Increased robustness of system in both scenarios (correlation between two subspaces) • • V1 V2 M m Vy Dz Ux 72 V1 y1 V2 y2 JHU WS’08 RSR Team Combination Strategies u2 u1 d11 v2 Combining U and/or V? d22 d33 v1 • Stacking U’s (channel): successfully demonstrated (at NIST) • for a large set of microphones Stacking V’s (speaker): Suitable for phone conditioning as: – Phonetic models can represent the speaker – Precedents are P-GMM, MLLR systems, … Table of 2 Class Stacking Results (DET 1 SRE’06) EER (%) Min. DCF Baseline 5.20 0.262 Stacked Us 5.72 0.279 Stacked Vs 5.14 0.243 Stack Us+Vs 5.31 0.258 M m Vy Dz Ux 73 JHU WS’08 RSR Team Combination Strategies u2 u1 d11 v2 Augmented eigenvectors d22 d33 v1 • Again, train several subspaces on the same kind of • • variability In practice: – Value of the subspace rank is unchanged – Increase the model size – Need to retrain the joint variability – Not extendable to more than 2,3 classes Close to Tied factor analysis – Produce a single y, independent from the class V1 V2 V1 M m Vy Dz Ux 74 V2 JHU WS’08 RSR Team y Combination Strategies u2 u1 d11 v2 Factor Analysis: Un-tying d22 d33 v1 • Augmented eigenvectors produce a common y for all • • conditions In practice: – There’s always a between class error – The error is averaged out by the ML algorithm Keep each speaker factor (y) from each class with the error – More parameters to describe a speaker – Feed this input to a classifier – Experiments with Gaussians as classes are promising V1 V2 TRAINING M m Vy Dz Ux 75 y V1 y1 V2 y2 TESTING JHU WS’08 RSR Team Combination Strategies u2 u1 d11 v2 Results d22 d33 v1 Table of Results for Different Factor Configurations # of Classes EER DCF 1 Baseline Single system 5.12 0.241 1 Baseline (x2) EV 4.83 0.239 2 Baseline 2 sys. fusion 5.20 0.262 4 Baseline 4 sys. fusion 5.42 0.272 2 Stacked V 2 sys fusion 5.09 0.247 4 Stacked V 4 sys. fusion 5.03 0.250 2 Stacked V Single system 5.14 0.243 4 Stacked V Single system 4.76 0.234 2 Augmented Single system 13.4 0.573 2 Augmented Retrained (Tied) 5.39 0.266 16 Un-tied Gaussian 4.54 0.233 M m Vy Dz Ux 76 Method JHU WS’08 RSR Team u2 u1 d11 v2 d22 Overview d33 v1 • Introduction • A Phonetic Analysis • Combination Strategies • Within Session Variability Modeling • Hierarchical Factor Analysis • Review M m Vy Dz Ux 77 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Within-session Variability Modeling d33 v1 • The characteristics of inter-session variability are dependent on session duration. • This doesn't fit well with the JFA model – Capturing more than channel! – Speech content (phonetic information) averages out for long utterances but will become significant for short utterances M m Vy Dz Ux 78 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Within-session Variability Modeling d33 v1 • Proposed solution: Model within-session variability as well – Break utterances into smaller segments, each described by: M = m + Vy + UI x + UW w + dz U split into inter- (UI) and within-session (UW) parts • x is held constant for a whole utterance, • but we have many w's! • In this work we chose to align our segments with OLPR transcripts – i.e. one w per phonetic event, – Approx. 10 per second – Approx. 1000 in a NIST conversation side M m Vy Dz Ux 79 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Within-session Variability Modeling d33 v1 • Proposed solution: Model within-session variability as well – Break utterances into smaller segments, each described by: M = m + Vy + UI x + UW w + dz y x UI V M m Vy Dz Ux 80 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Within-session Variability Modeling d33 v1 • Proposed solution: Model within-session variability as well – Break utterances into smaller segments, each described by: M = m + Vy + UI x + UW w + dz UI V M m Vy Dz Ux 81 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Within-session Variability Modeling d33 v1 • Proposed solution: Model within-session variability as well – Break utterances into smaller segments, each described by: M = m + Vy + UI x + UW w + dz UI UW V M m Vy Dz Ux 82 JHU WS’08 RSR Team u2 u1 d11 v2 Within-session Variability Modeling Single Phonetic Event d22 d33 v1 Dimension M m Vy Dz Ux 83 JHU WS’08 RSR Team u2 u1 d11 v2 Within-session Variability Modeling Single Phonetic Event d22 d33 v1 Dimension M m Vy Dz Ux 84 JHU WS’08 RSR Team u2 u1 d11 v2 Within-session Variability Modeling Contribution Over Varying Utterance Lengths d22 d33 v1 Dimension M m Vy Dz Ux 85 JHU WS’08 RSR Team Within-session Variability Modeling u2 u1 d11 v2 Within-session Results d22 d33 v1 U+V+D 50 1conv EER Min DCF 3.10% .0159 U+V+D 60 3.03% .0156 13.01% .0562 20.31% .0820 UMatched+V+D 50 3.10% .0159 12.20% .0531 19.71% .0814 UI+UW+V+D 2.97% .0170 11.98% .0541 19.67% .0807 JFA Model Subspace Dims 50I + 10W 20 sec EER Min DCF 12.79% .0561 10 sec EER Min DCF 20.21% .0819 Similar performance for full conversations • Modest gains with reduced utterance lengths, mostly for the EER. – Better than matching U to the utterance length in most cases – Good flexibility across utterance length for a single model! M m Vy Dz Ux 86 JHU WS’08 RSR Team u2 u1 d11 Overview v2 d22 d33 v1 • • • • • Introduction A Phonetic Analysis Combination Strategies Within Session Variability Modeling Hierarchical Factor Analysis • Review M m Vy Dz Ux 87 JHU WS’08 RSR Team u2 u1 d11 Hierarchical Factor Analysis v2 d22 d33 v1 Low Complexity Coarse Grain Model M m Vy Dz Ux 88 High Complexity Fine Grain Model JHU WS’08 RSR Team u2 u1 d11 v2 Hierarchical Factor Analysis Multi-grained Hybrid Model d22 d33 v1 Such a model may compensate for session effects that cause large regional variability and localized distortions. A multi-grained model may be structured in a manner such that the nuisance kernel subspace has reduced complexity (a reduced number of parameters) while preserving compensation impact. M m Vy Dz Ux 89 JHU WS’08 RSR Team Hierarchical Factor Analysis u2 u1 d11 v2 Multi-grained GMM/Phone Model d22 d33 v1 Table of NIST 2008 Minimum DCF Results Task Condition 7 Condition 8 Base System with NAP 0.179 0.182 Base System with Multigrained NAP 0.175 0.166 Broad Phone System with NAP 0.212 0.209 Broad Phone System with Multigrained NAP 0.206 0.190 Thanks to Jiri Navratil (IBM) for the phonetic results. M m Vy Dz Ux 90 JHU WS’08 RSR Team Hierarchical Factor Analysis u2 u1 d11 v2 Multi-stage FA Broad Phone Model d22 d33 v1 Table of NIST 2006 Minimum DCF/EER Results Baseline Hierarchical Baseline Hierarchical Phone Type NonVowel Sibilant Stop Vowel NonVowel Sibilant Stop Vowel DET 3 - Base Min DCF 0.0888 0.0988 0.0993 0.0604 0.0852 0.0994 0.0991 0.0482 EER 24.04% 30.28% 33.33% 11.26% 23.24% 28.93% 33.27% 10.29% Consonant Vowel Consonant Vowel 0.0839 0.0604 0.0777 0.0482 20.48% 11.26% 18.26% 10.29% DET 3 - ZTNorm Min DCF EER 0.0413 9.05% 0.0584 13.05% 0.0631 13.81% 0.0201 3.97% 0.042 9.53% 0.0585 14.20% 0.0655 14.63% 0.0206 3.91% 0.0323 0.0201 0.0312 0.0206 6.28% 3.97% 6.45% 3.91% Fusion of hierarchical systems with baseline system gives modest improvements. M m Vy Dz Ux 91 JHU WS’08 RSR Team u2 u1 d11 Review v2 d22 d33 v1 Results and Outcomes – A conditioned FA model can provide improved performance - But, score level combination may not be the best way - Automatic system selection may not be feasible – Including Within-Session factors in an FA model can reduce the sensitivity to utterance duration and phonetic content variability – Stacking factors across conditions or data subsets can provide additional robustness – Hierarchical modeling for Factor Analysis shows promise – Applicability to other condition types: languages, microphones, … M m Vy Dz Ux 92 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 Support Vector Machines and Joint Factor Analysis Najim DEHAK, Reda DEHAK, Zahi KARAM, and John NOECKER Jr. M m Vy Dz Ux 94 XYZ 7/13/2017 JHU WS’08 RSR Team u2 u1 d11 Outline v2 d22 d33 v1 • • • • • • • Introduction SVM-JFA : GMM supervectors space SVM-JFA : Speaker factors space Intersession compensation in speaker factors space – Within Class Covariance Normalization – Handling variability with SVMs SVM-JFA : Combined factor space Importance of speaker and channel factors Conclusion M m Vy Dz Ux 95 JHU WS’08 RSR Team u2 u1 d11 Introduction v2 d22 d33 v1 • Joint Factor Analysis is state of the art in speaker verification. • Combine discriminative and generative models. • SVM - JFA – Speaker GMM supervectors space. – Speaker factors space. – Combination of factors. • Intersession variability compensation in speaker factors. M m Vy Dz Ux 96 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 SVM in GMM Supervector Space s = m + Vy + Dz M m Vy Dz Ux 97 XYZ 7/13/2017 JHU WS’08 RSR Team u2 u1 d11 SVM-JFA : Supervector Space v2 d22 d33 v1 • Our Starting Point : Kernel between two GMM supervectors (Campbell 06). ubm Target JFA Adaptation Nontarget 1 1 11 n 1 n 1 n n n n Speaker SVM Target SVM Training K ( g a , g b ) i at ,i i1b ,i i Project each utterance into a high-dimensional space – stack mean vectors from GMM Use a KL-based kernel Non-Target m ubm JFA Adaptation 1 n SVM Scoring M m Vy Dz Ux 98 Detection score JHU WS’08 RSR Team Thanks to Douglas Reynolds for this slide u2 u1 d11 JFA Configuration v2 d22 d33 v1 • Gender independent JFA. • 2048 Gaussians, 60 dimensional features – 19 Gaussianized MFCC’s + energy + delta + double delta • • • • 300 speaker factors. 100 channel factors for telephone speech Decoupled estimation of eigenvoice and diagonal matrix (D) JFA hyper-parameters are obtained on MIXER and Switchboard database. M m Vy Dz Ux 99 JHU WS’08 RSR Team u2 u1 d11 SVM-JFA : Supervector Space v2 d22 d33 v1 • Result on NIST 2006 and 2008 SRE, Core condition, Telephone-Telephone dataset. (EER) NIST 2006 SRE NIST 2008 SRE English All trials English All trials JFA: s=m+Vy 1.95% 3.01% 2.81% 5.58% JFA: s=m+Vy+dz 1.80% 2.96% 2.81% 5.69% SVM-JFA: s=m+Vy 4.24% 4.98% 5.10% 7.92% SVM-JFA: s=m+Vy+dz 4.23% 4.92% 5.27% 8,13% • JFA scoring is obtained with frame by frame. M m Vy Dz Ux 100 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 SVM in JFA-Speaker Factors Space s=m+Vy M m Vy Dz Ux 101 XYZ 7/13/2017 JHU WS’08 RSR Team u2 u1 d11 v2 d22 SVM-JFA : Speaker Factors Space d33 v1 • Using speaker factors ‘y’ rather than GMM supervectors ‘s’. – s=m+V y – c = Ux • Low dimension space • We can perform quick experiments. • First we used only eigenvoice adaptation (D=0). M m Vy Dz Ux 102 JHU WS’08 RSR Team u2 u1 d11 v2 d22 SVM-JFA : Speaker Factors Space d33 v1 • s=m+Vy • Inner-product kernel k ( y1 , y2 ) y1 , y2 • Gaussian kernel : k ( y1 , y2 ) exp( 1 2 y1 y2 ) k • Cosine Kernel : y1 , y 2 k ( y1 , y 2 ) y1 . y 2 M m Vy Dz Ux 103 JHU WS’08 RSR Team u2 u1 d11 SVM-JFA : Speaker Factors Space v2 d22 d33 v1 • Result on NIST 2006 SRE, Core condition. (EER). English JFA All trials No-norm T-norm Zt-norm No-norm T-norm Zt-norm - - 1.95% - - 3.01% 4.98% - KL- kernel supervectors 4.24% Linear kernel 3.47% 2.93% 4.64% 4.04% - Gauss kernel 3.03% 2.98% 4.59% 4.46% - Cosine kernel 3.08% 2.92% 4.18% 4.15% - M m Vy Dz Ux 104 JHU WS’08 RSR Team u2 u1 d11 v2 d22 SVM-JFA : Speaker Factors Space d33 v1 • Results on NIST 2008 SRE, Core condition. TelephoneTelephone dataset (EER). All trials M m Vy Dz Ux 105 No-norm T-norm Zt-norm JFA - - 5.58% KL- kernel supervectors - 7.92% - Linear kernel 7.06% 7.10% - Gauss kernel 7.84% 7.42% - Cosine kernel 7.24% 7.24% JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 Intersession Compensation in Speaker Factors Space Within Class Covariance Normalization in SVM speaker factors space. M m Vy Dz Ux 106 XYZ 7/13/2017 JHU WS’08 RSR Team Within Class Covariance Normalization (WCCN) u2 u1 d11 v2 d22 d33 v1 • Within Class Covariance 1 S 1 W S s 1 ns – 1 ys ns ns s s t ( y y )( y y ) i s i s i 1 ns y i 1 s i mean of utterances of each speaker – S number of speakers – ns number of utterances for each speaker ( s ) – Within class covariance is calculated in MIXER and Switchboard database. M m Vy Dz Ux 107 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 SVM-JFA : Speaker Factors Space (WCCN) y • s=m+V • Linear kernel k ( y1 , y 2 ) y1 W 1 y 2 t • Cosine Kernel : M m Vy Dz Ux 108 y1 W 1 y2 k ( y1 , y2 ) t 1 t y1 W y1 . y2 W 1 y2 t JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 SVM-JFA : Speaker Factors Space (WCCN) • Result on NIST 2006 SRE, Core condition, English trials (EER) Without WCCN Linear kernel With WCCN No-norm T-norm No-norm T-norm 3.47% 2.93% 2.87% 2.44% 17 % Cosine kernel JFA-scoring M m Vy Dz Ux 109 3.03% 2.98% 2.60% 2.45% 1.95% JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 Intersession Compensation in Speaker Factors Space Handling variability with SVMs M m Vy Dz Ux 110 XYZ 7/13/2017 JHU WS’08 RSR Team Handling Variability with SVMs u2 u1 d11 v2 Good and Bad Variability d22 d33 v1 • Types of Variability: – Good Variability (Inter-speaker) Speaker information – Bad Variability (Nuisance) Session Channel • Handling Variability – – – – Joint Factor Analysis GMM + LFA SVM + NAP SVM + WCCN M m Vy Dz Ux 111 JHU WS’08 RSR Team u2 u1 d11 v2 Handling Variability with SVMs Motivation: Handling Nuisance d22 d33 v1 Principle Nuisance Dimension Nuisance Estimate w w M m Vy Dz Ux 112 U JHU WS’08 RSR Team u2 u1 d11 v2 Handling Variability with SVMs SVM Formulation d22 d33 v1 w U ξ=∞ ξ=2 ξ=1 ξ=0 M m Vy Dz Ux 114 JHU WS’08 RSR Team Handling Variability with SVMs u2 u1 d11 v2 Results d22 d33 v1 Using only 300 speaker factors s=m+Vy Dimension of Nuisance subspace = 50 11% 18% M m Vy Dz Ux 116 JHU WS’08 RSR Team Handling Variability with SVMs u2 u1 d11 v2 Future Work d22 d33 v1 • Beyond Nuisance compensation: – Bias towards using inter-speaker variability – Handle all variability Bias towards inter-speaker Bias away from Nuisance • Extend formulation to full supervectors M m Vy Dz Ux 117 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 SVM-JFA : Speaker and Common Factors Space y + Dz s=m+V M m Vy Dz Ux 118 XYZ 7/13/2017 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 SVM-JFA : Speaker and Common Factors Space • Full Joint Factor Analysis : s = m + Vy + D z – y : Speaker factors. – z : Common factors (EER 6.23% on NIST 2006 SRE, English trials). • How to use speaker and common factors with SVMs? – Score fusion – Kernel combination M m Vy Dz Ux 119 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Scores Fusion vs. Kernels Combination d33 v1 • Scores Fusion: – Linear weighted score Weights are computed using a development scores dataset M S F x w0 wl Sl x l 1 • Kernels Combination: – Linear weighted kernel No development dataset for weights estimation M k F x, y l kl x, y l 1 M m Vy Dz Ux 120 JHU WS’08 RSR Team u2 u1 d11 Kernel Combination Space v2 d22 d33 v1 Kernel in speaker y factor space ( ) ? New Kernel function Kernel in common z factor space ( ) Kernel space M m Vy Dz Ux 121 JHU WS’08 RSR Team u2 u1 d11 Kernel Combination Training v2 d22 d33 v1 Large margin classifier (SVM) Linear Kernels Combination M k F x, y l kl x, y l 1 Maximize the margin Learn a linear combination Multiple Kernel SVM training Find that maximize the margin M m Vy Dz Ux 122 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Results d33 v1 NIST 2006 SRE NIST 2008 SRE English all English All Cosine Kernel on Y 2.34% 3.59% 3.86% 6.55% Cosine Kernel On Z 6.26% 8.68% 10.34% 13.45% Linear Score Fusion 2.11% 3.62% 3.23% 6.86% Kernel Combination 2.08% 3.62% 3.20% 6.60% M m Vy Dz Ux 123 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 Importance of Speaker and Channel Factors M m Vy Dz Ux 124 XYZ 7/13/2017 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 Importance of Speaker and Channel Factors • Gender JFA (Female part). • 2048 Gaussians, 60 dimensional features – 19 Gaussianized MFCC’s + energy + delta + double delta • • • • 300 speaker factors. 0 common factors. 100 channel factors for telephone speech JFA hyper-parameters are obtained on MIXER and Switchboard database. M m Vy Dz Ux 125 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 Importance of Speaker and Channel Factors x M = m + Vy + Dz + U Apply intersession compensation on speaker factor space rather than supervector space M m Vy Dz Ux 126 EER=20% Oops: Channel factors contain information about speaker JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 • Importance of Speaker and Channel Factors Three systems: – M = m + Vy + Ux – M = m + Vy – M = m + Tt • • • V : low rank matrix, eigenvoices (speaker variability 300 dim) U : low rank matrix, eigenchannels (channel variability 100 dim) T : low rank matrix contain total variability (speaker and channel variability 400 dim). EER Results on NIST 2006 SRE, core condition, English trials, female part (using cosine kernel). M m Vy Dz Ux 127 M = m + Vy + Ux 2.56% M = m + Vy 2.74% M = m + Tt 2.19% JHU WS’08 RSR Team u2 u1 d11 Conclusion v2 d22 d33 v1 • SVM scoring in speaker factor space rather than GMM supervector space: – Quite well linear separation in speaker factors space. – Improves over JFA-SVM supervectors – Performance comparable to other scoring – Allows for faster scoring • Generalized using combinations of factors • Further improvement using intersession compensation in speaker factors space • JFA as feature extraction M m Vy Dz Ux 128 JHU WS’08 RSR Team u2 u1 d11 v2 d22 SVM vs. Cosine Distance (CD) Scoring d33 v1 • SVMs find a linear separation in the speaker factor space. • The score can be computed as the nearest neighbor distance. • Can we omit the SVM and use different distance? M m Vy Dz Ux 130 JHU WS’08 RSR Team u2 u1 d11 SVM vs. CD Scoring v2 d22 d33 v1 • The idea is to compute the trial score as a cosine distance between the enrollment and test speaker models T ytrain ytest score ytrain ytest • Inspired by the SVM, normalization by within-class covariance (WCCM) can be applied T y train y test score y train y test y y chol( wc ) 1 M m Vy Dz Ux 131 JHU WS’08 RSR Team SVM vs. CD Scoring u2 u1 d11 v2 Results in the y Space d22 d33 v1 • CD Scoring • SVM – applies gender dependent ZT-norm – does not need any score normalization 2006 det3 2008 det6 EER DCF EER DCF EER DCF SVM w. cosine kernel, WCCN 2.38 1.27 3.62 1.80 6.59 3.23 SVM w. cosine kernel 2.98 1.55 4.09 2.02 7.20 3.24 Cosine dist w. WCCN, ZT-norm 2.00 1.19 3.82 1.93 6.43 3.60 Cosine dist w. ZT-norm 2.82 1.50 4.07 2.11 6.95 3.55 M m Vy Dz Ux 132 2006 det1 JHU WS’08 RSR Team SVM vs. CD Scoring u2 u1 d11 v2 Extending to x,y,z Space d22 d33 v1 • Motivation – z carries a residual speaker information, which we want to use. – Ideally, there should be no speaker information in the x vector as it expresses the channel shift. – We tried to substitute the y vectors by x and z. – z was much worse than y, but still gave reasonable result (approx. 9% EER on 2006 det1) – surprisingly, using x gave around 25% EER on 2006 det1 – let’s train a linear fusion on all these systems M m Vy Dz Ux 133 JHU WS’08 RSR Team SVM vs. CD Scoring u2 u1 d11 v2 Results in the x, y, z Space d22 d33 v1 • CD Scoring • SVM – applies gender dependent ZT-norm – uses gender dependent linear logistic regression – for 2006, train on 2008 and vice-versa – kernel combination – uses y, z only (no improvement using x) 2006 det3 2008 det6 EER DCF EER DCF EER DCF SVM w. cosine kernel 2.98 1.55 4.09 2.02 7.20 3.24 SVM w. cosine kernel – y,z 2.08 1.27 3.62 2.00 6.60 3.41 Cosine dist w. ZTnorm 2.82 1.50 4.07 2.11 6.95 3.55 Cosine dist w. ZTnorm – x,y,z 2.11 1.26 3.62 1.87 6.24 3.29 M m Vy Dz Ux 134 2006 det1 JHU WS’08 RSR Team SVM vs. CD Scoring u2 u1 d11 v2 Conclusion d22 d33 v1 CD Scoring • Positives: – The problem of scoring is symmetric – No training steps • Negatives: – ZT-norm is needed – poorer relative improvement on all trials (det1) – possibly needs calibration M m Vy Dz Ux 135 SVM • Positives: – Generalizes well in all trials (det1) – No need for score normalization • Negatives: – SVM training procedure JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 Discriminative Optimization of Speaker Recognition Systems Lukas Burget & Niko Brummer, with lots of help from other team members, especially Ondrej Glembek, Najim Dehak and Valja Hubeika. M m Vy Dz Ux 137 XYZ 7/13/2017 JHU WS’08 RSR Team Discriminative Training u2 u1 d11 v2 What is new here? d22 d33 v1 • • • Discriminative training of speaker models has been around for more than a decade, and SVM speaker modeling has been a constant feature at the NIST SRE evaluations since 2003. So what is new in this work? M m Vy Dz Ux 138 JHU WS’08 RSR Team Discriminative Training u2 u1 d11 v2 What is new here? d22 d33 v1 • We propose to discriminatively optimize the whole speaker recognition system, rather than individual speaker models. M m Vy Dz Ux 139 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Traditional Discriminative Training d33 v1 discriminative optimization enrollment speech test speech feature extraction feature extraction estimate model match score M m Vy Dz Ux 140 JHU WS’08 RSR Team u2 u1 d11 Current State-of-the-Art v2 d22 d33 v1 Generative Modeling via Joint Factor Analysis (ML optimization) test speech enrollment speech feature extraction system hyperparameters feature extraction estimate model match score M m Vy Dz Ux 141 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 Proposed Discriminative System Optimization discriminative optimization test speech enrollment speech feature extraction system hyper-parameters feature extraction estimate model match score M m Vy Dz Ux 142 JHU WS’08 RSR Team u2 u1 d11 v2 d22 d33 v1 Proposed Discriminative System Optimization This methodology directly measures and optimizes the quality of the output enrollment speech of the whole system. feature extraction discriminative optimization system hyper-parameters test speech feature extraction estimate model match score M m Vy Dz Ux 143 JHU WS’08 RSR Team Discriminative Training u2 u1 d11 v2 What is new here? d22 d33 v1 • Typically, we have small amount of enrollment data for target speaker, which disallows use of standard discriminative techniques We need to consider inter-session variability – an important problem in SRE But only recent data collection with the same speaker recorded over various channels allowed us to start the work in this direction • • M m Vy Dz Ux 144 JHU WS’08 RSR Team u2 u1 d11 v2 Discriminative Training What is new here? d22 d33 v1 Target speaker model Test data UBM M m Vy Dz Ux 145 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Discriminative System Optimization d33 v1 • • • • • Motivation Envisioned advantages Challenges A few techniques to address these challenges Some preliminary experiments M m Vy Dz Ux 146 JHU WS’08 RSR Team u2 u1 d11 v2 Discriminative System Optimization Motivation d22 d33 v1 • Several participants of this workshop have previous successful experience with similar training: – Discriminative training of weighted linear combinations of the outputs of multiple sub-systems (a.k.a. fusion) has been very successful in the last few NIST Speaker Recognition Evaluations Neural Networks, SVM, Logistic Regression – Lukas and BUT were very successful with discriminative (MMI) training of GMM's in the similar task of Language Recognition in the last two NIST LRE's M m Vy Dz Ux 147 JHU WS’08 RSR Team u2 u1 d11 v2 Discriminative System Optimization Envisioned Advantages d22 d33 v1 • • Discriminative training can compensate for unrealistic generative modeling assumptions. – Could find hyperparameter estimates that give better accuracy than ML estimates Discriminative training can optimize smaller, simpler, faster systems to rival the accuracy of larger generatively trained systems – In this workshop, we concentrated on this aspect, with a few encouraging results M m Vy Dz Ux 148 JHU WS’08 RSR Team u2 u1 d11 v2 Discriminative System Optimization Challenges d22 d33 v1 • • This is a difficult problem! – In large LVCSR systems it took years for discriminative methods to catch up with generative ones Challenges include: – complexity and scale of implementation – overtraining M m Vy Dz Ux 149 JHU WS’08 RSR Team u2 u1 d11 v2 Discriminative System Optimization Challenges d22 d33 v1 • Complexity: – Computation of derivatives for optimization (gradient, Hessian) of complex systems – Finding and coding good numerical optimization algorithms • Scale (CPU, memory): – our current state-of-the-art systems can have tens of millions of parameters – 1500 hours of training speech, or 250 million training examples • Overtraining (up to millions of parameters) M m Vy Dz Ux 150 JHU WS’08 RSR Team Techniques u2 u1 d11 v2 Computing Derivatives d22 d33 v1 • • We tried a coding technique that automatically implements the chain-rule for partial derivatives for function combination – Similar to back propagation in neural networks – Computationally equivalent to reverse-mode automatic differentiation – Did not scale well for our problem, involved multiplication of multiple Jacobian matrices of very large dimension Our solution was to restrict our efforts to very simple system designs, for which the derivatives could be hand-coded and optimized M m Vy Dz Ux 151 JHU WS’08 RSR Team Derivatives u2 u1 d11 v2 Hand-Optimized d22 d33 v1 • • Lukas hand-optimized a gradient calculation of 6 million components over 440 million training examples to run in 15 minutes on a single machine This was made possible by: – Replacing GMM log-likelihood calculation with a linear approximation (without significant performance loss) – Not doing ZT-norm (at some performance loss) M m Vy Dz Ux 152 JHU WS’08 RSR Team Techniques u2 u1 d11 v2 Optimization Algorithms d22 d33 v1 • We investigated stochastic gradient descent (after inspiring invited talk here at JHU by Yann LeCun) – Did not scale well in our computing environment – Difficult to set hyperparameters – Not obvious how to parallelize over machines M m Vy Dz Ux 153 JHU WS’08 RSR Team Techniques u2 u1 d11 v2 Optimization Algorithms d22 d33 v1 • We investigated MATLAB's optimization toolbox. – Tried 'large scale' trust-region optimization algorithm – Did not scale well in time and space – Needs further investigation ... M m Vy Dz Ux 154 JHU WS’08 RSR Team Techniques u2 u1 d11 v2 Optimization Algorithms d22 d33 v1 • • • Lukas was successful in his experiments with the Extended Baum-Welch algorithm I was successful in my experiments with the RPROP1 algorithm In both cases, we coded our own optimization algorithms in MATLAB for economy of scale 1. See http://en.wikipedia.org/wiki/Rprop M m Vy Dz Ux 155 JHU WS’08 RSR Team u2 u1 d11 Objective Function v2 d22 d33 v1 • Our discriminative optimization objective function has many names: Maximum Mutual Information (MMI), Minimum Cross-Entropy, Logistic Regression, ... – This criterion optimizes classification error-rates over wide ranges of priors and cost functions – For linear systems, gives nice convex optimization objective – Gives some protection against over-training – Has been very successfully applied to fusion of sub-systems in NIST SRE evaluations M m Vy Dz Ux 156 JHU WS’08 RSR Team u2 u1 d11 Overtraining v2 d22 d33 v1 • • • • I was optimizing 90 000 parameters and Lukas 6 million This allows the training to learn irrelevant detail of the training data (even though we used 100's of millions of training examples) We both managed to optimize EER << 1% over the development data (Switchboard, SRE'04+05) if we allowed the training to go too far These overtrained systems did not generalize to good performance on independent test data (SRE'06+08) M m Vy Dz Ux 157 JHU WS’08 RSR Team u2 u1 d11 v2 d22 Regularization to Combat Overtraining d33 v1 • • We used early stopping to combat overtraining – just stop training when performance on a crossvalidation set stops improving We hope to be able to apply more principled approaches in the future – adding SVM-style regularization penalties, or – more general Bayesian methods with appropriate priors on the hyperparameters. M m Vy Dz Ux 158 JHU WS’08 RSR Team u2 u1 d11 Proof of Concept Experiments v2 d22 d33 v1 • Niko: smaller scale experiment using 300-dimensional y-vectors for train and test, training 90 000 parameters • Lukas: larger scale experiments using 300 dimensional y-vector for train and 20 000 dimensional statistic for test, training 6 million parameters M m Vy Dz Ux 159 JHU WS’08 RSR Team u2 u1 d11 Small Scale Experiment v2 d22 d33 v1 • • Within-class covariance-normalized dot-product between y-vectors for train and test Generative (ML) covariance estimate gives on a subset (english females) of SRE 2006: EER = 2.61% Discriminative retraining of covariance gave an 11% relative improvement: EER = 2.33% M m Vy Dz Ux 160 JHU WS’08 RSR Team u2 u1 d11 Large Scale Experiment 1 v2 d22 d33 v1 • • • • Pure eigenvoice system (only V; no U and D) GMM with 512 components; 39D features V matrix trained discriminatively (300x20k parameters) Fixed original speaker factors y EER[%] No norm ZT-norm Generative V 15.44 11.42 Discriminative V 7.19 5.06 Discriminative V with channel compensated y 6.80 4.81 Generative V and U 6.99 161 u1 d11 SRE 2006 all trials (det1) M m Vy Dz Ux u2 v2 4.07 d22 d33 m v1 JHU WS’08 RSR Team u2 u1 d11 Large Scale Experiment 2 v2 d22 d33 v1 • • • Channel compensated system (V and U; no D) Only V matrix trained discriminatively Fixed original speaker factors y EER[%] No norm ZT-norm Generative V and U 6.99 4.07 Discriminative V 6.00 3.87 u2 u1 Generative U d11 SRE 2006 all trials (det1) v2 d22 d33 M m Vy Dz Ux 162 m v1 JHU WS’08 RSR Team u2 u1 d11 Next Steps v2 d22 d33 v1 • Re-estimation of other hyperparameters (e.g. U) • Iterative re-estimation of both hyperparameters and • factors Direct optimization of ZT-normalized system (derivatives difficult to compute) M m Vy Dz Ux 163 JHU WS’08 RSR Team u2 u1 d11 Conclusion v2 d22 d33 v1 • This is a large and difficult problem. • But it has the potential of worthwhile gains: • • – Possibility of more accurate, but faster and smaller systems. We have managed to show some proof of concept, but so far without improving on the state-of-the-art. Remaining problems are practical and theoretical. – Complexity of optimization. – Principled methods for combating overtraining. M m Vy Dz Ux 164 JHU WS’08 RSR Team u2 u1 d11 v2 Robust Speaker Recognition Summary d22 d33 v1 • • • Diarization Examined application of JFA and Bayesian methods to diarization Produced 3-4% DER on summed telephone speech Working on challenging interview speech Factor Analysis Conditioning Explored ways to use JFA to account for non-session variability (phone) Showed robustness using withinsession, stacking and hierarchical modeling • • u2 u1 d11 v2 d22 d33 • • • m SVM-JFA Developed techniques to use JFA elements in SVM classifiers Results comparable to full JFA system but with fast scoring and no score normalization Better performance using all JFA factors M m Vy Dz Ux 166 v1 • • Discriminative System Optimization Focused on means to discriminatively optimize the whole speaker recognition system Demonstrated proof-of-concept experiments JHU WS’08 RSR Team u2 u1 d11 Robust Speaker Recognition v2 d22 d33 v1 • Extremely productive and enjoyable workshop • Aim is to continue collaboration in problem areas going forward • Cross-site, joint efforts will provide big gains in future speaker recognition evaluations and experiments • Possible special session at ICASSP on team workshop efforts M m Vy Dz Ux 167 JHU WS’08 RSR Team
© Copyright 2025 Paperzz