Speaker verification using i-Vectors Andreas Nautsch Hochschule Darmstadt 4. Short duration experiments & analysis 2. From Speech to i-vectors 1 300 2 400 600 10 5 0.05 0 -0.05 3 2 1 200 1 300 400 2 600 10 2 3 4 5 6 7 Time (in s) 8 5 Iterations 1 300 400 MFCC-2 2 0 -2 5 2 600 10 5 Iterations 0.2 0.1 0.10.2 0.5 1 2 5 10 FMR (in %) 20 30 40 (e) Detection error tradeoff diagram comparing the False Match Rate (FMR) and the False Non-Match Rate (FNMR) (d) C = 1024 Enrolment Verification EER 33×2.5s/subject 5s/attempt 206.2% 5.5% 0.9% 6.1% 3.2% 2.1% 9.9% 3.1% 2.5% 16.2% 3.3% 2.3% 125 250 375 500 625 750 875 1,000 Segments Different speakers have different subspaces within an universal acoustical cluster, the UBM. The subspace offset from the universal cluster describes the direction vector of a sample which depends on the verbalised text of a speaker. In order to obtain relevant speaker-only features, these vectors are analysed towards characteristic factors. Thereby, a factor analysis model is iteratively trained. The factor analysed features are referred to as identity-vectors (i-vectors) [2]. Reference and probe i-vectors can be compared, e.g. by the cosine distance between these two vectors. 2 2 Factors HMM i-vector-128/400 i-vector-256/400 i-vector-512/300 −100 Figure: Speech sample, left: raw values, right: after segment-wise Fourier transform. MFCC-2 2 1 200 System −75 0 9 10 10 Table: Real-time comparison of the model-based Hidden-Markov-Model (HMM) system as baseline and the i-vector systems -0.1 1 Iterations Figure: (a-d): EER performances of i-vector system configurations: number of characteristic factors (i-vector dimension), factor analysis model training iterations & UBM components (C), (e): detection error tradeoff comparison of the three best configurations on development (dashed) and evaluation (solid) data sets. −50 1 5 5 −25 2 20 0.5 5 0 0.1 600 10 (b) C = 256 4 Frequency (in kHz) Changes in air pressure 300 400 1 (c) C = 512 0.15 1 2 Factors (a) C = 128 Factors Characteristic information about the voice of a subject can be obtained by extracting speech signal features. Thereby, the acoustical features depend on the speaker and the spoken text. Different speakers can be distinguished by comparing patterns of the whole frequency band which can be obtained by a segment-wise applied Fourier transform and the extraction of Mel-Frequency Cepstrum Coefficients (MFCCs). 30 2 1 200 Iterations Factors i-vector-128 i-vector-128 i-vector-256 i-vector-256 i-vector-512 i-vector-512 30 FNM 40 5 FNMR (in %) 2 1 200 EER (in %) 5 EER (in %) I Current speaker recognition research on . Text-independent scenarios, i.e. free speech . Various-duration samples, e.g. call-center speech . Short-duration samples, e.g. random PINs I Statistical comparison approaches . Model-based: sound unit models for each subject ⇒ Accurate on short samples, but high computation efforts on long samples . Template-based: offset to universal background model (UBM) ⇒ Accurate on long samples, but ambiguous behaviour on short samples The short duration scenario comprises sequences of 5 spoken German digits between 0 and 9 (approx. 5 seconds) by 356 subjects who were separated into 56 calibration and 300 evaluation subjects. Configuration evaluations on the amount of UBM components, the number of characteristic factors, and iterations of the factor analysis model were firstly performed on the calibration subset with respect to the Equal Error Rate (EER) and then validated on the evaluation subset. EER (in %) Speaker verification becomes more important as a biometric key security solution in industry, forensic, and governmental terms, e.g. for data encryption purposes on mobile devices, or user validations on contact-centers. Users seem to be annoyed by maintaining several PINs and passwords, i.e. biometrics, which cannot be lost or forgotten, bring about substantial benefits with respect to usability [1]. EER (in %) 1. Motivation I Applicable i-vector configurations positively validated . 128 & 256 UBM components: 5 iterations & 400 factors . 512 UBM components: 10 iterations, 300 factors I Applicable performance on short duration scenario with content constrains, i.e. random PINs I i-vector systems outperform the HMM baseline systems in real-time with up to 97% relative gains on subject enrolments and 44% relative gains on verifications, where HMMs have lower EERs In order to preserve reproducibility an ISO/IEC standardised general biometric system design [3] was adapted to speaker verification purposes, and implemented in Matlab. Further, public speaker recognition community tools were used for speech signal processing, i-vector extraction, and evaluation measurements. UBM Speaker A 0 Speaker B -2 -3 -2 -1 0 MFCC-1 1 2 3 -3 -2 -1 0 MFCC-1 1 2 3 Figure: Speakers within acoustical space clusters (schematic), left: Gaussian Mixture Model (GMM) cluster with 4 components of the figure 1 sample, right: intended differences between UBM cluster and speaker-depending subspaces leading to characteristic offsets. 3. Research questions I Is the i-vector approach extensible on short duration scenarios with applicable performances? I Do i-vector systems deliver new information to approaches that are known to perform well on short duration samples? I Are duration-depending performance mismatches compensable? Student Andreas Nautsch [email protected] . . .. . . . .. .. .. .. .. .. .. .. .. .. . ....... .. ..... . Figure: Speaker verification framework design based on ISO/IEC 19795-1:2006 [3]. Examiner Prof. Dr. Christoph Busch [email protected] Associated company atip GmbH http://www.atip.de Second examiner Dr. Christian Rathgeb [email protected] Supervisor Prof. Dr. Herbert Reininger [email protected] . . . .. .. .. .. .. .. .. .. ........... ........... .......... ......... .. .. .. .. .. .. .. .. . . . ....... . ......... .. .. .. .. .. .. . .. . . . . Speaker verification using i-Vectors Andreas Nautsch Hochschule Darmstadt 5. Fusion of baseline and i-vector systems 7. Duration-invariant i-vector score normalisation Score-level system fusions were performed on the three i-vector systems and between the HMM baseline system and the most promising i-vector system on the calibration subset, the 256 UBM component i-vector system. By fusing only i-vector systems, significantly gains could be observed, and the HMM+i-vector-256 fusion significantly outperformed all other systems in highsecure applications which are indicated by low false match rates (FMRs) where lower false non-match rates (FNMRs) mean fewer genuine attempt rejections. HMM i-vector-256 i-vector-128+256+512 HMM+i-vector-256 30 FNM 40 30 20 FNMR (in %) The National Institute of Standards and Technology (NIST) hosts the 2014 i-vector machine learning challenge. In this challenge were i-vectors of 414 subjects supplied together with according sample durations (1-300 seconds). Varying sample durations are emphasised by NIST for the purpose of examining durational effects on the speaker verification robustness. Figure: Occurence quantity of observed phones reduces exponentially with duration [5] 10 5 2 1 0.5 0.2 0.1 0.1 0.2 0.5 1 2 5 10 FMR (in %) 20 30 40 Figure: Detection error tradeoff comparison of HMM-baseline, i-vector and fused systems I By using additional information from i-vector systems, the biometric performance in terms of EER relatively increases by 44% I Hasan et. al [5] introduced duration groups of: 5, 10, 20, 40 seconds and full (>40 seconds) I Adaptive Symmetric (AS) score normalisation compares reference and probe i-vectors to a development set of i-vectors . Computing mean and deviation for top-100 comparisons for each: reference, probe S−µprobe reference . S 0 = 21 S−µ + σreference σprobe I New approach: Duration-invariant AS normalisation (dAS) . Idea: only compare i-vectors of same duration group . On reference - development set: i-vectors with probe duration . On probe - development set: i-vectors with duration >60 seconds I Offline evaluation: 10×5-fold cross-validation 6. Score cross-entropy evaluation Baseline AS dAS 30 FNMs 40 For forensic purposes, recognition systems not only need to have highperformance, rather it is important to know the evidence that this systems provide. In terms of a binary recognition system, evidence is measured by score cross-entropy. Where Bayesian thresholds are denoted with respect to prior odds of genuine attempts, or likewise the cost of falsely accepting impostors. For comparability purposes among various prior odds, Bayesian entropy error rates are normalized by the according default entropy, on which the evidence equals a random guess. Further, system entropy as total entropy Htot norm can be separated into the minimum entropy assuming well-calibrated systems Hmin norm and a calibration loss [4]. False Non-Match Rate (in %) 30 20 Table: System performances: avg. EER, FMR100, minDCF System Baseline AS dAS 10 5 EER FMR100 minDCF Challenge 2.56 5.15 0.428 0.336 2.49 4.48 0.378 0.331 2.06 3.47 0.364 0.312 2 Table: Average cross-entropy comparison: all scores & duration groups 1 0.5 0.2 0.1 0.10.2 0.5 1 2 5 10 20 30 40 False Match Rate (in %) Figure: Detection error tradeoff diagram System all 5s 10s Baseline 0.89 0.95 0.93 AS 2.49 0.17 1.18 dAS 0.10 0.35 0.20 20s 0.92 0.41 0.11 40s 0.89 0.18 0.07 full 0.86 0.05 0.07 Figure: Evaluation on variant duration: baseline i-vector system, baseline+AS normalization, and baseline+dAS normalization, where NIST refers to the minimum detection error cost function (minDCF), the normalised Bayesian error rate at the NIST threshold 8. Future perspectives Figure: Entropy evaluation of HMM baseline, i-vector, and fused systems with Bayesian threshold of the 2014 NIST i-vector machine learning challenge NISTη I In general, fused systems provide more evidence I HMM+i-vector-256 fusion was observed to have the lowest score cross-entropy, and hence contains the highest amount of information with the lowest integral under the Htot norm curve as the cost of Log-Likelihood Ratio (LLR) scores Cllr = 0.028 Student Andreas Nautsch [email protected] . . .. . . . .. .. .. .. .. .. .. .. .. .. . ....... .. ..... . I Examining other i-vector comparators, e.g. probabilistic linear discriminant analysis (PLDA) I Effect analysis on other speech signal features, e.g. MHECs, ProsPols or RASTA-PLPs I Evaluating feature-level fusions, i.e. concatenating i-vectors stemming from different features References [1] [2] [3] [4] A. Krupp et al.: Social Acceptance of Biometric Technologies in Germany: A Survey, IEEE BIOSIG, 2013. N. Dehak et al.: Front-End Factor Analysis for Speaker Verification, IEEE/ACM TASLP, 2011. ISO/IEC: Information technology — Biometric performance testing and reporting — Part 1: principals and framework, ISO-19795-1, 2006(E), 2011. N. Brümmer: Measuring, refining and calibrating speaker and language information extracted from speech, Ph.D. dissertation, University of Stellenbosch, 2010. [5] T. Hasan et al.: Duration Mismatch Compensation for i-Vector based Speaker Recognition Systems, IEEE ICASSP, 2013. Examiner Prof. Dr. Christoph Busch [email protected] Associated company atip GmbH http://www.atip.de Second examiner Dr. Christian Rathgeb [email protected] Supervisor Prof. Dr. Herbert Reininger [email protected] . . . .. .. .. .. .. .. .. .. ........... ........... .......... ......... .. .. .. .. .. .. .. .. . . . ....... . ......... .. .. .. .. .. .. . .. . . . .
© Copyright 2025 Paperzz