An Overview of Automatic Speaker Recognition JHU 2010 Workshop Summer School Douglas Reynolds Senior Member of Technical Staff MIT Lincoln Laboratory This work was sponsored by the Department of Defense under Air Force contract F19628-05-C-0002. Opinions, interpretations, conclusions, and recommendations are MIT Lincoln Laboratory those of the authors and are not necessarily endorsed by the Department of Defense. 1 Extracting Information from Speech Goal: Automatically extract information transmitted in speech signal Speech Recognition Speech Signal Language Recognition Speaker Recognition Words “How are you?” Language Name English Speaker Name James Wilson MIT Lincoln Laboratory 2 Speech & Language Processing Taxonomy & Applications Speech Speech & & Language Language Processing Processing Speech Speech Processing Processing Analysis/ Analysis/ Synthesis Synthesis Speech Speech Synthesis Synthesis Enhancement Enhancement For For Machines Machines Coding Coding Speech Speech Detection Detection •Compaction •Enabler Recognition Recognition Text Text Processing Processing Sex Sex Recog Recog •Indexing •Enabler Speaker Speaker Recog Recog Language Language Recog Recog •Biometrics •Personalize •Indexing •Law Enforce •Interfaces •Routing •Enabler Speech Speech Recog Recog •Cmd & Cntrl •Dictation •Transcription •Enabler Machine Machine Translation Translation For For Humans Humans Speech Speech Translation Translation •Translation •Gisting •Transcription •Communicate MIT Lincoln Laboratory 3 Outline Speaker Recognition • Introduction – Definitions – Applications • Structure of Speaker Detection Systems – Features: Spectral and high-level – Models: GMMs, SVMs – Compensation • Evaluation and Performance – Evaluation tools and metrics – Evaluation design and corpora – Performance survey MIT Lincoln Laboratory 4 Evolution of Speaker Recognition 1930 1960 Aural and spectrogram matching 1980 Dynamic Time-Warping Vector Quantization Template matching Speech technology research accelerated with increase in – – 2000 Common corpora and evaluations Compute and storage capacity Present Support Vector Machines GMM Supervectors … Hidden Markov Models Gaussian Mixture Models Small Smalldatabases, databases, clean, clean,controlled controlled speech speech • 1990 Session/Channel Compensation Large Largedatabases, databases, realistic, realistic, unconstrained unconstrainedspeech speech Common CommonCorpora Corporaand and Evaluations Evaluations Commercial Commercial application applicationof ofspeaker speaker recognition recognitiontechnology technology MIT Lincoln Laboratory 5 Speaker Recognition Applications Access AccessControl Control Transaction TransactionAuthentication Authentication Physical Physicalfacilities facilities Computer Computernetworks networksand andwebsites websites Telephone Telephonebanking banking Remote Remotecredit creditcard cardpurchases purchases Law LawEnforcement Enforcement Forensics Forensics Surveillance Surveillance Speech SpeechData DataMining Mining Personalization Personalization Voice Voicemail mailbrowsing browsing Speech Speechskimming skimming Intelligent Intelligentanswering answeringmachine machine Voice-web Voice-web/ /device device customization customization MIT Lincoln Laboratory 6 Speaker Recognition Tasks Identification Verification/Authentication/ Detection ? Whose Whose voice voice is is this? this? ? Is Is this this Bob’s Bob’s voice? voice? ? ? ? Segmentation and Clustering (Diarization) Where Where are are speaker speaker changes? changes? Speaker A Which Which segments segments are are from from the the same same speaker? speaker? Speaker B MIT Lincoln Laboratory 7 Speech Modalities Application dictates different speech modalities: Text-dependent Text-independent • Recognition system knows text spoken by person • Recognition system does not know text spoken by person • Examples: fixed phrase, prompted phrase • Examples: User selected phrase, conversational speech • Used for applications with strong control over user input • Used for applications with less control over user input • More flexible system but also more difficult problem • Speech recognition can provide knowledge of spoken text • Knowledge of spoken text can improve system performance MIT Lincoln Laboratory 8 Voice Biometric • • Speaker verification is often referred to as a voice biometric Biometric: a human generated signal or attribute for authenticating a person’s identity Voice is a popular biometric: • – natural signal to produce – does not require a specialized input device – ubiquitous: telephones and microphone equipped PC • Voice biometric can be combined with other forms of security Strongest – Something you have - e.g., badge security Are – Something you know - e.g., password – Something you are - e.g., voice Know Have MIT Lincoln Laboratory 9 Outline Speaker Recognition • Introduction – Definitions – Applications • Structure of Speaker Detection Systems – Features: Spectral and high-level – Models: GMMs, SVMs – Compensation • Evaluation and Performance – Evaluation tools and metrics – Evaluation design and corpora – Performance survey MIT Lincoln Laboratory 10 Phases of Speaker Detection System Two distinct phases to any speaker verification system Enrollment Phase Enrollment speech for each speaker Bob Feature Feature extraction extraction Model for each speaker Model Model training training Sally Detection Phase Sally Feature Feature extraction extraction Hypothesized identity: Sally 11 Bob Detection Detection decision decision Detected! MIT Lincoln Laboratory Features for Speaker Recognition • Humans use several levels of perceptual cues for speaker recognition High-level cues (learned traits) Low-level cues (physical traits) • • Difficult to automatically extract Hierarchy of Perceptual Cues Semantics, Semantics,diction, diction, pronunciations, pronunciations, idiosyncrasies idiosyncrasies Prosodics, Prosodics,rhythm, rhythm, speed intonation, speed intonation, volume volumemodulation modulation Acoustic Acousticaspect aspectof of speech, nasal, speech, nasal, deep, deep,breathy, breathy, rough rough Socio-economic Socio-economic status, status,education, education, place of place ofbirth birth Personality Personalitytype, type, parental influence parental influence Anatomical Anatomicalstructure structure of vocal apparatus of vocal apparatus Easy to automatically extract There are no exclusive speaker identity cues Low-level acoustic cues most applicable for automatic systems MIT Lincoln Laboratory 12 Features for Speaker Recognition • Desirable attributes of features for an automatic system (Wolf ‘72) Practical Robust Secure •• •• •• •• •• Occur Occurnaturally naturallyand andfrequently frequentlyin inspeech speech Easily Easilymeasurable measurable Not Notchange changeover overtime timeor orbe beaffected affectedby byspeaker’s speaker’shealth health Not Notbe beaffected affectedby byreasonable reasonablebackground backgroundnoise noisenor nor depend on specific transmission characteristics depend on specific transmission characteristics Not Notbe besubject subjectto tomimicry mimicry • No feature has all these attributes • Features derived from spectrum of speech have proven to be the most effective in automatic systems MIT Lincoln Laboratory 13 Speech Production • Speech production model: source-filter interaction – Anatomical structure (vocal tract/glottis) conveyed in speech spectrum Glottal pulses Time (sec) Vocal tract Speech signal Time (sec) MIT Lincoln Laboratory 14 Vocal-tract Information Different speakers will have different spectra for similar sounds /AE/ • 70 60 50 40 30 20 10 0 18 16 14 12 10 8 6 4 2 0 Male Speaker Cross Section of Vocal Tract Female Speaker /I/ 4 Male Speaker 3 Magnitude (dB) Cross Section of Vocal Tract Magnitude (dB) • 2 1 0 5 4 3 2 1 Female Speaker 0 0 2000 4000 6000 Frequency (Hz) 0 2000 4000 6000 Frequency (Hz) Differences are in location and magnitude of peaks in spectrum – Peaks are known as formants and represent resonances of vocal cavity • The spectrum captures the format location and, to some extent, pitch without explicit formant or pitch tracking MIT Lincoln Laboratory 15 Time-Varying Vocal-tract Information • Speech is a continuous evolution of the vocal tract – Need to extract time series of spectra – Use a sliding window - 20 ms window, 10 ms shift ... Fourier Fourier Transform Transform Frequency (Hz) • Magnitude Magnitude Produces time-frequency evolution of the spectrum Time (sec) MIT Lincoln Laboratory 16 Spectral Features • The number of discrete Fourier transform samples representing the spectrum is reduced by averaging frequency bins together – Typically done by a simulated filterbank • A perceptually based filterbank is used such as a Mel or Bark scale filterbank Magnitude – Linearly spaced filters at low frequencies – Logarithmically spaced filters at high frequencies Frequency MIT Lincoln Laboratory 17 Spectral Features • • Primary feature used in speaker recognition systems are cepstral feature vectors Log() function turns linear convolutional effects into additive biases – Easy to remove using blind-deconvolution techniques • Cosine transform helps decorrelate elements in feature vector – Less burden on model and empirically better performance ... Fourier FourierTransform Transform Log() Log() Magnitude Magnitude Cosine Cosinetransform transform One feature vector every 10 ms 3.4 3.6 3.4 3.4 2.1 3.6 3.6 0.0 2.1 2.1 -0.9 0.0 0.0 0.3-0.9-0.9 .1 0.3 0.3 .1 .1 MIT Lincoln Laboratory 18 Spectral Features Fourier Fourier Transform Transform Magnitude Magnitude Log() Log() Cosine Cosine transform transform MIT Lincoln Laboratory 19 Spectral Features Additional Robustness Measures • To help capture some temporal information about the spectra, delta cepstra are often computed and appended to the cepstra feature vector – 1st order linear fit used over a 5 frame (50 ms) span K ∂ci (t ) ≈ Δci (n) = ∂t ∑ kci (n + k ) k =− K Cepstra K ∑k 2 3.4 3.6 2.1 0.0 -0.9 0.3 0.1 3.3 3.7 0.7 0.0 -0.1 0.3 0.2 3.1 3.6 0.3 0.0 0.3 0.3 0.6 k =− K Delta Cepstra -0.3 0.0 -1.7 0.0 1.2 0.0 0.5 MIT Lincoln Laboratory 22 Exploiting Higher-Levels of Information High-level cues (learned traits) semantic dialogic idiolectal ? A: B: b b d a d e e b c <s>how shall i say this<e> <s> yeah i know … /S/ /oU/ /m/ /i:/ /D/ /&/ /m/ /Λ/ /n/ /i:/ … phonetic prosodic Low-level cues (physical traits) spectral MIT Lincoln Laboratory 23 Prosodic Pitch/Energy Slope • Slope Classes Estimation 1. compute the rate of change for F0 and energy contours 2. detect the points where the dynamics of the contours changes Delta features (50 ms) 3. generate new segments using the detected points 4. convert the segments into a sequence of symbols that represent the dynamics of both contours 5. Add a label (Short/Long) representing the segment duration Voiced (classes 1,2,3, and 4): Short < 70 ms Unvoiced (class 5): Short < 130 ms Duration (short/long) 5S 4S 2S 1S 3L 4L 5S 1S 2S 3S 4S 3S 2L 4S 5S Class Temporal Trajectory 1 rising f0 and rising energy 2 rising f0 and falling energy 3 falling f0 and rising energy 4 falling f0 and falling energy 5 unvoiced segment MIT Lincoln Laboratory 24 A. Adami, R. Mihaescu, D. Reynolds, and J. Godfrey, “Modeling Prosodic Dynamics for Speaker Recognition,” ICASSP 2003 Phonetic and Idiolect Information • Requires the use of speech or phone recognition systems to extract token features – Phonetic: Use phone sequences to characterize speakerspecific pronunciations and speaking patterns – Idiolect: Use word sequences to characterize speaker-specific use of word patterns Speech Speech recognizer recognizer • • Show me the money … Phone Phone recognizer recognizer S oU m i: D & m Λ n i: People have also looked at linguistic patterns to characterize speaker-specific conversation style Phone/word based features are limited by – Recognizer errors – Sparse information – Correlation with non-speaker information (topic,lang,etc) MIT Lincoln Laboratory 25 Higher-Level Information Speaker information of word bigrams • Bigram is just the occurrence of two tokens in a sequence MIT Lincoln Laboratory 26 Phases of Speaker Detection System Speaker Models Two distinct phases to any speaker detection system Training Phase Training speech for each speaker Bob Model for each speaker Feature Feature extraction extraction Model Model training training Sally Detection Phase Sally Feature Feature extraction extraction Hypothesized identity: Sally 27 Bob Detection Detection decision decision Accepted! MIT Lincoln Laboratory Speaker Models • Speaker models are used to represent the speakerspecific information conveyed in the feature vectors • Desirable attributes of a speaker model – Theoretical underpinning – Generalizable to new data – Parsimonious representation (size and computation) • Many different modeling techniques have been applied to speaker recognition problems – Probability (Generative) models – Discriminative models • Before we describe modeling techniques, let’s first look at how we want to use them in recognition MIT Lincoln Laboratory 28 Detection Decision Bayes’ Decision • The detection task is fundamentally a two-class hypothesis test – H0: the speech X is not from the hypothesized speaker – H1: the speech X is from the hypothesized speaker • • We select the most likely hypothesis (Bayes test for minimum error) > Pr( H 1| X ) Pr( H 0 | X ) < p ( X | H 1) Pr( H 1) > p( X | H 0) Pr( H 0) (using Bayes rule) < p( X ) p( X ) p ( X | H 1) > Pr( H 0) p ( X | H 0) < Pr( H 1) This is known as the likelihood ratio test p ( X | H 1) LR = p ( X | H 0) LR > θ Accept H 1 LR < θ Accept H 0 MIT Lincoln Laboratory 29 Detection Decision • Classic log-likelihood ratio test from signal detection theory LLR = Λ = log p( X | H 1) − log p ( X | H 0) Speaker Speakermodel model Feature Feature Extraction Extraction Σ Alternative Alternative model model • + - Λ Λ > θ Accept Λ < θ Reject Need two models – Hypothesized speaker model for H1 – Alternative (background) model for H0 • These can be explicit or implicit models – We will later look at how this is done for different modeling techniques MIT Lincoln Laboratory 30 Speaker Models • We will now describe the most common modeling techniques used in speaker recognition • Gaussian Mixture Models (GMM) – Probability (Generative) models • Support Vector Machines (SVM) – Discriminative models MIT Lincoln Laboratory 31 Speaker Models Probability Models • • Treat speaker as a hidden random source generating observed feature vectors How can we mathematically characterize the speaker using a set of training observations? r r r x1 x2 x 3 3.4 Speaker (source) 3.4 3.6 3.4 3.6 2.1 3.6 2.1 0.0 2.1 0.0 -0.9 0.0 -0.9 0.3 -0.9 0.3 .1 0.3 .1 .1 Observed feature vectors … MIT Lincoln Laboratory 32 Speaker Models Probability Models • • • Assume we are using a 1 dimensional feature vector From some training examples we get the following histogram of features What is a good model for this histogram? Gaussian distribution N(1.0,0.25) feature 33 MIT Lincoln Laboratory Speaker Models Probability Models • What are good models for these histograms? feature feature MIT Lincoln Laboratory 34 Speaker Models Probability Models • • If we assume the source is producing independent identically distributed (iid) observations, and we know the distribution, then a parametric pdf is used (e.g., Gaussian) Suitable approach for “static” sources r r r x1 x2 x 3 3.4 Variance Mean Speaker (source) 3.4 3.6 3.4 3.6 2.1 3.6 2.1 0.0 2.1 0.0 -0.9 0.0 -0.9 0.3 -0.9 0.3 .1 0.3 .1 .1 Observed feature vectors … MIT Lincoln Laboratory 35 Speaker Models Probability Models • • But we know speech is not a “static” source Source has “states” corresponding to different speech sounds Speaker (source) r r r x1 x2 x 3 3.4 3.4 3.6 3.4 3.6 2.1 3.6 2.1 0.0 2.1 0.0 -0.9 0.0 -0.9 0.3 -0.9 0.3 .1 0.3 .1 .1 Observed feature vectors … Hidden speech state MIT Lincoln Laboratory 36 Speaker Models Gaussian Mixture Models • • • Assume feature vectors generated from each state follow a Gaussian distribution Further assume all states are equally likely to occur Each state has the following parameters pi = prior probability r μi = mean vector Feature distribution for state i Σi = covariance matrix • The total distribution of the source (speaker) would then be characterized by a Gaussian Mixture Model r ( pi , μi , Σi ) Μ r r p ( x | λs ) = ∑ pi bi ( x ) i =1 MIT Lincoln Laboratory 37 Speaker Models Hidden Markov Models • • • More precisely, the feature vectors generated from each state follow a Gaussian mixture distribution π s1, s 2 Transition between states based on modality of speech – Text-dependent case will have ordered states – Text-independent case will allow all transitions Feature distribution for state i Model parameters – Transition probabilities – State mixture parameters • Transition probability This is a Hidden Markov Model – The GMM is a special case when we have equal transitions between states r λs = ( pi , μ i , Σ i ) Μ r r p ( x | λs ) = ∑ pi bi ( x ) i =1 MIT Lincoln Laboratory 38 Speaker Models Hidden Markov Models Form of HMM depends on the application Fixed Phrase Word/phrase models “Open sesame” Prompted phrases/passwords /e/ /t/ Text-independent Phoneme models /n/ single state HMM (GMM) General speech MIT Lincoln Laboratory 39 Detection Decision • Classic log-likelihood ratio test from signal detection theory LLR = Λ = log p( X | H 1) − log p ( X | H 0) Speaker Speakermodel model Feature Feature Extraction Extraction Σ Alternative Alternative model model • + - Λ Λ > θ Accept Λ < θ Reject Need two models – Hypothesized speaker model for H1 – Alternative (background) model for H0 • These can be explicit or implicit models – We will later look at how this is done for different modeling techniques MIT Lincoln Laboratory 43 GMM Speaker Models Background Model • There are two main approaches for creating an alternative model for the likelihood ratio test Cohorts/Likelihood Sets/Background Sets (Higgins, DSPJ91) – Use a collection of other speaker models – The likelihood of the alternative is some function, such as average, of the individual impostor model likelihoods Speaker Speaker model model General/World/Universal Background Model (Carey, ICASSP91, Reynolds ICSLP95) – Use a single speaker-independent model – Trained on speech from a large number of speakers to represent general speech patterns – Speaker and background models often are coupled Speaker Speaker model model / Bkg Bkg11 Bkg model Bkg22 model Bkg 3 model Bkg 3 model model model p( X | H 0) = f ( p ( X | Bkg (b), b = 1,..., B) / Universal Universal model model p( X | H 0) = p ( X | UBM ) MIT Lincoln Laboratory 44 GMM Speaker Models Background Model • The background model is crucial to good performance – Acts as a normalization to help minimize non-speaker related variability in decision score • Just using speaker model’s likelihood does not perform well – Too unstable for setting decision thresholds – Influenced by too many non-speaker dependent factors • The background model should be trained using speech representing the expected impostor speech – Same type speech as speaker enrollment (modality, language, channel) – Representation of impostor and microphone types to be encountered MIT Lincoln Laboratory 45 Likelihood Ratio Scoring with Cohorts • This approach works well but has some drawbacks 1. 2. 3. 4. • Requires marked training data for cohort speakers Need to score several models for each input vector Must select cohorts to cover all regions of feature space No coupling between target and background score To address these issues the approach of using an adapted GMM was developed MIT Lincoln Laboratory 46 Adapted GMMs • • The basic idea is to start with a single background model that represents general speech Using target speaker training data, “tune” the general speech model to the specifics of the target speaker – This “tuning” is done via unsupervised Bayesian adaptation x x x x x x x x Target training data x UBM Target Model MIT Lincoln Laboratory 47 Speaker Models • We will now describe the most common modeling techniques used in speaker recognition • Gaussian Mixture Models (GMM) – Probability (Generative) models • Support Vector Machines (SVM) – Discriminative models MIT Lincoln Laboratory 50 Generative vs. Discriminative Models Support Vector Machines • The GMM-UBM is considered a generative model – The model is focused on representing the total distribution of the speaker data – Parameters estimated with Maximum likelihood or Maximum A-Posteriori criteria – Competition with other models comes through likelihood ratio • Class 1 Class 0 Support Vector Machines (SVMs) are an example of discriminative models – The model is focused on representing the boundary between competing speaker data – Parameters are estimated with a maximum margin (separation boundary) criteria Class 1 – Competition with other classes directly optimized in model training Class 0 MIT Lincoln Laboratory 51 Speaker Detection Systems Score Fusion • Scores from different types of features/models can be combined using a simple fuser – • • Requires scores from some development data to train fuser A generalized linear regression fuser works well An added benefit is we can get calibrated scores – E.g. [0-1] posterior probability estimates Feature Feature extraction extraction GMM-UBM GMM-UBM a1 λubm Feature Feature extraction extraction SVM-GSV SVM-GSV Detection score a2 b Fusion/calibration MIT Lincoln Laboratory 52 Outline Speaker Recognition • Introduction – Definitions – Applications • Structure of Speaker Detection Systems – Features: Spectral and high-level – Models: GMMs, SVMs – Compensation • Evaluation and Performance – Evaluation tools and metrics – Evaluation design and corpora – Performance survey MIT Lincoln Laboratory 53 Speaker Detection Systems Channel/Session Effects The largest challenge to practical use of speaker detection systems is channel/session variability • • Variability refers to changes in channel effects between training and successive detection attempts Channel/session effects encompasses several factors – The microphones Carbon-button, electret, hands-free, array, etc – The acoustic environment Office, car, airport, etc. – The transmission channel Landline, cellular, VoIP, etc. • Intrinsic variability can also affect systems – – • Stress, illness, style, etc. Much more difficult than extrinsic variability to characterize Anything which affects the spectrum can cause problems – Speaker and channel effects are bound together in spectrum and hence features used in speaker verifiers MIT Lincoln Laboratory 54 Channel Compensation Examples MIT Lincoln Laboratory 55 Channel/Session Compensation • Channel/session compensation occurs at several levels in a speaker detection system Signal domain Feature domain Model domain Score domain Target Targetmodel model Front-end Front-end processing processing Adapt Σ LR LRscore score normalization normalization Background Background model model • Noise removal • Tone removal • Cepstral mean subtraction • RASTA filtering •Feature Mapping • Mean & variance •Eigenchannel normalization adaptation in • Feature warping feature domain • Speaker Model Synthesis • Eigenchannel compensation •Joint Factor Analysis • Z-norm • T-norm • ZT-norm • Nuisance Attribute Projection MIT Lincoln Laboratory 56 Λ Outline Speaker Recognition • Introduction – Definitions – Applications • Structure of Speaker Detection Systems – Features: Spectral and high-level – Models: GMMs, SVMs – Compensation • Evaluation and Performance – Evaluation tools and metrics – Evaluation design and corpora – Performance survey MIT Lincoln Laboratory 57 Evaluation Metrics • Recall the basic detection question: – Are two audio samples spoken by the same person? – The pair of samples is known as a trial • So, there are two types of errors that can occur Miss: incorrectly reject a true trial Also known as a false reject or a Type-I error False accept: incorrectly accept false trial Also known as a Type-II error • The performance of a detection system is a measure of the trade-off between these two errors – The tradeoff is usually controlled by adjustment of the decision threshold MIT Lincoln Laboratory 60 Evaluation Metrics • Evaluation errors are estimates of true errors using a finite number of trials Pr(fa | θ ) = Pr(miss | θ ) = ∫θp(Λ | false trial) Λ> ∫θp(Λ | true trial) Λ< Λ θ Pr(miss | θ ) ≈ # true trial scores < θ N true Pr(false accept | θ ) ≈ # false trial scores > θ N false MIT Lincoln Laboratory 61 Evaluation metrics ROC and DET Curves PROBABILITY OF FALSE REJECT (in %) Receiver Operator Characteristic (ROC) Decreasing threshold PROBABILITY OF FALSE ACCEPT (in %) PROBABILITY OF FALSE REJECT (in %) Plot of Pr(miss) vs. Pr(fa) shows system performance DET plots Pr(miss) and Pr(fa) on normal deviate scale Detection Error Tradeoff (DET) Decreasing threshold Better performance PROBABILITY OF FALSE ACCEPT (in %) MIT Lincoln Laboratory 62 Evaluation metrics DET Curve PROBABILITY OF FALSE REJECT (in %) Application operating point depends on relative costs of the two errors Wire Transfer: Equal Error Rate (EER) is often quoted as a summary performance measure False acceptance is very costly Users may tolerate rejections for security High Security Equal Error Rate (EER) = 1 % Balance Toll Fraud: High Convenience False rejections alienate customers Any fraud rejection is beneficial PROBABILITY OF FALSE ACCEPT (in %) MIT Lincoln Laboratory 63 Evaluation Design Data Selection Factors • Performance numbers are only meaningful when evaluation conditions are known Speech quality – – – Channel and microphone characteristics Ambient noise level and type Variability between enrollment and verification speech Speech modality – – Fixed/prompted/user-selected phrases Free text Speech duration – Duration and number of sessions of enrollment and verification speech Speaker population – – Size and composition Experience The evaluation data and design should match the target application domain of interest MIT Lincoln Laboratory 65 Evaluation Design Data Providers • Linguistic Data Consortium http://www.ldc.upenn.edu/ • European Language Resources Association http://www.icp.inpg.fr/ELRA/home.html MIT Lincoln Laboratory 66 Performance Survey Range of Performance Probability of False Reject (in %) Text-independent Text-independent (Read (Readsentences) sentences) ng i as e cr n I 1% Clean CleanData Data Single Singlemicrophone microphone Text-dependent Text-dependent (Digit (Digitstrings) strings) Large Largeamount amountof of train/test speech train/test speech 0.1% 25% Moderate Moderateamount amount of training of trainingdata data Text-independent Text-independent (Conversational) (Conversational) 10% Text-dependent Text-dependent (Combinations) (Combinations) 67 ts n i ra t s n co Military Militaryradio radioData Data Multiple Multipleradios radios&& microphones microphones Telephone TelephoneData Data Multiple Multiple microphones microphones Moderate Moderateamount amount of training of trainingdata data Telephone TelephoneData Data Multiple Multiple microphones microphones Probability of False Accept (in %) Small Smallamount amountof of training trainingdata data MIT Lincoln Laboratory Performance Survey Closed-Set Identification Error (%) TIMIT 50 45 40 35 30 25 20 15 10 5 0 -5 10 50 100 NTIMIT 200 # Speakers SWBI 400 600 MIT Lincoln Laboratory 68 Performance Survey NIST Speaker Recognition Evaluations • Technology Consumers • • Application domain / parameters Data Provider Annual NIST evaluations of speaker verification technology (since 1995) Aim: Provide a common paradigm for comparing technologies Focus: Conversational telephone speech (text-independent) Evaluation Coordinator Linguistic Data Consortium Comparison of technologies on common task Technology Developers Evaluate Improve http://www.nist.gov/speech/tests/spk/index.htm MIT Lincoln Laboratory 69 Performance Survey Human vs. Machine • Motivation for comparing human to machine – Evaluating speech coders and potential forensic applications Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000) – – – – • Same amount of training data Matched Handset-type tests Mismatched Handset-type tests Used 3-sec conversational utterances from telephone speech Humans have more robustness to channel variability Computer Human M at ch M is ed m at ch ed • Error Rates Humans 15% worse Humans 44% better – Use different levels of information MIT Lincoln Laboratory 72 Performance Survey Human Forensic Performance • In 1986, the Federal Bureau of Investigation published a survey of two thousand voice identification comparisons made by FBI examiners – Forensic comparisons completed over a period of fifteen years, under actual law enforcement conditions – The examiners had a minimum of two years experience, and had completed over 100 actual cases – The examiners used both aural and spectrographic methods – http://www.owlinvestigations.com/forensic_articles/aural_spe cetrographic/fulltext.html#research Recorded threat Suspect No decision Non-match Match 65.2% (1304) 18.8% (378) 15.9% (318) FR = 0.53% (2) FA = 0.31% (1) From “Spectrographic voice identification: A forensic survey ,” J. Acoust. Soc. Am, 79(6) June 1986, Bruce E. Koenig MIT Lincoln Laboratory 73 Summary • Broad overview of automatic speaker recognition • Major concepts behind modern speaker recognition systems – Feature and models • Key elements in evaluating performance of a speaker recognition system • Range of expected performance • Challenges of successful application speaker recognition MIT Lincoln Laboratory 74
© Copyright 2026 Paperzz