An Overview of Automatic Speaker Recognition

An Overview of Automatic Speaker
Recognition
JHU 2010 Workshop Summer School
Douglas Reynolds
Senior Member of Technical Staff
MIT Lincoln Laboratory
This work was sponsored by the Department of Defense under Air Force contract
F19628-05-C-0002. Opinions, interpretations, conclusions, and recommendations are
MIT Lincoln Laboratory
those
of the authors and are not necessarily endorsed by the Department of Defense.
1
Extracting Information from Speech
Goal: Automatically extract information
transmitted in speech signal
Speech
Recognition
Speech Signal
Language
Recognition
Speaker
Recognition
Words
“How are you?”
Language Name
English
Speaker Name
James Wilson
MIT Lincoln Laboratory
2
Speech & Language Processing
Taxonomy & Applications
Speech
Speech &
& Language
Language Processing
Processing
Speech
Speech Processing
Processing
Analysis/
Analysis/
Synthesis
Synthesis
Speech
Speech
Synthesis
Synthesis
Enhancement
Enhancement
For
For Machines
Machines
Coding
Coding
Speech
Speech
Detection
Detection
•Compaction
•Enabler
Recognition
Recognition
Text
Text Processing
Processing
Sex
Sex
Recog
Recog
•Indexing
•Enabler
Speaker
Speaker
Recog
Recog
Language
Language
Recog
Recog
•Biometrics
•Personalize
•Indexing
•Law Enforce
•Interfaces
•Routing
•Enabler
Speech
Speech
Recog
Recog
•Cmd & Cntrl
•Dictation
•Transcription
•Enabler
Machine
Machine
Translation
Translation
For
For Humans
Humans
Speech
Speech
Translation
Translation
•Translation
•Gisting
•Transcription
•Communicate
MIT Lincoln Laboratory
3
Outline
Speaker Recognition
•
Introduction
– Definitions
– Applications
•
Structure of Speaker Detection Systems
– Features: Spectral and high-level
– Models: GMMs, SVMs
– Compensation
•
Evaluation and Performance
– Evaluation tools and metrics
– Evaluation design and corpora
– Performance survey
MIT Lincoln Laboratory
4
Evolution of Speaker Recognition
1930
1960
Aural and
spectrogram matching
1980
Dynamic Time-Warping
Vector Quantization
Template
matching
Speech technology research
accelerated with increase in
–
–
2000
Common corpora and evaluations
Compute and storage capacity
Present
Support Vector Machines
GMM Supervectors
…
Hidden Markov Models
Gaussian Mixture Models
Small
Smalldatabases,
databases,
clean,
clean,controlled
controlled
speech
speech
•
1990
Session/Channel
Compensation
Large
Largedatabases,
databases,
realistic,
realistic,
unconstrained
unconstrainedspeech
speech
Common
CommonCorpora
Corporaand
and
Evaluations
Evaluations
Commercial
Commercial
application
applicationof
ofspeaker
speaker
recognition
recognitiontechnology
technology
MIT Lincoln Laboratory
5
Speaker Recognition Applications
Access
AccessControl
Control
Transaction
TransactionAuthentication
Authentication
Physical
Physicalfacilities
facilities
Computer
Computernetworks
networksand
andwebsites
websites
Telephone
Telephonebanking
banking
Remote
Remotecredit
creditcard
cardpurchases
purchases
Law
LawEnforcement
Enforcement
Forensics
Forensics
Surveillance
Surveillance
Speech
SpeechData
DataMining
Mining
Personalization
Personalization
Voice
Voicemail
mailbrowsing
browsing
Speech
Speechskimming
skimming
Intelligent
Intelligentanswering
answeringmachine
machine
Voice-web
Voice-web/ /device
device customization
customization
MIT Lincoln Laboratory
6
Speaker Recognition Tasks
Identification
Verification/Authentication/
Detection
?
Whose
Whose voice
voice is
is this?
this?
?
Is
Is this
this Bob’s
Bob’s voice?
voice?
?
?
?
Segmentation and Clustering (Diarization)
Where
Where are
are speaker
speaker
changes?
changes?
Speaker A
Which
Which segments
segments are
are from
from
the
the same
same speaker?
speaker?
Speaker B
MIT Lincoln Laboratory
7
Speech Modalities
Application dictates different speech modalities:
Text-dependent
Text-independent
•
Recognition system knows
text spoken by person
•
Recognition system does not know
text spoken by person
•
Examples: fixed phrase,
prompted phrase
•
Examples: User selected phrase,
conversational speech
•
Used for applications with
strong control over user
input
•
Used for applications with less
control over user input
•
More flexible system but also more
difficult problem
•
Speech recognition can provide
knowledge of spoken text
•
Knowledge of spoken text
can improve system
performance
MIT Lincoln Laboratory
8
Voice Biometric
•
•
Speaker verification is often referred to as a voice biometric
Biometric: a human generated signal or attribute for
authenticating a person’s identity
Voice is a popular biometric:
•
– natural signal to produce
– does not require a specialized input device
– ubiquitous: telephones and microphone equipped PC
•
Voice biometric can be combined with other forms of
security
Strongest
– Something you have - e.g., badge
security
Are
– Something you know - e.g., password
– Something you are - e.g., voice
Know
Have
MIT Lincoln Laboratory
9
Outline
Speaker Recognition
•
Introduction
– Definitions
– Applications
•
Structure of Speaker Detection Systems
– Features: Spectral and high-level
– Models: GMMs, SVMs
– Compensation
•
Evaluation and Performance
– Evaluation tools and metrics
– Evaluation design and corpora
– Performance survey
MIT Lincoln Laboratory
10
Phases of Speaker Detection System
Two distinct phases to any speaker verification system
Enrollment
Phase
Enrollment speech for
each speaker
Bob
Feature
Feature
extraction
extraction
Model for each
speaker
Model
Model
training
training
Sally
Detection
Phase
Sally
Feature
Feature
extraction
extraction
Hypothesized identity:
Sally
11
Bob
Detection
Detection
decision
decision
Detected!
MIT Lincoln Laboratory
Features for Speaker Recognition
•
Humans use several levels of perceptual cues for speaker
recognition
High-level cues
(learned traits)
Low-level cues
(physical traits)
•
•
Difficult to
automatically
extract
Hierarchy of Perceptual Cues
Semantics,
Semantics,diction,
diction,
pronunciations,
pronunciations,
idiosyncrasies
idiosyncrasies
Prosodics,
Prosodics,rhythm,
rhythm,
speed
intonation,
speed intonation,
volume
volumemodulation
modulation
Acoustic
Acousticaspect
aspectof
of
speech,
nasal,
speech, nasal,
deep,
deep,breathy,
breathy,
rough
rough
Socio-economic
Socio-economic
status,
status,education,
education,
place
of
place ofbirth
birth
Personality
Personalitytype,
type,
parental
influence
parental influence
Anatomical
Anatomicalstructure
structure
of
vocal
apparatus
of vocal apparatus
Easy to
automatically
extract
There are no exclusive speaker identity cues
Low-level acoustic cues most applicable for automatic systems
MIT Lincoln Laboratory
12
Features for Speaker Recognition
•
Desirable attributes of features for an automatic system
(Wolf ‘72)
Practical
Robust
Secure
••
••
••
••
••
Occur
Occurnaturally
naturallyand
andfrequently
frequentlyin
inspeech
speech
Easily
Easilymeasurable
measurable
Not
Notchange
changeover
overtime
timeor
orbe
beaffected
affectedby
byspeaker’s
speaker’shealth
health
Not
Notbe
beaffected
affectedby
byreasonable
reasonablebackground
backgroundnoise
noisenor
nor
depend
on
specific
transmission
characteristics
depend on specific transmission characteristics
Not
Notbe
besubject
subjectto
tomimicry
mimicry
•
No feature has all these attributes
•
Features derived from spectrum of speech have proven to
be the most effective in automatic systems
MIT Lincoln Laboratory
13
Speech Production
•
Speech production model: source-filter interaction
– Anatomical structure (vocal tract/glottis) conveyed in speech spectrum
Glottal pulses
Time (sec)
Vocal tract
Speech signal
Time (sec)
MIT Lincoln Laboratory
14
Vocal-tract Information
Different speakers will have different spectra for similar
sounds
/AE/
•
70
60
50
40
30
20
10
0
18
16
14
12
10
8
6
4
2
0
Male
Speaker
Cross Section of
Vocal Tract
Female
Speaker
/I/
4
Male
Speaker
3
Magnitude (dB)
Cross Section of
Vocal Tract
Magnitude (dB)
•
2
1
0
5
4
3
2
1
Female
Speaker
0
0
2000
4000
6000
Frequency (Hz)
0
2000
4000
6000
Frequency (Hz)
Differences are in location and magnitude of peaks in
spectrum
– Peaks are known as formants and represent resonances of
vocal cavity
•
The spectrum captures the format location and, to some
extent, pitch without explicit formant or pitch tracking
MIT Lincoln Laboratory
15
Time-Varying Vocal-tract Information
•
Speech is a continuous evolution of the vocal tract
– Need to extract time series of spectra
– Use a sliding window - 20 ms window, 10 ms shift
...
Fourier
Fourier
Transform
Transform
Frequency (Hz)
•
Magnitude
Magnitude
Produces time-frequency evolution of the spectrum
Time (sec)
MIT Lincoln Laboratory
16
Spectral Features
•
The number of discrete Fourier transform samples
representing the spectrum is reduced by averaging frequency
bins together
– Typically done by a simulated filterbank
•
A perceptually based filterbank is used such as a Mel or Bark
scale filterbank
Magnitude
– Linearly spaced filters at low frequencies
– Logarithmically spaced filters at high frequencies
Frequency
MIT Lincoln Laboratory
17
Spectral Features
•
•
Primary feature used in speaker recognition systems are
cepstral feature vectors
Log() function turns linear convolutional effects into additive
biases
– Easy to remove using blind-deconvolution techniques
•
Cosine transform helps decorrelate elements in feature
vector
– Less burden on model and empirically better performance
...
Fourier
FourierTransform
Transform
Log()
Log()
Magnitude
Magnitude
Cosine
Cosinetransform
transform
One feature vector every 10 ms
3.4
3.6 3.4 3.4
2.1 3.6 3.6
0.0 2.1 2.1
-0.9 0.0 0.0
0.3-0.9-0.9
.1 0.3 0.3
.1
.1
MIT Lincoln Laboratory
18
Spectral Features
Fourier
Fourier
Transform
Transform
Magnitude
Magnitude
Log()
Log()
Cosine
Cosine
transform
transform
MIT Lincoln Laboratory
19
Spectral Features
Additional Robustness Measures
•
To help capture some temporal information about the
spectra, delta cepstra are often computed and appended to
the cepstra feature vector
– 1st order linear fit used over a 5 frame (50 ms) span
K
∂ci (t )
≈ Δci (n) =
∂t
∑ kci (n + k )
k =− K
Cepstra
K
∑k
2
3.4
3.6
2.1
0.0
-0.9
0.3
0.1
3.3
3.7
0.7
0.0
-0.1
0.3
0.2
3.1
3.6
0.3
0.0
0.3
0.3
0.6
k =− K
Delta
Cepstra
-0.3
0.0
-1.7
0.0
1.2
0.0
0.5
MIT Lincoln Laboratory
22
Exploiting Higher-Levels of Information
High-level cues
(learned traits)
semantic
dialogic
idiolectal
?
A:
B:
b
b
d
a
d
e
e
b
c
<s>how shall i say this<e> <s> yeah i know …
/S/ /oU/ /m/ /i:/ /D/ /&/ /m/ /Λ/ /n/ /i:/ …
phonetic
prosodic
Low-level cues
(physical traits)
spectral
MIT Lincoln Laboratory
23
Prosodic
Pitch/Energy Slope
•
Slope Classes Estimation
1. compute the rate of change
for F0 and energy contours
2. detect the points where the
dynamics of the contours
changes
ƒ
Delta features (50 ms)
3. generate new segments
using the detected points
4. convert the segments into a
sequence of symbols that
represent the dynamics of
both contours
5. Add a label (Short/Long)
representing the segment
duration
ƒ
ƒ
Voiced (classes 1,2,3, and 4):
Short < 70 ms
Unvoiced (class 5): Short <
130 ms
Duration (short/long)
5S 4S 2S 1S 3L 4L 5S 1S 2S 3S 4S 3S 2L 4S 5S
Class
Temporal Trajectory
1
rising f0 and rising energy
2
rising f0 and falling energy
3
falling f0 and rising energy
4
falling f0 and falling energy
5
unvoiced segment
MIT Lincoln Laboratory
24
A. Adami, R. Mihaescu, D. Reynolds, and J. Godfrey, “Modeling Prosodic
Dynamics for Speaker Recognition,” ICASSP 2003
Phonetic and Idiolect Information
•
Requires the use of speech or phone recognition systems to
extract token features
– Phonetic: Use phone sequences to characterize speakerspecific pronunciations and speaking patterns
– Idiolect: Use word sequences to characterize speaker-specific
use of word patterns
Speech
Speech
recognizer
recognizer
•
•
Show me the money …
Phone
Phone
recognizer
recognizer
S oU m i: D & m Λ n i:
People have also looked at linguistic patterns to characterize
speaker-specific conversation style
Phone/word based features are limited by
– Recognizer errors
– Sparse information
– Correlation with non-speaker information (topic,lang,etc)
MIT Lincoln Laboratory
25
Higher-Level Information
Speaker information of word bigrams
•
Bigram is just the occurrence of two tokens in a sequence
MIT Lincoln Laboratory
26
Phases of Speaker Detection System
Speaker Models
Two distinct phases to any speaker detection system
Training
Phase
Training speech for
each speaker
Bob
Model for each
speaker
Feature
Feature
extraction
extraction
Model
Model
training
training
Sally
Detection
Phase
Sally
Feature
Feature
extraction
extraction
Hypothesized identity:
Sally
27
Bob
Detection
Detection
decision
decision
Accepted!
MIT Lincoln Laboratory
Speaker Models
•
Speaker models are used to represent the speakerspecific information conveyed in the feature vectors
•
Desirable attributes of a speaker model
– Theoretical underpinning
– Generalizable to new data
– Parsimonious representation (size and computation)
•
Many different modeling techniques have been applied to
speaker recognition problems
– Probability (Generative) models
– Discriminative models
•
Before we describe modeling techniques, let’s first look at
how we want to use them in recognition
MIT Lincoln Laboratory
28
Detection Decision
Bayes’ Decision
•
The detection task is fundamentally a two-class
hypothesis test
– H0: the speech X is not from the hypothesized speaker
– H1: the speech X is from the hypothesized speaker
•
•
We select the most likely hypothesis (Bayes test for
minimum error)
>
Pr( H 1| X ) Pr( H 0 | X )
<
p ( X | H 1) Pr( H 1) > p( X | H 0) Pr( H 0)
(using Bayes rule)
<
p( X )
p( X )
p ( X | H 1) > Pr( H 0)
p ( X | H 0) < Pr( H 1)
This is known as the likelihood ratio test
p ( X | H 1)
LR =
p ( X | H 0)
LR > θ Accept H 1
LR < θ Accept H 0
MIT Lincoln Laboratory
29
Detection Decision
•
Classic log-likelihood ratio test from signal detection theory
LLR = Λ = log p( X | H 1) − log p ( X | H 0)
Speaker
Speakermodel
model
Feature
Feature
Extraction
Extraction
Σ
Alternative
Alternative
model
model
•
+
-
Λ
Λ > θ Accept
Λ < θ Reject
Need two models
– Hypothesized speaker model for H1
– Alternative (background) model for H0
•
These can be explicit or implicit models
– We will later look at how this is done for different modeling
techniques
MIT Lincoln Laboratory
30
Speaker Models
•
We will now describe the most common modeling
techniques used in speaker recognition
•
Gaussian Mixture Models (GMM)
– Probability (Generative) models
•
Support Vector Machines (SVM)
– Discriminative models
MIT Lincoln Laboratory
31
Speaker Models
Probability Models
•
•
Treat speaker as a hidden random source generating
observed feature vectors
How can we mathematically characterize the speaker
using a set of training observations?
r r r
x1 x2 x
3
3.4
Speaker
(source)
3.4
3.6
3.4
3.6
2.1
3.6
2.1
0.0
2.1
0.0
-0.9
0.0
-0.9
0.3 -0.9
0.3
.1
0.3
.1
.1
Observed
feature
vectors
…
MIT Lincoln Laboratory
32
Speaker Models
Probability Models
•
•
•
Assume we are using a 1 dimensional feature vector
From some training examples we get the following
histogram of features
What is a good model for this histogram?
Gaussian
distribution
N(1.0,0.25)
feature
33
MIT Lincoln Laboratory
Speaker Models
Probability Models
•
What are good models for these histograms?
feature
feature
MIT Lincoln Laboratory
34
Speaker Models
Probability Models
•
•
If we assume the source is producing independent
identically distributed (iid) observations, and we know the
distribution, then a parametric pdf is used (e.g., Gaussian)
Suitable approach for “static” sources
r r r
x1 x2 x
3
3.4
Variance
Mean
Speaker
(source)
3.4
3.6
3.4
3.6
2.1
3.6
2.1
0.0
2.1
0.0
-0.9
0.0
-0.9
0.3 -0.9
0.3
.1
0.3
.1
.1
Observed
feature
vectors
…
MIT Lincoln Laboratory
35
Speaker Models
Probability Models
•
•
But we know speech is not a “static” source
Source has “states” corresponding to different speech
sounds
Speaker
(source)
r r r
x1 x2 x
3
3.4
3.4
3.6
3.4
3.6
2.1
3.6
2.1
0.0
2.1
0.0
-0.9
0.0
-0.9
0.3 -0.9
0.3
.1
0.3
.1
.1
Observed
feature
vectors
…
Hidden speech
state
MIT Lincoln Laboratory
36
Speaker Models
Gaussian Mixture Models
•
•
•
Assume feature vectors
generated from each state
follow a Gaussian
distribution
Further assume all states
are equally likely to occur
Each state has the following
parameters
pi = prior probability
r
μi = mean vector
Feature distribution
for state i
Σi = covariance matrix
•
The total distribution of the
source (speaker) would
then be characterized by a
Gaussian Mixture Model
r
( pi , μi , Σi )
Μ
r
r
p ( x | λs ) = ∑ pi bi ( x )
i =1
MIT Lincoln Laboratory
37
Speaker Models
Hidden Markov Models
•
•
•
More precisely, the feature
vectors generated from each
state follow a Gaussian
mixture distribution
π s1, s 2
Transition between states
based on modality of speech
– Text-dependent case will
have ordered states
– Text-independent case will
allow all transitions
Feature distribution
for state i
Model parameters
– Transition probabilities
– State mixture parameters
•
Transition
probability
This is a Hidden Markov
Model
– The GMM is a special case
when we have equal
transitions between states
r
λs = ( pi , μ i , Σ i )
Μ
r
r
p ( x | λs ) = ∑ pi bi ( x )
i =1
MIT Lincoln Laboratory
38
Speaker Models
Hidden Markov Models
Form of HMM depends on the application
Fixed Phrase
Word/phrase models
“Open sesame”
Prompted phrases/passwords
/e/
/t/
Text-independent
Phoneme models
/n/
single state HMM (GMM)
General speech
MIT Lincoln Laboratory
39
Detection Decision
•
Classic log-likelihood ratio test from signal detection theory
LLR = Λ = log p( X | H 1) − log p ( X | H 0)
Speaker
Speakermodel
model
Feature
Feature
Extraction
Extraction
Σ
Alternative
Alternative
model
model
•
+
-
Λ
Λ > θ Accept
Λ < θ Reject
Need two models
– Hypothesized speaker model for H1
– Alternative (background) model for H0
•
These can be explicit or implicit models
– We will later look at how this is done for different modeling
techniques
MIT Lincoln Laboratory
43
GMM Speaker Models
Background Model
•
There are two main approaches for creating an alternative
model for the likelihood ratio test
Cohorts/Likelihood Sets/Background
Sets (Higgins, DSPJ91)
– Use a collection of other speaker
models
– The likelihood of the alternative is
some function, such as average, of
the individual impostor model
likelihoods
Speaker
Speaker
model
model
General/World/Universal Background Model
(Carey, ICASSP91, Reynolds ICSLP95)
– Use a single speaker-independent model
– Trained on speech from a large number
of speakers to represent general speech
patterns
– Speaker and background models often
are coupled
Speaker
Speaker
model
model
/
Bkg
Bkg11
Bkg
model
Bkg22
model
Bkg 3
model
Bkg 3
model
model
model
p( X | H 0) = f ( p ( X | Bkg (b), b = 1,..., B)
/
Universal
Universal
model
model
p( X | H 0) = p ( X | UBM )
MIT Lincoln Laboratory
44
GMM Speaker Models
Background Model
•
The background model is crucial to good performance
– Acts as a normalization to help minimize non-speaker
related variability in decision score
•
Just using speaker model’s likelihood does not perform
well
– Too unstable for setting decision thresholds
– Influenced by too many non-speaker dependent factors
•
The background model should be trained using speech
representing the expected impostor speech
– Same type speech as speaker enrollment (modality,
language, channel)
– Representation of impostor and microphone types to be
encountered
MIT Lincoln Laboratory
45
Likelihood Ratio Scoring with Cohorts
•
This approach works well but has some drawbacks
1.
2.
3.
4.
•
Requires marked training data for cohort speakers
Need to score several models for each input vector
Must select cohorts to cover all regions of feature space
No coupling between target and background score
To address these issues the approach of using an adapted
GMM was developed
MIT Lincoln Laboratory
46
Adapted GMMs
•
•
The basic idea is to start with a single background model
that represents general speech
Using target speaker training data, “tune” the general
speech model to the specifics of the target speaker
– This “tuning” is done via unsupervised Bayesian adaptation
x
x
x
x
x
x
x
x
Target
training
data
x
UBM
Target
Model
MIT Lincoln Laboratory
47
Speaker Models
•
We will now describe the most common modeling
techniques used in speaker recognition
•
Gaussian Mixture Models (GMM)
– Probability (Generative) models
•
Support Vector Machines (SVM)
– Discriminative models
MIT Lincoln Laboratory
50
Generative vs. Discriminative Models
Support Vector Machines
•
The GMM-UBM is considered a generative
model
– The model is focused on representing the
total distribution of the speaker data
– Parameters estimated with Maximum
likelihood or Maximum A-Posteriori criteria
– Competition with other models comes
through likelihood ratio
•
Class 1
Class 0
Support Vector Machines (SVMs) are an
example of discriminative models
– The model is focused on representing the
boundary between competing speaker data
– Parameters are estimated with a maximum
margin (separation boundary) criteria
Class 1
– Competition with other classes directly
optimized in model training
Class 0
MIT Lincoln Laboratory
51
Speaker Detection Systems
Score Fusion
•
Scores from different types of features/models can be combined
using a simple fuser
–
•
•
Requires scores from some development data to train fuser
A generalized linear regression fuser works well
An added benefit is we can get calibrated scores
–
E.g. [0-1] posterior probability estimates
Feature
Feature
extraction
extraction
GMM-UBM
GMM-UBM
a1
λubm
Feature
Feature
extraction
extraction
SVM-GSV
SVM-GSV
Detection
score
a2
b
Fusion/calibration
MIT Lincoln Laboratory
52
Outline
Speaker Recognition
•
Introduction
– Definitions
– Applications
•
Structure of Speaker Detection Systems
– Features: Spectral and high-level
– Models: GMMs, SVMs
– Compensation
•
Evaluation and Performance
– Evaluation tools and metrics
– Evaluation design and corpora
– Performance survey
MIT Lincoln Laboratory
53
Speaker Detection Systems
Channel/Session Effects
The largest challenge to practical use of speaker detection
systems is channel/session variability
•
•
Variability refers to changes in channel effects between training
and successive detection attempts
Channel/session effects encompasses several factors
–
The microphones
Carbon-button, electret, hands-free, array, etc
–
The acoustic environment
Office, car, airport, etc.
–
The transmission channel
Landline, cellular, VoIP, etc.
•
Intrinsic variability can also affect systems
–
–
•
Stress, illness, style, etc.
Much more difficult than extrinsic variability to characterize
Anything which affects the spectrum can cause problems
–
Speaker and channel effects are bound together in spectrum and
hence features used in speaker verifiers
MIT Lincoln Laboratory
54
Channel Compensation
Examples
MIT Lincoln Laboratory
55
Channel/Session Compensation
•
Channel/session compensation occurs at several levels in a
speaker detection system
Signal domain
Feature domain
Model domain
Score domain
Target
Targetmodel
model
Front-end
Front-end
processing
processing
Adapt
Σ
LR
LRscore
score
normalization
normalization
Background
Background
model
model
• Noise
removal
• Tone
removal
• Cepstral mean
subtraction
• RASTA filtering
•Feature
Mapping
• Mean & variance •Eigenchannel
normalization
adaptation in
• Feature warping feature domain
• Speaker Model
Synthesis
• Eigenchannel
compensation
•Joint Factor
Analysis
• Z-norm
• T-norm
• ZT-norm
• Nuisance Attribute
Projection
MIT Lincoln Laboratory
56
Λ
Outline
Speaker Recognition
•
Introduction
– Definitions
– Applications
•
Structure of Speaker Detection Systems
– Features: Spectral and high-level
– Models: GMMs, SVMs
– Compensation
•
Evaluation and Performance
– Evaluation tools and metrics
– Evaluation design and corpora
– Performance survey
MIT Lincoln Laboratory
57
Evaluation Metrics
•
Recall the basic detection question:
– Are two audio samples spoken by the same person?
– The pair of samples is known as a trial
•
So, there are two types of errors that can occur
Miss: incorrectly reject a true trial
Also known as a false reject or a Type-I error
False accept: incorrectly accept false trial
Also known as a Type-II error
•
The performance of a detection system is a measure of the
trade-off between these two errors
– The tradeoff is usually controlled by adjustment of the
decision threshold
MIT Lincoln Laboratory
60
Evaluation Metrics
•
Evaluation errors are estimates of true errors using a finite
number of trials
Pr(fa | θ ) =
Pr(miss | θ ) =
∫θp(Λ | false trial)
Λ>
∫θp(Λ | true trial)
Λ<
Λ
θ
Pr(miss | θ ) ≈
# true trial scores < θ
N true
Pr(false accept | θ ) ≈
# false trial scores > θ
N false
MIT Lincoln Laboratory
61
Evaluation metrics
ROC and DET Curves
PROBABILITY OF FALSE REJECT (in %)
Receiver Operator
Characteristic
(ROC)
Decreasing
threshold
PROBABILITY OF FALSE ACCEPT (in %)
PROBABILITY OF FALSE REJECT (in %)
Plot of Pr(miss) vs. Pr(fa) shows system performance
DET plots Pr(miss) and Pr(fa) on normal deviate scale
Detection
Error Tradeoff
(DET)
Decreasing
threshold
Better
performance
PROBABILITY OF FALSE ACCEPT (in %)
MIT Lincoln Laboratory
62
Evaluation metrics
DET Curve
PROBABILITY OF FALSE REJECT (in %)
Application operating point depends on relative costs of the two errors
Wire Transfer:
Equal Error Rate
(EER) is often quoted
as a summary
performance measure
False acceptance
is very costly
Users may tolerate
rejections for
security
High Security
Equal Error Rate
(EER) = 1 %
Balance
Toll Fraud:
High Convenience
False rejections
alienate customers
Any fraud rejection
is beneficial
PROBABILITY OF FALSE ACCEPT (in %)
MIT Lincoln Laboratory
63
Evaluation Design
Data Selection Factors
•
Performance numbers are only meaningful when evaluation
conditions are known
Speech quality
–
–
–
Channel and microphone characteristics
Ambient noise level and type
Variability between enrollment and
verification speech
Speech modality
–
–
Fixed/prompted/user-selected phrases
Free text
Speech duration
–
Duration and number of sessions of
enrollment and verification speech
Speaker population
–
–
Size and composition
Experience
The evaluation data and design should match the
target application domain of interest
MIT Lincoln Laboratory
65
Evaluation Design
Data Providers
•
Linguistic Data Consortium
http://www.ldc.upenn.edu/
•
European Language Resources Association
http://www.icp.inpg.fr/ELRA/home.html
MIT Lincoln Laboratory
66
Performance Survey
Range of Performance
Probability of False Reject (in %)
Text-independent
Text-independent
(Read
(Readsentences)
sentences)
ng
i
as
e
cr
n
I
1%
Clean
CleanData
Data
Single
Singlemicrophone
microphone
Text-dependent
Text-dependent
(Digit
(Digitstrings)
strings)
Large
Largeamount
amountof
of
train/test
speech
train/test speech
0.1%
25%
Moderate
Moderateamount
amount
of
training
of trainingdata
data
Text-independent
Text-independent
(Conversational)
(Conversational)
10%
Text-dependent
Text-dependent
(Combinations)
(Combinations)
67
ts
n
i
ra
t
s
n
co
Military
Militaryradio
radioData
Data
Multiple
Multipleradios
radios&&
microphones
microphones
Telephone
TelephoneData
Data
Multiple
Multiple
microphones
microphones
Moderate
Moderateamount
amount
of
training
of trainingdata
data
Telephone
TelephoneData
Data
Multiple
Multiple
microphones
microphones
Probability
of False Accept
(in %)
Small
Smallamount
amountof
of
training
trainingdata
data
MIT Lincoln Laboratory
Performance Survey
Closed-Set Identification
Error (%)
TIMIT
50
45
40
35
30
25
20
15
10
5
0
-5
10
50
100
NTIMIT
200
# Speakers
SWBI
400
600
MIT Lincoln Laboratory
68
Performance Survey
NIST Speaker Recognition Evaluations
•
Technology Consumers
•
•
Application domain / parameters
Data Provider
Annual NIST evaluations of speaker
verification technology (since 1995)
Aim: Provide a common paradigm for
comparing technologies
Focus: Conversational telephone
speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Comparison of
technologies on
common task
Technology Developers
Evaluate
Improve
http://www.nist.gov/speech/tests/spk/index.htm
MIT Lincoln Laboratory
69
Performance Survey
Human vs. Machine
•
Motivation for comparing human
to machine
– Evaluating speech coders and
potential forensic applications
Schmidt-Nielsen and Crystal used
NIST evaluation (DSP Journal,
January 2000)
–
–
–
–
•
Same amount of training data
Matched Handset-type tests
Mismatched Handset-type tests
Used 3-sec conversational
utterances from telephone speech
Humans have more robustness to
channel variability
Computer
Human
M
at
ch
M
is
ed
m
at
ch
ed
•
Error
Rates
Humans
15%
worse
Humans
44%
better
– Use different levels of information
MIT Lincoln Laboratory
72
Performance Survey
Human Forensic Performance
•
In 1986, the Federal Bureau of Investigation published a
survey of two thousand voice identification comparisons
made by FBI examiners
– Forensic comparisons completed over a period of fifteen
years, under actual law enforcement conditions
– The examiners had a minimum of two years experience, and
had completed over 100 actual cases
– The examiners used both aural and spectrographic methods
– http://www.owlinvestigations.com/forensic_articles/aural_spe
cetrographic/fulltext.html#research
Recorded threat
Suspect
No decision
Non-match
Match
65.2% (1304)
18.8% (378)
15.9% (318)
FR = 0.53% (2)
FA = 0.31% (1)
From “Spectrographic voice identification: A forensic survey ,” J. Acoust. Soc. Am,
79(6) June 1986, Bruce E. Koenig
MIT Lincoln Laboratory
73
Summary
•
Broad overview of automatic speaker recognition
•
Major concepts behind modern speaker recognition
systems
– Feature and models
•
Key elements in evaluating performance of a speaker
recognition system
•
Range of expected performance
•
Challenges of successful application speaker recognition
MIT Lincoln Laboratory
74