Anti-spoofing, Voice Without Abstract Synonyms Definition Introduction

© Springer Science+Business Media New York 2015
Stan Z. Li
and
Anil K. Jain
Encyclopedia of Biometrics
10.1007/978-1-4899-7488-4_70
Anti-spoofing, Voice
Michael Wagner1, 2, 3
(1)College of Engineering and Computer Science, Australian National University, Canberra, Australia
(2)National Centre for Biometric Studies Pty Ltd, Canberra, Australia
(3)Faculty of ESTeM, University of Canberra, Canberra, Australia
Michael Wagner
Email: [email protected]
Without Abstract
Synonyms
Liveness assurance; Liveness verification; One-to-one speaker recognition; Speaker verification;
Voice authentication; Voice verification
Definition
The process of verifying whether the voice sample presented to an authentication system is real (i.e.,
alive), or whether it is replayed or synthetic, and thus fraudulent. When authentication through a voice
authentication system is requested, it is important to be sure that the person seeking the authentication
actually provides the required voice sample at the time and place of the authentication request. The
voice is presented live like that of a radio presenter during a live broadcast as distinct from a recorded
audio tape. In contrast, an impostor who seeks authentication fraudulently could try to play an audio
recording of a legitimate client or synthesized speech that is manufactured to resemble the speech of a
legitimate client. Such threats to the system are known as replay attack and synthesis attack,
respectively. Liveness assurance uses a range of measures to reduce the vulnerability of a voice
authentication system to the threats of replay and synthesis attack.
Introduction
The security of a voice authentication system depends on several factors. Primarily it is important that
the system is capable of distinguishing people by their voices, so that clients who are enrolled in, say,
a telephone banking system are admitted to their account reliably, while an “impostor” who attempts
to access the same account is rejected equally reliably. A good voice authentication system will thwart
an impostor irrespective of whether the access to the other person’s account is inadvertent or
deliberate and irrespective of whether the impostors use their natural voice or try to improve their
chances by mimicking the voice of the client.
However, one vulnerability common to all voice authentication systems is the possibility that
attackers, instead of speaking to the system directly and with their own voice, fraudulently use the
recorded voice of a true client in order to be admitted by the system. In principle, such a “replay
attack” can be carried out by means of any sound recording device, analog or digital, through which
the recorded voice of the client is played back to the system, say, to a microphone at a system access
point or remotely into a telephone handset connected to the authentication system. The security issue
in this case is that the voice used for authentication is not the “live” voice of the person, who is
seeking access to the system, at the time and place of the access request.
A technically sophisticated attacker may also use suitable computer hardware and software to create a
simile of the client’s voice by means of speech synthesis without having to record specific voice
samples of the client. Such an attack will be referred to as a “synthesis attack” in the following. Figure
1 shows how replayed or synthesized voice signals can be substituted for the live voice of a client at
the sensor input of the authentication system.
Anti-spoofing, Voice, Fig. 1
Prerecording of client voice either for later replay or for generating a client model, which can be used
later to synthesize the client’s voice
Replay Attack
Since voice authentication is always implemented within the context of a computer system, it is
important to consider the vulnerabilities of the entire system generally ( Security and Liveness,
Overview). Figure 2 shows the structure of a typical voice authentication system. During the
enrolment or training phase, the client’s voice is captured by the microphone, salient features are
extracted from the speech signals, and finally, a statistical “client model” or template is computed,
which represents the client-specific voice characteristics according to the speech data collected during
enrolment. During the operational or testing phase, when the system needs to decide whether a speech
sample belongs to the client, the signal is also captured by the sensor, and features are extracted in the
same way as they are in the enrolment phase. Then, the features of the unknown speech sample are
compared statistically with the model of the client that was established during enrolment. Depending
on how close the unknown sample is to the client model, the system will issue either an “accept” or a
“reject” decision: the person providing the voice sample is either authenticated or considered an
impostor.
Figure 2 shows various ways in which attackers could manipulate the outcome of the authentication,
if any of the software or hardware components of an insecure computer system could be accessed. If it
were possible, for example, to manipulate the database of client models, attackers could potentially
replace the voice model of a client with their own voice model and subsequently gain fraudulent
access to the system by having substituted their own identity for that of the client. Or, even more
simply, if it were possible to manipulate the decision module of the system, an attacker could
essentially bypass the entire authentication process and manufacture an “accept” decision of the
system without having provided any matching voice data. Such considerations fall into the domain of
the system engineer who needs to ensure, much as with any other secure system, that there are no
bugs, trap doors, or entry points for Trojan horses, which could allow an attacker to manipulate or
bypass the authentication mechanisms of the system. Since such vulnerabilities are not specific to
voice authentication systems, they are not dealt with in this entry ( Biometric Vulnerabilities,
Overview).
Anti-spoofing, Voice, Fig. 2
Potential points of vulnerability of a voice biometric authentication system: ( a) replay or synthesize
the client voice into the input sensor; ( b) insert the replayed or synthesized client voice into
vulnerable system-internal points; ( c) override detected features at vulnerable system-internal points;
( d) override the client at vulnerable system-internal points; and ( e) override the accept/reject decision
at vulnerable system-internal points
The remainder of this entry discusses how a secure voice authentication system can provide the
assurance that the voice used for an access request to the system is “live” at the time and place of the
access request and is neither a playback of a voice recording nor a synthesized simile of a client voice.
Hence, liveness assurance is an essential aspect of the security of any voice authentication system.
Liveness Assurance for Different Authentication
Protocols
Voice authentication systems operate under different protocols, and assurance of liveness is affected
differently by the various authentication protocols. The three main protocols used for voice
authentication are text-dependent speaker verification, text-independent speaker verification, and textprompted speaker verification, as shown in Fig. 3. The earliest authentication protocol was text
dependent [ 1]. In this protocol, the client uses a fixed authentication phrase, which is repeated several
times during enrolment. The repetitions are necessary so that the system “learns” about the range of
pronunciation of the authentication phrase by the client. Generally, speaker verification works best if
the natural variation of a speaker’s voice is well captured during enrolment. Hence, ideally, enrolment
should be distributed over several recording sessions that may be spread over several days or even
weeks. The same phrase, for example, a sequence of digits (“three-five-seven-nine”), or a password or
passphrase (“Open Sesame”) is then used again by the client during the operational phase in order to
be authenticated by the system.
Anti-spoofing, Voice, Fig. 3
( a) Text-dependent voice authentication: ( E) at enrolment the client repeats the authentication
phrase several times; ( V ) for verification the client speaks the same authentication phrase. ( b) Textindependent voice authentication: ( E) at enrolment the client reads a 2–3-min phonetically rich text; (
V ) for verification any utterance can be used by the client. ( c) Text-prompted voice authentication: (
E) at enrolment the client reads a 2–3-min phonetically rich text; ( V ) for verification the client is
prompted to say a given phrase, which is verified both for the correct content and for the client’s voice
characteristics
Text-dependent systems have the advantage that the client model only needs to represent the acoustic
information related to the relatively few speech sounds of the passphrase. Enrolment, therefore, is
shorter and quicker than for other protocols, which typically require the representation of the entire
collection of speech sounds that the client could possibly produce. However, the text-dependent
protocol has the distinct disadvantage that clients will repeat the same phrase every time while using
the system. Consequently, there may be ample opportunity for an attacker, especially if the system
microphone is situated in a public area, to plan and carry out a surreptitious recording of the
passphrase, uttered by the client, and to replay the recorded client passphrase fraudulently in order to
be authenticated by the system.
In contrast, text-independent voice authentication systems [ 2] will authenticate a client – and reject
an impostor – irrespective of any particular utterance used during enrolment. Client enrolment for
text-independent systems invariably takes longer than enrolment for a text-dependent system and
usually involves a judiciously designed enrolment text, which contains all, or at least most, of the
speech sounds of the language. This will ensure that the client models, which are constructed from the
enrolment speech data, will represent to the largest extent possible the idiosyncrasies of the client
when an arbitrary sentence or other utterance is provided for authentication later. Text-independent
protocols offer the advantage that authentication can be carried out without the need for a particular
passphrase, for example, as part of an ordinary interaction between a client and a customer-service
agent or automated call center agent, as shown in this fictitious dialog:
Client: phones XYZ Bank.
Agent: Good morning, this is XYZ Bank. How can I help you?
Client: I would like to inquire about my account balance.
Agent: What is your account number?
Client: It’s 123-4567-89
Agent: Good morning, Ms. Applegate, the balance of your account number 123-4567-89 is
$765.43. Is there anything else …?
The example shows a system, which combines speech recognition with voice authentication. The
speech recognizer understands what the customer wants to know and recognizes the account number,
while the authentication system uses the text-independent protocol to ascertain the identity of the
client from the first two responses the client gives over the telephone. These responses would not
normally have been encountered by the system during enrolment, but the coverage of the different
speech sounds during enrolment would be sufficient for the authentication system to verify the client
from the new phrases. The text-independent protocol offers an attacker the opportunity to record any
client utterances, either in the context of the client using the authentication system or elsewhere, and
to replay the recorded client speech in order to fraudulently achieve authentication by the system.
A more secure variant of the text-independent protocol is the text-prompted protocol [ 3]. Enrolment
under this protocol is similar to the text-independent protocol in that it aims to achieve a
comprehensive coverage of the different possible speech sounds of a client so that later on any
utterance can be used for client authentication. However, during authentication the text-prompted
protocol asks the user to say a specific, randomly chosen phrase, for example, by prompting the user
“please say the number sequence ‘two-four-six.”’ When the client repeats the prompted text, the
system uses automatic speech recognition to verify that the client has spoken the correct phrase. At
the same time it verifies the client’s voice by means of the text-independent voice authentication
paradigm. The text-prompted protocol makes a replay attack more difficult because an attacker would
be unlikely to have all possible prompted texts from the client recorded in advance. However, such an
attack would still be feasible for an attacker with a digital playback device that could construct the
prompted text at the press of a button.For example, an attacker who has managed surreptitiously to
record the ten digits “zero” to “nine” from a client – either on a single occasion or on several separate
occasions – could store those recorded digits on a notebook computer and then combine them to any
prompted digit sequence by simply pressing buttons on the computer.
Synthesis Attack
Even a text-prompted authentication system is vulnerable to an attacker who uses a text-to-speech
(TTS) synthesizer. A TTS system allows a user to input any desired text, for example, by means of a
computer keyboard, and to have that text rendered automatically into a spoken utterance and output
through a loudspeaker or another analog or digital output channel. The basic principle is that an
attacker would program a TTS synthesizer in such a way that it produces similar speech patterns as
the target speaker. If that is achieved, the attacker would only need to type the text that is required or
prompted by the authentication system in order for the TTS synthesizer to play the equivalent
synthetic utterance to the authentication system in the voice of the target speaker. In practice,
however, current state-of-the-art text-to-speech synthesis is not quite capable of producing such
natural sounding utterances. In other words, synthetic speech produced by current TTS systems still
sounds far from natural and is easily distinguished from genuine human speech by the human ear.
Does this mean, however, that TTS speech could not deceive an authentication system based on
automatic speaker recognition? To answer this question, it needs to be examined how different
speaker recognition systems actually work.
As shown in Table 1, there are three types of speaker recognition systems that are distinct by the types
of speech patterns each examines in order to determine the similarity of the unknown speech and the
target speech. The most common type of speaker recognition system looks at speaker differences at
the individual sound level. A second type of speaker recognition system examines the sequences of
speech sounds, which form words, and a third type also analyzes higher-level information such as
intonation, choice of words, choice of sentence structure, or even semantic or pragmatic content of the
utterances in question [ 4].Anti-spoofing, Voice, Table 1
Types of speaker authentication methods
Type of speaker
recognition system
Training/enrolment
Testing
Typical
method
Recognizes individual
speech sounds (contextfree)
A set of speech sounds typical
for the target speaker is collected
and becomes the “model” for the
target speaker
Each speech sound is
individually compared
with the target speaker
“model”
Gaussian
mixture model
(GMM)
Recognizes sequences
of speech sounds
(context-sensitive)
In addition to the individual
sounds, the speaker model
represents the sequences of
speech sounds that are typical for
the target speaker
The entire utterance is
compared with the
target speaker model
for both individual
sounds and sound
sequences
Hidden Markov
model (HMM)
Recognizes higher-level
features (intonation,
word choice, syntax,
etc.)
In addition to sound sequences,
the speaker model represents
words, sentence structures, and
intonation patterns typical for the
target speaker
Similarity of sounds
and sound sequences is
combined with
similarity of word
sequences and
intonation patterns
Information
fusion of GMM
and/or HMM
with higher-level
information
sources
Speech processing invariably segments a speech signal into small chunks, or “frames,” of about 10–
30ms duration, which corresponds approximately to the average duration of speech sounds. For each
frame, features are extracted from the speech signal, such as a spectrum or a cepstrum or a melfrequency cepstrum (MFC) [ 5]. These extracted features serve as the basis for the comparison
between the unknown speech and the target speech. The first type of speaker recognition system
independently compares the features of each frame of the unknown speech signal with the model of
the target speaker. This is done independently for each frame and without considering the speech
sounds immediately preceding or succeeding the given frame.
The second type of speaker recognition system takes into account the likelihood of sequences of
speech sounds, rather than individual speech sounds, when comparing the unknown speech signal
with the model of the target speaker. For example, the sound sequence /the/ would be more likely for
a speaker of English than the sound sequence /eth/. The third type of system takes into account
higher-level features, i.e., the variation of features over time such as the intonation pattern of a
sentence, as it manifests itself through the functions of loudness and pitch over time. Such
authentication systems typically operate on much longer passages of speech, for example, to segment
a two-way telephone conversation or the proceedings in a court of law into the turns belonging to the
different speakers. Figure 4 shows an example of two speakers pronouncing the same sentence with
quite different intonation.
Anti-spoofing, Voice, Fig. 4
Two male speakers from the Australian National Database of Spoken Language (ANDOSL), speaking
the same sentence with distinctly different intonation: audio signal and power and fundamental
frequency (F0) contours. Speaker S017 produced the word John with falling F0, while Speaker S029
produced the same with rising F0
It is easy to see that a context-free authentication system is prone to be attacked successfully by a very
simple synthesizer, namely, one that produces a few seconds of only a single speech sound. For
example, an attacker could reproduce a single frame of, say, the sound “a” of the target speaker and
play this frame repeatedly for a second or two in order to “convince” an authentication system of this
type that the “aaaaaaa…” sound represents the natural voice of the target speaker. This is because each
frame is assessed independently as being similar to the “a” sound of the target speaker, irrespective of
the fact that the sequence of “a” sounds does not represent a likely speech pattern of the target voice.
A context-sensitive authentication system, on the other hand, requires a speech synthesizer to
reproduce entire sound sequences that are sufficiently similar to sound sequences produced by the
target speaker. This means that the individual sounds, produced by the synthesizer, must be similar to
sounds of the target speaker and the sound sequences must be structured in a similar way to those of
the target speaker. This is a proposition that is far more difficult, although not impossible, to achieve
with current state-of-the-art speech synthesizers. Furthermore, if the speaker authentication system
also considers the intonation pattern and higher-level features such as choice of words and
grammatical constructs, an attacker who tries to impersonate a target speaker using a TTS synthesizer
would require a system that is beyond the capabilities of the technology at the time of writing.
Multimodal Liveness Assurance
The assurance that a voice biometric is delivered live at the time and place of authentication can be
enhanced considerably by complementing the voice modality with a second modality. In the simplest
case, this could be the visual modality provided by a human observer who can assure that the voice
biometric is actually provided by the person seeking authentication and that person is not using any
device to play back a recorded or synthesized voice sample.
In an automatic voice authentication system, similar assurance of liveness can be achieved by
combining the voice modality with a face recognition system. Such a system has a number of
advantages. Firstly, the bimodal face-voice approach to authentication provides two largely
independent feature sets, which, when combined appropriately, can be expected to yield better
authentication than either of the two modalities by itself. Secondly, the bimodal approach will add
robustness to the system when either modality is affected by difficult environmental conditions. In the
case of bimodal face-voice authentication, it is particularly useful to fall back on the complementary
face recognition facility when the voice recognition modality breaks down due to high levels of
surrounding noise, competing speakers, or channel variability such as that caused by weak cell phone
reception. In such situations, the face recognition modality will be able to take over and hence provide
enhanced robustness for the combined system.
A similar consideration applies, of course, when the combined face-voice authentication system is
viewed from the perspective of the face recognition modality, which may equally break down in
difficult environmental conditions such as adverse lighting. In this case, too, the overall robustness of
the authentication system is preserved by the combination of the two modalities, voice and face, each
of which is affected differently and largely independently by environmental factors.
However, the most important advantage of a bimodal face-voice authentication system for the
assurance of liveness is the fact that the articulator movements, mainly of the lips, but also of the tip
of the tongue, jaw, and cheeks, are mostly observable and correspond closely to the particular speech
sounds produced. Therefore, it is possible when observing a bimodal audio-video signal of the
speaking face to ascertain whether the facial dynamics and the sequence of speech sounds are
mutually compatible and synchronous. To a human observer it is quite disconcerting when this is not
the case, for example, with an out-of-sync television signal or with a static facial image when the
speaker is heard saying something, but the lips are not seen to be moving. In the field of audiovisual
speech recognition, the term “viseme” has been coined as the visual counterpart of the “phoneme,”
which denotes a single speech sound. The visemes /m/, /u/, and /d/ (as in the word “mood”), for
example, first show the speaker’s lips spread and closed (for /m/), then protruded and rounded (for
/u/), and finally spread and slightly open (for /d/). It is therefore possible to detect whether the
corresponding sequences of visemes and phonemes of an utterance are observed in a bimodal audiovideo signal and whether the observed viseme and phoneme sequences are synchronous.
In order for the synchrony of the audio and video streams to be ascertained, the two modalities must
be combined appropriately. Multimodal authentication systems employ different para-digms to
combine, or “fuse,” information from the different modalities. Modality fusion can happen at different
stages of the authentication process. Fusing the features of the different channels immediately after
the feature extraction phase is known as “feature fusion” or “early fusion.” In this paradigm, all
comparisons between the unknown sample and the client model as well as the decision making are
based on the combined feature vectors. The other possibility is to fuse information from the two
modalities after independent comparisons have been made for each modality. Such paradigms are
known as score fusion, decision fusion, or late fusion.
For liveness assurance by means of bimodal face-voice authentication, it is necessary to apply an
early fusion stratagem, i.e., to fuse the two modalities at the feature level [ 6]. If the two modalities
were fused late, i.e., at the score or decision level, analysis of the video of the speaking face would
yield one decision on the speaker’s identity and analysis of the audio of the utterance would yield
another decision on the speaker’s identity. The two processes would run independently of each other
with no connection between them that would allow the checking for the correspondence and
synchrony of visemes and phonemes [ 7].
Therefore, the features that are extracted from the audio signal on a frame-by-frame basis – usually at
an audio frame rate of about 40–100 frames per second – must be combined with the features that are
extracted from the video signal, usually at the video frame rate of 25 or 30 frames per second. An
example of how the differing frame rates for the audio and video signals can be accommodated is
shown in Fig. 5, where the audio frame rate is 50 frames per second, the video frame rate is 25 frames
per second, and the combined audiovisual feature vector comprises the audio feature vectors of two
consecutive audio frames, combined with the single video vector of the synchronous video frame.
Anti-spoofing, Voice, Fig. 5
Feature fusion of two consecutive 20 ms audio feature vectors with the corresponding 40 ms video
feature vector. Before fusion, the audio vectors have been reduced to 8 dimensions each, and the
video vector has been reduced to 20 dimensions. The combined feature vector has 36 dimensions
The combined audiovisual feature vectors will then reveal whether the audio and video streams are
synchronous, for example, when the combined audiovisual feature vectors contain the sequence of
visemes /m/, /u/, and /d/ and likewise the sequence of phonemes /m/, /u/, and /d/. In contrast, if one of
the combined audiovisual feature vectors were to contain the visual information for the viseme /m/
and at the same time the audio information for the phoneme /u/, the combined feature vector would
indicate that the audio and video streams do not represent a corresponding synchronous representation
of any speech sound.
The proper sequencing of visemes and phone-mes is usually ascertained by representing the
audiovisual speech by hidden Markov models (HMM), which establish the likelihoods of the different
combined audiovisual vectors and their sequences over time [ 8]. It is therefore possible to ascertain
whether the audio and video components of a combined audio-video stream represent a likely live
utterance. Therefore, an attacker who attempts to impersonate a target speaker by means of a recorded
speech utterance and a still photograph of the target speaker will be thwarted because the system will
recognize the failure of the face to form the corresponding visemes that should be observed
synchronously with the phonemes of the utterance. Similarly, such a system will thwart an attack by
an audiovisual speech synthesis system, unless the synthesizer can generate the synthetic face and the
synthetic voice in nearly perfect synchrony.
Related Entries
Biometric Vulnerabilities, Overview
Security and Liveness, Overview
References
1.
S. Furui, Cepstral analysis techniques for automatic speaker verification. IEEE Trans. Acoust. Speech
Signal Process. (ASSP) 29, 254–272 (1981)
2.
F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin,
J. Ortega-García, D. Petrovska-Delacrétaz, D.A. Reynolds, A tutorial on text-independent speaker
verification. EURASIP J. Appl. Signal Process. 2004(4), 430–451 (2004)
3.
T. Matsui, S. Furui, Speaker adaptation of tied-mixture-based phoneme models for text-prompted
speaker recognition, in Proceedings of International Conference on Acoustics, Speech and Signal
Processing, Adelaide (IEEE, New York, 1994), pp. I-125–128
4.
D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin, D. Klusacek, J.
Abramson, R. Mihaescu, J. Godfrey, D. Jones, B. Xiang, The SuperSID project: exploiting high-level
information for high-accuracy speaker recognition, in Proceedings of International Conference on
Acoustics, Speech and Signal Processing, Hong Kong (IEEE, New York, 2003), pp. IV-784–787
5.
X. Huang, A. Acero, H.-W. Hon, Spoken Language Processing (Prentice Hall, Upper Saddle River,
2001)
6.
G. Chetty, M. Wagner, Investigating feature-level fusion for checking liveness in face-voice
authentication, in Proceedings of Eighth IEEE Symposium on Signal Processing and Its Applications,
Sydney (IEEE, New York, 2005), pp. 66–69
7.
H. Bredin, G. Chollet, Audiovisual speech synchrony measure: application to biometrics. EURASIP J.
Adv. Signal Process. 2007(1), 1–11 (2007)
8.
G. Chetty, M. Wagner, Speaking faces for face-voice speaker identity verification. In: Proceedings of
Interspeech-2006 – International Conference on Spoken Language Processing, Paper Mon3A1O-6,
Pittsburgh (International Speech Communication Association, 2006)