© Springer Science+Business Media New York 2015 Stan Z. Li and Anil K. Jain Encyclopedia of Biometrics 10.1007/978-1-4899-7488-4_70 Anti-spoofing, Voice Michael Wagner1, 2, 3 (1)College of Engineering and Computer Science, Australian National University, Canberra, Australia (2)National Centre for Biometric Studies Pty Ltd, Canberra, Australia (3)Faculty of ESTeM, University of Canberra, Canberra, Australia Michael Wagner Email: [email protected] Without Abstract Synonyms Liveness assurance; Liveness verification; One-to-one speaker recognition; Speaker verification; Voice authentication; Voice verification Definition The process of verifying whether the voice sample presented to an authentication system is real (i.e., alive), or whether it is replayed or synthetic, and thus fraudulent. When authentication through a voice authentication system is requested, it is important to be sure that the person seeking the authentication actually provides the required voice sample at the time and place of the authentication request. The voice is presented live like that of a radio presenter during a live broadcast as distinct from a recorded audio tape. In contrast, an impostor who seeks authentication fraudulently could try to play an audio recording of a legitimate client or synthesized speech that is manufactured to resemble the speech of a legitimate client. Such threats to the system are known as replay attack and synthesis attack, respectively. Liveness assurance uses a range of measures to reduce the vulnerability of a voice authentication system to the threats of replay and synthesis attack. Introduction The security of a voice authentication system depends on several factors. Primarily it is important that the system is capable of distinguishing people by their voices, so that clients who are enrolled in, say, a telephone banking system are admitted to their account reliably, while an “impostor” who attempts to access the same account is rejected equally reliably. A good voice authentication system will thwart an impostor irrespective of whether the access to the other person’s account is inadvertent or deliberate and irrespective of whether the impostors use their natural voice or try to improve their chances by mimicking the voice of the client. However, one vulnerability common to all voice authentication systems is the possibility that attackers, instead of speaking to the system directly and with their own voice, fraudulently use the recorded voice of a true client in order to be admitted by the system. In principle, such a “replay attack” can be carried out by means of any sound recording device, analog or digital, through which the recorded voice of the client is played back to the system, say, to a microphone at a system access point or remotely into a telephone handset connected to the authentication system. The security issue in this case is that the voice used for authentication is not the “live” voice of the person, who is seeking access to the system, at the time and place of the access request. A technically sophisticated attacker may also use suitable computer hardware and software to create a simile of the client’s voice by means of speech synthesis without having to record specific voice samples of the client. Such an attack will be referred to as a “synthesis attack” in the following. Figure 1 shows how replayed or synthesized voice signals can be substituted for the live voice of a client at the sensor input of the authentication system. Anti-spoofing, Voice, Fig. 1 Prerecording of client voice either for later replay or for generating a client model, which can be used later to synthesize the client’s voice Replay Attack Since voice authentication is always implemented within the context of a computer system, it is important to consider the vulnerabilities of the entire system generally ( Security and Liveness, Overview). Figure 2 shows the structure of a typical voice authentication system. During the enrolment or training phase, the client’s voice is captured by the microphone, salient features are extracted from the speech signals, and finally, a statistical “client model” or template is computed, which represents the client-specific voice characteristics according to the speech data collected during enrolment. During the operational or testing phase, when the system needs to decide whether a speech sample belongs to the client, the signal is also captured by the sensor, and features are extracted in the same way as they are in the enrolment phase. Then, the features of the unknown speech sample are compared statistically with the model of the client that was established during enrolment. Depending on how close the unknown sample is to the client model, the system will issue either an “accept” or a “reject” decision: the person providing the voice sample is either authenticated or considered an impostor. Figure 2 shows various ways in which attackers could manipulate the outcome of the authentication, if any of the software or hardware components of an insecure computer system could be accessed. If it were possible, for example, to manipulate the database of client models, attackers could potentially replace the voice model of a client with their own voice model and subsequently gain fraudulent access to the system by having substituted their own identity for that of the client. Or, even more simply, if it were possible to manipulate the decision module of the system, an attacker could essentially bypass the entire authentication process and manufacture an “accept” decision of the system without having provided any matching voice data. Such considerations fall into the domain of the system engineer who needs to ensure, much as with any other secure system, that there are no bugs, trap doors, or entry points for Trojan horses, which could allow an attacker to manipulate or bypass the authentication mechanisms of the system. Since such vulnerabilities are not specific to voice authentication systems, they are not dealt with in this entry ( Biometric Vulnerabilities, Overview). Anti-spoofing, Voice, Fig. 2 Potential points of vulnerability of a voice biometric authentication system: ( a) replay or synthesize the client voice into the input sensor; ( b) insert the replayed or synthesized client voice into vulnerable system-internal points; ( c) override detected features at vulnerable system-internal points; ( d) override the client at vulnerable system-internal points; and ( e) override the accept/reject decision at vulnerable system-internal points The remainder of this entry discusses how a secure voice authentication system can provide the assurance that the voice used for an access request to the system is “live” at the time and place of the access request and is neither a playback of a voice recording nor a synthesized simile of a client voice. Hence, liveness assurance is an essential aspect of the security of any voice authentication system. Liveness Assurance for Different Authentication Protocols Voice authentication systems operate under different protocols, and assurance of liveness is affected differently by the various authentication protocols. The three main protocols used for voice authentication are text-dependent speaker verification, text-independent speaker verification, and textprompted speaker verification, as shown in Fig. 3. The earliest authentication protocol was text dependent [ 1]. In this protocol, the client uses a fixed authentication phrase, which is repeated several times during enrolment. The repetitions are necessary so that the system “learns” about the range of pronunciation of the authentication phrase by the client. Generally, speaker verification works best if the natural variation of a speaker’s voice is well captured during enrolment. Hence, ideally, enrolment should be distributed over several recording sessions that may be spread over several days or even weeks. The same phrase, for example, a sequence of digits (“three-five-seven-nine”), or a password or passphrase (“Open Sesame”) is then used again by the client during the operational phase in order to be authenticated by the system. Anti-spoofing, Voice, Fig. 3 ( a) Text-dependent voice authentication: ( E) at enrolment the client repeats the authentication phrase several times; ( V ) for verification the client speaks the same authentication phrase. ( b) Textindependent voice authentication: ( E) at enrolment the client reads a 2–3-min phonetically rich text; ( V ) for verification any utterance can be used by the client. ( c) Text-prompted voice authentication: ( E) at enrolment the client reads a 2–3-min phonetically rich text; ( V ) for verification the client is prompted to say a given phrase, which is verified both for the correct content and for the client’s voice characteristics Text-dependent systems have the advantage that the client model only needs to represent the acoustic information related to the relatively few speech sounds of the passphrase. Enrolment, therefore, is shorter and quicker than for other protocols, which typically require the representation of the entire collection of speech sounds that the client could possibly produce. However, the text-dependent protocol has the distinct disadvantage that clients will repeat the same phrase every time while using the system. Consequently, there may be ample opportunity for an attacker, especially if the system microphone is situated in a public area, to plan and carry out a surreptitious recording of the passphrase, uttered by the client, and to replay the recorded client passphrase fraudulently in order to be authenticated by the system. In contrast, text-independent voice authentication systems [ 2] will authenticate a client – and reject an impostor – irrespective of any particular utterance used during enrolment. Client enrolment for text-independent systems invariably takes longer than enrolment for a text-dependent system and usually involves a judiciously designed enrolment text, which contains all, or at least most, of the speech sounds of the language. This will ensure that the client models, which are constructed from the enrolment speech data, will represent to the largest extent possible the idiosyncrasies of the client when an arbitrary sentence or other utterance is provided for authentication later. Text-independent protocols offer the advantage that authentication can be carried out without the need for a particular passphrase, for example, as part of an ordinary interaction between a client and a customer-service agent or automated call center agent, as shown in this fictitious dialog: Client: phones XYZ Bank. Agent: Good morning, this is XYZ Bank. How can I help you? Client: I would like to inquire about my account balance. Agent: What is your account number? Client: It’s 123-4567-89 Agent: Good morning, Ms. Applegate, the balance of your account number 123-4567-89 is $765.43. Is there anything else …? The example shows a system, which combines speech recognition with voice authentication. The speech recognizer understands what the customer wants to know and recognizes the account number, while the authentication system uses the text-independent protocol to ascertain the identity of the client from the first two responses the client gives over the telephone. These responses would not normally have been encountered by the system during enrolment, but the coverage of the different speech sounds during enrolment would be sufficient for the authentication system to verify the client from the new phrases. The text-independent protocol offers an attacker the opportunity to record any client utterances, either in the context of the client using the authentication system or elsewhere, and to replay the recorded client speech in order to fraudulently achieve authentication by the system. A more secure variant of the text-independent protocol is the text-prompted protocol [ 3]. Enrolment under this protocol is similar to the text-independent protocol in that it aims to achieve a comprehensive coverage of the different possible speech sounds of a client so that later on any utterance can be used for client authentication. However, during authentication the text-prompted protocol asks the user to say a specific, randomly chosen phrase, for example, by prompting the user “please say the number sequence ‘two-four-six.”’ When the client repeats the prompted text, the system uses automatic speech recognition to verify that the client has spoken the correct phrase. At the same time it verifies the client’s voice by means of the text-independent voice authentication paradigm. The text-prompted protocol makes a replay attack more difficult because an attacker would be unlikely to have all possible prompted texts from the client recorded in advance. However, such an attack would still be feasible for an attacker with a digital playback device that could construct the prompted text at the press of a button.For example, an attacker who has managed surreptitiously to record the ten digits “zero” to “nine” from a client – either on a single occasion or on several separate occasions – could store those recorded digits on a notebook computer and then combine them to any prompted digit sequence by simply pressing buttons on the computer. Synthesis Attack Even a text-prompted authentication system is vulnerable to an attacker who uses a text-to-speech (TTS) synthesizer. A TTS system allows a user to input any desired text, for example, by means of a computer keyboard, and to have that text rendered automatically into a spoken utterance and output through a loudspeaker or another analog or digital output channel. The basic principle is that an attacker would program a TTS synthesizer in such a way that it produces similar speech patterns as the target speaker. If that is achieved, the attacker would only need to type the text that is required or prompted by the authentication system in order for the TTS synthesizer to play the equivalent synthetic utterance to the authentication system in the voice of the target speaker. In practice, however, current state-of-the-art text-to-speech synthesis is not quite capable of producing such natural sounding utterances. In other words, synthetic speech produced by current TTS systems still sounds far from natural and is easily distinguished from genuine human speech by the human ear. Does this mean, however, that TTS speech could not deceive an authentication system based on automatic speaker recognition? To answer this question, it needs to be examined how different speaker recognition systems actually work. As shown in Table 1, there are three types of speaker recognition systems that are distinct by the types of speech patterns each examines in order to determine the similarity of the unknown speech and the target speech. The most common type of speaker recognition system looks at speaker differences at the individual sound level. A second type of speaker recognition system examines the sequences of speech sounds, which form words, and a third type also analyzes higher-level information such as intonation, choice of words, choice of sentence structure, or even semantic or pragmatic content of the utterances in question [ 4].Anti-spoofing, Voice, Table 1 Types of speaker authentication methods Type of speaker recognition system Training/enrolment Testing Typical method Recognizes individual speech sounds (contextfree) A set of speech sounds typical for the target speaker is collected and becomes the “model” for the target speaker Each speech sound is individually compared with the target speaker “model” Gaussian mixture model (GMM) Recognizes sequences of speech sounds (context-sensitive) In addition to the individual sounds, the speaker model represents the sequences of speech sounds that are typical for the target speaker The entire utterance is compared with the target speaker model for both individual sounds and sound sequences Hidden Markov model (HMM) Recognizes higher-level features (intonation, word choice, syntax, etc.) In addition to sound sequences, the speaker model represents words, sentence structures, and intonation patterns typical for the target speaker Similarity of sounds and sound sequences is combined with similarity of word sequences and intonation patterns Information fusion of GMM and/or HMM with higher-level information sources Speech processing invariably segments a speech signal into small chunks, or “frames,” of about 10– 30ms duration, which corresponds approximately to the average duration of speech sounds. For each frame, features are extracted from the speech signal, such as a spectrum or a cepstrum or a melfrequency cepstrum (MFC) [ 5]. These extracted features serve as the basis for the comparison between the unknown speech and the target speech. The first type of speaker recognition system independently compares the features of each frame of the unknown speech signal with the model of the target speaker. This is done independently for each frame and without considering the speech sounds immediately preceding or succeeding the given frame. The second type of speaker recognition system takes into account the likelihood of sequences of speech sounds, rather than individual speech sounds, when comparing the unknown speech signal with the model of the target speaker. For example, the sound sequence /the/ would be more likely for a speaker of English than the sound sequence /eth/. The third type of system takes into account higher-level features, i.e., the variation of features over time such as the intonation pattern of a sentence, as it manifests itself through the functions of loudness and pitch over time. Such authentication systems typically operate on much longer passages of speech, for example, to segment a two-way telephone conversation or the proceedings in a court of law into the turns belonging to the different speakers. Figure 4 shows an example of two speakers pronouncing the same sentence with quite different intonation. Anti-spoofing, Voice, Fig. 4 Two male speakers from the Australian National Database of Spoken Language (ANDOSL), speaking the same sentence with distinctly different intonation: audio signal and power and fundamental frequency (F0) contours. Speaker S017 produced the word John with falling F0, while Speaker S029 produced the same with rising F0 It is easy to see that a context-free authentication system is prone to be attacked successfully by a very simple synthesizer, namely, one that produces a few seconds of only a single speech sound. For example, an attacker could reproduce a single frame of, say, the sound “a” of the target speaker and play this frame repeatedly for a second or two in order to “convince” an authentication system of this type that the “aaaaaaa…” sound represents the natural voice of the target speaker. This is because each frame is assessed independently as being similar to the “a” sound of the target speaker, irrespective of the fact that the sequence of “a” sounds does not represent a likely speech pattern of the target voice. A context-sensitive authentication system, on the other hand, requires a speech synthesizer to reproduce entire sound sequences that are sufficiently similar to sound sequences produced by the target speaker. This means that the individual sounds, produced by the synthesizer, must be similar to sounds of the target speaker and the sound sequences must be structured in a similar way to those of the target speaker. This is a proposition that is far more difficult, although not impossible, to achieve with current state-of-the-art speech synthesizers. Furthermore, if the speaker authentication system also considers the intonation pattern and higher-level features such as choice of words and grammatical constructs, an attacker who tries to impersonate a target speaker using a TTS synthesizer would require a system that is beyond the capabilities of the technology at the time of writing. Multimodal Liveness Assurance The assurance that a voice biometric is delivered live at the time and place of authentication can be enhanced considerably by complementing the voice modality with a second modality. In the simplest case, this could be the visual modality provided by a human observer who can assure that the voice biometric is actually provided by the person seeking authentication and that person is not using any device to play back a recorded or synthesized voice sample. In an automatic voice authentication system, similar assurance of liveness can be achieved by combining the voice modality with a face recognition system. Such a system has a number of advantages. Firstly, the bimodal face-voice approach to authentication provides two largely independent feature sets, which, when combined appropriately, can be expected to yield better authentication than either of the two modalities by itself. Secondly, the bimodal approach will add robustness to the system when either modality is affected by difficult environmental conditions. In the case of bimodal face-voice authentication, it is particularly useful to fall back on the complementary face recognition facility when the voice recognition modality breaks down due to high levels of surrounding noise, competing speakers, or channel variability such as that caused by weak cell phone reception. In such situations, the face recognition modality will be able to take over and hence provide enhanced robustness for the combined system. A similar consideration applies, of course, when the combined face-voice authentication system is viewed from the perspective of the face recognition modality, which may equally break down in difficult environmental conditions such as adverse lighting. In this case, too, the overall robustness of the authentication system is preserved by the combination of the two modalities, voice and face, each of which is affected differently and largely independently by environmental factors. However, the most important advantage of a bimodal face-voice authentication system for the assurance of liveness is the fact that the articulator movements, mainly of the lips, but also of the tip of the tongue, jaw, and cheeks, are mostly observable and correspond closely to the particular speech sounds produced. Therefore, it is possible when observing a bimodal audio-video signal of the speaking face to ascertain whether the facial dynamics and the sequence of speech sounds are mutually compatible and synchronous. To a human observer it is quite disconcerting when this is not the case, for example, with an out-of-sync television signal or with a static facial image when the speaker is heard saying something, but the lips are not seen to be moving. In the field of audiovisual speech recognition, the term “viseme” has been coined as the visual counterpart of the “phoneme,” which denotes a single speech sound. The visemes /m/, /u/, and /d/ (as in the word “mood”), for example, first show the speaker’s lips spread and closed (for /m/), then protruded and rounded (for /u/), and finally spread and slightly open (for /d/). It is therefore possible to detect whether the corresponding sequences of visemes and phonemes of an utterance are observed in a bimodal audiovideo signal and whether the observed viseme and phoneme sequences are synchronous. In order for the synchrony of the audio and video streams to be ascertained, the two modalities must be combined appropriately. Multimodal authentication systems employ different para-digms to combine, or “fuse,” information from the different modalities. Modality fusion can happen at different stages of the authentication process. Fusing the features of the different channels immediately after the feature extraction phase is known as “feature fusion” or “early fusion.” In this paradigm, all comparisons between the unknown sample and the client model as well as the decision making are based on the combined feature vectors. The other possibility is to fuse information from the two modalities after independent comparisons have been made for each modality. Such paradigms are known as score fusion, decision fusion, or late fusion. For liveness assurance by means of bimodal face-voice authentication, it is necessary to apply an early fusion stratagem, i.e., to fuse the two modalities at the feature level [ 6]. If the two modalities were fused late, i.e., at the score or decision level, analysis of the video of the speaking face would yield one decision on the speaker’s identity and analysis of the audio of the utterance would yield another decision on the speaker’s identity. The two processes would run independently of each other with no connection between them that would allow the checking for the correspondence and synchrony of visemes and phonemes [ 7]. Therefore, the features that are extracted from the audio signal on a frame-by-frame basis – usually at an audio frame rate of about 40–100 frames per second – must be combined with the features that are extracted from the video signal, usually at the video frame rate of 25 or 30 frames per second. An example of how the differing frame rates for the audio and video signals can be accommodated is shown in Fig. 5, where the audio frame rate is 50 frames per second, the video frame rate is 25 frames per second, and the combined audiovisual feature vector comprises the audio feature vectors of two consecutive audio frames, combined with the single video vector of the synchronous video frame. Anti-spoofing, Voice, Fig. 5 Feature fusion of two consecutive 20 ms audio feature vectors with the corresponding 40 ms video feature vector. Before fusion, the audio vectors have been reduced to 8 dimensions each, and the video vector has been reduced to 20 dimensions. The combined feature vector has 36 dimensions The combined audiovisual feature vectors will then reveal whether the audio and video streams are synchronous, for example, when the combined audiovisual feature vectors contain the sequence of visemes /m/, /u/, and /d/ and likewise the sequence of phonemes /m/, /u/, and /d/. In contrast, if one of the combined audiovisual feature vectors were to contain the visual information for the viseme /m/ and at the same time the audio information for the phoneme /u/, the combined feature vector would indicate that the audio and video streams do not represent a corresponding synchronous representation of any speech sound. The proper sequencing of visemes and phone-mes is usually ascertained by representing the audiovisual speech by hidden Markov models (HMM), which establish the likelihoods of the different combined audiovisual vectors and their sequences over time [ 8]. It is therefore possible to ascertain whether the audio and video components of a combined audio-video stream represent a likely live utterance. Therefore, an attacker who attempts to impersonate a target speaker by means of a recorded speech utterance and a still photograph of the target speaker will be thwarted because the system will recognize the failure of the face to form the corresponding visemes that should be observed synchronously with the phonemes of the utterance. Similarly, such a system will thwart an attack by an audiovisual speech synthesis system, unless the synthesizer can generate the synthetic face and the synthetic voice in nearly perfect synchrony. Related Entries Biometric Vulnerabilities, Overview Security and Liveness, Overview References 1. S. Furui, Cepstral analysis techniques for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. (ASSP) 29, 254–272 (1981) 2. F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D. Petrovska-Delacrétaz, D.A. Reynolds, A tutorial on text-independent speaker verification. EURASIP J. Appl. Signal Process. 2004(4), 430–451 (2004) 3. T. Matsui, S. Furui, Speaker adaptation of tied-mixture-based phoneme models for text-prompted speaker recognition, in Proceedings of International Conference on Acoustics, Speech and Signal Processing, Adelaide (IEEE, New York, 1994), pp. I-125–128 4. D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D. Jones, B. Xiang, The SuperSID project: exploiting high-level information for high-accuracy speaker recognition, in Proceedings of International Conference on Acoustics, Speech and Signal Processing, Hong Kong (IEEE, New York, 2003), pp. IV-784–787 5. X. Huang, A. Acero, H.-W. Hon, Spoken Language Processing (Prentice Hall, Upper Saddle River, 2001) 6. G. Chetty, M. Wagner, Investigating feature-level fusion for checking liveness in face-voice authentication, in Proceedings of Eighth IEEE Symposium on Signal Processing and Its Applications, Sydney (IEEE, New York, 2005), pp. 66–69 7. H. Bredin, G. Chollet, Audiovisual speech synchrony measure: application to biometrics. EURASIP J. Adv. Signal Process. 2007(1), 1–11 (2007) 8. G. Chetty, M. Wagner, Speaking faces for face-voice speaker identity verification. In: Proceedings of Interspeech-2006 – International Conference on Spoken Language Processing, Paper Mon3A1O-6, Pittsburgh (International Speech Communication Association, 2006)
© Copyright 2026 Paperzz