automatic acquisition device identification from speech recordings

AUTOMATIC ACQUISITION DEVICE IDENTIFICATION FROM SPEECH RECORDINGS
Daniel Garcia-Romero and Carol Y. Espy-Wilson
Department of Electrical and Computer Engineering, University of Maryland, College Park, MD
[email protected], [email protected]
ABSTRACT
In this paper we present a study on the automatic
identification of acquisition devices when only access to the
output speech recordings is possible. A statistical
characterization of the frequency response of the device
contextualized by the speech content is proposed. In
particular, the intrinsic characteristics of the device are
captured by a template, constructed by appending together
the means of a Gaussian mixture trained on the device
speech recordings. This study focuses on two classes of
acquisition devices, namely, landline telephone handsets and
microphones. Three publicly available databases are used to
assess the performance of linear- and mel-scaled cepstral
coefficients. A Support Vector Machine classifier was used
to perform closed-set identification experiments. The results
show classification accuracies higher than 90 percent among
the eight telephone handsets and eight microphones tested.
Index Terms— Digital speech forensics, intrinsic
fingerprint, non-intrusive forensics, Gaussian supervectors..
1. INTRODUCTION
The widespread availability of low-cost and sophisticated
digital media editing software allows amateur users to
perform imperceptible alterations to digital content. This
fact poses a serious threat to a wide variety of fields such as
intellectual property, criminal investigation, and lawenforcement. In an attempt to minimize this threat, this work
addresses the problem of automatic acquisition device
identification (Dev-ID). In particular, our goal is to
automatically extract forensic evidence about the
mechanism involved in the generation of the speech
recording by analysis of the acoustic signal. Our focus is on
blind-passive strategies – as opposed to active embedding of
watermarks or having access to input-output pairs – since
most realistic scenarios only allow for this kind of approach.
The underlying hypothesis of our approach is that the
physical devices–along with their associated signal
processing chain–leave behind intrinsic fingerprint traces in
the speech signal. Moreover, these digital traces can be
.
The authors would like to thank Dr. K.J. Ray Liu for his generous advice
and direction regarding this research. Supported by NSF grant #0917104
978-1-4244-4296-6/10/$25.00 ©2010 IEEE
1806
characterized and detected by statistical methods and
automatic pattern recognition techniques. An important
observation in this regard is that in the field of speaker
recognition (SR), the artifacts left in the speech signal by the
acquisition device are highly detrimental to the performance
of recognition systems [1]. Hence, most SR systems try to
remove or compensate these artifacts by using some general
knowledge about the acquisition device (e.g., GSM handset
as opposed to CDMA). A great amount of research has been
dedicated to this issue (see [1] for current approaches).
Since Dev-ID suffers from the dual problem – the speech
content variability adds great difficulty to our problem – this
paper relies heavily on the findings in the SR field when
building our algorithms.
Regarding prior work, a small study concerning the
classification of 4 microphones was presented in [2].
Motivated by a steganalysis perspective, a combination of
time domain features involving short-term statistics of the
audio signal as well as mel-cepstral coefficients were used
during the feature extraction. A Naïve Bayes classifier was
used to perform closed set identification experiments at the
frame level. Accuracies on the order of 60-75% were
reported. Although its experimental setup was very limited
in size, it is the only prior work the authors have been able
to identify and nonetheless sets a precedent in the automatic
identification of microphones.
2. DATABASES
2.1. Landline telephone handsets
The Handset-TIMIT (HTIMIT) and Lincoln-Labs Handset
Database (LLHDB) provide speech recordings through four
carbon-button {CB1-CB4} and four electrect {EL1-EL4}
landline handsets [3]. In particular, HTIMIT was created by
playing a subset of TIMIT through a dummy head into the
different handsets. Ten short segments of around 3 seconds
from 384 speakers (half of them male) resulted in 38,400
speech segments played through each handset. The specific
make and models of each handset is available in [3].
LLHDB comprises 53 speakers (24 males and 29 females)
that uttered the 10 short sentences from HTIMIT plus a read
passage and a picture description using the same handsets
from HTIMIT. Both databases were acquired at 8 KHz
sampling rate and 16 bits.
ICASSP 2010
Figure 1. Magnitude squared of the frequency response (in dB) of the landline handsets of HTIMIT/LLHDB computed from white
noise using Welch’s method with a 20 ms Hamming window and 50% overlap
2.2. Microphones
3.1. Statistical modeling by UBM-GMM
The ICSI subset of the NIST 2006 Speaker Recognition
Evaluation database [4] comprising 8 different microphones
and 61 speakers (28 male and 33 female) provides a total of
2223 speech segments of around 2.5 minutes each. An
almost evenly distributed number of recordings came from
each microphone. The speech segments were acquired
simultaneously with all microphones in an interview style
setup. Table 1 shows the specific microphone types.
Influenced by the SR field, we used Gaussian Mixture
Models (GMM). In particular, an only-means adapted
UBM-GMM architecture with 2048 mixtures and diagonal
covariance matrices was used [5]. The frequency content
information was represented by a parametrization of shorttime speech segments using 20 ms Hamming windows with
50% overlap. Either 24 mel-scaled or 39 linearly-scaled
filters were used to compute 23 MFCCs or 38 LFCCs by
removing c0. To select an optimal parametrization, closedset identification experiments were conducted on HTIMIT.
Half of the database was used to train the 8 GMM handset
models and the remaining part to perform classification
using log-likelihood ratio scores [5]. Table 2 shows the
average device identification results for different types of
parametrizations.
M1
M2
M3
M4
M5
M6
M7
M8
AT3035 (Audio Technica Studio Mic)
MX418S (Shure Gooseneck Mic)
Crown PZM Soundgrabber II
AT Pro45 (Audio Technica Hanging Mic)
Jabra Cellphone Earwrap Mic
Motorola Cellphone Earbud
Olympus Pearlcorder
Radio Shack Computer Desktop Mic
Table 1. Microphone types from NIST 2006 database
3. DEVICE CHARACTERIZATION
A typical methodology to characterize handsets is to model
them as a linear system and estimate their frequency
response by measuring the output of the system when
excited with white noise. Figure 1 shows the results of
estimating the power spectral density of white noise
captured with each telephone handset using Welch’s
method. This data is part of the HTIMIT distribution. A
simple inspection of the frequency responses of each device
shows the potential discriminative power of the frequency
domain information. Unfortunately, for blind-passive speech
forensics approaches an input-output characterization of the
device is not available. Consequently, we need to devise
mechanisms that solely rely on the output signal, which in
our case are speech recordings. This fact makes the problem
more challenging since the signals available to us not only
contain information about the device but also about the
speech content variability (i.e., different speakers, linguistic
content). Our approach to this problem is to alleviate the
effects of the speech content variability by using a statistical
characterization of the frequency response of the device
contextualized by the speech content. In the following we
present a practical implementation of this formulation.
1807
Param. type
23 MFCC
23 MFCC + DELTA
38 LFCC
38 LFCC + DELTA
Num. coeffs.
23
46
38
76
ID rate (%)
99.84
99.75
99.98
99.97
Table 2. Average handset identification accuracy for various
types of parametrizations on a dev. set of HTIMIT.
Two important remarks are in place. First, the
augmentation of the mel-frequency (MFCC) and linearfrequency cepstral coefficients with deltas does not improve
the performance. Thus, there is no justification to double the
dimensionality of our acoustic space. Second, the MFCC
and LFCC result in very similar results, which suggest the
use of MFCC due to their smaller dimensionality. This last
observation will be further tested on a different setting in
section 4 to strengthen its validity.
3.2. Intrinsic fingerprint computation
In an only-means adapted UBM-GMM architecture all the
discriminant information of the model (in our case telephone
handsets or microphones) is captured by the means of the
GMM. This is due to the fact that MAP adaptation process
only updates the means of the speaker’s GMM with respect
to the UBM. This observation led to the construction of a
Gaussian supervector (GSV) by stacking the means of the
Figure 2. Top panel shows the magnitude responses of the EL4 telephone handset and the reference SENH microphone from the
HTIMIT database. They were estimated by computing the power spectral density of the output using white noise as input. The
bottom panel is the visualization of the difference between the clustered means of the EL4 GSV and the SENH GSV. Dark blue
colors indicate low values and dark red high values.
mixture components [6]. In this way, a speech recording is
represented by a point in a high-dimensional vector space
(i.e., dimension ~ 50k). Formally, given a training data set
ܺ ൌ ሼ࢞‫ ݐ‬ሽܶ‫ݐ‬ൌͳ and a diagonal covariance UBM with ‫ܭ‬
mixtures defined by ሼ࣊ǡ ࢓ǡ ‫܀‬ሽ, where࣊ ൌ ሾߨͳ ǡ ǥ ǡ ߨ‫ ܭ‬ሿܶ ,
࢓ ൌ ሾ࢓ͳܶ ǡ ǥ ǡ ࢓ܶ‫ ܭ‬ሿܶ and ‫ ܀‬ൌ ݀݅ܽ݃ሺሼ ݇ ሽ‫݇ܭ‬ൌͳ ሻ, the GSV
ࣂ ൌ ሾࣂͳܶ ǡ ࣂܶʹ ǡ ǥ ǡ ࣂܶ‫ ܭ‬ሿܶ is obtained as the MAP adaptation of
the means of the UBM by using one iteration of the EM
algorithm with prior ܰሺࣂǢ ࢓ǡ ߬ െͳ ‫܀‬ሻ. After the E step, the
responsibility of Gaussian ݇ for data point ࢞‫ ݐ‬is denoted by
ߛ– and ߛ ൌ σ– ߛ– . Furthermore, the application of the M
step results in the MAP estimate of the GSV [5]:
ࣂ ൌ ሺ߬۷ ൅ ઩ሻെͳ ሺ߬࢓ ൅ ઩ࣆሻǡ
݀݅ܽ݃ሺሼɀ݇ ሽ‫݇ܭ‬ൌͳ ሻ
(1)
ሾࣆͳܶ ǡ ǥ ǡ ࣆܶ‫ ܭ‬ሿܶ .
with matrix ઩ ൌ
and ࣆ ൌ
mixture ݇, eq. (1) reduces to the convex combination:
ࣂ ൌ ߙ ࢓ ൅ ሺͳ െ ߙ ሻࣆ ,
For
(2)
between the UBM mean ࢓݇ and the data mean ࣆ݇ with the
mixing coefficient ߙ݇ given by ߙ ൌ ߬Ȁሺ߬ ൅ ߛ ሻ and the
data mean:
ͳ
ࣆ ൌ ෍ ߛ– ࢞– (3)
ߛ݇
–
The scalar ߬ (known as “relevance factor”) controls the
trade-off between what the data “says” and our prior belief
contained in the UBM means [5].
Based on this formulation, the intrinsic fingerprint of an
acquisition device ݀ is defined as the GSV ࣂሺ݀ሻ computed
from speech recordings acquired with the device of interest
and a UBM. This procedure results in a fixed-length
template to represent variable-length speech recordings.
This is a desirable property, since in principle, the intrinsic
fingerprint of a device should be independent of the amount
of data acquired with the device. Moreover, regarding the
speech content contextualization, an experimental study
presented in [7] indicated that a phonetic context can be
attached to partitions of the GSV (i.e., grouping of subsets
of mean vectors). To support this claim, the bottom panel of
Figure 2 shows the difference between the GSV of the EL4
handset and a reference microphone SENH (both from
HTIMIT) using the visualization procedure presented in [7].
1808
Twenty clusters were used in the visualization process. The
color coding is such that dark blue means low values and
dark red high values. Also, the top panel of Figure 2 shows
the magnitude response of both devices estimated with
white noise. The blue band in the low frequency range of the
difference between the GSVs is justified by the fact that
below 1 KHz the SENH magnitude response is above the
EL4’s response. On the other hand, above 1 KHz the
relation is the opposite, and therefore, it makes sense that
the GSV difference is mostly represented with red colors.
4. EXPERIMENTS
This section presents an experimental validation of the use
of GSV as intrinsic fingerprints of acquisition devices. Two
separate closed-set identification experiments were carried
out: one for landline telephone handsets and another one for
microphones. This separation was primarily motivated by
the distinct nature of the devices as well as the difference in
the duration of the speech recordings for each type of
device. Whereas for the telephone handsets the average
recording length is of 3 seconds, 2.5 minutes of speech is
available for the microphones. A linear Support Vector
Machine (SVM) classifier was trained for each device using
the GSVs computed from each file. MFCCs were used to
parametrize the files. In particular, for the telephone
handsets, half of HTIMIT was used to train a UBM.
Additionally, LLHDB was partitioned in 2 sets (balanced in
the number of files as well as speakers and gender). Each
file in the database was MAP-adapted from the UBM to
compute the GSVs. A two-fold cross-validation setup
resulted in approximately 300 positive GSV exemplars,
comprising 26 speakers and 7 times as much negative GSV
exemplars to train the SVM models for each partition. A
total of 5079 identification trials were obtained. Table 3
shows the corresponding confusion matrix where the entries
represent percentages. The average identification accuracy
across telephones is 93.2 %. The gray shading in the table
highlights the fact that most of the errors remain within the
same transducer class (i.e., electrect and carbon-button). As
indicated in section 3.1, to further validate the use of 23
MFCCs instead of 38 LFCCs, the same experiment was
repeated using LFCCs. The average identification rate was
identical to that of MFCCs. Thus, no improvement was
obtained by using a linear-frequency scale.
CB1
CB2
CB3
CB4
EL1
EL2
EL3
EL4
CB1
94.3
1.6
0.2
0.2
1.1
0.2
0.3
1.3
CB2
0.6
97.0
0.0
0.0
2.5
0.3
1.9
0.0
CB3
0.3
0.3
99.2
0.9
0.0
0.0
0.0
0.3
CB4
1.1
0.2
0.6
98.7
0.2
0.0
0.0
0.5
EL1
0.9
0.5
0.0
0.0
86.9
3.3
7.4
3.5
EL2
0.2
0.0
0.0
0.0
2.5
92.9
6.2
0.6
EL3
0.0
0.5
0.0
0.0
3.3
2.8
83.4
0.6
EL4
2.5
0.0
0.0
0.2
3.5
0.5
0.8
93.2
Table 3. Confusion matrix for telephone handsets based on
the GSV-SVM architecture. Rows indicate files and columns
models. In this way, entry (2,1) indicates that 1.6% of the
CB2 files were wrongly identified as coming from CB1.
5. CONCLUSIONS
For the microphone identification experiments, the
same experimental setup was used with some small changes
due to constraints imposed by the database. Namely, the
UBM was trained with all speech segments from the ICSI
subset. A downsampling ratio of 100 frames reduced the
data to 2 hours of speech for the UBM. Moreover, the twofold cross-validation resulted in an average of 280 positive
GSV exemplars, comprising 30 speakers and 7 times as
much negative GSV exemplars to train the SVM models. A
total of 2223 identification trials were obtained. Table 4
shows the confusion matrix of the identification
experiments. The average identification accuracy across
microphones is 99.0 % with a very uniform behavior. No
apparent confusion patterns are observable among the small
number of errors. The worst performance was obtained for
recordings coming from M1 for which 2.5% were
mistakenly identified as M2. The fact that the biggest source
of errors for recordings from M2 is M1 suggests that these
two microphones have similar characteristics. Although a
formal study of the influence of the amount of data in
performance is a topic for future research, the gap of 6% in
average identification accuracy between telephone handsets
and microphones seems to indicate an interrelation.
M1
M2
M3
M4
M5
M6
M7
M8
M1
96.8
1.1
0.0
0.0
0.0
1.1
0.0
0.0
M2
2.5
98.2
0.4
0.0
0.0
0.4
0.0
0.0
M3
0.4
0.0
99.3
0.0
0.0
0.0
0.0
0.0
M4
0.0
0.0
0.4
100
0.0
0.0
0.4
0.0
M5
0.0
0.4
0.0
0.0
100
0.0
0.0
0.0
M6
0.4
0.4
0.0
0.0
0.0
98.6
0.0
0.0
M7
0.0
0.0
0.0
0.0
0.0
0.0
99.6
0.0
added. The subset of HTIMIT not used in the construction
of the UBM was used to compute an undesired variability
subspace using the procedure detailed in [8]. This variability
was removed by projecting the GSVs into the orthogonal
complement subspace. A slight improvement of 0.3% over
the average identification accuracy of 93.2% was obtained
when 64 dimensions were projected away. This result
indicates that most of the variability in our experimental
setup is well accounted for by our GSV-SVM system.
However, we are currently searching for publicly available
databases with more sources of variability (e.g., acoustic
environments) and it is our belief that these techniques will
play an important role in these more challenging setups.
M8
0.0
0.0
0.0
0.0
0.0
0.0
0.0
100
Table 4. Confusion matrix for microphone handsets based
on the GSV-SVM architecture. Rows indicate files and
columns models.
Finally, to extend the telephone handset experiments
presented above, an explicit mechanism to remove
undesired variability form the device intrinsic fingerprint
known as Nuisance Attribute Projection (NAP) [8] was
1809
A study on the automatic identification of eight telephone
handsets and eight microphones was presented. Several
types of parametrizations were evaluated and the MFCCs
resulted in the best trade-off between performance and
dimensionality. The use of Gaussian supervectors as a
statistical characterization of frequency domain information
of a device contextualized by speech content was proposed.
Thus, a template that captures the intrinsic characteristics of
a device was obtained. A simple visualization procedure of
this template validated its discriminative power. A Support
Vector Machine classifier was used to perform closed-set
identification experiments. Classification accuracies higher
than 90 percent were obtained. This result indicates that
most of the variability in our experimental setup was well
accounted for by our system.
6. REFERENCES
[1] L. Burget, P. Matejka, P. Schwarz, O. Glembek, and J.
Cernocký, "Analysis of Feature Extraction and Channel
Compensation in a GMM Speaker Recognition System," IEEE
TASP, vol. 15, no. 7, pp. 1979-1986, September 2007.
[2] C. Kraetzer, A. Oermann, J. Dittmann, and A. Lang, "Digital
Audio Forensics: A First Practical Evaluation on Microphone and
Environment Classification," in MMSEC'07, pp. 63-74, 2007.
[3] D. A. Reynolds, "HTIMIT and LLHDB: Speech Corpora for
the Study of Handset Transducer Effects," in ICASSP, pp. 15351538, 1997.
[4] NIST Speech Group. [Online]. http://www.nist.gov/speech/
[5] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker
Verification Using Adapted Gaussian Mixture Models," Digital
Signal Processing, vol. 10, pp. 19-41, 2000.
[6] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support
Vector Machines Using GMM Supervectors for Speaker
Verification,” IEEE Signal Processing Letters, vol. 13, no. 5, pp.
308–311, May 2006.
[7] D. Garcia-Romero and C. Espy-Wilson, "Intersession
Variability in Speaker Recognition: A Behind the Scene Analysis,"
in Interspeech, pp. 1413-1416, 2008.
[8] A. Solomonoff, W. M. Campbell and I. Boardman, “Advances
in Channel Compensation for SVM Speaker Recognition,” in
ICASSP, pp. 629–632, 2005.