Psychoacoustic Evaluation of Systems for Delivering Spatialized

PAPERS
Psychoacoustic Evaluation of Systems for
Delivering Spatialized Augmented-Reality Audio*
AENGUS MARTIN, CRAIG JIN, AES Member, AND ANDRÉ VAN SCHAIK
([email protected])
([email protected])
([email protected])
Computing and Audio Research Laboratory, University of Sydney, Sydney, Australia
Two new lightweight systems for delivering spatialized, augmented-reality audio (SARA)
are presented. Each comprises a set of earphone drivers coupled with “acoustically transparent” earpieces and a digital filter. Using the first system, subjects were able to localize virtual
auditory space (VAS) stimuli with the same accuracy as when using earphones that are
standard for presentation of VAS, while free-field localization performance was reduced only
slightly. The only disadvantage of this system is that it has a poor low-frequency response.
VAS localization performance using the second system is also as good as that with standard
VAS presentation earphones, though free-field localization performance is degraded to a
greater extent. This system has good low-frequency response, however, so its range of uses
complements that of the first SARA system. Both systems are light and easily constructed
from unmodified, commercially available products. They require little digital signal processing overhead and no special preamplifier, so they are ideally suited to mobile applications.
0 INTRODUCTION
Spatialized augmented-reality audio (SARA) can be
defined as the superposition of virtual sound sources on
the real acoustic environment, where the sounds from
virtual sources are processed so that the listener perceives
them to be coming from particular directions in space.
Therefore SARA involves the superposition of a virtual
auditory space (VAS, see [1]) on a real acoustic space.
Applications of SARA include assistive systems for
vision-impaired users, guidance systems, teleconferencing, attention-focusing systems, and audio-only electronic
games. It has been predicted that in the future SARA will
play an increasing role in the diversification of audio user
interfaces, particularly in the context of mobile devices
[2]. An ideal system for delivering SARA would be capable of rendering a VAS in which the virtual sound sources
are indistinguishable from real acoustic ones, without
impairing a user’s normal free-field hearing. In this paper
we present a novel technique for delivering SARA in
which we use earphones comprising “acoustically transparent” earpieces coupled with earphone drivers. Our primary contribution is to enable readers to quickly and
easily construct a cost-effective, lightweight SARA delivery system for experimentation or practical use, having
clear and proven expectations of its performance. The
system hardware is easily assembled using commercially
*Manuscript received 2009 March 31; revised 2009 October 28.
1016
available products, and the filters required for use (see
Section 1) are freely available on request from the
authors. To begin, we give a brief overview of the approaches taken in previously published SARA systems.
A variety of techniques have previously been used to
deliver SARA. Many of these involve a device with some
earphone–microphone combination. Härmä et al. [3] give
a comprehensive review of such systems and describe one
of their own design. It comprises a binaural system consisting of an earpiece worn in each ear. The earpiece
consists of a combination of an earphone and a microphone. The real acoustic environment is picked up by the
microphone and reproduced directly via the earphone.
VAS sounds can be mixed with the signal from the microphones for SARA. This system has since been improved
by using analog electronics to pass the signal from the
real acoustic environment to the earphones, thereby eliminating problems associated with a delay between the
real acoustic environment being received by the microphones and its transmission to the earphones [4]. However, there is still the problem known as the occlusion
effect [5], which refers to the amplification of lowfrequency sounds within the head when the ear is blocked,
and has a number of unpleasant manifestations. In their
report on a usability study of a SARA headset that
blocked the ear canals Tikander et al. [6] state that “eating
and drinking was reported to be one of the most irritating
situations with the headset.” Another technique for delivering SARA is to use bone conduction headphones
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
PAPERS
[7], [8]. The advantage of these is that they can be worn
without affecting normal hearing at all. However, it is
difficult to present well-spatialized virtual audio sources
using this technique [9].
Our SARA system uses open earpieces, so the occlusion effect does not arise, and no microphone is required
to transmit the real acoustic environment. In this paper we
first describe the system in detail. Then we characterize
the acoustic properties of the earphones in order to
calibrate the system to present a VAS of the highest possible fidelity. We then describe a set of psychoacoustic
experiments which examine subjects’ abilities to localize
real, free-field acoustic sounds with the earphones in
place, and also their ability to localize virtual sound
sources presented using the earpieces, before discussing
the performance and usage scenarios of the SARA delivery system.
1 SPATIAL AUGMENTED-REALITY
AUDIO SYSTEM
In this section we first describe our SARA earphones
and give a brief overview of some topics in human spatial
SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO
hearing which motivate much of the remainder of this
paper. We then describe the acoustic characterization of
these earphones. Our new earphones consist of a set of
Etymōtic Research ER4P MicroPro earphones (ER4P)
with the supplied ear tips having been replaced by acoustically transparent earpieces manufactured by Surefire,
LLC (see Fig. 1). The ER4Ps are marketed as referencequality earphones suitable for use with portable devices
without the need for an additional amplifier. They have
a manufacturer-cited magnitude frequency response of
20 Hz to 16 kHz 4 dB using the supplied ear tips
(these are not used), and a 1-kHz sensitivity of 108 dB
SPL for a 0.2-V input and a nominal impedance of 27 O.
The acoustically transparent earpieces are designed for
discrete monitoring of radio communications while
allowing ambient sounds and conversation to be heard.
Two SARA earphones were investigated, one with the
Surefire CommEarTMComfort EP1 earpieces (EP1) and
the other with the Surefire CommEarTMBoost EP2 earpieces (EP2). The EP2s differ from the EP1s in that they
intrude further into the ear canal and have a flange on
the end which makes them less acoustically transparent
(refer to Fig. 1). Both earpieces are designed to fit snugly
Fig. 1. SARA earphones. (a) Etymōtic Research ER4P MicroPro earphone drivers with supplied ear tips replaced by acoustically–
transparent earpieces. (b) Surefire CommEarTM Comfort EP1 earpiece. (c) Surefire CommEarTM Boost EP2 earpiece.
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
1017
MARTIN ET AL.
within the conchal cavity of the ear and are made from a
resilient polymer. The left and right earpieces are mirror
symmetric and are available in three sizes—(small (S),
medium (M), and large (L))—for the EP1s and two
sizes (medium and large) for the EP2s. No modifications
were required to fit them to the ER4P earphone drivers.
We will refer to the SARA earphones comprising the
ER4P drivers and the EP1 earpieces as ER4P-EP1X
and to those comprising the ER4P drivers and the EP2
earpieces as ER4P-EP2X, where X indicates the size
(S, M, or L) of the earpiece that was used. Where X is
omitted, the statement is applicable to the earphones with
all earpiece sizes.
We now briefly introduce some of the topics in human
spatial hearing, as they underlie much of the discussion
that follows (for a thorough treatment, see [10]). In our
terminology, to “localize” a sound means to identify the
direction in space from which the sound is arriving, but
not the distance from the source. Humans localize sound
sources using a number of spatial auditory cues. First,
when a sound radiates from a source, it reaches the two
ears by different paths. The path lengths vary with the
direction and give rise to an interaural time difference
(ITD) cue. In addition, a sound coming from a direction
to one side of the head will propagate directly to the ear
on that side, but will be attenuated somewhat by the head
on its path to the other ear. This is referred to as head
shadowing and gives rise to an interaural level difference
(ILD) cue, which has a complex dependence on frequency. Finally the head, the upper body, and the pinna,
together, form an acoustic filter which is highly personalized and direction dependent, and which provides the
“spectral” sound localization cues. The ITD and ILD are
binaural cues since they originate as a difference between
the sounds arriving at the two ears, and they are related to
localization in the horizontal plane, whereas the spectral
cues originate as monaural cues since they exist even if
sound only reaches one ear. Spectral cues are primarily
used to resolve cone of confusion (COC) errors, which
can arise when sound sources are located at different
points on the surface of an imaginary cone which has as
its axis of symmetry the interaural axis (the line that
passes through both ears; see Section 2.2). Sounds from
these sources will have roughly constant ITDs and lowfrequency ILDs, so their directions are disambiguated
primarily using monaural spectral cues. With regard to
our earphones, we note that the ILD cue can be distorted
if the sound arriving at one ear is obstructed, and that
spectral localization cues are easily disrupted by any
change in the shape of the pinna. This disruption can lead
to an increase in the number of COC errors.
The preceding discussion relates to the localization of
free-field acoustic sound sources. However, if a sound is
delivered to the ears by transducers which are inserted
into the ear canals, the cues arising in natural binaural
hearing are not available unless they are introduced artificially. To make the sound appear to come from a
particular direction in space, the sounds from each transducer must be filtered electronically to mimic the effects
1018
PAPERS
of the acoustic phenomena, which give rise to the
auditory localization cues. The head-related impulse
responses (HRIRs) describe such filters and there is a
separate HRIR for each ear and each direction in space.
The frequency domain representation of an HRIR is
referred to as a head-related transfer function (HRTF)
and is obtained by taking the discrete Fourier transform
of an HRIR. HRIRs can be measured by placing small
microphones in the ear canals and recording sound stimuli arriving from different directions in space (see
Section 2.1 for more details). As such the HRIRs include
the ITD, ILD, and spectral localization cues. After filtering a sound with the appropriate HRIR, it is often necessary to apply an additional filter to compensate for the
acoustic characteristics of the transducer used to deliver
the audio.
1.1 Characterization of Earphone
In this section we describe the creation and verification
of compensation filters for the SARA earphones in order
to improve their VAS presentation (again, see [10]).
Compensation filters were designed with respect to the
“reference” magnitude frequency response of an Etymōtic
Research ER1 earphone (ER1). This model was selected
because ER1s are standard earphones used to present
VAS stimuli, and they are designed to have a constantgain frequency response at the ear drum, except for a
simulated ear-canal resonance. The impulse response of
an ER1 was measured using a log-sine sweep signal of
10-s duration and a frequency range of 50 Hz to 22 kHz
(for details, see [11]), and a Brüel and Kjær head and
torso simulator (HATS 4128C) mannequin. This mannequin has an ear simulator which comprises a removable
silicon-rubber pinna joined to an ear canal and a
Zwislocki coupler. The ER1 was fitted to the mannequin
so that the outer surface of the foam ear tip was flush
with the entrance to the ear canal. An RME Multiface
sound card was used to drive the ER1 and record the
signal from the mannequin at a sample rate of 48 kHz.
The magnitude of the transfer function of the ER1 earphone was obtained from a 512-point discrete Fourier
transform of the measured IR. It was then smoothed by
applying a linear one-sample leading, five-sample lagging moving-average filter, and the smoothed magnitude
frequency response was used as the reference magnitude
frequency spectrum. After the reference magnitude spectrum had been measured, the transfer functions were
measured for an ER4P-EP1 earphone with each of the three
sizes of the EP1 earpiece and for the ER4P-EP2 earphone
with each of the two sizes of the EP2 earpiece. The IRs of
the SARA earphones were measured using the same
method as described for the ER1 headphones. We note that
preliminary investigations of the ER4P-EP2 earphones
revealed that both normal free-field hearing and VAS
presentation were strongly affected by the quality of the
seal being made between the flange on the earpiece and
the inner wall of the ear canal. For this reason special care
was taken to ensure that there was a good seal during the
measurements.
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
PAPERS
The magnitude frequency spectra of the IRs of the
ER4P-EP1 and ER4P-EP2 earphones deviate significantly
from each other and from the reference transfer function
(see Fig. 2). Below 1594 Hz the transfer function of the
ER4P-EP2 earphone shows a rolloff of approximately
15 dB per octave, whereas that of the ER4P-EP2 earphone, which does seal the ear canal with a thin flange,
shows no low-frequency rolloff. Both the ER4P-EP1 and
the ER4P-EP2 transfer functions show spectral peaks and
troughs above 5 kHz which are not present in the reference transfer function. These measurements were made
using the medium-size EP1 and EP2 earpieces. Measurements of the earphones using the other earpiece sizes
showed the same features and are omitted from the figure
for clarity. These magnitude frequency spectra were used
to create filters to compensate for the acoustic properties
of the earphones.
A compensation filter was computed for each of the
SARA earphones as follows. An inverse magnitude spectrum was computed by dividing the reference magnitude
spectrum by the magnitude spectrum of the measured IR.
It was important to truncate the low-frequency compensation so that the dynamic range of the SARA earphone was
not reduced too much. To this end a particular low-valued
frequency bin lb was chosen and the values of the first
(lb 1) bins were set to the same value as this bin. For the
ER4P-EP1 earphones it was not known what the optimum
value of lb would be, so four compensation filters were
created using lb values of 4, 8, 10, and 15, corresponding
to 281.2, 656.2, 843.75, and 1312.5 Hz, respectively.
Since the ER4P-EP2 earphones do not attenuate low
frequencies, lb = 2 was used, corresponding to 93.75 Hz,
since the first frequency bin corresponds to 0 Hz. Also,
no compensation was performed for frequency bins
with indices greater than hb = 187, corresponding to
17.438 kHz. The values of bins (hb + 1) to 256 were
set to the same value as this bin. Compensation filters
were created as minimum-phase finite-impulse-response
(FIR) filters, using the inverse magnitude frequency
spectrum.
Fig. 2. Magnitude frequency spectra of earphones.
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO
The IR of each system comprising a SARA earphone
and its compensation filter was measured. A number of
repeat IR measurements were made, reseating the earphone in the ear of the mannequin before each one, to test
the robustness of the system to the small but inevitable
changes in the placement of the earpiece. For the ER4PEP1M earphones with a compensation filter applied from
281.25 to 17.438 kHz the maximum absolute difference
for a single-frequency bin in the compensated range, between any measured transfer function of the ER4P-EP1M
earphone and the reference transfer function, was 1.8 dB.
The mean absolute difference over this frequency range
was 0.5 dB. Similar results were found for the ER4PEP1S and ER4P-EP1L earphones and for the other
compensation ranges tested. For the ER4P-EP2M earphones with a compensation filter applied from 93.75
to 17.438 kHz the maximum absolute difference for a
single-frequency bin in the compensated range, between
any measured transfer function of the ER4P-EP2M earphone and the reference transfer function, was 1.5 dB.
The mean absolute difference over this frequency range
was 0.7 dB. Similar results were found for the ER4PEP2L earphones.
2 LOCALIZATION TESTING
Two localization experiments were performed with human subjects to test 1) their ability to localize real freefield acoustic sound sources while wearing the SARA
earphones and 2) their ability to localize virtual sound
sources presented using the SARA system. The purpose
of the first experiment was to examine the influence of the
SARA earphones on free-field sound localization. To do
this, subjects were asked to complete a free-field localization task under three different experimental conditions.
The first was the control condition, in which the localization task was performed without any interference with
normal hearing (FF-CON). In the second condition the
task was performed while wearing a set of ER4P-EP1
earphones (FF-EP1), and in the third condition the task
was performed while wearing a set of ER4P-EP2 earphones (FF-EP2). For the third condition subjects were
asked to ensure that the ER4P-EP2 earphones were
making a good seal in both ears. The level of the freefield stimuli was kept constant throughout the three conditions. No sound was delivered through the earphones in
either of the two test conditions.
The purpose of the second experiment was to investigate the fidelity of VAS presentation using the SARA
earphones. Subjects were asked to complete a VAS localization task under three different experimental conditions.
The first was the control condition, in which the VAS was
presented using a set of Etymōtic Research ER1 earphones (VAS-CON). In the second condition the VAS
was presented using the SARA system consisting of the
ER4P-EP1 earphones and compensation filters (VASEP1). In the third condition the VAS was presented using
the SARA system comprising the ER4P-EP2 earphones
and compensation filters (VAS-EP2). For this condition
1019
MARTIN ET AL.
the subjects were asked to ensure that the ER4P-EP2 earphones were making a good seal in both ears. For both of
these experiments subjects were allowed to choose the
earpiece size that was most comfortable for them in each
of the two test conditions, and in the second experiment
the compensation filters appropriate for the chosen size
were used. These two experiments are now described in
detail.
2.1 Methods
The same localization testing paradigm was used for
both the free-field and the VAS localization experiments.
Localization testing was conducted with the subject standing in darkness in a triple-walled anechoic chamber, with
his head at the center of the chamber (see Fig. 3). A single
trial begins with the subject aligning his head with a
calibrated start position of (0, 0), directed by feedback
from an array of LEDs. Once aligned, the subject presses
a response button to indicate his readiness, and a 150-ms
broad-band noise burst is presented from one of 76 random locations distributed on an imaginary sphere surrounding his head. As in [12], the duration of the noise
burst was chosen to ensure that the subject could not
PAPERS
move his head during stimulus presentation. For the freefield localization testing the noise burst is delivered by a
loudspeaker (Vifa D26TG-35) mounted on a robotic arm,
and for the VAS localization testing it is delivered over
earphones (see below for more details). Once the sound
has finished playing, the subject performs the localization
task by turning around and tilting his head so that his
nose points toward the perceived direction of the sound
source before pushing a response button. An inertial headorientation tracker (InertiaCube3 manufactured by Intersense Inc.) mounted firmly on top of the subject’s head is
used to measure the subject’s head orientation and thus
provide an objective measure of the perceived sound direction. The data from a single trial comprise the target
azimuth and elevation angles, measured from the calibrated start position to the direction of the sound source
(virtual or not), and the response azimuth and elevation
angles, measured from the same reference to the direction
in which the subject’s head is pointing when the response
button is pressed. The subjects performed five sets of
76 localization trials for each experimental condition.
A validation of this localization testing paradigm is
provided in [12].
Fig. 3. Subject in an anechoic chamber, holding a response button and wearing an orientation tracker on his head. He points his nose
toward the perceived direction of the sound source. In case of free-field localization experiments, the sound source is a loudspeaker
mounted on the movable arm.
1020
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
PAPERS
Five male subjects, aged 26 to 40 years, participated in
the experiments. We refer to them as S1, S2, S3, S4, and
S5. Two subjects (S4 and S5) had substantial previous
experience of auditory localization while the others were
relatively new to the testing paradigm. All subjects
reported having normal hearing.
The subjects’ HRTFs were measured in the same anechoic chamber as was used for localization testing by
means of a blocked-ear recording technique. This approach involves embedding a small recording microphone
in an earplug secured flush with the distal end of the ear
canal. The recordings were performed at 393 locations
around the sphere in the anechoic chamber with the subject’s head at the center. More details of HRTF recording
techniques can be found in [13], [14].
The free-field sound stimuli used were 150-ms bursts
of Gaussian white noise with 10-ms raised cosine onset
and offset ramps. A new stimulus was generated for
each trial. For the VAS localization experiments the
sound stimuli for a given subject and direction consisted
of freshly generated 150-ms Gaussian white noise with
15-ms raised-cosine onset and offset ramps convolved
with the subject’s measured HRIR filters for that direction. For the VAS-CON sound condition the stimuli
were delivered through the ER1 earphones with no further processing. For the VAS-EP1 and VAS-EP2 sound
conditions the stimuli were delivered through the SARA
earphones with compensation filters applied. The compensation filters used in the localization experiments
with both earpieces were all created using lb = 10,
corresponding to 843.75 Hz. Before presenting the
results of the localization experiments, we introduce
some of the data analysis and visualization techniques
used in the remainder of this paper.
2.2 Data Analyses and Visualization
This section gives a brief summary of the data analysis
and visualization techniques used to analyze and study the
results of the localization experiments. Many of the data
analysis techniques we employ use the lateral–polar coordinate system, rather than the spherical coordinate system.
In the lateral–polar coordinate system the lateral angle
indicates the angle of incidence with respect to the midsagittal plane and the polar angle indicates the angle
around the interaural axis (see Fig. 4). The range of the
lateral angle is ( 90 , 90 ) and that of the polar angle is
[0 , 360 ).
For the analysis of the polar angle data, results that
constitute COC errors are removed before the statistics
are carried out. Since COC errors may include front–back
confusions, they distort the analysis of polar angle errors,
so they are removed and analyzed separately. A COC
error is identified when the target and response lateral
angles are within 25 of each other, the polar angle error
is greater than 35 , and the target lateral angle is more
than 15 from the interaural axis.
The overall localization performance of the subjects in
the different experimental conditions was measured using the spherical correlation coefficient (SCC, see [15]).
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO
Its use with localization data is described in detail in
[16], but in brief it describes the global degree of correlation between the target and response locations, where
unity corresponds to perfect correlation and zero to no
correlation. Along with the SCC values we calculate the
percentage of COC errors in the localization results for
each subject and sound condition.
To visualize the raw localization data, the distributions
of lateral and polar angle responses can be conveniently
viewed as scatter plots. In these plots the target angles
are shown on the horizontal axes and the response angles
on the vertical axes. The size of a dot indicates the
number of responses for a given target and response
angle combination. If all responses corresponded perfectly to the targets, all the dots would lie on the diagonal line (bottom left corner to top right corner of the
plot), so the extent of the spread of dots around this
diagonal line gives a good indication of the localization
accuracy for a given subject and condition. A markedly
asymmetric distribution of dots around the upward diagonal can indicate systematic errors or severe difficulty
with the localization task.
To determine whether there is a statistically significant
difference between the localization performances of two
sound conditions, we use a Kruskal–Wallis nonparametric
one-way analysis of variance (KW ANOVA). For the KW
ANOVA a critical p value of 0.05 is used, below which
the difference between two sound conditions is statistically significant. Where two conditions are statistically
significantly different, we calculate Cliff’s d [17], which
is an effect size measure suitable for nonparametric data
and provides an estimate of the difference between the
probability that a sample of the random variable X is
Fig. 4. Lateral–polar coordinate system. Interaural axis—line
passing through both ears, y axis; midcoronal plane—like dividing sphere into front and back hemispheres, yz plane. ∠XOB
Lateral angle of point P, ∠BDP polar angle of point P.
1021
MARTIN ET AL.
greater than a sample of the random variable Y, and the
probability that it is less than a sample of the random
variable Y. It ranges from 1.0, where all samples of X
are less than those of Y, to 1.0, where all samples of X are
greater than those of Y.
PAPERS
2.3 Localization Results
In this section we present analyses of the results of
the localization experiments. We begin with the SCC
since it can be used to gauge overall localization performance. The SCC of the localization data pooled across
subjects, for all sound conditions, was greater than 0.89,
with the exception of the FF-EP2 sound condition, for
which the SCC was 0.72 [see Fig. 5(a)]. The maximum
intersubject spread of SCC values for any sound condition is less than 0.10, also with the exception of the
FF-EP2 sound condition, where the SCC varies widely.
Preliminary investigations of the ER4P-EP2 earpieces
presaged the poor localization performance for the
FF-EP2 sound condition. Subjects reported considerable
attenuation of free-field sounds, which was sensitive to
the precise placement of the earpiece and the quality of
the seal being made with the inner wall of the ear canal.
We suspect that the reason for the large variation between subjects in localization performance for this sound
condition is due at least in part to variations in ear-canal
diameters between subjects. Smaller ear canals may give
rise to greater sensitivity to precise placement of the
earphone in the ear. If this is the case, slight asymmetries in earphone placement could disrupt ILD cues and
distort the spectral cues differently in each ear. No measurements of ear-canal sizes were made, but it was noted
that when using the ER1 earphones, subjects S1 and S2
both found the smaller sized earpieces more comfortable, whereas subjects S3, S4, and S5 all preferred to
use the larger size earpieces.
The percentage of COC errors in the localization data
pooled across subjects is greatest for the FF-EP2 sound
condition [see Fig. 5(b)]. The value is 30.6%, compared
to values of between 8.0 and 16.5% for the other sound
conditions. This indicates a disruption in spectral sound
localization cues by the ER4P-EP2 earphones and is
consistent with the subjects’ reports that the earphones
interfered significantly with normal hearing.
The distributions of lateral and polar angle responses
are shown using scatter plots. The lateral angle data were
similar across subjects, so they were pooled across subjects and are shown in Fig. 6. In the free-field results the
Fig. 5. (a) Spherical correlation coefficients. (b) Percentages of
cone-of-confusion errors.
Fig. 6. Scatter plots of lateral angle localization data. (a) Freefield sound condition. (b) VAS sound condition.
1022
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
PAPERS
SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO
responses are generally clustered close to and symmetrically around the upward diagonal, though the spread is
greater for the FF-EP2 sound condition. The lateral angle
results for the three VAS sound conditions are similar to
one another.
The polar angle data are shown in Fig. 7 for each
subject and each condition. The polar angle scatter plots
show a large effect of the earpieces on the free-field
polar angle response accuracy. The dispersion around
the upward diagonal for the FF-EP1 sound condition is
much greater than that for the FF-CON sound condition, and that for the FF-EP2 sound condition is greater
still. For S1 there are very few dots on the upward
diagonal for the FF-EP2 sound condition; most of the
responses lie in the lower half of the front hemisphere.
This subject reported extreme localization difficulty
while wearing the ER4P-EP2 earphones. There is no
clear pattern in the polar angle scatter plots for the
VAS sound conditions.
The results shown so far indicate that the ER4P-EP1
earphones interfere to some small extent with free-field
localization and that the ER4P-EP2 earphones interfere
significantly with free-field localization, while both
systems can deliver a VAS comparable, in terms of
localization accuracy, to that produced using the ER1
earphones. We now support this point of view by presenting statistics performed on the mean absolute lateral
angle (MALA) error data and the mean absolute polar
angle (MAPA, see Section 2.2) error data (Fig. 8 and
Table 1). To begin with we note that the VAS-CON
sound condition is statistically significantly different
from the FF-CON sound condition for both MALA
and MAPA errors, but in both cases the effect size is
small. The VAS-CON sound condition represents a
benchmark of the quality of VAS that we can present
using the techniques described in Section 2.1. This
means that the best SARA delivery system that could
be expected would have a free-field localization performance equivalent to the FF-CON sound condition
and a VAS localization performance equivalent to the
VAS-CON sound condition, as opposed to the aforementioned ideal case, in which both free-field and a
Fig. 7. Scatter plots of polar angle localization data. (a) Free-field sound condition. (b) VAS sound condition.
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
1023
MARTIN ET AL.
VAS localization performance would be equivalent to
that for the FF-CON sound condition. Therefore in the
results that follow we compare the FF-EP1 and FF-EP2
sound conditions only to the FF-CON sound condition,
and the VAS-EP1 and VAS-EP2 sound conditions only
to the VAS-CON sound condition.
The MALA error data for the free-field sound conditions, pooled across subjects, show that the error is
largest for the FF-EP2 sound condition [see Fig. 8(a)].
Based on informal reports from the subjects, we attribute the reduced lateral angle localization accuracy to
1) overall attenuation of the stimuli and 2) disruption of
ILD cues caused by slight asymmetries in earphone
placement. Sabin et al. [18] have shown that low listening levels can lead to reduced localization performance.
The MALA error data pooled across subjects show that
there is also a statistically significant difference between
the FF-EP1 and FF-CON sound conditions, but the effect size is very small. In fact, when a KW ANOVA is
PAPERS
performed on the data from each subject individually,
there is a statistically significant difference between
these two sound conditions for S1 only [w2 = 19.36,
p 0.0001], and the effect size is small (d = 0.18). A
slight disruption of ILD cues by the earphones may
have caused the reduced localization accuracy for this
subject.
The MALA error data for the VAS sound conditions,
pooled across subjects, show no significant effect of the
condition between the VAS-EP1 and VAS-CON sound
conditions. They do show a significant effect of the condition between the VAS-EP2 and VAS-CON sound conditions, and the average MALA error is reduced slightly
for the VAS-EP2 sound condition. However, since the
effect size is very small, we have not investigated this
further.
The MAPA error data pooled across subjects show
that the error is again largest for the FF-EP2 sound
condition [see Fig. 8(b)]. This is again consistent with
the results shown so far, which have indicated that the
ER4P-EP2 earpieces interfere significantly with ILD
and in particular with the spectral localization cues in
the free field. The same data show that there is a
statistically significant difference between the FF-EP1
sound condition and the FF-CON sound condition, but
here the effect size is small. For the VAS sound conditions the MAPA error data pooled across subjects show
that there is no statistically significant effect of the
condition between the VAS-EP1 and VAS-CON sound
conditions, or between the VAS-EP2 and VAS-CON
sound conditions.
3 DISCUSSION
Fig. 8. Mean absolute angle error, its 95% confidence interval,
and average angle error across subject population. (a) Lateral
angle data. (b) Polar angle data.
1024
Martin et al. have found that VAS localization performance can be as good as free-field localization performance, as measured both by percentage of front–back
confusions and by average localization error [19]. They
also used a blocked-ear recording technique to measure
HRTFs, and presented the VAS stimuli using circumaural
headphones calibrated in situ. This means that the VAS
stimuli in their experiment were presented with an individualized ear-canal resonance, whereas in our VAS-CON
sound condition stimuli were presented with a generic
simulation of ear-canal resonance. The small difference
in localization performance between our FF-CON and
VAS-CON sound conditions can be attributed to the presentation of virtual stimuli with nonindividualized earcanal resonances.
Härmä et al. [3] (see also [20]) studied the externalization of the virtual sound sources using their SARA
earphones, and the extent to which subjects could differentiate between real and virtual sound sources. They proposed an augmented-reality Turing test, in which a SARA
system would pass if users could not differentiate between
real and virtual sound sources. In our study no experiments were performed to measure the quality of the externalization of the VAS sounds delivered over our SARA
earphones, but all subjects informally reported good
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
PAPERS
SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO
externalization, in particular when using the ER4P-EP1
earphones. In addition there are no occlusion effects associated with the ER4P-EP1 earphones, and subjects
reported only minor effects when using the ER4P-EP2
earphones. However, we expect that for broad-band
stimuli the ER4P-EP1 earphones would not pass the
augmented-reality Turing test because their poor lowfrequency response would enable users to differentiate
easily between real and virtual sound sources. Langendijk
and Bronkhorst [21] found that real and virtual sounds
(noise bandpass filtered between 500 Hz and 16 kHz)
could not be differentiated in a direct A/B comparison
between real sounds presented by a loudspeaker and virtual sounds presented using small earphones suspended
close to the ear in a way that did not significantly interfere
with free-field localization. Indeed their method for presenting the virtual sounds might form the basis of a SARA
delivery system, but we suspect that there would be
difficulties caused by the inevitable compromise between
the loudness of the sounds that could be presented and
the extent to which they could be heard by nearby listeners. Kulkarni and Colburn [22] also conducted an augmented-reality Turing test, though like Langendijk and
Bronkhorst, their aim was not to develop a system for
delivering SARA. Again they showed that virtual sounds
were indistinguishable from real sounds using their methods, but for randomized stimulus spectra, so that timbre
could not be used for discrimination. They used “tubephones” to present the virtual sounds, which were characterized by a 40-dB per decade spectral rolloff below
2 kHz. This is comparable to our ER4P-EP1 earphones,
but with a much bulkier transducer, which is not designed
to be driven by consumer mobile devices.
Bone conduction (BC) technology has been used for
SARA delivery (see, for examples [8], [23]), but there
seems to be a lack of comprehensive studies of localization performance using BC headphones in the literature.
MacDonald et al. [7] have compared lateral localization
performance using a pair of BC headphones to that using
a pair of semiopen circumaural headphones. The stimulus
used was a train of eight 250-ms bursts of Gaussian noise
filtered for each trial by the subject’s HRTFs for a particular direction and by a filter that compensated for the
frequency response of the transducer. Eight source directions were used, evenly distributed around the head in the
horizontal plane. However, the stimulus was bandpassfiltered (300 Hz to 5 kHz) because of the band-limited
frequency response of the BC headphones. This means
that only ITD and ILD cues played a significant role in
the localization. Under these conditions they found the
BC headphones to be as good as the circumaural headphones. It is not clear how the subjects discriminated
between directions in the front and back hemispheres
when there were no spectral cues. The authors describe
the accurate localization using the BC headphones as
“surprising” since the stimuli were reduced in bandwidth.
However, the environment in which the tests were conducted is not reported. If it was not anechoic, then sound
reflections in the room may have played a role, since there
is generally audible leakage from both BC headphones
and semiopen circumaural headphones. Since the experimental conditions were significantly different from those
reported here, it is difficult to compare the results in a
meaningful way. However, an unpublished study previously undertaken by our research group indicated that
the upper limit for the SCC using BC headphones was
0.85. These localization experiments were performed using the same experimental paradigm as the research
reported in this paper, so this number can be compared
directly to the highest obtained SCC values for the VASEP1 and VAS-EP2 sound conditions of 0.91 and 0.92,
respectively. We also consider the extremely band-limited response of current BC headphones to restrict their
usefulness.
4 CONCLUSIONS
We began by stating that an ideal system for delivering SARA would present a high-fidelity VAS without
interfering with normal hearing. Based on those criteria,
the ER4P-EP1 earphones are a very promising candidate.
The results showed that their interference with normal
Table 1. KW ANOVA results for comparisons between FF-CON and VAS-CON and other conditions for MALA and MAPA
error data.a
Lateral
w
Polar
2
p
VAS-CON
46.43
0.0001
FF-EP1
14.48
FF-EP2
107.27
VAS-EP1
0.23
0.63
–
VAS-EP2
5.09
0.024
0.04
w
2
p
d
0.13
16.56
< 0.0001
0.08
0.0001
0.07
41.84
0.0001
0.13
0.0001
0.19
148.02
0.0001
0.26
0.36
0.55
–
2.09
0.15
–
d
FF-CON
VAS-CON
a
Where difference is statistically significant, effect size (Cliff’s d) is also given.
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
1025
MARTIN ET AL.
free-field localization was statistically significant but
very small, and that subjects could localize VAS sounds
delivered using the ER4P-EP1 earphones as well as they
could using the standard ER1 earphones. They are also
comfortable to wear and do not require any expertise to
use. The only disadvantage of these earphones is that
they have a poor low-frequency response, so they may
not be suitable for the presentation of certain types of
audio content. The ER4P-EP2 earphones are at least as
good for presenting VAS and they do not suffer from a
poor low-frequency response, but they do interfere significantly with normal free-field hearing. This indicates that
these earphones would be more useful for delivering
SARA in situations where the environmental sounds are
quite loud.
We envisage many possible usage scenarios for the
ER4P-EP1 and ER4P-EP2 SARA earphones. Since they
are no more bulky or heavy than the earbuds supplied
with most portable music players, and they do not require any special amplifier, they are suitable for use with
any portable audio device. They are lightweight, and
high-fidelity SARA can be discretely presented by
driving them with a mobile computer or personal digital
assistant (PDA), loaded with appropriate software for
dynamic auralization and hardware for location and orientation tracking. Lastly these SARA systems are widely
accessible, since they can be assembled easily from unmodified commercially available components, and the
compensation filters are freely available on request from
the authors.
5 REFERENCES
[1] S. Carlile, Virtual Auditory Space: Generation and
Applications (Neuroscience Intelligence Unit, R. G.
Landes, Austin, TX, 1996).
[2] J. Huopaniemi, “Future of Personal Audio—Smart
Applications and Immersive Communication,” in Proc.
AES 30th Int. Conf. on Intelligent Audio Environments
(Saariselkä, Finland, 2007 March 15–17).
[3] A. Härmä, J. Jakka, M. Tikander, M. Karjalainen,
T. Lokki, J. Hiipakka, and G. Lorho, “Augmented Reality
Audio for Mobile and Wearable Appliances,” J. Audio
Eng. Soc., vol. 52, pp. 618–639 (2004 June).
[4] V. Riikonen, M. Tikander, and M. Karjalainen,
“An Augmented Reality Audio Mixer and Equalizer,” presented at the 124th Convention of the Audio
Engineering Society, (Abstracts) www.aes.org/events/
124/124th WrapUp.pdf, (2008 May), convention paper
7372.
[5] M. S. Dean and F. N. Martin, “Insert Earphone
Depth and the Occlusion Effect,” Am. J. Audio., vol. 9,
no. (2), p. 131 (2000).
[6] M. Tikander, M. Karjalainen, and V. Riikonen, “An
Augmented Reality Audio Headset,” in Proc. Digital
Audio Effects (DAFx-08), (Espoo, Finland, 2008).
[7] J. A. MacDonald, P. P. Henry, and T. R. Letowski,
“Spatial Audio through a Bone Conduction Interface,”
Int. J. Audio., vol. 45, pp. 595–599 (2006).
1026
PAPERS
[8] J. Wilson, B. N. Walker, J. Lindsay, C. Cambias,
and F. Dellaert, “SWAN: System for Wearable Audio
Navigation,” in Proc. 11th IEEE Int. Symp. on Wearable
Computers (2007), pp. 91–98.
[9] D. Sun, “Bone-Conduction Spatial Audio,” B. E.
dissertation, Sydney University, Sydney, Australia (2005).
[10] J. Blauert, Spatial Hearing: The Psychophysics
of Human Sound Localization, rev. ed. (MIT Press, Cambridge, MA, 1997).
[11] A. Farina, “Simultaneous Measurement of Impulse
Response and Distortion with a Swept-Sine Technique,”
presented at the 108th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 48, p. 350
(2000 Apr.), preprint 5093.
[12] S. Carlile, P. Leong, and S. Hyams, “The Nature
and Distribution of Errors in Sound Localization by Human Listeners,” Hear. Research, vol. 114, pp. 179–196
(1997).
[13] H. Møller, “Fundamentals of Binaural Technology,” Appl. Acoust., vol. 36, pp.171–218 (1992).
[14] H. Møller, M. F. Sorensen, D. Hammershoi, and
C. B. Jensen, “Head-Related Transfer Functions of Human Subjects,” J. Audio Eng. Soc., vol. 43, pp. 300–321
(1995 May).
[15] N. J. Fisher, T. Lewis, and B. J. J. Embleton,
Statistical Analysis of Spherical Data (Cambridge University Press, Cambridge, MA, 1987).
[16] P. Leong and S. Carlile, “Methods for Spherical
Data Analysis and Visualization,” J. Neurosci. Meth., vol.
80, pp.191–200 (1998).
[17] D. Cliff, “Dominance Statistics—Ordinal Analyses to Answer Ordinal Questions,” Psycho. Bull.,
vol. 114, pp. 494–509 (1993).
[18] A. Sabin, E. Macpherson, and J. Middlebrooks,
“Human Sound Localization at Near-Threshold Listening Levels,” Hear. Research, vol. 199, pp. 124–134
(2005).
[19] R. L. Martin, K. I. McAnally, and M. A. Senova,
“Free-Field Equivalent Localization of Virtual Audio,”
J. Audio Eng. Soc., vol. 49, pp. 14–22 (2001 Jan./
Feb.).
[20] A. Härmä, J. Jakka, M. Tikander, M. Karjalainen,
T. Lokki, H. Nironen, and S. Vesa, “Techniques and
Applications of Wearable Augmented Reality Audio,”
presented at the 114th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 51, p. 419
(2003 May), convention paper 5768.
[21] E. H. A. Langendijk and A. W. Bronkhorst, “Fidelity of Three-Dimensional-Sound Reproduction Using a
Virtual Auditory Display,” J. Acoust. Soc. Am., vol. 107,
pp. 528–537 (2000).
[22] A. Kulkarni and H. S. Colburn, “Role of Spectral
Detail in Sound-Source Localization,” Nature, vol. 396,
pp. 747–749 (1998, Dec.).
[23] A. Valjamae, A. Tajadura-Jimenez, P. Larsson,
D. Vastfjall, and M. Kleiner, “Binaural Bone-Conducted
Sound in Virtual Environments: Evaluation of a Portable,
Multimodal Motion Simulator Prototype,” Acoust. Sci.
and Technol., vol. 29, pp. 149–155 (2008).
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
PAPERS
SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO
THE AUTHORS
A. Martin
C. Jin
Aengus Martin was born in Ireland in 1979. He
received a B.A. degree in computational physics, an
M.Sc. degree in physics, and an M.Phil. degree in music
and media technology from Trinity College, Dublin, in
2001, 2003, and 2005, respectively.
Since 2005 he has been a research assistant in the
Computing and Audio Research Laboratory at the University of Sydney, Sydney, Australia, where he began a
Ph.D. program in 2007. His main research activities involve interactive sound synthesis and spatial audio.
l
Craig Jin received an M.S. degree in applied physics
from the California Institute of Technology, Pasadena,
CA, in 1991 and a Ph.D. degree in electrical engineering
from the University of Sydney, Sydney, Australia, in
2001.
He is a senior lecturer in the School of Electrical and
Information Engineering at the University of Sydney and
also a Queen Elizabeth II Fellow. He is the director of the
Computing and Audio Research Laboratory at the University of Sydney and a cofounder of three startup companies: VAST Audio Pty Ltd, Personal Audio Pty Ltd,
and Heard Systems Pty Ltd. His research focuses on spatial audio and neuromorphic engineering.
Dr. Jin is the author or coauthor of more than 70 journal or conference papers in these areas and he holds six
patents. He has received recognition in Australia for his
invention of a spatial hearing aid. He is a member of the
J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December
A. van Schaik
Audio Engineering Society, the Acoustical Society of
America, and the Institute of Electrical and Electronics
Engineers.
l
André van Schaik received an M.Sc. degree in electrical engineering from the University of Twente, Enschede,
The Netherlands, in 1990, and a Ph.D. degree in electrical
engineering from the Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland, in 1998.
He is a reader in electrical engineering in the School of
Electrical and Information Engineering, University of
Sydney, Sydney, Australia, and an Australian Research
Council Queen Elizabeth II Research Fellow. His research focuses on three main areas: neuromorphic engineering, bioelectronics, and spatial audio.
Dr. van Schaik has authored or coauthored more than
100 papers in these research areas and is the holder of
more than 30 patents. He is the director of the Computing and Audio Research Laboratory at the University of
Sydney and a cofounder of three start-up companies. He
is a member of the EPSRC College and a board member of the Institute of Neuromorphic Engineering. He is
a member of the Analog, BioCAS, and Neural Network
Technical Committees of the IEEE Circuits and Systems Society and a past chair of its Sensory Systems
Technical Committee. He is an associate editor for the
IEEE Transactions on Circuits and Systems—I: Regular
Papers.
1027