PAPERS Psychoacoustic Evaluation of Systems for Delivering Spatialized Augmented-Reality Audio* AENGUS MARTIN, CRAIG JIN, AES Member, AND ANDRÉ VAN SCHAIK ([email protected]) ([email protected]) ([email protected]) Computing and Audio Research Laboratory, University of Sydney, Sydney, Australia Two new lightweight systems for delivering spatialized, augmented-reality audio (SARA) are presented. Each comprises a set of earphone drivers coupled with “acoustically transparent” earpieces and a digital filter. Using the first system, subjects were able to localize virtual auditory space (VAS) stimuli with the same accuracy as when using earphones that are standard for presentation of VAS, while free-field localization performance was reduced only slightly. The only disadvantage of this system is that it has a poor low-frequency response. VAS localization performance using the second system is also as good as that with standard VAS presentation earphones, though free-field localization performance is degraded to a greater extent. This system has good low-frequency response, however, so its range of uses complements that of the first SARA system. Both systems are light and easily constructed from unmodified, commercially available products. They require little digital signal processing overhead and no special preamplifier, so they are ideally suited to mobile applications. 0 INTRODUCTION Spatialized augmented-reality audio (SARA) can be defined as the superposition of virtual sound sources on the real acoustic environment, where the sounds from virtual sources are processed so that the listener perceives them to be coming from particular directions in space. Therefore SARA involves the superposition of a virtual auditory space (VAS, see [1]) on a real acoustic space. Applications of SARA include assistive systems for vision-impaired users, guidance systems, teleconferencing, attention-focusing systems, and audio-only electronic games. It has been predicted that in the future SARA will play an increasing role in the diversification of audio user interfaces, particularly in the context of mobile devices [2]. An ideal system for delivering SARA would be capable of rendering a VAS in which the virtual sound sources are indistinguishable from real acoustic ones, without impairing a user’s normal free-field hearing. In this paper we present a novel technique for delivering SARA in which we use earphones comprising “acoustically transparent” earpieces coupled with earphone drivers. Our primary contribution is to enable readers to quickly and easily construct a cost-effective, lightweight SARA delivery system for experimentation or practical use, having clear and proven expectations of its performance. The system hardware is easily assembled using commercially *Manuscript received 2009 March 31; revised 2009 October 28. 1016 available products, and the filters required for use (see Section 1) are freely available on request from the authors. To begin, we give a brief overview of the approaches taken in previously published SARA systems. A variety of techniques have previously been used to deliver SARA. Many of these involve a device with some earphone–microphone combination. Härmä et al. [3] give a comprehensive review of such systems and describe one of their own design. It comprises a binaural system consisting of an earpiece worn in each ear. The earpiece consists of a combination of an earphone and a microphone. The real acoustic environment is picked up by the microphone and reproduced directly via the earphone. VAS sounds can be mixed with the signal from the microphones for SARA. This system has since been improved by using analog electronics to pass the signal from the real acoustic environment to the earphones, thereby eliminating problems associated with a delay between the real acoustic environment being received by the microphones and its transmission to the earphones [4]. However, there is still the problem known as the occlusion effect [5], which refers to the amplification of lowfrequency sounds within the head when the ear is blocked, and has a number of unpleasant manifestations. In their report on a usability study of a SARA headset that blocked the ear canals Tikander et al. [6] state that “eating and drinking was reported to be one of the most irritating situations with the headset.” Another technique for delivering SARA is to use bone conduction headphones J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December PAPERS [7], [8]. The advantage of these is that they can be worn without affecting normal hearing at all. However, it is difficult to present well-spatialized virtual audio sources using this technique [9]. Our SARA system uses open earpieces, so the occlusion effect does not arise, and no microphone is required to transmit the real acoustic environment. In this paper we first describe the system in detail. Then we characterize the acoustic properties of the earphones in order to calibrate the system to present a VAS of the highest possible fidelity. We then describe a set of psychoacoustic experiments which examine subjects’ abilities to localize real, free-field acoustic sounds with the earphones in place, and also their ability to localize virtual sound sources presented using the earpieces, before discussing the performance and usage scenarios of the SARA delivery system. 1 SPATIAL AUGMENTED-REALITY AUDIO SYSTEM In this section we first describe our SARA earphones and give a brief overview of some topics in human spatial SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO hearing which motivate much of the remainder of this paper. We then describe the acoustic characterization of these earphones. Our new earphones consist of a set of Etymōtic Research ER4P MicroPro earphones (ER4P) with the supplied ear tips having been replaced by acoustically transparent earpieces manufactured by Surefire, LLC (see Fig. 1). The ER4Ps are marketed as referencequality earphones suitable for use with portable devices without the need for an additional amplifier. They have a manufacturer-cited magnitude frequency response of 20 Hz to 16 kHz 4 dB using the supplied ear tips (these are not used), and a 1-kHz sensitivity of 108 dB SPL for a 0.2-V input and a nominal impedance of 27 O. The acoustically transparent earpieces are designed for discrete monitoring of radio communications while allowing ambient sounds and conversation to be heard. Two SARA earphones were investigated, one with the Surefire CommEarTMComfort EP1 earpieces (EP1) and the other with the Surefire CommEarTMBoost EP2 earpieces (EP2). The EP2s differ from the EP1s in that they intrude further into the ear canal and have a flange on the end which makes them less acoustically transparent (refer to Fig. 1). Both earpieces are designed to fit snugly Fig. 1. SARA earphones. (a) Etymōtic Research ER4P MicroPro earphone drivers with supplied ear tips replaced by acoustically– transparent earpieces. (b) Surefire CommEarTM Comfort EP1 earpiece. (c) Surefire CommEarTM Boost EP2 earpiece. J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December 1017 MARTIN ET AL. within the conchal cavity of the ear and are made from a resilient polymer. The left and right earpieces are mirror symmetric and are available in three sizes—(small (S), medium (M), and large (L))—for the EP1s and two sizes (medium and large) for the EP2s. No modifications were required to fit them to the ER4P earphone drivers. We will refer to the SARA earphones comprising the ER4P drivers and the EP1 earpieces as ER4P-EP1X and to those comprising the ER4P drivers and the EP2 earpieces as ER4P-EP2X, where X indicates the size (S, M, or L) of the earpiece that was used. Where X is omitted, the statement is applicable to the earphones with all earpiece sizes. We now briefly introduce some of the topics in human spatial hearing, as they underlie much of the discussion that follows (for a thorough treatment, see [10]). In our terminology, to “localize” a sound means to identify the direction in space from which the sound is arriving, but not the distance from the source. Humans localize sound sources using a number of spatial auditory cues. First, when a sound radiates from a source, it reaches the two ears by different paths. The path lengths vary with the direction and give rise to an interaural time difference (ITD) cue. In addition, a sound coming from a direction to one side of the head will propagate directly to the ear on that side, but will be attenuated somewhat by the head on its path to the other ear. This is referred to as head shadowing and gives rise to an interaural level difference (ILD) cue, which has a complex dependence on frequency. Finally the head, the upper body, and the pinna, together, form an acoustic filter which is highly personalized and direction dependent, and which provides the “spectral” sound localization cues. The ITD and ILD are binaural cues since they originate as a difference between the sounds arriving at the two ears, and they are related to localization in the horizontal plane, whereas the spectral cues originate as monaural cues since they exist even if sound only reaches one ear. Spectral cues are primarily used to resolve cone of confusion (COC) errors, which can arise when sound sources are located at different points on the surface of an imaginary cone which has as its axis of symmetry the interaural axis (the line that passes through both ears; see Section 2.2). Sounds from these sources will have roughly constant ITDs and lowfrequency ILDs, so their directions are disambiguated primarily using monaural spectral cues. With regard to our earphones, we note that the ILD cue can be distorted if the sound arriving at one ear is obstructed, and that spectral localization cues are easily disrupted by any change in the shape of the pinna. This disruption can lead to an increase in the number of COC errors. The preceding discussion relates to the localization of free-field acoustic sound sources. However, if a sound is delivered to the ears by transducers which are inserted into the ear canals, the cues arising in natural binaural hearing are not available unless they are introduced artificially. To make the sound appear to come from a particular direction in space, the sounds from each transducer must be filtered electronically to mimic the effects 1018 PAPERS of the acoustic phenomena, which give rise to the auditory localization cues. The head-related impulse responses (HRIRs) describe such filters and there is a separate HRIR for each ear and each direction in space. The frequency domain representation of an HRIR is referred to as a head-related transfer function (HRTF) and is obtained by taking the discrete Fourier transform of an HRIR. HRIRs can be measured by placing small microphones in the ear canals and recording sound stimuli arriving from different directions in space (see Section 2.1 for more details). As such the HRIRs include the ITD, ILD, and spectral localization cues. After filtering a sound with the appropriate HRIR, it is often necessary to apply an additional filter to compensate for the acoustic characteristics of the transducer used to deliver the audio. 1.1 Characterization of Earphone In this section we describe the creation and verification of compensation filters for the SARA earphones in order to improve their VAS presentation (again, see [10]). Compensation filters were designed with respect to the “reference” magnitude frequency response of an Etymōtic Research ER1 earphone (ER1). This model was selected because ER1s are standard earphones used to present VAS stimuli, and they are designed to have a constantgain frequency response at the ear drum, except for a simulated ear-canal resonance. The impulse response of an ER1 was measured using a log-sine sweep signal of 10-s duration and a frequency range of 50 Hz to 22 kHz (for details, see [11]), and a Brüel and Kjær head and torso simulator (HATS 4128C) mannequin. This mannequin has an ear simulator which comprises a removable silicon-rubber pinna joined to an ear canal and a Zwislocki coupler. The ER1 was fitted to the mannequin so that the outer surface of the foam ear tip was flush with the entrance to the ear canal. An RME Multiface sound card was used to drive the ER1 and record the signal from the mannequin at a sample rate of 48 kHz. The magnitude of the transfer function of the ER1 earphone was obtained from a 512-point discrete Fourier transform of the measured IR. It was then smoothed by applying a linear one-sample leading, five-sample lagging moving-average filter, and the smoothed magnitude frequency response was used as the reference magnitude frequency spectrum. After the reference magnitude spectrum had been measured, the transfer functions were measured for an ER4P-EP1 earphone with each of the three sizes of the EP1 earpiece and for the ER4P-EP2 earphone with each of the two sizes of the EP2 earpiece. The IRs of the SARA earphones were measured using the same method as described for the ER1 headphones. We note that preliminary investigations of the ER4P-EP2 earphones revealed that both normal free-field hearing and VAS presentation were strongly affected by the quality of the seal being made between the flange on the earpiece and the inner wall of the ear canal. For this reason special care was taken to ensure that there was a good seal during the measurements. J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December PAPERS The magnitude frequency spectra of the IRs of the ER4P-EP1 and ER4P-EP2 earphones deviate significantly from each other and from the reference transfer function (see Fig. 2). Below 1594 Hz the transfer function of the ER4P-EP2 earphone shows a rolloff of approximately 15 dB per octave, whereas that of the ER4P-EP2 earphone, which does seal the ear canal with a thin flange, shows no low-frequency rolloff. Both the ER4P-EP1 and the ER4P-EP2 transfer functions show spectral peaks and troughs above 5 kHz which are not present in the reference transfer function. These measurements were made using the medium-size EP1 and EP2 earpieces. Measurements of the earphones using the other earpiece sizes showed the same features and are omitted from the figure for clarity. These magnitude frequency spectra were used to create filters to compensate for the acoustic properties of the earphones. A compensation filter was computed for each of the SARA earphones as follows. An inverse magnitude spectrum was computed by dividing the reference magnitude spectrum by the magnitude spectrum of the measured IR. It was important to truncate the low-frequency compensation so that the dynamic range of the SARA earphone was not reduced too much. To this end a particular low-valued frequency bin lb was chosen and the values of the first (lb 1) bins were set to the same value as this bin. For the ER4P-EP1 earphones it was not known what the optimum value of lb would be, so four compensation filters were created using lb values of 4, 8, 10, and 15, corresponding to 281.2, 656.2, 843.75, and 1312.5 Hz, respectively. Since the ER4P-EP2 earphones do not attenuate low frequencies, lb = 2 was used, corresponding to 93.75 Hz, since the first frequency bin corresponds to 0 Hz. Also, no compensation was performed for frequency bins with indices greater than hb = 187, corresponding to 17.438 kHz. The values of bins (hb + 1) to 256 were set to the same value as this bin. Compensation filters were created as minimum-phase finite-impulse-response (FIR) filters, using the inverse magnitude frequency spectrum. Fig. 2. Magnitude frequency spectra of earphones. J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO The IR of each system comprising a SARA earphone and its compensation filter was measured. A number of repeat IR measurements were made, reseating the earphone in the ear of the mannequin before each one, to test the robustness of the system to the small but inevitable changes in the placement of the earpiece. For the ER4PEP1M earphones with a compensation filter applied from 281.25 to 17.438 kHz the maximum absolute difference for a single-frequency bin in the compensated range, between any measured transfer function of the ER4P-EP1M earphone and the reference transfer function, was 1.8 dB. The mean absolute difference over this frequency range was 0.5 dB. Similar results were found for the ER4PEP1S and ER4P-EP1L earphones and for the other compensation ranges tested. For the ER4P-EP2M earphones with a compensation filter applied from 93.75 to 17.438 kHz the maximum absolute difference for a single-frequency bin in the compensated range, between any measured transfer function of the ER4P-EP2M earphone and the reference transfer function, was 1.5 dB. The mean absolute difference over this frequency range was 0.7 dB. Similar results were found for the ER4PEP2L earphones. 2 LOCALIZATION TESTING Two localization experiments were performed with human subjects to test 1) their ability to localize real freefield acoustic sound sources while wearing the SARA earphones and 2) their ability to localize virtual sound sources presented using the SARA system. The purpose of the first experiment was to examine the influence of the SARA earphones on free-field sound localization. To do this, subjects were asked to complete a free-field localization task under three different experimental conditions. The first was the control condition, in which the localization task was performed without any interference with normal hearing (FF-CON). In the second condition the task was performed while wearing a set of ER4P-EP1 earphones (FF-EP1), and in the third condition the task was performed while wearing a set of ER4P-EP2 earphones (FF-EP2). For the third condition subjects were asked to ensure that the ER4P-EP2 earphones were making a good seal in both ears. The level of the freefield stimuli was kept constant throughout the three conditions. No sound was delivered through the earphones in either of the two test conditions. The purpose of the second experiment was to investigate the fidelity of VAS presentation using the SARA earphones. Subjects were asked to complete a VAS localization task under three different experimental conditions. The first was the control condition, in which the VAS was presented using a set of Etymōtic Research ER1 earphones (VAS-CON). In the second condition the VAS was presented using the SARA system consisting of the ER4P-EP1 earphones and compensation filters (VASEP1). In the third condition the VAS was presented using the SARA system comprising the ER4P-EP2 earphones and compensation filters (VAS-EP2). For this condition 1019 MARTIN ET AL. the subjects were asked to ensure that the ER4P-EP2 earphones were making a good seal in both ears. For both of these experiments subjects were allowed to choose the earpiece size that was most comfortable for them in each of the two test conditions, and in the second experiment the compensation filters appropriate for the chosen size were used. These two experiments are now described in detail. 2.1 Methods The same localization testing paradigm was used for both the free-field and the VAS localization experiments. Localization testing was conducted with the subject standing in darkness in a triple-walled anechoic chamber, with his head at the center of the chamber (see Fig. 3). A single trial begins with the subject aligning his head with a calibrated start position of (0, 0), directed by feedback from an array of LEDs. Once aligned, the subject presses a response button to indicate his readiness, and a 150-ms broad-band noise burst is presented from one of 76 random locations distributed on an imaginary sphere surrounding his head. As in [12], the duration of the noise burst was chosen to ensure that the subject could not PAPERS move his head during stimulus presentation. For the freefield localization testing the noise burst is delivered by a loudspeaker (Vifa D26TG-35) mounted on a robotic arm, and for the VAS localization testing it is delivered over earphones (see below for more details). Once the sound has finished playing, the subject performs the localization task by turning around and tilting his head so that his nose points toward the perceived direction of the sound source before pushing a response button. An inertial headorientation tracker (InertiaCube3 manufactured by Intersense Inc.) mounted firmly on top of the subject’s head is used to measure the subject’s head orientation and thus provide an objective measure of the perceived sound direction. The data from a single trial comprise the target azimuth and elevation angles, measured from the calibrated start position to the direction of the sound source (virtual or not), and the response azimuth and elevation angles, measured from the same reference to the direction in which the subject’s head is pointing when the response button is pressed. The subjects performed five sets of 76 localization trials for each experimental condition. A validation of this localization testing paradigm is provided in [12]. Fig. 3. Subject in an anechoic chamber, holding a response button and wearing an orientation tracker on his head. He points his nose toward the perceived direction of the sound source. In case of free-field localization experiments, the sound source is a loudspeaker mounted on the movable arm. 1020 J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December PAPERS Five male subjects, aged 26 to 40 years, participated in the experiments. We refer to them as S1, S2, S3, S4, and S5. Two subjects (S4 and S5) had substantial previous experience of auditory localization while the others were relatively new to the testing paradigm. All subjects reported having normal hearing. The subjects’ HRTFs were measured in the same anechoic chamber as was used for localization testing by means of a blocked-ear recording technique. This approach involves embedding a small recording microphone in an earplug secured flush with the distal end of the ear canal. The recordings were performed at 393 locations around the sphere in the anechoic chamber with the subject’s head at the center. More details of HRTF recording techniques can be found in [13], [14]. The free-field sound stimuli used were 150-ms bursts of Gaussian white noise with 10-ms raised cosine onset and offset ramps. A new stimulus was generated for each trial. For the VAS localization experiments the sound stimuli for a given subject and direction consisted of freshly generated 150-ms Gaussian white noise with 15-ms raised-cosine onset and offset ramps convolved with the subject’s measured HRIR filters for that direction. For the VAS-CON sound condition the stimuli were delivered through the ER1 earphones with no further processing. For the VAS-EP1 and VAS-EP2 sound conditions the stimuli were delivered through the SARA earphones with compensation filters applied. The compensation filters used in the localization experiments with both earpieces were all created using lb = 10, corresponding to 843.75 Hz. Before presenting the results of the localization experiments, we introduce some of the data analysis and visualization techniques used in the remainder of this paper. 2.2 Data Analyses and Visualization This section gives a brief summary of the data analysis and visualization techniques used to analyze and study the results of the localization experiments. Many of the data analysis techniques we employ use the lateral–polar coordinate system, rather than the spherical coordinate system. In the lateral–polar coordinate system the lateral angle indicates the angle of incidence with respect to the midsagittal plane and the polar angle indicates the angle around the interaural axis (see Fig. 4). The range of the lateral angle is ( 90 , 90 ) and that of the polar angle is [0 , 360 ). For the analysis of the polar angle data, results that constitute COC errors are removed before the statistics are carried out. Since COC errors may include front–back confusions, they distort the analysis of polar angle errors, so they are removed and analyzed separately. A COC error is identified when the target and response lateral angles are within 25 of each other, the polar angle error is greater than 35 , and the target lateral angle is more than 15 from the interaural axis. The overall localization performance of the subjects in the different experimental conditions was measured using the spherical correlation coefficient (SCC, see [15]). J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO Its use with localization data is described in detail in [16], but in brief it describes the global degree of correlation between the target and response locations, where unity corresponds to perfect correlation and zero to no correlation. Along with the SCC values we calculate the percentage of COC errors in the localization results for each subject and sound condition. To visualize the raw localization data, the distributions of lateral and polar angle responses can be conveniently viewed as scatter plots. In these plots the target angles are shown on the horizontal axes and the response angles on the vertical axes. The size of a dot indicates the number of responses for a given target and response angle combination. If all responses corresponded perfectly to the targets, all the dots would lie on the diagonal line (bottom left corner to top right corner of the plot), so the extent of the spread of dots around this diagonal line gives a good indication of the localization accuracy for a given subject and condition. A markedly asymmetric distribution of dots around the upward diagonal can indicate systematic errors or severe difficulty with the localization task. To determine whether there is a statistically significant difference between the localization performances of two sound conditions, we use a Kruskal–Wallis nonparametric one-way analysis of variance (KW ANOVA). For the KW ANOVA a critical p value of 0.05 is used, below which the difference between two sound conditions is statistically significant. Where two conditions are statistically significantly different, we calculate Cliff’s d [17], which is an effect size measure suitable for nonparametric data and provides an estimate of the difference between the probability that a sample of the random variable X is Fig. 4. Lateral–polar coordinate system. Interaural axis—line passing through both ears, y axis; midcoronal plane—like dividing sphere into front and back hemispheres, yz plane. ∠XOB Lateral angle of point P, ∠BDP polar angle of point P. 1021 MARTIN ET AL. greater than a sample of the random variable Y, and the probability that it is less than a sample of the random variable Y. It ranges from 1.0, where all samples of X are less than those of Y, to 1.0, where all samples of X are greater than those of Y. PAPERS 2.3 Localization Results In this section we present analyses of the results of the localization experiments. We begin with the SCC since it can be used to gauge overall localization performance. The SCC of the localization data pooled across subjects, for all sound conditions, was greater than 0.89, with the exception of the FF-EP2 sound condition, for which the SCC was 0.72 [see Fig. 5(a)]. The maximum intersubject spread of SCC values for any sound condition is less than 0.10, also with the exception of the FF-EP2 sound condition, where the SCC varies widely. Preliminary investigations of the ER4P-EP2 earpieces presaged the poor localization performance for the FF-EP2 sound condition. Subjects reported considerable attenuation of free-field sounds, which was sensitive to the precise placement of the earpiece and the quality of the seal being made with the inner wall of the ear canal. We suspect that the reason for the large variation between subjects in localization performance for this sound condition is due at least in part to variations in ear-canal diameters between subjects. Smaller ear canals may give rise to greater sensitivity to precise placement of the earphone in the ear. If this is the case, slight asymmetries in earphone placement could disrupt ILD cues and distort the spectral cues differently in each ear. No measurements of ear-canal sizes were made, but it was noted that when using the ER1 earphones, subjects S1 and S2 both found the smaller sized earpieces more comfortable, whereas subjects S3, S4, and S5 all preferred to use the larger size earpieces. The percentage of COC errors in the localization data pooled across subjects is greatest for the FF-EP2 sound condition [see Fig. 5(b)]. The value is 30.6%, compared to values of between 8.0 and 16.5% for the other sound conditions. This indicates a disruption in spectral sound localization cues by the ER4P-EP2 earphones and is consistent with the subjects’ reports that the earphones interfered significantly with normal hearing. The distributions of lateral and polar angle responses are shown using scatter plots. The lateral angle data were similar across subjects, so they were pooled across subjects and are shown in Fig. 6. In the free-field results the Fig. 5. (a) Spherical correlation coefficients. (b) Percentages of cone-of-confusion errors. Fig. 6. Scatter plots of lateral angle localization data. (a) Freefield sound condition. (b) VAS sound condition. 1022 J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December PAPERS SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO responses are generally clustered close to and symmetrically around the upward diagonal, though the spread is greater for the FF-EP2 sound condition. The lateral angle results for the three VAS sound conditions are similar to one another. The polar angle data are shown in Fig. 7 for each subject and each condition. The polar angle scatter plots show a large effect of the earpieces on the free-field polar angle response accuracy. The dispersion around the upward diagonal for the FF-EP1 sound condition is much greater than that for the FF-CON sound condition, and that for the FF-EP2 sound condition is greater still. For S1 there are very few dots on the upward diagonal for the FF-EP2 sound condition; most of the responses lie in the lower half of the front hemisphere. This subject reported extreme localization difficulty while wearing the ER4P-EP2 earphones. There is no clear pattern in the polar angle scatter plots for the VAS sound conditions. The results shown so far indicate that the ER4P-EP1 earphones interfere to some small extent with free-field localization and that the ER4P-EP2 earphones interfere significantly with free-field localization, while both systems can deliver a VAS comparable, in terms of localization accuracy, to that produced using the ER1 earphones. We now support this point of view by presenting statistics performed on the mean absolute lateral angle (MALA) error data and the mean absolute polar angle (MAPA, see Section 2.2) error data (Fig. 8 and Table 1). To begin with we note that the VAS-CON sound condition is statistically significantly different from the FF-CON sound condition for both MALA and MAPA errors, but in both cases the effect size is small. The VAS-CON sound condition represents a benchmark of the quality of VAS that we can present using the techniques described in Section 2.1. This means that the best SARA delivery system that could be expected would have a free-field localization performance equivalent to the FF-CON sound condition and a VAS localization performance equivalent to the VAS-CON sound condition, as opposed to the aforementioned ideal case, in which both free-field and a Fig. 7. Scatter plots of polar angle localization data. (a) Free-field sound condition. (b) VAS sound condition. J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December 1023 MARTIN ET AL. VAS localization performance would be equivalent to that for the FF-CON sound condition. Therefore in the results that follow we compare the FF-EP1 and FF-EP2 sound conditions only to the FF-CON sound condition, and the VAS-EP1 and VAS-EP2 sound conditions only to the VAS-CON sound condition. The MALA error data for the free-field sound conditions, pooled across subjects, show that the error is largest for the FF-EP2 sound condition [see Fig. 8(a)]. Based on informal reports from the subjects, we attribute the reduced lateral angle localization accuracy to 1) overall attenuation of the stimuli and 2) disruption of ILD cues caused by slight asymmetries in earphone placement. Sabin et al. [18] have shown that low listening levels can lead to reduced localization performance. The MALA error data pooled across subjects show that there is also a statistically significant difference between the FF-EP1 and FF-CON sound conditions, but the effect size is very small. In fact, when a KW ANOVA is PAPERS performed on the data from each subject individually, there is a statistically significant difference between these two sound conditions for S1 only [w2 = 19.36, p 0.0001], and the effect size is small (d = 0.18). A slight disruption of ILD cues by the earphones may have caused the reduced localization accuracy for this subject. The MALA error data for the VAS sound conditions, pooled across subjects, show no significant effect of the condition between the VAS-EP1 and VAS-CON sound conditions. They do show a significant effect of the condition between the VAS-EP2 and VAS-CON sound conditions, and the average MALA error is reduced slightly for the VAS-EP2 sound condition. However, since the effect size is very small, we have not investigated this further. The MAPA error data pooled across subjects show that the error is again largest for the FF-EP2 sound condition [see Fig. 8(b)]. This is again consistent with the results shown so far, which have indicated that the ER4P-EP2 earpieces interfere significantly with ILD and in particular with the spectral localization cues in the free field. The same data show that there is a statistically significant difference between the FF-EP1 sound condition and the FF-CON sound condition, but here the effect size is small. For the VAS sound conditions the MAPA error data pooled across subjects show that there is no statistically significant effect of the condition between the VAS-EP1 and VAS-CON sound conditions, or between the VAS-EP2 and VAS-CON sound conditions. 3 DISCUSSION Fig. 8. Mean absolute angle error, its 95% confidence interval, and average angle error across subject population. (a) Lateral angle data. (b) Polar angle data. 1024 Martin et al. have found that VAS localization performance can be as good as free-field localization performance, as measured both by percentage of front–back confusions and by average localization error [19]. They also used a blocked-ear recording technique to measure HRTFs, and presented the VAS stimuli using circumaural headphones calibrated in situ. This means that the VAS stimuli in their experiment were presented with an individualized ear-canal resonance, whereas in our VAS-CON sound condition stimuli were presented with a generic simulation of ear-canal resonance. The small difference in localization performance between our FF-CON and VAS-CON sound conditions can be attributed to the presentation of virtual stimuli with nonindividualized earcanal resonances. Härmä et al. [3] (see also [20]) studied the externalization of the virtual sound sources using their SARA earphones, and the extent to which subjects could differentiate between real and virtual sound sources. They proposed an augmented-reality Turing test, in which a SARA system would pass if users could not differentiate between real and virtual sound sources. In our study no experiments were performed to measure the quality of the externalization of the VAS sounds delivered over our SARA earphones, but all subjects informally reported good J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December PAPERS SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO externalization, in particular when using the ER4P-EP1 earphones. In addition there are no occlusion effects associated with the ER4P-EP1 earphones, and subjects reported only minor effects when using the ER4P-EP2 earphones. However, we expect that for broad-band stimuli the ER4P-EP1 earphones would not pass the augmented-reality Turing test because their poor lowfrequency response would enable users to differentiate easily between real and virtual sound sources. Langendijk and Bronkhorst [21] found that real and virtual sounds (noise bandpass filtered between 500 Hz and 16 kHz) could not be differentiated in a direct A/B comparison between real sounds presented by a loudspeaker and virtual sounds presented using small earphones suspended close to the ear in a way that did not significantly interfere with free-field localization. Indeed their method for presenting the virtual sounds might form the basis of a SARA delivery system, but we suspect that there would be difficulties caused by the inevitable compromise between the loudness of the sounds that could be presented and the extent to which they could be heard by nearby listeners. Kulkarni and Colburn [22] also conducted an augmented-reality Turing test, though like Langendijk and Bronkhorst, their aim was not to develop a system for delivering SARA. Again they showed that virtual sounds were indistinguishable from real sounds using their methods, but for randomized stimulus spectra, so that timbre could not be used for discrimination. They used “tubephones” to present the virtual sounds, which were characterized by a 40-dB per decade spectral rolloff below 2 kHz. This is comparable to our ER4P-EP1 earphones, but with a much bulkier transducer, which is not designed to be driven by consumer mobile devices. Bone conduction (BC) technology has been used for SARA delivery (see, for examples [8], [23]), but there seems to be a lack of comprehensive studies of localization performance using BC headphones in the literature. MacDonald et al. [7] have compared lateral localization performance using a pair of BC headphones to that using a pair of semiopen circumaural headphones. The stimulus used was a train of eight 250-ms bursts of Gaussian noise filtered for each trial by the subject’s HRTFs for a particular direction and by a filter that compensated for the frequency response of the transducer. Eight source directions were used, evenly distributed around the head in the horizontal plane. However, the stimulus was bandpassfiltered (300 Hz to 5 kHz) because of the band-limited frequency response of the BC headphones. This means that only ITD and ILD cues played a significant role in the localization. Under these conditions they found the BC headphones to be as good as the circumaural headphones. It is not clear how the subjects discriminated between directions in the front and back hemispheres when there were no spectral cues. The authors describe the accurate localization using the BC headphones as “surprising” since the stimuli were reduced in bandwidth. However, the environment in which the tests were conducted is not reported. If it was not anechoic, then sound reflections in the room may have played a role, since there is generally audible leakage from both BC headphones and semiopen circumaural headphones. Since the experimental conditions were significantly different from those reported here, it is difficult to compare the results in a meaningful way. However, an unpublished study previously undertaken by our research group indicated that the upper limit for the SCC using BC headphones was 0.85. These localization experiments were performed using the same experimental paradigm as the research reported in this paper, so this number can be compared directly to the highest obtained SCC values for the VASEP1 and VAS-EP2 sound conditions of 0.91 and 0.92, respectively. We also consider the extremely band-limited response of current BC headphones to restrict their usefulness. 4 CONCLUSIONS We began by stating that an ideal system for delivering SARA would present a high-fidelity VAS without interfering with normal hearing. Based on those criteria, the ER4P-EP1 earphones are a very promising candidate. The results showed that their interference with normal Table 1. KW ANOVA results for comparisons between FF-CON and VAS-CON and other conditions for MALA and MAPA error data.a Lateral w Polar 2 p VAS-CON 46.43 0.0001 FF-EP1 14.48 FF-EP2 107.27 VAS-EP1 0.23 0.63 – VAS-EP2 5.09 0.024 0.04 w 2 p d 0.13 16.56 < 0.0001 0.08 0.0001 0.07 41.84 0.0001 0.13 0.0001 0.19 148.02 0.0001 0.26 0.36 0.55 – 2.09 0.15 – d FF-CON VAS-CON a Where difference is statistically significant, effect size (Cliff’s d) is also given. J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December 1025 MARTIN ET AL. free-field localization was statistically significant but very small, and that subjects could localize VAS sounds delivered using the ER4P-EP1 earphones as well as they could using the standard ER1 earphones. They are also comfortable to wear and do not require any expertise to use. The only disadvantage of these earphones is that they have a poor low-frequency response, so they may not be suitable for the presentation of certain types of audio content. The ER4P-EP2 earphones are at least as good for presenting VAS and they do not suffer from a poor low-frequency response, but they do interfere significantly with normal free-field hearing. This indicates that these earphones would be more useful for delivering SARA in situations where the environmental sounds are quite loud. We envisage many possible usage scenarios for the ER4P-EP1 and ER4P-EP2 SARA earphones. Since they are no more bulky or heavy than the earbuds supplied with most portable music players, and they do not require any special amplifier, they are suitable for use with any portable audio device. They are lightweight, and high-fidelity SARA can be discretely presented by driving them with a mobile computer or personal digital assistant (PDA), loaded with appropriate software for dynamic auralization and hardware for location and orientation tracking. Lastly these SARA systems are widely accessible, since they can be assembled easily from unmodified commercially available components, and the compensation filters are freely available on request from the authors. 5 REFERENCES [1] S. Carlile, Virtual Auditory Space: Generation and Applications (Neuroscience Intelligence Unit, R. G. Landes, Austin, TX, 1996). [2] J. Huopaniemi, “Future of Personal Audio—Smart Applications and Immersive Communication,” in Proc. AES 30th Int. Conf. on Intelligent Audio Environments (Saariselkä, Finland, 2007 March 15–17). [3] A. Härmä, J. Jakka, M. Tikander, M. Karjalainen, T. Lokki, J. Hiipakka, and G. Lorho, “Augmented Reality Audio for Mobile and Wearable Appliances,” J. Audio Eng. Soc., vol. 52, pp. 618–639 (2004 June). [4] V. Riikonen, M. Tikander, and M. Karjalainen, “An Augmented Reality Audio Mixer and Equalizer,” presented at the 124th Convention of the Audio Engineering Society, (Abstracts) www.aes.org/events/ 124/124th WrapUp.pdf, (2008 May), convention paper 7372. [5] M. S. Dean and F. N. Martin, “Insert Earphone Depth and the Occlusion Effect,” Am. J. Audio., vol. 9, no. (2), p. 131 (2000). [6] M. Tikander, M. Karjalainen, and V. Riikonen, “An Augmented Reality Audio Headset,” in Proc. Digital Audio Effects (DAFx-08), (Espoo, Finland, 2008). [7] J. A. MacDonald, P. P. Henry, and T. R. Letowski, “Spatial Audio through a Bone Conduction Interface,” Int. J. Audio., vol. 45, pp. 595–599 (2006). 1026 PAPERS [8] J. Wilson, B. N. Walker, J. Lindsay, C. Cambias, and F. Dellaert, “SWAN: System for Wearable Audio Navigation,” in Proc. 11th IEEE Int. Symp. on Wearable Computers (2007), pp. 91–98. [9] D. Sun, “Bone-Conduction Spatial Audio,” B. E. dissertation, Sydney University, Sydney, Australia (2005). [10] J. Blauert, Spatial Hearing: The Psychophysics of Human Sound Localization, rev. ed. (MIT Press, Cambridge, MA, 1997). [11] A. Farina, “Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique,” presented at the 108th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 48, p. 350 (2000 Apr.), preprint 5093. [12] S. Carlile, P. Leong, and S. Hyams, “The Nature and Distribution of Errors in Sound Localization by Human Listeners,” Hear. Research, vol. 114, pp. 179–196 (1997). [13] H. Møller, “Fundamentals of Binaural Technology,” Appl. Acoust., vol. 36, pp.171–218 (1992). [14] H. Møller, M. F. Sorensen, D. Hammershoi, and C. B. Jensen, “Head-Related Transfer Functions of Human Subjects,” J. Audio Eng. Soc., vol. 43, pp. 300–321 (1995 May). [15] N. J. Fisher, T. Lewis, and B. J. J. Embleton, Statistical Analysis of Spherical Data (Cambridge University Press, Cambridge, MA, 1987). [16] P. Leong and S. Carlile, “Methods for Spherical Data Analysis and Visualization,” J. Neurosci. Meth., vol. 80, pp.191–200 (1998). [17] D. Cliff, “Dominance Statistics—Ordinal Analyses to Answer Ordinal Questions,” Psycho. Bull., vol. 114, pp. 494–509 (1993). [18] A. Sabin, E. Macpherson, and J. Middlebrooks, “Human Sound Localization at Near-Threshold Listening Levels,” Hear. Research, vol. 199, pp. 124–134 (2005). [19] R. L. Martin, K. I. McAnally, and M. A. Senova, “Free-Field Equivalent Localization of Virtual Audio,” J. Audio Eng. Soc., vol. 49, pp. 14–22 (2001 Jan./ Feb.). [20] A. Härmä, J. Jakka, M. Tikander, M. Karjalainen, T. Lokki, H. Nironen, and S. Vesa, “Techniques and Applications of Wearable Augmented Reality Audio,” presented at the 114th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 51, p. 419 (2003 May), convention paper 5768. [21] E. H. A. Langendijk and A. W. Bronkhorst, “Fidelity of Three-Dimensional-Sound Reproduction Using a Virtual Auditory Display,” J. Acoust. Soc. Am., vol. 107, pp. 528–537 (2000). [22] A. Kulkarni and H. S. Colburn, “Role of Spectral Detail in Sound-Source Localization,” Nature, vol. 396, pp. 747–749 (1998, Dec.). [23] A. Valjamae, A. Tajadura-Jimenez, P. Larsson, D. Vastfjall, and M. Kleiner, “Binaural Bone-Conducted Sound in Virtual Environments: Evaluation of a Portable, Multimodal Motion Simulator Prototype,” Acoust. Sci. and Technol., vol. 29, pp. 149–155 (2008). J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December PAPERS SYSTEMS FOR DELIVERING SPATIALIZED AUGMENTED-REALITY AUDIO THE AUTHORS A. Martin C. Jin Aengus Martin was born in Ireland in 1979. He received a B.A. degree in computational physics, an M.Sc. degree in physics, and an M.Phil. degree in music and media technology from Trinity College, Dublin, in 2001, 2003, and 2005, respectively. Since 2005 he has been a research assistant in the Computing and Audio Research Laboratory at the University of Sydney, Sydney, Australia, where he began a Ph.D. program in 2007. His main research activities involve interactive sound synthesis and spatial audio. l Craig Jin received an M.S. degree in applied physics from the California Institute of Technology, Pasadena, CA, in 1991 and a Ph.D. degree in electrical engineering from the University of Sydney, Sydney, Australia, in 2001. He is a senior lecturer in the School of Electrical and Information Engineering at the University of Sydney and also a Queen Elizabeth II Fellow. He is the director of the Computing and Audio Research Laboratory at the University of Sydney and a cofounder of three startup companies: VAST Audio Pty Ltd, Personal Audio Pty Ltd, and Heard Systems Pty Ltd. His research focuses on spatial audio and neuromorphic engineering. Dr. Jin is the author or coauthor of more than 70 journal or conference papers in these areas and he holds six patents. He has received recognition in Australia for his invention of a spatial hearing aid. He is a member of the J. Audio Eng. Soc., Vol. 57, No. 12, 2009 December A. van Schaik Audio Engineering Society, the Acoustical Society of America, and the Institute of Electrical and Electronics Engineers. l André van Schaik received an M.Sc. degree in electrical engineering from the University of Twente, Enschede, The Netherlands, in 1990, and a Ph.D. degree in electrical engineering from the Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland, in 1998. He is a reader in electrical engineering in the School of Electrical and Information Engineering, University of Sydney, Sydney, Australia, and an Australian Research Council Queen Elizabeth II Research Fellow. His research focuses on three main areas: neuromorphic engineering, bioelectronics, and spatial audio. Dr. van Schaik has authored or coauthored more than 100 papers in these research areas and is the holder of more than 30 patents. He is the director of the Computing and Audio Research Laboratory at the University of Sydney and a cofounder of three start-up companies. He is a member of the EPSRC College and a board member of the Institute of Neuromorphic Engineering. He is a member of the Analog, BioCAS, and Neural Network Technical Committees of the IEEE Circuits and Systems Society and a past chair of its Sensory Systems Technical Committee. He is an associate editor for the IEEE Transactions on Circuits and Systems—I: Regular Papers. 1027
© Copyright 2026 Paperzz