Room-dependent preference of Virtual Surround Sound

Audio Engineering Society
Convention Paper
Presented at the 124th Convention
2008 May 17–20 Amsterdam, The Netherlands
The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that have been peer
reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from the author's advance
manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents.
Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New
York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof,
is not permitted without direct permission from the Journal of the Audio Engineering Society.
Room-dependent preference of Virtual
Surround Sound
Frederick S. Scott 1 and Agnieszka Roginska2
1
Music Technology, New York University, New York, NY, USA
[email protected]
2
Music Technology, New York University, New York, NY, USA
[email protected]
ABSTRACT
A common method for simulating surround sound over headphones, so-called virtual surround sound, is the
convolution of content information with binaural cues. Often, room information is included. This paper examines if
using HRTFs with room impulse responses customized to the room the listener is in enhances the listening
experience. Perceptual experiments were conducted to evaluate the effect of the room listeners were seated in on
their preference of the processing technology.
Scott, Roginska
1.
Room-dependent Preference of Virtual Surround
INTRODUCTION
2.
In the past digital signal processing was challenging,
expensive and required the use of specialized chips or
designated processors. Now nearly everyone has a
computer that can handle numerous computationallytaxing processes. The ownership of devices that can
handle DSP operations has reached critical mass and
mp3 players, such as the iPod, have created a generation
of listeners for whom the sound of earbuds is frequently
heard. Though it is very unlikely for headphone
listening to totally supplant speaker listening in the near
future, more attention is given to enhancing and
improving headphone listening.
There are numerous headphone enhancement
technologies on the market. These range from hardwarebased solutions, to binaural cue replication, to fast
convolution algorithms that focus specifically at
simulating reverberation in real time. Previous work
demonstrates several different methods to simulate
surround sound over headphones. [1] [2] [3] Some
specific examples are Dolby Headphone, Yamaha Silent
Cinema, Creative Labs CMSS-3D. The goal of most of
these systems is to replicate a surround sound speaker
setup. In the ideal case scenario, a listener would not be
able to tell the difference between a speaker
configuration and a virtual surround sound imitation.
Among the key elements for this are head-related
transfer functions, interaural times differences,
interaural level differences, and reverberation. All of
these features can be individually measured and
combined together. Existing technologies typically
employ fixed processing, with no, or limited, number of
presets with respect to listener (HRTF) and room
response selection.
METHODOLOGY
Psychoacoustic experiments were designed and
performed to test subjective preference of virtual
surround sound reproduction technologies in three
studios in the Music Technology program at New York
University. Three studios were selected on merits of
general acoustical quality and proximity to one another.
Room A is 2.6m by 4.3m by 2.43m and is used for
Room A
Room B
Room C
The goal of this paper is to investigate whether there
exists a correlation between the evaluated preference of
a virtual surround sound processing method and the
room a person is physically located in. A listener seated
in a large room while listening to a virtual surround
sound that is using room acoustics of a small room may
lead to an unnatural perception of the environment.
Customization of binaural room responses may lead to
an improved virtual surround sound representation. This
paper focuses specifically on evaluating subject
preference of virtual surround sound presented over
headphones, using matching and mismatching binaural
room impulse responses (BRIRs) to the room a subject
was seated in.
AES 124th Convention, Amsterdam, The Netherlands,
Figure 1. Room 2008
Diagrams
May 17–20
Page 2 of 5
Scott, Roginska
Room-dependent Preference of Virtual Surround
mastering and mixing. Room B is 5.3m by 3.9m by
2.59m and is used for mixing and creation of
multimedia projects. Room C is 9.4m by 9.4m by 2.59m
and is used as a classroom. The RT60’s of the three
rooms are: room A .48 seconds, room B .44 seconds and
room C .63 seconds.
Five binaural room impulse responses were measured
for each of the three rooms. The existing surround
speaker setups were used in rooms A and B, Genelec
1030a’s in the former and 1031a’s in the latter. Impulse
response measurements in room C were taken with
AuSIM’s AuProbe speaker which has a 3-inch driver.
The speaker was placed in locations representing a
surround sound setup: center location, 30° for the front
channels and 110° for the rear channels. A five-second
sweep was used and repeated five times for a higher
signal to noise ratio. The responses for each speaker
were recorded binaurally with a Neumann KU 100 and
a RME Fireface 400 at 96 kHz. The dummy head was
placed in the sweet spot of each speaker setup in each
room. The impulse responses were deconvolved from
the recordings in MATLAB using scripts written by the
author based off of the works of Farina [4]
For the listening test two audio samples were selected
with surround ambiance in mind. The first sample was
an excerpt of a Haydn string quartet taken from the
THX demo disc II. The second sample was from the
tune "The Great Pagoda of Funn" by Donald Fagen, off
his 2006 album Morph the Cat. The 5.1 AC3 files from
the DVD’s were decoded and demultiplexed with
Apple’s Compressor software and then cut down to 10
second selections. The 3 BRIR sets were down sampled
to 48 kHz and convolved with the two selections using
MATLAB. The LFE audio channel was kept and
convolved with the center channel IR in both in cases.
Another pair of listening samples was created by
summing the 5.1 to stereo.
3.
PROCEDURE
A total of 10 subjects participated in the listening
experiment, aged 23 to 35. The mean age was 28.7 with
a standard deviation of 4.6 years. The range of years of
musical experience ranged from 2 years to 30 with a
mean of 16.9 years and a standard deviation of 8.2.
Eight out of the ten subjects were graduate students in
the Music Technology program at New York University.
We consider this narrow population advantageous due to
their finer listening skills.
The experiment was designed as a two-interval forced
choice task between four different simulations: three
sets of measured impulse responses and a 5.1 summedto-stereo. Subjects listened to the two music samples
and were asked to state their preference between two
simulations. Each simulations pair was repeated twice.
There was a total of 24 trials. Upon listening to a
comparison the subject selected which surround sound
simulation they preferred using a GUI in MATLAB. The
code behind the GUI presented the comparisons in a
random order with the 2 musical styles intermixed. This
process was repeated by having each subject take the
same listening test in the three rooms the BRIR’s were
measured in, starting in room A, continuing in room C
and finishing in room B for every subject. Subjects took
approximately 30 minutes to complete the test. The
listening was done through a pair of Sennheiser HD650
headphones from the headphone out of a Macbook Pro.
The laptop was placed where the listener would be in
the same location where the dummy head was for the
recording of the impulse responses.
Figure 2. Listening Test GUI
4.
RESULTS
To counteract for each comparison being played only
twice per listening room, per style, any pair that had
contradictory answers was disregarded. From figure 3 it
can be seen that when averaged across all subjects there
is not a case in which the virtual room is preferred in the
real space, but this is muddied with a large amount of
variance as shown in figure 4. Any case in which the
subject picked the virtualization customized to the
listening room they were in at the time was labeled as a
“correct” room pick. As per figure 6the mean of
“correct” room picks across all listeners and all rooms
was 29.4% with a minimal variance of 2.6% amongst
the rooms and 11.1% amongst the subjects.
Another important statistic was the amount of change of
a subject’s preference between different listening rooms.
This was examined in two ways. The first was by taking
the maximum difference for a single comparison, e.g.
the simulation of B versus the summed for the rock
samples, between listening rooms, per subject, shown in
figure 7. The second was taking the mean of the changes
AES 124th Convention, Amsterdam, The Netherlands, 2008 May 17–20
Page 3 of 5
Scott, Roginska
Room-dependent Preference of Virtual Surround
between two rooms for all comparisons, shown in figure
8. The maximum single change for all subjects had
mean of 63.3%, while the mean for all subjects of their
own individual means of all comparisons between all
rooms was only 20.83%. In the comparison of mean
change between rooms the maximum of means was
25% change illustrated in figure 9.
V. A Classical
V. C Classical
V. A Rock
V. C Rock
V. B Classical
Summed Classical
V. B Rock
Summed Rock
0.500
0.375
0.250
0.125
0
1
2
3
4
5
6
7
8
9
10 mean STD
Figure 6. “Correct” Room Picks per Subject
0.700
1.000
0.525
0.667
0.350
0.333
0.175
0
0
In A
In B
In C
Figure 3. Mean Preferences for all Subjects
1
2
3
4
5
6
7
8
9
10 mean STD
Figure 7. Maximum Change Between Choice Between
Rooms
0.4
0.700
0.3
0.525
0.2
0.1
0.350
0
1
2
3
4
5
6
7
8
9
10 mean STD
0.175
0
Figure 8. Mean Change For All Comparisons per
Subject
In A
In B
In C
classical
Figure 4. STD of Mean Preferences for all Subjects
0.2500
0.700
0.1875
0.525
rock
0.1250
0.350
0.0625
0.175
0
0
In A
In B
In C
In AvB
In AvC
Figure 9. Mean Change Between Rooms
Figure 5. Mean Preferences for all Subjects w/o
Summed
AES 124th Convention, Amsterdam, The Netherlands, 2008 May 17–20
Page 4 of 5
In BvC
Scott, Roginska
5.
Room-dependent Preference of Virtual Surround
technique,” 108th AES Convention, Paris, 2000
February 18-22
CONCLUSIONS
The data suggests that simulated surround sound is not
significantly enhanced with IR’s customized to the
listening room. In fact, for the mean of preferences
across all subjects, the simple summed to stereo samples
were generally more preferred than most of the
simulated rooms. This is further suggested that the mean
of “correct” room picks was only 29.4% of the time the
choice was available.
There is evidence, however, that supports the idea that a
listener’s preference of simulated rooms can change
according to their physical location. Although the mean
change across all choices was barely significant, 20.8%,
the mean of the biggest change for all users was 63.3%.
When looking at individual subjects there were 3 cases
out of 10 in which there was a 100% reversal of opinion
on one of the simulations. All of those reversal cases
involved the summed audio, but there was not a
significant pattern to which listening rooms they
occurred between. The summed samples would be drier
than any of the simulations perhaps explaining this. This
could be a potential artifact of the testing procedure
itself by having the subjects listen in all three rooms in
rapid succession and could disappear if the testing were
spread out over time. Also the order of progression
through the listening rooms kept static. The perfect
situation would have every subject running the test an
additional 5 times for every potential room order.
Familiarity of the subjects with the listening rooms was
not investigated. With 8 of the subjects being students in
the program at NYU it is probable that these subjects
could have spent a large amount of time in many of the
listening rooms, possibly influencing their results.
6.
REFERENCES
[1] Cheng, CI., Wakefield, GH, "Introduction to HeadR e l a t e d T r a n s f e r F u n c t i o n s ( H RT F s ) :
Representations of HRTFs in Time, Frequency, and
Space," Journal of the Audio Engineering Society,
2001
[2] Begault, D., "3D Sound For Virtual Reality and
Multimedia," R Academic Press Inc, 1994
[3] Lorho, G., Isherwood, D. , Zacharov, N.,
Huopaniemi, J., "Round Robin Subjective
Evaluation of Stereo Enhancement Systems for
Headphones," AES 22nd International Conference
on Virtual, Synthetic and and Entertainment Audio
2002
[4] Farina, A., “Simultaneous measurement of impulse
response and distortion with a swept-sine
AES 124th Convention, Amsterdam, The Netherlands, 2008 May 17–20
Page 5 of 5