An immersive teleconferencing system using spherical microphones

Virtual Acoustics: Paper ICA2016-826
An immersive teleconferencing system using spherical
microphones and wave field synthesis
Jonas Braasch(a) , Jeff Carter(b) , Samuel Chabot(c) , Jonathan Mathews(d)
(a) Rensselaer
Polytechnic Institute, Troy, United States, [email protected]
Polytechnic Institute, Troy, United States, [email protected]
(c) Rensselaer Polytechnic Institute, Troy, United States, [email protected]
(d) Rensselaer Polytechnic Institute, Troy, United States, [email protected]
(b) Rensselaer
Abstract
Although the use of videoconferencing systems has become very common, only a few attempts
have been made to transmit spatially-correct audio. One reason for this is that traditional stereophonic microphone systems cannot be used with a bi-directional transmission scheme. Because
such systems are based on capturing sound sources from the far-field, their use is prone to
acoustic feedback. To avoid the latter, the sound has to be captured with closely-positioned microphones or beamforming microphone system. A solution based on a spherical microphone
is proposed that allows the preservation of spatial cues while avoiding acoustic feedback. The
custom-built microphone consists of 16 capsules embedded in a sphere. Higher-order ambisonics is used to analyze the sound spatially and to produce beamforming patterns. The Microphoneaided Computational Auditory Scene Analysis (MaCASA) algorithm is used to track and capture
sound sources in real time. The spherical microphone can either be used as a beamformer or as a
sound localization system to track participants with wearable microphones. In both cases, a wave
field synthesis (WFS) system is used to reproduce the sound spatially correct. The CollaborativeResearch Augmented Immersive Virtual Environment Laboratory (CRAIVE-Lab) serves as the
main site for this research. The lab includes a 134-channel sound system that is used for WFS
and a seamless 360-deg video projection over a floor area of 12 x 10 square meters. Two satellite labs, one containing a 64-channel WFS system and another with a 24-channel ambisonics
system, exist to serve as remote sites.
Keywords: Virtual acoustics, telepresence, wave-field synthesis, spherical microphone
An immersive teleconferencing system using
spherical microphones and wave field synthesis
>>
d
M1
r1
M2
a
rc
L
R
r2
a
°
-30
Recording space
30
°
Reproduction space
Figure 1: Sketch of a microphone-based recording and reproduction set-up.
1
Introduction
Live networked musical performances have gained in popularity over the last few years. In
these concerts, musicians are distributed over at least two remote venues and connected via
the internet. Some of the challenging technical requirements that these projects have imposed
on the underlying research have been addressed in previous work [11, 12, 10, 13]. One of
the main challenges is the accurate spatial reproduction of the transmitted sound field at the
remote end, as all sound sources have to be captured from a close distance to avoid echoes.
For this reason, the main microphone techniques that operate from a distance to capture spatial
sound do not work for this application. The work presented here is an extension of previous
work [6, 1, 2]; the focus of this paper is to describe the integration of our technology into our
new Virtual Reality infrastructure, the CRAIVE-Lab [7].
The idea for such research goes back to the 1930s, when Steinberg and Snow described a
system that enabled the world-renowned conductor Leopold Stokowski and the Philadelphia
Orchestra to broadcast music live from Philadelphia to Washington, D.C. The authors used
their then-newly invented main-microphone techniques to produce a stereophonic image from
the recorded sound. Figure 1 shows a general diagram of how microphones and loudspeakers
have to be set up and how the signals have to be routed for stereophonic imagery. The spatial
positions of sound sources in the recording space are encoded by placing and orienting two or
more microphones – the main microphone array – strategically, capturing spatial information by
utilizing time and level differences between the different microphone channels. Each channel
is then independently transmitted, amplified, and fed to the matching loudspeaker of an array
2
pre amp
T1
Transmission
Computer
T4
Transmission
Computer
T3
amplifier
pre amp
T2
amplifier
Figure 2: Feedback loop in a telematic transmission.
of at least two speakers, for example the classic stereo set-up shown in Fig. 1. The box in
this figure that connects the microphones to the loudspeakers can either be an amplifier, a
broadcasting unit or a sound recording/reproduction system. Steinberg and Snow used two
to three parallel telephone lines to transmit the spatially encoded sound from Philadelphia to
Washington, D.C. [14].
While we now experience music broadcasts via radio, satellite, and the internet in our daily
life, music collaborations in which ensemble members are distributed over long distances are
still in the experimental stage because of technical difficulties associated with two-way or multicast connections. One such challenge is the susceptibility of bidirectional set-ups to feedback
loops, which are prone to audible colorations and echoes. Figure 2 demonstrates the general
problem: the microphone signal recorded at Site A is reproduced through a loudspeaker at Site
B, where it is picked up by a second microphone. This microphone signal is then transmitted
back to the original Site A, where it is re-captured by the first microphone. Because of the
transmission latency, the feedback becomes audible at much lower gains as an echo, when
compared to the feedback situation known from local public address systems. Many popular
audio/videoconferencing systems such as iChat or Skype use echo cancellation systems to
suppress feedback. In speech communication echo cancellation systems work well, since the
back-and-forth nature of spoken dialogue usually allows for suppressing the transmission channel temporarily in one direction. In simultaneous music communication, however, this procedure
tends to cut off part of the performance. Spectral alterations are a common side effect if the
echo-cancellation system operates with a filter bank.
For the given reasons, the authors suggest to avoid using echo-cancellation systems completely. Instead, it is proposed to capture all instruments from a close distance (e.g., lavalier
microphones) to minimize the gain and therefore the risk of feedback loops. Unfortunately, the
exclusive use of near-field microphones contradicts the original idea of Steinberg and Snow,
since the main microphones have to be placed at a further distance to capture the sound field
stereophonically. To resolve this conflict, this paper describes an alternative approach to simulate main microphone signals from closely captured microphone signals and geometric data.
The system – called Virtual Microphone Control (ViMiC) – includes a room simulation soft-
3
ware to construct a multichannel audio signal from a dry recording as if it had been recorded
in a particular room [3]. The position data of the sound sources, which is needed to compute the main microphone signals, are estimated using a microphone array. The array, which
is optimized to locate multiple sound sources, is installed at each co-located venue to track
the positions of the sound sources. The recorded position data is transmitted to the remote
venue(s) along with the acoustic signals that were recorded in the near-field of the instruments. At the remote end, the sound can then be projected with correct spatial image using
the ViMiC system. The low-latency audio transmission software Jacktrip [8] is used for audio
transmission.
16103mm.
13792mm.
1.6 m
10 m
2.13 m
HVAC
Duct,
9'9" high
1.2 m
12 m
1.9 m
Legend
Loudspeaker
Column
Directional Microphone
Computer Workstation
Video projector
Equipment Rack
Figure 3: Floor plan of the CRAIVE-Lab. The spherical microphone is located in the center
between the video projectors, with additional support from shotgun microphones.
4
Figure 4: Landscape panorama photo (Ithaca, NY) displayed at the CRAIVE-Lab.
2
The CRAIVE-Lab
The Collaborative-Research Augmented Immersive Virtual Environment Laboratory (CRAIVELab), which is used as main host for our telepresence research, was built to address the
need for a specialized virtual-reality (VR) system for the study and enabling of communicationdriven tasks with groups of users immersed in a high-fidelity, multi-modal environment. For
the visual domain, a front-projection display consisting of eight independent projectors creates
scenes on a seamless screen. For the acoustic domain, a 134-loudspeaker-channel system
has been designed and installed for Wave Field Synthesis (WFS) with the support of HigherOrder-Ambisonic (HoA) sound projection to render inhomogeneous acoustic fields. An intelligent position tracking system estimates current user locations and head orientations as well as
positioning data for other objects.
3
Audio reproduction system
The audio system in the CRAIVE-Lab consists of 134 active loudspeakers (JBL SR308) – see
Fig. 3. The majority of the loudspeakers, 128 units, are mounted at ear height on a shelf
systems that is part of the projection screen frame, where they are located behind the microperforated screen material. The electrical outlets for the speakers have been mounted directly at
the shelves carrying the speakers. Eight different electrical circuits power the loudspeakers. Six
additional loudspeakers are mounted at the ceiling according to the locations shown in Fig. 3.
The audio computer (Apple Mac Pro) hosts two RME MADI-to-ADAT cards which are connected
to 16 ADAT-to-analog converters. All gear is mounted into a rack as shown in the left photo of
Fig. 5. The six ceiling speakers are driven by a separate audio card (M-Audio, 1814), because
the two RME MADI cards are fully loaded with the 128-channel system. The rack also holds
the video computer and the computer for the 6-camera array. All 134 loudspeakers have been
cabled with individual XLR cables, with a total cable length of over 2 miles – see center photo
of Fig. 5.
5
Figure 5: Left: CRAIVE-Lab audio rack, center: CRAIVE cabling, Right: Custom-built 32-channel
microphone array [9].
4
Sound spatialization using Wave Field Synthesis (WFS)
Within the CRAIVE-Lab, we have been using Wave-Field Synthesis (WFS) technology, Higherorder Ambisonics (HoA), and Virtual Microphone Control (ViMiC) to render sound fields. WFS
is based on the Kirchhoff-Helmholtz Integral (KHI), which states that the sound-pressure and
particle-velocity fields in a source-free volume can be determined if the sound pressure and
particle-velocities are known for a closed surface surrounding the source-free volume. In practice, the WFS algorithm calculates the sound pressures along this surface for a virtual external
sound source. Loudspeakers that are densely positioned on this theoretical surface then reproduce the signals according to calculated values, by adjusting the gains and delays for each
speaker. A wave field synthesis system can be easily simulated using ViMiC by following Steinberg and Snow’s original approach [14]. Instead of placing a curtain of real microphones in a
concert hall, an array of virtual microphones can be set up in the ViMiC environment. Ideally,
the virtual microphone positions should correspond to the loudspeaker positions of the sound
projection setup to capture the virtual wave front of a point source. The microphone signal of
the nth virtual microphone is determined through this equation:
yn (t, r) = g · x(t − τ) = gd (rn ) · x(t −
rn
).
cs
(1)
with the distance rn between the nth microphone and the sound source. In principle, the achievable results with the ViMiC WFS approach are identical to traditional WFS implementations,
and corrections for truncated and cornered arrays can be simulated through changes in the positions, and directional and frequency dependent sensitivities of the virtual microphones. Since
WFS can be integrated into the general framework of ViMiC, no additional software has to be
utilized to create a WFS system or subsystem.
6
Live
transmission
or data
storage
Recording Space
Reproduction
Space
microphone
array
lavalier
microphones
D/A
converter
analysis
computer
preamplifier
relative position of
microphone array
spatialization
control data
virtual sound
sources
audio signals
Audio
processing
with ViMiC
Figure 6: Setup of the acoustic tracking system with a main microphone system and near-field
microphones. Also shown is the reproduction side to accurately position the near-field recordings,
for example in a telepresence scenario.
5
MaCASA-based sound source tracking system
A custom-built 16-channel (or alternatively 32-channel) spherical microphone serves as the
main sound localization system – see center photo of Fig. 5. A Sennheiser wireless system
with four microphone units is used to capture voice signals and musical instruments from a
close distance. The Microphone-aided Computational Auditory Scene Analysis (MaCASA) system builds on a combination of user-worn near-field microphones (lavalier microphones) and
a main microphone array (see Fig. 6) to localize individual sound from a mixture. Our previous main microphone array design [4, 5, 3] utilized arrival time differences between five spatially configured, omnidirectional microphones (inter-channel time differences) to determine the
sound source. The major problem with this design was the sensitivity of the cross-correlation algorithm to room reverberation. While the method worked flawlessly in near-anechoic conditions,
the individual microphone channels proved to be too decorrelated in the presence of reverberation to obtain the robust time-of-arrival differences between the different microphone channels
necessary for accurately determining the directions of the sound sources. As a consequence,
we decided to replace the original 5-microphone array with the aforementioned spherical microphone.
Tracking multiple simultaneous sound sources is still a challenge in collaborative environments.
Our solution for this problem is to use near-field microphone signals in conjunction with a traditional microphone-array based localization system. It is quite common to employ these type of
microphones for other tasks as well, such as speech recognition and telecommunication. The
near-field microphone signals are used to determine the signal-to-noise ratios (SNRs) between
several sound sources, such as concurrently playing musicians, while still serving the main pur-
7
pose of capturing the audio signals. The running SNR is calculated frequency-wise from the
acoustic energy recorded in a given time interval:

SNRi,m = 10 log10 
with:
i−1
a=
tmZ+∆t
∑
n=1 t
m
1
a
tmZ+∆t
p2i · dt 
(2)
tm
N
p2i · dt +

∑
tmZ+∆t
p2i · dt
(3)
n=i+1 t
m
where pi the sound pressure captured with the ith near-field microphone, tm the beginning of
the measured time interval m, ∆t its duration and N, the number of near-field microphones.
Lavalier microphone signals
Energy Source 1
Energy Source 2
Signal analysis
time
SNR Source 1
SNR Source 2
time
Figure 7: Estimation of the signal-to-noise ratios for each sound source.
The SNRs are measured for each time interval between each observed sound source and the
remaining sound sources. The data can then be used to select and weight those time slots
in which the sound source dominates the scene, assuming that in this case the SNR is high
enough for the microphone array to provide stable localization cues. Figure 7 depicts the core
idea. In this example, a good time slot is found for the third time frame for Sound Source 1,
which has a large amount of energy in this frame, because the recorded energy for Sound
Source 2 is very low. Time Slot 6 depicts an example where a high SNR is found for the
second sound source.
To improve the quality of the algorithm, all data are analyzed frequency-wise. For this purpose,
the signals are sent through an octave-band filter bank before the SNR is determined. The
8
SNR is now a function of frequency f , time interval t, and the index of the sound source.
The sound source position is determined for each time/frequency slot by analyzing spherical
harmonics captured by the microphone array. Since this technique cannot resolve two sound
sources within one time-frequency bin, the estimated position is assigned to the sound source
with the highest SNR. Alternatively, the information in each band can be weighted with the SNR
in this band. To minimize computational load, a minimum SNR threshold can be determined,
below which the localization algorithm will not be activated for the corresponding time/frequency
slot.
6
Conclusion and outlook
In the paper, we have presented a framework to enable bi-directional telepresence communication for live music performances. The system builds on capturing the audio signals from
a close distance with a wireless microphone system and tracks the locations of the sources
using a combined near-field/main microphone approach that continuously compares the close
microphone signals to the main microphone signals to determine the location of each source.
The signals are then transmitted over the internet and reproduced using an immersive audio
system based on wave field synthesis or virtual microphone control.
Acknowledgments
The CRAIVE-Lab has been developed and erected with support from the National Science
Foundation Grant No. 1229391.
References
[1] J. Braasch, C. Chafe, P. Oliveros, and D. Van Nort. Mixing console design considerations for telematic
music applications. volume 127, 2009. Paper Number 7942.
[2] J. Braasch, N. Peters, P. Oliveros, D. Van Nort, and C. Chafe. A spatial auditory display for telematic
music performances. In Principles and Applications of Spatial Hearing, pages 436–451. 2011.
[3] J. Braasch, N. Peters, and D. Valente. A loudspeaker-based projection technique for spatial music
application using virtual microphone control. Computer Music Journal, 32(3):55–71, 2008.
[4] J. Braasch and N. Tranby. A sound-source tracking device to track multiple talkers from microphone
array,and lavalier microphone data. In 19th International Congress on Acoustic, Madrid, Spain, Sept.
2-7 2007. ELE-03-009.
[5] J. Braasch, D. Valente, and N. Peters. An immersive audio environment with source positioning based
on virtual microphone control (ViMiC). In Proc. of the 123rd Convention of the Audio Eng. Soc., New
York, NY, 2007. Audio Engineering Society. Paper Number 7209.
[6] J. Braasch, D. Valente, and N. Peters. Sharing acoustic spaces over telepresence using virtual microphone control. volume 123, 2007. Paper Number 7209.
[7] J. Braasch (PI), R. Radke (Co-PI), B. Cutler (Co-PI), J. Goebel (Co-PI), and B. Chang (Co-PI).
MRI: Development of the collaborative-research augmented immersive virtual environment laboratory
(CRAIVE-Lab), 2012–2016. NSF #1229391.
9
[8] J. Cáceres and C. Chafe. JackTrip: Under the hood of an engine for network audio. In Proceedings
of International Computer Music Conference, Montreal, QC, Canada, Aug. 2009.
[9] S. Clapp, J. Botts, A. Guthrie, J. Braasch, and N. Xiang. Using spherical microphone array beamforming and Bayesian inference to evaluate room acoustics (Conference Abstract). J. Acoust. Soc. Am.,
132:2058, 2012.
[10] J. Cooperstock, J. Roston, and W. Woszczyk. Broadband networked audio: Entering the era of
multisensory data distribution. In 18th International Congress on Acoustics, Kyoto, April 2004.
[11] P. Oliveros, J. Watanabe, and B. Lonsway. A collaborative internet2 performance. Technical report,
Offering Research In Music and Art, Orima Inc. Oakland, CA, 2003.
[12] R. Rowe and N. Rolnick. The technophobe and the madman: an internet2 distributed musical. In
Proc. of the Int. Computer Music Conf. Miami, Florida, November 2004.
[13] F. Schroeder, A. Renaud, P. Rebelo, and F. Gualdas. Addressing the network: Performative strategies
for playing apart. In Proc. of the 2007 International Computer Music Conference (ICMC 07), pages
133–140, 2007.
[14] J. C. Steinberg and W. B. Snow. Auditory perspective – physical factors. Electrical Engineering,
Jan:12–17, 1934.
10