Virtual Acoustics: Paper ICA2016-826 An immersive teleconferencing system using spherical microphones and wave field synthesis Jonas Braasch(a) , Jeff Carter(b) , Samuel Chabot(c) , Jonathan Mathews(d) (a) Rensselaer Polytechnic Institute, Troy, United States, [email protected] Polytechnic Institute, Troy, United States, [email protected] (c) Rensselaer Polytechnic Institute, Troy, United States, [email protected] (d) Rensselaer Polytechnic Institute, Troy, United States, [email protected] (b) Rensselaer Abstract Although the use of videoconferencing systems has become very common, only a few attempts have been made to transmit spatially-correct audio. One reason for this is that traditional stereophonic microphone systems cannot be used with a bi-directional transmission scheme. Because such systems are based on capturing sound sources from the far-field, their use is prone to acoustic feedback. To avoid the latter, the sound has to be captured with closely-positioned microphones or beamforming microphone system. A solution based on a spherical microphone is proposed that allows the preservation of spatial cues while avoiding acoustic feedback. The custom-built microphone consists of 16 capsules embedded in a sphere. Higher-order ambisonics is used to analyze the sound spatially and to produce beamforming patterns. The Microphoneaided Computational Auditory Scene Analysis (MaCASA) algorithm is used to track and capture sound sources in real time. The spherical microphone can either be used as a beamformer or as a sound localization system to track participants with wearable microphones. In both cases, a wave field synthesis (WFS) system is used to reproduce the sound spatially correct. The CollaborativeResearch Augmented Immersive Virtual Environment Laboratory (CRAIVE-Lab) serves as the main site for this research. The lab includes a 134-channel sound system that is used for WFS and a seamless 360-deg video projection over a floor area of 12 x 10 square meters. Two satellite labs, one containing a 64-channel WFS system and another with a 24-channel ambisonics system, exist to serve as remote sites. Keywords: Virtual acoustics, telepresence, wave-field synthesis, spherical microphone An immersive teleconferencing system using spherical microphones and wave field synthesis >> d M1 r1 M2 a rc L R r2 a ° -30 Recording space 30 ° Reproduction space Figure 1: Sketch of a microphone-based recording and reproduction set-up. 1 Introduction Live networked musical performances have gained in popularity over the last few years. In these concerts, musicians are distributed over at least two remote venues and connected via the internet. Some of the challenging technical requirements that these projects have imposed on the underlying research have been addressed in previous work [11, 12, 10, 13]. One of the main challenges is the accurate spatial reproduction of the transmitted sound field at the remote end, as all sound sources have to be captured from a close distance to avoid echoes. For this reason, the main microphone techniques that operate from a distance to capture spatial sound do not work for this application. The work presented here is an extension of previous work [6, 1, 2]; the focus of this paper is to describe the integration of our technology into our new Virtual Reality infrastructure, the CRAIVE-Lab [7]. The idea for such research goes back to the 1930s, when Steinberg and Snow described a system that enabled the world-renowned conductor Leopold Stokowski and the Philadelphia Orchestra to broadcast music live from Philadelphia to Washington, D.C. The authors used their then-newly invented main-microphone techniques to produce a stereophonic image from the recorded sound. Figure 1 shows a general diagram of how microphones and loudspeakers have to be set up and how the signals have to be routed for stereophonic imagery. The spatial positions of sound sources in the recording space are encoded by placing and orienting two or more microphones – the main microphone array – strategically, capturing spatial information by utilizing time and level differences between the different microphone channels. Each channel is then independently transmitted, amplified, and fed to the matching loudspeaker of an array 2 pre amp T1 Transmission Computer T4 Transmission Computer T3 amplifier pre amp T2 amplifier Figure 2: Feedback loop in a telematic transmission. of at least two speakers, for example the classic stereo set-up shown in Fig. 1. The box in this figure that connects the microphones to the loudspeakers can either be an amplifier, a broadcasting unit or a sound recording/reproduction system. Steinberg and Snow used two to three parallel telephone lines to transmit the spatially encoded sound from Philadelphia to Washington, D.C. [14]. While we now experience music broadcasts via radio, satellite, and the internet in our daily life, music collaborations in which ensemble members are distributed over long distances are still in the experimental stage because of technical difficulties associated with two-way or multicast connections. One such challenge is the susceptibility of bidirectional set-ups to feedback loops, which are prone to audible colorations and echoes. Figure 2 demonstrates the general problem: the microphone signal recorded at Site A is reproduced through a loudspeaker at Site B, where it is picked up by a second microphone. This microphone signal is then transmitted back to the original Site A, where it is re-captured by the first microphone. Because of the transmission latency, the feedback becomes audible at much lower gains as an echo, when compared to the feedback situation known from local public address systems. Many popular audio/videoconferencing systems such as iChat or Skype use echo cancellation systems to suppress feedback. In speech communication echo cancellation systems work well, since the back-and-forth nature of spoken dialogue usually allows for suppressing the transmission channel temporarily in one direction. In simultaneous music communication, however, this procedure tends to cut off part of the performance. Spectral alterations are a common side effect if the echo-cancellation system operates with a filter bank. For the given reasons, the authors suggest to avoid using echo-cancellation systems completely. Instead, it is proposed to capture all instruments from a close distance (e.g., lavalier microphones) to minimize the gain and therefore the risk of feedback loops. Unfortunately, the exclusive use of near-field microphones contradicts the original idea of Steinberg and Snow, since the main microphones have to be placed at a further distance to capture the sound field stereophonically. To resolve this conflict, this paper describes an alternative approach to simulate main microphone signals from closely captured microphone signals and geometric data. The system – called Virtual Microphone Control (ViMiC) – includes a room simulation soft- 3 ware to construct a multichannel audio signal from a dry recording as if it had been recorded in a particular room [3]. The position data of the sound sources, which is needed to compute the main microphone signals, are estimated using a microphone array. The array, which is optimized to locate multiple sound sources, is installed at each co-located venue to track the positions of the sound sources. The recorded position data is transmitted to the remote venue(s) along with the acoustic signals that were recorded in the near-field of the instruments. At the remote end, the sound can then be projected with correct spatial image using the ViMiC system. The low-latency audio transmission software Jacktrip [8] is used for audio transmission. 16103mm. 13792mm. 1.6 m 10 m 2.13 m HVAC Duct, 9'9" high 1.2 m 12 m 1.9 m Legend Loudspeaker Column Directional Microphone Computer Workstation Video projector Equipment Rack Figure 3: Floor plan of the CRAIVE-Lab. The spherical microphone is located in the center between the video projectors, with additional support from shotgun microphones. 4 Figure 4: Landscape panorama photo (Ithaca, NY) displayed at the CRAIVE-Lab. 2 The CRAIVE-Lab The Collaborative-Research Augmented Immersive Virtual Environment Laboratory (CRAIVELab), which is used as main host for our telepresence research, was built to address the need for a specialized virtual-reality (VR) system for the study and enabling of communicationdriven tasks with groups of users immersed in a high-fidelity, multi-modal environment. For the visual domain, a front-projection display consisting of eight independent projectors creates scenes on a seamless screen. For the acoustic domain, a 134-loudspeaker-channel system has been designed and installed for Wave Field Synthesis (WFS) with the support of HigherOrder-Ambisonic (HoA) sound projection to render inhomogeneous acoustic fields. An intelligent position tracking system estimates current user locations and head orientations as well as positioning data for other objects. 3 Audio reproduction system The audio system in the CRAIVE-Lab consists of 134 active loudspeakers (JBL SR308) – see Fig. 3. The majority of the loudspeakers, 128 units, are mounted at ear height on a shelf systems that is part of the projection screen frame, where they are located behind the microperforated screen material. The electrical outlets for the speakers have been mounted directly at the shelves carrying the speakers. Eight different electrical circuits power the loudspeakers. Six additional loudspeakers are mounted at the ceiling according to the locations shown in Fig. 3. The audio computer (Apple Mac Pro) hosts two RME MADI-to-ADAT cards which are connected to 16 ADAT-to-analog converters. All gear is mounted into a rack as shown in the left photo of Fig. 5. The six ceiling speakers are driven by a separate audio card (M-Audio, 1814), because the two RME MADI cards are fully loaded with the 128-channel system. The rack also holds the video computer and the computer for the 6-camera array. All 134 loudspeakers have been cabled with individual XLR cables, with a total cable length of over 2 miles – see center photo of Fig. 5. 5 Figure 5: Left: CRAIVE-Lab audio rack, center: CRAIVE cabling, Right: Custom-built 32-channel microphone array [9]. 4 Sound spatialization using Wave Field Synthesis (WFS) Within the CRAIVE-Lab, we have been using Wave-Field Synthesis (WFS) technology, Higherorder Ambisonics (HoA), and Virtual Microphone Control (ViMiC) to render sound fields. WFS is based on the Kirchhoff-Helmholtz Integral (KHI), which states that the sound-pressure and particle-velocity fields in a source-free volume can be determined if the sound pressure and particle-velocities are known for a closed surface surrounding the source-free volume. In practice, the WFS algorithm calculates the sound pressures along this surface for a virtual external sound source. Loudspeakers that are densely positioned on this theoretical surface then reproduce the signals according to calculated values, by adjusting the gains and delays for each speaker. A wave field synthesis system can be easily simulated using ViMiC by following Steinberg and Snow’s original approach [14]. Instead of placing a curtain of real microphones in a concert hall, an array of virtual microphones can be set up in the ViMiC environment. Ideally, the virtual microphone positions should correspond to the loudspeaker positions of the sound projection setup to capture the virtual wave front of a point source. The microphone signal of the nth virtual microphone is determined through this equation: yn (t, r) = g · x(t − τ) = gd (rn ) · x(t − rn ). cs (1) with the distance rn between the nth microphone and the sound source. In principle, the achievable results with the ViMiC WFS approach are identical to traditional WFS implementations, and corrections for truncated and cornered arrays can be simulated through changes in the positions, and directional and frequency dependent sensitivities of the virtual microphones. Since WFS can be integrated into the general framework of ViMiC, no additional software has to be utilized to create a WFS system or subsystem. 6 Live transmission or data storage Recording Space Reproduction Space microphone array lavalier microphones D/A converter analysis computer preamplifier relative position of microphone array spatialization control data virtual sound sources audio signals Audio processing with ViMiC Figure 6: Setup of the acoustic tracking system with a main microphone system and near-field microphones. Also shown is the reproduction side to accurately position the near-field recordings, for example in a telepresence scenario. 5 MaCASA-based sound source tracking system A custom-built 16-channel (or alternatively 32-channel) spherical microphone serves as the main sound localization system – see center photo of Fig. 5. A Sennheiser wireless system with four microphone units is used to capture voice signals and musical instruments from a close distance. The Microphone-aided Computational Auditory Scene Analysis (MaCASA) system builds on a combination of user-worn near-field microphones (lavalier microphones) and a main microphone array (see Fig. 6) to localize individual sound from a mixture. Our previous main microphone array design [4, 5, 3] utilized arrival time differences between five spatially configured, omnidirectional microphones (inter-channel time differences) to determine the sound source. The major problem with this design was the sensitivity of the cross-correlation algorithm to room reverberation. While the method worked flawlessly in near-anechoic conditions, the individual microphone channels proved to be too decorrelated in the presence of reverberation to obtain the robust time-of-arrival differences between the different microphone channels necessary for accurately determining the directions of the sound sources. As a consequence, we decided to replace the original 5-microphone array with the aforementioned spherical microphone. Tracking multiple simultaneous sound sources is still a challenge in collaborative environments. Our solution for this problem is to use near-field microphone signals in conjunction with a traditional microphone-array based localization system. It is quite common to employ these type of microphones for other tasks as well, such as speech recognition and telecommunication. The near-field microphone signals are used to determine the signal-to-noise ratios (SNRs) between several sound sources, such as concurrently playing musicians, while still serving the main pur- 7 pose of capturing the audio signals. The running SNR is calculated frequency-wise from the acoustic energy recorded in a given time interval: SNRi,m = 10 log10 with: i−1 a= tmZ+∆t ∑ n=1 t m 1 a tmZ+∆t p2i · dt (2) tm N p2i · dt + ∑ tmZ+∆t p2i · dt (3) n=i+1 t m where pi the sound pressure captured with the ith near-field microphone, tm the beginning of the measured time interval m, ∆t its duration and N, the number of near-field microphones. Lavalier microphone signals Energy Source 1 Energy Source 2 Signal analysis time SNR Source 1 SNR Source 2 time Figure 7: Estimation of the signal-to-noise ratios for each sound source. The SNRs are measured for each time interval between each observed sound source and the remaining sound sources. The data can then be used to select and weight those time slots in which the sound source dominates the scene, assuming that in this case the SNR is high enough for the microphone array to provide stable localization cues. Figure 7 depicts the core idea. In this example, a good time slot is found for the third time frame for Sound Source 1, which has a large amount of energy in this frame, because the recorded energy for Sound Source 2 is very low. Time Slot 6 depicts an example where a high SNR is found for the second sound source. To improve the quality of the algorithm, all data are analyzed frequency-wise. For this purpose, the signals are sent through an octave-band filter bank before the SNR is determined. The 8 SNR is now a function of frequency f , time interval t, and the index of the sound source. The sound source position is determined for each time/frequency slot by analyzing spherical harmonics captured by the microphone array. Since this technique cannot resolve two sound sources within one time-frequency bin, the estimated position is assigned to the sound source with the highest SNR. Alternatively, the information in each band can be weighted with the SNR in this band. To minimize computational load, a minimum SNR threshold can be determined, below which the localization algorithm will not be activated for the corresponding time/frequency slot. 6 Conclusion and outlook In the paper, we have presented a framework to enable bi-directional telepresence communication for live music performances. The system builds on capturing the audio signals from a close distance with a wireless microphone system and tracks the locations of the sources using a combined near-field/main microphone approach that continuously compares the close microphone signals to the main microphone signals to determine the location of each source. The signals are then transmitted over the internet and reproduced using an immersive audio system based on wave field synthesis or virtual microphone control. Acknowledgments The CRAIVE-Lab has been developed and erected with support from the National Science Foundation Grant No. 1229391. References [1] J. Braasch, C. Chafe, P. Oliveros, and D. Van Nort. Mixing console design considerations for telematic music applications. volume 127, 2009. Paper Number 7942. [2] J. Braasch, N. Peters, P. Oliveros, D. Van Nort, and C. Chafe. A spatial auditory display for telematic music performances. In Principles and Applications of Spatial Hearing, pages 436–451. 2011. [3] J. Braasch, N. Peters, and D. Valente. A loudspeaker-based projection technique for spatial music application using virtual microphone control. Computer Music Journal, 32(3):55–71, 2008. [4] J. Braasch and N. Tranby. A sound-source tracking device to track multiple talkers from microphone array,and lavalier microphone data. In 19th International Congress on Acoustic, Madrid, Spain, Sept. 2-7 2007. ELE-03-009. [5] J. Braasch, D. Valente, and N. Peters. An immersive audio environment with source positioning based on virtual microphone control (ViMiC). In Proc. of the 123rd Convention of the Audio Eng. Soc., New York, NY, 2007. Audio Engineering Society. Paper Number 7209. [6] J. Braasch, D. Valente, and N. Peters. Sharing acoustic spaces over telepresence using virtual microphone control. volume 123, 2007. Paper Number 7209. [7] J. Braasch (PI), R. Radke (Co-PI), B. Cutler (Co-PI), J. Goebel (Co-PI), and B. Chang (Co-PI). MRI: Development of the collaborative-research augmented immersive virtual environment laboratory (CRAIVE-Lab), 2012–2016. NSF #1229391. 9 [8] J. Cáceres and C. Chafe. JackTrip: Under the hood of an engine for network audio. In Proceedings of International Computer Music Conference, Montreal, QC, Canada, Aug. 2009. [9] S. Clapp, J. Botts, A. Guthrie, J. Braasch, and N. Xiang. Using spherical microphone array beamforming and Bayesian inference to evaluate room acoustics (Conference Abstract). J. Acoust. Soc. Am., 132:2058, 2012. [10] J. Cooperstock, J. Roston, and W. Woszczyk. Broadband networked audio: Entering the era of multisensory data distribution. In 18th International Congress on Acoustics, Kyoto, April 2004. [11] P. Oliveros, J. Watanabe, and B. Lonsway. A collaborative internet2 performance. Technical report, Offering Research In Music and Art, Orima Inc. Oakland, CA, 2003. [12] R. Rowe and N. Rolnick. The technophobe and the madman: an internet2 distributed musical. In Proc. of the Int. Computer Music Conf. Miami, Florida, November 2004. [13] F. Schroeder, A. Renaud, P. Rebelo, and F. Gualdas. Addressing the network: Performative strategies for playing apart. In Proc. of the 2007 International Computer Music Conference (ICMC 07), pages 133–140, 2007. [14] J. C. Steinberg and W. B. Snow. Auditory perspective – physical factors. Electrical Engineering, Jan:12–17, 1934. 10
© Copyright 2026 Paperzz