SMPTE Meeting Presentation Scalable Format and Tools to extend the possibilities of Cinema Audio Charles Q Robinson Dolby Laboratories, Inc., 100 Potrero Ave., San Francisco, CA 94103 [email protected] Sripal Mehta Dolby Laboratories, Inc., 100 Potrero Ave., San Francisco, CA 94103 [email protected] Nicolas Tsingos Dolby Laboratories, Inc., 100 Potrero Ave., San Francisco, CA 94103 [email protected] Written for presentation at the 2012 SMPTE Annual Technical Conference Abstract. Surround sound has been making cinematic story telling more compelling and immersive for over 30 years. The first widely deployed surround systems used magnetic recording. Later, optical recording became standard, enabling up to 7.1 channels of audio. With the transition from film to digital distribution there is an opportunity for the next generational step forward. In this paper we describe a new surround sound format that dramatically advances the capabilities of cinema sound. The format was developed in close cooperation with industry stakeholders and was specifically designed to provide the most desired new capabilities and provide a path for future enhancements, while respecting and leveraging the strengths and know-how of the current sound format and pipeline. In particular, the new system maintains and advances the ability to deliver impeccable audio quality, and flexibly extends the creative possibilities to meet the needs and aspirations of both content creators and exhibitors. Keywords. Cinema, Surround, Sound, Spatial Audio. The authors are solely responsible for the content of this technical presentation. The technical presentation does not necessarily reflect the official position of the Society of Motion Picture and Television Engineers (SMPTE), and its printing and distribution does not constitute an endorsement of views which may be expressed. This technical presentation is subject to a formal peer-review process by the SMPTE Board of Editors, upon completion of the conference. Citation of this work should state that it is a SMPTE meeting paper. EXAMPLE: Author's Last Name, Initials. 2011. Title of Presentation, Meeting name and location.: SMPTE. For information about securing permission to reprint or reproduce a technical presentation, please contact SMPTE at [email protected] or 914-761-1100 (3 Barker Ave., White Plains, NY 10601). Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. Introduction In this paper we present a new audio format that achieves unprecedented levels of audience immersion and engagement by offering powerful new authoring tools to mixers, a flexible rendering engine that optimizes the audio quality and spatial audio impression to each room’s loudspeaker layout and characteristics. At the same time, the system has been designed to maintain backwards compatibility and minimize the impact on production, distribution and exhibition. The introduction of a new audio format allows for changes in the design of sound systems without breaking compatibility with existing practices. At the heart of the new format is a novel Spatial Audio Description model. To maximize the benefits of the new audio description model, and to address existing shortcomings of cinema sound reproduction, new speaker layout and capabilities have been derived. To support these features new authoring, distribution and exhibition tools and systems were developed. In this paper we describe each aspect in turn. Spatial Audio Description Dolby has long recognized the potential benefits of moving beyond "speaker feeds" as a means for distributing spatial audio. Recently there has been considerable interest in alternative spatial audio description methods. This interest is manifest in the academic community and industry with systems proposed by Dolby and others. At a high level, there are three main spatial audio description formats: • Speaker feed - the audio is described as signals intended for loudspeakers at nominal speaker positions. Binaural audio is a special case where the speakers are located at the left and right ears. • Model- or Object-based description - the audio is described in terms of a sequence of audio events at specified positions. • Sound field description - describes the acoustic sound field, not a set of sound sources (e.g. objects or speakers). For example, an acoustic sound field can be described within a region using spherical harmonics. The speaker-feed format is the most common because it is simple and effective. If the playback system is known in advance, mixing, monitoring and distributing a speaker feed description that identically matches the target configuration provides the highest fidelity. However, in most cases the playback system is not known and can only be assumed to conform to a general standard e.g. stereo, 5.1. In cinema, the precise location and even the number of loudspeakers vary substantially. Deviation from nominal speaker placement results in distortions of the spatial information; however timbre is generally well preserved. For content where spatial accuracy in not critical, the speaker-feed format is effective. There is a large body of excellent stereo and multi-channel audio programs that support this statement. The object-based description is the most adaptable because it makes no assumptions about the rendering technology and is therefore most easily applied to any rendering technology. This adaptability allows the listener or exhibitor the freedom to select a playback configuration that suits their individual needs or budget -- with the audio rendered specifically for their chosen configuration. The model-based description efficiently captures high resolution spatial information and enables accurate and lifelike reproduction that is particularly effective for Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. 2 discrete audio images. The object-based model includes much information beyond position, including size. The new audio format described in this paper combines these two scene description methods. Hybrid format A hybrid channel- and object-based audio system provides all the benefits of a traditional speaker-based format • High timbre control and fidelity, • Direct control of speaker signals when desired (e.g. screen channels), • Efficient transmission of dense audio ambiences and textures, • Traditional authoring options that allow mixers to make use of their experience and expertise, while incorporating the new capabilities at their own pace, and extends the capabilities to include the following benefits • More immersion and envelopment, • Increased spatial resolution, e.g. an audio object can be dynamically assigned to any one or more loudspeakers within a traditional surround array, • Ability to effectively bring sound images off screen, • Single inventory distribution compatible with effective adaption to alternative rendering modes including 5.1 and 7.1, • Familiar surround mixing paradigm. The front end of the mixing process is identical to existing tools, as shown below in the figure comparing the speaker-feed and objectbased audio pipeline. The rendering step is delayed until after distribution. Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. 3 Authoring Distribution Raw Audio Element Raw Audio Element . Panner . Exhibition Channel Config. . Renderer . DCP Mix Metadata (a) Speaker Feed Spatial Audio Descipion Authoring Distribution Exhibition Speaker Positions Raw Audio Element Raw Audio Element . Panner . DCP . Renderer . Mix Metadata (b) Object-based Spatial Audio Descipion Figure 1. Comparison of audio pipeline for (a) speaker-feed or channel-based audio and (b) object-based spatial audio descriptions. The audio production tools are similar in both cases, and in both cases the audio is carried within the D-Cinema Package (DCP). In the latter case, final rendering occurs at exhibition, based on actual speaker configuration. Speaker and Channel configuration In this section we present the results investigations into optimal speaker configuration. Optimality included practical as well as performance considerations. The preferred solution is deliberately an extension of existing recommendations. The detailed recommendation can be found in [1]. The derivation of the new recommendation was driven by direct input from the creative community, as well as psycho-acoustic principles and experiments. The solution provides the enhancements most desired by mixers, maximizes the impact on the listener, provides a uniform experience throughout the listening area, and integrates well into existing theaters. Top Surround Existing cinema recommendations include speakers mounted on all four walls providing an excellent “surround” capability. What is missing however is elevation. The addition of ceiling loudspeakers literally provides a new dimension for sound. The benefits include increase immersion thru the presence of overhead sounds, as well as giving mixers the ability to move sounds into the room, away from the walls. Consideration of elevation for audio reproduction is not a new idea [2]. The challenge is finding the best reproduction format for cinema which, in addition to the goals of listener envelopment and imaging, must support a wide dynamic range and a large listening area. To solve this problem, let’s begin with a brief review of typical implementation of surround sound. The wall-mounted side and back surround speakers are typically placed well above the heads of the audience (to improve coverage and keep the speakers safe from tampering), often approaching the ceiling. The side and back (in 7.1 systems) channels fill the theater with sound Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. 4 using arrays of loudspeakers to create sufficient and uniform sound pressure level over a wide listening area. In particular, left and right side surround arrays in a 7.1 surround format provide uniform coverage thru the use of a speaker array spanning the listening area front to back. To achieve the same benefit for height channels, top surround arrays oriented front to back are used. As noted above, the side surround speakers are generally located high on the side walls. To provide an impression of “elevation” or “overhead” relative to the side surrounds, it is clear that ceiling mounted speakers will have a much greater effect than speakers mounted above the existing surround speakers on the walls. Ceiling mounting is often more difficult that wall mounting, but the benefits are clearly audible. The details on array count, position and density were determined through rigorous listening experiments. A commercial cinema was used for the experiments. With 205 seats, the room is representative of a medium sized theater in a modern cineplex. It has partial stadium seat configuration with a length of 56’, a width of 47’ and a maximum height of 21’. For testing additional sound absorption material was installed to bring the reverb time down to RT60 approximately 350 ms for 500 – 10,000, rising to 580 ms at 120 Hz and dropping to 260 ms at 16,000 Hz. Three top surround arrays were installed on the ceiling as represented in Fig 2. A variety of listeners, listening positions, and stimuli were used to evaluate potential speaker configurations with attention to listener envelopment and spatial imaging. The listeners were presented audio (double blind) from a set of alternative speaker configurations derived as a subset of the speakers available. Listeners then graded each configuration. Fig 2: Experimental Speaker Installation. Top down view; screen is to the left. The 19 blue squares near the centerline of the room represent ceiling mounted speakers. Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. 5 Results from the Array Count investigation shall be described. The null hypothesis was “all systems equivalent” but it was expected that we would find that “more is better,” approaching an asymptotic limit. The experiment was split into two parts. Figure 3 shows the results for a test measuring listener preference for Ambient signals rendered in 5 different configurations. Fig 3: Listener preference for ambient sounds rendered in 5 different channel configurations: mono surround (ms), stereo surround (ss), ss + mono top surround, ss + stereo top surround, and ss + 3 top surround arrays. Stereo surround (ss) was the reference. A second test measured listener preference for panned, point sources. Results averaged over all listeners, listening positions and stimuli are shown in Figure 4. Fig 4: Listener preference for panned point-source sounds rendered using 4 different speaker configurations: no top surround speakers, one top surround array, 2 top surround arrays, and 3 top surround arrays. In this test the reference was a text description of the intended pan trajectory, but no audio reference. Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. 6 These experiments indicate the following: Two or three top surround arrays are significantly better than one or none. A third, center array does not improve, and may even decrease envelopment in the case of ambient signals. The center array (while useful for some pan trajectories) does not provide much “bang for the buck.” On this basis it was determined that for spatial audio reproduction. Additional tests to evaluate the length and density of arrays were conducted. The resulting recommendation is shown in Figure 5. Full details on recommended speaker position and aiming can be found in [1]. Individually addressable speakers The two additional sets of overhead arrays described above can be directly addressed as two new channels in the speaker-feed representation. The full channel set is as follows: 9.1 channel: L, R, C, LFE, Lss, Rss, Lrs, Rrs, Lts and Rts. While there are now 10 channels, there are many more speakers. For example, Fig 5 below shows 42 surround loudspeakers within the six surround zones. The channel renderer will “fanout” channel signals to all the appropriate speakers. Using the object-based representation, an audio object can be precisely positioned. The object renderer will determine the best speaker or speakers to play back the object audio stream. The speaker(s) used could be within an array, or span multiple speakers across arrays. Bass Management A significant shortcoming of traditional cinema surround sound is the lack of “full range” surrounds. Specifically, while typical screen channels have a frequency response extending down to 40Hz and lower, surround speakers often begin to roll off at 100Hz. If a full range sound is panned off the screen, the timbre will shift dramatically as a result of the lack of low frequency capability of the surrounds. As a result of this timbre shift (as well as the lack of spatial resolution) mixers hesitate to bring sound objects off the screen. To address this issue, the concept of left/right “surround direct” loudspeaker pair has been standardized [3], and in some Imax configurations, the Ls and Rs loudspeaker arrays are replaced by a pair of full range speakers. The recommendation for Dolby Atmos includes a pair of surround subwoofers. The channel and object renderers redirect low frequency content from the left and right arrays (side, rear and top) to the subwoofers, taking advantage of the limited directionality of low frequency sound. In effect, every surround loudspeaker becomes a surround direct loudspeaker. The appropriate crossover frequency is established during installation based on the capabilities of the surround loudspeakers. Bass redirection is optional. The goal is full range surrounds. If the surround loudspeakers have sufficient low frequency extension, bass redirection is not needed. Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. 7 Fig. 5: Recommended speaker configuration. Authoring Cinema authoring is getting increasingly complex, time-consuming and expensive as content creators strive to get more from cinema sound. New mixing technology must should enable new creative options, but it must integrate into existing post production workflows without adding excessive time and money to the process. The hybrid model of channels and objects allows Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. 8 most sound design, editing, pre-mixing, and final mixing to be performed in the same manner as they are today. Plug-in applications for digital audio workstations allow existing panning techniques within sound design and editing to remain unchanged. In this way, it is possible to lay down both channels and objects within the workstation in 5.1-equipped editing rooms. Object audio and metadata is recorded in the session in preparation for the pre- and final-mix stages in the dubbing theatre. Metadata is integrated into the dubbing theatre’s console surface, allowing the channel strips’ faders, panning and audio processing to work with channels, channel sets (“beds”) and audio objects. The metadata can be edited using either the console surface or the workstation user interface, and the sound is monitored using the Dolby Rendering and Mastering Unit (RMU). Figure 6. Authoring Workflow, showing combination of Beds and Objects. A specific panning tool has been developed to allow the mixers to freely move sound sources in 3D space while taking advantage of all the speakers present in the environment. Figure 7 illustrates the panner plugin UI for Protools™. The interface is similar to common panning tools existing today but extends the creative palette of the mixer by introducing elevation controls for the sound objects. Sound objects can therefore be freely moved in the 3D space of the room and either panned between several loudspeakers or snapped to a single speaker closest to the intended location. In addition to the object location, the perceived width of the object can also be controlled. Several elevation constraints can also be used so that the objects elevation is automatically adjusted as they cross the room using different profiles (sphere, wedge, etc.). Finally, zone exclusion controls (e.g. no side wall, no back wall, etc.) can be enabled allowing the mixer to finely control the set of speakers involved in rendering a particular pan. Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. 9 Figure 7. Panner plugin UI for Protools™ The channel and object audio data and associated metadata is recorded during the mastering session to create a ‘print master’, which includes a Dolby Atmos mix and any other rendered deliverables (such as a Dolby Surround 7.1 or 5.1 theatrical mix). This print master file is wrapped using industry-standard MXF wrapping procedures, hashed and optionally encrypted in order to ensure integrity of the audio content for delivery to the digital cinema packaging facility. Distribution D-Cinema Package (DCP) Integration Dolby Atmos is carried within a dedicated track file within a D-Cinema Package. The Dolby Atmos Track is kept independent of the Audio Track file within the DCP to allow for backwards compatibility with existing exhibition systems. The Atmos Track file consists of frame-wrapped elements where each frame element is the same duration as video to allow for easy editing of packages. Each frame contains all necessary audio and metadata information to render one frame of audio. In order to maintain synchronization during playback, a timing signal is embedded into a channel of the main Audio Track file containing the UUID of the Audio Track file and a frame count for the given frame of audio. A dedicated universal label is given to the synchronization track to prevent the integrated media block from sending it to an audio processor that is not Dolby Atmos aware. Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. 10 Audio Packing Like standard D-Cinema Audio Tracks, the audio format supports transmission of 24 bit PCM audio at sample rates of 96 kHz and 48 kHz. The distribution format supports an arbitrarily large number of tracks (initial product hardware is limited to 128 tracks). This data expansion is mitigated through the use of lossless data compression. The packaging tool accepts input PCM audio at 48 kHz or 96 kHz. The input format is losslessly packed, and a high quality sample rate converter is used to generate a signal at the other sample rate. The audio is efficiently packed as a 48 kHz layer and a 96 kHz difference layer. The cinema processor can unpack just the 48 kHz layer or both layers. In this way a single package can support low-cost 48 kHz cinema installations as well as full 96 kHz cinemas. The lossless packing minimizes the size of the Atmos Audio Track enabling 10’s or even 100’s of audio tracks (including distinct audio objects) to be efficiently distributed. Internal testing has shown that the packaging tool can compress non-trivial audio signals at approximately a 2:1 ratio. In addition, the compression tool is able to compress low level and digital-zero signals much more efficiently. In the case of the first full feature mix, the packaged Atmos audio track was smaller than the standard 7.1 audio track. Figure 8. Lossless Sample Scalable audio packing and unpacking. Figure shows scenario with lossless packing of 96 kHz audio. 48 kHz input audio is also supported. Exhibition Integration (Server, CP, Sync) The Dolby Atmos system was designed to integrate within the existing digital cinema networking infrastructure. An additional 1000BaseT Ethernet connection is established between the digital cinema server and the Dolby Atmos cinema processor to provide a dedicated link for the Dolby Atmos data. This allows for greater reliability in the theater environment and does not require reducing the bandwidth necessary for the other digital cinema audio and video assets since they are sent over a separate, dedicated link to the integrated media block. Figure 9. Exhibition integration Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. 11 Audio is routed from the integrated media block to the Dolby Atmos cinema processor over 8 pairs of AES-3id. The cinema processor can supply signal to a large number of amplifiers via a Dolby Atmos Connect audio-over-Ethernet link that can be upgraded to industry standards such as Ethernet AVB when available. Audio and Video Synchronization The Dolby Atmos playback system was designed for seamless integration with existing digital cinema systems and content. The Dolby Atmos cinema processor serves as both the Dolby Atmos renderer and B-chain processor for all audio within the theater environment. In order to perform both functions for both Dolby Atmos and traditional cinema content, the cinema processor must detect and seamlessly switch between Dolby Atmos playback and legacy playback. To support seamless switching, the cinema processor searches for a synchronization signal embedded within the main audio of the DCP. If this synchronization signal is present and Dolby Atmos data is present and the synchronization information is aligned, the processor will play Dolby Atmos. If the signal is not present, the system will switch to playing back the traditional digital cinema audio. This method allows the system to switch to traditional cinema playback without user intervention. Since the picture, main audio, and Dolby Atmos audio must be synchronized, using this method of audio to audio synchronization guarantees that all 3 assets are in sync. Conclusions We have described the key elements of a new spatial audio format including a new Spatial Audio Description model, loudspeaker configuration authoring tools, distribution method and exhibition tools all designed to leverage and extend the current state of the art for cinema audio. References [1] Dolby Laboratories, “Dolby Atmos Cinema Technical Guidelines,” http://www.dolby.com/uploadedFiles/Assets/US/Doc/Professional/Dolby-Atmos-CinemaTechnical-Guidelines.pdf [2] M.A. Gerzon, “With-heigth sound reproduction.” J. Audio Eng. Soc. (JAES), 21:2–10, 1973. [3] SMPTE 428-3, “D-Cinema Distribution Master Audio Channel Mapping and Channel Labeling," SMPTE ST428-3:2006. Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved. 12
© Copyright 2026 Paperzz