Scalable Format and Tools to extend the possibilities of Cinema Audio

SMPTE Meeting Presentation
Scalable Format and Tools to extend the possibilities
of Cinema Audio
Charles Q Robinson
Dolby Laboratories, Inc., 100 Potrero Ave., San Francisco, CA 94103
[email protected]
Sripal Mehta
Dolby Laboratories, Inc., 100 Potrero Ave., San Francisco, CA 94103
[email protected]
Nicolas Tsingos
Dolby Laboratories, Inc., 100 Potrero Ave., San Francisco, CA 94103
[email protected]
Written for presentation at the
2012 SMPTE Annual Technical Conference
Abstract. Surround sound has been making cinematic story telling more compelling and immersive
for over 30 years. The first widely deployed surround systems used magnetic recording. Later,
optical recording became standard, enabling up to 7.1 channels of audio. With the transition from film
to digital distribution there is an opportunity for the next generational step forward. In this paper we
describe a new surround sound format that dramatically advances the capabilities of cinema sound.
The format was developed in close cooperation with industry stakeholders and was specifically
designed to provide the most desired new capabilities and provide a path for future enhancements,
while respecting and leveraging the strengths and know-how of the current sound format and
pipeline. In particular, the new system maintains and advances the ability to deliver impeccable audio
quality, and flexibly extends the creative possibilities to meet the needs and aspirations of both
content creators and exhibitors.
Keywords. Cinema, Surround, Sound, Spatial Audio.
The authors are solely responsible for the content of this technical presentation. The technical presentation does not necessarily reflect the
official position of the Society of Motion Picture and Television Engineers (SMPTE), and its printing and distribution does not constitute an
endorsement of views which may be expressed. This technical presentation is subject to a formal peer-review process by the SMPTE
Board of Editors, upon completion of the conference. Citation of this work should state that it is a SMPTE meeting paper. EXAMPLE:
Author's Last Name, Initials. 2011. Title of Presentation, Meeting name and location.: SMPTE. For information about securing permission
to reprint or reproduce a technical presentation, please contact SMPTE at [email protected] or 914-761-1100 (3 Barker Ave., White
Plains, NY 10601).
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
Introduction
In this paper we present a new audio format that achieves unprecedented levels of audience
immersion and engagement by offering powerful new authoring tools to mixers, a flexible
rendering engine that optimizes the audio quality and spatial audio impression to each room’s
loudspeaker layout and characteristics. At the same time, the system has been designed to
maintain backwards compatibility and minimize the impact on production, distribution and
exhibition. The introduction of a new audio format allows for changes in the design of sound
systems without breaking compatibility with existing practices.
At the heart of the new format is a novel Spatial Audio Description model. To maximize the
benefits of the new audio description model, and to address existing shortcomings of cinema
sound reproduction, new speaker layout and capabilities have been derived. To support these
features new authoring, distribution and exhibition tools and systems were developed. In this
paper we describe each aspect in turn.
Spatial Audio Description
Dolby has long recognized the potential benefits of moving beyond "speaker feeds" as a means
for distributing spatial audio. Recently there has been considerable interest in alternative spatial
audio description methods. This interest is manifest in the academic community and industry
with systems proposed by Dolby and others.
At a high level, there are three main spatial audio description formats:
•
Speaker feed - the audio is described as signals intended for loudspeakers at nominal
speaker positions. Binaural audio is a special case where the speakers are located at the
left and right ears.
•
Model- or Object-based description - the audio is described in terms of a sequence of audio
events at specified positions.
•
Sound field description - describes the acoustic sound field, not a set of sound sources
(e.g. objects or speakers). For example, an acoustic sound field can be described within a
region using spherical harmonics.
The speaker-feed format is the most common because it is simple and effective. If the
playback system is known in advance, mixing, monitoring and distributing a speaker feed
description that identically matches the target configuration provides the highest fidelity.
However, in most cases the playback system is not known and can only be assumed to conform
to a general standard e.g. stereo, 5.1. In cinema, the precise location and even the number of
loudspeakers vary substantially. Deviation from nominal speaker placement results in distortions
of the spatial information; however timbre is generally well preserved. For content where spatial
accuracy in not critical, the speaker-feed format is effective. There is a large body of excellent
stereo and multi-channel audio programs that support this statement.
The object-based description is the most adaptable because it makes no assumptions about
the rendering technology and is therefore most easily applied to any rendering technology. This
adaptability allows the listener or exhibitor the freedom to select a playback configuration that
suits their individual needs or budget -- with the audio rendered specifically for their chosen
configuration. The model-based description efficiently captures high resolution spatial
information and enables accurate and lifelike reproduction that is particularly effective for
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
2
discrete audio images. The object-based model includes much information beyond position,
including size.
The new audio format described in this paper combines these two scene description methods.
Hybrid format
A hybrid channel- and object-based audio system provides all the benefits of a traditional
speaker-based format
•
High timbre control and fidelity,
•
Direct control of speaker signals when desired (e.g. screen channels),
•
Efficient transmission of dense audio ambiences and textures,
•
Traditional authoring options that allow mixers to make use of their experience and
expertise, while incorporating the new capabilities at their own pace,
and extends the capabilities to include the following benefits
•
More immersion and envelopment,
•
Increased spatial resolution, e.g. an audio object can be dynamically assigned to any
one or more loudspeakers within a traditional surround array,
•
Ability to effectively bring sound images off screen,
•
Single inventory distribution compatible with effective adaption to alternative rendering
modes including 5.1 and 7.1,
•
Familiar surround mixing paradigm. The front end of the mixing process is identical to
existing tools, as shown below in the figure comparing the speaker-feed and objectbased audio pipeline. The rendering step is delayed until after distribution.
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
3
Authoring
Distribution
Raw
Audio
Element
Raw
Audio
Element
.
Panner .
Exhibition
Channel
Config.
.
Renderer .
DCP
Mix
Metadata
(a) Speaker Feed
Spatial Audio Descipion
Authoring
Distribution
Exhibition
Speaker
Positions
Raw
Audio
Element
Raw
Audio
Element
.
Panner .
DCP
.
Renderer .
Mix
Metadata
(b) Object-based
Spatial Audio Descipion
Figure 1. Comparison of audio pipeline for (a) speaker-feed or channel-based audio and (b)
object-based spatial audio descriptions. The audio production tools are similar in both cases,
and in both cases the audio is carried within the D-Cinema Package (DCP). In the latter case,
final rendering occurs at exhibition, based on actual speaker configuration.
Speaker and Channel configuration
In this section we present the results investigations into optimal speaker configuration.
Optimality included practical as well as performance considerations. The preferred solution is
deliberately an extension of existing recommendations. The detailed recommendation can be
found in [1]. The derivation of the new recommendation was driven by direct input from the
creative community, as well as psycho-acoustic principles and experiments. The solution
provides the enhancements most desired by mixers, maximizes the impact on the listener,
provides a uniform experience throughout the listening area, and integrates well into existing
theaters.
Top Surround
Existing cinema recommendations include speakers mounted on all four walls providing an
excellent “surround” capability. What is missing however is elevation. The addition of ceiling
loudspeakers literally provides a new dimension for sound. The benefits include increase
immersion thru the presence of overhead sounds, as well as giving mixers the ability to move
sounds into the room, away from the walls. Consideration of elevation for audio reproduction is
not a new idea [2]. The challenge is finding the best reproduction format for cinema which, in
addition to the goals of listener envelopment and imaging, must support a wide dynamic range
and a large listening area. To solve this problem, let’s begin with a brief review of typical
implementation of surround sound.
The wall-mounted side and back surround speakers are typically placed well above the heads of
the audience (to improve coverage and keep the speakers safe from tampering), often
approaching the ceiling. The side and back (in 7.1 systems) channels fill the theater with sound
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
4
using arrays of loudspeakers to create sufficient and uniform sound pressure level over a wide
listening area. In particular, left and right side surround arrays in a 7.1 surround format provide
uniform coverage thru the use of a speaker array spanning the listening area front to back.
To achieve the same benefit for height channels, top surround arrays oriented front to back are
used. As noted above, the side surround speakers are generally located high on the side walls.
To provide an impression of “elevation” or “overhead” relative to the side surrounds, it is clear
that ceiling mounted speakers will have a much greater effect than speakers mounted above the
existing surround speakers on the walls. Ceiling mounting is often more difficult that wall
mounting, but the benefits are clearly audible.
The details on array count, position and density were determined through rigorous listening
experiments. A commercial cinema was used for the experiments. With 205 seats, the room is
representative of a medium sized theater in a modern cineplex. It has partial stadium seat
configuration with a length of 56’, a width of 47’ and a maximum height of 21’. For testing
additional sound absorption material was installed to bring the reverb time down to RT60
approximately 350 ms for 500 – 10,000, rising to 580 ms at 120 Hz and dropping to 260 ms at
16,000 Hz. Three top surround arrays were installed on the ceiling as represented in Fig 2. A
variety of listeners, listening positions, and stimuli were used to evaluate potential speaker
configurations with attention to listener envelopment and spatial imaging. The listeners were
presented audio (double blind) from a set of alternative speaker configurations derived as a
subset of the speakers available. Listeners then graded each configuration.
Fig 2: Experimental Speaker Installation. Top down view; screen is to the left. The 19 blue
squares near the centerline of the room represent ceiling mounted speakers.
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
5
Results from the Array Count investigation shall be described. The null hypothesis was “all
systems equivalent” but it was expected that we would find that “more is better,” approaching an
asymptotic limit. The experiment was split into two parts. Figure 3 shows the results for a test
measuring listener preference for Ambient signals rendered in 5 different configurations.
Fig 3: Listener preference for ambient sounds rendered in 5 different channel configurations:
mono surround (ms), stereo surround (ss), ss + mono top surround, ss + stereo top surround,
and ss + 3 top surround arrays. Stereo surround (ss) was the reference.
A second test measured listener preference for panned, point sources. Results averaged over
all listeners, listening positions and stimuli are shown in Figure 4.
Fig 4: Listener preference for panned point-source sounds rendered using 4 different speaker
configurations: no top surround speakers, one top surround array, 2 top surround arrays, and 3
top surround arrays. In this test the reference was a text description of the intended pan
trajectory, but no audio reference.
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
6
These experiments indicate the following: Two or three top surround arrays are significantly
better than one or none. A third, center array does not improve, and may even decrease
envelopment in the case of ambient signals. The center array (while useful for some pan
trajectories) does not provide much “bang for the buck.” On this basis it was determined that for
spatial audio reproduction.
Additional tests to evaluate the length and density of arrays were conducted. The resulting
recommendation is shown in Figure 5.
Full details on recommended speaker position and aiming can be found in [1].
Individually addressable speakers
The two additional sets of overhead arrays described above can be directly addressed as two
new channels in the speaker-feed representation. The full channel set is as follows:
9.1 channel: L, R, C, LFE, Lss, Rss, Lrs, Rrs, Lts and Rts.
While there are now 10 channels, there are many more speakers. For example, Fig 5 below
shows 42 surround loudspeakers within the six surround zones. The channel renderer will “fanout” channel signals to all the appropriate speakers.
Using the object-based representation, an audio object can be precisely positioned. The object
renderer will determine the best speaker or speakers to play back the object audio stream. The
speaker(s) used could be within an array, or span multiple speakers across arrays.
Bass Management
A significant shortcoming of traditional cinema surround sound is the lack of “full range”
surrounds. Specifically, while typical screen channels have a frequency response extending
down to 40Hz and lower, surround speakers often begin to roll off at 100Hz. If a full range sound
is panned off the screen, the timbre will shift dramatically as a result of the lack of low frequency
capability of the surrounds. As a result of this timbre shift (as well as the lack of spatial
resolution) mixers hesitate to bring sound objects off the screen. To address this issue, the
concept of left/right “surround direct” loudspeaker pair has been standardized [3], and in some
Imax configurations, the Ls and Rs loudspeaker arrays are replaced by a pair of full range
speakers.
The recommendation for Dolby Atmos includes a pair of surround subwoofers. The channel and
object renderers redirect low frequency content from the left and right arrays (side, rear and top)
to the subwoofers, taking advantage of the limited directionality of low frequency sound. In
effect, every surround loudspeaker becomes a surround direct loudspeaker. The appropriate
crossover frequency is established during installation based on the capabilities of the surround
loudspeakers. Bass redirection is optional. The goal is full range surrounds. If the surround
loudspeakers have sufficient low frequency extension, bass redirection is not needed.
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
7
Fig. 5: Recommended speaker configuration.
Authoring
Cinema authoring is getting increasingly complex, time-consuming and expensive as content
creators strive to get more from cinema sound. New mixing technology must should enable new
creative options, but it must integrate into existing post production workflows without adding
excessive time and money to the process. The hybrid model of channels and objects allows
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
8
most sound design, editing, pre-mixing, and final mixing to be performed in the same manner as
they are today.
Plug-in applications for digital audio workstations allow existing panning techniques within
sound design and editing to remain unchanged. In this way, it is possible to lay down both
channels and objects within the workstation in 5.1-equipped editing rooms.
Object audio and metadata is recorded in the session in preparation for the pre- and final-mix
stages in the dubbing theatre.
Metadata is integrated into the dubbing theatre’s console surface, allowing the channel strips’
faders, panning and audio processing to work with channels, channel sets (“beds”) and audio
objects. The metadata can be edited using either the console surface or the workstation user
interface, and the sound is monitored using the Dolby Rendering and Mastering Unit (RMU).
Figure 6. Authoring Workflow, showing combination of Beds and Objects.
A specific panning tool has been developed to allow the mixers to freely move sound sources in
3D space while taking advantage of all the speakers present in the environment. Figure 7
illustrates the panner plugin UI for Protools™. The interface is similar to common panning tools
existing today but extends the creative palette of the mixer by introducing elevation controls for
the sound objects. Sound objects can therefore be freely moved in the 3D space of the room
and either panned between several loudspeakers or snapped to a single speaker closest to the
intended location. In addition to the object location, the perceived width of the object can also be
controlled. Several elevation constraints can also be used so that the objects elevation is
automatically adjusted as they cross the room using different profiles (sphere, wedge, etc.).
Finally, zone exclusion controls (e.g. no side wall, no back wall, etc.) can be enabled allowing
the mixer to finely control the set of speakers involved in rendering a particular pan.
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
9
Figure 7. Panner plugin UI for Protools™
The channel and object audio data and associated metadata is recorded during the mastering
session to create a ‘print master’, which includes a Dolby Atmos mix and any other rendered
deliverables (such as a Dolby Surround 7.1 or 5.1 theatrical mix). This print master file is
wrapped using industry-standard MXF wrapping procedures, hashed and optionally encrypted in
order to ensure integrity of the audio content for delivery to the digital cinema packaging facility.
Distribution
D-Cinema Package (DCP) Integration
Dolby Atmos is carried within a dedicated track file within a D-Cinema Package. The Dolby
Atmos Track is kept independent of the Audio Track file within the DCP to allow for backwards
compatibility with existing exhibition systems. The Atmos Track file consists of frame-wrapped
elements where each frame element is the same duration as video to allow for easy editing of
packages. Each frame contains all necessary audio and metadata information to render one
frame of audio.
In order to maintain synchronization during playback, a timing signal is embedded into a
channel of the main Audio Track file containing the UUID of the Audio Track file and a frame
count for the given frame of audio. A dedicated universal label is given to the synchronization
track to prevent the integrated media block from sending it to an audio processor that is not
Dolby Atmos aware.
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
10
Audio Packing
Like standard D-Cinema Audio Tracks, the audio format supports transmission of 24 bit PCM
audio at sample rates of 96 kHz and 48 kHz. The distribution format supports an arbitrarily large
number of tracks (initial product hardware is limited to 128 tracks). This data expansion is
mitigated through the use of lossless data compression. The packaging tool accepts input PCM
audio at 48 kHz or 96 kHz. The input format is losslessly packed, and a high quality sample rate
converter is used to generate a signal at the other sample rate. The audio is efficiently packed
as a 48 kHz layer and a 96 kHz difference layer. The cinema processor can unpack just the 48
kHz layer or both layers. In this way a single package can support low-cost 48 kHz cinema
installations as well as full 96 kHz cinemas. The lossless packing minimizes the size of the
Atmos Audio Track enabling 10’s or even 100’s of audio tracks (including distinct audio objects)
to be efficiently distributed. Internal testing has shown that the packaging tool can compress
non-trivial audio signals at approximately a 2:1 ratio. In addition, the compression tool is able to
compress low level and digital-zero signals much more efficiently. In the case of the first full
feature mix, the packaged Atmos audio track was smaller than the standard 7.1 audio track.
Figure 8. Lossless Sample Scalable audio packing and unpacking. Figure shows scenario with
lossless packing of 96 kHz audio. 48 kHz input audio is also supported.
Exhibition Integration (Server, CP, Sync)
The Dolby Atmos system was designed to integrate within the existing digital cinema networking
infrastructure. An additional 1000BaseT Ethernet connection is established between the digital
cinema server and the Dolby Atmos cinema processor to provide a dedicated link for the Dolby
Atmos data. This allows for greater reliability in the theater environment and does not require
reducing the bandwidth necessary for the other digital cinema audio and video assets since they
are sent over a separate, dedicated link to the integrated media block.
Figure 9. Exhibition integration
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
11
Audio is routed from the integrated media block to the Dolby Atmos cinema processor over 8
pairs of AES-3id. The cinema processor can supply signal to a large number of amplifiers via a
Dolby Atmos Connect audio-over-Ethernet link that can be upgraded to industry standards such
as Ethernet AVB when available.
Audio and Video Synchronization
The Dolby Atmos playback system was designed for seamless integration with existing digital
cinema systems and content.
The Dolby Atmos cinema processor serves as both the Dolby Atmos renderer and B-chain
processor for all audio within the theater environment. In order to perform both functions for
both Dolby Atmos and traditional cinema content, the cinema processor must detect and
seamlessly switch between Dolby Atmos playback and legacy playback. To support seamless
switching, the cinema processor searches for a synchronization signal embedded within the
main audio of the DCP. If this synchronization signal is present and Dolby Atmos data is
present and the synchronization information is aligned, the processor will play Dolby Atmos. If
the signal is not present, the system will switch to playing back the traditional digital cinema
audio. This method allows the system to switch to traditional cinema playback without user
intervention. Since the picture, main audio, and Dolby Atmos audio must be synchronized,
using this method of audio to audio synchronization guarantees that all 3 assets are in sync.
Conclusions
We have described the key elements of a new spatial audio format including a new Spatial
Audio Description model, loudspeaker configuration authoring tools, distribution method and
exhibition tools all designed to leverage and extend the current state of the art for cinema audio.
References
[1] Dolby Laboratories, “Dolby Atmos Cinema Technical Guidelines,”
http://www.dolby.com/uploadedFiles/Assets/US/Doc/Professional/Dolby-Atmos-CinemaTechnical-Guidelines.pdf
[2] M.A. Gerzon, “With-heigth sound reproduction.” J. Audio Eng. Soc. (JAES), 21:2–10, 1973.
[3] SMPTE 428-3, “D-Cinema Distribution Master Audio Channel Mapping and Channel
Labeling," SMPTE ST428-3:2006.
Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.
12