Automatic Mixing and Tracking of On

Automatic Mixing and Tracking of On-Pitch
Football Action for Television Broadcasts
1
R. G. Oldfield , B. G. Shirley
1
1
Acoustics Research Centre, The University of Salford, Salford, UK
[email protected], [email protected]
ABSTRACT
For the television broadcast of football in Europe, the sound engineer will typically have an arrangement of 12
shotgun microphones around the pitch to pick up on-pitch sounds such as whistle blows, players talking and ball
kicks etc. Typically, during a match, the sound engineer will increase and decrease the levels of these microphones
manually in accordance with where the action is on the pitch at a given time to prevent the final mix being awash
with crowd noise. As part of the EU funded project, FascinatE, we have developed an automatic mixing algorithm
that intelligently seeks key events on the pitch and turns on the corresponding microphones, the algorithm picks out
the key events and automatically tracks the action eliminating the need for manual tracking.
1. INTRODUCTION
In the broadcast of football on the television, the main
problem that befalls the sound engineer is the control of
the crowd noise. Whilst there are designated
microphones used for recording the crowd and
ambience (typically, stereo pairs and Soundfield®
microphones), a standard setup (see Figure 1) would
also include 12 shotgun microphones placed around the
pitch used for on-pitch sounds only. These microphones
are chosen to be highly directional but they still pick up
much of the crowd noise, either from the rear of the
microphone or from the crowd on the opposite side of
the pitch. If all of these microphones were to be left
high in the mix, the result would be an audio mix that is
awash of crowd noise and if left low in the mix there
would be no pitch sounds which would be unrealistic.
Consequently the sound engineer will track the action
on the pitch and will only raise the levels of a
microphone when the action is near by and there is
likely to be some on-pitch sounds to pick up. This
process can be rather laborious for the sound engineer
and also means that the only sounds that are picked up
are the ones around the main action which may not tell
the whole story of the match. If for example one player
shouts for his team mate to pass the ball at one end of
the pitch they would not be picked up by the
microphones as the action would have not yet reached
that point on the pitch. Also if there were to be an
altercation between two players or another event
auxiliary to the play during the game, the corresponding
sound would not be picked up. A further problem can be
the sound engineer’s reactions; if the play switches
quickly from one end of the pitch to the other he/she
will have to quickly move one set of faders down and
the others up, if this is not done in time the audio will
not get picked up. Only raising the levels of the faders
when there is action nearby also results in a fluctuation
of level during the match but this is masked by the
crowd noise from the designated ambient/crowd
microphones so is not perceptible to viewers. These
microphones are usually placed high in the stadium such
that it is the ambience that is captured rather than
individuals in the crowd so is more easily controlled by
the sound engineer.
In this paper we present a method that not only allows
multiple events from different areas of the pitch to be
included in the mix at any one time but also will mix the
audio automatically and will therefore alleviate the need
for any manual mixing or tracking the action as is the
current practice. The key element to the automatic
mixing process is the separation of the unwanted crowd
noise from the on-pitch sounds; this can be rather
problematic as the pitch microphones are often
swamped by crowd noise. However the nature of this
crowd noise is that it contains few sharp transients of
significant level; this means that if a significant transient
is detected in any of the microphones, it is likely that it
is picking up an auditory event that is not crowd noise
and is therefore an on-pitch sound.
Once these sounds have been detected, the location of
the sound source can also be approximately determined
using a time delay estimation (TDE) technique [1], if
this location is on the pitch the microphone can be made
active in the mix. This technique however can only be
applied if the sound source is detected in more than one
microphone. Separating and positioning on-pitch
sources like this also means that they can be spatially
mixed and output over stereo, ambisonics, 5.1 etc to
give a more immersive viewing experience. This
technique also allows for a better reconstruction of the
audio scene therefore allowing consistent source
positions on the pitch even as the user may
navigate/interact with the broadcast scene as is one of
the goals of the EU funded FascinatE project [2] of
which this research is a part.
or whistle blows from the referee). It then determines
the amplitude envelope of the filtered output using the
Hilbert transform, the key audio events can then be
extracted when the gradient of the envelope exceeds a
given threshold, i.e. a significant transient has occurred
in a manner similar to a traditional noise gate. Once
these sound sources have been isolated they can be
positioned on the pitch based on the TDE algorithm or
can be assigned an approximate zone if only picked up
by one microphone.
Microphone Input
FascinatE stands for Format-Agnostic SCript-based
INterAcTive Experience. The objective of the project is
to allow a completely customisable viewing experience
where the viewer will be able to make choices as to
which area of interest he/she would like to view on the
pitch and they will have liberty to navigate around the
visual scene by zooming, panning etc. For the audio
side of this the audio has to match the visual content and
therefore it is important that a realistic and robust audio
scene be recorded for quality spatial reproduction
whatever the user’s navigational decision [3].
Filter the signal
Extract envelope of
filtered signal
Calculate gradient of envelope
Pick out points where gradient
exceeds threshold
2. METHODOLOGY
In order to automatically choose when each microphone
should be made active, it is possible to track the game,
either with the faders on a desk, or using some
automatic tracking device [4] which then communicates
with the mixing desk to do an automatic mix. It is also
possible to perform an automatic mix based on the
contents of the audio in each microphone. These signals
can be analyzed and depending on the results, the
microphone can be made active/inactive in the mix. The
downside of this approach is that it requires the signal to
be received first and then decide whether to add the
microphone into the mix, this requires that there is a
slight delay in broadcast to allow for the processing and
detection time of the algorithm. In most cases this is not
a problem as even with live football the actual broadcast
is several seconds after the actual events.
Figure 2 Flow diagram of audio extraction algorithm
Each of the microphone signals is fed into the
algorithm. The microphone signals are then filtered
depending on the type of event to be extracted. Initially,
just two filters were implemented. A low pass filter with
a cut off frequency of 280Hz is used to extract ball
kicks and a band pass filter with pass band between
3.6kHz and 4.0kHz which corresponds approximately to
the fundamental frequency of an average whistle blow.
Once the signal has been filtered to be particularly
sensitive to particular event types, the envelope of the
signal is extracted. The method used for the envelope
extraction is based on the Hilbert transform of the signal
[5]. For this method the envelope of the signal, y ( t ) is
given by equation (1)
!"# ! =
2.1. Detecting the key audio events
The algorithm presented here analyses the signals from
all of the pitch-side microphones, filtering the signals
with two specially designed filters depending on the
nature of the audio event type to be detected (ball kick
! ! ! + !! ! !
(1)
Where !! ! is the Hilbert transform of the signal given
by (2).
Page 2 of 8
3
4
5
6
2
7
1
8
12
Shotgun
microphones
10
11
9
Stereo Pair
Soundfield®
Microphone
Figure 1 Typical Microphone setup for and English Premier League match
Attack Phase
Sustain
Phase
Release
Phase
(2)
From this envelope of the filtered signal the gradient is
taken. This shows the nature of the signal envelope.
This gradient is then analyzed and points are picked out
where the gradient exceeds a given threshold. This
corresponds to a transient in the signal of significant
level and thus should be extracted. For a ball kick, the
envelope of the low frequency filtered signal will
change quickly therefore the gradient of the envelope
will be high. The audio can then be extracted from the
original microphone signal accordingly. Practically;
each of the microphones are silenced until a significant
transient is detected in either of the low-pass or bandpass filtered signals. When a transient is detected the
microphone is switched on with an attack, sustain and
release envelope applied as shown in figure 3. If any
additional transients are detected during the sustain or
release phase of the amplitude envelope, the sustain
phase is extended to include them.
Transient
event
Figure 3 Amplitude envelope to be applied to
microphone signals when transient is detected
The use of this technique can also eliminate unwanted
crowd noise; for example during a corner when the
corner microphones are on, it is common to hear
individuals in the crowd (whose speech may well
include expletives). However if the microphones are
only on when there is an on-pitch auditory event, these
occurrences can be minimized and the crowd noise can
be recorded with the designated microphones and
controlled more easily by the sound engineer.
Page 3 of 8
Table 1
2.1.1.Results
A basic test was carried out to analyze the effectiveness
of the automatic audio event extraction algorithm by
comparing the algorithm with the actual mix that was
broadcast by the BBC. The recordings came from an
English Premier league match that was recorded in
conjunction with an outside broadcaster (SIS Live). The
match took place on 23rd October 2010 and featured
Chelsea versus Wolverhampton Wanderers.
The microphone setup was as shown in Figure 1. There
are tight regulations on broadcasts of Premier League
football such that additional microphones could not be
added. This poses one of the biggest challenges to
extracting on-pitch sources because there are not
enough shotgun microphones to perform any accurate
array processing/beamforming operations on.
Manual Ball Kicks
Auto Ball Kicks
Definite
Ambiguous
Tot
Definite
Ambiguous
1
5
1
6
5
1
6
2
1
2
3
1
0
1
3
3
1
4
3
0
3
4
2
2
4
0
0
0
5
0
1
1
0
0
0
6
1
2
3
1
1
2
7
1
4
5
0
1
1
8
2
1
3
0
1
1
9
1
2
3
1
0
1
10
1
2
3
1
0
1
11
5
0
5
2
0
2
12
2
1
3
2
0
2
Total
24
19
43
16
4
20
Mic
The number of events was also counted for in the final
broadcast mix and the automatic mix as shown in Table
2:
Manual Whistle Blows
Tot
Auto Whistle Blows
Definite
Ambiguous
Tot
Definite
Ambiguous
Tot
1
5
0
5
0
0
0
2
4
1
5
0
0
0
3
5
0
5
2
0
2
4
3
2
5
0
0
0
5
0
2
2
0
0
0
6
4
1
5
5
0
5
7
4
0
4
0
0
0
8
2
3
5
0
0
0
9
0
4
4
0
0
0
10
4
1
5
0
0
0
11
5
0
5
2
0
2
12
5
0
5
0
0
0
Total
41
14
55
9
0
9
Mic
Table 2
Manual and automatic detection of whistle
blows
6
5
NUmber of Ball kicks
To perform the test a random one minute section of the
match was chosen. This section was chopped out of
each of the 12 shotgun microphone signals and listened
to individually, counting the number of ball kicks and
whistle blows in each. The algorithm was then run over
the same set of files and the number of ball kicks and
whistle blows that were extracted were counted. The
results can be seen in Tables 1 and 2. In each case,
whether ball kicks or whistle blows a qualitative
distinction is drawn between ambiguous events i.e. it is
likely but uncertain whether it is such an event due to
low level, and definite events – were the event type is
clear. This was done because it is often difficult for a
human listener to identify the event. Some events occur
on the other side of the pitch and are consequently only
faintly picked up by the microphone in question.
Manual and automatic detection of ball kicks
Definite Manual Ball Kicks
4
Definite Automatic Ball Kicks
3
Ambihuous Manual Ball Kicks
Ambiguous Automatic Ball
Kicks
2
1
0
1
2
3
4
5
6
7
8
9
10
11
12
Microphone Number
Figure 3 Number of ball kicks counted manually and
automatically
Definite
Ambiguous
Ball Kicks
66.70%
21.10%
Whistle Blows
22.00%
0%
Table 3
Page 4 of 8
Percentage of events correctly extracted by
the automatic extraction algorithm
6
2.1.2.Discussion
NUmber of Whistle Blows
5
Definite Whistle Blows
4
Definite Automatic Whistle
Blows
3
Ambihuous Manual Whistle
Blows
Ambiguous Automatic Whistle
Blows
2
1
0
1
2
3
4
5
6
7
8
9
10
11
12
Microphone Number
Figure 4 Number of whistle blows counted manually
and automatically
Ball Kicks
Whistle Blows
Broadcast Mix
7
5
Automatic Mix
8
5
Table 2 Number of ball kicks and Whistle blows
counted in the broadcast mix and automix
The results show that the automatic event extraction
manages to extract 66.7% of definite ball kicks for this
one minute section. The percentages are less for the
ambiguous events due to having less signal to noise
ratio in the microphone signals. The whistle blows in
particular are harder to extract because the background
noise contains more high frequency transients anyway
so discrimination between these and the whistle blows is
more difficult without miss-interpreting these sounds as
whistle blows. However even with the transient
threshold set low enough to avoid errors the comparison
with the final mix shows that each whistle blow was
discerned if not detected in all of the microphone
signals.
The audio events extraction algorithm looks for 2
principal audio event types: Whistle blows, ball kicks
but it is also possible to extract players’
communications from the pitch microphones using a
third filter, however this can get very problematic as it
is difficult to automatically distinguish between
individuals in the crowd and players on the pitch.
Currently this problem can not be resolved easily,
however different microphone techniques may be able
to be employed to help better discern what sounds are
from the crowd and what sounds are from the pitch as
described in section 3.1.1.
It can be seen from the results that the automatic
algorithm does not pick up all of either the ball kicks or
the whistle blows when analyzing the individual
microphone feeds, however when the mix down from
all the microphone feeds is compared with the actual
mix that was broadcast, it can be seen that the automatic
algorithm has picked out an additional ball kick which
was not picked up by the sound engineer moving the
faders. This could be due to the ball switching rapidly
from one end of the pitch to the other and the engineer
not having enough time to react and increase the level of
the faders to the next area of action.
The coefficients of the algorithm were chosen such that
it was less sensitive to transient events, thus not
allowing any erroneous events (such as crowd members,
shouting, chanting etc) to be detected. The algorithm
can be made more sensitive to either ball kicks or
whistle blows but this increases the likelihood of
erroneous events being interpreted as a key event. Of
course any mistakes that are made in the automatic
detection are likely to be masked by the crowd noise as
recorded from the designated ambient microphones.
This situation then is not too dissimilar to the current
situation where the sound engineer may turn the level of
the fader up because the play is in that particular section
of the pitch but if there is no significant on-pitch audio
it will only be the crowd noise that is picked up. So in
the time domain, errors in event extraction may not be
too problematic, however when positioning the sources
in the sound field for a spatial audio representation of
the scene as will be done in the FascinatE project, this
could be very problematic as it would lead to the sound
of individuals in the crowd being placed on the pitch
which would obviously be incorrect.
2.2. Positioning the source on the field
Panning the on-pitch auditory events may not be
desirable for a standard TV broadcast and is not the
current state of affairs. If for example the active
broadcast camera changes, this would require the
panning to change so that the audio event appears to
come form the correct direction with respect to the
camera view point. This could be achieved with a
simple syncing in the OB truck such that the position of
the audio event/object in the stereo, 5.1 etc field would
be controlled by which camera the producer makes
active at that point in time but would mean constantly
moving audio sources with different cuts which may be
disturbing/distracting and consequently undesirable for
the viewer. For the FascinatE project it is aimed that the
user will have complete control over his/her visual
scene i.e. they will be able to select the viewing
Page 5 of 8
position, this will require the audio to be updated based
on the individual’s decisions. With this in mind, it will
not be an audio mix that is broadcast but rather a
selection of audio objects and their location. It will
consequently be at the user end where the audio mix
will happen, allowing for a format agnostic solution
where almost any audio system setup can be
accommodated and providing a rendering which is
unique to the individual user, based on their viewing
and listening preferences.
Positioning the sources on the pitch in the sound field
can be done by simply looking at which microphone has
energy at that moment in time and positioning its signal
as the audio object in the centre of the zone it covers as
shown in Figure 1. This has the undesirable effect that
there can be quite a large mismatch in visual source
location and the corresponding audio object which may
be noticed in the FascinatE rendering, especially if the
user zooms into the pitch where the relative difference
in position will be greater. This problem can be
overcome to some extent if more than one microphone
picks up the same audio object, as a more precise
position of the source could then be inferred from the
relative delays between, and strengths of, the two
microphone signals. This is not a problem in nonFascinatE applications as the sources are not panned so
it is permissible to have two microphones active at any
one time. For FascinatE applications it is imperative that
any audio object only comes from one position on the
pitch.
It is also important that noise from the crowd doesn’t
get positioned on the pitch (this requires that the audio
extraction works well, although much of this will be
masked by the crowd noise from the Soundfield®
microphone(s)). If more than one microphone picks up a
signal within a certain allowed time window then it is
possible to determine whether or not the sound source
they are picking up is the same audio event by looking
at the coherence between the microphone signals, if
within certain bounds it can be assumed that the sound
source is the same. The temporal difference between the
two signals arriving at the microphones can then be
found by calculating the cross-correlation between
them. This time difference can be used to roughly
position the source on the pitch although it can only tell
what the azimuth direction is and in terms of depth, the
object will have to be positioned somewhere on a line
running through the centre of the active zones. Problems
could arise with this technique if the algorithm were to
detect some noises from the crowd such as drum beats
or people whistling etc, these could get incorrectly
positioned on the pitch in the FascinatE rendering.
There is the possibility of positioning two microphones
at each location slightly offset with respect to each other
so it would be possible to determine the depth of the
sound source and also could determine whether the
source was coming from the crowd and could in that
case be rejected as an audio object.
This describes the case when more than one microphone
picks up the same audio source, however this is very
often not the case and it is more usual that the audio
event isn’t picked up by any of the microphones or is at
a low level and is almost completely masked by the
crowd noise. This is a difficult case and could result in,
for example, a player kicking a ball with no
corresponding audio. This is currently the case for
standard television broadcasts and often is not a
problem as the viewers don’t expect such a high level of
accuracy in the audio rendering but in FascinatE this
could be highly problematic as the viewer will have
higher expectations, especially when zoomed right into
the pitch which exacerbates the problem, making the
crowd quieter and hence the pitch noises are less
masked and differences will be more noticeable. This is
a difficult problem to solve without the use of more
microphones or ones that have a better rejection from
the rear. More advanced processing techniques could be
implemented in the algorithm for the purpose of
reducing further the noise from the crowd as described
in section 3.
3. FURTHER WORK
Whilst the algorithm can be shown to be useful and
effective as an automatic mixing and tracking algorithm,
there are several improvements that could be made to
make the algorithm both more efficient and error free
and several problems need to be overcome.
3.1. Transients from the crowd
Problems could occur if there were to be a transient
event in the sound from the crowd (i.e. someone hitting
a drum etc) – this means that the algorithm would detect
it and then position it incorrectly on the pitch. There is a
possibility of using a secondary microphone at each
position which is not broadcast but only used for the
positioning of sound objects, this would possibly allow
the determination of whether the sources were in front
or behind the microphone (if behind they are from the
crowd and not the pitch) using a TDE algorithm. Both
microphones would be pointing in the same direction
but with one slightly behind the other, then using the
time delay and level difference between the signals it
will be possible to work out whether the sound is in
front or behind the microphone. This would mean an
extra microphone at each position and more intensive
processing but would not take up any more room so
Page 6 of 8
wouldn’t contravene any rules imposed on the
broadcaster.
3.2. Broadcast latency
Another problem is that the algorithm needs to ‘look
ahead’ to know when a significant transient has
occurred so that the microphone be turned on/off (be
made active or inactive). This means that the audio
transmission would be slightly late. This is possibly not
a problem as long as the necessary syncing can be done
because a standard broadcast is slightly behind real-time
anyway. The limit of this broadcast latency is the
duration of the attack phase of the amplitude envelope
which in this case was 1 second. This time could be
reduced but if it is made too short the switching on of
the microphone would become noticeable and possibly
distracting.
3.3. Distant sources
It is difficult to pick up sound objects that are in the
centre of the pitch or in between the sparsely placed
shotgun microphones. Although transient events can be
discerned by human ear, they are ambiguous and often
not picked up by the automatic detection algorithm due
to the low signal to noise ratio. There is a possibility of
using auxiliary microphones further from the pitch such
as the Eigenmike®[6] to help position and determine
the content and position of more audio objects as is a
proposal for the FascinatE project.
Several techniques can be used to improve the accuracy
of the audio event extraction and allow for better, more
accurate positioning of sources in the sound filed.
Focal
Length
Pan
angle
Figure 5 Using camera data to better position audio
sources
The camera data that can be used includes, camera
zoom, pan/tilt position and focal length. More and more
cameras coming on to the market now days utilize
camera heads that are able to provide this kind of
metadata with respect to time. For the main broadcast
camera (whose job it is to follow the play around) this
gives an approximate position of the main on-pitch
action and therefore the main audio source(s). It would
provide only a rough position but when combined with
data from the other cameras, it is possible to perform
some triangulation to get better source positions as
shown in Figure 5. This is particularly important for the
FascinatE project if the rendering is to be done using
wave field synthesis where each sound source will need
an accurate coordinate.
3.4.1.Using player tracking for better
source positioning
3.4. Using Camera Data for better source
positioning
Using audio data alone for the localization and
positioning of audio sources can work well but in some
circumstances, the lack of audio data available from
more than one microphone may make it difficult to
position the source on the pitch. With this is mind it is
possible to use the data from the cameras to locate the
position of the sources on the pitch and from this data to
either turn on or point the microphones in the right
direction for better capturing or to position the source
more accurately in the sound field for rendering.
Another part of the FascinatE project is implementing a
player tracking system. For FascinatE, this will allow
the user to keep their view on one particular player or
even to track the ball and follow the ball around the
pitch automatically (Figure 6). If this technique is
applied it will provide the necessary information to
position the main source on the pitch. One can even
imagine tracking the referee as well as the player which
would enable an increased accuracy of whistle blows.
Page 7 of 8
6. REFERENCES
[1] C. H. Knapp and G. C. Carter, ‘‘The generalized
correlation method for estimation of time delay,’’
IEEE Trans. Acoust., Speech, Signal Process.
ASSP-24, 320–327 (1976).
[2] http://www.fascinate-project.eu
[3] JM. Batke, J. Spille et al, “Spatial Audio
Processing for Interactive TV Services”, 130th
Conv. Audio Eng Soc, London, UK, ( 2011).
Figure 6 Using player tracking in FascinatE to position
audio sources in sound field
3.4.2.Preprocessing the audio files
It might also be beneficial to pre-process the audio files
before doing the audio extraction; this could be done by
looking at a discrete set of parameters of interest such as
using the Mel Frequency Cepstral Coefficient (MFCC)
or some musical information extraction algorithm. This
could be particularly useful when trying to extract
player’s communications.
[4] G. Cengarle, T. Mateos, N. Olaiz, and P. Arumí, “A
New Technology for the Assisted Mixing of Sport
Events:
Application
to
Live
Football
Broadcasting.” Proc. 128th Conv. Audio Eng. Soc.
London, UK (2010).
[5] H. Kuttruff, Room Acoustics: Spon press, 2000.
[6] http://www.mhacoustics.com
4. CONCLUSIONS
An algorithm has been developed that extracts the key
audio events such as ball kicks and whistle blows from
the 12 shotgun microphones placed around a football
pitch in a standard broadcast. The algorithm allows the
sources to be extracted and also positioned in space for
a spatial audio rendering.
The algorithm has been developed within the context of
the European funded project, FascinatE but it also
useful for non-FascinatE applications. For FascinatE it
will be used so determine the content and position of
audio objects that can be rendered using almost any
rendering technique such as stereo, binaural, 5.1,
ambisonics and wave field synthesis. For non-FascinatE
it is a useful alternative to manually tracking the action
around the pitch and adjusting the faders accordingly.
5. ACKNOWLEDGEMENTS
This research project work is part of the FascinatE
project which has received funding from the European
Union's Seventh Framework Programme (FP7/20072013) under grant agreement no: 248138.
Page 8 of 8