Introduction to sound

COMP 546, Winter 2017
lecture 19 - sound 1
Introduction to sound
A few lectures from now, we will consider problems of spatial hearing which I loosely define as
problems of using sounds to compute where objects are in 3D space. We’ll look at two kinds of
spatial hearing problems. One involves sounds that are emitted by objects in the world, which is
the spatial hearing problem we are used to. The other involves sounds that are reflected off objects
in the world. This is the spatial hearing problem used by echolocating animals e.g bats.
Emitted sounds are produced when forces are applied to an object that make the object vibrate
(oscillate). For example, when one object hits another object, e.g. hands clapping, kinetic energy
is transformed into potential energy by an elastic compression 1 . This elastic compression results
in vibrations of the object(s) which dampen out over time depending on the material of the object.
These object vibrations produce tiny pressure changes in the air surrounding the object, since as
the object vibrates it bumps into the air molecules next to it. These air pressure changes then
propagate as waves into the surrounding air.
There are three general factors that determine a sounds. The first is the energy source which
drives some object into oscillatory behavior. The second is the object that is oscillating – its shape
and material properties. The third factor is the space into which or through which the sound is
emitted. This space cavity can attenuate or enhance certain frequencies through resonance. The
varying shapes of musical instruments are an obvious example. Voice is another, and we will return
to both.
Emitted sounds are important for hearing. They inform us about events occurring around us,
such as footsteps, a person talking, an approaching vehicle, etc. Here we have a major difference
between hearing and vision. Nearly all the visible surfaces reflect light rather than emit light.2
However, light sources themselves are generally not informative for vision but rather their ’role’
is to illuminate other objects, that is, shining light on other objects so that the visual system
can use this reflected light to estimate 3D scene properties or recognize the object by the spatial
configuration of the light patterns. By contrast, emitting sound sources are informative. They tell
us about the location of objects and their material properties.
What about reflected sounds? To what extent are they useful? Blind people seem to use reflected
sounds (echos) to navigate. Blind people can walk through an environment without bumping into
walls. To do so, they use the echos of their footsteps and the echos of the tapping of their cane.
They hear the reflections of these sounds off walls and other obstancles. What about people like
us that have normal vision? Do we use reflected sound? Probably, although not nearly as much as
blind people. This hasn’t been studied much.
Animals such as bats and dolphins actively use reflected sounds to decide both where objects
are in space and what their material properties are. e.g. Bats can easily distinguish a flying moth
from a blowing leaf. I will tell you about this in the final lecture of the course.
1
Another component of the kinetic energy is transformed into non-elastic mechanical energy, namely a permanent
shape change. This happens when the object cracks, chips, breaks or is dented
2
Typically only hot objects emit light. Electronic displays/lights e.g. LEDs are obvious exceptions to this
statement. Not only are they non-hot light emitters, but the light patterns they emit are often meant to be informative
– perhaps the light pattern you are seeing right now.
last updated: 1st Apr, 2017
1
COMP 546, Winter 2017
lecture 19 - sound 1
Pressure vs. intensity
Sound is a set of air pressure waves that are measured by the ear. Sound can travel through liquids
and solids as well as air. We will restrict our discussion to air. Air pressure is always positive. It
oscillates about some mean value Ia which we call atmospheric pressure. The units of air pressure
are atmospheres and the mean air pressure around us is approximately “one atmosphere”.
Sounds are small variations in air pressure about this mean value Ia . These pressure variations
can be either positive (compression) or negative (rarefaction). We will treat the pressure at a point
in 3D space as a function of time,
P (X, Y, Z, t) = Ia + I(X, Y, Z, t)
where I(X, Y, Z, t) is small compared to atmospheric pressure Ia . Note that we are using “big”
X, Y, Z rather than little x, y, z, since we are talking about points in 3D space.
I emphasize that sounds are quite small perturbations of the atmospheric pressure. The quietest
sound that we can hear is a perturbation of 10−9 atmospheres. The loudest sound that we can
tolerate without pain is 10−3 atmospheres. Thus, we are sensitive to 6 orders of magnitude of
pressure changes. (An order of magnitude is a factor of 10.)
Loudness: SPL and dB
To refer to the loudness of a sound, one can refer either to sound pressure I(X, Y, Z, t) or to the
square of this value, which I will loosely refer to as intensity. Sound pressure oscillates about 0,
whereas intensity is of course always positive. Think of intensity as the energy energy per unit
volume, namely, the work done to compress (positive I( )) or expand (negative I( )) a unit volume
of air to produce the deviation from Ia .
When describing the loudness of a sound, we typically don’t care about the instantaneous pressure or intensity, but rather the average over some time. Averaging the sound pressure I(X, Y, Z, t)
over time makes no sense, since the average is Ia which has nothing to do with the loudness of a
sound. Instead one averages the intensity over some time T . Let I be the root mean square of the
sound pressure:
v
u
T
u1 X
t
I(t)2
I≡
T t=1
The loudness will then be defined in terms of the ratio of the RMS sound pressure to the RMS
sound pressure of some standard I0 which is very quiet sound called the ”threshold of hearing”. We
can think of I0 as being some specific extremely soft sound which has been determined by scientists
in white coats in acoustically isolated rooms and which is the quietest sound for which one can
determine a JND (just noticeable difference).
The range of sound pressures that we can hear comfortably is very large. Moreover, psychophysical studies have shown that people are sensitive to ratios of RMS sound pressures rather
than differences. For these reasons, one defines the loudness by the log of this ratio:
Bels = log10
last updated: 1st Apr, 2017
2
I2
I
= 2 log10
2
I0
I0
COMP 546, Winter 2017
lecture 19 - sound 1
It is common to use a slightly different unit, namely ten times Bels. That is, the common units for
defining the loudness of a sound is the sound pressure level (SPL) in SPL in decibels (dB):
10 log10
I
I2
= 20 log10 | |
2
I0
I0
Multiplying by a factor 10 is convenient because the human auditory system is limited in its ability to
discriminate sounds of different loudnesses, such that we can just discriminate sounds that different
1
from each other by about 10
of a Bel, or 1 dB. So, we can think of 1 dB as a just noticable difference
(JND).
As an example, suppose you were to double the RMS sound pressure from I to 2I. What would
be the increase in SPL (dB) ? To answer this, note
20 log10
I
2I
= 20 log10 2 + 20 log10
I0
I0
So the increase in dB is 20 log10 2 ≈ 6 dB.
Here are a few examples of sound pressure levels:
Sound
jet plane taking off (60 m)
noisy traffic
conversation (1 m)
middle of night quiet
recording studio
threshold of hearing
dB
120
90
60
30
10
0
Speed of sound
Sound travels through air at speed about v = 340 meters per second. This is quite slow. If you go
to a baseball game and you sit behind the outfield fence over 100 m away, you can easily perceive
the delay between when you see the ball hit the bat, and when you hear the ball hit the bat.
A scale that is more relevant to the human is the time it takes a sound to go the distance of
say 17 cm which is roughly the distance between the ears (340 m/s = 34 cm/s. So, a sound that
arrives at the head from the left will reach the left ear 21 ms before the right ear. The brain is quite
sensitive to this time difference. (You are not consciously aware of this difference but your brain
does use it.)
The phenomenon is analogous to stereo vision where you are not consciously aware of the small
spatial disparities between the left and right eye. In stereovision, your brain fuses the left and right
eyes images and converts the disparities to depths. In binaural hearing, your brain fuses the left
and right ear’s images and converts the temporal disparities to a direction in 3D space. More on
this later...
last updated: 1st Apr, 2017
3
COMP 546, Winter 2017
lecture 19 - sound 1
Wave equation
We are considering sound to be a pressure function I(X, Y, Z, t). However, sound is not just any
function of those four variables. Rather, sound obeys the wave equation:
∂ 2 I(X, Y, Z, t) ∂ 2 I(X, Y, Z, t) ∂ 2 I(X, Y, Z, t)
1 ∂ 2 I(X, Y, Z, t)
+
+
= 2
∂X 2
∂Y 2
∂Z 2
v
∂t2
where v is the speed of sound. Notice that this equation is linear in I(X, Y, Z, t). If you have several
sources of sound, then the pressure function I that results is identical to the sum of the pressure
functions produced by the individual sources in isolation.
One function that satisfies the the wave equation is
sin( 2π(kX X + kY Y + kZ Z + ωt))
where the spatial frequencies k are measured in cycles per meter and the temporal frequency ω is
measured in cycles per second.
frequency
-------1
Hz
10
Hz
100
Hz
1000
Hz
10,000 Hz
wavelength
---------343
34.3
3.43
34.3
3.43
m
m
m
cm
cm
Sound impulse
Consider an isolated perturbation of air pressure at 3D point (Xo , Yo , Zo ) and at time t = t0 , for
example, due to some impact. Or, you can imagine a digital sound generation system with a speaker
that generates a pulse. Intuitively, we suppose that pressure is constant (complete silence) and then
suddenly there is an instantaneous impulse in pressure at some particular spatial location. What
happens after that pulse?
Mathematically, we could model this sound pressure perturbation as an impulse function
I(X, Y, Z) = δ(X − Xo , Y − Yo , Z − Zo )
at t = t0 . But how does this impulse evolve for t > 0 ? Think of a stone dropped in the water.
After the impact, you get an expanding circle. Because sound also obeys a wave equation, the same
phenomenon occurs, except now we are in 3D and so we get a wave of expanding spheres. The
speed of the wavefronts is the speed of sound. After one millisecond, the sphere is of radius 34 cm.
After two milliseconds, the sphere is of radius 68 cm, etc.
In fact, the expanding sphere will have a finite thickness since the sound impulse will not be
instantaneous but rather it will be a pulse that has a very short duration. This thickness will not
change as the sphere expands, however, but rather the leading sphere and trailing sphere (separated
by the small thickness of the sphere) will both travel at the speed v of sound.
last updated: 1st Apr, 2017
4
COMP 546, Winter 2017
lecture 19 - sound 1
The area of a sphere of radius r is 4πr2 , where
p
r = (X − X0 )2 + (Y − Y0 )2 + (Z − Z0 )2
and
r = v · (t − t0 )
where v is the speed of sound. At any t, the pressure is approximately constant on the sphere. (We
ignore loss of energy due to friction/attenuation in the air, which in fact can be substantial for high
frequencies).
As mentioned earlier, the total energy of a sound within some volume is proportional to the
sum square of the square pressure over that volume. Since the energy of an expanding impulse
wave is distributed over the thin spherical shell whose volume grows as r2 (recall the thickness of
the shell is small and fixed), the squared pressure I 2 at each point on the sphere of radius r will
be proportional to 1/r2 in order for the total energy to be fixed. So, the RMS pressure I will be
proportional to 1/r.
Let Isrc (t0 ) be a constant that indicates the strength of the impulse. Then at a distance r away
from the source, for example at the origin, the impulse is a function,
Isrc (t0 )
δ(r − v(t − t0 )).
r
If the point (X0 , Y0 , Z0 ) where the impulse occurs is far from the origin, then the impulse will reach
the origin at time t = t0 + vr and the impulse will be roughly a plane in the neighborhood of the
origin.
Finally, note that a real sound source won’t be just a single impulse, but rather will have a finite
time duration. Think of a person talking or shaking keys, etc. Even an impact that seems to have
quite a short duration will in fact have a duration over tens of milliseconds, as the vibrations of
the object dampen to zero. We can model a more general sound source that originates at some 3D
position as a sum of impulses
X Isrc (t0 )
δ(r − v(t − t0 ))
I(t) =
r
t
I(t) =
0
Here I(t) is the sound measured at a distance r from the source.
Interaural Timing Differences
To compare the arrival time difference for the two ears, we begin with a simple model that relates
the pressure signals measured by the left and right ears:
Il (t) = αIr (t − τ ) + n(t)
where τ is the time delay, α is a scale factor that accounts for the shadowing of the head, and n(t)
is noise.
The auditory system is not given α, τ explicitly, of course. Rather it has to estimate them. We
can formulate this estimation problem as finding α, τ that minimizes:
T
X
(Il (t) − αIr (t − τ ))2 .
t=1
last updated: 1st Apr, 2017
5
(1)
COMP 546, Winter 2017
lecture 19 - sound 1
Intuitively, we wish to shift and scale the right ear’s signal by τ so that it matches the left ear’s
signal. If the signals could be matched perfectly, then the sum of square differences would be zero.
Note that, to find the minimum over τ , the auditory system only needs to consider τ in the range
[− 21 , 21 ] ms, which is the time it takes sound to go the distance between the ears.
Minimizing (1) is equivalent to minimizing
T
X
2
Il (t) + α
2
T
X
t=1
2
Ir (t − τ ) − 2 α
t=1
T
X
Il (t)Ir (t − τ ) .
t=1
The summations in the first two terms are over slightly different intervals because of the shift τ in
the second term. However, if τ is small relative to the time duration of T samples (if T samples
cover several milliseconds), then most of the sample points are shared by the two summations.
Moreover, if the mean square intensities of Ir (t) and It (t) are each roughly constant over an interval
of T samples ± 21 ms, then we can ignore the explicit dependence on τ in the second term. With
these assumptions, one can find the τ that maximizes
T
X
Il (t) Ir (t − τ ) .
t=1
This summation is essentially the cross-correlation of Il (t) and Ir (t), so one can find the τ that
maximizes the cross-correlation of sound pressures measured in the two ears over a small time
interval.
To estimate α, one could simply use the model that ignores the noise:
Il (t) ≈ αIr (t − τ )
and so
α
2
PT
= PT
2
t=1 Il (t)
t=1 Ir (t
− τ )2
.
As we will see in the coming lectures, this model is overly simplistic because the different between
sounds in the left and right ears is more complicated than just a shift and rescaling.
One other point to note about the time indexing. Here we just indexing from t = 1 to T and
so we are estimating the model for an interval centered at about t = T /2. Of course, if there are
many sounds in different positions, then we would like to estimate the α and τ at many different
times. To estimate at some time centered at t0 , we could use instead
P T2
α
2
= PT
2
t=− T2
t=− T2
Il (t0 + t)2
.
Ir (t0 + t − τ )2
T
2
X
Il (t0 + t) Ir (t0 + t − τ ) .
t=− T2
last updated: 1st Apr, 2017
6
COMP 546, Winter 2017
lecture 19 - sound 1
Cone of confusion
Note that timing differences do not uniquely specify direction. Consider the line passing through
the two ears. This line and the center of the head, together with an angle θ ∈ [0, π], defines a cone
of confusion. If we treat the head as an isolated sphere floating in space, then all directions along
a single cone of confusion produce the same intensity difference and the same timing difference.
θ
Does the cone of confusion provide an ultimate limit on our ability to detect where sounds are
coming from? No it doesn’t and the reason is that the head is not a sphere floating in space. The
head is attached to the body (in particular the shoulders) which reflects sound in an asymmetric
way, and the head has ears (the pinna) which shape the sound wave in a manner that depends on
the direction from which the sound is coming. As we will see in an upcoming lecture, there is an
enormous amount of information available which breaks the cone of confusion.
last updated: 1st Apr, 2017
7