The Digital Representation of Sound, Part One: Sound and Timbre

©Burk/Polansky/Repetto/Roberts/Rockmore. All rights reserved.
http://music.columbia.edu/cmc/musicandcomputers/
1.1
The Digital Representation of Sound, Part One:
Sound and Timbre
What is Sound?
Chapter 1
1.2
Amplitude
1.3
Frequency, Pitch and Intervals
1.4
Timbre
2.1
The Digital Representation of Sound, Part Two:
Playing by the Numbers
The Digital Representation of Sound
2.2
Analog v. Digital
2.3
Sampling Theory
2.4
Binary Numbers
Chapter 2
2.5
Bit Width
2.6
Digital Copying
2.7
Storage Concerns
2.8
Compression
Chapter 3
3.1
The Frequency Domain
The Frequency Domain
3.2
Phasors
3.3
Fourier and the Sum of Sines
3.4
The DFT, FFT and IFFT
3.5
Problems with the FFT/IFFT
3.6
Some Alternatives to the FFT
Chapter 4
4.1
The Synthesis of Sound by Computer
Introduction to Synthesis
4.2
Additive Synthesis
4.3
Filters
4.4
Formant Synthesis
4.5
Introduction to Modulation
4.6
Waveshaping
4.7
Frequency Modulation
4.8
Granular Synthesis
4.9
Physical Modeling
Chapter 5
5.1
The Transformation of Sound by Computer
Sampling
5.2
Reverb
5.3
Localization/Spatialization
5.4
The Phase Vocoder
5.5
Convolution
5.6
Morphing
5.7
Graphical Manipulation
http://music.columbia.edu/cmc/musicandcomputers/
-1-
Chapter 1: The Digital Representation of Sound,
Part One: Sound and Timbre
Section 1.1: What Is Sound?
Sound is a complex phenomenon involving physics and perception. Perhaps the simplest way
to explain it is to say that sound involves at least three things:
•
•
•
something moves
something transmits the results of that movement
something (or someone) hears the results of that movement (though this is
philosophically debatable)
All things that make sound move, and in some very metaphysical sense, all things that move
(if they don’t move too slowly or too quickly) make sound.
As things move, they "push" and "pull" at the surrounding air (or water or whatever medium
they occupy), causing pressure variations (compressions and rarefactions). Those pressure
variations, or sound waves, are what we hear as sound.
Sound is often represented visually by figures, as in Figures 1.1 and 1.2.
Figure 1.1 Most sound waveforms end up looking pretty much alike. It’s hard to tell much about the nature of
this sound from this sort of time-domain plot of a sound wave.
This illustration is a plot of the immortal line "I love you, Elmo!" Try to figure out which part
of the image corresponds to each word in the spoken phrase.
http://music.columbia.edu/cmc/musicandcomputers/
-2-
Figure 1.2 Minute portion of the sound wave file from Figure 1.1, zoomed in many times.
In this image we can see the almost sample-by-sample movement of the waveform (we’ll
learn later what samples are). You can see that sound is pretty much a symmetrical type of
affair—compression and rarefaction, or what goes up usually comes down. This is more or
less a direct result of Newton’s third law of motion, which states that for every action there is
an equal and opposite reaction.
Figures 1.1 and 1.2 are often called functions. The concept of function is the simplest glue
between mathematical and musical ideas. Sound files are in
C:\bolum\bahar2014-2015\ce476\Section 1.1
Soundfile 1.1: Speaking
Soundfile 1.2: Bird song
Soundfile 1.3: Gamelan
Soundfile 1.4: Tuba sounds
Soundfile 1.5: More bird sounds
Soundfile 1.6: Trumpet sound
Soundfile 1.7: Loud sound that gets softer
Soundfile 1.8: Whooshing
Soundfile 1.9: Chirping
Soundfile 1.10: Phat groove
Sound as a Function
Most of you probably have a favorite song, something that reminds you of a favorite place or
person. But how about a favorite function? No, not something like a black-tie affair or a
tailgate party; we mean a favorite mathematical function.
http://music.columbia.edu/cmc/musicandcomputers/
-3-
In fact, songs and functions aren’t so different. Music, or more generally sound, can be
described as a function. Mathematical functions are like machines that take in numbers as raw
material and, from this input, produce another number, which is the output.
There are lots of different kinds of functions. Sometimes functions operate by some easily
specified rule, like squaring. When a number is input into a squaring function, the output is
that number squared, so the input 2 produces an output of 4, the input 3 produces an output of
9, and so on. For shorthand, we’ll call this function s.
s(2) = 22 = 4
s(3) = 32 = 9
s(x) = x2
The last expression is really just an abbreviation that says for any number given as input to s,
the number squared is the output. If the input is x, then the output is x2. Sometimes the
input/output relation may be easy to describe, but often the actual cause and effect may be
more complicated. For example, review the following function.
Assume there’s a thermometer on the wall. Starting at 8 A.M., for any number t of minutes
that have elapsed since 8 A.M., our function gives an output of the temperature at that time.
So, for an input of 5, the output of our temperature function is the room temperature at 5
minutes after 8 A.M. The input 10 gives as output the room temperature at 10 minutes after
8 A.M., and so on. Once again, for shorthand we can abbreviate this and call the function f.
f(5) = room temperature at 5 minutes after 8 A.M.
f(10) = room temperature at 10 minutes after 8 A.M.
f(t) = room temperature at t minutes after 8 A.M.
You can see how this temperature function is a little like our previous sound amplitude
graphs. The easiest way to understand the temperature function is according to its graph, the
picture that helps us visualize the function. The two axes are the input and output. If an input
is some number x units from 0 and the output is f(x) units (which could be a positive or
negative number), then we place a mark at f(x) units
above x.
Assume the following:
f(0) = 30
f(5) = 35
f(10) = 38
Figure 1.3 shows what happens when we graph these
three temperatures. (Note that we’ll leave the x-axis in
real time, but to be more precise we probably should
have written 0, 5, and 10 there!) We’ll join these marks
by a straight line.
Figure 1.3
So how do we get a function out of sound or music?
http://music.columbia.edu/cmc/musicandcomputers/
-4-
A Kindergarten Example
Imagine an entire kindergarten class piled on top of a trampoline in your neighbor’s backyard
(yes, we know this would be dangerous!). The kids are jumping up and down like maniacs,
and the surface of the trampoline is moving up and down in a way that is seemingly
impossible to analyze.
Suppose that before the kids jump on the trampoline, we paint a fluorescent yellow dot on the
trampoline and then ask the kids not to jump on that dot so that we can watch how it moves
up and down. The surface of the trampoline is initially at rest. The class climbs on. We take a
stopwatch out of our pocket and yell "Go!" while simultaneously pressing the start button. As
the kids go crazy, our job is to measure at each possible instant how far the yellow dot has
moved from its rest position. If the dot is above the initial position, we measure it as positive
(so a displacement of 3 cm up is recorded as +3). If the displacement is below the rest
position, we measure it as negative (so a displacement of 3 cm down is recorded as -3).
So follow the bouncing dot! It rises, then falls, sometimes a lot, sometimes a little, again and
again. If we chart this bouncing dot on a moving piece of paper, we get the kind of function
(of pressure, or deformation or perturbation) that we’ve been talking about.
Let’s return to the idea of writing down a list of numbers corresponding to a set of times. Now
we’re going to turn that list into the graph of a mathematical function! We’ll call that function
F.
On the horizontal line (the x-axis), we mark off the equally spaced numbers 1, 2, 3, and so on.
Then we mark off on the vertical axis (the y-axis) the numbers 1, 2, 3, and so on, going up,
and -1, -2, -3, and so on, going down. The numbers on the x-axis stand for time, and on the yaxis the numbers represent displacement. If at time N we recorded a displacement of 4, we put
a dot at 4 units above N and we say that F(N) = 4. If we recorded a displacement of -2, we put
a dot at the position 2 units below N and we say F(N) = -2. Each of the values F(N) is called a
sample of the function F.
We’ll learn later (in Section 2.1, when we talk about sampling a waveform) that this process
of "every now and then" recording the value of a displacement in time is referred to as
sampling, and it’s fundamental to computer music and the storage of digital data. Sampling is
actually pretty simple. We regularly inspect some continuous movement and record its
position. It’s like watching a marathon on television: you don’t really need to see the whole
thing from start to finish—checking in every minute or so gives you a good sense of how the
race was run.
But suppose you could take a measurement at absolutely every instant in time—that is, take
these measurements continuously. That would give you a lot of numbers (infinitely many, in
fact, because who’s to say how small a moment in time can be?). Then you would have
numbers above and below every point and get a picture something like Figures 1.1 and 1.2,
which appear to be continuous.
Actually, calling these axes x and y is not so instructive. It is better to call the y-axis
"amplitude" and the x-axis "time." The following
http://music.columbia.edu/cmc/musicandcomputers/
-5-
Figure 1.4
Figure 1.5
Figure 1.6
When you hear something, this is in fact the end result of a very complicated sequence of
events in your brain that was initiated by vibrations of your eardrum. The vibrations are
caused by air molecules hitting the eardrum. Together they act a bit like waves crashing
against a big rubber seawall (or those kids on the trampoline).
These waves are the result of things like speaking, plucking a guitar string, hitting a key of the
piano, the wind rustling leaves, or blowing into a saxophone. Each of these actions causes the
air molecules near the sound source to be disturbed, like dropping many pebbles into a pond
all at once. The resulting waves are sent merrily on their way toward you, the listener, and
your eagerly awaiting eardrum. The corresponding function takes as input the number
representing the time elapsed since the sound was initiated and returns a number that
measures how far and in what direction your eardrum has moved at that instant. But what is
your eardrum actually measuring? That’s what we’ll talk about next.
http://music.columbia.edu/cmc/musicandcomputers/
-6-
Amplitude and Pressure
In the graphs of sound waves shown in Figures 1.1 and 1.2, time was represented on the xaxis and amplitude on the y-axis. So as a function, time is the input and amplitude (or
pressure) is the output, just like in the temperature example.
As we’ll point out again and again in this chapter, one way to think about sound is as a
sequence of time-varying amplitudes or pressures, or, more succintly, as a function of time.
The amplitude (y-)axis of the graphs of sound represents the amount of air compression
(above zero) or rarefaction (below zero) caused by a moving object, like vocal chords. Note
that zero is the "rest" position, or pressure equilibrium (silence). Looking at the changes in
amplitude over time gives a good idea of the amplitude shape or envelope of the sound wave.
Actually, this amplitude shape might correspond closely to a number of things, including:
•
•
•
the actual vibration of the object
the changes in pressure of the air, or water, or other medium
the deformation (in or out) of the eardrum
This picture of a sound wave, as with amplitudes in time, provides a nice visual metaphor for
the idea of sound as a continuous sequence of pressure variations. When we talk about
computers, this graph of pressure versus time becomes a picture of a list of numbers plotted
against some variable (again, time). We’ll see in Chapter 2 how these numbers are stored and
manipulated.
Frequency: A Preview
Amplitude is just one mathematical, or acoustical, characteristic of sound, just as loudness is
only one of the perceptual characteristics of sounds. But, as we know, sounds aren’t only loud
and soft.
People often describe musical sounds as being "high" or "low." A bird tweeting may sound
"high" to us, or a tuba may sound "low."
But what are we really saying when we classify a sound as "high" or "low"? There’s a
fundamental characteristic of these graphs of pressure in time that is less obvious to the eye
but very obvious to the ear: namely, whether there is (or is not) a repeating pattern, and if so
how quickly it repeats. That’s frequency!
When we say that the tuba sounds are low and the bird sounds are high, what we are really
talking about is the result of the frequency of these particular sounds—how fast a pattern in
the sound’s graph repeats. In terms of waveforms, like what you saw and heard in the
previous sound files, we can, for the moment, somewhat concisely state that the rate at which
the air pressure fluctuates (moves in and out) is the frequency of the sound wave. We’ll learn
a lot more about frequency and its related cognitive phenomenon, pitch, in Section 1.3.
http://music.columbia.edu/cmc/musicandcomputers/
-7-
How Our Ears Work
Mathematical functions and kids jumping on a trampoline are one thing, but what’s the
connection to sound and music? Just moving an eardrum in and out can’t be the whole story!
Well, it isn't. The ear is a complex mechanism that tries to make sense out of these arbitrary
functions of pressure in time and sends the information to the brain.
We’ve already used the physical analogy of the trampoline as our eardrum and the kids as the
air molecules set in motion by a sound source. But to cover the topic more completely, we
need to discuss how sounds interact, via the eardrum, with the rest of our auditory system
(including the brain).
Our eardrums, like microphones and speakers, are in a sense transducers—they turn one form
of information or energy into another.
When sound waves reach our ears, they vibrate our eardrums, transferring the sound energy
through the middle ear to the inner ear, where the real magic of human hearing takes place in
a snail-shaped organ called the cochlea. The cochlea is filled with fluid and is bisected by an
elastic partition called the basilar membrane, which is covered with hair cells. When sound
energy reaches the cochlea, it produces fluid waves that form a series of peaks in the basilar
membrane, the position and size of which depend on the frequency content of the sound.
Different sections of the basilar membrane resonate (form peaks) at different frequencies:
high frequencies cause peaks toward the front of the cochlea, while low frequencies cause
peaks toward the back. These peaks match up with and excite certain hair cells, which send
nerve impulses to the brain via the auditory nerve. The brain interprets these signals as sound,
but as an interesting thought experiment, imagine extraterrestrials who might "see" sound
waves (and maybe "hear" light). In short, the cochlea transforms sounds from their physical,
time domain (amplitude versus time) form to the frequency domain (amplitude versus
frequency) form that our brains understand. Pretty impressive stuff for a bunch of goo and
some hairs!
Figure 1.7 Diagram of the inner ear showing how sound waves that enter through the auditory canal are
transformed into peaks, according to their frequencies, on the basilar membrane. In other words, the basilar
membrane serves as a time-to-frequency converter, in order to prepare sonic information for its eventual
cognition by the higher functions of the brain.
Who’d have thought sound was this complicated! But keep in mind that the sound wave
pressure picture is just raw data; it contains no frequency, timbral, or any other kind of
http://music.columbia.edu/cmc/musicandcomputers/
-8-
information. It needs a lot of processing, organization, and consideration to provide any sort
of meaning to us higher species.
We’ve made the hearing process seem pretty simple, but actually there’s a lot of controversy
in current auditory cognition research about the specifics of this remarkable organ and how it
works. As we understand more and more about the ear, musicians and scientists gain an
increasing sense of understanding how we perceive sound, and even, some believe, how we
perceive music. It’s an exciting field of research, and an active one!
How Do We Describe Sound?
Sound can be described in many ways. We have a lot of different words for sounds, and
different ways of speaking about them. For example, we can call a sound "groovy," "dark,"
"bright," "intense," "low and rumbly," and so on. In fact, our colloquial language for talking
about sound, from a scientific viewpoint, is pretty imprecise. Part of what we’re trying to do
in computer music is to try to formulate more formal ways of describing sonic phenomena.
That doesn’t mean that there’s anything wrong with our usual ways of talking about sounds:
our current vocabulary actually works pretty well.
But to manipulate digital signals with a computer, it is useful to have access to a different sort
of description. We need to ask (and answer!) the following kinds of questions about the sound
in question:
•
•
•
•
•
•
•
•
How loud is it?
What is its pitch?
What is its spectrum?
What frequencies are present?
How loud are the frequencies?
How does the sound change over time?
Where is the sound coming from?
What’s a good guess as to the characteristics of the physical object that made the
sound?
Even some of these questions can be broken down into lots of smaller questions. For example,
what specifically is meant by "pitch"? Taken together, the answers to these questions and
others help describe the various characteristics and features that for many years have been
referred to collectively as the timbre (or "color") of a sound. But before we talk about timbre,
let’s start with more basic concepts: amplitude and loudness (Section 1.2).
http://music.columbia.edu/cmc/musicandcomputers/
-9-
Part One: Sound and Timbre
Section 1.2: Amplitude and Loudness
In the previous section we talked briefly about how a function of amplitude in time could be
thought of as a kind of sampling of a sound. (Remember, a sample is essentially a
measurement of the amplitude of a sound at a point in time.)
But knowing that at 200.056 milliseconds the amplitude of a sound is 0.2 doesn’t really help
us understand much about the sound in most cases. What we need is some way of measuring
some form of average amplitude of a number of samples (we sometimes call this a frame).
We need a way of understanding how these amplitudes, which are a physical measurement
(like frequency), correspond to our perception of loudness, which is a psychophysical
(anything that we perceive about the physical world is called "psychophysical") or, more
precisely, psychoacoustic or cognitive measure (like pitch).
We’ll learn in Section 1.3 that amplitude and frequency are not independent—they both
contribute to our perception of loudness; that is, we use them together (in a way described by
something called the Fletcher-Munson curves, Section 1.3). But to describe that complex
psychoacoustic or cognitive aggregate called loudness, we first need to understand something
about amplitude and another related quantity called intensity. Then, at the end of our
discussion on frequency (Section 1.3), we’ll return to an important way that frequency affects
loudness (we’ll give you a little preview of this in this section as well).
In fact, it’s very important to realize that certain terms refer to physical or acoustic measures
and others refer to cognitive ones. The cognitive measures are much higher level and often
incorporate related effects from several acoustic phenomena. See Figure 1.8 to sort out this
little terminological jungle!
Acoustic and Cognitive (Psycho-acoustic Correlates)
Frequency
-->
Pitch
How fast something
vibrates.
Amplitude
How high or low we perceive it.
--> Intensity
--> Loudness
How much
something vibrates
How much the medium
is displaced.
How loud we perceive it, affected
by pitch and timbre.
Waveshape
-->
Timbre (???)
Attack/Decay (traansients)
Modulation
Spectral characteristics
Spectral, pitch, and loudness trajectory
Everything else... .
http://music.columbia.edu/cmc/musicandcomputers/
- 10 -
Figure 1.8 Acoustic phenomena resulting from pressure variations and their psychophysical, or psychoacoustic
(perceptual),
counterparts.
This chart gives some sense of how the terminology for sound varies depending on whether
we talk about direct physical measures (frequency, amplitude) or cognitive ones (pitch,
loudness). It’s true that pitch is largely a result of frequency, but be careful: they’re not the
same thing. Contents of C:\bolum\bahar2014-2015\ce476\Section 1.2:
Xtra bit 1.2: Another sonic universe
Applet 1.2: Changing amplitudes
Soundfile 1.11: Two sine waves
Soundfile 1.12: A sine wave and a triangular wave
Soundfile 1.13: Chirp
Xtra bit 1.3: More on decibels
Applet 1.3: Amplitude versus decibel level
Amplitude and Pitch Independence
If you gently pluck a string on a guitar and then pluck it again, this time harder, what is the
difference in the sounds you hear? You’ll hear the same pitch, only louder. That illustrates
something interesting about the relationship between pitch and loudness—they are generally
independent of one another.
You can have two of the same frequencies at one loudness, or two different loudnesses at one
frequency. You can visualize this by drawing a series of sine waves, each with the same
period but different amplitude. Pure tones with the same period will generally be heard to
have the same pitch—so all of the sine waves must be at the same frequency. We’ll see that
pure tones correspond to variations in that old favorite function, the sine function. Remember
that amplitude is not loudness (one is physical, the other is psychophysical), but for the
moment let’s not make that distinction too rigid.
Figure 1.9 Two sine waves with the same frequency, different amplitude.
http://music.columbia.edu/cmc/musicandcomputers/
- 11 -
Figure 1.10 Two sounds may have the same frequency but different waveforms (resulting in a different sense of
timbre).
Figure 1.11 We can draw an envelope over a sound file, which is the average, or smoothed, amplitude of the
sound wave.
This is roughly equivalent to what we perceive as the changes in loudness of the sound. (If we
just take one average, we might call that the loudness of the whole sound.)
Figure 1.12 Two sine waves starting at different phase points. These sine waves have different starting points—
like the trampoline we talked about in the previous section. It can start flexed, either up or down. Remember that
these starting points are very close together in time (tiny fractions of a second). We’ll talk more about phase and
frequency later.
http://music.columbia.edu/cmc/musicandcomputers/
- 12 -
This simple picture directly above shows us something interesting about amplitude. The point
where the colors overlap is where the waveforms will combine, either summing to a combined
signal at a higher level or summing to a combined signal at a lower level. That is, when one
waveform goes negative (compression, perhaps), it will counteract the other’s positive
(rarefaction, perhaps). This is called phase cancellation, but the complexity with which this
phenomenon occurs in the real sound world is obviously very great—lots and lots of sound
waves going positive and negative all over the place.
Intensity
You can think of Soundfile 1.13 as a function that begins life as a sine function, f(x) = sin(x),
and then over time morphs into a really fast oscillating sine function, like f(x) = sin(20,000x).
As it morphs, we’ll continually listen to it.
Soundfile 1.3, sometimes called a chirp in acoustic discussions, sweeps a sine wave over the
frequency range from 0 Hz to 20,000 Hz. (Just so you know, this sonic phenomenon is not
named for a bird chirp, but in fact for a radar chirp, which is a special function used in some
radar work.)
The amplitude of the sine wave in Soundfile 1.13 does not change, but the perceived loudness
changes as it moves through areas of greater sensitivity in the Fletcher-Munson curve (Section
1.3). In other words, how loud we perceive something to be is mostly a result of amplitude,
but it is also a result of frequency. This simple example shows how complicated our
perception of even these simple phenomena is.
We measure amplitude in volts, pressure, or even just sample numbers: it doesn’t really
matter. We just graph a function of something moving, and then if we want, we can say what
its average displacement was. If it was generally a big displacement (for example, sixth
graders on the trampoline), we say it has a big amplitude. If, on the other hand, things didn’t
move very much (say we drop a bunch of mice on the trampoline), we say that the sound
function had a small amplitude.
In the real world, things vibrate and send their vibrations through some medium (usually gas)
to our eardrums. When we become interested in how amplitude actually affects a medium, we
speak of the intensity of a sound in the medium. This is a little more specific than discussing
amplitude, which is more a purely relative term.
The medium we’re usually interested in is air (at sea level, 72°F), and we measure intensity as
the amount of energy in a given air unit, the cubic meter. In this case, energy (or work done)
is measured in watts, and we can then measure intensity in watts per meter2 (or wm2).
As is the case with the perception of frequency as pitch, our perception of intensity as
loudness is logarithmic. But what the heck does that mean? Logarithmic perception means
that it takes more of a change in the amplitude to produce the same perceived change in
loudness.
http://music.columbia.edu/cmc/musicandcomputers/
- 13 -
Bubba and You
Think of it this way. Let’s say you work at a burger joint flipping burgers, and you make
$8.00 an hour. Let’s say your supervisor, Bubba, makes $9.00 an hour. Now let’s say the
burger corporation hits it big with their new broccoli sandwich, and they decide to put every
employee on a monthly raise schedule. They decide to give Bubba a dollar a month raise, and
you a 7% raise each month. Bubba thinks this is great!
That means the first month you only get $8.56, and Bubba gets $10.00. The next month,
Bubba gets $11.00, and you get $9.15. This means that you now got a 59¢ raise, or that your
raise went up while his remained the same. The equation for your salary for any given month
is:
new salary = old salary + (0.07 X old salary)
Bubba’s is:
new salary = old salary + 1
You’re getting an increase by a fixed ratio of your salary, which itself is increasing, while
Bubba’s raise/salary ratio is actually decreasing. (The first month he got 1/9, the next month
1/10—at this rate he’ll approach a zero percent raise if he works at the burger place long
enough.) Figure 1.13 shows what the salary raises look like as functions.
Figure 1.13 Bubba & U.
This fundamental difference between ratiometric change and fixed arithmetic change is very
important. We tend to perceive most changes not in terms of absolute quantities (in this case,
$1.00), but in terms of relative quantities (percentage of a quantity).
Changes in both amplitude and frequency are also perceived in terms of ratios. In the case of
amplitude, we have a standard measure, called the decibel (dB), which can describe how loud
http://music.columbia.edu/cmc/musicandcomputers/
- 14 -
a sound is perceived. As a convention, silence, which is 0 dB, is set to 10-12 wm2. This is not
really silence in the absolute sense, because things are still vibrating, but it is more or less
what you would hear in a very quiet recording studio with nobody making any sound. There is
still air movement and other sound-producing activity. (There are rooms called anechoic
chambers, used to study sound, that try to get things much quieter than 0 dB—but they’re
very unusual places to be.) Any change of 10 dB corresponds roughly to a doubling of
perceived loudness. So, for example, going from 10 dB to 20 dB or from 12 dB to 22 dB is
perceived as a doubling of perceived sound pressure level.
Source
Average
dB
silence
0
whisper, rustling of leaves on a romantic summer evening
30
phone ringing, normal conversation
60
car engine
70
diesel truck, heavy city traffic, vacuum cleaner, factory noise
80
power lawn mower, subway train
90
chain saw, rock concert
110
jet takeoff, gunfire, Metallica (your head near the guitar player’s amplifier)
120+
Table 1.1 Average decibel levels for environmental sounds. 120 dB is often called the threshold of pain. Good
name for a rock band, huh?
Note that even brief exposure to very loud (90 dB or greater) sounds or constant exposure to
medium-level (60 dB to 80 dB) sounds can cause hearing loss. Be careful with your ears—
invest in some earplugs!
"Audience participation: During tests in the Royal Festival Hall, a note played mezzoforte on the horn measured approximately 65 decibels of sound. A single uncovered
cough gave the same reading. A handkerchief placed over the mouth when coughing
assists in obtaining a pianissimo."
http://music.columbia.edu/cmc/musicandcomputers/
- 15 -
Chapter 1: The Digital Representation of Sound,
Part One: Sound and Timbre
Section 1.3: Frequency, Pitch, and Intervals
Contents of C:\bolum\bahar2014-2015\ce476\Section 1.3:
Xtra bit 1.4: Tapping a frequency
Applet 1.4: Hearing frequency and amplitude
Soundfile 1.14: Two sine waves, different frequencies
Applet 1.5: When ticks become tones
Soundfile 1.15: Low gong
Applet 1.6: Transpose using multiply and add
Applet 1.7: Octave quiz
Applet 1.8: Fletcher-Munson example
What is frequency? Essentially, it’s a measurement of how often a given event repeats in
time. If you subscribe to a daily paper, then the frequency of paper delivery could be
described as once per day, seven times per week. When we talk about the frequency of a
sound, we’re referring to how many times a particular pattern of amplitudes repeats during
one second.
Not all waveforms or physical vibrations repeat exactly (in fact almost none do!). But many
vibratory phenomena, especially those in which we perceive some sort of pitch, repeat
approximately regularly. If we assume that in fact they are repeating, we can measure the rate
of repetition, and we call that the waveform’s frequency.
http://music.columbia.edu/cmc/musicandcomputers/
- 16 -
Figure 1.14 Two sine waves. The frequency of the red wave is twice that of the blue one, but their amplitudes
are the same. It would be difficult or impossible to actually hear this as two distinct tones, since the octaves fuse
into one sound.
As we’ll discuss later, we measure frequencies in cycles per second, or hertz (Hz). Click on
the soundfile icon to hear two sine tones: one at 400 Hz and one at 800 Hz.
Sine Waves
A sine wave is a good example of a repeating pattern of amplitudes, and in some ways it is the
simplest. That’s why sine waves are sometimes referred to as simple harmonic motions. Let’s
arbitrarily fix an amplitude scale to be from –1 to 1, so the sine wave goes from 0 to 1 to 0 to
–1 to 0. If the complete cycle of the sine wave’s curve takes one second to occur, then we say
that it has a frequency of one cycle per second (cps), or one hertz (Hz, or kHz for 1,000 Hz).
The frequency range of sound (or, more accurately, of human hearing) is usually given as
0 Hz to 20 kHz, but our ears don’t fuse very low frequency oscillations (0 Hz to 20 Hz, called
the infrasonic range) into a pitch. Low frequencies just sound like beats. These numbers are
just averages: some people hear pitches as low as 15 Hz; others can hear frequencies
significantly higher than 20 kHz. A lot depends on the amplitude, the timbre, and other
factors. The older you get (and the more rock n’ roll you listened to!), the more your ears
become insensitive to high frequencies (a natural biological phenomenon called presbycusis).
Source
Lowest freqency (Hz) Highest frequency (Hz)
piano
female speech
27.5
4,186
140
500
male speech
80
240
compact disc
0
22,050
20
20,000
human hearing
Table 1.2 Some frequency ranges.
Period
When we talk about frequency in music, we are referring to how often a sonic event happens
in its entirety over the course of a specified time segment. For a sound to have a perceived
frequency, it must be periodic (repeat in time). Since the period of an event is the length of
time it takes the event to occur, it’s clear that the two concepts (periodicity and frequency) are
related, if not pretty much equivalent.
The period of a repeating waveform is the length of time it takes to go through one cycle. The
frequency is sort of the inverse: how many times the waveform repeats that cycle per unit
time. We can understand the periodicity of sonic events just like we understand that the period
of a daily newspaper delivery is one day.
http://music.columbia.edu/cmc/musicandcomputers/
- 17 -
Since a 20 Hz tone by definition is a cycle that repeats 20 times a second, then in 1/20th of a
second one cycle goes by, so a 20 Hz tone has a period of 1/20 or 0.05 second. Now the
"thing" that repeats is one basic unit of this regularly repeating wave—such as a sine wave (at
the beginning of this section there’s a picture of two sine waves together). It’s not hard to see
that the time it takes for one copy of the basic wave to recur (or move through whatever
medium it is in) is proportional to the distance from crest to crest (or any two successive
corresponding points, for that matter) of the sine wave.
This distance is called the wavelength of the wave (or of the periodic function). In fact, if you
know how fast the wave is moving, then it is easy to figure out the wavelength from the
period.
Physically, the period is inversely proportional to the wavelength. Wavelength is a spatial
measure that says how far the wave travels in space in one period. We measure it in distance,
not time. The speed of sound (s) is about 345 meters/second. To find the wavelength (w) for a
sound of a given frequency, first we invert the frequency (1/f) to get its period (p), and then
we use the following simple formula:
Figure 1.15 Very low musical sounds can have very
long wavelengths: some central Javanese (from
Indonesia) gongs vibrate at around 8 Hz to 10 Hz, and
as such their wavelengths are on the order of 30 to 35
m. Look at the size of the gong in this photo! It makes
some very low sounds.
(The musicians above are from STSI Bandung, a music conservatory in West Java, Indonesia,
participating in a recording session for a piece called "mbuh," by the contemporary composer
Suhendi. This recording can be heard on the CD Asmat Dream: New Music Indonesia,
Lyrichord Compact Disc #7415.)
Using the formula, we find that the wavelength of a 1 Hz tone is 345 meters, which makes
sense, since a 1 Hz tone has a period of 1 second, and sound travels 345 meters in one second!
That’s pretty far, until you realize that, since these waveforms are usually symmetrical, if you
were standing, say, at 172.5 meters from a vibrating object making a 1 Hz tone and right
behind you was a perfectly reflective surface, it’s entirely possible that the negative portion of
the waveform might cancel out the positive and you’d hear nothing! While this is a rather
extreme and completely hypothetical example, it is true that wave cancellation is a common
physical occurrence, though it depends on a great many parameters.
http://music.columbia.edu/cmc/musicandcomputers/
- 18 -
Pitch
Musicians usually talk to each other about the frequency content of their music in terms of
pitch, or sets of pitches called scales. You’ve probably heard someone mention a G-minor
chord, a blues scale, or a symphony in C, but has anyone ever told you about the new song
they wrote with lots of 440 cycles per seconds in it? We hope not!
Humans tend to recognize relative relationships, not absolute physical values. And when we
do, those relationships (especially in the aural domain) tend to be logarithmic. That is, we
don’t perceive the difference (subtraction) of two frequencies, but rather the ratio (division).
This means that it is much easier for most humans to hear or describe the relationship or ratio
between two frequencies than it is to name the exact frequencies they are hearing. And in fact,
for most of us, the exact frequencies aren’t even very important—we recognize "Row, Row,
Row Your Boat" regardless of what frequency it is sung at, as long as the relationships
between the notes are more or less correct. The common musical term for this is
transposition: we hear the tune correctly no matter what key it’s sung in.
Although pitch is directly related to frequency, they’re not the same. As we pointed out
earlier, and similar to what we saw in Section 1.2 when we discussed amplitude and loudness,
frequency is a physical or acoustical phenomenon. Pitch is perceptual (or psychoacoustic,
cognitive, or psychophysical). The way we organize frequencies into pitches will be no
surprise if you understood what we talked about in the previous section: we require more and
more change in frequency to produce an equal perceptual change in pitch.
Once again, this is all part of that logarithmic perception "thing" we’ve been yammering on
about, because the way we describe that increase is by logarithms and exponentials. Here’s a
simple example: the difference to our ears between 101 Hz and 100 Hz is much greater than
the difference between 1,001 Hz and 1,000 Hz. We don’t hear a change of 1 Hz for each;
instead we hear a change of 1,001/1,000 (= 1.001) as compared to a much bigger change of
101/100 (= 1.01).
Intervals and Octaves
So we don’t really care about the linear, or arithmetic, differences between frequencies; we
are almost solely interested in the ratio of two frequencies. We call those ratios intervals, and
almost every musical culture in the world has some term for this concept. In Western music,
the 2:1 ratio is given a special importance, and it’s called an octave.
It seems clear (though not totally unarguable) that most humans tend to organize the
frequency spectrum between 20 Hz and 20 kHz roughly into octaves, which means powers of
2. That is, we perceive the same pitch difference between 100 Hz and 200 Hz as we do
between 200 Hz and 400 Hz, 400 Hz and 800 Hz, and so on. In each case, the ratio of the two
frequencies is 2:1. We sometimes call this base-2 logarithmic perception. Many theorists
believe that the octave is somehow fundamental to, or innate and hard-wired in, our
perception, but this is difficult to prove. It’s certainly common throughout the world, though a
great deal of approximation is tolerated, and often preferred!
http://music.columbia.edu/cmc/musicandcomputers/
- 19 -
In almost all musical cultures, pitches are named not by their actual frequencies, but as
general categories of frequencies in relationship to other frequencies, all a power of 2 apart.
For example, A is the name given to the pitch on the piano or clarinet with a frequency of
440 Hz as well as 55 Hz, 110 Hz, 220 Hz, 880 Hz, 1760 Hz, and so on. The important thing is
the ratio between the frequencies, not the distance; for example, 55 Hz to 110 Hz is an octave
that happens to span 55 Hz, yet 50 Hz to 100 Hz is also an octave, even though it only covers
50 Hz. But if an orchestra tunes to a different A (as most do nowadays, for example, to middle
A = 441 Hz or 442 Hz to sound higher and brighter), those frequencies will all change to be
multiples/divisors of the new absolute A.
Figure 1.16 The red graph shows a series of octaves starting at 110 Hz (an A); each new octave is twice as high
as the last one. The blue graph shows linearly increasing frequencies, also starting at 110 Hz.
We say that the frequencies in the blue graph are rising linearly, because to get each new
frequency we simply add 110 Hz to the last frequency: the change in frequency is always the
same. However, to get octaves we must double the frequency, meaning that the difference
(subtractively) in frequency between two adjacent octaves is always increasing.
One thing is clear, however: to have pitch, we need frequency, and thus periodic waveforms.
Each of these three concepts implies the other two. This relationship is very important when
we discuss how frequency is not used just for pitch, but also in determining timbre. To get
some sense of this, consider that the highest note of a piano is around 4 kHz. What about the
rest of the range, the almost 18 kHz of available sound? It turns out that this larger frequency
range is used by the ear to determine a sound’s timbre. We will discuss timbre in Section 1.4.
Before we move on to timbre, though, we should mention that pitch and amplitude are also
related. When we hear sounds, we tend to compare them, and think of their amplitudes, in
terms of loudness. The perceived loudness of a sound depends on a combination of factors,
including the sound’s amplitude and frequency content. For example, given two sounds of
http://music.columbia.edu/cmc/musicandcomputers/
- 20 -
very different frequencies but at exactly the same amplitude, the lower-frequency sound will
often seem softer. Our ear tends to amplify certain frequencies and attenuate others.
Figure 1.17 Equal-loudness contours (often referred to as Fletcher-Munson curves) are curves that tell us how
much intensity is needed at a certain frequency in order to produce the same perceived loudness as a tone at a
different frequency. They’re sort of "loudness isobars." If you follow one line (which meanders across intensity
levels and frequencies), you’re following an equal loudness contour.
For example, for a 50 Hz tone to sound as loud as a 1000 Hz tone at 20 dB, it needs to sound
at about 55 dB. These curves are surprising, and they tell us some important things about how
our ears evolved. While these curves are the result of rather gross data generalization, and
while they will of course vary depending on the different sonic environments to which
listeners are accustomed, they also seem to be rather surprisingly accurate across cultures.
Perhaps they represent something that is hardwired rather than learned. These curves are
widely used by audio manufacturers to make equipment more efficient and sound more
realistic.
When looking at Figure 1.17 for the Fletcher-Munson curves, note how the curves start high
in the low frequencies, dip down in the mid-frequencies, and swing back up again. What does
this mean? Well, humans need to be very sensitive to the mid-frequency range. That’s how,
for instance, you can tell immediately if your mom’s upset when she calls you on the phone.
(Phones, by the way, cut off everything above around 7 kHz.)
Most of the sounds we need to recognize for survival purposes occur in the mid-frequency
range. Low frequencies are not too important for survival. The nuances and tiny inflections in
speech and most of sonic sound tend to happen in the 500 Hz to 2 kHz range, and we have
evolved to be extremely sensitive in that range (though it’s hard to say which came first, the
evolution of speech or the evolution of our sensitivity to speech sounds.
This mid-frequency range sensitivity is probably a universal human trait and not all that
culturally dependent. So, if you’re traveling far from home, you may not be able to
understand the person at the table in the café next to you, but if you whistle the FletcherMunson curves, you’ll have a great time together.
http://music.columbia.edu/cmc/musicandcomputers/
- 21 -
Chapter 1: The Digital Representation of Sound,
Part One: Sound and Timbre
Section 1.4: Timbre
Contents of C:\bolum\bahar2014-2015\ce476\Section 1.4:
Applet 1.9: Draw a waveform
Soundfile 1.16: Tuning fork at 256 Hz
Soundfile 1.17: Sine wave at 256 Hz
Xtra bit 1.6: Tuning forks
Soundfile 1.18: Composer Warren Burt’s piece for tuning forks
Soundfile 1.19: Clarinet sound
Soundfile 1.20: Clarinet with attack lopped off
Soundfile 1.21: Flute sound
Soundfile 1.22: Flute with attack lopped off
Soundfile 1.23: Piano sound
Soundfile 1.24: Piano with attack lopped off
Soundfile 1.25: Trombone (not played underwater!)
Soundfile 1.26: Trombone with attack lopped off
Soundfile 1.27: Violin
Soundfile 1.28: Violin with attack lopped off
Soundfile 1.29: Voice
Soundfile 1.30: Voice with attack lopped off
What’s the difference between a tuba and a flute (or, more accurately, the sounds of each)?
How do we tell the difference between two people singing the same song, when they’re
singing exactly the same notes? Why do some guitars "sound" better than others (besides the
fact that they’re older or cost more or have Eric Clapton’s autograph on them)? What is it that
makes things "sound" like themselves?
It’s not necessarily the pitch of the sound (how high or low it is)—if everyone in your family
sang the same note, you could almost surely tell who was who, even with your eyes closed.
http://music.columbia.edu/cmc/musicandcomputers/
- 22 -
It’s also not just the loudness—your voice is still your voice whether you talk softly or scream
at the top of your lungs. So what’s left? The answer is found in a somewhat mysterious and
elusive thing we call, for lack of a better word, "timbre," and that’s what this section is all
about.
"Timbre" (pronounced "tam-ber") is a kind of sloppy word, inherited from previous eras, that
lumps together lots of things that we don’t fully understand. Some think we should abandon
the word and concept entirely! But it’s one of those words that gets used a lot, even if it
doesn’t make much sense, so we’ll use it here too—we’re sort of stuck with it for the time
being.
What Makes Up Timbre?
Timbre can be roughly defined as those qualities of a sound that aren’t just frequency or
amplitude. These qualities might include:
•
•
spectra: the aggregate of simpler waveforms (usually sine waves) that make up what
we recognize as a particular sound. This is what Fourier analysis gives us (we’ll
discuss that in Chapter 3).
envelope: the attack, sustain, and decay portions of a sound (often referred to as
transients).
Envelope and spectra are very complicated concepts, encompassing a lot of subcategories. For
example, spectral features are very important, different ways that the spectral aggregates are
organized statistically in terms of shape and form (e.g., the relative "noisiness" of a sound is a
result, in large part, of its spectral relationships). Many facets of envelope (onset time,
harmonic decay, spectral evolution, steady-state modulations, etc.) are not easily explained by
just looking at the envelope of a sound. Researchers spend a great deal of time on very
specific aspects of these ideas, and it’s an exciting and interesting area for computer
musicians to research.
Figure 1.18 shows a simplified
picture of the envelope of a
trumpet tone.
Figure 1.18 This image illustrates the attack, sustain, and decay portions of a standard amplitude envelope. This
is a very simple, idealized picture, called a trapezoidal envelope. We are not aware of any actual, natural
occurrence of this kind of straight-lined sound!
http://music.columbia.edu/cmc/musicandcomputers/
- 23 -
It’s helpful here to bring another descriptive term into our vocabulary: spectrum. Spectrum is
defined by a waveform’s distribution of energy at certain frequencies. The combination of
spectra (plural of spectrum) and envelope helps us to define the "color" of a sound. Timbre is
difficult to talk about, because it’s hard to measure something subjective like the "quality" of
a sound. This concept gives music theorists, computer musicians, and psychoacousticians a lot
of trouble. However, computers have helped us make great progress in the exploration and
understanding of the various components of what’s traditionally been called "timbre."
Basic Elements of Sound
As we’ve shown, the average piece of music can be a pretty complicated function.
Nevertheless, it’s possible to think of it as a combination of much simpler sounds (and hence
simpler functions)—even simpler than individual instruments. The basic atoms of sound, the
sinusoids (sine waves) we talked about in the previous sections, are sometimes called pure
tones, like those produced when a tuning fork vibrates. We use the tuning fork to talk about
these tones because it is one of the simplest physical vibrating systems.
Although you might think that a discussion of tuning forks belongs more in a discussion of
frequency, we’re going to use them to introduce the notion of sinusoids: Fourier components
of a sound.
Figure 1.19 This tuning fork rings at 256 Hz. You can hear the sound it makes by clicking on Soundfile 1.16.
Compare the tuning fork sound to that of a pure sine wave at the same frequency of 256 Hz.
When you hit the tines of a tuning fork, it vibrates and emits a very pure note or tone. Tuning
forks are able to vibrate at very precise frequencies. The frequency of a tuning fork is the
number of times the tip goes back and forth in a second. And this number won’t change, no
matter how hard you hit that fork. As we mentioned, the human ear is capable of hearing
sounds that vibrate all the way from 20 times a second to 20,000 times a second. Lowfrequency sounds are like bass notes, and high-frequency sounds are like treble notes. (Low
frequency means that the tines vibrate slowly, and high frequency means that they vibrate
quickly.)
http://music.columbia.edu/cmc/musicandcomputers/
- 24 -
Figure 1.20 Click on the different tuning forks to see the different audiograms. Notice what they have in
common! They are all roughly the same shape—simple waves that differ only in the width of the regularly
repeating peaks. The higher tones give more peaks over the same interval. In other words, the peaks occur more
frequently. (Get it? Higher frequency!)
When you whack the tines of a tuning fork, the fork vibrates. The number of times the tines
go back and forth in one second determines the frequency of a particular tuning fork.
Click Soundfile 1.18. You’ll hear composer Warren Burt’s piece for tuning forks,
"Improvisation in Two Ancient Greek Modes."
Now, why do tuning fork functions have their simple, sinusoidal shape? Think about how the
tip of the tuning fork is moving over time. We see that it is moving back and forth, from its
greatest displacement in one direction all the way back to just about the same displacement in
the opposite direction.
Imagine that you are sitting on the end of the tine (hold on tight!). When you move to the left,
that will be a negative displacement; and when you move to the right, that will be a positive
displacement. Once again, as time progresses we can graph the function that at each moment
in time outputs your position. Your back-and-forth motion yields the functions many of you
remember from trigonometry: sines and cosines.
Figure 1.21 Sine and cosine waves. Thanks to Wayne Mathews for this image.
Any sound can be represented as a combination of different amounts of these sines and
cosines of varying frequencies. The mathematical topic that explains sounds and other wave
phenomena is called Fourier analysis, named after its discoverer, the great 18th century
mathematician Jean Baptiste Joseph Fourier.
http://music.columbia.edu/cmc/musicandcomputers/
- 25 -
Figure 1.22 Spectra of (a) sawtooth wave and (b) square wave.
Figure 1.22 shows the relative amplitudes of sinusoidal components of simple waveforms. For
example, Figure 1.22(a) indicates that a sawtooth wave can be made by addition in the
following way: one part of a sine wave at the fundamental frequency (say, 1 Hz), then half as
much of a sine wave at 2 Hz, and a third as much at 3 Hz, and so on, infinitely.
In Section 4.2, we’ll talk about using the Fourier technique in synthesizing sound, called
additive synthesis. If you want to jump ahead a bit, try the applet in Section 4.2, that lets you
build simple waveforms from sinusoidal components. Notice that when you try to build a
square wave, there are little ripples on the edges of the square. This is called Gibbs ringing,
and it has to do with the fact that the sum of any finite number of these decreasing amounts of
sine waves of increasing frequency is never exactly a square wave.
What the charts in Figure 1.22 mean is that if you add up all those sinusoids whose
frequencies are integer multiples of the fundamental frequency of the sound and whose
amplitudes are described in the charts by the heights of the bars, you’ll get the sawtooth and
square waves.
This is what Fourier analysis is all about: every periodic waveform (which is the same, more
or less, as saying every pitched sound) can be expressed as a sum of sines whose frequencies
are integer multiples of the fundamental and whose amplitudes are unknown. The sawtooth
and square wave charts in Figure 1.22 are called spectral histograms (they don’t show any
evolution over time, since these waveforms are periodic).
These sine waves are sometimes referred to as the spectral components, partials, overtones, or
harmonics of a sound, and they are what was thought to be primarily responsible for our sense
of timbre. So when we refer to the tenth partial of a timbre, we mean a sinusoid at 10 times
the frequency of the sound’s fundamental frequency (but we don’t know its amplitude).
The sounds in some of the following soundfiles are conventional instruments with their
attacks lopped off, so that we can hear each instrument as a different periodic waveform and
listen to each instrument’s different spectral configurations. Notice that, strangely enough, the
clarinet (whose sound wave is a lot like a sawtooth wave) and the flute, without their attacks,
are not all that different (in the grand scheme of things).
http://music.columbia.edu/cmc/musicandcomputers/
- 26 -
Figure 1.25 Piano.
Figure 1.23 Clarinet.
Figure 1.24 Flute.
Figure 1.26 Picture courtesy of Bob Hovey www.TROMBONISTICALISMS.bigstep.com.
Figure 1.27 Violin.
http://music.columbia.edu/cmc/musicandcomputers/
- 27 -
Chapter 2: The Digital Representation of Sound,
Part Two: Playing by the Numbers
Section 2.1: Digital Representation of Sound
The world is continuous. Time marches on and on, and there are plenty of things that we
can measure at any instant. For example, weather forecasters might keep an ongoing
record of the temperature or barometric pressure. If you are in the hospital, the nurses
might be keeping a record of your temperature, your heart rate, or your brain waves. Any
of these records gives you a function f(t) where, at a given time t, f(t) is the value of the
particular statistic that interests you. These sorts of functions are called time series.
Figure 2.1 Genetic algorithms (GA) are evolutionary computing systems, which have been applied to
things like optimization and machine learning. These are examples of continuous functions. What we mean
by "continuous" is that at any instant of time the functions take on a well-defined value, so that they make
squiggly line graphs that could be traced without the pencil ever leaving the paper. This might also be called
an analog function. Copyright Juha Haataja/Center for Scientific Computing, Finland
Of course, the time series that interest us are those that represent sound. In particular, we
want to take these time series, stick them on the computer, and play with them!
Now, if you’ve been paying attention, you may realize that at this moment we’re in a bit
of a bind: the type of time series that we’ve been describing is a continuous function. That
is, at every instant in time, we could write down a number that is the value of the function
at that instant—whether it be how much your eardrum has been displaced, what your
temperature is, what your heart rate is, and so on. But such a continuous function would
provide an infinite list of numbers (any one of which may have an infinite expansion, like
= 3.1417...), and no matter how big your computer is, you’re going to have a pretty
http://music.columbia.edu/cmc/musicandcomputers/
- 28 -
tough time fitting an infinite collection of numbers on your hard drive.
So how do we do it? How can we represent sound as a finite collection of numbers that
can be stored efficiently, in a finite amount of space, on a computer, and played back and
manipulated at will? In short, how do we represent sound digitally? That’s the problem
that we’ll start to investigate in this chapter.
Somehow we have to come up with a finite list of numbers that does a good job of
representing our continuous function. We do it by sampling the original function, at every
few instants (at some predetermined rate, called the sampling rate), recording the value of
the function at that moment. For example, maybe we only record the temperature every 5
minutes. For sound we need to go a lot faster, and we often use a special device that grabs
instantaneous amplitudes at rapid, audio rates (called an analog to digital converter, or
ADC).
A continuous function is also called an analog function, and, to restate the problem, we
have to convert analog functions to lists of samples, or digital functions, which is the
fundamental way that computers store information. In computers, think of a digital
function as a function of location in computer memory. That is, these functions are stored
as lists of numbers, and as we read through them we are basically creating a discrete
function of time of individual amplitude values.
Figure 2.2 A pictorial description of the recording and playback of sounds through an ADC/DAC.
In analog to digital conversions, continuous functions (for example, air pressures, sound
waves, or voltages) are sampled and stored as numerical values. In digital to analog
conversions, numerical values are interpolated by the converter to force some continuous
system (such as amplifiers, speakers, and subsequently the air and our ears) into a
continuous vibration. Interpolation just means smoothly transitioning across the discrete
numerical values.
When we digitally record a sound, an image, or even a temperature reading, we
numerically represent that phenomenon. To convert sounds between our analog world and
the digital world of the computer, we use a device called an analog to digital converter
(ADC). A digital to analog converter (DAC) is used to convert these numbers back to
sound (or to make the numbers usable by an analog device, like a loudspeaker). An ADC
takes smooth functions (of the kind found in the physical world) and returns a list of
discrete values. A DAC takes a list of discrete values (like the kind found in the computer
world) and returns a smooth, continuous function, or more accurately the ability to create
such a function from the computer memory or storage medium.
http://music.columbia.edu/cmc/musicandcomputers/
- 29 -
Figure 2.3 Two graphical representations of sound. The top is our usual time domain graph, or audiogram,
of the waveform created by a five-note whistled melody. Time is on the x-axis, and amplitude is on the yaxis.
The bottom graph is the same melody, but this time we are looking at a time-frequency
representation. The idea here is that if we think of the whistle as made up of contiguous
small chunks of sound, then over each small time period the sound is composed of
differing amounts of various pieces of frequency. The amount of frequency y at time t is
encoded by the brightness of the pixel at the coordinate (t, y). The darker the pixel, the
more of that frequency at that time. For example, if you look at time 0.4 you see a band of
white, except near 2,500, showing that around that time we mainly hear a pure tone of
about 2,500 Hz, while at 0.8 second, there are contributions all around from about 0 Hz to
3,000 Hz, but stronger ones at about 2,500 Hz and 200 Hz.
http://music.columbia.edu/cmc/musicandcomputers/
- 30 -
We’ll be giving a much more precise description of the frequency domain in Chapter 3,
but for now we can simply think of sounds as combinations of more basic sounds that are
distinguished according to their "brightness." We then assign numbers to these basic
sounds according to their brightness. The brighter a sound, the higher the number.
As we learned in Chapter 1, this number is called the frequency, and the basic sound is
called a sinusoid, the general term for sinelike waveforms. So, high frequency means high
brightness, and low frequency means low brightness (like a deep bass rumble), and in
between is, well, simply in between.
All sound is a combination of these sinusoids, of varying amplitudes. It’s sort of like
making soup and putting in a bunch of basic spices: the flavor will depend on how much
of each spice you include, and you can change things dramatically when you alter the
proportions. The sinusoids are our basic sound spices! The complete description of how
much of each of the frequencies is used is called the spectrum of the sound.
Since sounds change over time, the proportions of each of the sinusoids changes too. Just
like when you spice up that soup, as you let it boil the spice proportion may be changing
as things evaporate.
http://music.columbia.edu/cmc/musicandcomputers/
- 31 -
Chapter 2: The Digital Representation of Sound,
Part Two: Playing by the Numbers
Section 2.2: Analog Versus Digital
http://music.columbia.edu/cmc/musicandcomputers/applets/2_2_sampled_fader.php
Applet 2.1: Sampled fader
The distinction between analog and digital information is not an easy one to understand, but
it’s fundamental to the realm of computer music (and in fact computer anything!). In this
section, we’ll offer some analogies, explanations, and other ways to understand the difference
more intuitively.
Imagine watching someone walking along, bouncing a ball. Now imagine that the ball leaves
a trail of smoke in its path. What does the trail look like? Probably some sort of continuous
zigzag pattern, right?
Figure 2.4 The path of a bouncing ball.
OK, keep watching, but now blink repeatedly. What does the trail look like now? Because we
are blinking our eyes, we’re only able to see the ball at discrete moments in time. It’s on the
way up, we blink, now it’s moved up a bit more, we blink again, now it may be on the way
down, and so on. We’ll call these snapshots samples because we’ve been taking visual
samples of the complete trajectory that the ball is following. The rate at which we obtain
these samples (blink our eyes) is called the sampling rate.
It’s pretty clear, though, that the faster we sample, the better chance we have of getting an
accurate picture of the entire continuous path along which the ball has been traveling.
Figure 2.5 The same path, but sampled by blinking.
What’s the difference between the two views of the bouncing ball: the blinking and the
nonblinking? Each view pretty much gives the same picture of the ball’s path. We can tell
how fast the ball is moving and how high it’s bouncing. The only real difference seems to be
that in the first case the trail is continuous, and in the second it is broken, or discrete. That’s
the main distinction between analog and digital representations of information: analog
information is continuous, while digital information is not.
http://music.columbia.edu/cmc/musicandcomputers/
- 32 -
Analog and Digital Waveform Representations
Now let’s take a look at two time domain representations of a sound wave, one analog and
one digital, in Figure 2.6.
Figure 2.6 An analog waveform and its digital cousin: the analog waveform has smooth and continuous
changes, and the digital version of the same waveform has a stairstep look. The black squares are the actual
samples taken by the computer. The grey lines suggest the "staircasing" that is an inevitable result of converting
an analog signal to digital form. Note that the grey lines are only for show—all that the computer knows about
are the discrete points marked by the black squares. There is nothing in between those points.
The analog waveform is nice and smooth, while the digital version is kind of chunky. This
"chunkiness" is called quantization or staircasing—for obvious reasons! Where do the
"stairsteps" come from? Go back and look at the digital bouncing ball figure again, and see
what would happen if you connected each of the samples with two lines at a right angle.
Voila, a staircase!
Staircasing is an artifact of the digital recording process, and it illustrates how digitally
recorded waveforms are only approximations of analog sources. They will always be
approximations, in some sense, since it is theoretically impossible to store truly continuous
data digitally. However, by increasing the number of samples taken each second (the sample
rate), as well as increasing the accuracy of those samples (the resolution), an extremely
accurate recording can be made. In fact, we can prove mathematically that we can get so
accurate that, theoretically, there is no difference between the analog waveform and its digital
representation, at least to our ears.
http://music.columbia.edu/cmc/musicandcomputers/
- 33 -
Chapter 2: The Digital Representation of Sound,
Part Two: Playing by the Numbers
Section 2.3: Sampling Theory
Contents of C:\bolum\bahar2014-2015\ce476\Section 2.3:
Xtra bit 2.1: Free sample: a tonsorial tale
Applet 2.2: Oscillators
Soundfile 2.1: Undersampling
Soundfile 2.2: Standard sampling at 44,100 samples per second
Applet 2.3: Scrubber applet
Soundfile 2.3: Chirping
So now we know that we need to sample a continuous waveform to represent it digitally. We
also know that the faster we sample it, the better. But this is still a little vague. How often do
we need to sample a waveform in order to achieve a good representation of it?
The answer to this question is given by the Nyquist sampling theorem, which states that to
well represent a signal, the sampling rate (or sampling frequency—not to be confused with the
frequency content of the sound) needs to be at least twice the highest frequency contained in
the sound of the signal.
For example, look back at our time-frequency picture in Figure 2.3 from Section 2.1. It looks
like it only contains frequencies up to 8,000 Hz. If this were the case, we would need to
sample the sound at a rate of 16,000 Hz (16 kHz) in order to accurately reproduce the sound.
That is, we would need to take sound bites (bytes?!) 16,000 times a second.
In the next chapter, when we talk about representing sounds in the frequency domain (as a
combination of various amplitude levels of frequency components, which change over time)
rather than in the time domain (as a numerical list of sample values of amplitudes), we’ll learn
a lot more about the ramifications of the Nyquist theorem for digital sound. But for our
current purposes, just remember that since the human ear only responds to sounds up to about
20,000 Hz, we need to sample sounds at least 40,000 times a second, or at a rate of 40,000 Hz,
to represent these sounds for human consumption. You may be wondering why we even need
to represent sonic frequencies that high (when the piano, for instance, only goes up to the high
4,000 Hz range). The answer is timbral, particularly spectral. Remember that we saw in
Section 1.4 that those higher frequencies fill out the descriptive sonic information.
Just to review: we measure frequency in cycles per second (cps) or Hertz (Hz). The frequency
range of human hearing is usually given as 20 Hz to 20,000 Hz, meaning that we can hear
sounds in that range. Knowing that, if we decide that the highest frequency we’re interested in
is 20 kHz, then according to the Nyquist theorem, we need a sampling rate of at least twice
that frequency, or 40 kHz.
http://music.columbia.edu/cmc/musicandcomputers/
- 34 -
Figure 2.7 Undersampling: What happens if we sample too slowly for the frequencies we’re trying to represent?
We take samples (black dots) of a sine wave (in blue) at a certain interval (the sample rate). If
the sine wave is changing too quickly (its frequency is too high), then we can’t grab enough
information to reconstruct the waveform from our samples. The result is that the highfrequency waveform masquerades as a lower-frequency waveform (how sneaky!), or that the
higher frequency is aliased to a lower frequency.
Soundfile 2.1 demonstrates undersampling of the same sound source as Soundfile 2.2. In this
example, the file was sampled at 1,024 samples per second. Note that the sound sounds
"muddy" at a 1,024 sampling rate—that rate does not allow us any frequencies above about
500 Hz, which is sort of like sticking a large canvas bag over your head, and putting your
fingers in your ears, while listening.
Figure 2.8 Picture of an undersampled waveform. This sound was sampled 512 times per second. This was way too slow.
Figure 2.9 This is the same sound file as above, but now sampled 44,100 (44.1 kHz) times per second. Much better.
http://music.columbia.edu/cmc/musicandcomputers/
- 35 -
Aliasing
The most common standard sampling rate for digital audio (the one used for CDs) is 44.1
kHz, giving us a Nyquist frequency (defined as half the sampling rate) of 22.05 kHz. If we
use lower sampling rates, for example, 20 kHz, we can’t represent a sound whose frequency is
above 10 kHz. In fact, if we try, we’ll get usually undesirable artifacts, called foldover or
aliasing, in the signal.
In other words, if a sine wave is changing quickly, we would get the same set of samples that
we would have obtained had we been taking samples from a sine wave of lower frequency!
The effect of this is that the higher-frequency contributions now act as impostors of lowerfrequency information. The effect of this is that there are extra, unanticipated, and new lowfrequency contributions to the sound. Sometimes we can use this in cool, interesting ways,
and other times it just messes up the original sound.
So in a sense, these impostors are aliases for the low frequencies, and we say that the result of
our undersampling is an aliased waveform at a lower frequency.
Figure 2.10 Foldover aliasing. This picture shows what happens when we sweep a sine wave
up past the Nyquist rate. It’s a picture in the frequency domain (which we haven’t talked
about much yet), so what you’re seeing is the amplitude of specific component frequencies
over time. The x-axis is frequency, the z-axis is amplitude, and the y-axis is time (read from
back to front).
As the sine wave sweeps up into frequencies above the Nyquist frequency, an aliased wave
(starting at 0 Hz and ending at 44,100 Hz over 10 seconds) is reflected below the Nyquist
frequency of 22,050 Hz. The sound can be heard in Soundfile 2.3. Soundfile 2.3 is a 10second soundfile sweeping a sine wave from 0 Hz to 44,100 Hz. Notice that the sound seems
to disappear after it reaches the Nyquist rate of 22,050 Hz, but then it wraps around as aliased
sound back into the audible domain.
Anti-Aliasing Filters
Fortunately it’s fairly easy to avoid aliasing—we simply make sure that the signal we’re
recording doesn’t contain any frequencies above the Nyquist frequency. To accomplish this
http://music.columbia.edu/cmc/musicandcomputers/
- 36 -
task, we use an anti-aliasing filter on the signal. Audio filtering is a technique that allows us to
selectively keep or throw out certain frequencies in a sound—just as light filters (like ones
you might use on a camera) only allow certain frequencies of light (colors) to pass. For now,
just remember that a filter lets us color a sound by changing its frequency content.
An anti-aliasing filter is called a low-pass filter because it only allows frequencies below a
certain cutoff frequency to pass. Anything above the cutoff frequency gets removed. By
setting the cutoff frequency of the low-pass filter to the Nyquist frequency, we can throw out
the offending frequencies (those high enough
to cause aliasing) while retaining all of the
lower frequencies that we want to record.
Figure 2.11 An anti-aliasing low-pass filter. Only the
frequencies within the passband, which stops at the
Nyquist frequency (and "rolls off" after that), are
allowed to pass. This diagram is typical of the way we
draw what is called the frequency response of a filter. It
shows the amplitude that will come out of the filter in
response to different frequencies (of the same
amplitude).
Anti-aliasing filters can be analogized to coffee
filters. The desired components (frequencies or
liquid coffee) are preserved, while the filter
(coffee filter or anti-aliasing filter) catches all
the undesirable components (the coffee
grounds or the frequencies that the system
cannot handle).
Perfect anti-aliasing filters cannot be constructed, so we almost always get some aliasing error
in an ADC DAC conversion.
Anti-aliasing filters are a standard component in digital sound recording, so aliasing is not
usually of serious concern to the average user or computer musician. But because many of the
sounds in computer music are not recorded (and are instead created digitally inside the
computer itself), it’s important to fully understand aliasing and the Nyquist theorem. There’s
nothing to stop us from using a computer to create sounds with frequencies well above the
Nyquist frequency. And while the computer has no problem dealing with such sounds as data,
as soon as we mere humans want to actually hear those sounds (as opposed to just
conceptualizing or imagining them), we need to deal with the physical realities of aliasing, the
Nyquist theorem, and the analog-to-digital conversion process.
http://music.columbia.edu/cmc/musicandcomputers/
- 37 -
Chapter 2: The Digital Representation of Sound,
Part Two: Playing by the Numbers
Section 2.4: Binary Numbers
Applet 2.4: Binary counter
Now we know more than we ever wanted to know about how often to store numbers in order
to digitally represent a signal. But surely speed isn’t the only thing that matters. What about
size? In this section, we’ll tell you something about how big those numbers are and what they
"look" like.
All digital information is stored as binary numbers. A binary number is simply a way of
representing our regular old numbers as a list, or sequence, of zeros and ones. This is also
called a base-2 representation.
In order to explain things in base 2, let’s think for a minute about those ordinary numbers that
we use. You know that we use a decimal representation of numbers. This means that we write
numbers as finite sequences whose symbols are taken from a collection of ten symbols: our
digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. A number is really a shorthand for an arithmetic
expression. For example:
Now we bet you can see a pattern here. For every place in the number, we just multiply by a
higher power of ten. Representing a number in base-8, say (which is called an octal
representation), would mean that we would only use eight distinct symbols, and multiply
those by powers of 8 (instead of powers of 10). For example (in base 8), the number 2,051
means:
There is even hexadecimal notation in which we work in base 16, so we need 16 symbols.
These are our usual 0 through 9, augmented by A, B, C, D, E, and F (counting for 10 through
15). So, in base 16 we might write:
But back to binary: this is what we use for digital (that is, computer) representations. With
only two symbols, 0 and 1, numbers look like this:
http://music.columbia.edu/cmc/musicandcomputers/
- 38 -
Each of the places is called a bit for (Binary digIT). The leftmost bit is called the most
significant bit (MSB), and the rightmost is the least significant bit (because the digit in the
leftmost position, the highest power of 2, makes the most significant contribution to the total
value represented by the number, while the rightmost makes the least significant
contribution). If the digit at a given bit is equal to 1, we say it is set. We also label the bits by
what power of 2 they represent. How many different numbers can we represent with 4 bits?
There are sixteen such combinations, and here they are, along with their octal, decimal, and
hexadecimal counterparts:
Binary Octal Decimal Hexadecimal
0000
00
00
00
0001
01
01
01
0010
02
02
02
0011
03
03
03
0100
04
04
04
0101
05
05
05
0110
06
06
06
0111
07
07
07
1000
10
08
08
1001
11
09
09
1010
12
10
0A
1011
13
11
0B
1100
14
12
0C
1101
15
13
0D
1110
16
14
0E
1111
17
15
0F
Table 2.1 Number base chart.
Some mathematicians and philosophers argue that the reason we use base 10 is that we have
ten fingers (digits)—those extraterrestrials we met in Chapter 1 who hear light might also
have an extra finger or toe (or sqlurmphragk or something), and to them the millennial year
might be the year 1559 (in base 11)! Boy, that will just shock them right out of their
fxrmmp!qts!
http://music.columbia.edu/cmc/musicandcomputers/
- 39 -
What’s important to keep in mind is that it doesn’t much matter which system we use; they’re
all pretty much equivalent (1010 base 2 = 10 base 10 = 12 base 8 = A base 16, and so on). We
pick numeric bases for convenience: binary systems are useful for switches and logical
systems, and computers are essentially composed of lots of switches.
Figure 2.12 A four-bit binary number (called a nibble). Sixteen is the biggest decimal number we can represent in 4 bits.
Numbering Those Bits
Consider the binary number 0101 from Figure 2.12. It has 4 bits, numbered 0 to 3 (computers
generally count from 0 instead of 1). The rightmost bit (bit zero, the LSB) is the "ones" bit. If
it is set (equal to 1), we add 1 to our number. The next bit is the "twos" bit, and if set adds 2 to
our number. Next comes the "fours" bit, and finally the "eights" bit, which, when set, add 4
and 8 to our number, respectively. So another way of thinking of the binary number 0101
would be to say that we have zero eights, one four, zero twos, and one 1.
Bit number
2Bit number
Bit values
0
20
1
1
21
2
2
22
4
3
23
8
Table 2.2 Bits and their values. Note that every bit in base 2 is the next power of 2. More generally, every place
in a base-n system is n raised to that place number.
Don’t worry about remembering the value of each bit. There’s a simple trick: to find the value
of a bit, just raise it to its bit number (and remember to start counting at 0!). What would be
the value of bit 4 in a five-bit number? If you get confused about this concept, just remember
that this is simply what you learned in grade school with base 10 (the 1s column, the 10s
column, the 100s column, and so on).
This applet sonifies, or lets you hear a binary counter. Each of the 8 bits is assigned to a
different note in the harmonic series (pitches that are integer multiples of the fundamental
pitch, corresponding to the spectral components of a periodic waveform). As the binary
counter counts, it turns the notes on and off depending on whether or not the bit is set.
The crucial question is, though, how many different numbers can we represent with four bits?
Or more important, how many different numbers can we represent with n bits? That will
determine the resolution of our data (and for sound, how accurately we can record minute
amplitude changes). Look back at Table 2.1: using all possible combinations of the 4 bits, we
http://music.columbia.edu/cmc/musicandcomputers/
- 40 -
were able to represent sixteen numbers (but notice that, starting at 0, they only go to 15).
What if we only used 2 bits? How about 16?
Again, thinking of the analogy with base 10, how many numbers can we represent in three
"places" (0 through 999 is the answer, or 1,000 different numbers). A more general question
is this: for a given base, how many values can be represented with n places (can’t call them
bits unless it’s binary)? The answer is the base to the nth power.
The largest number we can represent (since we need 0) is the base to the nth power minus 1.
For example, the largest number we can represent in binary with n bits is 2n – 1.
Number of bits
2Number of bits
Number of numbers
8
28
256
16
216
65,536
24
224
16,777,216
32
232
4,294,967,296
Table 2.3 How many numbers can an n-bit number represent? They get big pretty fast, don’t they? In fact, by
definition, these numbers get big exponentially as the number of bits increases linearly (like we discussed when
we
talked
about
pitch
and
loudness
perception
in
Section
1.2).
Modern computers use 64-bit numbers! If 232 is over 4 billion, try to imagine what 264 is (232 x
232). It’s a number almost unimaginably large. As an example, if you counted very fast, say
five times a second, from 0 to this number, it would take you over 18 quintillion years to get
there (you’d need the lifetimes of several solar systems to accomplish this).
OK, we’ve got that figured out, right? Now back to sound. Remember that a sample is
essentially a "snapshot" of the instantaneous amplitude of a sound, and that snapshot is stored
as a number. What sort of number are we talking about? 3? 0.00000000017? 16,000,000,126?
The answer depends on how accurately we capture the sound and how much space we have to
store our data.
http://music.columbia.edu/cmc/musicandcomputers/
- 41 -
Chapter 2: The Digital Representation of Sound,
Part Two: Playing by the Numbers
Section 2.5: Bit Width
When we talked about sampling, we made the point that the faster you sample, the better the
quality. Fast is good: it gives us higher resolution (in time), just like in the old days when the
faster your tape recorder moved, the more (horizontal) space it had to put the sound on the
tape. But when you had fast tape recorders, moving at 30 or 60 inches per second, you used
up a lot of tape. The faster our moving storage, the more media it’s going to consume. While
this may be bad ecologically, it’s good sonically. Accuracy in recording demands space.
However, if we only have a limited amount of storage space, we need to do something about
that space issue. One thing that eats up space digitally (in the form of memory or disk space)
is bits. The more bits you use, the more hard disk space or memory size you need. That’s also
true with sampling rates. If we sample at high rates, we’ll use up more space.
We could, in principle, use 64-bit numbers (capable of extraordinary detail) and sample at 100
kHz—big numbers, fast speeds. But our sounds, as digitally stored numbers, will be huge.
Somehow we have to make some decisions balancing our need for accuracy and sonic quality
with our space and storage limitations.
For example, suppose we only use the values 0, 1, 2, and 3 as sample values. This would
mean that every sample measurement would be "rounded off" to one of these four values. On
one hand, this would probably be pretty inaccurate, but on the other hand, each sample would
then be encoded using only a 2-bit number. Not too consumptive, and pretty simple,
technologically! Unfortunately, using only these four numbers would probably mean that
sample values won’t be distinguished all that much! That is, most of our functions in the
digital world would look pretty much alike. This would create very low resolution data, and
the audio ramification is that they would sound terrible. Think of the difference between, for
example, 8 mm and 16 mm film: now pretend you are using 1 mm film! That’s what a 4-bit
sample size would be like.
So, while speed is important—the more snapshots we take of a continuous function, the more
accurately we can represent it in discrete form—there’s another factor that seriously affects
resolution: the resolution of the actual number system we use to store the data. For example,
with only three numbers available (say, 0, 1, 2), every value we store has to be one of those
three numbers. That’s bad. We’ll basically be storing a bunch of simple square waves. We’ll
be turning highly differentiated, continuous data into nondifferentiated, overly discrete data.
http://music.columbia.edu/cmc/musicandcomputers/
- 42 -
Figure 2.13 An example of what a 3-bit
sound file might look like (8 possible
values).
Figure 2.14 An example of what a 6-bit sound
file might look like (64 possible values).
In computers, the way we describe numerical resolution is by the size, or number of bits, used
for number storage and manipulation. The number of bits used to represent a number is
referred to as its bit width or bit depth. Bit width (or depth) and sample speed more or less
completely describe the resolution and accuracy of our digital recording and synthesis
systems. Another way to think of it is as the word-length, in bits, of the binary data.
Common bit widths used for digital sound representation are 8, 16, 24, and 32 bits. As we
said, more is better: 16 bits gives you much more accuracy than 8 bits, but at a cost of twice
the storage space. (Note that, except for 24, they’re all powers of 2. Of course you could think
of 24 as halfway between 24 and 25, but we like to think of it as 24.584962500721156.)
We’ll take a closer look at storage in Section 2.7, but for now let’s consider some standard
number sizes.
4 bits
1 nibble
8 bits
1 byte (2 nibbles)
16 bits
1 word (2 bytes)
1,024 bytes 1 kilobyte (K)
1,000 K
1 megabyte (MB)
1,000 MB
1 gigabyte (GB)
1,000 GB
1 terabyte (TB)
Table 2.4 Some convenient units for dealing with bits. If you just remember that 8 bits is the same as a byte,
you can pretty much figure out the rest.
Sample rate (in Hz)
16 bits
44,100
http://music.columbia.edu/cmc/musicandcomputers/
- 43 -
8 bits
22,050
11,025
5,512.5
Table 2.5 Some sound files at various bit widths and sampling rates. Fast and wide are better (and
more expensive in terms of technology, computer time, and computer storage). So the ideal format in
this table is 44,100 Hz at 16 bits (which is standard audio CD quality). Listen to the sound files and
see if you can hear what effect sampling rate and bit width have on the sound. See the sound files on
C:\bolum\bahar2014-2015\ce476\Section 2.5
You may notice that lower sampling rates make the sound "duller," or lacking in high
frequencies. Lower bit widths make the sound, to use an imprecise word, "flatter"—small
amplitude nuances (at the waveform level) cannot be heard as well, so the sound’s timbre
tends to become more similar, and less defined.
Let’s sum it up: what effect does bit width have on the digital storage of sounds? Remember
that the more bits we use, the more accurate our recording. But each time our accuracy
increases, so do our storage requirements. Sometimes we don’t need to be that accurate—
there are lots of options open to us when playing with digital sounds.
http://music.columbia.edu/cmc/musicandcomputers/
- 44 -
Chapter 2: The Digital Representation of Sound,
Part Two: Playing by the Numbers
Section 2.6: Digital Copying
Xtra bit 2.2:The peanut butter conundrum
Applet 2.5: Melody copier
Xtra bit 2.3: Errors in digital copying: parity
Soundfile 2.4: Our rendition of Alvin Lucier’s "I am Sitting in a Room"
Soundfile 2.5: Pretender, from John Oswald’s Plunderphonics
Soundfile 2.6(a) :Original
Soundfile 2.6(b): Degraded
Soundfile 2.7: David Mahler
Xtra bit 2.4: Digital watermarking
One of the most important things about digital audio is the ability, theoretically, to make
perfect copies. Since all we’re storing are lists of numbers, it shouldn’t be too hard to rewrite
them somewhere else without error. This means that unlike, for example, photocopying a
photocopy, there is no information loss in the transfer of data from one copy to another, nor is
there any noise added. Noise, in this context, is any information that is added by the
imperfections of the recording or playback technology.
Figure 2.15 Each generation of analog copying is a little noisier than the last. Digital copies are (generally!)
perfect, or noiseless.
http://music.columbia.edu/cmc/musicandcomputers/
- 45 -
Not-So-Perfect Copying
What does perfect copying really mean? In the good old days, if you copied your favorite LP
vinyl (remember those?) onto cassette and gave it to your buddy, you knew that your original
sounded better than your buddy’s copy. He knew the same: if he made a copy for his buddy, it
would be even worse (these copies are called generations). It was like the children’s game of
telephone:
"My cousin once saw a Madonna concert!"
"My cousin Vince saw Madonna’s monster!"
"My mother Vince sat onna mobster!"
"My mother winces when she eats lobster!"
Start with a simple tune and set a noise parameter that determines how bad the next copy of
the file will be. The melody degrades over time.
Note that when we use the term degrade, we are giving a purely technical description, not an
aesthetic one. The melodies don’t necessarily get worse, they just get further from the
original.
The noise added to successive generations blurs the detail of the original. But we still try to
reinterpret the new, blurred message in a way that makes sense to us. In audio, this blurring of
detail usually manifests itself as a loss of high-frequency information (in some sense, sonic
detail) as well as an addition of the noise of the transmission mechanism itself (hiss, hum,
rumble, etc.).
Alvin Lucier’s classic electronic work, "I Am Sitting in a Room," "redone" with our own text.
"This is a cheap imitation of a great and classic work of electronic music by the composer
Alvin Lucier. You can easily hear the degradations of the copied sound; this was part of the
composer’s idea for this composition."
We strongly encourage all of you to listen to the original, which is a beautiful, innovative
work.
Is It Real, or Is It...?
Digital sound technology changes this copying situation entirely and raises some new and
interesting questions. What’s the original, and what’s the copy?
When we copy a CD to some other digital medium (like our hard drive), all we’re doing is
copying a list of numbers, and there’s no reason to expect that any, or significantly many,
errors will be incurred. That means the copy of the president’s speech that we sample for our
web site contains the same data—is the same signal—as the original. There’s no way to trace
where the original came from, and in a sense no way to know who owns it (if anybody does).
http://music.columbia.edu/cmc/musicandcomputers/
- 46 -
This makes questions of copyright and royalties, and the more general issue of intellectual
property, complicated to say the least, and has become, through the musical technique of
sampling (as in rap, techno, and other musical styles), an important new area of technological,
aesthetic, and legal research.
"If creativity is a field, copyright is the fence."
—John Oswald
In the mid-1980s, composer John Oswald made a famous CD in which every track was an
electronic transformation, in some unique way, of existing, copyrighted material. This
controversial CD was eventually litigated out of existence, largely on the basis of its Michael
Jackson track (called "Dab").
Digital copying has produced a crisis in the commercial music world, as anyone who has
downloaded music from the internet knows. The government, the commercial music industry,
and even organizations like BMI (Broadcast Music Inc.) and ASCAP (American Society of
Composers, Authors, and Publishers), which distribute royalties to composers and artists, are
struggling with the law and the technology, and the two are forever out of synch (guess who's
ahead!). Things change every day, but one thing never changes-technology moves faster than
our society's ability to deal with it legally and ethically. Of course, that makes things
interesting.
In Pretender, John Oswald gradually changes the pitch and speed of Dolly Parton’s version of
the song "The Great Pretender" in a wry and musically beautiful commentary on gender,
intellectual property, and sound itself.
Soundfiles 2.6(a) and 2.6(b) show an analog copy that was made over and over again,
exemplifying how the signal degrades to noise after numerous copies. Soundfile 2.6(a) is the
original digital file; Soundfile 2.6(b) is the copied analog file. Note that unlike the example in
Soundfile 2.4, where we tried to re-create a famous piece by Alvin Lucier, this example
doesn’t involve the acoustics of space, just the noise of
electronics, tape to tape.
Sampling, and using pre-existing materials, can be a lot of fun and
artistically interesting. Soundfile 2.7 is an excerpt from David
Mahler’s composition "Singing in the Style of The Voice of the
Poet." This is a playful work that in some ways parodies textsound composition, radio interviewing, and electronic music by
using its own techniques. In this example, Mahler makes a joke of
the fact that when speech is played backward, it sounds like
"Swedish," and he combines that effect (backward talking) with
composer Ingram Marshall’s talking about his interest in Swedish
text-sound composers.
Figure 2.20 David Mahler
The entire composition can be heard on David Mahler’s CD The Voice of the Poet; Works on
Tape 1972–1986, on Artifact Recordings.
http://music.columbia.edu/cmc/musicandcomputers/
- 47 -
Chapter 2: The Digital Representation of Sound,
Part Two: Playing by the Numbers
Section 2.7: Storage Concerns: The Size of Sound
Xtra bit 2.5: Hard drives
OK, now we know about bits and how to copy them from one place to another. But where
exactly do we put all these bits in the first place?
One familiar digital storage medium is the compact disc (CD). Bits are encoded on a CD as a
series of pits in a metallic surface. The length of a pit corresponds to the state of the bit (on or
off). As we’ve seen, it takes a lot of bits to store high-quality digital audio—a standard 74minute CD can have more than 6 billion pits on it!
Figure 2.21 This photo shows what standard CD pits look like under high magnification. A CD stores data
digitally, using long and short pits to encode a binary representation of the sound for reading by the laser
mechanism of a CD player. Thanks to Evolution Audio and Video in Agoura Hills California for this photo from
their A/V Newsletter (Jan 1996). The magnification is 20k.
Figure 2.22 Physically, a CD is composed of a thin film of aluminum embedded between two discs of
polycarbonate plastic. Information is recorded on the CD as a series of microscopic pits in the aluminum film
arranged along a continuous spiral track. If expanded linearly, the track would span over 3 miles.
Using a low-power infrared laser (with a wavelength of 780 nm), the data are retrieved from
the CD using photosensitive sensors that measure the intensity of the reflected light as the
laser traverses the track. Since the recovered bit stream is simply a bit pattern, any digitally
encoded information can be stored on a CD.
http://music.columbia.edu/cmc/musicandcomputers/
- 48 -
Putting Everything Together
Now that we know about sampling rates, bit width, number systems, and a lot of other stuff,
how about a nice practical example that ties it all together?
Assume we’re composers working in a digital medium. We’ve got some cool sounds, and we
want to store them. We need to figure out how much storage we need.
Let’s assume we’re working with a stereo (two independent channels of sound) signal and 16bit samples. We’ll use a sampling rate of 44,100 times/second.
One 16-bit sample takes 2 bytes of storage space (remember that 8 bits equal 1 byte). Since
we’re in stereo, we need to double that number (there’s one sample for each channel) to 4
bytes per sample. For each second of sound, we will record 44,100 four-byte stereo samples,
giving us a data rate of 176.4 kilobytes (176,400 bytes) per second.
Let’s review this, because we know it can get a bit complicated. There are 60 seconds in a
minute, so 1 minute of high-quality stereo digital sound takes 176.4 * 60 KB or 10.584
megabytes (10,584 KB) of storage space. In order to store 1 hour of stereo sound at this
sampling rate and resolution, we need 60 * 10.584 MB, or about 600 MB. This is more or less
the amount of sound information on a standard commercial audio CD (actually, it can store
closer to 80 minutes comfortably). One gigabyte is equal to 1,000 megabytes, so a standard
CD is around two-thirds of a gigabyte.
Figure 2.23
One good rule of thumb is that CD-quality sound currently requires about 10 megabytes per
minute.
http://music.columbia.edu/cmc/musicandcomputers/
- 49 -
Chapter 2: The Digital Representation of Sound,
Part Two: Playing by the Numbers
Section 2.8: Compression
Xtra bit 2.6: MP3
Soundfile 2.8: 128 Kbps
Soundfile 2.9: 64 Kbps
Soundfile 2.10: 32 Kbps
Xtra bit 2.7: Delta modulation
When we start talking about taking 44.1 kHz samples per second, each one of those samples
has a 16-bit value, so we’re building up a whole heck of a lot of bits. In fact, it’s too many bits
for most purposes. While it’s not too wasteful if you want an hour of high-quality sound on a
CD, it’s kind of unwieldy if we need to download or send it over the Internet, or store a bunch
of it on our home hard drive. Even though high-quality sound data aren’t anywhere near as
large as image or video data, they’re still too big to be practical. What can we do to reduce the
data explosion?
If we keep in mind that we’re representing sound as a kind of list of symbols, we just need to
find ways to express the same information in a shorter string of symbols. That’s called data
compression, and it’s a rapidly growing field dealing with the problems involved in moving
around large quantities of bits quickly and accurately.
The goal is to store the most information in the smallest amount of space, without
compromising the quality of the signal (or at least, compromising it as little as possible).
Compression techniques and research are not limited to digital sound—data compression
plays an essential part in the storage and transmission of all types of digital information, from
word-processing documents to digital photographs to full-screen, full-motion videos. As the
amount of information in a medium increases, so does the importance of data compression.
What is compression exactly, and how does it work? There is no one thing that is "data
compression." Instead, there are many different approaches, each addressing a different aspect
of the problem. We’ll take a look at just a couple of ways to compress digital audio
information. What’s important about these different ways of compressing data is that they
tend to illustrate some basic ideas in the representation of information, particularly sound, in
the digital world.
Eliminating Redundancy
There are a number of classic approaches to data compression. The first, and most
straightforward, is to try to figure out what’s redundant in a signal, leave it out, and put it back
in when needed later. Something that is redundant could be as simple as something we
already know. For example, examine the following messages:
http://music.columbia.edu/cmc/musicandcomputers/
- 50 -
YNK DDL WNT T TWN, RDNG N PNY
or
DNT CNT YR CHCKNS BFR THY HTCH
It’s pretty clear that leaving out the vowels makes the phrases shorter, unambiguous, and
fairly easy to reconstruct. Other phrases may not be as clear and may need a vowel or two.
However, clarity of the intended message occurs only because, in these particular messages,
we already know what it says, and we’re simply storing something to jog our memory. That’s
not too common. Now say we need to store an arbitrary series of colors:
blue blue blue blue green green green red blue red blue yellow
This is easy to shorten to:
4 blue 3 green red blue red blue yellow
In fact, we can shorten that even more by saying:
4 blue 3 green 2 (red blue) yellow
We could shorten it even more, if we know we’re only talking about colors, by:
4b3g2(rb)y
We can reasonably guess that "y" means yellow. The "b" is more problematic, since it might
mean "brown" or "black," so we might have to use more letters to resolve its ambiguity. This
simple example shows that a reduced set of symbols will suffice in many cases, especially if
we know roughly what the message is "supposed" to be. Many complex compression and
encoding schemes work in this way.
Perceptual Encoding
A second approach to data compression is similar. It also tries to get rid of data that do not
"buy us much," but this time we measure the value of a piece of data in terms of how much it
contributes to our overall perception of the sound.
Here’s a visual analogy: if we want to compress a picture for people or creatures who are
color-blind, then instead of having to represent all colors, we could just send black-and-white
pictures, which as you can well imagine would require less information than a full-color
picture. However, now we are attempting to represent data based on our perception of it.
Notice here that we’re not using numbers at all: we’re simply trying to compress all the
relevant data into a kind of summary of what’s most important (to the receiver). The tricky
part of this is that in order to understand what’s important, we need to analyze the sound into
its component features, something that we didn’t have to worry about when simply shortening
lists of numbers.
http://music.columbia.edu/cmc/musicandcomputers/
- 51 -
temperature: 76oF
humidity: 35%
=
wind: north-east at 5 MPH
It’s a nice day out!
barometer: falling
clouds: none
Figure 2.24 We humans use perception-based encoding all the time. If we didn't, we’d have very tedious conversations.
MP3 is the current standard for data compression of sound on the web. But keep in mind that
these compression standards change frequently as people invent newer and better methods.
Soundfiles 2.8, 2.9, and 2.10 were all compressed into the MP3 format but at different bit
rates. The lower the bit rate, the more degradation. (Kbps means kilobits per second.)
Perceptually based sound compression algorithms usually work by eliminating numerical
information that is not perceptually significant and just keeping what’s important.
µ-law ("mu-law") encoding is a simple, common, and important perception-based
compression technique for sound data. It’s an older technique, but it’s far easier to explain
here than a more sophisticated algorithm like MP3, so we’ll go into it in a bit of detail.
Understanding it is a useful step toward understanding compression in general.
µ-law is based on the principle that our ears are far more sensitive to low amplitude changes
than to high ones. That is, if sounds are soft, we tend to notice the change in amplitude more
easily than between very loud and other nearly equally loud sounds. µ-law compression takes
advantage of this phenomenon by mapping 16-bit values onto an 8-bit µ-law table like Table
2.6.
0
8
16
24
32
40
48
56
64
72
80
88
96
104
112
120
132
148
164
180
196
212
228
244
260
276
292
308
324
340
356
372
Table 2.6 Entries from a typical µ-law table. The complete table consists of 256 entries spanning the 16-bit
numerical range from –32,124 to 32,124. Half the range is positive, and half is negative. This is often the way
sound values are stored.
Notice how the range of numbers is divided logarithmically rather than linearly, giving more
precision at lower amplitudes. In other words, loud sounds are just loud sounds.
To encode a µ-law sample, we start with a 16-bit sample value, say 330. We then find the
entry in the table that is closest to our sample value. In this case, it would be 324, which is the
28th entry (starting with entry 0), so we store 28 as our µ-law sample value. Later, when we
http://music.columbia.edu/cmc/musicandcomputers/
- 52 -
want to decode the µ-law sample, we simply read 28 as an index into the table, and output the
value stored there: 324.
You might be thinking, "Wait a minute, Our original sample value was 330, but now we have
a value of 324. What good is that?" While it’s true that we lose some accuracy when we
encode µ-law samples, we still get much better sound quality than if we had just used regular
8-bit samples.
Here’s why: in the low-amplitude range of the µ-law table, our encoded values are only going
to be off by a small margin, since the entries are close together. For example, if our sample
value is 3 and it’s mapped to 0, we’re only off by 3. But since we’re dealing with 16-bit
samples, which have a total range of 65,536, being off by 3 isn’t so bad. As amplitude
increases we can miss the mark by much greater amounts (since the entries get farther and
farther apart), but that’s OK too—the whole point of µ-law encoding is to exploit the fact that
at higher amplitudes our ears are not very sensitive to amplitude changes. Using that fact, µlaw compression offers near-16-bit sound quality in an 8-bit storage format.
Prediction Algorithms
A third type of compression technique involves attempting to predict what a signal is going to
do (usually in the frequency domain, not in the time domain) and only storing the difference
between the prediction and the actual value. When a prediction algorithm is well tuned for the
data on which it’s used, it’s usually possible to stay pretty close to the actual values. That
means that the difference between your prediction and the real value is very small and can be
stored with just a few bits.
Let’s say you have a sample value range of 0 to 65,536 (a 16-bit range, in all positive
integers) and you invent a magical prediction algorithm that is never more than 256 units
above or below the actual value. You now only need 8 bits (with a range of 0 to 255) to store
the difference between your predicted value and the actual value. You might even keep a
running average of the actual differences between sample values, and use that adaptively as
the range of numbers you need to represent at any given time. Pretty neat stuff! In actual
practice, coming up with such a good prediction algorithm is tricky, and what we’ve
presented here is an extremely simplified presentation of how prediction-based compression
techniques really work.
The Pros and Cons of Compression Techniques
Each of the techniques we’ve talked about has advantages and disadvantages. Some are timeconsuming to compute but accurate; others are simple to compute (and understand) but less
powerful. Each tends to be most effective on certain kinds of data. Because of this, many of
the actual compression implementations are adaptive—they employ some variable
combination of all three techniques, based on qualities of the data to be encoded.
A good example of a currently widespread adaptive compression technique is the MPEG
(Moving Picture Expert Group) standard now used on the Internet for the transmission of both
sound and video data. MPEG (which in audio is currently referred to as MP3) is now the
standard for high-quality sound on the Internet and is rapidly becoming an audio standard for
http://music.columbia.edu/cmc/musicandcomputers/
- 53 -
general use. A description of how MPEG audio really works is well beyond the scope of this
book, but it might be an interesting exercise for the reader to investigate further.
Delta Modulation
Bit width, sampling rate, compression, and even prediction are well illustrated by an
extremely simple, interesting, and important technique called delta modulation or single-bit
encoding. This algorithm is commonly used in commercial digital sound equipment.
We know that if we sample very fast, the change in the y-axis or amplitude of the signal is
going to be extremely small. In fact, we can more or less assume that it’s going to be nearly
continuous. This means that each successive sample will change by at most 1. Using this fact,
we can simply store the signal as a list of 1s and 0s, where 1 means "the amplitude increased"
and 0 means "the amplitude decreased." The good thing about delta modulation is that we can
use only one bit for the representation: we’re not actually representing the signal, but rather a
kind of map of its contour. We can, in fact, play it back in the same way. The disadvantage of
this algorithm is that we need to sample at a much higher rate in order to ensure that we’re
catching all the minute changes.
http://music.columbia.edu/cmc/musicandcomputers/
- 54 -
Chapter 3: The Frequency Domain
Section 3.1: Frequency Domain
Soundfile 3.1: Monochord sound
Soundfile 3.2: Trumpet sound
Xtra bit 3.1: MatLab code to plot amplitude envelopes
Soundfile 3.3: Mystery sound
Soundfile 3.4: The song of the hooded warbler. Can you follow it with Figure 3.5?
Time-domain representations show us a lot about the amplitude of a signal at different points
in time. Amplitude is a word that means, more or less, "how much of something," and in this
case it might represent pressure, voltage, some number that measures those things, or even the
in-out deformation of the eardrum.
For example, the time-domain picture of the waveform in Figure 3.1 starts with the attack of
the note, continues on to the steady-state portion (sustain) of the note, and ends with the cutoff
and decay (release). We sometimes call the attack and decay transients because they only
happen once and they don’t stay around! We also use the word transient, perhaps more
typically, to describe timbral fluctuations during the sound that are irregular or singular and to
distinguish between those kinds of sounds and the steady state.
From the typical sound event shown in Figure 3.1, we can tell something about how the
sound’s amplitude develops over time (what we call its amplitude envelope). But from this
picture we can’t really tell much of anything about what is usually referred to as the timbre or
"sound" of the sound: What instrument is it? What note is it? Is it bright or dark? Is it
someone singing or a dog barking, or maybe a Back Street Boys bootleg? Who knows! These
time domain pictures all look pretty much alike.
Figure 3.1 A time-domain waveform. It’s easy to see the attack, steady-state, and decay portions of the "note" or
sound event, because these are all pretty much variations of amplitude, which time-domain representations show
us quite well. The amplitude envelope is a kind of average of this picture.
http://music.columbia.edu/cmc/musicandcomputers/
- 55 -
We can even be a little more precise and mathematical. If the amplitude at the nth sample in
the above is A[n] and we make a new signal with amplitude, say, S[n], then the nth sample (of
S[n], our envelope) would be:
S[n] = (A[n–1] + A[n] + A[n+1])/3
This would look like the envelope. This averaging operation is sometimes called smoothing or
low-pass filtering. We’ll talk more about it later.
Figure 3.2 Monochord sound: signal, average signal envelope, peak signal envelope.
http://music.columbia.edu/cmc/musicandcomputers/
- 56 -
Figure 3.3 Trumpet sound: signal, average signal envelope, peak signal envelope.
Two Sounds, and Two Different Kinds of Amplitude Envelopes
The two figures and sounds in Soundfiles 3.1 and 3.2 (one of a trumpet, one of a one-stringed
instrument called a monochord, made for us by Sam Torrisi) illustrate different ways of
looking at amplitude and the time domain. In each, the time-domain signal itself is given by
the light blue area. This is exactly the same as what we showed you at the beginning of this
section in Figure 3.1. But we’ve added two more envelopes to these figures, to illustrate two
useful ways to think of a sound event.
The magenta line more or less follows the peaks of the signal, or its highest amplitudes. Note
that it doesn’t matter whether these amplitudes are very positive or very negative; all we
really care about is their absolute value, which is more or less like saying how much energy or
displacement (in either direction). Sometimes, we even simplify this further by measuring the
peak-to-peak amplitude of a signal, just looking at the maximum range of amplitudes (this
will tell us, for example, if our speakers/ears will be able to withstand the maximum of the
signal). In Figure 3.2, we look at some number of samples and more or less remain on the
highest value in that window (that’s why it has a kind of staircase look to it).
The dark blue line is a running average of the absolute value of the signal, which in effect
smooths the sound out tremendously, and also attenuates it. There’s a similar measure, called
RMS (root-mean-squared) amplitude, that tries to give an overall average of energy. Once
again, we used a running window technique to average the last n number of samples (where n
is the length of the window). Different values for n would give very different pictures.
http://music.columbia.edu/cmc/musicandcomputers/
- 57 -
Just to give you some idea how we generate these kinds of graphs and measurements, in Xtra
bit 3.1 we’ve included the computer code, written in a popular mathematical modeling
program called MatLab, that made these pictures. By studying this code and the
accompanying comments, you can get some idea of what computer music software often
looks like and how you might go about making similar kinds of measurements.
The Frequency/Amplitude/Time Plot
Distinguishing between sounds is one place where the frequency domain comes in. Figure 3.4
is a frequency/amplitude/time plot of the same sound as the time-domain picture in Figure
3.1. This new kind of sound-image is called a sonogram. Time still goes from left to right
along the x-axis, but now the y-axis is frequency, not amplitude. Amplitude is encoded in the
intensity of a point on the image: the darker the point, the more energy present at that
frequency at around that time.
For example, the semi-dark line around 7,400 Hz shows that from about 0.05 second to 0.125
second, there is some contribution to the sound at that frequency. This is occurring in the
attack portion. It pretty much dies after a short period of time.
Figure 3.4 This picture shows the same sound as that of the time domain in Figure 3.1, but now in the frequency
domain, as a sonogram. Here, the y-axis is frequency (or, more accurately, frequency components). The darkness
of the line indicates how much energy is present in that frequency component. The x-axis is, as usual, time.
What sorts of information do the two pictures give us about the sound? Can you make some
guesses about what sort of sound this might be?
What does this sonogram tell us about the sound? Remember that we said before that we use
the entire frequency range to determine timbre as well as pitch. As it turns out, any sound
contains many smaller component sounds at a wide variety of frequencies (we’ll learn more
about this later; it’s really important!). What you’re seeing in this sonogram is a
http://music.columbia.edu/cmc/musicandcomputers/
- 58 -
representation of how all those component sounds change in frequency and amplitude over
time. Now listen to Soundfile 3.3 at left.
The sonogram shows that the mystery sound starts with a burst of energy that is spread out
across the frequency spectrum—notice the spikes that reach all the way up to the top
frequencies in the image. Then it settles down into a fairly constant, more concentrated, and
lower energy state, where it remains until the end, when it quickly fades out. This is a pretty
common description of a vibrating system: start it vibrating out of its rest state (chaotic, loud),
listen to it settle into some sort of regular vibratory behavior, and then, if the energy source is
removed (for example, you stop blowing the horn or take your e-bow off your electric guitar
string), listen to it decay (again, chaotic).
The presence of a band of high-amplitude, low-frequency energy coupled with some loweramplitude, high-frequency energy implies that we’re looking at some sort of pitched sound
with a number of strong harmonics. The darkest low band is probably the fundamental note of
the sound. By studying the sonogram, can you get a mental idea of what sort of sound it might
be?
Listen to the sound a few times while watching the waveform and sonogram images. Can you
follow along? Is there a clear correlation between what you see and what you hear? Does the
sound look the way it sounds? Do you agree that the sonogram gives you a more informative
visual representation of the sound? Isn’t the frequency domain cool?
Figure 3.5 Song of the hooded warbler.
This is another kind of sonogram, kind of like a negative image of the sound moving in pitch
(the y-axis) over time. The thickness of the line shows a lot about the pitch range. What this
old-style sonogram did was try to find the maximum energy concentration and give a picture
of the moving pitch of a sound, natural or otherwise.
Sometimes pictures like this, which were very common a long time ago, are called
melograms, or melographs, because they graph pitch in time. We got this wonderful picture
out of an old book about recording natural sounds!
http://music.columbia.edu/cmc/musicandcomputers/
- 59 -
Figure 3.6 Just for historical interest, the picture above is an example of an old process called
phonophotography, an early (1920s) method for capturing a graphic image of a sound. It’s essentially a
melographic technique.
What we are looking at is a picture of a "recording" of a performance of the gospel song
"Swing Low, Sweet Chariot." This color image came from the work of a brilliant researcher
named Metfessel.
This kind of highly descriptive analysis greatly influenced music theorists in the first part of
the 20th century. Many people saw it as a kind of revolutionary mechanism for describing
sound and music, potentially removing music analysis from the realm of the aesthetic, the
emotional, and the transcendental into a more modernist, scientific, and objective domain.
http://music.columbia.edu/cmc/musicandcomputers/
- 60 -
Chapter 3: The Frequency Domain
Section 3.2: Phasors
Applet 3.1: Sampling a phasor
Applet 3.2: Building a sawtooth wave partial by partial
In Chapter 1 we talked about the basic atoms of sound—the sine wave, and about the function
that describes the sound generated by a vibrating tuning fork. In this chapter we’re talking a
lot about the frequency domain. If you remember, our basic units, sine waves, only had two
parameters: amplitude and frequency. It turns out that these dull little sine waves are going to
give us the fundamental tool for the analysis and description of sound, and especially for the
digital manipulation of sound. That’s the frequency domain: a place where lots of little sine
waves are our best friends.
But before we go too far, it’s important to fully understand what a sine wave is, and it’s also
wonderful to know that we can make these simple little curves ridiculously complicated, too.
And it’s useful to have another model for generating these functions. That model is called a
phasor.
Description of a Phasor
Think of a bicycle wheel suspended at its hub. We’re going to paint one of the spokes bright
red, and at the end of the spoke we’ll put a red arrow. We now put some axes around the
wheel—the x-axis going horizontally through the hub, and the y-axis going vertically. We’re
interested in the height of the arrowhead relative to the x-axis as the wheel—our phasor—
spins around counterclockwise.
Figure 3.7 Sine waves and phasors. As the sine wave moves forward in time, the arrow goes around the circle at
the same rate. The height of the arrow (that is, how far it is above or below the x-axis) as it spins around in a
circle is described by the sine wave.
In other words, if we trace the arrow’s location on the circle (from 0 to 2 ) and measure the
height of the arrow on the y-axis as our phasor goes around the circle, the resulting curve is a
sine wave! Thanks to: George Watson, Dept. of Physics & Astronomy, University of
Delaware, [email protected], for this animation.
http://music.columbia.edu/cmc/musicandcomputers/
- 61 -
Figure 3.8 Phase as angle.
Figure 3.9 Phasor: circle to sine.
As time goes on, the phasor goes round and round. At each instant, we measure the height of
the dot over the x-axis. Let’s consider a small example first. Suppose the wheel is spinning at
a rate of one revolution per second. This is its frequency (and remember, this means that the
period is 1 second/revolution). This is the same as saying that the phasor spins at a rate of 360
degrees per second, or better yet, 2\2 radians per second (if we’re going to be
mathematicians, then we have to measure angles in terms of radians). So 2 radians per
second is the angular velocity of the phasor.
This means that after 0.25 second the phasor has gone /2 radians (90 degrees), and after 0.5
second it’s gone radians or 180 degrees, and so on. So, we can describe the amount of angle
that the phasor has gone around at time t as a function, which we call (t).
Now, let’s look at the function given by the height of the arrow as time goes on. The first
thing that we need to remember is a little trigonometry.
The sine and cosine of an angle are measured using a right triangle. For our right triangle, the
sine of , written sin( ) is given by the equation:
http://music.columbia.edu/cmc/musicandcomputers/
- 62 -
This means that:
We’ll make use of this in a minute, because in this example a is the height of our triangle.
Similarly, the cosine, written cos( ), is:
This means that:
This will come in handy later, too.
Now back to our phasor. We’re interested in measuring the height at time t, which we’ll
denote as h(t). At time t, the phasor’s arrow is making an angle of (t) with the x-axis. Our
basic phasor has a radius of 1, so we get the following relationship:
We also get this nice graph of a function, which is our favorite old sine curve.
Figure 3.10 Basic sinusoid.
Now, how could we change this curve? Well, we could change the amplitude—this is the
same as changing the length of our arrow on the phasor. We’ll keep the frequency the same
and make the radius of our phasor equal to 3. Then we get:
http://music.columbia.edu/cmc/musicandcomputers/
- 63 -
Then we get this nice curve, which is another kind of sinusoid (bigger!).
Figure 3.11 Bigger sine curve.
Now let’s start messing with the frequency, which is the rate of revolution of the phasor.
Let’s ramp it up a notch and instead start spinning at a rate of five revolutions per second.
Now:
This is easy to see since after 1 second we will have gone five revolutions, which is a total of
10 radians. Let’s suppose that the radius of the phasor is 3. Again, at each moment we
measure the height of our arrow (which we call h(t)), and we get:
Now we get this sinusoid:
http://music.columbia.edu/cmc/musicandcomputers/
- 64 -
Figure 3.12 Bigger, faster sine curve.
In general, if our phasor is moving at a frequency of
revolutions per second and has
radius A, then plotting the height of the phasor is the same as graphing this sinusoid:
Now we’re almost done, but there is one last thing we could vary: we could change the place
where we start our phasor spinning. For example, we could start the phasor moving at a rate
of five revolutions per second with a radius of 3, but start the phasor at an angle of /4
radians, instead.
Now, what kind of function would this be? Well, at time t = 0 we want to be taking the
measurement when the phasor is at an angle of /4, but other than that, all is as before. So
the function we are graphing is the same as the one above, but with a phase shift of /4. The
corresponding sinusoid is:
http://music.columbia.edu/cmc/musicandcomputers/
- 65 -
Figure 3.13 Changing the phase.
Our most general sinusoid of amplitude A, frequency
, and phase shift
has the form:
A particularly interesting example is what happens when we take the phase shift equal to 90
degrees, or /2 radians. Let’s make it nice and simple, with
equal to one revolution per
second and amplitude equal to 1 as well. Then we get our basic sinusoid, but shifted ahead
/2. Does this look familiar? This is the graph of the cosine function!
Figure 3.14 90-degree phase shift (cosine).
You can do some checking on your own and see that this is also the graph that you would
get if you plotted the displacement of the arrow from the y-axis. So now we know that a
cosine is a phase-shifted sine!
http://music.columbia.edu/cmc/musicandcomputers/
- 66 -
Adding Phasors
Fourier’s theorem tells us that any periodic function can be expressed as a sum (possibly
with an infinite number of terms!) of sinusoids. (We’ll discuss Fourier’s theorem more in
depth later.) Remember, a periodic function is any function that looks like the infinite
repetition of some fixed pattern. The length of that basic pattern is called the period of the
function. We’ve seen a lot of examples of these in Chapter 1.
In particular, if the function has period T, then this sum looks like:
If T is the period of our periodic function, then we now know that its frequency is 1/T—this
is also called the fundamental (frequency) of the periodic function, and we see that all other
frequencies that occur (called the partials) are simply integer multiples of the fundamental.
If you read other books on acoustics and DSP, you will find that partials are sometimes
called overtones (from an old German word, "übertonen") and harmonics. There’s often
confusion about whether the first overtone is the second partial, and so on. So, to be specific,
and also to be more in keeping with modern terminology, we’re always going to call the first
partial the one with the frequency of the fundamental.
Example: Suppose we have a triangle wave that repeats once every 1/100 second. Then the
corresponding fundamental frequency is 100 Hz (it repeats 100 times per second). Triangle
waves only contain partials at odd multiples of the fundamental. (The even multiples have
no energy—in fact, this is generally true of wave shapes that have the "odd" symmetry, like
the triangle wave.) Click on Applet 3.2 and see a triangle wave built by adding one partial
after another.
This adding up of partials to make a complex waveform might make sense acoustically, but in
order to really understand how to add phasors from a mathematical standpoint, we first need
to understand how to add vectors, or arrows.
How should we define an arithmetic of arrows? It sounds funny, but in fact it’s a pretty
natural generalization of what we already know about adding regular old numbers. When we
add a negative number, we go backward, and when we add a positive number, we go forward.
Our regular old numbers can be thought of as arrows on a number line. Adding any two
numbers, then, simply means taking the two corresponding arrows and placing them one after
http://music.columbia.edu/cmc/musicandcomputers/
- 67 -
the other, tip to tail. The sum is then the arrow from the origin pointing to the place where
"adding" the two arrows landed you.
Really, what we are doing here is thinking of numbers as vectors. They have a magnitude
(length) and a direction (in this case, positive or negative, or better yet 0 radians or radians).
Now, to add phasors, we need to enlarge our worldview and allow our arrows to get not just 2
directions, but instead a whole 2 radians worth of directions! In other words, we allow our
arrows to point anywhere in the plane. We add, then, just as before: place the arrows tip to
tail, and draw an arrow from the origin to the final destination.
So, to recap: to add phasors, at each instant as our phasors are spinning around, we add the
two arrows. In this way, we get a new arrow spinning around (the sum) at some frequency—a
new phasor. Now it’s easy to see that the sum of two phasors of the same frequency yields a
new phasor of the same frequency. We can also see that the sum of a cosine and sine of the
same frequency is simply a phase-shifted sine of the same frequency with a new amplitude
given by the square root of the sum of squares of the two original phasors. That’s the
Pythagorean theorem!
Sampling and Fourier Expansion
Figure 3.15
The decomposition of a complex waveform into its component phasors (which is pretty much
the same as saying the decomposition of an acoustic waveform into its component partials) is
called Fourier expansion.
In practice, the main thing that happens is that analog waveforms are sampled, creating a
time-domain representation inside the computer. These samples are then converted (using
what is called a fast Fourier transform, or FFT) into what are called Fourier coefficients.
http://music.columbia.edu/cmc/musicandcomputers/
- 68 -
Figure 3.16 FFT of sampled phasors exp(2*j* *x/64), x=1,1.01, 1.02,...,1.90,2.
Figure 3.17 FFT plot of a gamelan instrument.
http://music.columbia.edu/cmc/musicandcomputers/
- 69 -
Figure 3.17 shows a common way to show timbral information, especially the way that
harmonics add up to produce a waveform. However, it can be slightly confusing. By running
an FFT on a small time-slice of the sound, the FFT algorithm gives us the energy in various
frequency bins. (A bin is a discrete slice, or band, of the frequency spectrum. Bins are
explained more fully in Section 3.4.) The x-axis (bottom axis) shows the bin numbers, and the
y-axis shows the strength (energy) of each partial.
The slightly strange thing to keep in mind about these bins is that they are not based on the
frequency of the sound itself, but on the sampling rate. In other words, the bins evenly divide
the sampling frequency (linearly, not exponentially, which can be a problem, as we’ll explain
later). Also, this plot shows just a short fraction of time of the sound: to make it time-variant,
we need a waterfall 3D plot, which shows frequency and amplitude information over a span
of time. Although theoretically we could use the FFT data shown in Figure 3.17 in its raw
form to make a lovely, synthetic gamelan sound, the complexity and idiosyncracies of the
FFT itself make this a bit difficult (unless we simply use the data from the original, but that’s
cheating).
Figure 3.18 shows a better graphical representation of sound in the frequency domain. Time is
running from front to back, height is energy, and the x-axis is frequency. This picture also
takes the essentially linear FFT and shows us an exponential image of it, so that most of the
"action" happens in the lower 2k, which is correct. (Remember that the FFT divides the
frequency spectrum into linear, equal divisions, which is not really how we perceive sound—
it’s often better to graph this exponentially so that there’s not as much wasted space "up top.")
The waterfall plot in Figure 3.18 is stereo, and each channel of sound has its own slightly
different timbre.
Figure 3.18 Waterfall plot.
Here’s a fact that will help a great deal: if the highest frequency is B times the fundamental,
then you only need 2B + 1 samples to determine the Fourier coefficients. (It’s easy to see that
you should need at least 2B, since you are trying to get 2B pieces of information (B
amplitudes and B2 phase shifts).)
http://music.columbia.edu/cmc/musicandcomputers/
- 70 -
Figure 3.19 Aliasing, foldover. This is a phenomenon that happens when we try to sample a frequency that is
more than half the sampling rate, or the Nyquist frequency. As the frequency we want to sample gets higher than
half the sampling rate, we start "undersampling" and get unwanted, lower-frequency artifacts (that is, low
frequencies created by the sampling process itself).
http://music.columbia.edu/cmc/musicandcomputers/
- 71 -
Chapter 3: The Frequency Domain
Section 3.3: Fourier and the Sum of Sines
Soundfile 3.5: Adding sine waves
Soundfile 3.6: Trumpet
Applet 3.3: FFT demo
Soundfile 3.7: Filter examples. We start with a sampled file of bugs. This recording was
made by composer and sound artist David Dunn using very powerful hydrophones
(microphones that work underwater) to amplify the "microsound" of bugs in a pond.
Soundfile 3.8: High-pass filtered. This is David Dunn’s bugs sound file filtered so that we
hear only the frequencies above 1,000 Hz. In other words, we have high-pass filtered the
sound.
Soundfile 3.9: Low-pass filtered. This is the bugs sound file filtered so that we hear only the
frequencies below 1,000 Hz. Here we have low-pass filtered the sound.
In this section, we’ll try to really explain the notion of a Fourier expansion by building on the
ideas of phasors, partials, and sinusoidal components that we introduced in the previous
section. A long time ago, French scientist and mathematician Jean Baptiste Fourier (1768–
1830) proved the mathematical fact that any periodic waveform can be expressed as the sum
of an infinite set of sine waves. The frequencies of these sine waves must be integer multiples
of some fundamental frequency.
In other words, if we have a trumpet sound at middle A (440 Hz), we know by Fourier’s
theorem that we can express this sound as a summation of sine waves: 440 Hz, 880Hz,
1,320Hz, 1,760 Hz..., or 1, 2, 3, 4... times the fundamental, each at various amplitudes. This is
rather amazing, since it says that for every periodic waveform (one, by the way, that has
pitch), we basically know everything about its partials except their amplitudes.
http://music.columbia.edu/cmc/musicandcomputers/
- 72 -
The spectrum of the sine wave has energy only at one frequency. The triangle wave has energy at
odd-numbered harmonics (meaning odd multiples of the fundamental), with the energy of each harmonic
2
decreasing as 1 over the square of the harmonic number (1/N ). In other words, at the frequency that is N times
2
the fundamental, we have 1/N as much energy as in the fundamental.
The partials in the sawtooth wave decrease in energy in proportion to the inverse of the
harmonic number (1/N). Pulse (or rectangle or square) waveforms have energy over a broad
area of the spectrum, but only for a brief period of time.
Fourier Series
What exactly is a Fourier series, and how does it relate to phasors? We use phasors to
represent our basic tones. The amazing fact is that any sound can be represented as a
combination of phase-shifted, amplitude-modulated tones of differing frequencies. Remember
that we got a hint of this concept when we discussed adding phasors in Section 3.2.
http://music.columbia.edu/cmc/musicandcomputers/
- 73 -
A phasor is essentially a way of representing a sinusoidal function. What this means,
mathematically, is that any sound can be represented as a sum of sinusoids. This sum is called
a Fourier series.
Note that we haven’t limited these sounds to periodic sounds (if we did, we’d have to add that
last qualifier about integer multiples of a fundamental frequency). Nonperiodic, or aperiodic,
sounds are just as interesting—maybe even more interesting—than periodic ones, but we have
to do some special computer tricks to get a nice "harmonic" series out of them for the
purposes of analysis and synthesis.
But let’s get down to the nitty-gritty. First, let’s take a look at what happens when we add two
sinusoids of the same frequency. Adding a sine and cosine of the same frequency gives a
phase-shifted sine of the same frequency:
In fact, the amplitude of the sum, C, is given by:
The phase shift
is:
is given by the angle whose tangent is equal to A/B. The shorthand for this
We can visualize this with a phasor. Remember that the cosine is just a phase-shifted sine.
Since the sine and cosine are moving at the same frequency, they are always "out of sync" by
/2, so when we add them it looks like this:
And we get another sinusoid of that frequency.
http://music.columbia.edu/cmc/musicandcomputers/
- 74 -
Any periodic function of period 1 can be written as follows:
Notice that these sums can be infinite!
We have a nice shorthand for those possibly infinite sums (also called an infinite series):
The two expressions after the signs are called the Fourier coefficients of the function f(t).
The Fourier coefficient A0 has a special name: it is called the DC term, or the DC offset. It
tells you the average value of the function. The Fourier coefficients make up a set of numbers
called the spectrum of the sound. Now, when you think of the word "spectrum," you might
think of colors, like the spectrum of colors of the rainbow. In a way it’s the same: the
spectrum tells you how much of each frequency (color) is in the sound.
The values of An and Bn for "small" values of n make up the low-frequency information, and
we call these the low-order Fourier coefficients. Similarly, the big values of n index the highfrequency information. Since most sounds are made up of a lot of low-frequency information,
the low-frequency Fourier coefficients have larger absolute value than the high-frequency
Fourier coefficients.
What this means is that it is theoretically possible to take a complex sound, like a person’s
voice, and decompose it into a bunch of sine waves, each at a different frequency, amplitude,
and phase. These are called the sinusoidal or spectral components of a sound. To find them,
we do a Fourier analysis. Fourier synthesis is the inverse process, where we take varying
amounts of a bunch of sine waves and add them together (play them at the same time) to
reconstruct a sound. Sounds a bit fantastic, doesn’t it? But it works. This process of analyzing
or synthesizing a sound based on its component sine waves is called performing a Fourier
transform on the sound. When the computer does it, it uses a very efficient technique called
the fast Fourier transform (or FFT) for analysis and the inverse FFT (IFFT) for synthesis.
http://music.columbia.edu/cmc/musicandcomputers/
- 75 -
What happens if we add a number of sine waves together? We end up with a complicated
waveform that is the summation of the individual waves. This picture is a simple example: we just added up
two sine waves. For a complex sound, hundreds or even thousands of sine waves are needed to accurately
build up the complex waveform. By looking at the illustration from the bottom up, you can see that the inverse
is also true—the complex waveform can be broken down into a collection of independent sine waves.
Fig: Adding Sine waves
http://music.columbia.edu/cmc/musicandcomputers/
- 76 -
A trumpet note in an FFT (fast Fourier transform) analysis—two views. The
trumpet sound can be heard by clicking on Soundfile 3.6. Both of these pictures show the
evolution of the amplitude of spectral components in time.
The advantage of representing a sound in terms of its Fourier series is that it allows us to
manipulate the frequency content directly. If we want to accentuate the high-frequency
effects in a sound (make a sound brighter), we could just make all the high-frequency Fourier
coefficients bigger in amplitude. If we wanted to turn a sawtooth wave into a square wave,
we could just set to zero the Fourier coefficients of the even partials.
In fact, we often modify sounds by removing certain frequencies. This corresponds to making
a new function where certain Fourier coefficients are set equal to zero while all others are left
alone. When we do this we say that we filter the function or sound. These sorts of filters are
called bandpass filters, and the frequencies that we leave unaltered in this sort of situation are
said to be in the passband. A low-pass filter puts all the low frequencies (up to some
bandwidth) in the passband, while a high-pass filter puts all high frequencies (down to some
cutoff) in the passband. When we do this, we talk about high-passing and low-passing the
sound. In the following soundfiles, we listen to a sound and its high-passed and low-passed
versions. We’ll talk a lot more about filters in Chapter 4.
Once you have the spectral content of a sound, there is a lot you can do with it, but how do
you get it?! That’s what the FFT does, and we’ll talk about it in the next section.
http://music.columbia.edu/cmc/musicandcomputers/
- 77 -
Chapter 3: The Frequency Domain
Section 3.4: The DFT, FFT, and IFFT
Xtra bit 3.2: Dan’s history of the FFT
Xtra bit 3.3: The mathematics of magnitude and phase in the FFT
The most common tools used to perform Fourier analysis and synthesis are called the fast
Fourier transform (FFT) and the inverse fast Fourier transform (IFFT). The FFT and IFFT are
optimized (very fast) computer-based algorithms that perform a generalized mathematical
process called the discrete Fourier transform (DFT). The DFT is the actual mathematical
transformation that the data go through when converted from one domain to another (time to
frequency). Basically, the DFT is just a slow version of the FFT—too slow for our impatient
ears and brains!
FFTs, IFFTs, and DFTs became really important to a lot of disciplines when engineers figured
out how to take samples quickly enough to generate enough data to re-create sound and other
analog phenomena digitally. Remember, they don’t just work on sounds; they work on any
continuous signal (images, radio waves, seismographic data, etc.).
An FFT of a time domain signal takes the samples and gives us a new set of numbers
representing the frequencies, amplitudes, and phases of the sine waves that make up the sound
we’ve analyzed. It is these data that are displayed in the sonograms we looked at in Section
1.2.
http://music.columbia.edu/cmc/musicandcomputers/
- 78 -
Figure 3.23 Graph and table of spectral components.
Figure 3.23 shows the first 16 bins of a typical FFT analysis after the conversion is made from
real and imaginary numbers to amplitude/phase pairs. We left out the phases, because, well, it
was too much trouble to just make up a bunch of arbitrary phases between 0 and 2 . In a lot
of cases, you might not need them (and in a lot of cases, you would!). In this case, the sample
rate is 44.1 kHz and the FFT size is 1,024, so the bin width (in frequency) is the Nyquist
frequency (44,100/2 = 22,050) divided by the FFT size, or about 22 Hz.
Amplitude values are assumed to be between 0 and 1, and notice that they’re quite small
because they all must sum to 1 (and there are a lot of bins!).
We confess that we just sort of made up the numbers; but notice that we made them up to
represent a sound that has a simple, more or less harmonic structure with a fundamental
somewhere in the 66 Hz to 88 Hz range (you can see its harmonics at around 2, 3, 4, 5, and 6
times its frequency, and note that the harmonics decrease in amplitude more or less like they
would in a sawtooth wave).
How the FFT Works
The way the FFT works is fairly straightforward. It takes a chunk of time called a frame (a
certain number of samples) and considers that chunk to be a single period of a repeating
waveform. The reason that this works is that most sounds are "locally stationary"; meaning
that over any short period of time, the sound really does look like a regularly repeating
function. The following is a way to consider this mathematically—taking a window over
some portion of some signal that we want to consider as a periodic function.
http://music.columbia.edu/cmc/musicandcomputers/
- 79 -
The Fast Fourier Transform in a Nutshell:
Computing Fourier Coefficients
Here’s a little three-step procedure for digital sound processing.
1. Window
2. Periodicize
3. Fourier transform (this also requires sampling, at a rate equal to 2 times the highest
frequency required). We do this with the FFT. Following is an illustration of steps 1
and 2.
Here’s the graph of a (periodic) function, f(t). (Note that f(t) need not be a periodic function.)
Figure 3.24
Suppose we’re only interested in the portion of the graph between 0 ≤ t ≤ 1. Following is a
graph of the window function we need to use. We’ll call the function w(t). Note that w(t)
equals 1 only in the interval 0 ≤ t ≤ 1 and it’s 0 everywhere else.
Figure 3.25
In step 1, we need to window the function. In Figure 3.25 we’ve plotted both the window
function, w(t) (which is nonzero in the region we’re interested in) and function f(t) in the same
picture.
http://music.columbia.edu/cmc/musicandcomputers/
- 80 -
Figure 3.26
In Figure 3.26 we’ve plotted f(t)*w(t), which is the periodic function multiplied by the
windowing function. From this figure, it’s obvious what part of f(t) we’re interested in.
Figure 3.27
In step 2, we need to periodically extend the windowed function, f(t)*w(t), all along the t-axis.
Figure 3.28
Great! We now have a periodic function, and the Fourier theorem says we can represent this
function as a sum of sines and cosines. This is step 3.
Remember, we can also use other, nonsquare windows. This is done to ameliorate the effect
http://music.columbia.edu/cmc/musicandcomputers/
- 81 -
of the square windows on the frequency content of the original signal.
Now, once we’ve got a periodic function, all we need to do is figure out, using the FFT, what
the component sine waves of that waveform are.
As we’ve seen, it is possible to represent any periodic waveform as a sum of phase-shifted
sine waves. In theory, the number of component sine waves is infinite—there is no limit to
how many frequency components a sound might have. In practice, we need to limit ourselves
to some predetermined number. This limit has a serious effect on the accuracy of our analysis.
Here’s how that works: rather than looking for the frequency content of the sound at all
possible frequencies (an infinitely large number—100.000000001 Hz, 100.000000002 Hz,
100.000000003 Hz, etc.), we divide up the frequency spectrum into a number of frequency
bands and call them bins. The size of these bins is determined by the number of samples in
our analysis frame (the chunk of time mentioned above). The number of bins is given by the
formula:
number of bins = frame size/2
Frame Size
So let’s say that we decide on a frame size of 1,024 samples. This is a common choice
because most FFT algorithms in use for sound processing require a number of samples that is
a power of two, and it’s important not to get too much or too little of the sound.
A frame size of 1,024 samples gives us 512 frequency bands. If we assume that we’re using a
sample rate of 44.1 kHz, we know that we have a frequency range (remember the Nyquist
theorem) of 0 kHz to 22.05 kHz. To find out how wide each of our frequency bins is, we use
the following formula:
bin width = frequency/number of bins
This formula gives us a bin width of about 43 Hz. Remember that frequency perception is
logarithmic, so 43 Hz gives us worse resolution at the low frequencies and better resolution at
higher frequencies.
By selecting a certain frame size and its corresponding bandwidth, we avoid the problem of
having to compute an infinite number of frequency components in a sound. Instead, we just
compute one component for each frequency band.
http://music.columbia.edu/cmc/musicandcomputers/
- 82 -
Software That Uses the FFT
Figure 3.29 Example of a commonly used FFT-based program: the phase vocoder menu from Tom Erbe’s SoundHack.
Note that the user is allowed to select (among several other parameters) the number of bands
in the analysis. This means that the user can customize what is called the time/frequency
resolution trade-off of the FFT. Don’t ask us what the other options on this screen are—
download the program and try it yourself!
There are many software packages available that will do FFTs and IFFTs of your data for you
and then let you mess around with the frequency content of a sound. In Chapter 4 we’ll talk
about some of the many strange and wonderful things that can be done to a sound in the
frequency domain.
The details of how the FFT works are well beyond the scope of this book. What is important
for our purposes is that you understand the general idea of analyzing a sound by breaking it
into its frequency components and, conversely, by using a bunch of frequency components to
synthesize a new sound. The FFT has been understood for a long time now, and most
computer music platforms have tools for Fourier analysis and synthesis.
Figure 3.30 Another way to look at the frequency spectrum is
to remove time as an axis and just consider a sound as a
histogram of frequencies. Think of this as averaging the
frequencies over a long time interval. This kind of picture
(where there’s no time axis) is useful for looking at a shortterm snapshot of a sound (often just one frame), or perhaps
even for trying to examine the spectral features of a sound that
doesn’t change much over time (because all we see are the
"averages").
The y-axis tells us the amplitude of each component frequency. Since we’re looking at just
one frame of an FFT, we usually assume a periodic, unchanging signal. A histogram is
generally most useful for investigating the steady-state portion of a sound. (Figure 3.30 is a
screen grab from SoundHack.)
http://music.columbia.edu/cmc/musicandcomputers/
- 83 -
Chapter 3: The Frequency Domain
Section 3.5: Problems with the FFT/IFFT
Soundfile 3.10: Sine wave lobes. Soundfile 3.10 is an example of a sine wave swept from 50 Hz to 10
kHz, processed through an FFT. Figure 3.32 illustrates the sine wave sweep in an FFT analysis. The lobes that
you see are the result of the energy of the sine wave "centering" in the successive FFT bands and then fading
slightly as the width of the band forces the FFT into less accurate representations of the moving frequency
(until it centers in the next band). In other words, one of the hardest things for an FFT to represent is a simply
moving sine wave!
Soundfile 3.11 In the first illustration, we used an FFT size of 512 samples, giving us pretty good time
resolution. In the second, we used 2,048 samples, giving us pretty good frequency resolution. As a result,
frequencies are smeared vertically in the first analysis, while time is smeared horizontally in the second. What’s
the solution to the time/frequency uncertainty dilemma? Compromise.
Soundfile 3.12: Beat sound. A fairly normal-sounding beat soundfile.
Soundfile 3.13.: Time smeared. In this example the beat soundfile has been processed to
provide accurate frequency resolution but inaccurate time resolution, or rhythmic smearing.
Soundfile 3.14: Frequency smeared. In this example the beat soundfile has been processed to
provide accurate rhythmic resolution but inaccurate frequency resolution, or spectral
smearing.
http://music.columbia.edu/cmc/musicandcomputers/
- 84 -
Figure 3.31 There is a lot of "wasted" space in an FFT analysis—most of the frequencies that are of concern to
us tend to be below 5 kHz. Shown are the approximate ranges of some musical instruments and the human voice.
It’s also a big, big problem that the FFT divides the frequency range into linear segments (each frequency bin is
the same "width") while, as we well know, our perception of frequency is logarithmic.
Charts courtesy of Geoff Husband and tnt-audio.com. Used with permission.
The FFT often sounds like the perfect tool for exploring the frequency domain and timbre,
right? Well, it does work very well for many things, but it’s not without its problems. One of
the main drawbacks is that the frequency bins are linear. For example, if we have a bin width
of 43 Hz (which will be a result of dividing Nyquist frequency by the FFT frame size), then
we have bins from 0 Hz to 43 Hz, 43 Hz to 86 Hz, 86 Hz to 129 Hz, and so on.
The problem with this, as we learned earlier, is that the human ear responds to frequency
logarithmically, not linearly. At low frequencies, 43 Hz is quite a wide interval—the jump
from 43 Hz to 86 Hz is a whole octave! But at higher frequencies, 43 Hz is a tiny interval
(perceptually)—less than a minor second. So the FFT has very fine high-frequency pitch
resolution, but very poor low-frequency resolution.
The effect of the FFT’s linearity is that, for us, much of the FFT data is "wasted" on recording
high-frequency information very accurately, at the expense of the low-frequency information
that is generally more useful in a musical context. Wavelets, which we’ll look at in Section
3.6, are one approach to solving this problem.
Figure 3.32 Sweeping a sine wave through an FFT.
http://music.columbia.edu/cmc/musicandcomputers/
- 85 -
Frequency and Time Resolution Trade-Off
A related drawback of the FFT is the trade-off that must be made between frequency and time
resolution. The more accurately we want to measure the frequency content of a signal, the
more samples we have to analyze in each frame of the FFT. Yet there is a cost to expanding
the frame size—the larger the frame, the less we know about the temporal events that take
place within that frame.
In other words, more samples require more time; but the longer the time, the less the sound
over that interval looks like a sine wave, or something periodic—so the less well it is
represented by the FFT. We simply can’t have it both ways!
Figure 3.33 Selecting an FFT size involves making trade-offs in terms of time and frequency accuracy.
Basically it boils down to this: The more accurate the analysis is in one domain, the less accurate it will be in the
other. This figure illustrates what happens when we choose different frame sizes.
In the first illustration, we used an FFT size of 512 samples, giving us pretty good time
resolution. In the second, we used 2,048 samples, giving us pretty good frequency resolution.
As a result, frequencies are smeared vertically in the first analysis, while time is smeared
horizontally in the second. What’s the solution to the time/frequency uncertainty dilemma?
Compromise.
Time Smearing
We mentioned that 1,024 samples (1k) is a pretty common frame size for an audio FFT. At a
sample rate of 44.1 kHz, 1,024 samples is about 0.022 second of sound. What that means is
that all the sonic events that take place within that 0.022 second will be lumped together and
analyzed as one event. Because of the nature of the FFT, this "event" is actually treated as if it
were an infinitely repeating periodic waveform. The amplitudes of the frequency components
http://music.columbia.edu/cmc/musicandcomputers/
- 86 -
of all the sonic events in that time frame will be averaged, and these averages will end up in
the frequency bins.
This is known as time smearing. Now let’s say that we need more than the 43 Hz frequency
resolution that a 1k FFT gives us. To get better frequency resolution, we need to use a bigger
frame size. But a bigger frame size means that even more samples will be lumped together,
giving us even worse time resolution. At a frame size of 2k we get a frequency resolution of
about 21.5 Hz, but our time resolution goes down to about 0.05 (1/20) of a second. And,
believe it or not, a great deal can happen in 1/20 of a second!
Good Time Resolution
Conversely, if we need good time resolution (say we’re analyzing some percussive sounds
and we want to know exactly when they happen), we need to shrink the frame size. The ideal
frame size for the time domain would of course be one sample—that way we would know at
exactly which sample something happened.
Unfortunately, with only one sample to analyze, we would get no useful frequency
information out of the FFT at all. A more reasonable frame size and one that is considered
small for audio, such as 256 samples (a 0.006-second chunk of time), gives us 128 analysis
bands, for a bin width of about 172 Hz. While a 0.006-second time resolution is reasonable,
172 Hz is a pretty dreadful frequency resolution. That would put several bottom octaves of the
piano into one averaged bin.
A Compromise
So what’s the answer to this time/frequency dilemma? There really isn’t one. If we use the
FFT to do our analysis, we’re stuck with the fact that higher resolution in one domain results
in lower resolution in the other. The trick is to find a useful balance, based on the types of
sounds we are analyzing. No single frame size will work well for all sounds.
http://music.columbia.edu/cmc/musicandcomputers/
- 87 -
Chapter 3: The Frequency Domain
Section 3.6: Some Alternatives to the FFT
We know that in previous sections we’ve made it sound like the FFT is the only game in
town, but that’s not entirely true. Historically,the FFT was the first, most important, and most
widely used algorithm for frequency domain analysis. Musicians, especially, have taken a
liking to it because there are some well-known references on its implementation, and
computer musicians especially have gained a lot of experience in its nuances. It’s fairly
simple to program, and there are a lot of existing software programs to use if you get lazy.
Also, it works pretty well (emphasis on the word "pretty").
The FFT breaks up a sound into sinusoids, but there may be reasons why you’d like to break a
sound into other sorts of basic sounds. These alternate transforms can work better, for some
purposes, than the FFT. (A transform is just something that takes a list of the sample values
and turns the values into a new list of numbers that describes a possible new way to add up
different basic sounds.)
For example, wavelet transforms (sometimes called dynamic base transforms) can modify the
resolution for different frequency ranges, unlike the FFT, which has a constant bandwidth
(and thus is, by definition, less sensitive to lower frequencies—the important ones— than to
higher ones!). Wavelets also use a variety of different analysis waveforms (as opposed to the
FFT, which only uses sinusoids) to get a better representation of a signal. Unfortunately,
wavelet transforms are still a bit uncommon in computer software, since they are, in general,
harder to implement than FFTs. One of the big problems is deciding which wavelet to use,
and in what frequency bandwidth. Often, that decision-making process can be more important
(and time-consuming) than the actual transform! Wavelets are a pretty hot topic though, so
we’ll probably be seeing some wavelet-based techniques emerge in the near future.
Figure 3.34 Some common wavelet analysis waveforms. Each of these is a prototype, or "mother wavelet," that
provides
a
basic
template
for
all
the
related
analysis
waveforms.
A wavelet analysis will start with one of these waveforms and break up the sound into a sum
of translated and stretched (dilated) versions of these.
http://music.columbia.edu/cmc/musicandcomputers/
- 88 -
We know that in previous sections we’ve made it sound like the FFT is the only game in
town, but that’s not entirely true. Historically,the FFT was the first, most important, and most
widely used algorithm for frequency domain analysis. Musicians, especially, have taken a
liking to it because there are some well-known references on its implementation, and
computer musicians especially have gained a lot of experience in its nuances. It’s fairly
simple to program, and there are a lot of existing software programs to use if you get lazy.
Also, it works pretty well (emphasis on the word "pretty").
The FFT breaks up a sound into sinusoids, but there may be reasons why you’d like to break a
sound into other sorts of basic sounds. These alternate transforms can work better, for some
purposes, than the FFT. (A transform is just something that takes a list of the sample values
and turns the values into a new list of numbers that describes a possible new way to add up
different basic sounds.)
For example, wavelet transforms (sometimes called dynamic base transforms) can modify the
resolution for different frequency ranges, unlike the FFT, which has a constant bandwidth
(and thus is, by definition, less sensitive to lower frequencies—the important ones— than to
higher ones!). Wavelets also use a variety of different analysis waveforms (as opposed to the
FFT, which only uses sinusoids) to get a better representation of a signal. Unfortunately,
wavelet transforms are still a bit uncommon in computer software, since they are, in general,
harder to implement than FFTs. One of the big problems is deciding which wavelet to use,
and in what frequency bandwidth. Often, that decision-making process can be more important
(and time-consuming) than the actual transform!
Wavelets are a pretty hot topic though, so we’ll probably be seeing some wavelet-based
techniques emerge in the near
future.
Figure 3.34 Some common wavelet
analysis waveforms. Each of these is a
prototype, or "mother wavelet," that
provides a basic template for all the
related analysis waveforms.
A wavelet analysis will start with
one of these waveforms and break
up the sound into a sum of
translated and stretched (dilated)
versions of these.
McAulay-Quatieri (MQ) Analysis
Another interesting approach to a more organized, information-rich set of transforms is the
extended McAulay-Quatieri (MQ) analysis algorithm. This is used in the popular Macintosh
program Lemur Pro, written by the brilliant computer music researcher Kelly Fitz.
http://music.columbia.edu/cmc/musicandcomputers/
- 89 -
MQ analysis works a lot like normal FFTs—it tries to figure out how a sound can be recreated using a number of sine waves. However, Lemur Pro is an enhancement of more
traditional FFT-based software in that it uses the resulting sine waves to extract frequency
tracks from a sound. These tracks attempt to follow particular components of the sound as
they change over time. That is, MQ not only represents the amplitude trajectories of partials,
but also tries to describe the ways that spectral components change in frequency over time.
This is a fundamentally different way of describing spectra, and a powerful one.
MQ is an excellent idea, as it allows the analyst a much higher level of data representation
and provides a more perceptually significant view of the sound. A number of more advanced
FFT-based programs also implement some form of frequency tracking in order to improve the
results of the analysis, similar to that of MQ analysis. For example, even though a sine wave
might sweep from 1 kHz to 2 kHz, it would still be contained in one MQ track (rather than
fading in and out of FFT bins, forming what are called lobes).
An MQ-type analysis is often more useful for sound work than regular FFT analysis, since it
has sonic cognition ideas built into it. For example, the ear tends to follow frequencies in
time, much like an MQ analysis, and in our perception loud frequencies tend to mask
neighboring quieter frequencies (as do MQ analyses). These kinds of sophisticated analysis
techniques, where frequency and amplitude are isolated, suggest purely graphical
manipulation of sounds, and we’ll show some examples of that in Section 5.7.
Figure 3.35 Frequency tracks in Lemur Pro. The information looks much like one of our old sonograms, but in
fact it is the result of a great deal of high-level processing of FFT-type data.
Figure 3.36 MQ plot of 0'00" to 5'49" of Movement II: from the electro-acoustic work "Sud" by the great
French composer and computer music pioneer Jean-Claude Risset. Photo by David Hirst, an innovative
composer and computer music researcher from Australia.
http://music.columbia.edu/cmc/musicandcomputers/
- 90 -
Chapter 4: The Synthesis of Sound by Computer
Section 4.1: Introduction to Sound Synthesis
What Is Sound Synthesis?
We’ve learned that a digitally recorded sound is stored as a sequence of numbers. What would
happen if, instead of using recorded sounds to generate those numbers, we simply generated
the numbers ourselves, using the computer? For instance, what if we just randomly select
44,100 numbers (1 second’s worth) and call them samples? What might the result sound like?
The answer is noise. No surprise here—it makes sense that a signal generated randomly
would sound like noise. Philosophically (and mathematically), that’s more or less the
definition of noise: random or, more precisely, unpredictable information. While noise can be
sonically interesting (at least at first!) and useful for lots of things, it’s clear that we’ll want to
be able to generate other types of sounds too. We’ve seen how fundamentally important sine
waves are; how about generating one of those? The formula for generating a sine wave is very
straightforward:
y = sin(x)
That’s simple, but controlling frequency, amplitude, phase, or anything more complicated
than a sine wave can get a bit trickier.
This applet allows you to hear some basic types of noise: pink, white, and red.
Applet 4.1
Bring in da funk, We use different colors to describe different types of noise, based on the
bring in da noise average frequency content of the noiseband. White noise contains all
frequencies, at random amplitudes (some of which may be zero). Pink noise
has a heavier dose of lower frequencies, and red noise even more.
Sine Wave Example
If we take a sequence of numbers (the usual range is 0 to 2 , perhaps with a phase increment
of 0.1—meaning that we’re adding 0.1 for each value) and plug them into the sine formula,
what results is another sequence of numbers that describe a sine wave. By carefully selecting
the numbers, we can control the frequency of the generated sine wave.
Most basic computer synthesis methods follow this same general scheme: a formula or
function is defined that accepts a sequence of values as input. Since waveforms are defined in
terms of amplitudes in time, the input sequence of numbers to a synthesis function is usually
an ongoing list of time values. A synthesis function usually outputs a sequence of sample
values describing a waveform. These sample values are, then, a function of time—the result
of the specific synthesis method. For example, assume the following sequence of time values.
(t1, t2, t3,....., tn)
http://music.columbia.edu/cmc/musicandcomputers/
- 91 -
Then we have some function of those values (we can call it f(t)):
(f(t1), f(t2), f(t3),....., f(tn)) = (s1, s2, s3,....., sn)
(s1, s2, s3,....., sn) are the sample values.
This sine wave function is a very simple example. For some sequence of time values (t1, t2,
t3,....., tn), sin(t) gives us a new set of numbers {sin(t1), sin(t2), sin(t3),....., sin(tn)}, which we
call the signal’s sample values.
Xtra bit 4.1: Computer code for generating an array of a sine wave
#define TABLE_SIZE 512
#define TWO_PI (3.14159 * 2)
float samples [TABLE_SIZE];
float phaseIncrement = TWO_PI/TABLE_SIZE;
float currentPhase = 0.0;
int i;
for (i = 0; i < TABLE_SIZE; i ++){
samples[i] = sin(currentPhase);
currentPhase += phaseIncrement;
}
As we’ll see later, more sophisticated functions can be used to generate increasingly complex
waveforms. Often, instead of generating samples from a raw function in time, we take in the
results of applying a previous function (a list of numbers, not a list of time values) and do
something to that list with our new function. In fact, we have a more general name for sound
functions that take one sequence of numbers (samples) as an input and give you another list
back: they’re called filters.
You may never have to get quite this "down and dirty" when dealing with sound synthesis—
most computer music software takes care of things like generating noise or sine waves for
you. But it’s important to understand that there’s no reason why the samples we use need to
come from the real world. By using the many available synthesis techniques, we can produce
a virtually unlimited variety of purely synthetic sounds, ranging from clearly artificial,
machine-made sounds to ones that sound almost exactly like their "real" counterparts!
There are many different approaches to sound synthesis, each with its own strengths,
weaknesses, and suitability for creating certain types of sound. In the sections that follow,
we’ll take a look at some simple and powerful synthesis techniques.
http://music.columbia.edu/cmc/musicandcomputers/
- 92 -
Chapter 4: The Synthesis of Sound by Computer
Section 4.2: Additive Synthesis
Additive synthesis refers to a number of related synthesis techniques, all based on the idea that
complex tones can be created by the summation, or addition, of simpler tones. As we saw in
Chapter 3, it is theoretically possible to break up any complex sound into a number of simpler
ones, usually in the form of sine waves. In additive synthesis, we use this theory in reverse.
Figure 4.1 Two waves joined by a plus sign.
Figure 4.2 This organ has a great many pipes, and together they function exactly like an
additive synthesis algorithm.
Each pipe essentially produces a sine wave (or something like it), and by selecting different
combinations of harmonically related pipes (as partials), we can create different combinations
of sounds, called (on the organ) stops. This is how organs get all those different sounds:
organists are experts on Fourier series and additive synthesis (though they may not know
that!).
http://music.columbia.edu/cmc/musicandcomputers/
- 93 -
The technique of mixing simple sounds together to get more complex sounds dates back a
very long time. In the Middle Ages, huge pipe organs had a great many stops that could be
"pulled out" to combine and recombine the sounds from several pipes. In this way, different
"patches" could be created for the organ. More recently, the telharmonium, a giant electrical
synthesizer from the early 1900s, added together the sounds from dozens of electromechanical tone generators to form complex tones. This wasn’t very practical, but it has an
important place in the history of electronic and computer music.
This applet demonstrates how sounds are mixed together.
Applet 4.2
Mixed sounds
The Computer and Additive Synthesis
While instruments like the pipe organ were quite effective for some sounds, they were limited
by the need for a separate pipe or oscillator for each tone that is being added. Since complex
sounds can require anywhere from a couple dozen to several thousand component tones, each
needing its own pipe or oscillator, the physical size and complexity of a device capable of
producing these sounds would quickly become prohibitive. Enter the computer!
A short excerpt from Kenneth Gaburo’s composition "Lemon Drops," a
classic of electronic music made in the early 1960s.
Soundfile 4.1
Excerpt from
Kenneth
Gaburo’s
composition
"Lemon
Drops"
This piece and another extraordinary Gaburo work, "For Harry," were
made at the University of Illinois at Urbana-Champaign on an early
electronic music instrument called the harmonic tone generator, which
allowed the composer to set the frequencies and amplitudes of a number
of sine wave oscillators to make their own timbres. It was extremely
cumbersome to use, but it was essentially a giant Fourier synthesizer,
and, theoretically, any periodic waveform was possible on it!
It’s a tribute to Gaburo’s genius and that of other early electronic music
pioneers that they were able to produce such interesting music on such
primitive instruments. Kind of makes it seem like we’re almost
cheating, with all our fancy software!
If there is one thing computers are good at, it’s adding things together. By using digital
oscillators instead of actual physical devices, a computer can add up any number of simple
sounds to create extremely complex waveforms. Only the speed and power of the computer
limit the number and complexity of the waveforms. Modern systems can easily generate and
mix thousands of sine waves in real time. This makes additive synthesis a powerful and
versatile performance and synthesis tool. Additive synthesis is not used so much anymore
(there are a great many other, more efficient techniques for getting complex sounds), but it’s
definitely a good thing to know about.
http://music.columbia.edu/cmc/musicandcomputers/
- 94 -
A Simple Additive Synthesis Sound
Let’s design a simple sound with additive
synthesis. A nice example is the generation of a
square wave.
You can probably imagine what a square wave
would look like. We start with just one sine wave,
called the fundamental. Then we start adding odd
partials to the fundamental, the amplitudes of
which are inversely proportional to their partial
number. That means that the third partial is 1/3 as
strong as the first, the fifth partial is 1/5 as strong,
and so on. (Remember that the fundamental is the
first partial; we could also call it the first
harmonic.) Figure 4.3 shows what we get after
adding seven harmonics. Looks pretty square,
doesn’t it?
Now, we should admit that there’s an easier way to
synthesize square waves: just flip from a high
sample value to a low sample value every n
samples. The lower the value of n, the higher the
frequency of the square wave that’s being
generated. Although this technique is clearer and
easier to understand, it has its problems too;
directly generating waveforms in this way can cause unwanted frequency aliasing.
http://music.columbia.edu/cmc/musicandcomputers/
- 95 -
Figure 4.4 The Synclavier was an early digital
electronic music instrument that used a large
oscillator bank for additive synthesis. You can see
this on the front panel of the instrument—many of
the LEDs indicate specific partials! On the
Synclavier (as was the case with a number of other
analog and digital instruments), the user can tune
the partials, make them louder, even put envelopes
on each one.
Figure 4.3
This applet lets you add sine waves together at various amplitudes, to
see how additive synthesis works.
Applet 4.3
Additive
synthesis
This applet lets you add spectral envelopes to a number of partials. This
means that you can impose a different amplitude trajectory for each
partial, independently making each louder and softer over time.
Applet 4.4
Spectral
envelopes
This is really more like the way things work in the real world: partial
amplitudes evolve over time—sometimes independently, sometimes in
conjunction with other partials (in a phenomenon called common fate).
This is called spectral evolution, and it’s what makes sounds live.
http://music.columbia.edu/cmc/musicandcomputers/
- 96 -
A More Interesting Example
OK, now how about a more interesting example of additive synthesis? The quality of a
synthesized sound can often be improved by varying its parameters (partial frequencies,
amplitudes, and envelope) over time. In fact, time-variant parameters are essential for any
kind of "lifelike" sound, since all naturally occurring sounds vary to some extent.
Soundfile 4.2 Soundfile 4.3
Sine wave
Regular
speech
speech
Soundfile 4.4 Soundfile 4.5
Sine wave
Regular
speech
speech
These soundfiles are examples of sentences reconstructed with sine waves. Soundfile 4.2 is
the sine wave version of the sentence spoken in Soundfile 4.3, and Soundfile 4.4 is the sine
wave version of the sentence spoken in Soundfile 4.5.
Sine wave speech is an experimental technique that tries to simulate speech with just a few
sine waves, in a kind of primitive additive synthesis. The idea is to pick the sine waves
(frequencies and amplitudes) carefully. It’s an interesting notion, because sine waves are
pretty easy to generate, so if we can get close to "natural" speech with just a few of them, it
follows that we don’t require that much information when we listen to speech.
Sine wave speech has long been a popular idea for experimentation by psychologists and
researchers. It teaches us a lot about speech—what’s important in it, both perceptually and
acoustically.
These files are used with the permission of Philip Rubin, Robert Remez,
and Haskins Laboratories.
Attacks, Decays, and Time Evolution in Sounds
As we’ve said, additive synthesis is an important tool, and we can do a lot with it. It does,
however, have its drawbacks. One serious problem is that while it’s good for periodic sounds,
it doesn’t do as well with noisy or chaotic ones.
For instance, creating the steady-state part (the sustain) of a flute note is simple with additive
synthesis (just a couple of sine waves), but creating the attack portion of the note, where there
is a lot of breath noise, is nearly impossible. For that, we have to synthesize a lot of different
kinds of information: noise, attack transients, and so on.
And there’s a worse problem that we’d love to sweep under the old psychoacoustical rug, too,
but we can’t: it’s great that we know so much about steady-state, periodic, Fourier-analyzable
sounds, but from a cognitive and perceptual point of view, we really couldn’t care less about
them! The ear and brain are much more interested in things like attacks, decays, and changes
over time in a sound (modulation). That’s bad news for all that additive synthesis software,
which doesn’t handle such things very well.
http://music.columbia.edu/cmc/musicandcomputers/
- 97 -
That’s not to say that if we play a triangle wave and a sawtooth wave, we couldn’t tell them
apart; we certainly could. But that really doesn’t do us much good in most circumstances. If
angry lions roared in square waves, and cute cuddly puppy dogs barked in triangle waves,
maybe this would be useful, but we have evolved—or learned to hear attacks, decays, and
other transients as being more crucial. What we need to be able to synthesize are transients,
spectral evolutions, and modulations. Additive synthesis is not really the best technique for
those.
Another problem is that additive synthesis is very computationally expensive. It’s a lot of
work to add all those sine waves together for each output sample of sound! Compared to some
other synthesis methods, such as frequency modulation (FM) synthesis, additive synthesis
needs lots of computing power to generate relatively simple sounds.
But despite its drawbacks, additive synthesis is conceptually simple, and it corresponds very
closely to what we know about how sounds are constructed mathematically. For this reason
it’s been historically important in computer sound synthesis.
Figure 4.5 A typical ADSR (attack, decay, sustain, release) steady-state modulation. This is a standard
amplitude envelope shape used in sound synthesis.
The ability to change a sound’s amplitude envelope over time plays an important part in the
perceived "naturalness" of the sound.
Shepard Tones
One cool use of additive synthesis is in the generation of a very interesting phenomenon
called Shepard tones. Sometimes called "endless glissandi," Shepard tones are created by
specially configured sets of oscillators that add their tones together to create what we might
call a constantly rising tone. Certainly the Shepard tone phenomenon is one of the more
interesting topics in additive synthesis.
In the 1960s, experimental psychologist Roger Shepard, along with composers James Tenney
and Jean-Claude Risset, began working with a phenomenon that scientifically demonstrates
an independent dimension in pitch perception called chroma, confirming the circularity of
relative pitch judgments.
What circularity means is that pitch is perceived in kind of a circular way: it keeps going up
until it hits an octave, and then it sort of starts over again. You might say pitch wraps around
(think of a piano, where the C notes are evenly spaced all the way up and down). By chroma,
we mean an aspect of pitch perception in which we group together the same pitches that are
http://music.columbia.edu/cmc/musicandcomputers/
- 98 -
related as frequencies by multiples of 2. These are an octave apart. In other words, 55 Hz is
the same chroma as 110 Hz as 220 Hz as 440 Hz as 880 Hz. It’s not exactly clear whether this
is "hard-wired" or learned, or ultimately how important it is, but it’s an extraordinary idea and
an interesting aural illusion.
We can construct such a circular series of pitches in a laboratory setting using synthesized
Shepard tones. These complex tones are comprised of partials separated by octaves. They are
complex tones where all the non-power-of-two numbered partials are omitted.
These tones slide gradually from the bottom of the frequency range to the top. The amplitudes
of the component frequencies follow a bell-shaped spectral envelope (see Figure 4.6) with a
maximum near the middle of the standard musical range. In other words, they fade in and out
as they get into the most common frequency range. This creates an interesting illusion: a
circular Shepard tone scale can be created that varies only in tone chroma and collapses the
second dimension of tone height by combining all octaves. In other words, what you hear is a
continuous pitch change through one octave, but not bigger than one octave (that’s a result of
the special spectra and the amplitude curve). It’s kind of like a barber pole: the pitches sound
as if they just go around for a while, and then they’re back to where they started (even though,
actually, they’re continuing to rise!).
Figure 4.10 Bell-shaped spectral envelope for making Shepard tones.
Shepard wrote a famous paper in 1964 in which he explains, to some extent, our notion of
octave equivalence using this auditory illusion: a sequence of these Shepard tones that shifts
only in chroma as it is played. The apparent fundamental frequency increases step by step,
through repeated cycles. Listeners hear the pitch steps as climbing continuously upward, even
though the pitches are actually moving only around the chroma circle. Absolute pitch height
(that is, how "high" or "low" it sounds) is removed from our perception of the sequence.
http://music.columbia.edu/cmc/musicandcomputers/
- 99 -
Soundfile
4.6
Shepard
tone
Figure 4.7 Try clicking on Soundfile 4.6. After you listen to the soundfile once,
click again to listen to the frequencies continue on their upward spiral.
Used with permission from Susan R. Perry, M.A., Dept. of Psychology,
University of Tennessee.
The Shepard tone contains a large amount of octave-related harmonics
across the frequency spectrum, all of which rise (or fall) together. The
harmonics toward the low and high ends of the spectrum are attenuated
gradually, while those in the middle have maximum amplification. This
Soundfile 4.7 creates a spiraling or barber pole effect. (Information from Doepfer
Shepard tone Musikelektronik GmbH.)
Soundfile 4.7 is an example of the spiraling Shepard tone effect.
James Tenney is an important computer music composer and pioneer who
worked at Bell Laboratories with Roger Shepard in the early 1960s.
Soundfile 4.8
"For Ann"
(rising), by
James
Tenney.
This piece was composed in 1969. This composition is based on a set of
continuously rising tones, similar to the effect created by Shepard tones.
The compositional process is simple: each glissando, separated by some
fixed time interval, fades in from its lowest note and fades out as it nears
the top of its audible range. It is nearly impossible to follow, aurally, the
path of any given glissando, so the effect is that the individual tones never
reach their highest pitch.
http://music.columbia.edu/cmc/musicandcomputers/
- 100 -
Chapter 4: The Synthesis of Sound by Computer
Section 4.3: Filters
The most common way to think about filters is as functions that take in a signal and give back
some sort of transformed signal. Usually, what comes out is "less" than what goes in. That’s
why the use of filters is sometimes referred to as subtractive synthesis.
It probably won’t surprise you to learn that subtractive synthesis is in many ways the opposite
of additive synthesis. In additive synthesis, we start with simple sounds and add them together
to form more complex ones. In subtractive synthesis, we start with a complex sound (like
noise) and subtract, or filter out, parts of it. Subtractive synthesis can be thought of as sound
sculpting—you start out with a thick chunk of sound containing many possibilities
(frequencies), and then you carve out (filter) parts of it. Filters are one of the sound sculptor’s
most versatile and valued tools.
Older telephones had around an 8k low-pass filter imposed
on their audio signal, mostly for noise reduction and to keep
the equipment a bit cheaper.
Soundfile 4.9
Telephone simulations
White noise (every frequency below the Nyquist rate at equal
level) is filtered so we hear only frequencies above 5 kHz.
Soundfile 4.10
High-pass filtered noise
Here we hear only frequencies up to 500 Hz.
Soundfile 4.11
Low-pass filtered noise
http://music.columbia.edu/cmc/musicandcomputers/
- 101 -
Four Basic Type of Filters
Figure 4.8 Four common filter types (clockwise from upper left): low-pass, high-pass, bandreject, band-pass.
Figure 4.8 illustrates four basic types of filters: low-pass, high-pass, band-pass, and bandreject. Low-pass and high-pass filters should already be familiar to you—they are exactly like
the "tone" knobs on a car stereo or boombox. A low-pass (also known as high-stop) filter
stops, or attenuates, high frequencies while letting through low ones, while a high-pass (lowstop) filter does just the opposite.
This applet is a good example of how filters, combined with
something like noise, can produce some common and useful musical
effects with very few operations.
Applet 4.5
Using filters
Band-Pass and Band-Reject Filters
Band-pass and band-reject filters are basically combinations of low-pass and high-pass filters.
A band-pass filter lets through only frequencies above a certain point and below another, so
there is a band of frequencies that get through. A band-reject filter is the opposite: it stops a
band of frequencies. Band-reject filters are sometimes called notch filters, because they can
notch out a particular part of a sound.
http://music.columbia.edu/cmc/musicandcomputers/
- 102 -
Comb filters are a very specific type of digital process in which a short
delay (where some number of samples are actually delayed in time) and
simple feedback algorithm (where outputs are sent back to be
reprocessed and recombined) are used to create a rather extraordinary
Applet 4.6
effect. Sounds can be "tuned" to specific harmonics (based on the
Comb filters
length of the delay and the sample rate).
Low-Pass and High-Pass Filters
Low-pass and high-pass filters have a value associated with them called the cutoff frequency,
which is the frequency where they begin "doing their thing." So far we have been talking
about ideal, or perfect, filters, which cut off instantly at their cutoff frequency. However, real
filters are not perfect, and they can’t just stop all frequencies at a certain point. Instead,
frequencies die out according to a sort of curve around the corner of their cutoff frequency.
Thus, the filters in Figure 4.8 don’t have right angles at the cutoff frequencies—instead they
show general, more or less realistic response curves for low-pass and high-pass filters.
Cutoff Frequency
The cutoff frequency of a filter is defined as the point at which the signal is attenuated to
0.707 of its maximum value (which is 1.0). No, the number 0.707 was not just picked out of a
hat! It turns out that the power of a signal is determined by squaring the amplitude: 0.7072 =
0.5. So when the amplitude of a signal is at 0.707 of its maximum value, it is at half-power.
The cutoff frequency of a filter is sometimes called its half-power point.
Transition Band
The area between where a filter "turns the corner" and where it "hits the bottom" is called the
transition band. The steepness of the slope in the transition band is important in defining the
sound of a particular filter. If the slope is very steep, the filter is said to be "sharp";
conversely, if the slope is more gradual, the filter is "soft" or "gentle."
Things really get interesting when you start combining low-pass and high-pass filters to form
band-pass and band-reject filters. Band-pass and band-reject filters also have transition bands
and slopes, but they have two of them: one on each side. The area in the middle, where
frequencies are either passed or stopped, is called the passband or the stopband. The
frequency in the middle of the band is called the center frequency, and the width of the band
is called the filter’s bandwidth.
You can plainly see that filters can get pretty complicated, even these simple ones. By varying
all these parameters (cutoff frequencies, slopes, bandwidths, etc.), we can create an enormous
variety of subtractive synthetic timbres.
A Little More Technical: IIR and FIR Filters
Filters are often talked about as being one of two types: finite impulse response (FIR) and
infinite impulse response (IIR). This sounds complicated (and can be!), so we’ll just try to
give a simple explanation as to the general idea of these kinds of filters.
http://music.columbia.edu/cmc/musicandcomputers/
- 103 -
Finite impulse response filters are those in which delays are used along with some sort of
averaging. Delays mean that the sound that comes out at a given time uses some of the
previous samples. They’ve been delayed before they get used.
We’ve talked about these filters in earlier chapters. What goes into an FIR is always less than
what comes out (in terms of amplitude). Sounds reasonable, right? FIRs tend to be simpler,
easier to use, and easier to design than IIRs, and they are very handy for a lot of simple
situations. An averaging low-pass filter, in which some number of samples are averaged and
output, is a good example of an FIR.
Infinite impulse response filters are a little more complicated, because they have an added
feature: feedback. You’ve all probably seen how a microphone and speaker can have
feedback: by placing the microphone in front of a speaker, you amplify what comes out and
then stick it back into the system, which is amplifying what comes in, creating a sort of
infinite amplification loop. Ouch! (If you’re Jimi Hendrix, you can control this and make
great music out of it.)
Well, IIRs are similar. Because the feedback path of these filters consists of some number of
delays and averages, they are not always what are called unity gain transforms. They can
actually output a higher signal than that which is fed to them. But at the same time, they can
be many times more complex and subtler than FIRs. Again, think of electric guitar
feedback—IIRs are harder to control but are also very interesting.
http://music.columbia.edu/cmc/musicandcomputers/
- 104 -
Figure 4.9 FIR and IIR filters.
!"# $#
%# &'()* +,-./012 3/4561/ 785. 49:/1 96. 61/1 5 +,;,./ ;6:3/0 9+ 15:<-/12
5;= 5 15:<-/ 9;-> 851 5 +,;,./ /++/4.?
(+ 7/ =/-5>2 5@/05A/2 5;= .8/; +//= .8/ 96.<6. 9+ .85. <094/11 354B ,;.9 .8/ 1,A;5-2 7/
40/5./ 785. 50/ 45--/= !"# $#
%# &(()* +,-./01? C8/ +//=354B <094/11 54.65-->
5--971 .8/ 96.<6. .9 3/ :648 A0/5./0 .85; .8/ ,;<6.? C8/1/ +,-./01 45;2 51 7/ -,B/ .9 15>2
D3-97 6<?D
These diagrams are technical lingo for typical filter diagrams for FIR and IIR filters. Note
how in the IIR diagram the output of the filter’s delay is summed back into the input, causing
the infinite response characteristic. That’s the main difference between the two filters.
Thanks to Fernando Pablo Lopez-Lezcano for these graphics.
Designing filters is a difficult but key activity in the field of digital signal processing, a rich
area of study that is well beyond the range of this book. It is interesting to point out that,
surprisingly, even though filters change the frequency content of a signal, a lot of the
mathematical work done in filter design is done in the time domain, not in the frequency
domain. By using things like sample averaging, delays, and feedback, one can create an
extraordinarily rich variety of digital filters.
For example, the following is a simple equation for a low-pass filter. This equation just
averages the last two samples of a signal (where x(n) is the current sample) to produce a new
sample. This equation is said to have a one-sample delay. You can see easily that quickly
changing (that is, high-frequency) time domain values will be "smoothed" (removed) by this
equation.
http://music.columbia.edu/cmc/musicandcomputers/
- 105 -
x(n) = (x(n) + x(n - 1))/2
In fact, although it may look simple, this kind of filter design can be quite difficult (although
extremely important). How do you know which frequencies you’re removing? It’s not
intuitive, unless you’re well schooled in digital signal processing and filter theory, have some
background in mathematics, and know how to move from the time domain (what you have) to
the frequency domain (what you want) by averaging, delaying, and so on.
http://music.columbia.edu/cmc/musicandcomputers/
- 106 -
Chapter 4: The Synthesis of Sound by Computer
Section 4.4: Formant Synthesis
Formant synthesis is a special but important case of subtractive synthesis. Part of what makes
the timbre of a voice or instrument consistent over a wide range of frequencies is the presence
of fixed frequency peaks, called formants.
These peaks stay in the same frequency range, independent of the actual (fundamental) pitch
being produced by the voice or instrument. While there are many other factors that go into
synthesizing a realistic timbre, the use of formants is one way to get reasonably accurate
results.
Figure 4.10 A trumpet plays two different notes, a perfect fourth apart, but the formants (fixed resonances) stay
in the same places.
Resonant Structure
The location of formants is based on the resonant physical structure of the sound-producing
medium. For example, the body of a certain violin exhibits a particular set of formants,
depending upon how it is constructed. Since most violins share a similar shape and internal
construction, they share a similar set of formants and thus sound alike. In the human voice,
the vocal tract and nasal cavity act as the resonating body. By manipulating the shape and size
of that resonant space (i.e., by changing the shape of the mouth and throat), we change the
location of the formants in our voice. We recognize different vowel sounds mainly by their
formant placement. Knowing that, we can generate some fairly convincing synthetic vowels
by manipulating formants in a synthesized set of tones. A number of books list actual formant
frequency values for various voices and vowels (including Charles Dodge’s highly
recommended standard text, Computer Music—Dodge is a great pioneer in computer music
voice synthesis).
http://music.columbia.edu/cmc/musicandcomputers/
- 107 -
Composing with Synthetic Speech
Generating really good and convincing synthetic speech and singing voices is more complex
than simply moving around a set of formants—we haven’t mentioned anything about
generating consonants, for example. And no speech synthesis system relies purely on formant
synthesis. But, as these examples illustrate, even very basic formant manipulation can
generate sounds that are undoubtedly "vocal" in nature.
Figure 4.11 A spectral picture of the voice, showing formants. Graphic courtesy of the
alt.usage.english newsgroup.
"Notjustmoreidlechatter" was made on a DEC MicroVaxII computer
in 1988. All the "chatter" pieces (there are three in the set) use
techniques known as linear predictive coding, granular synthesis,
and a variety of stochastic mixing techniques.
Soundfile 4.12
"Notjustmoreidlechatter Paul Lansky is a well-known composer and researcher of computer
" of Paul Lansky
music who teaches at Princeton University. He has been a leading
pioneer in software design, voice synthesis, and compositional
techniques.
Used with permission from Paul Lansky.
Paul Lansky writes:
"Over ten years ago I wrote three 'chatter' pieces, and then decided to
http://music.columbia.edu/cmc/musicandcomputers/
- 108 -
quit while I was ahead. The urge to strike again recently overtook
Soundfile 4.13
"idlechatterjunior" of me, however, and after my lawyer assured me that the statute of
Paul Lansky from 1999 limitations had run out on this particular offense, I once again leapt
into the fray. My hope is that the seasoning provided by my labors in
the intervening years results in something new and different. If not,
then look out for 'Idle Chatter III'... ."
Used with permission from Paul Lansky.
Composer Sarah Myers used an interview with her friend Gili Rei as
the source material for her composition. "Trajectory of Her Voice" is
a ten-part canon that explores the musical qualities of speech. As the
verbal content becomes progressively less comprehensible as
language, the focus turns instead to the sonorities inherent in her
voice.
Soundfile 4.14
Composition by
composer Sarah Myers
entitled "Trajectory of This piece was composed using the Cmix computer music language
Her Voice"
in 1998 (Cmix was written by Paul Lansky).
Soundfile 4.15
Synthetic speech
example, "Fred" voice
from the Macintosh
computer
Soundfile 4.16
Carter Sholz’s 1-minute
piece "Mannagram"
Soundfile 4.17
The trump
Over the years, computer voice simulations have become better and
better. They still sound a bit robotic, but advances in voice synthesis
and acoustic technology make voices more and more realistic. Bell
Telephone Laboratories has been one of the leading research
facilities for this work, which is expected to become extremely
important in the near future.
In this piece, based on a reading by Australian sound-poet Chris
Mann, the composer tries to separate vowels and consonants, moving
them each to a different speaker. This was inspired by an idea of
Mann's, who always wanted to do a "headphone piece" in which he
spoke and the consonants appeared in one ear, the vowels in another.
One of the most interesting examples of formant usage is in playing
the trump, sometimes called the jaw-harp. Here, a metal tine is
plucked and the shape of the vocal cavity is used to create different
pitches.
http://music.columbia.edu/cmc/musicandcomputers/
- 109 -
Section 4.5: Amplitude Modulation
Introduction to Modulation
We might be more familiar with the term modulation in relationship to radio. Radio
transmission utilizes amplitude modulation (AM) and frequency modulation (FM), but we too
can create complex waveforms by using these techniques.
Modulated signals are those that are changed regularly in time, usually by other signals. They
can get pretty complicated. For example, modulated signals can modulate other signals! To
create a modulated signal, we begin with two or more oscillators (or anything that produces a
signal) and combine the output signals of the oscillators in such a way as to modulate the
amplitude, frequency, and/or phase of one of the oscillators.
In the amplitude modulation equation, the DC offset (a signal that is essentially a straight line)
is added to the signal m(t) and multiplied by a sinusoid with frequency fc . Ac is the carrier
amplitude, and ka is the modulation index.
Applet 4.9 shows what happens, in the case of frequency modulation, if the modulating signal
is low frequency. In that case, we’ll hear something like vibrato (a regular change in
frequency, or perceived pitch). We can also modulate amplitude in this way (tremolo), or even
formant frequencies if we want. Low-frequency modulations (that is, modulators that
themselves are low-frequency signals) can produce interesting sonic effects.
Applet 4.8
LFO modulation
But for making really complex sounds, we are generally interested in high-frequency
modulation. We take two audio frequency signals and multiply them together. More precisely,
we start with a carrier oscillator and attach a modulating oscillator to modify and distort the
signal that the carrier oscillator puts out. The output of the carrier oscillator can include its
original signal and the sidebands or added spectra that are generated by the modulation
process.
Amplitude Modulation
Figure 4.13 shows how we might construct a computer music instrument to do amplitude
modulation. The two half-ovals are often called unit generators, and they refer to some
software device like an oscillator, a mixer, a filter, or an envelope generator that has inputs
and outputs and makes and transforms digital signals.
http://music.columbia.edu/cmc/musicandcomputers/
- 110 -
Figure 4.13 Amplitude modulation, two operator case.
Soundfile 5.1 Jon Appleton: "Chef d'Oeuvre"
Soundfile 4.19 High-pass moving filter (modulated by sine)
Soundfile 4.20 Low-pass moving filter (modulated by sawtooth)
http://music.columbia.edu/cmc/musicandcomputers/
- 111 -
A low-pass moving filter
that uses a sine wave to
control a sweep between
0 Hz and 500 Hz.
A high-pass moving
filter that uses a sine
wave to control a sweep
between 5,000 Hz and
15,000 Hz.
A low-pass moving filter
that uses a sawtooth
wave to control a sweep
between 0 Hz and 500
Hz.
Soundfile 4.21 High-pass moving filter (modulated by sawtooth)
A high-pass moving
filter that uses a
sawtooth wave to
control a sweep between
5,000 Hz and 15,000 Hz.
Figure 4.14 James Tenney’s "Phases," one of the earliest and still most interesting pieces of computer-assisted composition.
The pictures above are his "notes" for the piece, which constitute a kind of score.
Tenney made use of some simple modulation trajectories to control timbral parameters over time (like amplitude modulation
rate, spectral upper limit, note-duration, and so on). By simply coding these functions in the computer and linking the output
to the synthesis engine, Tenney was able to realize a number of highly original works in which he controlled the overall,
large-scale process, but the micro-structure was largely determined by the computer making use of his curves. "Phases" was
released on an Artifact CD of James Tenney’s computer music.
http://music.columbia.edu/cmc/musicandcomputers/
- 112 -
Section 4.6: Waveshaping
Waveshaping is a popular synthesis-and-transformation technique that turns simple sounds
into complex sounds. You can take a pure tone, like a sine wave, and transform it into a
harmonically rich sound by changing its shape. A guitar fuzz box is an example of a
waveshaper. The unamplified electric guitar sound is fairly close to a sine wave. But the fuzz
box amplifies it and gives it sharp corners. We have seen in earlier chapters that a signal with
sharp corners has lots of high harmonics. Sounds that have passed through a waveshaper
generally have a lot more energy in their higher-frequency harmonics, which gives them a
"richer" sound.
Simple Waveshaping Formulae
A waveshaper can be described as a function that takes the original signal x as input and
produces a new output signal y. This function is called the transfer function.
y = f(x)
This is simple, right? In fact, it’s much simpler than any other function we’ve seen so far.
That’s because waveshaping, in its most general form, is just any old function. But there’s a
lot more to it than that. In order to change the shape of the function (and not just make it
bigger or smaller), the function must be nonlinear, which means it has exponents greater than
1, or transcendental (like sines, cosines, exponentials, logarithms, etc.). You can use almost
any function you want as a waveshaper. But the most useful ones output zero when the input
is zero (that’s because you usually don’t want any output when there is no input).
0 = f(0)
Let’s look at a very simple waveshaping function:
y = f(x) = x * x * x = x3
What would it look like to pass a simple sine wave that varied from -1.0 to 1.0 through this
waveshaper? If our input x is sin(wt), then:
y = x3 = sin3(wt)
If we plot both functions (sin(x) and the output signal), we can see that the original input
signal is very round, but the output signal has a narrower peak. This will give the output a
richer sound.
http://music.columbia.edu/cmc/musicandcomputers/
- 113 -
Figure 4.15 Waveshaping by x3.
This example gives some idea of the power of this technique. A simple function (sine wave)
gets immediately transformed, using simple math and even simpler computation, into
something new.
One problem with the y = x3 waveshaper is that for x-values outside the range –1.0 to +1.0, y
can get very large. Because computer music systems (especially sound cards) generally only
output sounds between –1.0 and +1.0, it is handy to have a waveshaper that takes any input
signal and outputs a signal in the range –1.0 to +1.0. Consider this function:
y = x / (1 + |x|)
When x is zero, y is zero. Plug in a few numbers for x, like 0.5, 7.0, 1,000.0, –7.0, and see
what you get. As x gets larger (approaches positive infinity), y approaches +1.0 but never
reaches it. As x approaches negative infinity, y approaches –1.0 but never reaches it. This kind
of curve is sometimes called soft clipping because it does not have any hard edges. It can give
a nice "tubelike" distortion sound to a guitar. So this function has some nice properties, but
unfortunately it requires a divide, which takes a lot more CPU power than a multiply. On
older or smaller computers, this can eat up a lot of CPU time (though it’s not much of a
problem nowadays).
Here is another function that is a little easier to calculate. It is designed for input signals
between –1.0 and +1.0.
y = 1.5x - 0.5x3
This applet plays a sine wave
through various waveshaping
formulae.
Applet 4.9 Changing the shape of a waveform
Chebyshev Polynomials
http://music.columbia.edu/cmc/musicandcomputers/
- 114 -
A transfer function is often expressed as a polynomial, which looks like:
y = f(x) = d0 + d1x + d2x2 + d3x3 + ... + dNxN
The highest exponent N of this polynomial is called the "order" of the polynomial. In Applet
4.9, we saw that the x2 formula resulted in a doubling of the pitch. So a polynomial of order 2
produced strong second harmonics in the output. It turns out that a polynomial of order N will
only generate frequencies up to the Nth harmonic of the input sine wave. This is a useful
property when we want to limit the high harmonics to a value below the Nyquist rate to avoid
aliasing.
Back in the 19th century, Pafnuty Chebyshev discovered a set of polynomials known as the
Chebyshev polynomials. Mathematicians like them for lots of different reasons, but computer
musicians like them because they can be used to make weird noises, er, we mean
music. These Chebyshev polynomials have the property that if you input a sine wave of
amplitude 1.0, you get out a sine wave whose frequency is N times the frequency of the input
wave. So, they are like frequency multipliers. If the amplitude of the input sine wave is less
than 1.0, then you get a complex mix of harmonics. Generally, the lower the amplitude of the
input, the lower the harmonic content. This gives musicians a single number, sometimes
called the distortion index, that they can tweak to change the harmonic content of a sound. If
you want a sound with a particular mixture of harmonics, then you can add together several
Chebyshev polynomials multiplied by the amount of the harmonic that you desire. Is this
cool, or what?
Here are the first few Chebyshev polynomials:
T0(x) = 1
T1(x) = x
T2(x) = 2x2 – 1
T3(x) = 4x3 – 3x
T4(x) = 8x4 – 8x2 + 1
You can generate more Chebyshev polynomials using this recursive formula (a recursive
formula is one that takes in its own output as the next input):
Tk+1(x) = 2xTk(x) – Tk–1(x)
Table-Based Waveshapers
Doing all these calculations in realtime at audio rates can be a lot of work, even for a
computer. So we generally precalculate these polynomials and put the results in a table. Then
when we are synthesizing sound, we just take the value of the input sine wave and use it to
look up the answer in the table. If you did this during an exam it would be called cheating, but
in the world of computer programming it is called optimization.
One big advantage of using a table is that regardless of how complex the original equations
were, it always takes the same amount of time to look up the answer. You can even draw a
function by hand without using an equation and use that hand-drawn function as your transfer
function.
http://music.columbia.edu/cmc/musicandcomputers/
- 115 -
This applet plays sine waves through polynomials and
hand-drawn waves.
Applet 4.10 Waveshaping
Don Buchla’s Synthesizers
Don Buchla, a pioneering designer of synthesizers, created a series of instruments based on
digital waveshaping. One such instrument, known as the Touche, was released in 1978. It had
16 digital oscillators that could be combined into eight voices. The Touche had extensive
programming capabilities and had the ability to morph one sound into another. Composer
David Rosenboom worked with Buchla and developed much of the software for the Touche.
Rosenboom produced an album in 1981 called Future Travel using primarily the Touche and
the Buchla 300 Series Music System.
Now that you know a lot about waveshaping, Chebyshev polynomials, and transfer
functions, we’ll show you what happens when the information gets into the wrong
hands!
Soundfile These soundfiles are two recordings done in the mid-1980s, at the Mills College
Center for Contemporary Music, by one of our authors (Larry Polansky). They use an
4.22a
Experimental highly unusual live, interactive computer music waveshaping system.
waveshaping:
"Toyoji Patch" was a piece/software/installation written originally for trombonist and
"Toyoji Patch" composer Toyoji Tomita to play with. It was a real-time feedback system, which fed
live audio into the transfer function of a set of six digital waveshaping oscillators. The
hardware was an old S-100 68000-based computer system running the original
prototype of a computer music language called HMSL, including a GUI and set of
instrument drivers and utilities for controlling synthesizer designer Don Buchla’s 400
series digital waveshaping oscillators.
Soundfile
4.22b
The system could be used with or without a live input, since it made use of an
Experimental external microphone to feed back its output to itself. The audio time domain signal
waveshaping: used a transfer function to modify itself, but you could also alter Chebyshev
"Toyoji Patch" coefficients in realtime, redraw the waveform, and control a number of other
parameters.
Both of these sound excerpts feature the amazing contemporary flutist Anne
LaBerge. In the first version, LaBerge is playing but is not recorded. She is in another
room, and the output of the system is fed back into itself through a microphone. By
playing, she could drastically affect the sound (since her flute went immediately into
the transfer function). However, although she’s causing the changes to occur, we
don’t actually hear her flute. In the second version, LaBerge is in front of the same
microphone that’s used for feedback and recording.
In both versions, Polansky was controlling the mix and the feedback gain as well as
playing with the computer.
http://music.columbia.edu/cmc/musicandcomputers/
- 116 -
Section 4.7: FM Synthesis
One goal of synthesis design is to find efficient algorithms, which don’t take a lot of
computation to generate rich sound palettes. Frequency modulation synthesis (FM) has
traditionally been one of the most popular techniques in this regard, and it provides a good
way to communicate some basic concepts about sound synthesis. You’ve probably heard of
frequency modulation—it is the technique used in FM radio broadcasting. We took a look at
amplitude modulation (AM) in Section 4.5.
Applet 4.11 Frequency modulation
History of FM Synthesis
FM techniques have been around since the early 20th century, and by the 1930s FM theory for
radio broadcasting was welldocumented and understood. It was not until the 1970s, though,
that a certain type of FM was thoroughly researched as a musical synthesis tool. In the early
1970s, John Chowning, a composer and researcher at Stanford University, developed some
important new techniques for music synthesis using FM.
Chowning’s research paid off. In the early 1980s, the Yamaha Corporation introduced their
extremely popular DX line of FM synthesizers, based on Chowning’s work. The DX-7
keyboard synthesizer was the top of their line, and it quickly became the digital synthesizer
for the 1980s, making its mark on both computer music and synthesizer-based pop and rock.
It’s the most popular synthesizer in history.
FM turned out to be good for creating a wide variety of sounds, although it is not as flexible
as some other types of synthesis. Why has FM been so useful? Well, it’s simple, easy to
understand, and allows users to tweak just a few "knobs" to get a wide range of sonic
variation. Let’s take a look at how FM works and listen to some examples.
Figure 4.16 The Yamaha DX-7 synthesizer was an extraordinarily popular instrument in the
1980s and was partly responsible for making electronic and computer music a major industry.
Thanks to Joseph Rivers and The Audio Playground Synthesizer Museum for this photo.
http://music.columbia.edu/cmc/musicandcomputers/
- 117 -
Simple FM
In its simplest form, FM involves two sine waves. One is called the modulating wave, the
other the carrier wave. The modulating wave changes the frequency of the carrier wave. It
can be easiest to visualize, understand, and hear when the modulator is low frequency.
Figure 4.17 Frequency modulation, two operator case.
http://music.columbia.edu/cmc/musicandcomputers/
- 118 -
Vibrato
Carrier: 500 Hz; modulator frequency: 1 Hz;
modulator index: 100.
Soundfile 4.23 Vibrato sound
FM can create vibrato when the modulating frequency is less than 30 Hz. Okay, so it’s still
not that exciting—that’s just because everything is moving slowly. We’ve created a very slow,
weird vibrato! That’s because we were doing low-frequency modulation. In Soundfile 4.23,
the frequency (fc) of the carrier wave is 500 Hz and the modulating frequency (fm) is 1 Hz. 1
Hz means one complete cycle each second, so you should hear the frequency of the carrier
rise, fall, and return to its original pitch once each second.
fc = carrier frequency, m(t) = modulating signal, and Ac= carrier amplitude.
Note that the frequency of the modulating wave is the rate of change in the carrier’s
frequency. Although you can’t tell from the above equation, it also turns out that the
amplitude of the modulator is the degree of change of the carrier’s frequency, and the
waveform of the modulator is the shape of change of the carrier’s frequency.
In Figure 4.17 showing the unit generator diagram for frequency modulation (remember, we
showed you one of these in Section 4.5), note that each of the sine wave oscillators has two
inputs: one for frequency and one for amplitude. For our modulating oscillator we are using 1
Hz as the frequency, which becomes fm to the carrier (that is, the frequency of the carrier is
changed 1 time per second). The modulator’s amplitude is 100, which will determine how
much the frequency of the carrier gets changed (at a rate of 1 time per second).
The amplitude of the modulator is often called the modulation depth, since this value
determines how high and low the frequency of the carrier wave will go. In the sound example,
the fc ranges from 400 Hz to 600 Hz (500 Hz – 100 Hz to 500 Hz + 100 Hz). If we change the
depth to 500 Hz, then our fc would range from 0 Hz to 1,000 Hz. Humans can only hear
sounds down to about 30 Hz, so there should be a moment of "silence" each time the
frequency dips below that point.
Carrier: 500 Hz, modulator frequency: 1 Hz, modulator
index: 500.
Soundfile 4.24 Vibrato sound
Generating Spectra with FM
http://music.columbia.edu/cmc/musicandcomputers/
- 119 -
If we raise the frequency of the modulating oscillator above 30 Hz, we can start to hear more
complex sounds. We can make an analogy to being able to see the spokes of a bike wheel if it
rotates slowly, but once the wheel starts to rotate faster a visual blur starts to occur.
So it is with FM: when the modulating frequency starts to speed up, the sound becomes more
complex. The tones you heard in Soundfile 4.24 sliding around are called sidebands and are
extra frequencies located on either side of the carrier frequency. Sidebands are the secret to
FM synthesis. The frequencies of the sidebands (called, as a group, the spectra) depend on the
ratio of fc to fm. John Chowning, in a famous article, showed how to predict where those
sidebands would be using a simple mathematical idea called Bessel functions. By controlling
that ratio (called the FM index) and using Bessel functions to determine the spectra, you can
create a wide variety of sounds, from noisy jet engines to a sweet-sounding Fender Rhodes.
Figure 4.18 FM sidebands.
Soundfiles 4.25 through 4.28 show some simple two-carrier FM sounds with modulating
frequencies above 30 Hz.
Carrier: 100 Hz; modulator frequency: 280 Hz;
FM index: 6.0 -> 0.
Soundfile 4.25 Bell-like sound
http://music.columbia.edu/cmc/musicandcomputers/
- 120 -
Carrier: 250 Hz; modulator frequency: 175 Hz;
FM index: 1.5 -> 0.
Soundfile 4.26 Bass clarinet-type sound
Carrier: 700 Hz; modulator frequency: 700 Hz;
FM index: 5.0 -> 0.
Soundfile 4.27 Trumpet-like sound
Carrier: 500 Hz; modulator frequency: 500 ->
5,000 Hz; FM index: 10.
Soundfile 4.28 FM sound
An Honest-to-Goodness Computer Music Example of FM
(Using Csound)
One of the most common computer languages for synthesis and sound processing is called
Csound, developed by Barry Vercoe at MIT. Csound is popular because it is powerful, easy to
use, public domain, and runs on a wide variety of platforms. It has become a kind of lingua
franca for computer music. Csound divides the world of sound into orchestras, consisting of
instruments that are essentially unit-generator designs for sounds, and scores (or note lists)
that tell how long, loud, and so on a sound should be played from your orchestra.
The code in Figure 4.19 is an example of a simple FM instrument in Csound. You don’t need
to know what most of it means (though by now a lot of it is probably self-explanatory, like sr,
which is just the sampling rate). Take a close look at the following line:
asig foscil p4*env, cpspch(p5), 1,3,2,1 ;simple FM
Commas separate the various parts of this command.
Asig is simply a name for this line of code, so we can use it later (out asig).
foscil is a predefined Csound unit generator, which is just a pair of oscillators that
implement simple frequency modulation (in a compact way).
p4*env states that the amplitude of the oscillator will be multiplied by an envelope
(which, as seen in Figure 4.19, is defined with the Csound linseg function).
cpspch(p5) looks to the fifth value in the score (8.00, which will be a middle C) to find
what the fundamental of the sound will be. The next three values determine the carrier
and modulating frequencies of the two oscillators, as well as the modulation index (we
won’t go into what they mean, but these values will give us a nice, sweet sound). The
final value (1) points to a sine wave table (look at the first line of the score).
Yes, we know, you might be completely confused, but we thought you’d like to see a
common language that actually uses some of the concepts we’ve been discussing!
http://music.columbia.edu/cmc/musicandcomputers/
- 121 -
Computer Code for an FM Instrument and a Score to Play
the Instrument in Csound
Figure 4.19 This is a Csound Orchestra and Score blueprint for a simple FM synthesis
instrument. Any number of basic unit generators, such as oscillators, adders, multipliers,
filters, and so on, can be combined to create complex instruments.
Some music languages, like CSound, make extensive use of the unit generator model.
Generally, unit generators are used to create instruments (the orchestra), and then a set of
instructions (a score) is created that tells the instruments what to do.
Now that you understand the basics of FM synthesis, go back to the beginning of this section
and play with Applet 4.12. FM is kind of interesting theoretically, but it’s far more fun and
educational to just try it out.
http://music.columbia.edu/cmc/musicandcomputers/
- 122 -
Section 4.8: Granular Synthesis
When we discussed additive synthesis, we learned that complex sounds can be created by
adding together a number of simpler ones, usually sets of sine waves. Granular synthesis uses
a similar idea, except that instead of a set of sine waves whose frequencies and amplitudes
change over time, we use many thousands of very short (usually less than 100 milliseconds)
overlapping sound bursts or grains. The waveforms of these grains are often sinusoidal,
although any waveform can be used. (One alternative to sinusoidal waveforms is to use grains
of sampled sounds, either pre-recorded or captured live.) By manipulating the temporal
placement of large numbers of grains and their frequencies, amplitude envelopes, and
waveshapes, very complex and time-variant sounds can be created.
This applet lets you granulate a signal and alter some of
the typical parameters of granular synthesis, a popular
synthesis technique.
Applet 4.12 Granular synthesis
A grain is created by taking a waveform, in this case a sine wave, and multiplying it by an
amplitude envelope.
How would a different amplitude envelope, say a square one, affect the shape of the grain? What would it do
to the sound of the grain?
Clouds of Sound
Granular synthesis is often used to create what can be thought of as "sound clouds"—shifting
regions of sound energy that seem to move around a sonic space. A number of composers,
like Iannis Xenakis and Barry Truax, thought of granular synthesis as a way of shaping large
http://music.columbia.edu/cmc/musicandcomputers/
- 123 -
masses of sound by using granulation techniques. These two composers are both considered
pioneers of this technique (Truax wrote some of the first special-purpose software for granular
synthesis). Sometimes, cloud terminology is even used to talk about ways of arranging grains
into different sorts of configurations.
Visualization of a granular synthesis "score." Each dot represents a grain at a particular frequency
and moment in time. An image such as this one can give us a good idea of how this score might sound, even
though there is some important information left out (such as the grain amplitudes, waveforms, amplitude
envelopes, and so on).
What sorts of sounds does this image imply? If you had three vocal performers, one for each "cloud," how
would you go about performing this piece? Try it!
This is an excerpt of a composition by computer music composer and
researcher Mara Helmuth titled "Implements of Actuation." The sounds
of an electric mbira and bicycle wheels are transformed via granular
synthesis.
Soundfile
4.29
There are a great many commercial and public domain applications to
"Implements do granular synthesis because it is relatively easy to implement and the
of Actuation" sounds can be very interesting and attractive.
http://music.columbia.edu/cmc/musicandcomputers/
- 124 -
Section 4.9: Physical Modeling
We’ve already covered a bit of material on physical modeling without even telling you—the
ideas behind formant synthesis are directly derived from our knowledge of the physical
construction and behavior of certain instruments. Like all of the synthesis methods we’ve
covered, physical modeling is not one specific technique, but rather a variety of related
techniques. Behind them all, however, is the basic idea that by understanding how
sound/vibration/air/string behaves in some physical system (like an instrument), we can
model that system in a computer and thus synthetically generate realistic sounds.
Karplus-Strong Algorithm
Let’s take a look at a really simple but very effective physical model of a plucked string,
called the Karplus-Strong algorithm (so named for its principal inventors, Kevin Karplus and
Alex Strong). One of the first musically useful physical models (dating from the early 1980s),
the Karplus-Strong algorithm has proven quite effective at generating a variety of pluckedstring sounds (acoustic and electric guitars, banjos, and kotos) and even drumlike timbres.
Fun with the Karplus-Strong
plucked string algorithm
Applet 4.13 Karplus-Strong plucked string algorithm
Here’s a simplified view of what happens when we pluck a string: at first the string is highly
energized and it vibrates like mad, creating a fairly complex (meaning rich in harmonics)
sound wave whose fundamental frequency is determined by the mass and tension of the
string. Gradually, thanks to friction between the air and the string, the string’s energy is
depleted and the wave becomes less complex, resulting in a "purer" tone with fewer
harmonics. After some amount of time all of the energy from the pluck is gone, and the string
stops vibrating.
If you have access to a stringed instrument, particularly one with some very low notes, give
one of the strings a good pluck and see if you can see and hear what’s happening per the
description above.
How a Computer Models a Plucked String with the
Karplus-Strong Algorithm
Now that we have a physical idea of what’s happening in a plucked string, how can we model
it with a computer? The Karplus-Strong algorithm does it like this: first we start with a buffer
full of random values—noise. (A buffer is just some computer memory (RAM) where we can
store a bunch of numbers.) The numbers in this buffer represent the initial energy that is
transferred to the string by the pluck. The Karplus-Strong algorithm looks like this:
http://music.columbia.edu/cmc/musicandcomputers/
- 125 -
To generate a waveform, we start reading through the buffer and using the values in it as
sample values. If we were to just keep reading through the buffer over and over again, what
we’d get would be a complex, pitched waveform. It would be complex because we started out
with noise, but pitched because we would be repeating the same set of random numbers.
(Remember that any time we repeat a set of values, we end up with a pitched (periodic)
sound. The pitch we get is directly related to the size of the buffer (the number of numbers it
contains) we’re using, since each time through the buffer represents one complete cycle (or
period) of the signal.)
Now here’s the trick to the Karplus-Strong algorithm: each time we read a value from the
buffer, we average it with the last value we read. It is this averaged value that we use as our
output sample. We then take that averaged sample and feed it back into the buffer. That way,
over time, the buffer gets more and more averaged (this is a simple filter, like the averaging
filter described in Section 3.1). Let’s look at the effect of these two actions separately.
Averaging and Feedback
First, what happens when we average two values? Averaging acts as a low-pass filter on the
signal. Because we’re averaging the signal, the signal changes less with each sample, and by
limiting how quickly it can change we’re limiting the number of high frequencies it can
contain (since high frequencies have a high rate of change). So, averaging a signal effectively
gets rid of high frequencies, which according to our string description we need to do—once
the string is plucked, it should start losing harmonics over time.
The "over time" part is where feeding the averaged samples back into the buffer comes in. If
we were to just keep averaging the values from the buffer but never actually changing them
(that is, sticking the average back into the buffer), then we would still be stuck with a static
waveform. We would keep averaging the same set of random numbers, so we would keep
getting the same results.
Instead, each time we generate a new sample, we stick it back into the buffer. That way our
waveform evolves as we move through it. The effect of this low-pass filtering accumulates
over time, so that as the string "rings," more and more of the high frequencies are filtered out
of it. The filtered waveform is then fed back into the buffer, where it is filtered again the next
time through, and so on. After enough times through the process, the signal has been averaged
so many times that it reaches equilibrium—the waveform is a flat line the string has died out.
http://music.columbia.edu/cmc/musicandcomputers/
- 126 -
Figure 4.22 Applying the Karplus-Strong algorithm to a random waveform. After 60 passes
through the filter/feedback cycle, all that’s left of the wild random noise is a gently curving
wave.
The result is much like what we described in a plucked string: an initially complex, periodic
waveform that gradually becomes less complex over time and ultimately fades away.
Figure 4.23 Schematic view of a computer software implementation of the basic KarplusStrong algorithm.
http://music.columbia.edu/cmc/musicandcomputers/
- 127 -
For each note, the switch is flipped and the computer memory buffer is filled with random
values (noise). To generate a sample, values are read from the buffer and averaged. The newly
calculated sample is both sent to the output stream and fed back into the buffer. When the end
of the buffer is reached, we simply wrap around and continue reading at the beginning. This
sort of setup is often called a circular buffer. After many iterations of this process, the
buffer’s contents will have been transformed from noise into a simple waveform.
If you think of the random noise as a lot of energy and the averaging of the buffer as a way of
lessening that energy, this digital explanation is not all that dissimilar from what happens in
the real, physical case.
Thanks to Matti Karjalainen for this graphic.
Physical models generally offer clear, "real world" controls that can be used to play an
instrument in different ways, and the Karplus-Strong algorithm is no exception: we can relate
the buffer size to pitch, the initial random numbers in the buffer to the energy given to the
string by plucking it, and the low-pass buffer feedback technique to the effect of air friction
on the vibrating string.
Many researchers and composers have worked on the
plucked string sound as a kind of basic mode of physical
modeling. One researcher, engineer Charlie Sullivan
(who we're proud to say is one of our Dartmouth
Soundfile 4.30 Super guitar colleagues!) built a "super" guitar in software. Here’s the
heavy metal version of "The Star Spangled Banner."
Understand the Building Blocks of Sound
Physical modeling has become one of the most powerful and important current techniques in
computer music sound synthesis. One of its most attractive features is that it uses a very small
number of easy-to-understand building blocks—delays, filters, feedback loops, and
commonsense notions of how instruments work—to model sounds. By offering the user just a
few intuitive knobs (with names like "brightness," "breathiness," "pick hardness," and so on),
we can use existing sound-producing mechanisms to create new, often fantastic, virtual
instruments.
Soundfile 4.31 An example of Perry Cook’s SPASM
Part of the interface from Perry R. Cook’s SPASM singing voice
software. Users of SPASM can make Sheila, a computerized singer, sing. Perry
Cook has been one of the primary investigators of musically useful physical
models. He’s released lots of great physical modeling software and source
code.
http://music.columbia.edu/cmc/musicandcomputers/
- 128 -
' '
Section 5.1: Introduction to the Transformation of Sound by
Computer
Although direct synthesis of sound is an important area of computer music, it can be equally
interesting (or even more so!) to take existing sounds (recorded or synthesized) and transform,
mutate, deconstruct—in general, mess around with them. There are as many basic sound
transformation techniques as there are synthesis techniques, and in this chapter we’ll describe
a few important ones. For the sake of convenience, we will separate them into time-domain
and frequency-domain techniques.
Tape Music/Cut and Paste
The most obvious way to transform a sound is to chop it up, edit it, turn it around, and collage
it. These kinds of procedures, which have been used in electronic music since the early days
of tape recorders, are done in the time domain. There are a number of sophisticated software
tools for manipulating sounds in this way, called digital sound editors. These editors allow for
the most minute, sample-by-sample changes of a sound. These techniques are used in almost
all fields of audio, from avant-garde music to Hollywood soundtracks, from radio station
advertising spots to rap.
This classic of analog electronic music was composed using only tape
recorders, filters, and other precomputer devices and was based on a
recording of an old television commercial.
Soundfile 5.1
Jon Appleton:
"Chef
d'Oeuvre"
"BS Variation 061801" by Huk Don Phun was created using Phun’s
experimental "bleeding eyes" filtering technique. He used only free
software and sounds he downloaded from the web.
Soundfile 5.2
"BS Variation
061801"
Time-Domain Restructuring
Composers have experimented a lot with unusual time-domain restructuring of sound. By
chopping up waveforms into very small segments and radically reordering them, some noisy
and unusual effects can be created. As in collage visual art, the ironic and interesting
juxtaposition of very familiar materials can be used to create new works that are perhaps
greater than the sum of their constituent parts.
http://music.columbia.edu/cmc/musicandcomputers/
- 129 -
Playing with the perception of the listener, "Puzzels & Pagans" takes
the first 2 minutes and 26 seconds of "Jackass" (from Odelay) and cuts
it up into 2,500 pieces. These pieces are then reshuffled, taking into
account probability functions (that change over the length of the track)
Soundfile 5.3 that determine if pieces remain in their original position or if they don’t
Jane Dowe’s sound at all.
"Beck
Deconstruction" This was realized using James McCartney’s SuperCollider computer
piece
music programming language.
A unique, experimental, and rather strange program for deconstructing and reconstructing
sounds in the time domain is Argeïphontes Lyre, written by the enigmatic Akira Rabelais. It
provides a number of techniques for radical decomposition/recomposition of sounds—
techniques that often preclude the user from making specific decisions in favor of larger, more
probabilistic decisions.
Figure 5.1 Sample GUI from Argeïphontes Lyre, for sound deconstruction. This is timedomain mutation.
Sound example using Argeïphontes Lyre (random cut-up).
The original piano piece "Aposiopesi" was written by Vincent Carté in
1999. This recording was made by Akira Rabelais at Wave Equation
Soundfile 5.4 Studios in Hollywood, California. The piece was filtered with
Argeïphontes Argeïphontes Lyre 2.0b1.
Lyre
Sampling
Sampling refers to taking small bits of sound, often recognizable ones, and recontextualizing
them via digital techniques. By digitally sampling, we can easily manipulate the pitch and
time characteristics of ordinary sounds and use them in any number of ways.
http://music.columbia.edu/cmc/musicandcomputers/
- 130 -
We’ve talked a lot about samples and sampling in the preceding chapters. In popular music
(especially electronic dance and beat-oriented music), the term "sampling" has acquired a
specialized meaning. In this context, a sample refers to a (usually) short excerpt from some
previously recorded source, such as a drum loop from a song or some dialog from a film
soundtrack, that is used as an element in a new work. A sampler is the hardware used to
record, store, manipulate, and play back samples. Originally, most samplers were stand-alone
pieces of gear. Today sampling tends to be integrated into a studio’s computer-based digital
audio system.
Sampling was pioneered by rap artists in the mid-1980s, and by the early 1990s it had become
a standard studio technique used in virtually all types of music. Issues of copyright violation
have plagued many artists working with sample-based music, notably John Oswald of
"Plunderphonics" fame and the band Negativland, although the motives of the "offended"
parties (generally large record companies) have tended to be more financial than artistic. One
result of this is that the phrase "Contains a sample from xxx, used by permission" has become
ubiquitous on CD cases and in liner notes.
Although the idea of using excerpts from various sources in a new work is not new (many
composers, from Béla Bartók, who used Balkan folk songs, to Charles Ives, who used
American popular music folk songs, have done so), digital technology has radically changed
the possibilities.
Figure 5.2 Herbert Brün said of his program SAWDUST: "The computer program which I
called SAWDUST allows me to work with the smallest parts of waveforms, to link them and
to mingle or merge them with one another. Once composed, the links and mixtures are
treated, by repetition, as periods, or by various degrees of continuous change, as passing
moments of orientation in a process of transformations." Also listen to Soundfile 5.5, Brün’s
1978 SAWDUST composition "Dustiny."
Soundfile 5.5
"Dustiny"
http://music.columbia.edu/cmc/musicandcomputers/
- 131 -
Drum Machines
Drum machines and samplers are close cousins. Many drum machines are just specialized
Soundfile 5.6
Drum machine sounds samplers—their samples just happen to be all percussion/drumoriented. Other drum machines feature electronic or digitally synthesized drum sounds. As
with sampling, drum machines started out as stand-alone pieces of hardware but now have
largely been integrated into computer-based systems.
DAW Systems
Applet 5.2
Drum machine
Digital-audio workstations (DAWs) in the 1990s and 2000s have had the same effect on
digital sound creation as desktop publishing software had on the publishing industry in the
1980s: they’ve brought digital sound creation out of the highly specialized and expensive
environments in which it grew up and into people’s homes. A DAW usually consists of a
computer with some sort of sound card or other hardware for analog and digital input/output;
sound recording/editing/playback/multi-track software; and a mixer, amplifier, and other
sound equipment traditionally found in a home studio. Even the most modest of DAW
systems can provide from eight to sixteen tracks of CD-quality sound, making it possible for
many artists to self-produce and release their work for much less than it would traditionally
cost. This ability, in conjunction with similar marketing and publicizing possibilities opened
up by the spread of the Internet, has contributed to the explosion of tiny record labels and
independently released CDs we’ve seen recently.
In 1989, Digidesign came out with Sound Tools, the first professional tape–less recording
system. With the popularization of personal computers, numerous software and hardware
manufacturers have cornered the market for computer-based digital audio. Starting in the mid1980s, personal computer–based production systems have allowed individuals and institutions
to make the highest-quality recordings and DAW systems, and have also revolutionized the
professional music, broadcast, multimedia, and film industries with audio systems that are
more flexible, more accessible, and more creatively oriented than ever before.
Today DAWs come in all shapes and sizes and interface with most computer operating
systems, from Mac to Windows to LINUX. In addition, many DAW systems involve a
"breakout box"—a piece of hardware that usually provides four to eight channels of digital
audio I/O (inputs and outputs); and as we write this a new technology called FireWire looks
like a revolutionary way to hook up the breakout boxes to personal computers. Nowadays
http://music.columbia.edu/cmc/musicandcomputers/
- 132 -
many DAW systems also have "control surfaces" —pieces of hardware that look like a
standard mixer but are really controllers for parameters in the digital audio system.
Figure 5.3 The Mark of the Unicorn (MOTU) 828 breakout box, part of a DAW system with
FireWire interface.
Thanks to MOTU.com for their permission to use this graphic.
Figure 5.4 Motor Mix™ stands alone as the only DAW mixer control surface to offer eightbank, switchable motorized faders. It also features eight rotary pots, a rotary encoder, 68
switches, and a 40-by-2 LCD.
http://music.columbia.edu/cmc/musicandcomputers/
- 133 -
Section 5.2: Reverb
One of the most important and widely used techniques in computer music is reverberation and
the addition of various types and speeds of delays to dry (unreverberated) sounds.
Reverberation and delays can be used to simulate room and other environmental acoustics or
even to create new sounds of their own that are not necessarily related to existing physical
spaces.
There are a number of commonly used techniques for simulating and modeling different
reverberations and physical environments. One interesting technique is to actually record the
ambience of a room and then superimpose that onto a sound recorded elsewhere. This
technique is called convolution.
A Mathematical Excursion: Convolution in the Time
Domain
Convolution can be accomplished in two equivalent ways: either in the time domain or in the
frequency domain. We’ll talk about the time-domain version first. Convolution is like a
"running average"—that is, we take the original function representing the music (we’ll call
that one m(n), where m(n) is the amplitude of the music function at time n) and then we make
a new, smoothed function by making the amplitude a at time n of our new function equal to:
a = (1/3)[m(n – 2) + m(n – 1) + m(n)]
Let’s call this new function r(n)—for running average—and let’s look at it a little more
closely.
Just to make things work out, we’ll start off by making r(0) = 0 and r(1) = 0. Then:
r(2) = 1/3[m(0) + m(1) + m(2)]
r(3) = 1/3[m(1) + m(2) + m(3)]
We continue on, running along and taking the average of every three values of m— hence the
name "running average." Another way to look at this is as taking the music m(n) and
"convolving it against the filter a(n)" where a(n) is another function, defined by:
a(0) = 1/3, a(1) = 1/3, a(2) = 1/3
All the rest of the a(n) are 0.
Now, what is this mysterious "convolving against" thing? Well, first off we write it like:
m*a(n)
then:
m*a(n) = m(n)a(0) + m(n – 1)a(1) + m(n – 2)a(2) +....+ m(0)a(n)
http://music.columbia.edu/cmc/musicandcomputers/
- 134 -
next:
m*a(n) = m(n)(1/3) + m(n – 1)(1/3) + m(n – 2)(1/3)
Now, there is nothing special about the filter a(n)—in fact it could be any kind of function. If
it were a function that reflects the acoustics of your room, then what we are doing—look back
at the formula—is shaping the input function m(n) according to the filter function a(n). A
little more terminology: the number of nonzero values that a(n) takes on is called the number
of taps for a(n). So, our running average is a three-tap filter.
Reverb in the Time Domain
Convolution is sure fun, and we’ll see more about it in the following section, but a far easier
way to create reverb in the time domain is to simply delay the signal some number of times
(by some very small time value) and feed it back onto itself, simulating the way a sound
bounces around a room. There are a large variety of commercially available and inexpensive
devices that allow for many different reverb effects. One interesting possibility is to change
reverb over the course of a sound, in effect making a room grow and shrink over time (see
Soundfile 5.7).
Changing the reverberant characteristics of a sound over time.
Here is a nail being hammered. The room size goes from very small to
very large over the course of about 15 seconds. This can’t happen in the
Soundfile 5.7 real, physical world, and it’s a good example of the way the computer
Reverberant musician can create imaginary soundscapes.
characteristics
Before there were digital systems, engineers created various ways to get reverb-type effects.
Any sort of situation that could cause delay was used. Engineers used reverberant chambers—
sending sound via a loudspeaker into a tiled room and rerecording the reverb-ed sound. When
one of us was a kid, we ran a microphone into our parents' shower to record a trumpet solo!
Reverb units made with springs and metal plates make interesting effects, and many digital
signal processing units have algorithms that can be dialed up to create similar effects. Of
course we like to think that with a combination of delayed copies of signals we can simulate
any type of enclosure, whether it be the most desired recording studio room or a parking
garage.
As you’ll see in Figure 5.5, there are some essential elements for creating a reverberant sound.
Much of the time a good reverb "patch," or set of digital algorithms, will make use of a
number of copies of the original signal—just as there are many versions of a sound bouncing
around a room. There might be one copy of the signal dedicated to making the first reflection
(which is the very first reflected sound we hear when a sound is introduced into a reverberant
space). Others might be algorithms dedicated to making the early reflections—the first sounds
we hear after the initial reflection. The rest of the software for a good digital reverb is likely
designed to blend the reverberant sound. Filters are usually used to make the reverb tails
sound as if they are far away. Any filter that attenuates the higher frequencies (like 5 kHz or
up) makes a signal sound farther away from us, since high frequencies have very little energy
and don’t travel very far.
http://music.columbia.edu/cmc/musicandcomputers/
- 135 -
When we are working to simulate rooms with digital systems, we have to take a number of
things into consideration:
•
•
•
How large is the room? How long will it take for the first reflection to get back to our
ears?
What are the surfaces like in the room? Not just the walls, ceiling, and floor, but also
any objects in the room that can reflect sound waves; for example, are there any huge
ice sculptures in the middle of the room?
Are there surfaces in the room that can absorb sound? These are called absorptive
surfaces. For example: movie theaters usually have curtains hanging on the walls to
suck up sound waves. Maybe there are people in the room: most bodies are soft, and
soft surfaces remove energy from reflected sounds.
Figure 5.5 Time-domain reverb is conceptually similar to the idea of physical modeling—we
take a signal (say a hand clap) and feed it into a model of a real-world environment (like a
large room). The signal is modified according to our model room and then output as a new
signal.
In this illustration you see the graphic representation of a direct sound being introduced into a
reverberant space. The RT 60 refers to reverb time, or how long it takes the sound in the space
to reduce by –60 dB.
Photo courtesy of Garry Clennell <[email protected]>.
Figure 5.6 In this illustration you can see how sound from a sound source (like a
loudspeaker) can bounce around a reverberant space to create what we call reverb.
A typical reverb algorithm can have lots of control variables such as room size (gymnasium,
http://music.columbia.edu/cmc/musicandcomputers/
- 136 -
small club, closet, bathroom), brightness (hard walls, soft walls or curtains, jello), and
feedback or absorption coefficient (are there people, sand, rugs, llamas in the room—how
quickly does the sound die out?).
Photo courtesy of Garry Clennell <[email protected]>.
Figure 5.7 Here’s a basic diagram showing how a signal is delayed (in the delay box), then
fed back and added to the original signal with some attenuated (that is, not as loud) version of
the original signal added. This would create an effect known as comb filtering (a short delay
with feedback that emphasizes specific harmonics) as well as a delay.
What does a delay do? It takes a function (a signal), shifts it "backward" in time, then
combines it with another "copy" of the same signal. This is, once again, a kind of averaging.
Figure 5.8 When we both feed-back and feed-forward the same signal at the same time,
phase inverted, we get what is called an all-pass filter. (Phase inverting a signal means
changing its phase by 180°.) With this type of filter/delay, we don’t get the comb filtering
effect because of the phase inversion of the feed-forward signal, which fills in the missing
frequencies created by the comb filter of the feedbacked signal.
This type of unit is used to create blended reverb. A set of delay algorithms would probably
have some combination of a number of comb-type delays as in this figure, as well as some allpass filters to create a blend.
Making Reverb in the Frequency Domain: A Return to
Convolution
http://music.columbia.edu/cmc/musicandcomputers/
- 137 -
As we mentioned, convolution is a great way to get reverberant effects. But we mainly talked
about the mathematics of it and why convolving a signal is basically putting it through a type
of filter.
But now we’re going to talk about how to actually convolve a sound. The process is fairly
straightforward:
1. Locate a space whose reverberant characteristics we like.
2. Set up a microphone and some digital media for recording.
3. Make a short, loud sound in this room, like hitting two drumsticks together (popping a
balloon and firing a starter’s pistol are other standard ways of doing this—sometimes
it’s called "shooting the room").
4. Record this sound, called the impulse.
5. Analyze the impulse in the room; this gives us the impulse response of the room,
which is the characteristics of the room’s response to sound.
6. Use the technique of convolution to combine this impulse response with other sound
sources, more or less bringing the room to the sound.
Convolution can be powerful, and it’s a fairly complicated software process. It takes each
sample in the impulse response file (the one we recorded in the room, which should be short)
and multiplies that sample by each sample in the sound file that we want to "put in" that room.
So, each sample input sound file, like a vocal sound file to which we want to add reverb, is
multiplied by each sample in the impulse response file.
That’s a lot of multiplies! Let’s pretend, for the moment, that we have a 3-second impulse file
and a 1-minute sound file to which we want to add the reverberant characteristics of some
space. At 44.1 kHz, that’s:
3 * 44,100 * 60 * 44,100
= 180 * 680,683,500,000
= 122,523,403,030,000 multiplies
If you’ve been paying close attention, you might raise an interesting question here: isn’t this
the time domain? We’re just multiplying signals together (well, actually, we’re multiplying
each point in each function by every other point in the other function—called a cross
multiply). But this multiply in the time domain actually produces what we refer to as the
convolution in the frequency domain.
Now, we’re not math geniuses, but even we know that this is a whole mess of multiplies!
That’s why convolution, which is very computationally expensive, had not been a popular
technique until your average computer got fast enough to do it. It was also a completely
unknown sound manipulation idea until digital recording technology made it feasible.
But nowadays we can do it, and the reason we can do it is our old friend, the FFT! You see, it
turns out that for filters with lots of taps (remember that this means one with lots of nonzero
values), it is easier to compute the convolution in the spectral domain.
Suppose we want to convolve our music function m(n) against our filter function a(n). We can
tell you immediately that the convolution of these functions (which, remember, is another
http://music.columbia.edu/cmc/musicandcomputers/
- 138 -
sound, a function that we call c(n)) has a spectrum equal to the pointwise product of the
spectrum of the music function and the filter function. The pointwise product is the frequency
content at any point in the convolution and is calculated by multiplying the spectrums of the
music function and the filter function at that particular point.
Another way of saying this is to say that the Fourier coefficients of the convolution can be
computed by simply multiplying together each of the Fourier coefficients of m(n) and a(n).
The zero coefficient (the DC term) is the product of the DC terms of a(n) and m(n); the first
coefficient is the product of the first Fourier coefficient of a(n) and m(n); and so on.
So, here’s a sneaky algorithm for making the convolution:
Step 1: Compute the Fourier coefficients of both m(n) and a(n).
Step 2: Compute the pointwise products of the Fourier coefficients of a(n) and m(n).
Step 3: Compute the inverse Fourier transform (IFFT) of the result of Step 1.
And we’re done! This was one of the great properties of the FFT: it made convolution just
about as easy as multiplication!
So, once again, if we have a big library of great impulse responses— the best-sounding
cathedrals, the best recording studios, concert halls, the Grand Canyon, Grand Central Station,
your shower—we can simulate any space for any sound. And indeed this is how many digital
effects processors and reverb plugins for computer programs work. This is all very cool when
you’re working hard to get that "just right" sound.
Figure 5.9 The impulse on the left, and the room’s response on the right.
Thanks to Nigel Redmon of earlevel.com for this graphic.
Applet 5.3
Flanges, delays, reverb: effect box fun
Applet 5.4
Multitap delay
http://music.columbia.edu/cmc/musicandcomputers/
- 139 -
http://music.columbia.edu/cmc/musicandcomputers/
- 140 -
Section 5.3: Localization/Spatialization
This applet lets you pan, fade, and simulate binaural listening (best if
listened to with headphones).
Applet 5.5
Filter-based
localization
Soundfile 5.8
Binaural
location
Using the program SoundHack by Tom Erbe, we can place a sound
anywhere in the binaural (two ears) space. SoundHack does this by
convolving the sound with known filter functions that simulate the ITD
(interaural time delay) between the two ears.
Close your eyes and listen to the sounds around you. How well can you tell where they’re
coming from? Pretty well, hopefully! How do we do that? And how could we use a computer
to simulate moving sound so that, for example, we can make a car go screaming across a
movie screen or a bass player seem to walk over our heads?
Humans have a pretty complicated system for perceptually locating sounds, involving, among
other factors, the relative loudness of the sound in each ear, the time difference between the
sound’s arrival in each ear, and the difference in frequency content of the sound as heard by
each ear. How would a "cyclaural" (the equivalent of a "cyclops") hear? Most attempts at
spatializing, or localizing, recorded sounds make use of some combination of factors
involving the two ears on either side of the head.
Simulating Sound Placement
Simulating a loudness difference is pretty simple—if someone standing to your right says
your name, their voice is going to sound louder in your right ear than in your left. The
simplest way to simulate this volume difference is to increase the volume of the signal in one
channel while lowering it in the other—you’ve probably used the pan or balance knob on a
car stereo or boombox, which does exactly this. Panning is a fast, cheap, and fairly effective
means of localizing a signal, although it can often sound artificial.
Interaural Time Delay (ITD)
Simulating a time difference is a little trickier, but it adds a lot to the realism of the
localization. Why would a sound reach your ears at different times? After all, aren’t our ears
pretty close together? We’re generally not even aware that this is true: snap your finger on
one side of your head, and you’ll think that you hear the sound in both ears at exactly the
same time.
But you don’t. Sound moves at a specific speed, and it’s not all that fast (compared to light,
anyway): about 345 meters/second. Since your fingers are closer to one ear than the other, the
sound waves will arrive at your ears at different times, if only by a small fraction of a second.
http://music.columbia.edu/cmc/musicandcomputers/
- 141 -
Since most of us have ears that are quite close together, the time difference is very slight—too
small for us to consciously "perceive."
Let’s say your head is a bit wide: roughly 250 cm, or a quarter of a meter. It takes sound
around 1/345 of a second to go 1 meter, which is approximately 0.003 second (3 thousandths
of a second). It takes about a quarter of that time to get from one ear of your wide head to the
other, which is about 0.0007 second (0.7 thousandths of a second). That’s a pretty small
amount of time! Do you believe that our brains perceive that tiny interval and use the
difference to help us localize the sound? We hope so, because if there’s a frisbee coming at
you, it would be nice to know which direction it’s coming from! In fact, though, the delay is
even smaller because your head’s smaller than 0.25 meter (we just rounded it off for
simplicity). The technical name for this delay is interaural time delay (ITD).
To simulate ITD by computer, we simply need to add a delay to one channel of the sound.
The longer the delay, the more the sound will seem to be panned to one side or the other
(depending on which channel is delayed). The delays must be kept very short so that, as in
nature, we don’t consciously perceive them as delays, just as location cues. Our brains take
over and use them to calculate the position of the sound. Wow!
Modeling Our Ears and Our Heads
That the ears perceive and respond to a difference in volume and arrival time of a sound
seems pretty straightforward, albeit amazing. But what’s this about a difference in the
frequency content of the sound? How could the position of a bird change the spectral makeup
of its song? The answer: your head!
Imagine someone speaking to you from another room. What does the voice sound like? It’s
probably a bit muffled or hard to understand. That’s because the wall through which the
sound is traveling—besides simply cutting down the loudness of the sound—acts like a lowpass filter. It lets the low frequencies in the voice pass through while attenuating or muffling
the higher ones.
Your head does the same thing. When a sound comes from your right, it must first pass
through, or go around, your head in order to reach your left ear. In the process, your head
absorbs, or blocks, some of the high-frequency energy in the sound. Since the sound didn’t
have to pass through your head to get to your right ear, there is a difference in the spectral
makeup of the sound that each ear hears. As with ITD, this is a subtle effect, although if
you’re in a quiet room and you turn your head from side to side while listening to a steady
sound, you may start to perceive it.
Modeling this by computer is easy, provided you know something about how the head filters
sounds (what frequencies are attenuated and by how much). If you’re interested in the
frequency response of the human head, there are a number of published sources available for
the data, since they are used by, among others, the government for all sorts of things (like
flight simulators, for example). Researcher and author Durand Begault has been a leading
pioneer in the design and implementation of what are called head transfer functions—
frequency response curves for different locations of sound.
What Are Head-Related Transfer Functions (HRTFs)?
http://music.columbia.edu/cmc/musicandcomputers/
- 142 -
Figure 5.10 This illustration shows how the spectral contents of a sound change depending
on which direction the sound is coming from. The body (head and shoulders) and the time-ofarrival difference that occurs between the left and right ears create a filtering effect.
Figure 5.11 The binaural dummy head recording system includes an acoustic baffle with the
approximate size, shape, and weight of a human head. Small microphones are mounted where
our ears are located.
This recording system is designed to emulate the acoustic effects of the human head (just as
our ears might hear sounds) and then capture the information on recording media.
A number of recording equipment manufacturers make these "heads," and they often have
funny names (Sven, etc.).
Thanks to Sonic Studios for this photo.
http://music.columbia.edu/cmc/musicandcomputers/
- 143 -
Not surprisingly, humans are extremely adept at locating sounds in two dimensions, or the
plane. We’re great at figuring out the source direction of a sound, but not the height. When a
lion is coming at us, it’s nice of evolution to have provided us with the ability to know,
quickly and without much thought, which way to run. It’s perhaps more of a surprise that
we’re less adept at locating sounds in the third dimension, or more accurately, in the
"up/down" axis. But we don’t really need this ability. We can’t jump high enough for that
perception to do us much good. Barn owls, on the other hand, have little filters on their
cheeks, making them extraordinarily good at sensing their sonic altitude distances. You would
be good at sensing your sonic altitude distance, too, if you had to catch and eat, from the air,
rapidly running field mice. So if it’s not a frisbee heading at you more or less in the twodimensional plane, but a softball headed straight down toward your head, we’d suggest a
helmet!
http://music.columbia.edu/cmc/musicandcomputers/
- 144 -
Section 5.4: Introduction to Spectral Manipulation
There are two different approaches to manipulating the frequency content of sounds: filtering,
and a combination of spectral analysis and resynthesis. Filtering techniques, at least
classically (before the FFT became commonly used by most computer musicians), attempted
to describe spectral change by designing time-domain operations. More recently, a great deal
of work in filter design has taken place directly in the spectral domain.
Spectral techniques allow us to represent and manipulate signals directly in the frequency
domain, often providing a much more intuitive and user-friendly way to work with sound.
Fourier analysis (especially the FFT) is the key to many current spectral manipulation
techniques.
Phase Vocoder
Perhaps the most commonly used implementation of Fourier analysis in
computer music is a technique called the phase vocoder. What is called the
phase vocoder actually comprises a number of techniques for taking a timedomain signal, representing it as a series of amplitudes, phases, and
Soundfile 5.9 frequencies, and manipulating this information and returning it to the time
Unmodified domain. (Remember, Fourier analysis is the process of turning the list of
speech
samples of our music function into a list of Fourier coefficients, which are
complex numbers that have phase and amplitude, and each corresponds to a
frequency.)
Two of the most important ways that musicians have used the phase vocoder
Soundfile technique are to use a sound’s Fourier representation to manipulate its length
without changing its pitch and, conversely, to change its pitch without affecting
5.10
Speech made its length. This is called time stretching and pitch shifting.
half as long
with a phase Why should this even be difficult? Well, consider trying it in the time domain:
play back, say, a 33 1/3 RPM record at 45 RPMs. What happens? You play the
vocoder
record faster, the needle moves through the grooves at a higher rate, and the
sound is higher pitched (often called the "chipmunk" effect, possibly after the
famous 1960s novelty records featuring Alvin and his friends). The sound is
also much shorter: in this case, pitch is directly related to frequency—they’re
both controlled by the same mechanism. A creative and virtuosic use of this
Soundfile technique is scratching as practiced by hip-hop, rap, and dance DJs.
5.11
Speech made
twice as long
with a phase
vocoder
Soundfile 5.12
Speech transposed up an octave
Soundfile 5.13
Speech transposed down an octave
http://music.columbia.edu/cmc/musicandcomputers/
- 145 -
Soundfile
5.14
Speeding up a
sound
An example of time-domain pitch shifting/speed changing. In this case,
pitch and time transformations are related. The faster the sound is
played, the higher the pitch becomes, as heard in Soundfile 5.14. In
Soundfile 5.15, the opposite effect is heard: the slower file sounds
lower in pitch.
Soundfile 5.15
Slowing down a sound
The Pitch/Speed Relationship in the Digital World
Now think of altering the speed of a digital signal. To play it back faster, you might raise the
sampling rate, reading through the samples for playback more quickly. Remember that
sometimes we refer to the sampling rate as the rate at which we stored (sampled) the sounds,
but it also can refer to the kind of internal clock that the computer uses with reference to a
sound (for playback and other calculations). We can vary that rate, for example playing back a
sound sampled at 22.05 kHz at 44.1 kHz. With more samples (read) per second, the sound
gets shorter. Since frequency is closely related to sampling rate, the sound also changes pitch.
Xtra bit 5.1
"Slow
Motion
Sound"
Even with the basic pitch/speed problem, manipulating the speed of sound has
always attracted creative experiment. Consider an idea proposed by composer
Steve Reich in 1967, thought to be a kind of impossible dream for electronic
music: to slow down a sound without changing its pitch (and vice versa).
A SoundHack varispeed of some standard speech. Note how the speech’s speed
is changed over time.
Soundfile
5.16
Varispeed
Varispeed is a general term for "fooling around" with the sampling rate of a
sound file.
Using the Phase Vocoder
Using the phase vocoder, we can realize Steve Reich’s piece (see Xtra bit 5.1), and a great
many others. The phase vocoder allows us independent control over the time and the pitch of
a sound.
How does this work? Actually, in two different ways: by changing the speed and changing the
pitch.
To change the speed, or length, of a sound without changing its pitch, we need to know
something about what is called windowing. Remember that when doing an FFT on a sound,
http://music.columbia.edu/cmc/musicandcomputers/
- 146 -
we use what are called frames—time-delimited segments of sound. Over each frame we
impose a window: an amplitude envelope that allows us to cross-fade one frame into another,
avoiding problems that occur at the boundaries of the two frames.
What are these problems? Well, remember that when we take an FFT of some portion of the
sound, that FFT, by definition, assumes that we’re analyzing a periodic, infinitely repeating
signal. Otherwise, it wouldn’t be Fourier analyzable. But if we just chop up the sound into
FFT-frames, the points at which we do the chopping will be hard-edged, and we’ll in effect be
assuming that our periodic signal has nasty edges on both ends (which will typically show up
as strong high frequencies). So to get around this, we attenuate the beginning and ending of
our frame with a window, smoothing out the assumed periodical signal. Typically, these
windows overlap at a certain rate (1/8, 1/4, 1/2 overlap), creating even smoother transitions
between one FFT frame and another.
Figure 5.12 Why do we window FFT frames? The image on the left shows the waveform
that our FFT would analyze without windowing—notice the sharp edges where the frame
begins and ends. The image in the middle is our window. The image on the right shows the
windowed waveform. By imposing a smoothing window on the time domain signal and doing
an FFT of the windowed signal, we de-emphasize the high-frequency artifacts created by
these sharp vertical drops at the beginning and end of the frame.
Thanks to Jarno Seppänen <[email protected]> Nokia Research Center, Tampere,
Finland, for these images.
http://music.columbia.edu/cmc/musicandcomputers/
- 147 -
Figure 5.13 After we window a signal for the FFT, we overlap those windowed signals so that the original
signal is reconstructed without the sharp edges.
By changing the length of the overlap when we resynthesize the signal, we can change the
speed of the sound without affecting its frequency content (that is, the FFT information will
remain the same, it’ll just be resynthesized at a "larger" frame size). That’s how the phase
vocoder typically changes the length of a sound.
What about changing the pitch? Well, it’s easy to see that with an FFT we get a set of
amplitudes that correspond to a given set of frequencies. But it’s clear that if, for example, we
have very strong amplitudes at 100 Hz, 200 Hz, 300 Hz, 400 Hz, and so on, we will perceive
a strong pitch at 100 Hz. What if we just take the amplitudes at all frequencies and move them
"up" (or down) to frequencies twice as high (or as low)? What we’ve done then is re-create
the frequency/amplitude relationships starting at a higher frequency—changing the perceived
pitch without changing the frequency.
http://music.columbia.edu/cmc/musicandcomputers/
- 148 -
Two columns of FFT bins. These bins divide the Nyquist frequency evenly. In other words, if we
were sampling at 10 kHz and we had 100 FFT bins (both these numbers are rather silly, but they’re
arithmetically simple), our Nyquist frequency would be 5 kHz, and the bin width would be 50 Hz.
Each of these frequency bins has its own amplitude, which is the strength or energy of the spectra at that
frequency (or more precisely, the average energy in that frequency range). To implement a pitch shift in a
phase vocoder, the amplitudes in the left column are shifted up to higher frequencies in the right column. This
operation shifts 1 spectra in the left bin to a higher frequency in the right bin.
The phase vocoder technique actually works just fine, though for radical pitch/time
deformations we get some problems (usually called "phasiness"). These techniques work
better for slowly changing harmonic sounds and for simpler pitch/time relationships (integer
multiples). Still, the phase vocoder works well enough, in general, for it to be a widely used
technique in both the commercial and the artistic sound worlds.
Larry Polansky’s 1-minute piece, "Study: Anna, the long and the short
of it."
All the sounds are created using phase vocoder pitch and time shifts of
Soundfile a recording of a very short cry (with introductory inhale) of the
5.17
composer’s daughter when she was 6 months old.
"Study: Anna"
http://music.columbia.edu/cmc/musicandcomputers/
- 149 -
Section 5.5: More on Convolution
Another interesting use of the phase vocoder is to perform convolution. We talked about
convolution a lot in our section on reverb (both in the time domain and in the frequency
domain).
A simple convolution: a guitar power chord (the impulse
response), a vocal excerpt, and the convolution of one with
the other. We have sonically turned the singer’s head into the
guitar chord: he’s singing "through" the chord.
Soundfile 5.18 Convolution: Schwarzhund
Remember that a convolution multiplies every frequency content in one sound by every
frequency content in another (sometimes called a cross-multiply). This is different from
simply multiplying each value in one sound by its one corresponding value in another. In fact,
as we mentioned, there is a well-known and surprisingly simple relationship between these
two concepts: multiplication in the time domain is the same as convolution in the frequency
domain (and vice versa). As you probably saw in Section 5.2, the mathematics of convolution
can get a little hairy. But the uses of convolution for transformations of sound are pretty
straightforward, so we’ll explain them in this section.
Cross-Synthesis
Convolution is a type of cross-synthesis: a general, less technical term which means that some
aspect of one sound is imposed onto another. A simple example of cross-synthesis returns us
to the subject of reverb. We described the way that, by recording what is called the impulse
response of a room (its resonant characteristics) using a very short, loud sound, we can place
another sound in that room (at whatever position we "shot") by convolving the impulse
response with that sound. Surprisingly, by convolving any sound with white noise, we can
simulate simple reverb.
Using Convolution
Although reverberation is a common application of the convolution technique, convolution
can be used creatively to produce unusual sounds as well. Simple-to-use convolution tools
(like that in the program SoundHack) have only recently become available to a large
community of musicians because up until recently, they only ran on quite large computers and
in rather arcane environments. So we are likely to hear some amazing things in the near future
using these techniques!
Soundfile 5.19 Noise
Soundfile 5.20 Inhale Soundfile 5.21 Noise and inhale convolved
In this convolution example, Soundfile 5.19 and Soundfile 5.20 are convolved into Soundfile
5.21. The two sound sources are noise and an inhale.
http://music.columbia.edu/cmc/musicandcomputers/
- 150 -
Section 5.6: Morphing
In recent years the idea of morphing, or turning one sound (or image) into another, has
become quite popular. What is especially interesting, besides the idea of having a lion roar
change gradually and imperceptibly into a meow, is the broader idea that there are sounds "in
between" other sounds.
Figure 5.15 Image morphing: several stages of a morph.
What does it mean to change one sound into another? Well, how would you graphically
change a picture into another? Would you replace, over time, little bits of one picture with
those of another? Would you gradually change the most important shapes of one into those of
the other? Would you look for important features (background, foreground, color, brightness,
saturation, etc.), isolate them, and cross-fade them independently? You can see that there are
lots of ways to morph a picture, and each way produces a different set of effects. The same is
true for sound.
Larry Polansky's "51 Melodies" is a terrific example of a computerassisted composition that uses morphing to generate novel (and kind of
insane!) melodies. Polansky specified the source and target melodies,
and then wrote a computer program to generate the melodies in
between. From the liner notes to the CD "Change":
Soundfile
5.22
"morphing "51 Melodies is based on two melodies, a source and a target, and is in
piece"
three sections. The piece begins with the source, a kind of pseudoanonymous rock lick. The target melody, an octave higher and more
chromatic, appears at the beginning of the third section, played in
unison by the guitars and bass. The piece ends with the source. In
between, the two guitars morph, in a variety of different, independent
ways, from the source to the target (over the course of Sections 1 and 2)
and back again (Section 3). A number of different morphing functions,
both durational and melodic, are used to distinguish the three sections."
(The Soundfile 5.22 is an three minute edited version of the complete
12 minute piece, composed of one minute sections from the beginning,
middle, and end of the recording.)
http://music.columbia.edu/cmc/musicandcomputers/
- 151 -
Simple Morphing
The simplest sonic morph is essentially an amplitude cross-fade. Clearly, this doesn’t do
much (you could do it on a little audio mixer).
Soundfile
5.23
Cross-fade
Figure 5.16 An amplitude cross-fade of a number of different data
points.
What would constitute a more interesting morph, even limiting us to the time domain? How
about this: let’s take a sound and gradually replace little bits of it with another sound. If we
overlap the segments that we’re "replacing," we will avoid horrible clicks that will result from
samples jumping drastically at the points of insertion.
Interpolation and Replacement Morphing
The two ways of morphing described above might be called replacement and interpolation
morphing, respectively. In a replacement morph, intact values are gradually substituted from
one sound into another. In an interpolation morph, we compare the values between two
sounds and select values somewhere between them for the new sound. In the former, we are
morphing completely some part of the time; in the latter, we are morphing somewhat all of the
time.
In general, we can specify a degree of morphing, by convention called Ω, that tells how far
one sound is from the other. A general formula for (linear) interpolation is:
I = A + (Ω*(B – A))
In this equation, A is the starting value, B is the ending value, and Ω is the interpolation index,
or "how far" you want to go. Thus, when Ω = 0, I = A; when Ω = 1, I = B, and when Ω = 0.5, I
= the average of A and B.
This equation is a complicated way of saying: take some sound (SourceSound) and add to it
some percentage of the difference between it and another sound (TargetSound –
SourceSound), to get the new sound.
Sonic morphing can be more interesting in the frequency domain, in the creation of sounds
whose spectral content is some kind of hybrid of two other sounds. (Convolution, by the way,
could be thought of as a kind of morph!)
http://music.columbia.edu/cmc/musicandcomputers/
- 152 -
An interesting approach to morphing is to take some feature of a sound and morph that feature
onto another sound, trying to leave everything else the same. This is called feature morphing.
Theoretically, one could take any mathematical or statistical feature of the sound, even
perceptually meaningless ones—like the standard deviation of every 13th bin—and come up
with a simple way to morph that feature. This can produce interesting effects. But most
researchers have concentrated their efforts on features, or some organized representation of
the data, that are perceptually, cognitively, or even musically salient, such as attack time,
brightness, roughness, harmonicity, and so on, finding that feature morphing is most effective
on such perceptually meaningful features.
Feature Morphing Example: Morphing the Centroid
Music cognition researchers and computer musicians commonly use a measure of sounds
called the spectral centroid. The spectral centroid is a measure of the "brightness" of a sound,
and it turns out to be extremely important in the way we compare different sounds. If two
sounds have a radically different centroid, they are generally perceived to be timbrally distant
(sometimes this is called a spectral metric).
Basically, the centroid can be considered the average frequency component (taking into
consideration the amplitude of all the frequency components). The formula for the spectral
centroid of one FFT frame of a sound is:
Ci is the centroid for one spectral frame, and i is the number of frames for the sound. A
spectral frame is some number of samples that is equal to the size of the FFT.
The (individual) centroid of a spectral frame is defined as the average frequency weighted by
amplitudes, divided by the sum of the amplitudes, as follows:
We add up all the frequencies multiplied by their amplitudes (the numerator) and add up all
the amplitudes (the denominator), and then divide. The "strongest" frequency wins! In other
words, it’s the average frequency weighted by amplitude: where the frequency concentration
of a sound is.
http://music.columbia.edu/cmc/musicandcomputers/
- 153 -
Soundfile
5.24
Chris Mann
Soundfile
5.25
Single
violin
The centroid curve of a sound over time. Note that centroids tend to be suprisingly
high and never the "fundamental" (unless our sound is a pure sine wave). One of these curves is
of a violin tone; the other is of a rapidly changing voice (Australian sound poet Chris Mann). The
soundfile for Chris Mann is included as well.
Now let’s take things one step further, and try to morph the centroid of one sound onto that of
another. Our goal is to take the time-variant centroid from one sound and graft that onto a
second sound, preserving as much of the second sound’s amplitude/spectra relationship as
possible. In other words, we’re trying to morph one feature while leaving others constant.
To do this, we can think of the centroid in an unusual way: as the frequency that divides the
total sound file energy into two parts (above and below). That’s what an average is. For some
time-variant centroid (ci) extracted from one sound and some total amplitude from another
(ampsum), we simply "plop" the new centroid onto the sound and scale the amplitude of the
frequency bins above and below the new centroid frequency to (0.5 * ampsum). This will
produce a sort of "brightness morph." Notice that on either side of the centroid in the new
sound, the spectral amplitude relationships remain the same. We’ve just forced a new
centroid.
Section 5.7: Graphical Manipulation of Sound
We’ve seen how it’s possible to take sounds and turn them into pictures by displaying their
spectral data in sonograms, waterfall plots, and so on. But how about going the other way?
What about creating a picture of a sound, and then synthesizing it into an actual sound? Or
how about starting with a picture of something else—Dan Rockmore’s dog Digger, for
instance? What would he sound like? Or how about editing a sound as an image—what if you
could draw a box around some region of a sound and simply drag it to some other place, or
erase part of it, or apply a "blur" filter to it?
http://music.columbia.edu/cmc/musicandcomputers/
- 154 -
Graphical manipulation of sound is still a relatively underdeveloped and emerging field, and
the last few years have seen some exciting developments in the theory and tools needed to do
such work. One of the interesting issues about it is that the graphical manipulations may not
have any obvious relationship to the sonic effects. This could be looked upon as either a
drawback or an advantage. Although graphical techniques are used for both the synthesis and
transformation of sounds, much of the current work in this area seems geared more toward
sonic manipulation than synthesis.
Pre–Digital Era Graphic Manipulation
Composers have always been interested in exploring the relationships between color, shape,
and sound. In fact, people in general are fascinated with this. Throughout history, certain
people have been synaesthetic—they see sound or hear color. Color organs, sound pictures,
even just visual descriptions of sound have been an important part of the way people try to
understand sound and music, and most importantly their own experience of it. Timbre is often
called "sound color," even though sound color should more appropriately be analogized to
frequency/pitch.
Computer musicians have often been interested in working with sounds from a purely
graphical perspective—a "let’s see what would happen if" kind of approach. Creating and
editing sounds graphically is not a new idea, although it’s only recently that we’ve had tools
flexible enough to do it well. Even before digital computers, there were a number of graphicsto-sound systems in use. In fact, some of the earliest film sound technology was optical—a
waveform was printed, or even drawn by hand (as in the wonderfully imaginative work of
Canadian filmmaker Norman McLaren), on a thin stripe of film running along the socket
holes. A light shining through the waveform allowed electronic circuitry to sense and play
back the sound.
Canadian inventor and composer Hugh LeCaine took a different approach. In the late 1950s,
he created an instrument called the Spectrogram, consisting of a bank of 108 analog sine
wave oscillators controlled by curves drawn on long rolls of paper. The paper was fed into the
instrument, which sensed the curves and used them to determine the frequency and volume of
each oscillator. Sound familiar? It should—LeCaine’s Spectrogram was essentially an analog
additive synthesis instrument!
http://music.columbia.edu/cmc/musicandcomputers/
- 155 -
Canadian composer and instrument builder, and one of the great pioneers of electronic and
computer music, Hugh LeCaine. LeCaine was especially interested in physical and visual descriptions of
electronic music.
On the right is one of LeCaine’s inventions, an electronic musical instrument called the Spectrogram.
UPIC System
One of the first digital graphics-to-sound schemes, Iannis Xenakis’s UPIC (Unité
Polyagogique Informatique du CEMAMu) system, was similar to LeCaine’s invention in that
it allowed composers to draw lines and curves that represent control information for a bank of
oscillators (in this case, digital oscillators). In addition, it allowed the user to perform
graphical manipulations (cut and paste, copy, rearrange, etc.) on what had been drawn.
Another benefit of the digital nature of the UPIC system was that any waveform (including
sampled ones) could be used in the synthesis of the sound. By the early 1990s, UPIC was able
to do all of its synthesis and processing live, enabling it to be used as a real-time performance
instrument. Newer versions of the UPIC system are still being developed and are currently in
use at CEMAMu (Centre des Etudes Mathématiques Automatiques Musicales) in Paris, an
important center for research in computer music.
AudioSculpt and SoundHack
More recently, a number of FFT/IFFT-based graphical sound manipulation techniques have
been developed. One of the most advanced is AudioSculpt from IRCAM in France.
AudioSculpt allows you to operate on spectral data as you would an image in a painting
program—you can paint, erase, filter, move around, and perform any number of other
operations on the sonograms that AudioSculpt presents.
http://music.columbia.edu/cmc/musicandcomputers/
- 156 -
FFT data displayed as a sonogram in the computer music program AudioSculpt. Partials detected in
the sound are indicated by the red lines. On the right are a number of tools for selecting and manipulating the
sound/image data.
Another similar, and in some ways more sophisticated, approach is Tom Erbe’s QT-coder, a
part of his SoundHack program. The QT-coder allows you to save the results of an FFT of a
sound as a color image that contains all of the data (magnitude and phase) associated with the
sound you’ve analyzed (as opposed to AudioSculpt, which only presents you with the
magnitude information). It saves the images as successive frames of a QuickTime movie,
which can then be opened by most image/video editing software. The result is that you can
process and manipulate your sound using not only specialized audio tools, but also a large
number of programs meant primarily for traditional image/video processing. The movie can
then be brought back into SoundHack for resynthesis. It is also possible to go the other way,
that is, to use SoundHack to synthesize an actual movie into sound, manipulate that sound,
and then transform it back into a movie. As you may imagine, using this technique can cause
some pretty strange effects!
)
Original
image
)
Altered image
An original image created by the QT-coder in SoundHack (left), and the image after
alterations (right). Listen to the original sound (Soundfile 5.26) and examine the original image.
Now examine the altered image. Can you guess what the alterations (Soundfile 5.27)
http://music.columbia.edu/cmc/musicandcomputers/
- 157 -
will sound like?
Soundfile
5.28
Chris
Penrose’s
composition
"American
Jingo"
Chris Penrose’s Hyperupic. This computer software allows for a wide variety of
ways to transform images into sound. See Soundfile 5.28 for Penrose’s Hyperupic composition.
squiggy
squiggy, a project developed by one of the authors (repetto), combines some of the benefits of
both the UPIC system and the FFT-based techniques. It allows for the real-time creation,
manipulation, and playback of sonograms. squiggy can record, store, and play back a number
of sonograms at once, each of which can be drawn on, filtered, shifted, flipped, erased,
looped, combined in various ways, scrubbed, mixed, panned, and so on—all live. The goal of
squiggy is to create an instrument for live performance that combines some of the
functionality of a traditional time-domain sampler with the intuitiveness and timbral
flexibility of frequency-domain processing.
http://music.columbia.edu/cmc/musicandcomputers/
- 158 -
Figure 5.22 Screenshot from squiggy, a real-time spectral manipulation tool by douglas
repetto.
On the left is the spectral data display window, and on the right are controls for independent
volume, pan, loop speed, and loop length settings for each sound. In addition there are a
number of drawing tools and processes (not shown) that allow direct graphical manipulation
of the spectral data.
http://music.columbia.edu/cmc/musicandcomputers/
- 159 -