©Burk/Polansky/Repetto/Roberts/Rockmore. All rights reserved. http://music.columbia.edu/cmc/musicandcomputers/ 1.1 The Digital Representation of Sound, Part One: Sound and Timbre What is Sound? Chapter 1 1.2 Amplitude 1.3 Frequency, Pitch and Intervals 1.4 Timbre 2.1 The Digital Representation of Sound, Part Two: Playing by the Numbers The Digital Representation of Sound 2.2 Analog v. Digital 2.3 Sampling Theory 2.4 Binary Numbers Chapter 2 2.5 Bit Width 2.6 Digital Copying 2.7 Storage Concerns 2.8 Compression Chapter 3 3.1 The Frequency Domain The Frequency Domain 3.2 Phasors 3.3 Fourier and the Sum of Sines 3.4 The DFT, FFT and IFFT 3.5 Problems with the FFT/IFFT 3.6 Some Alternatives to the FFT Chapter 4 4.1 The Synthesis of Sound by Computer Introduction to Synthesis 4.2 Additive Synthesis 4.3 Filters 4.4 Formant Synthesis 4.5 Introduction to Modulation 4.6 Waveshaping 4.7 Frequency Modulation 4.8 Granular Synthesis 4.9 Physical Modeling Chapter 5 5.1 The Transformation of Sound by Computer Sampling 5.2 Reverb 5.3 Localization/Spatialization 5.4 The Phase Vocoder 5.5 Convolution 5.6 Morphing 5.7 Graphical Manipulation http://music.columbia.edu/cmc/musicandcomputers/ -1- Chapter 1: The Digital Representation of Sound, Part One: Sound and Timbre Section 1.1: What Is Sound? Sound is a complex phenomenon involving physics and perception. Perhaps the simplest way to explain it is to say that sound involves at least three things: • • • something moves something transmits the results of that movement something (or someone) hears the results of that movement (though this is philosophically debatable) All things that make sound move, and in some very metaphysical sense, all things that move (if they don’t move too slowly or too quickly) make sound. As things move, they "push" and "pull" at the surrounding air (or water or whatever medium they occupy), causing pressure variations (compressions and rarefactions). Those pressure variations, or sound waves, are what we hear as sound. Sound is often represented visually by figures, as in Figures 1.1 and 1.2. Figure 1.1 Most sound waveforms end up looking pretty much alike. It’s hard to tell much about the nature of this sound from this sort of time-domain plot of a sound wave. This illustration is a plot of the immortal line "I love you, Elmo!" Try to figure out which part of the image corresponds to each word in the spoken phrase. http://music.columbia.edu/cmc/musicandcomputers/ -2- Figure 1.2 Minute portion of the sound wave file from Figure 1.1, zoomed in many times. In this image we can see the almost sample-by-sample movement of the waveform (we’ll learn later what samples are). You can see that sound is pretty much a symmetrical type of affair—compression and rarefaction, or what goes up usually comes down. This is more or less a direct result of Newton’s third law of motion, which states that for every action there is an equal and opposite reaction. Figures 1.1 and 1.2 are often called functions. The concept of function is the simplest glue between mathematical and musical ideas. Sound files are in C:\bolum\bahar2014-2015\ce476\Section 1.1 Soundfile 1.1: Speaking Soundfile 1.2: Bird song Soundfile 1.3: Gamelan Soundfile 1.4: Tuba sounds Soundfile 1.5: More bird sounds Soundfile 1.6: Trumpet sound Soundfile 1.7: Loud sound that gets softer Soundfile 1.8: Whooshing Soundfile 1.9: Chirping Soundfile 1.10: Phat groove Sound as a Function Most of you probably have a favorite song, something that reminds you of a favorite place or person. But how about a favorite function? No, not something like a black-tie affair or a tailgate party; we mean a favorite mathematical function. http://music.columbia.edu/cmc/musicandcomputers/ -3- In fact, songs and functions aren’t so different. Music, or more generally sound, can be described as a function. Mathematical functions are like machines that take in numbers as raw material and, from this input, produce another number, which is the output. There are lots of different kinds of functions. Sometimes functions operate by some easily specified rule, like squaring. When a number is input into a squaring function, the output is that number squared, so the input 2 produces an output of 4, the input 3 produces an output of 9, and so on. For shorthand, we’ll call this function s. s(2) = 22 = 4 s(3) = 32 = 9 s(x) = x2 The last expression is really just an abbreviation that says for any number given as input to s, the number squared is the output. If the input is x, then the output is x2. Sometimes the input/output relation may be easy to describe, but often the actual cause and effect may be more complicated. For example, review the following function. Assume there’s a thermometer on the wall. Starting at 8 A.M., for any number t of minutes that have elapsed since 8 A.M., our function gives an output of the temperature at that time. So, for an input of 5, the output of our temperature function is the room temperature at 5 minutes after 8 A.M. The input 10 gives as output the room temperature at 10 minutes after 8 A.M., and so on. Once again, for shorthand we can abbreviate this and call the function f. f(5) = room temperature at 5 minutes after 8 A.M. f(10) = room temperature at 10 minutes after 8 A.M. f(t) = room temperature at t minutes after 8 A.M. You can see how this temperature function is a little like our previous sound amplitude graphs. The easiest way to understand the temperature function is according to its graph, the picture that helps us visualize the function. The two axes are the input and output. If an input is some number x units from 0 and the output is f(x) units (which could be a positive or negative number), then we place a mark at f(x) units above x. Assume the following: f(0) = 30 f(5) = 35 f(10) = 38 Figure 1.3 shows what happens when we graph these three temperatures. (Note that we’ll leave the x-axis in real time, but to be more precise we probably should have written 0, 5, and 10 there!) We’ll join these marks by a straight line. Figure 1.3 So how do we get a function out of sound or music? http://music.columbia.edu/cmc/musicandcomputers/ -4- A Kindergarten Example Imagine an entire kindergarten class piled on top of a trampoline in your neighbor’s backyard (yes, we know this would be dangerous!). The kids are jumping up and down like maniacs, and the surface of the trampoline is moving up and down in a way that is seemingly impossible to analyze. Suppose that before the kids jump on the trampoline, we paint a fluorescent yellow dot on the trampoline and then ask the kids not to jump on that dot so that we can watch how it moves up and down. The surface of the trampoline is initially at rest. The class climbs on. We take a stopwatch out of our pocket and yell "Go!" while simultaneously pressing the start button. As the kids go crazy, our job is to measure at each possible instant how far the yellow dot has moved from its rest position. If the dot is above the initial position, we measure it as positive (so a displacement of 3 cm up is recorded as +3). If the displacement is below the rest position, we measure it as negative (so a displacement of 3 cm down is recorded as -3). So follow the bouncing dot! It rises, then falls, sometimes a lot, sometimes a little, again and again. If we chart this bouncing dot on a moving piece of paper, we get the kind of function (of pressure, or deformation or perturbation) that we’ve been talking about. Let’s return to the idea of writing down a list of numbers corresponding to a set of times. Now we’re going to turn that list into the graph of a mathematical function! We’ll call that function F. On the horizontal line (the x-axis), we mark off the equally spaced numbers 1, 2, 3, and so on. Then we mark off on the vertical axis (the y-axis) the numbers 1, 2, 3, and so on, going up, and -1, -2, -3, and so on, going down. The numbers on the x-axis stand for time, and on the yaxis the numbers represent displacement. If at time N we recorded a displacement of 4, we put a dot at 4 units above N and we say that F(N) = 4. If we recorded a displacement of -2, we put a dot at the position 2 units below N and we say F(N) = -2. Each of the values F(N) is called a sample of the function F. We’ll learn later (in Section 2.1, when we talk about sampling a waveform) that this process of "every now and then" recording the value of a displacement in time is referred to as sampling, and it’s fundamental to computer music and the storage of digital data. Sampling is actually pretty simple. We regularly inspect some continuous movement and record its position. It’s like watching a marathon on television: you don’t really need to see the whole thing from start to finish—checking in every minute or so gives you a good sense of how the race was run. But suppose you could take a measurement at absolutely every instant in time—that is, take these measurements continuously. That would give you a lot of numbers (infinitely many, in fact, because who’s to say how small a moment in time can be?). Then you would have numbers above and below every point and get a picture something like Figures 1.1 and 1.2, which appear to be continuous. Actually, calling these axes x and y is not so instructive. It is better to call the y-axis "amplitude" and the x-axis "time." The following http://music.columbia.edu/cmc/musicandcomputers/ -5- Figure 1.4 Figure 1.5 Figure 1.6 When you hear something, this is in fact the end result of a very complicated sequence of events in your brain that was initiated by vibrations of your eardrum. The vibrations are caused by air molecules hitting the eardrum. Together they act a bit like waves crashing against a big rubber seawall (or those kids on the trampoline). These waves are the result of things like speaking, plucking a guitar string, hitting a key of the piano, the wind rustling leaves, or blowing into a saxophone. Each of these actions causes the air molecules near the sound source to be disturbed, like dropping many pebbles into a pond all at once. The resulting waves are sent merrily on their way toward you, the listener, and your eagerly awaiting eardrum. The corresponding function takes as input the number representing the time elapsed since the sound was initiated and returns a number that measures how far and in what direction your eardrum has moved at that instant. But what is your eardrum actually measuring? That’s what we’ll talk about next. http://music.columbia.edu/cmc/musicandcomputers/ -6- Amplitude and Pressure In the graphs of sound waves shown in Figures 1.1 and 1.2, time was represented on the xaxis and amplitude on the y-axis. So as a function, time is the input and amplitude (or pressure) is the output, just like in the temperature example. As we’ll point out again and again in this chapter, one way to think about sound is as a sequence of time-varying amplitudes or pressures, or, more succintly, as a function of time. The amplitude (y-)axis of the graphs of sound represents the amount of air compression (above zero) or rarefaction (below zero) caused by a moving object, like vocal chords. Note that zero is the "rest" position, or pressure equilibrium (silence). Looking at the changes in amplitude over time gives a good idea of the amplitude shape or envelope of the sound wave. Actually, this amplitude shape might correspond closely to a number of things, including: • • • the actual vibration of the object the changes in pressure of the air, or water, or other medium the deformation (in or out) of the eardrum This picture of a sound wave, as with amplitudes in time, provides a nice visual metaphor for the idea of sound as a continuous sequence of pressure variations. When we talk about computers, this graph of pressure versus time becomes a picture of a list of numbers plotted against some variable (again, time). We’ll see in Chapter 2 how these numbers are stored and manipulated. Frequency: A Preview Amplitude is just one mathematical, or acoustical, characteristic of sound, just as loudness is only one of the perceptual characteristics of sounds. But, as we know, sounds aren’t only loud and soft. People often describe musical sounds as being "high" or "low." A bird tweeting may sound "high" to us, or a tuba may sound "low." But what are we really saying when we classify a sound as "high" or "low"? There’s a fundamental characteristic of these graphs of pressure in time that is less obvious to the eye but very obvious to the ear: namely, whether there is (or is not) a repeating pattern, and if so how quickly it repeats. That’s frequency! When we say that the tuba sounds are low and the bird sounds are high, what we are really talking about is the result of the frequency of these particular sounds—how fast a pattern in the sound’s graph repeats. In terms of waveforms, like what you saw and heard in the previous sound files, we can, for the moment, somewhat concisely state that the rate at which the air pressure fluctuates (moves in and out) is the frequency of the sound wave. We’ll learn a lot more about frequency and its related cognitive phenomenon, pitch, in Section 1.3. http://music.columbia.edu/cmc/musicandcomputers/ -7- How Our Ears Work Mathematical functions and kids jumping on a trampoline are one thing, but what’s the connection to sound and music? Just moving an eardrum in and out can’t be the whole story! Well, it isn't. The ear is a complex mechanism that tries to make sense out of these arbitrary functions of pressure in time and sends the information to the brain. We’ve already used the physical analogy of the trampoline as our eardrum and the kids as the air molecules set in motion by a sound source. But to cover the topic more completely, we need to discuss how sounds interact, via the eardrum, with the rest of our auditory system (including the brain). Our eardrums, like microphones and speakers, are in a sense transducers—they turn one form of information or energy into another. When sound waves reach our ears, they vibrate our eardrums, transferring the sound energy through the middle ear to the inner ear, where the real magic of human hearing takes place in a snail-shaped organ called the cochlea. The cochlea is filled with fluid and is bisected by an elastic partition called the basilar membrane, which is covered with hair cells. When sound energy reaches the cochlea, it produces fluid waves that form a series of peaks in the basilar membrane, the position and size of which depend on the frequency content of the sound. Different sections of the basilar membrane resonate (form peaks) at different frequencies: high frequencies cause peaks toward the front of the cochlea, while low frequencies cause peaks toward the back. These peaks match up with and excite certain hair cells, which send nerve impulses to the brain via the auditory nerve. The brain interprets these signals as sound, but as an interesting thought experiment, imagine extraterrestrials who might "see" sound waves (and maybe "hear" light). In short, the cochlea transforms sounds from their physical, time domain (amplitude versus time) form to the frequency domain (amplitude versus frequency) form that our brains understand. Pretty impressive stuff for a bunch of goo and some hairs! Figure 1.7 Diagram of the inner ear showing how sound waves that enter through the auditory canal are transformed into peaks, according to their frequencies, on the basilar membrane. In other words, the basilar membrane serves as a time-to-frequency converter, in order to prepare sonic information for its eventual cognition by the higher functions of the brain. Who’d have thought sound was this complicated! But keep in mind that the sound wave pressure picture is just raw data; it contains no frequency, timbral, or any other kind of http://music.columbia.edu/cmc/musicandcomputers/ -8- information. It needs a lot of processing, organization, and consideration to provide any sort of meaning to us higher species. We’ve made the hearing process seem pretty simple, but actually there’s a lot of controversy in current auditory cognition research about the specifics of this remarkable organ and how it works. As we understand more and more about the ear, musicians and scientists gain an increasing sense of understanding how we perceive sound, and even, some believe, how we perceive music. It’s an exciting field of research, and an active one! How Do We Describe Sound? Sound can be described in many ways. We have a lot of different words for sounds, and different ways of speaking about them. For example, we can call a sound "groovy," "dark," "bright," "intense," "low and rumbly," and so on. In fact, our colloquial language for talking about sound, from a scientific viewpoint, is pretty imprecise. Part of what we’re trying to do in computer music is to try to formulate more formal ways of describing sonic phenomena. That doesn’t mean that there’s anything wrong with our usual ways of talking about sounds: our current vocabulary actually works pretty well. But to manipulate digital signals with a computer, it is useful to have access to a different sort of description. We need to ask (and answer!) the following kinds of questions about the sound in question: • • • • • • • • How loud is it? What is its pitch? What is its spectrum? What frequencies are present? How loud are the frequencies? How does the sound change over time? Where is the sound coming from? What’s a good guess as to the characteristics of the physical object that made the sound? Even some of these questions can be broken down into lots of smaller questions. For example, what specifically is meant by "pitch"? Taken together, the answers to these questions and others help describe the various characteristics and features that for many years have been referred to collectively as the timbre (or "color") of a sound. But before we talk about timbre, let’s start with more basic concepts: amplitude and loudness (Section 1.2). http://music.columbia.edu/cmc/musicandcomputers/ -9- Part One: Sound and Timbre Section 1.2: Amplitude and Loudness In the previous section we talked briefly about how a function of amplitude in time could be thought of as a kind of sampling of a sound. (Remember, a sample is essentially a measurement of the amplitude of a sound at a point in time.) But knowing that at 200.056 milliseconds the amplitude of a sound is 0.2 doesn’t really help us understand much about the sound in most cases. What we need is some way of measuring some form of average amplitude of a number of samples (we sometimes call this a frame). We need a way of understanding how these amplitudes, which are a physical measurement (like frequency), correspond to our perception of loudness, which is a psychophysical (anything that we perceive about the physical world is called "psychophysical") or, more precisely, psychoacoustic or cognitive measure (like pitch). We’ll learn in Section 1.3 that amplitude and frequency are not independent—they both contribute to our perception of loudness; that is, we use them together (in a way described by something called the Fletcher-Munson curves, Section 1.3). But to describe that complex psychoacoustic or cognitive aggregate called loudness, we first need to understand something about amplitude and another related quantity called intensity. Then, at the end of our discussion on frequency (Section 1.3), we’ll return to an important way that frequency affects loudness (we’ll give you a little preview of this in this section as well). In fact, it’s very important to realize that certain terms refer to physical or acoustic measures and others refer to cognitive ones. The cognitive measures are much higher level and often incorporate related effects from several acoustic phenomena. See Figure 1.8 to sort out this little terminological jungle! Acoustic and Cognitive (Psycho-acoustic Correlates) Frequency --> Pitch How fast something vibrates. Amplitude How high or low we perceive it. --> Intensity --> Loudness How much something vibrates How much the medium is displaced. How loud we perceive it, affected by pitch and timbre. Waveshape --> Timbre (???) Attack/Decay (traansients) Modulation Spectral characteristics Spectral, pitch, and loudness trajectory Everything else... . http://music.columbia.edu/cmc/musicandcomputers/ - 10 - Figure 1.8 Acoustic phenomena resulting from pressure variations and their psychophysical, or psychoacoustic (perceptual), counterparts. This chart gives some sense of how the terminology for sound varies depending on whether we talk about direct physical measures (frequency, amplitude) or cognitive ones (pitch, loudness). It’s true that pitch is largely a result of frequency, but be careful: they’re not the same thing. Contents of C:\bolum\bahar2014-2015\ce476\Section 1.2: Xtra bit 1.2: Another sonic universe Applet 1.2: Changing amplitudes Soundfile 1.11: Two sine waves Soundfile 1.12: A sine wave and a triangular wave Soundfile 1.13: Chirp Xtra bit 1.3: More on decibels Applet 1.3: Amplitude versus decibel level Amplitude and Pitch Independence If you gently pluck a string on a guitar and then pluck it again, this time harder, what is the difference in the sounds you hear? You’ll hear the same pitch, only louder. That illustrates something interesting about the relationship between pitch and loudness—they are generally independent of one another. You can have two of the same frequencies at one loudness, or two different loudnesses at one frequency. You can visualize this by drawing a series of sine waves, each with the same period but different amplitude. Pure tones with the same period will generally be heard to have the same pitch—so all of the sine waves must be at the same frequency. We’ll see that pure tones correspond to variations in that old favorite function, the sine function. Remember that amplitude is not loudness (one is physical, the other is psychophysical), but for the moment let’s not make that distinction too rigid. Figure 1.9 Two sine waves with the same frequency, different amplitude. http://music.columbia.edu/cmc/musicandcomputers/ - 11 - Figure 1.10 Two sounds may have the same frequency but different waveforms (resulting in a different sense of timbre). Figure 1.11 We can draw an envelope over a sound file, which is the average, or smoothed, amplitude of the sound wave. This is roughly equivalent to what we perceive as the changes in loudness of the sound. (If we just take one average, we might call that the loudness of the whole sound.) Figure 1.12 Two sine waves starting at different phase points. These sine waves have different starting points— like the trampoline we talked about in the previous section. It can start flexed, either up or down. Remember that these starting points are very close together in time (tiny fractions of a second). We’ll talk more about phase and frequency later. http://music.columbia.edu/cmc/musicandcomputers/ - 12 - This simple picture directly above shows us something interesting about amplitude. The point where the colors overlap is where the waveforms will combine, either summing to a combined signal at a higher level or summing to a combined signal at a lower level. That is, when one waveform goes negative (compression, perhaps), it will counteract the other’s positive (rarefaction, perhaps). This is called phase cancellation, but the complexity with which this phenomenon occurs in the real sound world is obviously very great—lots and lots of sound waves going positive and negative all over the place. Intensity You can think of Soundfile 1.13 as a function that begins life as a sine function, f(x) = sin(x), and then over time morphs into a really fast oscillating sine function, like f(x) = sin(20,000x). As it morphs, we’ll continually listen to it. Soundfile 1.3, sometimes called a chirp in acoustic discussions, sweeps a sine wave over the frequency range from 0 Hz to 20,000 Hz. (Just so you know, this sonic phenomenon is not named for a bird chirp, but in fact for a radar chirp, which is a special function used in some radar work.) The amplitude of the sine wave in Soundfile 1.13 does not change, but the perceived loudness changes as it moves through areas of greater sensitivity in the Fletcher-Munson curve (Section 1.3). In other words, how loud we perceive something to be is mostly a result of amplitude, but it is also a result of frequency. This simple example shows how complicated our perception of even these simple phenomena is. We measure amplitude in volts, pressure, or even just sample numbers: it doesn’t really matter. We just graph a function of something moving, and then if we want, we can say what its average displacement was. If it was generally a big displacement (for example, sixth graders on the trampoline), we say it has a big amplitude. If, on the other hand, things didn’t move very much (say we drop a bunch of mice on the trampoline), we say that the sound function had a small amplitude. In the real world, things vibrate and send their vibrations through some medium (usually gas) to our eardrums. When we become interested in how amplitude actually affects a medium, we speak of the intensity of a sound in the medium. This is a little more specific than discussing amplitude, which is more a purely relative term. The medium we’re usually interested in is air (at sea level, 72°F), and we measure intensity as the amount of energy in a given air unit, the cubic meter. In this case, energy (or work done) is measured in watts, and we can then measure intensity in watts per meter2 (or wm2). As is the case with the perception of frequency as pitch, our perception of intensity as loudness is logarithmic. But what the heck does that mean? Logarithmic perception means that it takes more of a change in the amplitude to produce the same perceived change in loudness. http://music.columbia.edu/cmc/musicandcomputers/ - 13 - Bubba and You Think of it this way. Let’s say you work at a burger joint flipping burgers, and you make $8.00 an hour. Let’s say your supervisor, Bubba, makes $9.00 an hour. Now let’s say the burger corporation hits it big with their new broccoli sandwich, and they decide to put every employee on a monthly raise schedule. They decide to give Bubba a dollar a month raise, and you a 7% raise each month. Bubba thinks this is great! That means the first month you only get $8.56, and Bubba gets $10.00. The next month, Bubba gets $11.00, and you get $9.15. This means that you now got a 59¢ raise, or that your raise went up while his remained the same. The equation for your salary for any given month is: new salary = old salary + (0.07 X old salary) Bubba’s is: new salary = old salary + 1 You’re getting an increase by a fixed ratio of your salary, which itself is increasing, while Bubba’s raise/salary ratio is actually decreasing. (The first month he got 1/9, the next month 1/10—at this rate he’ll approach a zero percent raise if he works at the burger place long enough.) Figure 1.13 shows what the salary raises look like as functions. Figure 1.13 Bubba & U. This fundamental difference between ratiometric change and fixed arithmetic change is very important. We tend to perceive most changes not in terms of absolute quantities (in this case, $1.00), but in terms of relative quantities (percentage of a quantity). Changes in both amplitude and frequency are also perceived in terms of ratios. In the case of amplitude, we have a standard measure, called the decibel (dB), which can describe how loud http://music.columbia.edu/cmc/musicandcomputers/ - 14 - a sound is perceived. As a convention, silence, which is 0 dB, is set to 10-12 wm2. This is not really silence in the absolute sense, because things are still vibrating, but it is more or less what you would hear in a very quiet recording studio with nobody making any sound. There is still air movement and other sound-producing activity. (There are rooms called anechoic chambers, used to study sound, that try to get things much quieter than 0 dB—but they’re very unusual places to be.) Any change of 10 dB corresponds roughly to a doubling of perceived loudness. So, for example, going from 10 dB to 20 dB or from 12 dB to 22 dB is perceived as a doubling of perceived sound pressure level. Source Average dB silence 0 whisper, rustling of leaves on a romantic summer evening 30 phone ringing, normal conversation 60 car engine 70 diesel truck, heavy city traffic, vacuum cleaner, factory noise 80 power lawn mower, subway train 90 chain saw, rock concert 110 jet takeoff, gunfire, Metallica (your head near the guitar player’s amplifier) 120+ Table 1.1 Average decibel levels for environmental sounds. 120 dB is often called the threshold of pain. Good name for a rock band, huh? Note that even brief exposure to very loud (90 dB or greater) sounds or constant exposure to medium-level (60 dB to 80 dB) sounds can cause hearing loss. Be careful with your ears— invest in some earplugs! "Audience participation: During tests in the Royal Festival Hall, a note played mezzoforte on the horn measured approximately 65 decibels of sound. A single uncovered cough gave the same reading. A handkerchief placed over the mouth when coughing assists in obtaining a pianissimo." http://music.columbia.edu/cmc/musicandcomputers/ - 15 - Chapter 1: The Digital Representation of Sound, Part One: Sound and Timbre Section 1.3: Frequency, Pitch, and Intervals Contents of C:\bolum\bahar2014-2015\ce476\Section 1.3: Xtra bit 1.4: Tapping a frequency Applet 1.4: Hearing frequency and amplitude Soundfile 1.14: Two sine waves, different frequencies Applet 1.5: When ticks become tones Soundfile 1.15: Low gong Applet 1.6: Transpose using multiply and add Applet 1.7: Octave quiz Applet 1.8: Fletcher-Munson example What is frequency? Essentially, it’s a measurement of how often a given event repeats in time. If you subscribe to a daily paper, then the frequency of paper delivery could be described as once per day, seven times per week. When we talk about the frequency of a sound, we’re referring to how many times a particular pattern of amplitudes repeats during one second. Not all waveforms or physical vibrations repeat exactly (in fact almost none do!). But many vibratory phenomena, especially those in which we perceive some sort of pitch, repeat approximately regularly. If we assume that in fact they are repeating, we can measure the rate of repetition, and we call that the waveform’s frequency. http://music.columbia.edu/cmc/musicandcomputers/ - 16 - Figure 1.14 Two sine waves. The frequency of the red wave is twice that of the blue one, but their amplitudes are the same. It would be difficult or impossible to actually hear this as two distinct tones, since the octaves fuse into one sound. As we’ll discuss later, we measure frequencies in cycles per second, or hertz (Hz). Click on the soundfile icon to hear two sine tones: one at 400 Hz and one at 800 Hz. Sine Waves A sine wave is a good example of a repeating pattern of amplitudes, and in some ways it is the simplest. That’s why sine waves are sometimes referred to as simple harmonic motions. Let’s arbitrarily fix an amplitude scale to be from –1 to 1, so the sine wave goes from 0 to 1 to 0 to –1 to 0. If the complete cycle of the sine wave’s curve takes one second to occur, then we say that it has a frequency of one cycle per second (cps), or one hertz (Hz, or kHz for 1,000 Hz). The frequency range of sound (or, more accurately, of human hearing) is usually given as 0 Hz to 20 kHz, but our ears don’t fuse very low frequency oscillations (0 Hz to 20 Hz, called the infrasonic range) into a pitch. Low frequencies just sound like beats. These numbers are just averages: some people hear pitches as low as 15 Hz; others can hear frequencies significantly higher than 20 kHz. A lot depends on the amplitude, the timbre, and other factors. The older you get (and the more rock n’ roll you listened to!), the more your ears become insensitive to high frequencies (a natural biological phenomenon called presbycusis). Source Lowest freqency (Hz) Highest frequency (Hz) piano female speech 27.5 4,186 140 500 male speech 80 240 compact disc 0 22,050 20 20,000 human hearing Table 1.2 Some frequency ranges. Period When we talk about frequency in music, we are referring to how often a sonic event happens in its entirety over the course of a specified time segment. For a sound to have a perceived frequency, it must be periodic (repeat in time). Since the period of an event is the length of time it takes the event to occur, it’s clear that the two concepts (periodicity and frequency) are related, if not pretty much equivalent. The period of a repeating waveform is the length of time it takes to go through one cycle. The frequency is sort of the inverse: how many times the waveform repeats that cycle per unit time. We can understand the periodicity of sonic events just like we understand that the period of a daily newspaper delivery is one day. http://music.columbia.edu/cmc/musicandcomputers/ - 17 - Since a 20 Hz tone by definition is a cycle that repeats 20 times a second, then in 1/20th of a second one cycle goes by, so a 20 Hz tone has a period of 1/20 or 0.05 second. Now the "thing" that repeats is one basic unit of this regularly repeating wave—such as a sine wave (at the beginning of this section there’s a picture of two sine waves together). It’s not hard to see that the time it takes for one copy of the basic wave to recur (or move through whatever medium it is in) is proportional to the distance from crest to crest (or any two successive corresponding points, for that matter) of the sine wave. This distance is called the wavelength of the wave (or of the periodic function). In fact, if you know how fast the wave is moving, then it is easy to figure out the wavelength from the period. Physically, the period is inversely proportional to the wavelength. Wavelength is a spatial measure that says how far the wave travels in space in one period. We measure it in distance, not time. The speed of sound (s) is about 345 meters/second. To find the wavelength (w) for a sound of a given frequency, first we invert the frequency (1/f) to get its period (p), and then we use the following simple formula: Figure 1.15 Very low musical sounds can have very long wavelengths: some central Javanese (from Indonesia) gongs vibrate at around 8 Hz to 10 Hz, and as such their wavelengths are on the order of 30 to 35 m. Look at the size of the gong in this photo! It makes some very low sounds. (The musicians above are from STSI Bandung, a music conservatory in West Java, Indonesia, participating in a recording session for a piece called "mbuh," by the contemporary composer Suhendi. This recording can be heard on the CD Asmat Dream: New Music Indonesia, Lyrichord Compact Disc #7415.) Using the formula, we find that the wavelength of a 1 Hz tone is 345 meters, which makes sense, since a 1 Hz tone has a period of 1 second, and sound travels 345 meters in one second! That’s pretty far, until you realize that, since these waveforms are usually symmetrical, if you were standing, say, at 172.5 meters from a vibrating object making a 1 Hz tone and right behind you was a perfectly reflective surface, it’s entirely possible that the negative portion of the waveform might cancel out the positive and you’d hear nothing! While this is a rather extreme and completely hypothetical example, it is true that wave cancellation is a common physical occurrence, though it depends on a great many parameters. http://music.columbia.edu/cmc/musicandcomputers/ - 18 - Pitch Musicians usually talk to each other about the frequency content of their music in terms of pitch, or sets of pitches called scales. You’ve probably heard someone mention a G-minor chord, a blues scale, or a symphony in C, but has anyone ever told you about the new song they wrote with lots of 440 cycles per seconds in it? We hope not! Humans tend to recognize relative relationships, not absolute physical values. And when we do, those relationships (especially in the aural domain) tend to be logarithmic. That is, we don’t perceive the difference (subtraction) of two frequencies, but rather the ratio (division). This means that it is much easier for most humans to hear or describe the relationship or ratio between two frequencies than it is to name the exact frequencies they are hearing. And in fact, for most of us, the exact frequencies aren’t even very important—we recognize "Row, Row, Row Your Boat" regardless of what frequency it is sung at, as long as the relationships between the notes are more or less correct. The common musical term for this is transposition: we hear the tune correctly no matter what key it’s sung in. Although pitch is directly related to frequency, they’re not the same. As we pointed out earlier, and similar to what we saw in Section 1.2 when we discussed amplitude and loudness, frequency is a physical or acoustical phenomenon. Pitch is perceptual (or psychoacoustic, cognitive, or psychophysical). The way we organize frequencies into pitches will be no surprise if you understood what we talked about in the previous section: we require more and more change in frequency to produce an equal perceptual change in pitch. Once again, this is all part of that logarithmic perception "thing" we’ve been yammering on about, because the way we describe that increase is by logarithms and exponentials. Here’s a simple example: the difference to our ears between 101 Hz and 100 Hz is much greater than the difference between 1,001 Hz and 1,000 Hz. We don’t hear a change of 1 Hz for each; instead we hear a change of 1,001/1,000 (= 1.001) as compared to a much bigger change of 101/100 (= 1.01). Intervals and Octaves So we don’t really care about the linear, or arithmetic, differences between frequencies; we are almost solely interested in the ratio of two frequencies. We call those ratios intervals, and almost every musical culture in the world has some term for this concept. In Western music, the 2:1 ratio is given a special importance, and it’s called an octave. It seems clear (though not totally unarguable) that most humans tend to organize the frequency spectrum between 20 Hz and 20 kHz roughly into octaves, which means powers of 2. That is, we perceive the same pitch difference between 100 Hz and 200 Hz as we do between 200 Hz and 400 Hz, 400 Hz and 800 Hz, and so on. In each case, the ratio of the two frequencies is 2:1. We sometimes call this base-2 logarithmic perception. Many theorists believe that the octave is somehow fundamental to, or innate and hard-wired in, our perception, but this is difficult to prove. It’s certainly common throughout the world, though a great deal of approximation is tolerated, and often preferred! http://music.columbia.edu/cmc/musicandcomputers/ - 19 - In almost all musical cultures, pitches are named not by their actual frequencies, but as general categories of frequencies in relationship to other frequencies, all a power of 2 apart. For example, A is the name given to the pitch on the piano or clarinet with a frequency of 440 Hz as well as 55 Hz, 110 Hz, 220 Hz, 880 Hz, 1760 Hz, and so on. The important thing is the ratio between the frequencies, not the distance; for example, 55 Hz to 110 Hz is an octave that happens to span 55 Hz, yet 50 Hz to 100 Hz is also an octave, even though it only covers 50 Hz. But if an orchestra tunes to a different A (as most do nowadays, for example, to middle A = 441 Hz or 442 Hz to sound higher and brighter), those frequencies will all change to be multiples/divisors of the new absolute A. Figure 1.16 The red graph shows a series of octaves starting at 110 Hz (an A); each new octave is twice as high as the last one. The blue graph shows linearly increasing frequencies, also starting at 110 Hz. We say that the frequencies in the blue graph are rising linearly, because to get each new frequency we simply add 110 Hz to the last frequency: the change in frequency is always the same. However, to get octaves we must double the frequency, meaning that the difference (subtractively) in frequency between two adjacent octaves is always increasing. One thing is clear, however: to have pitch, we need frequency, and thus periodic waveforms. Each of these three concepts implies the other two. This relationship is very important when we discuss how frequency is not used just for pitch, but also in determining timbre. To get some sense of this, consider that the highest note of a piano is around 4 kHz. What about the rest of the range, the almost 18 kHz of available sound? It turns out that this larger frequency range is used by the ear to determine a sound’s timbre. We will discuss timbre in Section 1.4. Before we move on to timbre, though, we should mention that pitch and amplitude are also related. When we hear sounds, we tend to compare them, and think of their amplitudes, in terms of loudness. The perceived loudness of a sound depends on a combination of factors, including the sound’s amplitude and frequency content. For example, given two sounds of http://music.columbia.edu/cmc/musicandcomputers/ - 20 - very different frequencies but at exactly the same amplitude, the lower-frequency sound will often seem softer. Our ear tends to amplify certain frequencies and attenuate others. Figure 1.17 Equal-loudness contours (often referred to as Fletcher-Munson curves) are curves that tell us how much intensity is needed at a certain frequency in order to produce the same perceived loudness as a tone at a different frequency. They’re sort of "loudness isobars." If you follow one line (which meanders across intensity levels and frequencies), you’re following an equal loudness contour. For example, for a 50 Hz tone to sound as loud as a 1000 Hz tone at 20 dB, it needs to sound at about 55 dB. These curves are surprising, and they tell us some important things about how our ears evolved. While these curves are the result of rather gross data generalization, and while they will of course vary depending on the different sonic environments to which listeners are accustomed, they also seem to be rather surprisingly accurate across cultures. Perhaps they represent something that is hardwired rather than learned. These curves are widely used by audio manufacturers to make equipment more efficient and sound more realistic. When looking at Figure 1.17 for the Fletcher-Munson curves, note how the curves start high in the low frequencies, dip down in the mid-frequencies, and swing back up again. What does this mean? Well, humans need to be very sensitive to the mid-frequency range. That’s how, for instance, you can tell immediately if your mom’s upset when she calls you on the phone. (Phones, by the way, cut off everything above around 7 kHz.) Most of the sounds we need to recognize for survival purposes occur in the mid-frequency range. Low frequencies are not too important for survival. The nuances and tiny inflections in speech and most of sonic sound tend to happen in the 500 Hz to 2 kHz range, and we have evolved to be extremely sensitive in that range (though it’s hard to say which came first, the evolution of speech or the evolution of our sensitivity to speech sounds. This mid-frequency range sensitivity is probably a universal human trait and not all that culturally dependent. So, if you’re traveling far from home, you may not be able to understand the person at the table in the café next to you, but if you whistle the FletcherMunson curves, you’ll have a great time together. http://music.columbia.edu/cmc/musicandcomputers/ - 21 - Chapter 1: The Digital Representation of Sound, Part One: Sound and Timbre Section 1.4: Timbre Contents of C:\bolum\bahar2014-2015\ce476\Section 1.4: Applet 1.9: Draw a waveform Soundfile 1.16: Tuning fork at 256 Hz Soundfile 1.17: Sine wave at 256 Hz Xtra bit 1.6: Tuning forks Soundfile 1.18: Composer Warren Burt’s piece for tuning forks Soundfile 1.19: Clarinet sound Soundfile 1.20: Clarinet with attack lopped off Soundfile 1.21: Flute sound Soundfile 1.22: Flute with attack lopped off Soundfile 1.23: Piano sound Soundfile 1.24: Piano with attack lopped off Soundfile 1.25: Trombone (not played underwater!) Soundfile 1.26: Trombone with attack lopped off Soundfile 1.27: Violin Soundfile 1.28: Violin with attack lopped off Soundfile 1.29: Voice Soundfile 1.30: Voice with attack lopped off What’s the difference between a tuba and a flute (or, more accurately, the sounds of each)? How do we tell the difference between two people singing the same song, when they’re singing exactly the same notes? Why do some guitars "sound" better than others (besides the fact that they’re older or cost more or have Eric Clapton’s autograph on them)? What is it that makes things "sound" like themselves? It’s not necessarily the pitch of the sound (how high or low it is)—if everyone in your family sang the same note, you could almost surely tell who was who, even with your eyes closed. http://music.columbia.edu/cmc/musicandcomputers/ - 22 - It’s also not just the loudness—your voice is still your voice whether you talk softly or scream at the top of your lungs. So what’s left? The answer is found in a somewhat mysterious and elusive thing we call, for lack of a better word, "timbre," and that’s what this section is all about. "Timbre" (pronounced "tam-ber") is a kind of sloppy word, inherited from previous eras, that lumps together lots of things that we don’t fully understand. Some think we should abandon the word and concept entirely! But it’s one of those words that gets used a lot, even if it doesn’t make much sense, so we’ll use it here too—we’re sort of stuck with it for the time being. What Makes Up Timbre? Timbre can be roughly defined as those qualities of a sound that aren’t just frequency or amplitude. These qualities might include: • • spectra: the aggregate of simpler waveforms (usually sine waves) that make up what we recognize as a particular sound. This is what Fourier analysis gives us (we’ll discuss that in Chapter 3). envelope: the attack, sustain, and decay portions of a sound (often referred to as transients). Envelope and spectra are very complicated concepts, encompassing a lot of subcategories. For example, spectral features are very important, different ways that the spectral aggregates are organized statistically in terms of shape and form (e.g., the relative "noisiness" of a sound is a result, in large part, of its spectral relationships). Many facets of envelope (onset time, harmonic decay, spectral evolution, steady-state modulations, etc.) are not easily explained by just looking at the envelope of a sound. Researchers spend a great deal of time on very specific aspects of these ideas, and it’s an exciting and interesting area for computer musicians to research. Figure 1.18 shows a simplified picture of the envelope of a trumpet tone. Figure 1.18 This image illustrates the attack, sustain, and decay portions of a standard amplitude envelope. This is a very simple, idealized picture, called a trapezoidal envelope. We are not aware of any actual, natural occurrence of this kind of straight-lined sound! http://music.columbia.edu/cmc/musicandcomputers/ - 23 - It’s helpful here to bring another descriptive term into our vocabulary: spectrum. Spectrum is defined by a waveform’s distribution of energy at certain frequencies. The combination of spectra (plural of spectrum) and envelope helps us to define the "color" of a sound. Timbre is difficult to talk about, because it’s hard to measure something subjective like the "quality" of a sound. This concept gives music theorists, computer musicians, and psychoacousticians a lot of trouble. However, computers have helped us make great progress in the exploration and understanding of the various components of what’s traditionally been called "timbre." Basic Elements of Sound As we’ve shown, the average piece of music can be a pretty complicated function. Nevertheless, it’s possible to think of it as a combination of much simpler sounds (and hence simpler functions)—even simpler than individual instruments. The basic atoms of sound, the sinusoids (sine waves) we talked about in the previous sections, are sometimes called pure tones, like those produced when a tuning fork vibrates. We use the tuning fork to talk about these tones because it is one of the simplest physical vibrating systems. Although you might think that a discussion of tuning forks belongs more in a discussion of frequency, we’re going to use them to introduce the notion of sinusoids: Fourier components of a sound. Figure 1.19 This tuning fork rings at 256 Hz. You can hear the sound it makes by clicking on Soundfile 1.16. Compare the tuning fork sound to that of a pure sine wave at the same frequency of 256 Hz. When you hit the tines of a tuning fork, it vibrates and emits a very pure note or tone. Tuning forks are able to vibrate at very precise frequencies. The frequency of a tuning fork is the number of times the tip goes back and forth in a second. And this number won’t change, no matter how hard you hit that fork. As we mentioned, the human ear is capable of hearing sounds that vibrate all the way from 20 times a second to 20,000 times a second. Lowfrequency sounds are like bass notes, and high-frequency sounds are like treble notes. (Low frequency means that the tines vibrate slowly, and high frequency means that they vibrate quickly.) http://music.columbia.edu/cmc/musicandcomputers/ - 24 - Figure 1.20 Click on the different tuning forks to see the different audiograms. Notice what they have in common! They are all roughly the same shape—simple waves that differ only in the width of the regularly repeating peaks. The higher tones give more peaks over the same interval. In other words, the peaks occur more frequently. (Get it? Higher frequency!) When you whack the tines of a tuning fork, the fork vibrates. The number of times the tines go back and forth in one second determines the frequency of a particular tuning fork. Click Soundfile 1.18. You’ll hear composer Warren Burt’s piece for tuning forks, "Improvisation in Two Ancient Greek Modes." Now, why do tuning fork functions have their simple, sinusoidal shape? Think about how the tip of the tuning fork is moving over time. We see that it is moving back and forth, from its greatest displacement in one direction all the way back to just about the same displacement in the opposite direction. Imagine that you are sitting on the end of the tine (hold on tight!). When you move to the left, that will be a negative displacement; and when you move to the right, that will be a positive displacement. Once again, as time progresses we can graph the function that at each moment in time outputs your position. Your back-and-forth motion yields the functions many of you remember from trigonometry: sines and cosines. Figure 1.21 Sine and cosine waves. Thanks to Wayne Mathews for this image. Any sound can be represented as a combination of different amounts of these sines and cosines of varying frequencies. The mathematical topic that explains sounds and other wave phenomena is called Fourier analysis, named after its discoverer, the great 18th century mathematician Jean Baptiste Joseph Fourier. http://music.columbia.edu/cmc/musicandcomputers/ - 25 - Figure 1.22 Spectra of (a) sawtooth wave and (b) square wave. Figure 1.22 shows the relative amplitudes of sinusoidal components of simple waveforms. For example, Figure 1.22(a) indicates that a sawtooth wave can be made by addition in the following way: one part of a sine wave at the fundamental frequency (say, 1 Hz), then half as much of a sine wave at 2 Hz, and a third as much at 3 Hz, and so on, infinitely. In Section 4.2, we’ll talk about using the Fourier technique in synthesizing sound, called additive synthesis. If you want to jump ahead a bit, try the applet in Section 4.2, that lets you build simple waveforms from sinusoidal components. Notice that when you try to build a square wave, there are little ripples on the edges of the square. This is called Gibbs ringing, and it has to do with the fact that the sum of any finite number of these decreasing amounts of sine waves of increasing frequency is never exactly a square wave. What the charts in Figure 1.22 mean is that if you add up all those sinusoids whose frequencies are integer multiples of the fundamental frequency of the sound and whose amplitudes are described in the charts by the heights of the bars, you’ll get the sawtooth and square waves. This is what Fourier analysis is all about: every periodic waveform (which is the same, more or less, as saying every pitched sound) can be expressed as a sum of sines whose frequencies are integer multiples of the fundamental and whose amplitudes are unknown. The sawtooth and square wave charts in Figure 1.22 are called spectral histograms (they don’t show any evolution over time, since these waveforms are periodic). These sine waves are sometimes referred to as the spectral components, partials, overtones, or harmonics of a sound, and they are what was thought to be primarily responsible for our sense of timbre. So when we refer to the tenth partial of a timbre, we mean a sinusoid at 10 times the frequency of the sound’s fundamental frequency (but we don’t know its amplitude). The sounds in some of the following soundfiles are conventional instruments with their attacks lopped off, so that we can hear each instrument as a different periodic waveform and listen to each instrument’s different spectral configurations. Notice that, strangely enough, the clarinet (whose sound wave is a lot like a sawtooth wave) and the flute, without their attacks, are not all that different (in the grand scheme of things). http://music.columbia.edu/cmc/musicandcomputers/ - 26 - Figure 1.25 Piano. Figure 1.23 Clarinet. Figure 1.24 Flute. Figure 1.26 Picture courtesy of Bob Hovey www.TROMBONISTICALISMS.bigstep.com. Figure 1.27 Violin. http://music.columbia.edu/cmc/musicandcomputers/ - 27 - Chapter 2: The Digital Representation of Sound, Part Two: Playing by the Numbers Section 2.1: Digital Representation of Sound The world is continuous. Time marches on and on, and there are plenty of things that we can measure at any instant. For example, weather forecasters might keep an ongoing record of the temperature or barometric pressure. If you are in the hospital, the nurses might be keeping a record of your temperature, your heart rate, or your brain waves. Any of these records gives you a function f(t) where, at a given time t, f(t) is the value of the particular statistic that interests you. These sorts of functions are called time series. Figure 2.1 Genetic algorithms (GA) are evolutionary computing systems, which have been applied to things like optimization and machine learning. These are examples of continuous functions. What we mean by "continuous" is that at any instant of time the functions take on a well-defined value, so that they make squiggly line graphs that could be traced without the pencil ever leaving the paper. This might also be called an analog function. Copyright Juha Haataja/Center for Scientific Computing, Finland Of course, the time series that interest us are those that represent sound. In particular, we want to take these time series, stick them on the computer, and play with them! Now, if you’ve been paying attention, you may realize that at this moment we’re in a bit of a bind: the type of time series that we’ve been describing is a continuous function. That is, at every instant in time, we could write down a number that is the value of the function at that instant—whether it be how much your eardrum has been displaced, what your temperature is, what your heart rate is, and so on. But such a continuous function would provide an infinite list of numbers (any one of which may have an infinite expansion, like = 3.1417...), and no matter how big your computer is, you’re going to have a pretty http://music.columbia.edu/cmc/musicandcomputers/ - 28 - tough time fitting an infinite collection of numbers on your hard drive. So how do we do it? How can we represent sound as a finite collection of numbers that can be stored efficiently, in a finite amount of space, on a computer, and played back and manipulated at will? In short, how do we represent sound digitally? That’s the problem that we’ll start to investigate in this chapter. Somehow we have to come up with a finite list of numbers that does a good job of representing our continuous function. We do it by sampling the original function, at every few instants (at some predetermined rate, called the sampling rate), recording the value of the function at that moment. For example, maybe we only record the temperature every 5 minutes. For sound we need to go a lot faster, and we often use a special device that grabs instantaneous amplitudes at rapid, audio rates (called an analog to digital converter, or ADC). A continuous function is also called an analog function, and, to restate the problem, we have to convert analog functions to lists of samples, or digital functions, which is the fundamental way that computers store information. In computers, think of a digital function as a function of location in computer memory. That is, these functions are stored as lists of numbers, and as we read through them we are basically creating a discrete function of time of individual amplitude values. Figure 2.2 A pictorial description of the recording and playback of sounds through an ADC/DAC. In analog to digital conversions, continuous functions (for example, air pressures, sound waves, or voltages) are sampled and stored as numerical values. In digital to analog conversions, numerical values are interpolated by the converter to force some continuous system (such as amplifiers, speakers, and subsequently the air and our ears) into a continuous vibration. Interpolation just means smoothly transitioning across the discrete numerical values. When we digitally record a sound, an image, or even a temperature reading, we numerically represent that phenomenon. To convert sounds between our analog world and the digital world of the computer, we use a device called an analog to digital converter (ADC). A digital to analog converter (DAC) is used to convert these numbers back to sound (or to make the numbers usable by an analog device, like a loudspeaker). An ADC takes smooth functions (of the kind found in the physical world) and returns a list of discrete values. A DAC takes a list of discrete values (like the kind found in the computer world) and returns a smooth, continuous function, or more accurately the ability to create such a function from the computer memory or storage medium. http://music.columbia.edu/cmc/musicandcomputers/ - 29 - Figure 2.3 Two graphical representations of sound. The top is our usual time domain graph, or audiogram, of the waveform created by a five-note whistled melody. Time is on the x-axis, and amplitude is on the yaxis. The bottom graph is the same melody, but this time we are looking at a time-frequency representation. The idea here is that if we think of the whistle as made up of contiguous small chunks of sound, then over each small time period the sound is composed of differing amounts of various pieces of frequency. The amount of frequency y at time t is encoded by the brightness of the pixel at the coordinate (t, y). The darker the pixel, the more of that frequency at that time. For example, if you look at time 0.4 you see a band of white, except near 2,500, showing that around that time we mainly hear a pure tone of about 2,500 Hz, while at 0.8 second, there are contributions all around from about 0 Hz to 3,000 Hz, but stronger ones at about 2,500 Hz and 200 Hz. http://music.columbia.edu/cmc/musicandcomputers/ - 30 - We’ll be giving a much more precise description of the frequency domain in Chapter 3, but for now we can simply think of sounds as combinations of more basic sounds that are distinguished according to their "brightness." We then assign numbers to these basic sounds according to their brightness. The brighter a sound, the higher the number. As we learned in Chapter 1, this number is called the frequency, and the basic sound is called a sinusoid, the general term for sinelike waveforms. So, high frequency means high brightness, and low frequency means low brightness (like a deep bass rumble), and in between is, well, simply in between. All sound is a combination of these sinusoids, of varying amplitudes. It’s sort of like making soup and putting in a bunch of basic spices: the flavor will depend on how much of each spice you include, and you can change things dramatically when you alter the proportions. The sinusoids are our basic sound spices! The complete description of how much of each of the frequencies is used is called the spectrum of the sound. Since sounds change over time, the proportions of each of the sinusoids changes too. Just like when you spice up that soup, as you let it boil the spice proportion may be changing as things evaporate. http://music.columbia.edu/cmc/musicandcomputers/ - 31 - Chapter 2: The Digital Representation of Sound, Part Two: Playing by the Numbers Section 2.2: Analog Versus Digital http://music.columbia.edu/cmc/musicandcomputers/applets/2_2_sampled_fader.php Applet 2.1: Sampled fader The distinction between analog and digital information is not an easy one to understand, but it’s fundamental to the realm of computer music (and in fact computer anything!). In this section, we’ll offer some analogies, explanations, and other ways to understand the difference more intuitively. Imagine watching someone walking along, bouncing a ball. Now imagine that the ball leaves a trail of smoke in its path. What does the trail look like? Probably some sort of continuous zigzag pattern, right? Figure 2.4 The path of a bouncing ball. OK, keep watching, but now blink repeatedly. What does the trail look like now? Because we are blinking our eyes, we’re only able to see the ball at discrete moments in time. It’s on the way up, we blink, now it’s moved up a bit more, we blink again, now it may be on the way down, and so on. We’ll call these snapshots samples because we’ve been taking visual samples of the complete trajectory that the ball is following. The rate at which we obtain these samples (blink our eyes) is called the sampling rate. It’s pretty clear, though, that the faster we sample, the better chance we have of getting an accurate picture of the entire continuous path along which the ball has been traveling. Figure 2.5 The same path, but sampled by blinking. What’s the difference between the two views of the bouncing ball: the blinking and the nonblinking? Each view pretty much gives the same picture of the ball’s path. We can tell how fast the ball is moving and how high it’s bouncing. The only real difference seems to be that in the first case the trail is continuous, and in the second it is broken, or discrete. That’s the main distinction between analog and digital representations of information: analog information is continuous, while digital information is not. http://music.columbia.edu/cmc/musicandcomputers/ - 32 - Analog and Digital Waveform Representations Now let’s take a look at two time domain representations of a sound wave, one analog and one digital, in Figure 2.6. Figure 2.6 An analog waveform and its digital cousin: the analog waveform has smooth and continuous changes, and the digital version of the same waveform has a stairstep look. The black squares are the actual samples taken by the computer. The grey lines suggest the "staircasing" that is an inevitable result of converting an analog signal to digital form. Note that the grey lines are only for show—all that the computer knows about are the discrete points marked by the black squares. There is nothing in between those points. The analog waveform is nice and smooth, while the digital version is kind of chunky. This "chunkiness" is called quantization or staircasing—for obvious reasons! Where do the "stairsteps" come from? Go back and look at the digital bouncing ball figure again, and see what would happen if you connected each of the samples with two lines at a right angle. Voila, a staircase! Staircasing is an artifact of the digital recording process, and it illustrates how digitally recorded waveforms are only approximations of analog sources. They will always be approximations, in some sense, since it is theoretically impossible to store truly continuous data digitally. However, by increasing the number of samples taken each second (the sample rate), as well as increasing the accuracy of those samples (the resolution), an extremely accurate recording can be made. In fact, we can prove mathematically that we can get so accurate that, theoretically, there is no difference between the analog waveform and its digital representation, at least to our ears. http://music.columbia.edu/cmc/musicandcomputers/ - 33 - Chapter 2: The Digital Representation of Sound, Part Two: Playing by the Numbers Section 2.3: Sampling Theory Contents of C:\bolum\bahar2014-2015\ce476\Section 2.3: Xtra bit 2.1: Free sample: a tonsorial tale Applet 2.2: Oscillators Soundfile 2.1: Undersampling Soundfile 2.2: Standard sampling at 44,100 samples per second Applet 2.3: Scrubber applet Soundfile 2.3: Chirping So now we know that we need to sample a continuous waveform to represent it digitally. We also know that the faster we sample it, the better. But this is still a little vague. How often do we need to sample a waveform in order to achieve a good representation of it? The answer to this question is given by the Nyquist sampling theorem, which states that to well represent a signal, the sampling rate (or sampling frequency—not to be confused with the frequency content of the sound) needs to be at least twice the highest frequency contained in the sound of the signal. For example, look back at our time-frequency picture in Figure 2.3 from Section 2.1. It looks like it only contains frequencies up to 8,000 Hz. If this were the case, we would need to sample the sound at a rate of 16,000 Hz (16 kHz) in order to accurately reproduce the sound. That is, we would need to take sound bites (bytes?!) 16,000 times a second. In the next chapter, when we talk about representing sounds in the frequency domain (as a combination of various amplitude levels of frequency components, which change over time) rather than in the time domain (as a numerical list of sample values of amplitudes), we’ll learn a lot more about the ramifications of the Nyquist theorem for digital sound. But for our current purposes, just remember that since the human ear only responds to sounds up to about 20,000 Hz, we need to sample sounds at least 40,000 times a second, or at a rate of 40,000 Hz, to represent these sounds for human consumption. You may be wondering why we even need to represent sonic frequencies that high (when the piano, for instance, only goes up to the high 4,000 Hz range). The answer is timbral, particularly spectral. Remember that we saw in Section 1.4 that those higher frequencies fill out the descriptive sonic information. Just to review: we measure frequency in cycles per second (cps) or Hertz (Hz). The frequency range of human hearing is usually given as 20 Hz to 20,000 Hz, meaning that we can hear sounds in that range. Knowing that, if we decide that the highest frequency we’re interested in is 20 kHz, then according to the Nyquist theorem, we need a sampling rate of at least twice that frequency, or 40 kHz. http://music.columbia.edu/cmc/musicandcomputers/ - 34 - Figure 2.7 Undersampling: What happens if we sample too slowly for the frequencies we’re trying to represent? We take samples (black dots) of a sine wave (in blue) at a certain interval (the sample rate). If the sine wave is changing too quickly (its frequency is too high), then we can’t grab enough information to reconstruct the waveform from our samples. The result is that the highfrequency waveform masquerades as a lower-frequency waveform (how sneaky!), or that the higher frequency is aliased to a lower frequency. Soundfile 2.1 demonstrates undersampling of the same sound source as Soundfile 2.2. In this example, the file was sampled at 1,024 samples per second. Note that the sound sounds "muddy" at a 1,024 sampling rate—that rate does not allow us any frequencies above about 500 Hz, which is sort of like sticking a large canvas bag over your head, and putting your fingers in your ears, while listening. Figure 2.8 Picture of an undersampled waveform. This sound was sampled 512 times per second. This was way too slow. Figure 2.9 This is the same sound file as above, but now sampled 44,100 (44.1 kHz) times per second. Much better. http://music.columbia.edu/cmc/musicandcomputers/ - 35 - Aliasing The most common standard sampling rate for digital audio (the one used for CDs) is 44.1 kHz, giving us a Nyquist frequency (defined as half the sampling rate) of 22.05 kHz. If we use lower sampling rates, for example, 20 kHz, we can’t represent a sound whose frequency is above 10 kHz. In fact, if we try, we’ll get usually undesirable artifacts, called foldover or aliasing, in the signal. In other words, if a sine wave is changing quickly, we would get the same set of samples that we would have obtained had we been taking samples from a sine wave of lower frequency! The effect of this is that the higher-frequency contributions now act as impostors of lowerfrequency information. The effect of this is that there are extra, unanticipated, and new lowfrequency contributions to the sound. Sometimes we can use this in cool, interesting ways, and other times it just messes up the original sound. So in a sense, these impostors are aliases for the low frequencies, and we say that the result of our undersampling is an aliased waveform at a lower frequency. Figure 2.10 Foldover aliasing. This picture shows what happens when we sweep a sine wave up past the Nyquist rate. It’s a picture in the frequency domain (which we haven’t talked about much yet), so what you’re seeing is the amplitude of specific component frequencies over time. The x-axis is frequency, the z-axis is amplitude, and the y-axis is time (read from back to front). As the sine wave sweeps up into frequencies above the Nyquist frequency, an aliased wave (starting at 0 Hz and ending at 44,100 Hz over 10 seconds) is reflected below the Nyquist frequency of 22,050 Hz. The sound can be heard in Soundfile 2.3. Soundfile 2.3 is a 10second soundfile sweeping a sine wave from 0 Hz to 44,100 Hz. Notice that the sound seems to disappear after it reaches the Nyquist rate of 22,050 Hz, but then it wraps around as aliased sound back into the audible domain. Anti-Aliasing Filters Fortunately it’s fairly easy to avoid aliasing—we simply make sure that the signal we’re recording doesn’t contain any frequencies above the Nyquist frequency. To accomplish this http://music.columbia.edu/cmc/musicandcomputers/ - 36 - task, we use an anti-aliasing filter on the signal. Audio filtering is a technique that allows us to selectively keep or throw out certain frequencies in a sound—just as light filters (like ones you might use on a camera) only allow certain frequencies of light (colors) to pass. For now, just remember that a filter lets us color a sound by changing its frequency content. An anti-aliasing filter is called a low-pass filter because it only allows frequencies below a certain cutoff frequency to pass. Anything above the cutoff frequency gets removed. By setting the cutoff frequency of the low-pass filter to the Nyquist frequency, we can throw out the offending frequencies (those high enough to cause aliasing) while retaining all of the lower frequencies that we want to record. Figure 2.11 An anti-aliasing low-pass filter. Only the frequencies within the passband, which stops at the Nyquist frequency (and "rolls off" after that), are allowed to pass. This diagram is typical of the way we draw what is called the frequency response of a filter. It shows the amplitude that will come out of the filter in response to different frequencies (of the same amplitude). Anti-aliasing filters can be analogized to coffee filters. The desired components (frequencies or liquid coffee) are preserved, while the filter (coffee filter or anti-aliasing filter) catches all the undesirable components (the coffee grounds or the frequencies that the system cannot handle). Perfect anti-aliasing filters cannot be constructed, so we almost always get some aliasing error in an ADC DAC conversion. Anti-aliasing filters are a standard component in digital sound recording, so aliasing is not usually of serious concern to the average user or computer musician. But because many of the sounds in computer music are not recorded (and are instead created digitally inside the computer itself), it’s important to fully understand aliasing and the Nyquist theorem. There’s nothing to stop us from using a computer to create sounds with frequencies well above the Nyquist frequency. And while the computer has no problem dealing with such sounds as data, as soon as we mere humans want to actually hear those sounds (as opposed to just conceptualizing or imagining them), we need to deal with the physical realities of aliasing, the Nyquist theorem, and the analog-to-digital conversion process. http://music.columbia.edu/cmc/musicandcomputers/ - 37 - Chapter 2: The Digital Representation of Sound, Part Two: Playing by the Numbers Section 2.4: Binary Numbers Applet 2.4: Binary counter Now we know more than we ever wanted to know about how often to store numbers in order to digitally represent a signal. But surely speed isn’t the only thing that matters. What about size? In this section, we’ll tell you something about how big those numbers are and what they "look" like. All digital information is stored as binary numbers. A binary number is simply a way of representing our regular old numbers as a list, or sequence, of zeros and ones. This is also called a base-2 representation. In order to explain things in base 2, let’s think for a minute about those ordinary numbers that we use. You know that we use a decimal representation of numbers. This means that we write numbers as finite sequences whose symbols are taken from a collection of ten symbols: our digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. A number is really a shorthand for an arithmetic expression. For example: Now we bet you can see a pattern here. For every place in the number, we just multiply by a higher power of ten. Representing a number in base-8, say (which is called an octal representation), would mean that we would only use eight distinct symbols, and multiply those by powers of 8 (instead of powers of 10). For example (in base 8), the number 2,051 means: There is even hexadecimal notation in which we work in base 16, so we need 16 symbols. These are our usual 0 through 9, augmented by A, B, C, D, E, and F (counting for 10 through 15). So, in base 16 we might write: But back to binary: this is what we use for digital (that is, computer) representations. With only two symbols, 0 and 1, numbers look like this: http://music.columbia.edu/cmc/musicandcomputers/ - 38 - Each of the places is called a bit for (Binary digIT). The leftmost bit is called the most significant bit (MSB), and the rightmost is the least significant bit (because the digit in the leftmost position, the highest power of 2, makes the most significant contribution to the total value represented by the number, while the rightmost makes the least significant contribution). If the digit at a given bit is equal to 1, we say it is set. We also label the bits by what power of 2 they represent. How many different numbers can we represent with 4 bits? There are sixteen such combinations, and here they are, along with their octal, decimal, and hexadecimal counterparts: Binary Octal Decimal Hexadecimal 0000 00 00 00 0001 01 01 01 0010 02 02 02 0011 03 03 03 0100 04 04 04 0101 05 05 05 0110 06 06 06 0111 07 07 07 1000 10 08 08 1001 11 09 09 1010 12 10 0A 1011 13 11 0B 1100 14 12 0C 1101 15 13 0D 1110 16 14 0E 1111 17 15 0F Table 2.1 Number base chart. Some mathematicians and philosophers argue that the reason we use base 10 is that we have ten fingers (digits)—those extraterrestrials we met in Chapter 1 who hear light might also have an extra finger or toe (or sqlurmphragk or something), and to them the millennial year might be the year 1559 (in base 11)! Boy, that will just shock them right out of their fxrmmp!qts! http://music.columbia.edu/cmc/musicandcomputers/ - 39 - What’s important to keep in mind is that it doesn’t much matter which system we use; they’re all pretty much equivalent (1010 base 2 = 10 base 10 = 12 base 8 = A base 16, and so on). We pick numeric bases for convenience: binary systems are useful for switches and logical systems, and computers are essentially composed of lots of switches. Figure 2.12 A four-bit binary number (called a nibble). Sixteen is the biggest decimal number we can represent in 4 bits. Numbering Those Bits Consider the binary number 0101 from Figure 2.12. It has 4 bits, numbered 0 to 3 (computers generally count from 0 instead of 1). The rightmost bit (bit zero, the LSB) is the "ones" bit. If it is set (equal to 1), we add 1 to our number. The next bit is the "twos" bit, and if set adds 2 to our number. Next comes the "fours" bit, and finally the "eights" bit, which, when set, add 4 and 8 to our number, respectively. So another way of thinking of the binary number 0101 would be to say that we have zero eights, one four, zero twos, and one 1. Bit number 2Bit number Bit values 0 20 1 1 21 2 2 22 4 3 23 8 Table 2.2 Bits and their values. Note that every bit in base 2 is the next power of 2. More generally, every place in a base-n system is n raised to that place number. Don’t worry about remembering the value of each bit. There’s a simple trick: to find the value of a bit, just raise it to its bit number (and remember to start counting at 0!). What would be the value of bit 4 in a five-bit number? If you get confused about this concept, just remember that this is simply what you learned in grade school with base 10 (the 1s column, the 10s column, the 100s column, and so on). This applet sonifies, or lets you hear a binary counter. Each of the 8 bits is assigned to a different note in the harmonic series (pitches that are integer multiples of the fundamental pitch, corresponding to the spectral components of a periodic waveform). As the binary counter counts, it turns the notes on and off depending on whether or not the bit is set. The crucial question is, though, how many different numbers can we represent with four bits? Or more important, how many different numbers can we represent with n bits? That will determine the resolution of our data (and for sound, how accurately we can record minute amplitude changes). Look back at Table 2.1: using all possible combinations of the 4 bits, we http://music.columbia.edu/cmc/musicandcomputers/ - 40 - were able to represent sixteen numbers (but notice that, starting at 0, they only go to 15). What if we only used 2 bits? How about 16? Again, thinking of the analogy with base 10, how many numbers can we represent in three "places" (0 through 999 is the answer, or 1,000 different numbers). A more general question is this: for a given base, how many values can be represented with n places (can’t call them bits unless it’s binary)? The answer is the base to the nth power. The largest number we can represent (since we need 0) is the base to the nth power minus 1. For example, the largest number we can represent in binary with n bits is 2n – 1. Number of bits 2Number of bits Number of numbers 8 28 256 16 216 65,536 24 224 16,777,216 32 232 4,294,967,296 Table 2.3 How many numbers can an n-bit number represent? They get big pretty fast, don’t they? In fact, by definition, these numbers get big exponentially as the number of bits increases linearly (like we discussed when we talked about pitch and loudness perception in Section 1.2). Modern computers use 64-bit numbers! If 232 is over 4 billion, try to imagine what 264 is (232 x 232). It’s a number almost unimaginably large. As an example, if you counted very fast, say five times a second, from 0 to this number, it would take you over 18 quintillion years to get there (you’d need the lifetimes of several solar systems to accomplish this). OK, we’ve got that figured out, right? Now back to sound. Remember that a sample is essentially a "snapshot" of the instantaneous amplitude of a sound, and that snapshot is stored as a number. What sort of number are we talking about? 3? 0.00000000017? 16,000,000,126? The answer depends on how accurately we capture the sound and how much space we have to store our data. http://music.columbia.edu/cmc/musicandcomputers/ - 41 - Chapter 2: The Digital Representation of Sound, Part Two: Playing by the Numbers Section 2.5: Bit Width When we talked about sampling, we made the point that the faster you sample, the better the quality. Fast is good: it gives us higher resolution (in time), just like in the old days when the faster your tape recorder moved, the more (horizontal) space it had to put the sound on the tape. But when you had fast tape recorders, moving at 30 or 60 inches per second, you used up a lot of tape. The faster our moving storage, the more media it’s going to consume. While this may be bad ecologically, it’s good sonically. Accuracy in recording demands space. However, if we only have a limited amount of storage space, we need to do something about that space issue. One thing that eats up space digitally (in the form of memory or disk space) is bits. The more bits you use, the more hard disk space or memory size you need. That’s also true with sampling rates. If we sample at high rates, we’ll use up more space. We could, in principle, use 64-bit numbers (capable of extraordinary detail) and sample at 100 kHz—big numbers, fast speeds. But our sounds, as digitally stored numbers, will be huge. Somehow we have to make some decisions balancing our need for accuracy and sonic quality with our space and storage limitations. For example, suppose we only use the values 0, 1, 2, and 3 as sample values. This would mean that every sample measurement would be "rounded off" to one of these four values. On one hand, this would probably be pretty inaccurate, but on the other hand, each sample would then be encoded using only a 2-bit number. Not too consumptive, and pretty simple, technologically! Unfortunately, using only these four numbers would probably mean that sample values won’t be distinguished all that much! That is, most of our functions in the digital world would look pretty much alike. This would create very low resolution data, and the audio ramification is that they would sound terrible. Think of the difference between, for example, 8 mm and 16 mm film: now pretend you are using 1 mm film! That’s what a 4-bit sample size would be like. So, while speed is important—the more snapshots we take of a continuous function, the more accurately we can represent it in discrete form—there’s another factor that seriously affects resolution: the resolution of the actual number system we use to store the data. For example, with only three numbers available (say, 0, 1, 2), every value we store has to be one of those three numbers. That’s bad. We’ll basically be storing a bunch of simple square waves. We’ll be turning highly differentiated, continuous data into nondifferentiated, overly discrete data. http://music.columbia.edu/cmc/musicandcomputers/ - 42 - Figure 2.13 An example of what a 3-bit sound file might look like (8 possible values). Figure 2.14 An example of what a 6-bit sound file might look like (64 possible values). In computers, the way we describe numerical resolution is by the size, or number of bits, used for number storage and manipulation. The number of bits used to represent a number is referred to as its bit width or bit depth. Bit width (or depth) and sample speed more or less completely describe the resolution and accuracy of our digital recording and synthesis systems. Another way to think of it is as the word-length, in bits, of the binary data. Common bit widths used for digital sound representation are 8, 16, 24, and 32 bits. As we said, more is better: 16 bits gives you much more accuracy than 8 bits, but at a cost of twice the storage space. (Note that, except for 24, they’re all powers of 2. Of course you could think of 24 as halfway between 24 and 25, but we like to think of it as 24.584962500721156.) We’ll take a closer look at storage in Section 2.7, but for now let’s consider some standard number sizes. 4 bits 1 nibble 8 bits 1 byte (2 nibbles) 16 bits 1 word (2 bytes) 1,024 bytes 1 kilobyte (K) 1,000 K 1 megabyte (MB) 1,000 MB 1 gigabyte (GB) 1,000 GB 1 terabyte (TB) Table 2.4 Some convenient units for dealing with bits. If you just remember that 8 bits is the same as a byte, you can pretty much figure out the rest. Sample rate (in Hz) 16 bits 44,100 http://music.columbia.edu/cmc/musicandcomputers/ - 43 - 8 bits 22,050 11,025 5,512.5 Table 2.5 Some sound files at various bit widths and sampling rates. Fast and wide are better (and more expensive in terms of technology, computer time, and computer storage). So the ideal format in this table is 44,100 Hz at 16 bits (which is standard audio CD quality). Listen to the sound files and see if you can hear what effect sampling rate and bit width have on the sound. See the sound files on C:\bolum\bahar2014-2015\ce476\Section 2.5 You may notice that lower sampling rates make the sound "duller," or lacking in high frequencies. Lower bit widths make the sound, to use an imprecise word, "flatter"—small amplitude nuances (at the waveform level) cannot be heard as well, so the sound’s timbre tends to become more similar, and less defined. Let’s sum it up: what effect does bit width have on the digital storage of sounds? Remember that the more bits we use, the more accurate our recording. But each time our accuracy increases, so do our storage requirements. Sometimes we don’t need to be that accurate— there are lots of options open to us when playing with digital sounds. http://music.columbia.edu/cmc/musicandcomputers/ - 44 - Chapter 2: The Digital Representation of Sound, Part Two: Playing by the Numbers Section 2.6: Digital Copying Xtra bit 2.2:The peanut butter conundrum Applet 2.5: Melody copier Xtra bit 2.3: Errors in digital copying: parity Soundfile 2.4: Our rendition of Alvin Lucier’s "I am Sitting in a Room" Soundfile 2.5: Pretender, from John Oswald’s Plunderphonics Soundfile 2.6(a) :Original Soundfile 2.6(b): Degraded Soundfile 2.7: David Mahler Xtra bit 2.4: Digital watermarking One of the most important things about digital audio is the ability, theoretically, to make perfect copies. Since all we’re storing are lists of numbers, it shouldn’t be too hard to rewrite them somewhere else without error. This means that unlike, for example, photocopying a photocopy, there is no information loss in the transfer of data from one copy to another, nor is there any noise added. Noise, in this context, is any information that is added by the imperfections of the recording or playback technology. Figure 2.15 Each generation of analog copying is a little noisier than the last. Digital copies are (generally!) perfect, or noiseless. http://music.columbia.edu/cmc/musicandcomputers/ - 45 - Not-So-Perfect Copying What does perfect copying really mean? In the good old days, if you copied your favorite LP vinyl (remember those?) onto cassette and gave it to your buddy, you knew that your original sounded better than your buddy’s copy. He knew the same: if he made a copy for his buddy, it would be even worse (these copies are called generations). It was like the children’s game of telephone: "My cousin once saw a Madonna concert!" "My cousin Vince saw Madonna’s monster!" "My mother Vince sat onna mobster!" "My mother winces when she eats lobster!" Start with a simple tune and set a noise parameter that determines how bad the next copy of the file will be. The melody degrades over time. Note that when we use the term degrade, we are giving a purely technical description, not an aesthetic one. The melodies don’t necessarily get worse, they just get further from the original. The noise added to successive generations blurs the detail of the original. But we still try to reinterpret the new, blurred message in a way that makes sense to us. In audio, this blurring of detail usually manifests itself as a loss of high-frequency information (in some sense, sonic detail) as well as an addition of the noise of the transmission mechanism itself (hiss, hum, rumble, etc.). Alvin Lucier’s classic electronic work, "I Am Sitting in a Room," "redone" with our own text. "This is a cheap imitation of a great and classic work of electronic music by the composer Alvin Lucier. You can easily hear the degradations of the copied sound; this was part of the composer’s idea for this composition." We strongly encourage all of you to listen to the original, which is a beautiful, innovative work. Is It Real, or Is It...? Digital sound technology changes this copying situation entirely and raises some new and interesting questions. What’s the original, and what’s the copy? When we copy a CD to some other digital medium (like our hard drive), all we’re doing is copying a list of numbers, and there’s no reason to expect that any, or significantly many, errors will be incurred. That means the copy of the president’s speech that we sample for our web site contains the same data—is the same signal—as the original. There’s no way to trace where the original came from, and in a sense no way to know who owns it (if anybody does). http://music.columbia.edu/cmc/musicandcomputers/ - 46 - This makes questions of copyright and royalties, and the more general issue of intellectual property, complicated to say the least, and has become, through the musical technique of sampling (as in rap, techno, and other musical styles), an important new area of technological, aesthetic, and legal research. "If creativity is a field, copyright is the fence." —John Oswald In the mid-1980s, composer John Oswald made a famous CD in which every track was an electronic transformation, in some unique way, of existing, copyrighted material. This controversial CD was eventually litigated out of existence, largely on the basis of its Michael Jackson track (called "Dab"). Digital copying has produced a crisis in the commercial music world, as anyone who has downloaded music from the internet knows. The government, the commercial music industry, and even organizations like BMI (Broadcast Music Inc.) and ASCAP (American Society of Composers, Authors, and Publishers), which distribute royalties to composers and artists, are struggling with the law and the technology, and the two are forever out of synch (guess who's ahead!). Things change every day, but one thing never changes-technology moves faster than our society's ability to deal with it legally and ethically. Of course, that makes things interesting. In Pretender, John Oswald gradually changes the pitch and speed of Dolly Parton’s version of the song "The Great Pretender" in a wry and musically beautiful commentary on gender, intellectual property, and sound itself. Soundfiles 2.6(a) and 2.6(b) show an analog copy that was made over and over again, exemplifying how the signal degrades to noise after numerous copies. Soundfile 2.6(a) is the original digital file; Soundfile 2.6(b) is the copied analog file. Note that unlike the example in Soundfile 2.4, where we tried to re-create a famous piece by Alvin Lucier, this example doesn’t involve the acoustics of space, just the noise of electronics, tape to tape. Sampling, and using pre-existing materials, can be a lot of fun and artistically interesting. Soundfile 2.7 is an excerpt from David Mahler’s composition "Singing in the Style of The Voice of the Poet." This is a playful work that in some ways parodies textsound composition, radio interviewing, and electronic music by using its own techniques. In this example, Mahler makes a joke of the fact that when speech is played backward, it sounds like "Swedish," and he combines that effect (backward talking) with composer Ingram Marshall’s talking about his interest in Swedish text-sound composers. Figure 2.20 David Mahler The entire composition can be heard on David Mahler’s CD The Voice of the Poet; Works on Tape 1972–1986, on Artifact Recordings. http://music.columbia.edu/cmc/musicandcomputers/ - 47 - Chapter 2: The Digital Representation of Sound, Part Two: Playing by the Numbers Section 2.7: Storage Concerns: The Size of Sound Xtra bit 2.5: Hard drives OK, now we know about bits and how to copy them from one place to another. But where exactly do we put all these bits in the first place? One familiar digital storage medium is the compact disc (CD). Bits are encoded on a CD as a series of pits in a metallic surface. The length of a pit corresponds to the state of the bit (on or off). As we’ve seen, it takes a lot of bits to store high-quality digital audio—a standard 74minute CD can have more than 6 billion pits on it! Figure 2.21 This photo shows what standard CD pits look like under high magnification. A CD stores data digitally, using long and short pits to encode a binary representation of the sound for reading by the laser mechanism of a CD player. Thanks to Evolution Audio and Video in Agoura Hills California for this photo from their A/V Newsletter (Jan 1996). The magnification is 20k. Figure 2.22 Physically, a CD is composed of a thin film of aluminum embedded between two discs of polycarbonate plastic. Information is recorded on the CD as a series of microscopic pits in the aluminum film arranged along a continuous spiral track. If expanded linearly, the track would span over 3 miles. Using a low-power infrared laser (with a wavelength of 780 nm), the data are retrieved from the CD using photosensitive sensors that measure the intensity of the reflected light as the laser traverses the track. Since the recovered bit stream is simply a bit pattern, any digitally encoded information can be stored on a CD. http://music.columbia.edu/cmc/musicandcomputers/ - 48 - Putting Everything Together Now that we know about sampling rates, bit width, number systems, and a lot of other stuff, how about a nice practical example that ties it all together? Assume we’re composers working in a digital medium. We’ve got some cool sounds, and we want to store them. We need to figure out how much storage we need. Let’s assume we’re working with a stereo (two independent channels of sound) signal and 16bit samples. We’ll use a sampling rate of 44,100 times/second. One 16-bit sample takes 2 bytes of storage space (remember that 8 bits equal 1 byte). Since we’re in stereo, we need to double that number (there’s one sample for each channel) to 4 bytes per sample. For each second of sound, we will record 44,100 four-byte stereo samples, giving us a data rate of 176.4 kilobytes (176,400 bytes) per second. Let’s review this, because we know it can get a bit complicated. There are 60 seconds in a minute, so 1 minute of high-quality stereo digital sound takes 176.4 * 60 KB or 10.584 megabytes (10,584 KB) of storage space. In order to store 1 hour of stereo sound at this sampling rate and resolution, we need 60 * 10.584 MB, or about 600 MB. This is more or less the amount of sound information on a standard commercial audio CD (actually, it can store closer to 80 minutes comfortably). One gigabyte is equal to 1,000 megabytes, so a standard CD is around two-thirds of a gigabyte. Figure 2.23 One good rule of thumb is that CD-quality sound currently requires about 10 megabytes per minute. http://music.columbia.edu/cmc/musicandcomputers/ - 49 - Chapter 2: The Digital Representation of Sound, Part Two: Playing by the Numbers Section 2.8: Compression Xtra bit 2.6: MP3 Soundfile 2.8: 128 Kbps Soundfile 2.9: 64 Kbps Soundfile 2.10: 32 Kbps Xtra bit 2.7: Delta modulation When we start talking about taking 44.1 kHz samples per second, each one of those samples has a 16-bit value, so we’re building up a whole heck of a lot of bits. In fact, it’s too many bits for most purposes. While it’s not too wasteful if you want an hour of high-quality sound on a CD, it’s kind of unwieldy if we need to download or send it over the Internet, or store a bunch of it on our home hard drive. Even though high-quality sound data aren’t anywhere near as large as image or video data, they’re still too big to be practical. What can we do to reduce the data explosion? If we keep in mind that we’re representing sound as a kind of list of symbols, we just need to find ways to express the same information in a shorter string of symbols. That’s called data compression, and it’s a rapidly growing field dealing with the problems involved in moving around large quantities of bits quickly and accurately. The goal is to store the most information in the smallest amount of space, without compromising the quality of the signal (or at least, compromising it as little as possible). Compression techniques and research are not limited to digital sound—data compression plays an essential part in the storage and transmission of all types of digital information, from word-processing documents to digital photographs to full-screen, full-motion videos. As the amount of information in a medium increases, so does the importance of data compression. What is compression exactly, and how does it work? There is no one thing that is "data compression." Instead, there are many different approaches, each addressing a different aspect of the problem. We’ll take a look at just a couple of ways to compress digital audio information. What’s important about these different ways of compressing data is that they tend to illustrate some basic ideas in the representation of information, particularly sound, in the digital world. Eliminating Redundancy There are a number of classic approaches to data compression. The first, and most straightforward, is to try to figure out what’s redundant in a signal, leave it out, and put it back in when needed later. Something that is redundant could be as simple as something we already know. For example, examine the following messages: http://music.columbia.edu/cmc/musicandcomputers/ - 50 - YNK DDL WNT T TWN, RDNG N PNY or DNT CNT YR CHCKNS BFR THY HTCH It’s pretty clear that leaving out the vowels makes the phrases shorter, unambiguous, and fairly easy to reconstruct. Other phrases may not be as clear and may need a vowel or two. However, clarity of the intended message occurs only because, in these particular messages, we already know what it says, and we’re simply storing something to jog our memory. That’s not too common. Now say we need to store an arbitrary series of colors: blue blue blue blue green green green red blue red blue yellow This is easy to shorten to: 4 blue 3 green red blue red blue yellow In fact, we can shorten that even more by saying: 4 blue 3 green 2 (red blue) yellow We could shorten it even more, if we know we’re only talking about colors, by: 4b3g2(rb)y We can reasonably guess that "y" means yellow. The "b" is more problematic, since it might mean "brown" or "black," so we might have to use more letters to resolve its ambiguity. This simple example shows that a reduced set of symbols will suffice in many cases, especially if we know roughly what the message is "supposed" to be. Many complex compression and encoding schemes work in this way. Perceptual Encoding A second approach to data compression is similar. It also tries to get rid of data that do not "buy us much," but this time we measure the value of a piece of data in terms of how much it contributes to our overall perception of the sound. Here’s a visual analogy: if we want to compress a picture for people or creatures who are color-blind, then instead of having to represent all colors, we could just send black-and-white pictures, which as you can well imagine would require less information than a full-color picture. However, now we are attempting to represent data based on our perception of it. Notice here that we’re not using numbers at all: we’re simply trying to compress all the relevant data into a kind of summary of what’s most important (to the receiver). The tricky part of this is that in order to understand what’s important, we need to analyze the sound into its component features, something that we didn’t have to worry about when simply shortening lists of numbers. http://music.columbia.edu/cmc/musicandcomputers/ - 51 - temperature: 76oF humidity: 35% = wind: north-east at 5 MPH It’s a nice day out! barometer: falling clouds: none Figure 2.24 We humans use perception-based encoding all the time. If we didn't, we’d have very tedious conversations. MP3 is the current standard for data compression of sound on the web. But keep in mind that these compression standards change frequently as people invent newer and better methods. Soundfiles 2.8, 2.9, and 2.10 were all compressed into the MP3 format but at different bit rates. The lower the bit rate, the more degradation. (Kbps means kilobits per second.) Perceptually based sound compression algorithms usually work by eliminating numerical information that is not perceptually significant and just keeping what’s important. µ-law ("mu-law") encoding is a simple, common, and important perception-based compression technique for sound data. It’s an older technique, but it’s far easier to explain here than a more sophisticated algorithm like MP3, so we’ll go into it in a bit of detail. Understanding it is a useful step toward understanding compression in general. µ-law is based on the principle that our ears are far more sensitive to low amplitude changes than to high ones. That is, if sounds are soft, we tend to notice the change in amplitude more easily than between very loud and other nearly equally loud sounds. µ-law compression takes advantage of this phenomenon by mapping 16-bit values onto an 8-bit µ-law table like Table 2.6. 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 132 148 164 180 196 212 228 244 260 276 292 308 324 340 356 372 Table 2.6 Entries from a typical µ-law table. The complete table consists of 256 entries spanning the 16-bit numerical range from –32,124 to 32,124. Half the range is positive, and half is negative. This is often the way sound values are stored. Notice how the range of numbers is divided logarithmically rather than linearly, giving more precision at lower amplitudes. In other words, loud sounds are just loud sounds. To encode a µ-law sample, we start with a 16-bit sample value, say 330. We then find the entry in the table that is closest to our sample value. In this case, it would be 324, which is the 28th entry (starting with entry 0), so we store 28 as our µ-law sample value. Later, when we http://music.columbia.edu/cmc/musicandcomputers/ - 52 - want to decode the µ-law sample, we simply read 28 as an index into the table, and output the value stored there: 324. You might be thinking, "Wait a minute, Our original sample value was 330, but now we have a value of 324. What good is that?" While it’s true that we lose some accuracy when we encode µ-law samples, we still get much better sound quality than if we had just used regular 8-bit samples. Here’s why: in the low-amplitude range of the µ-law table, our encoded values are only going to be off by a small margin, since the entries are close together. For example, if our sample value is 3 and it’s mapped to 0, we’re only off by 3. But since we’re dealing with 16-bit samples, which have a total range of 65,536, being off by 3 isn’t so bad. As amplitude increases we can miss the mark by much greater amounts (since the entries get farther and farther apart), but that’s OK too—the whole point of µ-law encoding is to exploit the fact that at higher amplitudes our ears are not very sensitive to amplitude changes. Using that fact, µlaw compression offers near-16-bit sound quality in an 8-bit storage format. Prediction Algorithms A third type of compression technique involves attempting to predict what a signal is going to do (usually in the frequency domain, not in the time domain) and only storing the difference between the prediction and the actual value. When a prediction algorithm is well tuned for the data on which it’s used, it’s usually possible to stay pretty close to the actual values. That means that the difference between your prediction and the real value is very small and can be stored with just a few bits. Let’s say you have a sample value range of 0 to 65,536 (a 16-bit range, in all positive integers) and you invent a magical prediction algorithm that is never more than 256 units above or below the actual value. You now only need 8 bits (with a range of 0 to 255) to store the difference between your predicted value and the actual value. You might even keep a running average of the actual differences between sample values, and use that adaptively as the range of numbers you need to represent at any given time. Pretty neat stuff! In actual practice, coming up with such a good prediction algorithm is tricky, and what we’ve presented here is an extremely simplified presentation of how prediction-based compression techniques really work. The Pros and Cons of Compression Techniques Each of the techniques we’ve talked about has advantages and disadvantages. Some are timeconsuming to compute but accurate; others are simple to compute (and understand) but less powerful. Each tends to be most effective on certain kinds of data. Because of this, many of the actual compression implementations are adaptive—they employ some variable combination of all three techniques, based on qualities of the data to be encoded. A good example of a currently widespread adaptive compression technique is the MPEG (Moving Picture Expert Group) standard now used on the Internet for the transmission of both sound and video data. MPEG (which in audio is currently referred to as MP3) is now the standard for high-quality sound on the Internet and is rapidly becoming an audio standard for http://music.columbia.edu/cmc/musicandcomputers/ - 53 - general use. A description of how MPEG audio really works is well beyond the scope of this book, but it might be an interesting exercise for the reader to investigate further. Delta Modulation Bit width, sampling rate, compression, and even prediction are well illustrated by an extremely simple, interesting, and important technique called delta modulation or single-bit encoding. This algorithm is commonly used in commercial digital sound equipment. We know that if we sample very fast, the change in the y-axis or amplitude of the signal is going to be extremely small. In fact, we can more or less assume that it’s going to be nearly continuous. This means that each successive sample will change by at most 1. Using this fact, we can simply store the signal as a list of 1s and 0s, where 1 means "the amplitude increased" and 0 means "the amplitude decreased." The good thing about delta modulation is that we can use only one bit for the representation: we’re not actually representing the signal, but rather a kind of map of its contour. We can, in fact, play it back in the same way. The disadvantage of this algorithm is that we need to sample at a much higher rate in order to ensure that we’re catching all the minute changes. http://music.columbia.edu/cmc/musicandcomputers/ - 54 - Chapter 3: The Frequency Domain Section 3.1: Frequency Domain Soundfile 3.1: Monochord sound Soundfile 3.2: Trumpet sound Xtra bit 3.1: MatLab code to plot amplitude envelopes Soundfile 3.3: Mystery sound Soundfile 3.4: The song of the hooded warbler. Can you follow it with Figure 3.5? Time-domain representations show us a lot about the amplitude of a signal at different points in time. Amplitude is a word that means, more or less, "how much of something," and in this case it might represent pressure, voltage, some number that measures those things, or even the in-out deformation of the eardrum. For example, the time-domain picture of the waveform in Figure 3.1 starts with the attack of the note, continues on to the steady-state portion (sustain) of the note, and ends with the cutoff and decay (release). We sometimes call the attack and decay transients because they only happen once and they don’t stay around! We also use the word transient, perhaps more typically, to describe timbral fluctuations during the sound that are irregular or singular and to distinguish between those kinds of sounds and the steady state. From the typical sound event shown in Figure 3.1, we can tell something about how the sound’s amplitude develops over time (what we call its amplitude envelope). But from this picture we can’t really tell much of anything about what is usually referred to as the timbre or "sound" of the sound: What instrument is it? What note is it? Is it bright or dark? Is it someone singing or a dog barking, or maybe a Back Street Boys bootleg? Who knows! These time domain pictures all look pretty much alike. Figure 3.1 A time-domain waveform. It’s easy to see the attack, steady-state, and decay portions of the "note" or sound event, because these are all pretty much variations of amplitude, which time-domain representations show us quite well. The amplitude envelope is a kind of average of this picture. http://music.columbia.edu/cmc/musicandcomputers/ - 55 - We can even be a little more precise and mathematical. If the amplitude at the nth sample in the above is A[n] and we make a new signal with amplitude, say, S[n], then the nth sample (of S[n], our envelope) would be: S[n] = (A[n–1] + A[n] + A[n+1])/3 This would look like the envelope. This averaging operation is sometimes called smoothing or low-pass filtering. We’ll talk more about it later. Figure 3.2 Monochord sound: signal, average signal envelope, peak signal envelope. http://music.columbia.edu/cmc/musicandcomputers/ - 56 - Figure 3.3 Trumpet sound: signal, average signal envelope, peak signal envelope. Two Sounds, and Two Different Kinds of Amplitude Envelopes The two figures and sounds in Soundfiles 3.1 and 3.2 (one of a trumpet, one of a one-stringed instrument called a monochord, made for us by Sam Torrisi) illustrate different ways of looking at amplitude and the time domain. In each, the time-domain signal itself is given by the light blue area. This is exactly the same as what we showed you at the beginning of this section in Figure 3.1. But we’ve added two more envelopes to these figures, to illustrate two useful ways to think of a sound event. The magenta line more or less follows the peaks of the signal, or its highest amplitudes. Note that it doesn’t matter whether these amplitudes are very positive or very negative; all we really care about is their absolute value, which is more or less like saying how much energy or displacement (in either direction). Sometimes, we even simplify this further by measuring the peak-to-peak amplitude of a signal, just looking at the maximum range of amplitudes (this will tell us, for example, if our speakers/ears will be able to withstand the maximum of the signal). In Figure 3.2, we look at some number of samples and more or less remain on the highest value in that window (that’s why it has a kind of staircase look to it). The dark blue line is a running average of the absolute value of the signal, which in effect smooths the sound out tremendously, and also attenuates it. There’s a similar measure, called RMS (root-mean-squared) amplitude, that tries to give an overall average of energy. Once again, we used a running window technique to average the last n number of samples (where n is the length of the window). Different values for n would give very different pictures. http://music.columbia.edu/cmc/musicandcomputers/ - 57 - Just to give you some idea how we generate these kinds of graphs and measurements, in Xtra bit 3.1 we’ve included the computer code, written in a popular mathematical modeling program called MatLab, that made these pictures. By studying this code and the accompanying comments, you can get some idea of what computer music software often looks like and how you might go about making similar kinds of measurements. The Frequency/Amplitude/Time Plot Distinguishing between sounds is one place where the frequency domain comes in. Figure 3.4 is a frequency/amplitude/time plot of the same sound as the time-domain picture in Figure 3.1. This new kind of sound-image is called a sonogram. Time still goes from left to right along the x-axis, but now the y-axis is frequency, not amplitude. Amplitude is encoded in the intensity of a point on the image: the darker the point, the more energy present at that frequency at around that time. For example, the semi-dark line around 7,400 Hz shows that from about 0.05 second to 0.125 second, there is some contribution to the sound at that frequency. This is occurring in the attack portion. It pretty much dies after a short period of time. Figure 3.4 This picture shows the same sound as that of the time domain in Figure 3.1, but now in the frequency domain, as a sonogram. Here, the y-axis is frequency (or, more accurately, frequency components). The darkness of the line indicates how much energy is present in that frequency component. The x-axis is, as usual, time. What sorts of information do the two pictures give us about the sound? Can you make some guesses about what sort of sound this might be? What does this sonogram tell us about the sound? Remember that we said before that we use the entire frequency range to determine timbre as well as pitch. As it turns out, any sound contains many smaller component sounds at a wide variety of frequencies (we’ll learn more about this later; it’s really important!). What you’re seeing in this sonogram is a http://music.columbia.edu/cmc/musicandcomputers/ - 58 - representation of how all those component sounds change in frequency and amplitude over time. Now listen to Soundfile 3.3 at left. The sonogram shows that the mystery sound starts with a burst of energy that is spread out across the frequency spectrum—notice the spikes that reach all the way up to the top frequencies in the image. Then it settles down into a fairly constant, more concentrated, and lower energy state, where it remains until the end, when it quickly fades out. This is a pretty common description of a vibrating system: start it vibrating out of its rest state (chaotic, loud), listen to it settle into some sort of regular vibratory behavior, and then, if the energy source is removed (for example, you stop blowing the horn or take your e-bow off your electric guitar string), listen to it decay (again, chaotic). The presence of a band of high-amplitude, low-frequency energy coupled with some loweramplitude, high-frequency energy implies that we’re looking at some sort of pitched sound with a number of strong harmonics. The darkest low band is probably the fundamental note of the sound. By studying the sonogram, can you get a mental idea of what sort of sound it might be? Listen to the sound a few times while watching the waveform and sonogram images. Can you follow along? Is there a clear correlation between what you see and what you hear? Does the sound look the way it sounds? Do you agree that the sonogram gives you a more informative visual representation of the sound? Isn’t the frequency domain cool? Figure 3.5 Song of the hooded warbler. This is another kind of sonogram, kind of like a negative image of the sound moving in pitch (the y-axis) over time. The thickness of the line shows a lot about the pitch range. What this old-style sonogram did was try to find the maximum energy concentration and give a picture of the moving pitch of a sound, natural or otherwise. Sometimes pictures like this, which were very common a long time ago, are called melograms, or melographs, because they graph pitch in time. We got this wonderful picture out of an old book about recording natural sounds! http://music.columbia.edu/cmc/musicandcomputers/ - 59 - Figure 3.6 Just for historical interest, the picture above is an example of an old process called phonophotography, an early (1920s) method for capturing a graphic image of a sound. It’s essentially a melographic technique. What we are looking at is a picture of a "recording" of a performance of the gospel song "Swing Low, Sweet Chariot." This color image came from the work of a brilliant researcher named Metfessel. This kind of highly descriptive analysis greatly influenced music theorists in the first part of the 20th century. Many people saw it as a kind of revolutionary mechanism for describing sound and music, potentially removing music analysis from the realm of the aesthetic, the emotional, and the transcendental into a more modernist, scientific, and objective domain. http://music.columbia.edu/cmc/musicandcomputers/ - 60 - Chapter 3: The Frequency Domain Section 3.2: Phasors Applet 3.1: Sampling a phasor Applet 3.2: Building a sawtooth wave partial by partial In Chapter 1 we talked about the basic atoms of sound—the sine wave, and about the function that describes the sound generated by a vibrating tuning fork. In this chapter we’re talking a lot about the frequency domain. If you remember, our basic units, sine waves, only had two parameters: amplitude and frequency. It turns out that these dull little sine waves are going to give us the fundamental tool for the analysis and description of sound, and especially for the digital manipulation of sound. That’s the frequency domain: a place where lots of little sine waves are our best friends. But before we go too far, it’s important to fully understand what a sine wave is, and it’s also wonderful to know that we can make these simple little curves ridiculously complicated, too. And it’s useful to have another model for generating these functions. That model is called a phasor. Description of a Phasor Think of a bicycle wheel suspended at its hub. We’re going to paint one of the spokes bright red, and at the end of the spoke we’ll put a red arrow. We now put some axes around the wheel—the x-axis going horizontally through the hub, and the y-axis going vertically. We’re interested in the height of the arrowhead relative to the x-axis as the wheel—our phasor— spins around counterclockwise. Figure 3.7 Sine waves and phasors. As the sine wave moves forward in time, the arrow goes around the circle at the same rate. The height of the arrow (that is, how far it is above or below the x-axis) as it spins around in a circle is described by the sine wave. In other words, if we trace the arrow’s location on the circle (from 0 to 2 ) and measure the height of the arrow on the y-axis as our phasor goes around the circle, the resulting curve is a sine wave! Thanks to: George Watson, Dept. of Physics & Astronomy, University of Delaware, [email protected], for this animation. http://music.columbia.edu/cmc/musicandcomputers/ - 61 - Figure 3.8 Phase as angle. Figure 3.9 Phasor: circle to sine. As time goes on, the phasor goes round and round. At each instant, we measure the height of the dot over the x-axis. Let’s consider a small example first. Suppose the wheel is spinning at a rate of one revolution per second. This is its frequency (and remember, this means that the period is 1 second/revolution). This is the same as saying that the phasor spins at a rate of 360 degrees per second, or better yet, 2\2 radians per second (if we’re going to be mathematicians, then we have to measure angles in terms of radians). So 2 radians per second is the angular velocity of the phasor. This means that after 0.25 second the phasor has gone /2 radians (90 degrees), and after 0.5 second it’s gone radians or 180 degrees, and so on. So, we can describe the amount of angle that the phasor has gone around at time t as a function, which we call (t). Now, let’s look at the function given by the height of the arrow as time goes on. The first thing that we need to remember is a little trigonometry. The sine and cosine of an angle are measured using a right triangle. For our right triangle, the sine of , written sin( ) is given by the equation: http://music.columbia.edu/cmc/musicandcomputers/ - 62 - This means that: We’ll make use of this in a minute, because in this example a is the height of our triangle. Similarly, the cosine, written cos( ), is: This means that: This will come in handy later, too. Now back to our phasor. We’re interested in measuring the height at time t, which we’ll denote as h(t). At time t, the phasor’s arrow is making an angle of (t) with the x-axis. Our basic phasor has a radius of 1, so we get the following relationship: We also get this nice graph of a function, which is our favorite old sine curve. Figure 3.10 Basic sinusoid. Now, how could we change this curve? Well, we could change the amplitude—this is the same as changing the length of our arrow on the phasor. We’ll keep the frequency the same and make the radius of our phasor equal to 3. Then we get: http://music.columbia.edu/cmc/musicandcomputers/ - 63 - Then we get this nice curve, which is another kind of sinusoid (bigger!). Figure 3.11 Bigger sine curve. Now let’s start messing with the frequency, which is the rate of revolution of the phasor. Let’s ramp it up a notch and instead start spinning at a rate of five revolutions per second. Now: This is easy to see since after 1 second we will have gone five revolutions, which is a total of 10 radians. Let’s suppose that the radius of the phasor is 3. Again, at each moment we measure the height of our arrow (which we call h(t)), and we get: Now we get this sinusoid: http://music.columbia.edu/cmc/musicandcomputers/ - 64 - Figure 3.12 Bigger, faster sine curve. In general, if our phasor is moving at a frequency of revolutions per second and has radius A, then plotting the height of the phasor is the same as graphing this sinusoid: Now we’re almost done, but there is one last thing we could vary: we could change the place where we start our phasor spinning. For example, we could start the phasor moving at a rate of five revolutions per second with a radius of 3, but start the phasor at an angle of /4 radians, instead. Now, what kind of function would this be? Well, at time t = 0 we want to be taking the measurement when the phasor is at an angle of /4, but other than that, all is as before. So the function we are graphing is the same as the one above, but with a phase shift of /4. The corresponding sinusoid is: http://music.columbia.edu/cmc/musicandcomputers/ - 65 - Figure 3.13 Changing the phase. Our most general sinusoid of amplitude A, frequency , and phase shift has the form: A particularly interesting example is what happens when we take the phase shift equal to 90 degrees, or /2 radians. Let’s make it nice and simple, with equal to one revolution per second and amplitude equal to 1 as well. Then we get our basic sinusoid, but shifted ahead /2. Does this look familiar? This is the graph of the cosine function! Figure 3.14 90-degree phase shift (cosine). You can do some checking on your own and see that this is also the graph that you would get if you plotted the displacement of the arrow from the y-axis. So now we know that a cosine is a phase-shifted sine! http://music.columbia.edu/cmc/musicandcomputers/ - 66 - Adding Phasors Fourier’s theorem tells us that any periodic function can be expressed as a sum (possibly with an infinite number of terms!) of sinusoids. (We’ll discuss Fourier’s theorem more in depth later.) Remember, a periodic function is any function that looks like the infinite repetition of some fixed pattern. The length of that basic pattern is called the period of the function. We’ve seen a lot of examples of these in Chapter 1. In particular, if the function has period T, then this sum looks like: If T is the period of our periodic function, then we now know that its frequency is 1/T—this is also called the fundamental (frequency) of the periodic function, and we see that all other frequencies that occur (called the partials) are simply integer multiples of the fundamental. If you read other books on acoustics and DSP, you will find that partials are sometimes called overtones (from an old German word, "übertonen") and harmonics. There’s often confusion about whether the first overtone is the second partial, and so on. So, to be specific, and also to be more in keeping with modern terminology, we’re always going to call the first partial the one with the frequency of the fundamental. Example: Suppose we have a triangle wave that repeats once every 1/100 second. Then the corresponding fundamental frequency is 100 Hz (it repeats 100 times per second). Triangle waves only contain partials at odd multiples of the fundamental. (The even multiples have no energy—in fact, this is generally true of wave shapes that have the "odd" symmetry, like the triangle wave.) Click on Applet 3.2 and see a triangle wave built by adding one partial after another. This adding up of partials to make a complex waveform might make sense acoustically, but in order to really understand how to add phasors from a mathematical standpoint, we first need to understand how to add vectors, or arrows. How should we define an arithmetic of arrows? It sounds funny, but in fact it’s a pretty natural generalization of what we already know about adding regular old numbers. When we add a negative number, we go backward, and when we add a positive number, we go forward. Our regular old numbers can be thought of as arrows on a number line. Adding any two numbers, then, simply means taking the two corresponding arrows and placing them one after http://music.columbia.edu/cmc/musicandcomputers/ - 67 - the other, tip to tail. The sum is then the arrow from the origin pointing to the place where "adding" the two arrows landed you. Really, what we are doing here is thinking of numbers as vectors. They have a magnitude (length) and a direction (in this case, positive or negative, or better yet 0 radians or radians). Now, to add phasors, we need to enlarge our worldview and allow our arrows to get not just 2 directions, but instead a whole 2 radians worth of directions! In other words, we allow our arrows to point anywhere in the plane. We add, then, just as before: place the arrows tip to tail, and draw an arrow from the origin to the final destination. So, to recap: to add phasors, at each instant as our phasors are spinning around, we add the two arrows. In this way, we get a new arrow spinning around (the sum) at some frequency—a new phasor. Now it’s easy to see that the sum of two phasors of the same frequency yields a new phasor of the same frequency. We can also see that the sum of a cosine and sine of the same frequency is simply a phase-shifted sine of the same frequency with a new amplitude given by the square root of the sum of squares of the two original phasors. That’s the Pythagorean theorem! Sampling and Fourier Expansion Figure 3.15 The decomposition of a complex waveform into its component phasors (which is pretty much the same as saying the decomposition of an acoustic waveform into its component partials) is called Fourier expansion. In practice, the main thing that happens is that analog waveforms are sampled, creating a time-domain representation inside the computer. These samples are then converted (using what is called a fast Fourier transform, or FFT) into what are called Fourier coefficients. http://music.columbia.edu/cmc/musicandcomputers/ - 68 - Figure 3.16 FFT of sampled phasors exp(2*j* *x/64), x=1,1.01, 1.02,...,1.90,2. Figure 3.17 FFT plot of a gamelan instrument. http://music.columbia.edu/cmc/musicandcomputers/ - 69 - Figure 3.17 shows a common way to show timbral information, especially the way that harmonics add up to produce a waveform. However, it can be slightly confusing. By running an FFT on a small time-slice of the sound, the FFT algorithm gives us the energy in various frequency bins. (A bin is a discrete slice, or band, of the frequency spectrum. Bins are explained more fully in Section 3.4.) The x-axis (bottom axis) shows the bin numbers, and the y-axis shows the strength (energy) of each partial. The slightly strange thing to keep in mind about these bins is that they are not based on the frequency of the sound itself, but on the sampling rate. In other words, the bins evenly divide the sampling frequency (linearly, not exponentially, which can be a problem, as we’ll explain later). Also, this plot shows just a short fraction of time of the sound: to make it time-variant, we need a waterfall 3D plot, which shows frequency and amplitude information over a span of time. Although theoretically we could use the FFT data shown in Figure 3.17 in its raw form to make a lovely, synthetic gamelan sound, the complexity and idiosyncracies of the FFT itself make this a bit difficult (unless we simply use the data from the original, but that’s cheating). Figure 3.18 shows a better graphical representation of sound in the frequency domain. Time is running from front to back, height is energy, and the x-axis is frequency. This picture also takes the essentially linear FFT and shows us an exponential image of it, so that most of the "action" happens in the lower 2k, which is correct. (Remember that the FFT divides the frequency spectrum into linear, equal divisions, which is not really how we perceive sound— it’s often better to graph this exponentially so that there’s not as much wasted space "up top.") The waterfall plot in Figure 3.18 is stereo, and each channel of sound has its own slightly different timbre. Figure 3.18 Waterfall plot. Here’s a fact that will help a great deal: if the highest frequency is B times the fundamental, then you only need 2B + 1 samples to determine the Fourier coefficients. (It’s easy to see that you should need at least 2B, since you are trying to get 2B pieces of information (B amplitudes and B2 phase shifts).) http://music.columbia.edu/cmc/musicandcomputers/ - 70 - Figure 3.19 Aliasing, foldover. This is a phenomenon that happens when we try to sample a frequency that is more than half the sampling rate, or the Nyquist frequency. As the frequency we want to sample gets higher than half the sampling rate, we start "undersampling" and get unwanted, lower-frequency artifacts (that is, low frequencies created by the sampling process itself). http://music.columbia.edu/cmc/musicandcomputers/ - 71 - Chapter 3: The Frequency Domain Section 3.3: Fourier and the Sum of Sines Soundfile 3.5: Adding sine waves Soundfile 3.6: Trumpet Applet 3.3: FFT demo Soundfile 3.7: Filter examples. We start with a sampled file of bugs. This recording was made by composer and sound artist David Dunn using very powerful hydrophones (microphones that work underwater) to amplify the "microsound" of bugs in a pond. Soundfile 3.8: High-pass filtered. This is David Dunn’s bugs sound file filtered so that we hear only the frequencies above 1,000 Hz. In other words, we have high-pass filtered the sound. Soundfile 3.9: Low-pass filtered. This is the bugs sound file filtered so that we hear only the frequencies below 1,000 Hz. Here we have low-pass filtered the sound. In this section, we’ll try to really explain the notion of a Fourier expansion by building on the ideas of phasors, partials, and sinusoidal components that we introduced in the previous section. A long time ago, French scientist and mathematician Jean Baptiste Fourier (1768– 1830) proved the mathematical fact that any periodic waveform can be expressed as the sum of an infinite set of sine waves. The frequencies of these sine waves must be integer multiples of some fundamental frequency. In other words, if we have a trumpet sound at middle A (440 Hz), we know by Fourier’s theorem that we can express this sound as a summation of sine waves: 440 Hz, 880Hz, 1,320Hz, 1,760 Hz..., or 1, 2, 3, 4... times the fundamental, each at various amplitudes. This is rather amazing, since it says that for every periodic waveform (one, by the way, that has pitch), we basically know everything about its partials except their amplitudes. http://music.columbia.edu/cmc/musicandcomputers/ - 72 - The spectrum of the sine wave has energy only at one frequency. The triangle wave has energy at odd-numbered harmonics (meaning odd multiples of the fundamental), with the energy of each harmonic 2 decreasing as 1 over the square of the harmonic number (1/N ). In other words, at the frequency that is N times 2 the fundamental, we have 1/N as much energy as in the fundamental. The partials in the sawtooth wave decrease in energy in proportion to the inverse of the harmonic number (1/N). Pulse (or rectangle or square) waveforms have energy over a broad area of the spectrum, but only for a brief period of time. Fourier Series What exactly is a Fourier series, and how does it relate to phasors? We use phasors to represent our basic tones. The amazing fact is that any sound can be represented as a combination of phase-shifted, amplitude-modulated tones of differing frequencies. Remember that we got a hint of this concept when we discussed adding phasors in Section 3.2. http://music.columbia.edu/cmc/musicandcomputers/ - 73 - A phasor is essentially a way of representing a sinusoidal function. What this means, mathematically, is that any sound can be represented as a sum of sinusoids. This sum is called a Fourier series. Note that we haven’t limited these sounds to periodic sounds (if we did, we’d have to add that last qualifier about integer multiples of a fundamental frequency). Nonperiodic, or aperiodic, sounds are just as interesting—maybe even more interesting—than periodic ones, but we have to do some special computer tricks to get a nice "harmonic" series out of them for the purposes of analysis and synthesis. But let’s get down to the nitty-gritty. First, let’s take a look at what happens when we add two sinusoids of the same frequency. Adding a sine and cosine of the same frequency gives a phase-shifted sine of the same frequency: In fact, the amplitude of the sum, C, is given by: The phase shift is: is given by the angle whose tangent is equal to A/B. The shorthand for this We can visualize this with a phasor. Remember that the cosine is just a phase-shifted sine. Since the sine and cosine are moving at the same frequency, they are always "out of sync" by /2, so when we add them it looks like this: And we get another sinusoid of that frequency. http://music.columbia.edu/cmc/musicandcomputers/ - 74 - Any periodic function of period 1 can be written as follows: Notice that these sums can be infinite! We have a nice shorthand for those possibly infinite sums (also called an infinite series): The two expressions after the signs are called the Fourier coefficients of the function f(t). The Fourier coefficient A0 has a special name: it is called the DC term, or the DC offset. It tells you the average value of the function. The Fourier coefficients make up a set of numbers called the spectrum of the sound. Now, when you think of the word "spectrum," you might think of colors, like the spectrum of colors of the rainbow. In a way it’s the same: the spectrum tells you how much of each frequency (color) is in the sound. The values of An and Bn for "small" values of n make up the low-frequency information, and we call these the low-order Fourier coefficients. Similarly, the big values of n index the highfrequency information. Since most sounds are made up of a lot of low-frequency information, the low-frequency Fourier coefficients have larger absolute value than the high-frequency Fourier coefficients. What this means is that it is theoretically possible to take a complex sound, like a person’s voice, and decompose it into a bunch of sine waves, each at a different frequency, amplitude, and phase. These are called the sinusoidal or spectral components of a sound. To find them, we do a Fourier analysis. Fourier synthesis is the inverse process, where we take varying amounts of a bunch of sine waves and add them together (play them at the same time) to reconstruct a sound. Sounds a bit fantastic, doesn’t it? But it works. This process of analyzing or synthesizing a sound based on its component sine waves is called performing a Fourier transform on the sound. When the computer does it, it uses a very efficient technique called the fast Fourier transform (or FFT) for analysis and the inverse FFT (IFFT) for synthesis. http://music.columbia.edu/cmc/musicandcomputers/ - 75 - What happens if we add a number of sine waves together? We end up with a complicated waveform that is the summation of the individual waves. This picture is a simple example: we just added up two sine waves. For a complex sound, hundreds or even thousands of sine waves are needed to accurately build up the complex waveform. By looking at the illustration from the bottom up, you can see that the inverse is also true—the complex waveform can be broken down into a collection of independent sine waves. Fig: Adding Sine waves http://music.columbia.edu/cmc/musicandcomputers/ - 76 - A trumpet note in an FFT (fast Fourier transform) analysis—two views. The trumpet sound can be heard by clicking on Soundfile 3.6. Both of these pictures show the evolution of the amplitude of spectral components in time. The advantage of representing a sound in terms of its Fourier series is that it allows us to manipulate the frequency content directly. If we want to accentuate the high-frequency effects in a sound (make a sound brighter), we could just make all the high-frequency Fourier coefficients bigger in amplitude. If we wanted to turn a sawtooth wave into a square wave, we could just set to zero the Fourier coefficients of the even partials. In fact, we often modify sounds by removing certain frequencies. This corresponds to making a new function where certain Fourier coefficients are set equal to zero while all others are left alone. When we do this we say that we filter the function or sound. These sorts of filters are called bandpass filters, and the frequencies that we leave unaltered in this sort of situation are said to be in the passband. A low-pass filter puts all the low frequencies (up to some bandwidth) in the passband, while a high-pass filter puts all high frequencies (down to some cutoff) in the passband. When we do this, we talk about high-passing and low-passing the sound. In the following soundfiles, we listen to a sound and its high-passed and low-passed versions. We’ll talk a lot more about filters in Chapter 4. Once you have the spectral content of a sound, there is a lot you can do with it, but how do you get it?! That’s what the FFT does, and we’ll talk about it in the next section. http://music.columbia.edu/cmc/musicandcomputers/ - 77 - Chapter 3: The Frequency Domain Section 3.4: The DFT, FFT, and IFFT Xtra bit 3.2: Dan’s history of the FFT Xtra bit 3.3: The mathematics of magnitude and phase in the FFT The most common tools used to perform Fourier analysis and synthesis are called the fast Fourier transform (FFT) and the inverse fast Fourier transform (IFFT). The FFT and IFFT are optimized (very fast) computer-based algorithms that perform a generalized mathematical process called the discrete Fourier transform (DFT). The DFT is the actual mathematical transformation that the data go through when converted from one domain to another (time to frequency). Basically, the DFT is just a slow version of the FFT—too slow for our impatient ears and brains! FFTs, IFFTs, and DFTs became really important to a lot of disciplines when engineers figured out how to take samples quickly enough to generate enough data to re-create sound and other analog phenomena digitally. Remember, they don’t just work on sounds; they work on any continuous signal (images, radio waves, seismographic data, etc.). An FFT of a time domain signal takes the samples and gives us a new set of numbers representing the frequencies, amplitudes, and phases of the sine waves that make up the sound we’ve analyzed. It is these data that are displayed in the sonograms we looked at in Section 1.2. http://music.columbia.edu/cmc/musicandcomputers/ - 78 - Figure 3.23 Graph and table of spectral components. Figure 3.23 shows the first 16 bins of a typical FFT analysis after the conversion is made from real and imaginary numbers to amplitude/phase pairs. We left out the phases, because, well, it was too much trouble to just make up a bunch of arbitrary phases between 0 and 2 . In a lot of cases, you might not need them (and in a lot of cases, you would!). In this case, the sample rate is 44.1 kHz and the FFT size is 1,024, so the bin width (in frequency) is the Nyquist frequency (44,100/2 = 22,050) divided by the FFT size, or about 22 Hz. Amplitude values are assumed to be between 0 and 1, and notice that they’re quite small because they all must sum to 1 (and there are a lot of bins!). We confess that we just sort of made up the numbers; but notice that we made them up to represent a sound that has a simple, more or less harmonic structure with a fundamental somewhere in the 66 Hz to 88 Hz range (you can see its harmonics at around 2, 3, 4, 5, and 6 times its frequency, and note that the harmonics decrease in amplitude more or less like they would in a sawtooth wave). How the FFT Works The way the FFT works is fairly straightforward. It takes a chunk of time called a frame (a certain number of samples) and considers that chunk to be a single period of a repeating waveform. The reason that this works is that most sounds are "locally stationary"; meaning that over any short period of time, the sound really does look like a regularly repeating function. The following is a way to consider this mathematically—taking a window over some portion of some signal that we want to consider as a periodic function. http://music.columbia.edu/cmc/musicandcomputers/ - 79 - The Fast Fourier Transform in a Nutshell: Computing Fourier Coefficients Here’s a little three-step procedure for digital sound processing. 1. Window 2. Periodicize 3. Fourier transform (this also requires sampling, at a rate equal to 2 times the highest frequency required). We do this with the FFT. Following is an illustration of steps 1 and 2. Here’s the graph of a (periodic) function, f(t). (Note that f(t) need not be a periodic function.) Figure 3.24 Suppose we’re only interested in the portion of the graph between 0 ≤ t ≤ 1. Following is a graph of the window function we need to use. We’ll call the function w(t). Note that w(t) equals 1 only in the interval 0 ≤ t ≤ 1 and it’s 0 everywhere else. Figure 3.25 In step 1, we need to window the function. In Figure 3.25 we’ve plotted both the window function, w(t) (which is nonzero in the region we’re interested in) and function f(t) in the same picture. http://music.columbia.edu/cmc/musicandcomputers/ - 80 - Figure 3.26 In Figure 3.26 we’ve plotted f(t)*w(t), which is the periodic function multiplied by the windowing function. From this figure, it’s obvious what part of f(t) we’re interested in. Figure 3.27 In step 2, we need to periodically extend the windowed function, f(t)*w(t), all along the t-axis. Figure 3.28 Great! We now have a periodic function, and the Fourier theorem says we can represent this function as a sum of sines and cosines. This is step 3. Remember, we can also use other, nonsquare windows. This is done to ameliorate the effect http://music.columbia.edu/cmc/musicandcomputers/ - 81 - of the square windows on the frequency content of the original signal. Now, once we’ve got a periodic function, all we need to do is figure out, using the FFT, what the component sine waves of that waveform are. As we’ve seen, it is possible to represent any periodic waveform as a sum of phase-shifted sine waves. In theory, the number of component sine waves is infinite—there is no limit to how many frequency components a sound might have. In practice, we need to limit ourselves to some predetermined number. This limit has a serious effect on the accuracy of our analysis. Here’s how that works: rather than looking for the frequency content of the sound at all possible frequencies (an infinitely large number—100.000000001 Hz, 100.000000002 Hz, 100.000000003 Hz, etc.), we divide up the frequency spectrum into a number of frequency bands and call them bins. The size of these bins is determined by the number of samples in our analysis frame (the chunk of time mentioned above). The number of bins is given by the formula: number of bins = frame size/2 Frame Size So let’s say that we decide on a frame size of 1,024 samples. This is a common choice because most FFT algorithms in use for sound processing require a number of samples that is a power of two, and it’s important not to get too much or too little of the sound. A frame size of 1,024 samples gives us 512 frequency bands. If we assume that we’re using a sample rate of 44.1 kHz, we know that we have a frequency range (remember the Nyquist theorem) of 0 kHz to 22.05 kHz. To find out how wide each of our frequency bins is, we use the following formula: bin width = frequency/number of bins This formula gives us a bin width of about 43 Hz. Remember that frequency perception is logarithmic, so 43 Hz gives us worse resolution at the low frequencies and better resolution at higher frequencies. By selecting a certain frame size and its corresponding bandwidth, we avoid the problem of having to compute an infinite number of frequency components in a sound. Instead, we just compute one component for each frequency band. http://music.columbia.edu/cmc/musicandcomputers/ - 82 - Software That Uses the FFT Figure 3.29 Example of a commonly used FFT-based program: the phase vocoder menu from Tom Erbe’s SoundHack. Note that the user is allowed to select (among several other parameters) the number of bands in the analysis. This means that the user can customize what is called the time/frequency resolution trade-off of the FFT. Don’t ask us what the other options on this screen are— download the program and try it yourself! There are many software packages available that will do FFTs and IFFTs of your data for you and then let you mess around with the frequency content of a sound. In Chapter 4 we’ll talk about some of the many strange and wonderful things that can be done to a sound in the frequency domain. The details of how the FFT works are well beyond the scope of this book. What is important for our purposes is that you understand the general idea of analyzing a sound by breaking it into its frequency components and, conversely, by using a bunch of frequency components to synthesize a new sound. The FFT has been understood for a long time now, and most computer music platforms have tools for Fourier analysis and synthesis. Figure 3.30 Another way to look at the frequency spectrum is to remove time as an axis and just consider a sound as a histogram of frequencies. Think of this as averaging the frequencies over a long time interval. This kind of picture (where there’s no time axis) is useful for looking at a shortterm snapshot of a sound (often just one frame), or perhaps even for trying to examine the spectral features of a sound that doesn’t change much over time (because all we see are the "averages"). The y-axis tells us the amplitude of each component frequency. Since we’re looking at just one frame of an FFT, we usually assume a periodic, unchanging signal. A histogram is generally most useful for investigating the steady-state portion of a sound. (Figure 3.30 is a screen grab from SoundHack.) http://music.columbia.edu/cmc/musicandcomputers/ - 83 - Chapter 3: The Frequency Domain Section 3.5: Problems with the FFT/IFFT Soundfile 3.10: Sine wave lobes. Soundfile 3.10 is an example of a sine wave swept from 50 Hz to 10 kHz, processed through an FFT. Figure 3.32 illustrates the sine wave sweep in an FFT analysis. The lobes that you see are the result of the energy of the sine wave "centering" in the successive FFT bands and then fading slightly as the width of the band forces the FFT into less accurate representations of the moving frequency (until it centers in the next band). In other words, one of the hardest things for an FFT to represent is a simply moving sine wave! Soundfile 3.11 In the first illustration, we used an FFT size of 512 samples, giving us pretty good time resolution. In the second, we used 2,048 samples, giving us pretty good frequency resolution. As a result, frequencies are smeared vertically in the first analysis, while time is smeared horizontally in the second. What’s the solution to the time/frequency uncertainty dilemma? Compromise. Soundfile 3.12: Beat sound. A fairly normal-sounding beat soundfile. Soundfile 3.13.: Time smeared. In this example the beat soundfile has been processed to provide accurate frequency resolution but inaccurate time resolution, or rhythmic smearing. Soundfile 3.14: Frequency smeared. In this example the beat soundfile has been processed to provide accurate rhythmic resolution but inaccurate frequency resolution, or spectral smearing. http://music.columbia.edu/cmc/musicandcomputers/ - 84 - Figure 3.31 There is a lot of "wasted" space in an FFT analysis—most of the frequencies that are of concern to us tend to be below 5 kHz. Shown are the approximate ranges of some musical instruments and the human voice. It’s also a big, big problem that the FFT divides the frequency range into linear segments (each frequency bin is the same "width") while, as we well know, our perception of frequency is logarithmic. Charts courtesy of Geoff Husband and tnt-audio.com. Used with permission. The FFT often sounds like the perfect tool for exploring the frequency domain and timbre, right? Well, it does work very well for many things, but it’s not without its problems. One of the main drawbacks is that the frequency bins are linear. For example, if we have a bin width of 43 Hz (which will be a result of dividing Nyquist frequency by the FFT frame size), then we have bins from 0 Hz to 43 Hz, 43 Hz to 86 Hz, 86 Hz to 129 Hz, and so on. The problem with this, as we learned earlier, is that the human ear responds to frequency logarithmically, not linearly. At low frequencies, 43 Hz is quite a wide interval—the jump from 43 Hz to 86 Hz is a whole octave! But at higher frequencies, 43 Hz is a tiny interval (perceptually)—less than a minor second. So the FFT has very fine high-frequency pitch resolution, but very poor low-frequency resolution. The effect of the FFT’s linearity is that, for us, much of the FFT data is "wasted" on recording high-frequency information very accurately, at the expense of the low-frequency information that is generally more useful in a musical context. Wavelets, which we’ll look at in Section 3.6, are one approach to solving this problem. Figure 3.32 Sweeping a sine wave through an FFT. http://music.columbia.edu/cmc/musicandcomputers/ - 85 - Frequency and Time Resolution Trade-Off A related drawback of the FFT is the trade-off that must be made between frequency and time resolution. The more accurately we want to measure the frequency content of a signal, the more samples we have to analyze in each frame of the FFT. Yet there is a cost to expanding the frame size—the larger the frame, the less we know about the temporal events that take place within that frame. In other words, more samples require more time; but the longer the time, the less the sound over that interval looks like a sine wave, or something periodic—so the less well it is represented by the FFT. We simply can’t have it both ways! Figure 3.33 Selecting an FFT size involves making trade-offs in terms of time and frequency accuracy. Basically it boils down to this: The more accurate the analysis is in one domain, the less accurate it will be in the other. This figure illustrates what happens when we choose different frame sizes. In the first illustration, we used an FFT size of 512 samples, giving us pretty good time resolution. In the second, we used 2,048 samples, giving us pretty good frequency resolution. As a result, frequencies are smeared vertically in the first analysis, while time is smeared horizontally in the second. What’s the solution to the time/frequency uncertainty dilemma? Compromise. Time Smearing We mentioned that 1,024 samples (1k) is a pretty common frame size for an audio FFT. At a sample rate of 44.1 kHz, 1,024 samples is about 0.022 second of sound. What that means is that all the sonic events that take place within that 0.022 second will be lumped together and analyzed as one event. Because of the nature of the FFT, this "event" is actually treated as if it were an infinitely repeating periodic waveform. The amplitudes of the frequency components http://music.columbia.edu/cmc/musicandcomputers/ - 86 - of all the sonic events in that time frame will be averaged, and these averages will end up in the frequency bins. This is known as time smearing. Now let’s say that we need more than the 43 Hz frequency resolution that a 1k FFT gives us. To get better frequency resolution, we need to use a bigger frame size. But a bigger frame size means that even more samples will be lumped together, giving us even worse time resolution. At a frame size of 2k we get a frequency resolution of about 21.5 Hz, but our time resolution goes down to about 0.05 (1/20) of a second. And, believe it or not, a great deal can happen in 1/20 of a second! Good Time Resolution Conversely, if we need good time resolution (say we’re analyzing some percussive sounds and we want to know exactly when they happen), we need to shrink the frame size. The ideal frame size for the time domain would of course be one sample—that way we would know at exactly which sample something happened. Unfortunately, with only one sample to analyze, we would get no useful frequency information out of the FFT at all. A more reasonable frame size and one that is considered small for audio, such as 256 samples (a 0.006-second chunk of time), gives us 128 analysis bands, for a bin width of about 172 Hz. While a 0.006-second time resolution is reasonable, 172 Hz is a pretty dreadful frequency resolution. That would put several bottom octaves of the piano into one averaged bin. A Compromise So what’s the answer to this time/frequency dilemma? There really isn’t one. If we use the FFT to do our analysis, we’re stuck with the fact that higher resolution in one domain results in lower resolution in the other. The trick is to find a useful balance, based on the types of sounds we are analyzing. No single frame size will work well for all sounds. http://music.columbia.edu/cmc/musicandcomputers/ - 87 - Chapter 3: The Frequency Domain Section 3.6: Some Alternatives to the FFT We know that in previous sections we’ve made it sound like the FFT is the only game in town, but that’s not entirely true. Historically,the FFT was the first, most important, and most widely used algorithm for frequency domain analysis. Musicians, especially, have taken a liking to it because there are some well-known references on its implementation, and computer musicians especially have gained a lot of experience in its nuances. It’s fairly simple to program, and there are a lot of existing software programs to use if you get lazy. Also, it works pretty well (emphasis on the word "pretty"). The FFT breaks up a sound into sinusoids, but there may be reasons why you’d like to break a sound into other sorts of basic sounds. These alternate transforms can work better, for some purposes, than the FFT. (A transform is just something that takes a list of the sample values and turns the values into a new list of numbers that describes a possible new way to add up different basic sounds.) For example, wavelet transforms (sometimes called dynamic base transforms) can modify the resolution for different frequency ranges, unlike the FFT, which has a constant bandwidth (and thus is, by definition, less sensitive to lower frequencies—the important ones— than to higher ones!). Wavelets also use a variety of different analysis waveforms (as opposed to the FFT, which only uses sinusoids) to get a better representation of a signal. Unfortunately, wavelet transforms are still a bit uncommon in computer software, since they are, in general, harder to implement than FFTs. One of the big problems is deciding which wavelet to use, and in what frequency bandwidth. Often, that decision-making process can be more important (and time-consuming) than the actual transform! Wavelets are a pretty hot topic though, so we’ll probably be seeing some wavelet-based techniques emerge in the near future. Figure 3.34 Some common wavelet analysis waveforms. Each of these is a prototype, or "mother wavelet," that provides a basic template for all the related analysis waveforms. A wavelet analysis will start with one of these waveforms and break up the sound into a sum of translated and stretched (dilated) versions of these. http://music.columbia.edu/cmc/musicandcomputers/ - 88 - We know that in previous sections we’ve made it sound like the FFT is the only game in town, but that’s not entirely true. Historically,the FFT was the first, most important, and most widely used algorithm for frequency domain analysis. Musicians, especially, have taken a liking to it because there are some well-known references on its implementation, and computer musicians especially have gained a lot of experience in its nuances. It’s fairly simple to program, and there are a lot of existing software programs to use if you get lazy. Also, it works pretty well (emphasis on the word "pretty"). The FFT breaks up a sound into sinusoids, but there may be reasons why you’d like to break a sound into other sorts of basic sounds. These alternate transforms can work better, for some purposes, than the FFT. (A transform is just something that takes a list of the sample values and turns the values into a new list of numbers that describes a possible new way to add up different basic sounds.) For example, wavelet transforms (sometimes called dynamic base transforms) can modify the resolution for different frequency ranges, unlike the FFT, which has a constant bandwidth (and thus is, by definition, less sensitive to lower frequencies—the important ones— than to higher ones!). Wavelets also use a variety of different analysis waveforms (as opposed to the FFT, which only uses sinusoids) to get a better representation of a signal. Unfortunately, wavelet transforms are still a bit uncommon in computer software, since they are, in general, harder to implement than FFTs. One of the big problems is deciding which wavelet to use, and in what frequency bandwidth. Often, that decision-making process can be more important (and time-consuming) than the actual transform! Wavelets are a pretty hot topic though, so we’ll probably be seeing some wavelet-based techniques emerge in the near future. Figure 3.34 Some common wavelet analysis waveforms. Each of these is a prototype, or "mother wavelet," that provides a basic template for all the related analysis waveforms. A wavelet analysis will start with one of these waveforms and break up the sound into a sum of translated and stretched (dilated) versions of these. McAulay-Quatieri (MQ) Analysis Another interesting approach to a more organized, information-rich set of transforms is the extended McAulay-Quatieri (MQ) analysis algorithm. This is used in the popular Macintosh program Lemur Pro, written by the brilliant computer music researcher Kelly Fitz. http://music.columbia.edu/cmc/musicandcomputers/ - 89 - MQ analysis works a lot like normal FFTs—it tries to figure out how a sound can be recreated using a number of sine waves. However, Lemur Pro is an enhancement of more traditional FFT-based software in that it uses the resulting sine waves to extract frequency tracks from a sound. These tracks attempt to follow particular components of the sound as they change over time. That is, MQ not only represents the amplitude trajectories of partials, but also tries to describe the ways that spectral components change in frequency over time. This is a fundamentally different way of describing spectra, and a powerful one. MQ is an excellent idea, as it allows the analyst a much higher level of data representation and provides a more perceptually significant view of the sound. A number of more advanced FFT-based programs also implement some form of frequency tracking in order to improve the results of the analysis, similar to that of MQ analysis. For example, even though a sine wave might sweep from 1 kHz to 2 kHz, it would still be contained in one MQ track (rather than fading in and out of FFT bins, forming what are called lobes). An MQ-type analysis is often more useful for sound work than regular FFT analysis, since it has sonic cognition ideas built into it. For example, the ear tends to follow frequencies in time, much like an MQ analysis, and in our perception loud frequencies tend to mask neighboring quieter frequencies (as do MQ analyses). These kinds of sophisticated analysis techniques, where frequency and amplitude are isolated, suggest purely graphical manipulation of sounds, and we’ll show some examples of that in Section 5.7. Figure 3.35 Frequency tracks in Lemur Pro. The information looks much like one of our old sonograms, but in fact it is the result of a great deal of high-level processing of FFT-type data. Figure 3.36 MQ plot of 0'00" to 5'49" of Movement II: from the electro-acoustic work "Sud" by the great French composer and computer music pioneer Jean-Claude Risset. Photo by David Hirst, an innovative composer and computer music researcher from Australia. http://music.columbia.edu/cmc/musicandcomputers/ - 90 - Chapter 4: The Synthesis of Sound by Computer Section 4.1: Introduction to Sound Synthesis What Is Sound Synthesis? We’ve learned that a digitally recorded sound is stored as a sequence of numbers. What would happen if, instead of using recorded sounds to generate those numbers, we simply generated the numbers ourselves, using the computer? For instance, what if we just randomly select 44,100 numbers (1 second’s worth) and call them samples? What might the result sound like? The answer is noise. No surprise here—it makes sense that a signal generated randomly would sound like noise. Philosophically (and mathematically), that’s more or less the definition of noise: random or, more precisely, unpredictable information. While noise can be sonically interesting (at least at first!) and useful for lots of things, it’s clear that we’ll want to be able to generate other types of sounds too. We’ve seen how fundamentally important sine waves are; how about generating one of those? The formula for generating a sine wave is very straightforward: y = sin(x) That’s simple, but controlling frequency, amplitude, phase, or anything more complicated than a sine wave can get a bit trickier. This applet allows you to hear some basic types of noise: pink, white, and red. Applet 4.1 Bring in da funk, We use different colors to describe different types of noise, based on the bring in da noise average frequency content of the noiseband. White noise contains all frequencies, at random amplitudes (some of which may be zero). Pink noise has a heavier dose of lower frequencies, and red noise even more. Sine Wave Example If we take a sequence of numbers (the usual range is 0 to 2 , perhaps with a phase increment of 0.1—meaning that we’re adding 0.1 for each value) and plug them into the sine formula, what results is another sequence of numbers that describe a sine wave. By carefully selecting the numbers, we can control the frequency of the generated sine wave. Most basic computer synthesis methods follow this same general scheme: a formula or function is defined that accepts a sequence of values as input. Since waveforms are defined in terms of amplitudes in time, the input sequence of numbers to a synthesis function is usually an ongoing list of time values. A synthesis function usually outputs a sequence of sample values describing a waveform. These sample values are, then, a function of time—the result of the specific synthesis method. For example, assume the following sequence of time values. (t1, t2, t3,....., tn) http://music.columbia.edu/cmc/musicandcomputers/ - 91 - Then we have some function of those values (we can call it f(t)): (f(t1), f(t2), f(t3),....., f(tn)) = (s1, s2, s3,....., sn) (s1, s2, s3,....., sn) are the sample values. This sine wave function is a very simple example. For some sequence of time values (t1, t2, t3,....., tn), sin(t) gives us a new set of numbers {sin(t1), sin(t2), sin(t3),....., sin(tn)}, which we call the signal’s sample values. Xtra bit 4.1: Computer code for generating an array of a sine wave #define TABLE_SIZE 512 #define TWO_PI (3.14159 * 2) float samples [TABLE_SIZE]; float phaseIncrement = TWO_PI/TABLE_SIZE; float currentPhase = 0.0; int i; for (i = 0; i < TABLE_SIZE; i ++){ samples[i] = sin(currentPhase); currentPhase += phaseIncrement; } As we’ll see later, more sophisticated functions can be used to generate increasingly complex waveforms. Often, instead of generating samples from a raw function in time, we take in the results of applying a previous function (a list of numbers, not a list of time values) and do something to that list with our new function. In fact, we have a more general name for sound functions that take one sequence of numbers (samples) as an input and give you another list back: they’re called filters. You may never have to get quite this "down and dirty" when dealing with sound synthesis— most computer music software takes care of things like generating noise or sine waves for you. But it’s important to understand that there’s no reason why the samples we use need to come from the real world. By using the many available synthesis techniques, we can produce a virtually unlimited variety of purely synthetic sounds, ranging from clearly artificial, machine-made sounds to ones that sound almost exactly like their "real" counterparts! There are many different approaches to sound synthesis, each with its own strengths, weaknesses, and suitability for creating certain types of sound. In the sections that follow, we’ll take a look at some simple and powerful synthesis techniques. http://music.columbia.edu/cmc/musicandcomputers/ - 92 - Chapter 4: The Synthesis of Sound by Computer Section 4.2: Additive Synthesis Additive synthesis refers to a number of related synthesis techniques, all based on the idea that complex tones can be created by the summation, or addition, of simpler tones. As we saw in Chapter 3, it is theoretically possible to break up any complex sound into a number of simpler ones, usually in the form of sine waves. In additive synthesis, we use this theory in reverse. Figure 4.1 Two waves joined by a plus sign. Figure 4.2 This organ has a great many pipes, and together they function exactly like an additive synthesis algorithm. Each pipe essentially produces a sine wave (or something like it), and by selecting different combinations of harmonically related pipes (as partials), we can create different combinations of sounds, called (on the organ) stops. This is how organs get all those different sounds: organists are experts on Fourier series and additive synthesis (though they may not know that!). http://music.columbia.edu/cmc/musicandcomputers/ - 93 - The technique of mixing simple sounds together to get more complex sounds dates back a very long time. In the Middle Ages, huge pipe organs had a great many stops that could be "pulled out" to combine and recombine the sounds from several pipes. In this way, different "patches" could be created for the organ. More recently, the telharmonium, a giant electrical synthesizer from the early 1900s, added together the sounds from dozens of electromechanical tone generators to form complex tones. This wasn’t very practical, but it has an important place in the history of electronic and computer music. This applet demonstrates how sounds are mixed together. Applet 4.2 Mixed sounds The Computer and Additive Synthesis While instruments like the pipe organ were quite effective for some sounds, they were limited by the need for a separate pipe or oscillator for each tone that is being added. Since complex sounds can require anywhere from a couple dozen to several thousand component tones, each needing its own pipe or oscillator, the physical size and complexity of a device capable of producing these sounds would quickly become prohibitive. Enter the computer! A short excerpt from Kenneth Gaburo’s composition "Lemon Drops," a classic of electronic music made in the early 1960s. Soundfile 4.1 Excerpt from Kenneth Gaburo’s composition "Lemon Drops" This piece and another extraordinary Gaburo work, "For Harry," were made at the University of Illinois at Urbana-Champaign on an early electronic music instrument called the harmonic tone generator, which allowed the composer to set the frequencies and amplitudes of a number of sine wave oscillators to make their own timbres. It was extremely cumbersome to use, but it was essentially a giant Fourier synthesizer, and, theoretically, any periodic waveform was possible on it! It’s a tribute to Gaburo’s genius and that of other early electronic music pioneers that they were able to produce such interesting music on such primitive instruments. Kind of makes it seem like we’re almost cheating, with all our fancy software! If there is one thing computers are good at, it’s adding things together. By using digital oscillators instead of actual physical devices, a computer can add up any number of simple sounds to create extremely complex waveforms. Only the speed and power of the computer limit the number and complexity of the waveforms. Modern systems can easily generate and mix thousands of sine waves in real time. This makes additive synthesis a powerful and versatile performance and synthesis tool. Additive synthesis is not used so much anymore (there are a great many other, more efficient techniques for getting complex sounds), but it’s definitely a good thing to know about. http://music.columbia.edu/cmc/musicandcomputers/ - 94 - A Simple Additive Synthesis Sound Let’s design a simple sound with additive synthesis. A nice example is the generation of a square wave. You can probably imagine what a square wave would look like. We start with just one sine wave, called the fundamental. Then we start adding odd partials to the fundamental, the amplitudes of which are inversely proportional to their partial number. That means that the third partial is 1/3 as strong as the first, the fifth partial is 1/5 as strong, and so on. (Remember that the fundamental is the first partial; we could also call it the first harmonic.) Figure 4.3 shows what we get after adding seven harmonics. Looks pretty square, doesn’t it? Now, we should admit that there’s an easier way to synthesize square waves: just flip from a high sample value to a low sample value every n samples. The lower the value of n, the higher the frequency of the square wave that’s being generated. Although this technique is clearer and easier to understand, it has its problems too; directly generating waveforms in this way can cause unwanted frequency aliasing. http://music.columbia.edu/cmc/musicandcomputers/ - 95 - Figure 4.4 The Synclavier was an early digital electronic music instrument that used a large oscillator bank for additive synthesis. You can see this on the front panel of the instrument—many of the LEDs indicate specific partials! On the Synclavier (as was the case with a number of other analog and digital instruments), the user can tune the partials, make them louder, even put envelopes on each one. Figure 4.3 This applet lets you add sine waves together at various amplitudes, to see how additive synthesis works. Applet 4.3 Additive synthesis This applet lets you add spectral envelopes to a number of partials. This means that you can impose a different amplitude trajectory for each partial, independently making each louder and softer over time. Applet 4.4 Spectral envelopes This is really more like the way things work in the real world: partial amplitudes evolve over time—sometimes independently, sometimes in conjunction with other partials (in a phenomenon called common fate). This is called spectral evolution, and it’s what makes sounds live. http://music.columbia.edu/cmc/musicandcomputers/ - 96 - A More Interesting Example OK, now how about a more interesting example of additive synthesis? The quality of a synthesized sound can often be improved by varying its parameters (partial frequencies, amplitudes, and envelope) over time. In fact, time-variant parameters are essential for any kind of "lifelike" sound, since all naturally occurring sounds vary to some extent. Soundfile 4.2 Soundfile 4.3 Sine wave Regular speech speech Soundfile 4.4 Soundfile 4.5 Sine wave Regular speech speech These soundfiles are examples of sentences reconstructed with sine waves. Soundfile 4.2 is the sine wave version of the sentence spoken in Soundfile 4.3, and Soundfile 4.4 is the sine wave version of the sentence spoken in Soundfile 4.5. Sine wave speech is an experimental technique that tries to simulate speech with just a few sine waves, in a kind of primitive additive synthesis. The idea is to pick the sine waves (frequencies and amplitudes) carefully. It’s an interesting notion, because sine waves are pretty easy to generate, so if we can get close to "natural" speech with just a few of them, it follows that we don’t require that much information when we listen to speech. Sine wave speech has long been a popular idea for experimentation by psychologists and researchers. It teaches us a lot about speech—what’s important in it, both perceptually and acoustically. These files are used with the permission of Philip Rubin, Robert Remez, and Haskins Laboratories. Attacks, Decays, and Time Evolution in Sounds As we’ve said, additive synthesis is an important tool, and we can do a lot with it. It does, however, have its drawbacks. One serious problem is that while it’s good for periodic sounds, it doesn’t do as well with noisy or chaotic ones. For instance, creating the steady-state part (the sustain) of a flute note is simple with additive synthesis (just a couple of sine waves), but creating the attack portion of the note, where there is a lot of breath noise, is nearly impossible. For that, we have to synthesize a lot of different kinds of information: noise, attack transients, and so on. And there’s a worse problem that we’d love to sweep under the old psychoacoustical rug, too, but we can’t: it’s great that we know so much about steady-state, periodic, Fourier-analyzable sounds, but from a cognitive and perceptual point of view, we really couldn’t care less about them! The ear and brain are much more interested in things like attacks, decays, and changes over time in a sound (modulation). That’s bad news for all that additive synthesis software, which doesn’t handle such things very well. http://music.columbia.edu/cmc/musicandcomputers/ - 97 - That’s not to say that if we play a triangle wave and a sawtooth wave, we couldn’t tell them apart; we certainly could. But that really doesn’t do us much good in most circumstances. If angry lions roared in square waves, and cute cuddly puppy dogs barked in triangle waves, maybe this would be useful, but we have evolved—or learned to hear attacks, decays, and other transients as being more crucial. What we need to be able to synthesize are transients, spectral evolutions, and modulations. Additive synthesis is not really the best technique for those. Another problem is that additive synthesis is very computationally expensive. It’s a lot of work to add all those sine waves together for each output sample of sound! Compared to some other synthesis methods, such as frequency modulation (FM) synthesis, additive synthesis needs lots of computing power to generate relatively simple sounds. But despite its drawbacks, additive synthesis is conceptually simple, and it corresponds very closely to what we know about how sounds are constructed mathematically. For this reason it’s been historically important in computer sound synthesis. Figure 4.5 A typical ADSR (attack, decay, sustain, release) steady-state modulation. This is a standard amplitude envelope shape used in sound synthesis. The ability to change a sound’s amplitude envelope over time plays an important part in the perceived "naturalness" of the sound. Shepard Tones One cool use of additive synthesis is in the generation of a very interesting phenomenon called Shepard tones. Sometimes called "endless glissandi," Shepard tones are created by specially configured sets of oscillators that add their tones together to create what we might call a constantly rising tone. Certainly the Shepard tone phenomenon is one of the more interesting topics in additive synthesis. In the 1960s, experimental psychologist Roger Shepard, along with composers James Tenney and Jean-Claude Risset, began working with a phenomenon that scientifically demonstrates an independent dimension in pitch perception called chroma, confirming the circularity of relative pitch judgments. What circularity means is that pitch is perceived in kind of a circular way: it keeps going up until it hits an octave, and then it sort of starts over again. You might say pitch wraps around (think of a piano, where the C notes are evenly spaced all the way up and down). By chroma, we mean an aspect of pitch perception in which we group together the same pitches that are http://music.columbia.edu/cmc/musicandcomputers/ - 98 - related as frequencies by multiples of 2. These are an octave apart. In other words, 55 Hz is the same chroma as 110 Hz as 220 Hz as 440 Hz as 880 Hz. It’s not exactly clear whether this is "hard-wired" or learned, or ultimately how important it is, but it’s an extraordinary idea and an interesting aural illusion. We can construct such a circular series of pitches in a laboratory setting using synthesized Shepard tones. These complex tones are comprised of partials separated by octaves. They are complex tones where all the non-power-of-two numbered partials are omitted. These tones slide gradually from the bottom of the frequency range to the top. The amplitudes of the component frequencies follow a bell-shaped spectral envelope (see Figure 4.6) with a maximum near the middle of the standard musical range. In other words, they fade in and out as they get into the most common frequency range. This creates an interesting illusion: a circular Shepard tone scale can be created that varies only in tone chroma and collapses the second dimension of tone height by combining all octaves. In other words, what you hear is a continuous pitch change through one octave, but not bigger than one octave (that’s a result of the special spectra and the amplitude curve). It’s kind of like a barber pole: the pitches sound as if they just go around for a while, and then they’re back to where they started (even though, actually, they’re continuing to rise!). Figure 4.10 Bell-shaped spectral envelope for making Shepard tones. Shepard wrote a famous paper in 1964 in which he explains, to some extent, our notion of octave equivalence using this auditory illusion: a sequence of these Shepard tones that shifts only in chroma as it is played. The apparent fundamental frequency increases step by step, through repeated cycles. Listeners hear the pitch steps as climbing continuously upward, even though the pitches are actually moving only around the chroma circle. Absolute pitch height (that is, how "high" or "low" it sounds) is removed from our perception of the sequence. http://music.columbia.edu/cmc/musicandcomputers/ - 99 - Soundfile 4.6 Shepard tone Figure 4.7 Try clicking on Soundfile 4.6. After you listen to the soundfile once, click again to listen to the frequencies continue on their upward spiral. Used with permission from Susan R. Perry, M.A., Dept. of Psychology, University of Tennessee. The Shepard tone contains a large amount of octave-related harmonics across the frequency spectrum, all of which rise (or fall) together. The harmonics toward the low and high ends of the spectrum are attenuated gradually, while those in the middle have maximum amplification. This Soundfile 4.7 creates a spiraling or barber pole effect. (Information from Doepfer Shepard tone Musikelektronik GmbH.) Soundfile 4.7 is an example of the spiraling Shepard tone effect. James Tenney is an important computer music composer and pioneer who worked at Bell Laboratories with Roger Shepard in the early 1960s. Soundfile 4.8 "For Ann" (rising), by James Tenney. This piece was composed in 1969. This composition is based on a set of continuously rising tones, similar to the effect created by Shepard tones. The compositional process is simple: each glissando, separated by some fixed time interval, fades in from its lowest note and fades out as it nears the top of its audible range. It is nearly impossible to follow, aurally, the path of any given glissando, so the effect is that the individual tones never reach their highest pitch. http://music.columbia.edu/cmc/musicandcomputers/ - 100 - Chapter 4: The Synthesis of Sound by Computer Section 4.3: Filters The most common way to think about filters is as functions that take in a signal and give back some sort of transformed signal. Usually, what comes out is "less" than what goes in. That’s why the use of filters is sometimes referred to as subtractive synthesis. It probably won’t surprise you to learn that subtractive synthesis is in many ways the opposite of additive synthesis. In additive synthesis, we start with simple sounds and add them together to form more complex ones. In subtractive synthesis, we start with a complex sound (like noise) and subtract, or filter out, parts of it. Subtractive synthesis can be thought of as sound sculpting—you start out with a thick chunk of sound containing many possibilities (frequencies), and then you carve out (filter) parts of it. Filters are one of the sound sculptor’s most versatile and valued tools. Older telephones had around an 8k low-pass filter imposed on their audio signal, mostly for noise reduction and to keep the equipment a bit cheaper. Soundfile 4.9 Telephone simulations White noise (every frequency below the Nyquist rate at equal level) is filtered so we hear only frequencies above 5 kHz. Soundfile 4.10 High-pass filtered noise Here we hear only frequencies up to 500 Hz. Soundfile 4.11 Low-pass filtered noise http://music.columbia.edu/cmc/musicandcomputers/ - 101 - Four Basic Type of Filters Figure 4.8 Four common filter types (clockwise from upper left): low-pass, high-pass, bandreject, band-pass. Figure 4.8 illustrates four basic types of filters: low-pass, high-pass, band-pass, and bandreject. Low-pass and high-pass filters should already be familiar to you—they are exactly like the "tone" knobs on a car stereo or boombox. A low-pass (also known as high-stop) filter stops, or attenuates, high frequencies while letting through low ones, while a high-pass (lowstop) filter does just the opposite. This applet is a good example of how filters, combined with something like noise, can produce some common and useful musical effects with very few operations. Applet 4.5 Using filters Band-Pass and Band-Reject Filters Band-pass and band-reject filters are basically combinations of low-pass and high-pass filters. A band-pass filter lets through only frequencies above a certain point and below another, so there is a band of frequencies that get through. A band-reject filter is the opposite: it stops a band of frequencies. Band-reject filters are sometimes called notch filters, because they can notch out a particular part of a sound. http://music.columbia.edu/cmc/musicandcomputers/ - 102 - Comb filters are a very specific type of digital process in which a short delay (where some number of samples are actually delayed in time) and simple feedback algorithm (where outputs are sent back to be reprocessed and recombined) are used to create a rather extraordinary Applet 4.6 effect. Sounds can be "tuned" to specific harmonics (based on the Comb filters length of the delay and the sample rate). Low-Pass and High-Pass Filters Low-pass and high-pass filters have a value associated with them called the cutoff frequency, which is the frequency where they begin "doing their thing." So far we have been talking about ideal, or perfect, filters, which cut off instantly at their cutoff frequency. However, real filters are not perfect, and they can’t just stop all frequencies at a certain point. Instead, frequencies die out according to a sort of curve around the corner of their cutoff frequency. Thus, the filters in Figure 4.8 don’t have right angles at the cutoff frequencies—instead they show general, more or less realistic response curves for low-pass and high-pass filters. Cutoff Frequency The cutoff frequency of a filter is defined as the point at which the signal is attenuated to 0.707 of its maximum value (which is 1.0). No, the number 0.707 was not just picked out of a hat! It turns out that the power of a signal is determined by squaring the amplitude: 0.7072 = 0.5. So when the amplitude of a signal is at 0.707 of its maximum value, it is at half-power. The cutoff frequency of a filter is sometimes called its half-power point. Transition Band The area between where a filter "turns the corner" and where it "hits the bottom" is called the transition band. The steepness of the slope in the transition band is important in defining the sound of a particular filter. If the slope is very steep, the filter is said to be "sharp"; conversely, if the slope is more gradual, the filter is "soft" or "gentle." Things really get interesting when you start combining low-pass and high-pass filters to form band-pass and band-reject filters. Band-pass and band-reject filters also have transition bands and slopes, but they have two of them: one on each side. The area in the middle, where frequencies are either passed or stopped, is called the passband or the stopband. The frequency in the middle of the band is called the center frequency, and the width of the band is called the filter’s bandwidth. You can plainly see that filters can get pretty complicated, even these simple ones. By varying all these parameters (cutoff frequencies, slopes, bandwidths, etc.), we can create an enormous variety of subtractive synthetic timbres. A Little More Technical: IIR and FIR Filters Filters are often talked about as being one of two types: finite impulse response (FIR) and infinite impulse response (IIR). This sounds complicated (and can be!), so we’ll just try to give a simple explanation as to the general idea of these kinds of filters. http://music.columbia.edu/cmc/musicandcomputers/ - 103 - Finite impulse response filters are those in which delays are used along with some sort of averaging. Delays mean that the sound that comes out at a given time uses some of the previous samples. They’ve been delayed before they get used. We’ve talked about these filters in earlier chapters. What goes into an FIR is always less than what comes out (in terms of amplitude). Sounds reasonable, right? FIRs tend to be simpler, easier to use, and easier to design than IIRs, and they are very handy for a lot of simple situations. An averaging low-pass filter, in which some number of samples are averaged and output, is a good example of an FIR. Infinite impulse response filters are a little more complicated, because they have an added feature: feedback. You’ve all probably seen how a microphone and speaker can have feedback: by placing the microphone in front of a speaker, you amplify what comes out and then stick it back into the system, which is amplifying what comes in, creating a sort of infinite amplification loop. Ouch! (If you’re Jimi Hendrix, you can control this and make great music out of it.) Well, IIRs are similar. Because the feedback path of these filters consists of some number of delays and averages, they are not always what are called unity gain transforms. They can actually output a higher signal than that which is fed to them. But at the same time, they can be many times more complex and subtler than FIRs. Again, think of electric guitar feedback—IIRs are harder to control but are also very interesting. http://music.columbia.edu/cmc/musicandcomputers/ - 104 - Figure 4.9 FIR and IIR filters. !"# $# %# &'()* +,-./012 3/4561/ 785. 49:/1 96. 61/1 5 +,;,./ ;6:3/0 9+ 15:<-/12 5;= 5 15:<-/ 9;-> 851 5 +,;,./ /++/4.? (+ 7/ =/-5>2 5@/05A/2 5;= .8/; +//= .8/ 96.<6. 9+ .85. <094/11 354B ,;.9 .8/ 1,A;5-2 7/ 40/5./ 785. 50/ 45--/= !"# $# %# &(()* +,-./01? C8/ +//=354B <094/11 54.65--> 5--971 .8/ 96.<6. .9 3/ :648 A0/5./0 .85; .8/ ,;<6.? C8/1/ +,-./01 45;2 51 7/ -,B/ .9 15>2 D3-97 6<?D These diagrams are technical lingo for typical filter diagrams for FIR and IIR filters. Note how in the IIR diagram the output of the filter’s delay is summed back into the input, causing the infinite response characteristic. That’s the main difference between the two filters. Thanks to Fernando Pablo Lopez-Lezcano for these graphics. Designing filters is a difficult but key activity in the field of digital signal processing, a rich area of study that is well beyond the range of this book. It is interesting to point out that, surprisingly, even though filters change the frequency content of a signal, a lot of the mathematical work done in filter design is done in the time domain, not in the frequency domain. By using things like sample averaging, delays, and feedback, one can create an extraordinarily rich variety of digital filters. For example, the following is a simple equation for a low-pass filter. This equation just averages the last two samples of a signal (where x(n) is the current sample) to produce a new sample. This equation is said to have a one-sample delay. You can see easily that quickly changing (that is, high-frequency) time domain values will be "smoothed" (removed) by this equation. http://music.columbia.edu/cmc/musicandcomputers/ - 105 - x(n) = (x(n) + x(n - 1))/2 In fact, although it may look simple, this kind of filter design can be quite difficult (although extremely important). How do you know which frequencies you’re removing? It’s not intuitive, unless you’re well schooled in digital signal processing and filter theory, have some background in mathematics, and know how to move from the time domain (what you have) to the frequency domain (what you want) by averaging, delaying, and so on. http://music.columbia.edu/cmc/musicandcomputers/ - 106 - Chapter 4: The Synthesis of Sound by Computer Section 4.4: Formant Synthesis Formant synthesis is a special but important case of subtractive synthesis. Part of what makes the timbre of a voice or instrument consistent over a wide range of frequencies is the presence of fixed frequency peaks, called formants. These peaks stay in the same frequency range, independent of the actual (fundamental) pitch being produced by the voice or instrument. While there are many other factors that go into synthesizing a realistic timbre, the use of formants is one way to get reasonably accurate results. Figure 4.10 A trumpet plays two different notes, a perfect fourth apart, but the formants (fixed resonances) stay in the same places. Resonant Structure The location of formants is based on the resonant physical structure of the sound-producing medium. For example, the body of a certain violin exhibits a particular set of formants, depending upon how it is constructed. Since most violins share a similar shape and internal construction, they share a similar set of formants and thus sound alike. In the human voice, the vocal tract and nasal cavity act as the resonating body. By manipulating the shape and size of that resonant space (i.e., by changing the shape of the mouth and throat), we change the location of the formants in our voice. We recognize different vowel sounds mainly by their formant placement. Knowing that, we can generate some fairly convincing synthetic vowels by manipulating formants in a synthesized set of tones. A number of books list actual formant frequency values for various voices and vowels (including Charles Dodge’s highly recommended standard text, Computer Music—Dodge is a great pioneer in computer music voice synthesis). http://music.columbia.edu/cmc/musicandcomputers/ - 107 - Composing with Synthetic Speech Generating really good and convincing synthetic speech and singing voices is more complex than simply moving around a set of formants—we haven’t mentioned anything about generating consonants, for example. And no speech synthesis system relies purely on formant synthesis. But, as these examples illustrate, even very basic formant manipulation can generate sounds that are undoubtedly "vocal" in nature. Figure 4.11 A spectral picture of the voice, showing formants. Graphic courtesy of the alt.usage.english newsgroup. "Notjustmoreidlechatter" was made on a DEC MicroVaxII computer in 1988. All the "chatter" pieces (there are three in the set) use techniques known as linear predictive coding, granular synthesis, and a variety of stochastic mixing techniques. Soundfile 4.12 "Notjustmoreidlechatter Paul Lansky is a well-known composer and researcher of computer " of Paul Lansky music who teaches at Princeton University. He has been a leading pioneer in software design, voice synthesis, and compositional techniques. Used with permission from Paul Lansky. Paul Lansky writes: "Over ten years ago I wrote three 'chatter' pieces, and then decided to http://music.columbia.edu/cmc/musicandcomputers/ - 108 - quit while I was ahead. The urge to strike again recently overtook Soundfile 4.13 "idlechatterjunior" of me, however, and after my lawyer assured me that the statute of Paul Lansky from 1999 limitations had run out on this particular offense, I once again leapt into the fray. My hope is that the seasoning provided by my labors in the intervening years results in something new and different. If not, then look out for 'Idle Chatter III'... ." Used with permission from Paul Lansky. Composer Sarah Myers used an interview with her friend Gili Rei as the source material for her composition. "Trajectory of Her Voice" is a ten-part canon that explores the musical qualities of speech. As the verbal content becomes progressively less comprehensible as language, the focus turns instead to the sonorities inherent in her voice. Soundfile 4.14 Composition by composer Sarah Myers entitled "Trajectory of This piece was composed using the Cmix computer music language Her Voice" in 1998 (Cmix was written by Paul Lansky). Soundfile 4.15 Synthetic speech example, "Fred" voice from the Macintosh computer Soundfile 4.16 Carter Sholz’s 1-minute piece "Mannagram" Soundfile 4.17 The trump Over the years, computer voice simulations have become better and better. They still sound a bit robotic, but advances in voice synthesis and acoustic technology make voices more and more realistic. Bell Telephone Laboratories has been one of the leading research facilities for this work, which is expected to become extremely important in the near future. In this piece, based on a reading by Australian sound-poet Chris Mann, the composer tries to separate vowels and consonants, moving them each to a different speaker. This was inspired by an idea of Mann's, who always wanted to do a "headphone piece" in which he spoke and the consonants appeared in one ear, the vowels in another. One of the most interesting examples of formant usage is in playing the trump, sometimes called the jaw-harp. Here, a metal tine is plucked and the shape of the vocal cavity is used to create different pitches. http://music.columbia.edu/cmc/musicandcomputers/ - 109 - Section 4.5: Amplitude Modulation Introduction to Modulation We might be more familiar with the term modulation in relationship to radio. Radio transmission utilizes amplitude modulation (AM) and frequency modulation (FM), but we too can create complex waveforms by using these techniques. Modulated signals are those that are changed regularly in time, usually by other signals. They can get pretty complicated. For example, modulated signals can modulate other signals! To create a modulated signal, we begin with two or more oscillators (or anything that produces a signal) and combine the output signals of the oscillators in such a way as to modulate the amplitude, frequency, and/or phase of one of the oscillators. In the amplitude modulation equation, the DC offset (a signal that is essentially a straight line) is added to the signal m(t) and multiplied by a sinusoid with frequency fc . Ac is the carrier amplitude, and ka is the modulation index. Applet 4.9 shows what happens, in the case of frequency modulation, if the modulating signal is low frequency. In that case, we’ll hear something like vibrato (a regular change in frequency, or perceived pitch). We can also modulate amplitude in this way (tremolo), or even formant frequencies if we want. Low-frequency modulations (that is, modulators that themselves are low-frequency signals) can produce interesting sonic effects. Applet 4.8 LFO modulation But for making really complex sounds, we are generally interested in high-frequency modulation. We take two audio frequency signals and multiply them together. More precisely, we start with a carrier oscillator and attach a modulating oscillator to modify and distort the signal that the carrier oscillator puts out. The output of the carrier oscillator can include its original signal and the sidebands or added spectra that are generated by the modulation process. Amplitude Modulation Figure 4.13 shows how we might construct a computer music instrument to do amplitude modulation. The two half-ovals are often called unit generators, and they refer to some software device like an oscillator, a mixer, a filter, or an envelope generator that has inputs and outputs and makes and transforms digital signals. http://music.columbia.edu/cmc/musicandcomputers/ - 110 - Figure 4.13 Amplitude modulation, two operator case. Soundfile 5.1 Jon Appleton: "Chef d'Oeuvre" Soundfile 4.19 High-pass moving filter (modulated by sine) Soundfile 4.20 Low-pass moving filter (modulated by sawtooth) http://music.columbia.edu/cmc/musicandcomputers/ - 111 - A low-pass moving filter that uses a sine wave to control a sweep between 0 Hz and 500 Hz. A high-pass moving filter that uses a sine wave to control a sweep between 5,000 Hz and 15,000 Hz. A low-pass moving filter that uses a sawtooth wave to control a sweep between 0 Hz and 500 Hz. Soundfile 4.21 High-pass moving filter (modulated by sawtooth) A high-pass moving filter that uses a sawtooth wave to control a sweep between 5,000 Hz and 15,000 Hz. Figure 4.14 James Tenney’s "Phases," one of the earliest and still most interesting pieces of computer-assisted composition. The pictures above are his "notes" for the piece, which constitute a kind of score. Tenney made use of some simple modulation trajectories to control timbral parameters over time (like amplitude modulation rate, spectral upper limit, note-duration, and so on). By simply coding these functions in the computer and linking the output to the synthesis engine, Tenney was able to realize a number of highly original works in which he controlled the overall, large-scale process, but the micro-structure was largely determined by the computer making use of his curves. "Phases" was released on an Artifact CD of James Tenney’s computer music. http://music.columbia.edu/cmc/musicandcomputers/ - 112 - Section 4.6: Waveshaping Waveshaping is a popular synthesis-and-transformation technique that turns simple sounds into complex sounds. You can take a pure tone, like a sine wave, and transform it into a harmonically rich sound by changing its shape. A guitar fuzz box is an example of a waveshaper. The unamplified electric guitar sound is fairly close to a sine wave. But the fuzz box amplifies it and gives it sharp corners. We have seen in earlier chapters that a signal with sharp corners has lots of high harmonics. Sounds that have passed through a waveshaper generally have a lot more energy in their higher-frequency harmonics, which gives them a "richer" sound. Simple Waveshaping Formulae A waveshaper can be described as a function that takes the original signal x as input and produces a new output signal y. This function is called the transfer function. y = f(x) This is simple, right? In fact, it’s much simpler than any other function we’ve seen so far. That’s because waveshaping, in its most general form, is just any old function. But there’s a lot more to it than that. In order to change the shape of the function (and not just make it bigger or smaller), the function must be nonlinear, which means it has exponents greater than 1, or transcendental (like sines, cosines, exponentials, logarithms, etc.). You can use almost any function you want as a waveshaper. But the most useful ones output zero when the input is zero (that’s because you usually don’t want any output when there is no input). 0 = f(0) Let’s look at a very simple waveshaping function: y = f(x) = x * x * x = x3 What would it look like to pass a simple sine wave that varied from -1.0 to 1.0 through this waveshaper? If our input x is sin(wt), then: y = x3 = sin3(wt) If we plot both functions (sin(x) and the output signal), we can see that the original input signal is very round, but the output signal has a narrower peak. This will give the output a richer sound. http://music.columbia.edu/cmc/musicandcomputers/ - 113 - Figure 4.15 Waveshaping by x3. This example gives some idea of the power of this technique. A simple function (sine wave) gets immediately transformed, using simple math and even simpler computation, into something new. One problem with the y = x3 waveshaper is that for x-values outside the range –1.0 to +1.0, y can get very large. Because computer music systems (especially sound cards) generally only output sounds between –1.0 and +1.0, it is handy to have a waveshaper that takes any input signal and outputs a signal in the range –1.0 to +1.0. Consider this function: y = x / (1 + |x|) When x is zero, y is zero. Plug in a few numbers for x, like 0.5, 7.0, 1,000.0, –7.0, and see what you get. As x gets larger (approaches positive infinity), y approaches +1.0 but never reaches it. As x approaches negative infinity, y approaches –1.0 but never reaches it. This kind of curve is sometimes called soft clipping because it does not have any hard edges. It can give a nice "tubelike" distortion sound to a guitar. So this function has some nice properties, but unfortunately it requires a divide, which takes a lot more CPU power than a multiply. On older or smaller computers, this can eat up a lot of CPU time (though it’s not much of a problem nowadays). Here is another function that is a little easier to calculate. It is designed for input signals between –1.0 and +1.0. y = 1.5x - 0.5x3 This applet plays a sine wave through various waveshaping formulae. Applet 4.9 Changing the shape of a waveform Chebyshev Polynomials http://music.columbia.edu/cmc/musicandcomputers/ - 114 - A transfer function is often expressed as a polynomial, which looks like: y = f(x) = d0 + d1x + d2x2 + d3x3 + ... + dNxN The highest exponent N of this polynomial is called the "order" of the polynomial. In Applet 4.9, we saw that the x2 formula resulted in a doubling of the pitch. So a polynomial of order 2 produced strong second harmonics in the output. It turns out that a polynomial of order N will only generate frequencies up to the Nth harmonic of the input sine wave. This is a useful property when we want to limit the high harmonics to a value below the Nyquist rate to avoid aliasing. Back in the 19th century, Pafnuty Chebyshev discovered a set of polynomials known as the Chebyshev polynomials. Mathematicians like them for lots of different reasons, but computer musicians like them because they can be used to make weird noises, er, we mean music. These Chebyshev polynomials have the property that if you input a sine wave of amplitude 1.0, you get out a sine wave whose frequency is N times the frequency of the input wave. So, they are like frequency multipliers. If the amplitude of the input sine wave is less than 1.0, then you get a complex mix of harmonics. Generally, the lower the amplitude of the input, the lower the harmonic content. This gives musicians a single number, sometimes called the distortion index, that they can tweak to change the harmonic content of a sound. If you want a sound with a particular mixture of harmonics, then you can add together several Chebyshev polynomials multiplied by the amount of the harmonic that you desire. Is this cool, or what? Here are the first few Chebyshev polynomials: T0(x) = 1 T1(x) = x T2(x) = 2x2 – 1 T3(x) = 4x3 – 3x T4(x) = 8x4 – 8x2 + 1 You can generate more Chebyshev polynomials using this recursive formula (a recursive formula is one that takes in its own output as the next input): Tk+1(x) = 2xTk(x) – Tk–1(x) Table-Based Waveshapers Doing all these calculations in realtime at audio rates can be a lot of work, even for a computer. So we generally precalculate these polynomials and put the results in a table. Then when we are synthesizing sound, we just take the value of the input sine wave and use it to look up the answer in the table. If you did this during an exam it would be called cheating, but in the world of computer programming it is called optimization. One big advantage of using a table is that regardless of how complex the original equations were, it always takes the same amount of time to look up the answer. You can even draw a function by hand without using an equation and use that hand-drawn function as your transfer function. http://music.columbia.edu/cmc/musicandcomputers/ - 115 - This applet plays sine waves through polynomials and hand-drawn waves. Applet 4.10 Waveshaping Don Buchla’s Synthesizers Don Buchla, a pioneering designer of synthesizers, created a series of instruments based on digital waveshaping. One such instrument, known as the Touche, was released in 1978. It had 16 digital oscillators that could be combined into eight voices. The Touche had extensive programming capabilities and had the ability to morph one sound into another. Composer David Rosenboom worked with Buchla and developed much of the software for the Touche. Rosenboom produced an album in 1981 called Future Travel using primarily the Touche and the Buchla 300 Series Music System. Now that you know a lot about waveshaping, Chebyshev polynomials, and transfer functions, we’ll show you what happens when the information gets into the wrong hands! Soundfile These soundfiles are two recordings done in the mid-1980s, at the Mills College Center for Contemporary Music, by one of our authors (Larry Polansky). They use an 4.22a Experimental highly unusual live, interactive computer music waveshaping system. waveshaping: "Toyoji Patch" was a piece/software/installation written originally for trombonist and "Toyoji Patch" composer Toyoji Tomita to play with. It was a real-time feedback system, which fed live audio into the transfer function of a set of six digital waveshaping oscillators. The hardware was an old S-100 68000-based computer system running the original prototype of a computer music language called HMSL, including a GUI and set of instrument drivers and utilities for controlling synthesizer designer Don Buchla’s 400 series digital waveshaping oscillators. Soundfile 4.22b The system could be used with or without a live input, since it made use of an Experimental external microphone to feed back its output to itself. The audio time domain signal waveshaping: used a transfer function to modify itself, but you could also alter Chebyshev "Toyoji Patch" coefficients in realtime, redraw the waveform, and control a number of other parameters. Both of these sound excerpts feature the amazing contemporary flutist Anne LaBerge. In the first version, LaBerge is playing but is not recorded. She is in another room, and the output of the system is fed back into itself through a microphone. By playing, she could drastically affect the sound (since her flute went immediately into the transfer function). However, although she’s causing the changes to occur, we don’t actually hear her flute. In the second version, LaBerge is in front of the same microphone that’s used for feedback and recording. In both versions, Polansky was controlling the mix and the feedback gain as well as playing with the computer. http://music.columbia.edu/cmc/musicandcomputers/ - 116 - Section 4.7: FM Synthesis One goal of synthesis design is to find efficient algorithms, which don’t take a lot of computation to generate rich sound palettes. Frequency modulation synthesis (FM) has traditionally been one of the most popular techniques in this regard, and it provides a good way to communicate some basic concepts about sound synthesis. You’ve probably heard of frequency modulation—it is the technique used in FM radio broadcasting. We took a look at amplitude modulation (AM) in Section 4.5. Applet 4.11 Frequency modulation History of FM Synthesis FM techniques have been around since the early 20th century, and by the 1930s FM theory for radio broadcasting was welldocumented and understood. It was not until the 1970s, though, that a certain type of FM was thoroughly researched as a musical synthesis tool. In the early 1970s, John Chowning, a composer and researcher at Stanford University, developed some important new techniques for music synthesis using FM. Chowning’s research paid off. In the early 1980s, the Yamaha Corporation introduced their extremely popular DX line of FM synthesizers, based on Chowning’s work. The DX-7 keyboard synthesizer was the top of their line, and it quickly became the digital synthesizer for the 1980s, making its mark on both computer music and synthesizer-based pop and rock. It’s the most popular synthesizer in history. FM turned out to be good for creating a wide variety of sounds, although it is not as flexible as some other types of synthesis. Why has FM been so useful? Well, it’s simple, easy to understand, and allows users to tweak just a few "knobs" to get a wide range of sonic variation. Let’s take a look at how FM works and listen to some examples. Figure 4.16 The Yamaha DX-7 synthesizer was an extraordinarily popular instrument in the 1980s and was partly responsible for making electronic and computer music a major industry. Thanks to Joseph Rivers and The Audio Playground Synthesizer Museum for this photo. http://music.columbia.edu/cmc/musicandcomputers/ - 117 - Simple FM In its simplest form, FM involves two sine waves. One is called the modulating wave, the other the carrier wave. The modulating wave changes the frequency of the carrier wave. It can be easiest to visualize, understand, and hear when the modulator is low frequency. Figure 4.17 Frequency modulation, two operator case. http://music.columbia.edu/cmc/musicandcomputers/ - 118 - Vibrato Carrier: 500 Hz; modulator frequency: 1 Hz; modulator index: 100. Soundfile 4.23 Vibrato sound FM can create vibrato when the modulating frequency is less than 30 Hz. Okay, so it’s still not that exciting—that’s just because everything is moving slowly. We’ve created a very slow, weird vibrato! That’s because we were doing low-frequency modulation. In Soundfile 4.23, the frequency (fc) of the carrier wave is 500 Hz and the modulating frequency (fm) is 1 Hz. 1 Hz means one complete cycle each second, so you should hear the frequency of the carrier rise, fall, and return to its original pitch once each second. fc = carrier frequency, m(t) = modulating signal, and Ac= carrier amplitude. Note that the frequency of the modulating wave is the rate of change in the carrier’s frequency. Although you can’t tell from the above equation, it also turns out that the amplitude of the modulator is the degree of change of the carrier’s frequency, and the waveform of the modulator is the shape of change of the carrier’s frequency. In Figure 4.17 showing the unit generator diagram for frequency modulation (remember, we showed you one of these in Section 4.5), note that each of the sine wave oscillators has two inputs: one for frequency and one for amplitude. For our modulating oscillator we are using 1 Hz as the frequency, which becomes fm to the carrier (that is, the frequency of the carrier is changed 1 time per second). The modulator’s amplitude is 100, which will determine how much the frequency of the carrier gets changed (at a rate of 1 time per second). The amplitude of the modulator is often called the modulation depth, since this value determines how high and low the frequency of the carrier wave will go. In the sound example, the fc ranges from 400 Hz to 600 Hz (500 Hz – 100 Hz to 500 Hz + 100 Hz). If we change the depth to 500 Hz, then our fc would range from 0 Hz to 1,000 Hz. Humans can only hear sounds down to about 30 Hz, so there should be a moment of "silence" each time the frequency dips below that point. Carrier: 500 Hz, modulator frequency: 1 Hz, modulator index: 500. Soundfile 4.24 Vibrato sound Generating Spectra with FM http://music.columbia.edu/cmc/musicandcomputers/ - 119 - If we raise the frequency of the modulating oscillator above 30 Hz, we can start to hear more complex sounds. We can make an analogy to being able to see the spokes of a bike wheel if it rotates slowly, but once the wheel starts to rotate faster a visual blur starts to occur. So it is with FM: when the modulating frequency starts to speed up, the sound becomes more complex. The tones you heard in Soundfile 4.24 sliding around are called sidebands and are extra frequencies located on either side of the carrier frequency. Sidebands are the secret to FM synthesis. The frequencies of the sidebands (called, as a group, the spectra) depend on the ratio of fc to fm. John Chowning, in a famous article, showed how to predict where those sidebands would be using a simple mathematical idea called Bessel functions. By controlling that ratio (called the FM index) and using Bessel functions to determine the spectra, you can create a wide variety of sounds, from noisy jet engines to a sweet-sounding Fender Rhodes. Figure 4.18 FM sidebands. Soundfiles 4.25 through 4.28 show some simple two-carrier FM sounds with modulating frequencies above 30 Hz. Carrier: 100 Hz; modulator frequency: 280 Hz; FM index: 6.0 -> 0. Soundfile 4.25 Bell-like sound http://music.columbia.edu/cmc/musicandcomputers/ - 120 - Carrier: 250 Hz; modulator frequency: 175 Hz; FM index: 1.5 -> 0. Soundfile 4.26 Bass clarinet-type sound Carrier: 700 Hz; modulator frequency: 700 Hz; FM index: 5.0 -> 0. Soundfile 4.27 Trumpet-like sound Carrier: 500 Hz; modulator frequency: 500 -> 5,000 Hz; FM index: 10. Soundfile 4.28 FM sound An Honest-to-Goodness Computer Music Example of FM (Using Csound) One of the most common computer languages for synthesis and sound processing is called Csound, developed by Barry Vercoe at MIT. Csound is popular because it is powerful, easy to use, public domain, and runs on a wide variety of platforms. It has become a kind of lingua franca for computer music. Csound divides the world of sound into orchestras, consisting of instruments that are essentially unit-generator designs for sounds, and scores (or note lists) that tell how long, loud, and so on a sound should be played from your orchestra. The code in Figure 4.19 is an example of a simple FM instrument in Csound. You don’t need to know what most of it means (though by now a lot of it is probably self-explanatory, like sr, which is just the sampling rate). Take a close look at the following line: asig foscil p4*env, cpspch(p5), 1,3,2,1 ;simple FM Commas separate the various parts of this command. Asig is simply a name for this line of code, so we can use it later (out asig). foscil is a predefined Csound unit generator, which is just a pair of oscillators that implement simple frequency modulation (in a compact way). p4*env states that the amplitude of the oscillator will be multiplied by an envelope (which, as seen in Figure 4.19, is defined with the Csound linseg function). cpspch(p5) looks to the fifth value in the score (8.00, which will be a middle C) to find what the fundamental of the sound will be. The next three values determine the carrier and modulating frequencies of the two oscillators, as well as the modulation index (we won’t go into what they mean, but these values will give us a nice, sweet sound). The final value (1) points to a sine wave table (look at the first line of the score). Yes, we know, you might be completely confused, but we thought you’d like to see a common language that actually uses some of the concepts we’ve been discussing! http://music.columbia.edu/cmc/musicandcomputers/ - 121 - Computer Code for an FM Instrument and a Score to Play the Instrument in Csound Figure 4.19 This is a Csound Orchestra and Score blueprint for a simple FM synthesis instrument. Any number of basic unit generators, such as oscillators, adders, multipliers, filters, and so on, can be combined to create complex instruments. Some music languages, like CSound, make extensive use of the unit generator model. Generally, unit generators are used to create instruments (the orchestra), and then a set of instructions (a score) is created that tells the instruments what to do. Now that you understand the basics of FM synthesis, go back to the beginning of this section and play with Applet 4.12. FM is kind of interesting theoretically, but it’s far more fun and educational to just try it out. http://music.columbia.edu/cmc/musicandcomputers/ - 122 - Section 4.8: Granular Synthesis When we discussed additive synthesis, we learned that complex sounds can be created by adding together a number of simpler ones, usually sets of sine waves. Granular synthesis uses a similar idea, except that instead of a set of sine waves whose frequencies and amplitudes change over time, we use many thousands of very short (usually less than 100 milliseconds) overlapping sound bursts or grains. The waveforms of these grains are often sinusoidal, although any waveform can be used. (One alternative to sinusoidal waveforms is to use grains of sampled sounds, either pre-recorded or captured live.) By manipulating the temporal placement of large numbers of grains and their frequencies, amplitude envelopes, and waveshapes, very complex and time-variant sounds can be created. This applet lets you granulate a signal and alter some of the typical parameters of granular synthesis, a popular synthesis technique. Applet 4.12 Granular synthesis A grain is created by taking a waveform, in this case a sine wave, and multiplying it by an amplitude envelope. How would a different amplitude envelope, say a square one, affect the shape of the grain? What would it do to the sound of the grain? Clouds of Sound Granular synthesis is often used to create what can be thought of as "sound clouds"—shifting regions of sound energy that seem to move around a sonic space. A number of composers, like Iannis Xenakis and Barry Truax, thought of granular synthesis as a way of shaping large http://music.columbia.edu/cmc/musicandcomputers/ - 123 - masses of sound by using granulation techniques. These two composers are both considered pioneers of this technique (Truax wrote some of the first special-purpose software for granular synthesis). Sometimes, cloud terminology is even used to talk about ways of arranging grains into different sorts of configurations. Visualization of a granular synthesis "score." Each dot represents a grain at a particular frequency and moment in time. An image such as this one can give us a good idea of how this score might sound, even though there is some important information left out (such as the grain amplitudes, waveforms, amplitude envelopes, and so on). What sorts of sounds does this image imply? If you had three vocal performers, one for each "cloud," how would you go about performing this piece? Try it! This is an excerpt of a composition by computer music composer and researcher Mara Helmuth titled "Implements of Actuation." The sounds of an electric mbira and bicycle wheels are transformed via granular synthesis. Soundfile 4.29 There are a great many commercial and public domain applications to "Implements do granular synthesis because it is relatively easy to implement and the of Actuation" sounds can be very interesting and attractive. http://music.columbia.edu/cmc/musicandcomputers/ - 124 - Section 4.9: Physical Modeling We’ve already covered a bit of material on physical modeling without even telling you—the ideas behind formant synthesis are directly derived from our knowledge of the physical construction and behavior of certain instruments. Like all of the synthesis methods we’ve covered, physical modeling is not one specific technique, but rather a variety of related techniques. Behind them all, however, is the basic idea that by understanding how sound/vibration/air/string behaves in some physical system (like an instrument), we can model that system in a computer and thus synthetically generate realistic sounds. Karplus-Strong Algorithm Let’s take a look at a really simple but very effective physical model of a plucked string, called the Karplus-Strong algorithm (so named for its principal inventors, Kevin Karplus and Alex Strong). One of the first musically useful physical models (dating from the early 1980s), the Karplus-Strong algorithm has proven quite effective at generating a variety of pluckedstring sounds (acoustic and electric guitars, banjos, and kotos) and even drumlike timbres. Fun with the Karplus-Strong plucked string algorithm Applet 4.13 Karplus-Strong plucked string algorithm Here’s a simplified view of what happens when we pluck a string: at first the string is highly energized and it vibrates like mad, creating a fairly complex (meaning rich in harmonics) sound wave whose fundamental frequency is determined by the mass and tension of the string. Gradually, thanks to friction between the air and the string, the string’s energy is depleted and the wave becomes less complex, resulting in a "purer" tone with fewer harmonics. After some amount of time all of the energy from the pluck is gone, and the string stops vibrating. If you have access to a stringed instrument, particularly one with some very low notes, give one of the strings a good pluck and see if you can see and hear what’s happening per the description above. How a Computer Models a Plucked String with the Karplus-Strong Algorithm Now that we have a physical idea of what’s happening in a plucked string, how can we model it with a computer? The Karplus-Strong algorithm does it like this: first we start with a buffer full of random values—noise. (A buffer is just some computer memory (RAM) where we can store a bunch of numbers.) The numbers in this buffer represent the initial energy that is transferred to the string by the pluck. The Karplus-Strong algorithm looks like this: http://music.columbia.edu/cmc/musicandcomputers/ - 125 - To generate a waveform, we start reading through the buffer and using the values in it as sample values. If we were to just keep reading through the buffer over and over again, what we’d get would be a complex, pitched waveform. It would be complex because we started out with noise, but pitched because we would be repeating the same set of random numbers. (Remember that any time we repeat a set of values, we end up with a pitched (periodic) sound. The pitch we get is directly related to the size of the buffer (the number of numbers it contains) we’re using, since each time through the buffer represents one complete cycle (or period) of the signal.) Now here’s the trick to the Karplus-Strong algorithm: each time we read a value from the buffer, we average it with the last value we read. It is this averaged value that we use as our output sample. We then take that averaged sample and feed it back into the buffer. That way, over time, the buffer gets more and more averaged (this is a simple filter, like the averaging filter described in Section 3.1). Let’s look at the effect of these two actions separately. Averaging and Feedback First, what happens when we average two values? Averaging acts as a low-pass filter on the signal. Because we’re averaging the signal, the signal changes less with each sample, and by limiting how quickly it can change we’re limiting the number of high frequencies it can contain (since high frequencies have a high rate of change). So, averaging a signal effectively gets rid of high frequencies, which according to our string description we need to do—once the string is plucked, it should start losing harmonics over time. The "over time" part is where feeding the averaged samples back into the buffer comes in. If we were to just keep averaging the values from the buffer but never actually changing them (that is, sticking the average back into the buffer), then we would still be stuck with a static waveform. We would keep averaging the same set of random numbers, so we would keep getting the same results. Instead, each time we generate a new sample, we stick it back into the buffer. That way our waveform evolves as we move through it. The effect of this low-pass filtering accumulates over time, so that as the string "rings," more and more of the high frequencies are filtered out of it. The filtered waveform is then fed back into the buffer, where it is filtered again the next time through, and so on. After enough times through the process, the signal has been averaged so many times that it reaches equilibrium—the waveform is a flat line the string has died out. http://music.columbia.edu/cmc/musicandcomputers/ - 126 - Figure 4.22 Applying the Karplus-Strong algorithm to a random waveform. After 60 passes through the filter/feedback cycle, all that’s left of the wild random noise is a gently curving wave. The result is much like what we described in a plucked string: an initially complex, periodic waveform that gradually becomes less complex over time and ultimately fades away. Figure 4.23 Schematic view of a computer software implementation of the basic KarplusStrong algorithm. http://music.columbia.edu/cmc/musicandcomputers/ - 127 - For each note, the switch is flipped and the computer memory buffer is filled with random values (noise). To generate a sample, values are read from the buffer and averaged. The newly calculated sample is both sent to the output stream and fed back into the buffer. When the end of the buffer is reached, we simply wrap around and continue reading at the beginning. This sort of setup is often called a circular buffer. After many iterations of this process, the buffer’s contents will have been transformed from noise into a simple waveform. If you think of the random noise as a lot of energy and the averaging of the buffer as a way of lessening that energy, this digital explanation is not all that dissimilar from what happens in the real, physical case. Thanks to Matti Karjalainen for this graphic. Physical models generally offer clear, "real world" controls that can be used to play an instrument in different ways, and the Karplus-Strong algorithm is no exception: we can relate the buffer size to pitch, the initial random numbers in the buffer to the energy given to the string by plucking it, and the low-pass buffer feedback technique to the effect of air friction on the vibrating string. Many researchers and composers have worked on the plucked string sound as a kind of basic mode of physical modeling. One researcher, engineer Charlie Sullivan (who we're proud to say is one of our Dartmouth Soundfile 4.30 Super guitar colleagues!) built a "super" guitar in software. Here’s the heavy metal version of "The Star Spangled Banner." Understand the Building Blocks of Sound Physical modeling has become one of the most powerful and important current techniques in computer music sound synthesis. One of its most attractive features is that it uses a very small number of easy-to-understand building blocks—delays, filters, feedback loops, and commonsense notions of how instruments work—to model sounds. By offering the user just a few intuitive knobs (with names like "brightness," "breathiness," "pick hardness," and so on), we can use existing sound-producing mechanisms to create new, often fantastic, virtual instruments. Soundfile 4.31 An example of Perry Cook’s SPASM Part of the interface from Perry R. Cook’s SPASM singing voice software. Users of SPASM can make Sheila, a computerized singer, sing. Perry Cook has been one of the primary investigators of musically useful physical models. He’s released lots of great physical modeling software and source code. http://music.columbia.edu/cmc/musicandcomputers/ - 128 - ' ' Section 5.1: Introduction to the Transformation of Sound by Computer Although direct synthesis of sound is an important area of computer music, it can be equally interesting (or even more so!) to take existing sounds (recorded or synthesized) and transform, mutate, deconstruct—in general, mess around with them. There are as many basic sound transformation techniques as there are synthesis techniques, and in this chapter we’ll describe a few important ones. For the sake of convenience, we will separate them into time-domain and frequency-domain techniques. Tape Music/Cut and Paste The most obvious way to transform a sound is to chop it up, edit it, turn it around, and collage it. These kinds of procedures, which have been used in electronic music since the early days of tape recorders, are done in the time domain. There are a number of sophisticated software tools for manipulating sounds in this way, called digital sound editors. These editors allow for the most minute, sample-by-sample changes of a sound. These techniques are used in almost all fields of audio, from avant-garde music to Hollywood soundtracks, from radio station advertising spots to rap. This classic of analog electronic music was composed using only tape recorders, filters, and other precomputer devices and was based on a recording of an old television commercial. Soundfile 5.1 Jon Appleton: "Chef d'Oeuvre" "BS Variation 061801" by Huk Don Phun was created using Phun’s experimental "bleeding eyes" filtering technique. He used only free software and sounds he downloaded from the web. Soundfile 5.2 "BS Variation 061801" Time-Domain Restructuring Composers have experimented a lot with unusual time-domain restructuring of sound. By chopping up waveforms into very small segments and radically reordering them, some noisy and unusual effects can be created. As in collage visual art, the ironic and interesting juxtaposition of very familiar materials can be used to create new works that are perhaps greater than the sum of their constituent parts. http://music.columbia.edu/cmc/musicandcomputers/ - 129 - Playing with the perception of the listener, "Puzzels & Pagans" takes the first 2 minutes and 26 seconds of "Jackass" (from Odelay) and cuts it up into 2,500 pieces. These pieces are then reshuffled, taking into account probability functions (that change over the length of the track) Soundfile 5.3 that determine if pieces remain in their original position or if they don’t Jane Dowe’s sound at all. "Beck Deconstruction" This was realized using James McCartney’s SuperCollider computer piece music programming language. A unique, experimental, and rather strange program for deconstructing and reconstructing sounds in the time domain is Argeïphontes Lyre, written by the enigmatic Akira Rabelais. It provides a number of techniques for radical decomposition/recomposition of sounds— techniques that often preclude the user from making specific decisions in favor of larger, more probabilistic decisions. Figure 5.1 Sample GUI from Argeïphontes Lyre, for sound deconstruction. This is timedomain mutation. Sound example using Argeïphontes Lyre (random cut-up). The original piano piece "Aposiopesi" was written by Vincent Carté in 1999. This recording was made by Akira Rabelais at Wave Equation Soundfile 5.4 Studios in Hollywood, California. The piece was filtered with Argeïphontes Argeïphontes Lyre 2.0b1. Lyre Sampling Sampling refers to taking small bits of sound, often recognizable ones, and recontextualizing them via digital techniques. By digitally sampling, we can easily manipulate the pitch and time characteristics of ordinary sounds and use them in any number of ways. http://music.columbia.edu/cmc/musicandcomputers/ - 130 - We’ve talked a lot about samples and sampling in the preceding chapters. In popular music (especially electronic dance and beat-oriented music), the term "sampling" has acquired a specialized meaning. In this context, a sample refers to a (usually) short excerpt from some previously recorded source, such as a drum loop from a song or some dialog from a film soundtrack, that is used as an element in a new work. A sampler is the hardware used to record, store, manipulate, and play back samples. Originally, most samplers were stand-alone pieces of gear. Today sampling tends to be integrated into a studio’s computer-based digital audio system. Sampling was pioneered by rap artists in the mid-1980s, and by the early 1990s it had become a standard studio technique used in virtually all types of music. Issues of copyright violation have plagued many artists working with sample-based music, notably John Oswald of "Plunderphonics" fame and the band Negativland, although the motives of the "offended" parties (generally large record companies) have tended to be more financial than artistic. One result of this is that the phrase "Contains a sample from xxx, used by permission" has become ubiquitous on CD cases and in liner notes. Although the idea of using excerpts from various sources in a new work is not new (many composers, from Béla Bartók, who used Balkan folk songs, to Charles Ives, who used American popular music folk songs, have done so), digital technology has radically changed the possibilities. Figure 5.2 Herbert Brün said of his program SAWDUST: "The computer program which I called SAWDUST allows me to work with the smallest parts of waveforms, to link them and to mingle or merge them with one another. Once composed, the links and mixtures are treated, by repetition, as periods, or by various degrees of continuous change, as passing moments of orientation in a process of transformations." Also listen to Soundfile 5.5, Brün’s 1978 SAWDUST composition "Dustiny." Soundfile 5.5 "Dustiny" http://music.columbia.edu/cmc/musicandcomputers/ - 131 - Drum Machines Drum machines and samplers are close cousins. Many drum machines are just specialized Soundfile 5.6 Drum machine sounds samplers—their samples just happen to be all percussion/drumoriented. Other drum machines feature electronic or digitally synthesized drum sounds. As with sampling, drum machines started out as stand-alone pieces of hardware but now have largely been integrated into computer-based systems. DAW Systems Applet 5.2 Drum machine Digital-audio workstations (DAWs) in the 1990s and 2000s have had the same effect on digital sound creation as desktop publishing software had on the publishing industry in the 1980s: they’ve brought digital sound creation out of the highly specialized and expensive environments in which it grew up and into people’s homes. A DAW usually consists of a computer with some sort of sound card or other hardware for analog and digital input/output; sound recording/editing/playback/multi-track software; and a mixer, amplifier, and other sound equipment traditionally found in a home studio. Even the most modest of DAW systems can provide from eight to sixteen tracks of CD-quality sound, making it possible for many artists to self-produce and release their work for much less than it would traditionally cost. This ability, in conjunction with similar marketing and publicizing possibilities opened up by the spread of the Internet, has contributed to the explosion of tiny record labels and independently released CDs we’ve seen recently. In 1989, Digidesign came out with Sound Tools, the first professional tape–less recording system. With the popularization of personal computers, numerous software and hardware manufacturers have cornered the market for computer-based digital audio. Starting in the mid1980s, personal computer–based production systems have allowed individuals and institutions to make the highest-quality recordings and DAW systems, and have also revolutionized the professional music, broadcast, multimedia, and film industries with audio systems that are more flexible, more accessible, and more creatively oriented than ever before. Today DAWs come in all shapes and sizes and interface with most computer operating systems, from Mac to Windows to LINUX. In addition, many DAW systems involve a "breakout box"—a piece of hardware that usually provides four to eight channels of digital audio I/O (inputs and outputs); and as we write this a new technology called FireWire looks like a revolutionary way to hook up the breakout boxes to personal computers. Nowadays http://music.columbia.edu/cmc/musicandcomputers/ - 132 - many DAW systems also have "control surfaces" —pieces of hardware that look like a standard mixer but are really controllers for parameters in the digital audio system. Figure 5.3 The Mark of the Unicorn (MOTU) 828 breakout box, part of a DAW system with FireWire interface. Thanks to MOTU.com for their permission to use this graphic. Figure 5.4 Motor Mix™ stands alone as the only DAW mixer control surface to offer eightbank, switchable motorized faders. It also features eight rotary pots, a rotary encoder, 68 switches, and a 40-by-2 LCD. http://music.columbia.edu/cmc/musicandcomputers/ - 133 - Section 5.2: Reverb One of the most important and widely used techniques in computer music is reverberation and the addition of various types and speeds of delays to dry (unreverberated) sounds. Reverberation and delays can be used to simulate room and other environmental acoustics or even to create new sounds of their own that are not necessarily related to existing physical spaces. There are a number of commonly used techniques for simulating and modeling different reverberations and physical environments. One interesting technique is to actually record the ambience of a room and then superimpose that onto a sound recorded elsewhere. This technique is called convolution. A Mathematical Excursion: Convolution in the Time Domain Convolution can be accomplished in two equivalent ways: either in the time domain or in the frequency domain. We’ll talk about the time-domain version first. Convolution is like a "running average"—that is, we take the original function representing the music (we’ll call that one m(n), where m(n) is the amplitude of the music function at time n) and then we make a new, smoothed function by making the amplitude a at time n of our new function equal to: a = (1/3)[m(n – 2) + m(n – 1) + m(n)] Let’s call this new function r(n)—for running average—and let’s look at it a little more closely. Just to make things work out, we’ll start off by making r(0) = 0 and r(1) = 0. Then: r(2) = 1/3[m(0) + m(1) + m(2)] r(3) = 1/3[m(1) + m(2) + m(3)] We continue on, running along and taking the average of every three values of m— hence the name "running average." Another way to look at this is as taking the music m(n) and "convolving it against the filter a(n)" where a(n) is another function, defined by: a(0) = 1/3, a(1) = 1/3, a(2) = 1/3 All the rest of the a(n) are 0. Now, what is this mysterious "convolving against" thing? Well, first off we write it like: m*a(n) then: m*a(n) = m(n)a(0) + m(n – 1)a(1) + m(n – 2)a(2) +....+ m(0)a(n) http://music.columbia.edu/cmc/musicandcomputers/ - 134 - next: m*a(n) = m(n)(1/3) + m(n – 1)(1/3) + m(n – 2)(1/3) Now, there is nothing special about the filter a(n)—in fact it could be any kind of function. If it were a function that reflects the acoustics of your room, then what we are doing—look back at the formula—is shaping the input function m(n) according to the filter function a(n). A little more terminology: the number of nonzero values that a(n) takes on is called the number of taps for a(n). So, our running average is a three-tap filter. Reverb in the Time Domain Convolution is sure fun, and we’ll see more about it in the following section, but a far easier way to create reverb in the time domain is to simply delay the signal some number of times (by some very small time value) and feed it back onto itself, simulating the way a sound bounces around a room. There are a large variety of commercially available and inexpensive devices that allow for many different reverb effects. One interesting possibility is to change reverb over the course of a sound, in effect making a room grow and shrink over time (see Soundfile 5.7). Changing the reverberant characteristics of a sound over time. Here is a nail being hammered. The room size goes from very small to very large over the course of about 15 seconds. This can’t happen in the Soundfile 5.7 real, physical world, and it’s a good example of the way the computer Reverberant musician can create imaginary soundscapes. characteristics Before there were digital systems, engineers created various ways to get reverb-type effects. Any sort of situation that could cause delay was used. Engineers used reverberant chambers— sending sound via a loudspeaker into a tiled room and rerecording the reverb-ed sound. When one of us was a kid, we ran a microphone into our parents' shower to record a trumpet solo! Reverb units made with springs and metal plates make interesting effects, and many digital signal processing units have algorithms that can be dialed up to create similar effects. Of course we like to think that with a combination of delayed copies of signals we can simulate any type of enclosure, whether it be the most desired recording studio room or a parking garage. As you’ll see in Figure 5.5, there are some essential elements for creating a reverberant sound. Much of the time a good reverb "patch," or set of digital algorithms, will make use of a number of copies of the original signal—just as there are many versions of a sound bouncing around a room. There might be one copy of the signal dedicated to making the first reflection (which is the very first reflected sound we hear when a sound is introduced into a reverberant space). Others might be algorithms dedicated to making the early reflections—the first sounds we hear after the initial reflection. The rest of the software for a good digital reverb is likely designed to blend the reverberant sound. Filters are usually used to make the reverb tails sound as if they are far away. Any filter that attenuates the higher frequencies (like 5 kHz or up) makes a signal sound farther away from us, since high frequencies have very little energy and don’t travel very far. http://music.columbia.edu/cmc/musicandcomputers/ - 135 - When we are working to simulate rooms with digital systems, we have to take a number of things into consideration: • • • How large is the room? How long will it take for the first reflection to get back to our ears? What are the surfaces like in the room? Not just the walls, ceiling, and floor, but also any objects in the room that can reflect sound waves; for example, are there any huge ice sculptures in the middle of the room? Are there surfaces in the room that can absorb sound? These are called absorptive surfaces. For example: movie theaters usually have curtains hanging on the walls to suck up sound waves. Maybe there are people in the room: most bodies are soft, and soft surfaces remove energy from reflected sounds. Figure 5.5 Time-domain reverb is conceptually similar to the idea of physical modeling—we take a signal (say a hand clap) and feed it into a model of a real-world environment (like a large room). The signal is modified according to our model room and then output as a new signal. In this illustration you see the graphic representation of a direct sound being introduced into a reverberant space. The RT 60 refers to reverb time, or how long it takes the sound in the space to reduce by –60 dB. Photo courtesy of Garry Clennell <[email protected]>. Figure 5.6 In this illustration you can see how sound from a sound source (like a loudspeaker) can bounce around a reverberant space to create what we call reverb. A typical reverb algorithm can have lots of control variables such as room size (gymnasium, http://music.columbia.edu/cmc/musicandcomputers/ - 136 - small club, closet, bathroom), brightness (hard walls, soft walls or curtains, jello), and feedback or absorption coefficient (are there people, sand, rugs, llamas in the room—how quickly does the sound die out?). Photo courtesy of Garry Clennell <[email protected]>. Figure 5.7 Here’s a basic diagram showing how a signal is delayed (in the delay box), then fed back and added to the original signal with some attenuated (that is, not as loud) version of the original signal added. This would create an effect known as comb filtering (a short delay with feedback that emphasizes specific harmonics) as well as a delay. What does a delay do? It takes a function (a signal), shifts it "backward" in time, then combines it with another "copy" of the same signal. This is, once again, a kind of averaging. Figure 5.8 When we both feed-back and feed-forward the same signal at the same time, phase inverted, we get what is called an all-pass filter. (Phase inverting a signal means changing its phase by 180°.) With this type of filter/delay, we don’t get the comb filtering effect because of the phase inversion of the feed-forward signal, which fills in the missing frequencies created by the comb filter of the feedbacked signal. This type of unit is used to create blended reverb. A set of delay algorithms would probably have some combination of a number of comb-type delays as in this figure, as well as some allpass filters to create a blend. Making Reverb in the Frequency Domain: A Return to Convolution http://music.columbia.edu/cmc/musicandcomputers/ - 137 - As we mentioned, convolution is a great way to get reverberant effects. But we mainly talked about the mathematics of it and why convolving a signal is basically putting it through a type of filter. But now we’re going to talk about how to actually convolve a sound. The process is fairly straightforward: 1. Locate a space whose reverberant characteristics we like. 2. Set up a microphone and some digital media for recording. 3. Make a short, loud sound in this room, like hitting two drumsticks together (popping a balloon and firing a starter’s pistol are other standard ways of doing this—sometimes it’s called "shooting the room"). 4. Record this sound, called the impulse. 5. Analyze the impulse in the room; this gives us the impulse response of the room, which is the characteristics of the room’s response to sound. 6. Use the technique of convolution to combine this impulse response with other sound sources, more or less bringing the room to the sound. Convolution can be powerful, and it’s a fairly complicated software process. It takes each sample in the impulse response file (the one we recorded in the room, which should be short) and multiplies that sample by each sample in the sound file that we want to "put in" that room. So, each sample input sound file, like a vocal sound file to which we want to add reverb, is multiplied by each sample in the impulse response file. That’s a lot of multiplies! Let’s pretend, for the moment, that we have a 3-second impulse file and a 1-minute sound file to which we want to add the reverberant characteristics of some space. At 44.1 kHz, that’s: 3 * 44,100 * 60 * 44,100 = 180 * 680,683,500,000 = 122,523,403,030,000 multiplies If you’ve been paying close attention, you might raise an interesting question here: isn’t this the time domain? We’re just multiplying signals together (well, actually, we’re multiplying each point in each function by every other point in the other function—called a cross multiply). But this multiply in the time domain actually produces what we refer to as the convolution in the frequency domain. Now, we’re not math geniuses, but even we know that this is a whole mess of multiplies! That’s why convolution, which is very computationally expensive, had not been a popular technique until your average computer got fast enough to do it. It was also a completely unknown sound manipulation idea until digital recording technology made it feasible. But nowadays we can do it, and the reason we can do it is our old friend, the FFT! You see, it turns out that for filters with lots of taps (remember that this means one with lots of nonzero values), it is easier to compute the convolution in the spectral domain. Suppose we want to convolve our music function m(n) against our filter function a(n). We can tell you immediately that the convolution of these functions (which, remember, is another http://music.columbia.edu/cmc/musicandcomputers/ - 138 - sound, a function that we call c(n)) has a spectrum equal to the pointwise product of the spectrum of the music function and the filter function. The pointwise product is the frequency content at any point in the convolution and is calculated by multiplying the spectrums of the music function and the filter function at that particular point. Another way of saying this is to say that the Fourier coefficients of the convolution can be computed by simply multiplying together each of the Fourier coefficients of m(n) and a(n). The zero coefficient (the DC term) is the product of the DC terms of a(n) and m(n); the first coefficient is the product of the first Fourier coefficient of a(n) and m(n); and so on. So, here’s a sneaky algorithm for making the convolution: Step 1: Compute the Fourier coefficients of both m(n) and a(n). Step 2: Compute the pointwise products of the Fourier coefficients of a(n) and m(n). Step 3: Compute the inverse Fourier transform (IFFT) of the result of Step 1. And we’re done! This was one of the great properties of the FFT: it made convolution just about as easy as multiplication! So, once again, if we have a big library of great impulse responses— the best-sounding cathedrals, the best recording studios, concert halls, the Grand Canyon, Grand Central Station, your shower—we can simulate any space for any sound. And indeed this is how many digital effects processors and reverb plugins for computer programs work. This is all very cool when you’re working hard to get that "just right" sound. Figure 5.9 The impulse on the left, and the room’s response on the right. Thanks to Nigel Redmon of earlevel.com for this graphic. Applet 5.3 Flanges, delays, reverb: effect box fun Applet 5.4 Multitap delay http://music.columbia.edu/cmc/musicandcomputers/ - 139 - http://music.columbia.edu/cmc/musicandcomputers/ - 140 - Section 5.3: Localization/Spatialization This applet lets you pan, fade, and simulate binaural listening (best if listened to with headphones). Applet 5.5 Filter-based localization Soundfile 5.8 Binaural location Using the program SoundHack by Tom Erbe, we can place a sound anywhere in the binaural (two ears) space. SoundHack does this by convolving the sound with known filter functions that simulate the ITD (interaural time delay) between the two ears. Close your eyes and listen to the sounds around you. How well can you tell where they’re coming from? Pretty well, hopefully! How do we do that? And how could we use a computer to simulate moving sound so that, for example, we can make a car go screaming across a movie screen or a bass player seem to walk over our heads? Humans have a pretty complicated system for perceptually locating sounds, involving, among other factors, the relative loudness of the sound in each ear, the time difference between the sound’s arrival in each ear, and the difference in frequency content of the sound as heard by each ear. How would a "cyclaural" (the equivalent of a "cyclops") hear? Most attempts at spatializing, or localizing, recorded sounds make use of some combination of factors involving the two ears on either side of the head. Simulating Sound Placement Simulating a loudness difference is pretty simple—if someone standing to your right says your name, their voice is going to sound louder in your right ear than in your left. The simplest way to simulate this volume difference is to increase the volume of the signal in one channel while lowering it in the other—you’ve probably used the pan or balance knob on a car stereo or boombox, which does exactly this. Panning is a fast, cheap, and fairly effective means of localizing a signal, although it can often sound artificial. Interaural Time Delay (ITD) Simulating a time difference is a little trickier, but it adds a lot to the realism of the localization. Why would a sound reach your ears at different times? After all, aren’t our ears pretty close together? We’re generally not even aware that this is true: snap your finger on one side of your head, and you’ll think that you hear the sound in both ears at exactly the same time. But you don’t. Sound moves at a specific speed, and it’s not all that fast (compared to light, anyway): about 345 meters/second. Since your fingers are closer to one ear than the other, the sound waves will arrive at your ears at different times, if only by a small fraction of a second. http://music.columbia.edu/cmc/musicandcomputers/ - 141 - Since most of us have ears that are quite close together, the time difference is very slight—too small for us to consciously "perceive." Let’s say your head is a bit wide: roughly 250 cm, or a quarter of a meter. It takes sound around 1/345 of a second to go 1 meter, which is approximately 0.003 second (3 thousandths of a second). It takes about a quarter of that time to get from one ear of your wide head to the other, which is about 0.0007 second (0.7 thousandths of a second). That’s a pretty small amount of time! Do you believe that our brains perceive that tiny interval and use the difference to help us localize the sound? We hope so, because if there’s a frisbee coming at you, it would be nice to know which direction it’s coming from! In fact, though, the delay is even smaller because your head’s smaller than 0.25 meter (we just rounded it off for simplicity). The technical name for this delay is interaural time delay (ITD). To simulate ITD by computer, we simply need to add a delay to one channel of the sound. The longer the delay, the more the sound will seem to be panned to one side or the other (depending on which channel is delayed). The delays must be kept very short so that, as in nature, we don’t consciously perceive them as delays, just as location cues. Our brains take over and use them to calculate the position of the sound. Wow! Modeling Our Ears and Our Heads That the ears perceive and respond to a difference in volume and arrival time of a sound seems pretty straightforward, albeit amazing. But what’s this about a difference in the frequency content of the sound? How could the position of a bird change the spectral makeup of its song? The answer: your head! Imagine someone speaking to you from another room. What does the voice sound like? It’s probably a bit muffled or hard to understand. That’s because the wall through which the sound is traveling—besides simply cutting down the loudness of the sound—acts like a lowpass filter. It lets the low frequencies in the voice pass through while attenuating or muffling the higher ones. Your head does the same thing. When a sound comes from your right, it must first pass through, or go around, your head in order to reach your left ear. In the process, your head absorbs, or blocks, some of the high-frequency energy in the sound. Since the sound didn’t have to pass through your head to get to your right ear, there is a difference in the spectral makeup of the sound that each ear hears. As with ITD, this is a subtle effect, although if you’re in a quiet room and you turn your head from side to side while listening to a steady sound, you may start to perceive it. Modeling this by computer is easy, provided you know something about how the head filters sounds (what frequencies are attenuated and by how much). If you’re interested in the frequency response of the human head, there are a number of published sources available for the data, since they are used by, among others, the government for all sorts of things (like flight simulators, for example). Researcher and author Durand Begault has been a leading pioneer in the design and implementation of what are called head transfer functions— frequency response curves for different locations of sound. What Are Head-Related Transfer Functions (HRTFs)? http://music.columbia.edu/cmc/musicandcomputers/ - 142 - Figure 5.10 This illustration shows how the spectral contents of a sound change depending on which direction the sound is coming from. The body (head and shoulders) and the time-ofarrival difference that occurs between the left and right ears create a filtering effect. Figure 5.11 The binaural dummy head recording system includes an acoustic baffle with the approximate size, shape, and weight of a human head. Small microphones are mounted where our ears are located. This recording system is designed to emulate the acoustic effects of the human head (just as our ears might hear sounds) and then capture the information on recording media. A number of recording equipment manufacturers make these "heads," and they often have funny names (Sven, etc.). Thanks to Sonic Studios for this photo. http://music.columbia.edu/cmc/musicandcomputers/ - 143 - Not surprisingly, humans are extremely adept at locating sounds in two dimensions, or the plane. We’re great at figuring out the source direction of a sound, but not the height. When a lion is coming at us, it’s nice of evolution to have provided us with the ability to know, quickly and without much thought, which way to run. It’s perhaps more of a surprise that we’re less adept at locating sounds in the third dimension, or more accurately, in the "up/down" axis. But we don’t really need this ability. We can’t jump high enough for that perception to do us much good. Barn owls, on the other hand, have little filters on their cheeks, making them extraordinarily good at sensing their sonic altitude distances. You would be good at sensing your sonic altitude distance, too, if you had to catch and eat, from the air, rapidly running field mice. So if it’s not a frisbee heading at you more or less in the twodimensional plane, but a softball headed straight down toward your head, we’d suggest a helmet! http://music.columbia.edu/cmc/musicandcomputers/ - 144 - Section 5.4: Introduction to Spectral Manipulation There are two different approaches to manipulating the frequency content of sounds: filtering, and a combination of spectral analysis and resynthesis. Filtering techniques, at least classically (before the FFT became commonly used by most computer musicians), attempted to describe spectral change by designing time-domain operations. More recently, a great deal of work in filter design has taken place directly in the spectral domain. Spectral techniques allow us to represent and manipulate signals directly in the frequency domain, often providing a much more intuitive and user-friendly way to work with sound. Fourier analysis (especially the FFT) is the key to many current spectral manipulation techniques. Phase Vocoder Perhaps the most commonly used implementation of Fourier analysis in computer music is a technique called the phase vocoder. What is called the phase vocoder actually comprises a number of techniques for taking a timedomain signal, representing it as a series of amplitudes, phases, and Soundfile 5.9 frequencies, and manipulating this information and returning it to the time Unmodified domain. (Remember, Fourier analysis is the process of turning the list of speech samples of our music function into a list of Fourier coefficients, which are complex numbers that have phase and amplitude, and each corresponds to a frequency.) Two of the most important ways that musicians have used the phase vocoder Soundfile technique are to use a sound’s Fourier representation to manipulate its length without changing its pitch and, conversely, to change its pitch without affecting 5.10 Speech made its length. This is called time stretching and pitch shifting. half as long with a phase Why should this even be difficult? Well, consider trying it in the time domain: play back, say, a 33 1/3 RPM record at 45 RPMs. What happens? You play the vocoder record faster, the needle moves through the grooves at a higher rate, and the sound is higher pitched (often called the "chipmunk" effect, possibly after the famous 1960s novelty records featuring Alvin and his friends). The sound is also much shorter: in this case, pitch is directly related to frequency—they’re both controlled by the same mechanism. A creative and virtuosic use of this Soundfile technique is scratching as practiced by hip-hop, rap, and dance DJs. 5.11 Speech made twice as long with a phase vocoder Soundfile 5.12 Speech transposed up an octave Soundfile 5.13 Speech transposed down an octave http://music.columbia.edu/cmc/musicandcomputers/ - 145 - Soundfile 5.14 Speeding up a sound An example of time-domain pitch shifting/speed changing. In this case, pitch and time transformations are related. The faster the sound is played, the higher the pitch becomes, as heard in Soundfile 5.14. In Soundfile 5.15, the opposite effect is heard: the slower file sounds lower in pitch. Soundfile 5.15 Slowing down a sound The Pitch/Speed Relationship in the Digital World Now think of altering the speed of a digital signal. To play it back faster, you might raise the sampling rate, reading through the samples for playback more quickly. Remember that sometimes we refer to the sampling rate as the rate at which we stored (sampled) the sounds, but it also can refer to the kind of internal clock that the computer uses with reference to a sound (for playback and other calculations). We can vary that rate, for example playing back a sound sampled at 22.05 kHz at 44.1 kHz. With more samples (read) per second, the sound gets shorter. Since frequency is closely related to sampling rate, the sound also changes pitch. Xtra bit 5.1 "Slow Motion Sound" Even with the basic pitch/speed problem, manipulating the speed of sound has always attracted creative experiment. Consider an idea proposed by composer Steve Reich in 1967, thought to be a kind of impossible dream for electronic music: to slow down a sound without changing its pitch (and vice versa). A SoundHack varispeed of some standard speech. Note how the speech’s speed is changed over time. Soundfile 5.16 Varispeed Varispeed is a general term for "fooling around" with the sampling rate of a sound file. Using the Phase Vocoder Using the phase vocoder, we can realize Steve Reich’s piece (see Xtra bit 5.1), and a great many others. The phase vocoder allows us independent control over the time and the pitch of a sound. How does this work? Actually, in two different ways: by changing the speed and changing the pitch. To change the speed, or length, of a sound without changing its pitch, we need to know something about what is called windowing. Remember that when doing an FFT on a sound, http://music.columbia.edu/cmc/musicandcomputers/ - 146 - we use what are called frames—time-delimited segments of sound. Over each frame we impose a window: an amplitude envelope that allows us to cross-fade one frame into another, avoiding problems that occur at the boundaries of the two frames. What are these problems? Well, remember that when we take an FFT of some portion of the sound, that FFT, by definition, assumes that we’re analyzing a periodic, infinitely repeating signal. Otherwise, it wouldn’t be Fourier analyzable. But if we just chop up the sound into FFT-frames, the points at which we do the chopping will be hard-edged, and we’ll in effect be assuming that our periodic signal has nasty edges on both ends (which will typically show up as strong high frequencies). So to get around this, we attenuate the beginning and ending of our frame with a window, smoothing out the assumed periodical signal. Typically, these windows overlap at a certain rate (1/8, 1/4, 1/2 overlap), creating even smoother transitions between one FFT frame and another. Figure 5.12 Why do we window FFT frames? The image on the left shows the waveform that our FFT would analyze without windowing—notice the sharp edges where the frame begins and ends. The image in the middle is our window. The image on the right shows the windowed waveform. By imposing a smoothing window on the time domain signal and doing an FFT of the windowed signal, we de-emphasize the high-frequency artifacts created by these sharp vertical drops at the beginning and end of the frame. Thanks to Jarno Seppänen <[email protected]> Nokia Research Center, Tampere, Finland, for these images. http://music.columbia.edu/cmc/musicandcomputers/ - 147 - Figure 5.13 After we window a signal for the FFT, we overlap those windowed signals so that the original signal is reconstructed without the sharp edges. By changing the length of the overlap when we resynthesize the signal, we can change the speed of the sound without affecting its frequency content (that is, the FFT information will remain the same, it’ll just be resynthesized at a "larger" frame size). That’s how the phase vocoder typically changes the length of a sound. What about changing the pitch? Well, it’s easy to see that with an FFT we get a set of amplitudes that correspond to a given set of frequencies. But it’s clear that if, for example, we have very strong amplitudes at 100 Hz, 200 Hz, 300 Hz, 400 Hz, and so on, we will perceive a strong pitch at 100 Hz. What if we just take the amplitudes at all frequencies and move them "up" (or down) to frequencies twice as high (or as low)? What we’ve done then is re-create the frequency/amplitude relationships starting at a higher frequency—changing the perceived pitch without changing the frequency. http://music.columbia.edu/cmc/musicandcomputers/ - 148 - Two columns of FFT bins. These bins divide the Nyquist frequency evenly. In other words, if we were sampling at 10 kHz and we had 100 FFT bins (both these numbers are rather silly, but they’re arithmetically simple), our Nyquist frequency would be 5 kHz, and the bin width would be 50 Hz. Each of these frequency bins has its own amplitude, which is the strength or energy of the spectra at that frequency (or more precisely, the average energy in that frequency range). To implement a pitch shift in a phase vocoder, the amplitudes in the left column are shifted up to higher frequencies in the right column. This operation shifts 1 spectra in the left bin to a higher frequency in the right bin. The phase vocoder technique actually works just fine, though for radical pitch/time deformations we get some problems (usually called "phasiness"). These techniques work better for slowly changing harmonic sounds and for simpler pitch/time relationships (integer multiples). Still, the phase vocoder works well enough, in general, for it to be a widely used technique in both the commercial and the artistic sound worlds. Larry Polansky’s 1-minute piece, "Study: Anna, the long and the short of it." All the sounds are created using phase vocoder pitch and time shifts of Soundfile a recording of a very short cry (with introductory inhale) of the 5.17 composer’s daughter when she was 6 months old. "Study: Anna" http://music.columbia.edu/cmc/musicandcomputers/ - 149 - Section 5.5: More on Convolution Another interesting use of the phase vocoder is to perform convolution. We talked about convolution a lot in our section on reverb (both in the time domain and in the frequency domain). A simple convolution: a guitar power chord (the impulse response), a vocal excerpt, and the convolution of one with the other. We have sonically turned the singer’s head into the guitar chord: he’s singing "through" the chord. Soundfile 5.18 Convolution: Schwarzhund Remember that a convolution multiplies every frequency content in one sound by every frequency content in another (sometimes called a cross-multiply). This is different from simply multiplying each value in one sound by its one corresponding value in another. In fact, as we mentioned, there is a well-known and surprisingly simple relationship between these two concepts: multiplication in the time domain is the same as convolution in the frequency domain (and vice versa). As you probably saw in Section 5.2, the mathematics of convolution can get a little hairy. But the uses of convolution for transformations of sound are pretty straightforward, so we’ll explain them in this section. Cross-Synthesis Convolution is a type of cross-synthesis: a general, less technical term which means that some aspect of one sound is imposed onto another. A simple example of cross-synthesis returns us to the subject of reverb. We described the way that, by recording what is called the impulse response of a room (its resonant characteristics) using a very short, loud sound, we can place another sound in that room (at whatever position we "shot") by convolving the impulse response with that sound. Surprisingly, by convolving any sound with white noise, we can simulate simple reverb. Using Convolution Although reverberation is a common application of the convolution technique, convolution can be used creatively to produce unusual sounds as well. Simple-to-use convolution tools (like that in the program SoundHack) have only recently become available to a large community of musicians because up until recently, they only ran on quite large computers and in rather arcane environments. So we are likely to hear some amazing things in the near future using these techniques! Soundfile 5.19 Noise Soundfile 5.20 Inhale Soundfile 5.21 Noise and inhale convolved In this convolution example, Soundfile 5.19 and Soundfile 5.20 are convolved into Soundfile 5.21. The two sound sources are noise and an inhale. http://music.columbia.edu/cmc/musicandcomputers/ - 150 - Section 5.6: Morphing In recent years the idea of morphing, or turning one sound (or image) into another, has become quite popular. What is especially interesting, besides the idea of having a lion roar change gradually and imperceptibly into a meow, is the broader idea that there are sounds "in between" other sounds. Figure 5.15 Image morphing: several stages of a morph. What does it mean to change one sound into another? Well, how would you graphically change a picture into another? Would you replace, over time, little bits of one picture with those of another? Would you gradually change the most important shapes of one into those of the other? Would you look for important features (background, foreground, color, brightness, saturation, etc.), isolate them, and cross-fade them independently? You can see that there are lots of ways to morph a picture, and each way produces a different set of effects. The same is true for sound. Larry Polansky's "51 Melodies" is a terrific example of a computerassisted composition that uses morphing to generate novel (and kind of insane!) melodies. Polansky specified the source and target melodies, and then wrote a computer program to generate the melodies in between. From the liner notes to the CD "Change": Soundfile 5.22 "morphing "51 Melodies is based on two melodies, a source and a target, and is in piece" three sections. The piece begins with the source, a kind of pseudoanonymous rock lick. The target melody, an octave higher and more chromatic, appears at the beginning of the third section, played in unison by the guitars and bass. The piece ends with the source. In between, the two guitars morph, in a variety of different, independent ways, from the source to the target (over the course of Sections 1 and 2) and back again (Section 3). A number of different morphing functions, both durational and melodic, are used to distinguish the three sections." (The Soundfile 5.22 is an three minute edited version of the complete 12 minute piece, composed of one minute sections from the beginning, middle, and end of the recording.) http://music.columbia.edu/cmc/musicandcomputers/ - 151 - Simple Morphing The simplest sonic morph is essentially an amplitude cross-fade. Clearly, this doesn’t do much (you could do it on a little audio mixer). Soundfile 5.23 Cross-fade Figure 5.16 An amplitude cross-fade of a number of different data points. What would constitute a more interesting morph, even limiting us to the time domain? How about this: let’s take a sound and gradually replace little bits of it with another sound. If we overlap the segments that we’re "replacing," we will avoid horrible clicks that will result from samples jumping drastically at the points of insertion. Interpolation and Replacement Morphing The two ways of morphing described above might be called replacement and interpolation morphing, respectively. In a replacement morph, intact values are gradually substituted from one sound into another. In an interpolation morph, we compare the values between two sounds and select values somewhere between them for the new sound. In the former, we are morphing completely some part of the time; in the latter, we are morphing somewhat all of the time. In general, we can specify a degree of morphing, by convention called Ω, that tells how far one sound is from the other. A general formula for (linear) interpolation is: I = A + (Ω*(B – A)) In this equation, A is the starting value, B is the ending value, and Ω is the interpolation index, or "how far" you want to go. Thus, when Ω = 0, I = A; when Ω = 1, I = B, and when Ω = 0.5, I = the average of A and B. This equation is a complicated way of saying: take some sound (SourceSound) and add to it some percentage of the difference between it and another sound (TargetSound – SourceSound), to get the new sound. Sonic morphing can be more interesting in the frequency domain, in the creation of sounds whose spectral content is some kind of hybrid of two other sounds. (Convolution, by the way, could be thought of as a kind of morph!) http://music.columbia.edu/cmc/musicandcomputers/ - 152 - An interesting approach to morphing is to take some feature of a sound and morph that feature onto another sound, trying to leave everything else the same. This is called feature morphing. Theoretically, one could take any mathematical or statistical feature of the sound, even perceptually meaningless ones—like the standard deviation of every 13th bin—and come up with a simple way to morph that feature. This can produce interesting effects. But most researchers have concentrated their efforts on features, or some organized representation of the data, that are perceptually, cognitively, or even musically salient, such as attack time, brightness, roughness, harmonicity, and so on, finding that feature morphing is most effective on such perceptually meaningful features. Feature Morphing Example: Morphing the Centroid Music cognition researchers and computer musicians commonly use a measure of sounds called the spectral centroid. The spectral centroid is a measure of the "brightness" of a sound, and it turns out to be extremely important in the way we compare different sounds. If two sounds have a radically different centroid, they are generally perceived to be timbrally distant (sometimes this is called a spectral metric). Basically, the centroid can be considered the average frequency component (taking into consideration the amplitude of all the frequency components). The formula for the spectral centroid of one FFT frame of a sound is: Ci is the centroid for one spectral frame, and i is the number of frames for the sound. A spectral frame is some number of samples that is equal to the size of the FFT. The (individual) centroid of a spectral frame is defined as the average frequency weighted by amplitudes, divided by the sum of the amplitudes, as follows: We add up all the frequencies multiplied by their amplitudes (the numerator) and add up all the amplitudes (the denominator), and then divide. The "strongest" frequency wins! In other words, it’s the average frequency weighted by amplitude: where the frequency concentration of a sound is. http://music.columbia.edu/cmc/musicandcomputers/ - 153 - Soundfile 5.24 Chris Mann Soundfile 5.25 Single violin The centroid curve of a sound over time. Note that centroids tend to be suprisingly high and never the "fundamental" (unless our sound is a pure sine wave). One of these curves is of a violin tone; the other is of a rapidly changing voice (Australian sound poet Chris Mann). The soundfile for Chris Mann is included as well. Now let’s take things one step further, and try to morph the centroid of one sound onto that of another. Our goal is to take the time-variant centroid from one sound and graft that onto a second sound, preserving as much of the second sound’s amplitude/spectra relationship as possible. In other words, we’re trying to morph one feature while leaving others constant. To do this, we can think of the centroid in an unusual way: as the frequency that divides the total sound file energy into two parts (above and below). That’s what an average is. For some time-variant centroid (ci) extracted from one sound and some total amplitude from another (ampsum), we simply "plop" the new centroid onto the sound and scale the amplitude of the frequency bins above and below the new centroid frequency to (0.5 * ampsum). This will produce a sort of "brightness morph." Notice that on either side of the centroid in the new sound, the spectral amplitude relationships remain the same. We’ve just forced a new centroid. Section 5.7: Graphical Manipulation of Sound We’ve seen how it’s possible to take sounds and turn them into pictures by displaying their spectral data in sonograms, waterfall plots, and so on. But how about going the other way? What about creating a picture of a sound, and then synthesizing it into an actual sound? Or how about starting with a picture of something else—Dan Rockmore’s dog Digger, for instance? What would he sound like? Or how about editing a sound as an image—what if you could draw a box around some region of a sound and simply drag it to some other place, or erase part of it, or apply a "blur" filter to it? http://music.columbia.edu/cmc/musicandcomputers/ - 154 - Graphical manipulation of sound is still a relatively underdeveloped and emerging field, and the last few years have seen some exciting developments in the theory and tools needed to do such work. One of the interesting issues about it is that the graphical manipulations may not have any obvious relationship to the sonic effects. This could be looked upon as either a drawback or an advantage. Although graphical techniques are used for both the synthesis and transformation of sounds, much of the current work in this area seems geared more toward sonic manipulation than synthesis. Pre–Digital Era Graphic Manipulation Composers have always been interested in exploring the relationships between color, shape, and sound. In fact, people in general are fascinated with this. Throughout history, certain people have been synaesthetic—they see sound or hear color. Color organs, sound pictures, even just visual descriptions of sound have been an important part of the way people try to understand sound and music, and most importantly their own experience of it. Timbre is often called "sound color," even though sound color should more appropriately be analogized to frequency/pitch. Computer musicians have often been interested in working with sounds from a purely graphical perspective—a "let’s see what would happen if" kind of approach. Creating and editing sounds graphically is not a new idea, although it’s only recently that we’ve had tools flexible enough to do it well. Even before digital computers, there were a number of graphicsto-sound systems in use. In fact, some of the earliest film sound technology was optical—a waveform was printed, or even drawn by hand (as in the wonderfully imaginative work of Canadian filmmaker Norman McLaren), on a thin stripe of film running along the socket holes. A light shining through the waveform allowed electronic circuitry to sense and play back the sound. Canadian inventor and composer Hugh LeCaine took a different approach. In the late 1950s, he created an instrument called the Spectrogram, consisting of a bank of 108 analog sine wave oscillators controlled by curves drawn on long rolls of paper. The paper was fed into the instrument, which sensed the curves and used them to determine the frequency and volume of each oscillator. Sound familiar? It should—LeCaine’s Spectrogram was essentially an analog additive synthesis instrument! http://music.columbia.edu/cmc/musicandcomputers/ - 155 - Canadian composer and instrument builder, and one of the great pioneers of electronic and computer music, Hugh LeCaine. LeCaine was especially interested in physical and visual descriptions of electronic music. On the right is one of LeCaine’s inventions, an electronic musical instrument called the Spectrogram. UPIC System One of the first digital graphics-to-sound schemes, Iannis Xenakis’s UPIC (Unité Polyagogique Informatique du CEMAMu) system, was similar to LeCaine’s invention in that it allowed composers to draw lines and curves that represent control information for a bank of oscillators (in this case, digital oscillators). In addition, it allowed the user to perform graphical manipulations (cut and paste, copy, rearrange, etc.) on what had been drawn. Another benefit of the digital nature of the UPIC system was that any waveform (including sampled ones) could be used in the synthesis of the sound. By the early 1990s, UPIC was able to do all of its synthesis and processing live, enabling it to be used as a real-time performance instrument. Newer versions of the UPIC system are still being developed and are currently in use at CEMAMu (Centre des Etudes Mathématiques Automatiques Musicales) in Paris, an important center for research in computer music. AudioSculpt and SoundHack More recently, a number of FFT/IFFT-based graphical sound manipulation techniques have been developed. One of the most advanced is AudioSculpt from IRCAM in France. AudioSculpt allows you to operate on spectral data as you would an image in a painting program—you can paint, erase, filter, move around, and perform any number of other operations on the sonograms that AudioSculpt presents. http://music.columbia.edu/cmc/musicandcomputers/ - 156 - FFT data displayed as a sonogram in the computer music program AudioSculpt. Partials detected in the sound are indicated by the red lines. On the right are a number of tools for selecting and manipulating the sound/image data. Another similar, and in some ways more sophisticated, approach is Tom Erbe’s QT-coder, a part of his SoundHack program. The QT-coder allows you to save the results of an FFT of a sound as a color image that contains all of the data (magnitude and phase) associated with the sound you’ve analyzed (as opposed to AudioSculpt, which only presents you with the magnitude information). It saves the images as successive frames of a QuickTime movie, which can then be opened by most image/video editing software. The result is that you can process and manipulate your sound using not only specialized audio tools, but also a large number of programs meant primarily for traditional image/video processing. The movie can then be brought back into SoundHack for resynthesis. It is also possible to go the other way, that is, to use SoundHack to synthesize an actual movie into sound, manipulate that sound, and then transform it back into a movie. As you may imagine, using this technique can cause some pretty strange effects! ) Original image ) Altered image An original image created by the QT-coder in SoundHack (left), and the image after alterations (right). Listen to the original sound (Soundfile 5.26) and examine the original image. Now examine the altered image. Can you guess what the alterations (Soundfile 5.27) http://music.columbia.edu/cmc/musicandcomputers/ - 157 - will sound like? Soundfile 5.28 Chris Penrose’s composition "American Jingo" Chris Penrose’s Hyperupic. This computer software allows for a wide variety of ways to transform images into sound. See Soundfile 5.28 for Penrose’s Hyperupic composition. squiggy squiggy, a project developed by one of the authors (repetto), combines some of the benefits of both the UPIC system and the FFT-based techniques. It allows for the real-time creation, manipulation, and playback of sonograms. squiggy can record, store, and play back a number of sonograms at once, each of which can be drawn on, filtered, shifted, flipped, erased, looped, combined in various ways, scrubbed, mixed, panned, and so on—all live. The goal of squiggy is to create an instrument for live performance that combines some of the functionality of a traditional time-domain sampler with the intuitiveness and timbral flexibility of frequency-domain processing. http://music.columbia.edu/cmc/musicandcomputers/ - 158 - Figure 5.22 Screenshot from squiggy, a real-time spectral manipulation tool by douglas repetto. On the left is the spectral data display window, and on the right are controls for independent volume, pan, loop speed, and loop length settings for each sound. In addition there are a number of drawing tools and processes (not shown) that allow direct graphical manipulation of the spectral data. http://music.columbia.edu/cmc/musicandcomputers/ - 159 -
© Copyright 2025 Paperzz