Introduction to praat for speech segmentation (with an introduction to acoustic phonetics) First, you should have downloaded praat from http://www.fon.hum.uva.nl/praat The program is small enough to put on a small flash drive in case you have to work at a computer lab or somewhere you can’t put it permanently. Then you can just copy it to the desktop and go to town. Some terms you will need to know: Objects window – this is where files that you have opened in praat appear. This includes sound files and TextGrids. Sound – this is what you get if you choose Read > Read from file… and select a wav file. You can also use the Read from file… command to open a TextGrid object. A TextGrid is a text file (this means it’s really small) that you can use inside praat along with a sound file to demarcate boundaries and points and write text, such as a transcription or notes or whatever. A LongSound is what shows up in the objects window if you choose Read > Open long sound file. This is the option you want if your wav file is longer than a couple minutes. You get fewer options with a long sound, but it crashes praat if you try to open too big a file as a regular Sound object. (there are many other objects, but these are the three most basic types of object, and enough to begin learning how to use praat.) Editing a Sound and TextGrid Select both the Sound (or LongSound) object and the TextGrid object, then click on the button to the right that says Edit. Then what is called an editor window will open, with the sound and TextGrid combined. (You can also open sounds by themselves or TextGrids by themselves by selecting just the one object and clicking Edit. But it is more useful to look at them both together.) In the sound part of the editor window, there will be a waveform on top, and a spectrogram on the bottom. If the window contains more than 1 minute, you have to zoom in to be able to see the spectrogram. On the bottom is the TextGrid, which is comprised of one or more tiers. A tier is one row in which you may mark intervals or points and write text. In the image below, the top two tiers are interval tiers, which contain text between the boundaries. The fourth tier from the top, marked events, is a point tier. In a point tier, any text is a part of the point itself. In order to zoom in to see a smaller section, highlight the section you are interested in, in the waveform or spectrogram part of the editor window, then click on the button in the lower left that says “sel” which stands for zoom to selection. After zooming, you can see the spectrogram, a clearer waveform, and the transcription in the first tier begins to make sense. Quick physics lesson: Sound is composed of three parts: transmitter/source (the thing making the sound), medium (the stuff that the sound wave is carried through, like air or water), and receiver (in case of speech, your ears or recording equipment). So the technical answer to the old mantra, “if a tree falls in the woods and no living entity is around to hear it, does it still make a sound?” is no. Without a receiver to pick up the displacement of particles caused the the movement of the transmitter and to interpret it as a sound, the cycle is not complete, and the displaced particles are just displaced particles. This is also why there is no sound in space, because there is no medium. If you scream into a vacuum, there are no displaced particles, thus no sound. When a movement causes a displacement of particles, it creates an oscillation. This means that particles are pushed away, but then they spring back again. Essentially what happens is that 1) the particles closest to the source are pushed away, but then 2) they bump up into the particles nearest them, and then 3) the force causes those next particles to move in the direction of the first particles, but it also causes the first particles to bounce off and move back into, or near their original resting position. This is why we speak of sound waves, because this is a cyclical pattern. 1) 2) 3) Key: the big star represents the source of the sound, or the original displacement. The arrows represent the direction of the force, and the circles represent air particles. The little stars in (2) represent the crowding or collision of the first wave of particles bumping into the next, where kinetic energy is transferred. The main point is that sound travels in a wave, and there are two main parts of the wave, rarefaction and compression. Rarefaction is what happens when two waves of particles are pushed further apart, and compression is when two waves of particles are pushed closer together. A complete cycle is made up of one compression and one rarefaction. Waveforms A waveform is a two-dimensional representation of a sound, with time on the x-axis, and amplitude on the y-axis. Amplitude is essentially a measure of how much air particles are displaced by the force of the sound wave. The top part of the wave is called the peak and represents the point of maximal compression. The bottom part of the wave is called the trough, and represents the point of maximal rarefaction. In the sound wave below, which is a sine wave, the waveform is symmetric, meaning there is the same amount of displacement on either side of the baseline, so equal compression and rarefaction. It is periodic, which means it repeats at regular intervals, about the same time apart, so each cycle has more or less equal periods. It is also a simple wave, meaning only one frequency is represented. Frequency is a measure of how many cycles occur over a given period of time. Sound is usually measured in Hertz, which is the number of cycles per second. The sounds of speech, however, never create a simple wave. Speech sounds are complex waves, meaning that multiple frequencies are represented in the sound wave. Because air particles bounce off of multiple places along our ever-changing vocal tracts, they are moving at many speeds. If you have ever played pool or billiards, you can think about how the many colored balls bounce around the table in different directions and at different speeds depending on where and how fast they were struck by the cue ball, or an adjoining ball, and their starting position. Some move slowly and cover a great distance, while some move quickly and bounce rapidly from rail to rail. Well, in our vocal tract, there are many sources of regular frequency differences, as well as some random noise caused by the turbulence of air moving through the passage. The main source of speech sounds generally comes from air being expelled from the lungs. After it leaves the lungs, it may pass through the larynx without much effort, or the vocal folds may rapidly open and close, creating the effect of voicing, which is the sound of the vocal folds vibrating. The frequency at which the vocal folds vibrate is called the fundamental frequency, or f0. In a voiced sound, you can see f0 as generally the largest cycle of the waveform, as in the figure below. These cycles correspond with glottal pulses. Other frequencies are captured by the sound wave as smaller bumps in the waveform. For example, if a sound were composed of only two frequencies, one at 50 Hz, and one at 100Hz, the lower frequency would be the larger repeating cycle, and the higher frequency would have 2 troughs and peaks that were part of the larger wave. Speech sounds have many frequencies, though. One of the most recognizable of these is what is called formants. Formants are created by the different tongue postures that are used to create different vowels. The first formant (F1) tells us how high or low the tongue posture is, as it extends or shortens the longest part of the vocal tract, which is your throat from the larynx to the roof of your mouth. The second formant (F2), and its relationship to F1, tell us how front the articulation is. A fronter articulation will have greater difference between F1 and F2. Essentially, a higher F2 means a fronter vowel, as the front cavity is shortened. F3 tells us about rounding, that is whether we extend the front-most part of the cavity by protruding our lips. f0 one cycle Aperiodic noise is also important in speech. Fricatives and stops have large amounts of aperiodic noise, as the articulators come closer together and force the air through a narrower channel. You can imagine this is like when water gets turbulent going through a narrow gorge and gives a rapid current and lots of whitewater. The air speeds up, as does the water, and bounces off the articulators, giving off lots of high frequency, aperiodic noise, as in the figure below. But you can get voicing coupled with turbulence or frication, as in a voiced fricative. In this case, the waveform is periodic, with f0, but with aperiodic noise in each cycle, as below. Spectrograms A spectrogram is a 3-dimensional representation of sound. On the x-axis is time. The y-axis in this case is frequency, and the z-axis is amplitude. The z-axis is essentially shown by the darkness of the bands of frequency. Darker areas have greater amplitude, while lighter areas have less amplitude. In order to really understand a spectrogram, you should know about sampling and spectra. Sampling is what has to happen in order to convert an analogue event into a digital recording, In this day and age, we kind of take digital signals for granted, on the telephone, cameras, recordings, and any kind of broadcast. The speech stream, as uttered by a human talker, is continuous. But in order to make a digital recording, we have to take snapshots of that continuous stream, just as we do with movies. The movies that we watch are collections of thousands or millions of pictures, pasted together, that gives us the illusion of motion. Digital recordings are snapshots of a soundwave, taken so close together that it sounds continuous (or nearly continuous for some lower resolution recordings). The upper limit of human hearing does not extend beyond 20,000 Hz. This means we cannot perceive sounds that have a higher pitch than 20,000 cycles per second, and many of us have an upper limit that is a great deal lower. So, high quality recordings are taken at a rate of 44,100 samples per second. This is slightly more than double the range of human hearing. The reason for doubling it is so that the higher frequency sounds have a greater fidelity. The problem is that software that compiles the samples may assume half the frequency for frequencies above the recording limit. You can see in the figure below that there are periods of overlap for the two waveforms. If we took a snapshot of the higher frequency waveform at a rate that would just capture the baseline crossing of the lower frequency waveform, it would have to interpolate what was between those points. And it would not suppose that there were baseline crossings between those points, but rather reduce the frequency of the higher frequency waveform by half, which is called aliasing. So, a sample is a snapshot of a waveform, taken at regular intervals, and the sampling rate is how many samples per second. A spectrum is like a sample. It is a snapshot of the waveform. The cool thing about a spectrum is that it is averaged over some short period of time, to be able to give us information about frequency and amplitude. A spectrum shows frequency along the x-axis, and amplitude on the y-axis. The peaks in a spectrum show at what frequency the most energy or intensity is. Frequencies with greater intensity represent f0, formants and resonances. If you take a whole bunch of spectra and turn them 90 degrees counter-clockwise and rotate them 90 degrees along the z-axis, so that they are facing you, you will get a spectrogram. OK, now that you know what a waveform and spectrogram are, you can get started in praat. Where were we? Oh, yes, zooming into the spectrogram and waveform – it’s starting to make sense now, isn’t it? If you are merely documenting what is between tow boundaries on an interval tier, you have it eay. Just put your cursor in the interval you need to annotate and click and type. Don’t’t forget to save periodically. If you are segmenting speech sounds, you should know how to add and delete interval boundaries and points. To add an interval boundary or a point, simply click in the spectrogram or waveform window where you would like to place your marker (make sure you are zoomed in enough to clearly find the exact place). Then you will see a line extending down into the TextGrid. Click on the circle at the top of the tier you are marking, and voila, you have an interval boundary or point. If you need to remove a boundary or point (and it’s too late to undo), you can click on the boundary or point and go to the Boundary menu and choose remove (or option+delete on a mac). Now, grasshopper, you are ready to begin an exciting adventure using signal analysis software
© Copyright 2025 Paperzz