Praat Intro

Introduction to praat for speech segmentation
(with an introduction to acoustic phonetics)
First, you should have downloaded praat from http://www.fon.hum.uva.nl/praat
The program is small enough to put on a small flash drive in case you have to work at a computer lab or
somewhere you can’t put it permanently. Then you can just copy it to the desktop and go to town.
Some terms you will need to know:
Objects window – this is where files that you
have opened in praat appear.
This includes sound files and TextGrids.
Sound – this is what you get if you choose
Read > Read from file… and select a wav file.
You can also use the Read from file…
command to open a TextGrid object. A
TextGrid is a text file (this means it’s really
small) that you can use inside praat along with
a sound file to demarcate boundaries and
points and write text, such as a transcription or
notes or whatever.
A LongSound is what shows up in the objects
window if you choose Read > Open long
sound file. This is the option you want if your
wav file is longer than a couple minutes. You
get fewer options with a long sound, but it
crashes praat if you try to open too big a file as
a regular Sound object.
(there are many other objects, but these are the
three most basic types of object, and enough to
begin learning how to use praat.)
Editing a Sound and TextGrid
Select both the Sound (or LongSound) object and the
TextGrid object, then click on the button to the right
that says Edit.
Then what is called an editor window will open,
with the sound and TextGrid combined. (You can
also open sounds by themselves or TextGrids by
themselves by selecting just the one object and
clicking Edit. But it is more useful to look at them
both together.)
In the sound part of the editor window, there will be
a waveform on top, and a spectrogram on the
bottom. If the window contains more than 1 minute,
you have to zoom in to be able to see the
spectrogram. On the bottom is the TextGrid, which
is comprised of one or more tiers. A tier is one row
in which you may mark intervals or points and write
text. In the image below, the top two tiers are
interval tiers, which contain text between the
boundaries. The fourth tier from the top, marked
events, is a point tier. In a point tier, any text is a
part of the point itself.
In order to zoom in to see a smaller section, highlight the section you are interested in, in the waveform or
spectrogram part of the editor window, then click on the button in the lower left that says “sel” which stands
for zoom to selection.
After zooming, you can see the spectrogram, a clearer waveform, and the transcription in the first tier begins
to make sense.
Quick physics lesson: Sound is composed of three parts: transmitter/source (the thing making the sound),
medium (the stuff that the sound wave is carried through, like air or water), and receiver (in case of speech,
your ears or recording equipment). So the technical answer to the old mantra, “if a tree falls in the woods and
no living entity is around to hear it, does it still make a sound?” is no. Without a receiver to pick up the
displacement of particles caused the the movement of the transmitter and to interpret it as a sound, the cycle
is not complete, and the displaced particles are just displaced particles. This is also why there is no sound in
space, because there is no medium. If you scream into a vacuum, there are no displaced particles, thus no
sound.
When a movement causes a displacement of particles, it creates an oscillation. This means that particles are
pushed away, but then they spring back again. Essentially what happens is that 1) the particles closest to the
source are pushed away, but then 2) they bump up into the particles nearest them, and then 3) the force
causes those next particles to move in the direction of the first particles, but it also causes the first particles to
bounce off and move back into, or near their original resting position. This is why we speak of sound waves,
because this is a cyclical pattern.
1)
2)
3)
Key: the big star represents the source of the sound, or the original displacement. The arrows represent the
direction of the force, and the circles represent air particles. The little stars in (2) represent the crowding or
collision of the first wave of particles bumping into the next, where kinetic energy is transferred. The main
point is that sound travels in a wave, and there are two main parts of the wave, rarefaction and compression.
Rarefaction is what happens when two waves of particles are pushed further apart, and compression is
when two waves of particles are pushed closer together. A complete cycle is made up of one compression
and one rarefaction.
Waveforms
A waveform is a two-dimensional representation of a sound, with time on the x-axis, and amplitude on the
y-axis. Amplitude is essentially a measure of how much air particles are displaced by the force of the sound
wave. The top part of the wave is called the peak and represents the point of maximal compression. The
bottom part of the wave is called the trough, and represents the point of maximal rarefaction. In the sound
wave below, which is a sine wave, the waveform is symmetric, meaning there is the same amount of
displacement on either side of the baseline, so equal compression and rarefaction. It is periodic, which
means it repeats at regular intervals, about the same time apart, so each cycle has more or less equal periods.
It is also a simple wave, meaning only one frequency is represented. Frequency is a measure of how many
cycles occur over a given period of time. Sound is usually measured in Hertz, which is the number of cycles
per second.
The sounds of speech, however, never create a simple wave. Speech sounds are complex waves, meaning
that multiple frequencies are represented in the sound wave. Because air particles bounce off of multiple
places along our ever-changing vocal tracts, they are moving at many speeds. If you have ever played pool or
billiards, you can think about how the many colored balls bounce around the table in different directions and
at different speeds depending on where and how fast they were struck by the cue ball, or an adjoining ball,
and their starting position. Some move slowly and cover a great distance, while some move quickly and
bounce rapidly from rail to rail. Well, in our vocal tract, there are many sources of regular frequency
differences, as well as some random noise caused by the turbulence of air moving through the passage.
The main source of speech sounds generally comes from air being expelled from the lungs. After it leaves
the lungs, it may pass through the larynx without much effort, or the vocal folds may rapidly open and close,
creating the effect of voicing, which is the sound of the vocal folds vibrating. The frequency at which the
vocal folds vibrate is called the fundamental frequency, or f0. In a voiced sound, you can see f0 as
generally the largest cycle of the waveform, as in the figure below. These cycles correspond with glottal
pulses. Other frequencies are captured by the sound wave as smaller bumps in the waveform. For example, if
a sound were composed of only two frequencies, one at 50 Hz, and one at 100Hz, the lower frequency would
be the larger repeating cycle, and the higher frequency would have 2 troughs and peaks that were part of the
larger wave. Speech sounds have many frequencies, though. One of the most recognizable of these is what is
called formants. Formants are created by the different tongue postures that are used to create different
vowels. The first formant (F1) tells us how high or low the tongue posture is, as it extends or shortens the
longest part of the vocal tract, which is your throat from the larynx to the roof of your mouth. The second
formant (F2), and its relationship to F1, tell us how front the articulation is. A fronter articulation will have
greater difference between F1 and F2. Essentially, a higher F2 means a fronter vowel, as the front cavity is
shortened. F3 tells us about rounding, that is whether we extend the front-most part of the cavity by
protruding our lips.
f0 one cycle
Aperiodic noise is also important in speech. Fricatives and stops have large amounts of aperiodic noise, as
the articulators come closer together and force the air through a narrower channel. You can imagine this is
like when water gets turbulent going through a narrow gorge and gives a rapid current and lots of
whitewater. The air speeds up, as does the water, and bounces off the articulators, giving off lots of high
frequency, aperiodic noise, as in the figure below.
But you can get voicing coupled with turbulence or frication, as in a voiced fricative. In this case, the
waveform is periodic, with f0, but with aperiodic noise in each cycle, as below.
Spectrograms
A spectrogram is a 3-dimensional representation of sound. On the x-axis is time. The y-axis in this case is
frequency, and the z-axis is amplitude. The z-axis is essentially shown by the darkness of the bands of
frequency. Darker areas have greater amplitude, while lighter areas have less amplitude. In order to really
understand a spectrogram, you should know about sampling and spectra. Sampling is what has to happen in
order to convert an analogue event into a digital recording, In this day and age, we kind of take digital
signals for granted, on the telephone, cameras, recordings, and any kind of broadcast. The speech stream, as
uttered by a human talker, is continuous. But in order to make a digital recording, we have to take snapshots
of that continuous stream, just as we do with movies. The movies that we watch are collections of thousands
or millions of pictures, pasted together, that gives us the illusion of motion. Digital recordings are snapshots
of a soundwave, taken so close together that it sounds continuous (or nearly continuous for some lower
resolution recordings). The upper limit of human hearing does not extend beyond 20,000 Hz. This means we
cannot perceive sounds that have a higher pitch than 20,000 cycles per second, and many of us have an upper
limit that is a great deal lower. So, high quality recordings are taken at a rate of 44,100 samples per second.
This is slightly more than double the range of human hearing. The reason for doubling it is so that the higher
frequency sounds have a greater fidelity. The problem is that software that compiles the samples may assume
half the frequency for frequencies above the recording limit. You can see in the figure below that there are
periods of overlap for the two waveforms. If we took a snapshot of the higher frequency waveform at a rate
that would just capture the baseline crossing of the lower frequency waveform, it would have to interpolate
what was between those points. And it would not suppose that there were baseline crossings between those
points, but rather reduce the frequency of the higher frequency waveform by half, which is called aliasing.
So, a sample is a snapshot of a waveform, taken at regular intervals, and the sampling rate is how many
samples per second.
A spectrum is like a sample. It is a snapshot of the waveform. The cool thing about a spectrum is that it is
averaged over some short period of time, to be able to give us information about frequency and amplitude. A
spectrum shows frequency along the x-axis, and amplitude on the y-axis. The peaks in a spectrum show at
what frequency the most energy or intensity is. Frequencies with greater intensity represent f0, formants and
resonances. If you take a whole bunch of spectra and turn them 90 degrees counter-clockwise and rotate
them 90 degrees along the z-axis, so that they are facing you, you will get a spectrogram.
OK, now that you know what a waveform and spectrogram are, you can get started in praat. Where were we?
Oh, yes, zooming into the spectrogram and waveform – it’s starting to make sense now, isn’t it?
If you are merely documenting what is between tow boundaries on an interval tier, you have it eay. Just put
your cursor in the interval you need to annotate and click and type. Don’t’t forget to save periodically.
If you are segmenting speech sounds, you should know how to add and delete interval boundaries and points.
To add an interval boundary or a point, simply click in the spectrogram or waveform window where you
would like to place your marker (make sure you are zoomed in enough to clearly find the exact place). Then
you will see a line extending down into the TextGrid. Click on the circle at the top of the tier you are
marking, and voila, you have an interval boundary or point. If you need to remove a boundary or point (and
it’s too late to undo), you can click on the boundary or point and go to the Boundary menu and choose
remove (or option+delete on a mac).
Now, grasshopper, you are ready to begin an exciting adventure using signal analysis software