2PP

CS 188: Artificial Intelligence
Fall 2006
Lecture 21: Speech / Viterbi
11/09/2006
Dan Klein – UC Berkeley
Announcements
ƒ Optional midterm
ƒ On Tuesday 11/21 in class
ƒ Review session 11/19, 7-9pm, in 306 Soda
ƒ Projects
ƒ 3.2 due 11/9
ƒ 3.3 due 11/15
ƒ 3.4 due 11/27
ƒ Contest
ƒ Pacman contest details on web site this week
ƒ Entries due 12/3
1
Hidden Markov Models
ƒ Hidden Markov models (HMMs)
ƒ Underlying Markov chain over states X
ƒ You observe outputs (effects) E at each time step
ƒ As a Bayes’ net:
X1
X2
X3
X4
X5
E1
E2
E3
E4
E5
ƒ Several questions you can answer for HMMs:
ƒ Last time: filtering to track belief about current X given evidence
Speech Recognition
ƒ [demos]
2
Digitizing Speech
Speech in an Hour
ƒ Speech input is an acoustic wave form
s
p
ee
ch
l
a
b
“l” to “a”
transition:
Graphs from Simon Arnfield’s web tutorial on speech, Sheffield:
http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/
3
She just had a baby
ƒ What can we learn from a wavefile?
ƒ
ƒ
ƒ
ƒ
ƒ
Vowels are voiced, long, loud
Length in time = length in space in waveform picture
Voicing: regular peaks in amplitude
When stops closed: no peaks: silence.
Peaks = voicing: .46 to .58 (vowel [i], from second .65 to .74 (vowel
[4]) and so on
ƒ Silence of stop closure (1.06 to 1.08 for first [b], or 1.26 to 1.28 for
second [b])
ƒ Fricatives like [6] intense irregular pattern; see .33 to .46
Spectral Analysis
ƒ Frequency gives pitch; amplitude gives volume
amplitude
ƒ sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec)
s
p
ee
ch
l
a
b
ƒ Fourier transform of wave displayed as a spectrogram
frequency
ƒ darkness indicates energy at each frequency
4
Adding 100 Hz + 1000 Hz Waves
0.99
0
–0.9654
0
0.05
Time (s)
Spectrum
Amplitude
Frequency components (100 and 1000 Hz) on x-axis
100
Frequency in Hz
1000
5
Back to Spectra
ƒ Spectrum represents these freq components
ƒ Computed by Fourier transform, algorithm which
separates out each frequency component of wave.
ƒ x-axis shows frequency, y-axis shows magnitude (in
decibels, a log measure of amplitude)
ƒ Peaks at 930 Hz, 1860 Hz, and 3020 Hz.
Vowel Formants
6
Resonances of the vocal tract
ƒ The human vocal tract as an open
tube
Closed end
Open end
Length 17.5 cm.
ƒ Air in a tube of a given length will
tend to vibrate at resonance
frequency of tube.
ƒ Constraint: Pressure differential
should be maximal at (closed) glottal
end and minimal at (open) lip end.
Figure from W. Barry Speech Science slides
From
Mark
Liberman’s
website
7
Why these Peaks?
ƒ Articulatory facts:
ƒ Vocal cord vibrations
create harmonics
ƒ The mouth is a selective
amplifier
ƒ Depending on shape of
mouth, some harmonics
are amplified more than
others
Vowel [i] sung at successively higher pitch.
2
1
5
4
3
6
7
Figures from Ratree Wayland slides from his website
8
How to read spectrograms
ƒ bab: closure of lips lowers all formants: so rapid
increase in all formants at beginning of "bab”
ƒ dad: first formant increases, but F2 and F3 slight fall
ƒ gag: F2 and F3 come together: this is a characteristic
of velars. Formant transitions take longer in velars
than in alveolars or labials
From Ladefoged “A Course in Phonetics”
Acoustic Feature Sequence
frequency
ƒ Time slices are translated into acoustic feature
vectors (~39 real numbers per slice)
……………………………………………..e12e13e14e15e16………..
ƒ These are the observations, now we need the
hidden states X
9
State Space
ƒ P(E|X) encodes which acoustic vectors are appropriate
for each phoneme (each kind of sound)
ƒ P(X|X’) encodes how sounds can be strung together
ƒ We will have one state for each sound in each word
ƒ From some state x, can only:
ƒ Stay in the same state (e.g. speaking slowly)
ƒ Move to the next position in the word
ƒ At the end of the word, move to the start of the next word
ƒ We build a little state graph for each word and chain
them together to form our state space X
HMMs for Speech
10
ASR Lexicon: Markov Models
Markov Process with Bigrams
Figure from Huang et al page 618
11
Decoding
ƒ While there are some practical issues, finding the words
given the acoustics is an HMM inference problem
ƒ We want to know which state sequence x1:T is most likely
given the evidence e1:T:
Viterbi Algorithm
ƒ Question: what is the most likely state sequence given
the observations?
ƒ Slow answer: enumerate all possibilities
ƒ Better answer: cached incremental version
12
Viterbi with 2 Words + Unif. LM
Figure from Huang et al page 612
Next Class
ƒ Final part of the course: machine learning
ƒ We’ll start talking about how to learn
model parameters (like probabilities) from
data
ƒ One of the most heavily used technologies
in all of AI
13