Arthur Kunkle

Arthur Kunkle
ECE 5526
HW #4
Problem 1
Each HMM was used to generate and visualize a sample sequence, X. These are the outputs from each HMM.
HMM1
HMM2
HMM3
HMM4
HMM5
HMM6
Questions:
1. The following characterize a correct transition matrix:
a. Has dimensions of the amount of states
b. First column is all 0 (initial state cannot be transitioned to)
c. Last row in all 0 except final entry (probability is 1 to enter final state)
d. All entries in a row or column (except first column) sum to 1
2. The transition matrix will effect the “duration” of emiisions within particular classes or groups of classes. In the
above output visualizations, especially in HMM’s 4-6, the sample chains tend to occur in class clumps.
3. Without a final state, the observation sequence length would be unbounded.
4. A single HMM is specified by:
a. D-dimensional mean vector for each state
b. DxD-dimensional variance matrix for each state
c. NxN-dimensional transition matrix for all transitions
d. N-dimensional initial state probability vector
Total parameters: D + N + D^2 + N^2
5. A word would use a left-right model. The sequence of phones would be fixed, with a probability of repeating
the same phone or longer utterances, which is also supported by this model type.
More Questions:
1. log(a+b) = log(a) + log(1 + e^(log(b) – log(a))
= log(a * (1 + e^(log(b) – log(a))
= log(a + a*e^(log(b/a))
= log(a + a * (b/a))
= log(a + b)
If log(a) > log(b), the second implementation would be better suited. This is because the difference in the
exponential would yield a result that is much less likely to be asymtotic to zero (where the derivative becomes
much sharper).
ln(x)
2. log(alpha_t(j)) = log(b_j(x_t)) + log(sum(alpha_t(i) * a_ij))
= log(b_j(x_t)) + sum(log(alpha_t(i) * a_ij))
= log(b_j(x_t)) + sum(log(alpha_t(i)) + log(a_ij))
The biggest performance gain for this converstion is the ability to perform repeated additions instead of
multiplications. Because the amount of state transitions can be very large for some HMM’s, this is a critical
gain.
Problem 2
1. Bayesian classification is based upon calculating the posteriori probability. In order to use this (as was done in
Problem 6 in HW #3), the prior probabilities of all possible state sequences will need to be known.
Previously, the probability of the classes q_k were given or calculated ahead of time. In the case of the
forward algorithm, we are generating all the possible joint-probabilities for each sequence. The most
significant assumption that must be made for Bayesian classification is the independence of the feature
vectors. Also, the general form of the PDF of the distribution of classes (or state sequences) must be known.
Problem 3
The plotted sample points for sequence X1-X6:
Log probabilities obtained using the logfwd routine. The most likely model was chosen based upon the greatest
probability for each model. The values shown are divided by 10^3.
Sequence
logP(X|1)
logP(X|2)
logP(X|3)
logP(X|4)
logP(X|5)
logP(X|6)
X1
X2
X3
X4
X5
X6
-0.5594
-0.1160
-0.8265
-0.8789
-0.7765
-1.3968
-0.6212
-0.1175
-0.7878
-0.8233
-0.7609
-1.3228
-1.4398
-0.1114
-1.1802
-0.8551
-0.8116
-1.9053
-1.4211
-0.1145
-1.1543
-0.8203
-0.7889
-1.8520
-1.2945
-0.2462
-0.7410
-1.1583
-0.9978
-2.1105
-0.9932
-0.1402
-1.3291
-1.0779
-0.6035
-2.0234
Most
Likely
Model
1
3
5
4
6
2
Problem 4
1. In the previously discussed forward algorithm, the alpha vector is updated by multiuplying the existing alpha by
the transition probability to each state. Obviously, if an ergodic model is being used with a large amount of
states, this operation is very expensive and must be performed for every transition time, t. Performing the log
equivilant of this operation reduces the operation to less expensive additions, but the amount of computations
needed is still infeasible. With the Viterbi best path approximation, the delta vector is updated only by
multiplying the existing vector entry times the maximum transition probability to the next state. If these
probabilities are pre-sorted during the model generation, only one multiplication (or addition on log form) must
be done per recursion.
2. log(delta_t+1(j)) = log(max(delta_t(i) * a_ij)) + log(b_j * x_t+1)
= max(log(delta_t(i) * a_ij)) + log(b_j) + log(x_t+1)
-- log(a*b) = log(a) + log(b)
3. The most likely models found earlier were used to find the best path of X1-X6 and plotted:
X1  HMM1
X2  HMM3
X3  HMM5
X4  HMM4
X5  hmm6
X6  HMM2
As shown by the yellow outlines in the data, there are very few differences in original and Viterbi alignments.
To show an opposite case, the following shows the alignment difference for an HMM that has a lower
probability. Notice the significant differences:
X6  HMM6
4. Likelihoods along the best path using Viterbi algorithm. Notice the most likely model chosen is the same
between the forward algorithm and Viterbi.
Sequence
logP(X|1)
logP(X|2)
logP(X|3)
logP(X|4)
logP(X|5)
logP(X|6)
X1
X2
X3
X4
X5
X6
-0.5594
-0.1160
-0.8267
-0.8789
-0.7768
-1.3968
-0.6222
-0.1175
-0.7879
-0.8233
-0.7610
-1.3228
-1.4398
-0.1114
-1.3478
-0.8551
-0.8116
-3.1193
-1.4211
-0.1145
-1.3169
-0.8203
-0.7890
-3.0588
-1.2945
-0.2462
-0.7410
-1.6207
-0.9978
-3.3987
-0.9932
-0.1402
-1.3760
-1.0779
-0.6035
-2.0438
Most
Likely
Model
1
3
5
4
6
2
Difference between log-likelihoods along the best path and using forward algorithm. The largest differences
are highlighted in red.
Sequence
HMM1
HMM2
HMM3
HMM4
HMM5
HMM6
X1
0.0001
0.0009
0.0000
0.0000
0.0000
0.0000
X2
X3
X4
X5
X6
0.0000
0.0002
0.0000
0.0003
0.0001
0.0000
0.0001
0.0000
0.0001
0.0000
0.0000
0.1676
0.0000
0.0000
1.2140
0.0000
0.1626
0.0000
0.0000
1.2068
0.0000
0.0000
0.4624
0.0000
1.2882
0.0000
0.0469
0.0000
0.0000
0.0204
5. Given the extremely low error between the “real” likelihood generated using the forward algorithm and the
likelihood along the best path given by the Viterbi algorithm, this method is a very good approximation of the
real likelihood.
Problem 5
1. A left-right type HMM would be the best choice for the word /aiy/. This is because the sequence of phones
that make up this word are sequential and do not make sense to repeat. For instance, if an ergodic model was
chosen, the sequence “ay-eh- ay-eh” would be modeled, which does not make sense to model the word /aiy/.
The duration of each of the phones (ay and eh) would be modeled by transition to self. There would be two
additonal start and end states:
Start
ay
eh
End
2. To generate the HMM parameters, we have to generate the Emission probabilities in each state as well as
the transition properties for each state.
a. Because the phone labeling for each sample value in the data is already done, we can simply use the
existing data to estimate the mean and the variance for the two classes (or others, depending on how
the word /aiy/ is broken down). This was the procedure used in Problem 2 of HW #3. This will
determine the B Matrix for our HMM.
b. Choose an initial A Matrix defining the probabilities to transition to each state. Three different
probabilities will be chosen (a01 and a23 are predefined to 1). Now, we can use the Baum-Welch
Re-estimation with Multiple Observations (discussed in Chapter 5). This procedure will
continuously refine the HMM model parameters. After each iteration and recalculation of the
parameters, check against the previously used model to determine if a critical point has been reached.
If so, we are using the maximum likelihood HMM. We can use all of the observations given in the
training data for this (or only some if the maximum is reached early).
3. If the training data is not labeled prior to training, some type of model generation method will have to be done,
as was the case in Problem 8 of HW3. Either of the three methods: K-means, Viterbi-EM, or EM can be used
on the training data. If we know the amount of classes we are using (which we are assuming 2, ay and eh),
one of these methods will provide the Gaussian model parameters for each state. As above, we can then
choose initial state transition probabilities and apply the Baum-Welch Re-estimation with Multiple
Observations method to refine the parameters over all training data.