sample midterm

CS534 — Midterm
Name:
1. (Perceptron)
a. [6pts]Define the hinge loss objective function that is optimized by perceptron. What
was the reason for using the hinge loss instead of 0/1 loss? Hinge loss:
J(w) =
N
1 X
max(0, −yi w · xi )
N i=1
0/1 loss function is piece-wise constant, thus gradient descent optimization can not be
applied to opimize 0/1 loss.
b. [6pts] Consider two different weight vectors w and w0 that are consistent with the training
data where w has a large margin and w0 has a very small margin. Which weight vector
will achieve a better hinge loss? Explain.
Both weight vectors will have a hinge loss of zero. This shows a potentially undesirable
aspect of the perceptron algorithm in that it does not care about the margin of weight
vectors on examples that are being correctly prediction. Thus, the perceptron algorithm
is just as happy with a weight vector that is “just barely” consistent with the data as with
a weight vector that has a large margin.
c. [6pts] When the training data is not linearly separable, we can apply the voted perceptron algorithm, where we store all intermediate linear separators (wi ’s) and their survival
time counts ci ’s. How does the voted perceptron algorithm make the final prediction for
a test example x? In other words, provide the decision function used by voted perceptron.
The decision function is:
N
X
h(x) = sign(
i=1
1
ci · sign(wi · x))
5. (Naive Bayes)
a. [5pts] What is the potential problem with using a Naive Bayes classifier that was learned
without Laplace smoothing?
Without laplace smoothing, we will encounter extreme probability values p(x|y) = 0,
learding to extreme values for p(y|x). In some cases, due to the rarity of the data, we
could have p(x|y = 0) = p(x|y = 1) = 0, which will wipe out the influence of all the other
features and make p(y|x) = 0 for all possible y values.
b. [5pts] What is the Naive Bayes assumption?
The features are conditional independent given the class label.
6 (Learning Theory)
a. [5pts] In class we showed that if a consistent learning algorithm for a finite hypothesis
space H is provided with
µ
¶
1
1
ln |H| + ln
m≥
²
δ
randomly drawn training instances, then we can state a certain guarantee. What is that
guarantee? Make sure to clearly indicate the roles of ² and δ.
The guarantee is that with at least 1 − δ probability over draws of training sets of size
m, any consistent hypothesis will have true error no more than ².
b. [10pts] Consider the concept class C of decision stumps for an input space containing
n binary features. That is, each target concept is a decision tree that contains exactly
one binary test on one of the binary features. Prove that the concept class C is PAClearnable. You may use the inequality from part [a] if you would like.
We can show that the consistent learner is a PAC-learning algorithm for decision stumps.
First, note that the number of decision stumps is 4n, i.e. one stump for each possible
root test and way of labeling the leaves. Note that many of these stumps will represent the
same underlying function (in particular the stumps where both leaves are either positive
or negative). The above bound tells us that a training set of size
µ
1
1
m≥
ln(4n) + ln
²
δ
¶
is sufficient to guarantee the accuracy and reliability requirements of PAC learnability.
We also argue that we can find a consistent decision stump when one exists in polynomial
time in the size of the training set, which we see from above needs only be polynomially
large. The algorithm is trivial. Simply test each of the 4n decision stumps to see if one
of them is consistent. The total running time is then
µ
·
n
1
O(nm) = O
ln(4n) + ln
²
δ
¸¶
which meets the polynomially runtime requirements of PAC learnability.
5