CIS 419/519 Introduction to Machine Learning Fall 2015 Midterm

CIS 419/519 – Fall 2015 Midterm
Name
1
CIS 419/519 Introduction to Machine Learning
Fall 2015 Midterm
Instructions:
• Please turn o↵ your cell phone.
• This examination is closed-book and closed-notes. Calculators and any other external resources are
prohibited.
• All answers must be recorded on the bubble sheet. Fill in the bubble of the correct answer for each
question. Be certain to fill in the bubble completely.
• To correct a mistake, be certain to erase the bubble completely.
• For true/false questions, fill in (A) for TRUE and fill in (B) for FALSE.
• For multiple choice questions, fill in the bubble(s) of the correct answer. Some multiple choice questions
may have multiple correct answers; in this case, you can fill in multiple bubbles for a single question.
• Only the bubble sheet will be counted as your official answers; any answers you write on the exam will
not be considered.
• Partial credit will not be given for individual questions, including questions with multiple correct
answers. (Grading will use zero-one loss. ;-) )
• If you find yourself spending too long on a problem, skip it and move on to the next one. Scrap paper
is available upon request. Be certain to write you name on all materials you turn in.
• If you do not understand a problem or require clarification, please ask the instructor.
• There are 100 points total. Students in CIS 419 will be graded out of 92 points; points earned beyond
92 will not count as extra credit. Students in CIS 519 will be graded out of 100 points. The instructor
reserves the right to adjust these denominators lower to curve the scores in your favor.
• I wish each of you the best of luck.
Sign this statement after you have completed the examination. Your exam will not be graded without
your signature.
I certify that my responses in this examination are solely the product of my own work and that I
have fully abided by the University of Pennsylvania Academic integrity policy while taking this
exam.
Signature:
Printed Name:
CIS 419/519 – Fall 2015 Midterm
I
Name
2
Preliminaries
There are multiple versions of this exam. Fill in the version number of the exam on your answer sheet:
• If you are enrolled in CIS 419, fill in the bubble for Version 2
• If you are enrolled in CIS 519, fill in the bubble for Version 4
Write your name on the bubble sheet, the first page of the exam, and on any scrap paper you use.
Be sure to bubble in your PennID on the answer sheet.
II
General Knowledge (12 pts total)
1.) (1 pts) Check that (1) your exam has all 12 pages, (2) you’ve written you name, and (3) you’ve written
and bubbled in your PennID on the answer sheet. Answer this question to let me know you did it:
what is your favorite course this semester?
(A) CIS 419/519
(B) CIS 419/519
(C) CIS 419/519
(D) CIS 419/519
2.) (1 pts) Is your answer to question 1 a lie?
(A) Yes
(B) No
3.) (2 pts) Which model will always yield a linear decision boundary? (Choose one or more)
(A) Polynomial regression
(D) Logistic Regression
(B) Decision tree with depth 2
(E) None of the above
(C) k-Nearest Neighbor
4.) (2 pts) Which value of C in an SVM has the most bias?
(A) C = 0
(B) C = 1
(C) C = 10
(D) C = 1000
(E) All have equal bias
The next two questions concern the following scenario: While attending a concert, you remember that ML
can be used to separate the singer’s voice from the instrumental track.
5.) (1 pts) What ML technique can do this? (Choose one)
(A) Logistic regression
(D) Reinforcement learning
(B) Support vector regression
(E) Inverse reinforcement learning
(C) Independent component analysis
6.) (1 pts) Is this algorithm supervised, semi-supervised, or unsupervised?
(A) Supervised
(B) Semi-supervised
(C) Unsupervised
(D) None of the above
Truth or LIES
For each question below, answer whether it is a True statement (A) or a bloody LIE (B)!
7.) (1 pts) Suppose that A1 , . . . , Ad are categorical input attributes. The maximum depth of the decision
tree must be less than d + 1.
8.) (1 pts) For a data set of n instances, the maximum depth of the decision tree must be less than n.
9.) (1 pts) The lower we are in a decision tree, the more likely we are to be modeling noise.
10.) (1 pts) Using the quadratic kernel in an SVM with very large numbers of training instances is much more
computationally efficient than precomputing the equivalent basis expansion of each training instance
and using a linear kernel.
CIS 419/519 – Fall 2015 Midterm
III
Name
3
Properties of Learning Algorithms (8 pts total)
For each machine learning algorithm listed below, choose the correct properties of that algorithm.
Does each learning algorithm below yield the globally optimal solution or a locally optimal solution?
(Assume that each algorithm does converge to a solution.)
11.) (1 pts) ID3 Decision Tree: (A) global optimum
(B) local optimum
12.) (1 pts) Batch perceptron: (A) global optimum
(B) local optimum
13.) (1 pts) Logistic regression: (A) global optimum
14.) (1 pts) SVM: (A) global optimum
(B) local optimum
(B) local optimum
What loss function does each algorithm use? Choose the best option from the following choices:
(A) zero-one loss
(B) sum of squared error
(C) log loss
(D) hinge loss
(E) exponential loss
15.) (1 pts) ID3 Decision Tree
16.) (1 pts) Batch perceptron
17.) (1 pts) Logistic regression
18.) (1 pts) SVM
IV
Computational Learning Theory (8 pts total)
19.) (2 pts) Which of the following is a correct expression for Expected Loss?
(A) expectedLoss = (bias)2 + variance + noise
(B) expectedLoss = bias + (variance)2 + noise
(C) expectedLoss = bias + variance + noise
(D) expectedLoss = (bias)2 + (variance)2 + noise
20.) (2 pts) True (A) or False (B): There is at least one set of 4 points in R3 that can be shattered by the
hypothesis set of all 2D planes in R3 .
21.) (4 pts) What is the VC-dimension of the 1-Nearest Neighbor classifier?
(A) 2 (B) 3 (C) 4 (D) 5 (E) 1
CIS 419/519 – Fall 2015 Midterm
V
Name
4
Experimental Protocol (9 pts total)
Imagine we are using 10-fold cross-validation to tune a parameter p of an ML algorithm. Recall that this
can be done via cross-validation over the training set, using the held-out fold to evaluate di↵erent values
of p. In the end, this process yields 10 models, {h1 , . . . , h10 }; each model hi has its own value pi for that
parameter, and corresponding error ✏i on the held-out fold. Let k = arg mini ✏i be the index of the model
with the lowest error.
22.) (3 pts) What is the best procedure for going from these 10 models individual to a single model that
we can apply to the test data?
(A) Use hk as the single model.
(B) Create an unweighted ensemble from all 10 models, averaging the members’ predictions.
(C) Create a weighted majority ensemble from all 10 models, weighting the models with lower error
more heavily.
P
1
(D) Set p = 10
i pi and train a new classifier on the entire training set.
(E) Set p = pk and train a new classifier on the entire training set.
Consider the confusion matrix for a multi-class classifier given below. Round answers to two decimal places.
```
``` Predicted
```
X Y Z
```
Actual
Class X
5
3
1
Class Y
1
4
0
Class Z
0
2
4
23.) (2 pts) What is the accuracy of the classifier?
(A) 0.75
(B) 0.65
(C) 0.50
(D) 0.35
(E) None of the above
24.) (2 pts) What is the precision of the classifier in predicting class X?
(A) 0.83
(B) 0.56
(C) 0.42
(D) 0.37
(E) None of the above
25.) (2 pts) What is the recall of the classifier in predicting class Z?
(A) 0.92
(B) 0.83
(C) 0.67
(D) 0.50
(E) None of the above
CIS 419/519 – Fall 2015 Midterm
VI
Name
5
Algorithm Evaluation (10 pts total)
The eccentric inventor Emerett T. Vandershmoot I has invented a new machine learning algorithm, Algorithm A, to diagnose a rare disease and wants your help in assessing its performance. However, he’s nervous
about competitors stealing his invention, and so won’t give any details on the algorithm or data set. The only
information he provides is the log file below from training his algorithm and testing on a held-out portion of
the data. Vandershmoot also compared his invention to two other well-known machine learning algorithms,
Algorithms B & C, but won’t tell you what those algorithms are either. Although he is quirky, you know
that Vandershmoot is diligent and so likely followed proper experimental procedure.
LOG FILE - PROJECT ALPHA, VERSION 9381
AUTHOR:
TRAINING PHASE
TESTING PHASE (ON HELD-OUT DATA)
Algorithm A:
Algorithm B:
Algorithm C:
Accuracy
99.2%
96.0%
97.1%
Error
0.8% (best)
4.0%
2.9%
Emerett T. Vandershmoot I
Algorithm A:
Algorithm B:
Algorithm C:
Accuracy
98.6%
96.0%
97.0%
Error
1.4% (best)
4.0%
3.0%
Confusion Matrix - Algorithm A
True False
<----- predicted label
0
8
T (actual class label)
0
992
F (actual class label)
Confusion Matrix - Algorithm A
True False
<----- predicted label
0
7
T (actual class label)
0
493
F (actual class label)
Confusion Matrix - Algorithm B
True False
<----- predicted label
8
0
T (actual class label)
40
952
F (actual class label)
Confusion Matrix - Algorithm B
True False
<----- predicted label
7
0
T (actual class label)
20
473
F (actual class label)
Confusion Matrix - Algorithm C
True False
<----- predicted label
4
4
T (actual class label)
25
967
F (actual class label)
Confusion Matrix - Algorithm C
True False
<----- predicted label
4
3
T (actual class label)
12
481
F (actual class label)
26.) (2 pts) Which algorithm(s) have the highest bias? (Choose one or more)
(A) Algorithm A
(B) Algorithm B
(C) Algorithm C
(D) Not enough information to tell
27.) (2 pts) Which algorithm(s) have the highest precision? (Choose one or more)
(A) Algorithm A
(B) Algorithm B
(C) Algorithm C
(D) Not enough information to tell
28.) (2 pts) Which algorithm(s) have the highest recall? (Choose one or more)
(A) Algorithm A
(B) Algorithm B
(C) Algorithm C
(D) Not enough information to tell
29.) (2 pts) Of the algorithms listed below, Algorithm A is most likely a
. (Choose one)
(A) decision stump (B) 1-NN (C) linear regression (D) SVM & RBF kernel (E) boosted perceptron
30.) (1 pts) Is Algorithm A e↵ective for this application?
(A) Yes
(B) No
(C) Not enough information to tell
31.) (1 pts) Which algorithm is best for this application? (Choose one)
(A) Algorithm A
(B) Algorithm B
(C) Algorithm C
(D) Not enough information to tell
CIS 419/519 – Fall 2015 Midterm
VII
Name
6
Decision Trees (10 pts total)
At Monsters University, Sulley needs help with deciding which final-year monsters should be awarded Full
Scary awards. He wants to use following data set to to learn an unpruned decision tree to predict whether
the students are diligent(D) or lazy(L) based on their scariness level (Normal or Low), number of eyes (2, 3,
or 4), and hair color (Green or Blue).
Scariness Level
Normal
Normal
Normal
Low
Low
Low
Normal
Normal
Low
Low
Hair Color
Green
Blue
Blue
Blue
Blue
Green
Green
Blue
Green
Green
# Eyes
2
2
2
3
3
4
4
4
3
3
Class
L
L
L
L
L
D
D
D
D
D
The following numbers may be helpful since you are not using a calculator:
log2 (0.1)
log2 (0.2)
log2 (0.3)
log2 (0.4)
log2 (0.5)
-3.32
-2.32
-1.73
-1.32
-1
log2 (0.6)
log2 (0.7)
log2 (0.8)
log2 (0.9)
log2 (1.0)
-0.74
-0.51
-0.32
-0.15
0
32.) (4 pts) What is the conditional entropy of H(HairColor | ScarinessLevel = Normal )?
(A) 0.00
(B) 0.49
(C) 0.50
(D) 0.97
(E) 1.00
33.) (2 pts) Which attribute will the ID3 algorithm pick as the root of the tree?
(A) Scariness Level
(B) Number of Eyes
(C) Hair Color
(D) ID3 will pick randomly between the three attributes
34.) (1 pts) What will be the height of the final unpruned decision tree? (Recall that a decision stump has
a height of 1.)
(A) 1
(B) 2
(C) 3
(D) 4
(E) None of the above
35.) (1 pts) Will the resulting unpruned decision tree obtain 100% training accuracy on this dataset?
(A) Yes
(B) No
36.) (1 pts) True(A) or False(B): Unpruned decision trees will always have 100% training accuracy.
37.) (1 pts) True(A) or False(B): Pruning a decision tree is guaranteed to reduce its training accuracy.
CIS 419/519 – Fall 2015 Midterm
VIII
Name
7
Linear Methods (16 pts)
38.) (2 pts) Suppose we have a regularized linear regression model: arg min✓ kY X✓k22 + k✓k1
What is the e↵ect of increasing on bias and variance?
(A) Increases bias, increases variance
(D) Decreases bias, decreases variance
(B) Increases bias, decreases variance
(E) Not enough information to tell
(C) Decreases bias, increases variance
39.) (2 pts) Suppose we have a regularized linear regression model: arg min✓ kY X✓k22 + k✓kpp
What is the e↵ect of increasing p on bias and variance, for p 1?
(A) Increases bias, increases variance
(D) Decreases bias, decreases variance
(B) Increases bias, decreases variance
(E) Not enough information to tell
(C) Decreases bias, increases variance
40.) (2 pts) Which of the following statements are true for linear discriminants? (Choose one or more)
(A) Regularizing the model always results in equal or better performance on the training set.
(B) Regularizing the model always results in equal or better performance on examples that are not in
the training set.
(C) Adding many new features to the model makes it more likely to overfit the training set.
(D) Adding many new features to the model always results in equal or better performance on examples
that are not in the training set.
(1 pt each) Consider two machine learning models: logistic regression and DT2 – a binary decision tree with
depth 2 (up to two levels of binary splitting nodes and 4 leaves). For each of the data sets below, identify
whether they can be:
(A) Perfectly classified by DT2 only
(B) Perfectly classified by logistic regression only
(C) Perfectly classified by both DT2 and logistic regression
(D) Cannot be perfectly classified by either algorithm
41.)
43.)
42.)
44.)
CIS 419/519 – Fall 2015 Midterm
Name
8
Pm
(i)
45.) (2 pts) For logistic regression, the gradient is given by @✓@ j J(✓) = i=1 (h✓ (x(i) ) y (i) )xj . Which
gradient descent update rule below is correct for logistic regression with a learning rate of ↵? (Choose
one or more).
Pm
(i)
1
(i)
(A) ✓j
✓j ↵ m
y (i) )xj
(simultaneously update for all j)
i=1 (h✓ (x )
⇣
⌘
Pm
(i)
1
1
(B) ✓j
✓j ↵ m
y (i) xj
(simultaneously update for all j)
i=1 1+exp( ✓ T x(i) )
Pm
1
(i)
(C) ✓j
✓j ↵ m
y (i) )x(i) (simultaneously update for all j)
i=1 (h✓ (x )
P
m
1
T
(D) ✓
✓ ↵m
y (i) x(i)
i=1 ✓ x
(E) None of the above are correct
46.) (2 pts) Which implementation of the predict() function, which predicts class labels, is correct for
logistic regression? Assume all utility functions called by the code work correctly and that the python
syntax is correct; focus only on semantics. (Choose one or more).
(A) def predict(self, X):
X = basisExpansion(standardize(X));
X = addOnesColumn(X);
return X * self.theta >= .5;
(B) def predict(self, X):
X = basisExpansion(standardize(X));
X = addOnesColumn(X);
return X * self.theta;
(C) def predict(self, X):
X = basisExpansion(standardize(X));
X = addOnesColumn(X);
return sigmoid(X * self.theta) >= .5;
(D) def predict(self, X):
X = basisExpansion(standardize(X));
X = addOnesColumn(X);
return sigmoid(X * self.theta);
(E) None of the above are correct
47.) (2 pts) Suppose that you have trained a logistic regression classifier, and it outputs on a new example
x a prediction h✓ (x) = 0.8. Which of the following are true?: (Choose one or more)
(A) Our estimate for P (y = 1 | x; ✓) is 0.8.
(B) Our estimate for P (y = 0 | x; ✓) is 0.2.
(C) Our estimate for P (y = 0 | x; ✓) is 0.8.
(D) Our estimate for P (y = 1 | x; ✓) is 0.2.
CIS 419/519 – Fall 2015 Midterm
IX
Name
9
SVMs and Kernels (17 pts total)
Consider the four training instances in
R2 that are given in the figure to the right.
There are positive examples at:
x1 = [0, 0]
x2 = [2, 2]
and negative examples at:
x3 = [h, 1]
x4 = [0, 3].
Note that x3 can move horizontally; its
horizontal position h is a variable such
that 0  h  3.
48.) (2 pts) How large can h be so that the training points are still linearly separable?
(A) h < 0.5 (B) h < 1 (C) h < 1.5 (D) h  3 (E) None of the above
49.) (2 pts) When the points are linearly separable, does the orientation (i.e., angle) of the maximum margin
decision boundary change as h varies?
(A) Yes (B) No
50.) (3 pts) Assume that we can only observe the 2nd dimension of the input vectors. Without the other
component, the labeled training points reduce to (0,+), (2,+), (1,-), and (3,-). What is the lowestorder degree d of the polynomial kernel that would allow us to correctly classify these points?
(A) d = 1
(D) d = 4
(B) d = 2
(E) It is not possible to separate these points with
(C) d = 3
the polynomial kernel
For an SVM, consider what might happen if we remove one of the support vectors from the training set.
(Hint: You may find it helpful to draw a few simple 2D dataset in which you identify the support vectors,
draw the location of the maximum margin hyperplane, remove one of the support vectors, and draw the
location of the resulting maximum margin hyperplane.)
51.) (1 pts) If we remove one of the support vectors, could the size of the maximum margin decrease?
(A) Yes
(B) No
52.) (1 pts) If we remove one of the support vectors, could the size of the maximum margin stay the same?
(A) Yes
(B) No
53.) (1 pts) If we remove one of the support vectors, could the size of the maximum margin increase?
(A) Yes
(B) No
CIS 419/519 – Fall 2015 Midterm
Name
54.) (2 pts) Which of the following is not a kernel? (Choose one or more)
(A) k(x, y) = exp(
kx yk2
)
T
(B) k(x, y) = tanh(x y + 1)
Pd
(C) k(x, y) = j=1 max(xj , yj )
p
(D) k(x, y) = xT y + 0.5
(E) None of the above (all are kernels)
55.) (2 pts) Which kernel is most likely to overfit? (Choose one)
(A) Polynomial kernel of degree 2
(B) Linear kernel
(C) Gaussian kernel
(D) A kernel that maps the d-dimensional input space into R2d
56.) (3 pts) In the figure below, which mapping will make the problem linear? (Choose one)
(Hint: the data was generated via the following equations.
• For class 1: x1 = t cos(t); x2 = t sin(t)
• For class 2: x1 =
t cos(t); x2 =
(A) x : [x1 , x2 ] 7! z : [x1 + x2 ]
(B) x : [x1 , x2 ] 7! z : [x1
(C) x : [x1 , x2 ] 7! z :
(D) x : [x1 , x2 ] 7! z :
[x21
[x21
x2 ]
+ x22 ]
x22 ]
(E) x : [x1 , x2 ] 7! z : [x1 |x1 | + x2 |x2 |]
t sin(t) )
10
CIS 419/519 – Fall 2015 Midterm
X
Name
11
Ensemble Methods (10 pts total)
57.) (2 pts) Which of the following statements are FALSE for ensemble learning? (Choose one or more)
(A) Individual learners should have low error rates
(B) Ensembles can combine di↵erent types of learners
(C) All learners are required to train on the same input points
(D) Cross-validation can be used to tune the weights of individual learners in the ensemble
58.) (1 pts) True (A) or False (B): In AdaBoost, you should stop iterating if the error rate of the combined
classifier on the original training data is 0.
59.) (1 pts) True (A) or False (B): In AdaBoost, weak classifiers added in later boosting rounds tend to
focus on the more difficult instances in the training set.
60.) (2 pts) What is the primary e↵ect of early boosting iterations on the ensemble? (Choose one or more)
(A) To reduce the bias of the ensemble classifier.
(B) To reduce the variance of the ensemble classifier.
(C) Early iterations have little e↵ect on the bias or variance of the ensemble classifier.
(D) To increase the bias of the ensemble classifier.
(E) To increase the variance of the ensemble classifier.
61.) (2 pts) What is the e↵ect of later boosting iterations on the ensemble? (Choose one or more)
(A) To reduce the bias of the ensemble classifier.
(B) To reduce the variance of the ensemble classifier.
(C) Later iterations have little e↵ect on the bias or variance of the ensemble classifier.
(D) To increase the bias of the ensemble classifier.
(E) To increase the variance of the ensemble classifier.
Consider an ensemble constructed by a weighted majority
vote of its members and applied to data with binary labels
{ 1, 1}. Given a data point x 2 Rd , the final classifier
HT (x) = sign(⌃Tt=1 ↵t ht (x)), where ht (x) : Rd 7! { 1, 1}.
Let the ensemble classifier consist of the four binary classifiers shown to the right (the arrows point to the side of the
discriminant labeled as positive(+)).
62.) (2 pts) What values of ↵ would make the ensemble
classifier consistent with the data?
(A) [1, 1, 1, 1]
(D) [1, 2, 1, 1]
(B) [1, 2, 2, 1]
(E) None of the above
(C) [1, 2, 1, 2]
That’s it! Relax a bit, check your answers, and check that your name and PennID are written (and bubbled
in!) on your bubble sheet. Then, quietly turn in your exam. Please be mindful of others around you who
are still completing the exam.
We still have a few students who need to take a makeup midterm, so please DO NOT post questions or
comments about the exam on Piazza.
CIS 419/519 – Fall 2015 Midterm
Name
12
Reference Page
Decision Trees Let D be the data, C be the class attribute, and A be an attribute (which could be C).
The accessor A.values denotes the values of attribute A.
✓
◆
X
|Di |
|Di |
Info(D) =
⇤ lg
|D|
|D|
i2C.values
Info(A, D)
X
=
j2A.values
Gain(A, D)
=
GainRatio(A, D)
=
SplitInfo(A, D)
=
=
|Dj |
⇤ Info(Dj )
|D|
Info(D) Info(A, D)
Gain(A, D)
SplitInfo(A, D)
Info(D) considering A as the class attribute C.
✓
◆
X
|Di |
|Di |
⇤ lg
|D|
|D|
i2A.values
Linear Regression
min✓
1
2n
n
X
d
X
2
(h✓ (xi )
yi ) +
i=1
Perceptron Update Rule ✓
✓j2
j=1
✓ + yi x i
if xi is misclassified
Support Vector Machines
1
Primal: min kwk22 s.t. 8i yi (w · x b) 1
w,b 2
n
n
n
n
X
X
1 XX
Dual: min
↵i ↵j yi yj (xi · xj )
↵i s.t.
yi ↵i = 0 and 8i ↵i
↵ 2
i=1 j=1
i=1
i=1
Logistic Regression
1
P (y = 0 | x) = 1 P (y = 1 | x)
1 + exp( ✓ | x)
n
X
min
[yi lg P (yi = 1 | xi ) + (1 yi ) lg P (yi = 0 | xi )]
P (y = 1 | x) =
✓
AdaBoost
i=1
n
t
=
X
1 1 ✏t
ln
where ✏t =
wt (xi )1[yi 6= ht (xi )]
2
✏t
i=1
n
X
wt (xi ) exp( t yi ht (xi ))
where Zt =
wt (xi ) exp(
Zt
i=1
!
T
X
H(x) = sign
t ht (x)
wt+1 (xi ) =
t=1
Norms
kxkp =
d
X
i=1
|xi |p
! p1
t yi ht (xi ))
0
CIS 419/519 – Fall 2015 Midterm
Name
13
Answer Key
1.)
2.)
3.)
4.)
5.)
6.)
7.)
8.)
9.)
10.)
11.)
12.)
13.)
14.)
15.)
16.)
17.)
18.)
19.)
20.)
21.)
22.)
23.)
24.)
25.)
26.)
27.)
28.)
29.)
30.)
31.)
32.)
Any answer
Any answer
D - Logistic Regression
A - SVM with C = 0
C - ICA
C - Unsupervised
A - True
A - True
A - True
B - False
B - local optimum
A - global optimum
A - global optimum
A - global optimum
A - zero-one loss
D - hinge loss
C - log loss
D - hinge loss
A - expectedLoss = (bias)2 + variance + noise
A - true
E - 1 since it can shatter an arbitrary set of points.
E - Set p = pk and train a new classifier on the entire
training set.
B - 0.65
A - 0.833
C - 0.67
A - Algorithm A
B - Algorithm B
B - Algorithm B
A - decision stump
B - No
B - Algorithm B
D - 0.97
p(C=G|S=N ) lg p(C=G|S=N )
!
2/5 log2 2/5
p(C=B|S=N ) lg p(C=B|S=N )
3/5 log2 3/5 ! 0.972
33.) B - Number of Eyes
34.) B - 2
35.) A - Yes
36.) B - False
37.) B - False
38.) B
39.) E - Not enough information to tell [NOTE: QUESTION
WAS DISCARDED – EVERYONE GIVEN CREDIT]
40.) C - possible overfitting
41.) C - Both
42.) A - DT2 only
43.) B - LR only
44.) D - neither
45.) A,B
46.) E - None (standardization must happen after basis expansion)
47.) A, B
48.) B - h < 1
49.) B - no, since H1, H2, and H3 stay the support vectors
50.) C - 3. We will need a cubic function
51.) B - No
52.) A - Yes
53.) A - Yes
54.) C - k(x, y) =
55.) C
Pd
j=1
max(xj , yj ) is not a kernel
56.) E
57.) A, C
58.) B - False
59.) A - True
60.) A - Reduce bias
61.) A, B - Reduce bias and variance
62.) E - None of the above