CIS 419/519 – Fall 2015 Midterm Name 1 CIS 419/519 Introduction to Machine Learning Fall 2015 Midterm Instructions: • Please turn o↵ your cell phone. • This examination is closed-book and closed-notes. Calculators and any other external resources are prohibited. • All answers must be recorded on the bubble sheet. Fill in the bubble of the correct answer for each question. Be certain to fill in the bubble completely. • To correct a mistake, be certain to erase the bubble completely. • For true/false questions, fill in (A) for TRUE and fill in (B) for FALSE. • For multiple choice questions, fill in the bubble(s) of the correct answer. Some multiple choice questions may have multiple correct answers; in this case, you can fill in multiple bubbles for a single question. • Only the bubble sheet will be counted as your official answers; any answers you write on the exam will not be considered. • Partial credit will not be given for individual questions, including questions with multiple correct answers. (Grading will use zero-one loss. ;-) ) • If you find yourself spending too long on a problem, skip it and move on to the next one. Scrap paper is available upon request. Be certain to write you name on all materials you turn in. • If you do not understand a problem or require clarification, please ask the instructor. • There are 100 points total. Students in CIS 419 will be graded out of 92 points; points earned beyond 92 will not count as extra credit. Students in CIS 519 will be graded out of 100 points. The instructor reserves the right to adjust these denominators lower to curve the scores in your favor. • I wish each of you the best of luck. Sign this statement after you have completed the examination. Your exam will not be graded without your signature. I certify that my responses in this examination are solely the product of my own work and that I have fully abided by the University of Pennsylvania Academic integrity policy while taking this exam. Signature: Printed Name: CIS 419/519 – Fall 2015 Midterm I Name 2 Preliminaries There are multiple versions of this exam. Fill in the version number of the exam on your answer sheet: • If you are enrolled in CIS 419, fill in the bubble for Version 2 • If you are enrolled in CIS 519, fill in the bubble for Version 4 Write your name on the bubble sheet, the first page of the exam, and on any scrap paper you use. Be sure to bubble in your PennID on the answer sheet. II General Knowledge (12 pts total) 1.) (1 pts) Check that (1) your exam has all 12 pages, (2) you’ve written you name, and (3) you’ve written and bubbled in your PennID on the answer sheet. Answer this question to let me know you did it: what is your favorite course this semester? (A) CIS 419/519 (B) CIS 419/519 (C) CIS 419/519 (D) CIS 419/519 2.) (1 pts) Is your answer to question 1 a lie? (A) Yes (B) No 3.) (2 pts) Which model will always yield a linear decision boundary? (Choose one or more) (A) Polynomial regression (D) Logistic Regression (B) Decision tree with depth 2 (E) None of the above (C) k-Nearest Neighbor 4.) (2 pts) Which value of C in an SVM has the most bias? (A) C = 0 (B) C = 1 (C) C = 10 (D) C = 1000 (E) All have equal bias The next two questions concern the following scenario: While attending a concert, you remember that ML can be used to separate the singer’s voice from the instrumental track. 5.) (1 pts) What ML technique can do this? (Choose one) (A) Logistic regression (D) Reinforcement learning (B) Support vector regression (E) Inverse reinforcement learning (C) Independent component analysis 6.) (1 pts) Is this algorithm supervised, semi-supervised, or unsupervised? (A) Supervised (B) Semi-supervised (C) Unsupervised (D) None of the above Truth or LIES For each question below, answer whether it is a True statement (A) or a bloody LIE (B)! 7.) (1 pts) Suppose that A1 , . . . , Ad are categorical input attributes. The maximum depth of the decision tree must be less than d + 1. 8.) (1 pts) For a data set of n instances, the maximum depth of the decision tree must be less than n. 9.) (1 pts) The lower we are in a decision tree, the more likely we are to be modeling noise. 10.) (1 pts) Using the quadratic kernel in an SVM with very large numbers of training instances is much more computationally efficient than precomputing the equivalent basis expansion of each training instance and using a linear kernel. CIS 419/519 – Fall 2015 Midterm III Name 3 Properties of Learning Algorithms (8 pts total) For each machine learning algorithm listed below, choose the correct properties of that algorithm. Does each learning algorithm below yield the globally optimal solution or a locally optimal solution? (Assume that each algorithm does converge to a solution.) 11.) (1 pts) ID3 Decision Tree: (A) global optimum (B) local optimum 12.) (1 pts) Batch perceptron: (A) global optimum (B) local optimum 13.) (1 pts) Logistic regression: (A) global optimum 14.) (1 pts) SVM: (A) global optimum (B) local optimum (B) local optimum What loss function does each algorithm use? Choose the best option from the following choices: (A) zero-one loss (B) sum of squared error (C) log loss (D) hinge loss (E) exponential loss 15.) (1 pts) ID3 Decision Tree 16.) (1 pts) Batch perceptron 17.) (1 pts) Logistic regression 18.) (1 pts) SVM IV Computational Learning Theory (8 pts total) 19.) (2 pts) Which of the following is a correct expression for Expected Loss? (A) expectedLoss = (bias)2 + variance + noise (B) expectedLoss = bias + (variance)2 + noise (C) expectedLoss = bias + variance + noise (D) expectedLoss = (bias)2 + (variance)2 + noise 20.) (2 pts) True (A) or False (B): There is at least one set of 4 points in R3 that can be shattered by the hypothesis set of all 2D planes in R3 . 21.) (4 pts) What is the VC-dimension of the 1-Nearest Neighbor classifier? (A) 2 (B) 3 (C) 4 (D) 5 (E) 1 CIS 419/519 – Fall 2015 Midterm V Name 4 Experimental Protocol (9 pts total) Imagine we are using 10-fold cross-validation to tune a parameter p of an ML algorithm. Recall that this can be done via cross-validation over the training set, using the held-out fold to evaluate di↵erent values of p. In the end, this process yields 10 models, {h1 , . . . , h10 }; each model hi has its own value pi for that parameter, and corresponding error ✏i on the held-out fold. Let k = arg mini ✏i be the index of the model with the lowest error. 22.) (3 pts) What is the best procedure for going from these 10 models individual to a single model that we can apply to the test data? (A) Use hk as the single model. (B) Create an unweighted ensemble from all 10 models, averaging the members’ predictions. (C) Create a weighted majority ensemble from all 10 models, weighting the models with lower error more heavily. P 1 (D) Set p = 10 i pi and train a new classifier on the entire training set. (E) Set p = pk and train a new classifier on the entire training set. Consider the confusion matrix for a multi-class classifier given below. Round answers to two decimal places. ``` ``` Predicted ``` X Y Z ``` Actual Class X 5 3 1 Class Y 1 4 0 Class Z 0 2 4 23.) (2 pts) What is the accuracy of the classifier? (A) 0.75 (B) 0.65 (C) 0.50 (D) 0.35 (E) None of the above 24.) (2 pts) What is the precision of the classifier in predicting class X? (A) 0.83 (B) 0.56 (C) 0.42 (D) 0.37 (E) None of the above 25.) (2 pts) What is the recall of the classifier in predicting class Z? (A) 0.92 (B) 0.83 (C) 0.67 (D) 0.50 (E) None of the above CIS 419/519 – Fall 2015 Midterm VI Name 5 Algorithm Evaluation (10 pts total) The eccentric inventor Emerett T. Vandershmoot I has invented a new machine learning algorithm, Algorithm A, to diagnose a rare disease and wants your help in assessing its performance. However, he’s nervous about competitors stealing his invention, and so won’t give any details on the algorithm or data set. The only information he provides is the log file below from training his algorithm and testing on a held-out portion of the data. Vandershmoot also compared his invention to two other well-known machine learning algorithms, Algorithms B & C, but won’t tell you what those algorithms are either. Although he is quirky, you know that Vandershmoot is diligent and so likely followed proper experimental procedure. LOG FILE - PROJECT ALPHA, VERSION 9381 AUTHOR: TRAINING PHASE TESTING PHASE (ON HELD-OUT DATA) Algorithm A: Algorithm B: Algorithm C: Accuracy 99.2% 96.0% 97.1% Error 0.8% (best) 4.0% 2.9% Emerett T. Vandershmoot I Algorithm A: Algorithm B: Algorithm C: Accuracy 98.6% 96.0% 97.0% Error 1.4% (best) 4.0% 3.0% Confusion Matrix - Algorithm A True False <----- predicted label 0 8 T (actual class label) 0 992 F (actual class label) Confusion Matrix - Algorithm A True False <----- predicted label 0 7 T (actual class label) 0 493 F (actual class label) Confusion Matrix - Algorithm B True False <----- predicted label 8 0 T (actual class label) 40 952 F (actual class label) Confusion Matrix - Algorithm B True False <----- predicted label 7 0 T (actual class label) 20 473 F (actual class label) Confusion Matrix - Algorithm C True False <----- predicted label 4 4 T (actual class label) 25 967 F (actual class label) Confusion Matrix - Algorithm C True False <----- predicted label 4 3 T (actual class label) 12 481 F (actual class label) 26.) (2 pts) Which algorithm(s) have the highest bias? (Choose one or more) (A) Algorithm A (B) Algorithm B (C) Algorithm C (D) Not enough information to tell 27.) (2 pts) Which algorithm(s) have the highest precision? (Choose one or more) (A) Algorithm A (B) Algorithm B (C) Algorithm C (D) Not enough information to tell 28.) (2 pts) Which algorithm(s) have the highest recall? (Choose one or more) (A) Algorithm A (B) Algorithm B (C) Algorithm C (D) Not enough information to tell 29.) (2 pts) Of the algorithms listed below, Algorithm A is most likely a . (Choose one) (A) decision stump (B) 1-NN (C) linear regression (D) SVM & RBF kernel (E) boosted perceptron 30.) (1 pts) Is Algorithm A e↵ective for this application? (A) Yes (B) No (C) Not enough information to tell 31.) (1 pts) Which algorithm is best for this application? (Choose one) (A) Algorithm A (B) Algorithm B (C) Algorithm C (D) Not enough information to tell CIS 419/519 – Fall 2015 Midterm VII Name 6 Decision Trees (10 pts total) At Monsters University, Sulley needs help with deciding which final-year monsters should be awarded Full Scary awards. He wants to use following data set to to learn an unpruned decision tree to predict whether the students are diligent(D) or lazy(L) based on their scariness level (Normal or Low), number of eyes (2, 3, or 4), and hair color (Green or Blue). Scariness Level Normal Normal Normal Low Low Low Normal Normal Low Low Hair Color Green Blue Blue Blue Blue Green Green Blue Green Green # Eyes 2 2 2 3 3 4 4 4 3 3 Class L L L L L D D D D D The following numbers may be helpful since you are not using a calculator: log2 (0.1) log2 (0.2) log2 (0.3) log2 (0.4) log2 (0.5) -3.32 -2.32 -1.73 -1.32 -1 log2 (0.6) log2 (0.7) log2 (0.8) log2 (0.9) log2 (1.0) -0.74 -0.51 -0.32 -0.15 0 32.) (4 pts) What is the conditional entropy of H(HairColor | ScarinessLevel = Normal )? (A) 0.00 (B) 0.49 (C) 0.50 (D) 0.97 (E) 1.00 33.) (2 pts) Which attribute will the ID3 algorithm pick as the root of the tree? (A) Scariness Level (B) Number of Eyes (C) Hair Color (D) ID3 will pick randomly between the three attributes 34.) (1 pts) What will be the height of the final unpruned decision tree? (Recall that a decision stump has a height of 1.) (A) 1 (B) 2 (C) 3 (D) 4 (E) None of the above 35.) (1 pts) Will the resulting unpruned decision tree obtain 100% training accuracy on this dataset? (A) Yes (B) No 36.) (1 pts) True(A) or False(B): Unpruned decision trees will always have 100% training accuracy. 37.) (1 pts) True(A) or False(B): Pruning a decision tree is guaranteed to reduce its training accuracy. CIS 419/519 – Fall 2015 Midterm VIII Name 7 Linear Methods (16 pts) 38.) (2 pts) Suppose we have a regularized linear regression model: arg min✓ kY X✓k22 + k✓k1 What is the e↵ect of increasing on bias and variance? (A) Increases bias, increases variance (D) Decreases bias, decreases variance (B) Increases bias, decreases variance (E) Not enough information to tell (C) Decreases bias, increases variance 39.) (2 pts) Suppose we have a regularized linear regression model: arg min✓ kY X✓k22 + k✓kpp What is the e↵ect of increasing p on bias and variance, for p 1? (A) Increases bias, increases variance (D) Decreases bias, decreases variance (B) Increases bias, decreases variance (E) Not enough information to tell (C) Decreases bias, increases variance 40.) (2 pts) Which of the following statements are true for linear discriminants? (Choose one or more) (A) Regularizing the model always results in equal or better performance on the training set. (B) Regularizing the model always results in equal or better performance on examples that are not in the training set. (C) Adding many new features to the model makes it more likely to overfit the training set. (D) Adding many new features to the model always results in equal or better performance on examples that are not in the training set. (1 pt each) Consider two machine learning models: logistic regression and DT2 – a binary decision tree with depth 2 (up to two levels of binary splitting nodes and 4 leaves). For each of the data sets below, identify whether they can be: (A) Perfectly classified by DT2 only (B) Perfectly classified by logistic regression only (C) Perfectly classified by both DT2 and logistic regression (D) Cannot be perfectly classified by either algorithm 41.) 43.) 42.) 44.) CIS 419/519 – Fall 2015 Midterm Name 8 Pm (i) 45.) (2 pts) For logistic regression, the gradient is given by @✓@ j J(✓) = i=1 (h✓ (x(i) ) y (i) )xj . Which gradient descent update rule below is correct for logistic regression with a learning rate of ↵? (Choose one or more). Pm (i) 1 (i) (A) ✓j ✓j ↵ m y (i) )xj (simultaneously update for all j) i=1 (h✓ (x ) ⇣ ⌘ Pm (i) 1 1 (B) ✓j ✓j ↵ m y (i) xj (simultaneously update for all j) i=1 1+exp( ✓ T x(i) ) Pm 1 (i) (C) ✓j ✓j ↵ m y (i) )x(i) (simultaneously update for all j) i=1 (h✓ (x ) P m 1 T (D) ✓ ✓ ↵m y (i) x(i) i=1 ✓ x (E) None of the above are correct 46.) (2 pts) Which implementation of the predict() function, which predicts class labels, is correct for logistic regression? Assume all utility functions called by the code work correctly and that the python syntax is correct; focus only on semantics. (Choose one or more). (A) def predict(self, X): X = basisExpansion(standardize(X)); X = addOnesColumn(X); return X * self.theta >= .5; (B) def predict(self, X): X = basisExpansion(standardize(X)); X = addOnesColumn(X); return X * self.theta; (C) def predict(self, X): X = basisExpansion(standardize(X)); X = addOnesColumn(X); return sigmoid(X * self.theta) >= .5; (D) def predict(self, X): X = basisExpansion(standardize(X)); X = addOnesColumn(X); return sigmoid(X * self.theta); (E) None of the above are correct 47.) (2 pts) Suppose that you have trained a logistic regression classifier, and it outputs on a new example x a prediction h✓ (x) = 0.8. Which of the following are true?: (Choose one or more) (A) Our estimate for P (y = 1 | x; ✓) is 0.8. (B) Our estimate for P (y = 0 | x; ✓) is 0.2. (C) Our estimate for P (y = 0 | x; ✓) is 0.8. (D) Our estimate for P (y = 1 | x; ✓) is 0.2. CIS 419/519 – Fall 2015 Midterm IX Name 9 SVMs and Kernels (17 pts total) Consider the four training instances in R2 that are given in the figure to the right. There are positive examples at: x1 = [0, 0] x2 = [2, 2] and negative examples at: x3 = [h, 1] x4 = [0, 3]. Note that x3 can move horizontally; its horizontal position h is a variable such that 0 h 3. 48.) (2 pts) How large can h be so that the training points are still linearly separable? (A) h < 0.5 (B) h < 1 (C) h < 1.5 (D) h 3 (E) None of the above 49.) (2 pts) When the points are linearly separable, does the orientation (i.e., angle) of the maximum margin decision boundary change as h varies? (A) Yes (B) No 50.) (3 pts) Assume that we can only observe the 2nd dimension of the input vectors. Without the other component, the labeled training points reduce to (0,+), (2,+), (1,-), and (3,-). What is the lowestorder degree d of the polynomial kernel that would allow us to correctly classify these points? (A) d = 1 (D) d = 4 (B) d = 2 (E) It is not possible to separate these points with (C) d = 3 the polynomial kernel For an SVM, consider what might happen if we remove one of the support vectors from the training set. (Hint: You may find it helpful to draw a few simple 2D dataset in which you identify the support vectors, draw the location of the maximum margin hyperplane, remove one of the support vectors, and draw the location of the resulting maximum margin hyperplane.) 51.) (1 pts) If we remove one of the support vectors, could the size of the maximum margin decrease? (A) Yes (B) No 52.) (1 pts) If we remove one of the support vectors, could the size of the maximum margin stay the same? (A) Yes (B) No 53.) (1 pts) If we remove one of the support vectors, could the size of the maximum margin increase? (A) Yes (B) No CIS 419/519 – Fall 2015 Midterm Name 54.) (2 pts) Which of the following is not a kernel? (Choose one or more) (A) k(x, y) = exp( kx yk2 ) T (B) k(x, y) = tanh(x y + 1) Pd (C) k(x, y) = j=1 max(xj , yj ) p (D) k(x, y) = xT y + 0.5 (E) None of the above (all are kernels) 55.) (2 pts) Which kernel is most likely to overfit? (Choose one) (A) Polynomial kernel of degree 2 (B) Linear kernel (C) Gaussian kernel (D) A kernel that maps the d-dimensional input space into R2d 56.) (3 pts) In the figure below, which mapping will make the problem linear? (Choose one) (Hint: the data was generated via the following equations. • For class 1: x1 = t cos(t); x2 = t sin(t) • For class 2: x1 = t cos(t); x2 = (A) x : [x1 , x2 ] 7! z : [x1 + x2 ] (B) x : [x1 , x2 ] 7! z : [x1 (C) x : [x1 , x2 ] 7! z : (D) x : [x1 , x2 ] 7! z : [x21 [x21 x2 ] + x22 ] x22 ] (E) x : [x1 , x2 ] 7! z : [x1 |x1 | + x2 |x2 |] t sin(t) ) 10 CIS 419/519 – Fall 2015 Midterm X Name 11 Ensemble Methods (10 pts total) 57.) (2 pts) Which of the following statements are FALSE for ensemble learning? (Choose one or more) (A) Individual learners should have low error rates (B) Ensembles can combine di↵erent types of learners (C) All learners are required to train on the same input points (D) Cross-validation can be used to tune the weights of individual learners in the ensemble 58.) (1 pts) True (A) or False (B): In AdaBoost, you should stop iterating if the error rate of the combined classifier on the original training data is 0. 59.) (1 pts) True (A) or False (B): In AdaBoost, weak classifiers added in later boosting rounds tend to focus on the more difficult instances in the training set. 60.) (2 pts) What is the primary e↵ect of early boosting iterations on the ensemble? (Choose one or more) (A) To reduce the bias of the ensemble classifier. (B) To reduce the variance of the ensemble classifier. (C) Early iterations have little e↵ect on the bias or variance of the ensemble classifier. (D) To increase the bias of the ensemble classifier. (E) To increase the variance of the ensemble classifier. 61.) (2 pts) What is the e↵ect of later boosting iterations on the ensemble? (Choose one or more) (A) To reduce the bias of the ensemble classifier. (B) To reduce the variance of the ensemble classifier. (C) Later iterations have little e↵ect on the bias or variance of the ensemble classifier. (D) To increase the bias of the ensemble classifier. (E) To increase the variance of the ensemble classifier. Consider an ensemble constructed by a weighted majority vote of its members and applied to data with binary labels { 1, 1}. Given a data point x 2 Rd , the final classifier HT (x) = sign(⌃Tt=1 ↵t ht (x)), where ht (x) : Rd 7! { 1, 1}. Let the ensemble classifier consist of the four binary classifiers shown to the right (the arrows point to the side of the discriminant labeled as positive(+)). 62.) (2 pts) What values of ↵ would make the ensemble classifier consistent with the data? (A) [1, 1, 1, 1] (D) [1, 2, 1, 1] (B) [1, 2, 2, 1] (E) None of the above (C) [1, 2, 1, 2] That’s it! Relax a bit, check your answers, and check that your name and PennID are written (and bubbled in!) on your bubble sheet. Then, quietly turn in your exam. Please be mindful of others around you who are still completing the exam. We still have a few students who need to take a makeup midterm, so please DO NOT post questions or comments about the exam on Piazza. CIS 419/519 – Fall 2015 Midterm Name 12 Reference Page Decision Trees Let D be the data, C be the class attribute, and A be an attribute (which could be C). The accessor A.values denotes the values of attribute A. ✓ ◆ X |Di | |Di | Info(D) = ⇤ lg |D| |D| i2C.values Info(A, D) X = j2A.values Gain(A, D) = GainRatio(A, D) = SplitInfo(A, D) = = |Dj | ⇤ Info(Dj ) |D| Info(D) Info(A, D) Gain(A, D) SplitInfo(A, D) Info(D) considering A as the class attribute C. ✓ ◆ X |Di | |Di | ⇤ lg |D| |D| i2A.values Linear Regression min✓ 1 2n n X d X 2 (h✓ (xi ) yi ) + i=1 Perceptron Update Rule ✓ ✓j2 j=1 ✓ + yi x i if xi is misclassified Support Vector Machines 1 Primal: min kwk22 s.t. 8i yi (w · x b) 1 w,b 2 n n n n X X 1 XX Dual: min ↵i ↵j yi yj (xi · xj ) ↵i s.t. yi ↵i = 0 and 8i ↵i ↵ 2 i=1 j=1 i=1 i=1 Logistic Regression 1 P (y = 0 | x) = 1 P (y = 1 | x) 1 + exp( ✓ | x) n X min [yi lg P (yi = 1 | xi ) + (1 yi ) lg P (yi = 0 | xi )] P (y = 1 | x) = ✓ AdaBoost i=1 n t = X 1 1 ✏t ln where ✏t = wt (xi )1[yi 6= ht (xi )] 2 ✏t i=1 n X wt (xi ) exp( t yi ht (xi )) where Zt = wt (xi ) exp( Zt i=1 ! T X H(x) = sign t ht (x) wt+1 (xi ) = t=1 Norms kxkp = d X i=1 |xi |p ! p1 t yi ht (xi )) 0 CIS 419/519 – Fall 2015 Midterm Name 13 Answer Key 1.) 2.) 3.) 4.) 5.) 6.) 7.) 8.) 9.) 10.) 11.) 12.) 13.) 14.) 15.) 16.) 17.) 18.) 19.) 20.) 21.) 22.) 23.) 24.) 25.) 26.) 27.) 28.) 29.) 30.) 31.) 32.) Any answer Any answer D - Logistic Regression A - SVM with C = 0 C - ICA C - Unsupervised A - True A - True A - True B - False B - local optimum A - global optimum A - global optimum A - global optimum A - zero-one loss D - hinge loss C - log loss D - hinge loss A - expectedLoss = (bias)2 + variance + noise A - true E - 1 since it can shatter an arbitrary set of points. E - Set p = pk and train a new classifier on the entire training set. B - 0.65 A - 0.833 C - 0.67 A - Algorithm A B - Algorithm B B - Algorithm B A - decision stump B - No B - Algorithm B D - 0.97 p(C=G|S=N ) lg p(C=G|S=N ) ! 2/5 log2 2/5 p(C=B|S=N ) lg p(C=B|S=N ) 3/5 log2 3/5 ! 0.972 33.) B - Number of Eyes 34.) B - 2 35.) A - Yes 36.) B - False 37.) B - False 38.) B 39.) E - Not enough information to tell [NOTE: QUESTION WAS DISCARDED – EVERYONE GIVEN CREDIT] 40.) C - possible overfitting 41.) C - Both 42.) A - DT2 only 43.) B - LR only 44.) D - neither 45.) A,B 46.) E - None (standardization must happen after basis expansion) 47.) A, B 48.) B - h < 1 49.) B - no, since H1, H2, and H3 stay the support vectors 50.) C - 3. We will need a cubic function 51.) B - No 52.) A - Yes 53.) A - Yes 54.) C - k(x, y) = 55.) C Pd j=1 max(xj , yj ) is not a kernel 56.) E 57.) A, C 58.) B - False 59.) A - True 60.) A - Reduce bias 61.) A, B - Reduce bias and variance 62.) E - None of the above
© Copyright 2024 Paperzz