Assignment 4 (Due on 2015-08-23, 23:55 IST) 1) Ridge or Lasso regression is used to: Answer: Shrink the coefficients of the fitted model as compared to the regular OLS multiple regression 2) If you have 5 input variables, A best subset selection would evaluate how many unique models? Answer: 25 = 32 3)In Decision Trees, There can be only 1 decision split/ attribute. Say True or False. Answer: False For the following Data, Which of the following two prediction equations better describe the data? y =3 + 4B --- model (i) y = 4 + AB --- model(ii) 4) Using a squared error loss which model do you prefer? Answer: Model (ii) 5) For the same data and models in Question 4. Which model do you prefer if you were using mean absolute deviation? Answer: Indifferent between Model (i) and Model (ii) A -1 -1 1 1 B -1 1 -1 1 Y 3 3 -1 7 AB 1 -1 -1 1 Predicted Y using Predicted model (i) model(ii) -1 5 7 3 -1 3 7 5 Y using Squared Error Conclusion MAD Conclusion Model (i) Model (ii) 2 2 (−1 − 3) + (7 − 3) + (5 − 3)2 + (3 − 3)2 + 2 2 (−1 + 1) + (7 − 7) = 32 (3 + 1)2 + (5 − 7)2 = 24 When Squared Error is considered, Model (ii) is better. |−1 − 3|+|7 − 3| + |−1 + 1| + |7 − 7|=8 |5 − 3| + |3 − 3| + |3 + 1| + |5 − 7|=8 When MAD is considered, Both models are Indifferent Questions 6 through 8 pertains to the following description: A leading fashion store chooses to predict the Willingness of a customer to buy a shirt of a particular price category based on the customers data. The company strongly believes that the willingness of a customer to buy depends on 3 factors essentially. They are the Gender of the customer (Male/ Female), the type of car used by the customer (Sports/ Family) and the type of shirt price category (Cheap/ Expensive). From the past history, The Company has got the data of 12 customers (gender, The type of the car used by the customer and The Shirt price category) along with the data of whether they bought the shirt of that category. 6) What is the best feature to split at the root level, If the splitting criterion is Entropy? Answer: Gender 7) Assume that you stop growing the tree, when there are 2 or fewer data points in a leaf node. If you use Entropy for the splitting criterion, Then How many leaf nodes will you end up with? Answer: 4 or 5 8) Assume that you stop growing the tree, when there are 2 or fewer data points in a leaf node. If you use Entropy for the splitting criterion, then What is the prediction accuracy? Answer: (11/12)*100 = 91.67 % Solution: Step 1 Attribute : Gender Male Female Will Buy 5 0 Will not Buy 3 4 Will Buy 3 2 Will not Buy 3 4 Will Buy 3 2 Will not Buy 3 4 8 4 12 Entropy(Gender, Will buy) = (8/12)*E(5,3) + (4/12)* E(0,4) =0.636 Attribute : Car Type Entropy(Car Type, Will buy) Sports Family 6 6 12 Entropy(Car Type, Will Buy) = (6/12) E(3,3) +(6/12) E(2,4) = 0.959 Attribute : Shirt Price Category Cheap Expensive 6 6 12 Entropy(Shirt price category, Will buy) = (6/12) E(3,3) +(6/12) E(2,4) =0.959 Entropy(Gender, Will buy) = 0.636 Entropy(Car Type, Will Buy) = 0.959 Entropy(Shirt price category, Will buy)= 0.959 Since, The Entropy(Gender, Will buy) is the lowest, The best feature to split is Gender. Since “Gender=Female” has all 4 data points with class label “no” and 0 data points with class label “yes”, you can assign class label of “yes”. But for other value taken by the attribute “Gender”, you have to keep splitting further as we have shown above. Tree at root level Gender Male Female Will not buy Step 2: Further Building the tree with Gender = Male 1 2 3 4 5 6 7 8 Male Male Male Male Male Male Male Male Sports Sports Family Family Sports Sports Family Family Cheap Expensive Cheap Expensive Cheap Expensive Cheap Expensive no yes yes no yes yes yes no Attribute : Car Type Sports Family Will Buy 3 2 Will not Buy 1 2 4 4 8 Entropy(Car Type, Will Buy) = (4/8) E(1,3) +(4/8) E(2,2) = 0.9055 Attribute : Shirt Price Category Will Buy 3 2 Cheap Expensive Will not Buy 1 2 4 4 8 Entropy(Shirt Price category, Will buy) = (4/8) E(3,1) +(4/8) E(2,2)= 0.9055 Since both attributes have same entropy value, you can choose any one of the attributes to split. Building Decision Tree 1 taking the attribute Car Type Gender Male Female Car Type Sports Will not buy Family Further Building the tree with Gender = Male, Car Type = Sports Male Male Male Male Sports Sports Sports Sports Cheap Expensive Cheap Expensive Cheap Expensive no yes yes yes Will Buy Will not Buy 1 2 1 0 2 2 4 Since “Shirt Price Category = Expensive” has all 2 data points with class label “yes” and 0 data points with class label “no”, you can assign class label of “yes”. But “Shirt Price Category = Cheap” has 1 data point with class label “yes” and 1 data point with class label “no”, This leaf node has 50% chances of being class “yes” and 50% chances of being class “no”. So you can either assign “yes” or assign “no”. If you assign class label of “yes” to the node “Shirt Price Category = Cheap”, you will end up getting 4 leaf nodes. If you assign class label of “no” to the node “Shirt Price Category = Cheap”, you will end up getting 5 leaf nodes. Because of the stopping condition given in the question, We stop growing the tree further. Further Building the tree with Gender = Male, Car Type = Family Male Male Male Male Family Family Family Family Cheap Expensive Cheap Expensive yes no yes no Cheap Expensive Will Buy Will not Buy 2 0 0 2 2 2 4 Since “Shirt Price Category = Cheap” has all 2 data points with class label “yes” and 0 data points with class label “no”, you can assign class label of “yes”. And Since “Shirt Price Category = Expensive” has all 2 data points with class label “no” and 0 data points with class label “yes”, you can assign class label of “no”. Because of the stopping condition given in the question, We stop growing the tree further. Decision Tree 1 with node “Gender = Male, Car Type = Sports, Shirt Price Category = Cheap” taking class “yes” Gender Male Female Car Type Sports Will Buy Will not buy Family Shirt Price Category Cheap Will Buy Expensive Will not Buy Decision Tree 1 with node “Gender = Male, Car Type = Sports, Shirt Price Category = Cheap” taking class “no” Gender Male Female Car Type Sports Will not buy Family Shirt Price Category Shirt Price Category Cheap Expensive Will not Buy Cheap Expensive Will Buy Will Buy Will not Buy Building Decision Tree 2 taking the attribute Shirt Price Category Gender Male Female Shirt Price Category Cheap Expensive Will not buy Further Building the tree with Gender = Male, Car Type = Sports Male Male Male Male Sports Family Sports Family Cheap Cheap Cheap Cheap Sports Family no yes yes yes Will Buy Will not Buy 1 2 1 0 2 2 4 Since “Car Type=Family” has all 2 data points with class label “yes” and 0 data points with class label “no”, you can assign class label of “yes”. But “Car Type=Sports” has 1 data point with class label “yes” and 1 data point with class label “no”, This leaf node has 50% chances of being class “yes” and 50% chances of being class “no”. So you can either assign “yes” or assign “no”. If you class label of “yes” to the node “Car Type=Sports”, you will end up getting 4 leaf nodes. If you assign class label of “no” to the node “Car Type=Sports”, you will end up getting 5 leaf nodes. Because of the stopping condition given in the question, We stop growing the tree further. Further Building the tree with Gender = Male, Car Type = Expensive Male Male Male Male Sports Family Sports Family Sports Family Expensive Expensive Expensive Expensive yes no yes no Will Buy Will not Buy 2 0 0 2 2 2 4 Since “Car Type=Sports” has all 2 data points with class label “yes” and 0 data points with class label “no”, you can assign class label of “yes”. And Since “Car Type=Family” has all 2 data points with class label “no” and 0 data points with class label “yes”, you can assign class label of “no”. Because of the stopping condition given in the question, We stop growing the tree further. Decision Tree 2 with node “Gender = Male, Shirt Price Category = Cheap , Car Type = Sports” taking class “yes” Gender Male Female Shirt Price Category Cheap Will not buy Expensive Will Buy Car Type Sports Family Will Buy Will not Buy Decision Tree 2 with node “Gender = Male, Shirt Price Category = Cheap, Car Type = Sports” taking class “no” Gender Male Female Shirt Price Category Cheap Car Type Sports Will not Buy Family Will Buy Will not buy Expensive Car Type Sports Will Buy Family Will not Buy Customer ID Gender Car Type Shirt Price Will Buy? Category 1 2 3 4 5 6 7 8 9 10 11 12 Male Male Male Male Male Male Male Male Female Female Female Female Sports Sports Family Family Sports Sports Family Family Sports Family Sports Family Cheap Expensive Cheap Expensive Cheap Expensive Cheap Expensive Cheap Expensive Cheap Expensive no yes yes no yes yes yes no no no no no Class predicted using Decision Tree 1 With node “Shirt Price Category = Cheap” taking class “yes”. yes yes yes no yes yes yes no no no no no Class predicted using Decision Tree 1 With node “Shirt Price Category = Cheap” taking class “no”. no yes Yes no no yes Yes no no no no no In all the 4 decision trees, Prediction Accuracy = (Number of data points correctly classified/ Total number of data points)*100 = (11/12)*100 = 91.67% Class predicted using Decision Tree 2 With node “car type = sports” taking class “yes”. Class predicted using Decision Tree 2 With node car type = sports taking class “no”. yes yes yes no yes yes yes no no no no no no yes Yes no no yes Yes no no no no no Question 9 and 10 pertains to the following description. A University chooses to predict the first semester GPA of a student based on his performance in the various rounds of the MBA Admission. The MBA Admission process of the University takes into consideration 3 scores essentially. They are the CAT(Common Admission Test) score, Written exam score and the Interview Score. The Scores are given in percentile. From the previous batch, The University has got the scores (CAT, Written Test and Interview) of 10 students along with the GPA that they scored at the end of their first semester. 10) Perform a K-NN prediction at the point [CAT Score= 86, Written exam Score = 89, Interview Score = 88]. Use a K value of 3. To determine the nearest neighbours use the concept of Euclidean distance. Hint: In a 3dimensional input space Euclidean distance between two points is determined in the following way: If point 1 has co-ordinates [a1, b1 and c1], and point 2 has co-ordinates [a2, b2 and c2], then the distance between these two points is (a2 -a1)2+ (b2-b1)2+ (c2-c1)2 Answer: 9.3(Refer next page for detailed explanation) 11) Perform a K-NN prediction at the point [CAT Score= 86, Written exam Score = 89, Interview Score = 88]. Use a K value of 3. To determine the nearest neighbours use the concept of Manhattan distance. Hint: In a 3dimensional input space Manhattan distance between two points is determined in the following way: If point 1 has co-ordinates [a1, b1 and c1], and point 2 has co-ordinates [a2, b2 and c2], then the distance between these two points is |a1-a2|+|b1-b2|+|c1-c2| Answer: 8.3(Refer next page for detailed explanation) predicted point CAT Score Written Exam Interview Score Score 86 89 88 Student ID 1 2 3 4 5 6 7 8 9 10 CAT Score 88 88 84 86 88 82 84 86 86 87 Written Exam Score 87 87 87 85 91 89 92 89 89 89 Interview Score 90 94 85 87 96 90 86 84 88 88 GPA 9.7 9.6 9.6 7.9 9.4 7.8 9.5 6.8 9 9.2 Manhattan Distance 6 10 7 5 12 6 7 4 0 1 predicted value (Manhattan) Corresponding GPA 6.8 9 9.2 8.333333333 Euclidean Distance 12 44 17 17 72 20 17 16 0 1 predicted value (Euclidean) Corresponding GPA 9.7 9 9.2 9.3
© Copyright 2026 Paperzz