Assignment 4

Assignment 4
(Due on 2015-08-23, 23:55 IST)
1) Ridge or Lasso regression is used to:
Answer: Shrink the coefficients of the fitted model as compared to the regular OLS
multiple regression
2) If you have 5 input variables, A best subset selection would evaluate how many unique
models?
Answer: 25 = 32
3)In Decision Trees, There can be only 1 decision split/ attribute. Say True or False.
Answer: False
For the following Data,
Which of the following two prediction equations better describe the data?
y =3 + 4B --- model (i)
y = 4 + AB --- model(ii)
4) Using a squared error loss which model do you prefer?
Answer: Model (ii)
5) For the same data and models in Question 4. Which model do you prefer if you were
using mean absolute deviation?
Answer: Indifferent between Model (i) and Model (ii)
A
-1
-1
1
1
B
-1
1
-1
1
Y
3
3
-1
7
AB
1
-1
-1
1
Predicted Y using Predicted
model (i)
model(ii)
-1
5
7
3
-1
3
7
5
Y
using
Squared Error
Conclusion
MAD
Conclusion
Model (i)
Model (ii)
2
2
(−1 − 3) + (7 − 3) +
(5 − 3)2 + (3 − 3)2 +
2
2
(−1 + 1) + (7 − 7) = 32
(3 + 1)2 + (5 − 7)2 = 24
When Squared Error is considered, Model (ii) is better.
|−1 − 3|+|7 − 3| + |−1 + 1| +
|7 − 7|=8
|5 − 3| + |3 − 3| + |3 + 1| +
|5 − 7|=8
When MAD is considered, Both models are Indifferent
Questions 6 through 8 pertains to the following description:
A leading fashion store chooses to predict the Willingness of a customer to buy a shirt of a
particular price category based on the customers data. The company strongly believes that
the willingness of a customer to buy depends on 3 factors essentially. They are the Gender
of the customer (Male/ Female), the type of car used by the customer (Sports/ Family) and
the type of shirt price category (Cheap/ Expensive).
From the past history, The Company has got the data of 12 customers (gender, The type of
the car used by the customer and The Shirt price category) along with the data of whether
they bought the shirt of that category.
6) What is the best feature to split at the root level, If the splitting criterion is Entropy?
Answer: Gender
7) Assume that you stop growing the tree, when there are 2 or fewer data points in a leaf
node. If you use Entropy for the splitting criterion, Then How many leaf nodes will you end
up with?
Answer: 4 or 5
8) Assume that you stop growing the tree, when there are 2 or fewer data points in a leaf
node. If you use Entropy for the splitting criterion, then What is the prediction accuracy?
Answer: (11/12)*100 = 91.67 %
Solution:
Step 1
Attribute : Gender
Male
Female
Will Buy
5
0
Will not Buy
3
4
Will Buy
3
2
Will not Buy
3
4
Will Buy
3
2
Will not Buy
3
4
8
4
12
Entropy(Gender, Will buy)
= (8/12)*E(5,3) + (4/12)* E(0,4)
=0.636
Attribute : Car Type
Entropy(Car Type, Will buy)
Sports
Family
6
6
12
Entropy(Car Type, Will Buy)
= (6/12) E(3,3) +(6/12) E(2,4)
= 0.959
Attribute : Shirt Price Category
Cheap
Expensive
6
6
12
Entropy(Shirt price category, Will buy)
= (6/12) E(3,3) +(6/12) E(2,4)
=0.959
Entropy(Gender, Will buy) = 0.636
Entropy(Car Type, Will Buy) = 0.959
Entropy(Shirt price category, Will buy)= 0.959
Since, The Entropy(Gender, Will buy) is the lowest, The best feature to split is Gender.
Since “Gender=Female” has all 4 data points with class label “no” and 0 data points with class label
“yes”, you can assign class label of “yes”. But for other value taken by the attribute “Gender”, you
have to keep splitting further as we have shown above.
Tree at root level
Gender
Male
Female
Will not
buy
Step 2:
Further Building the tree with Gender = Male
1
2
3
4
5
6
7
8
Male
Male
Male
Male
Male
Male
Male
Male
Sports
Sports
Family
Family
Sports
Sports
Family
Family
Cheap
Expensive
Cheap
Expensive
Cheap
Expensive
Cheap
Expensive
no
yes
yes
no
yes
yes
yes
no
Attribute : Car Type
Sports
Family
Will Buy
3
2
Will not Buy
1
2
4
4
8
Entropy(Car Type, Will Buy)
= (4/8) E(1,3) +(4/8) E(2,2)
= 0.9055
Attribute : Shirt Price Category
Will Buy
3
2
Cheap
Expensive
Will not Buy
1
2
4
4
8
Entropy(Shirt Price category, Will buy) = (4/8) E(3,1) +(4/8) E(2,2)= 0.9055
Since both attributes have same entropy value, you can choose any one of the attributes to split.
Building Decision Tree 1 taking the attribute Car Type
Gender
Male
Female
Car Type
Sports
Will not
buy
Family
Further Building the tree with Gender = Male, Car Type = Sports
Male
Male
Male
Male
Sports
Sports
Sports
Sports
Cheap
Expensive
Cheap
Expensive
Cheap
Expensive
no
yes
yes
yes
Will Buy
Will not Buy
1
2
1
0
2
2
4
Since “Shirt Price Category = Expensive” has all 2 data points with class label “yes” and 0 data points
with class label “no”, you can assign class label of “yes”. But “Shirt Price Category = Cheap” has 1
data point with class label “yes” and 1 data point with class label “no”, This leaf node has 50%
chances of being class “yes” and 50% chances of being class “no”. So you can either assign “yes” or
assign “no”.
If you assign class label of “yes” to the node “Shirt Price Category = Cheap”, you will end up getting 4
leaf nodes. If you assign class label of “no” to the node “Shirt Price Category = Cheap”, you will end
up getting 5 leaf nodes.
Because of the stopping condition given in the question, We stop growing the tree further.
Further Building the tree with Gender = Male, Car Type = Family
Male
Male
Male
Male
Family
Family
Family
Family
Cheap
Expensive
Cheap
Expensive
yes
no
yes
no
Cheap
Expensive
Will Buy
Will not Buy
2
0
0
2
2
2
4
Since “Shirt Price Category = Cheap” has all 2 data points with class label “yes” and 0 data points
with class label “no”, you can assign class label of “yes”. And Since “Shirt Price Category = Expensive”
has all 2 data points with class label “no” and 0 data points with class label “yes”, you can assign
class label of “no”. Because of the stopping condition given in the question, We stop growing the
tree further.
Decision Tree 1 with node “Gender = Male, Car Type = Sports, Shirt Price Category = Cheap” taking
class “yes”
Gender
Male
Female
Car Type
Sports
Will
Buy
Will not
buy
Family
Shirt Price
Category
Cheap
Will
Buy
Expensive
Will not
Buy
Decision Tree 1 with node “Gender = Male, Car Type = Sports, Shirt Price Category = Cheap” taking
class “no”
Gender
Male
Female
Car Type
Sports
Will not
buy
Family
Shirt Price
Category
Shirt Price
Category
Cheap Expensive
Will not
Buy
Cheap
Expensive
Will
Buy
Will
Buy
Will not
Buy
Building Decision Tree 2 taking the attribute Shirt Price Category
Gender
Male
Female
Shirt Price
Category
Cheap
Expensive
Will not
buy
Further Building the tree with Gender = Male, Car Type = Sports
Male
Male
Male
Male
Sports
Family
Sports
Family
Cheap
Cheap
Cheap
Cheap
Sports
Family
no
yes
yes
yes
Will Buy
Will not Buy
1
2
1
0
2
2
4
Since “Car Type=Family” has all 2 data points with class label “yes” and 0 data points with class label
“no”, you can assign class label of “yes”. But “Car Type=Sports” has 1 data point with class label
“yes” and 1 data point with class label “no”, This leaf node has 50% chances of being class “yes” and
50% chances of being class “no”. So you can either assign “yes” or assign “no”.
If you class label of “yes” to the node “Car Type=Sports”, you will end up getting 4 leaf nodes. If you
assign class label of “no” to the node “Car Type=Sports”, you will end up getting 5 leaf nodes.
Because of the stopping condition given in the question, We stop growing the tree further.
Further Building the tree with Gender = Male, Car Type = Expensive
Male
Male
Male
Male
Sports
Family
Sports
Family
Sports
Family
Expensive
Expensive
Expensive
Expensive
yes
no
yes
no
Will Buy
Will not Buy
2
0
0
2
2
2
4
Since “Car Type=Sports” has all 2 data points with class label “yes” and 0 data points with class label
“no”, you can assign class label of “yes”. And Since “Car Type=Family” has all 2 data points with class
label “no” and 0 data points with class label “yes”, you can assign class label of “no”. Because of the
stopping condition given in the question, We stop growing the tree further.
Decision Tree 2 with node “Gender = Male, Shirt Price Category = Cheap , Car Type = Sports” taking
class “yes”
Gender
Male
Female
Shirt Price
Category
Cheap
Will not
buy
Expensive
Will
Buy
Car Type
Sports
Family
Will
Buy
Will not
Buy
Decision Tree 2 with node “Gender = Male, Shirt Price Category = Cheap, Car Type = Sports” taking
class “no”
Gender
Male
Female
Shirt Price
Category
Cheap
Car Type
Sports
Will not
Buy
Family
Will
Buy
Will not
buy
Expensive
Car Type
Sports
Will
Buy
Family
Will not
Buy
Customer
ID
Gender
Car Type
Shirt Price Will Buy?
Category
1
2
3
4
5
6
7
8
9
10
11
12
Male
Male
Male
Male
Male
Male
Male
Male
Female
Female
Female
Female
Sports
Sports
Family
Family
Sports
Sports
Family
Family
Sports
Family
Sports
Family
Cheap
Expensive
Cheap
Expensive
Cheap
Expensive
Cheap
Expensive
Cheap
Expensive
Cheap
Expensive
no
yes
yes
no
yes
yes
yes
no
no
no
no
no
Class predicted using
Decision Tree 1
With node “Shirt
Price Category =
Cheap” taking class
“yes”.
yes
yes
yes
no
yes
yes
yes
no
no
no
no
no
Class
predicted
using Decision Tree
1 With node “Shirt
Price Category =
Cheap” taking class
“no”.
no
yes
Yes
no
no
yes
Yes
no
no
no
no
no
In all the 4 decision trees,
Prediction Accuracy = (Number of data points correctly classified/ Total number of data points)*100
= (11/12)*100 = 91.67%
Class predicted using
Decision Tree 2 With
node “car type =
sports” taking class
“yes”.
Class predicted using
Decision Tree 2 With
node car type = sports
taking class “no”.
yes
yes
yes
no
yes
yes
yes
no
no
no
no
no
no
yes
Yes
no
no
yes
Yes
no
no
no
no
no
Question 9 and 10 pertains to the following description.
A University chooses to predict the first semester GPA of a student based on his
performance in the various rounds of the MBA Admission. The MBA Admission process of
the University takes into consideration 3 scores essentially. They are the CAT(Common
Admission Test) score, Written exam score and the Interview Score. The Scores are given in
percentile.
From the previous batch, The University has got the scores (CAT, Written Test and
Interview) of 10 students along with the GPA that they scored at the end of their first
semester.
10) Perform a K-NN prediction at the point [CAT Score= 86, Written exam Score = 89,
Interview Score = 88]. Use a K value of 3.
To determine the nearest neighbours use the concept of Euclidean distance. Hint: In a 3dimensional input space Euclidean distance between two points is determined in the
following way: If point 1 has co-ordinates [a1, b1 and c1], and point 2 has co-ordinates [a2,
b2 and c2], then the distance between these two points is (a2 -a1)2+ (b2-b1)2+ (c2-c1)2
Answer: 9.3(Refer next page for detailed explanation)
11) Perform a K-NN prediction at the point [CAT Score= 86, Written exam Score = 89,
Interview Score = 88]. Use a K value of 3.
To determine the nearest neighbours use the concept of Manhattan distance. Hint: In a 3dimensional input space Manhattan distance between two points is determined in the
following way: If point 1 has co-ordinates [a1, b1 and c1], and point 2 has co-ordinates [a2,
b2 and c2], then the distance between these two points is |a1-a2|+|b1-b2|+|c1-c2|
Answer: 8.3(Refer next page for detailed explanation)
predicted point
CAT Score Written Exam Interview
Score
Score
86
89
88
Student ID
1
2
3
4
5
6
7
8
9
10
CAT Score
88
88
84
86
88
82
84
86
86
87
Written
Exam Score
87
87
87
85
91
89
92
89
89
89
Interview
Score
90
94
85
87
96
90
86
84
88
88
GPA
9.7
9.6
9.6
7.9
9.4
7.8
9.5
6.8
9
9.2
Manhattan
Distance
6
10
7
5
12
6
7
4
0
1
predicted value
(Manhattan)
Corresponding GPA
6.8
9
9.2
8.333333333
Euclidean
Distance
12
44
17
17
72
20
17
16
0
1
predicted value
(Euclidean)
Corresponding GPA
9.7
9
9.2
9.3