Learning_Assignment1.pdf

Machin
ne Learn
ning(407
717)
Fall 20
008
From: 1387/7/25
5
Due: 1387/8/11
1
Assignme
ent 1
Compute
er Engineeriing
Sharif University
U
o
of
Tec
chnology
Notes
s:
Take your
y
pape
er solution
n to the teacher or upload
u
yo
our docum
ment file in
n .doc .pdff
or .doccx format at Sharif coursewa
c
are. Your file
f name should be
e as follow
w:
“HWnumber_Stu
udendtNu
umber.pd
df” i.e.Æ HW1_871
H
111111.pd
df
o on.
and so
Send your
y
quesstions to rm
mnasiri@g
gmail.com
m.
Part.1: Conc
cept Lea
arning
1. Do these prroblems from tex
xt book.
2.4, 2.7, 2.9
Part.2: Decis
sion Tre
ee
xt book.
1. Do these prroblems from tex
3.2, 3.4(a, b, c)
2. Imagine that w
we have a memorylesss source W of Engllish text, em
mitting words from
an alph
habet A = {w1, w2, ..., wn} accorrding to a d
distribution P (indepe
endently random
selectio
on with disstribution P from A). The
T entrop
py of the so
ource W is
This is the averag
ge numberr of bits req
quired to specify eac
ch word in a sequenc
ce of wordss
S
that H(W) ≤ log
g2n .
output by W , asssuming P iss known. Show
Now suppose we don’t know P . By observing the output of W for a long time, we can
come up with a guess at P , which we’ll call Q. The cross-entropy of P given Q is
defined as
which is the average number of bits required to specify each word in the output from W ,
under the assumption that the source emits according to the distribution Q. Please show
that H(W) ≤ HQ(W).
3. A student said that he could build a discrete data set with two classes, and I am
allowed to choose part (not all) of the instances from it to train a Decision Tree. He
claimed that no matter how I choose the data and train the tree, the classification error
that my Decision Tree makes on the instances not contained in the training set will be
no less than 50%. Do you think he is correct? Explain why or give an example. Is this
obviously true(false)?
4. Assume we are investigating the relationship between the weight and age of a
graduate student and the degree he or she is studying for (Masters or Ph.D.). The
following table shows some data we collected in Sharif. (Weight in Kg)
Degree
Weight
Age
P
60
24
M
72
23
M
87
25
P
70
27
P
72
23
M
75
24
M
90
22
M
80
22
P
78
26
Is it true to design decision tree with selective attribute as weight and for
everybody have a branch? If it is not what wrong for such decision tree or if it is
true determine your limitation strategy for tree size?
5. Does the ID3 algorithm always produce an optimal decision tree?
If your answer is "Yes", prove the claim mathematically. Otherwise give a counterexample and discuss the pros and cons of such greedy heuristics (A decision tree with
minimum height and comparisons is considered as an optimal decision tree).
6. Linear Classifier & Decision Tree for continuous attributes
Assume training data with two continuous attributes x1 and x2. For data plot in figure1.a
we can design decision tree with single node (single-node decision tree or depth 1
decision tree). For example a tree with decision made by splitting data with x1>3 in root
(figure 1.b). These degenerate trees, consisting of only one node and therefore using
only one variable to split the data, are called decision stumps.
(a)
(b)
Figure-1: a) training data, b) single node decision tree
Q3.1
With stumps node in a decision tree, discuss about what kind of distributions could be
classified by decision tree with depth=1 and depth=2 in training data with 2 continuous
attribute x1, x2.
Q3.2
Discuss above question in n-dimensional case (training data with n attribute x1, …, xn).
Single-node decision tree could not classify every distribution as seen in figure 2.a. Now
assuming training data as in figure 2.a, we can classify them by single line. We name
this property as “linearly separable”. As you see for distribution in figure 2.b we couldn’t
classify them by 2D parametric surface (Separating Line or Decision Surface).
(a)
(b)
Figure-2: a) linear separable data b) linear non separable.
Q3.3
Decision trees of arbitrary depth, can capture more complex decision surfaces (such
those show in figure 2.b). Sketch or plot a small 2D data set which is completely
separable using decision trees of arbitrary depth (using decision stumps at each node)
but cannot be completely separated using a single linear classifier. Include in your
sketch the decision surface obtained by the decision tree).
Q3.4
Now sketch an algorithm that would train a linear-classifier-augmented decision tree
that at each node, instead of testing some particular feature to split the decision tree, we
now use the output of some linear classifier (i.e. separating line equation as decision
rule in each node). Positively classified examples will go to one branch of the tree, while
negatively classified ones will follow the other branch in each node.
- Use this algorithm for sample data in 2D train set as before question.
- Discuss about classification power for this decision tree.