Exam NO. 1

ECE 8527: INTRODUCTION TO MACHINE LEARNING AND PATTERN
RECOGNITION
Exam NO. 1
Meysam Golmohammadi
1
Problem No. 1: Consider a two-class discrete distribution problem:
πœ”1 : {[0,0], [2,0], [2,2], [0,2]}
πœ”2 : {[1,1], [2,1], [1,2], [3,3]}
(20 pts) (a) Compute the minimum achievable error rate by a linear machine (hint: draw a picture of the
data). Assume the classes are equiprobable.
(10 pts) (b) Assume the priors for each class are: 𝑃(πœ”1 ) = 𝛼 and𝑃(πœ”2 ) = 1 βˆ’ 𝛼. Sketch 𝑃(𝐸) as a
function of 𝛼 for a maximum likelihood classifier based on the assumption that each class is drawn from
a multivariate Gaussian distribution. Compare and contrast your answer with your answer to (a). Be very
specific in your sketch and label all critical points. Unlabeled plots will receive no partial credit.
(5 pts) (c) Assume you are not constrained to a linear machine. What is the minimum achievable error
rate that can be achieved for this data? Is this value different than (a)? If so, why? How might you
achieve such a solution? Compare and contrast this solution to (a).
Solution:
As we do not have additional knowledge, using Occam’s Razor rule, we will use the simplest model
possible which is a Gaussian model for the data in each class. In next step, using mean and covariance
formula, we calculate mean vector and covariance matrix for every class. So we have:
Class 1:
πœ”1 : {[0,0], [2,0], [2,2], [0,2]}
𝛍1 = [1 1]
1
𝚺1 = [
0
0
]
1
Class 2:
πœ”2 : {[1,1], [2,1], [1,2], [3,3]}
𝛍2 = [1.75 1.75]
0.69 0.44
𝚺2 = [
]
0.44 0.69
2
A linear machine is a classifier that uses linear discriminant functions. As in this problem the covariance
matrices are different for each class, the resulting discriminant functions are inherently quadratic and
can be expressed by:
𝑔𝑖 (𝐱) = 𝐱 𝒕 π–π’Š 𝐱 + 𝐰𝑖𝑑 𝐱 + 𝑀𝑖0
(1.1)
Where
1
𝐖𝑖 = βˆ’ πšΊπ‘–βˆ’1
2
(1.2)
𝐰𝑖 = πšΊπ‘–βˆ’1 𝛍𝑖
(1.3)
And:
1
1
𝑀𝑖0 = βˆ’ 𝛍𝑑𝑖 πšΊπ‘–βˆ’1 𝛍𝑖 βˆ’ ln|πšΊπ‘– | + ln 𝑃(πœ”π‘– )
2
2
(1.4)
The decision surfaces for a linear machine are hyperquadrics and can be obtained by the linear
equations of:
𝑔𝑖 (𝐱) = 𝑔𝑗 (𝐱)
(1.5)
Using Matlab code which presented at below and equations of (1.1) to (1.5) for the class 1 we will have:
π–π’Š = [
βˆ’πŸŽ. πŸ“
𝟎
]
𝟎
βˆ’πŸŽ. πŸ“
1
𝐰𝑖 = [ ]
1
𝑀𝑖0 = βˆ’1.69
g1 = - x1^2/2 + x1 - x2^2/2 + x2 – 1.69
Using the same approach for class 2 we have:
π–π’Š = [
βˆ’πŸ. 𝟐𝟐 𝟎. πŸ•πŸ–
]
𝟎. πŸ•πŸ– βˆ’πŸ. 𝟐𝟐
1.56
𝐰𝑖 = [
]
1.56
𝑀𝑖0 = βˆ’2.78
g2 = (14*x1)/9 + (14*x2)/9 - x1*((11*x1)/9 - (7*x2)/9) + x2*((7*x1)/9 - (11*x2)/9) – 2.78
3
Using 1.5 for we will have:
x1*((11*x1)/9 - (7*x2)/9) - (5*x2)/9 - x1^2/2 - x2^2/2 - (5*x1)/9 - x2*((7*x1)/9 - (11*x2)/9) + 1.09=0
The plot of this curve is illustrated in Fig 1. And minimum error case in this situation is 25%.
Fig 1
The Matlab code can be found in the next page.
% ------------------------------------------------------------------------% Author: Meysam Golmohammadi
% Date : 10/01/2015
% ------------------------------------------------------------------------% Exam NO. 1
%% Problem 1. Part 1
% ------------------------------------------------------------------------close all
clc
clear
% ------------------------------------------------------------------------% define symbloes for two classes
%
syms x1 x2
x=[x1;x2];
% ------------------------------------------------------------------------% calculate discrimanant function for class 1
4
%
p_class1=0.5;
class1=[0 0; 2 0;0 2;2 2];
mean_class1=transpose(mean(class1));
cov_class1=3/4*cov(class1);
wu1=-0.5*inv(cov_class1);
wl1=inv(cov_class1)*mean_class1;
wi01=-0.5*transpose(mean_class1)*inv(cov_class1)*mean_class10.5*log(det(cov_class1))+log(p_class1);
g1=transpose(x)*wu1*x+transpose(wl1)*x+wi01;
% ------------------------------------------------------------------------% calculate discrimanant function for class 2
%
p_class2=0.5;
class2=[1 1; 2 1; 1 2; 3 3];
mean_class2=transpose(mean(class2));
cov_class2=3/4*cov(class2);
wu2=-0.5*inv(cov_class2)
wl2=inv(cov_class2)*mean_class2
wi02=-0.5*transpose(mean_class2)*inv(cov_class2)*mean_class20.5*log(det(cov_class2))+log(p_class2)
g2=transpose(x)*wu2*x+transpose(wl2)*x+wi02
% ------------------------------------------------------------------------% plot the curve and data point
%
curv=g1-g2
ezplot(curv,[0 5 0 5])
hold on
x=[0 2 0 2];
y=[0 0 2 2];
plot(x,y,'*')
hold on
x=[1 2 1 3];
y=[1 1 2 3];
plot(x,y,'*')
axis square
5
Additionally this problem was solved using Java Applet and you can see the answer in Fig. 2.
Fig 2
6
1.b)
There are several answers to this problem depending on what assumptions we make. One answer could
be like Fig.3.
By the way in general case when 𝑃(πœ”1 ) > 𝑃(πœ”2 ) which means the prior probability of class 1 is greater
than the prior probability of class 2, the classifier curve moves toward the mean of class 2 and in this
way the posterior probability of class 1 increases. Also when 𝑃(πœ”1 ) < 𝑃(πœ”2 ) which means the prior
probability of class 2 is greater than the prior probability of class 1, the classifier curve moves toward the
mean of class 1 and in this way the posterior probability of class 2 increases.
Fig. 3
7
1.C)
In this part we can use a lot of curves to separate the classes completely from each other. Using the SVM
algorithm in the Java applet, one of these curves is illustrated in Fig. 4. As we can see, we obtain a
minimum error zero in this case. But in this case we have overtraining and in the real world when data
increases the error rate will increase also.
8
Problem No. 2: Suppose we have a random sample X1, X2,..., Xn where:
ο‚§
ο‚§
Xi = 0 if a randomly selected student does not own a laptop, and
Xi = 1 if a randomly selected student does own a laptop.
(35 pts) (a) Assuming that the Xi are independent Bernoulli random variables with unknown
parameter p:
𝑝(π‘₯; 𝑝) = (𝑝)π‘₯𝑖 (1 βˆ’ 𝑝)1βˆ’π‘₯𝑖
where π‘₯𝑖 = 0 π‘œπ‘Ÿ 1 and 0 < 𝑝 < 1. Find the maximum likelihood estimator of p, the proportion of
students who own a laptop.
Solution:
𝑃(π‘₯; 𝑝) = (𝑝)π‘₯𝑖 (1 βˆ’ 𝑝)1βˆ’π‘₯𝑖
The likelihood for a particular sequence of n samples is
𝑛
𝑃(π‘₯; 𝑝) = ∏(𝑝)π‘₯𝑖 (1 βˆ’ 𝑝)1βˆ’π‘₯𝑖
𝑖=1
and the log-likelihood function is then
𝑛
𝑙(𝑝) = βˆ‘ π‘₯𝑖 ln 𝑝 + (1 βˆ’ π‘₯𝑖 ) ln(1 βˆ’ 𝑝)
𝑖=1
To find the maximum of 𝑙(𝑝), we set βˆ‡π‘ 𝑙(𝑝) = 0 and get
𝑛
𝑛
𝑖=1
𝑖=1
1
1
βˆ‡π‘ 𝑙(𝑝) = βˆ‘ π‘₯𝑖 βˆ’
βˆ‘(1 βˆ’ π‘₯𝑖 ) = 0
𝑝
1βˆ’π‘
This implies that
𝑛
𝑛
𝑖=1
𝑖=1
1
1
βˆ‘ π‘₯𝑖 =
βˆ‘(1 βˆ’ π‘₯𝑖 )
𝑝
1βˆ’π‘
which can be rewritten as
𝑛
𝑛
(1 βˆ’ 𝑝) βˆ‘ π‘₯𝑖 = 𝑝(𝑛 βˆ’ βˆ‘ π‘₯𝑖 )
𝑖=1
𝑖=1
The final solution is then
𝑛
1
𝑝 = βˆ‘ π‘₯𝑖
𝑛
𝑖=1
9
Problem No. 3: Let’s assume you have a 2D Gaussian source which generates random vectors of the form
[π‘₯1 , π‘₯2 ]. You observe the following data: [1,1], [2,2], [3,3]. You were told the mean of this source was 0
and the standard deviation was 1.
(25 pts) (a) Using Bayesian estimation techniques, what is your best estimate of the mean based on
these observations?
(5 pts) (b) Now, suppose you observe a 4th value: [0,0]. How does this impact your estimate of the
mean? Explain, being as specific as possible. Support your explanation with calculations and equations.
Solution:
3.a)
As we assume a 2D Gaussian source, using mean and covariance formula, we calculate mean vector and
covariance matrix. So we have:
Class 1:
πœ”: {[1,1], [2,2], [3,3]}
2
𝛍=[ ]
2
𝚺=[
0.67
0
]
0
0.67
In Baysian estimation, the prior information is combined with the empirical information in the samples
to obtain a posteriori density p(ΞΌ|D). For 1D data we have these equations:
πœ‡π‘› = (
πœŽπ‘›2 =
π‘›πœŽ02
𝜎2
πœ‡Μ‚
+
)
(
) πœ‡0
𝑛
π‘›πœŽ02 + 𝜎 2
π‘›πœŽ02 + 𝜎 2
𝜎02 𝜎 2
π‘›πœŽ02 + 𝜎 2
For multivariate we have these equations:
𝛍𝒏 = 𝚺𝟎 (𝚺𝟎 +
𝟏 βˆ’πŸ
𝟏
𝟏
Μ‚ 𝒏 + 𝚺(𝚺𝟎 + 𝚺)βˆ’πŸ π›πŸŽ
𝚺) 𝛍
𝒏
𝒏
𝒏
πšΊπ’ = 𝚺𝟎 (𝚺𝟎 +
𝟏 βˆ’πŸ 𝟏
𝚺)
𝒏
𝒏
10
Here we have:
π›πŸŽ = [0 0]
𝚺𝟎 = [
1 0
]
0 1
As we have multivariate case in this problem, we are using the last equations. I wrote a Matlab code for
this part that can be found at the end of this solution. Using these equations and Matlab code we have:
𝛍𝒏 = [
πšΊπ’ = [
1.38
]
1.38
0.67
0
0.82 0.82
0.15 0.15
]+[
]=[
]
0
0.67
0.82 0.82
0.15 0.15
3.b)
Class 1:
πœ”: {[0,0], [1,1], [2,2], [3,3]}
𝛍=[
𝚺=[
1.5
]
1.5
1.25 1.25
]
1.25 1.25
Using these equations and Matlab code we have:
𝛍𝒏 = [
πšΊπ’ = [
0.92
]
0.92
1.44 1.44
]
1.44 1.44
Roughly speaking, πœ‡π‘› represents our best guess for πœ‡ after observing n samples, and πœŽπ‘›2 measures our
uncertainty about this guess. Since πœŽπ‘›2 decreases monotonically with n β€” approaching 𝜎 2 /𝑛 as n
approaches infinity β€” each additional observation decreases our uncertainty about the true value of πœ‡.
As n increases, p(ΞΌ|D) becomes more and more sharply peaked, approaching a Dirac delta function as n
approaches infinity. This behavior is commonly known as Bayesian learning.
11
% ------------------------------------------------------------------------% Author: Meysam Golmohammadi
% Date : 10/03/2015
% ------------------------------------------------------------------------% Exam NO. 1
%% Problem 3. Part 1
% ------------------------------------------------------------------------close all
clc
clear
% ------------------------------------------------------------------------class1=[1 1; 2 2; 3 3];
n=size(class1,1)
mean_class1=transpose(mean(class1))
cov_class1=(n-1)/n*cov(class1)
mean0=transpose([0 0])
cov0=[1 0; 0 1];
meann=(cov0*inv(cov0+(1/n).*(cov_class1))*mean_class1)+(1/n).*cov_class1*inv(
cov0+(1/n).*(cov_class1))*mean0
covn=1/n.*cov0*inv(cov0+(1/n).*(cov_class1))*cov_class1
cov=covn+cov_class1
%% Problem 3. Part 2
% ------------close all
clc
clear
class1=[0 0; 1 1; 2 2; 3 3];
n=size(class1,1)
mean_class1=transpose(mean(class1))
cov_class1=(n-1)/n*cov(class1)
mean0=transpose([0 0])
cov0=[1 0; 0 1];
meann=(cov0*inv(cov0+(1/n).*(cov_class1))*mean_class1)+(1/n).*cov_class1*inv(
cov0+(1/n).*(cov_class1))*mean0
covn=1/n.*cov0*inv(cov0+(1/n).*(cov_class1))*cov_class1
cov=covn+cov_class1
12