Solutions to extra exercises STK2100

Solutions to extra exercises STK2100
Vinnie Ko
May 22, 2017
Exercise 1
(a)
i) Least squares
The least squares method: find parameter values that minimizes the sum of squared residuals (RSS).
RSS =
n
X
ri2
i=1
n
X
(yi − f (xi ))2
=
=
i=1
n
X
p
X
i=1
j=1
(yi − β0 −
(by the definition of residual )
xij βj )2
So, the least squares estimator of β is
βb = arg min
β
p
n
X
X
(yi − β0 −
xij βj )2 .
i=1
(1)
j=1
Or, by using matrix algebra,
RSS =
n
X
i=1
(yi − β0 −
p
X
xij βj )2
j=1
2
= y − Xβ where X ∈ Rn×(p+1) and y ∈ Rn , β ∈ Rp+1
= (y − Xβ)T (y − Xβ)
which leads us to:
β̂ = arg min(y − Xβ)T (y − Xβ).
β
ii) Maximum likelihood
The error term in linear regression is defined as
i.i.d.
ε1 , · · · , εn ∼ N (0, σ 2 ).
1
(2)
By adding
xT
i β
(= β0 +
p
X
xij βj ) to each εi , we obtain
j=1
T
2
xT
1 β + ε1 , · · · , xn β + εn ∼ N (Xβ, σ )
We can now write the likelihood function by using independence:
n
n
2
Y
Y
(yi − xT
1
i β)
√
exp −
.
L=
f (yi |xi , β, σ 2 ) =
2σ 2
2πσ 2
i=1
i=1
Log-likelihood:
l = log(L) =
n 2
X
(yi − xT
1
i β)
− log 2πσ 2 −
2
2σ 2
i=1
n
n
1 X
2
2
= − log(2πσ ) − 2
(yi − xT
i β)
2
2σ i=1
=−
p
n
X
1 X
n
log(2πσ 2 ) − 2
(yi − β0 −
xij βj )2
2
2σ i=1
j=1
(3)
To find the maximum likelihood estimator of β, we have to maximize the equation (3) with respect to β.
p
n
X
X
It’s not difficult to see that maximizing (3) with respect to β is same as minimizing
(yi −β0 −
xij βj )2
i=1
j=1
with respect to β like defined in (1) and (2).
Therefore, the maximum likelihood estimator is same as least squares estimator this case.
(b)
From (a), we have
β̂MLE = β̂RSS = arg min(y − Xβ)T (y − Xβ).
β
Do some matrix algebra.
RSS = (y − Xβ)T (y − Xβ)
= (y T − β T X T )(y − Xβ)
= y T y − y T Xβ − β T X T y + β T X T Xβ
Differentiate this with respect to β.
∂(y T y − y T Xβ − β T X T y + β T X T Xβ)
∂RSS
=
∂β
∂β
T
T
= 0 − X y − X y + (X T X + (XX T )T )β
T
T
= −2X y + 2X Xβ
This first derivative should equal to 0. So,
−2X T y + 2X T Xβ = 0
X T Xβ = X T y
β = (X T X)−1 X T y.
2
(4)
Therefore, the maximum likelihood estimate for β is
βb = (X T X)−1 X T y
which is also a least square estimate.
To obtain the expression (4), the following matrix calculus rules are used:
Let the scalar α be defined by α = bT Ax where b and A are not a function of x, then
∂bT Ax
= AT b,
∂x
∂xT Ab
= Ab,
∂x
∂xT Ax
= (A + AT )x.
∂x
These three rules are actually a special case of a more general rule:
Let the scalar α be defined by α = uT Av where
u = u(x), v = v(x) and u : Rm → Rm , v : Rn → Rn , then
∂uT Av
∂u
∂v T
=
Av +
A u.
∂x
∂x
∂x
Note that there are several conventions in matrix calculus. In this solution, we stick to the denominator layout (a.k.a. Hessian formulation).
(c)
In the previous exercise, we obtained the log-likelihood function:
p
n
X
1 X
n
2
(yi − β0 −
xij βj )2
l = − log(2πσ ) − 2
2
2σ i=1
j=1
=−
p
n
X
n
1 X
log(2π) − n log(σ) − 2
(yi − β0 −
xij βj )2
2
2σ i=1
j=1
Differentiate this with respect to σ 2 .
n
∂l
n
1 X
=
−
+
(yi − f (xi ))2
∂σ 2
2σ 2
2σ 4 i=1
This first derivative should equal to 0. So,
n
n
1 X
=
(yi − f (xi ))2
2σ 2
2σ 4 i=1
n
1X
(yi − f (xi ))2
n i=1
2
1
= y − Xβ n
1
= (y − Xβ)T (y − Xβ)
n
σ2 =
3
Therefore, the maximum likelihood estimate for σ 2 is
1
(y − Xβ)T (y − Xβ).
n
σ̂ 2 =
Note that σ̂ 2 is a biased estimator of σ 2 . The unbiased estimator can be obtained by replacing n with
n − p.
(d)
i)
I prove a more general case E[XY ] = E[X]E[Y ] where X ∈ Rn×p , Y ∈ Rp×m , X ⊥
⊥ Y and 1 ≤ i ≤ n,
1 ≤ k ≤ p, 1 ≤ j ≤ m.



y1,1 · · · y1,j · · · y1,m
x1,1 · · · x1,k · · · x1,p
 ..
..
.. 
..
..   ..
..
..
..
..


 .
.
.
.
.
.
. 
.
. 

 .






E[XY ] = E  xi,1 · · · xi,k · · · xi,p  yk,1 · · · yk,j · · · yk,m 

 .
..
.. 
..
..   ..
..
..
..
..



 ..
.
.
.
.
.
.
. 
.
.
yp,1 · · · yp,j · · · yp,m
xn,1 · · · xn,k · · · xn,p

 p
p
p
X
X
X
x1,k yk,1 · · ·
x1,k yk,j · · ·
x1,k yk,m 


 k=1
k=1
k=1




..
.
.
.
.
.
.
.
.


.
.
.
.

 p .
p
p

 X
X
X


xi,k yk,1 · · ·
xi,k yk,j · · ·
xi,k yk,m 
= E 



k=1
k=1

 k=1


..
..
..
..
..


.
.
.
.

 p .
p
p

X
X
X


xn,k yk,1 · · ·
xn,k yk,j · · ·
xn,k yk,m
k=1
k=1
k=1
We can see that the (i, j) entry of XY is defined as (XY )i,j =
p
X
xi,k yk,j . From here on, it’s easier to
k=1
work coordinate wise
E[(XY )i,j ] = E[
p
X
xi,k yk,j ] =
k=1
=
p
X
p
X
E[xi,k yk,j ]
k=1
E[xi,k ]E[yk,j ]
k=1
= (E[X]E[Y ])i,j
That is, E[XY ] = E[X]E[Y ].
Now, I prove E[X + Y ] = E[X] + E[Y ] where X ∈ Rn×m , Y ∈ Rn×m and 1 ≤ i ≤ n, 1 ≤ j ≤ m. Note
that X and Y don’t have to be independent.
The (i, j) entry of X + Y is defined as: (X + Y )i,j = xi,j + yi,j .
E[(X + Y )i,j ] = E[xi,j + yi,j ] = E[xi,j ] + E[yi,j ]
= E[Xi,j ] + E[Yi,j ]
That is, E[X + Y ] = E[X] + E[Y ].
4
Finally, by combining the two properties that we just proved, we obtain
E[AZ + b] = E[A]E[Z] + E[b] = AE[Z] + b.
ii)
Consider b, d, X, Y ∈ Rn and A, C ∈ Rm×n .
The scalar-version of covariance is defined as:
Cov(X, Y ) = E (X − E[X])(Y − E[Y ]) .
Consider the matrix (X − E[X])(Y − E[Y ])T . The (i, j) entry of this matrix is (Xi − E[Xi ])(Yi − E[Yi ]).
Thus, the (i, j) entry of the matrix E[(X −E[X])(Y −E[Y ])T ] is Cov(X, Y ) = E[(Xi −E[Xi ])(Yi −E[Yi ])]
That is,
Cov(X, Y ) = E (X − E[X])(Y − E[Y ])T .
Now we write
Cov(AX + b, CY + d) = E (AX + b − E[AX + b])(CY + d − E[CY + d])T
= E (AX + b − AE[X] − b)(CY + d − CE[Y ] − d)T
= E (AX − AE[X])(CY − CE[Y ])T
= E A(X − E[X])(C(Y − E[Y ]))T
= E A(X − E[X])(Y − E[Y ])T C T
= AE (X − E[X])(Y − E[Y ])T C T
= ACov(X, Y )C T
That is,
Cov(AX + b, CY + d) = ACov(X, Y )C T
(5)
When AX + b = CY + d, (5) has a special case
Var(AX + b, AX + b) = AVar(X)AT .
Note that I assumed that A and b are not random matrices/vectors and that their dimensions are well
defined such that matrix operations (+, −, ×, · · · ) are possible.
(e)
Consider two arbitrary vectors a, X ∈ Rn .
By using (5) from the previous exercise,
Var(aT X) = Cov(aT X) = aT Cov(X)a
Notice that aT X is a scalar. For convenience we call it α. So, α = aT X.
By definition, variance is a non-negative real number, which implies
aT Cov(X)a = Var(α) ≥ 0
That is, covariance matrix is always positive semi-definite.
5
(f )
i)
Use the expression y = Xβ + ε
E[y] = E[X]E[β] + E[ε]
(because X and β are not a random matrix/vector)
= Xβ
ii)
By using the results from the previous exercises, we can write
βb = (X T X)−1 X T y
= (X T X)−1 X T (Xβ + ε)
= (X T X)−1 X T Xβ + (X T X)−1 X T ε
= β + (X T X)−1 X T ε
b
Take the expectation of the obtained expression of β:
b = E[β + (X T X)−1 X T ε]
E[β]
= β + E[(X T X)−1 X T ]E[ε]
= β.
(g)
By using the results from the previous exercises, we can write:
b = Var((X T X)−1 X T y)
Var(β)
= (X T X)−1 X T Var(y)((X T X)−1 X T )T
= (X T X)−1 X T σ 2 I((X T X)−1 X T )T
= σ 2 (X T X)−1 X T X(X T X)−1
= σ 2 (X T X)−1
(h)
# Clean up the memory before we start.
rm(list=ls(all=TRUE))
# For replicability.
set.seed(2017)
# Set parameter values.
n.vec = seq(10, 100, by = 5)
p = 5
sigma.beta = 1
# Make a frame to write down results.
6
var.beta1 = as.data.frame(matrix(NA, ncol = 2 ,nrow = length(n.vec)))
colnames(var.beta1) = c("n","var.beta1")
for (i in 1:length(n.vec)) {
# Select the value of n.
n = n.vec[i]
# Make a frame.
X = matrix(NA, nrow = n, ncol = p)
# 1st column contains only 1.
X[ ,1] = 1
# Generate random values from standard normal distribution.
for (j in 2:p) {
X[,j] = rnorm(n, mean = 0, sd = 1)
}
# Create the covariance matrix of beta.
cov.mat.beta = sigma.beta*solve(t(X)%*%X)
# Write down the result.
var.beta1[i,1] = n
var.beta1[i,2] = cov.mat.beta[1,1]
}
# Plot the result.
plot(x = var.beta1[,1], y = var.beta1[,2],
xlab = "n", ylab = expression(paste("Var(", hat(beta)[0], ")")),
main = "", font.main = 1)
●
0.06
^
Var(β0)
0.10
●
●
●
0.02
●
20
●
●
●
●
●
40
●
●
●
60
n
Figure 1: Result of exercise 1 (h).
(i)
# Clean up the memory before we start.
rm(list=ls(all=TRUE))
# For replicability.
set.seed(2017)
7
●
●
80
●
●
●
●
100
# Set parameter values.
p.vec = seq(20, 32, by = 1)
n = 31
sigma.beta = 1
# Make a frame to write down results.
var.beta1 = as.data.frame(matrix(NA, ncol = 2, nrow = length(p.vec)))
colnames(var.beta1) = c("p","var.beta1")
for (i in 1:length(p.vec)) {
# Select the value of p.
p = p.vec[i]
# Make a frame.
X = matrix(NA, nrow = n, ncol = p)
# 1st column contains only 1.
X[ ,1] = 1
# Generate random values from standard normal distribution.
for (j in 2:p) {
X[,j] = rnorm(n, mean = 0, sd = 1)
}
# Create the covariance matrix of beta.
cov.mat.beta = sigma.beta*solve(t(X)%*%X)
# Write down the result.
var.beta1[i,1] = p
var.beta1[i,2] = cov.mat.beta[1,1]
}
# Plot the result.
plot(x = var.beta1[,1], y = var.beta1[,2],
xlab = "p", ylab = expression(paste("Var(", hat(beta)[0], ")")),
main = "", font.main = 1)
10
5
0
^
Var(β0)
15
20
●
●
20
●
●
22
●
●
●
●
24
●
26
p
Figure 2: Result of exercise 1 (i).
8
●
28
●
●
30
(j)
Consider a linear regression setting with n data points and p predictors. In this situation, we have to
estimate p + 1 parameters (β0 , · · · .βp ) based on n observations.
When n is small (relative to p). βbj is easily affected by the randomness of an individual data point. But,
when n is large (relative to p), this individual effect on βbj becomes smaller. Therefore, as n increases
(relative to p), Var(β0 ) decreases.
When n < p, there is no unique solution to least squares method and we will get an error in R.
p
The relationship between and Var(β0 ) might be difficult to see in the plots above because n and p are
n
not that big. So, we generate the same plots again with larger n and p:
0.08
0.04
●
●
●
●
●●
●●
●●●
●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.00
^
Var(β0)
●●
0
200
400
600
800
1000
n
Figure 3: Exercise 1 (h) with: p = 5 and n = 10, 15, · · · , 995, 1000
0.06
●
●
●
0.04
0.02
●
●●●
●
● ●
●
● ●●
● ●●●● ●
● ●●●●
● ●●●●●●●●● ●
●●●●●●●●●●●●●●●●●●●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
0.00
^
Var(β0)
●
●
0
200
400
600
800
p
Figure 4: Exercise 1 (i) with: n = 1000 and p = 20, 25, · · · , 985, 990
9
1000
Exercise 2
(a)
EPE(f ) = E[L(Y, f (X))] = E[(Y − f (X))2 ]
Z
=
(y − f (x))2 p(x, y)dydx
x,y
Z Z
=
(y − f (x))2 p(x, y)dydx
x y
Z Z
=
(y − f (x))2 p(x)p(y|x)dydx
x y
Z Z
2
=
(y − f (x)) p(y|x)dy p(x)dx
x
y
Z
=
EY |X (Y − f (X))2 |X = x p(x)dx
x
= EX EY |X (Y − f (X))2 |X = x
We are looking for a function f that minimizes EPE(f ) given the data (i.e. X = x).
EPE(f ) becomes
EPE(f ) = EX EY |X=x (Y − f (x))2 |X = x
Since all X are replaced by the given data x, we can ignore EX [·]. So,
EPE(f ) = EY |X=x (Y − f (x))2 |X = x
We are looking for a function f that minimizes this expression, which is by definition
f (x) = arg min EY |X=x (Y − c)2 |X = x .
c
(b)
We want to the value of c that minimizes L.
L = EY |X=x (Y − c)2 |X = x
= EY |X=x Y 2 − 2Y c + c2 |X = x
= EY |X=x Y 2 |X = x − 2cEY |X=x [Y |X = x] + c2
Take the first derivative.
∂L
= −2EY |X=x [Y |X = x] + 2c
∂c
This first derivative should equal to 0.
−2EY |X=x [Y |X = x] + 2c = 0
c = EY |X=x [Y |X = x]
Take the second derivative.
∂2L
=2>0
∂c2
Therefore, c = EY |X=x [Y |X = x] is the minimizer of L.
10
(c)
In the previous exercise, we showed that c = EY |X=x [Y |X = x] is the minimizer of EPE(f ). We plug
in the given expression of Y into this solution.
c = EY |X=x [Y |X = x]
= EY |X=x [g(x) + ε|X = x]
= g(x) + EY |X=x [ε|X = x]
= g(x)
So, f (·) is the optimal predictor when f (·) = g(·).
(d)
EPE(f ) = E (Y
= E (Y
= E (Y
= E (Y
= E (Y
= E (Y
= E (Y
− f (X))2
− E[Y ] + E[Y ] − f (X))2
− E[Y ])2 + (E[Y ] − f (X))2 + 2(Y − E[Y ])(E[Y ] − f (X))
− E[Y ])2 + E (E[Y ] − f (X))2 + E [2(Y − E[Y ])(E[Y ] − f (X))]
− E[Y ])2 + E (E[Y ] − f (X))2 + 2(E[Y ] − E[E[Y ]])E [E[Y ] − f (X)]
− E[Y ])2 + E (E[Y ] − f (X))2 + 2(E[Y ] − E[Y ])E [E[Y ] − f (X)]
− E[Y ])2 + E (E[Y ] − f (X))2
= Var(Y ) + E (E[Y ] − f (X))2
= Var(f (X) + ε) + E (E[Y ] − f (X))2
= Var(f (X)) + Var(ε) + E (E[Y ] − f (X))2
= Var(f (X)) + σ 2 + E (E[Y ] − f (X))2
The last term will be 0 when E[Y ] = f (X). So, the lower bound is Var(f (X)) + σ 2 .
11
Exercise 3
(a)
This is quite straightforward.
EPE(f ) = E[L(Y, f (X))] = E[1 − I{f (x)} (y)]
Z Z
=
1 − I{f (x)} (y) p(x, y)dydx
x y
Z Z
=
1 − I{f (x)} (y) p(x)p(y|x)dydx
x y
Z Z
=
(1 − I{f (x)} (y))p(y|x)dy p(x)dx
x
y
Z
=
(1 − Pr(Y = f (x)|X = x)) p(x)dx
x
(b)
Z
{1 − Pr(Y = f (x)|X = x)} p(x)dx
EPE(f ) =
x
We are looking for a function f that minimizes this expression, which is by definition
f (x) = arg min [1 − Pr(Y = k|X = x)] = arg max [Pr(Y = k|X = x)] where k ∈ 0, 1.
k
k
Since f (x) is a binary predictor, we have only 2 options for the value of f (x): 0 and 1.
We are maximizing Pr(Y = k|X = x). So, if Pr(Y = 0|X = x) < Pr(Y = 1|X = x), k = 1. And if
Pr(Y = 0|X = x) > Pr(Y = 1|X = x), k = 0.
Notice that Pr(Y = 0|X = x) + Pr(Y = 1|X = x) = 1. So, the decision boundary is at Pr(Y = 0|X =
x) = Pr(Y = 1|X = x) = 0.5
Therefore,
(
f (x) =
1
0
if Pr(Y = 1|X = x) > 0.5
otherwise
(c)
Intuitively,

K −1







K −2



f (x) = ..
.





1





0
if K − 1 = arg max [Pr(Y = k|X = x)]
k
if K − 2 = arg max [Pr(Y = k|X = x)]
k
..
.
if 1 = arg max [Pr(Y = k|X = x)]
k
otherwise
12
(d)
Assume kopt : arg max [Pr(Y = k|X = x)]. We get an error when Y 6= kopt . The probability that this
k
happens is 1 − Pr(Y = kopt |X = x) which corresponds to 1 − max Pr(Y = k|x).
k
13
Exercise 4
(a)
See extra4.r on the course webpage.
(b)
It’s given that
X ∼ N (0, 1), η ∼ N (0, 1) and X ⊥
⊥ η.
Now we assume
Z = 0.9X +
p
1 − 0.92 η,
then
Var(Z) = Var(0.9X +
p
1 − 0.92 η)
2
= 0.9 Var(X) + (1 − 0.92 )Var(η)
= 0.92 · 1 + (1 − 0.92 ) · 1
=1
p
1 − 0.92 η)
p
= Cov(X, 0.9X) + Cov(X, 1 − 0.92 η)
p
= 0.9Cov(X, X) + 1 − 0.92 Cov(X, η)
p
= 0.9 · 1 + 1 − 0.92 · 0
Cov(X, Z) = Cov(X, 0.9X +
= 0.9
(The following rule is used:
Cov(aX + bY, cU + dV ) = acCov(X, U ) + adCov(X, V ) + bcCov(Y, U ) + bdCov(Y, V ))
Cov(X, Z)
Cor(X, Z) = p
= 0.9.
Var(X)Var(Z)
p
So, defining Z with {Z = 0.9X + 1 − 0.92 η and η ∼ N (0, 1)} is same as defining Z with {Z ∼ N (0, 1)
and Cor(X, Z) = 0.9}.
When we simulate (X, Z) in R, both generating algorithms will give the same result, except for the
differences created by the random number generator.
(c)
See extra4 extended.r on the course webpage.
14
1.0
0.8
0.6
rejection rate
0.4
0.2
0.0
x with only x
x
z
x or z
0.0
0.5
1.0
beta1
15
1.5
2.0
0.6
0.2
0.4
rejection rate
0.8
1.0
(d)
0.0
x with only x
x
z
x or z
0.0
0.5
1.0
1.5
2.0
beta1
(e)
z has a high correlation with x. When z is added to the model, it takes over a part of variance y that is
previously explained by x. So, the rejection rate of βx decreases.
16
Exercise 5
(a)


1
x∗1 

b = E[(x∗ )T β]
b = (x∗ )T E[β]
b = (x∗ )T β = θ where x∗ = 
E[θ]
 .. 
.
x∗p
So, θb is an unbiased estimator of θ.
(b)
b = Var((x∗ )T β)
b = (x∗ )T Var(β)x
b ∗ = (x∗ )T σ 2 (X T X)−1 x∗ = σ 2 (x∗ )T (X T X)−1 x∗
σθ2b = Var(θ)
Here, X is the design matrix that is used to fit the model (i.e. to estimate β) and σ 2 = Var(ε). x∗ is
the new data point for prediction.
(c)
i)
From the previous exercise, we have: σθ2b = σ 2 (x∗ )T (X T X)−1 x∗ . So, s2θb = σ
bθ2b = σ
b2 (x∗ )T (X T X)−1 x∗
T =
θb − θ
sθb
=p
θb − θ
σ
b2 (x∗ )T (X T X)−1 x∗
b
√ 2 ∗ Tθ−θ
σ (x ) (X T X)−1 x∗
= √ 2 ∗ T T −1 ∗
√σb2 (x∗ )T (X T X)−1 x∗
σ (x ) (X X)
=
q
√
=
x
b
√ 2 ∗ Tθ−θ
σ (x ) (X T X)−1 x∗
σ
b2
σ2
b
θ−θ
σ 2 (x∗ )T (X T X)−1 x∗
q
σ
b2 (n−p−1)
σ 2 (n−p−1)
b
√ 2 ∗ Tθ−θ
σ (x ) (X T X)−1 x∗
=q 2
σ
b (n−p−1)
1
· n−p−1
σ2
=q
Z
X
n−p−1
17
Now we need to show that Z ∼ N (0, 1), X ∼ χ2n−p−1 and Z ⊥
⊥ X.
θb − θ
∼ N (0, 1).
We know that θb ∼ N (θ, σ 2 (x∗ )T (X T X)−1 x∗ ). So, Z = p
2
∗
T
σ (x ) (X T X)−1 x∗
As a direct result of the given property, we obtain: X =
It’s given that βb ⊥
⊥σ
b2 . So, p
θb − θ
σ 2 (x∗ )T (X T X)−1 x∗
⊥
⊥
σ
b2
(n − p − 1) ∼ χ2n−p−1 .
σ2
σ
b2
(n − p − 1).
σ2
Therefore,
T =
θb − θ
Z
=q
X
sθb
∼ tn−p−1 .
n−p−1
ii)
T =
θb − θ
∼ tn−p−1
sθb
So,
P
t α2 ,n−p−1
θb − θ
≤
≤ t1− α2 ,n−p−1
sθb
!
= P θb − t1− α2 ,n−p−1 · sθb ≤ θ ≤ θb + t1− α2 ,n−p−1 · sθb
q
q
∗
T
T
−1
∗
∗
T
T
−1
∗
b
b
= P θ − t1− α2 ,n−p−1 · σ
b (x ) (X X) x ≤ θ ≤ θ + t1− α2 ,n−p−1 · σ
b (x ) (X X) x
=1−α
i
h
To sum up, the 100(1 − α)% confidence interval for θ: θb − t1− α2 ,n−p−1 · sθb, θb + t1− α2 ,n−p−1 · sθb
(d)
h
i
b = E (x∗ )T β + ε∗ − (x∗ )T βb
E[Y ∗ − θ]
b + E[ε∗ ]
= (x∗ )T β − (x∗ )T E[β]
= (x∗ )T β − (x∗ )T β + 0
=0
b = E[Y ∗ ] and not E[θ]
b = Y ∗.
The result that we obtain here is E[θ]
b − Y ∗ = (x∗ )T β − (x∗ )T β − ε∗ = −ε∗ 6= 0
E[θ]
18
(e)
σY2 ∗ −θb = Var((x∗ )T β − (x∗ )T βb + ε∗ )
b + Var(ε∗ )
= Var((x∗ )T β)
= σ 2 (x∗ )T (X T X)−1 x∗ + σ 2
= σ 2 ((x∗ )T (X T X)−1 x∗ + 1)
(f )
i)
First, show that Y ∗ − θb follows a normal distribution.
Y ∗ − θb = (x∗ )T β − (x∗ )T βb + ε∗
= (x∗ )T β − (x∗ )T (X T X)−1 X T y + ε∗
= (x∗ )T β − (x∗ )T (X T X)−1 X T (Xβ + ε) + ε∗
= (x∗ )T β − (x∗ )T (X T X)−1 X T Xβ − (x∗ )T (X T X)−1 X T ε + ε∗
= −(x∗ )T (X T X)−1 X T ε + ε∗
∼ N (0, (x∗ )T (X T X)−1 X T σ 2 (x∗ )T (X T X)−1 X T
T
) + N (0, σ 2 )
(because −(x∗ )T (X T X)−1 X T ε ⊥
⊥ ε∗ )
= N (0, σ 2 (x∗ )T (X T X)−1 X T (x∗ )T (X T X)−1 X T
T
) + N (0, σ 2 )
= N (0, σ 2 (x∗ )T (X T X)−1 X T X(X T X)−1 x∗ ) + N (0, σ 2 )
= N (0, σ 2 (x∗ )T (X T X)−1 x∗ ) + N (0, σ 2 )
= N (0, σ 2 (x∗ )T (X T X)−1 x∗ + σ 2 )
2
∗ T
T
−1
= N (0, σ ((x ) (X X)
(by the additivity of independent normal distributions)
∗
x + 1))
We have: σY2 ∗ −θb = σ 2 (x∗ )T (X T X)−1 x∗ + 1 . So, s2Y ∗ −θb = σ
bY2 ∗ −θb = σ
b2 (x∗ )T (X T X)−1 x∗ + 1
19
T =
Y ∗ − θb
sY ∗ −θb
Y ∗ − θb
=p
σ
b2 ((x∗ )T (X T X)−1 x∗ + 1)
∗
√ 2 ∗ TY −Tθb −1 ∗
σ ((x ) (X X) x +1)
= √ 2 ∗ T T −1 ∗
√σb2 ((x∗ )T (X T X)−1 x∗ +1)
σ ((x ) (X X)
=
q
√
=
x +1)
∗
√ 2 ∗ TY −Tθb −1 ∗
σ ((x ) (X X) x +1)
σ
b2
σ2
Y ∗ −θb
σ 2 ((x∗ )T (X T X)−1 x∗ +1)
q
σ
b2 (n−p−1)
σ 2 (n−p−1)
∗
√ 2 ∗ TY −Tθb −1 ∗
σ ((x ) (X X) x +1)
= q 2
σ
b (n−p−1)
1
· n−p−1
σ2
=q
Z
X
n−p−1
Now we need to show that Z ∼ N (0, 1), X ∼ χ2n−p−1 and Z ⊥
⊥ X.
We know that Y ∗ − θb ∼ N (0, σ 2 ((x∗ )T (X T X)−1 x∗ + 1)). So, Z = p
Y ∗ − θb
σ 2 ((x∗ )T (X T X)−1 x∗ + 1)
N (0, 1).
σ
b2
(n − p − 1) ∼ χ2n−p−1 .
σ2
σ
b2
Y ∗ − θb
⊥
⊥ 2 (n − p − 1).
It’s given that βb ⊥
⊥σ
b2 . So, p
σ
σ 2 ((x∗ )T (X T X)−1 x∗ + 1)
As a direct result of the given property, we obtain: X =
Therefore,
T =
Y ∗ − θb
Z
=q
X
sY ∗ −θb
∼ tn−p−1 .
n−p−1
ii)
T =
Y ∗ − θb
∼ tn−p−1
sY ∗ −θb
20
∼
So,
P
t α2 ,n−p−1
Y ∗ − θb
≤
≤ t1− α2 ,n−p−1
sY ∗ −θb
!
= P θb − t1− α2 ,n−p−1 · sY ∗ −θb ≤ Y ∗ ≤ θb + t1− α2 ,n−p−1 · sY ∗ −θb
q
q
∗
∗
T
T
−1
∗
∗
T
T
−1
∗
b
b
α
α
b (x ) (X X) x + 1 ≤ Y ≤ θ + t1− 2 ,n−p−1 · σ
b (x ) (X X) x + 1
= P θ − t1− 2 ,n−p−1 · σ
=1−α
h
i
To sum up, the 100(1−α)% prediction interval for Y ∗ : θb − t1− α2 ,n−p−1 · sY ∗ −θb, θb + t1− α2 ,n−p−1 · sY ∗ −θb
21
Exercise 6
(a)
Z Z
EPE(f ) =
L(Y, f (X))p(x, y)dydx
x
y
Z Z
=
L(Y, f (X))p(x)p(y|x)dydx
Z Z
=
L(Y, f (X))p(y|x)dy p(x)dx
x
y
Z
Z
Z
=
L(Y, f (X))p(y|x)dy p(x)dx +
x
y
x;f (x)=0
y
Z
x;f (x)=1
L(Y, f (X))p(y|x)dy p(x)dx
y
(6)
We first examine the inner integral:
Z
Z
L(Y, f (X))p(y|x)dy =
y
Z
c0 p(y|x, f (x) = 1)dy +
c1 p(y|x, f (x) = 0)dy
y;y=0
y;y=1
Z
Z
= c0
p(y|x, f (x) = 1)dy + c1
y;y=0
We plug this back into (6).
Z
Z
EPE(f ) =
c1
x;f (x)=0
=
x;f (x)=0
Z
=
x;f (x)=0
Z
p(y|x, f (x) = 0)dy p(x)dx +
y;y=1
Z
c1
Z
p(y|x, f (x) = 0)dy
y;y=1
c0
x;f (x)=1
Z
p(y|x)dy p(x)dx +
Z
c1 Pr(Y = 1|X = x)p(x)dx +
y;y=1
Z
c0
x;f (x)=1
Z
p(y|x, f (x) = 1)dy p(x)dx
y;y=0
p(y|x)dy p(x)dx
y;y=0
c0 Pr(Y = 0|X = x)p(x)dx
x;f (x)=1
Z
Z
=
I{0} (f (x))c1 Pr(Y = 1|X = x)p(x)dx +
x
I{1} (f (x))c0 Pr(Y = 0|X = x)p(x)dx
x
(b)
Z
EPE(f ) =
Z
I{0} (f (x))c1 Pr(Y = 1|X = x)p(x)dx +
Zx
=
I{1} (f (x))c0 Pr(Y = 0|X = x)p(x)dx
Zx
I{0} (f (x))c1 Pr(Y = 1|X = x)p(x)dx +
Zx
1 − I{0} (f (x)) c0 Pr(Y = 0|X = x)p(x)dx
x
I{0} (f (x)) (c1 Pr(Y = 1|X = x) − c0 Pr(Y = 0|X = x)) p(x)dx
Z
+ c0
Pr(Y = 0|X = x)p(x)dx
x
Z
=
I{0} (f (x)) (c1 Pr(Y = 1|X = x) − c0 Pr(Y = 0|X = x)) p(x)dx + c0 Pr(Y = 0)
Zx
=
I{0} (f (x)) (Q1 (x) − Q0 (x)) p(x)dx + constant
=
x
x
22
The optimal predictor should minimize EPE(f ). So,
The optimal predictor:
f (x) = arg min EPE(k)
k
Z
= arg min I{0} (k) (c1 Pr(Y = 1|X = x) − c0 Pr(Y = 0|X = x)) p(x)dx
k
x
Note that I{0} (k) is either 0 or 1. This implies that when c1 Pr(Y = 1|X = x)−c0 Pr(Y = 0|X = x) > 0,
I{0} (k) = 0 will minimize EPE(k), and when c1 Pr(Y = 1|X = x)−c0 Pr(Y = 0|X = x) ≤ 0, I{0} (k) = 1
will minimize EPE(k).
Therefore,
f (x) =

1
if Pr(Y = 1|X = x) >
0
otherwise
is the optimal predictor for the given L(Y, f (X)).
23
c0
Pr(Y = 0|X = x)
c1
Exercise 7
(a)
(i)
 
1
 
p = 0, so the model: Yi = β0 + εi , and X =  ... 
1
The least squares estimate is given by:
βb = (X T X)−1 X T y,
which in this case leads to
Pn
βb0 =
i=1
yi
n
=y
So,
ybi = y for 1 ≤ i ≤ n.
(ii)
Same procedure as in (i), but you have to replace X and y with X−i and y−i by removing the i-th data
point.
The resulting prediction:
ybi−i =
1 X
yi .
n−1
j6=i
(iii)
H = X(X T X)−1 X T
  
 −1
1
1
 . 
 ..  
=  .  ·  1 · · · 1 ·  ..  · 1 · · · 1
1
1
 
1
  = n−1  ...  · 1 · · · 1
1

1
1 .
=  ..
n
1
···
..
.
···

1
.. 
.
1
Thus,
hii =
24
1
.
n
(iv)
P
yi −
ybi−i
j6=i
= yi −
yi
n−1
Pn
yi0 − yi
0
= yi − i =1
n−1
P
n
i0 =1
n
= yi −
=
=
yi0
−
n−1
n
ybi − yni
yi −
1 − n1
(1 − n1 )yi + yni
1 − n1
yi
n
− ybi
yi − ybi
1 − n1
yi − ybi
=
1 − hi
=
(by using the result from (iii))
(b)
(i)
Mn = XnT Xn

x1,1 · · ·
 ..
..
 .
.

x
·
·
·
=
 1,j
 .
..
 ..
.
x1,p · · ·
···
..
.
···
..
.
xi,1
..
.
xi,j
..
.
xi,p · · ·
 T
x
 .1 
= x1 · · · xn ·  .. 
 
xn,1
x1,1
..   ..

. 
  .

xn,j 
·
  xi,1
..   ..
.   .
xn,p
xn,1
···
..
.
···
..
.
x1,j
..
.
xi,j
..
.
···
..
.
···
..
.
···
xn,j
···

x1,p
.. 
. 

xi,p 

.. 
. 
xn,p
xT
n
=
n
X
xi xT
i
i=1
(ii)
A + uv T
−1
= A−1 −
A−1 uv T A−1
1 + v T A−1 u
if and only if
A + uv
T
A
−1
A−1 uv T A−1
−
1 + v T A−1 u
= I and
25
A
−1
A−1 uv T A−1
−
1 + v T A−1 u
A + uv T = I
For convenience, lets write c =
1
.
1 + v T A−1 u
First condition:
A−1 uv T A−1
T
−1
= A + uv T A−1 − cA−1 uv T A−1
A + uv
A −
1 + v T A−1 u
= AA−1 − cAA−1 uv T A−1 + uv T A−1 − cu(v T A−1 u)v T A−1
= I − cuv T A−1 + uv T A−1 − c(v T A−1 u)uv T A−1
= I + (−c + 1 − cv T A−1 u)uv T A−1
=I +(
−1 + 1 + v T A−1 u − v T A−1 u
)uv T A−1
1 + v T A−1 u
=I
Second condition:
A−1 uv T A−1
−1
A −
A + uv T = A−1 − cA−1 uv T A−1 A + uv T
1 + v T A−1 u
= A−1 A + A−1 uv T − cA−1 uv T − cA−1 uv T A−1 uv T
= I + 1 − c − cA−1 uv T A−1 uv T
=I +(
1 + v T A−1 u − 1 − v T A−1 u −1 T
)A uv
1 + v T A−1 u
=I
Thus,
A + uv T
−1
= A−1 −
A−1 uv T A−1
.
1 + v T A−1 u
(iii)
−1
en = Mn−1
Let x
xn , then
Mn−1
=
n
X
!−1
xi xT
i
i=1
=
n−1
X
!−1
xi xT
i
+
xn xT
n
i=1
−1
= Mn−1
+ xn xT
n
−1
= Mn−1
−
−1
= Mn−1
−
−1
−1
−1
Mn−1
xn xT
n Mn−1
(by using the Sherman-Morrison formula)
−1
1 + xT
n Mn−1 xn
−1
e n xT
x
n Mn−1
en
1 + xT
nx
26
(iv)
βb = βbn
= (XnT Xn )−1 XnT yn
= Mn−1 XnT yn
−1
e n xT
x
n Mn−1
−
en
1 + xT
nx
!
−1
e n xT
x
n Mn−1
−
en
1 + xT
nx
!
−1
e n xT
x
n Mn−1
−1
Mn−1
−
en
1 + xT
nx
!
−1
Mn−1
=
−1
Mn−1
=
=
n
X
xi yi
i=1
n−1
X
!
xi yi + xn yn
i=1
T
Xn−1
yn−1 + xn yn
−1
−1
T
= Mn−1
Xn−1
yn−1 + Mn−1
xn yn −
−1
−1
e n xT
en xT
x
x
n Mn−1
n Mn−1
T
X
xn yn
y
−
n−1
n−1
en
en
1 + xT
1 + xT
nx
nx
−1
e n xT
e n xT
x
n Mn−1
n
−1
bn−1 − x
= βbn−1 + Mn−1
xn yn −
β
xn yn
en
en
1 + xT
1 + xT
nx
nx
e n xT
e n xT
x
n
n
−1
bn−1 + I − x
= I−
β
Mn−1
xn yn
Tx
e
e
1 + xT
1
+
x
x
n
n
n
n
e n xT
e n xT
x
n
n
bn−1 + I − x
e n yn
x
β
= I−
en
en
1 + xT
1 + xT
nx
nx
e n xT
x
n
en yn
= I−
βbn−1 + x
T
en
1 + xn x
Or alternatively,
I−
=
I−
=
I−
=
I−
=
=
I−
e n xT
e n yn
e n xT
x
x
n
nx
bn−1 + x
e
β
y
−
n
n
T
e
en
1 + xn xn
1 + xT
nx
T
T
en x
e n yn
e n xn
x x
x
en yn − n T
βbn−1 + x
T
en
en
1 + xn x
1 + xn x
Te
e n xT
x
x
x
n
n n
en yn
βbn−1 + 1 −
x
en
en
1 + xT
1 + xT
nx
nx
e n xT
e n − xT
en
x
1 + xT
n
nx
nx
e n yn
x
βbn−1 +
T
en
e
1 + xn x
1 + xT
x
n
n
e n xT
x
1
n
e n yn
x
βbn−1 +
Tx
e
1 + xT
x
1
+
x
n
n
n en
(v)
To prove
I−
en xT
x
n
en
1 + xT
nx
I−
−1
en xT
x
n
en
1 + xT
nx
e n xT
=I +x
n , show
I+
e n xT
x
n
=I
and
or simply use the Sherman-Morison formula.
27
I+
e n xT
x
n
I−
en xT
x
n
en
1 + xT
nx
=I
Use the result from (iv):
I−
βbn =
e n xT
x
n
I−
en
1 + xT
nx
e n xT
x
n
en
1 + xT
nx
e n yn
βbn−1 + x
−1
en yn
βbn = βbn−1 + x
βbn−1 =
I−
e n xT
x
n
en
1 + xT
nx
−1
e n yn .
βbn − x
Use the equality that we just proved
b
en xT
en yn .
βbn−1 = I + x
n βn − x
(vi)
b
yn − ybn−n = yn − xT
n βn−1
b
e n xT
e n yn
= yn − xT
I +x
n
n βn − x
T
T b
T
e
e
+
x
= yn − xT
x
x
β
−
x
x
y
n
n
n
n
n
n
n
n
T
b
en xT
en yn
= yn − 1 + xT
nx
n βn − xn x
b
en yn − 1 + xT
e n xT
= yn + xT
nx
nx
n βn
en yn − 1 + xT
en ybn
= 1 + xT
nx
nx
T
en (yn − ybn )
= 1 + xn x
(vii)
 T
x1
 .. 
T
−1
T
H = X(X X) X =  .  · Mn−1 · x1 · · · xn
xT
n
−1
From this, we can directly see that (H)i,j = xT
i Mn xj .
28
−1
(H)n,n = xT
n Mn xn
= xT
n
−1
e n xT
x
n Mn−1
−1
Mn−1
−
en
1 + xT
nx
!
xn
−1
en xT
x
n Mn−1 xn
=
−
en
1 + xT
nx
T
e n xn x
en
x
en −
= xT
n x
T
en
1 + xn x
Te
xn xn
e
= xT
x
1
−
n n
en
1 + xT
nx
Te
en
1 + xn xn − xT
nx
e
= xT
x
n n
en
1 + xT
nx
xT
n
=
!
−1
Mn−1
xn
en
xT
nx
en
1 + xT
nx
First, we use the result from (vii):
hn =
en
xT
nx
en
1 + xT
nx
e n = xT
en
hn + hn xT
nx
nx
en − hn xT
en
hn = xT
nx
nx
hn
en .
= xT
nx
1 − hn
We plug this result into the equation we just obtained:
en (yn − ybn )
yn − ybn−n = 1 + xT
nx
hn
(yn − ybn )
= 1+
1 − hn
yn − ybn
=
.
1 − hn
This verifies equation (5.2) in the textbook for i = n.
(viii)
Changing the order of data points in the dataset doesn’t effect the model. This means that we can set
any data point to be xn . Therefore, equation (5.2) is valid for all i = 1, · · · , n.
(c)
Consider a situation where we fitted a model based on n data points. (i.e. We estimated βbn .) If we get
m extra data points (after we already estimated βbn ), we don’t have to fit the whole model again, but we
can just ‘update’ our model by using the formulas we obtained. (i.e. We can update βbn to βbn+m .)
29
Exercise 8
(a)
First, realize that natural spline is a cubic spline with a constraint, namely: g(x) is a linear function in
the intervals x ∈ (−∞, c1 ) and x ∈ [cK , ∞).
The constant we want to impose in this exercise (on a cubic spline) is clearly a nested case of this
constraint of natural spline. So, this constraint can also be expressed as a natural spline with an extra
constraint, namely: The linear function in x ∈ (−∞, c1 ) and x ∈ [cK , ∞) has a slope of 0.
(b)
The constraint requires that g(x) is a constant for x ∈ (−∞, c1 ) and x ∈ [cK , ∞).
Let’s look at the first interval. Since x ∈ (−∞, c1 ), all (x − ck )3+ = 0. Which means that all nk (x) = 0.
So, g(x) becomes: g(x) = θ0 + θ1 x. For g(x) to be a constant, θ1 should be 0.
(c)
Cubic spline is a ‘stitched’ cubic polynomials and the stitching points are called ‘knots’. So, each interval
created by knots has a single cubic polynomial curve. Therefore, g(x) in the last interval is also a cubic
polynomial.
Now, let’s write this cubic polynomial in the interval x ∈ [cK , ∞) as g(x) = α0 + α1 x + α2 x2 + α3 x3 .
The constraint from the exercise requires that this polynomial is a constant. So, α1 = α2 = α3 = 0 and
g(x) = α0 . Thus, g 0 (x) = 0.
30
(d)
From (b), we have θ1 = 0. So g(x) in the interval x ∈ [cK , ∞) is g(x) = θ0 +
K−2
X
θk+1 nk (x).
k=1
From (c), we know that g 0 (x) = 0 in this interval. So,
g 0 (x) =
K−2
X
θk+1 n0k (x)
k=1
=
K−2
X
θk+1 d0k (x) − d0K−1 (x)
k=1
0 !
(x − cK−1 )3 − (x − cK )3
−
=
θk+1
cK − cK−1
k=1
K−2
X
3(x − ck )2 − 3(x − cK )2
3(x − cK−1 )2 − 3(x − cK )2
=
θk+1
−
cK − ck
cK − cK−1
k=1
K−2
X
3(cK − xk )(2x − ck − cK ) 3(cK − xK−1 )(2x − cK−1 − cK )
=
−
θk+1
cK − ck
cK − cK−1
K−2
X
(x − ck )3 − (x − cK )3
cK − ck
0
k=1
=
K−2
X
θk+1 (3(2x − ck − cK ) − 3(2x − cK−1 − cK ))
k=1
=3
K−2
X
θk+1 (cK−1 − ck )
k=1
=0
(e)
Now, we reparametrize g(x) = θ0 +
K−2
X
θk+1 nk (x).
k=1
Let,
(
θ0
ηk =
θk+1
if k = 0
if k ∈ {1, · · · , K − 2}
and let the new basis function be:
(
mk (x) =
1
nk (x)
if k = 0
.
if k ∈ {1, · · · , K − 2}
This gives
g(x) =
K−2
X
k=0
31
ηk mk .
Exercise 9
(a)
P (Y = k)
P (Y = k) = PK−1
l=0 P (Y = l)
h
i
Pp
exp θk,0 + j=1 θk,j xj
h
i
=P
Pp
K−1
l=0 exp θl,0 +
j=1 θl,j xj
=
P
exp[θk,0 + p
j=1 θk,j xj ]
P
exp[θ0,0 + p
j=1 θ0,j xj ]
PK−1
Pp
j=1 θl,j xj ]
l=0 exp[θl,0 +
Pp
exp[θ0,0 + j=1 θ0,j xj ]
h
i
Pp
exp θk,0 − θ0,0 + j=1 (θk,j − θ0,j )xj
h
i
=P
Pp
K−1
l=0 exp θl,0 − θ0,0 +
j=1 (θl,j − θ0,j )xj
h
i
Pp
exp θk,0 − θ0,0 + j=1 (θk,j − θ0,j )xj
h
i
=
PK−1
Pp
1 + l=1 exp θl,0 − θ0,0 + j=1 (θl,j − θ0,j )xj
by defining βl,j = θl,j − θ0,j , we get
h
i
Pp
exp βk,0 + j=1 βk,j xj
h
i
=
PK−1
Pp
1 + l=1 exp βl,0 + j=1 βl,j xj
By definition,
K−1
X
P (Y = l) = 1. So,
l=0
P (Y = 0) =
K−1
X
l=0
P (Y = l) −
K−1
X
l=1
P (Y = l) = 1 −
K−1
X
P (Y = l).
l=1
β-model imposes a restriction that Y = 0 is set as the reference case. So, β-model has smaller number
of parameters then the θ-model.
32
(b)
P (Zik = 1) = P (Yi = k|Yi ∈ {0, k})
P (Yi = k)
P (Yi = 0) + P (Yi = k)
P
exp[βk,0 + p
j=1 βk,j xj ]
PK−1
P
1+ l=1 exp[βl,0 + p
j=1 βl,j xj ]
=
P
PK−1
exp[βk,0 + p
βk,j xj ]
P
1 − l=1 P (Y = l) + 1+PK−1 exp β j=1
[ l,0 + pj=1 βl,j xj ]
l=1
P
exp[βk,0 + p
j=1 βk,j xj ]
PK−1
P
1+ l=1 exp[βl,0 + p
βl,j xj ]
j=1
=
P
Pp
PK−1
exp[βk,0 + p
exp[β 0 ,0 + j=1 βl0 ,j xj ]
j=1 βk,j xj ]
Pp
PK−1
P
+
1 − l0 =1 1+PK−1 lexp
1+ l=1 exp[βl,0 + p
[βl,0 + j=1 βl,j xj ]
j=1 βl,j xj ]
l=1
=
=
=
=
P
exp[βk,0 + p
j=1 βk,j xj ]
PK−1
P
1+ l=1 exp[βl,0 + p
j=1 βl,j xj ]
PK−1
Pp
PK−1
Pp
P
1+ l=1 exp[βl,0 + j=1 βl,j xj ]− l0 =1 exp[βl0 ,0 + j=1 βl0 ,j xj ]
exp[βk,0 + p
j=1 βk,j xj ]
PK−1
Pp
PK−1
P
+
1+ l=1 exp[βl,0 + j=1 βl,j xj ]
1+ l=1 exp[βl,0 + p
j=1 βl,j xj ]
Pp
exp[βk,0 + j=1 βk,j xj ]
P
Pp
1+ K−1
j=1 βl,j xj ]
l=1 exp[βl,0 +
P
exp[βk,0 + p
j=1 βk,j xj ]
1
PK−1
PK−1
Pp
P
+
1+ l=1 exp[βl,0 + j=1 βl,j xj ]
1+ l=1 exp[βl,0 + p
j=1 βl,j xj ]
Pp
exp[βk,0 + j=1 βk,j xj ]
P
Pp
1+ K−1
j=1 βl,j xj ]
l=1 exp[βl,0 +
Pp
1+exp[βk,0 + j=1 βk,j xj ]
P
Pp
1+ K−1
j=1 βl,j xj ]
l=1 exp[βl,0 +
h
i
Pp
exp βk,0 + j=1 βk,j xj
h
i
=
Pp
1 + exp βk,0 + j=1 βk,j xj
This is equal to logistic regression. So, we can use the theories from logistic regression to estimate β.
Exercise 11
(a)
We are asked to assume that there exists a separating hyperplane. This means that there exists β such
that yi β T xi > 0 for all i.
This consequently means that there exits such that yi β T xi ≥ for all i.
We can now rewrite:
yi β T x i ≥ βT
xi ≥ 1
β T kxi k xi
yi
≥ 1,
kxi k
yi
T
let βsep
=
xi
β T kxi k
and zi =
, then we have
kxi k
T
yi βsep
zi ≥ 1.
33
(b)
kβnew − βsep k2 = kβ + yi zi − βsep k2
= kβ − βsep k2 + kyi zi k2 + 2(β − βsep )T yi zi
T
z
= kβ − βsep k2 + 1 + 2yi β T zi − 2yi βsep
| {z } | {z }i
≤0
≥1
2
≤ kβ − βsep k + 1 + 0 − 2
= kβ − βsep k2 − 1
To sum up, kβnew − βsep k2 ≤ kβ − βsep k2 − 1.
(c)
In (b), we showed that kβnew − βsep k2 ≤ kβ − βsep k2 − 1.
In the context of an iterative algorithm, this can be rewritten as kβt+1 − βsep k2 ≤ kβt − βsep k2 − 1
So, the squared Euclidean distance between the true value (βsep ) and our estimate (βt ) decreases at
least by 1 at every iteration. Therefore, we are guaranteed to have βsep = βt within a finite number of
iterations.
34
Exercise 14
(a)
Bayes classifier is a classifier that minimizes the probability of misclassification (i.e. error rate).
By using Bayes theorem, we have:
Pr(Y, X)
Pr(X|Y ) Pr(Y )
=
Pr(X)
Pr(X)
Pr(Y |X) =
We are given that
Pr(X|Y = k) = Poisson(λk ) =
(5 + 5k)x e−(5+5k)
x!
and πk = Pr(Y = k) =
1
K
So,
Pr(X) =
3
X
Pr(X|Y = k) Pr(Y = k)
k=1
3
1X
Pr(X|Y = k)
3
k=1
15x e−15
20x e−20
1 10x e−10
+
+
=
3
x!
x!
x!
−10
e
=
10x + 15x e−5 + 20x e−10
3(x!)
=
Pr(Y = k|X) =
=
Pr(X|Y = k) Pr(Y = k)
Pr(X)
e−10
3(x!)
(5+5k)x e−(5+5k) 1
x!
3
(10x + 15x e−5 + 20x e−10 )
(5 + 5k)x e(5−5k)
10x + 15x e−5 + 20x e−10
(1 + k)x e(5−5k)
= x
2 + 3x e−5 + 4x e−10
=
Minimizing the probability of misclassification is equal to maximizing the probability of correct classification. Thus,
o
n
(1 + k)x e(5−5k) x 5(1−k) Bayes classifier: arg max Pr(Y = k|X) = arg max
x
=
arg
max
(1
+
k)
e
x .
2x + 3x e−5 + 4x e−10 k
k
k
(b)
Bayes classifier is a classifier that minimizes the probability of misclassification (i.e. error rate).
So, error rate of Bayes classifier:
Pr(Y 6= Yb |X) = 1 − Pr(Y = Yb |X)
= 1 − max Pr(Y = k|X)
35
# Theoretical error rate of Bayes classifier
theoretical.Bayes.error.rate = function(x,K) {
prob.mat = data.frame(k = 1:K, prob = NA)
for (k in 1:K) {
prob.mat[k, "prob"] = ((1 + k)^x)*exp(5 - 5*k)/(2^x +3^x*exp(-5) + 4^x*exp(-10))
}
theo.error.rate = 1 - prob.mat[which.max(prob.mat[,"prob"]), "prob"]
return(theo.error.rate)
}
theoretical.Bayes.error.rate.vec = Vectorize(theoretical.Bayes.error.rate, vectorize.args = c("x"))
the theoretical error rate of Bayes classifier.
= 0:50
= theoretical.Bayes.error.rate.vec(x.grid, 3)
= x.grid, y = y.grid, type = "l", xlab = "x", ylab = "Error rate of Bayes classifier")
0.0 0.1 0.2 0.3 0.4 0.5
Error rate of Bayes classifier
# Plot
x.grid
y.grid
plot(x
0
10
20
30
40
50
x
(c)
set.seed(2017)
# Simulate y
simulated.data = data.frame(y = sample(x = 1:3, size = 1000, replace = T))
# Simulate X
simulated.data[(simulated.data[,"y"] == 1), "x"] = rpois(sum(simulated.data[,"y"] == 1), 10)
simulated.data[(simulated.data[,"y"] == 2), "x"] = rpois(sum(simulated.data[,"y"] == 2), 15)
simulated.data[(simulated.data[,"y"] == 3), "x"] = rpois(sum(simulated.data[,"y"] == 3), 20)
# Bayes classifier
Bayes.classifier = function(x,K) {
prob.mat = data.frame(k = 1:K, prob = NA)
for (k in 1:K) {
prob.mat[k, "prob"] = ((1 + k)^x)*exp(5 - 5*k)/(2^x +3^x*exp(-5) + 4^x*exp(-10))
36
}
y.hat = prob.mat[which.max(prob.mat[,"prob"]), "k"]
return(y.hat)
}
# Compute y.hat based on Bayes classifier.
Bayes.classifier.vec = Vectorize(Bayes.classifier, vectorize.args = c("x"))
simulated.data[,"y.hat.Bayes"] = Bayes.classifier.vec(simulated.data[,"x"], 3)
simulated.data[,"is.pred.correct"] =
as.numeric(simulated.data[,"y.hat.Bayes"] == simulated.data[,"y"])
# Overall error rate
error.rate = 1 - sum(simulated.data[,"y"] == simulated.data[,"y.hat.Bayes"])/nrow(simulated.data)
show(error.rate)
# Error rate per x value
empirical.error.rate.mat = data.frame(
x = sort(unique(simulated.data[,"x"])),
n = as.numeric(table(simulated.data[,"x"])),
n.correct.pred = NA)
for (i in 1:nrow(empirical.error.rate.mat)) {
x.target = empirical.error.rate.mat[i, "x"]
empirical.error.rate.mat[i, "n.correct.pred"] =
sum(simulated.data[(simulated.data[,"x"] == x.target), "is.pred.correct"])
}
empirical.error.rate.mat[,"error.rate"] =
1 - empirical.error.rate.mat[,"n.correct.pred"]/empirical.error.rate.mat[,"n"]
0.6
0.2
0.4
Theoretical error rate
Error rate from simulation
0.0
Error rate of Bayes classifier
# Plot the error rate as a function of x.
plot(x = empirical.error.rate.mat[,"x"], y = empirical.error.rate.mat[,"error.rate"],
type = "l", xlab = "x", ylab = "Error rate of Bayes classifier")
points(x = x.grid, y = y.grid, type = "l", lty = 2, col = "red")
legend("topright", c("Theoretical error rate","Error rate from simulation"),
lty = c(2,1), col = c("red","black"))
5
10
15
20
x
37
25
30
Exercise 15
(a)
Use same approach as in exercise 14 (a)
Pr(X) =
2
X
Pr(X|Y = k) Pr(Y = k)
k=1
2
1X
Pr(X|Y = k)
2
k=1
1
1 − (x+1)2
1 − (x−1)2
2
2
√ e
=
+√ e
2
2π
2π
(x+1)2
(x−1)2
1
− 2
−
2
e
= √
+e
2 2π
=
Pr(Y = k|X) =
=
=
Bayes classifier:
Pr(X|Y = k) Pr(Y = k)
Pr(X)
√1
2 2π
·
e−
e−
e−
√1 e−
2π
1
2
(x+1)2
2
(x−µk )2
2
+ e−
(x−1)2
2
(x−µk )2
2
(x+1)2
2
+ e−
(x−1)2
2
arg max Pr(Y = k|X) = arg max
k
k


e−
(x−µk )2
2
2
(x−1)2
 e− (x+1)
2
+ e− 2


x

We can simplify this classifier further. We examine the decision boundary
Pr(Y = 1|X) > Pr(Y = 2|X) ⇐⇒ −(x + 1)2 > −(x − 1)2 ⇐⇒ x < 0.
So, we have
Bayes classifier: k Bayes = arg min {1 − Pr(Y = k|X)}
k
= arg max Pr(Y = k|X)
k
(
1
if x < 0
=
2
otherwise
(b)
We plot Pr(Y = 1|X) =
e−
e−
(x+1)2
2
(x+1)2
2
+ e−
(x−1)2
2
38
prob.y1.cond.x.func = function(x) {
prob.y1.cond.x = exp(-((x+1)^2)/2) / (exp(-((x+1)^2)/2) + exp(-((x-1)^2)/2))
return(prob.y1.cond.x)
}
x.grid = seq(from = -10, to = 10, by = 0.1)
y.grid = prob.y1.cond.x.func(x.grid)
0.6
0.4
0.0
0.2
Pr(Y=1|X)
0.8
1.0
plot(x = x.grid, y = y.grid, type = "l", xlab = "x", ylab = "Pr(Y=1|X)")
−10
−5
0
5
x
(c)
fX (x) =
2
X
Pr(X|Y = k) Pr(Y = k)
k=1
= f (x|y = 1)f (y = 1) + f (x|y = 2)f (y = 2)
1
1
= f (x|y = 1) + f (x|y = 2)
2
2
(x+1)2
(x−1)2
1
1
− 2
= √ e
+ √ e− 2
2 2π
2 2π
2
(x+1)
(x−1)2
1
= √
e− 2 + e− 2
2 2π
Null hyphothesis testing:
Reject H0 if FX (x) <
α
α
or FX (x) > 1 − where
2
2
Z x
1
1
FX (x) =
fX (u)du = Φ(x + 1) + Φ(x − 1).
2
2
−∞
39
10
(d)
When α = 0, the confidence interval is (−∞, ∞) and we will always accept the null hypothesis. In this
case, the given classifier is equal to Bayes classifier.
Bayes.classifier = function(x) {
y.hat = as.numeric(x < 0)*1 + as.numeric(x >= 0)*2
return(y.hat)
}
custom.classifier = function(x, alpha) {
# Test null hypothesis
if (
(1/2*pnorm(x+1) + 1/2*pnorm(x-1) < alpha/2) | (1/2*pnorm(x+1) + 1/2*pnorm(x-1) > 1 - alpha/2)
) {
# Null hypothesis is rejected
y.hat = c("outlier")
} else {
# Null hypothesis is accepted
y.hat = Bayes.classifier(x)
}
return(y.hat)
}
custom.classifier.vec = Vectorize(custom.classifier, vectorize.args = c("x"))
(e)
>
>
>
>
>
>
>
+
>
+
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
set.seed(2017)
# Simulate y
simulated.data = data.frame(y = sample(x = 1:2, size = 1000, replace = T))
# Simulate X
simulated.data[(simulated.data[,"y"] == 1), "x"] = rnorm(sum(simulated.data[,"y"] == 1),
mean = -1, sd = 1)
simulated.data[(simulated.data[,"y"] == 2), "x"] = rnorm(sum(simulated.data[,"y"] == 2),
mean = 1, sd = 1)
# Perform classification
# alpha = 0.05
y.hat.1 = custom.classifier.vec(x = simulated.data[,"x"], alpha = 0.05)
# alpha = 0.01
y.hat.2 = custom.classifier.vec(x = simulated.data[,"x"], alpha = 0.01)
# alpha = 0
y.hat.3 = custom.classifier.vec(x = simulated.data[,"x"], alpha = 0)
# Error rate
error.rate.1 = 1 - sum(simulated.data[,"y"] == y.hat.1)/nrow(simulated.data)
error.rate.2 = 1 - sum(simulated.data[,"y"] == y.hat.2)/nrow(simulated.data)
error.rate.3 = 1 - sum(simulated.data[,"y"] == y.hat.3)/nrow(simulated.data)
cat("Error rate with alpha = 0.05: ", error.rate.1, sep = "", "\n")
40
Error rate with alpha = 0.05: 0.205
> cat("Error rate with alpha = 0.01: ", error.rate.2, sep = "", "\n")
Error rate with alpha = 0.01: 0.166
> cat("Error rate with alpha = 0: ", error.rate.3, sep = "", "\n")
Error rate with alpha = 0: 0.158
41
Exercise 16
(a)


G
G
G
n
1X
1X  1 X 
1 XX
1X
yg =
ng y g =
ng
yi =
yi =
yi = µ
b
n
n g=1
n g=1
ng i∈g
n g=1 i∈g
n i=1
G
X
ng
g=1
n
σ
b2 =
1X
2
(yi − y)
n i=1
n
=
2
1X
yi − y g + y g − y
n i=1
=
2
2
1 X
yi − y g + y g − y + 2 yi − y g y g − y
n i=1
n
G
2
2
1 X ng X yi − y g + y g − y + 2 yi − y g y g − y
n g=1 ng i∈g


G
G
2
2
1X  1 X
1 X
2 XX
=
ng
yi − y g +
yi − y g y g − y
yg − y  +
n g=1
ng i∈g
ng i∈g
n g=1 i∈g
=
=
G
G X
2 2 X
1X 2
ng σ
yi − y g y g − y
bg + y g − y
+
n g=1
n g=1 i∈g
G
G
2 2 X
X
1X 2
=
ng σ
yg − y
yi − y g
bg + y g − y
+
n g=1
n g=1
i∈g
G
G
2 2 X
1X 2
ng σ
y −y ·0
=
bg + y g − y
+
n g=1
n g=1 g
G
=
2 1X 2
ng σ
bg + y g − y
n g=1
(b)
i)
Use the results from (a) with g1 = {1, · · · , n − 1} and g2 = {n}.
ii)
ng 2
ng1 2
σ
bg1 + (y g1 − y)2 + 2 σ
bg2 + (y g2 − y)2
n
n
1
n−1 2
2
σ
bn−1 + (y n−1 − y) +
(yn − yn )2 + (yn − y)2
=
n
n
n−1 2
n−1
1
2
=
σ
bn−1 +
(y
− y) + (yn − y)2
n
n n−1
n
n
n−1 2
n−1
2
2
=
σ
bn−1 +
n(y n−1 − y) +
(yn − y)
n
n2
n−1
σ
bn2 =
42
Use y =
n−1
1
y n−1 + yn
n
n
2
2 !
n−1
1
n
n−1
1
n y n−1 −
y n−1 − yn +
yn −
y n−1 − yn
n
n
n−1
n
n
2 !
2
1
n
n−1
1
n−1
n−1
y n−1 − yn +
y n−1
n
yn −
2
n
n
n
n−1
n
n
2 n − 1
2
n−1 1
y
− yn +
yn − y n−1
n2
n n−1
n
2
n−1
y n−1 − yn
2
n
2
n−1
yn − y n−1
2
n
n−1
n−1 2
σ
bn−1 +
=
n
n2
=
n−1 2
σ
bn−1 +
n
n−1 2
σ
bn−1 +
n
n−1 2
=
σ
bn−1 +
n
n−1 2
=
σ
bn−1 +
n
=
(c)
We already showed this in (iv) of exercise 7 (b). The recap of the result:
βbn = (XnT Xn )−1 XnT yn
= Mn−1 XnT yn
=
=
=
−1
e n xT
x
n Mn−1
−1
Mn−1
−
en
1 + xT
nx
!
−1
e n xT
x
n Mn−1
−
en
1 + xT
nx
!
−1
e n xT
x
n Mn−1
−
en
1 + xT
nx
!
−1
Mn−1
−1
Mn−1
n
X
xi yi
i=1
n−1
X
!
xi yi + xn yn
i=1
T
Xn−1
yn−1 + xn yn
−1
−1
T
= Mn−1
Xn−1
yn−1 + Mn−1
xn yn −
−1
−1
e n xT
en xT
x
x
n Mn−1
n Mn−1
T
X
y
−
xn yn
n−1
n−1
en
en
1 + xT
1 + xT
nx
nx
−1
e n xT
en xT
x
n Mn−1
n
−1
bn−1 − x
β
xn yn
= βbn−1 + Mn−1
xn yn −
en
en
1 + xT
1 + xT
nx
nx
e n xT
e n xT
x
n
n
−1
bn−1 + I − x
= I−
β
Mn−1
xn yn
Tx
e
e
1 + xT
1
+
x
x
n n
n n
e n xT
e n xT
x
x
n
n
b
e n yn
βn−1 + I −
x
= I−
en
en
1 + xT
1 + xT
nx
nx
e n xT
x
n
bn−1 + x
e
β
y
= I−
n
n
en
1 + xT
nx
This equation allows us to update the coefficients (instead of calculating from scratch) when we add or
remove a data point.
Instead of
multiplication
with a design matrix of size n × p, we do matrix multiplication
doing matrix
en xT
x
n
bn−1 + x
e
between I −
and
β
y
n n which are of size p × pand p × 1 respectively. So, when
en
1 + xT
nx
p < n, we use less memory.
We can start this algorithm with the ordinary least squares based on the at least p + 1 data points. This
is to avoid the nonsingularity.
43
(d)
T
The ‘empty’ linear model (e.g. no predictor) has design matrix: X = 1 · · · 1 and estimator ybi = βb =
n
1X
βb0 = (X T X)−1 X T y =
yi = y n . We plug this into the result from (c)
n i=1
Mn = XnT Xn = n,
−1
en = Mn−1
x
xn =
1
n−1
e n xT
x
n
bn−1 + x
e
β
y
n n
en
1 + xT
nx
!
1
1
n−1 · 1
= 1−
y n−1 +
yn
1
n−1
1 + 1 · n−1
1
n−1
y n−1 +
=
yn
n
n−1
n−1
1
=
y n−1 + yn
n
n
T
So, (*) is a special case of (**) when X = 1 · · · 1 .
y n = βbn =
I−
44
Exercise 17
(a)
βj follows the normal distribution. So, we expect that 1.000.000 · 0.01 = 10.000 null hypothesis will be
rejected.
(b)


[
α
Probability of making at least one error when we use significance level of is Pr 
reject H0,j . We
q
j
apply Boole’s inequality to this


q
q
[
X
X
α
= α.
Pr 
reject H0,j  ≤
Pr (reject H0,j ) =
q
j
j=1
j=1
Thus, if all H0,j ’s are true, the probability of making at least one errors is less ore equall to α.
α
power = Pr (reject H0,j |H1,j is true) = Pr pj <
q
α
Obviously, Pr pj <
Pr (pj < α).
1000000
(c)
V, S, U, T, R are stochastic.
Without Bonferroni correction and given that all H0,j ’s are true, the probability of wrongly reject H0,j
is α.
Type I error rate: Pr (V > 0) = 1 − Pr (V = 0) = 1 − (1 − α)q
If we apply Bonferroni correction,
Type I error rate: Pr (V > 0) = 1 − (1 −
Since 1 −
α q
)
q
α
> 1 − α, Bonferroni correction decreases the type I error rate. (But, it reduces the power.)
q
(d)
V
V
V
=E
=E
= E [1] = 1
q0 = q implies S = T = 0. Thus, E
R
V
+
S
V
V
So, we never get E
= 1 ≤ α.
R
(e)
When R > 0, q0 = q still implies S = T = 0. So, we have the same problem as in (d).
45
(f )
●
0.06
●
0.05
●
●
●
●
●
●
0.03
0.02
● ● ●
●
0.01
●
●
●●
●●●
●
●●●
●
●
● ●●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●
●
●
●
●
● ●
● ●
●
● ●●
● ●
0.00
FDR
0.04
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●●●
●
●
●●● ●●
● ●
●
●
●● ●
0
20
40
60
q0
46
80
100