10-702 Assignment 5 Solution Jiyan Pan

10-702 Assignment 5 Solution
Jiyan Pan
[email protected]
1. Classification
1, 𝑓(𝑥) ≥ 0
a) Let ℎ(𝑥) = sign(𝑓(𝑥)) = {
, then
0, 𝑓(𝑥) < 0
𝑅(𝑓) = ℙ(𝑌 ≠ ℎ(𝑋)) = ∫ ℙ(𝑌 ≠ ℎ(𝑋)|𝑋 = 𝑥) 𝑑𝑃(𝑥)
= ∫[ℙ(𝑌 = 0, ℎ(𝑥) = 1|𝑋 = 𝑥) + ℙ(𝑌 = 1, ℎ(𝑥) = 0|𝑋 = 𝑥)] 𝑑𝑃(𝑥)
= ∫[ℙ(𝑌 = 0|𝑋 = 𝑥)ℎ(𝑥) + ℙ(𝑌 = 1|𝑋 = 𝑥)(1 − ℎ(𝑥))] 𝑑𝑃(𝑥)
= ∫[(1 − 𝑚(𝑥))ℎ(𝑥) + 𝑚(𝑥)(1 − ℎ(𝑥))] 𝑑𝑃(𝑥)
= ∫[ℎ(𝑥) + 𝑚(𝑥) − 2𝑚(𝑥)ℎ(𝑥)] 𝑑𝑃(𝑥).
If we denote ℎ∗ (𝑥) = sign(𝑚(𝑥) − 1⁄2), the excess risk is
𝑅(𝑓) − 𝑅 ∗
= ∫[ℎ(𝑥) + 𝑚(𝑥) − 2𝑚(𝑥)ℎ(𝑥)] 𝑑𝑃(𝑥) − ∫[ℎ∗ (𝑥) + 𝑚(𝑥) − 2𝑚(𝑥)ℎ∗ (𝑥)] 𝑑𝑃(𝑥)
= ∫(2𝑚(𝑥) − 1)(ℎ∗ (𝑥) − ℎ(𝑥)) 𝑑𝑃(𝑥)
∫|2𝑚(𝑥) − 1|(1 − ℎ(𝑥)) 𝑑𝑃(𝑥), 𝑚(𝑥) ≥ 1⁄2
={
.
∫|2𝑚(𝑥) − 1|ℎ(𝑥) 𝑑𝑃(𝑥),
𝑚(𝑥) < 1⁄2
When 𝑚(𝑥) ≥ 1⁄2, we have ℎ∗ (𝑥) = 1, then no matter if ℎ(𝑥) = 1 or 0, 1 − ℎ(𝑥) =
1[ℎ(𝑥) ≠ ℎ∗ (𝑥)]; when 𝑚(𝑥) < 1⁄2, we have ℎ∗ (𝑥) = 0, then no matter if ℎ(𝑥) = 1 or 0,
ℎ(𝑥) = 1[ℎ(𝑥) ≠ ℎ∗ (𝑥)]. Therefore,
𝑅(𝑓) − 𝑅 ∗ = ∫ 1[ℎ(𝑥) ≠ ℎ∗ (𝑥)]|2𝑚(𝑥) − 1| 𝑑𝑃(𝑥)
= 𝔼(1[ℎ(𝑋) ≠ ℎ∗ (𝑋)]|2𝑚(𝑋) − 1|)
= 𝔼(1[sign(𝑓(𝑋)) ≠ sign(𝑚(𝑋) − 1⁄2)]|2𝑚(𝑋) − 1|).
b) The optimal 𝜙-risk is
∗
𝑅𝜙
= inf 𝔼[𝜙(𝑌𝑓(𝑋))]
𝑓∈ℳ
= inf 𝔼 [𝔼[𝜙(𝑌𝑓(𝑋))|𝑋]]
𝑓∈ℳ
= inf 𝔼[𝑚(𝑋)𝜙(𝑓(𝑋)) + (1 − 𝑚(𝑥))𝜙(−𝑓(𝑋))]
𝑓∈ℳ
= 𝔼 [ inf (𝑚(𝑋)𝜙(𝑓(𝑋)) + (1 − 𝑚(𝑥))𝜙(−𝑓(𝑋)))]
𝑓∈ℳ
= 𝔼 [ inf (𝑚(𝑋)𝜙(𝛼) + (1 − 𝑚(𝑥))𝜙(−𝛼))]
𝛼∈ℝ
= 𝔼[𝐻𝜙 (𝑚(𝑋))].
c)
(1) The conditional 𝜙-risk is
𝑚𝜙(𝛼) + (1 − 𝑚)𝜙(−𝛼)
= 𝑚(𝑒 −𝛼 − 𝛼 − 1) + (1 − 𝑚)(𝑒 𝛼 + 𝛼 − 1)
= 𝑚𝑒 −𝛼 + (1 − 𝑚)𝑒 𝛼 + (1 − 2𝑚)𝛼 − 1.
Taking the derivative w.r.t. 𝛼 and setting it to be zero, we have
∗
∗
−𝑚𝑒 −𝛼 + (1 − 𝑚)𝑒 𝛼 + 1 − 2𝑚 = 0,
that is,
∗
∗
(𝑒 𝛼 + 1) ((1 − 𝑚)𝑒 𝛼 − 𝑚) = 0.
∗
𝑚
∗
As 𝑒 𝛼 + 1 > 0, (1 − 𝑚)𝑒 𝛼 − 𝑚 must be 0. Therefore, 𝛼 ∗ = log 1−𝑚. Substituting it
back into the conditional 𝜙-risk, we obtain the optimal conditional 𝜙-risk:
∗
∗
𝐻𝜙 (𝑚) = 𝑚𝑒 −𝛼 + (1 − 𝑚)𝑒 𝛼 + (1 − 2𝑚)𝛼 ∗ − 1
𝑚
= (1 − 2𝑚) log
.
1−𝑚
To derive 𝐻𝜙− (𝑚), we note that the conditional 𝜙-risk is convex in 𝛼. If 𝑚 > 1⁄2, then
the allowable range of 𝛼 in 𝐻𝜙− (𝑚) is 𝛼 ≤ 0; on the other hand, we have 𝛼 ∗ > 0.
Therefore, the 𝛼 that minimizes 𝐻𝜙− (𝑚) must be 0. If 𝑚 ≤ 1⁄2, then the allowable
range of 𝛼 in 𝐻𝜙− (𝑚) is 𝛼 > 0; on the other hand, we have 𝛼 ∗ ≤ 0. Therefore, the 𝛼 that
minimizes 𝐻𝜙− (𝑚) must also be 0. In either case, 𝐻𝜙− (𝑚) = 0. As a result, we have
1+𝜃
1+𝜃
𝜓̃𝜙 (𝜃) = 𝐻𝜙− (
) − 𝐻𝜙 (
)
2
2
1+𝜃
= 0 − 𝐻𝜙 (
)
2
1+𝜃
= 𝜃 log
.
1−𝜃
4
As the second derivative of 𝜓̃𝜙 (𝜃) is (1−𝜃2 )2 > 0, 𝜓̃𝜙 (𝜃) is convex in 𝜃. Therefore,
𝜓𝜙 (𝜃) = 𝜓̃𝜙 (𝜃) = 𝜃 log
1+𝜃
.
1−𝜃
0
-4
1
-3
2
H
phi
-2
3
4
-1
5
0
The 𝐻𝜙 and 𝜓𝜙 are plotted in Figure 1.
0.0
0.2
0.4
0.6
0.8
1.0
-1.0
-0.5
m
0.0
theta
Figure 1
(2) The conditional 𝜙-risk is
𝑚𝜙(𝛼) + (1 − 𝑚)𝜙(−𝛼)
0.5
1.0
= 𝑚(1 − 𝛼)+ + (1 − 𝑚)(1 + 𝛼)+ .
It is evident that the conditional 𝜙-risk could reach its minimum only when −1 ≤ 𝛼 ≤ 1.
In this interval, the conditional 𝜙-risk is 1 + (1 − 2𝑚)𝛼. Clearly, when 𝑚 > 1⁄2, the
conditional 𝜙-risk reaches its minimum 2(1 − 𝑚) when 𝛼 = 1; when 𝑚 < 1⁄2, the
conditional 𝜙-risk reaches its minimum 2𝑚 when 𝛼 = −1. Therefore, 𝐻𝜙 (𝑚) = 1 −
|2𝑚 − 1|.
It is also evident that when 𝑚 > 1⁄2, the allowable range of 𝛼 in 𝐻𝜙− (𝑚) is 𝛼 ≤ 0, and
the conditional 𝜙-risk reaches its minimum 1 under this allowable range when 𝛼 = 0;
when 𝑚 < 1⁄2, the allowable range of 𝛼 in 𝐻𝜙− (𝑚) is 𝛼 ≥ 0, and the conditional 𝜙-risk
reaches its minimum 1 under this allowable range when 𝛼 = 0. Therefore, 𝐻𝜙− (𝑚) = 1.
Therefore,
1+𝜃
1+𝜃
𝜓̃𝜙 (𝜃) = 𝐻𝜙− (
) − 𝐻𝜙 (
)
2
2
= 1 − (1 − |1 + 𝜃 − 1|)
= |𝜃|.
As |𝜃| is convex,
𝜓𝜙 (𝜃) = 𝜓̃𝜙 (𝜃) = |𝜃|.
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
H
phi
0.6
0.8
1.0
The 𝐻𝜙 and 𝜓𝜙 are plotted in Figure 2.
0.0
0.2
0.4
0.6
0.8
1.0
-1.0
m
-0.5
0.0
0.5
1.0
theta
Figure 2
(3) The conditional 𝜙-risk is
𝑚𝜙(𝛼) + (1 − 𝑚)𝜙(−𝛼)
= 𝑚(1 − 𝛼)2+ + (1 − 𝑚)(1 + 𝛼)2+ .
It is evident that the conditional 𝜙-risk could reach its minimum only when −1 ≤ 𝛼 ≤ 1.
In this interval, the conditional 𝜙-risk is 𝑚(1 − 𝛼)2 + (1 − 𝑚)(1 + 𝛼)2 . Taking its
derivative w.r.t. 𝛼 and setting it to be zero, we have 𝛼 ∗ = 2𝑚 − 1. Substituting it back
into the conditional 𝜙-risk, we have
𝐻𝜙 (𝑚) = 𝑚(1 − 𝛼 ∗ )2 + (1 − 𝑚)(1 + 𝛼 ∗ )2
= 𝑚(2 − 2𝑚)2 + (1 − 𝑚)(2𝑚)2
= 4𝑚(1 − 𝑚).
−
To derive 𝐻𝜙 (𝑚), we note that the conditional 𝜙-risk is convex in 𝛼. If 𝑚 > 1⁄2, then
the allowable range of 𝛼 in 𝐻𝜙− (𝑚) is 𝛼 ≤ 0; on the other hand, we have 𝛼 ∗ > 0.
Therefore, the 𝛼 that minimizes 𝐻𝜙− (𝑚) must be 0. If 𝑚 < 1⁄2, then the allowable
range of 𝛼 in 𝐻𝜙− (𝑚) is 𝛼 ≥ 0; on the other hand, we have 𝛼 ∗ < 0. Therefore, the 𝛼 that
minimizes 𝐻𝜙− (𝑚) must also be 0. In either case, 𝐻𝜙− (𝑚) = 1. As a result, we have
1+𝜃
1+𝜃
𝜓̃𝜙 (𝜃) = 𝐻𝜙− (
) − 𝐻𝜙 (
)
2
2
1+𝜃1−𝜃
= 1−4
2
2
= 𝜃2.
As 𝜃 2 is convex,
𝜓𝜙 (𝜃) = 𝜓̃𝜙 (𝜃) = 𝜃 2 .
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
H
phi
0.6
0.8
1.0
The 𝐻𝜙 and 𝜓𝜙 are plotted in Figure 3.
0.0
0.2
0.4
0.6
0.8
1.0
-1.0
m
-0.5
0.0
0.5
1.0
theta
Figure 3
(4) The conditional 𝜙-risk is
𝑚𝜙(𝛼) + (1 − 𝑚)𝜙(−𝛼)
𝑚𝑒 −𝛼 + (1 − 𝑚)(2 − 𝑒 −𝛼 ), 𝛼 > 0
={
.
𝑚(2 − 𝑒 𝛼 ) + (1 − 𝑚)𝑒 𝛼 ,
𝛼≤0
The derivative of the conditional 𝜙-risk w.r.t. 𝛼 is
(1 − 2𝑚)𝑒 −𝛼 , 𝛼 > 0
.
{
(1 − 2𝑚)𝑒 𝛼 ,
𝛼≤0
When 𝑚 < 1⁄2, the derivative of the conditional 𝜙-risk is positive for all 𝛼. Therefore,
𝛼 ∗ = −∞, and the conditional 𝜙-risk has the minimum 2𝑚; when 𝑚 > 1⁄2, the
derivative of the conditional 𝜙-risk is negative for all 𝛼. Therefore, 𝛼 ∗ = +∞, and the
conditional 𝜙-risk has the minimum 2(1 − 𝑚). Combining the results, we have
𝐻𝜙 (𝑚) = 1 − |2𝑚 − 1|.
It is evident that when 𝑚 > 1⁄2, the allowable range of 𝛼 in 𝐻𝜙− (𝑚) is 𝛼 ≤ 0, and the
conditional 𝜙-risk reaches its minimum 1 under this allowable range when 𝛼 = 0; when
𝑚 < 1⁄2, the allowable range of 𝛼 in 𝐻𝜙− (𝑚) is 𝛼 ≥ 0, and the conditional 𝜙-risk
reaches its minimum 1 under this allowable range when 𝛼 = 0. Therefore, 𝐻𝜙− (𝑚) = 1.
Therefore,
1+𝜃
1+𝜃
𝜓̃𝜙 (𝜃) = 𝐻𝜙− (
) − 𝐻𝜙 (
)
2
2
= 1 − (1 − |1 + 𝜃 − 1|)
= |𝜃|.
As |𝜃| is convex,
𝜓𝜙 (𝜃) = 𝜓̃𝜙 (𝜃) = |𝜃|.
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
H
phi
0.6
0.8
1.0
The 𝐻𝜙 and 𝜓𝜙 are plotted in Figure 4.
0.0
0.2
0.4
0.6
0.8
1.0
-1.0
m
-0.5
0.0
theta
Figure 4
2.
0
1
2
3
4
a) The boxplots of the data are shown in Figure 5.
X1
Y2
Figure 5
b) The mle for 𝜇1 and 𝜇2 are
𝑛
1
𝜇̂ 1 = ∑ 𝑋𝑖 ,
𝑛
𝑖=1
𝑚
𝜇̂ 2 =
1
∑ 𝑌𝑖 .
𝑚
𝑖=1
The mle for 𝜎1 and 𝜎2 are
𝑛
1
𝜎̂1 = √ ∑(𝑋𝑖 − 𝜇̂ 1 )2 ,
𝑛
𝑖=1
0.5
1.0
𝑚
1
𝜎̂2 = √ ∑(𝑌𝑖 − 𝜇̂ 2 )2 .
𝑚
𝑖=1
The log-likelihood of 𝜇1 and 𝜎1 is
𝑛
1
ℓ(𝜇1 , 𝜎1 ) = −𝑛 log(√2𝜋𝜎1 ) − 2 ∑(𝑋𝑖 − 𝜇1 )2 .
2𝜎1
𝑖=1
The Fisher information matrix is
𝑛
𝜕2ℓ
𝜕2ℓ
𝔼
)
(
)
2
𝜎12
𝜕𝜇1 𝜕𝜎1
𝜕𝜇1
𝐼𝑛 (𝜇1 , 𝜎1 ) = −
=
𝜕2ℓ
𝜕2ℓ
0
𝔼(
𝔼 ( 2)
)
𝜕𝜎1 ] [
[ 𝜕𝜎1 𝜕𝜇1
The inverse of the Fisher information matrix is
𝜎12
0
𝐽𝑛 (𝜇1 , 𝜎1 ) = 𝐼𝑛−1 (𝜇1 , 𝜎1 ) = 𝑛
.
𝜎12
[ 0 2𝑛]
Therefore, the 1 − 𝛼 confidence interval for 𝜇1 is
𝜎̂1
𝜎̂1
𝐶𝜇1 = (𝜇̂ 1 − 𝑧𝛼⁄2
, 𝜇̂ 1 + 𝑧𝛼⁄2 ).
√𝑛
√𝑛
Similarly, the 1 − 𝛼 confidence interval for 𝜇2 is
𝜎̂2
𝜎̂2
𝐶𝜇2 = (𝜇̂ 2 − 𝑧𝛼⁄2
, 𝜇̂ 2 + 𝑧𝛼⁄2
).
√𝑚
√𝑚
The mle of 𝛿 = 𝜇1 − 𝜇2 is
𝛿̂ = 𝜇̂ 1 − 𝜇̂ 2 .
Since 𝜇̂ 1 and 𝜇̂ 2 are independent, the variance of 𝛿̂ is
0
𝔼(
Var(𝛿̂ ) = Var(𝜇̂ 1 − 𝜇̂ 2 ) = Var(𝜇̂ 1 ) + Var(𝜇̂ 2 ) =
2𝑛 .
𝜎12 ]
𝜎̂12 𝜎̂22
+ .
𝑛
𝑚
Therefore, the 1 − 𝛼 confidence interval for 𝛿 is
𝜎̂12 𝜎̂22
𝜎̂12 𝜎̂22
𝐶𝛿 = (𝛿̂ − 𝑧𝛼⁄2 √ + , 𝛿̂ + 𝑧𝛼⁄2 √ + ).
𝑛
𝑚
𝑛
𝑚
Using the actual data and setting 𝛼 = 0.05, we have
𝜇̂ 1 = 1.820011, 𝜇̂ 2 = 1.389454;
𝜎̂1 = 0.8042076, 𝜎̂2 = 0.8060367;
𝛿̂ = 0.4305566; 𝐶𝛿 = (0.07127103, 0.7898422).
c) The joint distribution is
𝑝(𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 , 𝑋, 𝑌) = 𝑝(𝑋, 𝑌|𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 )𝜋(𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 )
−
= (2𝜋)
𝑛
2 𝜎1−𝑛 exp (−
−
= 𝑘 ∙ (2𝜋)
𝑛
𝑛
𝑖=1
𝑗=1
𝑚
1
1
𝑘
2
)2 (2𝜋)− 2 𝜎2−𝑚 exp (− 2 ∑(𝑌𝑗 − 𝜇2 ) ) ∙
2 ∑(𝑋𝑖 − 𝜇1 ) ∙
𝜎1 𝜎2
2𝜎1
2𝜎2
𝑛+𝑚
2 𝜎1−𝑛−1 𝜎2−𝑚−1 exp (−
Here, 𝑘 is a constant in the prior.
𝑛
𝑛
𝑖=1
𝑗=1
1
1
2
)2 − 2 ∑(𝑌𝑗 − 𝜇2 ) ).
2 ∑(𝑋𝑖 − 𝜇1
2𝜎1
2𝜎2
The marginal distribution of data is
𝑝(𝑋, 𝑌) = ∫ 𝑝(𝑋, 𝑌|𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 )𝜋(𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 )𝑑𝜇1 𝑑𝜇2 𝑑𝜎1 𝑑𝜎2
−
= 𝑘 ∙ ∫(2𝜋)
𝑛
2 𝜎1−𝑛−1 exp (−
𝑛
1
∑(𝑋𝑖 − 𝜇1 )2 ) 𝑑𝜇1 𝑑𝜎1
2𝜎12
𝑖=1
∙
𝑚
∫(2𝜋)− 2 𝜎2−𝑚−1 exp (−
𝑛
1
2
2 ∑(𝑌𝑗 − 𝜇2 ) ) 𝑑𝜇2 𝑑𝜎2 ,
2𝜎2
𝑗=1
where
𝑛
∫(2𝜋)−2 𝜎1−𝑛−1 exp (−
𝑛
1
∑(𝑋𝑖 − 𝜇1 )2 ) 𝑑𝜇1 𝑑𝜎1
2𝜎12
𝑖=1
∞
∞
𝑛
1
=
𝑛
(2𝜋)−2 ∫
0
=
𝑛
∞
𝑛 ∞
1
1
−
2
−𝑛−1
(2𝜋) 2 ∫ 𝜎1
exp (− 2 ∑ 𝑋𝑖 ) [∫ exp (− 2 (𝑛𝜇12
2𝜎1
2𝜎1
0
−∞
𝑖=1
=
𝜎1−𝑛−1 [∫ exp (− 2 ∑(𝑋𝑖
2𝜎1
−∞
𝑖=1
𝑛 ∞
(2𝜋)−2 ∫ 𝜎1−𝑛−1 exp
0
− 𝜇1 )2 ) 𝑑𝜇1 ] 𝑑𝜎1
𝑛
− 2𝜇1 ∑ 𝑋𝑖 )) 𝑑𝜇1 ] 𝑑𝜎1
𝑖=1
𝑛
∞
1 1
1
2
2
̅
(𝜇1 − 𝑋̅)2 ) 𝑑𝜇1 ] 𝑑𝜎1
−
(
∑
𝑋
−
𝑋
)
exp (−
[∫
𝑖
2 𝑛
2
𝜎1
𝜎
−∞
𝑖=1
2
2 1
𝑛
𝑛
(
)
𝜎̂12
𝜎1
𝑑𝜎1
2 ) √2𝜋
𝜎
√𝑛
0
2 𝑛1
𝑛 1
1 ∞
𝑛𝜎̂12
= (2𝜋)−2 +2 𝑛−2 ∫ 𝜎1−𝑛 exp (− 2 ) 𝑑𝜎1 ,
2𝜎1
0
1 𝑛
1
where 𝑋̅ = ∑𝑖=1 𝑋𝑖 , and 𝜎̂12 = ∑𝑛𝑖=1 𝑋𝑖2 − 𝑋̅ 2 .
∞
𝑛
= (2𝜋)−2 ∫ 𝜎1−𝑛−1 exp (−
𝑛
𝑛
Let 𝜆1 = 𝜎1−2 , we have
−
∫(2𝜋)
=
=
𝑛
2 𝜎1−𝑛−1 exp (−
𝑛
1
∑(𝑋𝑖 − 𝜇1 )2 ) 𝑑𝜇1 𝑑𝜎1
2𝜎12
𝑖=1
𝑛 1
1 ∞
𝑛𝜎̂12
(2𝜋)−2 +2 𝑛−2 ∫ 𝜎1−𝑛 exp (− 2 ) 𝑑𝜎1
2𝜎1
0
0 𝑛−3
𝑛
1
1
1
𝑛𝜎̂12
− (2𝜋)−2 +2 𝑛−2 ∫ 𝜆1 2 exp (−
𝜆 ) 𝑑𝜆1
2
2 1
∞
𝑛−1
)
𝑛 1
1 Γ(
1
− + −
2 .
= (2𝜋) 2 2 𝑛 2
𝑛−1
2
𝑛𝜎̂12 2
( 2 )
Similarly, we have
𝑚
∫(2𝜋)− 2 𝜎2−𝑚−1 exp (−
Therefore,
𝑚−1
𝑛
𝑚 1
1 Γ(
1
1
2
− +
−
2 ).
2
2
2
(2𝜋)
∑(𝑌
−
𝜇
)
𝑑𝜇
𝑑𝜎
=
𝑚
)
𝑗
2
2
2
𝑚−1
2
2𝜎22
𝑗=1
𝑚𝜎̂22 2
(
)
2
𝑝(𝑋, 𝑌) = 𝑘 ∙
𝑛
∫(2𝜋)−2 𝜎1−𝑛−1 exp (−
𝑛
1
∑(𝑋𝑖 − 𝜇1 )2 ) 𝑑𝜇1 𝑑𝜎1
2𝜎12
𝑖=1
𝑛
𝑚
∙ ∫(2𝜋)− 2 𝜎2−𝑚−1 exp (−
1
2
2 ∑(𝑌𝑗 − 𝜇2 ) ) 𝑑𝜇2 𝑑𝜎2
2𝜎2
𝑗=1
𝑛−1
𝑚−1
) Γ( 2 )
𝑛+𝑚
1 Γ(
1
−
−
2
= 𝑘 ∙ (2𝜋) 2 2𝜋(𝑛𝑚) 2
𝑛−1
𝑚−1 .
4
2 2
2
2
𝑛𝜎̂
𝑚𝜎̂
( 21 )
( 22)
The posterior density is
𝑝(𝑋, 𝑌|𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 )𝜋(𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 )
𝑝(𝑋, 𝑌)
𝑛+𝑚
2
1
1
−
𝑘 ∙ (2𝜋) 2 𝜎1−𝑛−1 𝜎2−𝑚−1 exp (− 2 ∑𝑛𝑖=1(𝑋𝑖 − 𝜇1 )2 − 2 ∑𝑛𝑗=1(𝑌𝑗 − 𝜇2 ) )
2𝜎1
2𝜎2
=
𝑛−1
𝑚−1
𝑛+𝑚
1 Γ(
) Γ( 2 )
1
−
−
2
𝑘 ∙ (2𝜋) 2 2𝜋(𝑛𝑚) 2
𝑛−1
𝑚−1
4
𝑛𝜎̂12 2 𝑚𝜎̂22 2
(
)
(
)
2
2
𝑝(𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 |𝑋, 𝑌) =
𝑛−1
𝑚−1
𝑛𝜎̂ 2 2 𝑚𝜎̂22 2 exp (− 1 ∑𝑛 (𝑋 − 𝜇 )2 − 1 ∑𝑛 (𝑌 − 𝜇 )2 )
2√𝑛𝑚 ( 21 )
( 2 )
1
2
2𝜎12 𝑖=1 𝑖
2𝜎22 𝑗=1 𝑗
=
∙
.
𝑛−1
𝑚−1
𝜎1𝑛+1 𝜎2𝑚+1
𝜋Γ ( 2 ) Γ ( 2 )
d) The posterior distribution of (𝜇1 , 𝜎1 ) is
𝑝(𝑋|𝜇1 , 𝜎1 )𝜋(𝜇1 , 𝜎1 )
∫ 𝑝(𝑋|𝜇1 , 𝜎1 )𝜋(𝜇1 , 𝜎1 )𝑑𝜇1 𝑑𝜎1
𝑛
1
(2𝜋)−2 𝜎1−𝑛−1 exp (− 2 ∑𝑛𝑖=1(𝑋𝑖 − 𝜇1 )2 )
2𝜎1
=
𝑛−1
𝑛 1
1 Γ(
1
− + −
2 )
2
2
2
(2𝜋)
𝑛
𝑛−1
2
𝑛𝜎̂12 2
(
)
2
𝑝(𝜇1 , 𝜎1 |𝐷) =
𝑛−1
𝑛𝜎̂ 2 2 exp (− 1 ∑𝑛 (𝑋 − 𝜇 )2 )
2√𝑛 ( 21 )
1
2𝜎12 𝑖=1 𝑖
=
∙
.
𝑛−1
𝜎1𝑛+1
√2𝜋Γ ( 2 )
The posterior distribution of 𝜎1 is
∞
𝑝(𝜎1 |𝐷) = ∫ 𝑝(𝜇1 , 𝜎1 |𝐷)𝑑𝜇1
−∞
𝑛−1
𝑛𝜎̂ 2 2
𝑛
2√𝑛 ( 1 )
∞
1
2
=
∙ ∫ exp (− 2 ∑(𝑋𝑖 − 𝜇1 )2 ) 𝑑𝜇1
𝑛 − 1 𝑛+1 −∞
2𝜎1
√2𝜋Γ ( 2 ) 𝜎1
𝑖=1
𝑛−1
𝑛𝜎̂ 2 2
2√𝑛 ( 21 )
𝑛𝜎̂12
𝜎1
=
∙ exp (− 2 ) √2𝜋
𝑛−1
2𝜎1
√𝑛
√2𝜋Γ ( 2 ) 𝜎1𝑛+1
𝑛−1
𝑛𝜎̂ 2 2
2( 1)
𝑛𝜎̂12
2
=
∙ exp (− 2 ).
𝑛−1
2𝜎1
Γ ( 2 ) 𝜎1𝑛
Therefore, we have
𝑝(𝜇1 , 𝜎1 |𝐷)
𝑝(𝜇1 |𝜎1 , 𝐷) =
𝑝(𝜎1 |𝐷)
𝑛−1
2
=
𝑛𝜎̂ 2
2√𝑛 ( 1 )
2
exp (−
1 𝑛
𝑛−1 𝑛
2
2 ∑𝑖=1(𝑋𝑖 − 𝜇1 ) ) Γ ( 2 ) 𝜎1
2𝜎1
𝑛−1
2
𝑛𝜎̂ 2
𝑛𝜎̂ 2
𝑛−1
√2𝜋Γ ( 2 ) 𝜎1𝑛+1 exp (− 12 ) 2 ( 21 )
2𝜎1
1 𝑛
√𝑛 exp (− 2 ∑𝑖=1(𝑋𝑖 − 𝜇1 )2 )
2𝜎1
=
𝑛𝜎̂ 2
√2𝜋𝜎1 exp (− 12 )
2𝜎1
𝑛
𝑛𝜎̂12
1 1
1
(𝜇1 − 𝑋̅)2 )
=
exp ( 2 ) . exp −
( ∑ 𝑋𝑖2 − 𝑋̅ 2 ) exp (−
2
2
𝑛
2𝜎
𝜎
𝜎
√2𝜋𝜎1
1
𝑖=1
2 𝑛1
2 𝑛1
(
)
√𝑛
1
𝜎2
̅ )2 ) = 𝑁 (𝜇1 ; 𝑋̅, 1 ).
(𝜇
∙
exp
−
𝑋
(−
1
𝜎
𝑛
𝜎12
√2𝜋 1
2
𝑛
√
𝑛
Let 𝜆1 = 𝜎1−2 , we can obtain 𝑝(𝜆1 |𝐷) from 𝑝(𝜎1 |𝐷) as follows:
𝑑
𝑑
1
𝑑
1
𝑝𝜆1 (𝜆1 |𝐷) =
ℙ(𝜆′1 ≤ 𝜆1 |𝐷) =
ℙ ( ′ 2 ≤ 𝜆1 |𝐷) =
ℙ (𝜎1′ ≥
|𝐷)
𝑑𝜆1
𝑑𝜆1
𝑑𝜆1
𝜎1
√𝜆1
=
=
=
1
𝑑
1
𝑑
1
𝜎13 𝑑
1
(1 − ℙ (𝜎1′ ≤
|𝐷)) = −
ℙ (𝜎1′ ≤
|𝐷) =
ℙ (𝜎1′ ≤
|𝐷)
𝑑𝜆1
𝑑𝜆1
2 𝑑𝜎1
√𝜆1
√𝜆1
√𝜆1
𝑛−1
𝑛
3
𝑛𝜎̂12 2 2
−
22(
)
𝜆
𝜆1
1
2
𝜎13
1
𝑝𝜎1 (
|𝐷) =
2
2
√𝜆1
𝑛−1
Γ( 2 )
∙ exp (−
𝑛𝜎̂12
𝜆 )
2 1
𝑛−1
𝑛𝜎̂ 2 2
𝑛−1
( 1)
𝑛𝜎̂12
𝑛 − 1 𝑛𝜎̂12
−1
2
=
𝜆1 2
exp (−
𝜆1 ) = 𝑔 (𝜆1 ;
,
),
𝑛−1
2
2
2
Γ( 2 )
where 𝑔(𝑥; 𝛼, 𝛽) is the gamma distribution with shape parameter 𝛼 and rate parameter 𝛽.
Therefore,
𝑛 − 1 𝑛𝜎̂12
𝑝(𝜎1 |𝐷) = 𝑔 (𝜎1−2 ;
,
)
2
2
That is, to sample from 𝑝(𝜎1 |𝐷), we could sample from the gamma distribution 𝑔 (𝑧;
1
first, then computes 𝜎1 according to 𝜎1 = 𝑧 −2 .
Similarly,
̂12
𝑛−1 𝑛𝜎
,
)
2
2
𝑝(𝜎2 |𝐷) = 𝑔 (𝜎2−2 ;
𝑚 − 1 𝑚𝜎̂22
,
),
2
2
and
𝑝(𝜇2 |𝜎2 , 𝐷) = 𝑁 (𝜇2 ; 𝑌̅,
𝜎22
).
𝑚
Therefore,
𝑝(𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 |𝐷) = 𝑝(𝜇1 |𝜎1 , 𝐷)𝑝(𝜎1 |𝐷)𝑝(𝜇2 |𝜎2 𝐷)𝑝(𝜎2 |𝐷)
𝜎12
𝑛 − 1 𝑛𝜎̂12
𝜎22
𝑚 − 1 𝑚𝜎̂22
= 𝑁 (𝜇1 ; 𝑋̅ , ) ∙ 𝑔 (𝜎1−2 ;
,
,
) ∙ 𝑁 (𝜇2 ; 𝑌̅, ) ∙ 𝑔 (𝜎2−2 ;
).
𝑛
2
2
𝑚
2
2
To get the Bayes estimate of 𝛿 = 𝜇1 − 𝜇2 , we need to compute
𝛿̅ = ∫(𝜇1 − 𝜇2 )𝑝(𝜇1 , 𝜇2 |𝐷)𝑑𝜇1 𝑑𝜇2 = ∫(𝜇1 − 𝜇2 ) (∫ 𝑝(𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 |𝐷)𝑑𝜎1 𝑑𝜎2 ) 𝑑𝜇1 𝑑𝜇2
= ∫(𝜇1 − 𝜇2 )𝑝(𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 |𝐷) 𝑑𝜇1 𝑑𝜇2 𝑑𝜎1 𝑑𝜎2 .
(𝑘)
(𝑘)
(𝑘)
(𝑘)
The Monte Carlo method with direct sampling draws samples (𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 ) , 𝑘 = 1 … 𝑁
according to the factorized expression of 𝑝(𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 |𝐷), and approximates 𝛿̅ as
𝑁
1
(𝑘)
(𝑘)
∑ (𝜇1 − 𝜇2 ).
𝑁
𝑘=1
I generated 100,000 samples with direct sampling, and the estimate of 𝛿̅ is 0.4302737, very
close to the mle estimate. The 95 percent posterior interval is obtained by sorting the simulated
samples and finding the 0.025 and 0.975 quantile. The resulting 95% posterior interval is
(0.05256353, 0.8063011), also close to the 95% confidence interval in mle.
e) The posterior distribution of 𝜇1 is
𝑛−1
𝑛𝜎̂ 2 2
𝑛
2√𝑛 ( 1 )
∞
∞
1
1
2
𝑝(𝜇1 |𝐷) = ∫ 𝑝(𝜇1 , 𝜎1 |𝐷)𝑑𝜎1 =
∙ ∫ 𝑛+1 exp (− 2 ∑(𝑋𝑖 − 𝜇1 )2 ) 𝑑𝜎1 .
𝑛−1
2𝜎1
0
√2𝜋Γ ( 2 ) 0 𝜎1
𝑖=1
Let 𝜆1 = 𝜎1−2 , we have
𝑛−1
𝑛𝜎̂12 2
2
𝑛
(
∞ 𝑛
√
∑𝑛𝑖=1(𝑋𝑖 − 𝜇1 )2
1
−1
2 )
𝑝(𝜇1 |𝐷) =
∙ ∫ 𝜆12 exp (−
𝜆1 ) 𝑑𝜆1
𝑛−1
2
2
√2𝜋Γ ( 2 ) 0
𝑛−1
𝑛𝜎̂ 2 2
𝑛
√𝑛 ( 21 )
Γ (2 )
=
∙
𝑛.
𝑛−1
√2𝜋Γ ( 2 ) ∑𝑛𝑖=1(𝑋𝑖 − 𝜇1 )2 2
(
)
2
Therefore,
𝑝(𝜎1 |𝜇1 , 𝐷) =
𝑝(𝜇1 , 𝜎1 |𝐷)
𝑝(𝜇1 |𝐷)
𝑛−1
2
=
𝑛𝜎̂ 2
2√𝑛 ( 1 )
2
𝑛
1
𝑛 − 1 ∑𝑛𝑖=1(𝑋𝑖 − 𝜇1 )2 2
exp (− 2 ∑𝑛𝑖=1(𝑋𝑖 − 𝜇1 )2 ) √2𝜋Γ (
)(
)
2
2
2𝜎1
𝑛−1
2
𝑛𝜎̂ 2
𝑛−1
√2𝜋Γ ( 2 ) 𝜎1𝑛+1 √𝑛 ( 21 )
𝑛
Γ (2 )
𝑛
∑𝑛 (𝑋 − 𝜇1 )2 2
2 ( 𝑖=1 𝑖
)
∑𝑛𝑖=1(𝑋𝑖 − 𝜇1 )2
2
−𝑛−1
=
𝜎
exp
(−
).
1
𝑛
2𝜎12
Γ (2 )
Similar to the derivation in computing 𝑝(𝜎1 |𝐷) in the last sub-problem, we could compute
𝑝(𝜎1 |𝜇1 , 𝐷) by substituting 𝜆1 with 𝜎1−2 :
𝜎13
1
𝑝𝜆1 (𝜆1 |𝜇1 , 𝐷) =
𝑝𝜎1 (
|𝜇1 , 𝐷)
2
√𝜆1
=
=
3
∑𝑛 (𝑋𝑖 −
−
2 2 ( 𝑖=1
𝜆1
2
𝑛
Γ( )
2
2
𝜇1 )2
)
𝑛+1
𝜆1 2 exp (−
∑𝑛𝑖=1(𝑋𝑖 − 𝜇1 )2
𝜆1 )
2
𝑛
∑𝑛 (𝑋 − 𝜇1 )2 2
( 𝑖=1 2𝑖
)
𝑛
Γ (2 )
𝑛
2
𝑛
𝜆12
−1
exp (−
∑𝑛𝑖=1(𝑋𝑖 − 𝜇1 )2
𝑛 ∑𝑛𝑖=1(𝑋𝑖 − 𝜇1 )2
𝜆1 ) = 𝑔 (𝜆1 ; ,
).
2
2
2
Therefore,
𝑛 ∑𝑛𝑖=1(𝑋𝑖 − 𝜇1 )2
𝑝(𝜎1 |𝜇1 , 𝐷) = 𝑔 (𝜎1−2 ; ,
).
2
2
Similarly, we have
2
𝑝(𝜎2 |𝜇2 , 𝐷) =
𝑔 (𝜎2−2 ;
𝑚 ∑𝑚
𝑗=1(𝑌𝑗 − 𝜇2 )
,
).
2
2
𝑛
2
2
𝜎
𝑛 ∑ (𝑋 −𝜇 )
Having obtained 𝑝(𝜇1 |𝜎1 , 𝐷) = 𝑁 (𝜇1 ; 𝑋̅, 𝑛1 ), 𝑝(𝜎1 |𝜇1 , 𝐷) = 𝑔 (𝜎1−2 ; 2 , 𝑖=1 2𝑖 1 ),
2
𝑚
𝑚 ∑𝑗=1(𝑌𝑗 −𝜇2 )
2
𝜎2
𝑝(𝜇2 |𝜎2 , 𝐷) = 𝑁 (𝜇2 ; 𝑌̅, 𝑚2 ), and 𝑝(𝜎2 |𝜇2 , 𝐷) = 𝑔 (𝜎2−2 ; 2 ,
(𝑘)
(𝑘)
(𝑘)
), we can simulate
(𝑘)
samples (𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 ) , 𝑘 = 1 … 𝑁 using Gibbs sampling.
The results of simulating 100,000 samples using Gibbs sampling are as follows: the estimate of 𝛿̅
is 0.4301146, very close to the mle estimate. The 95% posterior interval is (0.0558426,
0.8073434), also close to the 95% confidence interval in mle.
f)
The proposal distribution that I used in the Metropolis-Hastings sampling is a Gaussian
distribution:
𝑞(𝜇1′ , 𝜇2′ , 𝜎1′ , 𝜎2′ |𝜇1 , 𝜇2 , 𝜎1 , 𝜎2 )
𝜎̂12
𝜎̂22
𝜎̂12
𝜎̂22
= 𝑁 (𝜇1′ ; 𝜇1 , ) ∙ 𝑁 (𝜇2′ ; 𝜇2 , ) ∙ 𝑁 (𝜎1′ ; 𝜎1 , ) ∙ 𝑁 (𝜎2′ ; 𝜎2 ,
).
𝑛
𝑚
2𝑛
2𝑚
The plot of 100,001 samples from the Metropolis-Hastings algorithm is shown in Figure 6. We
can see that the Markov chain mixed pretty well. The estimate of 𝛿̅ is 0.4338128, very close to
the mle estimate. The 95% posterior interval is (0.045316, 0.8312325), also close to the 95%
0.5
0.0
delta
1.0
confidence interval in mle.
0e+00
2e+04
4e+04
6e+04
8e+04
1e+05
iteration
Figure 6
0.20
0.25
p_Y
0.20
0.15
0.10
0.15
p_X
0.25
0.30
0.30
g) The estimate of the densities of 𝑋 and 𝑌 using kernel density estimation is shown in Figure 7. I
used Gaussian kernel with bandwidth equal to 1. This bandwidth gives a good shape of normal
distribution. The range of acceptable bandwidths is from 0.7 to 1. Setting the bandwidth too
small would overfit the data, resulting in multiple modes.
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0.0
X
0.5
1.0
1.5
2.0
2.5
3.0
Y
Figure 7
h) I use Gaussian to model both the conditional distribution of data and the prior distribution of
parameters. The conditional distribution of data for cluster 𝑘 is
𝜏 ∑𝑛𝑖=1(𝑥𝑖 − 𝜇)2
√𝜏
𝑝(𝑥1:𝑛 |𝜇, 𝜏) =
exp (−
),
2
√2𝜋
where 𝜏 is fixed at 1. In the discussion below, 𝑝(𝑥1:𝑛 |𝜇, 𝜏) is abbreviated as 𝑝(𝑥1:𝑛 |𝜇).
The prior distribution of parameter 𝜇 is
𝜏0 (𝜇 − 𝜇0 )2
).
2
√2𝜋
It can be shown that the posterior distribution of parameter 𝜇 is
𝑝(𝜇|𝜇0 , 𝜏0 ) =
√𝜏0
exp (−
𝜏𝑘 (𝜇 − 𝜇𝑘 )2
𝑝(𝜇|𝑥1:𝑛 , 𝜇0 , 𝜏0 ) ∝ 𝑝(𝑥1:𝑛 |𝜇)𝑝(𝜇|𝜇0 , 𝜏0 ) ∝
exp (−
),
2
√2𝜋
√𝜏𝑘
where
𝜏𝑘 = 𝜏0 + 𝑛𝑘 𝜏,
and
𝜇0 𝜏0 + 𝜏𝑛𝑘 𝑋̅𝑘
,
𝜏0 + 𝑛𝑘 𝜏
where 𝑛𝑘 is the number of training samples assigned to cluster 𝑘, and 𝑋̅𝑘 is the mean of those
samples.
Therefore, in Gibbs sampling, the likelihood of 𝑥𝑖 being assigned to cluster 𝑘 given the cluster
assignment of the other samples is
𝜇𝑘 =
𝑝(𝑥𝑖 |𝑥−𝑖 , 𝑐−𝑖 , 𝑐𝑖 = 𝑘, 𝜇0 , 𝜏0 ) = ∫ 𝑝(𝑥𝑖 |𝜇)𝑝(𝜇|𝑥−𝑖 , 𝜇0 , 𝜏0 )𝑑𝜇
=∫
√𝜏
√2𝜋
exp (−
𝜏(𝑥𝑖 − 𝜇)2 √𝜏𝑘
𝜏𝑘 (𝜇 − 𝜇𝑘 )2
exp (−
)∙
) 𝑑𝜇
2
2
√2𝜋
2
=
√𝜏
√2𝜋
𝜏𝑥𝑖2
2
𝑒−
(𝜏𝑘′ )−1⁄2 exp (
∙
(𝜏𝑘
𝜏𝑘′ 𝜇𝑘′
2 )
𝜏 𝜇
)−1⁄2 exp ( 𝑘 𝑘
2
,
)
2
where 𝜏𝑘 and 𝜇𝑘 are computed as before except that sample 𝑥𝑖 is excluded from 𝑛𝑘 and 𝑋̅𝑘 , and
𝜏𝑘′ = 𝜏𝑘 + 𝜏,
and
𝜇𝑘 𝜏𝑘 + 𝜏𝑥𝑖
𝜇𝑘′ =
.
𝜏𝑘 + 𝜏
If 𝑥𝑖 is assigned to a new cluster, the likelihood is
𝑝(𝑥𝑖 |𝜇0 , 𝜏0 ) = ∫ 𝑝(𝑥𝑖 |𝜇)𝑝(𝜇|𝜇0 , 𝜏0 )𝑑𝜇
=∫
=
√𝜏
√2𝜋
√𝜏
√2𝜋
𝑒
exp (−
𝜏𝑥 2
− 𝑖
2
𝜏(𝑥𝑖 − 𝜇)2 √𝜏0
𝜏0 (𝜇 − 𝜇0 )2
exp (−
)∙
) 𝑑𝜇
2
2
√2𝜋
(𝜏 ′ )−1⁄2 exp (
∙
(𝜏0 )−1⁄2 exp (
2
𝜏 ′ 𝜇′
2 )
,
𝜏0 𝜇0 2
)
2
where
𝜏 ′ = 𝜏0 + 𝜏,
and
𝜇0 𝜏0 + 𝜏𝑥𝑖
.
𝜏0 + 𝜏
The class assignment probability 𝑝(𝑐𝑖 = 𝑘|𝑐−𝑖 ) is the same as in the notes.
𝜇′ =
In my simulation, I set new cluster probability 𝛼 to be 0.05, and average 10 trials to get the final
density. The estimated densities for 𝑋 and 𝑌 are shown in Figure 8. As we can see, the densities
0.30
0.20
0.25
p_Y
0.25
0.20
0.05
0.15
0.10
0.15
p_X
0.30
0.35
0.35
0.40
0.40
in Figure 8 are closer to Gaussian compared to those in Figure 7 due to the relaxation of the
assumption on the number of clusters.
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0.0
0.5
X
1.5
Y
Figure 8
APPENDIX – R codes
2(d)
Xbar = mean(X)
sigmaX = mean((X-Xbar)^2)^0.5
Ybar = mean(Y)
sigmaY = mean((Y-Ybar)^2)^0.5
n = length(X)
m = length(Y)
shapeX = (n-1)/2
rateX = (n*sigmaX^2)/2
shapeY = (m-1)/2
rateY = (m*sigmaY^2)/2
delta = c()
N = 100000
for (indSam in 1:N) {
sigma1 = (rgamma(1, shape=shapeX, rate=rateX))^(-0.5)
mu1 = rnorm(1, mean=Xbar, sd=((sigma1^2)/n)^0.5)
sigma2 = (rgamma(1, shape=shapeY, rate=rateY))^(-0.5)
mu2 = rnorm(1, mean=Ybar, sd=((sigma2^2)/m)^0.5)
delta[indSam] = mu1-mu2
}
delta_bar = mean(delta)
delta_sort = sort(delta)
C_low = delta_sort[N*0.025]
C_high = delta_sort[N*0.975]
print(delta_bar)
print(C_low)
print(C_high)
2(e)
1.0
2.0
2.5
3.0
Xbar = mean(X)
sigmaX = mean((X-Xbar)^2)^0.5
Ybar = mean(Y)
sigmaY = mean((Y-Ybar)^2)^0.5
n = length(X)
m = length(Y)
shapeX = n/2
shapeY = m/2
mu1 = Xbar
mu2 = Ybar
sigma1 = sigmaX
sigma2 = sigmaY
delta = mu1-mu2
N = 100000
for (indSam in 1:N) {
rateX = sum((X-mu1)^2)/2
sigma1 = (rgamma(1, shape=shapeX, rate=rateX))^(-0.5)
mu1 = rnorm(1, mean=Xbar, sd=((sigma1^2)/n)^0.5)
rateY = sum((Y-mu2)^2)/2
sigma2 = (rgamma(1, shape=shapeY, rate=rateY))^(-0.5)
mu2 = rnorm(1, mean=Ybar, sd=((sigma2^2)/m)^0.5)
delta[indSam+1] = mu1-mu2
}
delta_bar = mean(delta)
delta_sort = sort(delta)
C_low = delta_sort[N*0.025]
C_high = delta_sort[N*0.975]
print(delta_bar)
print(C_low)
print(C_high)
2(f)
Xbar = mean(X)
sigmaX = mean((X-Xbar)^2)^0.5
Ybar = mean(Y)
sigmaY = mean((Y-Ybar)^2)^0.5
n = length(X)
m = length(Y)
sd_mu1 = sigmaX/(n^0.5)
sd_mu2 = sigmaY/(m^0.5)
sd_sigma1 = sigmaX/((2*n)^0.5)
sd_sigma2 = sigmaY/((2*m)^0.5)
shapeX = (n-1)/2
shapeY = (m-1)/2
rateX = n*(sigmaX^2)/2
rateY = m*(sigmaY^2)/2
mu1 = Xbar
mu2 = Ybar
sigma1 = sigmaX
sigma2 = sigmaY
delta = mu1-mu2
N = 100000
for (indSam in 1:N) {
mu1_t = rnorm(1, mean=mu1, sd=sd_mu1)
sigma1_t = rnorm(1, mean=sigma1, sd=sd_sigma1)
mu2_t = rnorm(1, mean=mu2, sd=sd_mu2)
sigma2_t = rnorm(1, mean=sigma2, sd=sd_sigma2)
density = dnorm(mu1, mean=Xbar, sd=sigma1/(n^0.5), log=TRUE) + dnorm(mu2,
mean=Ybar, sd=sigma2/(m^0.5), log=TRUE)
density = density + dgamma(sigma1^(-2), shape=shapeX, rate=rateX, log=TRUE) +
dgamma(sigma2^(-2), shape=shapeY, rate=rateY, log=TRUE)
density_t = dnorm(mu1_t, mean=Xbar, sd=sigma1_t/(n^0.5), log=TRUE) +
dnorm(mu2_t, mean=Ybar, sd=sigma2_t/(m^0.5), log=TRUE)
density_t = density_t + dgamma(sigma1_t^(-2), shape=shapeX, rate=rateX,
log=TRUE) + dgamma(sigma2_t^(-2), shape=shapeY, rate=rateY, log=TRUE)
r = min(exp(density_t-density), 1)
if (runif(1, min=0, max=1) < r) {
mu1 = mu1_t
sigma1 = sigma1_t
mu2 = mu2_t
sigma2 = sigma2_t
}
delta[indSam+1] = mu1-mu2
}
plot(1:(N+1), delta, type="l", xlab="iteration", ylab="delta")
delta_bar = mean(delta)
delta_sort = sort(delta)
C_low = delta_sort[N*0.025]
C_high = delta_sort[N*0.975]
print(delta_bar)
print(C_low)
print(C_high)
2(g)
minX = min(X)
maxX = max(X)
minY = min(Y)
maxY = max(Y)
stepX = (maxX-minX)/99
stepY = (maxY-minY)/99
hx = 1
hy = 1
dX = c()
dY = c()
for (ind in 1:100) {
x = (ind-1)*stepX + minX
dX[ind] = mean(exp(-(X-x)^2/2/(hx^2))/((2*pi)^0.5)/hx)
y = (ind-1)*stepY + minY
dY[ind] = mean(exp(-(Y-y)^2/2/(hy^2))/((2*pi)^0.5)/hy)
}
par(mfrow=c(1,2))
plot(((1:100)-1)*stepX+minX,dX,type="l",col="blue",xlab="X",ylab="p_X")
plot(((1:100)-1)*stepY+minY,dY,type="l",col="red",xlab="Y",ylab="p_Y")
2(h)
DPM <- function(X) {
minX = min(X)
maxX = max(X)
stepX = (maxX-minX)/99
dX = rep(0,100)
n = length(X)
numTrial = 10
thd_mu = 0.005
thd_tau = 3
for (indTrial in 1:numTrial) {
alpha = 0.05
tau = 1
Xbar = mean(X)
Xvar = mean((X-Xbar)^2)
mu0 = Xbar
tau0 = n/Xvar
ClstAsgn = rep(1,n)
numClst = 1
denom = tau0^(-0.5)*exp(tau0*(mu0^2)/2)
ClstXCnt = c()
ClstMu = c()
ClstTau = c()
for (indClst in 1:numClst) {
numX = sum(ClstAsgn==indClst)
sumX = sum(X[ClstAsgn==indClst])
ClstXCnt[indClst] = numX
ClstTau[indClst] = tau0 + tau*numX
ClstMu[indClst] = (mu0*tau0+tau*sumX)/ClstTau[indClst]
}
while (1) {
ClstXCnt0 = ClstXCnt
ClstMu0 = ClstMu
ClstTau0 = ClstTau
numClst0 = numClst
for (indSam in 1:n) {
p_old = c()
for (indClst in 1:numClst) {
numX = sum(ClstAsgn==indClst)
sumX = sum(X[ClstAsgn==indClst])
if (ClstAsgn[indSam]==indClst) {
numX = numX - 1
sumX = sumX - X[indSam]
}
p_old[indClst] = numX/(n-1)*(1-alpha)
tau1 = tau0 + tau*numX
mu1 = (mu0*tau0+tau*sumX)/tau1
tau2 = tau1 + tau
mu2 = (tau1*mu1+tau*X[indSam])/tau2
p_old[indClst] = p_old[indClst] * (tau2^(0.5)*exp(tau2*(mu2^2)/2)) / (tau1^(-0.5)*exp(tau1*(mu1^2)/2))
}
tau_new = tau0 + tau
mu_new = (tau0*mu0+tau*X[indSam])/tau_new
p_new = alpha * tau_new^(-0.5)*exp(tau_new*(mu_new^2)/2) /
denom
temp = sort(rmultinom(1, size = 1,
prob=c(p_old,p_new)),decreasing=TRUE,index.return=TRUE)
clst = temp$ix[1]
ClstAsgn[indSam] = clst
if (clst == numClst+1) {
numClst = numClst + 1
}
}
ClstList = unique(ClstAsgn)
numClst = length(ClstList)
ClstAsgn_t = ClstAsgn
for (indClst in 1:numClst) {
ClstAsgn_t[ClstAsgn==ClstList[indClst]] = indClst
}
ClstAsgn = ClstAsgn_t
ClstXCnt = c()
ClstMu = c()
ClstTau = c()
for (indClst in 1:numClst) {
numX = sum(ClstAsgn==indClst)
sumX = sum(X[ClstAsgn==indClst])
ClstXCnt[indClst] = numX
ClstTau[indClst] = tau0 + tau*numX
ClstMu[indClst] = (mu0*tau0+tau*sumX)/ClstTau[indClst]
}
temp = sort(ClstMu, index.return=TRUE)
ClstMu = temp$x
ClstXCnt = ClstXCnt[temp$ix]
ClstTau = ClstTau[temp$ix]
if (numClst==numClst0) {
if (mean(abs(ClstMu-ClstMu0))<thd_mu && mean(abs(ClstTauClstTau0))<thd_tau) {
break
}
}
}
for (ind in 1:100) {
x = (ind-1)*stepX + minX
hx = (tau^0.5)/((2*pi)^0.5)*exp(-tau*(x^2)/2)
p_old = c()
for (indClst in 1:numClst) {
p_old[indClst] = ClstXCnt[indClst]/n*(1-alpha)
tau2 = ClstTau[indClst] + tau
mu2 = (ClstTau[indClst]*ClstMu[indClst]+tau*x)/tau2
p_old[indClst] = p_old[indClst] * hx * (tau2^(0.5)*exp(tau2*(mu2^2)/2)) / (ClstTau[indClst]^(0.5)*exp(ClstTau[indClst]*(ClstMu[indClst]^2)/2))
}
tau_new = tau0 + tau
mu_new = (tau0*mu0+tau*x)/tau_new
p_new = alpha * hx * tau_new^(-0.5)*exp(tau_new*(mu_new^2)/2) /
denom
dX[ind] = dX[ind] + sum(c(p_old, p_new))
}
}
dX = dX / numTrial
out <- list(density=dX)
return(out)
}
out <- DPM(X)
dX = out$density
out <- DPM(Y)
dY = out$density
minX = min(X)
maxX = max(X)
minY = min(Y)
maxY = max(Y)
stepX = (maxX-minX)/99
stepY = (maxY-minY)/99
par(mfrow=c(1,2))
plot(((1:100)-1)*stepX+minX,dX,type="l",col="blue",xlab="X",ylab="p_X")
plot(((1:100)-1)*stepY+minY,dY,type="l",col="red",xlab="Y",ylab="p_Y")

Download Report

10-702 Assignment 5 Solution Jiyan Pan

Paperzz.com

Your Paperzz