10-702 Assignment 5 Solution Jiyan Pan

10-702 Assignment 5 Solution
Jiyan Pan
[email protected]
1. Classification
1, 𝑓(π‘₯) β‰₯ 0
a) Let β„Ž(π‘₯) = sign(𝑓(π‘₯)) = {
, then
0, 𝑓(π‘₯) < 0
𝑅(𝑓) = β„™(π‘Œ β‰  β„Ž(𝑋)) = ∫ β„™(π‘Œ β‰  β„Ž(𝑋)|𝑋 = π‘₯) 𝑑𝑃(π‘₯)
= ∫[β„™(π‘Œ = 0, β„Ž(π‘₯) = 1|𝑋 = π‘₯) + β„™(π‘Œ = 1, β„Ž(π‘₯) = 0|𝑋 = π‘₯)] 𝑑𝑃(π‘₯)
= ∫[β„™(π‘Œ = 0|𝑋 = π‘₯)β„Ž(π‘₯) + β„™(π‘Œ = 1|𝑋 = π‘₯)(1 βˆ’ β„Ž(π‘₯))] 𝑑𝑃(π‘₯)
= ∫[(1 βˆ’ π‘š(π‘₯))β„Ž(π‘₯) + π‘š(π‘₯)(1 βˆ’ β„Ž(π‘₯))] 𝑑𝑃(π‘₯)
= ∫[β„Ž(π‘₯) + π‘š(π‘₯) βˆ’ 2π‘š(π‘₯)β„Ž(π‘₯)] 𝑑𝑃(π‘₯).
If we denote β„Žβˆ— (π‘₯) = sign(π‘š(π‘₯) βˆ’ 1⁄2), the excess risk is
𝑅(𝑓) βˆ’ 𝑅 βˆ—
= ∫[β„Ž(π‘₯) + π‘š(π‘₯) βˆ’ 2π‘š(π‘₯)β„Ž(π‘₯)] 𝑑𝑃(π‘₯) βˆ’ ∫[β„Žβˆ— (π‘₯) + π‘š(π‘₯) βˆ’ 2π‘š(π‘₯)β„Žβˆ— (π‘₯)] 𝑑𝑃(π‘₯)
= ∫(2π‘š(π‘₯) βˆ’ 1)(β„Žβˆ— (π‘₯) βˆ’ β„Ž(π‘₯)) 𝑑𝑃(π‘₯)
∫|2π‘š(π‘₯) βˆ’ 1|(1 βˆ’ β„Ž(π‘₯)) 𝑑𝑃(π‘₯), π‘š(π‘₯) β‰₯ 1⁄2
={
.
∫|2π‘š(π‘₯) βˆ’ 1|β„Ž(π‘₯) 𝑑𝑃(π‘₯),
π‘š(π‘₯) < 1⁄2
When π‘š(π‘₯) β‰₯ 1⁄2, we have β„Žβˆ— (π‘₯) = 1, then no matter if β„Ž(π‘₯) = 1 or 0, 1 βˆ’ β„Ž(π‘₯) =
1[β„Ž(π‘₯) β‰  β„Žβˆ— (π‘₯)]; when π‘š(π‘₯) < 1⁄2, we have β„Žβˆ— (π‘₯) = 0, then no matter if β„Ž(π‘₯) = 1 or 0,
β„Ž(π‘₯) = 1[β„Ž(π‘₯) β‰  β„Žβˆ— (π‘₯)]. Therefore,
𝑅(𝑓) βˆ’ 𝑅 βˆ— = ∫ 1[β„Ž(π‘₯) β‰  β„Žβˆ— (π‘₯)]|2π‘š(π‘₯) βˆ’ 1| 𝑑𝑃(π‘₯)
= 𝔼(1[β„Ž(𝑋) β‰  β„Žβˆ— (𝑋)]|2π‘š(𝑋) βˆ’ 1|)
= 𝔼(1[sign(𝑓(𝑋)) β‰  sign(π‘š(𝑋) βˆ’ 1⁄2)]|2π‘š(𝑋) βˆ’ 1|).
b) The optimal πœ™-risk is
βˆ—
π‘…πœ™
= inf 𝔼[πœ™(π‘Œπ‘“(𝑋))]
π‘“βˆˆβ„³
= inf 𝔼 [𝔼[πœ™(π‘Œπ‘“(𝑋))|𝑋]]
π‘“βˆˆβ„³
= inf 𝔼[π‘š(𝑋)πœ™(𝑓(𝑋)) + (1 βˆ’ π‘š(π‘₯))πœ™(βˆ’π‘“(𝑋))]
π‘“βˆˆβ„³
= 𝔼 [ inf (π‘š(𝑋)πœ™(𝑓(𝑋)) + (1 βˆ’ π‘š(π‘₯))πœ™(βˆ’π‘“(𝑋)))]
π‘“βˆˆβ„³
= 𝔼 [ inf (π‘š(𝑋)πœ™(𝛼) + (1 βˆ’ π‘š(π‘₯))πœ™(βˆ’π›Ό))]
π›Όβˆˆβ„
= 𝔼[π»πœ™ (π‘š(𝑋))].
c)
(1) The conditional πœ™-risk is
π‘šπœ™(𝛼) + (1 βˆ’ π‘š)πœ™(βˆ’π›Ό)
= π‘š(𝑒 βˆ’π›Ό βˆ’ 𝛼 βˆ’ 1) + (1 βˆ’ π‘š)(𝑒 𝛼 + 𝛼 βˆ’ 1)
= π‘šπ‘’ βˆ’π›Ό + (1 βˆ’ π‘š)𝑒 𝛼 + (1 βˆ’ 2π‘š)𝛼 βˆ’ 1.
Taking the derivative w.r.t. 𝛼 and setting it to be zero, we have
βˆ—
βˆ—
βˆ’π‘šπ‘’ βˆ’π›Ό + (1 βˆ’ π‘š)𝑒 𝛼 + 1 βˆ’ 2π‘š = 0,
that is,
βˆ—
βˆ—
(𝑒 𝛼 + 1) ((1 βˆ’ π‘š)𝑒 𝛼 βˆ’ π‘š) = 0.
βˆ—
π‘š
βˆ—
As 𝑒 𝛼 + 1 > 0, (1 βˆ’ π‘š)𝑒 𝛼 βˆ’ π‘š must be 0. Therefore, 𝛼 βˆ— = log 1βˆ’π‘š. Substituting it
back into the conditional πœ™-risk, we obtain the optimal conditional πœ™-risk:
βˆ—
βˆ—
π»πœ™ (π‘š) = π‘šπ‘’ βˆ’π›Ό + (1 βˆ’ π‘š)𝑒 𝛼 + (1 βˆ’ 2π‘š)𝛼 βˆ— βˆ’ 1
π‘š
= (1 βˆ’ 2π‘š) log
.
1βˆ’π‘š
To derive π»πœ™βˆ’ (π‘š), we note that the conditional πœ™-risk is convex in 𝛼. If π‘š > 1⁄2, then
the allowable range of 𝛼 in π»πœ™βˆ’ (π‘š) is 𝛼 ≀ 0; on the other hand, we have 𝛼 βˆ— > 0.
Therefore, the 𝛼 that minimizes π»πœ™βˆ’ (π‘š) must be 0. If π‘š ≀ 1⁄2, then the allowable
range of 𝛼 in π»πœ™βˆ’ (π‘š) is 𝛼 > 0; on the other hand, we have 𝛼 βˆ— ≀ 0. Therefore, the 𝛼 that
minimizes π»πœ™βˆ’ (π‘š) must also be 0. In either case, π»πœ™βˆ’ (π‘š) = 0. As a result, we have
1+πœƒ
1+πœƒ
πœ“Μƒπœ™ (πœƒ) = π»πœ™βˆ’ (
) βˆ’ π»πœ™ (
)
2
2
1+πœƒ
= 0 βˆ’ π»πœ™ (
)
2
1+πœƒ
= πœƒ log
.
1βˆ’πœƒ
4
As the second derivative of πœ“Μƒπœ™ (πœƒ) is (1βˆ’πœƒ2 )2 > 0, πœ“Μƒπœ™ (πœƒ) is convex in πœƒ. Therefore,
πœ“πœ™ (πœƒ) = πœ“Μƒπœ™ (πœƒ) = πœƒ log
1+πœƒ
.
1βˆ’πœƒ
0
-4
1
-3
2
H
phi
-2
3
4
-1
5
0
The π»πœ™ and πœ“πœ™ are plotted in Figure 1.
0.0
0.2
0.4
0.6
0.8
1.0
-1.0
-0.5
m
0.0
theta
Figure 1
(2) The conditional πœ™-risk is
π‘šπœ™(𝛼) + (1 βˆ’ π‘š)πœ™(βˆ’π›Ό)
0.5
1.0
= π‘š(1 βˆ’ 𝛼)+ + (1 βˆ’ π‘š)(1 + 𝛼)+ .
It is evident that the conditional πœ™-risk could reach its minimum only when βˆ’1 ≀ 𝛼 ≀ 1.
In this interval, the conditional πœ™-risk is 1 + (1 βˆ’ 2π‘š)𝛼. Clearly, when π‘š > 1⁄2, the
conditional πœ™-risk reaches its minimum 2(1 βˆ’ π‘š) when 𝛼 = 1; when π‘š < 1⁄2, the
conditional πœ™-risk reaches its minimum 2π‘š when 𝛼 = βˆ’1. Therefore, π»πœ™ (π‘š) = 1 βˆ’
|2π‘š βˆ’ 1|.
It is also evident that when π‘š > 1⁄2, the allowable range of 𝛼 in π»πœ™βˆ’ (π‘š) is 𝛼 ≀ 0, and
the conditional πœ™-risk reaches its minimum 1 under this allowable range when 𝛼 = 0;
when π‘š < 1⁄2, the allowable range of 𝛼 in π»πœ™βˆ’ (π‘š) is 𝛼 β‰₯ 0, and the conditional πœ™-risk
reaches its minimum 1 under this allowable range when 𝛼 = 0. Therefore, π»πœ™βˆ’ (π‘š) = 1.
Therefore,
1+πœƒ
1+πœƒ
πœ“Μƒπœ™ (πœƒ) = π»πœ™βˆ’ (
) βˆ’ π»πœ™ (
)
2
2
= 1 βˆ’ (1 βˆ’ |1 + πœƒ βˆ’ 1|)
= |πœƒ|.
As |πœƒ| is convex,
πœ“πœ™ (πœƒ) = πœ“Μƒπœ™ (πœƒ) = |πœƒ|.
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
H
phi
0.6
0.8
1.0
The π»πœ™ and πœ“πœ™ are plotted in Figure 2.
0.0
0.2
0.4
0.6
0.8
1.0
-1.0
m
-0.5
0.0
0.5
1.0
theta
Figure 2
(3) The conditional πœ™-risk is
π‘šπœ™(𝛼) + (1 βˆ’ π‘š)πœ™(βˆ’π›Ό)
= π‘š(1 βˆ’ 𝛼)2+ + (1 βˆ’ π‘š)(1 + 𝛼)2+ .
It is evident that the conditional πœ™-risk could reach its minimum only when βˆ’1 ≀ 𝛼 ≀ 1.
In this interval, the conditional πœ™-risk is π‘š(1 βˆ’ 𝛼)2 + (1 βˆ’ π‘š)(1 + 𝛼)2 . Taking its
derivative w.r.t. 𝛼 and setting it to be zero, we have 𝛼 βˆ— = 2π‘š βˆ’ 1. Substituting it back
into the conditional πœ™-risk, we have
π»πœ™ (π‘š) = π‘š(1 βˆ’ 𝛼 βˆ— )2 + (1 βˆ’ π‘š)(1 + 𝛼 βˆ— )2
= π‘š(2 βˆ’ 2π‘š)2 + (1 βˆ’ π‘š)(2π‘š)2
= 4π‘š(1 βˆ’ π‘š).
βˆ’
To derive π»πœ™ (π‘š), we note that the conditional πœ™-risk is convex in 𝛼. If π‘š > 1⁄2, then
the allowable range of 𝛼 in π»πœ™βˆ’ (π‘š) is 𝛼 ≀ 0; on the other hand, we have 𝛼 βˆ— > 0.
Therefore, the 𝛼 that minimizes π»πœ™βˆ’ (π‘š) must be 0. If π‘š < 1⁄2, then the allowable
range of 𝛼 in π»πœ™βˆ’ (π‘š) is 𝛼 β‰₯ 0; on the other hand, we have 𝛼 βˆ— < 0. Therefore, the 𝛼 that
minimizes π»πœ™βˆ’ (π‘š) must also be 0. In either case, π»πœ™βˆ’ (π‘š) = 1. As a result, we have
1+πœƒ
1+πœƒ
πœ“Μƒπœ™ (πœƒ) = π»πœ™βˆ’ (
) βˆ’ π»πœ™ (
)
2
2
1+πœƒ1βˆ’πœƒ
= 1βˆ’4
2
2
= πœƒ2.
As πœƒ 2 is convex,
πœ“πœ™ (πœƒ) = πœ“Μƒπœ™ (πœƒ) = πœƒ 2 .
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
H
phi
0.6
0.8
1.0
The π»πœ™ and πœ“πœ™ are plotted in Figure 3.
0.0
0.2
0.4
0.6
0.8
1.0
-1.0
m
-0.5
0.0
0.5
1.0
theta
Figure 3
(4) The conditional πœ™-risk is
π‘šπœ™(𝛼) + (1 βˆ’ π‘š)πœ™(βˆ’π›Ό)
π‘šπ‘’ βˆ’π›Ό + (1 βˆ’ π‘š)(2 βˆ’ 𝑒 βˆ’π›Ό ), 𝛼 > 0
={
.
π‘š(2 βˆ’ 𝑒 𝛼 ) + (1 βˆ’ π‘š)𝑒 𝛼 ,
𝛼≀0
The derivative of the conditional πœ™-risk w.r.t. 𝛼 is
(1 βˆ’ 2π‘š)𝑒 βˆ’π›Ό , 𝛼 > 0
.
{
(1 βˆ’ 2π‘š)𝑒 𝛼 ,
𝛼≀0
When π‘š < 1⁄2, the derivative of the conditional πœ™-risk is positive for all 𝛼. Therefore,
𝛼 βˆ— = βˆ’βˆž, and the conditional πœ™-risk has the minimum 2π‘š; when π‘š > 1⁄2, the
derivative of the conditional πœ™-risk is negative for all 𝛼. Therefore, 𝛼 βˆ— = +∞, and the
conditional πœ™-risk has the minimum 2(1 βˆ’ π‘š). Combining the results, we have
π»πœ™ (π‘š) = 1 βˆ’ |2π‘š βˆ’ 1|.
It is evident that when π‘š > 1⁄2, the allowable range of 𝛼 in π»πœ™βˆ’ (π‘š) is 𝛼 ≀ 0, and the
conditional πœ™-risk reaches its minimum 1 under this allowable range when 𝛼 = 0; when
π‘š < 1⁄2, the allowable range of 𝛼 in π»πœ™βˆ’ (π‘š) is 𝛼 β‰₯ 0, and the conditional πœ™-risk
reaches its minimum 1 under this allowable range when 𝛼 = 0. Therefore, π»πœ™βˆ’ (π‘š) = 1.
Therefore,
1+πœƒ
1+πœƒ
πœ“Μƒπœ™ (πœƒ) = π»πœ™βˆ’ (
) βˆ’ π»πœ™ (
)
2
2
= 1 βˆ’ (1 βˆ’ |1 + πœƒ βˆ’ 1|)
= |πœƒ|.
As |πœƒ| is convex,
πœ“πœ™ (πœƒ) = πœ“Μƒπœ™ (πœƒ) = |πœƒ|.
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
H
phi
0.6
0.8
1.0
The π»πœ™ and πœ“πœ™ are plotted in Figure 4.
0.0
0.2
0.4
0.6
0.8
1.0
-1.0
m
-0.5
0.0
theta
Figure 4
2.
0
1
2
3
4
a) The boxplots of the data are shown in Figure 5.
X1
Y2
Figure 5
b) The mle for πœ‡1 and πœ‡2 are
𝑛
1
πœ‡Μ‚ 1 = βˆ‘ 𝑋𝑖 ,
𝑛
𝑖=1
π‘š
πœ‡Μ‚ 2 =
1
βˆ‘ π‘Œπ‘– .
π‘š
𝑖=1
The mle for 𝜎1 and 𝜎2 are
𝑛
1
πœŽΜ‚1 = √ βˆ‘(𝑋𝑖 βˆ’ πœ‡Μ‚ 1 )2 ,
𝑛
𝑖=1
0.5
1.0
π‘š
1
πœŽΜ‚2 = √ βˆ‘(π‘Œπ‘– βˆ’ πœ‡Μ‚ 2 )2 .
π‘š
𝑖=1
The log-likelihood of πœ‡1 and 𝜎1 is
𝑛
1
β„“(πœ‡1 , 𝜎1 ) = βˆ’π‘› log(√2πœ‹πœŽ1 ) βˆ’ 2 βˆ‘(𝑋𝑖 βˆ’ πœ‡1 )2 .
2𝜎1
𝑖=1
The Fisher information matrix is
𝑛
πœ•2β„“
πœ•2β„“
𝔼
)
(
)
2
𝜎12
πœ•πœ‡1 πœ•πœŽ1
πœ•πœ‡1
𝐼𝑛 (πœ‡1 , 𝜎1 ) = βˆ’
=
πœ•2β„“
πœ•2β„“
0
𝔼(
𝔼 ( 2)
)
πœ•πœŽ1 ] [
[ πœ•πœŽ1 πœ•πœ‡1
The inverse of the Fisher information matrix is
𝜎12
0
𝐽𝑛 (πœ‡1 , 𝜎1 ) = πΌπ‘›βˆ’1 (πœ‡1 , 𝜎1 ) = 𝑛
.
𝜎12
[ 0 2𝑛]
Therefore, the 1 βˆ’ 𝛼 confidence interval for πœ‡1 is
πœŽΜ‚1
πœŽΜ‚1
πΆπœ‡1 = (πœ‡Μ‚ 1 βˆ’ 𝑧𝛼⁄2
, πœ‡Μ‚ 1 + 𝑧𝛼⁄2 ).
βˆšπ‘›
βˆšπ‘›
Similarly, the 1 βˆ’ 𝛼 confidence interval for πœ‡2 is
πœŽΜ‚2
πœŽΜ‚2
πΆπœ‡2 = (πœ‡Μ‚ 2 βˆ’ 𝑧𝛼⁄2
, πœ‡Μ‚ 2 + 𝑧𝛼⁄2
).
βˆšπ‘š
βˆšπ‘š
The mle of 𝛿 = πœ‡1 βˆ’ πœ‡2 is
𝛿̂ = πœ‡Μ‚ 1 βˆ’ πœ‡Μ‚ 2 .
Since πœ‡Μ‚ 1 and πœ‡Μ‚ 2 are independent, the variance of 𝛿̂ is
0
𝔼(
Var(𝛿̂ ) = Var(πœ‡Μ‚ 1 βˆ’ πœ‡Μ‚ 2 ) = Var(πœ‡Μ‚ 1 ) + Var(πœ‡Μ‚ 2 ) =
2𝑛 .
𝜎12 ]
πœŽΜ‚12 πœŽΜ‚22
+ .
𝑛
π‘š
Therefore, the 1 βˆ’ 𝛼 confidence interval for 𝛿 is
πœŽΜ‚12 πœŽΜ‚22
πœŽΜ‚12 πœŽΜ‚22
𝐢𝛿 = (𝛿̂ βˆ’ 𝑧𝛼⁄2 √ + , 𝛿̂ + 𝑧𝛼⁄2 √ + ).
𝑛
π‘š
𝑛
π‘š
Using the actual data and setting 𝛼 = 0.05, we have
πœ‡Μ‚ 1 = 1.820011, πœ‡Μ‚ 2 = 1.389454;
πœŽΜ‚1 = 0.8042076, πœŽΜ‚2 = 0.8060367;
𝛿̂ = 0.4305566; 𝐢𝛿 = (0.07127103, 0.7898422).
c) The joint distribution is
𝑝(πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 , 𝑋, π‘Œ) = 𝑝(𝑋, π‘Œ|πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 )πœ‹(πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 )
βˆ’
= (2πœ‹)
𝑛
2 𝜎1βˆ’π‘› exp (βˆ’
βˆ’
= π‘˜ βˆ™ (2πœ‹)
𝑛
𝑛
𝑖=1
𝑗=1
π‘š
1
1
π‘˜
2
)2 (2πœ‹)βˆ’ 2 𝜎2βˆ’π‘š exp (βˆ’ 2 βˆ‘(π‘Œπ‘— βˆ’ πœ‡2 ) ) βˆ™
2 βˆ‘(𝑋𝑖 βˆ’ πœ‡1 ) βˆ™
𝜎1 𝜎2
2𝜎1
2𝜎2
𝑛+π‘š
2 𝜎1βˆ’π‘›βˆ’1 𝜎2βˆ’π‘šβˆ’1 exp (βˆ’
Here, π‘˜ is a constant in the prior.
𝑛
𝑛
𝑖=1
𝑗=1
1
1
2
)2 βˆ’ 2 βˆ‘(π‘Œπ‘— βˆ’ πœ‡2 ) ).
2 βˆ‘(𝑋𝑖 βˆ’ πœ‡1
2𝜎1
2𝜎2
The marginal distribution of data is
𝑝(𝑋, π‘Œ) = ∫ 𝑝(𝑋, π‘Œ|πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 )πœ‹(πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 )π‘‘πœ‡1 π‘‘πœ‡2 π‘‘πœŽ1 π‘‘πœŽ2
βˆ’
= π‘˜ βˆ™ ∫(2πœ‹)
𝑛
2 𝜎1βˆ’π‘›βˆ’1 exp (βˆ’
𝑛
1
βˆ‘(𝑋𝑖 βˆ’ πœ‡1 )2 ) π‘‘πœ‡1 π‘‘πœŽ1
2𝜎12
𝑖=1
βˆ™
π‘š
∫(2πœ‹)βˆ’ 2 𝜎2βˆ’π‘šβˆ’1 exp (βˆ’
𝑛
1
2
2 βˆ‘(π‘Œπ‘— βˆ’ πœ‡2 ) ) π‘‘πœ‡2 π‘‘πœŽ2 ,
2𝜎2
𝑗=1
where
𝑛
∫(2πœ‹)βˆ’2 𝜎1βˆ’π‘›βˆ’1 exp (βˆ’
𝑛
1
βˆ‘(𝑋𝑖 βˆ’ πœ‡1 )2 ) π‘‘πœ‡1 π‘‘πœŽ1
2𝜎12
𝑖=1
∞
∞
𝑛
1
=
𝑛
(2πœ‹)βˆ’2 ∫
0
=
𝑛
∞
𝑛 ∞
1
1
βˆ’
2
βˆ’π‘›βˆ’1
(2πœ‹) 2 ∫ 𝜎1
exp (βˆ’ 2 βˆ‘ 𝑋𝑖 ) [∫ exp (βˆ’ 2 (π‘›πœ‡12
2𝜎1
2𝜎1
0
βˆ’βˆž
𝑖=1
=
𝜎1βˆ’π‘›βˆ’1 [∫ exp (βˆ’ 2 βˆ‘(𝑋𝑖
2𝜎1
βˆ’βˆž
𝑖=1
𝑛 ∞
(2πœ‹)βˆ’2 ∫ 𝜎1βˆ’π‘›βˆ’1 exp
0
βˆ’ πœ‡1 )2 ) π‘‘πœ‡1 ] π‘‘πœŽ1
𝑛
βˆ’ 2πœ‡1 βˆ‘ 𝑋𝑖 )) π‘‘πœ‡1 ] π‘‘πœŽ1
𝑖=1
𝑛
∞
1 1
1
2
2
Μ…
(πœ‡1 βˆ’ 𝑋̅)2 ) π‘‘πœ‡1 ] π‘‘πœŽ1
βˆ’
(
βˆ‘
𝑋
βˆ’
𝑋
)
exp (βˆ’
[∫
𝑖
2 𝑛
2
𝜎1
𝜎
βˆ’βˆž
𝑖=1
2
2 1
𝑛
𝑛
(
)
πœŽΜ‚12
𝜎1
π‘‘πœŽ1
2 ) √2πœ‹
𝜎
βˆšπ‘›
0
2 𝑛1
𝑛 1
1 ∞
π‘›πœŽΜ‚12
= (2πœ‹)βˆ’2 +2 π‘›βˆ’2 ∫ 𝜎1βˆ’π‘› exp (βˆ’ 2 ) π‘‘πœŽ1 ,
2𝜎1
0
1 𝑛
1
where 𝑋̅ = βˆ‘π‘–=1 𝑋𝑖 , and πœŽΜ‚12 = βˆ‘π‘›π‘–=1 𝑋𝑖2 βˆ’ 𝑋̅ 2 .
∞
𝑛
= (2πœ‹)βˆ’2 ∫ 𝜎1βˆ’π‘›βˆ’1 exp (βˆ’
𝑛
𝑛
Let πœ†1 = 𝜎1βˆ’2 , we have
βˆ’
∫(2πœ‹)
=
=
𝑛
2 𝜎1βˆ’π‘›βˆ’1 exp (βˆ’
𝑛
1
βˆ‘(𝑋𝑖 βˆ’ πœ‡1 )2 ) π‘‘πœ‡1 π‘‘πœŽ1
2𝜎12
𝑖=1
𝑛 1
1 ∞
π‘›πœŽΜ‚12
(2πœ‹)βˆ’2 +2 π‘›βˆ’2 ∫ 𝜎1βˆ’π‘› exp (βˆ’ 2 ) π‘‘πœŽ1
2𝜎1
0
0 π‘›βˆ’3
𝑛
1
1
1
π‘›πœŽΜ‚12
βˆ’ (2πœ‹)βˆ’2 +2 π‘›βˆ’2 ∫ πœ†1 2 exp (βˆ’
πœ† ) π‘‘πœ†1
2
2 1
∞
π‘›βˆ’1
)
𝑛 1
1 Ξ“(
1
βˆ’ + βˆ’
2 .
= (2πœ‹) 2 2 𝑛 2
π‘›βˆ’1
2
π‘›πœŽΜ‚12 2
( 2 )
Similarly, we have
π‘š
∫(2πœ‹)βˆ’ 2 𝜎2βˆ’π‘šβˆ’1 exp (βˆ’
Therefore,
π‘šβˆ’1
𝑛
π‘š 1
1 Ξ“(
1
1
2
βˆ’ +
βˆ’
2 ).
2
2
2
(2πœ‹)
βˆ‘(π‘Œ
βˆ’
πœ‡
)
π‘‘πœ‡
π‘‘πœŽ
=
π‘š
)
𝑗
2
2
2
π‘šβˆ’1
2
2𝜎22
𝑗=1
π‘šπœŽΜ‚22 2
(
)
2
𝑝(𝑋, π‘Œ) = π‘˜ βˆ™
𝑛
∫(2πœ‹)βˆ’2 𝜎1βˆ’π‘›βˆ’1 exp (βˆ’
𝑛
1
βˆ‘(𝑋𝑖 βˆ’ πœ‡1 )2 ) π‘‘πœ‡1 π‘‘πœŽ1
2𝜎12
𝑖=1
𝑛
π‘š
βˆ™ ∫(2πœ‹)βˆ’ 2 𝜎2βˆ’π‘šβˆ’1 exp (βˆ’
1
2
2 βˆ‘(π‘Œπ‘— βˆ’ πœ‡2 ) ) π‘‘πœ‡2 π‘‘πœŽ2
2𝜎2
𝑗=1
π‘›βˆ’1
π‘šβˆ’1
) Ξ“( 2 )
𝑛+π‘š
1 Ξ“(
1
βˆ’
βˆ’
2
= π‘˜ βˆ™ (2πœ‹) 2 2πœ‹(π‘›π‘š) 2
π‘›βˆ’1
π‘šβˆ’1 .
4
2 2
2
2
π‘›πœŽΜ‚
π‘šπœŽΜ‚
( 21 )
( 22)
The posterior density is
𝑝(𝑋, π‘Œ|πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 )πœ‹(πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 )
𝑝(𝑋, π‘Œ)
𝑛+π‘š
2
1
1
βˆ’
π‘˜ βˆ™ (2πœ‹) 2 𝜎1βˆ’π‘›βˆ’1 𝜎2βˆ’π‘šβˆ’1 exp (βˆ’ 2 βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2 βˆ’ 2 βˆ‘π‘›π‘—=1(π‘Œπ‘— βˆ’ πœ‡2 ) )
2𝜎1
2𝜎2
=
π‘›βˆ’1
π‘šβˆ’1
𝑛+π‘š
1 Ξ“(
) Ξ“( 2 )
1
βˆ’
βˆ’
2
π‘˜ βˆ™ (2πœ‹) 2 2πœ‹(π‘›π‘š) 2
π‘›βˆ’1
π‘šβˆ’1
4
π‘›πœŽΜ‚12 2 π‘šπœŽΜ‚22 2
(
)
(
)
2
2
𝑝(πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 |𝑋, π‘Œ) =
π‘›βˆ’1
π‘šβˆ’1
π‘›πœŽΜ‚ 2 2 π‘šπœŽΜ‚22 2 exp (βˆ’ 1 βˆ‘π‘› (𝑋 βˆ’ πœ‡ )2 βˆ’ 1 βˆ‘π‘› (π‘Œ βˆ’ πœ‡ )2 )
2βˆšπ‘›π‘š ( 21 )
( 2 )
1
2
2𝜎12 𝑖=1 𝑖
2𝜎22 𝑗=1 𝑗
=
βˆ™
.
π‘›βˆ’1
π‘šβˆ’1
𝜎1𝑛+1 𝜎2π‘š+1
πœ‹Ξ“ ( 2 ) Ξ“ ( 2 )
d) The posterior distribution of (πœ‡1 , 𝜎1 ) is
𝑝(𝑋|πœ‡1 , 𝜎1 )πœ‹(πœ‡1 , 𝜎1 )
∫ 𝑝(𝑋|πœ‡1 , 𝜎1 )πœ‹(πœ‡1 , 𝜎1 )π‘‘πœ‡1 π‘‘πœŽ1
𝑛
1
(2πœ‹)βˆ’2 𝜎1βˆ’π‘›βˆ’1 exp (βˆ’ 2 βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2 )
2𝜎1
=
π‘›βˆ’1
𝑛 1
1 Ξ“(
1
βˆ’ + βˆ’
2 )
2
2
2
(2πœ‹)
𝑛
π‘›βˆ’1
2
π‘›πœŽΜ‚12 2
(
)
2
𝑝(πœ‡1 , 𝜎1 |𝐷) =
π‘›βˆ’1
π‘›πœŽΜ‚ 2 2 exp (βˆ’ 1 βˆ‘π‘› (𝑋 βˆ’ πœ‡ )2 )
2βˆšπ‘› ( 21 )
1
2𝜎12 𝑖=1 𝑖
=
βˆ™
.
π‘›βˆ’1
𝜎1𝑛+1
√2πœ‹Ξ“ ( 2 )
The posterior distribution of 𝜎1 is
∞
𝑝(𝜎1 |𝐷) = ∫ 𝑝(πœ‡1 , 𝜎1 |𝐷)π‘‘πœ‡1
βˆ’βˆž
π‘›βˆ’1
π‘›πœŽΜ‚ 2 2
𝑛
2βˆšπ‘› ( 1 )
∞
1
2
=
βˆ™ ∫ exp (βˆ’ 2 βˆ‘(𝑋𝑖 βˆ’ πœ‡1 )2 ) π‘‘πœ‡1
𝑛 βˆ’ 1 𝑛+1 βˆ’βˆž
2𝜎1
√2πœ‹Ξ“ ( 2 ) 𝜎1
𝑖=1
π‘›βˆ’1
π‘›πœŽΜ‚ 2 2
2βˆšπ‘› ( 21 )
π‘›πœŽΜ‚12
𝜎1
=
βˆ™ exp (βˆ’ 2 ) √2πœ‹
π‘›βˆ’1
2𝜎1
βˆšπ‘›
√2πœ‹Ξ“ ( 2 ) 𝜎1𝑛+1
π‘›βˆ’1
π‘›πœŽΜ‚ 2 2
2( 1)
π‘›πœŽΜ‚12
2
=
βˆ™ exp (βˆ’ 2 ).
π‘›βˆ’1
2𝜎1
Ξ“ ( 2 ) 𝜎1𝑛
Therefore, we have
𝑝(πœ‡1 , 𝜎1 |𝐷)
𝑝(πœ‡1 |𝜎1 , 𝐷) =
𝑝(𝜎1 |𝐷)
π‘›βˆ’1
2
=
π‘›πœŽΜ‚ 2
2βˆšπ‘› ( 1 )
2
exp (βˆ’
1 𝑛
π‘›βˆ’1 𝑛
2
2 βˆ‘π‘–=1(𝑋𝑖 βˆ’ πœ‡1 ) ) Ξ“ ( 2 ) 𝜎1
2𝜎1
π‘›βˆ’1
2
π‘›πœŽΜ‚ 2
π‘›πœŽΜ‚ 2
π‘›βˆ’1
√2πœ‹Ξ“ ( 2 ) 𝜎1𝑛+1 exp (βˆ’ 12 ) 2 ( 21 )
2𝜎1
1 𝑛
βˆšπ‘› exp (βˆ’ 2 βˆ‘π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2 )
2𝜎1
=
π‘›πœŽΜ‚ 2
√2πœ‹πœŽ1 exp (βˆ’ 12 )
2𝜎1
𝑛
π‘›πœŽΜ‚12
1 1
1
(πœ‡1 βˆ’ 𝑋̅)2 )
=
exp ( 2 ) . exp βˆ’
( βˆ‘ 𝑋𝑖2 βˆ’ 𝑋̅ 2 ) exp (βˆ’
2
2
𝑛
2𝜎
𝜎
𝜎
√2πœ‹πœŽ1
1
𝑖=1
2 𝑛1
2 𝑛1
(
)
βˆšπ‘›
1
𝜎2
Μ… )2 ) = 𝑁 (πœ‡1 ; 𝑋̅, 1 ).
(πœ‡
βˆ™
exp
βˆ’
𝑋
(βˆ’
1
𝜎
𝑛
𝜎12
√2πœ‹ 1
2
𝑛
√
𝑛
Let πœ†1 = 𝜎1βˆ’2 , we can obtain 𝑝(πœ†1 |𝐷) from 𝑝(𝜎1 |𝐷) as follows:
𝑑
𝑑
1
𝑑
1
π‘πœ†1 (πœ†1 |𝐷) =
β„™(πœ†β€²1 ≀ πœ†1 |𝐷) =
β„™ ( β€² 2 ≀ πœ†1 |𝐷) =
β„™ (𝜎1β€² β‰₯
|𝐷)
π‘‘πœ†1
π‘‘πœ†1
π‘‘πœ†1
𝜎1
βˆšπœ†1
=
=
=
1
𝑑
1
𝑑
1
𝜎13 𝑑
1
(1 βˆ’ β„™ (𝜎1β€² ≀
|𝐷)) = βˆ’
β„™ (𝜎1β€² ≀
|𝐷) =
β„™ (𝜎1β€² ≀
|𝐷)
π‘‘πœ†1
π‘‘πœ†1
2 π‘‘πœŽ1
βˆšπœ†1
βˆšπœ†1
βˆšπœ†1
π‘›βˆ’1
𝑛
3
π‘›πœŽΜ‚12 2 2
βˆ’
22(
)
πœ†
πœ†1
1
2
𝜎13
1
π‘πœŽ1 (
|𝐷) =
2
2
βˆšπœ†1
π‘›βˆ’1
Ξ“( 2 )
βˆ™ exp (βˆ’
π‘›πœŽΜ‚12
πœ† )
2 1
π‘›βˆ’1
π‘›πœŽΜ‚ 2 2
π‘›βˆ’1
( 1)
π‘›πœŽΜ‚12
𝑛 βˆ’ 1 π‘›πœŽΜ‚12
βˆ’1
2
=
πœ†1 2
exp (βˆ’
πœ†1 ) = 𝑔 (πœ†1 ;
,
),
π‘›βˆ’1
2
2
2
Ξ“( 2 )
where 𝑔(π‘₯; 𝛼, 𝛽) is the gamma distribution with shape parameter 𝛼 and rate parameter 𝛽.
Therefore,
𝑛 βˆ’ 1 π‘›πœŽΜ‚12
𝑝(𝜎1 |𝐷) = 𝑔 (𝜎1βˆ’2 ;
,
)
2
2
That is, to sample from 𝑝(𝜎1 |𝐷), we could sample from the gamma distribution 𝑔 (𝑧;
1
first, then computes 𝜎1 according to 𝜎1 = 𝑧 βˆ’2 .
Similarly,
Μ‚12
π‘›βˆ’1 π‘›πœŽ
,
)
2
2
𝑝(𝜎2 |𝐷) = 𝑔 (𝜎2βˆ’2 ;
π‘š βˆ’ 1 π‘šπœŽΜ‚22
,
),
2
2
and
𝑝(πœ‡2 |𝜎2 , 𝐷) = 𝑁 (πœ‡2 ; π‘ŒΜ…,
𝜎22
).
π‘š
Therefore,
𝑝(πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 |𝐷) = 𝑝(πœ‡1 |𝜎1 , 𝐷)𝑝(𝜎1 |𝐷)𝑝(πœ‡2 |𝜎2 𝐷)𝑝(𝜎2 |𝐷)
𝜎12
𝑛 βˆ’ 1 π‘›πœŽΜ‚12
𝜎22
π‘š βˆ’ 1 π‘šπœŽΜ‚22
= 𝑁 (πœ‡1 ; 𝑋̅ , ) βˆ™ 𝑔 (𝜎1βˆ’2 ;
,
,
) βˆ™ 𝑁 (πœ‡2 ; π‘ŒΜ…, ) βˆ™ 𝑔 (𝜎2βˆ’2 ;
).
𝑛
2
2
π‘š
2
2
To get the Bayes estimate of 𝛿 = πœ‡1 βˆ’ πœ‡2 , we need to compute
𝛿̅ = ∫(πœ‡1 βˆ’ πœ‡2 )𝑝(πœ‡1 , πœ‡2 |𝐷)π‘‘πœ‡1 π‘‘πœ‡2 = ∫(πœ‡1 βˆ’ πœ‡2 ) (∫ 𝑝(πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 |𝐷)π‘‘πœŽ1 π‘‘πœŽ2 ) π‘‘πœ‡1 π‘‘πœ‡2
= ∫(πœ‡1 βˆ’ πœ‡2 )𝑝(πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 |𝐷) π‘‘πœ‡1 π‘‘πœ‡2 π‘‘πœŽ1 π‘‘πœŽ2 .
(π‘˜)
(π‘˜)
(π‘˜)
(π‘˜)
The Monte Carlo method with direct sampling draws samples (πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 ) , π‘˜ = 1 … 𝑁
according to the factorized expression of 𝑝(πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 |𝐷), and approximates 𝛿̅ as
𝑁
1
(π‘˜)
(π‘˜)
βˆ‘ (πœ‡1 βˆ’ πœ‡2 ).
𝑁
π‘˜=1
I generated 100,000 samples with direct sampling, and the estimate of 𝛿̅ is 0.4302737, very
close to the mle estimate. The 95 percent posterior interval is obtained by sorting the simulated
samples and finding the 0.025 and 0.975 quantile. The resulting 95% posterior interval is
(0.05256353, 0.8063011), also close to the 95% confidence interval in mle.
e) The posterior distribution of πœ‡1 is
π‘›βˆ’1
π‘›πœŽΜ‚ 2 2
𝑛
2βˆšπ‘› ( 1 )
∞
∞
1
1
2
𝑝(πœ‡1 |𝐷) = ∫ 𝑝(πœ‡1 , 𝜎1 |𝐷)π‘‘πœŽ1 =
βˆ™ ∫ 𝑛+1 exp (βˆ’ 2 βˆ‘(𝑋𝑖 βˆ’ πœ‡1 )2 ) π‘‘πœŽ1 .
π‘›βˆ’1
2𝜎1
0
√2πœ‹Ξ“ ( 2 ) 0 𝜎1
𝑖=1
Let πœ†1 = 𝜎1βˆ’2 , we have
π‘›βˆ’1
π‘›πœŽΜ‚12 2
2
𝑛
(
∞ 𝑛
√
βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2
1
βˆ’1
2 )
𝑝(πœ‡1 |𝐷) =
βˆ™ ∫ πœ†12 exp (βˆ’
πœ†1 ) π‘‘πœ†1
π‘›βˆ’1
2
2
√2πœ‹Ξ“ ( 2 ) 0
π‘›βˆ’1
π‘›πœŽΜ‚ 2 2
𝑛
βˆšπ‘› ( 21 )
Ξ“ (2 )
=
βˆ™
𝑛.
π‘›βˆ’1
√2πœ‹Ξ“ ( 2 ) βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2 2
(
)
2
Therefore,
𝑝(𝜎1 |πœ‡1 , 𝐷) =
𝑝(πœ‡1 , 𝜎1 |𝐷)
𝑝(πœ‡1 |𝐷)
π‘›βˆ’1
2
=
π‘›πœŽΜ‚ 2
2βˆšπ‘› ( 1 )
2
𝑛
1
𝑛 βˆ’ 1 βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2 2
exp (βˆ’ 2 βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2 ) √2πœ‹Ξ“ (
)(
)
2
2
2𝜎1
π‘›βˆ’1
2
π‘›πœŽΜ‚ 2
π‘›βˆ’1
√2πœ‹Ξ“ ( 2 ) 𝜎1𝑛+1 βˆšπ‘› ( 21 )
𝑛
Ξ“ (2 )
𝑛
βˆ‘π‘› (𝑋 βˆ’ πœ‡1 )2 2
2 ( 𝑖=1 𝑖
)
βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2
2
βˆ’π‘›βˆ’1
=
𝜎
exp
(βˆ’
).
1
𝑛
2𝜎12
Ξ“ (2 )
Similar to the derivation in computing 𝑝(𝜎1 |𝐷) in the last sub-problem, we could compute
𝑝(𝜎1 |πœ‡1 , 𝐷) by substituting πœ†1 with 𝜎1βˆ’2 :
𝜎13
1
π‘πœ†1 (πœ†1 |πœ‡1 , 𝐷) =
π‘πœŽ1 (
|πœ‡1 , 𝐷)
2
βˆšπœ†1
=
=
3
βˆ‘π‘› (𝑋𝑖 βˆ’
βˆ’
2 2 ( 𝑖=1
πœ†1
2
𝑛
Ξ“( )
2
2
πœ‡1 )2
)
𝑛+1
πœ†1 2 exp (βˆ’
βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2
πœ†1 )
2
𝑛
βˆ‘π‘› (𝑋 βˆ’ πœ‡1 )2 2
( 𝑖=1 2𝑖
)
𝑛
Ξ“ (2 )
𝑛
2
𝑛
πœ†12
βˆ’1
exp (βˆ’
βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2
𝑛 βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2
πœ†1 ) = 𝑔 (πœ†1 ; ,
).
2
2
2
Therefore,
𝑛 βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ πœ‡1 )2
𝑝(𝜎1 |πœ‡1 , 𝐷) = 𝑔 (𝜎1βˆ’2 ; ,
).
2
2
Similarly, we have
2
𝑝(𝜎2 |πœ‡2 , 𝐷) =
𝑔 (𝜎2βˆ’2 ;
π‘š βˆ‘π‘š
𝑗=1(π‘Œπ‘— βˆ’ πœ‡2 )
,
).
2
2
𝑛
2
2
𝜎
𝑛 βˆ‘ (𝑋 βˆ’πœ‡ )
Having obtained 𝑝(πœ‡1 |𝜎1 , 𝐷) = 𝑁 (πœ‡1 ; 𝑋̅, 𝑛1 ), 𝑝(𝜎1 |πœ‡1 , 𝐷) = 𝑔 (𝜎1βˆ’2 ; 2 , 𝑖=1 2𝑖 1 ),
2
π‘š
π‘š βˆ‘π‘—=1(π‘Œπ‘— βˆ’πœ‡2 )
2
𝜎2
𝑝(πœ‡2 |𝜎2 , 𝐷) = 𝑁 (πœ‡2 ; π‘ŒΜ…, π‘š2 ), and 𝑝(𝜎2 |πœ‡2 , 𝐷) = 𝑔 (𝜎2βˆ’2 ; 2 ,
(π‘˜)
(π‘˜)
(π‘˜)
), we can simulate
(π‘˜)
samples (πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 ) , π‘˜ = 1 … 𝑁 using Gibbs sampling.
The results of simulating 100,000 samples using Gibbs sampling are as follows: the estimate of 𝛿̅
is 0.4301146, very close to the mle estimate. The 95% posterior interval is (0.0558426,
0.8073434), also close to the 95% confidence interval in mle.
f)
The proposal distribution that I used in the Metropolis-Hastings sampling is a Gaussian
distribution:
π‘ž(πœ‡1β€² , πœ‡2β€² , 𝜎1β€² , 𝜎2β€² |πœ‡1 , πœ‡2 , 𝜎1 , 𝜎2 )
πœŽΜ‚12
πœŽΜ‚22
πœŽΜ‚12
πœŽΜ‚22
= 𝑁 (πœ‡1β€² ; πœ‡1 , ) βˆ™ 𝑁 (πœ‡2β€² ; πœ‡2 , ) βˆ™ 𝑁 (𝜎1β€² ; 𝜎1 , ) βˆ™ 𝑁 (𝜎2β€² ; 𝜎2 ,
).
𝑛
π‘š
2𝑛
2π‘š
The plot of 100,001 samples from the Metropolis-Hastings algorithm is shown in Figure 6. We
can see that the Markov chain mixed pretty well. The estimate of 𝛿̅ is 0.4338128, very close to
the mle estimate. The 95% posterior interval is (0.045316, 0.8312325), also close to the 95%
0.5
0.0
delta
1.0
confidence interval in mle.
0e+00
2e+04
4e+04
6e+04
8e+04
1e+05
iteration
Figure 6
0.20
0.25
p_Y
0.20
0.15
0.10
0.15
p_X
0.25
0.30
0.30
g) The estimate of the densities of 𝑋 and π‘Œ using kernel density estimation is shown in Figure 7. I
used Gaussian kernel with bandwidth equal to 1. This bandwidth gives a good shape of normal
distribution. The range of acceptable bandwidths is from 0.7 to 1. Setting the bandwidth too
small would overfit the data, resulting in multiple modes.
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0.0
X
0.5
1.0
1.5
2.0
2.5
3.0
Y
Figure 7
h) I use Gaussian to model both the conditional distribution of data and the prior distribution of
parameters. The conditional distribution of data for cluster π‘˜ is
𝜏 βˆ‘π‘›π‘–=1(π‘₯𝑖 βˆ’ πœ‡)2
√𝜏
𝑝(π‘₯1:𝑛 |πœ‡, 𝜏) =
exp (βˆ’
),
2
√2πœ‹
where 𝜏 is fixed at 1. In the discussion below, 𝑝(π‘₯1:𝑛 |πœ‡, 𝜏) is abbreviated as 𝑝(π‘₯1:𝑛 |πœ‡).
The prior distribution of parameter πœ‡ is
𝜏0 (πœ‡ βˆ’ πœ‡0 )2
).
2
√2πœ‹
It can be shown that the posterior distribution of parameter πœ‡ is
𝑝(πœ‡|πœ‡0 , 𝜏0 ) =
√𝜏0
exp (βˆ’
πœπ‘˜ (πœ‡ βˆ’ πœ‡π‘˜ )2
𝑝(πœ‡|π‘₯1:𝑛 , πœ‡0 , 𝜏0 ) ∝ 𝑝(π‘₯1:𝑛 |πœ‡)𝑝(πœ‡|πœ‡0 , 𝜏0 ) ∝
exp (βˆ’
),
2
√2πœ‹
βˆšπœπ‘˜
where
πœπ‘˜ = 𝜏0 + π‘›π‘˜ 𝜏,
and
πœ‡0 𝜏0 + πœπ‘›π‘˜ π‘‹Μ…π‘˜
,
𝜏0 + π‘›π‘˜ 𝜏
where π‘›π‘˜ is the number of training samples assigned to cluster π‘˜, and π‘‹Μ…π‘˜ is the mean of those
samples.
Therefore, in Gibbs sampling, the likelihood of π‘₯𝑖 being assigned to cluster π‘˜ given the cluster
assignment of the other samples is
πœ‡π‘˜ =
𝑝(π‘₯𝑖 |π‘₯βˆ’π‘– , π‘βˆ’π‘– , 𝑐𝑖 = π‘˜, πœ‡0 , 𝜏0 ) = ∫ 𝑝(π‘₯𝑖 |πœ‡)𝑝(πœ‡|π‘₯βˆ’π‘– , πœ‡0 , 𝜏0 )π‘‘πœ‡
=∫
√𝜏
√2πœ‹
exp (βˆ’
𝜏(π‘₯𝑖 βˆ’ πœ‡)2 βˆšπœπ‘˜
πœπ‘˜ (πœ‡ βˆ’ πœ‡π‘˜ )2
exp (βˆ’
)βˆ™
) π‘‘πœ‡
2
2
√2πœ‹
2
=
√𝜏
√2πœ‹
𝜏π‘₯𝑖2
2
π‘’βˆ’
(πœπ‘˜β€² )βˆ’1⁄2 exp (
βˆ™
(πœπ‘˜
πœπ‘˜β€² πœ‡π‘˜β€²
2 )
𝜏 πœ‡
)βˆ’1⁄2 exp ( π‘˜ π‘˜
2
,
)
2
where πœπ‘˜ and πœ‡π‘˜ are computed as before except that sample π‘₯𝑖 is excluded from π‘›π‘˜ and π‘‹Μ…π‘˜ , and
πœπ‘˜β€² = πœπ‘˜ + 𝜏,
and
πœ‡π‘˜ πœπ‘˜ + 𝜏π‘₯𝑖
πœ‡π‘˜β€² =
.
πœπ‘˜ + 𝜏
If π‘₯𝑖 is assigned to a new cluster, the likelihood is
𝑝(π‘₯𝑖 |πœ‡0 , 𝜏0 ) = ∫ 𝑝(π‘₯𝑖 |πœ‡)𝑝(πœ‡|πœ‡0 , 𝜏0 )π‘‘πœ‡
=∫
=
√𝜏
√2πœ‹
√𝜏
√2πœ‹
𝑒
exp (βˆ’
𝜏π‘₯ 2
βˆ’ 𝑖
2
𝜏(π‘₯𝑖 βˆ’ πœ‡)2 √𝜏0
𝜏0 (πœ‡ βˆ’ πœ‡0 )2
exp (βˆ’
)βˆ™
) π‘‘πœ‡
2
2
√2πœ‹
(𝜏 β€² )βˆ’1⁄2 exp (
βˆ™
(𝜏0 )βˆ’1⁄2 exp (
2
𝜏 β€² πœ‡β€²
2 )
,
𝜏0 πœ‡0 2
)
2
where
𝜏 β€² = 𝜏0 + 𝜏,
and
πœ‡0 𝜏0 + 𝜏π‘₯𝑖
.
𝜏0 + 𝜏
The class assignment probability 𝑝(𝑐𝑖 = π‘˜|π‘βˆ’π‘– ) is the same as in the notes.
πœ‡β€² =
In my simulation, I set new cluster probability 𝛼 to be 0.05, and average 10 trials to get the final
density. The estimated densities for 𝑋 and π‘Œ are shown in Figure 8. As we can see, the densities
0.30
0.20
0.25
p_Y
0.25
0.20
0.05
0.15
0.10
0.15
p_X
0.30
0.35
0.35
0.40
0.40
in Figure 8 are closer to Gaussian compared to those in Figure 7 due to the relaxation of the
assumption on the number of clusters.
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0.0
0.5
X
1.5
Y
Figure 8
APPENDIX – R codes
2(d)
Xbar = mean(X)
sigmaX = mean((X-Xbar)^2)^0.5
Ybar = mean(Y)
sigmaY = mean((Y-Ybar)^2)^0.5
n = length(X)
m = length(Y)
shapeX = (n-1)/2
rateX = (n*sigmaX^2)/2
shapeY = (m-1)/2
rateY = (m*sigmaY^2)/2
delta = c()
N = 100000
for (indSam in 1:N) {
sigma1 = (rgamma(1, shape=shapeX, rate=rateX))^(-0.5)
mu1 = rnorm(1, mean=Xbar, sd=((sigma1^2)/n)^0.5)
sigma2 = (rgamma(1, shape=shapeY, rate=rateY))^(-0.5)
mu2 = rnorm(1, mean=Ybar, sd=((sigma2^2)/m)^0.5)
delta[indSam] = mu1-mu2
}
delta_bar = mean(delta)
delta_sort = sort(delta)
C_low = delta_sort[N*0.025]
C_high = delta_sort[N*0.975]
print(delta_bar)
print(C_low)
print(C_high)
2(e)
1.0
2.0
2.5
3.0
Xbar = mean(X)
sigmaX = mean((X-Xbar)^2)^0.5
Ybar = mean(Y)
sigmaY = mean((Y-Ybar)^2)^0.5
n = length(X)
m = length(Y)
shapeX = n/2
shapeY = m/2
mu1 = Xbar
mu2 = Ybar
sigma1 = sigmaX
sigma2 = sigmaY
delta = mu1-mu2
N = 100000
for (indSam in 1:N) {
rateX = sum((X-mu1)^2)/2
sigma1 = (rgamma(1, shape=shapeX, rate=rateX))^(-0.5)
mu1 = rnorm(1, mean=Xbar, sd=((sigma1^2)/n)^0.5)
rateY = sum((Y-mu2)^2)/2
sigma2 = (rgamma(1, shape=shapeY, rate=rateY))^(-0.5)
mu2 = rnorm(1, mean=Ybar, sd=((sigma2^2)/m)^0.5)
delta[indSam+1] = mu1-mu2
}
delta_bar = mean(delta)
delta_sort = sort(delta)
C_low = delta_sort[N*0.025]
C_high = delta_sort[N*0.975]
print(delta_bar)
print(C_low)
print(C_high)
2(f)
Xbar = mean(X)
sigmaX = mean((X-Xbar)^2)^0.5
Ybar = mean(Y)
sigmaY = mean((Y-Ybar)^2)^0.5
n = length(X)
m = length(Y)
sd_mu1 = sigmaX/(n^0.5)
sd_mu2 = sigmaY/(m^0.5)
sd_sigma1 = sigmaX/((2*n)^0.5)
sd_sigma2 = sigmaY/((2*m)^0.5)
shapeX = (n-1)/2
shapeY = (m-1)/2
rateX = n*(sigmaX^2)/2
rateY = m*(sigmaY^2)/2
mu1 = Xbar
mu2 = Ybar
sigma1 = sigmaX
sigma2 = sigmaY
delta = mu1-mu2
N = 100000
for (indSam in 1:N) {
mu1_t = rnorm(1, mean=mu1, sd=sd_mu1)
sigma1_t = rnorm(1, mean=sigma1, sd=sd_sigma1)
mu2_t = rnorm(1, mean=mu2, sd=sd_mu2)
sigma2_t = rnorm(1, mean=sigma2, sd=sd_sigma2)
density = dnorm(mu1, mean=Xbar, sd=sigma1/(n^0.5), log=TRUE) + dnorm(mu2,
mean=Ybar, sd=sigma2/(m^0.5), log=TRUE)
density = density + dgamma(sigma1^(-2), shape=shapeX, rate=rateX, log=TRUE) +
dgamma(sigma2^(-2), shape=shapeY, rate=rateY, log=TRUE)
density_t = dnorm(mu1_t, mean=Xbar, sd=sigma1_t/(n^0.5), log=TRUE) +
dnorm(mu2_t, mean=Ybar, sd=sigma2_t/(m^0.5), log=TRUE)
density_t = density_t + dgamma(sigma1_t^(-2), shape=shapeX, rate=rateX,
log=TRUE) + dgamma(sigma2_t^(-2), shape=shapeY, rate=rateY, log=TRUE)
r = min(exp(density_t-density), 1)
if (runif(1, min=0, max=1) < r) {
mu1 = mu1_t
sigma1 = sigma1_t
mu2 = mu2_t
sigma2 = sigma2_t
}
delta[indSam+1] = mu1-mu2
}
plot(1:(N+1), delta, type="l", xlab="iteration", ylab="delta")
delta_bar = mean(delta)
delta_sort = sort(delta)
C_low = delta_sort[N*0.025]
C_high = delta_sort[N*0.975]
print(delta_bar)
print(C_low)
print(C_high)
2(g)
minX = min(X)
maxX = max(X)
minY = min(Y)
maxY = max(Y)
stepX = (maxX-minX)/99
stepY = (maxY-minY)/99
hx = 1
hy = 1
dX = c()
dY = c()
for (ind in 1:100) {
x = (ind-1)*stepX + minX
dX[ind] = mean(exp(-(X-x)^2/2/(hx^2))/((2*pi)^0.5)/hx)
y = (ind-1)*stepY + minY
dY[ind] = mean(exp(-(Y-y)^2/2/(hy^2))/((2*pi)^0.5)/hy)
}
par(mfrow=c(1,2))
plot(((1:100)-1)*stepX+minX,dX,type="l",col="blue",xlab="X",ylab="p_X")
plot(((1:100)-1)*stepY+minY,dY,type="l",col="red",xlab="Y",ylab="p_Y")
2(h)
DPM <- function(X) {
minX = min(X)
maxX = max(X)
stepX = (maxX-minX)/99
dX = rep(0,100)
n = length(X)
numTrial = 10
thd_mu = 0.005
thd_tau = 3
for (indTrial in 1:numTrial) {
alpha = 0.05
tau = 1
Xbar = mean(X)
Xvar = mean((X-Xbar)^2)
mu0 = Xbar
tau0 = n/Xvar
ClstAsgn = rep(1,n)
numClst = 1
denom = tau0^(-0.5)*exp(tau0*(mu0^2)/2)
ClstXCnt = c()
ClstMu = c()
ClstTau = c()
for (indClst in 1:numClst) {
numX = sum(ClstAsgn==indClst)
sumX = sum(X[ClstAsgn==indClst])
ClstXCnt[indClst] = numX
ClstTau[indClst] = tau0 + tau*numX
ClstMu[indClst] = (mu0*tau0+tau*sumX)/ClstTau[indClst]
}
while (1) {
ClstXCnt0 = ClstXCnt
ClstMu0 = ClstMu
ClstTau0 = ClstTau
numClst0 = numClst
for (indSam in 1:n) {
p_old = c()
for (indClst in 1:numClst) {
numX = sum(ClstAsgn==indClst)
sumX = sum(X[ClstAsgn==indClst])
if (ClstAsgn[indSam]==indClst) {
numX = numX - 1
sumX = sumX - X[indSam]
}
p_old[indClst] = numX/(n-1)*(1-alpha)
tau1 = tau0 + tau*numX
mu1 = (mu0*tau0+tau*sumX)/tau1
tau2 = tau1 + tau
mu2 = (tau1*mu1+tau*X[indSam])/tau2
p_old[indClst] = p_old[indClst] * (tau2^(0.5)*exp(tau2*(mu2^2)/2)) / (tau1^(-0.5)*exp(tau1*(mu1^2)/2))
}
tau_new = tau0 + tau
mu_new = (tau0*mu0+tau*X[indSam])/tau_new
p_new = alpha * tau_new^(-0.5)*exp(tau_new*(mu_new^2)/2) /
denom
temp = sort(rmultinom(1, size = 1,
prob=c(p_old,p_new)),decreasing=TRUE,index.return=TRUE)
clst = temp$ix[1]
ClstAsgn[indSam] = clst
if (clst == numClst+1) {
numClst = numClst + 1
}
}
ClstList = unique(ClstAsgn)
numClst = length(ClstList)
ClstAsgn_t = ClstAsgn
for (indClst in 1:numClst) {
ClstAsgn_t[ClstAsgn==ClstList[indClst]] = indClst
}
ClstAsgn = ClstAsgn_t
ClstXCnt = c()
ClstMu = c()
ClstTau = c()
for (indClst in 1:numClst) {
numX = sum(ClstAsgn==indClst)
sumX = sum(X[ClstAsgn==indClst])
ClstXCnt[indClst] = numX
ClstTau[indClst] = tau0 + tau*numX
ClstMu[indClst] = (mu0*tau0+tau*sumX)/ClstTau[indClst]
}
temp = sort(ClstMu, index.return=TRUE)
ClstMu = temp$x
ClstXCnt = ClstXCnt[temp$ix]
ClstTau = ClstTau[temp$ix]
if (numClst==numClst0) {
if (mean(abs(ClstMu-ClstMu0))<thd_mu && mean(abs(ClstTauClstTau0))<thd_tau) {
break
}
}
}
for (ind in 1:100) {
x = (ind-1)*stepX + minX
hx = (tau^0.5)/((2*pi)^0.5)*exp(-tau*(x^2)/2)
p_old = c()
for (indClst in 1:numClst) {
p_old[indClst] = ClstXCnt[indClst]/n*(1-alpha)
tau2 = ClstTau[indClst] + tau
mu2 = (ClstTau[indClst]*ClstMu[indClst]+tau*x)/tau2
p_old[indClst] = p_old[indClst] * hx * (tau2^(0.5)*exp(tau2*(mu2^2)/2)) / (ClstTau[indClst]^(0.5)*exp(ClstTau[indClst]*(ClstMu[indClst]^2)/2))
}
tau_new = tau0 + tau
mu_new = (tau0*mu0+tau*x)/tau_new
p_new = alpha * hx * tau_new^(-0.5)*exp(tau_new*(mu_new^2)/2) /
denom
dX[ind] = dX[ind] + sum(c(p_old, p_new))
}
}
dX = dX / numTrial
out <- list(density=dX)
return(out)
}
out <- DPM(X)
dX = out$density
out <- DPM(Y)
dY = out$density
minX = min(X)
maxX = max(X)
minY = min(Y)
maxY = max(Y)
stepX = (maxX-minX)/99
stepY = (maxY-minY)/99
par(mfrow=c(1,2))
plot(((1:100)-1)*stepX+minX,dX,type="l",col="blue",xlab="X",ylab="p_X")
plot(((1:100)-1)*stepY+minY,dY,type="l",col="red",xlab="Y",ylab="p_Y")