10-702 Assignment 5 Solution Jiyan Pan [email protected] 1. Classification 1, π(π₯) β₯ 0 a) Let β(π₯) = sign(π(π₯)) = { , then 0, π(π₯) < 0 π (π) = β(π β β(π)) = β« β(π β β(π)|π = π₯) ππ(π₯) = β«[β(π = 0, β(π₯) = 1|π = π₯) + β(π = 1, β(π₯) = 0|π = π₯)] ππ(π₯) = β«[β(π = 0|π = π₯)β(π₯) + β(π = 1|π = π₯)(1 β β(π₯))] ππ(π₯) = β«[(1 β π(π₯))β(π₯) + π(π₯)(1 β β(π₯))] ππ(π₯) = β«[β(π₯) + π(π₯) β 2π(π₯)β(π₯)] ππ(π₯). If we denote ββ (π₯) = sign(π(π₯) β 1β2), the excess risk is π (π) β π β = β«[β(π₯) + π(π₯) β 2π(π₯)β(π₯)] ππ(π₯) β β«[ββ (π₯) + π(π₯) β 2π(π₯)ββ (π₯)] ππ(π₯) = β«(2π(π₯) β 1)(ββ (π₯) β β(π₯)) ππ(π₯) β«|2π(π₯) β 1|(1 β β(π₯)) ππ(π₯), π(π₯) β₯ 1β2 ={ . β«|2π(π₯) β 1|β(π₯) ππ(π₯), π(π₯) < 1β2 When π(π₯) β₯ 1β2, we have ββ (π₯) = 1, then no matter if β(π₯) = 1 or 0, 1 β β(π₯) = 1[β(π₯) β ββ (π₯)]; when π(π₯) < 1β2, we have ββ (π₯) = 0, then no matter if β(π₯) = 1 or 0, β(π₯) = 1[β(π₯) β ββ (π₯)]. Therefore, π (π) β π β = β« 1[β(π₯) β ββ (π₯)]|2π(π₯) β 1| ππ(π₯) = πΌ(1[β(π) β ββ (π)]|2π(π) β 1|) = πΌ(1[sign(π(π)) β sign(π(π) β 1β2)]|2π(π) β 1|). b) The optimal π-risk is β π π = inf πΌ[π(ππ(π))] πββ³ = inf πΌ [πΌ[π(ππ(π))|π]] πββ³ = inf πΌ[π(π)π(π(π)) + (1 β π(π₯))π(βπ(π))] πββ³ = πΌ [ inf (π(π)π(π(π)) + (1 β π(π₯))π(βπ(π)))] πββ³ = πΌ [ inf (π(π)π(πΌ) + (1 β π(π₯))π(βπΌ))] πΌββ = πΌ[π»π (π(π))]. c) (1) The conditional π-risk is ππ(πΌ) + (1 β π)π(βπΌ) = π(π βπΌ β πΌ β 1) + (1 β π)(π πΌ + πΌ β 1) = ππ βπΌ + (1 β π)π πΌ + (1 β 2π)πΌ β 1. Taking the derivative w.r.t. πΌ and setting it to be zero, we have β β βππ βπΌ + (1 β π)π πΌ + 1 β 2π = 0, that is, β β (π πΌ + 1) ((1 β π)π πΌ β π) = 0. β π β As π πΌ + 1 > 0, (1 β π)π πΌ β π must be 0. Therefore, πΌ β = log 1βπ. Substituting it back into the conditional π-risk, we obtain the optimal conditional π-risk: β β π»π (π) = ππ βπΌ + (1 β π)π πΌ + (1 β 2π)πΌ β β 1 π = (1 β 2π) log . 1βπ To derive π»πβ (π), we note that the conditional π-risk is convex in πΌ. If π > 1β2, then the allowable range of πΌ in π»πβ (π) is πΌ β€ 0; on the other hand, we have πΌ β > 0. Therefore, the πΌ that minimizes π»πβ (π) must be 0. If π β€ 1β2, then the allowable range of πΌ in π»πβ (π) is πΌ > 0; on the other hand, we have πΌ β β€ 0. Therefore, the πΌ that minimizes π»πβ (π) must also be 0. In either case, π»πβ (π) = 0. As a result, we have 1+π 1+π πΜπ (π) = π»πβ ( ) β π»π ( ) 2 2 1+π = 0 β π»π ( ) 2 1+π = π log . 1βπ 4 As the second derivative of πΜπ (π) is (1βπ2 )2 > 0, πΜπ (π) is convex in π. Therefore, ππ (π) = πΜπ (π) = π log 1+π . 1βπ 0 -4 1 -3 2 H phi -2 3 4 -1 5 0 The π»π and ππ are plotted in Figure 1. 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.5 m 0.0 theta Figure 1 (2) The conditional π-risk is ππ(πΌ) + (1 β π)π(βπΌ) 0.5 1.0 = π(1 β πΌ)+ + (1 β π)(1 + πΌ)+ . It is evident that the conditional π-risk could reach its minimum only when β1 β€ πΌ β€ 1. In this interval, the conditional π-risk is 1 + (1 β 2π)πΌ. Clearly, when π > 1β2, the conditional π-risk reaches its minimum 2(1 β π) when πΌ = 1; when π < 1β2, the conditional π-risk reaches its minimum 2π when πΌ = β1. Therefore, π»π (π) = 1 β |2π β 1|. It is also evident that when π > 1β2, the allowable range of πΌ in π»πβ (π) is πΌ β€ 0, and the conditional π-risk reaches its minimum 1 under this allowable range when πΌ = 0; when π < 1β2, the allowable range of πΌ in π»πβ (π) is πΌ β₯ 0, and the conditional π-risk reaches its minimum 1 under this allowable range when πΌ = 0. Therefore, π»πβ (π) = 1. Therefore, 1+π 1+π πΜπ (π) = π»πβ ( ) β π»π ( ) 2 2 = 1 β (1 β |1 + π β 1|) = |π|. As |π| is convex, ππ (π) = πΜπ (π) = |π|. 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 H phi 0.6 0.8 1.0 The π»π and ππ are plotted in Figure 2. 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 m -0.5 0.0 0.5 1.0 theta Figure 2 (3) The conditional π-risk is ππ(πΌ) + (1 β π)π(βπΌ) = π(1 β πΌ)2+ + (1 β π)(1 + πΌ)2+ . It is evident that the conditional π-risk could reach its minimum only when β1 β€ πΌ β€ 1. In this interval, the conditional π-risk is π(1 β πΌ)2 + (1 β π)(1 + πΌ)2 . Taking its derivative w.r.t. πΌ and setting it to be zero, we have πΌ β = 2π β 1. Substituting it back into the conditional π-risk, we have π»π (π) = π(1 β πΌ β )2 + (1 β π)(1 + πΌ β )2 = π(2 β 2π)2 + (1 β π)(2π)2 = 4π(1 β π). β To derive π»π (π), we note that the conditional π-risk is convex in πΌ. If π > 1β2, then the allowable range of πΌ in π»πβ (π) is πΌ β€ 0; on the other hand, we have πΌ β > 0. Therefore, the πΌ that minimizes π»πβ (π) must be 0. If π < 1β2, then the allowable range of πΌ in π»πβ (π) is πΌ β₯ 0; on the other hand, we have πΌ β < 0. Therefore, the πΌ that minimizes π»πβ (π) must also be 0. In either case, π»πβ (π) = 1. As a result, we have 1+π 1+π πΜπ (π) = π»πβ ( ) β π»π ( ) 2 2 1+π1βπ = 1β4 2 2 = π2. As π 2 is convex, ππ (π) = πΜπ (π) = π 2 . 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 H phi 0.6 0.8 1.0 The π»π and ππ are plotted in Figure 3. 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 m -0.5 0.0 0.5 1.0 theta Figure 3 (4) The conditional π-risk is ππ(πΌ) + (1 β π)π(βπΌ) ππ βπΌ + (1 β π)(2 β π βπΌ ), πΌ > 0 ={ . π(2 β π πΌ ) + (1 β π)π πΌ , πΌβ€0 The derivative of the conditional π-risk w.r.t. πΌ is (1 β 2π)π βπΌ , πΌ > 0 . { (1 β 2π)π πΌ , πΌβ€0 When π < 1β2, the derivative of the conditional π-risk is positive for all πΌ. Therefore, πΌ β = ββ, and the conditional π-risk has the minimum 2π; when π > 1β2, the derivative of the conditional π-risk is negative for all πΌ. Therefore, πΌ β = +β, and the conditional π-risk has the minimum 2(1 β π). Combining the results, we have π»π (π) = 1 β |2π β 1|. It is evident that when π > 1β2, the allowable range of πΌ in π»πβ (π) is πΌ β€ 0, and the conditional π-risk reaches its minimum 1 under this allowable range when πΌ = 0; when π < 1β2, the allowable range of πΌ in π»πβ (π) is πΌ β₯ 0, and the conditional π-risk reaches its minimum 1 under this allowable range when πΌ = 0. Therefore, π»πβ (π) = 1. Therefore, 1+π 1+π πΜπ (π) = π»πβ ( ) β π»π ( ) 2 2 = 1 β (1 β |1 + π β 1|) = |π|. As |π| is convex, ππ (π) = πΜπ (π) = |π|. 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 H phi 0.6 0.8 1.0 The π»π and ππ are plotted in Figure 4. 0.0 0.2 0.4 0.6 0.8 1.0 -1.0 m -0.5 0.0 theta Figure 4 2. 0 1 2 3 4 a) The boxplots of the data are shown in Figure 5. X1 Y2 Figure 5 b) The mle for π1 and π2 are π 1 πΜ 1 = β ππ , π π=1 π πΜ 2 = 1 β ππ . π π=1 The mle for π1 and π2 are π 1 πΜ1 = β β(ππ β πΜ 1 )2 , π π=1 0.5 1.0 π 1 πΜ2 = β β(ππ β πΜ 2 )2 . π π=1 The log-likelihood of π1 and π1 is π 1 β(π1 , π1 ) = βπ log(β2ππ1 ) β 2 β(ππ β π1 )2 . 2π1 π=1 The Fisher information matrix is π π2β π2β πΌ ) ( ) 2 π12 ππ1 ππ1 ππ1 πΌπ (π1 , π1 ) = β = π2β π2β 0 πΌ( πΌ ( 2) ) ππ1 ] [ [ ππ1 ππ1 The inverse of the Fisher information matrix is π12 0 π½π (π1 , π1 ) = πΌπβ1 (π1 , π1 ) = π . π12 [ 0 2π] Therefore, the 1 β πΌ confidence interval for π1 is πΜ1 πΜ1 πΆπ1 = (πΜ 1 β π§πΌβ2 , πΜ 1 + π§πΌβ2 ). βπ βπ Similarly, the 1 β πΌ confidence interval for π2 is πΜ2 πΜ2 πΆπ2 = (πΜ 2 β π§πΌβ2 , πΜ 2 + π§πΌβ2 ). βπ βπ The mle of πΏ = π1 β π2 is πΏΜ = πΜ 1 β πΜ 2 . Since πΜ 1 and πΜ 2 are independent, the variance of πΏΜ is 0 πΌ( Var(πΏΜ ) = Var(πΜ 1 β πΜ 2 ) = Var(πΜ 1 ) + Var(πΜ 2 ) = 2π . π12 ] πΜ12 πΜ22 + . π π Therefore, the 1 β πΌ confidence interval for πΏ is πΜ12 πΜ22 πΜ12 πΜ22 πΆπΏ = (πΏΜ β π§πΌβ2 β + , πΏΜ + π§πΌβ2 β + ). π π π π Using the actual data and setting πΌ = 0.05, we have πΜ 1 = 1.820011, πΜ 2 = 1.389454; πΜ1 = 0.8042076, πΜ2 = 0.8060367; πΏΜ = 0.4305566; πΆπΏ = (0.07127103, 0.7898422). c) The joint distribution is π(π1 , π2 , π1 , π2 , π, π) = π(π, π|π1 , π2 , π1 , π2 )π(π1 , π2 , π1 , π2 ) β = (2π) π 2 π1βπ exp (β β = π β (2π) π π π=1 π=1 π 1 1 π 2 )2 (2π)β 2 π2βπ exp (β 2 β(ππ β π2 ) ) β 2 β(ππ β π1 ) β π1 π2 2π1 2π2 π+π 2 π1βπβ1 π2βπβ1 exp (β Here, π is a constant in the prior. π π π=1 π=1 1 1 2 )2 β 2 β(ππ β π2 ) ). 2 β(ππ β π1 2π1 2π2 The marginal distribution of data is π(π, π) = β« π(π, π|π1 , π2 , π1 , π2 )π(π1 , π2 , π1 , π2 )ππ1 ππ2 ππ1 ππ2 β = π β β«(2π) π 2 π1βπβ1 exp (β π 1 β(ππ β π1 )2 ) ππ1 ππ1 2π12 π=1 β π β«(2π)β 2 π2βπβ1 exp (β π 1 2 2 β(ππ β π2 ) ) ππ2 ππ2 , 2π2 π=1 where π β«(2π)β2 π1βπβ1 exp (β π 1 β(ππ β π1 )2 ) ππ1 ππ1 2π12 π=1 β β π 1 = π (2π)β2 β« 0 = π β π β 1 1 β 2 βπβ1 (2π) 2 β« π1 exp (β 2 β ππ ) [β« exp (β 2 (ππ12 2π1 2π1 0 ββ π=1 = π1βπβ1 [β« exp (β 2 β(ππ 2π1 ββ π=1 π β (2π)β2 β« π1βπβ1 exp 0 β π1 )2 ) ππ1 ] ππ1 π β 2π1 β ππ )) ππ1 ] ππ1 π=1 π β 1 1 1 2 2 Μ (π1 β πΜ )2 ) ππ1 ] ππ1 β ( β π β π ) exp (β [β« π 2 π 2 π1 π ββ π=1 2 2 1 π π ( ) πΜ12 π1 ππ1 2 ) β2π π βπ 0 2 π1 π 1 1 β ππΜ12 = (2π)β2 +2 πβ2 β« π1βπ exp (β 2 ) ππ1 , 2π1 0 1 π 1 where πΜ = βπ=1 ππ , and πΜ12 = βππ=1 ππ2 β πΜ 2 . β π = (2π)β2 β« π1βπβ1 exp (β π π Let π1 = π1β2 , we have β β«(2π) = = π 2 π1βπβ1 exp (β π 1 β(ππ β π1 )2 ) ππ1 ππ1 2π12 π=1 π 1 1 β ππΜ12 (2π)β2 +2 πβ2 β« π1βπ exp (β 2 ) ππ1 2π1 0 0 πβ3 π 1 1 1 ππΜ12 β (2π)β2 +2 πβ2 β« π1 2 exp (β π ) ππ1 2 2 1 β πβ1 ) π 1 1 Ξ( 1 β + β 2 . = (2π) 2 2 π 2 πβ1 2 ππΜ12 2 ( 2 ) Similarly, we have π β«(2π)β 2 π2βπβ1 exp (β Therefore, πβ1 π π 1 1 Ξ( 1 1 2 β + β 2 ). 2 2 2 (2π) β(π β π ) ππ ππ = π ) π 2 2 2 πβ1 2 2π22 π=1 ππΜ22 2 ( ) 2 π(π, π) = π β π β«(2π)β2 π1βπβ1 exp (β π 1 β(ππ β π1 )2 ) ππ1 ππ1 2π12 π=1 π π β β«(2π)β 2 π2βπβ1 exp (β 1 2 2 β(ππ β π2 ) ) ππ2 ππ2 2π2 π=1 πβ1 πβ1 ) Ξ( 2 ) π+π 1 Ξ( 1 β β 2 = π β (2π) 2 2π(ππ) 2 πβ1 πβ1 . 4 2 2 2 2 ππΜ ππΜ ( 21 ) ( 22) The posterior density is π(π, π|π1 , π2 , π1 , π2 )π(π1 , π2 , π1 , π2 ) π(π, π) π+π 2 1 1 β π β (2π) 2 π1βπβ1 π2βπβ1 exp (β 2 βππ=1(ππ β π1 )2 β 2 βππ=1(ππ β π2 ) ) 2π1 2π2 = πβ1 πβ1 π+π 1 Ξ( ) Ξ( 2 ) 1 β β 2 π β (2π) 2 2π(ππ) 2 πβ1 πβ1 4 ππΜ12 2 ππΜ22 2 ( ) ( ) 2 2 π(π1 , π2 , π1 , π2 |π, π) = πβ1 πβ1 ππΜ 2 2 ππΜ22 2 exp (β 1 βπ (π β π )2 β 1 βπ (π β π )2 ) 2βππ ( 21 ) ( 2 ) 1 2 2π12 π=1 π 2π22 π=1 π = β . πβ1 πβ1 π1π+1 π2π+1 πΞ ( 2 ) Ξ ( 2 ) d) The posterior distribution of (π1 , π1 ) is π(π|π1 , π1 )π(π1 , π1 ) β« π(π|π1 , π1 )π(π1 , π1 )ππ1 ππ1 π 1 (2π)β2 π1βπβ1 exp (β 2 βππ=1(ππ β π1 )2 ) 2π1 = πβ1 π 1 1 Ξ( 1 β + β 2 ) 2 2 2 (2π) π πβ1 2 ππΜ12 2 ( ) 2 π(π1 , π1 |π·) = πβ1 ππΜ 2 2 exp (β 1 βπ (π β π )2 ) 2βπ ( 21 ) 1 2π12 π=1 π = β . πβ1 π1π+1 β2πΞ ( 2 ) The posterior distribution of π1 is β π(π1 |π·) = β« π(π1 , π1 |π·)ππ1 ββ πβ1 ππΜ 2 2 π 2βπ ( 1 ) β 1 2 = β β« exp (β 2 β(ππ β π1 )2 ) ππ1 π β 1 π+1 ββ 2π1 β2πΞ ( 2 ) π1 π=1 πβ1 ππΜ 2 2 2βπ ( 21 ) ππΜ12 π1 = β exp (β 2 ) β2π πβ1 2π1 βπ β2πΞ ( 2 ) π1π+1 πβ1 ππΜ 2 2 2( 1) ππΜ12 2 = β exp (β 2 ). πβ1 2π1 Ξ ( 2 ) π1π Therefore, we have π(π1 , π1 |π·) π(π1 |π1 , π·) = π(π1 |π·) πβ1 2 = ππΜ 2 2βπ ( 1 ) 2 exp (β 1 π πβ1 π 2 2 βπ=1(ππ β π1 ) ) Ξ ( 2 ) π1 2π1 πβ1 2 ππΜ 2 ππΜ 2 πβ1 β2πΞ ( 2 ) π1π+1 exp (β 12 ) 2 ( 21 ) 2π1 1 π βπ exp (β 2 βπ=1(ππ β π1 )2 ) 2π1 = ππΜ 2 β2ππ1 exp (β 12 ) 2π1 π ππΜ12 1 1 1 (π1 β πΜ )2 ) = exp ( 2 ) . exp β ( β ππ2 β πΜ 2 ) exp (β 2 2 π 2π π π β2ππ1 1 π=1 2 π1 2 π1 ( ) βπ 1 π2 Μ )2 ) = π (π1 ; πΜ , 1 ). (π β exp β π (β 1 π π π12 β2π 1 2 π β π Let π1 = π1β2 , we can obtain π(π1 |π·) from π(π1 |π·) as follows: π π 1 π 1 ππ1 (π1 |π·) = β(πβ²1 β€ π1 |π·) = β ( β² 2 β€ π1 |π·) = β (π1β² β₯ |π·) ππ1 ππ1 ππ1 π1 βπ1 = = = 1 π 1 π 1 π13 π 1 (1 β β (π1β² β€ |π·)) = β β (π1β² β€ |π·) = β (π1β² β€ |π·) ππ1 ππ1 2 ππ1 βπ1 βπ1 βπ1 πβ1 π 3 ππΜ12 2 2 β 22( ) π π1 1 2 π13 1 ππ1 ( |π·) = 2 2 βπ1 πβ1 Ξ( 2 ) β exp (β ππΜ12 π ) 2 1 πβ1 ππΜ 2 2 πβ1 ( 1) ππΜ12 π β 1 ππΜ12 β1 2 = π1 2 exp (β π1 ) = π (π1 ; , ), πβ1 2 2 2 Ξ( 2 ) where π(π₯; πΌ, π½) is the gamma distribution with shape parameter πΌ and rate parameter π½. Therefore, π β 1 ππΜ12 π(π1 |π·) = π (π1β2 ; , ) 2 2 That is, to sample from π(π1 |π·), we could sample from the gamma distribution π (π§; 1 first, then computes π1 according to π1 = π§ β2 . Similarly, Μ12 πβ1 ππ , ) 2 2 π(π2 |π·) = π (π2β2 ; π β 1 ππΜ22 , ), 2 2 and π(π2 |π2 , π·) = π (π2 ; πΜ , π22 ). π Therefore, π(π1 , π2 , π1 , π2 |π·) = π(π1 |π1 , π·)π(π1 |π·)π(π2 |π2 π·)π(π2 |π·) π12 π β 1 ππΜ12 π22 π β 1 ππΜ22 = π (π1 ; πΜ , ) β π (π1β2 ; , , ) β π (π2 ; πΜ , ) β π (π2β2 ; ). π 2 2 π 2 2 To get the Bayes estimate of πΏ = π1 β π2 , we need to compute πΏΜ = β«(π1 β π2 )π(π1 , π2 |π·)ππ1 ππ2 = β«(π1 β π2 ) (β« π(π1 , π2 , π1 , π2 |π·)ππ1 ππ2 ) ππ1 ππ2 = β«(π1 β π2 )π(π1 , π2 , π1 , π2 |π·) ππ1 ππ2 ππ1 ππ2 . (π) (π) (π) (π) The Monte Carlo method with direct sampling draws samples (π1 , π2 , π1 , π2 ) , π = 1 β¦ π according to the factorized expression of π(π1 , π2 , π1 , π2 |π·), and approximates πΏΜ as π 1 (π) (π) β (π1 β π2 ). π π=1 I generated 100,000 samples with direct sampling, and the estimate of πΏΜ is 0.4302737, very close to the mle estimate. The 95 percent posterior interval is obtained by sorting the simulated samples and finding the 0.025 and 0.975 quantile. The resulting 95% posterior interval is (0.05256353, 0.8063011), also close to the 95% confidence interval in mle. e) The posterior distribution of π1 is πβ1 ππΜ 2 2 π 2βπ ( 1 ) β β 1 1 2 π(π1 |π·) = β« π(π1 , π1 |π·)ππ1 = β β« π+1 exp (β 2 β(ππ β π1 )2 ) ππ1 . πβ1 2π1 0 β2πΞ ( 2 ) 0 π1 π=1 Let π1 = π1β2 , we have πβ1 ππΜ12 2 2 π ( β π β βππ=1(ππ β π1 )2 1 β1 2 ) π(π1 |π·) = β β« π12 exp (β π1 ) ππ1 πβ1 2 2 β2πΞ ( 2 ) 0 πβ1 ππΜ 2 2 π βπ ( 21 ) Ξ (2 ) = β π. πβ1 β2πΞ ( 2 ) βππ=1(ππ β π1 )2 2 ( ) 2 Therefore, π(π1 |π1 , π·) = π(π1 , π1 |π·) π(π1 |π·) πβ1 2 = ππΜ 2 2βπ ( 1 ) 2 π 1 π β 1 βππ=1(ππ β π1 )2 2 exp (β 2 βππ=1(ππ β π1 )2 ) β2πΞ ( )( ) 2 2 2π1 πβ1 2 ππΜ 2 πβ1 β2πΞ ( 2 ) π1π+1 βπ ( 21 ) π Ξ (2 ) π βπ (π β π1 )2 2 2 ( π=1 π ) βππ=1(ππ β π1 )2 2 βπβ1 = π exp (β ). 1 π 2π12 Ξ (2 ) Similar to the derivation in computing π(π1 |π·) in the last sub-problem, we could compute π(π1 |π1 , π·) by substituting π1 with π1β2 : π13 1 ππ1 (π1 |π1 , π·) = ππ1 ( |π1 , π·) 2 βπ1 = = 3 βπ (ππ β β 2 2 ( π=1 π1 2 π Ξ( ) 2 2 π1 )2 ) π+1 π1 2 exp (β βππ=1(ππ β π1 )2 π1 ) 2 π βπ (π β π1 )2 2 ( π=1 2π ) π Ξ (2 ) π 2 π π12 β1 exp (β βππ=1(ππ β π1 )2 π βππ=1(ππ β π1 )2 π1 ) = π (π1 ; , ). 2 2 2 Therefore, π βππ=1(ππ β π1 )2 π(π1 |π1 , π·) = π (π1β2 ; , ). 2 2 Similarly, we have 2 π(π2 |π2 , π·) = π (π2β2 ; π βπ π=1(ππ β π2 ) , ). 2 2 π 2 2 π π β (π βπ ) Having obtained π(π1 |π1 , π·) = π (π1 ; πΜ , π1 ), π(π1 |π1 , π·) = π (π1β2 ; 2 , π=1 2π 1 ), 2 π π βπ=1(ππ βπ2 ) 2 π2 π(π2 |π2 , π·) = π (π2 ; πΜ , π2 ), and π(π2 |π2 , π·) = π (π2β2 ; 2 , (π) (π) (π) ), we can simulate (π) samples (π1 , π2 , π1 , π2 ) , π = 1 β¦ π using Gibbs sampling. The results of simulating 100,000 samples using Gibbs sampling are as follows: the estimate of πΏΜ is 0.4301146, very close to the mle estimate. The 95% posterior interval is (0.0558426, 0.8073434), also close to the 95% confidence interval in mle. f) The proposal distribution that I used in the Metropolis-Hastings sampling is a Gaussian distribution: π(π1β² , π2β² , π1β² , π2β² |π1 , π2 , π1 , π2 ) πΜ12 πΜ22 πΜ12 πΜ22 = π (π1β² ; π1 , ) β π (π2β² ; π2 , ) β π (π1β² ; π1 , ) β π (π2β² ; π2 , ). π π 2π 2π The plot of 100,001 samples from the Metropolis-Hastings algorithm is shown in Figure 6. We can see that the Markov chain mixed pretty well. The estimate of πΏΜ is 0.4338128, very close to the mle estimate. The 95% posterior interval is (0.045316, 0.8312325), also close to the 95% 0.5 0.0 delta 1.0 confidence interval in mle. 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 iteration Figure 6 0.20 0.25 p_Y 0.20 0.15 0.10 0.15 p_X 0.25 0.30 0.30 g) The estimate of the densities of π and π using kernel density estimation is shown in Figure 7. I used Gaussian kernel with bandwidth equal to 1. This bandwidth gives a good shape of normal distribution. The range of acceptable bandwidths is from 0.7 to 1. Setting the bandwidth too small would overfit the data, resulting in multiple modes. 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.0 X 0.5 1.0 1.5 2.0 2.5 3.0 Y Figure 7 h) I use Gaussian to model both the conditional distribution of data and the prior distribution of parameters. The conditional distribution of data for cluster π is π βππ=1(π₯π β π)2 βπ π(π₯1:π |π, π) = exp (β ), 2 β2π where π is fixed at 1. In the discussion below, π(π₯1:π |π, π) is abbreviated as π(π₯1:π |π). The prior distribution of parameter π is π0 (π β π0 )2 ). 2 β2π It can be shown that the posterior distribution of parameter π is π(π|π0 , π0 ) = βπ0 exp (β ππ (π β ππ )2 π(π|π₯1:π , π0 , π0 ) β π(π₯1:π |π)π(π|π0 , π0 ) β exp (β ), 2 β2π βππ where ππ = π0 + ππ π, and π0 π0 + πππ πΜ π , π0 + ππ π where ππ is the number of training samples assigned to cluster π, and πΜ π is the mean of those samples. Therefore, in Gibbs sampling, the likelihood of π₯π being assigned to cluster π given the cluster assignment of the other samples is ππ = π(π₯π |π₯βπ , πβπ , ππ = π, π0 , π0 ) = β« π(π₯π |π)π(π|π₯βπ , π0 , π0 )ππ =β« βπ β2π exp (β π(π₯π β π)2 βππ ππ (π β ππ )2 exp (β )β ) ππ 2 2 β2π 2 = βπ β2π ππ₯π2 2 πβ (ππβ² )β1β2 exp ( β (ππ ππβ² ππβ² 2 ) π π )β1β2 exp ( π π 2 , ) 2 where ππ and ππ are computed as before except that sample π₯π is excluded from ππ and πΜ π , and ππβ² = ππ + π, and ππ ππ + ππ₯π ππβ² = . ππ + π If π₯π is assigned to a new cluster, the likelihood is π(π₯π |π0 , π0 ) = β« π(π₯π |π)π(π|π0 , π0 )ππ =β« = βπ β2π βπ β2π π exp (β ππ₯ 2 β π 2 π(π₯π β π)2 βπ0 π0 (π β π0 )2 exp (β )β ) ππ 2 2 β2π (π β² )β1β2 exp ( β (π0 )β1β2 exp ( 2 π β² πβ² 2 ) , π0 π0 2 ) 2 where π β² = π0 + π, and π0 π0 + ππ₯π . π0 + π The class assignment probability π(ππ = π|πβπ ) is the same as in the notes. πβ² = In my simulation, I set new cluster probability πΌ to be 0.05, and average 10 trials to get the final density. The estimated densities for π and π are shown in Figure 8. As we can see, the densities 0.30 0.20 0.25 p_Y 0.25 0.20 0.05 0.15 0.10 0.15 p_X 0.30 0.35 0.35 0.40 0.40 in Figure 8 are closer to Gaussian compared to those in Figure 7 due to the relaxation of the assumption on the number of clusters. 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.5 X 1.5 Y Figure 8 APPENDIX β R codes 2(d) Xbar = mean(X) sigmaX = mean((X-Xbar)^2)^0.5 Ybar = mean(Y) sigmaY = mean((Y-Ybar)^2)^0.5 n = length(X) m = length(Y) shapeX = (n-1)/2 rateX = (n*sigmaX^2)/2 shapeY = (m-1)/2 rateY = (m*sigmaY^2)/2 delta = c() N = 100000 for (indSam in 1:N) { sigma1 = (rgamma(1, shape=shapeX, rate=rateX))^(-0.5) mu1 = rnorm(1, mean=Xbar, sd=((sigma1^2)/n)^0.5) sigma2 = (rgamma(1, shape=shapeY, rate=rateY))^(-0.5) mu2 = rnorm(1, mean=Ybar, sd=((sigma2^2)/m)^0.5) delta[indSam] = mu1-mu2 } delta_bar = mean(delta) delta_sort = sort(delta) C_low = delta_sort[N*0.025] C_high = delta_sort[N*0.975] print(delta_bar) print(C_low) print(C_high) 2(e) 1.0 2.0 2.5 3.0 Xbar = mean(X) sigmaX = mean((X-Xbar)^2)^0.5 Ybar = mean(Y) sigmaY = mean((Y-Ybar)^2)^0.5 n = length(X) m = length(Y) shapeX = n/2 shapeY = m/2 mu1 = Xbar mu2 = Ybar sigma1 = sigmaX sigma2 = sigmaY delta = mu1-mu2 N = 100000 for (indSam in 1:N) { rateX = sum((X-mu1)^2)/2 sigma1 = (rgamma(1, shape=shapeX, rate=rateX))^(-0.5) mu1 = rnorm(1, mean=Xbar, sd=((sigma1^2)/n)^0.5) rateY = sum((Y-mu2)^2)/2 sigma2 = (rgamma(1, shape=shapeY, rate=rateY))^(-0.5) mu2 = rnorm(1, mean=Ybar, sd=((sigma2^2)/m)^0.5) delta[indSam+1] = mu1-mu2 } delta_bar = mean(delta) delta_sort = sort(delta) C_low = delta_sort[N*0.025] C_high = delta_sort[N*0.975] print(delta_bar) print(C_low) print(C_high) 2(f) Xbar = mean(X) sigmaX = mean((X-Xbar)^2)^0.5 Ybar = mean(Y) sigmaY = mean((Y-Ybar)^2)^0.5 n = length(X) m = length(Y) sd_mu1 = sigmaX/(n^0.5) sd_mu2 = sigmaY/(m^0.5) sd_sigma1 = sigmaX/((2*n)^0.5) sd_sigma2 = sigmaY/((2*m)^0.5) shapeX = (n-1)/2 shapeY = (m-1)/2 rateX = n*(sigmaX^2)/2 rateY = m*(sigmaY^2)/2 mu1 = Xbar mu2 = Ybar sigma1 = sigmaX sigma2 = sigmaY delta = mu1-mu2 N = 100000 for (indSam in 1:N) { mu1_t = rnorm(1, mean=mu1, sd=sd_mu1) sigma1_t = rnorm(1, mean=sigma1, sd=sd_sigma1) mu2_t = rnorm(1, mean=mu2, sd=sd_mu2) sigma2_t = rnorm(1, mean=sigma2, sd=sd_sigma2) density = dnorm(mu1, mean=Xbar, sd=sigma1/(n^0.5), log=TRUE) + dnorm(mu2, mean=Ybar, sd=sigma2/(m^0.5), log=TRUE) density = density + dgamma(sigma1^(-2), shape=shapeX, rate=rateX, log=TRUE) + dgamma(sigma2^(-2), shape=shapeY, rate=rateY, log=TRUE) density_t = dnorm(mu1_t, mean=Xbar, sd=sigma1_t/(n^0.5), log=TRUE) + dnorm(mu2_t, mean=Ybar, sd=sigma2_t/(m^0.5), log=TRUE) density_t = density_t + dgamma(sigma1_t^(-2), shape=shapeX, rate=rateX, log=TRUE) + dgamma(sigma2_t^(-2), shape=shapeY, rate=rateY, log=TRUE) r = min(exp(density_t-density), 1) if (runif(1, min=0, max=1) < r) { mu1 = mu1_t sigma1 = sigma1_t mu2 = mu2_t sigma2 = sigma2_t } delta[indSam+1] = mu1-mu2 } plot(1:(N+1), delta, type="l", xlab="iteration", ylab="delta") delta_bar = mean(delta) delta_sort = sort(delta) C_low = delta_sort[N*0.025] C_high = delta_sort[N*0.975] print(delta_bar) print(C_low) print(C_high) 2(g) minX = min(X) maxX = max(X) minY = min(Y) maxY = max(Y) stepX = (maxX-minX)/99 stepY = (maxY-minY)/99 hx = 1 hy = 1 dX = c() dY = c() for (ind in 1:100) { x = (ind-1)*stepX + minX dX[ind] = mean(exp(-(X-x)^2/2/(hx^2))/((2*pi)^0.5)/hx) y = (ind-1)*stepY + minY dY[ind] = mean(exp(-(Y-y)^2/2/(hy^2))/((2*pi)^0.5)/hy) } par(mfrow=c(1,2)) plot(((1:100)-1)*stepX+minX,dX,type="l",col="blue",xlab="X",ylab="p_X") plot(((1:100)-1)*stepY+minY,dY,type="l",col="red",xlab="Y",ylab="p_Y") 2(h) DPM <- function(X) { minX = min(X) maxX = max(X) stepX = (maxX-minX)/99 dX = rep(0,100) n = length(X) numTrial = 10 thd_mu = 0.005 thd_tau = 3 for (indTrial in 1:numTrial) { alpha = 0.05 tau = 1 Xbar = mean(X) Xvar = mean((X-Xbar)^2) mu0 = Xbar tau0 = n/Xvar ClstAsgn = rep(1,n) numClst = 1 denom = tau0^(-0.5)*exp(tau0*(mu0^2)/2) ClstXCnt = c() ClstMu = c() ClstTau = c() for (indClst in 1:numClst) { numX = sum(ClstAsgn==indClst) sumX = sum(X[ClstAsgn==indClst]) ClstXCnt[indClst] = numX ClstTau[indClst] = tau0 + tau*numX ClstMu[indClst] = (mu0*tau0+tau*sumX)/ClstTau[indClst] } while (1) { ClstXCnt0 = ClstXCnt ClstMu0 = ClstMu ClstTau0 = ClstTau numClst0 = numClst for (indSam in 1:n) { p_old = c() for (indClst in 1:numClst) { numX = sum(ClstAsgn==indClst) sumX = sum(X[ClstAsgn==indClst]) if (ClstAsgn[indSam]==indClst) { numX = numX - 1 sumX = sumX - X[indSam] } p_old[indClst] = numX/(n-1)*(1-alpha) tau1 = tau0 + tau*numX mu1 = (mu0*tau0+tau*sumX)/tau1 tau2 = tau1 + tau mu2 = (tau1*mu1+tau*X[indSam])/tau2 p_old[indClst] = p_old[indClst] * (tau2^(0.5)*exp(tau2*(mu2^2)/2)) / (tau1^(-0.5)*exp(tau1*(mu1^2)/2)) } tau_new = tau0 + tau mu_new = (tau0*mu0+tau*X[indSam])/tau_new p_new = alpha * tau_new^(-0.5)*exp(tau_new*(mu_new^2)/2) / denom temp = sort(rmultinom(1, size = 1, prob=c(p_old,p_new)),decreasing=TRUE,index.return=TRUE) clst = temp$ix[1] ClstAsgn[indSam] = clst if (clst == numClst+1) { numClst = numClst + 1 } } ClstList = unique(ClstAsgn) numClst = length(ClstList) ClstAsgn_t = ClstAsgn for (indClst in 1:numClst) { ClstAsgn_t[ClstAsgn==ClstList[indClst]] = indClst } ClstAsgn = ClstAsgn_t ClstXCnt = c() ClstMu = c() ClstTau = c() for (indClst in 1:numClst) { numX = sum(ClstAsgn==indClst) sumX = sum(X[ClstAsgn==indClst]) ClstXCnt[indClst] = numX ClstTau[indClst] = tau0 + tau*numX ClstMu[indClst] = (mu0*tau0+tau*sumX)/ClstTau[indClst] } temp = sort(ClstMu, index.return=TRUE) ClstMu = temp$x ClstXCnt = ClstXCnt[temp$ix] ClstTau = ClstTau[temp$ix] if (numClst==numClst0) { if (mean(abs(ClstMu-ClstMu0))<thd_mu && mean(abs(ClstTauClstTau0))<thd_tau) { break } } } for (ind in 1:100) { x = (ind-1)*stepX + minX hx = (tau^0.5)/((2*pi)^0.5)*exp(-tau*(x^2)/2) p_old = c() for (indClst in 1:numClst) { p_old[indClst] = ClstXCnt[indClst]/n*(1-alpha) tau2 = ClstTau[indClst] + tau mu2 = (ClstTau[indClst]*ClstMu[indClst]+tau*x)/tau2 p_old[indClst] = p_old[indClst] * hx * (tau2^(0.5)*exp(tau2*(mu2^2)/2)) / (ClstTau[indClst]^(0.5)*exp(ClstTau[indClst]*(ClstMu[indClst]^2)/2)) } tau_new = tau0 + tau mu_new = (tau0*mu0+tau*x)/tau_new p_new = alpha * hx * tau_new^(-0.5)*exp(tau_new*(mu_new^2)/2) / denom dX[ind] = dX[ind] + sum(c(p_old, p_new)) } } dX = dX / numTrial out <- list(density=dX) return(out) } out <- DPM(X) dX = out$density out <- DPM(Y) dY = out$density minX = min(X) maxX = max(X) minY = min(Y) maxY = max(Y) stepX = (maxX-minX)/99 stepY = (maxY-minY)/99 par(mfrow=c(1,2)) plot(((1:100)-1)*stepX+minX,dX,type="l",col="blue",xlab="X",ylab="p_X") plot(((1:100)-1)*stepY+minY,dY,type="l",col="red",xlab="Y",ylab="p_Y")
© Copyright 2026 Paperzz