On the Distribution of Posterior Probability in Bayesian Inference with a Large Number of Observations Akira Date Faculty of Engineering, University of Miyazaki, Miyazaki 889-2192 [email protected] Abstract: Probabilistic generative models work in many applications of image analysis and speech recognition. In general, there is an observation vector ~y and a state vector ~x, and a joint dependency structure among them. The object of interest is, given~y, the most likely configuration~xMAP and its posterior distribution. In practice, the exact value of posterior probability of ~xMAP is impossible to obtain, especially when there is a large number of observed variables. Here we analyzed the distribution of posterior probabilities of ~xMAP when there are N = 200 ∼ 1000 observations. We used a probabilistic model with simple linear dependency structure in which the exact value of posterior probability of ~xMAP is obtainable. Computer experiments show that even an identical model generate a variety of posterior distributions, which suggest difficulties in understanding the meaning of posterior probability. Finally, we propose a method to know the confidence of the estimator ~xMAP by computing P(~x0 |~y)’s where ~x0 ’s are neighbors of ~xMAP . Keywords: Bayesian inference, maximum a posteriori estimator, hidden Markov model, dependency graph 1 Introduction In Bayesian formations, we formalize the relevant prior information, as a probability distribution, say P(~x). Then, given a prior distribution P(~x) and an “observation” ~y, we form the posterior distribution P(~x|~y): the conditional distribution given what is observed. Given an observation we seek the most likely configuration~xMAP that maximizes the posterior distribution. Suppose we can compute P(~x|~y), the posterior probability of ~xMAP , although it is in practice difficult to compute the exact value of posterior probability when there is a large number of observed variables. What does P(~x|~y) signal to us? Even if we know the posterior probability of ~xMAP , its meaning is problem-dependent. When, for example, P(~xMAP |~y) = 0.01, can we say ~xMAP is not so believable? The answer depend on the problem. Even if we fix the problem setting, it is difficult to answer. It seems that larger the configuration space of ~x, smaller the P(~xMAP |~y). However, the value of posterior probability, of course, signals something. When, for example, the posterior probability of ~xMAP is 0.98, it signals that ~xMAP is most likely interpretation with high confidence. In general, how can we use the value of posterior probability effectively for processing in next stage ? In this paper, we used a probabilistic model with simple linear dependency structure in which the exact value of posterior probability of the most likely configuration ~xMAP is obtainable. We, then, analyzed the distribution of posterior probabilities when there are n = 200 ∼ 1000 observations. We performed a large number of computer experiments to obtain the distribution of P(~xMAP |~y). The results suggest that there is a variety of distributions of P(~xMAP |~y). We computed not only P(~xMAP |~y) but P(~x0 MAP |~y)’s where ~x0 is neighbor of~xMAP . The presentation will be nontechnical and by example, highlighting the meaning of posterior probability, such as, what posterior probability signals or whether low posterior probability signals that the estimator is relatively not believable. 2 2.1 Bayesian Inference Linear dependency graph As a simple example, suppose that X1 , X2 , . . . is a firstorder Markov process with state space X ∈ {0, 1}, initial probability distribution ≡ Prob(X1 = 0) = 0.5 (1) ≡ Prob(X1 = 1) = 0.5 (2) · ¸ p p10 and transition probability matrix P ≡ 00 ≡ p01 p11 · ¸ · ¸ Pr(Xi+1 = 0|Xi = 0) Pr(Xi+1 = 0|Xi = 1) 0.99 0.03 = . Pr(Xi+1 = 1|Xi = 0) Pr(Xi+1 = 1|Xi = 1) 0.01 0.97 p0 p1 a 2.2 Most likely configurations X1 X2 X3 Xn-1 Xn Y1 Y2 Y3 Yn-1 Yn Given an observation ~y = (y1 , · · · , yn )T we seek ~x = (x1 , · · · , xn )T that maximizes the posterior distribution (the so called MAP estimator); ~xMAP = argmax Prob(~x|~y) (5) ~x where b Prob(~x, ~y) . Prob(~y) Prob(~x | ~y) = xMAP 4 P (xMAP| y ) = 0.00003079 P (xtrue| y ) = 0.00000015 seed: 77777 σ=0.7 3 Since Prob(~y) is a positive constant, the goal is to compute ~xMAP = argmax Prob(~x, ~y). 2 xtrue , y (7) ~x 1 0 -1 -2 (6) 0 100 200 300 400 500 t Figure 1: a, Linear dependency graph. The nodes of the Xgraph represent a source sequence of 0 and 1’s generated from a Markov chain, and the Y nodes represent a noisy observation. b, A source data ~x (represented as a line), an observation ~y (points), and the maximum a posteriori estimator ~xMAP (plotted ~xMAP + 3) and its posterior probability. Yi = Xi + Zi , Zi ∼ N (0, σ 2 ), σ = 0.7, 1 5 i 5 n = 500. There was 17/500 discrepancies between ~xMAP and ~xtrue . There are legendary practical problems with the actual implementation of Bayesian methods. We want to compute most likely states with respect to these posterior distributions, but usually direct evaluation is already impossible with one hundred dimensions. Fortunately, for random fields with linear graph, this posterior distribution is itself Markov, which has some striking computational implications. ( ) ~x (3) with ηi iid N (0, σ 2 ), σ = 0.7; Prob(Yi < yi |xi ) = √ 1 2πσ ½ ¾ (y − xi )2 dy (4) exp − 2σ 2 −∞ Z yi The goal is to estimate X where the estimation will be based upon the corrupted observations, and based upon the model, i.e., the prior distribution. Y itself is not Markov. Nevertheless, the conditional distribution of X given Y remain simple, where X given Y is still first-order Markov. In general, the combination of a rich marginal structure for Y and a simple posterior structure for X makes hidden Markov process a common modeling tool [1]. n t=2 t=1 (8) In particular, dynamic programming methods can be used: computationally-feasible algorithms exist for estimating the most likely interpretation ~x of a given signal ~y. In concrete, first we compute C1 (i) = log(pi ) + log(qiy1 ) Suppose we observe corrupted signal Y1 ,Y2 , . . .Yn where Yi = Xi + ηi n ~xMAP = argmax log px1 + ∑ log pxt−1 xt + ∑ log qxt yt (9) Then, we sequentially compute for t = 1, · · · , n − 1 © ª St+1 ( j) = argmax Ct (i) + log pi j + log q jyt+1 (10) i Ct+1 ( j) = Ct (St+1 ( j)) + log pSt+1 ( j) j + log q jyt+1(11) where i, j ∈ {0, 1}, and qxt yt = √ ¾ ½ 1 (yt − xt )2 ∆y exp − 2σ 2 2πσ (12) = argmax Cn (i) (13) Finally, we obtain x̂n i Then the x̂t , t = n − 1, · · · , 1 will be obtained. x̂t = St+1 (x̂t+1 ) (14) 1 0 0.06 0.05 0.04 0 50 100 150 0 0.1 0.2 0.3 N=200 same xtrue 0.07 Relative Frequency 0 -1 50 100 150 = qxn ,yn px2 ,x3 qx2 ,y2 px1 ,x2 qx1 ,y1 i3 ··· ∑ pi3 ,i4 qi3 ,y3 r3 (i3 ) ∑qin ,yn in = qxn ,yn px2 ,x3 qx2 ,y2 ··· i3 ∑qin ,yn rn (in ) px3 ,x4 qx3 ,y3 0.9 0.95 1 0.95 1 0.95 1 0.95 1 10000 noisy samples seed: 3332877 0.04 σ=0.7 0.6 0.4 0.03 0.2 0.02 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 P (xtrue| y ) = 0.01229 0 0.8 10000 noisy samples σ=0.7 0.06 t 150 0.9 10000 noisy samples seed: 2772 σ=0.7 0.6 0.4 0.04 0 100 N=200 same xtrue seed: 2772 0.08 0.2 0.02 50 0.85 Reconstruction Rate N=200 same xtrue 0.1 -1 0 0.8 1 P (xMAP| y ) = 0.01893 seed: 2772 Relative Frequency xMAP N=200 same xtrue 0.8 10000 noisy samples σ=0.7 0.05 0.12 200 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.8 0.85 0.9 Reconstruction Rate Prob(xMAP | y) P (xMAP| y ) = 0.46593 seed: 56712 3 P (xtrue| y ) = 0.46593 1 1 10000 noisy samples σ=0.7 0.8 0.03 0.02 0.01 0 50 100 150 200 10000 noisy samples seed: 56712 0.04 0 0 N=200 same xtrue seed: 56712 0.05 -1 -2 N=200 same xtrue 0.06 4 i2 px3 ,x4 qx3 ,y3 0.85 Reconstruction Rate P (xMAP | y) px3 ,x4 qx3 ,y3 0.8 Prob(xMAP | y) 2 ∑ pi3 ,i4 qi3 ,y3 ∑ pi2 ,i3 qi2 ,y2 r2 (i2 ) ∑qin ,yn in i1 0.9 seed: 3332877 0.06 0 200 Relative Frequency qxn ,yn ··· i2 0.8 0.07 xMAP = i3 0.7 0.01 0 2 -2 ∑ pi3 ,i4 qi3 ,y3 ∑ pi2 ,i3 qi2 ,y2 ∑ pi1 ,i2 qi1 ,y1 ∑qin ,yn in xtrue , y px1 ,x2 px2 ,x3 qx1 ,y1 qx2 ,y2 qx3 ,y3 · · · qxn ,yn 0.6 1 0.08 P (xtrue| y ) = 0.13381 1 xtrue , y ∑ 0.5 Prob(xMAP | y) P (xMAP| y ) = 0.16374 3 i1 ,i2 ,i3 ,··· ,in 0.4 0.09 4 = 0.4 0.2 0 t pi1 ,i2 pi2 ,i3 qi1 ,y1 qi2 ,y2 qi3 ,y3 · · · qin ,yn σ=0.7 0.03 0 200 2 -2 10000 noisy samples seed: 7 0.6 0.01 4 xMAP (16) N=200 same xtrue 0.8 10000 noisy samples seed: 7 σ=0.7 0.02 1 p(~y) = x̃ p(~x,~y) p(~x,~y) N=200 same xtrue 0.07 t xtrue , y = 0.08 -1 seed: 3332877 1 p(~x|~y) 0.09 P (xtrue| y ) = 0.04105 2 3 ∑ p(x̃,~y) P (xMAP| y ) = 0.11287 P (xMAP | y) seed: 7 3 -2 1 0.1 4 P (xMAP | y) Still it is difficult to know the posterior probability when there is a large number n of observed variables. If we compute the inverse of the conditional probability, rather than conditional probability itself, we can obtain the posterior probability for relatively large n [2]. P (xMAP | y) (15) Relative Frequency Prob(~x, ~y) . Prob(~y) Prob(~x | ~y) = xMAP We want to compute the conditional probability of the most likely state, given the observation its MAP estimator ~xMAP are illustrated in Fig.2 (left). Posterior probability of the most likely state P(~xMAP |~y) and that of true state P(~xTRUE |~y) (~xTRUE is unknown to observer) is shown on the upper-right corner for typical four cases. P(~xMAP |~y) seems to be distributed broadly, and they were 0.113, 0.164, 0.019, and 0.466 in these four cases. xtrue , y 2.3 Posterior probability σ=0.7 0.6 0.4 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob(xMAP | y) t 0 0.8 0.85 0.9 Reconstruction Rate Figure 2: Posterior probability distribution of~xMAP for typical four cases of ~xTRUE . Left: See Fig.1 for explanation. Middle: Distribution of P(~xMAP |~y) for 10,000 different ~y’s generated from an identical~xTRUE . Right: Relationship between posterior probability P(~xMAP |~y) and reconstruction rate d(~xMAP ,~xTRUE )/n. in = qxn ,yn 3.2 where we put ∑ pi1 ,i2 qi1 ,y1 r2 (i2 ) = and for t = 3, · · · , n rt (it ) = i1 px1 ,x2 qx1 ,y1 ∑ pit−1 ,it qit−1 ,yt−1 rt−1 (it−1 ) it−1 pxt−1 ,xt qxt−1 ,yt−1 . 3 Computer Experiments 3.1 Posterior probability of MAP estimator Computer simulation was carried out for n = 200. Typical sequences of source signal~xTRUE and observation~y and Distribution of posterior probability of ~xMAP We generated a set of 10,000 different observation ~y’s from an identical signal ~xTRUE . Figure 2 (middle) show the distribution of P(~xMAP |~y) for these 10,000 ~y’s. The result shows that there are variety type of the distribution of P(~xMAP |~y). We asked whether these posterior probabilities signal to us in this problem setting by examining the relationship between reconstruction rate and posterior probability. The results are shown in Fig.2(right). As a reconstruction rate, we used the normalized Hamming distance dham (~xMAP ,~xTRUE ) n where n(= 200) is number of observations. These results show a tendency that higher the P(~xMAP |~y), larger the reconstruction rate, although it is not so simply interpreted. 1 N=200 10000 samples 0.14 σ=0.7 0.12 0.1 0.08 4 N=200 10000 samples 0.8 P (xMAP | y) Relative Frequency 0.16 σ=0.7 xMAP 0.18 0.6 0.4 σ=0.7 P (xtrue| y ) = 0.00213 2 0.06 0.2 0.04 0.02 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.8 0.85 0.9 0.95 1 Reconstruction Rate Prob(xMAP | y) xtrue , y 0 P (xMAP| y ) = 0.03118 seed: 20070919 3 0 -1 Figure 3: Posterior probability distribution of ~xMAP . Left: Distribution of P(~xMAP |~y) for 10,000 different ~y’s generated from 10,000 different samples of ~xtrue . Right: Relationship between posterior probability P(~xMAP |~y) and reconstruction rate dham (~xMAP ,~xTRUE )/n. -2 0 50 100 150 200 P (xMAP| y ) = 0.03118 seed: 20070919 180 P (xtrue| y ) = 0.00213 Number of Elements 160 Figure 3 shows the distribution of P(~xMAP |~y) for 10,000 totally different sample ~xTRUE ’s and different sample observation ~y’s. The value of P(~xMAP |~y) decrease as n being large (compare Fig.3 (left) to Fig.4 which is the case of n = 500). 140 120 100 80 60 40 1 xMAP 20 N=500 10000 samples 0.8 0 0 0.005 0.01 P (xMAP | y) 0.015 0.02 0.025 0.03 0.035 Posterior Probability σ=0.7 0.6 Figure 5: Meaning of posterior probability of~xMAP . Upper: P(~xMAP |~y) = 0.03. See the caption of Fig.1 for explanation in detail. Lower: Distribution of P(~x0 |~y) where 242 ~x0 similar to ~xMAP was generated. 0.4 0.2 0 200 t 0.8 0.85 0.9 0.95 1 Reconstruction Rate Figure 4: Reconstruction rate dham (~xMAP ,~xTRUE )/n and posterior probability P(~xMAP |~y) for n = 500. 4 Discussion The value of posterior probability of ~xMAP does not tell us much except in the case of P(~xMAP |~y) being extremely large since we do not know in advance the structure of posterior probability distribution. To use P(~xMAP |~y) effectively in interpretation, we consider the distribution of posterior probability of perturbed ~xMAP in which we generate a set of ~x’s which is close to ~xMAP as a vector, and make a histogram of P(~x|~y). For example there are 5 transition points (0 → 1 or 1 → 0) in ~xTRUE exemplified in Fig.5. We generated a set of 35 = 243 ~x’s in each of which each tran- sition point of ~xTRUE was systematically shifted -1,0,or 1. We computed P(~x|~y) for each thus generated ~x (see Fig.5). From this analysis, we see the value of P(~xMAP |~y) = 0.003 has something to tell us; ~xMAP is most likely interpretation given the observation ~y with high confidence. Acknowledgments This works was partly supported by Grant-in-Aid (19500257) from the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT), and by the Inamori Grant from the Inamori Foundation. References [1] H. Künsch, S. Geman, and A. Kehagias, “Hidden Markov random fields,” Annals of Applied Probability, vol. 5, pp. 557–602, 1995. [2] S. Geman and K. Kochanek, “Dynamic programming and the graphical representation of error-correcting codes,” IEEE Transactions on Information Theory, vol. 47, pp. 549–567, 2001.
© Copyright 2026 Paperzz