On the Distribution of Posterior Probability in Bayesian Inference

On the Distribution of Posterior Probability in Bayesian Inference
with a Large Number of Observations
Akira Date
Faculty of Engineering, University of Miyazaki, Miyazaki 889-2192
[email protected]
Abstract: Probabilistic generative models work in many applications of image analysis and speech recognition. In general,
there is an observation vector ~y and a state vector ~x, and a joint dependency structure among them. The object of interest is,
given~y, the most likely configuration~xMAP and its posterior distribution. In practice, the exact value of posterior probability of
~xMAP is impossible to obtain, especially when there is a large number of observed variables. Here we analyzed the distribution
of posterior probabilities of ~xMAP when there are N = 200 ∼ 1000 observations. We used a probabilistic model with simple
linear dependency structure in which the exact value of posterior probability of ~xMAP is obtainable. Computer experiments
show that even an identical model generate a variety of posterior distributions, which suggest difficulties in understanding the
meaning of posterior probability. Finally, we propose a method to know the confidence of the estimator ~xMAP by computing
P(~x0 |~y)’s where ~x0 ’s are neighbors of ~xMAP .
Keywords: Bayesian inference, maximum a posteriori estimator, hidden Markov model, dependency graph
1 Introduction
In Bayesian formations, we formalize the relevant prior
information, as a probability distribution, say P(~x). Then,
given a prior distribution P(~x) and an “observation” ~y, we
form the posterior distribution P(~x|~y): the conditional distribution given what is observed. Given an observation we
seek the most likely configuration~xMAP that maximizes the
posterior distribution.
Suppose we can compute P(~x|~y), the posterior probability of ~xMAP , although it is in practice difficult to compute the exact value of posterior probability when there is
a large number of observed variables. What does P(~x|~y)
signal to us? Even if we know the posterior probability of
~xMAP , its meaning is problem-dependent. When, for example, P(~xMAP |~y) = 0.01, can we say ~xMAP is not so believable? The answer depend on the problem. Even if we fix
the problem setting, it is difficult to answer. It seems that
larger the configuration space of ~x, smaller the P(~xMAP |~y).
However, the value of posterior probability, of course, signals something. When, for example, the posterior probability of ~xMAP is 0.98, it signals that ~xMAP is most likely interpretation with high confidence. In general, how can we use
the value of posterior probability effectively for processing
in next stage ?
In this paper, we used a probabilistic model with simple linear dependency structure in which the exact value of
posterior probability of the most likely configuration ~xMAP
is obtainable. We, then, analyzed the distribution of posterior probabilities when there are n = 200 ∼ 1000 observations. We performed a large number of computer experiments to obtain the distribution of P(~xMAP |~y). The results
suggest that there is a variety of distributions of P(~xMAP |~y).
We computed not only P(~xMAP |~y) but P(~x0 MAP |~y)’s where
~x0 is neighbor of~xMAP . The presentation will be nontechnical and by example, highlighting the meaning of posterior
probability, such as, what posterior probability signals or
whether low posterior probability signals that the estimator
is relatively not believable.
2
2.1
Bayesian Inference
Linear dependency graph
As a simple example, suppose that X1 , X2 , . . . is a firstorder Markov process with state space X ∈ {0, 1}, initial
probability distribution
≡ Prob(X1 = 0) = 0.5
(1)
≡ Prob(X1 = 1) = 0.5
(2)
·
¸
p
p10
and transition probability matrix P ≡ 00
≡
p01 p11
·
¸ ·
¸
Pr(Xi+1 = 0|Xi = 0) Pr(Xi+1 = 0|Xi = 1)
0.99 0.03
=
.
Pr(Xi+1 = 1|Xi = 0) Pr(Xi+1 = 1|Xi = 1)
0.01 0.97
p0
p1
a
2.2 Most likely configurations
X1
X2
X3
Xn-1
Xn
Y1
Y2
Y3
Yn-1
Yn
Given an observation ~y = (y1 , · · · , yn )T we seek ~x =
(x1 , · · · , xn )T that maximizes the posterior distribution (the
so called MAP estimator);
~xMAP = argmax Prob(~x|~y)
(5)
~x
where
b
Prob(~x, ~y)
.
Prob(~y)
Prob(~x | ~y) =
xMAP
4
P (xMAP| y ) = 0.00003079
P (xtrue| y ) = 0.00000015
seed: 77777
σ=0.7
3
Since Prob(~y) is a positive constant, the goal is to compute
~xMAP = argmax Prob(~x, ~y).
2
xtrue , y
(7)
~x
1
0
-1
-2
(6)
0
100
200
300
400
500
t
Figure 1: a, Linear dependency graph. The nodes of the Xgraph represent a source sequence of 0 and 1’s generated
from a Markov chain, and the Y nodes represent a noisy
observation. b, A source data ~x (represented as a line), an
observation ~y (points), and the maximum a posteriori estimator ~xMAP (plotted ~xMAP + 3) and its posterior probability. Yi = Xi + Zi , Zi ∼ N (0, σ 2 ), σ = 0.7, 1 5 i 5 n = 500.
There was 17/500 discrepancies between ~xMAP and ~xtrue .
There are legendary practical problems with the actual implementation of Bayesian methods. We want to compute
most likely states with respect to these posterior distributions, but usually direct evaluation is already impossible
with one hundred dimensions. Fortunately, for random
fields with linear graph, this posterior distribution is itself
Markov, which has some striking computational implications.
(
)
~x
(3)
with ηi iid N (0, σ 2 ), σ = 0.7;
Prob(Yi < yi |xi ) = √
1
2πσ
½
¾
(y − xi )2
dy (4)
exp −
2σ 2
−∞
Z yi
The goal is to estimate X where the estimation will be
based upon the corrupted observations, and based upon the
model, i.e., the prior distribution.
Y itself is not Markov. Nevertheless, the conditional distribution of X given Y remain simple, where X given Y is
still first-order Markov. In general, the combination of a
rich marginal structure for Y and a simple posterior structure for X makes hidden Markov process a common modeling tool [1].
n
t=2
t=1
(8)
In particular, dynamic programming methods can be used:
computationally-feasible algorithms exist for estimating
the most likely interpretation ~x of a given signal ~y. In concrete, first we compute
C1 (i) = log(pi ) + log(qiy1 )
Suppose we observe corrupted signal Y1 ,Y2 , . . .Yn where
Yi = Xi + ηi
n
~xMAP = argmax log px1 + ∑ log pxt−1 xt + ∑ log qxt yt
(9)
Then, we sequentially compute for t = 1, · · · , n − 1
©
ª
St+1 ( j) = argmax Ct (i) + log pi j + log q jyt+1 (10)
i
Ct+1 ( j) = Ct (St+1 ( j)) + log pSt+1 ( j) j + log q jyt+1(11)
where i, j ∈ {0, 1}, and
qxt yt
=
√
¾
½
1
(yt − xt )2
∆y
exp −
2σ 2
2πσ
(12)
= argmax Cn (i)
(13)
Finally, we obtain
x̂n
i
Then the x̂t , t = n − 1, · · · , 1 will be obtained.
x̂t
= St+1 (x̂t+1 )
(14)
1
0
0.06
0.05
0.04
0
50
100
150
0
0.1
0.2
0.3
N=200
same xtrue
0.07
Relative Frequency
0
-1
50
100
150
=
qxn ,yn
px2 ,x3 qx2 ,y2
px1 ,x2 qx1 ,y1
i3
···
∑ pi3 ,i4 qi3 ,y3 r3 (i3 )
∑qin ,yn
in
=
qxn ,yn
px2 ,x3 qx2 ,y2
···
i3
∑qin ,yn rn (in )
px3 ,x4 qx3 ,y3
0.9
0.95
1
0.95
1
0.95
1
0.95
1
10000 noisy samples
seed: 3332877
0.04
σ=0.7
0.6
0.4
0.03
0.2
0.02
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
P (xtrue| y ) = 0.01229
0
0.8
10000 noisy samples
σ=0.7
0.06
t
150
0.9
10000 noisy samples
seed: 2772
σ=0.7
0.6
0.4
0.04
0
100
N=200
same xtrue
seed: 2772
0.08
0.2
0.02
50
0.85
Reconstruction Rate
N=200
same xtrue
0.1
-1
0
0.8
1
P (xMAP| y ) = 0.01893
seed: 2772
Relative Frequency
xMAP
N=200
same xtrue
0.8
10000 noisy samples
σ=0.7
0.05
0.12
200
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.8
0.85
0.9
Reconstruction Rate
Prob(xMAP | y)
P (xMAP| y ) = 0.46593
seed: 56712
3
P (xtrue| y ) = 0.46593
1
1
10000 noisy samples
σ=0.7
0.8
0.03
0.02
0.01
0
50
100
150
200
10000 noisy samples
seed: 56712
0.04
0
0
N=200
same xtrue
seed: 56712
0.05
-1
-2
N=200
same xtrue
0.06
4
i2
px3 ,x4 qx3 ,y3
0.85
Reconstruction Rate
P (xMAP | y)
px3 ,x4 qx3 ,y3
0.8
Prob(xMAP | y)
2
∑ pi3 ,i4 qi3 ,y3 ∑ pi2 ,i3 qi2 ,y2 r2 (i2 )
∑qin ,yn
in
i1
0.9
seed: 3332877
0.06
0
200
Relative Frequency
qxn ,yn
···
i2
0.8
0.07
xMAP
=
i3
0.7
0.01
0
2
-2
∑ pi3 ,i4 qi3 ,y3 ∑ pi2 ,i3 qi2 ,y2 ∑ pi1 ,i2 qi1 ,y1
∑qin ,yn
in
xtrue , y
px1 ,x2 px2 ,x3 qx1 ,y1 qx2 ,y2 qx3 ,y3 · · · qxn ,yn
0.6
1
0.08
P (xtrue| y ) = 0.13381
1
xtrue , y
∑
0.5
Prob(xMAP | y)
P (xMAP| y ) = 0.16374
3
i1 ,i2 ,i3 ,··· ,in
0.4
0.09
4
=
0.4
0.2
0
t
pi1 ,i2 pi2 ,i3 qi1 ,y1 qi2 ,y2 qi3 ,y3 · · · qin ,yn
σ=0.7
0.03
0
200
2
-2
10000 noisy samples
seed: 7
0.6
0.01
4
xMAP
(16)
N=200
same xtrue
0.8
10000 noisy samples
seed: 7
σ=0.7
0.02
1
p(~y)
= x̃
p(~x,~y)
p(~x,~y)
N=200
same xtrue
0.07
t
xtrue , y
=
0.08
-1
seed: 3332877
1
p(~x|~y)
0.09
P (xtrue| y ) = 0.04105
2
3
∑ p(x̃,~y)
P (xMAP| y ) = 0.11287
P (xMAP | y)
seed: 7
3
-2
1
0.1
4
P (xMAP | y)
Still it is difficult to know the posterior probability when
there is a large number n of observed variables. If we compute the inverse of the conditional probability, rather than
conditional probability itself, we can obtain the posterior
probability for relatively large n [2].
P (xMAP | y)
(15)
Relative Frequency
Prob(~x, ~y)
.
Prob(~y)
Prob(~x | ~y) =
xMAP
We want to compute the conditional probability of the
most likely state, given the observation
its MAP estimator ~xMAP are illustrated in Fig.2 (left). Posterior probability of the most likely state P(~xMAP |~y) and
that of true state P(~xTRUE |~y) (~xTRUE is unknown to observer) is shown on the upper-right corner for typical four
cases. P(~xMAP |~y) seems to be distributed broadly, and they
were 0.113, 0.164, 0.019, and 0.466 in these four cases.
xtrue , y
2.3 Posterior probability
σ=0.7
0.6
0.4
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Prob(xMAP | y)
t
0
0.8
0.85
0.9
Reconstruction Rate
Figure 2: Posterior probability distribution of~xMAP for typical four cases of ~xTRUE . Left: See Fig.1 for explanation.
Middle: Distribution of P(~xMAP |~y) for 10,000 different ~y’s
generated from an identical~xTRUE . Right: Relationship between posterior probability P(~xMAP |~y) and reconstruction
rate d(~xMAP ,~xTRUE )/n.
in
=
qxn ,yn
3.2
where we put
∑ pi1 ,i2 qi1 ,y1
r2 (i2 ) =
and for t = 3, · · · , n
rt (it ) =
i1
px1 ,x2 qx1 ,y1
∑ pit−1 ,it qit−1 ,yt−1 rt−1 (it−1 )
it−1
pxt−1 ,xt qxt−1 ,yt−1
.
3 Computer Experiments
3.1 Posterior probability of MAP estimator
Computer simulation was carried out for n = 200. Typical sequences of source signal~xTRUE and observation~y and
Distribution of posterior probability of ~xMAP
We generated a set of 10,000 different observation ~y’s
from an identical signal ~xTRUE . Figure 2 (middle) show
the distribution of P(~xMAP |~y) for these 10,000 ~y’s. The result shows that there are variety type of the distribution of
P(~xMAP |~y). We asked whether these posterior probabilities
signal to us in this problem setting by examining the relationship between reconstruction rate and posterior probability. The results are shown in Fig.2(right). As a reconstruction rate, we used the normalized Hamming distance
dham (~xMAP ,~xTRUE )
n
where n(= 200) is number of observations. These results
show a tendency that higher the P(~xMAP |~y), larger the reconstruction rate, although it is not so simply interpreted.
1
N=200
10000 samples
0.14
σ=0.7
0.12
0.1
0.08
4
N=200
10000 samples
0.8
P (xMAP | y)
Relative Frequency
0.16
σ=0.7
xMAP
0.18
0.6
0.4
σ=0.7
P (xtrue| y ) = 0.00213
2
0.06
0.2
0.04
0.02
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
0.8
0.85
0.9
0.95
1
Reconstruction Rate
Prob(xMAP | y)
xtrue , y
0
P (xMAP| y ) = 0.03118
seed: 20070919
3
0
-1
Figure 3: Posterior probability distribution of ~xMAP . Left:
Distribution of P(~xMAP |~y) for 10,000 different ~y’s generated from 10,000 different samples of ~xtrue . Right: Relationship between posterior probability P(~xMAP |~y) and reconstruction rate dham (~xMAP ,~xTRUE )/n.
-2
0
50
100
150
200
P (xMAP| y ) = 0.03118
seed: 20070919
180
P (xtrue| y ) = 0.00213
Number of Elements
160
Figure 3 shows the distribution of P(~xMAP |~y) for 10,000
totally different sample ~xTRUE ’s and different sample observation ~y’s. The value of P(~xMAP |~y) decrease as n being
large (compare Fig.3 (left) to Fig.4 which is the case of
n = 500).
140
120
100
80
60
40
1
xMAP
20
N=500
10000 samples
0.8
0
0
0.005
0.01
P (xMAP | y)
0.015
0.02
0.025
0.03
0.035
Posterior Probability
σ=0.7
0.6
Figure 5: Meaning of posterior probability of~xMAP . Upper:
P(~xMAP |~y) = 0.03. See the caption of Fig.1 for explanation
in detail. Lower: Distribution of P(~x0 |~y) where 242 ~x0 similar to ~xMAP was generated.
0.4
0.2
0
200
t
0.8
0.85
0.9
0.95
1
Reconstruction Rate
Figure 4: Reconstruction rate dham (~xMAP ,~xTRUE )/n and
posterior probability P(~xMAP |~y) for n = 500.
4 Discussion
The value of posterior probability of ~xMAP does not tell
us much except in the case of P(~xMAP |~y) being extremely
large since we do not know in advance the structure of posterior probability distribution. To use P(~xMAP |~y) effectively
in interpretation, we consider the distribution of posterior
probability of perturbed ~xMAP in which we generate a set
of ~x’s which is close to ~xMAP as a vector, and make a histogram of P(~x|~y). For example there are 5 transition points
(0 → 1 or 1 → 0) in ~xTRUE exemplified in Fig.5. We generated a set of 35 = 243 ~x’s in each of which each tran-
sition point of ~xTRUE was systematically shifted -1,0,or 1.
We computed P(~x|~y) for each thus generated ~x (see Fig.5).
From this analysis, we see the value of P(~xMAP |~y) = 0.003
has something to tell us; ~xMAP is most likely interpretation
given the observation ~y with high confidence.
Acknowledgments
This works was partly supported by Grant-in-Aid (19500257)
from the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT), and by the Inamori Grant from
the Inamori Foundation.
References
[1] H. Künsch, S. Geman, and A. Kehagias, “Hidden Markov
random fields,” Annals of Applied Probability, vol. 5,
pp. 557–602, 1995.
[2] S. Geman and K. Kochanek, “Dynamic programming and
the graphical representation of error-correcting codes,” IEEE
Transactions on Information Theory, vol. 47, pp. 549–567,
2001.