Updating probabilities with MaxEnt and with Bayes` Rule

Updating probabilities with Maximum Entropy
and with Bayes' Rule - a comparison
In 2013, Herbert E. Müller, Switzerland
Abstract
This article is about random experiments with M possible issues, and unknown limiting frequencies
f m (m=1. .. M ) of each issue. A person with prior information I assigns a probability p m ( I ) to
issue m in the next trial. Given an additional information about the limiting frequencies, consisting in
the equation b1⋅ f 1+b 2⋅ f 2+...+b M⋅f M =̄b , this person will update the probability for issue m in the
next trial to p m ( ̄b , I ) . The updated probabilities can be calculated in two ways: with the Maximum
Entropy principle, or with Bayes' rule. The result is usually different. In this article, the outcome of the
two methods is compared for the simplest non-trivial case: a random experiment with three issues.
Contents
1.
2.
3.
4.
5.
6.
7.
Statement of the premises and the goal
Updating probabilities with Maximum Entropy and with Bayes' rule
Random Experiments with three issues
Example 1: One frequency is given: f 1=c
Example 2: The ratio of two frequencies is given: f 2=c⋅ f 3
Updated probabilities for three issues and the general condition ⃗
b⋅⃗f =0
The mapping of the condition vector b⃗ onto the updated probability ⃗p ( ̄b )
1. Statement of the premises and the goal
This article is about random experiments of the „dice“ type. Such an experiment has M possible issues,
labelled with m=1. .. M ; it can be repeated arbitrarily often; the results of N repeated experiments
n ≡(n1 , n2 ,... , n M ) (issue m arrived n m times, with ∣⃗n∣≡∑ nm =N );
are encoded in the vector ⃗
and we admit that there exists a limiting frequency of each issue: f m≡lim (nm / N ) as N → ∞ .
f m is the probability of issue m in the next dice throw for a person who has analysed a very large
number of dice throws. Since we usually don't know the exact limiting frequencies ⃗f , we introduce
a corresponding vector of probability variables ⃗x ≡(x 1 , x 2 ,... , x M ) , with x m ⩾0 and
∣⃗x∣≡∑ x m =1 . Our knowledge about the limiting frequencies, given some prior information I, is
mathematically expressed by the probability distribution function (pdf) P ( ⃗x∣I ) . To obtain the
probability of issue m in the next dice throw, we calculate the mean value of x m :
p m (I )≡∫ dx 1 ...∫ dx M x m⋅P( ⃗x∣ I ) . New information about the dice may consist in the outcome of N
more dice throws. In this case, we update the pdf and probabilities with Bayes' rule:
n,I) .
P (⃗x∣⃗n , I )∼ P(⃗
n∣⃗x , I )⋅P (⃗x∣ I )∼∏ x nm⋅P (⃗x∣ I ) , and p m (⃗n , I )≡∫ dx 1 ... ∫ dx M x m⋅P( ⃗x∣⃗
Consider now a different kind of new evidence. Let ⃗
b=( b1 , b 2 , ... , b M ) a vector of numbers b m
⃗ ∘ ⃗f =b 1⋅f 1+b2⋅ f 2+...+b M⋅ f M
associated with issue m . We are told the expectation value ̄b ≡b
of ⃗
b . (The person who gave us this information threw the dice a large number of times and
calculated ̄b to high precision.) In this case we can update our starting probabilities ⃗p (= ⃗p (I ) ,
from now on we drop the prior information I) in two ways: again by applying Bayes' rule, or, as is
more commonly done in this situation, by applying the Maximum Entropy (MaxEnt) principle. The
updated probabilities ⃗p ( ̄b ) will in general come out different for the two methods. The reason is
simple: the MaxEnt-udated probabilities only depend on the prior probabilities ⃗p and on the average
̄
b ; the Bayes-udated probabilities additionally depend on the details of the prior pdf.
The following two questions naturally arise.
Question: Which updating-method is better, MaxEnt or Bayes ?
Answer: Bayes! (See e. g. the publications of John Skilling.)
Question: Is there an prior pdf P ( ⃗x∣I ) for which the two methods give the same result?
Answer: ??? (My guess is: No.)
The aim of this article is not to give any deep insights, but to illustrate the different outcomes of the
two methods in the simplest non-trivial example, which is a dice with three numbers (or a random
experiment with three issues).
m
2. Updating probabilities with Maximum Entropy and with Bayes' rule
2.1 Updating probabilities with the MaxEnt principle
The MaxEnt principle provides the following formulas for the updated probabilities:
p m ( ̄b )= p m⋅exp( βb m )/Z (β)
(2.1)
with
Z (β)=∑ p m⋅exp ( β⋅b m)
(2.2)
and β given by
̄b = d ln Z / d β .
The prior and updated pdf are not needed.
(2.3)
2.2 Updating probabilities with Bayes' rule
The updated probabilities now depend on the prior pdf. We use the mathematically most convenient
form
P ( x 1 , ... , x M )=Z ( ⃗x )/ Z
(2.4)
with
Z (⃗x)=∏ x νm 1⋅δ(1 ∣⃗x∣)
(2.5)
m
and
∞
∞
Z=∫0 dx 1 ... ∫0 dx M Z (⃗x )=Γ(ν 1)⋅...⋅Γ(ν M )/Γ (∣(⃗ν )∣)≡B(ν 1 , ... , ν M )
(2.6)
Here, B( ν 1 , ... , ν M ) is Euler's Beta-function.
∞
∞
The 1st prior probability is p 1=∫0 dx1 ...∫0 dx M x 1 Z (⃗x )/ Z =B( ν 1+1 ,... , ν M )/ B (ν1 ,... , ν M )=ν 1 /∣⃗ν∣ .
It follows that νm = p m⋅ν , with ν=∣⃗ν∣=ν1+...+ν M .
The parameter values ν1=...=ν M =1 lead to the uniform prior pdf and p m=1 / M (Laplace's
principle of indifference).
The updated pdf is
P ( ⃗x∣̄b )=Z ( ⃗x , ̄b )/ Z ( ̄b )
(2.7)
with
ν 1
b ∘ ⃗x )
Z (⃗x∣̄b )=∏ x m ⋅δ(1 ∣⃗x∣)⋅δ( ̄b ⃗
m
(2.8)
and
∞
∞
Z ( ̄b )=∫0 dx 1 ...∫0 dx M Z (⃗x , ̄b )
(2.9)
The updated probabilities are
p m (̄b )=∫ dx 1 ...∫ dx M x m Z (⃗x , ̄b )/Z .
b .
The last two integrals can only be evaluated for special values of ⃗
b and ̄
(2.10)
3. Random Experiments with three issues
3.1 Geometrical description
figure 1:
Probability triangle for a random experiment
with three possible issues. Also shown is a normalized condition vector ⃗
b=( 0.5 , 0.1 , 1.4) ,
and the intersection ∣⃗x∣=1 ∧ ⃗
b⋅⃗x =0 .
The probability vectors ⃗x and ⃗p have length
∣⃗x∣=∣⃗p∣=1 and thus lie in the triangle 1-2-3
spanned by the 3 unit vectors (1,0,0), (0,1,0) and
(0,0,1). Figure 1 shows the top view onto this
probability triangle, along the diagonal (1, 1, 1).
b⋅⃗x = b̄ can always be reThe condition ⃗
formulated as ⃗
b⋅⃗x =0 , by subtracting ̄b /3
from all the components of b⃗ . The length of
⃗
b is then arbitrary, and may be normalized to
∣⃗
b∣=1 . The condition vector b⃗ now lies in
the same plane as ⃗x and ⃗p .
⃗
The equation b⋅⃗x =0 describes a plane normal to b⃗ containing the origin of the three coordinate
vectors. The intersection of this plane and the probability triangle is a secant, shown blue in the figure.
The limiting frequency ⃗f and the updated probability ⃗p ( ̄b ) lie on this secant.
In order that the intersection ∣⃗x∣=1 ∧ ⃗
b⋅⃗x =0 is non-empty, at least one, and at most two of the
⃗
components of b must be negative. As a consequence, b⃗ always lies outside the probability
triangle. We have thus established a bijective mapping of all the points outside the probability triangles
onto all the secants in the probability triangle. The secant end points lie either on the sides 3-1-2 (case
I), or 1-2-3 (case II), or 2-3-1 (case III). Accordingly, the b⃗ domain of the plane (outside of the
probability triangle) is decomposed into 3 domains I, II and III, as shown in figure 1.
4. Example 1: One frequency is given: f 1=c
⃗ x = b̄ , with ⃗
The condition is b⋅⃗
b=( 1 , 0 , 0) and b̄ =c .
⃗
The normalized condition is b⋅⃗x =0 , with ⃗
b=( 1 c , c , c)/ (1 3c) .
4.1 Updating with MaxEnt
Updated p's:
p 1 (c)= p1⋅exp( β)/Z (β) ,
State sum:
Z (β)= p 1⋅exp( β)+ p2 + p 3
Condition:
c=
Updated p's:
p 1 (c)=c ,
p 1⋅exp( β)+ p2 + p 3
Z (β)
p 2 (c )= p 2
p 2 (c )= p 2 / Z (β) ,
⇒
1 c
,
1 p1
exp( β)=
p 3 (c)= p 3 /Z (β)
( 1 p 1)⋅c
p1⋅(1 c)
p 3 (c)= p 3
1 c
1 p1
(4.1)
4.2 Updating with Bayes' Rule
pdf:
ν 1
Z (⃗x∣c)=∏ x m ⋅δ(1 ∣⃗x∣)⋅δ (c x 1 )
Normalization:
Z ( c)=c ν 1⋅(1 c)ν +ν
Updated p's:
p 1 (c) Z (c)=c ⋅(1 c)
m
1
2
ν1
3
1
B( ν2 , ν 3)
ν2+ν3 1
B( ν 2 , ν 3 )
⇒
p 1 (c)=c
B( ν 2+1 , ν 3)
⇒
p 2 (c )= p 2
1 c
1 p1
p 3 (c) Z (c)=c ν 1⋅(1 c)ν +ν B( ν 2 , ν3+1)
⇒
p 3 (c)= p 3
1 c
1 p1
p 2 (c ) Z (c)=c
ν1 1
⋅(1 c)
1
ν2+ν3
2
3
The updated probability is independent of the free parameter
(4.2)
ν .
4.3 Comparison of Bayes and MaxEnt
Both methods give the same updated probabilities !
figure 2:
The updated probability vector ⃗p (c) (o) as
predicted both by MaxEnt and Bayes, for
⃗p =(1/3 , 1/ 3 , 1/3) and f 1=c given.
The normalized condition vector
⃗
b=( 1 c , c , c)/ (1 3c) is perpendicular to
the triangle side x 1=0 , and the secant
x 1=c (shown for c=0.1 , 0.3 , ... , 0.9 ) is
parallel to the triangle side x 1=0 . Assuming
prior probabilities ⃗p =(1/3 , 1/ 3 , 1/ 3) , the
updated probability ⃗p (c) lies in the centre of
the secant.
5. Example 2: The ratio of two frequencies is given: f 2=c⋅ f 3
The condition is ⃗
b⋅⃗x = b̄ , with ⃗
b=( 0 , c , 1) and b̄ =0 .
⃗
The normalized condition is b⋅⃗x =0 , with ⃗
b=( 0 , c /(1 c) ,1 /(1 c)) .
5.1 Updating with MaxEnt
Updated p's:
p 1 (c)= p1 / Z (β) ,
p 2 (c )= p 2⋅exp (β c)/ Z (β) ,
State sum:
Z (β)= p 1+ p 2⋅exp (β c)+ p 3⋅exp( β)
Condition:
0=
c⋅p2⋅exp (βc )+ p 3⋅exp( β)
Z (β)
⇒
p 3 (c)= p 3⋅exp( β)/ Z (β)
exp( β)=(c
p2 1 /(1+c)
)
p3
Updated p's:
p 1 (c)=
p1
,
Z (β)
p 2 (c )=
Z (β)= p 1+ p 2⋅(c
p2
)
p3
c/(1+c)
p2
p2
⋅(c )
Z (β)
p3
c/(1+c)
,
p 3 (c)=
p3
p2 1 /(1+c)
⋅(c )
Z (β)
p3
1/(1+c)
+ p3⋅(c
p2
)
p3
(5.1)
5.2 Updating with Bayes' Rule
pdf:
Z (⃗x∣c)=∏ x νm 1⋅δ(1 ∣⃗x∣)⋅δ (x 3 cx 2)
Normalization:
c
Z ( c)=
⋅B( ν 1 , ν 2+ν 3 1)
ν +ν 1
(1+c)
Updated p's:
p 1 (c) Z (c)=
cν 1
⋅B( ν 1+1 , ν2 +ν 3 1)
(1+c) ν +ν 1
p 2 (c ) Z (c)=
cν 1
⋅B (ν1 , ν 2+ν3 )
( 1+c )ν +ν
⇒
p 2 (c )=
1 p 2+ p3 1/ ν
1+c 1 1/ν
p 3 (c) Z ( c)=
cν
⋅B( ν 1 , ν2 +ν 3)
(1+c) ν +ν
⇒
p 3 (c)=
c p2 + p 3 1/ ν
1+c 1 1 /ν
m
ν3 1
2
3
3
2
3
⇒
3
2
3
3
2
3
The updated probability depends on the free parameter
p 1 (c)=
p1
1 1/ν
(5.2)
ν .
5.3 Comparison of Bayes and MaxEnt
The two methods give this time quite different updated probabilities!
figure 3:
The updated probability vector ⃗p (c) as
predicted by MaxEnt (x) and Bayes (o), for
⃗p =(1/3 , 1/ 3 , 1/3) and f 2 / f 3=c given.
The normalized condition vector
⃗
b=( 0 , c /(1 c) ,1 /(1 c)) lies on the
prolonged triangle side 2-3. The secant
x 2=c⋅x 3 ends in the corner 1 of the probability
triangle. Assuming prior probabilities
⃗p =(1/3 , 1/ 3 , 1/3) , the MaxEnt-updated
probability vector ⃗p (c) lies on a curve
connecting the mid-points of the sides 1-2 and 13, and the Bayes-updated probability vector
⃗p (c) lies on a straight line p 1 (c)=const.
parallel to the triangle side 2-3 . For ν=3
(uniform prior pdf), p 1 (c)=0.5 .
The difference between the two updated probabilities is particularly striking for c = 1 , i. e. for the
condition x 2= x 3 . MaxEnt then leaves the probability ⃗p =(1/3 , 1/3 , 1/ 3) unchanged, while
Bayes with ν=3 (uniform prior pdf) updates it to ⃗p (c)=(1/ 2 , 1/2 , 1/4) .
6. Updated probabilities for three issues and the general condition ⃗b⋅⃗f =0
The case of general condition vector and 3 issues is mathematically more demanding.
6.1 Updating with MaxEnt
Z (β)= p 1⋅exp( βb1 )+ p2⋅exp( β b2 )+ p3⋅exp ( β b3 )
p 1 (c)= p1⋅exp( β b1 )/ Z (β) ,
p 2 (c )= p 2⋅exp ( β b2 )/Z (β) ,
p 3 (c)= p 3⋅exp( β p3)/ Z (β)
0=b1⋅p1 exp( β b1 )+b 2⋅p 2 exp( β b2 )+b 3⋅p 3 exp ( β b3)
(6.1)
In general, β must be evaluated numerically from the last formula.
Below we will only consider the case p 1= p 2= p3=1/ 3 .
6.2 Updating with Bayes' Rule
This time we calculate the updated probabilities with Laplace transform:
P (⃗
k )∝Z ( ⃗
k )=∫ dx 1 ...∫ dx M Z ( ⃗x )⋅exp( k⃗ ∘ ⃗x )
P (⃗
k ∣̄b )∝ Z ( ⃗
k , ̄b )=∫ dx 1 ...∫ dx M Z (⃗x∣̄b )⋅exp ( k⃗ ∘ ⃗x )
The Taylor coefficients of ln P ( ⃗k ) and ln P ( ⃗k∣̄b ) are the cumulants of P (⃗x ) and P ( ⃗x , ̄b ) ; the
1st cumulant is the prior or updated probability:
p m=
d ln P ⃗ ⃗
d ln Z ⃗ ⃗
( k =0 )=
( k =0 )
d km
d km
p m ( ̄b )=
d ln P ⃗ ⃗∣̄
d ln Z ⃗ ⃗∣̄
( k = 0 b )=
(k = 0 b )
d km
d km
Follows the explicit calculation for the prior pdf Z ( ⃗x) introduced in section 3 and the case of three
issues (M =3) .
Prior pdf and probability:
Prior pdf:
Z (⃗x)=∏ x νm 1⋅δ(1 ∣⃗x∣)
Use:
δ(1 ∣⃗x∣)=
Laplace trafo :
1
Z (⃗
k )=
d α exp (α)∏ ( k m+α)
2 πi ∫C
m
1
∫ d α exp α (1 ∣⃗x∣)
2πi C
νm
The integral can only be performed analytically for integral νm =1 , 2 ... .
For ν1=ν 2=ν3 =1 , i. e. for uniform prior pdf and ⃗p =(1/3 , 1/3 , 1/ 3) , we obtain
exp( k 1)
exp ( k 2)
exp( k 3 )
+
+
( k 2 k 1 )(k 3 k 1) (k 1 k 2)( k 3 k 2 ) (k 1 k 3)( k 2 k 3)
Laplace trafo :
Z (⃗
k )=
Test:
Z (⃗
k )=1
k 1+k 2+k 3
+...
3
⇒
p 1= p 2= p3=1 / 3
: okay.
Updated pdf and probability:
Updated pdf:
Z (⃗x∣⃗b)=∏ x νm 1⋅δ(1 ∣⃗x∣)⋅δ (b1 x 1+b 2 x 2 +b3 x 3)
Use:
δ( ⃗b⋅⃗x)=
Laplace trafo:
1
1
Z (⃗
k∣̄b=0)=
dβ
∫
∫ d α exp(α)∏ (k m+α+βb m )
C
2πi
2πi C
Comparison:
Z (⃗
k∣̄b=0)=
m
1
∫ d β exp β( ⃗b⋅⃗x)
2πi C
νm
1
∫ d β Z (⃗k +β⋅⃗b )
2πi C
When replacing k⃗ by ⃗
k +β⋅⃗
b in the expression for Z ( ⃗
k ) , we get terms with the exponential
decreasing in k m if b m>0 , and terms with the exponential growing in k m if b m<0 . When
performing the β integration, the former terms will not contribute to the integral, since the Laplace
contour (from r i ∞ to r +i ∞ , with r sufficiently large) can be moved to r =+∞ . We must
therefore distinguish cases of different signs of b1 , b 2 and b3 . But these are just the ⃗
b domains we labelled I, II and III in figures 1 to 3! Specifically, we have
Domain I:
Domain II:
Domain III:
sign(
sign(
sign(
⃗
b ) = (−,+,+) or (−,−,+)
⃗
b ) = (+,−,+) or (−,+,−)
⃗
b ) = (+,+,−) or (−,−,+)
⃗ x =0 remains
If we multiply b⃗ -vectors with two negative components by 1 , the condition b⋅⃗
⃗
unchanged, and the modified b has only one negative component. This means that the updated
probability formulas obtained in the following calculation with b1<0 , b 2>0 , b3>0 are valid for the
whole domain I.
The Laplace-transformed updated pdf is
Z (⃗
k∣̄b )=
exp( k 1+β b1)
1
d
β
∫
2πi C
(k 2 k 1 +β(b 2 b 1))(k 3 k 1+β(b 3 b1 ))
Taylor developing up to first order in k⃗
Z (⃗
k∣̄b )=
(b 2
gives
b1
b (b b )+b3 ( b2 b1)
b1
b1
[1 k 1 2 3 1
k2
k3
]
b1 )(b3 b1 )
2(b 2 b1 )(b3 b1)
2(b2 b 1)
2(b 3 b1 )
Taking the logarithm and differentiating with respect to ⃗
k gives
p 1 ( ̄b )=
b2 (b 3 b 1)+b 3 (b2 b1 )
,
2(b 2 b 1)(b 3 b1 )
p 2 ( ̄b )=
b1
2 (b2 b1)
,
p 3 ( ̄b )=
b1
2( b3 b 1)
(6.2)
The corresponding formulas for domains II & III are obtained by cyclical permutation of the
indices 1, 2, 3.
The special cases x 1=c (domain I) and x 2=c⋅x 3 (at the border of domains II and II) discussed
earlier are contained in these formulas.
6.3 Comparison of Bayes and MaxEnt
There is no apparent similarity between the updated probabilities (6.1) and (6.2) !
For the geometrical interpretation of these formulas, we return to the example of section 3, figure 1.
figure 4:
Probability triangle for a random experiment
with three possible issues. Also shown is an
arbitrary condition vector ⃗
b=( 0.5 , 0.1 , 1.4) ,
⃗
and the solution of b⋅⃗x =0 .
The Bayes-updated probability lies in the middle
of the secant ⃗
b⋅⃗x =0 .
To find the MaxEnt updated probability, we draw
lines of constant entropy (uncertainty)
H =∑ pm ( ⃗b) ln ( p m ( ⃗b)/ p m) .
into the probability triangle. These isentropes are
contour lines of an entropy mountain. The secant
passes over the side of Mt Entropy. The updated
probability lies on the highest point of the secant,
tangential to a contour line.
7. The mapping of the condition vector ⃗b onto the updated probability ⃗p (̄b )
⃗ p =0 can be
Updating a probability vector ⃗p ≡( p 1 , p 2 , p 3) to ⃗p ( ⃗
b) due to a new condition b⋅⃗
seen as a mapping of the condition vector ⃗
b≡( b1 , b 2 , b3 ) onto the probability ⃗p ( ⃗
b) .
Geometrically, the domain outside the probability triangle is mapped into the probability triangle.
Figure 5 shows that this mapping is bijective (at the least for ⃗p =(1/3 , 1/3 , 1/ 3) ) when we update
with MaxEnt.
Figure 6 shows that the mapping is surjective but not injective when we update with Bayes ( ν=3) .
In this case, any point ⃗p of the probability triangle with p 1 , p 2 , p3<0.5 is the image of three
distinct condition vectors b⃗ .
figure 5a&b:
MaxEnt mapping of the conditional domains I, II, III onto probability domains inside the triangle.
figure 6a&b:
Bayes mapping of the conditional domains I, II, III onto probability domains inside the triangle.