Updating probabilities with Maximum Entropy and with Bayes' Rule - a comparison In 2013, Herbert E. Müller, Switzerland Abstract This article is about random experiments with M possible issues, and unknown limiting frequencies f m (m=1. .. M ) of each issue. A person with prior information I assigns a probability p m ( I ) to issue m in the next trial. Given an additional information about the limiting frequencies, consisting in the equation b1⋅ f 1+b 2⋅ f 2+...+b M⋅f M =̄b , this person will update the probability for issue m in the next trial to p m ( ̄b , I ) . The updated probabilities can be calculated in two ways: with the Maximum Entropy principle, or with Bayes' rule. The result is usually different. In this article, the outcome of the two methods is compared for the simplest non-trivial case: a random experiment with three issues. Contents 1. 2. 3. 4. 5. 6. 7. Statement of the premises and the goal Updating probabilities with Maximum Entropy and with Bayes' rule Random Experiments with three issues Example 1: One frequency is given: f 1=c Example 2: The ratio of two frequencies is given: f 2=c⋅ f 3 Updated probabilities for three issues and the general condition ⃗ b⋅⃗f =0 The mapping of the condition vector b⃗ onto the updated probability ⃗p ( ̄b ) 1. Statement of the premises and the goal This article is about random experiments of the „dice“ type. Such an experiment has M possible issues, labelled with m=1. .. M ; it can be repeated arbitrarily often; the results of N repeated experiments n ≡(n1 , n2 ,... , n M ) (issue m arrived n m times, with ∣⃗n∣≡∑ nm =N ); are encoded in the vector ⃗ and we admit that there exists a limiting frequency of each issue: f m≡lim (nm / N ) as N → ∞ . f m is the probability of issue m in the next dice throw for a person who has analysed a very large number of dice throws. Since we usually don't know the exact limiting frequencies ⃗f , we introduce a corresponding vector of probability variables ⃗x ≡(x 1 , x 2 ,... , x M ) , with x m ⩾0 and ∣⃗x∣≡∑ x m =1 . Our knowledge about the limiting frequencies, given some prior information I, is mathematically expressed by the probability distribution function (pdf) P ( ⃗x∣I ) . To obtain the probability of issue m in the next dice throw, we calculate the mean value of x m : p m (I )≡∫ dx 1 ...∫ dx M x m⋅P( ⃗x∣ I ) . New information about the dice may consist in the outcome of N more dice throws. In this case, we update the pdf and probabilities with Bayes' rule: n,I) . P (⃗x∣⃗n , I )∼ P(⃗ n∣⃗x , I )⋅P (⃗x∣ I )∼∏ x nm⋅P (⃗x∣ I ) , and p m (⃗n , I )≡∫ dx 1 ... ∫ dx M x m⋅P( ⃗x∣⃗ Consider now a different kind of new evidence. Let ⃗ b=( b1 , b 2 , ... , b M ) a vector of numbers b m ⃗ ∘ ⃗f =b 1⋅f 1+b2⋅ f 2+...+b M⋅ f M associated with issue m . We are told the expectation value ̄b ≡b of ⃗ b . (The person who gave us this information threw the dice a large number of times and calculated ̄b to high precision.) In this case we can update our starting probabilities ⃗p (= ⃗p (I ) , from now on we drop the prior information I) in two ways: again by applying Bayes' rule, or, as is more commonly done in this situation, by applying the Maximum Entropy (MaxEnt) principle. The updated probabilities ⃗p ( ̄b ) will in general come out different for the two methods. The reason is simple: the MaxEnt-udated probabilities only depend on the prior probabilities ⃗p and on the average ̄ b ; the Bayes-udated probabilities additionally depend on the details of the prior pdf. The following two questions naturally arise. Question: Which updating-method is better, MaxEnt or Bayes ? Answer: Bayes! (See e. g. the publications of John Skilling.) Question: Is there an prior pdf P ( ⃗x∣I ) for which the two methods give the same result? Answer: ??? (My guess is: No.) The aim of this article is not to give any deep insights, but to illustrate the different outcomes of the two methods in the simplest non-trivial example, which is a dice with three numbers (or a random experiment with three issues). m 2. Updating probabilities with Maximum Entropy and with Bayes' rule 2.1 Updating probabilities with the MaxEnt principle The MaxEnt principle provides the following formulas for the updated probabilities: p m ( ̄b )= p m⋅exp( βb m )/Z (β) (2.1) with Z (β)=∑ p m⋅exp ( β⋅b m) (2.2) and β given by ̄b = d ln Z / d β . The prior and updated pdf are not needed. (2.3) 2.2 Updating probabilities with Bayes' rule The updated probabilities now depend on the prior pdf. We use the mathematically most convenient form P ( x 1 , ... , x M )=Z ( ⃗x )/ Z (2.4) with Z (⃗x)=∏ x νm 1⋅δ(1 ∣⃗x∣) (2.5) m and ∞ ∞ Z=∫0 dx 1 ... ∫0 dx M Z (⃗x )=Γ(ν 1)⋅...⋅Γ(ν M )/Γ (∣(⃗ν )∣)≡B(ν 1 , ... , ν M ) (2.6) Here, B( ν 1 , ... , ν M ) is Euler's Beta-function. ∞ ∞ The 1st prior probability is p 1=∫0 dx1 ...∫0 dx M x 1 Z (⃗x )/ Z =B( ν 1+1 ,... , ν M )/ B (ν1 ,... , ν M )=ν 1 /∣⃗ν∣ . It follows that νm = p m⋅ν , with ν=∣⃗ν∣=ν1+...+ν M . The parameter values ν1=...=ν M =1 lead to the uniform prior pdf and p m=1 / M (Laplace's principle of indifference). The updated pdf is P ( ⃗x∣̄b )=Z ( ⃗x , ̄b )/ Z ( ̄b ) (2.7) with ν 1 b ∘ ⃗x ) Z (⃗x∣̄b )=∏ x m ⋅δ(1 ∣⃗x∣)⋅δ( ̄b ⃗ m (2.8) and ∞ ∞ Z ( ̄b )=∫0 dx 1 ...∫0 dx M Z (⃗x , ̄b ) (2.9) The updated probabilities are p m (̄b )=∫ dx 1 ...∫ dx M x m Z (⃗x , ̄b )/Z . b . The last two integrals can only be evaluated for special values of ⃗ b and ̄ (2.10) 3. Random Experiments with three issues 3.1 Geometrical description figure 1: Probability triangle for a random experiment with three possible issues. Also shown is a normalized condition vector ⃗ b=( 0.5 , 0.1 , 1.4) , and the intersection ∣⃗x∣=1 ∧ ⃗ b⋅⃗x =0 . The probability vectors ⃗x and ⃗p have length ∣⃗x∣=∣⃗p∣=1 and thus lie in the triangle 1-2-3 spanned by the 3 unit vectors (1,0,0), (0,1,0) and (0,0,1). Figure 1 shows the top view onto this probability triangle, along the diagonal (1, 1, 1). b⋅⃗x = b̄ can always be reThe condition ⃗ formulated as ⃗ b⋅⃗x =0 , by subtracting ̄b /3 from all the components of b⃗ . The length of ⃗ b is then arbitrary, and may be normalized to ∣⃗ b∣=1 . The condition vector b⃗ now lies in the same plane as ⃗x and ⃗p . ⃗ The equation b⋅⃗x =0 describes a plane normal to b⃗ containing the origin of the three coordinate vectors. The intersection of this plane and the probability triangle is a secant, shown blue in the figure. The limiting frequency ⃗f and the updated probability ⃗p ( ̄b ) lie on this secant. In order that the intersection ∣⃗x∣=1 ∧ ⃗ b⋅⃗x =0 is non-empty, at least one, and at most two of the ⃗ components of b must be negative. As a consequence, b⃗ always lies outside the probability triangle. We have thus established a bijective mapping of all the points outside the probability triangles onto all the secants in the probability triangle. The secant end points lie either on the sides 3-1-2 (case I), or 1-2-3 (case II), or 2-3-1 (case III). Accordingly, the b⃗ domain of the plane (outside of the probability triangle) is decomposed into 3 domains I, II and III, as shown in figure 1. 4. Example 1: One frequency is given: f 1=c ⃗ x = b̄ , with ⃗ The condition is b⋅⃗ b=( 1 , 0 , 0) and b̄ =c . ⃗ The normalized condition is b⋅⃗x =0 , with ⃗ b=( 1 c , c , c)/ (1 3c) . 4.1 Updating with MaxEnt Updated p's: p 1 (c)= p1⋅exp( β)/Z (β) , State sum: Z (β)= p 1⋅exp( β)+ p2 + p 3 Condition: c= Updated p's: p 1 (c)=c , p 1⋅exp( β)+ p2 + p 3 Z (β) p 2 (c )= p 2 p 2 (c )= p 2 / Z (β) , ⇒ 1 c , 1 p1 exp( β)= p 3 (c)= p 3 /Z (β) ( 1 p 1)⋅c p1⋅(1 c) p 3 (c)= p 3 1 c 1 p1 (4.1) 4.2 Updating with Bayes' Rule pdf: ν 1 Z (⃗x∣c)=∏ x m ⋅δ(1 ∣⃗x∣)⋅δ (c x 1 ) Normalization: Z ( c)=c ν 1⋅(1 c)ν +ν Updated p's: p 1 (c) Z (c)=c ⋅(1 c) m 1 2 ν1 3 1 B( ν2 , ν 3) ν2+ν3 1 B( ν 2 , ν 3 ) ⇒ p 1 (c)=c B( ν 2+1 , ν 3) ⇒ p 2 (c )= p 2 1 c 1 p1 p 3 (c) Z (c)=c ν 1⋅(1 c)ν +ν B( ν 2 , ν3+1) ⇒ p 3 (c)= p 3 1 c 1 p1 p 2 (c ) Z (c)=c ν1 1 ⋅(1 c) 1 ν2+ν3 2 3 The updated probability is independent of the free parameter (4.2) ν . 4.3 Comparison of Bayes and MaxEnt Both methods give the same updated probabilities ! figure 2: The updated probability vector ⃗p (c) (o) as predicted both by MaxEnt and Bayes, for ⃗p =(1/3 , 1/ 3 , 1/3) and f 1=c given. The normalized condition vector ⃗ b=( 1 c , c , c)/ (1 3c) is perpendicular to the triangle side x 1=0 , and the secant x 1=c (shown for c=0.1 , 0.3 , ... , 0.9 ) is parallel to the triangle side x 1=0 . Assuming prior probabilities ⃗p =(1/3 , 1/ 3 , 1/ 3) , the updated probability ⃗p (c) lies in the centre of the secant. 5. Example 2: The ratio of two frequencies is given: f 2=c⋅ f 3 The condition is ⃗ b⋅⃗x = b̄ , with ⃗ b=( 0 , c , 1) and b̄ =0 . ⃗ The normalized condition is b⋅⃗x =0 , with ⃗ b=( 0 , c /(1 c) ,1 /(1 c)) . 5.1 Updating with MaxEnt Updated p's: p 1 (c)= p1 / Z (β) , p 2 (c )= p 2⋅exp (β c)/ Z (β) , State sum: Z (β)= p 1+ p 2⋅exp (β c)+ p 3⋅exp( β) Condition: 0= c⋅p2⋅exp (βc )+ p 3⋅exp( β) Z (β) ⇒ p 3 (c)= p 3⋅exp( β)/ Z (β) exp( β)=(c p2 1 /(1+c) ) p3 Updated p's: p 1 (c)= p1 , Z (β) p 2 (c )= Z (β)= p 1+ p 2⋅(c p2 ) p3 c/(1+c) p2 p2 ⋅(c ) Z (β) p3 c/(1+c) , p 3 (c)= p3 p2 1 /(1+c) ⋅(c ) Z (β) p3 1/(1+c) + p3⋅(c p2 ) p3 (5.1) 5.2 Updating with Bayes' Rule pdf: Z (⃗x∣c)=∏ x νm 1⋅δ(1 ∣⃗x∣)⋅δ (x 3 cx 2) Normalization: c Z ( c)= ⋅B( ν 1 , ν 2+ν 3 1) ν +ν 1 (1+c) Updated p's: p 1 (c) Z (c)= cν 1 ⋅B( ν 1+1 , ν2 +ν 3 1) (1+c) ν +ν 1 p 2 (c ) Z (c)= cν 1 ⋅B (ν1 , ν 2+ν3 ) ( 1+c )ν +ν ⇒ p 2 (c )= 1 p 2+ p3 1/ ν 1+c 1 1/ν p 3 (c) Z ( c)= cν ⋅B( ν 1 , ν2 +ν 3) (1+c) ν +ν ⇒ p 3 (c)= c p2 + p 3 1/ ν 1+c 1 1 /ν m ν3 1 2 3 3 2 3 ⇒ 3 2 3 3 2 3 The updated probability depends on the free parameter p 1 (c)= p1 1 1/ν (5.2) ν . 5.3 Comparison of Bayes and MaxEnt The two methods give this time quite different updated probabilities! figure 3: The updated probability vector ⃗p (c) as predicted by MaxEnt (x) and Bayes (o), for ⃗p =(1/3 , 1/ 3 , 1/3) and f 2 / f 3=c given. The normalized condition vector ⃗ b=( 0 , c /(1 c) ,1 /(1 c)) lies on the prolonged triangle side 2-3. The secant x 2=c⋅x 3 ends in the corner 1 of the probability triangle. Assuming prior probabilities ⃗p =(1/3 , 1/ 3 , 1/3) , the MaxEnt-updated probability vector ⃗p (c) lies on a curve connecting the mid-points of the sides 1-2 and 13, and the Bayes-updated probability vector ⃗p (c) lies on a straight line p 1 (c)=const. parallel to the triangle side 2-3 . For ν=3 (uniform prior pdf), p 1 (c)=0.5 . The difference between the two updated probabilities is particularly striking for c = 1 , i. e. for the condition x 2= x 3 . MaxEnt then leaves the probability ⃗p =(1/3 , 1/3 , 1/ 3) unchanged, while Bayes with ν=3 (uniform prior pdf) updates it to ⃗p (c)=(1/ 2 , 1/2 , 1/4) . 6. Updated probabilities for three issues and the general condition ⃗b⋅⃗f =0 The case of general condition vector and 3 issues is mathematically more demanding. 6.1 Updating with MaxEnt Z (β)= p 1⋅exp( βb1 )+ p2⋅exp( β b2 )+ p3⋅exp ( β b3 ) p 1 (c)= p1⋅exp( β b1 )/ Z (β) , p 2 (c )= p 2⋅exp ( β b2 )/Z (β) , p 3 (c)= p 3⋅exp( β p3)/ Z (β) 0=b1⋅p1 exp( β b1 )+b 2⋅p 2 exp( β b2 )+b 3⋅p 3 exp ( β b3) (6.1) In general, β must be evaluated numerically from the last formula. Below we will only consider the case p 1= p 2= p3=1/ 3 . 6.2 Updating with Bayes' Rule This time we calculate the updated probabilities with Laplace transform: P (⃗ k )∝Z ( ⃗ k )=∫ dx 1 ...∫ dx M Z ( ⃗x )⋅exp( k⃗ ∘ ⃗x ) P (⃗ k ∣̄b )∝ Z ( ⃗ k , ̄b )=∫ dx 1 ...∫ dx M Z (⃗x∣̄b )⋅exp ( k⃗ ∘ ⃗x ) The Taylor coefficients of ln P ( ⃗k ) and ln P ( ⃗k∣̄b ) are the cumulants of P (⃗x ) and P ( ⃗x , ̄b ) ; the 1st cumulant is the prior or updated probability: p m= d ln P ⃗ ⃗ d ln Z ⃗ ⃗ ( k =0 )= ( k =0 ) d km d km p m ( ̄b )= d ln P ⃗ ⃗∣̄ d ln Z ⃗ ⃗∣̄ ( k = 0 b )= (k = 0 b ) d km d km Follows the explicit calculation for the prior pdf Z ( ⃗x) introduced in section 3 and the case of three issues (M =3) . Prior pdf and probability: Prior pdf: Z (⃗x)=∏ x νm 1⋅δ(1 ∣⃗x∣) Use: δ(1 ∣⃗x∣)= Laplace trafo : 1 Z (⃗ k )= d α exp (α)∏ ( k m+α) 2 πi ∫C m 1 ∫ d α exp α (1 ∣⃗x∣) 2πi C νm The integral can only be performed analytically for integral νm =1 , 2 ... . For ν1=ν 2=ν3 =1 , i. e. for uniform prior pdf and ⃗p =(1/3 , 1/3 , 1/ 3) , we obtain exp( k 1) exp ( k 2) exp( k 3 ) + + ( k 2 k 1 )(k 3 k 1) (k 1 k 2)( k 3 k 2 ) (k 1 k 3)( k 2 k 3) Laplace trafo : Z (⃗ k )= Test: Z (⃗ k )=1 k 1+k 2+k 3 +... 3 ⇒ p 1= p 2= p3=1 / 3 : okay. Updated pdf and probability: Updated pdf: Z (⃗x∣⃗b)=∏ x νm 1⋅δ(1 ∣⃗x∣)⋅δ (b1 x 1+b 2 x 2 +b3 x 3) Use: δ( ⃗b⋅⃗x)= Laplace trafo: 1 1 Z (⃗ k∣̄b=0)= dβ ∫ ∫ d α exp(α)∏ (k m+α+βb m ) C 2πi 2πi C Comparison: Z (⃗ k∣̄b=0)= m 1 ∫ d β exp β( ⃗b⋅⃗x) 2πi C νm 1 ∫ d β Z (⃗k +β⋅⃗b ) 2πi C When replacing k⃗ by ⃗ k +β⋅⃗ b in the expression for Z ( ⃗ k ) , we get terms with the exponential decreasing in k m if b m>0 , and terms with the exponential growing in k m if b m<0 . When performing the β integration, the former terms will not contribute to the integral, since the Laplace contour (from r i ∞ to r +i ∞ , with r sufficiently large) can be moved to r =+∞ . We must therefore distinguish cases of different signs of b1 , b 2 and b3 . But these are just the ⃗ b domains we labelled I, II and III in figures 1 to 3! Specifically, we have Domain I: Domain II: Domain III: sign( sign( sign( ⃗ b ) = (−,+,+) or (−,−,+) ⃗ b ) = (+,−,+) or (−,+,−) ⃗ b ) = (+,+,−) or (−,−,+) ⃗ x =0 remains If we multiply b⃗ -vectors with two negative components by 1 , the condition b⋅⃗ ⃗ unchanged, and the modified b has only one negative component. This means that the updated probability formulas obtained in the following calculation with b1<0 , b 2>0 , b3>0 are valid for the whole domain I. The Laplace-transformed updated pdf is Z (⃗ k∣̄b )= exp( k 1+β b1) 1 d β ∫ 2πi C (k 2 k 1 +β(b 2 b 1))(k 3 k 1+β(b 3 b1 )) Taylor developing up to first order in k⃗ Z (⃗ k∣̄b )= (b 2 gives b1 b (b b )+b3 ( b2 b1) b1 b1 [1 k 1 2 3 1 k2 k3 ] b1 )(b3 b1 ) 2(b 2 b1 )(b3 b1) 2(b2 b 1) 2(b 3 b1 ) Taking the logarithm and differentiating with respect to ⃗ k gives p 1 ( ̄b )= b2 (b 3 b 1)+b 3 (b2 b1 ) , 2(b 2 b 1)(b 3 b1 ) p 2 ( ̄b )= b1 2 (b2 b1) , p 3 ( ̄b )= b1 2( b3 b 1) (6.2) The corresponding formulas for domains II & III are obtained by cyclical permutation of the indices 1, 2, 3. The special cases x 1=c (domain I) and x 2=c⋅x 3 (at the border of domains II and II) discussed earlier are contained in these formulas. 6.3 Comparison of Bayes and MaxEnt There is no apparent similarity between the updated probabilities (6.1) and (6.2) ! For the geometrical interpretation of these formulas, we return to the example of section 3, figure 1. figure 4: Probability triangle for a random experiment with three possible issues. Also shown is an arbitrary condition vector ⃗ b=( 0.5 , 0.1 , 1.4) , ⃗ and the solution of b⋅⃗x =0 . The Bayes-updated probability lies in the middle of the secant ⃗ b⋅⃗x =0 . To find the MaxEnt updated probability, we draw lines of constant entropy (uncertainty) H =∑ pm ( ⃗b) ln ( p m ( ⃗b)/ p m) . into the probability triangle. These isentropes are contour lines of an entropy mountain. The secant passes over the side of Mt Entropy. The updated probability lies on the highest point of the secant, tangential to a contour line. 7. The mapping of the condition vector ⃗b onto the updated probability ⃗p (̄b ) ⃗ p =0 can be Updating a probability vector ⃗p ≡( p 1 , p 2 , p 3) to ⃗p ( ⃗ b) due to a new condition b⋅⃗ seen as a mapping of the condition vector ⃗ b≡( b1 , b 2 , b3 ) onto the probability ⃗p ( ⃗ b) . Geometrically, the domain outside the probability triangle is mapped into the probability triangle. Figure 5 shows that this mapping is bijective (at the least for ⃗p =(1/3 , 1/3 , 1/ 3) ) when we update with MaxEnt. Figure 6 shows that the mapping is surjective but not injective when we update with Bayes ( ν=3) . In this case, any point ⃗p of the probability triangle with p 1 , p 2 , p3<0.5 is the image of three distinct condition vectors b⃗ . figure 5a&b: MaxEnt mapping of the conditional domains I, II, III onto probability domains inside the triangle. figure 6a&b: Bayes mapping of the conditional domains I, II, III onto probability domains inside the triangle.
© Copyright 2026 Paperzz