Vesslin I. Dimitrov on Shannon-Jaynes Entropy and Fisher Information July 13, 2007 (154) ON SHANNON-JAYNES ENTROPY AND FISHER INFORMATION Vesselin I. Dimitrov Idaho Accelerator Center Idaho State University, Pocatello, ID [email protected] V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 The MAXENT Paradigm • State of knowledge: probability density p (x ) • Data: constraints dx p(x)C ( x) d information content S[ p(x)] information distance S[ p1 (x), p2 (x)] • • i Quantification: MAXENT (constructive): minimize information content subject to the available data/constraints S [ p(x)] i dx p(x)Ci (x) min • i p(x) MAXENT (learning rule): minimize information distance subject to the available data/constraints S[ p1 (x), p0 (x)] i dx p1 (x)Ci (x) min p1 ( x ) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Motivation: Physics Laws as Rules of Inference • Why would universal physics laws exist? Who is enforcing them? The more universal a law is, the less likely it is to actually exist. The laws of thermodynamics were once thought to have universal validity but turned out to express a little more than handling one’s ignorance in a relatively honest way (Jaynes’57). Moreover, most fundamental aspects of thermodynamics laws are insensitive to the particular choice of the information distance! • Mechanics: Free particle (T.Toffoli’98) Particle in a force field (*) • Electrodynamics: Free field (*) Field with sources (*) Thermodynamics: Entropy (E.Jaynes’1957) Fisher information (Frieden, Plastino2, Soffer’1999) General criterion (Plastino2’1997, T.Yamano’2000) Quantum Mechanics: Non-relativistic (Frieden’89,?) Relativistic (?) Quantum Field Theory (?) • • V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Problems with the Standard Characterizations • Shannon-Jaynes entropy: Problems with continuous variables; In hindsight similar problems even with discrete variables; The constraints are not likely to be in the form of true average values; • Shannon-Jaynes distance (Shore & Johnson): Karbelkar’86 points out that there is a problem with the constraints as average values; Uffink’95 takes issue with SJ’s additivity axiom; this I believe is misguided (Caticha’06); The real problem with SJ’s derivation: SJ prove that the correct form of the relevant functional is H [q(x), p(x)] g ( F [q(x), p(x)]) • F [q(x), p(x)] dxq(x) f ( q ( x) ) p ( x) where g(.) is a monotonic function. Then, however, they declare H and F equivalent and require additivity for F only, thus excluding the possibility that F factorizes and g(xy)=g(x)+g(y). This ignored possibility is precisely the general Renyi case! Vàn’s derivation: Apparently the first to consider dependence also on the derivatives of p(x); Derives the functional as a linear combination of SJ term and Fisher information term; Deficient derivation – handles functionals as if they were functions. New characterization from scratch is highly desirable! V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Inadmissible and Admissible Variations p(x) δ p( x) p1 ( x) p2 ( x) p2(x) p1(x) x x | p( x) | Variations of this kind shouldn’t be allowed because they are physically insignificant. Yet they are included if we write S [ p ] S [ p ( x)] p(x) p( x) p1 ( x) p2 ( x) x | p( x) | 1 d | p ( x) | 2 dx δ1 p2(x) p1(x) x Excluding the former and allowing the latter variations is achieved by writing S [ p ] S[ p( x), p( x)] V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Axioms & Consequences S[ p] g dxpf ( x, p, p) S[ p] g dxf ( x, p, p) • Locality: • Expandability: • Additivity (simple): S[ p1 p2 ] S[ p1 ] S[ p2 ] 1 p( x) S[ p] c1 ln dxp( x) m ( x ) 1 • Information “Distance” interpretation: lim p 0 pf ( x, p, p) 0 g ( x y ) g ( x) g ( y ) g ( xy) g ( x) g ( y ) 2 p ( x) exp c0 ln m ( x ) 2 m1 ( x) m2 ( x) m( x) 1 p( x) S[ p, m] dc0dc1d W (c0 , c1 , ) c1 ln dxp( x) m( x) 2 p ( x) exp c0 ln m( x) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Axiom (ii): Conditional probabilities dxp( x, y) ( y y ) 0 p S[ p] g dxdypf ( ) m p S[ p] g dxdyf ( ) m p [ pf ( )] ( y ) 0 p m p f ( ) ( y) 0 p m f '( S S dy ( y ) dxp( x, y ) f( p ( x, y ) ) ( y )m( x, y ) 0 m( x, y ) p ( x, y ) p ( x, y ) p ( x, y ) ) f '( ) ( y) 0 m( x, y ) m( x, y ) m( x, y ) p( x, y ) m( x, y ) ( y ) f ( ) f ' ( ) ( y( )) 0 ( y) * p( x | y ) dyp( x, y ) dy ! dxm( x, y) ( y) ( y y0 ) ( y) * * ( y y0 ) m( y ) m( x, y ) ( y y0 ) m( x, y0 ) m( y ) m( y0 ) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Axiom (ii): Conditional probabilities 2 q( x) ln p ( x) m( x) S[ p] g dxpf (q, (q) 2 ) dxpf (q, (q) 2 ) dxp f f ,1 2 dxpf, 2q q dxp f f ,1 2m 1 (mqf, 2q) dxp f f ,1 2 (qf , 2q) 2qf , 2 ln m q (q) 2 2 dxpf ( q , ( q ) ) f q f 2 f q ln m q 2 q f ,1 ,2 ,2 p q The term with lnm can cause potential trouble with reproducing the Kolmogorov’s rule unless either m=const or else f,2=0, which implies c0=0 V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 General Consequences of Additivity q( x) ln a) p ( x) m( x) S[ p] g dxpf (q, (q) 2 ) q( x1 , x2 ) q1 ( x1 ) q2 ( x2 ) (q)2 (1q1 )2 (2q2 )2 g ( x y ) g ( x) g ( y ) g ( x) cx dx dx 1 p1 p2 f (q1 q2 , (1q1 ) 2 (1q1 ) 2 ) dx1 p1 f (q1 , (1q1 ) 2 dx2 p2 f (q2 , (1q1 ) 2 ) 2 f (q1 q2 , z1 z2 ) f (q1, z1 ) f (q2 , z2 ) f (q, z) q z p p S [ p] c dx p ln ( ln ) 2 m m b) g ( xy) g ( x) g ( y ) g ( x) c ln x dx dx 1 2 p1 p2 f (q1 q2 , (1q1 ) 2 (1q1 ) 2 ) dx1 p1 f (q1 , (1q1 ) 2 dx2 p2 f (q2 , (1q1 ) 2 ) f (q1 q2 , z1 z2 ) f (q1, z1 ) f (q2 , z2 ) f (q, z) exp(q z) p p S[ p] c ln dx p exp[ ( ln ) 2 ] m m V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Particular Parameter Choices 1 p ( x) S[ p, m] c1 ln dxp( x) m( x) 2 p ( x) exp c0 ln m ( x ) 1 • c0<<1 : p( x) p ( x) dxp ( x ) ln m( x ) m ( x ) S[ p, m] c1 (1 ) S R [ p, m] c0c1 1 p ( x) dxp ( x ) m ( x ) ν=1 : p( x) S[ p, m] c c dxp( x) ln m( x) c0=0 c1 =(ν-1)-1 : p ( x) 1 S[ p, m] ln dxp( x) 1 m( x ) 1 2 2 0 (Fisher distance) 1 c0=b(ν-1) c1 = a(ν-1)-1 ν=1 (Renyi distance) p( x) p( x) S[ p, m] a dxp( x) ln b dxp( x) ln m( x) m( x) 2 (Vàn-type distance) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Variational Equation q( x) ln p ( x) m( x) S[ p] g dxpf (q, (q) 2 ) (q)2 f f,1 2 f, 2q 2 f, 2 qf,12 4 f, 22q q q 2 f, 2 ln m q C ( x) q f (q, (q) ) exp ( 1)q c0 (q) 2 2 f ,1 ( 1) f f , 2 c0 f f ,12 ( 1)c0 f f , 22 c02 f (q)2 f 2c0 q 1 ( 1)q 2c0q q q ln m q f C ( x) q V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Vàn - Type Updating Rule Vàn-type distance: Vàn-type updating rule: 2 ( x ) ( x ) 2 1 1 S[ p1 , p2 ] 2 dx 1 a ln 2b ln ( x ) 2 ( x) 2 p ( x) i 2 i ( x) 2 2 2 dx a ln 2b ln dx C ( x) 0 a a 2 ln C ( x) 0 b b Non-linear Schrodinger equation of Bialynicki-Birula – Mycielski type. If μ(x) is determined on an earlier stage of updating from a prior μ0(x) and constraint C0(x) then 2 0 c c ln 0C0 ( x) 0 0 and any subsequent constraints are additive: 2 0 2c c ln 0C0 ( x) C ( x) 0 0 0 V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Fisher Updating Rule Fisher distance: Fisher updating rule: 2 1 4 dx 1 1 2 1 I [ p1 , p2 ] 4 dx 12 1 2 2 dx 2C ( x) 0 dx C ( x) 0 Schrodinger equation where the “quantum potential” of the prior adds to the constraint. If μ(x) is determined on an earlier stage of updating from a prior μ0(x) and constraint C0(x) then 0 0 0C0 ( x) and any subsequent constraints are additive: 0 0C0 ( x) C ( x) 0 0 V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Shannon-Jaynes Updating Rule Shannon-Jaynes distance: S [ p1 , p2 ] dxp1 ln Shannon-Jaynes updating rule: p dxp ln p1 p2 p dxpC ( x) 0 m p( x) m( x) exp 1 C( x) The usual exponential factor updating rule. If μ(x) is determined on an earlier stage of updating from a prior μ0(x) and constraint C0(x) then 0 exp[ 1 0C0 ( x)] and any subsequent constraints are additive: p( x) m0 ( x) exp[ 2 0C0 ( x) C ( x)] V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Uniqueness: Consistency with Probability Theory (i) • Having a prior knowledge about the joint distribution of x and y in the form m(x,y) includes knowing the conditional distribution of x given y: m( x | y) m( x, y) m( x, y) dxm( x, y) m( y) • If we now obtain some piece of information about y e.g. <C(y)>=d we could proceed in two ways: a) Start with the prior m(y) and the data constraint and obtain p(y); b) Apply the learning rule to the joint distribution, updating m(x,y) to p(x,y). Consistency requires that the marginal of the updated joint distribution is the same ! as p(y): dxp( x, y) p( y) This is the same as requiring that the conditional distribution should remain unchanged: p ( x | y ) m( x | y ) p ( x, y ) m( x | y ) p ( y ) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Consistency upon updating: Example 1 Prior m(x,y) Researcher B Researcher A m( y | x ) m( x ) g y g av f x f av m( y | x) pA ( x) pA ( x, y) pB ( x, y) m( x | y ) p A ( x) m( y ) m( x) pAB ( x, y) m( x, y ) pBA ( x, y) p A ( x ) pB ( y ) m( x ) m( y ) m( x | y ) m( y ) m( x | y) pB ( y) m( y | x) pB ( y ) m( x) m( y ) m( x, y ) p A ( x ) pB ( y ) m( x ) m( y ) Consistency upon updating requires that p ( x, y) p ( x, y) AB BA V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Uniqueness: Consistency with Probability Theory (ii) • A sufficient (but not necessary) condition is the “strong additivity” of S: ! S[ p( x, y ), m( x, y)] S[ p( y ) | m( y)] dyp( y) S[ p( x | y), m( x | y)] This is basically one of the Khinchin’s axioms discriminating Renyi’s ν=1 against all other values, thus singling out the Shannon-Jaynes distance. • A sufficient (and probably necessary) condition for the above is that the updating amounts to multiplying the prior by a function of the constraint: p( x, y ) m( x, y ) F (C ( x, y )) This condition rules out the gradient-dependent terms (i.e. pins down the value of c0 to zero), but does not restrict Renyi’s ν, V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Consistency upon updating: Example 1 Prior m(x,y) Researcher B Researcher A m( y | x ) m( x ) f x f av m( y | x) pA ( x) pA ( x, y) m( x | y ) p A ( x) m( y ) m( x) pAB ( x, y) m( x, y ) g y g av p A ( x ) pB ( y ) m( x ) m( y ) pB ( x, y) Researcher C pBA ( x, y) pC ( x, y) m( x | y ) m( y ) m( x | y) pB ( y) m( y | x) pB ( y ) m( x) m( y ) m( x, y ) p A ( x ) pB ( y ) m( x ) m( y ) Consistency upon updating requires that pAB ( x, y) pBA ( x, y) pC ( x, y) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Consistency upon updating: Example 1 pAB ( x, y) pBA ( x, y) p AB ( x, y ) m( x, y ) F (A f ( x)) F (B g ( y )) pAB ( x, y) pBA ( x, y) Renyi with c0=0: pBA ( x, y ) m( x, y ) F (B g ( y )) F (A f ( x)) f ( q, z ) C ( x, y ) C ( x, y) p( x, y) m( x, y) 1 1 pAB ( x, y) pBA ( x, y) pC ( x, y) F (A f ( x)) F (B g ( y)) F (A f ( x) B g ( y)) This excludes all ν≠1 and leaves us with the Shannon-Jaynes distance! V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Consistency upon updating: Example 2 N1 f N11 fi { fi } i 1, N1 p1 ( x) i 1 1 1 f w1 N N1 fi w2 ( N 2 N1 ) i 1 { fi } i N1 1, N 2 1 N2 fi p12 ( x) i N1 f ( N 2 N1 ) 1 w12 [0,1] w1 w2 1 N2 f i N1 i p2 ( x) Consistency upon updating requires that p12 ( x) w1 p1 ( x) w2 p2 ( x) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Intermediate Summary The Shannon-Jaynes relative entropy is singled out, after all, by consistency requirements even when dependence on the derivatives is allowed, as THE information distance measure to be used in a learning rule: dx p( x) ln p ( x) dx p ( x)0 C ( x) min m( x ) 0 : dx p( x) 1 : dx p( x)C ( x) d In practical applications, however, one rarely has prior knowledge in the form of a prior distribution; it is more likely to be in the form of data. In such cases the updating rule is useless but one can still utilize the derived unique information distance for identifying least-informative priors. V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Least-Informative Priors – Constructive Approach • Identify a generic operation invoking “information loss”; • Notice that the information loss should be (at least for small information content) proportional to the information content unless negative information is tolerated; • Look for the least-informative prior (under given data constraints) as the one least sensitive to the operation above. The sensitivity should be defined consistently with the adopted information distance. • The generic operation is coarse-graining: p ( x) 1 x' dx ' p ( x x ' ) f ( ) • The information distance is the Shannon-Jaynes one: S[ p, p ] dxp ln p ( x) ! min p ( x) (subject to data constraints) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Coarse-Graining = Adding Noise • Coarse-graining is equivalent to adding noise: x y x z: p( x); p( x) p( y, z ) • “Distance” between the original and the “noisy” probability distributions: S[ p0 ( y, z ), p ( y, z )] S[ p0 ( z ), p ( z )] dzp ( z ) S[ p0 ( y | z ), p ( y | z )] 0 p ( y | z) p( x z) • For the Shannon-Jaynes distance: n / 2 z n 1 d n p ( x) S[ p ( x), p ( x z )] dxp( x) ln 1 n n 1 n! p ( x) dx 2 1 dp ( x) 3/ 2 z dx O ( ) 2 p( x) dx 2 ( just another instance of De Bruijn identity) Essentially the same result holds for general Renyi ν as well as for symmetrized SJ distance or the symmetric Bernardo-Rueda “intrinsic discrepancy”. 2 z2 1 dp( x) S[ p0 ( y, z ), p ( y, z )] dx O( 2 ) 2 p( x) dx V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Least-Informative Priors: Constructive MAXENT Least-informative prior: The probability density least-sensitive to coarse-graining subject to all available data constraints: 2 p( x) dx p ( x ) p( x) dx p( x)C ( x) min Putting here p( x) 2 ( x) and varying ψ we obtain theEuler-Lagrange equation ( x) 4 C ( x) ( x) 0 ( x) 2m E U ( x) ( x) 0 2 This is a Schrodinger equation; non-relativistic Quantum Mechanics appears to be based on applying the constructive MAXENT to a system with given average kinetic energy, i.e. temperature! V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Least-Informative Priors: Consistency Condition p1 Information “distance” p2 S(p1, p2) Prior Data MAXENT A Posterior The posterior is the closest distribution to the prior subject to the data constraints. “Closest” is with regard to the information distance. Data MAXENT B Least-Informative Prior The LI prior is the one least sensitive to coarse-graining subject to the data constraints. “Least sensitive” is with regard to the information distance. V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Conclusion: The MAXENT Approach Updating rule: Constructive rule: m(x) C(x) C(x) MAXENT B MAXENT A p(x) p(x) dxp( ln p) dxpC ( x) 0 2 p dxp ln p dxpC ( x) 0 m p p ( x) 2 ( x) p( x) m( x) exp 1 C( x) ( x) C ( x) ( x) 0 Based on Shannon-Jaynes relative entropy Based on Fisher information for the “location” parameter as a measure for sensitivity to coarse-graining V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007 Summary • Axiomatic characterization of both constructive and learning MAXENT is possible, based on the notion of information distance; • The usual requirements for smoothness, consistency with probability theory and additivity narrow down the possible form of the information distance to Renyi – type one with an exponential Fisher-type modification factor; • Additional requirement that different ways of feeding the same information into the method produce the same results pins down the value of the Renyi’s parameter to 1 and excludes the derivative terms singling out the Shannon-Jaynes distance as the correct one; • Since the information distance does not measure information content, I propose to consider the information content proportional to the sensitivity of the probability distribution to coarse-graining. If this is accepted, a unique constructive procedure for priors subject to available information results, involving the minimization of Fisher information under constraints. • The constructive procedure, if taken seriously, has far reaching implications for the nature of those physics laws which can be formulated as variational principle. They may turn out to have very little to do with physics per se. V.I Dimitrov, MaxEnt2007, Saratoga Springs, July 8-12 2007
© Copyright 2026 Paperzz