Information Theory and Statistics

Information Theory and Statistics
Geometry of Information and Conditional Limit Theorem
André Carvalho Bittencourt
Linköping University, Sweden
A. C. Bittencourt et al.
InfoTheory, LiU 2013
AUTOMATIC CONTROL
REGLERTEKNIK
LINKÖPINGS UNIVERSITET
Outline
2(12)
Summary
Conditional Limit Theorem
The Geometry of Information
Proof of Conditional Limit Theorem
A. C. Bittencourt et al.
InfoTheory, LiU 2013
AUTOMATIC CONTROL
REGLERTEKNIK
LINKÖPINGS UNIVERSITET
Summary
Conditional Limit Theorem
The Geometry of Information
Proof of Conditional Limit Theorem
Summary - Method of types
4(12)
Xn : all sequences of size n
xn : a sequence of size n
P : all distribution space
Q: true distribution of xn
Pxn (a) = n1 N (a|xn ), type of xn
Pn : all types with denominator n
T (P) = {x : Px = P}, type class of P
|Pn | ≤ (n + 1)|X | = 2m log(n+1)
n
Q (x) = 2
−n(D(Px ||Q)+H (Px ))
|T (P)|=
˙ 2nH(P)
n
Q (T (P))=
˙2
A. C. Bittencourt et al.
InfoTheory, LiU 2013
−nD(P||Q)
polynominal number of types
exact prob. of type Px under Q
exponential number of seq. of each type
approximate prob. of T (P) under Q
AUTOMATIC CONTROL
REGLERTEKNIK
LINKÖPINGS UNIVERSITET
Summary - Sanov’s theorem
5(12)
Sanov th.: Let xi be i.i.d. ∼ Q(x) and E ⊆ P then
Qn ( E ) = Qn ( E ∩ P n ) =
˙ 2−nD(E||Q) =
˙ 2−nD(P
∗ ||Q)
where P∗ = arg min D(P||Q)
P∈E
i.e. probability of E is close to prob. of P∗ because all
other probs. decay exponentially faster and number of
types is polynomial in n.
Example of set E
1
n
∑ g(xi ) ≥ α ⇔ n1 ∑
i
and P∗ = arg
N (a|x)g(a) ≥ α ⇔
a∈X
∑
min D(P||Q) = c Q(x)eλ
PT g(x)≥α
∑ P=1
Px (a)g(a) ≥ α ⇔ PTx g(x) ≥ α
a∈X
T g(x)
, c=
∑
Q (a )eλ
T g(a)
a∈X
Note: if Q(x) = U gives the maximum entropy distribution
A. C. Bittencourt et al.
InfoTheory, LiU 2013
AUTOMATIC CONTROL
REGLERTEKNIK
LINKÖPINGS UNIVERSITET
Summary
Conditional Limit Theorem
The Geometry of Information
Proof of Conditional Limit Theorem
Conditional Limit Theorem
CL th.: Let E ⊆ P be a closed
convex set and xi be i.i.d. ∼ Q ∈
/E
then
Pr{x1 = a|Pxn ∈ E} → P∗ (a)
where P∗ = arg min D(P||Q)
P∈E
7(12)
i.e. the first few elements are asymp.
independent ∼ P∗ .
Given that Pxn ∈ E is observed, Q
works as a prior and P∗ is the
posterior distribution.
Strengthens the maximum entropy
principle
in prob. as n → ∞.
Example, biased dice
Given that E = {P : ∑ P(a)a ≥ 4} (i.e. obs. avg. larger than 4) then
P∗ (x ) =
A. C. Bittencourt et al.
InfoTheory, LiU 2013
2λx
∑6i=1 2λi
, where λ :
∑ iP∗ (i) = 4
AUTOMATIC CONTROL
REGLERTEKNIK
LINKÖPINGS UNIVERSITET
Summary
Conditional Limit Theorem
The Geometry of Information
Proof of Conditional Limit Theorem
The Geometry of Information
9(12)
Pythagorean th.: Let E ⊆ P be a closed convex
set and Q ∈
/ E then for all P ∈ E
D(P||Q) ≥ D(P||P∗ ) + D(P∗ ||Q)
proof: Define Pλ = λP + (1 − λ)P∗ . Then
Pλ ∈ E (convex) and
0≤
=
=
=
d D(Pλ ||Q) since P∗ is a minimum
dλ
λ =0
P∗ (x)
∗
∗
(
P
(
x
)
−
P
(
x
))
log
+
(
P
(
x
)
−
P
(
x
))
∑
Q(x)
∗ (x)
P
∑(P(x) − P∗ (x)) log Q(x)
P(x) P∗ (x)
P∗ (x)
∑ P(x) log Q(x) P(x) − P∗ (x)) log Q(x)
= D(P||Q) − D(P||P∗ ) − D(P∗ ||Q)
A. C. Bittencourt et al.
InfoTheory, LiU 2013
1. Like an obtuse triangle
2. eq. iff (P − P∗ ) ⊥ (P∗ − Q)),
i.e. right triangle
3. if Px ∈ E and
D(Px ||Q) → min D(P||Q) then
Px → P∗ (i.e. D(Px ||P∗ ) → 0
4. Locally square-like (next slide)
AUTOMATIC CONTROL
REGLERTEKNIK
LINKÖPINGS UNIVERSITET
Relations of D(P||Q)
10(12)
Relative entropy and 1-norm
D(P||Q) ≥
1
kP − Qk21 ,
2 log 2
kP − Qk1 ,
∑ |P(a) − Q(a)|
a∈X
i.e. convergence in D(P||Q) implies convergence in L1 .
Local behavior
From Taylor expansion around Q
D(P||Q) = X 2 (P, Q) + · · · ,
X 2 (P, Q) , ∑
x
(P(x) − Q(x))2
Q(x)
i.e. D(P||Q) locally behaves like a square distance.
A. C. Bittencourt et al.
InfoTheory, LiU 2013
AUTOMATIC CONTROL
REGLERTEKNIK
LINKÖPINGS UNIVERSITET
Summary
Conditional Limit Theorem
The Geometry of Information
Proof of Conditional Limit Theorem
Proof of Conditional Limit Theorem
12(12)
By Sanov’s theorem
∗
(a): Qn (B)=
˙ 2−nD(B||Q) =
˙ 2−n(D +2δ) and so
∗
Qn (B) ≤ (n + 1)|X | 2−n(D +2δ) .
∗
1
2−n(D +δ) (wc)
(b): Qn (A) ≥ Qn (A− ) ≥
(n + 1)|X |
1. Prob. of B → 0 and so Prob of A → 1
Pr(Px ∈ B|Px ∈ E) =
(n + 1)
St = {P ∈ P : D(P||Q) ≤ t}
(St is convex)
D∗ = min D(P||Q)
P∈E
(P∗ is unique)
A = SD∗ +2δ ∩ E
A − = SD ∗ + δ ∩ E
B = E−A
A. C. Bittencourt et al.
InfoTheory, LiU 2013
2|X | −nδ
2
Qn (B∩E)
Qn ( E )
≤
Qn (B)
Qn (A)
≤
→0
2. All P ∈ A are close to P∗ . For P ∈ A
D(P||Q) ≤ D∗ + 2δ. From Pythagorean th.
D(P||P∗ ) + D(P∗ ||Q) ≤ D(P||Q) ≤ D∗ + 2δ
and so D(Px ||P∗ ) ≤ 2δ for Px ∈ A.
By the L1 relation, the fact D(Px ||P∗ ) is small implies
that max |Px (a) − P∗ (a)| is small and so
a∈X
Pr(x1 = a|Px ∈ E) → P∗ (a)
AUTOMATIC CONTROL
REGLERTEKNIK
LINKÖPINGS UNIVERSITET