Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy

Meta-Bayesian Analysis
Jun Yang
joint work with Daniel M. Roy
Department of Statistical Sciences
University of Toronto
NIPS Workshop on Bounded Optimality and Rational Metareasoning
Dec. 11th, Montreal, QC.
Meta-Bayesian Analysis (Jun Yang and Daniel Roy)
1
Motivation
“All models are wrong,
some are useful.”
— George Box
“truth [...] is much too complicated to
allow anything but approximations.”
– John von Neumann
I
Subjectivism Bayesian:
alluring but impossible to practice when model is wrong
I
Prior probability = degree of Belief... in what? utility?
What is a prior?
I
Is there any role for (subjective) Bayesianism?
Our proposal: More inclusive and pragmatic definition for “prior”.
Our approach: Bayesian decision theory
Meta-Bayesian Analysis (Jun Yang and Daniel Roy)
2
Example: Grossly Misspecified Model
Setting: Machine learning
data are collection of documents:
I
Model: Latent Dirichlet Allocation (LDA)
I
Prior belief: π̃ ≡ 0,
i.e., no setting of LDA is faithful to
our true beliefs about data.
I
Conjugate priors π(dθ) ∼ Dirichlet(α)
What is the meaning of a prior on LDA parameters?
Pragmatic question: If we use an LDA model (for whatever
reason), how should we choose our “prior”?
Meta-Bayesian Analysis (Jun Yang and Daniel Roy)
3
Example: Accurate but still Misspecified Model
Setting: Careful Science
data are experimental measurements:
I
Model: (Qθ )θ∈Θ , painstakingly produced after years of effort
I
Prior belief: π̃ ≡ 0,
i.e., no Qθ is 100% faithful to
our true beliefs about data.
What is the meaning of a prior in a misspecified model?
(All models are misspecified.)
Pragmatic question: How should we choose a “prior”?
Meta-Bayesian Analysis (Jun Yang and Daniel Roy)
4
Problem Formulation
I
P: the true belief on X × Y.
I
(X , Y ) ∼ P, where X ∈ X : observed data
Y ∈ Y: future observations.
I
(Qθ )θ∈Θ : the model, i.e., a family of distributions on X × Y.
I
L : Y × Y → R+ : loss function
The Task
1. Choose prior π
2. Observe X .
3. Use Bayesian
inference with π and
(Qθ ) to predict Ŷ .
4. Suffer loss L(Ŷ , Y )
The Goal
Minimize expected loss
Meta-Bayesian Analysis (Jun Yang and Daniel Roy)
5
Review of Stardard Bayesian Analysis
π(·)
Qθ (·)
R
(πQ)(·) = Qθ (·)π(dθ)
πQ(·|x)
L : A × Y → R+
Prior on θ
Model/Likelihood on X × Y given θ
Marginal distribution on X × Y
Posterior distribution on Y given X
Loss function
Bayes optimal action assuming (X , Y ) ∼ πQ
I
Bayes optimal action minimizes expected loss under the
posterior distribution.
Z
BayesOptAction(πQ, x) = arg min
a∈A
I
I
L(a, y ) πQ(dy |x).
Quadratic loss −→ posterior mean.
Self-information loss (log loss) −→ posterior πQ(·|x).
Meta-Bayesian Analysis (Jun Yang and Daniel Roy)
6
Our Solution for Choosing the Optimal Prior
Meta-Bayesian analysis
I
Idea: Believe (X , Y ) ∼ P, but will predict using πQ(·|X = x)
I
Consider π as our action: then new loss function:
L∗ (π, (x, y )) = L(BayesOptAction(πQ, x), y )
I
Bayes risk under P of doing Bayesian analysis under πQ
Z
R(P, π) = L∗ (π, (x, y ))P(dx × dy ).
I
Meta-Bayesian optimal prior minimizes the expected loss
under the true belief:
inf R(P, π),
π∈F
where F is some set of priors under consideration.
Meta-Bayesian Analysis (Jun Yang and Daniel Roy)
7
Meta-Bayesian Analysis
Recipe
I
Step 1: State P, Qθ , and select a loss function L;
I
Step 2: Solve the corresponding variational problem to get π.
Examples
I
Log loss: minimizing the conditional relative entropy
Z
inf KL P 2 (x, ·)||πQ(·|x) P 1 (dx)
π
where P(dx, dy ) = P 1 (dx) ⊗ P 2 (x, dy ).
I
Quadratic loss: minimizing the expected quadratic distance
between two posterior means πQ(·|x) and P 2 (x, ·):
Z
inf kmπQ (x) − mP 2 (x)k22 P 1 (dx)
π
Meta-Bayesian Analysis (Jun Yang and Daniel Roy)
8
Example: “Misspecified” priors
data are coin tosses: 10001001100001000100100
I
Model: i.i.d. Bernoulli(θ) sequence, unknown θ
I
Prior belief π̃(dθ)
I
Conjugate prior π(dθ) ∼ Beta(α1 , α0 )
Pragmatic: If we are forced (or force ourselves) to use a Beta prior,
how should we choose α1 , α0 ?
Meta-Bayesian Analysis (Jun Yang and Daniel Roy)
9
Meta-Bayesian Analysis for “misspecified” priors
Problem Setting
I
X = {0, 1}n , Y = {0, 1}k .
I
P: [Bernoulli(θ)]n+k , θ ∼ π̃(dθ)
I
Qθ : [Bernoulli(θ)]n+k , θ ∼ π(dθ)
Results from Meta-Bayesian Analysis
I
Quadratic loss: π should match the first n + 1 moments of π̃;
I
Log loss: π should match the first n + k moments of π̃;
I
If θ ∼ Beta(α1 , α0 ), solving equations of polymonials of π̃ to
get α1 and α0 .
The optimal prior depends on n and k!
Meta-Bayesian Analysis (Jun Yang and Daniel Roy)
10
Conclusion and Future work
Conclusion
I
Standard definition of a (subjective) prior too restrictive
I
More useful definition using Bayesian decision theory.
I
Meta-Bayesian prior is one you believe will lead to best results.
Future Work
I
Analysis of nested models Q ⊆ P
I
Analysis of the rationality of non-subjective procedures
(e.g, empirical Bayes)
Meta-Bayesian Analysis (Jun Yang and Daniel Roy)
11