Kiran Kate, Sneha Chaudhari, Andy Prapance, Jayant Kalagnanam

Model Inference and Averaging
ESL Chapter 8
Lecturer: 张泽亚
2015-05-10
Outline
• Bootstrap
• Bootstrap vs Least Square, Maximum Likelihood and
Bayesian Inference
• Bagging
• Model Averaging
• EM Algorithm
7.11 Bootstrap Methods
• What is bootstrap?
♣ Suppose we have a model to fit a set of training data
♣ Denote training set by Z=(𝑧1 , 𝑧2 , … , 𝑧𝑁 ),𝑧𝑖 = (𝑥𝑖 , 𝑦𝑖 )
♣ Randomly draw datasets with replacement from Z, each sample the same
size as the original training set
♣ Draw B datasets, 𝑍 ∗1 , 𝑍 ∗2 , … , 𝑍 ∗𝐵
7.11 Bootstrap Methods
• What is bootstrap?
♣ S(Z) is any quantity computed from data Z, e.g. 𝑓(𝑥𝑖 )
♣ From bootstrap, we can estimate any aspect of distribution of S(Z), e.g.
8.2.1 A smoothing Example
• Fit a cubic spline to the dataset Z={ 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , (𝑥𝑁 , 𝑦𝑁 )} (N=50)
• Let H be the N*7 matrix with ijth element ℎ𝑗 (𝑥𝑖 )
8.2.1 A smoothing Example
• Least square estimate:
with
8.2.1 A smoothing Example
• Non-Parametric Bootstrap method:
♣ Draw B(=200) datasets each of size 50
♣ To each bootstrap dataset 𝒁∗ , we fit a cubic spline 𝜇∗ (𝑥)
♣ Form a 95% confidence band by choosing 5th largest and smallest value of
bootstraps
8.2.1 A smoothing Example
• Parametric Bootstrap method:
♣ Suppose the model errors are Gaussian:
♣ Simulate new responses by adding Gaussian noise to the predicted value, generate bootstrap
samples: 𝑥1 , 𝑦1∗ , 𝑥2 , 𝑦2∗ , … , (𝑥𝑁 , 𝑦𝑁∗ )
♣ Bootstrap prediction:
Agrees with Least Square
8.2.2 Maximum Likelihood Inference
• In general, the parametric bootstrap agrees with maximum likelihood.
• In essence, the bootstrap is a computer implementation of nonparametric
or parametric maximum likelihood.
The derivation? See yourselves.
8.3 Bayesian Methods
• Bayesian Methods:
• VS Maximum Likelihood:
Pr(𝑧 𝑛𝑒𝑤 𝑍 = Pr(𝑧 𝑛𝑒𝑤 |𝜃)
8.3 Bayesian Methods
• For the smoothing example:
8.3 Bayesian Methods
• For the smoothing example:
𝜏 → ∞ (𝑛𝑜𝑛𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑣𝑒 𝑝𝑟𝑖𝑜𝑟)
𝑝𝑎𝑟𝑎𝑚𝑎𝑡𝑟𝑖𝑐 𝑏𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝:
•
In Gaussian models, maximum likelihood and parametric bootstrap analyses tend to
agree with Bayesian analyses with noninformative priors.
8.7 Bagging
• What is Bagging? Bootstrap aggregation!
♣ Average the prediction over a collection of Bootstrap samples
• More formal:
♣ For each bootstrap sample 𝑍 ∗𝑏 (b=1,2,…,B), we fit our model and give prediction
𝑓 ∗𝑏 (𝑥)
♣ The bagging estimate is defined by :
8.7 Bagging
• Example 1:
♣ Use Tree to solve a two-classification problem
♣ Sample Size N=30, 5 features, Pr(Y=1|x1<0.5)=0.2, Pr(Y=1|x1>0.5)=0.8
♣ 200 bootstrap samples
8.7 Bagging
• Example 1:
♣ Use Tree to solve a two-classification problem
♣ Sample Size N=30, 5 features, Pr(Y=1|x1<0.5)=0.2, Pr(Y=1|x1>0.5)=0.8
♣ 200 bootstrap samples
• Consensus: Bagging using Majority
• Probability: Bagging using Average
• Bagging decreases Test Error because
it lowers the Variance.
8.7 Bagging
• Example 2:
♣ 50 people, 10 problem each of 4 choices
♣ For each problem, only random 15 people have some knowledge
♣ For consensus, 50 people vote for the answer to each problem
8.8 Model Averaging and Stacking
• Bayesian Model Averaging from Probability viewpoint:
♣ We have a set of candidate Models 𝑀𝑚 , 𝑚 = 1,2, … , 𝑀 for training data Z
♣ Suppose 𝜍 is some quantity of interest, e.g. prediction f(x)
♣ Posterior mean of 𝜍 is:
8.8 Model Averaging and Stacking
• Bayesian Model Averaging from Frequentist viewpoint:
♣ Given predictions 𝑓1 𝑥 , 𝑓2 𝑥 , … , 𝑓𝑀 𝑥 of M models
♣ Seek weights w=(𝑤1 , 𝑤2 , … , 𝑤𝑀 ) such that
♣ The solution is population linear regression of Y on 𝐹(𝑥)𝑇 = [𝑓1 𝑥 , 𝑓2 𝑥 , … , 𝑓𝑀 𝑥 ]
8.8 Model Averaging and Stacking
• Stacking: Stacked generalization
♣ Let 𝑓𝑚−𝑖 (𝑥) be the prediction at x, using model m, applied to the dataset with 𝑖 𝑡ℎ
training observation removed
♣ Stacking weights is given by:
♣ There is close connection between stacking and model selection via leave-one-out
cross-validation.
8.8 Bumping
• What is Bumping:
♣ Instead of averaging the models, bumping is a technique for finding a better single
model.
♣ We draw bootstrap samples𝑍 ∗1 , 𝑍 ∗2 , … , 𝑍 ∗𝐵
♣ Fit to each model𝑍 ∗𝑖 , give a prediction 𝑓 ∗𝑖 (𝑥)
♣ We choose model from bootstrap 𝑏 where:
♣ For problems where fitting method finds many local minima, bumping can help the
method to avoid getting stuck in poor situations.
8.5 EM Algorithm
• Expectation Maximization(EM)?
♣ A popular tool for simplifying difficult maximum likelihood problems
• Example: Two-Component Mixture Model
• Maximum Likelihood:
𝜕𝑙(𝑍|𝜃)
𝜕𝜃
is difficult to calculate
8.5 EM Algorithm
• Example: Two-Component Mixture Model
• Introduce Latent Variable ∆:
♣ ∆ 𝑖 = 0 if 𝑌𝑖 comes from model 1; ∆ 𝑖 = 1 if 𝑌𝑖 comes from model 2
log(∑ 𝜑) → ∑(log 𝜑)
• How to find 𝜃 that maximize 𝑙0 𝑍, ∆ 𝜃 ?
8.5 EM Algorithm
• Example: Two-Component Mixture Model
• Let us rewrite the model more general!
𝑧𝑖𝑗 ∶ ∆𝑖 , 1 − ∆𝑖
𝛼𝑖 ∶ 𝜋, 1 − 𝜋
𝛽𝑖 ∶ 𝜇𝑖 , 𝜎𝑖
J data point
Each point consists of I components
8.5 EM Algorithm
• Example: Two-Component Mixture Model
• Let us rewrite the model more general!
𝜽 = (𝜶, 𝜷)
• To calculate 𝜃 (𝑖+1) :
♣ Determine 𝐸[𝑧𝑖𝑗 |𝑦, 𝜃 𝑖 ]
♣ Maximize E[log 𝐿𝑐 𝜃 |𝑦, 𝜃 𝑖 ] with respect to 𝜃
𝑬𝒙𝒑𝒆𝒄𝒕𝒂𝒕𝒊𝒐𝒏 − 𝒔𝒕𝒆𝒑
𝐌𝐚𝐱𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 − 𝒔𝒕𝒆𝒑
8.5 EM Algorithm
• Example: Two-Component Mixture Model
• Back to the problem:
8.5.2 EM Algorithm in General
8.5.2 EM Algorithm in General
• Why does the algorithm work?
♣ We want to maximize L(Z|𝜃 ′ )
♣ Not L(T|𝜃 ′ )
…
maximize Q with respect to 𝜃 ′
>=0
Exercise 8.1, Jensen’s inequality
<=0
8.5.3 EM as Maximization-Maximization Procedure
• Consider the function:
𝑃 𝑍 𝑚 = Pr(𝑍 𝑚 |𝑍, 𝜃 ′ ),
• When choose 𝑃 𝑍 𝑚 = Pr(𝑍 𝑚 |𝑍, 𝜃 ′ ), F agrees with
log-likelihood of observed data
♣ 𝐹 𝜃 ′ , 𝑃 = 𝑙(𝑍|𝜃 ′ )
• Red Line: observed data log-likelihood:
𝑃 𝑍 𝑚 = Pr(𝑍 𝑚 |𝑍, 𝜃 ′ )
8.5.3 MCMC for Sampling from the Posterior
• MCMC: Markov chain Monte Carlo
• Gibbs Sampling: a MCMC procedure
♣ Random variables 𝑈1 , … 𝑈𝐾
♣ Difficult to calculate the joint distribution Pr(𝑈1 , … 𝑈𝐾 )
♣ But easy to simulate from Pr(𝑈𝑗 |𝑈1 , … 𝑈𝑗−1 , 𝑈𝑗+1 , … 𝑈𝐾 )
♣ How to get a sample from the joint distribution?
8.6 MCMC for Sampling from the Posterior
• Gibbs Sampling is closely related to EM algorithm:
♣ Gibbs samples from conditional distribution
♣ EM maximize the conditional distribution
𝜇
𝛾
Thanks