PAC-Bayes Analysis: Background and Applications

Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
PAC-Bayes Analysis: Background and
Applications
John Shawe-Taylor
University College London
Chicago/TTI Workshop
June, 2009
Including joint work with John Langford, Amiran Ambroladze
and Emilio Parrado-Hernández, Cédric Archambeau,
Matthew Higgs, Manfred Opper
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Aims
Hope to give you:
PAC-Bayes framework
Core result
How to apply to Support Vector Machines
Application to maximum entropy classification
Application to Gaussian Processes and dynamical systems
modeling
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Aims
Hope to give you:
PAC-Bayes framework
Core result
How to apply to Support Vector Machines
Application to maximum entropy classification
Application to Gaussian Processes and dynamical systems
modeling
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Aims
Hope to give you:
PAC-Bayes framework
Core result
How to apply to Support Vector Machines
Application to maximum entropy classification
Application to Gaussian Processes and dynamical systems
modeling
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Aims
Hope to give you:
PAC-Bayes framework
Core result
How to apply to Support Vector Machines
Application to maximum entropy classification
Application to Gaussian Processes and dynamical systems
modeling
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Aims
Hope to give you:
PAC-Bayes framework
Core result
How to apply to Support Vector Machines
Application to maximum entropy classification
Application to Gaussian Processes and dynamical systems
modeling
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
1
2
3
4
5
Background to Approach
PAC-Bayes Analysis
Definitions
PAC-Bayes Theorem
Applications
Linear Classifiers
General Approach
Learning the prior
Maximum entropy classification
Generalisation
Optimisation
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General perspectives
The goal of different theories is to capture the key
elements that enable an understanding and analysis of
different phenomena
Several theories of machine learning: notably Bayesian
and frequentist
Different assumptions and hence different range of
applicability and range of results
Bayesian able to make more detailed probabilistic
predictions
Frequentist makes only i.i.d. assumption
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General perspectives
The goal of different theories is to capture the key
elements that enable an understanding and analysis of
different phenomena
Several theories of machine learning: notably Bayesian
and frequentist
Different assumptions and hence different range of
applicability and range of results
Bayesian able to make more detailed probabilistic
predictions
Frequentist makes only i.i.d. assumption
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General perspectives
The goal of different theories is to capture the key
elements that enable an understanding and analysis of
different phenomena
Several theories of machine learning: notably Bayesian
and frequentist
Different assumptions and hence different range of
applicability and range of results
Bayesian able to make more detailed probabilistic
predictions
Frequentist makes only i.i.d. assumption
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General perspectives
The goal of different theories is to capture the key
elements that enable an understanding and analysis of
different phenomena
Several theories of machine learning: notably Bayesian
and frequentist
Different assumptions and hence different range of
applicability and range of results
Bayesian able to make more detailed probabilistic
predictions
Frequentist makes only i.i.d. assumption
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General perspectives
The goal of different theories is to capture the key
elements that enable an understanding and analysis of
different phenomena
Several theories of machine learning: notably Bayesian
and frequentist
Different assumptions and hence different range of
applicability and range of results
Bayesian able to make more detailed probabilistic
predictions
Frequentist makes only i.i.d. assumption
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Historical notes: Frequentist approach
Pioneered in Russia by Vapnik and Chervonenkis
Introduced in the west by Valiant under the name of
‘probably approximately correct’
Typical results state that with probability at least 1 − δ
(probably), any classifier from hypothesis class which has
low training error will have low generalisation error
(approximately correct).
Has the status of a statistical test: The confidence is
denoted by δ: the probability that the sample is
misleading/unusual.
SVM bound using luckiness framework by S-T et al. (1998)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Historical notes: Frequentist approach
Pioneered in Russia by Vapnik and Chervonenkis
Introduced in the west by Valiant under the name of
‘probably approximately correct’
Typical results state that with probability at least 1 − δ
(probably), any classifier from hypothesis class which has
low training error will have low generalisation error
(approximately correct).
Has the status of a statistical test: The confidence is
denoted by δ: the probability that the sample is
misleading/unusual.
SVM bound using luckiness framework by S-T et al. (1998)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Historical notes: Frequentist approach
Pioneered in Russia by Vapnik and Chervonenkis
Introduced in the west by Valiant under the name of
‘probably approximately correct’
Typical results state that with probability at least 1 − δ
(probably), any classifier from hypothesis class which has
low training error will have low generalisation error
(approximately correct).
Has the status of a statistical test: The confidence is
denoted by δ: the probability that the sample is
misleading/unusual.
SVM bound using luckiness framework by S-T et al. (1998)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Historical notes: Frequentist approach
Pioneered in Russia by Vapnik and Chervonenkis
Introduced in the west by Valiant under the name of
‘probably approximately correct’
Typical results state that with probability at least 1 − δ
(probably), any classifier from hypothesis class which has
low training error will have low generalisation error
(approximately correct).
Has the status of a statistical test: The confidence is
denoted by δ: the probability that the sample is
misleading/unusual.
SVM bound using luckiness framework by S-T et al. (1998)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Historical notes: Frequentist approach
Pioneered in Russia by Vapnik and Chervonenkis
Introduced in the west by Valiant under the name of
‘probably approximately correct’
Typical results state that with probability at least 1 − δ
(probably), any classifier from hypothesis class which has
low training error will have low generalisation error
(approximately correct).
Has the status of a statistical test: The confidence is
denoted by δ: the probability that the sample is
misleading/unusual.
SVM bound using luckiness framework by S-T et al. (1998)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Historical notes: Bayesian approach
Name derives from Bayes theorem: we assume a prior
distribution over functions or classifiers and then use
Bayes rule to update the prior based on the likelihood of
the data for each function
This gives the posterior distribution: Bayesian will classify
according to the expected classification under the posterior
– best strategy given that the prior is correct
Can be used for model selection by evaluating the
‘evidence’ for a model (see for example David McKay) –
this is related to the volume of version space consistent
with the data
Gaussian processes for regression justified within this
model
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Historical notes: Bayesian approach
Name derives from Bayes theorem: we assume a prior
distribution over functions or classifiers and then use
Bayes rule to update the prior based on the likelihood of
the data for each function
This gives the posterior distribution: Bayesian will classify
according to the expected classification under the posterior
– best strategy given that the prior is correct
Can be used for model selection by evaluating the
‘evidence’ for a model (see for example David McKay) –
this is related to the volume of version space consistent
with the data
Gaussian processes for regression justified within this
model
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Historical notes: Bayesian approach
Name derives from Bayes theorem: we assume a prior
distribution over functions or classifiers and then use
Bayes rule to update the prior based on the likelihood of
the data for each function
This gives the posterior distribution: Bayesian will classify
according to the expected classification under the posterior
– best strategy given that the prior is correct
Can be used for model selection by evaluating the
‘evidence’ for a model (see for example David McKay) –
this is related to the volume of version space consistent
with the data
Gaussian processes for regression justified within this
model
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Historical notes: Bayesian approach
Name derives from Bayes theorem: we assume a prior
distribution over functions or classifiers and then use
Bayes rule to update the prior based on the likelihood of
the data for each function
This gives the posterior distribution: Bayesian will classify
according to the expected classification under the posterior
– best strategy given that the prior is correct
Can be used for model selection by evaluating the
‘evidence’ for a model (see for example David McKay) –
this is related to the volume of version space consistent
with the data
Gaussian processes for regression justified within this
model
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Version space: evidence
f(x1,w) = 0
f(x4,w) = 0
f(x2,w) = 0
f(x3,w)=0
C2
C1
w
w’
C3
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Evidence and generalisation
Link between evidence and generalisation hypothesised by
McKay
First formal link was obtained by S-T & Williamson (1997):
PAC Analysis of a Bayes Estimator
Bound on generalisation in terms of the volume of the
sphere that can be inscribed in the version space –
included a dependence on the dimensionality of the space
Used Luckiness framework – a data-dependent style of
frequentist bound also used to bound generalisation of
SVMs for which no dependence on the dimensionality is
needed, just on the margin
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Evidence and generalisation
Link between evidence and generalisation hypothesised by
McKay
First formal link was obtained by S-T & Williamson (1997):
PAC Analysis of a Bayes Estimator
Bound on generalisation in terms of the volume of the
sphere that can be inscribed in the version space –
included a dependence on the dimensionality of the space
Used Luckiness framework – a data-dependent style of
frequentist bound also used to bound generalisation of
SVMs for which no dependence on the dimensionality is
needed, just on the margin
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Evidence and generalisation
Link between evidence and generalisation hypothesised by
McKay
First formal link was obtained by S-T & Williamson (1997):
PAC Analysis of a Bayes Estimator
Bound on generalisation in terms of the volume of the
sphere that can be inscribed in the version space –
included a dependence on the dimensionality of the space
Used Luckiness framework – a data-dependent style of
frequentist bound also used to bound generalisation of
SVMs for which no dependence on the dimensionality is
needed, just on the margin
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Evidence and generalisation
Link between evidence and generalisation hypothesised by
McKay
First formal link was obtained by S-T & Williamson (1997):
PAC Analysis of a Bayes Estimator
Bound on generalisation in terms of the volume of the
sphere that can be inscribed in the version space –
included a dependence on the dimensionality of the space
Used Luckiness framework – a data-dependent style of
frequentist bound also used to bound generalisation of
SVMs for which no dependence on the dimensionality is
needed, just on the margin
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
PAC-Bayes Theorem
First version proved by McAllester in 1999
Improved proof and bound due to Seeger in 2002 with
application to Gaussian processes
Application to SVMs by Langford and S-T also in 2002
Excellent tutorial by Langford appeared in 2005 in JMLR
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
PAC-Bayes Theorem
First version proved by McAllester in 1999
Improved proof and bound due to Seeger in 2002 with
application to Gaussian processes
Application to SVMs by Langford and S-T also in 2002
Excellent tutorial by Langford appeared in 2005 in JMLR
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
PAC-Bayes Theorem
First version proved by McAllester in 1999
Improved proof and bound due to Seeger in 2002 with
application to Gaussian processes
Application to SVMs by Langford and S-T also in 2002
Excellent tutorial by Langford appeared in 2005 in JMLR
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
PAC-Bayes Theorem
First version proved by McAllester in 1999
Improved proof and bound due to Seeger in 2002 with
application to Gaussian processes
Application to SVMs by Langford and S-T also in 2002
Excellent tutorial by Langford appeared in 2005 in JMLR
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Definitions for main result
Prior and posterior distributions
The PAC-Bayes theorem involves a class of classifiers C
together with a prior distribution P and posterior Q over C
The distribution P must be chosen before learning, but the
bound holds for all choices of Q, hence Q does not need to
be the classical Bayesian posterior
The bound holds for all (prior) choices of P – hence it’s
validity is not affected by a poor choice of P though the
quality of the resulting bound may be – contrast with
standard Bayes analysis which only holds if the prior
assumptions are correct
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Definitions for main result
Prior and posterior distributions
The PAC-Bayes theorem involves a class of classifiers C
together with a prior distribution P and posterior Q over C
The distribution P must be chosen before learning, but the
bound holds for all choices of Q, hence Q does not need to
be the classical Bayesian posterior
The bound holds for all (prior) choices of P – hence it’s
validity is not affected by a poor choice of P though the
quality of the resulting bound may be – contrast with
standard Bayes analysis which only holds if the prior
assumptions are correct
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Definitions for main result
Prior and posterior distributions
The PAC-Bayes theorem involves a class of classifiers C
together with a prior distribution P and posterior Q over C
The distribution P must be chosen before learning, but the
bound holds for all choices of Q, hence Q does not need to
be the classical Bayesian posterior
The bound holds for all (prior) choices of P – hence it’s
validity is not affected by a poor choice of P though the
quality of the resulting bound may be – contrast with
standard Bayes analysis which only holds if the prior
assumptions are correct
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Definitions for main result
Error measures
Being a frequentist (PAC) style result we assume an
unknown distribution D on the input space X .
D is used to generate the labelled training samples i.i.d.,
i.e. S ∼ Dm
It is also used to measure generalisation error cD of a
classifier c:
cD = Pr(x,y )∼D (c(x) 6= y )
The empirical generalisation error is denoted ĉS :
1 X
ĉS =
I[c(x) 6= y ] where I[·] indicator function.
m
(x,y )∈S
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Definitions for main result
Error measures
Being a frequentist (PAC) style result we assume an
unknown distribution D on the input space X .
D is used to generate the labelled training samples i.i.d.,
i.e. S ∼ Dm
It is also used to measure generalisation error cD of a
classifier c:
cD = Pr(x,y )∼D (c(x) 6= y )
The empirical generalisation error is denoted ĉS :
1 X
ĉS =
I[c(x) 6= y ] where I[·] indicator function.
m
(x,y )∈S
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Definitions for main result
Error measures
Being a frequentist (PAC) style result we assume an
unknown distribution D on the input space X .
D is used to generate the labelled training samples i.i.d.,
i.e. S ∼ Dm
It is also used to measure generalisation error cD of a
classifier c:
cD = Pr(x,y )∼D (c(x) 6= y )
The empirical generalisation error is denoted ĉS :
1 X
ĉS =
I[c(x) 6= y ] where I[·] indicator function.
m
(x,y )∈S
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Definitions for main result
Error measures
Being a frequentist (PAC) style result we assume an
unknown distribution D on the input space X .
D is used to generate the labelled training samples i.i.d.,
i.e. S ∼ Dm
It is also used to measure generalisation error cD of a
classifier c:
cD = Pr(x,y )∼D (c(x) 6= y )
The empirical generalisation error is denoted ĉS :
1 X
ĉS =
I[c(x) 6= y ] where I[·] indicator function.
m
(x,y )∈S
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Definitions for main result
Assessing the posterior
The result is concerned with bounding the performance of
a probabilistic classifier that given a test input x chooses a
classifier c ∼ Q (the posterior) and returns c(x)
We are interested in the relation between two quantities:
QD = Ec∼Q [cD ]
the true error rate of the probabilistic classifier and
Q̂S = Ec∼Q [ĉS ]
its empirical error rate
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Definitions for main result
Assessing the posterior
The result is concerned with bounding the performance of
a probabilistic classifier that given a test input x chooses a
classifier c ∼ Q (the posterior) and returns c(x)
We are interested in the relation between two quantities:
QD = Ec∼Q [cD ]
the true error rate of the probabilistic classifier and
Q̂S = Ec∼Q [ĉS ]
its empirical error rate
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Definitions for main result
Generalisation error
Note that this does not bound the posterior average but we
have
Pr(x,y )∼D (sgn (Ec∼Q [c(x)]) 6= y ) ≤ 2QD .
since for any point x misclassified by sgn (Ec∼Q [c(x)]) the
probability of a random c ∼ Q misclassifying is at least 0.5.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
PAC-Bayes Theorem
Fix an arbitrary D, arbitrary prior P, and confidence δ, then
with probability at least 1 − δ over samples S ∼ Dm , all
posteriors Q satisfy
KL(Q̂S kQD ) ≤
KL(QkP) + ln((m + 1)/δ)
m
where KL is the KL divergence between distributions
Q(c)
KL(QkP) = Ec∼Q ln
P(c)
with Q̂S and QD considered as distributions on {0, +1}.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Finite Classes
If we take a finite class of functions h1 , . . . , hN with prior
distribution p1 , . . . , pN and assume that the posterior is
concentrated on a single function, the generalisation is
bounded by
ˆ i )kerr(hi )) ≤
KL(err(h
− log(pi ) + ln((m + 1)/δ)
m
This is the standard result for finite classes with the slight
refinement that it involves the KL divergence between
empirical and true error and the extra log(m + 1) term on
the rhs.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Finite Classes
If we take a finite class of functions h1 , . . . , hN with prior
distribution p1 , . . . , pN and assume that the posterior is
concentrated on a single function, the generalisation is
bounded by
ˆ i )kerr(hi )) ≤
KL(err(h
− log(pi ) + ln((m + 1)/δ)
m
This is the standard result for finite classes with the slight
refinement that it involves the KL divergence between
empirical and true error and the extra log(m + 1) term on
the rhs.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Linear classifiers and SVMs
Focus in on linear function application (Langford & ST)
How the application is made
Extensions to learning the prior
Some results on UCI datasets to give an idea of what can
be achieved
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Linear classifiers and SVMs
Focus in on linear function application (Langford & ST)
How the application is made
Extensions to learning the prior
Some results on UCI datasets to give an idea of what can
be achieved
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Linear classifiers and SVMs
Focus in on linear function application (Langford & ST)
How the application is made
Extensions to learning the prior
Some results on UCI datasets to give an idea of what can
be achieved
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Definitions
PAC-Bayes Theorem
Applications
Linear classifiers and SVMs
Focus in on linear function application (Langford & ST)
How the application is made
Extensions to learning the prior
Some results on UCI datasets to give an idea of what can
be achieved
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Linear classifiers
We will choose the prior and posterior distributions to be
Gaussians with unit variance.
The prior P will be centered at the origin with unit variance
The specification of the centre for the posterior Q(w , µ) will
be by a unit vector w and a scale factor µ.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Linear classifiers
We will choose the prior and posterior distributions to be
Gaussians with unit variance.
The prior P will be centered at the origin with unit variance
The specification of the centre for the posterior Q(w , µ) will
be by a unit vector w and a scale factor µ.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Linear classifiers
We will choose the prior and posterior distributions to be
Gaussians with unit variance.
The prior P will be centered at the origin with unit variance
The specification of the centre for the posterior Q(w , µ) will
be by a unit vector w and a scale factor µ.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (1/2)
Prior P is Gaussian N (0, 1)
P
0
W
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (1/2)
w
Prior P is Gaussian N (0, 1)
P
Posterior is in the direction w
0
W
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (1/2)
w
µ
Prior P is Gaussian N (0, 1)
P
Posterior is in the direction w
at distance µ from the origin
0
W
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (1/2)
Q
w
µ
Prior P is Gaussian N (0, 1)
P
Posterior is in the direction w
at distance µ from the origin
0
Posterior Q is Gaussian
W
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(Q̂S (w , µ)k QD (w , µ) ) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
QD (w , µ) true performance of the stochastic classifier
SVM is deterministic classifier that exactly corresponds to
sgn Ec∼Q(w,µ) [c(x)] as centre of the Gaussian gives the
same classification as halfspace with more weight.
Hence its error bounded by 2QD (w, µ), since as observed
above if x misclassified at least half of c ∼ Q err.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(Q̂S (w , µ)k QD (w , µ) ) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
QD (w , µ) true performance of the stochastic classifier
SVM is deterministic classifier that exactly corresponds to
sgn Ec∼Q(w,µ) [c(x)] as centre of the Gaussian gives the
same classification as halfspace with more weight.
Hence its error bounded by 2QD (w, µ), since as observed
above if x misclassified at least half of c ∼ Q err.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(Q̂S (w , µ)k QD (w , µ) ) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
QD (w , µ) true performance of the stochastic classifier
SVM is deterministic classifier that exactly corresponds to
sgn Ec∼Q(w,µ) [c(x)] as centre of the Gaussian gives the
same classification as halfspace with more weight.
Hence its error bounded by 2QD (w, µ), since as observed
above if x misclassified at least half of c ∼ Q err.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(Q̂S (w , µ)k QD (w , µ) ) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
QD (w , µ) true performance of the stochastic classifier
SVM is deterministic classifier that exactly corresponds to
sgn Ec∼Q(w,µ) [c(x)] as centre of the Gaussian gives the
same classification as halfspace with more weight.
Hence its error bounded by 2QD (w, µ), since as observed
above if x misclassified at least half of c ∼ Q err.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL( Q̂S (w , µ) kQD (w , µ)) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
Q̂S (w , µ) stochastic measure of the training error
Q̂S (w , µ) = Em [F̃ (µγ(x, y ))]
γ(x, y ) = (y w T φ(x))/(kφ(x)kkw k)
Rt
2
F̃ (t) = 1 − √1 −∞ e−x /2 dx
2π
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL( Q̂S (w , µ) kQD (w , µ)) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
Q̂S (w , µ) stochastic measure of the training error
Q̂S (w , µ) = Em [F̃ (µγ(x, y ))]
γ(x, y ) = (y w T φ(x))/(kφ(x)kkw k)
Rt
2
F̃ (t) = 1 − √1 −∞ e−x /2 dx
2π
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL( Q̂S (w , µ) kQD (w , µ)) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
Q̂S (w , µ) stochastic measure of the training error
Q̂S (w , µ) = Em [F̃ (µγ(x, y ))]
γ(x, y ) = (y w T φ(x))/(kφ(x)kkw k)
Rt
2
F̃ (t) = 1 − √1 −∞ e−x /2 dx
2π
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL( Q̂S (w , µ) kQD (w , µ)) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
Q̂S (w , µ) stochastic measure of the training error
Q̂S (w , µ) = Em [F̃ (µγ(x, y ))]
γ(x, y ) = (y w T φ(x))/(kφ(x)kkw k)
Rt
2
F̃ (t) = 1 − √1 −∞ e−x /2 dx
2π
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL( Q̂S (w , µ) kQD (w , µ)) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
Q̂S (w , µ) stochastic measure of the training error
Q̂S (w , µ) = Em [F̃ (µγ(x, y ))]
γ(x, y ) = (y w T φ(x))/(kφ(x)kkw k)
Rt
2
F̃ (t) = 1 − √1 −∞ e−x /2 dx
2π
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(Q̂S (w , µ)kQD (w , µ)) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
Prior P ≡ Gaussian centered on the origin
Posterior Q ≡ Gaussian along w at a distance µ from the
origin
KL(PkQ) = µ2 /2
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(Q̂S (w , µ)kQD (w , µ)) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
Prior P ≡ Gaussian centered on the origin
Posterior Q ≡ Gaussian along w at a distance µ from the
origin
KL(PkQ) = µ2 /2
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(Q̂S (w , µ)kQD (w , µ)) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
Prior P ≡ Gaussian centered on the origin
Posterior Q ≡ Gaussian along w at a distance µ from the
origin
KL(PkQ) = µ2 /2
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(Q̂S (w , µ)kQD (w , µ)) ≤
KL(PkQ(w , µ)) + ln m+1
δ
m
Prior P ≡ Gaussian centered on the origin
Posterior Q ≡ Gaussian along w at a distance µ from the
origin
KL(PkQ) = µ2 /2
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(PkQ(w , µ)) + ln m+1
δ
KL(Q̂S (w , µ)kQD (w , µ)) ≤
m
δ is the confidence
The bound holds with probability 1 − δ over the random
i.i.d. selection of the training data.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(PkQ(w , µ)) + ln m+1
δ
KL(Q̂S (w , µ)kQD (w , µ)) ≤
m
δ is the confidence
The bound holds with probability 1 − δ over the random
i.i.d. selection of the training data.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(PkQ(w , µ)) + ln m+1
δ
KL(Q̂S (w , µ)kQD (w , µ)) ≤
m
δ is the confidence
The bound holds with probability 1 − δ over the random
i.i.d. selection of the training data.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Form of the SVM bound
Note that bound holds for all posterior distributions so that
we can choose µ to optimise the bound
If we define the inverse of the KL by
KL−1 (q, A) = max{p : KL(qkp) ≤ A}
then have with probability at least 1 − δ
Pr (hw, φ(x)i =
6 y ) ≤ 2 min KL−1
µ
John Shawe-Taylor University College London
µ2 /2 + ln m+1
δ
Em [F̃ (µγ(x, y ))],
m
PAC-Bayes Analysis: Background and Applications
!
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Form of the SVM bound
Note that bound holds for all posterior distributions so that
we can choose µ to optimise the bound
If we define the inverse of the KL by
KL−1 (q, A) = max{p : KL(qkp) ≤ A}
then have with probability at least 1 − δ
Pr (hw, φ(x)i =
6 y ) ≤ 2 min KL−1
µ
John Shawe-Taylor University College London
µ2 /2 + ln m+1
δ
Em [F̃ (µγ(x, y ))],
m
PAC-Bayes Analysis: Background and Applications
!
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Gives SVM Optimisation
Primal form:
1
2
2 kw k
yi w T φ(x i )
minw ,ξi
s.t.
+C
Pm
i=1 ξi
≥ 1 − ξi
ξi ≥ 0
i = 1, . . . , m
i = 1, . . . , m
Dual form:
maxα
s.t.
hP
m
i=1 αi
−
1
2
i
α
α
y
y
κ(x
,
x
)
i
j
i,j=1 i j i j
Pm
0 ≤ αi ≤ C
i = 1, . . . , m
where κ(x i , P
x j ) = hφ(x i ), φ(x j )i and
hw , φ(x)i = m
i=1 αi yi κ(x i , x).
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Slack variable conversion
3
2.5
2
1.5
1
0.5
0
−2
−1.5
−1
−0.5
John Shawe-Taylor University College London
0
0.5
1
1.5
2
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Learning the prior (1/3)
Bound depends on the distance between prior and
posterior
Better prior (closer to posterior) would lead to tighter
bound
Learn the prior P with part of the data
Introduce the learnt prior in the bound
Compute stochastic error with remaining data
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Learning the prior (1/3)
Bound depends on the distance between prior and
posterior
Better prior (closer to posterior) would lead to tighter
bound
Learn the prior P with part of the data
Introduce the learnt prior in the bound
Compute stochastic error with remaining data
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Learning the prior (1/3)
Bound depends on the distance between prior and
posterior
Better prior (closer to posterior) would lead to tighter
bound
Learn the prior P with part of the data
Introduce the learnt prior in the bound
Compute stochastic error with remaining data
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Learning the prior (1/3)
Bound depends on the distance between prior and
posterior
Better prior (closer to posterior) would lead to tighter
bound
Learn the prior P with part of the data
Introduce the learnt prior in the bound
Compute stochastic error with remaining data
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Learning the prior (1/3)
Bound depends on the distance between prior and
posterior
Better prior (closer to posterior) would lead to tighter
bound
Learn the prior P with part of the data
Introduce the learnt prior in the bound
Compute stochastic error with remaining data
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Tightness of the new bound
Problem
Wdbc
Waveform
Ringnorm
Pima
Landsat
Handwritten-digits
Spam
Average
PAC-Bayes Bound
0.346 ± 0.006
0.197 ± 0.002
0.211 ± 0.001
0.399 ± 0.007
0.035 ± 0.001
0.159 ± 0.001
0.243 ± 0.002
0.227
John Shawe-Taylor University College London
Prior-PAC-Bayes Bound
0.284 ± 0.021
0.143 ± 0.005
0.093 ± 0.004
0.374 ± 0.020
0.023 ± 0.002
0.084 ± 0.003
0.161 ± 0.006
0.166
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Model Selection with the new bound: results
Problem
Wdbc
Waveform
Ringnorm
Pima
Landsat
Handwritten-digits
Spam
Average
PAC-SVM
0.070 ± 0.024
0.090 ± 0.008
0.034 ± 0.003
0.241 ± 0.031
0.011 ± 0.002
0.015 ± 0.002
0.090 ± 0.009
0.079
Prior-PAC-Bayes
0.070 ± 0.024
0.091 ± 0.008
0.024 ± 0.003
0.236 ± 0.031
0.007 ± 0.002
0.016 ± 0.002
0.088 ± 0.009
0.076
Ten Fold XVal
0.067 ± 0.024
0.086 ± 0.008
0.016 ± 0.003
0.245 ± 0.040
0.005 ± 0.002
0.007 ± 0.002
0.063 ± 0.008
0.070
Test error achieved by the three settings.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Model selection with p-SVM
Problem
Wdbc
Waveform
Ringnorm
Pima
Landsat
Hand-digits
Spam
Average
PAC-SVM
0.070 ± 0.024
0.090 ± 0.008
0.034 ± 0.003
0.241 ± 0.031
0.011 ± 0.002
0.015 ± 0.002
0.090 ± 0.009
0.079
Prior-PAC-SVM
0.070 ± 0.024
0.091 ± 0.008
0.024 ± 0.003
0.236 ± 0.031
0.007 ± 0.002
0.016 ± 0.002
0.088 ± 0.009
0.076
John Shawe-Taylor University College London
PriorSVM
0.068 ± 0.0236
0.085 ± 0.0188
0.014 ± 0.0077
0.237 ± 0.0323
0.006 ± 0.0019
0.011 ± 0.0028
0.075 ± 0.0093
0.071
η-PriorSVM
0.073 ± 0.023
0.085 ± 0.007
0.015 ± 0.003
0.242 ± 0.033
0.006 ± 0.002
0.011 ± 0.003
0.080 ± 0.009
0.073
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
General Approach
Learning the prior
Tightness of the bound with p-SVM
Problem
Wdbc
Waveform
Ringnorm
Pima
Landsat
Hand-digits
Spam
Average
PAC-SVM
0.346 ± 0.006
0.197 ± 0.002
0.211 ± 0.001
0.399 ± 0.007
0.035 ± 0.001
0.159 ± 0.001
0.243 ± 0.002
0.227
Prior-PAC-SVM
0.284 ± 0.021
0.143 ± 0.005
0.093 ± 0.004
0.374 ± 0.020
0.023 ± 0.002
0.084 ± 0.003
0.161 ± 0.006
0.166
John Shawe-Taylor University College London
PriorSVM
0.308 ± 0.0252
0.156 ± 0.0054
0.054 ± 0.0038
0.418 ± 0.0182
0.027 ± 0.0032
0.046 ± 0.0045
0.171 ± 0.0065
0.169
η-PriorSVM
0.271 ± 0.027
0.136 ± 0.006
0.049 ± 0.003
0.391 ± 0.021
0.022 ± 0.002
0.042 ± 0.004
0.145 ± 0.007
0.151
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Maximum entropy learning
consider function class for X is a subset of the `∞ unit ball
)
(
!
N
X
F = fw : x ∈ X 7→ sgn
wi xi : kw k1 ≤ 1 ,
i=1
want posterior distribution Q(w ) such that can bound
P(x,y )∼D (fw (x) 6= y ) ≤ 2eQ(w ) (= 2QD (w )) = 2E(x,y )∼D,q∼Q(w ) [I [q(x) 6= y ]]
Given a training sample S = {(x 1 , y1 ), . . . , (x m , ym )}, we
similarly define
m
êQ(w ) (= Q̂S (w )) =
1 X
Eq∼Q(w ) [I [q(x i ) 6= yi ]] .
m
i=1
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Maximum entropy learning
consider function class for X is a subset of the `∞ unit ball
)
(
!
N
X
F = fw : x ∈ X 7→ sgn
wi xi : kw k1 ≤ 1 ,
i=1
want posterior distribution Q(w ) such that can bound
P(x,y )∼D (fw (x) 6= y ) ≤ 2eQ(w ) (= 2QD (w )) = 2E(x,y )∼D,q∼Q(w ) [I [q(x) 6= y ]]
Given a training sample S = {(x 1 , y1 ), . . . , (x m , ym )}, we
similarly define
m
êQ(w ) (= Q̂S (w )) =
1 X
Eq∼Q(w ) [I [q(x i ) 6= yi ]] .
m
i=1
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Maximum entropy learning
consider function class for X is a subset of the `∞ unit ball
)
(
!
N
X
F = fw : x ∈ X 7→ sgn
wi xi : kw k1 ≤ 1 ,
i=1
want posterior distribution Q(w ) such that can bound
P(x,y )∼D (fw (x) 6= y ) ≤ 2eQ(w ) (= 2QD (w )) = 2E(x,y )∼D,q∼Q(w ) [I [q(x) 6= y ]]
Given a training sample S = {(x 1 , y1 ), . . . , (x m , ym )}, we
similarly define
m
êQ(w ) (= Q̂S (w )) =
1 X
Eq∼Q(w ) [I [q(x i ) 6= yi ]] .
m
i=1
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Posterior distribution Q(w )
Classifier q involves random weight vector W ∈ RN plus
random threshold Θ
qW ,Θ (x) = sgn (hW , xi − Θ) .
The distribution Q(w ) of W will be discrete with
W = sgn(wi )e i ; with probability |wi |, i = 1, . . . , N,
where e i is the unit vector. The distribution of Θ is uniform
on the interval [−1, 1].
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Posterior distribution Q(w )
Classifier q involves random weight vector W ∈ RN plus
random threshold Θ
qW ,Θ (x) = sgn (hW , xi − Θ) .
The distribution Q(w ) of W will be discrete with
W = sgn(wi )e i ; with probability |wi |, i = 1, . . . , N,
where e i is the unit vector. The distribution of Θ is uniform
on the interval [−1, 1].
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Error expression
Proposition
With the above definitions, we have for w satisfying kw k1 = 1,
that for any (x, y ) ∈ X × {−1, +1},
Pq∼Q(w ) (q(x) 6= y ) = 0.5(1 − y hw , xi).
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Error expression proof
Proof.
Pq∼Q(w ) (q(x) 6= y ) =
=
N
X
i=1
N
X
|wi |PΘ (sgn (sgn(wi )he i , xi − Θ) 6= y )
|wi |PΘ (sgn (sgn(wi )xi − Θ) 6= y )
i=1
= 0.5
N
X
|wi |(1 − y sgn(wi )xi )
i=1
= 0.5(1 − y hw , xi),
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Generalisation error
Corollary
P(x,y )∼D (fw (x) 6= y ) ≤ 2eQ(w ) .
Proof.
Pq∼Q(w ) (q(x) 6= y )
≥
0.5
⇔
fw (x)
John Shawe-Taylor University College London
6=
y.
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Base result
Theorem
With probability at least 1 − δ over the draw of training sets of
size m
PN
|wi | ln |wi | + ln(2N) + ln((m + 1)/δ)
KL(êQ(w ) keQ(w ) ) ≤ i=1
m
Proof.
Use prior P uniform on unit vectors ±e i .
Posterior described above so KL(PkQ(w )) equals
ln(2N)− entropy of w .
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Base result
Theorem
With probability at least 1 − δ over the draw of training sets of
size m
PN
|wi | ln |wi | + ln(2N) + ln((m + 1)/δ)
KL(êQ(w ) keQ(w ) ) ≤ i=1
m
Proof.
Use prior P uniform on unit vectors ±e i .
Posterior described above so KL(PkQ(w )) equals
ln(2N)− entropy of w .
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Base result
Theorem
With probability at least 1 − δ over the draw of training sets of
size m
PN
|wi | ln |wi | + ln(2N) + ln((m + 1)/δ)
KL(êQ(w ) keQ(w ) ) ≤ i=1
m
Proof.
Use prior P uniform on unit vectors ±e i .
Posterior described above so KL(PkQ(w )) equals
ln(2N)− entropy of w .
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Interpretation
Suggests maximising the entropy as a means of
minimising the bound.
Problem that empirical error êQ(w ) is too large:
êQ(w ) =
m
X
0.5(1 − yi hw , x i i)
i=1
Function of margin – but just linear function.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Interpretation
Suggests maximising the entropy as a means of
minimising the bound.
Problem that empirical error êQ(w ) is too large:
êQ(w ) =
m
X
0.5(1 − yi hw , x i i)
i=1
Function of margin – but just linear function.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Interpretation
Suggests maximising the entropy as a means of
minimising the bound.
Problem that empirical error êQ(w ) is too large:
êQ(w ) =
m
X
0.5(1 − yi hw , x i i)
i=1
Function of margin – but just linear function.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Boosting the bound
Trick to boost the power of the bound is to take T
independent samples of the distribution Q(w ) and vote for
the classification:
!
T
X
qW ,Θ (x) = sgn
sgn hW t , xi − Θt
,
i=1
Now empirical error becomes
êQ(w )
m bT /2c 0.5T X X T
=
(1 + yi hw , x i i)t (1 − yi hw , x i i)T −t ,
t
m
i=1 t=0
giving sigmoid like loss as function of the margin.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Boosting the bound
Trick to boost the power of the bound is to take T
independent samples of the distribution Q(w ) and vote for
the classification:
!
T
X
qW ,Θ (x) = sgn
sgn hW t , xi − Θt
,
i=1
Now empirical error becomes
êQ(w )
m bT /2c 0.5T X X T
=
(1 + yi hw , x i i)t (1 − yi hw , x i i)T −t ,
t
m
i=1 t=0
giving sigmoid like loss as function of the margin.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Full result
Theorem
With probability at least 1 − δ over the draw of training sets of
size m
P(x,y )∼D (fw (x) 6= y ) ≤
!
P
T N
−1
i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)
2KL
êQ T (w ) ,
,
m
Note penalty factor of T applied to KL
Behaves like the (inverse) margin in usual bounds
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Full result
Theorem
With probability at least 1 − δ over the draw of training sets of
size m
P(x,y )∼D (fw (x) 6= y ) ≤
!
P
T N
−1
i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)
2KL
êQ T (w ) ,
,
m
Note penalty factor of T applied to KL
Behaves like the (inverse) margin in usual bounds
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Full result
Theorem
With probability at least 1 − δ over the draw of training sets of
size m
P(x,y )∼D (fw (x) 6= y ) ≤
!
P
T N
−1
i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)
2KL
êQ T (w ) ,
,
m
Note penalty factor of T applied to KL
Behaves like the (inverse) margin in usual bounds
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Algorithmics
Bound motivates the optimisation:
min
w ,ρ,ξ
subject to:
N
X
j=1
|wj | ln |wj | − Cρ + D
m
X
ξi
i=1
yi hw , x i i ≥ ρ − ξi , 1 ≤ i ≤ m,
kw k1 ≤ 1, ξi ≥ 0, 1 ≤ i ≤ m.
This follows the SVM route of approximating the sigmoid
like loss by the (convex) hinge loss
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Algorithmics
Bound motivates the optimisation:
min
w ,ρ,ξ
subject to:
N
X
j=1
|wj | ln |wj | − Cρ + D
m
X
ξi
i=1
yi hw , x i i ≥ ρ − ξi , 1 ≤ i ≤ m,
kw k1 ≤ 1, ξi ≥ 0, 1 ≤ i ≤ m.
This follows the SVM route of approximating the sigmoid
like loss by the (convex) hinge loss
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Dual optimisation
max
α
subject to:
L=−
N
X
j=1
m
X
!
m
X
αi yi xij − 1 − λ − λ
exp αi = C
i=1
0 ≤ αi ≤ D, 1 ≤ i ≤ m.
i=1
Similar to SVM but with exponential function
Surprisingly also gives dual sparsity
Coordinate wise descent works very well (cf SMO
algorithm)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Dual optimisation
max
α
subject to:
L=−
N
X
j=1
m
X
!
m
X
αi yi xij − 1 − λ − λ
exp αi = C
i=1
0 ≤ αi ≤ D, 1 ≤ i ≤ m.
i=1
Similar to SVM but with exponential function
Surprisingly also gives dual sparsity
Coordinate wise descent works very well (cf SMO
algorithm)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Dual optimisation
max
α
subject to:
L=−
N
X
j=1
m
X
!
m
X
αi yi xij − 1 − λ − λ
exp αi = C
i=1
0 ≤ αi ≤ D, 1 ≤ i ≤ m.
i=1
Similar to SVM but with exponential function
Surprisingly also gives dual sparsity
Coordinate wise descent works very well (cf SMO
algorithm)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Dual optimisation
max
α
subject to:
L=−
N
X
j=1
m
X
!
m
X
αi yi xij − 1 − λ − λ
exp αi = C
i=1
0 ≤ αi ≤ D, 1 ≤ i ≤ m.
i=1
Similar to SVM but with exponential function
Surprisingly also gives dual sparsity
Coordinate wise descent works very well (cf SMO
algorithm)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Results: effect of varying T
Bound on Ionosphere
1.15
Bound value
1.1
1.05
1
0.95
0.9
0
5
10
15
20
Value of T
John Shawe-Taylor University College London
25
30
35
40
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Generalisation
Optimisation
Results
Bound and test errors:
Data
Ionosphere
Votes
Glass
Haberman
Credit
Bound
0.63
0.78
0.69
0.64
0.60
John Shawe-Taylor University College London
Error
0.28
0.35
0.46
0.25
0.25
SVM error
0.24
0.35
0.47
0.26
0.28
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Gaussian Process Regression
GP is distribution over real valued functions that is
multivariate Gaussian when restricted to any finite subset
of inputs
Characterised by a kernel that specifies the covariance
function when marginalising on any finite subset
If have finite set of input/output observations generated
with additive Gaussian noise on the outputs, posterior is
also Gaussian process
KL divergence between prior and posterior can be
computed as (K = RR 0 is a Cholesky decomposition of K ):
2
−1 1
2
2 −1 2KL(QkP) = log det I + 2 K −tr σ I + K
K +R(K + σ I) y σ
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Gaussian Process Regression
GP is distribution over real valued functions that is
multivariate Gaussian when restricted to any finite subset
of inputs
Characterised by a kernel that specifies the covariance
function when marginalising on any finite subset
If have finite set of input/output observations generated
with additive Gaussian noise on the outputs, posterior is
also Gaussian process
KL divergence between prior and posterior can be
computed as (K = RR 0 is a Cholesky decomposition of K ):
2
−1 1
2
2 −1 2KL(QkP) = log det I + 2 K −tr σ I + K
K +R(K + σ I) y σ
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Gaussian Process Regression
GP is distribution over real valued functions that is
multivariate Gaussian when restricted to any finite subset
of inputs
Characterised by a kernel that specifies the covariance
function when marginalising on any finite subset
If have finite set of input/output observations generated
with additive Gaussian noise on the outputs, posterior is
also Gaussian process
KL divergence between prior and posterior can be
computed as (K = RR 0 is a Cholesky decomposition of K ):
2
−1 1
2
2 −1 2KL(QkP) = log det I + 2 K −tr σ I + K
K +R(K + σ I) y σ
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Gaussian Process Regression
GP is distribution over real valued functions that is
multivariate Gaussian when restricted to any finite subset
of inputs
Characterised by a kernel that specifies the covariance
function when marginalising on any finite subset
If have finite set of input/output observations generated
with additive Gaussian noise on the outputs, posterior is
also Gaussian process
KL divergence between prior and posterior can be
computed as (K = RR 0 is a Cholesky decomposition of K ):
2
−1 1
2
2 −1 2KL(QkP) = log det I + 2 K −tr σ I + K
K +R(K + σ I) y σ
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Applying PAC-Bayes theorem
Suggests can use the PB theorem if can create
appropriate classifiers indexed by real value functions
Consider for some > 0 classifiers:
1; if |y − f (x)| ≤ ;
hf (x, y ) =
0; otherwise.
Can compute expected value of hf under posterior
function:
!
!
1
y
+
−
m(x)
1
y
−
−
m(x)
p
p
Ef ∼Q [hf (x, y )] = erf
− erf
2
2
2v (x)
2v (x)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Applying PAC-Bayes theorem
Suggests can use the PB theorem if can create
appropriate classifiers indexed by real value functions
Consider for some > 0 classifiers:
1; if |y − f (x)| ≤ ;
hf (x, y ) =
0; otherwise.
Can compute expected value of hf under posterior
function:
!
!
1
y
+
−
m(x)
1
y
−
−
m(x)
p
p
Ef ∼Q [hf (x, y )] = erf
− erf
2
2
2v (x)
2v (x)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Applying PAC-Bayes theorem
Suggests can use the PB theorem if can create
appropriate classifiers indexed by real value functions
Consider for some > 0 classifiers:
1; if |y − f (x)| ≤ ;
hf (x, y ) =
0; otherwise.
Can compute expected value of hf under posterior
function:
!
!
1
y
+
−
m(x)
1
y
−
−
m(x)
p
p
Ef ∼Q [hf (x, y )] = erf
− erf
2
2
2v (x)
2v (x)
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
GP Result
Furthermore can lower bound expected value of point
(x, y ) in the posterior distribution by
2
2
√
.
τ ∈[,] 2 v (x) 2eπ
2N (y |m(x), v (x)) ≥ Ef ∼Q [hf (x, y )] − sup
enabling an application of the PB Theorem to give:
1
D + ln((m + 1)/δ)
−1
√
E N (y |m(x), v (x)) +
≥ KL
E(),
2
m
2v (x) 2eπ
where E() is the empirical average of Ef ∼Q hf (x, y ) and
D is the KL between prior and posterior.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
GP Experimental Results
The robot arm problem (R), 150 training points and 51 test
points.
The Boston housing problem (H), 455 training points and
51 test points.
The forest fire problem (F), 450 training points 67 test
points.
Dat
R
H
F
σ
0.0494
0.1924
1.0129
ê
0.8903
0.8699
0.5694
KL−1
0.4782
0.4645
0.4557
John Shawe-Taylor University College London
etest
0.8419
0.7155
0.5533
KL−1
varGP etest
0.8401
0.9416
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
GP Experimental Results
We can also plot the test accuracy and bound as a function
of :
Figure: Gaussian noise: Plot of E(x,y )∼D [1 − α(x)] against with for
varying noise level η.
(a) η = 1
(b) η = 3
John Shawe-Taylor University College London
(c) η = 5
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
GP Experimental Results
With Laplace noise:
Figure: Laplace noise: Plot of E(x,y )∼D [1 − α(x)] against with for
varying η.
(a) η = 1
(b) η = 3
John Shawe-Taylor University College London
(c) η = 5
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
GP Experimental Results
Robot arm problem and Boston Housing:
Figure: Confidence levels for Robot arm problem
(a) Robot arm
John Shawe-Taylor University College London
(b) Boston housing
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Stochastic Differential Equation Models
Consider modelling a time varying process with a
(non-linear) stochastic differential equation:
√
dx = f(x, t)dt + Σ dW
f(x, t) is a non-linear drift term and dW is a Wiener process
This is the limit of the discrete time equation:
√
∆xk ≡ xk +1 − xk = f(xk )∆t + ∆t Σ k .
where k is zero mean, unit variance Gaussian noise.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Stochastic Differential Equation Models
Consider modelling a time varying process with a
(non-linear) stochastic differential equation:
√
dx = f(x, t)dt + Σ dW
f(x, t) is a non-linear drift term and dW is a Wiener process
This is the limit of the discrete time equation:
√
∆xk ≡ xk +1 − xk = f(xk )∆t + ∆t Σ k .
where k is zero mean, unit variance Gaussian noise.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Stochastic Differential Equation Models
Consider modelling a time varying process with a
(non-linear) stochastic differential equation:
√
dx = f(x, t)dt + Σ dW
f(x, t) is a non-linear drift term and dW is a Wiener process
This is the limit of the discrete time equation:
√
∆xk ≡ xk +1 − xk = f(xk )∆t + ∆t Σ k .
where k is zero mean, unit variance Gaussian noise.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Variational approximation
We use the Bayesian approach to data modelling with a
noise model given by:
p(yn |x(tn )) = N (yn |Hx(tn ), R),
We consider a variational approximation of the posterior
using a time-varying linear SDE:
√
dx = fL (x, t)dt + Σ dW,
where
fL (x, t) = −A(t)x + b(t).
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Variational approximation
We use the Bayesian approach to data modelling with a
noise model given by:
p(yn |x(tn )) = N (yn |Hx(tn ), R),
We consider a variational approximation of the posterior
using a time-varying linear SDE:
√
dx = fL (x, t)dt + Σ dW,
where
fL (x, t) = −A(t)x + b(t).
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Girsanov change of measure
Measure for the drift f denoted by P and the one for drift fL
by Q.
The KL divergence in the infinite dimensional setting is
given by Radon-Nikodym derivative of Q with respect to P:
R
dQ
KL[QkP] = dQ ln dQ
dP = EQ ln dP ,
which can be computed as
dQ
dP
n R
t
ct +
= exp − t0f (f − fL )> Σ−1/2 d W
1
2
o
> Σ−1 (f − f ) dt ,
(f
−
f
)
L
L
t0
R tf
c is a Wiener process with respect to Q.
where W
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Girsanov change of measure
Measure for the drift f denoted by P and the one for drift fL
by Q.
The KL divergence in the infinite dimensional setting is
given by Radon-Nikodym derivative of Q with respect to P:
R
dQ
KL[QkP] = dQ ln dQ
dP = EQ ln dP ,
which can be computed as
dQ
dP
n R
t
ct +
= exp − t0f (f − fL )> Σ−1/2 d W
1
2
o
> Σ−1 (f − f ) dt ,
(f
−
f
)
L
L
t0
R tf
c is a Wiener process with respect to Q.
where W
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
KL divergence
Hence, KL divergence is
E
Rt D
KL[QkP] = 21 t0f (f(x(t), t) − fL (x(t), t))> Σ−1 (f(x(t), t) − fL (x(t), t))
qt
where h · iqt denotes the expectation with respect to the
marginal density at time t of the measure Q.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Variational approximation
As approximating SDE is linear, marginal distribution qt is
Gaussian
qt (x) = N (x|m(t), S(t)).
with the mean m(t) and covariance S(t) described by
ordinary differential equations (ODEs):
dm
dt
dS
dt
= −Am + b,
= −AS − SAT + Σ.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Variational approximation
As approximating SDE is linear, marginal distribution qt is
Gaussian
qt (x) = N (x|m(t), S(t)).
with the mean m(t) and covariance S(t) described by
ordinary differential equations (ODEs):
dm
dt
dS
dt
= −Am + b,
= −AS − SAT + Σ.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Algorithmics
Using Lagrangian methods can derive algorithm that finds
the variational approximation by minimising the KL
divergence between posterior and approximating
distribution.
But KL also appears in the PAC-Bayes bound – is it
possible to define appropriate loss over paths ω that
captures the properties of interest?
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Algorithmics
Using Lagrangian methods can derive algorithm that finds
the variational approximation by minimising the KL
divergence between posterior and approximating
distribution.
But KL also appears in the PAC-Bayes bound – is it
possible to define appropriate loss over paths ω that
captures the properties of interest?
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Error estimation
For ω : [0, T ] −→ RD defining a trajectory ω(t) ∈ RD , we
define the classifier hω by
1; if ky − Hω(t)k ≤ ;
hω (y, t) =
0; otherwise.
where the actual observations are linear functions of the
state variable given by the operator H.
Prior and posterior distribution over functions are inherited
from distributions P and Q over paths ω.
Hence, P = psde and Q = q defined by linear
approximating sde.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Error estimation
For ω : [0, T ] −→ RD defining a trajectory ω(t) ∈ RD , we
define the classifier hω by
1; if ky − Hω(t)k ≤ ;
hω (y, t) =
0; otherwise.
where the actual observations are linear functions of the
state variable given by the operator H.
Prior and posterior distribution over functions are inherited
from distributions P and Q over paths ω.
Hence, P = psde and Q = q defined by linear
approximating sde.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Error estimation
For ω : [0, T ] −→ RD defining a trajectory ω(t) ∈ RD , we
define the classifier hω by
1; if ky − Hω(t)k ≤ ;
hω (y, t) =
0; otherwise.
where the actual observations are linear functions of the
state variable given by the operator H.
Prior and posterior distribution over functions are inherited
from distributions P and Q over paths ω.
Hence, P = psde and Q = q defined by linear
approximating sde.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Error estimation
For ω : [0, T ] −→ RD defining a trajectory ω(t) ∈ RD , we
define the classifier hω by
1; if ky − Hω(t)k ≤ ;
hω (y, t) =
0; otherwise.
where the actual observations are linear functions of the
state variable given by the operator H.
Prior and posterior distribution over functions are inherited
from distributions P and Q over paths ω.
Hence, P = psde and Q = q defined by linear
approximating sde.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Generalisation analysis
For the PAC-Bayes analysis we must compute: KL(QkP),
eQ , êQ . We have as above
Z
dq
.
KL(QkP) = dq ln
dpsde
If we now consider a fixed sample (y, t) we can estimate
Z
Eω∼Q [hω (y, t)] = I [kHx − yk ≤ ] dqt (x),
For sufficiently small values of we can approximate by
Vd d
T
T −1
≈
exp
−(y
−
Hm(t))
(HS(t)H
)
(y
−
Hm(t))
,
(2π|HS(t)HT |)d/2
where Vd is the volume of a unit ball in Rd .
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Generalisation analysis
For the PAC-Bayes analysis we must compute: KL(QkP),
eQ , êQ . We have as above
Z
dq
.
KL(QkP) = dq ln
dpsde
If we now consider a fixed sample (y, t) we can estimate
Z
Eω∼Q [hω (y, t)] = I [kHx − yk ≤ ] dqt (x),
For sufficiently small values of we can approximate by
Vd d
T
T −1
≈
exp
−(y
−
Hm(t))
(HS(t)H
)
(y
−
Hm(t))
,
(2π|HS(t)HT |)d/2
where Vd is the volume of a unit ball in Rd .
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Generalisation analysis
For the PAC-Bayes analysis we must compute: KL(QkP),
eQ , êQ . We have as above
Z
dq
.
KL(QkP) = dq ln
dpsde
If we now consider a fixed sample (y, t) we can estimate
Z
Eω∼Q [hω (y, t)] = I [kHx − yk ≤ ] dqt (x),
For sufficiently small values of we can approximate by
Vd d
T
T −1
≈
exp
−(y
−
Hm(t))
(HS(t)H
)
(y
−
Hm(t))
,
(2π|HS(t)HT |)d/2
where Vd is the volume of a unit ball in Rd .
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Error estimates
Note that eQ is simply
Z
eQ = E(y,t)∼µ Eω∼Q [hω (y, t)] ∝
N (y|Hm(t), HS(t)HT )dµ(y, t),
while êQ is the empirical average of this quantity.
A tension arises in setting – if large approximation
inaccurate.
If eQ and êQ both small, the bound implied by the
KL(eQ kêQ ) ≤ C becomes weak.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Error estimates
Note that eQ is simply
Z
eQ = E(y,t)∼µ Eω∼Q [hω (y, t)] ∝
N (y|Hm(t), HS(t)HT )dµ(y, t),
while êQ is the empirical average of this quantity.
A tension arises in setting – if large approximation
inaccurate.
If eQ and êQ both small, the bound implied by the
KL(eQ kêQ ) ≤ C becomes weak.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Error estimates
Note that eQ is simply
Z
eQ = E(y,t)∼µ Eω∼Q [hω (y, t)] ∝
N (y|Hm(t), HS(t)HT )dµ(y, t),
while êQ is the empirical average of this quantity.
A tension arises in setting – if large approximation
inaccurate.
If eQ and êQ both small, the bound implied by the
KL(eQ kêQ ) ≤ C becomes weak.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Refining the distributions
Overcome this weakness by taking K -fold product
distributions and defining h(ω1 ,...,ωK ) as
h(ω1 ,...,ωK ) (y, t) =
1; if there exists 1 ≤ i ≤ K such thatky − Hωi (t)k ≤ ;
0; otherwise.
We now have
E(ω1 ,...,ωK )∼Q K
K
Z
h(ω1 ,...,ωK ) (y, t) ≈ 1 − 1 − I [kHx − yk ≤ ] dqt (x)
≈ KVd d N (y|Hm(t), HS(t)HT ),
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Refining the distributions
Overcome this weakness by taking K -fold product
distributions and defining h(ω1 ,...,ωK ) as
h(ω1 ,...,ωK ) (y, t) =
1; if there exists 1 ≤ i ≤ K such thatky − Hωi (t)k ≤ ;
0; otherwise.
We now have
E(ω1 ,...,ωK )∼Q K
K
Z
h(ω1 ,...,ωK ) (y, t) ≈ 1 − 1 − I [kHx − yk ≤ ] dqt (x)
≈ KVd d N (y|Hm(t), HS(t)HT ),
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Final result
Putting all together gives final bound:
h
i
E(y,t)∼µ N (y|Hm(t), HS(t)HT ) ≥
h
i
1
−1
d
T
KL
KV
Ê
N
(y|Hm(t),
HS(t)H
)
,
d
Vd d K
!
RT
K 0 Esde (t)dt + ln((m + 1)/δ)
.
m
where
Esde (t) =
1
2
D
E
(f(x) − fL (x, t))T Σ−1 (f(x) − fL (x, t)) ,
John Shawe-Taylor University College London
qt
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Small scale experiment
We applied the analysis to the results of performing a
variational Bayesian approximation to the Lorentz attractor
in three dimension. The quality of the fit with 49 examples
was good.
40
35
30
25
20
15
10
20
10
20
10
0
0
−10
John Shawe-Taylor University College
London
−20
−20
−10
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Small scale experiment
chose Vd d to optimise the bound – fairly small ball
implying that our approximation should be reasonable.
compared the bound with the left hand side estimated on a
random draw of 99 test points. The corresponding values
are
m
49
dt
0.005
êQ
0.137
John Shawe-Taylor University College London
A
3.536
eQ
0.128
KL−1 (·, ·)/V
0.004
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Small scale experiment
chose Vd d to optimise the bound – fairly small ball
implying that our approximation should be reasonable.
compared the bound with the left hand side estimated on a
random draw of 99 test points. The corresponding values
are
m
49
dt
0.005
êQ
0.137
John Shawe-Taylor University College London
A
3.536
eQ
0.128
KL−1 (·, ·)/V
0.004
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Conclusions
Overview of theory and main result
Application to bound the performance of an SVM
Experiments show the new bound can be tighter ...
...And reliable for low cost model selection
Extended to considering maximum entropy classification
Also consider lower bounding the accuracy of a posterior
distribution for Gaussian processes (GP)
Applied the theory to bound the performance of
estimations made using approximate Bayesian inference
for dynamical systems:
Prior determined by a non-linear stochastic differential
equation (SDE)
Variational approximation results in posterior given by an
approximating linear SDE – hence Gaussian process
posterior.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Conclusions
Overview of theory and main result
Application to bound the performance of an SVM
Experiments show the new bound can be tighter ...
...And reliable for low cost model selection
Extended to considering maximum entropy classification
Also consider lower bounding the accuracy of a posterior
distribution for Gaussian processes (GP)
Applied the theory to bound the performance of
estimations made using approximate Bayesian inference
for dynamical systems:
Prior determined by a non-linear stochastic differential
equation (SDE)
Variational approximation results in posterior given by an
approximating linear SDE – hence Gaussian process
posterior.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Conclusions
Overview of theory and main result
Application to bound the performance of an SVM
Experiments show the new bound can be tighter ...
...And reliable for low cost model selection
Extended to considering maximum entropy classification
Also consider lower bounding the accuracy of a posterior
distribution for Gaussian processes (GP)
Applied the theory to bound the performance of
estimations made using approximate Bayesian inference
for dynamical systems:
Prior determined by a non-linear stochastic differential
equation (SDE)
Variational approximation results in posterior given by an
approximating linear SDE – hence Gaussian process
posterior.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Conclusions
Overview of theory and main result
Application to bound the performance of an SVM
Experiments show the new bound can be tighter ...
...And reliable for low cost model selection
Extended to considering maximum entropy classification
Also consider lower bounding the accuracy of a posterior
distribution for Gaussian processes (GP)
Applied the theory to bound the performance of
estimations made using approximate Bayesian inference
for dynamical systems:
Prior determined by a non-linear stochastic differential
equation (SDE)
Variational approximation results in posterior given by an
approximating linear SDE – hence Gaussian process
posterior.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Conclusions
Overview of theory and main result
Application to bound the performance of an SVM
Experiments show the new bound can be tighter ...
...And reliable for low cost model selection
Extended to considering maximum entropy classification
Also consider lower bounding the accuracy of a posterior
distribution for Gaussian processes (GP)
Applied the theory to bound the performance of
estimations made using approximate Bayesian inference
for dynamical systems:
Prior determined by a non-linear stochastic differential
equation (SDE)
Variational approximation results in posterior given by an
approximating linear SDE – hence Gaussian process
posterior.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Conclusions
Overview of theory and main result
Application to bound the performance of an SVM
Experiments show the new bound can be tighter ...
...And reliable for low cost model selection
Extended to considering maximum entropy classification
Also consider lower bounding the accuracy of a posterior
distribution for Gaussian processes (GP)
Applied the theory to bound the performance of
estimations made using approximate Bayesian inference
for dynamical systems:
Prior determined by a non-linear stochastic differential
equation (SDE)
Variational approximation results in posterior given by an
approximating linear SDE – hence Gaussian process
posterior.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Conclusions
Overview of theory and main result
Application to bound the performance of an SVM
Experiments show the new bound can be tighter ...
...And reliable for low cost model selection
Extended to considering maximum entropy classification
Also consider lower bounding the accuracy of a posterior
distribution for Gaussian processes (GP)
Applied the theory to bound the performance of
estimations made using approximate Bayesian inference
for dynamical systems:
Prior determined by a non-linear stochastic differential
equation (SDE)
Variational approximation results in posterior given by an
approximating linear SDE – hence Gaussian process
posterior.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Conclusions
Overview of theory and main result
Application to bound the performance of an SVM
Experiments show the new bound can be tighter ...
...And reliable for low cost model selection
Extended to considering maximum entropy classification
Also consider lower bounding the accuracy of a posterior
distribution for Gaussian processes (GP)
Applied the theory to bound the performance of
estimations made using approximate Bayesian inference
for dynamical systems:
Prior determined by a non-linear stochastic differential
equation (SDE)
Variational approximation results in posterior given by an
approximating linear SDE – hence Gaussian process
posterior.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications
Background to Approach
PAC-Bayes Analysis
Linear Classifiers
Maximum entropy classification
GPs and SDEs
Gaussian Process regression
Variational approximation
Generalisation
Conclusions
Overview of theory and main result
Application to bound the performance of an SVM
Experiments show the new bound can be tighter ...
...And reliable for low cost model selection
Extended to considering maximum entropy classification
Also consider lower bounding the accuracy of a posterior
distribution for Gaussian processes (GP)
Applied the theory to bound the performance of
estimations made using approximate Bayesian inference
for dynamical systems:
Prior determined by a non-linear stochastic differential
equation (SDE)
Variational approximation results in posterior given by an
approximating linear SDE – hence Gaussian process
posterior.
John Shawe-Taylor University College London
PAC-Bayes Analysis: Background and Applications