talk - SRCF

Trinity College, March 2008
Robert B. Gramacy
Statistical Laboratory, University of Cambridge
http://www.statslab.cam.ac.uk/~bobby
[email protected]
with Herbert K. H. Lee
Applied Math & Statistics Dept., UC Santa Cruz
Designing Computer Experiments
Real-world Application: CFD simulations of a proposed reusable
NASA launch vehicle (Langley-Glide-Back Booster)
3 inputs:
side slip angle
Mach number
angle of attack
6 outputs:
lift
drag
pitching moment
side-force
yawing moment
rolling moment
• integrate inviscid Euler equations over mesh of 1.4 × 106 cells
• each input configuration takes 5-20 hours of CPU time
1
Motivation: example
20
25
30
CDF Data, Beta=0
10
alpha
15
Z=lift
lph
5
]=a
X[1
a
ach
]=M
−5
0
X[2
1
2
3
4
5
6
mach
• Exhibits typical characteristics of a computer experiment
– not: stationary, linear, differentiable, continuous
2
Augmenting the State of the Art
Canonical model: Gaussian process (Santner, et al. 2003)
– conceptually simple non-parametric extension of LM
– flexible & non-linear
But:
– inference scales poorly with number of data points N
– strictly stationary
3
Augmenting the State of the Art
Canonical model: Gaussian process (Santner, et al. 2003)
– conceptually simple non-parametric extension of LM
– flexible & non-linear
But:
– inference scales poorly with number of data points N
– strictly stationary
Our idea: use trees to partition the input space like Bayesian treed
LMs (Chipman et al., 2002) with GPs instead of LMs
– allows for modeling of non-stationary behavior
– predictive variance is location (region) dependent
– smaller covariance matrices ameliorate computational demands
3
Standard Linear Model (LM)
LM for n inputs (covariates) x of dimension p, and responses y
yi = x⊤
i β + ei ,
where ei ∼ N (0, σ 2 ),
for i = 1, . . . , n
⊤
• Collect Y = (y1 , . . . , yn )⊤ , and X = [x⊤
1 · · · xn ] and
Y ∼ Nn (Xβ, σ 2 I)
define the likelihood:
Jeffrey’s prior:
(β, σ 2 ) ∝ 1/σ 2
MLE: β̂ ∼ (X⊤ X)−1 X⊤ Y, & s2 ∼
1
(Y − X β̂)⊤ (Y − X β̂)
n−p
• Bayes Rule p(θ|Y ) ∝ p(Y |θ)p(θ) gives the posterior:
β|σ 2 , Y = Np (β̂, (X⊤ X)−1 σ 2 ),
&
σ 2 |Y = Inv-χ2 (n − p, s2 )
4
Gaussian Process (GP)
The Gaussian Process is also a LM (of sorts)
• Replacing I with K
Y |β, σ 2 ∼ Nn (Xβ, σ 2 I)
⇒
Z|β, σ 2 , K ∼ Nn (Xβ, σ 2 K)
• Ki,j = K(xi , xj ) is, e.g., an isotropic correlation matrix
2
||xi − xj ||
K(xi , xj ) = exp −
+ gδj,j
d
• with range (or length–scale) parameter d
• and nugget (or measurement–error) parameter g
– Range d → 0 gives K → (1 + g)I; i.e., the LM
5
GP Inference
• Conditional on K through d and g, we have MLEs
β̂ = (X⊤ K−1 X)−1 X⊤ K−1 Z,
and
s2 =
1
(Z − X β̂)⊤ K−1 (Z − X β̂)
n−p
• Numerical optimization required for d and g
Bayesian inference is similar
β|σ 2 , d, g, Z ∼ Np (β̂, (X⊤ K−1 X)−1 σ 2 )
σ 2 |d, g, Z ∼ Inv-χ2 (n − p, s2 )
• Metropolis–Hastings is used to sample the posterior p(d, g|Z)
6
GP Prediction (or Kriging)
The predicted value of z(x̃) is Normal with
ŷ(x̃) = x̃⊤ β̂ + k̃(x̃)⊤ K−1 (Z − Xβ̂)
mean
σ̂(x̃)2 = σ 2 [κ(x̃) − q(x̃)⊤ (K + XX⊤ )K−1 q(x̃)]
variance
where κ(x̃) = K(x̃, x̃) + x̃⊤ x̃, q(x̃) = k(x̃) + Xx̃, kj (x̃) = K(x̃, xj ),
th
column of X
and x⊤
j is the j
GP: z mean
0.0
0.2
z
0.4
0.6
0.8
1.0
LM: y mean
0.0
0.2
0.4
0.6
x1
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x1
7
The GP range parameter
MLE range (d) for various data:
d = 0.11111
g = 0.078947
s2 = 0.97213
1.5
2.5
d = 0.72727
g = 0.052632
s2 = 1.0836
1.0
y
1.5
y
0.0
0.0
1.0
0.5
0.5
1.5
1.0
y
2.0
2.0
2.5
3.0
d = 1.0101
g = 0.10526
s2 = 0.55554
50 samples: y=10(x−0.5)^2+e, e~N(0,0.25)
2.0
50 samples: y=3x^2+e, e~N(0,0.25)
3.0
50 samples: y=1+2x+e, e~N(0,0.25)
0.2
0.4
0.6
0.8
1.0
0.0
0.4
0.6
0.8
1.0
0.0
0.4
0.6
0.8
x
50 samples: y=−20(x−0.5)^3+2x−2+e, e~N(0,0.25)
50 samples: 4(exp(−(4x−2)^2) − exp(−(7x−4)^2))
50 samples: y=sin(pi*x/5) + 0.2*cos(4pi*x/5), e~N(0,0.25)
d = 0.017172
g = 0.052632
s2 = 0.37949
1.0
1.5
0.0
0.2
0.4
0.6
x
0.8
1.0
y
−1.0
−0.5
0.0
−0.5
0.5
0.0
y
1.0
0.5
2.0
d = 0.028283
g = 0.052632
s2 = 0.63313
1.0
−0.5
−1.0
y
0.2
x
d = 0.075758
g = 0.10526
s2 = 0.518
0.0
0.2
x
2.5
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
x
0.2
0.4
0.6
0.8
1.0
x
• However, recall that d → 0 gives K → (1 + g)I
8
Tree Example
How a tree T recursively partitions the input space:
i1 = {u1 , s1 }
T : diagram
T : graphically
u2
X[:, u1 ] < s1
X[:, u1 ] ≥ s1
ℓ2
i2 = {u2 , s2 }
ℓ3 = {X3 , Z3 }
ℓ3
X[:, u2 ] < s2
X[:, u2 ] ≥ s2
s2
ℓ1
ℓ1 = {X1 , Z1 }
ℓ2 = {X2 , Z2 }
s1
u1
• prior for node η at depth qη ∈ T : psplit (η, T ) = a(1 + qη )−b
9
Hierarchical Model
Conditioning on tree T we have R regions: {rν }R
ν=1
Zν |β ν , σν2 , Kν ∼ N (Fν β ν , σν2 Kν )
β ν |σν2 , τν2 , W, β 0 ∼ N (β 0 , σν2 τν2 W)
β 0 ∼ N (µ, B)
σν2 ∼ IG(ασ /2, qσ /2)
τν2 ∼ IG(ατ /2, qτ /2)
W−1 ∼ W ((ρV)−1 , ρ)
with Fν = (1, Xν ), and
Kν (xj , xk ) = Kν∗ (xj , xk ) + gν δj,k
for nugget g and true correlation K ∗ is separable:
PmX
∗
Kν (xj , xk |dν ) = exp{− i=1 |xij − xik |p0 /diν }
10
Tree MCMC
Conditional on a particular tree T
• Gibbs samples for all GP parameters {θ ν }R
ν=1 |T
– except K(·, ·|dν , gν ) requires MH
Sample from the joint posterior of (T , θ)
(Richardson & Green, 1997; Chipman et al., 2002)
• Average over T with reversible-jump MCMC (RJ-MCMC)
– Tree operations: grow, prune, change, swap
11
Bayesian Tree grow & prune
i1 = {u1 , s1 }
There is always a leaf to grow Tt :
• randomly select a leaf node η ∈ ℓ(Tt )
• randomly select a splitting rule (u, s)
• create T ∗ from Tt
– split η in dimension u at location s
– η ∈ ℓ(Tt ) becomes (η1 , η2 ) ∈ ℓ(T ∗ )
• accept or reject the proposed T ∗
– via the MH acceptance ratio:
ff

∗
∗
p(T |Y )q(T |T )
α = min 1,
p(Tt |Y )q(T ∗ |T )
ℓ
ℓ3
⇓
(grow)
⇓
⇑
(prune)
⇑
i1 = {u1 , s1 }
i2 = {u2 , s2 }
ℓ3
– set Tt+1 = T ∗ w.p. α; or Tt+1 = Tt
The opposite of grow is prune Tt :
ℓ1
ℓ2
12
Bayesian Tree change & swap
(change)
Splits are moved with change operations
• randomly select internal node η ∈ i(T )
– η has split point (u, s), say
• propose a new η ∗ ∈ T ∗
– by modifying the split location (u, s∗ )
u2
ℓ2
ℓ3
s2
ℓ1
Or, switch the order of splits with swap
• randomly select a pair of internal nodes
– η1 , η2 ∈ i(T ) so that P(η2 ) = η1
– swap their order, thus proposing T ∗
Accept or reject T ∗ with probability

ff
∗
p(T |Y )
α = min 1,
p(Tt |Y )
s1
u1
(swap)
u2
ℓ2
ℓ3
s2
ℓ1
s1
u1
13
Prediction (or Kriging)
The predicted value of Y (x) at x ∈ rν is Normal with
mean
variance
ŷ(x) = f ⊤ (x)β̃ ν + kν (x)⊤ K−1
ν (Zν − Fν β̃ ν )
−1
σ̂(x)2 = σν2 [κ(x, x) − q⊤
(x)C
ν
ν qν (x)]
where kν (x) is a nν −vector with kν,j (x) = Kν (x, xj ), xj ∈ Xν
⊤
2 −1
C−1
ν = (Kν + Fν WFν /τν )
κ(x, y) = Kν (x, y) + τν2 f ⊤ (x)Wf (y)
qν (x) = kν (x) + τν2 Fν Wν f (x)
f ⊤ (x) = (1, x⊤ )
Expected reduction in squared error
2
⊤
−1
qN (y)CN qN (x̃) − κ(x̃, y)
2
2
2
∆σ̂y (x̃) = σ̂y − σ̂y (x̃) =
−1
κ(x̃, x̃) − q⊤
N (x̃)CN qN (x̃)
14
Treed GP on sine Data
0.5
Z
0.0
−1.0
−0.5
0.0
−0.5
−1.0
Z
0.5
1.0
1d Sin Data −− Stationary GP
1.0
1d Sin Data −− treed LM
0
5
10
X
15
20
0
5
10
15
20
X
15
Treed GP on sine Data
0.5
Z
0.0
−1.0
−0.5
0.0
−0.5
−1.0
10
15
20
0
5
10
X
15
20
X
Z
0.5
1.0
1d Sin Data −− Treed GP
0.0
5
−0.5
0
−1.0
Z
0.5
1.0
1d Sin Data −− Stationary GP
1.0
1d Sin Data −− treed LM
0
5
10
15
20
X
15
Treed GP on exp Example
B−TGPLM: z quantile diff (error)
4
6
B−TGPLM: z mean
2
0
x2
z
−2
x2
x1
−2
0
2
4
6
x1
x1 <> 2.4
3
x2 <> 2
0
189 obs
1
2
0.0143
132 obs
0
120 obs
16
Motorcycle data
Motorcycle data:
(Silverman, 1985)
• non-stationary
• input-dependent noise
−50
−100
accel
0
50
GP, accel mean and error
10
20
30
40
50
time
17
Motorcycle data
−50
z(x)
Motorcycle data:
(Silverman, 1985)
0
50
Estimated Surface
−100
• non-stationary
• input-dependent noise
10
20
GP, accel mean and error
30
40
50
x
40
50
60
10
20
30
40
50
10
20
30
quantile difference
−50
−100
accel
0
70
50
Estimated Error Spread (95th − 5th Quantile)
time
10
20
30
40
50
x
17
Adaptive Sampling
Bayesian adaptive sampling proceeds in trials ...
Current trial:
• Choose candidates: X̃
• estimate model parameters (θ, T ) for data {Xi , zi }N
i=1
• Order candidates X̃ by
– ALM: maximize predictive error σ̂ 2 (x̃)
– ALC: maximize average expected reduction in predictive error
Z
∆σ̂ 2 (x̃) =
∆σ̂y2 (x̃) dy
D
• (repeat) begin next trial
18
Adaptive Sampling Comparison
2d exponential data:
y(x) = x1 exp(−x21 − x22 )
2d Exp Data −− Treed GP
75 adaptive samples
z(x,y)
y
model
as
rmse
btgp
alc
0.0035
btgp
alm
0.0037
bcart
alm
0.0090
bcart
alc
0.0093
btlm
alm
0.0099
btlm
alc
0.0115
bgp
alm
0.0352
bgp
alc
0.0493
x
19
3-d CFD data
Adaptive sampling on the Langley-Glide-Back Booster (LGBB):
Interfacing with a NASA supercomputer:
• fit six independent treed GPs– one for each response
(lift, drag, pitch, side-force, yaw, roll)
– Six parallel treed GP modules
• design via expected reduction in predictive error (ALC)
– pooled predictive quantiles across the six models
20
Adaptive Sampling on LGBB: Lift
Mean posterior predictive -- Lift
fixing Beta (side slip angle) to zero
Lift
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
30
25
20
15
Alpha
10
(angle of attack)
5
0
-5 0
1
2
3
4
5
6
Mach (speed)
21
Adaptive Sampling on LGBB: Lift
20
15
10
5
0
−5
angle of attack (alpha)
25
30
Sampled Input Configurations (beta=0) & Partitions
0
1
2
3
4
5
6
Mach (speed)
22
Adaptive Sampling on LGBB: Drag
Mean posterior predictive -- Drag
fixing Beta (side slip angle) to zero
Drag
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
30
25
20
15
Alpha
10
(angle of attack)
5
0
-5 0
1
2
3
4
5
6
Mach (speed)
23
Adaptive Sampling on LGBB: Pitch
Mean posterior predictive -- Pitch
fixing Beta (side slip angle) to zero
Pitch
0.02
0
0.02
-0.02
-0.04
0
-0.06
-0.02
-0.08
-0.04
-0.1
-0.06
-0.12
-0.08
-0.1
-0.12
30
25
20
15
Alpha
10
(angle of attack)
5
0
-5 0
1
2
3
4
5
6
Mach (speed)
24
Adaptive Sampling on LGBB: Side
Mean posterior predictive -- Side
fixing Beta (side slip angle) to 2
Side
0.06
0.05
0.04
0.03
0.02
0.01
0
-0.01
-0.02
-0.03
-0.04
-0.05
0.06
0.05
0.04
0.03
0.02
0.01
0
-0.01
-0.02
-0.03
-0.04
-0.05
30
25
20
15
Alpha
10
(angle of attack)
5
0
-5 0
1
2
3
4
5
6
Mach (speed)
25
Adaptive Sampling on LGBB: Yaw
Mean posterior predictive -- Yaw
fixing Beta (side slip angle) to 2
Yaw
0.01
0.008
0.006
0.004
0.002
0
-0.002
-0.004
-0.006
-0.008
0.01
0.008
0.006
0.004
0.002
0
-0.002
-0.004
-0.006
-0.008
30
25
20
15
Alpha
10
(angle of attack)
5
0
-5 0
1
2
3
4
5
6
Mach (speed)
26
Adaptive Sampling on LGBB: Roll
Mean posterior predictive -- Roll
fixing Beta (side slip angle) to 2
Roll
0.01
0.008
0.006
0.004
0.002
0
-0.002
-0.004
-0.006
-0.008
-0.01
-0.012
0.01
0.008
0.006
0.004
0.002
0
-0.002
-0.004
-0.006
-0.008
-0.01
-0.012
30
25
20
15
Alpha
10
(angle of attack)
5
0
-5 0
1
2
3
4
5
6
Mach (speed)
27
Adaptive Sampling on LGBB
20
15
10
5
0
−5
alpha (angle of attack)
25
30
Adaptive Samples, beta projection
0
1
2
3
4
5
6
Mach (speed)
780 adaptive samples, compared to more than 3,250
28
Implementation & R package
Implementation & computing details
• C++ (trees) and C (GPs) with LAPACK/BLAS
• Pthreads for (shared memory) parallelization:
R package called tgp on CRAN
• Implements all model combinations and special cases
– blm, bgp, btlm, btgp, bgpllm, btgpllm
• Adaptive sampling: ALM, ALC, etc.
29