Trinity College, March 2008 Robert B. Gramacy Statistical Laboratory, University of Cambridge http://www.statslab.cam.ac.uk/~bobby [email protected] with Herbert K. H. Lee Applied Math & Statistics Dept., UC Santa Cruz Designing Computer Experiments Real-world Application: CFD simulations of a proposed reusable NASA launch vehicle (Langley-Glide-Back Booster) 3 inputs: side slip angle Mach number angle of attack 6 outputs: lift drag pitching moment side-force yawing moment rolling moment • integrate inviscid Euler equations over mesh of 1.4 × 106 cells • each input configuration takes 5-20 hours of CPU time 1 Motivation: example 20 25 30 CDF Data, Beta=0 10 alpha 15 Z=lift lph 5 ]=a X[1 a ach ]=M −5 0 X[2 1 2 3 4 5 6 mach • Exhibits typical characteristics of a computer experiment – not: stationary, linear, differentiable, continuous 2 Augmenting the State of the Art Canonical model: Gaussian process (Santner, et al. 2003) – conceptually simple non-parametric extension of LM – flexible & non-linear But: – inference scales poorly with number of data points N – strictly stationary 3 Augmenting the State of the Art Canonical model: Gaussian process (Santner, et al. 2003) – conceptually simple non-parametric extension of LM – flexible & non-linear But: – inference scales poorly with number of data points N – strictly stationary Our idea: use trees to partition the input space like Bayesian treed LMs (Chipman et al., 2002) with GPs instead of LMs – allows for modeling of non-stationary behavior – predictive variance is location (region) dependent – smaller covariance matrices ameliorate computational demands 3 Standard Linear Model (LM) LM for n inputs (covariates) x of dimension p, and responses y yi = x⊤ i β + ei , where ei ∼ N (0, σ 2 ), for i = 1, . . . , n ⊤ • Collect Y = (y1 , . . . , yn )⊤ , and X = [x⊤ 1 · · · xn ] and Y ∼ Nn (Xβ, σ 2 I) define the likelihood: Jeffrey’s prior: (β, σ 2 ) ∝ 1/σ 2 MLE: β̂ ∼ (X⊤ X)−1 X⊤ Y, & s2 ∼ 1 (Y − X β̂)⊤ (Y − X β̂) n−p • Bayes Rule p(θ|Y ) ∝ p(Y |θ)p(θ) gives the posterior: β|σ 2 , Y = Np (β̂, (X⊤ X)−1 σ 2 ), & σ 2 |Y = Inv-χ2 (n − p, s2 ) 4 Gaussian Process (GP) The Gaussian Process is also a LM (of sorts) • Replacing I with K Y |β, σ 2 ∼ Nn (Xβ, σ 2 I) ⇒ Z|β, σ 2 , K ∼ Nn (Xβ, σ 2 K) • Ki,j = K(xi , xj ) is, e.g., an isotropic correlation matrix 2 ||xi − xj || K(xi , xj ) = exp − + gδj,j d • with range (or length–scale) parameter d • and nugget (or measurement–error) parameter g – Range d → 0 gives K → (1 + g)I; i.e., the LM 5 GP Inference • Conditional on K through d and g, we have MLEs β̂ = (X⊤ K−1 X)−1 X⊤ K−1 Z, and s2 = 1 (Z − X β̂)⊤ K−1 (Z − X β̂) n−p • Numerical optimization required for d and g Bayesian inference is similar β|σ 2 , d, g, Z ∼ Np (β̂, (X⊤ K−1 X)−1 σ 2 ) σ 2 |d, g, Z ∼ Inv-χ2 (n − p, s2 ) • Metropolis–Hastings is used to sample the posterior p(d, g|Z) 6 GP Prediction (or Kriging) The predicted value of z(x̃) is Normal with ŷ(x̃) = x̃⊤ β̂ + k̃(x̃)⊤ K−1 (Z − Xβ̂) mean σ̂(x̃)2 = σ 2 [κ(x̃) − q(x̃)⊤ (K + XX⊤ )K−1 q(x̃)] variance where κ(x̃) = K(x̃, x̃) + x̃⊤ x̃, q(x̃) = k(x̃) + Xx̃, kj (x̃) = K(x̃, xj ), th column of X and x⊤ j is the j GP: z mean 0.0 0.2 z 0.4 0.6 0.8 1.0 LM: y mean 0.0 0.2 0.4 0.6 x1 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 7 The GP range parameter MLE range (d) for various data: d = 0.11111 g = 0.078947 s2 = 0.97213 1.5 2.5 d = 0.72727 g = 0.052632 s2 = 1.0836 1.0 y 1.5 y 0.0 0.0 1.0 0.5 0.5 1.5 1.0 y 2.0 2.0 2.5 3.0 d = 1.0101 g = 0.10526 s2 = 0.55554 50 samples: y=10(x−0.5)^2+e, e~N(0,0.25) 2.0 50 samples: y=3x^2+e, e~N(0,0.25) 3.0 50 samples: y=1+2x+e, e~N(0,0.25) 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.6 0.8 1.0 0.0 0.4 0.6 0.8 x 50 samples: y=−20(x−0.5)^3+2x−2+e, e~N(0,0.25) 50 samples: 4(exp(−(4x−2)^2) − exp(−(7x−4)^2)) 50 samples: y=sin(pi*x/5) + 0.2*cos(4pi*x/5), e~N(0,0.25) d = 0.017172 g = 0.052632 s2 = 0.37949 1.0 1.5 0.0 0.2 0.4 0.6 x 0.8 1.0 y −1.0 −0.5 0.0 −0.5 0.5 0.0 y 1.0 0.5 2.0 d = 0.028283 g = 0.052632 s2 = 0.63313 1.0 −0.5 −1.0 y 0.2 x d = 0.075758 g = 0.10526 s2 = 0.518 0.0 0.2 x 2.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 x 0.2 0.4 0.6 0.8 1.0 x • However, recall that d → 0 gives K → (1 + g)I 8 Tree Example How a tree T recursively partitions the input space: i1 = {u1 , s1 } T : diagram T : graphically u2 X[:, u1 ] < s1 X[:, u1 ] ≥ s1 ℓ2 i2 = {u2 , s2 } ℓ3 = {X3 , Z3 } ℓ3 X[:, u2 ] < s2 X[:, u2 ] ≥ s2 s2 ℓ1 ℓ1 = {X1 , Z1 } ℓ2 = {X2 , Z2 } s1 u1 • prior for node η at depth qη ∈ T : psplit (η, T ) = a(1 + qη )−b 9 Hierarchical Model Conditioning on tree T we have R regions: {rν }R ν=1 Zν |β ν , σν2 , Kν ∼ N (Fν β ν , σν2 Kν ) β ν |σν2 , τν2 , W, β 0 ∼ N (β 0 , σν2 τν2 W) β 0 ∼ N (µ, B) σν2 ∼ IG(ασ /2, qσ /2) τν2 ∼ IG(ατ /2, qτ /2) W−1 ∼ W ((ρV)−1 , ρ) with Fν = (1, Xν ), and Kν (xj , xk ) = Kν∗ (xj , xk ) + gν δj,k for nugget g and true correlation K ∗ is separable: PmX ∗ Kν (xj , xk |dν ) = exp{− i=1 |xij − xik |p0 /diν } 10 Tree MCMC Conditional on a particular tree T • Gibbs samples for all GP parameters {θ ν }R ν=1 |T – except K(·, ·|dν , gν ) requires MH Sample from the joint posterior of (T , θ) (Richardson & Green, 1997; Chipman et al., 2002) • Average over T with reversible-jump MCMC (RJ-MCMC) – Tree operations: grow, prune, change, swap 11 Bayesian Tree grow & prune i1 = {u1 , s1 } There is always a leaf to grow Tt : • randomly select a leaf node η ∈ ℓ(Tt ) • randomly select a splitting rule (u, s) • create T ∗ from Tt – split η in dimension u at location s – η ∈ ℓ(Tt ) becomes (η1 , η2 ) ∈ ℓ(T ∗ ) • accept or reject the proposed T ∗ – via the MH acceptance ratio: ff ∗ ∗ p(T |Y )q(T |T ) α = min 1, p(Tt |Y )q(T ∗ |T ) ℓ ℓ3 ⇓ (grow) ⇓ ⇑ (prune) ⇑ i1 = {u1 , s1 } i2 = {u2 , s2 } ℓ3 – set Tt+1 = T ∗ w.p. α; or Tt+1 = Tt The opposite of grow is prune Tt : ℓ1 ℓ2 12 Bayesian Tree change & swap (change) Splits are moved with change operations • randomly select internal node η ∈ i(T ) – η has split point (u, s), say • propose a new η ∗ ∈ T ∗ – by modifying the split location (u, s∗ ) u2 ℓ2 ℓ3 s2 ℓ1 Or, switch the order of splits with swap • randomly select a pair of internal nodes – η1 , η2 ∈ i(T ) so that P(η2 ) = η1 – swap their order, thus proposing T ∗ Accept or reject T ∗ with probability ff ∗ p(T |Y ) α = min 1, p(Tt |Y ) s1 u1 (swap) u2 ℓ2 ℓ3 s2 ℓ1 s1 u1 13 Prediction (or Kriging) The predicted value of Y (x) at x ∈ rν is Normal with mean variance ŷ(x) = f ⊤ (x)β̃ ν + kν (x)⊤ K−1 ν (Zν − Fν β̃ ν ) −1 σ̂(x)2 = σν2 [κ(x, x) − q⊤ (x)C ν ν qν (x)] where kν (x) is a nν −vector with kν,j (x) = Kν (x, xj ), xj ∈ Xν ⊤ 2 −1 C−1 ν = (Kν + Fν WFν /τν ) κ(x, y) = Kν (x, y) + τν2 f ⊤ (x)Wf (y) qν (x) = kν (x) + τν2 Fν Wν f (x) f ⊤ (x) = (1, x⊤ ) Expected reduction in squared error 2 ⊤ −1 qN (y)CN qN (x̃) − κ(x̃, y) 2 2 2 ∆σ̂y (x̃) = σ̂y − σ̂y (x̃) = −1 κ(x̃, x̃) − q⊤ N (x̃)CN qN (x̃) 14 Treed GP on sine Data 0.5 Z 0.0 −1.0 −0.5 0.0 −0.5 −1.0 Z 0.5 1.0 1d Sin Data −− Stationary GP 1.0 1d Sin Data −− treed LM 0 5 10 X 15 20 0 5 10 15 20 X 15 Treed GP on sine Data 0.5 Z 0.0 −1.0 −0.5 0.0 −0.5 −1.0 10 15 20 0 5 10 X 15 20 X Z 0.5 1.0 1d Sin Data −− Treed GP 0.0 5 −0.5 0 −1.0 Z 0.5 1.0 1d Sin Data −− Stationary GP 1.0 1d Sin Data −− treed LM 0 5 10 15 20 X 15 Treed GP on exp Example B−TGPLM: z quantile diff (error) 4 6 B−TGPLM: z mean 2 0 x2 z −2 x2 x1 −2 0 2 4 6 x1 x1 <> 2.4 3 x2 <> 2 0 189 obs 1 2 0.0143 132 obs 0 120 obs 16 Motorcycle data Motorcycle data: (Silverman, 1985) • non-stationary • input-dependent noise −50 −100 accel 0 50 GP, accel mean and error 10 20 30 40 50 time 17 Motorcycle data −50 z(x) Motorcycle data: (Silverman, 1985) 0 50 Estimated Surface −100 • non-stationary • input-dependent noise 10 20 GP, accel mean and error 30 40 50 x 40 50 60 10 20 30 40 50 10 20 30 quantile difference −50 −100 accel 0 70 50 Estimated Error Spread (95th − 5th Quantile) time 10 20 30 40 50 x 17 Adaptive Sampling Bayesian adaptive sampling proceeds in trials ... Current trial: • Choose candidates: X̃ • estimate model parameters (θ, T ) for data {Xi , zi }N i=1 • Order candidates X̃ by – ALM: maximize predictive error σ̂ 2 (x̃) – ALC: maximize average expected reduction in predictive error Z ∆σ̂ 2 (x̃) = ∆σ̂y2 (x̃) dy D • (repeat) begin next trial 18 Adaptive Sampling Comparison 2d exponential data: y(x) = x1 exp(−x21 − x22 ) 2d Exp Data −− Treed GP 75 adaptive samples z(x,y) y model as rmse btgp alc 0.0035 btgp alm 0.0037 bcart alm 0.0090 bcart alc 0.0093 btlm alm 0.0099 btlm alc 0.0115 bgp alm 0.0352 bgp alc 0.0493 x 19 3-d CFD data Adaptive sampling on the Langley-Glide-Back Booster (LGBB): Interfacing with a NASA supercomputer: • fit six independent treed GPs– one for each response (lift, drag, pitch, side-force, yaw, roll) – Six parallel treed GP modules • design via expected reduction in predictive error (ALC) – pooled predictive quantiles across the six models 20 Adaptive Sampling on LGBB: Lift Mean posterior predictive -- Lift fixing Beta (side slip angle) to zero Lift 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 30 25 20 15 Alpha 10 (angle of attack) 5 0 -5 0 1 2 3 4 5 6 Mach (speed) 21 Adaptive Sampling on LGBB: Lift 20 15 10 5 0 −5 angle of attack (alpha) 25 30 Sampled Input Configurations (beta=0) & Partitions 0 1 2 3 4 5 6 Mach (speed) 22 Adaptive Sampling on LGBB: Drag Mean posterior predictive -- Drag fixing Beta (side slip angle) to zero Drag 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 30 25 20 15 Alpha 10 (angle of attack) 5 0 -5 0 1 2 3 4 5 6 Mach (speed) 23 Adaptive Sampling on LGBB: Pitch Mean posterior predictive -- Pitch fixing Beta (side slip angle) to zero Pitch 0.02 0 0.02 -0.02 -0.04 0 -0.06 -0.02 -0.08 -0.04 -0.1 -0.06 -0.12 -0.08 -0.1 -0.12 30 25 20 15 Alpha 10 (angle of attack) 5 0 -5 0 1 2 3 4 5 6 Mach (speed) 24 Adaptive Sampling on LGBB: Side Mean posterior predictive -- Side fixing Beta (side slip angle) to 2 Side 0.06 0.05 0.04 0.03 0.02 0.01 0 -0.01 -0.02 -0.03 -0.04 -0.05 0.06 0.05 0.04 0.03 0.02 0.01 0 -0.01 -0.02 -0.03 -0.04 -0.05 30 25 20 15 Alpha 10 (angle of attack) 5 0 -5 0 1 2 3 4 5 6 Mach (speed) 25 Adaptive Sampling on LGBB: Yaw Mean posterior predictive -- Yaw fixing Beta (side slip angle) to 2 Yaw 0.01 0.008 0.006 0.004 0.002 0 -0.002 -0.004 -0.006 -0.008 0.01 0.008 0.006 0.004 0.002 0 -0.002 -0.004 -0.006 -0.008 30 25 20 15 Alpha 10 (angle of attack) 5 0 -5 0 1 2 3 4 5 6 Mach (speed) 26 Adaptive Sampling on LGBB: Roll Mean posterior predictive -- Roll fixing Beta (side slip angle) to 2 Roll 0.01 0.008 0.006 0.004 0.002 0 -0.002 -0.004 -0.006 -0.008 -0.01 -0.012 0.01 0.008 0.006 0.004 0.002 0 -0.002 -0.004 -0.006 -0.008 -0.01 -0.012 30 25 20 15 Alpha 10 (angle of attack) 5 0 -5 0 1 2 3 4 5 6 Mach (speed) 27 Adaptive Sampling on LGBB 20 15 10 5 0 −5 alpha (angle of attack) 25 30 Adaptive Samples, beta projection 0 1 2 3 4 5 6 Mach (speed) 780 adaptive samples, compared to more than 3,250 28 Implementation & R package Implementation & computing details • C++ (trees) and C (GPs) with LAPACK/BLAS • Pthreads for (shared memory) parallelization: R package called tgp on CRAN • Implements all model combinations and special cases – blm, bgp, btlm, btgp, bgpllm, btgpllm • Adaptive sampling: ALM, ALC, etc. 29
© Copyright 2026 Paperzz