Nonlinear Regression

Nonlinear Regression
28.11.2012
Goals of Today’s Lecture
I
Understand the difference between linear and nonlinear
regression models.
I
See that not all functions are linearizable.
I
Get an understanding of the fitting algorithm in a statistical
sense (i.e. fitting many linear regressions).
I
Know that tests etc. are based on approximations and be able
to interpret computer output, profile t-plots and profile traces.
1 / 35
Nonlinear Regression Model
The nonlinear regression model is
Yi
(1)
(2)
(m)
= h(xi , xi , . . . , xi
; θ1 , θ2 , . . . , θp ) + Ei
= h(x i ; θ) + Ei .
where
I
Ei are the error terms, Ei ∼ N (0, σ 2 ) independent
I
x (1) , . . . , x (m) are the predictors
I
θ1 , . . . , θp are the parameters
I
h is the regression function, “any” function.
h is a function of the predictors and the parameters.
2 / 35
Comparison with linear regression model
I
In contrast to the linear regression model we now have a
general function h.
In the linear regression model we had
h(x i ; θ) = x T
i θ
(there we denoted the parameters by β).
I
Note that in linear regression we required that the
parameters appear in linear form.
I
In nonlinear regression, we don’t have that restriction
anymore.
3 / 35
Example: Puromycin
I
The speed of an enzymatic reaction depends on the
concentration of a substrate.
I
The initial speed is the response variable (Y ). The
concentration of the substrate is used as predictor (x).
Observations are from different runs.
I
Model with Michaelis-Menten function
h(x; θ) =
θ1 x
.
θ2 + x
I
Here we have one predictor x (the concentration) and two
parameters: θ1 and θ2 .
I
Moreover, we observe two groups: One where we treat the
enzyme with Puromycin and one without treatment (control
group).
4 / 35
Illustration: Puromycin (two groups)
150
Velocity
Velocity
200
100
50
0.0
0.2
0.4
0.6
0.8
1.0
Concentration
Concentration
Left: Data (• treated enzyme; 4 untreated enzyme)
Right: Typical shape of the regression function.
5 / 35
Example: Biochemical Oxygen Demand (BOD)
Model the biochemical oxygen demand (Y ) as a function of the
incubation time (x)
h(x; θ) = θ1 1 − e −θ2 x .
20
Oxygen Demand
Oxygen Demand
18
16
14
12
10
8
1
2
3
4
5
6
7
Days
Days
6 / 35
Linearizable Functions
Sometimes (but not always), the function h is linearizable.
Example
I
Let’s forget about the error term E for a moment. Assume we
have
y = h(x; θ)
=
θ1 exp{θ2 /x}
⇐⇒
log(y )
I
=
log(θ1 ) + θ2 · (1/x)
We can rewrite this as
ye = θe1 + θe2 · e
x,
where ye = log(y ), θe1 = log(θ1 ), θe2 = θ2 and e
x = 1/x.
I
If we use this linear model, we assume additive errors Ei
ei = θe1 + θe2 e
Y
xi + Ei .
7 / 35
I
This means that we have multiplicative errors on the original
scale
Yi = θ1 exp{θ2 /xi } · exp{Ei }.
I
This is not the same as using a nonlinear model on the
original scale (it would have additive errors!).
I
Hence, transformations of Y modify the model with
respect to the error term.
I
In the Puromycin example: Do not linearize because error
term would fit worse (see next slide).
I
Hence, for those cases where h is linearizable, it depends on
the data if it’s advisable to do so or to perform a nonlinear
regression.
8 / 35
Puromycin: Treated enzyme
200
●
●
●
●
150
●
●
Velocity
●
●
100
●
●
50
●
●
0.0
0.2
0.4
0.6
Concentration
0.8
1.0
9 / 35
Parameter Estimation
Let’s know assume that we really want to fit a nonlinear model.
Again, we use least squares. Minimize
S(θ) :=
n
X
(Yi − ηi (θ))2 ,
i=1
where
ηi (θ) := h(x i ; θ)
is the fitted value for the ith observation (x i is fixed, we only vary
the parameter vector θ).
10 / 35
Geometrical Interpretation
First we recall the situation for linear regression.
I
By applying least squares we are looking for the parameter
vector θ such that
kY − X θk22 =
n
X
Yi − x T
i θ
2
i=1
is minimized.
I
Or in other words: We are looking for the point on the plane
spanned by the columns of X that is closest to Y ∈ Rn .
I
This is nothing else than projecting Y on that specific plane.
11 / 35
Linear Regression: Illustration of Projection
10
Y
10
Y
6
6
0
0
[1,1,1]
8
2
4
6
η 1 | y1
8
10
η 2 | y2
8
6
4
10
η3 | y3
2
0
η3 | y3
4
2
[1,1,1]
8
x
4
2
0
0
10
2
4
x
η 2 | y2
6
8
y
0
2
4
6
8
10
η 1 | y1
12 / 35
Situation for nonlinear regression
I
Conceptually, the same holds true for nonlinear regression.
I
The difference is: All possible points now don’t lie on a plane,
but on a curved surface, the so called response or model
surface defined by
η(θ) ∈ Rn
when varying the parameter vector θ.
I
This is a p-dimensional surface because we parameterize it
with p parameters.
13 / 35
Nonlinear Regression: Projection on Curved Surface
20
θ1 = 22
Y
θ1 = 21
θ1 = 20
y
18
θ2 = 0.5
14
η2 | y2
16
0.4
12
0.3
21
20
η 3 | y3
19
−
−
10
22
18
5
6
7
8
η 1 | y1
9
10
11
14 / 35
Computation
I
Unfortunately, we can not derive a closed form solution for
b
the parameter estimate θ.
I
Iterative procedures are therefore needed.
I
We use a Gauss-Newton approach.
I
Starting from an initial value θ(0) , the idea is to
approximate the model surface by a plane, to perform a
projection on that plane and to iterate many times.
I
Remember η : Rp → Rn . Define n × p matrix
(j)
Ai (θ) =
∂ηi (θ)
.
∂θj
This is the Jacobi-matrix containing all partial derivatives.
15 / 35
Gauss-Newton Algorithm
More formally, the Gauss-Newton algorithm is as follows
I
Start with initial value θb(0)
I
For l = 1, 2, . . .
Calculate tangent plane of η(θ) in θb(l−1) :
η(θ) ≈ η(θb(l−1) ) + A(θb(l−1) ) · (θ − θb(l−1) )
Project Y on tangent plane
θb(l)
Projection is a linear regression problem, see blackboard.
Next l
I
Iterate until convergence
16 / 35
Initial Values
How can we get initial values?
I
Available knowledge
I
Linearized version (see Puromycin)
I
Interpretation of parameters (asymptotes, half-life, . . . ),
“fitting by eye”.
I
Combination of these ideas (e.g., conditional linearizable
functions)
17 / 35
0.020
200
0.015
150
Velocity
1/velocity
Example: Puromycin (only treated enzyme)
0.010
100
0.005
50
0
10
20
30
1/Concentration
40
50
0.0
0.2
0.4
0.6
0.8
1.0
Concentration
Dashed line: Solution of linearized problem.
Solid line: Solution of the nonlinear least squares problem.
18 / 35
Approximate Tests and Confidence Intervals
I
Algorithm “only” gives us b
θ.
I
How accurate is this estimate in a statistical sense?
I
In linear regression we knew the (exact) distribution of the
estimated parameters (remember animation!).
I
In nonlinear regression the situation is more complex in the
sense that we only have approximate results.
I
It can be shown that
θbj
approx.
∼
N (θj , Vjj )
for some matrix V (Vjj is the jth diagonal element).
19 / 35
I
Tests and confidence intervals are then constructed as in the
linear regression situation, i.e.
θbj − θj
q
bjj
V
approx.
∼
tn−p .
I
The reason why we basically have the same result as in the
linear regression case is because the algorithm is based on
(many) linear regression problems.
I
Once converged, the solution is not only the solution to the
nonlinear regression problem but also for the linear one of the
last iteration (otherwise we wouldn’t have converged).
In fact
b =σ
b T A)
b −1 .
V
b 2 (A
20 / 35
Example Puromycin (two groups)
Remember, we originally had two groups (treatment and control)
150
Velocity
Velocity
200
100
50
0.0
0.2
0.4
0.6
0.8
1.0
Concentration
Concentration
Question: Do the two groups need different regression parameters?
21 / 35
I
To answer this question we set up a model of the form
Yi =
(θ1 + θ3 zi )xi
+ Ei ,
θ2 + θ4 zi + xi
where z is the indicator variable for the treatment (zi = 1 if
treated, zi = 0 otherwise).
I
E.g., if θ3 is nonzero we have a different asymptote for the
treatment group (θ1 + θ3 vs. only θ1 in the control group).
I
Similarly for θ2 , θ4 .
Let’s fit this model to data.
22 / 35
Computer Output
Formula: velocity ~ (T1 + T3 * (treated == T)) * conc/(T2 +
T4 * (treated == T) + conc)
Parameters:
Estimate
T1
160.280
T2
0.048
T3
52.404
T4
0.016
Std.Error
6.896
0.008
9.551
0.011
t value
23.242
5.761
5.487
1.436
Pr(>|t|)
2.04e-15
1.50e-05
2.71e-05
0.167
I
We only get a significant test result for θ3 (
asymptotes) and not θ4 .
different
I
A 95%-confidence interval for θ3 (=difference between
asymptotes) is
t19
52.404 ± q0.975
9.551 = [32.4, 72.4],
t19
where q0.975
≈ 2.09.
23 / 35
More Precise Tests and Confidence Intervals
I
Tests etc. that we have seen so far are only “usable” if linear
approximation of the problem around the solution θb is good.
I
We can use another approach that is better (but also more
complicated).
I
In linear regression we had a quick look at the F -test for
testing simultaneous null-hypotheses. This is also possible
here.
I
Say we have the null hypothesis H0 : θ = θ∗ (whole vector).
Fact: Under H0 it holds
b
n − p S(θ∗ ) − S(θ)
T =
b
p
S(θ)
approx.
∼
Fp,n−p .
24 / 35
I
We still have only an “approximate” result. But this
approximation is (much) better (more accurate) than the
one that is based on the linear approximation.
I
This can now be used to construct confidence regions by
searching for all vectors θ∗ that are not rejected using this
test (as earlier).
I
If we only have two parameters it’s easy to illustrate these
confidence regions.
I
Using linear regression it’s also possible to derive confidence
regions (for several parameters). We haven’t seen this in
detail.
I
This approach can also be used here (because we use a linear
approximation in the algorithm, see also later).
25 / 35
Confidence Regions: Examples
Puromycin
Biochem. Oxygen D.
0.10
10
0.09
8
0.08
6
θ2 0.07
θ2
4
0.06
0.05
2
0.04
0
190
200
210
220
230
θ1
240
0
10
20
30
40
50
60
θ1
I
Dashed: Confidence Region (80% and 95%) based on linear approx.
I
Solid: Approach with F -test from above (more accurate).
I
“+” is parameter estimate.
26 / 35
What if we only want to test a single component θk ?
I
Assume we want to test H0 : θk = θk∗ .
I
Now fix θk = θk∗ and minimize S(θ) with respect to θj , j 6= k.
I
ek (θ∗ ).
Denote the minimum by S
k
I
Fact: Under H0 it holds that
b
e ∗
ek (θ∗ ) = (n − p) Sk (θk ) − S(θ)
T
k
b
S(θ)
approx.
∼
F1,n−p ,
or similarly
Tk (θk∗ ) = sign(θbk − θk∗ )
q
b
ek (θ∗ ) − S(θ)
S
k
σb
approx.
∼
tn−p .
27 / 35
I
Our first approximation was based on the linear approximation
and we got a test of the form
δk (θk∗ ) =
where s.e.(θbk ) =
θbk − θk∗
s.e.(θbk )
approx.
∼
tn−p ,
q
bjj .
V
This is what we saw in the computer output.
I
The new approach with Tk (θk∗ ) answers the same question
(i.e., we do a test for a single component).
I
The approximation of the new approach is (typically) much
more accurate.
I
We can compare the different approaches using plots.
28 / 35
Profile t-Plots and Profile Traces
The profile t-plot is defined as the plot of Tk (θk∗ ) against δk (θk∗ )
(for varying θk∗ ).
I
Remember: The two tests (Tk and δk test the same thing.
I
If they behave similarly, we would expect the same answers,
hence the plot should show a diagonal (intercept 0, slope 1).
I
Strong deviations from the diagonal indicate that the linear
approximation at the solution is not suitable and that the
problem is very non-linear in a neighborhood of θbk .
29 / 35
Profile t-Plots: Examples
Puromycin
Biochem. Oxygen D.
θ1
190
200
210
θ1
220
230
240
20
40
60
80
100
0.99
4
2
0.80
0
0.0
Level
0.80
1
T1(θ1)
T1(θ1)
2
0
0.0
0.80
−2
−1
Level
0.99
3
0.80
−2
−4
0.99
−3
0.99
−2
0
δ(θ1)
2
4
0
10
20
δ(θ1)
30
30 / 35
Profile Traces
I
Select a pair of parameters: θj , θk ; j 6= k.
I
Keep θk fixed, estimate remaining parameters: θej (θk ).
I
This means: When varying θk we can plot the estimated θej
(and vice versa)
I
Illustrate these two curves on a single plot.
I
What can we learn from this?
I
I
The angle between the two curves is a measure for the
correlation between estimated parameters. The smaller the
angle, the higher the correlation.
I
In the linear case we would see straight lines. Deviations are
an indication for nonlinearities.
Correlated parameter estimates influence each other strongly
and make estimation difficult.
31 / 35
Profile Traces: Examples
Puromycin
Biochem. Oxygen D.
2.0
0.10
0.09
1.5
θ2
θ2
0.08
0.07
1.0
0.06
0.5
0.05
0.04
190
200
210
220
θ1
230
240
15
20
25
30
35
40
θ1
Grey lines indicate confidence regions (80% and 95%).
32 / 35
Parameter Transformations
In order to improve the linear approximation (and therefore
improve convergence behaviour) it can be useful to transform the
parameters.
I
Transformations of parameters do not change the model, but
I
the quality of the linear approximation, influencing
the difficulty of computation and
the validity of approximate confidence regions.
I
the interpretation of the parameters.
I
Finding good transformations is hard.
I
Results can be transformed back to original parameters. Then,
transformation is just a technical step to solve the problem.
33 / 35
I
Use parameter transformations to avoid side constraints, e.g.
−→ Use θj = exp{φj }, φj ∈ R
b−a
θj ∈ (a, b) −→ Use θj = a +
, φj ∈ R
1 + exp{−φj }
θj > 0
34 / 35
Summary
I
Nonlinear regression models are widespread in chemistry.
I
Computation needs iterative procedure.
I
Simplest tests and confidence intervals are based on linear
approximations around solution b
θ.
I
If linear approximation is not very accurate, problems can
occur. Graphical tools for checking linearities are profile
t-plots and profile traces.
I
Tests and confidence intervals based on F -test are more
accurate.
I
Parameter transformations can help reducing these
problems.
35 / 35