Data Set 4: Environmental Effects on Moth Population Growth Rates

Data Set 4:
Environmental Effects on Moth Population
Growth Rates
Statistical Setting
This is an example of an analysis with several possible explanatory variables, none of
which is of any special interest a priori. The purpose of the study is to determine which of these
variables seems most related to the response variable, and how, in order to suggest possible causal
hypotheses as well as to allow prediction. The analysis therefore aims to develop a model which
explains the data well — to select apparently important variables — rather than to make formal
inferences. This model building will be done in two ways: using all the data (with internal
validation), and splitting the data into “training” and “validation” sets. Given the small sample
size the latter method really is not appropriate but is shown to illustrate how it would be done.
Background
The data are from a paper by N. Broekhuizen, H. F. Evans and M. P. Hassell on “Site
characteristics and the population dynamics of the pine looper moth” (Journal of Animal Ecology,
62: 511-518, 1993). The pine looper moth (Bupalus piniaria, an inchworm also known as the
Bordered White) is a major pest of pine forests in Europe. Like many forest pests, its populations
go through fairly regular cycles, including outbreaks during which they can defoliate large areas
of forest. (For a more recent analysis of the moth’s population dynamics, see Kendall et al., 2005,
Ecological Monographs 75(2): 259-276.)
Broekhuizen et al. report only simple regressions rather than the multiple regressions
shown in this handout. The results of the latter are somewhat different from, but not seriously
inconsistent with, the results of the published analysis.
Data
The raw data used in paper consisted of annual densities of pupae of the moth in different
forest plantations, over periods of at least 22 years in each forest. Each of these time series was fit
to a population-dynamics model which had been found previously to adequately describe
dynamics in two intensively studied populations. The estimated parameters of this model then
were used as response variables in multiple regression analyses exploring their relationships with
various environmental variables. (Note that this approach ignores the variability of the time-series
data around the fitted model and the uncertainty in the estimation of the parameters; that is, the
model aims to explain the estimated parameter value, not the true value). The dynamical
parameter considered in this Handout is λ (“lambda”), the net per capita reproductive rate of the
population.
The analysis aimed to explain the estimated parameters by six environmental factors
measured for each forest: latitude (degrees north: “latit”); mean age of the plantation when the
censuses began in 1953 (“age”); the percentage of the area covered by the moth’s main host
plant, Scots pine (“pct_scot”); the height of the plantation above sea level (in m: “altit”); the
mean monthly rainfall (in mm: “rain”); and the mean annual temperature (degrees C: “temp”).
The analysis used data from n = 27 plantations, but the age of one plantation was not
known so for all models including the age variable the sample size is n = 26.
Data Exploration
Distributions
Although regression analysis makes no assumptions about the distributions of any of the
variables (only about the distribution of the random deviations of the data from the true model),
examining the distributions can point to unusual observations which might be influential.
lambda
latit
10
age
5.0
10
5
2.5
2
Frequency
5
0
0.0
3
4
5
6
0
51 52 53 54 55 56 57 58
pct_scot
altit
8
16
24
10
10
32
40
48
56
rain
5.0
5
5
2.5
0
0
0.0
0
20
40
60
80
100
0
60
120
180
240
600
720
840
960
1080
temp
5.0
2.5
0.0
7
8
9
10
There are high outliers in age and rain. The distribution of latitudes is bimodal (perhaps
the plantations occur in two particular regions?) and that of pct_scot is quite skewed, most
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
2
plantations being mostly Scots pine but a few having very little of this species. Any or all of these
features could lead to observations being influential, which will need to be assessed later.
Collinearity (Confounding)
Correlation matrix of independent variables:
AGE
0.311
LATIT
AGE
PCT_SCOT
ALTIT
RAIN
PCT_SCOT
0.601
0.379
ALTIT
0.096
0.397
0.112
RAIN
0.062
0.261
-0.126
0.443
TEMP
-0.832
-0.429
-0.535
-0.603
-0.211
There are fairly strong correlations among various pairs of the explanatory variables; in
particular, temp is strongly negatively correlated with all the other variables except rain (and
even that correlation is moderately strong and negative). Other strong correlations are between
latit and pct_scot and between altit and rain.
A scatterplot matrix (below) illustrates these relationships, while also showing some
unusual observations and odd patterns.Most noteworthy is the one plantation with very high
0
25
50
0
100
200
7.5
9.0
10.5
57
latit
54
51
50
age
25
0
100
pct_scot
50
0
200
altit
100
0
1000
rain
800
600
10.5
9.0
temp
7.5
51
54
57
0
50
100
600
800
1000
rainfall despite being at low latitude and altitude. The latitudes fall into two fairly distinct groups.
In addition to the two high outliers for age seen in the histogram, the youngest plantation
(age = 4) also stands out in these plots, not following the relationships the other observations
show between age and some of the other variables, especially altit and rain. There also are
peculiar patterns in the plots of pct_scot vs. most of the other variables, and of altit vs.
several of the others.
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
3
Need for Model Reduction
If the aim of the analysis were simply to be able to describe the moth population growth
rates observed in these plantations the collinearity would not be a problem. It clearly does,
though, complicate the attempt to determine the importance of the various variables, i.e. to
distinguish their possible “effects.” It also is likely to weaken the predictive effectiveness of any
model containing all or most of the variables. For both purposes of the study, therefore, it is
appropriate to try to determine one or more subsets of the variables which are less collinear and
also explain the response variable accurately and parsimoniously.
In interpreting the results of this model selection, though, it is important to realize that the
collinearity also means that selection of which models to include in the final model should not be
taken as definitive evidence of which variables (if any) actually have important effects.
Relationships with lambda
The response variable λ is somewhat correlated with all the explanatory variables, but
most strongly with latitude (positively) and plantation age (negatively).
latit
0.400
age
-0.310
pct_scot
0.106
altit
-0.259
rain
-0.174
temp
-0.196
Scatterplots of these relationships (below) show several features of possible concern.
Several appear nonlinear, with the highest λs tending to occur at intermediate levels of the
variable; this is most apparent for temp. The relationships of λ to age and rain.appear to be
strongly affected by a few observations with extreme values of these variables. Because of the
latit
age
pct_scot
6
5
4
3
lambda
2
51
54
57
0
altit
25
50
0
50
rain
100
temp
6
5
4
3
2
0
100
200
600
800
1000
7.5
9.0
10.5
collinearity in these data the “partial” relationships with the response variable in the multiple
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
4
regression model could be quite different from these bivariate relationships so I did not choose to
do anything about these undesirable features at this point but simply to be aware of them while
assessing the adequacy of the full model.
Analysis
Choice of the Full Model
The number of variables to be considered was greatly constrained by the sample size: with
only 27 observations, the six basic variables alone exceed the guideline of 6 to 10 observations
per independent variable. It therefore would not be reasonable to include interaction or
polynomial terms in the initial analysis.
It nonetheless is important to consider the form in which the six variables are included, in
order primarily to produce a reasonably linear relationship, and possibly also to avoid excessive
leverage points; the histograms and bivariate scatterplots above suggest there could be both
nonlinearity and leverage problems. To assess these issues I examined partial-regression plots for
a model including linear terms for each of the six variables. These plots, shown below, did not
indicate any nonlinearity but did indicate several high-leverage points, particularly in age and
rain:
Adde d-Va ria ble Plot
0
-1
-2
-2
lambda | others
1
0
-1
lambda | others
1
0
-1
-2
lambda | others
1
2
Adde d-Varia ble Plot
2
Adde d-Varia ble Plot
-0.010
0.000
0.010
-20
0
10
20
-30
-10 0
10 20 30
Adde d-Varia ble Plot
Adde d-Varia ble Plot
Adde d-Va ria ble Plot
-0.5
0.0
0.5
latit | others
1
0
-2
-1
lambda | others
1
0
-1
-2
-1
0
1
lambda | others
2
2
altit | others
2
age | others
-2
lambda | others
-10
(Intercept) | others
-40
-20
0
20
pct_scot | others
40
-100
0
100
200
rain | others
1
0
-1
-2
lambda | others
Adde d-Varia ble Plot
-0.2
0.0 0.1 0.2 0.3
temp | others
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
5
Examination of the histograms of the independent variables suggested that these leverage
points were due to the skewed distributions of some of the variables. I therefore square-root
transformed age and altit, log transformed rain, and arcsine-square root transformed
pct_scot. Partial-regression plots from a regression using these transformed variables were then
examined. These plots indicated, however, that the transformations had increased rather than
reduced the leverage problem, so the untransformed variables were used in the subsequent
analysis.
Selection of Reduced Models
The following table (from SAS) lists the four best-fitting models for each size (number of
terms), and gives all the common model-selection criteria based on model SS as a measure of fit;
the fifth best three-variable model also is shown because its fit is very close to that of the four best
models of that size. This model-selection criteria are plotted against the number of variables in the
model in the figure on the next page.
1
2
3
4
5
6
.R2
0.1618
0.0960
0.0661
0.0374
0.3710
0.2703
0.2587
0.2294
0.3941
0.3932
0.3850
0.3814
0.3734
0.3975
0.3971
0.3959
0.3957
0.4009
0.4006
0.3980
0.3971
0.4025
.R2adj
0.1269
0.0583
0.0272
-0.0027
0.3163
0.2069
0.1943
0.1624
0.3115
0.3105
0.3011
0.2970
0.2880
0.2827
0.2822
0.2809
0.2806
0.2511
0.2508
0.2475
0.2464
0.2139
.C(p)
4.6541
6.7494
7.6991
8.6108
0.0033
3.2039
3.5732
4.5050
1.2676
1.2963
1.5587
1.6728
1.9262
3.1601
3.1731
3.2093
3.2167
5.0525
5.0612
5.1442
5.1720
7.0000
.AIC
15.1600
17.1274
17.9725
18.7587
9.6968
13.5557
13.9663
14.9742
10.7225
10.7611
11.1124
11.2636
11.5963
12.5770
12.5947
12.6437
12.6537
14.4306
14.4425
14.5554
14.5931
16.3589
.SBC
17.67616
19.64363
20.48865
21.27485
13.47109
17.33003
17.74055
18.74849
15.75489
15.79353
16.14481
16.29595
16.62871
18.86750
18.88517
18.93421
18.94418
21.97919
21.99105
22.10398
22.14172
25.16553
Variables in Model
lat
age
alt
tmp
lat age
alt
alt
lat
lat
lat
lat
lat
lat
lat
lat
lat
lat
lat
lat
lat
age
age
age
age
age
age
age
age
age
age
age
age
age
age
tmp
tmp
tmp
alt
alt
tmp
rn
%sct
%sct
%sct
%sct
%sct
%sct
%sct
alt
alt
alt
alt
alt
alt
alt
alt
alt
tmp
tmp
rn
tmp
rn
rn
rn
rn
tmp
tmp
tmp
All the criteria are best for the two-variable model with latit and age: Cp, AIC and SBC
all are much smaller for this model than any of the others, and R2adj is slightly larger than for the
next best models. In addition this two-variable model is the first (i.e. the one with fewest
variables) to have Cp near or below p. This agreement among the criteria is somewhat unusual,
and gives considerable confidence in the selection.
The five three-variable models are next best by all criteria; they are very similar to each
other. Four of these models include the best two-variable model, with latit and age; the other
(which happens to fit slightly better than the others) has altit in place of latit, and the
second-best of these models combines both latit and altit with age. The sixth best threevariable model does not contain either of these variables and fits considerably less well than does
the fifth best. The similarity of these best three-variable models to the best two-variable model
lends further support to the importance of the variables in that model.
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
6
2
adjR2
10.0
0.3
4
6
Cp
7.5
0.2
5.0
0.1
2.5
0.0
0.0
AIC
20.0
SBC
25.0
22.5
20.0
17.5
15.0
17.5
15.0
12.5
10.0
2
4
6
number of terms in the model (p-1)
PRESS and R2prediction statistics then were obtained for these six models, as shown in the
following table:
2
3
3
3
3
3
Variables in Model
lat
age
age
alt
lat
age
alt
lat
age
lat
age
lat
age
%scot
temp
temp
rain
.PRESS
36.4115
38.2080
39.4795
39.2322
37.2626
38.7459
R2.prediction
0.2358
0.1981
0.1715
0.1767
0.2180
0.1869
By these additional criteria the two-variable model again is better than any of the others.
Interestingly, the three-variable model that was only fourth best in its fit to the data was the best of
that size as measured by PRESS and R2prediction; this suggest that these three-variable models may
be slightly overfitting the data.
Based on all these criteria I selected three models for diagnostic assessment and possible
refinement: the model with latit and age, the model adding rain to those two variables (fourth
best three-variable model by R2), and the somewhat different model with age, altit, and temp
(best three-variable model by R2). Assuming all were considered acceptable, the two-variable
model would be adopted as the best model, by consensus of all criteria.
One note of caution comes from the very low Cp values of the better models: all are
considerably below p, suggesting either that they involve substantial bias or that the MSE for the
full model overestimates the true error variance. The possibility of bias suggests considering
larger models; in this case any of the better five-variable models have Cp closer to p than do the
smaller models. I suspect, however, that the cause for these low values of Cp is instead that the
full model is not adequate (see section on model adequacy below), and so did not choose to
consider any of these five-variable models.
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
7
Estimation and Inference for Selected Models
To understand these models we of course need to obtain the parameter estimates, which
are not given in the output of the model-reduction procedures. It also is informative, in a
qualitative way, to examine the tests of significance of the model and of the individual
parameters, though when analyzing models chosen on the basis of the data (rather than with a
structure known or assumed a priori) formal inference by these tests is not valid.
Model with only latitude and age:
SOURCE
Regression
Error
Total
s = 1.142
Predictor
Constant
latit
age
DF
2
23
25
R-sq
SS
MS
F
17.677
8.839
6.78
29.973
1.303
47.650
= 37.1%
.Coef
St.dev
t
-12.063
5.323
-2.27
0.31702
0.09997
3.17
-0.06289
0.02274
-2.77
p
0.005
p
0.033
0.004
0.011
Each term in this model clearly is related to the population growth rate, λ, and the model
explains a very significant proportion of the variability in λ. Latitude is positively related to λ:
the population rate of increase is higher farther north. Age is negatively related: λ decreases in
older plantations. (Note that both relationships are “adjusted for” the other.)
Model with age, altitude and temp:
SOURCE
Regression
Error
Total
s = 1.146
Predictor
Constant
age
altit
temp
DF
3
22
25
R-sq
SS
MS
F
18.780
6.260
4.77
28.870
1.312
47.650
= 39.4%
.Coef
St.dev
t
14.462
3.143
4.60
-0.05179
0.02443
-2.12
-0.010714
0.004341 -2.47
-0.9862
0.3111
-3.17
p
0.010
p
0.000
0.046
0.022
0.004
In this model λ again is negatively related to plantation age; the coefficient is about 20%
smaller than in the previous model. Altitude and temperature both also are negatively related to λ:
the moth population grows faster at lower altitudes, all else being equal, or at lower temperatures,
all else being equal. Of course these two variables are fairly strongly negatively correlated so in
fact there is little variation in one without offsetting variation in the other. As was known would
happen from the model-selection statistics above, this model explains only a small amount more
of the variation in λ than did the two-variable model, and as a result the F* statistic for the model
is smaller and the P-value larger.
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
8
Model with latitude, age, and rain:
Source
DF
SS
MS
F
Regression
3
18.172
6.057
4.52
Residual Error 22
29.477
1.340
Total
25
47.650
S = 1.15753 R-Sq
= 38.1%
Predictor
Coef
SE Coef
T
P
Constant
-11.281
5.549
-2.03
latit
0.3181
0.1014
3.14
age
-0.05936
0.02378
-2.50
rain
-0.001288
0.002119 -0.61
P
0.013
0.054
0.005
0.021
0.549
In this case, the model remains quite significant, and the relationships of latit and age
with λ are largely unchanged from the two-variable model above, though addition of the rain
term somewhat reduces the slope for age. The rain term, however, clearly is not significant
(especially considering that this model, i.e. addition of this term to the two-variable model was
selected as one of the best three-variable models).
Adequacy of Selected Models
The three models selected in the preceding were subjected to the full battery of diagnostic
approaches. Only the interesting results will be described here, for all three models together.
Unusual Observations
Three observations (circled in
studentized-deleted residuals vs. leverage
blue in the three plots to the right) had
0.00
0.15
0.30
0.45
0.60
0.3
high leverage in all of the models because
TRES1*HI1
TRES2*HI2
2
they were at the extremes of the age
0
distribution: two were from unusually old
plantations (58 and 57 years) and the third
-2
from an unusually young plantation (4
0.3
TRES3*HI3
2
years); the remaining observations were
between 15 and 35 years, and mostly in
0
the 20’s. These points are obvious in the
-2
partial-regression plot for age above.
One additional observation (circled in red
0.00
0.15
0.30
0.45
0.60
in the bottom plot to the right), with very
much more rain than any other, had very
high leverage in the third model, which included rain.
One observation (circled in red in all three plots) had a very large negative residual. The
studentized-deleted residuals for this observation were in the extreme 1% to 2% of the pertinent t
distributions for each of the three models. This observation had λ = 1.54 (only one other λ was
less than 2.0), and was from a high latitude, at which the predicted λ was high.
Based on Cook’s D and dffits, the high-leverage points — even the one with very high
leverage in the third model due to its very high level of rain — did not appear to be influential,
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
9
but the point with the large residual was. Its values for dffits were 0.95 or higher, and much larger
than for any other observation, in each of the models. Similarly, its values for the dfbetas were
mostly large (0.5 to 0.7) and again larger than those for other observations. Its Cook’s D,
however, was fairly small: 0.18 to 0.24 for the three models (less than the 10% level in the
appropriate F distributions).
Cook's D vs. residuals
-2
COOK1*RESI1
dffits vs. residuals
0
-2
2
COOK2*RESI2
0.24
DFIT1*RESI1
1
0
2
DFIT2*RESI2
1.0
0.5
0.18
0.0
0.12
-0.5
0.06
0.00
COOK3*RESI3
0.24
-1.0
-1
DFIT3*RESI3
1.0
1
0.5
0.18
0.12
0.0
0.06
-0.5
-1.0
0.00
-2
0
2
-1
-2
0
2
I therefore decided to repeat the variable-selection process without this one outlying
observation. The results are shown on the following table.
Vars
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
R2
25.8
9.9
9.8
3.6
50.7
35.4
30.9
30.7
51.0
50.8
50.8
50.7
51.1
51.0
51.0
50.8
51.1
51.1
51.0
50.8
51.1
R2.adj
22.6
6.0
5.9
0.0
46.2
29.5
24.6
24.4
44.0
43.7
43.7
43.7
41.3
41.2
41.2
40.9
38.2
38.2
38.2
37.8
34.9
Cp
6.3
12.2
12.2
14.5
-0.8
4.8
6.5
6.5
1.1
1.1
1.1
1.2
3.0
3.0
3.1
3.1
5.0
5.0
5.0
5.1
7.0
S
1.1848
1.3052
1.3062
1.3508
0.98725
1.1305
1.1688
1.1705
1.0075
1.0100
1.0101
1.0103
1.0314
1.0323
1.0324
1.0348
1.0582
1.0582
1.0588
1.0616
1.0867
lat
age
temp
alt
lat
lat
lat
lat
lat
lat
lat
lat
lat
lat
lat
lat
lat
lat
lat
age
age
temp
temp
alt
alt
age
age
age
age
age
age
age
age
age
age
age
age
age
rain
alt
temp
%sct
%sct
%sct
%sct
%sct
%sct
%sct
alt
alt
alt
alt
alt
alt
rain
rain
rain
rain
rain
rain
rain
temp
temp
temp
temp
temp
These results are similar to those for the entire data set in selecting the two-variable model
with latit and age first (and clearly best by all criteria), and several three-variable models as
also worth consideration. These results differ from those for the entire data set in that the best
models with three or more variables now all include latitude and age, and the best threevariable model from before (with age, altitude and temperature, but without latitude)
now is not particularly good. The three-variable model with latit, age and rain, which gave
only the fourth best fit (of three-variable models) to the full data set but performed well at
prediction (PRESS and R2pred) now is the best fitting three-variable model.
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
10
Considering these analyses of influence, and combining the variable selection results both
with and without this one observation, I concluded that the chosen model should include latit
and age, and that the apparent effectiveness of models lacking those two variables probably was
due to them fitting the one unusual observation.
Residual Plots
Plots of residuals against fitted values (below) or the independent variables in the models
were not perfect but were basically acceptable. There were no indications of nonlinearity. The
variability appeared to increase at larger values of the estimated Y, but this was due at least in part
to the one outlying observation considered already; I did not feel this problem was definite or
extreme enough to warrant remedial action (e.g. weighted least squares).
residuals vs. fits
2
RESI1*FITS1
3
4
5
RESI2*FITS2
2
0
-2
RESI3*FITS3
2
0
-2
2
3
4
5
The plot of the residuals from the latit, age model against the possible third
independent variable rain showed no trend, which supports not adding this variable.
residuals from 2-variable model vs. rain
2
RESI1
1
0
-1
-2
-3
600
700
800
900
1000
1100
rain
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
11
Conclusion from Diagnostics
While there are a few unusual observations that cause some concern about excessive
influence, it does not appear that the conclusions of the analysis depend only on one or a few
points. Otherwise there was nothing apparent in these diagnostic assessments which would call
into question the validity of these analyses.
One serious concern, however, was raised by the results for Cp in the “Selection of
Reduced Models”: the very small (even negative) values of Cp obtained for some of the models
suggest that the MSE for the full model was an overestimated of σ2, and thus that the full model
was not properly formulated. Since there was no indication of nonlinearity in the residual
diagnostics, it may be that there are important interactions which were omitted from
consideration. I do not feel it would be useful to explore this possibility with this very small data
set, but it should be considered in future studies if data is available from more plantations.
Validation
The final step in choosing among these various models is some form of validation.
Obviously external validation – testing these models against new data – ultimately is needed. The
current sample is too small for a data-splitting approach to validation to be very robust; I
nonetheless will illustrate this approach, but feel the internal cross-validation is more reasonable
for such a small data set.
Internal Cross-validation
Internal validation consists of comparing each model’s SSE and PRESS. The PRESS
values were given above, as part of the model selection, and are repeated here along with the SSE
values:
2
3
3
3
3
3
Variables in Model
lat
age
age
alt
lat
age
alt
lat
age
lat
age
lat
age
%scot
temp
temp
rain
.PRESS
36.4115
38.2080
39.4795
39.2322
37.2626
38.7459
.SSE
29.973
28.870
28.913
29.306
29.477
29.857
These comparisons suggest that the two-variable one is best: it actually gives better
predictive ability for the observations in the data set than do any of the three-variable models,
despite them fitting the full data set more closely. To put this another way, its SSE and PRESS are
more similar, indicating that its MSE is a more valid indicator of its predictive ability than are
those for the other models; this suggests that the MSEs of the latter, which already were larger
than for the smaller model, actually overstate their models’ effectiveness.
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
12
Data-splitting Validation
Because latitude appears to be an important variable, and is bimodal in distribution, I split
the data to give balanced distributions of latitudes in the two parts: I sorted the data by latitude
and then separated odd- and even-numbered observations. I then used each half of the data as both
a training set and a validation set: I fit each model to each half of the data, and used these fits to
predict the values of λ for the other half of the data. This approach also allows comparison of the
estimated models for the two halves (and of course the full data set).
fitting odds;
predicting evens
full data set
fitting evens;
predicting odds
model
MSE
MSE
MSPR
MSE
MSPR
latit, age
1.303
1.672
1.1256
1.200
1.4777
age, altit, temp
1.312
1.101
2.4732
1.379
1.5510
latit, age, rain
1.314
1.755
1.5708
1.260
1.5559
The preceding results show that the two-variable model with latit and age was the best
for prediction, with lower MSPRs than the other models. The three-variable model with age,
altit and temp was not very stable: when fit to the odd observations is fit them very well
(smallest MSE of all) but was very bad at predicting the even observations (far larger MSPR than
any other).
model
coefficient
full data set
latit, age
intercept
-12.063
-13.194
(+9%)
-11.588
(−4%)
latit
0.3170
0.3453
(+9%)
0.2989
(−6%)
age
-0.0629
-0.0821
(+30%)
-0.0421
(−33%)
intercept
14.462
16.560
(+15%)
12.993
(−10%)
age
-0.0518
-0.0716
(+38%)
-0.0531
(+3%)
altit
-0.0107
-0.0200
(+87%)
-0.0045
(−58%)
temp
-0.9862
-1.1033
(+12%)
-0.8577
(−13%)
intercept
-11.281
-13.851
(+23%)
-9.209
(−18%)
latit
0.3181
0.4106
(+29%)
0.2757
(−13%)
age
-0.0594
-0.0838
(+41%)
-0.0321
(−46%)
rain
-0.0013
-0.0041
(+317%)
-0.0019
(+44%)
age,
altit,
temp
latit,
age, rain
Data Set 4: Environmental Effects on Moth Population Growth Rates
odds
evens
(rev. February 20, 2007)
13
The results in the table above show that the parameter estimates for the two-variable
model were much more consistent across the full- and two half-data sets than were those of the
three-variable models. Even in the two-variable model the coefficient for age varied
considerably, presumably because the one very young plantation was not in the “evens” half of
the data, but even this coefficient was more consistent in the two-variable model than when rain
was added to the model.
Validation Conclusions
The simplest model clearly is most robust for predicting new data, and in obtaining similar
parameter estimates from different data sets.
Conclusion
The latitude and age of a plantation seem quite clearly related to the population growth
rate of the moths, with more rapid population growth farther north and in younger plantations.
The simple model using these two variables appears reliable for predicting moth population
growth. Given the clear superiority of this model for fitting and predicting these data, compared to
any of the other models considered, tentative hypotheses about causation seem reasonable: while
latitude cannot directly affect population dynamics, plantation age could well do so, and latitude
obviously could be correlated with climate or other important causal factors which were not in the
data set (or not well represented by the available variables).
The other four possible explanatory variables — altitude, the fraction of Scots pine,
rainfall, and temperature — all appear in models which fit the data reasonably well, so it should
not be concluded that they are unimportant.
Data Set 4: Environmental Effects on Moth Population Growth Rates
(rev. February 20, 2007)
14