Data Set 4: Environmental Effects on Moth Population Growth Rates Statistical Setting This is an example of an analysis with several possible explanatory variables, none of which is of any special interest a priori. The purpose of the study is to determine which of these variables seems most related to the response variable, and how, in order to suggest possible causal hypotheses as well as to allow prediction. The analysis therefore aims to develop a model which explains the data well — to select apparently important variables — rather than to make formal inferences. This model building will be done in two ways: using all the data (with internal validation), and splitting the data into “training” and “validation” sets. Given the small sample size the latter method really is not appropriate but is shown to illustrate how it would be done. Background The data are from a paper by N. Broekhuizen, H. F. Evans and M. P. Hassell on “Site characteristics and the population dynamics of the pine looper moth” (Journal of Animal Ecology, 62: 511-518, 1993). The pine looper moth (Bupalus piniaria, an inchworm also known as the Bordered White) is a major pest of pine forests in Europe. Like many forest pests, its populations go through fairly regular cycles, including outbreaks during which they can defoliate large areas of forest. (For a more recent analysis of the moth’s population dynamics, see Kendall et al., 2005, Ecological Monographs 75(2): 259-276.) Broekhuizen et al. report only simple regressions rather than the multiple regressions shown in this handout. The results of the latter are somewhat different from, but not seriously inconsistent with, the results of the published analysis. Data The raw data used in paper consisted of annual densities of pupae of the moth in different forest plantations, over periods of at least 22 years in each forest. Each of these time series was fit to a population-dynamics model which had been found previously to adequately describe dynamics in two intensively studied populations. The estimated parameters of this model then were used as response variables in multiple regression analyses exploring their relationships with various environmental variables. (Note that this approach ignores the variability of the time-series data around the fitted model and the uncertainty in the estimation of the parameters; that is, the model aims to explain the estimated parameter value, not the true value). The dynamical parameter considered in this Handout is λ (“lambda”), the net per capita reproductive rate of the population. The analysis aimed to explain the estimated parameters by six environmental factors measured for each forest: latitude (degrees north: “latit”); mean age of the plantation when the censuses began in 1953 (“age”); the percentage of the area covered by the moth’s main host plant, Scots pine (“pct_scot”); the height of the plantation above sea level (in m: “altit”); the mean monthly rainfall (in mm: “rain”); and the mean annual temperature (degrees C: “temp”). The analysis used data from n = 27 plantations, but the age of one plantation was not known so for all models including the age variable the sample size is n = 26. Data Exploration Distributions Although regression analysis makes no assumptions about the distributions of any of the variables (only about the distribution of the random deviations of the data from the true model), examining the distributions can point to unusual observations which might be influential. lambda latit 10 age 5.0 10 5 2.5 2 Frequency 5 0 0.0 3 4 5 6 0 51 52 53 54 55 56 57 58 pct_scot altit 8 16 24 10 10 32 40 48 56 rain 5.0 5 5 2.5 0 0 0.0 0 20 40 60 80 100 0 60 120 180 240 600 720 840 960 1080 temp 5.0 2.5 0.0 7 8 9 10 There are high outliers in age and rain. The distribution of latitudes is bimodal (perhaps the plantations occur in two particular regions?) and that of pct_scot is quite skewed, most Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 2 plantations being mostly Scots pine but a few having very little of this species. Any or all of these features could lead to observations being influential, which will need to be assessed later. Collinearity (Confounding) Correlation matrix of independent variables: AGE 0.311 LATIT AGE PCT_SCOT ALTIT RAIN PCT_SCOT 0.601 0.379 ALTIT 0.096 0.397 0.112 RAIN 0.062 0.261 -0.126 0.443 TEMP -0.832 -0.429 -0.535 -0.603 -0.211 There are fairly strong correlations among various pairs of the explanatory variables; in particular, temp is strongly negatively correlated with all the other variables except rain (and even that correlation is moderately strong and negative). Other strong correlations are between latit and pct_scot and between altit and rain. A scatterplot matrix (below) illustrates these relationships, while also showing some unusual observations and odd patterns.Most noteworthy is the one plantation with very high 0 25 50 0 100 200 7.5 9.0 10.5 57 latit 54 51 50 age 25 0 100 pct_scot 50 0 200 altit 100 0 1000 rain 800 600 10.5 9.0 temp 7.5 51 54 57 0 50 100 600 800 1000 rainfall despite being at low latitude and altitude. The latitudes fall into two fairly distinct groups. In addition to the two high outliers for age seen in the histogram, the youngest plantation (age = 4) also stands out in these plots, not following the relationships the other observations show between age and some of the other variables, especially altit and rain. There also are peculiar patterns in the plots of pct_scot vs. most of the other variables, and of altit vs. several of the others. Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 3 Need for Model Reduction If the aim of the analysis were simply to be able to describe the moth population growth rates observed in these plantations the collinearity would not be a problem. It clearly does, though, complicate the attempt to determine the importance of the various variables, i.e. to distinguish their possible “effects.” It also is likely to weaken the predictive effectiveness of any model containing all or most of the variables. For both purposes of the study, therefore, it is appropriate to try to determine one or more subsets of the variables which are less collinear and also explain the response variable accurately and parsimoniously. In interpreting the results of this model selection, though, it is important to realize that the collinearity also means that selection of which models to include in the final model should not be taken as definitive evidence of which variables (if any) actually have important effects. Relationships with lambda The response variable λ is somewhat correlated with all the explanatory variables, but most strongly with latitude (positively) and plantation age (negatively). latit 0.400 age -0.310 pct_scot 0.106 altit -0.259 rain -0.174 temp -0.196 Scatterplots of these relationships (below) show several features of possible concern. Several appear nonlinear, with the highest λs tending to occur at intermediate levels of the variable; this is most apparent for temp. The relationships of λ to age and rain.appear to be strongly affected by a few observations with extreme values of these variables. Because of the latit age pct_scot 6 5 4 3 lambda 2 51 54 57 0 altit 25 50 0 50 rain 100 temp 6 5 4 3 2 0 100 200 600 800 1000 7.5 9.0 10.5 collinearity in these data the “partial” relationships with the response variable in the multiple Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 4 regression model could be quite different from these bivariate relationships so I did not choose to do anything about these undesirable features at this point but simply to be aware of them while assessing the adequacy of the full model. Analysis Choice of the Full Model The number of variables to be considered was greatly constrained by the sample size: with only 27 observations, the six basic variables alone exceed the guideline of 6 to 10 observations per independent variable. It therefore would not be reasonable to include interaction or polynomial terms in the initial analysis. It nonetheless is important to consider the form in which the six variables are included, in order primarily to produce a reasonably linear relationship, and possibly also to avoid excessive leverage points; the histograms and bivariate scatterplots above suggest there could be both nonlinearity and leverage problems. To assess these issues I examined partial-regression plots for a model including linear terms for each of the six variables. These plots, shown below, did not indicate any nonlinearity but did indicate several high-leverage points, particularly in age and rain: Adde d-Va ria ble Plot 0 -1 -2 -2 lambda | others 1 0 -1 lambda | others 1 0 -1 -2 lambda | others 1 2 Adde d-Varia ble Plot 2 Adde d-Varia ble Plot -0.010 0.000 0.010 -20 0 10 20 -30 -10 0 10 20 30 Adde d-Varia ble Plot Adde d-Varia ble Plot Adde d-Va ria ble Plot -0.5 0.0 0.5 latit | others 1 0 -2 -1 lambda | others 1 0 -1 -2 -1 0 1 lambda | others 2 2 altit | others 2 age | others -2 lambda | others -10 (Intercept) | others -40 -20 0 20 pct_scot | others 40 -100 0 100 200 rain | others 1 0 -1 -2 lambda | others Adde d-Varia ble Plot -0.2 0.0 0.1 0.2 0.3 temp | others Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 5 Examination of the histograms of the independent variables suggested that these leverage points were due to the skewed distributions of some of the variables. I therefore square-root transformed age and altit, log transformed rain, and arcsine-square root transformed pct_scot. Partial-regression plots from a regression using these transformed variables were then examined. These plots indicated, however, that the transformations had increased rather than reduced the leverage problem, so the untransformed variables were used in the subsequent analysis. Selection of Reduced Models The following table (from SAS) lists the four best-fitting models for each size (number of terms), and gives all the common model-selection criteria based on model SS as a measure of fit; the fifth best three-variable model also is shown because its fit is very close to that of the four best models of that size. This model-selection criteria are plotted against the number of variables in the model in the figure on the next page. 1 2 3 4 5 6 .R2 0.1618 0.0960 0.0661 0.0374 0.3710 0.2703 0.2587 0.2294 0.3941 0.3932 0.3850 0.3814 0.3734 0.3975 0.3971 0.3959 0.3957 0.4009 0.4006 0.3980 0.3971 0.4025 .R2adj 0.1269 0.0583 0.0272 -0.0027 0.3163 0.2069 0.1943 0.1624 0.3115 0.3105 0.3011 0.2970 0.2880 0.2827 0.2822 0.2809 0.2806 0.2511 0.2508 0.2475 0.2464 0.2139 .C(p) 4.6541 6.7494 7.6991 8.6108 0.0033 3.2039 3.5732 4.5050 1.2676 1.2963 1.5587 1.6728 1.9262 3.1601 3.1731 3.2093 3.2167 5.0525 5.0612 5.1442 5.1720 7.0000 .AIC 15.1600 17.1274 17.9725 18.7587 9.6968 13.5557 13.9663 14.9742 10.7225 10.7611 11.1124 11.2636 11.5963 12.5770 12.5947 12.6437 12.6537 14.4306 14.4425 14.5554 14.5931 16.3589 .SBC 17.67616 19.64363 20.48865 21.27485 13.47109 17.33003 17.74055 18.74849 15.75489 15.79353 16.14481 16.29595 16.62871 18.86750 18.88517 18.93421 18.94418 21.97919 21.99105 22.10398 22.14172 25.16553 Variables in Model lat age alt tmp lat age alt alt lat lat lat lat lat lat lat lat lat lat lat lat lat age age age age age age age age age age age age age age tmp tmp tmp alt alt tmp rn %sct %sct %sct %sct %sct %sct %sct alt alt alt alt alt alt alt alt alt tmp tmp rn tmp rn rn rn rn tmp tmp tmp All the criteria are best for the two-variable model with latit and age: Cp, AIC and SBC all are much smaller for this model than any of the others, and R2adj is slightly larger than for the next best models. In addition this two-variable model is the first (i.e. the one with fewest variables) to have Cp near or below p. This agreement among the criteria is somewhat unusual, and gives considerable confidence in the selection. The five three-variable models are next best by all criteria; they are very similar to each other. Four of these models include the best two-variable model, with latit and age; the other (which happens to fit slightly better than the others) has altit in place of latit, and the second-best of these models combines both latit and altit with age. The sixth best threevariable model does not contain either of these variables and fits considerably less well than does the fifth best. The similarity of these best three-variable models to the best two-variable model lends further support to the importance of the variables in that model. Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 6 2 adjR2 10.0 0.3 4 6 Cp 7.5 0.2 5.0 0.1 2.5 0.0 0.0 AIC 20.0 SBC 25.0 22.5 20.0 17.5 15.0 17.5 15.0 12.5 10.0 2 4 6 number of terms in the model (p-1) PRESS and R2prediction statistics then were obtained for these six models, as shown in the following table: 2 3 3 3 3 3 Variables in Model lat age age alt lat age alt lat age lat age lat age %scot temp temp rain .PRESS 36.4115 38.2080 39.4795 39.2322 37.2626 38.7459 R2.prediction 0.2358 0.1981 0.1715 0.1767 0.2180 0.1869 By these additional criteria the two-variable model again is better than any of the others. Interestingly, the three-variable model that was only fourth best in its fit to the data was the best of that size as measured by PRESS and R2prediction; this suggest that these three-variable models may be slightly overfitting the data. Based on all these criteria I selected three models for diagnostic assessment and possible refinement: the model with latit and age, the model adding rain to those two variables (fourth best three-variable model by R2), and the somewhat different model with age, altit, and temp (best three-variable model by R2). Assuming all were considered acceptable, the two-variable model would be adopted as the best model, by consensus of all criteria. One note of caution comes from the very low Cp values of the better models: all are considerably below p, suggesting either that they involve substantial bias or that the MSE for the full model overestimates the true error variance. The possibility of bias suggests considering larger models; in this case any of the better five-variable models have Cp closer to p than do the smaller models. I suspect, however, that the cause for these low values of Cp is instead that the full model is not adequate (see section on model adequacy below), and so did not choose to consider any of these five-variable models. Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 7 Estimation and Inference for Selected Models To understand these models we of course need to obtain the parameter estimates, which are not given in the output of the model-reduction procedures. It also is informative, in a qualitative way, to examine the tests of significance of the model and of the individual parameters, though when analyzing models chosen on the basis of the data (rather than with a structure known or assumed a priori) formal inference by these tests is not valid. Model with only latitude and age: SOURCE Regression Error Total s = 1.142 Predictor Constant latit age DF 2 23 25 R-sq SS MS F 17.677 8.839 6.78 29.973 1.303 47.650 = 37.1% .Coef St.dev t -12.063 5.323 -2.27 0.31702 0.09997 3.17 -0.06289 0.02274 -2.77 p 0.005 p 0.033 0.004 0.011 Each term in this model clearly is related to the population growth rate, λ, and the model explains a very significant proportion of the variability in λ. Latitude is positively related to λ: the population rate of increase is higher farther north. Age is negatively related: λ decreases in older plantations. (Note that both relationships are “adjusted for” the other.) Model with age, altitude and temp: SOURCE Regression Error Total s = 1.146 Predictor Constant age altit temp DF 3 22 25 R-sq SS MS F 18.780 6.260 4.77 28.870 1.312 47.650 = 39.4% .Coef St.dev t 14.462 3.143 4.60 -0.05179 0.02443 -2.12 -0.010714 0.004341 -2.47 -0.9862 0.3111 -3.17 p 0.010 p 0.000 0.046 0.022 0.004 In this model λ again is negatively related to plantation age; the coefficient is about 20% smaller than in the previous model. Altitude and temperature both also are negatively related to λ: the moth population grows faster at lower altitudes, all else being equal, or at lower temperatures, all else being equal. Of course these two variables are fairly strongly negatively correlated so in fact there is little variation in one without offsetting variation in the other. As was known would happen from the model-selection statistics above, this model explains only a small amount more of the variation in λ than did the two-variable model, and as a result the F* statistic for the model is smaller and the P-value larger. Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 8 Model with latitude, age, and rain: Source DF SS MS F Regression 3 18.172 6.057 4.52 Residual Error 22 29.477 1.340 Total 25 47.650 S = 1.15753 R-Sq = 38.1% Predictor Coef SE Coef T P Constant -11.281 5.549 -2.03 latit 0.3181 0.1014 3.14 age -0.05936 0.02378 -2.50 rain -0.001288 0.002119 -0.61 P 0.013 0.054 0.005 0.021 0.549 In this case, the model remains quite significant, and the relationships of latit and age with λ are largely unchanged from the two-variable model above, though addition of the rain term somewhat reduces the slope for age. The rain term, however, clearly is not significant (especially considering that this model, i.e. addition of this term to the two-variable model was selected as one of the best three-variable models). Adequacy of Selected Models The three models selected in the preceding were subjected to the full battery of diagnostic approaches. Only the interesting results will be described here, for all three models together. Unusual Observations Three observations (circled in studentized-deleted residuals vs. leverage blue in the three plots to the right) had 0.00 0.15 0.30 0.45 0.60 0.3 high leverage in all of the models because TRES1*HI1 TRES2*HI2 2 they were at the extremes of the age 0 distribution: two were from unusually old plantations (58 and 57 years) and the third -2 from an unusually young plantation (4 0.3 TRES3*HI3 2 years); the remaining observations were between 15 and 35 years, and mostly in 0 the 20’s. These points are obvious in the -2 partial-regression plot for age above. One additional observation (circled in red 0.00 0.15 0.30 0.45 0.60 in the bottom plot to the right), with very much more rain than any other, had very high leverage in the third model, which included rain. One observation (circled in red in all three plots) had a very large negative residual. The studentized-deleted residuals for this observation were in the extreme 1% to 2% of the pertinent t distributions for each of the three models. This observation had λ = 1.54 (only one other λ was less than 2.0), and was from a high latitude, at which the predicted λ was high. Based on Cook’s D and dffits, the high-leverage points — even the one with very high leverage in the third model due to its very high level of rain — did not appear to be influential, Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 9 but the point with the large residual was. Its values for dffits were 0.95 or higher, and much larger than for any other observation, in each of the models. Similarly, its values for the dfbetas were mostly large (0.5 to 0.7) and again larger than those for other observations. Its Cook’s D, however, was fairly small: 0.18 to 0.24 for the three models (less than the 10% level in the appropriate F distributions). Cook's D vs. residuals -2 COOK1*RESI1 dffits vs. residuals 0 -2 2 COOK2*RESI2 0.24 DFIT1*RESI1 1 0 2 DFIT2*RESI2 1.0 0.5 0.18 0.0 0.12 -0.5 0.06 0.00 COOK3*RESI3 0.24 -1.0 -1 DFIT3*RESI3 1.0 1 0.5 0.18 0.12 0.0 0.06 -0.5 -1.0 0.00 -2 0 2 -1 -2 0 2 I therefore decided to repeat the variable-selection process without this one outlying observation. The results are shown on the following table. Vars 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 R2 25.8 9.9 9.8 3.6 50.7 35.4 30.9 30.7 51.0 50.8 50.8 50.7 51.1 51.0 51.0 50.8 51.1 51.1 51.0 50.8 51.1 R2.adj 22.6 6.0 5.9 0.0 46.2 29.5 24.6 24.4 44.0 43.7 43.7 43.7 41.3 41.2 41.2 40.9 38.2 38.2 38.2 37.8 34.9 Cp 6.3 12.2 12.2 14.5 -0.8 4.8 6.5 6.5 1.1 1.1 1.1 1.2 3.0 3.0 3.1 3.1 5.0 5.0 5.0 5.1 7.0 S 1.1848 1.3052 1.3062 1.3508 0.98725 1.1305 1.1688 1.1705 1.0075 1.0100 1.0101 1.0103 1.0314 1.0323 1.0324 1.0348 1.0582 1.0582 1.0588 1.0616 1.0867 lat age temp alt lat lat lat lat lat lat lat lat lat lat lat lat lat lat lat age age temp temp alt alt age age age age age age age age age age age age age rain alt temp %sct %sct %sct %sct %sct %sct %sct alt alt alt alt alt alt rain rain rain rain rain rain rain temp temp temp temp temp These results are similar to those for the entire data set in selecting the two-variable model with latit and age first (and clearly best by all criteria), and several three-variable models as also worth consideration. These results differ from those for the entire data set in that the best models with three or more variables now all include latitude and age, and the best threevariable model from before (with age, altitude and temperature, but without latitude) now is not particularly good. The three-variable model with latit, age and rain, which gave only the fourth best fit (of three-variable models) to the full data set but performed well at prediction (PRESS and R2pred) now is the best fitting three-variable model. Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 10 Considering these analyses of influence, and combining the variable selection results both with and without this one observation, I concluded that the chosen model should include latit and age, and that the apparent effectiveness of models lacking those two variables probably was due to them fitting the one unusual observation. Residual Plots Plots of residuals against fitted values (below) or the independent variables in the models were not perfect but were basically acceptable. There were no indications of nonlinearity. The variability appeared to increase at larger values of the estimated Y, but this was due at least in part to the one outlying observation considered already; I did not feel this problem was definite or extreme enough to warrant remedial action (e.g. weighted least squares). residuals vs. fits 2 RESI1*FITS1 3 4 5 RESI2*FITS2 2 0 -2 RESI3*FITS3 2 0 -2 2 3 4 5 The plot of the residuals from the latit, age model against the possible third independent variable rain showed no trend, which supports not adding this variable. residuals from 2-variable model vs. rain 2 RESI1 1 0 -1 -2 -3 600 700 800 900 1000 1100 rain Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 11 Conclusion from Diagnostics While there are a few unusual observations that cause some concern about excessive influence, it does not appear that the conclusions of the analysis depend only on one or a few points. Otherwise there was nothing apparent in these diagnostic assessments which would call into question the validity of these analyses. One serious concern, however, was raised by the results for Cp in the “Selection of Reduced Models”: the very small (even negative) values of Cp obtained for some of the models suggest that the MSE for the full model was an overestimated of σ2, and thus that the full model was not properly formulated. Since there was no indication of nonlinearity in the residual diagnostics, it may be that there are important interactions which were omitted from consideration. I do not feel it would be useful to explore this possibility with this very small data set, but it should be considered in future studies if data is available from more plantations. Validation The final step in choosing among these various models is some form of validation. Obviously external validation – testing these models against new data – ultimately is needed. The current sample is too small for a data-splitting approach to validation to be very robust; I nonetheless will illustrate this approach, but feel the internal cross-validation is more reasonable for such a small data set. Internal Cross-validation Internal validation consists of comparing each model’s SSE and PRESS. The PRESS values were given above, as part of the model selection, and are repeated here along with the SSE values: 2 3 3 3 3 3 Variables in Model lat age age alt lat age alt lat age lat age lat age %scot temp temp rain .PRESS 36.4115 38.2080 39.4795 39.2322 37.2626 38.7459 .SSE 29.973 28.870 28.913 29.306 29.477 29.857 These comparisons suggest that the two-variable one is best: it actually gives better predictive ability for the observations in the data set than do any of the three-variable models, despite them fitting the full data set more closely. To put this another way, its SSE and PRESS are more similar, indicating that its MSE is a more valid indicator of its predictive ability than are those for the other models; this suggests that the MSEs of the latter, which already were larger than for the smaller model, actually overstate their models’ effectiveness. Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 12 Data-splitting Validation Because latitude appears to be an important variable, and is bimodal in distribution, I split the data to give balanced distributions of latitudes in the two parts: I sorted the data by latitude and then separated odd- and even-numbered observations. I then used each half of the data as both a training set and a validation set: I fit each model to each half of the data, and used these fits to predict the values of λ for the other half of the data. This approach also allows comparison of the estimated models for the two halves (and of course the full data set). fitting odds; predicting evens full data set fitting evens; predicting odds model MSE MSE MSPR MSE MSPR latit, age 1.303 1.672 1.1256 1.200 1.4777 age, altit, temp 1.312 1.101 2.4732 1.379 1.5510 latit, age, rain 1.314 1.755 1.5708 1.260 1.5559 The preceding results show that the two-variable model with latit and age was the best for prediction, with lower MSPRs than the other models. The three-variable model with age, altit and temp was not very stable: when fit to the odd observations is fit them very well (smallest MSE of all) but was very bad at predicting the even observations (far larger MSPR than any other). model coefficient full data set latit, age intercept -12.063 -13.194 (+9%) -11.588 (−4%) latit 0.3170 0.3453 (+9%) 0.2989 (−6%) age -0.0629 -0.0821 (+30%) -0.0421 (−33%) intercept 14.462 16.560 (+15%) 12.993 (−10%) age -0.0518 -0.0716 (+38%) -0.0531 (+3%) altit -0.0107 -0.0200 (+87%) -0.0045 (−58%) temp -0.9862 -1.1033 (+12%) -0.8577 (−13%) intercept -11.281 -13.851 (+23%) -9.209 (−18%) latit 0.3181 0.4106 (+29%) 0.2757 (−13%) age -0.0594 -0.0838 (+41%) -0.0321 (−46%) rain -0.0013 -0.0041 (+317%) -0.0019 (+44%) age, altit, temp latit, age, rain Data Set 4: Environmental Effects on Moth Population Growth Rates odds evens (rev. February 20, 2007) 13 The results in the table above show that the parameter estimates for the two-variable model were much more consistent across the full- and two half-data sets than were those of the three-variable models. Even in the two-variable model the coefficient for age varied considerably, presumably because the one very young plantation was not in the “evens” half of the data, but even this coefficient was more consistent in the two-variable model than when rain was added to the model. Validation Conclusions The simplest model clearly is most robust for predicting new data, and in obtaining similar parameter estimates from different data sets. Conclusion The latitude and age of a plantation seem quite clearly related to the population growth rate of the moths, with more rapid population growth farther north and in younger plantations. The simple model using these two variables appears reliable for predicting moth population growth. Given the clear superiority of this model for fitting and predicting these data, compared to any of the other models considered, tentative hypotheses about causation seem reasonable: while latitude cannot directly affect population dynamics, plantation age could well do so, and latitude obviously could be correlated with climate or other important causal factors which were not in the data set (or not well represented by the available variables). The other four possible explanatory variables — altitude, the fraction of Scots pine, rainfall, and temperature — all appear in models which fit the data reasonably well, so it should not be concluded that they are unimportant. Data Set 4: Environmental Effects on Moth Population Growth Rates (rev. February 20, 2007) 14
© Copyright 2026 Paperzz