- Catalyst

FISH 458 Lab 2: Fitting models to data using sums of squares
Aims: to explore fitting models to data using sums of squares, using the log(sum of squares), fitting the
exponential growth model to data, fitting the logistic growth model to data, learning to use the Table Function
in Excel, using Solver, finding sum of squares profiles, parameter confounding (as one parameter increases,
another decreases), uncertainty in parameters.
Files needed for this laboratory: “Part 1 template.xlsx”, “Part 2 template.xlsx”, “Part 3 template.xlsx”
Part 1: The following estimates of wildebeest abundance (thousands of individuals) in the Serengeti National
Park were obtained in various years:
Year
1961
1963
1965
1967
1971
1972
Census
263
357
439
483
693
773
Begin a new sheet in Excel (good practice), or work from “Part 1 template.xlsx”. We are going to fit the
exponential growth model to the data: Nt +1 = rNt , assuming that Nt =1961 = 263 . Leave some lines for the
parameter values (r and N1961) and the total log sum of squares (lnSSQ), and create a block of cells with
columns for the year, abundance estimate, model prediction, and log sum of squares. Set the 1961 abundance
equal to the parameter N1961. The formula for exponential growth will go in column C.
Plot the observed and predicted data. By convention, use solid circles for the observations and a solid line for
the predictions. Try various values of r to see which value provides the best (visual) fit to the data.
To find the “best fit” of the model to the data, a common approach is to calculate the sum of squared
differences between observed and predicted values, generally known as the sum of squares or SSQ:
=
SSQ
∑ ( observed − predicted )
2
t
SSQ
, which for this model will be=
∑ (N
t
t
− Nˆt
)
2
Where the ^ denotes the model prediction of abundance. For this particular problem, since we are dealing
with abundances we will use the natural logs of observed and predicted values:
=
ln SSQ
∑ (lnN
t
t
− ln Nˆt
)
2
.
We do this so that the proportional differences in prediction error are weighted the same, in other words
observing 2 and predicting 4 is penalized the same as observing 200 and predicting 400.
Calculate the lnSSQ between the observed and predicted values in the D column, but don’t include a formula
for 1961 since we are not actually predicting N1961. In order to have a consistent formula in all years, including
when there are no data, we can use an IF statement to calculate lnSSQ only in years where there are
observations, thus the formula in D11 would be: =IF(B11>0, (ln(B11)-ln(C11))^2, ""). The order is
IF, THEN, ELSE, and the ELSE part returns “” which is a cell containing a blank character.
Now add up all of the lnSSQ terms for each year and put the result in the lnSSQ cell. Now we want to see how
much support the data provide for different values of r using the Table Function. Fill in cells A25 to A45 with
values of r going from 1.00 to 1.20 in steps of 0.01. These will be the alternative values of r we will explore.
Think of these as different model hypotheses. You could plug each of these values into the r cell (in B4) one at
a time and copy the resulting lnSSQ total to the right of the r value, but the Table Function in Excel does this
automatically and is a skill well worth learning. To use the (one-dimensional) Table Function, we put the value
of the output we want (“=B7”) one cell above and one cell to the right of the column of r values. Select the
entire table (A24:B45) and invoke Data-What If Analysis-Data Table. Leave the Row Input cell blank, and in the
Column Input cell enter the cell containing the r value (cell B4). Press OK.
The Table Function is very finicky, so practice going through these steps several times until you are satisfied
you have mastered the function. We will be using this function again and again in Excel.
Now draw the graph of lnSSQ vs. r. What does this tell you about the relative support that the data give for
different values of r? Later in the course we will use likelihoods to make statistically rigorous pronouncements
about the validity of each of the different hypotheses. Just like sum of squares, likelihoods allow us to fit
models to data by minimizing a function of the data and the model predictions.
Part 2: The wildebeest census data up to the mid-1980s look like this:
Year
1961
1963
1965
1967
1971
1972
1977
1978
1980
1982
1984
1986
Abundance
263
357
439
483
693
773
1444
1248
1337
1208
1337
1146
These data show clear evidence that the period of exponential growth has slowed down and that the
population has reached equilibrium. Fit the logistic growth model to the wildebeest data using these
equations:
Nˆt =1961 = N0
 Nˆ 
Nˆt +1 =+
Nˆt rNˆt  1 − t 
K 

=
ln SSQ
∑ (lnN
t
t
− ln Nˆt
)
2
Set up your sheet in a similar way to part 1 (or use “Part 2 template.xlsx”). This time, we will estimate three
parameters: N0, r, and K. We will treat the 1961 population size as a parameter to be estimated, unlike in part
1 where we assumed it was exactly 263. We are going to use the Excel feature Solver to find the values of the
parameters that will minimize the lnSSQ.
Some key hints for success with Solver and with other non-linear function minimizers.
a. First get a good fit by eye by trial and error using combinations of parameter values.
b. Set automatic scaling ON in the Solver options.
c. Although you can constrain parameter values in Solver, convergence is often better when you
constrain parameters by yourself rather than relying on the built-in Solver features (more on this
later).
d. Try solving for one parameter at a time rather than all simultaneously.
e. Do not be satisfied with any single answer, try a variety of starting points to see if you can find a better
fit to the data (smaller lnSSQ).
Now use Solver to estimate the parameters N0, r, and K.
Next calculate the sum of squares profile on r. This is similar to the profile you created in Part 1, but now there
are no shortcuts available using the Table Function since there are three inputs. For a series of fixed values of r
we will try to find the minimum lnSSQ possible by changing only the values of N0 and K. You will have to do
this manually by looping through values of r in the form of the pseudo-code (generalized computer code not in
the form of any particular programming language):
Step 1: set the value of r to the target value (r = 0.10).
Step 2: use Solver to find the values of N0 and K that minimize lnSSQ.
Step 3: copy the values of r, N0, K and lnSSQ to the E column to the right using Paste-Special-Values Only
(shortcut alt-E-S-V). Copying values only ensure they don’t change when you go to the next step
Step 4: go back to step one, incrementing r by 0.01.
Step 5: continue until you have a series of values for r (0.10 to 0.25).
Some key points can now be learned to see how the values of other parameters are related to the values of r.
In particular, as r becomes larger, the best fits to the data have smaller K, and smaller initial population size N0.
To visualize this, create the following plots:
1. Graph the lnSSQ profile on r.
2. Graph the relation between r and K from the profile on r.
3. Graph the relation between r and N0 from the profile on r.
If time permits, also do a sum of squares profile on K.
Part 3: beginning in the late 1970s, Tanzania, where most of the Serengeti ecosystem is found, underwent a
serious economic crisis, and no resources were available for wildlife protection such as anti-poaching patrols.
It is believed that there was a dramatic increase in the illegal harvest of wildebeest at that time, which
continued into the 1980s.
Modify your model to allow for harvesting beginning in 1977 using the following equations:
Nˆt =1961 = N0
 Nˆ 
Nˆt +1 =Nˆt + rNˆt  1 − t  − Ct
K 

Ct 0 if t < 1977
=
Ct x
=
ln SSQ
=
if t ≥ 1977
∑ (lnN
t
t
− ln Nˆt
)
2
Now you will have four parameters to estimate: N0, r, K, and x (the annual harvest in numbers after 1977). You
will need to add an additional column to your sheet with the annual catch, and modify the predicted
equations in column C.
Use Solver to find the best estimates of these parameters by minimizing the lnSSQ as before.
Find the sum of squares profile on r, K, and x, comparing the profiles on r and K with those obtained in Part 2.
When we admit poaching has occurred, we now know much less about the range of K.
Advanced exercises
1. Modify the exponential model in Part 1 to add harvesting as in Part 3. Is it necessary to invoke the
logistic model to explain the leveling off in the population, or could it be due entirely to increased
poaching?
2. Use the age-structured model for wildebeest you are building in the homework, to fit to the time
series of abundance data with two parameters: calf survival rate and the adult density-dependence
parameter. Add harvesting as above to create a three-parameter model.
3. Implement Part 1 in the programming language R. The function optim can be used for non-linear
function minimization in R.