Mixed Modelling using Stata

Mixed Modelling using Stata
Stefano Occhipinti – Griffith Health Institute
Workshop for GSBRC, July 2012
Stata is a general purpose statistical analysis package that, like SAS, SPSS, and R, can be used for
many different statistical tasks. It is particularly well suited to mixed modelling for reasons that will
become apparent. Stata is amenable to menu driven working but the syntax is so elegant and
straightforward that it is difficult to think of an occasion in which the menus are preferable (we will
not be using menus, in case my friendly hints have not been sufficiently clear). The syntax can be
considered to be like a series of Lego blocks. Each component of syntax is actually very simple but
the simple elements can be combined to form statements of increasing complexity. To begin, we can
double click on a Stata data file in its directory/folder to read the data into Stata and set the relevant
directory as the working directory. Then, the statistical procedures are very simple. For example,
regress math homework
provides a linear regression analysis with math as the criterion and homework as the predictor.
This line of syntax can be run by typing or pasting it into the command window at the bottom of the
Stata screen. No full stop is needed. As well, because we have a morbid fear of typing more
characters than are absolutely necessary, we can shorten the regress statement thus:
reg math homework
(Welcome to geekdom.) If we choose to add more predictors (as invariably we would), we simply list
them after homework, like so, assuming white is a dummy for race,
reg math homework white
You will notice that by default, Stata only provides the unstandardized regression coefficients.
Because we fetishise standardized (beta) in many fields such as psychology, we can obtain them by
using an option. In Stata, options are obtained by following the main statement with a comma.
reg math homework white, beta
If you must have partial and semipartial correlations and their respective squares, you may use,
pcorr math homework white
If for some reason, we wanted to restrict the analysis to kids from a particular school, we can use the
if operator. This is a slightly confusing point but repetition will soon drive it home. if is not an
option for a single procedure, it is used to address the whole dataset and is common to all
procedures. Therefore, it does not join the procedure specific options following the comma.
Assuming the school ID variable in our data is called schid, we would type,
reg math homework white if schid == 1, beta
(Note the double = for the comparison operator [does this equal that?] versus the assignment
operator [make this equal to this other thing!].)
Welcome to Stata! For multilevel modelling, we will use the xtmixed procedure. The form of the
syntax will be exactly the same as above. There will be an extra part that allows us to specify random
effects (as discussed in the presentation).
Multilevel (i.e., mixed) modelling in Stata
There are multiple procedures that handle mixed modelling in Stata and these generally begin with
xt and follow the procedures for the “ordinary” version of the statistical model. For example.
xtmixed is for mixed regression (with normally distributed, continuous outcomes). By contrast,
xtmelogit is for multilevel logistic regression, and so forth. We will be focusing entirely on
xtmixed in this workshop.
The data that we will use for the cross-sectional analyses is the homework dataset provided by
Snijders and Bosker and used by a range of textbook and tutorial materials. It contains data from 519
school students from 23 schools in the United States. The variables include student level
characteristics (e.g., maths achievement; number of hours of homework per week) and school level
ones (e.g., ratio of students to teachers, mean SES of the school). The variables are fairly self
explanatory and will do for the purpose of learning.
The following section is designed to accompany the presentation (see the PPT file). I have set it up as
a tutorial that proceeds in steps. Where possible, I have included explanatory detail to assist you
when revising and practising this material. We will turn to each step as indicated in the presentation.
Steps for mixed modelling in Stata
1. Set datafile up
1.1. Make sure there is an ID variable for every intended level of analysis (e.g., participant,
patient etc; group, school, ward etc)
1.2. For longitudinal analyses, the most likely scenario is that you already have data in a so called
wide setup where each person’s data are in a single row and the different times of
observation/measurement are represented as separate variables with a similar stem and a
numerical suffix to indicate time (e.g., rses1, rses2…, rses5). This is good! If you have done
this, stick to it. Avoid full stops/dots/periods (i.e., .) in variable names. They will be changed
to underscores (i.e., _) in Stata and they are redundant anyway. You will use the reshape
procedure below (see 3).
2. Read data into Stata
2.1. SPSS can read and write Stata files
2.1.1. Save as… select latest version of Stata offered and “Intercooled”Stata
2.2. User procedure usespss can read SPSS files directly
2.2.1. i.e., findit usespss to locate and then click to install directly from Net aware
Stata
2.2.2. then, usespss using filename , clear
2.2.3.You need to specify the .sav extension in filename and you will only need the clear
option if you already have data open, but it won’t hurt if you include it redundantly.
2.3. Stata can also read Excel files (as can SPSS and SAS etc). The easiest approach is to save the
data as a single worksheet in text/tab-delimited format and use the insheet procedure.
The procedure is very intuitive and if you already use Excel data in SPSS it is very simple to
do. Naturally, if you already have variable labels and so forth, this is not an attractive
option! I assume that most people at this workshop won’t be at this stage.
2.4. Often, the best approach vis-à-vis workflow is to create a new folder for each project you
run and to use the cd command (change directory—old Unix and/or DOS commands are
useful knowledge here) to set Stata to work in the project approach. Naturally, use the full
pathname (e.g., copy it from Windows Explorer) and enclose it in double quotes if it
contains spaces, as it invariably will.
2.4.1.E.g., cd "C:\Documents and Settings\psyocchi\My
Documents\Consulting temp files\UQMultilevel2012"
2.4.2.Ordinarily, you will not have a line breaking in the Command Window in Stata.
2.4.3.For the sake of mind-numbing completeness, I note that you can automatically land in
the appropriate directory by starting Stata by double clicking on a Stata data file in the
relevant directory.
2.4.3.1.
If unsure, use pwd to see where in your directory/folder structure you are
working in Stata
2.5. Then we can read a Stata dataset (i.e., *.dta) with the use command. We don’t need the
extension here.
2.5.1.E.g., use homework
2.6. Extra tip: if we already have data open and we have made changes to it in some way, Stata
will not allow us to open another dataset without saving the original data. This is a failsafe
to help to prevent wailing and gnashing of teeth. We merely need to add the clear option
to the use command.
3. Reshape data if necessary [optional]
3.1. If you have repeated measures data in wide format (see 1.2 above) then you will need to
reshape the data into so called long format.
3.2. If you have many variables, more than you need, think carefully about what you want to
have in the reshaped datafile. Remember, Stata Man says, “If in doubt, leave it out!”
3.3. E.g., a data file has: participant ID, dummy variable indicating group/random assignment and
Rosenberg Self Esteem Scale and Attitudes to Climate Change, the latter two variables
measured at 4 time points: pid txgroup rses1 rses2 rses3 rses4 attclim1 attclim2 attclim3
attclim4
A command that would reshape the data as required would look like:
reshape long rses attclim, i(pid) j(time)
3.3.1. we tell Stata what our ID variable (i.e., pid) is—this is always the individual level ID.
3.3.2.note that time is a new variable that will contain 1-4 according to the measurement
occasion. We can call this variable anything we like: time, phase, occasion etc. We can
also recode this variable into new metrics if we want to use a more advanced
parameterisation of time or simply if we want to recentre time for some reason.
3.3.3. the variables rses and attclim are formed from the stubs of the original variable
names you used (it’s important to be consistent and logical in your original naming
conventions). If you have a series of these, it’s just a question of listing them as in 3.3.
3.3.4. Note that the variables that aren’t measured via repeated measures are not
mentioned explicitly in the reshape. Stata is pretty clever in figuring these out and for
almost all designs that typical psych researchers use the program will figure out what is
going on and simply incorporate these variables as so-called time invariant covariates.
Naturally, the cleaner the dataset to begin with, the better. Avoid lots of recodes that
you then thought better of, different versions of scale scores that you don’t use and
other inelegant things. Practice ruthless data hygiene and you will be happier for it!
3.3.5.If you are not used to constant data manipulation, these concepts will probably be
easier to understand by trying out.
4. Re-centre your variables as appropriate. If you have used optional step 3 for reshaping, the time
to re-centre (e.g., mean centreing) is almost always now. There are some within-person recentreing strategies but they are for another time and place (such as a well stocked bar during
happy hour). If we do not re-centre, the regression parameters will all be the same as if we had
centred except for the intercept, that will reflect a predicted score for a person scoring zero (0)
on all the predictors. This may or may not be meaningful. Unlike OLS regression, most of the
time you will want to interpret the intercept in mixed modelling. Like all regression intercepts, in
mixed modelling, the intercept gives us the predicted Y score when X is zero and this generalises
to more than one X as for multiple regression. Therefore, if your variables have meaningful zero
points in them, such as dummies, you can leave them alone. Mostly, we have a mixture of
variables/covariates in our data. Is there a substantively meaningful point around which to
centre (such as an age milestone in developmental science, a clinical norm/cutoff, a neutral
point in an attitude scale etc)? If so, centre around this point. The intercept will reflect the
predicted Y score when participants have the score around which you have centred. Failing this,
centre around the sample mean.
5. Begin with a variance components model. This addresses the question, Is there sufficient
variance represented at a higher level to warrant the mixed approach? A rule of thumb widely
used is that about 10% of the total variance needs to be represented at a given level (n.b., for
simplicity we talk about two levels here but the same logic applies to three or more levels in a
hierarchical design). We use the xtmixed procedure as below to derive the variance estimates at
both levels of the example two-level design.
xtmixed math || schid:, mle
5.1.
5.2. Note that this model appears only to list the outcome variable math with no predictors. As
well, there is an operator made up of two vertical lines or pipes (||; the pipe symbol is
found on the backslash key above Enter). The “two pipes” operator indicates the beginning
of the random effects specification. If we were to leave it out, we would get an “ordinary”
regression using ML estimation instead of least squares.
5.3. By listing only the outcome, we have created an empty model with nothing but the constant
(implicitly included) and this will give us variance estimates at level 2 (i.e., school; labelled
(sd(_cons) = 4.985047) and at level 1 (labelled sd(Residual) = 9.013178). Note, these are SDs
by default and must be squared to give variances. We can get variances with the var
option. It really doesn’t matter.
5.4. The total variance of math is 4.9850472 + 9.0131782 = 106.0881 but we rarely care!
5.5. ICC = school (level 2) variance ÷ (level 2 variance + level 1 variance)
5.6. ICC = 4.9850472 / 4.9850472 + 9.0131782
5.7. ICC = .2342
5.8. Approximately 23% of the total variance in maths scores is represented at the school level
(i.e., level 2) and this is well in excess of the minimum of 10% expected for further
modelling. The whole point of this seemingly dreary step was to gather the estimated
variance components (i.e., at each level) and used these to calculate the ICC.
5.8.1.If the ICC were to be less than .1 then it indicates that we need not consider the level
in representing the variance of the outcome—there is no design effect. In this case, we
could go on to use OLS regression as usual.
6. At this point, you would add covariates and go on to model the outcome. You are then starting
with a random intercept model. This differs from an OLS regression in that individual scores are
modelled as having a different overall level (or mean) that is represented by the random
intercept according to the level 2 unit (or cluster) but they all share the same regression slope. By
contrast, in OLS regression, the intercept and slope are both fixed, population parameters and
individuals’ variation from the predicted score is represented completely by the single error
term.
6.1. E.g., xtmixed math homework || schid:, mle
6.2. Random intercept models will account for the statistically problematic situation in which
there are known differences on the outcome across levels of a variable (such as school, or
ward, or therapist) that is implicitly (or explicitly) included in the sampling scheme. In this
example, note that the slope of math on homework has dropped slightly. This suggests
that some of the previously observed effect of homework hours on maths scores when
constraining intercepts to be the same across schools was due to the differences between
schools in general level of maths achievement. Having accounted for this difference across
schools by allowing intercepts to vary randomly (across schools), we have obtained a
slightly more accurate representation of the effect of homework on maths scores.
6.3. Immediately after having run the model in 6.1, run the following line that stores the
estimates for later recall. We will use this step later to evaluate whether the random slope
(variance) we allow in 7.2 is significantly different from zero.
6.3.1. I.e., estimates store randint
6.3.2.You could call these anything. I chose randint for readability. Geeks would use ri.
7. Now designate a predictor as having a random slope. Random slopes (also known as random
coefficient models) involve freeing the variance of the slope parameter from being constrained
to zero—i.e., allowing the slope parameter to have a variance. It means that we are allowing the
slope of the regression line to take on a different value across the values of the level 2 variable
(i.e., school). Of course, it may decline our invitation! Practically speaking, we are allowing hours
of homework to have a different effect on maths scores across different schools. This step will
later allow us to include predictors of the slope if these are warranted.
7.1. The first step is to include the relevant predictor in the random specification of the syntax so
that Stata knows we want the slope variance of that predictor to be unconstrained across
level 2 units.
7.2. E.g., xtmixed math homework || schid: homework, cov(uns) mle
7.2.1.We include homework after the colon attached to schid (i.e., the level two ID). This
allows the slope variance of homework to vary randomly rather than to be fixed at
zero.
7.2.2.We also note that the variance-covariance between the random effects is unstructured
vs being fixed at 0 covariance. We do this by adding cov(uns) to the options. Until
you know better, do this step routinely.
7.2.3.The next step is to assess whether the variance of the slope is significantly different
from zero. If it is not, then it is evidence that we need not treat the slope as random
and can safely regard it as a fixed effect (i.e., we can assign the same slope parameter
to children from all schools). We immediately store the estimates again for use below:
7.2.3.1.
estimates store randcoeff
7.2.4.This brings a classic problem in stats. Almost all programs will provide a standard error
for the random variance. A typical Wald test approach is to take the parameter, divide
it by the standard error and interpret the result as a two-tailed Z test for the difference
between the obtained parameter value and zero. Bzzt! A variance can’t be negative
and so this approach has issues because it is testing against a value (zero) that is the
boundary of the parameter space. No problems. We implement a likelihood ratio test
in Stata and correct for the problem by dividing the obtained p value by 2 to get the
correct probability.
7.2.5. E.g., lrtest randint randcoeff
7.2.6. We have requested a likelihood ratio test for the random intercept model and the
random coefficient model. These are nested models and the test is similar to
difference-in-deviance approaches for other modelling such as SEM. The difference is
distributed as chisquare with degrees of freedom equal to the difference in DF across
the models. Stata helpfully notes:
Note: The reported degrees of freedom assumes the null hypothesis is not on the
boundary of the parameter space. If this is not true, then the reported test is
conservative.
7.2.7.Therefore, we correct the given p value (that is tiny already) by dividing by 2.
7.3. We now know that the random variance of the homework slope is different from zero. At
the same time, we can include level 2 and level 1 variables as predictors in their own right.
8. The inclusion of further covariates now follows just as for normal regression procedures.
8.1. E.g., xtmixed math homework white ratio meanses || schid:
homework , cov(uns) mle var
8.1.1.Again, ignore line breaks.
8.2. We have included: the race of the student; the student-teacher ratio; and the SES of the
school (i.e., operationalised as the mean SES of all the students in each school in the data).
These predictors provide a microcosm of the possibilities for us!
8.2.1. Level 1 predictors: measured at the individual level and vary at the individual level
8.2.1.1.
homework, white
8.2.2.Level 2 predictors: a) operationalised by aggregating the corresponding individual level
predictor. They then vary at the cluster level (i.e., level 2).
8.2.2.1.
meanses
8.2.3.Level 2 predictors: b) measured at the cluster level and vary at the cluster level
8.3. The software will provide fixed effects for all of these. The b-weights are interpreted as
usual—∆Y/∆X. The key is that the fixed effects refer to the population-averaged regression
line and they are estimated while specifying essentially separate error terms for level 1 and
level 2, rather than forcing everything into. We have chosen to allow the homework slope
to vary randomly but we could just as easily have chosen other variables to remove
constraints. Naturally, you need to make these choices with theory firmly in mind. You are
always cautioned against being too liberal with your assignment of random slopes.
9. The next questions will address the likely predictors of the random slope. Stop and think about
this for a moment. We are essentially talking about a type of moderation. We have an effect
(i.e., homework predicts maths achievement). We have potential variables as part of a multilevel
setup that may predict the size/sign of the homework slope. Therefore, we are talking about
variables that affect the relationship between another predictor and the criterion. As for
moderation in OLS regression, we use product terms to test these relationships. There are
different ways in which these statistical models can be expressed, but the practical outcome
remains very similar, if not the same.
10. Stata 11 and onwards has a very streamlined approach to factorial/product terms. Whereas in
Stata10 and other programs you need to calculate the product term as a new variable explcityly
and then include it in the model, in newer versions of Stata you can do this much more
efficiently.
10.1.
gen meanses_hmewrk= meanses * homework
10.2.
xtmixed math homework meanses white ratio meanses_hmewrk
|| schid: homework , cov(uns) mle
10.3.
In Stata 11 and onwards:
10.4.
xtmixed math c.homework##c.meanses white ratio || schid:
homework , cov(uns) mle var
10.4.1. c.* = continuous variable
10.5.
The interpretation of the parameter estimates is contingent on the coding scheme
used for categorical variables. This, crucially, will apportion zeros to particular participants
or groups thereof. Almost every study that will use mixed regression will involve
categoricals. For many/most applications, dummy coding is fine. You will see that almost
every public health/epidemiology paper uses it. It is ideal when we have a control group or
a meaningful reference group against which we can make comparisons. Remember, the
significance of the parameter estimates is related to differences between coded groups and
the reference group (i.e., as opposed to tests against the grand mean or similar). This
applies to interactions as well. Virtually everything that you already know about dummy
coding and interactions in OLS regression is applied here in mixed regression as well.
10.6.
Using the above results as an example, the homework effect would show the
homework slope for schools with a meanses level of zero. The product term then shows the
change in the homework slope for each extra meanses point.
10.6.1. If we were using dummy variables here, the main effect term would show the slope
for the 0 group and the product term would show the change in slope for the 1 group.
Analysing longitudinal designs (with continuous outcomes) with mixed effects regression
models using Stata
First of all, you will be pleased to know that we still use xtmixed for longitudinal designs, as long as
the outcome variable in each case is continuous. Recall that the longitudinal case is just a straight
extension of the general multilevel model as covered last week.
The data that we will use for the longitudinal analyses come from the now (in)famous alcohol use
dataset from the study by Pat Curran and colleagues (1997; see PPT for reference). The widely
available version is a relatively small subset (i.e., approximately 25%) of the complete study data
with fewer persons (i.e., only Caucasian adolescents) and variables. A benefit of this data is that
there are a range of fully documented analyses available online for various statistical packages.
Probably the best repository of these is the UCLA Stats Consulting site (Google it). Here you can read
through a series of worked and documented analyses of basic models and follow along on your own
computer.
Steps for mixed modelling in Stata
11. Set datafile up
11.1.
Revise the material from previous section. You will almost certainly need to use the
reshape procedure.
12. Begin with a variance components model. This addresses the question, Is there sufficient
variance represented at the person level to warrant the mixed approach? For the alcuse data, we
use the following xtmixed syntax.
12.1.
xtmixed alcuse || id:, mle var
12.2.
ICC = person variance ÷ (person variance + time variance)
12.3.
ICC = .5639 / .5639 + .5617
12.4.
ICC = .5010
12.5.
Approximately 50% of the total variance in alcohol use scores is represented at the
person level (i.e., level 2).
13. Next is the random intercept model. The key predictor here is your time effect.
13.1.
E.g., xtmixed alcuse age || id:, mle var
13.2.
But wait! If we use age in years for a sample of adolescents, the intercept will refer
to the drinking scores of newborns. Better to immediately rerun the analysis with an
appropriately centred age predictor. Normally you would anticipate this in your planning.
Note that the only parameter that changes is the intercept (just as for OLS regression).
13.3.
E.g., xtmixed alcuse age_14 || id:, mle var
13.4.
Immediately after having run the model in 3.3, stores the estimates for later use.
13.4.1. I.e., estimates store ri
13.4.2. You could call these ug instead of ri to indicate unconditional growth, as per ALDA.
13.5.
Now designate time as having a random slope. Random slopes for time mean that
we are allowing the slope of the outcome over time to take on a different value across the
values of the level 2 variable (i.e., person).
13.6.
E.g., xtmixed alcuse age_14 || id: age_14, cov(uns) mle var
13.6.1. As usual, the variance-covariance between the random effects is unstructured vs
being fixed at 0 covariance. In some instances, we may wish to allow for
autocorrelation across the errors for longitudinal data. This is on the more advanced
end of the spectrum.
13.6.2. Next, store these estimates to assess whether the variance of the slope is
significantly different from zero. If it is not, then it is evidence that we need not treat
the slope as random and can safely regard it as a fixed effect. We immediately store
the estimates again for use below:
13.6.2.1. estimates store rc
13.6.3. The same form of likelihood ratio test is used.
13.6.4. E.g., lrtest ri rc
13.6.5. Don’t forget to correct the given p value.
14. The inclusion of further covariates now follows just as in the cross-sectional case.
14.1.
E.g., xtmixed alcuse age_14 male coa peer || id:age_14, mle
var cov(uns)
14.2.
male = dummy for gender; coa = child of alcoholic; peer = reported peer alcohol use.
15. Finally, using the efficient approach of Stata you can check gender as a predictor of the slope of
alcohol use over time.
15.1.
xtmixed alcuse c.age_14##i.male coa peer || id:age_14,
mle var cov(uns)