Mixed Modelling using Stata Stefano Occhipinti – Griffith Health Institute Workshop for GSBRC, July 2012 Stata is a general purpose statistical analysis package that, like SAS, SPSS, and R, can be used for many different statistical tasks. It is particularly well suited to mixed modelling for reasons that will become apparent. Stata is amenable to menu driven working but the syntax is so elegant and straightforward that it is difficult to think of an occasion in which the menus are preferable (we will not be using menus, in case my friendly hints have not been sufficiently clear). The syntax can be considered to be like a series of Lego blocks. Each component of syntax is actually very simple but the simple elements can be combined to form statements of increasing complexity. To begin, we can double click on a Stata data file in its directory/folder to read the data into Stata and set the relevant directory as the working directory. Then, the statistical procedures are very simple. For example, regress math homework provides a linear regression analysis with math as the criterion and homework as the predictor. This line of syntax can be run by typing or pasting it into the command window at the bottom of the Stata screen. No full stop is needed. As well, because we have a morbid fear of typing more characters than are absolutely necessary, we can shorten the regress statement thus: reg math homework (Welcome to geekdom.) If we choose to add more predictors (as invariably we would), we simply list them after homework, like so, assuming white is a dummy for race, reg math homework white You will notice that by default, Stata only provides the unstandardized regression coefficients. Because we fetishise standardized (beta) in many fields such as psychology, we can obtain them by using an option. In Stata, options are obtained by following the main statement with a comma. reg math homework white, beta If you must have partial and semipartial correlations and their respective squares, you may use, pcorr math homework white If for some reason, we wanted to restrict the analysis to kids from a particular school, we can use the if operator. This is a slightly confusing point but repetition will soon drive it home. if is not an option for a single procedure, it is used to address the whole dataset and is common to all procedures. Therefore, it does not join the procedure specific options following the comma. Assuming the school ID variable in our data is called schid, we would type, reg math homework white if schid == 1, beta (Note the double = for the comparison operator [does this equal that?] versus the assignment operator [make this equal to this other thing!].) Welcome to Stata! For multilevel modelling, we will use the xtmixed procedure. The form of the syntax will be exactly the same as above. There will be an extra part that allows us to specify random effects (as discussed in the presentation). Multilevel (i.e., mixed) modelling in Stata There are multiple procedures that handle mixed modelling in Stata and these generally begin with xt and follow the procedures for the “ordinary” version of the statistical model. For example. xtmixed is for mixed regression (with normally distributed, continuous outcomes). By contrast, xtmelogit is for multilevel logistic regression, and so forth. We will be focusing entirely on xtmixed in this workshop. The data that we will use for the cross-sectional analyses is the homework dataset provided by Snijders and Bosker and used by a range of textbook and tutorial materials. It contains data from 519 school students from 23 schools in the United States. The variables include student level characteristics (e.g., maths achievement; number of hours of homework per week) and school level ones (e.g., ratio of students to teachers, mean SES of the school). The variables are fairly self explanatory and will do for the purpose of learning. The following section is designed to accompany the presentation (see the PPT file). I have set it up as a tutorial that proceeds in steps. Where possible, I have included explanatory detail to assist you when revising and practising this material. We will turn to each step as indicated in the presentation. Steps for mixed modelling in Stata 1. Set datafile up 1.1. Make sure there is an ID variable for every intended level of analysis (e.g., participant, patient etc; group, school, ward etc) 1.2. For longitudinal analyses, the most likely scenario is that you already have data in a so called wide setup where each person’s data are in a single row and the different times of observation/measurement are represented as separate variables with a similar stem and a numerical suffix to indicate time (e.g., rses1, rses2…, rses5). This is good! If you have done this, stick to it. Avoid full stops/dots/periods (i.e., .) in variable names. They will be changed to underscores (i.e., _) in Stata and they are redundant anyway. You will use the reshape procedure below (see 3). 2. Read data into Stata 2.1. SPSS can read and write Stata files 2.1.1. Save as… select latest version of Stata offered and “Intercooled”Stata 2.2. User procedure usespss can read SPSS files directly 2.2.1. i.e., findit usespss to locate and then click to install directly from Net aware Stata 2.2.2. then, usespss using filename , clear 2.2.3.You need to specify the .sav extension in filename and you will only need the clear option if you already have data open, but it won’t hurt if you include it redundantly. 2.3. Stata can also read Excel files (as can SPSS and SAS etc). The easiest approach is to save the data as a single worksheet in text/tab-delimited format and use the insheet procedure. The procedure is very intuitive and if you already use Excel data in SPSS it is very simple to do. Naturally, if you already have variable labels and so forth, this is not an attractive option! I assume that most people at this workshop won’t be at this stage. 2.4. Often, the best approach vis-à-vis workflow is to create a new folder for each project you run and to use the cd command (change directory—old Unix and/or DOS commands are useful knowledge here) to set Stata to work in the project approach. Naturally, use the full pathname (e.g., copy it from Windows Explorer) and enclose it in double quotes if it contains spaces, as it invariably will. 2.4.1.E.g., cd "C:\Documents and Settings\psyocchi\My Documents\Consulting temp files\UQMultilevel2012" 2.4.2.Ordinarily, you will not have a line breaking in the Command Window in Stata. 2.4.3.For the sake of mind-numbing completeness, I note that you can automatically land in the appropriate directory by starting Stata by double clicking on a Stata data file in the relevant directory. 2.4.3.1. If unsure, use pwd to see where in your directory/folder structure you are working in Stata 2.5. Then we can read a Stata dataset (i.e., *.dta) with the use command. We don’t need the extension here. 2.5.1.E.g., use homework 2.6. Extra tip: if we already have data open and we have made changes to it in some way, Stata will not allow us to open another dataset without saving the original data. This is a failsafe to help to prevent wailing and gnashing of teeth. We merely need to add the clear option to the use command. 3. Reshape data if necessary [optional] 3.1. If you have repeated measures data in wide format (see 1.2 above) then you will need to reshape the data into so called long format. 3.2. If you have many variables, more than you need, think carefully about what you want to have in the reshaped datafile. Remember, Stata Man says, “If in doubt, leave it out!” 3.3. E.g., a data file has: participant ID, dummy variable indicating group/random assignment and Rosenberg Self Esteem Scale and Attitudes to Climate Change, the latter two variables measured at 4 time points: pid txgroup rses1 rses2 rses3 rses4 attclim1 attclim2 attclim3 attclim4 A command that would reshape the data as required would look like: reshape long rses attclim, i(pid) j(time) 3.3.1. we tell Stata what our ID variable (i.e., pid) is—this is always the individual level ID. 3.3.2.note that time is a new variable that will contain 1-4 according to the measurement occasion. We can call this variable anything we like: time, phase, occasion etc. We can also recode this variable into new metrics if we want to use a more advanced parameterisation of time or simply if we want to recentre time for some reason. 3.3.3. the variables rses and attclim are formed from the stubs of the original variable names you used (it’s important to be consistent and logical in your original naming conventions). If you have a series of these, it’s just a question of listing them as in 3.3. 3.3.4. Note that the variables that aren’t measured via repeated measures are not mentioned explicitly in the reshape. Stata is pretty clever in figuring these out and for almost all designs that typical psych researchers use the program will figure out what is going on and simply incorporate these variables as so-called time invariant covariates. Naturally, the cleaner the dataset to begin with, the better. Avoid lots of recodes that you then thought better of, different versions of scale scores that you don’t use and other inelegant things. Practice ruthless data hygiene and you will be happier for it! 3.3.5.If you are not used to constant data manipulation, these concepts will probably be easier to understand by trying out. 4. Re-centre your variables as appropriate. If you have used optional step 3 for reshaping, the time to re-centre (e.g., mean centreing) is almost always now. There are some within-person recentreing strategies but they are for another time and place (such as a well stocked bar during happy hour). If we do not re-centre, the regression parameters will all be the same as if we had centred except for the intercept, that will reflect a predicted score for a person scoring zero (0) on all the predictors. This may or may not be meaningful. Unlike OLS regression, most of the time you will want to interpret the intercept in mixed modelling. Like all regression intercepts, in mixed modelling, the intercept gives us the predicted Y score when X is zero and this generalises to more than one X as for multiple regression. Therefore, if your variables have meaningful zero points in them, such as dummies, you can leave them alone. Mostly, we have a mixture of variables/covariates in our data. Is there a substantively meaningful point around which to centre (such as an age milestone in developmental science, a clinical norm/cutoff, a neutral point in an attitude scale etc)? If so, centre around this point. The intercept will reflect the predicted Y score when participants have the score around which you have centred. Failing this, centre around the sample mean. 5. Begin with a variance components model. This addresses the question, Is there sufficient variance represented at a higher level to warrant the mixed approach? A rule of thumb widely used is that about 10% of the total variance needs to be represented at a given level (n.b., for simplicity we talk about two levels here but the same logic applies to three or more levels in a hierarchical design). We use the xtmixed procedure as below to derive the variance estimates at both levels of the example two-level design. xtmixed math || schid:, mle 5.1. 5.2. Note that this model appears only to list the outcome variable math with no predictors. As well, there is an operator made up of two vertical lines or pipes (||; the pipe symbol is found on the backslash key above Enter). The “two pipes” operator indicates the beginning of the random effects specification. If we were to leave it out, we would get an “ordinary” regression using ML estimation instead of least squares. 5.3. By listing only the outcome, we have created an empty model with nothing but the constant (implicitly included) and this will give us variance estimates at level 2 (i.e., school; labelled (sd(_cons) = 4.985047) and at level 1 (labelled sd(Residual) = 9.013178). Note, these are SDs by default and must be squared to give variances. We can get variances with the var option. It really doesn’t matter. 5.4. The total variance of math is 4.9850472 + 9.0131782 = 106.0881 but we rarely care! 5.5. ICC = school (level 2) variance ÷ (level 2 variance + level 1 variance) 5.6. ICC = 4.9850472 / 4.9850472 + 9.0131782 5.7. ICC = .2342 5.8. Approximately 23% of the total variance in maths scores is represented at the school level (i.e., level 2) and this is well in excess of the minimum of 10% expected for further modelling. The whole point of this seemingly dreary step was to gather the estimated variance components (i.e., at each level) and used these to calculate the ICC. 5.8.1.If the ICC were to be less than .1 then it indicates that we need not consider the level in representing the variance of the outcome—there is no design effect. In this case, we could go on to use OLS regression as usual. 6. At this point, you would add covariates and go on to model the outcome. You are then starting with a random intercept model. This differs from an OLS regression in that individual scores are modelled as having a different overall level (or mean) that is represented by the random intercept according to the level 2 unit (or cluster) but they all share the same regression slope. By contrast, in OLS regression, the intercept and slope are both fixed, population parameters and individuals’ variation from the predicted score is represented completely by the single error term. 6.1. E.g., xtmixed math homework || schid:, mle 6.2. Random intercept models will account for the statistically problematic situation in which there are known differences on the outcome across levels of a variable (such as school, or ward, or therapist) that is implicitly (or explicitly) included in the sampling scheme. In this example, note that the slope of math on homework has dropped slightly. This suggests that some of the previously observed effect of homework hours on maths scores when constraining intercepts to be the same across schools was due to the differences between schools in general level of maths achievement. Having accounted for this difference across schools by allowing intercepts to vary randomly (across schools), we have obtained a slightly more accurate representation of the effect of homework on maths scores. 6.3. Immediately after having run the model in 6.1, run the following line that stores the estimates for later recall. We will use this step later to evaluate whether the random slope (variance) we allow in 7.2 is significantly different from zero. 6.3.1. I.e., estimates store randint 6.3.2.You could call these anything. I chose randint for readability. Geeks would use ri. 7. Now designate a predictor as having a random slope. Random slopes (also known as random coefficient models) involve freeing the variance of the slope parameter from being constrained to zero—i.e., allowing the slope parameter to have a variance. It means that we are allowing the slope of the regression line to take on a different value across the values of the level 2 variable (i.e., school). Of course, it may decline our invitation! Practically speaking, we are allowing hours of homework to have a different effect on maths scores across different schools. This step will later allow us to include predictors of the slope if these are warranted. 7.1. The first step is to include the relevant predictor in the random specification of the syntax so that Stata knows we want the slope variance of that predictor to be unconstrained across level 2 units. 7.2. E.g., xtmixed math homework || schid: homework, cov(uns) mle 7.2.1.We include homework after the colon attached to schid (i.e., the level two ID). This allows the slope variance of homework to vary randomly rather than to be fixed at zero. 7.2.2.We also note that the variance-covariance between the random effects is unstructured vs being fixed at 0 covariance. We do this by adding cov(uns) to the options. Until you know better, do this step routinely. 7.2.3.The next step is to assess whether the variance of the slope is significantly different from zero. If it is not, then it is evidence that we need not treat the slope as random and can safely regard it as a fixed effect (i.e., we can assign the same slope parameter to children from all schools). We immediately store the estimates again for use below: 7.2.3.1. estimates store randcoeff 7.2.4.This brings a classic problem in stats. Almost all programs will provide a standard error for the random variance. A typical Wald test approach is to take the parameter, divide it by the standard error and interpret the result as a two-tailed Z test for the difference between the obtained parameter value and zero. Bzzt! A variance can’t be negative and so this approach has issues because it is testing against a value (zero) that is the boundary of the parameter space. No problems. We implement a likelihood ratio test in Stata and correct for the problem by dividing the obtained p value by 2 to get the correct probability. 7.2.5. E.g., lrtest randint randcoeff 7.2.6. We have requested a likelihood ratio test for the random intercept model and the random coefficient model. These are nested models and the test is similar to difference-in-deviance approaches for other modelling such as SEM. The difference is distributed as chisquare with degrees of freedom equal to the difference in DF across the models. Stata helpfully notes: Note: The reported degrees of freedom assumes the null hypothesis is not on the boundary of the parameter space. If this is not true, then the reported test is conservative. 7.2.7.Therefore, we correct the given p value (that is tiny already) by dividing by 2. 7.3. We now know that the random variance of the homework slope is different from zero. At the same time, we can include level 2 and level 1 variables as predictors in their own right. 8. The inclusion of further covariates now follows just as for normal regression procedures. 8.1. E.g., xtmixed math homework white ratio meanses || schid: homework , cov(uns) mle var 8.1.1.Again, ignore line breaks. 8.2. We have included: the race of the student; the student-teacher ratio; and the SES of the school (i.e., operationalised as the mean SES of all the students in each school in the data). These predictors provide a microcosm of the possibilities for us! 8.2.1. Level 1 predictors: measured at the individual level and vary at the individual level 8.2.1.1. homework, white 8.2.2.Level 2 predictors: a) operationalised by aggregating the corresponding individual level predictor. They then vary at the cluster level (i.e., level 2). 8.2.2.1. meanses 8.2.3.Level 2 predictors: b) measured at the cluster level and vary at the cluster level 8.3. The software will provide fixed effects for all of these. The b-weights are interpreted as usual—∆Y/∆X. The key is that the fixed effects refer to the population-averaged regression line and they are estimated while specifying essentially separate error terms for level 1 and level 2, rather than forcing everything into. We have chosen to allow the homework slope to vary randomly but we could just as easily have chosen other variables to remove constraints. Naturally, you need to make these choices with theory firmly in mind. You are always cautioned against being too liberal with your assignment of random slopes. 9. The next questions will address the likely predictors of the random slope. Stop and think about this for a moment. We are essentially talking about a type of moderation. We have an effect (i.e., homework predicts maths achievement). We have potential variables as part of a multilevel setup that may predict the size/sign of the homework slope. Therefore, we are talking about variables that affect the relationship between another predictor and the criterion. As for moderation in OLS regression, we use product terms to test these relationships. There are different ways in which these statistical models can be expressed, but the practical outcome remains very similar, if not the same. 10. Stata 11 and onwards has a very streamlined approach to factorial/product terms. Whereas in Stata10 and other programs you need to calculate the product term as a new variable explcityly and then include it in the model, in newer versions of Stata you can do this much more efficiently. 10.1. gen meanses_hmewrk= meanses * homework 10.2. xtmixed math homework meanses white ratio meanses_hmewrk || schid: homework , cov(uns) mle 10.3. In Stata 11 and onwards: 10.4. xtmixed math c.homework##c.meanses white ratio || schid: homework , cov(uns) mle var 10.4.1. c.* = continuous variable 10.5. The interpretation of the parameter estimates is contingent on the coding scheme used for categorical variables. This, crucially, will apportion zeros to particular participants or groups thereof. Almost every study that will use mixed regression will involve categoricals. For many/most applications, dummy coding is fine. You will see that almost every public health/epidemiology paper uses it. It is ideal when we have a control group or a meaningful reference group against which we can make comparisons. Remember, the significance of the parameter estimates is related to differences between coded groups and the reference group (i.e., as opposed to tests against the grand mean or similar). This applies to interactions as well. Virtually everything that you already know about dummy coding and interactions in OLS regression is applied here in mixed regression as well. 10.6. Using the above results as an example, the homework effect would show the homework slope for schools with a meanses level of zero. The product term then shows the change in the homework slope for each extra meanses point. 10.6.1. If we were using dummy variables here, the main effect term would show the slope for the 0 group and the product term would show the change in slope for the 1 group. Analysing longitudinal designs (with continuous outcomes) with mixed effects regression models using Stata First of all, you will be pleased to know that we still use xtmixed for longitudinal designs, as long as the outcome variable in each case is continuous. Recall that the longitudinal case is just a straight extension of the general multilevel model as covered last week. The data that we will use for the longitudinal analyses come from the now (in)famous alcohol use dataset from the study by Pat Curran and colleagues (1997; see PPT for reference). The widely available version is a relatively small subset (i.e., approximately 25%) of the complete study data with fewer persons (i.e., only Caucasian adolescents) and variables. A benefit of this data is that there are a range of fully documented analyses available online for various statistical packages. Probably the best repository of these is the UCLA Stats Consulting site (Google it). Here you can read through a series of worked and documented analyses of basic models and follow along on your own computer. Steps for mixed modelling in Stata 11. Set datafile up 11.1. Revise the material from previous section. You will almost certainly need to use the reshape procedure. 12. Begin with a variance components model. This addresses the question, Is there sufficient variance represented at the person level to warrant the mixed approach? For the alcuse data, we use the following xtmixed syntax. 12.1. xtmixed alcuse || id:, mle var 12.2. ICC = person variance ÷ (person variance + time variance) 12.3. ICC = .5639 / .5639 + .5617 12.4. ICC = .5010 12.5. Approximately 50% of the total variance in alcohol use scores is represented at the person level (i.e., level 2). 13. Next is the random intercept model. The key predictor here is your time effect. 13.1. E.g., xtmixed alcuse age || id:, mle var 13.2. But wait! If we use age in years for a sample of adolescents, the intercept will refer to the drinking scores of newborns. Better to immediately rerun the analysis with an appropriately centred age predictor. Normally you would anticipate this in your planning. Note that the only parameter that changes is the intercept (just as for OLS regression). 13.3. E.g., xtmixed alcuse age_14 || id:, mle var 13.4. Immediately after having run the model in 3.3, stores the estimates for later use. 13.4.1. I.e., estimates store ri 13.4.2. You could call these ug instead of ri to indicate unconditional growth, as per ALDA. 13.5. Now designate time as having a random slope. Random slopes for time mean that we are allowing the slope of the outcome over time to take on a different value across the values of the level 2 variable (i.e., person). 13.6. E.g., xtmixed alcuse age_14 || id: age_14, cov(uns) mle var 13.6.1. As usual, the variance-covariance between the random effects is unstructured vs being fixed at 0 covariance. In some instances, we may wish to allow for autocorrelation across the errors for longitudinal data. This is on the more advanced end of the spectrum. 13.6.2. Next, store these estimates to assess whether the variance of the slope is significantly different from zero. If it is not, then it is evidence that we need not treat the slope as random and can safely regard it as a fixed effect. We immediately store the estimates again for use below: 13.6.2.1. estimates store rc 13.6.3. The same form of likelihood ratio test is used. 13.6.4. E.g., lrtest ri rc 13.6.5. Don’t forget to correct the given p value. 14. The inclusion of further covariates now follows just as in the cross-sectional case. 14.1. E.g., xtmixed alcuse age_14 male coa peer || id:age_14, mle var cov(uns) 14.2. male = dummy for gender; coa = child of alcoholic; peer = reported peer alcohol use. 15. Finally, using the efficient approach of Stata you can check gender as a predictor of the slope of alcohol use over time. 15.1. xtmixed alcuse c.age_14##i.male coa peer || id:age_14, mle var cov(uns)
© Copyright 2026 Paperzz