Now we will look at the example from the textbook on Glucose and Exercise again in more detail. We will start with a discussion of the SAS code. 1 To begin, it is best if we use formats to explain our categorical variables. For now we are only looking at the YN format for EXERCISE where we tell SAS that in the raw data 0 represents NO and 1 represents YES for variables given this format. Here we also go ahead and format the physical activity variable used later. Putting the number inside the format helps keep most of these in order in our results. Otherwise, SAS would order them alphabetically. When we format variables, we usually need to refer to the exact formatted value in our code. If you do not format values you will need to alter those statements to refer to a numeric value and remove the quotations used in our examples. We will point this out when we see it in the code. 2 Here I created a temporary dataset called HERS from my permanent dataset in a library called TEMP. We have three ways to select only those without diabetes. • if diabetes = 1 then delete; This will remove those with diabetes and is usually the method that is clearest. • where diabetes=0; This will select only those without diabetes using a WHERE statement • if diabetes=0 ; This will select only those without diabetes by using IF without any THEN. Finally we format the variables exercise and drinkany using the YN format we created and the physact variable using the PHY format we created. 3 We will skip the SAS code for the exploratory data analysis conducted and simply show some preliminary results. Here is the distribution of the outcome variable, Glucose. It is slightly skewed right but for now we will work with the raw Glucose values instead of performing any transformations. 4 Here are the graphs we discussed earlier. The standard side-by-side boxplots on the right and the strange scatterplot on the left with the mean values denoted with red stars. We can see that the mean glucose is slightly lower in the YES group than the NO group for our sample. Our Theoretical Model for the mean glucose for a given level of exercise is • E [GLUCOSE | EXERCISE] = β0 + β1(EXERCISE) 5 Here we are using PROC GLM to model the outcome of Glucose vs. the binary predictor exercise. We will set this up to compare Yes (as X = 1) to No (as X = 0). We begin with PROC GLM using our HERS data. We add the PLOTS = ALL option but for now we are not going to unpack any graphs. Our goal is to focus on interpretations of parameter estimates and working with theoretical and estimated regression models. Later we will come back to model validation and selection. We add a CLASS statement for our categorical variable exercise. After the variable name, we add parentheses and the REF = option. The value we use here will determine which level of our variable is used as the reference group. Here we want to use NO as our reference group. We need to put quotations and then exactly the format from our PROC format – it is case sensitive so we have a capital N and a lower case o. If you do not format values you would use the actual number. Since NO is coded as 0 in the data you would use REF = 0 with no quotation marks around the zero. 6 Earlier versons of GLM do not have the ref option in the class statement - if your version is like this, let me know and I will help you with a trick to make things work the same. Notice that our Reference group will correspond to X = 0 even if it is not coded that way in the original data!!! Then we have our model statement with our outcome, glucose on the left of the equals sign and our predictor, exercise on the right of the equals sign. We add two options of SOLUTION and CLPARM. • CLPARM gives confidence interval for parameters • SOLUTION gives the parameter estimates table – you may not always need this but often you will Then we have our run statement and technically PROC GLM doesn’t complete without a QUIT statement although this is not usually necessary to obtain the output. Now Let’s Look at the results. 7 The class level information will now be important to us as it tells us something about how SAS created the underlying dummy variables. The value listed LAST will be the reference group. From this we can check that we have correctly handled our binary variable. We will know that the first level listed, in this case YES is represented by exercise = 1 in the model and reference group, listed last, in this case NO will be represented by exercise = 0 in the model. In this case, this happens to be the same as the raw data but that will not always be the case as we will see. We also get the number of observations read and used which can show us if there were any observations with missing values for either of these variables but really, we should already know if that were the case from our exploratory analysis. 8 Then we have our overall ANOVA table. • We know the degrees of freedom for the model should be 1 since there is only one variable, exercise, in our model. • Since there were 2032 observations, we see the degrees of freedom for total are n-1 or 2031. • And either by subtraction or the equation, the degrees of freedom for error are n2 or 2030. • The sum of squares for each term are provided and you can see that the model and error sum of squares add to the total. • The mean squares are the sum of squares divided by the degrees of freedom and you could verify that MSE is 94.3968 yourself by dividing the error or residual sum of squares by 2030. • The overall F-statistic is the mean square for the model divided by the mean square for error. In this case it is 14.97 and is highly statistically significant. • This says that overall, the model using exercise does explain a statistically significant portion of the variation in glucose. 9 Then we have a table containing • R-square – which is 0.0073. Even though the model explains a statistically significant portion, it is not much. Overall, exercise only explains about 0.7% of the variation in glucose. • Root MSE is 9.715 – the estimate of the variation around the regression line. 10 The last table of interest is the parameter estimates table. We need to use this to write our estimated regression model. • Here our intercept is 97.36 and our slope is negative 1.69. • The NO row with a zero estimate and missing values is not relevant and should be ignored. It only tells us that NO is the reference group and we are comparing YES to NO with our slope. • Our equation is Y-hat (our estimated mean glucose) = 97.36 – 1.69 times Exercise. Where Exercise = 1 for YES and 0 for NO. • The slope can be interpreted by saying: The population mean glucose for those who exercise is expected to be 1.69 units less than the population mean glucose for those who do not exercise. The 95% CI is -2.55 to -0.83. • The p-value for the slope is 0.0001 and so this change in the population mean glucose is statistically significantly different from zero. There is a statistically significant difference in the mean glucose between exercisers and nonexercisers. 11 There are only two predicted values which result from this model: • Estimated Mean Glucose for NO using X = 0: • Y-hat = 97.36 – 1.69(0) = 97.36, this is simply the y-intercept from the model. • Estimated Mean Glucose for YES using X = 1: • Y-hat = 97.36 – 1.69(1) = 95.67 For each level (No/Yes) we could also obtain confidence limits for the mean or prediction limits for a future observed value 12 If we were to reverse the reference groups. All results would stay the same except: • The intercept – it would now represent exercise = YES and its entire row changes to reflect this • The slope and its confidence interval will be the opposite sign of it’s previous value, here we have a positive 1.69 and the values for the confidence interval have also become positive. • The interpretation here is that those who DO NOT exercise have a population mean glucose which is expected to be 1.69 units LARGER than that for those who DO exercise. This is the same conclusion just stated in the opposite direction for the comparison. • Remember that the reference is X = 0 and here this is NOT as in the original data!!! • To use this model YES would be X = 0 and NO would be X = 1. Try finding the two predicted values here and convince yourself you get the same values for the two groups regardless of which group is used as the reference. In practice, you should choose the reference group and resulting comparison that makes the most sense to you in your situation. 13 Here are the results, referencing NO again, but using PROC REG – we don’t illustrate this code directly but you can see that SAS no longer provides information about the levels of exercise as YES or NO. This is a nice benefit to using PROC GLM and one reason we no longer use PROC REG. For this discussion, we just need to compare the t-test and correlation to these results – which you can verify are the same as those from PROC GLM. The outlined values are illustrated to be the same in the following results of the ttest and correlation. 14 Here we have the t-test and correlation results. The regression is equivalent to the equal variances t-test and we can see that the value for the correlation is the square root of the R-square from our simple linear regression. • The mean difference (and CI) is the same as the parameter estimate for the slope (and CI). • The standard deviation of the pooled difference is the same as Root MSE • The degrees of freedom, t-value, and p-value are the same for the equal variances t-test. • The p-value for the correlation is the same as that for t-test and using regression methods • The value of the correlation squared is R-squared for our original model. Also notice that the means for the two groups are the predicted values provided from our model, 97.36 and 95.67 15
© Copyright 2026 Paperzz