1 Now we will look at the example from the textbook on Glucose and

Now we will look at the example from the textbook on Glucose and Exercise again
in more detail.
We will start with a discussion of the SAS code.
1
To begin, it is best if we use formats to explain our categorical variables. For now
we are only looking at the YN format for EXERCISE where we tell SAS that in the
raw data 0 represents NO and 1 represents YES for variables given this format.
Here we also go ahead and format the physical activity variable used later. Putting
the number inside the format helps keep most of these in order in our results.
Otherwise, SAS would order them alphabetically.
When we format variables, we usually need to refer to the exact formatted value in
our code. If you do not format values you will need to alter those statements to refer
to a numeric value and remove the quotations used in our examples. We will point
this out when we see it in the code.
2
Here I created a temporary dataset called HERS from my permanent dataset in a
library called TEMP.
We have three ways to select only those without diabetes.
• if diabetes = 1 then delete; This will remove those with diabetes and is usually the
method that is clearest.
• where diabetes=0; This will select only those without diabetes using a WHERE
statement
• if diabetes=0 ; This will select only those without diabetes by using IF without any
THEN.
Finally we format the variables exercise and drinkany using the YN format we
created and the physact variable using the PHY format we created.
3
We will skip the SAS code for the exploratory data analysis conducted and simply
show some preliminary results.
Here is the distribution of the outcome variable, Glucose. It is slightly skewed right
but for now we will work with the raw Glucose values instead of performing any
transformations.
4
Here are the graphs we discussed earlier. The standard side-by-side boxplots on
the right and the strange scatterplot on the left with the mean values denoted with
red stars.
We can see that the mean glucose is slightly lower in the YES group than the NO
group for our sample.
Our Theoretical Model for the mean glucose for a given level of exercise is
• E [GLUCOSE | EXERCISE] = β0 + β1(EXERCISE)
5
Here we are using PROC GLM to model the outcome of Glucose vs. the binary
predictor exercise. We will set this up to compare Yes (as X = 1) to No (as X = 0).
We begin with PROC GLM using our HERS data. We add the PLOTS = ALL option
but for now we are not going to unpack any graphs. Our goal is to focus on
interpretations of parameter estimates and working with theoretical and estimated
regression models. Later we will come back to model validation and selection.
We add a CLASS statement for our categorical variable exercise. After the variable
name, we add parentheses and the REF = option. The value we use here will
determine which level of our variable is used as the reference group. Here we want
to use NO as our reference group. We need to put quotations and then exactly the
format from our PROC format – it is case sensitive so we have a capital N and a
lower case o.
If you do not format values you would use the actual number. Since NO is coded as
0 in the data you would use REF = 0 with no quotation marks around the zero.
6
Earlier versons of GLM do not have the ref option in the class statement - if your
version is like this, let me know and I will help you with a trick to make things work
the same.
Notice that our Reference group will correspond to X = 0 even if it is not coded that
way in the original data!!!
Then we have our model statement with our outcome, glucose on the left of the
equals sign and our predictor, exercise on the right of the equals sign. We add two
options of SOLUTION and CLPARM.
• CLPARM gives confidence interval for parameters
• SOLUTION gives the parameter estimates table – you may not always need this
but often you will
Then we have our run statement and technically PROC GLM doesn’t complete
without a QUIT statement although this is not usually necessary to obtain the
output.
Now Let’s Look at the results.
7
The class level information will now be important to us as it tells us something about
how SAS created the underlying dummy variables.
The value listed LAST will be the reference group.
From this we can check that we have correctly handled our binary variable. We will
know that the first level listed, in this case YES is represented by exercise = 1 in the
model and reference group, listed last, in this case NO will be represented by
exercise = 0 in the model.
In this case, this happens to be the same as the raw data but that will not always be
the case as we will see.
We also get the number of observations read and used which can show us if there
were any observations with missing values for either of these variables but really,
we should already know if that were the case from our exploratory analysis.
8
Then we have our overall ANOVA table.
• We know the degrees of freedom for the model should be 1 since there is only
one variable, exercise, in our model.
• Since there were 2032 observations, we see the degrees of freedom for total are
n-1 or 2031.
• And either by subtraction or the equation, the degrees of freedom for error are n2 or 2030.
• The sum of squares for each term are provided and you can see that the model
and error sum of squares add to the total.
• The mean squares are the sum of squares divided by the degrees of freedom
and you could verify that MSE is 94.3968 yourself by dividing the error or residual
sum of squares by 2030.
• The overall F-statistic is the mean square for the model divided by the mean
square for error. In this case it is 14.97 and is highly statistically significant.
• This says that overall, the model using exercise does explain a statistically
significant portion of the variation in glucose.
9
Then we have a table containing
• R-square – which is 0.0073. Even though the model explains a statistically
significant portion, it is not much. Overall, exercise only explains about 0.7% of
the variation in glucose.
• Root MSE is 9.715 – the estimate of the variation around the regression line.
10
The last table of interest is the parameter estimates table. We need to use this to
write our estimated regression model.
• Here our intercept is 97.36 and our slope is negative 1.69.
• The NO row with a zero estimate and missing values is not relevant and should
be ignored. It only tells us that NO is the reference group and we are comparing
YES to NO with our slope.
• Our equation is Y-hat (our estimated mean glucose) = 97.36 – 1.69 times
Exercise. Where Exercise = 1 for YES and 0 for NO.
• The slope can be interpreted by saying: The population mean glucose for those
who exercise is expected to be 1.69 units less than the population mean glucose
for those who do not exercise. The 95% CI is -2.55 to -0.83.
• The p-value for the slope is 0.0001 and so this change in the population mean
glucose is statistically significantly different from zero. There is a statistically
significant difference in the mean glucose between exercisers and nonexercisers.
11
There are only two predicted values which result from this model:
• Estimated Mean Glucose for NO using X = 0:
• Y-hat = 97.36 – 1.69(0) = 97.36, this is simply the y-intercept from the
model.
• Estimated Mean Glucose for YES using X = 1:
• Y-hat = 97.36 – 1.69(1) = 95.67
For each level (No/Yes) we could also obtain confidence limits for the mean or
prediction limits for a future observed value
12
If we were to reverse the reference groups. All results would stay the same except:
• The intercept – it would now represent exercise = YES and its entire row
changes to reflect this
• The slope and its confidence interval will be the opposite sign of it’s previous
value, here we have a positive 1.69 and the values for the confidence interval
have also become positive.
• The interpretation here is that those who DO NOT exercise have a population
mean glucose which is expected to be 1.69 units LARGER than that for those
who DO exercise. This is the same conclusion just stated in the opposite
direction for the comparison.
• Remember that the reference is X = 0 and here this is NOT as in the original
data!!!
• To use this model YES would be X = 0 and NO would be X = 1. Try finding the
two predicted values here and convince yourself you get the same values for the
two groups regardless of which group is used as the reference.
In practice, you should choose the reference group and resulting comparison that
makes the most sense to you in your situation.
13
Here are the results, referencing NO again, but using PROC REG – we don’t
illustrate this code directly but you can see that SAS no longer provides information
about the levels of exercise as YES or NO. This is a nice benefit to using PROC
GLM and one reason we no longer use PROC REG.
For this discussion, we just need to compare the t-test and correlation to these
results – which you can verify are the same as those from PROC GLM.
The outlined values are illustrated to be the same in the following results of the ttest and correlation.
14
Here we have the t-test and correlation results. The regression is equivalent to the
equal variances t-test and we can see that the value for the correlation is the square
root of the R-square from our simple linear regression.
• The mean difference (and CI) is the same as the parameter estimate for the
slope (and CI).
• The standard deviation of the pooled difference is the same as Root MSE
• The degrees of freedom, t-value, and p-value are the same for the equal
variances t-test.
• The p-value for the correlation is the same as that for t-test and using regression
methods
• The value of the correlation squared is R-squared for our original model.
Also notice that the means for the two groups are the predicted values provided
from our model, 97.36 and 95.67
15