hospital - BIOSTATISTICS Lectures

Lab 8 : Linear Regression
Simple Linear Regression
Exercise:
Statistics that summarize personal health care expenditures by state for the years 1966 through
1982 have been examined in a attempt to understand the issues related to rising health care
costs. Suppose that you are interested in focusing on the relationship between expense per
admission into a community hospital and average length of stay in the facility. The data set
hospital contains for each state in the United States for 1982. The measures of mean expense
per admission are saved under the variable name expadm; the corresponding average lengths
of stay are saved under los.
a) Genarate a numerical summary statistics for the variables expense per admission and
length of stay in the hospital.(Means, std deviations, min, max values)
Analyze→ Descriptive Statistics→ Descriptives
Descriptive Statistics
N
Minimum
Maximum
Mean
Std. Deviation
mean expense per admission ($)
51
1772,00
4612,00
2716,8039
603,94708
average length of stay (days)
51
5,40
9,70
7,4902
1,01514
average salary ($)
51
11928,00
23594,00
14852,4118
1965,51403
Valid N (listwise)
51
1
b) Construct a two way scatter plot of mean expense per admission versus length of stay.
What does the scatter plot suggest about the nature of the relationship between these
variables?
Graphs→ Chatter Builder→ Scatter Dot → Simple Scatter
2
c) Using expense per admission as the response and length of stay as the explanatory
variable, compute the regression line. Interpret estimated slope and the intercept.
Analyze → Regression→ Linear
ANOVAa
Model
Sum of Squares
Regression
1
df
Mean Square
1890784,672
1
1890784,672
Residual
16346819,367
49
333608,559
Total
18237604,039
50
F
Sig.
,021b
5,668
a. Dependent Variable: mean expense per admission ($)
b. Predictors: (Constant), average length of stay (days)
Coefficientsa
Model
Unstandardized Coefficients
Standardized
t
Sig.
Coefficients
B
(Constant)
1
average length of stay
(days)
a.
Std. Error
1281,959
608,104
191,563
80,465
Dependent Variable: mean expense per admission ($)
3
Beta
,322
2,108
,040
2,381
,021
d) Interpret the 95% confidence for the true slope of the population regression line.
→ Statistics →Confidence intervals
Coefficientsa
Model
Unstandardized
Standardized
Coefficients
Coefficients
B
(Constant)
1
average length of stay
(days)
Std. Error
1281,959
608,104
191,563
80,465
t
Sig.
B
Beta
,322
1
R
,322a
R Square
,104
Adjusted R
Std. Error of the
Square
Estimate
,085
577,58857
a. Predictors: (Constant), average length of stay (days)
4
Upper Bound
,040
59,929
2503,990
2,381
,021
29,862
353,264
e) What is the coefficient of determination for the regression? (R squared)
Model
Lower Bound
2,108
a. Dependent Variable: mean expense per admission ($)
Model Summary
95,0% Confidence Interval for
f) What is the correlation between expense per admission and length of stay? How is it related
to the determination of coefficient?
Analyze→ Correlate →Bivariate
Correlations
mean expense
average length
per admission
of stay (days)
($)
Pearson Correlation
mean expense per
admission ($)
1
Sig. (2-tailed)
,021
N
average length of stay
(days)
,322*
51
51
Pearson Correlation
,322*
1
Sig. (2-tailed)
,021
N
51
51
*. Correlation is significant at the 0.05 level (2-tailed).
g) Construct a plot residuals versus fitted values of expense per admission. In what three ways
does the residual plot help you to evaluate the fit of the regression model?
To save the fitted values and standardize residuals do the following using the Save button of
regression model fitting window.
Analyze→ Linear Regression →Save→
Predicted Values→ Unstandardized
Residuals→ Standardized
5
The your data looks like the following:
6
Now plot fitted values versus standardized residuals. Also put a horizontal line at y=0 using
the graph editor menu.
7
A plot of the residuals serves three purposes.
1) It can help us to detect outlying observations in the sample.
(In our plot no residual is larger than the others.)
2) It can suggest a failure in the assumption of homoscedasticity. Homoscedasticity means
that standard deviation of the outcomes (y) is constant across all values of x. If the ranges of
the magnitudes of the residuals either increases or decreases as the fitted values become larger
this implies that standard deviation of the outcomes (y) does not take the same value for all
values of x.
(No such pattern is evident in our plot suggesting the assumption of homoscedasticity does
not appear to be violated)
3) If the residuals do not exhibit a random scatter but instead follow a trend this would
suggest the true relationship between x and y might not be linear or some important
explanatory variables might have been left out of the model.
(Our plot does not exhibit a random scatter)
Multiple Linear Regression
Exercise:
Consider the above data set hospital is the average salary per employee in 1982.
a) Summarize the average salary per employee both graphically(boxplot) and
numerically(mean, std. dev, min, max).
8
The 50th observation looks like an outlier.
b) Construct a two way scatter plot of mean expense per admission versus average salary.
What does the scatter plot suggest about the nature of the relationship between these
variables?
9
c) Fit a regression model taking mean expense per admission as the response and average
length of stay and average salary as the explanatory variables. Interpret the estimated
regression coefficients.
d) What happens to the estimated coefficient of length of stay when average salary is added to
the model?
e) Does the inclusion of salary in addition to average length of stay improve your ability to
predict mean expense per admission? (Check whether it is significant, check whether R
squared has increased etc.)
f) Examine a plot of the residuals versus the fitted values of expense per admisison. What
does the plot tell you about the fit of the model?
10