Newsom
UPS 654 Data Analysis II
Fall 2015
1
Regression Diagnostic Examples with R
R Code
#clear active frame from previous analyses
rm(mydata)
library(lessR)
mydata = Read("c:/jason/spsswin/da2/ccwa3_2_1.sav",quiet=TRUE)
#lessR full Regression function has a number of diagnostics by default
#Simple regression output slightly different-includes a plot with conf bands
Regression(SALARY ~ PUBS)
#multiple regression using lessR
Regression(SALARY ~ PUBS + TIME, res.sort="off")
#res.sort option turns of sorting by Cooks distance, can also sort by other diagnostics
#some diagnostics with car package
library(car)
model <- lm(SALARY ~ PUBS + TIME, data = mydata)
summary(model)
#single residual plot by fitted values
#plot(model, which = 1)
#unadjusted residual plots by IV and predicted (fitted) values
residualPlots(lm(SALARY ~ PUBS + TIME, data = mydata))
#partial residual plots
crPlots(lm(SALARY ~ PUBS + TIME, data = mydata))
#outlier test (may want to ignore significance tests and arbitrary definitions of
outliers)
outlierTest(model)
#diagnostic values and their plots, id.n= specifies number of cases desired
influencePlot(lm(SALARY ~ PUBS + TIME, data = mydata),id.n=15)
#obtain dfbetas (unstandardized)
dfbetas(lm(SALARY ~ PUBS + TIME, data = mydata),id.n=15)
#multicollinearity diagnostic
vif(model)
#test of autocorrelations or residuals (nonindependence)
durbinWatsonTest(model)
#normal probability plot
qqPlot(lm(SALARY ~ PUBS + TIME, data = mydata))
Newsom
UPS 654 Data Analysis II
Fall 2015
2
lessR Package Output
Simple Regression
BACKGROUND
Data Frame:
mydata
Response Variable: SALARY, annual salary in dollars
Predictor Variable: PUBS, number of publications
Number of cases (rows) of data: 15
Number of cases retained for analysis: 15
BASIC ANALYSIS
Estimated Model
(Intercept)
PUBS
Estimate
46357.45
335.53
Std Err
3072.73
128.07
t-value
15.087
2.620
p-value
0.000
0.021
Lower 95%
39719.22
58.85
Upper 95%
52995.68
612.20
Model Fit
Standard deviation of residuals:
R-squared:
0.346
6623.62 for 13 degrees of freedom
Adjusted R-squared:
0.295
PRESS R-squared:
0.157
Null hypothesis that all population slope coefficients are 0:
F-statistic: 6.864
df: 1 and 13
p-value: 0.021
Analysis of Variance
Model
Residuals
SALARY
df
1
13
14
Sum Sq
301137778.67
570340400.93
871478179.60
Mean Sq
301137778.67
43872338.53
62248441.40
F-value
6.86
p-value
0.021
RELATIONS AMONG THE VARIABLES
Correlation Matrix
SALARY
PUBS
SALARY PUBS
1.00 0.59
0.59 1.00
ANALYSIS OF RESIDUALS AND INFLUENCE
Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
[sorted by Cook's Distance]
[res.rows = 15, out of 15 ]
---------------------------------------------------------PUBS
SALARY
fitted
resid rstdnt dffits cooks
7 38.00 66432.00 59107.44
7324.56
1.25
0.60 0.18
13 10.00 39115.00 49712.71 -10597.71 -1.84 -0.62 0.16
2 3.00 54511.00 47364.03
7146.97
1.21
0.55 0.15
3 2.00 53425.00 47028.50
6396.50
1.08
0.52 0.13
9 9.00 41934.00 49377.18 -7443.18 -1.21 -0.43 0.09
4 17.00 61863.00 52061.39
9801.61
1.63
0.45 0.09
11 30.00 49832.00 56423.23 -6591.23 -1.06 -0.36 0.06
12 21.00 47047.00 53403.49 -6356.49 -0.99 -0.27 0.04
10 22.00 47454.00 53739.02 -6285.02 -0.98 -0.27 0.04
15 37.00 61458.00 58771.91
2686.09
0.43
0.20 0.02
14 27.00 59677.00 55416.65
4260.35
0.66
0.20 0.02
8 48.00 61100.00 62462.70 -1362.70 -0.25 -0.19 0.02
5 11.00 52926.00 50048.23
2877.77
0.44
0.14 0.01
6 6.00 47034.00 48370.60 -1336.60 -0.21 -0.08 0.00
1 18.00 51876.00 52396.92
-520.92 -0.08 -0.02 0.00
Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals
[sorted by lower bound of prediction interval]
---------------------------------------------------------------
Newsom
UPS 654 Data Analysis II
Fall 2015
PUBS
SALARY
3 2.00 53425.00
2 3.00 54511.00
6 6.00 47034.00
9 9.00 41934.00
13 10.00 39115.00
5 11.00 52926.00
4 17.00 61863.00
1 18.00 51876.00
12 21.00 47047.00
10 22.00 47454.00
14 27.00 59677.00
11 30.00 49832.00
15 37.00 61458.00
7 38.00 66432.00
8 48.00 61100.00
3
pred
47028.50
47364.03
48370.60
49377.18
49712.71
50048.23
52061.39
52396.92
53403.49
53739.02
55416.65
56423.23
58771.91
59107.44
62462.70
sf
7216.09
7176.35
7069.74
6982.67
6958.12
6935.85
6851.15
6845.32
6842.21
6845.96
6900.45
6961.27
7181.53
7221.54
7727.68
pi:lwr
31439.10
31860.46
33097.35
34292.03
34680.60
35064.24
37260.38
37608.49
38621.80
38949.22
40509.14
41384.33
43257.16
43506.25
45768.05
pi:upr
62617.91
62867.59
63643.86
64462.33
64744.82
65032.23
66862.40
67185.34
68185.19
68528.82
70324.17
71462.13
74286.66
74708.62
79157.34
width
31178.81
31007.13
30546.50
30170.30
30064.23
29968.00
29602.03
29576.85
29563.39
29579.61
29815.03
30077.80
31029.51
31202.37
33389.28
-----------------------------------------------------------Plot 1: Distribution of Residuals
Plot 2: Residuals vs Fitted Values
Plot 3: Regression Line, Confidence and Prediction Intervals
------------------------------------------------------------
Newsom
UPS 654 Data Analysis II
Fall 2015
4
Multiple Regression
Response Variable: SALARY, annual salary in dollars
Predictor 1: PUBS, number of publications
Predictor 2: TIME, years since PhD
Number of cases (rows) of data: 15
Number of cases retained for analysis: 15
BASIC ANALYSIS
Estimated Model
Estimate
(Intercept) 43082.39
PUBS
121.80
TIME
982.87
Std Err
3099.49
149.70
452.06
t-value
13.900
0.814
2.174
p-value
0.000
0.432
0.050
Lower 95%
36329.18
-204.36
-2.08
Upper 95%
49835.61
447.97
1967.81
Model Fit
Standard deviation of residuals:
R-squared:
0.530
5839.23 for 12 degrees of freedom
Adjusted R-squared:
0.452
PRESS R-squared:
Null hypothesis that all population slope coefficients are 0:
F-statistic: 6.780
df: 2 and 12
p-value: 0.011
Analysis of Variance
PUBS
TIME
Model
df
1
1
2
Sum Sq
301137778.67
161181041.67
462318820.34
Mean Sq
301137778.67
161181041.67
231159410.17
Residuals
12
409159359.26
34096613.27
SALARY
14
871478179.60
62248441.40
RELATIONS AMONG THE VARIABLES
Correlation Matrix
SALARY
PUBS
TIME
SALARY
1.00
0.59
0.71
PUBS
0.59
1.00
0.66
Collinearity
Tolerance
PUBS
0.569
TIME
0.569
TIME
0.71
0.66
1.00
VIF
1.758
1.758
F-value
8.83
4.73
6.78
p-value
0.012
0.050
0.011
0.325
Newsom
UPS 654 Data Analysis II
Fall 2015
All Possible Subset Regressions
PUBS TIME
0
1
1
1
1
0
R2adj
0.466
0.452
0.295
X's
1
2
1
ANALYSIS OF RESIDUALS AND INFLUENCE
Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
[res.rows = 15, out of 15 ]
---------------------------------------------------------------PUBS TIME
SALARY
fitted
resid rstdnt dffits cooks
1 18.00 3.00 51876.00 48223.41
3652.59
0.67
0.31 0.03
2 3.00 6.00 54511.00 49345.00
5166.00
0.99
0.49 0.08
3 2.00 3.00 53425.00 46274.60
7150.40
1.42
0.69 0.15
4 17.00 8.00 61863.00 53015.94
8847.06
1.69
0.48 0.07
5 11.00 9.00 52926.00 53268.00
-342.00 -0.06 -0.03 0.00
6 6.00 6.00 47034.00 49710.40 -2676.40 -0.48 -0.20 0.01
7 38.00 16.00 66432.00 63436.69
2995.31
0.60
0.40 0.06
8 48.00 10.00 61100.00 58757.50
2342.50
0.52
0.47 0.08
9 9.00 2.00 41934.00 46144.33 -4210.33 -0.78 -0.36 0.04
10 22.00 5.00 47454.00 50676.34 -3222.34 -0.57 -0.22 0.02
11 30.00 5.00 49832.00 51650.75 -1818.75 -0.35 -0.20 0.01
12 21.00 6.00 47047.00 51537.41 -4490.41 -0.79 -0.25 0.02
13 10.00 7.00 39115.00 51180.47 -12065.47 -2.72 -0.99 0.21
14 27.00 11.00 59677.00 57182.55
2494.45
0.44
0.15 0.01
15 37.00 18.00 61458.00 65280.62 -3822.62 -0.86 -0.76 0.20
FORECASTING ERROR
Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals
[sorted by lower bound of prediction interval]
--------------------------------------------------------------------PUBS TIME
SALARY
pred
sf
pi:lwr
pi:upr
width
9 9.00 2.00 41934.00 46144.33 6332.80 32346.35 59942.32 27595.96
3 2.00 3.00 53425.00 46274.60 6370.98 32393.43 60155.76 27762.34
1 18.00 3.00 51876.00 48223.41 6332.62 34425.82 62021.00 27595.18
2 3.00 6.00 54511.00 49345.00 6391.78 35418.51 63271.49 27852.98
6 6.00 6.00 47034.00 49710.40 6262.91 36064.69 63356.11 27291.43
10 22.00 5.00 47454.00 50676.34 6197.45 37173.25 64179.43 27006.18
11 30.00 5.00 49832.00 51650.75 6517.64 37450.03 65851.47 28401.44
13 10.00 7.00 39115.00 51180.47 6171.16 37734.67 64626.27 26891.59
12 21.00 6.00 47047.00 51537.41 6092.69 38262.58 64812.24 26549.66
5 11.00 9.00 52926.00 53268.00 6291.26 39560.52 66975.49 27414.97
4 17.00 8.00 61863.00 53015.94 6055.75 39821.59 66210.29 26388.71
8 48.00 10.00 61100.00 58757.50 7022.46 43456.87 74058.12 30601.25
14 27.00 11.00 59677.00 57182.55 6137.26 43810.61 70554.48 26743.87
7 38.00 16.00 66432.00 63436.69 6670.47 48902.99 77970.39 29067.40
15 37.00 18.00 61458.00 65280.62 7003.15 50022.07 80539.18 30517.11
[Collinearity analysis with Nilsson and Fox's vif function from the car package]
[Subsets analysis with Thomas Lumley's leaps function from the leap's package]
5
Newsom
UPS 654 Data Analysis II
Fall 2015
6
car Package Output
>
>
>
>
>
#some diagnostics with car package
library(car)
model <- lm(SALARY ~ PUBS + TIME, data = mydata)
summary(model)
Call:
lm(formula = SALARY ~ PUBS + TIME, data = mydata)
Residuals:
Min
1Q Median
-12066 -3522
-342
3Q
3324
Max
8847
Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 43082.4
3099.5 13.900 0.00000000926
PUBS
121.8
149.7
0.814
0.4317
TIME
982.9
452.1
2.174
0.0504
Residual standard error: 5839 on 12 degrees of freedom
Multiple R-squared: 0.5305, Adjusted R-squared: 0.4522
F-statistic: 6.78 on 2 and 12 DF, p-value: 0.01071
> #single residual plot by fitted values
> #plot(model, which = 1)
Newsom
UPS 654 Data Analysis II
Fall 2015
>
> #unadjusted residual plots by IV and predicted (fitted) values
> residualPlots(lm(SALARY ~ PUBS + TIME, data = mydata))
Test stat Pr(>|t|)
PUBS
0.795
0.443
TIME
-0.208
0.839
Tukey test
0.069
0.945
> #partial residual plots
> crPlots(lm(SALARY ~ PUBS + TIME, data = mydata))
> #outlier test (may want to ignore significance tests and arbitrary definitions of outliers)
> outlierTest(model)
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferonni p
13 -2.724396
0.019775
0.29663
> #diagnostic values and their plots, id.n= specifies number of cases desired
> influencePlot(lm(SALARY ~ PUBS + TIME, data = mydata),id.n=15)
StudRes
Hat
CookD
5 -0.06122461 0.16081925 0.01615946
13 -2.72439562 0.11691982 0.46192542
1
0.67327258 0.17612995 0.18396827
6 -0.48107171 0.15038019 0.12078285
14 0.43597027 0.10468261 0.08912904
10 -0.57369635 0.12645875 0.12970198
7
0.59862113 0.30497194 0.23531697
11 -0.34525315 0.24586129 0.11823617
9 -0.78142323 0.17619687 0.21211722
12 -0.79300659 0.08869675 0.14509773
15 -0.86429067 0.43838654 0.44559080
2
0.98695808 0.19820783 0.28362010
3
1.41695579 0.19042193 0.38107818
4
1.69413249 0.07553642 0.26005852
8
0.52255208 0.44632988 0.27947379
> avPlots(lm(SALARY ~ PUBS + TIME, data = mydata))
>
> #obtain dfbetas (unstandardized)
> dfbetas(lm(SALARY ~ PUBS + TIME, data = mydata),id.n=15)
(Intercept)
PUBS
TIME
1
0.231797720 0.139207147 -0.24384059
2
0.316875745 -0.385101202 0.17198426
3
0.639714806 -0.350417532 -0.09351616
4
0.229474144 -0.162360695 0.13247809
5 -0.008538538 0.019836564 -0.01694985
6 -0.140872717 0.142209195 -0.05507805
7 -0.210934645 0.028386377 0.24485383
8 -0.088903415 0.422006697 -0.20495510
9 -0.346822183 -0.006696766 0.21923153
10 -0.131247020 -0.115733245 0.14807598
11 -0.066378185 -0.156498763 0.14945109
12 -0.151232594 -0.093099374 0.12210115
13 -0.605522978 0.640035327 -0.33517230
14 -0.018987989 0.005396021 0.06408802
15 0.418823562 0.101232379 -0.59126091
> #multicollinearity diagnostic
> vif(model)
PUBS
TIME
1.758073 1.758073
> #test of autocorrelations or residuals (nonindependence)
> durbinWatsonTest(model)
lag Autocorrelation D-W Statistic p-value
1
0.3622956
1.207088
0.048
Alternative hypothesis: rho != 0
> #normal probability plot
> qqPlot(lm(SALARY ~ PUBS + TIME, data = mydata))
7
Newsom
UPS 654 Data Analysis II
Fall 2015
Partial residual plots
Residuals by each predictor and predicted (fitted) values
Normal probability (aka quantile-quantile or q-q plot) with cumulative normal reference line and confidence band
8
© Copyright 2026 Paperzz