Detection of Model Violations

Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Regression Analysis
Chapter 4 Regression Diagnostics:
Detection of Model Violations
Hongcheng Li
April 13, 2014
Outline
1
Standard Regression Assumptions
2
Residuals
3
Graphical methods
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Standard Regression Assumptions
1
Standard Regression Assumptions
Linearity
Assumptions on Errors
Assumption on predictors
Assumption about observations
2
Residuals
3
Graphical methods
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Standard Regression Assumptions
Standard Regression Assumptions I
The properties of least squares estimators and the statistical
analysis are based on the following assumptions:
1
Linearity assumption
X Simple regression: Scatter plot?
X Multiple regression:
X If linearity does not hold, transformation of the data can some
times lead to linearity(Ch6).
X Otherwise, nonlinear regression or other models might needed.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Standard Regression Assumptions
Standard Regression Assumptions II
Scatter plot of response and predictor
120
●
●
100
●
●
●
80
●
●
●●●
●
●
●
●
60
●
●
●
●
●
40
●
●
●
●
●
●
●●
●
●●
20
●●
●
●
●
●
0
response
●●
●
●
0
10
20
predictor
30
40
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Standard Regression Assumptions
Assumptions on Errors
Assumption about the errors or residuals I
yi = β0 + β1 xi1 + β2 xi2 + · · · + βp xip + εi
The errors, ε1 , ε2 , · · · , εn , are assumed to be independently and
identically distributed (iid)normal random variables( N(0, σ 2 ) ),
which implies:
1
normality assumption: εi , i = 1, 2, · · · , n has a normal
distribution
2
zero mean: εi have mean 0.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Standard Regression Assumptions
Assumptions on Errors
Assumption about the errors or residuals II
3
constant variance assumption: ε1 , ε2 , · · · , εn have the same
variance σ 2 .
It is referred to as homogeneity or homoscedasticity
assumption. If the assumption is violated, the problem is
called the it heterogeneity or heteroscedasticity problem.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Standard Regression Assumptions
Assumptions on Errors
Assumption about the errors or residuals III
4
independent-errors assumption: ε1 , ε2 , · · · , εn are independent
each other. When it doesn’t hold, we have the autocorrelation
problem.
In summary, the residuals should randomly distributed without
any patters. We can detect the violations from the scatter
plot of the residuals.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Standard Regression Assumptions
Assumptions on Errors
Assumption about the errors or residuals IV
Figure: Patterns that suggest violation of assumptions
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Standard Regression Assumptions
Assumption on predictors
Assumptions about the predictors I
1
X1 , X2 , · · · , Xn are nonrandom
2
The values of x1j , x2j , · · · , xnj ; j = 1, 2, · · · , p are measured
without error.
Note: If the measurement errors are not large compared to
the random errors, the effect of measurement errors is slight.
The presence of errors in the predictors decreases the accuracy
of predictors.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Standard Regression Assumptions
Assumption on predictors
Assumptions about the predictors II
3
The predictor variables are assumed to be linearly independent
of each other.
If the assumption is violated, the problem is referred to as the
collinearity problem.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Standard Regression Assumptions
Assumption about observations
Assumption about the observations I
All observations (i.e.,in simple regression, paired data (xi , yi ),
i = 1, 2, · · · , n) are equally reliable and have approximately equal
role in determining the regression results and influencing
conclusions.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Residuals
1
Standard Regression Assumptions
2
Residuals
3
Graphical methods
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Residuals
Various types of residuals I
yi = β0 + β1 x1 + β2 x2 + · · · + βp xp + εi
and we fit the set of data by least squares, we obtain the fitted
values
ŷi = βˆ0 + β̂1 x1 + β̂2 x2 + · · · + β̂p xp
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Residuals
Various types of residuals II
ei = yi − ŷi
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Residuals
Various types of residuals III
1
The fitted values can also be written in an alternative form:
ŷi = pi1 y1 + pi2 y2 + · · · pii yi + · · · + pin yn , i = 1, 2, · · · , n.
|{z}
leverage
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Residuals
Various types of residuals IV
2
Where pij are the quantities that depend only on the values of
the predictor variables, hat OR projection matrix.
P = X (X T X )−1 X T
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Residuals
Various types of residuals V
In simple regression, pij is given by
pij =
1 (xi − x̄)(xj − x̄)
+ P
(xi − x̄)2
n
When i = j, pii is the i-th diagonal element of the project
matrix P. The value pii is called the leverage value for the ith
observation, which is the weight(leverage) given to yi in
determining the ith fitted value ŷi .
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Residuals
Residuals vs. standardized residuals I
Various types of residuals
1
The un-standardize errors, namely the raw residuals,
ei , i = 1, 2, · · · , n, which will not have the same variance.
Var (ei ) = σ 2 (1 − pii )
s.e.(eˆi ) = σ̂
p
1 − pii
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Residuals
Residuals vs. standardized residuals II
2
Standardized residual, sometimes called the internally
studentized residual, which is êi divided by its s.e :
ei
zi = √
σ̂ 1 − pii
where,
2
σ̂ =
ei2
SSE
=
n−p−1
n−p−1
P
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Residuals
Residuals vs. standardized residuals III
3
An alternative biased estimator of σ 2 is given by
n
P
2
σ̂(i)
=
k=1,k6=i
ei2
(n − 1) − p − 1
=
SSE(i)
n−p−2
It is sometimes sometimes called the externally studentized
residual.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Residuals
Residuals vs. standardized residuals IV
Therefore the standardized residual
ri∗ =
e
√i
σ̂(i) 1 − pii
just omitting the i−th observation.
2 and σ̂ 2 are unbiased estimators of σ 2 .
Both σ̂(i)
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
1
Standard Regression Assumptions
2
Residuals
3
Graphical methods
Graphs before fitting the model
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphical methods I
Always plot your data before you doing any
analysis.
1
Detect errors in the data(e.g. an outlying point )
2
Recognize patterns in the data(clusters, outliers, gaps)
3
Explore relationships among variables
4
Discover new phenomena
5
Confirm or negate assumptions;
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphical methods II
6
assess the adequate of a fitted model
7
Suggest remedial actions(e.g. transform the data, redesign the
experiment, collect more data, etc.)
8
enhance numerical analysis in general
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Graphs before fitting the model I
1
One-dimensional graphs
Histogram
Stem and leaf plot
Dot plot
Box plot
2
3
4
Two-dimensional graphs
Rotating Plots
Dynamic graphs
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Histogram I
Figure: Histogram
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Stem and leaf plot I
> duration = faithful$eruptions
> stem(duration)
The decimal point is 1 digit(s) to the left of the |
16 | 070355555588
18 | 000022233333335577777777888822335777888
20 | 00002223378800035778
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Stem and leaf plot II
22
24
26
28
30
32
34
36
38
|
|
|
|
|
|
|
|
|
0002335578023578
00228
23
080
7
2337
250077
0000823577
2333335582225577
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Stem and leaf plot III
40
42
44
46
48
50
|
|
|
|
|
|
0000003357788888002233555577778
03335555778800233333555577778
02222335557780000000023333357778888
0000233357700000023578
00000022335800333
0370
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Dot plot I
Figure: Dot plot
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Box plot I
Figure: Histogram
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Two dimensional Graphs I
1
Pair-wise scatter plot Scatter matrix
The pairwise correlation coefficients should always be
interpreted in conjunction with the corresponding scatter plot!
2
P95 Hamilton’s Data
Information from Hamilton’s data:
In multiple regression, the presence of a linear pattern is
reassuring, the absence of such a pattern does not imply that
the linear model is incorrect.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
2-D plot I
Figure: Histogram
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
3-D plot I
Figure: Histogram
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Dynamic plot I
Figure: Dynamic
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Graphs after fitting a Model I
1
Graphs for checking the linearity and normality assumptions of
residuals
2
Graphs for the detection of outliers and influential observations
3
Diagnostic plots for the effect of variables
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Diagnostic plot I
3D visualization device system (OpenGL)
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Diagnostic plot II
3D visualization device system (OpenGL)
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Checking Linearity and Normality Assumptions I
When the number of variables is large, one can check the linearity
and normality assumptions by examining the residuals after fitting
a given model to the data.
1
Normal probability plot of the standardized residuals(i.e., the
P-P plot, or Q-Q plot)
2
scatter plots of the standardized residual against each of the
predictor variables.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Checking Linearity and Normality Assumptions II
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Leverage,influence and Outliers I
1
A point is an influential point if its deletion, singly or in
combination with others(two or three), cause substantial
changes in the fitted model(estimated coefficients, fitted
values, t-test, etc.).
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Leverage,influence and Outliers II
Example: P99 New York River Data
In a 1976 study exploring the relationship between water
quality and land use, Haith(1976) obtained the measurements
on 20 rivers basins in New York State.
A question of interest here is how the land use around a river
basin contributes to the water pollution as measures by the
mean nitrogen concentration(mg/liter).
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
New York Rivers Data I
Variable
Y
X1
X2
X3
X4
Definition
Mean nitrogen concentration based on samples
taken at regular intervals during spring,
summer , and fall months
Agriculture percentage of land area
currently in agricultural use
Forest: percentage of forest land
Residential:percentage of land area in residential use
Commercial/industrial:percentage of land area
in ether commercial or industrial use
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Please run “New York Rivers Data” to regenerate Table 4.2
on page 99 I
Test
t0
t1
t2
t3
t4
None
1.4
0.39
-0.93
-0.21
1.86
Observations
Neversink(D)
1.21
0.92
-0.74
-3.15
4.45
Deleted
Hackensack(D)
2.08
0.25
-1.45
4.08
0.66
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Outliers I
1
Outliers in the response variables
Points with standardized residuals larger than 2 or 3 standard
deviation away from the mean 0 are called outliers.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Outliers II
2
Outliers in the predictors Observations that are outliers in the
X −space are know as high leverage points.
SPSS can give the leverage :[Regression]→Linear
Regression→Save→Leverage values
If the leverage value is greater than 2(p + 1)/n,regard the
points having high leverage.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Outliers III
Analyses that are based on residuals alone may fail to detect
outliers and influential observations for the following reasons.
1
The presence of high leverage points. The ordinary residuals,
ei , and leverage values, pii are related by
pii +
2
ei2
≤1
SSE
Masking and swamping problems
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Outliers IV
Masking:Fales negative
the data contains outliers but fail to detect them.
Swamping problems: False positive
wrongly declare some of the none-outlying points as outlieres.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Measures of Influence I
1
Cooks distance
P104 Equation 4.20
Pn
Ci =
j=1 (ŷj
σ̂ 2 (p
− ŷj(i) )2
+ 1)
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Measures of Influence II
2
Welsch and Kuk measure
DFITSi =
ŷi − ŷi(i)
, i = 1, 2, · · · , n.
√
σ̂i pii
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Cook’s Distance I
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Cook’s Distance II
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
What to do with outliers I
1
P109 Exponential Growth Data
2
Outliers and influential observations can be the most
informative observations in the data, they should not be
automatically discarded without justification.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
What to do with outliers II
3
They should be examined to determine why they are outlying
or influential.
Based on this examination, appropriate corrective actions can
then be taken.
• correction of error in the data.
• deletion or down-weighting outliers.
• transforming the data.
• considering a different model
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
What to do with outliers III
• redesigning the experiment or the sample survey, collecting
more data.
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Role of variables in a regression equations I
1
Added-variable plot
2
Residual plus component plot
3
Page 113 The Scottish Hills Races Data
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Effects of an additional predictor I
Ref Page 115.
Is the regression coefficient of the new variable significant?
Does the introduction of the new variable substantially change
the regression coefficients of the variables already in the
regression equation?
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Effects of an additional predictor II
1
Case A:
The new variable has an insignificant regression coefficient
and the remaining regression coefficients do not change
substantially from the previous values.
The new variable should NOT be included in the regression
equation!
2
Case B:
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Effects of an additional predictor III
3
Case C:
The new variable is significant, and the coefficients of the
previously introduced variables do not change substantially.
And the new variable is uncorrelated wit the previously
introduced variables.
The new variable should be included in the regression
equation!
4
Case D:
Regression Analysis Chapter 4 Regression Diagnostics: Detection of Model Violations
Graphical methods
Graphs before fitting the model
Homework I
1
P116 4.3, 4.4, 4.5
2
P118 4.6 ∼ 4.10
3
P120 4.12 ∼ 4.14