投影片1

複迴歸迴歸常見問題




線性重合
虛擬變數
交叉變數
多次項變數
完全線性重合(perfect multicollinaerity)

如果多元迴歸模型中的解釋變數之間具有線性關
係:
是謂完全線性重合
 亦即, 至少有一個解釋變數可以寫成其他解釋變數
的線性組合。一旦存在完全線性重合, 代表模型中
有一個多餘的變數, 使得迴歸係數的估計有認定上
的問題, 無法求算

值得注意的是, 我們所定義的完全線性重合係定義
在解釋變數的線性關係, 因此, 非線性關係如
則不構成完全線性重合問題
 反之,
則具有完全線性重合問題(為什麼?)
完全線性重合EXCEL不會執行,反而不是問題。
問題在於高度線性重合
 高度線性重合,亦即解釋變數中有數個相當類似
。

如何解決線性重合問題



Remove redundant explanatory variables.
Re-express explanatory variables
Do nothing if the explanatory variables are significant
with sensible estimates.
26 of 46
Copyright © 2011 Pearson Education, Inc.
Example : RETAIL PROFITS
 Motivation

A chain of pharmacies is looking to expand into
a new community. It has data for 110 cities on
the following variables: income, disposable income,
birth rate, social security recipients, cardiovascular deaths
and percentage of local population aged 65 or more.
36 of 46
Copyright © 2011 Pearson Education, Inc.
4M Example 24.2: RETAIL PROFITS
 Method

Use multiple regression. The response variable
is profit. Examine the correlation matrix and
the scatterplot matrix.
37 of 46
Copyright © 2011 Pearson Education, Inc.
4M Example 24.2: RETAIL PROFITS
 Method

Several high correlations are present (shaded in table)
and indicate the presence of collinearity.
38 of 46
Copyright © 2011 Pearson Education, Inc.
4M Example 24.2: RETAIL PROFITS
 Method
This partial scatterplot
 matrix identifies
 communities that are
 distinct from others.

Linearity and no
 lurking variables
 conditions are met.


39 of 46
Copyright © 2011 Pearson Education, Inc.
4M Example 24.2: RETAIL PROFITS
 Mechanics
– Estimation Results

40 of 46
Copyright © 2011 Pearson Education, Inc.
4M Example 24.2: RETAIL PROFITS
 Mechanics

– Examine Plots
These and other plots (not shown here) indicate that all
MRM conditions are satisfied.
41 of 46
Copyright © 2011 Pearson Education, Inc.
4M Example 24.2: RETAIL PROFITS
 Mechanics

The F-statistic indicates that this collection of
explanatory variables explains statistically significant
variation in profits. The VIF’s indicate some
explanatory variables are redundant and should be
removed (one at a time) from the model.

42 of 46
Copyright © 2011 Pearson Education, Inc.
4M Example 24.2: RETAIL PROFITS
 Mechanics

– Simplified Model
This multiple regression separates the effects of birth
rates from age (and income). It reveals that cities with
higher birth rates produce higher profits when
compared to cities with lower birth rates but
comparable income and local population above 65.
43 of 46
Copyright © 2011 Pearson Education, Inc.
虛擬變數(dummy variables)
討論至此, 我們所探討的解釋變數均為連續隨機變
數
 有時我們關心的解釋變數可能為間斷
 譬如說, 回到阿中送貨的例子, 如果在外奔波時數
還會受到天氣影響, 則我們的解釋變數為

稱之為虛擬變數
虛擬變數

我們的模型變成

給定當天為晴天, 在外奔波時數的條件期望值為

給定當天為雨天, 在外奔波時數的條件期望值為
虛擬變數
兩者之差異
就是在
控制了其他變數後(給定相同的送貨路程與送貨點
個數), 天氣對於在外奔波時數的條件均數之影響
 一般而言, 下雨天的視線不良, 路況不佳, 我們預期
平均而言在外奔波時數會增加, 亦即 > 0

虛擬變數
關於虛擬變數, 在給定迴歸模型存在截距項
的
情況下, 有一個重要的設定規則: 如果有m 種不同
屬性需要考慮, 則只能設定m − 1 個虛擬變數。
 關於這樣的設定規則, 其背後的理由在於, 如果我
們設定了m 個虛擬變數, 在截距項 存在的情況
下, 將會造成完全線性重合問題

虛擬變數

回到阿中的例子。如果公司有四輛貨車(I, II, III,以
及IV 號車), 由於車況不同, 亦會影響在外奔波時數
, 則我們只能設定3 個虛擬變數:
虛擬變數的設定
Interaction Models
Interaction Model With
2 Independent Variables
•
Hypothesizes interaction between pairs of x
variables
— Response to one x variable varies at different
levels of another x variable
• Contains two-way cross product terms
E ( y )   0   1 x1   2 x 2   3 x1 x 2
• Can be combined with other models
— Example: dummy-variable model
Effect of Interaction
Given:
E ( y )   0   1 x1   2 x 2   3 x1 x 2
• Without interaction term, effect of
x1 on y is measured by 1
• With interaction term, effect of x1 on
y is measured by 1 + 3x2
— Effect increases as x2 increases
Interaction Model Relationships
E(y) = 1 + 2x1 + 3x2
+ 4x1x2
E(y)
E(y) = 1 + 2x1 + 3(1) + 4x1(1) =
1
2
8
E(y) = 1 + 2x1 + 3(0) + 4x1(0) =
4
0
0
0.5
1
1.5
x1
Effect (slope) of x1 on E(y) depends on
x2 value
Interaction Model Worksheet
Case, i
yi
x1i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
x2i x1i x2i
3
3
5
40
2
6
6
30
:
:
Multiply x1 by x2 to get x1x2.
Run regression with y, x1, x2 ,
x1 x2
Interaction Example
You work in advertising for the
New York Times. You want to
find the effect of ad size (sq. in.),
x1, and newspaper circulation
(000), x2, on the number of ad
responses (00), y. Conduct a test
for interaction. Use α = .05.
Interaction Model Worksheet
yi
x1i
1
4
1
3
2
4
1
8
3
5
6
10
x2i x1i x2i
2
2
8
64
1
3
7
35
4
24
6
60
Multiply x1 by x2 to get x1x2.
Run regression with y, x1, x2 ,
x1 x2
Excel Computer Output
Solution
Global F–test indicates at least one
parameter is not zero
F
P-Value
Interaction Test
Solution
•
•
•
•
•
H0 :  3 = 0
Test Statistic:
Ha: 3 ≠ 0
 .05
df  6 - 4 = 2
Critical Value(s):
Decision:
Reject H0
.025
Reject H0
.025
-4.3027 0 4.3027
t
Conclusion:
Excel Computer Output
Solution
ˆ3
t
sˆ
3
Interaction Test
Solution
•
•
•
•
•
H0 :  3 = 0
Ha: 3 ≠ 0
 .05
df  6 - 4 = 2
Critical Value(s):
Reject H0
.025
Reject H0
.025
-4.3027 0 4.3027
t
Interaction Test
Solution
Test Statistic:
t = 1.8528
Decision:
Do no reject at  = .05
Conclusion:
There is no evidence of interaction
虛擬變數+交叉變數

Does Wal-Mart discriminate against female employees?
Are they paid less than men?

Use multiple regression with a categorical explanatory variable
representing gender to analyze pay data.

Regression analysis can adjust the comparison between men and
women to account for other variables that may affect pay.
3 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數
 Example:

Mid-Level Managers’ Salaries
The average salary for women is $140,000 and the average salary
for men is $144,700.
4 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數
 Example:
Mid-Level Managers’ Salaries

The 95% confidence for the difference in mean salaries is $740
to $8,591 (since 0 is not in this interval, the difference is
significant).

Assume conditions for inference are satisfied.
5 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數

Without a randomized experiment, we must be careful about
lurking variables that would account for the significant difference
between average salaries (e.g., experience).

Experience is a confounding variable if it is correlated with salary
and the two groups (men and women) differ with regard to
experience.
6 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數

Restrict analysis to a subset of cases with matching levels of the
confounding variable (e.g., compare men and women with 5
years of experience).
7 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數

The 95% confidence interval for the difference in average
salaries between men and women within the subset of managers
with 5 years experience includes 0 (the difference is not
significant).

However, the standard error of the difference is much larger; the
cases in the subset do not produce a precise estimate.
8 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數

What about the difference between average salaries for managers
with 2, 10 or 15 years experience?

Analysis of covariance: regression that combines categorical and
numerical explanatory variables; adjusts the comparison of
means for the effects of confounding variables.
9 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數
10 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數

Simple regressions fit separately to men and women show that
estimated salary rises faster with experience for women
compared to men.
11 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數

Combining the separate regressions for men and women requires
a dummy variable identifying whether a manager is male or
female (Group = 1 for men; Group = 0 for women).

Also requires the interaction term Group Years. An interaction
term is the product of two explanatory variables in a regression
model.

12 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數
 Combining
Regressions
13 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數
 Combining
Regressions
14 of 47
Copyright © 2011 Pearson Education, Inc.
虛擬變數+交叉變數

The equation for the group coded as 0 in the dummy variable
forms a baseline for comparison.

The slope of the dummy variable is the difference between
estimated intercepts in the simple regressions. The slope of the
interaction is the difference between estimated slopes in the
simple regressions.
15 of 47
Copyright © 2011 Pearson Education, Inc.
Second–Order Models
Second-Order Model With
1 Independent Variable
•
•
•
Relationship between 1 dependent and 1
independent variable is a quadratic function
Useful 1st model if non-linear relationship
suspected
Curviline
Model
ar effect
E ( y )   0  1 x   2 x
Linear
effect
2
Second-Order Model
Relationships
y 2 > 0
y 2 > 0
x1
y 2 < 0
x1
y 2 < 0
x1
x1
Second-Order Model
Worksheet
2
Case, i
yi
xi
xi
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
1
64
9
25
:
Create x2 column.
Run regression with y, x,
x 2.
2nd Order Model Example
The data shows the number of
weeks employed and the number
of errors made per day for a
sample of assembly line workers.
Find a 2nd order model, conduct
the global F–test, and test if β2 ≠
0. Use α = .05 for all tests.
Errors (y)
Weeks (x)
20
18
16
10
8
4
3
1
2
1
0
1
1
1
2
4
4
5
6
8
10
11
12
12
Second-Order Model
Worksheet
2
yi
xi
xi
20
1
1
18
1
1
16
2
4
10
4
16
:
:
:
Create x2 column.
Run regression with y,
x, x2.
Excel Computer Output
Solution
yˆ  23.728  4.784 x  .242 x 2
Overall Model Test Solution
Global F–test indicates at least one
parameter is not zero
F
P-Value
β2 Parameter Test Solution
β2 test indicates curvilinear relationship
exists
t
P-Value