Midterm 2 cheatsheet

fariSample mean 𝑋̅ =
𝑋1 +β‹―+𝑋𝑛
𝑛
𝑋1 +β‹―+𝑋𝑛 βˆ’π‘›πœ‡
Central Limit Thm: 𝑃 {
Step 4: check if observed TS falls into region
Critical Region is set of all samples that will lead to
rejection of 𝐻0
P-value Prob of observing evidence as much/more in
favor of 𝐻1
Step 1: Set up null and alt hypotheses
Step 2: Find TS 𝑇 with known dist under 𝐻0
Step 3: Calculate p-value
< π‘₯} β‰ˆ 𝑃{𝑍 < π‘₯}
πœŽβˆšπ‘›
Μ…)2
(βˆ‘π‘›
𝑖=1(𝑋𝑖 βˆ’π‘‹ )
Sample Variance: 𝑆 2 =
, 𝐸[𝑆 2 ] = 𝜎 2
π‘›βˆ’1
Use 𝑛 βˆ’ 1 to make E[sample var] = population var.
Chi Square Use to find 𝑆 2 from normal.
𝑍1 … 𝑍𝑛 are iid standard normal
𝑋 = 𝑍12 + β‹― + 𝑍𝑛2 , 𝑋 is chi-square with 𝑛 deg freedom Χ𝑛2
π‘‹Μ…βˆ’πœ‡
π‘₯Μ… βˆ’πœ‡
π‘₯Μ… βˆ’πœ‡
2
2
Ξ§π‘š
+ Χ𝑛2 = Ξ§π‘š+𝑛
𝑝 = 𝑃 {| ⁄ 0| β‰₯ | ⁄ 0 |} = 2 (1 βˆ’ Ξ¦ (| ⁄ 0 |))
𝜎 βˆšπ‘›
𝜎 βˆšπ‘›
𝜎 βˆšπ‘›
𝜎2
Sampling from a Normal Population: 𝑋̅~𝑁 (πœ‡, )
Step 4: check if 𝑝 is small enough
𝑛
𝑋1 + β‹― + 𝑋𝑛 ~𝑁(π‘›πœ‡, π‘›πœŽ 2 ), 𝑋̅ βŠ₯ 𝑆 2
If 𝑝 is smaller than your 𝛼, reject 𝐻0
𝑝 tells us how cautious/easily convinced we should be
T-Distribution If 𝑍 βŠ₯ 𝑋𝑛2, 𝑇 = 𝑍/βˆšπ‘‹π‘›2 /𝑛 , 𝑑𝑛
π‘‹Μ…βˆ’πœ‡
Decision Errors Type I: Reject 𝐻0 when it’s actually true
~π‘‘π‘›βˆ’1
𝑆/βˆšπ‘›
β€œFalse alarm” error 𝛼
π‘š]
Method of Moments 𝐸[𝑋 = π‘”π‘˜ (πœƒ1 … πœƒπ‘˜ )
Type II: Accept 𝐻0 when it’s not true. Miss.
π‘š
π‘š
π‘š = 𝑋1 +β‹―+𝑋𝑛
Μ…Μ…Μ…Μ…
Operating characteristic (OC) curve: 𝛽(πœ‡) =
If π‘˜ unknown parameters, let 𝑋
𝑛
π‘ƒπœ‡ {π‘Žπ‘π‘π‘’π‘π‘‘ 𝐻0 }
Μ…Μ…Μ…Μ…π‘˜
π‘”π‘˜ (πœƒ1 … πœƒπ‘˜ ) = 𝑋
1 βˆ’ 𝛽(πœ‡) = π‘ƒπœ‡ {π‘Ÿπ‘’π‘—π‘’π‘π‘‘ 𝐻0 } is Power Function
Maximum Likelihood Estimator
One-Sided Test 𝐻0 = πœ‡0 , 𝐻1 = πœ‡ > πœ‡0
Likelihood 𝐿(πœƒ) = 𝑓(π‘₯1 … , π‘₯𝑛 ; πœƒ)
π‘‹Μ…βˆ’πœ‡0
π‘₯Μ… βˆ’πœ‡
π‘₯Μ… βˆ’πœ‡
𝑓(π‘₯1 … ; πœƒ) = 𝑓π‘₯ 1 (π‘₯1 ; πœƒ) βˆ— 𝑓π‘₯𝑛 (π‘₯𝑛 : πœƒ)
TS: 𝑍 =
, 𝑝 = 𝑃 {𝑍 β‰₯ ⁄ 0 } = 1 βˆ’ Ξ¦ ( ⁄ 0 )
𝜎/βˆšπ‘›
𝜎 βˆšπ‘›
𝜎 βˆšπ‘›
Maximize the log likelihood log 𝐿(πœƒ) = 𝑙(πœƒ)
Alternately 𝐻0 : πœ‡ ≀ πœ‡0 , composite hypothesis
Estimate Difference in Means
π‘₯Μ… βˆ’πœ‡
π‘₯Μ… βˆ’πœ‡
TS same as above, 𝑝 = 𝑃 {𝑍 β‰₯ ⁄ 0 } ≀ 1 βˆ’ Ξ¦( ⁄ 0 )
2 +(π‘šβˆ’1)𝑆 2
(π‘›βˆ’1)𝑆
𝜎 βˆšπ‘›
𝜎 βˆšπ‘›
1
2
Get pooled sample variance: 𝑆𝑝2 = (π‘›βˆ’1)+(π‘šβˆ’1)
(the bell curve only has one region that matters)
π‘‹Μ…βˆ’πœ‡
Like a weighted average, weights as degrees of freedom
Unknown Variance 𝐻0 : πœ‡ = πœ‡0 , TS 𝑇 = 𝑆⁄ 0 ~π‘‘π‘›βˆ’1
βˆšπ‘›
Evaluating a point estimator
π‘₯Μ… βˆ’ πœ‡0
π‘₯Μ… βˆ’ πœ‡0
Biased: shooting off-target
𝑝 = 𝑃{|𝑇| β‰₯ |
})
| = 2 (1 βˆ’ 𝑃 {π‘‘π‘›βˆ’1 ≀
𝑏(πœƒΜ‚) = 𝐸[πœƒΜ‚(𝑋1 … 𝑋𝑛 ) βˆ’ πœƒ
π‘†β„βˆšπ‘›
π‘†β„βˆšπ‘›
2
Equality of Means 𝜎π‘₯2 and πœŽπ‘¦2 are known, 𝑋, π‘Œ are normal
Μ‚
Mean Squared Error: 𝐸 [(πœƒΜ‚ βˆ’ πœƒ) ] = π‘‰π‘Žπ‘Ÿ(πœƒΜ‚) + 𝑏2 (πœƒ)
2 sets of indep samples
2
2
=𝐸 [(πœƒΜ‚ βˆ’ πœƒ) ] βˆ’ (𝐸[πœƒΜ‚] βˆ’ πœƒ)
𝐻0 : πœ‡π‘₯ = πœ‡π‘¦ π‘œπ‘Ÿ (πœ‡π‘₯ ≀ πœ‡π‘¦ )
Hard to assess credibility of a point estimate by itself
TS: if πœ‡π‘₯ = πœ‡π‘¦ , |𝑋̅ βˆ’ π‘ŒΜ…| is more likely to be small.
Precision: measure spread
𝜎2
𝜎2
𝑋̅ βˆ’ π‘ŒΜ…~𝑁 (πœ‡π‘₯ βˆ’ πœ‡π‘¦ , π‘₯ + 𝑦 ) since sum of indep normal is
𝑛
π‘š
Confidence Interval
still normal.
Step 1: find a statistic involving the parameter
π‘‹Μ…βˆ’π‘ŒΜ…βˆ’(πœ‡π‘₯ βˆ’πœ‡π‘¦ )
π‘‹Μ…βˆ’πœ‡
Standardize:
=TS. 𝑝 = 𝑃{|𝑇𝑆| β‰₯ |π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘|}
Mean with known variance:
~𝑁(0,1)
Mean with unknown Var:
𝜎/βˆšπ‘›
π‘‹Μ…βˆ’πœ‡
√𝜎π‘₯2 ⁄𝑛+πœŽπ‘¦2 β„π‘š
~π‘‘π‘›βˆ’1
𝑆/βˆšπ‘›
(π‘›βˆ’1)𝑆 2
2
mean:
~Ξ§π‘›βˆ’1
𝜎2
=2 (1 βˆ’ Ξ¦ (
Step 3: Find the equivalent interval for the parameter
π‘‹Μ…βˆ’πœ‡
𝜎
𝜎
βˆ’π‘§π›Όβ„2 ≀ ⁄ ≀ 𝑧𝛼⁄2 ⇔ 𝑋̅ βˆ’ 𝑧𝛼⁄2 ≀ πœ‡ ≀ 𝑋̅ + 𝑧𝛼⁄2
βˆšπ‘›
≀ πœ‡ ≀ 𝑋̅ + 𝑧𝛼⁄2
𝜎
βˆšπ‘›
βˆšπ‘›
}= 1βˆ’π›Ό
𝜎 βˆšπ‘›
𝜎2
𝜎1 , 𝜎2 known, use 𝑋̅ βˆ’ π‘ŒΜ… ± 𝑧𝛼/2 √ 1 + 𝜎22 /π‘š
𝑛
𝜎1 , 𝜎2 unknown but equal use 𝑋̅ βˆ’ π‘ŒΜ… ±
1
1
( + )(π‘›βˆ’1)𝑆12 +(π‘šβˆ’1)𝑆22
𝑛 π‘š
𝑑𝛼,𝑛+π‘šβˆ’2 √
𝑛+π‘šβˆ’2
2
Lower confidence interval
2
2
𝜎
𝜎
𝜎1 , 𝜎2 known, use (βˆ’βˆž, 𝑋̅ βˆ’ π‘ŒΜ… + 𝑧𝛼 √ 1 + 2 )
𝑛
π‘š
𝜎1 , 𝜎2 unknown but equal use (βˆ’βˆž, 𝑋̅ βˆ’ π‘ŒΜ… +
√𝜎π‘₯2 ⁄𝑛+πœŽπ‘¦2 β„π‘š
)) and reject if 𝑝 is small
βˆšπ‘†π‘2 ⁄𝑛+𝑆𝑝2 β„π‘š
~𝑑𝑛+π‘šβˆ’2 𝑝 = 𝑃 {|𝑑𝑛+π‘šβˆ’2 | β‰₯
Sample correlation coefficient: π‘Ÿ = π‘†π‘‹π‘Œ /βˆšπ‘†π‘‹π‘‹ π‘†π‘Œπ‘Œ
So π‘Ÿ 2 = 𝑅 2 , π‘Ÿ =
π‘₯Μ… βˆ’π‘¦Μ…
βˆšπ‘†π‘2 ⁄𝑛+𝑆𝑝2 β„π‘š
traditionally
π‘Ÿ does not measure the slope of regression line
𝑆
Slope is 𝑏̂ = π‘Ÿ π‘Œ , is about π‘Ÿ if π‘†π‘Œ β‰ˆ 𝑆𝑋
𝑆𝑋
π‘Ÿ ≀ 1 so usually you get regression to the mean
Inferential Linear Regression
π‘Œ = 𝛼 + 𝛽π‘₯ + 𝑒 where 𝑒 is error, assume 𝐸[𝑒] = 0
𝐸[π‘Œ|π‘₯] = 𝛼 + 𝛽π‘₯
For each observation π‘Œπ‘– = 𝛼 + 𝛽π‘₯𝑖 + 𝑒𝑖
Assume random errors 𝑒𝑖 are indep
Estimate 𝛼, 𝛽, π‘Žπ‘›π‘‘ 𝜎 2 from the tuples of samples
Estimate 𝛼 and 𝛽 from least squares formulas
𝑆𝑆𝑅 estimates 𝜎 2
βˆ‘π‘› (π‘₯ βˆ’π‘₯Μ… )(π‘Œπ‘– βˆ’π‘ŒΜ…)
𝐡 = 𝑖=1𝑛 𝑖
, 𝐴 = π‘ŒΜ… βˆ’ 𝐡π‘₯Μ…
2
βˆ‘π‘–=1(π‘₯𝑖 βˆ’π‘₯Μ… )
𝐡 is normally distributed, unbiased estimator for 𝛽
𝜎2
𝐡~𝑁 (𝛽,
) 𝐴~𝑁 (𝛼,
2
𝜎 2 βˆ‘π‘›
𝑖=1 π‘₯𝑖
)
𝑆𝑋𝑋
𝑛𝑆𝑋𝑋
𝑆𝑆𝑅 = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ 𝐴 βˆ’ 𝐡π‘₯𝑖 )2
𝑆𝑆𝑅
2
~Ξ§π‘›βˆ’2
, lose 2 deg. Free since A and B are linear
𝜎2
𝑆𝑆
𝑆𝑆
combinations of π‘Œ1 … π‘Œπ‘› , 𝐸 [ 𝑅 ] = 𝜎 2 , 𝐸 [ 2𝑅 ] =
π‘›βˆ’2
𝜎
𝑆𝑆
2 ]
2
𝐸[Ξ§π‘›βˆ’2
= 𝑛 βˆ’ 2. 2𝑅 ~Ξ§π‘›βˆ’2
indep of A and B
𝜎
Confidence Intervals
𝑆𝑆
𝑅
𝛽: (𝐡 βˆ’ √(π‘›βˆ’2)𝑆
𝑋𝑋
𝛼: 𝐴 ± √
𝑆𝑆
𝑅
𝑑𝛼⁄2,π‘›βˆ’2 , 𝐡 + √(π‘›βˆ’2)𝑆
)
𝑋𝑋
βˆ‘π‘– π‘₯𝑖2 𝑆𝑆𝑅
𝑑
𝑛(π‘›βˆ’2)𝑆𝑋𝑋 𝛼⁄2,π‘›βˆ’2
1
(π‘₯0 βˆ’π‘₯Μ… )2
𝑛
𝑆𝑋𝑋
𝛼 + 𝛽π‘₯0 : 𝐴 + 𝐡π‘₯0 ± √ +
𝑆𝑆𝑅
𝑑 ⁄
π‘›βˆ’2 𝛼 2,π‘›βˆ’2
√
Distributional Results
(π‘›βˆ’2)𝑆𝑋𝑋
𝛽: √
π‘†π‘†π‘Ÿ
(𝐡 βˆ’ 𝛽)~π‘‘π‘›βˆ’2
βˆ‘π‘– π‘₯𝑖2 𝑆𝑆𝑅
(𝐴 βˆ’ 𝛼)~π‘‘π‘›βˆ’2
𝐴+𝐡π‘₯0 βˆ’π›Όβˆ’π›½π‘₯0
Μ… )2 𝑆𝑆𝑅
1 (π‘₯ βˆ’π‘₯
√( + 0
)
)(
𝑆𝑋𝑋
Future Response π‘Œ = 𝛼 + 𝛽π‘₯0 + 𝑒:
~π‘‘π‘›βˆ’2
π‘›βˆ’2
π‘Œβˆ’π΄βˆ’π΅π‘₯0
√
Μ… )2 𝑆𝑆𝑅
𝑛+1 (π‘₯0βˆ’π‘₯
√
+ 𝑆
𝑛
π‘›βˆ’2
𝑋𝑋
~π‘‘π‘›βˆ’2
Mean Response given π‘₯0 : 𝐸[π‘Œ|π‘₯0 ] = 𝛼 + 𝛽π‘₯0
Estimates mean response given input π‘₯0
Unbiased, is a linear combination of indep normal r.v.
}
Linear Regression
Sum of squared residuals: 𝑆𝑆 = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘Ž βˆ’ 𝑏π‘₯𝑖 )2
Choose π‘Ž and 𝑏 that minimize this
Take partial derivatives w/ respect to a, b
Set to 0, get normal equations π‘Ž + π‘₯Μ… 𝑏 = π‘ŒΜ…, 𝑛π‘₯Μ… π‘Ž +
𝑏(βˆ‘π‘›π‘– π‘₯𝑖2 ) = βˆ‘π‘›π‘– π‘₯𝑖 π‘Œπ‘–
βˆ‘π‘› π‘₯ π‘Œ βˆ’π‘›π‘₯Μ… π‘ŒΜ…
Solution: 𝑏 = 𝑖 𝑛𝑖 2𝑖
, π‘Ž = π‘ŒΜ… βˆ’ π‘₯Μ… 𝑏
βˆ‘π‘– π‘₯𝑖 βˆ’π‘›π‘₯Μ…
βˆ‘(π‘₯𝑖 βˆ’π‘₯Μ… )(π‘Œπ‘– βˆ’π‘ŒΜ…)
βˆšβˆ‘(π‘₯𝑖 βˆ’π‘₯Μ… )2 βˆ‘(π‘Œπ‘– βˆ’π‘ŒΜ…)2
𝑛
If Assume not equal replace 𝑆𝑝 with 𝑆π‘₯ and 𝑆𝑦
Test Equality of Proportions L18
Identities: βˆ‘π‘›π‘–(π‘₯𝑖 βˆ’ π‘₯Μ… )2 = βˆ‘π‘›π‘– π‘₯𝑖2 βˆ’ 𝑛π‘₯Μ…
βˆ‘π‘›π‘–(π‘₯𝑖 βˆ’ π‘₯Μ… )(π‘Œπ‘– βˆ’ π‘ŒΜ…) = βˆ‘π‘›π‘– π‘₯𝑖 π‘Œπ‘– βˆ’ 𝑛π‘₯Μ… π‘ŒΜ…
2
𝑆π‘₯π‘Œ
Mean Response 𝛼 + 𝛽π‘₯0 :
1
βˆ‘π‘› (𝑋 βˆ’ 𝑋̅)2 similar for 𝑆𝑦
𝑆π‘₯ =
π‘›βˆ’1 π‘–βˆ’1 𝑖
Pooled sample variance as weighted average
If 𝐻0 is true, then πœ‡π‘₯ βˆ’ πœ‡π‘¦ = 0
π‘₯Μ… βˆ’π‘¦Μ…
Step 4: plug in observed numbers
π‘‹Μ…βˆ’πœ‡
One-sided upper: {βˆ’π‘§π›Ό ≀ ⁄ < ∞} = 1 βˆ’ 𝛼
|π‘₯Μ… βˆ’π‘¦Μ…|
𝑆𝑋𝑋
𝑆𝑋𝑋 π‘†π‘Œπ‘Œ
𝑛(π‘›βˆ’2)𝑆𝑋𝑋
Unknown Variance Assumed equal
𝜎 βˆšπ‘›
𝑃 {𝑋̅ βˆ’
So 𝑅 2 =
𝛼: √
Var with unknown
Step 2: find initial interval with prob. 1 βˆ’ 𝛼
π‘‹Μ…βˆ’πœ‡
Two-sided:𝑃 {βˆ’π‘§π›Όβ„2 ≀ ⁄ ≀ 𝑧𝛼⁄2 } = 1 βˆ’ 𝛼
𝜎 βˆšπ‘›
𝜎
𝑧𝛼⁄2
βˆšπ‘›
2
2
𝑆 𝑆 βˆ’π‘†
𝑆𝑆𝑅 = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŽΜ‚ βˆ’ 𝑏̂π‘₯𝑖 ) = 𝑋𝑋 π‘Œπ‘Œ π‘₯π‘Œ
Prediction Interval
Find an interval that will contain the response with
certain confidence, NOT the same as confidence interval
Best β€œpoint prediction” 𝐴 + 𝐡π‘₯0 , unbiased
Analysis of Residuals
Assume independence, constant variance
(homoscedasticity), normality.
Check these assumptions by looking at residuals
π‘Œ βˆ’(𝐴+𝐡π‘₯𝑖 )
Standardized residuals 𝑖
(the denominator
βˆšπ‘†π‘†π‘… /(π‘›βˆ’2)
estimates 𝜎
Linearity violated if residuals show discernible pattern
Nonconstant variance shows increasing outliers
Check Normality
Sample quantile of standardized residuals vs standard
normal quantile
𝑛+π‘šβˆ’2
Quantiles: points taken at regular intervals from CDF of
𝑆π‘₯π‘₯ = βˆ‘π‘›π‘–=1(π‘₯𝑖 βˆ’ π‘₯Μ… )2 = (𝑛 βˆ’ 1)𝑠π‘₯2
r.v.
Approximate Confidence Interval
π‘†π‘Œπ‘Œ = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŒΜ…)2 = (𝑛 βˆ’ 1)π‘ π‘Œ2
Have 1 βˆ’ 𝛼 CI with length no greater than 2Δ𝑝. What’s
This is TSS, total sum of squares, amount of variance in Q-Q plot: compare 2 distributions by plotting quantiles
the min sample size required?
response variables. Part of this is caused by difference in against each other, to check normality
Transforming to Linearity
(𝑝̂ ± 𝑧𝛼⁄2 βˆšπ‘Μ‚ (1 βˆ’ 𝑝̂ )⁄𝑛 ) Length: 2𝑧𝛼⁄2βˆšπ‘Μ‚(1βˆ’π‘Μ‚)⁄𝑛 ≀ 2Δ𝑝 input variables
2
2
π‘Š(𝑑) β‰ˆ 𝑐𝑒 𝑑𝑑 βˆ’ π‘™π‘œπ‘” β†’ π‘™π‘œπ‘”π‘Š β‰ˆ π‘™π‘œπ‘”π‘ + 𝑑𝑑
𝑆𝑆𝑅 = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŒΜ‚π‘– ) = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŽΜ‚ βˆ’ 𝑏̂π‘₯𝑖 )
Sub 𝑝̂ with π‘βˆ— after testing π‘˜ times
Where 𝛼 is π‘™π‘œπ‘”π‘ and 𝛽π‘₯ is 𝑑𝑑 and π‘™π‘œπ‘”π‘Š is Y
Hypothesis Testing
Measures remaining amount of variance in the response
Interpret the regression coefficients:
𝐻0 : presume this, in control, conventional, not effective, variables, after we take into account difference of input
Regression equation π‘Š = π‘ŽΜ‚ + 𝑏̂𝑑
no discrimination
variables, known as RSS, residual sum of squares
average number of defects at 0 is π‘ŽΜ‚, and as X increases
𝑛(π‘₯
Μ…
𝐻1 : out of control, new finding, discrimination, more
𝑆π‘₯π‘Œ = βˆ‘π‘– 𝑖 βˆ’ π‘₯Μ… )(π‘Œπ‘– βˆ’ π‘Œ)
1, number of defects increases by 𝑏̂ on average
effective
π‘†π‘Œπ‘Œ βˆ’ 𝑆𝑆𝑅 is amount of variation explained by input var, by
π‘Š(𝑑+1)
β‰ˆ 1 + 𝑏̂ by Taylor expansion
Level of significance 𝛼 Critical Region Approach
known as explained sum of squares ESS
π‘Š(𝑑)
𝐸𝑆𝑆
𝑅𝑆𝑆
Step 1: set up null and alternative hypotheses
π‘Š(1.01𝑑)
Coefficient of determination 𝑅 2 =
=1βˆ’
β‰ˆ 1 + 0.01𝑏̂ by Taylor expansion
𝑇𝑆𝑆
𝑇𝑆𝑆
π‘Š(𝑑)
Step 2: find a statistic 𝑍 with known distribution
This shows how well the model fits the data.
Linear Transformations
Called the test statistic TS, 𝑍(𝑋1 … 𝑋𝑛 ) =
Ex. 𝑅 2 = .75, so 75% of variation of y is from changes in Exponential: π‘™π‘œπ‘”π‘Œ β‰ˆ π‘™π‘œπ‘”π‘ + 𝑑π‘₯, π‘Œ β‰ˆ 𝑐𝑒 𝑑π‘₯
Step 3: Choose an 𝛼 and find β€œblue region”
x. The rest is from variance even when x is constant
X increases by 1, Y changes by factor of 100d%
Smaller 𝛼 means smaller region
1 1
𝑛 π‘š
( + )(π‘›βˆ’1)𝑆12 +(π‘šβˆ’1)𝑆22
𝑑𝛼,𝑛+π‘šβˆ’2 √
)
So 𝑏 =
Μ…
βˆ‘π‘›
𝑖 (π‘₯𝑖 βˆ’π‘₯Μ… )(π‘Œπ‘– βˆ’π‘Œ )
2
βˆ‘π‘›
𝑖 (π‘₯𝑖 βˆ’π‘₯Μ… )
= 𝑆π‘₯π‘Œ /𝑆π‘₯π‘₯
π‘Œ(π‘₯+1)
π‘Œ(π‘₯)
= 𝑒 π‘₯ so
π‘Œ(π‘₯+1)
π‘Œ
Assumptions 𝑋𝑖𝑗 are independent normal r.v.
β‰ˆ 1 + π‘₯ where π‘₯ < 0
Logarithm: π‘Œ β‰ˆ π‘™π‘œπ‘”π‘ + π‘‘π‘™π‘œπ‘”π‘₯
X increases b factor of 1%, Y changes by d/100
π‘Œ(1.01π‘₯) βˆ’ π‘Œ(π‘₯) = π‘π‘™π‘œπ‘”(1 + .01)
log(1 + π‘₯) β‰ˆ π‘₯ so π‘Œ(1.01π‘₯) βˆ’ π‘Œ(π‘₯) β‰ˆ .01 βˆ— 𝑐
Power: π‘™π‘œπ‘”π‘Œ β‰ˆ π‘™π‘œπ‘”π‘ + π‘‘π‘™π‘œπ‘”π‘₯, π‘Œ β‰ˆ 𝑐π‘₯ 𝑑
X increases by factor of 1%, Y changes by factor of d%
π‘Œ(1.01π‘₯)
= (1 + .01)π‘Ž , (1 + π‘₯)π‘Ž β‰ˆ 1 + π‘Žπ‘₯
π‘Œ(π‘₯)
Multiple Linear Regression
Least Squares method:
𝐡0
1 π‘₯11 π‘₯1π‘˜
𝐡 = [ … ] = (𝑋 β€² 𝑋)βˆ’1 𝑋 β€² π‘Œ where 𝑋 = [1|π‘₯21 |π‘₯2π‘˜ ] and π‘Œ =
π΅π‘˜
1 π‘₯𝑛1 π‘₯π‘›π‘˜
π‘Œ1
[ ] 𝛽𝑖 is amount of change in response if π‘₯𝑖 increased by
π‘Œπ‘˜
1 keeping others unchanged.
Residuals π‘Ÿπ‘– = π‘Œπ‘– βˆ’ 𝐡0 βˆ’ 𝐡1 π‘₯𝑖1 βˆ’ β‹― π΅π‘˜ π‘₯π‘–π‘˜
Matrix form π‘Ÿ = π‘Œ βˆ’ 𝑋𝐡
𝑆𝑆𝑅 = π‘Œ β€² π‘Œ βˆ’ π‘Œβ€²π‘‹π΅ residual sum of squares
Coefficient of multiple determination
𝑆𝑆
𝑅 2 = 1 βˆ’ 𝑅 proportion of variation explained by model
π‘†π‘Œπ‘Œ
𝑅 is multiple correlation coefficients
As input vars increase, 𝑅 2 goes down, 𝑆𝑆𝑅 goes up
…But this could mean overfitting
(π‘›βˆ’1)𝑅2 βˆ’π‘˜
Adjusted/corrected 𝑅 2 : 𝑅̅2 =
π‘›βˆ’π‘˜βˆ’1
Inference 𝐡 Has multivariate normal dist
𝑁(𝛽, 𝜎 2 (𝑋 β€² 𝑋)βˆ’1 ), 𝛽 is unbiased estimator
𝜎 2 (𝑋 β€² 𝑋)βˆ’1 is covariance matrix, diagonals give variance
𝜎2
For each 𝑖, 𝐡𝑖 ~𝑁(𝛽𝑖 , )
and
𝑆𝑆𝑅
𝑆𝑖𝑖
1
is π‘–π‘‘β„Ž elem in diagonal of (𝑋 β€² 𝑋)βˆ’1
𝑆𝑖𝑖
2
= Ξ§π‘›βˆ’π‘˜βˆ’1
and independent of B
This is an unbiased estimator for 𝜎 2
𝜎2
𝜎2
𝐡𝑖 ~𝑁 (𝛽𝑖 , ) for inference on 𝛽𝑖
𝐡𝑖 βˆ’π›½π‘–
√𝜎
Μ‚ 2 ⁄𝑆𝑖𝑖
𝑆𝑖𝑖
~π‘‘π‘›βˆ’π‘˜βˆ’1 and denominator is estimate for sd of 𝐡𝑖 ,
standard error. Test whether π‘₯𝑖 has impact on response
keeping other input vars unchanged 𝐻0 : 𝛽𝑖 = 0
Variances of each are the same
πœ‡π‘–π‘— = πœ‡ + 𝛼𝑖 + 𝛽𝑗 where πœ‡ is overall mean
𝛼𝑖 is correction for the 𝑖th row
𝛽𝑗 is correction for the 𝑗th column
πœ‡ + 𝛼𝑖 is mean for π‘–π‘‘β„Ž row, use 𝛽𝑗 for mean of π‘—π‘‘β„Ž column
Hypotheses 𝐻0π‘Ÿ : row factor has no impact, 𝐻0𝑐 : colun has
no impact
Estimators 𝑋.. is an unbiased estimator for πœ‡, overall
Sample Midterm 2
sample mean→ overall mean
(a) If simple linear regression of Y on X yields a line with
𝑋𝑖. βˆ’ 𝑋.. is unbiased estimator for 𝛼𝑖
slope equal to 2, then simple linear regression of X on Y
𝑋.𝑗 βˆ’ 𝑋.. is unbiased estimator for 𝛽𝑗
will yield a line with slope equal to 1/2. False.
Sample mean difference β†’ mean difference
(b) A simple linear regression model (Y on x) has R^2 =
2
Row sum of squares π‘†π‘†π‘Ÿ = βˆ‘π‘š
𝑖=1 𝑛(𝑋𝑖. βˆ’ π‘₯.. )
0.7, while an exponential
Variation between rows
model (log(Y) on x) has R^2 = 0.8. This means the
2
Column sum of squares 𝑆𝑆𝑐 = βˆ‘π‘›π‘—=1 π‘š(𝑋.𝑗 βˆ’ 𝑋.. )
exponential model can explain more variation in Y.
False.
Variation between columns
𝑛
(c) In multiple regression, adding an additional
Error sum of squares 𝑆𝑆𝑒 = βˆ‘π‘š
𝑖=1 βˆ‘π‘—=1(𝑋𝑖𝑗 βˆ’ 𝑋𝑖. βˆ’ 𝑋.𝑗 +
2
regressor (input variable) will always increase the
𝑋.. ) Variation due to randomness, subtract mean, 𝛼𝑖 and
multiple R-square, and decrease the adjusted R-square.
𝛽𝑖 from 𝑋𝑖𝑗
False.
𝑆𝑆𝑒
an unbiased estimator for 𝜎 2
(d) In linear regression, a 95% prediction interval for a
(π‘šβˆ’1)(π‘›βˆ’1)
𝑆𝑆
future response given certain
2
Error: 2𝑒 ~Ξ§(π‘šβˆ’1)(π‘›βˆ’1)
𝜎
input level always contains the 95% confidence interval
𝑆𝑆
2
Row: if 𝐻0π‘Ÿ is true, 2π‘Ÿ ~Ξ§π‘šβˆ’1
, indep. Of 𝑆𝑆𝑒
for the mean response given the same input level. True.
𝜎
𝑆𝑆
(e) In one-way ANOVA, the null hypothesis states that
2
Col: 𝐼𝑓 𝐻0𝑐 is true, 2𝑐 ~Ξ§π‘›βˆ’1
, indep. Of 𝑆𝑆𝑒
𝜎
the population mean within each group is equal to zero.
ANOVA Table Let 𝑁 = (π‘š βˆ’ 1)(𝑛 βˆ’ 1)
False.
source
sum of
d.f.
mean
F ratio
p-value
Having done poorly on their math final exams in June,
of var
squares
square
π‘†π‘†π‘Ÿ
π‘†π‘†π‘Ÿ ⁄(π‘š βˆ’ 1)
row
π‘†π‘†π‘Ÿ
π‘š
𝑃{πΉπ‘šβˆ’1,𝑁 some students repeat the course in summer school, then
βˆ’1
β‰₯ 𝐹 π‘Ÿπ‘Žπ‘‘π‘–π‘œ} take another exam in August. Assume these students are
π‘šβˆ’1
𝑆𝑆𝑒 /𝑁
𝑆𝑆𝑐
𝑆𝑆𝑐 ⁄(𝑛 βˆ’ 1)
column
𝑆𝑆𝑐
𝑛
𝑃{πΉπ‘›βˆ’1,𝑁
representative of all students who might attend summer
βˆ’1
β‰₯ 𝐹 π‘Ÿπ‘Žπ‘‘π‘–π‘œ}
π‘›βˆ’1
𝑆𝑆𝑒 /𝑁
school. Given their scores in June and August, explain
Error
𝑆𝑆𝑒
𝑁
𝑆𝑆𝑒 /𝑁
how you can test whether the summer program is
total
π‘†π‘†π‘Ÿ
π‘šπ‘›
+ 𝑆𝑆𝑐
βˆ’1
worthwhile.
+ 𝐸𝐸𝑒
Use paired t-test. πœ‡π‘₯ is mean score in June, πœ‡π‘¦ mean
2-way ANOVA with Interaction
score in August: 𝐻0 : not effective πœ‡π‘¦ = πœ‡π‘₯ 𝐻1 πœ‡π‘¦ > πœ‡π‘₯
Without interaction, πœ‡1𝑗 βˆ’ πœ‡2𝑗 is same for all 𝑗
Let Xi and Yi be the scores of ith student, and take the
πœ‡π‘–1 βˆ’ πœ‡π‘–2 same for all 𝑖
Μ… = 1 βˆ‘π‘– 𝐷𝑖 , and
difference Di=Yi-Xi. Sample mean 𝐷
Interaction πœ‡ = πœ‡ + 𝛼 + 𝛽 + 𝛾
𝑖𝑗
𝑖
𝑗
𝑛
𝑖𝑗
πœ‡ = πœ‡.. : grand mean
𝛼𝑖 = πœ‡π‘–. βˆ’ πœ‡.. effect of the ith row
Multiple R: 𝑅
𝛽𝑖 = πœ‡.𝑗 βˆ’ πœ‡.. effect of the jth col
2
R Squared: 𝑅 = 1 βˆ’ 𝑆𝑆𝑅 β„π‘†π‘Œπ‘Œ
Adjusted R Squared: 𝑅̅ 2 = ((𝑛 βˆ’ 1)𝑅 2 βˆ’ π‘˜)⁄(𝑛 βˆ’ π‘˜ βˆ’ 1) 𝛾𝑖𝑗 interaction of ith row and jth column
𝛾𝑖𝑗 = πœ‡π‘–π‘— βˆ’ πœ‡ βˆ’ 𝛼𝑖 βˆ’ 𝛽𝑗 = πœ‡π‘–π‘— βˆ’ πœ‡π‘–. βˆ’ πœ‡.𝑗 + πœ‡..
Standard Error: πœŽΜ‚ = βˆšπ‘†π‘†π‘… ⁄(𝑛 βˆ’ π‘˜ βˆ’ 1)
Data m rows, n columns, m*n cells, 𝑙 β‰₯2
Observations: 𝑛
observations/cell 𝑋𝑖𝑗 : sample mean of cell (𝑖,j)
π‘˜
1 π‘₯1 π‘₯
Polynomial Regression 𝐡 = (𝑋 β€² 𝑋)βˆ’1 π‘‹β€²π‘Œ, π‘₯ = [ |π‘₯ | 1π‘˜ ] 𝑋𝑖𝑗𝑙 denotes an individual number
1 𝑛 π‘₯2
Assumptions π‘‹π‘–π‘—π‘˜ = πœ‡π‘–π‘— + πœ€π‘–π‘—π‘˜ = πœ‡ + 𝛼𝑖 + 𝛽𝑗 + 𝛾𝑖𝑗 + πœ€π‘–π‘—π‘˜
ANOVA: Analysis Of Variance
πœ€π‘–π‘—π‘˜ Independent random error, ~𝑁(0, 𝜎 2 )
Study the impact of a certain factor
{π‘Ÿ,𝑐,𝑖𝑛𝑑}
𝐻0
: no {row,column,interaction}
𝐻0 : all means are equal
Estimation
π‘›π‘Ÿ total number of observation
𝑋… : π‘”π‘Ÿπ‘Žπ‘›π‘‘ π‘šπ‘’π‘Žπ‘› πœ‡
𝑋.. sample mean of all observation
β€’
𝑋𝑖.. : π‘Ÿπ‘œπ‘€ π‘šπ‘’π‘Žπ‘› πœ‡π‘–.
𝑋𝑖𝑗 𝑗th observation in 𝑖th group
β€’
𝑋.𝑗. : π‘π‘œπ‘™π‘’π‘šπ‘› π‘šπ‘’π‘Žπ‘› πœ‡.𝑗
𝑋𝑖 . Sample mean of the 𝑖th group
2 𝑋 : 𝑐𝑒𝑙𝑙 π‘šπ‘’π‘Žπ‘› πœ‡
Between-group sum of squares 𝑆𝑆𝑏 = βˆ‘π‘š
𝑖𝑗.
𝑖𝑗
𝑖=1 𝑛𝑖 (𝑋𝑖. βˆ’ 𝑋‒.. )
Amount of variation that can be explained by groupsβ€’
Row effect 𝛼𝑖 = πœ‡π‘–. βˆ’ πœ‡
Within-group sum of squares
β€’
estimator: 𝛼̂𝑖 = 𝑋𝑖.. βˆ’ 𝑋…
2
𝑛𝑖
β€’
Column effect 𝛽𝑗 = πœ‡.𝑗 βˆ’ πœ‡
𝑆𝑆𝑀 = βˆ‘π‘š
𝑖=1 βˆ‘π‘—=1(𝑋𝑖𝑗 βˆ’ 𝑋𝑖. ) amount of var that can’t be
𝑆𝑆
β€’
estimator: 𝛽̂𝑗 = 𝑋.𝑗. βˆ’ 𝑋…
explained by groups, 2𝑀 ~𝑋𝑛2𝑇 βˆ’π‘š , 𝑆𝑆𝑏 and 𝑆𝑆𝑀 are indep.
𝜎
β€’
Interaction
𝛾
=
πœ‡
βˆ’
πœ‡ βˆ’ 𝛼𝑖 βˆ’ 𝛽𝑗
𝑖𝑗
𝑖𝑗
𝑆𝑆 ⁄(π‘šβˆ’1)
TS: 𝐹 = 𝑏⁄
~πΉπ‘šβˆ’1,𝑛𝑇 βˆ’π‘š if 𝐻0 is true
β€’
estimator: 𝛾̂𝑖𝑗 = 𝑋𝑖𝑗. βˆ’ 𝑋𝑖.. βˆ’ 𝑋.𝑗. +
𝑆𝑆𝑀 (𝑛𝑑 βˆ’π‘š)
𝑋…
𝑝-value=𝑃{πΉπ‘šβˆ’1,𝑛𝑇 βˆ’π‘š β‰₯ π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘ 𝑇𝑆}
Error πœ€π‘–π‘—π‘˜ = π‘‹π‘–π‘—π‘˜ βˆ’ πœ‡π‘–π‘— estimator π‘‹π‘–π‘—π‘˜ βˆ’ 𝑋𝑖𝑗.
ANOVA table for multiple regression 𝐻0 : Y doesn’t β€’
2
𝑛
𝑙
depend on input
β€’
Error sum of squares 𝑆𝑆𝑒 = βˆ‘π‘š
𝑖=1 βˆ‘π‘—=1 βˆ‘π‘˜=1(π‘‹π‘–π‘—π‘˜ βˆ’ 𝑋𝑖𝑗. )
Reading Excel
source
of var
between
groups
within
groups
total
sum of
squares
𝑆𝑆𝑏
𝑆𝑆𝑀
𝑆𝑆𝑏
+ 𝑆𝑆𝑀
d.f.
π‘š
βˆ’1
𝑛𝑇
βˆ’π‘š
𝑛𝑇
βˆ’1
mean
square
𝑆𝑆𝑏
π‘šβˆ’1
π‘†π‘†π‘Š
𝑛𝑇 βˆ’ π‘š
F ratio
p-value
𝑆𝑆𝑒
𝑆𝑆𝑏 ⁄(π‘š βˆ’ 1)
π‘†π‘†π‘Š ⁄(𝑛 𝑇 βˆ’ π‘š)
𝑃{πΉπ‘˜,π‘›βˆ’π‘˜βˆ’1
β€’
β‰₯ 𝐹 π‘Ÿπ‘Žπ‘‘π‘–π‘œ}
𝑛
Interaction sum of squares 𝑆𝑆𝑖𝑛𝑑 = βˆ‘π‘š
𝑖=1 βˆ‘π‘—=1 𝑙(𝑋𝑖𝑗. βˆ’
β€’
Two-Factor ANOVA
One-way: single factor
Ex. gas mileage: difference between cars and with
drivers?
Each row/column represents one level
Column averages 𝑋.1 … 𝑋.𝑛
Row averages 𝑋1. … π‘‹π‘š. , overall average 𝑋..
β€’
β€’
𝜎2
∼
𝑆𝑆𝑒
2
πœ’π‘šπ‘›(π‘™βˆ’1)
,
π‘šπ‘›(π‘™βˆ’1)
2
is an unbiased estimator for 𝜎
𝑆𝑆
2
𝑋𝑖.. βˆ’ 𝑋.𝑗. + 𝑋… ) , Under 𝐻0𝑖𝑛𝑑 , 𝑖𝑛𝑑
∼ πœ’(π‘šβˆ’1)(π‘›βˆ’1)
𝜎2
Row sum of squares
2
β€’
π‘†π‘†π‘Ÿ = βˆ‘π‘š
𝑖=1 𝑛𝑙(𝑋𝑖.. βˆ’ 𝑋… )
π‘Ÿ π‘†π‘†π‘Ÿ
2
β€’
Under 𝐻0 , 2 ∼ πœ’π‘šβˆ’1
𝜎
Column sum of squares
2
β€’
𝑆𝑆𝑐 = βˆ‘π‘›π‘—=1 π‘šπ‘™(𝑋.𝑗. βˆ’ 𝑋… )
β€’
Under 𝐻0𝑐 ,
𝑆𝑆𝑐
𝜎2
2
∼ πœ’π‘›βˆ’1
2
sample sd 𝑆𝐷 = √
1
π‘›βˆ’1
Μ…)
βˆ‘π‘–(𝐷𝑖 βˆ’ 𝐷
2
Μ… /𝑆𝐷 is π‘‘π‘›βˆ’1 𝑝 = Pr{π‘‘π‘›βˆ’1 β‰₯ 𝑇𝑆}
TS: 𝑇 = βˆšπ‘›π·
3. In a study of the relationship between the starting
salary (Y) and GMAT score (x)
of n = 100 MBA students, the following results are
obtained: average starting salary
avg(Y) = $90,000, sample standard deviation s(Y) =
$45,000, average GMAT avg(x) = 600,
sample standard deviation s(x) = 100, sample
correlation coefficient r = 0.3.
(a) (10 pts) Find and interpret the slope of the simple
linear regression of Y on x.
π‘Ÿπ‘†π‘Œ
𝑏̂ =
= 135
𝑆π‘₯
π‘ŽΜ‚ = π‘ŒΜ… βˆ’ 𝑏̂π‘₯Μ… =90000-135*60=9000
Y=9000+135x
(b) (10 pts) Find a 95% confidence interval for the
mean salary of people who got 700 in GMAT.
Get 100(1 βˆ’ 𝛾) C.I. for 𝛼 + 𝛽π‘₯0
𝐴 + 𝐡π‘₯0 βˆ’ (𝛼 + 𝛽π‘₯0 )
~π‘‘π‘›βˆ’2
1 (π‘₯ βˆ’ π‘₯Μ… )2 √ 𝑆𝑆𝑅
√ + 0
𝑛
𝑆π‘₯π‘₯
π‘›βˆ’2
1
(π‘₯0 βˆ’π‘₯Μ… )2
𝑛
𝑆π‘₯π‘₯
Use 𝐴 + 𝐡π‘₯0 ± √ +
√
𝑆𝑆𝑅
𝑑𝛾
π‘›βˆ’2 2,π‘›βˆ’2
(c) (10 pts) Test at 0.05 level on whether higher GMAT
score means higher salary.
𝐻0 : 𝛽 = 0 mean salary doesn’t increase GMAT
(π‘›βˆ’2)𝑆π‘₯π‘₯
TS is π‘‘π‘›βˆ’2 , π‘‡π‘›βˆ’2 = √
𝑆𝑆𝑅
𝐡
(π‘›βˆ’2)𝑠π‘₯π‘₯
p-value = 𝑃(π‘‡π‘›βˆ’2 > √
𝑠𝑠𝑅
𝑏̂
Since 𝑑98 close to standard normal, know p=1-Ξ¦(3)
(d) (15 pts) Construct the ANOVA table for linear
regression.
(e) (10 pts) Simple linear regression of log(Y) on x
yields slope = 0.015%, and that
of Y on log(x) yields slope = 8000. Interpret the results.
If GMAT incrases by 1, starting salary increases by
.015% factor on average.
For slope=8000: if GMAT score increases by factor of
1%, salary increases by 8000*.01=80 on average.
An emergency room physician wanted to know whether
there were any differences in the amount of time it takes
for three different inhaled steroids to clear a mild
asthmatic attack. Over a period of weeks she randomly
administered these steroids to asthma sufferers, and
noted the time it took for the patients’ lungs to become
clear. Afterward, she discovered that 12 patients had
been treated with each type of steroid, with the
following sample means (in minutes) and sample
variances resulting. Construct the ANOVA table to test
on whether the mean time to clear a mild asthma attack
is the same for all three steroids.
Steroid
Sample mean
Sample
variance
A
32
145
B
40
138
C
30
150
π‘š=3, n=12, πœ‡π‘– is mean time to clear attack for ith
steroid
𝑆𝑆 ⁄(π‘šβˆ’1)
|𝐻0 : all means are equal, 𝑏
~πΉπ‘šβˆ’1,π‘›π‘šβˆ’π‘š
𝑆𝑆𝑀 /(π‘›π‘šβˆ’π‘š)
𝑠𝑠𝑏 = 𝑛 βˆ‘π‘š
̅𝑖 βˆ’ π‘₯Μ…Μ… )2 = 12 βˆ— [(32 βˆ’ 34)2 + (40 βˆ’ 34)2 + (30 βˆ’ 34)2 ] =
𝑖=1(π‘₯
672
2
𝑛
π‘š
2
𝑠𝑠𝑀 = βˆ‘π‘š
𝑖=1 βˆ‘π‘—=1(π‘₯𝑖𝑗 βˆ’ π‘₯̅𝑖 ) =βˆ‘π‘–=1(𝑛 βˆ’ 1)𝑠𝑖 =
11(145 + 138 + 150) = 34
source
of
varianc
e
betwee
n
groups
within
groups
total
sum of squares
d.f
.
mean
square
F ratio
P-val
672
2
336
πΉπ‘šβˆ’1,π‘›π‘šβˆ’π‘š
𝑃(𝐹2,33
β‰₯𝐹
βˆ’ π‘Ÿπ‘Žπ‘‘π‘–π‘œ)
33
144.333
3
672+4763=543
5
35