ch01-sec07-10.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Ch1, Sec
1.7 - 1.10
Luo Xiao
August 24, 2015
1 / 30
ST430 Introduction to Regression Analysis
Review of Basic Statistics
2 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
A statistic is a random variable
Recall: a statistic is a summary calculated from a sample.
Statistics vary from sample to sample.
If samples are chosen randomly, the variation of a statistic is also random.
That is, under random sampling, a statistic is a random variable.
3 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Sampling distribution
Every random variable has a probability distribution, usually represented by
either:
a probability density function, such as a normal density;
or a probability mass function, such as the binomial probability
function.
In the special case of a statistic, its probability distribution is also called its
sampling distribution.
4 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Berkeley girls’ heights example
For example, suppose we view the 54 girls’ heights at 18yrs as a population,
and draw a random sample of size 20:
library(fda)
x <- growth$hgtf[31,]
y = sample(x,size = 20)
mean(y)
## [1] 166.94
If we draw more samples, we get a different sample mean each time: R
demonstration.
5 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Sampling distribution
If we draw many samples, we begin to see the sampling distribution:
library(fda)
x = growth$hgtf[31,]
sampleMeans = rep(NA, 1000)
for (i in 1:length(sampleMeans))
sampleMeans[i] = mean(sample(x, 20))
par(mfrow=c(1,2))
hist(sampleMeans,xlim=c(150,185))
abline(v=mean(x),col="red",lwd=2)
6 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Histogram of the population
15
0
5
10
Frequency
100
50
0
Frequency
150
Histogram of sampleMeans
150
160
170
sampleMeans
7 / 30
180
150
160
170
180
Girls' heights (cm)
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Note that the sample means are:
distributed around the population mean of 166 cm;
not as widely dispersed as the original 54 measurements.
8 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Some theoretical results
If Y1 , Y2 , . . . , Yn are randomly sampled from some population with mean µ
and standard deviation σ, then the sampling distribution of their mean Ȳ
satisfies:
for any n,
Mean: E Ȳ = µȲ = µ,
σ
Standard error of estimate: σȲ = √
n
for large n, Ȳ is approximately normally distributed (Central Limit
Theorem).
9 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Inference about a parameter
Point estimation
Interval estimation
Hypothesis testing
10 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Inference about a parameter: point estimation
For example, the population mean, µ
A good estimator of µ should have a sampling distribution that is:
centered around µ;
with little dispersion.
We often make these ideas specific by using the mean and standard error.
Consider the sample mean, Ȳ :
centering: µȲ = µ; Ȳ is unbiased;
√
dispersion: σȲ = σ/ n; Ȳ has a small standard error of estimate
when n is large.
11 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
In fact, when the original data are normally distributed, Ȳ has the smallest
standard error of any unbiased estimator.
That is, the sample mean Ȳ is a Minimum Variance Unbiased
Estimator (MVUE).
In other cases, Ȳ is usally a good estimator of µ, but not the best.
For data with the uniform distribution, the midrange is better.
For data with the double exponential (Laplace) distribution, the sample
median is better.
12 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
The sample mean Ȳ is always the Best Linear Unbiased Estimator (BLUE):
For any constants w1 , w2 , . . . , wn with
W =
n
X
P
wi = 1, if W is the estimator
wi Yi
i=1
then W is unbiased:
µW =
n
X
wi µ = µ;
i=1
but the standard error of estimate is
v
u n
uX
σ
σW = t wi2 σ 2 ≥ √ = σȲ .
i=1
13 / 30
n
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Inference about a parameter: interval estimation
Recall that, by the Central Limit Theorem, when n is large, Ȳ is
approximately normally distributed.
That is,
Ȳ − µȲ
Ȳ − µ
√
=
σȲ
σ/ n
approximately follows the standard normal distribution.
14 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
So the chance that
−1.96 ≤
Ȳ − µ
√ ≤ 1.96
σ/ n
is approximately 95%.
Equivalently, the chance that
σ
σ
Ȳ − 1.96 √ ≤ µ ≤ Ȳ + 1.96 √
n
n
is approximately 95%.
We say that
σ
σ
Ȳ − 1.96 √ , Ȳ + 1.96 √
n
n
is an approximate 95% confidence interval (CI) for µ.
15 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
To calculate the end-points of this approximate confidence interval, we need
to know the additional parameter σ.
Typically σ is unknown, so we cannot use the CI.
But we can estimate σ by the sample standard deviation s, and use the
alternative confidence interval
s
s
Ȳ − 1.96 √ , Ȳ + 1.96 √ .
n
n
When n is large, the chance that
s
s
Ȳ − 1.96 √ ≤ µ ≤ Ȳ + 1.96 √
n
n
is still approximately 95%.
16 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
What if n is not large?
In small samples, we can still construct a confidence interval, but it has the
correct coverage probability only if the original data are approximately
normally distributed.
The key is to replace ±1.96, the 2.5% and 97.5% points of the normal
distribution, with ±t.025,n−1 , the 2.5% and 97.5% points of Student’s
t-distribution with (n − 1) degrees of freedom: for normally distributed data,
the chance that
s
s
Ȳ − t.025,n−1 √ ≤ µ ≤ Ȳ + t.025,n−1 √
n
n
is 95%.
17 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Tables of the t-distribution show that when n is large, the percent points
are very close to those of the normal distribution.
So it’s reasonable to use the t-distribution percent points whenever the
confidence interval is based on the sample s instead of the population σ.
M&S give formulas for a general 100(1 − α)% confidence interval:
s
s
Ȳ − tα/2,n−1 √ , Ȳ + tα/2,n−1 √ ;
n
n
here α = .05 for a 95% CI; in some situations, α = .01 for a 99% CI is
preferred; other values are rarely used.
18 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
What if n is small and the original data can not be approximated by a
normal distribution?
Bootstrap the original data
19 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Berkeley girls’ heights example: CI
library(fda)
y <- growth$hgtf[31,]
n <- length(y)
ybar <- mean(y)
s <- sd(y)
lower <- ybar - 1.96*s/sqrt(n)
upper <- ybar + 1.96*s/sqrt(n)
c(lower,upper) ## using normal distribution
## [1] 164.6115 167.9774
20 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Berkeley girls’ heights example: CI
library(fda)
y <- growth$hgtf[31,]
n <- length(y)
ybar <- mean(y)
s <- sd(y)
qt(0.975,n-1)
## [1] 2.005746
lower <- ybar - qt(0.975,n-1)*s/sqrt(n)
upper <- ybar + qt(0.975,n-1)*s/sqrt(n)
c(lower,upper) ## using t distribution
## [1] 164.5722 168.0167
21 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Berkeley girls’ heights example: CI
library(fda)
y <- growth$hgtf[31,]
sampleMeans = rep(NA, 1000)
for (i in 1:length(sampleMeans))
sampleMeans[i] = mean(sample(y, 20))
lower <- quantile(sampleMeans,0.025)
upper <- quantile(sampleMeans,0.975)
c(lower,upper) ## using bootstrap
##
2.5%
97.5%
## 164.1699 168.2253
22 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Inference about a parameter: testing a hypothesis
A point estimate is the most likely value of the parameter.
A confidence interval is a calibrated range of plausible values.
Sometimes we just want to know whether a particular value is plausible.
We assess its plausibility by testing statistical hypotheses.
23 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Example: µ0 is an interesting value of the population mean µ.
Null hypothesis, H0 : µ = µ0
Alternative hypothesis, Ha : µ 6= µ0 .
Data are a sample of size n with mean ȳ and standard deviation s.
Basic idea: H0 is implausible if ȳ is far from µ0 .
24 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
To be precise:
t=
ȳ − µ0
√
s/ n
(test statistic)
measures how far ȳ is from µ0 , as a multiple of the standard error of
estimate.
Note: if H0 is true, t follows the t-distribution with n − 1 degrees of freedom
Basic idea: reject H0 if |t| is large.
To be precise: choose a level of significance α; again often α = .05.
Reject H0 if |t| > tα/2,n−1 , where tα/2,n−1 is the cutoff value.
25 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
We can show that when H0 is true, that is µ = µ0 , and the data are
normally distributed, the chance of (incorrectly) rejecting H0 is α.
That is, α is the chance of making a Type I error.
If a statistician always follows this procedure, true null hypotheses will be
rejected only 100α% of the time.
So when a null hypothesis is rejected, either it was actually false, or one of
these infrequent errors occurred.
Note: We never accept H0 , we only fail to reject it.
26 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
This is a two-tailed test: we reject H0 if ȳ is too far from µ0 in either
direction.
In regression analysis, almost all tests are two-tailed.
M&S discuss one-tailed tests, and provide an example.
Deciding which hypothesis is H0 and which is Ha may not be easy.
27 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Berkeley girls’ heights example: test
Suppose we are interested in if the mean height of girls at 18 yrs is 160
centimeters, so µ0 = 160:
H0 : µ = 160; Ha : µ 6= 160.
library(fda)
y <- growth$hgtf[31,]
n <- length(y)
t <- (mean(y)-160)/(sd(y)/sqrt(n))
c(abs(t), qt(0.975,n-1))
## [1] 7.330501 2.005746
We reject H0 .
28 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
P-value
Instead of deciding whether or not to reject H0 , we can weigh the evidence
against it.
The conventional way to do this is using the P-value, which is:
the probability, if H0 were true, that the test statistic is as
extreme as observed.
In the Berkeley grils’ heights example,
P(|t| ≥ 7.33) < 0.001
A small P-value like this is strong evidence against H0 .
29 / 30
Review of Basic Statistics
ST430 Introduction to Regression Analysis
Hypothesis test in R
library(fda)
y <- growth$hgtf[31,]
t.test(y,mu=160)
##
##
##
##
##
##
##
##
##
##
##
One Sample t-test
data: y
t = 7.3305, df = 53, p-value = 1.325e-09
alternative hypothesis: true mean is not equal to 160
95 percent confidence interval:
164.5722 168.0167
sample estimates:
mean of x
166.2944
30 / 30
Review of Basic Statistics