Applied Data Analysis

Applied Data Analysis
Spring 2017
Ukraine, 2016
Karen Albert
[email protected]
Thursdays, 4-5 PM (Hark 302)
Lecture outline
1. Confidence intervals
2. Intro to hypothesis testing
Confidence intervals
Remember,
sample percentage = population percentage ± chance error
In an previous example,
49% = population percentage ± 2.7%
Confidence intervals
Provided the normal approximation is appropriate, how likely is
it that the parameter is within 1 standard error of the sample
percentage?
Confidence intervals
Provided the normal approximation is appropriate, how likely is
it that the parameter is within 1 standard error of the sample
percentage?
Provided the normal approximation is appropriate, how likely is
it that the parameter is within 2 standard errors of the sample
percentage?
Confidence intervals
Provided the normal approximation is appropriate, how likely is
it that the parameter is within 1 standard error of the sample
percentage?
Provided the normal approximation is appropriate, how likely is
it that the parameter is within 2 standard errors of the sample
percentage?
Provided the normal approximation is appropriate, how likely is
it that the parameter is within 3 standard errors of the sample
percentage?
Confidence intervals
What we just calculated were confidence intervals for the
population percentage.
[43.6%, 54.4%]
95% confidence interval
Confidence intervals
What we just calculated were confidence intervals for the
population percentage.
[43.6%, 54.4%]
95% confidence interval
Want more “confidence”?
[40.9%, 57.1%]
99.7% confidence interval
Confidence intervals
What we just calculated were confidence intervals for the
population percentage.
[43.6%, 54.4%]
95% confidence interval
Want more “confidence”?
[40.9%, 57.1%]
99.7% confidence interval
Note the tradeoff between confidence and precision.
Interpretation
Chances are you want to interpret the interval thusly:
There is a 95% chance that the population percentage is
between 43.6% and 54.5%.
Interpretation
Chances are you want to interpret the interval thusly:
There is a 95% chance that the population percentage is
between 43.6% and 54.5%.
But that would be completely wrong.
The correct interpretation
Out there in the world, there exists a true percentage.
The correct interpretation
Out there in the world, there exists a true percentage.
Consider our interval
[43.6%, 54.5%]
The true percentage is either in there, or it is not. There is no
probability about it.
The correct interpretation
Out there in the world, there exists a true percentage.
Consider our interval
[43.6%, 54.5%]
The true percentage is either in there, or it is not. There is no
probability about it.
Or more accurately, the probability is 0 or 1.
What does the 95% refer to?
It refers to the procedure.
Interpretation: take 2
If we repeated the procedure over and over again many times,
we would expect 95% of the confidence intervals generated to
contain the true percentage.
Interpretation: take 2
If we repeated the procedure over and over again many times,
we would expect 95% of the confidence intervals generated to
contain the true percentage.
If we repeated the procedure over and over again many times,
we would expect 95% of the confidence intervals generated to
contain the true percentage.
Interpretation: take 2
If we repeated the procedure over and over again many times,
we would expect 95% of the confidence intervals generated to
contain the true percentage.
If we repeated the procedure over and over again many times,
we would expect 95% of the confidence intervals generated to
contain the true percentage.
If we repeated the procedure over and over again many times,
we would expect 95% of the confidence intervals generated to
contain the true percentage.
Example
An SRS of 1000 persons is taken to estimate the percentage of
Democrats in a large population It turns out that 543 of the
people in the sample are Democrats.
Example
An SRS of 1000 persons is taken to estimate the percentage of
Democrats in a large population It turns out that 543 of the
people in the sample are Democrats.
First, find the sample percentage.
543
= 0.543
1000
Example
An SRS of 1000 persons is taken to estimate the percentage of
Democrats in a large population It turns out that 543 of the
people in the sample are Democrats.
Example
An SRS of 1000 persons is taken to estimate the percentage of
Democrats in a large population It turns out that 543 of the
people in the sample are Democrats.
Find the standard error.
r
0.543 × (1 − 0.543)
= 0.0158
s.e. =
1000
The interval
54.3% ± 3.2%
Which is correct?
1. There is about a 95% chance for the percentage of
Democrats in the population to be in the range of 54.3%
plus or minus 3.2%.
2. 54.3% plus or minus 3.2% is a 95% confidence interval for
the sample percentage.
3. 54.3% plus or minus 3.2% is a 95% confidence interval for
the population percentage.
CI for Averages
The standard error for the average:
σ
s.e. = √
n
The CI is then
x̄ ± z1− α2
σ
√
n
Where the equation comes from
1 − α = Pr −z ≤
x̄ − µ
√σ
n
!
≤z
Where the equation comes from
1 − α = Pr −z ≤
x̄ − µ
√σ
n
!
≤z
σ
σ
= Pr −z √ ≤ x̄ − µ ≤ z √
n
n
Where the equation comes from
1 − α = Pr −z ≤
x̄ − µ
√σ
n
!
≤z
σ
σ
= Pr −z √ ≤ x̄ − µ ≤ z √
n
n
σ
σ
= Pr −x̄ − z √ ≤ −µ ≤ −x̄ + z √
n
n
Where the equation comes from
1 − α = Pr −z ≤
x̄ − µ
√σ
n
!
≤z
σ
σ
= Pr −z √ ≤ x̄ − µ ≤ z √
n
n
σ
σ
= Pr −x̄ − z √ ≤ −µ ≤ −x̄ + z √
n
n
σ
σ
= Pr x̄ − z √ ≤ µ ≤ x̄ + z √
n
n
Question time
If I gave you a coin and asked you whether it is fair, how many
times would you have to flip before you got a result that is
impossible given a fair coin?
Question time
If I gave you a coin and asked you whether it is fair, how many
times would you have to flip before you got a result that is
impossible given a fair coin?
Answer: an infinite number of times.
Question time
If I gave you a coin and asked you whether it is fair, how many
times would you have to flip before you got a result that is
impossible given a fair coin?
Answer: an infinite number of times.
There is no way to disprove a probabilistic claim.
Question time
If I gave you a coin and asked you whether it is fair, how many
times would you have to flip before you got a result that is
impossible given a fair coin?
Answer: an infinite number of times.
There is no way to disprove a probabilistic claim.
Instead, we will look for a a result that is unlikely given a
particular assumption about the data.
Let’s start with an example....
A senator introduces a bill to simply the tax code. She claims
that the bill is revenue-neutral — it won’t cost either the
government or the taxpayer money. Is this the case?
The Treasury Department uses a computer file of 100,000
representative tax returns and gauges the total tax payable
under the old rules and under the new rules. Then, they look at
the change:
change = tax under the new rules − tax under the old rules
Let’s start with an example....
A senator introduces a bill to simply the tax code. She claims
that the bill is revenue-neutral — it won’t cost either the
government or the taxpayer money. Is this the case?
The Treasury Department uses a computer file of 100,000
representative tax returns and gauges the total tax payable
under the old rules and under the new rules. Then, they look at
the change:
change = tax under the new rules − tax under the old rules
Is the change positive or negative? The senator says that the
change is zero. Is she right?
Continuing the example
Let’s say that the work has been done on a pilot sample of 100
forms chosen at random. The sample average is,
x̄ = −$219
Why doesn’t this end the discussion?
Continuing the example
Let’s say that the work has been done on a pilot sample of 100
forms chosen at random. The sample average is,
x̄ = −$219
Why doesn’t this end the discussion?
We could have arrived at this answer by chance. We need to
know how likely it was to get an answer such as this one.
We need the standard error....
Let’s say the standard deviation is $725.
How do we get the standard error? We divide the standard
deviation by the square root of the sample size.
$725
√
= $72.50
100
So given that the senator believes that her tax bill is
revenue-neutral, how likely is it that we would observe a result
of $219 by chance?
What are the chances?
Remember that the senator says that the change is 0. So,
z=
−$219 − $0
= −3
$72
That is, $219 is 3 standard errors below what we expected
given revenue-neutrality.
What are the chances?
The chances
Using the normal approximation, the area to the left of -3 is
about 0.1%.
So there is 1 chance in 1000 of getting a value as extreme or
more extreme than -$219 if the true mean were 0.
Conclusion?
The chances
Using the normal approximation, the area to the left of -3 is
about 0.1%.
So there is 1 chance in 1000 of getting a value as extreme or
more extreme than -$219 if the true mean were 0.
Conclusion?
Either the senator is wrong, or something very rare has
occurred.
Significance tests
What we just did is known as a significance test or a hypothesis
test.
These tests tell us whether a statistic we observe is due to
chance or something else.
In other words, significance tests can tell us whether an effect is
real or simply due to chance.
Let’s consider the pieces of a significance test.
Five pieces
• Assumptions
• Hypotheses
• Test statistic
• P-value
• Conclusion
Assumptions
All hypothesis tests make assumptions, and a test is only valid
when its assumptions are met.
• Most tests assume that the data are from a random
sample. That is, the tests assume that the box model is a
good analogy.
• Most tests assume that the underlying population has a
particular distribution, usually a normal distribution.
• Each test applies to either quantitative data or categorical
data.
Hypotheses
The null hypothesis expresses the idea that an observed
difference is due to chance.
In the last example, the null hypothesis is that the average of
“the box” equal $0.
Notation: H0 : µd = $0
The null hypothesis above is that the average difference is
equal to $0.
The alternative hypothesis
The alternative hypothesis expresses the idea that an observed
difference is not due to chance; i.e. it is “real.”
In the example we just did, the alternative hypothesis is that the
average of “the box” is less than $0.
Notation: H1 : µd < $0
The alternative hypothesis here is that the average difference is
less than $0.
Test statistic
We temporarily assume that the null hypothesis is true in order
to test it.
We assume the null is true, and then calculate the test statistic.
In our example, the statistic is
−$219 − $0
= −3
$72
A test statistic is a measure of the difference between the data
we observed, and the data that we expected to see assuming
that the null is true.
More about test statistics
The test statistic in our case is a z, of course,
z=
observed − expected under the null
standard error
More about test statistics
The test statistic in our case is a z, of course,
z=
observed − expected under the null
standard error
Tests using the z-statistic are called...
More about test statistics
The test statistic in our case is a z, of course,
z=
observed − expected under the null
standard error
Tests using the z-statistic are called...
wait for it...
More about test statistics
The test statistic in our case is a z, of course,
z=
observed − expected under the null
standard error
Tests using the z-statistic are called...
wait for it...
z-tests!
P-values
The probability of seeing a z-statistic of -3 or smaller given the
null hypothesis was 1 in 1000.
The probability is known as an observed significance level or
often as a p-value.
• The observed significance level is the chance of getting a
test statistic as extreme, or more extreme than, the
observed one.
• The chance is computed on the assumption that the null
hypothesis is true.
• The smaller this chance is, the stronger the evidence
against the null.
Hmm....
Why does the p-value take into account the area to the left of
-3, as opposed to just z = −3?
Hmm....
Why does the p-value take into account the area to the left of
-3, as opposed to just z = −3?
There are two reasons:
1. What is the probability that z = −3?
2. It is a question of evidence.
The area to the left...
The data could have turned out differently.
Consider if the observed values had turned out to be:
z=
−$239 − $0
−$162 − $0
= −4.1 z =
= −2.6
$59
$63
The value on the left is stronger evidence against the null, and
the value on the right is weaker evidence against the null.
The area to the left...
The data could have turned out differently.
Consider if the observed values had turned out to be:
z=
−$239 − $0
−$162 − $0
= −4.1 z =
= −2.6
$59
$63
The value on the left is stronger evidence against the null, and
the value on the right is weaker evidence against the null.
But why? Because we arguing by contradiction.
The logic
• We want to show that the null hypothesis leads to an
absurd conclusion and must be rejected.
• Say you get a p-value of 1 in 1000, as we did.
• Imagine many other investigators repeating the
experiment. Only 1 in 1000 would get a test statistic as
extreme or more extreme than we did given the null
hypothesis is true.
• So the null hypothesis has created an absurdity and should
be rejected.
• The smaller the p-value, the more you want to reject the
null.
Warning
Be careful...
Warning
Be careful...
the p-value is not the probability that the null hypothesis is true.
Just like confidence intervals, the null hypothesis is either true,
or it is not.
The p-value is essentially the probability of seeing a large test
statistic...assuming that the null is true.
Conclusion
Small p-values are evidence against the null.
A small p-value indicates an observed value far from what was
expected under the null hypothesis. Too far, and we make a
decision to reject the null hypothesis.
We’ll discuss in the future what it means to be small.
Reviewing the pieces
• The null hypothesis expresses the idea that an observed
result is due to chance.
Reviewing the pieces
• The null hypothesis expresses the idea that an observed
result is due to chance.
• The alternative hypothesis states that the observed result
is not due to chance.
Reviewing the pieces
• The null hypothesis expresses the idea that an observed
result is due to chance.
• The alternative hypothesis states that the observed result
is not due to chance.
• The test statistic measures the distance between the result
we observed, and what we expected under the null
hypothesis.
Reviewing the pieces
• The null hypothesis expresses the idea that an observed
result is due to chance.
• The alternative hypothesis states that the observed result
is not due to chance.
• The test statistic measures the distance between the result
we observed, and what we expected under the null
hypothesis.
• The observed significance level or p-value tells us the
chance of getting a test statistic as extreme or more
extreme than the observed test statistic.
Another example
In one investigator’s model, the data are like 400 draws made
at random from a large box. The null hypothesis says that the
average of the box equals 50; the alternative says that the
average of the box is more than 50.
In fact, the data averaged out to 52.7 and the standard
deviation was 25. What do you conclude?
Example: the hypotheses
The null is
H0 : µ = 50
The alternative is
H1 : µ > 50
Example: the standard error
S.E.X̄
=
s
√
n
=
25
√
400
= 1.25
se <- 25/sqrt(400)
se
## [1] 1.25
Example: the test statistic
z =
=
x̄ − µ0
s.e.
52.7 − 50
1.25
= 2.16
z <- (52.7-50)/se
z
## [1] 2.16
Example: the p-value and conclusion
pnorm(z,0,1,lower.tail=FALSE)
## [1] 0.01538633
The probability of seeing a result as extreme or more extreme
than 2.16, given that the true mean is 50, is roughly 1.5%.
This is a small probability, so we must conclude that either the
null hypothesis is false or something rare has occurred.
Example: the p-value and conclusion
pnorm(z,0,1,lower.tail=FALSE)
## [1] 0.01538633
The probability of seeing a result as extreme or more extreme
than 2.16, given that the true mean is 50, is roughly 1.5%.
This is a small probability, so we must conclude that either the
null hypothesis is false or something rare has occurred.
Which is it, by the way?
The procedure: short version
1. Set up the null and alternative hypotheses.
2. Pick a test statistic to measure the difference between the
data and what is expected under the null.
3. Compute the observed significance level.
Why “pick a test statistic”?
The choice of the test statistic depends on the model and the
hypothesis being considered.
The test we just did is a “one-sample z-test.”
There are also:
• two-sample z-tests
• t-tests
• χ2 -tests
And there are many, many other test statistics as well.
Another example
Someone approached the first 100 students he saw one day at
Berkeley and asked in which school or college they were
enrolled. The sample included 53 men and 47 women. From
the Registrar’s data, 25,000 students were registered at
Berkeley, and 67% of them were male.
Was his sample procedure like taking a random sample?
Answer
H0 : p = 0.67
H1 : p < 0.67
Answer
H0 : p = 0.67
H1 : p < 0.67
se <- sqrt((0.67*0.33)/100)
se
## [1] 0.04702127
z <- (0.53-0.67)/se
z
## [1] -2.977376
Answer
pnorm(z,0,1)
## [1] 0.001453637
pnorm(0.53,0.67,.047)
## [1] 0.00144726
Conclusion
The p-value is therefore 1/1000.
Conclusion
The p-value is therefore 1/1000.
Was it like taking a simple random sample? The p-value is
small, so the result cannot be explained by chance.
Conclusion
The p-value is therefore 1/1000.
Was it like taking a simple random sample? The p-value is
small, so the result cannot be explained by chance.
No, the investigator had too many women in his sample.
What did we learn?
• Confidence intervals
• Hypothesis testing