Reasonable Hypothesis Testing and p

Reasonable Hypothesis Testing and p-values?
Stat 342, Spring 2014
Prof. Guttorp - TA Aaron Zimmerman
Warmup
To get started, consider the following opening sentence from the Take a Number column of the New
York Times:
“When medical researchers report their findings, they need to know whether their result is a real effect
of what they are testing, or just a random occurrence. To figure this out, they most commonly use the
p-value.”
(This column run, titles Putting a Value to ‘Real’ in Medical Research, was written by Nicholas Balakar,
and published on March 11, 2013.)
From what we know and have learned about statistics and p-values, what two problems do you have with
this sentence?
These comments actually come from Andrew Gelman’s blog: “First, whatever researchers might feel,
this? is something they’ll never know. Second, results are a combination of real effects and chance, it’s
not either/or.
Perhaps the above is a forgivable simplification, but I don’t think so; I think it’s a simplification that
destroys the reason for writing the article in the first place.”
? What
he means in the first part is that researchers will never know with 100% certainty whether or
not they’ve observed a random occurence or a real effect.
Hypothesis Testing Scenarios
For each of the following scenarios, discuss
1. Is the proposed hypothesis test reasonable for the scientific question of interest? Why? If the
answer is “no,” what could you change to make it better?
2. Is a p-value a reasonable tool to use in this setting? Why?
3. Regardless of the answers to (1) and (2), are there any additional statistical results that would be
particularly helpful in the setting?
1
Seattle Cloud Coverage Change
Researchers are interested in whether or not the the annual number of cloudy days in Seattle is changing.
Using records starting in 1945, they’ve calculated the number of cloudy days for each of the years and
are performing a simple linear regression with E[# Cloudy Days] = β0 + β1 ∗ (Y ear − 1945), and with
additive Gaussian error terms. To answer their question, they’re testing the hypotheses: H0 : β1 = 0
Vs. H1 : β1 6= 0.
The proposed hypothesis test seems reasonable, and given the sample size and the specific nature of the
question, a p-value seems to be a usable tool in this situation. As we’ve discussed, a confidence interval
in conjunction with a correct hypothesis setup is also useful.
Another issue might be that the model they’re using may be incorrect. The number of cloudy days in
adjacent years may be correlated and modelling each year as independent observations may inappropriate. Dealing with correlation among observations is outside the scope of the course, but understanding
what assumptions are needed to perform valid linear regression is an important part of our course.
Solar Radition
The earth is in an elliptical orbit around the sun and as a result the distance between the earth
and the sun changes throughout the year. Reserchers have developed a new tool to measure solar
radition and want to test it out. It’s known that the relationship between solar-radiation and distance
is proportional to the inverse square of the distance. After letting their tool collect data for one year,
they’re interested in seeing how accurately it measures the solar radiation. They have reason to believe
that they can model the relationship between solar radition and day of the year (at a fixed latitude) as
E[Solar Radiation] = β0 ∗ cos(β1 ∗ Day + β2 ) + β3 . The measurements then come from a model with
this mean and additive Gaussian noise. To test if their machine is working well, they set up the test:
H0 : β0 = 0 Vs. H1 : β0 = A, where A is the true change in solar radition from the closest point in
earth’s orbit to the farthest point in earth’s orbit.
Just to clear things up, the β0 term corresponds to the amplitude of the cosine curve, which should be
exactly the the difference between maximum and minimum solar radiation if the cosine model is correct.
That said, the proposed hypothesis doesn’t seem at all reasonable. They set their null to hypothesis to be a known false result. In fact, a β0 value of zero corresponds to no change in measurement
throughout the year which would indicate that the tool was working very poorly. If they instead set up
the hypothesis to be H0 : β0 = A Vs. H1 : β0 6= A, they could proceed and correctly use hypothesis tests
(and p-values). Again, reporting a confidence interval would be better and would give some indication
of how certain they were that their new tool was or wasn’t working correctly.
Unfair Penny
A group of 10 ambitious students have heard that the probability of heads and tails in a penny is not
truly 50-50 because the “heads” side is more embossed than the “tails” side. To figure this out, each of
the 10 students flip 300 pennies a day for a year. At the end of 1 year they have 1,095,000 penny flips
2
and they’re ready to perform their test: H0 : p = 0.5 Vs. H1 : p 6= 0.5. They write an article for the
school paper and report that they found a significant p-value using an α-level of 0.05 and claim that
the penny is indeed an unfair coin.
The hypothesis seems reaonable in this case. The large sample size is needed because they know that
the effect will be very small if it’s there at all. The problem lies in reporting only the p-value. Reporting
an estimate for the P (Heads) and a corresponding confidence interval will give the same information
and will give the readers an idea of how much practical relevance or impact the results actually have.
A real-life study that may have slightly more real life relevance about coin-flipping bias can be found
in a paper by Persei Diaconis and others called “Dynamical Bias in the Coin Toss.”
Diabetes Prevention
There was a National Institue of Health study called the Diabetes Prevention Program which studyied
whether or not a lifestyle alteration class (on eating healthier and and exercising more) could help lower
the risk of becoming a diabetic among a population of people identified to have “pre-diabetes.” They
used weight loss after 12 months as a proxy for reduced risk of diabetes. The study seemed successful,
but one problem with it was that more than 75% of the enrolled members were women. As a result, there
was was some worry that offering free versions of these classes wouldn’t be an effective type of program
to target both men and women in a general population. To help resolve this question, a second followup
study was designed that only allowed men to enroll. In the second study the men recieved 6 months of
class and had their weight measured every month during the class and for 6 months afterwards. The
researchers in the second study conducted a test to see if there was significant weight loss at the end of
the twelve months by conducting a paired t-test: H0 : µdif f = 0 Vs. H1 : µdif f 6= 0.
The main problem here is that the primary concern after the first study is the enrollment of men in
the intervention class. By designing an experiment to study this problem that only allows men into the
program, they lose the ability to make any conclusions about the main goal of the followup.
If we get past that main point, and assume that they’re actually interested in seeing how effective
their intervention class is among a group of men, the rest of the process seems reasonable. Again, a
confidence interval would be useful to report.
3