Matched tests

Objectives 7.1 Inference for comparing means of two populations
p 
Matched pair t confidence interval
p 
Matched pair t hypothesis test
p 
http://onlinestatbook.com/2/tests_of_means/correlated.html
Overview of what is to come p 
We have covered the difficult bits of the course. The part where we need
to do extensive calculations.
p 
In terms of inference, what have we done?
p 
Constructed confidence intervals for locating the population mean.
p 
Statistical tests.
From now on we will do much the same thing. The only difference is that the
data sets will `appear’ more complex. But we will still
q 
q 
Constructing confidence intervals
q 
Testing hypotheses.
The only difference is:
q 
q 
We need to identify the appropriate methodology/procedure given the data and
how it was collected.
In terms of calculations they become more difficult, but we don’t do them!
We make the computer do them instead. Our role is
q 
Understand every single part of the computer output.
q 
The assumptions used to do all the calculations.
q 
As before the standard error is a vital ingredient and will be used to
measure uncertainty:
q 
To construct CIs – just do as before.
q 
To do tests via the t-transform – just do as before. Recall
t=
Estimate
mean in the null hypothesis
standard error
p 
The t-transform measures the number of standard errors between
the estimate mean and the null mean. It is a measure of distance.
The further the distance the more infeasible the null.
p 
Important: continue to make plots of the normal and t-distribution,
using the results from the output. This will help you to check that
what you are doing makes sense.
New types of data: Comparative inference p 
In most statistical procedures, the objective is to make comparisons. For
example:
p 
Does tuition lead to higher grades?
p 
Does eating healthy food lead to longer life expectancy? Etc.
p 
How does one design the experiment in order to test such hypothesis?
p 
In matched pair studies, subjects are naturally matched and
comparisons are made with each pairing. Examples:
p 
p 
A patient before and after a treatment.
p 
Accessing whether a diet worked by weighing a person before and after.
p 
Studies involving twins, each twin given different regimes.
Using matched studies to make comparisons is an extremely useful
method for reducing confounding in a design.
p 
Twin in space:
http://www.nasa.gov/press/2015/january/astronaut-twins-available-forinterviews-about-yearlong-space-station-mission/ - .VRriQGR4qlw
Matched paired data p 
We are given a data set
p 
p 
p 
p 
If there is a clear matching in the data. Then we need to do a matched paired
procedure.
We need to determine whether there is a matching by understanding how the
data is collected.
Once we determine there is matching we need to understand the statistical
output of the matched paired procedure.
Examples of matched data we will consider in this chapter:
p 
The effect red wine has on polyphenol levels (wait a minute we already
considered this data…)
p 
The influence that a full moon has on certain patients.
p 
What are the differences between running at high and low altitude
p 
Does Friday 13th change behavior.
p 
Weight of baby calves.
p 
The questions asked above are answered by collecting matched data.
p 
In an exam you will be asked to identify matched data.
Example 1: Red wine and polyphenols levels in blood We have already come across this example in
Chapter 6 and 7. We had used a one sample
method to analyze this data, but this was after
processing.
The data we used after processing were the
differences we saw between before and after
taking red wine.
The `raw data’ is simply the polyphenol levels
before taking red wine and after taking red wine.
It is very natural to consider the difference as it
gives the increase/decrease in polyphenol after
the treatment.
The matched paired methods we discuss in this chapter are identical to the one
sample methods discussed in Chapter 6 and 7. We only need to understand taking
difference is important for matched data.
First, if we want to understand whether mean levels of polyphenol are greater after
drinking red wine we need to articulate this as a hypothesis.
H0 : µafter  µbefore HA : µafter > µbefore
H0 : µafter µbefore  0 HA : µafter µbefore > 0
The statistical output: CI and tests Above is the 95% CI and tests for the mean.
q  The top plot is with the paired data (see demo)
q  The lower plot is after manually taking differences.
q  We observe that the outputs are identical. In the next slide we review what the
output is telling us.
Review of polyphenol output: CI p 
The output on the left is the CI interval for where we believe the mean
difference after taking red wine should lie.
p 
We can calculate it using the output. Note that
p 
Std. Err = standard error = 3.06/√15 = 0.79.
p 
Using this the 95% confidence for the mean change in polyphenol levels
is [4.3 – 2.145×0.79, 4.3 + 2.145×0.79] = [2.6,5.99].
p 
However, we do not need to calculate this. It is automatically given in the
output (free lunch)!
p 
L.Limit correspond to the lower end of the interval, which is 2.6.
p 
U.Limit corresponds to the upper end of the interval, which is 5.99.
Review of polyphenol output: Testing p 
The output on the right is testing H0 : µA - µB ≤ 0 against HA : µA - µB >
0. (ie. Taking red wine increases polyphenol levels). The p-value is very
small, less than 0.1%. As this is far smaller than 5% it tells us that there
is strong evidence to suggest that the mean level of polyphenol increases
with wine consumption.
p 
The t-value is so large the p-value is tiny!
Observe, we can also deduce that if we were to test H0 : µA - µB ≥ 0 against
HA : µA - µB < 0. (i.e. Taking red wine decreases polyphenol levels), then
the p-value is greater than 99.9%, thus there is no evidence that polyphenol
decreases with wine consumption. Sum of one sided tests = 100%.
Examples 2: Does a full moon have an inEluence on behavior? p 
We want to investigate whether aggressive
dementia patients tend to be more aggressive
when there is a full moon. The behavior of 15
disruptive dementia patients was studied to see
if there is any evidence of this. For each patient
the average number of disruptive events on Full
moon days and other days was counted. The
data is on the right.
p 
The raw numbers do not contain the information
on being `more disruptive’ . Instead one should
consider the difference between each of the
pairs. It is the difference that actually contains
the information on whether the individuals are
more or less disruptive. In addition, by taking
differences we are `factoring’ out that some of
the natural variability in aggressive behavior
between patients.
The hypothesis of interest and output q 
We want to test whether the full moon made people more disruptive. First we set
notation:
Let µN = the mean number of disruptive events (no full moon)
Let µF = the mean number of disruptive events under full moon.
We conjecture that there are more disruptive events during full
moon, so we are testing that
H0 : µF – µN ≤ 0 against HA: µF - µN > 0.
We see that the p-value is extremely small,
thus there strong evidence to reject the null.
Understanding the output:
The t-value is calculated as
2.3 0
t=
= 6.71
0.34
Using the output we can calculate the 95%
CI using the same output
[2.3 ± 2.1 ⇥ 0.34] = [1.58, 3.02]
Lab Practice I (moon example) p 
Load the moon data into Statcrunch.
p 
Go to Stats -> T Stats -> Paired ->
With data
p 
You will get two pages of output:
Checking reliability of output p 
Now we want to see whether the calculations are correct:
p 
p 
We always assume the sample is a simple random sample. In this example,
this assumption is a bit dubious as the patients selected were the ones who
were the most disruptive. Therefore we can only draw inference on the
population of disruptive patients.
As the sample size is quite small (18) we should check that the differences
data does not deviate too much from normality. A histogram and QQplot is
given below.
q 
Observations:
q  The data is numerical continuous (average number of disruptive events per
person).
q  The histogram of the differences does not look very bell shaped and points
on the QQplot don’t fall on line. Data is not very close to normal, but there
isn’t any clear skew which is the main factor for the CLT not kicking in
for relatively large samples sizes.
q  The sample size of 18 is below the 30 rule of thumb.
q  However, the simulation of the sampling distribution using the applet shows
that the sample mean based on 18 will be quite close to normal. This means
•  We really do have 95% confidence in the confidence interval
•  The p-value is really far less than 0.01%
q 
The above observations imply that the sample mean will be approximately
normal. Therefore, the p-value that was calculated using the t-distribution
(remember we only use the t because the standard deviation is unknown) is
relatively close to the truth.
q 
Regardless of normality of the sample mean, the sample mean is 6.71
standard errors from the null (which is a huge difference) - the t-transform is
so large, that the p-value is very small regardless of what the actual distribution
is. Based on this there is overwhelming evidence that the behavior on Full
moon days will be different to other days.
q 
Recommendations: As a consequence of this study the nursing home way
want to bring in more staff on Full moon days. The mean number of additional
disruptive events on a full moon is between [1.57,3.02].
Based on this interval, the nursing home may want to calculate the number of
extra staff to bring on duty during the full moon.
Example 3: Running at different altitudes q 
It is usually believed that peoples
running times at high altitude are
worse than their running times at
sea level. We want to check this
assertion.
q 
q 
Data collection: 12 runners are
asked to run the same distance at
both sea level and high altitude,
their running times are recorded.
The data is given on the right. Since
the same runner is used for both
the high and low altitude there is
clear matching in the data.
Also observe that most of the
differences are positive. For most of
the runners we see an increase in
time.
The output and using it for testing q 
q 
q 
The 95% CI for the difference is given below
We are interested in understanding whether running at a high altitude increases
running time. So the hypothesis of interest is H0 : µH - µL ≤ 0 against HA : µH - µL >
0. Suppose, we do the test at the 1% level.
The t-transform is
t=
q 
q 
1.2 0
= 3.88
0.311
The p-value is the area to the RIGHT of 3.88. Looking up the tables we see that the
area to the right of 3.88 is 0.25%. Thus the p-value is less than 0.25%.
The p-value can be calculated exactly by doing the test. The results from three
hypothesis test outputs are on the next slide.
One and two sided tests: The output p 
The output corresponding to the hypothesis test of interest is the
last one and gives the p-value 0.13%
p 
However, for the purpose of an exam you should be able to deduce
the correct p-value from any three of the outputs.
p 
p 
The first output corresponds to the opposite hypothesis, where we are
interested in seeing if there is evidence to suggest that at high altitude
we run faster. The p-value for this test is 99.87% - this is the area to the
left of 3.88. Therefore the p-value we are interested in is the area to the
right of 3.88 which is 100 – 99.87 = 0.13%.
The middle output is the two-sided test, that is the mean running speed
at sea level and high altitudes is different. The p-value is 0.25%, which is
double our p-value.
QQplot to
check for
reliability of the
sample mean.
Example 4: Does Friday 13th increase accidents? p 
To answer this question the
number of accidents was on 6
consecutive Friday 13ths
(during the early 1990’s) was
collected. A comparison is
required, where all the factors
the same except for the 13th,
Thus the data is compared with
the number of accidents which
happen on the previous Friday
6th.
It is not immediately obvious but there
is matching in this data. This is
because Friday 6th and the following
Friday 13th share similar factors
except for the numbers (for example
more accidents tend to happen during
July/August which would increase
both values). There is a dependence
between them (driven by these
common factors).
Hypothesis of interest and the test p 
To see whether there is evidence that accidents have increased our
hypothesis of interest is H0 : µ13 - µ6 ≤ 0 against HA : µ13 - µ6 > 0.
p 
We see that the p-value is about 2.11% This is less than 5% so we can reject the
null at the 5% level and determine that Friday 13th tends to increase accidents:
Notes of caution: The data is clearly not normally distributed (it is numerical
discrete) and the sample size is very small (n=6). Therefore the p-value is unlikely
to be that reliable. As it is relatively close to the boundary of 5% we need to be
cautious in interpreting the full significance of this result
More Examples: Compare the weights of calf data at different weeks. p 
Load the calf data into Statcrunch and compare
their weights at different weeks
p 
The data is clearly matched, because the same
calf is followed over a few weeks. Moreover in
the scatterplot of week 0.5 against week 1 we
see a clear linear trend. This shows that there is
a clear matching between the weights (notice
that calves that are heavier at week 0.5 also
tend to be heavier at week 1).
p 
Based on the above, if we want to compare the
weights at different weeks we need to use a
match paired procedure.
p 
Do this!
Summary: matched pair procedures Sometimes we want to compare treatments or conditions at the
individual level. These situations produce two samples that are not
independent – they are related to each other. The subjects of one
sample are identical to, or matched (paired) with, the subjects of the
other sample.
p 
Example: Pre-test and post-test studies look at data collected on the
same subjects before and after some “treatment” is performed.
p 
Example: Twin studies often try to sort out the influence of genetic
factors by comparing a variable between sets of twins.
p 
Example: Using people matched for age, sex, and education in social
studies helps to cancel out the effects of these potentially relevant
variables.
Except for pre/post studies, subjects should be randomized – assigned
to the samples at random (within each pair), in observational studies.
For data from a matched pair design, we use the observed differences
Xdifference = (X1 − X2) to test the difference in the two population means. The
hypotheses can then be expressed as
H0: µdifference= 0 ; Ha: µdifference>0 (or <0, or ≠0)
You will need to decide what test to apply to the data. In Chapter 10
we will cover the independent sample t-test. This tests the same
hypothesis but there is no matching in the data so a different
procedure is used. Based on how the data was collected you should
be able to decide which test to use when.
Accompanying problems associated with this Chapter p 
Quiz 13
p 
Homework 6 (Q3)
p 
Homework 8 (Q1)