Lecture Notes Module 2

D.G. Bonett (5/2017)
Module 2
Two-group Experimental Designs
The goal of most research is to assess a possible causal relation between the
response variable and another variable called the independent variable. In
experimental designs, the response variable is usually called a dependent variable.
Three basic conditions must be satisfied to demonstrate a causal relation between
a dependent variable and an independent variable. First, there must be a relation
between the dependent variable and the independent variable. Second, the effect
of the independent variable must be present prior to any observed change in the
dependent variable. Third, no variable other than the independent variable can be
responsible for the relation between the dependent variable and the independent
variable.
An experiment can be used to assess a causal relation. The simplest type of
experiment involves just two treatment conditions that represent the levels of the
independent variable. In a two-group experiment, a random sample of n
participants is selected from a study population. The random sample is then
randomized (i.e., randomly divided) into two groups, and each group receives one
of the two treatments with participants treated identically within each group. If
one group does not receive any treatment, it is called a control group. Following
treatment, a measurement on the dependent variable is obtained for each
participant.
In a two-group experiment with a quantitative dependent variable, a population
mean could be estimated from each group. In an experimental design, the
population means have interesting interpretations: πœ‡1 is the population mean of
the dependent variable assuming everyone in the study population had received
level 1 of the independent variable, and πœ‡2 is the population mean of the dependent
variable assuming everyone in the same study population had instead received
level 2 of the independent variable.
The difference in population means for the two treatment conditions, πœ‡1 βˆ’ πœ‡2 , is
called the effect size and describes the strength of the relation between the
dependent and independent variables. In an experiment, a nonzero effect size is
evidence that the independent variable has a causal effect on the dependent
variable because all three conditions required for a causal association will have
been satisfied: 1) a nonzero effect size implies a relation between the dependent
and independent variables, 2) the change in the dependent variable occurred after
1
D.G. Bonett (5/2017)
exposure to the independent variable, and 3) because the participants were
randomized into the levels of the independent variable, no other variable could
have caused the nonzero effect size. A confidence interval for πœ‡1 βˆ’ πœ‡2 provides
information about the direction and magnitude of the effect size.
Two-group Nonexperimental Designs
In a two-group nonexperimental design, participants are classified into two
groups according to some preexisting characteristic (e.g., male/female,
democrat/republican, sophomore/junior, etc.) rather than being randomized into
the treatment conditions. In nonexperimental designs, the magnitude of the
difference in population means describes the strength of a relation between the
response variable and independent variable. In nonexperimental designs, the
independent variable is often referred to as a predictor variable. In
nonexperimental designs where participants are not randomly assigned to groups,
an observed relation between the predictor variable and the response variable
cannot be interpreted as a causal relation because the relation could be a
consequence of one or more unmeasured variables – referred to as confounding
variables – that are related to both the response variable and the predictor
variable. Consider, for example, the many nonexperimental studies that have
compared moderate alcohol drinkers with non-drinkers and found that moderate
drinkers live longer. Moderate drinkers could differ from nondrinkers in education
level, income, access to health care, moderation in consumption of unhealthy
foods, and many other characteristics. It is possible that one or more of these
confounding variables is responsible for the observed relation between alcohol
consumption and life expectancy. Therefore, the nonexperimental finding that
alcohol consumption is related to life expectancy does not imply that a non-drinker
who begins to consume alcohol in moderation will live longer.
In a nonexperimental design, the parameters also have a different interpretation.
Specifically, πœ‡1 is the population mean of the dependent variable for all people in
one study population who belong to one category of the predictor variable (e.g.,
male, democrat, sophomore), and πœ‡2 is the population mean of the dependent
variable for all people in a second study population who belong to the other
category of the predictor variable (e.g., female, republican, junior). The members
of the study populations within each category are referred to as subpopulations.
The subtle but important parameter interpretation differences in experimental and
nonexperimental designs will affect how the researcher describes the results of a
confidence interval or test.
2
D.G. Bonett (5/2017)
Confidence Interval for a Population Mean Difference
A 100(1 βˆ’ 𝛼)% confidence interval for πœ‡1 βˆ’ πœ‡2 is
(2.1)
πœ‡Μ‚ 1 βˆ’ πœ‡Μ‚ 2 ± 𝑑𝛼/2;𝑑𝑓 π‘†πΈπœ‡Μ‚1 βˆ’πœ‡Μ‚2
Μ‚2
𝜎
Μ‚2
𝜎
1
2
2
where 𝑑𝛼/2;𝑑𝑓 is a critical t-value, π‘†πΈπœ‡Μ‚1 βˆ’πœ‡Μ‚2 = βˆšπœŽΜ‚12 /𝑛1 + πœŽΜ‚22 /𝑛2 , and 𝑑𝑓 = (𝑛1 + 𝑛2 ) /
Μ‚14
𝜎
[𝑛2 (𝑛
1
1
Μ‚24
𝜎
βˆ’ 1)
+ 𝑛2 (𝑛
2
]. If the sample sizes are approximately equal and the study
2 βˆ’ 1)
population variances can be assumed to be similar, then the separate-variance
standard error could cautiously be replaced with a equal-variance standard error
π‘†πΈπœ‡Μ‚1 βˆ’πœ‡Μ‚2 = βˆšπœŽΜ‚π‘2 /𝑛1 + πœŽΜ‚π‘2 /𝑛2 where πœŽΜ‚π‘2 = [(𝑛1 βˆ’ 1)πœŽΜ‚12 + (𝑛2 βˆ’ 1)πœŽΜ‚22 ]/𝑑𝑓 and df = 𝑛1 +
𝑛2 – 2. Contrary to the recommendations of most statisticians, many researchers
have been taught to always use the equal-variance method.
In an experiment, recall that all participants within a particular treatment group
should be treated identically. The within-group variance estimates, πœŽΜ‚12 and πœŽΜ‚22 ,
represent unexplained variability in the dependent variable. The within-group
variance is also referred to as error variance.
Example 2.1. A researcher believes that it is important for 2nd grade students to overlearn
the multiplication tables so that these computations can be made rapidly and without
thought when students later begin working on more complex math problems. A
population of 1496 2nd grade students was identified in the Salinas, CA school district, and
80 students were randomly selected from this study population. The 80 students were
randomized into two groups of equal size. The first group was a control group and received
no additional multiplication table training. The second group received 15 minutes per day
of extra multiplication tables training for 60 days. At the end of the 60 day training period,
all 80 students were given a multiplication test and the time (in seconds) to complete the
test was recorded for each student. The sample means and standard deviations are given
below.
Group 1
Group 2
πœ‡Μ‚ 1 = 273.6
πœ‡Μ‚ 2 = 112.8
πœŽΜ‚1 = 27.2
πœŽΜ‚2 = 20.8
The separate variance 95% confidence interval for πœ‡1 βˆ’ πœ‡2 is
27.22
40
273.6 – 112.8 ± 𝑑.05/2;𝑑𝑓 √
27.22
40
where df = (
+
+
20.82
40
20.82 2
27.24
20.84
)
/[
+
]
2
40
40 (39)
402 (39)
= [150.0, 171.6]
= 72.9 and 𝑑.05/2;72.9 = 2.00. The researcher is
95% confident that in the study population of 948 2nd grade students, the average time to
complete the multiplication tables test would be 150.0 to 171.6 seconds faster if they had
all received the extra math training for 60 days.
3
D.G. Bonett (5/2017)
Confidence Interval for a Population Standardized Mean Difference
In the above math education experiment, the metric of the dependent variable
(seconds to complete the test) is easy to understand, and the confidence interval
for πœ‡1 βˆ’ πœ‡2 clearly describes the magnitude and importance of the effect of the
independent variable. However, many dependent variables used by social and
behavioral researchers have metrics that may not be well understood by other
researchers. In these cases the confidence interval for πœ‡1 βˆ’ πœ‡2 may not provide
easily interpretable information about the magnitude and importance of the effect
of the independent variable. For instance, suppose a researcher compared two
different counseling approaches for test anxiety. Following counseling, the
researcher administered a test anxiety questionnaire to all student participants
and obtained a 95% confidence interval for πœ‡1 βˆ’ πœ‡2 equal to [4.23, 7.37]. Is this an
important effect? The importance of this result is impossible to describe in the
absence of information about the test anxiety scale.
In applications where the magnitude of πœ‡1 βˆ’ πœ‡2 is not easy to interpret, it is helpful
to report a confidence interval for a standardized mean difference. The population
standardized mean difference, also known as Cohen’s d, is defined as
𝛿 = (πœ‡1 βˆ’ πœ‡2 )/√(𝜎12 + 𝜎22 )/2
An approximate 100(1 βˆ’ 𝛼)% confidence interval for 𝛿 is
𝛿̂ ± 𝑧𝛼/2 𝑆𝐸𝛿̂
(2.2)
where 𝛿̂ = (πœ‡Μ‚ 1 βˆ’ πœ‡Μ‚ 2 )/√(πœŽΜ‚12 + πœŽΜ‚22 )/2 and 𝑆𝐸𝛿̂ =√
Μ‚2( 1
𝛿
𝑛1 βˆ’1
+
8
1
)
𝑛2 βˆ’1
+
𝑛1 + 𝑛2
𝑛1 𝑛2
.
The interpretation of 𝛿 is difficult unless the population variances are similar. In
applications where the population variances are not expected to be similar, another
estimate of 𝛿 (also known as Glass’s d) is defined as the difference in sample means
divided by the sample standard deviation in one group (usually but not necessarily
the control group).
Example 2.2. In the test anxiety example mentioned above, suppose the researcher
obtained the following sample means and variances with sample sizes of 𝑛1 = 𝑛2 = 40.
Treatment 1
πœ‡Μ‚ 1 = 21.9
πœŽΜ‚12 = 14.8
Treatment 2
πœ‡Μ‚ 2 = 16.1
πœŽΜ‚22 = 10.2
(continued)
4
D.G. Bonett (5/2017)
The estimate of 𝛿 is
𝛿̂ = (21.9 – 16.1)/√(πœŽΜ‚12 + πœŽΜ‚22 )/2 = 5.8/3.54 = 1.64
where √(πœŽΜ‚12 + πœŽΜ‚22 )/2 = √(14.8 + 10.2)/2 = 3.54, and a 95% confidence interval for 𝛿 is
1 1
39 39
1.64 2 [ + ]
1.64 ο‚± 1.96√
8
1
1
+ 40 + 40 = [1.13, 2.15].
The researcher is 95% confident that, in the study population of students with test anxiety,
the mean test anxiety score would be 1.13 to 2.15 standard deviations greater if all students
received counseling method 1 rather than counseling method 2.
To interpret the confidence interval for 𝛿 in Example 2.2, imagine two normal
(Gaussian) curves, one curve for a population distribution of test anxiety scores
under counseling method 1 and a second curve for a population distribution of test
anxiety scores under counseling method 2. Now visualize the distribution for
counseling method 1 shifted to the right of the distribution for counseling method
2 at least 1.13 standard deviations and at most 2.15 standard deviations. To
visualize the magnitude of this shift, use the fact that one standard deviation from
the mean is the point where the normal curve changes from concave down to
concave up (the inflection point). Knowing that the point of inflection on a normal
curve is one standard deviation from the mean, a researcher can then easily
visualize two normal distributions that are separated by a specified number of
standard deviations.
The value of 𝛿 can be transformed into another measure of effect size called
Cohen’s U3. In an experimental design, U3 represents the proportion of people in
the study population who would have a score under Treatment 1 that would be
greater than their mean score under Treatment 2. In a nonexperimental design, U3
describes the proportion of scores in subpopulation 1 that are greater than the
mean of subpopulation 2. To transform 𝛿 into U3, simply find the area under the
standard unit normal curve that is less than 𝛿. The pnorm function in R is useful
for this purpose. This transformation can be applied to the estimate of 𝛿, the lower
confidence limit, and the upper confidence limit. For instance, in the above
example pnorm(1.13) = .87 and pnorm(2.15) = .98. We can be 95% confident that,
if all test anxiety students in the study population had received counseling method
1, between 87% and 98% of their acrophobia scores would be greater than their
mean score under counseling method 2.
5
D.G. Bonett (5/2017)
Confidence Interval for a Ratio of Population Means
If the dependent variable is measured on a ratio scale, a ratio of population means
(πœ‡1 /πœ‡2 ) is a unitless measure of effect size could be more meaningful and easier to
interpret than a standardized mean difference. An approximate 100(1 βˆ’ 𝛼)%
confidence interval for πœ‡1 /πœ‡2 that does not assume equal population variances is
Μ‚2
𝜎
Μ‚2
𝜎
1 1
2 2
(2.3)
𝑒π‘₯𝑝[𝑙𝑛(πœ‡Μ‚ 1 /πœ‡Μ‚ 2 ) ± 𝑑𝛼/2;𝑑𝑓 βˆšπœ‡Μ‚21𝑛 + πœ‡Μ‚22𝑛 ]
Μ‚2
𝜎
Μ‚2
𝜎
Μ‚4
𝜎
where df = [πœ‡Μ‚21π‘˜
+ πœ‡Μ‚22𝑛 ]2/[πœ‡Μ‚4 𝑛2 (𝑛1
𝑛
1 1
2 2
1 1
1
Μ‚4
𝜎
+ πœ‡Μ‚4 𝑛2 (𝑛2
βˆ’ 1)
2 2
2
]. If the sample sizes are
βˆ’ 1)
approximately equal and the study population variances can be assumed to be
similar, then πœŽΜ‚π‘—2 in Equation 2.3 could cautiously be replaced with πœŽΜ‚π‘2 and then df
simplifies to 𝑛1 + 𝑛2 – 2. Suppose a 95% confidence interval for πœ‡1 /πœ‡2 in a particular
study is [1.51, 1.78]. This confidence interval has a simple interpretation – the
researcher can be 95% confident that πœ‡1 is 1.51 to 1.78 times as large as πœ‡2 .
Hypothesis Testing
A confidence interval for πœ‡1 βˆ’ πœ‡2 can be used to test hypotheses. For instance, a
confidence interval for πœ‡1 βˆ’ πœ‡2 can be used to implement a three-decision rule for
the following hypotheses.
H0: πœ‡1 = πœ‡2
H1: πœ‡1 > πœ‡2
H2: πœ‡1 < πœ‡2
If the lower limit for πœ‡1 βˆ’ πœ‡2 is greater than 0, reject H0 and accept H1: πœ‡1 > πœ‡2 ; if
the upper limit for πœ‡1 βˆ’ πœ‡2 is less than 0, reject H0 and accept H2: πœ‡1 < πœ‡2 . The
results are inconclusive if the confidence interval includes 0. Note that it is not
necessary to develop special hypothesis testing rules for the value of 𝛿 because
πœ‡1 = πœ‡2 implies 𝛿 = 0, πœ‡1 > πœ‡2 implies 𝛿 > 0, and πœ‡1 < πœ‡2 implies 𝛿 < 0.
When this three-decision rule is applied to the mean difference in a two-group
design, it is commonly referred to as an independent-samples t-test and the test
statistic t = (πœ‡Μ‚ 1 βˆ’ πœ‡Μ‚ 2 )/π‘†πΈπœ‡Μ‚1 βˆ’πœ‡Μ‚2 is used to select H1 or H2:
accept H1: πœ‡1 > πœ‡2 if t > 𝑑𝛼/2;𝑑𝑓
accept H2: πœ‡1 < πœ‡2 if t < -𝑑𝛼/2;𝑑𝑓
fail to reject H0 (i.e., an inconclusive result) if |t| < 𝑑𝛼/2;𝑑𝑓
6
D.G. Bonett (5/2017)
Computer programs will compute a p-value for the t statistic. In a significance
testing approach, if the p-value is small (e.g., less than .05) the researcher declares
the results to be β€œsignificant”; otherwise, the results are declared to be
β€œnonsignificant”. It is important to remember that a significant result does not
imply that an important difference in population means has been detected, and a
nonsignificant result does not imply that the null hypothesis is true.
Recall from Module 1 that accepting H1 when H2 is true or accepting H2 when H1
is true is called a directional error and the probability of making a directional error
is at most 𝛼/2. It is reasonable to assume that H0: πœ‡1 = πœ‡2 will be false in any real
application. A failure to reject H0 is called a Type II error, and the power of the test
is the probability of avoiding a Type II error.
Confidence intervals for πœ‡1 βˆ’ πœ‡2 also can be used to determine if πœ‡1 and πœ‡2 are
similar. For instance, a confidence interval for πœ‡1 βˆ’ πœ‡2 can be used to choose
between the following two hypotheses in an equivalence test:
H0: |πœ‡1 βˆ’ πœ‡2 | ≀ 𝑏
H1: |πœ‡1 βˆ’ πœ‡2 | > 𝑏
where 𝑏 is some value specified by the researcher. Usually 𝑏 represents the value
πœ‡1 βˆ’ πœ‡2 that would be considered by experts to be small or unimportant. The
interval -b to b is called the region of practical equivalence. If the confidence
interval for πœ‡1 βˆ’ πœ‡2 is completely contained within the range –b to b, then H0 is
accepted; if the confidence interval for πœ‡1 βˆ’ πœ‡2 is completely outside the interval
-b to b, then H1 is accepted; otherwise, the results are inconclusive. The probability
of falsely accepting H1: |πœ‡1 βˆ’ πœ‡2 | > 𝑏 is at most 𝛼. In an equivalence test, accepting
H1 when H0 is true is called a Type I error, and accepting H1 when H0 is true is
called a Type II error.
In equivalence testing applications where it is difficult to specify a value of πœ‡1 βˆ’ πœ‡2
that would be considered by experts to be small or unimportant, it might be easier
to specify a value of 𝛿 that would be considered small or unimportant. A confidence
interval for 𝛿 can be used to choose between the following two hypotheses
H0: |𝛿| ≀ 𝑏
H1: |𝛿| > 𝑏
where 𝑏 represents a value of 𝛿 that would be considered by experts to be small or
unimportant. If the confidence interval for 𝛿 is completely contained within the
interval -b to b (the region of practical significance), then H0 is accepted; if the
confidence interval for 𝛿 is completely outside the interval -b to b, then H1 is
accepted; otherwise, the results are inconclusive.
7
D.G. Bonett (5/2017)
Prediction Intervals
In some experiments, the researcher might want estimate how the dependent
variable score for one randomly selected person would differ under the two
treatment conditions. A 100(1 βˆ’ 𝛼)% prediction interval for this difference is
πœ‡Μ‚ 1 βˆ’ πœ‡Μ‚ 2 ο‚± 𝑑𝛼/2;𝑑𝑓 βˆšπœŽΜ‚π‘2 /𝑛1 + πœŽΜ‚π‘2 /𝑛2 + 2πœŽΜ‚π‘2
(2.4)
where df = 𝑛1 + 𝑛2 – 2 and πœŽΜ‚π‘2 is the pooled variance estimate described previously.
Equation 2.4 assumes equal population variances. There exists another version of
this prediction interval that does not assume equal population variance but the df
formula is complicated. A prediction interval for the difference in scores for one
person will be wider than a confidence interval for the difference in population
means.
Example 2.3. In the two-group experiment described in a previous example where 2nd
grade students were randomized into a control group or a group that received an
additional 15 minutes of math instruction each day, the 95% confidence interval for
πœ‡1 βˆ’ πœ‡2 was [150.0, 171.6]. This confidence interval suggests that the mean time to
complete the multiplication test in the study population would be 150 to 171.6 seconds
shorter if all students in the study population received the extra training. The researcher
wants to estimate how much better a one randomly selected student from the study
population would perform if given the extra training compared to no extra training.
Applying Equation 2.4 gives a 95% prediction interval of [91.8, 229.8] indicating that any
one randomly selected student from the study population should complete the
multiplication test 91.8 to 229.8 seconds faster if given extra math training.
Assumptions for Confidence Intervals and Tests
The separate-variance and pooled-variance confidence intervals for πœ‡1 βˆ’ πœ‡2
assume random sampling and independence of observations. These confidence
intervals for πœ‡1 βˆ’ πœ‡2 also assume that the dependent variable has an approximate
normal distribution in the study population under each treatment condition in an
experimental design or within each subpopulation in a non-experimental design.
The pooled-variance confidence interval for πœ‡1 βˆ’ πœ‡2 requires an additional
assumption of equal population variances under each treatment condition (or
within each subpopulation) and is called the homoscedasticity assumption. The
homoscedasticity assumption is difficult to justify in most non-experimental
designs. The pooled-variance confidence interval can perform poorly when the
population variances are unequal and the sample sizes are unequal. Specifically,
the percent of all possible random samples for which a pooled-variance
100(1 βˆ’ 𝛼)% confidence interval for πœ‡1 βˆ’ πœ‡2 will capture the value of πœ‡1 βˆ’ πœ‡2 can
be much less than 100(1 βˆ’ 𝛼)% when the population variances are unequal and
8
D.G. Bonett (5/2017)
the smaller sample size is used in the treatment with the larger variance. Statistical
methods for assessing the homoscedasticity assumption require large sample sizes.
Consequently, most statisticians recommend using the separate-variance
confidence interval unless the sample sizes are approximately equal and there is
strong prior information suggesting that the population variances should be
similar.
The confidence interval and hypothesis test for a difference in population means
will perform properly (i.e., the true confidence level will be similar to the stated
confidence level) under violations of the normality assumption if the sample size
per group is not too small. With samples size of 20 or more per group, the separatevariance and pooled variance confidence intervals for πœ‡1 βˆ’ πœ‡2 will perform properly
even when the dependent variable is markedly skewed.
The confidence interval for 𝛿 (Equation 2.2) and the prediction interval for a
difference in scores (Equation 2.4) are sensitive to a violation of the normality
assumption, and increasing the sample size will not mitigate the problem. The
confidence interval for 𝛿 and the prediction interval for a difference in scores
should not be used, regardless of sample size, unless the dependent variable (or
some transformation of the dependent variable) is at most mildly non-normal.
To assess the degree of non-normality in a two-group design, subtract πœ‡Μ‚ 1 from all
of the group 1 scores and subtract πœ‡Μ‚ 2 from all of the group 2 scores. Then estimate
the skewness and kurtosis coefficients from these 𝑛1 + 𝑛2 deviation scores. If the
deviation scores are skewed, it might be possible to reduce the skewness by
transforming (e.g., log, square-root, reciprocal) the dependent variable scores.
The values of πœ‡1 and πœ‡2 could be difficult to interpret if the dependent variable
scores have been transformed in an effort to reduce non-normality. Consequently,
a confidence interval for πœ‡1 βˆ’ πœ‡2 could be difficult to interpret and the researcher
may want to report only a hypothesis testing result, which does not require an
interpretation of the effect size magnitude. However, confidence intervals for 𝛿
remain interpretable with transformed data because 𝛿 is a unitless measure of
effect size.
Distribution-free Methods
If the dependent variable is highly skewed, a difference in population medians
(𝜏1 βˆ’ 𝜏2 ) might be a more appropriate and meaningful measure of effect size than
a difference in population means. An approximate 100(1 βˆ’ 𝛼)% confidence
interval for 𝜏1 βˆ’ 𝜏2 is
9
D.G. Bonett (5/2017)
πœΜ‚1 βˆ’ πœΜ‚ 2 ± 𝑧𝛼/2 βˆšπ‘†πΈπœΜ‚21 + π‘†πΈπœΜ‚22
(2.5)
where π‘†πΈπœΜ‚2𝑗 was defined in Equation 1.8 of Module 1. This confidence interval only
assumes random sampling and independence among participants. Equation 2.5
can be used for testing H0: 𝜏1 = 𝜏2 and to decide if 𝜏1 > 𝜏2 or 𝜏1 < 𝜏2 . Equation 2.5
also can be used to test H0: |𝜏1 βˆ’ 𝜏2 | ≀ 𝑏 against H1: |𝜏1 βˆ’ 𝜏2 | > 𝑏. If the dependent
variable is measured on a ratio scale, a ratio of population medians (𝜏1 /𝜏2 ) could
be an informative measure of effect size. To obtain a confidence interval for 𝜏1 /𝜏2 ,
compute Equation 2.5 from log-transformed dependent variable scores and then
exponentiate the lower and upper limits.
The Mann-Whitney test (also referred to as the Mann-Whitney-Wilcoxon test) is a
distribution-free test of H0: πœ‹ = .5 where πœ‹ is the proportion of people in the study
population who would have a larger y score if they had received Treatment 2 rather
than Treatment 1. The interpretation of πœ‹ is more complicated in a two-group nonexperimental design with subpopulations sizes 𝑁1 and 𝑁2 . In a nonexperimental
design, πœ‹ is the proportion of all 𝑁1 𝑁2 pairs of scores where the score for a person
from subpopulation 2 is greater than the score for a person from subpopulation 1.
Statistical packages will compute a p-value for the Mann-Whitney test that can be
used to decide if the null hypothesis can be rejected. The Mann-Whitney test is
usually a little less powerful than the independent samples t-test, but it can be
more powerful than the t-test if the dependent variable is highly leptokurtic.
In applications where the dependent variable does not have, and cannot be
transformed to have, an approximate normal distribution, the standardized mean
difference will be difficult to interpret because we would not be able visualize one
standard deviation from the mean. In these situations the Mann-Whitney
parameter (πœ‹) is a useful measure of effect size because it has a clear and simple
interpretation for any distribution shape. A confidence interval for πœ‹ does not have
a simple formula but it has been programmed in the provided R functions.
Example 2.4. A sample of 8 male frequent marijuana users and a sample of 10 male non-
users were obtained from a University of Colorado research participant pool. Amygdala
activity levels of all participants were obtained while participants listened to an audio tape
with high emotional content. The activity scores are shown below.
Users:
Non-users:
14.6 5.1 8.1 22.7 6.4 4.4 19.0 3.2
9.4 10.3 58.3 106.0 31.0 46.2 12.0 19.0 135.0 159.0
(continued)
10
D.G. Bonett (5/2017)
The p-value for the Mann-Whitney test is .006 and the 95% confidence interval for πœ‹ is [.629,
1.00]. The sample medians are 7.25 and 38.6 for groups 1 and 2, respectively. A 95%
confidence interval for 𝜏1 βˆ’ 𝜏2 is [-85.1, 22.4]. Note that the Mann-Whitney test rejects the
null hypothesis but a 95% confidence interval for 𝜏1 βˆ’ 𝜏2 does not reject the null
hypothesis. The contradictory results are due to the fact that the Mann-Whitney test is
usually more powerful than a test of H0: 𝜏1 = 𝜏2 based on a confidence interval for 𝜏1 βˆ’ 𝜏2 .
The 95% confidence interval for πœ‹ excludes .5 (which will always occur whenever the
Mann-Whitney p-value is less than .05). The researcher can be 95% confident that the
proportion of all user/non-user pairs in the two subpopulations where a non-user has a
higher activity score than a user is between 62.9% and 100%. The activity scores are highly
skewed but a log transformation effectively removes the skewness. An estimate of 𝛿 for
log-transformed scores is -1.61 and its 95% confidence interval is [-2.73, -0.49]. The
meaning of the log-transformed activity scores is not clear and a confidence interval for
𝜏1 βˆ’ 𝜏2 or πœ‡1 βˆ’ πœ‡2 might be difficult to interpret. However, the confidence intervals for 𝛿
and πœ‹ are interpretable with transformed scores.
Equation 2.5 can be useful in time-to event studies. If the study ends before some
of the participants have exhibited the event, the scores for those participants are
set equal to t + 1. Equation 2.5 requires 𝑦(π‘œ2 ) ≀ 𝑑 in both groups where π‘œ2 is defined
in Equation 1.7 of Module 1.
A random sample of 20 social science graduates and 20 engineering
graduates from UCSC agreed to participate in a 36 month study of post-graduation
employment. The number of months each participant stayed in their first job was
determined from each participant. Some participants had not left their first job at the end
of the 36-month study period and were given a censored score of 37 months. The rankordered time-to-event scores (in months) are given below.
Example 2.5.
Social science: 2, 4, 6, 8, 10, 12, 12, 13, 15, 15, 20, 21, 24, 30, 30, 34, 34, 35, 36, 37
Engineering: 6, 15, 16, 17, 18, 18, 19, 21, 22, 22, 24, 25, 30, 21, 32, 35, 37, 37, 37, 37
Some scores have been censored and we must first verify that 𝑦(π‘œ2 ) ≀ 36 in both groups.
From Equation 1.7, we compute π‘œ2 = 15 and find that 𝑦(15) = 30 in group 1 and 𝑦(15) = 32 in
group 2 which satisfies the requirement. Applying Equation 2.5 gives a 95% confidence
interval of [-15.5, 6.5]. The confidence interval includes 0 and is too wide. The study needs
to be replicated using a larger sample size.
Sample Size Requirement for Desired Precision
The sample size requirement per group for estimating πœ‡1 βˆ’ πœ‡2 with desired
confidence and precision is approximately
2
𝑛𝑗 = 8πœŽΜƒ 2 (𝑧𝛼/2 /𝑀)2 + 𝑧𝛼/2
/4
(2.6)
11
D.G. Bonett (5/2017)
where πœŽΜƒ 2 is a planning value of the average within-group variance of the dependent
variable for the two groups. Recall from Module 1 that this planning value can be
specified using information from published research reports, a pilot study, or the
opinions of experts. If prior estimates of the dependent variable variance are
unavailable but the maximum and minimum values of the dependent variable are
known, the planning value of the variance could be set to [(max – min)/4]2 .
Example 2.6. A researcher wants to conduct a study to determine the effect of
β€œachievement motivation” on the types of tasks one chooses to undertake. The study will
ask participants to play a ring-toss game where they try to throw a small plastic ring over
an upright post. The participants will choose how far away from the post they are when
they make their tosses. The chosen distance from the post is the dependent variable. The
independent variable is degree of achievement motivation (high or low) and will be
manipulated by the type of instructions given to the participants. The results of a pilot
study suggest that the standard deviation of the distance scores is about 0.75 foot within
each condition. The researcher wants the 99% confidence interval for πœ‡1 βˆ’ πœ‡2 to have a
width of about 1 foot. The required sample size per group is approximately
𝑛𝑗 = 8(0.752)(2.58/1)2 + 1.66 = 31.6 β‰ˆ 32. A random sample of 64 participants is required with
32 participants given low achievement motivation instructions and 32 participants given
high achievement motivation instructions.
The sample size requirement per group for estimating 𝛿 with desired confidence
and precision is approximately
𝑛𝑗 = (𝛿̃ 2 + 8)(𝑧𝛼/2 /𝑀)2
(2.7)
where 𝛿̃ is a planning value of the standardized mean difference. The planning
value can be specified using information from published research reports, a pilot
study, or expert opinion.
Example 2.7. A researcher will compare two methods of treating homelessness-induced
PTSD in adolescents and will use a new measure of PTSD as the dependent variable. Given
the novelty of the new PTSD measure, it is difficult for the researcher to specify a desired
width of a confidence interval for πœ‡1 βˆ’ πœ‡2 . However, the researcher expects 𝛿 to be 1.0 and
would like a 95% confidence interval for 𝛿 to have a width of about 0.5. The required
sample size per group is approximately 𝑛𝑗 = (12 + 8)(1.96/0.5 )2 = 138.3 β‰ˆ 139. The researcher
needs to obtain a sample of 278 participants which will be randomly divided into two
groups with 139 participants receiving the one treatment and 139 participants receiving
the other treatment.
With a ratio-scale dependent variable, the sample size requirement per group to
estimate πœ‡1 /πœ‡2 with desired confidence and precision is approximately
1
𝑛𝑗 = 8πœŽΜƒ 2 (πœ‡Μƒ2 +
1
1
Μƒ 22
πœ‡
2
) [𝑧𝛼/2 /𝑙𝑛(π‘Ÿ)]2 + 𝑧𝛼/2
/4
(2.8)
12
D.G. Bonett (5/2017)
where πœ‡Μƒπ‘— is a planning value of πœ‡π‘— , r is the desired upper to lower confidence
interval endpoint ratio, and ln(r) is the natural logarithm of r. For instance, if
πœ‡1 /πœ‡2 is expected to about 1.3, the researcher might want the lower and upper
confidence interval endpoints to be about 1.1 and 1.5 and r would then be set to
1.5/1.1 = 1.36.
Example 2.8. A researcher will compare two methods of encouraging parents to read to
their preschool children. The number of reading minutes per week is the response
variable. The researcher plans to compute a 95% confidence interval for πœ‡1 /πœ‡2 and would
like the upper to lower interval endpoint ratio to be about 1.5. After reviewing the
literature, the researcher set πœŽΜƒ 2 = 200, πœ‡Μƒ12 = 50, and πœ‡Μƒ22 = 70. The required sample size per
group is approximately 𝑛𝑗 = 8(200)(1/502 + 1/702)[1.96/ln(1.5)]2 + 0.96 = 23.5 β‰ˆ 24. The
researcher needs to obtain a random sample of 48 parents of preschool children which will
be randomly divided into two groups of equal size.
Sample Size Requirement for Desired Power
The sample size requirement per group to test H0: πœ‡1 = πœ‡2 for a specified level of 𝛼
and with desired power is approximately
𝑛𝑗 =
(𝑧𝛼/2 + 𝑧 )
2πœŽΜƒ 2 (πœ‡Μƒ βˆ’ πœ‡Μƒ 𝛽)2
1
2
2
2
+ 𝑧𝛼/2
/4
(2.9)
where 1 – Ξ² is the desired power of the test and πœ‡Μƒ1 βˆ’ πœ‡Μƒ2 is a planning value of the
anticipated effect size. Note that Equation 2.9 only requires a planning value for
the difference in population means (i.e., the effect size) and does not require a
planning value for each population mean. In applications where it is difficult to
specify πœ‡Μƒ1 βˆ’ πœ‡Μƒ2 or πœŽΜƒ 2 , Equation 2.9 can be expressed in terms of a standardized
mean difference planning value, as shown below.
2
2
𝑛𝑗 = 2(𝑧𝛼/2 + 𝑧𝛽 ) /𝛿̃ 2 + 𝑧𝛼/2
/4
(2.10)
Example 2.9. Previous research has shown that team size is related to performance on
certain tasks. A researcher wants to compare the performance of 2-person and 4-person
teams on a writing task that must be completed in within a time limit. The quality of the
written report will be scored on a 1 to 10 scale. The researcher sets πœŽΜƒ 2 = 5.0 and expects a
2-point difference in the population mean ratings. For Ξ± = .05 and power of 1 – 𝛽 = .95, the
required number of teams per group is approximately 𝑛𝑗 = 2(5.0)(1.96 + 1.65)2/22 + 0.96 =
33.5 β‰ˆ 34. A random sample of 204 participants is required with 68 participants working in
to 34 2-person teams and 136 participants working in 34 4-person teams.
13
D.G. Bonett (5/2017)
Example 2.10. A researcher wants to compare two eating disorder treatments and wants
the power of the test to be .9 with Ξ± = .05. The researcher expects the standardized mean
difference to be 0.5. The required number of participants per group is approximately
𝑛𝑗 = 2(1.96 + 1.28)2/0.52 + 0.96 = 84.9 β‰ˆ 85.
The sample size requirement per group to test H0: |πœ‡1 βˆ’ πœ‡2 | ≀ 𝑏 for a specified
level of 𝛼 and with desired power is approximately
2
2
𝑛𝑗 = 2πœŽΜƒ 2 (𝑧𝛼 + 𝑧𝛽 ) /(|πœ‡Μƒ1 βˆ’ πœ‡Μƒ2 | βˆ’ 𝑏)2 + 𝑧𝛼/2
/4
(2.11)
where 𝑧𝛼 is a one-sided critical 𝑧-value and |πœ‡Μƒ1 βˆ’ πœ‡Μƒ2 | is the expected effect size
which must be smaller than b. Equivalence tests usually require large sample sizes.
Example 2.11. A researcher wants to show that women and men have similar population
means on a newly developed test of analytical reasoning. The test is scored on a 50 to 150
scale and the researcher believes that a 5-point difference in means would be small and
unimportant. The required sample size per group to test H0: |πœ‡1 βˆ’ πœ‡2 | ≀ 5 with power of
.8, Ξ± = .05, an expected effect size of 1, and a standard deviation planning value of 15 is
approximately 𝑛𝑗 = 2(225)(1.65 + 0.84)2/(1 – 5)2 + 0.68 = 175.1 β‰ˆ 176.
The sample size requirement per group to test H0: πœ‹ = .5 for a specified level of 𝛼
with desired power using the Mann-Whitney test is approximately
2
𝑛𝑗 = (𝑧𝛼/2 + 𝑧𝛽 ) /[6(πœ‹Μƒ βˆ’ .5)2 ]
(2.12)
where πœ‹Μƒ is a planning value of πœ‹. Recall that for experimental designs, πœ‹ is the
proportion of people in the study population who would have a larger y score if
they had received treatment 2 rather than treatment 1. In a nonexperimental
design, πœ‹ is the probability that a randomly selected person from the second
subpopulation would have a y score that is less than a randomly selected person
from the first subpopulation.
Unequal Sample Sizes
Using equal sample sizes has three major benefits: 1) if the population variances
are approximately equal, confidence intervals are narrowest and hypothesis tests
are most powerful when the sample sizes are equal, 2) when the pooled-variance
confidence interval method is used, the negative effects of violating the equal
variance assumption are less severe when the sample sizes are equal, and 3) the
negative effects of population skewness on the confidence intervals and tests for a
difference in population means are minimized when the sample sizes are equal.
However, there are situations when equal sample sizes are less desirable. If one
14
D.G. Bonett (5/2017)
treatment is more expensive or risky than another treatment, the researcher might
want to use fewer participants in the more expensive or risky treatment condition.
Also, in experiments that include a control group, it could be easy and inexpensive
to obtain a larger sample size for the control group.
The sample size formulas given above assume equal sample sizes per group.
Suppose the researcher requires 𝑛2 /𝑛1 = r. The approximate sample size
requirement for group 1 to estimate πœ‡1 βˆ’ πœ‡2 with desired precision is
2
𝑛1 = 4πœŽΜƒ 2 (1 + 1/π‘Ÿ)(𝑧𝛼/2 /𝑀)2 + 𝑧𝛼/2
/4
(2.16)
and the required sample size for group 1 to estimate 𝛿 with desired precision is
𝑛1 = 4[𝛿̃ 2 (1 + 1/π‘Ÿ)/8 + (1 + 1/π‘Ÿ)(𝑧𝛼/2 /𝑀)2
(2.17)
with 𝑛2 set equal to 𝑛1 π‘Ÿ.
To test H0: πœ‡1 = πœ‡2 with desired power, the approximate sample size requirement
for group 1 is
2
𝑛1 = πœŽΜƒ 2 (1 + 1/π‘Ÿ)(𝑧𝛼/2 + 𝑧𝛽 )2 /(πœ‡Μƒ1 βˆ’ πœ‡Μƒ2 )2 + 𝑧𝛼/2
/4
(2.18)
or equivalently, in the case where πœ‡Μƒ1 βˆ’ πœ‡Μƒ2 or πœŽΜƒ 2 is difficult to specify,
2
𝑛1 = (1 + 1/π‘Ÿ)(𝑧𝛼/2 + 𝑧𝛽 )2 /𝛿̃ 2 + 𝑧𝛼/2
/4
(2.19)
with 𝑛2 set equal to 𝑛1 π‘Ÿ.
Example 2.12. A researcher wants to estimate πœ‡1 βˆ’ πœ‡2 with 95% confidence and a desired
confidence interval width of 2.5 with a variance planning value of 4.0. The researcher also
wants 𝑛2 to be 2 times greater than 𝑛1 . The sample size requirement for group 1 is
approximately 𝑛1 = 4(4.0)(1 + 1/2)(1.96/2.5)2 + 0.96 = 15.7 β‰ˆ 16 with 𝑛2 = 2(16) = 32
participants required in group 2.
Example 2.13. A researcher wants to test H0: πœ‡1 = πœ‡2 with Ξ± = .05 and power of .95. The
researcher also wants 𝑛2 to be one-fourth the size of 𝑛1 . The researcher expects the
standardized mean difference to be 0.75. The sample size requirement for group 1 is
approximately 𝑛1 = (1 + 1/0.25)(1.96 + 1.65)2 /0.752 + 0.96 = 115.8 β‰ˆ 116 with 𝑛2 = (1/4)(116)
= 29 participants required in group 2.
15
D.G. Bonett (5/2017)
Graphing Results
The sample means for each group can be presented graphically using a bar chart.
A bar chart for two groups consists of two bars, one for each group, with the height
of each bar representing the value of the sample mean. Bar charts of sample means
can be misleading because the sample means contain sampling error of unknown
magnitude and direction. There is a tendency to incorrectly interpret a difference
in bar heights as representing a difference in population means. This
misinterpretation can be avoided by graphically presenting the imprecision of the
sample means with 95% confidence interval lines for each population mean, as
shown in the graph below. Non-overlapping 95% confidence interval lines implies
p < .05 but overlapping lines does not imply p > .05, and this is why it is important
to always report a confidence interval for πœ‡1 βˆ’ πœ‡2 along with a bar chart for the
individual means. A confidence interval for 𝜏1 βˆ’ 𝜏2 can be supplemented with a bar
chart of medians. The 95% confidence interval lines for πœπ‘— are obtained using
Equation 1.8 from Module 1. If the confidence interval lines for the individual
parameters do not overlap, then the confidence interval for the difference in
parameters will always exclude 0. However, if the confidence interval lines for the
individual parameters overlap, the confidence interval for the difference in
parameters might or might not exclude 0 depending on the amount of overlap.
Internal Validity
Recall that one of the fundamental requirements for declaring a relation between
two variables to be a causal relation is that the independent variable must be the
only variable affecting the dependent variable. In other words, there must be no
confounding variables. When this requirement is not satisfied, we say the internal
validity of the study has been compromised. In nonexperimental designs, there
will be many obvious confounding variables. For example, in a two-group study
that compares two teaching methods using students in two different classrooms
with one teacher using the first method and the other teacher using the second
method, a non-zero value of πœ‡1 βˆ’ πœ‡2 could be attributed to a difference in student
16
D.G. Bonett (5/2017)
abilities in the two classrooms or a difference in teacher effectiveness. Confounding
variables can also be present in experimental designs. Consider a two-group
experiment for the treatment of anxiety is conducted with one group receiving a
widely-used medication and the second group receiving a promising new drug.
Suppose a statistical analysis suggests that the new drug is more effective in
reducing anxiety than the old drug. However, the researchers cannot be sure that
the new drug will cause an improvement in anxiety because patients who received
the new drug also received extra safety precautions to monitor for possible negative
side effects. These extra precautions involved more supervision and patient
contact. It is possible that the additional supervision, and not the new drug, caused
the improvement in the patients.
Differential attrition is another problem that threatens the internal validity of a
study. Differential attrition occurs when the independent variable causes the
participants in one treatment condition to withdraw from treatment with higher
probability than participants in another treatment. With differential attrition,
participants who complete the study could differ across treatment conditions in
terms of some important attribute that would then be confounded with the
independent variable. Consider the following example. Suppose a researcher
conducts an experiment to evaluate two different methods of helping people
overcome their fear of public speaking. One method requires participants to
practice with an audience of size 20 and the other method requires participants to
practice with an audience of size 5. Fifty participants were randomly assigned to
each of these two training conditions, but ten dropped out of the first group and
only one dropped out of the second group. The results showed that public speaking
fear was lower under the first method (audience size of 20) of training. However, it
is possible that participants who stayed in the first group were initially less fearful
than those who dropped out and that this produced the lower fear scores in the
first training condition.
External Validity
External validity is the extent to which the results of a study can be generalized to
different types of participants and different types of research settings. In terms of
random sampling, it is usually easier to sample from a small homogeneous study
population than a larger and more heterogeneous study population. However, the
external validity of the study will be greater if the researcher samples from a larger
and more diverse study population. Researchers often go to great lengths to
minimize variability in the research setting for participants within a treatment
condition by, for instance, having the same researcher or lab assistant interact with
all participants, minimizing variability in laboratory lighting and temperature, or
17
D.G. Bonett (5/2017)
testing participants at about the same time of the day. These efforts have a
desirable effect of reducing within-treatment (error) variability, which in turn
produces narrower confidence intervals and greater power of statistical tests.
However, these same efforts could simultaneously have the undesirable effect of
reducing the external validity of the study.
Nonrandom attrition occurs when certain types of participants drop out of the
study with a higher probability than other participants but drop out with about the
same probability across groups. With nonrandom attrition, the participants who
complete the study are no longer a random sample from the original study
population. The remaining participants could be assumed to be a random sample
from a smaller study population of participants who would have completed the
study. This change in the size and nature of the study population decreases the
external validity of the study. In contrast to nonrandom attrition, random attrition
will reduce the planned sample size, which in turn will decrease power of a
hypothesis test and increase the width of a confidence interval, but will have no
effect on the external or internal validity of the study.
Using Prior Information
Suppose a population mean for a particular response variable has been estimated
in a previous study and also in a new study. The previous study used a random
sample of size 𝑛1 to estimate πœ‡1 from one study population, and the new study used
a random sample of size 𝑛2 to estimate πœ‡2 from another study population. This is a
special type of two-group nonexperimental design where the two study
populations are assumed to be conceptually similar. If a confidence interval for
πœ‡1 – πœ‡2 suggests that πœ‡1 and πœ‡2 are not too dissimilar, then the researcher might
want to compute a confidence interval for (πœ‡1 + πœ‡2 )/2. A confidence interval for
(πœ‡1 + πœ‡2 )/2 could be substantially narrower and will have greater external validity
than the confidence interval for πœ‡1 or πœ‡2 .
A 100(1 βˆ’ 𝛼)% confidence interval for (πœ‡1 + πœ‡2 )/2 is
(2.19)
(πœ‡Μ‚ 1 + πœ‡Μ‚ 2 )/2 ± 𝑑𝛼/2;𝑑𝑓 𝑆𝐸(πœ‡Μ‚1 +πœ‡Μ‚2 )/2
2
2
2
4
𝜎
Μ‚1
𝜎
Μ‚
𝜎
Μ‚
where 𝑑𝑓 = (4𝑛
+ 2 ) /[ 2 1
4𝑛
16𝑛 (𝑛
1
2
1
1 βˆ’ 1)
+
𝜎
Μ‚24
2
16𝑛2 (𝑛2 βˆ’ 1)
]
2
2
𝜎
Μ‚1
𝜎
Μ‚
and 𝑆𝐸(πœ‡Μ‚1 +πœ‡Μ‚2)/2 =√4𝑛
+ 2 . If the
4𝑛
1
2
sample sizes are approximately equal and the two study population variances can
be assumed to be similar, then the separate-variance standard error could
𝜎
Μ‚2
𝜎
Μ‚2
cautiously be replaced with a pooled-variance standard error 𝑆𝐸(πœ‡Μ‚1 +πœ‡Μ‚2)/2 = √4𝑛𝑝 + 4𝑛𝑝
1
2
where πœŽΜ‚π‘2 = [(𝑛1 βˆ’ 1)πœŽΜ‚12 + (𝑛2 βˆ’ 1)πœŽΜ‚22 ]/𝑑𝑓 and df = 𝑛1 + 𝑛2 – 2.
18
D.G. Bonett (5/2017)
If a median has been estimated in the two studies, an approximate 100(1 βˆ’ 𝛼)%
confidence interval for (𝜏1 + 𝜏2)/2 is
(πœΜ‚1 + πœΜ‚ 2 )/2 ± 𝑧𝛼/2 √(π‘†πΈπœΜ‚21 + π‘†πΈπœΜ‚22 )/4
(2.20)
where π‘†πΈπœΜ‚2𝑗 was defined in Equation 1.8 of Module 1.
Example 2.14. A researcher at UC Davis obtained a random sample of
𝑛1 = 100 social
science seniors and asked them to rate on a 1 – 10 scale how concerned they were about
police brutality. The sample mean (πœ‡Μ‚ 1 = 8.14) and variance (πœŽΜ‚12 = 1.99) were reported in
a published article along with other results. A researcher at Cal State Fresno replicated the
UC Davis study using a sample size of 𝑛2 = 165 and obtained πœ‡Μ‚ 2 = 7.88 and πœŽΜ‚22 = 2.41.
The 95% confidence interval for πœ‡1 – πœ‡2 was [-0.1, 0.6] which suggests that πœ‡1 and πœ‡2 are
similar. The 95% confidence interval for (πœ‡1 + πœ‡2 )/2 was [7.83, 8.19] which is narrower
than the confidence interval for either πœ‡1 or πœ‡2 and describes the study populations of both
UC Davis and Cal State Fresno.
When estimating (πœ‡1 + πœ‡2 )/2, we assume that the two study population
parameters πœ‡1 and πœ‡2 are equally important. An alternative Bayesian approach
assumes πœ‡2 is the only parameter of interest and the information from the prior
study is used only to obtain a more precise estimate of πœ‡2 . A Bayesian estimate of
πœ‡2 that uses πœ‡Μ‚ 1 and the standard error of πœ‡Μ‚ 1 from a prior study is
πœ‡Μ…2 = [(1/π‘†πΈπœ‡Μ‚21 )πœ‡Μ‚ 1 + (1/π‘†πΈπœ‡Μ‚22 )πœ‡Μ‚ 2 ]/(1/π‘†πΈπœ‡Μ‚21 + 1/π‘†πΈπœ‡Μ‚22 )
(2.21)
with an approximation standard error of
π‘†πΈπœ‡Μ…2 = 1/√1/𝑆𝐸2πœ‡Μ‚1 + 1/𝑆𝐸2πœ‡Μ‚2 .
(2.22)
An approximate 100(1 βˆ’ 𝛼)% Bayesian confidence interval for πœ‡2 is
πœ‡Μ…2 ± 𝑧𝛼/2 π‘†πΈπœ‡Μ…2 .
(2.23)
For the above example, πœ‡Μ…2 = [(100/1.99)8.14 + (165/2.41)7.88]/(100/1.99 + 165/2.41) =
7.99, π‘†πΈπœ‡Μ…2 = 1/√1/𝑆𝐸2πœ‡Μ‚1 + 1/𝑆𝐸2πœ‡Μ‚2 = 0.092, and the approximate 95% Bayesian
confidence interval for πœ‡2 is 7.99 ± 1.96(0.092) = [7.81, 8.16].
The Bayesian confidence interval for πœ‡2 is attractive because it will always be
narrower than the traditional confidence interval for πœ‡2 . In this example, the
traditional confidence interval for πœ‡2 is [7.65, 8.12]. However, unless the value of
πœ‡1 βˆ’ πœ‡2 is small (based on compelling theoretical arguments or as determined from
19
D.G. Bonett (5/2017)
an equivalence test) or 𝑛2 is several times larger than 𝑛1 , a 100(1 βˆ’ 𝛼)% Bayesian
confidence interval for πœ‡2 can have a coverage probability that is far less than the
specified confidence level.
Ethical Issues
Any study that uses human subjects should advance knowledge and potentially
lead to improvements in the quality of life – but the researcher also has an
obligation to project the rights and welfare of the participants in the study. These
two goals are often in conflict and lead to ethical dilemmas. The most widely used
approach to resolving ethical dilemmas is to weigh the potential benefits of the
research against the costs to the participants. Evaluating the costs and benefits of
a proposed research project that involves human subjects can be extremely
difficult and this task is assigned to the Institutional Review Board (IRB) at most
universities. Researchers who plan to use human subjects in their research must
submit a written proposal to the IRB for approval. The IRB will carefully examine
research proposals in terms of the following issues:
Informed Consent – Are participants informed of the nature of the study,
have they explicitly agreed to participate, and are they allowed to freely
decline to participate?
Coercion to participate – Were participants coerced into participating or
offered excessive inducements?
Confidentiality – Will the data collected from participants be used only for
research purposes and not divulged to others?
Physical and mental stress – Does the study involve more than minimal
risk? Minimal risk is defined as risk that is no greater in probability or
severity than ordinarily encountered in daily life or during a routine
physical or psychological exam.
Deception – Is deception needed in the study? If deception is used, are
participants debriefed after the study? Debriefing is used to clarify the
nature of the study to the participants and reduce any stress or anxiety to
the participants caused by the study.
In addition to principles governing the treatment of human subjects, researchers
are bound by a set of ethical standards. Violation of these standards is called
scientific misconduct. There are three basic types of scientific misconduct:
20
D.G. Bonett (5/2017)
Scientific dishonesty – Examples include: the fabrication or falsification of
data, and plagiarism. Plagiarism is the use of another person's ideas,
processes, results, or words without giving appropriate credit.
Unethical behavior – Examples include: sexual harassment of research
assistants or research participants, abuse of authority, failure to follow
university or government regulations, and inappropriately including or
excluding authors on a research report or conference presentation.
Questionable research practices – Examples include: performing an
exploratory analysis of many dependent or independent variables and
reporting only the variables that yield a β€œsignificant” result; reporting
unexpected findings as if they had been expected; failure to assess critical
assumptions for statistical tests or confidence intervals, and deleting
legitimate data in an effort to obtain desired results; interpreting a
"significant" result as representing an important effect and reporting a
"nonsignificant" result as evidence that the effect is zero; intentionally not
reporting a confidence interval result when it suggests that the effect could
be trivial or unimportant.
21