D.G. Bonett (5/2017) Module 2 Two-group Experimental Designs The goal of most research is to assess a possible causal relation between the response variable and another variable called the independent variable. In experimental designs, the response variable is usually called a dependent variable. Three basic conditions must be satisfied to demonstrate a causal relation between a dependent variable and an independent variable. First, there must be a relation between the dependent variable and the independent variable. Second, the effect of the independent variable must be present prior to any observed change in the dependent variable. Third, no variable other than the independent variable can be responsible for the relation between the dependent variable and the independent variable. An experiment can be used to assess a causal relation. The simplest type of experiment involves just two treatment conditions that represent the levels of the independent variable. In a two-group experiment, a random sample of n participants is selected from a study population. The random sample is then randomized (i.e., randomly divided) into two groups, and each group receives one of the two treatments with participants treated identically within each group. If one group does not receive any treatment, it is called a control group. Following treatment, a measurement on the dependent variable is obtained for each participant. In a two-group experiment with a quantitative dependent variable, a population mean could be estimated from each group. In an experimental design, the population means have interesting interpretations: π1 is the population mean of the dependent variable assuming everyone in the study population had received level 1 of the independent variable, and π2 is the population mean of the dependent variable assuming everyone in the same study population had instead received level 2 of the independent variable. The difference in population means for the two treatment conditions, π1 β π2 , is called the effect size and describes the strength of the relation between the dependent and independent variables. In an experiment, a nonzero effect size is evidence that the independent variable has a causal effect on the dependent variable because all three conditions required for a causal association will have been satisfied: 1) a nonzero effect size implies a relation between the dependent and independent variables, 2) the change in the dependent variable occurred after 1 D.G. Bonett (5/2017) exposure to the independent variable, and 3) because the participants were randomized into the levels of the independent variable, no other variable could have caused the nonzero effect size. A confidence interval for π1 β π2 provides information about the direction and magnitude of the effect size. Two-group Nonexperimental Designs In a two-group nonexperimental design, participants are classified into two groups according to some preexisting characteristic (e.g., male/female, democrat/republican, sophomore/junior, etc.) rather than being randomized into the treatment conditions. In nonexperimental designs, the magnitude of the difference in population means describes the strength of a relation between the response variable and independent variable. In nonexperimental designs, the independent variable is often referred to as a predictor variable. In nonexperimental designs where participants are not randomly assigned to groups, an observed relation between the predictor variable and the response variable cannot be interpreted as a causal relation because the relation could be a consequence of one or more unmeasured variables β referred to as confounding variables β that are related to both the response variable and the predictor variable. Consider, for example, the many nonexperimental studies that have compared moderate alcohol drinkers with non-drinkers and found that moderate drinkers live longer. Moderate drinkers could differ from nondrinkers in education level, income, access to health care, moderation in consumption of unhealthy foods, and many other characteristics. It is possible that one or more of these confounding variables is responsible for the observed relation between alcohol consumption and life expectancy. Therefore, the nonexperimental finding that alcohol consumption is related to life expectancy does not imply that a non-drinker who begins to consume alcohol in moderation will live longer. In a nonexperimental design, the parameters also have a different interpretation. Specifically, π1 is the population mean of the dependent variable for all people in one study population who belong to one category of the predictor variable (e.g., male, democrat, sophomore), and π2 is the population mean of the dependent variable for all people in a second study population who belong to the other category of the predictor variable (e.g., female, republican, junior). The members of the study populations within each category are referred to as subpopulations. The subtle but important parameter interpretation differences in experimental and nonexperimental designs will affect how the researcher describes the results of a confidence interval or test. 2 D.G. Bonett (5/2017) Confidence Interval for a Population Mean Difference A 100(1 β πΌ)% confidence interval for π1 β π2 is (2.1) πΜ 1 β πΜ 2 ± π‘πΌ/2;ππ ππΈπΜ1 βπΜ2 Μ2 π Μ2 π 1 2 2 where π‘πΌ/2;ππ is a critical t-value, ππΈπΜ1 βπΜ2 = βπΜ12 /π1 + πΜ22 /π2 , and ππ = (π1 + π2 ) / Μ14 π [π2 (π 1 1 Μ24 π β 1) + π2 (π 2 ]. If the sample sizes are approximately equal and the study 2 β 1) population variances can be assumed to be similar, then the separate-variance standard error could cautiously be replaced with a equal-variance standard error ππΈπΜ1 βπΜ2 = βπΜπ2 /π1 + πΜπ2 /π2 where πΜπ2 = [(π1 β 1)πΜ12 + (π2 β 1)πΜ22 ]/ππ and df = π1 + π2 β 2. Contrary to the recommendations of most statisticians, many researchers have been taught to always use the equal-variance method. In an experiment, recall that all participants within a particular treatment group should be treated identically. The within-group variance estimates, πΜ12 and πΜ22 , represent unexplained variability in the dependent variable. The within-group variance is also referred to as error variance. Example 2.1. A researcher believes that it is important for 2nd grade students to overlearn the multiplication tables so that these computations can be made rapidly and without thought when students later begin working on more complex math problems. A population of 1496 2nd grade students was identified in the Salinas, CA school district, and 80 students were randomly selected from this study population. The 80 students were randomized into two groups of equal size. The first group was a control group and received no additional multiplication table training. The second group received 15 minutes per day of extra multiplication tables training for 60 days. At the end of the 60 day training period, all 80 students were given a multiplication test and the time (in seconds) to complete the test was recorded for each student. The sample means and standard deviations are given below. Group 1 Group 2 πΜ 1 = 273.6 πΜ 2 = 112.8 πΜ1 = 27.2 πΜ2 = 20.8 The separate variance 95% confidence interval for π1 β π2 is 27.22 40 273.6 β 112.8 ± π‘.05/2;ππ β 27.22 40 where df = ( + + 20.82 40 20.82 2 27.24 20.84 ) /[ + ] 2 40 40 (39) 402 (39) = [150.0, 171.6] = 72.9 and π‘.05/2;72.9 = 2.00. The researcher is 95% confident that in the study population of 948 2nd grade students, the average time to complete the multiplication tables test would be 150.0 to 171.6 seconds faster if they had all received the extra math training for 60 days. 3 D.G. Bonett (5/2017) Confidence Interval for a Population Standardized Mean Difference In the above math education experiment, the metric of the dependent variable (seconds to complete the test) is easy to understand, and the confidence interval for π1 β π2 clearly describes the magnitude and importance of the effect of the independent variable. However, many dependent variables used by social and behavioral researchers have metrics that may not be well understood by other researchers. In these cases the confidence interval for π1 β π2 may not provide easily interpretable information about the magnitude and importance of the effect of the independent variable. For instance, suppose a researcher compared two different counseling approaches for test anxiety. Following counseling, the researcher administered a test anxiety questionnaire to all student participants and obtained a 95% confidence interval for π1 β π2 equal to [4.23, 7.37]. Is this an important effect? The importance of this result is impossible to describe in the absence of information about the test anxiety scale. In applications where the magnitude of π1 β π2 is not easy to interpret, it is helpful to report a confidence interval for a standardized mean difference. The population standardized mean difference, also known as Cohenβs d, is defined as πΏ = (π1 β π2 )/β(π12 + π22 )/2 An approximate 100(1 β πΌ)% confidence interval for πΏ is πΏΜ ± π§πΌ/2 ππΈπΏΜ (2.2) where πΏΜ = (πΜ 1 β πΜ 2 )/β(πΜ12 + πΜ22 )/2 and ππΈπΏΜ =β Μ2( 1 πΏ π1 β1 + 8 1 ) π2 β1 + π1 + π2 π1 π2 . The interpretation of πΏ is difficult unless the population variances are similar. In applications where the population variances are not expected to be similar, another estimate of πΏ (also known as Glassβs d) is defined as the difference in sample means divided by the sample standard deviation in one group (usually but not necessarily the control group). Example 2.2. In the test anxiety example mentioned above, suppose the researcher obtained the following sample means and variances with sample sizes of π1 = π2 = 40. Treatment 1 πΜ 1 = 21.9 πΜ12 = 14.8 Treatment 2 πΜ 2 = 16.1 πΜ22 = 10.2 (continued) 4 D.G. Bonett (5/2017) The estimate of πΏ is πΏΜ = (21.9 β 16.1)/β(πΜ12 + πΜ22 )/2 = 5.8/3.54 = 1.64 where β(πΜ12 + πΜ22 )/2 = β(14.8 + 10.2)/2 = 3.54, and a 95% confidence interval for πΏ is 1 1 39 39 1.64 2 [ + ] 1.64 ο± 1.96β 8 1 1 + 40 + 40 = [1.13, 2.15]. The researcher is 95% confident that, in the study population of students with test anxiety, the mean test anxiety score would be 1.13 to 2.15 standard deviations greater if all students received counseling method 1 rather than counseling method 2. To interpret the confidence interval for πΏ in Example 2.2, imagine two normal (Gaussian) curves, one curve for a population distribution of test anxiety scores under counseling method 1 and a second curve for a population distribution of test anxiety scores under counseling method 2. Now visualize the distribution for counseling method 1 shifted to the right of the distribution for counseling method 2 at least 1.13 standard deviations and at most 2.15 standard deviations. To visualize the magnitude of this shift, use the fact that one standard deviation from the mean is the point where the normal curve changes from concave down to concave up (the inflection point). Knowing that the point of inflection on a normal curve is one standard deviation from the mean, a researcher can then easily visualize two normal distributions that are separated by a specified number of standard deviations. The value of πΏ can be transformed into another measure of effect size called Cohenβs U3. In an experimental design, U3 represents the proportion of people in the study population who would have a score under Treatment 1 that would be greater than their mean score under Treatment 2. In a nonexperimental design, U3 describes the proportion of scores in subpopulation 1 that are greater than the mean of subpopulation 2. To transform πΏ into U3, simply find the area under the standard unit normal curve that is less than πΏ. The pnorm function in R is useful for this purpose. This transformation can be applied to the estimate of πΏ, the lower confidence limit, and the upper confidence limit. For instance, in the above example pnorm(1.13) = .87 and pnorm(2.15) = .98. We can be 95% confident that, if all test anxiety students in the study population had received counseling method 1, between 87% and 98% of their acrophobia scores would be greater than their mean score under counseling method 2. 5 D.G. Bonett (5/2017) Confidence Interval for a Ratio of Population Means If the dependent variable is measured on a ratio scale, a ratio of population means (π1 /π2 ) is a unitless measure of effect size could be more meaningful and easier to interpret than a standardized mean difference. An approximate 100(1 β πΌ)% confidence interval for π1 /π2 that does not assume equal population variances is Μ2 π Μ2 π 1 1 2 2 (2.3) ππ₯π[ππ(πΜ 1 /πΜ 2 ) ± π‘πΌ/2;ππ βπΜ21π + πΜ22π ] Μ2 π Μ2 π Μ4 π where df = [πΜ21π + πΜ22π ]2/[πΜ4 π2 (π1 π 1 1 2 2 1 1 1 Μ4 π + πΜ4 π2 (π2 β 1) 2 2 2 ]. If the sample sizes are β 1) approximately equal and the study population variances can be assumed to be similar, then πΜπ2 in Equation 2.3 could cautiously be replaced with πΜπ2 and then df simplifies to π1 + π2 β 2. Suppose a 95% confidence interval for π1 /π2 in a particular study is [1.51, 1.78]. This confidence interval has a simple interpretation β the researcher can be 95% confident that π1 is 1.51 to 1.78 times as large as π2 . Hypothesis Testing A confidence interval for π1 β π2 can be used to test hypotheses. For instance, a confidence interval for π1 β π2 can be used to implement a three-decision rule for the following hypotheses. H0: π1 = π2 H1: π1 > π2 H2: π1 < π2 If the lower limit for π1 β π2 is greater than 0, reject H0 and accept H1: π1 > π2 ; if the upper limit for π1 β π2 is less than 0, reject H0 and accept H2: π1 < π2 . The results are inconclusive if the confidence interval includes 0. Note that it is not necessary to develop special hypothesis testing rules for the value of πΏ because π1 = π2 implies πΏ = 0, π1 > π2 implies πΏ > 0, and π1 < π2 implies πΏ < 0. When this three-decision rule is applied to the mean difference in a two-group design, it is commonly referred to as an independent-samples t-test and the test statistic t = (πΜ 1 β πΜ 2 )/ππΈπΜ1 βπΜ2 is used to select H1 or H2: accept H1: π1 > π2 if t > π‘πΌ/2;ππ accept H2: π1 < π2 if t < -π‘πΌ/2;ππ fail to reject H0 (i.e., an inconclusive result) if |t| < π‘πΌ/2;ππ 6 D.G. Bonett (5/2017) Computer programs will compute a p-value for the t statistic. In a significance testing approach, if the p-value is small (e.g., less than .05) the researcher declares the results to be βsignificantβ; otherwise, the results are declared to be βnonsignificantβ. It is important to remember that a significant result does not imply that an important difference in population means has been detected, and a nonsignificant result does not imply that the null hypothesis is true. Recall from Module 1 that accepting H1 when H2 is true or accepting H2 when H1 is true is called a directional error and the probability of making a directional error is at most πΌ/2. It is reasonable to assume that H0: π1 = π2 will be false in any real application. A failure to reject H0 is called a Type II error, and the power of the test is the probability of avoiding a Type II error. Confidence intervals for π1 β π2 also can be used to determine if π1 and π2 are similar. For instance, a confidence interval for π1 β π2 can be used to choose between the following two hypotheses in an equivalence test: H0: |π1 β π2 | β€ π H1: |π1 β π2 | > π where π is some value specified by the researcher. Usually π represents the value π1 β π2 that would be considered by experts to be small or unimportant. The interval -b to b is called the region of practical equivalence. If the confidence interval for π1 β π2 is completely contained within the range βb to b, then H0 is accepted; if the confidence interval for π1 β π2 is completely outside the interval -b to b, then H1 is accepted; otherwise, the results are inconclusive. The probability of falsely accepting H1: |π1 β π2 | > π is at most πΌ. In an equivalence test, accepting H1 when H0 is true is called a Type I error, and accepting H1 when H0 is true is called a Type II error. In equivalence testing applications where it is difficult to specify a value of π1 β π2 that would be considered by experts to be small or unimportant, it might be easier to specify a value of πΏ that would be considered small or unimportant. A confidence interval for πΏ can be used to choose between the following two hypotheses H0: |πΏ| β€ π H1: |πΏ| > π where π represents a value of πΏ that would be considered by experts to be small or unimportant. If the confidence interval for πΏ is completely contained within the interval -b to b (the region of practical significance), then H0 is accepted; if the confidence interval for πΏ is completely outside the interval -b to b, then H1 is accepted; otherwise, the results are inconclusive. 7 D.G. Bonett (5/2017) Prediction Intervals In some experiments, the researcher might want estimate how the dependent variable score for one randomly selected person would differ under the two treatment conditions. A 100(1 β πΌ)% prediction interval for this difference is πΜ 1 β πΜ 2 ο± π‘πΌ/2;ππ βπΜπ2 /π1 + πΜπ2 /π2 + 2πΜπ2 (2.4) where df = π1 + π2 β 2 and πΜπ2 is the pooled variance estimate described previously. Equation 2.4 assumes equal population variances. There exists another version of this prediction interval that does not assume equal population variance but the df formula is complicated. A prediction interval for the difference in scores for one person will be wider than a confidence interval for the difference in population means. Example 2.3. In the two-group experiment described in a previous example where 2nd grade students were randomized into a control group or a group that received an additional 15 minutes of math instruction each day, the 95% confidence interval for π1 β π2 was [150.0, 171.6]. This confidence interval suggests that the mean time to complete the multiplication test in the study population would be 150 to 171.6 seconds shorter if all students in the study population received the extra training. The researcher wants to estimate how much better a one randomly selected student from the study population would perform if given the extra training compared to no extra training. Applying Equation 2.4 gives a 95% prediction interval of [91.8, 229.8] indicating that any one randomly selected student from the study population should complete the multiplication test 91.8 to 229.8 seconds faster if given extra math training. Assumptions for Confidence Intervals and Tests The separate-variance and pooled-variance confidence intervals for π1 β π2 assume random sampling and independence of observations. These confidence intervals for π1 β π2 also assume that the dependent variable has an approximate normal distribution in the study population under each treatment condition in an experimental design or within each subpopulation in a non-experimental design. The pooled-variance confidence interval for π1 β π2 requires an additional assumption of equal population variances under each treatment condition (or within each subpopulation) and is called the homoscedasticity assumption. The homoscedasticity assumption is difficult to justify in most non-experimental designs. The pooled-variance confidence interval can perform poorly when the population variances are unequal and the sample sizes are unequal. Specifically, the percent of all possible random samples for which a pooled-variance 100(1 β πΌ)% confidence interval for π1 β π2 will capture the value of π1 β π2 can be much less than 100(1 β πΌ)% when the population variances are unequal and 8 D.G. Bonett (5/2017) the smaller sample size is used in the treatment with the larger variance. Statistical methods for assessing the homoscedasticity assumption require large sample sizes. Consequently, most statisticians recommend using the separate-variance confidence interval unless the sample sizes are approximately equal and there is strong prior information suggesting that the population variances should be similar. The confidence interval and hypothesis test for a difference in population means will perform properly (i.e., the true confidence level will be similar to the stated confidence level) under violations of the normality assumption if the sample size per group is not too small. With samples size of 20 or more per group, the separatevariance and pooled variance confidence intervals for π1 β π2 will perform properly even when the dependent variable is markedly skewed. The confidence interval for πΏ (Equation 2.2) and the prediction interval for a difference in scores (Equation 2.4) are sensitive to a violation of the normality assumption, and increasing the sample size will not mitigate the problem. The confidence interval for πΏ and the prediction interval for a difference in scores should not be used, regardless of sample size, unless the dependent variable (or some transformation of the dependent variable) is at most mildly non-normal. To assess the degree of non-normality in a two-group design, subtract πΜ 1 from all of the group 1 scores and subtract πΜ 2 from all of the group 2 scores. Then estimate the skewness and kurtosis coefficients from these π1 + π2 deviation scores. If the deviation scores are skewed, it might be possible to reduce the skewness by transforming (e.g., log, square-root, reciprocal) the dependent variable scores. The values of π1 and π2 could be difficult to interpret if the dependent variable scores have been transformed in an effort to reduce non-normality. Consequently, a confidence interval for π1 β π2 could be difficult to interpret and the researcher may want to report only a hypothesis testing result, which does not require an interpretation of the effect size magnitude. However, confidence intervals for πΏ remain interpretable with transformed data because πΏ is a unitless measure of effect size. Distribution-free Methods If the dependent variable is highly skewed, a difference in population medians (π1 β π2 ) might be a more appropriate and meaningful measure of effect size than a difference in population means. An approximate 100(1 β πΌ)% confidence interval for π1 β π2 is 9 D.G. Bonett (5/2017) πΜ1 β πΜ 2 ± π§πΌ/2 βππΈπΜ21 + ππΈπΜ22 (2.5) where ππΈπΜ2π was defined in Equation 1.8 of Module 1. This confidence interval only assumes random sampling and independence among participants. Equation 2.5 can be used for testing H0: π1 = π2 and to decide if π1 > π2 or π1 < π2 . Equation 2.5 also can be used to test H0: |π1 β π2 | β€ π against H1: |π1 β π2 | > π. If the dependent variable is measured on a ratio scale, a ratio of population medians (π1 /π2 ) could be an informative measure of effect size. To obtain a confidence interval for π1 /π2 , compute Equation 2.5 from log-transformed dependent variable scores and then exponentiate the lower and upper limits. The Mann-Whitney test (also referred to as the Mann-Whitney-Wilcoxon test) is a distribution-free test of H0: π = .5 where π is the proportion of people in the study population who would have a larger y score if they had received Treatment 2 rather than Treatment 1. The interpretation of π is more complicated in a two-group nonexperimental design with subpopulations sizes π1 and π2 . In a nonexperimental design, π is the proportion of all π1 π2 pairs of scores where the score for a person from subpopulation 2 is greater than the score for a person from subpopulation 1. Statistical packages will compute a p-value for the Mann-Whitney test that can be used to decide if the null hypothesis can be rejected. The Mann-Whitney test is usually a little less powerful than the independent samples t-test, but it can be more powerful than the t-test if the dependent variable is highly leptokurtic. In applications where the dependent variable does not have, and cannot be transformed to have, an approximate normal distribution, the standardized mean difference will be difficult to interpret because we would not be able visualize one standard deviation from the mean. In these situations the Mann-Whitney parameter (π) is a useful measure of effect size because it has a clear and simple interpretation for any distribution shape. A confidence interval for π does not have a simple formula but it has been programmed in the provided R functions. Example 2.4. A sample of 8 male frequent marijuana users and a sample of 10 male non- users were obtained from a University of Colorado research participant pool. Amygdala activity levels of all participants were obtained while participants listened to an audio tape with high emotional content. The activity scores are shown below. Users: Non-users: 14.6 5.1 8.1 22.7 6.4 4.4 19.0 3.2 9.4 10.3 58.3 106.0 31.0 46.2 12.0 19.0 135.0 159.0 (continued) 10 D.G. Bonett (5/2017) The p-value for the Mann-Whitney test is .006 and the 95% confidence interval for π is [.629, 1.00]. The sample medians are 7.25 and 38.6 for groups 1 and 2, respectively. A 95% confidence interval for π1 β π2 is [-85.1, 22.4]. Note that the Mann-Whitney test rejects the null hypothesis but a 95% confidence interval for π1 β π2 does not reject the null hypothesis. The contradictory results are due to the fact that the Mann-Whitney test is usually more powerful than a test of H0: π1 = π2 based on a confidence interval for π1 β π2 . The 95% confidence interval for π excludes .5 (which will always occur whenever the Mann-Whitney p-value is less than .05). The researcher can be 95% confident that the proportion of all user/non-user pairs in the two subpopulations where a non-user has a higher activity score than a user is between 62.9% and 100%. The activity scores are highly skewed but a log transformation effectively removes the skewness. An estimate of πΏ for log-transformed scores is -1.61 and its 95% confidence interval is [-2.73, -0.49]. The meaning of the log-transformed activity scores is not clear and a confidence interval for π1 β π2 or π1 β π2 might be difficult to interpret. However, the confidence intervals for πΏ and π are interpretable with transformed scores. Equation 2.5 can be useful in time-to event studies. If the study ends before some of the participants have exhibited the event, the scores for those participants are set equal to t + 1. Equation 2.5 requires π¦(π2 ) β€ π‘ in both groups where π2 is defined in Equation 1.7 of Module 1. A random sample of 20 social science graduates and 20 engineering graduates from UCSC agreed to participate in a 36 month study of post-graduation employment. The number of months each participant stayed in their first job was determined from each participant. Some participants had not left their first job at the end of the 36-month study period and were given a censored score of 37 months. The rankordered time-to-event scores (in months) are given below. Example 2.5. Social science: 2, 4, 6, 8, 10, 12, 12, 13, 15, 15, 20, 21, 24, 30, 30, 34, 34, 35, 36, 37 Engineering: 6, 15, 16, 17, 18, 18, 19, 21, 22, 22, 24, 25, 30, 21, 32, 35, 37, 37, 37, 37 Some scores have been censored and we must first verify that π¦(π2 ) β€ 36 in both groups. From Equation 1.7, we compute π2 = 15 and find that π¦(15) = 30 in group 1 and π¦(15) = 32 in group 2 which satisfies the requirement. Applying Equation 2.5 gives a 95% confidence interval of [-15.5, 6.5]. The confidence interval includes 0 and is too wide. The study needs to be replicated using a larger sample size. Sample Size Requirement for Desired Precision The sample size requirement per group for estimating π1 β π2 with desired confidence and precision is approximately 2 ππ = 8πΜ 2 (π§πΌ/2 /π€)2 + π§πΌ/2 /4 (2.6) 11 D.G. Bonett (5/2017) where πΜ 2 is a planning value of the average within-group variance of the dependent variable for the two groups. Recall from Module 1 that this planning value can be specified using information from published research reports, a pilot study, or the opinions of experts. If prior estimates of the dependent variable variance are unavailable but the maximum and minimum values of the dependent variable are known, the planning value of the variance could be set to [(max β min)/4]2 . Example 2.6. A researcher wants to conduct a study to determine the effect of βachievement motivationβ on the types of tasks one chooses to undertake. The study will ask participants to play a ring-toss game where they try to throw a small plastic ring over an upright post. The participants will choose how far away from the post they are when they make their tosses. The chosen distance from the post is the dependent variable. The independent variable is degree of achievement motivation (high or low) and will be manipulated by the type of instructions given to the participants. The results of a pilot study suggest that the standard deviation of the distance scores is about 0.75 foot within each condition. The researcher wants the 99% confidence interval for π1 β π2 to have a width of about 1 foot. The required sample size per group is approximately ππ = 8(0.752)(2.58/1)2 + 1.66 = 31.6 β 32. A random sample of 64 participants is required with 32 participants given low achievement motivation instructions and 32 participants given high achievement motivation instructions. The sample size requirement per group for estimating πΏ with desired confidence and precision is approximately ππ = (πΏΜ 2 + 8)(π§πΌ/2 /π€)2 (2.7) where πΏΜ is a planning value of the standardized mean difference. The planning value can be specified using information from published research reports, a pilot study, or expert opinion. Example 2.7. A researcher will compare two methods of treating homelessness-induced PTSD in adolescents and will use a new measure of PTSD as the dependent variable. Given the novelty of the new PTSD measure, it is difficult for the researcher to specify a desired width of a confidence interval for π1 β π2 . However, the researcher expects πΏ to be 1.0 and would like a 95% confidence interval for πΏ to have a width of about 0.5. The required sample size per group is approximately ππ = (12 + 8)(1.96/0.5 )2 = 138.3 β 139. The researcher needs to obtain a sample of 278 participants which will be randomly divided into two groups with 139 participants receiving the one treatment and 139 participants receiving the other treatment. With a ratio-scale dependent variable, the sample size requirement per group to estimate π1 /π2 with desired confidence and precision is approximately 1 ππ = 8πΜ 2 (πΜ2 + 1 1 Μ 22 π 2 ) [π§πΌ/2 /ππ(π)]2 + π§πΌ/2 /4 (2.8) 12 D.G. Bonett (5/2017) where πΜπ is a planning value of ππ , r is the desired upper to lower confidence interval endpoint ratio, and ln(r) is the natural logarithm of r. For instance, if π1 /π2 is expected to about 1.3, the researcher might want the lower and upper confidence interval endpoints to be about 1.1 and 1.5 and r would then be set to 1.5/1.1 = 1.36. Example 2.8. A researcher will compare two methods of encouraging parents to read to their preschool children. The number of reading minutes per week is the response variable. The researcher plans to compute a 95% confidence interval for π1 /π2 and would like the upper to lower interval endpoint ratio to be about 1.5. After reviewing the literature, the researcher set πΜ 2 = 200, πΜ12 = 50, and πΜ22 = 70. The required sample size per group is approximately ππ = 8(200)(1/502 + 1/702)[1.96/ln(1.5)]2 + 0.96 = 23.5 β 24. The researcher needs to obtain a random sample of 48 parents of preschool children which will be randomly divided into two groups of equal size. Sample Size Requirement for Desired Power The sample size requirement per group to test H0: π1 = π2 for a specified level of πΌ and with desired power is approximately ππ = (π§πΌ/2 + π§ ) 2πΜ 2 (πΜ β πΜ π½)2 1 2 2 2 + π§πΌ/2 /4 (2.9) where 1 β Ξ² is the desired power of the test and πΜ1 β πΜ2 is a planning value of the anticipated effect size. Note that Equation 2.9 only requires a planning value for the difference in population means (i.e., the effect size) and does not require a planning value for each population mean. In applications where it is difficult to specify πΜ1 β πΜ2 or πΜ 2 , Equation 2.9 can be expressed in terms of a standardized mean difference planning value, as shown below. 2 2 ππ = 2(π§πΌ/2 + π§π½ ) /πΏΜ 2 + π§πΌ/2 /4 (2.10) Example 2.9. Previous research has shown that team size is related to performance on certain tasks. A researcher wants to compare the performance of 2-person and 4-person teams on a writing task that must be completed in within a time limit. The quality of the written report will be scored on a 1 to 10 scale. The researcher sets πΜ 2 = 5.0 and expects a 2-point difference in the population mean ratings. For Ξ± = .05 and power of 1 β π½ = .95, the required number of teams per group is approximately ππ = 2(5.0)(1.96 + 1.65)2/22 + 0.96 = 33.5 β 34. A random sample of 204 participants is required with 68 participants working in to 34 2-person teams and 136 participants working in 34 4-person teams. 13 D.G. Bonett (5/2017) Example 2.10. A researcher wants to compare two eating disorder treatments and wants the power of the test to be .9 with Ξ± = .05. The researcher expects the standardized mean difference to be 0.5. The required number of participants per group is approximately ππ = 2(1.96 + 1.28)2/0.52 + 0.96 = 84.9 β 85. The sample size requirement per group to test H0: |π1 β π2 | β€ π for a specified level of πΌ and with desired power is approximately 2 2 ππ = 2πΜ 2 (π§πΌ + π§π½ ) /(|πΜ1 β πΜ2 | β π)2 + π§πΌ/2 /4 (2.11) where π§πΌ is a one-sided critical π§-value and |πΜ1 β πΜ2 | is the expected effect size which must be smaller than b. Equivalence tests usually require large sample sizes. Example 2.11. A researcher wants to show that women and men have similar population means on a newly developed test of analytical reasoning. The test is scored on a 50 to 150 scale and the researcher believes that a 5-point difference in means would be small and unimportant. The required sample size per group to test H0: |π1 β π2 | β€ 5 with power of .8, Ξ± = .05, an expected effect size of 1, and a standard deviation planning value of 15 is approximately ππ = 2(225)(1.65 + 0.84)2/(1 β 5)2 + 0.68 = 175.1 β 176. The sample size requirement per group to test H0: π = .5 for a specified level of πΌ with desired power using the Mann-Whitney test is approximately 2 ππ = (π§πΌ/2 + π§π½ ) /[6(πΜ β .5)2 ] (2.12) where πΜ is a planning value of π. Recall that for experimental designs, π is the proportion of people in the study population who would have a larger y score if they had received treatment 2 rather than treatment 1. In a nonexperimental design, π is the probability that a randomly selected person from the second subpopulation would have a y score that is less than a randomly selected person from the first subpopulation. Unequal Sample Sizes Using equal sample sizes has three major benefits: 1) if the population variances are approximately equal, confidence intervals are narrowest and hypothesis tests are most powerful when the sample sizes are equal, 2) when the pooled-variance confidence interval method is used, the negative effects of violating the equal variance assumption are less severe when the sample sizes are equal, and 3) the negative effects of population skewness on the confidence intervals and tests for a difference in population means are minimized when the sample sizes are equal. However, there are situations when equal sample sizes are less desirable. If one 14 D.G. Bonett (5/2017) treatment is more expensive or risky than another treatment, the researcher might want to use fewer participants in the more expensive or risky treatment condition. Also, in experiments that include a control group, it could be easy and inexpensive to obtain a larger sample size for the control group. The sample size formulas given above assume equal sample sizes per group. Suppose the researcher requires π2 /π1 = r. The approximate sample size requirement for group 1 to estimate π1 β π2 with desired precision is 2 π1 = 4πΜ 2 (1 + 1/π)(π§πΌ/2 /π€)2 + π§πΌ/2 /4 (2.16) and the required sample size for group 1 to estimate πΏ with desired precision is π1 = 4[πΏΜ 2 (1 + 1/π)/8 + (1 + 1/π)(π§πΌ/2 /π€)2 (2.17) with π2 set equal to π1 π. To test H0: π1 = π2 with desired power, the approximate sample size requirement for group 1 is 2 π1 = πΜ 2 (1 + 1/π)(π§πΌ/2 + π§π½ )2 /(πΜ1 β πΜ2 )2 + π§πΌ/2 /4 (2.18) or equivalently, in the case where πΜ1 β πΜ2 or πΜ 2 is difficult to specify, 2 π1 = (1 + 1/π)(π§πΌ/2 + π§π½ )2 /πΏΜ 2 + π§πΌ/2 /4 (2.19) with π2 set equal to π1 π. Example 2.12. A researcher wants to estimate π1 β π2 with 95% confidence and a desired confidence interval width of 2.5 with a variance planning value of 4.0. The researcher also wants π2 to be 2 times greater than π1 . The sample size requirement for group 1 is approximately π1 = 4(4.0)(1 + 1/2)(1.96/2.5)2 + 0.96 = 15.7 β 16 with π2 = 2(16) = 32 participants required in group 2. Example 2.13. A researcher wants to test H0: π1 = π2 with Ξ± = .05 and power of .95. The researcher also wants π2 to be one-fourth the size of π1 . The researcher expects the standardized mean difference to be 0.75. The sample size requirement for group 1 is approximately π1 = (1 + 1/0.25)(1.96 + 1.65)2 /0.752 + 0.96 = 115.8 β 116 with π2 = (1/4)(116) = 29 participants required in group 2. 15 D.G. Bonett (5/2017) Graphing Results The sample means for each group can be presented graphically using a bar chart. A bar chart for two groups consists of two bars, one for each group, with the height of each bar representing the value of the sample mean. Bar charts of sample means can be misleading because the sample means contain sampling error of unknown magnitude and direction. There is a tendency to incorrectly interpret a difference in bar heights as representing a difference in population means. This misinterpretation can be avoided by graphically presenting the imprecision of the sample means with 95% confidence interval lines for each population mean, as shown in the graph below. Non-overlapping 95% confidence interval lines implies p < .05 but overlapping lines does not imply p > .05, and this is why it is important to always report a confidence interval for π1 β π2 along with a bar chart for the individual means. A confidence interval for π1 β π2 can be supplemented with a bar chart of medians. The 95% confidence interval lines for ππ are obtained using Equation 1.8 from Module 1. If the confidence interval lines for the individual parameters do not overlap, then the confidence interval for the difference in parameters will always exclude 0. However, if the confidence interval lines for the individual parameters overlap, the confidence interval for the difference in parameters might or might not exclude 0 depending on the amount of overlap. Internal Validity Recall that one of the fundamental requirements for declaring a relation between two variables to be a causal relation is that the independent variable must be the only variable affecting the dependent variable. In other words, there must be no confounding variables. When this requirement is not satisfied, we say the internal validity of the study has been compromised. In nonexperimental designs, there will be many obvious confounding variables. For example, in a two-group study that compares two teaching methods using students in two different classrooms with one teacher using the first method and the other teacher using the second method, a non-zero value of π1 β π2 could be attributed to a difference in student 16 D.G. Bonett (5/2017) abilities in the two classrooms or a difference in teacher effectiveness. Confounding variables can also be present in experimental designs. Consider a two-group experiment for the treatment of anxiety is conducted with one group receiving a widely-used medication and the second group receiving a promising new drug. Suppose a statistical analysis suggests that the new drug is more effective in reducing anxiety than the old drug. However, the researchers cannot be sure that the new drug will cause an improvement in anxiety because patients who received the new drug also received extra safety precautions to monitor for possible negative side effects. These extra precautions involved more supervision and patient contact. It is possible that the additional supervision, and not the new drug, caused the improvement in the patients. Differential attrition is another problem that threatens the internal validity of a study. Differential attrition occurs when the independent variable causes the participants in one treatment condition to withdraw from treatment with higher probability than participants in another treatment. With differential attrition, participants who complete the study could differ across treatment conditions in terms of some important attribute that would then be confounded with the independent variable. Consider the following example. Suppose a researcher conducts an experiment to evaluate two different methods of helping people overcome their fear of public speaking. One method requires participants to practice with an audience of size 20 and the other method requires participants to practice with an audience of size 5. Fifty participants were randomly assigned to each of these two training conditions, but ten dropped out of the first group and only one dropped out of the second group. The results showed that public speaking fear was lower under the first method (audience size of 20) of training. However, it is possible that participants who stayed in the first group were initially less fearful than those who dropped out and that this produced the lower fear scores in the first training condition. External Validity External validity is the extent to which the results of a study can be generalized to different types of participants and different types of research settings. In terms of random sampling, it is usually easier to sample from a small homogeneous study population than a larger and more heterogeneous study population. However, the external validity of the study will be greater if the researcher samples from a larger and more diverse study population. Researchers often go to great lengths to minimize variability in the research setting for participants within a treatment condition by, for instance, having the same researcher or lab assistant interact with all participants, minimizing variability in laboratory lighting and temperature, or 17 D.G. Bonett (5/2017) testing participants at about the same time of the day. These efforts have a desirable effect of reducing within-treatment (error) variability, which in turn produces narrower confidence intervals and greater power of statistical tests. However, these same efforts could simultaneously have the undesirable effect of reducing the external validity of the study. Nonrandom attrition occurs when certain types of participants drop out of the study with a higher probability than other participants but drop out with about the same probability across groups. With nonrandom attrition, the participants who complete the study are no longer a random sample from the original study population. The remaining participants could be assumed to be a random sample from a smaller study population of participants who would have completed the study. This change in the size and nature of the study population decreases the external validity of the study. In contrast to nonrandom attrition, random attrition will reduce the planned sample size, which in turn will decrease power of a hypothesis test and increase the width of a confidence interval, but will have no effect on the external or internal validity of the study. Using Prior Information Suppose a population mean for a particular response variable has been estimated in a previous study and also in a new study. The previous study used a random sample of size π1 to estimate π1 from one study population, and the new study used a random sample of size π2 to estimate π2 from another study population. This is a special type of two-group nonexperimental design where the two study populations are assumed to be conceptually similar. If a confidence interval for π1 β π2 suggests that π1 and π2 are not too dissimilar, then the researcher might want to compute a confidence interval for (π1 + π2 )/2. A confidence interval for (π1 + π2 )/2 could be substantially narrower and will have greater external validity than the confidence interval for π1 or π2 . A 100(1 β πΌ)% confidence interval for (π1 + π2 )/2 is (2.19) (πΜ 1 + πΜ 2 )/2 ± π‘πΌ/2;ππ ππΈ(πΜ1 +πΜ2 )/2 2 2 2 4 π Μ1 π Μ π Μ where ππ = (4π + 2 ) /[ 2 1 4π 16π (π 1 2 1 1 β 1) + π Μ24 2 16π2 (π2 β 1) ] 2 2 π Μ1 π Μ and ππΈ(πΜ1 +πΜ2)/2 =β4π + 2 . If the 4π 1 2 sample sizes are approximately equal and the two study population variances can be assumed to be similar, then the separate-variance standard error could π Μ2 π Μ2 cautiously be replaced with a pooled-variance standard error ππΈ(πΜ1 +πΜ2)/2 = β4ππ + 4ππ 1 2 where πΜπ2 = [(π1 β 1)πΜ12 + (π2 β 1)πΜ22 ]/ππ and df = π1 + π2 β 2. 18 D.G. Bonett (5/2017) If a median has been estimated in the two studies, an approximate 100(1 β πΌ)% confidence interval for (π1 + π2)/2 is (πΜ1 + πΜ 2 )/2 ± π§πΌ/2 β(ππΈπΜ21 + ππΈπΜ22 )/4 (2.20) where ππΈπΜ2π was defined in Equation 1.8 of Module 1. Example 2.14. A researcher at UC Davis obtained a random sample of π1 = 100 social science seniors and asked them to rate on a 1 β 10 scale how concerned they were about police brutality. The sample mean (πΜ 1 = 8.14) and variance (πΜ12 = 1.99) were reported in a published article along with other results. A researcher at Cal State Fresno replicated the UC Davis study using a sample size of π2 = 165 and obtained πΜ 2 = 7.88 and πΜ22 = 2.41. The 95% confidence interval for π1 β π2 was [-0.1, 0.6] which suggests that π1 and π2 are similar. The 95% confidence interval for (π1 + π2 )/2 was [7.83, 8.19] which is narrower than the confidence interval for either π1 or π2 and describes the study populations of both UC Davis and Cal State Fresno. When estimating (π1 + π2 )/2, we assume that the two study population parameters π1 and π2 are equally important. An alternative Bayesian approach assumes π2 is the only parameter of interest and the information from the prior study is used only to obtain a more precise estimate of π2 . A Bayesian estimate of π2 that uses πΜ 1 and the standard error of πΜ 1 from a prior study is πΜ 2 = [(1/ππΈπΜ21 )πΜ 1 + (1/ππΈπΜ22 )πΜ 2 ]/(1/ππΈπΜ21 + 1/ππΈπΜ22 ) (2.21) with an approximation standard error of ππΈπΜ 2 = 1/β1/ππΈ2πΜ1 + 1/ππΈ2πΜ2 . (2.22) An approximate 100(1 β πΌ)% Bayesian confidence interval for π2 is πΜ 2 ± π§πΌ/2 ππΈπΜ 2 . (2.23) For the above example, πΜ 2 = [(100/1.99)8.14 + (165/2.41)7.88]/(100/1.99 + 165/2.41) = 7.99, ππΈπΜ 2 = 1/β1/ππΈ2πΜ1 + 1/ππΈ2πΜ2 = 0.092, and the approximate 95% Bayesian confidence interval for π2 is 7.99 ± 1.96(0.092) = [7.81, 8.16]. The Bayesian confidence interval for π2 is attractive because it will always be narrower than the traditional confidence interval for π2 . In this example, the traditional confidence interval for π2 is [7.65, 8.12]. However, unless the value of π1 β π2 is small (based on compelling theoretical arguments or as determined from 19 D.G. Bonett (5/2017) an equivalence test) or π2 is several times larger than π1 , a 100(1 β πΌ)% Bayesian confidence interval for π2 can have a coverage probability that is far less than the specified confidence level. Ethical Issues Any study that uses human subjects should advance knowledge and potentially lead to improvements in the quality of life β but the researcher also has an obligation to project the rights and welfare of the participants in the study. These two goals are often in conflict and lead to ethical dilemmas. The most widely used approach to resolving ethical dilemmas is to weigh the potential benefits of the research against the costs to the participants. Evaluating the costs and benefits of a proposed research project that involves human subjects can be extremely difficult and this task is assigned to the Institutional Review Board (IRB) at most universities. Researchers who plan to use human subjects in their research must submit a written proposal to the IRB for approval. The IRB will carefully examine research proposals in terms of the following issues: Informed Consent β Are participants informed of the nature of the study, have they explicitly agreed to participate, and are they allowed to freely decline to participate? Coercion to participate β Were participants coerced into participating or offered excessive inducements? Confidentiality β Will the data collected from participants be used only for research purposes and not divulged to others? Physical and mental stress β Does the study involve more than minimal risk? Minimal risk is defined as risk that is no greater in probability or severity than ordinarily encountered in daily life or during a routine physical or psychological exam. Deception β Is deception needed in the study? If deception is used, are participants debriefed after the study? Debriefing is used to clarify the nature of the study to the participants and reduce any stress or anxiety to the participants caused by the study. In addition to principles governing the treatment of human subjects, researchers are bound by a set of ethical standards. Violation of these standards is called scientific misconduct. There are three basic types of scientific misconduct: 20 D.G. Bonett (5/2017) Scientific dishonesty β Examples include: the fabrication or falsification of data, and plagiarism. Plagiarism is the use of another person's ideas, processes, results, or words without giving appropriate credit. Unethical behavior β Examples include: sexual harassment of research assistants or research participants, abuse of authority, failure to follow university or government regulations, and inappropriately including or excluding authors on a research report or conference presentation. Questionable research practices β Examples include: performing an exploratory analysis of many dependent or independent variables and reporting only the variables that yield a βsignificantβ result; reporting unexpected findings as if they had been expected; failure to assess critical assumptions for statistical tests or confidence intervals, and deleting legitimate data in an effort to obtain desired results; interpreting a "significant" result as representing an important effect and reporting a "nonsignificant" result as evidence that the effect is zero; intentionally not reporting a confidence interval result when it suggests that the effect could be trivial or unimportant. 21
© Copyright 2026 Paperzz