D.G. Bonett (3/2017) Module 3 One-factor Experiments A between-subjects treatment factor is an independent variable with a ο³ 2 levels in which participants are randomized into a groups. It is common, but not necessary, to have an equal number of participants in each group. Each group receives one of the a levels of the independent variable with participants being treated identically in every other respect. The two-group experiment considered previously is a special case of this type of design. In a one-factor experiment with a levels of the independent variable (also called a completely randomized design), the population parameters are π1 , π2 , β¦, ππ where ππ (j = 1 to a) is the mean of the dependent variable if all members of the study population had received level j of the independent variable. One way to assess the differences among the a population means is to compute confidence intervals for all possible pairs of differences. For instance, with a = 3 levels the following pairwise comparisons in population means could be examined. π1 β π2 π1 β π3 π2 β π 3 In a one-factor experiment with a levels there are a(a β 1)/2 pairwise comparisons. Equation 2.1 of Module 2 can be used to compute a confidence interval for pairs of population mean differences. Equation 2.2 of Module 2 can be used to compute confidence intervals for pairs of population standardized mean differences. For any single 100(1 β πΌ)% confidence interval, we can be 100(1 β πΌ)% confident that the confidence interval contains the population parameter value if all assumptions have been satisfied. If v 100(1 β πΌ)% confidence intervals are computed, it can be shown that we can be at least 100(1 β π£πΌ)% confident that all v confidence intervals have captured their population parameters. For instance, if six 95% confidence intervals are computed, we can be at least 100(1 β π£πΌ)% = 100(1 β .3)% = 70% confident that all six confidence intervals have captured their population parameter values. The researcher would like to be at least 100(1 β πΌ)% confident, rather than at least 100(1 β π£πΌ)% confident, that all v confidence intervals have captured their population parameters. One way to achieve this is to use πΌ* = πΌ/v rather than πΌ in the critical t-value (In Equation 2.1) or critical z-value (in Equation 2.2) for each confidence interval. The adjusted alpha level πΌ/v is called a Bonferroni adjustment. 1 D.G. Bonett (3/2017) The Tukey-Kramer method yields a slightly narrower confidence interval than the Bonferroni method if all possible pairs of means are examined. The standard Tukey-Kramer method assumes equal population variances, but SAS also implements a version of the Tukey-Kramer method that does not require equal population variances. SPSS provides an option to compute Games-Howell confidence intervals for all pair-wise comparisons that are similar to the unequal variance Tukey-Kramer confidence intervals. The Tukey-Kramer and GamesHowell methods are used only when the researcher is interested in examining every pairwise difference in population means. The Bonferroni method should be used if the researcher is interested in a subset of all possible pairwise comparisons. Multiple confidence intervals based on the Tukey-Kramer, Games-Howell, or Bonferroni methods are called simultaneous confidence intervals. Using the three-decision rule in Module 2, simultaneous confidence intervals can be used to test multiple hypotheses and keep the familywise directional error rate (FWDER) at or below πΌ/2. FWDER is the probability of making at least one directional error when testing multiple null hypotheses. The Holm test is more powerful than tests based on simultaneous confidence intervals and also keeps the FWDER at or below πΌ/2. To perform a Holm test of v null hypotheses, rank order the v p-values from smallest to largest. If the smallest p-value is less than πΌ/v, then reject H0 for that test and examine the next smallest p-value; otherwise, do not reject H0 for that test or any of the reaming v β 1 null hypotheses. If the second smallest p-value is less than πΌ/(v β 1), then reject H0 for that test and examine the next smallest p-value; otherwise, do not reject H0 for that test or any of the remaining v β 2 null hypotheses. If the third smallest p-value is less than πΌ/(v β 2), then reject H0 for that test and examine the next smallest p-value; otherwise, do not reject H0 for that test or any of the remaining v β 3 null hypotheses (and so on). Example 3.1. There is considerable variability in measures of intellectual ability among college students. One psychologist believes that some of this variability can be explained by differences in how students expect to perform on these tests. Ninety undergraduates were randomly selected from a list of about 5,400 undergraduates. The 90 students were randomly divided into three groups of equal size and all 90 students were given a nonverbal intelligence test (Ravenβs Progressive Matrices) under identical testing conditions. The raw scores for this test range from 0 to 60. The students in group 1 were told that they were taking a very difficult intelligence test. The students in group 2 were told that they were taking an interesting βpuzzleβ. The students in group 3 were not told anything. Simultaneous Tukey-Kramer confidence intervals for all pairwise comparisons of population means are given below. Comparison π1 β π2 π1 β π3 π2 β π3 95% Lower Limit 95% Upper Limit -5.4 -3.1 -3.2 -1.4 1.2 3.5 (continued) 2 D.G. Bonett (3/2017) The researcher is 95% confident that the mean intelligence score would be 3.1 to 5.4 greater if all 5,400 undergraduates had been told that the test was a puzzle instead of a difficult IQ test, 1.4 to 3.2 greater if they all had been told nothing instead of being told that the test is a difficult IQ test, and 1.2 to 3.5 greater if they all had been told the test was a puzzle instead of being told nothing. The simultaneous confidence intervals allow the researcher to be 95% confident regarding all three conclusions. Linear Contrasts Some research questions can be expressed in terms of a linear contrast of population means, βππ=1 ππ ππ , where ππ is called a contrast coefficient. For example, in an experiment that compares two costly treatments (Treatments 1 and 2) with a new inexpensive treatment (Treatment 3), a confidence interval for (π1 + π2 )/2 β π3 might provide valuable information regarding the relative costs and benefits of the new treatment. Statistical packages and various statistical formulas require linear contrasts to be expressed as βππ=1 ππ ππ so that the contrast coefficients must be specified. For instance, (π1 + π2 )/2 β π3 can be expressed as π1 2 + π2 2 β π3 which can then be expressed as (½)π1 + (½)π2 + (-1)π3 so that π1= .5, π2 = .5, and π3 = -1. Consider another example where Treatment 1 is delivered to groups 1 and 2 by experimenters A and B and Treatment 2 is delivered to groups 3 and 4 by experimenters C and D. In this study we might want to estimate (π1 + π2 )/2 β (π3 + π4 )/2 which can be expressed as (½)π1 + (½)π2 + (-½)π3 + (-½)π4 so that π1= .5, π2 = .5 π3 = -.5 and π4 = -.5. A 100(1 β πΌ)% unequal-variance confidence interval for βππ=1 ππ ππ is βππ=1 ππ πΜ π ο± π‘πΌ/2;ππ ββππ=1 ππ2 πΜπ2 /ππ where df = [βππ=1 Μπ2 ππ2 π ππ 2 Μπ4 π4π π ] /[ βππ=1 π2 (π π (3.1) ]. When examining v linear contrasts, πΌ can π β1) be replaced with πΌ* = πΌ/v in Equation 3.1 to give a set of Bonferroni simultaneous confidence intervals. If the sample sizes are approximately equal and there is convincing evidence from previous research that the population variances are similar, then the unequalvariance standard error in Equation 3.1 could be replaced with an equal-variance standard errorβπΜπ2 βππ=1 ππ2 /ππ where πΜπ2 = [βππ=1(ππ β 1) πΜπ2 ]/ππ and df = (βππ=1 ππ ) β π. Contrary to the recommendations of most statisticians, many researchers have been taught to always use the equal-variance method. 3 D.G. Bonett (3/2017) Standardized Linear Contrasts In applications where the intended audience might be unfamiliar with the metric of the dependent variable, it could be helpful to report a confidence interval for a standardized linear contrast of population means which is defined as π = βππ=1 ππ ππ /β(βππ=1 ππ2 )/π and is generalization of the standardized mean difference defined previously. The denominator of π is called the standardizer. Some alternative standardizers have been proposed for linear contrasts. One alternative standardizer averages variances across only those groups that have a non-zero contrast coefficient. Another standardizer uses only the variance from a control group. Although not recommended for routine use, the most popular standardizer is the square root of πΜπ2 defined above, which can be justified only when the population variances are approximately equal. An equal-variance 100(1 β πΌ)% confidence interval for π is (3.2) πΜ ± π§πΌ/2 ππΈπΜ where πΜ = βππ=1 ππ πΜ π /β(βππ=1 πΜπ2 )/π and ππΈπΜ = β(πΜ 2 /2π2 ) βππ=1 π 1 π β1 + βππ=1 ππ2 /ππ . An unequal-variance confidence interval for π is available, but its standard error formula is more complicated than Equation 3.2. When examining v linear contrasts, πΌ can be replaced with πΌ* = πΌ/v in Equation 3.2 to give a set of Bonferroni simultaneous confidence intervals. Example 3.2. Ninety students were randomly selected from a research participant pool and randomized into three groups. All three groups were given the same set of boring tasks for 20 minutes. Then all students listened to an audio recording that listed the names of 40 people who will be attending a party and the names of 20 people who will not be attending the party in random order. The participants were told to simply write down the names of the people who will attend the party as they hear them. In group 1, the participants were asked to draw copies of complex geometric figures while they were listening to the audio recording and writing. In group 2, the participants were not told to draw anything while listening and writing. In group 3, the participants were told to draw squares while listening and writing. The number of correctly recorded attendees was obtained from each participant. The sample means and variances are given below. Complex Drawing πΜ 1 = 24.9 πΜ12 = 27.2 n1 = 30 No Drawing πΜ 2 = 23.1 πΜ22 = 21.8 n2 = 30 Simple Drawing πΜ 3 = 31.6 πΜ32 = 24.8 n3 = 30 4 D.G. Bonett (3/2017) The 95% confidence interval for (π1 + π2 )/2 β π3 is [-9.82, -5.38]. The researcher is 95% confident that the population mean number of correctly recorded attendees averaged across the no drawing and complex drawing conditions is 5.38 to 9.82 lower than the population mean correctly recorded attendees under the simple drawing condition. The 95% confidence interval for π is [-2.03, -1.03]. The researcher is 95% confident that the population mean number of correctly recorded attendee names, averaged across the no drawing and complex drawing conditions, is 1.03 to 2.03 standard deviations below the population mean number of correctly recorded attendee names under the simple drawing condition. Hypothesis Tests for Linear Contrasts The three-decision rule can be used to assess the following null and alternative hypotheses regarding the value of βππ=1 ππ ππ . H0: βππ=1 ππ ππ = 0 H1: βππ=1 ππ ππ > 0 H2: βππ=1 ππ ππ < 0 A confidence interval for βππ=1 ππ ππ can be used to test the above hypotheses. If the lower limit for βππ=1 ππ ππ is greater than 0, then reject H0 and accept H1. If the upper limit for βππ=1 ππ ππ is less than 0, then reject H0 and accept H2. The results are inconclusive if the confidence interval includes 0. Note that it is not necessary to develop special hypothesis testing rules for π because βππ=1 ππ ππ = 0 implies π = 0, βππ=1 ππ ππ > 0 implies π > 0, and βππ=1 ππ ππ < 0 implies π < 0. In an equivalence test, the goal is to decide if βππ=1 ππ ππ is between -π and π or if βππ=1 ππ ππ is outside this range, where π is a number that represents a small or unimportant value of βππ=1 ππ ππ . An equivalence test involves selecting one of the following two hypotheses. H0: | βππ=1 ππ ππ | β€ π H1: |βππ=1 ππ ππ | > π In applications where it is difficult to specify a small or unimportant value of βππ=1 ππ ππ , it might be easier to specify π for a standardized linear contrast of means and choose between the following two hypotheses. H0: |π| β€ π H1: |π| > π One-way Analysis of Variance The total variability in the dependent variable scores in a one-factor design can be decomposed into two sources of variability β the variance of scores within treatments (also called error variance) and the variance due to mean differences across treatments (also called between-group variance). The decomposition of 5 D.G. Bonett (3/2017) variability in a one-factor study can be summarized in a one-way analysis of variance (one-way ANOVA) table, as shown below, where SS stands for sum of squares, MS stands for mean square, and n is the total sample size (n = π1 + π2 + β¦ + ππ ). The between-group factor (i.e., the independent variable) will be referred to as Factor A. The components of the ANOVA table for a one-factor design are shown below. Source SS df MS F __________________________________________________________ A SSA dfA = a β 1 MSA = SSA/dfA MSA/ MSE ERROR SSE dfE = n β a MSE = SSE/dfE TOTAL SST dfT = n β 1 __________________________________________________________ The sum of squares (SS) formulas are 2 π π SSA = βππ=1 ππ (πΜ π β πΜ + ) where πΜ + = βππ=1 βπ=1 π¦ππ / βππ=1 ππ π 2 π SSE = βππ=1 βπ=1 (π¦ππ β πΜ π ) = βππ=1(ππ β 1) πΜπ2 π 2 π SST = βππ=1 βπ=1 (π¦ππ β πΜ + ) = SSA + SSE. SSA will equal zero if all sample means are equal and will be large if the sample means are highly unequal. MSE = SSE/dfE is called the mean squared error and is equal to the pooled within-group variance (πΜπ2 ) that was defined previously for the equal-variance confidence interval. SST/dfT is the variance for the total set of n scores ignoring group membership. The SS values in the ANOVA table can be used to estimate a standardized measure of effect size called eta-squared which can be defined as π2 = 1 β ππΈ2 /ππ2 . In a nonexperimental design, ππ2 is the variance of the dependent variable for everyone in the study population and ππΈ2 is the variance of the dependent variable within each subpopulation of the study population (and ππΈ2 is assumed to be equal across all subpopulations). In an experimental design, ππΈ2 is the variance of the dependent variable for everyone in the study population assuming they all received a particular treatment and ππ2 = ππ2 + ππΈ2 where ππ2 is the variance of the population means under the a treatment conditions. An estimate of π2 is πΜ 2 = SSA /SST or equivalently (because SSA = SST β SSE) πΜ 2 = 1 β SSE /SST. 6 D.G. Bonett (3/2017) The value of π2 can range from 0 to 1 (because SSE has a possible range of 0 to SST) and describes the proportion of the dependent variable variance in the population that is predictable from the between-group factor. In designs with many groups, π2 is a useful alternative to an examination of all possible pairwise comparisons. The estimate of π2 contains sampling error of unknown magnitude and direction and therefore a confidence interval for π2 should be reported along with πΜ 2 . In applications where the goal of the study is to show that all a population means have similar values, a small upper confidence interval limit for π2 would provide the necessary evidence to make such a claim. The confidence interval for π2 is complicated but can be obtained in SAS or R. Example 3.3. Sixty undergraduates were randomly selected from a study population of 4,350 college students and then classified into three groups according to their political affiliation (Democrat, Republican, Independent). A stereotyping questionnaire was given to all 60 participants. A one-way ANOVA detected differences in the three population means (F(2, 57) = 5.02, p = .010, πΜ 2 = .15, 95% CI = [.01, .30]). The researcher can be 95% confident that 1% to 30% of the variance in the stereotyping scores of the 4,350 students can be predicted from knowledge of their political affiliation. The F statistic from the ANOVA table is used to test the null hypothesis H0: π1 = π2 = β¦ = ππ against an alternative hypothesis that at least one pair of population means is not equal. The null and alternative hypotheses also can be expressed as H0: π2 = 0 and H1: π2 > 0. Statistical packages will compute the p-value for the F statistic which can be used to decide if H0 can be rejected. It is common practice to declare the ANOVA result to be βsignificantβ when the pvalue is less than .05, but it is important to remember that a significant result simply indicates a rejection of H0. The rejection of H0: π2 = 0 is not a scientifically important finding because H0: π2 = 0 is known to be false in almost every study. Furthermore, a "nonsignificant" results should not be interpreted as evidence that H0: π2 = 0 is true. Some researchers will conduct a preliminary test of H0: π2 = 0 and only if the results are "significant" will they proceed with tests or confidence intervals of pairwise comparisons or linear contrast. However, this preliminary test approach is not required or recommended when using simultaneous confidence intervals or tests that control the FWDER. The three-decision rule and the equivalence test do not have the same weakness as the test of H0: π2 = 0 because the three-decision rule and equivalence test provide useful information about the direction or magnitude of an effect. In comparison, rejecting H0: π2 = 0 in a one-factor design does not reveal anything about how the 7 D.G. Bonett (3/2017) population differences. mistake is differences differences. means are ordered or the magnitudes of the population mean In studies where the test of H0: π2 = 0 is "significant", a common to interpret the order and magnitudes of the population mean on the basis of the order and magnitudes of the sample mean Two-Factor Experiments Human behavior is complex and is influenced in many different ways. In a onefactor experiment, the researcher is able to assess the causal effect of only one independent variable on the dependent variables. The effect of two independent variables on the dependent variable can be assessed in a two-factor experiment. The two factors will be referred to as Factor A and Factor B. The simplest type of two-factor experiment has two levels of Factor A and two levels of Factor B. We call this a 2 × 2 factorial experiment. If Factor A had 4 levels and Factor B had 3 levels, it would be called a 4 × 3 factorial experiment. In general, an a × b factorial experiment has a levels of Factor A and b levels of Factor B. There are two types of two-factor between-subjects experiments. In one case, both factors are between-subjects treatment factors and participants are randomly assigned to the combinations of treatment conditions. In the other case, one factor is a treatment factor and the other is a classification factor. A classification factor is a factor with levels to which participants are classified according to some existing characteristic such as sex, ethnicity, or political affiliation. In a two-factor experiment with one treatment factor, participants are randomly assigned to the treatment conditions within each level of the classification factor. A study could have two classification factors, but then it would be a nonexperimental design. Example 3.3. An experiment with two treatment factors takes randomly sampled Coast Guard personnel and randomizes them to one of four treatment conditions: 24 hours of sleep deprivation and 15 hours without food; 36 hours of sleep deprivation and 15 hours without food; 24 hours of sleep deprivation and 30 hours without food; and 36 hours of sleep deprivation and 30 hours without food. One treatment factor is hours of sleep deprivation (24 or 36 hours) and the other treatment factor is hours of food deprivation (15 or 30 hours). The dependent variable is the score on a complex problem-solving task. Example 3.4. An experiment with one classification factor and one treatment factor uses a random sample of men and a random sample of women from a volunteer list of students taking introductory chemistry. The men and women samples are each randomized into two groups with one group receiving 4 hours of chemistry review and the other group receiving 6 hours of chemistry review. The treatment factor is the amount of review (4 or 6 hours) and the classification factor is gender. The dependent variable is the score on the final comprehensive exam. 8 D.G. Bonett (3/2017) One advantage of a two-factor experiment is that the effects of both Factor A and Factor B can be assessed in a single study. Questions about the effects of Factor A and Factor B could be answered using two separate one-factor experiments. However, two one-factor experiments would require at least twice the total number of participants to obtain confidence intervals with the same precision or hypothesis tests with the same power that could be obtained from a single two-factor experiment. Thus, a single two-factor experiment is more economical than two one-factor experiments. A two-factor experiment also can provide information that cannot be obtained from two one-factor experiments. Specifically, a two-factor experiment can provide unique information about the interaction effect between Factor A and Factor B. An interaction effect occurs when the effect of Factor A is not the same across the levels of Factor B (which is equivalent to saying that the effect of Factor B is not the same across the levels of Factor A). The inclusion of a second factor can improve the external validity of an experiment. For instance, if there is a concern that participants might perform a particular task differently in the morning than in the afternoon, then time of day (e.g., morning vs. afternoon) could serve as a second 2-level factor in the experiment. If the interaction effect between the Factor A and the time-of-day factor (Factor B) is small, then the effect of Factor A would generalize to both morning and afternoon testing conditions, thus increasing the external validity of the results for Factor A. The external validity of an experiment also can be improved by including a classification factor. In stratified random sampling, random samples are taken from two or more different study populations that differ geographically or in other demographic characteristics. If the interaction between the classification factor and the treatment factor is small, then the effect of the treatment factor can be generalized to the multiple study populations, thereby increasing the external validity of the results for the treatment factor. The inclusion of a classification factor also can reduce error variance (MSE), which will in turn increase the power of statistical tests and reduce the widths of confidence intervals. For instance, in a one-factor experiment with male and female subjects, if women tend to score higher than men, then this will increase the error variance (the variance of scores within treatments). If gender is added as a classification factor, the error variance will then be determined by the variability of scores within each treatment and within each gender, which will result in a smaller MSE. 9 D.G. Bonett (3/2017) Consider the special case of a 2 × 2 design. The population means for a 2 × 2 design are shown below. Factor B b1 b2 a1 π11 π12 a2 π21 π22 Factor A The main effects of Factor A and Factor B and the AB interaction effect are given below. A: (π11 + π12 )/2 β (π21 + π22 )/2 B: (π11 + π21 )/2 β (π12 + π22 )/2 AB: (π11 β π12 ) β (π21 β π22 ) = (π11 β π21 ) β (π12 β π22 ) = π11 β π21 β π12 + π22 The simple main effects of A and B are given below. A at b1: π11 β π21 B at π1 : π11 β π12 A at b2: π12 β π22 B at π2 : π21 β π22 The interaction effect can be expressed as a difference in simple main effects, specifically (π11 β π12 ) β (π21 β π22 ) = (B at a1) β (B at a2), or equivalently, (π11 β π21 ) β (π12 β π22 ) = (A at b1) β (A at b2). The main effects can be expressed as averages of simple main effects. The main effect of A is (A at b1 + A at b2)/2 = (π11 β π21 + π12 β π22 )/2 = (π11 + π12 )/2 β (π21 + π22 )/2. The main effect of B is (B at a1 + B at a2)/2 = (π11 β π12 + π21 β π22 )/2 = (π11 + π21 )/2 β (π12 + π22 )/2. All of the above effects are special cases of a linear contrast of means, and confidence intervals for these effects can be obtained using Equation 3.1. The main effect of A (which is the average of A at b1 and A at b2) could be misleading because A at b1 and A at b2 will be highly dissimilar if the AB interaction is large. Likewise, the main effect of B (which is the average of B at a1 and B at a2) could be misleading if the AB interaction is large because B at a1 and B at a2 will be highly dissimilar. If the AB interaction effect is large, then an analysis of simple main effects will be more meaningful than an analysis of main effects. If the AB interaction is small, then an analysis of the main effects of Factor A and Factor B will not be misleading and an analysis of simple main effects will be unnecessary. 10 D.G. Bonett (3/2017) Pairwise Comparisons in Two-factor Designs In experiments where Factor A or Factor B has more than two levels, various pairwise comparisons can be made. Consider a 2 × 3 design where the main effects of Factor B are of interest. The population means are given below. Factor B Factor A b1 b2 b3 π1 π11 π12 π13 π2 π21 π22 π23 The following three pairwise main effects can be defined for Factor B B12: (π11 + π21 )/2 β (π12 + π22 )/2 B13: (π11 + π21 )/2 β (π13 + π23 )/2 B23: (π12 + π22 )/2 β (π13 + π23 )/2 where the subscripts of B represent the levels of the factor being compared. If one or both factors have more than two levels, then more than one interaction effect can be examined. An interaction effect can be defined for any two levels of Factor A and any two levels of Factor B. For instance, in the 2 × 3 design described above, the following three pairwise interaction effects can be defined A12B12: π11 β π12 β π21 + π22 A12B13: π11 β π13 β π21 + π23 A12B23: π12 β π13 β π22 + π23 where the subscripts of AB represent the levels of Factor A and Factor B being compared. The number of pairwise interaction effects can be overwhelming in larger designs. For instance, in a 4 × 3 design, there are six pairs of Factor A levels and three pairs of Factor B levels from which 6 × 3 = 18 pairwise interaction effects could be examined. Pairwise interaction effects are typically examined in designs where the number of factor levels of each factor is small. If an AB interaction has been detected, then the simple main effects of Factor A or the simple main effects of Factor B provide useful information. Suppose the simple main effects of Factor A are to be examined and Factor A has more than two levels. In this situation, pairwise simple main effects can be examined. In the 2 × 3 design 11 D.G. Bonett (3/2017) described above, Factor B has three levels and the simple pairwise main effects of Factor B are B12 at π1 : π11 β π12 B12 at π2 : π21 β π22 B13 at π1 : π11 β π13 B13 at π2 : π21 β π23 B23 at π1 : π12 β π13 B23 at π2 : π22 β π23 Note that all of the above pairwise comparisons are linear contrasts of the population means and can be expressed as βππ π ππ ππ where ab is the total number of groups. For instance, the contrast coefficients that define the B12 pairwise main effect of Factor B (assuming the means in the 2 × 3 table are ordered left to right and then top to bottom) are 1/2, -1/2, 0, 0 1/2, -1/2, 0, 0; the contrast coefficients that define the A12B12 pairwise interaction effect are 1, -1, 0 -1, 1, 0; and the contrast coefficients that define the pairwise simple main effect for B12 at π1 are 1, -1, 0, 0, 0, 0. Two-Way Analysis of Variance Now consider a general a × b factorial design. The total variability of the quantitative dependent variable scores in a two-factor design can be decomposed into four sources of variability: the variance due to differences in means across the levels of Factor A, the variance due to differences in means of across the levels of Factor B, the variance due to differences in simple main effects of one factor across the levels of the other factor (the AB interaction), and the variance of scores within treatments (the error variance). The decomposition of the total variance in a twofactor design can be summarized in the following two-way analysis of variance (two-way ANOVA) table where n is the total sample size. Source SS df MS F _________________________________________________________________________ A SSA dfA = a β 1 MSA = SSA/dfA MSA/MSE B SSB dfB = b β 1 MSB = SSB/dfB MSB/MSE AB SSAB dfAB = (a β 1)(b β 1) MSAB = SSAB/dfAB MSAB/MSE ERROR SSE dfE = n β ab MSE = SSE/dfE TOTAL SST dfT = n β 1 _________________________________________________________________________ The TOTAL and ERROR sum of squares (SS) formulas in a two-way ANOVA shown below are conceptually similar to the one-way ANOVA formulas 12 D.G. Bonett (3/2017) π ππ SST = βππ=1 βππ=1 βπ=1 (π¦πππ β πΜ ++ )2 π ππ SSE = βππ=1 βππ=1 βπ=1 (π¦πππ β πΜ ππ )2 π ππ where πΜ ++ =βππ=1 βππ=1 βπ=1 π¦πππ / βππ=1 βππ=1 πππ . The formulas for SSA, SSB, and SSAB are complicated unless the sample sizes are equal. If all sample sizes are equal to no, the formulas for SSA, SSB, and SSAB are SSA = ππ0 βππ=1(πΜ π+ β πΜ ++ ) 2 π 0 where πΜ π+ = βππ=1 βπ=1 π¦ππ /ππ0 π0 SSB = ππ0 βππ=1(πΜ +π β πΜ ++ )2 where πΜ +π = βππ=1 βπ=1 π¦ππ /ππ0 SSAB = SST β SSE β SSA β SSB. Partial eta-squared estimates are computed from the sum of squares estimates, as shown below. πΜ π΄2 = SSA/(SST β SSB β SSAB) = SSA/(SSA + SSE) πΜ π΅2 = SSB/(SST β SSA β SSAB) = SSB/(SSB + SSE) 2 πΜ π΄π΅ = SSAB/(SST β SSB β SSA) = SSAB/(SSAB + SSE) These measures are called βpartialβ effect sizes because variability in the dependent variable due to the effects of other factors is removed. For example, SSB and SSAB are subtracted from SST to obtain πΜ π΄2 . The method of computing a confidence interval for a population partial eta-squared parameter is complicated but can be obtained in SAS or R. In designs where a factor has many levels, a partial etasquared estimate is a simple alternative to reporting all possible pairwise comparisons among the factor levels. The F statistics for the main effect of Factor A, the main effect of Factor B, and the 2 AB interaction effect, test the null hypotheses H0: ππ΄2 = 0, H0: ππ΅2 = 0, and H0: ππ΄π΅ = 0, respectively. Tests of these null hypotheses suffer from the same problem as the test of the null hypothesis a one-way ANOVA in that a βsignificantβ result does not imply a scientifically important result, and a βnonsignificantβ result does not imply that the effect is zero. The new APA guidelines suggest that the F statistics and pvalues for each effect to be supplemented with confidence intervals for population eta-squared parameters, linear contrasts of population means, or linear contrasts of unstandardized linear population means. Although a βnonsignificantβ (i.e., inconclusive) test for the AB interaction effect does not imply that the population interaction effect is zero, it is customary to 13 D.G. Bonett (3/2017) examine main effects rather than simple main effects if the AB interaction test is inconclusive. If the test for the AB interaction effect is βsignificantβ, it is customary to only analyze simple main effects or pairwise simple main effects. However, a main effect could be interesting, even if the AB interaction effect is βsignificantβ, if 2 the partial eta-squared estimate for the main effect is substantially larger than πΜ π΄π΅ . Three-factor Experiments The effects of three independent variables on the dependent variable can be assessed in a three-factor design. The three factors will be referred to as Factor A, Factor B, and Factor C. Like a two-factor design, a three-factor design provides information about main effects and two-way interaction effects. Specifically, the main effects of Factors A, B, and C can be estimated as well as the AB, AC, and BC two-way interactions. These main effects and two-way interaction effects could be estimated from three separate two-factor studies. A three-factor study has the advantage of providing all this information in a single study and also provides information about a three-way interaction (ABC) that could not be obtained from separate two-factor studies. The factors in a three-factor design can be treatment factors or classification factors. If all factors are classification factors, then the study would be a nonexperimental design. The simplest type of three-factor study has two levels of each factor and is called a 2 × 2 × 2 factorial design. In general, a × b × c factorial designs have a levels of Factor A, b levels of Factor B, and c levels of Factor C. A table of population means is shown below for the 2 × 2 × 2 factorial design. Factor C c1 c2 Factor B Factor B b1 b2 b1 b2 π1 π111 π121 π112 π122 π2 π211 π221 π212 π222 Factor A The main effects of Factors A, B, and C are defined as A: (π111 + π121 + π112 + π122 )/4 β (π211 + π221 + π212 + π222 )/4 B: (π111 + π211 + π112 + π212 )/4 β (π121 + π221 + π122 + π222 )/4 C: (π111 + π211 + π121 + π221 )/4 β (π112 + π212 + π122 + π222 )/4, 14 D.G. Bonett (3/2017) the three two-way interaction effects are defined as AB: (π111 + π112 )/2 β (π121 + π122 )/2 β (π211 + π212 )/2 + (π221 + π222 )/2 AC: (π111 + π121 )/2 β (π112 + π122 )/2 β (π211 + π221 )/2 + (π212 + π222 )/2 BC: (π111 + π211 )/2 β (π112 + π212 )/2 β (π121 + π221 )/2 + (π122 + π222 )/2, and the three-way interaction effect is defined as ABC: π111 β π121 β π211 + π221 β π112 + π122 + π212 β π222 . The simple main effects of Factors A, B, and C are given below. A at b1: (π111 + π112 )/2 β (π211 + π212 )/2 A at b2: (π121 + π122 )/2 β (π221 + π222 )/2 A at c1: (π111 + π121 )/2 β (π211 + π221 )/2 A at c2: (π112 + π122 )/2 β (π212 + π222 )/2 B at a1: (π111 + π112 )/2 β (π121 + π122)/2 B at a2: (π211 + π212 )/2 β (π221 + π222 )/2 B at c1: (π111 + π211 )/2 β (π121 + π221 )/2 B at c2: (π112 + π212 )/2 β (π122 + π222 )/2 C at a1: (π111 + π121 )/2 β (π112 + π122 )/2 C at a2: (π211 + π221 )/2 β (π212 + π222 )/2 C at b1: (π111 + π211 )/2 β (π112 + π212 )/2 C at b2: (π121 + π221 )/2 β (π122 + π222 )/2 The simple-simple main effects of Factors A, B, and C are defined as A at b1c1: π111 β π211 B at a1c1: π111 β π121 C at a1b1: π111 β π112 A at b1c2: π112 β π212 B at a1c2: π112 β π122 C at a1b2: π121 β π122 A at b2c1: π121 β π221 B at a2c1: π211 β π221 C at a2b1: π211 β π212 A at b2c2: π122 β π222 B at a2c2: π212 β π222 C at a2b2: π221 β π222 , 15 D.G. Bonett (3/2017) and the simple two-way interaction effects are defined as AB at c1: π111 β π121 β π211 + π221 AB at c2: π112 β π122 β π212 + π222 AC at b1: π111 β π211 β π112 + π212 AC at b2: π121 β π221 β π122 + π222 BC at a1: π111 β π121 β π112 + π122 BC at a2: π211 β π221 β π212 + π222 . The ABC interaction in a 2 × 2 × 2 design can be conceptualized as a difference in simple two-way interaction effects. Specifically, the ABS interaction is the difference between AB at π1 and AB at π2 , the difference between AC at b1 and AC at b2, or the difference between BC at a1 and BC at a2. Although the meaning of a three-way interaction is not easy to grasp, its meaning becomes clearer when it is viewed as the difference in simple two-way interaction effects with each simple two-way interaction viewed as a difference in simple-simple main effects. (Note that π1 and π2 are used in this section to represent levels of Factor C and should not be confused with the previous use of ππ to represent contrast coefficients). The two-way interaction effects in a three-factor design are conceptually the same as in a two-factor design. Two-way interactions in a three-factor design are defined by collapsing the three-dimensional table of population means to create a twodimensional table of means with cell means that have been averaged over the collapsed dimension. For instance, a table of averaged population means after collapsing Factor C gives the following 2 × 2 table from which the AB interaction can be defined in terms of the averaged population means. Factor B b1 b2 a1 (π111 + π112 )/2 (π121 + π122 )/2 a2 (π211 + π212 )/2 (π221 + π222 )/2 Factor A Three-Way Analysis of Variance The total variance of the dependent variable scores in a three-factor design can be decomposed into eight sources of variability β three main effects, three two-way interactions, one three-way interaction, and the within-group error variance. The decomposition of the total variance in a three-factor design can be summarized in the following three-way analysis of variance (three-way ANOVA) table where n is the total sample size. 16 D.G. Bonett (3/2017) Source SS df MS F ________________________________________________________________________ A SSA dfA = a β 1 MSA = SSA/dfA MSA/MSE B SSB dfB = b β 1 MSB = SSB/dfB MSB/MSE C SSC dfC = c β 1 MSC = SSC/dfC MSC/MSE AB SSAB dfAB = dfAdfB MSAB = SSAB/dfAB MSAB/MSE AC SSAC dfAC = dfAdfC MSAC = SSAC/dfAC MSAC/MSE BC SSBC dfBC = dfBdfC MSBC = SSBC/dfBC MSBC/MSE ABC SSABC dfABC = dfAdfBdfC MSABC = SSABC/dfABC MSABC/MSE ERROR SSE dfE = n β abc MSE = SSE/dfE TOTAL SST dfT = n β 1 ________________________________________________________________________ The SS formulas for a three-way ANOVA are conceptually similar to those for the two-way ANOVA and will not be presented. Partial eta-squared estimates are computed from the SS estimates in a three-way ANOVA in the same way they are 2 computed in a two-way ANOVA. For example, πΜ π΄2 = SSA/(SSA + SSE) and πΜ π΄π΅πΆ = SSABC/(SSABC + SSE). The hypothesis tests in the three-way ANOVA suffer from the same problem as the hypothesis tests in the one-way and two-way ANOVA. These tests should be supplemented with confidence intervals for population eta-squared values, linear contrast of population means, or standardized linear contrasts of population means to provide information regarding the magnitude of each effect. If an ABC interaction has been detected in a three-way ANOVA, simple two-way interactions or simple-simple main effects should be examined. A two-way interaction could be examined even if an ABC interaction is βsignificantβ if the partial eta-squared estimate for a two-way interaction is substantially larger than the partial eta-squared estimate for the ABC interaction. If the test for an ABC interaction is inconclusive, the AB, AC, and BC interactions should be examined. Using Factor A as an example, if AB and AC interactions are detected, then simple-simple main effects of A should be examined because Factor A interacts with both factor B and Factor C. If an AB interaction is detected, but the test for the AC interaction is inconclusive, then the simple main effects of A should be examined at each level of Factor B. Similarly, if an AC interaction is detected, but the test of an AB interaction is inconclusive, then the simple main effects of A should be examined at each level of Factor C. Even if AB and AC interactions have been detected, the main effect of A could be examined if the partial eta-squared estimate for the main effect of A is substantially larger than the partial eta-squared estimates for the AB and AC interactions. 17 D.G. Bonett (3/2017) If the tests for the ABC, AB, AC, and BC interactions are all inconclusive then all three the main effects should be examined. An analysis of main effects can be justified even if interactions are βsignificantβ if the partial eta-squared estimates for the main effects are substantially larger than the partial eta-squared estimates for the interaction effects. Assumptions In addition to the random sampling and independence assumptions, the ANOVA tests, the equal-variance Tukey-Kramer confidence intervals for pairwise comparison, the equal-variance confidence interval for βππ=1 ππ ππ , the equalvariance confidence interval for πΏ, and the confidence interval for π2 all assume equality of population variances across treatment conditions and normality of the dependent variable in the study population under any given treatment condition. The effects of violating these assumptions are identical to those for the equalvariance confidence interval for π1 β π2 described in Module 2. The Games-Howell and unequal-variance Tukey-Kramer methods for pairwise comparison, and the unequal-variance confidence intervals for βππ=1 ππ ππ and π relax the equal population variance assumption and are preferred to the equalvariance methods unless the sample sizes are approximately equal and there is compelling prior information to suggest that the population variances are similar across treatment conditions. The Welsh test is an alternative to the traditional oneway ANOVA test that relaxes the equal variance assumption and can be obtained in SAS, SPSS, and R. The adverse effects of violating the normality assumption on the ANOVA tests and the confidence intervals for βππ=1 ππ ππ are usually not serious unless the dependent variable is highly skewed and the sample size per group is small (ππ < 20). However, leptokurtosis of the dependent variable is detrimental to the performance of the confidence interval for π2 and π. Furthermore, the adverse effects of leptokurtosis on these confidence intervals are not diminished in large sample sizes. Data transformations are sometimes helpful in reducing leptokurtosis in distributions that are also skewed. To assess the degree of non-normality in a design with a β₯ 2 groups, subtract πΜ π from all of the group j scores then estimate the skewness and kurtosis coefficients from these π1 + π2 + β― + ππ deviation scores. If the deviation scores are skewed, it might be possible to reduce the skewness by transforming (e.g., log, square-root, reciprocal) the dependent variable scores. 18 D.G. Bonett (3/2017) Distribution-free Methods If the response variable is skewed, a confidence interval for a linear contrast of population medians might be more appropriate and meaningful than a confidence interval for a linear contrast of population means. An approximate 100(1 β πΌ)% confidence interval for βππ=1 ππ ππ is βππ=1 ππ πΜπ ± π§πΌ/2 ββππ=1 ππ2 ππΈπΜ2π (3.7) where ππΈπΜ2π was defined in Equation 1.8. This confidence interval only assumes random sampling and independence among participants. Equation 3.5 can be used to test H0: βππ=1 ππ ππ = 0 and decide if βππ=1 ππ ππ > 0 or βππ=1 ππ ππ < 0. Equation 3.7 also can be used to test H0: |βππ=1 ππ ππ | β€ π against H1: |βππ=1 ππ ππ | > π. The Kruskal-Wallis test is a distribution-free test of the null hypothesis that the dependent variable distribution is identical (same location, variance, and shape) in all a treatment conditions (or all a subpopulations in a nonexperimental design). A rejection of the null hypothesis implies differences in the location, variance, or shape of the dependent variable distribution in at least two of the treatment conditions or subpopulations. The Kruskal-Wallis test is used as a distribution-free alternative to the one-way ANOVA and suffers from the same problem as the oneway ANOVA because the null hypothesis is known to be false in virtually every study. In designs with more than two groups, useful information can be obtained by performing multiple Mann-Whitney tests for some or all pairwise comparisons using the Holm procedure. Simultaneous confidence intervals for pairwise differences or ratios of medians, the Mann-Whitney parameter for pairwise comparisons, or linear contrasts of medians are informative alternatives to the Kruskal-Wallis test. Some researchers use the Kruskal-Wallis test as a screening test to determine if multiple Mann-Whitney tests or simultaneous confidence intervals are necessary. Sample Size Requirements for Desired Precision The sample size requirement per group to estimate a linear contrast of a population means with desired confidence and precision is approximately 2 ππ = 4πΜ 2 (βππ=1 ππ2 )(π§πΌ/2 /π€)2 + π§πΌ/2 /2π (3.8) 19 D.G. Bonett (3/2017) where πΜ 2 is the average within-group variance, and m is the number of non-zero ππ values. Note that Equation 3.8 reduces to Equation 2.5 for the special case of comparing two means. Equation 3.8 also can be used for factorial designs where a is the total number of treatment combinations. The MSE from previous research is often used as a planning value for the average within-group variance. Example 3.7. A researcher wants to estimate (π11 + π12 )/2 β (π21 + π22 )/2 in a 2 × 2 factorial experiment with 95% confidence, a desired confidence interval width of 3.0, and a planning value of 8.0 for the average within-group error variance. The contrast coefficients are 1/2, 1/2, -1/2, and -1/2. The sample size requirement per group is approximately ππ = 4(8.0)(1/4 + 1/4 + 1/4 + 1/4)(1.96/3.0)2 + 0.48 = 14.2 β 15. The sample size requirement per group to estimate a standardized linear contrast of a population means (π) with desired confidence and precision is approximately ππ = [2πΜ 2 /π + 4(βππ=1 ππ2 )](π§πΌ/2 /π€)2 (3.9) where πΜ is a planning value of π. Note that this sample size formula reduces to Equation 2.6 for the special case of a standardized mean difference. Example 3.8. A researcher wants to estimate π in a one-factor experiment (a = 3) with 95% confidence, a desired confidence interval width of 0.6, and πΜ = 0.8. The contrast coefficients are 1/2, 1/2, and -1. The sample size requirement per group is approximately ππ = [2(0.64)/3 + 4(1/4 + 1/4 + 1)](1.96/0.6)2 = 68.6 β 69. A simple formula for approximating the sample size needed to obtain a confidence interval for π2 having a desired width is currently not available. However, if sample data can be obtained in two stages, then the confidence interval width for π2 obtained in the first-stage sample can be used in Equation 1.12 to approximate the additional number of participants needed in the second- stage sample to achieve the desired confidence interval width. Example 3.9. A first-stage sample size of 12 participants per group in a one-factor experiment gave a 95% confidence interval for π 2 with a width of 0.51. The researcher would like to obtain a 95% confidence interval for π2 that has a width 0f 0.30. To achieve this goal, [(0.51/0.30)2 β 1]12 = 22.7 β 23 additional participants per group are needed. Sample Size Requirements for Desired Power The sample size requirement per group to test H0: βππ=1 ππ ππ = 0 for a specified value of πΌ and with desired power is approximately 20 D.G. Bonett (3/2017) 2 ππ = πΜ 2 (βππ=1 ππ2 )(π§πΌ/2 + π§π½ )2/(βππ=1 ππ πΜπ )2 + π§πΌ/2 /2m (3.10) where πΜ 2 is the average within-group variance, βππ=1 ππ πΜπ is the anticipated effect size value, and m is the number of non-zero ππ values. This sample size formula reduces to Equation 2.7 when the contrast involves the comparison of two means. In applications where βππ=1 ππ πΜπ or πΜ 2 is difficult for the researcher to specify, Equation 3.10 can be expressed in terms of a planning value for π, as shown below 2 ππ = (βππ=1 ππ2 )(π§πΌ/2 + π§π½ )2/πΜ 2 + π§πΌ/2 /2m (3.11) which simplifies to Equation 2.8 in Module 2 when the contrast involves the comparison of two means. Example 3.10. A researcher wants to test H0: (π1 + π2 + π3 + π4 )/4 β π5 in a one-factor experiment with power of .90, πΌ = .05, and an anticipated standardized linear contrast value of 0.5. The contrast coefficients are 1/4, 1/4, 1/4, 1/4, and -1. The sample size requirement per group is approximately ππ = 1.25(1.96 + 1.28)2 /0.52 + 0.38 = 52.9 β 53. The sample size requirements for v simultaneous confidence intervals or tests are obtained by replacing πΌ in Equations 3.8 - 3.11 with πΌ β = πΌ/π£. Using Prior Information Suppose a population mean difference for a particular response variable has been estimated in a previous study and also in a new study. The previous study used a random sample to estimate π1 β π2 from one study population, and the new study used a random sample to estimate π3 β π4 from another study population. This is a 2 × 2 factorial design with a classification factor where Study 1 and Study 2 are the levels of the classification factor. The two study populations are assumed to be conceptually similar. If a confidence interval for (π1 β π2 ) β (π3 β π4 ) suggests that π1 β π2 and π3 β π4 are not too dissimilar, then the researcher might want to compute a confidence interval for (π1 + π3 )/2 β (π2 + π4 )/2. A confidence interval for (π1 + π3 )/2 β (π2 + π4 )/2 will have greater external validity and could be substantially narrower than the confidence interval for π1 β π2 or π3 β π4 . A 100(1 β πΌ)% confidence interval for (π1 + π3 )/2 β (π2 + π4 )/2 is obtained from Equation 3.1, and if medians have been computed in each study an approximate 100(1 β πΌ)% confidence interval for (π1 + π3 )/2 β (π2 + π4 )/2 is obtained from Equation 3.4 using π1 = .5, π2 = β.5, π3 = .5, and π4 = β.5. 21 D.G. Bonett (3/2017) If a standardized mean difference has been estimated in each study and a confidence interval for πΏ1 β πΏ2 suggests that these two parameter values are not too dissimilar, the researcher might want to compute the following approximate 100(1 β πΌ)% confidence interval for (πΏ1 + πΏ2 )/2 (πΏΜ1 + πΏΜ2 )/2 ± π§πΌ/2 β(ππΈπΏΜ2 + ππΈπΏΜ2 )/4 1 2 (3.12) where ππΈπΏΜ2 was defined in Equation 2.2 of Module 2. π Example 3.12. An eye-witness identification study with 20 participants per group at Kansas State University assessed participantsβ certainty in their selection of a suspect individual from a photo lineup after viewing a short video of a crime scene. Two treatment conditions were assessed in each study. In the first treatment condition the participants were told that the target individual βwill beβ in a 5-person photo lineup, and in the second treatment condition participants were told that the target individual βmight beβ in a 5person photo lineup. The suspect was included in the lineup in both instruction conditions. The estimated means were 7.4 and 6.3 and the estimated standard deviations were 1.7 and 2.3 in the βwill beβ and βmight beβ conditions, respectively. This study was replicated at UCLA using 40 participants per group. In the UCLA study, the estimated means were 6.9 and 5.7, and the estimated standard deviations were 1.5 and 2.0 in the βwill beβ and βmight beβ conditions, respectively. A 95% confidence interval for (π1 β π2 ) β (π3 β π4 ) indicated that π1 β π2 and π3 β π4 do not appear to be substantially dissimilar. The 95% confidence interval for (π1 + π3 )/2 β (π2 + π4 )/2, which describes the Kansas State and UCLA study populations, was [0.43, 1.87]. Graphing Results Results of a two-factor design can be illustrated using a clustered bar chart where the means for the levels of one factor are represented by a cluster of contiguous bars (with different colors, shades, or patterns) and the levels of the second factor are represented by different clusters. An example of a clustered bar chart for a 2 × 2 design is shown below where the levels of Factor A defined the cluster. 22 D.G. Bonett (3/2017) If one factor is more interesting than the other factor, the factor levels within each cluster should represent the more interesting factor because it is easier to visually compare means within a cluster than across clusters. In the above graph, it is easy to see than the mean for level 2 of Factor A is greater than the mean for level 1 of Factor A within each level of Factor B. Data Transformations and Interaction Effects Data transformations were described in Module 1 as a way to reduce nonnormality. In factorial designs, an interaction effect might be an artifact of the measurement process, and the magnitude of an interaction effect can sometimes be reduced considerably by a data transformation. Consider the following example of a 2 × 2 design with three participant scores per group. Factor B Factor A b1 b2 π1 49, 64, 81 100, 121, 144 π2 1, 4, 9 16, 25, 36 The simple main effect of A at b1 is 64.67 β 4.67 = 60 and the simple main effect of A at b2 is 121.67 β 25.67 = 96, which indicates a nonzero interaction effect in this sample. After taking a square root transformation of the data, the simple main effect of A at b1 is 8 β 2 = 6 and the simple main effect of A at b2 is 11 β 5 = 6, which indicates a zero interaction effect. In this example, the estimated interaction effect was reduced to zero by simply transforming the data. Interaction effects can be classified as removable or nonremovable. A removable interaction effect (also called an ordinal interaction effect) can be reduced to zero by some transformation of the data. A nonremovable interaction effect (also called a disordinal interaction effect) cannot be reduced to zero by a data transformation. In a 2 × 2 design, if the simple main effects of A have different signs, or the simple main effects of B have different signs, then the interaction effect is nonremovable. Otherwise, the interaction effect is removable by some data transformation. In studies where the existence of an interaction effect has an important theoretical implication, a more compelling theoretical argument can be made if it can be shown, based on confidence intervals for the simple main effects, that the population interaction effect is nonremovable. 23
© Copyright 2026 Paperzz