FITTING CATEGORICAL MODELS TO EFFECT SIZES FROM A

Journal of Educational Statistics
Summer 1982, Volume 7, Number!, pp. 119-137
FITTING CATEGORICAL MODELS TO EFFECT SIZES
FROM A SERIES OF EXPERIMENTS
LARRY V. HEDGES
The University of Chicago
Key words: Meta-analysis; research synthesis; effect size; test of homogeneity; analysis of
variance
ABSTRACT. One method of combining the results of a series of two-group experiments
involves the estimation of the effect size (population value of the standarized mean
difference) for each experiment. When each experiment has the same effect size, a
pooled estimate of effect size provides a summary of the results of the series of
experiments. However, when effect sizes are not homogeneous, a pooled estimate can
be misleading. A statistical test is provided for testing whether a series of experiments
have the same effect size. A general strategy is provided for fitting models to the results
of a series of experiments when the experiments do not share the same effect size and
the collection of experiments is divided into a priori classes. The overall fit statistic HT
is partitioned into a between-class fit statistic HB and a within-class fit statistic Hw.
The statistics HB and Hw permit the assessment of differences between effect sizes for
different classes and the assessment of the homogeneity of effect size within classes.
Large numbers of studies have accumulated in many areas of educational
research. Yet larger numbers of research studies have not always led to clearer
insights about the phenomena under investigation. One strategy for combining
the results of research studies has been the use of quantitative methods of
research synthesis. Glass (1976) proposed a method for combining the results
of a series of studies by calculating a quantitative estimate of effect magnitude
for each study. The estimates of effect magnitude from the series are then
averaged to obtain an overall estimate of effect magnitude. Hedges (1981) has
provided a formal statistical model for a series of studies when each study
makes a between-group comparison and the index of effect magnitude is a
standardized mean difference. Hedges (1982) also studied the properties of
estimators of effect size when the studies in a collection share a common
(population) effect size.
It is clear that representation of the results of a collection of studies by a
single estimate of effect magnitude can be misleading if the underlying
(population) effect sizes are not identical in all the studies. For example,
suppose a treatment produces large positive (population) effects in half a
collection of studies, and large negative (population) effects in the other half of
a collection of studies. Then representation of the overall effect of the
119
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
120
Larry V. Hedges
treatment as zero is misleading, because all the studies actually have underlying
effects that are different from zero. Hedges (1982) developed tests of homogeneity of effect size to detect situations in which underlying (population)
effect sizes are not homogeneous. Hedges indicated that in many real data sets,
the assumption of homogeneity of effect size is not met.
Some writers in the area of research synthesis have cited substantive reasons
for the position that different studies of the effects of the same treatment might
yield quite different results. Light and Smith (1971) argued that many contradictions in research evidence can be resolved by grouping studies with
similar characteristics. They asserted that studies with the same characteristics
are more likely to yield similar results, and therefore many apparent contradictions among research results arise from differences in the characteristics of
studies. Pillemer and Light (1980) have argued that grouping of studies
according to their characteristics is an essential step in assessing the range of
generaUzability of a research finding. For example, if a treatment produces
essentially the same effect in a wide variety of settings with a variety of people,
we are more confident in the generaUzability of the finding of a treatment
effect.
Some investigators in quantitative research synthesis (e.g., Kulik, Kulik &
Cohen, 1979) have recognized the potential for heterogeneous effect sizes and
have grouped studies that share common characteristics into classes. The usual
approach is then to treat the effect size estimates as data and calculate an
analysis of variance to determine if these classes have different mean effect
sizes. There are two problems with this procedure. First, the assumptions of the
analysis of variance might not be met because the effect size estimates might
not have the same distribution within cells. The variance of an individual
observation (effect size estimate) is proportional to 1/w, where n is the number
of subjects in the study. When studies have different sample sizes, the individual "error" variances might differ by a factor of 10 to 20. Second, even if the
between-classes test were accurate, the use of ANOVA does not provide any
indication of whether or not studies within the classes share a common effect
size. Thus, even if ANOVA correctly detects that two classes of studies have a
different average effect size, there is no guarantee that the average effect size
within each class is a reflection of a common underlying effect size for that
class.
This paper presents an alternative technique for fitting models to effect sizes
from a series of studies. We assume the investigator has an a priori grouping of
studies, that is, a scheme for classifying studies that are likely to produce
similar results. Often this will take the form of a set of categories into which
studies can be placed. Studies can be cross classified by two or more sets of
categories. The technique presented in this paper is straightforward. Conceptually the investigator begins by asking whether all studies (regardless of category) share a common effect size. A statistical test (fit statistic) is provided. If
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
Fitting Models to Effect Sizes
121
the hypothesis of fit to a single effect size is rejected, the experimenter then
breaks the series of studies into classes, and asks whether the model of a
different effect size for each class fits the data. It is interesting to note that the
fit statistic calculated at the first stage is partitioned into stochastically
independent parts corresponding to between-class and within-class fit, respectively. The between-class fit is an index of the extent to which effect sizes in the
classes are different. If the within-class fit (fit to a single effect size within each
class) is not rejected, the investigator can stop. If the within-class fit is rejected,
the investigator might want to further subdivide the classes. The process of
subdividing and testing for between- and within-class fit continues until an
acceptable level of within-class homogeneity is achieved. The procedure provides valid asymptotic tests for the effects of classifications as well as an
indication that the final classes are internally homogeneous with respect to
effect size.
The first section is an exposition of the specific model used in this paper.
Then, some basic results on estimation of effect size are presented. These
results are used in subsequent sections. Some tests of homogeneity based on
asymptotic theory for weighted estimators are discussed. An explanation
follows of the use of this paper's results for fitting models to a series of studies.
Next, the results of a simulation study of the small sample behavior of the
asymptotic tests given here are presented. The last section is an example of the
application of the techniques presented.
Model
The statistical procedures described in this paper depend on the structural
model for the results of a series of experiments. A conceptual requirement of
the structural model used is that each experiment measure a dependent
variable from a collection of congeneric measures, that is, each response scale
is a linear transformation of a response scale with unit variance within groups.
This requirement on the dependent variables is satisfied, for example, if all the
studies use psychological tests that are linearly equatable. We will also assume
that the studies are sorted into p disjoint classes that are determined a priori.
Let m^i— 1,...,/? be the number of studies in the / th class.
Let Y^k and Yfjk be the A:th experimental and control group scores in they th
experiment in the /th class. Denote the experimental and control group sample
sizes for theyth study in the /th class by nfj and nfj9 respectively. Assume that
for fixed / and / , Y^k and Yfjk are independently normally distributed with
means juf^ and \fi} and common variance /??., i.e.,
^-^(Mfy.tf,),
k = l,...,«fy,
k=\,...,nfj,
j= l , . . . , mj=l,...,
, , / = mi,i=l,...,p,
!,...,/>,
y^~9l(tf,.,#,),
/ : = l , . . . , w f y , j= l , . . . , j=\,...,
/w.,/ =
1,...,/?.
k=l,...,nfj,
mi,i=\,...,p.
and
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
Larry V. Hedges
122
The effect size for theyth experiment in the / th class is the parameter
*>r-
* 'J
**j
where /?,-. is assumed to be positive. When the response scale has unit variance
within groups, the effect size parameter S/y (the standardized mean difference)
is simply the treatment effect (mean difference) in the7th experiment in the /th
class. Note that the effect size is invariant under linear transformations. We
require the dependent variables to be linearly related so that effect sizes will
have the same interpretation across studies.
A more compact representation of the model presented above involves the
explicit use of 8ij9 the within-group standard deviation Pij9 SL location (scale
mean) parameter y/y and a residual term eiJk. The structural model can be
written as
k = \,...,nfj,j= k=l\,...,m
^ = M/ ++ A ^ ++ e ^
... nfi,i=
j=l ... m1,.
i=l• ... •,p,
p
^* =ftA- My 5*»
9 9
Y^
yiJ ++4Jkefjk,
,
k=i,.l...,nfj,j=
Y,cjk==PuPijyu
k=
9...9nfjj=
J9
9 9 i9
9 9 9
\,...,m
\,...,p,
l , . . .i,i
, m , , /- = 1,...,/?,
(1)
(1)
2
efJk
~ 91(0, #ftj)
and efjk
, ) and
~~ 91(0, ftft.).
where tf
,).
jk ~
Jk~
Model (1) does not explicitly allow for measurement error in the dependent
variable. The standarized mean difference 8y . in (1), is a measure of the
magnitude of the treatment effect compared to the variability within the two
groups of the experiment. The implicit assumption is that the variability within
the experimental and control groups arises from stable differences between
subjects (or more generally between experimental units). If the response
measure is not perfectly reliable, that is, if errors of measurement are present,
the measurement error also contributes to the within-group variability. Measurement error, therefore, decreases the population value of the standardized
mean difference. If the parameter of interest is SiJ9 the standardized mean
difference when no errors of measurement are present, some procedure to
correct for the effects of measurement error is necessary. The effect of
measurement error is a particularly important consideration when different
studies use dependent measures of differing reliability. In this case the standardized mean differences for different studies are difficult to compare because
they will be attentuated to different degrees. Hedges (1981) studied the effects
of measurement error on estimators of effect size and gave a correction for the
effects of unreliability.
A structural model that Hedges (1981) used that includes a measurement
error -qiJk for each observation is
E
\,...,nfj,j=
Y,
= frfij
+ PijyiJ (c+
+ vfjkkk=
),= l,...,nf
Uk =
+ Pijyij+
fy*{ef
+ Jk^f/*)'
JkPiAj
j9j=
Y
l , .\,...,m
. . , m ,t,,
/i =
..,/?,
= 1 , .l,...,p,
+
ee
+ 71
n
Uk = PijyiJ
Yv*
P,jy<J+ {{Uk
fjk + lvk)>
fijk)> =* = !l»-,nfjj=
. • • ••> fj>j
Y
=
k
=
\,...,p,
l\,.-.,m„i
, . . . , / W / , « == 1,...,/>,
(2)
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
Fitting Models to Effect Sizes
123
where (efjk + i\fjk) and (efJk + ifrjk) are distributed independently as
91(0, Pfj/Pjj), and ptj is the reliability of the response measure used in theyth
experiment in the /th class. Note that 8 /y has the interpretation that it is the
standardized mean difference that would be obtained if the measurements were
error free, that is, if i\ijk — 0. Note also that the parameters ejJk and i\ijk
always occur together and therefore cannot be distinguished without additional
information. One form of additional information is the reliability of the
response measure, which can be used to distinguish the variances of eiJk and
Hedges (1981) pointed out that under the classical assumptions that i\ijk and
E( jk are independently normally distributed within the groups of each experiment,
Var(e / M + Vijk | /, j) = Var(e/yjk | /, j) + V a r ^ * | /, j)
=
v
yf
*r(*/;*!'.
j)/ptjJ>
**{*ijk\i*j)/Pi
>
where p /y = Var(e/y7< 11, j)/[Var(eijk | /, j) + Var(i?/y* 11, j)] is the reliability
of the response measure in the j th experiment of the /th class. Therefore,
Vax(eiJk + i)ljk 11, j) = ffj/Pij in model (2) because j8?y = Var(e/y7< | /, j). We
will test hypotheses about the 8tj (error-free effect sizes) assuming that the p /y
are known. If the errors of measurement are negligible, we can set ptj — 1.
In subsequent sections we will be concerned with testing hypotheses about
the 8y -. For example, we might want to test whether all the studies in different
classes share a common but unknown effect size 8 by testing
H :8
o ij
= $> J:= 1,...,»!,., / = 1,...,/?,
(3)
=
(4)
versus the alternative
H :S
\ ij
§i> J = 1,...,"!,.,/= 1,...,/?,
that is, that the effect size depends on the class but is otherwise unknown. We
might also want to test the hypothesis Hx against the alternative
H2: 8/y. unrestricted.
(5)
The test of Hx versus H2 is a test of homogeneity of effect size within classes.
Estimating Effect Size
This section includes some facts about estimators of effect size that will be
used in subsequent sections. Each of these facts is proven or is easily obtained
from results given in Hedges (1981, 1982). First define the estimator gtJ of 8 /y
by
Sij =
Y E
i7
'
-
Y
r
,—U > j = l , . . . , w f - , / =
1,...,/>,
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
(6)
124
Larry V. Hedges
where Yfi and Yfj are the experimental and control group sample means, Stj is
the pooled within-groups sample standard deviation, and p /y is the reliability of
the response measure for they'th experiment in the / th class.
Define N = lf=liyixnfj
+ nfj9 rf = nfj/N, vfj = nfj/N9 where m* and
vf. remain fixed as N -+ oo. Then the asymptotic distribution of gtj is given by
^U, "
~9l(0, «,2(S, , ) ) .
~ * , j ) -
where
^toy) =
< + */y,
*/y
(7)
22 V
V
\Uv+
ij + ",,)'
ij)
ij"ijPiJ
"iW>u
V
Although the estimator g,7 is not the maximum likelihood estimator of 8ij9 gfj
has the same symptotic distribution as the maximum likelihood estimator.
For a series of experiments, define the weighted estimators g, and g.. by
2J1,
8i
8
'
toj/ofau)
f/=i
((8)
8)
,«
-=11' , ". -..,/»,
'^
V",
2 - 4 l \/
l /n*(„
o #l 2a / & y )\ ''
/ ij\gij)
and
2^,27=4,
Sf=,Sy% *,,-/<(&,.)
&,/«,',(&,)
£••
(9)
2f=i27=i V'/Mfty) '
where a,27(g,7) is given by (7).
If the m?} and m^ remain fixed as N -* oo, and the hypothesis Hx given in (4)
is true, the estimators g, have asymptotic distribution given by
2
//7(g,-S~)~9l(0,a,
V
^ ( & . - $ ) ~ 9 l ( 0 , a , 2(S~)),
($)),
i=\,...,p,
..,/>,
1 = 1,.
where
<*?(%)==
<W)
'
V
1
m
7
x
E C
m)C\o
2*ij*ij\*ij
7 r , .E^ ( ^ +-4+ ^*ij)Pij
P o
i
•
2lm m (m
y
(10)
(10)
E
E
C 2
7=1
77 C ) 22 +
+ ir«
TT
7=1 2
2( (? 7< ++ »/y)
^ ?.8Pp/ y.
\
ij
'J/
'J 'J
'^'J
If the 7rf-y and 77^ remain fixed as AT -> 00 and the hypothesis // 0 given in (3) is
true, the estimator g.. has an asymptotic distribution given by
2
{N(g.. - S > -91(0,
jN(g..-S)~9l(0,o
(8)),
"W).
where
(*) =
=
oa 22($)
1
P
P
1=1
-,
"''
:
1T )o 7
2 « J « ++ ^)p,
21TE1TCUE
C
2(^ + ^ ) 2 + ^ ^
P / y
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
(11)
Fitting Models to Effect Sizes
125
The g i: , / = 1,...,/?, are not the maximum likelihood estimators of the Si9
i = 1,...,/?, under J/,, but they have the same asymptotic distribution as the
maximum likelihood estimators. Similarly, though it is not the maximum
likelihood estimator of 8 under H0, g.. has the same asymptotic distribution as
the maximum likelihood estimator of 8.
Some Tests of Homogeneity
In this section procedures are developed for testing the hypothesis H0 given
in (3) versus Hx given in (4) and for testing H] versus H2 given in (5). These
procedures are intuitively appealing and involve easy calculations. The results
that follow are direct consequences of the lemma in the section Fundamental
Lemma.
Testing Homogeneity of Effect Size across Classes
The test of homogeneity of effect size 8,.. across classes is a test of the
hypothesis HQ given in (3) versus the hypothesis Hl given in (4). Define the fit
statistic HB by
P
HBB~= N^
m
-
(a
- a
'i,
2
,= 1 j=\
)2
! »
O2)
0,7(g,7)
where gL is given in (8), g.. is given in (9), o^ig^) is given in (7), and N is
defined as above. If n? — nfj/N and irfj — nfj/N remain fixed as N -> oo, the
asymptotic distribution of the fit statistic HB under H0 is given by
#B~X*-,.
(13)
Thus, if effect sizes are homogeneous across groups, the test statistic HB is
distributed as a central chi-square on (p — 1) degrees of freedom. The test of
H0 versus Hx at a significance level a consists of comparing the obtained value
HB to the 100(1 — a) percent critical value of the chi-square distribution with
(p — 1) degrees of freedom. If HB is greater than the critical value, we reject
H0 in favor of //",, and conclude that all classes do not share a common effect
size.
Testing Homogeneity of Effect Sizes within Classes
The test of homogeneity within classes is a test of the hypothesis Hx given in
(4) versus the hypothesis H2 given in (5). Define the test statistic Hw by
pp
m,
m
'
2 22
i=\ 7 = 1
H
w- =
Hw
= Nt*2
((jg
? / y-- jg ?
/ . )) 22
"
\ .
(14)
where g / ; is given in (6), g, is given in (8), ofj(gij) is given in (7), and N is
defined as above. If m^ — nfj/N and wfj = nfj/N remain fixed as N — oo, the
asymptotic distribution of Hw under //, is given by
HW~XM-P,
where M = 2f=1 m,.
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
(15)
Larry V. Hedges
126
Thus, if effect sizes are homogeneous within classes, the test statistic Hw is
distributed as a central chi-square on ( M — p) degrees of freedom. The test of
Hx versus H2 consists of comparing the obtained value of Hw with the
100(1 — a) percent critical value of the chi-square distribution with ( M — p)
degrees of freedom. If Hw is greater than the critical value, we reject Hx in
favor of H2 and conclude that effect sizes are not homogeneous within classes.
In actual practice, it may be helpful to partition the within-class fit statistic
Hw into p statistics indicating fit within each of the p classes. This helps to
isolate classes in which fit is particularly bad. If HWi is the statistic for fit
within class /,
m
H
'
(X---X-)2
wi-
In this case the asymptotic distribution of HWi under H2 is given by
II is c e a r t h a t H
HWi~x2m-v
l
w~ ^?=\HWi, and that HWi and HWi. are
independent whenever / ¥" /'.
Testing Whether all Studies Share a Common Effect Size
It may be desirable to test whether all studies, regardless of class, share the
same effect size versus the alternative that all studies do not have the same
effect size. Although the results of this section can be obtained from the results
of the previous section by treating each study as a separate class, the result is
stated for later reference. The test proposed in this section is formally a test of
H0 versus H2. In this case, define the test statistic
--NI
"r =
* 2 i2
Hj
i = l1 y = il
,=
{8IJ
KgtJ g)g )
2.
.,
(17)
« » /°ij\gij)
/«/>)
where gtJ is given by (6), g.. is given by (9), and ofjigij) is given by (7). If
Fj — nfj/N and irfj — nfj/N remain fixed as N -* oo, then the asymptotic
distribution of HT under H0 is given by HT ~ X M - I» where M = 2/L \ mt.
An Analogy to the Analysis of Variance
m
There is a simple relationship among the fit statistics HB, Hw, and HT that is
analogous to the partitioning of sums of squares in the analysis of variance. It
is easy to show that HT— HB + Hw, using only elementary algebra. One
interpretation of this formula involves this partitioning of the fit statistic HT.
The " total fit" to the model of a single effect size is represented by HT. The
"between-class fit" is represented by HB, and the "within-class fit" is represented by Hw. Thus the total fit is partitioned into between-class and withinclass components. We have stated that the statistics HB, Hw, and HT are
distributed asymptotically as central chi-squares under appropriate null hyH
potheses with distributions given by HB~x2p-\,
Hw~x\i-p>
T~X2M-\,
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
127
Fitting Models to Effect Sizes
where M = 2f=1 mr Furthemore, the fundamental lemma given below states
that HB and Hw are asymptotically stochastically independent. Therefore the
tests for between-class fit and within-class fit are asymptotically independent.
Computational Formulas for the Statistics HT, HB, and Hw
In practice, computational formulas can simplify calculation of the fit
statistics HT, HB, and Hw. These formulas are analogous to computational
formulas in the analysis of variance and enable the researcher to compute each
of the fit statistics in a single pass through the data (e.g., with a packaged
computer program). Each of the formulas can be verified by algebraic manipulation. The computational formulas are
p
m
ft2
i
H
T~ 2 2 . 2 / x
»=l /=! °,y(ft;)
/7
#B-
W#-
2 2
//^, = HT—
ft.
T2,
x
(2f = ,2^,ft/^(&>)) 2
S/L.^.l/tf/g,,) '
{2f=]lJiigij/dilJ(gu))2
2f=12,%l/a,2,(&,.)
'
HB,
where ^ ( g ^ ) and gL are obtained by replacing ir^ and irfj by nfj and /if.,
respectively, in the definitions of o^J(giJ) and g, given in (7) and (8).
A Fundamental Lemma
The lemma given in this section implies most of the results just presented.
Lemma 1: Let HB, Hw, and HT be defined as in (12), (14), and (17). Define
N = l?=xlj±x(nfj
+ nfj)9 rf = nfj/N9 *£ = nfj/N9 and assume that the wtEJ9
irfj remain fixed as N -> oo. If H] is true, the asymptotic distribution of Hw is
given by
™w~
XM-P-
If H0 is true, then the asymptotic distributions of HB and HT are given by
H ~ Y2
"T ~ X MA/- 1'
where M — 2f=1 mi9 and HB and Hw are asymptotically independent.
Proof. The distributions of HB, Hw, and i / r are a direct consequence of a
large sample result on pooling independent estimators (see e.g., Rao, 1973, pp.
389-390). The limiting distributions %B, %W9 and %T of the test statistics HB,
Hw, and HT satisfy the equations %T — %B + %w and M — 1 = (p — 1) +
(M — p). Therefore HB and Hw are asymptotically independent by Cochran's
theorem. ||
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
128
Larry V. Hedges
Fitting Effect Size Models to a Series of Studies
The statistical results of this paper can be used as part of a general strategy
for fitting models to the effect sizes from a series of studies. Start with a series
of studies where each study assesses the effect of a particular treatment via a
two-group experimental group/control group design. Suppose that the dependent variables measure the same construct and are (approximately) linearly
equatable. We assume that the studies are classified according to one of the
classification dimensions. The classes obtained by one partitioning can be
further partitioned according to a second classification dimension, and in turn
partitioned according to other dimensions.
One strategy for fitting models to effect sizes for each class is analogous to
the strategy used to fit hierarchical log-linear models to contingency tables.
The strategy can be described as follows:
Step 1. Ignore the classifications and fit the model of a single effect size to
all the studies. The estimate of this single effect size is g.. given by (9).
Calculate the fit statistic HT. If the value of HT is not large or is statistically
insignificant at some preset a level, the investigator can stop, concluding that
the model of a single effect size fits the data adequately. The asymptotic
distribution of g.. may be used to calculate an asymptotic confidence interval
for 8. If the fit statistic HT is large or statistically significant, go on to Step 2.
Step 2. A large value of the fit statistic HT indicates that effect sizes are not
homogeneous across all studies, so partition the studies into classes along one
dimension. One should choose the most important dimension first, that is, the
dimension believed to be most related to effect size.
Calculate the between-class fit statistic HB and the within-class fit statistic
Hw. If the value of the within-class statistic Hw is small or is statistically
insignificant, the investigator can stop, because the model of a different effect
size for each class is consistent with the data. In this case, g, given in (8) is the
estimate of effect size for the z'th class and HB represents the extent to which
the effect sizes differ among classes. If Hw is large or statistically significant,
then go on to Step 3.
Step 3. A large value of the fit statistic Hw indicates that effect sizes are not
homogeneous within classes. At this point it may be useful to partition
within-class fit Hw into p (if there are p classes) statistics HWi, i— 1,...,/?,
where HWi indicates the fit within the z'th class. Examining the values of HWi
might help identify classes with especially poor fit, that is, classes in which the
effect sizes are heterogneous. This might lead the investigator to exclude some
classes or studies from further analyses. Examination of within-class fit might
also suggest which other classification dimensions are useful.
Step 4. Partition the existing classes according to a second classification
dimension. Repeat Step 2, that is, calculate the between- and within-class fit
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
Fitting Models to Effect Sizes
129
statistics HB and Hw. Proceed through Steps 2, 3, and 4 until an acceptable
level of within-class fit is obtained or the classification dimensions are exhausted.
The procedure given is a practical method involving relatively simple calculations. It has the advantage that fit to the model can be assessed at each stage,
and it also provides a test of the relationship between the classification
dimension and effect size.
Comparisons Between Classes
If a priori knowledge or a formal hypothesis test (significant value of HB)
leads an investigator to believe that the effect sizes are not homogeneous across
classes, the investigator may wish to compare the effect sizes of different
classes. More generally, the investigator may wish to test hypotheses about
linear combinations of the effect sizes for the classes. Such comparisons are
analogous to contrasts in the analysis of variance.
We will consider comparisons of the form
C = 1 c,gL
i=\
as an estimator of 2f = i c,8j., where c-, / = 1,...,/?, are known constants and 8L
is the within-class, weighted, average population effect size given by
''•
"•
2 ^ , l /i/«,
a , 2 ,2(,(«„.)
S,.,)
27,,
There are two slightly different forms of the asymptotic distribution of C
according to whether or not the effect sizes 8,. • are identical within classes.
If the StJ are homogeneous within classes, that is, if 8,- • = 8i9 j — 1,... ,m,,
/ = l,...,/7, then if m^ and irfj remain fixed as N -* oo, the asymptotic
distribution of C is given by
^f{icigi-iclSi)~<3i(0,Oc2),~9l(<W),
where
"c2=2*,VM).
(is)
and o^(St) is obtained from (10).
If the 8jj are heterogeneous within classes, that is, if 8 /7 ¥= 8L for some / and
j \ the asymptotic distribution of the comparison C can still be obtained. In this
case, however, the comparison is a linear combination of the weighted averages
of the 8; j rather than a linear combination of parameters that are homogeneous
across all studies in a class. The comparison may therefore have a different
interpretation than in the case where there are homogeneous effect sizes within
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
Larry V. Hedges
130
classes. In particular, suppose that two classes / and /' have different average
effect sizes, that is, 8r > 8L. If the effect sizes are homogeneous within classes,
then 8, > 8r implies that 8iJ > 8rj, for each; = 1,... , m , , / = 1,. .ymr.U
the
effect sizes are not homogeneous within classes, it is possible that 8L > 8r , but
that 8fj < 8rj, for some ij\ / ' , / .
If the 8jj are heterogeneous within classes and if the TT^ and irfj remain fixed
as iV -^ oo, then the asymptotic distribution of C = 2f={cigi is given by
ft I 2^,-2cA.)~9i(o,* f 2 ),
where
c2
p
C
m
i
lmEmCl
rrrE +
mC\r>
(19)
These results are used in practice by substituting the consistent estimator gL
for §i in expression (18) or gtj for 8tj in expression (19) for the variance of C.
Small Sample Behavior of the H Statistic
The statistical procedures described in this paper depend on large sample
approximations to the distributions of g,, and the H statistics. Although large
sample approximations are sometimes reasonably accurate in small samples,
the uncritical use of large sample statistical procedures is seldom justified.
Therefore simulation studies were conducted to assess the accuracy of the large
sample approximations used here. In each simulation the experimental and
control group sample sizes were set equal, that is, wf. = nfj9 and four representative effect sizes were used: 8 = .25, 8 = .50, 8 = 1.00, and 8 = 1.50. The
values of g /y were generated using the identity g = X/ ^S/m, where X is
normal with mean 8 and variance 2/n and S is a chi-square random number
with m — In — 2 degrees of freedom. The required standard normal and
chi-square random numbers were generated using the International Mathematical Statistical Libraries, Inc. (1977) library subroutines GGNML and
GGCHS.
The statistics HT and Hw are very similar, as are their large sample
approximations. Thus we simulated only the distribution of HT. Two thousand
sets of effect sizes were generated for a large number of different configurations of sample sizes. We then calculated the proportion of obtained H1
statistics that exceeded various critical values of the appropriate chi-square
distribution. Table I presents some representative results for M = 2 and
M = 5. The chi-square approximation to the distribution of the HT statistic
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
131
Fitting Models to Effect Sizes
appears to provide reasonably accurate significance levels whenever sample
sizes exceed 10 per group. Most applications of these methods are likely to
involve sample sizes much larger than 10 per group, therefore the significance
levels obtained by the procedures used here are not likely to be grossly
inaccurate.
We studied the distribution of HB by examining the distribution of g,. This
method has the advantage of providing information on the accuracy of the
large sample approximation to the distribution of contrasts among the gL. Two
thousand sets of effect sizes were generated for each of a large number of
sample size combinations. The large sample approximation to the distribution
of g, was used to calculate confidence intervals for 8,. The proportion of these
confidence intervals that contained St was then calculated. Some representative
results are given in Table II. These results suggest that the bias of g, is small
and that the large sample approximation gives reasonably accurate significance
levels.
One might argue that the small sample accuracy of the approximation to the
distributions of g, and HT could be improved by replacing gtj with Hedges'
(1981) unbiased estimator or by using the exact variance of gtj in place of the
asymptotic variance. Simulation studies using these alternative methods did
not yield demonstrably better results than the methods suggested here.
Example
It has been asserted that open education programs, which emphasizes
student interaction and self-direction, would enhance the cooperativeness of
students (Horwitz, 1979). In a recent review of the research studies on the
effectiveness of open education programs, Hedges, Giaconia, and Gage (1981)
found several studies assessing the effects of open education on cooperativeness. Six of the studies that they examined provided summary statistics on
measures of cooperativeness that were thought to be linearly equatable. Each
of these studies compared the mean cooperativeness score of a group of
children from an open education program with that of a group of children
from a conventional educational setting. Table III is a summary of the effect
size data from these six studies. Estimates of the reliability of the dependent
variables were not available, so the value 1 was used for p, for all studies.
One dimension on which the studies differ is the definition of the independent variable "openness." Three studies selected their "open" classrooms on
the basis of systematic observation in these classrooms using observation scales
designed to assess open teaching. The other three studies did not use systematic observations to determine if the teaching in their nominally open
classroom reflected an accepted definition of open teaching. In some cases it
seemed that open space architecture, rather than open teaching, was the true
independent variable. That is, classrooms without walls were defined as open
classrooms. In the studies that did not use systematic observations, the actual
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
TABLE I
Small Sample A ccuracy of Significance
Levels for the Homogeneity Statistic HT
Sample Sizes
8
10, 10
10,20
10,50
20,20
50,50
10, 10
10,20
10,50
20,20
50,50
10, 10
10,20
10,50
20,20
50,50
10, 10
10,20
10,50
20,20
50,50
25
.25
.25
.25
.25
.50
.50
.50
.50
.50
1.00
1.00
1.00
1.00
1.00
1.50
1.50
1.50
1.50
1.50
Mean
of//
L082
1.060
1.050
.986
.965
1.032
.968
.977
1.010
1.059
1.082
1.010
1.071
.988
1.003
1.101
1.040
1.035
1.093
1.072
Variance
of//
2.61322
2.44510
2.36859
1.99147
1.99601
2.12215
1.98753
1.96954
2.00536
2.19370
2.52032
1.93902
2.19436
1.91693
2.11199
2.36588
2.12038
1.96059
2.51671
2.43926
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
.40
^400
.414
.408
.395
.386
.407
.385
.391
.397
.415
.409
.406
.428
.405
.391
.418
.409
.411
.415
.407
Proportion of Test Statistics Exceeding
the Critical Value of a Chi-Square
at Nominal Significance Level
Proportion
of2 0Test Statistics
.10 Exceeding
.05
30~~
the Critical Value of a Chi-Square
.064
.307
~2U
Tin
.052
.204
.104
.310
.204
.058
.306
.111
.044
.099
.298
.195
.186
.051
.281
.090
.107
.203
.049
.307
.094
.279
.192
.045
.189
.105
.051
.288
.298
.197
.103
.051
.307
.202
.116
.059
.316
.217
.113
.056
.102
.048
.306
.209
.102
.057
.321
.216
.294
.102
.047
.190
.197
.100
.057
.290
.224
.119
.065
.327
.054
.214
.313
.108
.220
.105
.048
.325
.219
.056
.317
.113
.111
.059
.297
.213
£
.01
XH5
.017
.014
.012
.012
.011
.009
.009
.012
.014
.017
.010
.013
.009
.009
.014
.011
.006
.016
.012
I
10, 10, 10, 10, 10
10, 10, 10, 50, 50
20, 20, 20, 20, 20
20, 20, 20, 50, 50
50, 50, 50, 50, 50
10, 10, 10, 10, 10
10, 10, 10, 50, 50
20, 20, 20, 20, 20
20, 20, 20, 50, 50
50, 50, 50, 50, 50
10, 10, 10, 10, 10
10, 10, 10, 50, 50
20, 20, 20, 20, 20
20, 20, 20, 50, 50
50,50,50,50,50
10, 10, 10, 10, 10
10, 10, 10, 50, 50
20, 20, 20, 20, 20
20, 20, 20, 50, 50
50, 50, 50, 50, 50
.25
.25
.25
.25
.25
.50
.50
.50
.50
.50
1.00
1.00
1.00
1.00
1.00
1.50
1.50
1.50
1.50
1.50
4.147
4.022
4.049
3.995
3.917
4.159
4.131
4.050
4.054
4.052
4.226
4.151
3.956
3.979
3.980
4.189
4.227
4.142
4.151
4.170
8.50930
8.48405
8.82977
7.96546
7.51802
8.67688
8.75718
7.87793
8.09656
8.31733
9.17542
8.65947
8.01181
7.68662
7.93969
8.20302
8.98480
8.50856
9.21155
8.99356
.417
.407
.408
.402
.403
.431
.419
.411
.409
.404
.433
.424
.384
.397
.399
.432
.432
.416
.418
.417
.320
.311
.305
.296
.307
.325
.317
.307
.307
.304
.328
.319
.286
.303
.296
.335
.329
.322
.318
.321
.223
.212
.203
.205
.199
.211
.218
.205
.208
.207
.220
.212
.202
.198
.203
.231
.226
.210
.215
.230
.107
.099
.104
.103
.097
.120
.115
.108
.100
.104
.113
.108
.103
.099
.093
.111
.119
.112
.111
.116
.059
.049
.054
.049
.038
.059
.054
.055
.053
.049
.060
.060
.054
.053
.046
.051
.062
.060
.055
.057
.013
.013
.013
.009
.006
.014
.012
.009
.010
.012
.017
.016
.009
.006
.011
.010
.012
.011
.015
.013
5
S'
Op
2
Note. These data are based on 2,000 replications. Sample sizes reflect the number of observations within each group of an
experiment.
CO
CO
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
TABLE II
Small Sample Accuracy of Confidence
Intervals Based on the Large Sample Distribution of g,
Mean
Sample Sizes
8
10,20
10,50
20,20
50,50
10, 10
10,20
10,50
20,20
50,50
10, 10
10,20
10,50
20,20
50,50
10, 10
10,20
10,50
20,20
50,50
^25
.25
.25
.25
.25
.50
.50
.50
.50
.50
1.00
1.00
1.00
1.00
1.00
1.50
1.50
1.50
1.50
1.50
.254
.259
.249
.258
.252
.505
.497
.493
.496
.494
1.000
1.002
.998
.998
.998
1.520
1.514
1.506
1.509
1.504
CO
Proportion of Confidence Intervals
Containing 8 with Nonimal Significance Level
Variance
°fgi.
.60
JO
.80
.90
.95
.99
.10438
.06704
.03337
.05133
.01992
.10531
.07016
.03585
.05377
.02099
.12042
.07588
.03878
.05440
.02220
.13310
.09173
.04350
.06638
.02582
.590
.615
.609
.607
.593
.608
.604
.585
.590
.603
.592
.613
.596
.610
.603
.592
.592
.596
.597
.587
/704
.716
.712
.698
.701
.709
.706
.680
.691
.706
.706
.718
.685
.716
.712
.692
.686
.694
.703
.695
.802
.813
.813
.802
.797
.813
.805
.783
.802
.810
.799
.809
.787
.817
.812
.798
.801
.799
.805
.796
^908
.908
.905
.897
.908
.903
.897
.895
.905
.902
.904
.902
.902
.910
.905
.910
.907
.895
.899
.903
.956
.951
.948
.951
.956
.953
.956
.947
.949
.944
.945
.949
.949
.956
.954
.960
.946
.952
.946
.953
.991
.991
.990
.988
.992
.988
.991
.989
.988
.989
.990
.989
.991
.991
.991
.992
.988
.991
.991
.990
IDownloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
.25
.25
.25
.25
.25
10, 10, 10, 10, 10 .50
10, 10, 10, 50, 50 .50
20, 20, 20, 20, 20
.50
20, 20, 20, 50, 50
.50
50, 50, 50, 50, 50 .50
10, 10, 10, 10, 10 1.00
10, 10, 10, 10, 10
10, 10, 10, 50, 50
20, 20, 20, 20, 20
20, 20, 20, 50, 50
50, 50, 50, 50, 50
1.00
1.00
1.00
1.00
10, 10, 10, 10, 10 1.50
10, 10, 10, 50, 50 1.50
20, 20, 20, 20, 20 1.50
20, 20, 20, 50, 50 1.50
50, 50, 50, 50, 50 1.50
10,
20,
20,
50,
10, 10, 50, 50
20, 20, 20, 20
20, 20, 50, 50
50, 50, 50, 50
.252
.251
.255
.252
.250
.504
.494
.497
.496
.496
.992
.994
.994
.994
.997
1.498
1.499
1.501
1.497
1.500
.04051
.01531
.02155
.01287
.00800
.04258
.01591
.02096
.01368
.00821
.04571
.01739
.02170
.01450
.00910
.05365
.02000
.02611
.01556
.01000
.599
.615
.598
.613
.605
.614
.591
.595
.593
.610
.616
.614
.613
.595
.602
.603
.591
.595
.597
.614
.698
.709
.695
.701
.699
.699
.698
.694
.688
.712
.711
.711
.712
.692
.701
.696
.691
.710
.710
.714
.806
.817
.790
.798
.803
.798
.806
.793
.781
.814
.804
.796
.809
.793
.791
.804
.798
.803
.813
.801
.912
.902
.897
.902
.905
.907
.905
.900
.896
.903
.907
.899
.912
.890
.900
.899
.902
.900
.899
.909
.958
.949
.941
.948
.956
.950
.948
.956
.939
.944
.946
.946
.956
.945
.956
.948
.953
.948
.952
.955
.992
.990
.986
.985
.990
.987
.989
.990
.986
.988
.989
.993
.991
.994
.988
.987
.992
.990
.994
.992
5?
2
Note. These data are based on 2,000 replications. Sample sizes reflect the number of observations within each group of an an
experiment.
CO
CI
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
136
Larry V. Hedges
TABLE III
Data from Six Studies on the Effects of Open Education on Cooperativeness
Study
Observations
Used
Grade
Level
"5
"fj
Sij
°u(gi)
1
2
3
4
5
6
No
No
No
Yes
Yes
Yes
4
5
3
3
3
4
30
30
280
6
44
37
30
30
290
11
40
55
.181
-.521
-.131
.959
.097
.425
.06694
.06893
.00703
.28462
.04778
.04619
Note. These data are from Hedges, Giaconia, & Gage (1981).
treatment was unclear. Thus the studies using systematic observations were
believed to have greater treatment fidelity than the other studies.
The methods described in this paper were applied to the data in Table III.
The first step was to calculate the overall fit statistic HT to determine if the
model of a single effect size fit the data well. The value of HT obtained was
13.843. Comparing this with the a = .95 critical value of the chi-square
distribution with five degrees of freedom, we found that the fit of the data to
the model of a single effect size was rejected.
The next step was to divide the studies into two classes: studies that defined
openness on the basis of systematic observation and those that did not use
systematic observation. The value of the fit statistic Hw was calculated as
Hw — 6.368. Comparing 6.368 with the percentage points of the chi-square
distribution with four degrees of freedom, we saw that a value of Hw this large
would arise between 10 and 25 percent of the time under //,. Therefore the fit
of the data to the model of a separate effect size for each class was not rejected.
It was also noted that the values of the separate within-class fit statistics were
Hwl = 2.713 for the studies that used observations and Hwl — 3.655 for the
studies that did not. The between-class fit statistic HB was obtained as
HB- HT- Hw - 7.475. Comparing the value of HB with the a = .95 critical
value of the chi-square distribution with one degree of freedom, we saw that
there was a significant difference between effect sizes for the classes.
The estimates of effect size for each class are g, = .317 for the studies using
observations, and g2 = -.137 for the studies that did not use observations.
Asymptotic 95 percent confidence intervals for 5, and S2 are
.167 < 8, < .467,
and
- . 4 2 6 < 8 2 < .152.
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016
Fitting Models to Effect
Sizes
137
Therefore the effect of open education programs on cooperation is positive for
the studies that identified open teaching via systematic classroom observations,
while this was not the case for the studies that did not use systematic
observations.
Acknowledgements
This research was supported by the Spencer Foundation. I thank Betsy Becker for
helpful comments and for programming the simulation study reported in this paper.
References
Glass, G. V Primary, secondary, and meta-analysis of research. Educational Researcher,
1976,5,3-8.
Hedges, L. V. Distribution theory for Glass's estimator of effect size and related
estimators. Journal of Educational Statistics, 1981,6, 107-128.
Hedges, L. V. Estimating effect size from a series of independent experiments.
Psychological Bulletin, 1982, 92, in press.
Hedges, L. V., Giaconia, R. M , & Gage, N. L. The empirical evidence on the effects of
open education (Final report of the Stanford Research Synthesis Project, volume II).
Stanford, Calif.: Stanford University School of Education, 1981.
Horwitz, R. A. Psychological effects of the "open classroom." Review of Educational
Research, 1979,49,71-86.
International Mathematical and Statistical Libraries, Inc. IMSL Library 1 (7th ed.).
Houston, Texas, 1977.
Kulik, J. A., Kulik, C. C , & Cohen, P. A. A meta-analysis of outcome studies of
Keller's personalized system of instruction. American Psychologist, 1979, 34, 307-318.
Light, R. J., & Smith, P. V. Accumulating evidence: Procedures for resolving contradictions among different studies. Harvard Educational Review, 1971, 41, 429-47 r l.
Pillemer, D. B., & Light, R. J. Synthesizing outcomes: How to use research evidence
from many studies. Harvard Educational Review, 1980, 50, 176-195.
Rao, C. R. Linear statistical inference and its applications. New York: McGraw-Hill,
1973.
Author
HEDGES, LARRY V., University of Chicago, Department of Education, 5835
Kimbark Avenue, Chicago, IL 60637
Downloaded from http://jebs.aera.net at PENNSYLVANIA STATE UNIV on September 17, 2016