Generalizability Theory

36-Green.qxd
12/30/2005
12:51 PM
Page 599
36
Generalizability Theory
Richard J. Shavelson
Stanford University
Noreen M. Webb
University of California, Los Angeles
Generalizability (G) theory is a statistical theory for evaluating the dependability (or reliability)
of behavioral measurements (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; see also Brennan,
2001; Shavelson & Webb, 1991). G theory permits the researcher to address such questions as:
Is the sampling of tasks or judges the major source of measurement error? Can I improve the
reliability of the measurement better by increasing the number of tasks or the number of judges,
or is some combination of the two more effective? Are the test scores adequately reliable to make
decisions about the level of a person’s performance for a certification decision?
G theory grew out of the recognition that the undifferentiated error in classical test theory
(Feldt & Brennan, 1989) provided too gross a characterization of the potential and/or actual
sources of measurement error. In classical test theory measurement error is undifferentiated
random variation; the theory does not distinguish among various possible sources. G theory
pinpoints the sources of systematic and unsystematic error variation, disentangles them, and estimates each one. Moreover, in contrast to the classical parallel-test assumptions of equal
observed-score means, variances, and covariances, G theory assumes only randomly parallel
tests sampled from the same universe. Finally, whereas classical test theory focuses on relative
(rank-order) decisions (e.g., student admission to selective colleges), G theory distinguishes
between relative (“norm-referenced”) and absolute (“criterion-” or “domain-referenced”) decisions for which a behavioral measurement is used.
In G theory, a behavioral measurement (e.g., a test score) is conceived of as a sample from
a universe of admissible observations. This universe consists of all possible observations that
decision makers consider to be acceptable substitutes (e.g., scores sampled on Occasions 2 and 3)
for the observation in hand (scores on Occasion 1). A measurement situation has characteristic
features such as test form, test item, rater, and/or test occasion. Each characteristic feature is
called a facet of a measurement. A universe of admissible observations, then, is defined by all
possible combinations of the levels of the facets (e.g., items, occasions).
Consider a generalizability study of students’ scores on a measure of academic self-concept.
Suppose students (persons) responded to three self-concept items randomly selected from a large
domain of such items on each of two randomly selected occasions (see Table 36–1). The items
599
36-Green.qxd
12/30/2005
600
12:51 PM
Page 600
SHAVELSON AND WEBB
TABLE 36–1
Crossed Person × Item × Occasion G Study of Self-Concept Scores
Occasion
I
Person
1
2
3
…
p
…
N
II
Item 1
Item 2
Item 3
Item 1
Item 2
Item 3
4
3
2
2
1
3
5
4
3
4
4
3
3
2
2
4
3
4
4
5
4
3
4
2
3
4
4
3
3
3
asked students to evaluate how well they do in academic settings (e.g., “I do well in school”) on
a Likert-type scale with scores ranging from 1 to 5. The scale was administered twice over
roughly a 2-week interval. In this G study, students (persons) are the object of measurement1 and
both items and occasions are facets of the measurement. The universe of admissible observations
includes all possible items and occasions that a decision maker would be equally willing to interpret as bearing on students’ academic self-concept.
To pinpoint different sources of measurement error, G theory extends earlier analysis of
variance approaches to reliability. It estimates the variation in scores due to each person, each
facet, and their combinations (interactions). More specifically, G theory estimates the components of observed-score variance contributed by the object of measurement, the facets, and their
combinations. In this way, the theory isolates different sources of score variation in measurements. In practice, the analysis of variance is used to estimate variance components. In contrast
to experimental studies, the analysis of variance is not used to formally test hypotheses.
Continuing with the self-concept example, note that the student is the object of measurement
and each student’s observed score is decomposed into a component for student; item; occasion; and
combinations (interactions) of student, item, and occasion. The student component of the score
reflects systematic variation in students’ self-appraisals of their academic ability, giving rise to variability among students (reflected by the student or person variance component). The other score
components reflect sources of error. For example, a good occasion (e.g., following a schoolwide
announcement that the student body had received a community award), would tend to raise all
students’ self-evaluations, giving rise to mean differences from one occasion to the next (indexed
by the occasion variance component), whereas the particular wording of an item might lead certain
students to more negative-than-typical self-evaluations than other students, giving rise to a nonzero person × item interaction (p × i variance component).The theory describes the dependability
(reliability) of generalizations made from a person’s observed score on a test to the score he or she
would obtain in the broad universe of admissible observations—her “universe score” (true score in
classical test theory). Hence the name, Generalizability Theory.
1
In behavioral research, the person is typically considered the object of measurement.
36-Green.qxd
12/30/2005
12:51 PM
Page 601
36.
GENERALIZABILITY THEORY 601
G theory recognizes that an assessment might be adapted for particular decisions and so
distinguishes a generalizability (G) study from a decision (D) study. In a G study, the universe
of admissible observations is defined as broadly as possible (items, occasions, raters if appropriate, etc.) to provide variance component estimates to a wide variety of decision makers. A D
study typically selects only some facets for a particular purpose, thereby narrowing the score
interpretation to a universe of generalization. A different generalizability (reliability) coefficient
can then be calculated for each particular use of the assessment. In the self-concept example, we
might decide to use only one occasion and perhaps six items for decision-making purposes, so
the G coefficient could be calculated to reflect this proposed use.
In the remainder of this chapter, we take up, in more detail, G studies, D studies, and the
design of G and D studies. We then sketch the multivariate version of G theory and end with a
section on miscellaneous topics. Before proceeding, one caveat is in order. At times you will find
some complicated equations. We include them for completeness. We hope that the text provides
sufficient explanation to follow along conceptually for readers who are less interested in the
technical details.
GENERALIZABILITY STUDIES
A G study is designed specifically to isolate and estimate as many facets of measurement error
as is reasonably and economically feasible. The study includes the most important facets that a
variety of decision makers might wish to generalize over (e.g., items, forms, occasions, raters).
Typically, “crossed” designs are used where all individuals are measured on all levels of all
facets. In our example, all students (persons) in a sample (sample size, N) responded to the same
3 Self-Concept Items on 2 Occasions (Table 36–1). A crossed design provides maximal information about the variation contributed by the object of measurement (universe-score or desireable variation analogous to true-score variance), the facets, and their combinations, to the total
amount of variation in the observed scores.
Universe of Generalization
The universe of generalization is defined as the set of facets and their levels (e.g., items and
occasions) to which a decision maker wants to generalize. A person’s universe score (denoted as
µp) is defined as the long-run average or, more technically, “expected value” of his or her
observed scores over all observations in the universe of generalization (analogous to a person’s
“true score” in classical test theory).
Components of the Observed Score
After the decision maker specifies the universe of generalization, an observed measurement can
be decomposed into a component or effect for the universe score and one or more error components with data collected in a G study. Consider a two-facet crossed p × i × o (person by item by
occasion) design where items and occasions have been randomly selected (random-effects
model). The object of measurement, here persons, is not a source of error and, therefore, is not
a facet. In the p × i × o design with generalization over all admissible test items and occasions
taken from an indefinitely large universe, the components of an observed score (Xpio) for a
particular person (p) on a particular item (i) and occasion (o) are:
36-Green.qxd
12/30/2005
602
Xpio =
12:51 PM
Page 602
SHAVELSON AND WEBB
µ
+ µp – µ
+ µi – µ
+ µo – µ
+ µpi – µp – µi + µ
+ µpo – µp – µo + µ
+ µio – µi – µo + µ
+ Xpio – µpi – µpo – µio + µp + µi + µo – µ
grand mean
person effect
item effect
occasion effect
(36.1)
person × item effect
person × occasion effect
item × occasion effect
residual
Where µ = EpEiEoXpio and µ = EiEoXpio with E meaning expectation and other terms in (1) are
defined analogously.
Except for the grand mean, µ, each observed-score component varies from one level to
another—for example, items vary in difficulty on a test. Assuming a random-effects model, the
distribution of each component or “effect,” except for the grand mean, has a mean of zero and
a variance σ2 (called the variance component). The variance component for the person effect is
σ2 = Ep(µp − µ)2. This variance component is called the universe-score variance and is analogous
to true-score variance in classical test theory. The variance components for the other effects are
defined similarly. The residual variance component, σ2pio.e., reflects the person × item × occasion
interaction confounded with residual error because there is one observation per cell (see scores
in Table 36–1). The collection of observed scores, Xpio, has a variance, σ2xpio.e. = EpEiEo(Xpio − µ)2,
which equals the sum of the variance components:
2 + σ2
σ 2X pio = σ 2p + σ i2 + σ 2o + σ 2pi + σ 22po + σ io
pio,e
(36.2)
Each variance component can be estimated from a traditional analysis of variance (or other
methods such as maximum likelihood, e.g., Searle, 1987). From our example (Table 36–1), we
would run a person × item × occasion random-effects ANOVA and estimate the variance components from the mean squares (Table 36–2; see Shavelson & Webb, 1991, for how to do this).
The relative magnitudes of the estimated variance components, except for σ^ p2 , provide information about potential sources of error influencing a behavioral measurement. Statistical tests are
not used in G theory; instead, standard errors for variance component estimates provide information about sampling variability of estimated variance components (e.g., Brennan, 2001).
In our example, the estimated person (universe-score) variance, σ^ 2pi , (1.108), is fairly large
compared to the other components (30% of total variation). This shows that, averaging over
items and occasions, persons in the sample differed systematically in their self-concepts.
Because persons constitute the object of measurement, not error, this variability represents systematic individual differences in self-concept, variability in scores analogous to classical test theory’s true-score variance. The other large estimated variance components concern the item facet
more than the occasion facet. The nonnegligible2 σ^ i2 (3% of the total variation) shows that items
varied somewhat in difficulty level. The large σ^ p2 (22%) reflects different relative standings of
persons across items. The small σ^ o2 (1% of the total variation) indicates that performance was
stable across occasions, averaging over persons and items. The nonnegligible σ^ 2po (6%) shows
that the relative standing of persons differed somewhat across occasions. The zero σ^ 2io indicates
that the rank ordering of item difficulty was the same across occasions. Finally, the large σ^ 2pioλe
2
Even small variance components can give rise to large confidence intervals. See Shavelson and Webb (1991, p. 13).
36-Green.qxd
12/30/2005
12:51 PM
Page 603
36.
GENERALIZABILITY THEORY 603
Table 36–2
Estimated Variance Components in the Example p × i × o design
Source
Person (p)
Item (i)
Occasion (o)
p×i
p×o
i×o
p × i × o, e
Variance Component
Estimate
Percent of Total Variability
σ 2p
σ i2
σ 2o
σ 2pi
1.108
0.102
0.030
30
03
01
0.810
22
σ 2po
2
σ io
2
σ pio,e
0.230
06
0.001
00
1.413
38
(38%) reflects the varying relative standing of persons across occasions and items and/or other
sources of error not systematically incorporated into the G study.
DECISION STUDIES
Generalizability theory distinguishes a D study from a G study. The D study uses information
from a G study to design a measurement procedure that minimizes error for a particular purpose.
In planning a D study, the decision maker defines the universe that he or she wishes to generalize to, called the universe of generalization, which may contain some or all of the facets and their
levels in the universe of admissible observations—items, occasions or both in our example. In
the D study, decisions usually will be based on the mean over multiple observations (e.g., many
self-concept items) rather than on a single observation. The mean score over a sample of n′i items
and n′o occasions, for example, is denoted as XpIO in contrast to a score on a single item and occasion, Xpio. A two-facet, crossed D-study design where decisions are to be made on the basis of
XpIO is, then, denoted as p × I × O.
Types of Decisions and Measurement Error
G theory recognizes that the decision maker might want to make two types of decisions based
on a behavioral measurement: relative (norm-referenced) and absolute (criterion- or domainreferenced). A relative decision focuses on the rank order of persons; an absolute decision
focuses on the level of performance, regardless of rank.
Measurement Error for Relative Decisions. A relative decision focuses on the rank ordering of individuals (e.g., norm-referenced interpretations of test scores). Decisions about college
or job selection are relative as are decisions based on correlational studies (correlations depend
on the consistency with which variables X and Y rank order individuals). For relative decisions,
the error in a random-effects p × I × O design is defined as:
∆pIO = (XpIO – µIO) – (µp – µ)
where µ = EIEOXpIO and µIO = EpXpIO. The variance of the errors for relative decisions is:
(36.3)
36-Green.qxd
12/30/2005
604
12:51 PM
Page 604
SHAVELSON AND WEBB
σ δ2 = E p EI EO δ 2pIO = σ 2pI + σ 2pO + σ 2pIO,e
=
σ 2pi
ni’
+
σ 2po σ 2pio,e
+
no’
ni’no’
(36.4)
Notice that the “main effects” of item and occasion do not enter into error for relative decisions
because, for example, all people respond on both occasions, so any difference in occasions
affects all persons and doesn’t change rank order. In our study, suppose we decided to increase
the number of items on the self-concept scale to 10 and use the questionnaire on two occasions:
n′i = 10 and n′o = 2. Substituting we have:
.810 .230 1.413
σˆ δ2 =
+
+
= 0.267
10
2
10 × 2
Simply put, in order to reduce σ δ′ , n′i and n′o may be increased in a manner analogous to the
Spearman-Brown prophecy formula in classical test theory and the standard error of the mean in
sampling theory.
Measurement Error for Absolute Decisions. An absolute decision focuses on the level of
an individual’s performance independent of others’ performance (cf. domain-referenced interpretations). For example, in California a minimum passing score on the drivers examination is
80% correct, regardless of how others perform on the test. For absolute decisions, the error in a
random-effects p × I × O design is defined as:
∆pIO = XpIO − µp
(36.5)
and the variance of the errors is:
σ 2∆ = E p EI EO ∆ 2pIO = σ 2I + σO2 + σ 2pI + σ 2pO + σ 2IO + σ 2pIO,e
=
σ 2pio,e
σ2
σ i2 σ 2o σ 2pi σ 2po
+
+
+ io +
+
ni′ no′
ni′
no′
ni′no′
ni′no′
(36.6)
Note that, with absolute decisions, the main effect of items and occasions—how difficult an
item is or whether one occasion provided a more hospitable atmosphere for responding to selfconcept items than another—does affect the level of self-concept measured even though neither
change the rank order. Consequently, they are included in the definition of measurement error.
Also note that σ ∆2 ≥ σ δ2 . Substituting values from Table 36–2, as was done earlier for relative
decisions, provides a numerical index for estimated measurement error for absolute decisions:
.102 .030 .810 .230 .001 1.413
σˆ 2∆ =
+
+
+
+
+
= 0.292
10
2
10
2
10 × 2 10 × 2
RELIABILITY COEFFICIENTS
Although G theory stresses the interpretation of variance components and measurement error, it
provides summary coefficients that are analogous to the reliability coefficient in classical test
theory (recall, true-score variance divided by observed-score variance, i.e., an intraclass correlation).
The theory distinguishes between a Generalizability Coefficient for relative decisions and an
Index of Dependability for absolute decisions.
36-Green.qxd
12/30/2005
12:51 PM
Page 605
36.
GENERALIZABILITY THEORY 605
Generalizability Coefficient
The Generalizability (G) Coefficient is analogous to the reliability coefficient in classical test
theory. It is the ratio of the universe-score variance to the expected observed-score variance,
i.e., an intraclass correlation. For relative decisions and a p × I × O random-effects design, the
generalizability coefficient is:
Eρ2X pIO ,µ p = Eρ2 =
E p ( µ p − µ )2
σ 2p
= 2
2
E p EI EO ( X pIO − µ IO )
σ p + σ δ2
(36.7)
1.108
From Table 36–2, we can calculate an estimate of the G coefficient: Eρˆ = 1.108 + 0.267 = 0.806 . In words,
the estimated proportion of observed score variance due to universe-score variance is 0.806.
2
Dependability Index
For absolute decisions with a p × I × O random-effects design, the index of dependability
(Brennan, 2001; see also Kane & Brennan, 1977) is:
Substituting estimates from Table 36–2, we can calculate the dependability index for a self1.108
ˆ
concept inventory with 10 items given on 2 occasions: Φ = 1.108 + .292 = 0.791 . Notice that the dependability index is only slightly lower than the G coefficient because the variance components corresponding to the main effects for item and occasion are quite small (Table 36–2). The right-hand
side of (7) and (8) are generic expressions that apply to any design and universe.
Φ=
σ 2p
σ 2p + σ 2∆
(36.8)
For domain-referenced decisions involving a fixed cutting score λ (often called criterionreferenced measurements), and assuming that λ is a constant that is specified a priori, the error
of measurement is:
∆pIO = (XpIO − λ) − (µp − λ) = XpIO − µp
(36.9)
and the index of dependability is:
Φλ =
E p ( µ p − λ )2
σ 2 p + ( µ − λ )2
=
2
2
E p EI EO ( X pIO − λ )
σ p + (µ − λ )2 + σ 2∆
(36.10)
–
–
An unbiased estimator of (µ − λ)2 is (X − λ)2 − σ^ –x2 where X e is the observed grand mean over
sampled objects of measurement and sampled conditions of measurement in a D study design
–
and σ^ –x2 is the error variance involved in using the observed grand mean X as an estimate of the
grand mean over the population of persons and the universe of items and occasions (µ). For the
p × I × O random-effects design, σ^ –x2 is:
σˆ 2X =
σˆ 2pi
σˆ 2pio,e
σˆ 2p σˆ i2 σˆ 2o
σˆ 2po
σˆ 2
+
+
+
+
+ io +
n p′
ni′ no′ n p′ ni′ n p′ no′ ni′no′ n p′ ni′no′
(36.11)
The estimate of Φ λ& is smallest when the cut score λ. is equal to the observed grand mean X% .
–
In the dataset presented in Table 36–1, X e = 3.500. For λ. = 3.500 , using n′p = 100 and values in
36-Green.qxd
12/30/2005
606
12:51 PM
Page 606
SHAVELSON AND WEBB
Table 36–2 gives Φλ = 0.764. For λ. = 2.000 (assuming self-concept scores above 2 fall within
^
the “normal” range), Φλ = 0.909.
^
STUDY DESIGN IN GENERALIZABILITY AND DECISION STUDIES
Generalizability theory allows the decision maker to use different designs in G and D studies.
Typically in a G study a crossed design is used. In a crossed design, all students are observed
under each level of each facet. This means that, in our example, each student responds to each
self-concept item on each occasion (see Table 36–1). The crossed design provides maximal
information about the components of variation in observed self-concept scores. In our example,
seven different variance components can be estimated—one each for the main effects of person
σ^ p2 , item σ^ i2 , and occasion σ^ o2 ; two-way interactions between person and item σ^ 2pi , person and
occasion σ^ 2po , and item and occasion σ^ 2io; and a residual due to the person × item × occasion
interaction and random error σ^ 2pio,e.
In a D study, both crossed and nested designs should be considered. In a nested design, not
all levels of one facet are paired with the levels of another facet. In our self-concept example, we
might use one set of randomly sampled items (1–3) at Occasion 1 and use another set of
randomly-sampled items (4–6) at Occasion 2. In this case we say that items are nested in occasion: Levels 1–3 of the item facet are paired with Occasion 1 and Levels 4–6 of the item facet
are paired with Occasion 2. In this way, six items and not just three items are sampled for the D
study. The more items, the greater the reliability (generalizability), typically.
Although G studies should use crossed designs whenever possible to avoid confounding of
effects, D studies may use nested designs for convenience or for increasing sample size, which
typically reduces estimated error variance and, hence, increases estimated generalizability. For
example, consider the error variance in a crossed p × I × O design with the error variance in a
partially nested p × (I : O) design where facet i is nested in facet o, and n’ denotes the number
of conditions of a facet under a decision-maker’s control. In a crossed p × I × O design, the
relative σ δ2 ) and absolute (σ ∆2 ) error variances are:
σ δ2 = σ 2pI + σ 2pO + σ 2pIO =
σ 2pi σ 2po σ 2pio,e
+
+
ni′
no′
ni′no′
(36.12a)
and
σ 2∆ = σ 2I + σO2 + σ 2pI + σ 2pO + σ 2IO + σ 2pIO
=
σ 2pio,e
σ i2 σ 2o σ 2pi σ 2po
σ2
+
+
+
+ io +
no′
ni′no′
ni′no′
ni′ no′
ni′
(36.12b)
In a nested p × (I : O) design,
σ δ2 = σ 2pO + σ 2pI :O =
σ 2∆ = σO2 + σ 2pO + σ 2I :O + σ 2pI :O =
σ 2po σ 2pi, pio,e
+
no′
ni′no′
(36.13a)
σ 2o σ 2po σ i2,io σ 2pi, pio,e
+
+
+
no′
no′
ni′no′
ni′no′
(36.13b)
36-Green.qxd
12/30/2005
12:51 PM
Page 607
36.
GENERALIZABILITY THEORY 607
In (12) and (13), σ p2 i , σ p2 o, and σ p2 i o,e are directly available from a G study with design p × i ×
o, σ i2,io is the sum of σi2 and σ i2o, and σ p2 i o,e is the sum of σ p2 i and σ p2 i o,e. To estimate σ δ2 in a p ×
(I : O) design, for example, simply substitute estimated values for the variance components into
equation 13a; similarly for 13b to estimate σ ∆2 . Moreover, given cost, logistics and other considerations, n′ can be manipulated to minimize error variance by trading off, in this example,
items and occasions. Due to the difference in the designs, σ δ2 is smaller in (13a) than in (12a)
&
and σ ∆2 is smaller in (13b) than in (12b).
From our example and Table 36–2, we find that the optimal D study design need not be fully
crossed. In this example, administering different items on each occasion (i:o) yields slightly
higher estimated generalizability than does the fully crossed design; for example, for 10 items
^
^
and 2 occasions, Eρ^ 2 = 0.830 and Φ = 0.818. The larger values of Eρ^ 2 and Φ for the partially
^
nested design than for the fully crossed design, Eρ^ 2 = 0.806 and Φ = 0.791, are solely attributable to the difference between (12a) and (13a) and the difference between (12b) and (13b).
Random and Fixed Facets
G theory is essentially a random effects theory. Typically a random facet is created by randomly
sampling levels of a facet (e.g., tasks from a job in observations of job performance). When the
levels of a facet have not been sampled randomly from the universe of admissible observations
but the intended universe of generalization is infinitely large, the concept of exchangeability may
be invoked to consider the facet as random (Shavelson & Webb, 1981).
A fixed facet (cf. fixed factor, in analysis of variance) arises when the decision maker: (a) purposely selects certain conditions and is not interested in generalizing beyond them, (b) finds it unreasonable to generalize beyond the levels observed, or (c) when the entire universe of levels is small
and all levels are included in the measurement design. G theory typically treats fixed facets by averaging over the conditions of the fixed facet and examining the generalizability of the average over
the random facets (Cronbach et al., 1972). When it does not make conceptual sense to average over
the conditions of a fixed facet, a separate G study may be conducted within each condition of the
fixed facet (Shavelson & Webb, 1991) or a full multivariate analysis may be performed with the
levels of the fixed facet comprising a vector of dependent variables (Brennan, 2001; see below).
G theory recognizes that the universe of admissible observations in a G study may be
broader than the universe of generalization of interest in a D study (e.g., a decision maker only
interested in one occasion). The decision maker may reduce the levels of a facet (creating a fixed
facet), select (and thereby control) one level of a facet, or ignore a facet. A facet is fixed in a D
study when n′ = N′, where n′ is the number of levels for a facet in the D study and N′ is the total
number of levels for a facet in the universe of generalization. From a random-effects G study
with design p × i × o in which the universe of admissible observations is defined by facets i and
o of infinite size, fixing facet i in the D study and averaging over the ni conditions of facet i in
the G study (ni = n′i) yields the following universe-score variance:
σ 2τ = σ 2p + σ 2pI = σ 2p +
σ 2pi
ni′
(36.14)
where σ denotes universe-score variance in generic terms. When facet i is fixed, the universe
score is based on a person’s average score over the levels of facet i, so the generic universe-score
variance in (14) is the variance over persons’ mean scores. Hence, (14) includes σ pl2 as well as
σ 2%p σˆ 2τ is an unbiased estimate of universe-score variance for the mixed model only when the
same levels of facet i are used in the G and D studies (Brennan, 2001). The relative and absolute
error variances, respectively, are:
2
τ
36-Green.qxd
12/30/2005
608
12:51 PM
Page 608
SHAVELSON AND WEBB
σ δ2 = σ 2pO + σ 2pIO =
σ 2∆ = σO2 + σ 2pO + σ 2IO + σ 2pIO =
σ 2po σ 2pio,e
+
, and
no′
ni′no′
(36.15a)
σ 2pio,e
σ2
σ 2o σ 2po
+
+ io +
no′
no′
ni′no′
ni′no′
(36.15b)
And the generalizability coefficient and index of dependability, respectively, are:
σ 2pi
ni′
Eρ2 =
2
2
σ
σ 2pio,e
σ
pi
po
σ 2P +
+
+
ni′
no′
ni′no′
σ 2p +
σ 2p +
Φ=
σ 2p +
, and
(36.16a)
σ 2pi
ni′
σ 2pi σ o2 σ 2po
σ 2pio,e
σ2
+
+
+ io +
ni′no′
ni′
no′
no′
ni′no′
(36.16b)
MULTIVARIATE GENERALIZABILITY
For behavioral measurements involving multiple scores describing individuals’ personality, ability, or performance, multivariate generalizability can be used to (a) estimate the reliability of
difference scores, observable correlations, or universe-score and error correlations for various
D study designs and sample sizes (Brennan, 2001), (b) estimate the reliability of a profile of
scores using multiple regression of universe scores on the observed scores in the profile
(Brennan, 2001, Cronbach et al., 1972), or (c) produce a composite of scores with maximum
generalizability (Shavelson & Webb, 1981). For all of these purposes, multivariate G theory
decomposes both variances and covariances into components. In a two-facet, crossed p × i × o
design with general self-concept divided into two dependent variables—academic and social
self-concepts, the observed scores for the two variables for person p observed under conditions
i and o can be denoted as 1 X pio and 2X pio , respectively. The variances of observed scores, σ 21 X pio
and σ 22 X pio, are decomposed as in (2).
The covariance σ 21 X pio , σ 22 X pio is decomposed in analogous fashion:
σ 1 X pio , 2 X pio = σ 1 p,2 p + σ 1 i,2 i + σ 1 o,2 o + σ 1 pi,2 pi + σ 1 po,2 po
+σ 1 io,2 io + σ 1 pio,e,2 pio,e
(36.17)
In (17) the term σ 21 p × 2 p is the covariance between universe scores for academic and social selfconcept. The remaining terms in (17) are error covariance components. The term σ 21 p, 2 p , for
example, is the covariance between scores on academic and social self-concept due to the levels
of observation for the item facet.
An important aspect of the development of multivariate G theory is the distinction between
linked and unlinked conditions. The expected values of error covariance components are zero
when conditions for observing different variables are unlinked, that is, selected independently
(e.g., the items used to obtain scores on one variable in a profile, academic self-concept, are
36-Green.qxd
12/30/2005
12:51 PM
Page 609
36.
GENERALIZABILITY THEORY 609
selected independently of the items used to obtain scores on another variable, social self-concept).
The expected values of error covariance components are nonzero when levels are linked or
jointly sampled (e.g., scores on two variables in a profile come from the same items).
Joe and Woodward (1976) presented a G coefficient for a multivariate composite that
maximizes the ratio of universe score variation to universe score plus error variation by using
statistically derived weights for each dependent variable (academic and social self-concept in
our example). Alternatives to maximizing the reliability of a composite are to determine variable
weights on the basis of expert judgment or use weights derived from a confirmatory factor
analysis (Marcoulides, 1994).
ADDITIONAL TOPICS
We have only scratched the surface of G theory (believe it or not!). Here we treat a few additional
topics that have practical consequences in using G theory. For details and advanced topics, see
Brennan (2001). First, given the emphasis on estimated variance components in G theory, we consider the sampling variability of estimated variance components and how to estimate variance components, especially in unbalanced designs. Second, sometimes facets are “hidden” in a G study and
are not accounted for in interpreting variance components. For example, in interpreting the substantial variability from one task to another (Shavelson, Baxter, & Gao, 1993), an occasion facet is
“hidden” in that the tasks take place over time (Cronbach, Linn, Brennan & Haertel, 1997;
Shavelson, Ruiz-Primo, & Wiley, 1999). We briefly consider such a hidden facet below. And
finally, it should come as no surprise that measurement error is not constant as often assumed but
depends on the magnitude of a person’s universe score, and so we treat this topic briefly.
Variance Component Estimates
Here we treat three concerns (among many) in estimating variance components. The first concern deals with the variability (“bounce”) in variance component estimates, the second with
negative variance-component estimates (variances, σ2, cannot be negative), and the third with
unbalanced designs.
Variability in Variance-Component Estimates. The first concern is that estimates of variance
components may be unstable with usual sample sizes (Cronbach et al., 1972). Here’s why. To estimate variance components, we use mean squares from the analysis of variance. For example, we
used means squares from a person × item × occasion random-effects, repeated measures analysis of
variance to get the estimates in Table 36–2. The mean square for the residual (person × item × occa2
sion confounded with error) provided a direct estimate of σ2pio.e.; i.e. σ^ pio
,e = MSpio,e. The mean square
for the item by occasion interaction, however, is a bit more complex, containing information about
2
both σ pio,e and σ2io. Moving all the way up an analysis of variance table (or Table 36–2), we find
&
that MSp contains not only information about σ2pi but it also contains information about σ2pio., σ2pi,
and σ2pio.e., each of which are estimated from their corresponding mean squares. In general, as we
move from the highest order interaction (residual) to main effects, the number of mean squares
involved in getting estimates of variance components increases. And the more mean squares that are
involved in estimating variance components, the larger the variability is likely to be in these estimates from one study to the next. Although exact confidence intervals for variance components are
generally unavailable (due to the inability to derive exact distributions for variance component estimates), approximate confidence intervals are available if one assumes normality or uses a resampling technique such as the bootstrap (illustrated in Brennan, 2001; for details, see Wiley, 2000).
36-Green.qxd
12/30/2005
610
12:51 PM
Page 610
SHAVELSON AND WEBB
Negative Estimated Variance Components. The second concern with variance component
estimation is when a negative estimate arises because of sampling error or model misspecification
(Shavelson & Webb, 1981). We can identify four possible solutions when negative estimates are
small in relative magnitude. One possible solution is to substitute zero for the negative estimate and
carry through the zero in other expected mean square equations from the analysis of variance,
which produces biased estimates (Cronbach et al., 1972). A second solution is to set negative estimates to zero but use the negative estimates in expected mean square equations for other components (Brennan, 2001). A third is to use a Bayesian approach that sets a lower bound of zero on the
estimated variance component (Shavelson & Webb, 1981). Finally, a fourth possible solution is to
use maximum likelihood methods, which preclude negative estimates (Searle, 1987).
Variance Component Estimation with Unbalanced Designs. An unbalanced design arises
when the number of levels of a nested facet varies for each level of, say, the object of measurement, person, producing unequal numbers of levels. For example, different judges observe the
performance of each person, hence judge is nested within person. Moreover, different numbers
of judges are assigned to each person, hence unequal numbers of judges create the unbalancing.
Although analysis of variance methods for estimating variance components are straightforward
when applied to balanced data, have the advantage of requiring few distributional assumptions,
and produce unbiased estimators, problems arise with unbalanced data. They include many different decompositions of the total sums of squares without an obvious basis for choosing among
them (which leads to a variety of ways in which mean squares can be adjusted for other effects
in the model), biased estimation in mixed models (not a problem in G theory because G theory
averages over fixed facets in a mixed model and estimates only variances of random effects, or
mixed models can be handled via multivariate G theory), and algebraically and computationally
complex rules for deriving expected values of mean squares. Brennan (2001) describes an analogous ANOVA procedure for estimating variance components in G studies and illustrates estimation of error variances for some frequently encountered unbalanced D-study designs.
Hidden Facets
In some cases, two facets are linked such that as the levels of one facet vary, correspondingly the
levels of another facet do too. This might not be readily obvious; hence the name, “hidden facet.”
The most notorious and easily understood hidden facet is the occasion facet. Here is how it
works. As a person proceeds through a test, for example, or performs a series of tasks, his performance occurs over time. Typically variability in performance from task to task would be interpreted as task-sampling variability. However, while task is varying, the hidden facet, occasion,
is too. It just might be that what appears to be task-sampling variability is actually occasionsampling variability and this alternative interpretation might change prescriptions for improving
the dependability of the measurement.
For example, Shavelson, Baxter, and Gao (1993) reported research on performance assessment in education and the military that task-sampling variability was consistently quite large and
that a large sample of tasks was needed to get a reliable measure of performance. However,
Cronbach, Linn, Brennan, and Haertel (1997) questioned this interpretation, pointing out the hidden facet of occasion. The importance of this challenge is that, if the occasion facet is actually
the cause, adding many tasks to address the task-sampling problem would not improve the
dependability of the measurement.
To resolve the issue, Shavelson, Ruiz-Primo, and Wiley (1999) re-examined some of the
data from the 1993 report in a person × task × rater × occasion G study so that the effects of task
36-Green.qxd
12/30/2005
12:51 PM
Page 611
36.
GENERALIZABILITY THEORY 611
and occasion could be separated. They found that both the task (person × task) and occasion
(person × occasion) facets contributed variability, but the lion’s share came from task sampling
(person × task), and joint task and occasion sampling (person × task × occasion). In a different
study, Webb, Schlackman, and Sugrue (2000) reported similar results. The moral of the story is to
be careful in interpreting variance components when occasion might be lurking in the background.
Nonconstant Error Variance for Different True Scores
The description of error variance given here, especially in (4) and (6), implicitly assumes that
variance of measurement error is constant for all persons, regardless of true score (universe
score, here). The assumption of constant error variance for different true scores has been criticized for decades, including by Lord (1955) who derived a formula for conditional error variance that varies as a function of true score. His approach produced estimated error variances that
are smaller for very high and very low true scores than true scores that are closer to the mean,
producing a concave-down quadratic form. Consider, for example, that persons who have very
high true scores are likely to score highly across multiple items or across multiple tests (small
error variance), whereas persons who have true scores close to the mean are likely to produce
scores that fluctuate more from item to item or from test to test (larger error variance). Stated
another way, for examinees with very high or very low true scores, there is little opportunity for
errors to influence observed scores. Lord’s conditional error variance is the conditional error
variance for absolute decisions in G theory for the p × i design with dichotomously scored items
and n′i equal to ni (see Brennan, 2001). Brennan (2001) discusses conditional error variances in
generalizability theory and shows estimation procedures for conditional relative and absolute
error variance for relative and absolute decisions, for univariate and multivariate studies, and for
balanced and unbalanced designs.
ACKNOWLEDGMENTS
A grant #000000 from the U.S. Office of Education to CRESST supported, in part, the preparation of this chapter was supported, etc. We thank our reviewers, Bob Brennan and Dimiter
Dimitrov, as well as colleague Felipe Martinez, for their careful reading and constructive comments; errors of commission and omission are ours.
REFERENCES
Brennan, R. L. (2001). Generalizability Theory. New York: Springer-Verlag.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The Dependability of Behavioral
Measurements. New York: Wiley.
Cronbach, L. J., Linn, R. L., Brennan, R. L, & Haertel, E. H. (1997). Generalizability analysis for performance assessments of student achievement or school effectiveness. Educational and Psychological
Measurement, 57, 373–399.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational Measurement 3rd edition
(pp. 105–146). Washington, DC: The American Council on Education/Macmillan.
Kane, M. T., & Brennan, R. L. (1977). The generalizability of class means. Review of Educational
Research, 47, 267–292.
Lord, F. M. (1955). Estimating test reliability. Educational and Psychological Measurement, 16, 325–336.
Marcoulides, G. A. (1994). Selecting weighting schemes in multivariate generalizability studies.
Educational and Psychological Measurement, 54, 3–7.
36-Green.qxd
12/30/2005
612
12:51 PM
Page 612
SHAVELSON AND WEBB
Searle, S. R. (1987). Linear Models for Unbalanced Data. New York, NY: Wiley.
Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal
of Educational Measurement, 30(3), 215–232.
Shavelson, R. J., Ruiz-Primo, M. A., & Wiley, E. W. (1999). Note on sources of sampling variability in science performance assessments. Journal of Educational Measurement, 36, 61–71.
Shavelson, R. J., & Webb, N. M. (1981). Generalizability theory: 1973–1980. British Journal of
Mathematical and Statistical Psychology, 34, 133–166.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability Theory: A Primer. Newbury Park, CA: Sage.
Webb, N. M., Nemer, K., Chizhik, A., & Sugrue, B. (1998). Equity issues in collaborative group assessment: Group composition and performance. American Educational Research Journal, 35, 607–651.
Webb, N. M., Schlackman, J., & Sugrue, B. (2000). The dependability and interchangeability of assessment
methods in science. Applied Measurement in Education, 13, 277–301.
Webb, N. M., Shavelson, R. J., & Maddahian, E. (1983). Multivariate generalizability theory. In L. J. Fyans
(Ed.), Generalizability Theory: Inferences and Practical Applications (pp. 67–81). San Francisco, CA:
Jossey-Bass.
Wiley, E. (2000). Bootstrap Strategies for Variance Component Estimation: Theoretical and Empirical
Results. Unpublished doctoral dissertation, Stanford University.

Download Report

Generalizability Theory

Paperzz.com

Your Paperzz