Generalizability Theory and Classical Test Theory

Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
APPLIED MEASUREMENT IN EDUCATION, 24: 1–21, 2011
Copyright © Taylor & Francis Group, LLC
ISSN: 0895-7347 print / 1532-4818 online
DOI: 10.1080/08957347.2011.532417
Generalizability Theory and Classical
Test Theory
Robert L. Brennan
Center for Advanced Studies in Measurement and Assessment
University of Iowa
Broadly conceived, reliability involves quantifying the consistencies and inconsistencies in observed scores. Generalizability theory, or G theory, is particularly well
suited to addressing such matters in that it enables an investigator to quantify and
distinguish the sources of inconsistencies in observed scores that arise, or could
arise, over replications of a measurement procedure. Classical test theory is an historical predecessor to G theory and, as such, it is sometimes called a parent of
G theory. Important characteristics of both theories are considered in this article,
but primary emphasis is placed on G theory. In addition, the two theories are briefly
compared with item response theory.
The pursuit of scientific endeavors necessitates careful attention to measurement
procedures, the purpose of which is to acquire information about certain attributes
or characteristics of objects. The data obtained from any measurement procedure include errors, however, since the measurements may vary depending on
numerous conditions of measurement. From this perspective on measurement,
“error” does not mean mistake in the conventional sense, and what constitutes
error in scores from a measurement procedure is, in part, a matter of definition.
It is one thing to say that error is an inherent aspect of a measurement procedure; it is quite another thing to quantify error and specify which conditions of
An earlier version of this paper was presented at the 2008 annual meeting of the American
Educational Research Association. The paper was one of two presented in a symposium sponsored
by the Buros Center for Testing, the sponsor of this journal. The other paper enumerated the benefits
of item response theory. We hope to be able to present this item response theory paper in a future issue
of the journal.
Correspondence should be addressed to Robert L. Brennan, E. F. Lindquist Chair in Measurement
and Testing and Director, Center for Advanced Studies in Measurement and Assessment (CASMA),
210D Lindquist, University of Iowa, Iowa City, IA 52242. E-mail: [email protected]
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
2
BRENNAN
measurement contribute to it. Doing so necessitates specifying what would constitute an “ideal” measurement (i.e., over what conditions of measurement is
generalization intended) and the conditions under which observed scores are
obtained.
These and other measurement issues are of concern in virtually all areas of
science. Different fields may emphasize different issues, different objects, different characteristics of objects, and even different ways of addressing measurement
issues, but the issues themselves pervade scientific endeavors. In education and
psychology, historically these types of issues have been subsumed under the
heading of “reliability.”
Broadly conceived, reliability involves quantifying the consistencies and
inconsistencies in observed scores. It has been stated that “A person with one
watch knows what time it is; a person with two watches is never quite sure!” This
simple aphorism highlights how easily investigators can be deceived by having
information from only one element in a larger set of interest.
The above discussion is closely associated with the conceptual framework of
generalizability theory, or G theory, which is the principal focus of this article.
G theory enables an investigator to quantify and distinguish the sources of inconsistencies in observed scores that arise, or could arise, over replications of a
measurement procedure. Classical test theory (CTT) is an historical predecessor
to G theory. Indeed, CTT is sometimes called a parent of G theory.
Provided next is a brief overview of CTT that serves as a bridge to the subsequent overview of G theory.1 The focus here is on important aspects of the
theories that serve to illustrate similarities and differences between them, as well
as between them and other theories, particularly item response theory (IRT).
CLASSICAL TEST THEORY
To understand G theory, it is helpful to consider first some aspects of the CTT
model
X = T + E,
(1)
where X, T, and E are observed, true, and error score random variables, respectively. Although CTT is very useful, the simplicity of this model, masks at least
four important considerations.
1 For more complete overviews of CTT see Lord and Novick (1968), Feldt and Brennan (1989),
and Haertel (2006). For more complete overviews of G theory see Cronbach, Gleser, Nanda, and
Rajaratnam (1972) and Brennan (1992, 2001b).
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
GENERALIZABILITY THEORY AND CTT
3
First, since T and E are both unobserved variables, to use this model one must
make some additional assumptions. There are at least two ways to proceed. First,
one can define T as the expected value of the observed scores X, which leads to
the expected value of E being zero. Second, one can define the expected value of
E as zero, which leads to T being the expected value of X. Clearly, both ways of
proceeding lead to the same result, but they differ with respect to what is assumed,
and what is a consequence of the assumptions. Whichever way one proceeds,
however, once T (or E) is defined, then E (or T) is derived unambiguously. That
is, the CTT model suggest that T and E are so tightly tied together that if one of
them were known, the other would be entirely evident.
Second, it is important to note that in the CTT model, T is definitely not platonic or “in the eye of God” true score. Lord and Novick (1968) emphasized
this over 40 years ago. More recently, Borsboom (2005, pp. 33–34) provided the
following interesting example. Currently, an autopsy is required for a definitive
diagnosis of Alzheimer’s disease. Let C be a nominal variable that takes two values: c = 0 for absence of Alzheimer’s based on an autopsy, and c = 1 for presence
of Alzheimer’s disease. This nominal variable can be viewed as a platonic true
score. (We neglect the possibility of autopsy errors.) Now, suppose there is some
observational test that results is a diagnosis of x = 0 if Alzheimer’s is not suspected and x = 1 if Alzheimer’s is suspected. If this diagnostic test (or different
forms of it) is repeated, clearly the expected value (i.e., true score) will be neither
0 nor 1; hence, platonic true score C and expected-value true score T will not be
the same.
Third, the form of the CTT model in Equation 1 is so clearly reminiscent of a
simple linear regression equation that it is easy to think of E as nothing more than
model fit error in the traditional statistical sense. Such a conception is misleading
at best, if not outright wrong. The CTT model is a tautology in which all variables
on the right-hand side are unobservable, and these unobservable variables have no
meaning beyond the assumptions we attach to them. In particular, T does not have
some status independent of the other variables in the model, which means that it
is misleading to characterize E as a residual or model fit error. Part of the problem
here is the multiple connotations associated with the word “model.” In traditional
statistical contexts, the word “model” often carries with it the connotation of a
relationship between dependent and fixed (i.e., known a priori) independent variables. This notion of the word “model” clearly does not apply to the CTT model;
not does it apply to G theory.
Fourth, as mentioned above, the CTT model is a tautology. As such, it is true
by definition. It’s truth/falsity cannot be tested by comparing it or its results to
some “objective” reality. Physical scientists tend to reserve the word “theory” for
models that can be falsified. No such falsification is possible for the CTT model or
for G theory. In applications of CTT what shall count as true score and what shall
count as error are very much under the control of the investigator, although this
4
BRENNAN
fact is frequently overlooked. In this sense “truth” and “error” are not realities to
be discovered—they are investigator-specific constructions to be studied. In CTT
“error” does not mean “mistake,” it does not mean lack of model fit, and “truth”
and “error” are defined by the investigator even if he or she does not realize it!
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
Reliability Coefficients and Error Variances
The canonical definition of reliability is usually taken to be that it is the squared
correlation between observed and true scores, ρ 2 (X,T). Other expressions for
reliability are given below:
ρ 2 (X, T) = ρ(X, X ) =
σ 2 (T)
σ 2 (T)
=
.
σ 2 (X)
σ 2 (T) + σ 2 (E)
(2)
The last three expressions are typically derived by assuming that, for the indefinitely large population of examinees: (a) test forms (say X and X ) are classically
parallel, which means that they have equal observed score means, variances, and
covariances, and they covary equally with any other measure; (b) the covariance
between errors for parallel forms is 0; and (c) the covariance between true and
error scores is 0.2 Several traditional estimates of reliability are motivated by the
ρ(X, X ) expression for reliability. These estimates differ overtly with respect to
their data collection designs, and they also differ with respect to how error is
implicitly defined. For example, if reliability is estimated by computing the correlation between “parallel” forms, then the only errors that are taken into account
are those attributable to form differences. By contrast, if reliability is estimated
by computing a test–retest correlation, then form differences do not contribute to
error variance, but occasion differences do. Clearly, these two estimates of reliability are not estimates of the same parameter, but the CTT model is not rich
enough to distinguish clearly between them. These distinctions are much more
evident in G theory.
Other estimates of reliability are more closely linked to one or the other of the
last two expressions in Equation 2, both of which make explicit reference to true
score variance which, of course, is unknown. Typically, these estimates make use
of the fact that the covariance between scores for classically parallel forms is true
score variance, that is σ (X, X ) = σ 2 (T). The best known of these coefficients is
Coefficient α.
Strictly speaking, Coefficient α can be derived using a parallelism assumption that is weaker than classically parallel forms, called essentially tau-equivalent
2 Equivalently, for any indefinitely large subpopulation of examinees, the expected value of the
errors is 0 provided examinees are not selected based on their observed scores.
GENERALIZABILITY THEORY AND CTT
5
forms, which are special cases of what are called congeneric forms. Two forms
are congeneric if their true scores are linearly related; further, their error variances
need not be equal, and, it follows that their observed score variances need not be
equal. Notationally, scores for forms i and j are congeneric if
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
Xi = (ai + bi T) + Ei
and
Xj = (aj + bj T) + Ej .
(3)
When bi = bj we say the forms are essentially tau-equivalent. Lord and Novick
(1968), Feldt and Brennan (1989), and Haertel (2006) provide extensive discussions of reliability coefficents based on these (and other) different definitions of
parallelism.
Reliability coefficients seldom play a role in other areas of scientific inquiry.
Why are they so prevalent in psychometrics? There are probably at least three
reasons. First, psychometrics is generally viewed as beginning with Spearman’s
(1904) study of what we now call corrections for attenuation, which adjust
observed score correlations using reliability coefficients. Corrections for attenuation are still of considerable interest in educational and psychological measurement. Second, the fact that reliability ranges between 0 and 1 is very appealing
to many. Unfortunately, the appeal is deceptive in that it suggests that all of reliability can be captured in a single dimensionless number. That is not true, but the
appeal persists, even though reliability coefficients are rather difficult to interpret
correctly.3 Third, under the assumptions of CTT, it can be shown that standard
error of measurement (SEM) is a function of reliability. Specifically,
σ (E) = σ (X) 1 − ρ 2 (X, T),
(4)
which is arguably more important than ρ 2 (X, T) itself.
Coefficient α and its Misunderstandings
Without question, the most popular reliability coefficient is Coefficient α, which
is often call Cronbach’s α, since Cronbach (1951) popularized it and derived it
from several different perspectives. As valuable and useful as this coefficient may
be, unfortunately it is widely misunderstood and misused, in part because it is so
easy to compute.
One misunderstanding is the common attribution of Coefficient α to Cronbach.
As Cronbach (2003) himself noted, he did not invent Coefficient α; other equivalent coefficients were reported in the literature prior to Cronbach (1951). Indeed
3 One complexity is that reliability coefficients have nonlinear characteristics. That is why it is
much more difficult to raise a reliability coefficient from .90 to .95 than from .50 to .55.
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
6
BRENNAN
derivations of one or more versions of Coefficient α (before and since 1951) might
be the all time favorite psychometric parlor game!
As noted previously, from the perspective of CTT, the derivation of Coefficient
α requires that forms be essentially tau-equivalent.4 In the vast majority of cases,
Coefficient α is computed based on item scores; that is, items play the role of
forms. It most circumstances, however, it seems highly unlikely that item scores
satisfy the assumption of essential tau-equivalence.
A particularly problematic misunderstanding is the frequently cited statement
that “Coefficient α is a lower limit to reliability.” Under a particular set of stringent
assumptions, this is a mathematically correct statement (see Lord & Novick, 1968,
pp. 87–88; Novick & Lewis, 1967), but these assumptions are rarely defensible in
real-world situations. In most cases, it is much more likely that Coefficient α is an
upper limit to reliability, as Cronbach (1951) noted over a half century ago. This
misinterpretation occurs when there is a disconnect between the data used to estimate Coefficient α and the definition of reliability intended by the investigator. For
example, if data are collected on a single occasion, but the investigator’s notion
of reliability involves generalizing to different occasions (as it usually does), then
it is almost certain that error variance will be underestimated.
Although Cronbach did not invent Coefficient α, he did name it, and his
choice of a name was not accidental. Consider the following quote from Cronbach
(1951):
A . . . reason for the symbol is that α is one of six analogous coefficients (to be
designated β, γ , δ, etc.) which deal with such other concepts as like-mindedness of
persons, stability of scores, etc. (pp. 299–300)
Essentially, this quote reinforces the fact that there are many reliability coefficients for any set of test scores. Cronbach did not publish subsequent papers
that specifically identified all of the other coefficients (i.e., β, γ , δ, etc.); rather,
these notions got incorporated into what came to be called G theory. In short,
Coefficient α is properly viewed as an historically important and often useful
estimator of reliability, but α should not be deified, and it is much overused.
Lord’s SEM
There are topics that are usually included in the CTT literature that are not quite
consonant with the assumptions noted above. For the purposes of this article,
a particularly important example is Lord’s (1955, 1957) SEM. Consider a test
consisting of k dichotomously scored items. Lord suggested that the SEM for
4 Classically parallel forms satisfy the assumptions of essential tau-equivalence, but this is not
necessarily true for congeneric forms.
GENERALIZABILITY THEORY AND CTT
7
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
an examinee can be viewed as the standard error of the mean for that examinee,
where each observable mean is the examinee’s mean score on a random sample of
k items drawn from an infinite universe of items. In terms of parameters, Lord’s
SEM is simply
τp (1 − τp )
∗
,
(5)
σ (E ) =
k
where τ p is the true score for the examine in the mean-score metric (i.e., proportion correct scores).5 It is worth noting that Lord’s SEM is not a simple function
of reliability, whereas the CTT formula in Equation 4 is. Furthermore, it can be
shown that the average value of σ 2 (E∗ ) is greater than σ 2 (E) when both are on the
same metric (see Brennan, 2001b, pp. 33, 160).
Lord’s SEM is a kind of bridge between CTT and G theory in at least two
senses. First, Lord’s SEM uses a random sampling model to estimate error
variance rather than CTT notions of parallelism. Second, Lord’s SEM uses a
within-person design as opposed to the across-persons design that characterized
virtually all the reliability literature prior to the 1950s. As discussed next, G theory
replaces CTT notions of parallelism with randomly parallel forms, and G theory
explicitly incorporates different types of data collection designs.
UNIVARIATE GENERALIZABILITY THEORY
G theory offers an extensive conceptual framework and a powerful set of statistical procedures for addressing numerous measurement issues. Often, CTT and
analysis of variance (ANOVA) are viewed as the parents of G theory.
Parents and Some History
In CTT, there is only one E term, which does not mean there is necessarily only
one source of error; it does mean, however, that in a single application of CTT,
all sources of error are confounded in one E term. One of the most important and
simplest perspectives on the G theory model is that it disconfounds the multiple
sources of error that interest an investigator, say H of them; so, in a sense, the G
theory model can be viewed as
X = μp + E1 + E2 + · · · EH ,
5 The
more familiar estimation formula for Lord’s SEM in the mean-score metric is:
σ (E∗ ) =
X p (1 − X p )/(k − 1).
(6)
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
8
BRENNAN
where μp is universe score, which is the G theory analogue of true score.
Importantly, in G theory the investigator must decide which sources of error are
of interest, which effectively defines the facets of measurement. Universe score is
then defined as the expected value of observed scores over replications of the measurement procedure (see Brennan, 2001a), where each such replication involves a
different random sample of conditions from each of the measurement facets.
In its essential features, the high-level model in Equation 6 is quite consistent with important aspects of the statistical framework for ANOVA. As noted by
Cronbach et al. (1972), when Fisher (1925) introduced ANOVA, he
revolutionized statistical thinking with the concept of the factorial experiment in
which the conditions of observations are classified in several respects. Investigators
who adopt Fisher’s line of thought must abandon the concept of undifferentiated error. The error formerly seen as amorphous is now attributed to multiple
sources, and a suitable experiment can estimate how much variation arises from
each controllable source. (p. 1)
The defining treatment of G theory is a monograph by Cronbach et al. (1972)
entitled The Dependability of Behavioral Measurements. A history of the theory
is provided by Brennan (1997). Brennan (2001b) provides an extensive exposition
of G theory. Shavelson and Webb (1991) provide a primer. Cardinet, Johnson,
and Pini (2010) provide a treatment of G theory based on a perspective that is
somewhat different from that of the previously cited authors. In discussing the
genesis of G theory, Cronbach (1991, pp. 391–392) states:
In 1957 I obtained funds from the National Institute of Mental Health to produce,
with Gleser’s collaboration, a kind of handbook of measurement theory. . . . “Since
reliability has been studied thoroughly and is now understood,” I suggested to the
team, “let us devote our first few weeks to outlining that section of the handbook,
to get a feel for the undertaking.” We learned humility the hard way—the enterprise
never got past that topic. Not until 1972 did the book appear . . . that exhausted
our findings on reliability reinterpreted as generalizability. Even then, we did not
exhaust the topic.
When we tried initially to summarize prominent, seemingly transparent, convincingly argued papers on test reliability, the messages conflicted.
To resolve these conflicts, Cronbach and his colleagues devised a rich conceptual
framework and married it to analysis of random effects variance components. The
net effect is “a tapestry that interweaves ideas from at least two dozen authors”
(Cronbach, 1991, p. 394). In particular, the work of Burt (1936), Ebel (1951),
and Lindquist (1953, chap. 16) appears to have anticipated various aspects of G
theory.
GENERALIZABILITY THEORY AND CTT
9
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
Framework and Machinery
Although CTT and ANOVA can be viewed as the parents of G theory, the child
is both more and less than the simple conjunction of its parents, and appreciating G theory requires an understanding of more than its lineage. For example,
although G theory liberalizes CTT, not all aspects of CTT are incorporated in
G theory. Also, the ANOVA issues emphasized in G theory are different from
those that predominate in many experimental design and ANOVA texts. In particular, G theory concentrates on variance components and their estimation, not
F tests.
Perhaps the most important aspect and unique feature of G theory is its conceptual framework. Among the concepts are universes of admissible observations and
G (generalizability) studies, as well as universes of generalization and D (decision) studies. Some of the more important concepts and methods of G theory are
introduced next using a hypothetical scenario.
Suppose a testing company ABC decides that it wants to begin offering a
writing proficiency testing program called WPT. ABC needs to identify, or otherwise characterize the types of essay prompts, t, that will be used and the types
of raters, r. Obviously, there are other considerations, too, but we will consider
only these two facets here. (A facet is simply a set of similar conditions of measurement, where the investigator decides what “similar” means.) Suppose that, in
theory, responses to any prompt could be evaluated by any rater, and the number
of potential prompts and raters is indefinitely large. Under these specifications,
we say that both facets are infinite in the universe of admissible observations, and
they are crossed, that is, t × r.
So far, no reference has been made to persons who respond to the essay
prompts. In G theory the word universe is reserved for conditions of measurement
(prompts and raters, here), while the word population is used for the objects of
measurement (persons, here). In the population and universe of admissible observations, any observable score for a single essay prompt evaluated by a single rater
can be represented as:
Xptr = μ + νp + νt + νr + νpt + νpr + νtr + νptr ,
(7)
where μ is the grand mean in the population and universe and ν designates
any one of the seven uncorrelated effects, or components. We say the Equation
7 is the p × t × r (persons crossed with tasks crossed with raters) linear
model.
Assuming that the effects in Equation 7 are uncorrelated, the variance of the
observed scores is:
σ 2 (Xptr ) = σ 2 (p) + σ 2 (t) + σ 2 (r) + σ 2 (pt) + σ 2 (pr) + σ 2 (tr) + σ 2 (ptr).
(8)
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
10
BRENNAN
The terms to the right of the equal sign are called random effects variance components. They can be estimated using expected mean square equations for a G
study in which a sample of np persons respond to nt prompts that are evaluated by
nr raters.
Once estimated variance components are available, they can be used to estimate universe score variance, error variances, and reliability-like coefficients for
various universe of generalization and D study designs. A universe of generalization can be viewed as the universe of randomly parallel forms of WPT, where
each such from uses n t prompts and n r raters.6 A D study design is the design
used operationally for a form of WPT.7
A crucial consideration in defining a universe of generalization is answering
the question, “Which facet(s) shall be considered random and which shall be considered fixed?” A facet is considered random when its conditions in the D study
are a sample from those in the universe of generalization.8 A facet is fixed when
its conditions in the D study exhaust its conditions in the universe of generalization. G theory does not specify which facets should be considered random and
which should be considered fixed; that is the prerogative and the responsibility of
the investigator. It should be noted, however, that fixing one or more facets generally lowers error variance and increases coefficients at the expense of narrowing
interpretations.
Infinite universe of generalization and crossed D study design
Suppose that ABC decides that both prompts and raters shall be viewed as
random for WPT, and the D study design will have the same crossed structure as
the G study design.9 Then, universe score variance is
σ 2 (τ ) = σ 2 (p),
(9)
relative error variance is
σ 2 (δ) =
6 It need not be true that n
σ 2 (pt) σ 2 (pr) σ 2 (ptr)
+
+ ,
n t
n r
n tn r
(10)
= nt nor that n r = nr ; that is, the sample sizes used to estimate variance
components need not equal the sample sizes used in an operational form of the test.
7 D study designs can differ with respect to structure and/or sample sizes.
8 Strictly speaking, for a random facet it is assumed that the number of conditions in the universe
of generalization is indefinitely large.
9 That is, the D study design shall be p × T × R with n prompts and n raters.
t
r
t
GENERALIZABILITY THEORY AND CTT
11
absolute error variance is
σ 2 (
) =
σ 2 (t) σ 2 (r) σ 2 (tr) σ 2 (pt) σ 2 (pr) σ 2 (ptr)
+ + +
+
+ ,
n t
nr
n tn r
n t
n r
n tn r
(11)
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
a generalizability coefficient is
σ 2 (τ )
,
σ 2 (τ ) + σ 2 (δ)
(12)
σ 2 (τ )
.
σ 2 (τ ) + σ 2 (
)
(13)
Eρ 2 =
and a dependability coefficient is
=
Equations 9–13 are expressed in terms of the mean score metric, which is the
tradition in G theory; by contrast, CTT equations are almost always expressed in
terms of the total score metric.
Relative error variance, σ 2 (δ), and a generalizability coefficient, Eρ 2 , are analogous to σ 2 (E) and ρ 2 (X, T), respectively, in CTT in that they characterize error
and reliability for decisions based on comparing examinees. It is important to
note, however, that except in trivial cases σ 2 (δ) = σ 2 (E) and Eρ 2 = ρ 2 (X, T).10
By contrast, strictly speaking, CTT has no analogue for σ 2 (
), which is the error
variance for making absolute (e.g., pass–fail) decisions about examinees. If we
go beyond the strict realm of CTT and consider Lord’s error variance, however,
there are some clear similarities—most obviously, both σ 2 (
) and Lord’s error
variance are derived under random sampling assumptions (see Brennan, 1997, for
more details.)
If an investigator performs a CTT analysis (e.g., computes Coefficient α) when
there is more than one random facet, it is likely that error variance will be underestimated. Consider, for example, σ 2 (δ) in Equation 10 which is based on n t n r
observations for each examinee. If Coefficient α is computed using the n t n r
observations for each examinee, then the estimated error variance in the mean
score metric will be [σ̂ 2 (pt) + σ̂ 2 (pr) + σ̂ 2 (ptr)]/(n t n r ) , which is clearly smaller
than the estimate of σ 2 (δ) based on Equation 10. This illustrates that CTT estimated error variances are generally too small when there is more than one random
facet in the universe of generalization.
10 The
most common “trivial” case is a design and universe with a single random facet.
12
BRENNAN
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
Different universes of generalization and D study designs
For different universes of generalization and D study designs, the expressions
for Eρ 2 and in Equations 12 and 13, respectively, still apply. Universe score
variance and error variances change, however, if the universe of generalization
changes. In addition, error variances change if the design changes and/or sample
sizes change.11
Suppose ABC decides to use the same tasks for all forms of WPT. If so, we
would say that tasks are fixed in the universe of generalization, and it can be
shown that universe score variance is
σ 2 (τ ) = σ 2 (p) +
σ 2 (pt)
,
n t
(14)
relative error variance is
σ 2 (pr) σ 2 (ptr)
+ ,
n r
n tn r
(15)
σ 2 (r) σ 2 (tr) σ 2 (pr) σ 2 (ptr)
+ +
+ .
n r
n tn r
n r
n tn r
(16)
σ 2 (δ) =
and absolute error variance is
σ 2 (
) =
Comparing these equations with Equations 9–11, it is evident that when tasks
are fixed, universe score variance increases and error variances decrease, which
leads to larger coefficients. Conceptually, fixing a facet restricts the universe of
generalization and, in doing so, decreases the gap between observed and universe
scores at the price of narrowing interpretations.
Suppose ABC publishes a technical manual in which it claims that WPT is a
highly reliable testing program because inter-rater reliability coefficients are high.
Let us consider this claim from the perspective of G theory. Suppose each interrater coefficient is a Pearson correlation based on the responses of examinees to
a single task with each response rated by the same two raters. Even if there are
multiple coefficients reported, as long as each of them is based on a single task,
then task is effectively being treated as fixed, whether or not ABC realizes it.12
Furthermore, a correlation between two conditions or units (here, raters) is an
11 CTT deals with sample size changes through the Spearman-Brown formula (see, Feldt &
Brennan, 1989 and Haertel, 2006), which does not apply when there is more than one random facet.
See Brennan (2001b, pp. 116–117) for an example.
12 Averaging inter-rater coefficients does not obviate this problem; it merely masks it.
GENERALIZABILITY THEORY AND CTT
13
estimate of reliability for one of them. Therefore, the inter-rater coefficients are
interpretable as estimates of
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
Eρ 2 =
σ 2 (p) + σ 2 (pt)
,
σ 2 (p) + σ 2 (pt) + [σ 2 (pr) + σ 2 (ptr)]
(17)
where σ 2 (δ) is enclosed in square brackets.
Compare this with Eρ 2 when both raters and tasks are random, and an
examinee’s score is the average over n t tasks and n r raters (see Equations 9,
10, and 12):
Eρ 2 =
σ 2 (p) +
σ 2 (p)
σ 2 (pt)
nt
+
σ 2 (pr)
nr
+
σ 2 (ptr)
n tn r
,
(18)
where σ 2 (δ) is enclosed in square brackets. Equation 18 almost always reflects
the D study design and intended universe much better than Equation 17, but Eρ 2
in Equation 18 is likely to be much smaller than Eρ 2 in Equation 17, primarily
because σ 2 (pt) moves from universe score variance in Equation 17 to error variance in Equation 18. This in an important matter. In most testing programs σ 2 (pt)
is quite large, which more than offsets the decrease in error variance that results
from division by sample sizes in Equation 18, especially since n t and n r tend to
be quite small in writing assessments.
Sometimes inter-rater coefficients are reported based on a side study, but operationally each response is rated by a single rater. If so, Equations 17 and 18
still apply, but n r = 1 in Equations 18. Importantly, however, σ 2 (pr) cannot be
estimated unless a G study is conducted that has nr ≥ 2.
The above discussion may be somewhat challenging, but it is still oversimplified relative to what often happens in parctice. In particular, the assignment
of raters to prompts and/or examinees is often more complicated than implied
by the design considered above. Suppose, for example, that for the operational
assessment, a different set of raters will evaluate responses to each prompt or
task, t. This is a verbal description of the D study p × (R:T) design, where “:” is
read “nested within.” For this design, if both raters and tasks are random, it can
be shown that
Eρ 2 =
σ 2 (p)
+
σ 2 (p)
σ 2 (pt)
nt
+
σ 2 (pr:t)
n tn r
,
(19)
where σ 2 (pr:t) represents the confounding of σ 2 (pr) and σ 2 (ptr). This means that
if the G study were conducted using the p × t × r design, then σ 2 (pr:t) = σ 2 (pr)
14
BRENNAN
+ σ 2 (ptr). It follows that σ 2 (pr) is divided by both n t and n r in Equation 19,
whereas σ 2 (pr) is divided by n r , only, in Equation 18. Therefore, when n r > 1
and n t > 1, σ 2 (δ) is smaller and Eρ 2 is larger for the nested design than for the
crossed design. (A similar statement holds for .)
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
In brief, this hypothetical WPT scenario illustrates that:
• universe score variance gets larger and error variances get smaller if a facet
shifts from being considered random to being considered fixed;
• larger D study sample sizes lead to smaller error variances; and
• nested D study designs usually lead to smaller error variances and larger
coefficients.
These conclusions are entirely predictable given the rich conceptual framework
of G theory.
MULTIVARIATE GENERALIZABILITY THEORY
The essential features of univariate G theory were largely completed with technical reports by the Cronbach team in 1960–1961. These were revised into three
journal articles, each with a different first author (Cronbach, Rajaratnam, &
Gleser, 1963; Gleser, Cronbach, & Rajaratnam, 1965; Rajaratnam, Cronbach, &
Gleser, 1965). In the mid 1960s, motivated by Harinder Nanda’s studies on interbattery reliability, the Cronbach team began their development of multivariate G
theory, which is incorporated in their 1972 monograph, and which they regarded
as the most unique aspect of G theory.13 Cronbach (1976) provides more historical details. The last four chapters in Brennan (2001b) provide an integrated and
extended treatment of multivariate G theory.
Multivariate G theory is multivariate primarily in the sense of multiple universes of generalization and, hence, multiple universe scores for each examinee.
In addition, there are corresponding multiple universes of admissible observations. Each one of the multiple universes is associated with a single fixed condition
of measurement. Statistically this implies that multivariate G theory analyses
involve not only variance components but also covariance components.
To continue with the WPT example, suppose each form involves both narrative and informative types of prompts. We will designate these prompt types as v1
and v2 , respectively. If, for each type, the population and universe of admissible
13 It can be argued that stratified alpha (Cronbach, Schönemann, & McKie, 1965) is a CTT
precursor to multivariate G theory.
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
GENERALIZABILITY THEORY AND CTT
15
observations is fully crossed (i.e., p × t × r), then there are seven variance
components for v1 and a different seven components for v2 . For example, σ12 (p)
is the person variance component for v1 , and σ22 (p) is the person variance
component for v2 . In addition, for each pair of variance components there is
the possibility of a covariance component. For this example, almost certainly
persons would respond to prompts of both types, which means that the covariance component for persons, σ 12 (p), would be non-zero. On the other hand,
probably there would be different prompts (t) for v1 and v2 . If so, σ12 (t) =
σ12 (pt) = σ12 (tr) = σ12 (ptr) = 0. The same raters might or might not be used
for the two types of prompts, which means that σ 12 (r) and σ 12 (pr) might or
might not be zero. In short, the multivariate WPT example has seven variancecovariance matrices that replace the seven variance components for the univariate
example.
Univariate D study analyses for the WPT example can be performed for v1 and
v2 separately, which gives results specific to narrative and informative prompts,
respectively. In addition, for the WPT example it is likely that ABC would
perform analyses for one or more composite universe scores defined generally as
μpC = w1 μp1 + w2 μp2 .
For example, if w1 + w2 = 1, then the analyses would be for weighted mean
scores over both narrative and informative prompts. If w1 = 1 and w2 = −1, then
the analyses would be for difference scores.
This relatively simple WPT example hints at the power and flexibility of multivariate G theory. Indeed, it can be said that multivariate G theory is the whole of
G theory, with univariate G theory simply being a special case. This multivariate
perspective on G theory illustrates that it is essentially a random effects theory.
The reader may quarrel with this last assertion by noting that the previous discussion of univariate G theory considered a mixed model in which there was a
fixed facet. True enough, but any univariate mixed model can always be reformulated as a multivariate model in which the levels of the fixed facet(s) become
levels of v. Indeed, doing so provides a more flexible representation of levels of a
fixed facet,14 and usually greatly simplifies estimation, especially for mixed models that have designs that are unbalanced with respect to nesting (see, for example,
Brennan, 2001b, pp. 268–273).
14 A mixed-model univariate analysis effectively makes a statistical “hidden” choice for the w
weights for each fixed level, whereas a multivariate analysis leaves the choice of weights to the
investigator.
16
BRENNAN
COMPARING THEORIES
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
For ease of reference, in this section CTT and G theory are sometimes referred
to as expected value theories to contrast them with item response theory (IRT).
We begin with a comparison of CTT and G theory that includes consideration
of some of the strengths and weaknesses of these expected value theories. Then
these theories are briefly compared with IRT.
Expected Value Theories
CTT and G theory have a number of similarities. They are both tautologies in
which terms to the right of the equal sign are unobserved, both theories define
true (or universe) score as an expected value of observed scores, both theories
explicitly incorporate random errors of measurement, and both theories have welldefined (and similar) notions of reliability (or generalizability).
It has been said by Cronbach et al. (1972) and by Brennan (2001b) that G
theory “liberalizes” CTT. This is true in several senses. First, G theory permits disentangling the multiple sources of error that are confounded in the single E term
of CTT. Second, G theory has a much richer conceptual framework than CTT,
which leads to resolutions of a number of apparent contradictions in various CTT
discussions of reliability. The two most important characteristics of G theory that
facilitate resolving contradictions are: (a) G theory’s distinction between fixed and
random measurement facets and (b) G theory’s capability of dealing with different
D study designs. Third, multivariate G theory expands reliability considerations
to multiple universes of generalization, which have no corresponding status in
CTT. Fourth, as noted by Cronbach et al. (1972) and Brennan (2001b), G theory
blurs distinctions between reliability and validity. Kane (1982), for example, provides a particularly prescient discussion of the reliability–validity paradox from
the perspective of G theory.
To say that G theory liberalizes CTT does not mean, however, that all of CTT
is subsumed under G theory or that CTT can or should be completely replaced by
G theory. There are still some important differences between the two theories that
more than justify retaining both. Perhaps the most obvious difference is in definitions of parallelism. G theory incorporates a single notion of parallelism, namely,
the notion of randomly parallel forms. This is quite different from the notion of
classically parallel forms in CTT. Both types of parallelism are idealized and not
ever likely to be strictly true, although one or the other may be more sensible in
particular contexts. Furthermore, CTT has several well-developed, useful definitions of parallelism that are weaker than classically parallel forms (in particular,
essentially tau-equivalent forms and congeneric forms), whereas G theory has no
role, as yet, for different types of parallelism.
In considering models, it often seems that what is a strength from one
perspective is a weakness or limitation from another perspective. For example,
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
GENERALIZABILITY THEORY AND CTT
17
one of the strengths of CTT is that it is based on a the very simple model X = T
+ E, but the simplicity of the model is also a weakness in that it does not permit
us to disentangle the multiple sources of error in E. By contrast, the capability of
disentangling error sources is an important strength of G theory, but that strength
is purchased at the price of conceptual complexity.
The complexity of G theory is often a stumbling block for those who seek to
find simple answers to measurment questions. In reality, however, most thoughtful consideration of such questions requires grappling with conceptual matters
that are often complex and not easily addressable with a template. An important
strength of G theory is that it is rich enough to guide investigators through such
measurement mazes, but that strength makes cognitive demands on investigators.
In the end, there is no psychometric “free lunch.”
Expected Value Theories and IRT
Given the popularity of item response theory (IRT) (see, for example, Lord, 1980
and Yen & Fitzpatrick, 2006), it seems obvious to consider some similarities and
differences between IRT and the two expected value theories discussed in this article. In both substantive and utilitarian senses, there is a rather obvious difference
between the two types of theories. Specifically, IRT focuses on item responses,
whereas CTT and G theory focus on test or form scores. Using IRT, investigators
can clearly distinguish among different items. By contrast, G theory cannot distinguish among items, since it is a random sampling model, just as different persons
are not distinguishable in survey sampling research. CTT can make distinctions
among items only if items are defined as forms, but if that is done, parallelism
assumptions are often suspect.15
Some may object to the above characterization of CTT by noting that there
is long history of using so-called classical item analysis statistics such as difficulty levels and point-biserial discrimination indices. True enough. Such statistics,
however, are not easily defended from a strict interpretation of CTT as discussed
in this article. The essential problem is that almost always item scores grossly
violate the assumption of classically parallel forms, and even the assumptions of
essentially tau-equivalent forms. Classical item analysis statistics have a longstanding demonstrated utility for test development, but that does not mean they
are well modelled by CTT.
A forest-trees metaphor is reasonably apt for considering IRT vis-à-vis
expected value theories. Consider individual items as trees and the universe of
items as the forest. If we focus on individual trees as we do in IRT, then we
are easily oblivious to the forest. If we focus on the forest, then the trees are
15 If items are considered as congeneric forms, then perhaps this problem can be circumvented (L.
S. Feldt, personal communication, March 3, 2010).
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
18
BRENNAN
indistinguishable. To put it another way, in IRT items (more correctly item parameters) are effectively fixed, which means that a replication would consist of
identically the same items (or, more correctly, a set of items with identically the
same parameters). Call this “strictly” parallel forms. The notion of randomly parallel forms in G theory is much less restrictive, and even the various CTT notions
of parallel forms are much weaker than “strictly” parallel forms.
Traditional developments of IRT do not typically mention fixed items or
strictly parallel forms. These notions are implicit, however, in other aspects
of IRT. For example, in the derivation of the standard error of the maximumlikelihood estimate of θ , there is no consideration of sampling items; and if items
are not sampled, they must be fixed. Also, the expected number-correct (ENR) on
the vertical axis in a test characteristic curve (TCC) is typically viewed as numbercorrect true score. However, ENR is not an expected value over any set of items
different from those for the specific TCC, since the TCC itself is conditional on
a very specific set of items.16 Therefore, there is a discontinuity between the IRT
notion of true score (ENR) and the notion of true score in CTT and G theory. This
is particularly evident in comparing IRT and G theory: items are fixed in IRT,
whereas they are almost always treated as random in G theory.
Some of the above comments may appear to conflict with some old and current literature. For example, Lord and Novick (1968) show relationships between
certain classical item analysis statistics and normal ogive item parameters. True
enough, but the relationships are based on first assuming that a normal ogive
model fits. The fact that proposition A implies proposition B does not mean that B
implies A; that is, the Lord and Novick (1968) demonstration does not mean that
CTT and IRT are interchangeable for item analysis purposes. A similar type of
comment, although more nuanced, applies to Holland and Hoskens (2003), which
in no way mitigates the quality or importance of their research. It is true that Lord
and Novick (1968) and Holland and Hoskens (2003) have taken steps in the direction of integrating CTT and IRT from certain perspectives; it is not true that the
two theories are fully integrated, or that one is a subset of the other.
Current IRT models and G theory differ not only with respect to items being
fixed (IRT) or random (G theory), but also in the sense that G theory emphasizes the contributions of multiple facets to measurement error, whereas almost
all of the widely used IRT models have no explicit role for multiple facets. There
is some research, however, that seeks to integrate aspects of G theory and IRT.
For example, Bock, Brennan, and Muraki (2002) have proposed a procedure that
incorporates multiple sources of error directly into the information function and,
hence, into the IRT SEM. Also, Briggs and Wilson (2007) and Chien (2008) have
16 It might be argued that ENR is an expected value over a propensity distribution of performance
on the fixed items, but even then, the items (or item parameters) themselves are still fixed.
GENERALIZABILITY THEORY AND CTT
19
TABLE 1
Comparisons Among CTT, G Theory, and IRT
Issue
Forms and
parallelism
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
True score
CTT
Classically parallel,
essentially
tau-equivalent, etc.
Expectation over forms
Assumptions
Primary strengths
Relatively weak
Simplicity; widely used;
has stood test of time
Primary weaknesses
Use and
understanding
Undifferentiated error
Easy
G Theory
IRT
Randomly parallel
Strictly parallel
Expectation over
randomly parallel
forms
Relatively weak
Conceptual breadth;
disentangles multiple
sources of error;
distinguishes between
fixed and random
facets
Conceptual complexity
Sometimes challenging
Expected number
right for fixed set
of items
Very strong
Mathematically
elegant; solves
many complex
measurement
problems if
assumptions hold
Only fixed facet(s)
Sometimes
challenging
considered an approach that estimates variance components based on IRT estimates of expected number correct scores rather than the actual observed scores
used in G theory. Briggs and Wilson (2007) consider an items facet, only; Chien
(2008) considers two facets. In addition, there have been a number of informal,
unpublished suggestions that Bayesian priors be used to turn the fixed items (more
correctly, fixed item parameters) in IRT into random variables.17 None of these
approaches have been studied much yet, but it is encouraging that researchers are
making attempts at integrating G theory and IRT. Even if the attempts fall short,
they may lead to beneficial insights.
Table 1 provides a comparison among CTT, G theory, and IRT with respect
to many of the issues considered in this section. The comparative phrases in
Table 1 are necessarily succinct; they should be interpreted in the more extended
sense discussed in this section. The differences among models are substantive and
important, but each of these models is defensible and valuable, and no one of them
is a substitute for the other, at least not in their current instantiations. It is unfortunate that much of the current research and practice in educational measurement
do not give more attention to the differences among these models, and especially
the differences among their assumptions.
17 Bayesian priors are actually involved in the Briggs and Wilson (2007) and Chien (2008)
approaches, which employ MCMC methods.
20
BRENNAN
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
REFERENCES
Bock, R. D., Brennan, R. L., & Muraki, E. (2002). The information in multiple ratings. Applied
Psychological Measurement, 26, 364–375.
Borsboom, D. (2005). Measuring the mind. Cambridge, UK: Cambridge University Press.
Brennan, R. L. (1992). Elements of generalizability theory (rev. ed.). Iowa City, IA: ACT, Inc.
Brennan, R. L. (1997). A perspective on the history of generalizability theory. Educational
Measurement: Issues and Practice, 16(4), 14–20.
Brennan, R. L. (2001a). An essay on the history and future of reliability from the perspective of
replications. Journal of Educational Measurement, 38, 285–317.
Brennan, R. L. (2001b). Generalizability theory. New York: Springer-Verlag.
Briggs, D. C., & Wilson, M. (2007). Generalizability in item response modeling. Journal of
Educational Measurement, 44, 131–155.
Burt, C. (1936). The analysis of examination marks. In P. Hartog & E. C. Rhodes (Eds.), The marks
of examiners. London: Macmillan.
Cardinet, J., Johnson, S., & Pini, G. (2010). Applying generalizability theory using EduG. New York:
Taylor and Francis.
Chien, Y. (2008). An investigation of testlet-based item response models with a random facets design
in generalizability theory. Unpublished doctoral dissertation, University of Iowa, Iowa City, Iowa.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16,
297–334.
Cronbach, L. J. (1976). On the design of educational measures. In D. N. M. de Gruijter & L. J. T. van
der Kamp (Eds.), Advances in psychological and educational measurement (pp. 199–208). New
York: Wiley.
Cronbach, L. J. (1991). Methodological studies—A personal retrospective. In R. E. Snow & D. E.
Wiley (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach (pp.
385–400). Hillsdale, NJ: Erlbaum.
Cronbach, L. J. (2003). My current thoughts on coefficient alpha and successor procedures.
Educational and Psyhcological Measurement, 64, 391–418.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral
measurements: Theory of generalizability for scores and profiles. New York: Wiley.
Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalization
of reliability theory. British Journal of Statistical Psychology, 16, 137–163.
Cronbach, L. J., Schönemann, P., & McKie, T. D. (1965). Alpha coefficients for stratified-parallel
tests. Educational and Psychological Measurement, 25, 291–312.
Ebel, R. L. (1951). Estimation of the reliability of ratings. Psychometrika, 16, 407–424.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational Measurement
(3rd ed., pp. 105–146). New York: American Council on Education and MacMillan.
Fisher, R. A. (1925). Statistical methods for research workers. London: Oliver & Bond.
Gleser, G. C., Cronbach, L. J., & Rajaratnam, N. (1965). Generalizability of scores influenced by
multiple sources of variance. Psychometrika, 30, 395–418.
Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp.
65–110). Westport, CT: American Council on Education/Praeger.
Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first order item response theory:
Application to true-score prediction from a possibly nonparallel test. Psychometrika, 68, 123–149.
Kane, M. T. (1982). A sampling model for validity. Applied Psychological Measurement, 6, 125–160.
Lindquist, E. F. (1953). Design and analysis of experiments in psychology and education. Boston:
Houghton Mifflin.
Lord, F. M. (1955). Estimating test reliability. Educational and Psychological Measurement, 15,
325–336.
Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011
GENERALIZABILITY THEORY AND CTT
21
Lord, F. M. (1957). Do tests of the same length have the same standard error of measurement?
Educational and Psychological Measurement, 17, 510–521.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ:
Erlbaum.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: AddisonWesley.
Novick, M. R., & Lewis, C. (1967). Coefficient alpha and the reliability of composite measurements.
Psychometrika, 32, 1–13.
Rajaratnam, N., Cronbach, L. J., & Gleser, G. C. (1965). Generalizability of stratified-parallel tests.
Psychometrika, 30, 39–56.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.
Spearman, C. (1904). The proof and measurement of association between two things. American
Journal of Psychology, 15, 72–101.
Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational
measurement (4th ed., pp. 11–154). Westport, CT: American Council on Education/Praeger.