Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 APPLIED MEASUREMENT IN EDUCATION, 24: 1–21, 2011 Copyright © Taylor & Francis Group, LLC ISSN: 0895-7347 print / 1532-4818 online DOI: 10.1080/08957347.2011.532417 Generalizability Theory and Classical Test Theory Robert L. Brennan Center for Advanced Studies in Measurement and Assessment University of Iowa Broadly conceived, reliability involves quantifying the consistencies and inconsistencies in observed scores. Generalizability theory, or G theory, is particularly well suited to addressing such matters in that it enables an investigator to quantify and distinguish the sources of inconsistencies in observed scores that arise, or could arise, over replications of a measurement procedure. Classical test theory is an historical predecessor to G theory and, as such, it is sometimes called a parent of G theory. Important characteristics of both theories are considered in this article, but primary emphasis is placed on G theory. In addition, the two theories are briefly compared with item response theory. The pursuit of scientific endeavors necessitates careful attention to measurement procedures, the purpose of which is to acquire information about certain attributes or characteristics of objects. The data obtained from any measurement procedure include errors, however, since the measurements may vary depending on numerous conditions of measurement. From this perspective on measurement, “error” does not mean mistake in the conventional sense, and what constitutes error in scores from a measurement procedure is, in part, a matter of definition. It is one thing to say that error is an inherent aspect of a measurement procedure; it is quite another thing to quantify error and specify which conditions of An earlier version of this paper was presented at the 2008 annual meeting of the American Educational Research Association. The paper was one of two presented in a symposium sponsored by the Buros Center for Testing, the sponsor of this journal. The other paper enumerated the benefits of item response theory. We hope to be able to present this item response theory paper in a future issue of the journal. Correspondence should be addressed to Robert L. Brennan, E. F. Lindquist Chair in Measurement and Testing and Director, Center for Advanced Studies in Measurement and Assessment (CASMA), 210D Lindquist, University of Iowa, Iowa City, IA 52242. E-mail: [email protected] Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 2 BRENNAN measurement contribute to it. Doing so necessitates specifying what would constitute an “ideal” measurement (i.e., over what conditions of measurement is generalization intended) and the conditions under which observed scores are obtained. These and other measurement issues are of concern in virtually all areas of science. Different fields may emphasize different issues, different objects, different characteristics of objects, and even different ways of addressing measurement issues, but the issues themselves pervade scientific endeavors. In education and psychology, historically these types of issues have been subsumed under the heading of “reliability.” Broadly conceived, reliability involves quantifying the consistencies and inconsistencies in observed scores. It has been stated that “A person with one watch knows what time it is; a person with two watches is never quite sure!” This simple aphorism highlights how easily investigators can be deceived by having information from only one element in a larger set of interest. The above discussion is closely associated with the conceptual framework of generalizability theory, or G theory, which is the principal focus of this article. G theory enables an investigator to quantify and distinguish the sources of inconsistencies in observed scores that arise, or could arise, over replications of a measurement procedure. Classical test theory (CTT) is an historical predecessor to G theory. Indeed, CTT is sometimes called a parent of G theory. Provided next is a brief overview of CTT that serves as a bridge to the subsequent overview of G theory.1 The focus here is on important aspects of the theories that serve to illustrate similarities and differences between them, as well as between them and other theories, particularly item response theory (IRT). CLASSICAL TEST THEORY To understand G theory, it is helpful to consider first some aspects of the CTT model X = T + E, (1) where X, T, and E are observed, true, and error score random variables, respectively. Although CTT is very useful, the simplicity of this model, masks at least four important considerations. 1 For more complete overviews of CTT see Lord and Novick (1968), Feldt and Brennan (1989), and Haertel (2006). For more complete overviews of G theory see Cronbach, Gleser, Nanda, and Rajaratnam (1972) and Brennan (1992, 2001b). Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one must make some additional assumptions. There are at least two ways to proceed. First, one can define T as the expected value of the observed scores X, which leads to the expected value of E being zero. Second, one can define the expected value of E as zero, which leads to T being the expected value of X. Clearly, both ways of proceeding lead to the same result, but they differ with respect to what is assumed, and what is a consequence of the assumptions. Whichever way one proceeds, however, once T (or E) is defined, then E (or T) is derived unambiguously. That is, the CTT model suggest that T and E are so tightly tied together that if one of them were known, the other would be entirely evident. Second, it is important to note that in the CTT model, T is definitely not platonic or “in the eye of God” true score. Lord and Novick (1968) emphasized this over 40 years ago. More recently, Borsboom (2005, pp. 33–34) provided the following interesting example. Currently, an autopsy is required for a definitive diagnosis of Alzheimer’s disease. Let C be a nominal variable that takes two values: c = 0 for absence of Alzheimer’s based on an autopsy, and c = 1 for presence of Alzheimer’s disease. This nominal variable can be viewed as a platonic true score. (We neglect the possibility of autopsy errors.) Now, suppose there is some observational test that results is a diagnosis of x = 0 if Alzheimer’s is not suspected and x = 1 if Alzheimer’s is suspected. If this diagnostic test (or different forms of it) is repeated, clearly the expected value (i.e., true score) will be neither 0 nor 1; hence, platonic true score C and expected-value true score T will not be the same. Third, the form of the CTT model in Equation 1 is so clearly reminiscent of a simple linear regression equation that it is easy to think of E as nothing more than model fit error in the traditional statistical sense. Such a conception is misleading at best, if not outright wrong. The CTT model is a tautology in which all variables on the right-hand side are unobservable, and these unobservable variables have no meaning beyond the assumptions we attach to them. In particular, T does not have some status independent of the other variables in the model, which means that it is misleading to characterize E as a residual or model fit error. Part of the problem here is the multiple connotations associated with the word “model.” In traditional statistical contexts, the word “model” often carries with it the connotation of a relationship between dependent and fixed (i.e., known a priori) independent variables. This notion of the word “model” clearly does not apply to the CTT model; not does it apply to G theory. Fourth, as mentioned above, the CTT model is a tautology. As such, it is true by definition. It’s truth/falsity cannot be tested by comparing it or its results to some “objective” reality. Physical scientists tend to reserve the word “theory” for models that can be falsified. No such falsification is possible for the CTT model or for G theory. In applications of CTT what shall count as true score and what shall count as error are very much under the control of the investigator, although this 4 BRENNAN fact is frequently overlooked. In this sense “truth” and “error” are not realities to be discovered—they are investigator-specific constructions to be studied. In CTT “error” does not mean “mistake,” it does not mean lack of model fit, and “truth” and “error” are defined by the investigator even if he or she does not realize it! Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 Reliability Coefficients and Error Variances The canonical definition of reliability is usually taken to be that it is the squared correlation between observed and true scores, ρ 2 (X,T). Other expressions for reliability are given below: ρ 2 (X, T) = ρ(X, X ) = σ 2 (T) σ 2 (T) = . σ 2 (X) σ 2 (T) + σ 2 (E) (2) The last three expressions are typically derived by assuming that, for the indefinitely large population of examinees: (a) test forms (say X and X ) are classically parallel, which means that they have equal observed score means, variances, and covariances, and they covary equally with any other measure; (b) the covariance between errors for parallel forms is 0; and (c) the covariance between true and error scores is 0.2 Several traditional estimates of reliability are motivated by the ρ(X, X ) expression for reliability. These estimates differ overtly with respect to their data collection designs, and they also differ with respect to how error is implicitly defined. For example, if reliability is estimated by computing the correlation between “parallel” forms, then the only errors that are taken into account are those attributable to form differences. By contrast, if reliability is estimated by computing a test–retest correlation, then form differences do not contribute to error variance, but occasion differences do. Clearly, these two estimates of reliability are not estimates of the same parameter, but the CTT model is not rich enough to distinguish clearly between them. These distinctions are much more evident in G theory. Other estimates of reliability are more closely linked to one or the other of the last two expressions in Equation 2, both of which make explicit reference to true score variance which, of course, is unknown. Typically, these estimates make use of the fact that the covariance between scores for classically parallel forms is true score variance, that is σ (X, X ) = σ 2 (T). The best known of these coefficients is Coefficient α. Strictly speaking, Coefficient α can be derived using a parallelism assumption that is weaker than classically parallel forms, called essentially tau-equivalent 2 Equivalently, for any indefinitely large subpopulation of examinees, the expected value of the errors is 0 provided examinees are not selected based on their observed scores. GENERALIZABILITY THEORY AND CTT 5 forms, which are special cases of what are called congeneric forms. Two forms are congeneric if their true scores are linearly related; further, their error variances need not be equal, and, it follows that their observed score variances need not be equal. Notationally, scores for forms i and j are congeneric if Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 Xi = (ai + bi T) + Ei and Xj = (aj + bj T) + Ej . (3) When bi = bj we say the forms are essentially tau-equivalent. Lord and Novick (1968), Feldt and Brennan (1989), and Haertel (2006) provide extensive discussions of reliability coefficents based on these (and other) different definitions of parallelism. Reliability coefficients seldom play a role in other areas of scientific inquiry. Why are they so prevalent in psychometrics? There are probably at least three reasons. First, psychometrics is generally viewed as beginning with Spearman’s (1904) study of what we now call corrections for attenuation, which adjust observed score correlations using reliability coefficients. Corrections for attenuation are still of considerable interest in educational and psychological measurement. Second, the fact that reliability ranges between 0 and 1 is very appealing to many. Unfortunately, the appeal is deceptive in that it suggests that all of reliability can be captured in a single dimensionless number. That is not true, but the appeal persists, even though reliability coefficients are rather difficult to interpret correctly.3 Third, under the assumptions of CTT, it can be shown that standard error of measurement (SEM) is a function of reliability. Specifically, σ (E) = σ (X) 1 − ρ 2 (X, T), (4) which is arguably more important than ρ 2 (X, T) itself. Coefficient α and its Misunderstandings Without question, the most popular reliability coefficient is Coefficient α, which is often call Cronbach’s α, since Cronbach (1951) popularized it and derived it from several different perspectives. As valuable and useful as this coefficient may be, unfortunately it is widely misunderstood and misused, in part because it is so easy to compute. One misunderstanding is the common attribution of Coefficient α to Cronbach. As Cronbach (2003) himself noted, he did not invent Coefficient α; other equivalent coefficients were reported in the literature prior to Cronbach (1951). Indeed 3 One complexity is that reliability coefficients have nonlinear characteristics. That is why it is much more difficult to raise a reliability coefficient from .90 to .95 than from .50 to .55. Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 6 BRENNAN derivations of one or more versions of Coefficient α (before and since 1951) might be the all time favorite psychometric parlor game! As noted previously, from the perspective of CTT, the derivation of Coefficient α requires that forms be essentially tau-equivalent.4 In the vast majority of cases, Coefficient α is computed based on item scores; that is, items play the role of forms. It most circumstances, however, it seems highly unlikely that item scores satisfy the assumption of essential tau-equivalence. A particularly problematic misunderstanding is the frequently cited statement that “Coefficient α is a lower limit to reliability.” Under a particular set of stringent assumptions, this is a mathematically correct statement (see Lord & Novick, 1968, pp. 87–88; Novick & Lewis, 1967), but these assumptions are rarely defensible in real-world situations. In most cases, it is much more likely that Coefficient α is an upper limit to reliability, as Cronbach (1951) noted over a half century ago. This misinterpretation occurs when there is a disconnect between the data used to estimate Coefficient α and the definition of reliability intended by the investigator. For example, if data are collected on a single occasion, but the investigator’s notion of reliability involves generalizing to different occasions (as it usually does), then it is almost certain that error variance will be underestimated. Although Cronbach did not invent Coefficient α, he did name it, and his choice of a name was not accidental. Consider the following quote from Cronbach (1951): A . . . reason for the symbol is that α is one of six analogous coefficients (to be designated β, γ , δ, etc.) which deal with such other concepts as like-mindedness of persons, stability of scores, etc. (pp. 299–300) Essentially, this quote reinforces the fact that there are many reliability coefficients for any set of test scores. Cronbach did not publish subsequent papers that specifically identified all of the other coefficients (i.e., β, γ , δ, etc.); rather, these notions got incorporated into what came to be called G theory. In short, Coefficient α is properly viewed as an historically important and often useful estimator of reliability, but α should not be deified, and it is much overused. Lord’s SEM There are topics that are usually included in the CTT literature that are not quite consonant with the assumptions noted above. For the purposes of this article, a particularly important example is Lord’s (1955, 1957) SEM. Consider a test consisting of k dichotomously scored items. Lord suggested that the SEM for 4 Classically parallel forms satisfy the assumptions of essential tau-equivalence, but this is not necessarily true for congeneric forms. GENERALIZABILITY THEORY AND CTT 7 Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 an examinee can be viewed as the standard error of the mean for that examinee, where each observable mean is the examinee’s mean score on a random sample of k items drawn from an infinite universe of items. In terms of parameters, Lord’s SEM is simply τp (1 − τp ) ∗ , (5) σ (E ) = k where τ p is the true score for the examine in the mean-score metric (i.e., proportion correct scores).5 It is worth noting that Lord’s SEM is not a simple function of reliability, whereas the CTT formula in Equation 4 is. Furthermore, it can be shown that the average value of σ 2 (E∗ ) is greater than σ 2 (E) when both are on the same metric (see Brennan, 2001b, pp. 33, 160). Lord’s SEM is a kind of bridge between CTT and G theory in at least two senses. First, Lord’s SEM uses a random sampling model to estimate error variance rather than CTT notions of parallelism. Second, Lord’s SEM uses a within-person design as opposed to the across-persons design that characterized virtually all the reliability literature prior to the 1950s. As discussed next, G theory replaces CTT notions of parallelism with randomly parallel forms, and G theory explicitly incorporates different types of data collection designs. UNIVARIATE GENERALIZABILITY THEORY G theory offers an extensive conceptual framework and a powerful set of statistical procedures for addressing numerous measurement issues. Often, CTT and analysis of variance (ANOVA) are viewed as the parents of G theory. Parents and Some History In CTT, there is only one E term, which does not mean there is necessarily only one source of error; it does mean, however, that in a single application of CTT, all sources of error are confounded in one E term. One of the most important and simplest perspectives on the G theory model is that it disconfounds the multiple sources of error that interest an investigator, say H of them; so, in a sense, the G theory model can be viewed as X = μp + E1 + E2 + · · · EH , 5 The more familiar estimation formula for Lord’s SEM in the mean-score metric is: σ (E∗ ) = X p (1 − X p )/(k − 1). (6) Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 8 BRENNAN where μp is universe score, which is the G theory analogue of true score. Importantly, in G theory the investigator must decide which sources of error are of interest, which effectively defines the facets of measurement. Universe score is then defined as the expected value of observed scores over replications of the measurement procedure (see Brennan, 2001a), where each such replication involves a different random sample of conditions from each of the measurement facets. In its essential features, the high-level model in Equation 6 is quite consistent with important aspects of the statistical framework for ANOVA. As noted by Cronbach et al. (1972), when Fisher (1925) introduced ANOVA, he revolutionized statistical thinking with the concept of the factorial experiment in which the conditions of observations are classified in several respects. Investigators who adopt Fisher’s line of thought must abandon the concept of undifferentiated error. The error formerly seen as amorphous is now attributed to multiple sources, and a suitable experiment can estimate how much variation arises from each controllable source. (p. 1) The defining treatment of G theory is a monograph by Cronbach et al. (1972) entitled The Dependability of Behavioral Measurements. A history of the theory is provided by Brennan (1997). Brennan (2001b) provides an extensive exposition of G theory. Shavelson and Webb (1991) provide a primer. Cardinet, Johnson, and Pini (2010) provide a treatment of G theory based on a perspective that is somewhat different from that of the previously cited authors. In discussing the genesis of G theory, Cronbach (1991, pp. 391–392) states: In 1957 I obtained funds from the National Institute of Mental Health to produce, with Gleser’s collaboration, a kind of handbook of measurement theory. . . . “Since reliability has been studied thoroughly and is now understood,” I suggested to the team, “let us devote our first few weeks to outlining that section of the handbook, to get a feel for the undertaking.” We learned humility the hard way—the enterprise never got past that topic. Not until 1972 did the book appear . . . that exhausted our findings on reliability reinterpreted as generalizability. Even then, we did not exhaust the topic. When we tried initially to summarize prominent, seemingly transparent, convincingly argued papers on test reliability, the messages conflicted. To resolve these conflicts, Cronbach and his colleagues devised a rich conceptual framework and married it to analysis of random effects variance components. The net effect is “a tapestry that interweaves ideas from at least two dozen authors” (Cronbach, 1991, p. 394). In particular, the work of Burt (1936), Ebel (1951), and Lindquist (1953, chap. 16) appears to have anticipated various aspects of G theory. GENERALIZABILITY THEORY AND CTT 9 Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 Framework and Machinery Although CTT and ANOVA can be viewed as the parents of G theory, the child is both more and less than the simple conjunction of its parents, and appreciating G theory requires an understanding of more than its lineage. For example, although G theory liberalizes CTT, not all aspects of CTT are incorporated in G theory. Also, the ANOVA issues emphasized in G theory are different from those that predominate in many experimental design and ANOVA texts. In particular, G theory concentrates on variance components and their estimation, not F tests. Perhaps the most important aspect and unique feature of G theory is its conceptual framework. Among the concepts are universes of admissible observations and G (generalizability) studies, as well as universes of generalization and D (decision) studies. Some of the more important concepts and methods of G theory are introduced next using a hypothetical scenario. Suppose a testing company ABC decides that it wants to begin offering a writing proficiency testing program called WPT. ABC needs to identify, or otherwise characterize the types of essay prompts, t, that will be used and the types of raters, r. Obviously, there are other considerations, too, but we will consider only these two facets here. (A facet is simply a set of similar conditions of measurement, where the investigator decides what “similar” means.) Suppose that, in theory, responses to any prompt could be evaluated by any rater, and the number of potential prompts and raters is indefinitely large. Under these specifications, we say that both facets are infinite in the universe of admissible observations, and they are crossed, that is, t × r. So far, no reference has been made to persons who respond to the essay prompts. In G theory the word universe is reserved for conditions of measurement (prompts and raters, here), while the word population is used for the objects of measurement (persons, here). In the population and universe of admissible observations, any observable score for a single essay prompt evaluated by a single rater can be represented as: Xptr = μ + νp + νt + νr + νpt + νpr + νtr + νptr , (7) where μ is the grand mean in the population and universe and ν designates any one of the seven uncorrelated effects, or components. We say the Equation 7 is the p × t × r (persons crossed with tasks crossed with raters) linear model. Assuming that the effects in Equation 7 are uncorrelated, the variance of the observed scores is: σ 2 (Xptr ) = σ 2 (p) + σ 2 (t) + σ 2 (r) + σ 2 (pt) + σ 2 (pr) + σ 2 (tr) + σ 2 (ptr). (8) Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 10 BRENNAN The terms to the right of the equal sign are called random effects variance components. They can be estimated using expected mean square equations for a G study in which a sample of np persons respond to nt prompts that are evaluated by nr raters. Once estimated variance components are available, they can be used to estimate universe score variance, error variances, and reliability-like coefficients for various universe of generalization and D study designs. A universe of generalization can be viewed as the universe of randomly parallel forms of WPT, where each such from uses n t prompts and n r raters.6 A D study design is the design used operationally for a form of WPT.7 A crucial consideration in defining a universe of generalization is answering the question, “Which facet(s) shall be considered random and which shall be considered fixed?” A facet is considered random when its conditions in the D study are a sample from those in the universe of generalization.8 A facet is fixed when its conditions in the D study exhaust its conditions in the universe of generalization. G theory does not specify which facets should be considered random and which should be considered fixed; that is the prerogative and the responsibility of the investigator. It should be noted, however, that fixing one or more facets generally lowers error variance and increases coefficients at the expense of narrowing interpretations. Infinite universe of generalization and crossed D study design Suppose that ABC decides that both prompts and raters shall be viewed as random for WPT, and the D study design will have the same crossed structure as the G study design.9 Then, universe score variance is σ 2 (τ ) = σ 2 (p), (9) relative error variance is σ 2 (δ) = 6 It need not be true that n σ 2 (pt) σ 2 (pr) σ 2 (ptr) + + , n t n r n tn r (10) = nt nor that n r = nr ; that is, the sample sizes used to estimate variance components need not equal the sample sizes used in an operational form of the test. 7 D study designs can differ with respect to structure and/or sample sizes. 8 Strictly speaking, for a random facet it is assumed that the number of conditions in the universe of generalization is indefinitely large. 9 That is, the D study design shall be p × T × R with n prompts and n raters. t r t GENERALIZABILITY THEORY AND CTT 11 absolute error variance is σ 2 ( ) = σ 2 (t) σ 2 (r) σ 2 (tr) σ 2 (pt) σ 2 (pr) σ 2 (ptr) + + + + + , n t nr n tn r n t n r n tn r (11) Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 a generalizability coefficient is σ 2 (τ ) , σ 2 (τ ) + σ 2 (δ) (12) σ 2 (τ ) . σ 2 (τ ) + σ 2 ( ) (13) Eρ 2 = and a dependability coefficient is = Equations 9–13 are expressed in terms of the mean score metric, which is the tradition in G theory; by contrast, CTT equations are almost always expressed in terms of the total score metric. Relative error variance, σ 2 (δ), and a generalizability coefficient, Eρ 2 , are analogous to σ 2 (E) and ρ 2 (X, T), respectively, in CTT in that they characterize error and reliability for decisions based on comparing examinees. It is important to note, however, that except in trivial cases σ 2 (δ) = σ 2 (E) and Eρ 2 = ρ 2 (X, T).10 By contrast, strictly speaking, CTT has no analogue for σ 2 ( ), which is the error variance for making absolute (e.g., pass–fail) decisions about examinees. If we go beyond the strict realm of CTT and consider Lord’s error variance, however, there are some clear similarities—most obviously, both σ 2 ( ) and Lord’s error variance are derived under random sampling assumptions (see Brennan, 1997, for more details.) If an investigator performs a CTT analysis (e.g., computes Coefficient α) when there is more than one random facet, it is likely that error variance will be underestimated. Consider, for example, σ 2 (δ) in Equation 10 which is based on n t n r observations for each examinee. If Coefficient α is computed using the n t n r observations for each examinee, then the estimated error variance in the mean score metric will be [σ̂ 2 (pt) + σ̂ 2 (pr) + σ̂ 2 (ptr)]/(n t n r ) , which is clearly smaller than the estimate of σ 2 (δ) based on Equation 10. This illustrates that CTT estimated error variances are generally too small when there is more than one random facet in the universe of generalization. 10 The most common “trivial” case is a design and universe with a single random facet. 12 BRENNAN Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 Different universes of generalization and D study designs For different universes of generalization and D study designs, the expressions for Eρ 2 and in Equations 12 and 13, respectively, still apply. Universe score variance and error variances change, however, if the universe of generalization changes. In addition, error variances change if the design changes and/or sample sizes change.11 Suppose ABC decides to use the same tasks for all forms of WPT. If so, we would say that tasks are fixed in the universe of generalization, and it can be shown that universe score variance is σ 2 (τ ) = σ 2 (p) + σ 2 (pt) , n t (14) relative error variance is σ 2 (pr) σ 2 (ptr) + , n r n tn r (15) σ 2 (r) σ 2 (tr) σ 2 (pr) σ 2 (ptr) + + + . n r n tn r n r n tn r (16) σ 2 (δ) = and absolute error variance is σ 2 ( ) = Comparing these equations with Equations 9–11, it is evident that when tasks are fixed, universe score variance increases and error variances decrease, which leads to larger coefficients. Conceptually, fixing a facet restricts the universe of generalization and, in doing so, decreases the gap between observed and universe scores at the price of narrowing interpretations. Suppose ABC publishes a technical manual in which it claims that WPT is a highly reliable testing program because inter-rater reliability coefficients are high. Let us consider this claim from the perspective of G theory. Suppose each interrater coefficient is a Pearson correlation based on the responses of examinees to a single task with each response rated by the same two raters. Even if there are multiple coefficients reported, as long as each of them is based on a single task, then task is effectively being treated as fixed, whether or not ABC realizes it.12 Furthermore, a correlation between two conditions or units (here, raters) is an 11 CTT deals with sample size changes through the Spearman-Brown formula (see, Feldt & Brennan, 1989 and Haertel, 2006), which does not apply when there is more than one random facet. See Brennan (2001b, pp. 116–117) for an example. 12 Averaging inter-rater coefficients does not obviate this problem; it merely masks it. GENERALIZABILITY THEORY AND CTT 13 estimate of reliability for one of them. Therefore, the inter-rater coefficients are interpretable as estimates of Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 Eρ 2 = σ 2 (p) + σ 2 (pt) , σ 2 (p) + σ 2 (pt) + [σ 2 (pr) + σ 2 (ptr)] (17) where σ 2 (δ) is enclosed in square brackets. Compare this with Eρ 2 when both raters and tasks are random, and an examinee’s score is the average over n t tasks and n r raters (see Equations 9, 10, and 12): Eρ 2 = σ 2 (p) + σ 2 (p) σ 2 (pt) nt + σ 2 (pr) nr + σ 2 (ptr) n tn r , (18) where σ 2 (δ) is enclosed in square brackets. Equation 18 almost always reflects the D study design and intended universe much better than Equation 17, but Eρ 2 in Equation 18 is likely to be much smaller than Eρ 2 in Equation 17, primarily because σ 2 (pt) moves from universe score variance in Equation 17 to error variance in Equation 18. This in an important matter. In most testing programs σ 2 (pt) is quite large, which more than offsets the decrease in error variance that results from division by sample sizes in Equation 18, especially since n t and n r tend to be quite small in writing assessments. Sometimes inter-rater coefficients are reported based on a side study, but operationally each response is rated by a single rater. If so, Equations 17 and 18 still apply, but n r = 1 in Equations 18. Importantly, however, σ 2 (pr) cannot be estimated unless a G study is conducted that has nr ≥ 2. The above discussion may be somewhat challenging, but it is still oversimplified relative to what often happens in parctice. In particular, the assignment of raters to prompts and/or examinees is often more complicated than implied by the design considered above. Suppose, for example, that for the operational assessment, a different set of raters will evaluate responses to each prompt or task, t. This is a verbal description of the D study p × (R:T) design, where “:” is read “nested within.” For this design, if both raters and tasks are random, it can be shown that Eρ 2 = σ 2 (p) + σ 2 (p) σ 2 (pt) nt + σ 2 (pr:t) n tn r , (19) where σ 2 (pr:t) represents the confounding of σ 2 (pr) and σ 2 (ptr). This means that if the G study were conducted using the p × t × r design, then σ 2 (pr:t) = σ 2 (pr) 14 BRENNAN + σ 2 (ptr). It follows that σ 2 (pr) is divided by both n t and n r in Equation 19, whereas σ 2 (pr) is divided by n r , only, in Equation 18. Therefore, when n r > 1 and n t > 1, σ 2 (δ) is smaller and Eρ 2 is larger for the nested design than for the crossed design. (A similar statement holds for .) Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 In brief, this hypothetical WPT scenario illustrates that: • universe score variance gets larger and error variances get smaller if a facet shifts from being considered random to being considered fixed; • larger D study sample sizes lead to smaller error variances; and • nested D study designs usually lead to smaller error variances and larger coefficients. These conclusions are entirely predictable given the rich conceptual framework of G theory. MULTIVARIATE GENERALIZABILITY THEORY The essential features of univariate G theory were largely completed with technical reports by the Cronbach team in 1960–1961. These were revised into three journal articles, each with a different first author (Cronbach, Rajaratnam, & Gleser, 1963; Gleser, Cronbach, & Rajaratnam, 1965; Rajaratnam, Cronbach, & Gleser, 1965). In the mid 1960s, motivated by Harinder Nanda’s studies on interbattery reliability, the Cronbach team began their development of multivariate G theory, which is incorporated in their 1972 monograph, and which they regarded as the most unique aspect of G theory.13 Cronbach (1976) provides more historical details. The last four chapters in Brennan (2001b) provide an integrated and extended treatment of multivariate G theory. Multivariate G theory is multivariate primarily in the sense of multiple universes of generalization and, hence, multiple universe scores for each examinee. In addition, there are corresponding multiple universes of admissible observations. Each one of the multiple universes is associated with a single fixed condition of measurement. Statistically this implies that multivariate G theory analyses involve not only variance components but also covariance components. To continue with the WPT example, suppose each form involves both narrative and informative types of prompts. We will designate these prompt types as v1 and v2 , respectively. If, for each type, the population and universe of admissible 13 It can be argued that stratified alpha (Cronbach, Schönemann, & McKie, 1965) is a CTT precursor to multivariate G theory. Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 GENERALIZABILITY THEORY AND CTT 15 observations is fully crossed (i.e., p × t × r), then there are seven variance components for v1 and a different seven components for v2 . For example, σ12 (p) is the person variance component for v1 , and σ22 (p) is the person variance component for v2 . In addition, for each pair of variance components there is the possibility of a covariance component. For this example, almost certainly persons would respond to prompts of both types, which means that the covariance component for persons, σ 12 (p), would be non-zero. On the other hand, probably there would be different prompts (t) for v1 and v2 . If so, σ12 (t) = σ12 (pt) = σ12 (tr) = σ12 (ptr) = 0. The same raters might or might not be used for the two types of prompts, which means that σ 12 (r) and σ 12 (pr) might or might not be zero. In short, the multivariate WPT example has seven variancecovariance matrices that replace the seven variance components for the univariate example. Univariate D study analyses for the WPT example can be performed for v1 and v2 separately, which gives results specific to narrative and informative prompts, respectively. In addition, for the WPT example it is likely that ABC would perform analyses for one or more composite universe scores defined generally as μpC = w1 μp1 + w2 μp2 . For example, if w1 + w2 = 1, then the analyses would be for weighted mean scores over both narrative and informative prompts. If w1 = 1 and w2 = −1, then the analyses would be for difference scores. This relatively simple WPT example hints at the power and flexibility of multivariate G theory. Indeed, it can be said that multivariate G theory is the whole of G theory, with univariate G theory simply being a special case. This multivariate perspective on G theory illustrates that it is essentially a random effects theory. The reader may quarrel with this last assertion by noting that the previous discussion of univariate G theory considered a mixed model in which there was a fixed facet. True enough, but any univariate mixed model can always be reformulated as a multivariate model in which the levels of the fixed facet(s) become levels of v. Indeed, doing so provides a more flexible representation of levels of a fixed facet,14 and usually greatly simplifies estimation, especially for mixed models that have designs that are unbalanced with respect to nesting (see, for example, Brennan, 2001b, pp. 268–273). 14 A mixed-model univariate analysis effectively makes a statistical “hidden” choice for the w weights for each fixed level, whereas a multivariate analysis leaves the choice of weights to the investigator. 16 BRENNAN COMPARING THEORIES Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 For ease of reference, in this section CTT and G theory are sometimes referred to as expected value theories to contrast them with item response theory (IRT). We begin with a comparison of CTT and G theory that includes consideration of some of the strengths and weaknesses of these expected value theories. Then these theories are briefly compared with IRT. Expected Value Theories CTT and G theory have a number of similarities. They are both tautologies in which terms to the right of the equal sign are unobserved, both theories define true (or universe) score as an expected value of observed scores, both theories explicitly incorporate random errors of measurement, and both theories have welldefined (and similar) notions of reliability (or generalizability). It has been said by Cronbach et al. (1972) and by Brennan (2001b) that G theory “liberalizes” CTT. This is true in several senses. First, G theory permits disentangling the multiple sources of error that are confounded in the single E term of CTT. Second, G theory has a much richer conceptual framework than CTT, which leads to resolutions of a number of apparent contradictions in various CTT discussions of reliability. The two most important characteristics of G theory that facilitate resolving contradictions are: (a) G theory’s distinction between fixed and random measurement facets and (b) G theory’s capability of dealing with different D study designs. Third, multivariate G theory expands reliability considerations to multiple universes of generalization, which have no corresponding status in CTT. Fourth, as noted by Cronbach et al. (1972) and Brennan (2001b), G theory blurs distinctions between reliability and validity. Kane (1982), for example, provides a particularly prescient discussion of the reliability–validity paradox from the perspective of G theory. To say that G theory liberalizes CTT does not mean, however, that all of CTT is subsumed under G theory or that CTT can or should be completely replaced by G theory. There are still some important differences between the two theories that more than justify retaining both. Perhaps the most obvious difference is in definitions of parallelism. G theory incorporates a single notion of parallelism, namely, the notion of randomly parallel forms. This is quite different from the notion of classically parallel forms in CTT. Both types of parallelism are idealized and not ever likely to be strictly true, although one or the other may be more sensible in particular contexts. Furthermore, CTT has several well-developed, useful definitions of parallelism that are weaker than classically parallel forms (in particular, essentially tau-equivalent forms and congeneric forms), whereas G theory has no role, as yet, for different types of parallelism. In considering models, it often seems that what is a strength from one perspective is a weakness or limitation from another perspective. For example, Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 GENERALIZABILITY THEORY AND CTT 17 one of the strengths of CTT is that it is based on a the very simple model X = T + E, but the simplicity of the model is also a weakness in that it does not permit us to disentangle the multiple sources of error in E. By contrast, the capability of disentangling error sources is an important strength of G theory, but that strength is purchased at the price of conceptual complexity. The complexity of G theory is often a stumbling block for those who seek to find simple answers to measurment questions. In reality, however, most thoughtful consideration of such questions requires grappling with conceptual matters that are often complex and not easily addressable with a template. An important strength of G theory is that it is rich enough to guide investigators through such measurement mazes, but that strength makes cognitive demands on investigators. In the end, there is no psychometric “free lunch.” Expected Value Theories and IRT Given the popularity of item response theory (IRT) (see, for example, Lord, 1980 and Yen & Fitzpatrick, 2006), it seems obvious to consider some similarities and differences between IRT and the two expected value theories discussed in this article. In both substantive and utilitarian senses, there is a rather obvious difference between the two types of theories. Specifically, IRT focuses on item responses, whereas CTT and G theory focus on test or form scores. Using IRT, investigators can clearly distinguish among different items. By contrast, G theory cannot distinguish among items, since it is a random sampling model, just as different persons are not distinguishable in survey sampling research. CTT can make distinctions among items only if items are defined as forms, but if that is done, parallelism assumptions are often suspect.15 Some may object to the above characterization of CTT by noting that there is long history of using so-called classical item analysis statistics such as difficulty levels and point-biserial discrimination indices. True enough. Such statistics, however, are not easily defended from a strict interpretation of CTT as discussed in this article. The essential problem is that almost always item scores grossly violate the assumption of classically parallel forms, and even the assumptions of essentially tau-equivalent forms. Classical item analysis statistics have a longstanding demonstrated utility for test development, but that does not mean they are well modelled by CTT. A forest-trees metaphor is reasonably apt for considering IRT vis-à-vis expected value theories. Consider individual items as trees and the universe of items as the forest. If we focus on individual trees as we do in IRT, then we are easily oblivious to the forest. If we focus on the forest, then the trees are 15 If items are considered as congeneric forms, then perhaps this problem can be circumvented (L. S. Feldt, personal communication, March 3, 2010). Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 18 BRENNAN indistinguishable. To put it another way, in IRT items (more correctly item parameters) are effectively fixed, which means that a replication would consist of identically the same items (or, more correctly, a set of items with identically the same parameters). Call this “strictly” parallel forms. The notion of randomly parallel forms in G theory is much less restrictive, and even the various CTT notions of parallel forms are much weaker than “strictly” parallel forms. Traditional developments of IRT do not typically mention fixed items or strictly parallel forms. These notions are implicit, however, in other aspects of IRT. For example, in the derivation of the standard error of the maximumlikelihood estimate of θ , there is no consideration of sampling items; and if items are not sampled, they must be fixed. Also, the expected number-correct (ENR) on the vertical axis in a test characteristic curve (TCC) is typically viewed as numbercorrect true score. However, ENR is not an expected value over any set of items different from those for the specific TCC, since the TCC itself is conditional on a very specific set of items.16 Therefore, there is a discontinuity between the IRT notion of true score (ENR) and the notion of true score in CTT and G theory. This is particularly evident in comparing IRT and G theory: items are fixed in IRT, whereas they are almost always treated as random in G theory. Some of the above comments may appear to conflict with some old and current literature. For example, Lord and Novick (1968) show relationships between certain classical item analysis statistics and normal ogive item parameters. True enough, but the relationships are based on first assuming that a normal ogive model fits. The fact that proposition A implies proposition B does not mean that B implies A; that is, the Lord and Novick (1968) demonstration does not mean that CTT and IRT are interchangeable for item analysis purposes. A similar type of comment, although more nuanced, applies to Holland and Hoskens (2003), which in no way mitigates the quality or importance of their research. It is true that Lord and Novick (1968) and Holland and Hoskens (2003) have taken steps in the direction of integrating CTT and IRT from certain perspectives; it is not true that the two theories are fully integrated, or that one is a subset of the other. Current IRT models and G theory differ not only with respect to items being fixed (IRT) or random (G theory), but also in the sense that G theory emphasizes the contributions of multiple facets to measurement error, whereas almost all of the widely used IRT models have no explicit role for multiple facets. There is some research, however, that seeks to integrate aspects of G theory and IRT. For example, Bock, Brennan, and Muraki (2002) have proposed a procedure that incorporates multiple sources of error directly into the information function and, hence, into the IRT SEM. Also, Briggs and Wilson (2007) and Chien (2008) have 16 It might be argued that ENR is an expected value over a propensity distribution of performance on the fixed items, but even then, the items (or item parameters) themselves are still fixed. GENERALIZABILITY THEORY AND CTT 19 TABLE 1 Comparisons Among CTT, G Theory, and IRT Issue Forms and parallelism Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 True score CTT Classically parallel, essentially tau-equivalent, etc. Expectation over forms Assumptions Primary strengths Relatively weak Simplicity; widely used; has stood test of time Primary weaknesses Use and understanding Undifferentiated error Easy G Theory IRT Randomly parallel Strictly parallel Expectation over randomly parallel forms Relatively weak Conceptual breadth; disentangles multiple sources of error; distinguishes between fixed and random facets Conceptual complexity Sometimes challenging Expected number right for fixed set of items Very strong Mathematically elegant; solves many complex measurement problems if assumptions hold Only fixed facet(s) Sometimes challenging considered an approach that estimates variance components based on IRT estimates of expected number correct scores rather than the actual observed scores used in G theory. Briggs and Wilson (2007) consider an items facet, only; Chien (2008) considers two facets. In addition, there have been a number of informal, unpublished suggestions that Bayesian priors be used to turn the fixed items (more correctly, fixed item parameters) in IRT into random variables.17 None of these approaches have been studied much yet, but it is encouraging that researchers are making attempts at integrating G theory and IRT. Even if the attempts fall short, they may lead to beneficial insights. Table 1 provides a comparison among CTT, G theory, and IRT with respect to many of the issues considered in this section. The comparative phrases in Table 1 are necessarily succinct; they should be interpreted in the more extended sense discussed in this section. The differences among models are substantive and important, but each of these models is defensible and valuable, and no one of them is a substitute for the other, at least not in their current instantiations. It is unfortunate that much of the current research and practice in educational measurement do not give more attention to the differences among these models, and especially the differences among their assumptions. 17 Bayesian priors are actually involved in the Briggs and Wilson (2007) and Chien (2008) approaches, which employ MCMC methods. 20 BRENNAN Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 REFERENCES Bock, R. D., Brennan, R. L., & Muraki, E. (2002). The information in multiple ratings. Applied Psychological Measurement, 26, 364–375. Borsboom, D. (2005). Measuring the mind. Cambridge, UK: Cambridge University Press. Brennan, R. L. (1992). Elements of generalizability theory (rev. ed.). Iowa City, IA: ACT, Inc. Brennan, R. L. (1997). A perspective on the history of generalizability theory. Educational Measurement: Issues and Practice, 16(4), 14–20. Brennan, R. L. (2001a). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 38, 285–317. Brennan, R. L. (2001b). Generalizability theory. New York: Springer-Verlag. Briggs, D. C., & Wilson, M. (2007). Generalizability in item response modeling. Journal of Educational Measurement, 44, 131–155. Burt, C. (1936). The analysis of examination marks. In P. Hartog & E. C. Rhodes (Eds.), The marks of examiners. London: Macmillan. Cardinet, J., Johnson, S., & Pini, G. (2010). Applying generalizability theory using EduG. New York: Taylor and Francis. Chien, Y. (2008). An investigation of testlet-based item response models with a random facets design in generalizability theory. Unpublished doctoral dissertation, University of Iowa, Iowa City, Iowa. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. Cronbach, L. J. (1976). On the design of educational measures. In D. N. M. de Gruijter & L. J. T. van der Kamp (Eds.), Advances in psychological and educational measurement (pp. 199–208). New York: Wiley. Cronbach, L. J. (1991). Methodological studies—A personal retrospective. In R. E. Snow & D. E. Wiley (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach (pp. 385–400). Hillsdale, NJ: Erlbaum. Cronbach, L. J. (2003). My current thoughts on coefficient alpha and successor procedures. Educational and Psyhcological Measurement, 64, 391–418. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley. Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16, 137–163. Cronbach, L. J., Schönemann, P., & McKie, T. D. (1965). Alpha coefficients for stratified-parallel tests. Educational and Psychological Measurement, 25, 291–312. Ebel, R. L. (1951). Estimation of the reliability of ratings. Psychometrika, 16, 407–424. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 105–146). New York: American Council on Education and MacMillan. Fisher, R. A. (1925). Statistical methods for research workers. London: Oliver & Bond. Gleser, G. C., Cronbach, L. J., & Rajaratnam, N. (1965). Generalizability of scores influenced by multiple sources of variance. Psychometrika, 30, 395–418. Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110). Westport, CT: American Council on Education/Praeger. Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first order item response theory: Application to true-score prediction from a possibly nonparallel test. Psychometrika, 68, 123–149. Kane, M. T. (1982). A sampling model for validity. Applied Psychological Measurement, 6, 125–160. Lindquist, E. F. (1953). Design and analysis of experiments in psychology and education. Boston: Houghton Mifflin. Lord, F. M. (1955). Estimating test reliability. Educational and Psychological Measurement, 15, 325–336. Downloaded by [Hong Kong Institute of Education] at 00:43 07 November 2011 GENERALIZABILITY THEORY AND CTT 21 Lord, F. M. (1957). Do tests of the same length have the same standard error of measurement? Educational and Psychological Measurement, 17, 510–521. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: AddisonWesley. Novick, M. R., & Lewis, C. (1967). Coefficient alpha and the reliability of composite measurements. Psychometrika, 32, 1–13. Rajaratnam, N., Cronbach, L. J., & Gleser, G. C. (1965). Generalizability of stratified-parallel tests. Psychometrika, 30, 39–56. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage. Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101. Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 11–154). Westport, CT: American Council on Education/Praeger.
© Copyright 2026 Paperzz