36-Green.qxd 12/30/2005 12:51 PM Page 599 36 Generalizability Theory Richard J. Shavelson Stanford University Noreen M. Webb University of California, Los Angeles Generalizability (G) theory is a statistical theory for evaluating the dependability (or reliability) of behavioral measurements (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; see also Brennan, 2001; Shavelson & Webb, 1991). G theory permits the researcher to address such questions as: Is the sampling of tasks or judges the major source of measurement error? Can I improve the reliability of the measurement better by increasing the number of tasks or the number of judges, or is some combination of the two more effective? Are the test scores adequately reliable to make decisions about the level of a person’s performance for a certification decision? G theory grew out of the recognition that the undifferentiated error in classical test theory (Feldt & Brennan, 1989) provided too gross a characterization of the potential and/or actual sources of measurement error. In classical test theory measurement error is undifferentiated random variation; the theory does not distinguish among various possible sources. G theory pinpoints the sources of systematic and unsystematic error variation, disentangles them, and estimates each one. Moreover, in contrast to the classical parallel-test assumptions of equal observed-score means, variances, and covariances, G theory assumes only randomly parallel tests sampled from the same universe. Finally, whereas classical test theory focuses on relative (rank-order) decisions (e.g., student admission to selective colleges), G theory distinguishes between relative (“norm-referenced”) and absolute (“criterion-” or “domain-referenced”) decisions for which a behavioral measurement is used. In G theory, a behavioral measurement (e.g., a test score) is conceived of as a sample from a universe of admissible observations. This universe consists of all possible observations that decision makers consider to be acceptable substitutes (e.g., scores sampled on Occasions 2 and 3) for the observation in hand (scores on Occasion 1). A measurement situation has characteristic features such as test form, test item, rater, and/or test occasion. Each characteristic feature is called a facet of a measurement. A universe of admissible observations, then, is defined by all possible combinations of the levels of the facets (e.g., items, occasions). Consider a generalizability study of students’ scores on a measure of academic self-concept. Suppose students (persons) responded to three self-concept items randomly selected from a large domain of such items on each of two randomly selected occasions (see Table 36–1). The items 599 36-Green.qxd 12/30/2005 600 12:51 PM Page 600 SHAVELSON AND WEBB TABLE 36–1 Crossed Person × Item × Occasion G Study of Self-Concept Scores Occasion I Person 1 2 3 … p … N II Item 1 Item 2 Item 3 Item 1 Item 2 Item 3 4 3 2 2 1 3 5 4 3 4 4 3 3 2 2 4 3 4 4 5 4 3 4 2 3 4 4 3 3 3 asked students to evaluate how well they do in academic settings (e.g., “I do well in school”) on a Likert-type scale with scores ranging from 1 to 5. The scale was administered twice over roughly a 2-week interval. In this G study, students (persons) are the object of measurement1 and both items and occasions are facets of the measurement. The universe of admissible observations includes all possible items and occasions that a decision maker would be equally willing to interpret as bearing on students’ academic self-concept. To pinpoint different sources of measurement error, G theory extends earlier analysis of variance approaches to reliability. It estimates the variation in scores due to each person, each facet, and their combinations (interactions). More specifically, G theory estimates the components of observed-score variance contributed by the object of measurement, the facets, and their combinations. In this way, the theory isolates different sources of score variation in measurements. In practice, the analysis of variance is used to estimate variance components. In contrast to experimental studies, the analysis of variance is not used to formally test hypotheses. Continuing with the self-concept example, note that the student is the object of measurement and each student’s observed score is decomposed into a component for student; item; occasion; and combinations (interactions) of student, item, and occasion. The student component of the score reflects systematic variation in students’ self-appraisals of their academic ability, giving rise to variability among students (reflected by the student or person variance component). The other score components reflect sources of error. For example, a good occasion (e.g., following a schoolwide announcement that the student body had received a community award), would tend to raise all students’ self-evaluations, giving rise to mean differences from one occasion to the next (indexed by the occasion variance component), whereas the particular wording of an item might lead certain students to more negative-than-typical self-evaluations than other students, giving rise to a nonzero person × item interaction (p × i variance component).The theory describes the dependability (reliability) of generalizations made from a person’s observed score on a test to the score he or she would obtain in the broad universe of admissible observations—her “universe score” (true score in classical test theory). Hence the name, Generalizability Theory. 1 In behavioral research, the person is typically considered the object of measurement. 36-Green.qxd 12/30/2005 12:51 PM Page 601 36. GENERALIZABILITY THEORY 601 G theory recognizes that an assessment might be adapted for particular decisions and so distinguishes a generalizability (G) study from a decision (D) study. In a G study, the universe of admissible observations is defined as broadly as possible (items, occasions, raters if appropriate, etc.) to provide variance component estimates to a wide variety of decision makers. A D study typically selects only some facets for a particular purpose, thereby narrowing the score interpretation to a universe of generalization. A different generalizability (reliability) coefficient can then be calculated for each particular use of the assessment. In the self-concept example, we might decide to use only one occasion and perhaps six items for decision-making purposes, so the G coefficient could be calculated to reflect this proposed use. In the remainder of this chapter, we take up, in more detail, G studies, D studies, and the design of G and D studies. We then sketch the multivariate version of G theory and end with a section on miscellaneous topics. Before proceeding, one caveat is in order. At times you will find some complicated equations. We include them for completeness. We hope that the text provides sufficient explanation to follow along conceptually for readers who are less interested in the technical details. GENERALIZABILITY STUDIES A G study is designed specifically to isolate and estimate as many facets of measurement error as is reasonably and economically feasible. The study includes the most important facets that a variety of decision makers might wish to generalize over (e.g., items, forms, occasions, raters). Typically, “crossed” designs are used where all individuals are measured on all levels of all facets. In our example, all students (persons) in a sample (sample size, N) responded to the same 3 Self-Concept Items on 2 Occasions (Table 36–1). A crossed design provides maximal information about the variation contributed by the object of measurement (universe-score or desireable variation analogous to true-score variance), the facets, and their combinations, to the total amount of variation in the observed scores. Universe of Generalization The universe of generalization is defined as the set of facets and their levels (e.g., items and occasions) to which a decision maker wants to generalize. A person’s universe score (denoted as µp) is defined as the long-run average or, more technically, “expected value” of his or her observed scores over all observations in the universe of generalization (analogous to a person’s “true score” in classical test theory). Components of the Observed Score After the decision maker specifies the universe of generalization, an observed measurement can be decomposed into a component or effect for the universe score and one or more error components with data collected in a G study. Consider a two-facet crossed p × i × o (person by item by occasion) design where items and occasions have been randomly selected (random-effects model). The object of measurement, here persons, is not a source of error and, therefore, is not a facet. In the p × i × o design with generalization over all admissible test items and occasions taken from an indefinitely large universe, the components of an observed score (Xpio) for a particular person (p) on a particular item (i) and occasion (o) are: 36-Green.qxd 12/30/2005 602 Xpio = 12:51 PM Page 602 SHAVELSON AND WEBB µ + µp – µ + µi – µ + µo – µ + µpi – µp – µi + µ + µpo – µp – µo + µ + µio – µi – µo + µ + Xpio – µpi – µpo – µio + µp + µi + µo – µ grand mean person effect item effect occasion effect (36.1) person × item effect person × occasion effect item × occasion effect residual Where µ = EpEiEoXpio and µ = EiEoXpio with E meaning expectation and other terms in (1) are defined analogously. Except for the grand mean, µ, each observed-score component varies from one level to another—for example, items vary in difficulty on a test. Assuming a random-effects model, the distribution of each component or “effect,” except for the grand mean, has a mean of zero and a variance σ2 (called the variance component). The variance component for the person effect is σ2 = Ep(µp − µ)2. This variance component is called the universe-score variance and is analogous to true-score variance in classical test theory. The variance components for the other effects are defined similarly. The residual variance component, σ2pio.e., reflects the person × item × occasion interaction confounded with residual error because there is one observation per cell (see scores in Table 36–1). The collection of observed scores, Xpio, has a variance, σ2xpio.e. = EpEiEo(Xpio − µ)2, which equals the sum of the variance components: 2 + σ2 σ 2X pio = σ 2p + σ i2 + σ 2o + σ 2pi + σ 22po + σ io pio,e (36.2) Each variance component can be estimated from a traditional analysis of variance (or other methods such as maximum likelihood, e.g., Searle, 1987). From our example (Table 36–1), we would run a person × item × occasion random-effects ANOVA and estimate the variance components from the mean squares (Table 36–2; see Shavelson & Webb, 1991, for how to do this). The relative magnitudes of the estimated variance components, except for σ^ p2 , provide information about potential sources of error influencing a behavioral measurement. Statistical tests are not used in G theory; instead, standard errors for variance component estimates provide information about sampling variability of estimated variance components (e.g., Brennan, 2001). In our example, the estimated person (universe-score) variance, σ^ 2pi , (1.108), is fairly large compared to the other components (30% of total variation). This shows that, averaging over items and occasions, persons in the sample differed systematically in their self-concepts. Because persons constitute the object of measurement, not error, this variability represents systematic individual differences in self-concept, variability in scores analogous to classical test theory’s true-score variance. The other large estimated variance components concern the item facet more than the occasion facet. The nonnegligible2 σ^ i2 (3% of the total variation) shows that items varied somewhat in difficulty level. The large σ^ p2 (22%) reflects different relative standings of persons across items. The small σ^ o2 (1% of the total variation) indicates that performance was stable across occasions, averaging over persons and items. The nonnegligible σ^ 2po (6%) shows that the relative standing of persons differed somewhat across occasions. The zero σ^ 2io indicates that the rank ordering of item difficulty was the same across occasions. Finally, the large σ^ 2pioλe 2 Even small variance components can give rise to large confidence intervals. See Shavelson and Webb (1991, p. 13). 36-Green.qxd 12/30/2005 12:51 PM Page 603 36. GENERALIZABILITY THEORY 603 Table 36–2 Estimated Variance Components in the Example p × i × o design Source Person (p) Item (i) Occasion (o) p×i p×o i×o p × i × o, e Variance Component Estimate Percent of Total Variability σ 2p σ i2 σ 2o σ 2pi 1.108 0.102 0.030 30 03 01 0.810 22 σ 2po 2 σ io 2 σ pio,e 0.230 06 0.001 00 1.413 38 (38%) reflects the varying relative standing of persons across occasions and items and/or other sources of error not systematically incorporated into the G study. DECISION STUDIES Generalizability theory distinguishes a D study from a G study. The D study uses information from a G study to design a measurement procedure that minimizes error for a particular purpose. In planning a D study, the decision maker defines the universe that he or she wishes to generalize to, called the universe of generalization, which may contain some or all of the facets and their levels in the universe of admissible observations—items, occasions or both in our example. In the D study, decisions usually will be based on the mean over multiple observations (e.g., many self-concept items) rather than on a single observation. The mean score over a sample of n′i items and n′o occasions, for example, is denoted as XpIO in contrast to a score on a single item and occasion, Xpio. A two-facet, crossed D-study design where decisions are to be made on the basis of XpIO is, then, denoted as p × I × O. Types of Decisions and Measurement Error G theory recognizes that the decision maker might want to make two types of decisions based on a behavioral measurement: relative (norm-referenced) and absolute (criterion- or domainreferenced). A relative decision focuses on the rank order of persons; an absolute decision focuses on the level of performance, regardless of rank. Measurement Error for Relative Decisions. A relative decision focuses on the rank ordering of individuals (e.g., norm-referenced interpretations of test scores). Decisions about college or job selection are relative as are decisions based on correlational studies (correlations depend on the consistency with which variables X and Y rank order individuals). For relative decisions, the error in a random-effects p × I × O design is defined as: ∆pIO = (XpIO – µIO) – (µp – µ) where µ = EIEOXpIO and µIO = EpXpIO. The variance of the errors for relative decisions is: (36.3) 36-Green.qxd 12/30/2005 604 12:51 PM Page 604 SHAVELSON AND WEBB σ δ2 = E p EI EO δ 2pIO = σ 2pI + σ 2pO + σ 2pIO,e = σ 2pi ni’ + σ 2po σ 2pio,e + no’ ni’no’ (36.4) Notice that the “main effects” of item and occasion do not enter into error for relative decisions because, for example, all people respond on both occasions, so any difference in occasions affects all persons and doesn’t change rank order. In our study, suppose we decided to increase the number of items on the self-concept scale to 10 and use the questionnaire on two occasions: n′i = 10 and n′o = 2. Substituting we have: .810 .230 1.413 σˆ δ2 = + + = 0.267 10 2 10 × 2 Simply put, in order to reduce σ δ′ , n′i and n′o may be increased in a manner analogous to the Spearman-Brown prophecy formula in classical test theory and the standard error of the mean in sampling theory. Measurement Error for Absolute Decisions. An absolute decision focuses on the level of an individual’s performance independent of others’ performance (cf. domain-referenced interpretations). For example, in California a minimum passing score on the drivers examination is 80% correct, regardless of how others perform on the test. For absolute decisions, the error in a random-effects p × I × O design is defined as: ∆pIO = XpIO − µp (36.5) and the variance of the errors is: σ 2∆ = E p EI EO ∆ 2pIO = σ 2I + σO2 + σ 2pI + σ 2pO + σ 2IO + σ 2pIO,e = σ 2pio,e σ2 σ i2 σ 2o σ 2pi σ 2po + + + io + + ni′ no′ ni′ no′ ni′no′ ni′no′ (36.6) Note that, with absolute decisions, the main effect of items and occasions—how difficult an item is or whether one occasion provided a more hospitable atmosphere for responding to selfconcept items than another—does affect the level of self-concept measured even though neither change the rank order. Consequently, they are included in the definition of measurement error. Also note that σ ∆2 ≥ σ δ2 . Substituting values from Table 36–2, as was done earlier for relative decisions, provides a numerical index for estimated measurement error for absolute decisions: .102 .030 .810 .230 .001 1.413 σˆ 2∆ = + + + + + = 0.292 10 2 10 2 10 × 2 10 × 2 RELIABILITY COEFFICIENTS Although G theory stresses the interpretation of variance components and measurement error, it provides summary coefficients that are analogous to the reliability coefficient in classical test theory (recall, true-score variance divided by observed-score variance, i.e., an intraclass correlation). The theory distinguishes between a Generalizability Coefficient for relative decisions and an Index of Dependability for absolute decisions. 36-Green.qxd 12/30/2005 12:51 PM Page 605 36. GENERALIZABILITY THEORY 605 Generalizability Coefficient The Generalizability (G) Coefficient is analogous to the reliability coefficient in classical test theory. It is the ratio of the universe-score variance to the expected observed-score variance, i.e., an intraclass correlation. For relative decisions and a p × I × O random-effects design, the generalizability coefficient is: Eρ2X pIO ,µ p = Eρ2 = E p ( µ p − µ )2 σ 2p = 2 2 E p EI EO ( X pIO − µ IO ) σ p + σ δ2 (36.7) 1.108 From Table 36–2, we can calculate an estimate of the G coefficient: Eρˆ = 1.108 + 0.267 = 0.806 . In words, the estimated proportion of observed score variance due to universe-score variance is 0.806. 2 Dependability Index For absolute decisions with a p × I × O random-effects design, the index of dependability (Brennan, 2001; see also Kane & Brennan, 1977) is: Substituting estimates from Table 36–2, we can calculate the dependability index for a self1.108 ˆ concept inventory with 10 items given on 2 occasions: Φ = 1.108 + .292 = 0.791 . Notice that the dependability index is only slightly lower than the G coefficient because the variance components corresponding to the main effects for item and occasion are quite small (Table 36–2). The right-hand side of (7) and (8) are generic expressions that apply to any design and universe. Φ= σ 2p σ 2p + σ 2∆ (36.8) For domain-referenced decisions involving a fixed cutting score λ (often called criterionreferenced measurements), and assuming that λ is a constant that is specified a priori, the error of measurement is: ∆pIO = (XpIO − λ) − (µp − λ) = XpIO − µp (36.9) and the index of dependability is: Φλ = E p ( µ p − λ )2 σ 2 p + ( µ − λ )2 = 2 2 E p EI EO ( X pIO − λ ) σ p + (µ − λ )2 + σ 2∆ (36.10) – – An unbiased estimator of (µ − λ)2 is (X − λ)2 − σ^ –x2 where X e is the observed grand mean over sampled objects of measurement and sampled conditions of measurement in a D study design – and σ^ –x2 is the error variance involved in using the observed grand mean X as an estimate of the grand mean over the population of persons and the universe of items and occasions (µ). For the p × I × O random-effects design, σ^ –x2 is: σˆ 2X = σˆ 2pi σˆ 2pio,e σˆ 2p σˆ i2 σˆ 2o σˆ 2po σˆ 2 + + + + + io + n p′ ni′ no′ n p′ ni′ n p′ no′ ni′no′ n p′ ni′no′ (36.11) The estimate of Φ λ& is smallest when the cut score λ. is equal to the observed grand mean X% . – In the dataset presented in Table 36–1, X e = 3.500. For λ. = 3.500 , using n′p = 100 and values in 36-Green.qxd 12/30/2005 606 12:51 PM Page 606 SHAVELSON AND WEBB Table 36–2 gives Φλ = 0.764. For λ. = 2.000 (assuming self-concept scores above 2 fall within ^ the “normal” range), Φλ = 0.909. ^ STUDY DESIGN IN GENERALIZABILITY AND DECISION STUDIES Generalizability theory allows the decision maker to use different designs in G and D studies. Typically in a G study a crossed design is used. In a crossed design, all students are observed under each level of each facet. This means that, in our example, each student responds to each self-concept item on each occasion (see Table 36–1). The crossed design provides maximal information about the components of variation in observed self-concept scores. In our example, seven different variance components can be estimated—one each for the main effects of person σ^ p2 , item σ^ i2 , and occasion σ^ o2 ; two-way interactions between person and item σ^ 2pi , person and occasion σ^ 2po , and item and occasion σ^ 2io; and a residual due to the person × item × occasion interaction and random error σ^ 2pio,e. In a D study, both crossed and nested designs should be considered. In a nested design, not all levels of one facet are paired with the levels of another facet. In our self-concept example, we might use one set of randomly sampled items (1–3) at Occasion 1 and use another set of randomly-sampled items (4–6) at Occasion 2. In this case we say that items are nested in occasion: Levels 1–3 of the item facet are paired with Occasion 1 and Levels 4–6 of the item facet are paired with Occasion 2. In this way, six items and not just three items are sampled for the D study. The more items, the greater the reliability (generalizability), typically. Although G studies should use crossed designs whenever possible to avoid confounding of effects, D studies may use nested designs for convenience or for increasing sample size, which typically reduces estimated error variance and, hence, increases estimated generalizability. For example, consider the error variance in a crossed p × I × O design with the error variance in a partially nested p × (I : O) design where facet i is nested in facet o, and n’ denotes the number of conditions of a facet under a decision-maker’s control. In a crossed p × I × O design, the relative σ δ2 ) and absolute (σ ∆2 ) error variances are: σ δ2 = σ 2pI + σ 2pO + σ 2pIO = σ 2pi σ 2po σ 2pio,e + + ni′ no′ ni′no′ (36.12a) and σ 2∆ = σ 2I + σO2 + σ 2pI + σ 2pO + σ 2IO + σ 2pIO = σ 2pio,e σ i2 σ 2o σ 2pi σ 2po σ2 + + + + io + no′ ni′no′ ni′no′ ni′ no′ ni′ (36.12b) In a nested p × (I : O) design, σ δ2 = σ 2pO + σ 2pI :O = σ 2∆ = σO2 + σ 2pO + σ 2I :O + σ 2pI :O = σ 2po σ 2pi, pio,e + no′ ni′no′ (36.13a) σ 2o σ 2po σ i2,io σ 2pi, pio,e + + + no′ no′ ni′no′ ni′no′ (36.13b) 36-Green.qxd 12/30/2005 12:51 PM Page 607 36. GENERALIZABILITY THEORY 607 In (12) and (13), σ p2 i , σ p2 o, and σ p2 i o,e are directly available from a G study with design p × i × o, σ i2,io is the sum of σi2 and σ i2o, and σ p2 i o,e is the sum of σ p2 i and σ p2 i o,e. To estimate σ δ2 in a p × (I : O) design, for example, simply substitute estimated values for the variance components into equation 13a; similarly for 13b to estimate σ ∆2 . Moreover, given cost, logistics and other considerations, n′ can be manipulated to minimize error variance by trading off, in this example, items and occasions. Due to the difference in the designs, σ δ2 is smaller in (13a) than in (12a) & and σ ∆2 is smaller in (13b) than in (12b). From our example and Table 36–2, we find that the optimal D study design need not be fully crossed. In this example, administering different items on each occasion (i:o) yields slightly higher estimated generalizability than does the fully crossed design; for example, for 10 items ^ ^ and 2 occasions, Eρ^ 2 = 0.830 and Φ = 0.818. The larger values of Eρ^ 2 and Φ for the partially ^ nested design than for the fully crossed design, Eρ^ 2 = 0.806 and Φ = 0.791, are solely attributable to the difference between (12a) and (13a) and the difference between (12b) and (13b). Random and Fixed Facets G theory is essentially a random effects theory. Typically a random facet is created by randomly sampling levels of a facet (e.g., tasks from a job in observations of job performance). When the levels of a facet have not been sampled randomly from the universe of admissible observations but the intended universe of generalization is infinitely large, the concept of exchangeability may be invoked to consider the facet as random (Shavelson & Webb, 1981). A fixed facet (cf. fixed factor, in analysis of variance) arises when the decision maker: (a) purposely selects certain conditions and is not interested in generalizing beyond them, (b) finds it unreasonable to generalize beyond the levels observed, or (c) when the entire universe of levels is small and all levels are included in the measurement design. G theory typically treats fixed facets by averaging over the conditions of the fixed facet and examining the generalizability of the average over the random facets (Cronbach et al., 1972). When it does not make conceptual sense to average over the conditions of a fixed facet, a separate G study may be conducted within each condition of the fixed facet (Shavelson & Webb, 1991) or a full multivariate analysis may be performed with the levels of the fixed facet comprising a vector of dependent variables (Brennan, 2001; see below). G theory recognizes that the universe of admissible observations in a G study may be broader than the universe of generalization of interest in a D study (e.g., a decision maker only interested in one occasion). The decision maker may reduce the levels of a facet (creating a fixed facet), select (and thereby control) one level of a facet, or ignore a facet. A facet is fixed in a D study when n′ = N′, where n′ is the number of levels for a facet in the D study and N′ is the total number of levels for a facet in the universe of generalization. From a random-effects G study with design p × i × o in which the universe of admissible observations is defined by facets i and o of infinite size, fixing facet i in the D study and averaging over the ni conditions of facet i in the G study (ni = n′i) yields the following universe-score variance: σ 2τ = σ 2p + σ 2pI = σ 2p + σ 2pi ni′ (36.14) where σ denotes universe-score variance in generic terms. When facet i is fixed, the universe score is based on a person’s average score over the levels of facet i, so the generic universe-score variance in (14) is the variance over persons’ mean scores. Hence, (14) includes σ pl2 as well as σ 2%p σˆ 2τ is an unbiased estimate of universe-score variance for the mixed model only when the same levels of facet i are used in the G and D studies (Brennan, 2001). The relative and absolute error variances, respectively, are: 2 τ 36-Green.qxd 12/30/2005 608 12:51 PM Page 608 SHAVELSON AND WEBB σ δ2 = σ 2pO + σ 2pIO = σ 2∆ = σO2 + σ 2pO + σ 2IO + σ 2pIO = σ 2po σ 2pio,e + , and no′ ni′no′ (36.15a) σ 2pio,e σ2 σ 2o σ 2po + + io + no′ no′ ni′no′ ni′no′ (36.15b) And the generalizability coefficient and index of dependability, respectively, are: σ 2pi ni′ Eρ2 = 2 2 σ σ 2pio,e σ pi po σ 2P + + + ni′ no′ ni′no′ σ 2p + σ 2p + Φ= σ 2p + , and (36.16a) σ 2pi ni′ σ 2pi σ o2 σ 2po σ 2pio,e σ2 + + + io + ni′no′ ni′ no′ no′ ni′no′ (36.16b) MULTIVARIATE GENERALIZABILITY For behavioral measurements involving multiple scores describing individuals’ personality, ability, or performance, multivariate generalizability can be used to (a) estimate the reliability of difference scores, observable correlations, or universe-score and error correlations for various D study designs and sample sizes (Brennan, 2001), (b) estimate the reliability of a profile of scores using multiple regression of universe scores on the observed scores in the profile (Brennan, 2001, Cronbach et al., 1972), or (c) produce a composite of scores with maximum generalizability (Shavelson & Webb, 1981). For all of these purposes, multivariate G theory decomposes both variances and covariances into components. In a two-facet, crossed p × i × o design with general self-concept divided into two dependent variables—academic and social self-concepts, the observed scores for the two variables for person p observed under conditions i and o can be denoted as 1 X pio and 2X pio , respectively. The variances of observed scores, σ 21 X pio and σ 22 X pio, are decomposed as in (2). The covariance σ 21 X pio , σ 22 X pio is decomposed in analogous fashion: σ 1 X pio , 2 X pio = σ 1 p,2 p + σ 1 i,2 i + σ 1 o,2 o + σ 1 pi,2 pi + σ 1 po,2 po +σ 1 io,2 io + σ 1 pio,e,2 pio,e (36.17) In (17) the term σ 21 p × 2 p is the covariance between universe scores for academic and social selfconcept. The remaining terms in (17) are error covariance components. The term σ 21 p, 2 p , for example, is the covariance between scores on academic and social self-concept due to the levels of observation for the item facet. An important aspect of the development of multivariate G theory is the distinction between linked and unlinked conditions. The expected values of error covariance components are zero when conditions for observing different variables are unlinked, that is, selected independently (e.g., the items used to obtain scores on one variable in a profile, academic self-concept, are 36-Green.qxd 12/30/2005 12:51 PM Page 609 36. GENERALIZABILITY THEORY 609 selected independently of the items used to obtain scores on another variable, social self-concept). The expected values of error covariance components are nonzero when levels are linked or jointly sampled (e.g., scores on two variables in a profile come from the same items). Joe and Woodward (1976) presented a G coefficient for a multivariate composite that maximizes the ratio of universe score variation to universe score plus error variation by using statistically derived weights for each dependent variable (academic and social self-concept in our example). Alternatives to maximizing the reliability of a composite are to determine variable weights on the basis of expert judgment or use weights derived from a confirmatory factor analysis (Marcoulides, 1994). ADDITIONAL TOPICS We have only scratched the surface of G theory (believe it or not!). Here we treat a few additional topics that have practical consequences in using G theory. For details and advanced topics, see Brennan (2001). First, given the emphasis on estimated variance components in G theory, we consider the sampling variability of estimated variance components and how to estimate variance components, especially in unbalanced designs. Second, sometimes facets are “hidden” in a G study and are not accounted for in interpreting variance components. For example, in interpreting the substantial variability from one task to another (Shavelson, Baxter, & Gao, 1993), an occasion facet is “hidden” in that the tasks take place over time (Cronbach, Linn, Brennan & Haertel, 1997; Shavelson, Ruiz-Primo, & Wiley, 1999). We briefly consider such a hidden facet below. And finally, it should come as no surprise that measurement error is not constant as often assumed but depends on the magnitude of a person’s universe score, and so we treat this topic briefly. Variance Component Estimates Here we treat three concerns (among many) in estimating variance components. The first concern deals with the variability (“bounce”) in variance component estimates, the second with negative variance-component estimates (variances, σ2, cannot be negative), and the third with unbalanced designs. Variability in Variance-Component Estimates. The first concern is that estimates of variance components may be unstable with usual sample sizes (Cronbach et al., 1972). Here’s why. To estimate variance components, we use mean squares from the analysis of variance. For example, we used means squares from a person × item × occasion random-effects, repeated measures analysis of variance to get the estimates in Table 36–2. The mean square for the residual (person × item × occa2 sion confounded with error) provided a direct estimate of σ2pio.e.; i.e. σ^ pio ,e = MSpio,e. The mean square for the item by occasion interaction, however, is a bit more complex, containing information about 2 both σ pio,e and σ2io. Moving all the way up an analysis of variance table (or Table 36–2), we find & that MSp contains not only information about σ2pi but it also contains information about σ2pio., σ2pi, and σ2pio.e., each of which are estimated from their corresponding mean squares. In general, as we move from the highest order interaction (residual) to main effects, the number of mean squares involved in getting estimates of variance components increases. And the more mean squares that are involved in estimating variance components, the larger the variability is likely to be in these estimates from one study to the next. Although exact confidence intervals for variance components are generally unavailable (due to the inability to derive exact distributions for variance component estimates), approximate confidence intervals are available if one assumes normality or uses a resampling technique such as the bootstrap (illustrated in Brennan, 2001; for details, see Wiley, 2000). 36-Green.qxd 12/30/2005 610 12:51 PM Page 610 SHAVELSON AND WEBB Negative Estimated Variance Components. The second concern with variance component estimation is when a negative estimate arises because of sampling error or model misspecification (Shavelson & Webb, 1981). We can identify four possible solutions when negative estimates are small in relative magnitude. One possible solution is to substitute zero for the negative estimate and carry through the zero in other expected mean square equations from the analysis of variance, which produces biased estimates (Cronbach et al., 1972). A second solution is to set negative estimates to zero but use the negative estimates in expected mean square equations for other components (Brennan, 2001). A third is to use a Bayesian approach that sets a lower bound of zero on the estimated variance component (Shavelson & Webb, 1981). Finally, a fourth possible solution is to use maximum likelihood methods, which preclude negative estimates (Searle, 1987). Variance Component Estimation with Unbalanced Designs. An unbalanced design arises when the number of levels of a nested facet varies for each level of, say, the object of measurement, person, producing unequal numbers of levels. For example, different judges observe the performance of each person, hence judge is nested within person. Moreover, different numbers of judges are assigned to each person, hence unequal numbers of judges create the unbalancing. Although analysis of variance methods for estimating variance components are straightforward when applied to balanced data, have the advantage of requiring few distributional assumptions, and produce unbiased estimators, problems arise with unbalanced data. They include many different decompositions of the total sums of squares without an obvious basis for choosing among them (which leads to a variety of ways in which mean squares can be adjusted for other effects in the model), biased estimation in mixed models (not a problem in G theory because G theory averages over fixed facets in a mixed model and estimates only variances of random effects, or mixed models can be handled via multivariate G theory), and algebraically and computationally complex rules for deriving expected values of mean squares. Brennan (2001) describes an analogous ANOVA procedure for estimating variance components in G studies and illustrates estimation of error variances for some frequently encountered unbalanced D-study designs. Hidden Facets In some cases, two facets are linked such that as the levels of one facet vary, correspondingly the levels of another facet do too. This might not be readily obvious; hence the name, “hidden facet.” The most notorious and easily understood hidden facet is the occasion facet. Here is how it works. As a person proceeds through a test, for example, or performs a series of tasks, his performance occurs over time. Typically variability in performance from task to task would be interpreted as task-sampling variability. However, while task is varying, the hidden facet, occasion, is too. It just might be that what appears to be task-sampling variability is actually occasionsampling variability and this alternative interpretation might change prescriptions for improving the dependability of the measurement. For example, Shavelson, Baxter, and Gao (1993) reported research on performance assessment in education and the military that task-sampling variability was consistently quite large and that a large sample of tasks was needed to get a reliable measure of performance. However, Cronbach, Linn, Brennan, and Haertel (1997) questioned this interpretation, pointing out the hidden facet of occasion. The importance of this challenge is that, if the occasion facet is actually the cause, adding many tasks to address the task-sampling problem would not improve the dependability of the measurement. To resolve the issue, Shavelson, Ruiz-Primo, and Wiley (1999) re-examined some of the data from the 1993 report in a person × task × rater × occasion G study so that the effects of task 36-Green.qxd 12/30/2005 12:51 PM Page 611 36. GENERALIZABILITY THEORY 611 and occasion could be separated. They found that both the task (person × task) and occasion (person × occasion) facets contributed variability, but the lion’s share came from task sampling (person × task), and joint task and occasion sampling (person × task × occasion). In a different study, Webb, Schlackman, and Sugrue (2000) reported similar results. The moral of the story is to be careful in interpreting variance components when occasion might be lurking in the background. Nonconstant Error Variance for Different True Scores The description of error variance given here, especially in (4) and (6), implicitly assumes that variance of measurement error is constant for all persons, regardless of true score (universe score, here). The assumption of constant error variance for different true scores has been criticized for decades, including by Lord (1955) who derived a formula for conditional error variance that varies as a function of true score. His approach produced estimated error variances that are smaller for very high and very low true scores than true scores that are closer to the mean, producing a concave-down quadratic form. Consider, for example, that persons who have very high true scores are likely to score highly across multiple items or across multiple tests (small error variance), whereas persons who have true scores close to the mean are likely to produce scores that fluctuate more from item to item or from test to test (larger error variance). Stated another way, for examinees with very high or very low true scores, there is little opportunity for errors to influence observed scores. Lord’s conditional error variance is the conditional error variance for absolute decisions in G theory for the p × i design with dichotomously scored items and n′i equal to ni (see Brennan, 2001). Brennan (2001) discusses conditional error variances in generalizability theory and shows estimation procedures for conditional relative and absolute error variance for relative and absolute decisions, for univariate and multivariate studies, and for balanced and unbalanced designs. ACKNOWLEDGMENTS A grant #000000 from the U.S. Office of Education to CRESST supported, in part, the preparation of this chapter was supported, etc. We thank our reviewers, Bob Brennan and Dimiter Dimitrov, as well as colleague Felipe Martinez, for their careful reading and constructive comments; errors of commission and omission are ours. REFERENCES Brennan, R. L. (2001). Generalizability Theory. New York: Springer-Verlag. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The Dependability of Behavioral Measurements. New York: Wiley. Cronbach, L. J., Linn, R. L., Brennan, R. L, & Haertel, E. H. (1997). Generalizability analysis for performance assessments of student achievement or school effectiveness. Educational and Psychological Measurement, 57, 373–399. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational Measurement 3rd edition (pp. 105–146). Washington, DC: The American Council on Education/Macmillan. Kane, M. T., & Brennan, R. L. (1977). The generalizability of class means. Review of Educational Research, 47, 267–292. Lord, F. M. (1955). Estimating test reliability. Educational and Psychological Measurement, 16, 325–336. Marcoulides, G. A. (1994). Selecting weighting schemes in multivariate generalizability studies. Educational and Psychological Measurement, 54, 3–7. 36-Green.qxd 12/30/2005 612 12:51 PM Page 612 SHAVELSON AND WEBB Searle, S. R. (1987). Linear Models for Unbalanced Data. New York, NY: Wiley. Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30(3), 215–232. Shavelson, R. J., Ruiz-Primo, M. A., & Wiley, E. W. (1999). Note on sources of sampling variability in science performance assessments. Journal of Educational Measurement, 36, 61–71. Shavelson, R. J., & Webb, N. M. (1981). Generalizability theory: 1973–1980. British Journal of Mathematical and Statistical Psychology, 34, 133–166. Shavelson, R. J., & Webb, N. M. (1991). Generalizability Theory: A Primer. Newbury Park, CA: Sage. Webb, N. M., Nemer, K., Chizhik, A., & Sugrue, B. (1998). Equity issues in collaborative group assessment: Group composition and performance. American Educational Research Journal, 35, 607–651. Webb, N. M., Schlackman, J., & Sugrue, B. (2000). The dependability and interchangeability of assessment methods in science. Applied Measurement in Education, 13, 277–301. Webb, N. M., Shavelson, R. J., & Maddahian, E. (1983). Multivariate generalizability theory. In L. J. Fyans (Ed.), Generalizability Theory: Inferences and Practical Applications (pp. 67–81). San Francisco, CA: Jossey-Bass. Wiley, E. (2000). Bootstrap Strategies for Variance Component Estimation: Theoretical and Empirical Results. Unpublished doctoral dissertation, Stanford University.
© Copyright 2024 Paperzz