Psych 818 - Lecture 6

Overview
„
Extends Classical Test theory
Generalizability Theory
r xx =
DeShon - 2007
„
 t2
2
O
=
 2t
2
2
t  e
G-Theory seeks to decompose the
undifferentiated error into its constituents
to better inform decisions
1
2
Overview
„
Developed by Cronbach and friends in 1963
„
„
Resulted in much jargon
Seeks to identify the most serious sources of
inconsistency in responses over measurement
conditions
„
„
„
Assumes that the error component can be
partitioned into multiple sources
„
„
Cronbach, L.J., Rajaratnam, N., and Gleser, G.C. (1963). Theory of
generalizability: A liberalization of reliability theory. British Journal of
Statistical Psychology, 16, 137-163.
Attempted to clarify reliability concepts and
distinguish G-theory from CTT
„
„
Jargon
Universe of admissible observations
„ Measurements are assumed to be a sample
from a universe of potential measurement
conditions (time, raters, items, etc...)
Facets/conditions of measurement
Blah, Blah, Blah...
Key: How generalizable are measurements
from some set of measurement conditions
(e.g., items) to other measurement
conditions?
3
4
Example
„
Key Issues in G-theory
3 judges rate the creativity of essays
written by college applicants
„
„
G-Study
„
Can the ratings provided by any one judge
be exchanged for any other judge and thus
provide a good estimate of the true
(universe) score?
„
„
D-study
„
„
„
Judges are the facet of the
measurement universe
„
5
Estimate variance components
Fully crossed designs are best
Fixed vs. Random conditions of measurement
Absolute vs. Relative decisions
Examination of various combinations of the
measurement conditions to determine desired
reliability - Spearman-Brown Prophesy formula
6
Single Facet G-study
„
„
Simplest G-theory design
Person 1
4 sources of variance
Person 2
„
„
Rater 1
Single Facet G-study
Rater 2
„
„
Object of measurement
Person 3
„ Systematic variability due
Person 4
to the focus of
differentiation – persons Person 5
Condition of measurement
„ Systematic variability due
to the Rater facet.
Simplest G-theory design
Person 1
4 sources of variance
Person 2
„
Interaction variance
„
Random error and
Person 5
unaccounted for systematic
variability (i.e., unmeasured
facets)
Rater 1
Rater 2
Person 3
Person 4
The last two sources of variability
cannot be separated.
7
8
Single Facet G-study
Single Facet G-study
„
The object of
measurement—
desirable source of
variability
p
Person x rater interaction—the
tendency for some raters to
rank order the objects
differently than other raters.
pr
r
Overall rater
differences—the
tendency for some
raters to give
generally higher or
lower ratings
„
The magnitude of the three sources of
variability can be estimated and compared
to make decisions about adequacy of
current measurement or the best way to
redesign a measure.
Measures are generalizable to the extent
that variance due to the object of
measurement is large relative to variance
from the several sources of error.
9
10
Multiple Facet G-Studies
„
„
2 Facet G-study
The single facet study is no different than
an interrater reliability coefficient or an ICC
from Shrout & Fleiss
The real power of G-theory comes from the
extension of the ICC to decompose the
error into its multiple constituents
„
Can use this information to substantially
improve decision making
11
„
Raters and Occasions as
conditions of
measurement
pr
p
r
pro
po
ro
Now there are seven sources of variability:
People
Raters
Occasions
People x Raters
People x Occasions
Raters x Occasions
People x Raters x Occasions, error
o
12
2 Facet G-study
„
This design allows us to
determine the
generalizability of ratings
across different raters
and different occasions.
Variance Component Estimation
pr
p
r
pro
po
ro
o
13
SPSS
Judge 1
AM PM
Person 1 2
3
Person 2 1
2
Person 3 2
3
Person 4 3
4
Person 5 4
5
Person 6 4
6
Person 7 3
7
Person 8 4
7
Person 9 3
5
Person 10 4
4
Person 11 3
5
Person 12 3
4
Person 13 3
3
Person 14 1
2
Person 15 2
3
Mean
2.80 4.20
1.03 2.60
Var.
Judge 2
AM PM
1
3
2
4
2
4
3
3
3
5
3
3
4
6
4
6
4
7
4
5
3
4
3
2
2
4
2
3
1
2
2.73 4.07
1.07 2.21
Judge 3
AM PM
3
5
4
6
5
4
4
6
5
7
5
4
6
7
5
6
3
7
4
4
5
5
3
5
1
2
2
4
3
3
3.87 5.00
1.84 2.29
14
SPSS
15
SPSS
16
SPSS Syntax
VARCOMP
Rating BY Person Rater Time
/RANDOM = Person Rater Time
/METHOD = MINQUE (1)
/DESIGN = Person Rater Time Person*Rater
Person*Time Rater*Time
/INTERCEPT = INCLUDE .
Variance Estimates
Component
Estimate
Var(Person)
.863
Var(Rater)
.300
Var(Time)
.821
Var(Person * Rater)
.300
Var(Person * Time)
.102
Var(Rater * Time)
-.029 a
Var(Error)
.573
Dependent Variable: Rating
Method: Minimum Norm Quadratic Unbiased Estimation
(Weight = 1 for Random Effects and Residual)
a. For the ANOVA and MINQUE methods, negative
variance component estimates may occur. Some
possible reasons for their occurrence are: (a) the
specified model is not the correct model, or (b)
the true value of the variance equals zero.
17
18
SPSS Results
Source
df
SPSS Results
Var.
MS
Comp.
6.659
.863
29.17
People
14
%
Source
df
Var.
Comp.
6.659
.863
29.17
MS
%
People
14
Judges
2
9.744
.300
10.14
Judges
2
9.744
.300
10.14
Occasions
1
37.378
.821
27.75
Occasions
1
37.378
.821
27.75
PxJ
28
1.173
.300
10.14
PxJ
28
1.173
.300
10.14
PxO
14
.878
.102
3.45
PxO
14
.878
.102
3.45
JxO
2
.144
-.029
0.00
JxO
2
.144
-.029
0.00
28
.573
.573
19.36
28
.573
.573
19.36
P x J x O, e
P x J x O, e
19
20
Interpreting Results
„
D-study Details
„
Examine the Variance Components
„ Big effect of time; smaller effect of
raters
„ So, focus effort on reducing
inconsistency over time.
„ More measurement over time
„ Identify factors that might be
responsible for the inconsistency
and get rid of them (e.g. , food)
„
„
„
„
21
D-study Details
„
Once the variance components are
estimated, D studies can be conducted to
explore the implications for using the
measure in different designs and for different
kinds of decisions.
Estimating the magnitude of error (lack of
generalizability) requires attention to four
important distinctions:
Decision 1:
„ Absolute vs. Relative Error
23
„
Generalizability versus decision studies
Random versus fixed effects
Relative versus absolute decisions
Number of measurement conditions
22
D-study Details
„
Decision 2:
„ Number of measurement conditions (e.g.,
raters, occasions, etc..) required to obtain
desired level of dependability/reliability
„ Use a variant of the spearman-brown
formula to determine number of
measurement conditions
26
D-study Details
Compute a G-coefficient based on these
decisions to estimate dependability
„ Absolute Error
„
 2p
2 =
 2p 
2
o
2
pr
2
po
2
ro
Compute a G-coefficient based on these
decisions to estimate dependability
„ Relative Error
 2p
2 =











nr
no
nr
no
n r no nr no
2
r
2
pro
 2p 

2
2
 po  pro
nr
no
n r no
2
pr
27
28
Relative Error
Absolute Error
Absolute Generalizability for a P x J x O Design
Relative Generalizability for a P x J x O Design
0.9-1
0.8-0.9
0.7-0.8
0.6-0.7
0.5-0.6
0.4-0.5
0.3-0.4
0.2-0.3
0.1-0.2
0-0.1
0.9-1
0.8-0.9
0.7-0.8
0.6-0.7
0.5-0.6
0.4-0.5
0.3-0.4
0.2-0.3
0.1-0.2
0-0.1
1.00
0.90
0.80
0.70
0.60
0.50
Generalizability
0.40
1.00
0.90
0.80
0.70
0.60
0.50
0.20
0.20
0.10
5
13
25
23
21
19
17
15
Number of Judges
1
1
3
3
11
9
7
13
5
11
9
7
25
23
21
19
17
15
25
23
21
19
17
15
13
11
9
7
5
0.00
Number of Occasions
Number of Judges
3
3
1
1
13
11
9
7
5
21
19
17
25
23
0.10
0.00
Number of Occasions
Generalizability
0.40
0.30
0.30
15
„
D-study Details
29
30
Fixed Effects
„
„
„
Fixed Effects
Someone might estimate a factor as random that
you think of as fixed (e.g., raters).
If one of the facets is fixed, then it makes no
sense to speak of generalizing from a sample of
facet levels to the universe of admissible facet
levels—all facet levels are already present.
Two methods to handle this
„
„
„
If occasion is fixed, then the averaging approach
calculates the variance components as:
Source
Var.
Comp.
%
People
.914
51.16
Judges
.286
15.99
P x J, e
.587
32.85
An averaging approach
Separate estimation of variance components with
levels of the fixed facet
31
32
Fixed Effects
Separate Analyses
AM
Source
Var.
Comp.
PM
%
Var.
Comp.
%
People
.649
38.83
1.281
50.27
Judges
.360
21.54
.183
7.18
P x J, e
.662
39.61
1.084
42.54
33
Summary
„
„
Summary
G-theory is an extension of classical measurement
theory.
„ assumes that an observed score is a linear
combination of true score and error.
The major difference between the two
approaches is how they treat error.
„ CCT- error is considered to be a single entity
that is random in its influence.
„ G-theory, error is multi-faceted and can be
systematic.
35
„
„
„
Similar to domain-sampling theory in assuming
that measurement occasions (items, times,
judges, etc.) are randomly drawn from a
population.
The conditions of measurement define the
universe of admissible observations.
The relevant conditions—called facets—are the
potential sources of systematic measurement
error.
36
Summary
„
„
Summary
Goal: estimate how well a given observation can
generalize to the universe score—the average
score that would be obtained over all
observations in the universe of admissible
observations.
G-theory determines how exchangeable
observations are and identifies the major
obstacles to exchangeability.
„ Large sources of error hinder exchangeability.
„
Variance of observed scores is the sum of the
variances for the universe score (true score) and
all of the separate sources of error.
„ The importance of different error sources is
indexed by the relative size of variance
components.
„ Estimate the size of variance components is
called a G-Study.
37
38
Summary
„
Summary
D-study
„ error is suppressed by adding more levels of a
facet (like adding more items to a
questionnaire).
„ Exploring the implications of adding levels to
facets is called a D-Study and is akin to the
use of the Spearman-Brown formula in classical
measurement theory.
„
Reliability coefficients in generalizability theory
are alled generalizability coefficients)
„ defined same as CTT - the ratio of universe
score (true score) variance to obtained score
variance.
39
40
Summary
„
Summary
The major complications in generalizability
theory model specification are:
(a) Is the design crossed or nested?
(b) Are there any fixed facets?
(c) Is the decision to be made from the
measurement a relative or an absolute
one?
„
Generalizability theory is symmetrical—anything
can be the object of measurement.
„
A target reliability can be achieved in many
different ways. This means that cost, feasibility,
and design implications (threats to inference)
need to be considered carefully.
Model specification determines the nature of the
variance components and the definition of error.
41
42