Generalizability Theory

RELIABILITY
consistency or reproducibility of a
test score (or measurement)
Common approaches to estimating reliability

Classical True Score Theory
–
test-retest, alternate forms, internal consistency

–
intraclass correlation


useful for estimating relative decisions
useful for estimating absolute decisions
Generalizability Theory
–
can estimate both relative & absolute
Reliability is a concept central to all
behavioral sciences. To some extent all
measures are unreliable. This is especially
true with psychological measures and
measurements based on human observation
Sources of Error

Random
–

fluctuations in the measurement based purely
on chance.
Systematic
–
Measurement error that affect a score because
of some particular characteristic of the person
or the test that has nothing to due with the
construct being measured.
CTST

X=T+E
–
Recognizes only two sources of variance



–
test -retest (stability)
alternate forms (equivalence in item sampling)
test-retest with alternate forms (stability &
equivalence but these are confounded)
Cannot adequately estimate individual sources
of error influencing a measurement
ICC

Uses ANOVA to partition variance due to
between subjects and within subjects
–
–
Has some ability to accommodate multiple
sources of variance
Does not provide an integrated approach to
estimating reliability under multiple conditions
Generalizability Theory
The Dependability of Behavioral Measures,
(1972) Cronbach, Glaser, Nanda, & Rajaratnam
Dependability
The accuracy of generalizing from a person’s
observed score on a measure to the average
score that person would have received
under all possible testing conditions the
tester would be willing to accept.
The Decision Maker
The score on which the decision is to be
based is only one of many scores that might
serve the same purpose. The decision maker
is almost never interested in the response
given to the particular moment of testing.
 Ideally the decision should be based on that
person’s mean score over all possible
measurement occasions.

Universe of Generalization

Definition & establishment of the universe
admissible observations:
–
–

observations that the decision maker is willing
to treat as interchangeable.
all sources of influence acting on the
measurement of the trait under study.
What are the sources of ERROR
influencing your measurement?
Generalizability Issues

Facet of Generalization
–

raters, trials, days, clinics, therapists
Facet of Determination
–
usually people, but can vary (e.g. raters)
Types of Studies

Generalizability Study (G-Study)

Decision Study (D-Study)
G-Study
Purpose is to anticipate the multiple uses of
a measurement.
 To provide as much information as possible
about the sources of variation in the
measurement.
 The G-Study should attempt to identify and
incorporate into its design as many potential
sources of variation as possible.

D-Study


Makes use of the information provided by the GStudy to design the best possible application of the
measurement for a particular purpose.
Planning a D-Study:
–
–
–
defines the Universe of Generalization
specifies the proposed interpretation of the
measurement.
uses G-Study information to evaluate the effectiveness
of alternative designs for minimizing error and
maximizing reliability.
Design Considerations

Fixed Facets

Random Facets
Fixed Facet

When the levels of the facet exhaust all
possible conditions in the universe to which
the investigator wants to generalize.

When the level of the facet represent a
convenient sub-sample of all possible
conditions in the universe.
Random Facets

When it is assumed that the levels of the
facet represent a random sample of all
possible levels described by the facet.

If you are willing to EXCHANGE the
conditions (levels) under study for any other
set of conditions of the same size from the
universe.
Types of Decisions

Relative
–
–

establish a rank order of individuals (or
groups).
the comparison of a subject’s performance
against others in the group.
Absolute
–
–
to index an individual’s (or group’s) absolute
level of measurement.
measurement results are to be made
independent from the performance of others in
the group.
Statistical Modeling
ANOVA
–
just as ANOVA partitions a dependent variable
into effects for the independent variable (main
effects & interactions), G-theory uses ANOVA
to partition an individual’s measurement score
into an effect for the universe-score and an
effect for each source of error and their
interactions in the design.
Statistical Modeling

In ANOVA we were driven to test specific
hypotheses about our independent variables
and thus sought out the F statistic and pvalue.

In G-theory we will use ANOVA to partition
the different sources of variance and then to
estimate their amount (Variance
Component).
One Facet Design

4 Sources of Variability
–
systematic differences among subjects
 (object
–
–
–
of measurement)
systematic differences among raters (occasions,
items)
subjects*raters interaction
confounded
random error
Two Facet Design
Components of Variance
Example of a fully crossed two facet design
(Kroll, et. al.)
 Seven sources of variance are estimated:

–
–
–
–
–
–
–
subjects
raters
observations
sx r
sx o
rxo
sxrxo,e
Variance Components
Subjects (s)
(sxo)
Observations (o)
(sxrxo)
+
Error
(sxr)
(oxr)
Raters (r)
TABLE 1 - Variance Components and Percentage of Variation for Measures of Pelvic Tilt (raters=2,
observations=5)
Resting Pelvic Tilt
Source of
Variation
Anterior Pelvic Tilt
Posterior Pelvic Tilt
VC
Percent
VC
Percent
VC
Percent
19.956
75.2
47.683
84.8
20.607
72.3
Raters
1.726
6.5
0.000
0.0
2.508
8.8
Observations
0.148
0.6
0.000
0.0
0.011
0.0
PxR
1.671
6.3
1.935
3.4
1.910
6.7
PxO
0.042
0.2
0.972
1.7
1.077
3.8
RxO
0.000
0.0
0.000
0.0
0.000
0.0
P x R x O, E
3.050
11.5
5.646
10.0
2.394
8.4
Persons
Abbreviations: P x R = persons by raters; P x O = persons by observations; R x O = raters by
observations; P x R x O, E = persons by raters by observations combined with residual error
TABLE 2 - Variance Components and Percentage of Variation for Modified Schober, Attraction Method,
and Lower Abdominal Strength Measures (raters=2, observations=3)
Modified Schober
Attraction Method
Lower Abdominal
Strength
Source
VC
Percent
VC
Percent
VC
Percent
Persons
1.006
67.8
0.360
81.3
105.055
52.9
Raters
0.000
0.0
0.000
0.0
0.000
0.0
Observations
0.008
0.5
0.000
0.0
0.000
0.0
PxR
0.181
12.2
0.000
0.0
71.349
36.0
PxO
0.029
2.0
0.083
18.7
3.695
1.9
RxO
0.016
1.1
0.000
0.0
0.757
0.4
P x R x O, E
0.243
16.4
0.000
0.0
17.577
8.9
Abbreviations: P x R = persons by raters; P x O = persons by observations; R x O = raters by
observations; P x R x O, E = persons by raters by observations combined with residual error
Relative Error
Facet of Determination: Subjects
Subjects (s)
(sxo)
Observations (o)
(sxrxo)
+
Error
(sxr)
(oxr)
Raters (r)
F2rel = F2sr /nr + F2so /no+
F2sro,e/nrno
Absolute Error
Facet of Determination: Subjects
Subjects (s)
(sxo)
Observations (o)
(sxrxo)
+
Error
(sxr)
(oxr)
Raters (r)
F2abs = F2r/nr + F2o /no + F2sr /nr + F2so /no + F2or /nonr +
F2sro,e /nonr
Generalizability Coefficients
AKA: Reliability Coefficients
Relative Generalizability Coefficient for Subjects:
F2s
2 = ------------F2s +
F2rel
Absolute Generalizability Coefficient for Subjects:
F2s
 = ------------F2s + F2abs
TABLE 3 - Variance Components and Percentage of Variation for Right and Left Hamstring Flexibility
Measures (raters = 2, observations = 3)
Right Hamstring
Flexibility
Left Hamstring
Flexibility
Source
VC
Percent
VC
Percent
Persons
398.526
93.1
382.639
91.9
Raters
0.000
0.0
0.000
0.0
Observations
1.767
0.4
2.123
0.5
PxR
20.656
4.8
24.030
5.8
PxO
0.708
0.2
1.235
0.3
RxO
0.001
0.0
0.707
0.2
P x R x O, E
6.407
1.5
5.727
1.4
Abbreviations: P x R = persons by raters; P x O = persons by observations; R x O = raters by
observations; P x R x O, E = persons by raters by observations combined with residual error
TABLE 4 - Variance Component and Percentage of Variation of Abdominal and Trunk Muscle
Endurance Methods (raters=2, observation=2)
Abdominal Muscle
Endurance
Trunk Muscle
Endurance
Source
VC
Percent
VC
Percent
Persons
646.177
68.9
1160.656
83.6
Raters
43.936
4.7
0.000
0.0
Observations
0.000
0.0
0.000
0.0
PxR
0.000
0.0
21.732
1.6
PxO
15.736
1.7
24.559
1.8
RxO
0.000
0.0
0.000
0.0
232.117
24.7
181.944
13.1
P x R x O, E
Abbreviations: P x R = persons by raters; P x O = persons by observations; R x O = raters by
observations; P x R x O, E = persons by raters by observations combined with residual error
TABLE 5 - Generalizability of Pelvic Tilt Measures
Resting Pelvic Tilt
Anterior Pelvic Tilt
G-study
D-study
G-study
D-study
G-study
D-study
2
5
1
1
2
5
1
1
2
5
1
1
ρ2
0.946
0.809
0.967
0.848
0.936
0.793
φ
0.907
0.750
0.967
0.848
0.886
0.723
nr =
no =
Posterior Pelvic Tilt
Abbreviations: nr = number of raters; no = number of observations; ρ2 = generalizability (G) coefficient for
relative decisions; φ = G-coefficient for absolute decisions
TABLE 6 - Generalizability of Trunk Flexibility and Strength Measures
Modified
Schober
Attraction
Method
Lower
Abdominal
Strength
Right
Hamstring
Flexibility
Left
Hamstring
Flexibility
study
G
D
G
D
G
D
G
D
G
D
nr =
no =
2
3
1
1
2
3
1
1
2
3
1
1
2
3
1
1
2
3
1
1
ρ2
0.877
0.690
0.928
0.813
0.752
0.531
0.972
0.935
0.966
0.925
φ
0.873
0.678
0.928
0.813
0.724
0.530
0.970
0.931
0.964
0.919
Abbreviations: nr = number of raters; no = number of observations; ρ2 = generalizability (G) coefficient for
relative decisions; φ = G-coefficient for absolute decisions
TABLE 7 - Generalizability of Trunk Endurance Measures
Flexion
Extension
G-study
D-study
G-study
D-study
2
2
1
1
2
2
1
1
ρ2
0.908
0.723
0.944
0.836
φ
0.880
0.689
0.944
0.836
nr =
no =
Abbreviations: nr = number of raters; no = number of observations; ρ2 = generalizability (G) coefficient for
relative decisions; φ = G-coefficient for absolute decisions