Classical true score measurement theory

Reliability
• performance on language tests is also affected by factors other than communicative
language ability.
• (1) test method facets
• They are systematic to the extent that they are uniform from one test administration to
the next)
• (2) attributes of the test taker that are not considered part of the language abilities we
want to measure
• cognitive style and knowledge of particular content areas, and group characteristics such
as sex, race, and ethnic background.
• (3) random factors that are largely unpredictable and temporary.
• These include unpredictable and largely temporary conditions, such as his mental
alertness or emotional state, and uncontrolled differences in test method facets, such as
changes in the test environment from one day to the next, or idiosyncratic differences in
the way different test administrators carry out their responsibilities.
Classical true score measurement theory
• When we investigate reliability, it is essential to keep in mind the
distinction between unobservable abilities, on the one hand, and
observed test scores, on the other.
• Classical true score ( C T S ) measurement theory consists of a set of
assumptions about the relationships between actual, or observed
test scores and the factors that affect these scores:
• The first assumption of this model states that an observed score on a
test comprises two factors or components: a true score that is due to
an individual’s level of ability and an error score, that is due to
factors other than the ability being tested.
• A second set of assumptions has to do with the relationship between
true and error scores. Error scores are unsystematic, or random, and
are uncorrelated with true scores.
Parallel tests
• In order for two tests to be considered parallel, we assume that they are
measures of the same ability, that is, that an individual’s true score on one
test will be the same as his true score on the other.
• Two tests are parallel if, for every group of persons taking both tests, (1)the
true score on one test is equal to the true score on the other, and (2)the
error variances for the two tests are equal.
• parallel tests are two tests of the same ability that have the same means
and variances and are equally correlated with other tests of that ability.
• In summary, reliability is defined in the CTS theory in terms of true
score variance. Since we can never know the true scores of
individuals, we can never know what the reliability is, but can only
estimate it from the observed scores.
Approaches to estimating reliability
• Internal consistency
Internal consistency is concerned with how consistent test takers’ performances on the different
parts of the test are with each other.
• Performance on the parts of a reading comprehension test, for example, might be inconsistent if
passages are of differing lengths and vary in terms of their syntactic, lexical, and organizational
complexity, or involve different topics.
• One approach to examining the internal consistency of a test is the
split-half method, in which we divide the test into two halves and then determine the extent to
which scores on these two halves are consistent with each other
(1)they both measure the same trait.
• (2)individuals’ performance on one half does not depend on how they perform on the other
A convenient way of splitting a test into halves might be to simply divide it into the first and
second halves.
• odd-even method
Stability (test-retest reliability)
• There are also testing situations in which it may be necessary to administer a test more
than once. For example, if a researcher were interested in measuring subjects 'language
ability at several different points in time, as part of a time-series design.
• In this approach, we administer the test twice to a group of individuals and then
compute the correlation between the two sets of scores.
• The primary concern in this approach is assuring that the individuals who take the test do
not themselves change differentially in any systematic way between test administrations.
• That is, we must assume that both practice and learning (or unlearning) effects are
either uniform across individuals or random
Equivalence (parallel form reliability)
• It is of particular interest in testing situations where alternate forms of the
test may be actually used, either for security reasons, or to minimize the
practice effect.
• In some situations it is not possible to administer the test to all
examinees at the same time, and the test user does not wish to take
the chance that individuals who take the test first will pass on
information about the test to later test takers.
• In other situations, the test user may wish to measure individuals’ language
abilities frequently over a period of time, and wants to be sure that any
changes in performance are not due to practice effect, and therefore
uses alternate forms.
Problems with the classical true score model
• In many testing situations these apparently straightforward
procedures for estimating the effects of different sources of error are
complicated by the fact that the different sources of error may
interact with each other, even when we carefully design our reliability
study.
• A second, related problem is that the CTS model considers all error
to be random, and consequently fails to distinguish systematic error
from random error.
Generalizability theory
• It investigating the relative effects of different sources of variance in test scores.
• on the basis of an individual’s performance on a test we generalize to her
performance in other contexts. The more reliable the sample of performance, or
test score, is, the more generalizable it is.
• The application of G-theory to test development and use takes place in two
stages:
• First, the test developer designs and conducts a study to investigate the sources
of variance that are of concern or interest.
• This involves identifying the relevant sources of variance (including traits, method
facets, personal attributes, and random factors), designing procedures for
collecting data that will permit the test developer to clearly distinguish the
different sources of variance, administering the test according to this design, and
then conducting the appropriate analyses.
• On the basis of this generalizability study (‘G-study’), the test developer obtains
estimates of the relative sizes of the different sources of variance (‘variance
components’).
• Depending on the outcome of this G-study, the test developer may revise
the test or the procedures for administering it, and then conduct another
G-study. Or, if the results of the G-study are satisfactory (if sources of error
variance are minimized), the test developer proceeds to the second stage,
a decision study (‘D-study’).
• Second, In a D-study, the test developer administers the test under
operational conditions, that is, under the conditions in which the test will
be used to make the decisions for which it is designed, and uses G theory
procedures to estimate the magnitude of the variance components.
• The application of G-theory thus enables test developers and test users to
specify the different sources of variance that are of concern
for a given test use, to estimate the relative importance of these different
sources simultaneously, and to employ these estimates in the
interpretation and use of test scores.
In general:
• It takes into account all possible sources of error (due to individual
factors, situational characteristics of the evaluator, and instrumental
variables) and tries to differentiate by applying the classical
procedures of analysis of variance (ANOVA).
Item Response theory
• A major limitation to CTS theory is that it does not provide a very satisfactory basis for predicting
how a given individual will perform on a given item.
• There are two reasons for this. First, CTS theory makes no assumptions about how an individual’s
level of ability affects the way he performs on a test. Second, the only information that is
available for predicting an individual’s performance on a given item is the index of difficulty ,
which is simply the proportion of individuals in a group that responded correctly to the item.
• Thus, the only information available in predicting how an individual will answer an item is the
average performance of a group on this item.
• Because of this and other limitations in CTS theory (and G-theory, as well), psychometricians have
developed a number of mathematical models for relating an individual’s test performance to that
individual’s level of ability.”
• Item response theory makes stronger predictions about individuals’
performance on individual items, their levels of ability, and about the
characteristics of individual items.
• Item characteristic curves (the relationship between the test taker’s
ability and his performances on a given item)
The types of information about item characteristics may include:
• (1) the degree to which the item discriminates among individuals of
differing levels of ability (the ‘discrimination’ parameter a )
• (2) the level of difficulty of the item (the ‘difficulty’ parameter b)
• (3)the probability that an individual of low ability can answer the
item correctly (the ‘pseudo-chance’ or ‘guessing’ parameter c ) .
• An individual’s expected performance on a particular test question, or
item, is a function of both the level of difficulty of the item and the
individual’s level of ability.
Item Characteristic Curves
• Specific assumptions about the relationship between the test taker's
ability and his performance on a given item are explicitly stated in the
mathematical formula, or item characteristic curve (ICC).
Item Characteristic Curves
• The form of the ICC is determined by the particular mathematical
model on which it is based. The types of information about item
characteristics may include:
• (1) the degree to which the item discriminates among individuals of
differing levels of ability (the 'discrimination' parameter a);
Item Characteristic Curves
• (2) the level of difficulty of the item (the 'difficulty' parameter b), and
• (3) the probability that an individual of low ability can answer the
item correctly (the 'pseudo-chance' or 'guessing' parameter c).
• One of the major considerations in the application of IRT models,
therefore, is the estimation of these item parameters.
ICC
Probability
Ability Scale
• pseudo-chance parameter
c: p=0.20 for two items
• difficulty parameter b:
halfway
between
the
pseudo-chance parameter
and one
• discrimination parameter a:
proportional to the slop of
the ICC at the point of the
difficulty parameter The
steeper the slope, the
greater the discrimination
parameter.