Introduction to Test Validity

Using statistics in small-scale language education research
Jean Turner
© Taylor & Francis 2014

Tests and other data collection tools must measure
accurately and appropriately given the nature of the
construct.

Test validity is associated with the extent to which:
◦ a tool measures the intended construct
◦ the tool scores/outcomes mean what they are intended to mean
◦ the tool scores/outcomes are useful for their intended purpose(s)
© Taylor & Francis 2014

Test validity has an impact on both internal research
study validity and external research study validity.
© Taylor & Francis 2014

There are different perspectives and techniques
associated with investigations of test validity.

Historically, these different perspectives and
techniques were referred to as different types of
validity.

(Though they aren’t really different types.)
© Taylor & Francis 2014

Construct validity

Content validity

Criterion-related validities
◦ Concurrent validity
◦ Predictive validity

Face validity
© Taylor & Francis 2014

Construct validity—the extent to which the constructs
measured by a test or data collection tools are clearly
and appropriately defined and measured
◦ (1) Are the definitions of the constructs clear and useful?
◦ (2) Does the data collection tool really tap these skills?
◦ (3) Is there convincing evidence supporting points 1 and 2?
© Taylor & Francis 2014

Content validity—the items or tasks measure the
constructs completely and without measuring other,
unrelated knowledge, skills, or abilities.
◦ Does the test measure all aspects of the construct?
◦ Is there very little measured by that test that’s unrelated to the construct?
© Taylor & Francis 2014

There are two criterion-related approaches to
investigating validity. Both involve investigating
the relationship between the data collection tool
in question and another tool.
◦ Concurrent validity
◦ Predictive validity
© Taylor & Francis 2014

The new tool is administered to a group of people—
who also completed a well-established tool tapping
the same construct.

If the new tool taps what it’s designed to measure, the
correlation between the two sets of scores will be
high.

If the correlation is high, the concurrent validity is
considered good—evidence that the test measures the
intended construct.
© Taylor & Francis 2014

Does a new test of Business English ability really measure that
construct?
◦ Give the new test to a large number of examinees; also give the same examinees the
English BULATS test (a recognized measure of Business English ability).
◦ Calculate the correlation between scores on the two tests. A high correlation serves as
evidence that the new test measures Business English, because it relates well to the
recognized measure of Business English.

This approach is called concurrent validity because the two
tests are taken concurrently.

This approach is only as useful as the comparison measure is
sound!
© Taylor & Francis 2014

Admissions tests must have good predictive validity.

Ways to collect evidence of predictive validity:
◦ Give the test to a number of people starting a program of study.
◦ At the end of the term, collect information on their final exam
or final GPA.
◦ Find the correlation between the initial scores and the later
measure of success.
◦ A high correlation is evidence of high predictive validity.
© Taylor & Francis 2014

In the past, all students in a particular MATESOL/TFL
had to take the GRE (though it wasn’t used for
admission).

The correlation between GRE performance and students’
score on their comprehensive examination at the end of
their studies was found to be very low.

The GRE doesn't seem to have good predictive validity
for students in this program.
© Taylor & Francis 2014

Face validity is the extent to which research study
participants and other users of a data collection tool
outcome believe the tool is useful and the outcomes are
good indicators of the intended construct.

A data collection tool’s face validity varies according to
individuals’ background and experiences, thus it’s
impressionistic.

Though impressionistic, it’s important because
participant performance may be affected by face
validity!
© Taylor & Francis 2014

Correlational evidence
◦ Two tests (concurrent validity)
◦ A test and a future measure (predictive validity)
© Taylor & Francis 2014

Experimental evidence
◦ Intervention study
◦ Differential group study
© Taylor & Francis 2014

Expert review of content, format, processes.
◦
◦
◦
◦
Language testing experts
Teachers
Employers
Learners
© Taylor & Francis 2014