Potthoff, R.F.; (1966)Statistical aspects of problem of biases in psychological tests."

I
I.
I
;1
I
STATISTICAL ASPECTS OF
THE PROBLEM OF BIASES
IN PSYCHOLOGICAL TESTS
by
•
Richard F. Potthoff
• University of Nortb Carolina
~I
;1
I
,I
Ie
II
a
Institute of Statistics Mimeo Series No.
May 1966
•
•
This research was supported by the College Entrance
Board through Educational Testing Service, and
waJ also supported in part by the National Institutes of
EX~ination
Heltlth Research Grant No. GM 12868-<:2.•
r
I
I'II
'.
,'I
ll.
479
DEPARTMENT OF STATISTICS
UNIVERSITY OF NORTH CAROLINA
Chapel Hill, N. C•
•
I
'.I
I
I
Summary
"
Questions are often raised as to whether certain psychological tests
I
I
I
I_
I
I
I
I
I
I
Ie
I
I
are biased against various kinds of groups.
This report eXamines the problems
of how to define bias, how to determine whether bias eXists, and how to rectify
bias which does eXist.
Two distinct situations are considered:
(i) the
situation where there is a criterion variable which the test is predicting,
and (ii) the situation where there is no such criterion variable.
The analysis
for situation (i) is based largely on a wide variety of multiple regression
techniques.
In situation (ii) there is a serious problem of how to define
I'biasll in the first place, but significance tests are presented for certain
hypotheses which are closely related to the general idea of bias.
I
I.
TABLE OF CONTENTS
~
i
!
!
3
!
!
~­
j-
J
J
]
]
J
-1
].
J
]
Introduction. • . • • •
Part I:
.........
•
•
•
•
•
•
e-
1
•
The Problem of Biases When There Is a Criterion
Variable to Be Predicted. •
• • • • • • • • •
4
Section 1:
Notation, assumptions, and understandings. •
4
Section 2:
Definition of bias when there is a criterion variable.
6
Section 3:
Pitfalls in sampling from the
8
Section 4:
Speed tests and power tests •••
15
Section 5:
Detection of bias when one uses just the total score
on the test X. •
• • •
•••• •
17
Section 6:
The situation of more than one test ••
23
Section 7:
The situation of more than one criterion variable • • •
25
Section 8:
Correcting for bias if it is found to exist1when just
the total score on the test X is being used •
35
Prediction and bias analysis when one uses a regression
equation which is based on the m item scores, and
which is linear in these item scores • • • • • • •
42
Section 9:
g
groups
Section 10: Prediction and bias analysis based on second-order
regression in the m item scores • • • • • • • •
53
Section 11: Prediction and bias analysis based on which responses
to the items are marked • •
• • • • • • • • • •
63
Part II:
The Problem of Biases When There Is No
Criterion Variable to Be Predicted • •
73
Section 12: Notation, assumptions, and understandings.
73
Section 13: The problem of how to define bias when there
is no criterion variable.
......
.....
74
Section 14: Testing for the equality of the groups on every item •
84
Section 15: Testing for the equality of the groups with
respect to mean total score only.
89
..
....
Section 16: Bias analysis and item-group interaction••
Acknowledgment. •
• • • • • ••
••••• • • • • •
References. . . . . . • . . .
.... . . • . . . . .
(11)
. . . .
90
109
110
I
I.
I
I
I
I
,
I
I
I_
I
I
I
I
I
I
I.
I
I
INTRODUCTION
There has been increasing concern recently over the question of whether
various educational and psychological tests are biased against certain groups.
Do tests discriminate against Negroes, or against other racial groups?
some tests biased in favor of girls, or in favor of boys?
Are examinees from
lower socio-economic groups at an unfair disadvantage in taking tests?
tests ever discriminate against certain religious groups?
against persons whose native language is not English?
biased against creative individuals?
If
Are
Do
Do they discriminate
Are intelligence tests
a person is somewhat susceptible to
eye-strain, is he under an unfair handicap when taking certain types of nonverbal tests which require prolonged visual concentration?
If
an individual
is not "observant" or is not artistically inclined, is he discriminated against
by a non-verbal intelligence test which requires
identification of pictures?
Are certain achievement tests biased in favor of students who followed a traditional curriculum and against students who followed a new, experimental curriculum, or vice versa?
Are persons from a foreign culture at an unfair dis-
advantage in taking some tests?
Questions such as these have been raised by
various authors, including Eells et
!! [9J,
Coffman [6, 7, 8], Cardall and
Coffman [4], and ,Cleary and Hilton [5], among others.
The problem of possible biases in psychological tests has been a
troublesome issue not only with respect to the use of tests for educational
purposes, but also with respect to tests which are utilized in the selection
of applicants for employment.
The recent case involving Motorola, Inc., has
received widespread publicity [3, 1.0, 21, 22], particularly in businessoriented publications.
In this case, which engendered considerable controversy,
Motorola refused to hire a Negro applicant for employment, and the applicant then
1
claimed discrimination.
Motorola claimed that it had rejected the applicant
because he had (according to Motorola) made too low a score on a certain test
of general aptitude and intelligence which the company administered as part of
its selection procedure.
Practices Commission).
The case went to the Illinois FEPC (Fair Employment
A hearing examiner for the Illinois FEPC (who happened
to be a Negro) ruled, among other things, that Motorola should stop using the
test
because (according to the examiner) the test discriminated against·
"culturally deprived and disadvantaged groups".
This part of the examiner's
ruling was later annulled, so that MOtorola was no longer prohibited from using
the test.
in
However, the underlying issue, as to whether tests such as the one
~uestion
are biased against Negroes or against persons of low socio-economic
status, remained unresolved.
The Federal civil rights legislation of 1964,
which was passed a few months after the Motorola case first came to light,
contained a provision which specifically permits employers to use "professionally
developed ability tests".
Of course, even professionally developed tests could
still contain unintentional biases against certain groups, if sufficient effort
is not made to identify, detect, and rectify such possible biases.
This report is concerned with some of the statistical
arise in investigating biases in psychological tests.
~uestions
which
Matters which we will
need to consider are (a) defining what is meant by "bias" in the first place;
(b) methods of determining whether or not biases exist in particular tests;
and (c) methods of correcting or eliminating bias.
problem for two distinct situations:
We will consider the bias
(i) the situation where there is a
criterion variable which the test score is intended to predict; and (ii) the
situation where there is no such criterion variable.
In situation (i), it is
presumed that the investigator has access to an appropriate sample of individuals
for whom criterion scores (as well as test
scores) are available.
The two
I
I
e'~
I
I
I
I
I
I
I
_I
I
I
I
,I
I
I
.-I
I
I
I.
I
I
basic situations (i) and (ii) are dealt with respectively in Parts Land II.
of this report.
It appears that results of a more definitive nature can be
obtained for situation (i) than for situation (ii).
I
I
II
I
Ie
I
I
I
I
I
I
I_e
I
I
3
I
.-I
PART I.
THE PROBLEM OF BIASES WHEN THERE IS A
CRITERION VARIABLE TO BE PREDICTED
Group bias would seem to be easier to define and to detect when there
is a
~r1terion
variable than when tere is no criterion variable.
Part I. con-
siders the question of bias in a test for which there is a criterion variable
which the test is used to predict.
some situation where there is no
1.
Part II. will deal with the more troublecriterion variable.
Notation, assumptions, and understandings.
We let X stand for the test, and Y for the criterion variable.
We will use
the notation Y to denote either the criterion variable itself or the numerical
score on the criterion variable.
denoted by
The score on the test X, however, will be
x? rather than by X itself. We will need to allow for several possible
ways of defining the test score ~ for a given test X; we will allow XO to be
either a single number or a vector.
We assume that the test X is being used
to predict Y, via a regression line of Y on Xo •
We suppose we have g different groups which may represent different
racial groups, different socio-economic groups, or groups which might arise
in various other ways as indicated above in the introduction.
We are interested
in the question of whether or not, for these g groups, the test X is biased
with respect to predicting the criterion Yj in other words, does the test X
give an unfair advantage or disadvantage to some groups?
inition of "bias" will be suggested shortly.)
(Amore precise def-
Frequently there are just two
groups (i.e., g=2), and in some of what follows we Will, for convenience, dwell
on this important special case although the results usually will be applicable
for general g.
4
I
I
I
I
I
I
_I
I
I
I
I
I
t
eI
I
I
I
I .
e
I
I
I
I
I
I
I
!e
!
~
~
]
]
J
Je
]
J
We assume we have available from the i-th group a suitably-chosen sample
of Ni individuals (i=I,2, ••• ,g) for whom both XC - and Y-scores are available.
Section 3 below will examine some of the problems in sampling and some sampling
pitfalls to be avoided.
For convenience, we are assuming in most of our development that there
is just a single test X, although many of our results could be generalized in
an obvious manner (essentially as outlined in Section 6) to handle a situation
where there is more than one test.
We will let Xij denote the total score on
the test X for the j-th individual in the i-th group (j=l,2, ••• , N ; i=I,2, ••• ,g).
i
We will assume generally that there is just one criterion variable, or
at least just a single criterion variable being considered at a time.
Our
definition of "bias" will be with respect to this particular criterion variable.
However, for the situation where there is more than one criterion variable, we
will point out later (see Section
7) how one can test the hypothesis that there
is no bias with respect to any of the criterion variables (i.e., test for bias
with respect to all criterion variables simultaneously) by using a multivariate
analysis of variance test.
For a single criterion variable we will use Yij
to .denote the criterion score for the j-th individual in the i-th group.
Let there be m items on the test X.
For k=I,2, ••• , m, we define a
quantity Xijk which is equal to I or 0 according as the j-th, individual in the
i-th group marks respectively the right answer or a wrong answer to the k-th
item.
Let
~
(k=I,2, ••• ,m).
be the total number of possible responses to the k-th item
In the case where the (i,j)-th individual does not mark any
response to the k-th item, we may wish, for most purposes, to define X
to
ijk
be equal to link (the average value of X ijk if a response is selected at random).
For .1=1,2, ••• ,~, we will define a quantity Xijkt which is equal to
5
I
1 if' the j-th individual in the i-th group marks the .t-th response to the k-th
item, and which is equal to 0 if' the individual in question marks some other
response to the k-th item.
One would normally assume that no more than one
response to an item could be marked.
If the (i,j)-th examinee does not mark
any response to the k-th item, then one might, for most purposes, define Xijkt
to be equal to l/~ for .t=1,2, ••• ,~.
In a few psychological tests, the items will not be of the multiple-
choice type.
For such tests, we would perhaps not attempt to define
~
or
Xijk.t, although in some cases it might still be possible to classify the incorrect responses somehow and thereby establish definitions for the Xijk$'s.
2.
Definition of bias when there is a criterion variable.
Consider the conditional expectation of the criterion score Y given the test
score ~.
By" test score" we may mean the single number X which represents
ij
the tota.l score on the test, or we may mean a vector of one type or another,
such as the vector
(2.1)
(Xijl ,
Xij2 ,
... ,
(Xijll ,
Xij2n ,
2
Xij12 ,
... ,
···f Xijml ,
X
ijlnl'
Xij21 ,
Xij22 , ... ~
Xijm2 , ••• , Xijmn ).
m
Thus, for the same test X, "test score" (~) might be defined in several different ways.
This will become clearer in later discussions here in Part I.
The conditional expectation of Y given ~ will be a function of
and frequently one would simply assume that it would be linear in Xo •
I
I
I
I
I
I
_II
I
I
I
X )
ijm
or
(2.2)
.-I
xo ,
This
conditional expectation represents the equation that would be used to predict
I
I
.1
I
6
I
I
I
I
I
I
I
I
I
I
_
I_
the criterion score Y from the test score
in fact, the main purpose of the
test X would generally be to predict the criterion Y.
If the conditional ex-
pectation (i.e., the prediction equation) of Y given the test score
same for all g groups, then we will say that
XC
XC
is ,the
is not biased with respect to
predicting the criterion Y; otherwise we will say that, with respect to these
o
£! biased
g groups, X
with respect to predicting the criterion Y.
Note that, with this definition of bias, the existence of bias is
dependent on:
(i)
(ii)
(iii)
the set of groups,
the criterion variable, and
the way "test score"
(t» is defined.
Thus tbB same test X could be biased if the test score
XO is defined
in one way but free of bias if XO is defined in a different way;
XC
could be
biased with respect to predicting one criterion but free of bias with respect
to predicting another criterion; and
I
I
I
I
I
I
Ie
I
I
XC;
XC
could be biased for predictingY for
one set of groups but free of bias for predicting Y for a slightly different
set of groups.
Our definition of bias simply says that a test is not biased if individuals from different groups who have the same test scores also have the same
expected criterion scores.
This would seem to be an obvious way of defining
bias when there is a criterion variable, and is in essence the same concept as
Cardall and Coffman [4] mention when they state,
"If the test is used for prediction,· the ultimate
question is whether or not a common regression
equation is appropriate for the several groups to
which it is being applied and whether or not the
predictions are equally effective for all grOUps."
7
I
As we shall see in Part II.,
the definition of bias will not be nearly
as clear-cut a matter when there is no criterion variable.
3.
Pitfalls in sampling from the g groups.
We have assumed that we have a sample of individuals from each of the g groups,
the sample from the i-th group being of size Ni • IT each sample is a random
sample from the respective group, then there should be no question about the
legitimacy of any analysis for bias.
For example, suppose that, in a parti-
cular city, every child who is in a certain grade in school is given the test
X and then later obtains a score on the criterion Y.
Suppose the g groups
consist of g different racial and/or socio-economic groups which are represented in the city.
Since the samples embrace all of the students, there has
been no selection within the city.
It would appear that the sample could be
considered to be random in this case.
Often it will be difficult or impossible to obtain strictly random
samples.
Bias analysis can still be legitimate in certain situations even
when the samples are not random.
The essential reqUirement is that, in the
sample from each group, the Y-scores of the individuals having any given Xo _
score shOUld constitute a random sample of the Y-scores of all the indiViduals
in the group who have that XC-score.
Thus, if the samples are selected on the
basis of XC and on the basis of XO alone, then each sample will still be representative of the respective population (group) as a whole With respect to the
distribution of Y given XC, and any bias analysis will still be legitimate since
our definition of bias de~ends on the conditional expectation of Y given XC.
The analysis will still be legitimate even if different selection standards
are used for the different groups, so long as the selection is made on the
basis of XC alone; thus, e.g., if all individuals with XC-scores above a certain
minimum were selected, then the minimum XC.score could be different for the
.'I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.'I
8
I
I
I.
I
I
I
I
I
I
I
·Ie
I
I
I
I
I
I
I.
I
I
different groups and the bias analysis still would not be distorted.
In many practical situations, some type of selection will affect the
choice of the sample.
Unless this selection is on the basis of Y? alone" how-
ever, it is possible that some distortion may be introduced so as to render
the bias analysis of questionable validity.
We now point out certain ways in
which such distortion might occur:
(a)
In many cases the samples may be self-selected to a certain extent.
This would be true if the samples are chosen from among applicants for employment or applicants for admission to college; even though ostensibly only the
test score Y? might be used by the employer or college to decide which candidates to accept, there was nevertheless a previous stage of selection (selfselection) in which individuals decided whether or not to apply in the first
place.
We indicate a possible way (perhaps a far-fetched one, perhaps not)
in which this self-selection could distort the bias analysis. Suppose that
indiViduals are able to form some rough idea of how well they would perform
on the criterion Y before they receive either a Y-score or an Y?-score.
Suppose
that individuals in "have-not" groups are relatively timid about making an
application in the first place.
Suppose that, in the main, applications from
the have-not groups (but not from the other grOUps) come only from individuals
who are exceptionally confident of scoring well on the criterion Y.
Then it
could turn out that, in a sample from a have-not group, the average Y-score
given Y? would tend to be higher than the expectation of Y given Y?for this
group as a whole.
o
Thus the conditional expectation of Y given X would tend
to be over-estimated for the have-not groups.
the true expectation of Y given
Jf
If
Y? were in fact not biased,
would be the same for all groups, but the
tendency to over-estimate this expectation for the have-not groups would tend
to make it appear that the expectation of Y given Y? was actually higher for
9
1the have-not groups than for the other groups.
Thus one would conclude falsely
that the test was biased against the have-not groups, because exactly the same
type of result would tend to occur (with a random sample) if the test were
biased against the have-not groups.
(b)
Candidates might exercise self-selection not only in their de-
cisions of whether or not to apply in the first place, but also in their decisions of whether or not to take the job or attend the college once they are
accepted (if they are accepted).
This second type of self-selection might
also be able to distort an analysis for bias.
(c)
Suppose that selection is made on the basis of
variable(s) as well.
Jf and of other
For 1llustration, we consider the case where selection
is- based on ~ and on one other variable, which we will call W.
W might (e. g. )
be a second test, or it might represent an evaluation of reconnnendations of the
candidate, or it might be something even less tangible, such as a judgment
I
I
I
I
I
I
I
_I
based on a personal interview.
pectation of Y given
Jf
We will suppose in what follows that the exand Wis a linear function of Xo and W, i.e'., it is
.1
of the form.
I
for an indiVidual in the i-th group.
Here we are taking
Jf to be the single
value X (the total score on the test X). Likewise, W is represented by the
ij
single value Wij ' the W-score for the j-th individual in the i-th group. Qi'
~X1'
and
~Wi
I
are regression coefficients pertaining to the i-th group.
Suppose
also that
E(Wij IXij ) = Qoi + ~oi Xij ,
where Qoi and ~oi are further regression coefficients.
(3.2)
(3.2), we have
10
Combining (3.1) and
I
I
I
I
.'I
I
I
I.
I
I
I
I
I
I
I
Thus it follows from
(3.3) that XO will be free of bias with respect to t~e
criterion Y if (and only if) the quantities (Qi + ~Wi Qoi)
(~X1
+
~wi ~oi)
have the same value for all g groups.
Suppose now that, in
the choosing of the samples from all the groups, there has been selection on
the basis of a specified linear combination of Xij and W (With exactly the
ij
same selection condition being used for all groups), so that, for all individuals in all groups,
(3.4).
c
x
xO~ j
+ c. W > K,
w ij -
where c x and cw are the coefficients of the linear combination and K is the
I_
minimum allowable score for the linear combination.
I
I
we can let E denote expectation for the sample and obtain
s
I
I
I
I
I.
I
I
and
Then, assuming that the
sampling in each group is random from among the individuals who satisfy
expectation of Wij given Xij and given
(3.4 ). As was previously pointed out, there
will in reality be no bias so long as -the expectation
groups.
However, the condition that
(3.3) is the same for all
(3.3) (the expectation in the population)
is the same for all groups is not, in general, sufficient to ensure that
(the expectation for the sample) is the same for all groups.
tions
(3.4),
(3.5)
If' the expecta-
(3.3) are alike but the expectations (3.5) are different, then we would
generally be concluding (With sufficiently large samples) that there is bias
11
when, in fact, no bias exists.
is that
X ]•
ij
A further difficulty in connection with
(3.5)
(3.5) will generally not be linear in Xij [whereas (3.3) is linear in
Thus we conclude that, if selection is based on some variable W in.
addition to X, using a selection condition such as
(3.4), then any bias analysis
will be open to question.
(d)
Suppose selection is made on the basis of Walone.
difficulty which was just indicated in (c) could occur.
tion would be a special case of that described in (c):
Then the same
In fact, this situaif c x
= 0 in (3.4),
that would correspond to selection on the basis of Walone.
(e)
If selection is made on the basis of both X
ij and Wij , and if
the vector (X .. , W.. ) (rather than X.
1J
"test score"
Y?,
1J
1
j
alone) is considered as the relevant
then the selection on Xij andWij could not cause any distor-
.IX.1 j'
tion in the bias analysis, beCause E (Yi
J
selection on (Xij ' W ).
ij
W.. ) would not be affected by
1J
.
However, it might not be feasible to consider
(Xij ' Wij ) rather than Xij to be the test score
Y?
which is being examined for
bias, if the variable W is of such a nature that it is not generally expressij
ed in numerical form (even though for theoretical purposes we can conceive
of it as being quantifiable).
Thus, e.g., if candidates are selected on the
basis of an aptitude test score (X ) and the results of a personal interview
ij
(wij ), then, as we saw in (c), this could distort the results of an analysis
for bias of the test score
Y? = X..
;
1J
on the other hand, this would not dis---
tort an analysis for bias of the pair (Xij ' Wij)[Y?
considered as the relevant "test score"
= (Xij '
Wij ) being now
which is being examined for bias],
but evidently this latter type of bias analysis would not often be feasible
12
I
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I.
I
I
in the first place since Wij (results of the interview) would not generally
be expressed in quantitative form.
(f)
If the trivariate distribution of (Y
ij
, X , Wij ) is the same
ij
for all groups, and if selection is made on the basis of both X and W
ij
ij
with the same selection standards for all groups, then aD analysis for bias of
the test score
XC = Xij
will not be distorted even though selection has been
made on the basis of both X and W •
ij
ij
This is because the g population
expectations (3.3) will all be equal (indicating no bias), and the g sample
expectations will also be all equal (although the latter will generally not
be linear in X even though the former are).
ij
However, if the trivariate
distribution of (Y , X , Wij ) is the same for all groups and selection is
ij
ij
made on the basis of both Xij and Wij but with
selec~ion
standards which are
o
not the same for all groups, then the analysis for bias of X = Xij will be
distorted.
Tb see thiS, let us suppose that the condition for selection is
of the form
instead of (3.4).
Now the fact that the trivariate distribution of (Y , Xij,Wij )
ij
is the same for all groups implies that the three parameters Qi'
~wi
~xi'
and
are the same for all i, so that (3.1) can be written in the form
The population expectations (3.3) will of course be the same for all groups
(since the conditional distribution of Wij given X is the
ij
13
~ame
for all groups
I
and this causes ex . and
01
O
that X
= X..
1J
(3 •
01
to be the same for all i), thereby indicating
.
is really not biased.
Now we use (3.6) and (3.7) to find that
the corresponding sample expectations [obtained in an analogous manner to (3.5)J
are of the form
E (Y .. IX,,)
s 1J 1J
If I3
w
~
= ex
+
(3
X
X.. +
1J
(3
W
1J
1J
i
1J
X..
x 1J)
Cw
0, then, for the normal distribution (and for virtually all other dis-
tributions as well), the expectations
if and only if the K. I
1
XO
E (W .. IX.. ; w.. >
K - c
= X..
1J
S
(3.8) will be the same for all groups
are the same for all groups.
Thus the bias analysis for
will be legitimate if and only if the K.ls are the same for all groups
1
(i.e., the selection standards are the same).
In practice, a situation where
(Y.. , X.. , W.. ) might reasonably have the same trivariate distribution for all
1J
1J
1J
groups, but where differential selection of a form such as (3.6) might be used,
could arise as follows.
Suppose a college uses lower selection standards (K.ls)
1
for applicants from distant areas, in order to obtain a student body which is
geographically more diversified than it would otherwise be.
then tries·to determine if the test X (with
to different geographical groups.
XO = .X..
)
1J
Suppose the college
is biased with respect
Under the premises we have set up, XO would
not in reality be biased, but it would tend to appear to be biased in favor
of the groups from the more distant areas.
(3
w
K.
Ix..;
w.. >
1J 1J
2J -
E(W ..
(assuming
1
- c
C
x
W
X..
1 J ) in
This is because the term
(3.8) will be larger the larger K.;~ is
13w > 0 and cw > 0), which will cause (3.8) to be lower for the groups
from the more distant areas.
14
.-I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
I
Although an earlier report [15]was concerned with an entirely different topic from that of the present report, Sections 6 and 8 of [15] point out
some situations which bear a degree of similarity to certain situations which
have just been mentioned here.
Similar difficulties resulting from the effects
of selection are involveQ.
In the remainder of Part I., we will assume that all of the sampling
pitfalls just described have been avoided.
More specifically, we will assume
that the sampling from each group is either altogether random, or else random
except for possible selection on the basis of
x?
This will ensure that
Es(yIXO) is the same as E(YIXO).
4.
Speed tests and power tests
A pure power test is one in which every examinee has sufficient time to attempt
I_
every
I
I
I
any examinee to complete all items (see, e.g.,[14], pp. 230-231).
-I
I
I
I.
I
I
i~m;
a pure speed test is one in which all items are so easy that no
examinee will answer any item incorrectly, but there is insufficient time for
In practice,
of course, most tests lie somewhe;'S between the two extremes of a pure speed
test and .a pure power test •
. In ge.eral, the discussion in this report is aimed at tests which are
essentially power tests rather than speed tests.
However, some of the techni-
ques for bias analysis which we present will be applicable for speed tests as
well as power tests.
Here in Part I., we will consider techniques for bias analysis which
are based on the regression of the criterion Y on the "test score"
in various ways.
x?
defined
We begin in Section 5 with a technique which considers the
test score (x") to be simply the total score
X
ij ; it would appear that the
material of Section 5 would be just as applicable to speed tests as to power
15
I
tests.
Then Section 9 considers bias analysis based on linear regression of
Y on (2.1), the vector (2.1) now being considered as the "test score"
ij
(If).
The analysis of Section 9 would appear to be usable for speed tests as well
as for power tests.
However, if a test is partially a speed test, then the
order in which the items appear in the test would (in general) need to be always the same or else the results of the bias analysis would be vitiated.
This
is because, if there is an element of speed in the test, the regression of Y
ij
on the item scores (2.1) will generally be altered if the order of the items
is altered.
Thus, if the test X is partially a speed test and if the items in
the test are for some reason not appearing in the same order for all examinees,
then the use of the tools of Section 9 would be questionable.
Sections 10 and 11 consider bias analyses based on the regression of
Y on vectors which contain even more elements than the m elements of (2.1)
ij
[the test score vector
lf
which is used in Section 11 is the vector (2.2)J.
For a pure speed test, the more complicated models of Sections 10-11 would
apparently still yield valid bias analyses, but these analyses would be the
same thing that would result from using the simpler model of Section 9~
The
reason Why there would be no difference is that, in a pure speed test, there
are only (m+l) possible values which the test score vector can assume, regardless of how the test score is defined or how many elements are in the test
score vector.
These (m+l) possible values would correspond to the (m+l)
possible numbers of items completedj thus, e.g., the only (m+l) possible values
of the mXl.. vector (2.1) in a pure speed test with all ~ = 5 would be
16
.-I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.-I
I
I
Ie
I
I
I
II
I
I
I
I_
I
I
I
I
I
I
I.
I
I
(.2, .2, .2,
( 1, .2, .2,
( 1,
1, .2,
...
( 1, 1, 1,
( 1, 1, 1,
·. , .2,
~
·.. , .2,
·.. , .2,
·.. ,
·.. ,
.2)
.2)
.2)
. ..
•
1, .2)
1,
1)
(we are assuming that each examinee works the items in order).
Similarly,
e.g., there are only (m+l) possible values which the vector (2.2) could assume.
ina pure speed test.
Thus it is not hard to see that the regression of Y
ij
on (2.2) (e.g) can be basically no different from the regression of Y
ij
on
(2.l), and therefore cannot effect any increase in the accuracy of prediction
via the regression equation nor any change in the bias situation.
We conclude that, for a pure speed test, there is no point in using
the relatively complicated models which are presented in Sections 10-11 at the
end of Part I.
We would also suspect that, for a test which is not a pure
speed test but which is primarily a speed test, there would be little to be
gained from the models of Sections 10-11 (in comparison with the model of Section 9).
For a test which is primarily a power test, on the other hand, we will
see later that the use of the more complicated models might possibly result in
the effective reduction or elimination of bias (in addition to bringing about
a more accurate prediction of the criterion variable).
5.
Detection of bias when one uses just the total score on the test X.
We start by considering the simplest possible type of bias analysis, an analysis
based on the regression of Y on the single variable X •
i
ij
r
17
That is, the "test
1score" yf is taken to be simply the single number Xij ' which is the total score
on the test X.
Here in Section 5, we work under the assumption that the regression of'
on X is linear f'or each of'the g groups, so that we have a regression
ij
ij
equation of' the f'orm
.Y
f'or the i-th group.
According to the def'inition of' bias given in Section 2
Y? = Xij
above, the test X (using
as the test score) will be f'ree of' bias if'
and only if'
(31 = (32 = ••• =
13g
and
g •
We are interested in how to detect or determine whether or not bias
(5.2).
What we have to do is to test the hypotheses
We will indicate possible approaches to the testing of' these hypotheses.
Although'this involves nothing more than standard results in least-squares
theory, we nevertheless will brief'ly indicate here the essential f'ormulas.
First, let us define
Ni
Y.~
= TiT1
~'l.
~
L
. 1
J=
N.
- =1 L,~
Y.~J0' ' X.~
N.
~ j=l
';N.
s .
~ry~
2I
( Y.
0
~J
-
-Y. )2
l
j=l
18
,
I
I
I
I
I
I
I
_I
=0
exists under the model (5.1).
.1
x .. ,
lJ
I
I
I
I
I
I
.1
I
I
I
I.
I
II
II
I;
I
I
I
N.
S
;oci
X.)2
~
,
j=l
N.~
=\'
(
~
S .
~
Y.. - y-\) ( X.. - X.)
~J
....
~J
~
,
j=l
g
S2 .
S=\'(S
L
1
i=l
I_
I
I
I
I
I
I
I.
I
I
=~
~ (X~J.. -
S
2
=
!
S
i=l
= \'
~ ~
1=1
S:xxi
..
c~
~=l
N. ,
Y =1
N
SXYi)2
YYJ.----g
1: S .
i=l :xx~
Ni
g
N
.-~),
YYJ.
g . N.
f-, \' y. it = ~ \ ' ~ x
.L
~ i j ' J.~ L ~ ij
~=1 J=l
1=1 j=l
19
,
I
f~L
L
Syy =
(Y.. - y)2
1=1 j=l
II
,I
f.
Sxx
I
N.
=.2...
(x
l=l j=l
f.
Sxy
lJ
.-I
= .2...
ij
I
-
xi
,
N.
(Yij - 'Y) (x.. - 51)
lJ'
l=l j=l
I
I
I
I
_I
2
S3 = s
yy
-~
S
,
xx
I
I
-I
F = (S3 - Sl)/(2g - 2)
1
Sll (N - 2g) ,
I
I
I
.-I
20
I
I
I.
I
,
I)
II
Ii
I
I
I
I_
I
I
I
I
I
I
Ii
I
I
and
F, = (83 - 82 }/(g - I)
82leN - g - I}
To test the hypotheses (5.2a) and (5.2b) simultaneously, we refer F
l
to the F- distribution with d.f. == 2g -2, N-2g.
c1ude that bias exists.
1 is significant, we con-
If' F
We may then wish to test (5.2a) by itself', in which
case we refer F2 to the F- distribution with d.f. == g-l,
N-~.
Alternatively, we may wish to start out by using F to test (5.2a). If'
2
F2 is significant, we lmow at once that bias exists, because we cannot even
use a common
~-coefficient
for the.g groups.
If' F2 is not significant, then we might Wish to assume that there is a
common
~-coef'fic1ent.·-:.'
Then the expectation model would be
(5.3)
e
instead of (5.1).
If' we assume that (5.3) holds, then, in order to test the
-hypothesis (5.2b), we refer F to the F-distribution with d.f .=e;-1, N-g -1.
3
21
I
If F3 is significant, then we conclude that bias eXists, because we cannot use
a common
~coefficient
for the g groups.
If we start out by testing (5.2a) and if we find that
F is not signifi2
cant, then, if we wish, we could still go ahead and calculate F to get
1
a
simultaneous test of (5.2a) and (5.2b).
If the hypotheses (5.2) are true; then the expectation model (5.1) be-
comes
Of course, there is no bias if (5.4) is the expectation of Yij given X •
ij
I f (5.3) but not (5.4) holds, then there is constant bias which can be
expressed in terms of the differences between the a
i
IS.
By" constant", we mean
that the amount of bias is the same regardless of the value of X •
ij
If i and I
denote two of the g groups, we may, in general, define the amount of bias for
group i relative to group I to be equal to E(YfJf) for group I minus E(YfJf)
for group i.
Thus amount of bias will generally be dependent on Jf, but not
under (5.3) .where we have constant bias.
If (5.1) but not (5.3) holds, then the amount of bias varies with X •
ij
In fact, under (5.1) the amount of bias for one group relative to another might
even change sign over the relevant range of the Xij values, so that the bias is
against one group for higher values of Xij and against the other group for lower values of X •
ij
We make the usual assumptions of homoscedasticity (i.e., uniform variance)
-and normality of the terms
22
.-I
II
I
I
I
I
I
_I
,I
I
I
I
I
I
.-I
I
I
I.
I
, as indicated above) to detect or deter2
3
For the case g = 2, however, other tests are avail-
i f we use the F-tests (F , F , and F
I)
I
mine whether bias eXists.
I
I
I
able when these usual assumptions are not fully satisfied (see [17J, [16J).
-I
I,
1_
1\
I
I
I
I
I
1\
e
I
I
Here in Section 5 we have dealt only with the matter of determining
whether or not bias exists.
We have not considered the question of how to
correct for or rectify bias once we find that it does exist.
will be discussed in Section 8.
This latter topic
Before we turn to this topic, however, we will
first consider the detection of bias when there are multiple tests (Section 6)
and when there are multiple criterion variables (Section
7), still assuming
that we are using just total score(s) on the test(s) for
if. Upon suitable
generalization, the discussion of Sections 5-8 will be applicable in an obvious
manner to the more refined models of Sections 9-11, but we will not go into as
great detail in Sections 9-11 as we do here in Sections 5-8 because the general
principles for the more refined models are essentially the same as those which
we are describing in some detail here in Sections 5-8.
6. The situation of more than one test.
If there are multiple tests instead of just a single test X being used to predict
the criterion Y, then we can easily generalize the material of Section 5 to obtain
means of determining whether or not the battery of tests (considered as a whole)
is
b1ased~with
respect to predicting the criterion Y.
For convenience, we
consider the case where there are just two tests in the battery.
Let these two
tests be X and W.
As usual, we letXij be the total score on test X; we let
Wij be the total score on test W. We assume that we are predicting Y on the
basis of these total scores alone, so that the "test score" if is (X.. , Wi.).
J.J
J
I
We assume that E(YijIXij , Wij ) is linear in Xij and Wij , and write as our model
(6.1)
.-I
\1
which is analogous to (5.1).
According to our definition of bias, the battery
(X ' W ) will be free of bias with respect to predicting Y if and only if
ij
ij
ij
(6.2a)
I3xi
= 13;x2 = ... = 13xg ,13w1 = 13We;~ = •.• = 13wg
and
(6.2b)
•
We can test these hypotheses (6.2) via techniques analogous to those
presented in Section 5.
Thus (e.g.) we could test (6.2a) and (6.2b) simultan-
eously, or we could first test (6.2a) and then [if (6.2a) is not rejected]
test (6.2b) assuming (6.2a) to be true.
The formulas for the appropriate
F-statistics are analogous to the corresponding formulas given in Section 5;
since they are, in fact, standard regression analysis formulas, we will not
write them out here.
If
If
(6.2a) is true, then (6.1) becomes
(6.2b) and (6.2a) are both true, then we have
24
I
I
I
I
I
_I
I
I
I
I
I
I
.-I
I
I
I.
I
II
I
I
I
I
Ii
I_
I!
I
I
I
I
I
1\
I
I
(6.4)
Under
(6.4),
of course, there is no bias.
is constant bias.
When
(6.1)
but not
When
(6.3)
(6.3)
(6.4)
but not
holds, there
holds, we have bias which varies
with (X , W ).
ij
ij
7. The situation of more
than one criterion variable.
If more than one criterion variable is being predicted, then
for predicting one criterion variable but not for another.
Y?
might be biased
Of course, one could
separately test to detect bias with respect to each criterion variable individually.
However, one could, alternatively, use a single statistical test proce-
dure which tests for bias with respect to all criterion variables simultaneously.
A procedure of this latter type involves an application of multivariate analysis
of variance.
Here in Section 7 we outline the appropriate multivariate tests for
bias with multiple criterion variables for the case where
Y?
is just the total
score X on the single test X.
ij
We suppose we have p criterion variables.
Generalizing (5.1), we can
write our model as
E(y(h)IX )
ij
for h
= 1,2, ••• , P (and
j -
ij
= 6e)
i
1,2, ••• ,
Hi
j
+ ~(h)
i
i
X
=1,2,
ij
••• ,
g),
(h)
where Yij is
the score of the (i,j)-th individual on the h-th criterion variable.
_
Y?
The test
= X ) will be free- of bias with respect to all p criterion variables
ij
(h)
(h)
if and only if the regression parameters [ai's and tl 's] satisfy
i
X (With
25
I
(h)
= t32
(h)
(h
= t3 g
= •..
= 1,2,
.-I
••• , p)
and
(h)
=Q
2
(h)
= .•. = a g
(h=1,2, ... , p)
The hypotheses (7.2), which represent a generalization of (5.2), can be tested
via a MANOVA (multivariate analysis of variance) test.
If (7.2a) is true, then (7.1) becomes
If (7.2a) and (7.2b) are both true, then (7.1) becomes
(h)
=Q
and there is no bias.
_I
(h)
+ t3
Note how (7.3) is a generalization of (5.3), and (7.4)
is a generalization of (5.4).
The theory of multivariate analysis of variance tests is discussed in
detail (e.g.) by Roy [19, Chapter
J2J.
A succinct sunnnary of the MANOVA testing
procedure is presented (e.g.) in passing in [18].
(In [18], see especially
equations (1), (2), (5a - 5b), and (7a - 7c), along with the accompanying
material, as well as the material in the middle of p. 322.)
We will
make
no
attempt here to give a general explanation of the theory of MANOVA testing;
rather we will simply identify the matrices which are to be plugged into the
appropriate formulas as presented in [18].
26
I
I
I
I
I
I
I
I
I
I
I
I
.-I
I
I
I.
I
Note that (7.1) can be rewritten in matrix form as
y(1)
11
1/
I
I
I
I
I,
1_
~\
]
y(l;)
12
• •• • ••
y(1)
(7.5a)
E
y(2)
11
• • •
y(2)
• • •
12
yep)
12
·,"'. ... ·.. ... ·.. ...
JP.i
y(2)
1N
1
• • •
yep)
lN1
y(1)
21
y(2)
21
• • •
yep)
21
y(l)
22
y(2)
22
• • •
yep)
22
·.. ...
y(l)
2N2
• •• • •• • •• • ••
y(1)
g1
y(2)
g1
y(1)
y(2)
g2
g2
·.. ...
]
y(1)
gN:g
2~
·..
• • •
• •• • •• • •• • ••
y(2)
gNg
·.. ...
·.. . yep)
•• ·.. ... ·.. ...
(2) .
Y2N
2
••• • •• • •• •
J
• • •
]
J
]'.
J
I
yep)
11
27
yep)
g1
yep)
g2
·.. ...
yep)
gNg
=
1-
1· X
0
11
1
0
·..
o0
o0
0
~2 0
·.. ... ·.. ·.. ·..
0 0 ·.. o 0
1 X
lN 1
a(l)
a,(2)
1
·..
a,(p)
1
(3(1)
1
(3(2)
1
·..
(3(p)
1
a,(1)
a,(2)
2
2.
• ••
(3(2)
2
·..
0 0
1 X21•••
o0
(3(1)
2
0 0
1 X22 • ..
o0
... ... ...
·.. ... ·.. ·.. ·..
0 0
1 X2N •••
2. ..
a,(1)
o0
a,(2)
g
g
(3(1)
(3(2)
g
g
1
a,(p)
2
(3(p)
2
...
·..
·..
a,(p)
g
(3(p)
g
·.. ... ·.. ·..
0 0
0·0
0 0
0 0
·..
·..
1 Xg1
More briefly, (7.5a) can be written as
E(Y)
Nxp
=
A
S
N x 2.g 2.g x p
,
Where the definition of the three matrices Y, A, and S is obvious upon comparing (7.5a) with(7.5b).
I
I
I
I
I
I
_I
1 Xg2
·.. ... ·.. ·.. ·.. ...
0 0 ·0 0
·.. 1XgNg
(7.5b)
.-I
The hypothesis (7.2a) can be written in matrix
form. as
28
I
I
I
I
I
I
.-I
I
I
I.
I
Ii
I
I
I
I
Ii
I_
I
I
I
I
I
I
1\
(7.6)
C~
~
(g-l) x 2g
=
I
2g x P
0
(g-l) x p
P x P
,
whereI is the identity matrix, 0 is the null matrix, and
(7.7)
C~
=
0
1
o -1
0
0
•••• 0
0
0
1
0
0 -1
•••• 0
0
••• • •• • ••
0
1
0
0
...
0
• • • • • • • •• • ••
0
0
.•.• 0 -1
The hypotheses (7.2a) and (7.2b) taken together can be written in matrix form
as
(7.8)
~
2g
X
o
. =
I
,
(2g-2) x p
P P x P
where C~ [ (g -1) x 2g] is given by (7.7) and
(7.9)
C
cx
. (g-l) x 2g
::::
1
0
-1
0
0
0 •••• 0
0
1
0
0
0
-1
0
•••
•••
... ... ...
0 •••• 0
1
0
0
0
0
• ••
o ....
We may define
(7.10)
e
c~
(2g -2) x 2g
::::
I
1\
29
• ••
[ :: ]
-1 0
•
1-
.1
The model (7.3) can be written in matrix fOrJ;l as
E (Y)
Nx P
=
A
N x ~(g + 1)
where Y is as before,
Sf3
(g + 1) x P
,
I
I
I
I
I
I
I
_I
I
I
I
I
I
I
-I
e
30
I
I
I
I.
I
I
I
I
I
I
I
anel
(1)
(1.13)
~~ =
I
I
I
I
I
I.
I
I
(p)
·...
.~
a2
• •••
a2
•••
•••
• ••
...
(2)
ag
·...
ag
• •••
~
l
(1)
(1)
ag
(p)
(2)
(1)
(2)
~
~
l
(p)
(p)
The hypothesis (7.2b), expressed in terms of ~~, can be written in matrix form.
as
cal~
I_
I'
(2)
a
l
a
a
(g-l) x (g+l)
~~
I
(S+l) x p
=
pxp
o
(g-l) x p
where
(7.15)
Caf~
(i)
1
-1
0
••••
0
0
1
0
-1
••••
0
0
•••
• ••
• ••
• ••
• ••
1
0
0
••••
-1 0
=
•
Suppose we wish to test the hypotheses (7.2a) and (7.Eb) simul-
taneously under the model (7.1), i.e., we wish to test (7.8) under the model
(7.5).
'!hen we identity (7.5) and (7.8) respectively with equations (1) and
(2) of [18].
We calculate ~ and Be respectively from equations (5a) and(5b)
31
I
of [18], using Y for X, I for V, A [see (7.5a) and (7.5b)] for Al ' and C~
(7.10) for C (X, V, A , and C refer to matrices which appear in
l
l
l
equation~
(5a - 5b) of (18]). As explained in [19] (and also in [18J), MANOVA testing is
-1
based, on the characteristic roots of the matrix R
n Se ;
further details may
be obtained from these references.
(11)
Suppose we wish to test (7.2a) alone under the model (7.1), i.e.,
we wish to test (7.6) under the model (7.5).
Then we identify (7.5) and (7.6)
respectively with equations (1) and (2) of [18].
Thus we see that, in formulas
[18J, we use Y for X, I for V, A [see (7.5a) and (7.5b)] for
(5a) and (5b) of
A , and C~ (7.7) for C1 •
1
(iii)
If
we test the hypothesis (7.2a) under the model (7.1) and do not
reject it, then we may wish to assume that (7.3) holds and test the hypothesis
(7.2b) under the model (7.3), i.e., test the hypothesis (7.14) under the model
(7.11).
.-I
I
I
I
I
I
I
_I
Hence we identify (7.11) and (7.14) respectively with equations (1)
and (2) of [18J.
Thus, in formulas (5a) and (5b) of [18], we use Y for X, I
for V, A~ (7.12) for Al , and Ca1~ (7.15) for Cl •
-1
For a MANOVA test based on the maximum characteristic root of Sh Se '
we need the three parameters given by equations (7) of [18J.
For the tests
indicated in (i), (11), and (11i) above, these parameters are respectively
(7.17) s*
= min(g
-1, p), m*
= ~(Ig
-p -11-1), n*
32
=~
(N -2g -p -1) for (11);
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
I
and
(7.18) s* = min(g -1, p), m* = ~q g -p -11-1), n* = ~(N -g -p -2) for (iii).
For g = 2 groups (but not for g > 2), the MANOVA test for (11) and (iii)
simply reduces to an ordinary F-test, with d.f. = p, N -p -3 for (ii) and d.f.=
p, N -p -2 for (iii).
special cases.
Generalizing the notation introduced above in Section 5, let us
define
N
..i
-(h)
y.
N.~
,
(h)
(h)
(Y
Ii
) (Xij - Xi)
ij
I
=
,
j=l
(h,H)
8
1
=
!
(h,H)
(8
m
(h)
(H)
8xyi
8
xyi
)
,
8XXi
i=l
and
=
I.
I
I
j=l
(h)
(H)
(H)
(h)
(Y
=I
ij -Ii) (Yij -Ii )
j=l
m
(h)
8
xyi
,
y(h)
ij
N.~
(h,H)
8
I
1
= N
i
~
Ie
I
I
I
I
I'
I
We will write out in full the F-statistics for these
!
g
(h,H)
(i~l
8yyi
. (h)
8xyi )
g
E
1=1
i=l
33
,
Ipfor h, H = 1,2, ••• , p. We now define 8 (p x p) and 82 (p x p) to be matrices
1
(h H)'
(h H) I
containing the 81 '
s and the 82 '
s respectively. Let us write
and
for h
= 1,2,
taining the
pothesis
We define d~ (p x 1) and da (p x 1) to be vectors conand the d~h) I S respectively. When g = 2, we test the hy-
••• , p.
d~h)
I
S
(7.6) under the model (7.5) via the statistic
F
p, N -p -3
= (N
-p -3)
P
(~l_
.8xx 1
and we test the hypothesis (7.14) under the model (7.11) via the statistic
F
p,
.1
I
I
I
I
I
I
I
_I
I
I
I
I
I
I
That is, when there are just g =2 groups, the MANOVA tests for (ii) and (iii)
are simply F-tests based on the F-statistics (7.19) and (7.20) respectively.
.1
I
I
I
I.
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
I,.
I
I
8.
Correcting for bias if it is found to eXist, when just the total
score on the test X is being used.
Having completed our digressions in Sections 6 and 7, we now return to pick up
our discussion where we left it at the end of Section 5.
In Section 5, we in-
dicated how to detect bias in a single test X (taking XOto be simply the total
score X ) with respect to a single criterion variable Y. The question of what
ij
to do in order to rectify bias if and when it is detected, however, was deferred
to the present section.
If bias is found to eXist, then one obvious way of correcting for this
bias is to use different regression equations (prediction equations) for the
different groups.
If
even assume a common
the hypothesis (5.2a) has been rejected, then we cannot
~-coefficient
for the g groups, and a set of g regression
equations of the form (5.1) would have to be used for prediction in order to
allow for different values of the
and
~i' s)
~its.
The regression coefficients (the Qi'S
would be estimated by the standard least-squares formulas; thus the
estimators of
~i
and Q would be respectively
i
(8.1)
~i = SXyi
/ Sxxi
1\
6
and
Q
i
(8.2)
= 'Ii
-
- ~i Xi
,
assuming that we want each group to have different estimates of Q and ~i (i.e.,
i
assuming that we don't want to make any kinds of consolidations of two or more
groups).
Suppose that (5.2a) is not rejected and that we are willing to assume
that (5.2a) is true, but suppose that our statistical testing indicates that
(5.2b) is not true.
Then, in order to correct for the constant bias which
35
I
eVidently exists, we could use a set of g regression equations of the form
(5.3) in order to predict the criterion for individuals in the g groups. The
estimator of the common
is of course
~-coefficient
g
S=,i=lg
(8.3)
I:
S
I:
Sxxi
i=l
xyi
,
while
(8.4)
estimates the coefficient a.•
~
We noted earlier that the amounts of (constant)
bias are determined by the a.' s.
~
Thus we see that the proper use of different regression equations for
the different groups will automatically compensate for bias, and will there-
.-I
I
I
I
I
I
I
_I
by eliminate the bias (to the extent that the regression coefficients are
accurately estimated).
One interpretation would be as follows.
the set of regression equations
(5.3). If we consider
XC
Consider, e.g.,
to be
rather than to be simply X itself, then
ij
(8.6)
Since the conditional expectation
(8.6) is the same for all groups, the defin-
ition of Section 2. tells us that Xij
(8.5) is not biased for predicting Y.
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
I
Although the technique of using different regression equations for
different groups will obviously correct for bias if bias eXists, the technique is open to possible criticism on the grounds that it is discriminatory, or
at least appears to be discriminatory.
tions (5.3), for example, appears to give an arbitrary advantage to individuals
in the groups with the higher (Xi's.
This notion can be thrown into sharper
focus if we interpret the situation as follows. Suppose for the moment that we
have just g
=2
groups.
group (i = 2) "out" •
Let us call the first g!'oup (i
I.
I
I
= 1)
"in" and the second
Let us define a variable WijwhiCh is equal to 0 if the
(i, j) individual is an "in" and equal to 1 if he is an "out", i. e • ,
(8.7)
if i = 1
if i = 2
I_
I
I
I
I
I
I
The use of the set of regression equa-
Then the set of regression equations (5.3) can be written alternatively in the
form.
(8.8)
wh~re
we make the identifications
From (8.8) it is clear that, if we consider
this
Y?
is free of bias for predicting Y.
Y?
to be the pair (Xij , Wij ), then
Thus (8.7-8.9) provides an alter-
native interpretation to that given by (8.5-8.6).
37
I
With
XO
taken to be (X , W ), we are, in effect, basing our predicij
ij
tion of Y on two "tests", the genuine test X and the "test" indicated by W
ij
(8.7).
But although mathematically W has a status identical to that of X
ij
ij
in the prediction equation (8.8), from the psychologist's point of view the
total test score X has a vastly different meaning from the varia'bJ.e W •
ij
ij
In
many cases (when W represents race, e.g.) it could be argued that, for a priori
ij
reasons, W couldn't possibly have any direct relevance to the prediction of
ij
Y, so how can we justify putting it in the prediGtion equation at all?
Thus, although on a priori grounds there might appear to be no justification whatever for including W in the prediction equation, we are neverij
theless faced with the fact that, with
with bias, whereas with
XO
tJ
equal to X alone, we are confronted
ij
equal to (X , W ), we are free of bias.
ij
ij
The ex-
• planation of this anomaly, of course, lies in the fact that, although W.. in
J.J
and of itself can be of no relevance in predicting Y, there are other variable(s)
closely correlated with W which are legitimately relevant to the prediction
ij
of Y, but which are not as easily isolated, identified, and measured as W
ij
and which therefore are not easily included in the prediction equation.
One answer to the dilemma wouJ.d be to try to isolate and identify
these latter variables in order that they can be brought into the prediction
equation.
If this can be successfully done, then W
ij
will no longer contribute
anything to improving the prediction (i.e., there will no longer be any bias
even with W left out of the prediction equation) so that W can be omitted
ij
ij
without any loss.
Different techniques for bringing in the relevant variables
are discussed in Sections 9-11.
All these techniques, however, are based on
the use of the X ' s and the Xij}dl' s, but the use of these variables (the
ijk
Xijk's or Xijk,.f's) may sometimeS be sufficient to give us a bias-free XO without having to include W in
ij
tJ.
Of course, the possibility of using other
38
.'I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.1
I
I
I
'.
I
I
I
tests besides X, or even certain relevant types of variables besides test
results, could be investigated as a means of obtaining a bias-free
which
does not include W •
ij
ij (8. 7) ~n
If we have to include W
XC
(and if W is clearly not releij
vant to the prediction of Y), then this simply amounts to an admission that we
have failed to find the real culprits which are causing the bias.
;
There will
undoubtedly be cases where we will not be able to have a bias-free test unless
we include W in
ij
i
!
!
XC.
In such cases, it appears that we have no choice except
to base our prediction in part on Wij.
Of course, if W is directly relevant to the prediction of Y, then
ij
there would be no logical reason for excluding it from the prediction equation.
For example, if the "out" group comprises individuals whose native language is
not English while the "in" group consists of persons whose native language is
]e
3
English, then it could be argued that this language variable is relevant to
predicting the criterion (which might be college grades, say)· and therefore
properly belongs in the regression equation.
qualms about
]
~Sing
In such cases we would have no
Wij •
On the other hand, if the two groups represent (e.g.) two different
]
races, then it is hard to see how skin pigmentation could be considered to be
J
1
]
relevant to the prediction of college grades or job performance.
If amount of
skin pigmentation is included as a variable in a regression equation which is
used to predict a criterion and to serve as an instrument of selection, then,
even though skin pigmentation has been included for the commendable reason of
eliminating bias, critics would still be likely to claim that such a selection
system would create discrimination rather than eliminate it.
-1
XC
~
Their criticism
would not be easy to answer, since skin pigmentation in and of itself could be
}
considered on a priori.· grounds to be obViously irrelevant to the prediction of
]
39
the criterion.
In some cases, it is conceivable that legal problems might re-
sult if a selection system uses a prediction equation which includes amount of
skin pigmentation as one of the predictors.
In any case, a moral and ethical
question would arise as to the propriety of using skin pigmentation as a predictor.
The other side of the coin, of course, is that :i..f skin pigmentation
is not included as a predictor, then we would not have eliminated our bias and
similar legal and ethical questions would arise.
The only way out of this
dilemma seems to be to try to discover a set of variables
Y?
which does not
include skin pigmentation but which is still free of bias for predicting Y.
Apparently it is assumed in many quarters that a psychological test,
if it is biased at all, will be biased against "disadvantaged" groups and in
favor of
lI
a dvantagedll groups.
If this is" the case, then in
(8.5)
the Oils for the
dIsadvantaged groups would be higher than the 0i's for the advantaged groups for
I
an affected test X.
This means that, if. Xij
(8.5)
rather than Xij were used as
a selection criterion, then the advantaged groups would be the ones that might
claim they were discriminated against.
We should point out, however, that such alleged "discrimination" could
operate in the opposite direction also.
For one thing, it is entirely con-
ceivablethat some tests could be biased against the advantaged groups (according to our definition of Section 2), in which case the disadvantaged groups
would be the ones that might be alleging discrimination if
(8.5)
were used.
However, it might -perhaps not be too common for a test to be biased against
advantaged groups, even though this possibility should certainly not be ruled
out altogether as some writers appear to be prone to do.
In order to bring
the ethical questions related to the use of something like X~j
(8.5)
into full-
er perspective, we describe another kind of situation in which the use of
(8.5)
could cause disadvantaged groups to claim that they were being discriminated
I
-e II
,I
I
I
I
I
I
_,'
I
I
I
I
I
I
.1
I
40
I
I
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I.
I
I
against.
Suppose that an auto insurance company uses some sort of regression
equation to predict the number of future accidents (or claims to be paid) of its
customers, and suppose the company charges each customer a premium which is
proportional to this prediction.
Suppose that race (along with other variables
which might or might not include a score on a psychological test) appears in
the prediction equation.
It would not be surprising if the regression analysis
yielded a higher regression coefficient
(CXi )
for Negroes than for whites, so
that a higher accident prediction for a Negro would be made on the basis of his
race alone and he would therefore be charged a higher premium.
Such a predic-
tion system would, of course, be "free of bias" according to our definition of
Section 2, but would it be an ethically proper system to use?
Questions such as this one pose a serious dilennna, regardless of what
the group is that feels it is discriminated against.
Because the use of an yf>
like (8.5) may sometimes lead to such a dilemma, it would appear to be worth-
while to give detailed consideration to other techniques of obtaining a biasfree yf>.
Sections 9-11 are devoted to several such techniques which utilize
information from the results of the test X only.
Although our sole motivation
in presenting these techniques in sections 9-11 is to provide means for obtaining a bias-free yf> which doesn't have the drawback of (8.5), these techniques
may produce a second desirable effect in addition to the hoped-for effect of
resolving the bias problem.
This second effect, which is purely incidental
for our purposes but which may in some cases even over-shadow the first effect,
is that the techniques of Sections 9-11 may improve the over-all accuracy of
the prediction.
In Sections 9-11, we will not go into the same detail in our analysis
as we did here in Sections 5-8 when we considered the case where y;:> is based
on total score(s) only.
However, most of the basic principles which have been
41
I~
brought out in Sections
5-8 are applicable with appropriate modification
to
the material of Sections 9-11.
Before concluding this section, we should emphasize that, if the ai's
of
(8.5) are higher for the disadvantaged groups than for the advantaged groups
(as will be the case if the test is biased against the disadvantaged groups),
then these higher ai'S for the
disadvantaged groups cannot in any sense be
interpreted as points which are arbitrarily added to the scores of the members
of the disadvantaged groups.
That is, the higher ai's are not "extra points"
which are added without regard to performance on the criterion, for the purpose
of compensating the members of the disadvantaged groups for the lack of opportunity which they may previously have suffered.
It might be argued in some
quarters that such "extra points" should somehow be added to scores of members
of disadvantaged groups, in order to make up for past handicaps and lack of
opportunity which members of these groups may have suffered.
It is, of course,
outside the scope of this report to evaluate such arguments here.
However,
we should make it clear that such "extra points", if they were used, would be
over and above any addition to the scores for members of a disadvantaged group
which might accrue as a result of a higher a
in
(8.5)
for the group. The higher a
i
i
merely compensates for bias in the test, and puts the prediction of
the criterion on an equalized basis for all groups.
9.
Prediction and bias analysis when one uses a regression equation
which is based on the m item scores, and which is linear in these item scores.
We now consider the possibility of basing our prediction of Y not on the
single number X alone, but rather on the set of m item scores (X , X , ••• ,
ij
ijl ij2
Xijm ). In other words, we will' take Y? to be this set of m Xijk's rather than
to be the total test score Xij • (For definition of xijk,see Section 1.)
For purposes of this section, we assume that the regression of Yij on
42
eI
I
I
I
I
I
I
I
_I
I
t
I
I
I
I
.'.,
I
(Xijl , Xij2 , ••• , Xijm ) is linear in the Xijk'S for each of the g groups (but
in Section 10 we consider a more complicated model which relaxes this as sumption) • Thus we have a regression equation of the form
for the i-th group, where
the interpretation that l3
Now
XC
'lf = ~j
= (X
ijl , Xij2 , ••• , Xijm ). We may make
is the score for marking the k-th item right.
ik
will be f~e of bias if and only if
and
If (9.2a) holds, then (9.1) "reduces to the form
If (9.2a) and (9.2b) both hold, then
Note the similarity between (5.1 - 5.4) and (9.1 - 9.4) respectively.
We
test the hypotheses (9.2) via methods similar to those indicated in Section 5;
43
we will not present the detailed formulas here, but these formulas are
analogous to those of Section 5 and are of course nothing but standard regression analysis formulas.
then of course
If the hypotheses (9.2) are true so that (9.4) holds,
XC is free of bias.
Our purPOse in consiLdering the model with
lies in the fact that this
though
XC = (Xijl , Xij2 ' ••• , Xijm )
XC might be free of bias for predicting
Y even
XC = Xij is biased. A similar purPOse will underlie our consideration
of the models of Sections 10-11.
(As previously pointed out, however, incidental
further benefits from these more refined models may accrue in the form of· increased accuracy of the prediction.)
We present a numerical example in order to illustrate clearly how a
test can be biased if
XC
(Xijl ' Xij2 ' ••• , XiJm ).
is taken to be X but bias-free if
ij
XC
is taken to be
Our exainple will be overly simple, in order that we
can illustrate the point easily.
Suppose the test X consists of m = 4
items~
Suppose there are g = 2 groups, the "insl1 (i = 1) and the "outs" (i = 2).
Suppose that the members of these two groups obtain item scores (on the test X)
and criterion. (Y) scores as indicated in Table 9.1.
Then, if prediction of
the criterion is to be based on the total score Xij alone, we find easily from
Table 9.1 that if we make an analysis for bias we obtain
I
.1
I
I
I
I
I
I
I
-,
I
I
I
I
I
for the l1inl1 group, and.
44
I
eI
I
I
I
I.
I
I
I
I
I
I
I
_Ie
I
I
TABLE
Example showing that a test can be biased if prediction is based on total score
(Xij ) alone, but bias-free if prediction is based on the item scores (Xijk's).
(Figures in this example and succeeding examples are fictitious.)
Item scores for the
4 items in the test
group
these
scores
Average
Y
score
10
10
2
2
7
6
5
5
=- 1
6
6.00
24
6.25
0
0
1
0
1
1
6
6
5
4
4
7
8
14
10
10
8
5
6
7
7
6
6
5
TOtal for X .. 2
ij
32
6.50
56
6.75
20
22
6
7
5
5
6
7
0
0
0
TOtal for X
ij
1
1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
1
0
1
1
0
1
0
1
0
1
1
1
0
1
1
1
1
0
Over-all total
I
I
O/~Of
wi
7
6
5
5
I
I.
Average
Y score
= 2)
2
2
1
1
0
1
0
0
0
1
0
" Out" group (i
0
0
0
1
1
0
TOtal for X =3
ij
I
"In" group (i = 1)
0/0 of group
X
X
X
X
With these scores
ij1 iJ2 iJ3 iJ4
r
I
9.1
7
7
6
6
8
8
10
10
8
8
5
5
8
8
62
7.00
20
7.25
100
6.78
100
6.73
Respective item difficulties are .61, .56, .70, .69 tor the "in" group,
'8nd .59, .55, .41, .41 for the "out" group. Average Xij score is 2.56 tor
the "in" group and 1.96 tor the
lI
out" group.
I
for the "out" group.
This means that the test X (using X for
ij
XC) is biased
against· the "out" group; the bias in constant, and the amount of this constant
bias is .25.
Now note that, if we consider the regression of Y on the four
I
I
I
I
I
item scores, it is clear from Table 9.1 that
for the "in" group, and
for the "out" group.
used for
Thus the test X is bias-free if the item scores are
t>.
[It may help to point out the identification of (9.5-9.6) with (5.1,5.3),
and of (9.7-9.8) with (9.1, 9.3-9.4).
we have ~l
= ~2 = ~ = .5
and (Xl
(9.1, 9.3-9.4), we have 1311
(X =
Identifying (9.5-9.6) with (5.1, 5.3),
= 5.5,
= 1321 = ~l
(X2
= 5.75.
= 2, ~12
Identifying (9.7-9.8) with
= ~22 = ~2 = 1,
and (Xl
= (X2 =
5.J
From (9.7-9.8) and from Table 9.1 it is clear that items 3 and 4 are
(in the presence of items 1 and 2) of no value for predicting the criterion.
However, the "in" group scores much better than the "out" group on items 3
and 4 (and only slightly better on items 1 and 2).
It might be, e. g., that
items 3 and 4 pertain to matters which arise frequently in the culture of the
"in" group but less frequently in the culture of the "out" group, whereas
items 1 and 2 pertain to matters which arise about equally frequently in the
cultures of both groups.
But success on the criterion itself evidently is
46
.'I
fl
_I
I'
I
1
I
I
I
.1
I
I
I
I.
I
I
I
I
I
I
I
basically related only to the characteristics which are measured by items 1
and 2.
The bias which exists when Xij alone is used for prediction results
essentially from the fact that the irrelevant items (items 3 and 4) cause the
"in" group to have a much higher average X score than the "out" group, but
ij
this superiority in Xij score does not carry along With it a corresponding
superiority in performance on the criterion (thanks to the fact that the superiority with respect to X arose mainly from superiority on irrelevant items).
ij
Thus most members of the "out" group would be justified in complaining that a
prediction and selection based on X would discriminate against them. Equally
ij
discriminated against, however, would be a minority of members of the "in"
group.
Thus, to take an extreme case, individuals who score (1, 0, 0, 0) would
I_
be heavily discriminated against, and this would be just as true if these
I
I
I
I
"out" group.
I
I
I.
I
I
individuals are members of the "in" group as it would if they belong to the
However, a greater percentage of the "out" group would be affected
by this discrimination, because there are relatively more "outs" than "ins"
who score (1, 0, 0, 0).
In any case, in the example of Table 9.1, the discrimination is not
uniformly against the "outs" as such, for there is a minority of "outs" who
are discriminated in favor of as well as a majority of "outs" who are discriminated against.
The discrimination is against"all individuals, "out" or
"in", who score poorly on the irrelevant i tams (3 and 4), and in favor of all
indiViduals, "out" or "in", who score well on these items.
For any given Xij
score, there just happen to be a greater percentage of n outs" than "ins" who
score poorly on the irrelevant items, and so it turns out that, on the average,
there is bias against the "outs" and in favor of the "ins".
If, in our example, selection is based on X~j (8.5) rather than on
47
I
X , then such selection will of course be bias-free according to the definition
ij
of Section 2. However, even though it is "bias-free", such selection would
.1
I
still be open to charges of discrimination; these charges might come from the
"in" group in particular.
For our example of Table 9.1, Xi' . (8.5) is
"
.J
I·
'I
I
I
for the "ins" and
for the "outs".
Thus, any "in" individual with any given set of item scores
(X
, X , X , Xij4 ) will have an Xij score which is .25 lower than the
ijl ij2
ij3
Xi. score of an "out" individual who has the very same set of item scores - this
J
.
in ,spite
of the fact that the two individuals have exactly the same expected
criterion scores (based on their item scores).
example, it would be
diff~cult
Hence, in this particular
to convince the "in" group that the "bias-
free" test is really non-discriminatory.
We are faced with the anomaly that,
if we use Xij (9.9-9.10) for selection, then such selection is bias-free on
the average (so to speak), but at the same time it is biased against the" in"
group for every single possible set of values of the item scores if the subgroups associated with these 14 sets of values are each considered individually.
Thus X~j has this draWback, whereas (as we have already seen) Xij has a different drawback which also makes it vulnerable to charges of discrimination.
Incidentally, it might be noticed in Table 9.1 that the (simple)
correlation of Y with X
(or with Xij4 ) is negative for each group.
iJ3
48
Because
I
_I
I
I
I
I
,
I
.1
I
I
I
I.
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
I.
I
I
of this, it might be felt that the example of Table 9.1 is not a realistic one.
However, we may simply point out that these correlations would change from
negative to positive if a sufficiently large bloc of individuals with (0,0,0,0)
and (1,1,1,1) scores were added to each group. [The (0,0,0,0) and (1,1,1,1)
individuals would have average criterion scores of 5 and 8 respectively.]
Little else in the example would change, except that E(YijIX ) would no longer
ij
be exactly linear in Xij for either group since E(YijlXij
necessarily would not differ from one group to the other.
= 0)
and E(YijlXij
= 4)
The only reason Why
we put no (0,0,0,0) or (1,1,1,1) individuals into the example in the first place
was that we wanted to have an example where E(YijIXij) was linear in Xij , of the
form (5.3).
We have seen that if Xij , the total score on the test X, is used to
predict Y in the example of Table 9.1, then there is bias against the "out"
group.
We will now show that, if the total score X on this very same test
ij
. X of the example of Table 9.1 is used to predict a different criterion, then
there can be bias against the" in" group rather than against the "out" group.
Suppose that this different criterion Y is s'uch that
for both groups.
Then Table 9.1 would remain the sarile except for the two
columns indicating average Y score.
From (9.11) we see that each of these
latter two columns will now read 9,9,10,10; 9,10,10,10,10,11; 11,11,10,10
(excluding the total ~ines). The total lines will be
I
.1
j
and
I
I
I
The over-all total line will be E(Y
lJ
) = 10.39, E(Y2j )
= 9.82.
f'rom (9.12-9.14) that there is bias against the "in" group.
We conclude
This bias is al-
most constant (about .17), although E(YijJxij ) is not quite linear in X
ij
f'or either group. The bias against the "in" group this time is caused essentially by items 1 and 2, which are now the irrelevant items (with respect to
the new criterion variable).
Observe that the bias against the
tI
in" group occurs
in spite of'the f'act that the "in" group scores better on all f'our test items
than the "out" group.
We recall that items 3 and 4 would perhaps pertain to matters
which would arise more f'requently in the culture of' the "in" group than in the
culture of' the "out" group.
Thus the new criterion itself' would perhaps be
related to matters arising more f'requently in the culture of' the "in" group
than in the culture of the "out" group.
This being the case, members of' the
"out" group (or s;ympathizers with the "out" group) mightf'eel in some instances
that the bias against the "in" group would be nothing to be greatly concerned
about, .on the grounds that the criterion itself'. was "loaded" in f'avor of' the
"in" group.
_I
The bias against the" in" group seems to result f'rom
new criterion variable evidently is closely related to characteristics measured
by items 3 and 4.
I
,I
I
I
However, the "in" group would not be likely to agree With such
50
I
I
I
I
I
I
.1
I
I
I
I.
I
I
a feeling, particularly if the criterion variable represented something fully
meaningful and important.
As was the case with the original criterion variable of Table 9.1, the
bias problem for the second criterion variable can be resolved by considering
J.
J
the regression of Y on the xijktS [see (9.1,1)
I
I
I
I
accuracy of the prediction; to a certain extent, these two benefits go hand-
I_
I
I
I
I
I
I
I.
I
I
We point out again that this
refined technique not only resolves the bias problem, but also increases the
in-hand with one another.
It should not be surmised, however, that use of the XijktS will inerttab1y reduce the an10unt of bias (compared with the use of 'X
alone).
ij
Certainly one would normally expect to be more successful (or at least no less
successful) in eliminating bias with an m-variab1e regression (on the Xijkts)
than with a one-variable regression(on X ), and one would think that this is
ij
usually what would occur in actual practice.
However, we exhibit an· example
here just for the purpose of demonstrating that it is at least possible for
~ Xij to be bias-free but for Y? = (Xij1 ' Xij2 ' ••• , Xijm )to be biased.
The example is presented in Table 9.2. We again haveg = 2 groups, the "ins"
if
and the ·outs".
The test has just m = 3 items.
From Table 9.2 we see that
and
so that we have no bias if X is used for
ij
51
Y?
On the other hand, we also
I
Example in which a test is bias-free if prediction is based on total score
(Xij ) alone, but is biased if prediction is based on the item scores.
Item scores for the
"In" group (i = 1)
"Out" group (i = 2)
3 items in the test
0/0
of group with Average
these scores
Y score
Xijl Xij2 X
ij3
0/0
Of groUp Average
with these
Y
. score
scores
1
0
0
20
6
24
6.25
0
1
0
10
5
22
5.25
0
0
1
10
4
24
4.25
Total for Xij
=1
40
5.25
70
5.25
0
1
1
10
5
10
5.25
1
0
1
25
6
10
6.25
1
1
.0
25
7
10
7.25
60
6.25
30
6.25
100
5.85
100
5.55
Total for Xij = 2
Over-all total
Respective item difficulties are .70, .45, .45 for the "in" group, and
.44, .42, .44 for the "out" group.
Average X score is 1.60 for the "in"
ij
group and 1.30 for the "out" group.
.',I
J
I
I
I
I
I
_I
I
I
I
I
I
I
.'I
52
I
I
I.
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
I.
I
I
find from Table 9.2 that
whereas
Thus (9.17-9.18) tell us that we have bias (a constant bias of .25) if the
Xijk's are used for
XO.
The use of the Xijk's (in comparison with the use of
Xij ) does improve the over-all accuracy of the prediction, but at the same
time it brings out a bias problem which seemingly did not
simpler model.
eXis~
under the
(Actually, it could be contended that the biasprobletn effect-
ively does eXist under the simpler model, and that it is merely swept under
the rug.)
Although this sort of situation (where, e.g., bias is absent under
the s1ilIpler model but does exist under the more refined model) might not
occur frequently in practice, we should at least realize that it is far from
being an impossibility.
10.
Prediction and bias analysis based on second-order regression in
the m item scores.
In the present section, we take
XO
to be the set of m item scores (X
,
ijl
X , ••• , Xijm ), just as we did in Section 9. However,in this section we
ij2
no longer restrict ourselves with the assumption that the regression of Y on
these item scores is linear, of the form (9.1).
Instead of (9.1), we now suggest
the possibility of a more general regression model which allows for second-order
terms (and also for terms of a still higher order, if desired) in the Xijk's.
53
I
The second-order model, upon which we will concentrate our attention here, is
of the form
+13.(
1 ) X..
lXi Jm
·)
~ m - ,m . ~J, m -
Within the second parentheses on the right side of (10.1) are included a total
1
2
of 2 m (m -1) terms.
be equal to X
ijk
values 0 and 1.
2
No terms in Xijk are included, because Xijk will always
if the situation is such that Xijk can assume only the two
.
1
Thus, under the model (10.1), there are 1 + m + 2 m (m -1) parameters
to be estimated for each of the g groups.
This means that, in order for us to
be able to estimate all these parameters and make significance tests, the
.
there are fewer parameters and it is only necessary that the total sample size
N be somewhat greater than 1 + m + 2'1 m (m -1); or, if the regression lines are
the same for all groups except that they may have differing a. 's (10.3), then
~
we simply need for N to be somewhat greater than g + m + ~ m (m -1).
In any case, we see that we need much greater sample sizes to investi-
gate second-order regression in the Xijk's than we needed to use the first-order
Furthermore, the matrices we have to invert
are of course much larger for the second-order models than for the first-order
models.
I
I
I
I
I
I
_I
1
sample size (Ifi ) for each gI'oup must be somewhat greater than 1 + m + 2' m (m -1).
Of course, if a common regression line is assumed for all groups (10.4), then
regression models (9.1, 9.3-9.4).
.-I
These two drawbacks may often make the cost of using a second-order
model so high as to far outweigh any possible benefits, except perhaps for
tests in which the number of items (m) is small.
On the other hand, many
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
I.
I
I
tests are administered to large numbers of subjects anyway (so that large
sample sizes would be readily available), and advances in computer technology
may cut the cost of inverting large matrices.
By and large, though, the tech-
niques based on (10.1-10.4) will probably be less practical than the techniques
of Sections 9 and 11.
At the end of this section, however, we present some
modifications of (10.1) which may be more practical than (10.1) itself.
Under the model (10.1), we will be free of bias if and only if
(10.2a) ~ll
= ~2l = ... = ~gl'
~12 = ~22 =••• = ~g2' ••• ' ~1m~2m
=... = ~gm
and
(10.2b)
a1
= a2
= •••
= ag
If (10.2a) holds, then (10.1) reduces to the form
If (10.2a) and (10.2b) both hold, then
(10.4)
E(YijIXijk's) = a + (~l Xijl + f32 Xij2 + ••• + ~m Xijm )
(equation continued on next page)
55
,
I~
(e~uation
continued from preceding page)
+ (~t12) Xijl Xij2 + ••• + ~(m -l,m) Xij,m -1 Xijm )
The hypotheses (10.2) are tested via standard regression methods.
We present a couple of examples of regression
e~uations
(10.4), for simple situations which are easily interpretable.
have a three-item test (m
= 3).
We assume we
Suppose that
Identifying (10.5) with (10.4), we have
and ~(23) = 1.
of the form
Q
= 3, ~l = 1, ~2
We can interpret (10.5) as follows.
= ~3
=0,
~(12)= ~(13)=O'
Success on item 1 of the
test X will always increase the expected criterion score by 1.
Success on
item 2 or item 3, however, will not increase the expected criterion score at
all unless the other of these two items is also answered correctly.
nature that both characteristics must be possessed together before performance
on the criterion will be improved, i.e., one of them is of no value without
Suppose (to take
a somewhat artificial example) that X
measures ability to speak English,
ij2
X
measures ability to speak Japanese, and the criterion Y measures perij
ij3
formance in a job of being an English-Japanese interpreter (X
might measure
ijl
some characteristic such as general educational level). For such a case, we
would logically expect the regression equation to be something iike (10.5).
Our second example of a regression equation of the form (10.4) is
somewhat different.
Suppose that
56
I
I
I
_I
In other
words, the characteristics measured by items 2 and 3 are eVidently of such a
the other (so far as criterion performance in concerned).
.1
I
I
I
I
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
I
I
I
(10.6)
Here a = 0, (31 = 1, (32
= (33
= 2, (3(12)
interpret (10.6) as follows.
= (3 (13)
= 0, and (3(23) = -2
We can
As in (10.5), success on item 1 will always
increase the expected criterion score by 1, regardless of the value of the
other two item scores.
In (10.6), though, success on either item 2 or item
3 will increase the expected criterion score by 2, but success on both items
2 and 3 (i.e., Xij2 = 1, X
ij3
= 1,
X
X
ij2 iJ3
= 1)
will still increase the
expected criterion score only by 2 and not by 4.
Thus it is superfluous for the
examinee to "be successful on both items 2 and 3. The characteristics measured
by
t~ese
two items are apparently such that possession of either one or the
other characteristic is sufficient to increase criterion performance, but
possession of both doesn't do any further good.
Suppose, e.g., that the cri-
terion variable Yij measures success in reading certain literature which is
available in either French or German (15ut in no .other language), and that X
ij2
measures ability to read French while X, '3 measures ability to read German
~J
(X, '1 could measure general educational level).
~J
Then we would expect the
regression equation to be something like (10.6).
We now present an example in which there is bias if one tries to predict the criterion on the basis of simply a linear combination of the item
scores (the method of Section 9), but in which bias no longer exists if one
uses a second-order regression on the item scores.
In this example, the de-
tails of which are exhibited in Table 10.1, we have g
and the "outs".
•
There are m = 3 items on the test.
=2
groups, the "ins"
It is easily seen from
Table 10.1 that E(YijIXijk's) is the same for both groups, and is a secondorder expression in the Xijk's which is, in fact, given by (10.6).
57
Thus
ITABLE 10.1
Example showing that a test can be biased if prediction is based on a linear
combination of the item scores, but bias-free if prediction is based on a
second-order expression in these same item scores [see also (10.6-10.8)J.
Item scores for the
II
In" group (i
= 1)
= 2)
"Out" group (i
3 items in the test
0/0
X,~J'1 X'J2
X,~J'3
~
Average
of group
with these
. scores
y
score
0/0
of group
with these
scores
Average
y
score
0
0
0
1
0
6
0
1
0
0
1
1
2
1
0
0
1
4
2
24
2
1
0
1
4
3
8
3
0
1
0
4
2
24
2
1
1
0
4
3
8
3
0
1
1
41
2
21
2
1
1
1
41
3
7
3
100
2.46
Over-all total
100
2.09
Respective item difficulties are .50, .90, .90 for the "in' group,
and .25, .60, .60 for the "outll group.
group and 1.45 for the "out" group.
Average X score is 2.30 for the "in"
ij
eI
I
I
I
I
I
I
I
_I
I
I
I
I
I
I
I·
I'
.:
,i
58
I~;
I
I.
I
I
I
I
I
I
I
I_
there is no bias if we use this second-order regression equation (10.6) for
prediction and selection.
Consider what happens, though, if we attempt to
predict Yij using just a linear combination of the Xijk's.
1\
such linear predictor of Y (which we call Y ) will be
ij
ij
(10.7)
~lj
== 1.312 + JS.jl + .36 Xlj2 + .36 XJ.j3
for the "ins", but
(10.8)
for the "outs" [(10.7)
and (10.8) were obtained by applying standard regress-
ion formulas to the data of Table 10.lJ.
If, e.g., (10.7) is used to predict
for the "outs" as well as the "ins", then we see from Table 10.1 that such
prediction Will, on the average, discriminate against the "outs", because,
I
I
I
I
I
I
for the "outs", the under-predictions will more than off-set the over-predictions [whereas, for the "ins", the under-predictions will exactly balance
the over-predictions using (10.7) J • For both "outs" and "ins", the underpredictions [using (10.7)J will occur for individuals scoring (0,0,1), (0,1,0),
(1,0,1), or (1,1,0).
But individuals with these scores comprise a much higher
percentage of the "out" group than of the "in" group.
This essentially is
what causes there to be discrimination (on the average) against the "out" group.
[If (10.8) were used to predict for both groups, then there would be over-
I.
I
prediction (on the average) for the "ins", but for the "outs" the over-prediction and under-prediction would balance.
If a prediction equation like
(10.7-10.8) were derived on the basis of both groups combined, then the re-
I
I
Then the best
59
I
sulting predictions would tend to over-predict for the "ins" and under-predict
for the "outs".]
All bias will be eliminated, of course, if we just use the second-order
equation (10.6).
This is better than using two different prediction equations, ,
(10.7) and (10.8), for the two groups.
Not only will bias be eliminated by
using (10.6), but also the over-all accuracy of prediction will be improved.
In fact, efforts to improve the general accuracy of prediction may often have
as an automatic by-product the reduction or elimination of bias.
In some cases
where one is trying to eliminate bias, it may be wise to direct the major effort
toward improving accuracy of prediction, in the belief that the resolution of
the bias problem will follow along hand-in-hand.
In the example of Table 10.1, it is the "outs" who suffer the brunt of
the discrimination if a connnon prediction equation linear in the Xijk's is
used for both groups.
We now show that, if the criterion in the example is
.-I
I
I
I
I
I
I
_I
changed a bit but the test X remains the same, then it will be principally
the "ins" who suffer from discrimination.
which is such that (10.5) holds.
Suppose we have a new criterion Y
Table 10.1 will stay the same except for the
two columns indicating average Y score, each of which will now read 3,4,3,4,
3,4,4,5 for the first eight lines.
the
II
ins" and 3.53 for the "outs".)
(The new total lines will read 4.32 for
We can find that the best predictors
,I
I
I
I
I
Ii
linear in the Xijk's will be
.-I
for the "ins" and
(10.10)
60
I
I
I.
I
I
I
I
I
I
I
II
I
I
I
I
I
I.
I
I
for the "outs".
From (10.9-10.10) and Table 10.1 we can verify that a
linear prediction equation common to the two groups would tend to under-predict for the " ins"
and/or
over-predict for the "outs".
Thus , although
the test X (using a linear prediction equation in the item scores) discriminated
against the "outs" (on the average) With respect to a criterion which required
the possession of either of two abilities, the same test X (again using a linear
prediction equation in the item scores) discriminates against the "ins" with
respect to a criterion requiring the possession of both of these two abilities.
As we previously indicated, the large number of terms inside the
second parentheses on the right sides of (10.1,10.3,10.4) may cause the
techniques thus far proposed in this section to be of limited practical value.
In closing this section, therefore, we suggest a way of "compromising" whereby
we can realize some of the benefits of using second-order terms without being
burdened with the large numbers of parameters which (10.1, 10.3, 10.4) require.
To begin With, let us point oui; that the model (5.1) could be generalized
by adding a single second-order term in
xi j , so that we would have
(10.11)
In a very restricted way, this model (10.11) takes account of the terms
Xijk XijK ' inasmuch as
61
I
Now observe that we could also add a single term in
xi j to the model (9.1).
This would give us
In view of (10.12), we would expect that, to a limited extent, the model (10.13)
might be effective in improving prediction in some of the same types of situations in which the more elaborate model (10.1) would produce improved prediction.
Suppose now that the items on the test X can be grouped naturally into
certain categories, where the different categories might represent sets of
items which measure (e.g.) different types of abilities or different subjectmatter areas.
Let there be c such categories, and let Xij(h) denote the sum
of the item scores on the items in the h-th category (h = 1,2, ••• , c) for the
(i, j) individual.
Then it might be worthwhile to consider a model of the
form.
A model of this form (10.14) might be an acceptable compromise between (9.1)
and (10.1), because it could use a much smaller number of parameters than
(10.1) but might still effect substantially greater predictive accuracy than
62
.-I
I
I
I
I
I
I
-I
II
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
-I
For the models (10.11), (10.13), and (10.14), we did not go into the
detail which we went into with the models (5.1), (9.1), and (10.1).
That is,
we omitted the detailed equations which are the analogues of (e.g.) (10.2-10.4),
but these equations should be readily apparent anyway.
Besides (5.1), (9.1), (10.1), (10.11), (10.13), and (10.14), one could
write down other possible models where prediction is based on the Xij's and
Xijk's.
For instance, the idea of working with the Xij(h)' s could be used not
only for the second-order terms but also for the first-order terms, so that
(e.g.) in (9.1), (10.13), or (10.14), one might utilize just c first-order terms
in the Xij(h)'s rather than m first-order terms in the Xijk's.
I
II
elaborate model, one does so in the hope that the more elaborate model will
I_
of bias would usually be associated with general improvement in the accuracy
I!
I
I
I
I
I
I.
I
I
In any case where one moves from using a simpler model to using a more
reduce or eliminate what bias eXisted under the simpler model.
Any reduction
of the prediction.
11.
Prediction and bias analysiS based on which responses to the
items are marked.
In this final section of Part I., we consider the possibility of using the
Xij~' s
(i.e., the information as to which response the examinee marks for
each item) for the prediction of Y.
set of XijkR'S (2.2).
That is, we now consider XO to be the
(For definition of XijkR ' see Section 1.)
We assume
that the regression of Y on the Xijk-f's is linear, of the form
+ ~i21 Xij21 + ~i22 Xij22 +••• + ~i2n2 Xij2n2 +••• + ~im1 Xijm1 + ~im2 Xijm2
+ ••• +
63
~i
mnm
X.
J.
jD'lI\n
I
This model (11.1) constitutes a generalization of (9.1):
(~
in effect, all
- 1) distractors (incorrect responses) for an item are given the same score
under (9.1), whereas these distractors are all allowed to carry different scores
Under (11.1).
It is entirely possible, of course, that the test X might be
biased for predicting Y if we use any of the models indicated in previous
sections, but bias-free if we use (11.1).
A simple example below will show
how (11.1) might serve to eliminate bias and improve prediction.
Under the model (1,1.1), we will be free of bias if and only if
.1
I
I
I
I
II
I
_I
• .. , f3 lml=l32ml= ••• =l3gml' f3 lm2 =f32m2 = ... =13gm2' ••• , f3 lmn = f3 2mn =••• = f3
m
m
gmnm
and
(l1.2b)
If (ll.2a) holds, then (11.1) reduces to the form
+ f3
\1
I
I
I
I
I
.1
X
+ f3
X
+
22 ij22
2l ij2l
64
I
I
I
I.
I
I
I
I
I
I
I
I_
If (11.2a) and (11.2b) both hold, then we can write
+ ... + f3
IIU\n
Testing of the hypotheses (11.2) is done by standard regression methods, except
that one will end up trying to invert a singular matriX unless steps are taken
to avert this difficulty.
The problem of a singular matriX will arise
tially because
for all k
I;
I
I
I
I
I
I.
I
I
essen~
However, the problem is easily resolved.
One can simply drop m terms, one
for each item, from the right sides of (11.1), (11.3), and (11.4).
Thus, e.g.,
the terms for Xij 11' Xij 21' ••• , Xijm 1 could be the ones to be dropped,
or (perhaps more conveniently) the m dropped terms could be the m XijkR's
which correspond to the m right answers.
In the latter case, e.g.,
a f3
iU
or f3
would represent simply the difference between the score for marking
k1
the right answer and the score for marking the J-th response to the item, and
would generally be negative.
We now exhibit an over-simplified numerical example which will show
65
I
.1
TABLE 11.1
Example showing that a test can be biased if prediction is based only on the
set of item scores(i.e., the Xijk1s, or in this example just Xijl ), but biasfree if prediction is based on the XijkAls.
Possible values
II
of the Xijk~ I s
In" group (i
010 of group
Xijll X
X
ij12 ij13
= 1)
Average
sc5re
with these
scores
= 2)
"Out" group (i
of group
with these
scores
0/0
Average
scote
0
0
1
6
9
20
9
0
1
0
24
14
30
14
30
13
50
12
70
16
50
16
70
16
50
16
100
15.1
100
Total for Xijl
1
0
=0
0
Total for Xijl
Over-all total
=1
14.0
Response 1 is the correct answer to the single item in this one-item
test.
The,~ifficulty of the item (which is also the average Xij score) is.7
for the "inll group and .5 for the "out" group.
66
I
I
I
I
I
I
I
el
\1
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
II
I
I_
I;
I
I
I
I
I,
I.
I
I
how the use of the XijkJ'S will result
in eliminating bias which would exist
if prediction were based on the set of Xijk's alone.
Actually, our
will be for a test X in which the munber of items is just m = 1.
there are g = 2 groups, the "ins" and the "outs".
e~ple
We suppose
The details of the example
appear in Table 11.1.
The single item in the test has n l = 3 possible responses,
of which the first is the correct response. Of the last two responses, the
examinees who mark response 2 score much better on the criterion than do the
examinees who mark response 3.
This might (e.g.) come about due to response
2 being a "nearly-correct" response and response 3 being nowhere close to a
correct response, so that response 2 would tend to attract (distract) examinees
with greater ability on the criterion than would response 3.
From Table 11.1 we easily find that
,
(11.6)
for the "ins", whereas
(11. 7)
for the "outs".
= Xijl
Thus, the test with"1f
right sides of (11.6) and (11. 7) are different.
is clearly biased, since the
The bias favors the "outs".
(The example is overly simple because there is just one
should be clear.)
have al = 13, 1311
ite~,
but the principle
Identifying (11.6) and (11.7) With (9.1), incidentally, we
= 3, a2 = 12,
1321
= 4.
If prediction is based on the Xij1R's rather than X
ijl , then we easily
see from Table 11.1 that we will no longer have any bias (and we also will have
67
I
improved our over-all accuracy).
Because of (11.5), there is not a unique
way of representing E(YijIXijk1ls) in the form (11.1, 11.3, 11.4); we will
omit Xij 11 and write (11.1, 11.3, 11.4) as
(11.8)
The bias-free and relatively accurate prediction based on (11.8) is clearly
superior to prediction based on Xij 1 alone.
If X 1 itself is used for prediction or selection in the example of
ij
Table 11.1, then there Will, of course, be bias against the "insll and in favor
of the
II
outs".
This is basically because re'sponse 2 (the presumed II nearly-
correctll response) is marked by a larger proportion of the "insll who miss the
item than of the "outs" who miss the item.
In his analysis of which dis-
tractors were favored by pupils of high socio-economic status and which by
.-I
I
I
I
I
I
I
_I
pupils of low socio-economic status, Eells [9, p. 275 ff.] found that the
high-status pupils who missed an item had a greater tendency to mark a nearlycorrect distractor than did the low..status pupils who missed ,the item.
This
might suggest that, if we are confronted with bias under (9.1) but are able to
eliminate this bias by using (11.1), then the direction of such bias would
normally be against the high-status pupils rather than in their favor.
(How-
ever, Eells migb.t not necessarily agree with the idea that the results of his
analysis could be used to support a conclusion such as this.)
We should realize, in the example of Table 11.1, that basically it is
the examinees who mark response 2 who are being discriminated against and the
examinees who mark response 3 who are being discriminated in favor of, rather
than the
II
insll who are being discriminated against and the
being discriminated in favor of.
II
outsll who are
However, there turns out to be bias (as we
68
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
Ii
I_
I;
I,
'I
have defined it) against the lIins ll and in favor of the
1I
0uts ll , because, among
the examinees who mark a wrong answer, there are relatively more "ins ll than
"outsll who mark response 2.
Of course, in the example of Table 11.1, the direction of the bias
would be reversed if, e.g., the percentages of examinees giving responses 3
and 2 were changed to 6 and 44
they are for the" ins II •
bias against the
1I
respectively for the "outs" but remained as
With the percentages thus changed, there would be
0uts ll •
.Note that, in this case, the fraction of the
examinees giving response 3 (the "worstll response) would be ~ 10 for both the
"inll group and the
1I
But, among the remaining· 940/0 of each group,
0ut ll group.
a higher proportion of the "outs ll than of the "ins" come up with response 2
(the linearly-correct" response).
causes relatively more
1I
It might be, e.g., that some cultural factor
0uts" than "ins ll to slip up slightly and give the
"not-quite-correct" response (response 2), even though 940/0 of both "outs"
and lI ins" give one or the other of the two
II
good" responses (responses 1 and 2).
Thus we are able to hypothesize a plausible situation in which bias that can
be eliminated by using the Xijki s is directed against the
II
outsll rather than
the lI ins".
In the example of the preceding paragraph, we (in effect) changed
the test X in Table 11.1 in order to show how the direction of the bias could
I
I
I
I.
I
I
be changed so that it would be against the
II
outs" rather than the "ins".
Theoretically, though, it is possible to retain the very same test X as in
Table 11.1, and have a different criterion for which the bias is against the
"outs" rather than the
II
insll •
It is easily seen that, if (e. g.) we simply
trade the figures 9 and 14 in the average 1 score columns in the first two
lines of the body of Table 11.1, then (with respect to this new criterion)
the bias using just X 1 will be against the "outs" rather than the "ins".
ij
69
I
In practice, however, it is doubtful that this type of situation would often
arise, since the distractor which tends to attract the more able individuals
with respect to one criterion would also be expected ordinarily to attract the
more able individuals with respect to some other criterion.
In Eells' analysis of the distribution of wrong responses among high-
status pupils and among low-status pupils
[9, Chapter XXII.], he found many
items where there were sharp differences between the two groups (With respect
to the distributions among the different distractors of the examinees who
answered the item incorrectly).
In view of these differences in the distri-
butions, it is possible that the information contained in the X / s might
ijk
sometimes be of considerable value in reducing or eliminating bias (and,
incidentally, improving over-all accuracy), if this information is used as
suggested in this section.
We now point out that, in order to use the models (11.1, 11.3, 11.4),
it is not necessary that the "right" answer to an item even be defined; in
fact, any effort to designate which response to an item is the "right" response
would, in a sense, be wasted effort.
Only the Xijkj'S are used in (11.1, 11.3,
11.4), and (in contrast to the quantities Xij and Xijk ) we do not need to know
which response is the "right" response in order to determine the value of an
Xijk.( The estimates of the l3-parameters in the multiple regression model
(11.4) will provide the scores to be awarded for the different responses on
the various items.
(A rather large sample would have to be used in order to
get accurate estimates of these 13's.)
It could turn out that one of the "wrong"
answers to an item (presumably only an occasional item) might receive a higher
score than the "right" answer.
No doubt the constructor of the test would
not be pleased with such a result.
In some cases (perhaps most cases), though,
such a result might indicate that the item was a poor one to begin With, and
70
.-I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
I;
I_
I;
I
I
I
I
I
I.
I
I
71
I
other things being equal) between the two sets of examinees who mark (respectively)
these two responses, then we V10uld certainly have to conclude that it was improper to designate the first
as "incorrect".
o~
these two responses as "correct" and the second
Bias against the "outs" clearly results t
If, however, the
techniques of this section are employed, then such bias as this will be automatically eliminated.
Before concluding this section, we should mention that it would be
possible to formulate various models which combine the ideas of Sections 10
and 11, i.e., which use the X
'S and also use second-order terms. Such
ijk1
models, however, could easily involve such a large number of 13' s as to be impractical, unless one restricted the scope of the model.
For example, if the
terms inside the second parentheses on the right side of (10.14) were appended
to the model (11.1), and if h were a small enough number, then such a combination model could be practical.
.-I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.-I
72
I
I
I.
I
I
I
I
I
I'
I
I_
I,
I,
I
I
I
I
I.
I
I
PART
II.
THE PROBLEM OF BIASES WEEN THERE IS NO
CRITERION .vARIABLE TO BE PREDICTED
Part II. considers the question of bias analysis when there is no
criterion variable.
This is a more difficult question than that considered
in Part I., mainly because we get stalled at the outset in trying to define
what is meant by "bias" when there is no criterion variable.
Although there
seems to be no generally suitable way to define "bias" in this situation, we
do examine several concepts which seem to be closely related thereto, and we
present means of testing hypotheses concerning these concepts.
Largely be-
cause of our difficulties in defining bias in the absence of a criterion
variable, Part II. will be shorter and less conclusive than Part I.
12.
Notation, assumptions, and understandings.
Most of the notation which we will use here in Part II. was already introduced
in Part I. (see Section 1).
However, we will also require a bit of additional
notation with respect to the test X.
Let Pik be the difficulty of the k-th
item for the i-th group, i.e., Pik is the proportion of the i-th group which
answers item k correctly.
Let PikK denote the proportion of the i-th group
which gets both item k and item K correct.
The pik' s and pikK I S are true
(population) values, not sample estimates thereof.
Define Pi. to be the sum
of all m Pik's for group i, i.e., Pi. is the mean score on the test X for
group i.
The definitions just given were formulated in the context of the test
73
I
X being a pure power test.
Although our material here in Part II. will be
aimed mainly toward this situation of the pure power test, Sections 14 and 15
(but not all of 16) evidently will still be applicable when the test X is partially a speed test, if we extend the definition of Pik to take account of
individuals who do not attempt the item.
We may define Pik to be the PQPulation
mean (true mean) of X , wherex
is considered to be (l/~) if the item is
ijk
ijk
not attempted.
This definition of Pik is equivalent to the previous definition
if X is a pure power test.
Here in Part II., we require that the sample of individuals from each
of the g groups be strictly a random sample with respect to performance on the
test X.
Since we no longer have a criterion variable, the type of non-random
sainpling which was permissible for Part I. (see Section 3) will no longer be
applicable here in Part II.
13.
.-I
I
I
I
I
I
I
_I
The problem of how to define bias when there is no criterion
variable.
In Section 9, we exhibited an example (With g = 2 groups, the "ins" and the
"outs") ShO~ing how the very same test X (using X for
ij
'i?)
could be biased
against the "outs" for predicting one criterion but biased against the "ins"
for predicting a second criterion.
This anomaly alone might cause one to
wonder whether it is actually possible to formulate a meaningful definition
of "bias" when there is no criterion variable being predicted.
In some cases, however, the investigator may feel he has adequate
a priori grounds for assuming that all g groups are alike with respect to
whatever is supposed to be measured by the Xijk's or by Xij •
(13.1) Pll
= P21
= •••
= Pgl'
P12 = P22
74
= ... = Pg2'
Now we have
••• , Plm = P2m = .' ••
= p gin
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
I.
I
I
if the g groups are alike with respect to mean score on each item, and
=p
g.
if the g groups are alike simply with respect. to mean total score.
means are of course for the population, not the sample.)
(These
We could define a
test X to be biased if (13.1) does not hold, or, as an alternative definition,
we could define X to be biased if (13.2) does not hold.
Of course, these defi-
nitions are sensible only if we have sufficient a priori grounds for assuming
that all groups are the same with respect to whatever trait(s) the Xijk's or
X are supposed to be measuring. Sections 14 and 15 discuss the testing of
ij
the hypotheses (13.1) and (13.2) respectively. We test the hypothesis (13.1)
if we feel that all groups should be scoring alike on each item;
if we don't
anticipate that all groups should necessarily be scoring alike on each item
but we do feel that all groups should be scoring alike on the test as a whole
if it is working as it should, then we test the hypothesis (13.2).
In many practical situations, a definition of bias based on either
(13.1) or (13.2) would appear to be on rather shaky ground,because it would
be difficult to defend an a priori assumption of equality of the groups.
Suppose
the g groups are different racial groups, and suppose the test X is supposed
to measure some characteristic or ability which is not determined entirely by
heredity (i.e., which is at least partially acquired rather than hereditary).
If the different racial groups have differing opportunities to acquire the
characteristic in question, then one certainly could not assume a priori that
the groups should all score alike on a test X which is supposed to measure the
characteristic.
Thus a definition of bias based on failure of (13.1) or (13.2)
75
I"
.-I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.-I
76
I
I
I.
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
I.
I
I
and (through inheritance) the "intelligence" of the child.
However, it should
be mentioned that Eells ~ al ([9J, see, e.g., pp. 26, 27-28) seem to be highly
skeptical of the idea that there are differences in the genetic endowments of
different SES groups.
If the test X is supposed to measure a characteristic which is purely
hereditary, and if it is desired to find out whether there is bias with respect
to different SES groups, then one could examine
f~r
such bias by giving the
test to a number of pairs of identical twins who have been separated and reared
in homes of differing SES.
In this way the heredity factor would be controlled,
and it would be proper to test (13.1) or (13.2).
The difficulty is, of course,
that identical twins who have been reared separately (and in homes of not
exactly the same SES) are not at all plentiful.
One could use brothers or
sisters (rather than identical tWins) who have been reared apart, assuming
that intelligence was not a factor in determining which brother or sister went
to which home; however, such an experimental design would be much less sensitive
for detecting bias than a design using identical twins.
If
one is interested in determining whether a test is biased with
respect to Negroes and whites, then one could compare Negroes and whites of
the same SES, and make the assumption that (for a bias-free test) Negroes
and whites should score alike if SES is held constant.
This would involve
testing the hypothesis (13.1) or (13.2) for fixed SES.
However, one could
question whether Negroes and whites really should be scoring alike (on a biasfree test) even if SES is held fixed.
low and high.
Let us say there are two SES groups,
Suppose that 500 / 0 of the whites and 200 / 0 of the Negroes are
high SES, while 500 /0 of the whites and 800 / 0 of the Negroes are low SES.
Suppose that the characteristic which the test is intended to measure is
partly acquired and partly determined by heredity, and suppose there are no
77
I
racial differences in genetic endowments.
Even though over-all there are no
genetic differences between the races, there could easily be genetic differences
between high-status Negroes and high-status whites, and also between low-status
:Negroes and low-status whites, if high-status individuals in each race tend to
be better genetically than low-status individuals.
One could expect high-status
Negroes to be better genetically (on the aver~ge) than high-status whites,
because the high-status Negroes
a~e
a more select portion of the entire Negro
group than high-status whites are of the entire white group
(200/0 versus 500/0).
By a similar line of reasoning, the low-status Negroes would be genetically
better (on the average) than the low-status whites.
Thus, if the"test X is
bias-free, high-status Negroes should score better than high-status 'whites and
also low-status Negroes should score better than low-status whites.
This is
because high(low)-status Negroes have the same environment as high(low)-status
whites (this we take to be a premise), but better heredity.
(Of' course, whites
as a whole will score better than Negroes as a whole if the test is bias-free,
because whites as a whole have the same heredity as Negroes as a whole, but a
better environment.)
At any rate, this example is intended to illustrate the
fact that a bias-free test may not necessarily produce the same average scores
for Negroes and whites even if SES is held constant.
In our example, the
appearance would be one of bias against whites if SES is held fixed and bias
against Negroes if SES is not taken account of, when in fact no bias exists.
Thus (13.1) or (13.2) would not be a suitable way to define a bias-free test
in this example, even if SES is held fixed.
"We may sometimes be interested in examining for bias with respect to
boys and girls.
In this case, we generally would 'not need to worry at all
about influences of race or SES.
Even in this case, though, it is not
always proper to postulate that the failure of (13.1) or of (13.2) to be
78
.-I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
satisfied necessarily implies the eXistence of bias.
Thus, e.g., Coffman [6J
found sex differences in responses to vocabulary items.
Girls tended to do
better on words related to people, while boys tended to score better on words
related to things, so that (13.1) did not hold.
Boys and girls were eVidently
alike with respect to race and SES, and presumably heredity also, so that these
factors apparently could not cause the sex differences.
An
obvious explanation
for the sex differences would be the different environments in which boys and
girls grow up.
In any event, the differences between boys and girls which are
reflected in tbe failure of (13.1) to be satisfied would be interpreted as
genuine differences and not as something caused by biased test items.
Thus we have seen that it is often inappropriate to base a definition of bias on (13.1) or (13.2).
Statistically, however, the hypotheses (13.1)
and (13.2) are not difficult to test, relatively speaking; because of this,
and because there may be cases where it would be suitable to define bias in
terms of (13.1) or (13.2) [or cases where we may wish to test the hypotheses
(13.1) or (13.2) anyway even
though
we may be dubious about defining bias on
this basis], we are covering the testing of (13.1) and (13.2) in Sections 14
and 15 respectively.
Some of the complications which crop up when one attempts· to formulate a definition of "bias" for the situation where there is no criterion
variable have already been made readily apparent.
In fact, it would appear
that, when one wishes to anaJ.yze for possible bias in a test, one should
ordinarily use a criterion variable if this is at all feasible (and if an
appropriate criterion variable naturally presents itself).
One thereby avoids
the probdem of how to define "bias" when there is no criterion variable, and
I.
I
I
one can use the techniques developed in Part I. to analyze for bias.
79
Needless to say, none of the definitions of "bias" which we suggest
here in Section 13 are equivalent to the definitions used in Part I.
The
former neither imply nor are implied by the latter.
We recall that, for some purposes in the situation where there is no
I
.-I
criterion variable, we have suggested that one could define a bias-free test
as one which satisfies (13.1), or as one which satisfies (13.2).
A third
possible way of defining a bias-free test when there is no criterion is in
terms of interaction between items and groups •. That is, one defines the
test X to be bias-free if there is no interaction between items and groups.
This allows for the groups to be different [which (13.1) does not], but at the
same time requires that Ei. difference between two groups be the same for all
m items in the test.
Unfortunately, the definition just given is not sufficiently
precise, because it fails to indicate how a "difference" between groups is to
be defined.
The "difference" betWeen group i and group I on item k could
be defined simply as (PIk-Pik)' but it could also be defined in other ways ,
and it is not clear what way, if rmy, is best.
the definition of "bias" in terms of item-group
It this issue could be resolved,
interaction might be a use-
ful one for certain purposes, since such a definition of bias would be less
restrictive thrm the definition based on (13.1).
In section 16 we will be
concerned with the question of how to define or measure the "difference" between two groups onrm item.
"interaction" •
The problem is essentially one of how to define
Thus it may apPear reasonable to define a bias-free test as
one for which there is no item-group interaction, but this does not settle the
question of how to define "interaction" in the firs.t place.
In addition to
examining this matter, section 16 will be concerned more broadly with the
over-all topic of bias rmalysis when bias is defined in terms of item-group
interaction, rmd will indicate possible ways of testing hypotheses of no i temgroup interaction.
80
I
I
I
I
I
I
_I
I
I.
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
t
I_
I
I
I
I
I
I
Ie
I
I
If there is no item-group interaction (so that the test is bias-
free according to the third definition of this section), this does not
necessarily mean that the test would be bias-free according to the definitions
of Part I. if there were a criterionvariable; for example, all the items on
the test could be uniformly biased against one group.
On the other hand, the
eXistence of item-group interaction would not necessarily imply that the test
would be biased according to the definitions of Part i. if there were a criterion
variable, because different groups may be strong on different types of relevant
items.
In practice, however, it is possible that absence of item-group inter-
action would, under some conditions, tend to go hand-in-hand with absence of
bias according to the definitions of Part I.
In some cases, item-group interaction might represent "balance"
rather than "bias", in which event one would not wish to use the third definition of "bias" indicated in this section [one might still find the definition
based on (13.2) to be suitable J.
For example, consider again the matter of
sex differences on vocabulary tests.
As we indicated previously in this
section, it has been found [6J that girls do better on vocabulary items related to people, while boys do better on items related to things.
Thus there
is item-group interaction (no matter how one defines "interaction"). However,
if there is a proper ratio of " people" items to "things" items in the test,
one could conclude that· the interaction reflects "balance" rather than "bias" •
But what is a "proper" ratio?
This matter boils down strictly to a question
of judgment as to what the test is supposed to assess, and there is no definitive answer.
I f it is felt that "people" items are more important or more
representative than "things" items, then greater weight can be attached to the
former, with the result that girls will tend to get higher scores than boys.
Alternatively, one could devise a test giving about equal weight to "people"
81
I
items and "things" items so that boys and girls would tend to be equal in score.
Which test would be "biased" and which would be "bias_free" '1
It would appear
that any answer to this question would necessarily be arbitrary.
Girls eVidently
.'I
tend to score higher than boys on certain existing vocabulary tests, but it
I
I
I
I
I
I
can be contended that this is due to "bias" caused by too many "people" items
[6] •
On the other hand, these existing tests might be defended as having the
right "balance" in their item content.
It should thus be clear at this point what kind of trouble
might encounter if one attempts to define bias in terms of interaction.
one
It
appears that whether an interaction represents "balance" or "bias" can only
be decided by the test constructor in terms of what he thinks the test is
supposed to assess.
Thus, e.g., on a history test an item pertaining to the
Catholic Church might be answered most successfully by Catholic examinees, an
item pertaining to Martin Luther King might be answered exceptionally well by
Negroes, and an item pertaining to Abraham Lincoln might be answered best by
_I
I
I
I
I
I
students from Illinois, but if the test constructor considers that these items
represent part of the subject matter which the test is sUpPOsed to cover and
if there is no criterion variable being used, then the items (even though
they would be causing interaction) would represent "balance" rather than "bias"
and would be left in the test.
used to distinguish between
Evidently no statistical criterion can be
"b~lance"
and "bias".
It would appear that we are just about forced to conclude that
we are unable to find an objective statistical definition of bias which is
generally satisfactory for the situation where there is no criterion variable.
I
Of course, each of our three definitiona of this section may be suitable for
certain specialized circumstances.
Also, the hypothesis (13.1), the hypothesis
(13.2), or the hypothesis of no interaction may be of interest to test in its
e
,I
I
I
I
I.
I
I
I,
I
I
I
I
,Ie
I
I
I
I
I
I
I.
I
I
own right, whether or not it is used :for def'ining bias.
it would appear (as we have already indicated)
In general, however,
that some knotty problems
can be circumvented ·if' one can use an appropriate criterion variable (or criterion variables) when analyzing a test :for possible bias.
Un:f'ortunately,
such a criterion variable may not be present or available i:f the test is used
entirely :for assessment and not at all :for prediction.
But i:f a criterion
variable can be brought in :for the bias analysis, then one avoids being :faced
with what would seem in many cases to be an almost unanswerable question,viz.,
how to objectively de:fine "bias" in the :first place when there is no criterion
variable.
Lf' it is not possible to use a criterion variable and i:f one
cannot :formulate a suitable de:finition of' "bias", then what should one do?
One can still test the hypothesis (1.3.1), the hypothesis- (13.2), and/or the
hypothesis of' no item-group interaction, and the testing o:f these hypotheses
may yie1.d valuable in:f'ormationj however, one has to be caref'ul to re:frain
:from concluding that the truth or f'alsity o:f any o:f these hypotheses automatically indicates that the test is (respectively) bias-:free or biased.
The
testing o:f these hypotheses, and the obtaining o:f estimates o:f the Pik's, may
produce some important clues a.bout the test items.
For example, if' item-group
interaction is :found to exist, then one certainly cannot conclude right away
that bias (Whatever "bias" is)
exists, but ona can take a care:ful look at
which items are :favoring which groups and make an examination o:f item content.
If' (e. g.) a pattern emerges so that certain groups do especially poorly on
certain types o:f items, then one might consider eliminating these types o:f
items provided that one is now wi1.1.ing to consider such items as not relevant
to the assessment which the test turnishes.
It should be enr,phasized, however,
that any decision along these lines would have to be based strictly on the
83
I
test constructor's (latest) judgment as to what does and does not constitute
relevant item content for the assessment I i. e., the decision would not and
could not be based on any further statistical eVidence.
14.
Testing for the equality of the groups on every item.
This section deals with the testing of the hypothesis (13.1,).
That is, we
consider how to test the hypothesis that, for every item, the item difficulties
for all groups are equal.
As we indicated in Section 13, there may be some
cases where we can properly define a bias-free test to be one for which (13.1)
holds, and some cases where we may be otherwise interested in testing (13.1).
The test we will describe here is a multivariate analysis of
variance test, and is simply a Hotelling's
i2
when g
= 2.
It requires the
inversion of an m x m matrix.
Alternatively, a second way of testing (13.1) is also available.
Geisser and Greenhouse [13, Section 4J present a test which can be applied
either to the original data (the Xijk's), or to certain arc-sine transformations of the type utilized by Cardall and Coffman [4J.
(To use this arc-sine
transformation technique, one arbitrarily divides the sample from each group
into a number of sub-samples with all sub-samples being of equal Size, and
then one considers the basic data to be the figures obtained by calculating,
for each item and each sub-sample, twice the arc-sine of the square root of
the proportion of individuals in the sub-sample who pass the item.
This
transformation causes the variance for all items to be approXimately equal.)
In order to utilize the Geisser-Greenhouse test, one first has to obtain a
value to use for e' [13, p. 890J.
11m (e'
= 11k
One can choose the conservative value e' =
in the notation of their article), but this may lead to a test
which is much too conservative.
I f one is working With the arc-sine trans-
formations and if one is willing to assume a common correlation coefficient
84
.1
I
I
,I
I
I
I
I
_I
I
I
I
I
I
I
eI
I
I
I
I.
I
I
I
I
I
I
01
I_
I
I
I
I
I
I
I.
I
I
85
(=p) between all pairs of items (a questionable assumption if there is an element of speed in the test), then the formula for €' [15, p. 890J reduces to
1 / [1 + (m -1) p2 J, but one still has to decide what value to use for p.
Although the Geisser-Greenhouse approach would thus have certain disadvantages,
it does have the advantage that it does not require the inversion of a large
(m x m) matrix as does the multivariate analysis of variance test.
We are now ready to outline this latter test.
It involves just
a standard application of multivariate analysis of variance, except that (as
is also the case with the Geisser-Greenhouse test using the Xijk's) we are involved with an approximation inasmuch as the Xijk's do not satisfy the assumption of being normally distributed (they are binomially distributed if all
examinees attempt all items).
cause
US
It would appear that this approximation 0should
no trouble so long as the N ' s are fairly large.
i
In outlining the multivariate analysis of variance test of the
hypothesis (13.1), we will follow a procedure similar to that which we used
in Section 7, i.e., we will simply identify the matrices which should be
plugged into the appropriate formulas given in [18].
··•
• · ·
.... ....
X
llm
~l
X
l12
X
122
XlN 1
1
~012
• • •
~lm
~ll
X
212
X
222
• •
~ll
·...
(14.1a)
E
X
221
·...
X2.N 1
2
·...
Xgll
~
••••
~22
••••
X
g12
·
~
~lm
~
·... ·...
~2m
• •••
Xglm
• • Xg2m
·
·... .... ·... ·...
XgN 2 • · · Xgwogm
XgNl
g
g
Xg21
Xg22
Note first that we can write
=
I
1 0
1 0
·..
·.. ·.. ·..
1 0
0 1
0 1
0
0
·.. Plm
P21 P22 ·.. P
... ... ... ...
Pg1 Pg2 ·.. Pgm
P11 P12
0
2m
0
·..
·.. ·.. ·..
1 ·..
·.. ·.. ·..
0
·..
0
0
0
0
0
0
1
1
0
0
1
·.. ·.. ·..
More briefly, (14.1a) can be written as
E(X)
Nxm
(14.1b)
=
Nxg
gxm
where the definition of the three matrices X, A, and ~ of (14.1b) is obvious
upon comparing (14.1a) with (14.1b). (The A and
from the A and ~ of Section
~
above are of course different
7). The hypothesis (13.1) can be written in
matrix form as
C
~
=
I
(g-l)xg gxm mxm
,
0
(g-l)xm
where I is the identity matriX, 0 is the null matrix, and
0
0
1
-1
0
0
1
-1
c =
(14.3)
...
... ... ...
1
0
0
86
I
I
I
I
I
I
_I
,
~
A
.-I
....
-1
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
I
I
I
....
We identify (14.1) and (14.2) respectively with equations (1)
and (2) of [18J.
In calculating ~ and Se from equations (5) of [18J, we use
X for X, I for V, A bsee (14.la) and (14.lb)J for A , and C (14.3) for C •
l
l
The three parameters given by equations (7) of [18J are respectively
(14.4)
I'
* 2(lm1
g+l -1), n.*
s * = min(g-l,m), m=
1
=2(N-g-m-l).
If there are just g = 2 groups, then our multivariate analysis of variance
test is a Rotelling
T2 test,
so that we essentially have an ordinary F-test
(d.f. = m, N-m-l) when g = 2.
The multivariate analysis of variance test assumes that (under
the null hypothesis) the mxm variance matrix of an individual's m item scores
is the same for every individual, regardless of the group to which he belongs.
The assumption that the variance matrices for all groups are the same is
equivalent to the assumptions
. 2
••• = Pgk-Pgk . for all k
and
if every examinee attempts all items (as is the case in a pure power test).
Now (14.5) is obviously satisfied whenever the null hypothesis (13.1, 14.2)
is true, and (14.6) will hold (under the null hypothesis) if
•
(14.7)
PlkK = P2kK
= ••• = PgkK
87
for all k ~ K.
I"
It would appear that (14.7) would not ordinarily be an unreasonable assumption
if the null hypothesis itself holds, so that we would not usually have any
reason to be concerned about the legitimacy of our multivariate analysis of
variance test.
We might mention, however, that when g
= 2,
an alternative
test of the hypothesis (13.1) is available which does not require the assumption (14.7) (or a corresponding.&ssumption wh~n the test X is not a pure power
test).
This alternative test of (13.1) when g = 2 is the test which Bennett
(2J proposed as a multivariate generalization of Scheff~'s test [20J for the
Behrens-Fisher problem, and it is covered by Anderson [1, p. l18ff. J as well
For the case N = N2 , Bennett's test is easily described: one
l
first pairs the two groups (either via random numbers, which brings in an
as in [2J.
element of arbitrariness, or by some kind of matching which is done without
knowledge of the results of the test X), and one then calculates Hotelling's
~ based on the sets of m differences between the item scores (Xijk's) of the
Group 1 member and the Group·2 member of each pair.
[Note that all m differences
will have mean 0 if and only if (13.1) is true. J
The technique which was just outlined would also be used if one
should have a number of pairs of identical twins who were reared separately
in homes of differing SES and one wishes to determine whether SES has any
effect on the average item scores (see Section 13).
In this case, though, we
would be using the techniqUe not so much because of any doubt about the
assumption of equal variance matrices being satisfied, but rather to take
advantage of the natural pairing which exists and thereby obtain a test with
superior power.
Actually, in order to use this technique with the test scores of
separately-reared identical tWins, it would not even be necessary for the two
tWins in B: pair to be in two different SES" groups". It would suffice if one
88
.-I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.-I
I,
I
I.
I
I
twin in each pair could be identified as being in a higher SEB home than the
other twin (it wouldn't matter how little or how much "higher"), and then one
would always subtract the lower SEB twin's score from the higher SEB twin's
score in obtaining the differences.
If SEB had no effect on the test results,
then the test statistic would follow the proper null distribution (Hotelling's
I,
I
I
I
I
Ie
I
I
I
I
I
I
I.
I
I
15. Testing for the equality of the groups with respect to mean
total score only.
There may be some cases where (13.1) does not hold because different groups
are strong on different kinds of items, but where we might still expect the
different groups to perform the same on the test as a whole.
In such cases ,.
we might consider for a priori reasons that the test would be biased unless
the various kinds of items are balanced in such proportions that the mean
total score on the test is the same for all groups, or we might simply wish
for other reasons to test the hypothesis that all g groups are the same with
respect to mean total
The hypothesis which we are interested in testing
SCOI'e.
in these cases is specified by (13.2) (remember that a Pi. is the population
mean of the total score for group i).
The testing of the hypothesis (13.2) can be accompiished by quite
elementary means, if one is willing to grant the assumption that the variances
of the xij's'under (13.2) are the same for all groups (plus the usual assumption of normality).
One simply refers the statistic
g
\
L
F =
i=l
N. (X.
J.
J..
- X )2/ (g -1)
••
,
Where
Ni
Ni
(15.2)
X.J.. =
I
(liN.J. )
x ij '
j=l
X
••
= (liN)
~I
X ••
J.J
,
i=l j=l
to the F-distribution with d.f •. = g-l, N-g.
In other words, one just uses the
ordinary F-test for one-way analysis of variance.
This would appear to be an
adequate way of testing statistically for the equality of all groups with
respect to over-all performance on the test X.
16.
Bias analysis and item-group interaction.
Section 13 indicated that not only would it not always be reasonable to define bias to be the same thing as item-group interaction, but also that there
would be same question as to howt6 define item-group interaction in the first
place.
In this section, we will consider different possible definitions of
item-group interaction.
We will also discuss the testing of certain interaction
hypotheses in conjunction with the different definitions, since one might wish
to test such hypotheses whether or not one is. able to define bias to be equivalent to interaction.
Most of our development in this section will be in the
context of the test X being a pure power test, although some of the material
may also be partially applicable if the test has an element of speed in it.
The problem of how to define item-group interaction is a knotty
one.
One obvious way was already alluded to in Section 13.
One can say that
no item-group interaction eXists if
i and I.
We now describe a multivariate analysis of variance test which
90
I
eI
I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
·1
Ie
I
I
I
I
I
I
I.
I
I
might be used to test the interaction hypothesis (16.1).
model is the same as (14.1).
Our expectation
The hypothesis (16.1) can be written the same
as is indicated by (14.2) and (14.3), except that the matrix I (mxm) in (14.2)
is replaced by
·.. 1
-1
·..
-1
·..
.. ..
••
••
..
..
·.. -1
1
1
0
0
0
(16.2)
V=
m x(m-l)
0
0
,
0
and the null matrix on the right side of (14.2) is (g -1) x (m -1) rather
than (g -1) x m. As we have done in specifying previous multivariate analysis
of variance tests, we simply indicate here the identification of the appropriate
matrices in the reference [18J. In calculating ~ and Se from equations (5)
of [18J, we use X [as given in (14.1a) and (14.lb)J for X, V (16.2) for V,
A [as given in (14.la) and (14.1b)J for A , and C (14.3) for C • The three
l
l
parameters given by equations (7) of [18J are respectively
s*
1
.
= min(g-l,m-l), m* = '12'1 g-mJ -1), n* = '2(N-g-m).
If g = 2, the test reduces essentially to an F-test with d.f. = m -1, N -me
Note that, as in Section 14, we are involved in an approXimation with respect
to the normality of the X l s, but again there should be no difficulty if the
ijk
Nil s are large.
We indicated in Section 14 [see (14.5-14.7) and accompanying
discussionJ that it would not be unreasonable to assume equality of the
variance matrices of the g different groups under the null hypothesis (13.1,
14.2) • Unfortunately, however, a similar statement does not apply with respect
91
I
to the null hypothesis (16.1), because not even the diagonal elements of the
variance matrices of the different groups will be exactly equal under the null
hypothesis (16.1) (except in certain highly specialized cases).
However, if
the differences between groups (16.1) are small, or if all the Pik's are close
enough to ~ so that the quantities Pik(l - Pik) are all close to .f ' then the
assumption of equality of the variance matrices of the different groups under
the null hypothesis would be approXimately satisfied, and the test described
in the last paragraph could properly be used.
On the other hand, the asslUllP-
tion of equal variance matrices under the null hypothesis would be severely
violated with certain values of the P1k's.
To handle these latter cases, an alternative multivariate test
technique is available if there are only g
=2
groups.
To use the technique,
one starts off just as with Bennett's test [2](see Section 14).
We specify
the details for the case N = N • One pairs the members of the two groups,
l
2
either randomly or by some type of matching which involves no knowledge of
the Xijk's.
We then consider the quantities
.1
I
I
I
I
I
I
I
_I
I
Ii
j
which are the item score differences between the two members of a pair.
Then we have
(16.6a)
E
p 2k - P1k
·.. DIm
·.. D
DNll %12 ·.. %lm
Dll D12
D21 D
22
2m
92
1
[°1 °2 •••
I:
Om]
1
=
1
II
Ii
II
us define
ok --
Let
,
.1
I
I
I
1
I
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I.
I
I
or, in matrix form,
,
(16.6b)
~
where thedefinitions of D, A, and
~
(16.6a) with (16.6b) (A and
in (16.6b) are obvious upon comparing
here are not the same as previous A's and
~'s).
Note that there is no problem of unequal variance matrices under (16.6).
The
hypothesis (16.1) can be written
~
I
(16.7)
lxl
l~
V
mx(m-l)
=
,
0
lx(m-l)
where I is a lxl identity matrix (1.e., a scalar equal to 1), V is the matrix
(16.2), and 0 is a null vector.
Now we can identify (16.6) and (16.7) respec-
tively with equations (1) and (2) of [18], and obtain a multivariate analysis
of variance test.
In calculating ~ and Se from equations (5) of [18], we
use D[see (16.6a) and (16.6b)J for X, v(16.2) for V, A[see (16.6a) and (16.6b)]
for A , and the scalar 1 for C •
l
l
N -m + 1.
The test reduces to an F-test with d.f. = m -1,
l
We now turn to a second possible way of defining item-group
interaction.
Instead of the definition (16.1), we can say that no item-group
interaction exists if
(16.8)
-1 r.::sin .J PIl
-
sin
-1 c-/Pil
=
=
-1
sin j PI2
-
-1,.;::sin "Pi2
=
sin-~ - Sin-~ for all (i,I).
93
I
Defining interaction via (16.8) is useful for the purpose of
certain types of tests of the no-interaction hypothesis based on the arc-sine
transformation.
The use of the arc-sine transformation is as explained in
the third paragraph of Section 14:
one arbitrarily divides the sample from
each. group into a number of sub-samples, and then the arc-sine transformation
is calculated for each sub-sample on each item.
The division into the sub-
samples is necessarily arbitrary, and also it would seem that some information
would be lost through the lumping together of all the individuals within a
sub-sample.
Against these disadvantages, however, must be weighed the impor-
tant advantage that the use of the arc-sine transformation eliminates any
problem of unequal variances among the different groups.
After the calculations of the arc-sine transformations have been
effected, the procedure which is indicated in both [4J and [5J is to analyze
the resulting figures and test for item-group interaction by using the model
of the two-factor experiment with repeated measures on one factor (see [24J,
.-I
I
I!
I
I
I
I
.,
p. 302ff.; or see [13J,the statistic F in particular). In thus using the
3
statistic F of [13J to test the no-interaction hypothesis (16.8), one assumes
3
that the item inter-correlations are all equal. (Or, alternatively, one may
I
reduce the degrees of freedom for the F-statistic and obtain a conservative
I
I
I'
test, as indicated in the references, but such a test would probably be much
too conservative in our applications here because the number of items m will
generally be large.
On the other hand, the assumption of equal inter-item
correlations would seem to be a questionable one if there is an element of
speed in the test.)
If
one assumes equality of the correlation coefficients
between all pairs of items, one does not, in addition, need to specify a
value for this common correlation coefficient, as was the case for a similar
test described in Section 14.
I:
I
.-I
.1,I
I
I.
I
I
I
I
I
I
I
objectionable.
I_
and suppose that (16.1) holds with 6 (16.5) equal to +0.2 for all m items
k
Instead of proceeding as just described after the arc-sine
transformations have been calculated, one can, alternatively, use a multivariate
test procedure similar to that described in tbe third paragraph of this section.
With tbis multivariate approacb we would no longer have to assume equality of
the inter-item covariances, but there would be two countervailing disadvantages.
First, we would have to invert a large (m x m) matrix.
freedom n * of (16.3) might be qUite small [since the number N in (16.3) will
no longer represent the total number of individuals, but rather will now represent just the total number of sub-samples]; in fact, n * might even be negative,
in 'Vlhich case the multivariate test would of course not be possible.
At this point we remark that both the defini-tion (16.1) and the
definition (16.8) possess a quirk which might for some purposes be considered
(k
I
I
I
I
I
I
I.
I
I
Second, the degrees of
= 1,
Let us consider (16.1).
2, ••• , m).
Suppose there are just g =2 groups,
Suppose now that we put into the test an additional item
whose difficulty is .18 for Group 2.
Then this would automatically cause us
to have interaction, because, in order for us to maintain our absence of
interaction, the difficulty of this same item for Group 1 would have to be
-.02, an impossible value.
By the same token, if we were to introduce into
the test an additional item with difficulty of, say, .87 for Group 1, this
also would automatically force us to have interaction, because the difficulty
of this same item for Group 2 would have to be the impossible value of 1.07
in order for (16.1) to continue to hold.
Thus, for certain possible values
of Pik there is no possible value of PIk under (16.1), and vice versa.
It is easily verified that the same kind of quirk exists with
respect to the definition (16.8).
It might be felt, however, that these
quirks are not serious so long as all the item difficulties for all the groups
95
I
are suffd.ciently far away from the two extremes of 0 and 1.
On the other hand, if there are some item difficulties which are
rather close to 0 or l,these quirks in the definitions (16.1) or (16.8) might
cause us to conclude that interaction eXists when, in fact, we would not really
want to draw such a conclusion.
Thus at this point we might want to consider
still further possible definitions of interaction which would not be burdened
with the type of quirk which affects (16.1) and (16.8).
Many such definitions
which are free of this quirk could undoubtedly be devised; here, however, we
will confine ourselves to suggesting just one possibility.
We will indicate
later that there are still more problems connected with the definition of
item-group interaction, even if one is able to resolve the matter of the quirk
which we have been dwelling upon.
The quirk associated with the extremes of 0 and 1 can be avoided
if one works with a transformation of the Pik's which carries a Pik-value of
o
into minus infinity and a pik-value of 1 into plus infinity.
Thus, e. g. ,
we can use a definition which says that no item-group interaction exists if
•••
where ~-l(p) denotes the inverse of the normal cumulative distribution function,
i.e., ~-l
= t-l(p)
is the value such that
-1
(16.10)
--!..- J~
v'21C -CD
1
e
-2
96
z
2
dz = p
•
.-I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
I
For measuring item difficulty, an "index of difficulty", delta,
is defined to be that value such that
(16.11)
I.
I
I
1
r l:l.
e
1: (W 4
2
13) 2
dw = P
We may mention that, if we use a definition of item-group interaction which
says that no interaction exists if
I_
I
I
I
I
I
I
Delta, which we will denote by~ =~ (p),
is sometimes defined as follows.
then this definition (16.12) based on the differences between the deltas is
equivalent to (16.9).
A discussion given by Eells[9, pp. l68-l72J would seem to support
the notion that it would be more sensible to use (16.9) than (16.1) to define
item-group interaction.
In order to measure the difference between two groups
on an item, he utilizes a difficulty index which is equivalent to ~-l(p) or
~
(p), and argues that this is most logical because the underlying distribution
of the trait being measured is likely to be normally distributed or approximately
so.
He thus claims that his index provides a more accurate measure of the
true difference in difficulty for the two groups than does the simple difference (PIk - Pik) which appears in (16.1).
We may mention that there is a certain limited relationship
between the subject of probit analysis and the application of indexes such as
97
I'
t-l(p) or6(p).
The area of probit analysis is covered, e.g., by Firmey's
book [12J.
The definition (16.9) is perhaps theoretically sounder than (16.1)
or (16.8), although one finds from looking at the tables that there would seem
to be little practical difference among these three definitions so long as all
of the Pik's are fairly close to .5.
Even though one might decide that the
definition (16.9) is a desirable one, however, one would still have to solve
the problem of devising a significance test of the hypothesis (16.9).
This
would not appear to be easy, except that, if we have the special case g = 2
and if N and N2 are large enough, we might proceed as follows. We could
l
arbitrarily divide each of our two samples (from the two groups) into a number
of sub-samples of equal and sufficiently large size; calculate It'",-1 ,for each
item and each sub-sample, based on the proportion of individuals in the subsample who pass the item; and then use these values of
",-1
It'
to calculate a
multivariate test analogous to that described in cormection with equations(16.416.7).
Even though we agree to the definition (16.9) and even though we
might be able to satisfactorily test the hypothesis (16.9), there is still a
basic difficulty concerning the matter of definition of item-group interaction
which we alluded to just before we presented (16.9), and which we are now
ready to explore in detail.
We have presented three possible definitions of
item-group interaction (16.1, 16.8, 16.9), and could present many more.
The
three definitions (16.1, 16.8, 16.9) are not equivalent to one another.
Thus,
e.g., if the relations (16.9) hold,this will automatically ensure (except in
very specialized cases) that (16.1) and (16.8) will not hold; or (e.g.) if
(16.1) holdS, then (16.8) and (16.9) generally will ~ hold.
This means
that, e.g., if (16.1) holds and if we use (16.8) to define item-group inter-
98
.~
I
I
I
I
I
I
I
_I
I
I
I
I
I
I
.-I
I
I
I.
I
I
I
I
I
I
I
I_
would appear that such an outcome would reflect not any genuine "interactiod',
but rather an artifact of our definiton of interaction.
It would seem that the exact nature of what is meant by "no i temgroup interactiod' is sufficiently uncertain and indeterminable that, in many
cases, one should not attempt to designate a definition which is as highly
specific as (16.1), (16.8), or (16.9).
Otherwise, one may get into the kind
of trouble just described, where one may end up rejecting a hypothesis of
"no interactionll simply because of the definition which one has chosen.
view to finding a possible way of extricating ourselves from the difficulties
of definitions which are excessively specific, let us note that there is one
feature which is common to all three of the definitions (16.1, 16.8, 16.9) and
to any other reasonable definition of item-group interaction as well:
I
I
I
I
I
I
I.
I
I
In
any
defensible definition would have to imply that we would say that item-group
interaction does exist if the condition
(16.13)
(PIK - PIk) (PiK - Pik) ~ 0 for all pairs of differ~nt groups
(i,I) and all pairs of different items (k,K)
is not satisfied, or if the condition
(16.14)
(PIK - PiK) (PIk - Pik) ~ 0 for all pairs of different groups
(i,I)" and all pairs of different items (k,K)
is not satisfied.
Thus, if there is any pair of groups and any pair of items such
that the easier of the two itlllmS for one group is the more difficult of the two items
99
I
for the other group, then (16.13) would not hold, and if there is any pair of
items and any pair of groups such that the better-scoring of the two groups
on one item is the worse-scoring of the two groups on the other item, then
(16.14) would not hold; in either situation, we would certainly want to say
that interaction is present.
Thus we are now led to suggest the following approach to the
problem of how to define item-group interaction.
Let us base our approach on
a concept which says that we will consider that item-group interaction does
exist if (16.13) or (16.14) is not satisfied.
and (16.14)
~
For the situation where (16.13)
satisfied, we will not attempt under this approach to specify
any conditions on the Pik's which would be defined as constituting either
interaction or absence of interaction, and in this way we will avoid the
problems which arise from overly specific definitions of interaction.
Thus,
if we reject the null hypothesis (16.13) or the null hypothesis (16.14) upon
.'I
I
I
I
I
I
I
_I
running an appropriate significance test, we will conclude that interaction
exists; if we do not reject the null hypothesis in question, we will simply
take the attitude that we have found no evidence that establishes·that any
interaction exists.
We now examine the question of how to test the hypothesis (16.13).
It will turn out that it is not too hard to obtain a test of the hypothesis
for a single set of values of (i, I, k, K), i.e., for a single pair of items
and a single pair of groups; we start by considering this problem.
Since we
are assuming that the test X is a pure power test, every item is attempted by
every examinee.
Let Pi(k)K denote the proportion of group i which answers item
k incorrectly and its K correctly, and let Pik(K) denote the proportion of
100
I
I
I
I
I
I
.'I
I
I
I.
I
I
I
I
I
I
I
group i which answers item k correctly and item K incorrectly. (it is understood
that similar definitions apply to group I); these pIS are of course true (population) values, not sample estimates.
Then (16.15) is equivalent to
(16.16)
Now if we define
,
(16.17)
then (16.16) is equivalent to
I_
(16.18)
I
I
I
I
I
I
Thus the hypothesis (16.15) is equivalent to the hypothesis (16.18).
I.
I
I
Note
that P. (16.17) is simply the proportion of individuals who answer item K
l
correctly among those individuals who mark either item k or item K (but not
both) correctly.
Let Ni(k)K be the number of examinees in the sample from the
. i-th group who mark item k incorrectly and item K correctly, and let Nik(K)
be the number who mark k correctly and K incorrectly.
Let us define
1\
Ni *
(16.19)
= Ni *
(k,K)
= Ni(k)K
+ Nik(K)' Pi
= Ni(k)K
/ Ni *
1\
Note that P.l will have mean Pl.•
Our test of the hypothesis (16.18) [i.e., of the hypothesis
101
I
(16.15)J is as fol~ows.
1\
1
1\
If Pi and PI are both
.s: '2
or both
~
1
'2 ' then we
obviously cannot reject the hypothesis (16.18) (more specifically, the significance level will be no smaller than .5).
If
~i < ~
and
~I >~,
then the
significance level for such a result is at least as small as
(16.20)
max
[<~)
N
i*
~k)K
N
L<;*)'
h=O
i.e., we will have a conservative test of the hypothesis (16.15) if we quote
(16.20) as the significance level.
If
~i >
! and ~I <!' then the significance
level is the expression (16.20) with i and I interchanged.
[The expression
"max" in (16.20) indicates that we take the greater of the two quantities inside the square brackets.J
A numerical example may serve to clarify how to use the test of
the hypothesis (16.15).
Suppose there are N
i
from group i, of whom 61 mark
bo~h
= 100
examinees in the sample
I
I
I
I
I
I
_I
item k and item K correctly, 29 mark neither
·1·.,..
.:
item correctly, 2 get k wrong and K right, and 8 mark k right and K wrong.
Suppose the sample from group I consists of N
I
= 101
examinees, of whom 54,
33, 11, and 3 respectively fall 'into the four categories just indicated.
The
examinees who get both items right or both items wrong are "thrown away", so
to speak.
.-I
The relevant figures are
from which we obtain also
102
I
I
I
I
I
eI
I'
,I
~
I
I.
I
I
I
I
I
I
-I
I_
I
I
I
I
I
I
I.
I
I
Ni *
(16.22)
1\
-1
Since Pi <
Z
= 10,
NI *
= 14, ~i = .2, ~I =
.786
1\
and PI > ~ , we use (16.21) and (16.22) to find that the signifi-
cance level (16.20) is
=
(16.23)
max f(l + 10 + 45)/1024, (1 + 14 + 91 + 364)/16384]
max [.055,
.029 ]
=
=
.055.
Thus we conclude that, if the null hypothesis (16.15) [i.e., (16.18)J is true,
then, no matter what may be the true values of Pi and PI or of the other
parameters [so long as (16.15)is satisfied], a result as extreme as (or more
extreme than) (16.21) could occur by chance no oftener than 5.50/0
of the
time.
We must now indicate a formal proof that our test of the hypothesis
(16.15) does indeed have the (conservative) significance level (16.20) which
we have claimed for it.
GLven Ni * and NI *
'
the variables Ni(k)K and NI(k)K
are binomially distributed with respective parameters Pi and PI' and they are
independent.
(Our test is of course a conditional test, conditional with
respect to N * and N *.)
I
i
Now consider the critical region which consists of
all points such that
103
1
1
where a i and a I are integers such that a i < '2
N * and a I < '2
N1*.
i
The proba-
bility (given Ni * and N1*) that the point (Ni(k)K, NI(k)K) will fall in the
critical region (16.24) is
&:r
][ I
N
( I*) P h (l-P)
h.
I
I
N
I*
-h
].
h=O
We find that the first derivative of (16.25) with respect to P. can be written
~
a
{
-(l-P )Ni*-2ai -1
i
[L (I*)p
I*
h
I
N
-h
(l_p)h ]
I
h=O
N.*-2ai -1
+ P. ~
~
.
N *
N *-h
( I )p h(l_P ) I
h - I
I
. the expression for the first derivative of (16.25) with respect to PI is
analogous to (16.26).
From a detailed examination of these two first deriva-
tives, we are able to conclude that the maxinmm of the function (16.25) over
the region (16.18) must occur at one of the four points
I
I
I
I
I
I
I
_I
I
N
I'
.1
I
I
I
I
II
I,
e"
104
I
.1
I
I.
I
I
I
I
I
I
I
I_
[The region (16.18) is the region of the null hypothesis, and consists of two
squares whose respective s~ts of vertices are-(O,O), (o,~), (~,O), (~,~) and
(~,~), (~,l),(l,~),
J Plugging the values
(1,1).
(16.27) into (16.25), we
find that
(16.28)
:::
(~)
a
N,
I*
I
I
N
(hI*) ,
h=O
a
=
1
(-)
2
Ni*I
h=O
~1erefore,
the maximum value of the function (16.25) over the region (16.18),
Le., the maximum probability that the null hypothesis 1'rill be rejected (16.24)
when it is true, is
N
I
I
I
I
I
I
I.
I
I
i
(1.)
max
2
a
I. N
I* \ ' ( I*) ]
L
11
h=O
Hence it follows that we can, in fact, conservatively quote (16.20) as the
significance level.
We have thus established a rather simple test of the hypothesis
(16.15) for a single pair of items (k,K) and a single pair of groups (i, I).
(Incidentally, note that there is, of course, only one possible pair of groups
if g
= 2,
so that we would need to be concerned only with different item pairs
in this case.)
We must now consider the question of how to test the hypothesis
(16.13), i.e., how to test (16.15) simultaneously for all item pairs and 'all
group pairs.
From a theoretical stand-point, this question poses some diffi-
cultiesj although'«;he method we will suggest here is rather crude, it is a
105
I
simple and legitimate one and should be effective with sufficiently large
sample sizes.
Suppose that" in an initial sampling" we obtain the significance
level [equal to (16.20) if
(~i
-
~) and(~I
-
~)
have opposite signs" result not
.'I
significant otherwise] for each of the (2') (~) possible (group pair)-(item pair)
combinations~i" I" k" K).
I
I
I
I
I
I
Suppose now that we arrange the resulting values
of (16.20) in order" and suppose we pick out the (group pair)-(item,pair) combinations which have the M smallest values of (16.20)" where M might be some
number in the vicinity of (say) 5 or 10 or 20 or 50.
Next" let us take a
second (independent) sampling" and obtain the significance levels just for
those M combinations which were singled out from the first sampling.
denote the smallest of these M significance levels.
Let a
2
Then we will have a
conservative test of the over-all hypothesis (16.13) if we quoteMa as the
2
(In case Ma2 > 1" then of course the result is not signifiIncidentally" if we were to take just a single sampling and obtain the
significance level.
cant.)
significance levels for all(~) (~) combinations" and if we denote the smallest
_I'
,I·
Ii
of these significance levels by al" then we could conservatively quote (~) (~)al
,
as our significance level for a test of the over-all hypothesis (16.13)" but
I
(~) (~)'\ would probably be too large a value ordinarily (and" in fact" might
easily be > 1).
I
Thus it would appear to be better to take two samplings in
I.
,i
most situations.
Il
The theoretical justification of the techniques just described
1
rests upon a Bonferroni inequality (see" e.g., [llJ" p. 100" second inequality
l
of (5.7) with m = 1).
I~
It would appear that a more sophisticated test of (16.13)
J
would be difficult to deVise" because of complications arising from the fact
i
I:
, that results from the (~) (~) different combinations are not independent.
Finally" we examine briefly the question of obtaining a' significance test of the hypothesis (16.14); we recall that (16.14)" along With
106
j
e. l~
...
1
1
I;
,II
I
I
(16.l3), is a necessary condition for absence of interaction.
I.
I
I
I
I
I
I
I
I_
Ii
I
I
I
I
I
I.
I
I
It might be
suspected that, in practice, interaction would eXist more frequently as a
result of (16.13) being violated than as a result of (16.14) being violated,
and that if (16.l4) is violated then normally (16.13) would also be violated.
(It might turn out in many cases that one group would be better than another
group on all m items, thereby causing (16.l4) to hold, but in such cases (16.l3)
might still be violated.)
In any event, however, a test of (16.14) may some-
times be useful, and, in fact, a violation of (16.14) might in some cases be
considered more worthy of note than a violation of (16.13).
We will consider how we might test the hypothesis (16.14) when
g
= 2.
We may start off in the same way as is indicated in the discussion
encompassing equations (16.4-l6.6), i.e., we use the pairing approach of
Scheff~ [20J and Bennett [2J.
This time, however, we are interested in test-
ing the hypothesis (16.14), Le., the hypothesis
... , 5m-< 0
rather than the hypothesis (16.7, 16.1).
can obtain an m-dimensiona1 100{1-a)%
,
Using the Djk'S [see {16.4)J, we
confidence region for {51' 52' ••• , 5m),
and then reject the hypothesis (16.30) at level a if this confidence region
fails to intersect (i.e.,has no points in common with) the region (16.30).
One possibility for obtaining such an m-dimensional 100{1-a)%
confidence region is to calculate m individual two-sided confidence intervals,
one for each 5k , where each of these m intervals has a confidence coefficient
of 100 [l-{a/m)J%
and is based simply on the ordinary t-distribution.
One
can then ascertain at once whether the resulting confidence region intersects
(16.30).
In fact, one could easily determine the precise value of Q for
which a confidence region of this sort would barely intersect the region (16.30);
107
I
one does this by calculating a t-statisticfor each item and then finding
which positive tand which negative t are largest in absolute value, and one
finally
~uotes
as the significance level the value of a which corresponds to
the lesser of these two maximum absolute values of t.
An alternative m-dimensional 100~1-a)%
confidence region on the
6 's is the one given by (e.g.) Anderson [1, p. 108, Section 5.3.2J.
k
To find
the precise value of a for which a confidence region of this sort would barely
intersect the region (16.30), one minimizes, with respect to the 6 's but
k
subject to the restrictions (16.30), the ~uadratic form indicated by the left
side of formula (10) of [1, p.108J;
one then ~uotes as the significance level
the a-value corresponding to the value of
sulting minimum.
which is represented by the re-
The minimization problem which is encountered here can be
handled as two problems in
~uadratic
programming, with the two problems
corresponding to the 6 's being ~ 0 and
k
techni~ues
rJf
are available for solving
5
0 (16.30) respectively.
~uadratic
Standard
programming problems (see, e. g.,
Vajda [23, p. 108ff.]), but they would require a computer.
It is not known which of the two types of confidence regions just
described can be expected to produce the more powerful test of
(16.30)~
Also,
it is altogether possible that a third test of (16.30) could be developed
which would be more powerful than either of the tests which we have just pre.sented.
If g
difficult.
> 2, then the problem of testing (16.14) becomes more
One crude but feasible approach would be to run one of the two tests
just described on all (~) pairs of groups, find the lowest of the resulting
(~) a-values (call it a', say), and then ~uote (~) a' as the over-all significance
level with respect to testing the hypothesis (16.14).
this approach rests upon a Bonferroni inequality.
108
The justification of
.-I
I
I
I
I
I
I
_I
il
I
I
I
I
I
I
I
.1
•
I
I.
I
I
I
I
I
I
I
ACKNOWLEDGMENT
The author wishes to thank Dr. William E. Coffman
of Educational Testing Service, who was responsible for
posing the bias problem to him.
I_
Ii
I
I
I
I
I
I.
I
I
109
I'
.1
REFERENCES
[lJ Anderson, T. W., An Introduction to Multivariate Statistical
Analysis.
John Wiley and Sons, New York (1958).
[2J Bennett, B. M.; "Note on a Solution of the Generalized
Behrens-Fisher Problem".
Annals of the Institute of Statistical Mathematics,
vol. 2, pp. 87-90 (1950-51).
[3J Business Week, "Hiring Tests Wait for the Score".
Issue of
February 13, 1965, pp. 45-48.
[4J Carda11, Carolyn, and Coffman, William E., "A Method for
Comparing the Performance of Different Groups on the Items in a Test".
Research
I
'I
I
I
I
I
I
_I
and Development Reports, 64-5, No.9, College Entrance Examination Board
,I
(November, 1964).
[5 J Cleary, T. Anne, and Hilton, Thomas L., "A Proposal for an
Investigation of Test Bias" (Mimeographed).
Educational Testing Service
(December, 1964).
[6J Coffman, William E., "Sex Differences in Responses to Items
in an Aptitude Test".
The 18th Yearbook, National Council on Measurements
Used in Education, pp. 117-124(1961).
I
I
I
I
I
.1
110
I
1-
I
I.
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
I.
I
I
[7J Coffman, William E., "Evidence of Cultural Factors in Responses
of African Students to Items in an American Test of Scholastic Aptitude".
Research and Development Reports, College Entrance Examination Board (June, 1963).
[8] Coffman, William E., "Principles of Developing Tests for the
Culturally Different".
Proceedin~
1964 InVitational Conference on Testing
Problems, Educational Testing Service, pp. 82-92 (1965).
[9J Eells, Kenneth, Davis, Allison, Havighurst, Robert J., Herrick,
Virgil E., and Tyler, Ralph W., Intelligence and Cultural
Differenc~. University
of Chicago Press, Chicago (1951).
[10J Fande11, Todd E., "Testing and Discrimination ••• Issue in
Motorola Case:
Do Job Tests Penalize Negroes?"
Wall Street Journal, issue
of April 21, 1964, p. 18.
[llJ Feller, William, An Introduction to Probability Theory and Its
Applications, Volume 1(2nd edition).
John Wiley and Sons, New York (1957).
[12J Finney, D. J., Probit Analysis (2nd edition).
University
Press, Cambridge (1952).
[13J Geisser, Seymour, and Greenhouse, Samuel W., "An Extension of
Box's Results on the Use of the F Distribution in Multivariate Analysis".
Annals of Mathematical Statistics, vol. 29, pp. 885-891 (1958).
111
[14] Gulliksen, Harold, Theory of Mental Tests.
John Wiley and
Sons, New York (1950).
[15 J Potthoff, Richard F., "The Prediction of College Grades from
College Board Scores and High School Grades".
Mimeo Series No. 419,
Department of Statistics , University of North Carolina, Chapel Hill (1964).
[16J Potthoff, Richard F., "A Non-Parametric Test of WhetheJ:' Two
Simple Regression Lines Are Parallel".
Mimeo Series No. 445, Department of
Statistics, University of North Carolina, Chapel Hill (1965).
[17J Potthoff, Richard F., "Some Scheff~ -Type Tests for Some
Behrens-Fisher-Type Regression Problems".
Journal of the American Statistical
Association, vol. 60, pp. 1163-1190 (1965).
[18J Potthoff, Richard F., and Roy, S. N., "A Generalized Multivariate Analysis of Variance Model Useful Especially for Growth Curve Problems".
Biometrika, vol.' 51, pp. 313-326 (1964).
[19J Roy, S. N., Some Aspects of Multivariate Analysis. John
Wiley and Sons, New York (1957).
[20J Scheff~, Henry, "On Solutions of the Behrens-Fisher Problem,
Based on the t-Distribution".
Annals of Mathematical Statistics, vol. 14, pp.
35-44 (1943).
I'
.1
I
I
I
I
I
I
I
_I
I
I
I
I
I
-I
I'
I
.1"
•
112
I
I.
I
I
I
I
I
I
I
Ie
I
I
I
,I
[21J u. S. News and World Report,"What Happens When Government
Polices Your Hiring".
Issue of March 30, 1964, p.87.
[22J U. S. News and World Report,"Aptitude Tests:
Issue of December 14, 1964, pp. 83-84.
[23J Vajda, S., Readings in Mathematical Programming(2nd edition).
John Wiley and Sons, New York (1962).
[24J Winer, B. J., Statistical Principles in Experimental Design.
McGraw-Hill Book Co., New York (1962).
d
I
I
I
I
I
Under a Cloud".
.113