Williams, J.S.; (1961)An evaluation of the worth of some selection indices."

AN EVAIIJATION OF THE lJORTH
OF
SOME SELECTION INDICES
BY
J. S. Williams" C.. Co Cockerham" and S. N. Roy
This research was supported in part
by the ef'fice of Naval Research Under
Contract NO G Nonr 486(04)(NR 042 202)
Institute of Statistics
Mimeo Series No o 281
April, 1961
iv
NOTATION
The following forms of notation are used throughout this dissertation:
1.
A lower case letter underlined, I.e.,
indicates a column
~
vector.
2.
An upper case letter, i.e., A, indicates a matrix.
Unless
specifically indicated all matrices are square-symmetric.
3.
A prime (') following a matrix indicates the transpose of the
matrix.
4.
In particular, !' is a row vector.
A subscript enclosed in parentheses and following the symbol
for a variable indicates the numerical rank of the variable.
For example, in an' array of k x value
s,
x( i) has the i -th
largest numerical value in the array.
5.
The letter E with no other marking and preceding a function
enclosed in parentheses or brackets stands for "the mathematical
expectation of."
This letter with a subscript directly below it,
i.e.,E, stands for lithe mathematical expectation taken over the
t
distribution of t of."
6.
In order to conform to style, many formulae and equations had
to be written on more than one line.
dot precedes
th~
If with two such lines, a
first s,ymbol on the second line, the formula is
to be read "the last element of the first
first element of the second line."
lin~
multiplied by the
v
TABLE OF CONTENTS
LIST OF
TABLES
Page
vii
.
LIST OF ILLUSTRATIONS •
• viii
CHAPrER
1.0
INTRODUCTION
1
2.0
REVIEW OF LITERATURE •
5
3.0
THE PROBABILITY OF SELECTING ONE OF THE FIRST m-RANKED
VARIABLES FROM A SAMPLE OF k UNKNOWN VARIABLES BASED
ON THE SELECTION OF THE MAXIMUM OBSERVABLE SUM OF THE
UNKN~ VARIABLES AND ERROR VARIABLES
16
3.1 Basic Formu+ation .
3.2
3.3
3.4
16
Properties of P[l, 1, 2] when x and y Are Drawn
from a Bivariate Normal. Population .
21
Properties of Pll, 1, k], k > 2, When x and y
Are Drawn from a Bivariate Normal Population.
27
pel, tm} , k] as a Monotonically Increasing Function
of pxy when x and y Are Drawn from a Bivariate
36
Normal Population
4.0
SAMPLE SELECTION INDICES DERIVED FROM THE PROBABILITY.
OF SELECTING ONE OF THE m LARGEST UNKNOWN. VARIABLES
BASED ON SELECTION OF THE LARGEST SAMPLE OBSERVATION.
41
4.1
4.2
4.3
4.4
41
42
44
49
General Method of Construction •
The Base Index and the Optimum Index .
The Reduced Index •
Coincidence of the Base Index and Optimum Index.
5 .0
METHODS OF Ca-iPARING THREE SAMPLE
6.0
AN IIJ.,USTRATION OF THE PROPERTIES OF SAMPLE ESTIMATES
USED IN CONSTRUCTING A SELECTION INDEX •
53
.:tNDICES •
.
,
6.1 Definition of Population and Proposed Estimates
6.2 Moments Associated with a Two-Variate Estimated
Index
6.3 Variances of the Weights of the Two-Variate
Estimated Index •
6.4 Correlation of Two-Variate Estimated Index with
Worth
6.5 p-Variate Properties for the Variance of the Estimated Weights and the Covariance of the Estimated
Index and True Worth
84
vi
TABLE OF CONTENTS (continued)
Page
7.0 SUMMARY AND CONCLUSIONS.
7.1
7.2
7.3
The Problem and Results
Conclusions
Suggestions for FUrther Research
LIST OF REFERENCES
89
89
91
92
99
vii
LIST OF TABLES
Page
TABLE
1.
Two-level, nested, multivariate analysis of variance •
50
2.
MUltivariate analysis of variance for estimation data.
60
viii
LIST OF ILLUSTRATIONS
Page
Figure
1.
Graph of P[1,1,2]
2.
An illustration of selection increase for k = 2
,.
Graphs of P[l,l,kl for k =
4.
The effect of varying k on a comparison of two selection
indices . .
0
Ol'\OOOOOoOOOOOaOo
•
0
•
•
CI
5,
0
0
0
0
10
0
0
26
37
10, 20, and 100
O.
24
0
45
5. A graphic comparison of, expected gain and P[l,l,k] • • . • 54
•
1.0 INTRODUCTION
The selection index as originally defined by Smith (1930) has been
applied widely in genetic research.
Theoretically, it is applicable also
to other fields of investigation where one wishes to select the best
m of a random sample of k linear functions of nonobservable, multivariate
normal variables,
by observing k linear functions of these variables combined each with
a normal element of error,
P
c. Pij = I: c/ gij + eij),i = 1,2, •• " k •
j=l J
j=l
p
I:
The coefficients in both linear functions (a and c ) are known conj
j
stants.
Smith's solution is in terms of the population
paramete~s
gj and e j , which in practice must be estimated from sample data.
of
For,
more than two variates, these estimates are laborious to calculate
and in finite samples possess very complicated statistical properties,
These difficulties have prompted the following questions,
Has the
selection index constructed with sample estimated of the population
parameters been proven reliable enough to merit
~ ts
use?
Would an
index, such as a linear combina1iion of the observable variates using
the known weights of the nononservable linear functions, be better
because the avoidance of estimated weights provides a simply constructed
tool with statistical properties which easily are determined?
2
To answer these questions, it is necessary first to develop a
criterion for comparing two or more proposed indices.
Smith's develop-
ment is based on the average gain expected from use of a selection index.
.
However, if the program under study does not involve repeated selections
"
from the same population, this criterion is not necessarily the best
which can be used for comparisons.
Elcpected gain is the mean of the
population of actual gains which are realized in random samples; it
does not include a measure of the sample-to-sample variation of the
realized gains.
If selection from one population is performed only
a few times, then it is important to combine variance and expected
gain into any single measure of the worth of an index.
'Ib.e criterion
suggested in this dissertation is the probability of selecting one of
the first m of k nonobservable sample variables using a specified sample
index.
'Ib.is criterion has not been studied seriously in the theory
of selection indices, although it is Widely used in theoretical discussions of selection problems of univariate populations.
When, after comparing the proposed indices, it is apparent that
no one index is best for all populations, it becomes necessary to decide which one is best for a particular population.
Usually, this
is a difficult assignment because the current comparison criteria do
not reduce to parametric expressions which are statistically testable.
Also, the criteria are too complicated to permit the expression of
the worth of one index in terms of the worth of another.
At present
we can suggest only one solution; it is to describe sets of situations
where an index is considered best on the basis of its individual statistical properties and to describe the physical implications of these
properties.
2
To answer these questions) it is necessary first to develop a
criterion for comparing two or more proposed indices.
Smith's develop-
ment is based on the average gain expected from use of a selection index,
However, if the program under study does not involve repeated selections
from the same population) this criterion is not necessarily the best
which can be used for comparisons.
EKpected gain is the mean of the
population of actual gains which are realized in random samples; it
does not include a measure of the sample-to-sample variation of the
realized gains.
If selection from one poPulation is performed only
a few times, then it is important to combine variance and expected
gain into any single measure of the worth of an index.
The criterion
suggested in this dissertation is the probability of selecting one of
the first m of k nonobservable sample variables using a specified sample
index.
This criterion has not been studied seriously in the theory
of selection indices, although it is widely used in theoretical discussions of selection problems of univariate populations,
When, after comparing the proposed indices, it is apparent that
no one index is best for all populations, it becomes necessary to decide which one is best for a particular population.
Usual:\.Y, this
is a difficult assignment because the current comparison criteria do
not reduce to parametric expressions which are statistically testable,
Also, the criteria are too complicated to permit the expression of
the worth of one index in terms of the worth of another.
At present
we can suggest only one solution; it is to describe sets of situations
where an index is considered best on the basis of its individual statistical properties and to describe the physical implications of these
properties.
As in all regression and prediction type problems, an im,porliant
question considered before the construction of a selectiqn index is
how many variates should be measured on one sampling un.i. t.
Often the
element of error in a measured va.:riate contributes more to the variation of the index than does the element of interest •
Intuitively in
case, it seems that the measuremen.t .,Of such a variate might
this
profitably be omitted from the index, but this is not always correct
since the correlation pattern among the variates also must be considered.
Although no statistical test for detecting all these variates
is known, and although it is unlikely that any single sample test of
this type exists, it is conceivable that a set of variates suspected
! priori to selection of being of little importance can be tested for
its contribution to the index.
An attempt has been made in this dissertation to develop a
theoretical discussion from which we can imply some of the answers to
the proposed questions.
Briefly, the approach adopted is the follow-
ing one.
The general problem of the probabillty of correct selection is
specialized in Chapter 3 to that of selecting one of the first
m-ranked~
"'11'
P
unknown values of a random sample of k worth variables, yi = E a g
j ij ,
.'
j=l
p
, and the unknown' worth are
where the sample index, Xi = E C P
,
j=l j ij
constructed With random sampling un.i.ts drawn from p-variate, normal
poPulations.
The indices conSidered in Chapter 4 are Smlth •s parametric solution, the. index obtained by replacing the population parameters in
S:m1th •s solution With sample estimates, a.ndthe sample index using
the known weights, a j , of the nonobservable linear function for which
4
we wish to select.
index
0Jl
An example ef the results obtainab1.e usiDg the third
a reduoed set of proposed variates is presented to iBtreduce
8Jll].
exp1.orator;r ciisoussion of how rmd wheJ!l. such variates caa be presat ..
Some indicatin is given of l10v a suspected set of variates
C&11
be tested
for its coJ!l.tribution to th.e iadex.
A deseriptive discussion is gi velI in Chapter 5 for the
of the first and third types of iJldex..
c~1sO>)]l
The second illdex is BoOt directly
compared witll either of the other tw, 'but it is shan. iB smne cases
possibly to pr<)duce illtermediate results, in some eases defiBitely to
be inferior to the other indices.
The variaaee of the index based
utab1.e in part; to these estimates.
011
sample estimates is attrib-
Altheugh the variances of the esti-
mates have been eonsidered in other discussions of the se1eetioJll illldex,
there is
DO
geaeral, kDon., exact expression for them.
Approx:lJDa.tions
in the form of botmds which are D1Gre iDf'ormative than tae approximation.
given in the literature are derived iJL ebapter ,. for est1:ma:tes eaJ.eu1ated for two variates from a speeifie type of samp1.ing scheme..
This
scheme is used because it does ltave wide application and beeause it
does provide estimates with simple distributional properties..
~
most 1m;portaat resu1.ts of the 1';'n-varia1;e development are exteDded
to the general :p-Tariate case ..
5
2.0
REVIEW OF LITERATURE
Improvement of genetic material can be achieved in time by selection
programs which attempt to collect tlle "best genotypes from each gellleratioD.
for the purpose of propagating the following generatioBo
the best ome of severa.l lots of metal we
~8J!ll.
8eleetiom of
be 8.©lhieved 'by'
that lot which is superior iJIl the perCeBt of element and
COlltamiuants, or some acceptable ba1.aD.ce of' these
0
tak1~
:milJj!li'lJll,fDl
1m!.
The desipatiOI/ll.
of a DIOst val:uable shipment of 'W00l is aeh1eved by selecting t_t one
of severa.l shiprrlU.ts with the su;perior eombinat10D of' long, cl.esa,
'tiMte fibers.
BaviDg decided OD a system whio weights each variate according
to its relative importaDce to the complete set of' variates eouidered,
the best resul.ts iB each of' the forego1Dg examples are
selectiDg the largest sample sum. of weighted variates
0
O;B~
'by
mwever, these
selectioJ:ilS are complicated because most genotypes are measurable elllly
in terms of' pAeDDtypes and because ore content and wool shipmeD.ts are
expressed iJIl crude ehem1ca.l and plt;ysioa.l a.naly'ses and/or by" judgments
of' iJlSpeotors.
This suggests that new lin.ear comb1natioJilS of' pheno-
types, of' ch.em1ca.l and pqsica.l aaaJ..ytic results, and of judgments
should be f'ound which reduce the variation due to errors iJ:i. measurements, and therefore aid in selectiDg the superior ind!viduals
..
In genera.l, let the vector ~' = (~, ~,
00 0,
0
~) be the set of
variates determining the value of an individual unit in the population.
sampled.
The relative value of' each. variate gi is a constant
ai' (.!:' = (8.;}:' a2 ,
000'
a p ) ), and the eumulated value or wrth of the
6
sampling unit is !
t~.
The variates gi cannot be observed, instead.
Pi = gi + e i is recorded where e i is an element of error inherent in
the method Qf measuring the physiaal expression of gi.
The problem
is to find that set of coefficients !' = (C , 02' ... , Op) such that
l
the largest value of
!t~
in a sample of k is indicated with a high
degree of consistency by the largest value of the sample index .£ 'Ro
In the genetic literature, S:m1th (1936) gave the following
solution for a closely related but more general set of asstmtptions
and definitions.
Let R be a vector of p phenotypic values of a
single plant variety,
~
be a vector of p genotypic values correspond-
ing to the phenotypic values, and! be a vector of relative weights
for each of the p genotypic values..
Then
!'~j'
j
= 1,
21
k, is
••• ,
the genetic worth of the j-th variety of a sample of k different varieties from which selections are made..
If it is assumed that pheootype
is the expression of the sum of the genotype and environment (error),
each normaJ.ly distributed, then J2j and ij are drawn from the 2p-variate
normal distribution
(2.1)
The notations
~
and
~g
for partitions of the variance - covariance
matrix are used in place of the more conventional notations,
~gg'
~
pp
and
,
to help reduce the size of equations involving these matrices.
7
If the sample index weights,
.£ are c.onstants, then the sample index
and genetic worth follow the bivariate normal distribution
..
(
.£'~j)
pC'E
c -p
c'Eg a)]
, ( a 'E' c a 'E a
_ala
2j
-
pg--
After having drawn a sample of k variables,
tion of the unknown
!'~j
•
g-
£'~j'
the distribu-
can be centered on the sample configuration
to create the linear regression equations
(2.3 )
If we want to select those variables with phenotypic values falling
in the upper q fraction of the phenotypic population, then
.£ '~j must
be greater or equal to same value (c'n)
-"'0 to be among those selected.
-·0 is obtained from the incomplete,
(c'n)
normal distribution
(2.4 )
SolVing (2.4) for the lower limit of the integral provides the minimum
1
selection pointj (c'n)
-·0
U
= (c'E
- p-C)2'-q
+ c'~ where u
--p
P
is the lower limit
of the upper q fraction of the distribution of standard normal deviates.
8
The expected gain in genetic worth by using the sample index,
.£'~j" and retaining all .£t~j ~ {.£'~)13 is
..
The expected gain can be maximized by taking for the set of coefficients
the maximizing solution of .£ for
(2.6)
Maximizing (2.6) is achieved by maximizing r-" the square of (2.6).
_J:l
r
= {.£' u ) 2
~
.£~
_.2
<--=> re'E c - (eVE a)
-
p-
-
Per"
2
=
°
2Kc'E
c ~K + 2r-e'E
a-c - 2(c'E a) aVE' p g
~c)
= 0
-p-p--pg--
(2.7)
0
(2.8)
Setting ~K = 0, and solving (208) for! provides the set of weights,
g'
= (~'r)
9
..
The scalar (c'
Epga) /r2- can be omitted from (209) because i tuniformly
magnifies the differences among the sample indices, but does not affect
the rankingso
Substituting the solution _c'
a'E' E-
=
-
pg p
1
into (208) veri-
fies that it is the maximizing solution.
Smith adopted the usual assumption that the genotypic and environmental elements of the phenotypic variable are distributed
and normally.
independ~ntly
Hence,
E
pg
= E , E
P
g
= E
g
+ E
e
(2.10)
•
The maximizing, sample, selection index then is
C'Tl
- &.j
=. a E (E
-
g
g
+ Ee
r
lTl •
&'J
0
(2.11)
Since (2.11) involves population parameters, Smith suggested that
g could be replaced by a satllJ?le estimate Sg' and Ee could be replaced
by a satllJ?le estimate S. Selection of genetic~ly superior individuals
e
would be based on
E
c'n
A
- --j
(
=a'S
S + Sg )-1&.j
Tl.
- g e
Hazel (1943) extended these results to animal populations and expressed
the parametric and estimation matrices in physically meaningful combinations of their elements.
-.
Other writers, notably Fisher (1954) and
Kempthorne (1957), have given reference to Smith's selection index
using these estimated coefficients as a useful tool in plant, genetic
selection problems.
10
By attaching new physical meaniBgs
to the vectors .! and i 111.
8m1 th ' s
development, we see that the results obtaiDed here are applieable to a
..
wide variety of DOngeaetic problems.
H0wever, the use of this index
must be questioaed for the followiDg reasons.
The basis of the index is the selection of sample individuals
belonging to tae upper q fractiolt. of the populatiollll of .! ',! variables.
Is this requirement necessary for the theoretical development, or for
the practical application of' the results?
Except for the degenerate
cases q = 0 aad q • 1, the solution for c is the same for all values of'
q.
The parameter q is determined only by the lower selection limit
(c 'n) , which at best is estimated, DOt known.
-"0
Even if' (c' n )
-"0
is
known, there are two important reasons wlq it is not always applicable
in practice.
If there is only one sample from which, for ecoDODlic
reasons, some ind1viduals must be selected, how:ma.ny should be retained
when (,£ 'R)o exceeds the maximum sample value of' the ,£ 'Rj?
Conversely,
if' only m of the sample can be saved and ~ > m variables exceed (,£'R)o'
what effect does saving the largest m of
~
have on the expected gain?
These examples indicate, that for ma.ny problems, it is better to select
a predetermined fraction of the sample instead of selecting individuals
from a predeterm1aed fraction of the population.
We could suggest that a way to circumvent these difficulties is
to determine in advance that m of a sample of k will be selected, and
to consider this as nearly equivalent to selecting q = m/k of the parent
population.
This, however, always leads to an overstatement of the
expected gain by selection as we will show in the following developmente
11
Denote the r-th ranked variable in a sample of k variables as
x( r ) •
The expectation of x( r) is
..
where ,(x) is the density function of the ran.dom variable x and ~(x)
is its cumulative distribution function.
It follows from (2.13) that
is the expectation of the aver88e of the first m-ranked sample XiS.
The expectation of the unranked variable:i x, over the upper
q
= m/k
fraction of the parent population is
E(xf q ==.!) •
k
~m fOO x'(x)
dx.
xq
Write
G( x)
=
m
(k ... 1H
r; ()"]k~r[
("a r - l
E (k... r)! (r ... 1)! l.~ x ..
1 ... ~ x) J
(2.16)
r=1
.,
and notice that 0
:s G(x) :s 1
for all values of x.
Now (2.14) can be
simplified to
~
JOO ~(x) G(x) dx '.
-00
By defin1tion, let the linear tra.nsformation y
,
,(x) and G(x) into .*(y) and G*(Y)o
+x
q
== x tra:asform
Substituting these in (2.17), we
12
have
..
k
in
JOO
-00
(y + X q )
= -k
m
foo
-00
ell *(y) G*(y) dy
yell *(y) G*(y) dy + x
q
J
k
00 yell * (y) G* (y) dy + k=-m
Om
fO yell* (y) G* (y) dy + x
-00
q
•
Next, applying the same transformation to (2.15) we observe that
-k
m
JOO(y + x
0
q
) ell*(y) dy = -k
m
JOOyell*(y) d.y + x
0
q
.
Finally, comparing (2.18) and (2.19) we can show
JOO
kk
yt *(y) dy ~ -m
mO'
J 00
0
yt *(y) G*(y) dy ,
and
!
~
J-00
0 yt*(y) G* (y) dy < 0 ;
-
hence, the integral in (2.19) is greater than the integral in (2.18)
which is what we wished to prove.
This development, which is a generalization by Dr. R. N. Curnow
of rIJY, proof for the special case of m = 1, prove-s that the population
expected gain is an overstatement of the gain obtainable by selecting
the upper m/k fraction of each'sample.
Partial verification of this
13
result for the normal distribution can be found in the Fi,sher and
Yates tables (1949) •
..
The most serious critiaism of the use of
~t~j
index is that .£t~j is the index actually used.
as a selection
Often 'Writers tacitly
implied that sample size is usually large, and hence, the estimates
of I:
g
and I: are good.
e
This entirely disregards the difficulty that
these are p x p matrices of p (p + 1)/2 distinct components each.
These matrices do not have the same distributional properties as s2,
the case where p
=
1.
F:urther, the product of two sample covariance
matrices is not estimated, but instead, at best, we use the product
of one sample covariance matrix with the inverse of ,the sum of two
independent covariance matrices.
The variance of a ratio si/(si +
s~)
is undefined for some possible sample sizes. ,How more frequently
will the variance of the elements of 8 (8 + 8 )
2
l i
lema of the dependence of 8 , 8 , and
g
e:.
quate sample size still further.
~j
-1
be undefined?
Prob-
complica,te the question of ade-
For the small sample selection program,
where the estimate of .£ should be provided by a larger amount of data
than that which is available, is it not better to apply an index not
involving unknown constants than to risk the errors introduced, by uSing
8
g
and 8 ?
e
Bartlett (1939), ,Nanda (1949), and Cochran (1950) were well aware
of these questions.
Bartlett and Nanda both attempted to obtain the
variances associated with the estimated coefficients.
For a two-variate
index, Bartlett did approximate the variance of the estimated expected
gain.
Nanda extended the approximation for two variates to p variates,
and provided approximate variances of the coefficients themselves.
14
The degrees of these approximatiQns are oot given, nor are the results
particularly imformative since they are polynomial expressions of the
..
elements of the population covariauce matrices.
However, both sets
of results indicate that the larger the number of variates measured,\>
the larger the size of the estimation sample must be to provide usable
estimates
0
This is confirmed by the bounds for the
'\l"8X1~ces
of the
estimates obtained in this dissertation and seems to be a. partictilarly
importaut result because it explains partially why the estimated index
has never been a wholly satisfactory tool.
The estimation question, quite clea;t'ly, is an argument about simple versus complex statistical procedures.
If the estimation of c
can be done easily and with assurance that the estimates deviate only
A
slightly from the desired Parameters, then the use of .£ llRj is
able if it provides a worthwhile improvement over
e:Dy
a.dV1s~
alternate index
However, au index which is simpler to construct, which requires
0
DO
estimated weights, and which indicates large values of the.!: 'ij nearly
A
as well as or better than does the index .!'Ej' certainly will be
regarded as more useful
0
How theJl should alternate indices be compared?
to compare expected gains.
tion index approach.
One method is
This could 'be termed the classical selec-
However, the probability of selecting the desired
set of the !'ij has been suggested by Cochran (1950) as a measure of
the usefulness of an index ~and is well known. to those students interested in selection from un1variate poPulations
0
In a recent paper,
Dunnett (1960) uses the probability measure for stu~ing the selection.
of the largest of k population means, each measured with a norms.lly
15
distributed eITOr.
To aid the solution of this problem, he assumed
an .! priori normaJ.. distribution of the :POPUlation means w:b1ch gives
the results as the same probability functions used in this work.
The
solution of the multiple integrals involved, however, is not attempted
in Dtmnett' s work, except for the obvious boundary solutiQns a.md for
the rcase of a sample size of two.
The generalization of this integral
is the n-variate, incomplete, normal distribution over the region
-00 ~ Xi ~
0, i = 1, 2, ••• n.
has been attempted by
~
The solution for these integrals
workers" but only slowly convergent multiple
series are available for the general case.
For certain special depend-
ence patterns, the multiple integral can be reduced to an exact solution or throw back. on the evaluation of less complicated integrals.
The bulk of this line of research is due to Kendall (1941), Plaekett
(1954), Moran (1956), Owen (1957), DuDnett (1960), and Ruben (1960a,
1960b).
16
3.0
THE PROBABILITY OF SELECTING ONE OF THE FIRST m-RANKED
VARIABLES FROM A SAMPLE OF k UNKNOWN VARIABLES BASED ON THE SELECTION OF
-.
THE MAXIMUM OBSERVABLE SUM OF THE UNKNOWN VARIABLES
AND ERROR VARIABLES
,
The probability of the concurrence of the largest sample index value,
£i~l),With
one of the m largest values of the true worth !:'6.(i)' i = 1,
2, •.. , m, is the frequency with which the first-ranked value in one
array of the observations occurs with the i-th ranked (i
= 1,
2, •.• , m)
value of a second array (the members of the two arrays are drawn randomly
in pairs' from a bivariate population).
In a random sample of k multi-
variate observational units, let the selection index of the j-th unit,
£':2.j' be denoted by x j '. If the true worth of the unit, !:'g,j' is denoted
by y j and if the error in observation, (£-!:) ,6.j
+ £ '~j' is denoted by
Zj' then x j is the sum of two, random, variable components; x j
i:i:
Yj + Zj'
The problem discussed in this chapter is how to evaluate the properties
of the probability of selecting one of the m', largest sample values of y j
by selecting the largest value of x
when Zj is an element of error
j
which is nonseparable from Yj inx j .
3.1 . Basic Formulation
Symbolize the probability of the joint occurrence of the maximum
sample x j value and the i -th ranked sample Yj value, j = 1, 2, "', k,
by
P [1,
i,
Initially, let us examine pel, 1, k
kJ .
J,
the probability of selecting
17
jointly the maximum of x j and Yj •
for i
=1
It is a well-known result tha~ (3.1)
is derived simply by averaging k times the probabillty that
k - 1 of the x vaJ.ues are less than a Particular x, say x j ' and k - 1
of the y values are less than a particular y, say YJ' over the joint
distribution of x j and Yj.' By denoting
where f(x,y) is. the joint density of x and y, the function PC1, 1, k]
can be written as
J
= ~. .
OOJOO
-00
.
~-l(x,y)' f(x,y) dxdy •
-00
Except for ti!reespecial cases of dependence between x and y, (3.3)
does not reduce to a simpler form for the unspecified density' function
f(x, y).
The first special case of interest is that of complete statistical
dependence where f(x,y) = fex) and where there exists a monotonically
increasing, functional relationship between the i-th ranked sa.m;ple values,
18
=1
•
;
Secondly, :fix· and y are statisticatJ,.y independent,
1
= it •
Finally, for complete statistical dependence of x and y, and for a
monotonicatJ,.y d~ere~ing, functicmal relationship, x(i) ;:: g(~(k-i»'
P [1, 1, k]
=0
(3.6)
because the maximum. of one array w:Ul always occur with the minimum of
the other
array~
.Now we proceed to derive more general results.' If we note thB.t
. x
..
F1 ("j) - F1 (Xj ,7j ) •
L.J
j
. QG
7j
' .
f(X1 ,71 ) dxi cl;y1' 1
'
~ 1,
2, ".', k, 1
r j,
, (3 ~7)
19
the general, function 'in (3.1) can be wr~~t~n as
(3.8) ,
,
, . ' J J.r
00
= (k-i)1(i-l)!.
00
--00 -00
~-
:I.
i-l
,(x,y) [F(X)-F(X,y)
f(x,y) dxdu·
J
.
When 1 < i < k, it is ev.:Ldent that (3.8) must be zero if x andy are
.'
completely
d~ezident. I f
.I
x and
reasoIlJl,ble ease for the general
t
.
are independent, we have the only' other
so~ution,;
k: . [fl.....k_l
. ][f01.. .k-i(y>[l-F(y)].i-1dF(y)JJ
0
(x) dF(x)
= (k~iH{i-rn
If .
If
=1- .
k
Now we define another function
pel,
~ ml
, k],
m<
k,
(3.10)
as the probability of the joint oceurreD.ce of the :max1muJI:l x value and
one of the firstm-ranked y values.
The results in (3.1) to (3.10) can
be used to define a probability measurement of the worth of a selection
20
index--
m
P[l, {mJ , k] = 1: P[l, i, k]
i=l
m
kl
= .~ (k-i)l(i~l)l
J.-1
J001 00
-k-i
r
(x,y)[F(x)-F(x,y)]
i-1
f(x,y) dxdy
-00 -00
When f(~,y) = f(x) and xCi) = g(y(i»' P[l, !mj , k] = 1; when f(x,y)
and xCi)
= g(y(k-i).)'
P[l, iml, kJ
= 0; and
= f(x)
when f(x,y) = f(x) fey),
P[l, {m}, k]
= m/k.
For some special forms of f(x,y),it is useful to
note that P[l,
t m1 ,
k] is the sum of the coefficients of t o, t 1 , ••• ,
t m-1 in the expansion of
k
J
oofoo
J, [tF(x)
k1
+ (l-t)F(x,y)J - f(x,y) dxdy •
-00 -00
In some selection problems, m is small relative to k, and the probability function PC1, 1, k] is a satisfactory substitute for P[l, ~ mJ , k]
as a measure of the worth of a selection index.
This function is inves-
tigated in the following two sections of this chapter for the special
case of f(x,y) being the bivariate normal distribution centered at the
origin with unspecified variances and a correlation coefficient p in
the interval (-1, 1) •
21
302
Properties of pel, 1, 2] when x and yAre
D!"awn from a Bivariate NtOrmal Population
It has been assumed previously tl'..a.t x j is the sum y j +Zj where
and
Zj
~'f j
are observationally nonsepaxable and are jointly distributed in a
bivaria.te normal. foZ"mo
va.rimM::::e of
Z j'
Define
O"i
for the
varian,~e of Yj Jl O"~ for -the
ewd 1'12 for the correlation coefficient of Yj and Z j
The joint density of y j and
Zj
is
This density is simplified by letting
0"2
2
Y = 0"1 '
9
22
the transformed density is
..
The simplest case of pel, 1, k] is for k
= 2wh.ere
the iI)l.te~F.i::,
representa.tion ean be solved direc1;,1y (see Kendall and Stuart (1958) :for
a. lengthy treatment of this integral aJld one form of its k-fo1d
g~:ae!"al~,
ization) ~-
_ JOOfOOfJXfY
J.
_
pel, 1, 2] - 2
-00 -00
-00 -00
1
. 2
21Ccr cr (l-p
x Y
xy
)i
[~~
2 : 2 + yi
~)
exp -
2(1-p) cr
xy
x
cr .
~.
Make the transformations
~ -~x
,~
:j::
.v
l
::!
cr (1_p2
x
xy
y - Y
.1
cr (1_p2
y
xy
)i
)'i
,
,
(3.1.4 )
x
u=
)t
xy
cr (1_p2
x
v=
y
cr (1_p2
y
xy
)i'
,
,
23
2
which have the Jacobian
u and v.
(j
x
2
(1 - p
(j
y
2
xy
)
2
and integrate out the variables
The resulting reduction to a bivariate normal integral over the
third quadrant of
~
and v1" is
2
1'[1, 1,
2J
=
(l-p )
,xy
exp[-
41t
i cui
+ vi -
Pxy~VI) J d~dv1
•
(3.15)
Now let
~
r = vI's = ~, Jacobian
s
= r2
(3016)
•
Integrating the transformation of (3.16) over the domain 0
:s r,
s <
00
gives the exact expression for (3.13) as a function of the correlation
coefficient of x and y -1'[1, 1, 2]
=
1: [~
+ Arctan
1t
In the interval 0 < P
-
f J
P
I
(l-p /)2
xy
< 1, (3.17) lies in the interval! <
xy-
1'[1, 1, 2] ~ 1, and in the interval -1~ Pxy ~ 0, (3.17) lies in the
interval
0 ~ 1'[1,1, 2J
~! (see Figure 1).
(3.17) with respect to P
xy
.. ~
is
1'[1, 1,
xy
The first derivative of
2J
= -1t1
(3.18)
24
'~~
,.....,
N
"
"
r-l
r-l
L.....l
p..
l.I-l
0
OQ..
,.d
p..
Cll
l-l
C'
.
r-l
Q)
l-l
::J
bO
.r-!
rz.
By,
setting
rP'xy = I..l
and noting that dil..l/dp , i > 0 for all, i, when p
xy -
it follows that aJ.l derivatives of pel, 1, 2] with respect to p
i
i
1
xy
o.
greater than or equal to zero because aJ.l d [1/(1-1..l)'~]/<4L ~
> 0,
xy-
are
This
characteristic of the derivatives in the range p xy> 0 is important when
comparing pel, 1, 2] for populations
with different values of p xy •
,
For
if two such populations are specified by P and P2' theprobability
1
pel, 1, 21Pi] for each can be expanded in the Maclaurin series,
j
1.
Pi
h
jTh., j
00
=-2+1:
j=l
•
J
=
L
>0 •
P[J., 1, 2]
j
dpxy ,
p
xy
=0
(:5.19 )
The difference between the two populations in terms of the probability
of selecting the respective maxima is
The series difference depends on the set of constants
~
pi - P~ 1which
is a function not only of Pl - P2' but also of the size of P2 relative to
Pl.
This is demonstrated graphically in Figure 2 when we note
P2 - Pl = P~ - P~ , but the selection increase is unequal, I* > I.
From examining (:3.17) and Figure 1, we see that
pel, 1, 2]
=±
Arccos
1C
lp I
Ixy
,
-1
< Pxy-< 0
-
follows immediately from the basic trignometric identities.
interesting to note one of the ph;y"sical implications of p
It is
< o. The
xy-
26
..
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
PI
o
P
Figure 2.
An illustration of selection increase for k
= 2
27
negative correlation occurs if and
~
if
The right side of (:5.21) can occur only if
This means if variance of the variable of interest, y, is greater than
the variance of the error variable, z, regardless of the type of correlation between y and z, selection for the maximum sample variable x to
obtain the ma.x1m:u:m y is certain to be better than a random d.:!=aw of one
of the x variables.
3-3
Properties of pel, 1, k], k > 2, When x and y Are
Drawn from a Bivariate Normal Population
The g~neral form. of PC1, 1,
k1
when x and y are drawn from a bivar-
1ate normal distribution is
pel,
1, k] =
1
2 ( l-p 2)
xy
If t h e variances
22·
x and ~y are sealed out,
~
(Xl + Y1_ 2 xy 1'y1)
2
2
~
x
2
2
y
~
p
:l
~ ~
x y
]
28
PCl, 1, k] =
"
This integral can be expressed in a series expansion with the application
of a result due to McFadden (1955) • Write the- incomplete distribution
function raised to the (k-l)-th power as the product of k-l incomplete
distribution functions (see equation (3.;))
and integrate the resulting
2k multiple integral over the ranges of x and y.
Next scale out a factor
(k-l)/k and group the remaining 2(k-l) variables into the vector form
u
v
=
(~, Yl' ~'Y2' ••• , ~-l' Yk - l )·
This procedure reduces the
integral in (3.23) to the form
where
=
o
J
-00
...
fO ~
2(k-l)
. k_lexp(-!E'AE)TI
dU
-00
(2")
j=l
j
(;.25)
McFadden I s
expansion of 12 (k-l) ( Q, A-1) is
-1
where the Pij are the elements of A •
pel, 1, 3]
=
Thus if k
Ii- + ~ (Arcsin Pxy + Arcsin Pr
= 3,
)
+
~
l6n:
2
p
xy
+ O(p3) •
1
< -21 ,
xy > -2 , the remainder can be written as 0(p3xy ), if pxy-
Obviously if P
the remainder can be written as 0(1/8).
Thus
converges too slowly to be of much p:t:actical
we see that the series
use~
30
The one property which mus't be established for all vaJ.ues of k -to
..
:make pel, 1, k] a contender with expected gain as a measure of the worth
of a selection index is that the difference pel, 1, klpl] - pEl, 1, klp ]
2
has the form
.,
>
p
xy
0 •
=0
This will establish the dependence of the probability difference of two
indices on the difference of and the relative positions in the range
(-1,
1) of the correlation coefficients of the individual indices with
true worth.
This result is assured by the following theorem.
Theorem:
The Maclaurin expansion of
roo
{J
XJ
Y
1 2
exp[1 2
(xi
+ yi - 2pVU'x.. y )]
l
J_ oo l
21f(l-p)2
2(l-p)
k J 00
-00
1
-00 -00
xy
""tl'
xy
is an infinite series,
00
pj
1 + E".2Sl.
-k
.I h ,
j=l J. j
where h
j
~
0 for all j
~
1 •
-1
< PVU' < 1
...,
,
.L
31
Proof':
We can show that the integrand in (3.28) and all its derivatives
with respect to p are continuous in the interval -1 < P < 1.
..
Thus the
integrand can be expanded in a Maclaurin series about p= 0 and by generalization of' conditions of' dif'f'erentiation and integration given in Titch,
marsh (1949), we can show
[f ooJ
j .
pel, 1, k] = k !: f?.:.j'
•
-CD
00
-00
J r--.lc-l <.~) f'(.!)
_.d..'.-.
dx
dpJ
j ,
-1
<P <1 ,
p=O
whenF(3f) is normal.
For the proof' which f'ollows, :we must note also
that the derivative with respect to p of a Fourier transf'orm of this normal. distribution f'unctionis the Fourier transf'orm of' the derivative.
The f'irst term in the Maclaurin expansion of' (3.28) is
= it1
•
To examine the remaining terms in the expansion, let us write the integrand of' (3.28) as the inversion of' the characteristic f'unction of' its
2k-f'o1d multivariate representation,
.k
.
(3.30)
k
k-l
where d!i = n dt ij , i = 1, 2, ~ = (II dxj)dx, an~
.1=1
.1=1
The m-th derivative~ eval.uated at p~
.
k
JO l~
J..Q:) •
Jool°oro
-00
-00
U
-00
1
(21C )2k
=
dl
k-1
=
(n dyj)dy •
.1=1
° is
fOO JOO
-00
k
••• -00 (- .1:1t 1Jt
m
2j )
(~.31)
Expand the exponential. term in (3.,31) and by a location transformation
replace t 1j + i(x + %.1), t lk + :Lx and t 2j + i(Y + y j ), t 2k + iy w1t~ ~ij
and t 2j , .1 = 1, 2, ••• k.
•
Integral (3.31) nov becomes
-(-2';;;;'~""')k- e>;p[- -k t~«X+XJ)2
+ (M'J)2) -
-k
2
(x
+:h ]~. ·
(3.32)
Next, examine the expectation over the standard normal distribution
of t of the binomial term (t - iW)n •
..
n
n'
n j
j
E(t-iw)n= E
(_j)i. f (-iW) - E(t)
t
'j=O n
.J.
t
'.
_ n
n!
~::;O (n-j )!j
-
r
(
)n- j
-iw
Il j '
,,1
where Il j
••• 2p.
3, ••• , 2p + 1, and Il j = j~/22(j/2)!, j = 0, 2,
n··
w
If n is even and E( t-iw) is denoted by S (+), then
t
- n
= 0,
j
= 1,
)
Sw(
n +
= ~n
4,
(2 )
( )n •
+ n( n-l)
2
Iln- 2 .~ + ••• + iw
S;(+) is a real-valued polynomial in w.
If n is odd and E(t-iW)n is
t
de~ted by iSw(_) , then
n
)
( ) n(n-l)~n-2) .~ 3 (-iw
) 3 + ••• + (-iw
)n
isw(
n - = -nIJ.n- 1 iw +
n_ . [ -DJ.L lW- n(n-lgfn-2)
.
n-
-J.
and S:( -) a real-valued polynomial in w.
Sm = [ -
k-l
~=l(tlj-
~
-;a;
.
n]
()n-lw ,
n- 3W""+ ••• + i
Next, make the identification
i(X+xj »(t2j - i(Y+Yj»- (tlk - ix)(t2k- i;V)
Jm,
(3.33 )
and find the expectation of Sm over the distribution of t ij and t 2j •
general term of S is
m
The
'.
About the consta."J.t C,we need. l"eme..rr.iber only that it is positive if m is
even, it is negative if m is odd.
Let
mj be the set of odd value mj in (3.34), and let m;-*' be the
set of even value mj in ( 3.3.4) •
Further define m*
= E m*j
** = E m**
j
, m
* ** = m. The number m** is even; hence, if m is even
'
both in* and mare
even integers. If m is odd, m* is an odd integer
**
and m** is an even integer.
and note that m + m
The general. term of E( S ), when m is even, is the product of a
...lJ m
positive constant, C, and functions of x +
k-l) and x, y.
X
j
' Y
+ Yj (j
= 1,
2, ••• ,
These functions are the SW( +) and SW( -) polynomials;
n
their product is premu1tiplied by (i
2
*.
)m =
'n
*
(_l)m = 1.
The ~eneraJ.
term of E( Sm)' when m is odd, is the product of a negative constant
t
and functions of x + x j ' Y + Yj ,
X,
and y.
Again, these functions are
the S:( +) and S:< -) polynomials; th~il.r product is premu1tiplied by
2 m*
(i)
= -1.
If this (-1) factor is absorbed in the constant term,
we can write for the general term of E(S ), regardless of'm even or odd,
t m
cjlj(X+x ')
J
=
(x+x )
(x+x )
S
j (+) or S
j (-), etc.,
mj
Ill,j
where C is al.wa;ys positi ve.
35
The m-th derivative of (3.28) is the expectation over! and l of
m
the sum of k terms of the form in (3.35).
= ° is
the m-th derivative evaluated at px:y
k
JOOJOOJo JO
...
-00
-00
-00
. exp [ -
-00
1
2"
k-l
E
j=l
C
(21t)
Thus, the general term of
k
«x+.x j
)
2
+ (Y-i'Y .)
J
2
) -
1
2 2
2" (x -i'Y)
J d!dl·
In (3.36) integrate out x j and Yj ) which are indepe~dentlY distributed
and located at x and Y, then integrate with respect tox and y.
= kC
J
00
1
2;( ~k(x)
-00
=K C ~
~
k-l
n
j=l
(
~j(x) exp - ~
2) dx
2
= K ct.
The solution in (3.37) is the product of three positive constants.
This establishes the main result, h j
>
°
which proves the theorem.
Two obvious and useful corollaries for graphically indicating the.
shape of P[l, 1, k] are:
Corollary 1:
All derivatives of pel, 1, k] with .respect to pare
positive in the interval
and
° <- Px:y < 1
x:y
.
COrollary 2:
All derivatives of pel, 1, k] with respect to Pxy
< P < 1.
xy in the interval 0 -xy
are monotonically increasing with p
The graphs of PC1, 1, k] for k
..
are idealized in Figure :3.
ing manner.
= 5,
10, 20, and 100 and 0 < P
-
xy
<1
These graphs were constructed in the follow-
The function
k1 pel, 1, k] =
JOOfOO.-kl
J. ~- (x,y)
-00 -00
f(x,y) dxdy ,
.
where f( x, y) is the bivariate normal density, has the properties given
by the theorem and two corollaries, and is monotonically increasing in k.
It lies within the region bounded by (1/2) P[l, 1, 2 J , 0 < P
-
(11k) pel, 1, k]
= 0,
0
:s Pxy < 1; Pxy = 0; Pxy = 1,
= 0,
the boundary (11k) PC1, 1, k]
without limit.
and it tends toward
=1
0 <P
-
xy < 1;
xy < 1; Pxy
as k increases
A graph of this type function was constructed and then
scaled by k to arrive at Figure :3.
This -final representation lIlB\Y be
slightly in error because, for some small k, it lIlB\Y be true that
pel, 1, k] > pel, 1, k - 1] in the neighborhood of p
xy
= 1.
However,
if this error exists, it has no practical implications in the. comparison
,
of selection indices.
:3.4 pel, f m} , kJ as a Monotically Increasing
Function of Pxy When x and y Are Drawn
from a Bivariate Normal Population
The properties of pel, f ml, k] are quite difficult to ascertain for
the bivariate normal distribution, but it can be shown that pel,
f m 1, k]
is an increasing function of p
xy , and that its general form ,·is a nonlinear
curve.
To achieve this most simply, it is necessary to consider the
37
'.
k =100
o
p
Figure 3.
Graphs of P[l,l,k] for k
= 5,
10, 20, and 100
38
integral expression of P[l, i, k] given in (3.8) where f(x,y) is th~
bivariate normal density with unit variance and zero means.
Now apply
to f(x,y) the transfo:rmations
..
IJ.
=x
The new domain of integration is
and the density and distribution functions of (3.8) are transformed into
F(u,v) =
UJv + (l-P:xy)
2 t
Pg
J
.
(u-'1..)
-00 -00
F(u)-F(u,v)
=
J
UJOO
-00
v +
P
f
1
:xy Y:i
(l-p
(u-~)
1
:_
2
1
(2ft )"'Z,
2
('1.. + VI )
exp 2
. d'1..dVl '
39
-1
<
P
<
-1
<
Pxy
< 1-
xy
0, we see that F(u,v) increases with p
xy
The derivative of the right side of
0.
J-ooJroo (
00
00\
in the interval
(3.8) with respect to
p
xy
,
k:
pk-l-i
i-I
k!
(k-i-l)!(i-l):
(u,V)[F(u)-F(u,v)]
- (k-i)!(i-2)!
k .
i-2
of -~(u,v)[F(u)-F(u,v)]
( d
) ap-- F(u,v) f(u,v) dudv,
xy
summed over i
= 1, 2, •.. , m is the derivative of P[l,
lm~ , kJ with
respect to pxy ,
k;·
(m-l)!(k-m-l)!
JOOJ.r 00 ~-m-l( u, v )[F( u)-F( u, v)]m-l dpd
-00 -00
F( u, v )f( u, v) dudv
xy
(3.41 )
But F(u,v) -> 0, F(u) - F(u,v) -> 0 for all u, v, and pxy , and we have
shown that dF( u, v)/dp xy -> 0 for all u, v, and pxy'.. thus
~
xy
p[ 1, ~ mL k]?: 0, -1 < Pxy < 1 .
(3 .42 )
The importance of this result in selection index studies is that
maximizing P[l,
im},
k] with respect to the correlation coefficient of
the selection index is most easily achieved by maximizing p
xy
to the index coefficients.
with respect
40
The curvilinearity of Pll, im\, k] 'With respect to Pxy is established
by noting that this function at' Pxy
= -1,
0, 1 takes on values 0, m/k, 1.
which do not fall on a straight line in the (p xy, pel, {mt, k] ) plane
except possibly when m
".
= k/2
•
41
4.0
SAMPLE SELECTION INDICES DERIVED FROM THE PROBABILITY
OF SELECTING ONE OF THE m LARGEST UNKNOWN VARIABLES BASED
ON SELECTION OF THE LARGEST SAMPLE OBSERVATION
4.1
".
General Method of Construction
A set of p constants, £' is sought which will produce, for a given
criterion, the best sample index, £'R.j' for detecting the largest, or
near largest, value of the unknown !'&j' For the usual additive
assumptions,
R.j
= ~j
(4.1)
+!j ,
it follows that once the vector c is applied, the sample index can be
partitioned into
(4.2)
If the criterion used for constructing the sample index is the
probability of correct selection, and if normality with zero means
and independence of
can be made.
~
~j
and
are assumed, the following association
Let
(4.3)
Then,
xj
.
N [0, c ' (1:: + 1:: )c
g
e-
(4.4)
J'
1,
(4.5 )
Yj
N ( 0, a'1:: a
Zj
N [ 0, (c - a)'1:: (c - a) +c'1::cJ.
-
g-
-
-
g-,-
-
e-
(4.6)
The correlation coefficient of x a:n.d Y' is
0'1:ga
-
(4.7)
Pxy = [a'1: a 0'(1: + 1: )c1 i
g- g
e "';j
eo
the selecti~n criterion, pel, {m~ , k 1, for any fixed population
parameters a.1ld set!:» is
fa
flm.e'titOOlL
0~
9f the .! used in the sample
selection index.
Two sets of !'S are of particuJ.ar interest.
maxim:l.zes
PC1, {m) ,
One is the set which
This set will be denoted as the set 'which
k].
produces the optimum index.
The ether is the set! = !I wh:l;ch is com-
pletely knewn and produces the
!!.!.!
. 4.2 The Base IJldex
index•
and the Optimum Index
The underl.yi1'1g population relationsh1p, for the base iDdex, is
(4.8)
='a'a
a'e •
_a'on
&..1_
2.1 + --.1
The variables.., a'a,., and a'e., are i:Ddepe».d.ent, and a'1: a / a'(1: + 1: )a =7 2
- 2 u
- -u
g- g
e is the squared correlation coefficient for .!'R and.!'.6. The foremost
attribute of this index is its sim.Pllcity' of constructioB. and of interpretation.
Ma.x::lmum il:lrprc)vement over tl!Le base i:Ddex is ach1eved 'by maxim1 zing
(4.7) with respect to !, thus max:fndzing p[l, {m} , k] , an iBCreasing
.!.
function of
f)
•
x::f.
.
The scalar, l/(a'1: a}2, is irrelevant to the ma.x1m1za.. g
tion procedure" and thus (4.7) can be replaced by the expeCted gain
criterion in (2.7).
The coefficients of the.optimum index must. be the
same as those Gf (2.9)--H. F. sm:tth's selection index is the, optimum
index.
The probability criterion in (3.23) is a function of k and
".
The set of constants for the optimum index is a dependence analogue
of the concept for independence of weighting each variate Pij with
the proportion of its variation contributed by the corresponding variate
of .aj.
Since it was stated previously that the main problem in selecting
the correct ij is the lack of knowledge of how much of the variation
in each element is due to the corresponding element in.a, such a weighting seems entirely proper.
An important property of the eptimum index which has been noted
frequently is that it does adjust the weighting pattern to correct for
the case where
~rtant
variates have smeJ..1. variances, less impertant
variates have high variances, and the correlation pattern between these
two sets reflects a strong degree of dependence.
For an illustration,.
consider the following two-variate example.
E [1 3], E r2 lJ,! [4] .
g ==
==
3
10
e
Ll
2
=
( 4.10)
1
The first variate is four times as important as the second, but contributes only 1/6 of the index varia:a.ce while the second contributes 5/12
1
The high correlation, p
= 3/(10 )~, is reflected
gl~
in the set, c t == atE (E + E 1 = (-1/5, 19/10), which reverses the
- g g
e
importance of the observational variates.
~/ .
of the variance.
r
44
In this two-variate population, the parameters of PC1,
m , kJ
for the base index and optimum index are
r = • 7372,
-.
xy =.8989.
P
(4.11)
The probability increase obtained by use of the optimum index instead
of the base index can be illustrated by considering the difference
P[l~ 1,
kl.8989J - PC1, 1, kl.7372J for increasing values of k. This
dil'ference is visualized in Figure 4 which was constructed empirically
using the concepts involved in constructing Figure 3.
If there is a large number of variates, and if for some subset
of these variates the contribution to the index variance from the e1ements of
~
is small, the optimum weights will place small emphasis on
this subset.
For this condition, it may be possible to construct an
index, intermediate in quality between the base and optimum indices,
by eliminating such a subset and bUilding a base index from the remaining variates.
The resulting index will be called a reduced index.
4.3
The Reduced Index
In the example just given, replace the base index by one constructed
with the coefficients
.£' = (0, 1).
(4.12)
The correlation coefficient of £'1>. and !'S., Pxy , is
o
which differs from (4.11) only in the last significant digit.
RealiZing
e
..
e
e
.25
.20
..
.15
P[I, I, K 1.8989]•• •
.10
.. ...
•50+-
p[ I, I, k 1.7372]
...
.... '................................... .....
,
I
o
5
10
15
20
25
30
35
I
40
45
50
I
55
60
65
10
.I
75
80
k
Figure 4.
The effect of varying k on a comparison of two selection indices
.p\J1
46
with estimates c,:!' the optimum \l'aights is
weightB i:u (4,,12) a~:e
So
~ed,
it is evi,dellll.t 1i.bB.t the
very s~15\:f'l9.Ctc)~ repla.(~e:mell!i.t for t.?G.e oYdmum.
-for which the redu.ced i,ndex is
~o
some exte:l!.t,
redu(~t10!l!l wi~l de~'lld
,
tiOD cova.ria.uee ms;trices..
W1tltOU-~
sary to part1tiO!Ol. the var1~t~s
t and p - t «( = 1"
possible partit10ml
1.m;proovement over the base index" and
a::rJJ.
2,9
a' .l!::ilowledge of the ' popula-
tll.is 1nforma:t1oD.)l it would be· neces-
u'tt"
aJ..l p~ss1ble sets of combiI18tions of
p/2 or (p - 1)/2) groupings.
0'"
theor!aot1~alJ.y
requiZ"es
.
.
unrealistic requ12'emeltil.t.
OiJj),
To
test each
a Dew
sample of data; a quite
I
Hcnfever" if a subset of variates is suspected
,
I
of contributing 11ttle to the s,aleatioIl. process" this set ea.n be tested
(provided a test procedure exists) for its value in the index.
Su;ppose the last t of a set of p
va.t~iates
a:re suspected.
Parti-
(4.14)
1:g =
Gil G12]
[Gi2 G22
(p-t)
, 1:
(t)
(p-t) (t)
If the reduced index is based
e
(p-t)
=
(t)
•
(p-t) (t)
0:;"
.El' tha'£l =
!:I.'
~
=,Q" and the
(4.15 )
part-total correlation coefficient is
( 4.16)
•
".
For reduction to improve over the base index, it is necessary that
p
XYo
> 't, which implies the more basic requirement that initially
(4.17)
Once (4.17) is establishe<i, then one would proceed to examine the truth
of the inequality
(4.18 )
Only one parametric relationship has been discovered which insures
( 4.18) and is testable ~
If the set oft, c variates is constant in
.a
and
variable in!:, then ~G2:#2 = 0 ;:-=> !:i.GJ.2!2 = 0, which gu.a.ra.ntees (4.17).
The inequality in (4.18) then is replaced by
!i'Gll~'
[!i(Gn + ~)~Jl >
!i'Gll~
[!i(Gn
+
~)~ ~ 2!i~ + !2~~]t ·
(4.19 )
The validity of (4.19) '8 in question 1mtil it can be established that
(4.20 )
48
Even though the t final elements of
.a
are fixed and indepemdent
of the first p - t elements, the fiMJ. t elements of R should be included
-.
in the index if (4.20) is violated. This is required because the correlatioD. pattern of the errors is such that the t final elements of e tend
-
;
to counteract the effects of the first (p - t) elements of!.
The
selection process actuaJJ.y is: sjiied l'r.f measuriD.g varia.tes w1II1ch are
expressions of error only:
Indication of the presence or absence of these relationships can
be tested in sampling schemes in which a test ead.sts tor
.!tG~
= 0,
and tor the equivalent to (4.20), i.e.,
•
>
( 4.21)
0 •
That the broader requirement (4.18) is satisfied by conditions
other than the ones just stated is verified by the example in (4.10),
(4.12), and (4.13), where !2G~
=16
and .!iG~ = 12.
However,
we have not discovered how such exa;m;ples can be reduced to testable
Parametric relationships.
Some remarks should be made about the advisability of usiDg the
reduced index instead of an index where the weights of the optimum index
are estimated:
If it is possible With sample tests to detect im,prove-
ment due to the use of a reduced index rather than the base index, a
set of observational variates is present which is not useful for
estimating G and G12 ; in the estimation of G and G12 these varll
ll
iables introduce needless Variation into the sample index. ID. Particular, since the index using estimated weights can be at best only
intermediate between the usefulness of the base and optimum ind1ces,
if !:iEl#e is large, we ean assume safely that the estimated index
cannot improve much on the reduced index.
The possibility has not been overlooked that a reduced set of
variates could be located, aDd an estimated set of weights used for
poss.ible further improvement in the selection index, but the features
of such a procedure will not be discussed here.
4.4
Coincidence of the Base Index and Optimum Index
There is a simple relationship between .Eg and .Ee for whieh !: lR.j
and a II: {.E + I: f lR.j reduce to the same index.
- g g
e
If.E == dI: , then
g
e
== a 'd I: (d I: + I: flu... = d/(d + 1)· a IUj •
-a I.Eg {.Eg +.Ee flu...
aL...,
e
e
e
£...,
.
- ..
The
constant scale factor., d/{Ci + 1), only magnifies the difference between
two sample observations and can be replaced by 1.
and base indices are one and the same.
Thus the optimum
That this is the only case
for which the two are equal for all sets of relative weights, !: I,
is easily established.
It is particularly interesting to us to test, from sample data,
the hypothesis, Ho :
.E = d .E , d being regarded as uDlmown.
g
e
Consider
a multivariate analysis of variance performed on a nested sampling
sci:1eme of the p-element vector R.jk;
R.jk = S-j +
~jk' j
== 1, 2, ••• , nl ' k == 1, 2, .•. , n .
2
The analysis of variance is given in Table 1.
(4.22)
50
Table 1.
Two-level, nested, multivariate analysis of variance
Source of
variation
Among ~j.
Within
:ej.
Expected
mean square
Mean squ,are
D.f.
~-l
Mg
.E e
~(~-1)
Me
.Ee
+~.E
g
Denote by ch 1A 1 a characteristic root of the matrix A and let
c i = ch{ [(~ - 1)/~(~ - l)JMgM~ll, i = 1,2, ••• , p. Notice that
,
-1
He: .Eg=d.Ee <===>Hel: ~.Eg=~.Ee <===> H02 : (.Ee+~.Eg).Ee
= (1
+'~) I p <===>~:'
all
chf (.Ee
+
~.Eg) .E~1! = (;1. +~) = m, (say).
Now under~, (l/m).£ = ~. has the joint density function
~-P-2
p-l
c( p, n.., n...) II
(vi - v of
.L . ~ i>1'=l
J.
2
P
)
Vi
II _-':::'-n..-n...--""::1- dVJ.o ,
i=l
.L ~
(l+v )
i
2
(4.23 )
c(p, ~, ~) is a constant independent of m (Roy (1957».
~ can be
constructed only if d is .l;tnown; a statistic independent of d is needed.
This suggests a test based on
(4.24 )
The critical region for H*0 is
(4.25 )
51
where a is the chosen rejection
l~vel
and lJ.
a
is defined by
(4.26 )
V
D
j _l ~
vj
~ vJ+l~ j
= 2,
3, ••• , p-l ,
vp- l<v
- p<~~l'
-a
o~
vl
~ 00 •
To be of practical value, the solution of (4.26) for lJ. / should
a
At the present such ,tables have DOt
be available in tabular form.
~een
calculated" but an approximate, conservation t~st can be constructed
when tables for tests based on the joint distribution of the ,max1WlIIl
and minimtml sample root become available.
Let ~ and
A.u
satisfy
Then for a given value of m
:s l :s 2 :s m A.u is contained within the
larger set of events cllc2 :s ~/~ ,; thus
Now the set of events m ~
C
C
The left side of (4.27) can be a.veraged over a:D:3' .! priori dist;r.oibution
of m and the marginal probability of the ratio of roots being
~ess
than
:3~
Au/~ Will satisty
P [ell c2
Thus, ifve let~~ =
sample test result
A.u/A.J.. '
1 - a.
the prebab1l1ty of incorree1ily oalliDg a
sign1~icant
This approximate -test will
:5 "1/'1:.] ~
is less thaD a --
pro~de
gMd protection against MJ] 11Dg tae
base and optimum indices the same en the basis of spurious sam;pJ.e
res~:.l:tB ~
but it deci<UlPlY lacks power in the Neyman-Pearson senseo
Bloek1ng can be introduced into the sampling sCheme in Table 1
w.:tthout altering the above resu1.ts other than to eha:nge the error
degrees of freedom.
53
5.0 METHODS OF COMPARING TlmEE SAMPLE INDICES
..
In this section, we will aJilsume that any advisable reduction in
the number of variates measured has been achieved, and the problem is
'.
to choose among the base index, optimum index, and estimated i:D.a.ex for
a ss.tisfac-:tiory s~le il!lldi(~atolr' 6:f the values of the
•
0 0'
!9'sj'
j = 1, 2,
k (estimated 1l!lldex refers to an index where the optimum weights
have been estimated).
Brief comments have been. made
alre~
on the problem of comparing indices.
in the preceding ehapters
We have seen that the criterion
for comparison can be a function of the expected gain or a function of
.the expected gain and the variance of the re8J.ized gain.
By itself
expected gain is not completely satisfactory if the aim of the selection process is iImnediate increase from a few samples, rather than
.accumulated gain over many samples.
In particular, if ~selection is
':from only' one sa.tr.q)le, the distribution of realized gain about expected
gain frequently will be skewed with a large variance (
a: result
of a
large variance of the element of error in the index), so that achievement of the level of expected gain is very unlikely.
For this reaSon,
the probabillty of correctly selecting one of the first m of k unknown
variables with the maximum sample value of the selection index can be
a more likely comparison criterion.
This probabilistic representation
of the worth of an index increases in a curvilinear manner as the
variance of the error elementdeereases
0
The conceptual difference between the exPected gain and probability
criterion is illustrated best graphically.
In Figure
5, the expected
gain is given for selection, ,in samples of size k,of the upper q
= 11k
54
0.
I
I
I
I
I
L~-=::.-P.;-[~:-=It_k-:]~_~~_~ _~} I
I
I
I
(In
--I
I
I
I
p
Figure 5.
A graphic comparison of expected gain and P[l,l,k]
55
fraction of the population of observable variables. . The actual graph
l
is that of the rotation, e-E(gain), whereE(Jgain)
= -e p xy ,ande is
.
function of k and
'.
0'
y
The proport.ional improvement, (1/2) - P
expected gain by using an index for which p
p
xy
= 1/2
,in
0
xy = p 0 and one for which
is the same as the improvement over the better of these indices
by use of an index for which p
. '
xy
=
1 - P
0
.
Expressed in PL1, 1,
it] ,
however, the probabilistic improvement in the upper range, 1 (k), is
2
superior to that in the lower range, Il(k).
For fixed
0'
y
,
as k
increases the proportional increases in expected gain remain constant,
(1/2) - p ., in both regions, but the true expected gain, the
o
of (1/2) - p , rises rapidly 'With k.
' 0 '
e
multiple
This pattern of increase does not
hold for the improvement measures I (k) and 1 (k); as k increases (and
l
2
Pxy < 1) both decrease to zero, Il(k) more rapidly than 12 (k), until
some limiting ratio of the improvements is reached.
These features are explained easily.
Expected gain is an average
of realized gain over all samples, and is unaffected by the sample-tosample index variance.
The selected proportion, q, decreases as k
increases;. hence, more and I\lOre, only very large
resulting in increasing expected gains.
~j
pel, 1, k l
'Will be rertained,
,
more sensitive to the variance of error elements as k
however, becomes
increa.s~s.
The
larger the sample, the more likely that a small amount of variation
in the en-or portion of the observations 'Will cause the maximum warth
. variable to be missed.
For large samples this feature is not desirable
for determining Jthe worth of a single index unless the condition of good
selection is exactly that of' obtaining the maximum.
However, for
56
comparison of two indices this is not an objectional feature; since it
,
indicates that the real differenoe between indices is small unless the
use of the better index results in a large decrease in the variance
of the error element.
'.
When examin:lng the base index and optimum index as possible c.boices
for a work1Dg iDdex, these concepts take on added 1m:po>rtMce.
To begiu
'With the most elementary situation, assume that the parameters in the
population are k:nawn; a'E (E + E
- g g
e
rl
and p
ry
oan be calculated.
With
each index is associated
a parameter pry which 'Will be identified in
.
the following manner:
To decide which iadex to use, the probabilistic improvement from use of
the optinlum index,
pEi,
1, klrr ] -
pEl,
1, klr ];
= I( r,
r), sa:y, must
1
be balanced aga.1Dst the computational
effort to obtain -a 'Eg (Eg + Ee
•
.
r
If the parameters are known, the decision 'Will be to use the optimum
index.
When E and E are not known, the decision of which to use, the
g
e
'
.
base index or estimated index, is DOt easy. If the weights
a 'E (E + E
g g
e
-
rl
as constants to
are estimated from an initial sample and then are applied
.bse~tions,in
suceeedi12$
i~ependent
samples, the, worth
of the index is dependent on
* a'Sg (Sg + Se
Pa'a_.5iV -
r 1....n
= r
*=
1
a a-tS (S + S rl(E + E )(S + S r1s aJ ~
[ -a'El!>g g
e
g
e
g
e
g-
5'7
'Y
*
is either in the range 'Y
~
'Y
* ~ r'Y
, the result Gf an 1!l.dex
interme<U.ate in vaJ.ue between the base and optimum indices, er in the
range 'Y
* < 'Y
, lbhe result of an. index inferior to the base index.
The aim of the estimation procedure is to produce an intermediate itl.dex.
This can be aehieved only if the estimates are good, which fQr highly
variable data intimates a very laJ:ge number of observstioI'J3
taken.
If' the prababilistic improvement of the optimum
m:~t
Q"g'e'Jf
be
the "base
index is smsJ.l, then the chance of achieving an even smaller improvement over the base index 'With the estimated index may not be large
enough to outweigh the risk that the estimates provide results worse
than those for the base index.
This is very important since the results
of the next secti<)D. give some indieatiQn tha.t the variances Qf the
esti~tes
increase as the differen.ces between the weights of base index
and optimum index increase.
Cochran (1950) has suggested that the theoretical investiga1;ion
of this procedure should be extended by cansidering the imtial estimates
as constant for a single series of selection trials, but variable from
one series ef trials to another.
Essentially this requires finding the
expectation ttf ptl, 1, kl 'Y *] ttver the distribution of the imtial
estimates.
This is theoretically interesting because the
a measure lIf the chance that 'Y
* falls
res~ts
in the desired range 'Y
~
'Y
include
* ~ r 'Y ,
however, the mathematical problems invelved in reaching the result are
staggering.
AIthGugh less than satisfaetory, an approach can be made to the
problem suggested by Cochran by obtaining the unconditionaJ. correlation
58
of the estimated index with the worth,
Pa'g
a'S (8 + S )-In
g g
,e
£,
- Sil.' -
'.
=1
A
0
This correlation is not applicable to the formulae developed for
" .
expeeted gain and pel, 1, k] because _Sil.
a'gand -a'Sg (Sg + Sis
jointly distributed in a bivariate ::lDrmal form.
r1
n
&,
are not
:HbYeve1"J/ there is Some
justification to the assumption, tacitly employed in most of' the literature, that the ranking of the correlation coefficients of saZJq>1e indices
with worth is a reliable indicator of the ranking of the uset'ulness of
the same indices.
.Again the point of interest is whether the estimated
index can be considered as an intermediate index, 1
as an inferior index,
in the next chapter..
r,. < r.
~
r,.
~
r1 , or
We Will consider this problem in detail
59
6.0
AN ILLUSTRATION OF THE PROPERTIES OF SAMPLE ESTIMATES
USED IN CONSTRUCTING A SELECTION INDEX
6.1
".
Definition of Population and Proposed Estimates
To derive properties of an estimated index, very simple sampling
schemes must be assumed or the problems in integration will be extremely
difficult.
It is necessary that the estimation matrices derived from
one sample be independent and that the estimation sample be independent
of the selection samples.
We will insure the second of these requirements
by assuming that all estimates will be derived from random samples
other than the selection sample.
The first requirement will be
guaranteed by specializing the sampling scheme used to obtain data for
the estimation sample.
Since the selection indei is applied easily
to populations where there are lots of material to be, selected and
each lot can be sampled repeatedly if necessary, a replicated, crossclassified design with subsamples has been chosen to illustrate the
statistical properties of the estimated index.
Suppose in the estimation sample) a single sample unit is
j = 1, 2,
~j
... , b
; k = 1, 2,
is the j-th, fixed-block, parametric vector,
interaction vector for the j-th block and i '-th
(~)i'j
~i"
are independent, random, p-variate, normal vectors,.
§.nd
is the mixed
~i
and .!:.i'jk
60
b
b
I: ~,,~ I: (~. )i 'j
j=l . oJ j=l
=0
•
'.
Ea.ch vector contains p elements, so that the p x p covariance matrices
each containp(p + 1)/2 elements which must be estimated.
Since the i ...th lot can be blocked -and replicated as many times as
physical limitations permit, the selection sample can consist of k
means, ~i' over n block...replication combinations each~
1he k means
are distributed independently as
(6.2)
The optimum index,
(6.; )
is to be approximated by the use of the best available estimates of
.
a'I:g (I:e + n I:g )""
1
-
which can be derived
fr~m
the estimation sample.
Table 2 gives the ordinary multtvariate analysis of variance for
the data in the estimation sample.
Table 2:
Multivariate analysis of variance for estimation data
Source of variation
Mean square Ex:pectedmean square
b ...l
~
Among l!i'
n1 . .1
Mg
I:e + n2Ibg + nln2e(~)
I:e + bn2I:g
~g
I:e + .n2~ g
M
I:e
Error
,
D.f.
Among blocks
Interaction
~\
I
(b"'l)(~...1)
bn (n2...1)
l
e
61
If b
=1
~
and the mean
= 0,
in which case there are n degrees of freel
dom for Mg , or if b = 1, but ~ rf 0, and the information in the one degree
of freedom for the mean is disregarded, then
1
~
'.
~
- M
E = M
g = -n2 (mg
e),e
e
E
(6.4 )
form two sets of maximum likelihood estimators for the population
parameters.
Even if b
rf
1, the functions in (6.4) are the most readily
available estimates of the covariance matrices.
The set of weights
-1
-!...a'(M
- M
bn
g
e)[M
e +..!!.bn (Mg - M)J
e
2
(6.5 )
2
is simplified if the number of blocks and replications in the estimation
sample is the same as the number of blocks and replications in the
selection sample.
These weights,
-1
-a' ( I - MM
e g ) ,
(6.6 )
are composed of known constants and independent sample covariance matrices,
and they exhaust all the usable information contained in the estimation
sample about the parametric weights of the optimum index.
The eXPectations and variances of the weights in (6.6) can be
obtained for an unspecified
freedom.
Even
wi~h
~
by adopting Mg for nl - 1 degrees of
this precaution, for the correlation of the esti-
mated index with the true wortn, the moment evaluation problem is greatly
simplified by adopting
~
=0
in the selection sample, and then noting
that the resulting correlation coefficient is an upper positive or lower
negative bound for the correlation coefficient when general
~
is the
62
correct assumptiQn.
variable
i;.. from the
This is most easily seen if we partition the
selection sample i:n:to the sum Of. a constant ele-
ment, .l:!:., and a variable element,
...
.s:r.
+ ~ =~, (say).
The eovar1ance
of the estimated index and the true worth,
is una:f'fected by adopting .l:!:. = Q, but the variance of the estimated
index is reduced by this procedure;
The desired result follows directly;
(6.9)
Notice that the sampling soheme
quite useful.
giftU
by (6.1) and Table 2 is
If blocking is tIJI11eeessary, b = 1, then an elementary,
two-level nested design is obtained which provides possibly tb.e best
obtainable estimates for the index.
If
DO
interactions are assumed,
the estimates of the weights are unchanged except that the error mean
square contaiDs b~~ - b - ~ + 1 degrees of freedom.
A1r3' .of these
specializations of the general design requires only that the error
degrees of freedom be altered in the moment evaluations which fGllow.
The evaluatien of moments wh:1eh is aecessary to comple'be the discussion begun in Chapter 5 involves leragtl:Jy elementwise multipl1ea.tions
of matriees.
To facilitate the ease Ci)f presentation, the moments in
63
Section 6.2 are derived only for a two-variate normaJ.
popul~tion
'With
the important properties of these moments extended in Section 6.5 to
include p-va.riate normal populations.
6.2 Moments Associated 'With a Two-Variate Estimated Index
Evaluation of the first ,two mome:l1.ts of -a 9 (I - M,'7:M-,l)
and
g
a 9 (I - M M- 1 ):0,. depends on the possibUity of t:r'a~fo:.."'lJIllllg 't.1.le est!e g '""'1
mated weights to matrix functions of random elements which are m-Il'tually
independent. Since M , M , and ~ are independent, expectations can
,
e
g.
"
,
be taken ,first 'With respect to ~" then 'With respect to Mg and Me.
, The expectations over
Ri
reduce all the moment problems to finding
expectations of functions of Mg and Me.
(6.10)
Varra'(I
- MM-l)v
e g - ]
i
= -a'Er(I
- MM-l).!.tVi'(I
re
g
-J.-
-
M-~
g e )'a
IJ_
= a 'E( v' v')a - 2a 'E(M M-lv Vi )a
-i-i e g -i-1 -
+ a 'E(M M-1v.r Vi'M-~ )a
,-
=
e g
-J.-
e-
g
.l:....
a' ( E + bn.,.I: )a
b~ e
'c g 2
.
1
eg
- - b. a 'E[M M- ](I:
~-
e
+ bn-I:
)a
cg-
+ b 1~-,eg
a 'E[M M-l(I: + bn.,.I: ) ~f~J a •
e
cg
g
-
(6.12)
64
=
a'E(v.~')a
- -a'E(Me M-g1-~lii!.i
Y a')a- (6.13)
-~ i o
-
= -alI:g-a
- aJE(M M-1 ) 1: a
-
eg
1 )J
g-
var[a'(I - MM= E[(I - M-1M )aa'(I - MeM-g1
e g g e -- E ( I - Mg- 1
Me)
0
)J
(6.14 )
-1
-aa'E (I - Me Mg )
Equations (6-.12), (6.13), and (6.14) depend on the expectations of
(6.15 )
where Q 1s a 2 x 2, symmetric, at least
p06itiYe-~emi-defin1tematrix
of constants.
In his text,
Roy (1957) shows
how (n - 1) M can be expressed as
.
1
g
......
...,
a product of triangular matrices T and T', where T is '2 x 2;
(6.16)
-
The distribution of T is g1 yen as
,
const.
(6.17)
..
65
where E is used to denote Ee + bn Eg and tr for the trace of a
2
matrix.
Since E is a symmetric positive-definite matrix, it too can
be factored into triangular matrices;
-1
~~
E=EE',E
In the exponential term of
~
-1~-1
(6.18)
=E'-:E
(6.17), the relationship
(6.19 )
suggests the use of the transformation,
(6.20)
.The probability density of
Vis
(6.21)
If in
(6.15) Q = E, then one of the required results is the
expectation of
.. -1"-1" -1"-1
= T'
V V'
T
(6.22)
66
which leads to the simpler problem,
(6.23 )
But
(6.24 )
],
..-
and (VV ' )
-1....
(VV ' )
-1
(6.25 )
is equal to
1
4 4
v v
ll 22
=
(6.26)
67
To find the expectation of (6.26), we must find the expectations of
1
4
v '
ll
v2
2l
4
v v222
ll
v:2
2l
2
v v4
ll 22
4
v
2l
v
2l
4
4
vllv22 ' ~llv2
22
,
~l ,
~ 4
llv22
v
2l
1
v v4 ' v2 v4
ll 22
ll 22
(6.27)
To achieve this, first expand the exponential term in (6.21),
and then notice the ranges of variation.
is a normal variable,
21
independent of v11' v22' with mean zero and standard deviation one.
2 is distribv2
is distributed as 2
X with ~ - 1 degrees of freedom, v
ll
22
2 2 2
uted as X with ~ - 2 degrees of freedom, and vll and v22 are independent.
Since the odd moments of v
2l
v
are zero, we see that in (6.27)
(6.28)
and the expectation of the off-diagonal elements of (6.26) is zero.
-1
Notice also that the odd powers of Vii' i = 1, 2, are associated only
with the odd powers of v ; the result is that e~ctations only of
2l
2
integral powers of X variables are required. The expectation of (6.26)
now involves the evaluations:
(6.29)
68
Renee,
1
..
+
= (~ - 3)(~ - 5)
(~
+
+
2
- 3)(~ - 4)(~
~
5)
1
(»:I. - 3 )(D;t - 1i)(~- (j)
(~- 3)6il-1i~D..t - 5)(~-~6)
:&:t - 2
= (»:I. - 3 )(D:J. -4)(&:t- 6)
As
-
the other
di~
•
element has the same expeeta.tion
as (0.;0) •
+
(~
_
4)t~ - 6)
(6.31)
~ - 2
• (Bi - 3 )(»:1.
..
S~st1tutiag the
results of (6.28), (6.30),
and
- 4)(~ - 6)
(6.31)-1n (6.22), the
69
-1
-1
solution for E M (1: + bn 1: ) M follows directly.
g
e
g
_
2. g
~ - 2
o
(~ - 3)(~ - 4)(~ - 6)
o
"'-1
E
n1 - 2
(~ - 3)(nl - 4)(~
- 6)
or
(n
l
(n - 1)2(n - 2)
l
l
- 3){~ - 4)(~ - 6)
(n - -
1)2(~
- 2)
l
"',-lE"'-l _
~
(~+k ~)
E
- (nl - 3)(~ - 4)(nl - 6) ""'e -un2""'g
(6.32)
The expectations involving M are handled in a similar manner.
e
--
... ...*
Factor b~ (n - 1) Me into T*T ' and replace bn (n - 1) by m.
l 2
2
The
probability density of ...*
T is
(6.33)
If 1:-1
e
= i*,-li*-l
, then the changes of variables
...
"'*-1""*
U= E
T,
(6.34 )
-1
.
70
transforms (6.33) to
Consider now some symmetric, positive-definite matrix
L1
and the
matrix product
(6.3 6 )
-* 'L1-1"
Simplify (6.36) by denoting E
E
= A. A is symmetric
positive-definite;
(6.37)
From (6.24), (W') is obtained by :Lnterchanging uij and vij .
The
essential component of (6.36) is the product
(W' )A*(UU') =
(6.38 )
The matrix in (6.38) differs from
(W')A(UU') in that the terms of this
product containing u2l ~d u~ have been omitted because we have shown
71
that they have zero expectation.
Since the distributional. properties
of the uij are similar to those of the vij with
~
- 1 replaced by
m, the expectation of (6.38) can be written down directly.
'.
4
4]
22
22
E [ au'\1~l + a22(~2 + 2~2~1 + ~1)
= sum
.
+ a22 (m + 2Jm
0
(6.40)
E[B:t2(2ut.~1 +
uiJ.~2)]
= B:t2[2m + m (m - 1)]
=~2m(m +
1) •
(6.41)
By substituting these results in
(6.36), we have
B:t2m (m + 1)
]
E*t
a
m (m + 2) + sum
22
'
(6.42)
m+2
au
m
E(Mer.:-~ ) = E*
.
-~
m+1
B:t2
m
+1
e
B:t2 m m
E*·
au
·
+m
This matrix of linear functions in the a ij can be written as the sum
of two parts,
..
Since A is symmetric positive-definite, for any vector
£
of real num-
bers,
m
+ 2 b tAb + 1: b. (Adj A)b > m + 2 b tAb •
m
-
-
m-
--
m
--
(6.44)
72
In (6.44),
..
l?:
aa.n be replaced by !'E* with the result that
If ~l is replaced by (6.32) and m is replaced by b~(~ - 1), a
lower bound is established
for one of the i.mportant expectations.
,
(6.46)
Anoth~
expectation for which a bound similar to (6.li6) is needed
is
(6.47)
From the results in (6.46),
Replace Mg by its triangular factorization
m' E',
and denoted
E-1EeQ EeE,-l by B, then the bound in (6.48) is
(m + 2)( ~ - 1)2
m
..
[_ 1...... L......
1'" 1]
.!:'E E'- (VV'r-B(VV'r E- .!:.
But,
(w' r~(w' r
l
=
4 14 Adj(W')B Adj(W')
v v
l l 22
(6.50)
73
has a torm similar to (6.38) for a2 x 2 matrix when the elements with
zero expectation are omitted--
'.
1
(6.51)
'!be matrix in
(6.51)
has expectation
b ll
·
b 22
(~-4)~1lJ. -6)· + (~ -3 )(~ -4 )(nl -6)
=
1
(~-4)(~-6)
B +
1
(~-3)(~-6)
(Ad-jB)
(6.52)
Matrix B is at least positive-semi-definitej hence, after replacing m by
l
Finally, we need E(M M- ).
e g
Since M and M are independent, the
e
g
74
-1
expectation of MeMg obviously is
E(r.eM~l) = (~ - 1) E[r.eEt-l(W t rli-1J
(6.54)
= (~ - 1)
r.eEt-~(wt)-lE-l •
Referring back to (6.25) and (6.29), we see that the evaJ.u.ation. of this
expectation is
- t-l
1
--1
( ~ - 1 ) r. E
(
4) I E
.L
e
~-
~ = ~-
1
(
)-1 (6
)
4 r. r. + bn-r.
•
.55
e e
e=g
6.3 Variances of the Weights of the Two-Variate Estimated Index
If the weights for the estimated index are constructed from one
initial sample and then applied as constants in all the selection samples,
the main point of concern is how well.,* approximates r
r.
The answer
to this question involves an examination of the mean and variance of the
vector of estimated weights.
First from the result in (6.55), we see
E[at(I - MM- l )] = atI - a t E(MM- 1 )
e g
e g
= -at[I
- x.;r. (r. +bn-1:
~-~
e e
cg
r1J.
The weights of the index given by (6.6) are biased and should be replaced
by the adjusted estimates
at(I
-
~
_.L
n..t
-4
-1
M M- l ) •
e g
This adjustment for bias is very 1:m;portant because its omission can
reverse the ranking of two sample index values.
The difference
be~een
75
two sample index vaJ.ttes is
l
a' [(I-M M- ) ""
e g
.
--2MM-l]CO.-il.,) \= a' [I
~- 1 e g
-:L -:L
-
The remainder causing the bias, -[3/(~
-
~-4
- - 1 MM-lrni-i>· 1).(6.58)
~e g 1'.... -:L
l)J !'MeM;l(~ - ~), can
change the sign of the difference; at the end of Section 6.4, we Will
show that for populations where this is a frequent occurrence, this result in a negative covariance of the biased estimated index and worth.
The covariance matrix of the tmbiased set of estimated weights is
E[~-l4 M-~
- (I: + bn..I: rlI: ]aal[~-l4 M M- l - I: (I: + bn..I: rlJ=
gee
g
e e gee
g
~-
~
~-
~
E[~~~) 2M~ly,e""'M.r-~l] • (Ee+ ~EgrlEe"" 'Ee(Ee+ ~Eg)·l •
(6.59)
Let
aa
= A, and replace the symmetric matrices with triangular faatoriza-
I
The right side of equation (6.59) reduces to
tions.
2
(~~~) E'-~i (W' rlj-~""E [( W' )E*'A E*( W')] E*'E' -lew' r l ] E-l
- (I:
e
+ bn..I: rlI: A I: (I:
~
gee
e
+ bn..I: r
~
g
l
•
(6.60)
N* 'A E"'* with B; and apply to (6.60) the results in (6.43) and
Identify E
(6.52)--
....
76
4
I
+ E*Ad,j B E*'J(I:" + b"2l:g f1
.i,-lAdjf~~(~ - 1)
+
+[b"J.("2-J(~-3)("J.-6) ]
2J i-~"'*B i*'i,-l
+
i-1i*Adj Bi*li,-1}i-1
(6.61)
Make the further identification
,.
(6.62 )
then the covariance matrix of the estimated weights is
fK~~(~ -
1) + 2J - 1J(Ee +
b~Egr1Eeaa'Ee
(Ee +
b~Egrl
r
+ K(E + bn..E 1 i*Adj(i*'aa'i*)E'*I(E + bn..E r 1
e
~ g
-e
~ g
l
+
+ K bn.(n..-l)
. .L ~
~-3
2J E'- -1Adj (--1
- -1 --1
E E aa IE E' )E
e---
e
(6.63)
In the general form of
(6.63), only the first term, of the sum is
readily com,prehended; however, for a numerical example, it is possible
to write down the exact variances and oovariance of the two ooefficients.
Of more value are the lower bounds of the variances which are derived by
use of' the results in (6.44) and (6.53).
Denote by ~, i
= 1,
2, a
77
vector with the i ..th element equal to one .and the other element zero.
Then, from (6.63),
>
More conveniently, the matrix-vector products can be rearranged as a
square;
The bound in (6.65) is an important result for the two-variate index.
The greater the difference between the base index and optimum index weights,
the more improvement the researcher would expect to make if a good set of
estimates could be used to approximate the optimum weights.
However, the
lower bounds to the variances of the estimates have increasingly unfavorable properties as this possible improvement
incre~ses.
The width of the
plus and. minus one standard deviation interval about the mean
between the weights of the base and estimated
2~K[b~(n2
1
- 1) + 2]-
.
indice~
differ~nce
is at least
lJ2Ibn2!I~g(~e + bn2~gfl~i .. ~I
.
(6.66)
78
This is particularly large when passing from the base to the optimum
index produces a change in the sign of the weights ~
The variance multipliers K (for (6.63»
and K b~ (~ - 1) + 2
(for
(6.65» are the best keys we have to how to estimate the weights of the
optimum index as well as possible.
The second multiplier is undefined
for ~ < 7; is greater than two for ~ :: 7 and b(~ - 1)
greater than one for ~
=8
and b(~ - 1) ~ 2.
If ~
2:
2; and is
2= 9, the multi-
plier decreases simultaneously from one with ~ and/or b(~ - 1).
This
indicates that it is most important, when planning to estimate, to allocate the sampling units so that most of the sample resources are used to
estimate the variation among the ~ I
•
{K[b~(~ - 1) + 2] - l} is miriimized in a balanced allocation with
fixed total, N = b~~, when ~
= N/4
and b, ~ = 2.
For this allocation
K bn.. (n.. _ 1) + 2 :: (N + 4)( N - 16)
.L c
N-<N - 24)
K [b~(~ -
. 1) + 2 ] - 1
,
4 f3N_ - 24)
16)
= NN
•
If blocking is unnecessary, then the allocation is ~ = N/2 and ~ = 2
. for the indicated best estimates.
We should remember that the lower bounds, (6.65), can be very poor
approximations to the variances unless
~
is large.
The lower bounds
also fail to indicate the nature of the covariance of the estimated
coefficients.
Excep~
for completely specified parent populations which
79
permit exact evaJ.uation of (6.63), the covariance is not expressible in
a meaningful form.
6.4 Correlation of Two-Variate Estimated Index with Worth
For the estimated index where the weights are statisticaJ.1y independent of the sample means in the selection sample, the correlation bet"lrl'een
the sample index and worth is
_
r
Cov{!:' [I -
-
.
(~
[!:I1:~ varf!:'[ I
I'.
_
-
4)/(~
- 1)
(~ _ 4)/(~
MJ'!;l~; s.1!:J
..
- 1)
1
MeM~l~} ]"2
•
(6.68)
The numerator in (6.68) can be evaluated exactly--
Covra'(I ~
~ - 4
1 MM-lrni ; &!:!aJ
~ e g £
¥1-
= -a'!:g-a - -a'!:e (!: e + bu-!:
r l !: a
egg(6.69)
= bu-.ca'!:
(!: + bn-!: )-l!: a
-gecg
g-
•
~ - ~ MeM-l)n
]> va.r[a'(I - ~~ - ~ MM-l)v
]
g £i. e g -i
,
From the result in (6.8),
va.r[a'(I -
~-
and
va.r[a'{I -
l
~ -- ~ Meg
M-1)V ] = b
-i
2
a'(!: + bn-!: )a - b
a'!: a
e
cg~e-
~-
+
(~
_4)2
~-
1
b
1
a'E[M M- (!: + bn-!: )M-~ Ja •
~eg
e
cg g e-
1
Applying the results in (6.46), the right side of (6.70) is greater than
80
or equal to
By rearranging these terms, a more convenient form of the lower boun.d to
the variance is obtained;
var[a (I -
~
- ~ MM- 1 )ni ] >
~eg"'-
(6.71)
Let K*
= K [b~(~
Varia I(I L~
- 1) + 2 ]( ~ - 2) /~ - 3) , then
~
- ~ MM- 1 )n. J>
~ e g
K*bn...alt·(I: + bn...I:
~
g e
~ g
&1
f
-
1 +. (K* - 1)[1
I: a
g-
As a consequence of the results in
- b alt a - alI: a
~
-
e-
(6.69) and (6.72),
-
g-
]
•
81
when the bound in (6.72) is positive.
.,
A positive bound is guaranteed if
a > bn-.a'~ a, which implies that r < .7071 •
e-- ~ ~
UIlfortunately, the bound in (6.73) is affected more by the relative
a'~
-
sizes of a' ~ a and. bn-.a' ~ a than by the allocation of the sampling units
- e~
gin the estimation sample.
This is true because the bound cannot be ex-
pressed only in terms of a'~ (~ + bn-~ r1I: a/a'~ a a.nd a consta.ut~ and
-g e
~g
g-- f!>
because two measures of variation have been omitted to obJcain (6.72),
but it is evident still that
rA
becomes increasingly smaller than r
r
as
r decreases from .7071 •
As for the case of the variance of the estimated weights, the best
estimated index, using a balanced allocation of sampling units, appears
to be that one for which the bound in (6.73) is maximized., Like the
bound in (6.65), this involves placing the maximum emphasis on ~he measurement of the variation among ~i'; ~ = N/4, b, ~ = 2. 'Since (6062)
is a lower bound to the variance of the estimated index, it is also certain that ~ ~ 7 (which makes the bound fimte) is the most essential
requirement for a usable estimated index.
With a population like the oneillustrated ' here, will' it pay to
estimate the optimum weights when selection is for the sample maximum?
If it is acceptable to assume success in different procedures of selection can be ranked as
,
r., r ,..,
and r
r
are ranked, then at least those
cases where it is not wise to estimate can be determined:
r
:s .7071,
(1) if ~, <
the increase in selection power at best will be small, and
(2) if, -a' ~e-a > bn-.a'
~ a,
~
g
also
. will ,be smaJ.l.
r,..
will be considerably less than r
r,
1
which
If the converses to (1) and (2) hold, r ,.. could
approach r r closely, but we have no assurance of this.
9,
82
We have not been able to show that
less than r r!
r
A
is almost always considerably
Does this tend to contradict the general feeling that the
estimated index is not a worthwhile tool?
The small illustrat.ion given
here does not provide the answer to this question, but it does provide
some clues.
As has been mentioned, the upper bound in (6.73) presents
a better (probably a much better) picture of
eXists.
r,.
than that which truly
Since the estimates provided by the estimation sample are nearly
optimal when b = 1, it seems likely that the general set of estimates in
the illustration are superior to most of those which have been used in
I
practice, particularly those which have been used for problems of genEltic
selection from animal populations •
The requirement that
~
.? 7 is for
two variates; in the next section, we will show that the p-variate
requirement is
~ ~
P + 5.
Finally, frequently in the literature no men-
tion is made of the necessity of correcting the estimated weights for a
bias.
This correction is very important when it is noted that
~-l
= -:-:--T.'
bD-al~ (~
~ -If
~
g e
+ bD-~ rlI; a - ~ a;lE a
~
g
S-
~ -If -
g-
(6.74)
will be negative if r
2
.,2
~ 3/(~ - 1). In this case, using the estimated
index is a,J.most certainly worse than randomly drawing any observation
and calling it the largest.
By further manipulation of the covariance and evaluation of the
variance, we see that
83
___6-..... a'L: a
~
- 4- g-
+'
~.- _1_ra'L: a
~b~L:'" e-
and
.
I
var[a I (I - M M- )niJ >
ega: -
_
nl + 2
~
I
l I(~ -~)2 K* _~
6 4 alL: a + b
-g~L~-
~-
I
4 bn..a
L: (L: + bn 'L: r L: a
c- g e g
2 g-
n
l
+
-
~] -aiL:e(L:e + bn...L:
r1L:a.
cg
e'(6.76)
Now we can show that
i~the
2
region of positive covariance
~-122
r A -< ~
-
4r r -
when
But, the bound (6.78) is impIif:ld by
-
•
~
~-
84
where
A
'I
(I1:L'
*
K ) "
1
(~-
1)
2
(6.80)
3" [ I1:L - 11
f
.AB a working rule, for sample sizes less than one hundred, bn2~(~' K* )
The bound in (6.77) is applicable then if -a'~e-a > 3a'~
a.
- r;When this bound holds, it is also obvious that a lower bound on the dif.
222
ference between r A, and r r ,
is less than 3.
(6.81)
increases as r 2r 2 decreases.
If the relationship between the results of using the estimated index
and the correlation coefficient rA, is curvilinear, similar to that given
for P[l, 1, k], then even when the biased estimated index has a positive
covariance, the results of its use can be consistently poorer than the
results from using the base index are.
The difference in (6.81) strongly
indicates that this is so unless r r is close to 1 and/or n
l
is very
large.
6.5 p-Variate Properties for the Variance of the Estimated Weights
and the Covariance of the Estimated Index and True Worth
In the general estimation problem, where Mg is a p x p matrix based
on ~ - 1 degrees of freedom, the triangular factorization, (~ - l)Mg=Tr"
reduces the expectation problem to finding moments obtained from
P
(~-l)-i
1
1const. IT t i .
exp [- -2 tr (~ + bn2~
TT'] dT
i=l
~
e
g
r
(6.82)
85
p
where dT stands f'or
..
#v
setting T'E'
-1
n dt,,"," Factoring (E + b~E
i~
~d
e
g
r 1 into ii' -lEtl,
and
....
= V' reduces the problem to finding moments from
P
const.
(~-l)-i
n Vii
i=l
The diagonal elements, vii of
degrees of' freedom.
V,
1
exp( - 2"
P
E
2
vij ) dV •
(6.83)
i~j=l
are roots of
x?-
variables w:t th
(~
- i)
The off-diagonal elements are normal variables with
zero means and unit variances.
~ v~i' and it is
i=l
special because it is the only element of the product matrix to contain
2
-1_1
p Vii;
2
vpp. The determinant of (VV') is V =.~
hence,the inverse of
The element in the (p, p) position of
(W')
is
3.=1
(W') is
_=1_ Adj(W') •
p
2
n Vii
i=l
(6.84)
The adjoint is oonstructed of cofactors with the result that the p-th row
and column of the adjoint cannot contain any elements which are f'unctions
of' v
pp
•
Hence, the (p, p) element of'
(6.85)
.
•
is functionally and statistically independent of v
pp
in the numerator;
-
the numerator is a sum of p squared functions of the elements of V other
than v •
pp
The fraction, l/tr vii' can be written as
.
i~
4~-1
(l/v )(l/n
PP
i=l
4
Vii)'
86
and the (p, p) element of
(6.85) can be represented as
(6.86)
.
where gi is functionally and statistically independent of v
pp
~ g~]
E [1
~i=l ~
1
(nl - p - 2)(~ - P -
=
pp
Since. E( g~)
~
4)
i=l
~ 0, we would like to show that E( g~)
and at least one i.
E( g~
~
Hence,
) (
6,87 )
r 0 for ~ :s p + 4
But this must be so because each g~ is a squared
function of independent normal variables and .;. variables with degrees
of freedom greater than or equal to
~
- P + 1.
The assurance that
E(g~)
> 0 and the result in (6.87) constitute a proof for the following
~
theorem which is important in the study of the estimated index.
Theorem:
If M is the sample covariance matrix of n independent
p-variate vectors,
!i
N[.I:,
r:J,
i
= 1, 2, ••• ,
p,
then
is undefined for n:S p +
4.
From this theorem, it is ob~ious that the number of
.Ei
t
in the esti-
ma:j:;ion sample must be at least the number of variates measured plus five.
This result is very important because it does not appear to be upset by
81
the approximation to normality found in practice.
The theorem is the
consequence of the probability mass located at the point zero in the
distribution of vpp •
This probability mass in the practical approxima-
tion should be greater than that in the theoretical distribution because
'I.
the approximation usually constitutes a truncation of the tail regions
of the theoretical joint
The restriction on
dis~ribution of
~
the original
~ariables.
provides two caution signs to be observed if
the researcher has decided to use the estimated index.
When
~
is fixed
and sma.11, the number of variates
which can be measured is very restrj.cted.
,
This necessitates judicious choosing of the most informative set of variates.
When
~
is not fixed, it still is advisable to choose as small a
set of variates as is considered reasonable and to measure as many different individuals as physical limitations permit.
To obtain good esti-
mates, it is more important to sample :man;y individuals than to subsample
one individual xna.ny times ~
A second important generalization for the estimated index concerns
the correction for bias in the weights.
Since (VV,)-l =
v'-IV-I,
the
P:t'Qauct·~epresen.tation(6.84) expands to give
P1
Adj
V'
Adj
V•
(6.88)
IT ';'i
i=l ~
If the cofactor of the (i,j) position of
V is
denoted by c
(p, p-l) and (p, p) elements of the adjoint product in
p-l
p-l 2
( IT vii) c -1 and IT vi' •
i=l
P P
1=1
~
iJ
' then the
(6.88)
are
88
Since (6.88) represents the inverse of a Wishart type matrix from
a population with the covariance matrix I, the ,expectations of all the
diagonal elements are identical, the expectations of all the off-diagonal
,
elements are identical.
p-1
E( diagonal element)
2
vii
!I
J.=1
=E
P
2
n
vii
i=l
p-2
It is easy to check that the cofactor cp _1p is
(n'
vii) vpp _ ; hence,
1
i=l
p-2
E( off-diagonal element) = E
(n v2 ) v
i=l ii
pp-1
P
n
i=l
2
V
pp-1
=E
vi"
J.
P
[
n
i=p-1
2
vii
1
=0
•
Substituting these expectations into equations (6.55) and (6.74), it
follows that the unbiased estimated weights for the p-variate index are
_al(I -
~ - p - 2
~
- 1
M M- 1 )
e g
,
and that failure to correct for the bias can result in a negative correlation of the estimated index and true worth if
rr~
89
7.0 SUMMARY AND CONCLUSIONS
.-
7.1
The Problem and Results
An examination of statistical problems of selecting nonobservable
linear functions of multivariate normal variables with the aid of . a
selection index is presented in this dissertationo
The very general
problem of selection is reduced, by limiting procedures and defining
populations, to one concerning populations sampled in a particular ma..;Q,ner and concerning selection With one of four indices.
No. attempt is
made to examine non-time-stationary populations, such as those created
by generating new populations With selected samples from existing populations.
Instead, one population, unchanged by time, is considered, and
from samples drawn at random from this population, only the sample maxima
are selected.
Two criteria are presented for developing an optimum index and for
comparing alternate indices.
The approach of using the probability of
selecting the unknown maximum is deveJ.oped for use in comparing indices
when it is necessary to emphasize the effect of the variance of realized
gain.
The more' familiar approach of evaluating the worth of an index
With the expected gain criterion is stated, as being appropriate for
comparison only when the variance of realized gain is unimportant, such
as in long-term or large-sample selection programs.
Necessary conditions
of selection for the validity of expected gain are clarified.
The best selection indices
be the same index.
deri~ed
by each criterion are shown to
This optimum index is a function of known relative
weights and the population covariance matrices.
The optimum index is
90
compared with three other indices:
(1) the base index, an. index employ-
ing the known. relative weights, (2) the reduced index, a base index con-
."
structed with a reduced set of variates, and C~) the estimated index,
an optimum type index with the true optimum weights replaced by sample
estimates.
Comments about when each index might improve over the base
index are given.
Special emphasis is given to the assumptions
necessa-~
to provide a basis for comparing the estimated index with the base and
optimum indices.
The estimated index is examined for a particular type of estimation
and selection sample which has wide applicability in selection programs.
Moments of 'the weights and the correlation of the index with nonobservable worth are given exactly or are approximated with bounds for the
two-variate index.
From the properties of the moments and the bound
on the correlation coefficient:
(1) a superior allocation of sampling
units in the estimation sample is provided, and (2) a description of
parametric relationships which can result in poor results from the estimated index are discussed.
Correction for a bias in the estimated
weights is shOwn. to be a critical factor in obtaining useful results
with the estimated index.
An extension of the two most important properties of the estimation
procedure, the necessity of using more than p +
4 entries in the estima-
tion of the among means sum of squares (p is the number of variates
measured per individual) and the correction for bias in the ~stimates,
are developed from the two-variate illustration.
The practical pre-
cautions indicated by these properties are discussed briefly.
91
1.2
Conclusions
If, in a selection program, selection with an index is for an
unknown sample maximum (or near maximum), the number of repetitions of
the selection process can dictate the criterion to be used for measuring
the possibility of achieving success.
it is usually the mean of
t~e
If selection is repeated often,
accumulated selections which is most
important; the variation among the individual selections is of small
importance.
For this type of selection program, the expected gain from
the use of an index is generally regarded as a satisfactory measure both
of the worth of the index and of the worth relative to the worth of
another index.
Selection from a few samples, particularly one sample,
should be measured, however, by a combination of expected gain and the
variance of realized gain.
to be a
~atisfactory
The probability of correct selection appears
combination of these characteristics for the purpose
of comparing the worth of alternate indices.
Having accepted either criterion for comparison, it is always best
to use the optimum index when it is avai.:I.able.
either -:the base index
If it is not available,
or the reduced index, if its worth is successfully
detected, would be recommended here as better indices than the estimated
index.
Estimation of the covariance matrices involves the use of much
noninformative variation which creates the risk of missing the optimum
weights by a large margin and hence of doing an extremely bad job of
selection.
For the two-variate case, the hazards of estimation can be minimized
by providing the largest possible estimation sample, properly allocated,
and by correcting the estimated weights. for bias.
For more than two
92
variates, the results are too scanty to make any def1nite statements
other than that the estimation sample size must be increased as the
.-
number of variates are increased, and that the proper allocation and
correction for bias must be made.
Extension of the estimation procedure to populations and sampling
schemes other than those discussed is not promising.
The estimates will
have few of the optimum properties discussed--either of construction or
of statistical independence.
7.3 Suggestions for Further Research
Within the boundaries defined for the problem in this dissertation,
there are two important investigations which couId be undertaken.
The
first is the development of a means for evaluating the difference,
., (r - 1), between the parameters ., and r .,.
possible for a general set of weights,
~,
Although this may not be
there is at least one approach
which can lead to an approximation to the difference.
Since the param-
eters ., and r ., are correlation coefficients, the square of each lies in
the range (0,
1).
The same set
* maximizes
~
both parameters; the same
set a ** minimizes both parameters-max [
~tI:~ .. ] _ max[~tI::(I:
+ I:erlI:n.!:]
~ , I: a
!ea
ai(I: !e+ I: )a - a
g
e- go-
a
t
min
~ I:ia
ai(I:+ I: )a
g
e-
[
] _ min [t
. ~ I:g (I:g + I:e )-1]
I:i-
a
-
aiI: a
- go-
.
Ch
= max
[I: ( I:
g g
+ I: )-1
e
_ ch·
-1
- min [ I:g(I:g + I:e )
J.
J.
93
...
A
O
aIL a
-
~
A + 1 ~ al(L· + L Ja
o
g
e-
,
<
,..,
2
which provides a bound, R, on the difference rt:.(:r
- 1),
The behavior of' R as L L- l departs from I is very important because
g e
it will indicate when the true difference is small.
bounds,
~
and
~,
Since
confidence
f'or the characteristic roots have been given in the
literature (see Roy (1957)), one approximation to the maximum possible
increase is
~
+1
QO
A.
Q + 1'= R •
o
Perhaps this estimate of R is unnecessarily large and could be
replaced by one where the sample characteristic roots of the covariance
-1
matrix product SgSe
are used in place of
~
and
~_.
Certainly, better
methods of estimating R, or of' testing a hypothesis of the type
He:
R
< RO are the f'irst approaches to a solution of' the problem. The
best approach is to express r
r
in terms of'
r,
but this appears to be
greatly complicated by the presence of' a matrix inverse in r
r.
A second important problem is to determine either the joint distribution of' the estimated index and the unknown worth, or to establish how
well pel, 1, k] with parameter r ,.. approximates the probability of selecting the unknown maximum.
One suggested approach would be to try to deter-
mine if pel, 1, k], with r ,.. , is an upper bound to the true probability
function.
If this is true, then the practical questions involving the
estimated index can be answered satisfactorily.
The two major extensions to this work, "which should "be attempted.,
are the generalization of the results to the selection of the best m of
k ranked variables
and an examination of what constitutes a mea::-tuxe of
good selection when the parent population changes in time.
For the firs'l:;
of these problems, it seems reasoriable that the general properties for
m =. 1 should hold for m > 1.
Support for this assumption is obtained by
considering the symmetry involved in selecting the maximum.
The prob-
ability of selecting the maximum is the same as the probability of correctly selecting the minimum k - 1.
result, P[m, ~, k]
= P[k
This leads innnediately to the general
- m, k - om, k].
Even if for all m, all deriv-
atives with respect to the part-total correlation coefficients are not
positive, it seems only reasonable that monotonicity should exist.
To
establish-properties of the derivatives, it is necessary to begin with
the integral expression for P[m, m, k].
For m > 1, an exact solution for P[m, m, kJ is complicated by the
great complexity of the domain of integration.
Starting with the most
general case for the development, the distributional properties of x '
j
.
•
Yj' and Zj are those of (3.2).
The density function of x and y will be
If Xl' ~, • •• , xm and Yl' Y2' ••• , Ym represent
the first m-tiles of the samples of x j and Yj ' and if x and y are the
denoted as f(x, Ylm).
95
minima of these sets, then the domain of integration is
-00 ~
r'
xj
~
x,
-00 ~
Yj
~
Y,
-00 ~
x, y
~ 00,
j
= m + 1,
... , k,
or
The integral
ylm) dxdy ,
00 ~
00
requires an expansion of' f'(x, Ylm).
<
-
x,y
Zj
~ 00,
< x-y.J
-
Starting f'rom the joint density of'
x j and Yj ,
,
96
the pair x and y can be fOrmed of one random pair (xj ' Yj ) or of two
pairs (X j ' Yj)' (X j " y j ,).
If they occur in the same pair, then the
frequency function of x and Y is
m (1 - F(x, y)m-lf(x, y),
where f(x, y) is the density and F(x, y) is the dis~r.ribution function
of x j and Yj.
If they occur in separate pairs, then the frequency func-
tion of x and y is
(m - 1)(1 - F(x, y»m-2f (y) f (x)
Y
x
where
fy(X)
=
J
y
f(x, Yj
:S Yj :S
)~j
•
00
The density of x and y can be expanded and substituted into the expression
for p[m, m, k J.
Applying the transformations developed in Chapter 3,
P[m, m, k] reduces to a slightly more tractable form,
97
D1
D
2
n;
J
l-
...
(
'0
D.r
v $
U,
00,
$ 1.1.J. ,$ u,
- 00
{
~{
:D
$
S-00 ~ vI ~ V + [P";(I-P~)~] (U-~).
~{
4
00
{
{
v + [p
U
'XY
$ 1.1.J. $
-00
2o
(U-,\)
,'XY-
~ v1 $
00,
00 ,
~ v1 ~
-00 ~
)i 1
/(1_p
1.1.J. $
2
V
1
+ [Pxy/(l-P.xyJ~ ] (U-1.1.J.) ,
U,
1
V
+ [Pxy/(1_Pxy)2 ] (U-1.1.J.) $ V1 $ 00,
U
$ 1.1.J. $
U
$ 1.1.J.
~ 00 ,
V
$ V1
~ 00 •
00 ,
98
The first term of the sum of integral products is monotonically
increasing with respect to p
because in the first partial integral,
xy
1
u...L -< u, in the second, u...L -> u, and because d[p xy/(1 - p2xy )2 /dp xy > o•
J
In the second term of the sum, ,there is a similar monotonic product
multiplied by an integral which is monotonically increasing in Pxy if
yk
[ pxy /(1 - p2
xy
J (u..".L
- u» - v.
the integral limits insure
when v ~ O.
Examining only the ca.se whenp
~ ~
> 0,
xy-
u, so that the equ.sJ.lty always holds
But, v is integrated over'the full range,
This cOllI.Plication is not unexpected.
-00 ~
v ~
The second term in the
00 =
StuD.
of the integrals corresponds to the probability of selecting the first
m correctly when x and y belong to different random pairs.
As Pxy increases, ,
the probability of x and y belonging to different pairs decreases, producing the nonmonotonic effect.
Finding the properties of this integral
expression is a real challenge! '
Many selection programs, such as genetic selection, generate stochas-
tic processes by using the selected individuals to generate new parent
populations.
If the original population is
populations are non-normal.
normal, the succeeding
Selection also creates a time-linked change
in the parameters of the samplese:Lection indices and variables ofw6rth.
Hence, it is desirable to predict the direction and magnitude of change
,in these parameters in order to
evalUate.~he
success of selection in time.
This requires expressing the' worth of an index as a function of time and
'.
the parameters, e.g. PtLm, m, k] and/or Et(gain).
Even a small start
in this direction would bea worthwl:iile contribution.
99
LIST OF REFERENCES
Bartlett, M. S. 1939. The standard errors of discriminant function
coefficients. J. R. Statist. Soc. 6:169-173.
Cochran, W. G. 1950. Proceedings of the Second Berkley Symposium
on Mathematical Statistics and Probability:449-470.
Dunnet, C. W. 1960. on selecting the largest of k normal population
means. J. R. Statist. Soc. 22:1-30.
Fisher, R. A. 1954. statistical Method for Research Workers.
Inc., New York:287-289.
Fisher, R. A. and Yates, F.
New York:2l-22, 66.
1949.
statistical Tables.
Hafner,
Hafner, Inc.,
Hazel, L. N. 1943. The genetic basis for constructing selection indexes.
Genetics 28:476-490.
Kempthorne, o. 1957. An Introduction to Genetic Statistics.
and Sons, Inc., New York:506-5l6.
John Wiley
Kendall, M. G. 1941. Relations connected with the tetrachordic series
and its generalisation. Biometrika 32:196-198.
Kendall, M. G. and stuart, A. 1958. The Advanced Theory of Statistics,
Vol. 1. Hafner, Inc., New York:350-354.
McFadden, J. A. 1955. Urn models of correlation and a comparison with
the ¥lultivariate normal integraL Ann. Math. stat. 26:478-489.
Moran, P. A. P. 1956. '!he numerical evaluation of a class of integrals.
Proc. Camb. PhiL Soc. 52:230-23L
Nanda, D. N.
1949. The standaI'd errors of discriminant function coeffiin plant-breeding experiments. J. R. statist. Soc., Series
B, 11:283-290.
cient~
Owen,
D.~.
1...956.
Tables for computing bivariate normal probabilities.
Ann. Math. Stat. 27:1075-'1090.
Plackett, R. L.
integrals.
•-
1954. A reduction formula for normal multivariate
Biometrika 41:351-360.
Roy, S. N.. 1957. Some Aspects of Multivariate Analysis.
and Sons, Inc., New York:33-50, 95-109-
John Wiley
Ruben, H. 1960a. Probability content of regions under sperical normal
distribution, I. Ann. Math. Stat. 31:598-618.
100
Ruben, H. 1960b. Probability content of regions under sphericaJ. normal
distribution, II: the distribution of -the range in normal samples.
Ann. Math. stat. 31:1113-1121.
,"
Smith, H. F. 1936.
Eugenics Londo
A discriminant function for plant selection.
7:240-250.
Titchmarsh, E. C. 19490
Press, London:2-590
.
The Theory of Functions.
Ann.
Oxford University
INSTITUTE OF STATISTICS
NORTH CAROLINA STATE COLLEGE
(Mimeo Series available for distribution)
••
258. Hoeffding, Wassily.
--
On sequences of sums of independent random vectors.
259. Webster. J. T., A. H. E. Grandage, R. J. Hader, R. L. Anderson.
dent variate in a linear estimator. June, 1960.
260. Chakravarti, I. M.
A decision procedure for the inclusion of an indepen-
On some methods of construction of partially balanced arrays. July, 1960.
261. Roy, S. N. and R. Gnanadesikan.
On certain alternative hypotheses on dispersion matrices.
August, 1960.
262. Murthy, V. K. On the distribution of averages over the various lags of certain statistics related to the serial correlation
coefficients. August, 1960.
263. Anderson, R. L.
Some needed developments in multivariate analysis.
264. Chapman, D. G., W. S. Overton and A. L. Finkner.
August, 1960.
Methods of estimating dove kill.
October, 1959.
265. Eicker, Friedheim.
Consistency of parameter-estimates in a linear time-series model.
266. Eicker, Friedheim.
1960.
A necessary and sufficient condition for consistency of the LS estimates in linear regression.
267. Smith, W. L.
On some general renewal theorems for nonidentically distributed variables.
268. Duncan, D. B.
1960.
269. Bose, R. C.
October, 1960.
October, 1960.
Bayes rules for a common multiple comparisons problem and related Student-t problems.
Theorems in the additive theory of numbers.
270. Cooper, Dale and D. D. Mason.
271. Eicker, Friedheim.
October,
November,
November, 1960.
Available soil moisture as a stochastic process.
December, 1960.
Central limit theorem and consistency in linear regression.
December, 1960.
272. Rigney, Jackson A. The cooperative organization in wildlife statistics. Presented at the 14th Annual Meeting, Southeastern
Association of Game and Fish Commissioners, Biloxi, Mississippi, October 23-26, 1960. Published in Mimeo Series, January, 1961.
273. Schutzenberger, M. T.
274. Roy, S. N. and
January, 1961.
J.
On the definition of a certain class of automata.
N. Shrizastaza.
275. Ray-Chaudhuri, D. K.
January, 1961.
Inference on treatment effects and design of experiments in relation to such inferences.
An algorithm for a minimum cover of an abstract complex.
February, 1961.
276. Lehman, E. H., Jr. and R. L. Anderson. Estimation of the scale parameter in the Weibull distribution using samples censored by time and by number of failures. March, 1961.
277. Hotelling, Harold.
The behavior of some standard statistical tests under non-standard conditions.
278. Foata, Dominique.
1961.
On the construction of Bose-Chaudhuri matrices with help of Abelian group characters.
279. Eicker, Friedheim.
Central limit theorem for sums over sets of random variables.
280. Bland, R. P.
282. Roy, S. N. and R. Gnanadesikan.
April, 1961.
283. Schutzenberger, M. T.
285. Patel, M. S.
286. Bishir,
J.
287. Konsler, T. R.
April, 1961.
A coding problem arising in the transmission of numerical data.
April, 1961.
May, 1961.
Two problems in the theory of stochastic branching processes.
May, 1961.
A quantitative analysis of the growth and regrowth of a forage crop.
288. Zaki, R. M. and R. L. Anderson.
ning over time. May, 1961.
May, 1961.
Equality of two dispersion matrices against alternatives of intermediate specificity.
Investigations on factorial designs.
W.
March, 1961.
An evaluation of the worth of some selected indices.
On the recurrence of patterns.
284. Bose, R. C. and I. M. Chakravarti.
February,
February, 1961.
A minimum average risk solution for the problem of choosing the largest mean.
281. Williams, J. S., S. N. Roy and C. C. Cockerham.
February, 1961.
May, 1961.
Applications of linear programming techniques to sOllle problems of production plan-