Watson, Peyton.; (1982)Kendalls Order Statistic Method of Discriminant Analysis."

KENDALL'S ORDER STATISTIC METHOD
OF DISCRIMINANT ANALYSIS
Peyton Watson
Department of Biostatistics
University of North Carolina at Chapel Hill
Institute of Statistics Mimeo Series No. 1377
March 1982
KENDALL'S ORDER STATISTIC METHOD
OF DISCRIMINANT ANALYSIS
by
Peyton watson
A Dissertation submitted to the
faculty of the University of North
Carolina in partial fulfillment
of the requirements for the degree
of Doctor of Philosophy in the
Department of Biostatistics.
Chapel Hill
1981
Approved by:
'.'
01981
BERNARD PEYTON WATSON
ALL RIGHTS RESERVED
BE&~ARD
PEYTON WATSON.
Analysis.
Kendall's Order Statistic Method of Discriminant
(Under the direction of DANA QUADE)
ABSTRACT
A partial discriminant analysis rule is a discriminant analysis
rule which allows a withhold-judgment region.
one such rule, based on order statistics.
Kendall (1976) suggests
We derive explicit expressions
for the cutpoints of his rule in the univariate case, and find the
asymptotic distribution of the cutpoints in both the univariate and
bivariate cases; our results generalize to the p(>2)-variate case.
We establish conditions under which Kendall's rule is asymptotically
optimal, but provide an example which demonstrates that neither Kendall's
least-to-most nor the most-to-least method for determining the order
of selection of component variables leads in general to a better
asymptotic rule.
For both the univariate and multivariate cases we show that the
probability of misclassification of Kendall's rule converges (with
probability one) to zero, but then in certain instances the probability
of nonclassification may converge to unity.
We give conditions under
which this is possible; in particular it occurs when the underlying
populations are multivariate normal.
In order to deal with this potential nonclassification problem
we offer several extensions and a modification of Kendall's original
rule.
We explore the asymptotic properties of the extensions and
modification, and in particular we find an asymptotic bound less than
unity for the probability of nonclassification of our modification.
We use Fisher's Iris data to illustrate Kendall's rule and our
extensions and modification of it, and to compare them to each other
and to the standard linear rule.
A result is that our modification
appears to perform at least as well as the linear rule for the Iris
data.
TABLE OF CONTENTS
ACKNOWLEDGEMENTS
I.
iv
A REVIEW OF THE LITERATURE OF NONPARAMETRIC
DISCRIMINANT ANALYSIS •
1
1.0 - Introduction
1
1.1 - Classical Theory - Normal Case
3
1.2 - Discrete Case
8
1.3 - Nonparametric Approach - Density Estimation
9
1.4 - Three Specific Cases of Nonparametric Density
Estimators .
13
1.5 - Difficulties with Practical Application
16
1.6 -Nearest-Neighbor Rules
23
1.7 - Empirical Best-of-Class Rules
31
1.8 - Some Other Rules
34
1.9 - Notational Considerations
41
II.
KENDALL'S METHOD·
42
III.
DETAILS OF THE UNIVARIATE CASE
67
IV.
THE MULTIVARIATE CASE
89
4~1
89
- Redefining "Least Overlap"
iii
4.2 - The Asymptotic Behavior of the Cutpoints
92
4.3 - The Asymptotic Probabilities of
Misclassification and Nonclassification
99
4.4
V.
The Asymptotic Optimality of Kendall's Rule
102
4.5 - The Order of Component Selection
107
EXTENSIONS AND A MODIFICATION OF KENDALL'S RULE
111
5.1 - Extensions
111
5.2 - A Modification
116
5.3 - Practical Implementation of d
3
.
VI.
AN
VII.
SUMMARY AND RECOMMENDATIONS FOR FURTHER RESEARCH
146
7.1 - Summary
146
7.2 - Suggestions for Further Research
147
EX~~LE
OF THE DATA ANALYSIS
123
132
APPENDIX
150
BIBLIOGRAPHY
172
e
iv
ACKNOWLEDGEMENTS
I wish to express my gratitude to Professor Dana Quade for his
invaluable advice during the course of this research.
Also I thank
each of the other members of my advisory committee, Drs. Frank Harrell,
Norman J. Johnson, P. K. Sen and Herman A. Tyroler.
I dedicate this research to my father, Richard
Wat~on,
whose
intellect is a source of inspiration to me, to Lee Watson, who provided
the conducive environment during which a large portion of this research
was produced, and to the memory of my mother, Evelyn Watson.
Lastly, I thank Dr. John Caldwell, without whom this research
would have remained, at best, a dream.
CHAPTER I
A REVIEW OF THE LITERATURE OF
NONPARAMETRIC DISCRIMINANT ANALYSIS
Introduction
1.0
In discriminant analysis one is concerned with determining the
population of origin of a given multivariate observation
~1'~2' ••• '~~
X.
Letting
be the populations under consideration, we are required
to decide from which of these our observation has corne.
Statistically,
we need to determine disjoint sets
RP
such that when
X e: D.
~
collection of sets
we assign
J
D
=
D ,D , ••• ,DR.' where
1 2
X to popUlation j.
R.
=
.V D.,
~=
1
~
Any such
will be called a decision rule.
(D , ••• ,DR.)
1
Since we have only a finite number of populations, we are considering
only nonrandomized rulesl we follow the convention of Glick (1972) here.
and let each of the distribution functions
density
F.
~
admit a probability
the proba-
f ..
~
bility of correctly classifying an observation from
P..
J.J.
=1
D.
1.
f. (x) dx
J.
-
-
~.
~
is
(1.1)
2
and the probability of incorrectly classifying an observation from
as from
~,
~
is
1T,
J
(n f.~ (x)
In,
~ dx,
~
P., =
J~
i ;. j.
(1. 2)
J
The total probability of misclassification for a decision rule D is
=
R(D)
In a sense
R(D)
~
~
q.
~
E
'~'
1
J
jf.(X)dX.
n
j
1
~
(1. 3)
-
reflects the "risk" (a decision theoretic term) asso-
ciated with using the decision rule
D.
In discriminant analysis the statistician faces the problem of
choosing a decision rule which makes the total probability of misclassification (or risk) as small as possible.
known for
i
= 1, .•• ,1,
Given that
and
are
f.
1
the solution to this problem is straightforward
(Welch (1939), Hoel and Peterson (1949».
The conditional probability of an observation
X
coming from
1T.
1
is
i
p(1T,lx)
1
-
= 1, .. .,t
To minimize (1.3), let D* be the rule that assigns each
population with the largest conditional (on
~)
(1. 4)
X
probability.
to that
Some
ambiguity can exist for this definition, when for a given ~, P(~i!~)
p(1T,lx)
J -
= max
(ni, ... ,D~)
m
P(n IX) for
m
~
i ~ J'.
be the rule such that
TO avoid this ambiguity let D*
=
=
3
E D~
)
X
j
iff
= min {i,q. f.
~
Using the rule given by
~
= max
(:<)
~
~f
m
m(X)}.
(1. 5)
~
i t is easy to show that
(1. 5)
= min
R* o. R (D*)
(R (D) )
De:fi'
(1. 6)
where ~ is the class of all decision rules.
for
~
i
j
then the minimizing rule
probability zero).
D*
If
P(q.f. = q.f.
~
J )
~
Irr.)
~
= 0
is unique (except for sets of
We will use the notation
D*
=
to denote
(Di, ••• ,Di)
any decision rule which satisfies (1.6) , and any such rule will be called
a Bayes rule.
A generalization of the above can be obtained by introducing a loss
function
L where
vation from rr.
to rr .•
J
and
L(i,i)
L(i,j)
We have tacitly assumed
~
= O.
is the loss incurred when assigning an obserL(i,j)
=1
for
i
~
j
The results using a loss function are a straightforward
extension of the above, and for notational simplicity a loss function will
not be introduced
a~
this point.
Also, for notational convenience, through-
out the remainder of this chapter we will consider the two population
= 2)
(2
the
(1
1.1
case.
>
If a result is mentioned which cannot be generalized to
2) population case, a warning will be issued.
Classical Theory - Normal Case
In practice it is usually assumed that the f. 's are normal densities
~
with different means
u
(i)
....
and the same covariance :natrix
1
-2
1
exp[- -(x
21~1
That is,
~
-?
f. (x) •
~
~
(2':T)
2
~
;J
(i) ) ..
E-l(x
~(i)],
i
= 1,2.
(1. 7)
4
In this case the rule
0*
given by (1.5) is equivalent to:
X e: D*
1
iff
~ '"t- 1(l:! (1) _ 1: (2»
e: D*
iff
~ "E-1 (~ (1) -!:l (2) ) _ '!'(lJ(l) + !:l (2»
2 ~
X
2
5J.·nce the
size
n.
J.
~
,,(i)
and
~
_ '!'(lJ (1) + 1: (2) ) "f-l (1: (1)
(2) )
- !;!
~ l09(Q2!ql)'
2 ~
"E- 1(~ (1)
- lJ
(1. 8)
(2) )
< log (Q21 Ql ) .
..
are usua11 y unk nown, traJ.nJ.ng
samples of
~
must be collected from each of the populations to estimate these
parameters.
Using
" (i)
lJ
-(i)
=X
i
= 1,2
(1. 9)
and
+
'~12
J=
(X.
~J
(2) _ x(2» (X. (2) _ ~(2»~
~
-J
J l~n1 + n 2 - 2), (1.10)
e
0* = (Di,Di)
these estimates are plugged into (1.8) to form the decision rule
given by
iff
~ ..~-1
(X (1)
x(2»
_ .!.(X (1) + X( 2 ) ) 'S-1 (X (1 )
2 ~
g(2»
iff
X '5- 1 (X (1)
- (2)
_ .!.(x(l) + x(2)(5- 1 (X(1)
2 -
X(2»
A
X e: 0*
1
~ log(Q2 Iq1)'
(1.11)
A
X e: D*
2
--
X
In general, the symbol
0*
<
log(q2!q1)·
will be used to denote any sample-based
decision rule which is intended to asymptotically mimic the Bayes rule in
A
terms of error probability.
0*
is really a sequence of decision rules so
perhaps a more appropriate, but cumbersome, notation for this sequence
A
',o1ou1d be
D*
n ,n .
1 2
Furthermore, for
n ,n
1 2
fixed,
D*
is a function of
'.
,e
5
To keep all this in mind, perhaps
the particular training samples drawn.
it would be best to use the symbol
•
where
At any ra t e, f" or no t a t'10na1 conven1ence,
.
the
(i).)·
Z (i) -_ (X 1(i) , .•. , X n
1
symbol
D*
will be employed in what follows.
Assuming the specified normal distributions are actually those of
the underlying populations, the rule
D*
given by (1.8) is optimal in
the sense that no other rule has a smaller probability of misclassification
than
D*.
But the claim of optimality cannot be made for
D*.
As implied
above, the probability of error for this discriminant rule is dependent
upon the particular training samples selected.
From (1.3) and (1.11) this
probability is given by
R(D*) =
~
i=l
q.l*f. (x)dx
~
j
1
-
(1.12)
-
and is called the "actual error rate" (Hills, 1966) of the rule
D*.
Another error rate of interest is the "expected actual error rate", which
is the expected value of
'"
R(D*),
where the expected value is taken over all
training samples of size
~
'"
R(D*)
~
R(D*)
and
'"
E(R(D*»
= (n ,n ).
1
~
2
In general from (1.6) we have that
R(D*).
However, assuming normality with equal covariance matrices,
given by (1.11) seems to be a reasonable procedure.
0*
Further justification
for its use can be given since
R(O*)
....!....
R(D*)
as
(1.13)
6
and
...
E (R (D*»
That is, the rule
...
D*
-
R (D*)
•
(1.14)
is asymptotically optimal in the sense that its
probability of incorrect classification (conditioned on the training
samples) converges in probability to the minimum possible probability
of misclassification among the class of all decision rules.
Also from
(1.14) we have that its unconditional probability of misclassification
converges to this minimum rate.
But this is true for the rule given by
(1.11) only when the underlying distributions are normal with identical
covariance matrices.
Using the rule given by (1.11) when in fact the
underlying population distributions are not normal with equal covariance
matrices will in general lead to neither a reasonable nor an asyrnptotically optimal decision rule.
In practice, the rule given by (loll) or its
Ro(> 2)
counterpart is the one that is almost invariably used.
population
This decision
rule seems to perform fairly well when the underlying population distributions are "approximately" normal with equal covariance matrices; but
when they are far from normal, or the covariance matrices far from
equal, this rule can perform poorly.
If it is known that the underlying population distributions have
densities
f
i
given by
7
f . (x)
~
-
where
.
=
1-1/2 exp [- 1- ( X-\! (i»,,~-l(
(i»]
.~
(2 'T1' ) -P/21~
l. •
l. •
X-II
,
-~
~l ~ ~2'
-~
2 -
-
= 1 ,2 ,
l:
(1.15)
then the decision rule given by (1.5) is equivalent to the
rule
X
iff
0*
€
1
1[
I ~21
-2 In~
1~11
~
- -
(X -
(1)
+ (X - ~
(2) " 1
) E- (X - ~
-2-
-
)
(1.16)
)E- 1 (X
--
= -X(i)
Using
and
(2)
Ei=s. =
-~
[~i
j=l
(x.(i)
-J
_x(i»(X~i)
-J
_x(i»"]/(n.
-
-1),
~
i
= 1,2,
we plug these estiInates into (1.16) and the resulting sample-based decision
rule is given by
xeDi
iff
t[ln~:~~ +(~ ~(2»)~;1(~ ~(2)J
(1.17)
8
Under the assumption of underlying normal distributions the above
sample-based rule satisfies (1.13) and (1.14) but, unfortunately, the rule
is sensitive to departures from normality.
It also can perform poorly if
the size of the training samples is not large.
Discrete Case
1.2
Another case which has received attention in the literature is that
of underlying population distributions which are discrete.
instance in which the assumptions
behind the linear and quadratic rules,
i.e. the rules given by (1.11) and (1.17), are violated.
with
discrete random variables
s.
of distinct values.
of
s =
J
~s.
points.
This is an
Consider
p
taking on a finite number
The sample space of
then consists
The usual assumption is that the random vector
j=l J
has the multinomial distribution.
the above mentioned
x. In.), j
~ J.
J
= 1,2,
i
s
possible outcomes for
= 1, .•. ,s.
X
Let
~l'···'~s
be
= P (X =
and let q..
J.J
Then the optimal assignment rule is given
by
(1.18)
Since the probabilities in (1.18) are usually unknown, we estimate
them by use of our training samples. That is , we estimate
P (X
~
= x.J. In.)
J
~
by
9
#.
P(X
where
rr
j
#. (x.)
J
=
X
]
.Irr .) =
~
(x.)
-~
n.
]
=
= 1, ... ,s,
i
j
= 1,2,
(1.19)
J
is the number of observations in the training sample from
-~
which assumes the value
x .•
-~
Using these estimates, the rule given
by (1.18) becomes
(X
-
(X
= x.)
-
~
= x.)
-~
€ 0
€ O
The rule given by
in the sense of both
#2 (x. )q2
if.f
1
-~
n
,
2
iff
2
(1. 20)
(1.20)
(1.13)
and
can be shown to be asymptotically optimal
(1.14).
For proofs of even stronger
asymptotic optimality properties of this rule see Glick (1973).
Unfortu-
nate1y, for small and moderate sized training samples, the rule given by
(1.20)
can perform poorly.
Another source of information about discrete
discriminant analysis is the recent book by Goldstein and Dillon (1978).
1.3
Nonparametric Approach - Density Estimation
During the past twenty-five years there has been considerable theo-
retical interest in approaching the classification problem in a less constrained fashion.
Assuming ignorance of the underlying population dis-
tributions, one natural way to approach the problem is to obtain nonparametric estimates of the densities
f., i
~
= 1,2,
at the point
X to
be classified and then plug these estimates into the rule given by
For the remainder of this section we will assume that the underlying
(1.5).
10
distribution of n., i
1.
= 1,2,
is absolutely continuous with density
f ..
1.
There is an extensive literature on nonparametric density estimation; for
an excellent review of the work done in this area see Wegman (1972).
Letting
of size
n.,
1.
f.1.
be the estimator of
f.,
based on a training sample
1.
the resulting decision rule is given by
X
£
X £
Di
iff
q
D2
iff
ql f l (~) < q2f2(~)'
1 f 1 (~) ~ q2f2(~)'
(1.21)
A fundamental paper studying the relationship between nonparametric
density estimation and discriminant analysis is that of Glick (19i2).
reD)
D.
Let
be the probability of correct classification of a discriminant rule
Using the notation of
(1.3)
we have that
reD) = 1 - R(D)
so that a rule maximizes
reD)
(1. 22)
if and only if it minimizes
The probability of correct classification of any rule
R(O).
0 = (01,D )
2
is given by
reO) =
~
q.jo.
f.
j=l J
J
(~)~
(1.23)
•
J
Letting
and
o·
(1.6)
be the Bayes rule given by
it follows from
(1.22)
that
r(o·)
Let
(1.5)
= max
r(O).
(1. 24)
O£@
.e
11
2
1:
reD) =
q.1 f.
(1.25)
(x) dx
j=l J D. 1 -
-
J
and
then
(1. 26)
r(D)
is just a plug-in estimator of the true nonerror rate of a
decision rule
r(D*)
D and, again using the terminology of Hills,
called the "apparent nonerror rate" of the decision rule
0*
is
given by
(1. 21) •
Glick proved that
(1.27)
or that the apparent nonerror rate is an optimistically biased estimator
of the nonerror rate of the Bayes rule.
definition since
(The last inequality follows by
Letting ~
is a Bayes rule.)
D*
be the class of all
discriminant rules, Glick also proved that
I
supJr(D) - r(D)
D£fi}
-
a.s.
r (D*)
a.s.
r (O*) ----..
provided that
for almost all
f.
(X)
1. -
-
X, for
a.s.
A
fi
(1. 28)
r{D*)
(l. 29)
r{D*)
(l. 30)
for almost all
f. (X)
X
1. -
i = 1,2.
satisfied if the estimators
fi - P
~ 0
then analogues of
a. s.
~
and f q.I:. {x)dx - 1
;L1.-
-
The latter condition is automatically
f.1. are densities themselves.
(1 28)
.
the convergence being in probability.
,
(1.29)
and
(1.30)
Also i f
hold with
12
In proving the above results Glick actually considered the mixed
population case, but his results can be extended to our setting.
The
mixed population case occurs when there is only one underlying population
which is a mixture of the populations
1T
and
and
in the proportions
An observation from the mixed population
as a random vector
(X,I) where
I
1T
may be regarded
indexes the observation's original
population and
X is the p-dimensional classification variable.
sample of size
n
from
takes the form of
1T
= (Xl,I ) , ••• ,(X ,I »,
l
-n n
Z
-n
where each observation has been correctly classified.
A major advantage
of working in the mixed population setting is that if the
they can be estimated by
q.
1.
= #.1. (Z-n )/n
where
observations in the training sample for which
#.
1.
I
A train-
(Z )
-n
qi
are unknown
is the number of
= i.
At this point it will prove useful to introduce a definition from
Glick's paper.
Definition:
(Van Ryzin (1966»
Any sample based rule
....
D is Bayes
risk consistent (or Bayes risk strongly consistent) if the nonerror rate
.
r(D)
~
r(D*)
in probability (or with probability one).
This definition is equivalent to the criterion given by (1.13), due
to the equality given by (1.22).
Thus Glick has shown that use of density estimators in the rule
given by (1.5) can lead to a rule with the same asymptotic nonerror rate
as the Bayes r.ule; or in terms of our definition, to a rule that is Bayes
13
risk consistent.
Or conversely, decision rules based on density estimators
asymptotically have the smallest error rate that can be achieved by any rule.
1.4
Three Specific Cases of Nonparametric Density Estimators
In light of the above results, one might wonder why classification
rules derived from density estimators are rarely used in practice.
To
begin to answer this question let us consider three of the more popular
nonparametric density estimators whose properties have been studied in
connection with discriminant analysis.
Parzen (1962) proposed the univariate density estimator
1. 4.1
f
X , ... ,X
1
n
where
density
f.
n
(x)
=
nK[x - x.)
1
E
nh(n) j=1
.
h(n)
J
(1. 31)
is a rand9m sample from the random variable having
K and
h(n)
are functions satisfying
I
<
f\K(y) Idy <
00
sup IK(y)
_oo<y<oo
lim!yK(y)
(1.32)
00
(1. 33)
I =0
(1.34)
=0
(1.35)
=
00
(1.36)
= 1.
(1. 37)
y-r-co
lim h(n)
n+<'"
lim nh (n)
n-+<lO
fK(y)dy
Assuming the above conditions Parzen proved that
square consistent estimator
0:
f.
f
n
is a mean
Cacoullos (1966) considered the multi-
14
variate version of the above estimator and proved results analogous to
Parzen's in this setting.
Under various conditions on
K, h(n), and
f,
stronger convergence properties have been proved for the Parzen-Cacoullos
(PC)
estimator, e.g., Silverman (1978).
1.4.2
An estimator usually attributed to Loftsgaarden and Quesenberry
(1965) is
g
where
n
x
()
~
= k(n)-l.
n
1
~--:..-
(1. 38)
is the volume of the smallest hypersphere centered at
containing at least
ken)
of the sample observations.
is a consistent estimator of
f
x
They proved that
when
lim ken)
=
00
(1. 39)
c.
(1.40)
n~
and
lim ken) =
n
n~
The Loftsgaarden-Quesenberry (LQ) estimator was obviously inspired
by the work of Fix and Hodges (1951), who suggested and studied the asymptotic properties of the well known K-nearest neighbor rules.
stronger conditions on
be proved for
The
LQ
ken)
and
If one imposes
f, stronger convergence properties can
g ie.g., Moore and Henrichon (1969) •
n
estimator does not satisfy the conditions of Glick's theo-
rens since
19 n (x)dx
~
~
=
00,
(1. 41)
15
but under slight restrictions Glick's results can be modified to include
the
LQ
estimator.
Because of equality given by (1.41), use of
gn
to
compute estimated probabilities is not feasible.
1.4.3
Another type of density estimation is that based on orthogonal
series.
Cencov (1962) appears to have been the first to propose and study
orthogonal series (OS) density estimators.
One univariate version of this
type of estimator, proposed by Schwartz (1967), is
h
n
(x)
=
q(n)
Ea. C. (x)
j=O In J
(1.42)
where
2
C. (x)
J
=
j
1 2
(2 j 111 / )
1
a.]n = -
n
n
1:
c.
i=1 J
-1/2 - ~
2 H. (x) ,
J
e
(1.43)
(1. 44)
(X.)
~
(1. 45)
Assuming f E L , Schwartz showed that
2
integrated square error
that
h
n
formly in
(J(h
n
- f)dx
~
h
n
is consistent in mean
0), and under additional conditions
is consistent in mean square error
(lim E(h
n-+«'
X.
- f)2
n
= 0)
uni-
His estimator and results extend to the multivariate case.
A number of authors have proposed various specific orthogonal series
for this type of density estimation.
Kronmal and Tarter (1968) suggested
and studied the properties of Fourier series for density estimation.
They
noted an interesting connection between their density estimator and that
of Parzen-Cacoullos.
16
Prior to Glick, Van Ryzin (1966) studied the PC and OS estimators
and showed that
consistent.
1.5
di~criminant
procedures based on these were Bayes risk
He also studied the rates of convergence of these procedures.
Difficulties with Practical Application
More generally, in the mixed population case, Greblicki (1978) proved
that if a density estimator is weakly (strongly) pointwise consistent
almost surely in
RP
then a discriminant rule based on this density esti-
mator is weakly (strongly) Bayes risk consistent.
Letting
A
0*
be a
decision rule based on a density estimator which satisfies the above
assumptions, he also proved that
E[R(D*») ~ R*
is taken over all possible samples of size
n.
where the expected value
Under mild assumptions
all three of the above density estimators satisfy Greblicki's conditions
and thus decision rules based on them are Bayes risk consistent.
In light
of the above results, ~e once again raise the question of why discriminant
analysis procedures based on nonparametric density estimation are infrequently used in practice.
Consider the Parzen-Cacoullos estimator.
must be selected by the statistician.
hen)
f
the convergence of
proper selection of
hen)
n
to
f
must satisfy for
f
n
K and
For selection of suboptimal
can be very slow.
f
n
hen)
K and
Of the two, the
seems to be the more crucial.
gave an optimal convergence rate for
h(n)
The functions
Wahba (1975)
and conditions that
to achieve this optimal rate.
K and
Unfortunately
.e
17
for the practitioner, these conditions depend on the underlying probability
distribution involved.
for
f
That is, in order to select proper h(n)
to achieve the optimal rate of
n
conv~rgence,
ledge about the underlying distribution (or
rarely if ever available.
and
K
one must have know-
f), and this information is
Further, it seems that to select an
h(n)
which
works even reasonably well, one must base this selection in some way on
the underlying population distribution.
This is usually done by means of
the random sample with which the statistician is presented.
Various suggestions have been made for choosing an appropriate
with the help of the observed data.
~
h(n)
Silverman (1978) suggested what appears
to be a promising method for the univariate case, but his idea cannot be
practically generalized to a multivariate setting.
Another possibility
which has been suggested for discriminant analysis applications is to use
PC class conditional estimators of the form
f
.
n~
= - - -1 P -
1
-
ni
1: exp
n.~ J=
. 1
(~~i)
[
_
- __-=-
20 2
OP(21T)2
;__
~)" (~~i)
_ ~)j,
-L___
..
1,2,
(1. 46)
or
f
(1/47)
.
n~
where
?
Si1
1:
0
(1. 48)
2
= o.
~
0
2
S.
~
18
and where
is the jth marginal sample variance for the sample from
the i-th population.
In this instance one needs worry only about estimating the optimal
a
(or
a.~ 's).
Of course the optimal
may not be of the
fo~
K and
h(n)
given in (1.46) or (1.47).
for a given problem
At any rate, a number
of simulation studies (Van Ness and Simpson (1976), Van Ness (1979), and
Remme, Habbema and Hermans (1980»have shown that discriminant analysis
procedures based on density estimators of the above two types perform fairly
well using relatively small sample sizes when the underlying populations
are normal.
These results hold both in the equal covariance and unequal
covariance matrices cases as well as when the underlying populations are
a mixture of normal distributions.
When the underlying distributions
were lognormal Remme et al got poor results using the estimator given by
(1.47).
No doubt their poor results for the lognormal distributions are
due in part to the extremely small sample sizes used (n
1
= n2
= 15 or 35).
They suggest a modification for use in this case, which by their own admission is impractical.
For the estimator given by (1.47) perhaps the most promising method
for estimating the o. 's
~
den Brock (1974).
has been suggested by Habbema, Hermans and van
It is based on a modified maximum likelihood technique,
and the method is feasible in the multivariate vase.
a computer
p~ogram
They have writ~en
using this technique for discriminant analysis, and
,e
19
this program is available for purchase by interested users.
But their
method, and hence their computer program, may not give good results when
the correlations between variables are all greater than .5 or when the
underlying population distributions are lognormal and the training samples
are not large.
Also, their program is greedy in terms of CPU time.
For discriminant analysis an obvious disadvantage of the PC estimator, as well as the LQ and OS estimators, is that all the data must
be stored and the density estimate must be computed using all the stored
data each time an observation
X is presented for classification.
Specht
(1967, 1971) attempted to overcome this disadvantage by using a PC estimator of the type given by (1.46) and using a truncated polynomial series
expansion of the exponential function involved.
This method involves
estimating a large number of coefficients, some of which are estimated
from very high powers of the sample observations.
The method obviously
presents problems in terms of instability of computer computations.
Nevertheless, it offers the advantage of not having to store the training samples, and in at least one application (Specht, 1967) it appears
to work reasonably well.
priate
0
Specht also discusses how to select an appro-
for use with his method.
For orthogonal series estimators the crucial
quan~ity
is
q(n).
This is the function that controls the tradeoff between the bias and
20
variance of
h.
Large values of
n
small bias for
h,
q(n)
lead to a large variance but
whereas small values of
n
variance but large bias.
h
n
to
lead to a small
Using the Fourier series suggested by Kronmal
and Tarter, Wahba (1975) gives conditions on
vergence rate of
q(n)
f
is optimal.
q(n)
such that the con-
Her results hold only for the
univariate and bivariate cases, and the conditions for selection of
optimal
q(n)
distribution
of
q(n)
require more information about the underlying probability
(or
f) than is usually available.
can lead to disastrous results.
Improper selection
For an example of this see
page Ill, Figure 9, of Tarter and Kronmal (1976).
In trying to implement practical density estimators based on
orthogonal series, the idea is to use the data in an attempt to determine satisfactory
q(n).
Methods have been suggested (Tarter and Kronmal
(1976), Kronmal and Tarter (1968), Tarter and Raman (197I»which seem
to work reasonably well in the univariate and bivariate cases, but a
method which has practical applications in higher dimensions is not yet
available.
Problems which arise with the implementation of classification
rules based on the LQ estimator are identical to those arising in the
use of the so-called K-nearest neighbor rules.
More will be said of
these problems in the section on K-nearest neighbor rules, but we will
21
say at this point that work by Wegman (1972) and Goldstein (1975) indicates that the proper selection of
k(n)
not so critical as proper selection of
for the LQ estimator is possibly
h(n)
for the PC estimator.
Unfortunately, in using the LQ estimator for discriminant analysis
applications, computer problems can arise.
This is due in part to the
fact that this technique requires the storage of the training samples
for use in classifying future observations.
fact that the computation of
gn
More troublesome is the
can require prohibitive amounts of
CPU time when the size of the training samples is large.
Of course, the
instance of large training samples is the one most likely to produce a
procedure which will perform well.
Under
no~ality
assumptions Goldstein (1975) compared discriminant
analysis techniques based on the LQ and PC estimators to the t-population
counterpart of the rule given by (1.17).
sfu~ples
(n
= n.1 = 75,
He used moderate-sized training
100 and 150), and his simulation results indicated
that discriminant analysis procedures based on these density estimators
can work well when the underlying distributions are normal.
considered methods of selecting
k(n)
and
h(n).
He also
Gessaman and Gessaman
(1972) considered the LQ and PC estimators as well as a number of other
nonparametric techniques for discriminant analysis.
In a simulation
study, under normality assumptions, they compared these procedures to
22
each other and to the rule given by (1.11) and the decision rules based
on the PC and LQ estimators performed adequately.
Their results will
be discussed in greater detail later.
To summarize, we have seen that for the density estimators cons idered there are certain pivotal quantities, called smoothing parameters,
which must be selected.
For the estimators to work reasonably well these
quantities must be appropriately chosen.
In using the PC density esti-
mator for discriminant analysis, certain suggestions have been made for
selecting the smoothing parameter
h(n)
which appear to lead to reason-
able results when the underlying populations are normal.
Of course,
assuming underlying normal distributions, better rules (the well known
linear and quadratic rules) are known to exist.
When the underlying
distributions are nonnormal there is no evidence to suggest that these
recommendations for the PC estunator lead to viable results and as yet
there are no other practical guidelines for use in selecting the smoothing parameter
h(n)
in the nonnormal case.
methods for selecting the smoothing parameter
only for the case of
p
the smoothing parameter
~
2.
k(n)
For the OS estimator practical
q(n)
have been suggested
As mentioned above, proper selection of
for the LQ estimator is possibly not so
crucial as the analogous selections for the PC and
as
estimators.
The
major deficiencies of the LQ estimator are in terms of its practical
implementation by means of a computer and these deficiencies will be
23
discussed in greater detail in connection with the K-nearest neighbor
rules.
Thus it becomes apparent why nonparametric classification rules
based on density estimators have rarely been used in practice.
The
available estimators are not practical in terms of the selection of
the smoothing parameters associated with them and/or in terms of their
practical implementation by means of a computer.
To quote Lachenbruch
and Goldstein (1979), "until practitioners are more secure in how to
arrive at smoothing parameters so that the induced classification procedures have a fair chance at competing with existing methodology, the
area will probably still stay dormant.
Nonetheless, however, we believe
this is an area worthy of much additional study and use even if no analytic results become available to help in the smoothing problems."
1.6
Nearest-Neighbor Rules
The so-called K-nearest neighbor procedures were first studied by
Fix and Hodges (1951), and it was evidently their work which inspired
the subsequent large body of theoretical research on
nonpara~etric
disc=iminant rules in general and the K-nearest neighbor rules in particular.
Consider the two population case with
Xl(l) , ••. ,X (1) and
~
n
~n
2
I
respectively.
(2)
Xl
~
Let
(2)
, ... ,X-n
d(~,~)
be the training samples from
and let
~1
and
2
be some metric (e.g. the Euclidean distance
24
RP )
in
and let
K be an integer.
If
X is the p-variate observation
(1)
to be classified, then order the real numbers
d(~'~l
(1)
), •.• ,d(~'~n ),
1
(2)
d(~'~l
(2)
), ... ,d(~'~n
and let
)
K.
from
7T
i
, i
= 1,2
among the
assign
be the number of observations
~
2
K closest observations to
X
to
lT
~
to
7T
1
iff
K
K
1 > - 2
n -n
2
1
iff
-n 1
X.
Then
(1. 49)
but
K
assign
If
ql Y. q2
2
1
K
2
<n
2
then
assign
assign
X
to
7T
X
to
7T
1
2
iff
iff
K
K
1
2
> :::-<l .
~l - n 2'
1
2
K
K
2
l
-ncrl < ~2
1
(1. 50)
2
Fix and Hodges showed that if the underlying population distributions
have densities
f., i
J.
= 1,2
which are almost surely continuous and if
(1.51)
then the probabilities of misclassification for their rule converge to
those of the Bayes rules.
When
K
=1
(They actually prove a slightly stronger result).
we determine the population of origin of the nearest
sample observation to
X and assign
X to this population.
case the above rule is often called a nearest neighbor rule.
In this
e
25
In effect, the K-nearest neighbor rule obtains an estimate of the
density for each of the populations at
Bayes rule.
X and then uses the plug-in
It is not hard to see that for discriminant analysis a
plug-in rule based on the LQ estimator is very similar to the K-nearest
neighbor rule.
In a later paper, Fix and Hodges (1953) study the small sample
performance of the K-nearest neighbor .rule under the assumption of
normality, with equal covariance matrices.
They obtain exact and asyrnp-
totic expressions for the probability of misclassification when
K
= 1,3
= 1,
p
and they compare the performance of their rule with that of
the rule given by (1.11) when
K = 1,2, P
= 1,2.
What statisticians call discriminant analysis, engineers call
pattern recognition.
In what has become a famous paper in the area of
pattern recognition, Cover and Hart (1967) studied, in the mixed population setting, the properties of the nearest neighbor
Let
n
a function of the particular training sample,
R (l,Z)
n
~
= 1)
rule.
be the conditional probability (conditional on the
R (1)
training sample) of error for the nearest neighbor rule.
symbol
(K
would be more appropriate.
though, the symbol
R
n
(1)
~,
n
is
drawn, so perhaps the
For notational simplicity,
will be used hereafter.]
ditions Cover and Hart proved that
[R (1)
Under mild con-
26
E[R (1)]
n
-+
R(l)
(1. 52)
where
(1. 53)
and that
R*
~
R (1) < 2R* (1 - R*) •
(1. 54)
Thus they showed that the asymptotic expected probability of error
for the nearest neighbor rule is less than twice the probability of error
of the optimal Bayes rule.
Or as they put it, "In this sense, it may
be said that half the classification information in an infinite sample
set is contained in the nearest neighbor."
Letting
nearest neighbor to
X
,···,X,
~ 1
~n
X
in a random sample
also showed, under mild conditions, that
They also established bounds for
R
n
(K)
x
-+ X
I
X'
~n
denote the
Cover and Hart
with probability one.
~n
in terms of
R*, where
R
n
(K)
is the conditional probability of error for the K-nearest neighbor rule.
(If
K
= 1,
then of course
Most of the
fo~lowing
R (K)
n
= Rn (1).)
results for K-nearest neighbor rules were
proved for the mixed population case.
The probability
praccitioner than
would seem to be of more interest to the
R (K)
n
E [R (K)]
n
since
R (K)
n
is the actual error rate,
given the training sample, of the K-nearest neighbor rule.
showed that there exists
R(K)
such that
R (K)
n
-+
R(K)
wagner (1971)
in probability
and he gave additional conditions for 'Nhich the convergence is with
.e
27
probability one.
He also studied rates of convergence.
an asymptotic upper bound for
the form
and
B
pnR (1) - R(l)
n
I~
Fritz (1975) gave
e:].
This bound is of
A exp(-BIn), and is distribution-free in the sense that
depend only on
and
e:
p
A
and not on the underlying population
distribution.
An interesting idea using the K-nearest neighbor rule was explored
by Hellman (1970).
Letting
the following rule.
are from
IT
i
observation.
K'
If at least
to
~
then assign
follo'Ning form.
e: Al () AC
2
lT
2
but i f
Let
Al
then assign
K'
of the
IT.,
otherNise withhold judgment on the
1.
K nearest neighbors to
C
and
X
A
2
to
lT
be two sets in the sample space.
l
and i f
C
K'
= K,
K nearest neighbors to
Hellman's rUlel~}
where
Z
X
E
AC () A
1
2
assign
do not classify
Hellman's rule assigns
P{misclassify with Hellman's rUlel~}
~
and
X
1.
= P{withhcld
is the training sample.
to
~
~.
are all from IT ••
SH
n
If
For
to IT.
1.
Let
i f and
H
R
n
=
judgment with
K' = K Hellman
showed that
(1.
and
X
In general, rules in this area take the
X e: (AI () A ) U (AI () A )
2
2
In the case of
only i f the
K he suggested
His rule falls within an area that is sometimes called
partial discriminant analysis.
X
be less than or equal to
55)
28
(1. 56)
H
where for the given asymptotic rate, S , of withholding judgment the
H
corresponding probability of error, R
I
satisfies
(1.
where
1 + K/2
is the smallest constant possible, and
R*
5
57)
is the
probability of error of the rule which is optimal among all rules with
asymptotic probability of withholding judgment less than or equal to
SHe
One of the reasons that the K-nearest neighbor rules have not been
widely used in practice for
in terms of CPU time.
K > 1
is that these procedures are costly
Although progress has been made recently (Friedman,
Bentley and Finbel, 1975) with respect to this problem, the algorithm
still remains computationally expensive.
For applied problems more
interest has been expressed in the nearest neighbor
(K
= 1)
procedures,
but even those procedures can be costly in terms of the computel." storage
requirements they impose.
It must be remembered that the nearest neighbor
rule is not asymptotically optimal, even though its asymptotic expected
probability of error is bounded by twice the probability of
Bayes rule.
e~ror
of the
A number of suggestions have been made in the hope of alle-
viating these computer problems for the K-nearest and nearest neighbor
rules.
Hart (1968) suggested the condensed nearest neighbor rule.
His
.e
29
algorithm finds a consistent subset of the training sample and this consistent subset is then used with the nearest neighbor rule to classify
new observations.
A consistent subset is a subset of the observations
which, when used as the reference set for the nearest neighbor rule,
leads to the correct classification of all the remaining sample observations which are not in the reference set.
A minimal consistent subset
for the sample observations is a consistent subset of the smallest possible
size~
though a minimal consistent subset always exists, it is not
necessarily unique.
But Hart's algorithm does not necessarily find a
minimal consistent subset of the training sample -- only a consistent
subset.
The idea behind this algorithm is that if the area of overlap
for the populations is small then this algorithm will only retain those
observations in and near this area of overlap.
Thus the reference set
would contain only those points which are really essential for correct
classification.
Unfortunately, the statistical properties of the con-
densed nearest neighbor rule are unknown.
Hart studies this rule via
computer simulation and his results are unimpressive.
Another interesting idea was considered by. Wilson (1972).
suggestion was to first edit the data by means of the K-nearest
His
~eighbor
rule and then use the edited dataset with the nearest neighbor rule to
classify new observations.
sample from n .
1
Let
be a member of the training
Wilson's algorithm uses the remaining
n
l
- 1 + n
2
30
observations and the K-nearest neighbor rule to classify
same procedure is repeated with each of the other
vations.
n
1
X ~1) •
~J
- 1 + n
2
The
obser-
All the observations which are not correctly classified are
edited (deleted).
The remaining edited dataset is used as the reference
set and new observations are classified using this dataset and the
nearest neighbor
(K = 1)
rule.
This procedure has intuitive appeal,
and Wilson apparently showed that tighter and more favorable bounds
exist for the asymptotic error rate (for each
K) of his rule than
exist for the asymptotic error rate of the comparable K-nearest neighbor
rule.
Unfortunately, an error has been found in Wilson's proof.
In
a univariate setting, using a slightly modified version of Wilson's
rule, Penrod and Wagner (1977) were able to show that this rule did
indeed yield tighter and more favorable bounds on asymptotic error
probability, though their improvements were not as impressive as Wilson's.
They also expressed the belief that the advantages of the edited nearest neighbor rule will be retained in higher dimensions, though they
were unable to prove it.
A number of other authors have made suggestions for modifying
nearest neighbor type rules to reduce storage requirements and computation time (Chang, 1974; Gates, 1972; Swonger, 1972; Ullman, 1974).
Again, almost nothing is known of the statistical properties cf these
31
modified algorithms.
In recent years some bounds have been established on the difference
between the estimated probability of error and the actual probability
of error for the K-nearest neighbor rules.
Rogers and Wagner (1978)
•
obtain distribution free bounds for
Rn (K),
mators,
E[(R (K) - R (K»]
n
of the error probability
n
R
n
for some esti-
In this setting,
(K).
Oevroye and Wagner (1979) obtain some tighter bounds, and they also
show that
p{ (Rn(K) - Rn(K»
~E} ~ A exp(-Bn)
positive constants depending only on
K, p
where
and
E.
A and
Bare
Penrod and Wagner
(1979) study the tightness of some of these bounds by means of simulation.
Empirical Best-of-Class Rules
1.7
Another area of ongoing research in nonparametric discriminant
analysis is that of empirical best-of-class rules.
in this area is one by Glick (1976).
An important paper
As in his paper on discriminant
analysis via density estimation, Glick considers the mixed population
case, but his results obviously hold when the populations are not mixed
and
and
Let
0 = (0 ,0 )
1 2
r(D)
Then
r(O)
are known.
=
be an arbitrary discriminant rule and let
number of correct sample classifications by rule D
n
is often called the resubstitution estimate of the proba-
bility of correct classification.
of
(1.58)
cl~ssification
Letting ~ be an arbitrary collection
rules, there always exists a
.
D such that
32
= sup-
De:~
though the maximizing
A
D
r(D),
(1. 59)
is not necessarily unique.
(1.59) is called an empirical best-of-class rule.
Any
A
D
satisfying
Glick showed that
E[sup r(D)] > sup r(D) ~ r(D),
De:!l)
De:~
(1. 60)
where the expectation is taken over all training samples of size
Glick called a classification rule
D.
~
D
=
(D ,D ) m-linear if the
1 2
are each sets in the finite field generated by some
spaces (determined by m linear inequalities in
m-convex if each
D.
~
n.
~),
m open half-
and he called it
is a set in the finite field generated by some
.
m measurable convex sets.
Lett~ng
fiJlL and
;;;()
m
@Cm be the class of all
decision rules which are m-linear and m-convex, respectively, Glick
showed that
supJ
r (D)
-
r
(D)
I ~·,.o
De:9i)
(1. 61)
Using the above result he was able to
show that
a.s.
(D)
(1. 62)
r (D )
(1.63)
lsu~r
De:~
and
A
r(D)
when
9J
holds.
a.s.
---+ISUP
De:~
is anyone of the sets of decision rules for which (1.61)
Thus for instance, if we are considering the class of all linear
.
33
rules, ~~, and we choose a linear rule which maximizes the resubstitution estimate of correct classification, then this rule will be
asymptotically optimal among the class of all linear rules.
A similar
statement holds if we are considering the class of all quadratic rules.
It is interesting to note that the well-known linear rule given by
(1.11) also belongs to ~~, but it of course does not in general satisfy
(1. 63) •
Unfortunately, algorithms for finding empirical best-of-class
rules in the general case are not yet available.
Greer (1979) has
found an algorithm for the class of all linear rules, ~~, for
k
=2
only, and a computer program implementing this algorithm has been written
and is available upon request.
Unfortunately, his algorithm can require
prohibitive amounts of computer time.
To help solve this problem he
suggested an "approximate" algorithm, but its statistical properties
are unknown.
In an interesting paper, which evidently inspired some of Glick's
work, Stoler (1954) considered rules, for the univariate two population
case, of the form
assign
X to
~1
iff
X~ b
assign
X to
~2
iff
X > b.
(1.64)
In particular Stoler considered the case for which it is known
t~at
there
34
exists a unique
(1. 64) •
b
Letting
such that the optimal rule takes the form given by
Ii},
~.
be the class of all rules of this form, Stoler
proved (1.60) and (1.62) and (1.63) in probability only.
Results
similar to Stoler's were obtained by Hudimoto (1956, 1957).
Perhaps
the first paper in this area is that of Aoyama (1950).
1.8
Some Other Rules
i)
An interesting rule in the area of partial nonparametric
discriminant analysis was proposed by Broffitt, Randles and Hogg (1976).
Their method is evidently applicable only to the two population case.
x
Let
be an observation to be classified and
2
Xi ) , •.• ,x(2) be the training samples from
-
n2
group
X with the data from
(1)
(1)
Xl ' ... ,X , X and
-
-n 1
to
-
1
-
and
-n
1
First we
by considering the two samples
1T
1
(2)
(2)
Xl , ... ,X
•
-
1T
(1)
(1)
Xl , ••• ,X
and
n
Let
be a function from
2
R which is selected after observing the training samples and which
tends to be positive for observations from
vations from
The selection of
H
l
1T
l
and negative for obser-
must depend on the training
samples only through statistics that are symmetric functions of
(1)
Xl
-
• (1)
, ... ,X
-n l
,X
and of
-
(2)
(2)
Xl , •.. ,X
•
-
-n
(~)
and let
the data from
H
2
(from RP
R
l
1T
to
2
(~)
(The linear discriminant
2
function given by (1.11) is one such
HI
RP
be the rank of
H.)
Rank
H (~).
1
Next group
X with
and select another, possibly different, function
R)
which tends to be positive for observations
35
from
and negative for observations from
The selection of
must depend on the samples only through statistics that are symmetric
•
functions of
'<1
(1)
-
be the rank of
.(1)
-n
-H
2
(~)
(2)
and of
, ••• ,X
Xl
-
1
ai1,ong
(2)
-H 2 (~1
• (2)
, ..• ,X
), ••• ,
-n
,X.
-
2
Letting
(2)·
-H 2 (~n ), -H 2 (~O , their
2
decision rule is given by
X €
X £
If
IT
lT
1
2
iff
R1 (~) .
R2(~)
>
n + 1
n + 1
1
2
iff
R1 (~) < R2(~)
n +1
n + 1
1
2
then classify
X
(1.65)
using a nonrandom procedure (e.g.
the rule given by (1.11».
One important contribution of their rule is that the chQice of the
H.
~
functions is made after observing the data.
Also, the procedure can
be used with continuous, discrete, or mixed data.
They suggest several
Hi
functions in this and a later paper
(Randles, Broffitt, Ramberg, and Hogg, 1978).
They contend that their
procedure will tend to give a small average probability of misclassification ''''hile controlling the balance between
and
P(misclassification!
lT
I
)
P(misclassificationl ;'2) and reducing the large probabilities of
nonclassification exhibited by some other partial discriminant analysis
procedures.
They base their contention on some simulation studies and
heuristic arguments in the two papers.
ii)
Another discriminant analysis rule based on ranks has been
36
suggested by Conover and Iman (1980).
Using their method the two
training samples are combined and ranked component-wise.
consider one of the
The
p
That is,
components of the p-variate training vectors.
observations for this component are ranked from smallest
to largest, and each observation is replaced by its rank.
for each of the
tors.
p
components of the
n
1
+ n
2
This is done
p-variate training vec-
The training samples are thus reduced to p-variate vectors of
ranks and these are then used to estimate (by the usual formulas) the
~
(i)
and
-E
(or
~.)
-~
of the standard linear (or quadratic) rule.
A
new observation which is presented for classification has each of its
components assigned ranks by comparison, component by component, with
all the
n
l
+ n
2
original training observations.
The standard linear
(or quadratic) rule is then used with the above mentioned estimates of
the
~ (i)
and
E (or
E i)
and with the new observation of ranks.
One might expect this procedure to work reasonably well if the
data were from underlying normal distributions or distributions which
were a monotonic transformation of normal distributions (since ranks
remain unchanged under monotonic (increasing) transformations).
Through
simulation work Conover and Iman present some evidence that supports
this expectation.
They compare their technique to the standard linear and quadratic
rules as well as to some of the other nonparametric rules mentioned
.e
37
above in the circumstance of underlying normal distributions as well as
that of underlying distributions which are a monotonic transformation
of normal distributions.
For
their technique performs fairly
well against the linear and quadratic rules when the underlying distributions are normal and in general appears to outperform these rules
when the underlying distributions are a monotonic transformation of
normal distributions.
For small sample sizes their rule outperforms
the nearest neighbor rule and a rule based on a LQ density estimator.
As the size of the training samples is increased Conover and Iman's
rule does not appear to perform any better than some of the other nonparametric rules they consider (e.g. those based on the LQ density
estimator or nearest neighbor).
As a matter of fact their rule seems to perform no better for
large sized training samples than it does for training samples of
smaller size.
Also, it does not perform well when the sample sizes
for the two training samples differ by as much as a ratio of three to
one.
Conover and Iman's rule seems best suited to the situation of
underlying normal distributions or distributions that are a monotonic
transformation of normal distributions, and these were the only cases
they considered for simulation.
How their rule might perform in the
instan=e of underlying distributions which are neither normal nor a
38
monotonic transformation of normal distributions is unknown. Also it must
be emphasized that all their conclusions are based on simulation work
only, and that evidently none of the theoretical properties of their
rules are known.
iii)
Of all the nonparametric discriminant analysis procedures
considered to date the most promising appears to be a set of rules
studied by Gordon and Olshen (1978).
Their work was motivated by a
paper by Friedman (1977) who suggested the following discriminant procedure.
Using the training samples, consider, for each of the
p
compo-
nents the Kolmogorov-smirnov distance between the marginal empirical
distribution functions of the two populations.
I,et
be a point
X*
j
which yields this distance for the j-th component and let
X*
j*
=
max
. 1 , •.• ,p
J=
(1.66)
D (X~)
)
where
D(X~)
)
= IF1(X~)
)
- F2(X~)
)
I
is the above-mentioned Kolmogorov-Smirnov distance
(1.67)
fo~
the j-th component.
Divide the sample space into two boxes by making a cut at
dicular to the j-th axis.
boxes.
Now consider one of the two
result~ng
daushter
Refine this box by using the marginal conditional empirical dis-
tribution functions of the observations in this box.
of the
perpen-
p
That is, for each
components, consider the Kolmogorov-Smirnov distance cet'Neen
..
39
the marginal conditional empirical distribution functions of the two
populations, and make another cut by using the criterion given by (1.66).
Repeat this procedure for the other daughter
bo~
and continue refining
each successive box in this fashion until all the observations in a given
box are from one population or until the number of observations in a box
falls below a preassigned integer
variate observation
and then assigning
L.
The assignment of a given multi-
X is made by determining the box in which
X lies
X to that population having the majority of the
members in this box.
This procedure generalizes to the l-population case, and Friedman
claimed that it is Bayes risk consistent.
Gordon and Olshen point out
a pathological example (1978, page 519) for which the above algorithn
is not Bayes risk consistent, but under the assumptions that
= 21
q1 = q2
and nonatomic marginal distributions they offer a slight modification
of Friedman's algorithm which is Bayes risk consistent.
Friedman pointed out that the optimal value of
L
is problern-
dependent, but that his experience has indicated that the performance
of his algorithm is not particularly sensitive to its choice; Gordon and
Olshen suggest using
_
L -
n5/8.
A simulation study by Friedman suggests
that, for the distributions he considered, his successive partitioning
algorithm is superior to the nearest neighbor rule, both in amount of
computer time needed to reach a decision (there is considerable improvement
40
here) and in error rate.
In their paper Gordon and Olshen consider a general class of successive partitioning algorithms for use in discriminant analysis.
Their
algorithms involve the successive partitioning of the sample space into
boxes.
At a given point in the algorithm each partition is a refinement
of a previous partition and classification is accomplished by a majority
vote within each box of the final partition.
They proved that use of
their method for discriminant analysis leads to procedures which are
Bayes risk consistent.
Another group of discriminant rules for which their results are
applicable are those of Anderson (1966): under the assumption of nonatomic marginals, Gordon and Olshen have shown that Anderson's suggested
rules are Bayes risk consistent.
Gessarnan and Gessaman (1972) study a
rule similar to some of Anderson's, and in their simUlation work, under
assumptions of normality, their rule outperforms discriminant procedures
based on Parzen-Cacoullos and Quesenberry-Loftsgaarden density estimators.
In actuality, Gordon and Olshen's approach is a generalization of
the K-nearest neighbor and density estimation discriminant analysis procedures.
It is more appealing than either of these, though, since it
requires no regularity assumptions on the underlying population distributions, which can be discrete, continuous, mixed, or singular.
Lastly, it should
~e
mentioned that logistic regression has become
41
an accepted alternative to discriminant analysis; particularly in epidemiological applications (Press and Wilson, 1978).
Using this alternative
a regression function of the form
•
(1.68)
•
can be estimated by means of the training samples from both populations.
A new observation,
X, can then be classified into one population or the
other by means of the estimates
Notational Considerations
1.9
Since we shall be considering the two population case in most of
what follows, we adopt a simpler notation for the following chapters.·
•
We denote the two underlying populations by
training vectors from n
by
and those from n y
by
X
kth
~
)
kth
component,
=
Yk'
)
;
having apriori
= 1, ... , n X
= 1, .•. ,n...
Y
,); j
Xk ' of
It should
is the same variable
~
•
¥i the only difference being that the
The trai:ling
and the latter from
X
• (1)
Xl
~
(1),
' . ••• , Xn
,
l.S
now
1
and that from
Y1' Y2' ••• , y
~
ny
We represent the p-variate
PJ
component,
of
and
x
(Xl" X . , ... , X ,); i
1.
pl.
2 1.
n ' which we represented above by
given by
now given by
1.
)
former is a measurement from
sample from
X,
~
y, "" ( Y l " Y 2 " " " y
be kept in mind that the
as the
respectively.
and
probabilities of
n
~
~n
n y which was
(2)
(2)
Xl , ... ,X
~
~n2
is
Thus our new notation obviates the use of
y
superscripts when referring to the training samples.
CHAPTER II
KENDALL I S METHOD
The aim of the present research is to investigate a straightforward nonpararnetric rule, based on order statistics, which was
proposed by Kendall (1966).
This rule is formed by considering
sequentially the univariate components of the p-variate training
vectors.
and let
Let
Zl
~
be a random vector presented for classification
be the univariate variable which is the first component
•
of this random vector.
If all the training sample observations with
a first component value less than
a
1
(say) are from
those with a first component value greater than
b
l
~x
and all
are from
then a reasonable criterion for the first component would be
•
refer to the next component iff a
l
~Z1 ~
b •
l
At this first step the sample observations might look like:
xxx············ xx
yyy ••••••••••••• yy
~y'
43
where the
X's are from
configuration
a
l
If
a1
~
ZI
x
= Y(I)
values to choose for
~
a
b
and the
W
l
1
and
bI
and
bI •
Y's
from wy
For the above
•
would seem to be natural
= X(n)
we refer to the next component but consider
only those training observations which fall into the area of overlap
for the first conponent.
(Kendall does not actually define what he
means by overlap, but it is clear from his example that he means the
interval
[max(x(l)' Y(I»' min(X(n)' Y(n»].
max(x(l)' Y(I»
>
min(X(n)'
Yen»~
two populations on this component.
Of course; if
then there is no overlap for the
In terms of the sample configuration
given above, the area of overlap for the first component is given below
by the interval enclosed in brackets.
Let
Z2
be the second component of the observation,
has been presented for classification.
classified as belonging to
or
If
if
al~
ZI
~
~,
b l , then
which
~
is
falls outside the area
of overlap, for the second component of the remaining training observations.
An
Otherwise we refer to the next component, and so on.
example may prove helpful in illustrating Kendall's technique.
We follow Kendall by using some of Fisher's Iris data (see Table 1).
44
TABLE 1 - Some of Fisher's Iris Data
..
Iris versicolor
Sepal
length
7·0
6·4
6·9
5·5
6'5
5'7
6'3
4'9
6·6
5·2
5·0
5·9
6'0
6'1
5-6
6'7
5'6
s·a
6·2
5'6
5'9
6'1
6'3
6'1
6'4
6'6
6'8
6'7
6'0
5'7
5'5
5'5
5·S
6'0
S· 4
6'0
6·7
6'3
5'6
5·5
5-5
6'1
sea
5'0
5'6
·5·7
5·7
6'2
5'1
5"7
Iris virginica
Sapal
width
Petal
length
Petal
width
Sepal
length
Sepal
width
?etal
length
Petal
width
3·2
3·2
3·1
2·3
2·a
2'8
3'3
2'4
2·9
2·7
2-0
3·0
2'2
2'9
2'9
3'1
3'0
2·7
2'2
2·5
3'2
2'8
2-5
2'8
2-9
3'0
2·a
3'0
2'9
2'6
2'4
2'4
2'7
2'7
3'0
3'4
3'1
2·3
3-0
2'5
2-6
3'0
2'6
2'3
2·7
3'0
2·9
2'9
2'5
2'8
4·7
4·5
4·9
4·0
4-6
4'5
4-7
3'3
4·6
3'9
3'5
4·2
4·0
4·7
3'6
4·4
4·5
4·1
4·5
3-9
4·8
4'0
4·9
4'7
4-3
4'4
4·a
5·0
4'5
3'5
3·g
3'7
3'9
5'1
4' 5
4'5
4' 7
4·4
4'1
4'0
4·4
1·4
1·5
1·5
1·3
1·5
1·3
1·6
1·0
1·3
1·4
1·0
1·5
1·0
1·4
1'3
1·4
1·5
1·0
1·5
1'1
1'8
1'3
1·5
1·2
1·3
1·4
1·4
1·7
1·5
1'0
1·1
1'0
1·2
1·6
1·5
1'6
1· 5
1-3
1'3
1'3
1·2
1·4
1·2
1·0
1-3
1·2
1·3
1'3
1'1
1·3
6.3
5·a
7·1
6·3
6·5
7·6
4·9
7·3
6·7
7·2
6·5
6·4
6·a
5·7
5'8
6·4
6·5
7·7
7·7
6·0
6'9
5·6
7·7
6'3
6·7
7·2
6·2
6·1
6·4
7·2
7·4
7'9
6·4
6·3
6·1
7·7
6· 3
6'4
6'0
6'9
6'7
6·9
s·a
6·a
6·7
6'7
6'3
6'5
6"2
5'9
3.3
2·7
3·0
2·9
3·0
3·0
2·5
2'9
2·5
3·6
3·2
2·7
3·0
2·5
2·a
3·2
3·0
3·8
2·6
2·2
3·2
2·a
2·a
2·7
3·3
3·2
2·a
3-0
2-a
3·0
2·a
3·8
2·a
6.0
5.1
5·9
5·6
s·e
6·6
4·5
6·3
s·a
6·1
5·1
5·3
S·S
5·0
5-1
5·3
5·5
6·7
6·9
5·0
5·7
4·9
6·7
4'9.
5·7
6·0
4.a
4·9
5·6
s·e
6·1
6·4
5·6
5·1
5·6
6·1
S· 6
5 -.5
4·a
5·4
5-6
5-1
5·1
5·9
5·7
5·2
5·0
5·2
5·4
5·1
2.5
1-9
2.1
l·a
2·2
2·1
1-7
1·8
l·a
2·5
2.0
1·9
2-1
2·0
2·4
2·3
l·a
2·2
2·3
1·5
2-3
2·0
2·0
1·8
2·1
1-8
1.a
l·a
2·1
1·6
1·9
2·0
2·2
1·5
:;'·4
2·3
2.4
l·e
loa
2-1
2-4
2·3
1·9
2·3
2·5
;':·3
1·9
2·0
2·3
1.a
4'6
4'0
3'3
4'2
-1-2
4'2
4·3
3-0
4"1
2-8
2·6
3·0
3·4
3·1
3-0
3'1
3·1
3·1
2·7
3·2
3-3
3-0
2·5
3-0
3-4
3·0
..
45
Kendall proposes to start his procedure by using that component with
least overlap for the two training samples.
He defines the component
having least overlap to be that which has the largest total number of
training observations outside its overlap area.
Consider the samples from versicolor and virginica.
Each 4-
variate vector has the components Petal Length (PL), Petal Width (PW),
Sepal Length (SL) and Sepal Width (SW).
Inspection of the training
data reveals that the two populations differ more on Petal Length and
Petal Width than on the other two components.
Frequency distributions
[see Table 2] show that for PL the area of overlap is [4.5, 5.1] and
that there are 29 observations from versicolor and 34 observations
from virginica outside this area.
For PW the overlap area is [1.4,
1.8] and outside this interval there are 28 observations from versicolor
34 from virginica.
Thus there are 63 observations outside the area of
overlap for PL and 62 outside that for PW.
Thus Kendall chooses PL as
the first component for discrimination and the first step of his rule
is given as
assign to versicolor iff PL < 4.5
assign to virginica
refer to the next component
iff PL > 5.1
iff 4.5
~
PL
~
5.1
At the second step he considers only those training observations
46
TABLE 2
Frequency Distributions of Petal Length and Petal Width
Variate
values
4·3
4·4
4·5
4·6
4·7
4·8
4·9
5·0
5·1
5·2
5·3
5·4
5·5
5·6
5·7
5·8
5·9
Petal length
Virgo
Verso
Variate
values
Petal width
Verso
Virgo
25
4
7
1
3
-
5
2
2
1
1
-2
3
3
7
2
2
2
3
6
3
3
13
50
50
1·0
1·1
1·2
1·3
1·4
1·5
1·6
1·7
1·8
1·9
2·0
2·1
2·2
2·3
2·4
2·5
7
3
5
13
7
10
3
1
1
1
2
1
1
11
5
6
6
3
8
3
3_
50
50
47
which fall inside the area of overlap for PL.
There are 37 such obser-
vations and among these there is least overlap for PW.
tabulation for this
compo~ent
A frequency
(see Table 3) reveals that its area of
overlap is [1.5, 1.8] and proceeding as in step one, the assignment
rule at this stage is
assign to versicolor iff PW < 1.5
assign to virginica
iff PW > 1.8
refer to the next component iff 1.5
~
PW
~
1.8
There are 14 observations from versicolor and 8 from virginica
(for a total of 22) inside the interval [1.5, 1.8] and Table 4 gives
the frequency distributions for the remaining two components for these
22 observations.
For SL the overlap interval is [5.4, 6.3] and there
is a total of 6 observations outside this interval.
(Kendall has ap-
parently made a mistake here since he claims that there are only five
observations outside the common range for SL.)
For SW the overlap area
is [2.2, 3.0] and again there is a total of 6 observations (all from
versicolor) outside this area.
Since there is a total of 6 observations
outside each component's overlap area, he arbitrarily selects SW as the
next component for discrimination.
•
Thus he adds to the
assign to versicolor iff SW > 3.0
refer to the next component iff SW < 3.0
rul~
at
~his
step:
48
TABLE 3
Frequency Distribution of Petal Width for 37 Cases
Not Distinguished by Petal Length
Variate values
1·2
1·3
1·4
1·5
1·6
1·7
1·8
1·9
2·0
2·1
2·2
2·3
2·4
Petal width
Verso
Virgo
1
2
4
9
3
1
1
2
-1
5
3
3
1
1
21
16
49
TABLE 4
Frequency Distributions of SL and SW for 22 Cases Not
Distinguished by PL and PW
Variate
values
4·9
-
5·4
5·5
5·6
5·7
5·8
5·9
6·0
6·1
6·2
6·3
6·4
6·5
6·6
6·7
6·8
6·9
Sepal length
Verso
Virgo
1
1
1
3
-1
2
1
1
2
-
1
1
2
1
1
2
-
Variate
values
2·2
2·3
2·4
2·5
2·6
2·7
2·8
2·9
3·0
3.1
3·2
3·3
3.4
Sepal width
Verso
Virgo
1
1
-
1
1
1
3
2
2
1
-
1
1
2
-
3
1
1
-
1
-
14
8
14
8
50
At the last step there are 16 training observations left.
The
frequency distribution for SL is given in Table 5 for these remaining
observations.
The area of overlap for SL is [5.4, 6.3J and he adds
assign to versicolor iff SL > 6.3
assign to virginica
iff SL < 5.4
withhold to judgment iff 5.4
Let
~
SL
~
6.3
Z be an observation presented for classification.
Then the
discriminant rule based on Kendall's order statistics method is given
as
I.
assign
Z
to versicolor iff
ZpL < 4.5
or 4.5 -< ZPL< 5.1 and Zpw < 1.5
or 4.5 -< ZPL< 5.1 and 1. 5
or 4.5
II. assign
s..
ZpL
Z
s..
5.1 and 1. 5
~
Zpw
s..
1. 8 and ZSW > 3.0
s..
Zpw
s..
1. 8 and Zsw
s..
3.0 and ZSL > 6.3
s..
1. 8 and ZSW
s..
3.0 and ZSL < 5.4
to virginica iff
ZpL > 5.1
or 4.5
~
ZpL
~
5.1 and Zpw > 1.8
or 4.5
s..
ZpL
s..
5.1 and 1. 5
III. withhold jUdgment
s..
Zpw
(do not classify
Z)
iff
Kendall's rule obviously comes under the heading of partial discriminant analysis since, for a given observation, it can withhold judgment.
Of the original 100 training observations, the above rule leaves 13 unclassified.
It should be noted, though, that each training observation
51
TABLE 5
Frequency Distribution of SL for 16 Cases Not
Distinguished by PL, PW or SW
Variate values
4°9
-
5°4
5 0S
S06
S07
S08
S·9
6°0
6·1
6°2
6·3
6·4
6 0S
6°6
6·7
Sepal length
Verso
Virgo
1
-1
-
-2
-
1
-
-
1
1
-
1
2
1
1
2
1
-
8
8
1
-
52
which is classified is classified correctly.
At the third step of Kendall's rule with Fisher's Iris data, the
amount of training sample overlap is tied for the two remaining components -- SW and SL.
At a given step Kendall specifies that his rule
selects that component, among those remaining, with least overlap.
Obviously we need to develop a systematic means of handling this case
for which Kendall's rule does not select a unique component at a given
step.
There are several alternatives for doing this.
fixed sequence rule.
One is to use a
That is, instead of proceeding from the component
with least overlap to that with most, we just select the components in
the order they are presented, or in any other fixed order not determined
by the data.
Another possibility is to look ahead, for those components tied
for least overlap at a given step, in order to break the ties.
Using
this method for the above example, we choose SL for the component to be
used at the third step and SWat the fourth if the amount of sample
overlap is less at the fourth step than when using SL at the third step
and SWat the fourth.
Of course there is the possibility that for both
orderings we may end up with the same amount of overlap at the fourth
step, and since the fourth step is the last step for the above data,
we could not look ahead any furthe+ to resolve the tie.
53
These difficulties lead us to the consideration of a third alternative.
If two components are tied for least overlap at a given step,
we just toss a fair coin to determine which of the two will be selected
for discrimination at this step.
of more than two components.
This readily generalizes to the case
This alternative is simpler than the look
ahead method, for which there always exists the possibility that ultimately a coin must be tossed anyway in order to resolve ties in least
overlap.
Because of these reasons we find the coin tossing method more
satisfactory than the look ahead method and finally we rule out the
fixed sequence method as being too arbitrary.
Thus from this point on
we adopt the convention that if two or more components are tied for least
overlap at a given step, we pick one of them at random; as we did for the
above example, and continue on.
Barring ties, the remaining training sample data for a given component at a given step of the rule can appear in eight basic configurations.
Again letting the X's be the training sample values from
those from
these eight configurations are:
Configuration 1
x
XX
X
Y Y ••.•••.••..• Y Y
Configuration 2
x
X •••••••••••• X
Y Y ••..•....... Y Y
n
x and the Y's
54
Configuration 3
x
X
XX
y
•••••••• y
Configuration 4
x ••••.•..••..
X
y ••••••••••••••••••••• y
Configuration 5
x
X •••••••••••• X X
y •••••••••••• y
Configuration 6
Y •••••••••••• y
x
X
x ..........•.
X
Configuration 7
x .•.•.....•..
X
y
•••• y
Configuration 8
x
y •••••••••• y
X
y ••••••••••• y
Note that for Configurations 5, 6, 7 and 8 there is no sample overlap.
Using the above set of configurations as a point of reference, we
consider several ambiguities inherent in Kendall's method.
From his
example, the first component used for discrimination is Petal Length.
By means of the
training samples Kendall determines his rule for the
first component as
55
assign to versicolor iff PL
~
4.4
assign to virginica
~
5.2
iff PL
refer to the next variable iff 4.5
~
PL
~
5.1
Suppose we were presented with a new observation for classification with versicolor measurement 5.15.
For this observation we
would not refer to the next component nor would we classify.
In par-
ticular, if this were a univariate setting with PL the only component,
the observation would not be classified as coming from either population
nor would it fall into the nonclassification (withhold judgment) region.
This may seem to be quibbling over trivial matters, but the distinction
is important when one attempts to derive expressions for the distribution
of Kendall's criterion.
One logical solution to the problem, which we adopted above, is
as follows.
Suppose the data has Configuration 1 and we are presented
with an observation,
classify
coming from
Z
Z,
for classification.
as coming from
~Y
~X.
If
Then if
ZpL > X(n)
ZpL < Yell
classify
Z
as
refer to the next component
and if
(or withhold judgment in the univariate case).
One might wonder what form Kendall's rule takes for a sample such
as that given by Configuration 5.
Obviously this configuration makes
the component from which it was taken seem a very promising one for the
discrimination problem.
Kendall mentions an example like this and he
56
just chooses a point between
and
X(n)
Y(l)' though not the midpoint,
as the cutpoint for differentiation between the two populations.
Pre-
sented with such a sample one would not want to refer to the next component nor have a withhold judgment region.
Kendall remarks that such
a variable would be allperfectly good discriminator in itself. 1I
Let
Zk
be the first component selected by Kendall's procedure
for which the above situation occurs.
at the
.th
~
If this component is selected
stage we believe that a reasonable rule for this step is:
assign to
assign to
'Il'x
'Il'y
iff Zk
iff Zk
~ Xk(n) + Yk (1)
2
>
X
k (n) +
Y k (l)
2
Then at this point the rule terminates.
Similarly for Configuration 7
we
or
Zk > Yen)
and then the procedure terminates.
A note of caution is due here.
statistics notation, for thei
masks some complexities.
th
(>
The above use of the ordinary order
1) step of Kendall's procedure,
First, the sample sizes on which the lIorder
statistics II are based are not necessarily
n
X
and
As a matter of
fact these sample sizes are dependent on the values of the training ob-
57
servations for the first i-l components selected by Kendall's procedure •
Similarly, the training samples which remain at the
. th
~
step no
longer constitute a random sample for the component selected at this
stage.
Although the order statistic notation seems to be the most
convenient and natural notation to use at this point, the above mentioned complexities should be kept in mind.
Now consider Configuration 3.
If the remaining data for the
component (say Component K) used at the
.th
~
step appears in this
configuration the appropriate rule would seem to be
assign to
lT
assign to
lT
X
X
iff
iff
refer to the next component
Z >
k
y
ken)
iff
Lastly the specific rules for Configurations 2, 4, 6 and 8 are
the obvious analogues of those given for Configurations 1, 3, 5 and 7.
An extension of Kendall's order statistic method to the case of
ties, among the observations for a selected component, is straightforward.
Incidentally, at a given step, we need only concern ourselves with ties,
between the two populations, in the two largest (or two smallest) values.
Otherwise our rule remains unchanged.
Step 3 of Kendall's procedure
for Fisher's Iris data is one instance of such a tie.
sample data locked like:
In this case the
58
x
y
where
XSW(l)
= YSW(l)·
X
••••••••••• y
The obvious modification in this instance is
assign to versicolor
iff
refer to the next component
ZSW > Ysw(n)
iff
ZSW ~ Ysw(n)
Ties for the other configurations are handled similarly.
In summary, for Kendall's order statistic method we consider each
of the components individually for our p-variate training vectors.
Start-
ing with that component with least overlap for the training samples we
proceed at each step of the way by choosing that component with least
overlap, for the remaining training sample observations, among those
components yet to be used.
At each stage, barring ties, the remaining
training data must fall into one of the eight basic configurations given
above.
The assignment rule at a given stage is then completely deter-
mined by the configuration that occurs.
We proceed in this fashion until
we have exhausted either our supply of components or of training data.
[In order to exhaust the supply of training data Configuration 5,6,7 or
8 must occur.]
Once this process is complete we have a set of rules
which are then used for classification of all new observations.
One might wonder about the advantage of using one component at a
time in the nonparametric discrimination problem.
The following quotation
59
from Kendall throws some light on his motivation for doing this.
"All true distribution-free methods, I contend, must depend on order
statistics.
It follows that any distribution-free method of dealing
with the discrimination problem must rely on order properties.
One
of the distinguishing features of such properties, however, is that
they exist only in one dimension.
Thus we have to consider discrimi-
nation by one variable at a time.
This turns out to be an advantage."
Furthermore he states, "We may, indeed, ask ourselves a very pertinent
question at this point.
Apart from convenience why do we want a single
discriminant function?
Provided that we have a set of rules to perform
the discrimination, there. is, so far as I can see, no reason whatever
why that set should be reducible to a single algebraic function, still
less a linear one."
Very little research has been done with Kendall's order statistic
method.
Richards (1972) suggested both a refinement and an extension
of Kendall's procedure.
The refinement involves considering, for a
given step, all components except that used in the immediately preceding step.
This differs from Kendall's proposal since once a
component is used at a given step of Kendall's procedure, it is no
longer eligible for use at some later step.
vo1ves examining
bivariat~
Richards' extension in-
frequency distributions for those training
samples left unclassified by Kendall's procedure.
His extension yields
60
fewer training samples unclassified and as with Kendall's method,
it misclassifies none.
Unfortunately, Richards offers no theoretical
justification for either his refinement or his extension.
Gessaman and Gessaman (1972), using Monte Carlo techniques,
compare Kendall's order statistic method with a rule (actually two
nearly identical rules) based on statistically equivalent blocks.
Their rule, which they call "nearest neighbor with probability squares"
is a partial discriminant analysis analogue to a rule in the class
considered by Gordon and Olshen (see Chapter I).
They considered
only the case of underlying normal populations, and the distributions
they sampled were
I.
~l
= (0, 0),
~1
= [~
~l
e
versus
G~]
= G~]
~2 = (2,0)'~2 =
II.
~3
=
(0, 0),
~3
versus
~4 = (0, 0), ~4
III.
~5
=
(1, 1),
~5
= ~-11
-1J
4
= [~
~]
versus
~6
=
(-1, 0),
~6
= [1
'1
~]
e
61
Each discriminant analysis procedure was studied for three
training sample sizes:
64, 200 and 729.
Further
sa~ples
of size
250 were generated from each distribution and then classified by each
procedure.
Their results are presented in Table 6.
An
examination of this
table shows that Kendall's method uniformly (across all distributions
and training sample sizes) misclassifies a smaller percentage of observations than does either of the "nearest neighbor with probability
squares" methods.
Unfortunately, Kendall's rule also has a uniformly
higher rate of nonclassification.
Gessaman and Gessarnan point out some desirable properties of
Kendall's procedure:
it is "completely distribution-free", it does
not misclassify any of the training sample observations from which it
is formed, and it is easily constructed even for large samples of high
dimensionality.
They also point out some of its disadvantages:
for
variables of unlimited range or for distributions differing only in
covariance matrices it may not classify a large percentage of those
observations presented for classification.
Gessaman and Gessaman claim that their Monte Carlo results tend
to confirm the
above-~entioned
disadvantages.
Furthermore they call
attention to the fact that, for the underlying population distributions
which they study, the larger the training sample size the greater the
TABLE 6
Proportions of the 500 Observations for Each Pair Misclassified (MC) and
Not Classified (NC) by Procedures Based on Sample Sizes 729, 200, 64
Procedure
200
MC
NC
III
II
I
200
r-iC
64
-Me64
.... - NC
MC
.002 .50B .036 .290
.002
.996 .012
.966 .062
.876 .004
probability
S'tuares
Using (a)
tiquares
.078 .076
.098 .024 .152 .032
.092
.400 .194
.256 .242
Using (b)
squares
.032 .166
.072 .066 .098 .000
.09B
.382 .202
.200 .254
729
NC
Me
--,---
1) Kendall .002 .6B6
729
NC
NC
MC
NC
729
----=-=
MC
NC
2<J0
.--
64
MC
NC
l-lC
NC
.846
.004
.829
.024
.492
.126 .102
.102
.128
.056
.178
.000
.144 .084
.160
.122
.108
.184
.000
2) Nearest
NC!ighbor
with
*Adapted from Gessaman and Gessaman (1972) page 471.
'"
N
.e
e
e
63
percentage of observations not classified by Kendall's procedure.
The following quotation seems to summarize their conclusions about
Kendall's order statistic method:
"In other words, discriminatory
power decreases as sample size increases.
This can not help but be
distressing to a statistician who might want to use the method.
How-
ever, the advantages of this method which were mentioned earlier still
remain, and one must ask if there exists an alternative which holds
better promise."
From the preceding it is clear that one drawback of Kendall's
suggested method is that it offers no means of controlling the probability of nonclassification in the partial discriminant analysis problem.
Acknowledging this difficulty, Kendall (l976) states, "As we have explained it, only nonoverlapping regions are accepted for discrimination;
for variables of infinite range there tends to be more overlap as sample
size increases.
It might, therefore, be preferable to accept some
misclassification from the outset by permitting overlap up to a specified
arr~unt;
or to fit univariate distributions and estimate the cutoff points
to a specified degree of overlap.
Much more remains to be done in this
field."
One cannot help but notice from Table 6, though, that if some·of
its extremely low error rate could be sacrificed in a judicious manner
for a lower rate of nonclassification, Kendall's rule might compare
64
favorably to the probability squares rule in terms of both misclassification and nonclassification rates.
We want to emphasize that the
probability squares procedure is a partial discriminant analysis
analogue of a rule in the class of rules studied by Gordon and Olshen.
As pointed out in Chapter I, this class appears to be among the most
promising, from the standpoint of applications, of all the nonparametric
rules for discriminant analysis.
Thus we feel that, if the nonclassi-
fication rate of his procedure could be generally lowered in the
jUdicious manner mentioned above, Kendall's order statistic method
might become a more likely candidate for use by practitioners than
many of the other nonparametric rules, expecially since it is so easily
constructed under any circumstance.
At any rate, Gessaman and Gessaman's study considered only the
case of underlying normal distributions.
Additionally, their research
is based solely on Monte Carlo work, and they offer no theoretical
justification whatsoever for any of their conclusions concerning
Kendall's order statistic method.
It appears that no one has studied the theoretical properties
of Kendall's rule (as it stands) nor has anyone attempted to modify
his procedure in order to control its nonclassification rate.
We
study both the theoretical properties of Kendall's rule as well as
offer
(a~d
study the theoretical properties of) several modifications
65
which attempt to control the probability of nonclassification for his
rule.
Before proceeding, however, we give a definition and theorem
(with several versions), which will be used extensively in the following chapters.
x
For any random sample
with distribution function
F
X1'.'.'X
from a random variable
n
the empirical distribution function
is the step function defined at each point x by
NF(x)
= number
of sample observations having values X. < x.
J.
(With multidimensional X, "<,, denotes inequality in all coordinates.)
Any asymptotic result of the form
II
0
(2.1)
is called a Glivenko-Cantelli uniform convergence of sample measures.
The classic theorem (Chung, 1974) without restriction on the distribution function F, asserts this convergence in the univariate case for
the supremum overall half-lines S
x
= (-~,x).
Generalization to all
Euclidean half-spaces S in any finite dimension is due to Wolfowitz
(1954, 1960).
A perhaps more familiar version of these two theorems
is
The Glivenko-Cantelli Theorem (GeT) -- Version I:
IF -
sup
_00
< x <
F!
a.s.
~
0
00
where x is p-dimensional.
Another version of the Glivenko-Cantelli TheQrem, which we shall
66
call the GCT--Version II, is given by Rao's Theorem 7.1 (Rao, 1962),
which asserts convergence of (2.1) over all Lebesgue-measurable
convex sets if the distribution function F is absolutely continuous
with respect to Lebesgue measure.
CHAPTER III
DETAILS OF THE
u~IVARIATE
CASE
We consider the univariate case (p=l) first, and attempt to get
the distribution of Kendall's criterion into a convenient probabilistic
form.
Upon examining the eight basic configurations on pages 53 and 54, one
sees that the criterion can be expressed in terms of a set of bivariate
•
cutpoints (C ,C ) for Configurations 1,2,3,4,7 and 8, or a single cutpoint
1 2
for Configurations 5 and 6.
We adopt the convention that for Configura-
tions 5 and 6 we consider the single cutpoint C as bivariate, i.e., (C,C),
so that one can speak of "the (bivariate) random variable" which represents
Kendall's criterion.
Since we are excluding the possibility of ties (they
occur with probability zero), if we have for our criterion that C =C
l 2
then we have unambiguously that Configuration 5 or 6 has occurred.
To restate the problem, then, we would like to obtain an expression
•
.
for the bivariate distribution function of the cutpoints f=(C ,C ) on
1 2
which Kendall's procedure is based in the univariate case.
Since the
eight configuraticr.s consist of disjoint sets whose total probability is
unity, and thus partition the sample space
n,
we prcceed as follows:
68
8
= p ~u C1~c1 'C2~c2'
Configuration i occurs)
1.=1
8
= .~
P
~1~c1,c~c2'
Configuration i occurs)
1.=1
(3.1)
Of course, assuming continuous underlying distribution functions for X
and Y, the strict inequalities ("<") above can be changed to non-strict
("~")
without altering any of the probabilities.
Now
= p (Y (1)~c1'X (n)~c2'X (1)~'f (1) , Y (l)~X (n)~Y (n» ,
since by the definition of Kendall's procedure for Configuration 1 we
have that
i.e., the above two conditions are identical by the definition of our
•
rule for Configuration 1.
(It should be mentioned that it is at this
point that our quibbling over the inconsistency in the way Kendall defines
.e
69
his procedure makes a difference.)
Similarly
P(Cl~c1,C~c2'X(n)~Y(1» = prX(n)+Y(1)~
X
<y
]
t 2 l' X(n)+Y(l)<c
2
2' (n)- (1)
.=;
..
)
<Y]
_ p[x (n) +Y (1) < . (
2
_~~n c 1 ,c 2 'X(n)- (1)
and arguing analogously, i.e., by looking at the other different configurations, we have
= p (Y (l)~c1'X (n)~c2'X (l)~Y (1)
, Y (1).::..x (n) <Y (n) )
+p (X (l)~c1' Y (n)~c2' Y (1)~ (1) ,X (l)~Y (n).::..x (n) )
which we may write as
6
E p.
i=l ~
An important point to note is that Kendall's criterion does not use
all the information in the training samples, since its distribution is a
functon of X(l)' Y(l)' X(n) and Y(n) only.
From any standard text on
probability theory, e.g., Lindgren (1976), we have that if f
for
TI
' and f y for
X
TI
y
'
x
is a density
then a density for the joint distribution of
70
f
X(1) ,X(n)
(x ,x ) • g
(y ,y )
1 n
Y(1)'Y (n) 1 n
for xl~n'Y1.:s..Yn
o
(3.2)
.
otherwise,
where
.
f
X(l)'X(n)
=
(x ,x )
1
n
o
otherwise,
is a density for the joint distribution of (X(l) 'X(n»' and
g
a
(y ,y ) ..
y (1) , Y (n) 1 n
.,
o
otherwise,
is a density for the joint distribution of (Y(l) 'Y(n».
(Note that h=f·g
since the two samples are drawn independently of each other.)
Using some calculus and probability theory, we have that
p
p
1
2
:I
c
f _001
c
1
.. r_ClD
J
c
p
3
:I
1
f _ClD
Y1
f _00
Xl
f_
ClD
Y1
f _ClD
c
f 2
Yl
00
f
c
x
h(x 1 ,x n 'Y1'Yn )dyn dx n dx 1dY1
,
(3.3)
h(x1,xn'Y1'Yn)dxndYndY1dx1
,
(3.4)
h(X1,xn'Y1'Yn)dxndYndx1dY1
,
(3.5)
ClD
2
f
Xl
f
c
ClD
2
f
Y1
n
f
Yn
Yn
,e
71
I:
(3.6)
h(Xl,xn'Yl'Yn)dYndxndYldx1
n
[2min(c ,c.,)-x ]
1
n hx
y
(x ,y )dy dx
1 n
xn
(n)' (1) n 1
1
where h x
y
• (n)' (1)
(x 'Y ) = f
X (n)
n 1
(xn)·gy
(1)
(3.7)
(y ), with f
(x)
1
x (n) n
and
(3.8)
where h
y(n) ,x(l) is defined analogously to the h in (3.7) above.
Computation of the exact distribution of Kendall's criterion is a
tedious task even for underlying population distributions of the simplest
nature.
For example, assume that the two underlying population distributions
are Uniform (0,1) and Uniform (0,1+0) respectively, with 0<0<1.
n
(
~
-1) (1+0) i (-1)
Then
n -i-
y
(-1) Il X +n y - i - j - 3
[c_~-:-x_+_n_y_-_i_-_J_'-_1
~
j
0_n_X_+_n_y_-_i_-__-
(n x -i-1) (nx+ny-i-j-l)
J+
72
nx+ny-i-j
n -j
y
c nX-i
1
nx-l ny
_ nEE
xi=l j=l
n y -2
E
i=l
(n -lXl ~
x
i
n +n -i-j-l
y (-1) X y
n -j
o
y
j
~~-lnxny(ny_1)(n~-2\(n~-)_1)nx+"y-i-J-302i+J+1
)=1
J.) )
+
i+j+l
n y -2
E
i=1
n y -2
i:
i=l
+
-X -J
1
y
2
fl -2 n,t (
r
y
n n (n -1) r.
XY Y
. 1 . 1"
~=
J=
1
x
73
n +n -i-j-3 (
n +n
n +n )
-1) X y
c X y -0 X Y
J
_~1~
~
(i+j+1) (nx-j) (nx+n y >
+
n -j
ey
n y -j
We now derive some expressions for the unconditional probability of
misclassification for Kendall's procedure in the univariate case.
probability of misclassification is
The total
74
Consider p(misclassificationln )'
x
We have, again using our eight basic
configurations, p(misclassificationl~x) =
P (X (l)~Y (1) , Y(1).9C (n)~Y (n) 'Xn+l>X (n) )
[Configuration 1)
+P (Y (l).s..x (1) ,X (1).s.Y (n)~: (n) 'Xn+l <X (1»
[Configuration 2)
[Configuration 4)
(3.9)
[Configuration 5)
[Configuration 6)
[Configuration 7)
"
+P ..U
(
J.. Y's <}''(1) ,n-J.Y
J.=1
S
>X (n)' Xn+1 <X (1) or Xn+1 >X (n) ) [C on f'J.gura t'J.on 8)
Note that if the sample data are in Configuration 3, then an observation
from
nx
cannot be
mi~classified,
so Configuration 3 does not enter into the
computation of the probability given by (3.9).From Configuration 7 we have
n
P ~~l i X's<Y (1) ,n-iX' s>Y (n) , Y (l).9Cn+l~Y (n»
=
n
1:/ (i X's<Y (1) ,n-ix's>Y (n)'Y (l)9(n+l~Y (n»
and for Configuration 8
n
P~U1 i Y's<X (1) ,n-i Y's>X (n) 'X n +1 <X (1) or Xn+l>X (n»
J.=
n
=
n
. L P (i Y's<X (1) ,n-i Y's>X (n) , Xn +1 <X (1) ) + . L1P (i Y' ~X (1) , n-i Y' s>X (n) , Xn +1 >X (n) )
J.-1
J.=
Also from (3.9) note that
+P (Y (1)9\ (1) ,X (n)~Y (n) ,X n +1 >X (n»
.
75
In theory, each of the above probabilities, whose sum is
p(misClassificationln
P (X (l).s.Y (1) , Y
fX
(1)
(Xl)
x)'
is calculable.
(1)~ (n)~Y (n) 'X n+1>X (n»
·1/2f~(x/2)dxdy
,.
n
dX
For instance,
1-1x h:f: f:
1
n
h (xl ,xn'y1 ,yn)· (3.10)
n
1
(3.11)
where gy (n) (Yn) and f X(l) (xl) are densities for Y(n) and X(l) respectively.
Obviously the above computations are exceedingly tedious even for
the simplest underlying pcpulation distributions. It is fortunate that the
asymptotic distribution of Kendall's criterion as well as the asymptotic
probability of misclassification can be derived rather simply in the general
case.
Before doing this, though, we give a definition and some examples.
We have defined, in terms of our training samples, what overlap means
in the context of Kendall's procedure.
It will be convenient to have an
analogous definition in terms of the underlying population distributions.
Let F
x and Fy be the distribution functions for nx and ny respectively
(3.17)
(3.18)
76
(3.19)
(3.20 )
Note that if F
no overlap.
x
and Fy are continuous functions it is easy to prove that
Otherwise we define the set on which the two populations
overlap to be the closed interval
(3.21 )
We adopt the convention of stating "a
is from Fy " if a 1 "'"
.11·
(or Fy(Y»
An
anal~gous
1
is from F " if a =X and that "a
1 1
X
1
convention applies for b •
1
> 0 for all x(y) then we define xl (or Y1)=-~' and if Fx(X) (Fy(Y»< 1
Examples:
(1)
Assume that the underlying population distributions are Uniform
(0,1) and Uniform (6,1+6), 0<8<1, respectively.
Then the area
of overlap of the two populations is [8,1]
(2)
For the case of two underlying exponential distributions with
with parameters A and A , respectively, the area of overlap
1
2
(3)
For any two normal distributions the area of overlap is
[-~,~].
Why we claim that the above definition of overlap is the population
analogue of Kendall
~
training sample definition should become apparent as
.e
77
we proceed to derive some more of the theoretical properties of Kendall's
rule.
There are other definitions of population overlap which are possibly
more reasonable than our extension of Kendall's sample-based definition.
The problem with these other definitions is that they assume more information about the underlying population distributions than is usually known.
Suppose that the graph of the two population distribution functions looked
as follows:
(3.22)
e
a
Note that F
x
l
t
t
1
b
2
1
is constant for all tl~~t2' i.e., a density for F is identix
cally zero for
t1~~t2'
By our extension of Kendall's sample definition
of overlap, the interval [t ,t ) would be part of the area of overlap for
1 2
the two populations; the entire area of overlap by our definition would
be [a ,b ).
1 l
Some might contend that any good definition of overlap should,
in this case, exclude the interval [t ,t )
l 2
The reason that our definition
has the crudeness illustrated by the above example is that the· concept
behind the definition was taken from the analogous definition for the
training samples, and from the training samples one can derive only incomplete information about F
Fy have known densities f
x
x
and Fy '
At any rate, assuming that F
X
and
and f y ' and letting AX and Ay be their respective
78
supports, one reasonable definition of the population overlap area would
be the set Af'Ay •
Since the densities f
and f y are not unique, the area
x
of overlap would not be unique, although it would be unique except for
sets of probability zero.
Of course, if f
x
and f y were known, there
would be no need to consider a nonparametric method, since in this case
there would exist an optimal parametric discriminant analysis rule.
Though the distribution function for Kendall's criterion turns
out to be complicated in terms of computation, the asymptotic distribution of the criterion turns out to have a simple form.
Theorem 3.1
Let [al,b ) be the area of population overlap for the two underlying
l
population distributions.
for
TI
x
and
TI
y'
Let F
X
and Fy be the distribution functions
and assume that each of these distribution functions is a
continuous function.
Lastly, let
~=(Cl,C2)
be Kendall's criterion.
Then
(3.23)
Corollary:
Proof:
~
n
Dist
~
(a ,b ), a constant random variable.
1 1
Assume that a
l
is from Fy and b
we take care of some preliminaries.
l
is from F ' as per (3.22).
x
First
The term a.s. (almost surely) refers
to the fact that the above convergence holds for all wen except possibly
for a set N with P(N)=O.
The basic underlying probability measure P is
defined by the joint distribution function of the bivariate random variable
(X,y).
Since X and Yare independent, this distribution function is given
79
by FX(X).Fy(y).
It is with respect to this underlying probability measure
Px,y that the above almost sure convergence is asserted.
will be to show that there exists a set
for all wcG
C (w)
I
P
(F) = I
Y.,y
and for all
~
two sets each having
G for which
Our strategy
P
(G)
x,y
aI' and that there also exists a set
w~F
C (w)
2
+
b •
l
F
=I
and
for which
Then, since the intersection of
probability one must have probability one also, we
have P x,y (G(\F) = 1 and for all we:G(\F
~
(w)
=
-~
(C (w) ,C (w»
2
1
(a ,b ) •
1 1
(Implicit in this approach is the use of the fact from analysis that, for
any sequence
We shall require the
un~variate
Glivenko-Cante1li Theorem (GCT): see Chapter II.
Let
A
= {w
such that the univariate GCT holds for F y }
and
B = {w such that the univariate GCT holds for F }
X
and for any positive integer
Dn={W:
y(l) (w)<a
1
n
let
for a random sample of size n from n
Then
= I-P(YI (w)~a1'···'Yn(w)~al)
= 1co
n
TI P(Y. ~ a )= 1 - 1 = 0 •
1
i=l
J.
Now let D= UD n ; then P,(D)<
L P (D )=0
,y n
n=l
~
n=l
x} •
80
Note:
For wr.O, C (w) does not necessarily converge to a so we must
1
1
exclude this set from the set for which we assert convergence of C •
1
.
c
See belo\ol. Let G= (A(lO )(l B. We want to show that G has probability one.
with respect to our underlying probability measure
Px,y , since we assert
.
that, for all wEG, C (w)
1
P
x,y
(G)
=1
P
[ (A (l 0 c)
x,y
~
a •
1
(Lemma. )
n B) =P y (A(l 0c) . Px (3)
since the sets A and 0 are determined
by the random variable Y only, B is determined by X only, and X and Yare
independent random variables.
Since both A and OC
By the GCT P (B)=1.
x
have probability one with respect to the measure P , and since the intery
section of two sets, each with probability one, must also have probability
c
one, we have that P (Af10 ) =1. So we have
y
P
[(A(loc)nB]
x,y
= PY (AnOc).p x (B) = 1·1
All that remains to be shown is that for all
~>O
= l.
w~G
we have C
l
and let w belong to G; then there exists a constant d
the definition of a
have that Fy(d1»0.
1
and our initial assumption that a
1
1
(w)~a1.
Let
such that d
is from Fy
1
> a
1
we
Fy (a )=0 from the definition of a and the fact that
1
1
~
the set for which the GCT holds for Fy ' where Fy is the empirical distribution function for Y.
For this to be true, and by the definition of the
,e
81
empirical distribution functicn, we must have Y(l) (w)<d , for all n>N y •
l
Since W€oc it cannot be that y(l) (w)<a •
l
Since by assumption a
is from
l
Fy it must be that P(X~dl»O and so there exists an N such that, for all
X
~
n>N x ' FX(dl»O, since w€B, the set for which the GCT holds for F.
X
This
implies, by the definition of the empirical distribution function, that
X(l) (w)<d
1
for all n>N •
x
Let N=max(Nx,N y ) and let n>N.
Then
The same type of argument shows that C2a~s·bl' which completes
the proof.
Note that the assumption that a
l
is from Fy and b
l
is from
F is in no way restrictive and the other three cases pertaining to the
X
origin of a
l
and b
l
are handled similarly.
Lastly, the corollary to
Theorem 3.1 follows since for any random variable convergence almost
surely to a constant implies convergence in distribution to that constant
(random variable).
We assumed for Theorem 3.1 that there was an area of overlap [al,b )
l
for the two underlying populations.
assumption is not satisfied?
What about
th~
case where this
The appearance of Configuration 5, 6, 7
or 8 would lead one to doubt the validity of this assumption.
Let us
assume, then, that there is no area of overlap for the two underlying
popUlation distributions and that one distribution function lies entirely
to the left of the other.
An example of this might look as follows.
82
(3.24)
Assume, as in the diagram, that the distribution function for F
entirely to the left of that for Fy
'
X
lies
and let
(3.25)
and
(3.26)
Then by a proof paralleling that of Theorem 1 we have that
(3.27)
and then of course as a corollary that
(3.28)
An analogous result follows when Fy lies entirely to the left of FX.
expected result also follows for the case of max{xl,y )
l
~
min(x 'Y2)
2
The
but
Some interesting corollaries follow as a consequence of Theorem 3.1.
We can use it to construct a simple proof of the fact that:
The total conditional probability of misclassification
for Kendall's rule converges almost surely to zero.
Proof:
(3.29 )
Let [al,b ) be the population overlap area and assume a comes
l
1
.e
63
from Fy and b from F •
1
X
Then the underlying
dis~ribution
=uncticns might
look as follows.
(3.30)
Let A={w:
for which Theorem 3.1 is true}.
Then P(A)=l.
Let wcA.
Then
for nX,ny large enough the training sample configuration must look like
x (1)
X (n) (w)
(w)
(3. 31)
y (1) (w) •••••••••••• Yn (w)
That the training samples must have the above configuration for all
nX,ny>N where N is some finite integer follows from the GeT.
For the
above configuration the assignment rule for classifying a new observation,
Z, is.given by
assign Z to
TI
x iff Z<Y(l) (w)
assign Z to
TI y
withhold judgment iff Y(l)
So,
(w)~~(n)
p(miSc1assification!llx'w). P(X>X(n) (w»
and since
X(n)
(w)~bl
(3.32)
iff Z>X(n) (w)
(w)
= I-F x (X(nf(w»
by Theorem 3.1, and since F
X
,
is a continuous
function of x (by assumption), we have that FX(X (n) (w) )-+FX(b ).
I
p(misclassification\ll ,w)~l-F (b )=0.
X
and so
X
1
Thus
Similarly p(miSclassificationllly'w)~O
84
This completes the proof.
As in the case of Theorem 3.1, the assumption that a
and b
l
from F
X
is from F
y.
is in no way restrictive, and the proofs for the other cases
pertaining to the origin of a
above.
1
1
and b
1
are handled similarly to the proof
Also, it is easy to see from the remarks made after the proof of
Theorem 3.1 that if the underlying population distributions do not overlap
then the above result still holds.
In this case we additionally have that
the probability of correct classification for Kendall's rule converges to
unity.
To summarize, in the univariate case, under the mild assumption of
continuous underlying distribution functions, we have proven that the
conditional probability of misclassification of Kendall's rule converges
to zero.
(Unless otherwise stated, "conditional" means conditional on
the purticular training samples drawn.)
This asymptotic property of
Kendall's rule is of course a most desirable property for a discriminant
rule to have, and it can occur only for a rule in the class of partial
discriminant analysis rules.
We mentioned above that Gessaman and Gessaman pointed out that for
. variables of infinite range
Ke~dall's
procedure tends to classify only
a small percentage of the observations presented for classification and
that this tendency can become more pronounced as the size of the training
,e
85
samples increases.
Before commenting further on the probability of
nonclassificaticn for Kendall's rule we want to prove
Theorem 3.2
Let [a l ,b1 ] be the area of overlap of the two underlying population
distributions and assume that F
X
and Fy are each continuous functions.
Then we have for the total conditional probability of nonclassification
that
(3.34)
Proof:
Let w£A={w:
for which Theorem 3.1 is true} and let
X(l) (w) ' ..• 'X(n) (w)
and y(l) (w) , ... ,y(n) (w) be our two ordered training samples.
Since
~(w)=(C1 (w),C
2
Then
(w»+(a ,b ) by Theorem 3.1, and since F is continuous
l l
X
in x, we have that
A similar result holds for p(nonclassificationlny,w), which proves the
theorem.
In the case of no overlap of the two underlying populations, a proof
analogous to that of Theorem 3.2 shows that the asymptotic probability of
nonclassification of Kendall's rule is zero.
Upon a little reflection the following Corollary follows from Theorem
3.2.
86
Corollary:
Let {qxP(nonclassificationl~x,~)+qyp(nonclassificationl~y'~)}be
the conditional probability of nonclassification for Kendall's rule for
training samples of size
~=(nx,ny)'
and let r be the asymptotic rate of
nonclassification for Kendall's rule.
Then for all nX,n y
qxP(nOnClassificationl~x,~)+qyp(nonClassificationl~y'~)a~s·r
(3.36)
In the case of no overlap of the underlying population distributions,
the above corollary implies that the conditional probability of nonclassification is zero almost surely whatever the size of the training samples.
Also in this instance Theorem 3.2 implies that the asymptotic rate of
correct classification is unity (a.s.), since for any partial discriminant
analysis rule
P(nonclassification)+P(misclassification)+P(correct classification) = 1. (3.37)
Since we have exhibited both the asymptotic probability of nonclassification
and misclassification, the asymptotic probability of correct classification
for Kendall's rule follows directly from (3.37).
We offer a final Theorem which should help to clarify some of the
disadvantages which were mentioned by Gessaman and Gessaman as being associated with the nonclassification rate of Kendall's rule.
Theorem 3.3
The asymptotic conditional probability of nonclassification for
,e
87
arc defined by (3.17), (3.18), (3.19) and (3.20).
Proof:
For F
(similar to the proofs of Theorems 3.1 and 3.2)
X
and Fy which admit densities, another version of Theorem 3.3 is
Theorem 3.3 (alternate version)
The asymptotic conditional probability of nonclassification of
Kendall's rule is unity (a.s.) if f
x
and f y have identical supports
(except for sets of probability zero).
Obviously there are underlying population variables of infinite
range for which the nonclassification probability of Kendall's rule would
be very low, and remain very low, no matter how large the size of the
training samples.
The reason that Gessaman and Gessaman's
Mo~te
Carlo
work tended to confirm the disadvantages they mentioned as being associated
with the nonclassification rate of Kendall's rule should be apparent:
for
any two normal populations, x =)'1 and x =Y2 (see Theorem 3.3), and hence
2
1
the probability of nonclassification, for the case they studied, converges
to unity almost surely.
It should be kept in mind, however, that one of
the main reasons for using a nonparametric discriminant analysis technique
in the first place is a belief on the experimenter's part that the underlying populations are nonnorrnal.
For the univariate case we feel that the following statement about
the nonclassification rate of Kendall's rule is more accurate than some
of those in the literature:
"There is a number r (O-.::.r.:5.,l), dependent on
88
the uncerlying population distributions, which is the asymptotic rate of
~onclassification of
Kendall's rule.
the actual (i.e., based on finite n
x
Under no circumstance (a.s.) can
and n y ) probability of nonclassi-
fication of Kendall's rule exceed r, and r=l if and only if the conditions
of Theorem 3.3 hold."
We conclude this chapter with an interesting observation.
a
l
be the asymptotic probability of nonclassification and a
2
Let
be the
asymptotic probability of correct classification for Kendall's rule, and
let d be any other rule with a nonclassification rate equal to ale
Then
for d the probability of correct classification cannot be greater than
the asymptotic probability of correct classification for Kendall's rule;
i. e. ,
Pd(correct classification)
~
(3.38) follows directly from (3.29) and (3.37).
a
(3.38)
2
Unfortunately, it is
possible for a rule d to exist for which Pd(nonclassification)<a ,
l
Pd(misclassification)=O and Pd(correct classification»a 2 , and the
following rule based on (3.22) provides an example of this:
Assign Z to n
x
iff Z < a
l
Assign Z to ny iff Z > b or t
l
l
Otherwise withhold judgment.
~
Z
~
t
2
(3.39)
CHAPTER IV
THE r-1ULTIVARIATE CASE
4.0
Introduction
In this chapter we study some of the properties of Kendall's
rule in a general multivariate setting.
Since the notation is simpler
for the bivariate case, and since proofs for the p-dimensional case
proceed along the same lines as those for the two-dimensional case,
results will be presented in bivariate form.
All the results presented
in this chapter, though, readily generalize to the p-dimensional case.
4.1
Redefining "Least Overlap"
We offer a
overlap.
~light
refinement for choosing the component with least
Considering the bivariate case, let [aI' b ] be the population
l
overlap area for the first component and [a , b2l that for the second.
2
It would seem reasonable to judge that there is less population overlap
for component 1 if and only if
(4.1)
where
P(Y2 E (a 2 ,b 2 )
and qx and qy are the prior probabilities
fo~
"X and "y
90
then it would seem reasonable to judge
that there is less population overlap for component 2, and if
that the amount of pOFulation overlap is tiec.
That is, if the values in (4.1) were known we would simply choose
first whichever component had the smallest total probability of population overlap.
a coin would be tossed,
per the convention adopted in Chapter II, to determine the component
X Y
Y
X
The values of PI' PI' P2 and P2 are unknown, but they
X
Y
X
Y
K
K
K
K
X
2 , "
2
1
1
and
can be estimated by
wnere K1 is the number of
n
ny
n
ny
X
X
selected first.
sample observations from
'IT X
in (a ,b ) for the first component and
1 1
y
Y
K , KX and K
2 are defined analogously.
2
1
Of course in practice (a ,b )
1 1
and (a ,b ) are unknown also, but since
2 2
(max (X I (l) , Y1 (1) ) , min(X1 (n) 'Y 1 (n»)
a.s.
-l-
(a ,b )
1 l
(4.2)
(a ,b )
2 2
(4.3)
and
(max (X 2 (1) , y 2 (1) ) , min (X 2 (n) , y 2 (n) ) )
a.s
-l-
it is not hard to see that if k~, k~, k~ and k~ are the sample counterY
X
Y
parts o f KX, Kl' K and K
2
2
1
from
then
'IT
i.e., k
X
1
is the number of sample observations
in (max (Xl (1) , Y1 (l) ) , min(X (n) , Yl (n») for the first component - 1
X
Y
Y
X
X
Y
Y
X
X
k
K
k
K
k
K
k
K1,
2
2
2
2
1
1
1
all converge
and
ny
n
ny
n
n
n y - °ny
n
x
X
X
X
!
I
almost surely to zero.
Thus it would
ap~ear
that a reasonable sample-based criterion for
"0
,e
91
selection of the component with least overlap is to select component 1 if
and only if
(4.4)
(It should be remembered that if the remaining training data for a component has Configuration 5,6,7 or 8 (see Chapter II) then it has no (zero)
sample overlap.)
I f the inequality given by (4.4) is reversed, component 2 is selected
k
first and i f qx
n
X
k
1
Y
k
1
X
k
2
Y
2
+ qy n = q X n + qy n
y
y
x
x
mine the first component selected.
a fair coin is tossed to deter-
The reasonableness of this sample-based
criterion is complemented by the easy-to-prove fact that it converges with
probability one to the population based criterion given above.
If nx=ny=n and qx=qy=
21
then our criterion is identical to Kendall's,
since
1
k~
2n
1
k~
1
+- <2n-2
(4.5)
That is, in this instance, we choose that component with fewest
observations in its overlap area.
It does not take much thought to see
that is was this special case of qx=qy=
%'
nx=ny=n for which Kendall's
criterion (of selecting the component with least overlap) was designed.
For instance if qx=.9999 and Sy=.OOOl it would not be reasonable to let
the observations from each population contribute equally in determining
the component with least overlap.
Of course the above statement holds to
92
a greater or lesser degree
w~enever qx~.
Lastly, when n 'jOn we divide
X y
X
k l by fiX' etc. since it obviously makes more sense to measure amount of
overlap in terms of estimated probability rather than in terms of absolute
n~~ers
of observations in the overlap area.
4.2
The Asvrnptotic Behavior of the Cutpoints
Let [al,b ] and [a ,b ] be the population overlap areas for component
2 2
l
1 and component 2, respectively.
Consider the random variables
overlap area for their underlying distributions.
Similarly, let [a
2l
,b
2l
]
Using the above notation, we want to derive the asymptotic distribution of Kendall's criterion in the bivariate case.
For the moment
consider the simplified version of Kendall's rule which just uses components in the order they are presented (i.e., it uses the first component
be Kendall's criterion (the cutpoints for the two steps of the rule in the
bivariate case.)
Theorem 4.1
Assuming the continuity of the underlying distribution functions
involved, and using the above simplified version of Kendall's rule, we
have that
·e
93
and consequently
(4.6)
Proof:
(Straightforward extension of Theorem 3.1, using the multivariate version
of the GCT in the same fashion that we used the univariate GCT before.)
Now consider the standard version of Kendall's rule (without the
above simplification.)
Theorem 4.2
Assume that all the bivariate distribution functions involved are
i.e., that the total
probability of population
that for the second.
overl~p
for the first component is less than
Let Kendall's criterion (the cutpoints) be denoted
[Note that it is not necessarily true that
Corollary:
Proof:
-2
There exist a
>
a
2
and b
2
<
b
2
such that
(4.7)
Let AI'
A2 , A3 , and A4 denote the sets of all w such that the GCT holds
for Xl' Y , X ' and Y , respectively.
I
2
2
Also, let AS denote the set of all
94
w such that the strong law of large numbers holds for estimating the
probability of the event Xl t
[al,b ]; similarly define A in relation
l
6
Since each of the events AI, •.. ,A
S
has probability one, we have that for
S
B =
n
Let w be a member of B; then there exists HI such that
A., P (B) = 1.
i=l
J.
for all n
X
and n y
>
N
I
(4.8)
where P(X
2
[a 2 ,b 2 ]) =
t
(# of sample observations for X which lie in
2
[a 2 ,b 2 ]) Inx ' etc., since the expression on the left of the inequality sign
there exists N2 such that for all n x and n y > N2 max(x 2 (1)'Y2(1) (w»
and min(X (n) (w) 'Y2(n) (w»
2
a
2
<
a2
and
b
2
< b
2
>
< a2
b2 • This follows from the GeT and since
and from the definition of a 2 and b •
2
Let
Then
and
X
k
I
q Xn
+
X
and thus from (4.8)
,e
95
To this point we have shown that there exists N such that for all
n
x
and ny
>
N Kendall's criterion selects component 1 first.
That is,
wi.th probability one, once a large enough sample size is reached, Kendall's
rule will always select component 1 first.
From this point the proof of
this theorem follows that of Theorem 4.1 above.
This completes the proof
of Theorem 4.2.
For instance, for the bivariate
normal distribution we have that qxp~ + qyP~
=1 =
This is
because
(4.9)
and
(4.10)
since each of these intervals is
(_w,~].
It is not hard to see that,
whenever the two variables have identical first-stage overlap areas and
identical second-stage conditional overlap areas also, then the criterion
to the multivariate case and consequently for underlying normal distributions, Kendall's criterion converges almost surely to [_00,00, ••• ,_00,00].
(a
12
,b
12
]
~
(a
21
,b
nor (4.10) holds.
21
]?
For simplicity let us assume that neither (4.9)
In this instance it appears that the criterion might
jump back and forth between selection of component 1 first and selection
96
of component 2 first.
Consequently, as
~
=
(nx,ny)~ ~
the criterion would
be getting cleser and closer to, but jumping back and forth between,
converge with probability one to either random variable.
One can still get a grip on the asymptotic distribution of Kendall's
criterion by proceeding in the following fashion.
the criterion in the bivariate case.
For simplicity, assume
Then letting 5 .. ,i=1,2,j=1,2 be the
J.]
event that the i th component is selected first and the jth second we have
Thus we have
But the random variable (Sl,S2) 1512] converges almost surely to
(al,b1,a12,b12); and since almost sure convergence implies convergence
The question remains as to the existence and value of lim P(5
n-+<x>
12
)
,e
97
By Kendall's procedure we select component 1 first if there are fewer
observations in its sample overlap area than there are in component 2's
sample overlap area, and if the numbers of observations in the overlap
areas are tied for the two components then Kendall's rule begins with
component 1 with probability
%and
with component 2 with probability;
Let #. be the number of observations in component i's sample overlap area.
J.
Then
= P(#l
<#2) + P(S12!#l=#2)P(#1=#2)
= P(#l <#2) +
1
2
P(#1=#2)·
Similarly,
Now #1 can be represented as the sum of n
denoted by
#~i
and
#~j
x
+ ny indicator variables
where
X
#li = 1 i f Xli e: [max (Xl (1) , Y1 (1) ,min (Xl (n) , y 1 (n)], i=l, ... ,nx
(4.9)
y
and #lj = 1 i f Y1j e: [max(X l (1)'Yl(l),min(X 1 (n)'Yl(n»), j=l, •.. ,ny
Similarly, #2 can be expressed as the sum of n
x
•
+ n y indicator variables.
For the moment let us assume that the underlying population overlap
points
a~e
ove~lap
known and that the nunU)ers of observations in the population
areas are counted to determine which variable is selected first.
In this instance we would define
o
and #1 =
o
otherwise
~ ~X
~ w
~
l-
J
li +
for component 1.
'l
#lj where [a
l
,b
) is the population overlap area
l
X
Then #li' i=1, ... ,n
otherwise
'l
x
are i.i.d., as are #lj' j=1, •.. ,ny;
and using the analogous definition for #2 and the central limit theorem
we have that
_ N(O,l) asymptotically
where Cn = Jvar (#1-#2) •
Thus
lim P(#,>#2) = lim P(#,-#2 > 0)
where
~
(4.10)
P~#~:#2 > 0]= 1-~(0) = 1
= lim
is the standard normal distribution function.
(4.11)
Similarly
(4.12)
and
(4.13)
The above results are valid when we define #1 and #2 by means of the
population overlap points rather than the sample overlap points.
But due
to the fact that the sample overlap points converge with probability one
to the population overlap points it is also true that (4.10), (4.11), (4.12)
and (4.13) hold for #1 and #2 defined in terms of the sample overlap points
as per (4.9).
Thus we have shown that
lim FC C (f1'~2) = lim P(~1~q1'~2~~2IS12) lim PIS 12 ) + lim P(fl~~1'~2~~2iS21)'
~1'~2
lim P(S2l) = lim
P(~1~:1'~2~:2IS12)' ~
+ lim
P(~1~~1'~2~~2IS21)'~
(4.14)
.e
99
and consequently that
>
1
lim F ,C (~1'~2) =
C
~ 1 ~ 2
1
2
o
4.3
for
for
(~1'~2) =
c
c
and
~
(a ,b ,a ,b )
l l l2 l2
(a2,b2,a2l,b21)
for which one and only one of the
above inequalities holds
(4.15)
otherwise
The Asymptotic Probabilities of Misclassification and Nonclassification
As in the univariate case, the
fication
case.
0:
conditio~al
probability of misclassi-
Kendall's rule converges almost surely to zero in the multivariate
This result relies only on the continuity of the underlying population
distribution functions.
Theorem 4.3
For Kendall's procedure the conditional probability of misclassification converges to zero almost surely (given only that all distribution
functions involved are continuous).
Proof:
Let A be the set of all w for which the GCT holds for all bivariate
and marginal distribution functions involved; then P(A)=l.
And let w be
in A.
from Xl (in the same sense described on page 3.7), b
large enough n
l
is from Y , a
l
l2
100
But P(X
l
P[(C (w)
l
~
~
~
Cl (w»
Xl
~
~
0 since C (w)
l
C (w) ,
2
X
2
~
a
C (w»)
3
Thus p(miSClassificationlnx'w) ~
and F
l
X
1
~
(a
) = 0 by the definition of ale
l
0 since C (w)
l
~
a
,C (w)
2
l
~
b
l
'
o.
A similar argument shows that p(misclassificationlny,w) ~ 0; thus we
have that qxp(miSClassificationlnx) + qyP (misclassification! TI y ) , the total
probability of misclassification, converges to zero almost surely.
All other
cases of origin of a , b , a , and b
are handled similarly; as is the case
l
I
l2
l2
X
Lastly we need to consider the case where qxPl + qyP
Y
1
=
In this case, no matter how large the sample size, the rule may jump
bac~
and forth between starting with the first component and starting with the
second component.
We have
p(misclassificationlnx)=p(misClassification and 512 1n X)
(4.16)
But
P(misclassification and 5l2InX)= P (misc1assificationl 5 12 and
TI
X)P(512 and
(4.17)
since for arbitrary sets A, B, and C,
P(E,C)
P(A,Blc)
= p(AIB,C)
P(C)
TI
x
)
101
Consider p(rnisclassificationlTI
x
and 5 12 ).
Given that component 1 is selected
first the criterion converges almost surely to (a
from an argument identical to the case where
we have that p(misclassificationlTI
x
and 5
that P(misclassifica~ion and S12ITIx)~o.
12
1
,b
1
,u
12
,b
12
).
Thus
qxp~ + qyP~ ~ qxp~ + qyP~
) a~s. 0 and hence by (4.17)
A similar argument shows that
P(misclassification and S21ITIx)a~s. 0 and thus from (4.16) we have that
p(misclassificationITI ) a~s. O.
x
A similar argument shows that
p(misclassification!TI y ) a~s. 0 and thus in this case too we have that
This exhausts all the cases and hence completes the proof of Theorem 4.3.
In the above proof we have tacitly assumed that there was population overlap
for the respective relevant univariate distributions.
The above result
obviously· still holds when this is not the case.
To have an asymptotic conditional probability of misclassification
equal to zero is a very desirable property for a partial discriminant
analysis rule to have and it should be noted that it is by no means true
that all partial discriminant analysis rules possess this characteristic.
To this point we have proven multivariate analogues of Theorem 3.1
and Equation 3.29.
the other
r~sults
~1ultivariate
versions can also be
given in Chapter II.
g~ven
for most of
102
nonuniqueness of the asymptotic multivariate rule presents difficulties).
Then Inultivariate analogues of Theorem 3.2 and the Corollary given by
(3.36) are not hard to prove.
the multivariate case but for
The result given by (3.38) also holds in
Th~orem
3.3 the implication goes only from
right to left in the multivariate case.
4.4
The ASymptotic Optimality of Kendall's Rule
Let f
x and f y be bivariate densities (with Ax and Ay their respective
supports) for
~x
and
~y'
and consider the following properties:
AXfl Ay is a rectangle (except for sets which have probability
zero under both distribution functionsl having sides parallel
(4.18)
to the Cartesian coordinate axes.
Let Sl (Side 1) be the leftmost boundary (extended to
AXfl Ay (see Figure 4.1).
of 51 such that f f dx
C ~ -
±~)
of
If there exists a set C to the left
1
o then
>
the~e
does not exist a set C to
2
1
the left of 51 such that f fyd¥ > O. Let similar statements hold
C2 for the other three extended boundaries of AX flAy'
(4.19)
A
·
Figure 4.1:
S3
~
..-. - ~
--.,,~---"1
~S2
<E--
- - _. -
"-----~,
541"
.
~
·•
~
...
~
103
Theorem 4.4
Let DO be the class of all decision rules with a zero probability of
misclassification and for all discriminant rules d
P(nonc1assification using rule d).
£
DO let n(d) =
I f (4.18) and (4.19) hold and
then Kendall's rule converges almost surely to a rule d* with
= min
n(d*)
d
£
n(d)
(4.20)
DO
Proof:
Let d
A
denote a rule which withholds judgment for all obser-
nA
y
X
vations belonging to AX
n
belonging to (AX nAy) c.
Ay
and which correctly' classifies all observations
Then it is obvious that d
n (d A
nA
X
)
= min
d
y
£
Axn ~"y
£
DO
n (d) •
and
(4.21)
DO
In this case
we know Kendall's rule converges almost surely to a unique rule d* and that
d*
£
°o
Let K be the withhold-judgment region for d*.
Then
Lemma:
Let c be the first coordinate of the point which is the lower
lefthand corner of AX nAy.
a
1
<
e < c and F
X
(e)
> 0
1
F
If c
and Fy
X
1
(e)
>
(e)
1
a
l
> 0
then there exists e such that
(by the definition of a ) .
1
=[:l>(~)d~
(4.22)
and
(4.23 )
But
104
Then since property (4.19) holds, both (4.22) and (4.23) cannot be positive;
so we have a contradiction.
Thus c
~
results for the other four corners of
a •
1
It is easy to prove analogous
Ax
Ay;
n
and since K and AX n Ay
are both rectangles, we have proved that K CAn A .
X
Y
Now
P(c
~
Xl
~
a1 '
-~ ~
X2
~~)
= FX
1
(a ) - F (c)
1
x
(4.24)
1
and
(4.25)
Either (4.24) or (4.25) is equal to zero by the definition of a
since (4.18) holds, both (4.24) and (4.25) must be zero.
1
and thus,
Analogous results
hold for the other three corners of AX nAy' and therefore we have shown
that
(4.26)
and
(4.27)
Since AX n Ay
=KU
[(AX nAy) - K] it follows that
(4.28)
and
(4.29)
or that the total probability of nonclassification for Kendall's asymptotic
rule is equal to the probability of nonclassification for that rule which
correctly classifies all observations falling in (Axn Ay)c and withholds
105
judgment for those falling in AX
n(d*)
= ned
n Ay . That is,
AX
min n (d)
n Ay ) =d e:
DO
where the last equality follows from (4.21).
The proof for the case of qXPlX +
y
0"'Y P 1
>
y
X
qxP 2 + qy P 2 is identical
to the above proof and thus this completes the proof of ,Theorem 4.4.
Note:
It is obvious that any
ru~e
belonging to DO which minimizes ned)
also maximizes the probability of correct classification over the set of
rules belonging to DO.
X
It is easy to generalize Theorem 4.4 to the case of qxPl +
X
= QxP 2
Y
~P1
y
+ qy P 2 since when (4.18) and (4.19) hold the asymptotic withhold-
judgment region for Kendall's rule is unique.
Theorem 4.4 (generalized)
Let d
D
be Kendall's sample-based rule.
If (4.18) and (4.19) hold
then
ned ) a.+s. min ned)
D
d e: DO
It should be noted, though, that d
n
is not in general a member of DO for
finite n.
If the amount of population overlap is not tied, consider the follow-
•
ing weakened version of property (4.19) above:
Let 51 and 52 be the sides determined by the variable
with least population overlap, and S3 and S4 those determined
by the variable with most population overlap.
If there exists
106
a set C to the left of Sl such that
l
r f ~ex
Jc 1
>
not exist a set C to the left of 51 such that
2
a similar statement hold for 52.
51 and 52 and above
a
then there does
r fyd¥
JC
>
O.
Let
2
If there exists a set C
3
between
53
(see Figure 4.1) withfc f dx > 0, then
X C3 there does not exist a set C between 51 and 52 and above 53 with
4
r fyd¥
-
JC 4
> O.
Let a similar statement hold for 54.
Then it is not
difficult to prove
Theorem 4.5
and sufficient conditions for
n(d*) = min ned)
d £ DO
A univariate analogue of Theorems 4.4 and 4.5 is:
Theorem 4.6
Consider the univariate case and let F
tinuous distribution functions for n
x
X
and n y '
and Fy be univariate conThen n(d·)
= min
d
£
n(d)iff F
DO
X
is not equal to a constant c (O<cl<l) over any interval on which Fy is inl
creasing and Fy does not equal a constant c
on which F
X
2
(0<c <1) over any interval
2
is increasing.
The three preceding theorems assert that under certain conditions_
the
cond~tional
probability of nooclassification for Kendall's sample-based
rule convergE:S to that of a rule which is "best" among all rules with zero
•
107
probability of misclassification.
Thus, when these conditions hold, no
other sample-based rule which converges to a rule with zero probability
of misclassification can have a lower asymptotic probability of nonclassification (or higher asymptotic probability of correct classification) than
Kendall's rule.
Lastly, if there is no population overlap for anyone of the components under consideration in the above three theorems, then all three
theorems are trivially true.
4.5
The Order of
C~nponent
Selection
One might wonder whether it makes any difference asymptotically
whether one follows Kendall's "greedy" rule of starting with the component
having least overlap and proceeding to that having most overlap, as compared to proceeding in the opposite order.
Obviously for finite sample
sizes either method could lead to a rule with a lower conditional probability of nonclassification.
Since the asymptotic probability of
mis-
classification is zero for Kendall's rule (for any order of selection of
the components) the question we ask is whether proceeding from the component
with least overlap to that with most leads in general to a rule with a
lower asymptotic rate of nonclassification (or equivalently a higher
asymptotic rate of correct classification) than proceeding in the opposite
direction.
108
It turns out that neither method of ordering the component selection is in general asymptotically better than the other.
bivariate case with
qx=qy=
21
Consider the
where the two underlying population distri-
butions are uniform within the two distinct but overlapping circles, each
of area c.
Figure 4.2 depicts the supports for each of the two population
densities.
We can consider
~endall's
rule based on the underlying population
distributions since once the order of selection of components is fixed
his salnple-based rule converges almost surely to the population based rule.
The area of the region shaded horizontally in Figure 4.2 is proportional to
Figure 4..2
the total probability of nonclassification when one proceeds from least
to most population overlap, whereas the area of the region shaded vertically is proportional to the total probability of r.onclassificaticn when
109
proceeding from most to least population overlap.
That is, let h
x
be the
horizontally shaded area inside the circle representing the support of
f
X
and h y be the horizontally shaded area inside the support of f y •
Then
the total probability of nonclassification when proceeding from least to
most overlap is
1 hx
2 c
Let v
x
1 hy
(4.31 )
+-2 c
be the vertically shaded area inside the support of f
v y the vertically shaded area inside the support of f y
•
x
and
Then the total
probability of nonclassification when proceeding from most to least overlap is
(4.32 )
For this example, then, we have that (4.31) is less than (4.32), and
therefore that proceeding from the component with least overlap to that
with most yields a "better" asymptotic rule.
For an example where the converse is true consider underlying
population distributions uniform on overlapping squares as shown in
Figure 4.3.
As before, an area proportional to the total probability
of nonclassification for the most-to-least method is shaded vertically
whereas that for the least-to-most is shaded horizontally.
For this
example, then, proceeding from the component with most population overlap
to that with least yields a lower asymptotic probability of nonclassification.
110
Figure 4.3
The above two examples illustrate that neither method of order of
component selection leads in general to a better asymptotic rule.
It
does not appear feasible to judge from the observed data which order of
component selection is optimal for the populations from which they arose.
Kendall (personal communication) believes that either method will be
satisfactory in practice.
the
compo~ents
One might ask, therefore, why not just select
in the order that they are presented?
That is, perhaps
neither the least-to-most nor the most-to-least method of order of
component selection is in general better than just using the components
in the order that they are presented (or in any arbitrary order).
If,
indeed, this practice were adopted it would make the procedure simpler.
as well as render the asymptotic and finite sample size properties of
Kendall's rule more mathematically manageable.
CHAPTER V
•
EXTENSIONS AND A MODIFICATION OF KENDALL'S RULE
5.1
Extensions
i)
rule.
In Chapter II we described Richards' refinement of Kendall's
This refinement allows an already used component to be used
again at a given step if it is the component with the least overlap
among the (p-l) which were not' used at the immediately preceding step.
Richards modifies one of Kendall's data points to show that his refinement can differ from Kendall's original rule.
One might wonder why Richards disallows the immediately preceding
component from consideration, since it seems logical to use the same
component twice in a row if it has less overlap in both instances than
the other p-l components.
Suppose the training data looked like the
following two-dimensional plot (not likely in practice, but theoretically
possible):
x
y
x
y
y
x
y
(5.1)
Clearly one would want to use the first component twice and then stop.
Note that the practical application of our refinement is no more complicated than Kendall's original rule.
Therefore, as one extension of
112
Kendall's method we propose to use this refinement, which we shall denote
by d , of Richards' refinement.
1
In general, d
of data.
1
just keeps selecting components until it runs out
Clearly, after at most n steps, d
l
will have classified all the
data (unless at some stage it finds ties on all components such that there
is total overlap on all of them -- an event of probability zero under continuityassumptions).
It seems reasonable, however, to stop classifying
before the number of observations in either training sample becomes too
small, say less than r.
terminate d
l
Thus, if this would happen at Step K(r), we
at Step K(r)-I. Also, K(r) should probably be kept below
some prespecified integer no matter how large the training samples are,
in order to prevent the number of steps from becoming infinite as n
(This requirement ensures that d
1
+
00
has an asymptotic conditional probability
of nonclassification equal to zero.)
It is rather surprising that our refinement does not always perform
at least as well as Kendall's original rule in terms of the number of
training observations left unclassified.
We illustrate this fact by
means of the data in Table 5.1.
At Step 1, Kendall's rule and our refinement are always identical.
In this case, they each select Component I (see Table 5.2) since it has
the least overlap of the three components.
The remaining data for those
observations left unclassified after Step 1 are given in Table 5.3.
•
113
Among the remaining two unused components, Component 2 has less overlap,
so this component is selected next by Kendall's rule.
After the second
step when using Kendall's rule, the remaining data for Component 3 have
no sample overlap (see Table 5.4).
none of the training
Referring
bacl~
obse~vations
Therefore, Kendall's rule leaves
unclassified for this data set.
to Table 5.3, if all three components are eligible
for consideration after Step 1, then Component 1 has least overlap and is
therefore used again at Step 2 for d •
l
are given in Table 5.5.
The data remaining after its use
From Table 5.5, we see that Component 3 is
selected next and the data remaining after step 3 for d
Table 5.6.
if d
l
are given in
1
From Table 5.6, we see that Component 2 is selected next and
terminates after Step 4, it leaves three observations unclassified.
Therefore, for the data given in Table 1, Kendall's original rule performs
better than our refinement in terms of the number of training sample observations left unclassified.
It is also not hard to concoct examples with sample data for which
Kendall's original rule outperforms Richards' refinement.
above that we found these results somewhat surprising.
We mentioned
It turns out that
the matter rests on the order in which the variables are selected by d 1 ·
Suppose that Kendall's rule selects the components in the following order:
(5.2)
where i
k
is the subscript used for the i
th
k
component of the training
114
vector.
Now suppose that our refinement uses all of the p original com-
ponents at least once.
Then for d
l
we have a sequence of components as
above, but possibly with repeats, since our refined rule may use a given
component any number of times.
Deleting all the repeats from this
sequence, we are left with the sequence
(5.3)
... ,
i p =j p , it is easy to see that our refinement
always performs at least as well as Kendall's rule in terms of number
of training observations not classified.
In words, we require that if
the jth component is selected for the first time at a given step of our
refined rule and it is the i
then it must be the i
select.
th
th
different component selected by our rule,
component that Kendall's original rule would
If this requirement is met, then our refinement always performs
at least as well as Kendall's rule in terms of the number of training
observations left unclassified.
Of course, if we drop the requirement that our rule must terminate
after the training sample size is reduced to a certain number, r, and
if our data are continuous, then d
sample observations.
1
will classify all the training
(In the case of discrete distributions, it is con-
ceivable that the rule would terminate, due to ties, before it classifies
all the data.)
But if we allow our rule to continue reusing components
until no more data remain, we can no longer claim that it has an asymptotic
115
probability of misclassification equal to zero.
That is, in gaining some-
thing in terms of the percent of observations which would be classified,
we lose something in terms of the misclassification rate.
ii)
We suggest another extension of Kendall's rule, which we shall
denote as d .
2
We showed earlier that neither proceeding from most to
least overlap nor proceeding from least to most leads in general to a
rule with a lower probability of nonclassification.
For this extension,
we propose to try all possible orderings of the components and use that
ordering which leaves the smallest percentage of observations unclassified.
If we allow each component to be used only once, this amounts to computing
Kendall's rule for all p! possible orderings of the components.
Theorem 4.1
d
2
By
has an asymptotic probability of misclassification of
zero and it is obvious that the asymptotic probability of nonclassification
for d
2
can be no greater than that of Kendall's rule when proceeding from
least to most overlap, or vice versa, or any other order of component
selection.
Since Kendall's rule is so easily derived from the sample
data, computation of the rule for all p! orderings should be practical
unless p is large.
Comparing Kendall's rule to this extension is somewhat analogous
to comparing a stepwise regression procedure to an all possible regressions
procedure.
In regression a "better fitting model" can be found by exami-
ning all possible regressions, but at the cost of more computation and
116
computer time.
5.2
A Modification
For underlying population variables which can overlap completely,
Kendall's rule classifies a smaller and smaller percentage of observations, as the size of the training samples becomes larger.
rule to be of general use, something
tendency.
m~st
For his
be done about this disturbing
It is the purpose of this section to offer a modification for
dealing with this problem.
We have previously proven that the conditional probability of misclassification of Kendall's rule converges almost surely to zero.
In
order to gain some control over the probability of nonclassification, it
seems apparent that we shall have to allow a certain amount of misclassification at the outset.
In doing this, we have several options.
For
instance, we could attempt to control the probability of nonclassification
at a certain prescribed level, and this in general would yield a rule with
a positive asymptotic rate of misclassification.
Conversely, we could
allow misclassification up to a certain amount, say a, thus ensuring that
the asymptotic probability of nonclassification would be no greater than
(l-a), and hopefully much less.
Since misclassification of any given
observation is a more serious error than withholding judgment, which
technically is
no~
an error at all, we choose to follow the latter course.
That is, we offer a modification of Kendall's rule which controls mis-
117
classification at a prescribed level in order to prevent the possibility
of having an asymptotic probability of
nonclassific~tion equal
For the moment, consider the univariate case.
to unity.
Within the class of
sample-based rules which
a) have one or two cutpoints, and
b) misclassify a total of at most lOOa% of the training data, we
seek a rule which minimizes the total proportion of observations left
unclassified.
Let n
fied from nand n
X
~
Yo.
X2
be the number of training observations not classi-
the n\mIDer not classified from nv•.
Then the total
proportion not classified is given by
and a similar definition holds for the total proportion misclassified.
Such a rule always exists, though it is not necessarily unique.
Consider the following data
x
XXX XXXX XX
yy Y Y
Y Y Y Y Y Y
For the data given by (5.4), and letting a=.2, a rule which satisfies
our requirements is
assign Z to n
x
(X(8) +y (3»
iff
Z
~
iff
Z
>
•
assign Z to
'IT
y
2
(X (8) +y (3) )
2
This rule misclassifies 20\ of the observations and leaves none (0%)
unclassified.
118
This modification, which we shall denote by d , readily generalizes
3
to the other possible data configurations, as well as to a multivariate
setting.
In order to do this, though, we need to develop some notation.
Let i index observations within a training sample, j index components, k
index steps of the rule, and X and Y index the two samples.
Also, let
K equal the ma>:imum allowable number of steps and a the maximum allowable
(with n
X1
= nx ,
n
= ny ).
Y1
At the k th step, consider choosing the jth component with cutpoints
Land U (L
~
U).
Let
a
b
c
X
(5.9)
= 1: I (X. < L)
X
~
=
X
(5.10)
1: I(L < X. < U)
= 1:
-
1.-
(5.11)
I(U < X.)
1.
where the summation is over the unclassified observations.
(Note that
(5.9), (5.10) and (5.11) depend on L, U, j, k, but temporarily this has
been suppressed.)
We define a y ' band c y similarly and thus
119
(5.12)
(5.13)
.
Assum~ng
. b'
t h e J.th component ~s
e~ng
.
Land U are chosen
cons~dered,
so as to minimize
+
the nonclassification rate at the k
(5.14)
th
step, subject to
(5.15)
Note that the left side of (5.15) is the misclassification rate up to and
including the k
th
step.
Let the minimum value be w .•
is chosen which minimizes w..
J
coin.
Then the component
J
In case of ties in components, we toss a
In case of ties in choosing Land U for a given component, we
minimize (U-L).
If this does not resolve the tie, we toss a coin.
Having chosen the component, and the Land U for it, the k
th
step
of the rule is as follows:
if
qxa
n
qya y
X < ---- then new observations, Z, with Z. < L are
x
assigned to
ny
-
TIXi
otherwise, they are assigned to
J
TI
Yi
(5.16)
assigned to n
i
X
otherwise, they are assigned to
TI
Y
i f L < Z. _< U then we refer to the next step (if any).
- J
120
At any step all components are considered, whether previously used or not.
K may be equal to, smaller, or larger than the number of components.
Also,
if at any step the remaining observations are all from the same sample,
then the rule is modified appropriately.
For instance, if at step k+l
all of the remaining training observations are from ~X and if the jth
component were selected at step k, then our rule terminates at step k
since those new observations,
~,
with L
~
Zj
~
U are assigned to
~x.
As with Kendall's rule, our modified rule operates on a principle
of greed.
At the first step, for each component, it finds two cutpoints,
Land U, which minimize the nonclassification rate among all two-cutpoint
rules with misclassification rate less than or equal to a/K.
The com-
ponent with smallest nonclassification rate (amount of overlap) is
selected along with its two-cutpoint rule.
Those multivariate training
observations with a value in [U, L] for the component selected are then
retained. At the second step all components are again eligible for selection.
(This is analogous to the method of component selection used for d .)
l
Using the remaining observations we find, for each component, two cutpoints
which minimize the nonclassification rate among all rules (which utilize
the component and cutpoints selected at the first step) with misclassification rate after two steps of less than or equal to
a·2
~
The component
with smallest non-classification rate is selected and those remaining obscrvations with a value in [L,U] for this component are retained. We
•
121
th
continue until we reach the K
step or until a one cutpoint rule (L=U)
is selected or until the remaining observations are all from one sample.
At this point we state '(as theorems) several asymptotic properties
Theorem 5·1
Assume that
TI
bution functions.
and
x
TI
y have absolutely continuous p-variate distri-
Then the asymptotic conditional probability of mis-
classification for d
3
is almost surely less than or equal to
~.
Proof:
Since for all n ' n y and w
x
~d
(misclassification) ~ ~
(5.17)
3
(where
Pd
(misclassification) is co~puted using the appropriate p-variate
3
empirical distribution functions) this theorem is a simple consequence of
the GCT - Version II.
Theorem 5·2
Assume that
TI
bution functions.
classification of d
x
and
TI
y
have absolutely continuous p-variate distri-
Then the asymptotic conditional probability of non-
3
is almost surely less than or equal to
1-~.
Proof:
•
Again this result is a consequence of the GCT - version II.
Theorem 5·2 gives a crude upper bound on the probability of nonclassification for d , but it does illustrate that under no circumstance
3
122
can our modified rule have an asymptotic probability of nonclassification
equal to unity.
Thus in utilizing d
3
we have gained some control over
the probability of nonclassification of Kendall's rule albeit at the
...
expense of allowing some misclassification.
In the univariate case a stronger result can be proven for d •
3
Theorem 5·3
Let ~ be the class of all univariate non-sample-based rules which
have a probability of misclassification less than or equal to a and which
are of the following form
assign Z to
~U
iff
z
assign Z to
~v
iff
Z > t2
< t
1
otherwise withhold judgment ,
where t
l
~
t
and U
2
=X
or Y as does V.
rule defined above, and assume that F
X
Let d
3
denote the sample-based
and Fy are continuous functions.
Then
P
d
3
(nonclassification) ~-> inf Pd(nonclassification)
d£:~
(5.18)
Proof:
A consequence of the GeT.
Thus, in the univariate case, among all rules of the one or two cutpoint
type which contiol the probability of misclassification at less than or
equal to n, d
rate.
3
is·asymptotically the best in terms of its nonclassification
123
5.3
Practical Implementation of d
3
In general, a computer must be used to find d •
3
for d
To make the search
simpler we consider only values of U and L between nontied data
3
points, and in order to fix things we use midpoints between adjacent
non-identical points.
Consider the first step, and let
Xa / K, X2a / K, Xl - a / K, Xl - 2a / K, Ya / K, Y2a / K, Yl - a / K and Yl - 2a / K be
the
(a/K)th, (2a/K)th, (l_a/K)th and (1_2a/K)th sample quantiles from X and
Y for Component 1.
Then obviously no one ortwo-cutpoint rules with U
or L in
(5.19)
need be reviewed, since any rule with U or L in the closed interval given
by (5.19) does not satisfy (5.15).
max (Xa / K,
Ya / K)
Also, if
> min(x _ /
1 a K,
Yl - a / K),
then only one-cutpoint rules with U and L in
(5.20)
need be reviewed, and at least one of these will satisfy (5.15).
The
method outlined above applies in general for the search for U and L for
each of the p components at the first step.
Unfortunately due to (5.15) the above method is not so easily
implemented after the first step, and we therefore discard it after this
step.
However, there are other methods which can be utilized to shorten
124
the search for U and L at a given step and these are used at every step.
In general it suffices to search for U and L only among those points in
the overlap area.
For instance, suppose at step k for Component 1 there
are ten training observations left for
tied.
~X
and ny' and two of these are
The data could appear as follows:
Y(3) Y(4)
Y(6)
Y(7)
X(9) X(10)
Y (10)
We search the one-cutpoint rules first, starting
and ending wi th
Y (10)
+
X(9)
X(2) + Y(4)
omitting
--:~~-2-~":"
-~~-2-~":"
"th Y(2) + X(l)
2
wJ.
from consideration.
If a one-cutpoint rule is found which satisfies (5.15), then there is no
need to review the two-cutpoint rules (U<L) for this component.
Addition-
ally we review only one-cutpoint rules for the remaining p-1 components
and our rule terminates at this step.
If no one-cutpoint rule is found
for Component 1 we proceed to search the two-cutpoint rules, starting with
L =
U =
U.::,.
L
=
Y(2) + X(1)
2
X(l) + X(3)
2
Y(10) + X(9)
2
X(1) + Y(3)
2
and U =
X(1) + Y(3)
we consider
Y(3) + X(2)
2
satisfying (5.15) .
and U ==
For the moment we fix L.
2
Y(3) + X(2)
2
After
and so on until we arrive at
At this point we start again \\"ith
and continue our search.
Once the
.
125
optimal Land U have been found for Component 1 we continue on to Component
2 and so on.
Although we have illustrated our searching technique with a specific
data configuration and n=lO, the general method should now be clear.
•
computer program used for finding d
in SAS.
3
The
for an arbitrary dataset was written
Ties in choosing U and L for a given component, as well as ties
in smallest nonclassification rate between components, were resolved using
a Uniform random variable generator within SAS.
the general computer program used to find d
3
The source statements for
are included in Appendix I;
as are those for the programs written to find Kendall's rule, d
1
and d •
2
126
Table 5.1
1T
'rT
X
(Xl' X2 ' X3 )
y
(Yl' Y2 ' Y3)
1, 30, 35)
3, 21,
2, 20, 34)
5, 19, 22)
9, 25, 27)
7, 15, 16)
(11 ,
22, 30)
8, 17,
2)
9)
(14, 27, 33)
(16, 23, 20)
(18, 18, 32)
(30,
(36, 16, 31)
(32, 28, 11)
(27, 14, 38)
(39, 26, 12)
(31, 12, 40)
(26, 24, 18)
(33, 10, 25)
(20, 13, 13)
9, 10)
(34,
2,
1)
(21, 29, 36)
(35,
4,
4)
(25, 31, 37)
(22,
6, 19)
(38,
3,
6)
e
e
e
e
Table 5.2
Data for Step 1 of Kendall's Rule and of Our Refinement
Component 1
1
2
9
3
5
7
14
11
18
21
20
16
8
27
22
31
26
25
30
33
34
35
36
32
38
39
Component 2
2
4
6
3
10
12
9
14
16
15
13
19
17
22
20
18
21
25
23
24
30 31
32
27
26
30
28
29
31
Component 3
1
4
2
25
19
6
9
10
11
12
13
16
18
20
22
27
33
34
38
35
36
40
37
I-'
l\J
-.J
Table 5.3
Data Remaining After Step 1
Component 1
9
3
5
7
11
20
16
8
22
18
14
21
31
27
25
30
26
33
34
35
36
28
29
31
38
40
32
Component 2
2
4
6
10
12
9
16
14
15
13
22
18
19
17
21
25
23
24
31
32
27
Component 3
1
4
25
19
9
2
10
11
13
16
18
20
22
27
30
33
36
37
~
I'J
CD
-
~
-
•
e
e
e
.
Table 5.4
"
Data Remaining at Step 3 for Kendall's Rule
Component 3
25
2
9
10
13
16
18
20
27
30
31
32
33
38
40
22
.....
tv
'"'
Table 5.5
Data Remaining at Step 3 of Our Refinement
Component 1
9
11
14
22
18
20
16
21
27
25
26
31
30
32
Component 2
14
12
6
9
18
22
13
25
23
24
30
32
27
28
29
31
38
40
Component 3
27
19
10
11
13
18
20
33
36
37
~
.
w
o
e
e
e
e
e
e
Table 5.6
Data Remaining at Step 4 of Our Refinement
Component 1
9
11
14
18
16
22
21
25
Component 2
6
18
25
22
27
23
29
31
36
37
Component 3
27
19
20
30
32
33
.-'
W
f-'
CHAPTER VI
AN EXAMPLE OF THE DATA ANALYSIS
In this chapter we analyze Fisher's Iris data (see Chapter II,
Table I) by means of the extensions and modification suggested in
Chapter V.
These data, which originally appeared in a paper by Fisher
(1936) on discriminant analysis, have since been analyzed and used to
illustrate various multivariate techniques in many standard statistical
textbooks:
e.g., Anderson (1958) and Morrison (1976).
In particular,
Kendall and Stuart (1976) use the Iris data to illustrate the standard
linear discriminant rule as well as some others, including Kendall's
order statistic method, in their chapter on discrimination and classification.
Using the standard linear rule (1.11), which we shall denote by
l, the misclassification rate for the Iris data, based on the training
samples, is .03; the nonclassification rate is of course zero.
We recall
from Chapter II that Kendall's rule selects the four components in the
order (PL, PW, SW, SL), yielding a nonclassification rate of .13, with
of course a misclassification rate of zero.
133
We refer to Table 1 for the analysis with d •
1
In order to save
computer time and money, we set r=l in our computer program, which then
gave in one
~unthe
~
results for all rules with 1
with four steps the nonclassification rate fer d
improvement over Kendall's four-step rule.
1
r
~
50.
Note that
is .10, which is an
After seven steps the train-
ing data are exhausted and consequently at this stage the nonclassification
rate is zero for d , and the rule terminates.
l
The extension d , 'based on looking at all possible orderings of
2
the components, selects the compcnents in the order (PW, SL, PL, SW) and
has a nenclassification rate of .10.
As we mentioned in Chapter V, d
2
always performs at least as well as Kendall's rule, both in terms of nonclassification of the training data and in terms of its asymptotic probability of nonclassification.
It is interesting that for these data the
order of selection is neither least-to-most nor most-to-Ieast.
It is
also of interest to note that this rule has the same nonclassification
rate as d
1
when d
1
has the same number (four) of steps.
The results for d , with K = 1 to 10, are given in Tables 2, 3 and
3
4.
Table 2 gives the analysis for u=.Ol, Table 3 for a=.05
for a=.10.
and Table 4
It was felt that these values for a and K would cover a fairly
comprehensive class of.rules while keeping the computing costs at a reasonable level.
Special attention needs to be given to Table 2.
For K = 5, 6, 7,
134
8, 9 and 10, d
3
terminates at Step 4 because at that stage no candidate
(rule) can be found, by our algorithm, which satisfies (5.15).
The data
remaining at Step 4 for each of these instances -- where for each component
the top
lin~
of data is from Versicolor and the bottom from Virginica --
is as follows:
SL
5.4
5.6
5.9
6
6
6
6
5~9
6
6.2
6.2
6.1
6.3
6.3
6.3
6.3
SW
2.2
2.7
2.5
2.2
2.9
2.7
2.8
2.8
3
3
3
3
3.2
3.3
3.4
3
PL
4.5
4.5
4.5
4.5
4.7
4.9
4.8
4.8
4.8
4.9
1.6
1.6
5.1
4.9
5.1
5
5.1
PW
1.5
1.5
1.5
1.5
1.5
1.5
1.6
1.5
The problem is caused by ties.
1.8
1.8
1.8
1.8
1.8
1.8
In particular, since either X
)
(1
and Y(l) or X(n) and Yen) is tied for each of the four components and
since our algorithm searches only the midpoints between adjacent nontied
data points, no rule can be found to satisfy (5.15).
Consider K=5.
At
Step 4, in order to satisfy (5.15), we need a one or two-cutpoint rule
which misclassifies none of the data remaining.
rule will work.
Obviously no one-cutpoint
Consider SL and the two-cutpoint search.
Our algorithm
starts with L=5.75 and U=5.95 and stops searching once it reaches L=6.15
and U=6.25.
It stops searching SL at this point since there is no data
135
point outside the overlap area to the right of 6.3; i.e., there is not
another midpoint to the right of 6.3.
Thus all two-cutpoint rules mis-
classify at least one observation, 6.3, and therefore none satisfy (5.15)
for SL.
The same sort of situation arises for the other three components,
again because of ties between either X(l) and Y(l) or X(n) and Y(n) or both.
Fortunately, there is a simple way out of this unpleasant situation.
If at a given step we find by our algorithm no rule which satisfies (5.15),
we simply use Kendall's rule at this step and proceed as before for the
next step.
Note that Kendall's rule is a one or two-cutpoint rule which
always satisfies (5.15), and since in this instance it is the only rule
•
satisfying (5.15), it automatically minimizes (5.14) subject to (5.15) .
Adopting this convention, the results for a=.Ol and K=l to 10 are as
presented in Table 5.
Using Tables 3, 4 and 5, we point out several interesting facts
about the analyses with d •
3
Allowing a misclassification rate of .01,
for K=4 we are able to reduce the nonclassification rate to .06 by using
our modification.
This is an improvement over the other four-step rules
since for Kendall's rule the comparable rate is .13, for d
and for d
2
it is .10.
l
it is .10
Even better results are achieved for K=5 and K=6.
For a=.Ol und K = 7, 8, 9 or 10
d
3
is equivalent to d
l
for the Iris data.
For u=.05 a nonclassification rate of zero is achieved for K = 2 to
10.
For K=4 the misclassification rate is .03; thus again, when using the
136
same number of steps as the other rules, d
rule, d
1
or d .
2
Also, for 0=.05 only one
3
performs better than Kendall's
~ule
terminates with five steps
(K=7); the rest terminate with four or less.
For 0=.10 no rule uses more than three steps, all rules have a nonclassification rate of zero, and every rule, except for K=l, has a misclassification rate less than or equal to .05.
For K=4 the misclassification
rate is .05 and if the cost of misclassification is no more than twice that
of nonclassification, once again d
d
2
3
outperforms Kendall's rule, d
1
and
in terms of classification of the training data.
It seems reasonable to suppose that it is more costly to misc1assify
an observation than withhold judgment for that observation, since in withholding judgment no error is being made.
For the sake of argument, let us
assume that the loss (cost) associated with misclassifying an observation
is twice as great as that associated with withholding judgment for an
observation.
Then, based on the training samples, we can compute the
estimated loss associated with t, Kendall's rule, d , d and d •
2
1
3
For
instance, the loss for t is proportional to
2(.03) + 1(0) = .06
(6.1 )
since t leaves none of the training sample observations unclassified.
Similarly the loss for Kendall's rule is proportional to
2(0) + 1(.13) = .13
(6.2)
•
137
and for d
2
it is .10.
The results for d
1
Table 6 we see that for d
and d
1
3
are presented in Tables 6 and 7.
with six or more steps (1
loss is less than that for t.
Similarly for d
and K = 5 to 10 the loss is less than for t.
least as well as t for K = 4 to 10.
3
r
~
From
2) the estimated
~
(see Table 7) with a=.Ol
For a=.05
d
3
performs at
We feel that the results for d
3
and
a=.10 are not particularly relevant since in this case we allowed a misclassification rate up to .10, which is over three times that of t.
Nevertheless, for a=.10 and K = 8, 9 and 10, d 's loss is equal to t's.
3
In conclusion, for the Iris data it would seem that for a=.Ol and
a=.05 our modification, d , has led to a rule which performs as well or
3
better than the standard linear rule in terms of classification of the
original training samples.
As we mentioned in Chapter V, for fixed
training sample sizes from continuous underlying distributions, if K is
allowed to increase without bound, d
3
will always classify all of the
training data, although the rule will not in general have a probability
of nonclassification equal to zero.
implementation of d
3
Therefore, one problem with the
would appear to be the selection of an optimal K,
and this would appear to be a major area for further research with d •
3
The optimal K for a given a may be a function of n, p
and
ai
without
additional information on which to base a choice one obvious selection
would be K equal to p, the number of components.
In this case d
3
would
138
have the same number of steps as Kendall's rule.
For large n and small p,
however, a selection of K>p might be more reasonable.
i
139
Table 1
Results for d
•
e
1
with r = 1 to 50
r
Final
Step
K(r)
Component
Selected
17 to 50
1
PL
.37
0*
9 to 16
2
PW
.22
a
8
3
SL
.16
a
4
PL
.10
a
3
5
SW
.08
0
2
6
SW
.02
0
1
7
PL
a
a
4
to 7
Nonclassification
Rate
Misclassification
Rate
*The misclassification rate, based on the training samples, is zero
at each step by the definition of d •
l
140
Table 2
Results for d
with a=.Ol and K
3
Step
2
4
=1
to 10
Nonclassification
Rate
Misclassification
Rate
.21
.01
.16
.01
.14
.01
K
1
1
PL
2
PL
PW
3
PL
PW
SW
4
PL
PW
SL
PL
.06
.01
5
PL
PW
SL
NC*
.16
0
6
PL
PW
SL
NC
.16
0
7
PL
PW
SL
NC
.16
0
8
PL
Pl'l
SL
NC
.16
0
9
PL
PW
SL
NC
.16
0
10
PL
PW
SL
NC
.16
0
3
•
e
*NC (no candidate) signifies that no rule can be found which
satisfied (5.15) at this step; and consequently that d terminates
1
at the preceding step.
e,
141
Table 3
Results for d
K
~
e
1
2
1
P\'l
2
PL
SL
3
PL
PW
4
PL
5
Step
3
3
with a=.05 and K = 1 to 10
4
5
Nonclassification
Rate
Misclassification
Rate
.02
.05
0
.05
SL
0
.05
PW
SW
0
.03
PL
P\.;r
SW
0
.03
6
PL
PW
SW
SL
0
.03
7
PL
P\.;r
SW
SL
0
.02
B
PL
P\-1
SL
PL
0
.02
9
PL
PW
SL
PL
0
.02
10
PL
PW
SL
PL
0
.02
SW
142
Table 4
Results for d
3
with
Step
Q ....
10 and K
=1
to 10
Nonclassification
Rate
Misclassification
Rate
0
.08
K
1
1
PL
2
PW
PL
0
.05
3
PL
SL
0
.05
4
PL
SL
0
.05
5
PL
SL
0
.04
6
PL
P\~
SL
0
.05
7
PL
PW
SL
0
.04
8
PL
P\-J
SW
0
.03
9
PL
PW
SW
0
.03
10
PL.
PW
SW
0
.03
2
3
e
143
Table 5
Results for Modified d
3
with Cl=.01 and K = 1 to 10
Step
•
e
K
1
1
PL
2
PL
PW
3
PL
PW
SW
4
PL
PW
SL
PL
5
PL
PW
SL
PL
SW
6
PL
PW
SL
PL
5W
5\'1
7
PL
PW
5L
PL
5\'1
SW
8
PL
PW
SL
PL
SW
9
PL
Pt'l
SL
PL
10
PL
PW
SL
PL
2
3
4
5
6
7
Nonclassification
Rate
Misclassification
Rate
.21
.01
.16
.01
.14
.01
.06
.01
.03
.01
0
.01
PI.
0
0
S\'1
5L
0
0
5W
SW
PL
0
0
5W
SW
PW
0
0
144
Table 6
Estimated Loss Associated with d ,
1
Based on the Training Samples
r
K (r)
Estimated
Loss
17 to 50
1
.37
9 to 16
2
.22
8
3
•16
to 7
4
.10
3
5
.08
2
6
.02
1
7
0
4
...
e
•
e
~
e
e
"
Table 7
Estimated Loss Associated with d ,
3
Based on the Training Samples
'."
K
1
2
3
4
5
6
7
8
9
10
.01
.23
.18
.16
.08
.05
.02
0
0
0
0
.05
.12
.10
.10
.06
.06
.06
.04
.04
.04
.04
.10
.16
.10
.10
.10
.08
.10
08
.06
.06
.06
a '"
I--'
"'"
lT1
CHAPTER VII
SU~~~RY
7.1
AND
RECO~~mNDATIONS
FOR FURTHER RESEARCH
Summary
We have paid considerable attention to the distribution of Kendall's
criterion (i.e., of the cutpoints).
We have given explicit expressions
for the cutpoints in the univariate case, and have found the asymptotic
distribution of the cutpoints in both univariate and bivariate cases;
our results generalize to the p(>2) - variate setting.
We have established
conditions under which Kendall's rule is asymptotically optimal, but have
provided an example which demonstrates that neither the least-to-most
nor the most-to-1east method of selecting component variables leads in
general to an asymptotically better rule.
For both the univariate and multivariate cases we have shown that
the conditional probability of misclassification of Kendall's rule converges (with probability one) to zero, but this desirable property can,
in certain instances, allow the probability of nonclassification to
converge to unity.
We have given conditons under which this is possible;
in particular it occurs when the underlying populations are multivariate
normal.
147
In order to deal with this potential nonclassification problem we
have offered several extensions and a modification of Kendall's original
rule.
We explored the asymptotic properties of the extensions and modi-
fication, and in particular we found an
as}~ptotic
bound less than unity
for the conditional probability of nonclassification of our modification.
We used Fisher's Iris data to illustrate Kendall's rule and our
extensions and modification of it, and to compare them to each other and
to the standard linear rule.
A result was that our modification appears
to perform at least as well as the linear rule for the Iris data.
7.2
i)
Suggestions for Further Research
The
~(>2)
population Case
In Chapters II through VI we dealt with only the two-population
case.
For at least one of the partial discriminant analysis rules discussed
in Chapter I there appears to be no generalization to the t(>2) population
case, but fortunately this is not true for Kendall's rule.
Consider the
three-population case and suppose that the data appear as follows for the
component (the j-th, say) having least overlap:
w
W
W
x
W
x
w
x
W
x
W
x
W
x
W
x
w
x
x
x
y y y y y y y y y y
Then one rule for assigning a new observation (ZI' Z2' Z3) would be as
follows:
148
Step 1
Assign to
TI
w if and only if Zj
Assign to
TI
x if and only if Zj > WCn)
< Y(l)
Otherwise go to Step 2
Discard training sample observations which were
assigned in Step 1.
Step 2
Then proceed with the re-
maining training data as if starting from scratch.
(The component selected at Step 1 mught be chosen
again at Step 2.)
Repeat until
~
steps have been taken, or until
Step 3
all training data have been classified.
Obviously the above rule can be generalized to all other possible
data configurations for the three-population case, and its generalization
to more than 3 populations is straightforward.
Note that the above rule
is similar to dl1 as with d , if there are p components there might be
l
more than p steps.
Lastly, it is not hard to see that by using the above
rule as the basis, one could also generalize our modification, d , to the
3
£(>2) population case.
ii)
Some Other Suggestions
One way of determining how well our extensions and modification
perform in relation to each other, as well as to some of the other partial
discriminant analysis rules, is by a Monte Carlo study.
We have mentioned
several studies of this type (see Chapter I) and we feel that it would
be interesting to conduct such a study at least for the case of underlying
149
normal distributions with unequal means.
Of course it must be remembered
that some of the other nonparametric rules appear to be designed with
the case in mind of underlying normal distributions.
Lastly, the discrete case should be considered.
Kendall (1966,
pp. 178-180) points out some of the difficulties involved and makes a
few tentative suggestions as to how to proceed.
APPENDIX
SOURCE STATEMENTS FOR KENDALL'S RULE
1
:!
3
II
5
6
7
8
9
10
11
12
13
III
15
Hi
17
18
19
20
21
22
23
211
25
26
27
28
29
30
JI
32
3J
311
35
36
37
38
19
110
III
112
!lACRO CCJIlPS SL SV PI. pw I
~AC~O SET P pall: ~
~~CRO SET:O OX-.S; OY-.5; I
'1ACRO GE'!~IIl
H' Il IN XJ BallAXIlIN (JB,1) 'IIlUr 1:'0;
LOVERX P&I~: ~INXJB:
!ND:
USE 00:
LOWEIY P~.!NT IIINYJB;
!':ND: "
Il AC30 GE:!! AX
IF II UIJB-IlINII AX (JB,1) 'IHEN DC;
UP?!RX PRINT IlAXXJB:
Elf 0 :
ELSE DO:
UPPEIY PRIN! ~AXYJB:
EIfD: "
!tlC!O UPPElX
NOTE SKIP-2 THE UPPER CUTPOllfT IS FaOIl X AI: 0 EOUALS;
lIAC!tO r.OIlERX
NOTE SKIP-2 ~HE LOWER CUTPOINT IS FROtl X AND EQUALS;
'1AClIO LOWERY
NOTE SKIP-2 THE LOWER COTPCINT IS FEO!! Y AND EQOAL 5;
~ ~C!lC UPPEBT
NOTE SKIP-2 THE OPPER CUTPOINT IS HO !l Y AND EQUALS;
lIAC!lO !1PPERXY
NOTE SKIP-2 THE UPPER COTPCJINT IS FROIl BOTH X AND y AIID EQUALS;
:UCIO LOV ERXY
NOTE SKIP-2 THE LOWER COTPCINT IS FROIl BOTH X AND Y AND EQUALS:
~ AC lIO NO'!'EXY
NOTE NO FlJR':OU DISCR! ~UIATICIf IS PCSSIBLE DUE TO;
NCTE THE CO~PLETE OVERLAP OF ALL &E~AIKING;
NOTE CO~PONENTS; "
IIAC!tO nlfDJB
IP N... 1 '.!'HEN DO:
U.ONU'C'!l!'!( 131313);
DO 1 .. 1 TO If;
IP U)(I-I) '/11 AND U(It/K THEN D=I;
END:
END:
ELSE B" I:
"
"
"
"
"
"
CwO;
..
e
151
43
44
45
46
117
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
7S
76
77
78
79
80
81
82
83
8u
8S
86
87
88
89
90
91
92
93
94
95
91:
97
98
00 I" I TO K:
I PHI I , I) = 1 T HF!I DO:
Cc:C+ I:
IF C= B THE!! JB"I:
END:
!ND:
FREE H N: "
1L\CP.0 PCO,P
NOT! SKIP-2 THE CCftFCNENT S~LECT!D AT STEP:
PRINT STEP:
NOTE IS:
CNA~!aNAMES(1.JB) ;
?~!NT JB COLNA~!:CNA~E: ~
11ACBO PlX~1l
ONE- J ( II , 1 , 1) :
C=T>-(ONE·IIAIIIIN'):
O-T<-(Clli!·IlINIIAX') :
FR~! ON! 'I;
~:aC'D:
FR!E C 0: 'l
O!LET!1l
I~ Kal THEN CI-1:
ELSE 00:
CI'" 1 : K;
CI (1, J B) - 0 ;
CIc:LOC (C I) ;
END;
I f K X ( I , J B) c: 0 0 P. KY ( I • J B) .. 0 'IH END 0 :
NOTE SKIP-2 SINCE CNE OF THE SI~PLE SIZES WeULD BE 0 AFTER:
NOT! THIS S'!!P. 'IUE RlILE 'IEP.IIINA'IES:
NOTE SKIP-2 TilE TOTAL PEECENT OF OBSERVATIONS I.EfT:
NOT! UNCLASSIfIED IS;
PRINT IICB:
STOP:
END;
Z LS E RI '" LOC ( EX ( • J Pll ;
FH! EX;
NA~ES- NAI'I ES ( I, CI) ;
X'-X (Il I.C I) :
RI=LOC( E (,JB)) ;
FREE E:
I=Y(BI ,CI); ~
'lACRO FINDClJTS
11 IN XJB .. nIlI N (JB, 1) :
IIINTJB-XYt!IN(JI!,2) :
!UUJB"xYI'lAX (JB, I) ;
I'lArYJB-XYlllX(JB.~ ;
PIlE E Xl!!I N Xl IIA X:
IF 11INXJB-IlINTJB AND IIlXXJB~"I'lAIYJB THEN DO;
GET:1IN
GET:UX
END:
!LSE IF IlINXJB-IlAUJB AND IlAXIJB=I'lAXYJB TIIEN DO;
LOWERXY PPINT I'lINXJB;
UPP~RXY PRINT IIAXXJB;
NO",EXY S'l'CF:
1\ ACliO
152
99
tND;
ELSE IF Il!NX JB= IlIliYJ B THEN DC;
LO\lE!lXY PF!Ii~ IlINXJB;
100
101
102
10 J
104
105
G~T!UX
EllD:
USE DO:
UPFlEXY PBINT
106
107
108
109
110
111
112
11 J
114
I1S
116
117
118
119
120
12 I
122
123
124
12S
126
127
128
129
130
131
132
133
13~
US
136
137
138
139
1 ~O
I~
I
1~2
1~3
1~4
145
146
147
148
1119
150
1S 1
152
153
~AXXJB;
GE1!IIN
!Ile; "
TINDAL!.
rR - 10 XlKIt/liXl+ (QYfP\Yt/NY) ;
ca-c!' :
!tca-IIIN (ca) :
H-Cf.=/lCR:
raE! CE;
N-SIJIl (H): "
lIleao COUN'l'Ell
:'I-In;
~ACRO
TaX:
FIXE:'I
EX=!;
FREE E;
1'\ X-EX 1+.);
!I-i'lY;
-:-y:
FIXEIl
"y-£(+,): "
IUCRO SET!II
:'IAX'IN-J(K,I.0) :
~IN:'Ul.!UXIl IN: UlllN"J (It. 2.0); XYIlA X.. XI III !!;
DO I-I TO K:
UllIN (I, I) "'PI!!! (X (, I)) :
XI :'I IN (I .2 ) - :'II Ii ( Y( , I) ) ;
XYIUX (I, 1) -lUX (X (. I)):
XYIlU (I .2)-IlU (Y (.1)):
~U!lI!!(I. I)-r.AXIXYr.!ll(I.));
1'1 llPU X (1.11 =/II!lIXYIlAX(I.)):
END: "
•••••
ItENDALL'S FU!.!
••••• :
PE\OC 'U,:aI X;
F:::TCH X DlU-OlJT.POPX(KEEP"'CO!lPS) CCLNAllE=Nlll~S:
F2':CH Y DATA-OUT. PCP! (KEEP-CeIlPS) :
SET_P
S!T_IJ
DO STEP- I
or')
P:
K "'HCOL (X) :
:'IX-NaOIl(XI:
1'1 Y -N RO II (Y) ;
IF STEP- 1 :HEN DO:
NX-IlX:
NY"'IlT:
&»HINT OX QT STEP X Y:
END:
S!TElI
H a III XU!!>:'!! NIlU;
1~3
1511
155
156
!l=SU:1 (HI;
IF N.. =O THEN DO;
!"IN DJ B
157
PCOI1P
158
NO':'! SKIP= 2 THE CU'rPOI!:! FOll ST!P;
PR INT SUP:
HOTE IS:
cp .. (IUX:!IN (JB.1I +~I KIIAX (JB. 1) I t/2:
159
160
161
162
16 )
1611
165
166
167
168
1~
9
170
171
PRI ,r CP:
FIN DC OTS
NorE DUE TO THE FACT :H1T TB!RE IS HO SAftPLE OVERLAP FO!
NOT! COIlPONEN'l' AT THIS S'IEP. KENDAl-LIS BULE TEilIlINATES.:
STOP;
~NO:
ELS!: DO:
COUNTE~
FINDUL
F1 NOJB
172
FCO~P
17 J
1711
fINDCUTS
D~ LE T! II
175
176
END:
Elf 0 :
NOT! SK!P=2 THE
TOTAL
177
178
NOTE UNCLASSIF:ED IS:
179
Pllr NT
IICll:
PEP.C~NT
or
OBSERVATIONS
LE1T:
THE;
154
SOURCE STATEMENTS FOR d
1
2
3
II
5
6
7
8
9
10
11
12
13
111
15
16
I"'
18
19
20
21
22
23
211
25
26
2"'
28
29
30
31
32
33
311
35
36
37
38
39
110
111
112
~ACRO
~ACRC
~ACRO
~ACRC
II TH!N to:
LOWES X P9iNT IIINXJB:
END:
nSE DC:
LO.ZllY ?RINT III~YJB:
END: ,.
IIACSO ~ETIUX
IF !IAXXJB"IIINItAX (JB, 1) 'lHEN DC,
UPPERX PRINT nAXXJIl;
ZND:
nSE: DO:
UPPERY PRINT IIAXYJB;
END: "
IIACBO UPP!:SX
II 0 Ti; SKIP=2 THE U?PER CtTPCI NT IS rRCII x ANI: EQUALS; 1I
MACRO LOWESX
NOTE SKIP"2 TaE LO.tR CCTPC:NT IS nc!! I ANt EQUALS. 1I
lUCllC LOWESY
NOTE SKIP=2 THE LOWES CUTPCIN'l IS PRCIl Y ANI: EQUALS; 1I
iUCRO UPPERY
NOTE SKIP"2 tME QPP£R CUTPCiNT IS FDOIl Y AND EQUALS;
MACaO UPPERXY
NOTE SKIP-2 TaE UPPER CUTPCINT IS FIle!! BOTH x AND Y AND EQDALS; ,..
llACllO LOlO ERU
NOTE SKIP=2 THE LOWER COTPCINT IS PBCIl aOTH x AND Y All 0 EQOALS:
IIAC Re :IO'!EXY
NOT! NO FURtHER DISCRIIIINATION IS FCSSIBLE DUE TO;
NOTf Taz COHiLETE OVESLAP CF ALL RE~AINING,
NO~E COIIPONENtS: "
!lACSO FINDJB
IF N... 1 THEH DO,
U-UNIFOBII1131J131:
:>0 I-I TO N:
IF U>(I-l)'/N AND U<I./N ~HEN B=I;
END:
IF IIZNXJ&=IIAXIIIN(Jl!,
"
"
END;
ELSE B-1:
c::o.
43
51
53
COMPS sL SV Fl FW ~
SET_P ?~4: ~
5'::T_O {J 1=.5; QY=.5; "
MIN N 0 ~
IIACltO GET~IN
DO Is 1 TO K:
1111
115
116
117
118
49
50
52
1
IF H (I, I)'" I ':'Hn DC:
CsCf 1:
IF
C~5
1HEN JB::::
tllO:
END:
FH!: U :I; "
~ACRO
PCO~P
NOT:: SK:P-2 ':HE (ClIFClHll'I
PRIM'I S':EP;
SE:UCHI:
A'1'
S'l'!P;
e
155
54
55
56
57
58
~CT!
~ACRO
£=<':'1) ;
fREE C 1:: ~
III.C80 DE!.£':'EII
! : KXll,JE)=O CR KYI1,JE)=0 !H!N CO;
NO'H SY.IP=2 SINCE ONE CF 'IHE sunE SIZES iCOLC BE 0 AFTIR:
NOTE TH':S S'!Er, THE lila! TEilIIINATES;
NOTE SKIP-2 1ME 'IO~Al PERCENT OF 09SERVATIONS LEPT;
NOTE UI<CLASSIFIEI) :S:
?EINT
STOP;
72
END;
74
75
76
78
"79
RI=LOC(E(,J2»
FREE E;
Y=Y (RI,);
MAcao fIl;DCUTS
If
80
83
AND
eAXIJB~=IIAXYJB
ElSE IF nINXJ2=/IAXXJD
THEN DC;
AND ~AXXJB=IIAXYJB
:'CWE BXY f BINT III NXJ E;
as
UPPEP.XY PRIN: IlAXXJS:
S'Iep:
ENC:
66
87
88
89
NO'!EU
E:LS I
90
91
92
93
100
10 I
102
103
104
105
106
107
10e
109
~
G!::,rUN
Gi:T!lA J
Et/I);
84
98
99
:
;
II:NXJB~:II:NYJB
81
82
94
95
96
97
.
~:CR;
;:LSE R:=LOC lEX I, JB»
fREE EX;
:<=X(8I,);
73
7"T
~
FI XEII
ON !:= J I II , 1 , 1) ;
C=!)=(ONE·IIAXIIIN'):
D=T<=IOt;!:·l'IIN~AX');
FREE: ON E 'I;
59
60
61
62
63
64
65
66
67
68
69
70
71
::S;
CNU1£=NAtIES (1 ,J5) :
PRIN~ J! CCL~AME=CNAH!:
IF llINXJI!=ltINYJ I! 'tHEN DO;
LCI/UXY PUNT llINIJI!:
Gt'Ill AX
E!lD;
nSE DC;
UPPEBJY PP~N'I l'IAXXJB;
GE'IIIIN
MACRO
n
END: l
NDALL
CBs (QXtKXt/NXH (OUU'/NY) ;
CR=CR' :
tlCS=lUN (CD) :
a-CR-IICR:
FREt CRt
N=5tlM (H): ,.
/lACSO COUNTEII
MsM X:
'!=X;
FIXZII
EX-E:
FREE E:
THEN DC;
156
110
11 1
112
113
1111
115
116
117
11 B
119
120
121
122
123
1211
125
126
127
128
129
130
131
132
133
U" EX(+,) :
/l=lIY:
T=Y:
FIX E:!
KyaE (f,): ~
tlAC IIC SETE/l
tI AAIUN "" (K I 1 1 0) :
/lIU/lAX=MAXlII3; XYMI3=J(K,2 I O): XYKAI=XYKIN;
DO 1= I TO K;
XYMIN (I, I) =MIN (X (, I» :
XY!lIN (:::,2) =IUN (Y (II»:
XY!lAX (1,1) =MA:< (X (,I» :
XY!lAX (1,2) "MAX ('1(11» :
!IAX!1.!N(I, 1) -!lAX (Xytl1N (I,»;
!IINtlAX (I, I) =/1111 (UIlAX (I,;) :
EN:>: ~
•••••
~XTENSION I
••••• ;
fIloe !lATRIX;
fETCH X DA'IA =OUT .POPI (l<EEE=CCtlfS) CCLliAllEzIIAllES;
FETCH Y DATA=OUT.PC?Y(~lEP·CCIlES);
SET_P
SET_Q
S':'EP=O: K-P;
LOOP: STEP=STEP .. ';
1311
/lX=NROIl(X) ;
135
136
137
138
IlY-NROII(Y) :
IF !U N (MXIIISY) <-!!1N_N 'UiEN DO:
NOTE THE SA!!PLE S~ZES HAVE BECOIIE TOO S!lALl.;
STOP:
END:
PRINT STEP X Y:
IF STEp·1 TREN CC:
N X"I!X:
139
1110
141
142
143
144
145
1116
111 7
1118
149
150
15 I
152
153
154
155
156
NY=/lY;
END:
SETE!I
Ha llAXIIIN)III II lUX:
N-SU:S (H) :
IF 1/ .... 0 THEN to:
FINDJB
peo/lp
NOTE SKIP-2 THE CUTPOINT FOB STEP;
PRINT STEP:
NOTE IS;
ep .. (!IAIIlIN (JB 11) .. Ill usn (JB, 1) ) '12:
PBI NT ep:
!I':NXJBzXYIlIN(JB,1) :
IlINYJB=XYIlIN(JE,2);
IUUJa.XYI!AX(JB, I) ;
15 '7
159
159
160
161
162
16 J
164
165
llllYJ9 a XYllAX(Je,2);
FBEE XY!lIN XYI!U;
?I " DCtJ'tS
NOTE DOE TO ':'HE FACT THAT :HEIIE IS NO SAlIPLE CVERlAP FOB ':'8£;
IIOTE CO!1PONE~T AT :HIS STEP, KENCALl'S RULE TER~INATES.;
STOP;
!N!);
157
166
167
168
169
170
171
172
173
174
175
176
177
178
1:9
180
181
162
ELS£ DO;
COUNTEM
r:NDALL
FINDJI!
iCOMP
IlINXJB=XYIIIN(JB,1);
IUNYJS=XYllIN (JI!,~) :
MAXXJB=XYIll.X(JB,I):
IlAXYJI!=XY flAX (JE,~) :
fREE .u1l:::N XYIlAX;
FINCCCTS
DEL::UlS
E~D;
NOTE SKI?=2 THE TOTAL PERCENT OF CBSEHVATICNS LEFT;
NOTE UNCLASSIFIED IS:
fRIN! !leB:
GO TO LOOP:
158
SOURCE STATEMENTS FOR d 2
1
2
3
4
OJ
6
7
8
C;
10
11
12
13
14
IS
16
17
18
19
20
21
22
23
24
25
26
27
28
CCMPS SL SW PL PW ,
~ET_P P-4; ~
~ACRO 5ET_0 OX-.5; OY-.5: :
"'At 110 PERI'S
PER"cl 2 3 4/1 2 4 3/1 t, 2 3/1 4 3 2/1 3 4 2/1 3 ? 41
2 I 3 4/2 1 4 3/2 4 1 3/2 4 3 1/2 3 4 1/2 3 1 41
~AC~C
~AC
RO
~ 1 2 4/3 1 4 2/3 " 1 2/3 4 2 1/3 2 1 4/3 2 4 11
" 1 2 ~/" 1 3 Z/4 3 1 2/4 ~ 2 114 2 3 114 2 1 3; t
MAC FlO P_FAC
PFAC-I;
ce "~-2 TC P;
PFAC-PFAC''';N;
PIO: t
MAC~C
GE"'I~
~ACRO
CETMU
IF MI"XJ8-MAXMIN THEN 00;
LCIoE12X PRINT MINXJR:
fNO;
ELSE Cr.:
lCIoEIlY PFlINT MINYJB;
no: t
IF
TH~N DC;
lPPERX PRINT ~~XXJB;
fNC:
ELS E DC:
UPPERY PRINT ~AXYJe;
"AXX~B-¥I~"'AX
e"c: t
36
37
311
39
"'AC RO UPPER X
NOTE SKIP-2 THE
MAC RC lO,ERX
NOTE SKIP-2 THE
.. ACIIC lCwERY
NOTE S~IP.2 THE
.. A(Fe UPPERY
NOTE SKIP-2 THE
MAC 110 LPPERXY
NOTE SKIP-2 THE
~Acilo lCIo/ERXY
NOTE SKIP-Z THE
40
/lACH HHXY
29
3«)
31
~2
33
34
35
45
46
47
48
49
'hl
51
52
53
LOWER CUTPOINT IS FROM X AND ECUALS; t
LOWER CUTPOINT IS FROM Y AND ECUALS: t
UPPER CUTPOINT IS FFlOM Y AND EOUALS; t
UPPER CUTPOINT IS FROM BOTti X 4NO Y AND EQUALS; t
LOWER CUTPOINT IS FROM BeTH x AND Y AND EOUALS: t
NOTE NO FURTHER nlSCRIMINATION IS POSSIBLE DUE Te:
NOTE THE COMPLETE CVERLAPOF ALL REM4INING:
NOTE CC~PONENTS; t
t,1
42
43
4t,
UPPER CUTPOINT IS FROM X AN:> EOUALS; t
MAC RO FINCJ8
IF N.. -l THEN DO;
l-UNIFORM(1313131:
00 I-I TO N;
IF U>II-II"N AND U<I./N THEN B-1:
END:
END:
ELSE B-1:
(-0:
CC I-I TC K:
159
')4
55
':it
51
'58
59
60
"1
62
t3
64
t5
66
67
"8
69
70
71
72
73
74
75
76
17
18
79
80
81
a2
113
84
8'J
86
87
88
80;
90
91
92
'13
94
<;5
96
'17
r;A
9q
lOt)
11)1
101
Ill)
104
lO!
lJt
101
lr. e
lor;
IF HII.1I-1 THEN no:
C-C+1:
IF CaB THEN JR-I:
ENl):
END:
FIlEE H N: ~
"'AC IlC FCCfoIP
~CTE SKIP-2 THE CO~PONENT SELECTED AT STEP:
PII[hT STEP:
HTE IS;
C~A~E-NA"'ESI1.JBI:
FIII"T JB CClNAME-CNAME: t
MACRO FIXEM
CI\E-JI"'ol.lI :
C=T>=ICNE*MAXM[N'I:
O-T<-IONE*'HN"'IAX'I:
FIIEE CNE T:
EaCMO:
FIlEE CO: ~
foIACRe CElETEM
IF KX-O OR KY-~ THEN DO:
"aTE SKIP-2 SINCE ONE OF THE SAMPLE SIZES WOULD BE
NeTE THIS STEP, THE RULE TERfoIINATES:
GO TO ~jEwPER"':
ENe:
El SE II I -L OC I ExI :
X-XIIl[.I:
IlI-LeCIEI:
,
., .. ., I II I , I :
foIAC Il( F I "eCl TS
MIN XJ9 = XY'" [ NIl, 11 :
foIl"YJR-XYHINI1,21:
foIAX)lJB-XY"'IAXI1,11:
foIUYJB-XYHAXI1,21:
FREE XYHIN XYHAX:
IF ~INXJB~aHINYJB AND MAXXJB~.MAXYJe THEN O~:
GETM[N
GETMAX
END:
El SE [F 14INXJB.. MAXXJB AND MAXXJB-HAXYJB THEN DO:
lOwERXV PRINT Iol[NXJB:
UPPE~XY PRINT HAXXJB:
NOTEXY STep:
END:
ELSE IF MINXJB.~INYJB THEN 00:
lOWERXY PRINT HJ~XJB:
GfTMAlC
END:
ELSE 00:
UPPERXY PRINT MAXXJ8:
GET"IN
:ND: t
",AClle CCU"TEM
IO=Exl+,1:
KY-El+,l :
~
AFTER:
160
110
111
112
II?
114
TT-IOX-KX./NXI+IOV.KV.,NVI;
CRI n.lI a TT: ~
"AC j; e FIN CIT
,.." X:
Ta lll.J81:
F [ l Elo!
11~
116
117
118
119
120
121
122
12:
124
125
126
127
128
129
no
I? I
132
13:
1:4
135
Pi:
137
138
139,
140
141
142
143
144
145
14l:
147
148
14<;
150
151
152
153
154
155
156
157
158
150;
160
161
162
163
164
lt~
EXaE:
"-,,y:
l-YI .J81;
F I Xl' M
~
"AC Ae SET EM
XY141"aJll.2.01: XYMU-XYMIN;
Je-HP"III.STEPI:
)Y~I"ll.llaf1INIXI.J611:
)YMINll.21-MINIYI.JAI1:
II Y" 4 XI 1 • 1 ) =.. AX IXI • J B II :
ly,.Allll.Zla"1AXIYI.JBII:
WAX"IN-"AXIXYMINI:
~I~"AX-"INIXYMAXI:
l
•••••
KENCALL'S RULE
PRoe ,.AlRIX:
SFT_O
SET_P
P_F AC
••••• :
e Ita J I P F 4 C. 1 • 0 I :
P!:R~S
DC II al TO PF AC :
FETCH X CATA-nUT.POPXII<EF.P-COMPS) CCLNAME-NAMES;
FETCH V OATA-CUT.POPYIKfEP-COMPSI:
_ P t' AMa P Ej; MI I I. I :
~CTE T~E PERMUTATION OF THE COMPCNENTS U~DER CONSIDERATION IS:
PR I" T _PERM :
CO STfP a l TO P:
KaNCClIXI:
!'IX- "RCld X) :
"Y- "A OdY I:
IF HEP-l THEN 00:
~Xa"X: KX-NX:
I\Y.",v: KY-I\Yl
1'1\0;
SeT EM
IF "All"IN>"II\MAX THEN DC:
~CC"P
~CTE
SKIP-2 THE CUTPCINT FOR STEP:
FR'''' STEP:
~CH IS:
ep·I"AX~IN+~INMAX1"2:
FRI"T ep:
FII\CCU1S
I\OTE D~E TO THE FACT THAT THERE IS NO SAMPLE OVERLAP FOR THE:
~CTE CCMPCNENT AT THIS STEP. THIS PULE TERMINATES.:
T 1-0:
Cc TC I\;E,jPERM:
PIC:
nSf ce:
161
1ft
FI~CIT
PC(~P
167
168
16~
170
171
172
173
174
FlhCeUTS
CGU~Tf~
OELETE~
END:
E~O:
NOTE SKIP-2 THE TOTlL PERCENT OF CBSERVATICNS LEFT;
NeTE UNCLASSIFIED IS;
PPI~T TT:
NEWPER~:
li5
176
177
17e
17G
180
lAl
182
FNO:
~CR&MI~ICR):
~.CP=~CR;
)')';
THE SMALLEST PERCENTAGE OF
wITH THAT PERCENTAGE IS;
H.IPEP~ILeCIH),I·IICRILOC(HI,
NOTE
T~E
~OTE
UNCLeSSIFIE~ ALO~G
PRI~T
~
PERMUTATICN
;
~ITY
CBSERVATle~S;
162
SOURCE STATEHENTS FOR d
6
7
"'loCH CC"fS Sl SW Pl PW t:
SECP P-4: ~
"'AC~C SET_O OX-.5: OY-.5: ~
MAC~O TCl .ooo~nl
~
"'4CRO GET"'I~
IF ~1~XJO-"'AX"INIJ~.11 THE~ 00;
lC~ERX PPINT ~INXJ8:
fl
Ii
EL S f
1
'l
3
I,
5
10
11
12
13
14
15
16
17
18
"'4: flO
f~ll:
:
cr.:
lC~ERY
E~r.:
PPI~T
M4CPC GET"AX
IF "'A)XJB-"'I~"'4XIJD.ll THEN on:
LPPEFX PAINT M4XXJ8:
f"l r.
:
ELSf CC:
LPFERY
PPI~T
26
I'Ae ~O LPPc~XY
NrT E SK JP=" THE
2';
2Cl
30
31
n
33
34
35
36
37
"18
3ll
4n
41
42
43
44
4S
46
47
4A
4Cl
50
~ 1
52
5~
~4XYJ8:
Ft\C; I
21
28
23
24
~J~YJB:
l
Feu F PER 'C
SKIP-2 THE
Pol AC Il C l CWE R x
N1TE ~KIP-7 T"E
"'4C 110 l no/ E~Y
NCTf 51<1P-2 THE
"'4e PC LPPS:RY
~;CTE SK IP"Z THE
III
20
21
'l2
3
I' 4e
~OTE
UPPFR CUTPCINT IS FPOM X At\O HUAlS;
,
e
LOWER CUTPOINT IS FPCM X AND ECUAlS: 1
lOwER eUTPOINT IS
FROM .,
Aiii 0 EQU4lS; t
UPPER CllTPOINT IS FRO'" y 4No EQUAL S; t
UPPER eUTPO HIT IS FROM BeTH x AND y APiO EQUALS; 1
... 4e H leWEIIXY
NOTE SKIP.Z THE lCWER eUTPOINT IS FRCM BeTH x AND
AND EQUALS:
"'AeFe t..lCTEXV
~OTE NC FURTHER nISCRIMINATION IS POSSIBLE DUE TO;
~OTE THE eC~PLETE CVERlAP CF loll RE~AI~ING:
NOTE CCIoIPf)NENTS: t:
MAC PO F I P\C_CNE
IF N.. -l THE~ I)n:
.,
~-~NIFOR~11313131:
CO I-I TO N:
IF U>([-II_/~ 4NO U<I./N THEt\ 88-[:
EIliO:
fNl):
ELSE l!8-l: l
IolAePQ FIto;CJB
F INC3PiE
CwO:
r.e I -I TO K:
IF HCloll"l THE~ DO:
C-C.l:
IF C.. BB THEN JO"I:
END:
EII;O:
FREE
... A(': Fe FCC"'P
t1
N:
l
,
163
NrTE ~KI?=2 THE Cr~PCNENT
I=PIf\T ST:i>:
MTE IS:
o ll"lE .. f\Il Il ESI1.JRI :
S4
S'5
56
57
FPI~T
51'.'
59
60
61
67
MAr.RQ
CaT<"IONE*MINMAlC'l:
Fr:EE OlE T:
T(=CI+.I: TD-OI+.I:
E"C_n:
F I1E E CO: l
67
~H.crH)
68
IF
73
74
7S
H:
77
78
79
AO
81
A2
83
e4
CHET:~
T~E~ PRI~T X Y CCLNAME-NAMES:
IF r.Xll.Jdl .. \l O~ KYll.JBI"1) THEN DC:
"OlE SKIP-2 SINCE ONE OF THE SAMPLE SIZES WOULD BE \) AFTER;
"alE THIS STEP. T~E RULE TEPMINATES;
~OTF SKTP-2 THE TOTAL PPOPCRTIOf\ CF OBSERVATIO"S LEFT;
"CTE UNCLASSIFIEC IS:
PRINT IICR;
CC TO NEXT_CO:4;
Et-;r. :
EUE RT-LOCIExl.JRII:
FREE EX:
x.. )f e R I .I :
PI"LCCIEI oJAII:
FPEE E:
Y"'YIRI.I:
STFF~.l
,
IIac PC Fel. TS
85
LJa·~A)flIT~IJC.ll:
86
67
A8
UJB·"I"~AXIJn.l';
89
90
'71
92
'13
~4
qli
96
~7
<;A
~c;
100
101
102
103
104
It) ')
Hi/:
107
1011
10 t;
~
C"E=JIMtltll:
1::4
65
66
70
71
72
CDL~AflIE"C:jAME:
FI)(~M
C=T>"IC~E*MAXMIN'I:
l'>~
f,9
J~
SELECTED AT STEP;
:HXJR-i'\lXIJl3tll+KXLIJ8.1I:
~l YJB =F\LY 1Je. 11"K YL I JB. 11;
NUXJB .. eLXIJB.ll+KXUeJB.11:
~Il. YJB - BLYI J g , 1 I + KYU1 J B• 11 :
~JCTE SKIP-2 THE U::4ER CUTPOINT IS;
PRT "T LJB:
NOTE SKTP-2 THE UPPER CUTPCINT IS;
PRI"T UJB;
PRINT "LXJC ~LYJR NUXJ9 NUYJD ; ~
"aCFC F 1~I(6LL
IICR ..... TN leR I ;
H-Clh'lCR:
*PPINT CR
FREE CF.:
"'·SUMI11l; ~
IIIlCFC C(Uf\TE~
~ ..
YX:
1=
x:
FixE'"
EX .. E :
FP £ E E:
I<X-EXI+.I:
" .. MY:
164
111'
111
112
11;
i=Y:
F DE 14
I<Y - e 1+. I: :
MACIIO FINDCl.TS
114
~1~xJR-XYMINIJR.ll:
115
11 b
117
IF
1111
~1~YJ8·XYMINIJB.21:
·AXXJB-XY~AXIJA.ll;
~A)YJ8-XY~AXIJR.21:
MI~XJB~-MINYJB
~AXXJB~aMAXYJe
AND
GH"'AX
E ~o:
121
=lSE IF
AND MAXXJB."'AXYJB
lCwERXY PPINT MINXJB:
UPPE~XY
PRINT wAXXJB:
NOTEXY GO TO NEXT_COM:
EN!):
ELSE IF ~INXJBaMINYJB THEN DC:
LOWERXY PRINT MINXJ8:
GETIo4U
END:
ELSE 00:
UPPERXY PRINT ~AXXJe:
In
12;
12'
12~
12E
127
1VI
12q
130
1;1
1'3]
133
134
135
136
137
138
13Cj
140
141
142
143
144
14~
146
147
148
14Cj
150
151
~INXJo=MAXXJB
GET"I~
MAC PO FlHU
IIFl a l:
IF A>.~ Tt'lEN 00:
HO:
ELSE ee;
IF PCP-I THFN·no:
I-TX:
Nl·FlCOIlI~x~AlPHAI:
f N r.:
ELSE ee:
z. 1 Y:
~laFL(C~I~YMALPHAI:
Et-;O:
FND :
15~
00
156
157
1511
15q
160
161
162
It:;
HI,
Ib~
!!:
l--Z:
154
153
END:
1l1=PANKIZI:
lIRZolla-l:
IF ~l~-O THEN 00:
KK- C:
FlAG-O:
152
THEN DO:
GET14IN
11Cj
120
~HILEIFLAG·QI:
'"~Z-KK
ZI
:
lIPl-II+1:
IF ZI -I AND 11 1 • II-Z1 2. 11 THEN 00;
ZHAT-Zll.lI:
FLAG-I:
f~C:
ElSE IF ZIZloll-ZIZIPldl THEN KK=KK+l:
elSE DO:
ZHU-IZIZI.II+ZIZIPl.11IM/2:
FLAG-I:
THE~
DC:
165
1 7~
ENn:
ENn:
END;
FLSE U·AT=Zlltll: ':
",AC ~C LCkUPC
A-ALPI-A:
FZH GT
DLOll.FCPI .. iIP1:
17~
XvuINIIP.PCPI·lHAT~
166
167
166
1M
170
111
172
175
t=l-ALPI-A:
176
F ZH AT
177
DUPll.PfPI-ZIP1:
XVMAXlIF,PCPI-ZHAT;
FREE 1: ~
fo'AC Fe SEre'"
I'AXI'II\-,J(I<.l,U): CR=JIK,1.21; X"IISS=CI1.:
CLn=Jl1,8,21;
"IN,.,AX.. I'AXMIN: XYMI~I-JIK,2,OI: XYMAX=XY~IN:
DC 1= I 10 K:
XY"II Nil, 11-"IINI '(I, I II;
)Y" AX II ,1 I aM AX IX I , I) I :
X v·~ I 'III I , 2 I .. MIN I VI • I I I :
xv,.AXI 1.21-10\4)(('1'1, II):
I'AXMINII ,II-MAXI XV"'I M 1.11:
" I ""AX I 1 , 11 ..... 1 N I XVM AXII, ) ) :
17E
17<;
lAO
181
187
183
1@4
1A~
1A6
187
18R
18C;
1ClO
1Cl1
1ClZ
1Cl'3
1C;4
lCl':
1Cl"
len
1C; R
,.,aM X:
TX=)(.IP):
PTXaRA"I«TXI:
TXI RH.I=XI.IPll
FREE liD:
1qq
T... T X:
ZOO
TEMP1.,.,AXP'II\: ~AXMIN=MAXIo\INIIP,ll;
TEMPZ-MINMAX: ~INMAX-MIN~AXIIP,ll:
FIXEfol;
IF [1+,1-0 TI-E~ on:
"elE SKIF-Z THERE ARE NO OBSERVATIONS REMAINING IN THE;
"CTE C~E~LAP AREA FCR ONE OF THE POPULATIONS SO TMIS RULE:
"OlE TEII"INATES;
GC TC IIEXT_COM:
201
202
203
204
7u5
20~
207
20e
20e;
710
211
212
21~
214
21~
•
["'~;
1
,'ACIlC FINCCV
2lf
217
218
21'1
220
'Z 1
[NO:
f'VX"Tx ILee IE ,,,;
FRE E E:
KXL II P, ll-"'-TC;
KXUIIF.11·/,-TC:
NOV)-"IlCIojIO .. XI;
".M" :
T ya , I "P I ;
IlTV=RAI\Kllvl;
TVIIlTV.I=YI.IP) ;
FR!:E R1\':
T-T'Y :
FIX EM
I F EI +. )- C THE ~I DO;
",
166
722
~r,TE
nJ
~CTE
224
225
GC TC
22t
227
nfl
22q
730
2~\
732
23~
234
23~
2~6
237
238
2'Q
240
241
242
24
~
744
245
74f
241
248
24Ci
250
251
752
25~
254
255
256
251
25A
2SCi
760
261
762
763
264
26 ';
2bf
2b1
26A
2bCi
77C
271
27')
273
274
77~
ZH
277
~CTE
SKIP:2 THE~E 4RE NO 06SERVATIO~S REMAINING IN THE;
OVFRL4P ARE4 FOR CNE OF THE POPULATIONS SO THIS RULE:
TER-4PUTES;
~El(T_CC~;
Er40 :
t>4X"'I~ .. TE"Pl; fREE TEI4Pl:
~IN~.aclE"'P'); ;:REE TE"P2:
r\lV-nllCCIEI, I;
KVl I I P .1 I • '" - T C ;
KYU I I P • 11 .. M- TD:
~CV' .. ~PrWICVYI :
FQEE E TC T1;
OVX=CI/" IIJI 1.~If"\VX ell:
CVY=CVY'IIJII.~nVY,ZI;
(VX"(\)': (\~-OVY':
C V" C V X Ilr. \ Y :
~CV=~A~rICvl.111:
CVTsOV;
rVI RrV .I-C\ T:
~CV .. ~CV't~(VV :
fRf'F 10(1/ (\T: ~
'" AC II ('l Sf T_ All
IF CLCI"IN2--Z THE~ 00;
"-NRC .. I CLOI:
FINc_rM
CF.IIP.llaClr)I06.21; XM\SSIIP.lI-CLCIBBtll: .PRINT XMISS:
fHxIIP.I)sCLDI~q.51: ALYIIP,II"OLDIBB,61: eUXIIP.II-OLDIBB,7);
(lUv I I P • I I-n CI 88. A I: .~ 41(14 I NI I P, lI-CL 0 ( BB, 3) : "I NMA X( I P , ll-OL D(88.4) ;
END:
•
ELSE IF IP"P A"Jr) AASISUMICRI-ZI/PI<-TCL THEN DO:
~CTc NC C\NOIDATES ~ERE FOUND FeR T~E FOLLOWING:
PI' 1"11 ~L PHA STEP:
GC TC NEXT _ Ci)M :
END: 1:
I4AC RG L4LL
lI~K neC\,T:
IF CLCl"INI-=Z THEN DO:
CLD"INZ"OLD'1INI;
~ET_All
5FlAGa\: CRIIP.lI-O:
E~C:
FLSE ce;
"A)A=Nr.V;
"I~A-l:
AFL4G-\:
TIo/OellT:
.IF ClD~IN2-2 THEN NOTE wEIRD:
.nSE co:
SfT _.\LL
ll~K
·Et>1:
£I\C: 'I:
"AUC OLCI
u)-",n"X-L): LY-"lOVV-LV:
l r. I I. 1 I" C ) /I ( K'(l I I l" • 1 I • L x I 1/ / NX :
LOI1.;I-OYIlIKYLlIP.1ItLYI-/NV:
l' l" I 1. 1 I- '; ) II I KXIJl I P • 1 I "u I "/ ~x :
UPI l.n,.c:u IKVUIIP olltUVI V/NV;
•
167
278
279
2'30
2Al
282
283
284
2A~
2P~
Ze7
zee
289
290
291
292
Z9~
Z94
29~
CUR~I~lzMINILOI+MI~IUPI
:
Ul=P//L'I'/ll.X/lUY; *41-Al,.': *PRlt.T AL:
CRO"O ;
IF CJRuINl+CU~_MISS<-ALPHAMST~P'/MAX_STEP+TQLTHE~ DO:
IF eln~IN1-Z THeN
eLC=ICU~MIN11ICROIIZHATAIIZHAT~lllXlllYIIUXIIUYI':
EL~E (LC=CLOIIICURM[NIIICROIIZHATAIIZHATAIILXlllYIIUXIIUYI';
r.LC~INl=CURMINI:
'" PP. II. 1 CL C:
('Ie: 1:
"AOC PH!l
IF KXLllp,llav AND KYLlIP,llo:O THEN GC TC "leXTA1:
ELSE IF KXLIIP,11o:0 THEN ZHA1A=ITYIKVLIIP,II,II+CVIl,III'IZ;
EL~f ZHATAz(TXIKXLIIP,II,ll+(VI1,11If/Z;
L-=t:: L""'O: 'J:
loIH:H HE!
l)=C; l'l'=O:
IF KXLlIP.1lzj) AND KYL(Ic>, 11=0 THEN (lC:
2C;~
e-"I~B:
Z97
?'is
299
300
301
302
303
GO lC
~(l4
305
306
307
308
lee;
310
311
~12
~13
~11,
315
316
317
318
~19
320
321
~v
~2~
321,
375
'26
377
~2A
~2q
~EXTA:
fNr:
ELSE IF KllL(IP,lI-0 THEN ZHATA-ITVIKYll[Ptlltll+QVlltlll'IZ:
ELSE ZH~TA-ITXIKXL(IP,II,11+CVll,lll'IZ: 1:
'" AC ~ C F( S11 I
IF (VIA-I,Zl z l THEN lX-LX+1:
elSE LV-LV.l:
IF KXL;(IP,II-0 ANIl KVUIIP,lI z 0 TH/\ GC TC NEXTAl:
EL~E IF KXUIIP,II~O THEN ZHA1A=ITV(MY-KYU(IP,ll+l,ll+CV(NCV,lll'IZ:
ELSE ZHATA-(TXIMX-KXUIIP,II+I,11+0VINOV,lIII/Z:
UX=C: l"aO: r.
~6CllC FeSTB
IF" KXLlIP,II=u A"NO KYU( 11',1 !=O THEN GC TC "EXTA:
ELS E IF KX'JI IR, i )"'0 THEN ZHA lB a l TYI ,.Y-KYU( IP, 11+1 ,11+CV(NCV, lJ IIIIZ:
El5E lHATB-ITXI'1l(-KXUIIP,ll+I,ll+QV(NCV, llIIlIZ:
(Jll=C: l''''O: ~
"'6(PC (ALCZ
LOll.lI-0)'IKXLlIP,lI+LXIMlNX:
Lnll,~I-O'l'II'KYllIP,II+LYI./NV:
UPll,I'=CXIIIKXUIIP,II+UXIIl/NX:
UPI I, ?l=C'I'M IKYUIIP .11 'UYI IIINY:
,., I 5~"" I /I (Le I +11 IN !UP I:
*AlzL-//LY//lX/lUV: *4L:04L'; *PRII\T ALi
IF "I S5 +CUM_" I 5S <- AL PHUST EPI/M4X_STEP+TCL THEN DO;
C'.JR;~INZ-ICHI,"X-KXLIIP,II-LX-KXUIIP,II-UXIII/"lXI
+
ICVIf'''lV-KVllIP,lI-LY-KYU( IP,lI-UVIIIINYI;
IF CL;R"I~2<.. OLD~IN2+TOl THEN OC:
IF CURMIN2<OLO~INZ THEN DO:
CLos I "I S5/1CUR~ IN2I1 ZHATA/lZHUB/llXlIl Y /lUXIIUVI ':
*PR INT eu::
eLO>olI"a-cup,", IN2:
EI\I::
EL
~
E Dr::
IF ABSlnLQll,41-0LOII,31-ZHATP+ZH4T41<aTOl THEN
3"'0
~ ~1
(LO:OLDIIIMISSIIClA~IN21IZHATAIIZHATeIILXIILYIIUXIIUYI':
~32
ELSE IF ZHUB-ZtiATA<OLOII,41-ClD(I,31 THE"
13 J
(LOs(~ISSIICUR"IN21IZHATAIIZ~4TBIILXIILYIIUXIIUVI':
168
334
33~
)](:
337
338
33<;
340
:?41
342
~4~
344
34~
'!46
341
348
34«;
350
~51
~S)
351
3~4
~5
5
~5~
351
158
35q
~PR1"T
CLO:
f"C:
END:
GO TC f\exu:
E~r.:
~
"1Al: AO SEARCH
HLAG-O:
OLOlo4l'H=Z: OLO"4INZ-Z:
L~.e: L" .. e: UXa.,: UY-/):
IF SHAG-1 T~FNOO;
1I ~K ("feUT:
IF (LO~I~l~.Z THEN DO:
SET_All
5FL4G-1 :
ff.;O:
fI\ I):
fl SE IF SlE;»1 THE'" DC:
S_All
E"D:
ELSE Dr:
TFIo4l1 A& AL DHA: Al PHA- AL PH4 M''''AX_STE P:
PCP.I:
lCHPn
PI:D-2:
IOlPC
4LPHAaTE~P~:
IF
MUIXY~INIIP.)J>"lINIXYM4XIIP,11THEN
360.
II
:!61
162
IF
~63
364
36~
1H
367
~61'
36'i
370
371
~72
373
374
375
3H
37;
~78
37«;
3110
381
382
363
384
385
~K C~ECUT:
Cl~~IN1~-Z
THEN DO:
SeT_4ll
SFLAC;.al;
E'1'1:
*ELSE NOTE WfIR/)Z:
nJl'l:
nSE CO:
T I:MPA- ALPtU: AL PH A·Z' ALPHU 1144X_ ST EP;
PI:P-1:
lO.UPC
PCP-2:
lOol':P!)
4LPl-4-TEIo4P4:
FREE TE"lPA:
CLOaLe: CUP-up:
n L F 11 tl I - l' X- r.u P 11 • 1 1+ 1 :
OLPl1.ZI-Io4Y-OUPll.ZI+1:
C l C I 1 • 1 I- T X I nL C I 1 , 1 I. 11 :
ClCI1.ZI-TYIDLCI1.ZI,ll:
CUP I 1 • ZI- TYI (l uP 11 , Z I • 11 :
C l. P 11 , I 1 - TXI CIJ P I 1 , 1 I , 1 1 :
C-Cvl.ll>-"'UICLQI;
D-CVI .1I<."',NICUPI:
E-C-O:
E-el+.1:
38~
F~EI:
387
3811
IF E>2 THEN DO
CC .... A~IClOI
LINK FIND4'!
~8'i
C 0:
FJ-1:
LJ-NCV:
INC-1;
DO;
169
... a.xa.·JJ;
CC-MINleUPI:
LINK FINDAR;
"'IIIIB-JJ:
LI~K TWCCUT:
SET_a.LL
END:
1QO
:!91
~q2
39:!
HI,
3Q'i
3Qt:
EM::
3C;C;
401
402
40l
404
405
406
401
4Ce
4eli
410
411
412
1,13
414
1,15
416
411
41!!
419
420
421
422
~2 3
424
1,25
426
427
4ZE
429
1,30
Ellln :
ENn: ~
",AC lie 1I\I1CN1 S
CFL AG-l :
IF [VIF.21-1 TH~N DO:
Fl=l\rvx: v-ovx; no-eVIB,II:
II ,. K F 11\ au :
LX·"(VX-LL:
lY-NeV-e-l,;x:
ENC:
ElSE co:
Fl-I\rVY: v-OVY: DO-0VIA.ll:
1I ,. K F I ~!,)U:
LY-Ne \ Y-LL:
UX-NeV-II-UY:
EI\D: "
",a.c~n SETAl
IF (VI6-1,21-1 THEN LX-LX"l:
elSE lY-LY"l:
IF CVIA.ll-nVIA-l.11 THEN GO TO NEXTA1;
ELSE lHATI-ICVIA,ll"OVIA-1,111./2; ~
"'ACPC SFU
I F e v I A-l .2 I - 1 TH"E N LX-l X" 1 :
F LS E l Y= LY.. I ;
IF eVIA.ll-nVIA-l,ll THEN GO TO NEXTA:
ELSE 7tiAThlOIlIA.ll"OV(A-l,1II'/2; t
lo1AC RO SETB
IF eVIB.ll-eVI9 .. l.11 THEN 00;
IF CFLAG-l THEN DO:
IF CVIII.21-1 THEN UX=UX-l;
ElSE LY-UY-l:
Et-O;
431
432
43~
434
435
43t
437
438
439
440
441
442
""
444'3
44'i
lJ-l: INC--l:
EL Sf Del:
S-ALL:
1C; 1
lC; A
400
FJ-~OV:
Ge TC "Exre;
ENC:
ELSE ee:
1 HA1 e- (OV I I! .1 I +Ov ( B+1 ,1 II ./2;
IF CFlAG-O THEN DO:
INITeNTS E~D;
ElSE IF 01lIR.21-1 THEN UX-UX-l:
ELSE UY-UY-l:
*...
END:
J
KENDALL'S RULE
••••• ;
IX:
FETCH ~x DA1A-OUT.PCPXIKEEP-COMPSI CCLNA"E-NA~ES:
FF.TCti YY fATA-nUT.POPY(KEEP-COMPSI;
PRII\T XX yy ceLNAuE-NAwES;
PIIOC "nR
170
44~
447
44S
449
450
451
452
453
4'1 I,
4'5~
I,
56
1,57
45£
45q
460
461
461
H'3
464
. 465
SF.T_P
SELC
ALL_AlP·.01 .05 .10:
r.c; H.1 TC 3:
ALPHaALL_ALP (l.AI):
or ~AX_5TEP.l TO 10;
~r.Te PAGf ALPHA ANe THE NAXI~UM NUMBE~ OF STEPS ALLOWED ARE:
PRI~T ALPHA ~AX_STEP:
STEr:.O: I(-P: eU~_"'lISSaO: x-xx; y-yy;
lCCF: HEP-SHP+l:
"1x a NRC\lIX) :
"Va ~IlCIo IY I:
IF HH>MAX_ STEI) THEN OC:
~CTE THE PROGRAM EXCEEDS THE ~AXIMU" ALLOWABLE STEPS;
r.c TC
ENe:
~exT_CON:
KXl=JIKtl •.l): KVL-KXL: KXU"KXL: KYU-KxL:
~L)-KXl:
ALY-KXL: PUXaKXL: AUV-KXL:
LOaJll.2.0): UPaLO: ~LCaLO: OUP-LC:
IF STEF a l THE~ 00:
4l:f
~X.")l;
467
ioU
106e;
470
411
472
471
~Ya"Y:
e~c:
SETEM
"_"1 ex
~
1/1)'" 1"'4 ax;
....·Sllo1IH) :
IF ~~.o THEN
~O:
FI~CJB
474
I=CC"1P
47'5
~CTE SKIP-2 THE CUTPOINT FOR STEP:
FRI~' HFP:
~r::TE IS:
CP.("AX"IN(J~rll+MINMAXIJB.
111_/2:
4a
477
478
482
PR ' .... T CP:
F rNOeUTS
~CTE OUE TC THE FACT THAT THERE IS NO SA~PLE OVERLAP FOR THE:
~OTF. CC~PCNFNT
AT THIS STEP. KE~DALL'S PULE TERMINATES.:
4A3
CC TC
484
END:
47S
480
481
485
486
481
48R
48Ci
490
491
491
493
494
495
loCH:
4<;7
Io«;f
499
'50r
501
~EXT_CC~:
elSE
SFlAC-a:
DC IPal TC K:
FI . . CCV
HIRCH
FN C:
CCL~TE"
r I ~DALL
FINCJB
Fr.C"P
pelT!
CflfTE"l
51<1Pa2 THE TOTAL PROPCRTICN OF OBSERVATIONS LEFT;
NC TE U"C L<\ SSI FrED r S;
~CTE
PRINT '4CR:
~1~5-X"I~SIJA.ll;
CU"_MISS-CU~_"1ISS+MISS:
171
•
51'"
~u '
T~E
NOTE
P~I~T
P~OPCRTION
~IS5
~CJI,
r.r TC leCF:
~O~
NEXT_CC":
END:
SOl:
~ISCL4SSIFIEO
'501
PF.TlRN:
FI~IC4n:
~tr.
FI'.O:
RFTUPN:
FINeu: rc ll=FL TO 1 dY -1:
nc
JJaFJ TO LJ BY
IF CVIJJdlo:CC
IF VILL.llaDO
~1~
51~
~"11
521
'i22
~2~
~24
CUT:
~nv F 1 .1\( V+ 1 :
Dr" 4 - 1 TC NCVOl :
IF 4... "1 ~"O A.. aNOVOl THEN DO:
~E 141
r AL C i
Hie:
IF 4-1
FFEAl
(AL\: 1
HD:
ELSE
~2f
PO S141
C ALCI
1''''0:
~lfXUl:
537
CFL AG-r:
~I,1
54::>
51,3
-Pllla4/1ZH4TA:. *Plll=PRl';
n,rCL T:
oc hI TC "AX4:
IF hI TI"""I DO:
PR E A
HoD:
EL Sf Cc,:
SET4
EN'):
CC
54t
541
54A
~4
TO ",ov:
IF e-NCV THEN ~n:
PC S T6
e-YI~B
PIO;
FU E
G
55,
~
') I,
55 '5
556
ce:
Sf TB
55,)
551
55~
.PIlI"T Pill: END:
J:FT lP":
SI,lo
54~
THEN OC;
cr.:
ELSE
S3i,
535
53t
53F
DC:
FETLAN:
c~:r
529
530
S31
532
53G
'541
THE~
RETURN:
E~m:
527
s:n
INC:
TH~I
E"'D:
S2~
~2t
VALUE ARE
FREE V:
PCTURN:
~14
516
517
511l
51Cl
CU~~UlATIVE
1''''0:
'hlP.
50<;
511
'5U
ALCNG hlTH ITS
CUM_~iSS:
pIn:
CALC2
\EXTP:
*P~2"4/1~/IZHAT4.IIZHATB: .PRZ"*PRZ'; .PRINT
IF AflAG-l THE'" MIN6=MI~B+I: CFL4G-Q;
*PR7.sA/If"IIZH4T41IlH4TfI: *PRZaPR2'; *P~INT PRZ; 1'''10:
I) FT liP" ;
NEXT~:
PIlZ:
Et\O:
BIBLIOGRAPHY
Anderson, T. W.
Analysis.
(1958) . An Intpoduation to MultivaPiate Statistical
New York: Wiley.
Anderson, T. w. (1966). Some nonparametric multivariate procedures
based on statistically equivalent blocks. ppoceedings of Fipst
IntepnationaZ. Symposiwn on MuZtivaPiate AnalYBis~ Ed. P. R.
Krishnaiah, Academic Press, New York, 5-27.
Aoyama, H.
(1950).
A note on the classification of observation data.
Ma~hematics~ 2:
17-20.
Annals of the Institute of Statistical
Broffitt, J. D., Randles, R. H. and Hogg, R. V. (1976). Distributionfree partial discriminant analysis. JOU2~l of the Amepican
Statistical Association~ 71: 934-939.
Cacoullos, T.
(1966).
Estimation of a multivariate density. AnnaZs
Nathematic8~ 18:
179-189.
of the Institute of StatisticaZ
Cencov, N. N. (1962). Evaluation of an unknown distribution density
from observations. Soviet Mathematics~ 3: 1559-1562.
Chang, C.
(1974).
Tpansactions on
Finding prototypes for nearest neighbors.
Computeps~
C-23:
IEEE
1179-1184.
Chung, K. L. (1974). A Course in Probability Theory, New York:
Academic Press, 133.
Conover, W. J. and Iman, R. L. (1980). The rank transformation as
a method of discrimination with some examples. Communications
in Statistics~ A 9: 465-487.
Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern
classification. IEEE Tpansactions on Information Theopy~ IT-13:
21-27.
Devroye, L. P. and Wagner, T. J. (1979). Distribution-free inequalities
for the deleted and holdout error estimates. IEEE Tpansactions
on
Info~ation
Theopy~
IT-25:
202-207.
173
Fisher, R. A.
(1936). The use of multiple measurements in taxonomic
problems. Anna~s of Eugenics, 7: 179-188.
..
Fix, E. and Hodges, J. L. (1951). Nonparametric discrimination:
consistency properties. U. S. School of Aviation Medicine,
Project 21-49-004, Rep. 4, Randolph Field, Texas •
Fix, E. and Hodges, J. L.
(1953). Discriminatory analysis: nonparametric discrimination: small-sample performance. U. S.
School of Aviation Medicine, Project 21-49-004, Rep. 11,
Randolph Field, Texas.
Friedman, J. H. (1977). A recursive partitioning decision rule for
nonparametric classification. IEEE Transactions on Computers,
C-26:
404-408.
Friedman, J. H., Bentley, J. L. and Finkel, R. A. (1975). An algorithm for finding best matches in logarithmic time. Stanford
Linear Accelerator Center, Rep. SLAC-PUB-1549.
Fritz, J.
(1975). Distribution-free exponential error bound for
nearest neighbor pattern classification. IEEE Transactions on
Information Theory, IT-21:
Gates, G. W.
(1972).
552-557.
The reduced nearest neighbor rule.
Transactions on Information Theory, IT-18:
IEEE
431-433 .
•
Gessaman, M. P. and Gessaman, P. H. (1972).
multivariate discrimination procedures.
Statistical Association, 67: 468-472.
A comparison of some
Journal of the American
Glick, N.
(1972). Sample-based classification procedures derived
from density estimators. Journal of the American Statistical
Association, 67: 116-122.
Glick, N. (1973) •
29: 241-256.
Sample-based multinomial classification.
Biom03trias,
Glick, N. (1976). Sample-based classification procedures related to
empiric distributions. IEEE Transactions on Information Theory,
IT-22:
454-461.
Goldstein, M.
(1975) . Comparison of some density estimate classification procedures. JournaZ of tlze Ameriaan Statistiaal Assoaiation,
70: 666-669.
174
Goldstein, M. and Dillon, W. R.
(1978).
New York: John Wiley and Sons.
.
Discrete Disc:ri'11inant Analysis,
Gordon, L. and Olshen, R. A.
(1978). Asymptotically efficient solutions
to the classification problem. Annals of Statistics, 6: 515-533.
Greblicki, W.
(1978). Pattern recognition procedures with nonparametric
density estimates. IEEE Transactions on Systems, Man, and Cyber-
netics, SMC-18:
809-812.
Greer, R. L.
(1979). Consistent nonparametric estimation of best linear
classification rules/solving inconsistent systems of linear inequalities, Tech. Rept. No. 129, Dept. of Statistics, Stanford
University, Stanford, CA.
Habbema, J. D. F., Hermans, J. and van den Broek, K.
(1974). A stepwise discriminant analysis program using density estimation.
Compstat, 1974, P~oceedings in Computational Statistics, Wien,
Physica Verlag, 101-110.
Hart, P. E.
(1968).
Transaations on
The condensed nearest neighbor rule.
Infor~~tion
The~ry,
IT-14:
IEEE
515-516.
Hellman, M. E.
(1970). The nearest neighbor classification rule with
a reject option. IEEE Transactions on Systems Science and Cybel'-
netics, SSC-6:
Hills, M.
(1966).
179-185.
Allocation rules and their error rates.
of the Royal Statistiaal Society, B28:
Hoel, P. G. and Peterson, R. P.
of optimum classification.
20: 433-438.
(1949).
Journal
1-20.
A solution to the problem
Annars of Mathematical Statistics,
Hudimoto, H.
(1956). On the distribution-free classification of an
individual into one of two groups. Annals of the Institute of
Statistical Mathematics, 8: 105-112.
Hudimoto, H.
(1957). A note on the probability of the correct classification when the distributions are not specified. AnnaZs of the
Institute of Statistical Mathematics, 9: 31-36.
Kendall, M. G.
(1966).
Discrimination and Classification.
In Multi-
variate AnaZysis Pr·oceedir.ge of ;:he Inr;el'national Symposium at
Dayton., Ohio, New York: Academic Press, Inc., 165-185.
•
175
Kendall, M. G. and Stuart, A. (1976). The Advanced Theory of Statisti~s,
Vol. III. London: Griffin, 350.
Kronmal, R. A. and Tarter, M.
(1968). The estimation of probability
densities and cumulatives by Fourier series methods. JOU2~al of
the American Statistical Association, 63: 925-952.
Lachenbruch, P. A. and Goldstein, M.
Biometrics, 35: 69-85.
Lindgren, Bernard W.
(1976).
Publishing Co., Inc.
(1979).
Discriminant analysis,
Statistical Theory.
New York:
MacMillan
Loftsgaarden, D. O. and Quesenberry, C. P.
(1965). A nonparametric
estimate of a multivariate density function. AnnaZs of MathematicaZ
Statistics, 36: 1049-1051.
Moore, D. S. and Henrichon, E. G.
(1969) •
some estimates of a density function.
Statistics, 40: 1499-1502.
Morrison, D. F.
(1976).
McGraw-Hill.
•
Uniform consistency of
Annals of Mathematical
Multivariate Statistical Methods, New York:
Parzen, E.
(1962). On estimation of a probability density function
and mode. Annals of Mathematical Statistics, 33: 1065-1076 •
Penrod, C. S. and Wagner, T. J.
(1977). Another look at the edited
nearest neighbor rule. IEEE Tr'allsaations on S~stems, Man, and
Cybernetics, SMC-7:
92-94.
Penrod, C. S. and Wagner, T. J.
(1979). Risk estimation for nonparametric discrimination and estimation rules: a simulation
study. IEEE Tl'ansactions on Information Theory, IT-25: 753-758.
Press, S. J. and Wilson, S.
(1978). Choosing between logistic regression
and discriminant analysis. JournaZ of the American Statistical
ASDoaiation, 73: 699-705.
"
Randles, R. H., Broffitt, J. D., Ramberg, J. S. and Hogg, R. V.
(1978).
Discriminant analysis based on ranks. JouT'l7.al of the American
Statist-iaal Aesociatioll, 73: 379-384.
Rao, R. R.
(1962). Relations between weak and uniform convergence of
measures with applications. AmzaZs of Mathemat:'cal Statistics,
33: 659-680.
176
Remme, J., Habbema, J. D. F., and Hermans, J.
(1980). A simulative
comparison of linear, quadratic and kernel estimation. Journal
of Statistic:al Conpu.tation and Simulation, 10 (to appear) .
Richards, L. E.
(1972). Refinement and extension of distributionfree discriminant analysis. Applied Statistics, 21: 174-176.
Rogers, W. H. and Wagner, T. J.
(1978). A finite sample distributionfree performance bound for local discrimination rules. AnnaZs of
Statistics, 6: 506-514.
Schwarz, S. C.
(1967). Estimation of probability density by an orthogonal series. Annals of Mathematical Statistics, 38: 1261-1265.
Silverman, B. W.
(1978). Choosing the window width when estimating
a density, Biometrika, 65: 1-11.
Silverman, B. W.
(1978). vleak and strong uniform consistency of the
kernel estimate of a density and its derivatives. Annals of
Statiotias, 6: 177-184.
Specht, D. F.
(1967). Generation of polynomial discriminant functions
for pattern recognition. IEEE Transactions on Electronic Computers,
EC-16: 308-319.
Specht, D. F.
(1967). Vectorcardiographic diagnosis using the polynomial discriminant method of pattern recognition. IEEE Transactions
on Bio-Medical Engineering, BME-14:
Specht, D. F.
function.
(1971).
..
90-95.
Series estimation of a probability density
409-424.
Technomer;r'ics, 13:
Stoller, D. S.
(1954). Univariate two-population distribution-free
discrimination. JOllrnal of the Ameriaan Statistical Association,
49: 770-777.
Swonger, C. W.
(1972). Sample set condensation for a condensed nearest
neighbor decision rule for pattern recognition. In Frontiers of
Pattern Recognition, Ed. by S. Watanabe, (Academic press) New
York, 511-519.
Tarter, M. E. and Kronmal, R. A.
(1976). An introduction to the
implementation and theory of nonparametric density estimation.
The American Stati8ti~iar., 30: 105-112.
•
177
Tarter, M. E. and Raman, S.
(1971). A systematic approach to graphical
methods in biometry. ppoaeedings of tr~ 6th Be~keley Symposium
on Mathematical Statistics and ppobabiZity, IV. University of
California Press, Berkeley, 199-222.
Ullman, J. R.
(1974). Automatic selection of reference data for use
in a nearest-neighbor method of pattern classification. IEEE
Tpansactions on Info:r-mation Theopy, IT-20:
541-543.
Van Ness, J.
(1979). On the effects of dimension in discriminant
analysis for unequal covariance matrices. Technomet:r-ics, 21:
119-127.
Van Ness, J. and Simpson, C.
(1976). On the effects of dimension in
discriminant analysis. Technometpics, 18: 175-187.
Van Ryzin, J.
(1966). Bayes risk consistency of classification procedures using density estimation. Sankhya, A-28: 261-270.
Wagner, T. J.
(1971).
Convergence of the nearest neighbor rule.
IEEE Tpansaations on Infol'TTlation
..
.
e
T1zeOl~Y,
IT-17:
566-571.
Wahba, G.
(1975). Optimal convergence properties of variable knot,
kernel, and orthogonal series methods for density estimation.
Annals of Statistics, 3: 15-29.
Wegman, E. J.
(1972).
Technomet~ics,
14:
Welch, B. L.
(1939).
31: 218-220.
Nonparametric probability density estimation.
533-546.
Note on discriminant functions.
Biometpika,
Wilson, D. L.
(1972). Asymptotic properties of nearest neighbor rules
using edited data. IEEE Transactions on Systems, Man, and Cyber-
netics, SMC-2:
408-42.
(1954). Generalization of the theorem of GlivenkoWolfowitz, J.
Cantelli. Annals of Mathematical Statistics, 25: 131-138.
•
Wolfowitz, J.
(1960). Convergence of the empiric distribution on
half-sFaces. Contribution:; to ppobability and Statistics: Essays
in Honol' of lIarold Hotdling. Stanford University Press, Stanford,
California, 504-507.

Download Report

Watson, Peyton.; (1982)Kendalls Order Statistic Method of Discriminant Analysis."

Paperzz.com

Your Paperzz