UNIVERSITY OF NORTH CAROLINA
Department of statistics
Chapel Hill, N. C•
.ON THE OPTJMUM PROPERTIES OF SOME
CLASSIFICATION RULES
by
Somesh Das Gupta
August, 1962
-e
COl1tract No. AF 49(638)-213
Contract No. NSF 0-5824
For the problem of classification into one of two normal
populations (univariate and also multivariate), minimax,
admissible, unbiased, consistent rules are obtained under
the homoscedasticity assumption and with a loss function
based on Mahalanobis-distancej lower bounds are obtained
for the probability of correct classification under the
maximum likelihood. rule. Admissible rules are obtained
for classification into several normal populations, in
particular, when the populations are identified as different cells of a statistical design. Non-parametric
classification rules, based on distance functions between
distribution functions and also on Wilcoxon-statistic,
are proposed and their consistencies are shown.
This research was supported in part by the National Science Foundation and
in part by the Mathematics Division of the Air Force Office of Scientii'1c. Research.
·,·e
Institute of Statistics
Mimeo Series No. 333
ACKNOWLEDGMENrS
It is a privilege and pleasure to acknowledge my sincerest
gratitude to Professor S. N. Roy for his inspiring lectures, constant encouragement and valuable suggestions.
I would also like
to express my grateful appreciation to Dr. P. K. Bhattacharya for
many encouraging discussions during the course of this work; and to
Professor Wassily Hoeffding and Professor vI. J. Hall for going through
the manuscript and offering many helpful suggestions.
I owe
my
sincere thanks to the
u.
S. Educational Foundation in
India for the Fulbright award and to the Air Force Office of Scientific Research for financial support.
•
I am grateful to Professor L. V•
Jones, the Director of the Psychometric Laboratory, for providing
financial assistance through the office of the National Science
Foundation.
Finally, thanltful appreciation is due to Miss Martha Jordan and
Mrs. Doris Gardner for the careful work of typing the manuscript; and
to the secretarial staffs of the Department of Statistics and of the
Psychometric Laboratory at Chapel Hill for the help received from
them at various stages of this work .
•
iii
TABLE OF CONTEIDlTS
Chapter
Page
ii
ACKNOWLEDGMENTS
I
INTRODUCTION
v
DEFINITIONS AND ASSUMPTIONS IN THE CLASSIFICATION
PROBLBM AND SOI·fill RESULTS IN THE THEORY OF STATISTICAL DECISION ~TCTIONS .. .. .. .. - .. - - -
1
1.0.
Introduction
- - - - - -
1
1.1.
Some definitions and some results in the
theory of statistical decision functions ..
1
Invariant decision rules .. - ..
5
1.;. The classification problem - ..
7
1.2.
1.4. Minimax decision rules and the role of the
triangular group - .. .. .. - .. - .. .. .. - .. ..
II
CLASSIFICATION INTO UNIVARIATE NOBMAL POPULATIONS
1;
17
2.0.
Summary
17
2.1.
Classification into two univariate normal
populations ............................-
17
2.2 Classification into two univariate normal populations with common variance when the individual
to be classified does not necessarily belong
to one of them
..
29
2.;.
Classification into more than two univariate
normal populations - - ..
.. .. - .. -
2.4 Concluding remarks
III
CLASSIFICATION INTO MULTIVARIATE NOBNAL POPULATIONS..
40
;.0.
SUllllIlary
40
'.1.
Classification into two multivariate normal
populations with know and common dispersion
matrix
- - .. .. .. .. .. .. .. ..
...... - ...... - ...... - .. .. .. .. .. .. ....
40
•
iv
Chapter
Page
3.2. Classification into more than two multivariate
normal populations with know dispersion
matrices - - - - - - - - - - - - - - - - - - - -
51
3.3. Classification into one of two multivariate normal populations with common and unknown dispersion
matrix - - - - - - - - -
IV
53
3.4. Concluding remarks - - - - - - - -
62
SOME SPECIAL CLASSIFICATION PROBW1S - - - - - -
64
4.0.
64
Summary - - - - - - - - - - - -
4.1. Classification into univariate structured normal
- - - - - - - - - - - - - - - -
64
Classification into structured multivariate
normal populations - - - - - - - - - - - - -
71
populations
4.2.
4.3. Classification into k multivariate normal
V
populations - a special case -
75
NONP.ARJ.\ME'l'RIC CLASSIFICATION RULES -
86
5.0.
86
Summary
M
-
-
-
-
-
-
-
-
-
5.1. Minimum-distance classification rules
86
5.2.
93
Classification rule based on Wilcoxon-statistic-
5.3. Concluding remarks - -
94
BIBLIOGRAPHY
96
- - - - -
•
INTRODUCTION 1
The classification problem was originally formulated by R. A.
Fisher ~8_72 in the language of discriminatory analysis.
Later A.
Wald~34_7 formulated a special case of the classification problem
in the set-up of statistical decision theory.
The problem, in its
general form, can be posed as follows.
Let
tions.
~l' ~2'
""
~k
denote k physically distinguishable popula-
We have n experimental units which are random outcomes from
a mixture of these k populations.
The problem is to classify each
of these experimental units into one of these populations on the
basis of a set of p responses on the experimental units.
Let
~(p x 1) be a .randomvector denoting a set of p responses on which
the classification problem is based.
of ~ on ~j(j
= 1,
••• , k) be F j (j
Let the distribution function
= 1,
••• , k).
The classification
problems can be divided into two major categories.
(a)
Simple classification problem:
In this case, it is known beforehand that all the n experi-
mental units come from the same population.
Hence, the problem can
1
This research was supported in part by the National
Science Foundation and in part by the Mathematics Division of
the Air Force Office of Scientific Research.
.
2 The numbers in square brackets refer to the bibliography
listed at the end •
•
vi
be posed in terms of a k-decision problem.
assume n
In this case, we may
= 1 because for higher values of n the essential structure
of the problem remains unaltered.
(b)
Compound or simultaneous classification problem.
In this case the n experimental units do not necessarily belong to the same population and the problem can be formulated in
terms of a nk-decision problem.
A major portion of the statistical literature on the classification problems is devoted to the case (a) and especially when the
Fi's are different normal distributions.
mainly along this line.
•
The present work is also
We shall now discuss different subcases
of this case •
When the F
i
I
S
are cotllpletely known, a minimax classification
rule and the minimal complete class of classification rules can
easily be obtained~1_7 following the general principles of Wald
~33_7.
This case has been discussed by Wa1d .£'34_7, Von Mises
~32_7 and Anderson ["1_7, when the F
i
tions with the same covariance matrix.
Bahadur
I
S
are known normal distribuRecently, Anderson and
["2_7 obtained the complete class of rules for the classi-
fication into two multivariate normal distributions with different
covariance matrices after restricting to linear classification rules.
In most situations, the rules in this case depend on the ratios of
the likelihoods.
When the Fi's are known except for a set of parameters we must
have random samples from each of the populations
~i's
to get partial
•
vii
information about the parameters.
Wald
.['34_7
and Anderson ['\}
suggested a class of heuristic rules which is obtained by replacing
the parameters in the class of Bayes rules, obtained assuming the
parameters are known, by their respective sample estimates.
and Hodges
.['9_7
Fix
proved that if the parameters are replaced by
their respective consistent (uniform) estimates then the resulting
rules will also be consistent (uniform).
Roy and Anderson
.['1.-7
sUggested the folldwing heuristic nile Wh:1.ch in1ght be termed as
the reaxirrum likelihood rule.
classified come from
~.,
J.
Assuming that the observations to be
the joint likelihood is maximized under
the variation of parameters, and let this maximum value be L.•
J.
Then the observations are classified into
among all L.'s. (j
J
= 1,
tuitive approach, the
••• , k).
Rao
if L is maximum
i
suggests, as one in-
~i
.['25_7
following procedure.
Test the i-th composite
hypothesis that the observation comes from~. (i
J.
= 1,
••• , k)
separately, without regard to alternatives, and accept the hypothesis
with the highest observed significance level.
It may be noted that
in many situations there are inter-relations between all these
different heuristic approaches.
Hoel and Peterson
.['13_7
found
that if a Bayes procedure, derived when the parameters are known,
is modified by replacing the parameters by their consistent estimates, then the performance of this modified rule tends, for large
samples, to that of the Bayes rule.
A great bulk of work has been done in the problem of classification into two multivariate normal populations with unknown mean
vectors and common dispersion matrix (known or unknown).
In this
•
viii
sittution, Rao ~25_7 derived a rule having some optimal properties
in the neighborhood of equality of the two populations and errors of
misclassification depending on the Mahalanobis-distance between the
two populations.
All the workers in this line attacked the problem
from the standpoint of minimization of error probabilities.
The
present work is the first attempt to attack this problem from a
decision-theoretic standpoint.
When the common dispersion matrix is
known, JOhn~15, 16_7 obtained explicitly the distributions of the
statistics involved in the rules suggested by Rao, Anderson and
Wald and gave simplified formulae to calculate the error probabili.
ties.
Wald~34_7 first attempted to obtain the distribution of
the statistic involved in his heuristic rule when the common dispersion matrix is unknown.
later Sitgreaves ~30_7 completed this
work and she also found the explicit distribution of Anderson's
classification statistic ~1_7in this case.
The explicit distribu-
tion involves five-fold _infinite series but later
9itgre.aves~and
Bowker ~30_7 obtained an asymptotic series-expansion of this distribution function in powers of
from each of two populations.
lIN, Where N is the sample-size
In Chapter II of the work, it is found
that the maximum likelihood rule is minimax and admissible for the
problem of classification into two univariate normal distributions
with unknown means and unknown but common variance under a general
structure of the loss function.
It is also found that this rule
has some other optimum properties, e.g., unbiasedness, monotonicity,
etc.
These results are generalized in Chapter III to the multi-
variate case when the common dispersion matrix is known.
It is
•
ix
fUrther shown in Chapter III that the same properties hold for the
maximum likelihood rule in an invariant class when the common dispersion matrix is unknown.
It is proved that under a mild assump-
tion on the loss function the
maxi~likelihood rules
cases are unique minimax rules.
in the above
When the dispersion matrix is known,
two lower bounds are found( in Chapter III) for the probability of
correct classification for the maximum-likelihood rule.
complicated lower-bound was found by John
["14_7.
Kudo
A more
["20_7
has
shown, in the case of classification into two populations when the
populations are univariate normal with common variance or multivariate normal with known and common dispersion matriX, that the
maximum likelihood rule maximizes the probability of correct classification among a symmetric-invariant (defined in some sense) class
of rules.
This result has been extended in Chapter III to the case
when the common dispersion matrix is unknown.
It may be mentioned
that, in each of the above cases, when the sample sizes from
~2
are equal, the
maximu~likelihood rule
~l
and
is the same as the heur-
istic rule suggested by Rao (from level of significance standpoint)
["25_7
and it is also the same as the minimax rule, derived assuming
the parameters as known, when it is modified by replacing the parameters by their respective maximum-likelihood estimates.
In Chapter II a class of admissible rules is derived for the
problem of classification into k( > 2) univariate normal populations
with the same variance.
In Chapter III, a class of admissible rules
is derived for the multivariate k-population classification problem
when the distributions are normal with the same known dispersion
•
x
Ellison~6_7 gave a class of admissible rules under simple
matrix.
loss function and assuming that Fi = ~~i' Ei ) (i = 1, ••• , k) with
known Ei's and also assuming that the individual to be classified
comes from JV(IJ.
, I: ) with known I: and IJ. = IJ.. for only one i
-0
0
0
-0
~
(i
= 1,
••• , k).
In Chapter IV a class of admissible rules is de-
rived in this situation with the modified assumption that (IJ.
, Eo)
-0
= (~i'
I: ) for only one i (1
t
= 1,
••• , k);
moreover, it is shown
that the maxim~likelihood rule is €-admissible ~33_7 for this
problem.
In Chapter IV, it has been found that for the problem of
classification into k univariate normal distributions with the same
unknown variance and where the k means are linearly dependent (in a
known way), the maximum likelihood rule is admissible.
This result
is also generalized to the multivariate case when the distributions
have the same known dispersion matrix.
This situation occurs when
the k populations are identified as k cells of a statistical design,
e.g., randomized block,B.I.B., P.B.LB. designs, factorial designs,
etc.
Not much work has been done in the non-parametric classification problem.
For the classification into two univariate discrete
distributions, Matusita~23_7 gave a m1nim~distance rule based on
Matusita-distance and proved its consistency.
Hudimoto ~11_7 and
Stoller ~31_7 gave classification rules, for this problem, based on
sample distribution functions and studied their large-sample performances.
•
xi
Ih'C~apter
V, it has been found that for the problem of
classification into k populations, a
minim~distance
classification
rule is consistent (uniform) if the distance function is consistent
(uniform).
A lower bound for the probability of correct classifica-
tion for the minimum Kolmogorov-distance rule is found.
For classi-
fication into two univariate populations, a rule based on the
Wilcoxon-statistic ~35_7 is suggested and its consistency is proved.
The compound classification problem is not discussed in this
work.
However, we state some results in this line due to Robbins
and Lehmann.
Robbins ~26_7 considered this problem for classifica-
tion into two known univariate normal populations and in this connection he introduced the concepts of "empirical Bayes rule ll and
"asymptotically subminimax rule".
Lehmann ~22_7 considered the
problem of classifying each of n independent vector-valued random
observations into one or the other of two populations having known
density functions.
The rules given by Robbins and Lehmann have good
large sample properties but they are not necessarily admissible.
M. V. Johns ~30_7 used the idea of Robbins to derive some nonparametric classification rules for a very specialized form of the
classification problem.
•
CHAPTER I
DEFINITIONS .AND ASSUMPTIONS IN THE: CLASSIFICATION PROBLEM
AND SOME RESULTS IN THB TImORY OF STATISTICAL DECISION roNCTIONS
1.0 Introduction
In this chapter a special formulati. on of the statistical decision
theory has been posed
60
from this standpoint.
that the classification problem can be viewed
Admissible, Bayes, minimax and most-stringent
decision rules are defined for an. easy comprehension of this work and
some results in statistical decision theory are given which will
used in the
I
subsequent chapters.
be
Then, the classification problem is
formulated and consistent and unbiased classification rules are defined;
some loss functions and some classification rules are defined for the
problem of classification into normal populations.
In the last section,
a few results on minimax decision rules are discussed with the hope
that these results might help to solve some unsolved problems in the
classification area and also in the area of normal multivariate inference problems.
1.1
Some definitions and some results in the theory of statistical
decision functions
We use the set-up and notations of a
problem.
space
A random variable (or a vector)
X
The family
fi~:ed
sample size decision
X takes on values in a
which is the underlying sample space With Borel field
fiU
fi3 .
of possible states of nature is a class of probability
•
2
measures on
that
fjJ
(X , ff3
= { PQi
).
Q E
In most of the situations, we shall assume
®} ,
where
@ is
the parameter space.
The
action space A is assumed to be finite and is denoted by
A = { a , a , ••• , a }.
l
2
k
The loss function is taken to be a bounded,
non-negative, k-dimensional vector-valued function L on
L
= (Ll ,
when Q
®,
where
••• , ~) and Li(Q)
E
is the loss for taking the action a i
Since the action space if finite, there is
@ obtains.
a 1-1 correspondence between mixed strategies and randomized strategies
[)}.
Let
\7l
be the class of decision functions
k-dimensional vector-valued measurable f\mction on
o <
k
E
;
i=l
ell i
(x) ~ 1
eIl.(x)
(i
ell,
X
where ell
is a
such that for
= 1, ••. , k)
=1
J.
for all x E ~ , and eIl i (x) is the probability of taking the action
a under the rule cP, given the observation x (i = 1, ••. , k). The
i
risk of a decision function ell E <{l, when Q E @ obtains, is given
by
Definition 1.1.1 A decision function ell o in ~ is said to be admissible in ~ if there does not exist any decision rule ell in ~
other than
Q>
o
such that
•
r(Q, $) -< r(Q, ep)
0
r;:p
\.!.:)
for all Q €
and the strict inequality holds for at least one Q € ~
Definition 1.1.2 A decision function
max in 0
ep
o
•
in 0 is said to be mini-
if
< sup reg, $), for all
sup r(Q, $)
o
Q€@
-
s
decision function
in 0
o
0 •
Q€@
Definition 1.1.' Let
$
ep €
be a probability measure on ~.
is said to be a Bayes rule in
Then a
~
against
£ if
Definition 1.1.4
A decision rule
.
gent (or, minimax regret) in
~,
$
0
in
~
is said to be most strin-
if
where
r(Q, 0)
= inf
$
E
r(Q , $)
~
Next we state some theorems 'Without proof.
Theorem 1.1.1
If for any Q E
e, P
g
is absolutely continuous 'With
respect to a (T-finite measure ~ on (;(
bility measure
s
on
0),
,:B ),
then for any proba-
a Bayes rule $(£) in 0 against S is
given by
epie)(x)
=1
tor an i
tor which li(x)
=
min ({j(x»
l~j~ k
and
•
4
$lS)(x)
=0
fi(x) =
J
for all other i (i
Li(Q) po(x) d S (0)
density corresponding to Pg'
fj(x)
= 1,
for some
(i,j), i
(a.e.) except for
~
(i
••• , k), where
= 1, ••• ,
k)
and Pg
If Il(C) = 0, where C =
j } then the Bayes rule
is the
~ x:
$(S)
f i (x) =
is unique
Il-equivalence (Il-equivalence of two decision
functions implies their risk-equivalence).
For a proof see
-;-'7.
-
Corollary 1.1.1
If $ is a unique (a.e.,
a probability measure
g on
6,
then
1J!:.7)
Bayes rule against
is an admissible rule.
$
Theorem 1.1.2 In a fixed experiment decision problem with finite action space, there exists a minimax strategy if
(i)
the loss function
L is bounded
(ii) the distributions [pg :
0} are absolutely con-
,,€
tinuous with respect to a finite dimensional Lebesgue measure
(X ,
93 ),
For a proof, see
Theorem 1.1.3
-
sets of
0
and
(i)
X
where
~
A(A
is an index set.
for each A E
~A. on ~A. against vhich
dependent of A,
(ii)
Then
Proof.
A,
«P*
cfl*
vhere
8'\~ 's
are disjoint sub-
there exists a prior distribution
is a Bayes rule in ~ and «P*
$*) is constant for
$* is a minimax rule in
Since
~
If
and
r (0,
on
is a finite-dimensional Euclidean sample space.
1'27 or 127.
Let /HI = LJ 0,\'
A
Il
Q
E ~A ' for all
Ii>.
is a Bayes rule in
q>
against
SA' we have
A.
is in-
•
5
~
reO,
®
$*)
d
e~(O)
~
<
-
r( 0,
d
4»
g~ (0),
for all
lj>
E
<P
@
By condition (ii),
~
reO,
$*) ds~(O) = ~ reO, ~*) d g~(O) =
~
Hence for any
sup
$*) ~ ~ r( 0,
reO,
$)
d
s~ (0 )
®
<
r( 0, $), for all
su.p
E w
$
e €@
$*
sup
is independent of
$*) =
reO,
e f€J
$*
Thus
1.2
$*) .
E J\.
e E0;A
Since
reO,
e € 0;\
®A
®
sup
have
~,we
sup
sup
!lEA
BE®
reO,
$*) <
.A
is a minimax rule in
$),
sup reO,
-e,of@
for all
E
$
CD
cP •
Invariant decision rules
pefinition 1.2~ A group of transformations
["2J:.7
G
on
is said to be admissible, if
(i)
for each
into
,
(ii)
closed
(0
€
where
g
G
€
,g
is a measurable transformation from
the class of probability measures
under
@ ) then
Of
€
G ; 1. e., if
for
@ ;
g
0' is
€
G
i
FO :
Q
€
®}
is
X has the probability measure Po
,
S X has the probability measure F QI
enoted by
gO.
•
6
Thus G
£2!:::7
homomorphic
(iii)
L{g
(iv)
G
Note:
'G- on
induces a group of transforma:tions
@ which
is
to G
Q, a)
= L(Q,
a) for all
is a measurable group
Q
e~, g e G ,a e A.
I"1?:.7, £117.
Every locally compact Hausdorff group is a measurable group
£1'5.7·
1.2.2
Definitions
A decision function
~
in
is said to be in-
~
variant under a group G if
~(gx)
= ep(x),
~
A decision function
e
G
in
for all
x e ~
~ -
depends on
for all
G
g.
= ~(x)
g e:
N,
g
x
G ,where the exceptional null set
A decision function
*
except possibly on a
the class of decision functions
if
[j0 -null
$ €
~
set
N.
vie shall denote
which are invariant under
Consider a group of transformations G on
A measurable function
f
on
*
is said to be a maximal invariant
G if
(i)
f
is invariant under
f(gx)
= f(x),
Ng
is said to be equivalent to an
4l
G
~efinition 1.2.~
under
e:
W is said to be almost invariant under
invariant decision function
~
3t , g
if
1
~(gx)
by
x e
for all
G ,i.e.
for all x
e
jE ,g e G ,and
•
7
(ii)
for
X
in
:Next we quote two theorems from Lehmann [" 2};7 •
Theorem 1.2.1
G
f
be a maximal invariant function on
3E
under
Then a necessary and sufficient condition for a measurable
function
x
Let
on?JC to be invariant under G is that
4>
only through
Theorem 1.2.2
if v
f(x).
If
f
is an invariant function on
is a maxiIJJal invariant function on
tribution of
depends on
<l>
f(x)
depends on
Q
0
X
under G , and
under G, then the dis-
only through v(Q) .
1.3 The Classification Problem
Let
"0'
'it
"1' ••. ,
be k+l
populations in which the p-dimen-
sional random vector ~ (pxl) is distributed vuth distribution functions
F0' F , .•. ,
l
respectively.
Fl~
In this problem, the action
space A is denoted by A = (a , a , .•• , ~), where
2
l
denotes the decision
F
o
we shall assume that
F0
= F.
:II
= 1, ••• ,
(i
~
F
i
case when
F
o
may be different from
F
o
known and we have a random sample of size
~
(i
= 0,
1, .•. , k).
and Z the space of
<l>
where
4> € ~·is
Let
z.
Let
z
(i = 1, •.• , k)
= F.
~
= 1, .•• ,
i (i
for some
i
k).
and the
F , F , ... , F .
l
2
k
We shall only consider the case when
'It.
i
In most of the sitt~tions
k).
for only one value of
But we shall also discuss cases where
a
F.'s are not completely
~
n (i
i
= 0,
1, ... , k) from
denote the vector of all observations
~
be the class of classification rules
a k-dimensional vector-valued measurable function
•
8
k
!:
(j)i(z)
i=l
for all
=1
z e Z, and (j).(z)
denotes the probability of taking the acJ.
tion a
given z.
i
Let F' = (F1 , F , ••• , Fk ) and
2
:J
be the space of
F where
F1 , F2 , ••• , F are not all e~ua1.
k
Let 6 be a distance function between two p-dimensiona1 distribution functions.
We define the loss function as follows:
Li is the loss for the action a i
tonic increasing function with leo) = o.
where
For any
(1.,.4)
e
¢l
~,
= Fi'
(1.,. )
rei,
is a bounded mono-
the risk is
=
reF 1 F ; ¢)
o -
When F0
l
and
F , F
o
7.
--
we denote the risk by
E ;~) =
k
!:
j=l
l
r
Fo
-
= F.,
F7.
J. -
j~i
Def1nition 1.,.1
A classification rule
in
~
~
is said to be a
equal probability of correct classification rule (EPCC) if
F
o
=
= F , F 7
2 --
F
o
= F.k ,
F
7.
--
•
9
for all F e
Definitions
:Jt .
1.,.2.
A classification rule
in 0
$
is said to be
unbiased (of type I) if
for all i (i
= 1, ••• ,
Jt .
k) and F e
A classification rule
in @ is said to be unbiased (Of type II)
E
for all
L
<l>
j:j:. i,
(i
Note that for k
Definitions
i:
1.,.,
F0,
= 1, ••• ,
= 2,
<l>
! ;
$)
!
F0
<l> j
e '} ) •
A classification rule
----->
i (i
° as
in
$
= 1, ••• ,
is said to be
k) and F e ~
= 0,
nils (i
~
1, ••• , k) ----> ~
is said to be error-consistent if, for each i(i
F e
= Fi' ! _7
tnese two definitions are equivalent.
risk-consistent if, for each
rei,
k;
if
Ef
= Fi' ! _7 ?:
<l>
= 1, ••• ,
k) and
Jr
When the loss function is bounded, error-consistency implies riskconsistency.
Definitions 1.,.4
A classification rule is said to be monotonic in-
creasing with respect to a distance
increases as
l
~(Fi' F )
j
increases
~
if, for each
(for
j (:
i).
i
We shall say that
•
10
~consistent
is
4>
if, for each i, (i
= 1, ••• ,
k)
(j
~
i) .
Classification into normal populations
F (i = 0, 1, ••• , k) is the p-variate normal disi
tribution m.th the mean vector Hi (pxl) and the dispersion matrix
In this case,
z
(pxp).
We shall consider two cases, namely
is knovm
(i) Z
is unlmolm.
(ii) Z
Let
(1.3.6)
Aoi
=
(Ho - Hi) r
z-l (H o - Hi)
(i
f
6 oi
Let
t
=
r
l
= 1, ... ,
k) .
(lin0
+ 1
lin. 0
6 i
be a real-valued function defined on the right-half of the
real line such that
teo) = 0;
above and monotonic increasing.
moreover, we assume
l
to be bounded
Then we define two types of loss func-
tions as follows:
.Loss function of type 1
L.J. (iJ.
, fJ , ••• , u..,~)
-0 -l ' : " K
= t
(60of... )
(i
= 1, ... ,
k) .
Loss function of type 2.
(i
= 1, ... , k) .
•
11
When I:
is know, we drop I:
from the arguments of
L .
In
i
the univariate case, we modify (1.3.6) as follows:
I ~o
.. ~i
I
0'
where
is the common variance.
0'2
= 0,
!II
)
FlO
\~
I.
f(u)
f
When we define
=1
for
as follows:
°
u >
then the loss function is called the simple loss function.
It may be
noted that the above types of loss functions include this particular
case.
= 0,
Let ii (i
a sample of size
8
n
i
1, ••• , k)
(1
= 0,
denote the sample mean vector based on
~i
1, ••• , k) from
= 0,1, .•. ,
(1
k); let
denote the maximum likelihood estimator (or, sometimes we shall con-
sider the pooled unbiased estimator) of I:.
{
(1.3.9)
d
Oi
=
(~o - ~1)'
I:-
(~::
8-
-0
(1. 3 .10)
d'.
OJ.
- -i
x )'
l
(~o
-
~~i)
1
(~o
-
~i)'
,
=
for case
= I Xo
-
Xi
I ) d~i
(i)
for case (ii)
(i
In the univariate case, we may modify
d Oi
Define
= 1, ... , k)
d
as follows:
Oi
.
= '(l/no + l/ni )
-1/2
, (i
= 1, •.• ,k)
This may be done, 'Without loss of generality, for the follow.l.ng rules.
Definitions
1.3.5
The decision rule
distance rule (MD) , where
$*
is called the minimum-
12
=
0,
The decision rule
(i
otherw.1.se.
$**
= 1, ..• ,k)
•
is called the modified minimum distance rule
(MMD), where
if
= 0,
(i
Let
= 1, ... ,
k) .
Let
=
l
co
d'
oi
=
min
(d' .)
i~ j ~ k oJ
otherwise.
(H o ' Hl) ••• ,
1:::k'
I::) j we drop
I::
when I::
(z, co) denote the likelihood of the parameters.
(i
1.1
-0
Definition
1.,.6
is known.
Define
= 1, ... , k) .
=-J.
1.1.
The decision rule 'it
is called the maximum likeli-
hood rule (ML), Where
=
(i
= 1, ... ,
°} otherwise
k) •
Lemma. 1.,.1
For the problem of classification into k
butions with COmIllon dispersion matrix, the ML
are equivalent.
For a proof, see Anderson
£1:.7.
normal distri-
rule and the MMD rule
13
1.4
Minimax decision rules and the. role of the triangular group
Some transformation groups:
Consider a space
~i
JiP
where
E
EP,
~
is the p-dimensional Euclidean space.
Ef.
(i) Additive group of
is the additive group in ~ if to each S
p
i = 1, ••• ,N
€
G P there corres-
ponds one-to-one a p-dimensional vector in EP, group operation being
defined as vector addition.
SbZ
(ii)
= (~l
+
E. ,
Linear group in
Ap
IA
Z €
(iii)
;] p
~2 + ~ , ••• ) ~N + ~ ) •
F? .
is the linear group defined as
[ A
AZ
For
=
(A ~l'
iB a non-singular
Z
pxp matrix with real elements}
J
A ~2 ) ••• , A ~N ) •
Lower-tria.ngular group in
Ef.
is the lovrer- triangular group in
T
is a
p~:p
F?,
where
matrix with real elements
•
such that t
> 0 for all
ii
t ij = 0 for i > j
}.
i
and
14
Note that
(iv)
JP
Ap .
is a subgroup of
Scale group in
J! .
Ci
>
for all
0
~
Then for
Le:mma 1. 4.1
JP
]
1
1
1
1
2
=
LTI
T
::
1
0
t'l
t
41
t
]
1
32
t
t
42
43
1
1
73
::
.LTf
T
::
0
1
0
0
t
•
e
:14
=
£I 4:
•
and
Consider the following groups:
1
J
.z
}
is a solvable group.
For simplicity, cons.ider p:: 4.
Proof.
€
i
41
t
42
1
t
the identity rnatrix_7
]
43
1
•
15
Then it is seen that
7:;
Moreover
::Ti
ia a normal divisor in
and
~
J4
7i -l
/
~
are Abelian groups.
Lemma 1. 4.2
~p is
A left invariant Haar measure in
dT
, where
=
d IJ.(T)
.7TC t i i )i
1-=
The proof of this
T
pxp
1.
l~n
is trivial.
A condition for a topoloGical group
Condition 1.4.1
Let G
(Kiefer)
be a measurable group and let IJ. be a left invariant
Haar measure on the o-.. ring of subsets of
quence {G n}
of measurable subsets of G
O<IJ.( G)
n
and such the.t, for every g
lim
n
G • There exists a se-
IJ.
->00
<00
€
G
(g G n b. G n)
with
(n = 1,2, ••. )
,
I
IJ. ( G n)
=
O.
For the group-theoretic concepts used above, we refer to
An extension of the Hunt-Stein theorem
Theorem 1.4.1
problem.
Let
L12, 2!!7.
L2'};7
Let G be an admissible group for a given decision
~
be a minimax rule in 0 G
condition 1.4.1 then
~will
If G satisfies the
also be a minimax rule in
~
provided
•
16
such a minimax rule exists.
to fl17.
For a proof of this thm·reg,,, we refer
A simple proof can easily be obtained following the ar-
guments used in Lehmann r s book
~-compact"
when
G is
dition
£ 2"J:.7 •
L 2J} .
It may. be mentioned that
condition 1.4.1
implies the Hunt-Stein con-
Some comments
Since
J
is a solvable group, it follows from the lemma. 1.4.2
and from the results of Kiefer £117 that ~ satisfies the condition
1.4.1.
Hence" if a decision rule
will also be minimax in
is minimax in
cP
,then it
(J)::r
0 provided such a minimax rule exists.
In most of the normal multivariate problems, conditions of the
•
theorem 1.1.2 will be satisfied and mininJax rules will exist.
In
particular" for the normal multivariate classification problem with
..
the present set up" a minimax rule in
~
does exist .
In many normal multivariate problems" we find some optimum rules
in the class invariant under the linear group.
Since the linear group
(p > 1) does not satisfy the condition 1.4.1 £2]} , we cannot apply
the theorem 1.4.1 to find a minimax rule in 11>.
But this may not be
the case if we restrict to the class invariant under
J .
Wald £3d..7 gave general conditions (e.g. compact parameter space)
for which a minimax rule is a Bayes rule.
given rule
<l>"
invariant under
J '
If we want to show that a
is not minimax in
(J)
then subject
to the conditions of Hald" it will be enough to show that there does
not exist a prior distribution for which
<l>
is a Bayes rule in
These ideas are not pursued further in this thesis.
~vr
•
CHAPTER II
CLASSIFICATION INTO UNIVERIATE NORMAL POPULATIONS
2.0
Summary
For the classification into one of two univariate norma.l populations with unknown means and common, unknovffi variance, it is shown
that the maximum likelihood rule is minimax, admissible, unbiased,
monotone and consistent.
It is also shown that the :maximum likelihood
rule is admissible when the individual to be classified does not necessarily belong to one of the two populations.
A class of adlnissible
rules has been obtained for classification into more than two univariate normal populations.
2.1 Classification into two univariate normal popUlations
Let
denote three populations in which a random
variable
is normally distributed with means
X
respectively and with common variance
unknown except that
Xi (i
Let
ple of slze n
i
..
(J'
2
o = jJ.i
= 0, 1,
(i = 0,
for simplicity, that
tor of
l..L
n
for
(1'2; .the
i::: 1
or
jJ.o' jJ.l and
jJ.2
jJ. r sand (1' are
i
2 and
jJ.l
f
jJ.2 .
2) be the sample mean based on a random samI, 2) from
0
::: 1.
Let
1C
S
i
(i
= 0,
I, 2); we will assume,
be the pooled unbiased estima-
.
It will be sufficient to consider the class of classification
rules which are functions of the sufficient statistics
We make a
follows.
1-1 transformation from (xo ' Xl' x2 )
to
x ' Xl' x ' S.
o
2
(zl' z2' z3) as
•
18
= (1
+ 1I ni )
-1/2 (
z1
(2.1. 2)
z} = (1 + n1 + n2r1/2 (xo + n1 xl + n2
X
= 1, 2)
(i
(2.. 1.1)
o - xi) ,
X
2)
v1ithout any loss of generality, we can consider the class of
rules as functions of
(zl' z2'
z"
S).
The sample space is denoted
by
z· 1
s)
j
s >0
and the class of classification rules
t
by
Z
= (Zl'
Z2'
z,'
} .
~,where
$~=
is a 2-d1mensional vector valued measurable function of z
o~
(2.1.3)
for all
z;
$.(z)
J.
$1(z)
5
1 ,
(i = 1, 2)
a i (1 = 1, 2) ~1.e.
~o = ~i
(1
= 1,
is the probability of taking the action
_7
under the rule
(~o
(2.1.4)
g1
= (1 + 1/ni )-1/2
(2.1.5)
Q
= (1 + n1 + n2 )-1/2
The parameter space is denoted by
un
.
where
such that
2)
2
-
(~o
~i)'
+ n
1
given
$,
We can consider the parameters as (9 , 9 , 9 ,
1 2 3
3
($1' $2)
~),
(i = 1, 2)
~1
+ n2
~2)
z.
where
•
= 1, 2) .
(i
The dellsity of
z when
a:> €
fl obtains
is given by
(2.1.8)
where
(m/2)m/2
1(mi2 )
(jm
and
(2.1.10)
The risk of a classification rule
(i
= 1,
2) obtains, is given by
~
in
w,
when
a:>
in ~i
•
20
,
under the loss function of type 1
(2.1.11)
f [" I Qi'
_7
/(j
E(1) fjl i ' under the loss
function of type
where
E(1) denotes the
2.1.1
eJ~ectation
2.
with the density P(1) .
Minimax classification rule
We shall prove the following two theorems:
Theorem 2.1.1
The maximum likelihood rule is minimax and admissible
for the problem of classification into one of two univariate normal
POPulations with unknovm means and common variance, under the loss
function of type
If n
1
type 1.
2.
= n2 ,
Theorem 2.1.2
If
the above result also holds for the loss function of
f
is a continuous function on the positive-half
of the real line, then the maximum likelihood rule for the problem
given in the theorem 2.1.1 is also unique (a.e.) minimax.
Proof of the theorem 2.1.1
We shall prove this theorem in the
following steps:
1)
(2.1.12)
For any 6
n 6,rJ
> 0, let
=
t
fl
A
w.,(j
be a subset of
[l.
such that
(1) \ (j fixed> 0; (Ql' Q2) = (6, 0), (-6, 0),
(0, 6), (0, -6) }
•
21
It is easily seen that
n
=
LJ
0'>0
SAu,CJ on
Now,conaider a prior probability measure
DA
u,cr
such
that
(i) Q;
is distrib'llted independently of (Ql' Q2) with the
marginal density function p,
(ii)
the marginal distribution of
(Ql' Q2)
is a point-distri-
bution with equal probabilities (= 1/4) at (6, 0), (-6, 0), (0, 6),
(0, -6).
For any 4l
in
and with the loss function of type
<r>
2,
we have
(2.1.13)
41_7 = f(6/cr)
r£(b., 0, Q;, 0');
J
2
Bl (z)eX:PL6(az l -bz 2 )/cr _7
Z
4l 1 (z)dz
(2.1.14)
r£(-6,0,Q;,cr);
~7= f(6/cr~Bl(z)eX:PL-6(azl-bZ2)/cr2_74l1(Z)dz
Z
(2.1.1,)
r£(O,6,Q3'cr); _7=
f(6/cr~Bl(z)eXPLD.(az2-bzl)/cr2_74l2(z)dz
Z
where
a
= l/(l-k 2 ) ,
b
= k/(l-k 2 )
22
(2.1.18)
Let
Then, for any
(2.1.20)
~
in
~
r(~lIIU;~)'~ • {(lI/U)J B2 (Z) r.eXPt lI(aZ1-bz2 )/u2} ~l(Z)
Z
2
+ exp { -lI(aZ1-bz 2 )/u }
~l (z)
+ exp { 6(az 2-bz l )/0'2 }
~2(z)
+ e:A'1' { -6(az 2-bz l )/0'2 } <Il (zJ7dz
2
It can be seen easily that
(2.1.21)
= 1,
V (z)
l
r( SA
~,O'
; <Il ) is minimized for
<Il:=
V where
t
t
if exp 6(az -bz )/0'2} + exp {-6(az -bz )/0'2 }
l
1
2
2
< exp 6(az 2-bz l )/0'2] + exp{ -6(az 2-bz l )/0'2}
and
Vl(Z)
:=
0,
otherwise.
The inequality in (2.1.21) is equivalent to
f
exp { 6(a..b)(zl+z2)/20'2} - exp
•£exp{ ~(a+b)(zl-Z2)/20'2} -
t"6(a-b)(zl+Z2)/2O'~_7
.
exp{"6(a+b)(zl-Z2)/2O'2} _7 < 0
•
#
23
i.e. ,
i.e. ,
I zl I
i . e. ,
(a.l,aa)
<
J
za
~
(1 + 1/n1
Le.,
Thus we see that t
r 1/ a Ixo-~I < (1+1/na r 1/ a / xo-xa f
•
is the modified m:Lnimum distance rule, and
(from lemma 1.3.1) it is also the maximum likelihood rule.
Note
that the inequality (a.1.aa) is attained in a set of measure zero.
Since t
does not depend on 6. and cr, we have proved that for any
A > 0, cr > 0, the rule t
bution
is the Bayes .ru1e against the prior distri-
SA,cr' defined as above.
a)
Next, we shall show that for any
the risk of t
risk of t
depends only on
is constant in
} Qi
nAt
~,cr
in
CJ.)
I /0-
(i
{L i
= 1,
(i = 1, a),
a) and thus the
for any A > 0, cr >
o.
Let
Then u, v
are independently normally distributed with means
a
- Q and variances a(l + k)cr, a(l-k)cr
a
the 1neq'l.~ty in (a.1.22) is equivalent to
Q
l
(a.1.a3)
For
CJ.)
in
n,
u v < 0
a
respectively.
1 + Qa'
Q
Note that
•
24
where
x
(2.1.26)
N{X)
For (J,) in
=
J
D 1 ;00
Q
1
r«(J,);*).!( I ~ 1 ) { N{-Q1/~ ~2(1+k) )
(2.1.27)
+ N(
and, for (J,)
(2.1.28)
in
r«(J,);*)
=l (
IQ2 \
~
Thus, we see that for (J,)
I Q. I
in
n"
],
uJ~
/~
type
(i
)
{
W is
in
= 1"
for any D.
From the results in
follows that
-J2{1+k»
N
(-Q1/~ J2( l-k) )}
D2
.
N(Q2/~ ,J2(i+k»
+N{
of
Q1/~
N (Q1/~~2(1-k»
-Q2/~ J2{1+k» N(Q2/~.j2(1-k»
fl. i
2) .
N(-Q2/~ ..J2(l-k»
(i
= 1,
1
2), r«(J,);*) is a function
Hence the risk of *
is constant
> 0, ~ > 0 and [l"u,~ is defined as above.
J:7
and ~7 and from the theorem 1.1.3, it
a minimax rule in 0
with the loss function of
2.
"L7
We have proved that for any
the unique Bayes procedure in
0
D.
> 0,
~
>
OJ the rule
*
is
against the prior distribution
£"u,~ where £"u,,~ is defined as above.
From the Corollary 1.1.1, it
•
25
W is
follows that
an admissible rule in
~
with the loss function of
type 2.
!:!:7
When
n
= n 2 (=n,
l
say) and with the loss function of type
f{6/er) by
the only change required in the above proof is to replace
(.L(l + 1/n)1/2 D./er_7'.
1,
Then the later part of the theorem easily
follows.
Proof of the theorem 2.1.2
To prove this theorem, we first observe the following facts:
i) For any D. > 0, er > 0, the risk of
ii) For
iii)
'I>
e
Jl.. J..,
For any
I Qi I
function of
(.Q
flw,er
A
•
(i = 1, 2)
eD
/er
W is constant in
i
(i
(i
= 1,
= 1,
2), the risk of W is a continuous
2)( except, possibly at the origin,
which is excluded from the present discussion) and moreoverJit is
l
bounded since the function
Thus, there exist·
sup r(w,
we.
n
W)
is bounded above.
A>O ,
w
=
er >0 , and D. <
er
sup r(ro, w)
we IlD.,er
= r(m,w), for any ro
Let
from W.
$
such that
00
be another minimax rule in
0
Then
=
sup r(co, ~)
ro e 11
in
.D...
A
•
w,er
which is different
(a.e.)
•
26
Thus,
for all m in
n
1\
t..l.,a
,
The strict inequality cannot hold for any point m in
since '"
is the unique (a.e.) Bayes agalt.nst
function of Q;
at our choice)
n
t::.,r;r 1
~I\
u,a (with the density
and hence admissible in SLt::.,a •
Thus
=
r(c.o,.)
r(m,,,,) , for all
But this is a contradiction, since '"
against
'"
(i)
n
,...,a
1\
is the unique (a.e.) Bayes rule
is the unique (a. e.) minimax rule in
t
The fact that
CP.
is a monotonic increasing function is not
used in the above proofs.
2.1.2
in
$A
•
L.lo,a
Thus
Remark
(l)
This assumption 'Will be used later.
Some more properties of the maximum
Limiting behavior of
11kelihood_~11e
rem, ",)
It is easily seen from (2.1.27) and (2.1.28) that
r(lJ. o ' 1J.1 , 1J. ,
2
is a function of 11J.1 - 1J. } /a
2
aj ",)
tends to zero as
1J. - 1J.
1
2
=
t::.
then '"
J
I /a
tends to
co
•
and the risk
If we define
1J.1 - 1J.2 1 /a
1s t::.-consistent.
When we have
no
observations from
follows:
= 0,
otherwise.
3!o then we modify '"
as
•
X
o
is the sample mean of the
no
as
1
no' n , n
1
2
o
•
i ; 1, 2
It can be seen that for
tends to
observations from 1£
tend to infinity.
Thus
V is error-
consistent and also risk-consistent.
(ii)
Monotonicity property of V
For (J) e
into
1£1
fl i
(i = 1, 2)
(i
= 1,
2), the probability of misclassification
under the rule
V is
N(-Qi!cr ~2(1-k»
Differentiating
(2.1.31) with respect to Qi/cr, we get
where
From above, it fol1o'\'lS that
(iii)
E(J) Vi
I Qi I /cr
decreases as
increases.
Next we shall show that the rule V is the best rule in
a class of rules defined below.
Theorem
2.1.3
conditiona
For any classification rule
(i), (ii)
E
(J)i
$
in
below, we have for any (J)i
$i
>
E
(J)i
Vi
'
(i
= 1,
~
in
2)
satisfying the
ni
(i
= I,
2)
28
where 1jr
is the maximum likelihood rule.
E(l)<l>
(i)
E.L
( ii)
Remark
depends only on
<l>
I J 01
= 0,
02
1 - 1-1 2
11-1
= a:
,
(j
l/(j
for all (l) in
Jl
_7
The conditions (i), (ii) above can be replaced by the
following conditions.
(i)'
When
is e~~ressed as a function of xo ' xl' x2 ' S, then
<l>
for all real a >
This implies that
°
and all real b.
<l>
depends on the sufficient statistics only through
(xo - x1 )/s, (xo - x2 )/s •
(ii)' When
<l>
is expressed as a function of
zl' z2' S then
and
Proof:
Let (l)
= (01' 02'
(j)
where
01' 02 are defined in (2.1.4).
Let
P(l) be the probability density of u1 ' u where
2
ui
Define
<l>
For any
For any til
= (xo
- xi)/s
,
(i
= 1,
2) .
as a function of u1 ' u
2
6. > 0, (j > 0, define
in
c;I>
satisfying the conditions of the theorem, vTe have
29
E(.1,)1 $1
= E(.1,)2$1 =
=
E~$2
:;
~"4$2
....,
Also, the rule W satisfies the conditions of the theorem.
It is
seen from the proof of the theorem 2.1.1, that if we fix (.I,)i's by
fixing
b:.,
then W minimizes
(j
The theorem now follm'Ts easily.
given by Kudo
Detailed proof of this theorem is
.L22,7.
Corollary 2.1.3
The maximum likelihood rule is an unbiased rule (in
this case the two definitions for unbiasedness are equivalent).
Proof:
Let
be a rule such that
$
$(z)
for all z.
= (1/2,
1/2)
Clearly,
satisfies all the conditions of the above
$
theorem, and
E $. = 1/2,
(.I,)i J.
(.I,)J..
n
€
(i
i
= 1,
2) .
Thus
E(.I,)iWi
< 1/2
E(.I,)iWi
< E
(i
i.e.
(.I,)i
Wj
(j
= 1,
-f
2) •
i; i,
j
= 1,
2)
i.e., the probability of correct classification under the rule
W is greater than the probability of misclassification.
2.2
Classification into two univariate no_mal populations with common variance 1-Then the individual to be classified does not
necessarily belong to one of them.
The problem:
Let
X be a random variable which is normally
~. (i
distributed in the population
~i (i
= 0,
1, 2)
In this case
Il
o
J.
2
= 0,
f sand
0' are unknown.
i
is completely unrestricted but we are forced to
and variance
make one of the decisions
Il
o
IJ" ;
= III
the
1, 2) with mean
or
Il
~
0
= ~2.
Let
be the sample mean based on a random sample of size
from
n
i
(1
= 0,
X. (i
J.
(i
i
S is the pooled unbiased estimator of 0'2
1, 2);
In this case, we shall assume that
no = 1, n
1
n
= 0, 1, 2)
= 0, 1, 2)
= n2 = n.
The para-
meter space is denoted by
i1 ,,{ ~ (~Q' ~l' ~2'
(j)
CT)
I
CT
>0
}.
Next, we shall prove the following theorem.
Theorem 2.2.1
For the problem posed above, any rule
a minimax rule in
<I>
with the loss function of type
same loss function, the maximum likelihood rule
rule in
<I>
Proof:
(i)
'Ijr
for the above problem.
For any
<l>
in
i;l)
,
,
where
)
ep
1.
in
(9
is
T,vjth the
is an admissible
31
~, and
Moreover, for any fixed
~o ~ ~l' ~o ~ ~2'
Thus
sup
<0 Eo
r(<o,
.n..
= sup f(6(<o»
4»
€.1l
<0
=
sup
u
>
feu)
0
where
= I ~l
8(ro)
sup r(~;$)
Hence,
- ~2
t
/rr
is the same for all
in
$
@.
<0
(ii) We shall shm-T next that if we specify the parameter space
as
Sf... ={<o = (~O'
then also
n
1
~2'
Irr > 0,
rr)
4>** is an aClmissible rule.
= n (=n)
sider
1,
1..1.
~i
I
are all different] )
S
For simplicity we consider
and the loss function of type 1.
2
$ as a function of
As before we con-
zl' z2' z3' S where
zi
= (1
+ 1/n)·1/2 (x
z3
= (1
+ 2n)-1/2(x
o
o
- xi)
+ nx
l
= 1,
(i
2)
+ nx )
2
and the parameter space as
Jl ={"' = (Ql' ~2'
Q}, ,,)
I"
> 0, Q1 .J 0, Q2 .J 0, Q1 .J Q2 }
where
0i
= (1
+ 1/n)·1/2
Q
= (1
+ 2n).1/2
3
(~o ~ ~1)
(~o
+
~l
(i
+
= 1,
n~2)
.
2)
32
Then the probability density of Z
= (zl'
z2' Z3' S), when ro
obtains, is
2
'exp -(Z3- Q3) /2a
where k
= {l
2
Pa(S).
+ l/n)-l
Consider a prior distribut10n son
(1) a
J:2 such that
is fixed
(1i) Q is 1ndependent of (Ql' Q2) with marginal distribution F
3
(iii) the distribution of (Ql' Q2) is a point-distribution with
equal probabilities
(=1/8) at
(a,I3), (a,-I3), (-0:,13), (-0:,-13), (13, a), (f3,-a), (-13,0:), (-13,-0:)
wi th
a>
0, 13
>
0, a ~ 13
•
Let
Pl(Z) = exp ~bal3 + 0:(az l -bz 2 ) + l3(az 2-bz l )_7
= exp fb 0:
P2(Z)
= exp ~-b
a 13 + 0:(az l -bz 2 ) - l3(az 2 - bzl )_7
= exp f-b a
P3(z)
= exp ~-b
P4(z)
= exp fb
13 + (a-b)(o: + l3)u + (a+b)(O:-I3) v_7.
13 + (a-bHo:-l3)u + (a+b)(O: + l3)v_7
0: 13 - 0:(az l -bz 2 ) + l3(az 2 - bz l )_7.
= exp f-b a 13 - (a-b)(a - l3)u - (a+b) (0: + 13) v_7
so;
0: 13 - a(az l -bz 2 ) - l3(az 2-bz l )_7
exp £b a 13 - (a-b)(o: + l3)u - {a + b)(o: - f»v_7 •
:;:;
P5(z)
= exp ~b
P6(z)
= exp ~-b
= exp ~-b
a ~ + ~(azl-bz2) + a(az 2 - bz l )_7
;: : exp ~b a ~ + (a-b) (a + ~)u - (a+b) (~~)v_7
a ~ + ~(azl - bz 2 ) - a(az 2 - bz l )_7
a ~ - (a-b) {a-~)u + (a+b) (0.: + ~)v_7
= exp ~-b a ~ - ~(azl - bz 2 ) + a(az 2 - bzl )_7
= exp f-b a ~ + (a - b)(o.: - ~)u - (a + b)(a + ~)v_7
P8(z) = exp ~b a ~ - ~(azl - bz 2 ) - a(az 2 - bz l )_7
Pr(z)
= exp £b a
~ - (a-b) {a + ~)u + (a + b) ( a - ~) v_7
where,
a
zl
= 1/(1_k2 ),
2
b;::: k/(1_k )
I
= u + v, z2;::: u - v .
It can be seen easily that reg, $) is minimized for
if (~a(l + lin)
+
= 0,
1/2
/~_7fpl(z)
~ = ~*
where
+p2(z)+p:;(z)+P4(z)_7
(~~(1+1/n)1/2/~_7LP5(z)+P6(z)+P7(z)+P8(z)_7
<
(~a(1+1/n)1/2/~_7LP5(z)+P6(z)+Dr(z)+P8(zL7
+f
L~(1+1/n)1/2/~_7LPl(z)+P2(z)+P:;(z)+P4(z)_7
othe~dse.
The above inequality reduces to
~! {a(1+1/n)1/2/~ ]
r
_ ([ ~(1+1/n)1/2/~] _7
lSe(a-b )(a+~)u-e- (a-b )(QH-~)U J7
{e(a+b)(C>-~)V_.-(a+b)«Y~~)v J+ e-b ,,~[e(a-b)(C>-~)U
. _ exp (b a
A)
I-'
·_e... {a-b )(~~
)u] fe(a+b )(QH-~)V _e-(a+b )(a+~ )J-7< 0 '.
Since
f
is a monotonic increasing function, the above inequality
is equivalent to
<
(a-b)u (a+b)v
i.e.
uv < 0
i.e.
zl
i.e.
(x - x )2 <
o
l
2
ThUS
, since a>b
2
Z2
<
$*
0
(Xo - x2 ) 2
is a Bayes rule against
S• If we replace the in-
e qua 11 ty sign in the above inequality by the equality sign then the
set of points satisfYing this equality will be of measure zero.
over, note that
cl>*
does not depend on ex" f3, cr.
g. Hence
unique (a. e.) Bayes rule against
4>*
Thus
4>*
More-
is the
is an admissible
is the maximum likelihood rule.
rule.
Note that
2.,3
Classification into more than two univariate normal populations
Let
variable
4>*
o ' 7Cl " ••• " ~ be (k+1) populat~ons in which a random
X is distributed with F ' Fl , ••. , F as c.d.f. reso
l
7C
pectively.
(i
~i'S
one
= 0,
1" ••. , k)., where the
~o
and cr are unknown and it is known that
i
for
one random observation from
(i = 1, ••. ,k)
~l""'~~
i = l, .•• ,k and
7C
each of size
o
= ~i
for at least
are not all equal.
and random samples from the
n.
Let
x o ' xl""'~
z.~ = (1 + ~)-1/2(x
- xi )
n
0
zk+1 = ( x
k
o
+ n
'1:")
~
x.
i=l ~
(1 + k n)-1/2
(i
7C' s
i
denote the sam-
ple means and S be the pooled unbiased estimator of cr2 •
Put
We take
= l, ... ,k)
35
We can consider
~
,s
as functions of
Z
i
f
S and the :Para-
and
S
meter space as
Q.
J.
=0
for at least one
for
1
i
< i < k and Q.'s
-
-
J.
are not equal for
i
Note that
zi (i
= l, ••• ,k)
covariance matrix of
= 1, ... ,
is independent of zk+l'
zl"",zk
k
]
The variance-
is
Cl.L(l-a) I + aJ_7
where
J
a
is a
==
(l+l/nr l , I
is the identity matrix of order k
kxk. matrix with all elements unity.
I
NoW
a J
l-a
(l-a){a(k-l)+l)
Let
b
= I-a
-1
c
=
(l-a)(a(k-l)+l)
a
1
k/2
==
(a(k-l)+l)(l-a)
, d
(l-a)(a(k-I)+l)
Then the joint density of
(2:rc)
a(k-2)+1
a
k
(J
zl"",zk
J:(I-a).k-1 (l+a(k-I»
.
is
=
a(k-I)+l
(a(k-I)+l)(l-a)
and
36
eXPL k
-2 1:: Qi(d Zi - kc z)
i=l
.
where
-Z
II:
By taking a prior distribution
(i)
_7
~
~
such that
is fixed
(ii) Qk+l
is independent of
(iii) distribution of
equal probabilities (=
(6,0,0, .•. ,0),
(Ql,···,Ok)
(Ql, ••. ,Qk)
1/2k)
is a point-distribution with
at the points
(0,6,0, •.. ,0), .••.• , (0,0, .•• ,6) ,
°,
(-6, 0, .•. ,0),
°,
°,
( -6,0, .•. ,0), .•••• , ( 0, •.. ,-6)
where 6 > 0, it can be seen by the method used before that the
following rule
<l>
is unique (a. e.) Bayes against
g and hence ad-
missible, using loss ftIDction of type 1.
I
d zi -Icc
= 0,
(i
= 1, ... ,
'-'j = min L 1~c
i$j~ k
Z
Zj
-
Z
I]
elsewhere
k) .
The set of points for which the above minimum is obtained for more
than one value of
d
kc
-> 1
as
i
is of measure zero.
n -->00.
l1oreover, note that
37
Case II
We take the same problem as posed in Case I 'With zero-
one loss function, i.e.
feu)
=1
u ~ O.
for
It can be shown using the method of Ellison ~§7 that the
following class of rules are admissible.
2 2
if A.· zi - log S.
=
= 0,
(i
min
1.~9:S: I:
J.
otherwise
= 1, ... ,k)
.
0
< Si < 1 (i
tribution
g such that
k
E
= l, .•• ,k),
Si = 1 •
i=l
Each such rule is the unique (a.e.) Bayes rule against a prior dis-
where,
(i)
(ii)
0" is fixed
0k+l
(iii)
is independent of
Prob (Oi
tribution of
(Ol"",Ok)
= 0) = Si (i = l, ••• ,k), and the conditional dis-
(01' 02,· •• ,Oi_l' Qi+l,· •• ,Ok)
is a normal oistribution
with zero means and with a properly chosen dispersion matrix.
In this case we could have assumed also that the sample sizes
from the
~i's (i = l, ••• ,k) are nl""'~ and zi = (1 + l/ni)-l
(xo - xi)
(i
from
~i'
= l, ... ,k)
By choosing
where xi
is the mean of the observations
Sl = • = Sk'
it can be seen tr.at the modi-
fied minimum distance rule belongs to the above class of admissible
rules.
Case III
Xl"'"
Here we quote a result obtained by Ellison
~
~§7.
Let
x '
o
be independent random variables with distributions
2 2 2
(f.l.o'O"o)'
(1J1 '0"1)'"''
(!-1t'O"k) respectively. The 0"i 's
38
are all known and it is also known that for some
Il o
i (i
= l, .•. ,k),
= Ili
Let x
= (Xo '
1t )
~, .•• ,
and
J
~i(x)
(i
= 1) ..• ) k) be the
o = J.l. i ' given ;:.
following class of admissible rules.
probability of deciding
if
= 0,
(1
= 1, .•. ,
Ellison gave the
J.l.
t. (x)
~
=
otherwise.
k)
where
k
!:
i=l
2.4
~i
= 1)
f..
>0
Concluding remarks
The problem of classification into two univariate normal popula-
tions has some similarities with the following decision problem.
Xl' x be two random variables) jointly distributed as a
2
Let
::::::.n:::xdiSnbutiO; 5::,m::~ g~~
Ql) Q2'
an:~:ia~::.rs
but it is known that either Ql = 0 or
Then the follOid.ng rule is unique (a . e.) minimax and ad-
~
Q2 = O.
missible.
:\.
are
unkn~m)
Take decision Ql = 0) if
otherwise take decision Q
2
=0
•
39
The loss function for wrongly deciding Qi
(
=0
is
(
lQi'J), where
is a bounded continuous monotonic increasing function with
(0)
= o.
Similar analogies can be made to the cases discussed in
2.3.
For classification into two univariate normal populations we are
essentially restricting to those rules whose risks involve (~l - ~2)/a.
But for classification into k
univariate normal populations, we could
not apply the technique of 2.1 to derive a minimax rule.
Even if we
restrict to the class of rules invariant under translation-group and
scale-group, the maximal invariant in the parameter space would be
vector-valued; it seems that this is the main trouble in deriving a
minimax rule.
But from Vlald I s general theory, we know that a minima!X
rule exists.
It seems that the minimum (or the modified minimum - as
the case is) distance rule will not be minimax in the situation.
may be interesting to carry out this investigation.
It
Another interesting
problem is to derive the complete class of c'lassification rule s.
CHAPTER III
CIASSIFICATION INTO MULTIVARIATE NORMAL POPULATIONS
3.0
Swmnary
In this chapter, it is shown that for the problem of classifi-
cation into one of two nmltivariate normal populations with known
and common dispersion matrix the maximum likelihood rule is minimax and admissible for the loss function of type
shown to be unbiased.
2; this rule is also
't,}hen the common dispersion matrix is unknown,
then the maximum likelihood rule is minimax and admissible in an invariant class;
given.
some other optimum properties of this rule are also
A class of admissible rules is derived for the problem of
classification into more than two multivariate normal populations.
3.1
Classification into two multivariate normal populations with
known and common dispersion
mtrb~
Let ~ (px1) be a random vector which is distributed in the
~i
population
(i
= 0,
1, 2)
with the mean vector Hi
z; the
i = 1
~
-i
's
(i = 0, 1, 2)
and the dispersion matrix
are unknown and it is assumed that
2, and HI ~ H2'
or
as the p-variate normal distribution
~ =~.
-0
-~
for
The dispersion matrix E is known
and it is non-singular.
Let ~i (i= 0, 1, 2) be the sample mean vector based on a random sample of size
assume that
fication
no = 1.
n
i
(1
= OJ
1, 2)
from
~i
(i
= 0,
1, 2); we
It is enough to consider only the class of classi-
rules which are functions of the sufficient statistics
41
~o' ~l' ~2'
We make a 1-1 transformation from
(~o' ~l' ~2) to
(~l' ~2' ~3)' where
(3.1.1) _zi
and, where
= (1
E =
+ lin.;...
T T',
rl/2
and
(1
T-l(x
- x.) ,
-o-~
= 1, 2)
is a lower-triangular
T
'Wi th positive diagonal elements
(T is unique).
pxp matrix
Without any loss
of generality, we can consider anly the class of rules which are func-
= 1,
(1
2)
The parameter space can be denoted by
,
where
~i ~
g ,
i, j
The joint density of
Z
,when
Q
in
~j
=g
= 1,
2
for
}.
G obta.ins , is given by
j ~ i;
42
where
(3.1.7)
(3.1.8)
It is seen that
p(G , G ,
-1 -2
Q
-3
)(~1' -Z.2' ~3)
Consider a prior distribution
(i)
(Q
prob
(ii)
e:
=
-1 --2
Si'
in
~,
r(g;") =
gIl
+
sJ
"I(Z)
(~3) •
(~i' ~3) is
given g e:Gi' the conditional distribution of
1.
2, we have for any rule
[(~i~l) p~~) (~1' ~2) P~3
$2(z)fj K(~~2)p~2)(~1'~2)PQ(~
-2
-3
(z )dF1 (~1'~3)
)dF2(~2' ~_}dZ
02.
where ,
P"£l) (~1' !2) =
p
-1
z)
(Q
,0) (z
-1' --2
-1-
= P(-0, Q)
P""£2)(~l' -~2)
-2
..
-3
(i = 1, 2)
-Considering the loss function of type
cjl
Pg
on ~ such that
e
0i) =
p(Q, g )(!1' ~2)
It is seen easily that
-
r(g;
$)
-2
(z
z)
~1' --2
is minimized for
$
= W,
where
"'l(Z) :: 1,
(; .1.10)
and
;.1.1 Minimax and admissible rule
Next, we shall prove the following theorems:
Theorem 3.1.1
For the problem of classification into one of two
multivariate normal populations with knovffi and common dispersion
llJa-
trix, the maximum likelihood rule is mininJa:{; and admissible, consider1ng the loss function of type
When
n
= n2 ,
l
2.
the maximum likelihood rule is also minimax and
admissible considering the loss function of type
Theorem;.1. 2
l-Jhen the function
t
1.
is continuous on the positive-
half of the real line, the maximum likelihood rule is the unique
(a . e.) minimax :..--ule for the problem pos ed in the theorem 3 .1.1.
Proof of the theorem 3.1.1
~7
case of
Consider a particular prior distribution
g, defined earlier) such that
(i) £1
(ii)
(iii)
£~ (as a special
Gl
=
S2
=
1/2 ,
= G2 = G
G is the distribution functi. on corresponding to the uniform
probability measure
v
CNer the
origin as the center and of radius
surface of a sphere in
YE:
(~
> 0). Define
JiP
with the
44
where
for
i:: 1, 2,
0 i6.
C0 i
is a subset of
56 assigns probability 1 to
Note that
for which
~6.
~i ~i
:: A
The inequality in
(3.1.10) reduces to
J
eX.!?
L~l
(a!l - b!2)_7 dv
(~1)
~i ~1 :: 6
(3.1.11)
J
<
L~2
exp
(a!2 - b!lL7 d
v(~2)
=A
~'2 ~2
Next we shall prove the following 1etmna:
Lemma
,
t,3.1.1
Let
be a
~
px1 vector and
x' x ::
measure over the surface of the sphere
fixed vector.
'l1
the unif,)I'm probability
r..2 •
Let g (px1) be a
Then the integral
J
x' ic::
exp
r..
(g'!)
d
2
O~'
is a monotonic increasing function of
v
(~)
a for fixed
Proof:
Let
J
x' x
d
~
:: f ("')
,
where
= r..
p
rr
i=l
J exp(~'
x' x=r..
2
,?E) d v
(~)
=
ft.r..)
J
Xl X
d ~ci'
exp (gl
2
:: r..
Then
~)
dx
r...
45
Make an orthogonal transformation l
Yl = (,9'
~)/(f!:.' ~)1/'2,
~
B~
= (Yl ,'"
where l.'
such that
,Y )'
Put
p
c
= (,91~)1/'2.
Then
J
exp
(~' .~)
p
IT
=
d x
i=l
= ')..2
x' x
r..
=
J
exp (c Yl )
-;\
d Yi
J
L
P '2
E Y
i='2 i
'2
= r.. -
Let
Then
1
=-
f(r..)
= 'f(~)
r..
JLexp
(cYl ) + exp(-cy1
>-7 g(yi,
r..
2
) dyl
o
Note that for
Y1 > 0, [.exp (cy1) + exp( - cy1
creasing function of
c.
>"7
is a monotonic in-
Thus the lemma. is proved.
46
From the lemma ;,1.1, it follows that the inequality (:;.1.11)
is equivalent to
(a.~l - b!2) ,
i.e. ,
(a
2
(a~l - b.~2)
< (a!2 - b!l) I
2
.. b ) (~i!l - !2 !2)
< 0
Le. ,
(1 +
i.e.,
l/~fl (Eo - ~l)'
E-
l
(:;0 - ~l)
l
<: (1 + 1/n2 f
(3.1.12)
(~o
Thus, we see that the Bayes rule against
hood rule~.
-
~2) I
1:-
1
(~o
-
~2)'
£6 is the maximum likeli-
The inequality in (3,1,12)
is attained in a set of
measure zero.
5.7
Now we shall show that for any
depends only on -~:i. ~i
stant in
Let
06
(;)
E
(i
= 1,
(;)
p
the risk of
thus the risk of ~
2);
0 1,
~
is con-
Then :'1'!2 are jointly normally distributed
~1}.2
and the dispersion matrix
=
I
(8 i'
,
with the mean vectors
where
in
is the
pxp
identity matrix.
transformation C such that
a'
lxp
=
C ~l
=
~,
B }
There exists an orthogonal
where
(~~i ~l ' O) 0, •• '} 0)
Put
C ~l = ~!,
C ~2 = ~~.
(~t, ~~)
Then
(~, .9)
distributed with mean vectors
are jointly normally
and dispersion matrix B.
Moreover,
Zl Z
·-1 ..··1
Thus, we see that
Q
€
O2 ,
EQ '2
=
z*'z*
,
·1 -·1
EQ, 1
ZI Z
-2
,·2
= -2
z*' _z*2 •
depends only ~n ~i ~l'
~2 ~2'
depends only on
Similarly,
for
Also, it is easily seen
that
=
r(Ql; ,)
for all
Q €
l
@"1l:l.'
o /: ,.' @;
Q2 €
denotes the
r(Q2; V)
0~·
~i-th
Hence
r(Q, ,)
component of
(ij) i
is constant in
•
From the above results and from the theorem 1.1.3, it follows
that the rule ,
,
is a minimax rule in
@•
is the unique (a.e.) Bayes rule in ~
the corollary 1.1.1 that ,
Since, for any A
> 0,
againste/::,., it follows from
is admissible in
~
•
To prove the last part of the theorem, the only change which is
to be made in the above proof is to replace
((1 + lin) ~1 ~i)' where
= n2 = n
nl
The proof of the theorem 3.1.2
t(~i ~i) by
•
follows in the same 'Way as that
of the theorem 2.1.2.
Remrk
If we assume that
as J(J!:.i ' 4: i )
ki's
and 4:
!
(pxl) is distributed in
(i = 0, 1, 2), where 4:
i
(i = 0, 1, 2)
!: (i = 0, 1, 2) and
i
are know, then we can have the following result.
FOr simplicity, asstme that
n
l
i
= k
7f
= n2 = n.
Define
48
= (k0
z.
-J.
L = .,.
+ k./n)-1/2
J.
-1
T
(~o - ~i)'
(i
= 1,
2)
.,.r •
Then the folloWing rule is unique minimax and aClmissible.
Take
decision
and
.!:!:o
3.1.2
A
==
otherwise.
1:2 '
lower bound f'or the probability of' correct classif'ication
using the
rule
1Jr
(ML
rule) •
In this context, we shall assume
n = n .
l
2
Next we state the
f'olloWing lemma without proof'.
Lemma 3.1.2
For a f'ixed symmetric positive def'inite matrix L (pxp)
and f'or any two pxl vectors
Then
x and
Also,
def'ine
d satisf'ies all the properties of' distance between
In the present context, assume
Then,
Z,
~,
l .
Thus,
Therefore,
since
~o' ~l' ~2
Assume,
fying into
H
o
~I
are mutually independent_
= Hl-
Then the probability of correctly classi-
using the ML
P l (HI' !:;2; 1!f)
=
rule
~
is
PI (6,; ~)
Prob ..-Fd(x
, IJ. ) < d0/4
7
-0 -0
--
Let
Prob (X2 < c)
p
=
f (c) _
p
,
50
Then
(3,1.14)
We can get another lower bound for
lation (3,1.13) as
follows.
Pl (Hl' H2; 1/f) from the re..
We have found that
Pl(Hl' H2; 1/f) ~ probfd(~, Hl) + 2d(lEo ' Ho } + d(2S2' H4) < do
Now, note that
tributed as
.Ln d2(~1'
2
X
with
>-7
2
.t!:2) + d (2S0 ' Ho
is disdegrees of freedom. By Cauchy-Sch,oJarz' in-
Hl) + d
3p
_7.
n2(~2'
equality,
Hence
Hence,
The two lower bounds obtained in (3.1.14) and (3.1.15) can be com)?ared to pick up the better one.
correctly classifying into
If we take
n
o
1t
2
Similarly, for the probability of
the same lower bounds can be found.
observations from
1C,
0
and denote the mean of
these observations by x, then it follmvs from (3.1.14) that for
-0
i
= 1,
2.
as
Also, it follows from above that for
i
= 1,
n, n - >
o
2
QQ
51
Thus, we see that the ML rule is error-consistent, risk consistent
and
~consistent.
Nex~we
state the following theorem, the proof of which will
follow from the proof of the theorem 3.1.1.
For any classification rule
Theorem 3.1.3
conditions (i), (ii)
below,
where
rule.
is the ML
0/
(i)
(i
==
For
Q €
0) i
(i
= 1,
~
in
~
satisfying
depends only on
2),
Q!O.
-J. J.
1, 2)
For any a
(ii)
Corollary 3.1.1
different from
°
,
The maJ..'"imv.m likelihood rule is an unbiased rule.
3.2 Classification into more than two multivariate normal populations with
Let
lation
kno'~dispersion matrices
! (pxl)
*1 (1
be a random vector which is distributed in the popu-
= 0,
1, •• " k)
••• J
k) .
with the distribution function
Let ~i (i
= 0,
vector based on a randOLl sample of size
l, .•• ,k)
n.
J.
from
be the sample mean1!.
J.
(1
= 0,
1, ••• , k) j
52
Fi = Jf (Hi' ~), (i = 0, 1, .•• ,k), where I:
is the ImOvID IDa trix. 11e also assume, for simplicity, n = n =
~ ...
.....
1
2
(==n) and the loss function of type 1. The Hi's are unknown, exCase I
Let
...
cept that it is known that Ho = Hi
H2' .•. ,
~
for some
i(i = 1, ... ,k)
and Hl'
are not all equal.
Then following the method used in the univariate case with a prior
distribution of
uniform over the surface of a
~
s~here,
and using
the lemma 3.1.1, it can be shown that the following rule is admissible.
Take
the decision Ho
bi
==
Hi (i
==
l, ••• ,k), if
min
==
l~ j ~ k
where
b
i
= (d ~i/lt
z. = (1
+
-~
=
a
for
1/n)·1/2 (x
- x.) , .-Z =
-0
-~
Let
a known matrix.
= Hi
~)' ~-l (~i/1t - ~)
= ~a(k-l)+~71 a
J(J (~i' ~), (i = 0,
(1 + l/n)-l, d
Case II
Ho
-
1, ••. ,k), where I: is
i ==
In this case, we assume the simple loss function and
F
only one
Then, Ellison ~§J has shown
i(i = 1, ..• ,k).
that the following class of rules is admissible
Take the decision
Ho
== Hi'
min
(t)
if
(i
= l, ... ,k)
l':::j.:s:k
where
t.
1
= -log g.1
+
~ (1
+ l/ni)-l (x
-0
- Xi)' ~-l (x - Xi)
-
-0-
53
and
o < si <
taking
By
1
£.J. = 11k
k
~
(i = l, ••• ,k),
Si = 1
i=l
for all i, it follows that the maximum 1ike1i-
hood rule is admissible under sinw1e loss
Case III
~i's
matrices
=~
Let Fi
(Hi' ~i)
(i
fun~
= 0,
.
l, ••• ,k),
where the
are known. Then, assuming simple loss function,
Ellison £§J has shown that the following class .of' rules is admissible.
Take the decision Ho
=Hi
(i
= l"'f,k),
if t i
=
min
(t j
1~ j ~ Ie
where
= -log S.J.
t.J.
where A, and
s.'
s
J.
(~ 0 + ~./n.)-l
(x
- x.)
J.
J.
-O-J.
+ A(x
- -J.
x.)'
-0
are defined as in the Case II.
3.3 Classification into one of two multivariate normal populations
with common and unknown dispersion matrix
Let X (px1)
population
_lJ.i'S
one
(i
and
i(i
= 0,
be a random vector which is distributed in the
1(i (i = 0, 1, 2)
~
= 1,
1, 2)
Jf
as
(~.,
Z) (i
-J.
= 0, 1, 2); the
are unknmVIl except that it is known that
2)
and H1
r.1::2 '
and ~
lJ.
-0
is non-singular.
= -J.
IJ.. for
Let
X.
-J.
be the sample mean vector based on a random sample of
size ni (i = 0, 1, 2) from 1(i (i = 0, 1, 2); we assume no
Let S be the pooled unbiased estimator of ~.
= 1.
It will be enough to consider the class of rules based on the
sufficient statistics
•
e
~
~o' ~1' ~2'
S.
A classification rule
is invariant under Gp (the additive group in
#), if
$
in
),
5.4
for all
~.
(px1) vectors
It is seen easily that a necessary and
sUfficient condition for this is that
(.?Eo - ~1' ~o - ~2' S).
through
variant rules by @(l).
A
variant under
depends on (~o' ~1' ~2' s)
cjl
We denote the class of all such in-
A classification
rule
(the group of all pxp
p
in @(l)
cjl
is in-
non-singular real matrices) ..
if
<I>(~o - ~~, ~o - ~2' S) "" <I> LA(~o - 3El)' A(2£o - ~2)' ABA'_7
~
for all A in
A set of maximal invariant ftmctions on the space of
Lemma 3.3.1
(~o - ~l'
p •
~2'
lEo •
S) under the transformation group
b OJ -- (-,
.i..
~
~:
S·l (x
v)'
of""_
-0
Define ~i = _:~o - 2~i (i
(b ll , b12 , b 22 ).
under
-
-O-J.
A p'
- :c
- j
),
(i,
Ap
j =
1, 2) .
2), and f(~l' ~2' S)
= 1,
It can be seen easily that
bij'S
is
=
are invariant
He shall shovT that, if
s~})
:r(z*,
z* ,
-1 -2
=:
(b*11 , b!2' b~2)
then there exists an element A in ) . p
(~!, ~~
,
S-x-)
= (A ~l'
= (bll,
b12 ,
such the.t
A ~2' ASA') •
There exists a unique lower-triangular pxp matrix T with positive diagonal elements such that
~
Then,
u'u
-1
=T
= bll ,
~'y.
-~l'
S
= T T'. Define
-1
.! = T
~2
= b12 , .!'.! = b22
•
Let ~~
lx.p
= ( bll , 0, 0, ••• ,0)
55
and
= (b12 /
~~
lxp
be two
vectors.
h.':P
0, 0, ... , 0) .
Then there exists an orthogonal t ransforma tion
such that
C
(u,
v)
-
=
Similarly, define
.!
,
b ll ,
= T*-1 ~~.
C
(u
, v ) .
--0 -0
S*
such that
T*
= plr
TIP, and u
= T*-l -1
z* ,
Then
There also exists another orthogonal transformation
=
(~*, ~r)
C*
such that
C*(u,
v ) •
-0 -0
Thus,
(~*,
where
D
v*)
= c-x-
= C*
C·
l
A
C-1 € .
T!f
(~, ~) = D(~, ~)
p' and
-1
~!
D is orthogonal.
=
D T
-1
~r = (T* D T- l )
or,
~i
zi
= A ~i ' say.
Also,
= T*
D Df T7< r
Thus the lemma is proved.
,
=
S~(·
Thus, for
1
= 1,
2
•
Corollary
A maximal invariant function on the parameter space under
the groups induced by G
p
Proof
and
AP
is
It follows from the above leIlllUa tr.l8. t a set of maximal invariant
functions on the parameter space under the groups induced by
vA
p
(~l' ~2'
is
6 22 ),
o
is either equal to HI
or to l!:2 •
Distribution of the maximal invariants b .. 's
------------------J.J
Define
(~o - '!:2 t7
cl
Lkl
(~o - lEl) -
~2 = c 2
Lkl
(.?Eo - lEl ) + k 2 (~o - '!:2
=:
k2
t7 ,
where
k
i
cl
= (1
=:
1/
+ 1/n )-1/2 ,'(i
i
= 1,2)
2(1 - klk2)
Define
mij
=
,
~i S
-1
~j
Then, it can be seen that
l?
where
The corollary f'ollollS from the fact that H
~l
G
,
(1,
j
= 1,
2) •
and
C k k )3 ~ 0 .
l 2 l 2
be the class of classification rules invariant under
The determinant of the above matrix is
Let
Gp
4)(2)
and
A.
p
cp(2)
the rules in
llithout any loss of generality, we can consider
as functions of IrJ.l' m , m
12
22
~l
Note that
and
~2
Il
-0
only.
are independently normally distributed
as p-var1ate normal distribution
Now, when
Q( c
with the same dispersion matrix
=-III
E ~l = -c l k 2 (~l - H2)
E ~2 = c 2k 2 (Hl - J!2) •
E ~l = .. clk l (Hl - H2)
E ~2:=
-c 2k l (Hl - H2) •
We consider the parameter space
i
= 1,
2
.Jl i =
t
(l)
= (~l'
.Jl as JI. 1
U
J1 2
where for
~2' ~3' E) I ~i F.2, ~j =.2
for j
and
~i
=ki
(H o • Hi)
~3
= Ho
+ nl.!:!:l + n~2
,
(i
= 1,
2)
F i;
i,
J
= l, 2
.t
~
•
•
58
For studying the rules in
space as
fl =
~(2),
we can consider the parameter
U fl 16 '
U
where
i=l, 2
6>0
We shall consj.deJ:.lWre the loss function of type
from the
have
th~ 1.2.2,
c~tant risk
in
n
The joint density of
by
it follows that any rule
6
= I2.J.6 U [22D,.
(11].1' m12 , m ) ,
22
when
<I>
2.
in
Then,
<9(2)
for any 6
Q)
E
i1 M.'
will
> O.
is given
£28_7
P16(IlJ.l' IlJ.2' m22 )
=
00
~
j=o
where
~~1)
j(~ +
eX)? .r-62(ci +
c~)
_7
J M
I
f °imn + 2010/\2 + 0~2 + (oi + o~) IM
j)
[(~+j)j!
J1
f M 1= IlJ.1 m22 - mi2 > O.
sity of (m , m , m )
ll
12
22
/ 2
is
2
If
+ M J (m+2) /2 + j
Also, when
Q)
€D2b,.'
the joint den-
59
•
=
F~)
p-r
eX!}
2
L
2
-b. (ci +
C~)/2 _7
j(~-r1)
)
2 j
(~)
[7.P;l)
i ci~l
in
j(~)
c~2 + (ci+c~) IMJ]j
I I 2 + M I (m+2) /2 + j
•
Consider a prior distribution
invariant of .•..
I (p-3)/2
- 2c1 c 2m12 +
2
where m = n1 + n2 - 2
JM
g/: :;. on the space of the maximal
under the groups induced by
Gp
and f A
tp
such
that
(i)
g/:::;. assign probability
(11) Prob (00
Then, for any rule
€
~
~
in
16)
1
= 1/2,
~(2)
,
to.fL 6
(1
= 1,
1=
2) .
Ii 16
U
1l2/:::;. .
60
2 j
where
(~)
2
--I
r(A)
c =
2
j(~l)
exp
lI
2
+ M} (m+2)/2 + j
(p-3)/2
f- A2 (ci + c~)/2_7
Jc;-,p+2)
2
Lt(m-p+l)
F 2
M
{(i,-l)
_
2
Now, note that, for a > 0 and for any positive integer
(a + x)j < (a - x)j
r(sA; $)
if and only if x <
is minimized for
$
=~
j,
o. Thus, we see that
(for the variation of
w(2»,
in
$
where
= 0,
Also,
~
is unique (a.e.) Bayes rule in
any A > O.
-1
~i
==
S
against
sA for
~2
1 2 1(1+1/nl ,-1 (:50
C C
- (1 + 1/n2
Also note that
~
rl
-
(:50
:51)'
-
S-l
:52) I
(~o - :51)
S-l (:50 -
~2L7.
c c > O. Thus ~ is the maximum likelihood rule.
l 2
does not depend on A and it.s risk is constant in fl
for any A > O.
ThUS from the theorem 1.1.3, it follows that
a minimax rule in
that
~(2)
Now,
l!1.2 ==
Note that
otherwise.
~(2).
is
Also, from the corollary 1.1.1, it follows
~ is an admissible rule in
following theorem.
~
A
Q}(2).
Thus we have proved the
61
3.3.1
~em
For the problem of classification into one of
two multivariate normal populations with COImllon and unknown dispersion matrix the maJeUnum likelihood rule is minimax and admissible
in the class of rules invariant under the translation group and the
group of non-singular linear transformations, considering the loss
function of type
Remark 1
2.
When n
l
= n2 ,
max and admissible in
Remark 2
If
!
the maximum likelihood rule W is also mini-
<9(2)
under the loss function of type 1-
is a continuous function on the positive-half of the
real line, then the maximum likelihood rule is the unique (a. e.) minimax rule in @(2).
Some properties of the ML rule
We state the follmving theorem which follows from the proof of
the theorem 3.3.1.
Theorem
3.,.2
For any classification rule
~ in ~(2) satisfying
the condition (i) below, 'tve have
E(.Oi ~i
~ ~i Wi'
vThere W is the ML rule.
(i)
For any fiJ~ed b. > 0 ,
E£4>ll ~i Z-l ~1 = b. , ~2 = 9-7 = E'f4>2 ~2'
J
Remark
vfuen n
l
= n2 ,
Z-l
~2 = !:.'J., ~l~ _7
the condition (i) implies that the probability
(i.e., H =HI) is equal to the
o
probability of correct classification into ~2 (i.e. H =H2) •
o
of correct classification into
~l
62
Corollary ;.;.1
Proof
Compare the ML rule V with the rule •
= (1/2,
$(Z)
Thema:dmum likelihood rule Vis an unbiased rule.
where
1/2) and both these rules satisfY the condition in the
above theorem.
Thus for any Cl)i
E
Cl)i
$i
in
D
(i
i
= 1,
2)
~ 1/2
Hence, for the ML rule, the probability of correct classification
is greater than or equal to the probability of misclassification.
Corollary ;.;.2
rules
$
The ML rule V is most stringent in the class of
in 0(2)
satisfying the condition of
(i)
of the theorem
;.;.2.
Proof For all Cl) e
where
..fL6 ,
and all
~(;) is a subclass of
all Cl) e
116 ,
all
(j>
€
r(Cl),V) - inf rem,
ep e
(j»
QJ (; )
$
€
(l) (;)
such that
(i) holds.
,
< r(Cl),
$) -
~(;)
inf
ep €
r(Cl), $)
(l)(;)
The above corollary follows from the fact that '"
6
Thus, for
is independent of
and the above inequality holds for any 6 > 0 •
If we have n o observations from
1!,
0
then it is easily seen that
the ML rule is error-consistent.
3.4 Concluding remarks
The problem of classification into one of two multivariate normal
populations with the same
unkno~m
dispersion matrix is entirely
different from the problem where the conunon dispersion matrix is know.
The present author fails to shaw that the ML rule is minimax in
the unrestricted class lrhen the common dispersion matrix is unknown.
However, the discussions given at the end of Chapter I
to solve this problem.
might help
Also, it might be interesting to derive the
complete class for these problems.
The problem becomes complicated when ttere are more than two
populations.
The maximal invariant function, in the case when the com-
mon dispersion
matri;,~
is unknown, would be vector-valued, and this
might create difficulties in finding a minimax rule.
CHAPTER IV
SOME SPECIAL CIASSIFICATION PROBlEMS
4.0 Summary
In this chapter, it is first shown that the maximum likelihood
rule is an admissible rule for the problem of classification into k
univariate normal populations with unknown common variance and when
the k population means are linearly restricted.
This situation arises
when the k populations are identified with the k cells of a statistical design.
This result is generalized to the multivariate case when
the common dispersion matrix is known.
Secondly, a class of admiss-
ible rules is derived for the problem of classification into k normal
multivariate populations X();:i' E ), when the E IS are known and it
i
i
is further known that the individual to be classified comes from
~(~-0 , E)
where (~
, E0 ) = (_~~,
~i)
0
-0
...
(i = 1, ••• , k); it is also shown that
for only one value of i
the ML rule is e-admissible.
4.1 Classification into univariate structured normal populations
let X be a random variable which is distributed in the population
~i
~
and cr are unknown except that it is known that
i
IS
(i
= 0, 1, ••• , k) as ~(~i' a2 ) (i = 0, 1, ••• , k); the
(i)
~
(11)
~l' ••• ,
(iii)
~l' ~2' ••• ,
o
=~.
~
for only one i (i = 1, •• " k)
are all different
~k
~k
are linearly restricted as follows.
~l
(4.1.1)
~
kXl
=
~2
~k
=
A
kxm
!
mxl
65
where! is a vector of unknown parameters and A is a known matrix of
rank r < k.
space.
In other words, the 1::'s lie in a r-dimensional Euclidean
The matrix A is sometimes called a design matrix.
let xo ' xl' •• "
~
be random observations on X from the popUla-
tions ~o' ~l' "', ~k respectively.
let~' = (xo ' xl' "', ~).
We
define a classification function ¢ as a k-dimensional vector valued
measurable function of x and denote it by
where
¢i(~)
= probability
given~
of taking decision ~o
= ~1'
= 1, ••• , k)
(i
let
=
~
L
¢
=
(¢
rJ. )
l' ""
?k
I 0 ~ ¢i ( .)
~
. 1
~=
~ 1 for all i,
¢ (.) = 1
]
i
A heuristic rule
We formulate below a heuristic rule suggested by Professor
S. N. Roy.
later we shall show some of its properties.
,-). = L
S (1) = (Il,
-
~ -~
0"), Il
0
= Il.~
10" >
o}
(i
= 1,
Let
"', k)
and
Let P(1)(~) be the probability density of ~ when (1) e .el
obtains.
Define
66
(i
= 1,
... , k)
Then we call ¢** the maximum likelihood rule, where
¢i** (~)
=1
,
=0
,
if Li
( 4.1.2)
=
max
l~j~k
(L )
j
otherwise
(i
= 1,
.'"
k)
It is clear that ¢ ** is defined uniquely (a.e.).
The density of ~ under c.o
€
1-2 is
(x) = . 1
exp
c.o( %c cr)k+l
p
With these normality and homoscedasticity assumptions, we define a
model
m'
Il
o = Ili
(i
i
1:
=
Then it follows from Roy
that the error sum of squares under model
Si -- ~ I ~
_
I
Xi
... , k) •
A!
Let AI (n x r) be a basis of A.
~27_7
= 1,
A (A' D- l A-)-l A'
I I i --r
I Xi
(i
~i
= 1,
is
... , k)
,
where
and D. is a diagonal matrix with all diagonal elements unity except
J.
that the i-th diagonal element is 1/2.
It can be seen from the relation between maximum-likelihood
¢**
estimation and least-squares estimation in the normal case that
can be expressed also as follows.
¢i** (~) = 1
if 8i
min
(8.)
l.s:j.s:k
J
otherwise
= 0
¢**
=
as an admissible rule
Next we shall show that ¢** is an admissible rule in
zero-one loss function.
fixed (i
= 1,
••• , k).
p
with
I.et1-2i<J be a subspace in 1-2i where <J is
Consider a prior distribution
s on 1-2
such
that
Prob(ill e C).) = 1/2
-
(ii)
(i = 1, •.• , k)
the conditional distribution of ill in ,-). (i
~ -~
is such that ill e '-).
~-~<J
1-2i <J with c.d.f. Fi (i
where
-~
= l, ••• ,k)
with probability 1 and ~ is distributed over
-
= 1,
X is the space of~.
••• , k).
Then
Then for any
¢ e ~,
the risk
¢ i (~)
,
= 1
J
i f Ii (xl =
68
p(J)~)dFi (j!)
1-li ois maximum in
=0
,
[11
(!:), •.• ,
Ik
(!:)]
Otherwise
(i = 1, .•• , k)
Now we shall choose the Fils to simplify the Ii's.
~
is restricted by (4.1.1) and
lies in a known r-dimensional sub-
We will have to take a prior distribution on the
space.
that
£
lies in that SUbspace with probability 1.
~
Note that
~lS
such
There exists a
known orthogonal matrix C. when1N.. holds such that the joint density of
~
is
1
(~o-) +
i
[1
- -2
20-
r
j=l
--~k"":"l exp
where
~(i)'
.
~
E (y
(i)
(i) 2 k+l
(i)2
- 11 j ) + E
y.
j
j=r+l J
}J
.Clio- is transformed (With a 1-1 correspondence) to
= (11ii ), "', 11~i». (Lehmann~21_7). We take the prior
ill €
dist~ibution Fi
on1-l o- such that 11ii ) , ••• ,
i
2
distributed, each as ~(O, 0- ). Then
I. (x) =
~-
1
k++l
(,.,j2ir. 0-) r
J[
exp
are independently
1 {k+l (i)2 r
(i)
-2
E Y
+r:(y.
-
+
j=r+l j
2J
~
i=l
J
-n-
11(i)2]]
d11\i)
j=l j
j=l
J
1
1
[ -1
= -----:--:
---=rr:; exp
- 2
r
(~o-)k+l 2 ,2
11~i)
40-
I
k+l
r: y (i)2 + k+l
E Y(i)2} }
j=l j
j=r+l j
Now,
XIX
k+l
= k~l
(y(i»2
j=l
j
,
I:
j=r+l
(Xlebtlmrt
£21_7 ).
Hence
exp [ -
4~2 (~'~ +
(i
= 1,
51) ]
.", k)
Thus,
is equivalent to
S
i
min
l,Sj,Sk
[s j ]
cf is the maximum likelihood rule ¢** .
Thus
¢**
=
is the unique (a.e.) Bayes rule against
ofF.'s).
J.
Example.
¢**
Hence
It is also seen that
s (with
is an admissible rule in
p.
the above choice
Two-way classification
Consider k = bt populations fill' fi 12 , .•• , fi bt such that X is
distributed in 1t ij asJ((tJrij , rl) (i = 1, "., bj j'= 1, .•. , t);
the
~i/ sand 0"
lJ.
are unknown but it is known that
ij
b
I: ~.
i=l
J.
= 0:
=0
+
~i
,
,
+ Tj
b
I:
j=l
(i
= 1,
.'., b; j
= 1,
.•• , t) )
=0
T
j
Let xij be a random observation on X from fi ij (i = 1, ••• , b;
j = 1, ••• , t), and x be a random observation on X from fi in which X
o
0
is distributed as
X(
70
Jl o '
for a classification rule ¢
¢.. (x)
o
=
Jl
ij
€
f.
= probability of deciding Jl
~J
Jl
Let x = (xo ' xII' x12 ' .•• ,
0"2).
Jl"J given~.
=
0
~t)
and
When
~J
, then the sum of squares due to error is
(
=S+ax
-x
-X.+X
o
1.
oJ
)2
where
b
t
2
b -2
t -2 + bt -x2
t xi' - t t x. - b t xi'
S = t
1=1 j=l
J
i=l ~o
j=l
J
and
1
xi.
1
t
= -t .t 1
J=
xi"
x .. = -b
~J
J
b
t x. j ,
. 1 ~
_
1
x
= bt
~=
••
b
t
t
t
i=l j=l
x ..
~J
a = bt/{b + t + bt - 1)
Define
d .
iJ
= xo
- xi. - x .J. + x •• •
Then the maximum likelihood rule
=0
Remark
,
¢**
is
otherwise.
Let S denote the error sum of squares under model (4.1.1),
disregarding x.
o
Then, it can be seen that
S. - S
~
= d 2i
where the d. 's are normally distributed.
1.
as follows:
(i
= 1,
Then
¢**
... , k)
can be expressed
(4.1.6)
¢i** (~) = 1
,
if
=0
,
otherwise
Upper bound for the probability of misclassification
For simplicity, let us assume that
e 1-11 .
Then dl' ••• , d
k
are distributed as a joint normal distribution and the expectation
ill
of dl is zero. Let Eil be the event: d~ < di (i = 2, ••• , k).
the probability of classifying Xo into ~i (i ~ 1) is
k
Then
k
Pr{ U Eil ) ~ !: Pr(E )
il
i=2
i=2
(4.1. 7)
), for each i (i = 2, ••• ,k), can be calculated using incomplete
il
probability integrals for univariate normal distributions.
Pr(E
4.2 Classification into structured multivariate normal populations
Consider a random vector X (p x 1) which is distributed in the
= 0, ., ... , k)
(i = 0, 1, ... , k)
... , k) • l.\.J. 's are
population ~i (i
Let -J.
x.
~i (i
= 0,
(ii)
1,
= ~i
l:!o
= 0, 1, ••• ,
k).
be a random observation on X from
unknown but it is known that
for only one i (i
(11)
( 4.2.1)
as ~(~i' ~)(i
= 1,
... , k)
~'
-1
~
kXp
=
~'
-2
=
A
S
kxm mxp
~'
-k
where
s is
a matrix of unknown parameters, and A is a known matrix of
rank r < k.
Consider the following models:
72
(4.2.2)
m {
Ex
-0
=-0
~
E X ::: IJ.
kxp
= As
where
X=
kxp
x'
-1
x'
-k
and the model
(i
= 1,
"', k)
Let S oe··th~ niatrix of sums of products due to error under the
model
771 and let 8i
(i
= 1,
"', k) be the sums of products matrix due
to error under the model 177i (i
= 1,
••• , k).
It can be seen easily
that
(4.2.4)
s.J. = S
+ -J.
d. d '
- i
(i = 1, "', k)
where -J.
d. 's are distributed as p.variate normal.
Explicit expressions
for S, S.,
d. can be obtained from the general results due to Roy LC127_7
J. -J.
Case I:
Z matrix is known
We first make a transformation as follows:
Xi
=T
-1
~i
(i
= 0,
1, " ' , k)
Consider the model
E Yo
=-0
\I
,
Z
= TT'
73
E
Y
xi
=E
kxp
v'
-1
--
y'
=
T'
-k
- k
Xi (i
= 0,
A
,
11
11
= e(T'r l
loan m.xp
1, .", k) is distributed as p-variate normal with disper-
sion matrix as the identity matrix.
Also, consider the model:
(4.2,6)
= v!
E y
-0
and the rest is same as in
m.
(i = 1, "., k)
-J.
* *
Let S , S. (i
= 1, "', k) denote
J.
the sum of products matrices due to error under the models
*
~i
(i
= 1,
""
k) respectively.
Si*
= S*
~
*
and
Also
+d.*l -d*
i'
= 1, "', k)
(i
It can be found easily that the maximum likelihood rule
¢**
can be
expressed as follows:
(4.2.8)
=0
(i
= 1,
,
*
if tr(Si)
,
otherwise
••• , k)
It can be seen that the relation
*J.
tr(S.)
=
min
*
tr( S.)
J
l~j~k
is equivalent to
* .9-*i =
2i I
*
min
( dj
1 ~ j:5 k -
I
d *)
-j
=
*
min
tr(S.)
l~j~k
J
and this is equivalent to
d I .E- l d.
-i
-J,
=
where the -J,
d. 's are defined in (4.2.4).
It can be easily shown,
following more or less the same line of proof as in the univariate
case, that ¢** is an admissible rule in
function.
~
uSinga.zero-one 10s8
A lower bound for the probability of misclassification can
be obtained using the results in 4.1 and 3.2.
Case II .E matrix is unknown
It can be easily· seen ["1_7 that the maximum likelihood rule
ti**._ in this case, can be expressed as follows:
~
¢.**
(x,
J.
-0
(i
= 1,
X) = 1
,
if
=0
,
otherwise
Is.J. I =
min (Is.
l~j~k
J
/)
.•• , k)
Also note that the relation
Is. I =
J.
min
(
l~j~k
Is j
l)
is equivalent to
( 4.2.10)
d' S-l d =
-i
-i
min
1 ~ j ~ k
(d' S-l d.)
-j
-J
where S, and the ~i's are defined in (4.2.4).
The present author fails to show the admissibility of ¢'** in
this case even after restricting to an invariant class of rules (invariant under premultiplication of the p x p non-singular matrix group).
75
A possible approach would be to follow the line of proof used to show
the admissibility of Roy's maximum latent root test for MANOVA.
It
might be mentioned that this classification rule can be expressed as
a union-intersection~27_7 classification rule as follows:
(4.2.10) i6 equivalent to
n
[
0, d
J..
d' 0
-i -i A
I'SI
.0, d d~ 1.'/
A..oj -J
.~
I' sl
, for all
j],
l~j~k
alIi (p x 1)
vectors which
are non- null.
Now, for any non-null
1,
we can reduce this problem to the univariate
case, and then the corresponding maximum likelihood rule will be admis6ible.
These facts might be useful for proving the admissibility
of ¢** •
4.;
Classification into k multivariate normal populations - a
special case
Problem:
Let.! (p x 1) be a random vector which is distributed
in the popUlation
1f
i
(i= 0,1, ••• , k) with c.d.f. F (i = 0, l, ... ,k).
i
Assumptions:
(i)
Fi
= J(~i'
Li )
(i
= 1, "" k) J
where Li's are known.
(11)
Fo = F.1. for only one value of i (i = 1, .'"
Let A be the action space where
k).
and a i denotes the decision Fo
loss function as follows:
= Fi
= 1,
(1
""
k).
We assume the
1
,
i ~ j
o
,
i
L(a , F = F.) =
1
o
J
= 0,
Let ~1 (i
(i
= 0,
k).
1, ""
k) be a random observation on X from 3f i
1, ""
Let x
=j
= (~o' ~l'
class of classification rules.
••• ,~) and let
Any ¢
sional vector function of x and ¢
E
E
i
denote the
is a measurable k-dimen-
= (¢l' •• "
¢k)' where ¢i(x) denotes
the probability of taking action a 1 given x. It can be easily seen
that the maximum likelihood rule ¢ ** can be expressed as follows:
1
if ii**(x)
,
=
¢;*(x)
0
,
=
min
l~j~k
i **
(x)
j
,
otherwise
where
ii**(X)
= log lEi I + -21 (x.
-~
(i
= 1,
""
- -0
x ) I Iril(x.
- -0
x )
.
-~
k).
A family of admissible rules
let
.eli
k
s
1-2 = U
I"=
U!l' ... , .!!k)' Fo
= Fi
i'
(i = 1, ... , k)
f-)
.l-i
i=l
Next we shall derive a class of Bayes rules each of which is the unique
(a.e.) Bayes rule against some prior distribution, and hence all of
them are admissible.
77
Consider a prior distribution
(i)
(i = 1, ""
as follows:
the probability that the parameter point is in 1-2
k) is ~i (i = 1, ""
(i
(ii)
s on.r2
= 1,
i
k), where
k
••• , k);
!:
i=l
t
= 1
i
the conditional c.d.f. of ~ when the parameter point is
inJ.-2 i is Gi (i = 1, ••• , k).
Let Fo
=J(~o'_!:o)
rli = {(J) = (.!!o' ~, I:oll.!!o =2i' Eo = Ei
The only unknown in 1-2i is
} (i
= 1,
and
''', k).
~ and its space is the pk-dimensional
Euclidean space E '
pk
The probability density of x when
(1)
€
1~i obtains is
where
(4.,.4)
Hi = (x -0
~i) I t-il(K0 - -~i) +
-
J=l
Next we shall prove some lemmas.
Lemma
Hi
4.,.1
= <'~Oi
where
-
~oi) B~ \!Oi - ~oi)
I
k
1: (x. - ~j)
-J
-
+ ~iC~i~i
l!:~l(x.
- '~j)
J -J
-
•
78
~i
Z
2
e
=
L
L
I:
=~i
(i)
= (t
~1
1
1
• ~o (i
L
i
i
= 1, ... ,
f
k)
I
"
)
~2 .•• ~i-1 !i+l ••• ~
L
i
.
1
•
0
I:
I:
i
I: +I:
1 i
I:
L
I: +L
2 i
I:
L
L +L
k i
i
i
i
i
f
1
L
I
i
I:
I
i
i
In the lower right-hand portion of the above matrix, the diagonal
z
Proof
(1)
-
~1
=
Q(1)
= (~i ~'2 '
Let
(0
Note that
€
~i - Ho
f2.. i
-1
C i2,c 11
(i
• 0"
.~i
= 1, .•. , k)
Q~ 1
-J.-
Q~ 1
-J.+
79
(i) !oi
(ii)
(iii)
is independent of !i (i = l, ••• ,k)
the dispersion matrix of !oi
is
Bi
and its mean is ~oi
the dispersion matrix of (!i' !l' !2""'!i-l'!i+l""'!k)
is 0i and
E
(iv)
z.
-~
= -0
(j ~ i)
,
the dispersion matrix of !i is
matrix of !(i), given !i'
0 il
and the dispersion
is
·1
'
°i3 - C12 Oil °i2
(v)
the conditional mean of !(i), given Z.,
-~
is
·1
+ ~(i)
i2 01l!i
~
°
The lemma. follows easily from the above facts.
~emma 4.3.2
Let ~ and ~ be pxl vectors and let B,
pxp symmetric positive definite matrices.
° be two
Then
exp~- ~~'
(27f)P/2
0.
I
1
Q
01 1 / 2
The proof of this lemma follows from the convolution of two normal
density functions and the additive property of independent normally
distributed random vectors.
dQ
80
If' the matrix
2
A (pxq)
A;
# -1,
is non-sinsular, and A.
A
1
A'
2
A
A;
2
=
+ A ( ~ - A2 A1-1 A2')
t*r
1
'f..+l
A
2
and
distribution of
tribution of
1
+
A
3
'f..
m
f---+--:J
A
1
-1
0
L3g
Now note that when
~
then
-1
For a proof see
between
(qxq)
(J)
€
12 i'
(~oi' ~(i».
~ when
(J)
(~oi' ~(i».
e
there is a 1-1 correspondence
Hence]instead of considering a prior
n
i'
we shall consider a prior dis-
Let
Consider the prior distribution
g(~, f3.) as follows:
o<g.J. <1
=1
(i = 1, ... , k),
81
(1i)
of ~, given
c.d.f.
~
ill
Jrli ,
is Gi(i
= 1, ••. ,k).
We
shall consider Gi such that
(a) ~oi is independent of 0(1)
(b)
~oi
is distributed as a p-variate normal distribution
"With zero means and dispersion mtrix
(ai > 0 )
ai Bi
(c)
O(i)
is distributed as a
p(k-1)-variate normal dis-
tribution "With zero means and dispersion matrix
For any ¢ E
¢ ,
k
i:1 ~i ~(l-¢i(x» d Gi(~) Pi(x, ~)_7d x
Hence,
r(s,¢) is minimized for
~(X){=
(4.3.5) =
1,
if
¢
"Ii(x)
= ¢ where
= max
1< j < k
0,
otherwise.
['I j (x)_7
.. ~
-
(i ::;: 1, .•• , k)
and
(4.3.6)
Ii (x)
=
£i
J
Pi (x,
~)
d Gi
(~l)
~),
the Bayes rule will be
#k
Thus for th e prior Matribution S(~,
determined as (4.3.5) and (4.}.6)
where
82
(21t)P(k+1)/2/ E .1 1 / 2
J.
(4.~.
Tr
11:.1 11 / 2
j=l I
7)
ex:pf- ~ ~~i
(ai Bi )·l
/ai
(21t)P/2
B
i
~oi
11 / 2
ex:p~- ~ ~(i) (~iCt3)·1 Q(i)
(21t)P(k-1)/2
where,
C*1~/
= Ci3
~ c*
i3
1
• C12 C·111 C·12
From lemma 4.,.2,
S.
Ii (x) =
J.-:---.
(21C)P(k+1)/2
IE
i
l
7 d Q(1)
1/2
•
n- IE 1
-..,.__
1/ 2
,] d Qoi
j=1
efrPr• .! z I C· 1 z
j 1/ 2 2 -1 11 -1 -
7
83
C*. ,,1/2
1,1
J
From lemma 4.3.3,
1
(l+a
1-
)P/2
84
¢(e)
Thus the Bayes rule
g(g,~)
against
can be expressed as
follows:
=
(4.3.9)
(i
otherwise
0,
= 1". •• ,
k)
where
,-1
7
~oi Bi
~oi-
r
1
exp 2(1+a.)
J.
(
expf-
1
"2
13 i
13 +l
I
~i' ~
(i) ')
c- l
i
,-1
~i Cil ~i -
7
(
~i
.
i
By varying
rules.
that,
ai's,
13' sand
i
g1 's we will get a family of admissible
In particular, consider the prior distribution
£i = l~~
(i
a.J.
= C~
(i
= l, •.. ,k)
= l, ... ,k)
13
= (3
(1
= 1, ... ,k)
1
Then the Bayes rule
¢t*(a,13)
g*(a,13)
can be expressed as follows.
such
85
=
0,
otherwise.
(i = l, ... ,k)
where
1
1 +
+-
a
Zl
-01
-1
B. z
~ -01
l
+ ---1L... z I Ci- l !i
1 + f3 -i
Now as
a
-> Cl', f3 ->
00
,
I
-1
+ !i Gil ~i
=
it*'
(:=)
Thus, we see that the ma::imum likelihood rule
sequence of admissible rules.
¢** is the limit of s
Hence, the maximum likelihood rule is
CHAPTER V
NONPARAMETRIC CLASSIFICATION RUIES
5.0
Summary
In the first part of this chapter, minimum distance classifica-
classification rules oased on aistance functions between
·twv.dist~ibu-
tion functions are':proposed and their c.onsistencies are shown.
For
the rule based on KOlmogbrov:'s distance, a lower: bound for theprobabil:Lty of correct classification lser·to: ,ned.
In the second part: a
classification rule-based on Wilcoxon's statistic is proposed and its
consistency is shown.
5.1 Minim~distance classification rules
5.1.1 Definitions and notations
Definition 5.1.1 A distance function D between two distribution
functions F and G must satisfy the following conditions.
(i)
D(F, G)
= D(G,
(ii)
D(F, F)
=0
(iii)
D(F, G) ~ D(F, H) + D(H, G)
F) ~ 0
,
for all d.f.ls F, G, H.
X be a random sample of size n from a population
let Xl' X2 ' ""
with d.f. F.
n
We define the sample d.f. S as follows:
S(x)
=.!n krNumber
of X. 's < x
7
~-
Definition 5.1.2 A distance function D is said to be consistent if
the following condition holds.
Given any
n>N
€
> 0,
€'
> 0, there exist a number N such that for
87
Probe D(Sn' F) ~ e:1 F ] < e:'
where S is the sample d.f. based on a random sample of size n.
n
If (5.1.1) holds uniformly for all F e: ~ , a subclass of the
class of all univariate distributions, then D is said to be uniformly
( '}) consi stent.
Definition
5~
A distance function D is called the Kolmogorov dis-
tance when
=
D(Y, G)
sup
IF(x) - G(x)
-co < x < co
I
When we consider distribution functions of a random vector ~(p x 1)
then the above definitions can easily be modified in a natural way.
Let
~l' ~,
••• ,
~n
be a random sample of size n.
Then the
sample distribution function S is defined as follows:
S(~) =1:n [Number
of X. 's such that
-1
xCi j) < x j
]
where
,
(i
= 1,
••• , n)
x' = (xl' ••• , x )
-
p
Kolmogorov distance is defined in the multivariate case as
D(F, G)
= sup
x
IF(~) - G(~)
I
Classification problem:
Let X be a random variable which is distributed in the population f(i (i
= 0,1,
••• , k) with d.f. F
i
(i
= 0,1,
... , k).
F 's
i
are
88
unknown
except it is known that F = F. for only one value of i
· 0 1
(i
= 1, ••• , k) and Fl , F2 , ' •• , Fk are all different.
Let 8 be the sample d.f. based on a random sample of size ni
i
(i = 0, 1, •• ', k) from ~i (i = 0, 1, ••• , k). The action space A is
denoted by A
(i
= 1,
= (al ,
•• ', a ), where a i denotes the decision Fo = Fi
k
Let~ be the vector of all sample observations. A
••• , k).
= (¢l'
classification rule ¢
••• , ¢k) is a measurable function of Z
with the other usual assumptions;
¢i(~) is the probability of taking
given~. The minimum distance rule ¢(D) based on a disi
tance function D is defined as follows:
action a
¢~D) (Z)
1
=
1
min
=
-
1
,
~
j
~
(d)
k
oj
otherwise
(i = 1, ... , k)
where
,
(i
= 1, ... , k)
¢(D) is uniquely (a.e.) defined if Fils are continuous.
5.1.2
Two lemmas
Let Cii
= Prob [¢iD)(~) = 11
Fo
= Fi
] (i
= 1, ... ,
k).
and let
fD(n, r, F)
= Prob
[D(Sn' F)
< y IF]
,
where 8 is the sample d.f. based on a random sample of size n.
n
In the notations of 5.1.1, consider k = 2.
Then for i = 1, 2,
wbere
Proof.
We
shall prove (5.1.5) for i
= 1.
Assume Fo
= Fl.
Now
do2 - do1Z D(F1 , F2 ) - D(Sl' F1 ) - D(S2' F2 )
- 2D(So' F1 )
But from (5.1.8),
Hence
We
have
90
It can be similarly proved
Lemma 5.1. 2
fo~
i = 2.
Let
(i ; j;
Then for any 1 (1
= 1,
1 - Cu
i,j
= 1,
'.', k)
... , k) •
~
iY
k
(1 - Bij )
j;i
Proof.
Let E
ij
be the event do1 > dOj (i ; J).
Then
1 - C11 = Prob(
k
< Z prob(E'
-
. 1
J=
IF
= F.)
1
J 01
j;i
5.1.3 Some well-known results
The following results on Kolmogorov distance
["29_7, £7_7
are stated below without proof.
Let S be the sample d.f. based on a random sample of size n
n
from a population with d. f. F.
Define
91
Dn
= sup
I sn(x) - F( x) I
x
Theorem 5.1.1 The Kolmogorov distance is uniformly consistent.
Theorem 5.1.2
For
€
> 0,
Prob(D >
n-
€
4
2
< -€ exp(-n€ /2)
I F)
Theorem 5.1.3 For every
€
> 0, €I > 0, there exists a number N(e, e')
such that
Prob [Dn >
if N> N(e,€').
€
for some n > N IF ] < e'
The following result on Kolmogorov distance in the
multivariate case is due to Kiefer and Wolfowitz
D
n
= sup
x
I
£18_7.
sn (x)
- F( x) I
-
We define
,
where.! is a p x 1 vector.
Theorem 5.1.4
For each p, there exist position constants Co' c
such
that for all·n and all F and for all positive '1
Prob
For p
= 1,
rD
> '1/...;n
- n-
IF] < C e- c '1
the best value of c is 2.
2
0
Kiefer and Wolfowitz
discussed about the possible values of C and c.
o
for p > 1.
5.1.4
(i)
£18_7
In general, c < 2
Results on minimum distance classification rules
Theorem 5.1.5 The minimum distance classification rule
consistent (uniform) if D is consistent (uniform).
of ¢(D) we mean,
~(D)
is
By consistency
92
Cu --,> 1 for all i,as ni's - - >
Proof.
00
This follows from the definition of a consistent distance
function and the lemmas 5.1.1 and 5.1.2.
Note that,
where
(i ~ j;
= 1,
i,j
"', k),
and
,
since D is consistent.
(ii)
In the following results D stands for Kolmogorov distance.
(a)
Kolmogorov
The minimum distance classification rule based on
distance is uniformly consistent.
Theorems 5.1.1 and 5.1.5.
(b)
This is also true in the multivariate case.
For the case k
Cu ~
£'12
2
1f
It follows from
[
i=o
= 2,
16
1 - ..,,- e
1.12
-niii2/32
J
(i
= 1,2)
J
= D(F1 , F2 '
It follows from Lemma 5.1.1 and Theorem 5.1.2.
For the general case,
lower bounds for Cii's can be obtained using Lemmas 5.1.1, 5.1.2 and
Theorem 5.1. 2.
(c)
For the case k = 2, as no' nl , n2 - - >
dOi
= min(d01 '
d02 ) when Fo
CD,
= Fi
with probability tending to 1. This follows from the relation (5.1.8)
9;
and Theorem 5.1.;.
(d)
In the multivariate case, for k
C'i >
J.
-
2
1T
i=l
[
1 - C e
=2
-cniKi2/l6J
, i
0
K12 = sup I F1 (~)
x
- Fi~)
= 1,
2 )
I
This can be easily generalized for k > 2.
5.2
Classification
rul~
based on Wilcoxon statistic
let X be a random variable which is distributed in the population ni (i = 0, 1, 2) with d.f. Fi (i = 0, 1, 2); Fils are unknown
but it is known that Fo = F for only one value of i (i = 1, 2).
i
let (Xl"'"
l,
samples of sizes
u
1
=Iii{
Xl)' (Yl , ••• , ym), (zl' ••• , Zn) be random
m, n from no' nl , n2 respectively.
[number of pairs (Xi' Yj) with xi < Yj
(1
v
1
=lil
Define
= 1, ... , i;
j
= 1, ... , m) ]
[number of pairs (Xi' Zj) With xi <
(i
= 1,
... , {;
j
Zj
,
= 1, "" n) ]
We propose the following heuristic classification rule based
on u and v.
,
if
Iu - ~ I< Iv - ~ I
, otherwise
,
Now the inequality in (5.2.3) is equivalent to
(u - v) (u + v - 1) < 0
W"hen F = F
1 '
o
=~ -
E (u - v)
E (u + v - 1) =
,
jFl (t)dF2 (t)
f
Fl(t)dF (t)
2
From the results known about Wilcoxon statistic
theorem due to Cramer
L-4_7,
prob
(u - v) (u + v - 1)
e
15_7
and from a
it follows that
> [~-
[
S
S
Fl (t)dF2 (t)
]
1
Fl (t)dF2( t) - '2
] < O.
Thus,
Prob [(u - v)(u + v - 1) < 0
as
K,
I Fo
= F
l
] -> 1
> ro
m, n
A similar result holds when Fo = F •
2
Hence the classification rule (5.2.3) is consistent.
5.3 Concluding remarks This chapter is concerned with simple applications of the theory of nonparametric inference to the classification
problem.
However, it is interesting to note the similarities between
these two branches.
It would be interesting to investigate the fol-
lowing problems:
(i)
to make relative comparisons of performances of different
nonparametric classification rules;
(ii)
to obtain some optimum rules in a clasBof restricted
rules, e.g., invariant under permutations, unbiased, etc.;
(iii)
to get good lower bound for the probability of correct
classification.
e
-
BIBLIOGRAPHY
~1_7
Anderson, T. W., Introduction to Multivariate Statistical
Analysis, John Wiley and Sons, Inc., New York, 1958.
~2_7
Anderson, T. W. and Bahadur, R. R., "Classification into two
multivariate normal distributions with different covariance matrices," Annals of Mathematical Statistics,
vol. 33 (1962), pp. 420- 432.
["3_7
Blackwell, D. and Girshick, M. A., Theory of Games and
Statis~cal
Decisions, John Wiley and Sons, Inc., New
York, 1954.
[,,4_7
Cramer, H., Mathematical Methods of Statistics, Princeton
University Press, Princeton, 1946.
~5_7
Dantzig, D. Van, "On the consistency and the power of
Wilcoxon's two sample test,1I Akademie Van Wetenschappen
about alternative distributions is based on samples,"
Annals of Mathematical Statistics, vol. 33 (1962), .
I
pp. 213-223·
["7_7
Feller, W., "On the Kolmogorov- Smirnov limit theorems for
empirical distributions," Annals of Mathematical
Statistics, vol. 19 (1948), pp. 177-189.
~8_7
Fisher, R. A., "The use of multiple measurements in taxonomic
problems," Annals of Eugenics, vol. 7 (1938), pp. 179-188.
.
e
97
["9_7
Fix, E. and. Hodges, J. L., "Discriminatory analysis:
non-
parametric discrimination, consistency properties,"
Report No.4, (1959), School of Aviation Medicine,
Randolph Air Force Base, Texas.
£1o_7 Fix, E. and Hodges, J. L., "Discriminatory analysis:
non-
parametric discrimination, small-sample performances,"
Report No. 11 (1952), School of Aviation Medicine,
Randolph Air Force Base, Texas.
£11_7 Hudimoto, H., "On the distribution free classification of an
individual into one of two groups," Annals of the Institute of Statistical Mathematics, vol. 8 (1957),
pp. 105-112 •
.
£12_7 Halmos, P., Measure TheorY', Van Nostrand Co., New York, 1950.
£13_7 Hoel, P. G. and Peterson, R. P., "A solution of the problem of
optimum classification," Annals of Mathematical Statistics,
L14_7
John, S., "On some classification problems - I," Sankhyi,
vol. 22 (1960), pp. 301-308.
£15_7 John, S., "On some classification statistics," Sankhya, vol.
22 (1960), pp. 309-316.
["16_7 John, S., "Errors in discrimination," Annals of Mathematical
Statistics, vol. 32 (1961), pp. 1125-1144.
["17_7 Kiefer, J., "Invariance, minimax sequential estimation and
continuous time processes," Annals of Mathematical
Statistics, vol. 28 (1957), pp. 573-601.
["18_7 Kiefer, J. and Wolfowitz, J., "On the deviations of the
empiric distribution function of vector chance
variables," Transactions of the American Mathematical
Societlz vol. 87 (1958), pp. 17~186.
["19_7 Kolmogorov, A., "Sulla determinazione empirica di una legge
di distribuzione," Journal of Italian Attuari, vol. 4
(1933), pp. 83-9l.
["20_7 Kudo, A., "The classificatory problem viewed as a two-decision
problem," j1emoir of the Faculty of Science, Kyushu UniverSityz Japan, Series A, vol. 13 (1959), pp. 96-125.
["21_7 Lehmann, E. L., Testing Statistical Hypotheses, John Wiley
and Sons, Inc., New York, 1959.
122_7 Lehmann, E. L., "Discriminatory analysis:
on the simultaneous
classification of several individuals," Report No.6
(1951), School of Aviation Medicine, Randolph Air Force
Base, Texas.
123_7 Matusita, K., "Decision rule, based on distance, for the
classification problem, U Annals of the Institute of
Statistical Mathematics, vol. 8 (1957), pp. 67-77.
124_7 Montogomery, D. and Zippin, L., Topological Transformation
Groups, Interscience Publishers, New York, 1955.
125_7 Rao,
c.
R., "A general theory of discrimination when the in-
formation about alternative population distributions is
based on samples," Annals of Mathematical Statistics,
vol. 25 (1954), pp. 651-670.
99
L26_7
Robbins" H., "Asymptotically subminimax solutions of compound
decision problems,," Proceedings of the Second Berkeley
Symposium on Mathematical Statistics and Probabilitr,
University of California Press, Berkeley, 1951,pp.13l-l48.
L27_7
Roy,
s.
N., Some Aspects of Multivariate Analysis, John Wiley
and Sons, Inc., New York, 1958.
L28_7
S1tgreaves, R., "On the distribution of two random matrices
used in classification procedures," Annals of Mathematical
Statisti~, vol. 23 (1952), pp. 263-270.
L29_7Smi rnov,
N., "Table for estimating the goodness of fit of
empirical distributions," Annals of Mathematical
Statistics, vol. 19 (1948), pp. 279-281.
L30_7
Solomon, H., et al, Studies in Item Analysis and Prediction,
Stanford University Press, Stanford, California, 1961.
L3l_7 Stoller, D. C., "Univariate two-population distribution-free
discrimination," Journal of the American Statistical
Association, vol. 49 (1954), pp. 770-775.
L32_7
Von Mises, R., "On the classification of observation data into
two distinct groups,tI Annals of Mathematical Statistics,
vol. 16 (1945), pp. 68-73.
~33_7 Wald, A., Statistical Decision Functions, Jo~n Wiley and Sons,
Inc., New York, 1950.
~34_7 Wald, A., "On a statistical problem arising in the classifica-
tion of an individual into one of two groups," Annals of
Mathematical Statistics, vol. 15 (1944), pp. 145-163.
100
['"35_7
."
Wilcoxon~ F., "Individual comparisons by ranking methods~ II
B1ometr1cs~ vol.
["36_7
(1945)~ pp.
80-82.
Waugh ~ F. V., "A note concerning Rotelling' s method of inverting
vol.
•
1
matrices~1t
Annals of Mathematical Statistics,
16 (1945), pp. 216-217•
© Copyright 2026 Paperzz