N-3. Antoniak1974

Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems
Author(s): Charles E. Antoniak
Reviewed work(s):
Source: The Annals of Statistics, Vol. 2, No. 6 (Nov., 1974), pp. 1152-1174
Published by: Institute of Mathematical Statistics
Stable URL: http://www.jstor.org/stable/2958336 .
Accessed: 27/03/2012 21:48
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The
Annals of Statistics.
http://www.jstor.org
The Annals of Statistics
1974,Vol, 2, No. 6, 1152-1174
MIXTURES OF DIRICHLET PROCESSES WITH APPLICATIONS
TO BAYESIAN NONPARAMETRIC PROBLEMS1
BY CHARLES E. ANTONIAK
University
of California,Berkeley
A randomprocesscalled theDirichletprocesswhosesamplefunctions
are almostsurelyprobability
measureshas been proposedby Fergusonas
problemsfroma Bayesianviewan approachto analyzingnonparametric
point.
An important
resultobtainedby Fergusonin thisapproachis thatif
observations
aremadeon a randomvariablewhosedistribution
is a random
samplefunctionof a Dirichletprocess,thenthe conditionaldistribution
of the randommeasurecan be easilycalculated,and is again a Dirichlet
process.
This paperextendsFerguson'sresultto cases wheretherandommeasthe distriure is a mixingdistribution
fora parameterwhichdetermines
of
butionfromwhichobservations
are made. The conditionaldistribution
therandommeasure,giventheobservations,
is no longerthatof a simple
Dirichletprocess,butcan be describedas beinga mixtureofDirichletproforthesemixtures
and develops
cesses. Thispapergivesa formaldefinition
the mostimportant
of whichis a
severaltheoremsabout theirproperties,
closure propertyfor such mixtures. Formulas for computingthe conditionaldistribution
arederivedand applicationsto problemsin bio-assay,
are given.
discrimination,
regression,
and mixingdistributions
F on the
1. Introduction.In certainstatisticalproblems,the distribution
samplespace (X, ,B)is onlyknownto belongto somecollectionof distributions
W= {Fj. This collectionJW may be treatedas a parameterspace in a
Bayesiananalysisof theseproblems,but the Bayesianapproachrequiresthe
on .
If Xwis a parametricfamily,the prior
placingof a priordistribution
can be givenfortheparameters.However,in manyso-callednondistribution
or distribution-free
problemstheset 5 maybe too largeto permit
parametric
suchtreatment.For example, X maybe theclassof all distribution
functions
In
the
on thereal line.
thiscontext,discussing
whathe calls
"empiricalBayes"
problem,Robbins[16] has said:
A strictly
Bayesianapproachmightbe to startout with
an a prioridistributionof probabilityover the class .~' of
all possible distributionsF ... this could possiblybe convertedinto an a posterioridistributionafterXl, X2, . X*,
ReceivedDecember1972;revisedSeptember1973.
1This researchwas supportedpartlyby theU.S. Navy EletronicsLab, San Diego, California,
and partlybytheOfficeofNaval ResearchunderContractN00014-69-A-0200-1051
withtheUniversityof California.
AMS 1970 subject class/ifcations.Primary 60K99; Secondary 60G35, 62C10, 62G99.
Key wordsandphrases.Dirichletprocess,nonparametric,
Bayes,empiricalBayes,mixingdisrandommeasures,bio-assay,discrimination.
tribution,
1152
MIXTURES OF DIRICHLET
PROCESSES
1153
have been observed. Whether anyone will espouse this
view and recommenda procedure for carryingit out remains to be seen.
Some propertiesthat would be desirable for such a procedure can be paraphrased fromRaiffaand Schlaifer[14], page 44 forthis situationas follows:
1. The class _ of random priordistributionon YWshould be analyticallytractable in threerespects:
(a) It should be reasonably easy to determinethe posteriordistributionon
X given a "sample";
(b) It should be possible to expressconvenientlythe expectationsof simple
loss functions;and
(c) The class 2 should be closed, in the sense that if the prioris a member
of _, then the posterioris a memberof _.
2. The class _ should be "rich," so that therewill exist a memberof Q capable of expressingany prior informationor belief.
3. The class _ should be parametrizedin a mannerwhich can be readilyinterpretedin relation to prior informationand belief.
While these requirementsare not mutually exclusive, they do seem, in the
case we are considering, to be antagonistic in the sense that some may be
obtained at the expense of others. For example, several authors,among them
Dubins and Freedman [5], [6], Kraft and van Eeden (12), and Kraft [11], have
describeddistributionfunctionprocesseswhich satisfythe second requirement,
but are deficientin the firstand third.
More recently,Ferguson [9] has defineda processcalled the Dirichletprocess
which is particularlystrongin satisfyingthe firstand thirdrequirementsand is
only slightlydeficientwith respectto the second requirement.
AlthoughFerguson was successfulin using the DirichletprocessforBayesian
analyses of several nonparametricproblems, there are some statisticalmodels
forwhich the closure property1(c) does not hold. For example, the nature of
samplingin mixingdistributionproblemsand bio-assay problems can be such
that the posterior distributionis not a simple Dirichlet process, but can be
representedas a mixtureof Dirichlet processes, i.e., as belongingin a sense to
a linear manifoldspanned by Dirichlet processes.
This paper reviews the basic propertiesof Ferguson's Dirichlet process in
Section 2. Section 3 develops the additional structurenecessaryto definemixturesand derives some basic propertiesof mixtures,culminatingin Theorem 3
which states, roughly, that mixtures of Dirichlet processes have the closure
property1(c).
The somewhat unusual behavior of samples from a Dirichlet process is examined in Section 4 and explicit formulas for the posteriordistributionsare
derived. Section 5 illustratesthe application of mixturesof Dirichlet processes
CHARLES E. ANTONIAK
1154
to estimationproblems,includingestimationof a mixing distributionand empirical Bayes estimation. The remainingSections 6-8 give applications to a
regressionproblem, a bio-assay problem, and a discriminationproblem.
2. Ferguson's Dirichlet process. In a fundamental paper on a Bayesian
approach to nonparametricproblems, Ferguson [9] definesa random process,
called the Dirichlet process, whose sample functionsare almost surely probabilitymeasures,and he derives many importantpropertiesof this process. To
save space, we will list below only the definitionand those propertieswe need
for this paper.
a a-fieldof subsetsof 9. Let a be a
DEFINITION 1. Let 9 be a set, and
We say a
finite,nonnull, nonnegative, finitelyadditive measure on (9, )
random probabilitymeasure P on (9, _V) is a Dirichlet process on (8, _/)
if foreveryk = 1, 2, **.. and measurable
C
with parametera, denoted P e(a),
Bk of E), the joint distributionof the random probabilities
partitionB1, *.,
(P(B1), I. P, (Bk)) is Dirichlet with parameters (a(B1), .. , a(BOj), denoted
(When a(Bi) = 0, P(Bi) - 0 with
(P(B1), *.., P(Bk))e ??Y(a(B1),*.., a(Bk)).
probabilityone.)
Ferguson shows that thisdefinitionsatisfiesthe Kolmogorov criteria for the
existence of a probability 9 on the space of all functionsfrom _v into [0, 1]
with the v-field generated by the cylinder sets. Certain properties of the
Dirichlet process obtained by Ferguson will be needed later.
(i) If P e 9J(a), and A e JV then e(P(A))
= a(A)/a(E9).
02, '
are i.i.d. P, then
*,
(ii) If PF ?2(a) and conditionalgiven P, 01 02
P 01,02, *, *,~ * ! 09(?a
l
3) where 3xdenotesthe measuregivingmass
?
one to the point x.
(iii) If P e 3'(a), then P is almost surelydiscrete. (See also Blackwell [2].)
We point out that property(ii) above satisfiesdesirable properties 1(a) and
1(c). The posteriordistributionis easily computable and in the same class as
the prior. Property(i) above fulfillsdesirable property3 in two respects. Since
hispriorguess
e(P(A)) = a(A)/a(9), one can choosethe"shape"of a to reflect
at the shape of the distribution. Moreover, it follows from (i) and (ii) that
? nFn(.)], whereFn is the
) 101, 029 ...* Io,) is (n + a(9))-'[a(6))'(P(.))
e(P(
empirical df. Thus the magnitudeof a(o) represents,in a sense, the degree of
faithin the prior guess, and appears in the formulaas if it were "prior sample
size." The statistician can choose the magnitude of a(e) to represent the
strengthof his conviction,independentof his opinion about the "shape" of the
distribution.
On the other hand, the almost sure discretenessof a Dirichlet selectiongiven
in (iii) would seem to be inconsistentwith desirable property2 for situations
where one wants a prioron a class of continuousdistributions.In manycases of
interest,however,this discretenessis no more troublesomethan the discreteness
MIXTURESOF DIRICHLET PROCESSES
1155
of a sample cdf. The real concern of property2 in this context is that distributionfunctionschosen by a Dirichlet process can be reasonably "close" to
the kind of distributionfunctionsone is interestedin. In this regard,Ferguson
has proved thatthe Dirichlet process is "rich" in the sense that there is positive
probabilitythat a sample functionwill approximate as closely as desired the
measuregiven any specifiedcollection of _v measurablesets by any fixeddistribution which is absolutely continuous with respectto the parametermeasurea.
The mixturesof Dirichlet processeswhich we definebelow can also be shown
to be "rich" in this same sense, but we will make no use of the propertyin
this paper.
3. Mixtures of Dirichlet processes. Because the basic Dirichlet process definedabove does not encompassenoughof the situationsencounteredin Bayesian
analysis, we proceed now to the concept of a mixtureof Dirichlet processes,
which, roughly,is a Dirichlet process where the parametera is itselfrandom.
Beforewe can make this idea more precisewe need a slightgeneralizationof
the usual definitionof a transitionprobability,and laterwe will findit necessary
to impose certain regularityconditionson the underlyingspaces to assure the
existenceof some needed conditional distributions.
DEFINITION 2. Let (9, QV) and (U, X) be two measurable spaces. A transitionmeasureon U x
is a mappinga of U x SV into[0, oo) suchthat
(a) For every u e U, a(u, .) is a finite,nonnegative, nonnull measure on
(b) For every A c XV,a(., A) is measurable on (U, ?).
We note that this differsfromthe definitionof a transitionprobabilityin that
a(u, 9) need not be identically one. We make this change because we want
a(u, .) to be a parameterfora Dirichlet process.
DEFINITION 3. Let (9, iV) be a measurable space, let (U, M, H) be a probabilityspace, called the index space, and let a be a transitionmeasure on U x
-V. We say P is a mixtureof Dirichlet processes on (E, S)
with mixingdistributionH on the index space (U, ?), and transitionmeasure a, if for all
k = 1, ** and any measurable partitionA1,A2, , Akof 9 we have
{~fP(Aj) < yl, * * * P(Ak) <_Yk}
= 5u D(yj,
.
, Yk
la(u, A1) .*,
a(u, Ak)) dH(u),
whereD(01, * O*,
Sk Ia1, . *, ak) denotesthedistribution
oftheDirichlet
function
distributionwith parameters(a1, .., ak).
In concise symbolswe use the heuristicnotation:
(P(Al)yP(A2) - *XP(Ak))
d
.
or simplyP e 5Su_A(a(u,*)) dH(u).
5u -g(a(u, Al),
a(u,
A,))
dH(u)
1156
CHARLES E. ANTONIAK
Roughly,we mayconsiderthe indexu as a randomvariablewithdistribution
H and conditionalgivenu, P is a Dirichletprocesswithparametera(u, .). In
fact v can be definedas theidentity
mappingrandomvariableand we willuse
thenotationIu for"givenv = u." In alternative
notation,u e H, P Iu C ?(a.),
wherau = a(u, .).
An exampleofa mixture
ofDirichletprocessesderivedfroma simpleDirichlet
processis givenbelow.
EXAMPLE
1. Let P be a Dirichletprocesson (9, SV) withparameter
a. De-
finea(u, A) = a(A) + 3u(A), where bu(A) = 1 if u e A, 0 otherwise. Let H be
a fixedprobabilitymeasureon (9, XV). Then theprocessP* whichchosesu
accordingto H, and P froma Dirichletprocesswithparametera(u, A) is a
mixtureof Dirichletprocessesas definedabove. Moreover,forthemixturein
thisexample,we notethatif (B1, **, Bk) is anymeasurablepartitionof 9,
(1)
(P(Bl),
P(B2), *I*
P(Bk))
e Zk=1 H(Bs)?2g(a(B1),
a*, (B)
a,a(Bk))
+ 1,
Relation (1), in fact, characterizesmixturesof this type,and we will encoun-
in thispaper.
terit frequently
DEFINITION 4. Let P be a mixtureof Dirichletprocesseson (0, JV) with
H on indexspace (U, ?) and transition
mixingdistribution
measurea on U x
.
.
.
Si'. We say that 1,02,
is a sampleof size n fromP if forany m=
an
1, 2, ... and measurablesetsA1,A2, . , Am'C1, C2, **,C. we have:
(2)
9'{01 CC 1, ** , 0a e Cn U, P(A1), ** P(A.), P(C1), **, P(C,)}
.
= fl=1 P(Ci)
a.s.
Thisdefinition
determines
thejointdistribution
of01 ...,
sincethecustomary
S
9{1fo C C1,***, o
0?&,P(A1),
P(A.)
C., P(Al) < yl, ***eP(Am,) < Ym}
maybe foundby integrating
(3) withrespectto the knownjoint conditional
distribution of P(A1),
*.*, P(A,,),
P(C1),
. . .,
P(C,,)
given u over the set
[O'YI] x ... x [?,Y.] x [0, 1] x ... x [0, 1] and then integratingthe resulting
function
of u withrespectto H(u) over U.
An immediateconsequenceof thisdefinition
is thefollowing
usefulresult.
PROPOSITION 1. If P e Su ?2(a(u, .)) dH(u) and 0 is a sample of size one from
P, thenfor any measurableset A,
(3)
ECA) - Su a (u, A9)dH(u) .
-~~~9D(O
a (u, 9))
PROOF. -9D(Oe A I U, P(A)) = P(A) a.s. hence
A(6 e A I u) =ef{?(6
e A Iu, P(A)) I u}
= {P(A) I u} _ a(u, )
a(u, 9)
a.s.
a. s.
MIXTURES OF DIRICHLET
Finally
a(8y
A)A)= . [a (u, A) S=
c_(u, E)i
1157
PROCESSES
[1
a(u, A) dH(u)
a(u,90)
THEOREM 1. Let P be a Dirichletprocess on (9, _V), withparametera.
Let
O be a sample of size 1 from P, and A E _ be any measurableset such that
a(A) > 0. Then the conditionaldistributionof P, given 0 C A, is a mixtureof
Dirichletprocesseson (E9,_V), withindexspace (A, v n A), and transition
measure
a on A x (sQ/ n A), withdistribution
HA on (A, _ n A), whereHA(.) = a(.)/a(A)
on A and a(u, .) = a + e. forue A.
PROOF. Let (B1, B2, ...,
Bn) be any measurable partitionof 9.
Since A is
measurablewe obtaina refinedmeasurablepartitionby lettingBi' = A n Bi,
and BO = Ac n Bs so thatA = U=1Bi'. If followsfromproperty(ii) of the
Dirichletprocesslistedin Section2 thattheconditionaldistribution
of P(B1'),
-, P(B,,'), P(B10), **, P(B,"?),given 0 e Bi' is Dirichlet with parameters
P(B21) ***
.
(a (B,'), a(B2'),
.
*, a(Bi') + 1, * *, a(B?').
Integratingthis with respectto the
conditionalprobability
that0 E Bi', given0 E A, we obtainthe conditionaldistribution
of P(B1'), P(B2'),*I * , P(B,"?),given0 e A, is
.
a(B('A) D(a(B,'),
a(A)
a*,
(Bi') + 1, ** *, a(B))
.
whichwe recognizeas relation(1) in Example1 following
Definition3, with
H(.)
= a(.)/a(A)
on A. [1
COROLLARY 1. 1. Let P be a mixture
of Dirichletprocesseson (9, -$Q/)withindex
space also (9, SV), and transition
measurea. = a + 5.. If thedistribution
H on
the index space (9, J/) is givenby H(A) = a(A)/a(O), thenP is infact a simple
Dirichletprocesson (9, Se') withparametera. In symbols
Se 7(a + 5.) a(du) = _`r(a)
a(E))
We omitthe proofand simplypointout thatH(.) = a(.)/a(E) is whatwe
wouldobtainin Theorem1 if we let A = 9. But beinggiven0 e 9 is being
at all, so theposterior
givenno information
distribution
is thesameas theprior,
a simpleDirichletprocess. The usefulnessof Corollary1.1 is thatit enables
one to reducesomemixtures
ofDirichletprocessesto simpleDirichletprocesses.
PROPOSITION2. Let P be a Dirichletprocesson (0, ) withparametera, and
let A E s/. ThengivenP(A) = M, theconditionaldistribution
of (1 /M)P restricted
to (A, v n A) is a Dirichletprocess on (A, vY,n A) withparametera restricted
to A. That is, if A1, . . ., Ak is any measurablepartitionof A, thentheconditional
distribution
of (P(A1)/M, * * *, P(Ak)IM), givenP(A) = M, is a Dirichletdistribution
withparameter (a(A1),
*, a(Ak)).
PROOF. From the definition
of a Dirichletdistribution
in termsof the
as given in Ferguson[9] we know the distribution
gamma distribution
of
1158
CHARLES
E. ANTONIAK
** P(Ak)/L=l P(Ai) is Dirichlet (a(A1), *, a(A,)).
P(A1)/1k 1 P(A),
E" =l P(A) = P(A) = M by hypothesis. [
But
Proposition 2 is an alternativeway of stating that the Dirichlet process is
"tailfree" as discussed by Doksum [4] and Fabius [7], who show furtherthat
the Dirichlet process is essentiallythe only processwhich is tailfreewithrespect
to everytreeof partitions. They also show thatthe Dirichlet process is the only
one for which the posteriordistributionof P(A) given a sample 81,82, *.
On,
depends only on the numberof observationsfalling in A, not wherethey fall
in A.
We now combine the resultsof Theorem 1, Corollary 1. 1, and Proposition 1
to get:
2. Let P be a Dirichletprocesson (E), SV) withparametera, and let
8 be a samplefromP. Let A e <.
Thentheconditonaldistribution
ofP givenP(A)
and 8 e A is thesame as theconditionaldistribution
of P givenP(A).
THEOREM
Roughly speaking, if P(A) is known, then the event 8 e A tells us nothing
more about the process. This is consistentwith thedefinitionof samplinggiven
in Definition4, and since P(O) = 1, a.s. it confirmsthe interpretationgiven for
Corollary 1. 1.
Before we can proceed to the most interestingtheorems about mixturesof
Dirichlet processes, we must stop to examine the measure theoreticstructure
we have created and insure the existenceof certain conditional distributionsby
adding appropriateregularityconditionsto the underlyingspaces.
Startingwith the measure space (E), JV), index probabilityspace (U, , H),
and transitionmeasure a on U x ,/ we definea measure ,uon the measurable
product space (3 x U, Jv x X) as 1a(A x B) = 5B a(u, A)/a(u, E)) dH(u).
If P is mixtureof Dirichlet processes on (E), sV) with these parameters,and
81, . .., 0 is a sample of size n, we will need to be able to go fromthe known
conditionaldistribution
of P givenu, 801 82,
. . .*
in
to theconditionaldistribu-
tion of u given 81,. **, 8n. To assure the existenceof thisconditionaldistribution
we will henceforthrequire that (0, SV) and (U, _?~)be standard Borel spaces,
definedas follows:
5. A standard Borel space is a measurable space (E, JV'), in
DEFINITION
which v is countably generated, and for which there existsa bi-measurable
mapping between (E, JV) and some complete separable metricspace (Y, W).
If (e, JV) and (U, XW)are standard Borel spaces, then the product space
(3 x U, _v x ') is a standardBorel space, and the requiredconditionaldistributionsspecifiedabove are known to exist (see, for example, Parthasarathy[ 13]
Chapter 5).
With these definitionswe can now state the last and mostimportanttheorem
in thissectionwhich says, roughly,thatifwe sample froma mixtureof Dirichlet
processes,and the sample is distortedby randomerror,the posteriordistribution
MIXTURES OF DIRICHLET
PROCESSES
1159
of the process is again a mixtureof Dirichlet processes. We will see in the
sections on applications that this situationoccurs frequently.
3. Let P be a mixtureof Dirichletprocesseson a standardBorelspace
(89,SV) withstandardBorel indexspace (U, ?), distribution
H on (U, ?), and
Let (X, 4@) be a standardBorel samplespace,
transitionmeasurea on U x S.
and F a transition
probability
from 8) x <' to [0, 1]. If 0 is a samplefromP, i.e.,
a 1P, u e P and Xj P, 0, u e F(a, .), thenthedistribution
of P givenX = x is a mixture of Dirichletprocesseson (8, SV), withindexspace (8) x U, v x 5), transitionmeasurea. + 6, on (8 x U) x XV,and mixingdistribution
H. on the index
space (8 x U, -v x ?) whereH. is the conditionaldistribution
of (0, u) given
X = x. In symbols,if
THEOREM
u C H,
a[P,ucP
PIu e-(a
and
then (PIX = x) c Sexu ?6(a.
)
PG su(a
)dH(u) ,
XIP,a,ueF(a,.)
+6o) dHx(8, u).
PROOF. The distributionof P given (d, u, x) is 2(a
+ b), independentof x,
the distributionof X given (1, u) is F(a, .) independentof u; the distributionof
a given u is a(u, .)/a(u, 0); and the distributionof u is H. The last threedefine
the joint distributionof (0, u, X) as given in the theorem, and conditional distributionof P given X is obtained by integratingthe known conditional distributionof P given (a, u, X), namely!`?-(a. + 6.), with respectto the conditional
distributionof (a, u) given X, yieldingthe formulagiven in the theorem,which
we recognize as a mixtureof Dirichlet processes. [:
Lest the imposingnotationobscure the basic principlesinvolved,we illustrate
Theorem 3 with an example where U and X are two-pointspaces, and 8 is the
unit interval.
2. Let X be a Bernoulli random variable with P(X = 1) = a, and
let a have as priordistributionan equal mixtureof Beta distributions,g(0) =
4'Me(l, 2) + -?6e(2, 2). Then the posteriordistributionof a given X = 1 is a
mixtureof Beta distributions,g(0 IX = 1) = 2Me(2, 2) + 3%e(3, 2).
EXAMPLE
Notice that the posteriormixturegives more weightto Me(3, 2) since given
X = 1, a is more likely to have come from Me(2, 2) in the priormixture. This
is a specificexample of a general propertyof mixturesof Dirichlet processes
which will be stated formallyas Corollary 3.2.
We now state two corollaries to Theorem 3 which treatcases that occur frequentlyin applications. No proofis given since theyare simplyspecial cases of
Theorem 3.
COROLLARY 3. 1. Let P be a Dirichletprocesson a standardBorelspace (8, JY),
withparametera and let a be a samplefrom P. Let (X, K) be a standardBorel
samplespace and F a transition
probability
from8 x
to [0, 1]. If theconditional
distribution
of XgivenP and a is F(a, .), thentheconditionaldistribution
of P given
&2
1160
CHARLES E. ANTONIAK
H on
X = x is a mixtureof Dirichletprocesseson (9, JV) withmixingdistribution
measurea(8, .) = a(.) + 5,(*), wherethemixindexspace (9, -QY)and transition
H on (9, sV) consideredas theindexspace is theconditionaldistriingdistribution
butionof 8 givenX = x; in concisenotation:
Pe??J(a),8eP,XIP,8eF(8,
.)
P
PIXe 5 '(a
+ 6a0)dH_(8).
Roughly, if the sample froma simpleDirichlet process is distortedby random
error,then the posteriordistributionof the process given the distortedsample
is a mixtureof Dirichlet processes.
COROLLARY 3.2. Let P be a mixtureof Dirichletprocesses on a standardBorel
H on (U, ),
space (9, XV), withstandardBorel indexspace (U, ?), distribution
measurea on U x -/. If 8 is a samplefromP, thenP given8 is a
and transition
measurea + 6,9,and dismixtureof Dirichletprocesseson (9, -), withtransition
distribution
where
is
the
conditional
on
of u given 8. In
?),
tribution
(U,
Ho
Ho
8
8
P
and
then
P
e
(P 0) c Su?iJ(a. + 30) dH0(u).
dH(u)
symbols,if e Su _2B(a,.)
The essential point of this corollary is that the observation 8 affectseach
componentof the mixtureas one would expect it to, by adding e0 to a., and in
addition, changes the relative weightingsof componentsof the mixtureto the
conditionaldistributionof u given 8. An explicit expressionforthis conditional
distributionis given in Lemma 1 followingProposition4.
4. Properties of samples fromDirichlet processes. The preceding theorems
were statedratherformallyto reveal the underlyingmeasuretheoreticstructure,
but only for samples of size one, to avoid cumbersomenotation. In what follows we will develop more usefulformulasforsamples of size n, with less regard
for elaborate formalism. We will see that the joint distributionof multiple
samples possesses some peculiar propertiescaused by a virtual "memory" of
to illustratethese pethe process. Sometimes a sample of size two is sufficient
culiarities. For example, let P be a Dirichlet process on a standardBorel space
(9, -V) with parametera, and assume that a is nonatomic. If 81 and 82 were a
sample froma(8)/a(O) in the usual independentidentically distributedsense,
we would expect ,9;{81 = 821 = 0. The followingargumentshows that in fact,
for such a Dirichlet process, i93{81= 821 = 1/(a(9) + 1). The reason for this
is that although a may be nonatomic,the conditional distributionof P given 8,
is a Dirichlet process with parametera + b6, which is already atomic with an
atom of measure 1 at 8. Hence, the probability that 82 - 8l, given 81, is
1/(a(O) + 1) independentof 81. If we proceedto calculate ,9?{3 = 2 = 8l 182 = 81}
we see that the conditional distributionof P given 81 82, 81 = 82, is Dirichlet
with parametera + 2ba1. Hence JS{83
82 = 8l 182 = 81} = 2/(a(9) + 2), and
=
the
that
80
82 = 83 is 2/[a(O) + 1)(a(O) + 2)],
joint probability
consequently
and by induction,?7(80 = 02 = * * = tn) = l"-1)/{[a(9) + 1](ff-)}, where the
definitionof a'?' = 1 and aln' = a(a + 1) * (a + n - 1), n > 0. This property
of the Dirichlet process is very similarto what happens in Polya urn schemes
-
MIXTURES OF DIRICHLET
1161
PROCESSES
and in one sense characterizesDirichlet processes (see Feller [8] and Blackwell
and MacQueen [3]).
On the otherhand, one can considerthe probabilitythat On is a new value,
distinctfromany previousobservations
By thesamelogicas
0,, 02, ...,*n-.l
before,onenotesthatP(02 it 0,) = a(e)/(a(e) + 1). Similarly
P(,3 v {10, 021) =
a(9)/(a(9) + 2), regardlessof whether 01- 02 or not. If one definesWi as a
random variable which equals 1 if St is a new, distinctvalue, and zero otherdefineZn =,
wise,thenP(Wi = 1) = a(8)/(a(8) + i - 1). Ifwe further
Wi
thenZ,n is simplythe numberof distinctvalues of 0 which have occurred in the
firstn observations.
Although P(Wn = 1) = a(O)/(a(O) + n - 1) is monotone decreasing in n,
neverthelessnote thatE(Zn)
= a()/(a(8)) + m- 1) = a(6) E m, I /(a(E) +
m - 1) a(e)[log ((n + a(E))/a(8)))]. Hence E(Zn) -0 ?o, n -> oo. In fact,
Korwarand Hollander[10] show Zn oas co, as n -f oo. Thus althoughnew
distinctvalues are increasinglyrare, we are assured nonethelessof a steadily
increasingnumber of distinctvalues. Moreover, since the distributionof the
distinctvalues is simplya(.)/a(0), this propertycan be used in the usual way
to obtain informationabout the shape of a(.) if it is unknown.
On the other hand, the rate at whichnew distinctvalues appear dependsonly
on the magnitudeof a(e), and not the shape of a(.), and this propertyshould
enable us to obtain informationabout the magnitudeof a(e) if it is unknown.
We begin by obtainingan expressionforP(Zn = k) as a functionof a(e). As
an aid in this we definea sequence of polynomialsAn(x) as
Al(x) = x
(X
A2(X)
A,(x)
=
+ l)Al(x)
= x(x + 1)
=x(2)
(x + n)- l)A,n(x) = x(x= + 1) .. (x
J
n-
x(n)
Hence An(x) is a polynomialof degree n in x with integercoefficients,
which
we write
+ nanxn.
An(X) =na,x + na2x2 +
...
If we substitutex = a(o) in An(x), then considerationssimilar to those in the
discussion of repeated values enable us to identifythe kth termof thispolynomial withthe event of observingexactly k distinctvalues in a sample of size
n, P(Zn = k) = nakac(e)k//An(a
())The coefficients
nak of the polynomialare the absolute values of Stirlingnumbers of the firstkind, tabulated in (Abramowitz and Stegun [1] page 833).
The significanceof the foregoingis that if one knows he is sampling from
some Dirichlet process with unknown parametera, then he can obtain, independently,consistentestimates for a(.)/a(e) and a(e) by making use of the
propertiesgiven above. If, however, there were some doubt that the process
was in fact a Dirichlet process, then the only discriminatingfeature left is the
1162
CHARLES E. ANTONIAK
actualpatternof multiplicities
observed. As a firststepin suchdiscrimination
thefinestructure
of theprobability
of
we showthatwe can, in fact,determine
If
Dirichlet
we
in
a
from
a
variouspatterns
of multiplicities sample
process.
forexample,in the probability
thatthefirsttwoobservations
wereinterested,
fromthefirsttwo,thefourthand fifth
were identical,the thirdwas different
from
matchedthethird,and thesixthand seventhmatchedbut were different
the previousvalues, then an argumentsimilarto that given above yields
.
-(E)1'2/a(E))
9'10, = 62* 63= 64 = 65* 60 = 67# 61} =
the above eventas
in characterizing
Moreover,ifone wereonly interested
one where,in a sampleof size 7, only3 distinctvaluesof 6 occurred,and that
two werepairsand one a triplet,thenthesubscriptsof the O's are irrelevant,
theprevious
and we obtainthe probabilityof the lattereventby multiplying
expressionby the appropriatecombinatorial
factor,in thiscase the familiar
dividedby2! sincethetwopairsare indistinguishcoefficient
multinomial
(27,3)
able. The resultcould be expressed
'9iAI.iLI =
6, =
6
6-
=
?6
=
6,
6i
=
a()3l1( 7
a(E9)
~~~~~~~~~~~~~~2!
(2,23)
where {i,j, k, 1,r, s, t} = {l, 2, 3, 4, 5, 6, 7) in some order. The generalization
are thegoalsof thefolof itsexpression
of thisexampleand thesimplification
and proposition.
lowingdefinition
6n be a sampleof size n froma DirichletproDEFINITION 6. Let 61, 62, ...*
cess P. We will say thatthesamplebelongsto theclassC(ml, m2, . ., in,M),and
e C(m1,. **, m,), if thereare mldistinctvalues of 6 that occur
onlyonce, m2thatoccurexactlytwice, ..,* ml,thatoccurexactlyn times.
Two immediate
consequencesof thisdefinition
arethatn =
= imi, and the
write(1,
.
*, 6n)
totalnumberofdistinctvaluesof 6 thatoccuris Z,,-=
mi. As an example
of thisnotationwe notethatthesamplein thepreceding
discussionbelongsto
theclass C(O,2, 1, 0, 0, 0, 0).
3. Let P be a Dirichletprocesson a standardBorelspace (0, W),
withparametera, and let a be nonatomic.Let (6k ..*, 6,n) be a sample of size n
fromP. Then
PROPOSITION
(4)
.5q{(6l X**,
6,,)e C(m
m,
M,
in.M)) =
n!
)(0)
'i
We calculatethe probability
of a particularsequencein the class
PROOF.
C(ml, ***, mi), observe that all such sequences are equally likely,and multiply
by thenumberof thesesequences.
Let C0((mi,. . ., m,,) be the event that 61 * , 6.1, in that order,are unique in
the sample and occur only once; that 61 +1' * , 6ml+2m2 occur twice each, in the
order0.1+1= 6m1+2' 6mi+3 = 6l+,, etc. Thenbythesameargument
givenin the
. . .
..
=
previousdiscussion9i{01 ., 6a) e C0(ml, m.* . .i"))
[l(O)a(8)] ni[1(l)a(E3)]n2
.
[1 (
)]
/a(8)
MIXTURES OF DIRICHLET
PROCESSES
1163
Nextwe mustcountthenumberof wayswe can permutethe indicesof the
0's to obtainall essentially
different
sequences. But thisnumberis wellknown
as the numberof elementsin a conjugateclass of thesymmetric
permutation
groupon n elements,and is equal to
1
=l (mi!)
(
1, * ,1, 2, ..n*, 2, *, n)
ml
m2
where("l "
denotesthewell-known
multinomial
coefficient.
For a derivationofthisnumbersee Scott[17]. Multiplying
thiscombinatorial
factorbythe
precedingprobabilityand dividingnumeratorand denominatorby ]l[In [1(i_l)]mi
which is identical to 11i l [(i - 1)!]ms,the expressionreduces to:
n!
a((6))Iz=1
mi/af(e)
(")
111 1 imi(mi!)
.
We see thusthattheDirichletprocessinducesa measureon thesamplespace
En1
which gives positive mass to certain collections of sub-hyperplanesin an.
The relativemagnitudeof the mass concentrated
on the sub-hyperplanes,
over the remainderof the samplespace,is
comparedto the mass distributed
ofthemagnitude
of a(e). Consequently,
seento be a function
fora givensamifa(e) is verysmallthanif
ple size, we wouldexpectmanymoremultiplicities
if a(e) is unknown,thisproperty
it is verylarge. Furthermore,
justdescribed
the
make
an
inference
about
us
to
of
enables
magnitude a(e) based on the
in a sample. We will see, in fact,thatit is notthe
numberof multiplicities
but the numberof distinctvalues thatoccur in
of multiplicities,
distribution
in makinginferences
thesample,whichis important
abouta(e). We beginby
theresultsof thepreviousexample.
formalizing
4. Let P e Su 3Q(a.)dH(u), wherea. is nonatomicfor all u e U,
and let (0,, 0,* ) be a sampleof size nfromP. Thentheposteriordistribution
of
u, given(0k ..., an) e C(m1,*.., m")e JV is determined
by
PROPOSITION
SB
1(ue B a e C(mn)) =
Z=
a(u
5u ((
dH(u)
en dH(u)
PROOF. FromProposition
3, we have
S?7(6e C(m) I u) =
n!
a(u, 1)= nim
)('n)
Il=l~~~
i(m!)c(u,
withrespectto H(u) overu e B givesthejointprobability
of0 e C(m)
Integrating
and u e B. Since 91(Qe C(m)) > 0 we obtaintheusual conditionalprobability
coefficients
by applicationof Bayes formula,and note thatthecombinatorial
cancel. D
It is important
to noticethata consequenceof thecombinatorial
coefficients
1164
CHARLES E. ANTONIAK
cancelling is that the mi occur in the above expressiononly as E=1 m%,that is,
in a sum expressingthe total number of distinctvalues, which we have previfor
ously denotedby Z,. This confirmsour earlier remark,that Zn' is sufficient
at(e).
We can conclude fromProposition4 that if a is nonatomic and a.(e) = M,
a constantindependentof u, thenthe event0 e C(m) is independentof the event
u e B, and hence 6 ? C(m) providesno new informationabout u. This is a consequence of the assumption that a. be nonatomic. If a. is atomic the event
6 e C(m) may stillprovideinformationabout u even when a(u, 3) is independent
of u. An analysis for a special case of atomic a is given later in Lemma 1, but
some remarksof general validityare possible now.
We note firstthat if a is atomic, then P assigns positive weight to certain
specific points in the product space of observations E)"n, and these points are
contained in the collections of sub-hyperplanesdescribed earlier. If P, e _2(a,)
and P2E 3!(a2), whereal(E) = a2(O) but a, is nonatomic and a2 has atoms, then
thereis a greaterprobabilityof duplicated values in a sample fromP2 than in a
sample fromP1.
If Pe Su !(au) dH(u), and some or all au are atomic, the posterior distribution of u given 6 will depend on whetherany of the observed samples 6i match
any of the atoms of any of the au, and if so, whetherthe matched atoms are
common to more than one value of u, and if common to more than one aU,
whether the size of the atoms is the same, etc. To elucidate this idea consider
the followingexample.
3. Let au be a geometric distribution on u, u + 1, u + n, ... and
H(u) be uniformon [0, 1]. Then with only one sample 6 fromP, the distribution of u given 6 is degenerateon #(mod1). Notice that here thereis no dominatinga-finitemeausreforthe a., in fact,theyare all mutuallysingular. Nevertheless,the factthatthejoint distributionof (0, u) is concentratedon a subspace
of the space x U enables us to findrathereasily the posteriordistributionof
u given 6.
We conclude this section with a partial extensionof Corollary 3.2 to the case
of a sample of size n.
EXAMPLE
e
LEMMA 1. Let Pe Su (aj)
dH(u) as in Theorem3, let 0 = (61, - * ,an) be a
sampleof size nfromP, and supposethereexistsa a-finite,a-additivemeasurep on
(0, SV) shchthatfor each u e U,
(i) au is a-additiveand absolutelycontinuouswithrespectto Xc, and
(ii) themeasurep has mass one at each atom of au. Then
1
(5)
dH,1(u) =
1,
X
Su M.I
M (n)
a,.
dH(u)
a,U'(6i')(mU(6i')+ 1)(n'&i''-1)
7J~-~
n':,(6')(mn(6i')
+ l)')1)
dH(u)
MIXTURES OF DIRICHLET
1165
PROCESSES
whereau'(.) denotestheRadon-Nikodym
derivativeof au(.) withrespectto 4a;0i'(8)
0
is theith distinctvalueof in 0; n(0i') is thenumberof timesthevalue 0i' occursin
0; M. = a.(0) and m.(0i') = arn'(0i')if 0i' is an atom of a., zero otherwise.
PROOF.
We obtain the joint distributionof (u, 01, * , 0,) by calculating the
likelihoodof 0 and makingthe appropriatenormalization.Referring
to the
proofof Proposition3 we see thatthelikelihoodof Ok+1' givenu, 01, * . Ok iS
in
au'(6k+1) du!(Mu + k) fora value of 0,k+ whichhas not occurredpreviously
.
01''
and [mu(Ok+l) + j] dp/(Mu+ k) fora value of 6k+1 whichhas occurred
* *k'
O
of8 (Of .*., O,n)given
0k. Hence thelikelihood
previouslyjtimesin 01,
*
uU
is
1S~~~~~~~
M I71 au'(0i')(mu(0i') + 1)(?(i)1) dfen
We obtain dH0(u) by multiplyingthe above by dH(u) and dividingby the un-
conditionaldistribution
of 0. El
We have thus established the validity of the term dH0(u) in the following
extensionof Corollary3.2.
3.2'. Let P and au be as in Lemma 1 and let 0 = (0k
sampleof size nfromP. Then
COROLLARY
P Iae Su _(a
+ E1
...,
On)
be a
.) dHIz(u),
whereHlI is theconditionaldistribution
of u given0.
5. Estimationproblems:parameters,mixingdistributions,
and empirical
Bayes. In this section we presenta general samplingmodel using mixturesof
Dirichletprocesses,and pointout somecases wherethisDirichletmodel leads
to estimatesdifferent
fromstandardBayesiananalysis. Considera mixtureof
Dirichletprocesseswherethe indexspace, parameterspace, and observation
space are all thereal line withthea-fieldof Borel sets. Let G(O) be a sample
distributionfunctionfroma mixturewith parametera. and mixingdistribution
H(u). Let 01,02, *, 0, be a sampleof size n from G(0) and let Xi,, . . Xim
be a sample of size mi fromF0.(x). We consider the followingproblems:
.
theindexof theparameter.If we wish to estimateu withsquared
(a) Estimating
errorloss, thenthe Bayes estimateis simplyuz= E(U 01, 0, **, On) if the 0%are
observed directly,and u = E(u IX,,, ** , Xnm) if we only observe the Xxj. The
theoryis given in Theorem 3 and the requiredconditionaldistributionof u can
be obtained using Lemma 1. The positive probabilityof duplications among
the 0%causes some complicationsthat will be illustratedin an example at the
end of thissection.
(b) Estimatingthe(mixing)distribution
function. Suppose we wish an estimate
G forG with squared errorloss, weightedaccording to some finitemeasure W
on (-oo, so), i.e., L(G, G) =
(G - 6)' dW. Then G = E(G 10, 02,
0*
)
9,
1166
CHARLES E. ANTONIAK
is the Bayes estimate when the 0i are observed, and G = E(G IXI,, .
..,
Xnm)
whenonlytheXij are observed,thelatterbeingthe case whenG(0) is an unknownmixingdistribution.
(c) The empiricalBayesproblem. When G(0) is an unknown mixingdistribution, we may wish to estimate 0i given X,,, **., Xi. with squared errorloss.
The Bayes estimateis 0i = E(0j IX,,, ** , Xim.). We illustratethe problemsdis-
is
cussedabove withan examplewherethepriorguessis thatthe distribution
normal. For concisenessof notationwe let xf/(p, a2) denotethe
approximately
normaldistribution
functionor measureas thecontextrequires.
EXAMPLE 4. Let G(0) be a sampledistribution
functionfroma mixtureof
Dirichletprocesseson (- oo, oo) with transitionmeasurea. = MI(u, a2),
H = Ix(O, p2), and samplingdistribution
mixingdistribution
F. = ,x/(0, 92).
Beforetreating
problems(a), (b), and (c) inturn,we listsomeoftheconditional
we willneed in theanalysis,fora sampleof size 2, whichwill be
distributions
to illustratetheinteresting
sufficient
featuresof thismodel. To save space we
and to simplify
omitthederivations,
we definethefolmanyof theexpressions
a=
lowingconstants:
= (2p2 + 2a2 +
(6)
(p2 + or2+
2)-1
a'
= (aT2+
r2)-l
f3= (2p2 + a2 + 92)-l'
2)-1
G 01,02 e
l(M
-
(U,
a2) + 601+ 682)dH(uI 08,02)
and a12 = p2a2/(2p2 +
(A12= p2a2/(p2 + (a2) when01= 2.
G IZ1 e r0.
?(M
V(u, a2) + ar)dHxl(0, u)
whereH(u101,02) = _4(ifA a12),
when01 # 02; Pi = 0p2/(p2 + a2),
(7)
1 = 20p2/(2p2 + a2)
whereHx,(0,u) is bivariateNormalwithp, = X,a(p2 +
a2), a 2 =
ar92(p2 +
ap2(a2
G IX1, X2
(8)
e
5
+
92), a21 =
!
r(M
a2),
a2)
2 = X, ap2, a12=
a 2p2.
(u, a2) +
6a1 + 602) dHX1,X2(01
02
a)
?) + p8.4/(p*, ?*), a mixtureof trivariate
The mixingcoefficient
Normaldistributions.
P8 = P(01 = 021 Xl, X2) = (1 +
and
exp
X2)2a_/4
Pd = 1 - pJ.The means,
rM(x'~/~')* {a2((X
_ X2fit)})-1
variances,and covariancesof the distribution-l"'(y, 2) are a1= a'(X ar2+
X2p 2
2 = a'(X2 2 + X2 p292), /13 = X213p2,a12 = a22 = a' 3Z.2 (a + 2a2p2 +
whereHxVX2(01,02
2p2 +
a292),
a21 =
2X(p2
+
U)
a'~
= Pd -'(J/I,
Zp2,
a31 =
a32 =
a'r32p2(a2
r2), a3 2 =
+
a' p2(a2 + 2)2.
The component (y*, ?*) is a singulartrivariate
Normalwhereall themass
on the 01 = 02 plane. In thiscase, pl* =
of the distribution
is concentrated
* -
p2(2aOr2+
92)f3',
a2):'
*3 =
/1
r*3=
=
2Xp2fi,
p2T.2p
A1*2 =
2*2
=
2*1 =
((P2 +
2))
*2 =
We can now writedowndirectlythesolutionsto theproblemsposed above.
For (a) we get U,0= 02p2/(a2 + 2p2) when01# 02, and ua = 8p2/(a2 + p2) when
at 01 = 02. The heuristicexpla01 = 02. Thus theestimateu4is discontinuous
nationforthisis thatwhen01 = 02, itis almostsurelydue to theatomat 01,and
aboutu, hencetheestimateforu reducesto thatgivenonly
givesno information
1167
PROCESSES
MIXTURES OF DIRICHLET
01. If we are given X1 and X2, but do not get to see 01 and 02, we do not know
U = X22p2/(2p2 + q2 + 2),
whether0 = 02, So we mustconsidertwoestimates,
which is appropriatewhen 01 * 02, and u* = 22p2/(2p2 + 2a2 + r2) when 01=
Since we do not observe whether01 = 02 or not, we must weightthese two
estimatesaccording to the posteriorprobability,given X, and X2,that 01 = 02.
This gives the estimateu' =-du + P8z*
The solution to problem (b), estimatingG(0), requires that we compute
02.
1 we get
E(G(0) I01,02) and E(G(0) IX1,X2). FromProposition
=
2
/0M
02)
E(G(0) I10k,
u\d(
1 2
-~.(
Mo(2a)dH(UIl
M+ 2
O2)++2F2(O)
For 01 *
we get
M
(
M
M+ 2
G(0) =
(9)
If 01
02,
=
02,
20.
2
(
a4 + 3p02or2\
~O+b
+
202
82
(
2p2
2'
2p2 +
62
+
M+ 2
a2
then forthe reason given in (a) above, we get
0(0'
M
M+ 2
-
('O
v4
ap2
p2 +
+
p2
(I2
+
2
M?+ 2
2a2p2'
r2
a
If we observe only the X1, X2, then
E(G(0) | X1,X2)
(10),
M + 2
1
+
M
?108
+
-1
,
+ 2
LMr2-
.P-..
M+
,<
2 LM+2
2
2+
X
2p2 +
g2 +
+
2
(X2i
a2
X r2
X(
2p2
32p29
T2
X2fip292
X2p2
2U2 +
q2
2
+
p272>
r2
+
2a2p2
+
?
o292
+
T2)(a2
+
2
4p2q2 + 2u + 22
42p2,g
ri
qJ2+
+
(2p2
+
22
a4p+
2p2 +
T2(g4
+
P2+222+
+C2)
2
+
T2(a4r4
2p2 +
2p2)
r2
+
2 2
+2P2+262+9
2) g4+yr2+pr
2j2
+
r2)
Finally, we writedown the solution to (c), the empiricalBayes problem. We
have immediately
01= E(0 I X1) = X1p2/(p2 + 62 + z2) and 02 E(02 1X1,X2) =
ifthe
+ r2) + p, 2X(p2 + a2)/(2p2 + 202 + 92). Moreover,
X2p p2Z.2/(u2
formulationof the problem allows "hindsight,"then given X1 and X2 we can
obtain an expressionfora revised estimateof 01 by replacingthe subscript2 by
1 in the expressionfor 02. The effectof the possibilitythat 01 = 02 iS to shift
0, JX1,X2away from01IX1and toward 02.
Certain featuresof the precedingexample illustratean importantdifference
between simple Dirichlet processes and mixturesof Dirichlet processes. ConsiderG e ?-(M_A7'(u,g2)) as above and assume for the momentthat u is known,
so thatwe have a simpleDirichletprocess. Let Go = E(G) and G = E(G 01,02),
Pd(X2 (2 +
1168
CHARLES E. ANTONIAK
and let A denote the open interval (0, 02), (assume withoutloss of generality
that01 < 02). Note thatG2(A) = GO(A)M/(M+ 2) and in general,forany set
whichdoes not includeanyof theobservations,
theexpectedvalueoftheprobis smallerby a factor
abilityassignedthatsetundertheposteriordistribution
M/(M+ 2) thanunderthe prior,even thoughthe set may be "close" to the
as (01, 02) iS.
observations,
However,ifwe let u be an indexwithmixingdistribution
_/Y(0,p2) as originally in the example,thenalthough01 and 02 stilldo notcontribute
anything
to G2(A),theydo causethemeanof theposteriordistribution
of u given
directly
01 and 02 to be shiftedaway fromzero towardA, and the varianceto be decreasedtop2U2/(2p2
forthefac+ g2). Thesechangesmaymorethancompensate
torM/(M+ 2), especiallyifp is largecomparedto a, (p > v), as seenfrom(9).
Nextconsiderthecase wherewe onlyobserveX1and X2. Let B = (X1,X2),
(assumeX1< X2),and G2* E(G IX1,X2). Referring
to (10), we can see several
termswhichcontribute
to 62*(B). Thereis a changeintheposterior
distribution
ofu similartothatdescribedfor01and ,2above. Thereis a component
pd/(M + 2)
whichwouldhave beenconcentrated
on 01if01 had beenobserved,but which
is now spreadout aroundtheestimated
value of 01, and someof thismassfalls
in the intervalB; similarlyfor02. Finallythereis a component2p,/(M+ 2)
spreadout aroundthemidpointof the estimatesfor01 and 02. If p > a > T,
thenOXis near Xi, i = 1, 2 forthePd components,and 0 is nearX forthep,
component.Thus G2*(B) maybe muchlargerthanG0(B),even thoughB does
not containX1or X2.
in thissectionaresimilar
6. A regression
problem.The problemsconsidered
to thosein Section5, in thatthegoal is a Bayesestimateof an unknowndistributionfunctionG on [0, 1], withloss functionL(G, G) = 5- (G(t) - G(t))2dW(t),
where W(t) is some finitemeasure. Again we assume G to be chosen by a
Dirichletprocess,but thistimethe samplingtechniqueis morelikethatused
in regression
thecdfof G, thenvariousvalues of
problems.If G(t) represents
t,say0 < t1 < ... < tk ? 1are chosenand theunknownvalueofG(ti) becomes
a parameterin a distribution
F(x IG(ti)). SamplesfromF(x IG(ti))are used to
makeinferences
aboutthevalue of G(ti).
If G is a sample functionfroma Dirichletprocesswithparametera, let
Y, = G(tl), Y2 = G(t2)- G(tl), ***, Yk+1= 1 - G(tk), and j31 = a(tl), P2 =
Then the joint distribution
of Y1,.
a(t2)a(tl), * *, &+1 = a(l) - a(tk).
is Dirichletwithparameters
fordifP1, *, k?+1. Hence theobservations
Yk+1
ferentvaluesofi willnotbe stochastically
independent
ingeneral. We illustrate
the effectof thisdependencewithan examplewhereF(x IG(t)) has a simple
and wherek = 2.
densityfunction,
.
5. Let P be a Dirichlet process on ([O, 1], X),
the Borel sets,
withstrictly
monotoneparametera and let F(x IG(t)) have density
EXAMPLE
.
f(xIG(t)) = 2[xG(t) + (1
-
x)(1
-
G(t))]
0
< x< 1
MIXTURES
OF DIRICHLET
1169
PROCESSES
and zero elsewhere. For given t, and t, defineY1 and Y2as above and let X1and
X2 given Y1 and Y2 be independent samples from F(x IG(t,)) and F(x IG(t2))
respectively. The joint densitybecomes
f(x1, X2,y1,
Y2) =
4[xlyl + (1
X
-
x1)(1 -Yl)][X2(Yl + Y2) + (1
Y1'-'Y2'(-1
-
-
-
x2)(1
- Y-
Y2)]
F(M)
Y1)P3-l
A straightforward,
but tedious, algebraic manipulation, using Bayes formula,
yields
(Y1,
Y2 1Xl =
X1, X2
e
C-1(Xl,
+
X2)
X2){Xl
X2
P(10
(1 -X)X2P2(32
+ (1 -X1)(
+
+
l)?(p,
+
- X2)3(P3
P2 +
1)0(l1
+
2, P25 P)
2, p3)
P25P3 + 2)
1)'(l,
+ x2P1 29(P1 + 1, P2+ 1, P3)
?
(X1
+
X2-
2x1X2)V1P3?-(P1
+
+ (1 -x1)V2 i3?(31, P2 + 1 P3+
1,
P21P3 +
1)
1)}
where c(x1,x2) = (xlx2Pl(Pl + 1) + ** + (1 -XOP2P3).
The algebraic manipulationreferredto above makes frequentuse of property
(ii) of the Dirichletdistributiongiven in Section 2. The techniqueis not limited
to linear densities, but the example shows that even with linear densitiesand
small k, the posteriordistributionis somewhatunwieldyforhand computations.
Nevertheless,it is possible to transformany densitywhich can be expressed
as a finitepolynomial in G(t) into a polynomialin the Yi's and absorb it into
the Dirichletdensityfunction. This leaves open thepossibilityof approximating
densityfunctionsof interestby polynomials,and performingthe requiredcomputationson a computer.
7. A bio-assayproblem. The nextproblemwe treatis a typeof bio-assayproblem. We wish to estimatethe dose responsecurve of some animals to a certain
drug. We assume, forsimplicity,that an animal's responseto a given dose of
the drug is eitherpositive,or negative (no response),and that each animal has
a thresholdthatthedose given him must exceed to produce a positive response.
However, this thresholdvaries fromone animal to the next, so we treatit as a
randomvariable withunknown distributionG. Kraftand van Eeden [12] have
treated this problem using a process of the Dubins and Freedman type [5] to
choose the distributionG. Our approach will be to let G be a sample function
froma Dirichlet process. Essentiallythe same model has been considered by
Ramsey [ 15].
Let (), _) = ([0, 1], X) be the unit interval with Borel sets and assume,
then, that G is chosen by a Dirichlet process with parametera, and we select
some dosage level t and administerthisdosage to n animals. G(t) is the expected
proportionof the animals whose response thresholdis less than or equal to t,
1170
CHARLES
E. ANTONIAK
thatis, we "expect"nG(t)animalsto givea positiveresponse. Of course,G(t)
is unknown,since G was chosen at random,but since G was chosenby a
of G(t) is Me(a(t), M - a(t)). And,as
Dirichletprocess,thepriordistribution
of G(t) givenk positiveresponsesin n
is wellknown,theposteriordistribution
trialsis Me(a(t) + k, M + n - k - a(t)) where Me(a1, a2) is the Beta distribution.
simple,sincethingsbecome
Thisanalysisforone dosagelevelwasdeceptively
muchmorecomplicatedwiththeuse of morethanone level,as maybe seenin
thecase whentwo levels,t1and t2are used, withn, and n2trialsrespectively.
a(tl)] > 0. Let Ki be the numberof successes
amongtheni trialsat ti. Then the joint densityof K1,K2,G(tl),G(t2) is most
easilyexpressedin termsof Yi's, whereY1 = G(tl), Y2 = G(t2)- G(tl)and Y3 =
a(l) - a(t2). The joint
1 - G(t2);and 1i's, where/31= a(tl), P2= a(t2),
densityof (K1,K2, Y1, Y2) then becomes
Assume that t1 < t2,and [a(t2)-
(ni)( 2)y k1(l -
yl)n,-k(yl
+
F()
y)k2(Y3)2-k2
Y1'Y2-21Y3_31
on S = {Yl,Y2IYl _ 01Y2_ 0y1 +Y2 < 1} and wherey3= I-y1-Y2.
as a mixture
whichis recognizable
thisintoan expression
We can transform
(1 - Yi) = Y2+ y3, and
by makingthe substitutions
of Dirichletdistributions
expanding(Y2 + y3)"r-kl and (yl + y3)k2 usingtheBinomialformula.This leads
of Y1,Y2givenK1= kl,
fortheconditionaldistribution
finallyto an expression
k2-i-I,
3+
n2K2= k2 as 'kI
1+ k1+i,
2?+ n-k
=0kl
=1
k2+ j) where aij = bijl/Ek >OE nl1 b.i;and
ki
b _ tn(71
1)
V
'k28 r(/, + kl+
(\)
i)(3
i)r(i2+
+ n1-k, + k2 -i-])r(33
+ n2- k2 +i)
()(p)(
Beforeproceedingto our goal of a Bayes estimateof G we examinethe
thesourceof theincreasedcomabove in moredetailto determine
expressions
viewpoint.
different
plexity.It is helpfulto considertheproblemfroma slightly
Saying that K1 of the n, trials at t1 were successes is statisticallyequivalent to
sayingthata sampleX1,..., ATXof size n1fromG was taken,butall thatwas
of
recordedwas thatK1of theXi's werelessthanor equal to t1. In particular,
the n, - K1 values of the Xi which were greaterthan tl, it is known how many
fellin theinterval(tl, t2] and how manyin (t2, 1]. Hence we let J denotethe
numberof the Xi's fallingin the interval(t2, 1]. Since thetruevalue ofJ is
all possiblevaluesofJ
and weightappropriately
unknown,we mustenumerate
from0 to n1-K1. Similarly,if Y1Y2,*. *, Y,2 is a sampleof size n2 fromG,
and K2 values of Yi are lessthanor equal to t2, we let I denote the numberof
the Yi's thatfellin theinterval[0, tl]. Now if it were knownthatI = i and
ofG(t,),G(t2)- G(t,),and 1 -G(t2), givenK1 = kl,
J =j, thejointdistribution
J =j, K2 = k2,I
k1+ k2 - i - Il
i would be Dirichletwithparameters(/, + k, + i, j2 + n, theproperexP3+ j + n2- k2). Since I and Jare unknown,
=
pressionis a mixtureoverall possiblevaluesof I and Jas givenby thedouble
above.
summation
MIXTURES OF DIRICHLET
1171
PROCESSES
By makinga similaranalysisforthe case wherethereare threethresholds,
summation
one can showthecorresponding
runsover6 indices,and ingeneral,
formthresholds
therewouldbe m(m- 1) indicesto be summedover. Generally suchtasksare bestleftto thecomputer.
To resumeour treatment
of thisproblemwe computeour estimateG of G.
Recall thatforsquarederrorloss,theBayesestimateis themeanoftheposterior
distribution
of G. Moreover,since G is linearin the intervals[0, t1), [tl, t2),
[t2, 1], we needonlycalculateG(t1)and G(t2).
G
t
_
k2
E(G(4))
=
Ei=0
->
1;1-kl a
2
=
Gt = E(G(tl))
=0E( 1i=
]i
G(t2)=
+ai
E
i3=
k,
+i1
+
M+ n1+ n2
+ M2+ fl + k,-]
G(t) = G(tl)tltl
O _ t < ti
= G(tl)+ [(t -0)(4
=
G(t2) + [(t-
- t1)]{G(t2)_ G(tl) I,
t2)/(1 - t2)]{1 -(t2)},
tl < t _< t,
t2 < t?
1
We recognizeG(tl)as a weightedaverageof thevariousestimates
of G(tl)we
wouldmakeifwe knewI and J. It can be shownthatforlargeM thisestimate
reasonablevalue (a1 + k1+ (al/a2)k2)/(M
approachestheintuitively
+ n1+ n2).
For smallvaluesof M, however,the situationis quite different,
and contrary
to intuition.Recall that M is, in a sense,a measureof our confidence
in the
a. Verysmallvaluesof M can be interpreted
parameter
as meaningwe have
about G, and our estimateof thedistribution
practicallyno priorinformation
willbe heavilyweightedin favorof theempiricaldistribution
function.A less
obviousconsequenceof a smallM is the implicationthatthe processchooses
distributions
withmostof theirmassconcentrated
at a fewpoints.
6. Let (9, ) = ([O, 1], M), a(t) = Mt, n1= n2 = 100, ki = 1,
werecom99, ti = X, t2 = 2. In words,of 100 sampleswherethresholds
EXAMPLE
k2 =
pared with t1 =
,
one was less than or equal to
99 were greaterthan :. Of
99 werelessthanor equal to 2, and 1 was
1
100 samplescomparedwitht2 = 2,
greaterthan2. RecallingthatbecauseM is smallwe expecttheprocessto have
chosena distribution
withmostof itsmass on a fewpoints,we could readily
explainthesampleresultby decidingthereis a verylargejumpin G somewhere
on (3, 2], and significantly
smallerjumpson [0, 3] and (2, 1], a totalof three
jumps. When we evaluatethe weightsaij, however,we findthattheproduct
17(M/3+ 1 + i)r(M/3 + 198 - i -j)F(M/3 + 1 + j) is largest,as M -0, for
i = 99, j = 99, since then the center factorbecomes F(M/3) and r(M/3) -- oo
as M -O 0. But when we check the interpretation
of thesevaluesforI and J,
we discover they correspond to estimatingF(I) = F(2) =
2.
As M
-*
0, the
dominantposteriordistribution
is theone thatgivesa jump
to [0, s),and
a similarjumpto (2, 1],and practically
no weightto [1, 2)! Roughlyspeaking,
the posteriorDirichletprocessgivesmostof its weightto thosedistributions
-2
1172
CHARLES E. ANTONIAK
whichcan "explain"thesamplewiththefewestnumberofjumps, in thiscase
twojumpsinsteadof themorereasonablesoundingthree.
of thebio-assayproblemwitha briefdiscussion
We concludeour treatment
of the designproblem: Given the model of the bio-assayproblemdescribed
total,whatare the"best"thresholds
above,anda budget,say,ofn observations
ti to testat, and what are the optimalnumbersni to testat each threshold?
feelthat
we nevertheless
Whilethisproblemforgenerala and W is intractable,
following
the
on
W
uniform
1],
and
[0,
on
a
linear
[0, 1]
forthe special case
conjectureis true.
Conjecture.Let 0 be [0, 1] and _v theBorel sets,and let the bio-assayresponse curve G(6) be a sample function from a Dirichlet process on (0, .W)
withparameter
a(t) = Mt. If we have a totaln _
ni to
ni of observations
be comparedwiththresholds
ti, and we wishto estimateG by G to minimize
0S(G- G)2dW, W=ton [0, 1], then
theBayesdesignis to setti = i/(k+ 1) andmake
(i) If k is fixedbeforehand,
theni as nearlyequal as possible,i.e., In - nj < 1 forall i? j
(ii) If k is notfixed,theBayesdesignis to letk = n and takeone observation
each at the thresholdsti = i/(n+ 1).
fora similarmodel.
Ramsey[15] reachesthissameconclusionindependently
is for
the
strategy
is
what
design
sequential
optimal
question
A moreintriguing
fixedsamplesize, and forvariablesamplesize, withthegoal of achievingconfidencebandsforG(t). The authorhasnothadmuchsuccessin seekinganswers
to thesequestions.
problemsmaybe deproblem.Statisticaldiscrimination
8. A discrimination
from
unknowndistribuscribedas follows. We are givensamplesXi,, ., * X2kt
tionspi, i = 1, * * *, k and a sampleY1, **, Y,. knownonlyto comefromone
of thep*'s. The problemis to decidewhichone.
samples
We mightmodelsucha problembyassumingthepi's areindependent
froma Dirichletprocesswithparametera, a(0) = M. A littlepreliminary
we
analysisthenrevealsthatthe solutionis verysensitiveto theassumptions
distributions
then
the
posterior
a
to
continuous,
If
assume
be
about
a.
we
make
of thepi will have parametersai withatomsof varioussizesat each distinct
one, the samplesfromeach pi
value of the Xx,. Moreover,withprobability
will be disjointfromthe samplesfromall the otherp's, but withprobability
I - (kiM + ki)", at least one samplepointYi will matchsomesamplepoint
pi, in whichcase the choice of i is
Xij whenthe Y's come fromdistribution
clear. If, however,none of the Y, matchesanyof the valuesof theXij, then
thechoiceof thepi to ascribethe Yi to becomesbiasedin favorof thepi from
whichwe have thesmallestsampleof Xs, i.e., smallestki. If,in thiscase,the
anda random
kiare all equal,thesampleofY's hasnotgivenus anyinformation,
decisionis made.
MIXTURES OF DIRICHLET
1173
PROCESSES
For these reasons we will assume that a is purely discrete, and for more
generalitywe assume Pi e .(aN)a ..., Pk e ?((ak),
where a1 throughak are all
discretewith the same support.
In the notation we have been using, let e be the nonnegative integersand
DWthe a-algebra generatedby the singletonsets. Let ai be the finitenonnegative a-additive measure definedon _Q/throughits values on the singletonsets,
a,
*
ai = (ai
*,
ai., *. * *) where ai({]}) = aij, and jail = EZ=0 ai%. Let Pi E
`2(a%) and XiA, *, Xik be a sample from Pi. Let Y1, * *, Y,nbe a sample from
some Pi, where the prior probability that Y1, * Y,Yn e Pi is rj, j = 1, *,k.
Let L(i, j) denote the loss associated with deciding the Y's come fromPi when
theycome fromP3. We seek a nonrandomizeddecision rule which minimizes
our expected loss given the observations.
First we notice that the distributionof Pi given Xil,
, Xi,, is just .(ai'),
where
as
= (aio + mio,ail + mil, *a,
=
(a%o, a1
, aC,
ai+
mi
*
)
and mi6is the numberof Xi's equal toj, j
0, 1,.... Hence theproblemreduces
to deciding which of k different
Dirichlet processes with known parametersthe
samples Y1, * * *, Y,, come from. The discrimination problem has been reduced
to a classificationproblem. To treatthiswe calculate the Bayes risk ri foreach
i, where
r,=
=l
L(i,
P(Y I i)
j)w(]j
=
I|
I
Y1,
.
* *, Yn)
a(k-j
,
)/a6t( n)
w(i I Y1,
P( Y)
Y,") =
=
=i
ri P(Y I i)/P(Y)
P(Y I i) ,
wherek, = numberof Y's equal to j. The Bayes decision rule chooses s where
r, = min r. If, for example L(i, j) = 0, i = j; 1, i tj, and wri= 1/k forall i,
then the Bayes decision rule chooses that s for which P(Y Is) = maxi P(Y Ii).
9. Acknowledgments. The author is immeasurably indebted to Professor
Thomas Ferguson for introducinghim to the concept of Dirichletprocessesand
providingcountlesshelpfulsuggestionsduring the course of the research,and
to David Blackwell and JimMacQueen forhelpfuldiscussions. A refereeand an
associate editor for the Annalsof Statisticsmade many helpful suggestionsfor
revisions of the manuscript. This paper is an extensionof part of the author's
dissertationsubmittedin partial fulfillment
of the requirementsfor the Ph. D.
degree at U.C.L.A., writtenunder the supervisionof ProfessorFerguson.
REFERENCES
M. and STEGUN, I. A. (1964). Handbookof Mathematical
Functions.NationalBureauof Standards.
[2] BLACKWELL, DAVID (1973). Discretenessof Fergusonselections.Ann.Statist.1 356-358.
[3] BLACKWELL, D. and MACQUEEN, J. (1973). Fergusondistributions
via Polyaurnschemes.
Ann.Statist.1 353-355.
[1]
ABRAMOWITZ,
1174
CHARLES
E. ANTONIAK
and theirposteriordistri[4] DOKSUM,K. (1974). Tailfreeand neutralrandomprobabilities
butions. Ann.Prob.2 183-201.
functions.
[5] DUBINS, LESTER E. and FREEDMAN, DAVID A. (1966). Random distribution
Proc. Fifth BerkeleySymp. Math. Statist. Prob. 3 183-214. Univ. of California Press.
[6] DUBINS, LESTER E. and FREEDMAN, DAVID A. (1963). Random distribution functions.
Bull.Amer.Math.Soc. 69 548-551.
of theSixthPrague
J. (1973). Neutralityand Dirichletdistributions.Transactions
and RandomProTheory,StatisticalDecisionFunctions,
Conferenceon Information
cesses. 175-181.
1.
to ProbabilityTheoryand its Applications,
[8] FELLER, WILLIAM (1968). An Introduction
Wiley,New York.
problems.Ann.
[9] FERGUSON, THOMAS S. (1973). A Bayesiananalysisofsomenonparametric
Statist.1 209-230.
to the theoryof Dirichletpro[10] KORWAR, R. and HOLLANDER, M. (1973). Contributions
cesses. Ann.Prob.1 705-711.
functionprocesseswhichhave deriva[11] KRAFT, CHARLES H. (1964). A class of distribution
1 385-388.
tives. J. Appl.Probability
[12] KRAFT, CHARLES H. and VAN EEDEN, CONSTANCE (1964). Bayesianbio-assay.Ann.Math.
Statist.35 886-890.
Measureson MetricSpaces. Academic Press,
[13] PARTHASARATHY, K. R. (1967). Probability
[7]
FABIUS,
New York.
H. and SCHLAIFER, R. (1961). AppliedStatistica'DecisionTheory. MIT Press,
Cambridge.
28 841-858.
[15] RAMSEY, F. L. (1972). A Bayesianapproachto Bio-assay. Biometrics
[16] ROBBINS, HERBERT (1963). An empiricalBayesapproachto testingstatisticalhypotheses.
Rev.Internat.
Statist.Inst.
EnglewoodCliffs,New Jersey.
[17] SCOTT, W. R. (1964). GroupTheory.Prentice-Hall,
[14]
RAIFFA,
DEPARTMENT
UNIVERSITY
BERKELEY,
OF STATISTICS
OF CALIFORNIA
CALIFORNIA
94720