Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems Author(s): Charles E. Antoniak Reviewed work(s): Source: The Annals of Statistics, Vol. 2, No. 6 (Nov., 1974), pp. 1152-1174 Published by: Institute of Mathematical Statistics Stable URL: http://www.jstor.org/stable/2958336 . Accessed: 27/03/2012 21:48 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The Annals of Statistics. http://www.jstor.org The Annals of Statistics 1974,Vol, 2, No. 6, 1152-1174 MIXTURES OF DIRICHLET PROCESSES WITH APPLICATIONS TO BAYESIAN NONPARAMETRIC PROBLEMS1 BY CHARLES E. ANTONIAK University of California,Berkeley A randomprocesscalled theDirichletprocesswhosesamplefunctions are almostsurelyprobability measureshas been proposedby Fergusonas problemsfroma Bayesianviewan approachto analyzingnonparametric point. An important resultobtainedby Fergusonin thisapproachis thatif observations aremadeon a randomvariablewhosedistribution is a random samplefunctionof a Dirichletprocess,thenthe conditionaldistribution of the randommeasurecan be easilycalculated,and is again a Dirichlet process. This paperextendsFerguson'sresultto cases wheretherandommeasthe distriure is a mixingdistribution fora parameterwhichdetermines of butionfromwhichobservations are made. The conditionaldistribution therandommeasure,giventheobservations, is no longerthatof a simple Dirichletprocess,butcan be describedas beinga mixtureofDirichletproforthesemixtures and develops cesses. Thispapergivesa formaldefinition the mostimportant of whichis a severaltheoremsabout theirproperties, closure propertyfor such mixtures. Formulas for computingthe conditionaldistribution arederivedand applicationsto problemsin bio-assay, are given. discrimination, regression, and mixingdistributions F on the 1. Introduction.In certainstatisticalproblems,the distribution samplespace (X, ,B)is onlyknownto belongto somecollectionof distributions W= {Fj. This collectionJW may be treatedas a parameterspace in a Bayesiananalysisof theseproblems,but the Bayesianapproachrequiresthe on . If Xwis a parametricfamily,the prior placingof a priordistribution can be givenfortheparameters.However,in manyso-callednondistribution or distribution-free problemstheset 5 maybe too largeto permit parametric suchtreatment.For example, X maybe theclassof all distribution functions In the on thereal line. thiscontext,discussing whathe calls "empiricalBayes" problem,Robbins[16] has said: A strictly Bayesianapproachmightbe to startout with an a prioridistributionof probabilityover the class .~' of all possible distributionsF ... this could possiblybe convertedinto an a posterioridistributionafterXl, X2, . X*, ReceivedDecember1972;revisedSeptember1973. 1This researchwas supportedpartlyby theU.S. Navy EletronicsLab, San Diego, California, and partlybytheOfficeofNaval ResearchunderContractN00014-69-A-0200-1051 withtheUniversityof California. AMS 1970 subject class/ifcations.Primary 60K99; Secondary 60G35, 62C10, 62G99. Key wordsandphrases.Dirichletprocess,nonparametric, Bayes,empiricalBayes,mixingdisrandommeasures,bio-assay,discrimination. tribution, 1152 MIXTURES OF DIRICHLET PROCESSES 1153 have been observed. Whether anyone will espouse this view and recommenda procedure for carryingit out remains to be seen. Some propertiesthat would be desirable for such a procedure can be paraphrased fromRaiffaand Schlaifer[14], page 44 forthis situationas follows: 1. The class _ of random priordistributionon YWshould be analyticallytractable in threerespects: (a) It should be reasonably easy to determinethe posteriordistributionon X given a "sample"; (b) It should be possible to expressconvenientlythe expectationsof simple loss functions;and (c) The class 2 should be closed, in the sense that if the prioris a member of _, then the posterioris a memberof _. 2. The class _ should be "rich," so that therewill exist a memberof Q capable of expressingany prior informationor belief. 3. The class _ should be parametrizedin a mannerwhich can be readilyinterpretedin relation to prior informationand belief. While these requirementsare not mutually exclusive, they do seem, in the case we are considering, to be antagonistic in the sense that some may be obtained at the expense of others. For example, several authors,among them Dubins and Freedman [5], [6], Kraft and van Eeden (12), and Kraft [11], have describeddistributionfunctionprocesseswhich satisfythe second requirement, but are deficientin the firstand third. More recently,Ferguson [9] has defineda processcalled the Dirichletprocess which is particularlystrongin satisfyingthe firstand thirdrequirementsand is only slightlydeficientwith respectto the second requirement. AlthoughFerguson was successfulin using the DirichletprocessforBayesian analyses of several nonparametricproblems, there are some statisticalmodels forwhich the closure property1(c) does not hold. For example, the nature of samplingin mixingdistributionproblemsand bio-assay problems can be such that the posterior distributionis not a simple Dirichlet process, but can be representedas a mixtureof Dirichlet processes, i.e., as belongingin a sense to a linear manifoldspanned by Dirichlet processes. This paper reviews the basic propertiesof Ferguson's Dirichlet process in Section 2. Section 3 develops the additional structurenecessaryto definemixturesand derives some basic propertiesof mixtures,culminatingin Theorem 3 which states, roughly, that mixtures of Dirichlet processes have the closure property1(c). The somewhat unusual behavior of samples from a Dirichlet process is examined in Section 4 and explicit formulas for the posteriordistributionsare derived. Section 5 illustratesthe application of mixturesof Dirichlet processes CHARLES E. ANTONIAK 1154 to estimationproblems,includingestimationof a mixing distributionand empirical Bayes estimation. The remainingSections 6-8 give applications to a regressionproblem, a bio-assay problem, and a discriminationproblem. 2. Ferguson's Dirichlet process. In a fundamental paper on a Bayesian approach to nonparametricproblems, Ferguson [9] definesa random process, called the Dirichlet process, whose sample functionsare almost surely probabilitymeasures,and he derives many importantpropertiesof this process. To save space, we will list below only the definitionand those propertieswe need for this paper. a a-fieldof subsetsof 9. Let a be a DEFINITION 1. Let 9 be a set, and We say a finite,nonnull, nonnegative, finitelyadditive measure on (9, ) random probabilitymeasure P on (9, _V) is a Dirichlet process on (8, _/) if foreveryk = 1, 2, **.. and measurable C with parametera, denoted P e(a), Bk of E), the joint distributionof the random probabilities partitionB1, *., (P(B1), I. P, (Bk)) is Dirichlet with parameters (a(B1), .. , a(BOj), denoted (When a(Bi) = 0, P(Bi) - 0 with (P(B1), *.., P(Bk))e ??Y(a(B1),*.., a(Bk)). probabilityone.) Ferguson shows that thisdefinitionsatisfiesthe Kolmogorov criteria for the existence of a probability 9 on the space of all functionsfrom _v into [0, 1] with the v-field generated by the cylinder sets. Certain properties of the Dirichlet process obtained by Ferguson will be needed later. (i) If P e 9J(a), and A e JV then e(P(A)) = a(A)/a(E9). 02, ' are i.i.d. P, then *, (ii) If PF ?2(a) and conditionalgiven P, 01 02 P 01,02, *, *,~ * ! 09(?a l 3) where 3xdenotesthe measuregivingmass ? one to the point x. (iii) If P e 3'(a), then P is almost surelydiscrete. (See also Blackwell [2].) We point out that property(ii) above satisfiesdesirable properties 1(a) and 1(c). The posteriordistributionis easily computable and in the same class as the prior. Property(i) above fulfillsdesirable property3 in two respects. Since hispriorguess e(P(A)) = a(A)/a(9), one can choosethe"shape"of a to reflect at the shape of the distribution. Moreover, it follows from (i) and (ii) that ? nFn(.)], whereFn is the ) 101, 029 ...* Io,) is (n + a(9))-'[a(6))'(P(.)) e(P( empirical df. Thus the magnitudeof a(o) represents,in a sense, the degree of faithin the prior guess, and appears in the formulaas if it were "prior sample size." The statistician can choose the magnitude of a(e) to represent the strengthof his conviction,independentof his opinion about the "shape" of the distribution. On the other hand, the almost sure discretenessof a Dirichlet selectiongiven in (iii) would seem to be inconsistentwith desirable property2 for situations where one wants a prioron a class of continuousdistributions.In manycases of interest,however,this discretenessis no more troublesomethan the discreteness MIXTURESOF DIRICHLET PROCESSES 1155 of a sample cdf. The real concern of property2 in this context is that distributionfunctionschosen by a Dirichlet process can be reasonably "close" to the kind of distributionfunctionsone is interestedin. In this regard,Ferguson has proved thatthe Dirichlet process is "rich" in the sense that there is positive probabilitythat a sample functionwill approximate as closely as desired the measuregiven any specifiedcollection of _v measurablesets by any fixeddistribution which is absolutely continuous with respectto the parametermeasurea. The mixturesof Dirichlet processeswhich we definebelow can also be shown to be "rich" in this same sense, but we will make no use of the propertyin this paper. 3. Mixtures of Dirichlet processes. Because the basic Dirichlet process definedabove does not encompassenoughof the situationsencounteredin Bayesian analysis, we proceed now to the concept of a mixtureof Dirichlet processes, which, roughly,is a Dirichlet process where the parametera is itselfrandom. Beforewe can make this idea more precisewe need a slightgeneralizationof the usual definitionof a transitionprobability,and laterwe will findit necessary to impose certain regularityconditionson the underlyingspaces to assure the existenceof some needed conditional distributions. DEFINITION 2. Let (9, QV) and (U, X) be two measurable spaces. A transitionmeasureon U x is a mappinga of U x SV into[0, oo) suchthat (a) For every u e U, a(u, .) is a finite,nonnegative, nonnull measure on (b) For every A c XV,a(., A) is measurable on (U, ?). We note that this differsfromthe definitionof a transitionprobabilityin that a(u, 9) need not be identically one. We make this change because we want a(u, .) to be a parameterfora Dirichlet process. DEFINITION 3. Let (9, iV) be a measurable space, let (U, M, H) be a probabilityspace, called the index space, and let a be a transitionmeasure on U x -V. We say P is a mixtureof Dirichlet processes on (E, S) with mixingdistributionH on the index space (U, ?), and transitionmeasure a, if for all k = 1, ** and any measurable partitionA1,A2, , Akof 9 we have {~fP(Aj) < yl, * * * P(Ak) <_Yk} = 5u D(yj, . , Yk la(u, A1) .*, a(u, Ak)) dH(u), whereD(01, * O*, Sk Ia1, . *, ak) denotesthedistribution oftheDirichlet function distributionwith parameters(a1, .., ak). In concise symbolswe use the heuristicnotation: (P(Al)yP(A2) - *XP(Ak)) d . or simplyP e 5Su_A(a(u,*)) dH(u). 5u -g(a(u, Al), a(u, A,)) dH(u) 1156 CHARLES E. ANTONIAK Roughly,we mayconsiderthe indexu as a randomvariablewithdistribution H and conditionalgivenu, P is a Dirichletprocesswithparametera(u, .). In fact v can be definedas theidentity mappingrandomvariableand we willuse thenotationIu for"givenv = u." In alternative notation,u e H, P Iu C ?(a.), wherau = a(u, .). An exampleofa mixture ofDirichletprocessesderivedfroma simpleDirichlet processis givenbelow. EXAMPLE 1. Let P be a Dirichletprocesson (9, SV) withparameter a. De- finea(u, A) = a(A) + 3u(A), where bu(A) = 1 if u e A, 0 otherwise. Let H be a fixedprobabilitymeasureon (9, XV). Then theprocessP* whichchosesu accordingto H, and P froma Dirichletprocesswithparametera(u, A) is a mixtureof Dirichletprocessesas definedabove. Moreover,forthemixturein thisexample,we notethatif (B1, **, Bk) is anymeasurablepartitionof 9, (1) (P(Bl), P(B2), *I* P(Bk)) e Zk=1 H(Bs)?2g(a(B1), a*, (B) a,a(Bk)) + 1, Relation (1), in fact, characterizesmixturesof this type,and we will encoun- in thispaper. terit frequently DEFINITION 4. Let P be a mixtureof Dirichletprocesseson (0, JV) with H on indexspace (U, ?) and transition mixingdistribution measurea on U x . . . Si'. We say that 1,02, is a sampleof size n fromP if forany m= an 1, 2, ... and measurablesetsA1,A2, . , Am'C1, C2, **,C. we have: (2) 9'{01 CC 1, ** , 0a e Cn U, P(A1), ** P(A.), P(C1), **, P(C,)} . = fl=1 P(Ci) a.s. Thisdefinition determines thejointdistribution of01 ..., sincethecustomary S 9{1fo C C1,***, o 0?&,P(A1), P(A.) C., P(Al) < yl, ***eP(Am,) < Ym} maybe foundby integrating (3) withrespectto the knownjoint conditional distribution of P(A1), *.*, P(A,,), P(C1), . . ., P(C,,) given u over the set [O'YI] x ... x [?,Y.] x [0, 1] x ... x [0, 1] and then integratingthe resulting function of u withrespectto H(u) over U. An immediateconsequenceof thisdefinition is thefollowing usefulresult. PROPOSITION 1. If P e Su ?2(a(u, .)) dH(u) and 0 is a sample of size one from P, thenfor any measurableset A, (3) ECA) - Su a (u, A9)dH(u) . -~~~9D(O a (u, 9)) PROOF. -9D(Oe A I U, P(A)) = P(A) a.s. hence A(6 e A I u) =ef{?(6 e A Iu, P(A)) I u} = {P(A) I u} _ a(u, ) a(u, 9) a.s. a. s. MIXTURES OF DIRICHLET Finally a(8y A)A)= . [a (u, A) S= c_(u, E)i 1157 PROCESSES [1 a(u, A) dH(u) a(u,90) THEOREM 1. Let P be a Dirichletprocess on (9, _V), withparametera. Let O be a sample of size 1 from P, and A E _ be any measurableset such that a(A) > 0. Then the conditionaldistributionof P, given 0 C A, is a mixtureof Dirichletprocesseson (E9,_V), withindexspace (A, v n A), and transition measure a on A x (sQ/ n A), withdistribution HA on (A, _ n A), whereHA(.) = a(.)/a(A) on A and a(u, .) = a + e. forue A. PROOF. Let (B1, B2, ..., Bn) be any measurable partitionof 9. Since A is measurablewe obtaina refinedmeasurablepartitionby lettingBi' = A n Bi, and BO = Ac n Bs so thatA = U=1Bi'. If followsfromproperty(ii) of the Dirichletprocesslistedin Section2 thattheconditionaldistribution of P(B1'), -, P(B,,'), P(B10), **, P(B,"?),given 0 e Bi' is Dirichlet with parameters P(B21) *** . (a (B,'), a(B2'), . *, a(Bi') + 1, * *, a(B?'). Integratingthis with respectto the conditionalprobability that0 E Bi', given0 E A, we obtainthe conditionaldistribution of P(B1'), P(B2'),*I * , P(B,"?),given0 e A, is . a(B('A) D(a(B,'), a(A) a*, (Bi') + 1, ** *, a(B)) . whichwe recognizeas relation(1) in Example1 following Definition3, with H(.) = a(.)/a(A) on A. [1 COROLLARY 1. 1. Let P be a mixture of Dirichletprocesseson (9, -$Q/)withindex space also (9, SV), and transition measurea. = a + 5.. If thedistribution H on the index space (9, J/) is givenby H(A) = a(A)/a(O), thenP is infact a simple Dirichletprocesson (9, Se') withparametera. In symbols Se 7(a + 5.) a(du) = _`r(a) a(E)) We omitthe proofand simplypointout thatH(.) = a(.)/a(E) is whatwe wouldobtainin Theorem1 if we let A = 9. But beinggiven0 e 9 is being at all, so theposterior givenno information distribution is thesameas theprior, a simpleDirichletprocess. The usefulnessof Corollary1.1 is thatit enables one to reducesomemixtures ofDirichletprocessesto simpleDirichletprocesses. PROPOSITION2. Let P be a Dirichletprocesson (0, ) withparametera, and let A E s/. ThengivenP(A) = M, theconditionaldistribution of (1 /M)P restricted to (A, v n A) is a Dirichletprocess on (A, vY,n A) withparametera restricted to A. That is, if A1, . . ., Ak is any measurablepartitionof A, thentheconditional distribution of (P(A1)/M, * * *, P(Ak)IM), givenP(A) = M, is a Dirichletdistribution withparameter (a(A1), *, a(Ak)). PROOF. From the definition of a Dirichletdistribution in termsof the as given in Ferguson[9] we know the distribution gamma distribution of 1158 CHARLES E. ANTONIAK ** P(Ak)/L=l P(Ai) is Dirichlet (a(A1), *, a(A,)). P(A1)/1k 1 P(A), E" =l P(A) = P(A) = M by hypothesis. [ But Proposition 2 is an alternativeway of stating that the Dirichlet process is "tailfree" as discussed by Doksum [4] and Fabius [7], who show furtherthat the Dirichlet process is essentiallythe only processwhich is tailfreewithrespect to everytreeof partitions. They also show thatthe Dirichlet process is the only one for which the posteriordistributionof P(A) given a sample 81,82, *. On, depends only on the numberof observationsfalling in A, not wherethey fall in A. We now combine the resultsof Theorem 1, Corollary 1. 1, and Proposition 1 to get: 2. Let P be a Dirichletprocesson (E), SV) withparametera, and let 8 be a samplefromP. Let A e <. Thentheconditonaldistribution ofP givenP(A) and 8 e A is thesame as theconditionaldistribution of P givenP(A). THEOREM Roughly speaking, if P(A) is known, then the event 8 e A tells us nothing more about the process. This is consistentwith thedefinitionof samplinggiven in Definition4, and since P(O) = 1, a.s. it confirmsthe interpretationgiven for Corollary 1. 1. Before we can proceed to the most interestingtheorems about mixturesof Dirichlet processes, we must stop to examine the measure theoreticstructure we have created and insure the existenceof certain conditional distributionsby adding appropriateregularityconditionsto the underlyingspaces. Startingwith the measure space (E), JV), index probabilityspace (U, , H), and transitionmeasure a on U x ,/ we definea measure ,uon the measurable product space (3 x U, Jv x X) as 1a(A x B) = 5B a(u, A)/a(u, E)) dH(u). If P is mixtureof Dirichlet processes on (E), sV) with these parameters,and 81, . .., 0 is a sample of size n, we will need to be able to go fromthe known conditionaldistribution of P givenu, 801 82, . . .* in to theconditionaldistribu- tion of u given 81,. **, 8n. To assure the existenceof thisconditionaldistribution we will henceforthrequire that (0, SV) and (U, _?~)be standard Borel spaces, definedas follows: 5. A standard Borel space is a measurable space (E, JV'), in DEFINITION which v is countably generated, and for which there existsa bi-measurable mapping between (E, JV) and some complete separable metricspace (Y, W). If (e, JV) and (U, XW)are standard Borel spaces, then the product space (3 x U, _v x ') is a standardBorel space, and the requiredconditionaldistributionsspecifiedabove are known to exist (see, for example, Parthasarathy[ 13] Chapter 5). With these definitionswe can now state the last and mostimportanttheorem in thissectionwhich says, roughly,thatifwe sample froma mixtureof Dirichlet processes,and the sample is distortedby randomerror,the posteriordistribution MIXTURES OF DIRICHLET PROCESSES 1159 of the process is again a mixtureof Dirichlet processes. We will see in the sections on applications that this situationoccurs frequently. 3. Let P be a mixtureof Dirichletprocesseson a standardBorelspace (89,SV) withstandardBorel indexspace (U, ?), distribution H on (U, ?), and Let (X, 4@) be a standardBorel samplespace, transitionmeasurea on U x S. and F a transition probability from 8) x <' to [0, 1]. If 0 is a samplefromP, i.e., a 1P, u e P and Xj P, 0, u e F(a, .), thenthedistribution of P givenX = x is a mixture of Dirichletprocesseson (8, SV), withindexspace (8) x U, v x 5), transitionmeasurea. + 6, on (8 x U) x XV,and mixingdistribution H. on the index space (8 x U, -v x ?) whereH. is the conditionaldistribution of (0, u) given X = x. In symbols,if THEOREM u C H, a[P,ucP PIu e-(a and then (PIX = x) c Sexu ?6(a. ) PG su(a )dH(u) , XIP,a,ueF(a,.) +6o) dHx(8, u). PROOF. The distributionof P given (d, u, x) is 2(a + b), independentof x, the distributionof X given (1, u) is F(a, .) independentof u; the distributionof a given u is a(u, .)/a(u, 0); and the distributionof u is H. The last threedefine the joint distributionof (0, u, X) as given in the theorem, and conditional distributionof P given X is obtained by integratingthe known conditional distributionof P given (a, u, X), namely!`?-(a. + 6.), with respectto the conditional distributionof (a, u) given X, yieldingthe formulagiven in the theorem,which we recognize as a mixtureof Dirichlet processes. [: Lest the imposingnotationobscure the basic principlesinvolved,we illustrate Theorem 3 with an example where U and X are two-pointspaces, and 8 is the unit interval. 2. Let X be a Bernoulli random variable with P(X = 1) = a, and let a have as priordistributionan equal mixtureof Beta distributions,g(0) = 4'Me(l, 2) + -?6e(2, 2). Then the posteriordistributionof a given X = 1 is a mixtureof Beta distributions,g(0 IX = 1) = 2Me(2, 2) + 3%e(3, 2). EXAMPLE Notice that the posteriormixturegives more weightto Me(3, 2) since given X = 1, a is more likely to have come from Me(2, 2) in the priormixture. This is a specificexample of a general propertyof mixturesof Dirichlet processes which will be stated formallyas Corollary 3.2. We now state two corollaries to Theorem 3 which treatcases that occur frequentlyin applications. No proofis given since theyare simplyspecial cases of Theorem 3. COROLLARY 3. 1. Let P be a Dirichletprocesson a standardBorelspace (8, JY), withparametera and let a be a samplefrom P. Let (X, K) be a standardBorel samplespace and F a transition probability from8 x to [0, 1]. If theconditional distribution of XgivenP and a is F(a, .), thentheconditionaldistribution of P given &2 1160 CHARLES E. ANTONIAK H on X = x is a mixtureof Dirichletprocesseson (9, JV) withmixingdistribution measurea(8, .) = a(.) + 5,(*), wherethemixindexspace (9, -QY)and transition H on (9, sV) consideredas theindexspace is theconditionaldistriingdistribution butionof 8 givenX = x; in concisenotation: Pe??J(a),8eP,XIP,8eF(8, .) P PIXe 5 '(a + 6a0)dH_(8). Roughly, if the sample froma simpleDirichlet process is distortedby random error,then the posteriordistributionof the process given the distortedsample is a mixtureof Dirichlet processes. COROLLARY 3.2. Let P be a mixtureof Dirichletprocesses on a standardBorel H on (U, ), space (9, XV), withstandardBorel indexspace (U, ?), distribution measurea on U x -/. If 8 is a samplefromP, thenP given8 is a and transition measurea + 6,9,and dismixtureof Dirichletprocesseson (9, -), withtransition distribution where is the conditional on of u given 8. In ?), tribution (U, Ho Ho 8 8 P and then P e (P 0) c Su?iJ(a. + 30) dH0(u). dH(u) symbols,if e Su _2B(a,.) The essential point of this corollary is that the observation 8 affectseach componentof the mixtureas one would expect it to, by adding e0 to a., and in addition, changes the relative weightingsof componentsof the mixtureto the conditionaldistributionof u given 8. An explicit expressionforthis conditional distributionis given in Lemma 1 followingProposition4. 4. Properties of samples fromDirichlet processes. The preceding theorems were statedratherformallyto reveal the underlyingmeasuretheoreticstructure, but only for samples of size one, to avoid cumbersomenotation. In what follows we will develop more usefulformulasforsamples of size n, with less regard for elaborate formalism. We will see that the joint distributionof multiple samples possesses some peculiar propertiescaused by a virtual "memory" of to illustratethese pethe process. Sometimes a sample of size two is sufficient culiarities. For example, let P be a Dirichlet process on a standardBorel space (9, -V) with parametera, and assume that a is nonatomic. If 81 and 82 were a sample froma(8)/a(O) in the usual independentidentically distributedsense, we would expect ,9;{81 = 821 = 0. The followingargumentshows that in fact, for such a Dirichlet process, i93{81= 821 = 1/(a(9) + 1). The reason for this is that although a may be nonatomic,the conditional distributionof P given 8, is a Dirichlet process with parametera + b6, which is already atomic with an atom of measure 1 at 8. Hence, the probability that 82 - 8l, given 81, is 1/(a(O) + 1) independentof 81. If we proceedto calculate ,9?{3 = 2 = 8l 182 = 81} we see that the conditional distributionof P given 81 82, 81 = 82, is Dirichlet with parametera + 2ba1. Hence JS{83 82 = 8l 182 = 81} = 2/(a(9) + 2), and = the that 80 82 = 83 is 2/[a(O) + 1)(a(O) + 2)], joint probability consequently and by induction,?7(80 = 02 = * * = tn) = l"-1)/{[a(9) + 1](ff-)}, where the definitionof a'?' = 1 and aln' = a(a + 1) * (a + n - 1), n > 0. This property of the Dirichlet process is very similarto what happens in Polya urn schemes - MIXTURES OF DIRICHLET 1161 PROCESSES and in one sense characterizesDirichlet processes (see Feller [8] and Blackwell and MacQueen [3]). On the otherhand, one can considerthe probabilitythat On is a new value, distinctfromany previousobservations By thesamelogicas 0,, 02, ...,*n-.l before,onenotesthatP(02 it 0,) = a(e)/(a(e) + 1). Similarly P(,3 v {10, 021) = a(9)/(a(9) + 2), regardlessof whether 01- 02 or not. If one definesWi as a random variable which equals 1 if St is a new, distinctvalue, and zero otherdefineZn =, wise,thenP(Wi = 1) = a(8)/(a(8) + i - 1). Ifwe further Wi thenZ,n is simplythe numberof distinctvalues of 0 which have occurred in the firstn observations. Although P(Wn = 1) = a(O)/(a(O) + n - 1) is monotone decreasing in n, neverthelessnote thatE(Zn) = a()/(a(8)) + m- 1) = a(6) E m, I /(a(E) + m - 1) a(e)[log ((n + a(E))/a(8)))]. Hence E(Zn) -0 ?o, n -> oo. In fact, Korwarand Hollander[10] show Zn oas co, as n -f oo. Thus althoughnew distinctvalues are increasinglyrare, we are assured nonethelessof a steadily increasingnumber of distinctvalues. Moreover, since the distributionof the distinctvalues is simplya(.)/a(0), this propertycan be used in the usual way to obtain informationabout the shape of a(.) if it is unknown. On the other hand, the rate at whichnew distinctvalues appear dependsonly on the magnitudeof a(e), and not the shape of a(.), and this propertyshould enable us to obtain informationabout the magnitudeof a(e) if it is unknown. We begin by obtainingan expressionforP(Zn = k) as a functionof a(e). As an aid in this we definea sequence of polynomialsAn(x) as Al(x) = x (X A2(X) A,(x) = + l)Al(x) = x(x + 1) =x(2) (x + n)- l)A,n(x) = x(x= + 1) .. (x J n- x(n) Hence An(x) is a polynomialof degree n in x with integercoefficients, which we write + nanxn. An(X) =na,x + na2x2 + ... If we substitutex = a(o) in An(x), then considerationssimilar to those in the discussion of repeated values enable us to identifythe kth termof thispolynomial withthe event of observingexactly k distinctvalues in a sample of size n, P(Zn = k) = nakac(e)k//An(a ())The coefficients nak of the polynomialare the absolute values of Stirlingnumbers of the firstkind, tabulated in (Abramowitz and Stegun [1] page 833). The significanceof the foregoingis that if one knows he is sampling from some Dirichlet process with unknown parametera, then he can obtain, independently,consistentestimates for a(.)/a(e) and a(e) by making use of the propertiesgiven above. If, however, there were some doubt that the process was in fact a Dirichlet process, then the only discriminatingfeature left is the 1162 CHARLES E. ANTONIAK actualpatternof multiplicities observed. As a firststepin suchdiscrimination thefinestructure of theprobability of we showthatwe can, in fact,determine If Dirichlet we in a from a variouspatterns of multiplicities sample process. forexample,in the probability thatthefirsttwoobservations wereinterested, fromthefirsttwo,thefourthand fifth were identical,the thirdwas different from matchedthethird,and thesixthand seventhmatchedbut were different the previousvalues, then an argumentsimilarto that given above yields . -(E)1'2/a(E)) 9'10, = 62* 63= 64 = 65* 60 = 67# 61} = the above eventas in characterizing Moreover,ifone wereonly interested one where,in a sampleof size 7, only3 distinctvaluesof 6 occurred,and that two werepairsand one a triplet,thenthesubscriptsof the O's are irrelevant, theprevious and we obtainthe probabilityof the lattereventby multiplying expressionby the appropriatecombinatorial factor,in thiscase the familiar dividedby2! sincethetwopairsare indistinguishcoefficient multinomial (27,3) able. The resultcould be expressed '9iAI.iLI = 6, = 6 6- = ?6 = 6, 6i = a()3l1( 7 a(E9) ~~~~~~~~~~~~~~2! (2,23) where {i,j, k, 1,r, s, t} = {l, 2, 3, 4, 5, 6, 7) in some order. The generalization are thegoalsof thefolof itsexpression of thisexampleand thesimplification and proposition. lowingdefinition 6n be a sampleof size n froma DirichletproDEFINITION 6. Let 61, 62, ...* cess P. We will say thatthesamplebelongsto theclassC(ml, m2, . ., in,M),and e C(m1,. **, m,), if thereare mldistinctvalues of 6 that occur onlyonce, m2thatoccurexactlytwice, ..,* ml,thatoccurexactlyn times. Two immediate consequencesof thisdefinition arethatn = = imi, and the write(1, . *, 6n) totalnumberofdistinctvaluesof 6 thatoccuris Z,,-= mi. As an example of thisnotationwe notethatthesamplein thepreceding discussionbelongsto theclass C(O,2, 1, 0, 0, 0, 0). 3. Let P be a Dirichletprocesson a standardBorelspace (0, W), withparametera, and let a be nonatomic.Let (6k ..*, 6,n) be a sample of size n fromP. Then PROPOSITION (4) .5q{(6l X**, 6,,)e C(m m, M, in.M)) = n! )(0) 'i We calculatethe probability of a particularsequencein the class PROOF. C(ml, ***, mi), observe that all such sequences are equally likely,and multiply by thenumberof thesesequences. Let C0((mi,. . ., m,,) be the event that 61 * , 6.1, in that order,are unique in the sample and occur only once; that 61 +1' * , 6ml+2m2 occur twice each, in the order0.1+1= 6m1+2' 6mi+3 = 6l+,, etc. Thenbythesameargument givenin the . . . .. = previousdiscussion9i{01 ., 6a) e C0(ml, m.* . .i")) [l(O)a(8)] ni[1(l)a(E3)]n2 . [1 ( )] /a(8) MIXTURES OF DIRICHLET PROCESSES 1163 Nextwe mustcountthenumberof wayswe can permutethe indicesof the 0's to obtainall essentially different sequences. But thisnumberis wellknown as the numberof elementsin a conjugateclass of thesymmetric permutation groupon n elements,and is equal to 1 =l (mi!) ( 1, * ,1, 2, ..n*, 2, *, n) ml m2 where("l " denotesthewell-known multinomial coefficient. For a derivationofthisnumbersee Scott[17]. Multiplying thiscombinatorial factorbythe precedingprobabilityand dividingnumeratorand denominatorby ]l[In [1(i_l)]mi which is identical to 11i l [(i - 1)!]ms,the expressionreduces to: n! a((6))Iz=1 mi/af(e) (") 111 1 imi(mi!) . We see thusthattheDirichletprocessinducesa measureon thesamplespace En1 which gives positive mass to certain collections of sub-hyperplanesin an. The relativemagnitudeof the mass concentrated on the sub-hyperplanes, over the remainderof the samplespace,is comparedto the mass distributed ofthemagnitude of a(e). Consequently, seento be a function fora givensamifa(e) is verysmallthanif ple size, we wouldexpectmanymoremultiplicities if a(e) is unknown,thisproperty it is verylarge. Furthermore, justdescribed the make an inference about us to of enables magnitude a(e) based on the in a sample. We will see, in fact,thatit is notthe numberof multiplicities but the numberof distinctvalues thatoccur in of multiplicities, distribution in makinginferences thesample,whichis important abouta(e). We beginby theresultsof thepreviousexample. formalizing 4. Let P e Su 3Q(a.)dH(u), wherea. is nonatomicfor all u e U, and let (0,, 0,* ) be a sampleof size nfromP. Thentheposteriordistribution of u, given(0k ..., an) e C(m1,*.., m")e JV is determined by PROPOSITION SB 1(ue B a e C(mn)) = Z= a(u 5u (( dH(u) en dH(u) PROOF. FromProposition 3, we have S?7(6e C(m) I u) = n! a(u, 1)= nim )('n) Il=l~~~ i(m!)c(u, withrespectto H(u) overu e B givesthejointprobability of0 e C(m) Integrating and u e B. Since 91(Qe C(m)) > 0 we obtaintheusual conditionalprobability coefficients by applicationof Bayes formula,and note thatthecombinatorial cancel. D It is important to noticethata consequenceof thecombinatorial coefficients 1164 CHARLES E. ANTONIAK cancelling is that the mi occur in the above expressiononly as E=1 m%,that is, in a sum expressingthe total number of distinctvalues, which we have previfor ously denotedby Z,. This confirmsour earlier remark,that Zn' is sufficient at(e). We can conclude fromProposition4 that if a is nonatomic and a.(e) = M, a constantindependentof u, thenthe event0 e C(m) is independentof the event u e B, and hence 6 ? C(m) providesno new informationabout u. This is a consequence of the assumption that a. be nonatomic. If a. is atomic the event 6 e C(m) may stillprovideinformationabout u even when a(u, 3) is independent of u. An analysis for a special case of atomic a is given later in Lemma 1, but some remarksof general validityare possible now. We note firstthat if a is atomic, then P assigns positive weight to certain specific points in the product space of observations E)"n, and these points are contained in the collections of sub-hyperplanesdescribed earlier. If P, e _2(a,) and P2E 3!(a2), whereal(E) = a2(O) but a, is nonatomic and a2 has atoms, then thereis a greaterprobabilityof duplicated values in a sample fromP2 than in a sample fromP1. If Pe Su !(au) dH(u), and some or all au are atomic, the posterior distribution of u given 6 will depend on whetherany of the observed samples 6i match any of the atoms of any of the au, and if so, whetherthe matched atoms are common to more than one value of u, and if common to more than one aU, whether the size of the atoms is the same, etc. To elucidate this idea consider the followingexample. 3. Let au be a geometric distribution on u, u + 1, u + n, ... and H(u) be uniformon [0, 1]. Then with only one sample 6 fromP, the distribution of u given 6 is degenerateon #(mod1). Notice that here thereis no dominatinga-finitemeausreforthe a., in fact,theyare all mutuallysingular. Nevertheless,the factthatthejoint distributionof (0, u) is concentratedon a subspace of the space x U enables us to findrathereasily the posteriordistributionof u given 6. We conclude this section with a partial extensionof Corollary 3.2 to the case of a sample of size n. EXAMPLE e LEMMA 1. Let Pe Su (aj) dH(u) as in Theorem3, let 0 = (61, - * ,an) be a sampleof size nfromP, and supposethereexistsa a-finite,a-additivemeasurep on (0, SV) shchthatfor each u e U, (i) au is a-additiveand absolutelycontinuouswithrespectto Xc, and (ii) themeasurep has mass one at each atom of au. Then 1 (5) dH,1(u) = 1, X Su M.I M (n) a,. dH(u) a,U'(6i')(mU(6i')+ 1)(n'&i''-1) 7J~-~ n':,(6')(mn(6i') + l)')1) dH(u) MIXTURES OF DIRICHLET 1165 PROCESSES whereau'(.) denotestheRadon-Nikodym derivativeof au(.) withrespectto 4a;0i'(8) 0 is theith distinctvalueof in 0; n(0i') is thenumberof timesthevalue 0i' occursin 0; M. = a.(0) and m.(0i') = arn'(0i')if 0i' is an atom of a., zero otherwise. PROOF. We obtain the joint distributionof (u, 01, * , 0,) by calculating the likelihoodof 0 and makingthe appropriatenormalization.Referring to the proofof Proposition3 we see thatthelikelihoodof Ok+1' givenu, 01, * . Ok iS in au'(6k+1) du!(Mu + k) fora value of 0,k+ whichhas not occurredpreviously . 01'' and [mu(Ok+l) + j] dp/(Mu+ k) fora value of 6k+1 whichhas occurred * *k' O of8 (Of .*., O,n)given 0k. Hence thelikelihood previouslyjtimesin 01, * uU is 1S~~~~~~~ M I71 au'(0i')(mu(0i') + 1)(?(i)1) dfen We obtain dH0(u) by multiplyingthe above by dH(u) and dividingby the un- conditionaldistribution of 0. El We have thus established the validity of the term dH0(u) in the following extensionof Corollary3.2. 3.2'. Let P and au be as in Lemma 1 and let 0 = (0k sampleof size nfromP. Then COROLLARY P Iae Su _(a + E1 ..., On) be a .) dHIz(u), whereHlI is theconditionaldistribution of u given0. 5. Estimationproblems:parameters,mixingdistributions, and empirical Bayes. In this section we presenta general samplingmodel using mixturesof Dirichletprocesses,and pointout somecases wherethisDirichletmodel leads to estimatesdifferent fromstandardBayesiananalysis. Considera mixtureof Dirichletprocesseswherethe indexspace, parameterspace, and observation space are all thereal line withthea-fieldof Borel sets. Let G(O) be a sample distributionfunctionfroma mixturewith parametera. and mixingdistribution H(u). Let 01,02, *, 0, be a sampleof size n from G(0) and let Xi,, . . Xim be a sample of size mi fromF0.(x). We consider the followingproblems: . theindexof theparameter.If we wish to estimateu withsquared (a) Estimating errorloss, thenthe Bayes estimateis simplyuz= E(U 01, 0, **, On) if the 0%are observed directly,and u = E(u IX,,, ** , Xnm) if we only observe the Xxj. The theoryis given in Theorem 3 and the requiredconditionaldistributionof u can be obtained using Lemma 1. The positive probabilityof duplications among the 0%causes some complicationsthat will be illustratedin an example at the end of thissection. (b) Estimatingthe(mixing)distribution function. Suppose we wish an estimate G forG with squared errorloss, weightedaccording to some finitemeasure W on (-oo, so), i.e., L(G, G) = (G - 6)' dW. Then G = E(G 10, 02, 0* ) 9, 1166 CHARLES E. ANTONIAK is the Bayes estimate when the 0i are observed, and G = E(G IXI,, . .., Xnm) whenonlytheXij are observed,thelatterbeingthe case whenG(0) is an unknownmixingdistribution. (c) The empiricalBayesproblem. When G(0) is an unknown mixingdistribution, we may wish to estimate 0i given X,,, **., Xi. with squared errorloss. The Bayes estimateis 0i = E(0j IX,,, ** , Xim.). We illustratethe problemsdis- is cussedabove withan examplewherethepriorguessis thatthe distribution normal. For concisenessof notationwe let xf/(p, a2) denotethe approximately normaldistribution functionor measureas thecontextrequires. EXAMPLE 4. Let G(0) be a sampledistribution functionfroma mixtureof Dirichletprocesseson (- oo, oo) with transitionmeasurea. = MI(u, a2), H = Ix(O, p2), and samplingdistribution mixingdistribution F. = ,x/(0, 92). Beforetreating problems(a), (b), and (c) inturn,we listsomeoftheconditional we willneed in theanalysis,fora sampleof size 2, whichwill be distributions to illustratetheinteresting sufficient featuresof thismodel. To save space we and to simplify omitthederivations, we definethefolmanyof theexpressions a= lowingconstants: = (2p2 + 2a2 + (6) (p2 + or2+ 2)-1 a' = (aT2+ r2)-l f3= (2p2 + a2 + 92)-l' 2)-1 G 01,02 e l(M - (U, a2) + 601+ 682)dH(uI 08,02) and a12 = p2a2/(2p2 + (A12= p2a2/(p2 + (a2) when01= 2. G IZ1 e r0. ?(M V(u, a2) + ar)dHxl(0, u) whereH(u101,02) = _4(ifA a12), when01 # 02; Pi = 0p2/(p2 + a2), (7) 1 = 20p2/(2p2 + a2) whereHx,(0,u) is bivariateNormalwithp, = X,a(p2 + a2), a 2 = ar92(p2 + ap2(a2 G IX1, X2 (8) e 5 + 92), a21 = ! r(M a2), a2) 2 = X, ap2, a12= a 2p2. (u, a2) + 6a1 + 602) dHX1,X2(01 02 a) ?) + p8.4/(p*, ?*), a mixtureof trivariate The mixingcoefficient Normaldistributions. P8 = P(01 = 021 Xl, X2) = (1 + and exp X2)2a_/4 Pd = 1 - pJ.The means, rM(x'~/~')* {a2((X _ X2fit)})-1 variances,and covariancesof the distribution-l"'(y, 2) are a1= a'(X ar2+ X2p 2 2 = a'(X2 2 + X2 p292), /13 = X213p2,a12 = a22 = a' 3Z.2 (a + 2a2p2 + whereHxVX2(01,02 2p2 + a292), a21 = 2X(p2 + U) a'~ = Pd -'(J/I, Zp2, a31 = a32 = a'r32p2(a2 r2), a3 2 = + a' p2(a2 + 2)2. The component (y*, ?*) is a singulartrivariate Normalwhereall themass on the 01 = 02 plane. In thiscase, pl* = of the distribution is concentrated * - p2(2aOr2+ 92)f3', a2):' *3 = /1 r*3= = 2Xp2fi, p2T.2p A1*2 = 2*2 = 2*1 = ((P2 + 2)) *2 = We can now writedowndirectlythesolutionsto theproblemsposed above. For (a) we get U,0= 02p2/(a2 + 2p2) when01# 02, and ua = 8p2/(a2 + p2) when at 01 = 02. The heuristicexpla01 = 02. Thus theestimateu4is discontinuous nationforthisis thatwhen01 = 02, itis almostsurelydue to theatomat 01,and aboutu, hencetheestimateforu reducesto thatgivenonly givesno information 1167 PROCESSES MIXTURES OF DIRICHLET 01. If we are given X1 and X2, but do not get to see 01 and 02, we do not know U = X22p2/(2p2 + q2 + 2), whether0 = 02, So we mustconsidertwoestimates, which is appropriatewhen 01 * 02, and u* = 22p2/(2p2 + 2a2 + r2) when 01= Since we do not observe whether01 = 02 or not, we must weightthese two estimatesaccording to the posteriorprobability,given X, and X2,that 01 = 02. This gives the estimateu' =-du + P8z* The solution to problem (b), estimatingG(0), requires that we compute 02. 1 we get E(G(0) I01,02) and E(G(0) IX1,X2). FromProposition = 2 /0M 02) E(G(0) I10k, u\d( 1 2 -~.( Mo(2a)dH(UIl M+ 2 O2)++2F2(O) For 01 * we get M ( M M+ 2 G(0) = (9) If 01 02, = 02, 20. 2 ( a4 + 3p02or2\ ~O+b + 202 82 ( 2p2 2' 2p2 + 62 + M+ 2 a2 then forthe reason given in (a) above, we get 0(0' M M+ 2 - ('O v4 ap2 p2 + + p2 (I2 + 2 M?+ 2 2a2p2' r2 a If we observe only the X1, X2, then E(G(0) | X1,X2) (10), M + 2 1 + M ?108 + -1 , + 2 LMr2- .P-.. M+ ,< 2 LM+2 2 2+ X 2p2 + g2 + + 2 (X2i a2 X r2 X( 2p2 32p29 T2 X2fip292 X2p2 2U2 + q2 2 + p272> r2 + 2a2p2 + ? o292 + T2)(a2 + 2 4p2q2 + 2u + 22 42p2,g ri qJ2+ + (2p2 + 22 a4p+ 2p2 + T2(g4 + P2+222+ +C2) 2 + T2(a4r4 2p2 + 2p2) r2 + 2 2 +2P2+262+9 2) g4+yr2+pr 2j2 + r2) Finally, we writedown the solution to (c), the empiricalBayes problem. We have immediately 01= E(0 I X1) = X1p2/(p2 + 62 + z2) and 02 E(02 1X1,X2) = ifthe + r2) + p, 2X(p2 + a2)/(2p2 + 202 + 92). Moreover, X2p p2Z.2/(u2 formulationof the problem allows "hindsight,"then given X1 and X2 we can obtain an expressionfora revised estimateof 01 by replacingthe subscript2 by 1 in the expressionfor 02. The effectof the possibilitythat 01 = 02 iS to shift 0, JX1,X2away from01IX1and toward 02. Certain featuresof the precedingexample illustratean importantdifference between simple Dirichlet processes and mixturesof Dirichlet processes. ConsiderG e ?-(M_A7'(u,g2)) as above and assume for the momentthat u is known, so thatwe have a simpleDirichletprocess. Let Go = E(G) and G = E(G 01,02), Pd(X2 (2 + 1168 CHARLES E. ANTONIAK and let A denote the open interval (0, 02), (assume withoutloss of generality that01 < 02). Note thatG2(A) = GO(A)M/(M+ 2) and in general,forany set whichdoes not includeanyof theobservations, theexpectedvalueoftheprobis smallerby a factor abilityassignedthatsetundertheposteriordistribution M/(M+ 2) thanunderthe prior,even thoughthe set may be "close" to the as (01, 02) iS. observations, However,ifwe let u be an indexwithmixingdistribution _/Y(0,p2) as originally in the example,thenalthough01 and 02 stilldo notcontribute anything to G2(A),theydo causethemeanof theposteriordistribution of u given directly 01 and 02 to be shiftedaway fromzero towardA, and the varianceto be decreasedtop2U2/(2p2 forthefac+ g2). Thesechangesmaymorethancompensate torM/(M+ 2), especiallyifp is largecomparedto a, (p > v), as seenfrom(9). Nextconsiderthecase wherewe onlyobserveX1and X2. Let B = (X1,X2), (assumeX1< X2),and G2* E(G IX1,X2). Referring to (10), we can see several termswhichcontribute to 62*(B). Thereis a changeintheposterior distribution ofu similartothatdescribedfor01and ,2above. Thereis a component pd/(M + 2) whichwouldhave beenconcentrated on 01if01 had beenobserved,but which is now spreadout aroundtheestimated value of 01, and someof thismassfalls in the intervalB; similarlyfor02. Finallythereis a component2p,/(M+ 2) spreadout aroundthemidpointof the estimatesfor01 and 02. If p > a > T, thenOXis near Xi, i = 1, 2 forthePd components,and 0 is nearX forthep, component.Thus G2*(B) maybe muchlargerthanG0(B),even thoughB does not containX1or X2. in thissectionaresimilar 6. A regression problem.The problemsconsidered to thosein Section5, in thatthegoal is a Bayesestimateof an unknowndistributionfunctionG on [0, 1], withloss functionL(G, G) = 5- (G(t) - G(t))2dW(t), where W(t) is some finitemeasure. Again we assume G to be chosen by a Dirichletprocess,but thistimethe samplingtechniqueis morelikethatused in regression thecdfof G, thenvariousvalues of problems.If G(t) represents t,say0 < t1 < ... < tk ? 1are chosenand theunknownvalueofG(ti) becomes a parameterin a distribution F(x IG(ti)). SamplesfromF(x IG(ti))are used to makeinferences aboutthevalue of G(ti). If G is a sample functionfroma Dirichletprocesswithparametera, let Y, = G(tl), Y2 = G(t2)- G(tl), ***, Yk+1= 1 - G(tk), and j31 = a(tl), P2 = Then the joint distribution of Y1,. a(t2)a(tl), * *, &+1 = a(l) - a(tk). is Dirichletwithparameters fordifP1, *, k?+1. Hence theobservations Yk+1 ferentvaluesofi willnotbe stochastically independent ingeneral. We illustrate the effectof thisdependencewithan examplewhereF(x IG(t)) has a simple and wherek = 2. densityfunction, . 5. Let P be a Dirichlet process on ([O, 1], X), the Borel sets, withstrictly monotoneparametera and let F(x IG(t)) have density EXAMPLE . f(xIG(t)) = 2[xG(t) + (1 - x)(1 - G(t))] 0 < x< 1 MIXTURES OF DIRICHLET 1169 PROCESSES and zero elsewhere. For given t, and t, defineY1 and Y2as above and let X1and X2 given Y1 and Y2 be independent samples from F(x IG(t,)) and F(x IG(t2)) respectively. The joint densitybecomes f(x1, X2,y1, Y2) = 4[xlyl + (1 X - x1)(1 -Yl)][X2(Yl + Y2) + (1 Y1'-'Y2'(-1 - - - x2)(1 - Y- Y2)] F(M) Y1)P3-l A straightforward, but tedious, algebraic manipulation, using Bayes formula, yields (Y1, Y2 1Xl = X1, X2 e C-1(Xl, + X2) X2){Xl X2 P(10 (1 -X)X2P2(32 + (1 -X1)( + + l)?(p, + - X2)3(P3 P2 + 1)0(l1 + 2, P25 P) 2, p3) P25P3 + 2) 1)'(l, + x2P1 29(P1 + 1, P2+ 1, P3) ? (X1 + X2- 2x1X2)V1P3?-(P1 + + (1 -x1)V2 i3?(31, P2 + 1 P3+ 1, P21P3 + 1) 1)} where c(x1,x2) = (xlx2Pl(Pl + 1) + ** + (1 -XOP2P3). The algebraic manipulationreferredto above makes frequentuse of property (ii) of the Dirichletdistributiongiven in Section 2. The techniqueis not limited to linear densities, but the example shows that even with linear densitiesand small k, the posteriordistributionis somewhatunwieldyforhand computations. Nevertheless,it is possible to transformany densitywhich can be expressed as a finitepolynomial in G(t) into a polynomialin the Yi's and absorb it into the Dirichletdensityfunction. This leaves open thepossibilityof approximating densityfunctionsof interestby polynomials,and performingthe requiredcomputationson a computer. 7. A bio-assayproblem. The nextproblemwe treatis a typeof bio-assayproblem. We wish to estimatethe dose responsecurve of some animals to a certain drug. We assume, forsimplicity,that an animal's responseto a given dose of the drug is eitherpositive,or negative (no response),and that each animal has a thresholdthatthedose given him must exceed to produce a positive response. However, this thresholdvaries fromone animal to the next, so we treatit as a randomvariable withunknown distributionG. Kraftand van Eeden [12] have treated this problem using a process of the Dubins and Freedman type [5] to choose the distributionG. Our approach will be to let G be a sample function froma Dirichlet process. Essentiallythe same model has been considered by Ramsey [ 15]. Let (), _) = ([0, 1], X) be the unit interval with Borel sets and assume, then, that G is chosen by a Dirichlet process with parametera, and we select some dosage level t and administerthisdosage to n animals. G(t) is the expected proportionof the animals whose response thresholdis less than or equal to t, 1170 CHARLES E. ANTONIAK thatis, we "expect"nG(t)animalsto givea positiveresponse. Of course,G(t) is unknown,since G was chosen at random,but since G was chosenby a of G(t) is Me(a(t), M - a(t)). And,as Dirichletprocess,thepriordistribution of G(t) givenk positiveresponsesin n is wellknown,theposteriordistribution trialsis Me(a(t) + k, M + n - k - a(t)) where Me(a1, a2) is the Beta distribution. simple,sincethingsbecome Thisanalysisforone dosagelevelwasdeceptively muchmorecomplicatedwiththeuse of morethanone level,as maybe seenin thecase whentwo levels,t1and t2are used, withn, and n2trialsrespectively. a(tl)] > 0. Let Ki be the numberof successes amongtheni trialsat ti. Then the joint densityof K1,K2,G(tl),G(t2) is most easilyexpressedin termsof Yi's, whereY1 = G(tl), Y2 = G(t2)- G(tl)and Y3 = a(l) - a(t2). The joint 1 - G(t2);and 1i's, where/31= a(tl), P2= a(t2), densityof (K1,K2, Y1, Y2) then becomes Assume that t1 < t2,and [a(t2)- (ni)( 2)y k1(l - yl)n,-k(yl + F() y)k2(Y3)2-k2 Y1'Y2-21Y3_31 on S = {Yl,Y2IYl _ 01Y2_ 0y1 +Y2 < 1} and wherey3= I-y1-Y2. as a mixture whichis recognizable thisintoan expression We can transform (1 - Yi) = Y2+ y3, and by makingthe substitutions of Dirichletdistributions expanding(Y2 + y3)"r-kl and (yl + y3)k2 usingtheBinomialformula.This leads of Y1,Y2givenK1= kl, fortheconditionaldistribution finallyto an expression k2-i-I, 3+ n2K2= k2 as 'kI 1+ k1+i, 2?+ n-k =0kl =1 k2+ j) where aij = bijl/Ek >OE nl1 b.i;and ki b _ tn(71 1) V 'k28 r(/, + kl+ (\) i)(3 i)r(i2+ + n1-k, + k2 -i-])r(33 + n2- k2 +i) ()(p)( Beforeproceedingto our goal of a Bayes estimateof G we examinethe thesourceof theincreasedcomabove in moredetailto determine expressions viewpoint. different plexity.It is helpfulto considertheproblemfroma slightly Saying that K1 of the n, trials at t1 were successes is statisticallyequivalent to sayingthata sampleX1,..., ATXof size n1fromG was taken,butall thatwas of recordedwas thatK1of theXi's werelessthanor equal to t1. In particular, the n, - K1 values of the Xi which were greaterthan tl, it is known how many fellin theinterval(tl, t2] and how manyin (t2, 1]. Hence we let J denotethe numberof the Xi's fallingin the interval(t2, 1]. Since thetruevalue ofJ is all possiblevaluesofJ and weightappropriately unknown,we mustenumerate from0 to n1-K1. Similarly,if Y1Y2,*. *, Y,2 is a sampleof size n2 fromG, and K2 values of Yi are lessthanor equal to t2, we let I denote the numberof the Yi's thatfellin theinterval[0, tl]. Now if it were knownthatI = i and ofG(t,),G(t2)- G(t,),and 1 -G(t2), givenK1 = kl, J =j, thejointdistribution J =j, K2 = k2,I k1+ k2 - i - Il i would be Dirichletwithparameters(/, + k, + i, j2 + n, theproperexP3+ j + n2- k2). Since I and Jare unknown, = pressionis a mixtureoverall possiblevaluesof I and Jas givenby thedouble above. summation MIXTURES OF DIRICHLET 1171 PROCESSES By makinga similaranalysisforthe case wherethereare threethresholds, summation one can showthecorresponding runsover6 indices,and ingeneral, formthresholds therewouldbe m(m- 1) indicesto be summedover. Generally suchtasksare bestleftto thecomputer. To resumeour treatment of thisproblemwe computeour estimateG of G. Recall thatforsquarederrorloss,theBayesestimateis themeanoftheposterior distribution of G. Moreover,since G is linearin the intervals[0, t1), [tl, t2), [t2, 1], we needonlycalculateG(t1)and G(t2). G t _ k2 E(G(4)) = Ei=0 -> 1;1-kl a 2 = Gt = E(G(tl)) =0E( 1i= ]i G(t2)= +ai E i3= k, +i1 + M+ n1+ n2 + M2+ fl + k,-] G(t) = G(tl)tltl O _ t < ti = G(tl)+ [(t -0)(4 = G(t2) + [(t- - t1)]{G(t2)_ G(tl) I, t2)/(1 - t2)]{1 -(t2)}, tl < t _< t, t2 < t? 1 We recognizeG(tl)as a weightedaverageof thevariousestimates of G(tl)we wouldmakeifwe knewI and J. It can be shownthatforlargeM thisestimate reasonablevalue (a1 + k1+ (al/a2)k2)/(M approachestheintuitively + n1+ n2). For smallvaluesof M, however,the situationis quite different, and contrary to intuition.Recall that M is, in a sense,a measureof our confidence in the a. Verysmallvaluesof M can be interpreted parameter as meaningwe have about G, and our estimateof thedistribution practicallyno priorinformation willbe heavilyweightedin favorof theempiricaldistribution function.A less obviousconsequenceof a smallM is the implicationthatthe processchooses distributions withmostof theirmassconcentrated at a fewpoints. 6. Let (9, ) = ([O, 1], M), a(t) = Mt, n1= n2 = 100, ki = 1, werecom99, ti = X, t2 = 2. In words,of 100 sampleswherethresholds EXAMPLE k2 = pared with t1 = , one was less than or equal to 99 were greaterthan :. Of 99 werelessthanor equal to 2, and 1 was 1 100 samplescomparedwitht2 = 2, greaterthan2. RecallingthatbecauseM is smallwe expecttheprocessto have chosena distribution withmostof itsmass on a fewpoints,we could readily explainthesampleresultby decidingthereis a verylargejumpin G somewhere on (3, 2], and significantly smallerjumpson [0, 3] and (2, 1], a totalof three jumps. When we evaluatethe weightsaij, however,we findthattheproduct 17(M/3+ 1 + i)r(M/3 + 198 - i -j)F(M/3 + 1 + j) is largest,as M -0, for i = 99, j = 99, since then the center factorbecomes F(M/3) and r(M/3) -- oo as M -O 0. But when we check the interpretation of thesevaluesforI and J, we discover they correspond to estimatingF(I) = F(2) = 2. As M -* 0, the dominantposteriordistribution is theone thatgivesa jump to [0, s),and a similarjumpto (2, 1],and practically no weightto [1, 2)! Roughlyspeaking, the posteriorDirichletprocessgivesmostof its weightto thosedistributions -2 1172 CHARLES E. ANTONIAK whichcan "explain"thesamplewiththefewestnumberofjumps, in thiscase twojumpsinsteadof themorereasonablesoundingthree. of thebio-assayproblemwitha briefdiscussion We concludeour treatment of the designproblem: Given the model of the bio-assayproblemdescribed total,whatare the"best"thresholds above,anda budget,say,ofn observations ti to testat, and what are the optimalnumbersni to testat each threshold? feelthat we nevertheless Whilethisproblemforgenerala and W is intractable, following the on W uniform 1], and [0, on a linear [0, 1] forthe special case conjectureis true. Conjecture.Let 0 be [0, 1] and _v theBorel sets,and let the bio-assayresponse curve G(6) be a sample function from a Dirichlet process on (0, .W) withparameter a(t) = Mt. If we have a totaln _ ni to ni of observations be comparedwiththresholds ti, and we wishto estimateG by G to minimize 0S(G- G)2dW, W=ton [0, 1], then theBayesdesignis to setti = i/(k+ 1) andmake (i) If k is fixedbeforehand, theni as nearlyequal as possible,i.e., In - nj < 1 forall i? j (ii) If k is notfixed,theBayesdesignis to letk = n and takeone observation each at the thresholdsti = i/(n+ 1). fora similarmodel. Ramsey[15] reachesthissameconclusionindependently is for the strategy is what design sequential optimal question A moreintriguing fixedsamplesize, and forvariablesamplesize, withthegoal of achievingconfidencebandsforG(t). The authorhasnothadmuchsuccessin seekinganswers to thesequestions. problemsmaybe deproblem.Statisticaldiscrimination 8. A discrimination from unknowndistribuscribedas follows. We are givensamplesXi,, ., * X2kt tionspi, i = 1, * * *, k and a sampleY1, **, Y,. knownonlyto comefromone of thep*'s. The problemis to decidewhichone. samples We mightmodelsucha problembyassumingthepi's areindependent froma Dirichletprocesswithparametera, a(0) = M. A littlepreliminary we analysisthenrevealsthatthe solutionis verysensitiveto theassumptions distributions then the posterior a to continuous, If assume be about a. we make of thepi will have parametersai withatomsof varioussizesat each distinct one, the samplesfromeach pi value of the Xx,. Moreover,withprobability will be disjointfromthe samplesfromall the otherp's, but withprobability I - (kiM + ki)", at least one samplepointYi will matchsomesamplepoint pi, in whichcase the choice of i is Xij whenthe Y's come fromdistribution clear. If, however,none of the Y, matchesanyof the valuesof theXij, then thechoiceof thepi to ascribethe Yi to becomesbiasedin favorof thepi from whichwe have thesmallestsampleof Xs, i.e., smallestki. If,in thiscase,the anda random kiare all equal,thesampleofY's hasnotgivenus anyinformation, decisionis made. MIXTURES OF DIRICHLET 1173 PROCESSES For these reasons we will assume that a is purely discrete, and for more generalitywe assume Pi e .(aN)a ..., Pk e ?((ak), where a1 throughak are all discretewith the same support. In the notation we have been using, let e be the nonnegative integersand DWthe a-algebra generatedby the singletonsets. Let ai be the finitenonnegative a-additive measure definedon _Q/throughits values on the singletonsets, a, * ai = (ai *, ai., *. * *) where ai({]}) = aij, and jail = EZ=0 ai%. Let Pi E `2(a%) and XiA, *, Xik be a sample from Pi. Let Y1, * *, Y,nbe a sample from some Pi, where the prior probability that Y1, * Y,Yn e Pi is rj, j = 1, *,k. Let L(i, j) denote the loss associated with deciding the Y's come fromPi when theycome fromP3. We seek a nonrandomizeddecision rule which minimizes our expected loss given the observations. First we notice that the distributionof Pi given Xil, , Xi,, is just .(ai'), where as = (aio + mio,ail + mil, *a, = (a%o, a1 , aC, ai+ mi * ) and mi6is the numberof Xi's equal toj, j 0, 1,.... Hence theproblemreduces to deciding which of k different Dirichlet processes with known parametersthe samples Y1, * * *, Y,, come from. The discrimination problem has been reduced to a classificationproblem. To treatthiswe calculate the Bayes risk ri foreach i, where r,= =l L(i, P(Y I i) j)w(]j = I| I Y1, . * *, Yn) a(k-j , )/a6t( n) w(i I Y1, P( Y) Y,") = = =i ri P(Y I i)/P(Y) P(Y I i) , wherek, = numberof Y's equal to j. The Bayes decision rule chooses s where r, = min r. If, for example L(i, j) = 0, i = j; 1, i tj, and wri= 1/k forall i, then the Bayes decision rule chooses that s for which P(Y Is) = maxi P(Y Ii). 9. Acknowledgments. The author is immeasurably indebted to Professor Thomas Ferguson for introducinghim to the concept of Dirichletprocessesand providingcountlesshelpfulsuggestionsduring the course of the research,and to David Blackwell and JimMacQueen forhelpfuldiscussions. A refereeand an associate editor for the Annalsof Statisticsmade many helpful suggestionsfor revisions of the manuscript. This paper is an extensionof part of the author's dissertationsubmittedin partial fulfillment of the requirementsfor the Ph. D. degree at U.C.L.A., writtenunder the supervisionof ProfessorFerguson. REFERENCES M. and STEGUN, I. A. (1964). Handbookof Mathematical Functions.NationalBureauof Standards. [2] BLACKWELL, DAVID (1973). Discretenessof Fergusonselections.Ann.Statist.1 356-358. [3] BLACKWELL, D. and MACQUEEN, J. (1973). Fergusondistributions via Polyaurnschemes. Ann.Statist.1 353-355. [1] ABRAMOWITZ, 1174 CHARLES E. ANTONIAK and theirposteriordistri[4] DOKSUM,K. (1974). Tailfreeand neutralrandomprobabilities butions. Ann.Prob.2 183-201. functions. [5] DUBINS, LESTER E. and FREEDMAN, DAVID A. (1966). Random distribution Proc. Fifth BerkeleySymp. Math. Statist. Prob. 3 183-214. Univ. of California Press. [6] DUBINS, LESTER E. and FREEDMAN, DAVID A. (1963). Random distribution functions. Bull.Amer.Math.Soc. 69 548-551. of theSixthPrague J. (1973). Neutralityand Dirichletdistributions.Transactions and RandomProTheory,StatisticalDecisionFunctions, Conferenceon Information cesses. 175-181. 1. to ProbabilityTheoryand its Applications, [8] FELLER, WILLIAM (1968). An Introduction Wiley,New York. problems.Ann. [9] FERGUSON, THOMAS S. (1973). A Bayesiananalysisofsomenonparametric Statist.1 209-230. to the theoryof Dirichletpro[10] KORWAR, R. and HOLLANDER, M. (1973). Contributions cesses. Ann.Prob.1 705-711. functionprocesseswhichhave deriva[11] KRAFT, CHARLES H. (1964). A class of distribution 1 385-388. tives. J. Appl.Probability [12] KRAFT, CHARLES H. and VAN EEDEN, CONSTANCE (1964). Bayesianbio-assay.Ann.Math. Statist.35 886-890. Measureson MetricSpaces. Academic Press, [13] PARTHASARATHY, K. R. (1967). Probability [7] FABIUS, New York. H. and SCHLAIFER, R. (1961). AppliedStatistica'DecisionTheory. MIT Press, Cambridge. 28 841-858. [15] RAMSEY, F. L. (1972). A Bayesianapproachto Bio-assay. Biometrics [16] ROBBINS, HERBERT (1963). An empiricalBayesapproachto testingstatisticalhypotheses. Rev.Internat. Statist.Inst. EnglewoodCliffs,New Jersey. [17] SCOTT, W. R. (1964). GroupTheory.Prentice-Hall, [14] RAIFFA, DEPARTMENT UNIVERSITY BERKELEY, OF STATISTICS OF CALIFORNIA CALIFORNIA 94720
© Copyright 2026 Paperzz