RC23645 (W0506-120) June 29, 2005 Mathematics IBM Research Report Extended Baum Transformations for General Functions, II Dimitri Kanevsky IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598 Research Division Almaden - Austin - Beijing - Haifa - India - T. J. Watson - Tokyo - Zurich LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. Ithas been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home. EXTENDED BAUM TRANSFORMATIONS FOR GENERAL FUNCTIONS, II Dimitri Kanevsky∗ IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 { kanevsky@ us.ibm.com } ABSTRACT The discrimination technique for estimating the parameters of Gaussian mixtures that is based on the Extended Baum transformations (EB) has had significant impact on the speech recognition community. The proof that definitively shows that these transformations increase the value of an objective function with iteration (i.e., so-called ”growth transformations”) was presented by the author two years ago for a diagonal Gaussian mixture densities. In this paper this proof is extended to a multidimensional multivariant Gaussian mixtures. The proof presented in the current paper is based on the linearization process and the explicit growth estimate for linear forms of Gaussian mixtures. 1. INTRODUCTION The EB procedure involves two types of transformations that can be described as follows. Let F (z) = F (zij ) be some function in variables z = (zij ) and cij = zij δzδij F (z). I. Discrete probabilities: cij + zij C ẑij = P i cij + C where z ∈ D = {zij ≥ 0, P j zij = (1) Pj=mi j=1 zij = 1} II. Gaussian mixture densities: P i∈I cij yi + Cµj µ̂j = µ̂j (C) = P i∈I cij + C P σ̂j2 2 = σ̂j (C) = where zij = i∈I cij yi2 + C(µ2j + σj 2 ) P − µ̂2j c + C ij i∈I 2 2 1 e−(yi −µj ) /2σj 1/2 (2π) σj (2) III. Multidemensional multivariate Gaussian mixture densities: P i∈I cij yi + Cµj (5) µ̂j = µ̂j (C) = P i∈I cij + C P Σ̂j = Σ̂j (C) = i∈I cij yi yiT + C(µj µTj + Σj ) P − µ̂j µ̂Tj c + C ij i∈I (6) where zij = |Σj |−1/2 −1/2(yi −µj )T Σ−1 j (yi −µj ) e (2π)n/2 (7) and yiT = (yi1 , ...yim ) is a sample of training data. It was shown in [4] that (1) are growth transformations for sufficiently large C when F is a rational function. Updated formulae (5, 6) for rational functions F were obtained through discrete probability approximation of Gaussian densities [7] and have been widely used as an alternative to direct gradient-based optimization approaches ([9], [8]). Using the linearization technique that was originally presented in our IBM Research Report [5] and in [6] for diagonal Gaussian mixtures, we demonstrate in this paper that (5, 6) are growth transformations for sufficiently large C if functions F obey certain smoothness constraints. Axelrod [1] has recently proposed another proof of existence of a constant C that ensures validity of the MMIE auxiliary function as formulated by Gunawardana et al. [3]). We also replicate in this paper from [6] the proofs that transformations for diagonal Gaussian mixtures (5) and for discrete probabilities (1) are growth. 2. LINEARIZATION (3) (4) and yi is a sample of training data. ∗ The work is partly supported by the DARPA Babylon Speech-toSpeech Translation Program. This principle is needed to reduce proofs of growth transformation for general functions to linear forms. Lemma 1 Let F (z) = F̃ ({uj }) = F̃ ({gj (z)}) = F̃ ◦ g(z) (8) where uj = gj (z), j = 1, ..m and z varies in some real vector space Rn of dimension n. Let gj (z) for all j = 1, ...m δ F̃ ({u }) j and F (z) be differentiable at z. Let, also, exist δuj at uj¯ = gj (z) for all j = 1, ...m. Let, further, L(z0) ≡ ¯ ∇F̃ ¯ · g(z0), z0 ∈ Rn . Let TC be a family of transfor- g(z) mations Rn → Rn such that for some l = (l1 ...ln ) ∈ Rn TC (z) − z = l/C + o(1/C) if C → ∞. (Here o(²) means that o(²)/² → 0 if ² → 0). Let, further, TC (z) = z if ∇L|z · l = 0 Proof Let as assume that aj ≥ aj+1 and substituting z0 = Pj=n−1 zj we need to prove: j=1 j=n−1 X [a2j zj + a2n (1 − z0)] ≥ j=1 j=n−1 X (aj − an )2 zj2 + j=1 2 (9) j=n−1 X (aj − an )an zj + a2n (12) j=1 Then for sufficiently large C TC is growth for F at z iff TC is growth for L at z. We will prove the above formula by proving for every fixed j (a2j − a2n )zj ≥ (aj − an )2 zj2 + 2(aj − an )an zj . If (aj − an )zj 6= 0 then the above inequality is equivalent to aj − an ≥ (aj − an )zj and is obviously holds since 0 ≤ zj ≤ 1 Proof First, from the definition of L we have P δF̃ ({uj }) δgj (z) δL(z) δF (z) j δzk = δuj δzk = δzk Next, for z0 = TC (z) and sufficiently large C we have: Lemma 3 For sufficiently large |C| the following holds: P δF (z) P δF (z) l(ẑ) > l(z) if C is positive and l(ẑ) < l(z) if C is F (z0)−F (z) = i δzi (zi 0−zi )+o(1/C) = i δzi li /C+ P δL(z) P δL(z) negative. o(1/C) = i δzi li /C + o(1/C) = i δzi (zi 0 − zi ) + o(1/C) = L(z0)−L(z)+o(1/C). Therefore for sufficiently Proof From (11) we have the following inequalities. large C F (z0) − F (z) > 0 iff L(z0) − L(z) > 0. l2 (z) + Cl(z) ≥ l(z)2 + Cl(z), 2 +Cl(z) l(ẑ) = l2 (z)+Cl(z) ≥ l(z)l(z)+C if l(z) + C > 0 l(z)+C 2 3. EB FOR DISCRETE PROBABILITIES +Cl(z) and l(ẑ) = l2 (z)+Cl(z) ≤ l(z)l(z)+C if l(z) + C < 0. l(z)+C The general case of Theorem 1 follows immediately from The following theorem is a generalization of [4]. the observation that (9) is equivalent to l2 (z) − l(z)2 = 0 Theorem P 1 Let F (z) be a function that is defined over D = for large C. {zij ≥ 0, zij = 1}. Let F be differentiable at z ∈ D and let ẑ 6= z be defined as in (1). Then F (ẑ) > F (z) for suf4. EB FOR GAUSSIAN DENSITIES ficiently large positive C and F (ẑ) < F (z) for sufficiently small negative C. For simplicity of the notation we consider the transformation (5), (6), only for a single pair of variables µ, σ, i.e. Proof Following the linearization principle, we first assume P we drop subscript j everywhere in (5, 6), (7) and also set that F (z) = l(z) = aij zij is a linear form. Than the 2 ˆ2 zˆi = (2π)11/2 σ̂ e−(yi −µ) /2σ̂ transformation formula for l(x) is the following: ẑij = aij zij + Czij l(z) + C (10) We need to show that l(ẑ) ≥ l(z). It is sufficient to prove this inequality for each linear sub component associated with i j=n j=n X X aij ẑij ≥ aij zij j=1 j=1 Therefore without loss of generality we can assume that i is fixed and drop subscript P i in the forthcoming proof (i.e. we assume that l(z) = aj zj , where z = {zj }, zj ≥ 0 and P zj = 1). We have: l(ẑ) = l2 (z)+Cl(z) l(z)+C , where l2 (z) := P 2 j aj zj . The linear case of Theorem 1 will follow from next two lemmas. Lemma 2 l2 (z) ≥ l(z)2 (11) Theorem 2 Let F ({zi }), i = 1...m, be differentiable at i }) µ, σ and δF ({z exist at zi . Let either µ̂ 6= µ or σ̂ 6= σ. δzi Then for sufficiently large C F ({ẑi }) − F ({zi }) = T /C + o(1/C) (13) Where P cj [(yj − µ)2 − σ 2 ]}2 X +[ cj (yj −µ)]2 } > 0 2σ 2 (14) In other words, F ({ẑi }) grows proportionally to 1/C for sufficiently large C. T = 1 { { σ2 Proof First, we assume that F ({zi }) = l(µ, σ) := l({zi }) := Pi=m Pi=m i=1 ai zi . Let us set l(µ̂, σ̂) := l({ẑi }) := i=1 ai zˆi . Then cj = aj zj in (5), (6). We want to prove that for sufficiently large C l(µ̂, σ̂) ≥ l(µ, σ). This inequality is sufficiently to prove with the precision 1/C 2 . Pj=m =1 cj yj + Cµ Pj=m =1 cj + C µ̂ = µ̂(C) = = 1 C ∼ Pj=m 1 C j=1 cj yj + µ ∼ Pj=m =1 cj + 1 (yi − µ)2 1 (yi − µ)2 X − { [(yj − µ)2 − σ 2 ]cj + 2 σ Cσ 2 σ2 j X +2(yi − µ) (yj − µ)cj } (22) j P X 1 X 1 X j cj ∼( cj yj +µ)(1− ) ∼ µ+ ( cj yj −µ cj ) C j C C j j zˆi ∼ (15) P µ̂ ∼ µ + j [cj (yj − µ)] C (16) X (yi − µ)2 X 2 2 [(y −µ) −σ ]c +(y −µ) (yj −µ)cj j j i 2σ 2 j j Ai = Continue this we have P 2 2 2 j cj yj + C(µ + σ ) P − µ̂2 c + C j j 2 Let us compute σ̂ 2 using (38) P 2 2 2 j cj yj + C(µ + σ ) P ∼ j cj + C P P 2 j cj yj j cj 2 2 ∼( + µ + σ )(1 − )∼ C C X 1 X ∼ µ2 + σ 2 + [ cj yj2 − (µ2 + σ 2 ) cj ] C j j µ̂2 ∼ µ2 + 2µ C j=m X (17) zˆi ∼ Ke −(yi −µ)2 2σ 2 Where K= 1/σ̂ ∼ 1 {1 − σ Ai (1 + ){1 − Cσ 2 (18) P cj (yj − µ) (19) =1 X 1 X [ cj yj2 − (µ2 + σ 2 ) cj ]− C j j 2µ X cj (yj − µ)] = C j X X 1 X =σ + [ cj yj2 − (µ2 + σ 2 ) cj − 2µ cj (yj − µ)] C j j j 2 j cj [(yj j cj [(yj P σ̂ ∼ σ + j [(yj − µ)2 − σ 2 ] 2σ 2 C } (25) }∼ − µ)2 − σ 2 ]cj C 1 (yi − µ̂)2 /σ̂ 2 ∼ 2 [(yi − µ)2 − σ P 2(yi − µ) j cj (yj − µ) ]× − C P 2 2 j cj [(yj − µ) − σ ] ×{1 − }∼ σ2 C (21) j Bi (26) Cσ 2 P − 1/2] j [(yj − µ)2 − σ 2 ]cj + (yi − ∼1+ 2 (yi −µ) Where P Bi = [ 2σ2 µ) j (yj − µ)cj Using the last equalities we get ẑi = zi + Bi zi Cσ 2 (27) Since l(µ̂, σ̂) is a linear form in the zi we have (20) 2 − µ)2 − σ 2 ] 2σ 2 C P (24) 1 (yi − µ)2 X [(yj − µ)2 − σ 2 ]cj + { Cσ 2 2σ 2 j X X (yj − µ)cj − 1/2 cj [(yj − µ)2 − σ 2 ]} ∼ +(yi − µ) l({ẑi }) = l({zi }) + And finally 2 Ai ) Cσ 2 1 (2π)1/2 σ̂ j −[µ2 + (1 + ∼1+ This gives σ̂ 2 ∼ µ2 + σ 2 + (23) Where Next, we have σ̂ 2 = σ̂(C) = −(yi −µ)2 A 1 + Cσi2 2σ 2 e (2π)1/2 σ̂ l({Bi zi }) Cσ 2 (28) and l({Bi zi }) = X i × X (yi − µ)2 − 1/2]× 2σ 2 cj [(yj − µ)2 − σ 2 ] + (yi − µ) j = ai zi {[ X cj (yj − µ)} = j X i ci {[ X (yi − µ)2 − 1/2] cj [(yj − µ)2 − σ 2 ]+ 2 2σ j +(yi − µ) X cj (yj − µ)} = Step 2: Computation of T j P = { j cj [(yj 2 Pj=m 2 − µ) − σ ]} 2 2σ 2 +[ X 2 cj (yj − µ)] (29) j T C Since by assumption either µ̂ 6= µ or σ̂ 6= σ T 6= 0. Applicability of the lineriazation principle follows from the fact that if (14) holds then the left part in the equation (9) is not equal to zero. Q.E.D. l({ẑi }) − l({zi }) ∼ =1 cj yj + Cµ Pj=m =1 cj + C µ̂ = µ̂(C) = For simplicity of the notation we consider the transformation (5), (6), only for a single pair of variables µ, Σ, i.e. we drop subscript j everywhere in (5, 6), (7) and also set Σ̂ |−1 /2 −1/2(yi −µ̂)TˆΣ−1 (yi −µ̂) ẑi = |(2π) n/2 e Theorem 3 Let F ({zi }), i = 1...m, be differentiable at i }) exist at zi . Let either µ̂ 6= µ or Σ̂ 6= Σ . µ, Σ and δF ({z δzi Then for sufficiently large C F ({ẑi }) − F ({zi }) = T /C + o(1/C) (30) P T = T1 + T2 + T3 X 1X 2 T1 = (λk + λ2l )( ci aki ali )2 2 i − µ)] (37) C T j cj yj yj P Σ̂ = Σ̂ (C ) = + C(µµT + Σ ) j cj +C − µ̂µ̂T (38) Let us compute Σ̂ using (38) P T T j cj yj yj + C(µµ + Σ ) P ∼ j cj + C P T j cj yj yj ∼( P T + µµ + Σ )(1 − j cj )∼ C C X 1 X ∼ µµT + Σ + [ cj yj yjT − (µµT + Σ ) cj ] (39) C j j j=m 2µ X cj (yj − µ)T C =1 µ̂µ̂T ∼ µµT + (40) This gives Σ̂ ∼ µµT + Σ + (32) X 1 X [ cj yj yjT − (µµT + Σ ) cj ]− C j j −[µµT + (33) = Σ+ n j=1 cj yj + µ ∼ Pj=m =1 cj + 1 Next, we have k6=l X X 1X T2 = (λk ci a2ki − ci )2 2 i j [cj (yj µ̂ ∼ µ + (31) then one can write T explicitly as follows: 1 C (36) where T > 0 . i.e. F ({ẑi }) grows proportionally to 1/C for sufficiently large C. If Σ represented as a diagonal matrix Σ −1 = diag[λ1 , ..., λn ] Pj=m P X 1 X 1 X j cj ∼( ) ∼ µ+ ( cj yj +µ)(1− cj yj −µ cj ) C j C C j j P 5. EB FOR MULTIDIMENSIONAL MULTIVARIATE GAUSSIAN DENSITIES = 1 C (34) 2µ X cj (yj − µ)T ] = C j X X 1 X [ cj yj yjT −(µµT +Σ ) cj −2 µ cj (yj −µ)T ] C j j j (41) k=1 T3 = n X k=1 X λk ( ci aki )2 And finally P (35) Σ̂ ∼ Σ + i Proof Our proof will be split in several steps. Step1: Linerarization First, we assume that F ({zi }) = l(µ, Σ ) := l ({zi }) := Pi=m Pi=m i=1 ai zi . Let us set l(µ̂, Σ̂ ) := l ({ẑi }) := i=1 ai zˆi . Then cj = aj zj in (5), (6). We want to prove that for sufficiently large C l(µ̂, Σ̂ ) ≥ l (µ, Σ ). This inequality is sufficiently to prove with the precision 1/C 2 . Σ̂ −1 ∼Σ −1 − j [(yj − µ)(yj − µ)T − Σ ]cj (42) C Σ −2 { P j [(yj − µ)(yj − µ)T − Σ ]cj } C (43) P T 1/2(yi −µ̂) Σ̂ −1 (yi −µ̂) ∼ 1/2[(yi −µ)− j cj (yj C − µ) ]T × P −1 [Σ − Σ −2 { j [(yj ×[(yi − µ) − − µ)(yj − µ)T − Σ ]cj } C P j cj (yj − µ) C and = ]∼ X T = l({Bi zi }) = X X ci Ai +1/2( ci ) cj [n−(yj −µ)T Σ −1 (yj −µ)] i i X Ai ∼ 1/2(yi − µ) Σ (yi − µ) − (44) C Ai = Ai1 + Ai2 X = 1/2 cj (yi −µ)T Σ −2 [(yj −µ)(yj −µ)T −Σ ](yi −µ) T Ai1 ]× −1 Ai2 = 1/2 ci Ai = i i Ã1 = j X j X (45) = 1/2 T −1 T −1 cj [(yj −µ) Σ (yi −µ)+(yi −µ) Σ (yj −µ)] X X 1 Where T Σ−1 (yi −µ) (1 + Ai ) C (48) |Σ̂ |−1 /2 (2π)n/2 P cj ∼ |Σ |−1 /2 {1 + [n−TrΣ −1 (yj −µ)(yj −µ)T ]} 2C (49) P Ai cj ){1 + [n − T rΣ −1 (yj − µ)(yj − µ)T ]} ∼ (1 + C 2C X 1 ∼ 1 + {Ai + 1/2 cj [n − T rΣ −1 (yj − µ)(yj − µ)T ]} C j ∼1+ Bi C (50) Where Bi = Ai + D Here we use D = 1/2 X cj [n − (yj − µ)T Σ −1 (yj − µ)] j and T rΣ −1 (yj − µ)(yj − µ)T = (yj − µ)T Σ −1 (yj − µ) Bi zi C i ci cj [(yj −µ)T Σ −1 (yi −µ)+(yi −µ)T Σ −1 (yj −µ)] = (55) i j T C Step3: Reduction to a diagonal case Since Σ is a symetric matrix there exists an ortogonal matrix O such that OΣO−1 is a diagonal matrix. It is easily to see that T is invariant under such ortogonal change of coordinates. For example, the component Ã1 of T is invariant under ortogonal change of coordinate as one can see from the following computations: l({ẑi }) − l({zi }) ∼ Ã1 = = 1/2 X ci cj (yi − µ)T OT OΣ −2 O T × ij ×[O(yj − µ)(yj − µ)T OT − OΣO T ]O(yi − µ) (56) Step 4: special case - 2-dimensional Gaussians We will perform computations for simplicity for 2 -dimensional case. Withot loss of generality we can assume the Σ −1 = diag[λ1 , λ2 ] is a diagonal 2 × 2 - matrix with diagonal elemens λ1 and λ2 . Let compute A0 1 = X 1/2 ci cj (yi −µ)T Σ −2 [(yj −µ)(yj −µ)T (yi −µ)] (57) Let set l({Bi zi }) C (yi − µ)T = (a1i , a2i ) (51) Then Since l(µ̂, Σ̂) is a linear form in the zi we have l({ẑi }) ∼ l({zi }) + (54) ci Ai2 = ij Using the last equalities we get ẑi ∼ zi + X X X =[ cj (yj − µ)]T Σ −1 [ ci (yi − µ)] K= |Σ̂ |−1 /2 ci Ai1 = ij Continue this we have zˆi ∼ Ke− 2 (yi −µ) (53) ij Ã2 = = 1/2 ci Ai2 i ci cj (yi −µ)T Σ −2 [(yj −µ)(yj −µ)T −Σ ](yi −µ) (46) (47) X X i j |Σ̂ |−1 /2 −1 (yi −µ)T Σ −1 (yi −µ)+ Ai C zˆi ∼ e2 (2π)n/2 ci Ai1 + (52) 1/2 X ij A0 1 = ci cj (a1i , a2i ) × diag[λ21 , λ22 ]× (58) (a1j , a2,j )T × (a1j , a2j ) × (a1i , a2i )T (59) X = 1/2 ci cj (λ21 a1i , λ22 a2i )(a1j , a2,j )T (a1j a1i + a2j a2i ) And finally T = X 1/2(λ21 + λ22 )( ci a1i a2i )2 + ij = 1/2 X (60) ci cj (λ21 a1i a1j + λ22 a2i a2j )(a1j a1i + a2j a2i ) ij +1/2(λ1 ij = 1/2(λ21 + λ22 ) ci cj a1i a1j a2i a2,j + +1/2(λ2 X ij +1/2λ21 X +λ1 ( ci a1i )2 ci cj a22i a22j = X +λ2 ( ci a2i )2 In the above equation: ij +1/2λ21 ( X T1 = 1/2(λ21 + λ22 )( ci a21i )2 +1/2λ22 ( X T2 = 1/2(λ1 ci a22i )2 = Ã1 − +1/2(λ2 = ci cj (yi − µ)T Σ −1 (yi − µ) = cj )( X ci λ1 a21i + X i ci λ2 a22i ) (64) i Next, for our 2-dimensional case we have X X X X ci D = ( ci )2 −1/2( cj )( ci λ1 a21i + ci λ2 a22i ) j i i (65) Therefore T = X 2 2 1/2(λ1 + λ2 )( ci a1i a2i )2 + +1/2λ21 ( X ci )2 + ci )2 (69) ci a1i )2 X X +1/2λ22 ( Step 4: General case - n-dimensional Gaussians We will perform computations for n -dimensional case. Withot loss of generality we can assume the Σ−1 = diag[λ1 , λ2 , ...λn ] is a diagonal n×n - matrix with diagonal elemens λ1 , λ2 and λn . Let compute A0 1 = X 1/2 ci cj (yi −µ)T Σ −2 [(yj −µ)(yj −µ)T (yi −µ)] (71) Let set (yi − µ)T = (a1i , a2i , ...ani ) ci a21i )2 Then cj )( j X ci a22i )2 − i ci λ1 a21i + i X 1/2 ci λ2 a22i ) A0 1 = ci cj (a1i , a2i , ...ani )× diag[λ21 , λ22 , ...λ2n ] × (a1j , a2j , ...anj )T i ci cj (λ1 ai1 aj1 + λ2 ai2 aj2 ) X (72) ij X +( ci )2 X (70) i ij i + X X X +λ2 ( ci a2i )2 ij −( ci a22i − T3 = λ1 ( (63) i X X (68) i j X X i ij −1/2( ci a1i a2i )2 ci a21i − i X X −1/2( cj ) ci (a1i , a2i ) × (λ1 , λ2 ) × (a1i , a2i )T = j X (62) i A01 X ij i X (67) i X ci a1i a2i )2 + 1/2(λ21 + λ22 )( = −1/2 ci )2 + i ij A001 X ci )2 i ij +1/2λ22 ci a22i − X ci cj a21i a21j X X ci a21i − i (61) X X (66) ×(a1j , a2j , ...anj ) × (a1i , a2i , ...anj )T (73) X = 1/2 ci cj (λ21 a1i , λ22 a2i , ...λ22 ani )(a1j , a2j , ...anj )T × ij ×( = 1/2 X X akj aki ) (74) k X ci cj ( λ2k aki akj )× ij ×( X Σ k akj aki ) (75) k X X = 1/2 ci cj (λ2k + λ2l )× k,l,k6=l ij ×aki akj ali al,j + X X +1/2 ci cj λ2k a2ki a2kj = ij 1/2 X k (λ2k + λ2l )( X ci aki ali )2 + X +1/2( λk ci a22i )2 (76) k,i Similar (like for the 2-dimensional case) one can compute other componets in T. Step 5: Invariant transformation points Here we prove the following Lemma 4 Let Σ be a diagonal matrix. Then the following holds. a) T = 0 implies that Σ (C ) = Σ and µ(C) = µ for C = 0. b) Σ (C ) = Σ and µ(C) = µ for C = 0 implies that Σ (C ) = Σ and µ(C) = µ for any C. c) Σ (C ) = Σ and µ(C) = µ for some C → T = 0 Proof of Lemma P a) T = 0 → T3 = 0 → µ = Pcicyi i → µ(0) = µ. P P Next, T 2 = 0 → λk ci a2ki − ci = 0 → λ−1 = k P 2 Pyik ci − µk µk = Σ(0)kk . P Finally, T1 = 0 → ci (yik − µk )(yil − µP l) = 0 → P ci yik yil −ci yik µl −ci yil µk +ci µk µl = 0 → ci (yik yil − µk µl ) = 0 → Σkl (0) = 0. This proves a) for C = 0. b) It follows from (5) that if µ(C) = µ then X X µ( ci + C) = ci yi + Cµ → µ X ci = X X ci + ΣC 0 = X ci yi yiT − X ci µµT + +C 0 (µµT + Σ ) − C 0 µµT → P ci yi yiT + C 0 (µµT + Σ ) P − µµT → Σ= ci + C 0 Σ = Σ (C 0 ) ij k,i6=l Adding to both parts of the above equation C 0 Σ for any C 0 we get ci yi Adding to both parts of the above equation C 0 µ for any C 0 we get X X µ( ci + C 0 ) = ci yi + C 0 µ → µ(C 0 ) = µ This proves b) for µ. Similarly, from (6) and a part b) of the lemma for µ we have that Σ (C ) = Σ implies X X X Σ ci = ci yi yiT − ci µµT c) µ = µ(0) implies that T3 = 0. Σkk = Σkk (0 ) implies that P T2 = 0. Finally, Σkl = Σkl (0 ) = 0 for k 6= l implies that ci (yik − µk )(yil − µl ) = 0, i.e. T1 = 0. We can now finish the proof ot the theorem. Since by assumption either µ̂ 6= µ or Σ̂ 6= Σ T 6= 0. Applicability of the lineriazation principle follows from the fact that if (14) holds then the left part in the equation (9) is not equal to zero. Q.E.D. 6. NEW GROWTH TRANSFORMATIONS One can derive new updates for means and variances applying EB algorithm of the section 3 by introducing probability constraints for means and variances as follows. Let us assume that 0 ≤ µj ≤ Dj , 0 ≤ σj ≤ Ej . Then we can introduce slack variables µj 0 ≥ 0, σj 0 ≥ 0 such that µj /Dj + µj 0/Dj = 1, σj /Ej + σj 0/Ej = 1. Then we can compute updates as in (1), with cj as in (5, 6). P (yi −µj ) +C σ2 j j jP (yi −µj ) µj +Dj C i cij σ2 j P (yi −µj )2 ]+Cσj i cij [−1+ σ2 j jP (yi −µj )2 ]+Ej C i cij [−1+ σ2 j µ̂j = D µ i cij σ̂j = E If some µj < 0 one can make them positive by adding positive constants, compute updates for new variables in the new coordinate system and then go back to the old system of coordinates. 7. REFERENCES [1] S. Axelrod, V. Goel, R. Gopinath, P. Olsen, and K. Visweswariah, ”Discriminative Training of Subspace Constrained GMMs for Speech Recognition,” to be submitted to IEEE Transactions on Speech and Audio Processing. [2] L.E.Baum and J.A. Eagon, ”An inequality with applications to statistical prediction for functions of Markov processes and to a model of ecology,” Bull. Amer. Math. Soc., vol. 73, pp.360-363, 1967. [3] A. Gunawardana and W. Byrne, “Discriminative Speaker Adaptation with Conditional Maximum Likelihood Linear Regression,” ICASSP, 2002. [4] P.S. Gopalakrishnan, D. Kanevsky, D. Nahamoo and A. Nadas, ”An inequality for rational functions with applications to some statistical estimation problems”, IEEE Trans. Information Theory, Vol. 37, No.1 January 1991 [5] D. Kanevsky, ”Growth Transformations for General Functions”, RC22919 (W0309-163), September 25, 2003. [6] D. Kanevsky, ”Extended Baum transformations for general functions”, in Proc. ICASSP, 2004. [7] Y. Normandin, ”An improved MMIE Training Algorithm for Speaker Independent, Small Vocabulary, Continuous Speech Recognition”, Proc. ICASSP’91, pp. 537-540, 1991. [8] R. Schluter, W. Macherey, B. Muler and H. Ney, ”Comparison of discriminative training criteria and optimization methods for speech recognition”, Speech Communication, Vol. 34, pp.287-310, 2001. [9] V. Valtchev , P.C. Woodland and S. J. Young, ”Latticebased Discriminative Training for Large Vocabulary Speech Recognition Systems”, Speech Communication, Vol. 22, pp. 303-314, 1996.
© Copyright 2026 Paperzz