rc23645

RC23645 (W0506-120) June 29, 2005
Mathematics
IBM Research Report
Extended Baum Transformations for General Functions, II
Dimitri Kanevsky
IBM Research Division
Thomas J. Watson Research Center
P.O. Box 218
Yorktown Heights, NY 10598
Research Division
Almaden - Austin - Beijing - Haifa - India - T. J. Watson - Tokyo - Zurich
LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. Ithas been issued as a Research
Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific
requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P.
O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home.
EXTENDED BAUM TRANSFORMATIONS FOR GENERAL FUNCTIONS, II
Dimitri Kanevsky∗
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598
{ kanevsky@ us.ibm.com }
ABSTRACT
The discrimination technique for estimating the parameters
of Gaussian mixtures that is based on the Extended Baum
transformations (EB) has had significant impact on the speech
recognition community. The proof that definitively shows
that these transformations increase the value of an objective function with iteration (i.e., so-called ”growth transformations”) was presented by the author two years ago for
a diagonal Gaussian mixture densities. In this paper this
proof is extended to a multidimensional multivariant Gaussian mixtures. The proof presented in the current paper is
based on the linearization process and the explicit growth
estimate for linear forms of Gaussian mixtures.
1. INTRODUCTION
The EB procedure involves two types of transformations
that can be described as follows. Let F (z) = F (zij ) be
some function in variables z = (zij ) and cij = zij δzδij F (z).
I. Discrete probabilities:
cij + zij C
ẑij = P
i cij + C
where z ∈ D = {zij ≥ 0,
P
j
zij =
(1)
Pj=mi
j=1
zij = 1}
II. Gaussian mixture densities:
P
i∈I cij yi + Cµj
µ̂j = µ̂j (C) = P
i∈I cij + C
P
σ̂j2
2
= σ̂j (C) =
where
zij =
i∈I
cij yi2 + C(µ2j + σj 2 )
P
− µ̂2j
c
+
C
ij
i∈I
2
2
1
e−(yi −µj ) /2σj
1/2
(2π) σj
(2)
III. Multidemensional multivariate Gaussian mixture densities:
P
i∈I cij yi + Cµj
(5)
µ̂j = µ̂j (C) = P
i∈I cij + C
P
Σ̂j = Σ̂j (C) =
i∈I
cij yi yiT + C(µj µTj + Σj )
P
− µ̂j µ̂Tj
c
+
C
ij
i∈I
(6)
where
zij =
|Σj |−1/2 −1/2(yi −µj )T Σ−1
j (yi −µj )
e
(2π)n/2
(7)
and yiT = (yi1 , ...yim ) is a sample of training data.
It was shown in [4] that (1) are growth transformations
for sufficiently large C when F is a rational function. Updated formulae (5, 6) for rational functions F were obtained
through discrete probability approximation of Gaussian densities [7] and have been widely used as an alternative to direct gradient-based optimization approaches ([9], [8]). Using the linearization technique that was originally presented
in our IBM Research Report [5] and in [6] for diagonal
Gaussian mixtures, we demonstrate in this paper that (5, 6)
are growth transformations for sufficiently large C if functions F obey certain smoothness constraints. Axelrod [1]
has recently proposed another proof of existence of a constant C that ensures validity of the MMIE auxiliary function
as formulated by Gunawardana et al. [3]). We also replicate
in this paper from [6] the proofs that transformations for diagonal Gaussian mixtures (5) and for discrete probabilities
(1) are growth.
2. LINEARIZATION
(3)
(4)
and yi is a sample of training data.
∗ The work is partly supported by the DARPA Babylon Speech-toSpeech Translation Program.
This principle is needed to reduce proofs of growth transformation for general functions to linear forms.
Lemma 1 Let
F (z) = F̃ ({uj }) = F̃ ({gj (z)}) = F̃ ◦ g(z)
(8)
where uj = gj (z), j = 1, ..m and z varies in some real vector space Rn of dimension n. Let gj (z) for all j = 1, ...m
δ F̃ ({u })
j
and F (z) be differentiable at z. Let, also,
exist
δuj
at uj¯ = gj (z) for all j = 1, ...m. Let, further, L(z0) ≡
¯
∇F̃ ¯
· g(z0), z0 ∈ Rn . Let TC be a family of transfor-
g(z)
mations Rn → Rn such that for some l = (l1 ...ln ) ∈ Rn
TC (z) − z = l/C + o(1/C) if C → ∞. (Here o(²) means
that o(²)/² → 0 if ² → 0). Let, further, TC (z) = z if
∇L|z · l = 0
Proof Let as assume that aj ≥ aj+1 and substituting z0 =
Pj=n−1
zj we need to prove:
j=1
j=n−1
X
[a2j zj + a2n (1 − z0)] ≥
j=1
j=n−1
X
(aj − an )2 zj2 +
j=1
2
(9)
j=n−1
X
(aj − an )an zj + a2n
(12)
j=1
Then for sufficiently large C TC is growth for F at z iff TC
is growth for L at z.
We will prove the above formula by proving for every fixed
j (a2j − a2n )zj ≥ (aj − an )2 zj2 + 2(aj − an )an zj . If (aj −
an )zj 6= 0 then the above inequality is equivalent to aj −
an ≥ (aj − an )zj and is obviously holds since 0 ≤ zj ≤ 1
Proof First, from the definition of L we have
P δF̃ ({uj }) δgj (z)
δL(z)
δF (z)
j
δzk =
δuj
δzk = δzk
Next, for z0 = TC (z) and sufficiently large C we have:
Lemma 3 For sufficiently large |C| the following holds:
P δF (z)
P δF (z)
l(ẑ) > l(z) if C is positive and l(ẑ) < l(z) if C is
F (z0)−F (z) = i δzi (zi 0−zi )+o(1/C) = i δzi li /C+
P δL(z)
P δL(z)
negative.
o(1/C) = i δzi li /C + o(1/C) = i δzi (zi 0 − zi ) +
o(1/C) = L(z0)−L(z)+o(1/C). Therefore for sufficiently
Proof From (11) we have the following inequalities.
large C F (z0) − F (z) > 0 iff L(z0) − L(z) > 0.
l2 (z) + Cl(z) ≥ l(z)2 + Cl(z),
2
+Cl(z)
l(ẑ) = l2 (z)+Cl(z)
≥ l(z)l(z)+C
if l(z) + C > 0
l(z)+C
2
3. EB FOR DISCRETE PROBABILITIES
+Cl(z)
and l(ẑ) = l2 (z)+Cl(z)
≤ l(z)l(z)+C
if l(z) + C < 0.
l(z)+C
The
general
case
of
Theorem
1
follows
immediately from
The following theorem is a generalization of [4].
the observation that (9) is equivalent to l2 (z) − l(z)2 = 0
Theorem P
1 Let F (z) be a function that is defined over D =
for large C.
{zij ≥ 0, zij = 1}. Let F be differentiable at z ∈ D and
let ẑ 6= z be defined as in (1). Then F (ẑ) > F (z) for suf4. EB FOR GAUSSIAN DENSITIES
ficiently large positive C and F (ẑ) < F (z) for sufficiently
small negative C.
For simplicity of the notation we consider the transformation (5), (6), only for a single pair of variables µ, σ, i.e.
Proof Following the linearization
principle, we first assume
P
we drop subscript j everywhere in (5, 6), (7) and also set
that F (z) = l(z) =
aij zij is a linear form. Than the
2
ˆ2
zˆi = (2π)11/2 σ̂ e−(yi −µ) /2σ̂
transformation formula for l(x) is the following:
ẑij =
aij zij + Czij
l(z) + C
(10)
We need to show that l(ẑ) ≥ l(z). It is sufficient to prove
this inequality for each linear sub component associated with
i
j=n
j=n
X
X
aij ẑij ≥
aij zij
j=1
j=1
Therefore without loss of generality we can assume that i is
fixed and drop subscript
P i in the forthcoming proof (i.e. we
assume that l(z) =
aj zj , where z = {zj }, zj ≥ 0 and
P
zj = 1). We have: l(ẑ) = l2 (z)+Cl(z)
l(z)+C , where l2 (z) :=
P 2
j aj zj . The linear case of Theorem 1 will follow from
next two lemmas.
Lemma 2
l2 (z) ≥ l(z)2
(11)
Theorem 2 Let F ({zi }), i = 1...m, be differentiable at
i })
µ, σ and δF ({z
exist at zi . Let either µ̂ 6= µ or σ̂ 6= σ.
δzi
Then for sufficiently large C
F ({ẑi }) − F ({zi }) = T /C + o(1/C)
(13)
Where
P
cj [(yj − µ)2 − σ 2 ]}2 X
+[
cj (yj −µ)]2 } > 0
2σ 2
(14)
In other words, F ({ẑi }) grows proportionally to 1/C for
sufficiently large C.
T =
1 {
{
σ2
Proof First, we assume that F ({zi }) = l(µ, σ) := l({zi }) :=
Pi=m
Pi=m
i=1 ai zi . Let us set l(µ̂, σ̂) := l({ẑi }) :=
i=1 ai zˆi .
Then cj = aj zj in (5), (6). We want to prove that for sufficiently large C l(µ̂, σ̂) ≥ l(µ, σ). This inequality is sufficiently to prove with the precision 1/C 2 .
Pj=m
=1 cj yj + Cµ
Pj=m
=1 cj + C
µ̂ = µ̂(C) =
=
1
C
∼
Pj=m
1
C
j=1 cj yj + µ
∼
Pj=m
=1 cj + 1
(yi − µ)2
1 (yi − µ)2 X
−
{
[(yj − µ)2 − σ 2 ]cj +
2
σ
Cσ 2
σ2
j
X
+2(yi − µ)
(yj − µ)cj }
(22)
j
P
X
1 X
1 X
j cj
∼(
cj yj +µ)(1−
) ∼ µ+ (
cj yj −µ
cj )
C j
C
C j
j
zˆi ∼
(15)
P
µ̂ ∼ µ +
j [cj (yj
− µ)]
C
(16)
X
(yi − µ)2 X
2
2
[(y
−µ)
−σ
]c
+(y
−µ)
(yj −µ)cj
j
j
i
2σ 2
j
j
Ai =
Continue this we have
P
2
2
2
j cj yj + C(µ + σ )
P
− µ̂2
c
+
C
j
j
2
Let us compute σ̂ 2 using (38)
P
2
2
2
j cj yj + C(µ + σ )
P
∼
j cj + C
P
P
2
j cj yj
j cj
2
2
∼(
+ µ + σ )(1 −
)∼
C
C
X
1 X
∼ µ2 + σ 2 + [
cj yj2 − (µ2 + σ 2 )
cj ]
C j
j
µ̂2 ∼ µ2 +
2µ
C
j=m
X
(17)
zˆi ∼ Ke
−(yi −µ)2
2σ 2
Where
K=
1/σ̂ ∼
1
{1 −
σ
Ai
(1 +
){1 −
Cσ 2
(18)
P
cj (yj − µ)
(19)
=1
X
1 X
[
cj yj2 − (µ2 + σ 2 )
cj ]−
C j
j
2µ X
cj (yj − µ)] =
C j
X
X
1 X
=σ + [
cj yj2 − (µ2 + σ 2 )
cj − 2µ
cj (yj − µ)]
C j
j
j
2
j cj [(yj
j cj [(yj
P
σ̂ ∼ σ +
j [(yj
− µ)2 − σ 2 ]
2σ 2 C
}
(25)
}∼
− µ)2 − σ 2 ]cj
C
1
(yi − µ̂)2 /σ̂ 2 ∼ 2 [(yi − µ)2 −
σ
P
2(yi − µ) j cj (yj − µ)
]×
−
C
P
2
2
j cj [(yj − µ) − σ ]
×{1 −
}∼
σ2 C
(21)
j
Bi
(26)
Cσ 2
P
− 1/2] j [(yj − µ)2 − σ 2 ]cj + (yi −
∼1+
2
(yi −µ)
Where
P Bi = [ 2σ2
µ) j (yj − µ)cj
Using the last equalities we get
ẑi = zi +
Bi
zi
Cσ 2
(27)
Since l(µ̂, σ̂) is a linear form in the zi we have
(20)
2
− µ)2 − σ 2 ]
2σ 2 C
P
(24)
1 (yi − µ)2 X
[(yj − µ)2 − σ 2 ]cj +
{
Cσ 2
2σ 2
j
X
X
(yj − µ)cj − 1/2
cj [(yj − µ)2 − σ 2 ]} ∼
+(yi − µ)
l({ẑi }) = l({zi }) +
And finally
2
Ai
)
Cσ 2
1
(2π)1/2 σ̂
j
−[µ2 +
(1 +
∼1+
This gives
σ̂ 2 ∼ µ2 + σ 2 +
(23)
Where
Next, we have
σ̂ 2 = σ̂(C) =
−(yi −µ)2
A
1
+ Cσi2
2σ 2
e
(2π)1/2 σ̂
l({Bi zi })
Cσ 2
(28)
and
l({Bi zi }) =
X
i
×
X
(yi − µ)2
− 1/2]×
2σ 2
cj [(yj − µ)2 − σ 2 ] + (yi − µ)
j
=
ai zi {[
X
cj (yj − µ)} =
j
X
i
ci {[
X
(yi − µ)2
− 1/2]
cj [(yj − µ)2 − σ 2 ]+
2
2σ
j
+(yi − µ)
X
cj (yj − µ)} =
Step 2: Computation of T
j
P
=
{
j cj [(yj
2
Pj=m
2
− µ) − σ ]}
2
2σ 2
+[
X
2
cj (yj − µ)]
(29)
j
T
C
Since by assumption either µ̂ 6= µ or σ̂ 6= σ T 6= 0. Applicability of the lineriazation principle follows from the fact
that if (14) holds then the left part in the equation (9) is not
equal to zero. Q.E.D.
l({ẑi }) − l({zi }) ∼
=1 cj yj + Cµ
Pj=m
=1 cj + C
µ̂ = µ̂(C) =
For simplicity of the notation we consider the transformation (5), (6), only for a single pair of variables µ, Σ, i.e.
we drop subscript j everywhere in (5, 6), (7) and also set
Σ̂ |−1 /2 −1/2(yi −µ̂)TˆΣ−1 (yi −µ̂)
ẑi = |(2π)
n/2 e
Theorem 3 Let F ({zi }), i = 1...m, be differentiable at
i })
exist at zi . Let either µ̂ 6= µ or Σ̂ 6= Σ .
µ, Σ and δF ({z
δzi
Then for sufficiently large C
F ({ẑi }) − F ({zi }) = T /C + o(1/C)
(30)
P
T = T1 + T2 + T3
X
1X 2
T1 =
(λk + λ2l )(
ci aki ali )2
2
i
− µ)]
(37)
C
T
j cj yj yj
P
Σ̂ = Σ̂ (C ) =
+ C(µµT + Σ )
j cj
+C
− µ̂µ̂T (38)
Let us compute Σ̂ using (38)
P
T
T
j cj yj yj + C(µµ + Σ )
P
∼
j cj + C
P
T
j cj yj yj
∼(
P
T
+ µµ + Σ )(1 −
j
cj
)∼
C
C
X
1 X
∼ µµT + Σ + [
cj yj yjT − (µµT + Σ )
cj ] (39)
C j
j
j=m
2µ X
cj (yj − µ)T
C =1
µ̂µ̂T ∼ µµT +
(40)
This gives
Σ̂ ∼ µµT + Σ +
(32)
X
1 X
[
cj yj yjT − (µµT + Σ )
cj ]−
C j
j
−[µµT +
(33)
= Σ+
n
j=1 cj yj + µ
∼
Pj=m
=1 cj + 1
Next, we have
k6=l
X
X
1X
T2 =
(λk
ci a2ki −
ci )2
2
i
j [cj (yj
µ̂ ∼ µ +
(31)
then one can write T explicitly as follows:
1
C
(36)
where T > 0 . i.e. F ({ẑi }) grows proportionally to 1/C for
sufficiently large C. If Σ represented as a diagonal matrix
Σ −1 = diag[λ1 , ..., λn ]
Pj=m
P
X
1 X
1 X
j cj
∼(
) ∼ µ+ (
cj yj +µ)(1−
cj yj −µ
cj )
C j
C
C j
j
P
5. EB FOR MULTIDIMENSIONAL MULTIVARIATE
GAUSSIAN DENSITIES
=
1
C
(34)
2µ X
cj (yj − µ)T ] =
C j
X
X
1 X
[
cj yj yjT −(µµT +Σ )
cj −2 µ
cj (yj −µ)T ]
C j
j
j
(41)
k=1
T3 =
n
X
k=1
X
λk (
ci aki )2
And finally
P
(35)
Σ̂ ∼ Σ +
i
Proof Our proof will be split in several steps.
Step1: Linerarization
First, we assume that F ({zi }) = l(µ, Σ ) := l ({zi }) :=
Pi=m
Pi=m
i=1 ai zi . Let us set l(µ̂, Σ̂ ) := l ({ẑi }) :=
i=1 ai zˆi .
Then cj = aj zj in (5), (6). We want to prove that for sufficiently large C l(µ̂, Σ̂ ) ≥ l (µ, Σ ). This inequality is sufficiently to prove with the precision 1/C 2 .
Σ̂
−1
∼Σ
−1
−
j [(yj
− µ)(yj − µ)T − Σ ]cj
(42)
C
Σ −2 {
P
j [(yj
− µ)(yj − µ)T − Σ ]cj }
C
(43)
P
T
1/2(yi −µ̂) Σ̂
−1
(yi −µ̂) ∼ 1/2[(yi −µ)−
j cj (yj
C
− µ)
]T ×
P
−1
[Σ
−
Σ −2 {
j [(yj
×[(yi − µ) −
− µ)(yj − µ)T − Σ ]cj }
C
P
j cj (yj − µ)
C
and
=
]∼
X
T = l({Bi zi }) =
X X
ci Ai +1/2(
ci )
cj [n−(yj −µ)T Σ −1 (yj −µ)]
i
i
X
Ai
∼ 1/2(yi − µ) Σ (yi − µ) −
(44)
C
Ai = Ai1 + Ai2
X
= 1/2
cj (yi −µ)T Σ −2 [(yj −µ)(yj −µ)T −Σ ](yi −µ)
T
Ai1
]×
−1
Ai2 = 1/2
ci Ai =
i
i
Ã1 =
j
X
j
X
(45)
= 1/2
T −1
T −1
cj [(yj −µ) Σ (yi −µ)+(yi −µ) Σ (yj −µ)]
X
X
1
Where
T
Σ−1 (yi −µ)
(1 +
Ai
)
C
(48)
|Σ̂ |−1 /2
(2π)n/2
P
cj
∼ |Σ |−1 /2 {1 +
[n−TrΣ −1 (yj −µ)(yj −µ)T ]}
2C
(49)
P
Ai
cj
){1 +
[n − T rΣ −1 (yj − µ)(yj − µ)T ]} ∼
(1 +
C
2C
X
1
∼ 1 + {Ai + 1/2
cj [n − T rΣ −1 (yj − µ)(yj − µ)T ]}
C
j
∼1+
Bi
C
(50)
Where
Bi = Ai + D
Here we use
D = 1/2
X
cj [n − (yj − µ)T Σ −1 (yj − µ)]
j
and
T rΣ −1 (yj − µ)(yj − µ)T = (yj − µ)T Σ −1 (yj − µ)
Bi
zi
C
i
ci cj [(yj −µ)T Σ −1 (yi −µ)+(yi −µ)T Σ −1 (yj −µ)] =
(55)
i
j
T
C
Step3: Reduction to a diagonal case
Since Σ is a symetric matrix there exists an ortogonal
matrix O such that OΣO−1 is a diagonal matrix. It is easily
to see that T is invariant under such ortogonal change of coordinates. For example, the component Ã1 of T is invariant
under ortogonal change of coordinate as one can see from
the following computations:
l({ẑi }) − l({zi }) ∼
Ã1 =
= 1/2
X
ci cj (yi − µ)T OT OΣ −2 O T ×
ij
×[O(yj − µ)(yj − µ)T OT − OΣO T ]O(yi − µ)
(56)
Step 4: special case - 2-dimensional Gaussians
We will perform computations for simplicity for 2 -dimensional
case.
Withot loss of generality we can assume the Σ −1 =
diag[λ1 , λ2 ] is a diagonal 2 × 2 - matrix with diagonal elemens λ1 and λ2 .
Let compute
A0 1 =
X
1/2
ci cj (yi −µ)T Σ −2 [(yj −µ)(yj −µ)T (yi −µ)] (57)
Let set
l({Bi zi })
C
(yi − µ)T = (a1i , a2i )
(51)
Then
Since l(µ̂, Σ̂) is a linear form in the zi we have
l({ẑi }) ∼ l({zi }) +
(54)
ci Ai2 =
ij
Using the last equalities we get
ẑi ∼ zi +
X
X
X
=[
cj (yj − µ)]T Σ −1 [
ci (yi − µ)]
K=
|Σ̂ |−1 /2
ci Ai1 =
ij
Continue this we have
zˆi ∼ Ke− 2 (yi −µ)
(53)
ij
Ã2 =
= 1/2
ci Ai2
i
ci cj (yi −µ)T Σ −2 [(yj −µ)(yj −µ)T −Σ ](yi −µ)
(46)
(47)
X
X
i
j
|Σ̂ |−1 /2 −1 (yi −µ)T Σ −1 (yi −µ)+ Ai
C
zˆi ∼
e2
(2π)n/2
ci Ai1 +
(52)
1/2
X
ij
A0 1 =
ci cj (a1i , a2i ) × diag[λ21 , λ22 ]×
(58)
(a1j , a2,j )T × (a1j , a2j ) × (a1i , a2i )T
(59)
X
= 1/2
ci cj (λ21 a1i , λ22 a2i )(a1j , a2,j )T (a1j a1i + a2j a2i )
And finally
T =
X
1/2(λ21 + λ22 )(
ci a1i a2i )2 +
ij
= 1/2
X
(60)
ci cj (λ21 a1i a1j + λ22 a2i a2j )(a1j a1i + a2j a2i )
ij
+1/2(λ1
ij
=
1/2(λ21
+
λ22 )
ci cj a1i a1j a2i a2,j +
+1/2(λ2
X
ij
+1/2λ21
X
+λ1 (
ci a1i )2
ci cj a22i a22j =
X
+λ2 (
ci a2i )2
In the above equation:
ij
+1/2λ21 (
X
T1 = 1/2(λ21 + λ22 )(
ci a21i )2
+1/2λ22 (
X
T2 = 1/2(λ1
ci a22i )2
= Ã1 −
+1/2(λ2
=
ci cj (yi − µ)T Σ −1 (yi − µ) =
cj )(
X
ci λ1 a21i +
X
i
ci λ2 a22i )
(64)
i
Next, for our 2-dimensional case we have
X
X
X
X
ci D = (
ci )2 −1/2(
cj )(
ci λ1 a21i +
ci λ2 a22i )
j
i
i
(65)
Therefore
T =
X
2
2
1/2(λ1 + λ2 )(
ci a1i a2i )2 +
+1/2λ21 (
X
ci )2 +
ci )2
(69)
ci a1i )2
X
X
+1/2λ22 (
Step 4: General case - n-dimensional Gaussians
We will perform computations for n -dimensional case.
Withot loss of generality we can assume the Σ−1 =
diag[λ1 , λ2 , ...λn ] is a diagonal n×n - matrix with diagonal
elemens λ1 , λ2 and λn .
Let compute
A0 1 =
X
1/2
ci cj (yi −µ)T Σ −2 [(yj −µ)(yj −µ)T (yi −µ)] (71)
Let set
(yi − µ)T = (a1i , a2i , ...ani )
ci a21i )2
Then
cj )(
j
X
ci a22i )2 −
i
ci λ1 a21i +
i
X
1/2
ci λ2 a22i )
A0 1 =
ci cj (a1i , a2i , ...ani )×
diag[λ21 , λ22 , ...λ2n ] × (a1j , a2j , ...anj )T
i
ci cj (λ1 ai1 aj1 + λ2 ai2 aj2 )
X
(72)
ij
X
+(
ci )2
X
(70)
i
ij
i
+
X
X
X
+λ2 (
ci a2i )2
ij
−(
ci a22i −
T3 = λ1 (
(63)
i
X
X
(68)
i
j
X
X
i
ij
−1/2(
ci a1i a2i )2
ci a21i −
i
X
X
−1/2(
cj )
ci (a1i , a2i ) × (λ1 , λ2 ) × (a1i , a2i )T =
j
X
(62)
i
A01
X
ij
i
X
(67)
i
X
ci a1i a2i )2 +
1/2(λ21 + λ22 )(
= −1/2
ci )2 +
i
ij
A001
X
ci )2
i
ij
+1/2λ22
ci a22i −
X
ci cj a21i a21j
X
X
ci a21i −
i
(61)
X
X
(66)
×(a1j , a2j , ...anj ) × (a1i , a2i , ...anj )T
(73)
X
= 1/2
ci cj (λ21 a1i , λ22 a2i , ...λ22 ani )(a1j , a2j , ...anj )T ×
ij
×(
= 1/2
X
X
akj aki )
(74)
k
X
ci cj (
λ2k aki akj )×
ij
×(
X
Σ
k
akj aki )
(75)
k
X X
= 1/2
ci cj (λ2k + λ2l )×
k,l,k6=l ij
×aki akj ali al,j +
X
X
+1/2
ci cj
λ2k a2ki a2kj =
ij
1/2
X
k
(λ2k + λ2l )(
X
ci aki ali )2 +
X
+1/2(
λk ci a22i )2
(76)
k,i
Similar (like for the 2-dimensional case) one can compute other componets in T.
Step 5: Invariant transformation points
Here we prove the following
Lemma 4 Let Σ be a diagonal matrix. Then the following
holds. a) T = 0 implies that Σ (C ) = Σ and µ(C) = µ for
C = 0.
b) Σ (C ) = Σ and µ(C) = µ for C = 0 implies that
Σ (C ) = Σ and µ(C) = µ for any C.
c) Σ (C ) = Σ and µ(C) = µ for some C → T = 0
Proof of Lemma
P
a) T = 0 → T3 = 0 → µ = Pcicyi i → µ(0) = µ.
P
P
Next, T 2 = 0 → λk ci a2ki −
ci = 0 → λ−1
=
k
P 2
Pyik
ci
− µk µk = Σ(0)kk .
P
Finally,
T1 = 0 →
ci (yik − µk )(yil − µP
l) = 0 →
P
ci yik yil −ci yik µl −ci yil µk +ci µk µl = 0 → ci (yik yil −
µk µl ) = 0 → Σkl (0) = 0. This proves a) for C = 0.
b) It follows from (5) that if µ(C) = µ then
X
X
µ(
ci + C) =
ci yi + Cµ →
µ
X
ci =
X
X
ci + ΣC 0 =
X
ci yi yiT −
X
ci µµT +
+C 0 (µµT + Σ ) − C 0 µµT →
P
ci yi yiT + C 0 (µµT + Σ )
P
− µµT →
Σ=
ci + C 0
Σ = Σ (C 0 )
ij
k,i6=l
Adding to both parts of the above equation C 0 Σ for any C 0
we get
ci yi
Adding to both parts of the above equation C 0 µ for any C 0
we get
X
X
µ(
ci + C 0 ) =
ci yi + C 0 µ → µ(C 0 ) = µ
This proves b) for µ.
Similarly, from (6) and a part b) of the lemma for µ we have
that Σ (C ) = Σ implies
X
X
X
Σ
ci =
ci yi yiT −
ci µµT
c) µ = µ(0) implies that T3 = 0. Σkk = Σkk (0 ) implies
that P
T2 = 0. Finally, Σkl = Σkl (0 ) = 0 for k 6= l implies
that ci (yik − µk )(yil − µl ) = 0, i.e. T1 = 0.
We can now finish the proof ot the theorem. Since by
assumption either µ̂ 6= µ or Σ̂ 6= Σ T 6= 0. Applicability of
the lineriazation principle follows from the fact that if (14)
holds then the left part in the equation (9) is not equal to
zero. Q.E.D.
6. NEW GROWTH TRANSFORMATIONS
One can derive new updates for means and variances applying EB algorithm of the section 3 by introducing probability constraints for means and variances as follows. Let
us assume that 0 ≤ µj ≤ Dj , 0 ≤ σj ≤ Ej . Then we
can introduce slack variables µj 0 ≥ 0, σj 0 ≥ 0 such that
µj /Dj + µj 0/Dj = 1, σj /Ej + σj 0/Ej = 1. Then we can
compute updates as in (1), with cj as in (5, 6).
P
(yi −µj )
+C
σ2
j
j jP
(yi −µj )
µj +Dj C
i cij
σ2
j
P
(yi −µj )2
]+Cσj
i cij [−1+
σ2
j
jP
(yi −µj )2
]+Ej C
i cij [−1+
σ2
j
µ̂j = D µ
i
cij
σ̂j = E
If some µj < 0 one can make them positive by adding
positive constants, compute updates for new variables in the
new coordinate system and then go back to the old system
of coordinates.
7. REFERENCES
[1] S. Axelrod, V. Goel, R. Gopinath, P. Olsen, and K.
Visweswariah, ”Discriminative Training of Subspace
Constrained GMMs for Speech Recognition,” to be
submitted to IEEE Transactions on Speech and Audio
Processing.
[2] L.E.Baum and J.A. Eagon, ”An inequality with applications to statistical prediction for functions of
Markov processes and to a model of ecology,” Bull.
Amer. Math. Soc., vol. 73, pp.360-363, 1967.
[3] A. Gunawardana and W. Byrne, “Discriminative
Speaker Adaptation with Conditional Maximum Likelihood Linear Regression,” ICASSP, 2002.
[4] P.S. Gopalakrishnan, D. Kanevsky, D. Nahamoo and
A. Nadas, ”An inequality for rational functions with
applications to some statistical estimation problems”,
IEEE Trans. Information Theory, Vol. 37, No.1 January 1991
[5] D. Kanevsky, ”Growth Transformations for General
Functions”, RC22919 (W0309-163), September 25,
2003.
[6] D. Kanevsky, ”Extended Baum transformations for
general functions”, in Proc. ICASSP, 2004.
[7] Y. Normandin, ”An improved MMIE Training Algorithm for Speaker Independent, Small Vocabulary,
Continuous Speech Recognition”, Proc. ICASSP’91,
pp. 537-540, 1991.
[8] R. Schluter, W. Macherey, B. Muler and H. Ney,
”Comparison of discriminative training criteria and
optimization methods for speech recognition”, Speech
Communication, Vol. 34, pp.287-310, 2001.
[9] V. Valtchev , P.C. Woodland and S. J. Young, ”Latticebased Discriminative Training for Large Vocabulary
Speech Recognition Systems”, Speech Communication, Vol. 22, pp. 303-314, 1996.