Appendix A. Derivation of the µk update rule

Appendix A. Derivation of the µk update rule
In Eq. (21) we have expressed the scaled conditional distri00
00
00
bution of µk in terms
P of κ and τ , where κ = κ + βnk
and τ 00 = τ + β i∈Nk xi . As β → ∞, note that κ0 → ∞
P
00
and τκ00 = n1k i∈Nk xi . Therefore, as β goes to infinity, the
conditional distribution of µk will concentrate on the empirical mean of the data in cluster k.
Appendix C. Extension to the Multi-class
scenario
In the multi-class scenario, each class y in a particular component k is associated with a classifier η k,y , and we follow
the work (Crammer and Singer 2001) to define the multiclass hinge loss φm (yi |zi , η) as
>
φm (yi |zi , η) = exp(−2c max(∆yi + η >
zi ,y xi − η zi ,yi xi )),
y
Appendix B. Proof to Theorem 1
In this section, we give a proof to Theorem 1, which says
each iteration of Algorithm 1 decreases (or keeps the same)
the loss function Lsv (z, µ, η) in Eq. (35). The proof is
simlilar to the analysis of the K-means algorithm.
Proof. Suppose there are K components before an iteration,
with parameter µ1:K and classifiers η 1:K . We prove that after each update of z, µ, ω and η the loss function Lsv does
not increase.
(g)
(g+1)
Update of zi : Suppose zi = k and zi
= k 0 . If
0
k ≤ K (i.e. no new component created), we have
∆Lsv = Qi (k 0 ) − Qi (k)
where Qi (k) is the cost function defined in Eq. (31) and
(32); otherwise, (i.e. k 0 = K + 1), putting µK+1 = xi
and η K+1 = η ∗ , we have again
∆Lsv = Qi (k 0 ) − Qi (k).
By the update rule of zi , it is obvious that Qi (k 0 ) ≥ Qi (k)
and hence the loss does not increase.
Update of µk : Let X k = {xi |zi = k} denote all in(g)
(g+1)
stances in component k. Suppose µk = µ and µk
=
0
µ . We then have
∆Lsv = s · (h(µ) − h(µ0 )),
where
h(µ) =
X
xi ∈X k
−Dϕ (xi , µ) + log fϕ (xi ) = log p(X k |µ).
P
k
xi
xi ∈X
Since µ0 =
is the MLE of p(X k ), we have
|X k |
h(µ0 ) ≥ h(µ) and hence the loss Lsv does not increase.
Update of η and ω: For each component k, the udpate of
η k and ω is the same as the EM algorithm used in (Polson
and Scott 2011) to train an SVM model on data X k . As a
result, after an update process of η k and ω, the learning loss
X
kη k k2
+c·
(ζik )+
2
2ν
k
xi ∈X
does not increase while the remaining part of the loss Lsv
remains the same.
∆yi
= l · δy,yi is the loss imposed on predicting y. As
where
a result, when the component k and the class y is fixed, the
conditional distribution of η k,y can be expressed as
q(η k,y |z) ∝ exp −
−2c
X
i
kη k,y k2
2ν 2
y
δzi ,k (syik ζik
−
syik η >
k,y xi )+
!
, (38)
y
where syik = 1 if y = yi , syik = −1 if y 6= yi and ζik
=
0
y
maxy0 6=y (∆yi + η >
x
)
−
∆
.
i
k,y 0 i
It can be seen that the conditional distribution of η k,y expressed in Eq. (38) is very similar to the one in Eq. (18),
y
except that we replace l and yi with syik ζik
and syik respec2
tively. The Gibbs updates and updates in M DPM of η and
ω follow immediately. The updates for z and µ also remain
nearly the same, except that we should replace the hinge loss
φ(yi |zi , η) with its multi-class version φm (yi |zi , η).