Appendix A. Derivation of the µk update rule
In Eq. (21) we have expressed the scaled conditional distri00
00
00
bution of µk in terms
P of κ and τ , where κ = κ + βnk
and τ 00 = τ + β i∈Nk xi . As β → ∞, note that κ0 → ∞
P
00
and τκ00 = n1k i∈Nk xi . Therefore, as β goes to infinity, the
conditional distribution of µk will concentrate on the empirical mean of the data in cluster k.
Appendix C. Extension to the Multi-class
scenario
In the multi-class scenario, each class y in a particular component k is associated with a classifier η k,y , and we follow
the work (Crammer and Singer 2001) to define the multiclass hinge loss φm (yi |zi , η) as
>
φm (yi |zi , η) = exp(−2c max(∆yi + η >
zi ,y xi − η zi ,yi xi )),
y
Appendix B. Proof to Theorem 1
In this section, we give a proof to Theorem 1, which says
each iteration of Algorithm 1 decreases (or keeps the same)
the loss function Lsv (z, µ, η) in Eq. (35). The proof is
simlilar to the analysis of the K-means algorithm.
Proof. Suppose there are K components before an iteration,
with parameter µ1:K and classifiers η 1:K . We prove that after each update of z, µ, ω and η the loss function Lsv does
not increase.
(g)
(g+1)
Update of zi : Suppose zi = k and zi
= k 0 . If
0
k ≤ K (i.e. no new component created), we have
∆Lsv = Qi (k 0 ) − Qi (k)
where Qi (k) is the cost function defined in Eq. (31) and
(32); otherwise, (i.e. k 0 = K + 1), putting µK+1 = xi
and η K+1 = η ∗ , we have again
∆Lsv = Qi (k 0 ) − Qi (k).
By the update rule of zi , it is obvious that Qi (k 0 ) ≥ Qi (k)
and hence the loss does not increase.
Update of µk : Let X k = {xi |zi = k} denote all in(g)
(g+1)
stances in component k. Suppose µk = µ and µk
=
0
µ . We then have
∆Lsv = s · (h(µ) − h(µ0 )),
where
h(µ) =
X
xi ∈X k
−Dϕ (xi , µ) + log fϕ (xi ) = log p(X k |µ).
P
k
xi
xi ∈X
Since µ0 =
is the MLE of p(X k ), we have
|X k |
h(µ0 ) ≥ h(µ) and hence the loss Lsv does not increase.
Update of η and ω: For each component k, the udpate of
η k and ω is the same as the EM algorithm used in (Polson
and Scott 2011) to train an SVM model on data X k . As a
result, after an update process of η k and ω, the learning loss
X
kη k k2
+c·
(ζik )+
2
2ν
k
xi ∈X
does not increase while the remaining part of the loss Lsv
remains the same.
∆yi
= l · δy,yi is the loss imposed on predicting y. As
where
a result, when the component k and the class y is fixed, the
conditional distribution of η k,y can be expressed as
q(η k,y |z) ∝ exp −
−2c
X
i
kη k,y k2
2ν 2
y
δzi ,k (syik ζik
−
syik η >
k,y xi )+
!
, (38)
y
where syik = 1 if y = yi , syik = −1 if y 6= yi and ζik
=
0
y
maxy0 6=y (∆yi + η >
x
)
−
∆
.
i
k,y 0 i
It can be seen that the conditional distribution of η k,y expressed in Eq. (38) is very similar to the one in Eq. (18),
y
except that we replace l and yi with syik ζik
and syik respec2
tively. The Gibbs updates and updates in M DPM of η and
ω follow immediately. The updates for z and µ also remain
nearly the same, except that we should replace the hinge loss
φ(yi |zi , η) with its multi-class version φm (yi |zi , η).
© Copyright 2026 Paperzz