Implicit Learning of Simper Output Kernels for Multi-label Prediction
Hanchen Xiong, Sandor Szedmak, Justus Piater
U NIVERSITY OF I NNSBRUCK , A USTRIA
F ROM S TRUCTURAL SVM TO J OINT SVM
O VERVIEW
Structural SVM is an extension of SVM for structured-outputs, in which, however, the
margin to be maximized is defined as the score gap between the desired output and the first
runner-up.
m
n
o
X
1
2
(i)
0
(i)
0
arg min ||W|| + C
max
d(y
,
y
)
−
∆
(y
,
y
)
(1)
F
y0 ∈Y
W∈RΨ 2
i=1
where ∆F (y(i) , y0 ) = F (x(i) , y(i) ; W) − F (x(i) , y0 ; W), the score function is F x(i) , y(i) ; W =
hW, φ(x(i) ) ⊗ y(i) i, and Hamming distance d(y(i) , y0 ) is used on outputs. Because of linear
decomposability, (1) can be rewritten as:
n
o
Pm PT
(i) 0
(i) 0
1
2
0 ={−1,+1}
d(y
,
y
)
−
∆
(y
,
y
)
||W||
+
C
max
arg
min
F
y
t
t
t
t
F
i=1
t=1
2
t
Research in Multi-label Prediction:
• exploiting and utilizing inter-label dependencies;
©
• increasingly more sophisticated dependency are used;
?
• “overfit" output structural dependencies when the desired ones are simpler
§
Hφ ×RT
W∈R
• a regularization should be added on
output structural dependencies
⇓
V
arg
min
w1 ,··· ,wT
PT
n
PT
n
t=1
∈RHφ
1
2
||w
||
t
2
+C
oo
n
(i)
(i)
(i)
(i)
i=1 max 0, d(yt , −yt ) − ∆F (yt , −yt )
Pm
(2)
⇓
Our contributions: Joint SVM
arg
• joint SVM ⇐⇒ structural SVM with
linearly decomposable score functions;
min
t=1
w1 ,··· ,wT ∈RHφ
1
2
||w
||
t
2
+ 2C
n
oo
(i) >
(i)
max
0,
1
−
y
w
φ(x
)
t
t
i=1
Pm
where h·, ·iF denotes Frobenius product and ||W||F is the Frobenius norm of matrix W.
It can be seen in (2) that,with linearly decomposable score functions and output distances,
structural SVM on multi-label learning is equivalent to learning T SVMs jointly: Joint SVM
• in joint SVM, (non-)linear kernels can
be assigned on outputs to capture
inter-label dependencies;
arg
• joint SVM shares the same computational complexity as a single SVM;
Pm ¯(i)
+ C i=1 ξ
ψ(y(i) ), Wφ(x(i) ) ≥ 1 − ξ¯(i) , ξi ≥ 0, i ∈ {1, . . . , m}
1
2
||W||
F
2
min
W∈RHψ ×Hφ
s.t.
• when linear output kernels are used, a
output-kernel regularization is implicitly added.
where y(i) =
is:
(1)
(i)
[y1 , . . . , yT ],
arg
• yield promising results on 3 image annotation databases.
min
W=
Pm
>
w1>
wT
[ T ; . . . ; T ]> .
i=1 αi −
α1 ,··· ,αm
The corresponding dual form of Joint SVM
Pm
(i)
(j)
(i)
(j)
α
α
K
(y
,
y
)K
(x
,
x
)
φ
i,j=1 i j ψ
∀i, 0 ≤ αi ≤ C
s.t
(3)
(4)
therefore, the same computational complexity as a single regular SVM.
I MPLICIT R EGULARIZATION ON L INEAR O UTPUT K ERNELS
When a linear output kernel is used to capture pairwise dependencies of labels, via which the output vectors can be linearly mapped as
ψ(y) = Py: KψLin (y(i) , y(j) ) = y(i)> Ωy(j) where Ω = P> P. By denoting U = P> W, we can have:
arg
Pm ¯(i)
+ C i=1 ξ
(i)
y , Uφ(x(i) ) ≥ 1 − ξ¯(i) , ξi ≥ 0, i ∈ {1, . . . , m}
1
2
2 ||W||F
min
W∈RHψ ×Hφ
s.t.
we use a compact regularization for both W and Ω,
arg
1
> >
2
2 ||W Ω W||F ,
(5)
resulting in:
Pm ¯(i)
+ C i=1 ξ
(i)
(i)
(i)
¯
y , Uφ(x ) ≥ 1 − ξ , ξi ≥ 0, i ∈ {1, . . . , m}
1
2
2 ||U||F
min
U∈RHψ ×Hφ
s.t.
(6)
Remarkably, a linear output kernel is implicitly learned, and absorbed in W, also a regularization on the output kernel is also implicitly added.
E XPERIMENTAL R ESULTS AND C OMPARISON
Histograms of label correlation(log scale), data set:corel5k
105
Training labels:
Histograms of label correlation(log scale), data set:espgame
Test labels:
105
105
Training labels:
Histograms of label correlation(log scale), data set:iaprtc12
Test labels:
105
105
Training labels:
104
104
104
104
104
104
103
103
103
103
103
103
102
102
102
102
102
102
101
101
101
101
101
101
100 0.2 0.0
0.2
0.4
0.6
0.8
1.0
1.2
100 0.2 0.0
0.2
0.4
0.6
0.8
1.0
1.2
100 0.2 0.0
0.2
0.4
0.6
0.8
1.0
1.2
100 0.2 0.0
0.2
0.4
0.6
0.8
1.0
1.2
100 0.2 0.0
0.2
0.4
0.6
0.8
Test labels:
105
1.0
1.2
100 0.2 0.0
0.2
0.4
0.6
0.8
1.0
1.2
© Copyright 2026 Paperzz