Polynomial Networks and Factorization Machines:
New Insights and Efficient Training Algorithms
Mathieu Blondel
NTT Communication Science Laboratories
Kyoto, Japan
Joint work with M. Ishihata, A. Fujino and
N. Ueda
Presented at ICML 2016
∞
Σ
1 / 27
2016/8/10
Supervised learning with polynomials
• From {x i , yi }ni=1 , x i ∈ Rd and yi ∈ R, learn a polynomial
ŷ : Rd → R
• Motivation
◦ Universality: polynomials can approximate any ŷ : Rd → R
arbitrary well on a compact subset of Rd
(Stone-Weierstrass theorem)
◦ Interpretability: Feature combinations are meaningful in many
applications (NLP, bioinformatics, etc)
2 / 27
Polynomial regression
• Assign weights to feature combinations
ŷPR (x; w, W ) := hw, xi +
X
W j,j 0 xj xj 0
j 0 >j
where w ∈ Rd and W ∈ Rd×d
• Pro: reduces to a simple linear model
• Con: does not scale well to high-dimensional data
3 / 27
Kernel methods for polynomial regression
• Use a polynomial kernel so as to implicitly map the
data to feature combinations via the kernel trick
• Predictions are computed by
ŷKM (x; α) :=
.
n
X
Linear dependence on
training set size!
αi K(x i , x)
i=1
where α ∈ Rn and K is set to
Pγm (x i , x) := (γ + hx i , xi)m
4 / 27
Factorization machines
(Rendle 2010)
• Recall polynomial regression
ŷPR (x; w, W ) := hw, xi +
X
W j,j 0 xj xj 0
j 0 >j
• In FMs, we replace W ∈ Rd×d by a factorized matrix
ŷFM (x; w, P) := hw, xi +
X
(PP T )j,j 0 xj xj 0
j 0 >j
w ∈ Rd ,
5 / 27
P ∈ Rd×k
kd
FMs: pros and cons
, Reduced number of parameters to estimate
O(dk) instead of O(d 2 ) (PR) or O(n) (KM)
, Faster predictions
O(dk) instead of O(d 2 ) (PR) or O(dn) (KM)
, Ability to infer weight of unobserved feature
combinations (useful for recommender systems)
/ Learning P involves a non-convex problem
6 / 27
Proposed framework
• We consider models of the form
ŷK (x; λ, P) :=
k
X
λs K(p s , x)
s=1
where λ ∈ Rk and P ∈ Rd×k with columns p 1 , . . . , p k
• We focus on two kernels:
◦ ANOVA kernel (recover factorization machines)
◦ Homogeneous polynomial kernel (recover “polynomial networks”)
7 / 27
Polynomial and ANOVA kernels (m = 2)
• Homogeneous polynomial kernel
H2 (p, x) := hp, xi2 =
d
X
p i xi p j xj
i,j=1
Uses all feature combinations: xi2 and xi xj for i 6= j
• ANOVA kernel (Vapnik 1998)
A2 (p, x) :=
X
p i xi p j xj
j>i
Uses distinct feature combinations: xi xj for i 6= j
8 / 27
Polynomial and ANOVA kernels (m = 3)
• Homogeneous polynomial kernel
H3 (p, x) := hp, xi3 =
d
X
p i xi p j xj p k xk
i,j,k=1
Uses all feature combinations: xi3 , xi2 xj , and xi xj xk
• ANOVA kernel (Vapnik 1998)
A3 (p, x) :=
X
p i xi p j xj p k xk
k>j>i
Uses distinct feature combinations: xi xj xk for i 6= j 6= k
9 / 27
Polynomial and ANOVA kernels (m ≥ 2)
• Homogeneous polynomial kernel
m
d
X
m
H (p, x) := hp, xi =
p j1 xj1 . . . p jm xjm
j1 ,...,jm =1
Uses all feature combinations (with replacement)
• ANOVA kernel (Vapnik 1998)
Am (p, x) :=
X
p j1 xj1 . . . p jm xjm
jm >···>j1
Uses distinct feature combinations (without
replacement)
10 / 27
Expressing FMs and PNs using kernels
• Recall that
ŷK (x; λ, P) :=
k
X
λs K(p s , x)
s=1
• Expressing factorization machines
ŷFM (x; w, P) = hw, xi + ŷA2 (x; 1, P)
• Expressing polynomial networks
ŷPN (x; w, λ, P) = hw, xi + ŷH2 (x; λ, P)
11 / 27
Direct optimization
• Most natural approach: directly minimize
DK (λ, P) :=
n
X
` yi ,
i=1
k
X
λs K(p s , x i ) + β|λs | kp s k2
s=1
where ` is a µ-smooth convex loss function
• Convexity?
λ
P
rows of P
columns of P
elements of P
12 / 27
K = Hm
convex
non-convex
non-convex
non-convex
non-convex
K = Am
convex
non-convex
convex
← thanks to
m
non-convex multi-linearity of A
convex
Direct optimization
pj1
(a) when K = H2
p j2
pj1
(b) when K = A2
Objective function w.r.t. one row of P
13 / 27
p j2
Multi-convex optimization
• When K = Am , the objective is called multi-convex
• We can use alternating minimization
◦ Popular in the matrix and tensor factorization literature
◦ Simple to implement
◦ Converges to a stationary point
◦ When ` is the squared loss, each sub-problem can be solved
analytically
14 / 27
A tensor approach
• When K = Hm , the direct objective is neither convex
nor multi-convex
• We will now present an objective that is multi-convex for
both K = Hm and Am
• The main idea is to convert the estimation of λ and P
to that of a low-rank symmetric tensor W
15 / 27
Rank-one symmetric tensor
d
x ⊗m := x
⊗
·
·
·
⊗
x
∈
S
|
{z
}
m
m times
x2 x3 x1
x4
x3
x2
x1
x2
x3
x4
x1
x1
x1
=
x2
16 / 27
x2
x3
x4
x4
x2
x3
x⊗3
x3
x1
x3
x1
x2
x4
x⊗x⊗x
x4
Symmetric tensor decomposition
W=
k
X
λs p ⊗m
s
s=1
where k is the (symmetric) rank of W
= λ1
3
W ∈ Sd
17 / 27
+ ...
+ λ2
p⊗3
1
p⊗3
2
Link between tensors and poly. kernel
Homogeneous polynomial kernel can be rewritten as
Hm (p, x) := hp, xim = hp ⊗m , x ⊗m i
p2 p3 p1
H3 (p, x) = h
,
p⊗3
18 / 27
x2 x3 x1
i
x⊗3
Link between tensors and ANOVA kernel
• For the ANOVA kernel, we need to ignore irrelevant
feature combinations...
• We introduce the following notation
hW, X i> :=
X
W j1 ,...,jm X j1 ,...,jm
jm >···>j1
• Then
Am (p, x) = hp ⊗m , x ⊗m i>
19 / 27
W, X ∈ Sd
m
Link between tensors and kernel expansions
k
X
• Assume W is decomposed as
↓ not multi-linear /
λs p ⊗m
s . Then,
s=1
ŷ
H2
= hW, x
⊗m
i=
k
X
λs Hm (p s , x)
s=1
k
X
ŷA2 = hW, x ⊗m i> =
λs Am (p s , x)
s=1
• We can convert the estimation of λ and P to that of a
low-rank tensor W
20 / 27
Key idea of the proposed method
• Expressing the loss as a function of W
LHm (W) :=
LAm (W) :=
n
X
` yi , hW, x ⊗m
i i
i=1
n
X
` yi , hW, x ⊗m
i i>
i=1
• Our idea: we set W = S
r
X
↓ multi-linear ,
u 1s ⊗ · · · ⊗ u m
s
s=1
where S(M) is the symmetrization of M
21 / 27
Multi-convex formulation
min
m
U 1 ,...,U
∈Rd×r
LK S
r
X
+
u 1s ⊗ · · · ⊗ u m
s
s=1
β
2
m
X
kU t k2F
t=1
where u ts is s th column of U t
• Convex in U 1 , . . . , U m separately due to multi-linearity
• When m = 2, this is equivalent to direct formulation
(and we can easily convert U 1 , U 2 to λ, P)
• Coordinate descent: costs O(mrnz (X)) per epoch
22 / 27
Direct vs. proposed approach
Direct
Proposed
λ ∈ Rk
U 1 , . . . , U m ∈ Rd×r
Parameters
P ∈ Rd×k
Multi-convex if
K = Am
K = Am or Hm
Multi-convex in λ and rows of P
U 1, . . . , U m
In practice, we set r = k/m.
23 / 27
Direct vs. proposed (“lifted”)
Direct vs. lifted optimization with K = A2
10 0
Relative objective value reduction
Relative objective value reduction
10 0
10 -1
10 -2 -1
10
Lifted (CD)
Direct (L-BFGS)
Direct (SGD)
Direct (CD)
10 0
10 1
CPU time (seconds)
(a) K = A2
10 2
10 3
Direct vs. lifted optimization with K = H2
10 -1
10 -2 -1
10
10 0
(b) K = H2
E2006-tfidf dataset
n = 16, 087, d = 150, 360
24 / 27
10 1
CPU time (seconds)
10 2
10 3
Low-budget non-linear regression
We compared six methods:
1. Proposed with K = H3 (with x T ← [1, x T ]),
2. Proposed with K = A3 (with x T ← [1, x T ]),
3. Nyström method with K = Pγ3 , where γ = 1
4. Random Selection: choose bases uniformly at random
from training set with K = Pγ3 .
5. Linear ridge regression
6. Kernel ridge regression with K = Pγ3
25 / 27
0.7
Coefficient of determination R 2
Coefficient of determination R 2
0.60
0.55
0.50
0.45
Proposed K = H3
Proposed K = A3
Nyström
Random Selection
Ridge
Kernel Ridge
0.40
0.35
0.30
0
10
20
30
Number of bases
40
50
0.6
0.5
0.4
0.3
0.2
0.1
0
20
(a) abalone
40
(b) cadata
Coefficient of determination R 2
1.0
0.8
0.6
0.4
0.2
0.0
26 / 27
0
60
Number of bases
10
20
30
Number of bases
(c) cpusmall
40
50
80
100
Conclusion
• We proposed a unified framework for factorization
machines (FM) and polynomial networks (PN)
• We proposed efficient training algorithms based on
tensor decomposition
Open-source implementation by Vlad Niculae:
http://contrib.scikit-learn.org/polylearn/
27 / 27
© Copyright 2026 Paperzz