Distributed Optimization of Deeply Nested Systems

Distributed Optimization of
Deeply Nested Systems
M.A. Carreira-Perpinan & Weiran Wang
AISTATS 2014
been proven very successful at learning sophisticated tasks,
hy of nonsuch as recognizing faces or speech, when trained on data.
eep net or
A typical neural net defines
estimation
a hierarchical, feedforward,
x
a difficult
parametric mapping from ine a general
W1
puts to outputs.
The pao some exz1
σ
σ
•  Trainingrameters
nested
architectures
(weights)
are learned
, which we
given a dataset by numeriW2
es (MAC).
•  Deep nets,
but
also
cascades
cally minimizing an objecnvolving a
σ
σ
z2
tive
function.
The
outputs
like
PCA
before
SVM
ined probof the hidden units at each
W3
n an aug•  How to do
layeritare obtained by transonstrained
σ
σ
z3
forming
the
previous
layer’s
ased meth–  Quickly
W4
outputs by a linear operation
ver the pawith the layer’s weights fol–  In parallel
es. MAC
y
lowed
by
a
nonlinear
elemenimplement
–  Do model
selection
twise
mapping (e.g. sigmoid).
layers, can
Figure 1: Net with K =
Deep,
nonlinear
neural
nets
ly, applies
3 hidden layers (Wk :
are
universal
approximators,
not availweights, zk : auxiliary
that
is,
they
can
approximate
ome model
coordinates).
any
target
mapping
(from
a
with state-
Problem
proven
very successful
at learning sophisticated tasks,
Method of been
Auxiliary
Coordinates
(MAC)
hy1 ofThe
nonsuch as recognizing faces or speech, when trained on data.
eep
net
or
1.1 The Nested Objective
Function
A
typical
neural net defines
estimation
a hierarchical,
x
definiteness, we describe
the approachfeedforward,
for a deep net
aFor
difficult
as that of fig. 1. parametric
Later sectionsmapping
will show from
other setine asuch
general
W1
tings. Consider a regression
problem
of
mapping
inputs
x
puts to outputs.
The pao some
exto outputs y (both high-dimensional) with a deep net f (x)
z1
σ
σ
rameters (weights) are learned
, which
given awe
dataset of N pairs (xn , yn ). A typical objective
given
a K
dataset
numerito learn a deep
net with
hidden by
layers
has the
W2
esfunction
(MAC).
callywe minimizing
an objecform (to simplify
notation,
ignore bias parameters):
nvolving
a
σ
σ
z2
tive
The
outputs
N function.
!
ined prob1
2
E1 (W) = of the
∥ynhidden
− f (xn ; W)∥
(1)
units
at
each
W3
2
n an augn=1
layer are obtained by transonstrained
f (x; W) = fK+1 (. . . f2 (f1 (x; W1 ); W2 ) . . . ; WK+1 )
σ
σ
z3
forming
the
previous
layer’s
ased
methwhere each layer function
has by
the aform
fk (x;
Wk ) =
W4
outputs
linear
operation
verσ(W
thek x),
pa-i.e., a linear mapping followed by a squashing
witha the
layer’s weights folscalar function, such as the
es.nonlinearity
MAC (σ(t) applies
y
−t lowed by a nonlinear elemensigmoid 1/(1 + e ), elementwise to a vector argument,
implement
twise
mapping
sigmoid).
with output in [0, 1]). Our
method
applies(e.g.
to loss
functions
layers,
can
Figure 1: Net with K =
other than squared error
(e.g. cross-entropy
for classificaDeep,
nonlinear
neural
nets
ly,tion),
applies
3 hidden layers (Wk :
with fully or sparsely connected layers each with
are
universal
approximators,
not
avail- number of hidden units, with weights shared
a different
weights, zk : auxiliary
that
is,
they
can
approximate
across
layers, and with regularization terms on the weights
ome
model
coordinates).
any
target
mapping
(from
a
W
.
The
basic
issue
is
the
deep
nesting
of
the
mapping
withk state-
Nested models
ter
ill-conditioned;
this worsens
thetypical
numberobjective
of layers
given be
a dataset
of N pairs
(xn , ynwith
). A
hen com(Rögnvaldsson,
1994;net
Erhan
et al.,
a result,
on,
d Müller,
function
to learn a deep
with
K 2009).
hiddenAslayers
hasmost
the
methods
take notation,
tiny steps,we
slowly
zigzagging
down a curved
on,
er, selectform (to
simplify
ignore
bias parameters):
valley, and take a huge time to converge in practice. Also,
r of units
teN
!
the
chain
rule
requires
filterbanks
1 differentiable layers wrt2 W.
ers
E1 (W) =
∥yn − f (xn ; W)∥
(1)
binatorial
2 n=1
on,
1.2 The Method of Auxiliary Coordinates (MAC)
a manual
wer
rt and ex- f (x; W) = fK+1 (. . . f2 (f1 (x; W1 ); W2 ) . . . ; WK+1 )
ch (where
We introduce one auxiliary variable per data point and per
ns
wherehidden
each unit
layer
the form
fk (x; Wk ) =
ds of the
andfunction
define thehas
following
equality-constrained
tive
σ(WDecouple
x), i.e., a linear
mappingfunctions
followed by a squashing
problem:
koptimization
nd Idea:
the nested
nonlinearity (σ(t) applies
a scalar function, such as the
N
he
!
or deeply
2
−t1
E(W,
Z)
=
∥yn − fK+1to
(zK,n
; WK+1
)∥
(2)
sigmoid
1/(1
+
e
), elementwise
a vector
argument,
up
y coordi2 n=1
with output in"[0, 1]). Our method applies
ith
# to loss functions
ng gradizK,nerror
= fK (e.g.
(zK−1,n
; WK )
other than squared
cross-entropy
for classificaun
euses ex.
.
.
s.t.
n = 1, . . . , N.
tion), with fully
or=sparsely
layers each with
that optiz1,n
f1 (xn ; Wconnected
m1)
a different number of hidden units, with weights shared
ov,
across layers, and with regularization terms on the weights
—
11 easier?
Is
this
Wk . The basic issue is the deep nesting of the mapping
of . The traditional way to minimize (1) is by computing the
Method of Auxiliary Coordinates (MAC)
Method of Auxiliary Coordinates (MAC)
simplest way, we use the quadratic-penalty (QP) method
(Nocedal and Wright, 2006). Using the augmented Lagrangian is also possible. We optimize the following function over (W, Z) for fixed µ > 0 and drive µ → ∞:
N
!
1
2
∥yn − fK+1 (zK,n ; WK+1 )∥ (3)
EQ (W, Z; µ) =
2 n=1
N !
K
!
µ
2
∥zk,n − fk (zk−1,n ; Wk )∥ .
+
2 n=1
k=1
This defines a continuous path (W∗ (µ), Z∗ (µ)) which, under mild assumptions, converges to a minimum of the con(2), and thus to a minimum of the original
Thestrained
trick: problem
problem
(1) (Carreira-Perpiñán
and Wang,
2012).and
In pracAllow
discrepancy
between function
output
tice, we follow this path loosely.
auxiliary variables
The QP objective function can be seen as breaking the func-
Empirical Evaluation
tion of Deeply Nested Systems
er
le
30
µ=1
SGD (• = 20 epochs)
25
objective function
ng
we
Wa
op
ry
mit
ter
ar).
CG (• = 100 its.)
20
15
101
10
5
0
0
102
103
MAC (• = 1 it.)
Parallel MAC
MAC (minibatches)
Parallel MAC (minibatches)
0.5
1
1.5
runtime (hours)
104 106
107 8
10
2