Distributed Optimization of Deeply Nested Systems M.A. Carreira-Perpinan & Weiran Wang AISTATS 2014 been proven very successful at learning sophisticated tasks, hy of nonsuch as recognizing faces or speech, when trained on data. eep net or A typical neural net defines estimation a hierarchical, feedforward, x a difficult parametric mapping from ine a general W1 puts to outputs. The pao some exz1 σ σ • Trainingrameters nested architectures (weights) are learned , which we given a dataset by numeriW2 es (MAC). • Deep nets, but also cascades cally minimizing an objecnvolving a σ σ z2 tive function. The outputs like PCA before SVM ined probof the hidden units at each W3 n an aug• How to do layeritare obtained by transonstrained σ σ z3 forming the previous layer’s ased meth– Quickly W4 outputs by a linear operation ver the pawith the layer’s weights fol– In parallel es. MAC y lowed by a nonlinear elemenimplement – Do model selection twise mapping (e.g. sigmoid). layers, can Figure 1: Net with K = Deep, nonlinear neural nets ly, applies 3 hidden layers (Wk : are universal approximators, not availweights, zk : auxiliary that is, they can approximate ome model coordinates). any target mapping (from a with state- Problem proven very successful at learning sophisticated tasks, Method of been Auxiliary Coordinates (MAC) hy1 ofThe nonsuch as recognizing faces or speech, when trained on data. eep net or 1.1 The Nested Objective Function A typical neural net defines estimation a hierarchical, x definiteness, we describe the approachfeedforward, for a deep net aFor difficult as that of fig. 1. parametric Later sectionsmapping will show from other setine asuch general W1 tings. Consider a regression problem of mapping inputs x puts to outputs. The pao some exto outputs y (both high-dimensional) with a deep net f (x) z1 σ σ rameters (weights) are learned , which given awe dataset of N pairs (xn , yn ). A typical objective given a K dataset numerito learn a deep net with hidden by layers has the W2 esfunction (MAC). callywe minimizing an objecform (to simplify notation, ignore bias parameters): nvolving a σ σ z2 tive The outputs N function. ! ined prob1 2 E1 (W) = of the ∥ynhidden − f (xn ; W)∥ (1) units at each W3 2 n an augn=1 layer are obtained by transonstrained f (x; W) = fK+1 (. . . f2 (f1 (x; W1 ); W2 ) . . . ; WK+1 ) σ σ z3 forming the previous layer’s ased methwhere each layer function has by the aform fk (x; Wk ) = W4 outputs linear operation verσ(W thek x), pa-i.e., a linear mapping followed by a squashing witha the layer’s weights folscalar function, such as the es.nonlinearity MAC (σ(t) applies y −t lowed by a nonlinear elemensigmoid 1/(1 + e ), elementwise to a vector argument, implement twise mapping sigmoid). with output in [0, 1]). Our method applies(e.g. to loss functions layers, can Figure 1: Net with K = other than squared error (e.g. cross-entropy for classificaDeep, nonlinear neural nets ly,tion), applies 3 hidden layers (Wk : with fully or sparsely connected layers each with are universal approximators, not avail- number of hidden units, with weights shared a different weights, zk : auxiliary that is, they can approximate across layers, and with regularization terms on the weights ome model coordinates). any target mapping (from a W . The basic issue is the deep nesting of the mapping withk state- Nested models ter ill-conditioned; this worsens thetypical numberobjective of layers given be a dataset of N pairs (xn , ynwith ). A hen com(Rögnvaldsson, 1994;net Erhan et al., a result, on, d Müller, function to learn a deep with K 2009). hiddenAslayers hasmost the methods take notation, tiny steps,we slowly zigzagging down a curved on, er, selectform (to simplify ignore bias parameters): valley, and take a huge time to converge in practice. Also, r of units teN ! the chain rule requires filterbanks 1 differentiable layers wrt2 W. ers E1 (W) = ∥yn − f (xn ; W)∥ (1) binatorial 2 n=1 on, 1.2 The Method of Auxiliary Coordinates (MAC) a manual wer rt and ex- f (x; W) = fK+1 (. . . f2 (f1 (x; W1 ); W2 ) . . . ; WK+1 ) ch (where We introduce one auxiliary variable per data point and per ns wherehidden each unit layer the form fk (x; Wk ) = ds of the andfunction define thehas following equality-constrained tive σ(WDecouple x), i.e., a linear mappingfunctions followed by a squashing problem: koptimization nd Idea: the nested nonlinearity (σ(t) applies a scalar function, such as the N he ! or deeply 2 −t1 E(W, Z) = ∥yn − fK+1to (zK,n ; WK+1 )∥ (2) sigmoid 1/(1 + e ), elementwise a vector argument, up y coordi2 n=1 with output in"[0, 1]). Our method applies ith # to loss functions ng gradizK,nerror = fK (e.g. (zK−1,n ; WK ) other than squared cross-entropy for classificaun euses ex. . . s.t. n = 1, . . . , N. tion), with fully or=sparsely layers each with that optiz1,n f1 (xn ; Wconnected m1) a different number of hidden units, with weights shared ov, across layers, and with regularization terms on the weights — 11 easier? Is this Wk . The basic issue is the deep nesting of the mapping of . The traditional way to minimize (1) is by computing the Method of Auxiliary Coordinates (MAC) Method of Auxiliary Coordinates (MAC) simplest way, we use the quadratic-penalty (QP) method (Nocedal and Wright, 2006). Using the augmented Lagrangian is also possible. We optimize the following function over (W, Z) for fixed µ > 0 and drive µ → ∞: N ! 1 2 ∥yn − fK+1 (zK,n ; WK+1 )∥ (3) EQ (W, Z; µ) = 2 n=1 N ! K ! µ 2 ∥zk,n − fk (zk−1,n ; Wk )∥ . + 2 n=1 k=1 This defines a continuous path (W∗ (µ), Z∗ (µ)) which, under mild assumptions, converges to a minimum of the con(2), and thus to a minimum of the original Thestrained trick: problem problem (1) (Carreira-Perpiñán and Wang, 2012).and In pracAllow discrepancy between function output tice, we follow this path loosely. auxiliary variables The QP objective function can be seen as breaking the func- Empirical Evaluation tion of Deeply Nested Systems er le 30 µ=1 SGD (• = 20 epochs) 25 objective function ng we Wa op ry mit ter ar). CG (• = 100 its.) 20 15 101 10 5 0 0 102 103 MAC (• = 1 it.) Parallel MAC MAC (minibatches) Parallel MAC (minibatches) 0.5 1 1.5 runtime (hours) 104 106 107 8 10 2
© Copyright 2026 Paperzz