Parallel Online Learning Nikos Karampatziakis John Langford Cornell University Online Learning É Learner gets the next example xt , makes a prediction pt , receives actual label yt , suffers loss ℓ(pt , yt ), updates itself É Simple and fast predictions and updates Yahoo! Research Some Experiments with Local Updates Training & Combining > pt = w xt wt+1 = wt − ηt ∇ℓ(pt , yt ) Online gradient descent asymptotically attains optimal regret É Online learning scales well . . . É Sharding & Training relative squared loss or time Rutgers University relative squared loss or time Daniel Hsu 1 0.8 0.6 0.4 0.2 r. squared loss r. time 0 1 2 4 1.2 1 0.8 0.6 0.4 0.2 r. squared loss r. time 0 8 1 shard count Parallel Online Learning É . . . but it’s a sequential algorithm É What if examples arrive very fast? É What if we want to train on huge datasets? É We investigate ways of distributing predictions, and updates while minimizing communication. 2 4 8 shard count Improvement due to nonlinearity σ Global Updates É Unfortunately, local updates can also hurt performance. É Improved representation power by global training. É Slightly more communication, some delay. É Delayed global training Each node predicts but doesn’t immediately train on y. É Later it receives global prediction ŷ and trains as if it predicted that. É Delay É Parallelizing online learning leads to delay problems. É This is exacerbated in a setting with temporally correlated or adversarial examples. É We investigate no delay and bounded delay schemes. É The tree can be thought as a neural network É Lockstep backpropagation would be slow É Each node trains locally, sends prediction after training. É Later it receives global gradient from parent uses chain rule as in backprop. É É Tree Architectures ŷ ŷ2,1 Delayed backprop Delay fixed (helps stability, development and debugging) Experiments with Global (and Local) Updates RCV1 Webspam ŷ2,2 accuracy with 1 pass accuracy with 1 pass 0.95 0.992 0.945 0.99 0.94 0.935 ŷ1,1 ŷ1,4 ŷ1,2 ŷ1,3 0.988 0.93 0.986 0.925 0.984 0.92 Local Backprop Backprop x8 0.915 0.91 1 2 Local Backprop 0.982 4 worker number 8 16 1 2 accuracy with 16 passes xF1 xF2 xF3 xF4 0.95 16 8 16 0.992 0.945 0.99 0.935 0.988 0.93 0.986 0.925 0.984 0.92 Local Backprop Backprop x8 0.915 ŷ 8 accuracy with 16 passes 0.94 Each node has f + 1 weights where f is the node’s fan-in. Bottom nodes use subsets of raw features. Others use predictions of their children. 4 worker number 0.91 1 2 Local Backprop 0.982 4 worker number 8 16 1 2 accuracy with 1 worker 4 worker number accuracy with 1 worker 0.95 0.992 0.945 ŷ1,1 ŷ1,2 ŷ1,3 ŷ1,4 0.99 0.94 0.935 0.988 0.93 0.986 0.925 0.984 0.92 Local Backprop Backprop x8 0.915 0.91 1 xF1 xF2 xF3 xF4 2 4 pass number 8 Local Backprop 0.982 16 1 2 accuracy with 16 workers 4 pass number 8 16 accuracy with 16 workers 0.95 0.992 0.945 Label travels together with prediction, available in each node 0.99 0.94 0.935 0.988 0.93 0.986 Local Updates Each node in the tree: É Computes its prediction pi,j based on its weights and inputs a É Sends ŷi,j = σ(pi,j) to its parent É Updates its weights based on ∇ℓ(pi,j , y) No delay Limited representation power: between Naive Bayes and centralized linear model. aThe nonlinearity introduced by σ has an interesting effect 0.925 0.984 0.92 Local Backprop Backprop x8 0.915 0.91 1 2 4 pass number 8 Local Backprop 0.982 16 1 2 4 pass number 8 16 Vowpal Wabbit This (and more) is implemented in Vowpal Wabbit. http://hunch.net/∼vw Fork it from http://github.com/JohnLangford/vowpal_wabbit Email: [email protected], [email protected], [email protected] WWW: http://hunch.net/∼vw
© Copyright 2026 Paperzz