Parallel Online Learning

Parallel Online Learning
Nikos Karampatziakis
John Langford
Cornell University
Online Learning
É Learner gets the next example xt , makes a prediction pt ,
receives actual label yt , suffers loss ℓ(pt , yt ), updates itself
É Simple and fast predictions and updates
Yahoo! Research
Some Experiments with Local Updates
Training & Combining
>
pt = w xt
wt+1 = wt − ηt ∇ℓ(pt , yt )
Online gradient descent asymptotically attains optimal regret
É Online learning scales well . . .
É
Sharding & Training
relative squared loss or time
Rutgers University
relative squared loss or time
Daniel Hsu
1
0.8
0.6
0.4
0.2
r. squared loss
r. time
0
1
2
4
1.2
1
0.8
0.6
0.4
0.2
r. squared loss
r. time
0
8
1
shard count
Parallel Online Learning
É . . . but it’s a sequential algorithm
É What if examples arrive very fast?
É What if we want to train on huge datasets?
É We investigate ways of distributing predictions, and updates
while minimizing communication.
2
4
8
shard count
Improvement due to nonlinearity σ
Global Updates
É Unfortunately, local updates can also hurt performance.
É Improved representation power by global training.
É Slightly more communication, some delay.
É Delayed global training
Each node predicts but doesn’t immediately train on y.
É Later it receives global prediction ŷ and trains as if it predicted that.
É
Delay
É Parallelizing online learning leads to delay problems.
É This is exacerbated in a setting with temporally correlated or
adversarial examples.
É We investigate no delay and bounded delay schemes.
É
The tree can be thought as a neural network
É Lockstep backpropagation would be slow
É Each node trains locally, sends prediction after training.
É Later it receives global gradient from parent uses chain rule as in
backprop.
É
É
Tree Architectures
ŷ
ŷ2,1
Delayed backprop
Delay fixed (helps stability, development and debugging)
Experiments with Global (and Local) Updates
RCV1
Webspam
ŷ2,2
accuracy with 1 pass
accuracy with 1 pass
0.95
0.992
0.945
0.99
0.94
0.935
ŷ1,1
ŷ1,4
ŷ1,2 ŷ1,3
0.988
0.93
0.986
0.925
0.984
0.92
Local
Backprop
Backprop x8
0.915
0.91
1
2
Local
Backprop
0.982
4
worker number
8
16
1
2
accuracy with 16 passes
xF1
xF2
xF3
xF4
0.95
16
8
16
0.992
0.945
0.99
0.935
0.988
0.93
0.986
0.925
0.984
0.92
Local
Backprop
Backprop x8
0.915
ŷ
8
accuracy with 16 passes
0.94
Each node has f + 1 weights where f is the node’s fan-in.
Bottom nodes use subsets of raw features. Others use
predictions of their children.
4
worker number
0.91
1
2
Local
Backprop
0.982
4
worker number
8
16
1
2
accuracy with 1 worker
4
worker number
accuracy with 1 worker
0.95
0.992
0.945
ŷ1,1 ŷ1,2
ŷ1,3 ŷ1,4
0.99
0.94
0.935
0.988
0.93
0.986
0.925
0.984
0.92
Local
Backprop
Backprop x8
0.915
0.91
1
xF1
xF2
xF3
xF4
2
4
pass number
8
Local
Backprop
0.982
16
1
2
accuracy with 16 workers
4
pass number
8
16
accuracy with 16 workers
0.95
0.992
0.945
Label travels together with prediction, available in each node
0.99
0.94
0.935
0.988
0.93
0.986
Local Updates
Each node in the tree:
É Computes its prediction pi,j based on its weights and inputs
a
É Sends ŷi,j = σ(pi,j) to its parent
É Updates its weights based on ∇ℓ(pi,j , y)
No delay
Limited representation power: between Naive Bayes and
centralized linear model.
aThe
nonlinearity introduced by σ has an interesting effect
0.925
0.984
0.92
Local
Backprop
Backprop x8
0.915
0.91
1
2
4
pass number
8
Local
Backprop
0.982
16
1
2
4
pass number
8
16
Vowpal Wabbit
This (and more) is implemented in Vowpal Wabbit.
http://hunch.net/∼vw
Fork it from http://github.com/JohnLangford/vowpal_wabbit
Email: [email protected], [email protected], [email protected]
WWW: http://hunch.net/∼vw