Reinforcement Learning on the Web

Reinforcement Learning on the Web
summit,
San Francisco, 26 Jan 2017
Andrej Karpathy
Universe
Universe interface:
pixels in, keyboard & mouse out
Docker containers in the cloud
interface
agent
Agents on
the Web
Outline
-
Wait but... why?
MiniWoB benchmark
Reinforcement Learning: MiniWoB example
In practice: Reinforcement vs. Supervised Learning
Ongoing work
Outline
-
Wait but... why?
MiniWoB benchmark
Reinforcement Learning: MiniWoB example
In practice: Reinforcement vs. Supervised Learning
Ongoing work
L
L
A
n
i
s
t
n
e
!
g
s
t
A
n
e
m
n
o
r
i
v
n
the e
Robotics
Iteration speed?
Data?
Reproducible?
Ease of working
with envs?
Useful?
Robotics
Not ideal.
Iteration speed?
Atoms, hardware,
difficult to scale
Possible.
Data?
Reproducible?
Ease of working
with envs?
Useful?
Teleoperation,
YouTube videos (?)
Hard.
Need the same
robot and setup.
Hard.
Need physical space
or 3D design tools.
Definitely.
industrial applications,
elderly care, ...
Robotics
Not ideal.
Iteration speed?
Atoms, hardware,
difficult to scale
Possible.
Data?
Reproducible?
Ease of working
with envs?
Useful?
Teleoperation,
YouTube videos (?)
Hard.
Need the same
robot and setup.
Hard.
Need physical space
or 3D design tools.
Definitely.
industrial applications,
elderly care, ...
Games
Robotics
Iteration speed?
Data?
Reproducible?
Ease of working
with envs?
Useful?
Games
Not ideal.
Good.
Atoms, hardware,
difficult to scale
Bits.
Possible.
Great.
Teleoperation,
YouTube videos (?)
Lots of games, easy
to get demos.
Hard.
Good.
Need the same
robot and setup.
Run in docker
containers
Hard.
Ok
Need physical space
or 3D design tools.
Must work with
game engines.
Definitely.
Unclear.
industrial applications,
elderly care, ...
Robotics
Iteration speed?
Data?
Reproducible?
Ease of working
with envs?
Useful?
Games
Web Browsers
Not ideal.
Good.
Good.
Atoms, hardware,
difficult to scale
Bits.
Bits.
Possible.
Great.
Good.
Teleoperation,
YouTube videos (?)
Lots of games, easy
to get demos.
Can record anyone
using keyboard/mouse.
Hard.
Good.
Need the same
robot and setup.
Good.
Run in docker
containers
Run in docker
containers
Hard.
Ok
Good.
Need physical space
or 3D design tools.
Must work with
game engines.
Just Javascript.
Full access to DOM
Definitely.
Unclear.
Quite likely.
industrial applications,
elderly care, ...
UI automation (e.g. AMT),
AI digital assistants,
putting AI to school
Robotics
Iteration speed?
Data?
Reproducible?
Ease of working
with envs?
Useful?
Games
Web Browsers
Not ideal.
Good.
Good.
Atoms, hardware,
difficult to scale
Bits.
Bits.
Possible.
Great.
Good.
Teleoperation,
YouTube videos (?)
Lots of games, easy
to get demos.
Can record anyone
using keyboard/mouse.
Hard.
Good.
Need the same
robot and setup.
Good.
Run in docker
containers
Run in docker
containers
Hard.
Ok
Good.
Need physical space
or 3D design tools.
Must work with
game engines.
Just Javascript.
Full access to DOM
Definitely.
Unclear.
Quite likely.
industrial applications,
elderly care, ...
UI automation (e.g. AMT),
AI digital assistants,
putting AI to school
Outline
-
Wait but... why?
MiniWoB benchmark
Reinforcement Learning: MiniWoB example
In practice: Reinforcement vs. Supervised Learning
Ongoing work
Mini World of Bits
“The MNIST of
World of Bits”
work with
Jonathan Hernandez
Mini World of Bits
http://alpha.openai.com/miniwob/index.html
“MNIST of World of Bits”
Will be: 100 tasks from easy to hard.
HTML/CSS/JS
Confined to 210x160 (top 50px is query)
MiniWob
Example human
demonstrations
MiniWob webpage
http://alpha.openai.com/miniwob/index.html
Outline
-
Wait but... why?
MiniWoB benchmark
Reinforcement Learning: MiniWoB example
In practice: Reinforcement vs. Supervised Learning
Ongoing work
Training with Reinforcement Learning
Training with Reinforcement Learning
Q: where should we click?
We have no labels :(
Training with Reinforcement Learning
Amazing convolutional
neural network
4
0
0
Probabilities of clicking
on one of 400 possible
positions.
RL:
1. initialize a stochastic policy network
Training with Reinforcement Learning
RL:
1. initialize a stochastic policy network
2. sample the actions
Training with Reinforcement Learning
“An episode”
t=0
t=1
reward 0
t=2
reward 0
t=3
reward 0
reward +1
Training with Reinforcement Learning
“An episode”
t=0
t=1
reward 0
t=2
reward 0
t=3
reward 0
reward +1
Training with Reinforcement Learning
reward 1.0
Our previous episode, miniaturized.
3 states encountered,
some 3 specific actions were taken.
got +1 reward.
Training with Reinforcement Learning
An entire batch of episodes:
reward 1.0
reward 1.0
reward 1.0
we won
reward 1.0
reward -1.0
reward -1.0
reward -1.0
reward -1.0
reward -1.0
we lost
Training with Reinforcement Learning
Treat all of the actions we took
here as “fake” labels,
increase their probability.
Training with Reinforcement Learning
Treat all of the actions we took
here as “fake” labels,
increase their probability.
For all of the actions we took
here, flip the sign and instead
decrease their probability.
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
1) we have no labels so we sample:
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
1) we have no labels so we sample:
2) once we collect a batch of episodes:
maximize:
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
1) we have no labels so we sample:
2) once we collect a batch of episodes:
maximize:
We call this the advantage, it’s a
number, like +1.0 or -1.0 based on how
this action eventually turned out.
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
1) we have no labels so we sample:
2) once we collect a batch of episodes:
maximize:
+ve advantage will make that action more
likely in the future, for that state.
-ve advantage will make that action less
likely in the future, for that state.
Supervised Learning
maximize:
Reinforcement Learning
For images x_i and their
labels y_i.
Find m
ore on
https:
//ka
1) we have no labels so we sample:
2) once we collect a batch of episodes:
maximize:
rpathy
.githu
b
.io/2
+ve advantage
will
action more
01make
6/05that
31/rl/
likely in the future, for that /state.
-ve advantage will make that action less
likely in the future, for that state.
Outline
-
Wait but... why?
MiniWoB benchmark
Reinforcement Learning: MiniWoB example
In practice: Reinforcement vs. Supervised Learning
Ongoing work
Example MiniWob results:
CNN feedforward policy:
pixels -> click @ 64 screen locations
Running at 8 FPS with 16 workers.
wob.mini.ClickButton-v0
Example MiniWob results:
CNN feedforward policy:
pixels -> click @ 64 screen locations
Running at 8 FPS with 16 workers.
wob.mini.TicTacToe-v0
“I’d like to train a Tic Tac Toe AI.”
“I’d like to train a Tic Tac Toe AI.”
No problem, I’ll just…
set up a kubernetes cluster on AWS,
spin up a Chrome browser on Ubuntu in a Docker container,
initialize a 10 Million parameter Deep Neural Network
running Asynchronous SGD with distributed Tensorflow,
connect it over VNC to the container,
feed it raw pixels of a rendered TicTacToe grid
and get it to click the cells with a mouse pointer.
RL is not enough...
E.g. keyboard has ~80 keys. Button mashing at random won’t get us far.
Human Demonstration Recordings
Supervised Learning on MiniWoB
Supervised Learning (SL) alone:
Wrong loss, but lots of bits.
=> agent does not know what parts are important,
has the tendency to “spiral out” to unknown situations.
Supervised Learning (SL) alone:
Wrong loss, but lots of bits.
=> agent does not know what parts are important,
has the tendency to “spiral out” to unknown situations.
Reinforcement Learning (RL) alone:
Correct loss, but very few bits.
=> hopeless search problem
Supervised Learning (SL) alone:
Wrong loss, but lots of bits.
=> agent does not know what parts are important,
has the tendency to “spiral out” to unknown situations.
Reinforcement Learning (RL) alone:
Correct loss, but very few bits.
=> hopeless search problem
Not-too-original solution:
Initialize with SL, fine-tune with RL.
Outline
-
Wait but... why?
MiniWoB benchmark
Reinforcement Learning: MiniWoB example
In practice: Reinforcement vs. Supervised Learning
Ongoing work
Form Filling
E.g. booking flights
with Tim Shi,
Percy Liang
JSON blob
actions
work with Jim Fan
Making things a bit easier: OCR?
Making things a bit easier: Analyze DOM.
Sending AI to school
https://www.ixl.com/
http://alpha.openai.com/kalite_exercises/index.html
Curriculum
Thank you!
Universe
+ Almost any binary
Kerbal Space Program,
FoldIt, GTA 5….
Docker Images
cli,
s
w
l, a
a
i
t
n
rl,
se
u
s
c
e
nx,
,
i
d
e
g
l
i
k
n
u
a
u, b tes, cm , unzip, er,
t
n
u
ub
fica ptables omedriv
i
t
r
e
ca-c ils, git, i um, chr , ...
i
t
dnsu e, selen low, gym
rf
m
chro g, tenso
n
gola
Docker Images
Then spin up containers somewhere in the cloud.
Run the init script.
E.g.:
1. Start the Chrome browser
2. Navigate to a (local) URL that loads the flash file
3. Start the VNC server
4. Open ports and wait for connections
In Computer Vision land...
Develop ConvNets here
[1989 - 2012]
Scale them up
[2012-]