Reinforcement Learning on the Web summit, San Francisco, 26 Jan 2017 Andrej Karpathy Universe Universe interface: pixels in, keyboard & mouse out Docker containers in the cloud interface agent Agents on the Web Outline - Wait but... why? MiniWoB benchmark Reinforcement Learning: MiniWoB example In practice: Reinforcement vs. Supervised Learning Ongoing work Outline - Wait but... why? MiniWoB benchmark Reinforcement Learning: MiniWoB example In practice: Reinforcement vs. Supervised Learning Ongoing work L L A n i s t n e ! g s t A n e m n o r i v n the e Robotics Iteration speed? Data? Reproducible? Ease of working with envs? Useful? Robotics Not ideal. Iteration speed? Atoms, hardware, difficult to scale Possible. Data? Reproducible? Ease of working with envs? Useful? Teleoperation, YouTube videos (?) Hard. Need the same robot and setup. Hard. Need physical space or 3D design tools. Definitely. industrial applications, elderly care, ... Robotics Not ideal. Iteration speed? Atoms, hardware, difficult to scale Possible. Data? Reproducible? Ease of working with envs? Useful? Teleoperation, YouTube videos (?) Hard. Need the same robot and setup. Hard. Need physical space or 3D design tools. Definitely. industrial applications, elderly care, ... Games Robotics Iteration speed? Data? Reproducible? Ease of working with envs? Useful? Games Not ideal. Good. Atoms, hardware, difficult to scale Bits. Possible. Great. Teleoperation, YouTube videos (?) Lots of games, easy to get demos. Hard. Good. Need the same robot and setup. Run in docker containers Hard. Ok Need physical space or 3D design tools. Must work with game engines. Definitely. Unclear. industrial applications, elderly care, ... Robotics Iteration speed? Data? Reproducible? Ease of working with envs? Useful? Games Web Browsers Not ideal. Good. Good. Atoms, hardware, difficult to scale Bits. Bits. Possible. Great. Good. Teleoperation, YouTube videos (?) Lots of games, easy to get demos. Can record anyone using keyboard/mouse. Hard. Good. Need the same robot and setup. Good. Run in docker containers Run in docker containers Hard. Ok Good. Need physical space or 3D design tools. Must work with game engines. Just Javascript. Full access to DOM Definitely. Unclear. Quite likely. industrial applications, elderly care, ... UI automation (e.g. AMT), AI digital assistants, putting AI to school Robotics Iteration speed? Data? Reproducible? Ease of working with envs? Useful? Games Web Browsers Not ideal. Good. Good. Atoms, hardware, difficult to scale Bits. Bits. Possible. Great. Good. Teleoperation, YouTube videos (?) Lots of games, easy to get demos. Can record anyone using keyboard/mouse. Hard. Good. Need the same robot and setup. Good. Run in docker containers Run in docker containers Hard. Ok Good. Need physical space or 3D design tools. Must work with game engines. Just Javascript. Full access to DOM Definitely. Unclear. Quite likely. industrial applications, elderly care, ... UI automation (e.g. AMT), AI digital assistants, putting AI to school Outline - Wait but... why? MiniWoB benchmark Reinforcement Learning: MiniWoB example In practice: Reinforcement vs. Supervised Learning Ongoing work Mini World of Bits “The MNIST of World of Bits” work with Jonathan Hernandez Mini World of Bits http://alpha.openai.com/miniwob/index.html “MNIST of World of Bits” Will be: 100 tasks from easy to hard. HTML/CSS/JS Confined to 210x160 (top 50px is query) MiniWob Example human demonstrations MiniWob webpage http://alpha.openai.com/miniwob/index.html Outline - Wait but... why? MiniWoB benchmark Reinforcement Learning: MiniWoB example In practice: Reinforcement vs. Supervised Learning Ongoing work Training with Reinforcement Learning Training with Reinforcement Learning Q: where should we click? We have no labels :( Training with Reinforcement Learning Amazing convolutional neural network 4 0 0 Probabilities of clicking on one of 400 possible positions. RL: 1. initialize a stochastic policy network Training with Reinforcement Learning RL: 1. initialize a stochastic policy network 2. sample the actions Training with Reinforcement Learning “An episode” t=0 t=1 reward 0 t=2 reward 0 t=3 reward 0 reward +1 Training with Reinforcement Learning “An episode” t=0 t=1 reward 0 t=2 reward 0 t=3 reward 0 reward +1 Training with Reinforcement Learning reward 1.0 Our previous episode, miniaturized. 3 states encountered, some 3 specific actions were taken. got +1 reward. Training with Reinforcement Learning An entire batch of episodes: reward 1.0 reward 1.0 reward 1.0 we won reward 1.0 reward -1.0 reward -1.0 reward -1.0 reward -1.0 reward -1.0 we lost Training with Reinforcement Learning Treat all of the actions we took here as “fake” labels, increase their probability. Training with Reinforcement Learning Treat all of the actions we took here as “fake” labels, increase their probability. For all of the actions we took here, flip the sign and instead decrease their probability. Supervised Learning maximize: For images x_i and their labels y_i. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning 1) we have no labels so we sample: Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning 1) we have no labels so we sample: 2) once we collect a batch of episodes: maximize: Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning 1) we have no labels so we sample: 2) once we collect a batch of episodes: maximize: We call this the advantage, it’s a number, like +1.0 or -1.0 based on how this action eventually turned out. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning 1) we have no labels so we sample: 2) once we collect a batch of episodes: maximize: +ve advantage will make that action more likely in the future, for that state. -ve advantage will make that action less likely in the future, for that state. Supervised Learning maximize: Reinforcement Learning For images x_i and their labels y_i. Find m ore on https: //ka 1) we have no labels so we sample: 2) once we collect a batch of episodes: maximize: rpathy .githu b .io/2 +ve advantage will action more 01make 6/05that 31/rl/ likely in the future, for that /state. -ve advantage will make that action less likely in the future, for that state. Outline - Wait but... why? MiniWoB benchmark Reinforcement Learning: MiniWoB example In practice: Reinforcement vs. Supervised Learning Ongoing work Example MiniWob results: CNN feedforward policy: pixels -> click @ 64 screen locations Running at 8 FPS with 16 workers. wob.mini.ClickButton-v0 Example MiniWob results: CNN feedforward policy: pixels -> click @ 64 screen locations Running at 8 FPS with 16 workers. wob.mini.TicTacToe-v0 “I’d like to train a Tic Tac Toe AI.” “I’d like to train a Tic Tac Toe AI.” No problem, I’ll just… set up a kubernetes cluster on AWS, spin up a Chrome browser on Ubuntu in a Docker container, initialize a 10 Million parameter Deep Neural Network running Asynchronous SGD with distributed Tensorflow, connect it over VNC to the container, feed it raw pixels of a rendered TicTacToe grid and get it to click the cells with a mouse pointer. RL is not enough... E.g. keyboard has ~80 keys. Button mashing at random won’t get us far. Human Demonstration Recordings Supervised Learning on MiniWoB Supervised Learning (SL) alone: Wrong loss, but lots of bits. => agent does not know what parts are important, has the tendency to “spiral out” to unknown situations. Supervised Learning (SL) alone: Wrong loss, but lots of bits. => agent does not know what parts are important, has the tendency to “spiral out” to unknown situations. Reinforcement Learning (RL) alone: Correct loss, but very few bits. => hopeless search problem Supervised Learning (SL) alone: Wrong loss, but lots of bits. => agent does not know what parts are important, has the tendency to “spiral out” to unknown situations. Reinforcement Learning (RL) alone: Correct loss, but very few bits. => hopeless search problem Not-too-original solution: Initialize with SL, fine-tune with RL. Outline - Wait but... why? MiniWoB benchmark Reinforcement Learning: MiniWoB example In practice: Reinforcement vs. Supervised Learning Ongoing work Form Filling E.g. booking flights with Tim Shi, Percy Liang JSON blob actions work with Jim Fan Making things a bit easier: OCR? Making things a bit easier: Analyze DOM. Sending AI to school https://www.ixl.com/ http://alpha.openai.com/kalite_exercises/index.html Curriculum Thank you! Universe + Almost any binary Kerbal Space Program, FoldIt, GTA 5…. Docker Images cli, s w l, a a i t n rl, se u s c e nx, , i d e g l i k n u a u, b tes, cm , unzip, er, t n u ub fica ptables omedriv i t r e ca-c ils, git, i um, chr , ... i t dnsu e, selen low, gym rf m chro g, tenso n gola Docker Images Then spin up containers somewhere in the cloud. Run the init script. E.g.: 1. Start the Chrome browser 2. Navigate to a (local) URL that loads the flash file 3. Start the VNC server 4. Open ports and wait for connections In Computer Vision land... Develop ConvNets here [1989 - 2012] Scale them up [2012-]
© Copyright 2026 Paperzz