human-level control through deep reinforcement learning presenter

HUMAN-LEVEL CONTROL THROUGH
DEEP REINFORCEMENT LEARNING
BY
VOLODYMYR MNIH, KORAY KAVUKCUOGLU, DAVID SILVER, ANDREI A. RUSU, JOEL VENESS, MARC G.
BELLEMARE, ALEX GRAVES, MARTIN RIEDMILLER, ANDREAS K. FIDJELAND, GEORG OSTROVSKI, STIG
PETERSEN, CHARLES BEATTIE, AMIR SADIK, IOANNIS ANTONOGLOU, HELEN KING, DHARSHAN
KUMARAN, DAAN WIERSTRA, SHANE LEGG, DEMIS HASSABIS
PRESENTER: DUYGU CELIK
OUTLINE
•
Q-learning
•
Related Works
•
Architecture & Algorithm
•
Results
•
Contributions
•
Drawbacks
•
References
REINFORCEMENT LEARNING
•
Value-based RL: Estimate the optimal value function
An optimal value function
Bellman Equation
Function Approximator
Q-Network
Stochastic Gradient
RELATED WORKS
Function Approximator
RL
Tile Coding [1]
Linear Function [2]
MLP [3]
Q-learning
Shallow RL
DEEP REINFORCEMENT LEARNING
RL
Function Approximator
Deep Learning
Q-learning
• Model-Free
• Off-Policy
Deep Q-Network
MODEL ARCHITECTURE
[5]
ALGORITHM
[6]
EXPERIENCE REPLAY
[6]
BENEFITS
1.Data Efficiency
2. No correlation between samples
3. No more poor local minimum
SEPARATE NETWORK
[6]
BENEFITS
1. More stable method
2. Less likely divergence/oscillations
RESULTS
49 ATARI GAMES
DQN ACHIEVED HUMAN-LEVEL
PERFORMANCE IN ALMOST
HALF OF THE GAMES
[5]
RESULTS
DQN vs. Linear function approximator
[5]
Effects of Replay and Target Q
[5]
CONTRIBUTIONS
1. Stabilize the training of Q action value function approximation
with deep neural networks using experience replay and target
network.
2. Designing an end-to-end RL approach.
3. Training a flexible network with same algorithm, network architecture, hyperparameters.
SHORTCOMINGS
•
Action selection policy:
•
Overestimation :
Greedy selects random actions uniformly. The worst
possible action is just as likely to be selected as the second best.
Max operator uses the same values to both select and
evaluate an action. Solution is provided in [7] and it’s called Double DQN.
Problem!
•
Experience Replay [8]:
Experience transitions are uniformly sampled from the
replay memory. Solution is provided in [8].
REFERENCES
1. Sherstov, Alexander A., and Peter Stone. "Function approximation via tile coding: Automating parameter choice."
International Symposium on Abstraction, Reformulation, and Approximation. Springer Berlin Heidelberg, 2005.
2. Irodova, Marina, and Robert H. Sloan. "Reinforcement Learning and Function Approximation." FLAIRS
Conference. 2005.
3. Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM,38(3):58–68, 1995.
4.V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with
Deep Reinforcement Learning,” arXiv, 2013.
5. V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G.
Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D.
Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, 2015, pp. 529–533.
6. Y. Li, “Deep Reinforcement Learning: An Overview,” arXiv, 2017.
7. Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." AAAI.
2016.
8. Perez-Liebana, Diego, et al. "The 2014 general video game playing competition." IEEE Transactions on
Computational Intelligence and AI in Games 8.3 (2016): 229-243.