HUMAN-LEVEL CONTROL THROUGH DEEP REINFORCEMENT LEARNING BY VOLODYMYR MNIH, KORAY KAVUKCUOGLU, DAVID SILVER, ANDREI A. RUSU, JOEL VENESS, MARC G. BELLEMARE, ALEX GRAVES, MARTIN RIEDMILLER, ANDREAS K. FIDJELAND, GEORG OSTROVSKI, STIG PETERSEN, CHARLES BEATTIE, AMIR SADIK, IOANNIS ANTONOGLOU, HELEN KING, DHARSHAN KUMARAN, DAAN WIERSTRA, SHANE LEGG, DEMIS HASSABIS PRESENTER: DUYGU CELIK OUTLINE • Q-learning • Related Works • Architecture & Algorithm • Results • Contributions • Drawbacks • References REINFORCEMENT LEARNING • Value-based RL: Estimate the optimal value function An optimal value function Bellman Equation Function Approximator Q-Network Stochastic Gradient RELATED WORKS Function Approximator RL Tile Coding [1] Linear Function [2] MLP [3] Q-learning Shallow RL DEEP REINFORCEMENT LEARNING RL Function Approximator Deep Learning Q-learning • Model-Free • Off-Policy Deep Q-Network MODEL ARCHITECTURE [5] ALGORITHM [6] EXPERIENCE REPLAY [6] BENEFITS 1.Data Efficiency 2. No correlation between samples 3. No more poor local minimum SEPARATE NETWORK [6] BENEFITS 1. More stable method 2. Less likely divergence/oscillations RESULTS 49 ATARI GAMES DQN ACHIEVED HUMAN-LEVEL PERFORMANCE IN ALMOST HALF OF THE GAMES [5] RESULTS DQN vs. Linear function approximator [5] Effects of Replay and Target Q [5] CONTRIBUTIONS 1. Stabilize the training of Q action value function approximation with deep neural networks using experience replay and target network. 2. Designing an end-to-end RL approach. 3. Training a flexible network with same algorithm, network architecture, hyperparameters. SHORTCOMINGS • Action selection policy: • Overestimation : Greedy selects random actions uniformly. The worst possible action is just as likely to be selected as the second best. Max operator uses the same values to both select and evaluate an action. Solution is provided in [7] and it’s called Double DQN. Problem! • Experience Replay [8]: Experience transitions are uniformly sampled from the replay memory. Solution is provided in [8]. REFERENCES 1. Sherstov, Alexander A., and Peter Stone. "Function approximation via tile coding: Automating parameter choice." International Symposium on Abstraction, Reformulation, and Approximation. Springer Berlin Heidelberg, 2005. 2. Irodova, Marina, and Robert H. Sloan. "Reinforcement Learning and Function Approximation." FLAIRS Conference. 2005. 3. Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM,38(3):58–68, 1995. 4.V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” arXiv, 2013. 5. V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, 2015, pp. 529–533. 6. Y. Li, “Deep Reinforcement Learning: An Overview,” arXiv, 2017. 7. Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." AAAI. 2016. 8. Perez-Liebana, Diego, et al. "The 2014 general video game playing competition." IEEE Transactions on Computational Intelligence and AI in Games 8.3 (2016): 229-243.
© Copyright 2026 Paperzz