Deep Q Learning Jheng-Ying Yu Taiwan Evolutionary Intelligence Laboratory 2017/03/27 Group Meeting Presentation Brief Introduction • Deep Learning (DL) is to learn representation that is required to achieve objective • Reinforcement Learning (RL) is for an agent with the capacity to act and select actions to maximize future reward Deep Learning Deep Learning Reinforcement Learning ′ ′ Q∗ 𝑠, 𝑎 = 𝔼𝑠′ 𝑟 + 𝛾 max 𝑄 𝑠 , 𝑎 , 𝐰 𝑠, 𝑎 ′ 𝑎 RL + DL • We seek a single agent which can solve any human-level task • RL defines the objective • DL gives the mechanism • RL + DL = general intelligence RL + DL • Use deep neural networks to represent – Value function – Policy – Model • Optimize by stochastic gradient descent Q-network • Represent value function by Q-network with weight w Q-Learning • Q∗ 𝑠, 𝑎 = ′ ′ 𝔼𝑠′ 𝑟 + 𝛾 max 𝑄 𝑠 , 𝑎 , 𝐰 𝑠, 𝑎 ′ 𝑎 ′ ′ • Treat 𝑟 + 𝛾 max 𝑄 𝑠 , 𝑎 , 𝐰 as the target ′ 𝑎 • Minimize ′ ′ 2 (𝑟 + 𝛾 max 𝑄 𝑠 , 𝑎 , 𝐰 − 𝑄 𝑠, 𝑎, 𝐰 ) ′ 𝑎 Deep Q-Network (DQN) • Sample experience from the data-set • Experience Replay Deep Q-Network (DQN) ′ ′ • (𝑟 + 𝛾 max 𝑄 𝑠 , 𝑎 , 𝐰 − 𝑄 𝑠, 𝑎, 𝐰 ) ′ 2 𝑎 ′ ′ − 2 • (𝑟 + 𝛾 max 𝑄 𝑠 , 𝑎 , 𝐰 − 𝑄 𝑠, 𝑎, 𝐰 ) ′ 𝑎 • To make the target stationary and remove the correlation between data Over-optimism • 𝑄 𝑎𝑝𝑝𝑟𝑜𝑥 𝑠, 𝑎 = 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 𝑠, 𝑎 + 𝑌𝑠,𝑎 • After we update the Q-value • 𝑍 = 𝑟𝑠,𝑎 + 𝛾 max 𝑄 𝑎𝑝𝑝𝑟𝑜𝑥 (𝑠 ′ , 𝑎1) − 𝑟𝑠,𝑎 − 𝛾 max 𝑄 𝑎2 𝑎1 𝑡𝑎𝑟𝑔𝑒𝑡 ′ 𝑠 , 𝑎2 = 𝛾 max 𝑄 𝑎𝑝𝑝𝑟𝑜𝑥 (𝑠 ′ , 𝑎1) − 𝛾 max 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 𝑠 ′ , 𝑎2 𝑎1 𝑎2 ≥ 𝛾 𝑄 𝑎𝑝𝑝𝑟𝑜𝑥 𝑠 ′ , 𝑎′ − 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 𝑠 ′ , 𝑎′ = 𝛾𝑌𝑠′ ,𝑎′ • 𝐸𝑣𝑒𝑛 𝑖𝑓 𝐸 𝑌 = 0, 𝐸 𝑍 > 0 Double DQN • Train two DQN, one is used to select actions, the other is used to evaluate actions Prioritized Replay • Weight experience according to surprise • Store experience in priority queue according to DQN error ′ ′ • (𝑟 + 𝛾 max 𝑄 𝑠 , 𝑎 , 𝐰 ′ 𝑎 − − 𝑄 𝑠, 𝑎, 𝑤 ) 2 Dueling Network • 𝑄 𝑠, 𝑎 = 𝑉 𝑠, 𝑣 + 𝐴(𝑠, 𝑎, 𝐰) Bridge Bidding Bridge Bidding • It would be difficult to evaluate bids before the final bid, such as the opening and intermediate bids. • It would be most suitable that the intermediate bids are scored by the ability to help the last bid achieve the best score. Random select data instance Generate possible bidding sequence Initialize cost array Determine the cost and record it Save result in database Select action with highest estimated reward Update bidding sequence For all actions Training Sampling random mini-batch from database Perform gradient descent to update the Q-value Conclusion • Using deep networks to represent value, policy, model • Using a variety of deep RL paradigms to achieve stable and scalable AI Reference • http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf • http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_ 2017/Lecture/Basic%20Structure%20(v8).pdf • http://www.algorithmdog.com/drl • Automatic Bridge Bidding Using Deep Reinforcement Learning, Chih-Kuan Yeh, Hsuan-Tien Lin
© Copyright 2026 Paperzz