A brief introduction to Deep Q-learning

Deep Q Learning
Jheng-Ying Yu
Taiwan Evolutionary Intelligence Laboratory
2017/03/27 Group Meeting Presentation
Brief Introduction
• Deep Learning (DL) is to learn
representation that is required to achieve
objective
• Reinforcement Learning (RL) is for an
agent with the capacity to act and select
actions to maximize future reward
Deep Learning
Deep Learning
Reinforcement Learning
′ ′
Q∗ 𝑠, 𝑎 = 𝔼𝑠′ 𝑟 + 𝛾 max
𝑄
𝑠
, 𝑎 , 𝐰 𝑠, 𝑎
′
𝑎
RL + DL
• We seek a single agent which can solve
any human-level task
• RL defines the objective
• DL gives the mechanism
• RL + DL = general intelligence
RL + DL
• Use deep neural networks to represent
– Value function
– Policy
– Model
• Optimize by stochastic gradient descent
Q-network
• Represent value function by Q-network
with weight w
Q-Learning
• Q∗ 𝑠, 𝑎 =
′ ′
𝔼𝑠′ 𝑟 + 𝛾 max
𝑄
𝑠
,
𝑎
,
𝐰
𝑠,
𝑎
′
𝑎
′
′
• Treat 𝑟 + 𝛾 max
𝑄
𝑠
,
𝑎
,
𝐰
as
the
target
′
𝑎
• Minimize
′ ′
2
(𝑟 + 𝛾 max
𝑄
𝑠
,
𝑎
,
𝐰
−
𝑄
𝑠,
𝑎,
𝐰
)
′
𝑎
Deep Q-Network (DQN)
• Sample experience from the data-set
• Experience Replay
Deep Q-Network (DQN)
′
′
• (𝑟 + 𝛾 max
𝑄
𝑠
,
𝑎
,
𝐰
−
𝑄
𝑠,
𝑎,
𝐰
)
′
2
𝑎
′ ′
−
2
• (𝑟 + 𝛾 max
𝑄
𝑠
,
𝑎
,
𝐰
−
𝑄
𝑠,
𝑎,
𝐰
)
′
𝑎
• To make the target stationary and
remove the correlation between data
Over-optimism
• 𝑄 𝑎𝑝𝑝𝑟𝑜𝑥 𝑠, 𝑎 = 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 𝑠, 𝑎 + 𝑌𝑠,𝑎
• After we update the Q-value
• 𝑍 = 𝑟𝑠,𝑎 + 𝛾 max 𝑄 𝑎𝑝𝑝𝑟𝑜𝑥 (𝑠 ′ , 𝑎1) − 𝑟𝑠,𝑎 −
𝛾 max 𝑄
𝑎2
𝑎1
𝑡𝑎𝑟𝑔𝑒𝑡 ′
𝑠 , 𝑎2
= 𝛾 max 𝑄 𝑎𝑝𝑝𝑟𝑜𝑥 (𝑠 ′ , 𝑎1) − 𝛾 max 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 𝑠 ′ , 𝑎2
𝑎1
𝑎2
≥ 𝛾 𝑄 𝑎𝑝𝑝𝑟𝑜𝑥 𝑠 ′ , 𝑎′ − 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 𝑠 ′ , 𝑎′
= 𝛾𝑌𝑠′ ,𝑎′
• 𝐸𝑣𝑒𝑛 𝑖𝑓 𝐸 𝑌 = 0, 𝐸 𝑍 > 0
Double DQN
• Train two DQN, one is used to select
actions, the other is used to evaluate
actions
Prioritized Replay
• Weight experience according to surprise
• Store experience in priority queue
according to DQN error
′
′
• (𝑟 + 𝛾 max
𝑄
𝑠
,
𝑎
,
𝐰
′
𝑎
−
− 𝑄 𝑠, 𝑎, 𝑤 )
2
Dueling Network
• 𝑄 𝑠, 𝑎 = 𝑉 𝑠, 𝑣 + 𝐴(𝑠, 𝑎, 𝐰)
Bridge Bidding
Bridge Bidding
• It would be difficult to evaluate bids
before the final bid, such as the opening
and intermediate bids.
• It would be most suitable that the
intermediate bids are scored by the
ability to help the last bid achieve the
best score.
Random select data instance
Generate possible
bidding sequence
Initialize cost array
Determine the
cost and record it
Save result in
database
Select action with
highest estimated
reward
Update bidding
sequence
For all actions
Training
Sampling random
mini-batch from
database
Perform gradient descent to update
the Q-value
Conclusion
• Using deep networks to represent value,
policy, model
• Using a variety of deep RL paradigms to
achieve stable and scalable AI
Reference
• http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
• http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_
2017/Lecture/Basic%20Structure%20(v8).pdf
• http://www.algorithmdog.com/drl
• Automatic Bridge Bidding Using Deep Reinforcement
Learning, Chih-Kuan Yeh, Hsuan-Tien Lin