Double Q-learning 1cmOr: How to solve Q

Double Q-learning
Or: How to solve Q-learning’s gambling issues
Hado van Hasselt
Centrum voor Wiskunde en Informatica, Amsterdam, Netherlands
Q-learning: bad performance in some noisy settings
Double Q-learning
• For
• Idea:
instance, roulette (1 state, 171 actions)
QB
A(s, a) + α r + γQB (s′, a∗) − QA(s, a)
QA
(s,
a)
←
Q
or
t
t
t
t+1
B (s, a) + α r + γQA(s′, b∗) − QB (s, a)
QB
(s,
a)
←
Q
t
t
t
t+1
where a∗ = arg max QA(s′, a)
a
b∗ = arg max QB (s′, a)
a
• Each time step: update only one (e.g., random pick QA or QB )
Average action values for Q-learning on roulette with
synchronous updates, α = 1/n and γ = 0.95.
• Q-learning
apply double estimator to Q-learning
• Use
two Q-functions:
• Use
average of
• Same
after 17 million updates:
◦ Each dollar will yield between $22.60 and $22.70
and
and
QB
to choose actions
time and space complexity as Q-learning
• Provably
◦ Better to gamble than to walk away
QA
QA
convergent
Results: roulette
Why?
• At
time t: in state
s
do action a, observe reward
• Q-learning:
r
and state
Qt+1(s, a) ← Qt(s, a) + α r + γ max Qt(s′, b) − Qt(s, a)
b
• Qt
is noisy approximation of optimal
• But,
• In
in presence of noise:
E max Qt(s′, b)
b
Q∗
n
> max E Qt(s′, b)
b
o
• Double
Q-learning:
◦ Much better than
Q-learning
◦ Better to walk away
than to gamble
Results: small grid world
other words: Q-learning is biased
• Bias
•9
states, 4 actions per state
• Start in S
• Reward in s 6= G: −12 or 10 (random)
• Reward in s = G: 5 (end episode)
is cumulative per update
In general: single estimator
• Let Ai
• In
s′
denote an unbiased noisy sample of
general:
• Therefore,
xi
E{maxi Ai} ≥ maxi E{Ai} = maxi xi
often:
maxi Ai > maxi xi
Left: average reward per step.
Right:
maximal value in start state S.
√
ǫ-greedy exploration, ǫ = 1/ n, α = 1/n, γ = 0.95
f (xi) + noise is unbiased sample of f (xi)
max(f (xi) + noise) > max(f (x)) in all three examples
• Q-learning
does this
A
maximizing argument from one set:
and
is biased:
E{Aa∗ } ≥ maxi xi ≥ xa∗
• Ba ∗
is unbiased:
E{Ba∗ } = xa∗
• In
general:
xa ∗ ,
not for
E{Ba∗ } = xa∗ ≤ maxi xi
http://homepages.cwi.nl/~hasselt/
B
a∗ = arg maxi Ai
• Aa ∗
unbiased for
value in start state S
Conclusion
use two sets of unbiased estimates:
• Unfortunately,
◦ Q-learning: very poor
◦ Double Q-learning: much better, not perfect
◦ Q-learning: overestimation
◦ Double Q-learning: underestimation
In general: double estimator
• Select
per step
• Maximum
◦ Sarsa too, depending on policy
◦ Value iteration too, but less noise → less bias
• Idea:
• Reward
maxi xi
• Q-learning
• Double
is biased: sometimes (huge) overestimations
Q-learning: sometimes underestimations
• Empirical
evidence:
Double Q-learning better in (some) noisy settings
• Future
work: unbiased Q-learning?
Created with LATEXbeamerposter http://www-i6.informatik.rwth-aachen.de/~dreuw/latexbeamerposter.php
Roulette wheel picture under creative commons license from http://www.flickr.com/photos/29820142@N08/
[email protected]