A Self-normalized Martingale Tail Inequality

Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits
A
Self-normalized Martingale Tail Inequality
The self-normalized martingale tail inequality that we present here is the scalar-valued version of the more general
vector-valued results obtained by Abbasi-Yadkori et al. (2011b,a). We include the proof for completeness.
Theorem 7 (Self-normalized bound for martingales). Let {Ft }∞
t=1 be a filtration. Let τ be a stopping time w.r.t.
to the filtration {Ft+1 }∞
i.e.
the
event
{τ
≤
t}
belongs
to
Ft+1 . Let {Zt }∞
t=1
t=1 be a sequence of real-valued
∞
variables such that Zt is Ft -measurable. Let {ηt }t=1 be a sequence of real-valued random variables such that ηt
is Ft+1 -measurable and is conditionally R-sub-Gaussian. Let V > 0 be deterministic. Then, for any δ > 0, with
probability at least 1 − δ,
!
p
Pτ
Pτ
2
V + t=1 Zt2
( t=1 ηt Zt )
2
Pτ
√
≤ 2R ln
.
V + t=1 Zt2
δ V
Proof. Pick λ ∈ R and let
Dtλ = exp
St =
t
X
ηt λZt
1
− λ2 Zt2
R
2
,
ηs λZs ,
s=1
Mtλ = exp
t
λSt
1 X 2
− λ2
Z
R
2 s=1 t
!
.
∞
λ
We claim that {Mtλ }∞
t=1 is an {Ft+1 }t=1 -adapted supermartingale. That Mt ∈ Ft+1 for t = 1, 2, . . . is clear from
λ
the definitions. By sub-Gaussianity, E[Dt | Ft ] ≤ 1. Further,
λ
E[Mtλ |Ft ] = E[Mt−1
Dtλ | Ft ]
λ
λ
= Mt−1
E[Dtλ | Ft ] ≤ Mt−1
,
showing that {Mt }∞
t=1 is indeed a supermartingale.
Next we show that Mτλ is always well-defined and E[Mτλ ] ≤ 1. First define M̃ = Mτλ and note that M̃ (ω) =
λ
Mτλ(ω) (ω). Thus, when τ (ω) = ∞, we need to argue about M∞
(ω). By the convergence theorem for nonnegative
λ
supermartingales, limt→∞ Mt (ω) is well-defined, which means Mτλ is well-defined, independently of whether
λ
τ < ∞ holds or not. Now let Qλt = Mmin{τ,t}
be a stopped version of Mtλ . We proceed by using Fatou’s Lemma
to show that E[Mτλ ] = E[lim inf t→∞ Qλt ] ≤ lim inf t→∞ E[Qλt ] ≤ 1.
Let F∞ be the σ-algebra generated by {Ft }∞
t=1 i.e. the tail σ-algebra. Let Λ be a zero-mean Gaussian random
variable with variance 1/V independent of F∞ . Define Mt = E[MtΛ | F∞ ]. Clearly, we still have E[Mτ ] =
E[MτΛ ] = E[E[MτΛ ] | Λ] ≤ E[1 | Λ] ≤ 1.
Let us calculate Mt . We will need the density λ which is f (λ) = √
e−V λ
1
2π/V
2
/2
explicitly
Mt = E[MtΛ | F∞ ]
Z ∞
=
Mtλ f (λ) dλ
−∞
r
=
V
2π
= exp
where we have used that
R∞
−∞
Z
t
∞
exp
∞
λSt
λ2 X 2
−
Z
R
2 s=1 t
!s
St2
Pt
2R2 (V + s=1 Zt2 )
exp(aλ − bλ2 ) = exp(a2 /(4b))
p
!
V +
π/b.
e−V λ
V
Pt
2
s=1
/2
dλ
Zt2
,
. Now, it is easy to write Mt
Yasin Abbasi-Yadkori, Dávid Pál, Csaba Szepesvári
To finish the proof, we use Markov’s inequality and the fact that E[Mτ ] ≤ 1:
!#
" P
p
P
2
τ
V + t=1 Zt2
( t=1 ηt Zt )
2
Pτ
√
≥ 2R ln
Pr
V + t=1 Zt2
δ V
"
!#
p
P
V + t=1 Zt2
Sτ2
Pτ
√
≥ ln
= Pr
2R2 (V + t=1 Zt2 )
δ V
"
#
P
p
V + t=1 Zt2
Sτ2
Pτ
√
= Pr exp
≥
2R2 (V + t=1 Zt2 )
δ V
1
= Pr Mτ ≥
δ
≤δ.
The theorem can be “bootstrapped” to a “stronger” statement (or at least one, that looks stronger at the first
sight) that holds uniformly for all time steps t as opposed to only a particular (stopping) time τ . The idea of
the proof goes back at least to Freedman (1975).
Corollary 8 (Uniform Bound). Under the same assumptions as the previous theorem, for any δ > 0, with
probability at least 1 − δ, for all n ≥ 0,
v
n
!
!
p
u
Pn
n
X
2
X
u
V
+
Z
t
t=1
√
ηt Zt ≤ Rt2 V +
Zt2 ln
.
δ V
t=1
t=1
Proof. Define the “bad” event
(
Bt (δ) =
2
q
)
Pt
ηs Zs
V + s=1 Zs2
 .
√
ω∈Ω :
> 2R2 ln 
Pt
δ V
V + s=1 Zs2
P
t
s=1
S
We are interested in bounding the probability that t≥0 Bt (δ) happens. Define τ (ω) = min{t ≥ 0 : ω ∈ Bt (δ)},
with the convention that min ∅ = ∞. Then, τ is a stopping time. Further,
[
Bt (δ) = {ω : τ (ω) < ∞}.
t≥0
Thus, by Theorem 7 it holds that


[
Pr  Bt (δ) = Pr [τ < ∞]
t≥0
" P
2
τ
( t=1 ηt Zt )
Pτ
= Pr
> 2R2 ln
V + t=1 Zt2
" P
2
τ
( t=1 ηt Zt )
Pτ
= Pr
> 2R2 ln
V + t=1 Zt2
!
#
p
P
V + t=1 Zt2
√
and τ < ∞
δ V
!#
p
P
V + t=1 Zt2
√
δ V
≤δ.
B
Some Useful Tricks
Proposition 9 (Square-Root Trick). Let a, b ≥ 0. If z 2 ≤ a + bz then z ≤ b +
√
a.
Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits
Proof of the Proposition 9. Let q(x) = x2 − bx − a. The condition z 2 ≤ a + bz can be expressed as q(z) ≤ 0. The
quadratic polynomial q(x) has two roots
√
b ± b2 + 4a
.
x1,2 =
2
The condition q(z) ≤ 0 implies that z ≤ max{x1 , x2 }. Therefore,
z ≤ max{x1 , x2 } =
where we have used that
√
u+v ≤
√
u+
√
√
b+
√
b2 + 4a
≤ b + a,
2
v holds for any u, v ≥ 0.
Proposition
(Logarithmic Trick). Let c ≥ 1, f > 0, δ ∈ (0, 1/4]. If z ≥ 1 and z ≤ c + f
r 10
z ≤ c + f 2 ln c+f
.
δ
p
ln(z/δ) then
p
p
Proof of the Proposition 10. Let g(x) = x − c − f ln(x/δ) for any x ≥ 1. The condition z ≤ c + f ln(z/δ)
can be expressed as g(z) ≤ 0. For large enough x, the function g(x) is increasing. This is easy to see, since
g 0 (x) = 1 − √ f
. Namely, it is not hard see g(x) is increasing for x ≥ max{1, f /2} since for any such x,
2x
ln(x/δ)
g 0 (x) is positive.
r
Clearly, c + f 2 ln c+f
≥ max{1, f /2} since c ≥ 1 and δ ∈ (0, 1/4]. Therefore, it suffices to show that
δ
s
g c+f
2 ln
c+f
δ
!
≥0.
This is verified by the following calculation
s
g c+f
2 ln
c+f
δ
s
!
=c+f
2 ln
s
s
=f
ln
s
≥f
=f
ln
p
v
u
u
− f tln
c+f
c+f
δ
2
v
u
u
− f tln
c+f
c+f
δ
2
v
u
u
− f tln
2 ln
=f
c+f
δ
v
u
u
− c − f tln
c+f
δ
r
ln (A2 ) − f
c+f
p
p
p
2 ln ((c + f )/δ)
δ
2 ln ((c + f )/δ)
δ
2 ln ((c + f )/δ)
δ
!
!
!
!
p
(c + f ) 2 ln ((c + f )/δ)
δ
√
ln A 2 ln A
≥ 0,
√
where have defined A = (c + f )/δ and the last inequality follows from that A2 ≥ A 2 ln A for any A > 0.
C
Proof of Theorem 3
In this section we will need the following notation. For a given positive definite matrix A ∈ Rd×d we denote
by hx, yiA =√x> Ay the inner product between two vectors x, y ∈ Rd induced by A. We denote by kxkA =
p
hx, xiA = x> Ax the corresponding norm.
The following lemma is a from Dani et al. (2008). We reproduce the proof for completeness.
Yasin Abbasi-Yadkori, Dávid Pál, Csaba Szepesvári
Lemma 11 (Elliptical Potential). Let x1 , x2 , . . . , xn ∈ Rd and let Vt = I +
Then it holds that
n
n
o
X
min 1, kxt k2V −1 ≤ 2 ln(det(Vn )) .
Pt
s=1
x>
s xs for t = 0, 1, 2, . . . , n.
t−1
t=1
Furthermore, if kxt k2 ≤ X for all t = 1, 2, . . . , n then
nX 2
.
ln(det(Vn )) ≤ d ln 1 +
d
Proof of Lemma 11 . We use the inequality x ≤ 2 ln(1 + x) valid for all x ∈ [0, 1]:
n
X
n
n n
o X
Y
min 1, kxt k2V −1 ≤
2 ln 1 + kxt k2V −1 = 2 ln
1 + kxt k2V −1
t−1
t=1
We now show that det(Vn ) =
Qn
t−1
t=1
t=1 (1
!
.
t−1
t=1
+ kxt k2V −1 ):
t−1
xn x>
n)
det(Vn ) = det(Vn−1 +
−1/2
−1/2
= det Vn−1 (I + (Vn−1 xn )(Vn−1 xn )>
−1/2
= det (Vn−1 ) det I + (Vn−1 xn )(Vn−1/2 xn )>
= det (Vn−1 ) · 1 + kxn k2V −1
n−1
= ···
n
Y
=
(1 + kxt k2V −1 ) .
(since V0 = I)
t−1
t=1
In the above calculation we have used that det(I + zz > ) = 1 + kzk22 since all but one eigenvalue of I + zz > equals
to 1 and the remaining eigenvalue is 1 + kzk22 with associated eigenvector z.
To prove the second part, consider the eigenvalues
α1 , α2 , . . . , αd of Vn . Since Vn is positive definite, the eigenQd
values are positive. Recall that det(Vn ) = i=1 αi . The bound on kxt k ≤ X implies a bound on the trace of
Vn :
Trace Vn = Trace(I) +
n
X
Trace(xt x>
t )=d+
t=1
Recalling that Trace(Vn ) =
Pd
i=1
√
d
n
X
kxt k22 ≤ d + nX 2 .
t=1
αi we can apply the AM-GM inequality:
α1 α2 · · · αd ≤
α1 + α2 + · · · + αd
Trace(Vn )
=
,
d
d
from which the second inequality follows by taking logarithm and multiplying by d.
Proof of Theorem 3. Consider the event A when θ∗ ∈
bility at least 1 − δ.
T∞
t=0
Ct . By Corollary 2, the event A occurs with proba-
The set Ct−1 is an ellipsoid underlying the covariance matrix Vt−1 = I +
θbt = argmin kθk22 +
θ∈Rd
Pt−1
t−1
X
(Ybs − hθ, Xs i)2
s=1
Xs> Xs and center
!
.
s=1
The ellipsoid Ct−1 is non-empty since θ∗ lies in it (on the event A). Therefore θbt ∈ Ct−1 . We can thus express
the ellipsoid as
(
)
t−1 D
E2
X
d
>
2
b
b
b
b
b
Ct−1 = θ ∈ R : (θ − θt ) Vt−1 (θ − θt ) + kθt k +
Ys − θt , Xs
≤ βt−1 (δ) .
2
s=1
Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits
The ellipsoid is contained in a larger ellipsoid
o
n
o n
p
Ct−1 ⊆ θ ∈ Rd : (θ − θbt )> Vt−1 (θ − θbt ) ≤ βt−1 (δ) = θ ∈ Rd : kθ − θbt kVt−1 ≤ βt−1 (δ) .
First, we bound the instantaneous regret using that (Xt , θet ) = argmax(x,θ)∈Dt ×Ct−1 hx, θi:
hx∗ − Xt , θ∗ i = hx∗ , θ∗ i − hXt , θ∗ i
D
E
≤ Xt , θet − hXt , θ∗ i
D
E
= Xt , θet − θ∗
D
E D
E
= Xt , θet − θbt − Xt , θbt − θ∗
D
E
E D
≤ Xt , θet − θbt + Xt , θbt − θ∗ + kXt kV −1 θbt − θ∗ ≤ kXt kV −1 θet − θbt t−1
t−1
Vt−1
Vt−1
p
≤ 2 βt−1 (δ) · kXt kV −1 .
t−1
(Cauchy-Schwarz)
(because θet , θ∗ ∈ Ct−1 )
Since we p
assume that |hx, θ∗ i| ≤ G for any x ∈ Dt and any t = 1, 2, . . . , n, we can upper bound hx∗ − Xt , θ∗ i ≤
2 min{G, βt−1 (δ) · kXt kV −1 }. Summing over all t we upper bound regret
t−1
Rn =
n
X
hx∗ − Xt , θ∗ i
t=1
n
X
≤2
n p
o
min G, βt−1 (δ) · kXt kV −1
t−1
≤2
t=1
n p
X
n
o
βt−1 (δ) · min G, kXt kV −1
(since βt−1 (δ) ≥ 1)
t−1
t=1
≤2
max
0≤t<n
≤2
p
X
n
n
o
min G, kXt kV −1
βt (δ)
t−1
t=1
n
n
o
X
p
max βt (δ) max{1, G}
min 1, kXt kV −1
0≤t<n
t−1
t=1
v
u n
p
u X
2
t
≤ 2 max βt (δ) max{1, G} × n
min 1, kXt kV −1
0≤t<n
s
≤ 2 max{1, G}
(Cauchy-Schwarz)
t−1
t=1
nX 2
2nd log 1 +
max βt (δ) ,
0≤t<n
d
where the last inequality follows from Lemma 11.
Proof of Theorem 4. Summing over all t we upper bound regret
Rn =
n
X
t=1
n
hx∗ − Xt , θ∗ i ≤
1 X ∗
2
hx − Xt , θ∗ i ,
∆ t=1
where the last inequality follows from the fact that either hx∗ − Xt , θ∗ i = 0 or hx∗ − Xt , θ∗ i > ∆. Then we take
Yasin Abbasi-Yadkori, Dávid Pál, Csaba Szepesvári
similar steps as in the proof of Theorem 3 to obtain
n
1 X ∗
2
hx − Xt , θ∗ i
∆ t=1
n
n
o
X
4
2
max βt (δ) max{1, G }
≤
min 1, kXt k2V −1
t−1
∆ 0≤t<n
t=1
8d
nX 2
≤
max βt (δ) max{1, G2 } log 1 +
,
∆ 0≤t<n
d
Rn ≤
finishing the proof of the problem dependent bound.