CONCENTRATION INEQUALITIES
MAXIM RAGINSKY
In the previous lecture, the following result was stated without proof. If X1 , . . . , Xn are independent Bernoulli(θ) random variables representing the outcomes of a sequence of n tosses of a coin
with bias (probability of heads) θ, then for any ε ∈ (0, 1)
2
(1)
P θbn − θ ≥ ε ≤ 2e−nε
where
n
1X
θbn =
Xi
n
i=1
Xn
is the fraction of heads in
= (X1 , . . . , Xn ). Since θ = Eθbn , (1) says that the sample (or
empirical) average of the Xi ’s concentrates sharply around the statistical average θ = EX1 . Bounds
like these are fundamental in statistical learning theory. In the next few lectures, we will learn the
techniques needed to derive such bounds for settings much more complicated than coin tossing.
This is not meant to be a complete picture; more details and additional results can be found in the
excellent survey by Boucheron et al. [BBL04].
1. The basic tools
We start with Markov’s inequality: Let X ∈ R be a nonnegative random variable. Then for any
t > 0 we have
EX
(2)
.
P(X ≥ t) ≤
t
The proof is simple:
(3)
P(X ≥ t) = E[1{X≥t} ]
(4)
≤
E[X1{X≥t} ]
t
EX
≤
,
t
(5)
where:
• (3) uses the fact that the probability of an event can be expressed as the expectation of its
indicator function:
Z
Z
P(X ∈ A) =
PX (dx) =
1{x∈A} PX (dx) = E[1{X∈A} ]
A
X
• (4) uses the fact that
X≥t>0
=⇒
X
≥1
t
• (5) uses the fact that
X≥0
X1{X≥t} ≤ X,
=⇒
so consequently E[X1{X≥t} ] ≤ EX.
Date: January 24, 2011.
1
Markov’s inequality leads to our first bound on the probability that a random variable deviates
from its expectation by more than a given amount: Chebyshev’s inequality. Let X be an arbitrary
real random variable. Then for any t > 0
P (|X − EX| ≥ t) ≤
(6)
Var X
,
t2
where Var X , E[|X −EX|2 ] = EX 2 −(EX)2 is the variance of X. To prove (6), we apply Markov’s
inequality (2) to the nonnegative random variable |X − EX|2 :
(7)
P (|X − EX| ≥ t) = |X − EX|2 ≥ t2
E|X − EX|2
,
t2
≤
(8)
where the first step uses the fact that the function φ(x) = x2 is monotonically increasing on [0, ∞),
so that a ≥ b ≥ 0 if and only if a2 ≥ b2 .
Now let’s apply these tools to the problem of bounding the probability that, for a coin with
bias θ, the fraction of heads in n trials differs from θ by more than some ε > 0. To that end,
let us represent the outcomes of the n tosses by n independent Bernoulli(θ) random variables
X1 , . . . , Xn ∈ {0, 1}, where P(Xi = 1) = θ for all i. Let
n
1X
Xi .
θbn =
n
i=1
Then
#
n
n
X
1
1X
Eθbn = E
Xi =
n
n
"
i=1
EXi
|{z}
=θ
i=1 =P(X =1)
i
and
n
n
1 X
θ(1 − θ)
,
Var Xi =
2
n
n
i=1
i=1
P
where we have used the fact that the Xi ’s are i.i.d., so Var(X1 +. . .+Xn ) = ni=1 Var Xi = n Var X1 .
Now we are in a position to apply Chebyshev’s inequality:
Var θb
θ(1 − θ)
n
P θbn − θ ≥ ε ≤
(9)
=
.
2
ε
nε2
Var θbn = Var
1X
Xi
n
!
=
At the very least, (9) shows that the probability of getting a bad sample decreases with sample
size. Unfortunately, it does not decrease fast enough. To see why, we can appeal to the Central
Limit Theorem, which (roughly) states that
r
2
n
1 e−t /2
n→∞
b
P
θn − θ ≥ t −−−→ 1 − Φ(t) ≤ √
,
θ(1 − θ)
2π t
√
Rt
2
where Φ(t) = (1/ 2π) −∞ e−x /2 dx is the standard Gaussian CDF. This would suggest something
like
nε2
b
P θn − θ ≥ ε ≈ exp −
,
2θ(1 − θ)
which decays with n much faster than the right-hand side of (9),
2
2. The Chernoff bounding trick and Hoeffding’s inequality
To fix (9), we will use a very powerful technique, known as the Chernoff bounding trick [Che52].
Let X be a nonnegative random variable. Suppose we are interested in bounding the probability
P(X ≥ t) for some particular t > 0. Observe that for any s > 0 we have
(10)
P(X ≥ t) = P esX ≥ est ≤ e−st E esX ,
where the first step is by monotonicity of the function φ(x) = esx and the second step is by Markov’s
inequality (2). The Chernoff trick is to choose an s > 0 that would make the right-hand side of
(10) suitably small. In fact, since (10) holds simultaneously for all s > 0, the optimal thing to do
is to take
P(X ≥ t) ≤ inf e−st E esX .
s>0
However, often a good upper bound on the moment-generating function E esX is enough. One
such bound was developed by Hoeffding [Hoe63] for the case when X is bounded with probability
one:
Lemma 1 (Hoeffding). Let X be a random variable with EX = 0 and P(a ≤ X ≤ b) = 1 for some
−∞ < a ≤ b < ∞. Then for all s > 0
2
2
(11)
E esX ≤ es (b−a) /8 .
Proof. The proof uses elementary calculus and convexity. First we note that the function φ(x) = esx
is convex on R. Any x ∈ [a, b] can be written as
b−x
x−a
b+
a.
x=
b−a
b−a
Hence
x − a sb b − x sa
esx ≤
e +
e .
b−a
b−a
Since EX = 0, we have
b sa
a sb
E esX ≤
e −
e
b−a
b−a
b
a s(b−a) sa
=
−
e
e .
b−a b−a
We have s(b − a) in the exponent in the parentheses. To get the same thing in the esa term
multiplying the parentheses, we (with a bit of foresight) seek λ such that sa = −λs(b − a), which
gives us λ = −a/(b − a). Then
a s(b−a) sa b
−
e
e = 1 − λ + λes(b−a) e−λs(b−a) .
b−a b−a
Now let u = s(b − a), so we can write
(12)
E esX ≤ (1 − λ + λeu )e−λu .
Again with a bit of foresight, let us express the right-hand side of (12) as an exponential of a
function of u:
(1 − λ + λeu )e−λu = eφ(u) ,
where φ(u) = log(1 − λ + λeu ) − λu.
Now the whole affair hinges on us being able to show that φ(u) ≤ u2 /8 for any u ≥ 0. To that end,
we first note that φ(0) = φ0 (0) = 0, and that φ00 (u) ≤ 1/4 for all u ≥ 0. Therefore, by Taylor’s
theorem we have
1
φ(u) = φ(0) + φ0 (0)u + φ00 (α)u2
2
3
for some α ∈ [0, u], and we can upper-bound the right-hand side of the above expression by u2 /8.
Thus,
2
2
2
E esX ≤ eφ(u) ≤ eu /8 = es (b−a) /8 ,
which gives us (11).
We will now use the Chernoff method and the above lemma to prove the following
Theorem 1 (Hoeffding’s inequality). Let X1 , . . . , Xn be independent random variables, such that
P
Xi ∈ [ai , bi ] with probability one. Let Sn , ni=1 Xi . Then for any t > 0
2t2
P (Sn − ESn ≥ t) ≤ exp − Pn
(13)
;
2
i=1 (bi − ai )
2t2
P (Sn − ESn ≤ −t) ≤ exp − Pn
(14)
.
2
i=1 (bi − ai )
Consequently,
2t2
2
i=1 (bi − ai )
P (|Sn − ESn | ≥ t) ≤ 2 exp − Pn
(15)
.
Proof.
By replacing each Xi with Xi − EXi , we may as well assume that EXi = 0. Then Sn =
Pn
i=1 Xi . Using Chernoff’s trick, we write
(16)
P (Sn ≥ t) = P esSn ≥ est ≤ e−st E esSn .
Since the Xi ’s are independent,
(17)
" n
#
n
h
i
Y
Y
sSn
s(X1 +...+Xn )
sXi
=
E e
=E e
=E
e
E esXi .
i=1
i=1
Since Xi ∈ [ai , bi ], we can apply Lemma 1 to write E esXi ≤ e
(17) and (16), we obtain
P (Sn ≥ t) ≤ e−st
n
Y
es
2 (b
s2 (b
2
i −ai ) /8
. Substituting this into
2
i −ai ) /8
i=1
n
s2 X
= exp −st +
(bi − ai )2
8
!
i=1
If we choose s =
Pn 4t
2,
i=1 (bi −ai )
then we obtain (13). The proof of (14) is similar.
Now we will apply Hoeffding’s inequality to improve our crude concentration bound (9) for the
sum of n independent Bernoulli(θ) random variables, X1 , . . . , Xn . Since each Xi ∈ {0, 1}, we can
apply Theorem 1 to get, for any t > 0,
!
n
X
2
Xi − nθ ≥ t ≤ 2e−2t /n .
P i=1
Therefore,
!
n
X
2
P θbn − θ ≥ ε = P Xi − nθ ≥ nε ≤ 2e−2nε ,
i=1
which gives us the claimed bound (1).
4
3. From bounded variables to bounded differences: McDiarmid’s inequality
Hoeffding’s inequality applies to sums of independent random variables. We will now develop
its generalization, due to McDiarmid [McD89], to arbitrary real-valued functions of independent
random variables that satisfy a certain condition.
Let X be some set, and consider a function g : Xn → R. We say that g has bounded differences
if there exist nonnegative numbers c1 , . . . , cn , such that
g(x1 , . . . , xn ) − g(x1 , . . . , xi−1 , x0i , xi+1 , . . . , xn ) ≤ ci
(18)
sup
x1 ,...,xn ,x0i ∈X
for all i = 1, . . . , n. In words, if we change the ith variable while keeping all the others fixed, the
value of g will not change by more than ci .
Theorem 2 (McDiarmid’s inequality [McD89]). Let X n = (X1 , . . . , Xn ) ∈ Xn be an n-tuple of
independent X-valued random variables. If a function g : Xn → R has bounded differences, as in
(18), then, for all t > 0,
2t2
n
n
(19)
P (g(X ) − Eg(X ) ≥ t) ≤ exp − Pn 2 ;
i=1 ci
2t2
n
n
(20)
P (Eg(X ) − g(X ) ≥ t) ≤ exp − Pn 2 .
i=1 ci
n
n
Proof. Let me first sketch the general
Pn idea behind the proof. Let V = g(X ) − Eg(X ). The first
step will be to write V as a sum i=1 Vi , where the terms Vi are constructed so that:
(1) Vi is a function only of X i = (X1 , . . . , Xi )
(2) There exists a function Ψi : Xi−1 → R such that, conditionally on X i−1 ,
Ψi (X i−1 ) ≤ Vi ≤ Ψi (X i−1 ) + ci .
Provided we can arrange things in this way, we can apply Lemma 1 to Vi conditionally on X i−1 :
2 2
(21)
E esVi X i−1 ≤ es ci /8 .
Then, using Chernoff’s method, we have
P (Z − EZ ≥ t) = P(V ≥ t)
≤ e−st E esV
h Pn
i
= e−st E es i=1 Vi
h Pn−1
i
= e−st E es i=1 Vi esVn
h Pn−1
h
ii
= e−st E es i=1 Vi E esVn X n−1
h Pn−1 i
2 2
≤ e−st es cn /8 E es i=1 Vi ,
where in the next-to-last step we used the fact that V1 , . . . , Vn−1 depend only on X n−1 , and in the
last step we used (21) with i = n. If we continue peeling off the terms involving Vn−1 , Vn−2 , . . . , V1 ,
we will get
!
n
s2 X 2
P (Z − EZ ≥ t) ≤ exp −st +
ci .
8
i=1
Pn 2
Taking s = 4t/ i=1 ci , we end up with (19).
5
It remains to construct the Vi ’s with the desired properties. To that end, let
Hi (X i ) = E[Z|X i ]
Vi = Hi (X i ) − Hi−1 (X i−1 ).
and
Then
n
X
n
X
Vi =
E[Z|X i ] − E[Z|X i−1 ] = E[Z|X n ] − EZ = Z − EZ = V.
i=1
i=1
Note that Vi depends only on X i by construction. Moreover, let
Ψi (X i−1 ) = inf Hi (X i−1 , x) − Hi−1 (X i−1 )
x∈X
Ψ0i (X i−1 )
= sup Hi (X i−1 , x0 ) − Hi−1 (X i−1 ) ,
x0 ∈X
where, owing to the fact that the Xi ’s are independent, we have
Z
i−1
i−1
n
n (dx
Hi (X , x) = E[Z|X , Xi = x] = g(X i−1 , x, xni+1 )PXi+1
i+1 )
xni+1 denoting the tuple (xi+1 , . . . , xn ). Then
Ψ0i (X i−1 ) − Ψi (X i−1 ) = sup Hi (X i−1 , x0 ) − Hi−1 (X i−1 ) − inf Hi (X i−1 , x) − Hi−1 (X i−1 )
x∈X
x0 ∈X
= sup sup Hi (X i−1 , x) − Hi (X i−1 , x0 )
x∈X x0 ∈X
= sup sup E[Z|X i−1 , Xi = x] − E[Z|X i−1 , Xi = x0 ]
x∈X x0 ∈X
Z
i−1
n
i−1 0 n
n
= sup sup
g(X , x, xi+1 ) − g(X , x , xi+1 ) P (dxi+1 )
x∈X x0 ∈X
Z
≤ sup sup g(X i−1 , x, xni+1 ) − g(X i−1 , x0 , xni+1 ) P (dxni+1 )
x∈X x0 ∈X0
≤ ci ,
where the last step follows from the bounded difference property. Thus, we can write Ψ0i (X i−1 ) ≤
Ψi (X i−1 ) + ci , which implies that, indeed,
Ψi (X i−1 ) ≤ Vi ≤ Ψi (X i−1 ) + ci
conditionally on X i−1 .
4. McDiarmid’s inequality in action
McDiarmid’s inequality is an extremely powerful and often used tool in statistical learning theory.
We will now discuss several examples of its use. To that end, we will first introduce some notation
and definitions.
Let X be some (measurable) space. If Q is a probability distribution of an X-valued random
variable X, then we can compute the expectation of any (measurable) function f : X → R w.r.t. Q.
So far, we have denoted this expectation by Ef (X) or by EQ f (X). We will often find it convenient
to use an alternative notation, Q(f ).
Let X n = (X1 , . . . , Xn ) be n independent identically distributed (i.i.d.) X-valued random variables with common distribution P . The main object of interest to us is the empirical distribution
induced by X n , which we will denote by PbX n . The empirical distribution assigns the probability
1/n to each Xi , i.e.,
n
1X
δXi .
PbX n =
n
i=1
6
Here, δx denotes a unit mass concentrated at a point x ∈ X, i.e., the probability distribution on X
defined by
∀ measurable A ⊆ X.
δx (A) = 1{x∈A} ,
We note the following important facts about PbX n :
(1) Being a function of the sample X n , PbX n is a random variable taking values in the space of
probability distributions over X.
(2) The probability of a set A ⊆ X under PbX n ,
n
1X
PbX n (A) =
1{Xi ∈A} ,
n
i=1
is the empirical frequency of the set A on the sample X n . The expectation of PbX n (A) is
equal to P (A), the P -probability of A. Indeed,
" n
#
n
n
X
1
1X
1X
1{Xi ∈A} =
E[1{Xi ∈A} ] =
P(Xi ∈ A) = P (A).
EPbX n (A) = E
n
n
n
i=1
i=1
i=1
(3) Given a function f : X → R, we can compute its expectation w.r.t. PbX n :
n
1X
f (Xi ),
PbX n (f ) =
n
i=1
X n.
which is just the sample mean of f on
It is also referred to as the empirical expectation
of f on X n . We have
#
" n
n
X
1X
1
f (Xi )
Ef (Xi ) = Ef (X) = P (f ).
EPbX n (f ) = E
n
n
i=1
i=1
We can now proceed to our examples.
4.1. Sums of bounded random variables. In the special case when X = R, P is a probability
distribution supported on a finite interval, and g(X n ) is the sum
g(X n ) =
n
X
Xi ,
i=1
McDiarmid’s inequality simply reduces to Hoeffding’s. Indeed, for any xn ∈ [a, b]n and x0i ∈ [a, b]
we have
g(xi−1 , xi , xni+1 ) − g(xi−1 , x0i , xni+1 ) = xi − x0i ≤ b − a.
Interchanging the roles of x0i and xi , we get
g(xi−1 , x0i , xni+1 ) − g(xi−1 , xi , xni+1 ) = x0i − xi ≤ b − a.
Hence, we may apply Theorem 2 with ci = b − a for all i to get
2t2
n
n
.
P (|g(X ) − Eg(X )| ≥ t) ≤ 2 exp −
n(b − a)2
7
4.2. Uniform deviations. Let X1 , . . . , Xn be n i.i.d. X-valued random variables with common
distribution P . By the Law of Large Numbers, for any A ⊆ X and any ε > 0
lim P PbX n (A) − P (A) ≥ ε = 0.
n→∞
In fact, we can use Hoeffding’s inequality to show that
2
b n
P PX (A) − P (A) ≥ ε ≤ 2e−2nε .
This probability bound holds for each A separately. However, in learning theory we are often
interested in the deviation of empirical frequencies from true probabilities simultaneously over
some collection of the subsets of X. To that end, let A be such a collection and consider the
function
(22)
g(X n ) , sup PbX n (A) − P (A) .
A∈A
√
Later in the course we will see that, for certain choices of A, Eg(X n ) = O(1/ n). However,
regardless of what A is, it is easy to see that, by changing only one Xi , the value of g(X n ) can
change at most by 1/n. Let xn = (x1 , . . . , xn ), choose some other x0i ∈ X, and let xn(i) denote xn
with xi replaced by x0i :
xn(i) = (xi−1 , x0i , xni+1 ).
xn = (xi−1 , xi , xni+1 ),
Then
g(xn ) − g(xn(i) ) = sup Pbxn (A) − P (A) − sup Pbxn(i) (A0 ) − P (A0 )
0
A∈A
A ∈A
o
n
b n
b n
0
0 = sup inf
P
(A)
−
P
(A)
−
P
(A
)
−
P
(A
)
x
x(i)
0
0
A∈A A ∈A
o
n
≤ sup Pbxn (A) − P (A) − Pbxn(i) (A) − P (A)
A∈A
≤ sup Pbxn (A) − Pbxn(i) (A)
A∈A
1
sup 1{xi ∈A} − 1{x0i ∈A} n A∈A
1
≤ .
n
n
Interchanging the roles of x and xn(i) , we obtain
=
g(xn(i) ) − g(xn ) ≤
1
.
n
Thus,
1
n
g(x ) − g(xn(i) ) ≤ .
n
Note that this bound holds for all i and all choices of xn and xn(i) . This means that the function
g defined in (22) has bounded differences with c1 = . . . = cn = 1/n. Consequently, we can use
Theorem 2 to get
2
P (|g(X n ) − Eg(X n )| ≥ ε) ≤ 2e−2nε .
This shows that the uniform deviation g(X n ) concentrates sharply around its mean Eg(X n ).
8
4.3. Uniform deviations continued. The same idea applies to arbitrary real-valued functions
over X. Let X n = (X1 , . . . , Xn ) be as in the previous example. Given any function f : X → [0, 1],
Hoeffding’s inequality tells us that
2
P PbX n (f ) − Ef (X) ≥ ε ≤ 2e−2nε .
However, just as in the previous example, in learning theory we are primarily interested in controlling the deviations of empirical means from true means simultaneously over whole classes of
functions. To that end, let F be such a class consisting of functions f : X → [0, 1] and consider the
uniform deviation
g(X n ) , sup PbX n (f ) − P (f ) .
f ∈F
An argument entirely similar to the one in the previous example1 shows that this g has bounded
differences with c1 = . . . = cn = 1/n. Therefore, applying McDiarmid’s inequality, we obtain
2
P (|g(X n ) − Eg(X n )| ≥ ε) ≤ 2e−2nε .
√
We will see later that, for certain function classes F, we will have Eg(X n ) = O(1/ n).
4.4. Kernel density estimation. For our final example, let X n = (X1 , . . . , Xn ) be an n-tuple of
i.i.d. real-valued random variables whose common distribution P has a probability density function
(pdf) f , i.e.,
Z
P (A) =
f (x)dx
A
for any measurable set A ⊆ R. We wish to estimate f from the sample X n . A popular method is
to use a kernel estimate (the book by Devroye and Lugosi [DL01] has plenty of material on density
estimation, including kernel methods, from the viewpoint of statistical learning
theory). To that
R
end, we pick a nonnegative function K : R → R that integrates to one, K(x)dx = 1 (such a
function is called a kernel), as well as a positive bandwidth (or smoothing constant) h > 0 and form
the estimate
n
1 X
x − Xi
.
fbn (x) =
K
nh
h
i=1
It is not hard to verify2 that fbn is a valid pdf, i.e., that it is nonnegative and integrates to one. A
common way of quantifying the performance of a density estimator is to use the L1 distance to the
true density f :
Z b
kfbn − f kL1 =
f
(x)
−
f
(x)
n
dx.
R
Note that kfbn − f kL1 is a random variable since it depends on the random sample X n . Thus, we
can write it as a function g(X n ) of the sample X n . Leaving aside the problem of actually bounding
Eg(X n ), we can easily establish a concentration bound for it using McDiarmid’s inequality. To do
1Exercise: verify this!
2Another exercise!
9
that, we need to check that g has bounded differences. Choosing xn and xn(i) as before, we have
g(xn ) − g(xn(i) )
Z i−1
n
X
1 X
x − xj
x − xj
1
1
x − xi
dx
+
+
−
f
(x)
=
K
K
K
h
nh
h
nh
h
R nh j=1
j=i+1
Z i−1
n
0
X
1 X
x − xj
x − xj
1
1
x − xi
dx
−
+
+
−
f
(x)
K
K
K
h
nh
h
nh
h
R nh j=1
j=i+1
Z 0 1
K x − xi − K x − xi dx
≤
nh R h
h
Z
2
x
≤
K
dx
nh R
h
2
= .
n
Thus, we see that g(X n ) has the bounded differences property with c1 = . . . = cn = 2/n, so that
P (|g(X n ) − Eg(X n )| ≥ ε) ≤ 2e−nε
2 /2
.
References
[BBL04] S. Boucheron, O. Bousquet, and G. Lugosi. Concentration inequalities. In O. Bousquet, U. von Luxburg,
and G. Rätsch, editors, Advanced Lectures in Machine Learning, pages 208–240. Springer, 2004.
[Che52] H. Chernoff. A meausre of asymptotic efficiency of tests of a hypothesis based on the sum of observations.
Annals of Mathematical Statistics, 23:493–507, 1952.
[DL01] L. Devroye and G. Lugosi. Combinatorial Methods in Density Estimation. Springer, 2001.
[Hoe63] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American
Statistical Association, 58:13–30, 1963.
[McD89] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148–188. Cambridge University Press, 1989.
10
© Copyright 2026 Paperzz