Maximum Likelihood Estimation and Complexity
Regularization
R. M. Castro
May 18, 2011
1
Introduction
Recall the most basic setting of the likelihood framework. Suppose we have i.i.d.
observations/data Y1 , . . . , Yn from a distribution with density p (with respect to
a suitable dominating measure, for example, Lebesgue measure). Let Θ denote
a “parameter” space (e.g., Θ ∈ R or Θ = {smooth functions}) and assume you
have a class of densities parameterized by θ ∈ Θ, namely
PΘ = {pθ : θ ∈ Θ} .
The maximum likelihood estimator seeks the model in that class that maximizes
the likelihood of the data, or equivalently minimizes the negative log-likelihood.
The maximum likelihood estimator θ̂n is therefore given by
θ̂n
n
Y
=
arg max
=
arg min −
θ∈Θ
pθ (Yi )
i=1
n
X
θ∈Θ
log(pθ (Yi )) .
(1)
i=1
Why is maximum likelihood a good idea? To better understand this we need
to introduce the Kullback-Leibler divergence. First convention that 0 · log 0 =
limx→0+ x log x = 0. Let p and q be two densities (for simplicity say these are
with respect to the Lebesgue measure). Then the Kullback-Leibler divergence
is defined as
( R
∞
if p q
p(x) log p(x)
q(x) dx
−∞
,
KL(p, q) = dKL (pkq) =
∞
otherwise
where p q means q dominates p, that is, for almost all x if q(x) = 0 then
p(x) = 0.
Important Facts:
1
• KL(p, q) ≥ 0, with equality if and only if p(x) = q(x) almost everywhere,
which means p and q are identical densities.
The proof of this follows from Jensen’s inequality.
Z
p(x)
KL(p, q) =
p(x) log
dx
q(x)
p(X)
= Ep log
,
where X is a r.v. with density p
q(X)
q(X)
= Ep − log
p(X)
q(X)
≥ − log Ep
p(X)
Z
q(x)
dx
= − log p(x)
p(x)
Z
= − log q(x)dx = − log 1 = 0 .
where the inequality follows from Jensen’s inequality and it is strict unless p(x) = q(x) almost everywhere (note: there are minute technicalities
omitted in the above proof).
• KL(p, q) 6= KL(q, p). In other words, the KL divergence is not symmetric.
It is therefore not a distance (in addition, it doesn’t satisfy the triangle
inequality). Let’s see an example. Let p(x) = 1{x ∈ [0, 1]} and q(x) =
1{x ∈ [0, 2]}/2. Then KL(p, q) = log 2 and KL(q, p) = ∞.
• The KL divergence is related to testing. Namely if null hypothesis corresponds to density p and the alternative corresponds to density q the
probabilities of type I and II error are related to KL(p, q) and KL(q, p)
respectively. In particular larger KL divergences correspond to smaller
the error probability.
• The product-density property: Let pi and qi , i ∈ 1, . . . , n be univariate densities and define the following multivariate densities
p(x1 , . . . , xn ) = p1 (x1 ) · p2 (x2 ) · · · pn (xn ) ,
q(x1 , . . . , xn ) = q1 (x1 ) · q2 (x2 ) · · · qn (xn ) .
Then
KL(p, q) =
n
X
KL(pi , qi ) .
i=1
The proof is quite simple and left as an exercise to the reader.
2
Now let’s look again at the MLE formulation in (1). Suppose Yi are i.i.d.
with density q. By the Strong Law of Large Numbers (SLLN), for a fixed value
θ
n
1X
a.s.
−
log pθ (Yi ) → E[− log pθ (Y )] ,
n i=1
where Y has density q(·). Therefore, the negative log-likelihood can be viewed
as a surrogate for E[− log pθ (Y )]. What happens if we try to minimize this
quantity with respect to θ?
= E[− log pθ (Y )] + E[log q(Y )] − E[log q(Y )]
q(Y )
= E log
− E[log q(Y )]
pθ (Y )
= KL(q, pθ ) + E[log q(Y )] ,
E[− log pθ (Y )]
where the second term is not a function of θ. Therefore minimizing the negative
log-likelihood is finding approximately the density pθ that is closest to the true
data density q. In particular if q = pθ∗ for some θ∗ ∈ Θ then we hope to find
this model when n is large enough.
1.1
A more general setting
One can consider also a more general setting. Suppose we have independent, but
not identically distributed data. Let Yi , i = 1, . . . , n, be independent samples
of random variables with densities respectively qi . Suppose we have a class of
models parameterized by Θ, where we model the data as having joint density
n
Y
pθ,i (yi ) .
i=1
The MLE is, in this case
n
θ̂n = arg min −
θ∈Θ
1X
log pθ,i (Yi ) .
n i=1
Again, by the SLLN, for a fixed θ we have
n
n
1X
1X
a.s.
log pθ,i (Yi ) −
E[− log pθ,i (Yi )] → 0 .
−
n
n
i=1
i=1
So,Psimilarly to what we seen before, the MLE is attempting to minimize
n
1
i=1 E[− log pθ,i (Yi )]. As before, note that E[− log pθ,i (Yi )] = KL(q, pθ,i ) +
n
E[log qi (Y )] and so we conclude that the MLE is, in this case, trying to minimize
n
X
KL(qi , pθ,i ) = KL (q, pθ ) ,
i=1
where q is the true joint density of the data, and pθ is a model for the data
from the class in consideration.
3
Example 1. The regression setting with deterministic design: Let
Yi = r(xi ) + i ,
i = 1, . . . , n
where i are i.i.d. normal random variables with variance σ 2 . Then each Yi has
density
(Yi − r(xi )2
1
√
exp −
.
2σ 2
2πσ 2
The joint likelihood of Y1 , . . . , Yn is
n
Y
(Yi − r(xi )2
√
,
exp −
2σ 2
2πσ 2
i=1
1
and the MLE is simple given by
r̂n = arg min
r∈R
n
X
(Yi − r(xi ))2 .
i=1
So we see that we recover our familiar sum of squared residuals in this case.
1.2
Idea behind the analysis of the MLE
Here we try to see what ingredients we need for the analysis of the MLE. Recall
that
n
1X
θ̂n = arg min −
log pθ,i (Yi ) .
θ∈Θ
n i=1
Let q denote the joint density of the data. Let θ̃n denote the theoretical analogue
of the MLE
θ̃n = arg min KL(q, pθ ) .
θ∈Θ
Note that, from the definition of the MLE we have
n
−
n
1X
1X
log pθ̂n ,i (Yi ) ≤ −
log pθ,i (Yi ) ,
n i=1
n i=1
for any θ ∈ Θ. In particular this inequality must hold true for θ̃n , and so
n
−
n
1X
1X
log pθ̂n ,i (Yi ) +
log pθ̃n ,i (Yi ) ≤ 0
n i=1
n i=1
or equivalently
n
n
1X
1X
qi (Yi
qi (Yi )
−
≤0.
log
log
n i=1
pθ̂n ,i (Yi ) n i=1
pθ̃n ,i (Yi )
4
For each fixed θ ∈ Θ the SLLN tells us that
n
1 X
qi (Yi
1
a.s.
log
− KL(q, pθ ) → 0 ,
n
p
(Y
)
n
θ,i i
i=1
(2)
as n → ∞. If (2) also holds for the sequence θ̂n then in the limit we must have
KL(q, pθ̂n ) − KL(q, pθ̃n ) → 0 .
In many cases this also implies that |θ̂n − θ̃n | → 0 and so the MLE is converging
to the best possible model in the class Θ.
It’s important to note that (2) is the most difficult aspect of the analysis,
and it requires a uniform convergence result. This is going to involve not only
the structure of the likelihood functions, but also the complexity of the class Θ.
The “larger” this class the harder is to guarantee (2). In what follows we are
going to consider a possible approach to deal with this problem, but first we
need some preliminary results.
2
The Hellinger Distance
As we argued the Kullback divergence is not a proper distance (in particular it
is not symmetric and does not satisfy the triangle inequality). However, it is
not hard to define distances between densities. Notably useful is the Hellinger
distance
Z 2 1/2
p
p
H(p, q) =
.
p(x) − q(x) dx
The Hellinger metric is symmetric, since H(p, q) = H(q, p), non-negative and it
satisfies the triangle inequality (although one must check this).
Remarkably the squared Hellinger distance provides a lower bound for the
KL divergence, so convergence in KL divergence implies convergence of the
Hellinger distance.
Proposition 1.
H 2 (p, q) ≤ KL(p, q) .
5
Proof.
2
H (p, q)
=
=
=
≤
=
=
≤
Z p
2
p
p(x) − q(x) dx
Z
Z
Z p
p(x)dx + q(x)dx − 2
p(x)q(x)dx
Z p
2 1−
p(x)q(x)dx
Z p
−2 log
p(x)q(x)dx , since 1 − x ≤ − log x
s
Z
q(x)
−2 log
p(x)dx
p(x)
#
"s
q(X)
, where X has density p
−2 log E
p(X)
h p
i
−2E log q(X)/p(X) , by Jensen’s inequality
= E [log(p(X)/q(X))] ≡ KL(p, q) .
Note that we have also showed that
Z p
Z p
H 2 (p, q) = 2 1 −
p(x)q(x)dx ≤ −2 log
p(x)q(x)dx .
The quantity inside the log is called the Affinity, and is a measure of the similarity between two densities (valued 1 is these are identical, and zero if they
have disjoint support). Namely
Z p
A(p, q) =
p(x)q(x)dx .
In summary, we have shown that
H 2 (p, q) ≤ −2 log A(p, q) ≤ KL(p, q) .
2.1
The important case of Gaussian distributions
An important case to consider is when p and q are Gaussian distributions with
the same variance but different means. Let
pθ (x) = √
1
2πσ 2
6
e−
(x−θ)2
2σ 2
.
Then
Z
pθ1 (x)
pθ (x)dx
pθ2 (x) 1
Z
(x − θ2 )2 − (x − θ1 )2
pθ1 (x)dx
=
2σ 2
(X − θ2 )2 − (X − θ1 )2
= E
where X has density pθ1
2σ 2
1
1
=
E[(X − θ2 )2 ] − 2 E[(X − θ1 )2 ]
2
2σ
2σ
1
1
=
E[(X − θ1 + θ1 − θ2 )2 ] − 2 E[(X − θ1 )2 ]
2
2σ
2σ
1
1 2
2
2
=
(σ + (θ1 − θ2 ) ) − 2 σ
2σ 2
2σ
(θ1 − θ2 )2
=
.
2σ 2
So, in this case, the KL divergence is symmetric.
Now, for the affinity and Hellinger distance we proceed in a similar fashion.
Z p
pθ1 (x)pθ2 (x)dx
A(pθ1 , pθ2 ) =
Z
(x−θ1 )2
(x−θ2 )2
1
√
=
e− 4σ2 e− 4σ2 dx
2πσ 2
Z
2 +x2 −2θ x+θ 2
x2 −2θ1 x+θ1
2
2
1
4σ 2
√
=
e−
dx
2πσ 2
Z
2 +x2 −2θ x+θ 2
x2 −2θ1 x+θ1
2
2
1
4σ 2
√
e−
dx
=
2
2πσ
2
Z
θ 2 +θ2
θ +θ2
x2 −2 1
x+ 1
2
2
1
−
2
2σ
√
=
e
dx
2πσ 2
2
2 θ 2 +θ 2
Z
2
2
(x− θ1 +θ
) −( θ1 +θ
) + 12 2
2
2
1
−
2
2σ
√
=
e
dx
2πσ 2
2 θ 2 +θ 2
2
2
2
( θ1 +θ
) − 12 2 Z
(x− θ1 +θ
)
2
2
1
2σ 2
2σ 2
√
dx
= e
e−
2πσ 2
KL(pθ1 , pθ2 )
=
log
2
θ 2 +θ2
− 1
2
2σ 2
2
= e
2
( θ1 +θ
)
2
= e−
Therefore
(θ1 −θ2 )2
8σ 2
.
(θ1 −θ2 )2
,
H 2 (pθ1 , pθ2 ) = 2 1 − e− 8σ2
and
−2 log A(pθ1 , pθ2 ) =
7
(θ1 − θ2 )2
.
4σ 2
© Copyright 2026 Paperzz