Introduction to Estimation
Theory: A Tutorial
Volkan Cevher
2
Outline
Introduction
Terminology and Preliminaries
Bayesian (Random) Parameter Estimation
Nonrandom Parameter Estimation
Questions
Georgia Institute of Technology
Center for Signal and Image Processing
3
Introduction
Classical detection problem:
–
Design of optimum procedures for deciding
between possible statistical situations given a
random observation:
H 0 : Yk ~ P P0 , k 1,, n
H1 : Yk ~ P P1 , k 1,, n
–
The model has the following components:
Parameter Space (for parametric detection
problems)
Probabilistic Mapping from Parameter Space to
Observation Space
Observation Space
Detection Rule
Georgia Institute of Technology
Center for Signal and Image Processing
4
Introduction
Parameter Space:
–
–
Completely characterizes the output given the mapping.
Each hypothesis corresponds to a point in the parameter
space. This mapping is one-to-one.
Probabilistic Mapping from Parameter Space to
Observation Space:
–
The probability law that governs the effect of a
parameter on the observation.
Example 1:
Probabilistic
mapping
Parameter Space
Georgia Institute of Technology
2
N k , p 1 / 2, N k ~ N (0, )
Yk N k , p 1 / 4, N k ~ N (1, 2 )
N , p 1 / 4, N ~ N (1, 2 )
k
k
1 0 1T
Center for Signal and Image Processing
5
Introduction
Observation Space:
–
Finite dimensional, i.e. Y n, where n is finite.
Detection Rule
–
Mapping of the observation space into its
parameters in the parameter space is called a
detection rule.
Georgia Institute of Technology
Center for Signal and Image Processing
6
Introduction
Classical estimation problem:
–
–
–
Interested in not making a choice among several discrete
situations, but rather making a choice among a continuum of
possible states.
Think of a family of distributions on the observation space,
indexed by a set of parameters.
Given the observation, determine as accurately as possible the
actual value of the parameter.
Example 2:
–
Yk N k , N k ~ N ( , 2 )
In this example, given the observations, parameter is being
estimated. Its value is not chosen among a set of discrete
values, but rather is estimated as accurately as possible.
Georgia Institute of Technology
Center for Signal and Image Processing
7
Introduction
Estimation problem also has the same components as the
detection problem.
–
–
–
–
Parameter Space
Probabilistic Mapping from Parameter Space to Observation
Space
Observation Space
Estimation Rule
Detection problem can be thought of as a special case of
the estimation problem.
There are a variety of estimation procedures differing
basically in the amount of prior information about the
parameter and in the performance criteria applied.
Estimation theory is less structured than detection theory.
“Detection is science, estimation is art.” Array Signal
Processing by Johnson, Dudgeon.
Georgia Institute of Technology
Center for Signal and Image Processing
8
Introduction
Based on the a priori information about the
parameter, there are two basic approaches
to parameter estimation:
–
–
Bayesian Parameter Estimation:
–
Bayesian Parameter Estimation
Nonrandom Parameter Estimation
Parameter is assumed to be a random quantity
related statistically to the observation.
Nonrandom Parameter Estimation:
–
Parameter is a constant without any
probabilistic structure.
Georgia Institute of Technology
Center for Signal and Image Processing
9
Terminology and Preliminaries
Estimation theory relies on jargon to
characterize the properties of estimators.
In this presentation, the following
definitions are used:
–
The set of n observations are represented by
the n-dimensional vector y (observation
space).
T
y Y1 Yk Yn
–
The values of the parameters are denoted by
the vector (parameter space).
The estimate of this parameter vector is
denoted by θˆ (y ) : .
–
Georgia Institute of Technology
Center for Signal and Image Processing
10
Terminology and Preliminaries
Definitions (continued):
–
The estimation error (y) ( in short) is defined
by the difference between the estimate and the
actual parameter:
ε(y ) θˆ (y ) θ
–
–
The function C[a,]: + is the cost of
estimating a true value of as a.
Given such a cost function C, the Bayes risk
(average risk) of the estimator is defined by the
following:
r (θˆ ) E{E{C[θˆ (Y), Θ] | y}}
Georgia Institute of Technology
Center for Signal and Image Processing
11
Terminology and Preliminaries
Example 3:
Suppose we would like to
minimize the Bayes risk defined by
r (θˆ ) E{E{C[θˆ (Y), Θ] | y}}
for a given cost function C.
By inspection, one can see that the Bayes
estimate of can be found (if it exists) by
minimizing, for each y, the posterior
cost given Y=y: E{C[θˆ (Y), Θ] | y}
Georgia Institute of Technology
Center for Signal and Image Processing
12
Terminology and Preliminaries
Definitions (continued):
–
An estimate is said to be unbiased if the expected value
of the estimate equals the true value of the parameter
E{θˆ | θ} θ . Otherwise the estimate is said to be biased.
The bias b() is usually considered to be additive, so
that:
b(θ) E{θˆ | θ} θ
–
–
An estimate is said to be asymptotically unbiased if the
bias tends to zero as the number of observations tend to
infinity.
An estimate is said to be consistent if the mean-squared
estimation error tends to zero as the number of
observations becomes large.
lim E{ε Tε} 0
n
Georgia Institute of Technology
Center for Signal and Image Processing
13
Terminology and Preliminaries
Definitions (continued):
–
An efficient estimate has a mean-squared error
that equals a particular lower bound: the
Cramer-Rao bound. If an efficient estimate
exists, it is optimum in the mean-squared
sense: No other estimate has a smaller meansquared error.
Following shorthand notations will also be
used for brevity:
pθ (y ) py|θ (y | θ) Probabilit y density( y given θ)
Eθ {y} E{y | θ}
Georgia Institute of Technology
Center for Signal and Image Processing
14
Terminology and Preliminaries
Following definitions and theorems will be useful
later in the presentation:
Definition: Sufficiency
Suppose that is an arbitrary set. A function T: is
said to be a sufficient statistic for the parameter set
if the distribution of y conditioned on T(y) does not
depend on for .
If knowing T(y) removes any further dependence on of
the distribution of y, one can conclude that T(y) contains
all the information in y that is useful for estimating .
Hence, it is sufficient.
Georgia Institute of Technology
Center for Signal and Image Processing
15
Terminology and Preliminaries
Definition: Minimal Sufficiency
A function T on is said to be minimal
sufficient for the parameter set if it is a
function of every other sufficient statistic for .
A minimal sufficient statistic represents the
furthest reduction in the observation without
destroying information about .
Minimal sufficient statistic does not necessarily
exist for every problem. Even if it exists, it is
usually very difficult to identify it.
Georgia Institute of Technology
Center for Signal and Image Processing
16
Terminology and Preliminaries
The Factorization Theorem:
Suppose that the parameter set has a
corresponding families of densities p. A
statistic T is sufficient for iff there are
functions g and h such that
pθ g θ [T (y )] h( y )
for all y and .
Refer to the supplement for a proof.
Georgia Institute of Technology
Center for Signal and Image Processing
17
Terminology and Preliminaries
Example 4: (Poor) Consider the hypothesis-testing
problem ={0,1} with densities p0 and p1. Noting that
if θ 0
p0 ( y)
p ( y)
p ( y) 1
p0 ( y) if 1,
p0 ( y)
the factorization pθ g θ [T (y )] h( y ) is possible with
h( y ) p0 ( y )
T ( y ) p1 ( y ) / p0 ( y ) L( y )
1 if 0
g ( y )
t if 1.
Thus the likelihood ratio L is a sufficient statistic for the
binary hypothesis-testing problem.
Georgia Institute of Technology
Center for Signal and Image Processing
18
Terminology and Preliminaries
The Rao-Blackwell Theorem:
Suppose that ĝ(y) is an unbiased estimate of
g() and that T is sufficient for . Define
~
g[T (y )] Eθ{ĝ(Y) | T (Y) T (y)}
~
Then g[T (y )] is also an unbiased estimate of
g(). Furthermore,
Varθ (~
g[T (Y)]) Varθ (ĝ(Y)),
with equality iff
Pθ (ĝ(Y) ~
g[T (Y)]) 1.
Refer to the supplement for a proof.
Georgia Institute of Technology
Center for Signal and Image Processing
19
Terminology and Preliminaries
Definition: Completeness
The parameter family is said to be complete if the
condition E{f(Y)}=0 for all implies that P(f(Y)=0)=1 for all
.
Example 5: (Poor) Suppose that ={0,1,…,n}, ={0,1}, and
n!
p ( y )
y (1 ) n y , y 0,, n, 0 1
y!(n y )!
For any function f onn, we have
n!
E { f (Y )}
f ( y ) y (1 ) n y
y 0
y!(n y )!
(1 )
n
n
a x
y 0
y
y
The condition E{f(Y)}=0 for all implies that
n
a x
y 0
y
y
0, for all x 0.
nth
However, an
order polynomial has at most n zeros unless
all of its coefficients are zero. Hence, is complete.
Georgia Institute of Technology
Center for Signal and Image Processing
20
Terminology and Preliminaries
Definition: Exponential Families
–
A class of distributions with parameter set
is said to be an exponential family if there are
real-valued functions C,Q1,…,Qm,T1,…,Tm, and
h such that
m
pθ (y ) C(θ) exp Ql (θ) Tl (y ) h( y )
l 1
–
T(y)=[T1(y),…,Tm(y)]T is a complete sufficient
statistic.
Georgia Institute of Technology
Center for Signal and Image Processing
21
Bayesian Parameter Estimation
For the random observation Y , indexed
by a parameter m, our goal is to
find a function θ̂ : such that θˆ (y ) is the
best guess of the true value of given
Y=y.
Bayesian estimators are the estimators
that minimize the Bayesian risk function.
The following estimators are commonly
used in practice and can be distinguished
by their cost functions.
Georgia Institute of Technology
Center for Signal and Image Processing
22
Bayesian Parameter Estimation
Minimum-Mean-Squared-Error (MMSE):
–
Euclidian Cost function:
m
m
i 1
i 1
C[a, θ] a θ Ci [ai , θi ] (ai i ) 2
2
–
The posterior cost given Y=y is given by
2
E{C[θˆ (Y), Θ] | Y y} E{ θˆ (Y) Θ | Y y}
2
–
θˆ (y ) 2Re{[ θˆ (y )]H E{Θ | Y y}}
E{Θ 2 | Y y}
Minimizing this cost function also minimizes the Bayes
risk r(θˆ ). Hence, on differentiating with respect to θˆ (y ),
one can obtain the Bayes estimate
θ̂ MMSE (y) E{Θ | Y y}
Georgia Institute of Technology
Center for Signal and Image Processing
23
Bayesian Parameter Estimation
Minimum-Mean-Absolute-Error (MMAE):
–
Absolute Error Cost
function:
m
C[a, θ] a θ Ci [ai , θi ] | ai i |
i 1
–
m
i 1
The posterior cost given Y=y is given by
E{C[θˆ (Y), Θ] | Y y} E{ θˆ (Y) Θ | Y y}
P( θˆ (Y)
i
i
–
i
x | Y y )dx
0
Here we used the fact that with P(X0)=1, then
E{X } P( X x)dx
0
MMAE 1of3
Georgia Institute of Technology
Center for Signal and Image Processing
24
Bayesian Parameter Estimation
–
Further simplification is also possible:
E{C[θˆ (Y ), Θ] | Y y} P ( i x θˆi (Y ) | Y y )dx
i 0
ˆ
P(i x θi (Y ) | Y y )dx
0
P ( i t | Y y )dt
i θˆ ( Y )
i
P (i t | Y y )dt
θˆi ( Y )
MMAE 2of3
Georgia Institute of Technology
Center for Signal and Image Processing
25
Bayesian Parameter Estimation
–
Taking the derivative with respect to each θˆi (Y),
one can see that
θˆi (Y)
E{C[θˆ (Y), Θ] | Y y}
P(i θˆi (Y) | Y y )
P(i θˆi (Y) | Y y )
This derivative is a nondecreasing function of
θˆi (Y) that approaches –1 as θˆi (Y) and +1
as θˆi (Y) . Thus E{C[θˆ (Y), Θ] | Y y} achieves its
minimum where its derivative changes sign:
P(i t | Y y ) P(i t | Y y ), t θˆi ,MMAE (Y)
P( t | Y y ) P( t | Y y ), t θˆ
(Y)
i
i
i , MMAE
MMAE 3of3
Georgia Institute of Technology
Center for Signal and Image Processing
26
Bayesian Parameter Estimation
Maximum A Posteriori Probability (MAP):
–
–
Uniform Error Cost function:
1 if max 1i m | ai i |
C[a, θ]
0 if max 1im | ai i |
The posterior cost given Y=y is given by
E{C[θˆ (Y), Θ] | Y y} 1 P( θˆ1 (Y) 1 ,, θˆm (Y) m ).
–
Within some smoothness conditions, the
estimator that maximizes this cost function is
given by
θ̂ MAP (y ) arg max pθ|Y y (θ | Y y )
θ̂
Georgia Institute of Technology
Center for Signal and Image Processing
27
Bayesian Parameter Estimation
Observations:
–
MMSE Estimator:
θ̂ MMSE (y) E{Θ | Y y}
–
The MMSE estimate of given Y=y is the conditional
mean of given Y=y .
MMAE Estimator:
P(i t | Y y ) P(i t | Y y ), t θˆi ,MMAE (Y)
P( t | Y y ) P( t | Y y ), t θˆ
(Y)
i
–
i
i , MMAE
The MMAE estimate of given Y=y is the conditional
median of given Y=y .
MAP Estimator:
θ̂ MAP (y ) arg max pθ|Y y (θ | Y y )
θ̂
The MMAE estimate of given Y=y is the conditional
mode of given Y=y .
Georgia Institute of Technology
Center for Signal and Image Processing
28
Bayesian Parameter Estimation
Example 6: (Poor) Given the following conditional
probability density function
e y
p ( y)
0
if y 0
e
w( )
0
if 0
if y 0
hence y has an exponential density with parameter . Suppose
is also exponential random variable with density
if 0.
Then, the posterior distribution of given Y=y is given by
e ( y )
w( | y )
( y )
e
d
0
( y ) 2e ( y )
for 0 and y0, and w(|y)=0 otherwise.
Georgia Institute of Technology
Center for Signal and Image Processing
29
Bayesian Parameter Estimation
Example 7: (Continued.)
–
The MMSE is the mean of this distribution:
ˆMMSE ( y )
–
2
y
The MMAE is the median of this distribution:
ˆMMAE ( y )
2
3 2
–
The MAP estimate is the mode of this distribution (where it is
maximum):
1
ˆMAP ( y )
y
–
To decide which one to use, one must decide which three of
the cost functions best suits the problem at hand.
Georgia Institute of Technology
Center for Signal and Image Processing
30
Nonrandom Parameter Estimation
Our goal is the same in Bayesian
parameter estimation problem. Find .
Assume that the parameter set is real
valued. In the nonrandom parameter
estimation problem, we do not know
anything about the true value of other
than the fact that it lies in . Hence, given
the observation Y=y, what is the best
estimate of is the question we would like
to answer.
Georgia Institute of Technology
Center for Signal and Image Processing
31
Nonrandom Parameter Estimation
The only average performance cost that
can be done is with respect to the
distribution of Y given , given a cost
function C.
A reasonable restriction to place on an
estimate of is that its expected value is
equal to the true parameter value:
E {θˆ (Y)} θ, θ
θ
For its tractability, the Euclidian norm
squared cost function will be used.
Georgia Institute of Technology
Center for Signal and Image Processing
32
Nonrandom Parameter Estimation
When the squared-error cost is used, the
risk function is the following:
2
–
–
Rθ (θˆ ) Eθ { θˆ (Y) θ }, θ
One can not generally expect to minimize this
risk function uniformly for all . This is easily
seen for the squared error cost since for any
particular value of , say 0 the conditional
mean-squared error can be made zero by
choosing the estimate to be identically 0 for all
observations y.
However, if is not close to 0, such an
estimate would perform poorly.
Georgia Institute of Technology
Center for Signal and Image Processing
33
Nonrandom Parameter Estimation
With the unbiased-ness restriction, the
conditional mean-squared error becomes
the variance of the estimate. Hence, these
estimators are termed minimum-variance
unbiased estimators (MVUEs).
The procedure for seeking MVUEs:
–
–
–
Find a complete sufficient statistics T for .
Find any unbiased estimator ĝ(y) of g().
Then, ~g[T( y)] Eθ{ĝ(Y) | T( Y) T( y)} is an MVUE of
g().
Georgia Institute of Technology
Center for Signal and Image Processing
34
Nonrandom Parameter Estimation
Example 8: (Poor) Consider the model
Yk N k sk ,
k 1,, n
where N1,…,Nn are i.i.d. N(0,2) noise samples, and sk is a
known signal for k=1,…,n. Our objective is to estimate and
2.
1. The density of Y is given by
1 n
2
p(y )
exp
(
y
s
)
k
k
2
(2 2 ) n / 2
2
k
1
C(θ) exp 1 T1 (y ) 2 T2 (y )h( y ),
1
where =[ 1 2 ]T and
1 / 2 , 2
1
2
2
,
n
n
k 1
k 1
T1 (y ) sk yk , T2 (y ) yk2 ,
C(θ) 2
Georgia Institute of Technology
n/2
12
exp
4 2
2
s
k , h( y ) 1.
k 1
n
Center for Signal and Image Processing
35
Nonrandom Parameter Estimation
Example 9:
(Continued.) Note that T= [ T1 T2 ]T is a complete
sufficient statistic for .
2
2. We wish to estimate g1 (θ) 1 / 2 2 and g 2 (θ) 1 / 2 2 .
Assuming that s10, the estimate ĝ1(y)=y1/s1 is an unbiased estimator
of g1().
Moreover, note that
Eθ {T12 (Y)} Varθ {T1 (Y)} Eθ {T1 (Y)}
2
n 2 s 2 n 2 2 ( s 2 ) 2 , with s 2 (1 / n)k 1 sk2
n
and that
n
n
Eθ {T2 (Y)} Eθ {Y } ( 2 2 sk2 )
2
k
k 1
k 1
n 2 n 2 s 2 .
Hence, ĝ 2 (y) [T2 (y) T1 (y) / ns ] /( n 1) is an unbiased estimate
of g2().
2
Georgia Institute of Technology
2
Center for Signal and Image Processing
36
Nonrandom Parameter Estimation
Example 10: (Continued.)
3. Since T1 and T2 are complete, the estimates
~
g1[T(y )] Eθ {ĝ1 (Y) | T(Y) T(y )}
~
g [T(y )] E {ĝ (Y) | T(Y) T(y )}
θ
2
2
are MVUEs of . Note that ĝ1(y) and T1 (y) are both linear
functions of Y and are jointly Gaussian. Hence, MVUEs are
~
g1[T(y )] Eθ {ĝ1 (Y)} Cov[ĝ1 (Y), T1 (y )]
[Varθ [T1 (y )]] 1[T1 (y ) Eθ {T1 (y )}]
2 (n 2 s 2 ) 1[T1 (y ) ns 2 ]
T1 (y ) / ns 2
n
s yk
k 1 k
ns 2
~
g 2 [T(y )] [T2 (y ) T12 (y ) / ns 2 ] /( n 1)
Georgia Institute of Technology
ˆ
1
n
2
2
ˆ
ˆ
(
y
s
)
k
k
n 1 k 1
Center for Signal and Image Processing
37
Nonrandom Parameter Estimation
Maximum-Likelihood (ML) Estimation:
–
–
–
For many problems arising in practice, it is not usually
feasible to find MVUEs.
Another method for seeking good estimators are
needed.
ML is one of the most commonly used methods in signal
processing literature.
Consider MAP estimation for :
ˆMAP (y ) arg max pθ (y ) w(θ)
θΛ
–
In the absence of any prior information about the
parameter , we can assume that it is uniformly
distributed (w() becomes a uniform distribution) since
this represents the worst case scenario.
Georgia Institute of Technology
Center for Signal and Image Processing
38
Nonrandom Parameter Estimation
ML Estimation: (Continued.)
–
–
–
Hence, the MAP estimate for a given y is any value of
that maximizes p(y) over .
p(y) is usually called the likelihood ratio.
Hence, the ML estimate is
θˆ ML (y ) arg max pθ (y )
θΛ
–
Maximizing p(y) is the same as maximizing log p(y)
(log-likelihood function). Therefore, a necessary
condition for the maximum-likelihood estimate is
log pθ (y )
0.
θ
θ θˆ ML ( y )
–
The above condition is also known as the likelihood
equation.
Georgia Institute of Technology
Center for Signal and Image Processing
39
Nonrandom Parameter Estimation
Cramer-Rao Bound:
–
–
Let θˆ (Y ) be some unbiased estimator of Then the
error covariance matrix θ̂ is bounded by the Cramer-Rao
bound (refer to the supplement).
If the Cramer-Rao bound can be satisfied with equality,
only the maximum likelihood estimate achieves it.
Hence, if an efficient estimate exists, it is the maximum
likelihood estimate.
Example 11:
refer to the attached paper: “The Stochastic
CRB for Array Processing: A Textbook Derivation” by
Stoica, Larsson, and Gershman.
Georgia Institute of Technology
Center for Signal and Image Processing
40
Questions
Georgia Institute of Technology
Center for Signal and Image Processing
© Copyright 2026 Paperzz