Lecture 3

Linear models
• As the first approach to estimator design, we consider the
class of problems that can be represented by a linear model.
• In general, finding the MVUE is difficult. But if the linear
model is valid, this task is straightforward.
• A model with parameters θ ∈ Rp×1 and data x ∈ Rn×1 is
linear, if it is of the form
x = Hθ + w,
where w ∼ N(0, σ2 I) and H ∈ Rn×p . The matrix H is called
the observation matrix or design matrix1 .
1
http://en.wikipedia.org/wiki/Design_matrix
Linear models
• For example, the "DC level in WGN" –problem belongs to
this class with
x = [x[0], x[1], . . . , x[N − 1]]T
w = [w[0], w[1], . . . , w[N − 1]]T
θ = [A]
H = [1, 1, . . . , 1]T
| {z }
N times
• With these definitions, x[n] = A · 1 + w[n] holds for all
n = 0, 1, . . . , N − 1.
Linear models (cont.)
• Also fitting a straight line to a set of data belongs to this
class. In this case the model is
x[n] = A + Bn + w[n],
n = 0, 1, . . . , N − 1
and the problem is to find MVU estimators for A and B
assuming w[n] ∼ N(0, σ2 ).
Linear models (cont.)
• In matrix form x = Hθ + w, or





|
x[0]
x[1]
..
.
x[N − 1]
{z
}
x

1
1
 
 1
=
  ..
.

0
1
2
..
.



w[0]
  w[1] 



 A
+

·
.
.
B



.
 | {z }
w[N − 1]
θ
1 N−1
|
{z
}
|
{z
}
w
H
• The matrix H is called the observation matrix.
Linear models: finding the MVUE
• The nice thing about linear models is that the MVUE given
by the CRLB theorem is always found.
• More specifically, the factorization
∂ ln p(x;θ)
∂θ
= I(θ)(g(x) − θ) can always be done. According
to CRLB theorem for the vector parameter case, g(x) is
then the MVUE.
• To see what does the factorization look like, let’s calculate
the derivative of the log-likelihood function.
Linear models: finding the MVUE (cont.)
• The likelihood function for each sample x[n] is now
1
1
2
√
p(x[n]; θ) =
exp − 2 (x[n] − [Hθ]n )
2σ
2πσ2
and the joint probability for all samples
p(x; θ) =
N−1
Y
p(x[n]; θ) =
n=0
N−1
1 X
exp − 2
(x[n] − [Hθ]n )2
2σ
"
1
N
(2πσ2 ) 2
n=0
or in vector form
p(x; θ) =
1
N
(2πσ2 ) 2
1
exp − 2 (x − Hθ)T (x − Hθ)
2σ
#
Linear models: finding the MVUE (cont.)
• After taking the logarithm and differentiating, we get
∂ ln p(x; θ)
∂θ
∂
1
2 N
T
2
=
− ln(2πσ ) − 2 (x − Hθ) (x − Hθ)
∂θ
2σ
i
1 ∂ h T
= − 2
x x − 2xT Hθ + θT HT Hθ .
2σ ∂θ
• It can be shown that for any vector v and any symmetric
matrix M the following differentiation rules hold.
∂ T
v θ = v
∂θ
∂ T
θ Mθ = 2Mθ
∂θ
Linear models: finding the MVUE (cont.)
• Using these, we can evaluate the above formula:
∂ ln p(x; θ)
∂θ
1 T
T
−2H
x
+
2H
Hθ
2σ2
1 = − 2 −HT x + HT Hθ
σ
= −
Linear models: finding the MVUE
• The MVUE g(x) is given by the following factorization:
∂ ln p(x; θ)
= I(θ)(g(x) − θ),
∂θ
• If the square matrix HT H is invertible2 , we can cleverly
multiply by the identity matrix, or I = HT H · (HT H)−1 :
∂ ln p(x; θ)
∂θ
=
2
i
HT H h
T
−1 T
−(H
H)
H
x
+
θ
σ2
i
T
H H h T −1 T
(H
H)
H
x
−
θ
σ2
= −
It usually is; we will return to this issue later.
Linear models: finding the MVUE
• Comparing this with the required factorization of the
CRLB theorem,
∂ ln p(x; θ)
= I(θ)(g(x) − θ),
∂θ
we can see immediately, that the MVUE g(x) exists, and is
given by
θ̂ = g(x) = (HT H)−1 HT x.
Linear models: finding the MVUE (cont.)
• Furthermore, the Fisher information matrix is
I(θ) =
HT H
,
σ2
which means that the covariance matrix of the estimator is
its inverse:
Cθ̂ = σ2 (HT H)−1 .
Linear models: theorem
MVU estimator for the linear model If the observed data can
be modeled as
x = Hθ + w
(1)
where x is an N × 1 vector of observations, H is a known N × p
observation matrix (with N > p) and rank p, θ is a p × 1 vector
of parameters to be estimated, and w is an N × 1 noise vector
with pdf N(0, σ2 I), then the MVU estimator is
θ̂ = (HT H)−1 HT x
(2)
and the covariance matrix of θ̂ is
Cθ̂ = σ2 (HT H)−1
(3)
Linear models: theorem (cont.)
Moreover, the MVU estimator is efficient in that it attains the
CRLB.
Proof. We have already proven everything except the fact that
the estimator is unbiased. The unbiasedness is easily seen:
E[θ̂] = E[(HT H)−1 HT x] = (HT H)−1 HT E[x] = (HT H)−1 HT Hθ = θ.
(Here we used the fact that E[x] = Hθ + E[w] = Hθ.)
Examples: Line fitting
• In the line fitting case the equation was:





|
x[0]
x[1]
..
.
x[N − 1]
{z
}
x

1
1
 
 1
=
  ..
.

0
1
2
..
.



w[0]
 
 w[1] 
 A


·
+



..
 B


.
 | {z }
w[N − 1]
θ
1 N−1
|
{z
}
|
{z
}
w
H
Examples: Line fitting (cont.)
• Once we observe the data x and assume this model, the
MVU estimator is
θ̂ = (HT H)−1 HT x
• Writing the matrices open, we have:



 1
Â
=
 0
B̂


1
1
1
2
···
···

1
1

1
1
·
.
N−1
.
.
1
−1
0
1 

2 
1

·

0
. 


. 
.
N−1

1
1
1
2
···
···


1
·

N−1


x[0]
x[1] 


.

.

.
x[N − 1]
Examples: Line fitting (cont.)
• Now,
T
H H=
PN−1 !
n=0 n
PN−1
PN−1
=
2
n
n=0 n
n=0
N
N
N(N−1)
2
N(N−1)
2
N(N−1)(2N−1)
6
and we can show that the inverse is
T
(H H)
−1
=
2(2N−1)
N(N+1)
6
− N(N+1)
6
− N(N+1)
12
N(N2 −1)
!
!
Examples: Line fitting (cont.)
• Finally,
Â
=
B̂
2(2N−1)
N(N+1)
6
− N(N+1)
6
− N(N+1)
12
N(N2 −1)
!
!
PN−1
x[n]
n=0
PN−1
n=0 nx[n]
• Below is the result of one test run, with σ2 = 1000, A = 1
and B = −2. In this realization, the result was  = −1.2583
and B = −1.8730.
Examples: Line fitting (cont.)
100
0
−100
−200
−300
0
10
20
30
40
50
60
70
80
90
100
• The covariance matrix
(or inverse ofthe Fisher information
matrix) is Cθ̂ =
39.4059 −0.5941
−0.5941 0.0120
• This tells that the estimates  will have a lot higher
variance than the estimates B̂.
Examples: Line fitting (cont.)
• We can validate this by estimating the parameters from
1000 noise realizations. The histograms and the
corresponding variances are plotted below.
Estimates for A. Theoretical variance = 39.4059. Sample variance = 38.6936.
300
200
100
0
−25
−20
−15
−10
−5
0
5
10
15
20
25
Estimates for B. Theoretical variance = 0.012001. Sample variance = 0.011676.
300
200
100
0
−2.5
−2.4
−2.3
−2.2
−2.1
−2
−1.9
−1.8
−1.7
−1.6
Amplitude of a sinusoid
• So far we have considered problems, where the function
was also linear (straight line or a constant).
• The model allows also other cases as long as the
relationship between the parameters and the data is linear.
• These include for example estimation of the amplitude of a
known sinusoid. Consider the model
x[n] = A1 cos(2πf1 n + φ1 ) + A2 cos(2πf2 n + φ2 ) + B + w[n],
for n = 0, 1, . . . , N − 1, where f1 , f2 , φ1 , φ2 are known and
A1 , A2 and B are the unknowns.
Amplitude of a sinusoid (cont.)
• Then the linear model is applicable with
x = Hθ + w,
namely

1


x[0]
cos(2πf1 + φ1 )

 x[1]  
cos(4πf1 + φ1 )

 
cos(6πf1 + φ1 )

=
.

 
.

 
.
.

.

x[N − 1]
.
|
{z
}
cos(2(N − 1)πf1 + φ1 )
x
|

1
cos(2πf2 + φ2 )
cos(4πf2 + φ2 )
cos(6πf2 + φ2 )
.
.
.
cos(2(N − 1)πf2 + φ2 )
{z
H

1


1
w[0]
  
 w[1] 
1
A1



1 · A2  + 

.



.


B
.
.
.
.  | {z }
w[N − 1]
θ
|
{z
}
1
w
}
Amplitude of a sinusoid (cont.)
• Again, the MVU estimator is θ̂ = (HT H)−1 HT x.
• The Matlab code for this is below.
• Note that now we’re generating the simulated data exactly
according to our model. It’s interesting to see how
deviations from the model affect the performance—try it:
http://www.cs.tut.fi/courses/SGN-2607/CosFit.m.
• Try also other curves instead of the sinusoids + lines.
Code
% Let’s generate a test case first:
N = 200;
n = (0:N-1)’;
sigma_sq = 10; % Variance of WGN
w = sqrt(sigma_sq)*randn(N,1);
A = 1; % This is the unknown for the estimator
B = -2; % This is the unknown for the estimator
C = 10; % This is the unknown for the estimator
f1 = 0.05; % This parameter the estimator knows
f2 = 0.02; % This parameter the estimator knows
theta = [A;B;C];
Code (cont.)
H = [cos(2*pi*f1*n+pi/4), cos(2*pi*f2*n-pi/10), ones(N,1)];
x = H*theta + w;
% This is the observed data.
% Now lets try to estimate theta from the data x:
% Note: Below is Matlab’s preferred way for
% thEst = inv(H’*H)*H’*x
thEst = H \ x;
plot(n,H*theta, ’b-’, ’LineWidth’, 2);
hold on
plot(n,x,’go’, ’LineWidth’, 2);
plot(n,H*thEst, ’r-’, ’LineWidth’, 2);
hold off
Amplitude of a sinusoid, results
• Below is the result of one example run.
25
True model
Noisy data
Estimated sinusoid
20
15
10
5
0
−5
0
20
40
60
80
100
120
140
160
• In this case θ̂ = [1.2128, −2.6580, 9.7749]T
• The true θ = [1, −2, 10]T .
180
200
Amplitude of a sinusoid, results (cont.)
• The covariance matrix is diagonal:


1 0 0
Cθ̂ = 0 1 0 
0 0 12
Linear models — other examples in Kay’s book
• Curve fitting: For example the gravitational force can be
modeled using a second order polynomial:
x(tn ) = θ1 + θ2 tn + θ3 t2n + w(tn ), n = 0, . . . , N − 1
In matrix form, this is given by
x = Hθ + w
or
Linear models — other examples in Kay’s book
(cont.)

x(t0 )
x(t1 )
..
.


1
1
..
.
t0
t1
..
.
t20
t21
..
.

 

 

=

 
1 tN−1 t2N−1
x(tN−1 )


w0
 


 θ1
    w1 
 θ2 +  .. 
 . 

θ3
wN−1

Notice, that for polynomial models, the matrix H has a
special form, and is called Vandermonde matrix.
Linear models — other examples in Kay’s book
(cont.)
• The nice property of the linear model is that you can try
inserting whatever functions you can imagine, and let the
formula decide if they are useful or not.
• As an example, below is an example of data with two
linear models fitted into it.
Linear models — other examples in Kay’s book
(cont.)
2
Model: y = 0.002*x −0.083*x+16.044. MSE = 4658.5243
100
90
90
80
80
70
70
60
60
y(n)
y(n)
Model: y = 0.406*x−0.089. MSE = 15386.4108
100
50
50
40
40
30
30
20
20
10
0
10
0
20
40
60
80
100
x(n)
120
140
160
180
200
0
0
20
40
60
80
100
x(n)
120
140
160
180
• Below are some additional models (although not very
suitable ones).
200
Linear models — other examples in Kay’s book
(cont.)
−1
10
Model: y = −39.497*sqrt(x)+51.283*log(1+x)+0.000*x +1.905*x. MSE = 5200.8578
100
90
90
80
80
70
70
60
60
y(n)
y(n)
Model: y = 2.368*cos(2*pi*0.01*x)−85.958*(1+x) +42.811. MSE = 114908.1373
100
50
50
40
40
30
30
20
20
10
0
10
0
20
40
60
80
100
x(n)
120
140
160
180
200
0
0
20
40
60
80
100
x(n)
120
140
160
180
200
• Note that the MSE is a good indicator of model suitability.
We will discuss this later in the context of sequential least
squares.
Linear models — other examples in Kay’s book
(cont.)
• Fourier analysis
x[n] =
M
X
k=1
M
2πkn X
2πkn ak cos
+
bk sin
+ w[n],
N
N
k=1
with n = 0, 1, . . . , N − 1.
• Now
θ = [a1 , a2 , · · · , aM , b1 , b2 , · · · , bM ]T
and
Linear models — other examples in Kay’s book
(cont.)
1 1 
cos 2π
cos 4π

N N 
 cos 2π·2
cos 4π·2

N
N
H=

.
.

.
.

.
.

2π(N−1)
4π(N−1)
cos
cos
N
N

···
···
···
.
1
cos 2Mπ
N cos 2Mπ·2
N
.
.
.
..
· · · cos
0 sin 2π
N sin 2π·2
N
2Mπ(N−1)
N
.
.
.
sin
2π(N−1)
N
···
0 sin 4π
N sin 4π·2
N
···
···
.
.
.
sin
4π(N−1)
N
.
0
sin 2Mπ
N sin 2Mπ·2
N
.
.
.
..
· · · sin
2Mπ(N−1)
N
• The MVU estimator results in the usual DFT coefficients, as
one could expect.











Linear models — other examples in Kay’s book
(cont.)
• System identification: Any linear process can be modeled
using a FIR filter.
• In system identification context, we measure the input and
the output of an unknown system ("black box"), and try to
model its properties by a FIR filter.
• The problem is essentially estimating the FIR impulse
response, and thus it’s natural to formulate the problem as
a linear model.
Linear models — other examples in Kay’s book
(cont.)
• Denote the input by u[n], and the output by x[n],
n = 0, 1, . . . , N − 1. Also denote the FIR impulse response
by h[k], k = 0, 1, . . . , p − 1. Then our model for the
measured output data is
x[n] =
p−1
X
n = 0, 1, . . . , N − 1
h[k]u[n − k] + w[n],
k=0
or in matrix form

u[0]
u[1]
..
.


x=

u[N − 1]
|
0
u[0]
..
.
···
···
..
.
0
0
..
.

u[N − 2]
{z
···
u[N − p]
}|
H




h[0]
h[1]
..
.



 +w

h[p − 1]
{z
}
θ
Linear models — other examples in Kay’s book
(cont.)
• Because this is in linear model form (assuming w[n] is
WGN), the minimum variance FIR coefficient vector is
θ̂ = (HT H)−1 HT x.
• Kay continues the discussion by asking: "What is the best
selection for u[n]?" If we can select the input sequence,
which one produces the smallest variance?
• Answer: any sequence whose covariance matrix is
diagonal. That is, any (pseudo)random sequence.
Automatic Bacteria Counting from Microscope
• The next example considers automatic
counting and measuring of DAPI stained
bacteria from microscope image.
• DAPIa is a fluorescent stain molecule that
binds strongly to DNA.
• When excited by ultraviolet light
(wavelength near 358 nm), it starts to emit
longer wavelengths (near 461 nm which is
blue light).
• DAPI staining is widely used in biology
and medicine for highlighting the cells for
counting, tracking and other purposes.
a
4’,6-diamidino-2-phenylindole
Automatic Bacteria Counting from Microscope
• Traditionally (and even today) the number of cells is
calculated manually.
• However, there are numerous automatic solutions
available.
• At our department the software CellC was developed for
this task.3
• The code is freely available at
http://sites.google.com/site/cellcsoftware/.
3
J. Selinummi, J. Seppälä, O. Yli-Harja, and J. Puhakka, "Software for quantification of labeled bacteria from
digital microscope images by automated image analysis," BioTechniques, Vol. 39, No 6, 2005, pp 859-863.
CellC Operation
• The software consists of the following stages:
• Normalization of the background for variations in
illumination
• Extraction of cells by thresholding
• Separation of clustered cells by marker-controlled watershed
segmentation
• Finally, too small or large objects are discarded.
• The output is an excel file of cell sizes and locations
together with a binary image of the segmented cells.
Background Correction
• Often the illumination is not homogeneous, but is more
bright in the center.
• This can be corrected by fitting a two-dimensional
quadratic surface and subtracting the result.
• Denote the image intensity at (xk , yk ) by zk . Then the
quadratic model for the intensities is z = Hθ + w, or
   2
 
z1
x1 y21
x1 y1
x1 y1 1
c1
 z2   x2 y2


x2 y2
x2 y2 1 c2 
2
   2

 ..  =  ..
..
..
..
..
..   ..  + w.
 .   .


.
.
.
.
.
.
zN
x2N y2N xN yN xN yN 1
c6
Background Correction
Left: Blue channel with uneven illumination. Center: Fitted
quadratic surface. Right: Difference image.
z(x, y) = −0.000080x2 − 0.000288y2 + 0.000123xy
+ 0.022064x + 0.284020y + 106.538687
Extension: 2D Measurements
• In another project we were required to model
displacements on a 2D grid4
• The measurement data consisted of 2D vector
displacement measurements.
• In other words, we know that the displacements at points
(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ) are
(∆x1 , ∆y1 ), (∆x2 , ∆y2 ), . . . , (∆xN , ∆yN ).
4
Manninen, T., Pekkanen, V., Rutanen, K., Ruusuvuori, P., Rönkkä, R. and Huttunen, H., "Alignment of
individually adapted print patterns for ink jet printed electronics," Journal of Imaging Science and Technology, 54(5), Oct.
2010.
Extension: 2D Measurements
• This case was also modeled using a 2nd order polynomial
model:

∆x1
 ∆x2

 ..
 .
∆xN
  2
x1
∆y1
 x22
∆y2 
 
..  =  ..
.   .
∆yN
x2N
y21
y22
..
.
x1 y1
x2 y2
..
.
x1
x2
..
.
y1
y2
..
.
y2N
xN yN
xN
yN

1
a1
a2
1

..   ..
.  .
a6
1

b1
b2 

.. +w.
.
b6
• The familiar formula θ = (HT H)−1 HT x applies also in this
case.
• Note that it would have been equivalent to separate this to
two linear models; one for ∆xk and another for ∆yk .
Results
• Below is an example of a resulting vector field.
Linear Models Summary
• If a linear model (x = Hθ + w) can be assumed, the MVUE
reaching the CRLB can be found in closed form:
θ = (HT H)−1 HT x. Matlab calls it theta = H \ x, and
Excel LINEST.
• We will continue discussion on this topic in chapter 8:
Least Squares (LS). It turns out that the linear LS estimator
has exactly the above formula.
• The difference is that LS assumes nothing about the
distribution, and thus has no guarantees for optimality or
unbiasedness. Additionally, LS has numerous extensions
to be discussed later.