slides-9 - Department of Electrical Engineering

Lecture 9: Sept. 17, 2012
Review LMS Algorithm
•  Least mean square (LMS) algorithm
^
• 
• 
• 
• 
∇ J(w(k))= -e(k)x(k)
w(k+1) = w(k) + µe(k)x(k)
LMS algorithm is an iterative noisy gradient descent
algorithm that approximates Steepest Descent (SD) from
one training example.
LMS algorithm attempts to find weight that minimizes
mean squared error cost function, J(w).
SD and LMS algorithm convergence depends on
step size µ and eigenvalues of R.
LMS algorithm is simple to implement.
Department of
Electrical
Engineering
1
Lecture 9: Sept. 17, 2012
Step size tradeoff
•  Larger step size µ quicker convergence, but more excess
mean squared error, convergence time = 2n /(µ tr(R))
•  Smaller step size µ slower convergence, but less excess
mean squared error, Jexcess ≈ µ Jmin tr(R)/2
•  Misadjustment (dimensional quantity) proportional to
step size
M = Jexcess / Jmin = .5 µ tr (R)
Example: M = 10% : excess MSE 10% and convergence
time is 20n.
Department of
Electrical
Engineering
2
Lecture 9: Sept. 17, 2012
Convergence curves
LMS algorithm, n=10
14
12
10
MSE
8
6
4
2
0
0
100
200
300
400
500
t
600
700
800
900
1000
n=10, input components iid,
step sizes µ = . 005, .01, .02, .1; M = 5µ
Department of
Electrical
Engineering
3
Lecture 9: Sept. 17, 2012
Other Iterative Algorithms
•  LMS algorithm with variable step size:
w(k+1) = w(k) + µ(k)e(k)x(k)
When step size µ(k) = µ/k algorithm converges almost
surely to optimal weights.
•  Conjugate gradient (CG) algorithm: Gradients and line
search used to form CGs. Algorithm converges in n steps.
•  Newton s algorithm: Let g(n) be gradient and H(n) be
Hessian of w(n) then approximate energy function by
J(w) ≈ J(w(n)) + (w-w(n))T g(n) + .5 (w-w(n))TH(n)(w-w(n))
Take gradient of approximation and set to zero to get
w(n+1) = w(n) + H(n)-1 g(n)
Algorithm involves inverting Hessian matrix (costly).
Department of
Electrical
Engineering
4
Lecture 9: Sept. 17, 2012
Linear Filter Applications
•  Inverse Modeling: Channel Equalization
•  Adaptive Beamforming
–  Radar
–  Sonar
–  Speech enhancement
•  System Identification: Plant modeling
•  Prediction
•  Adaptive Interference Cancellation: Echo Cancellation
Department of
Electrical
Engineering
5
Lecture 9: Sept. 17, 2012
Single Neuron
x
w
(Computational node)
Σ
s
w
g( )
y
0
s=w T x + w0 ; synaptic strength (linearly weighted
sum of inputs).
y=g(s); activation or squashing function
g(s) = sgn(s); linear threshold function
g(s) = 1/ (1+ exp(-βs)); sigmoid function
Department of
Electrical
Engineering
6
Lecture 9: Sept. 17, 2012
Logistic Regression
•  Discriminative models: Learn posterior probability from
training examples
•  Let P(D= d|x(k)=x) = P(y(k) = d) = 1/(1+exp(-βds)) where
where d {{-1,1} , s=wTx +w0 , and β > 0
•  Assume zero threshold, then assume training examples
{(x(1),d(1)),…,x(m),d(m)} are drawn from an iid sample
the goal is to maximize the following cost function
m
Jp(w) =Πk=1 P(y(k)=d(k) |X(k) = x(k))
which is equivalent to max. log (Jp(w)) or min.
m
J(w) = - 1/m log (Jp(w)) = 1/m Σk=1 log (1 + exp(-βd(k)w(k)Tx(k)))
Department of
Electrical
Engineering
7
Lecture 9: Sept. 17, 2012
Cost function J(w)
•  J(w) is an entropy based cost function
m
J(w) = - 1/m log (Jp(w)) = 1/m Σk=1
log (1 + exp(-βd(k)w(k)Tx(k)))
•  Gradient given by
m
∇ J(w(k)) = - 1/m Σk=1β d(k) x(k) / (1 + exp(βd(k)w(k)Tx(k)))
•  Cost function is convex, this can be determined by
finding Hessian matrix which is positive definite
•  Best weight is where gradient is 0, but closed form
solution cannot be found.
•  Use iterative learning algorithm such as stochastic
gradient descent (LMS) or Newton’s algorithm
Department of
Electrical
Engineering
8
Lecture 9: Sept. 17, 2012
LMS algorithm
•  Logistic regression: LMS algorithm, estimate
negative gradient from one training example
^
∇  J(w(k))= - βd(k)x(k) / (1 + exp(βd(k)w(k)Tx(k)))
= - βd(k)x(k) P(d(k) ≠ y(k))
w(k+1) = w(k) + µβd(k)x(k) P(d(k) ≠ y(k))
•  Compare to Perceptron learning algorithm update
w(k+1) = w(k) + µd(j)x(j) I(d(j) ≠ y(j))
where I() is the indicator function
Department of
Electrical
Engineering
9
Lecture 9: Sept. 17, 2012
Nonlinear Methods
•  Multilayer feedforward networks: error back
propagation algorithm
•  Kernel methods:
–  Support Vector Machines (SVM)
–  Least squares methods
–  Radial Basis Functions (RBF)
Department of
Electrical
Engineering
10
Lecture 9: Sept. 17, 2012
Kernel Methods
In many classification and detection
problems a linear classifier is not sufficient.
However, working in higher dimensions can
lead to curse of dimensionality .
Solution: Use kernel methods where
computations done in dual observation space.
O
X
O
O
Input space
Department of
Electrical
Engineering
X
Φ
X
X
O
Feature space
Φ: X
Z
11
Lecture 9: Sept. 17, 2012
Linear SVM (dual representation)
SVM quadratic programming problem involves
computing inner products.
max W(α) = Σ α(i) - ½Σ α(i) α(j) d(i) d(j)(x(i) Tx(j) )
subject to α(i) ≥ 0, and Σ α(i) d(i) = 0 .
The hyperplane decision function can be written as
f(x) = sgn (Σ α(i) d(i) xT x(i) + b)
Department of
Electrical
Engineering
12