Widrow-Hoff Learning

Widrow-Hoff Learning
Carlo U. Nicola, SGI FH Aargau
With extracts from publications of :
D. Demuth, University of Colorado
D.J. C. MacKay, Cambridge University
Adaline (Adaptive linear neuron)
WBS WS06-07 2
1
Transfer function for Adaline
WBS WS06-07 3
Adaline with only two inputs
The line divides
the p1, p2 plane in
two: a sort of
classifier like the
perceptron
The 2 inputs Adaline divides the plane along the line defined by:
WBS WS06-07 4
2
A digression on the Mean Square Error function (MSE)
If we want to find the optimal set of weights w that minimise the error between
target and inputs vectors, it is opportune to define an error landscape function
E(w) with MSE and then try a gradient descent hoping to get a minimum.
Assume that the training set consists of pairs containing an input pi and a
target ti : (p1,t1), (p2,t2), L , (pq,tq) , then the MSE function is:
With E(w) we define a performance surface i.e. the total error surface
plotted in the space of the system coefficients (in our case the weights w).
With the gradient vector (= tangent to the E(w) landscape) we then try to
reach a global minimum for the weights' vector:
WBS WS06-07 5
Gradient or steepest descent
What does it mean gradient descent (or steepest descent) ?
MSE-Function: Total
error-landscape.
Tangent direction:
Tangent = grad (Vector!)
-grad crawls to minimum of MSE
(we hope so!)
In the Adaline case the correction procedure does a (stochastic) gradient
descent on the n-dimensional mean square error surface, found by
calculating the gradient of error with respect to the weights.
Notice the difference with the perceptron convergence procedure!
WBS WS06-07 6
3
The meaning of approximate and stochastic in the LMS
algorithm
The Least Mean Squared, or LMS, algorithm is a stochastic gradient
algorithm that iterates each tap weight in the filter in the direction of
the gradient of the squared amplitude of an error signal with respect to
that tap weight.
LMS is called a stochastic gradient algorithm because the gradient
vector is chosen at ‘random’ and not, as in the steepest descent
case, precisely derived from the shape of the total error surface.
Random means here the instantaneous value of the gradient. This is
then used as the estimator for the true quantity.
In practice this simply means that LMS uses at each iteration step the
actual value of the error (not the full MSE function!) to build the
gradient.
WBS WS06-07 7
How Adaline works
The main idea behind Adaline is simple: Change the weights w of the inputs
signals pi so that the LMS (Least Mean Square) error between the teaching
vector ti and the output signal a reaches a minimum.
The training set consists of pairs containing an input pi and a target ti : (p1,t1),
(p2,t2), L , (pq,tq) .
With the new notation:
With these x and z we tidy up the output of Adaline and from that
we derive an equation for the mean square error:
WBS WS06-07 8
4
Adaline error analysis
Goal: Under which conditions can we find a global minimum of the mean square
error?
The mean square error for the Adaline network is thus a quadratic function:
WBS WS06-07 9
Conditions for a global minimum
Goal: Under which conditions can we find a global minimum x* of the weights from
the mean square error? The correlation matrix R should be at least positive
semidefinite (all eingenvalues are > 0).
The last equation holds if matrix R is positive definite (the inverse of R exists!).
WBS WS06-07 10
5
Approximate steepest descent (1)
For one sample k we get the following approximate mean square error:
Applying to it the (stochastic) gradient:
WBS WS06-07 11
Approximate steepest descent (2)
Gradient is given by the
product of the actual error
with the input vector.
WBS WS06-07 12
6
LMS algorithm
Finally from the last equation of the preceding slides we can derive the
learning rule for Adaline. This rule is another instance of the well known LMS
(Least Mean Square) algorithm.
From these equations we get (with some simple algebra) the learning rule for
both the weights and the biases of Adaline:
α is a stability parameter we will estimate shortly.
WBS WS06-07 13
Summary of Adaline LMS learning
Step1: Initialize weights and thresholds: Set the wij and the bij to small
random values in the range [-1,+1].
Gradient descent !
Step2: For each pattern pair (pk, tk) do:
(i) Input pk and calculate the actual output:
aj = Σwijpi + bj
(ii) Compute the error:
e2(k) = (tk -ak)2
(iii) Adjust the weights:
∆ wij = ± 2αeipi where : (ei = ti - ai)
Step 3: Repeat step 2 until the error correction is sufficiently low or zero.
WBS WS06-07 14
7
LMS stability and the calculation of α
Without going into any details, we shall summarise the conditions under
which the LMS algorithm finds a stable minimum.
That condition says that the eigenvalues of:
must fall inside the unit circle. Since λi > 0 this means:
WBS WS06-07 15
Stability as function of α
α = 0.1
WBS WS06-07 16
8
Example: apple and banana sorter
WBS WS06-07 17
Iteration 1
Banana:
WBS WS06-07 18
9
Iteration 2
Apple:
WBS WS06-07 19
Iteration 3
WBS WS06-07 20
10
LMS in practice
The LMS algorithm is the most pervasive of all encoding procedures.
Adaptive signal processing is largely based upon the LMS algorithm with
applications such as system modelling, statistical prediction, noise
cancelling, adaptive echo cancelling, and channel equalization. As an
example we will look at an adaptive filter for noise cancellation.
Adaptive filter
Tapped delay line
WBS WS06-07 21
Noise cancellation example 1
Noise cancellation in the pilot's radio headset in jet aircraft is a nice application of an adaptive filter. A
jet engine can produce noise levels of over 140 dB. Since normal human speech occurs between 30
and 40 dB, it is clear that traditional filtering techniques will not work. To implement an adaptive
technique, we place an additional microphone at the rear of the aircraft to record the engine noise
directly. By taking advantage of the additional information this reference microphone gives us, we can
substantially improve the signal for the pilot.
We can naively think to directly subtract the reference noise signal from the primary signal to
implement such noise cancellation. However, this technique will not work, because the noise at the
reference microphone will not be exactly the same as the noise in the jet cabine. There will be a
delay corresponding to the distance between the primary and reference microphones. Also, unknown
acoustic effects, such as an echoes or low pass filtering, can occur to the noise as it travels through
the fuselage of the aircraft. Even in the ideal case, the delay alone will guarantee that simply
subtracting will not properly cancel the noise.
If we model the path from the noise source to the primary microphone as a linear system, we can
devise an adaptive algorithm to train an FIR filter to match the acoustic characteristics of the channel.
If we then apply this filter to the noise recorded at the reference microphone, we should be able to
successfully subtract out the noise recorded at the primary microphone. This leaves us with an
improved recording of the pilot's voice.
WBS WS06-07 22
11
Noise cancellation example 2
WBS WS06-07 23
Noise cancellation Adaline for example 2
We now apply our machinery to the 2 inputs (with 1 delay) Adaline:
WBS WS06-07 24
12
Calculation of R
WBS WS06-07 25
Calculation of h
= 0, v(k), v(k-1) and s(k)
are not correlated.
WBS WS06-07 26
13
Calculate the corrected weights x*
We can check how good our correction is, using the previous formula for the error:
= 0, no correlation
WBS WS06-07 27
Phone echo cancellation
A last useful application of Adaline: echo cancellation in a phone conversation!
Today every DS-(Digital Signal)-Processor can be programmed as Adaline
adaptive filter.
WBS WS06-07 28
14

Download Report

Widrow-Hoff Learning

Paperzz.com

Your Paperzz