02/06

Artificial Neural
Networks #1
Machine Learning CH4 : 4.1 – 4.5
Promethea Pythaitha
Artificial Neural Networks



Robust approach to approximating target
functions over attributes with continuous
domains as well as discrete.
Can approximate unknown target functions.
Target function can be
 Discrete
valued.
 Real valued.
 Vector valued.
Neural Net Applications

Robust to noise in training data
 Among
most successful methods at
interpreting noisy real world sensor data.
Microphone input / speech recognition.
 Camera input.
 Handwriting recognition
 Face recognition and Image Processing.

 Robotic
control.
 Fuzzy neural nets.
Inspiration for Artificial Neural nets.

Natural learning systems (i.e. brains) are
composed of very complex webs of
interconnected neurons.
 Each
neuron receives signals (current spikes).
 When neuron threshold is reached, neuron sends its
signal downstream to….




Other neurons
Physical actuators
Perception in the Neocortex
Etc….

Artificial Neural nets are built out of
densely interconnected sets of units.
 Each
unit (artificial neuron) takes many realvalued inputs, and produces a real-valued
output.
 Output is then sent downstream to
Other units within net
 Output layer of net.

Brain estimated to contain ~ 100 billion
neurons.
 Each neuron connects to an average of
10,000 others.
 Neuron switching speeds are on the order
of a thousandth of a second
 (versus a ten-billionth of a second for logic
gates)
 Yet brains can make decisions/ recognize
images, etc, VERY fast.


Hypothesis:
 Thought
/information processing in the brain
is result of massively parallelized processing
of distributed inputs.

Neural Nets are built on this idea:
 In
parallel, process Distributed data.
Artificial vs. Natural.

Many Complexities of Natural Neural Nets not
present in Artificial ones.
 Feedback
(uncommon in ANNs)
 Etc.


Many features of Artificial Neural Nets not
compatible with Natural ones.
Units in Artificial Neural Nets produce one
constant output rather than a time-varying
sequence of current pulses.
Neural Network representations.

ALVINN
 learned
to drive an autonomous vehicle on highway.
 Input: 30 x 32 pixel matrix from camera

960 values (B/W pixel intensities)
 Output: Steering
 (30 real values)
direction for vehicle.
 Two layer Neural Net:
 Input (not counted)
 Hidden Layer
 Output.
ALVINN
ALVINN explained


Typical Neural Net Structure.
All 960 inputs in the 30x32 matrix are sent to the four
hidden neurons/units, where weighted linear sums are
computed.


Hidden unit: a unit whose output is only accessible in the
net, but not at the output layer.
Outputs from the 4 hidden neurons are sent downstream
to 30 output neurons, each of which outputs confidence
value corresponding to steering in a specific direction.

Fuzzy truth??
 Probability measure?

Program chooses direction with highest confidence
Typical Neural Net structure.


Usually, Layers are connected in Directed,
Acyclic Graph.
Can, in general, have any structure:
 Cyclic/Acyclic
 Directed/Undirected
 Feedforward/feedback

Most common & practical Nets trained using
Backpropagation
 Learning
= Select weight value for each connection.
 Assumes net is directed.
 Cycles are restricted

Usually there are none in practice.
Appropriate problems for A.N.N.s

Great for problems with noisy data and
complex sensor inputs.
 Camera/microphone,
etc
VS.
 Shaft

Symbolic problems:
 As

encoder/light-sensor, etc
good as Decision Tree Learning!!
BACKPROPAGATION
 Most
widely used technique.
Appropriate problems for A.N.N.s

Neural Net Learning suitable for problems with
following attributes:

Instances represented by many attribute, value
pairs.
–
–
–

Target function depends on vector of predefined attributes
Real-valued inputs
Attributed can be correlated or inependent.
Target function can be discrete-valued, real-valued,
or a vector!
–
ALVINN’s target function was a 30 real-element vector.
Appropriate problems for A.N.N.s
•
Training Data can contain Errors/Noise
•
Neural nets very robust to noisy data
•
•
Thankfully so are natural ones ;-)
Long Training Times are acceptable.
•
Neural nets usu. Take longer than other
machine-learning algorithms.
•
•
Few minutes  several hours.
Fast evaluation of learned function may be
required.
•
Neural nets do compute learned function very
fast.
•
ALVINN re-computes bearing several times/second.
Appropriate problems for A.N.N.s
 Not
important if humans can understand
learned function!!

ALVINN
960 inputs
 4 hidden nodes
 30 outputs

 Get

somewhat messy looking to humans!!
Thanks to the massive parallelism and distributed
data.
Perceptrons.
Basic building block of Neural Nets
 Takes several real-valued inputs
 Computes weighted sum
 Checks sum against threshold: - w0

> threshold  output +1
 Else  output -1.
o(x1, …, xn) = 1 if w0 + w1*x1 + …+ wn*xn > 0
-1 otherwise.
 If

For simplicity, let x0 = 1, then

o(x1, …, xn) = sgn(w.x)
 Vectors
denoted in Bold!!
 The . is vector dot product!!

Hypothesis space =
 All
possible combinations of real-valued
weights.
 All
w in Rn+1
Perceptron can represent any linearly
separable concept.
 Learned hypothesis is a hyperplane in Rn

Equation of hyperplane is w.x = 0
 Example:

 AND,

OR, NAND, NOR - vs, XOR, etc.
Any boolean function can be represented
by 2-layer perceptron network!!
Training a single Perceptron
 Perceptron
 Delta
training rule.
training rule / Gradient Descent.
Converge
to different hypotheses,
Under different conditions!!
Perceptron training rule.


Start with random weights
Go through training examples
 When an error is
 Update weights:
 For each wi
made,
wi  wi + Δwi
Δwi = η(t - o)xi
 Terminology:
 η is the learning rate




typically small
Sometimes decreases as we proceed.
t is the value of the target function
o is the value outputted by perceptron.

When the set of training examples is
completed perfectly, STOP.
Pros and Cons

Can be proven to converge to a w that
correctly classifies all training examples in
finite time, if η is small enough.

Convergence guaranteed only if concept is
linearly seperable.
not, no convergence  no stopping!!
 Can be an infinite loop of indecision!!
 If
Delta rule & Gradient descent.
Addresses problem of non-convergence
for nonlinear concepts.
 Will give a linear approximation to
nonlinear concepts that will minimize error.


…how do we measure error??

Consider a perceptron with the
“thresholding” function removed.
Then o = w.x
 Can define error as sum of squared
deviations:
 E = ½ Σ (td – od)2, sum over all training
examples d.

 td is
the target function value for example d.
 od is the computed value of the weighted sum
for d.

With this choice of error, it can be proven that
minimizing E will lead to the most probable
hypothesis that fits the training data, under the
assumption that the noise is normally distributed
with mean 0.
 Note
“most probable” hypothesis and “correct”
hypothesis still can be different.

Graph E versus weights:

E will always be parabolic (by definition)
 So
it has only one global minimum.

Goal: Descent to the global minimum
ASAP!

How?
 Gradient
definition.
 Meaning of the Gradient.
 Gradient descent.

The Gradient of E tells us the direction of
steepest ascent.

So, – Gradient E tells us us direction of
steepest descent.
in that direction with step size η.
 The learning rule becomes:
 Go
Derivation of simple update rule.

Finally, the Delta Rule weight update is.
Δwi = η*Σ(td – od)xid , over all training examples d.
Gradient descent pseudocode.
Pick an initial random weight vector w
 Until the termination condition is met, do

 Set Δwi
= 0 for all i.
each <x, t> in D, do
 Run net on x :compute o(x)
 For

For each wi, do
Δ wi = Δ wi + η (t – o) xi
 For

each wi, do
wi = wi + Δ wi
Results of Gradient Descent.

Because of the shape of the error surface,
with only a local minimum, the algorithm
will converge to a w with minimum
squared deviation/error as long as η is
small enough.
 This
holds regardless of linear seperability.
 If η is too large, the algorithm may skip the
global minimum instead of settling in.

A common remedy is to decrease η with time.

Gradient descent can be used to search
large or infinite hypothesis spaces when:
 The
hypothesis space is continuously
parameterized.
 The error can be differentiated with respect to
the hypothesis parameters.

However
 Convergence
can be slooooooow
 If there are lots of local minima, there’s no
guarantee it will find the global one.
Stochastic gradient descent.

Stochastic gradient descent
a.k.a.
Incremental gradient descent.
 Instead
of updating the weights massively after going
through all the training examples, we update them
after each example.
 This really descends the gradient for a singleexample error function (an example per step):

E = ½ (td – od)2
η is small enough, this is as optimal as true gradient
descent.
 If
Stochastic Gradient Descent.
Pick an initial random weight vector w
 Until the termination condition is met, do
 For each <x, t> in D, do

Run net on x :compute o(x)
 For each wi, do

wi = wi + η (t – o) xi
Results.
Compared to Stochastic gradient descent,
standard gradient descent takes more
computation per step, but generally has a
larger step size.
 When E has multiple local minima,
Stochastic gradient descent can
sometimes avoid them.

Perceptrons with discrete output.

For Delta-learning/gradient descent, we
discussed the unthresholded perceptrons.
 It
can simply be modified to thresholded
perceptrons.

Just use the thresholded t values as the t values
for the perceptron delta-learning algorithm.

(with the unthresholded o values)
 Unfortunately,
this may not necessarily reduce
the percent of errors in training data by the
thresholded output, just the squared error in
the thresholded output.
Multilayer Networks and the
Backpropagation Algorithm

In order to learn non-linear decision
surfaces, a system more complex than
perceptrons is needed.
Choice of base unit.

Multiple layers of linear units  still linear.
 Unthresholded

Thresholded perceptron has nondifferentiable thresholding function:
 Cannot

perceptron
compute gradient of E
Need something different.
 Must
be non-linear
 And continuously differentiable.
Sigmoid unit.


In place of the perceptron step function, use the
sigmoid function as thresholding function.
Sigmoid: σ(y) = 1 / (1 + e-y)
The sigmoid unit computes the weighted
linear sum w.x, and then applied the
sigmoid “squashing function”.
 Steepness of incline increases with
coefficient of –y.
 Continuously differentiable:

 Derivative:
dσ(y)/dy = σ(y)*(1 - σ(y))
Backpropagation algorithm.
Learns the weights that minimize squared
error given fixed # of units/neurons, and
interconnections.
 Employs gradient descent similar to delta
rule.
 Error is measured by:


Error Surface can have multiple local
minima.
 No
guarantee algorithm will find the Global
Minimum.

However, in practice, backpropagation
performs very well.
 Recall
Stochastic Gradient descent
vs.
 Gradient
descent.
Stochastic Backpropagation Algorithm

Considering a feedforward network with
two layers of sigmoid units which is fully
connected in one direction.
 Each
unit is assigned an index (I = 0, 1, 2, …)
 xji denotes input from i into j.
 wji denotes weight on connection from i to j.
 δn is the error term associated with unit n.
Backpropagation algorithm.
Backpropagation explained
Make the network, randomly initialize the
weights.
 For each training example d, apply the
network, calculate the squared error Ed,
apply the gradient, and proceed a step of
size η in direction of steepest decline.
 Weight update rule:

Delta rule: Δ wi = Δ wi + η (t – o) xi
 Here we have Δwji = ηδjxji
 The error term δj is more complex here.
 Recall
Error term for unit j.

Intuitively, if j is an output node k, then its error is
the standard tk – ok multiplied by ok(1- ok) : the
derivative of the sigmoid function.
 Derivative

of sigmoid because we’re using gradient(E).
If j is a hidden node h, have no th to compare it
with.
 Must
sum error in the output nodes k influenced by h:
δk
 weighted
by how much they were influenced by h: wkh
 δh = oh(1- oh) Σ wkhδk
Derivation of error term for unit j.
Termination condition.
Stop after fixed number of iterations.
 Stop when E on training data drops below
given level.
 Stop when E on test data drops below a
certain level.

few iterations  too much error remains.
 Too many  overfitting data.
 Too
Variant of Backpropagation with
momentum.

Adding momentum = make weight update at
step n depend partially on that at n-1.
 Δwji(n) = ηδjxji + α Δwji(n - 1)
α

is a small number between 0 and 1.
Analogous to a ball rolling down a bumpy hill.
 Momentum
(α) tries to keep the ball moving in the
same direction as before.
 Can keep ‘ball’ moving through local minima and
plateaus.
 If Gradient(E) does not change, increases effective
step size.
Pros and Cons of Momentum
Can provide quicker convergence.
 However theoretically, can also drop right
through global minimum and keep going.

Generalization to n-layer network
We simply evaluate the δk for output
nodes as before.
 Then Backpropagate errors through
network layer by layer:

 For
a node r at layer m
 δr = or(1- or) Σwsrδs over all nodes s in layer m+1
Layer m+1 is downstream from m.
 r feeds input into the s nodes.

Generalization to arbitrary acyclic
network
We simply evaluate the δk for output nodes
as before.
 Then Backpropagate errors through
network:

 For
a node r
 δr = or(1- or) Σwsrδs over all nodes s in
Downstream(r).
 Downsteam(r) = {s | s recieves input from r}
Summary.

Artificial Neural networks are
 Practical
way to learn discrete, real, and
vector valued functions.
 Robust to noisy data.
 Usually trained via Backpropagation.
 Used for many real world tasks
Robot control
 Computer creativity


http://www.venturacountystar.com/news/2007/jul/09/com
puters-compose-original-melodies/

Feedforward networks with 3 layers can
approximate any function to any desired
accuracy given sufficient units/artificial
neurons and connections.
 Good
accuracy achieved even with small
nets.

Backpropagation is able to find
intermediate features within the net, that
are not explicitly defined as attributes of
the input or output.