Artificial Neural
Networks #1
Machine Learning CH4 : 4.1 – 4.5
Promethea Pythaitha
Artificial Neural Networks
Robust approach to approximating target
functions over attributes with continuous
domains as well as discrete.
Can approximate unknown target functions.
Target function can be
Discrete
valued.
Real valued.
Vector valued.
Neural Net Applications
Robust to noise in training data
Among
most successful methods at
interpreting noisy real world sensor data.
Microphone input / speech recognition.
Camera input.
Handwriting recognition
Face recognition and Image Processing.
Robotic
control.
Fuzzy neural nets.
Inspiration for Artificial Neural nets.
Natural learning systems (i.e. brains) are
composed of very complex webs of
interconnected neurons.
Each
neuron receives signals (current spikes).
When neuron threshold is reached, neuron sends its
signal downstream to….
Other neurons
Physical actuators
Perception in the Neocortex
Etc….
Artificial Neural nets are built out of
densely interconnected sets of units.
Each
unit (artificial neuron) takes many realvalued inputs, and produces a real-valued
output.
Output is then sent downstream to
Other units within net
Output layer of net.
Brain estimated to contain ~ 100 billion
neurons.
Each neuron connects to an average of
10,000 others.
Neuron switching speeds are on the order
of a thousandth of a second
(versus a ten-billionth of a second for logic
gates)
Yet brains can make decisions/ recognize
images, etc, VERY fast.
Hypothesis:
Thought
/information processing in the brain
is result of massively parallelized processing
of distributed inputs.
Neural Nets are built on this idea:
In
parallel, process Distributed data.
Artificial vs. Natural.
Many Complexities of Natural Neural Nets not
present in Artificial ones.
Feedback
(uncommon in ANNs)
Etc.
Many features of Artificial Neural Nets not
compatible with Natural ones.
Units in Artificial Neural Nets produce one
constant output rather than a time-varying
sequence of current pulses.
Neural Network representations.
ALVINN
learned
to drive an autonomous vehicle on highway.
Input: 30 x 32 pixel matrix from camera
960 values (B/W pixel intensities)
Output: Steering
(30 real values)
direction for vehicle.
Two layer Neural Net:
Input (not counted)
Hidden Layer
Output.
ALVINN
ALVINN explained
Typical Neural Net Structure.
All 960 inputs in the 30x32 matrix are sent to the four
hidden neurons/units, where weighted linear sums are
computed.
Hidden unit: a unit whose output is only accessible in the
net, but not at the output layer.
Outputs from the 4 hidden neurons are sent downstream
to 30 output neurons, each of which outputs confidence
value corresponding to steering in a specific direction.
Fuzzy truth??
Probability measure?
Program chooses direction with highest confidence
Typical Neural Net structure.
Usually, Layers are connected in Directed,
Acyclic Graph.
Can, in general, have any structure:
Cyclic/Acyclic
Directed/Undirected
Feedforward/feedback
Most common & practical Nets trained using
Backpropagation
Learning
= Select weight value for each connection.
Assumes net is directed.
Cycles are restricted
Usually there are none in practice.
Appropriate problems for A.N.N.s
Great for problems with noisy data and
complex sensor inputs.
Camera/microphone,
etc
VS.
Shaft
Symbolic problems:
As
encoder/light-sensor, etc
good as Decision Tree Learning!!
BACKPROPAGATION
Most
widely used technique.
Appropriate problems for A.N.N.s
Neural Net Learning suitable for problems with
following attributes:
Instances represented by many attribute, value
pairs.
–
–
–
Target function depends on vector of predefined attributes
Real-valued inputs
Attributed can be correlated or inependent.
Target function can be discrete-valued, real-valued,
or a vector!
–
ALVINN’s target function was a 30 real-element vector.
Appropriate problems for A.N.N.s
•
Training Data can contain Errors/Noise
•
Neural nets very robust to noisy data
•
•
Thankfully so are natural ones ;-)
Long Training Times are acceptable.
•
Neural nets usu. Take longer than other
machine-learning algorithms.
•
•
Few minutes several hours.
Fast evaluation of learned function may be
required.
•
Neural nets do compute learned function very
fast.
•
ALVINN re-computes bearing several times/second.
Appropriate problems for A.N.N.s
Not
important if humans can understand
learned function!!
ALVINN
960 inputs
4 hidden nodes
30 outputs
Get
somewhat messy looking to humans!!
Thanks to the massive parallelism and distributed
data.
Perceptrons.
Basic building block of Neural Nets
Takes several real-valued inputs
Computes weighted sum
Checks sum against threshold: - w0
> threshold output +1
Else output -1.
o(x1, …, xn) = 1 if w0 + w1*x1 + …+ wn*xn > 0
-1 otherwise.
If
For simplicity, let x0 = 1, then
o(x1, …, xn) = sgn(w.x)
Vectors
denoted in Bold!!
The . is vector dot product!!
Hypothesis space =
All
possible combinations of real-valued
weights.
All
w in Rn+1
Perceptron can represent any linearly
separable concept.
Learned hypothesis is a hyperplane in Rn
Equation of hyperplane is w.x = 0
Example:
AND,
OR, NAND, NOR - vs, XOR, etc.
Any boolean function can be represented
by 2-layer perceptron network!!
Training a single Perceptron
Perceptron
Delta
training rule.
training rule / Gradient Descent.
Converge
to different hypotheses,
Under different conditions!!
Perceptron training rule.
Start with random weights
Go through training examples
When an error is
Update weights:
For each wi
made,
wi wi + Δwi
Δwi = η(t - o)xi
Terminology:
η is the learning rate
typically small
Sometimes decreases as we proceed.
t is the value of the target function
o is the value outputted by perceptron.
When the set of training examples is
completed perfectly, STOP.
Pros and Cons
Can be proven to converge to a w that
correctly classifies all training examples in
finite time, if η is small enough.
Convergence guaranteed only if concept is
linearly seperable.
not, no convergence no stopping!!
Can be an infinite loop of indecision!!
If
Delta rule & Gradient descent.
Addresses problem of non-convergence
for nonlinear concepts.
Will give a linear approximation to
nonlinear concepts that will minimize error.
…how do we measure error??
Consider a perceptron with the
“thresholding” function removed.
Then o = w.x
Can define error as sum of squared
deviations:
E = ½ Σ (td – od)2, sum over all training
examples d.
td is
the target function value for example d.
od is the computed value of the weighted sum
for d.
With this choice of error, it can be proven that
minimizing E will lead to the most probable
hypothesis that fits the training data, under the
assumption that the noise is normally distributed
with mean 0.
Note
“most probable” hypothesis and “correct”
hypothesis still can be different.
Graph E versus weights:
E will always be parabolic (by definition)
So
it has only one global minimum.
Goal: Descent to the global minimum
ASAP!
How?
Gradient
definition.
Meaning of the Gradient.
Gradient descent.
The Gradient of E tells us the direction of
steepest ascent.
So, – Gradient E tells us us direction of
steepest descent.
in that direction with step size η.
The learning rule becomes:
Go
Derivation of simple update rule.
Finally, the Delta Rule weight update is.
Δwi = η*Σ(td – od)xid , over all training examples d.
Gradient descent pseudocode.
Pick an initial random weight vector w
Until the termination condition is met, do
Set Δwi
= 0 for all i.
each <x, t> in D, do
Run net on x :compute o(x)
For
For each wi, do
Δ wi = Δ wi + η (t – o) xi
For
each wi, do
wi = wi + Δ wi
Results of Gradient Descent.
Because of the shape of the error surface,
with only a local minimum, the algorithm
will converge to a w with minimum
squared deviation/error as long as η is
small enough.
This
holds regardless of linear seperability.
If η is too large, the algorithm may skip the
global minimum instead of settling in.
A common remedy is to decrease η with time.
Gradient descent can be used to search
large or infinite hypothesis spaces when:
The
hypothesis space is continuously
parameterized.
The error can be differentiated with respect to
the hypothesis parameters.
However
Convergence
can be slooooooow
If there are lots of local minima, there’s no
guarantee it will find the global one.
Stochastic gradient descent.
Stochastic gradient descent
a.k.a.
Incremental gradient descent.
Instead
of updating the weights massively after going
through all the training examples, we update them
after each example.
This really descends the gradient for a singleexample error function (an example per step):
E = ½ (td – od)2
η is small enough, this is as optimal as true gradient
descent.
If
Stochastic Gradient Descent.
Pick an initial random weight vector w
Until the termination condition is met, do
For each <x, t> in D, do
Run net on x :compute o(x)
For each wi, do
wi = wi + η (t – o) xi
Results.
Compared to Stochastic gradient descent,
standard gradient descent takes more
computation per step, but generally has a
larger step size.
When E has multiple local minima,
Stochastic gradient descent can
sometimes avoid them.
Perceptrons with discrete output.
For Delta-learning/gradient descent, we
discussed the unthresholded perceptrons.
It
can simply be modified to thresholded
perceptrons.
Just use the thresholded t values as the t values
for the perceptron delta-learning algorithm.
(with the unthresholded o values)
Unfortunately,
this may not necessarily reduce
the percent of errors in training data by the
thresholded output, just the squared error in
the thresholded output.
Multilayer Networks and the
Backpropagation Algorithm
In order to learn non-linear decision
surfaces, a system more complex than
perceptrons is needed.
Choice of base unit.
Multiple layers of linear units still linear.
Unthresholded
Thresholded perceptron has nondifferentiable thresholding function:
Cannot
perceptron
compute gradient of E
Need something different.
Must
be non-linear
And continuously differentiable.
Sigmoid unit.
In place of the perceptron step function, use the
sigmoid function as thresholding function.
Sigmoid: σ(y) = 1 / (1 + e-y)
The sigmoid unit computes the weighted
linear sum w.x, and then applied the
sigmoid “squashing function”.
Steepness of incline increases with
coefficient of –y.
Continuously differentiable:
Derivative:
dσ(y)/dy = σ(y)*(1 - σ(y))
Backpropagation algorithm.
Learns the weights that minimize squared
error given fixed # of units/neurons, and
interconnections.
Employs gradient descent similar to delta
rule.
Error is measured by:
Error Surface can have multiple local
minima.
No
guarantee algorithm will find the Global
Minimum.
However, in practice, backpropagation
performs very well.
Recall
Stochastic Gradient descent
vs.
Gradient
descent.
Stochastic Backpropagation Algorithm
Considering a feedforward network with
two layers of sigmoid units which is fully
connected in one direction.
Each
unit is assigned an index (I = 0, 1, 2, …)
xji denotes input from i into j.
wji denotes weight on connection from i to j.
δn is the error term associated with unit n.
Backpropagation algorithm.
Backpropagation explained
Make the network, randomly initialize the
weights.
For each training example d, apply the
network, calculate the squared error Ed,
apply the gradient, and proceed a step of
size η in direction of steepest decline.
Weight update rule:
Delta rule: Δ wi = Δ wi + η (t – o) xi
Here we have Δwji = ηδjxji
The error term δj is more complex here.
Recall
Error term for unit j.
Intuitively, if j is an output node k, then its error is
the standard tk – ok multiplied by ok(1- ok) : the
derivative of the sigmoid function.
Derivative
of sigmoid because we’re using gradient(E).
If j is a hidden node h, have no th to compare it
with.
Must
sum error in the output nodes k influenced by h:
δk
weighted
by how much they were influenced by h: wkh
δh = oh(1- oh) Σ wkhδk
Derivation of error term for unit j.
Termination condition.
Stop after fixed number of iterations.
Stop when E on training data drops below
given level.
Stop when E on test data drops below a
certain level.
few iterations too much error remains.
Too many overfitting data.
Too
Variant of Backpropagation with
momentum.
Adding momentum = make weight update at
step n depend partially on that at n-1.
Δwji(n) = ηδjxji + α Δwji(n - 1)
α
is a small number between 0 and 1.
Analogous to a ball rolling down a bumpy hill.
Momentum
(α) tries to keep the ball moving in the
same direction as before.
Can keep ‘ball’ moving through local minima and
plateaus.
If Gradient(E) does not change, increases effective
step size.
Pros and Cons of Momentum
Can provide quicker convergence.
However theoretically, can also drop right
through global minimum and keep going.
Generalization to n-layer network
We simply evaluate the δk for output
nodes as before.
Then Backpropagate errors through
network layer by layer:
For
a node r at layer m
δr = or(1- or) Σwsrδs over all nodes s in layer m+1
Layer m+1 is downstream from m.
r feeds input into the s nodes.
Generalization to arbitrary acyclic
network
We simply evaluate the δk for output nodes
as before.
Then Backpropagate errors through
network:
For
a node r
δr = or(1- or) Σwsrδs over all nodes s in
Downstream(r).
Downsteam(r) = {s | s recieves input from r}
Summary.
Artificial Neural networks are
Practical
way to learn discrete, real, and
vector valued functions.
Robust to noisy data.
Usually trained via Backpropagation.
Used for many real world tasks
Robot control
Computer creativity
http://www.venturacountystar.com/news/2007/jul/09/com
puters-compose-original-melodies/
Feedforward networks with 3 layers can
approximate any function to any desired
accuracy given sufficient units/artificial
neurons and connections.
Good
accuracy achieved even with small
nets.
Backpropagation is able to find
intermediate features within the net, that
are not explicitly defined as attributes of
the input or output.
© Copyright 2026 Paperzz